Hacker Newsnew | past | comments | ask | show | jobs | submit | pleurotus's commentslogin

Super cool to read but can someone eli5 what Gaussian splatting is (and/or radiance fields?) specifically to how the article talks about it finally being "mature enough"? What's changed that this is now possible?


1. Create a point cloud from a scene (either via lidar, or via photogrammetry from multiple images)

2. Replace each point of the point cloud with a fuzzy ellipsoid, that has a bunch of parameters for its position + size + orientation + view-dependent color (via spherical harmonics up to some low order)

3. If you render these ellipsoids using a differentiable renderer, then you can subtract the resulting image from the ground truth (i.e. your original photos), and calculate the partial derivatives of the error with respect to each of the millions of ellipsoid parameters that you fed into the renderer.

4. Now you can run gradient descent using the differentiable renderer, which makes your fuzzy ellipsoids converge to something closely reproducing the ground truth images (from multiple angles).

5. Since the ellipsoids started at the 3D point cloud's positions, the 3D structure of the scene will likely be preserved during gradient descent, thus the resulting scene will support novel camera angles with plausible-looking results.


You... you must have been quite some 5 year old.


ELI5 has meant friendly simplified explanations (not responses aimed at literal five-year-olds) since forever, at least on the subreddit where the concept originated.

Now, perhaps referring to differentiability isn't layperson-accessible, but this is HN after all. I found it to be the perfect degree of simplification personally.


Some things would be literally impossible to properly explain to a 5 year old.


If one actually tried to explain to a five year old, they can use things like analogy, simile, metaphor, and other forms of rhetoric. This was just a straight-up technical explanation.


Lol. Def not for 5 year olds but it's about exactly what I needed

How about this:

Take a lot of pictures of a scene from different angles, do some crazy math, and then you can later pretend to zoom and pan the camera around however you want


sure, but does that explanation really help anyone. Imo it might scare people off actually diving into things, the math isn't too crazy.


Anybody sufficiently interested would press further, not back away.


Saying math (even using it in a dismissive tldr) is immensely helpful. Specifically, I've never encountered these terms before:

- point cloud - fuzzy ellipsoid - view-dependent color - spherical harmonics - low order - differentiable renderer (what makes it differentiable? A renderer creates images, right?) - subtract the resulting image from the ground truth (good to know this means your original photos, but how do you subtract images from images?) - millions of ellipsoid parameters (the explanation previously mentioned 4 parameters by name. Where are the millions coming from?) - gradient descent (I've heard of this in AI, but usually ignore it because I haven't gotten deep enough into it to need to understand what it means) - 3D point cloud's positions (are all point clouds 3d? The point cloud mentioned earlier wasn't. Or was it? Is this the same point cloud?)

In other words, you've explained this at far too high a level for me. Given that the request was for ELI5, I expected an explanation that I could actually follow, without knowing any specific terminology. Do disregard specifics and call it math. Don't just call it math and skip past it entirely: call it math and explain what you're actually doing with the math, rather than trying to explain the math you're doing; same for all the other words. If a technical term is only needed once in a conversation, then don't use it.

Given that I actually do know what photogrammetry is at a basic level, I can make a best-effort translation here, but it's purely from 100% guessing rather than actually understanding:

1. Create a 3d scan of a real-life scene or object. It uses radar (intentionally incorrect term, more familiar) or multiple photographs at different angles to see the 3 dimensional shape.

2. For some reason, break up the stapes into smaller shapes.

This is where my understanding goes to nearly 0:

3-5: somehow, looking at the difference between a rendering of your 3d scene and a picture of the actual scene allows you to correct the errors in the 3d scene to make it more realistic. Using complex math works better and having the computer do it is less effort than manually correcting the models in your 3d scene.


Thanks.

How hard is it to handle cases where the starting positions of ellipsoids in 3D is not correct (being too off). How common is such a scenario with the state of the art? E.g., if having only a stereoscopic image pair, the correspondences are often not accurate.

Thanks.


I assume that the differentiable renderer is only given its position and viewing angle at any one time (in order to be able to generalize to new viewing angles)?

Is it a fully connected NN?


No. There are no neural networks here. The renderer is just a function that takes a bunch of ellipsoid parameters and outputs a bunch of pixels. You render the scene, then subtract the ground truth pixels from the result, and sum the squared differences to get the total error. Then you ask the question "how would the error change if the X position of ellipsoid #1 was changed slightly?" (then repeat for all ellipsoid parameters, not just the X position, and all ellipsoids, not just ellipsoid #1). In other words, compute the partial derivative of the error with respect to each ellipsoid parameter. This gives you a gradient, that you can use to adjust the ellipsoids to decrease the error (i.e. get closer to the ground truth image).


Great explanation/simplification. Top quality contribution.


And what about the "mature enough" part? How has it changed / progressed recently?


The field is advancing rapidly. New research papers are being published daily for a few years now. The best news feed I've found on the topic is

https://radiancefields.com/

https://x.com/RadianceFields alt: https://xcancel.com/RadianceFields


Thanks for the explanation!


Or: Matrix bullet time with more viewpoints and less quality.


Gaussian splatting is a way to record 3-dimensional video. You capture a scene from many angles simultaneously and then combine all of those into a single representation. Ideally, that representation is good enough that you can then, post-production, simulate camera angles you didn't originally record.

For example, the camera orbits around the performers in this music video are difficult to imagine in real space. Even if you could pull it off using robotic motion control arms, it would require that the entire choreography is fixed in place before filming. This video clearly takes advantage of being able to direct whatever camera motion the artist wanted in the 3d virtual space of the final composed scene.

To do this, the representation needs to estimate the radiance field, i.e. the amount and color of light visible at every point in your 3d volume, viewed from every angle. It's not possible to do this at high resolution by breaking that space up into voxels, those scale badly, O(n^3). You could attempt to guess at some mesh geometry and paint textures on to it compatible with the camera views, but that's difficult to automate.

Gaussian splatting estimates these radiance fields by assuming that the radiance is build from millions of fuzzy, colored balls positioned, stretched, and rotated in space. These are the Gaussian splats.

Once you have that representation, constructing a novel camera angle is as simple as positioning and angling your virtual camera and then recording the colors and positions of all the splats that are visible.

It turns out that this approach is pretty amenable to techniques similar to modern deep learning. You basically train the positions/shapes/rotations of the splats via gradient descent. It's mostly been explored in research labs but lately production-oriented tools have been built for popular 3d motion graphics tools like Houdini, making it more available.


Thanks for the explanation! It makes a lot of sense that voxels would scale as badly as they do, especially if you want to increase resolution. Am I right in assuming that the reason this scales a lot better is because the Gaussian splats, once there's enough "resolution" of them, can provide the estimates for how light works reasonably well at most distances? What I'm getting at is, if I can see Gaussian splats vs voxels similarly to pixels vs vector graphics in images?


I think, yes, with greater splat density—and, critically, more and better inputs to train on, others have stated that these performances were captured with 56 RealSense D455fs—then splats will more accurately estimate light at more angles and distances. I think it's likely that during capture they had to make some choices about lighting and bake those in, so you might still run into issues matching lighting to your shots, but still.

https://www.realsenseai.com/products/real-sense-depth-camera...

That said, I don't think splats:voxels as pixels:vector graphics. Maybe a closer analogy would be pixels:vectors is the same as voxels:3d mesh modeling. You might imagine a sophisticated animated character being created and then animated using motion capture techniques.

But notice where these things fall apart, too. SVG shines when it's not just estimating the true form, but literally is it (fonts, simplified graphics made from simple strokes). If you try to estimate a photo using SVG it tends to get messy. Similar problems arise when reconstructing a 3d mesh from real-world data.

I agree that splats are a bit like pixels, though. They're samples of color and light in 3d (2d) space. They represent the source more faithfully when they're more densely sampled.

The difference is that a splat is sampled irregularly, just where it's needed within the scene. That makes it more efficient at representing most useful 3d scenes (i.e., ones where there are a few subjects and objects in mostly empty space). It just uses data where that data has an impact.


Are meshes not used instead of gaussian splats only due to robustness reasons? I.e., if there were a piece of software that could reliably turn a colored point cloud into a textured mesh, would that be preferable?


Photogrammetry has been around for a long time now. It uses pretty much the same inputs to create meshes from a collection of images of a scene.

It works well for what it does. But, it's mostly only effective for opaque, diffuse, solid surfaces. It can't handle transparency, reflection or "fuzz". Capturing material response is possible, but requires expensive setups.

A scene like this poodle https://superspl.at/view?id=6d4b84d3 or this bee https://superspl.at/view?id=cf6ac78e would be pretty much impossible with photogrammetry and very difficult with manual, traditional, polygon workflows. Those are not videos. Spin them around.


It’s not only for robustness. Splats are volumetric and don’t have topology constraints, and both of those things are desirable. The volume capability is sometimes used for volume effects like fog and clouds, but it also gives splats a very graceful way to handle high frequency geometry - higher frequency detail than the capture resolution - that mesh photogrammetry can’t handle (hair, fur, grass, foliage, cloth, etc.). It depends entirely on the resolution of the capture, of course. I’m not saying meshes can’t be used to model hair or other fine details, they can obviously, but in practice you will never get a decent mesh out of, say, iPhone headshots, while splats will work and capture hair pretty well. There are hair-specific capture methods that are decent, but no general mesh capture methods that’ll do hair and foliage and helicopters and buildings.

BTW I believe there is software that can turn point clouds into textured meshes reliably; multiple techniques even, depending on what your goals are.


Not everything can be represented by textured meshes used in traditional photogrammetry (think Google Street View)

This includes sparse areas like fences, vegetation and the likes, but more importantly any material properties like reflections, specularity, opacity, etc.

Here's a few great examples: https://superspl.at/view?id=cf6ac78e

https://superspl.at/view?id=c67edb74


> Gaussian splatting is a way to record 3-dimensional video.

I would say it's a 3D photo, not a 3D video. But there are already extensions to dynamic scenes with movement.


See 4D splatting.


Brain dances!


It’s a point cloud where each point is a semitransparent blob that can have a view dependent color: color changes depending on direction you look at them. Allowing to capture reflections, iridescence…

You generate the point clouds from multiple images of a scene or an object and some machine learning magic


This 2-minute video is a great intro to the topic https://www.youtube.com/watch?v=HVv_IQKlafQ

I think this tech has become "production-ready" recently due to a combination of research progress (the seminal paper was published in 2023 https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/) and improvements to differentiable programming libraries (e.g. PyTorch) and GPU hardware.


For the ELI5, Gaussian splatting represents the scene as millions of tiny, blurry colored blobs in 3D space and renders by quickly "splatting" them onto the screen, making it much faster than computing an image by querying a neural net model like radiance fields.

I'm not up on how things have changed recently



This is a REALLY good video explaining it. https://www.youtube.com/watch?v=eekCQQYwlgA


I found this VFX breakdown of the recent Superman movie to have a great explanation of what it is and what it makes possible: https://youtu.be/eyAVWH61R8E?t=232

tl;dr eli5: Instead of capturing spots of color as they would appear to a camera, they capture spots of color and where they exist in the world. By combining multiple cameras doing this, you can make a 3D works from footage that you can then zoom a virtual camera round.


I also spoke to the vfx team from Superman on how they achieved the reconstructions! (I’m also the author for the Helicopter article here).

https://radiancefields.com/gaussian-splatting-in-superman


As far as I read on Ladybird's blog updates, the issue is less the formalised specs, and more that other browsers break the specs, so websites adjust, so you need to take the non-compliance to specs into account with your design


Yeah they mention it in the article, most network connections are restricted. But not connections to anthropic. To spell out the obvious—because Claude needs to talk to its own servers. But here they show you can get it to talk to its own servers, but put some documents in another user's account, using the different API key. All in a way that you, as an end user, wouldn't really see while it's happening.


> Not to mention that a liar doesn't necessarily have to mean someone who tells a falsehood in every single statement.

FTA: >Note: this question was originally set in a maths exam, so the answer assumes some basic assumptions about formal logic. A liar is someone who only says false statements.

I think it's pretty clear how on definitions


Ah I see that now, thanks. It was under the ad banner so I missed it first time around.


That's more a flaw of classification systems though. Because even if they comprise a distinct life form does not mean they need to have a unique species. Consider lichen, which comprise two (or more!) separate "species" which becomes a meaningless distinction when they cannot survive on their own, or even if they could, not in a form recognizable in any wayas they were when they were a part of the symbiotic system


I mean at that point what do we consider multicellular organisms? Did you see what was going on with those frog skin cells and "xenobots"? Also, our gut bacteria kinda makes us a symbiote at a larger scale.


Idk man if your splash page links to Wikipedia to explain your project maybe rewrite it instead


I think this is what they are referring to: https://time.com/6247678/openai-chatgpt-kenya-workers/


That seems more likely than my initial interpretation, in which case the moral and ethical implications just so you can have a "Summarize with AI" button or other such features in your web browser are obviously much worse.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: