My understanding is that those models create gaussian splats from a text prompt, kinda like a 3d version of nano banana. I'm not doing that (yet), what I'm doing is creating splats from a set of photos - aka "splat training" and then rendering the splat as a static (working on dynamism) on the Quest headset. This is pretty well-worn territory with a lot of good implementations, but I have my own implementation of a trainer in C++/CUDA (originally based in SpeedySplat, which was written in Python, but now completely rewritten and not much of SpeedySplat is left) and renderer in C++/OpenXR for the Quest (originally based on a LLM-made port of 3DGS.cpp to OpenXR, but 100% rewritten now), and I can easily integrate techniques from research.
How would editing work?
Do you think these will win over video world models like Genie?
Have you played with DiamondWM and other open source video world models?