Nice and concise description of quadtrees implemented with classical pointer chasing data structures.
A faster and (arguably) simpler way to construct quad/octrees is using morton codes, sorting and searching with flat arrays [0]. It's also probably easier to implement.
The gist of it is that you quantize the coordinates, do bit interleaving to construct morton code and then sort. The sorting is using numerical keys so you can use a radix sort for O(n) complexity which is much faster on a GPU (but on single CPU core a comparison based sort will probably win if the array isn't huge).
Now everything on the top half of the space is in the beginning of the array and the bottom half is in the end of the array (assuming yxyxyxyx morton code bits). Each of those halves is then split left/right and then each level below that alternates between vertical and horizontal splits.
To find the split point in the array, look at the first and last entry of the (sub)array you're looking at, and look for the first differing bit with (first ^ last).leading_zeros(). Then binary search for the first entry where that bit is high.
To traverse the quad/octree, repeat this process with the two halves you found. This can be done without recursion in O(1) memory using a fixed size stack because you know the depth of the "recursion" is at most half the number of bits in the morton code.
If you used radix sorting for building the array, you can avoid the binary search if you store the histograms from the counting phase of the sorting. Storing the whole histogram may be too much, but just a few highest order bits can already help.
Although I've found experimentally that for small (less that 10000 objects) just sorting and searching is faster if the whole array fits in L2 cache. On my 2015 laptop a single core can sort 10k objects with 64 bit keys in 1 millisecond. Traversing the tree for frustum culling is about 5x faster than just going through the entire array because a lot of geometry can be discarded very quickly.
With a good comparison based sort rebuilding the array after some changes is mighty fast because modern sorting algorithms are much faster for "almost sorted" inputs.
For range queries ("Find everything in the region" in the article), you can probably get better performance by using the BIGMIN/LITMAX method [1].
Now here's a brain teaser to delight your Friday: the article and the method I describe above is for storing points/centroids, but often you need to store axis aligned boxes instead (AABB). There is a very clever trick for this that I discovered independently but later found in some research papers. Can you come up with a (very) small change in the algorithm above to extend it to AABBs instead of centroids?
Discretize the AABB with a certain length proportional to the morton code grid size so that each sample point will land in a distinct but continuous sequence of morton codes enclosing the AABB, compute the hash of each of those points, then dump all of those morton codes into the sort and reverse lookup the other AABBs with the same morton code :)
The problem comes at scale, when you have many AABBs with different sizes. Memory pressure can be punishing and you are bottlenecked by the network for distributed solutions. Is there a faster way with fewer points? I would love to know!
For each AABB, you take the morton code of the minimum and maximum and store them in the array.
But for sorting and searching, construct a sort key from the pair of morton codes so that the most significant bits are the common prefix of the two morton codes and the least significant bits are the "level" (length of common prefix in bits).
E.g. for a 3d octree with 64 bit keys, you'd have 1 unused bit, 57 = 19 * 3 bits of morton code prefix and 6 bits of level. A 2d quadtree with 32b keys would have 1 unused bit, 26 = 2 * 13 bits of prefix and 5 bits of level.
You could also store just the sort key, but I've stored both AABB morton codes so that I can reconstruct the quantized AABBs for culling.
When you sort by this key, you'll get a linear octree where the beginning of the array is all the AABBs that overlap the subdivision plane, then everything on the left side of the plane and the rest is right side of the plane. When traversing the tree, you need two searches per node to find the entries belonging to this node and to find the split point between the left and the right side.
Having the "level" in the least significant bits works as a tiebreaker in the sorting to ensure that entries closer to the root of the tree are in the beginning of the array.
I wrote about this solution vaguely the last time this link came up (wasn't that long ago either, I think?) and you've filled in all the details and the essential Karras paper link; great post :)
Vulkan gives all the tools to avoid any "lag spikes" from shader compiling. In fact, causing them is much more difficult than OpenGL where they could happen in surprising places (and only on certain hardware).
The issue is two fold:
1. Some engines produce a lot of shader permutations. Some AAA titles can have 60000 different shaders compiled.
2. Some GPU rasterizer states (such as color blending) are implemented as shader epilogues.
In Vulkan 1.0 almost all of the pipeline state had to be pre-baked into a pipeline state object compiled ahead of time. This lead to a "shader permutation explosion" where different states need different pipelines.
This requires the game engine to either a) compile all the pipeline combinations ahead of time (slow loading time) or b) compile them as needed (lag spikes).
The core issue for this was solved years ago and now most of the pipeline states can be added to the command buffers ("dynamic states"). This solves the permutation explosion. But at the same time it opens the door for issue 2: some states (blending in particular) can cause a state-based recompile (like ye olde OpenGL days) at runtime.
The only solution to the second problem is not to use the dynamic states that trigger recompiling. That's basically only blending as far as I know. You can't even have dynamic blend state on all GPUs.
For maximum developer flexibility there's the shader object extension that allows mixing and matching shaders and pipeline states any way you want. This will cause state based recompiles at unpredictable times but it's an opt-in feature and easy to avoid if lag spikes are not wanted.
tl;dr: shader recompilation is easy to avoid in Vulkan but porting legacy engine code or art content may take you off the happy path.
> look up the extension on vulkan.gpuinfo.org and see ... currently 0.3% of all devices support it.
Afaik the extension isn't even finalized yet and they are pre-releasing it to gather feedback.
And you can't use gpuinfo for assessing how widely available something is or isn't. The stats contain reports from old drivers too so the numbers you see are no indication of hardware support.
To assess how widely supported something is, you need to look at gpuinfo, sort by date or driver version and cross reference something like steam hardware survey.
You can use descriptor heaps with existing bindless shaders if you configure the optional "root signature".
However it looks like it's simpler to change your shaders (if you can) to use the new GLSL/SPIR-V functionality (or Slang) and don't specify the root signature at all (it's complex and verbose).
Descriptor heaps really reduce the amount of setup code needed, with pipeline layouts gone you can drop like third of the code needed to get started.
Having quite recently written a (still experimental) Vulkan backend for sokol_gfx.h, my impression is that starting with `VK_EXT_descriptor_buffer` (soon-ish to be replaced with `VK_EXT_descriptor_heap`), the "core API" is in pretty good shape now (with the remaining problem that all the outdated and depreciated sediment layers are still part of the core API, this should really be kicked out - e.g. when I explicitly request a specific API version like 1.4 I don't care about any features that have been deprecated in versions up to 1.4 and I don't care about any extensions that have been incorporated into the core API up until 1.4, so I'd really like to have them at least not show up in the Vulkan header so that code completion cannot sneak in outdated code (like EXT/KHR postfixes for things that have been moved into core).
The current OpenGL-like sediment-layer-model (e.g. never remove old stuff) is extremely confusing when not following Vulkan development very closely since 2016, since there's often 5 ways to do the same thing, 3 of which are deprecated - but finding out whether a feature is deprecated is its own sidequest.
What I actually wrestled with most was getting the outer frame-loop right without validation layer errors. I feel like this should be the next thing which the "Eye of Khronos" should focus on.
All official tutorial/example code I've tried doesn't run without swapchain-sync-related validation errors on one or another configuration. Even this 'best practices' example code which demonstrates how to do the frame-loop scaffolding correctly produces valiation layer errors, so it's also quite useless:
What's worse: different hardware/driver combos produce different validation layer errors (even in the swapchain-code which really shouldn't have different implementations across GPU vendors - e.g. shouldn't Khronos provide common reference code for those GPU-independent parts of drivers?). I wonder if there is actually any Vulkan code out there which is completely validation-layer-clean across all possible configs (I seriously doubt it).
Also the VK_[EXT/KHR]_swapchain_maintenance1 extension which is supposed to fix all those little warts has such a low coverage that it's not worth supporting (but it should really be part of the core API by now - the extension is from 2019).
Anyway... baby steps into the right direction, only a shame that it took a decade ;)
> Isn't the idea that 99% of people use a toolkit atop of Vulkan?
This idea creates a serious chicken-egg-problem.
Two or three popular engine code bases sitting on top of Vulkan isn't enough 'critical mass' to get robust and high performance Vulkan drivers. When there's so little diversity in the code hammering on the Vulkan API it's unlikely that all the little bugs and performance problems lurking in the drivers will be triggered and fixed, especially when most Unity or Unreal game projects will simply select the D3D11 or D3D12 backend since their main target platform on PC is Windows.
Similar problem to when GLQuake was the only popular OpenGL game, as soon as your own code used the GL API in a slightly different way than Quake did all kinds of things broke since those GL drivers only properly implemented and tested the GL subset used by GLQuake, and with the specific function call patterns of GLQuake.
From what I've seen so far, the MESA Vulkan drivers on Linux seem to be in much better shape than the average Windows Vulkan driver. The only explanation I have for this is that there are hardly any Windows games running on top of Vulkan (instead they use D3D11 or D3D12), while running those same D3D11/D3D12 games on Linux via Proton always goes through the Vulkan driver. So on Linux there may be more 'evolutionary pressure' to get high quality Vulkan drivers indirectly via D3D11/D3D12 games that run via Proton.
You might be unaware of this, but Vulkan Video Decode is slowly but surely replacing the disparate bespoke video decode acceleration on almost all platforms.
Vulkan is mature. It has been used in production since 2013 (!) in the form of Mantle. I have no idea why all the Vulkan doomsayers here think it still needs a half-to-whole decade to be 'useful'.
280 games over 10 years really isn't impressive (2.5x less than even D3D8 which was an unpopular 'inbetween' D3D version and only relevant for about 2 years). D3D12 (890 games) isn't great either when compared to D3D11 (4.6k) or D3D9 (3.3k), it really demonstrates what a massive failure the modern 3D APIs are for real-world usage :/
I don't think those lists are complete, but they seem to show the right relative amount of 3D API usage across PC games.
I’m just pointing out that Vulkan is supported on all major modern engines, internal and public. Some also go so far as to do DX12 (fine, it’s a similar feeling API) but what’s really amazing is taking all of those games that run on OpenGL, DirectX, etc and forcing them to run on Vulkan…
Proton is amazing and Wine project deserves your support.
Video games are entertainment. In the old days you inserted a cartridge or optical disc into a physical device. You play the game, finish it and then move on. They are always self contained experiences with a custom UI independent of the OS.
In the best case, explicit Linux support does not affect the experience in a positive or negative way. In the worst case, explicit Linux support means the game can't be played anymore.
Doing it this way actually makes games more stable on Linux. Often, Linux ports of games would be riddled with bugs because the QA just isn't worth it. Especially because desktop Linux is always in a fast flux of changes. Hence the joke that "Win32 is the only stable Linux ABI."
Now game studios can just develop for windows, work out all the bugs. And then Proton has a broad set of compatibility patches that can be applied to those Windows games.
Doing it this way also unlocks a gigantic library of old games that otherwise would have been unplayable on Linux.
>Like, these days game devs just use Unreal Engine
This is not true in the slightest. There are loads of custom 3D engines across many many companies/hobbyists. Vulkan has been out for a decade now, there are likely Vulkan backends in many (if not most) of them.
Many people need something in-between heavy frameworks and engines or oppinionated wrappers with questionable support on top of Vulkan; and Vulkan itself. OpenGL served that purpose perfectly, but it's unfortunately abandoned.
Isn't that what the Zink, ANGLE, or GLOVE projects meant to provide? Allow you to program in OpenGL, which is then automatically translated to Vulkan for you.
DirectX 9 is long term stable so I don't see the issue...
No current gen console supports it. Mac is stuck on OpenGL 4.1 (you can't even compile anything OpenGL on a Mac without hacks). Devices like Android run Vulkan more and more and are sunsetting OpenGLES. No, OpenGL is dead. Vulkan/Metal/NVN/DX12/WebGPU are the current.
Also, if we're talking Switch 2, you have Vulkan support on it so odds are you would choose 2x-10x performance gains over OpenGL. It isn't that powerful. My 6 year old iPhone is on par.
People generally don't realize how much more CPU efficient calls are with Vulkan, DX12 or Metal. And especially on handhelds, the SoC is always balancing its power envelope between the CPU and GPU so the more efficient you are with the CPU, the more power you can dedicate to the GPU.
The aforementioned abstraction layers exist. You had dismissed those as only suitable for backporting. Can you justify that? What exactly is wrong with using a long term stable API whether via the native driver or an abstraction layer?
Edit: By the same logic you could argue that C89 is dead for new projects but that's obviously not true. C89 is eternal and so is OpenGL now that we've got decent hardware independent implementations.
I don't see the point of those when I can just directly use OpenGL. Any translation layer typically comes with limitations or issues. Also, I'm not that glued to OpenGL, I do think it's a terrible API, but there just isn't anything better yet. I wanted Vulkan to be something better, but I'm not going to use an API with entirely pointless complexity with zero performance benefits for my use cases.
The one on Vulkan.org recently got updated to use dynamic rendering and a bunch of the newest features (plus modern C++, Slang instead of glsl, etc...).
The GPS jamming maps are based on commercial air traffic flying in the area.
While that gives some ideas of how widespread the jamming is, it won't give accurate information about the range (air traffic avoids areas with jamming) of the interference or any information from places where there is no commercial air traffic (war zones, etc).
Using printf in shaders is awesome, it makes a huge difference when writing and debugging shaders. Vulkan and GLSL (and Slang) have a usable printf out of the box, but HLSL and D3D do not.
Afaik the way it works in Vulkan is that all the string formatting is actually done on the CPU. The GPU writes only writes the data to buffers with structs based on the format string.
All the shader prints are captured by tools such as Renderdoc, so you can easily find the vertex or pixel that printed something and then replay the shader execution in a debugger.
I only wish that we would've had this 20 years ago, it would have saved me so much time, effort and frustration.
Finding which pixel to debug, or just dumping some info from the pixel under mouse cursor (for example) is better done with a simple printf. Then you can pick up the offending pixel/vertex/mesh/compute in the debugger if you still need it.
You get both, a debugger and printf related tooling in Renderdoc and it's better than either of those alone.
I've been writing a lot of GPU code over the past few years (and the few decades before it) and shader printf has been a huge productivity booster.
Maybe this would be a suitable application for "Fibonacci hashing" [0][1], which is a trick to assign a hash table bucket from a hash value. Instead of just taking the modulo with the hash table size, it first multiplies the hash with a constant value 2^64/phi where phi is the golden ratio, and then takes the modulo.
There may be better constants than 2^64/phi, perhaps some large prime number with roughly equal number of one and zero bits could also work.
This will prevent bucket collisions on hash table resizing that may lead to "accidentally quadratic" behavior [2], while not requiring rehashing with a different salt.
I didn't do detailed analysis on whether it helps on hash table merging too, but I think it would.
> This will prevent bucket collisions on hash table resizing
Fibonacci hashing is really adding another multiplicative hashing step followed by dropping the bottom bits using a shift operation instead of the top bits using an and operation. Since it still works by dropping bits, items that were near before the resize will still be near after the resize and it won't really change anything.
A faster and (arguably) simpler way to construct quad/octrees is using morton codes, sorting and searching with flat arrays [0]. It's also probably easier to implement.
The gist of it is that you quantize the coordinates, do bit interleaving to construct morton code and then sort. The sorting is using numerical keys so you can use a radix sort for O(n) complexity which is much faster on a GPU (but on single CPU core a comparison based sort will probably win if the array isn't huge).
Now everything on the top half of the space is in the beginning of the array and the bottom half is in the end of the array (assuming yxyxyxyx morton code bits). Each of those halves is then split left/right and then each level below that alternates between vertical and horizontal splits.
To find the split point in the array, look at the first and last entry of the (sub)array you're looking at, and look for the first differing bit with (first ^ last).leading_zeros(). Then binary search for the first entry where that bit is high.
To traverse the quad/octree, repeat this process with the two halves you found. This can be done without recursion in O(1) memory using a fixed size stack because you know the depth of the "recursion" is at most half the number of bits in the morton code.
If you used radix sorting for building the array, you can avoid the binary search if you store the histograms from the counting phase of the sorting. Storing the whole histogram may be too much, but just a few highest order bits can already help.
Although I've found experimentally that for small (less that 10000 objects) just sorting and searching is faster if the whole array fits in L2 cache. On my 2015 laptop a single core can sort 10k objects with 64 bit keys in 1 millisecond. Traversing the tree for frustum culling is about 5x faster than just going through the entire array because a lot of geometry can be discarded very quickly.
With a good comparison based sort rebuilding the array after some changes is mighty fast because modern sorting algorithms are much faster for "almost sorted" inputs.
For range queries ("Find everything in the region" in the article), you can probably get better performance by using the BIGMIN/LITMAX method [1].
Now here's a brain teaser to delight your Friday: the article and the method I describe above is for storing points/centroids, but often you need to store axis aligned boxes instead (AABB). There is a very clever trick for this that I discovered independently but later found in some research papers. Can you come up with a (very) small change in the algorithm above to extend it to AABBs instead of centroids?
[0] Karras: Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees - https://research.nvidia.com/sites/default/files/pubs/2012-06... [1] https://en.wikipedia.org/wiki/Z-order_curve#Use_with_one-dim...
reply