It would be great if Meta was able to sustain to support more open source projects, especially those they benefit from.
For example they use AsmJit in a lot of projects (both internal and open-source) and it's now unmaintained because of funding issues. Maybe they have now internal forks too.
The problems related to PostgreSQL are pretty much all described here. It's very difficult to do low-latency queries if you cannot cache the compiled code and do it over and over again. And once your JIT is slow you need a logic to decide whether to interpret or compile.
I think it would be the best to start interpreting the query and start compilation in another thread, and once the compilation is finished and interpreter still running, stop the interpreter and run the JIT compiled code. This would give you the best latency, because there would be no waiting for JIT compiler.
> It's very difficult to do low-latency queries if you cannot cache the compiled code
This is not too difficult, it just requires a different execution style. Salesforce's Hyper for example very heavily relies on JIT compilation, as does Umbra [1], which some people regard as one of the fastest databases right now. Umbra doesn't cache any IR or compiled code and still has an extremely low start-up latency; an interpreter exists but is practically never used.
Postgres is very robust and very powerful, but simply not designed for fast execution of queries.
Disclosure: I work in the group that develops Umbra.
If I recall research papers regarding Umbra it's also using AsmJit as a JIT backend, which means that theoretically the compilation times would be comparable if you only consider code emitting overhead.
The problem will always be queries where the compilation is orders of magnitude more expensive than the query itself. I can imagine indexed lookup of 1 or few entries, etc... Accessing indexed entries like these are very well optimized by SQL query engines and possibly make no sense JIT optimizing.
Interesting... AsmJit is pretty fast for compilation, but about 3x than sljit. The only way I can see how to make it fast enough, in theory (i.e. without slowing down point-lookup queries and such) would be to fuse planning with code generation - i.e. a single pass plan builder + compiler essentially. Not sure if Umbra tries to do that, and AsmJit is not the best choice for it anyway, but with sljit it could be on par with interpreter even for fastest queries I believe. Pretty hard (likely impossible) to implement though, planning is inherently a non-linear process...
Because pg_jitter uses AsmJit's Compiler, which also allocates registers. That's much more work than using hardcoded physical registers in SLJIT case. There is always a cost of such comfort.
I think AsmJit's strength is completeness of its backends as you can emit nice SIMD code with it (like AVX-512). But the performance could be better of course, and that's possible - making it 2x faster would be possible.
There are other issues with that auto-allocation. I tested all 3 backends on very large queries (hundreds of KBs) per query. Performance of all of them (+LLVM, but -sljit) was abysmal - the compiler overhead was in seconds to tens(!) of seconds. They have some non-linear components in their optimization algorithms. While sljit was scaling linearly and almost as fast as for smaller queries. So yes, it gives higher run-time performance but the cost of that performance grows non-linearly with code size and complexity. While you still can have good performance with manual allocations. I also don't believe you can make AsmJit 2x faster without sacrificing that auto-allocation algorithm.
AsmJit has only one place where a lot of time is spent - bin-packing. It's the least optimized part, which has quadratic complexity (at the moment), which starts to show when you have like hundreds of thousands of virtual registers. There is even a benchmark in AsmJit called `asmjit_bench_regalloc`, which shows that a single function that has 16MB alone, with 65k labels and 200k virtual registers takes 2.2 seconds to generate (and 40ms of that is time to just call `emit()`).
If this function is optimized, or switched to some other implementation when there is tens of thousands of virtual registers, you would get orders of magnitude faster compilation.
But realistically, which query requires tens of megabytes of machine code? These are pathological cases. For example we are talking about 25ms when it comes to a single function having 1MB of machine code, and sub-ms time when you generate tens of KB of machine code.
So from my perspective the ability to generate SIMD code that the CPU would execute fast in inner loops is much more valuable than anything else. Any workload, which is CPU-bound just deserves this. The question is how much the CPU bound the workload is. I would imagine databases like postgres would be more memory-bound if you are processing huge rows and accessing only a very tiny part of each row - that's why columnar databases are so popular, but of course they have different problems.
I worked on one project, which tried to deal with this by using buckets and hashing in a way that there would be 16 buckets, and each column would get into one of these, to make the columns closer to each other, so the query engine needs to load only buckets used in the query. But we are talking about gigabytes of RAW throughput per core in this case.
I have a test of 200Kb query that AsmJit takes 7 seconds to compile (that's not too bad both LLVM and MIR take ~20s), while sljit does it in 50ms. 200Kb is a pathological case, but it's not unheard of in the area I'm working on. It's realistic, although a rare case.
Last 10-15 years most OLTP workloads became CPU bound, because active datasets of most real databases fully fit in memory. There are exceptions, of course.
That's interesting - 200kB should not be a big deal for it - maybe it uses something that I usually don't, like many function calls, or insane number of branches, etc... I would be interested in that case, but I'm not sure whether I would be able to blindly improve AsmJit without a comprehensive test.
Definitely good to know though. When it comes to low-latency compilation my personal goal is to make it even faster when generating small functions.
SLJIT is a bit smarter than just to use hardcoded registers. It's multi-platform anyway, so it uses registers when they are available on the target platform, if not it will use memory, that's why performance can differ between Windows and Linux on x64 for example - different number of available registers.
Indeed, but this also means that you would get drastically different performance on platforms that have more physical registers vs on platforms that have less. For example x86_64 only has 16 GP registers, while AArch64 has 32 - if you use 25 registers without any analysis and just go to stack with 10 of them, the difference could be huge.
But... I consider SLJIT to be for a different use-case than AsmJit. It's more portable, but its scope is much more limited.
It's definitely different, and for Postgres specifically, they may complement each other. SLJit can be used for low latency queries where codegen time is more important than optimizations, also for other platforms like s390x / PPC / SPARC, etc. AsmJit can be used for SIMD optimizations for x86_64 and ARM64. MIR is kinda in the middle - it does auto-allocations of registers, doesn't support SIMD, but also it's multiplatform. The only thing that doesn't fit well here is LLVM :). It has some advantages in some edge cases, but... It really needs a separate provider, the current one is bad. I'll probably create another LLVM backend for pg_jitter in the future to utilize it properly...
> I think it would be the best to start interpreting the query and start compilation in another thread
This technique is known as a "tiered JIT". It's how production virtual machines operate for high-level languages like JavaScript.
There can be many tiers, like an interpreter, baseline compiler, optimizing compiler, etc. The runtime switches into the faster tier once it becomes ready.
It’s also common for JITs to sprout a tier and shed a tier over time, as the last and first tiers shift in cost/benefit. If the first tier works better you delay the other tiers. If the last tier gets faster (in run time or code optimization) you engage it sooner, or strip the middle tier entirely and hand half that budget to the last tier.
The idea with parallel compilation is interesting. Worth considering, in some cases. The only problem with it is the same as too much parallelization - you can exhaust your CPU resources much faster. But with some sort of smart scheduling it should work. I'll think about it, thanks!
How JIT kills compatibility if it's only enabled on x86 and aaarch64? You can compile Blend2D without it and it would just work.
So no, it doesn't kill any compatibility - it only shows a different approach.
BTW GPU-only renderers suck, and many renderers that have GPU and CPU engines suck when GPU is not available or have bugs. Strong CPU rendering performance is just necessary for any kind of library if you want true compatibility across various platforms.
I have seen many many times broken rendering on GPU without any ability to switch to CPU. And the biggest problem is that more exotic HW you run it on, less chance that somebody would be able to fix it (talking about GPUs).
You need to rerun the benchmarks if you want fresh numbers. The post was written when Blend2D didn't have JIT for AArch64, which penalized it a bit. Also on X86_64 the numbers are really good for Blend2D, which beats Blaze in some tests. So it's not black&white.
And please keep in mind that Blend2D is not really in development anymore - it has no funding so the project is basically done.
> And please keep in mind that Blend2D is not really in development anymore - it has no funding so the project is basically done.
That's such a shame. Thanks a lot for Blend2D! I wish companies were less greedy and would fund amazing projects like yours. Unfortunately, I do think that everyone is a bit obsessed with GPUs nowadays. For 2D rendering the CPU is great, especially if you want predictable results and avoid having to deal with the countless driver bugs that plague every GPU vendor.
Skia is definitely not a good example at all. Skia started as a CPU renderer, and added GPU rendering later, which heavily relies on caching. Vello, for example, takes a completely different approach compared to Skia.
NV path rendering is a joke. nVidia though that ALL graphics would be rendered on GPU within 2 years after making the presentation, and it took 2 decades and 2D CPU renderers still shine.
Right. The question is does Skia grows its broad and useful toolkit with an eye toward further GPU optimization? Or does Vello (broadened and perhaps burdened by Rust and the shader-obsessive crowd) grow a broad and useful API?
There's also the issue of just how many billions of line segments you really need to draw every 1/120th of a second at 8K resolution, but I'll leave those discussions to dark-gray Discord forums rendered by Skia in a browser.
> There's also the issue of just how many billions of line segments you really need to draw every 1/120th of a second at 8K resolution
IMO, one of biggest benefit of a high performance renderer would be power savings (very important for laptops and phones). If I can run the same work but use half the power, then by all means I'd be happy to deal with the complications that the GPU brings. AFAIK though, no one really cares about that and even efforts like Vello are just targeting fps gains, which do correlate with reduced power consumption but only indirectly.
Adding a power draw into the mix is pretty interesting. Just because a GPU can render something 2x faster in a particular test doesn't mean you have consumed 50% less power, especially when we talk about dedicated GPUs that can have power draw in hundreds of watts.
Historically 2D rendering on CPU was pretty much single-threaded. Skia is single-threaded, Cairo too, Qt mostly (they offload gradient rendering to threads, but it's painfully slow for small gradients, worse than single-threaded), AGG is single-threaded, etc...
In the end only Blend2D, Blaze, and now Vello can use multiple threads on CPU, so finally CPU vs GPU comparisons can be made more fairy - and power draw is definitely a nice property of a benchmark. BTW Blend2D was probably the first library to offer multi-threaded rendering on CPU (just an option to pass to the rendering context, same API).
As far as I know - nobody did a good benchmarking between CPU and GPU 2D renderers - it's very hard to do completely unbiased comparison, and you would be surprised how good the CPU is in this mix. Modern CPU cores consume maybe few watts and you can render to a 4K framebuffer with that single CPU core. Put rendering text to the mix and the numbers would start to be very interesting. Also GPU memory allocation should be included, because rendering fonts on GPU means to pre-process them as well, etc...
2D is just very hard, on both CPU and GPU you would be solving a little bit different problems, but doing it right is insane amount of work, research, and experimentation.
On my Apple M1 Pro, the Vello CPU renderer is competitive with the GPU renderers on simple scenes, but falls behind on more complex ones. And especially seems to struggle with large raster images. This is also without a glyph cache (so re-rasterizing every glyph every time, although there is a hinting cache) which isn't implemented yet. This is dependent on multi-threading being enabled and can consume largish portions of all-core CPU while it runs. Skia raster (CPU) gets similarish numbers, which is quite impressive if that is single-threaded.
I think Vello CPU would always struggle with raster images, because it does a bounds check for every pixel fetched from a source image. They have at least described this behavior somewhere in Vello PRs.
The obsession for memory safety just doesn't pay off in some cases - if you can batch 64 pixels at once with SIMD it just cannot be compared to a per-pixel processor that has a branch in a path.
It's an argument you can make in any performance effort. But I think the "let's save power using GPUs" ship sailed even before Microsoft started buying nuclear reactors to power them.
Blend2D doesn't benchmark against GPU renderers - the benchmarking page compares CPU renderers. I have seen comparisons in the past, but it's pretty difficult to do a good CPU vs GPU benchmarking.
Then paint to a regular buffer and do a memcpy to the framebuffer that has no cache at the end of each frame, possibly only copying a region/tiles you want to update.
All the libraries that exist are designed to work like this.
It's not, SSA and an optimizing pipeline was never the goal of AsmJit actually. You emit your SIMD code as you want it an no optimizer or other transformations mess with it - that's the goal and it works great for use-cases that don't need an additional optimizing pipeline.
However, it can do the mentioned cross-compilation. AsmJit is not dependent on host architecture in any way - you can generate AArch64 on X86 and vice versa. It's of course more optimized for JIT so it offers many tools to help with creating your own lightweight JIT compilers and running the code you generate.
So, no, it's not an LLVM alternative, but it's also not a trivial assembling engine. It has a unique position as it optimizes for low-latency code generation, which LLVM doesn't.
Interesting, thanks. I've seen that there is a kind of IR; does it support the same IR code to run on different targets? Is SIMD part of this abstraction?
Blend2D has C-API and no dependencies - it doesn't even need a C++ standard library - so generally it's not an issue to build it and use it anywhere.
There is a different problem though. While many people working on Vello are paid full time, Blend2D lacks funding and what you see today was developed independently. So, the development is super slow and that's the reason that Blend2D will most likely never have the features other libraries have.
For example they use AsmJit in a lot of projects (both internal and open-source) and it's now unmaintained because of funding issues. Maybe they have now internal forks too.