realtimechris's comments

realtimechris · on Jan 6, 2025

*Optimizing uint64_t Digit Counting: A New Method that Beats Lemire's by Up to 27%*

In the quest to improve the performance of my high-speed JSON library, JSONIFIER, I recently stumbled upon a breakthrough in optimizing the calculation of digit counts for `uint64_t` values. While Lemire’s method of using the `lzcnt` instruction for determining the number of digits in a 32-bit unsigned integer has been widely regarded as efficient, I’ve developed a new method that achieves even faster results for 64-bit unsigned integers (i.e., `uint64_t`), with significant gains across different compilers and platforms.

### The Existing Method: Lemire’s Approach

Lemire’s method, known for its efficiency, calculates the number of digits in a `uint32_t` by leveraging the `lzcnt` instruction, which finds the index of the most significant bit set to 1. This is combined with a static lookup table to map the result to the corresponding number of digits.

Here’s the code for Lemire’s method:

```cpp JSONIFIER_INLINE int int_log2(uint32_t x) { return 31 - simd_internal::lzcnt(x | 1); }

JSONIFIER_INLINE int fast_digit_count(uint32_t x) { static uint64_t table[] = { ... }; return (x + table[int_log2(x)]) >> 32; } ```

While this approach works well for 32-bit integers, the need for a faster and more efficient solution for `uint64_t` led me to create an alternative method, which uses a more streamlined approach without the overhead of a lookup table.

### My New Method: RTC-64-Bit Digit Counting

I’ve designed a new approach for 64-bit unsigned integers that leverages a similar logic but optimizes the process by storing precomputed digit counts for specific ranges and applying simple threshold checks. The result is faster execution with reduced computational overhead.

Here's the code for the new method:

```cpp JSONIFIER_INLINE_VARIABLE uint8_t digitCounts[]{ ... };

JSONIFIER_INLINE_VARIABLE uint64_t digitCountThresholds[]{ ... };

JSONIFIER_INLINE uint64_t fastDigitCount(const uint64_t inputValue) { const uint64_t originalDigitCount{ digitCounts[simd_internal::lzcnt(inputValue)] }; return originalDigitCount + static_cast<uint64_t>(inputValue > digitCountThresholds[originalDigitCount]); } ```

This method works by using a static array to hold the precomputed digit counts and another array for threshold values that determine the exact number of digits in a `uint64_t`. The key optimization lies in the efficient use of a bit manipulation technique and direct threshold checking to avoid unnecessary computations.

### [Benchmark Results](https://github.com/RealTimeChris/BenchmarkSuite/blob/digit-c...)

I ran performance benchmarks comparing my new RTC-64-bit method with Lemire’s approach and the traditional `log10` method across various platforms and compilers. The results were consistently impressive:

#### GCC/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by *27.33%*. - *Lemire-32-bit* beats *Log10-32-bit* by a massive *814.16%*.

#### Clang/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by *143.34%*. - *Lemire-32-bit* beats *Log10-32-bit* by *522.01%*.

#### MSVC/Windows: - *RTC-64-bit* is *12.50%* faster than *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by *515.90%*.

#### Clang/MacOS: - *RTC-64-bit* is *25.37%* faster than *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by *343.97%*.

### Key Takeaways

The RTC-64-bit method not only delivers improved performance on modern hardware but also significantly reduces the overhead of traditional methods by eliminating the need for a large lookup table. This is especially beneficial for high-performance applications where every cycle counts, such as in JSON serialization and parsing, where speed is critical.

mabster · on Jan 8, 2025

FYI, Hackers Delight uses a slight variation of Lemire's. He multiplies by 19/64 (so the divisor is a shift) and has the highest integer at the end of his table (i.e. 2^64 - 1). He also uses a shift instead of a conditional, so his code is branchless.

Generally I prefer this kind of approach. But, I suspect your digit counting is a very hot path so your tables will generally be in cache. So your approach will likely win anyway.

nick__m · on Jan 7, 2025

if you prefix your code by two space on every lines it will be more readable, ex:

  JSONIFIER_INLINE int fast_digit_count(uint32_t x)
  {
    static uint64_t table[] = { ... }; 
    return (x + table[int_log2(x)]) >> 32; 
  }

orra · on Jan 8, 2025

Very nice work. But could you please explain why counting digits quickly (or even slowly) is useful for a JSON serializer? This is lower level than I'm used to working. Is this so you can alloc the right amount of memory, or something else?

realtimechris · on Jan 8, 2025

So we can properly dispatch the correct length-based function with minimal branching to detect which length to serialize for: https://github.com/RealTimeChris/Jsonifier/blob/dev/Include/...

ComputerGuru · on Jan 8, 2025

Yes. And for string formatting in general such as sprintf to calculate buffer size. Also for proposes of calculating length for string padding and alignment.

GuB-42 · on Jan 8, 2025

I think it is the case: first allocate the space, then write.

However, I am not sure how significant the gain is here. Actually printing the number requires a division every digit, maybe 2 or 3, which I believe is much slower than even naive digit counting, and you will have to do it eventually.

Maybe it is worth it if there is parallelism involved: write the JSON with blank spaces for the numbers, then have another thread write the numbers. Writing the numbers is computationally expensive because of the divisions, so if you have a lot of large numbers, it may be worth dedicating some cores to this task.

realtimechris · on Jan 8, 2025

No, you only require division for digit-counts greater than or equal to 10: https://github.com/RealTimeChris/Jsonifier/blob/dev/Include/...

vbezhenar · on Jan 8, 2025

Because JSON is not well suitable as a fast machine exchange format and people trying to make it work anyway.

If you ask me, JSON serializers should be simple and slow, and if you need speed, change wire format for something intrinsically fast.

I understand that this is not the approach that people will take, most developers prefer simple interface with complex implementation over complex interface with simple implementation. But my opinion is that implementation simplicity is what matters in the end.

touisteur · on Jan 8, 2025

Another way of seeing this is: whatever way people are using JSON, over a fleet of all the machines running the parser (or serializer), reducing the execution time (a good enough proxy to energy consumption) on the whole fleet is worthwile. Especially if it has zero impact on the code calling this parser/serializer (so, no second-order energy spent because of the change breaking some API or contract).

o11c · on Jan 8, 2025

Honestly? Most of the time you can use an approximation. At worst you allocate a single extra byte unless your integers are hundreds of bits long (I worked this out once but I forget where I left the details). Just multiply by the ratio of logs or something like that.

exyi · on Jan 8, 2025

I'd assume that the serializer is writing directly into the output buffer, so you'd have to shift everything left by one character if you overallocate. With the added checks for this, it might be faster to compute it precisely the first time.

Nevermark · on Jan 8, 2025

Nice. The best kind of optimization: Go lazy, to go fast.

I often go slow, to go fast. But lazy is the supremum.

eapriv · on Jan 8, 2025

Isn’t “the ratio of logs” going to be slower?

o11c · on Jan 8, 2025

I mean the constant logs of the bases; you get to hard-code this in the algorithm.

One interesting coincidence is that log₁₀(2) ~= log₂(10) - 3, thus the 83 ~= 78 below.

If you use an 8-bit shift, depending on which direction you're going, you multiply the input length by 851 (which is 3*256 + 83) or 78, then shift by 8. This starts being wasteful at 23 bits, but is pretty cheap on 8-bit CPUs.

For a more accurate approximation (slackening at 289 bits), 217706 (3<<16 + 21098) or 19729, then >> 16

realtimechris · on Aug 28, 2024

Here is a json parser/serializer that I've been working on for about a year or so now. There is benchmarks available at https://github.com/RealTimeChris/Json-Performance comparing it to the other 2 most performant C++ parsers/serializers. Let me know what you think! Cheers!

npalli · on Aug 28, 2024

Congrats. Will check it out. What is special about your parser? Any high level design/architecture decisions that you can share? From your graphs, it seems to perform much better than simdjson which seems to be SOA in C++ land.

realtimechris · on Sept 2, 2024

It has reflection for the member names as well as compile-time hash maps for storing the locations to be parsed into/serialized from.

realtimechris · on Nov 21, 2023

As you can see - the parser/serializer I've developed uses the minimal possible memory allocations. Nice parser though.

realtimechris · on Nov 21, 2023

Hey everyone - so I've taken simdjson's algorithm and improved upon it by resaturating the CPU registers with data instead of only operating on 64-bytes worth of string data at a time, as well as incorporated some compile-time hashmaps for the keys of data/memory locations being parsed. Let me know what you think! Also here's some benchmarks: https://github.com/RealTimeChris/Json-Performance