*Optimizing uint64_t Digit Counting: A New Method that Beats Lemire's by Up to 27%*
In the quest to improve the performance of my high-speed JSON library, JSONIFIER, I recently stumbled upon a breakthrough in optimizing the calculation of digit counts for `uint64_t` values. While Lemire’s method of using the `lzcnt` instruction for determining the number of digits in a 32-bit unsigned integer has been widely regarded as efficient, I’ve developed a new method that achieves even faster results for 64-bit unsigned integers (i.e., `uint64_t`), with significant gains across different compilers and platforms.
### The Existing Method: Lemire’s Approach
Lemire’s method, known for its efficiency, calculates the number of digits in a `uint32_t` by leveraging the `lzcnt` instruction, which finds the index of the most significant bit set to 1. This is combined with a static lookup table to map the result to the corresponding number of digits.
While this approach works well for 32-bit integers, the need for a faster and more efficient solution for `uint64_t` led me to create an alternative method, which uses a more streamlined approach without the overhead of a lookup table.
### My New Method: RTC-64-Bit Digit Counting
I’ve designed a new approach for 64-bit unsigned integers that leverages a similar logic but optimizes the process by storing precomputed digit counts for specific ranges and applying simple threshold checks. The result is faster execution with reduced computational overhead.
This method works by using a static array to hold the precomputed digit counts and another array for threshold values that determine the exact number of digits in a `uint64_t`. The key optimization lies in the efficient use of a bit manipulation technique and direct threshold checking to avoid unnecessary computations.
I ran performance benchmarks comparing my new RTC-64-bit method with Lemire’s approach and the traditional `log10` method across various platforms and compilers. The results were consistently impressive:
#### GCC/Ubuntu:
- *RTC-64-bit* outperforms *Lemire-32-bit* by *27.33%*.
- *Lemire-32-bit* beats *Log10-32-bit* by a massive *814.16%*.
#### Clang/Ubuntu:
- *RTC-64-bit* outperforms *Lemire-32-bit* by *143.34%*.
- *Lemire-32-bit* beats *Log10-32-bit* by *522.01%*.
#### MSVC/Windows:
- *RTC-64-bit* is *12.50%* faster than *Lemire-32-bit*.
- *Lemire-32-bit* beats *Log10-32-bit* by *515.90%*.
#### Clang/MacOS:
- *RTC-64-bit* is *25.37%* faster than *Lemire-32-bit*.
- *Lemire-32-bit* beats *Log10-32-bit* by *343.97%*.
### Key Takeaways
The RTC-64-bit method not only delivers improved performance on modern hardware but also significantly reduces the overhead of traditional methods by eliminating the need for a large lookup table. This is especially beneficial for high-performance applications where every cycle counts, such as in JSON serialization and parsing, where speed is critical.
FYI, Hackers Delight uses a slight variation of Lemire's. He multiplies by 19/64 (so the divisor is a shift) and has the highest integer at the end of his table (i.e. 2^64 - 1). He also uses a shift instead of a conditional, so his code is branchless.
Generally I prefer this kind of approach. But, I suspect your digit counting is a very hot path so your tables will generally be in cache. So your approach will likely win anyway.
Very nice work. But could you please explain why counting digits quickly (or even slowly) is useful for a JSON serializer? This is lower level than I'm used to working. Is this so you can alloc the right amount of memory, or something else?
Yes. And for string formatting in general such as sprintf to calculate buffer size. Also for proposes of calculating length for string padding and alignment.
I think it is the case: first allocate the space, then write.
However, I am not sure how significant the gain is here. Actually printing the number requires a division every digit, maybe 2 or 3, which I believe is much slower than even naive digit counting, and you will have to do it eventually.
Maybe it is worth it if there is parallelism involved: write the JSON with blank spaces for the numbers, then have another thread write the numbers. Writing the numbers is computationally expensive because of the divisions, so if you have a lot of large numbers, it may be worth dedicating some cores to this task.
Because JSON is not well suitable as a fast machine exchange format and people trying to make it work anyway.
If you ask me, JSON serializers should be simple and slow, and if you need speed, change wire format for something intrinsically fast.
I understand that this is not the approach that people will take, most developers prefer simple interface with complex implementation over complex interface with simple implementation. But my opinion is that implementation simplicity is what matters in the end.
Another way of seeing this is: whatever way people are using JSON, over a fleet of all the machines running the parser (or serializer), reducing the execution time (a good enough proxy to energy consumption) on the whole fleet is worthwile. Especially if it has zero impact on the code calling this parser/serializer (so, no second-order energy spent because of the change breaking some API or contract).
Honestly? Most of the time you can use an approximation. At worst you allocate a single extra byte unless your integers are hundreds of bits long (I worked this out once but I forget where I left the details). Just multiply by the ratio of logs or something like that.
I'd assume that the serializer is writing directly into the output buffer, so you'd have to shift everything left by one character if you overallocate. With the added checks for this, it might be faster to compute it precisely the first time.
I mean the constant logs of the bases; you get to hard-code this in the algorithm.
One interesting coincidence is that log₁₀(2) ~= log₂(10) - 3, thus the 83 ~= 78 below.
If you use an 8-bit shift, depending on which direction you're going, you multiply the input length by 851 (which is 3*256 + 83) or 78, then shift by 8. This starts being wasteful at 23 bits, but is pretty cheap on 8-bit CPUs.
For a more accurate approximation (slackening at 289 bits), 217706 (3<<16 + 21098) or 19729, then >> 16
Here is a json parser/serializer that I've been working on for about a year or so now. There is benchmarks available at https://github.com/RealTimeChris/Json-Performance comparing it to the other 2 most performant C++ parsers/serializers. Let me know what you think! Cheers!
Congrats. Will check it out. What is special about your parser? Any high level design/architecture decisions that you can share? From your graphs, it seems to perform much better than simdjson which seems to be SOA in C++ land.
Hey everyone - so I've taken simdjson's algorithm and improved upon it by resaturating the CPU registers with data instead of only operating on 64-bytes worth of string data at a time, as well as incorporated some compile-time hashmaps for the keys of data/memory locations being parsed. Let me know what you think! Also here's some benchmarks: https://github.com/RealTimeChris/Json-Performance
In the quest to improve the performance of my high-speed JSON library, JSONIFIER, I recently stumbled upon a breakthrough in optimizing the calculation of digit counts for `uint64_t` values. While Lemire’s method of using the `lzcnt` instruction for determining the number of digits in a 32-bit unsigned integer has been widely regarded as efficient, I’ve developed a new method that achieves even faster results for 64-bit unsigned integers (i.e., `uint64_t`), with significant gains across different compilers and platforms.
### The Existing Method: Lemire’s Approach
Lemire’s method, known for its efficiency, calculates the number of digits in a `uint32_t` by leveraging the `lzcnt` instruction, which finds the index of the most significant bit set to 1. This is combined with a static lookup table to map the result to the corresponding number of digits.
Here’s the code for Lemire’s method:
```cpp JSONIFIER_INLINE int int_log2(uint32_t x) { return 31 - simd_internal::lzcnt(x | 1); }
JSONIFIER_INLINE int fast_digit_count(uint32_t x) { static uint64_t table[] = { ... }; return (x + table[int_log2(x)]) >> 32; } ```
While this approach works well for 32-bit integers, the need for a faster and more efficient solution for `uint64_t` led me to create an alternative method, which uses a more streamlined approach without the overhead of a lookup table.
### My New Method: RTC-64-Bit Digit Counting
I’ve designed a new approach for 64-bit unsigned integers that leverages a similar logic but optimizes the process by storing precomputed digit counts for specific ranges and applying simple threshold checks. The result is faster execution with reduced computational overhead.
Here's the code for the new method:
```cpp JSONIFIER_INLINE_VARIABLE uint8_t digitCounts[]{ ... };
JSONIFIER_INLINE_VARIABLE uint64_t digitCountThresholds[]{ ... };
JSONIFIER_INLINE uint64_t fastDigitCount(const uint64_t inputValue) { const uint64_t originalDigitCount{ digitCounts[simd_internal::lzcnt(inputValue)] }; return originalDigitCount + static_cast<uint64_t>(inputValue > digitCountThresholds[originalDigitCount]); } ```
This method works by using a static array to hold the precomputed digit counts and another array for threshold values that determine the exact number of digits in a `uint64_t`. The key optimization lies in the efficient use of a bit manipulation technique and direct threshold checking to avoid unnecessary computations.
### [Benchmark Results](https://github.com/RealTimeChris/BenchmarkSuite/blob/digit-c...)
I ran performance benchmarks comparing my new RTC-64-bit method with Lemire’s approach and the traditional `log10` method across various platforms and compilers. The results were consistently impressive:
#### GCC/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by *27.33%*. - *Lemire-32-bit* beats *Log10-32-bit* by a massive *814.16%*.
#### Clang/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by *143.34%*. - *Lemire-32-bit* beats *Log10-32-bit* by *522.01%*.
#### MSVC/Windows: - *RTC-64-bit* is *12.50%* faster than *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by *515.90%*.
#### Clang/MacOS: - *RTC-64-bit* is *25.37%* faster than *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by *343.97%*.
### Key Takeaways
The RTC-64-bit method not only delivers improved performance on modern hardware but also significantly reduces the overhead of traditional methods by eliminating the need for a large lookup table. This is especially beneficial for high-performance applications where every cycle counts, such as in JSON serialization and parsing, where speed is critical.