Thanks for the reply and clarification! That’s an impressive throughput. Am I in...

jmillikin · on May 27, 2024

  > Am I interpreting the behavior correctly to think that: your code has
  > one branch per int and that branch will be predicted well when your
  > integers are all of similar magnitude?

Yep, that's generally correct. The branch prediction seems to be harmless either way -- interleaving values on either side of the branch doesn't seem to affect throughput much on my machine (results may vary).

  > I never fully groked the Zipf distribution but am I correct to conclude
  > the worst case for vu128 would be alternating integers on either side of
  > the branch condition?

It depends on whether you're measuring throughput by values/sec or bytes/sec.

In terms of bytes/sec, the worst case is a set of integers exclusively in the range [0xF0, 0xFF] -- big enough to skip the fast path, small enough that each byte has to go through the full bit-masking.

In terms of values/sec, the worst is really big integers because writing 8 bytes (or 16, for `encode_u128`) to a buffer takes longer than writing one byte.

However, regardless of measurement, the worst case for vu128 is faster than VLQ/LEB128 -- except for distributions containing only [0x00, 0x7F] in which case they're all equivalent.

---

Also, after some comments on Lobsters about cases where space efficiency in the u16 range really matters, I'm experimenting with a version that tries to match or exceed VLQ/LEB128 for size efficiency at all bit lengths.

It has three more branches, but in practice this hasn't turned out to matter much -- it seems that indefinite loops are slower than a fixed set of branches, so still significantly faster than VLQ/LEB128 on most sets of integers.

  // like LEB128 but continuation bits are packed into the
  // leading byte -- sometimes called "prefixed varint"
  
  0xxxxxxx                            //  7 bits of payload
  10xxxxxx xxxxxxxx                   // 14 bits of payload
  110xxxxx xxxxxxxx xxxxxxxx          // 21 bits payload
  1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx // 28 bits payload

  // For payloads > 28 bits, use (0xF0 | length) type prefix,
  // same as original post.
  1111____  xxxxxxxx xxxxxxxx [...]

It seems to be more fussy about compiler optimizations, though: https://github.com/rust-lang/rust/issues/125543