xavcochran's comments

xavcochran · on May 14, 2025

We utilize some of LMDB's optimizations such as the APPEND put flags. We also make use of LMDB handling duplicates as a one-to-many key instead of duplicating keys. This means we can get all values for one key in one call rather than a call for each duplicate.

For keys we are using UUIDs, but using the v6 timestamped uuids so that they are easily lexicographically ordered at creation time. This means keys inserted into LMDB are inserted using the APPEND flag, meaning LMDB shortcuts to the rightmost leaf in its B-Tree (rather than starting at the root) and appends the new record. It can do this because the records are ordered by creation time meaning each new record is guaranteed to be larger (in terms of big-endian byte order) than the previous record.

We also store the UUIDs as u128 values for two reasons. The first is that a u128 takes up 16 bytes where as a string UUID takes up 36 bytes. This means we store 56% less data and LMDB has to decode 56% less bytes when doing code accesses.

For the outgoing/incoming edges for nodes, we store them as fixed sizes which means LMDB packs them in, removing the 8 byte header per Key-Value pair.

In the future, we are also going to separate the properties from the stored value as empty property objects still take up 8 bytes of space. We will also make it so nothing is inserted if the properties are empty.

You can see most of this in action in the storage core file: https://github.com/HelixDB/helix-db/blob/main/helixdb/src/he...

xavcochran · on May 14, 2025

Looking at your benchmarks you say for inserting 1k edges its around 500,000 ns/iteration. Is this 500,000 ns/per edge insertion or for all 1k of them?

BlooIt · on May 19, 2025

Hello. These benchmarks are a bit outdated, we’re currently updating them this sprint.

The open-source in-memory version loads around 3 million edges/second, while the on-disk version handles does about 2 million edges/second with a WAL batch size of 100, and 3m with no WAL.

xavcochran · on May 14, 2025

thank you! any feedback would be much appreciated

xavcochran · on May 14, 2025

there is also the fact that the more dimensions you have for embedded data the more diluted the embedding becomes so it is unusual to go anywhere near the limits of vector length!

xavcochran · on May 14, 2025

We will definitely look into it. The SPLADE models look promising!

xavcochran · on May 14, 2025

SPALDE*

xavcochran · on May 14, 2025

to add to George's reply, for helix to run on the browser with WASM the storage engine has to be completely in memory. At the moment we use LMDB which uses file based storage so that does't work with the browser. As George said, we plan on making our own storage engine and as part of that we aim to have an in-memory implementation.

hansworst · on May 14, 2025

Not entirely sure if you could use it, but wondering if you’ve heard about the origin private file system feature of modern browsers? https://developer.mozilla.org/en-US/docs/Web/API/File_System...

xavcochran · on May 14, 2025

very interesting, will look into this. I know for a fact that you cannot compile the likes of LMDB and RocksDB to work with WASM but this looks promising for our custom storage engine to be able to make it work with the browser. Thanks for this!

xavcochran · on May 14, 2025

thanks for the question! we chose f64 as a default for now as just to cover all cases and we believed that basic vector operations would not be our bottleneck initially. As we optimize our HNSW implementation, we are going to add support for f32 and binary vectors and drop using Vec<f64/f32> and instead use [f64/f32; {num_dimensions}] to avoid unnecessary heap allocation!

quantike · on May 14, 2025

I appreciate the reply! Yeah that sounds like the correct path forward is swapping out the type for some enum of numeric types you want to cover.

I'd be curious if there's some benefit to the runtime-memory utilization to baking in the precision of the vector if it's known at comptime/runtime. In my own usage of vector DBs I've only ever used a single-precision (f32), and often have a single, known dimension. But if Helix is aiming for something more general purpose, then it makes sense to offer the mixing of precision and dimension in the internals.

Cheers

xavcochran · on May 15, 2025

The benefit of baking in the dimension and size of individual elements (the precision) is the fact that the size will be known at compile time meaning it can be allocated on the stack instead of being heap allocated.

xavcochran · on May 14, 2025

apart from the fact Cozo seems to be pretty dead, we use a different storage engine which makes our reads much faster. based on their benchmarks I estimate our most of our reads to be 10x faster. I think our query language is much simpler, and easy to understand than Datalog which is what they use.

xavcochran · on May 14, 2025

Assuming you are using GPUs for model inference, the best way to set it up would have the DB and a separate server to send inference requests. Note that we plan on support custom model endpoints and on the database side so you probably won't need the inference server in the future!

xavcochran · on May 14, 2025

Thanks for the kind words! At the moment the query language transpilation is quite unstable but we are in the process of a large remodel which we aim to finish in the next day or so. This will make the query language compilation far more robust, and will return helpful error messages (like the rust compiler). The other thing is the core traversals are currently single threaded, so aggregating huge lists of graph items can take a bit of a hit. Note however, that we are also implementing parallel LMDB iterators with the help of the meilisearch guys to make aggregation of large results much faster.