Right now, some things are somewhat hard-coded to be Cloudflare compatible. If someone's willing, you can just deploy this without Cloudflare, but you'd need to dig into the code a little.
In the future releases, I'll make it possible to host it on VPCs and release a Dockerfile along with it, so that should help a little.
I couldn't find the right words to describe this, in comparison to something like Github Gist. I suppose "Own-your-data" since the D1 db generated is yours completely.
Happy to change the branding to be more reflective of this!
Thank you so much for giving Chonkie a chance! Just to note Chonkie is still in beta mode (with v0.1.2 running) with a bunch of things planned for it. It's an initial working version, which seemed promising enough to present.
I hope that you will stick with Chonkie for the journey of making the 'perfect' chunking library!
I don't fully understand what you mean by "maximum length truncation of the string"; but if you're talking about splitting the sentence into 'chunks' which have token counts less than a pre-specified max_token length then, yes!
I'm not sure if this is what they mean, but this is a use case that I have dealt with and had to roll my own code for:
Given a list sentences, find the largest in order group of sentences which fit into a max token length such that the sentences contain a natural coherence.
In my case I used a fuzzy token limit and the chunker would choose a smaller group of sentences that fit into a single paragraph or a single common structure instead of cramming every possible sentence until it ran out of room. It would do the same going over the limit if it would be beneficial to do so.
A simple example would be having an alphabetized set and instead of making one chunk A items through part of B items it would end at A items with tokens to spare, or if it were only an extra 10% it would finish the B items. Most of the time it just decided to use paragraphs to end chunks instead of continuing into the middle of the next one.
TokenChunking is really limited by the tokenizer and less by the Chunking algorithm. Tiktoken tokenizers seem to do better with warm-up which Chonkie defaults to -- which is also what the 2nd one is using.
Algorithmically, there's not much difference in TokenChunking between Chonkie and LangChain or any other TokenChunking algorithm you might want to use. (except Llamaindex, I don't know what mess they made for 33x slower algo)
If you only want TokenChunking (which I do not recommend completely), better than Chonkie or LangChain, just write your own for production :) At least don't install 80MiB packages for TokenChunking, Chonkie is 4x smaller than them.
That's just my honest response... And these benchmarks are just the beginning, future optimizations on SemanticChunking which would increase the speed-up from the current 2nd (2.5x right now) to even higher.
Just to clarify, the 21MB is the size of the package itself! Other package sizes are way larger.
Memory footprint of the chunking itself would vary widely based on the dataset, and it's not something we tested on... usually other providers don't test it either, as long as it doesn't bust up the computer/server.
If saving memory during runtime is important for your application, let me know! I'd run some benchmarks for it...
Right now, we haven't worked on adding support for code -- some things like comments (#, //) have punctuations that adversely affect chunking, along with indentation and other issues.
Right now, some things are somewhat hard-coded to be Cloudflare compatible. If someone's willing, you can just deploy this without Cloudflare, but you'd need to dig into the code a little.
In the future releases, I'll make it possible to host it on VPCs and release a Dockerfile along with it, so that should help a little.
Thanks for checking the project out!
reply