Google's Guava Bloom implementation is very well tested. I suggest looking at their guava test GitHub repo. They even use pre-seeded and manually verified filters as controls as some responses suggested.
Or you could just use Guava's Bloom ;)
As for probabilistic testing of fp rate... The problem is that every once in a while a test will fail.
Disclaimer: I wrote a cuckoo filter library.
If you want to test fp rate check my cuckoo filter test at sanityOverFillFilter() in github.com/MGunlogson/CuckooFilter4J/blob/master/src/test/java/com/github/mgunlogson/cuckoofilter4j/TestCuckooFilter.java
The unit test basically does fuzzing, filling the filter repeatedly in the hopes that one day any errors will surface. The error bounds are pretty large but small enough to detect any egregious failures. Importantly, my filter defaults to a random seed. Guava DOES NOT, so any tests using the same items will be deterministic. The guava filters use this property to verify some filters that have been manually determined to be correct
Hi there,
I've wrestled with Cuckoo filters myself, so I took a look at your lib and there's a some things that stood out.
Fingerprint size- It allows fingerprints that are too short, basically less than 4 bits doesn't allow a reasonable fill capacity. The paper authors only hinted at this, but check out the fill capacity graphs on page 6. This could be why your inserts are slowing down around 80% level when in my experience it doesn't happen till around 93%.
Modulo bias - Your filter capacity code doesn't seem to round the filter size to a power of two. This is a simple fix, but without it your array indexes will be skewed by modulo bias, possibly badly if someone picks a nasty number.
Alternate bucket positions - Your code seems to do a full hash for calculating alternate bucket positions. I know the paper mentions this, but I haven't seen anyone actually doing it :). Its a lot faster to just XOR with a mixing constant. TBH that's what most libraries are doing... whether it's a good idea is debatable :).
Fingerprint zero corner case - I'm not that great at python, but I didn't see any special handling for the rare case that the hash fingerprint is zero. In most implementations zero means empty bucket, so this could violate the "no false negatives" aspect of the filter by making items rarely disappear when they were supposed to be inserted. Most implementations either just add one to it, but I prefer to re-hash with a salt.
No victim cache - Didn't look too much into it, but I didn't see a victim slot used in your code. This will cause problems when the first insert fails. The problem is, by the time the first insert actually fails, you've relocated a bunch of different fingerprints like 500 times. It becomes unclear which fingerprint you originally tried to insert, and you're left holding a dangling item from a random place in the filter that you cannot insert. This violates the "no false negatives" mantra. Even though the filter is full it shouldn't break by deleting a random item when the first insert fails. You either need to store this last item or figure out a way to unwind your insert attempts to be able to reject the item that originally failed to insert.
Check out my Java library if you want to see how I dealt with these things. Also I have a bunch of unit tests there that I either came up with or borrowed from other Cuckoo libs. Should be pretty easy to convert some of those to python :) .
One of the authors here. Great points especially on the "No victim cache" aka insertion failure. This was mostly a test implementation for us to see how cuckoo filters work, but agree that it is critical to have these issues fixed.
No problem,
The paper authors left a lot of the specifics off... I got burned by the victim cache too. I saw one in their reference code and thought it was pointless until weeks later when I got around to writing tests and had one of those aha moments.
The thing with Cuckoo filters is that the item's alternate position is determined solely from it's current position and fingerprint.
Mathematically this is done by XOR'ing with a magic mixing constant borrowed from Murmur2 :) (or in my lib from Mumur3). The quality of the randomness of alternate bucket positions is...probably bad, but in practice it seems to work fine.
Because of this finding alternate positions doesn't require a rehash per se, just a few CPU instructions.
However, there IS a corner case that requires re-hash that comes up more frequently the shorter the fingerprints used. Conceptually a tag/fingerprint of zero means empty bucket, and occasionally the fingerprint hash for an item will end up being zero. In this case I've seen some libraries just add 1, which I didn't really like since it increases hash collisions slightly. Or you can do it like my lib does and just re-hash the item with a salt.
Hm, the empty/zero issue is quite interesting. How are buckets usually represented? I mean, if it were just an array of numbers, that problem wouldn't exist. If it's e.g. multiple bit sequences "bit-or-ed" into one integer, couldn't we use one additional bit per fingerprint to indicate whether it's there?
Depends on the language I guess. The most efficient way to represent buckets is to pack the buckets right next to eachother in a giant memory block made of bits. The other part of your question about using a flag, your method definitely works but adds an extra bit when you only need one value for zero.
Using bit flags:
Using zero as empty, 4 bits could represent numbers 0-15, so 15 values plus empty/zero. With 5 bits you can represent 0-31, so 30 values plus empty/zero. If you're always using one of your bits for the flag it reduces the uniqueness of a fingerprint by almost 50%.
How the filter is built in memory:
At least in Java, the best way to build the backing data structure is a giant array of longs. Really what you want is just a giant block of memory you can address by bit offset. Unfortunately modern CPU's can only index into a byte offset. Because of this, buckets and fingerprints can overlap byte boundaries. To make this happen as little as possible, it's best to go the other way and index into the largest primitive structure the CPU supports. In Java at least, this means you want to simulate bit-addressable memory using a long array.
My java filter uses a Bitset implementation borrowed from Apache Lucene because the Java builtin Bitset only allows a 32 bit index(same with java array indices). You would think this limits you to a 2GB filter but it's actually only around 270MB since the Bitset indexes are a 32 bit int. Anyways, the Lucene Bitset uses a long array to simulate bit addressable memory.
Using regular arrays of booleans doesn't work in higher level languages unfortunately. They tend to use CPU supported primitives underneath so for example a boolean could take up a byte underneath. Every Cuckoo filter library I know of besides the reference implementation and mine doesn't handle this properly and is extremely space-inefficient :(.
Hmmm.... this might be an artifact of the python implementation?
The insert speed should be roughly constant until the filter begins moving items into their alternate buckets(when one of the two buckets for that item is full). In my implementation this doesn't usually happen until well over 90% capacity.
Insert speed will slow slightly as the filters fills up if the implementation uses a shortcut to determine the first empty bucket position by just checking if it's zero. Conceptually, if a bucket is empty you only need to check the first position before the insert. When the filter is half full you need to check ~%50 of bucket positions before an insert. In practice the bucket operations are so fast and buckets only hold 4 items each, I haven't noticed a difference.
Yes, I understand why it slows down (though the additional details you've provided are helpful). My point was that slowing down is correctly described as a decrease, not an increase, in throughput.
Since you've implemented one, maybe you can answer a couple more questions I have:
> Typically, a Cuckoo filter is identified by its fingerprint and bucket size. For example, a (2,4) Cuckoo filter stores 2 bit length fingerprints and each bucket in the Cuckoo hash table can store up to 4 fingerprints.
Wouldn't that fingerprint length normally be two bytes rather than two bits? The Python sample code given appears to use bytes.
Also, when you start doing deletion, isn't there a nonzero probability of causing false negatives because of fingerprint collisions? I guess a 2 byte fingerprint would make that probability acceptably small, but it doesn't seem to be zero. Correct?
The code should use bits, the point of the filter is to increase space efficiency so you need to pack everything together as close as possible. Having fingerprint sizes of only multiples of 8 bits doesn't make sense.
Also, from the quote, it's probably more accurate to say a Cuckoo filter is identified by it's fingerprint size, bucket size, and capacity. In most implementations the bucket size is fixed at 4, so you can define a filter with only fingerprint size and capacity.
And yes, deleting can cause false negatives but only in a roundabout way. Deleting will only cause false negatives if you delete something that wasn't added previously. Because the filter has false positives, sometimes it will tell you it's seen an item that it hasn't. If you delete one of these false positives it will cause a false negative. If you only delete items that have been added before, and in the case of duplicate items, added the same number of times, you won't get any false negatives.
Essentially if you saved a log of all the inserts then replayed it as deletions you would have a perfectly empty filter every time with no false negatives. I actually have a unit test that does this in my java Cuckoo filter :), it exorcised quite a few demons from the code.
I try to think of these filters as the opposite of a cache. Bloom/Cuckoo has no false negatives and a cache has no false positives. Besides this they are very similar conceptually and used in the similar contexts.
So this bot also creates fake hackernews upvotes? because it's fairly obvious that it does...Nothing in the last 20 hours has more than 10 upvotes in Show HN
Or you could just use Guava's Bloom ;)
As for probabilistic testing of fp rate... The problem is that every once in a while a test will fail.
Disclaimer: I wrote a cuckoo filter library.
If you want to test fp rate check my cuckoo filter test at sanityOverFillFilter() in github.com/MGunlogson/CuckooFilter4J/blob/master/src/test/java/com/github/mgunlogson/cuckoofilter4j/TestCuckooFilter.java
The unit test basically does fuzzing, filling the filter repeatedly in the hopes that one day any errors will surface. The error bounds are pretty large but small enough to detect any egregious failures. Importantly, my filter defaults to a random seed. Guava DOES NOT, so any tests using the same items will be deterministic. The guava filters use this property to verify some filters that have been manually determined to be correct