Not sure how big the word dict is in your latest version, but you can do much better simply by reordering how you create your index.
With alphabet in order, assembling letters ABCDE: 17345.00 bytes
With alphabet in order, assembling letters EDCBA: 16949.00 bytes
With alphabet order tweaked, assembling letters EDCBA: 16309.00 bytes
Where tweaked means you build your offset as if each position was ordered like this ([::-1] means reverse if you're unfamiliar with Python).
You can also use a prefix rather than variable length encoding, this means you can use 2 bits to represent a number bigger than 2^14, rather than 3. This might hurt your ability to decode though, as you'll have bits that cross byte boundaries.
You can get much smaller using length 3 varints rather than 7 (13,110 bytes), but I presume that would perform worse on GB hardware than staying byte aligned.
Not sure what technique you're using for the answers list, but the compress5.py suggests it's doing a basic bitmap.
Base bitmap is 12972 bits, or 1622 bytes (your file lists 1619, not sure why it's 3 bytes smaller, but all the same). You can "skip encode" (I don't know the formal name for this technique) into 1232 bytes by encoding runs of three [0, 0, 0] as [0], and anything else as [1, X, X, X], saving another 390 bytes.
I tried all combinations of runs between 1 and 7, and 3 is optimal.
You can beat 15,412 bytes with a much simpler decompression algorithm than xz. Using variable length encoding on a tail-sorted delta-offset array gets down to 13,180 bytes.
Neither competes with RoadRoller (which gets down to around 12,200 and includes the code for decoding), but that takes forever to decompress and uses a ton of memory so certainly not applicable for this application.
See my other comments in this thread if you have interest!
Secondly, I made my offsets relative to the overall solution space of 26*5, ordering the words sorting from their ends, and with a little bit of twiddling the order of the alphabet to put common letters near the start:
alpha1 = "aeioustrbcdfghjklmnpqvwxyz"
alpha2 = "aeioustrbcdfghjklmnpqvwxyz"
alpha3 = "aeioustrbcdfghjklmnpqvwxyz"
alpha4 = "aeioustrbcdfghjklmnpqvwxyz"
alpha5 = "aeioustrybcdfghjklmnpqvwxz"
bitmap = []
for ia, a in enumerate(alpha1):
for ib, b in enumerate(alpha2):
for ic, c in enumerate(alpha3):
for id, d in enumerate(alpha4):
for ie, e in enumerate(alpha5):
bitmap.append(e+d+c+b+a in words)
Doing this, and then doing the variable length encoding I got the file down to 13,181 bytes (it was 13,180 bytes, but I needed to add a 7 bit termination string so that you can properly decode the file after you write it to disk, otherwise when the file rounds to the nearest byte you have random 0s that get decoded).
I'm sure with some twiddling of the alphabets some more you could save a few more bytes, but this does better than both Brotli on a ASCII trie and the Huffman Trie by almost 1KB (https://github.com/adamcw/wordle-trie-packing#all-words), so I'm very happy.
I think you might have miscalculated bits per bytes here?
8 * 17,763/64,860 = 2.19
Also, I attempted to implement this as described in this paper (variable length encoding the letters and the offsets, utilized L, and dropped F entirely because all words are the same length, N didn't make a big difference).
I achieved a naive size of 20,560 bytes, which I didn't have confidence implementing more advanced techniques outlined in the paper would get the size down sufficiently to compete with using a trie+Huffman representation (15,599 bytes, https://github.com/adamcw/wordle-trie-packing#all-words).
A trie will already run-length encode all the first letters into 26*5=130 bits pre-Huffman coding. I doubt RLE will beat that. A trie will in essence RLE every level but without needing to track the length of the run, so I suspect it'll outperform RLE at every level.
If you have a means of doing RLE that performs otherwise, I'd love to understand how it works.
FYI, turning it into 12972 by 5 and Brotli compressing achieves 15,093 bytes, which is less than if you first turn the data into an ASCII trie then Brotli compress that (14,180 bytes) (Source: https://github.com/adamcw/wordle-trie-packing#all-words).
I generated a DAWG with 12,822 nodes, which means you need 14 bits for each pointer. A trie representation can be packed much smaller because you don't need to randomly jump around the graph, you can just read it out sequentially.
With huffman coded labels and offsets, I got the size down to approximately:
- 94,761 bits for offsets.
- 56,900 bits for labels.
- 12,822 bits for indicating when you're at the end of a next chain.
I feared that (“This list is a lot shorter, so there will be fewer opportunities for savings”)
However, I think you can layout the tree so that no pointers point backwards. If so, can you make those offsets smaller by making them relative to the current point in the tree?
Also, since the list only has five-letter words, for the last letter, you don’t even need the letters themselves, just 26 bits for what letters can complete a word. That might be a saving.
Ah, I replied to myself with more information while you were also replying.
I also surmise that the short length of the words makes a DAWG just very heavy.
It's not clear to me that relative offsets would be notably smaller to the extent that would be needed. Even a hypothetical and cheated DAWG I came up with is ~33% bigger than alternatives. I've generally explored enough (see the paper in my other comment) that I don't feel that further investigations into a DAWG are likely to outperform other methods.
I can't see anything immediately that jumps out that the Crab game is doing that's special to save space, I think it just achieves better compression because you can compress larger files easier, and the words are longer with more overlapping sections.
I agree that you need to compare including the decompressor size, so I'm not sure which approach is better the Huffman trie or the one in the original article. I'm not familiar enough with GB programming to be able to suggest how much program memory would be needed to decode the Huffman Trie, it looks like it would be somewhat similar in complexity.
With 12,822 nodes, you need 57,387 bits for the labels and the Huffman table (I'm sure you could make the Huffman table more efficient, but it's only 50 bytes, so that's not helping much).
Then, to mimic their edge reordering technique but without having to actually implement all the logic, I ordered the edges by frequency and used variable length integer encoding of size 3 (this performed the best on the data set) which required 95,988 bits.
Variable length integer encoding breaks the number into 3 bit chunks, each prefixed by 1 bit to indicate if there is another 4 bit chunk to read for that number. Since the distribution of offsets is heavily skewed, optimizing the most frequent offsets into a small package is better even if rarer ones suffer from multiple prefix bits.
This is 19,171 bytes total, or substantially worse than both the original article and Huffman tries do. This isn't even counting the flag bits needed for actually traversing the graph. So even cheating, it's not clear I can get a DAWG to be within striking distance of either other approach.
I hypothesize that the reason tries and other methods perform so well here is the relatively shallow depth. All words are only length 5, so the trie doesn't ever get really deep. This also means that suffixes generally don't actually take up that much space given common ones will also pack small with Huffman coding. The size of offsets appears to be just too great relative to how much you can save by removing shared suffixes from 5 letter words.
Would love to know if there is some trick to DAWG that I'm missing that would let me get it even smaller.
I was thinking that it's probably not quite sparse enough to benefit from RLE as-is, since the number of bits you'd need for your lengths would outstrip the length of your run. If any run can be more than 128 words, then you'd need at least 8 bits for the run, making it only beneficial for runs of longer than that.
An alternative would be to make 0 mean five zeros (or some other N) and then if you hit a 1, it means the next 5 bits are to be interpreted as-is. This reduces all 5 length 0s to 1 bit, while only adding 1 bit whenever there is a bit. At worst this introduces 1 extra bit per answer. The answer to non-answer ratio is about 5 to 1, so this should definitely save space while also having a trivial decoding algorithm.
Probably not suitable for direct RLE as you say, but if you looked over the data I suspect you might find that a stepped RLE, where you interleave two or more RLEs could provide a savings. You do of course need a cache to decompress parts of the intervleaved RLE'd data into.
With alphabet in order, assembling letters ABCDE: 17345.00 bytes With alphabet in order, assembling letters EDCBA: 16949.00 bytes With alphabet order tweaked, assembling letters EDCBA: 16309.00 bytes
Where tweaked means you build your offset as if each position was ordered like this ([::-1] means reverse if you're unfamiliar with Python).
``` alpha1 = "abcdestfghijklmnopqruvwxyz"[::-1] alpha2 = "eaioustrbcdfghjklmnpqvwxyz" alpha3 = "aeioustrbcdfghjklmnpqvwxyz" alpha4 = "eaiousthrbcdfgjklmnpqvwxyz" alpha5 = "aeioustryhkbcdfgjlmnpqvwxz" ```
You can also use a prefix rather than variable length encoding, this means you can use 2 bits to represent a number bigger than 2^14, rather than 3. This might hurt your ability to decode though, as you'll have bits that cross byte boundaries.
You can get much smaller using length 3 varints rather than 7 (13,110 bytes), but I presume that would perform worse on GB hardware than staying byte aligned.