Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Wonders of the Suffix Tree Through the Lens of Ukkonen’s Algorithm (humanreadablemag.com)
150 points by pekalicious on Oct 26, 2019 | hide | past | favorite | 39 comments


So often lately, Ive come here and left disappointed. Reading about Wasabi and hipster sweaters. But this post is exactly the kind of content we should be pushing. Everyone who reads this will become a better hacker for it.


This site is currently running a Kickstarter to launch a new programming magazine. If you'd like to see more content like this out there, consider taking a look:

https://www.kickstarter.com/projects/pekalicious/human-reada...


Also interesting to get into are suffix arrays which you could construct from a tree. Then there is a notion of generalized suffix arrays, where you mix multiple suffix arrays into a big suffix array. Then you can give fast answers like: what is the longest common prefix between the several tokens, whereas the tokens of your individual suffix arrays could be letters from an alphabet (so you construct a suffix array from words) or maybe words (you construct a suffix array from a sentence)


And then you could get into compressed indexes such as the FM-index, which represent a text in a form that is both (typically) smaller than the original, and also supports fast substring queries.

Suffix trees are cool, but they’re just the entry point to a world of wonderful data structures and algorithms.


The Burrows-Wheeler transform was an interesting thing to see! When I used to teach at an immersive 3 month program for web dev we had a few weeks of playing with such arrays, trees, tries, dynamic programming, and text from Wikipedia.


In practice one should typically prefer a suffix array to a suffix tree. A suffix array is conceptually so easy!

1. Given a set of strings, put all the suffixes into a set. For example "derp" -> "derp", "erp", "rp", "p".

2. Sort it

It's cheap! The number of suffixes equals the string length, and suffixes can be mere references into the string input. So in practice you use maybe ~2x memory compared to the input - and in big cache friendly chunks (not tries).

Now you're flying. If you want to perform a substring search:

1. Binary search on the first character. You get a range of suffixes which begin with that character.

2. Within that range, perform a binary search on the second character. Now you have a range of suffixes which begin with the first two characters.

3. etc.

Powerful and easy.

By the way this technique of going character by character can and should be extended to the sort itself; look up "three-way radix quicksort" for what you never learned in school.


More like 4x or 8x the memory (of an ASCII string) if you just use pointers or size_t’s. Abstractly, you need at minimum (n log n) memory to represent a permutation, which is what you’re doing.


I took string algorithms course with Esko Ukkonen. It was one of the most interesting and fun algoritm courses ever and very useful.

Ukkonen algorithm for suffix trees and Kärkkainen-Sanders for suffix array are very beautiful algorithms.


> Introduced to the world in 1973 by Peter Weiner, suffix trees are a relatively new concept in computer science, with this newness likely fueling its obscurity.

That isn't really all that new; it's just about when red-black trees and B-trees were invented as well and I wouldn't call those "obscure".


You're right, it is not new. It is slightly obscure though. Definitely comes up if you study algorithms enough, but I don't think intro courses usually even mention them, and most working programmers won't have any idea they exist.


Are there some hidden constants/considerations that prevent suffix trees from being used in the real world?

Every time I expect it to show up (e.g., for pattern matching in databases), I almost always see something else used instead (e.g., trigrams)


A couple of considerations that come to mind for a database: (a) efficiency, and (b) dynamicity. The tree structure isn't particularly efficient (it has a lot of space overhead, and I think it's also not as fast), so you'd prefer something else if possible. So people often use suffix arrays + LCA preprocessing instead. However, those are hard to make dynamic (your database is constantly changing), so I'd expect that still makes it tough to use them in databases. Though I suspect this isn't the whole story.


Not _quite_ a suffix tree, but a similar structure called a GADDAG is sometimes used in word-generating board games (Scrabble, Words With Friends).

GADDAGs are different because they're more like "reversed prefix" trees, but they help with the issues of placing a new word from a dictionary onto the board using the existing letters of another word.

   [1]: https://en.wikipedia.org/wiki/GADDAG


A suffix array is of similar utility to a suffix tree, more compact to store, and pretty simple to compute: all you need to do is start with an array of each index into the data from start to end, then sort the indices by memcmp'ing the suffix string starting at each index. There are more optimized algorithms, but they're within 2-3x for moderately sized data (megabytes). You need the fancier sorting algorithms if you're constructing a prefix array of something huge like a genome.


Just as a side note, that algorithm for suffix array construction has really bad worst cases (consider a string of all "a"s).

It won't matter for experimenting or most likely for using it on smallish realistic texts, but anything where you're taking user data for instance you'd want to step up to a slightly more clever algorithm (there's an O(n lg n) worst case one that's still very simple, called rank doubling I believe).


(Ironically, for a site called "Human Readable Magazine" I found that really hard to read.)


(Edit: I been curious/confused by this for years)

Can someone explain how this is fundamentally different from a b-tree of the substrings of a string? You'd store each substring by start index and each leaf is where the substring becomes unique.


The only thing suffix trees have in common with B-tries that I can see is that they're both non-binary trees.

A B-tree stores whole keys in the nodes, not fragments of values. And a B-tree tries to self-balance, so there's a lot of leeway for what keys get pushed up to the root node. B-trees split and merge nodes based on the number of items in them, trying to achieve a certain fill factor.

A suffix tree, on the other hand, is storing fragments of the values, and I believe (but have not bothered to prove) that there is one and only one minimal and valid way to organize the tree for a given set of data. Suffix trees add children when the data mandate that they do in order to remain correct, not just in order to make room.


I should phrase it differently; what does a suffix tree get you that a standard sorting tree doesn't get you. A standard sorting tree would get you all the neighbors in lexical order. What more do you need?


The B-tree just lets you quickly find out if a given string is in a set. The suffix tree lets you do things like finding the longest common substring very quickly, or looking for repeating motifs.


A B-tree (or any sorting/index tree) allows one to find nearby elements in where sorting order one choose. If one takes a B-tree of strings sorted in lexical order, one find the neighbors of a given string X. If you start with a string X=YZ, where Y is smallest substring in our base string-to-search and Z is the part that doesn't occur, then the (left and right) neighbors of X will be other strings containing Y, if such strings exist. Search of this sort may take O(log(l)) time admitted where a suffix tree might take O(1) time (for all I know) but still, a lot of specialized search trees are not that different from B-tree+special-sort-order, as far as I can tell. For example, B-tree + Z-ordering gives you a spatial sorting tree.


This might be a stupid question: why is the fact it’s a _suffix_ tree important? (as opposed to building a similar structure with prefixes)


A suffix tree contains a compact trie like graph representation of all the suffixes of a long string. So if you want to know if it contains a substring, you can start at the root and just search like in a prefix tree. Since for the queried word to be an infix it must be a prefix of a suffix, you will find the word in O(|query|) time.

So yeah its a prefix tree, but a prefix tree of all the suffixes.

And now to answer your question. If you naively put every suffix into a regular prefix tree, you would get the same lookup performance. BUT, since the suffixes will share a lot of suffixes, which the prefix tree can't compress you have a lot more memory consumption.

The suffix tree avoids this by essentially performing compression on the suffix as well, so that you simply have pointers to shared suffixes of suffixes, which point to the root.

In the end it's like a diamond shape I guess, trie in the front, compression in the back.


Why can't one construct a prefix tree of all prefixes, then start process from the right to left (assume that it is an offline algorithm)?


Because you can walk search trees from their root but not from their leafs.


The underlying data structure (a trie[0]) is used to create both suffix trees and prefix trees. The 'default' trie is a prefix tree actually.

The reason you would want a suffix tree (over a prefix trie) is that suffix trees are quite useful for computing the longest common substring of two strings (they can do it in linear time).

[0] https://en.wikipedia.org/wiki/Trie


Peter Weiner, who first invented suffix trees gave an answer to this question on p.3 of "Linear Pattern Matching Algorithms"[1]: "If P does match a substring of S, then reverse(P) also matches a substring of reverse(S). This observation implies that every technique which solves a pattern matching problem working from left to right has a dual procedure which works from right to left. In what follows, we adopt a left to right view point, referring only briefly to dual concepts as appropriate."

[1]https://pdfs.semanticscholar.org/d339/0f13eed511cde1116446e7...


The key property is that each substring is the begining of a suffix, so if you have your suffixes ordered in lexicographical order, is easy to answer pattern matching queries. Now, you could do the same with prefixes, but then you would need to reverse the query pattern, and things become less straightforward.


The prefix tree of a string is just a linked list, though—it's the string itself. It doesn't help you do anything for a single string, unlike a suffix tree, which is used for finding substrings of a single string.

Moreover, how would you even start searching for a substring using a prefix tree? You can't start at the root, because the substring wouldn't be a prefix. It doesn't help you there at all.


This is such a clear and illuminating explanation as to why _suffix_, rather than prefix trees, are so useful.


Thanks!


For most problems, a prefix tree will be just as useful.

But suppose you want to look for the best match of string in a data stream with no predetermined length e.g. a TCP pipe. Then a suffix tree will allow you to process each incoming character in constant time. A prefix tree will not.


If you have keys that tend to have common prefices (e.g. URLs in RDF) then checking them in reverse (e.g. with a suffix tree) is generally faster, as you will hit differences quicker.


I remember learning about this in a data structures class, and it is a remarkable data structure. But the author of this article needs to pick up a copy of Strunk and White. This reads like it was written by a high school junior. Awful, awful writing.


It seems like the style of the article is deliberately long-winded and poetic to match the type of content that the “human readable magazine” is going for (similar to New Yorker articles). It’s likely seeking to cater to a specific niche of reader preferences.


You've got to be kidding me. The sentences are excruciating to read. New Yorker prose is in a different universe. The writing is spare and lucid. You can't evaluate the quality of prose with "wc -w".


This was one of the most insightful articles I’ve read on HN in a while, and I’d love to see more of this content or even pay a premium to some publisher who has this category of technical deep dives.

My unsolicited feedback to you - it doesn’t hurt to practice empathy. The fact that you assumed the authors primary language is English (and even if it really is - it doesn’t matter) and hold them to that bar - and you point out the data structure is something you already knew - is this the same way you provide feedback in a work environment? Don’t take this as an attack, but I guarantee you would be a short timer in my org.


[flagged]


You have unfortunately been posting many comments that break the site guidelines. Would you please review https://news.ycombinator.com/newsguidelines.html and stick to the rules from now on, so we don't have to ban you again?


English may not be their first language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: