I love how the code is such a mess. You can really tell one guy just wrote this whole thing over the span of a decade... It's just one patch on top of another and the comments are pretty amusing. Also funny to see hardcoded algorithms for pre-defined site paths and whole domains such as facebook/myspace/vimeo. This is truly a makeshift search engine on a massive scale.
Just found a little gem myself. I am working on another open source search engine[0], and needed a way to make bad behaving document filters timeout.
Unfortunately the document filter in questioning dose spawn child processes, so the normal way of using fork() and a monitoring process was not working. However using ulimit like this should work:
https://github.com/gigablast/open-source-search-engine/blob/... . Hadn’t thought about spanning a new shell and let it have control like that :)
Old habits perhaps? When I look back at it I remember that my first books on C were full of problematic sprintf and strcpy use. It may then easy to continue using what you first learned, even when you know better. It basically the "Baby duck syndrome"[0] for C functions.
EDIT: Gotta say, this has some very useful pieces of code. I'm working on a niche-specific crawler and am battling the url stripping/cleanup part of it. This is very useful: https://github.com/gigablast/open-source-search-engine/blob/...