To add to Vlad's comment above: The major win was in being able to create a simple API which inherently computes update -> cache invalidation probability taking into account the semantics of the item being cached and its relevance to the query. To be able to do this explicitly via heuristics leads to bespoke effort for each new and modified query.
With Spiral, we were able to approach this top down as a classification problem.
e.g.
If you have a cached query for "Friends that liked my post", the Spiral classifier quickly learns that "Post Last Viewed At" or "Post Last Modified At" is not relevant to this via the feedback from the caching code.
Pre-spiral, this was expressed via a curated blacklist/whitelist which had to be recreated if the query characteristics changed.
I think this makes sense. My mental model would have been viewing this as a sharding style concern. Where different sized and topic items would be to effectively different caches. With hit rates determining the size of each cache.
That said, I think I see where I was mistaken in thinking that was an odd example. It was literally the example. Not just a random pedagogical one.
Could you talk a bit about the challenges that you faced in developing this? And, the before ans after benchmarks?
I tried something kinda similar to help with tuning data engineering jobs and pipelines for performance and costs. But, it turned out to be a fruitless activity because there were too many variables that affected performance. I’d produce some models that seemed to be marginally effective. But, after code changes, configuration changes, changes to data input sizes, the models quickly became stale and ineffective.
We had two big challenges: (1) getting people to clearly specify the feedback mechanism, (2) figuring out what features to use.
(Isn’t all ML about good labels and features? :-) )
Structured ML systems require you to provide (ideally) unambiguous examples of expected behavior. In case of Spiral (or any other online learning) such examples need to be generated automatically. In our experience this part took a good amount of effort: distributed systems issues (aka race conditions and transient bugs in remote systems) made automatic generation of “clean” examples difficult. Once the bulk of these problems were addressed the system began to operate very smoothly. Specifically, it adapted to changing conditions very well.
Spiral is designed to be a drop-in replacement for hand-coded heuristics. In other words, if you had a somewhat working tree if-else statements that specified your image caching policy (if size<100k and type==jpeg..), you should already have an idea for what features to use. There is a bit of work involved in translating these features into the form suitable for classifiers in Spiral. For example, if a classifier is using binary features, the file size feature would need to be quantized (123kb -> “100-200kb bucket”). While this type of work requires forethought and effort, runtime cost of running this classifier is very low.
> Today, rather than specify how to compute correct responses to requests, our engineers encode the means of providing feedback to a self-tuning system.
"encod[ing] the means of providing feedback to a self-tuning system", got it, very cool!
But don't they still have to "specify how to compute correct responses to requests"?
Not OP / GP; from what I understand, this isn't an API generated by ML, it's a cache manager. "Computing correct responses to requests" refers to deciding whether or not it should read it from some caching layer and whether or not it should be cached in the future, and it does this by optimizing some parameters. The difference is something like it decides whether or not to use the API or the cache to load an image or some content, rather than saying "this request should return a response with these properties". Hopefully I'm right and this makes sense.
I built PaleoCodex as an online interactive encyclopedia for Dinosaurs and other prehistoric animals for my six year old using wikipedia data.
He is a big dinosaur buff and wants to be a paleontologist when he grows up. His project brief was "Build a pokedex... but for dinosaurs"
Still kind of a WIP (Search is very basic and Text to speech / Listen feature may not work on iPhones, works on iPad/Nexus and Desktop using Chrome and Safari) and very much a weekend hack.
Built using python, tornado, wikipedia data and the skeleton html5 boilerplate. Uses ResponsiveSpeech for text to speech.
Enjoy! (and do send in your comments and suggestions)
Thanks! Working on the timeline feature. The temporal data and diet/size tags need some clean up - a tedious work via an offline tool with RegEx and some nltk elbow grease :)
When clicking through I wondered how much pages that would be printed in book form (I like printed text).
I think one can make it look really great. Especially as the data is heavily structured already.
the naked pointers are inside an internal kv namespace and are a convenience/carry over from the previous implementation (I've covered the justification for the way it's written in a comment above).
the TL;DR is that it is written the way it is to achieve constant time remove/remove_and_add_to_end operations which is not an erase+append but more like an unlink/relink operation.
the kv::List is not intended as a generic linked list implementation like std::list.
The reason for the specific implementation is to get constant time removal and remove-add_to_end operations. Truth be told, it is a hold-over from the previous implementation. I shall be writing benchmarks for this next up and shall revisit the decision on whether or not to use std::list soon-ish :)
The : private NoCopy is just to block the copy constructors.
header-only usually means there's one or more .h/.hpp files that you can include in your source. There's no source file or library to compile and include
e.g. if I had Hello.h and Hello.cpp, you'd need to either add Hello.cpp to your make/build or build a library and link to it.
So in this case, headers-only is really just an approach to avoid compiling a separate library? More akin to literally including the source code in your source code, in the eyes on the compiler. All in the same object/library.
Also, keep in mind that most of the library is templates, which need to go in the header files.
BTW, most of boost is header-only :) e.g. boost/noncopyable.hpp would give you a header only equivalent of the NoCopy class I wrote.
I've always felt weird about having to include some big-arse library for just using a simple container, so for things like this I prefer to write header-only versions.
We paused and looked at each other for a minute and then we burst out laughing.