More

mohaps · on Jan 16, 2019

Interviewer: What are your strengths? Me: Cache Invalidation and Naming Things.

We paused and looked at each other for a minute and then we burst out laughing.

gumby · on Jan 16, 2019

Perfect response would have been "are those your top three?"

projectileboy · on Jan 16, 2019

It’s funny... I got the joke reading it, but I don’t think I would have gotten the joke had I heard it spoken in an interview. Is that just me?

ArrayList · on Jan 16, 2019

That is fantastic.

"There are only two hard problems in Computer Science: cache invalidation, and naming things." -- Phil Karlton

grkvlt · on Jan 16, 2019

> There are only two hard problems in Computer Science: cache invalidation, naming things, and off-by-one errors.

I think this version is better, sadly no idea who to attribute it to ... ?

mohaps · on July 3, 2018

To add to Vlad's comment above: The major win was in being able to create a simple API which inherently computes update -> cache invalidation probability taking into account the semantics of the item being cached and its relevance to the query. To be able to do this explicitly via heuristics leads to bespoke effort for each new and modified query.

With Spiral, we were able to approach this top down as a classification problem.

e.g.

If you have a cached query for "Friends that liked my post", the Spiral classifier quickly learns that "Post Last Viewed At" or "Post Last Modified At" is not relevant to this via the feedback from the caching code.

Pre-spiral, this was expressed via a curated blacklist/whitelist which had to be recreated if the query characteristics changed.

taeric · on July 3, 2018

I think this makes sense. My mental model would have been viewing this as a sharding style concern. Where different sized and topic items would be to effectively different caches. With hit rates determining the size of each cache.

That said, I think I see where I was mistaken in thinking that was an odd example. It was literally the example. Not just a random pedagogical one.

To that end, thanks for sharing! Cool stuff.

mohaps · on July 3, 2018

One of the authors here. Be happy to answer questions

afpx · on July 3, 2018

Could you talk a bit about the challenges that you faced in developing this? And, the before ans after benchmarks?

I tried something kinda similar to help with tuning data engineering jobs and pipelines for performance and costs. But, it turned out to be a fruitless activity because there were too many variables that affected performance. I’d produce some models that seemed to be marginally effective. But, after code changes, configuration changes, changes to data input sizes, the models quickly became stale and ineffective.

vbychkovsky · on July 3, 2018

We had two big challenges: (1) getting people to clearly specify the feedback mechanism, (2) figuring out what features to use.

(Isn’t all ML about good labels and features? :-) )

Structured ML systems require you to provide (ideally) unambiguous examples of expected behavior. In case of Spiral (or any other online learning) such examples need to be generated automatically. In our experience this part took a good amount of effort: distributed systems issues (aka race conditions and transient bugs in remote systems) made automatic generation of “clean” examples difficult. Once the bulk of these problems were addressed the system began to operate very smoothly. Specifically, it adapted to changing conditions very well.

Spiral is designed to be a drop-in replacement for hand-coded heuristics. In other words, if you had a somewhat working tree if-else statements that specified your image caching policy (if size<100k and type==jpeg..), you should already have an idea for what features to use. There is a bit of work involved in translating these features into the form suitable for classifiers in Spiral. For example, if a classifier is using binary features, the file size feature would need to be quantized (123kb -> “100-200kb bucket”). While this type of work requires forethought and effort, runtime cost of running this classifier is very low.

abrichr · on July 3, 2018

I'm having trouble fully grasping this:

> Today, rather than specify how to compute correct responses to requests, our engineers encode the means of providing feedback to a self-tuning system.

"encod[ing] the means of providing feedback to a self-tuning system", got it, very cool!

But don't they still have to "specify how to compute correct responses to requests"?

nawgszy · on July 3, 2018

Not OP / GP; from what I understand, this isn't an API generated by ML, it's a cache manager. "Computing correct responses to requests" refers to deciding whether or not it should read it from some caching layer and whether or not it should be cached in the future, and it does this by optimizing some parameters. The difference is something like it decides whether or not to use the API or the cache to load an image or some content, rather than saying "this request should return a response with these properties". Hopefully I'm right and this makes sense.

mohaps · on July 3, 2018

the basic difference in approach was to switch the earlier approach of

if (conditionA && conditionB && !conditionC) cache_it()

to

hey look, an item with featureset X is cacheable while one with featureset Y is not.

This reflects in the API which is just two calls predict() and feedback()

this simplifies the integration code and is easily debuggable even in the face of changes.

mohaps · on Aug 27, 2017

I built PaleoCodex as an online interactive encyclopedia for Dinosaurs and other prehistoric animals for my six year old using wikipedia data.

He is a big dinosaur buff and wants to be a paleontologist when he grows up. His project brief was "Build a pokedex... but for dinosaurs"

Still kind of a WIP (Search is very basic and Text to speech / Listen feature may not work on iPhones, works on iPad/Nexus and Desktop using Chrome and Safari) and very much a weekend hack.

Built using python, tornado, wikipedia data and the skeleton html5 boilerplate. Uses ResponsiveSpeech for text to speech.

Enjoy! (and do send in your comments and suggestions)

xaedes · on Aug 27, 2017

Pretty fast page load time. That is really nice when clicking through the entries. A timeline or browsing by time categories would be interesting.

mohaps · on Aug 27, 2017

Thanks! Working on the timeline feature. The temporal data and diet/size tags need some clean up - a tedious work via an offline tool with RegEx and some nltk elbow grease :)

xaedes · on Aug 28, 2017

When clicking through I wondered how much pages that would be printed in book form (I like printed text). I think one can make it look really great. Especially as the data is heavily structured already.

mohaps · on Aug 30, 2016

ah, now I remember why I had carried on with the linked list from the old c++ codebase

https://gist.github.com/Quinny/09e34afb1d187e43dea2f3c3b0c04...

the key refresh in std::list required a copy of the std::pair, where as I wished to keep it as an unlink / relink

mohaps · on Aug 30, 2016

thanks. fixed

mohaps · on Aug 30, 2016

the naked pointers are inside an internal kv namespace and are a convenience/carry over from the previous implementation (I've covered the justification for the way it's written in a comment above).

the TL;DR is that it is written the way it is to achieve constant time remove/remove_and_add_to_end operations which is not an erase+append but more like an unlink/relink operation.

mohaps · on Aug 30, 2016

the kv::List is not intended as a generic linked list implementation like std::list.

The reason for the specific implementation is to get constant time removal and remove-add_to_end operations. Truth be told, it is a hold-over from the previous implementation. I shall be writing benchmarks for this next up and shall revisit the decision on whether or not to use std::list soon-ish :)

The : private NoCopy is just to block the copy constructors.

quinnftw · on Aug 30, 2016

Personally I would use a macro such as:

    #define DISALLOW_COPY_AND_ASSIGN(TypeName) \
    TypeName(TypeName&) = delete;              \
    void operator=(TypeName) = delete;

Introducing a virtual base class for disabling copying introduces a vtable and un-needed inheritance.

_lbaq · on Aug 30, 2016

>The : private NoCopy is just to block the copy constructors.

If its C++11 why not just use the delete keyword to prevent auto generating members ?

Someone · on Aug 30, 2016

"The reason for the specific implementation is to get constant time removal and remove-add_to_end operations"

http://en.cppreference.com/w/cpp/container/list:

"std::list is a container that supports constant time insertion and removal of elements from anywhere in the container."

'The : private NoCopy is just to block the copy constructors."

But does it need a virtual destructor?

en4bz · on Aug 30, 2016

You could have used std::list iterators instead of raw pointers to a node in the map to implement that.

mohaps · on Aug 30, 2016

now i remember why I went with the old linked list implementation.

with the std::list, everytime I refreshed a node... i did a copy.

basically to the tune of list.push_back(*iter) where as I wanted to keep it as a simple unlink and relink of the node

copy of the comment: https://news.ycombinator.com/item?id=12391069

en4bz · on Aug 30, 2016

To swap nodes in the list you need to use splice like this:

keys_.splice(keys_.begin(), keys_, iter->second);

Where iter is from the cache lookup.

mohaps · on Aug 30, 2016

yeah, saw the gist posted by quinnftw. that actually is a better way to do what I'm doing. Will update that along with the

cache.remove(const Key& k, F deleteCallback) method.

Thanks for the feedback

mohaps · on Aug 30, 2016

yup! that's next up. :)

mohaps · on Aug 30, 2016

header-only usually means there's one or more .h/.hpp files that you can include in your source. There's no source file or library to compile and include

e.g. if I had Hello.h and Hello.cpp, you'd need to either add Hello.cpp to your make/build or build a library and link to it.

Header-only version means all you do is

#include "Hello.h"

and you're all set to go.

tobz · on Aug 30, 2016

So in this case, headers-only is really just an approach to avoid compiling a separate library? More akin to literally including the source code in your source code, in the eyes on the compiler. All in the same object/library.

mohaps · on Aug 30, 2016

mostly: yes! :)

Also, keep in mind that most of the library is templates, which need to go in the header files.

BTW, most of boost is header-only :) e.g. boost/noncopyable.hpp would give you a header only equivalent of the NoCopy class I wrote.

I've always felt weird about having to include some big-arse library for just using a simple container, so for things like this I prefer to write header-only versions.

tobz · on Aug 30, 2016

Cool. Thanks for the edification.