Great questions and discussions. I'm definitely passionate about versioning in the context of models and data science for both data and code. I work full time on the open source Pachyderm project (pachyderm.io), and we have users versioning their data and models in our system. Basically, you can output checkpoints, weights, etc. from your modeling and have that data versioned automatically in Pachyderm. Then if you utilize that persisted model in a data pipeline for inference, you can have total provenance over which versions of which models created which results (and which training data was used to create that version of the model, etc.).
Thanks for sharing this story. My experience has been that data scientists and analysts aren't able to efficiently use Hadoop/Spark even in cases where it is warranted. These individuals don't generally like working with Java/Scala and/or haven't spend time understanding the underlying structures used (e.g., RDDs, caching, etc.). As a result, they either don't put their sophisticated modeling or analyses into production, or they hand off their application to other engineers to implement for production size "big data." This produces all sorts of problems and inefficiencies, not the least of which is the fact that the engineers don't understand the analyses and the data scientists don't understand the implementation.
My (biased, as I work for them) opinion is that something like Pachyderm (http://pachyderm.io/) will ease some of these struggles. The philosophy of those who work on this open source project is that data people should be able to use the tooling and frameworks they like/need and be able to push there analyses to production pipelines without re-writing, lots of frictions, or worrying about things like data sharding and parallelism.
For example, in Pachyderm you can create a nice, simple Python/R script that is single threaded and runs nicely on your laptop. You can then put the exact same script into pachyderm and run it in a distributed way across many workers on a cluster. Thus, keeping your code simple and approachable, while still allowing people to push things into infrastructure and create value.
I will also throw in my (biased, as I work on the project) suggestion to take a look at Pachyderm (http://pachyderm.io/). It is open source, language agnostic, and distributed. Plus it automatically tracks the provenance of all of your data pipelines, regardless of language or parallelism over your data.
Basically you set up data pipelines, where the input/output of each stage is versioned (like "git for data"). That way you have versioned sets of your data (e.g., training data), but you also can know things like exactly what model was used to produce which result, what data was used to train that particular model, what transformations occurred on that training set, etc.
Things like Airflow and Luigi are, no doubt, useful for data pipelining and some workflows (depending on what language you are working with). However, by combining pipelining and data versioning in a unified way, Pachyderm naturally lets you handle provenance of complicated pipelines, have exact reproducibility, and even do interesting things like incremental processing.
Reproducibility and provenance are essential for the data science workflows. And it's important to maintain these at scale. Take a look at Pachyderm some time as well if you get a chance (I work on the project for full disclosure). We version data in addition to code for full reproducibility. Even for distributed multi-stage and multi-language pipelines.
Thanks for sharing your experience. I work with Pachyderm, which is an open source data pipelining and data versioning framework. Some things like might be relevant to this conversation are the fact that Pachyderm is language agnostic and that it keeps analyses in sync with data (because it triggers off of commits to data versioning). This makes it distinct from Airflow or Luigi, for example.
Pachyderm, with its "git for big data" approach is one of, if not THE, coolest thing I learned about in 2016.
Only I hope to get time to test it out in some more depth sooner rather than later (it is one of my top goals for 2017).
Also, the pipeline feature in Pachyderm does not suffer from the "dependencies between tasks rather than data" problem that I mentioned in another post here, but properly identifies separate inputs and outputs declaratively.
Pachyderm specifies workflows in a kind of DSL AFAIK, and I'm very much interested to see if it could natively fit the bill for our complex workflows. But if not, I think we can always use it in a a light-weight way to fire off scipipe workflows (instead of the applications directly), and so let scipipe take care of the complex data wiring.
We would still like to benefit from the seemingly groundbreaking "git for big data" paradigm, and auto-executed workflow on updated data, which should enable something as impactful as on-line data analyses (auto-updated upon new data) in a manageable way.
I definitely agree here. As a data scientist, sometimes it seems like we are in the wild west as far as reproducibility and versioning of our analyses.
This seems like an interesting project that tackles some of the data versioning stuff. However, I believe that, at least in data science, we need data versioning closely tied to the analyses themselves for complete reproducibility.
That is, we need the versioning tied to the inputs/outputs of data pipeline stages, such that we can reproduce pipeline runs at any time and incrementally improve and run pipelines based on diffs in data.
As mentioned elsewhere in the comments, Pachyderm (http://pachyderm.io/) does exactly this. Working both as git for data, but also enabling data pipelining and analyses with the data versioning.
This is awesome! Great to see so much momentum on the Go data science front (see https://github.com/gopherds/resources/blob/master/tooling/RE... for more). Go provides a great integrity and deployment story for data scientists, and contributions like this are well worth the effort!