Thanks for the heads up! I tried to ensure the first letter was not capitalized in the input field but didn't add the validation. I just pushed a fix for this and the game should now recognize you as a proper winner :)
Yea, I thought parameters per compression algorithm should indeed be added in a next version :) more compute but definitely an improvement.
I think pandas doesn't offer zstd as option with csv, but I'll check once more.
EDIT: indeed, it's missing in `to_csv` - seems like an oversight.
Spot on (I briefly touch on this in the article)! This is why I try to work with cheap-to-compute features. I used to calculate how unique all values were, but ended up taking a sample instead to speed that part up for large data!
Perfect interview questions to get to the juice details haha!
The problem is that choice of compression is very much dependent on the sample size, so this is why just choosing the algorithm based on running benchmarks on the sample will be off.
However, computing certain statistics on the sample and then doing machine learning makes a lot of sense!
Obviously not for simple/cheap to compute features such as number of rows/columns which actually do not take longer to compute as the data gets larger.
But I actually do this for predicting the uniqueness of values! You basically want to get an idea of how many unique values you have per column. But.. if you take a sample this will give you completely wrong numbers.
You can see my approach here: https://github.com/kootenpv/shrynk/blob/6a8675061d82aa65fc3b...
Basically the formula calculates the uniqueness on a sample (e.g. 10000 rows), but then extrapolates the result to your actual data size. E.g. it finds 100 unique values in a sample of 10000, but in reality that means you maybe have 500 unique values.
As the author noted after you posted this, it’s not a given that the ML algorithm is O(n). It may be constant time (by looking only at column headers and a sample of data, say).
That said, I was really more interested in practical runtimes. Like, in practice the ML may have a high startup cost (e.g., due to cost of loading a model), whereas for most sized datasets linear complexity may be fine...
Yea - that is related to genetic programming. That, and using auto-encoders for e.g. image compression are known approaches in "AI".
I'm particularly proud of this meta approach and I am actually thinking this could become huge: the same thing can be done for hyperparameter optimization in machine learning tasks.
Hyperparamter optimization is currently focused on minimizing cross-validation error, but using this concept you could have weights on accuracy, training time and prediction time (very similar to compression where the 3 dimensions are size, write time and read time), and then given a new unknown dataset you could predict what model/hyperparameters to use.
> I'm particularly proud of this meta approach and I am actually thinking this could become huge: the same thing can be done for hyperparameter optimization in machine learning tasks.
There is already a substantial field of Machine Learning/Meta Learning which focuses on exactly this. For example, this paper [1] from NeurIPS 2015 does exactly what you suggest.
Yea I am aware of meta hyperparameter approach for ML, except they only focus on accuracy instead of also including train/prediction times in to the equation :) That's what I was referring to! (you can save A LOT of compute and zoom in on things that work if you can weed out slow / badly performing algorithms as part of meta learning hyperparameters).
To make it extra clear: by doing a lot of compute on different datasets and not only recording the accuracy but also time it took, and then by including that as dimension it will even give better results.