More

kootenpv · on Sept 24, 2023

On phone I typed with a capital letter the correct word but it was considered wrong.

j-rom · on Sept 24, 2023

Thanks for the heads up! I tried to ensure the first letter was not capitalized in the input field but didn't add the validation. I just pushed a fix for this and the game should now recognize you as a proper winner :)

kootenpv · on June 22, 2022

I have 2 popular python projects, one with 4.9k and one with 2.3k stars and I don't qualify :/

https://github.com/kootenpv

If anyone knows why pls let me know

kootenpv · on Dec 9, 2019

Yea, I thought parameters per compression algorithm should indeed be added in a next version :) more compute but definitely an improvement. I think pandas doesn't offer zstd as option with csv, but I'll check once more.

EDIT: indeed, it's missing in `to_csv` - seems like an oversight.

kootenpv · on Dec 9, 2019

Yea zstd is really amazing... if I would choose a single one all the time it'd be zstd for sure.

kootenpv · on Dec 8, 2019

I can give you this... the larger the data the more useful parquet and compression on it will be...

mattnewport · on Dec 8, 2019

From the brief description it seems like Parquet is just a weaker version of what a decent LZ style compressor will do anyway

kootenpv · on Dec 8, 2019

There's TimescaleDB built on top of PostgreSQL.

kootenpv · on Dec 8, 2019

Indeed, but this is for lossless compression :)

kootenpv · on Dec 8, 2019

Spot on (I briefly touch on this in the article)! This is why I try to work with cheap-to-compute features. I used to calculate how unique all values were, but ended up taking a sample instead to speed that part up for large data!

md_ · on Dec 8, 2019

How does that compare to just doing compression on a sample of data? :)

kootenpv · on Dec 9, 2019

Perfect interview questions to get to the juice details haha!

The problem is that choice of compression is very much dependent on the sample size, so this is why just choosing the algorithm based on running benchmarks on the sample will be off.

However, computing certain statistics on the sample and then doing machine learning makes a lot of sense!

Obviously not for simple/cheap to compute features such as number of rows/columns which actually do not take longer to compute as the data gets larger.

But I actually do this for predicting the uniqueness of values! You basically want to get an idea of how many unique values you have per column. But.. if you take a sample this will give you completely wrong numbers.

You can see my approach here: https://github.com/kootenpv/shrynk/blob/6a8675061d82aa65fc3b... Basically the formula calculates the uniqueness on a sample (e.g. 10000 rows), but then extrapolates the result to your actual data size. E.g. it finds 100 unique values in a sample of 10000, but in reality that means you maybe have 500 unique values.

ma2rten · on Dec 8, 2019

Running the compression algorithms is O(kn) where k is the number of compression algorithms. Taking the machine learning approach is O(n).

md_ · on Dec 9, 2019

As the author noted after you posted this, it’s not a given that the ML algorithm is O(n). It may be constant time (by looking only at column headers and a sample of data, say).

That said, I was really more interested in practical runtimes. Like, in practice the ML may have a high startup cost (e.g., due to cost of loading a model), whereas for most sized datasets linear complexity may be fine...

kootenpv · on Dec 10, 2019

Models are cached and not large so the setup time is very low. I'll time it when I get to a PC

kootenpv · on Dec 8, 2019

Yea - that is related to genetic programming. That, and using auto-encoders for e.g. image compression are known approaches in "AI".

I'm particularly proud of this meta approach and I am actually thinking this could become huge: the same thing can be done for hyperparameter optimization in machine learning tasks.

Hyperparamter optimization is currently focused on minimizing cross-validation error, but using this concept you could have weights on accuracy, training time and prediction time (very similar to compression where the 3 dimensions are size, write time and read time), and then given a new unknown dataset you could predict what model/hyperparameters to use.

Maybe this should be patented ;)

eugenhotaj · on Dec 8, 2019

> I'm particularly proud of this meta approach and I am actually thinking this could become huge: the same thing can be done for hyperparameter optimization in machine learning tasks.

There is already a substantial field of Machine Learning/Meta Learning which focuses on exactly this. For example, this paper [1] from NeurIPS 2015 does exactly what you suggest.

[1]: https://papers.nips.cc/paper/5872-efficient-and-robust-autom...

kootenpv · on Dec 8, 2019

Yea I am aware of meta hyperparameter approach for ML, except they only focus on accuracy instead of also including train/prediction times in to the equation :) That's what I was referring to! (you can save A LOT of compute and zoom in on things that work if you can weed out slow / badly performing algorithms as part of meta learning hyperparameters).

To make it extra clear: by doing a lot of compute on different datasets and not only recording the accuracy but also time it took, and then by including that as dimension it will even give better results.

kootenpv · on Dec 8, 2019

I like mine for the fact that it is static and hosted on github pages with a not too bad design.

https://vks.ai/

2mylesaway · on Dec 10, 2019

clean but cookie cutter.