Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Un My case it was 2015. I was struggling with a 28GB CSV file that i needed to cut grabbing only 5 columns.

Tried spark on my laptop: waste if time. After 4h I killed al processes because it didn't read 25% of the file yet.

Same for hadoop, python and pandas, and a shiny new tool from google whose name I forgot long time ago.

Finally I installed cygwin con My laptop and 20 minutes later 'cut' gave me the results file I needded.



> I was struggling with a 28GB CSV file

Okay, sounds reasonable, that's larger than most machines' memory...

> Tried spark on my laptop

Yikes, how did we get here? Not to shame you or anything, but that's like two orders of magnitude smaller than the minimum you might consider reaching for cluster solutions...and on a laptop...I'm legitimately curious to hear the sequence of events that led you to pick up Spark for this.

In my opinion this isn't something you should be leaving the command line for. I'm partial to awk; if you needed to get, say, columns 3, 7, 6, 4 and 1 for every line (in that order):

    awk '{print $3"\t"$7"\t"$6"\t"$4"\t"$1}' file.csv
...where you're using tab as the delimiter for the output.

While I'm at it, this is my favorite awk command, because you can use it for fast deduplication without wasting cycles/memory on sorting:

    awk '!x[$1]++' file.csv
I actually don't know of anything as concise or as fast as that last command, and it's short enough to memorize. It's definitely not intuitive, however...


cut solved your problem. Let's talk about why.

cut is line oriented, like most Unix style filters. It needs to keep only one line in memory at most.

If you say:

    pd.read_csv(f)[[x,y,z]]
It has to read and parse the entire 28GB into memory (because it is not lazily evaluated; cf Julia).

If you actually need to operate on three columns in memory and discard the rest, you should:

    pd.read_csv(f, usecols=[x,y,z])
Then you get exactly what you need, and avoid swapping.

The lack of lazy evaluation does inhibit composition--just look at the myriad options in read_csv(), some of which are only there to enable eager evaluation to remain efficient.


While Pandas csv parser is quite slow anyway, the reason Pandas is particularly slow in this case is because it insists on applying type detection to every field of every row read. I have no clue how to disable it, but it's default behaviour.

Parsing isn't actually a tough problem – https://github.com/dw/csvmonkey is a project of mine, it manages almost 2GB/sec throughput _per thread_ on a decade old Xeon


Or just use python and iterate your file using csv lib to cater for strange cases while doing your computations along.


28GB is really too small for Hadoop to get out of bed for, in general. Though I would wonder why it was _that_ slow with Spark (or Hadoop, for that matter).


Spark is going to try and ingest all the data, and it won’t fit in RAM. Wrong tool for the job basically.


Depends on exactly how you do it, I suppose, but it shouldn't necessarily. Most Hadoop-y work can also be accomplished in Spark without much fuss.


Because it was running on a laptop instead of on a cluster :-)


Ehh... sounds like you might've over-eng'd something with python, tbh. I've dumped 50gb+ worth of CSV into a sqlite DB with something like 20 lines of code, and it only took about 30 seconds. (until I added some indexes)


Re: pandas, you could have done it streaming, a'la the `chunksize` parameter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: