Un My case it was 2015. I was struggling with a 28GB CSV file that i needed to c...

throwawaymath · on May 24, 2018

> I was struggling with a 28GB CSV file

Okay, sounds reasonable, that's larger than most machines' memory...

> Tried spark on my laptop

Yikes, how did we get here? Not to shame you or anything, but that's like two orders of magnitude smaller than the minimum you might consider reaching for cluster solutions...and on a laptop...I'm legitimately curious to hear the sequence of events that led you to pick up Spark for this.

In my opinion this isn't something you should be leaving the command line for. I'm partial to awk; if you needed to get, say, columns 3, 7, 6, 4 and 1 for every line (in that order):

    awk '{print $3"\t"$7"\t"$6"\t"$4"\t"$1}' file.csv

...where you're using tab as the delimiter for the output.

While I'm at it, this is my favorite awk command, because you can use it for fast deduplication without wasting cycles/memory on sorting:

    awk '!x[$1]++' file.csv

I actually don't know of anything as concise or as fast as that last command, and it's short enough to memorize. It's definitely not intuitive, however...

jzwinck · on May 23, 2018

cut solved your problem. Let's talk about why.

cut is line oriented, like most Unix style filters. It needs to keep only one line in memory at most.

If you say:

    pd.read_csv(f)[[x,y,z]]

It has to read and parse the entire 28GB into memory (because it is not lazily evaluated; cf Julia).

If you actually need to operate on three columns in memory and discard the rest, you should:

    pd.read_csv(f, usecols=[x,y,z])

Then you get exactly what you need, and avoid swapping.

The lack of lazy evaluation does inhibit composition--just look at the myriad options in read_csv(), some of which are only there to enable eager evaluation to remain efficient.

_wmd · on May 24, 2018

While Pandas csv parser is quite slow anyway, the reason Pandas is particularly slow in this case is because it insists on applying type detection to every field of every row read. I have no clue how to disable it, but it's default behaviour.

Parsing isn't actually a tough problem – https://github.com/dw/csvmonkey is a project of mine, it manages almost 2GB/sec throughput _per thread_ on a decade old Xeon

makapuf · on May 24, 2018

Or just use python and iterate your file using csv lib to cater for strange cases while doing your computations along.

rsynnott · on May 23, 2018

28GB is really too small for Hadoop to get out of bed for, in general. Though I would wonder why it was _that_ slow with Spark (or Hadoop, for that matter).

mr_toad · on May 24, 2018

Spark is going to try and ingest all the data, and it won’t fit in RAM. Wrong tool for the job basically.

rsynnott · on May 24, 2018

Depends on exactly how you do it, I suppose, but it shouldn't necessarily. Most Hadoop-y work can also be accomplished in Spark without much fuss.

sfifs · on May 24, 2018

Because it was running on a laptop instead of on a cluster :-)

Groxx · on May 24, 2018

Ehh... sounds like you might've over-eng'd something with python, tbh. I've dumped 50gb+ worth of CSV into a sqlite DB with something like 20 lines of code, and it only took about 30 seconds. (until I added some indexes)

stuartaxelowen · on May 24, 2018

Re: pandas, you could have done it streaming, a'la the `chunksize` parameter.