If you have two dozen nodes each with 6 TB of RAM + a few petabytes of HDD stora...

anfilt · on May 24, 2018

A C/C++ program can go through that amount data. You just have to use a simple divide and conquer strategy. It also depends on how the data is stored and the system architecture, but if you have any method that lets you break that data up into chunks or even just ranges. Then spin your program up for each chunk across each node or processing module that has quick access to the data your processing. Then take those results and have one or more threads merge results depending on the resulting data. I guess it also depends if this a one off job or continually if you would want to do this.

Assuming the locality of data is not a big issue. It can be extremely fast. However, depending on the system architecture reading from the drives can be a bottle neck. If your system has enough drives for enough parallel reads you will turn through data pretty quickly. Moreover from my experience most systems or clusters with a few petabytes have enough drives that one can read quite a lot data in parallel.

However, the worst is when the data is referencing non-local pieces of data. So your processing thread will have to fetch data from either a different node or data not in main memory to finish processing. This can be pain since that means either the task is just nor really parallelize-able or the person who originally generated the data did not take into account certain groups of data may be referenced with each other. Sadly, what happens here is that commutation or read costs for each thread start to dominate the cost of the computation. If it's a common task and your data is fairly static it makes sense to start duplicating data to speed up things. Also restructuring data can also be quite helpful and pay off in the long run.

sfifs · on May 24, 2018

Isn't all this just reinventing Hadoop and MapReduce in C++ instead of Java?

grigjd3 · on May 24, 2018

Research programmers using MPI have been dividing up problems for several decades now but what they often don't get about hadoop until they've spent serious time with it, is that the real accomplishment in it is hdfs. Map-reduce is straight forward entirely because of hdfs.