in other words, "Too big for excel is not big data" https://www.chrisstucchio.co...

mseebach · on May 23, 2018

I heard a variation on this: it's not big data until it can't fit in RAM in a single rack.

zackelan · on May 23, 2018

The version I've heard is that small data fits on an average developer workstation, medium data fits on a commodity 2U server, and "big data" needs a bigger footprint than that single commodity server offers.

I like that better than bringing racks into it, because once you have multiple machines in a rack you've got distributed systems problems, and there's a significant overlap between "big data" and the problems that a distributed system introduces.

mmt · on May 24, 2018

It's frustrated me for the better part of a decade that the misconception persists that "big data" begins after 2U. It's as if we're all still living during the dot-com boom and the only way to scale is buying more "pizza boxes".

Single-server setups larger than 2U but (usually) smaller than 1 rack can give tremendous bang for the buck, no matter if your "bang" is peak throughput or total storage. (And, no, I don't mean spending inordinate amounts on brand-name "SAN" gear).

There's even another category of servers, arguably non-commodity, since one can pay a 2x price premium (but only for the server itself, not the storage), that can quadruple the CPU and RAM capacity, if not I/O throughput of the cheaper version.

I think the ignorance of what hardware capabilities are actually out there ended up driving well-intentioned (usually software) engineers to choose distributed systems solutions, with all their ensuing complexity.

Today, part of the driver is how few underlying hardware choices one has from "cloud" providers and how anemic the I/O performance is.

It's sad, really, since SSDs have so greatly reduced the penalty for data not fitting in RAM (while still being local). The penalty for being at the end of an ethernet, however, can be far greater than that of a spinning disk.

zackelan · on May 24, 2018

That's a good point, I suppose it'd be better to frame it as what you can run on a $1k workstation vs. a $10k rackmount server, or something along those lines.

As a software engineer who builds their own desktops (and has for the last 10 years) but mostly works with AWS instances at $dayjob, are there any resources you'd recommend for learning about what's available in the land of that higher-end rackmount equipment? Short of going full homelab, tripling my power bill, and heating my apartment up to 30C, I mean...

mmt · on May 24, 2018

> I suppose it'd be better to frame it as what you can run on a $1k workstation vs. a $10k rackmount server, or something along those lines.

That's probably better, since it'll scale a bit better with technological improvements. The problem is, it doesn't have quite the clever sound to it, especially with the numbers and dollars.

Now, the other main problem is that, though the cost of a workstation is fairly well-bounded, the cost of that medium-data server can actually vary quite widely, depending on what you need to do with that data (or, I suppose, how long you might want to retain data you don't happen to be doing anything to right at that moment).

I suppose that's part of my point, that there's a mis-perception that, because a single server (including its attached storage) can be so expensive, to the tune of many tens of thousands of (US) dollars, that somehow makes it "big" and undesireable, despite its potentially close-to-linear price-to-performance curve compared to those small 1U/2U servers. Never mind doing any reasoned analysis of whether going farther up the single-server capacity/performance axis, where the price curve gets steeper is worth it compared to the cost and complexity of a distributed solution.

> are there any resources you'd recommend for learning about what's available in the land of that higher-end rackmount equipment?

Sadly, no great tutorials or blogs that I know of. However, I'd recommend taking a look at SuperMicro's complete-server products, primarily because, for most of them, you can find credible barebones pricing with a web search. I expect you already know how to account for other components (primarily of concern for the mobos that take only exotic CPUs).

As I alluded in another comment, you might also look into SAS expanders (conveniently also well integrated into some, but far from all, SuperMicro chassis backplanes) and RAID/HBA cards for the direct-attached (but still external) storage.

daemin · on May 24, 2018

Actually it is sometimes faster to fetch data over a network loading from SSD than it is to read from a local spinning disk. Source: current work.

mmt · on May 24, 2018

Well do notice I did say the penalty "can be" not "always is" far greater.

That's primarily because I'm aware of the variability that random access injects into spinning disk performance and that 10GE is now common enough that it takes more than just a single (sequentially accessed) spinning disk to saturate a server's NIC.

Plus, if you're talking about a (single) local spinning disk, I'd argue that's a trivial/degenerate case, especially if compared to a more expensive SSD. Does my assertion stand up better if I it had "of comparable cost" tacked on? Otherwise, the choice doesn't make much sense, since a local SSD is the obvious choice.

My overall point is that, though one particular workload may make a certain technology/configuration appear superior to another [1], in the general case, or, perhaps most importantly, in the high performance case, to have an eye on the bottlenecks, especially the ones that carry a high incremental cost of increasing their capacity.

It may be that people think the network, even 10GE now, is too cheap to be one of those bottlenecks, arguably a form of fallacy [2] number 7, but that ignores the question of aggregate (e.g. inter-switch) traffic. 40G and 100G ports can get pricey, and, at 4x and 10x of a single server port, they're far from solving fallacy number 3 at the network layer.

The other tendency I see is for people not to realize just how expensive a "server" is, by which I mean the minimum cost, before any CPUs or memory or storage. It's about $1k. The fancy, modern, distributed system designed on 40 "inexpensive" servers is already spending $40k just on chasses, motherboards, and PSUs. If the system didn't really need all 80 CPU sockets and all those DIMM sockets, it was money down the drain. What's worse, since the servers had to be "cheap", they were cargo-cult sized at 2U with low-end backplanes, severely limiting existing I/O performance. Then, to expand I/O performance, more of the same servers [3] are added, not because CPU or memory is needed, but because disk slots are needed and another $4k is spent to add capacity for 2-4 disks.

[1] This has been done on purpose for "competitive" benchmarks since forever

[2] https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

[3] Consistency in hardware is generally something I like, for supportability, except it's essentially impossible anyway, given the speed of computer product changes/refreshes, which means I think it's also foolish not to re-evaluate when it's capacity-adding time after 6-9 month.

daemin · on May 25, 2018

Actually my example is far simpler and less interesting. Having a console devkit read un-ordered file data from a local disk ends up being slower than reading the same data from a developer's machine from an SSD over a plain gigabit network connection. Simply has to do with the random access patterns and seek latency of the spinning disk versus the great random access capabilities of an SSD. Note this is quite unoptimised reading of the data.

mmt · on May 26, 2018

Yes, that is, indeed, a degenerate case, as I suspected.

Is it safe to say that such situations are often found with embedded or otherwise specialized hardware?

anitil · on May 24, 2018

I am on a completely different end of the spectrum (embedded devices) - How would I go about learning the capabilities of modern servers?

mmt · on May 24, 2018

Good question.

At a theoretical level, as a sysadmin, I learned the theoretical capabities by reading datasheets for CPUs, motherboards (historically, also chipsets, bridge boards, and the like, but those are much less irrelevant), and storage products (HBAs/RAID cards, SAS expander chips, HDDs, SSDs). Make sure you're always aware of the actual payload bandwidth (net of overhead), actual units (base 2 or base 10) and duplex considerations (e.g. SATA).

From a more practical level, I look at various vendors' actual products, since it doesn't matter (for example) if a CPU series can support 8 sockets if the only mobos out there are 2- and 4-socket.

I also look at whatever benchmarks were out there to determine if claimed performance numbers are credible. This is where sometimes even enthusiast-targeted benchmark sites can be helpful, since there's often a close-enough (if not identical) desktop version of a server CPU out there to extrapolate from. Even SAS/SATA RAID cards get some attention, not in a configuration worthy of even "medium" data, but enough for validating marketing specs.

imtringued · on May 23, 2018

If you have two dozen nodes each with 6 TB of RAM + a few petabytes of HDD storage you're definitively going to need a big data solution.

anfilt · on May 24, 2018

A C/C++ program can go through that amount data. You just have to use a simple divide and conquer strategy. It also depends on how the data is stored and the system architecture, but if you have any method that lets you break that data up into chunks or even just ranges. Then spin your program up for each chunk across each node or processing module that has quick access to the data your processing. Then take those results and have one or more threads merge results depending on the resulting data. I guess it also depends if this a one off job or continually if you would want to do this.

Assuming the locality of data is not a big issue. It can be extremely fast. However, depending on the system architecture reading from the drives can be a bottle neck. If your system has enough drives for enough parallel reads you will turn through data pretty quickly. Moreover from my experience most systems or clusters with a few petabytes have enough drives that one can read quite a lot data in parallel.

However, the worst is when the data is referencing non-local pieces of data. So your processing thread will have to fetch data from either a different node or data not in main memory to finish processing. This can be pain since that means either the task is just nor really parallelize-able or the person who originally generated the data did not take into account certain groups of data may be referenced with each other. Sadly, what happens here is that commutation or read costs for each thread start to dominate the cost of the computation. If it's a common task and your data is fairly static it makes sense to start duplicating data to speed up things. Also restructuring data can also be quite helpful and pay off in the long run.

sfifs · on May 24, 2018

Isn't all this just reinventing Hadoop and MapReduce in C++ instead of Java?

grigjd3 · on May 24, 2018

Research programmers using MPI have been dividing up problems for several decades now but what they often don't get about hadoop until they've spent serious time with it, is that the real accomplishment in it is hdfs. Map-reduce is straight forward entirely because of hdfs.

digi_owl · on May 23, 2018

Cloud, big data, "AI", i wonder what will be the next "me too, look at me" kind of buzzword for the corporate world...

cratermoon · on May 24, 2018

"blockchain"

digi_owl · on May 24, 2018

Ah yes, how did i forget...

erikig · on May 23, 2018

I keep coming back to that article everytime someone talks about big data :)