I don't think NoSQL people usually claim that SQL isn't scalable, just that it's...

wmf · on March 3, 2010

You generally have to partition your data horizontally and thus give up many of the features that SQL has to offer

There are plenty of databases that will partition data without giving up any SQL features, but they cost money.

jbellis · on March 3, 2010

> There are plenty of databases that will partition data without giving up any SQL features, but they cost money.

They also either rely on a single huge SAN for storage (single point of failure + expensive as hell) like Oracle RAC, or they require specialized gear like infiniband to reduce intra-node latency like Exadata (starting price: seven figures) or they're analytics databases that are designed for huge queries with latencies to match like Vertica, ParAccel, etc. (Think minutes between data being loaded and being available to query.)

I'll take NoSQL, thanks.

strlen · on March 3, 2010

Afaik Exadata wasn't even originally meant for OLTP. Seems like another case of a high-latency analytics/warehousing system being marketed as a "distributed database". They're now claiming that they can get OLTP grade performance with SSDs on Exadata, but I don't buy it. The promotion of Exadata is ironic, given I remember one of their engineers claiming (on his personal weblog) about impossibility of OLTP on top of shared nothing not too long ago.

As for the high-latency analytics databases (Vertica, Greenplum et al), I don't see much market for them either. Their big advantage over Hadoop was claimed to be the ability of non-programmer analysts to use them (via SQL), but Hive (which now even has JDBC drivers for it, allowing it to work with existing OLAP tools) solves that problem as well.

neilc · on March 3, 2010

Would would the need for "specialized gear like infiniband to reduce intra-node latency" be limited to parallel databases? (I assume you mean "inter-node"...)

wmf · on March 3, 2010

This whole discussion is about parallel databases since that's the only way to scale beyond the performance of one machine.

neilc · on March 3, 2010

Well, replace "parallel databases" with your favorite term for the parallel databases that fall outside NoSQL (VoltDB, Exadata, shared MySQL, etc). My point being that the alleged need for high-speed interconnects is orthogonal to SQL vs. NoSQL.

jbellis · on March 3, 2010

But it's not. Because SQL databases (strictly speaking any requiring strong consistency... which is mostly RDBMSes) are highly latency sensitive, where NoSQL databases like Cassandra design around that by saying "hey, you could not see the most recent write for a few ms, unless you request a higher consistency level." And most apps are fine with that. As a bonus you get multi-datacenter replication with basically the same code, another place most RDBMSes are weak.

It's a classic design hack -- redefining your goal as an easier problem.

wmf · on March 3, 2010

Oh, I see what you're saying. Yes, the interconnect is orthogonal (although you could argue that strong consistency requires more complex protocols like 2PC so interconnect latency becomes critical).

barrkel · on March 3, 2010

Consistency, Availability, Partition tolerance. NoSQL stores usually sacrifice consistency, and instead settle for eventual consistency. SQL (i.e. RDBMS) stores, with their usual emphasis on transactions, must necessarily sacrifice something else. SQL that doesn't hew to a hard line on consistency and transactions doesn't really have all the features of SQL. This is the distinction that matters most, IMHO, in the NoSQL strand of thinking.

wmf · on March 3, 2010

I admit that the SQL vendors (besides MySQL) made a mistake by putting ACID above scalability; that's clearly not always the right choice. However, CAP still allows a SQL database that is scalable, consistent, and available.

pashields · on March 3, 2010

No, that's exactly what CAP doesn't allow. Unless by scalable you mean non-horizontal scaling. In which case yes, but we already knew that big machines make things fast.

evgen · on March 3, 2010

> CAP still allows a SQL database that is scalable, consistent, and available

Name one please. It seems you are either fundamentally mistaken about what CAP implies or are constraining the "solution" to a clustered system that is effectively a single RDBMS hiding behind lots of tightly-coupled components.

wmf · on March 3, 2010

A tightly-coupled (whatever that means) cluster sounds like a perfectly legitimate way to scale to me.

barrkel · on March 3, 2010

And what if that datacentre goes down? And what if you want reasonable (<50ms) latencies in different parts of the world?

freetard · on March 3, 2010

Except for postgresql.

wanderr · on March 5, 2010

It certainly seems that the NoSQL movement is largely fueled by two facts: 1. MySQL sucks 2. Oracle is expensive

I'd love to see Postgresql get more attention, as I feel that they have scaling up and scaling out handled fairly well, whereas MySQL/InnoDB has a hard time even scaling up (which is why the Drizzle project even exists).

wmf · on March 3, 2010

Does Postgres partition across a cluster? That's what we're talking about here.

akvadrako · on March 3, 2010

Yes. This is what Skype does.

tpz · on March 3, 2010

I'm not even so sure that they are unnecessarily complicated to scale, or that you should need or have to replace the features you list. I am sure, however, that everything you list ends up needing to happen if the solution chosen doesn't apply well to the problem at hand.

When you really dig deep down into each and every article on this subject, whether for or against NOSQL, the most important (and yet unstated) fact is this:

It isn't that RDBMS systems scale or don't scale, or that NOSQL systems scale or don't scale, it's that any solution which prioritizes (just for example) consistency and availability is not going to scale effectively for a problem that instead prioritizes availability and partition tolerance.

I am willing to bet that any time a problem and a given solution don't align on the two attributes they've respectively prioritized from CAP, there will be a claim that the solution "doesn't scale". The reality is just that the solution wasn't applicable to the problem at hand. If instead one evaluates solutions which match the problem's CAP priorities, the solutions will scale effectively (modulo their individual pros and cons relative to the other options within the evaluated CAP-priority-matching set of possible solutions, of course.)