"Such a platform can yield very satisfactory performance for tens or hundreds of thousands of active users"
There are 253 million Internet users in China alone. What happens when your site needs to scale from 0.001% of them using the site simultaneously to needing to scale to 1% of them using the site simultaneously? Within a month?
"Of course if you index poorly or create some horrendous joins"
Which in the Twitter and Facebook cases is exactly what they have to do on many of their requests. As I've personally found out, relying on a database to do a join across a social network graph is a recipe for disaster. One day you'll be woken up because your database's query planner decided to switch from using a hash join to doing a full table scan against 10s of millions of users on every request. Then, you'll be left either trying to tweak the query plan back to working order, or actually doing what you should've done in the first place: architect around message queues and custom daemons more suited to your query load.
"Even with billions upon billions of help tickets."
At 50 million tweets a day, Twitter would hit 18 billion tweets within a year. Good luck architecting a database system to handle that kind of load. That is, one in which the database system is serving all of the requests (including Twitter streams) and isn't just being used for data warehousing.
"Such a solution — even on a stodgy old RDBMS — is scalable far beyond any real world need"
The disconnect here is this guy's needs are not the needs of a lot of us who are actively looking at alternatives. He is simply not familiar with the problem domain.
I don't think the original poster was making the argument that Twitter should run fine on a SQL database; in fact, I think he seemed to indicate the opposite. Namely, that large, nominally non-relational datasets that can afford to lose a little data here and there, or at the very least just take awhile to save it, are really what you need for serving up a big, fresh pile o' Social Networking.
So, bringing up Twitter or Facebook really doesn't make a good case against RDBMSes as a good tool in the toolbox -- they've got a very unique set of needs that don't apply to a lot of rest of the world. So, of course, SQL isn't the best solution when you're dealing with trillions of rows of data, and don't really want to spend hundreds of millions of dollars on the infrastructure required to guarantee that you never go down, and never lose a tweet.
And keep in mind, RDMBS helped them get to point where they could enjoy these problems; Twitter probably wouldn't exist in all its current glory if they spent a year building it to be 'scalable' before launching.
I think the reason that a lot of people end up hating RDBMS and SQL is because of one-or-more of (a) their only experience is with MySQL, which really isn't that awesome; (b) they've been burned by bad schema design; or, (c) they don't really get relational algebra or set theory.
For an example of 'bad schema design', I once worked at a company that had indices on nearly every column of their DB, even though almost none of these ever got queried. There was one database table with five indices on three columns, and of course this was the table that logged every single HTTP request processed by the front end. Including API calls. Did I mention that this table was never queried by any part of the application?
It was a poor design decision, and sure enough, it completely torpedoed performance. But the problem wasn't the RDBMS, because it did exactly what it was told to do, no matter how asinine.
So, in short, RDBMS aren't the solution to all problems, but they do solve a lot of problems adequately. NoSQL databases also serve an important role in the toolbox, but are much more narrowly-focused.
I'd also suggest that even in a social networking space, there's multiple types of data and some of it should be stored in an ACID environment (which probably means SQL). You may not care if you can't save a few thousand (or million) status updates to your data store immediately, but you need to care a lot more about your customer profile data (e.g., account, password, etc.) and changes there should be as ACID as possible. Your advertising or subscriber billing models should probably be ACID-backed as well. In both of these cases, that probably means SQL.
However, the original author's point basically boiled down to: if you define scalability as the problems you can scale an RDBMS to solve, RDBMS systems are scalable. I'm not big on arguing the finer points of someone's tautology.
The particulars of a situation determine the scalability of a solution. For a lot of us working at web scale or on interesting new problems, an RDBMS won't scale. Sometimes it won't scale within the constraints we have, but sometimes it won't scale because we won't be able to build the system we're trying to build. His example of a company-internal billing system really only served to highlight the disconnect between the crowd following along well-trod ground, and the people out front doing innovative work.
No. The author provides counterexamples to the argument, "RDBMSes don't scale."
As for the "innovative work" versus "well-trod ground," there are still businesses who need better, more innovative solutions to well-trod problems, and I, for one, am not willing to ignore money on the table. The problem I'm working on works well with a combination RDBMS/key-value system for different pieces of the puzzle.
"[SQL deal for when] "Data consistency and reliability is a primary concern"."
I'm curious: lets say you have a tweeter scale app that must satisfy the consistency and reliability as a primary requirement. Is there really a NoSQL solution that can take you there without (effectively) raising the costs to the point that a scalable (money not an object) SQL solution would provide? (Kinda like how the difficult to extract North Sea oil became economically viable once oil prices crashed through a certain ceiling?)
Well, "reliability" can be sliced in a couple of different ways since that term can cover both the A & P in the CAP options and it can also mean the elimination of single points of failure and an architecture that can degrade gracefully when components fail. Some NoSQL systems let you select the mix of consistency and reliability you need at a rather fine-grained level -- one thing that does distinguish these systems from the traditional RDBMS is that you are almost never in an all or nothing situation regarding any particular part of the data space unless you explicitly want to create that choice to enable other options.
NoSQL gets scalability by giving up a huge feature set, and, yes, cross-object consistency. I question the lower reliability argument though. Dynamo, one of the original NoSQLs, allows each instance to specify how reliable they want writes to be. Most of the NoSQLs allow you to tune the reliability factor, just as MySQL and PostgreSQL allow you to do.
The disconnect here is that only a tiny percentage of sites on the Internet have the requirements of Twitter and Facebook. They are outliers. Using them as an example here is ridiculous.
You don't need to have Twitter's userbase to have Twitter's data problem. Look at someone like FlightCaster -- they're using a NoSQL database to handle a huge amount of data to be useful to even their first user.
With the advent of cheap, reliable, available commodity hardware and network access, previously difficult data problems are solvable. SQL doesn't make sense for all of those problems.
There are 253 million Internet users in China alone. What happens when your site needs to scale from 0.001% of them using the site simultaneously to needing to scale to 1% of them using the site simultaneously? Within a month?
"Of course if you index poorly or create some horrendous joins"
Which in the Twitter and Facebook cases is exactly what they have to do on many of their requests. As I've personally found out, relying on a database to do a join across a social network graph is a recipe for disaster. One day you'll be woken up because your database's query planner decided to switch from using a hash join to doing a full table scan against 10s of millions of users on every request. Then, you'll be left either trying to tweak the query plan back to working order, or actually doing what you should've done in the first place: architect around message queues and custom daemons more suited to your query load.
"Even with billions upon billions of help tickets."
At 50 million tweets a day, Twitter would hit 18 billion tweets within a year. Good luck architecting a database system to handle that kind of load. That is, one in which the database system is serving all of the requests (including Twitter streams) and isn't just being used for data warehousing.
"Such a solution — even on a stodgy old RDBMS — is scalable far beyond any real world need"
The disconnect here is this guy's needs are not the needs of a lot of us who are actively looking at alternatives. He is simply not familiar with the problem domain.