Data Center Servers Suck, But Nobody Knows How Much

jws · on Oct 9, 2012

Just an echo of the NYTimes non-story.

Bizarrely, this one self repudiates its own title near the end where they contacted a person that knows what they are doing…

Over at Mozilla, Datacenter Operations Manager Derek Moore says he probably averages around 6 to 10 percent CPU utilization from his server processors, but he doesn’t see that as a problem because he cares about memory and networking.

…

After we contacted him, Moore took a look at the utilization rates of about 1,000 Mozilla servers. Here’s what he found: the average CPU utilization rate was 6 percent; memory utilization was 80 percent; network I/O utilization was 42 percent.

CPU use is irrelevant to most internet servers.

Over the week, I operate my car engine at about 1.2% capacity. Maybe they should write about that.

chubot · on Oct 9, 2012

Ideally the software would be flexible enough to utilize all resources on a machine. A better resource balance would let you reduce the total number of machines and save money. Rather than having 1000 servers at 80% RAM and 6% CPU, you could have many fewer at 80% RAM and 80% CPU.

A cache gives you CPU in exchange for memory. Compression gives you memory in exchange for CPU. I imagine that almost all internet services can be improved by using more compression if they have so much idle CPU. That's what http://code.google.com/p/snappy/ is for essentially.

But the root problem is that our software isn't flexible enough. Capitalism is basically working in this case. Right now we don't have the programming technology to adjust software to fit a machine well -- it's cheaper just to throw more servers at it. Re-optimizing and testing software is very, very expensive. But computing is taking up a growing portion of the world's power, and software is getting better and more flexible.

Eventually the economics will adjust so that we're incentivized to get much higher utilization.

barrkel · on Oct 9, 2012

CPU helps with two things; throughput and latency. Having idle CPU means you're not constrained on CPU for maximizing throughput. But that doesn't mean you'd be just as well off with a slower CPU at a higher usage rate, because then you'd be sacrificing latency.

It might not be a large proportion out of all the other sources of latency, but it is one.

flatline3 · on Oct 9, 2012

I can't say I understand that. If RAM utilization is so vastly out-scaled as compared to CPU utilization, there's a significant resource use inefficiency at play.

Machines consume a baseline amount of power whether they're used or not; that power usage obviously increases with utilization, but ideally you'd have full utilization across the board.

If memory usage is so much higher than CPU usage, I have to wonder what it is that Mozilla is doing wrong with their architecture. Are they using pre-fork-style servers? Are they just provisioning poorly? What is it?

> CPU use is irrelevant to most internet servers.

Why? The CPU is used when the machine does anything. Ideally you're operating the machines at full capacity, less overhead to handle load spikes.

> Over the week, I operate my car engine at about 1.2% capacity. Maybe they should write about that.

What you're doing is inefficient, and they do write about that. The solution is called car sharing and public transportation.

cube13 · on Oct 9, 2012

>If memory usage is so much higher than CPU usage, I have to wonder what it is that Mozilla is doing wrong with their architecture. Are they using pre-fork-style servers? Are they just provisioning poorly? What is it?

Or are they just serving up web pages to users? That's RAM and bandwidth heavy, but very CPU light. You still need the machines to scale your load, but you're not going to be using the CPU.

Realistically, for just about any application, you're going to be RAM-bound before you're CPU-bound. The exceptions are(off the top of my head) scientific computing and video rendering, both of which are CPU heavy, and are very deterministic in their behaviors, which allows for heavy optimization of L2 and L3 cache misses.

flatline3 · on Oct 9, 2012

> Or are they just serving up web pages to users? That's RAM and bandwidth heavy, but very CPU light. You still need the machines to scale your load, but you're not going to be using the CPU.

That depends very much on the efficiency of software your architecture. A well-architected web app can scale up RAM and CPU utilization much more closely than something modeled on zero shared state independent processes.

Additionally, even if your scaling model of RAM before CPU is the only possible one, that doesn't make the utilization effecient, and implies that higher efficiency could still be reached by scaling up RAM per machine.

jws · on Oct 9, 2012

What you're doing is inefficient, and they do write about that. The solution is called car sharing and public transportation.

I drive 10000 miles per year. That's about average, but nowhere near running an engine at full power 24x7.

I don't have two spare hours each day to cover the public transit time delta. (4-8 times longer than driving) Maybe if I laid off HN…

andreasvc · on Oct 9, 2012

To make the comparison fair, you would have to have your engine idling the rest of the time. As part of a group of 1000 cars.

kokey · on Oct 9, 2012

This whole article does feel like it was written in 2008, like the report it refers to. Many of the conservative boring corporations that used to have many underutilized servers in their data centers have been actively moving towards virtualization and cloud like solutions since then.

jeremyjh · on Oct 9, 2012

Most cloud hosts are memory bound as well on the hypervisor - this is a consequence of the applications being memory-bound. Low CPU utilization rates are not at all uncommon for cloud environments. Storage bottlenecks are also very common.

Osiris · on Oct 9, 2012

CPU use is irrelevant to most internet servers.

If that's the case, then why aren't they using lower power, lower performance CPUs? If their servers are only running at 10% CPU load, then maybe the CPU in their servers is significantly overpowered for their needs and they could get by with some less powerful CPUs in order to save power and money.

Sumaso · on Oct 9, 2012

Just because you don't need the cycles all of the time, doesn't mean you should get rid of them.

waterlesscloud · on Oct 9, 2012

True. But it does feel like there's some sort of opportunity here if someone made an architecture tuned to this sort of scenario.

jeremyjh · on Oct 9, 2012

Conflating utilization with efficiency is a mistake that is made so often and so reliably that it probably alone explains why reporting on this is sparse.

To understand if a system is inefficient, I would have to know if the same computing could be performed at a lower total cost of ownership. The utilization of the actual specific resources is an input, but only to understand the computing demand. The reason it is so common to "throw servers at it" is because the hardware and even the power is very cheap compared to human labor costs; and there are significant labor efficiencies in deploying infrastructure in batches.

gphil · on Oct 9, 2012

"After we contacted him, Moore took a look at the utilization rates of about 1,000 Mozilla servers. Here’s what he found: the average CPU utilization rate was 6 percent; memory utilization was 80 percent; network I/O utilization was 42 percent."

Sounds like they are stuffing as much data as possible into RAM for performance reasons, which causes the servers to be memory-bound, and as such they have more CPU available than they need relative to the amount of RAM.

I am (and I think a lot of other people are) in the same situation.

tluyben2 · on Oct 9, 2012

This is a political issue as the article already indicates: even though you could just buy very energy efficient, but worse performing ARM based servers with (depending on your needs) networking/memory/IO focus, it is easier to do the 'no-one has been fired for buying IBM' route: if something goes wrong, it's easier to say that you bought the fastest thing money can buy (who cares if it's running at 6% CPU util) than going the 'experimental' route with low energy equipment in servers risking downtime which could be because of your decision.

The problem is of course that if you go below the enterprises, it's probably even worse, where high powered servers with cpanel running max 400 accounts per server could be put on a microcontroller costing a few cents on average. I ran servers which had 300-400 accounts on them and where the TOTAL number of sites had a few 100 requests per month. I think this is the case for most of the 100s of 1000s of servers running at hostgator, godaddy etc. The whole problem is 'peaks'. Management will ask you 'nice story, but what about peaks'? So you just buy the fastest thing there is, slap VMWare on and hope for the best. I think power management like in mobile phones would not be that bad for this purpose; most of the time you just switch off all cores but one and power down the clockspeed; when a peak comes, you just power up fully.

travem · on Oct 9, 2012

In the enterprise datacenter level it is not uncommon for sysadmins to use VMware's Distributed Resources Scheduler (DRS) coupled with Distributed Power Management (DPM) to balance workloads and power down underutilized hosts. There are probably similar solutions from other vendors (disclosure: VMware employee)

JimmaDaRustla · on Oct 9, 2012

I'm sure that this has been said many times, but the "utilization" is not the same as the need to have the availability of the capacity - we could run some servers on a 386 and achieve 90% utilization, but now it takes two minutes to view a web page!

Also, CPUs are known to "step down" their clock speed to be power effecient when not under load - this is not included in a percentage of utilization.

Also, like someone else mentioned, bigger costs are the cooling techniques. Personally, I can't wait to see stuff like this utilized in the industry: http://techcrunch.com/2012/06/26/this-fanless-heatsink-is-th...

ChuckMcM · on Oct 9, 2012

Sigh, I wonder why people write stories like this.

Data Center servers don't suck, and I'd bet that most folks running them understand what their utilization is. Blekko has over 1500 servers in its Santa Clara facility and we know pretty much exactly how utilized they are, but that is because we designed the system that way.

Its funny how things have come full circle. Back in the 70's you might have a big mainframe in the machine room, it was so expensive that the accounting department required you to get maximum use out of the asset, so you had batch jobs that ran 24/7. You charged people by the kilo-core-second to use economics to maximize their value.

Then minicomputers, and later large multi-CPU microcomputer servers (think Sun E10000 or the IBM PowerPC series) replaced mainframes they didn't cost as much so the pressure to 'get the return' was a bit lower, you could run them at 50% utilization and they still cost less than you'd expect to pay for equivalent mainframe power.

Then the dot.com explosion and suddenly folks were 'co-locating' a server in a data center because it was cheaper to get decent bandwidth there rather than run it the last mile to where your business was. But you didn't need a whole lot of space for a couple of servers, just a few 'U' (1.5" each of vertical space) in a 19" rack. And gee, some folks said why bring your own server, we can take a machine and put a half dozen web sites on it and then you could could pay like 1/6th the cost of 4U of rack space in the Colo for your server. Life was good (as long as you weren't co-resident with a porn or warez site :-)

Then, at the turn of the century, the Sandia 'cheap supercomputer' and NASA Beowulf papers came out and everyone wanted to put a bunch of 'white box' servers in racks to create their own 'Linux Cluster' and the era of 'grid' computing was born.

The interesting thing about 'grid' computing though was that you could buy 128 generic machines for about $200K which would out perform a $1.2M big box server. The accountants were writing these things off over 3 years so the big box server cost the company $400K/year in depreciation costs, the server farm maybe $70K/year (if you include switches and such like) so it really didn't matter to the accountants if the server farm was less 'utilized' since the dollars were so much lower but the compute needs were met.

Now that brings us up to the near-present. these 'server farms' provided compute at hitherto un-heard of low costs, and access to the web became much more ubiquitous. That set up the situation where even if you could offer a service where you got just a few dollars per 1,000 requests to this farm, like a real farm harvesting corn, you made it up in volume. Drive web traffic to this array of machines (which have a fixed cost to operate) and turn electrons into gold. If you can get above about $5 revenue per thousand queries ($5 RPM) you can pretty much profitably run your business from any modern data center.

But what if you can't get $5 RPM? Or your traffic is diurnal and you get $5 RPM during the day but $0.13 RPM at night? Then your calculation gets more complex, and of course what if you have 300 servers, 150 of which are serving and 150 of which are 'development' so you really have to cover the cost of all of them from the revenue generated by the 'serving' ones.

Once you start getting into the "business" of a web infrastructure it gets a bit more complicated (well there are more things to consider, the math is pretty much basic arithmetic). And 'efficiency' suddenly becomes something you can put a price on.

Once you get to that point, you can point at utilization and say 'those 3 hours of high utilization made me $X' and suddenly the accountants are interested again. For companies like Google whose business is information 'crops', they were way ahead of others in computing these numbers, Amazon also because well they sell price this stuff with EC2 and AWS and S3 and they need to know what prices are 'good' and which are 'bad.' It is 'new' to older businesses that have yet to switch over to this model. And that is where a lot of folks are making their money (you pay one price for the 'cloud' which is cheaper than you had been paying, so you don't analyze what it would cost to do your own 'cloud' type deployment). That will go away (probably 5 to 10 years from now) as folks use savings in that infrastructure to be more competitive.

alexchamberlain · on Oct 9, 2012

Let us not forget that running servers at 100% CPU/Memory/Network IO (whatever) is bad too. Sudden spikes or a machine crashing will kill your service.

chubot · on Oct 9, 2012

Sure but they said 80% RAM and 6% CPU utilization. It's better according to both your and their logic to run at 50% RAM and 50% CPU.

alexchamberlain · on Oct 9, 2012

Yeah, totally agree.

greenyoda · on Oct 9, 2012

Data centers have to be designed to handle peak capacity, not average capacity. For example, if a retail business's web site couldn't handle the peak loads in the month before Christmas without degrading response time, they'd go out of business.

staunch · on Oct 9, 2012

CPU performance is simply far beyond what we need in most cases.

What's the average CPU utilization of desktops/laptops across the world? I wouldn't be surprised if it was even lower.

My laptop:

  Cpu(s):  1.4%us,  0.1%sy,  0.0%ni, 98.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

chubot · on Oct 9, 2012

See my comment above -- this is true now, but I believe it's because our software and programming methodology isn't flexible enough to use resources efficiently. Whenever you write code you're essentially hard-coding the balance between CPU and memory. It doesn't cause much of an issue on desktop machines, but with the growing number of data centers, it will start to be economical to have more flexibility in our code.

JakeFratelli · on Oct 9, 2012

The real issue is the lack of testing on the triple-redundant backup power systems. Utilization is meaningless on a server that isn't turned on.

ollybee · on Oct 9, 2012

Hosting companies are in the business of selling servers. In my experience most customers dedicated servers are hugely underutilised but they only get the higher level of support or service if they take out dedicated tin.