21st Century Server Choices

Lots of people base their server form factor choice on what they are used to buying. Critical database applications equal a high-end server. Less critical applications: midrange server. High-end machines used to find a home at larger companies and cheaper servers would typically be attractive to SMEs. I am oversimplifying but those are the clichés that pop up when you speak of server choices.

Dividing the market into who should or should not buy high-end servers is so... 20th century. Server buying decisions today are a lot more flexible and exciting for those who keep an open mind. In the world of virtualization your servers are just resource pools of networking, storage and processing. Do you buy ten cheap 1U servers, four higher performance 2U, one “low cable count” blade chassis, or two high-end servers to satisfy the needs of your services?

A highly available service can be set up with cheap and simple server nodes, as Google and many others show us every day. On the flipside of the coin, you might be able to consolidate all your services on just a few high-end machines, reducing the management costs while at the same time taking advantage of the advanced RAS features these kind of machines offer. It takes a detailed study to determine which strategy is the best one for your particular situation, so we are not saying that one strategy is better than all the others. The point is that the choice between cheap clustered nodes and only a few high-end machines cannot be answered by simply looking at the size of the company you are working for or the "mission critical level" of your service. There are corner cases where the choice is clear, but that is not the case for the majority of virtualized datacenters.

So is buying high-end servers as opposed to buying two or three times more 2-socket systems an interesting strategy for your virtualized cluster if you are not willing to pay a premium for RAS features? Until very recently, the answer was simple: no. High-end quad socket systems were easily three times and more as expensive and never offered twice as much performance compared to dual socket systems. There are many reasons for that. If we focus on Intel, the MP series were always based on mature but not the cutting edge technology. Also, quad socket systems have more cache coherency overhead, and the engineering choices favor reliability and expandability over performance. That results in slower but larger memory subsystems and sometimes lower clock speeds too. The result was that the performance advantage of the quad system was in many cases minimal.

At the end of 2006, the Dual Xeon X5300 were more than a match for the Xeon X7200 quad systems. And recently, dual Xeon 5500 servers made the massive Xeon 7400 servers look slow. The most important reason why these high-end systems were still bought were the superior RAS features. Other reasons include the fact that some decision makers never really bothered to read the benchmarks carefully and simply assumed that a quad system would automatically be faster since that is what the OEM account manager told them. You cannot even blame them: a modern CIO has to bury his head in financial documents, must solve HR problems, and is constantly trying to explain to the upper management why the complex IT sytems are not aligned with the business goals. Getting the CIO down from the “management penthouse” to the “cave down under”, also called the datacenter, is no easy task. But I digress.

Virtualization can shatter the old boundaries between the midrange and high-end servers. They can be interesting for the rest of us, the people that do not normally consider these high-end expensive systems. The condition is that the high-end systems can consolidate more services than the dual socket systems, so performance must be much better. How much better? If we just focus on capital investment, we get the figures below.

Type Server CPUs Memory Approx. Price
Midrange Dell R710 2x X5670 18 x 4GB = 72GB $9000
Midrange Dell R710 2x X5670 16 x 8GB = 128GB $13000
High-end Dell R910 4x X7550 64 x 4GB = 256GB $32000

So these numbers seem to suggest that we need 2.5 to 3 times better performance. In reality, that does not need to be the case. The TCO of two high-end servers is most likely a bit better than that of four midrange servers. The individual components like the PSU, fans, and motherboard should be more reliable and thus result in less downtime and less time spent on replacing those components. Even if that is not the case, it is statistically more likely that a component fails in a cluster with more servers, and thus more components. Less cables and less hypervisor updates should also help. Of course, the time spent in managing the VMs is probably more or less the same.

While a full TCO calculation is not the goal of this article, it is pretty clear to us that a high-end system should outperform the midrange dual socket systems by at least a factor two to be an economical choice in a virtualization cluster where hardware RAS capabilities are not the only priority. There is a strong trend that the availability of the (virtual) machine is guaranteed by easy to configure and relatively cheap software techniques such as VMware’s HA and fault tolerance. The availability of your service is then guaranteed by using application level high availability such as Microsoft’s clustering services, load balanced web servers, Oracle fail-over, and other similar (but still affordable) techniques.

The ultimate goal is not keeping individual hardware running but keeping your services running. Of course hardware that fails too frequently will place a lot of stress on the rest of your cluster, so that is another reason to consider this high-end hardware... if it delivers price/performance wise. Let us take a closer look at the hardware.

The 32-Core, 64-Thread Beast
Comments Locked

51 Comments

View All Comments

  • fynamo - Wednesday, August 11, 2010 - link

    WHERE ARE THE POWER CONSUMPTION CHARTS??????

    Awesome article, but complete FAIL because of lack of power consumption charts. This is only half the picture -- and I dare to say it's the less important half.
  • davegraham - Wednesday, August 11, 2010 - link

    +1 on this.
  • JohanAnandtech - Thursday, August 12, 2010 - link

    Agreed. But it wasn't until a few days before I was going to post this article that we got a system that is comparable. So I kept the power consumption numbers for the next article.
  • watersb - Wednesday, August 11, 2010 - link

    Wow, you IT Guys are a cranky bunch! :-)

    I am impressed with the vApus client-simulation testing, and I'm humbled by the complexity of enterprise-server testing complexity.

    A former sysadmin, I've been an ignorant programmer for lo these past 10 years. Reading all these comments makes me feel like I'm hanging out on the bench in front of the general store.

    Yeah, I'm getting off your lawn now...
  • Scy7ale - Wednesday, August 11, 2010 - link

    Does this also apply to consumer HDDs? If so is it a bad idea to have an intake fan in front of the drives to cool them as many consumer/gaming cases have now?
  • JohanAnandtech - Thursday, August 12, 2010 - link

    Cold air comes from the bottom of the server aisle, sometimes as low as 20°C (68F) and gets blown at high speed over the disks. Several studies now show that this is not optimal for a HDD. In your desktop, the temperature of the air that is blown over the hdd should be higher, as the fans are normally slower. But yes, it is not good to keep your harddisk at temperatures lower than 30 °C . use hddsentinel or speedfan to check on this. 30-45°C is acceptable.
  • Scy7ale - Monday, August 16, 2010 - link

    Good to know, thanks! I don't think this is widely understood.
  • brenozan - Thursday, August 12, 2010 - link

    http://en.wikipedia.org/wiki/UltraSPARC_T2
    2 sockets =~ 153GHz
    4 sockets =~ 306GHz
    Like the T1, the T2 supports the Hyper-Privileged execution mode. The SPARC Hypervisor runs in this mode and can partition a T2 system into 64 Logical Domains, and a two-way SMP T2 Plus system into 128 Logical Domains, each of which can run an independent operating system instance.

    why SUN did not dominate the world in 2007 when it launched the T2? Besides the two 10G Ethernet builtin processor they had the most advanced architecture that I know, see in
    http://www.opensparc.net/opensparc-t2/download.htm...
  • don_k - Thursday, August 12, 2010 - link

    "why SUN did not dominate the world in 2007 when it launched the T2?"

    Because it's not actually that good :) My company bought a few T2s and after about a week of benchmarking and testing it was obvious that they are very very slow. Sure you get lots and lots of threads but each of those threads is oh so very slow. You would not _want_ to run 128 instances of solaris, one on each thread, because each of those instances would be virtually unusable.

    We used them as webservers.. good for that. Or file servers that you don't need to do any cpu intensive work.

    The theory is fine and all but you obviously have never used a T2 or you would not be wondering why it failed.
  • JohanAnandtech - Thursday, August 12, 2010 - link

    "http://en.wikipedia.org/wiki/UltraSPARC_T2
    2 sockets =~ 153GHz
    4 sockets =~ 306GHz"

    You are multiplying threads times clockspeed. IIRC, the T2 is a finegrained multithread CPU where 8 (!!) threads share two pipelines of *one* core.

    Compare that with the Nehalem core where 2 threads share 4 "pipelines" (sustained decode/issue/execution/retire) per cycle. So basically, a dual socket T2 is nothing more than 16 relatively weak cores which can execute 2 instructions per clockcycle at the most, or 32 instructions per cycle. The only advantage of having 8 threads per core is that (with enough indepedent software threads) the T2 is able to come relatively close to that kind of throughput.

    A dual six-core Xeon has a maximum throughput of 12 cores x 4 instructions or 48 instructions per cycle. As the Xeon has only 2 threads per core, it is less likely that the CPU will ever come close to that kind of output (in business apps). On the other hand, it performs excellent when you have some amount of dependent threads, or simply not enough threads in parallel. The T2 will only perform well if you have enough independent threads.

Log in

Don't have an account? Sign up now