Marketing and benchmarking can lead to some pretty smart, but very misleading results. As promised, I shall give you a very good example of it. One promoted by some of the smartest people industry in fact. But first, a bit of background.
 
If you read our article about the nuts and bolds of virtualization, you might remember that paravirtualization is one of the best ideas in the virtualization space. Basically, a paravirtualized guest OS is adapted so that it hands over control to the hypervisor whenever it is necessary. One of the big advantages is that the hypervisor has to intervene much less than hypervisors which do not use paravirtualized guest OSs. Xen, the flagship of paravirtualization, has a lot of other tricks to offer excellent performance, such as the fact that drivers in the guest OS are linked to real native linux drivers (in domain 0).

The basic concept and technology behind Xen are - in theory - superior to other hypervisors which make use of emulation and/or binary translation. It is one of the weapons Xensource (and other Xen based virtualization solutions) can leverage against the undisputed king of virtualization land, VMWare.

Xensource came up with a number of benchmarks which had to proof that Xen was by far superior performancewise. "Near Native Performance" became the new battlecry when the Xen benchmarketing team stormed the VMWare stronghold. Unfortunately, some of the benchmarks are nothing more than marketing, and have little to do with reality.
 
To be honest, Xen can in fact offer superior performance in some virtualization scenario's. We'll show you in one of our detailed articles. And the other virtualization vendors have and will come up with some pretty insane benchmarks too.
 
The purpose of exposing this benchmark is to help you recognize bad virtualization benchmarks. We have the greatest respect for Ian Pratt and his team as they made one of the most innovative and valuable contributions to the IT market. But that doesn't mean we don't have to be critical for the benchmarks they present :-)
 
Here it is:

 
Yes, it is already a benchmark that is a year old. But it is still a very important benchmark to Xensource. Simon Crossby (CTO) talks about it in a blogpost on July the 2nd 2008:
 "....at a time when Xen already offers Linux a typical overhead of under 1% (SPECJBB)..."
 
So what is wrong? Several things.
 
No work for the hypervisor.  
SPECJBB (2005) hardly touches the hypervisor. Less than 1% of the CPU time is spend in the kernel [1] . To put this in perspective: even a CPU intensive load such as SPECint spends about 5% of it's time in kernel, and a typical OLTP workload makes the OS work for 20 to 30% of the CPU time.
  
The hypervisor hardly ever intervenes, so a Specjbb test is  one the worst ways of showing how powerful your virtualization technology is. It is like sitting in a Porsche in a traffic jam and saying to your companion: "do you feel how much horsepower is available in my newest supercar ?"
 
No I/O
 SPECjbb2005 from SPEC (Standard Performance Evaluation Corporation) evaluates the performance of server side Java by emulating a three-tier client/server system with emphasis on the middle tier. Instead of testing with a possible disk intensive database system, SPECjbb uses tables of objects, implemented by Java Collections, rather than a separate database.
 
For native testing, that is wonderful. It means that you do not have to setup an extremely expensive disk system (Contrary toTPC-C) and it makes the benchmark a lot more easier and faster to do. SPECJBB 2005 gives you an idea of how your specific CPU+Memory+JVM+JVM tuning combination can perform.  In case that your own JAVA application looks a lot like specjbb, keeping the disk system out of the benchmarking is not such a bad idea: you can always size your disksystem later.
 
In case of a virtualized benchmark scenario, excluding (disk) I/O is a very bad idea. Depending on the virtualization scenario (RAID card, available drivers, 32 vs 64 bit etc.) , the hypervisor overhead of accessing the disk can range from insignificant to "a complete performance disaster".
 
To make a long story short, Specjbb 2005 produces no disk nor network activity and those are the two main factors why some virtualization solutions stumble and fall. Cutting those out of the benchmark and the value of your "hypervisor comparison" drops like a bank share after a credit crisis.
 
Native performance is too low
One of the weaknesses of the current Xen 3.x is that it does not support large pages. The use of large pages improves the performance of server workloads, and SPECJBB is no exception. It is well known that large pages can boost the performance of SPECJBB 2005 by about 20%. So it is pretty clear that Large pages were not enabled when Xensource tested native performance: performance was lower than it could be. In the real world, if performance really matters, large pages will be enabled, especially now that both the Windows and Linux platform make it so much easier.
 
So what does the XenSource SPECJBB benchmark prove? That if the hypervisor sees almost no action and you have an application that does no I/O whatsoever (unlike any realworld server application), virtualized performance is very close to native performance ... that is not well tuned. In other words: the message that this benchmark bring is close to meaningless.
  
Don't get us wrong. This is not a post against virtualizing your workloads. Most of the virtualized workloads out there perform very close to native performance, and all the goodies that virtualization brings (fast provisioning, incredible cost savings to name a few) make it more than worth to pay a small performance penalty. But quite a few of the benchmarks out there are quite misleading. If a vendor really wants to show how powerful it's hypervisor is, they have to show a benchmark that stresses the hypervisor, not one that leaves it alone. We will be showing quite a few benchmarks soon.
 [1] Measured with VMstat in linux in our lab."Evaluating Non-deterministic Multi-threaded Commercial Workloads" (2002) by Alaa R. Alameldeen, Carl J. Mauer, Min Xu, Pacia J. Harper, Milo M. K. Martin, Daniel J. Sorin, Mark D. Hill, David A. Wood shows similar results for SPECJBB 2000.
Comments Locked

5 Comments

View All Comments

  • jeffb - Saturday, August 23, 2008 - link

    Johan, gmyx, and Bert all make good points. I'd like to add to those, last one first:

    Bert, thanks for linking to my blog! You're certainly free to look at it in the same way as Johan, but that blog didn't make any claims about "typical" performance. Instead it was narrowly focused on SPECjbb2005 for 2 reasons: to counteract FUD about ESX performance on that particular workload, and to show the strength of the ESX scheduler under a CPU over-committed scenario. To see the big picture, you'll need to look at it as one data point among many.

    gmyx, of course I completely agree with you. Multi-VM tests are very important, especially when the total number of virtual CPUs exceeds the number of CPU cores. However, it is not easy to design a multi-VM benchmark with varied workloads. I admit to some bias here, but the only success in this area I know of is VMmark: http://www.vmware.com/products/vmmark/">http://www.vmware.com/products/vmmark/

    Johan, thanks for the insightful article! I especially like the bits on large pages and the importance of getting good native performance first. My only quibble: there are other ways to see virtualization overhead besides workloads with a significant kernel component. Some virtualization products may look beautiful running one benchmark in a single VM, but fall apart when many VMs are run in a resource-constrained environment, even if the workloads don't have a big kernel component. I suggest tests with CPU and memory over-commitment to see what happens when you try to squeeze the most out of a virtualization solution.
  • Bert Bouwhuis - Wednesday, August 20, 2008 - link

    Last Monday, VMware published similar benchmark results using SPECjbb2005 (see http://blogs.vmware.com/performance/2008/08/esx-ru...">http://blogs.vmware.com/performance/2008/08/esx-ru.... Should we be look at those the same way?
  • davecason - Monday, August 18, 2008 - link

    Johan,

    Change "proof" to "prove"

    This:
    Xensource came up with a number of benchmarks which had to proof that Xen was by far superior
    Should be this:
    Xensource came up with a number of benchmarks which had to prove that Xen was by far superior

    Feel free to knock out this comment after the fix.
  • gmyx - Monday, August 18, 2008 - link

    The question is: what happens with multiple VMs running the code? This to me is the real benchmark. I know many server where I work we consolidated into a simple machine because they were not very busy.

    Doing a 1 VM to 1 host in my opinion is not very useful. I would like to know how a VM scales from a 1 to 1 config to 2 to 1 and higher.

    I remember taking a MS course a year ago that had us running 4 VMs at once, a PDC, a SDC, a W2K client and a WXP client. It was not very fast on Virtual PC 2004/2007.

    Looking forward to your VM comparison article.
  • skiboysteve - Sunday, August 17, 2008 - link

    great article.

Log in

Don't have an account? Sign up now