A Closer Look at the Server Node

We’ve arrived at the heart of the server node: the SoC. Calxeda licensed ARM IP and built its own SoC around it, dubbed the Calxeda EnergyCore ECX-1000 SoC. This version is produced by TSMC at 40nm and runs at 1.1GHz to 1.4GHz.

Let’s start with a familiar block on the SoC (black): the external I/O controller. The chip has a SATA 2.0 controller capable of 3Gb/s, a General Purpose Media Controller (GPMC) providing SD and eMMC access, a PCIe controller, and an Ethernet controller providing up to 10Gbit speeds. PCIe connectivity cannot be used in this system, but Calxeda can make custom designs of the "motherboard" to let customers attach PCIe cards if requested.

Another component we have to introduce before arriving at the actual processor is the EnergyCore Management Engine (ECME). This is an SoC in its own right, not unlike a BMC you’d find in conventional servers. The ECME, powered by a Cortex-M3, provides firmware management, sensor readouts and controls the processor. In true BMC fashion, it can be controlled via an IPMI command set, currently implemented in Calxeda’s own version of ipmitool. If you want to shell into a node, you can use the ECME's Serial-over-LAN feature, yet it does not provide any KVM-like environment; there simply is no (mouse-controlled) graphical interface.

The Processor Complex

Having four 32-bit Cortex-A9 cores, each with 32 KB instruction and 32 KB data L1 per-core caches, the processor block is somewhat similar to what we find inside modern smarphones. One difference is that this SoC contains a 4MB ECC enabled L2 cache, while most smartphone SoCs have a 1MB L2 cache.

These four Cortex-A9 cores operate between 1.1GHz and 1.4GHz, with NEON extensions for optimized SIMD processing, a dedicated FPU, and “TrustZone” technology, comparable to the NX/XD extension from x86 CPUs. The Cortex-A9 can decode two instructions per clock and dispatch up to four. This compares well with the Atom (2/2) but of course is nowhere near the current Xeon "Sandy Bridge" E5 (4/5 decode, 6 issue). But the real kicker for this SoC is its power usage, which Calxeda claims to be as low as 5W for the whole server node under load at 1.1GHz and only 0.5W when idling.

The Fabric

The last block in the Soc is the EC Fabric Switch. The EC Fabric switch is an 8X8 crossbar switch that links to five XAUI ports. These external links are used to connect to the rest of the fabric (adjacent server nodes and the SFPs) or to connect SATA 2 ports. The OS on top of server nodes sees two 10Gbit Ethernet interfaces.

As Calxeda advertises their offerings with scaling out as one of the major features, they have created fast and high volume links between each node. The fabric has a number of link topology options and specific optimizations to provide speed when needed or save power when the application does not need high bandwidth. For example, the links of the fabric can be set to operate at 1, 2.5, 5 and 10Gb/s.

A big plus for their approach is that you do not need expensive 10Gbit top-of-rack switches linking up each node; instead you just need to plug in a cable between two boxes making the fabric span across. Please note that this is not the same as in virtualized switches, where the CPU is busy handling the layer-2 traffic; the fabric switch is actually a physical, distributed layer-2 switch that operates completely autonomously—the CPU complex doesn’t need to be powered on for the switch to work.

It's a Cluster, Not a Server Software Support & The ARM Server CPU
Comments Locked


View All Comments

  • JohanAnandtech - Wednesday, March 13, 2013 - link

    Ok, good question. I'll look into it, as I am definitely considering a follow-up
  • skyroski - Wednesday, March 13, 2013 - link

    I make performance oriented web apps for a living and I was looking forward to this performance test very much. However, I was quite disappointed at how you have done the "real world" test.

    If you're serving a single site you would never put a Xeon through the performance penalties of virtualisation, so I deem your real world results flawed/unusable.

    Basically, if I was to consider buying a Calxeda server tomorrow, I want to know if I can serve a site faster/better by using the "cluster in a box" solution which ARM's partners are going for or if a single Xeon server with standardised dedicated hardware will serve me and my businesses better.

    The other thing that I would have also tested is SSL request performance because Intel has AES-NI built in and I believe ARM has something similar? I would say the majority of request today for a serious web app/site will be traffic using the SSL protocol, so that would also be one of those deciding factors I would look at.

    If I was a cloud host provider your comparison may contain some truth as their business model would be to presumably let each ARM node out as a VPS alternative, but that isn't what you were testing were you?
  • JohanAnandtech - Wednesday, March 13, 2013 - link

    1. The single site: it is not meant to be an environment of one single site. The reason why we use the same site over and over again, is that it makes it easier to interpret the results and more repeatable. Consider a hosting provider who host many similar - but not the same - LAMP sites.
    The repeatable part is the part that most people don't understand very well: we don't just hit the same URL over and over again. We perform real user interactions and randomize them in realworld patterns (like logging in first and then several real actions) and then getting a repeatable benchmark gets very complex.
    2. The SSL comment is definitely good feedback. We are currently writing the connection code for such SSL websites but also need to find one or more good examples. If your site is a good example, maybe we can use yours (even under NDA if necessary) ?
    3. Lastly, the virtualization overhead of ESXi 5 is very small.
  • Kurge - Wednesday, March 13, 2013 - link

    You know, you can host multiple different LAMP sites on bare metal ;)
  • klmccaughey - Wednesday, March 13, 2013 - link

    It won't be LAMP sites any more though - take a trawl through something like the Linode forums to get an idea of what people are building. You are talking higher concurrency and more likely nginx.

    Someone made a valid comment about database sharding - for web apps this is much more likely as people try to make sure they have failover.

    Whilst initially very disappointed, if you imaging the refresh on the ARM cores over the next 2 years (and considering the rate of change due to the phone market) you might actualy be looking at a beast of a machine in two or three iterations. Imagine if you could buy these off the shelf for under $10k: That feels to me like mission critical failover systems in a box. I can see this taking off in a couple of years.
  • klmccaughey - Wednesday, March 13, 2013 - link

    And kudos for the review - I look forward to the follow-up. This is a space that needs watching!
  • Silma - Thursday, March 14, 2013 - link

    True but do you think Intel will stop product development for the next 3 years? In addition who will have the best fabs then? My guess is Intel.
  • Krysto - Monday, March 18, 2013 - link

    I don't know how fast it actually is, but relative to the ARMv7 architecture, AES should be up to 10x faster on ARMv8.
  • kfreund - Wednesday, March 13, 2013 - link

    Nice job, Johan. Can't wait to see your next one; we will be sure to get you an A15 based system as soon as we get it out! Let the debates begin!
  • kfreund - Wednesday, March 13, 2013 - link

    Regarding Stream performance, this is a known limitation of A9; it just can't handle a lot of concurrent memory requests. A15 will nearly triple the memory bandwidth at same DDR rate.

Log in

Don't have an account? Sign up now