128 Cores Mesh Setup & Memory Subsystem

Starting off the testing, one thing that is extremely intriguing about Ampere’s implementation of their Altra designs is the fact that they’re achieving more than 64 cores whilst still using Arm’s CMN-600 mesh network IP. In our more recent coverage earlier this year of Arm’s newer upcoming CMN-700 mesh network, we wrote about the fundamental structure of the CMN mesh and its building blocks, such as RN-F, HN-F, and components such as CALs.

In a typical deployment, a mesh consists of cross-points (XPs) to whose RN-F (Fully coherent request node) connect to either directly a CPU, or a CAL (component aggregation layer) which can house two CPUs.

Our initial confusion last year with the Quicksilver designs was that 80 cores was more cores than what the CMN would actually support when configured with the maximum mesh size and two cores per CAL per XP – at least officially. Ampere back then was generally coy about talking about the mesh setup, but more recent discussions with Arm and Ampere, the companies have divulged that it’s also possible to house the CPUs inside of a DSU (DynamiQ Shared Unit), the same cluster design that we find in mobile Arm SoCs with Cortex CPUs.

Ampere has since confirmed the mesh setup in regards to the CPUs: instead of housing cores directly to the mesh via RN-Fs, or even via a CAL on each XP, they are employing two DSUs, each with two Neoverse-N1 cores, connected to a CAL, connected to a single XP. That means each mesh cross-point houses four cores, vastly reducing the needed mesh size to be able get to such core numbers – this is both valid for the Quicksilver 80-core designs and the new Mystique 128-core designs. The only differences with the Mystique design is that Ampere has now simply increased the mesh size (we still don’t have official confirmation on the exact setup here).

From a topology perspective, the Altra Max is still a massive monolithic 128-core chip, with competitive core-to-core latencies within the same socket. Our better understanding of the use of DSUs in the design now also explains the more unique low-latency figures of 26ns which only happens between two core pairs – these would presumably be two sibling cores housed within a single DSU, and coherency and communications between the two doesn’t have to go out into the mesh, which incurs higher latencies.

We had discussed Ampere’s quite high inter-socket latencies in our review of the Altra last year, as a fresh reminder, this is because the design doesn’t have a single coherency protocol that spans from the mesh network to the remote mesh of the other socket – instead having to have to go through an intermediary cache-coherency protocol translation for inter-socket communication, CCIX in this case. In particular this wasn’t very efficient for when two cores within a socket have to work on a remote socket cache line – the communication between cores in DSU is very efficient here, however between cores in a mesh it means doing a round-trip to the remote socket, resulting in pretty awful latencies.

 

The good news for the new Altra Max design is that Ampere was able to vastly improve the inter-socket communication overhead by optimising the CCIX stack part of things. The results are that socket-to-socket core latencies have gone down from ~350ns to ~240ns, and the aforementioned core-to-core within a socket with a remote cache line from ~650ns to ~450ns – still pretty bad, but undoubtedly a large improvement.

Latencies within a socket can be up at the extremes, simply due to the larger mesh. Ampere has boosted the mesh frequency from 1800MHz to 2000MHz in this generation, so there is a slight boost there as well as associated bandwidth.


Looking at the memory latencies of the new part, comparing the Q80-33 to the M128-30 results at 64KB page size, of course the first thing that is noticeable is the fact that the new Altra Max system now only has 16MB of SLC, or system level cache, half of the 32MB of the Quicksilver design. This was one of the compromises the company decided to make when increasing the core count and mesh in the Mystique design.

L3/SLC latencies are also slightly up from 30 to 33.6ns, some of that is the 10% slower CPU clock, but most of it is because the larger mesh with more wire distance and more cross-points for data to travel across.

One thing that we hadn’t covered in our initial review was the chip running regular 4K pages – the most surprising aspect here is not the fact that things look a bit different due to the 4K pages themselves, but rather because the prefetchers now behave totally differently. In our first review we believed that Ampere had intentionally disabled the prefetchers due to the sheer core count of the system, but looking at the 4K page results here they appear to be in line with what we saw in behaviour in Amazon’s Graviton2. Notably the area/region prefetcher no longer pulls in whole pages in patterns which have strong region locality, such as the “R per R page” pattern (Random cache lines within a page followed by random pages traversal). Ampere confirmed that this was not an intentional configuration at 64KB pages, though we didn’t have an exact explanation for it. I theorise it’s maybe a microarchitectural aspect of the N1 cores trying to avoid increased cache pressure at larger page sizes.

This weird behaviour also explains the discrepancy in scores between Graviton2 and Altra in SPEC’s 507.cactuBSSN_r, which is actually due to the prefetchers working or not between 64/4KB pages.


It’s still possible to run the chip in either monolithic, hemisphere, or quadrant modes, segmenting the memory accesses between the various memory controller channels on the chip, as well as the SLC. Unfortunately, at 128 cores and only 16MB of SLC, the quadrant mode results in only 4MB of SLC, which is quite minuscule for a desktop machine, much less a server system. Each core still has 1MB of L2, however as we’ll see later in the tests, there are real-world implications of such tiny SLC sizes.

In terms of DRAM bandwidth, the Altra system on paper is equal to AMD’s EPYC Rome or Milan, or Intel’s newest Ice Lake-SP parts, due to all of them running 8-channel DDR4-3200. Ampere’s advantage comes from the fact that it is able to detect streaming memory workloads and automatically transform them into non-temporal writes, avoiding an extra memory read due to RFO (read for ownership) operations that “normal” designs have to go through. Intel’s newest Ice Lake-SP design has a somewhat similar optimisation, though working more on a cache-line basis and seemingly not able to extract as much bandwidth efficiency as the Arm design. AMD currently lacks any such optimisation and software has to have explicit usage of non-temporal writes to be able to fully extract the most out of the memory subsystem – which isn’t as optimal as a generic workload agnostic optimisation that Ampere or Intel currently employ.

Between the Q80-33 and M128-30, we’re seeing bandwidth curves that roughly match – up to a certain core count. The new M128-30 naturally goes further to 128 cores, but the resulting aggregate bandwidth also goes further down due to resource contention on the SoC – something very important to keep in mind as we explore more detailed workload results on the next pages.

 

At lower core count load, we’re seeing the M128-30 bandwidth exceed that of the Q80-33 even though it’s at lower CPU frequencies, again this is likely due to the fact that the mesh is now running 11% faster in frequency on the new design. AMD’s EPYC Milan still has access to the most per-core bandwidth in low thread situations.

 

Test Bed and Setup - Compiler Options SPEC - Multi-Threaded Performance - Subscores
Comments Locked

60 Comments

View All Comments

  • mode_13h - Thursday, October 7, 2021 - link

    > x86 still commands 99% of the server market.

    Depends on what you consider the "server market", but AWS is very rapidly switching over. Others will follow.

    Lots of cloud compute just depends on density and power-efficiency. And here's where ARM has a real advantage.
  • Wilco1 - Thursday, October 7, 2021 - link

    According to https://www.itjungle.com/2021/09/13/the-cacophony-... Arm server revenue has been 4-5% over the last few quarters.
  • schujj07 - Friday, October 8, 2021 - link

    Anything under 10% market share in the server world is basically considered a niche player. Right now AMD is over 10% so they are finally seen as an actual player in the market.
  • Spunjji - Friday, October 8, 2021 - link

    Pointing at current market share that resulted from a lack of viable ARM competition isn't a great argument for your prediction that ARM will not gain market share, especially when you're being presented with evidence of viable ARM competition.
  • mode_13h - Thursday, October 7, 2021 - link

    > Before AMD can disrupt Intel in the server,

    *before* ? This is already happening! You can clearly see it in AMD's server marketshare, as well as the price structure of Ice Lake.

    > And now Intel is coming back with Saphire Rapids. Doesn't look good for AMD.

    AMD has Genoa, V-Cache, and who knows what else in the pipeline. Oh, and they can also build an ARM core just as good as anyone (with the possible exceptions of Apple and Nuvia/Qualcomm).
  • yetanotherhuman - Friday, October 8, 2021 - link

    Not even in slight agreement. Different architecture.
  • eastcoast_pete - Thursday, October 7, 2021 - link

    Thanks Andrei, great analysis! IMO, the biggest problem Ampere and other firms that develop server CPUs based on ARM designs is that their natural customers - large, cloud-type providers - pretty much all have their own, in-house designed ARM-based CPUs, and won't buy thousands of third party CPUs unless they do something their own can't do, or nowhere near as well. AWS, Google, MS, and Apple still buy x86 CPUs from Intel or AMD because there is a customer demand for those instances, but also try to shift as much as they can to their own, home-grown ARM server systems. In this regard, has anyone heard any updates about the ARM designs supposedly in development at MS? Maybe Ampere can get themselves bought out by them?
  • name99 - Friday, October 8, 2021 - link

    “own house-designed ARM-based CPU’s”?
    We obviously have Graviton. Apple seem a reasonable bet at some point. Maybe a large Chinese player.

    Do we have any evidence (as opposed to hypotheses and rumors) of Google, Facebook, Microsoft, or most of China? Or other smaller but still large players like Yandex or Cloudflare?
  • Sivar - Thursday, October 7, 2021 - link

    This is a proper old-school deep CPU review.
  • vegemeister - Thursday, October 7, 2021 - link

    Text says Intel Xeon 8380 is running at 205 W power limit, but the table says 270 W. Which is it? I assume 270 W like ARK says?

Log in

Don't have an account? Sign up now