Intel Xeon Sapphire Rapids: How To Go Monolithic with Tilesby Dr. Ian Cutress on August 31, 2021 10:00 AM EST
One of the critical deficits Intel has to its competition in its server platform is core count – other companies are enabling more cores by one of two routes: smaller cores, or individual chiplets connected together. At its Architecture Day 2021, Intel has disclosed features about its next-gen Xeon Scalable platform, one of which is the move to a tiled architecture. Intel is set to combine four tiles/chiplets through its fast embedded bridges, leading to better CPU scalability at higher core counts. As part of the disclosure, Intel also expanded on its new Advanced Matrix Extension (AMX) technology, CXL 1.1 support, DDR5, PCIe 5.0, and an Accelerator Interfacing Architecture that may lead to custom Xeon CPUs in the future.
What is Sapphire Rapids?
Built on an Intel 7 process, Sapphire Rapids (SPR) will be Intel’s next-generation Xeon Scalable server processor for its Eagle Stream platform. Using its latest Golden Cove processor cores which we detailed last week, Sapphire Rapids will bring together a number of key technologies for Intel: Acceleration Engines, native half-precision FP16 support, DDR5, 300-Series Optane DC Persistent Memory, PCIe 5.0, CXL 1.1, a wider and faster UPI, its newest bridging technology (EMIB), new QoS and telemetry, HBM, and workload specialized acceleration.
Set to launch in 2022, Sapphire Rapids will be Intel’s first modern CPU product to take advantage of a multi-die architecture that aims to minimize latency and maximize bandwidth due to its Embedded Multi-Die Interconnect Bridge technology. This allows for more high-performance cores (Intel hasn’t said how many just quite yet), with the focus on ‘metrics that matter for its customer base, such as node performance and data center performance’. Intel is calling SPR the ‘Biggest Leap in DC Capabilities in a Decade’.
The headline benefits are easy to rattle off. PCIe 5.0 is an upgrade over the previous generation Ice Lake PCIe 4.0, and we move from six 64-bit memory controllers of DDR4 to eight 64-bit memory controllers of DDR5. But the bigger improvements are in the cores, the accelerators, and the packaging.
Golden Cove: A High-Performance Core with AMX and AIA
By using the same core design on its enterprise platform Sapphire Rapids and consumer platform Alder Lake, there are some of the same synergies we saw back in the early 2000s when Intel did the same thing. We covered Golden Cove in detail in our Alder Lake architecture deep dive, however here’s a quick recap:
The new core, according to Intel, will over a +19% IPC gain in single-thread workloads compared to Cypress Cove, which was Intel’s backport of Ice Lake. This comes down to some big core changes, including:
- 16B → 32B length decode
- 4-wide → 6-wide decode
- 5K → 12K branch targets
- 2.25K → 4K μop cache
- 5 → 6 wide allocation
- 10 → 12 execution ports
- 352 → 512-entry reorder buffer
The goal of any core is to process more things faster, and the newest generation tries to do it better than before. A lot of Intel’s changes make sense, and those wanting the deeper details are encouraged to read our deep dive.
There are some major differences between the consumer version of this core in Alder Lake and the server version in Sapphire Rapids. The most obvious one is that the consumer version does not have AVX-512, whereas SPR will have it enabled. SPR also has a 2 MB private L2 cache per core, whereas the consumer model has 1.25 MB. Beyond this, we’re talking about Advanced Matrix Extensions (AMX) and a new Accelerator Interface Architecture (AIA).
So far in Intel’s CPU cores we have scalar operation (normal) and vector operation (AVX, AVX2, AVX-512). The next stage up from that is a dedicated matrix solver, or something akin to a tensor core in a GPU. This is what AMX does, by adding a new expandable register file with dedicated AMX instructions in the form of TMUL instructions.
AMX uses eight 1024-bit registers for basic data operators, and through memory references, the TMUL instructions will operate on tiles of data using those tile registers. The TMUL is supported through a dedicated Engine Coprocessor built into the core (of which each core has one), and the basis behind AMX is that TMUL is only one such co-processor. Intel has designed AMX to be wider-ranging than simply this – in the event that Intel goes deeper with its silicon multi-die strategy, at some point we could see custom accelerators being enabled through AMX.
Intel confirmed that we shouldn’t see any frequency dips worse than AVX – there are new fine-grained power controllers per core for when vector and matrix instructions are invoked.
This feeds quite nicely into discussing AIA, the new accelerator interface. Typically when using add-in accelerator cards, commands must navigate between kernel and user space, set up memory, and direct any virtualization between multiple hosts. The way Intel is describing its new Acceleration Engine interface is akin to talking to a PCIe device as if it were simply an accelerator on board to the CPU, even though it’s attached through PCIe.
Initially, Intel will have two capable AIA bits of hardware.
Intel Quick Assist Technology (QAT) is one we’ve seen before, as it showcased inside special variants of Skylake Xeon’s chipset (that required a PCIe 3.0 x16 link) as well as an add-in PCIe card – this version will support up to 400 Gb/s symmetric cryptography, or up to 160 Gb/s compression plus 160 Gb/s decompression simultaneously, double the previous version.
The other is Intel’s Data Streaming Accelerator (DSA). Intel has had documentation about DSA on the web since 2019, stating that it is a high-performance data copy and transformation accelerator for streaming data from storage and memory or to other parts of the system through a DMA remapping hardware unit/IOMMU. DSA has been a request from specific hyperscaler customers, who are looking to deploy it within their own internal cloud infrastructure, and Intel is keen to point out that some customers will use DSA, some will use Intel’s new Infrastructure Processing Unit, while some will use both, depending on what level of integration or abstraction they are interested in. Intel told us that DSA is an upgrade over the Crystal Beach DMA engine which was present on the Purley (SKL+CLX) platforms.
On top of all this, Sapphire Rapids also supports AVX512_FP16 instructions for half-precision, mostly for AI workloads as part of its DLBoost strategy (Intel was quite quiet on DLBoost during the event). These FP16 commands can also be used as part of AMX, alongside INT8 and BF16 support. Intel now also supports CLDEMOTE for cache-line management.
A Side Word about CXL
Throughout the presentations of Sapphire Rapids, Intel has been keen to highlight it will support CXL 1.1 at launch. CXL is a connectivity standard designed to handle much more than what PCIe does – aside from simply acting as a data transfer from host to device, CXL has three branches to support, known as IO, Cache, and Memory. As defined in the CXL 1.0 and 1.1 standards, these three form the basis of a new way to connect a host with a device.
Naturally it was our expectation that all CXL 1.1 devices would support all three of these standards. It wasn’t until Hot Chips, several days later, that we learned Sapphire Rapids is only supporting part of the CXL standard, specifically CXL.io and CXL.cache, but CXL.memory would not be part of SPR. We're not sure to what extent this means SPR isn't CXL 1.1 compliant, or what it means for CXL 1.1 devices - without CXL.mem, as per the diagram above, all Intel loses is Type-2 support. Perhaps this is more of an indication that the market around CXL is better served by CXL 2.0, which will no doubt come in a later product.
In the next page, we look at Intel's new tiled architecture for Sapphire Rapids.
Post Your CommentPlease log in or sign up to comment.
View All Comments
SystemsBuilder - Tuesday, August 31, 2021 - linkpage 1, Golden Cove: A High-Performance Core with AMX and AIA, text under the AMX picture:
"AMX uses eight 1024-bit registers for basic data operators" should be 1024 BYTE (or 1KByte) not 1024-bit.
AMX has 8 (row/column) configurable 1KB so called T registers, i.e. the 8 T registers can be configured to use a maximum size of 1KByte each but can also be smaller configured by row and columns parameters (you set tile configuration for each tile with the STTILECFG assembly instruction: i.e. row, columns, BF16/INT8 data type etc).
For more details see AMX section in this document:
SystemsBuilder - Tuesday, August 31, 2021 - linkCant edit so have to use a comment to clarify: LDTILECFG is used for setting the tile file configuration of all 8 tiles (# of rows and # columns per T register, while Data type is not set by this instruction) while STTILECFG is used for reading out the current tile file configuration and store the read out store that in memory.
Ian Cutress - Tuesday, August 31, 2021 - linkMy slide from Intel architecture day says 1 Kb = 1 kilo-bit. It literally says that in the slide above the paragraph you're referencing.
So either a typo in the slide, or a typo in the AMX doc.
SystemsBuilder - Tuesday, August 31, 2021 - linkIt's a type from Intel on the slides that you unfortunately propagated.
Should be 1KByte not 1Kb (as in 1 Kbit).
yeah this presentation was not one of intel's finest moment...
just read the full spec ere: https://software.intel.com/content/www/us/en/devel...
There is significantly more detail in the full documentation. all sorts of limitation on number of rows (max 16) for instance which complicates INT8 matrices just as an example... What I would have liked would be to be able to is to fully configure # of rows and # of columns within the 1KByte for a given data type - to fully use each T register 1KByte size. We now need to have rectangular NxM matrix tiles instead of the preferable square NxN matrix tiles (and fit them into 16xM = 1024 bytes, solve for M)- symmetric N x N tiles makes algorithms easier...
SystemsBuilder - Tuesday, August 31, 2021 - linkIan, to be clear the intel AMX specs in the intel doc:https://software.intel.com/content/www/us/en/devel... spends entire chapter 3 (25 pages) discussing AMX in detail. Stating multiple times that each T register is 1KByte and the whole register files size is 8KByte, also detailing each assembly Instruction etc.
Additionally, first rev of this document was published last summer and the latest rev was published in June this year. During this whole time the T register 1KByte size have never changed (but more details have been included with each revision the past 12 months).
Further, glibc and various compliers have already included AMX extensions based on this spec. it would be quite catastrophic for them if intel suddenly cut the T reg size to 1024 bits.
Also, T reg size is not really new news. https://fuse.wikichip.org/news/3600/the-x86-advanc... published a pretty good article already last summer about this (also stating T regs are 1Kbyte).
Lastly, it makes no logical sense to only have 1024bit (128Bytes) tile regs because it is just too small.
Hence, you can safely assume that intel messed up on the slide and adjust your article accordingly. If you still don't believe it, ask intel yourself.
schujj07 - Tuesday, August 31, 2021 - linkOne of the rumors for Gen 4 Epyc is 12 channels of DDR5. Now this is just a rumor so it HAS to be taken with a grain of salt. However, if Epyc goes 12 channels, Arm goes 12 channels, and SPR is at 8 channels we could see another instance like Gen 1 & 2 Xeon Scalable not having RAM parity. While going DDR5 does increase the bandwidth, I don't think it does enough to justify not increasing the channels at the same time.
JayNor - Wednesday, September 1, 2021 - linkThe four stacks of HBM, each with 8 channels DDR should take care of Intel's bandwidth issues for AI operations.
schujj07 - Wednesday, September 1, 2021 - linkBandwidth might be OK for AI with HBM on SPR. One thing to remember is that most of these are going to be running on hypervisors. 6 channel RAM became immediately an issue with Xeon Scalable (especially with their old 1TB/socket limit without L series CPUs where you could only get 768GB RAM). If they only have 8 channels when everyone else has 12 channels you cannot put as much RAM into a system for cheap. Most servers are dual socket and if you are using a hypervisor RAM capacity matters A LOT. If you can have 1.5TB (dual socket with 64GB DIMMs) instead of 1TB (dual socket with 64GB DIMMs) that makes a huge difference for running VMs. All the hosts in my datacenter run with 1TB RAM & dual 32c/64t CPUs. We are not CPU limited but we are RAM limited on each host. While VMware can do RAM compression/ballooning, once you start over provisioning RAM you will start running into performance issues. I've read that after about 10-15% over provisioning on RAM you start getting pretty major performance loss. I've experienced VMs basically stall out (like what happened in the early 2000s when your computer used 512MB RAM and you only had 384MB RAM) at a 50% over provision. Basically depending on the workload bandwidth isn't everything.
Spunjji - Tuesday, September 7, 2021 - linkAt what cost, though?
schujj07 - Tuesday, September 7, 2021 - linkIf you have to ask you cannot afford it.