Golden Cove Microarchitecture (P-Core) Examined

Section by Andrei Frumusanu

Within Alder Lake and Sapphire Rapids, we find Intel’s brand-new performance core (P-Core) microarchitecture, called Golden Cove. The new microarchitecture and core design follows the Willow Cove core in Tiger Lake, the Sunny Cove core in Ice Lake, and the derivative Cypress Cove core in Rocket Lake.

Although the new design is a follow-up over the last few years of core microarchitectures, Intel sees the Golden Gove as a major step-function update, with massive revamps of the fundamental building blocks of the CPU, going as far as calling it as allowing performance for the next decade of compute, meaning Intel is seemingly planning on using it as the basis for the evolution of cores for at least a few years.

The last time we had seen major upgrades to Intel’s core front-end was in Ice Lake's Sunny Cove, making notable increases in the microarchitecture’s out-of-order execution capabilities and depth, while Willow Cove and Cypress Cove were more iterative designs focusing on the memory subsystem.

Golden Cove, in comparison, makes gigantic changes to the microarchitecture’s front-end – in fact, Intel states that this is the largest microarchitectural upgrade in the core family in a decade, mirroring similar changes in magnitude to what Skylake had done many years ago.

Starting off with the directly most obvious change: Intel is moving from being a 4-wide decode machine to being a 6-wide microarchitecture, a first amongst x86 designs, and a major design focus point. Over the last few years there had been a discussion point about decoder widths and the nature of x86’s variable length instruction set, making it difficult to design decoders that would go wider, compared to say a fixed instruction set ISA like Arm’s, where adding decoders is relatively easier to do. Notably last year AMD’s Mike Clarke had noted while it’s not a fundamental limitation, going for decoders larger than 4 instructions can create practical drawbacks, as the added complexity, and most importantly, added pipeline stages. For Golden Cove, Intel has decided to push forward with these changes, and a compromise that had to be made is that the design now adds an additional stage to the mispredict penalty of the microarchitecture, so the best-case would go up from 16 cycles to 17 cycles. We asked if there was still a kind of special-case decoder layout as in previous generations (such as the 1 complex + 3 simple decoder setup), however the company wouldn’t dwell deeper into the details at this point in time. To feed the decoder, the fetch bandwidth going into it has been doubled from 16 bytes per cycle to 32 bytes per cycle.

Intel states that the decoder is clock-gated 80% of the time, instead relying on the µOP cache. This has also seen extremely large changes this generation: first of all, the structure has now almost doubled from 2.25K entries to 4K entries, mimicking a similar large increase we had seen with the move from AMD’s Zen to Zen2, increasing the hit-rate and further avoiding going the route of the more costly decoders.

Although the I-cache has remained the same in terms of size at 32KB, Intel has improved latencies for larger code coverage by increasing the L1 iTLB from 128 entries to 256 entries (16 -> 32 for 2M/4M hugepages).

Because the general misprediction penalty has gone up this generation, Intel is very much counter-acting this by massively improving branch prediction accuracy. First of all, the L2 BTB (branch target buffer) has grown to well over double with the structure increased from 5K entries to 12K entries, making this by far the biggest in the industry that we’re aware of (Zen3: 6.5K, Cortex-A77+: 8K, Apple: ?). It is also said that the branch predictors themselves are now smarter, with their accuracy being improved, although these are always quite a bit secretive in terms of their designs so we didn’t have too many details divulged.

The decoders and the µOP cache feed a µOP queue – beyond he 50% larger decoder width, the µOP cache also now sees an increased bandwidth and able to emit 8 µOPs instead of 6. The µOP queue itself has seen a structural change, for single threads (only one SMT logical core is loaded), the structure can be aggregated and the usable depth has doubled from 70 entries to 144 entries, while for dual threads, has gone up slightly from 70 to 72 entries per thread.

In the mid-core, there’s also been quite large changes. The core here increases from 5-wide to 6-wide, meaning that in theory it should still be a maximum 6 IPC microarchitecture, though we’re talking about µOPs here and not actual instructions.

Intel is noting that they’re doing an increased amount of dependency resolution at the allocation stage, actually eliminating instructions that otherwise would have to actually emitted to the back-end execution resources. Intel here didn’t want to elaborate a lot, only stating that this capability has increased quite a lot generation over generation.

The out-of-order instruction window of Golden Cove is massively larger than Sunny Cove – increasing the re-order buffer (ROB) size from 352 to 512 entries. This is over double that of AMD’s Zen3 µarch, and really only second to Apple’s core microarchitecture which we’ve measured in at around 630 instructions. Traditionally, we’ve always been told that increasing the ROB has diminishing effects and always corelates with quite an increase in power consumption – but on the other hand we saw Apple achieve extremely high OOO-windows. It’s not clear how exactly Intel managed to increase their µarch by such a large degree in a single jump, but it’s a very impressive and admittedly unexpected design characteristic of the Golden Cove cores.

Moving onto the back-end, Intel has increased their execution width from 10 ports to 12 ports. Intel still continues to use merged execution port / reservation station design that characterises their cores in contrast to other µarchs from the competition, or even their own E-core microarchitectures, which use separate integer and FP/vector ports and pipelines.

On the integer side of things, there’s now a fifth execution port and pipeline with simple ALU and LEA capabilities, which in theory makes this the widest x86 core in raw ALU throughput, although in mixed code the 5 pipelines are still somewhat comparatively narrower.

On the FP capabilities side of things, we’ve seen added dedicated FADD capabilities, which are far more efficient and lower latency than using the FMA units (when applicable). The FMA units now have also seen added support for FP16 data types, although this falls under the AVX512 umbrella consumer cores won’t see the capabilities.

In general, besides the added 5th ALU port/pipeline, the back-end execution resources of Golden Cove appear to be somewhat more conservative compared to the more substantial front-end changes of the core.

On the load & store front, we see the addition of an additional dedicated execution port with a load AGU, increasing the possible number of loads per cycle from 2 to 3, matching the upgrades we’ve seen from many other microarchitectures over the last few years.

For AVX512-enabled implementations of the µarch, it’s able to load up to 1Kb/cycle from the L1, while consumer cores still see a 50% improvement for AVX2 256-bit loads with 3 per cycle.

Intel, like many other, have noted that modern workloads have increased memory level parallelism, so they have increased the outstanding memory transactions from the core by increasing the L1D fill buffers from 12 to 16, and increasing the load and store buffers by undisclosed amounts. Similarly, to AMD, to better address sparse memory access workloads, the µarch has seen a doubling of page walkers from 2 to 4, able to look up pages in TLB misses.

The L1 DTLB has grown from 64 entries to 96 entries – Intel didn’t mention the L2 TLB which would mean it’s still at 2048 entries.

Intel has improved the prefetchers, nothing things such as “better stride prefetching in L1”, though beyond that the company hasn’t divulged much other details. We’ve seen a lot of advances in the industry in terms of prefetchers so it’ll be interesting to see how Golden Cove compared to the competition here.

The L2 cache in Golden Cove remains at 1.25MB for consumer core variants, while the server variant grows it to 2MB compared to 1.25MB in the current Ice Lake-SP chips. Parallel data access outstanding requests for the memory subsystem has grown from 32 misses to 48, also further helping with increasing MLP capabilities of the core.

Intel also again notes there’s been prefetching improvements in the L2. What the company calls “full-line-write predictive bandwidth optimisation” is likely the same mechanism we saw in our review of Ice Lake-SP where the core can greatly improve bandwidth by avoiding RFO reads of cache lines that are going to be fully rewritten, which should greatly improve common generic operations such as memory copies.

The aggregated changes of the new Golden Cove microarchitecture amount to a geomean IPC increase of 19% compared to Cypress Cove (Rocket Lake) - measured over a set of workloads including SPEC CPU 2017, SYSmark 25, Crossmark, PCMark 10, WebXPRT3, and Geekbench 5.4.1. We can see in the graph that there’s outlier workloads with up to +60% IPC, but also low outliers where the new design doesn’t improve thing much or even sees regressions, which is odd.

The front-end changes of Golden Cove indeed look like a major step-function: 50% wider decode, +25% wider µOP output and double the capacity, +45% deeper out-of-order execution buffer, massive branch prediction improvements. It’s possible that the more conservative back-end execution improvements overall lowered the end results.

19% IPC is still a very impressive figure, and one of Intel’s largest jumps in a decade, but the company is also comparing it to Cypress Cove, rather than their latest generation design, which is Willow Cove inside of the Tiger Lake chips. We would have wished for Intel to at least give us both comparisons, so that we would get an idea for both desktop as well as mobile improvements.

Intel Thread Director Gracemont Microarchitecture (E-Core) Examined
Comments Locked

223 Comments

View All Comments

  • name99 - Thursday, August 19, 2021 - link

    "Intel’s Thread Director controller puts an embedded microcontroller inside the processor such that it can monitor what each thread is doing and what it needs out of its performance metrics. It will look at the ratio of loads, stores, branches, average memory access times, patterns, and types of instructions."

    People might be interested to know that Apple has done this for years (I don't know about ARM).

    The Apple scheme has many moving parts but these include
    - tracking how much work is done by Fetch, Decode and Execute. The first two can estimate based on number of instructions, the third takes account of the type of instruction.

    - the scheme is even sophisticated enough (at least the patent talks about this) that the weights given to each of these pieces are variable to match the characteristics of the manufactured chip. Each SoC is tested and the precise weights are fused into the chip after testing.

    - this mean that the SoC can calculate things like instantaneous power usage. This is used at the overall SoC level (to limit battery current draw) and at the per execution unit level (eg to halt the SIMD pipeline for a cycle every few cycles if some thermal pr power constraint is being exceeded). You will notice this is the equivalent of Intel's frequency throttling for AVX512, but much nicer because it is done on demand, purely to the level needed, and without slowing down the rest of the core or without a slow transition between faster and slower frequencies.

    - there is also tracking of where L1 cache fills comes from. If a lot come from the E cores, the E-core frequency is boosted. If a lot come from DRAM, then the fabric frequency and DRAM frequency are boosted.

    - behind everything, for *most purposes* the primary metric is nJ/instruction. The scheduler does many things in the obvious way you would expect (background threads on E cores, track progress vs deadline and ramp core performance up or down depending on how that is doing); but some non-obvious things are that code that is inefficient (ie nJ/instruction is too low) and that is not otherwise protected by the OS will be pushed to lower frequency or to an E-core. This might sound bad, but mainly what it's saying is
    + if you're constantly waiting on DRAM, then running the core at high frequency does you no good anyway
    + if you're not running very wide (hard to predict branches, or long dependency chains) you can't take advantage of the big core anyway, so why waste power keeping you there?

    Presumably Intel's scheme at least covers all these sorts of bases.

    One complication Apple has, that I assume Intel/Windows will not have (but it's not clear) is the use of clustering. Clustering sounds great, as does that huge low latency shared cache. But it comes at the cost of, as far as I can tell, a common frequency for the entire cluster. (If CPUs were at different frequencies, there'd have to be a cross-frequency-domain stage when communicating with the shared L2, at that would add noticeable latency).
    So the OS scheduler doesn't just have the job of scheduling each thread to the optimal core at optimal DVFS, it also has to pack 4 optimal [as a unit] threads to a cluster...
    I can't tell if Intel's scheme runs their small cores that way, as a cluster of 4 sharing an L2 (and thus sharing frequency). If so, how the OS scheduler handles this is something to keep an eye on for both Windows and Linux.

    BTW there are very recently published patents that suggest Apple may be moving away from this, to a scheme of private L2s and a shared per-cluster L3!
    https://patents.google.com/patent/US10942850B2
    That's something to keep an eye on for the A15 and M2...
  • mode_13h - Friday, August 20, 2021 - link

    Thanks for the info.

    What do you mean by "nJ/instruction" ? Is that the ratio of branches vs. non-branch instructions? If not, then what does it have to do with DRAM latency? Or was that a reference to the prior paragraph?

    Where do you read this stuff?
  • name99 - Friday, August 20, 2021 - link

    nanoJoules/instruction. ie energy per instruction

    This info is acquired from reading massive numbers of Apple patents, validated as much as possible by experiments run on M1.
  • mode_13h - Saturday, August 21, 2021 - link

    Wow. My eyes glaze over, trying to read patents. I'm sure there are better and worse ones, but they're often written in ways that drain the joy out of the most interesting ideas.

    Thanks for sharing!
  • jospoortvliet - Sunday, August 22, 2021 - link

    Indeed super interesting!
  • mode_13h - Thursday, August 19, 2021 - link

    I wonder if they did anything to the decoder around SMT or multiple instruction streams. In Tremont, it seemed like they way they used a 6-wide decoder was as two 3-wide decoders, where each would work on a separate branch target.

    > the L2 BTB (branch target buffer) has grown to well over double with the
    > structure increased from 5K entries to 12K entries

    Can someone refresh us on the function of a BTB? Is it like a cache that stores the target address of each recent branch instruction, so that speculative execution doesn't have to wait for the target to be computed (if not a fixed target)?

    > actually eliminating instructions that otherwise would have to actually
    > emitted to the back-end execution resources.

    Huh? Seems like an editing error. Can anyone elaborate?

    > Intel still continues to use merged execution port / reservation station design

    Someone please remind us what a reservation station is?

    > On the integer side of things, there’s now a fifth execution port and pipeline with
    > simple ALU and LEA capabilities

    In this case, I presume LEA means "load effective address" and is used to compute memory addresses (potentially involving a multiply, an add, a constant offset?). Is that correct? And does the above statement mean that each of those ports can do simple ALU *or* LEA operations?

    > Intel has improved the prefetchers

    Yes, and the article text didn't even mention the bullet point in the slide about feedback-based prefetch-throttling! I'm reminded of how ARM's N2 can throttle back prefetching, during periods of memory contention. Perhaps Intel came to the same conclusion that overzealous prefetchers can starve cores running memory-intensive routines, in highly-threaded workloads.

    > full-line-write predictive bandwidth optimisation ... where the core can greatly improve
    > bandwidth by avoiding RFO reads of cache lines that are going to be fully rewritten

    Yes, I've been wanting this for about 2 decades.

    > We can see in the graph ... low outliers where the new design doesn’t improve thing
    > much or even sees regressions, which is odd.

    Maybe those were affected by the disabling of AVX-512? Or were those benchmarks performed on a fully-enabled core?

    > +25% wider µOP output

    If this is referring to expanding uOP cache outputs from 6 -> 8, that's a 33% improvement!
  • name99 - Thursday, August 19, 2021 - link

    "Can someone refresh us on the function of a BTB? "
    It's hard to be sure because I can never tell the extent to which Intel is doing things the old comfortable way, or the most sensible new way. I'll tell you what Apple do.
    Intel are presumably a subset of these features, but I don't know how good a subset. You need ALL the pieces to run sustained "random" code 8-wide as Apple does.

    (a) You want a predictor for indirect branches (think things like virtual function or procPtr calls). How to construct such a predictor is interesting but we will just assume it exists. This may have been what the original Branch Target Buffer was say in the early 1990s, but it is NOT what we have in mind today.

    (b) To run a modern wide OoO machine optimally, you want to be able to process a TAKEN branch per cycle. (Code has a branch every ~6 instructions, a taken branch every ~10 instructions. If you want to run 8 wide...)
    - This means you need to pull a new run of instructions (ie loaded from a new address) every cycle.
    - This, in turn, means that you really need to run Fetch as an asynchronous process. A Fetch Engine every cycle predicts the next Fetch Address and the number of instructions to Fetch. (On Apple this can be at least as wide as 16 instructions in one cycle if everything lines up correctly.) These instructions are placed in the Fetch Queue and at the other end Decode reads 8/cycle from this queue. Making Fetch async from the rest of the machine means that you can sometimes pull in 16 instructions into the queue, sometimes you just pull in three or four instructions, sometimes none while you wait for a cache miss. But hopefully the queue between Fetch and Decode buffers much of this variation.

    - BUT asynchronous Fetch means Fetch is on its own regarding how it proceeds. Basically what you want is
    + a very fast (single cycle!) Next Fetch Predictor that produces the next fetch address and (ideally) also a Fetch Width
    But a fast such predictor is of limited accuracy.
    So the second essential you need is very high quality predictors that correct the Next Fetch Predictor. As long as you correct a misFetch before the instruction stream hits Rename life is fairly easy. Correcting after Rename is tough (you have to undo resource allocations), correcting after Issue is hopeless and you have to flush.
    The Apple numbers are that their high quality predictors (Branch Prediction and Indirect Branch Prediction) are TAGE based, phenomenally accurate, and take up to 5 cycles to generate a result. That just works out (of course!)

    So the idea is that the Next Fetch Predictor generates a stream of Fetch's which results in a stream of, let's call them cars of instructions, proceeding from I-cache, through the Fetch Queue, through Decode. At any point one of the better quality predictors can step in and derail all the cars after a certain point, so that Fetch restarts. Obviously this isn't great, you've lost up to five cycles of work, but it's a lot better than a full machine flush!

    OK, within this framework, I believe that what Intel today calls the BTB is essentially doing the same job as what I am calling the Next Fetch Predictor.

    BTW there are an insane number of tweaks and improvement Apple have made to the above scheme over the years. These include
    - a separate Return stack used by the Next Fetch predictor to deal with extremely rapid call/return pairs (eg call, work, return is three cycles; all done before the code has even hit decode, so totally out of sync with the "full accuracy" Return stack)
    - Decode (ie the earliest stage possible) does what it can to keep the machinery on track. Decode detects any sort of mismatch between decoded branches and the next car of instructions and, if so, gets Fetch to resteer. This is easily done for unconditional branches, and can also be done for a few other weird cases (like some mismatched call/return pairs). Decode also updates the Return stack.
    - pre-decode (ie when an instruction line is moved from l2 to L1) does a bunch of stuff to mark significant points (eg where branches are) in a cache line. This in turn is referenced the first time the Next Fetch Predictor encounters these new lines.
    - for certain cases (most obviously when the Next Fetch Predictor has an indirect branch marked as low confidence) Fetch pauses until some of the upstream machinery can suggest an address. The idea is that for low confidence indirect branches, you're so unlikely to guess correctly why even waste energy trying?

    Apart from all these, theres a whole other set of machinery that handles loops and the transition from "random" code to loops. These include both an L0 cache and a trace cache. (That's right kids, a trace cache!)
    There's also a whole set of ideas for saving power at every stage of this process. For example the Next Fetch Predictor, along with the obvious things it is recording (next fetch address, and fetch width) also records two items obvious in retrospect -- the physical address (so no TLB lookup necessary) and even the cache way (so no way prediction necessary, and the correct way -- and only that way) can be fired up on cache access. The loop buffer, L0, and the trace cache are additional ways to run Fetch on energy fumes for code that meets the specific requirements, so that various of TLB, way prediction, multi-way lookup, branch predictor, etc etc can all be powered down.
  • mode_13h - Friday, August 20, 2021 - link

    Thanks for the info. I wonder where you find such detailed descriptions!

    > L0 cache

    Just another name for a uop cache?

    > That's right kids, a trace cache!

    So, a trace cache stores an entire string of instructions, even across one or more branches? Does it hold instructions before or after decode?

    > and even the cache way

    You mean the cache set?

    > The loop buffer

    What's a loop buffer? Sort of like a trace cache, for loops?
  • name99 - Friday, August 20, 2021 - link

    Think of the steps required to run normal code, as I described above. The consider various simple loops.

    Suppose you have a straight line loop, say 40 instructions in the loop body, no branches. Then you can omit branch prediction, TLB, cache -- just repeatedly run the same code from a straight buffer. That's essentially a loop buffer.

    Now suppose that your loop body has a few branches in it, but they are predictable, maybe something like
    if(loop counter even){path1} else {path2}
    Now what you want is something like a trace cache that's holding the two paths path1 and path2, and a very simple predictor that's telling which of these to choose each iteration. You can still avoid the costs of a real branch predictor and a real cache.

    Now suppose you have a loop with moderately complicated branches, not terrible but not that easy to predict either. You can't avoid the cost of branch prediction now (as I said, to validate the guess of the Next Fetch Predictor) but you can avoid much of the cost of the cache by moving the loop body into an L0 cache which will be essentially a small direct-mapped cache. Being smaller, and direct-mapped, it will use less energy/access than the full I-cache. (And you probably will also access it virtually rather than physically, so also avoid TLB costs.)

    cache way:
    Recall that a DIRECT-MAPPED cache has only a single place where a line can go -- grab some bits from the middle of an address, they define an index, the line goes at that index. This is fast and cheap, but means you have a problem if you frequently want to access two addresses with the same index (ie same middle bits in their addresses).
    n-way set-associative cache means you now have, n (may be 2, 4, 8 or some other number) of slots associated with a given index. So if you have 8 slots, you can hold 8 lines with that same index, ie 8 addresses with those same middle bits.
    BUT how do you know WHICH of those 8 lines you want? Ahh.

    That gets into the business of matching tags, way prediction and a whole mess of other stuff that you need to read in a textbook. But the relevance to what I was saying is that which of these 8 possible lines is of interest is called a WAY. So by storing the cache way, you can access a cache with the speed (avoid cache tag lookup) and energy (no need to precharge the tags) of a direct-mapped cache.
  • GeoffreyA - Saturday, August 21, 2021 - link

    Great information. I believe on the Intel side, Nehalem added something like that, the LSD.

Log in

Don't have an account? Sign up now