Well that certainly helps inform my buying decisions for the next couple months. I was hoping to see Rome be available outside of OEM channels by at least the end of Q3 (FY Q4), but I can't wait until Q4 2019 or Q1 2020. At least you can drop Rome into a Naples motherboard with a BIOS update. Now to find a decent server motherboard.
You know big buyers like cloud services and super computer services, get the first dibs on these new juicy CPUs. You as a solitary consumer, is secondary, if not tertiary.
Being a customer that purchases 1000s of your product versus one comes with perks ? And customers who need to plan and develop an entire infrastructure around this product get earlier insight ?
He is right NVM is bs for cloud. Its the denser memory promise that is where all the potential is. even at 300ns or 500ns its still fine, but the problem with optane is that bandwidth and especially random bandwidth is in the toilet as well compared to DRAM. 2 or 3x more dense is simply not gonna cut it.
I said the exact same thing and all I got was attacked from left and right. And while DRAM pricing dropped, it still isn't bottom yet, just not making extreme profits like they were used to. We have 256GB per DIMM, and in the future 512GB per DIMM by 2022. That is 16TB of Memory on a 2S system. NVM will need to improve ( fulfil ) its performance promise and cost will need some come way down to be an attractive option. And judging from Micron's word, it is not going to be cost effective soon.
Though rather hilarious given he had just finished excusing AMD for eschewing a standardised interconnect in favour of one customised for a better fit to the task. Exactly what Intel did with the Optane DIMMs once it became clear DDR4 was not suitable for an NVRAM interface without significant protocol modification. DDR5 was ratified long before any of that could be fed back upstream, so DDR6 would likely be the first viable standard that could be designed around native NVRAM support. The terribly named 'Optane Memory' on the consumer side uses standard NVME over PCIe and can be treated as any other NVME drive.
That customized interconnect for Frontier is designed for a custom CPU + GPU system isolated to one supercomputer. Optane is designed to be used throughout the industry - at least on Intel.
There's also Radeon Instinct cards that run over IF rather than CCIX (MI50 & MI60).
The two situations are almost the inverse: AMD have selected their own proprietary interconnect over an existing standard for performance reasons. Intel have selected their own proprietary interconnect version because no existing standard exists and existing interfaces simply did not work (same as Everspin did for their MRAM when facing the same issue with DDR).
Radeon Instinct, I believe, CAN run on IF if hooked up to AMD CPUs. Otherwise, they default to PCIe 4 (or maybe CCIX, not sure). The test for an open system is whether there is vendor lock-in. If a standard did not exist, Intel has the clout to define one and people will build for it (like it finally did with Thunderbolt after keeping it proprietary for close to a decade). Obviously that would allow Micron and others to be compatible with intel CPUs, and for AMD and others to be compatible with Optane. Intel is clearly using optane as a competitive advantage. Can AMD or ARM build an Optane-compatible interface to allow optane modules to be plugged in? Can anyone else use Optane? I think you know the answer.
AMD could make IF open like they did with hypertransport. However, there are already 3 open coherent interconnect proposals out there well before IF came about: CAPI, GenZ, CCIX. Intel has its own version called Omnipath. Does the world need another interconnect specification and if so what problem does it uniquely solve? What uptake do you expect to see if IF were made public? Could they have used CCIX instead of IF as their interconnect? Probably, but these designs were probably done 5-6 years ago.
It is in AMD’s and Intel’s best interest to use a proprietary connection that is only available to their own GPUs. The two planned exascale supercomputers in the US are single vendor. One is AMD cpus and GPUs while the other will be Intel cpus and gpus. If they wanted to use nvidia, they would need to fall back to pci-express for use with AMD or Intel cpus. This is why nvidia has been talking about ARM for supercomputers. They have ARM processors such that they could make high performance ARM cores with NVlink.
Yeah I think Frontier is really a separate specific case here. I think Forrest would have liked to see Intel create an open NVDIMM standard that others could easily use instead of just one which is proprietary to them.
Lot of interesting discussion. Just as much was said as it couldn't be said because you asking the right questions.
Obviously not directly stated, I would say that it is implied that AMD is going to be pursuing more advanced packaging techniques for their future server CPUs. Leveraging interposers does lower power consumption vs. traditional wire bonding that is used in the Rome packaging. It is interesting to hear that the amount of power being spend on IO is not as large as being portrayed externally of AMD. This would imply that a change to interposers or similar packaging technologies is further down the road than believed.
Frontier comments are interesting and I wish more pressure was put on this point. With AMD's chiplet strategy, the semi-custom part may only be the IO die to account for the desired number of memory channels, PCIe lanes and Infinity Fabric links. The CPU die side of Frontier may just be the commodity die. To follow up my comments on packaging, this also could be what makes Frontier custom: the CPU and IO die could be part of a massive interposer package that includes the four GPU dies and HBM stacks. Cooling could be a problem but such packaging is feasible for a project like Frontier. AMD has options here and that should be followed up.
The other interesting take are the thoughts on non-volatile. The pricing scenario vs. DRAM today is correct, but I would question how much impact Optane would have if DRAM pricing was near its peak as it was ~18 months ago. Intel certainly would have had greater demand than they do right now but the flip side is their capabilities to match that demand: DRAM pricing would only be impacted by Optane if Intel could meet their own demand.
I would disagree about the comments regarding non-volatile memory and commits. It is indeed true that for enterprise DB systems you want that commit to be replicated to provide redundancy. However, that is true for Optane *AND* existing systems today: data should never rest at a single point. Thus the cost equation between Optane and existing technologies wouldn't change in this regard as the comparison on both sides should inherently include the concept of redundancy as part of the baseline. How that is done may change but that has recently happened already with the transition from SAS based primary storage pools to NVMe.
Maybe AMD are directly integrating the Cray Slingshot interconnect into the IO die for Frontier? Also, given the acquisition of Cray by HPE that might lead to some interesting possibilities for HPE in the server space to separate themselves commodity wise.
That idea has been floated around as part of the custom nature of Frontier's CPU.
As for HPE, they are one of the few with an Intel chipset license which was inherited when they purchased SGI. It is how they are able to build a 32 socket Xeon right now.
Doing something different with AMD for the more commodity parts is kinda difficult. AMD hasn't left many options open. Technically Rome should scale to quad socket merit of how many IF links there are and how many IO dies would be at play. It just isn't clear if AMD even has support for this. Vendors can build dual socket Rome boxes with 160 and 192 active PCIe lanes for accelerators now.
I am also curious if there is a Vega 20 card put into a Rome system if it would be possible to switch from PCIe to IF over the existing physical link That would be an interesting play for AMD.
I think just like people who simplify things down to "what node is it on" you are missing the point on chipplet strategy. AMD has used the appropriate chiplet approach on all of its products. This isn't the place for you to vent your feelings about ryzen chiplets in the form of garbled logic. Get your mind right.
I would disagree in that we have yet to fully see the full benefits of AMD's chiplet strategy. So far they only have two IO dies where they put a selective amount of CPU dies into the same package. Not a bad start but far from its full potential.
Need to move to PCIe 5.0 and/or DDR5? Only the IO die needs be changed. Need to add a GPU to make it a full mobile SoC? Link a GPU die to the IO die via IF without having to redesign the CPU or IO portions that have already been in production for months. Need to add fabric for a super computer? Link it the IO die via IF at the cost of a CPU die in the same package.
The chipset strategy is more than just being able to mix process nodes (for the record that is a good thing too, and I would also cite the ability to combat poor foundry yields for the amount of silicon being used as a third major benefit). It is about being able rapidly re-use what works with minimal validation to quickly produce custom products to fit market demands. We are only in the beginning stages of this strategy.
The mobile part will probably be a single die APU. Where power consumption is the most important, there is still an advantage in having everything on one die. I don’t know how they will launch that since they already have 12nm APUs marketed a Ryzen 3000. Perhaps the 7nm mobile APU will be the first Ryzen 4000 part with desktop Ryzen 4000 parts (Zen 2+) coming a bit later.
The main benefit of the chiplet strategy is the yields and the lower cost (they aren’t separate obviously). Intel can’t really compete with Ryzen 3000 parts with 14 nm parts. AMD will have a $200 part with 32 MB L3 and a $500 part with 64 MB L3. Intel 14 nm currently only goes up to 38.5 MB and those are expensive chips. Some of that is 7 nm vs intel 14 nm but having two or 3 chiplets is actually a really big part of AMD’s advantage. Intel will eventually get a 10 nm die on the market, but it will almost certainly still be a monolithic die. That will waste 10 nm die area on IO and also, a monolithic die with a lot of cores will be quite a bit larger and therefore lower yields than AMD’s tiny chiplets. If the competing intel part comes out at maybe 250 square mm, then that is going to be a lot more expensive than AMD’s 2 80 square mm chiplets (much better yields) and a cheap 14 nm IO die.
The comparison is even worse for intel with Epyc. Intel has their two die 56 core marketing part with 2 x 38.5 MB L3 cache, but being on 14 nm it consumes way too much power and it is ridiculously expensive. AMD has about 1000 square mm of silicon for a 64-core Epyc. You get a massive 256 MB L3 cache which intel doesn’t even come close to. The intel part is, I believe, 2 almost 700 square mm die, so around 1400 square mm total and probably bad yields due to the size. A 64 core part just cannot be done as a monolithic die. Even 2 die is still probably not doable on 10 nm due to worse yields.
So while it is true that they can mix and match parts with using separate die, that really isn’t that important. What is important is that it allows a 16 core desktop part for $749 while slower intel parts go for 2 to 3x that cost. Same thing with Epyc. The intel parts are ridiculously more expensive and at this point, Intel doesn’t even have a real competitor beyond 28 core. The dual 28 core parts are more marketing than real products. AMD is probably planning a new cpu chiplet for the DDR5 time frame anyway. I don’t know if they want to support both DDR4 and DDR5 in the next generation, but that would probably be easily possible by just making a new IO die while keeping compatibility with the old one. That would be great if they can do that but it isn’t why people are excited about Ryzen 3000.
You have well underlined the advantages of chiplet design; too bad you have forgotten to underline the disasvantages on performance. This is very common, AMD too do this mistake. All server cpus suppliers avoid carefully the granular AMD solution, even ARM server suppliers are all on monolithic solutions. This is an indication that disavantages are bigger than advantages.
IMO AMD could have shipped a least a 16 core die for Epyc to reduce the insane amount of off die interconnection. I suspect that Rome will perform well only for a selected number of workloads. I don't think Intel is concerned too much with Rome arrival.
@Gondalf: It is a tradeoff, but "This is an indication that disavantages are bigger than advantages" was true; not anymore. Intel is also doing multi-die packaging with Cooper/Ice lake based on rumors and have plans to use EMIB-based solutions in future (Sapphire Rapids). The reality is that latest and greatest nodes are taking longer to achieve yield parity with older nodes. You then have two choices: wait a couple of years for yields to catch up such that you can build a high core-count server part OR do what AMD is doing. So, the tradeoff AMD is making is that the extra cost (power, latency) from MCM is offset by power efficiency of 7nm. And when 7nm can yield a monolithic 32c die, AMD can jump to 5nm or whatever is the next node. More cores per socket, less power per core even when adjusted for the extra power from MCM.
Plus, the argument that this is a bad idea because nobody else does it is a pretty weak argument. Nobody made $600 phones before Apple tried.
By all accounts that I have heard, AMD is gaining significant traction in cloud and HPC with Rome. The cloud folks run VMs or containers with few threads in each so don't really need a large unified cache that Intel provides (the biggest -ve for AMD). Basically what is called 'scale-out'. Then there's legacy enterprise ('scale-up') where AMD may be at a disadvantage (oracle, SAP databases). These workloads have a large number of threads with coherent shared memory. Whether this matters in the real world is hard to say. Clearly AMD has decided that it can ignore legacy database markets and focus on cloud, which is what, 40-50% of server volume?
“Leveraging interposers does lower power consumption vs. traditional wire bonding that is used in the Rome packaging.”
I don’t think actual wire bonding has been used for cpus in about 20 years; the Pentium Pro in the late 90’s was the last wire bonded Intel chip, I think. They moved to flip-chip with the Pentium 2 or perhaps a second revision of the Pentium 2.
I don’t think they will have ridiculously large silicon interposers. Interposers are made the same way silicon chips are so they are limited by the reticule size. Techniques do exist to make them larger than that, but the cost probably skyrockets. I would guess that the most likely configuration is to place gpus with HBM on separate interposers and then connect them with infinity fabric through the PCB. They probably want a large gpu and 4 stacks of HBM for each gpu which is quite large. Although, for HPC 2 smaller gpus often perform the same as one larger GPU with 2x the compute and 2x the bandwidth, so it is unclear how big the gpus actually are and how many HBM stacks they will use. They will probably have the gpus and cpus mounted on a single board though. It is a lot easier to route 3 IF connections to each interposer with them on the same board.
The next step for Epyc seem to be to replace the IO die with an interposer. I don’t know if that will be Zen 4 though. It may come later. Using an interposer would allow using multiple much smaller 7 nm chiplets for the actual logic. An active interposer would be even better since they could place the larger transistors needed for driving off interposer links in the interposer and reserve the 7nm chips just for logic. Using an interposer for the IO die would allow better power consumption since the logic would all be 7 nm chiplets. An active interposer would free up a lot of extra area on the Epyc package. That could allow things like memory controller chiplets with very large caches on die and such. They could also move things around again and do something like place cpu cores and memory controllers onto one interposer and all of the other IO onto another chip since it isn’t as latency sensitive. An active interposer with the external memory interface would be interesting. Such a device might have 7 nm chiplet(s) with the memory controller and the IF switch and the cpu core chiplets on a single interposer. That would probably be pushing the reticule size limits with a passive interposer though. Even if the reticule size limits aren’t an issue, what if something goes wrong soldering the chiplets onto the interposer? Four gpus, a bunch of HBM, cpu chiplets, and possibly a bunch of other chiplets would be a lot to lose in one soldering operation.
There are a lot of different options once silicon interposers come into play, but I would still expect that they would use separate interposers for each gpu and the cpu and just connect them together with infinity fabric. If the cpu uses a silicon interposer then one possible customization is to place the slingshot interface directly on the interposer. If it is using an active interposer, then it could actually be the exact same chiplets as a regular Epyc with just an added die for slingshot and/or a different interposer with slingshot integrated.
Good interview, thank you! I wish that Mr Norrod was a bit more forthcoming in some areas, but what can you do...
Some small typos caught my eye. "Lisa eluded to this earlier" should be "Lisa alluded to this earlier". "commodity markets have a very set of economic rules" ... is there a word or so missing here? "I think that that Intel has made" should be "I think that Intel has made".
I wonder if we will see a mining using the Frontier supercomputer custom chips? Wouldn't 1 cpu to 4 GPU's would make a nice dense setup as in like a blade rack footprint?.
With PCIe bifurcation common on consumer CPUs , and mining needing neither CPU grunt nor interface bandwidth, there is no need to waste money on server-grade hardware. e.g. the ASUS B250 MINING EXPERT will happily host 19 GPUs and it plus a cheap CPU will likely cost a fraction of what a single Frontier CPU alone would cost.
They probably will not be available to buy and they will be ridiculously expensive compared to other solutions for mining. They will probably mount the gpus and cpus onto a single board like a compute blade. It will need to be quite large though to support a lot of system memory and 5 devices that might consume 200 Watts each. The power delivery would take up quite a bit space. These are packed very densely into the racks given some of the per cabinet power numbers. They may be made to plug into a backplane.
I am not smart at this stuff, but regarding 3D packaging, is it possible they could add L3 / L4 cache which is at least stacked? Doesn't that run a lot cooler?
HBM is 2.5D packaging (memory chip is next to gpu). The problem with heat and 3D package (chip covering chip) is when either chip is hot. Cpu and gpu run very hot. Stacking cache or memory on top traps that heat, threatening to damage the chip and cook the one on top.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
48 Comments
Back to Article
Ninhalem - Monday, June 24, 2019 - link
Well that certainly helps inform my buying decisions for the next couple months. I was hoping to see Rome be available outside of OEM channels by at least the end of Q3 (FY Q4), but I can't wait until Q4 2019 or Q1 2020. At least you can drop Rome into a Naples motherboard with a BIOS update. Now to find a decent server motherboard.RobJoy - Monday, June 24, 2019 - link
You know big buyers like cloud services and super computer services, get the first dibs on these new juicy CPUs.You as a solitary consumer, is secondary, if not tertiary.
quantumshadow44 - Monday, June 24, 2019 - link
so why publish it on solitary site?Oliseo - Tuesday, June 25, 2019 - link
Basically it's a post to show how better that person is to the OP. But, it's obvious to others it's just an insecurity thing.Irata - Tuesday, June 25, 2019 - link
Being a customer that purchases 1000s of your product versus one comes with perks ?And customers who need to plan and develop an entire infrastructure around this product get earlier insight ?
Who'd have thought ?
Jackbender - Monday, June 24, 2019 - link
Slight typo I guess, the article ends on a comma rather than a period.Great discussion and a good read otherwise.
azfacea - Monday, June 24, 2019 - link
He is right NVM is bs for cloud. Its the denser memory promise that is where all the potential is. even at 300ns or 500ns its still fine, but the problem with optane is that bandwidth and especially random bandwidth is in the toilet as well compared to DRAM. 2 or 3x more dense is simply not gonna cut it.ksec - Tuesday, June 25, 2019 - link
I said the exact same thing and all I got was attacked from left and right. And while DRAM pricing dropped, it still isn't bottom yet, just not making extreme profits like they were used to. We have 256GB per DIMM, and in the future 512GB per DIMM by 2022. That is 16TB of Memory on a 2S system. NVM will need to improve ( fulfil ) its performance promise and cost will need some come way down to be an attractive option. And judging from Micron's word, it is not going to be cost effective soon.SLVR - Monday, June 24, 2019 - link
A very strong interview by Ian & Forrest!Eris_Floralia - Monday, June 24, 2019 - link
The last few sentences... ouch.edzieba - Monday, June 24, 2019 - link
Though rather hilarious given he had just finished excusing AMD for eschewing a standardised interconnect in favour of one customised for a better fit to the task. Exactly what Intel did with the Optane DIMMs once it became clear DDR4 was not suitable for an NVRAM interface without significant protocol modification. DDR5 was ratified long before any of that could be fed back upstream, so DDR6 would likely be the first viable standard that could be designed around native NVRAM support. The terribly named 'Optane Memory' on the consumer side uses standard NVME over PCIe and can be treated as any other NVME drive.edzieba - Monday, June 24, 2019 - link
"given he had just finished excusing AMD for eschewing a standardised interconnect in favour of one customised for a better fit to the task."Skipping over CCIX for Infinity Fabric, for those who skipped over the article to the comments.
IanCutress - Monday, June 24, 2019 - link
That customized interconnect for Frontier is designed for a custom CPU + GPU system isolated to one supercomputer. Optane is designed to be used throughout the industry - at least on Intel.edzieba - Tuesday, June 25, 2019 - link
There's also Radeon Instinct cards that run over IF rather than CCIX (MI50 & MI60).The two situations are almost the inverse: AMD have selected their own proprietary interconnect over an existing standard for performance reasons. Intel have selected their own proprietary interconnect version because no existing standard exists and existing interfaces simply did not work (same as Everspin did for their MRAM when facing the same issue with DDR).
deltaFx2 - Tuesday, June 25, 2019 - link
Radeon Instinct, I believe, CAN run on IF if hooked up to AMD CPUs. Otherwise, they default to PCIe 4 (or maybe CCIX, not sure). The test for an open system is whether there is vendor lock-in. If a standard did not exist, Intel has the clout to define one and people will build for it (like it finally did with Thunderbolt after keeping it proprietary for close to a decade). Obviously that would allow Micron and others to be compatible with intel CPUs, and for AMD and others to be compatible with Optane. Intel is clearly using optane as a competitive advantage. Can AMD or ARM build an Optane-compatible interface to allow optane modules to be plugged in? Can anyone else use Optane? I think you know the answer.AMD could make IF open like they did with hypertransport. However, there are already 3 open coherent interconnect proposals out there well before IF came about: CAPI, GenZ, CCIX. Intel has its own version called Omnipath. Does the world need another interconnect specification and if so what problem does it uniquely solve? What uptake do you expect to see if IF were made public? Could they have used CCIX instead of IF as their interconnect? Probably, but these designs were probably done 5-6 years ago.
deltaFx2 - Tuesday, June 25, 2019 - link
Oh, forgot, Intel is now backing CXL. So 4 open standards. Why does Intel think the world needs CXL? My guess is, not invented here.jamescox - Thursday, June 27, 2019 - link
It is in AMD’s and Intel’s best interest to use a proprietary connection that is only available to their own GPUs. The two planned exascale supercomputers in the US are single vendor. One is AMD cpus and GPUs while the other will be Intel cpus and gpus. If they wanted to use nvidia, they would need to fall back to pci-express for use with AMD or Intel cpus. This is why nvidia has been talking about ARM for supercomputers. They have ARM processors such that they could make high performance ARM cores with NVlink.extide - Monday, June 24, 2019 - link
Yeah I think Frontier is really a separate specific case here. I think Forrest would have liked to see Intel create an open NVDIMM standard that others could easily use instead of just one which is proprietary to them.Kevin G - Monday, June 24, 2019 - link
Lot of interesting discussion. Just as much was said as it couldn't be said because you asking the right questions.Obviously not directly stated, I would say that it is implied that AMD is going to be pursuing more advanced packaging techniques for their future server CPUs. Leveraging interposers does lower power consumption vs. traditional wire bonding that is used in the Rome packaging. It is interesting to hear that the amount of power being spend on IO is not as large as being portrayed externally of AMD. This would imply that a change to interposers or similar packaging technologies is further down the road than believed.
Frontier comments are interesting and I wish more pressure was put on this point. With AMD's chiplet strategy, the semi-custom part may only be the IO die to account for the desired number of memory channels, PCIe lanes and Infinity Fabric links. The CPU die side of Frontier may just be the commodity die. To follow up my comments on packaging, this also could be what makes Frontier custom: the CPU and IO die could be part of a massive interposer package that includes the four GPU dies and HBM stacks. Cooling could be a problem but such packaging is feasible for a project like Frontier. AMD has options here and that should be followed up.
The other interesting take are the thoughts on non-volatile. The pricing scenario vs. DRAM today is correct, but I would question how much impact Optane would have if DRAM pricing was near its peak as it was ~18 months ago. Intel certainly would have had greater demand than they do right now but the flip side is their capabilities to match that demand: DRAM pricing would only be impacted by Optane if Intel could meet their own demand.
I would disagree about the comments regarding non-volatile memory and commits. It is indeed true that for enterprise DB systems you want that commit to be replicated to provide redundancy. However, that is true for Optane *AND* existing systems today: data should never rest at a single point. Thus the cost equation between Optane and existing technologies wouldn't change in this regard as the comparison on both sides should inherently include the concept of redundancy as part of the baseline. How that is done may change but that has recently happened already with the transition from SAS based primary storage pools to NVMe.
guycoder - Monday, June 24, 2019 - link
Maybe AMD are directly integrating the Cray Slingshot interconnect into the IO die for Frontier? Also, given the acquisition of Cray by HPE that might lead to some interesting possibilities for HPE in the server space to separate themselves commodity wise.Kevin G - Tuesday, June 25, 2019 - link
That idea has been floated around as part of the custom nature of Frontier's CPU.As for HPE, they are one of the few with an Intel chipset license which was inherited when they purchased SGI. It is how they are able to build a 32 socket Xeon right now.
Doing something different with AMD for the more commodity parts is kinda difficult. AMD hasn't left many options open. Technically Rome should scale to quad socket merit of how many IF links there are and how many IO dies would be at play. It just isn't clear if AMD even has support for this. Vendors can build dual socket Rome boxes with 160 and 192 active PCIe lanes for accelerators now.
I am also curious if there is a Vega 20 card put into a Rome system if it would be possible to switch from PCIe to IF over the existing physical link That would be an interesting play for AMD.
Opencg - Monday, June 24, 2019 - link
I think just like people who simplify things down to "what node is it on" you are missing the point on chipplet strategy. AMD has used the appropriate chiplet approach on all of its products. This isn't the place for you to vent your feelings about ryzen chiplets in the form of garbled logic. Get your mind right.Oliseo - Tuesday, June 25, 2019 - link
Projecting much?Kevin G - Tuesday, June 25, 2019 - link
I would disagree in that we have yet to fully see the full benefits of AMD's chiplet strategy. So far they only have two IO dies where they put a selective amount of CPU dies into the same package. Not a bad start but far from its full potential.Need to move to PCIe 5.0 and/or DDR5? Only the IO die needs be changed. Need to add a GPU to make it a full mobile SoC? Link a GPU die to the IO die via IF without having to redesign the CPU or IO portions that have already been in production for months. Need to add fabric for a super computer? Link it the IO die via IF at the cost of a CPU die in the same package.
The chipset strategy is more than just being able to mix process nodes (for the record that is a good thing too, and I would also cite the ability to combat poor foundry yields for the amount of silicon being used as a third major benefit). It is about being able rapidly re-use what works with minimal validation to quickly produce custom products to fit market demands. We are only in the beginning stages of this strategy.
jamescox - Thursday, June 27, 2019 - link
The mobile part will probably be a single die APU. Where power consumption is the most important, there is still an advantage in having everything on one die. I don’t know how they will launch that since they already have 12nm APUs marketed a Ryzen 3000. Perhaps the 7nm mobile APU will be the first Ryzen 4000 part with desktop Ryzen 4000 parts (Zen 2+) coming a bit later.The main benefit of the chiplet strategy is the yields and the lower cost (they aren’t separate obviously). Intel can’t really compete with Ryzen 3000 parts with 14 nm parts. AMD will have a $200 part with 32 MB L3 and a $500 part with 64 MB L3. Intel 14 nm currently only goes up to 38.5 MB and those are expensive chips. Some of that is 7 nm vs intel 14 nm but having two or 3 chiplets is actually a really big part of AMD’s advantage. Intel will eventually get a 10 nm die on the market, but it will almost certainly still be a monolithic die. That will waste 10 nm die area on IO and also, a monolithic die with a lot of cores will be quite a bit larger and therefore lower yields than AMD’s tiny chiplets. If the competing intel part comes out at maybe 250 square mm, then that is going to be a lot more expensive than AMD’s 2 80 square mm chiplets (much better yields) and a cheap 14 nm IO die.
The comparison is even worse for intel with Epyc. Intel has their two die 56 core marketing part with 2 x 38.5 MB L3 cache, but being on 14 nm it consumes way too much power and it is ridiculously expensive. AMD has about 1000 square mm of silicon for a 64-core Epyc. You get a massive 256 MB L3 cache which intel doesn’t even come close to. The intel part is, I believe, 2 almost 700 square mm die, so around 1400 square mm total and probably bad yields due to the size. A 64 core part just cannot be done as a monolithic die. Even 2 die is still probably not doable on 10 nm due to worse yields.
So while it is true that they can mix and match parts with using separate die, that really isn’t that important. What is important is that it allows a 16 core desktop part for $749 while slower intel parts go for 2 to 3x that cost. Same thing with Epyc. The intel parts are ridiculously more expensive and at this point, Intel doesn’t even have a real competitor beyond 28 core. The dual 28 core parts are more marketing than real products. AMD is probably planning a new cpu chiplet for the DDR5 time frame anyway. I don’t know if they want to support both DDR4 and DDR5 in the next generation, but that would probably be easily possible by just making a new IO die while keeping compatibility with the old one. That would be great if they can do that but it isn’t why people are excited about Ryzen 3000.
Gondalf - Wednesday, July 3, 2019 - link
You have well underlined the advantages of chiplet design; too bad you have forgotten to underline the disasvantages on performance. This is very common, AMD too do this mistake.All server cpus suppliers avoid carefully the granular AMD solution, even ARM server suppliers are all on monolithic solutions. This is an indication that disavantages are bigger than advantages.
IMO AMD could have shipped a least a 16 core die for Epyc to reduce the insane amount of off die interconnection. I suspect that Rome will perform well only for a selected number of workloads.
I don't think Intel is concerned too much with Rome arrival.
deltaFx2 - Thursday, July 4, 2019 - link
@Gondalf: It is a tradeoff, but "This is an indication that disavantages are bigger than advantages" was true; not anymore. Intel is also doing multi-die packaging with Cooper/Ice lake based on rumors and have plans to use EMIB-based solutions in future (Sapphire Rapids). The reality is that latest and greatest nodes are taking longer to achieve yield parity with older nodes. You then have two choices: wait a couple of years for yields to catch up such that you can build a high core-count server part OR do what AMD is doing. So, the tradeoff AMD is making is that the extra cost (power, latency) from MCM is offset by power efficiency of 7nm. And when 7nm can yield a monolithic 32c die, AMD can jump to 5nm or whatever is the next node. More cores per socket, less power per core even when adjusted for the extra power from MCM.Plus, the argument that this is a bad idea because nobody else does it is a pretty weak argument. Nobody made $600 phones before Apple tried.
By all accounts that I have heard, AMD is gaining significant traction in cloud and HPC with Rome. The cloud folks run VMs or containers with few threads in each so don't really need a large unified cache that Intel provides (the biggest -ve for AMD). Basically what is called 'scale-out'. Then there's legacy enterprise ('scale-up') where AMD may be at a disadvantage (oracle, SAP databases). These workloads have a large number of threads with coherent shared memory. Whether this matters in the real world is hard to say. Clearly AMD has decided that it can ignore legacy database markets and focus on cloud, which is what, 40-50% of server volume?
deltaFx2 - Thursday, July 4, 2019 - link
PS: Naples had an additional wrinkle of being NUMA and apps running on it had to be NUMA-aware. This is not the case with Rome.jamescox - Thursday, June 27, 2019 - link
“Leveraging interposers does lower power consumption vs. traditional wire bonding that is used in the Rome packaging.”I don’t think actual wire bonding has been used for cpus in about 20 years; the Pentium Pro in the late 90’s was the last wire bonded Intel chip, I think. They moved to flip-chip with the Pentium 2 or perhaps a second revision of the Pentium 2.
I don’t think they will have ridiculously large silicon interposers. Interposers are made the same way silicon chips are so they are limited by the reticule size. Techniques do exist to make them larger than that, but the cost probably skyrockets. I would guess that the most likely configuration is to place gpus with HBM on separate interposers and then connect them with infinity fabric through the PCB. They probably want a large gpu and 4 stacks of
HBM for each gpu which is quite large. Although, for HPC 2 smaller gpus often perform the same as one larger GPU with 2x the compute and 2x the bandwidth, so it is unclear how big the gpus actually are and how many HBM stacks they will use. They will probably have the gpus and cpus mounted on a single board though. It is a lot easier to route 3 IF connections to each interposer with them on the same board.
The next step for Epyc seem to be to replace the IO die with an interposer. I don’t know if that will be Zen 4 though. It may come later. Using an interposer would allow using multiple much smaller 7 nm chiplets for the actual logic. An active interposer would be even better since they could place the larger transistors needed for driving off interposer links in the interposer and reserve the 7nm chips just for logic. Using an interposer for the IO die would allow better power consumption since the logic would all be 7 nm chiplets. An active interposer would free up a lot of extra area on the Epyc package. That could allow things like memory controller chiplets with very large caches on die and such. They could also move things around again and do something like place cpu cores and memory controllers onto one interposer and all of the other IO onto another chip since it isn’t as latency sensitive. An active interposer with the external memory interface would be interesting. Such a device might have 7 nm chiplet(s) with the memory controller and the IF switch and the cpu core chiplets on a single interposer. That would probably be pushing the reticule size limits with a passive interposer though. Even if the reticule size limits aren’t an issue, what if something goes wrong soldering the chiplets onto the interposer? Four gpus, a bunch of HBM, cpu chiplets, and possibly a bunch of other chiplets would be a lot to lose in one soldering operation.
There are a lot of different options once silicon interposers come into play, but I would still expect that they would use separate interposers for each gpu and the cpu and just connect them together with infinity fabric. If the cpu uses a silicon interposer then one possible customization is to place the slingshot interface directly on the interposer. If it is using an active interposer, then it could actually be the exact same chiplets as a regular Epyc with just an added die for slingshot and/or a different interposer with slingshot integrated.
Carmen00 - Monday, June 24, 2019 - link
Good interview, thank you! I wish that Mr Norrod was a bit more forthcoming in some areas, but what can you do...Some small typos caught my eye. "Lisa eluded to this earlier" should be "Lisa alluded to this earlier". "commodity markets have a very set of economic rules" ... is there a word or so missing here? "I think that that Intel has made" should be "I think that Intel has made".
FreckledTrout - Monday, June 24, 2019 - link
I wonder if we will see a mining using the Frontier supercomputer custom chips? Wouldn't 1 cpu to 4 GPU's would make a nice dense setup as in like a blade rack footprint?.edzieba - Monday, June 24, 2019 - link
With PCIe bifurcation common on consumer CPUs , and mining needing neither CPU grunt nor interface bandwidth, there is no need to waste money on server-grade hardware. e.g. the ASUS B250 MINING EXPERT will happily host 19 GPUs and it plus a cheap CPU will likely cost a fraction of what a single Frontier CPU alone would cost.jamescox - Thursday, June 27, 2019 - link
They probably will not be available to buy and they will be ridiculously expensive compared to other solutions for mining. They will probably mount the gpus and cpus onto a single board like a compute blade. It will need to be quite large though to support a lot of system memory and 5 devices that might consume 200 Watts each. The power delivery would take up quite a bit space. These are packed very densely into the racks given some of the per cabinet power numbers. They may be made to plug into a backplane.V900 - Monday, June 24, 2019 - link
AMD really need to up their GPU game.It’s quite telling that the GPUs for the Frontier supercomputer are designed and delivered by... Nvidia!
extide - Monday, June 24, 2019 - link
Yeah, which means they have made Nvidia GPUs able to work with Infinity Fabric. Very interesting...GreenReaper - Tuesday, June 25, 2019 - link
Clearly AMD is putting its fabric where its mouth is and providing open standards.koopahermit - Monday, June 24, 2019 - link
This part confuses me. Was it recently changed to Nvidia? It was announced that Frontier was going to use Radeon Instinct back in May.deltaFx2 - Monday, June 24, 2019 - link
It was never Nvidia. V900 is mistaken. The cpu and gpu are AMD. Extide was being sarcasticAbRASiON - Monday, June 24, 2019 - link
I am not smart at this stuff, but regarding 3D packaging, is it possible they could add L3 / L4 cache which is at least stacked? Doesn't that run a lot cooler?Processor with 2GB local HBM or something?
frenchy_2001 - Wednesday, June 26, 2019 - link
HBM is 2.5D packaging (memory chip is next to gpu).The problem with heat and 3D package (chip covering chip) is when either chip is hot.
Cpu and gpu run very hot. Stacking cache or memory on top traps that heat, threatening to damage the chip and cook the one on top.
KAlmquist - Tuesday, June 25, 2019 - link
@ Kevin G:I think Forrest Norrod's point about data redundancy had to do with performance, not cost.
Mugur - Tuesday, June 25, 2019 - link
Sorry, but the picture with Ian and the AMD SVP is hilarious... Either Ian is very short, or the AMD guy very tall (NBA tall). :-)IanCutress - Tuesday, June 25, 2019 - link
Both.WatcherCK - Tuesday, June 25, 2019 - link
Why is Forrest standing on a box (in the first image)?IanCutress - Tuesday, June 25, 2019 - link
He's not. He's that tall. I'm as tall as Anand.katsetus - Wednesday, June 26, 2019 - link
A Big Guy and Cutress, Ian (Anandtech).Massive Baneposting potential.