Supermicro Ultra SYS-120U-TNR Review: Testing Dual 10nm Ice Lake Xeon in 1Uby Dr. Ian Cutress on July 22, 2021 9:00 AM EST
With the launch of Intel’s Ice Lake Xeon Scalable platform comes a new socket and a range of features that vendors like Supermicro have to design for. The server and enterprise market is so vast that every design can come in a range of configurations and settings, however one of the key elements is managing compute density with memory and accelerator support. The SYS-120U-TNR we are testing today is a dense system with lots of trimmings all within a 1U, to which Supermicro is aiming at virtualization workloads, HPC, Cloud, Software Defined Storage, and 5G. This system can be equipped with upwards of 80 cores, 12 TB of DRAM, and four PCIe 4.0 accelerators, defining a high-end solution from Supermicro.
Servers: General Purpose or Hyper Focused?
Due to the way the server and enterprise market is both expansive and optimized, vendors like Supermicro have to decide how to partition their server and enterprise offerings. Smaller vendors might choose to target one particular customer, or go for a general purpose design, whereas the larger vendors can have a wide portfolio of systems for different verticals. Supermicro falls into this latter category, designing targeted systems with large customers, but also enabling ‘standard’ systems that can do a bit of everything but still offer good total cost of ownership (TCO) over the lifetime of the system.
Server size compared to a standard 2.5-inch SATA SSD
When considering a ‘standard’ enterprise system, in the past we have typically observed a dual socket design in a 2U (3.5-inch, 8.9cm height) chassis, which allows for a sufficient cooling design along with a number of add-in accelerators such as GPUs or enhanced networking, or space on the front panel for storage or additional cooling. The system we’re testing today, the SYS-120U-TNR, certainly fields this ‘standard’ definition, although Supermicro does the additional step of optimizing for density by cramming everything into a 1U chassis.
With only 1.75-inches (4.4cm) vertical clearance on offer, cooling becomes a priority, which means substantial enough heatsinks and fast moving airflow backed by 8 powerful 56mm fans, which are running at up to 30k RPM with PWM control. The SYS-120U-TNR we’re testing has support for 2 Ice Lake Xeon processors at up to 40 cores and 270 W each, as well as additional add-in accelerators (one dual slot full height + two single slot full height), and comes equipped with dual 1200W Titanium or dual 800W Titanium power supplies, indicating that it is suited up should a customer want to fill it with plenty of hardware. You can see in the image above and on the right of the image below, Supermicro uses plastic baffles to ensure that airflow through the heatsink and memory is as laminar as possible.
LGA-4189 Socket with 1U Heatsink and 16 DDR4 slots
Even with the 1U form factor, Supermicro has enabled full memory support for Ice Lake Xeon, allowing both processors sixteen DDR4-3200 memory slots, capable of supporting a total of 12 TB of memory with Intel’s Optane DCPMM 200-series.
At the front are 12 2.5-inch SATA/NVMe PCIe 4.0 x4 hot swappable drive bays, with six apiece coming from each processor. If we start looking into where all the PCIe lanes from each processor go, it gets a bit confusing very quickly:
By default the system comes without network connectivity, only with a BMC connection for admin control. Network options requires an Ultra add-in riser card for dual 10GBase-T (X710-AT2), or dual 10GBase-T plus dual 10GbE SFP+ (X710-TM4). With the PCIe connectors, any other networking option might be configured, but Supermicro also lists the complete no-NIC option for air-gapped systems. The system also has three USB 3.0 ports (2 rear, 1 front), a rear VGA output, a rear COM port, and two SuperDOM ports internally.
Admin control comes from the Aspeed AST2600 which supports IPMI v2.0, Redfish API, Intel Node Manager, Supermicro’s Update Manager, and Supermicro’s SuperDoctor 5 monitoring interface.
The configuration Supermicro sent to us for review contains the following:
- Supermicro SYS-120U-TNR
- Dual Intel Xeon Gold 6330 CPUs (2x28-core, 2.5-3.1 GHz, 2x205W, 2x$1894)
- 512 GB of DDR4-3200 ECC RDIMMs (16 x 32 GB)
- Dual Kioxia CD6-R 1.92TB PCIe 4.0x4 NVMe U.2
- Dual 10GBase-T via X710-AT2
Full support for the system includes:
|CPUs||Dual Socket P+ (LGA-4189)
Support 3rd Gen Ice Lake Xeon
Up to 270W TDP, 40C/80T
7+1 Phase Design Per Socket
|DRAM||32 DDR4-3200 ECC Slots
Support RDIMM, LRDIMM
|Up to 8 TB
32 x 256 GB LRDIMM
|Up to 12 TB
16 x 512 GB Optane
16 x 256 GB LRDIMM
|Storage||12 x SATA Front Panel
Optional PCIe 4.0 x4 NVMe Cabling
|PCIe||PCIe 4.0 x16 Low Profile
PCIe 4.0 x16 Low Profile (Internal)
2 x PCIe 4.0 x16 Full Height (10.5-inch length)
Ultra Riser for Networking
|Networking||None by default
Optional X710-AT2 dual 10GBase-T
Optional X710-TM4 dual 10GBase-T + SFP+
|IO||RJ45 BMC via ASpeed AST2600
3 USB 3.0 Ports (2 rear, 1 front)
1 x COM
2 x SuperDOM
|Fans||8 x 40mm double thick 30k RPM with control
2 Shrouds, 1 per CPU socket+DRAM
|Power||1200W Titanium Redundant, Max 100A|
|IPMI 2.0 via ASpeed AST2600
Supermicro OOB License included
Intel Node Manager
KVM with Dedicated LAN
ACPI Power Management
|Optional||2x M.2 RAID Carrier
Broadcom Cache Vaults
Intel VROC Raid Key
RAID Cards + Cabling
Ultra Riser Cards
|Note||Sold as assembled system to resellers
(2 CPU, 4xDDR, 1xStorage, 1xNIC)
We reached out to Supermicro for some insight into how this system might be configured for the different verticals.
|Supermicro Ultra-E SYS-120U-TNR
|Cloud Computing||handles all mainstream configs|
|Software Defined Storage||+ or 2U|
|5G/Telco||Ultra-E Short-Depth Version|
Read on for our benchmark results.
Post Your CommentPlease log in or sign up to comment.
View All Comments
Elstar - Saturday, July 24, 2021 - link> All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.
AVX-512, as an instruction set, was a huge leap forward compared to AVX/AVX2. So much so that Intel created the AVX-512VL extension that allows one to use AVX-512 instructions on vectors smaller than 512-bits. As a vector programmer, here are the things I like about AVX-512:
1) Dedicated mask registers and every instruction can take an optional mask for zeroing/merging results
2) AVX-512 instructions can broadcast from memory without requiring a separate instruction.
3) More register (not just wider)
Also, and this is kind of hard to explain, but AVX/AVX2 as an instruction set is really annoying beacause it acts like two SSE units. So for example, you can't permute (or "shuffle" in Intel parlance) the contents of an AVX2 register as a whole. You can only permute the two 128-bit halves as if they were/are two SSE registers fused together. AVX-512 doesn't repeat this half-assed design approach.
mode_13h - Sunday, July 25, 2021 - link> 1) Dedicated mask registers and every instruction can take an optional
> mask for zeroing/merging results
This seems like the only major win. The rest are just chipping at the margins.
More registers is a win for cases like fitting a larger convolution kernel or matrix row/column in registers, but I think it's really the GP registers that are under the most pressure.
AVX-512 is not without its downsides, which have been well-documented.
Spunjji - Monday, July 26, 2021 - link@Elstar - Interesting info. Just makes me more curious as to how many of these things might be benefiting the 3DPM workload specifically. Another good reason for more people to get eyes on the code!
Dolda2000 - Saturday, July 24, 2021 - link>All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.
I don't remember where it was posted any longer (it was in the comment section of some article over a year ago), but apparently 3DPM makes heavy use of wide (I don't recall exactly how wide) integer multiplications, which were made available in vectorized form in AVX-512.
dwbogardus - Saturday, July 24, 2021 - linkPerformance optimization is converged upon from two different directions: 1) the code users run to perform a task, and 2) the compute hardware upon which the code is intended to run. As an Intel engineer, for some time I was in a performance evaluation group. We ran many thousands of simulations of all kinds to quantify the performance of our processor and chipset designs before they ever went to silicon. This was in addition to our standard pre-silicon validation. Pre-silicon performance validation was to demonstrate that the expected performance was being delivered. You may rest assured that every major silicon architectural revision or addition to the silicon and power consumption was justified by demonstrated performance improvements. Once the hardware is optimized, then the coders dive into optimizing their code to take best advantage of the improved hardware. It is sort of like "double-bounded successive approximation" toward a higher performance target from both HW and SW directions. No surprise that benchmarks are optimized to the latest and highest performant hardware.
GeoffreyA - Sunday, July 25, 2021 - linkFair enough. But what if the legacy code path, in this case AVX2, were suboptimal?
mode_13h - Sunday, July 25, 2021 - link> You may rest assured that every major silicon architectural revision
> or addition to the silicon and power consumption was justified
> by demonstrated performance improvements.
Well, it looks like you folks failed on AVX-512 -- at least, in Skylake/Cascade Lake:
I experienced this firsthand, when we had performance problems with Intel's own OpenVINO framework. When we reported this to Intel, they confirmed that performance would be improved by disabling AVX-512. We applied *their* patch, effectively reverting it to AVX2, and our performance improved substantially.
I know AVX-512 helps in some cases, but it's demonstrably false to suggest that AVX-512 is *only* an improvement.
However, that was never the point in contention. The question was: how well 3DPM performs with a AVX2 codepath that's optimized to the same degree as the AVX-512 path. I fully expect AVX-512 would still be faster, but probably more inline with what we've seen with other benchmarks. I'd guess probably less than 2x.
mode_13h - Thursday, July 22, 2021 - link> a modern dual socket server in a home rack with some good CPUs
> can no longer be tested without ear protection.
When I saw the title of this review, that was my first thought. I feel for you, and sure wouldn't like to work in a room with these machines!
firstname.lastname@example.org - Thursday, July 22, 2021 - linkWhy is this still relevant? You can buy CPU 'cards' and stick them in a chassis using less power and cost as much or less.
mode_13h - Friday, July 23, 2021 - linkAre you referring to blade servers? But they don't have the ability to host PCIe cards or a dozen SSDs like this thing does. I'm also not sure how their power budget compares, nor how much RAM they can have.
Anyway, if all you needed was naked CPU power, without storage or peripherals, then I think OCP has some solutions for even higher density. However, not everyone is just looking to scale massive amounts of raw compute.