CPU Performance

We’ll begin our Kirin 960 performance evaluation by investigating the A73’s integer and floating-point IPC with some synthetic tests. Then we’ll see how the changes to its memory system affect memory latency and bandwidth. Finally, after completing the lower-level tests, we’ll see how Huawei’s Mate 9 and its Kirin 960 SoC perform when running some real-world workloads.

Our first look at the A73’s integer performance comes from SPECint2000, the integer component of the SPEC CPU2000 benchmark developed by the Standard Performance Evaluation Corporation. This collection of single-threaded tests allows us to compare IPC for competing CPU microarchitectures. The scores below are not officially validated numbers, which requires the test to be supervised by SPEC, but we’ve done our best to choose appropriate compiler flags and to get the tests to pass internal validation.

SPECint2000 - Estimated Scores
ARMv8 / AArch64
  Kirin 960 Kirin 950
(% Advantage)
Exynos 7420
(% Advantage)
Snapdragon 821
(% Advantage)
164.zip 1217 1094
175.vpr 4118 3889
176.gcc 2157 1864
181.mcf 1118 664
186.crafty 2222 2083
197.parser 1395 1208
252.eon 3421 3333
253.perlmk 1748 1651
254.gap 1930 1667
255.vortex 2111 1863
256.bzip2 1402 1220
300.twolf 2479 2521

The Kirin 960’s A73 CPU is about 11% faster on average than the Kirin 950’s A72. In addition to the front-end changes discussed on the previous page and the changes to the memory system discussed in the next section, the A73’s integer pipelines have undergone a few tweaks as well. Where the A72 had 3 integer ALUs—2 simple ALUs for basic operations such as addition and shifting and 1 dedicated multi-cycle ALU for complex operations such as multiplication, division, and multiply-accumulate—the A73 only has 2 integer ALUs that are capable of performing both basic and complex operations. This affects performance in different ways. For example, because only one of the A73’s ALUs can handle multiplication while the other handles division, the time to execute multiply or division operations sees no change; however, while an ALU is occupied with a multi-cycle instruction, it cannot execute simple instructions like the A72’s dedicated pipelines can, leading to a potential performance loss. Multiply-accumulate operations, which require both of the A73’s pipelines, incur a similar penalty. It’s not all bad, however. Workloads that perform parallel arithmetic or use certain other complex instructions can see double the execution throughput on A73 versus A72.

Note that the table above does not account for differences in CPU frequency. The Kirin 960’s frequency advantage over the Kirin 950 and Snapdragon 821 is less than 3%, making these numbers easier to compare, but its advantage over the Exynos 7420 is a little over 12%. The chart below accounts for this by dividing the estimated SPECint2000 ratio score by CPU frequency, making IPC comparisons easier.

SPECint2000 64b/32b Estimated Ratio/MHz

Despite the substantial microarchitectural differences between the A73 and A72, the A73’s integer IPC is only 11% higher than the A72’s. This is likely the result of improvements in one area being partially offset by regressions in another. Still, assuming ARM’s power reduction claims hold true, this is not a bad result.

The gap between the A73 and A57 increases to 29%. The integer performance for Qualcomm’s custom Kryo core is well behind ARM’s A73 and A72 cores, essentially matching the A57’s IPC.

Geekbench 4 - Integer Performance
Single Threaded
  Kirin 960 Kirin 950
(% Advantage)
Exynos 7420
(% Advantage)
Snapdragon 821
(% Advantage)
AES 911.3 MB/s 935.6 MB/s
795.8 MB/s
559.1 MB/s
LZMA 3.03 MB/s 2.87 MB/s
2.28 MB/s
2.20 MB/s
JPEG 16.1 Mpixels/s 15.5 Mpixels/s
14.1 Mpixels/s
21.6 Mpixels/s
Canny 22.5 Mpixels/s 26.8 Mpixels/s
23.6 Mpixels/s
30.3 Mpixels/s
Lua 1.70 MB/s 1.55 MB/s
1.20 MB/s
1.47 MB/s
Dijkstra 1.53 MTE/s 1.14 MTE/s
0.92 MTE/s
1.39 MTE/s
SQLite 51.6 Krows/s 43.5 Krows/s
34.0 Krows/s
36.7 Krows/s
HTML5 Parse 8.30 MB/s 6.79 MB/s
6.37 MB/s
7.61 MB/s
HTML5 DOM 2.17 Melems/s 1.92 Melems/s
1.26 Melems/s
0.37 Melems/s
Histogram Equalization 48.7 Mpixels/s 57.0 Mpixels/s
50.6 Mpixels/s
51.2 Mpixels/s
PDF Rendering 44.8 Mpixels/s 45.5 Mpixels/s
39.7 Mpixels/s
53.0 Mpixels/s
LLVM 194.4 functions/s 167.9 functions/s
128.6 functions/s
113.5 functions/s
Camera 5.45 images/s 5.45 images/s
4.95 images/s
7.19 images/s

The updated Geekbench 4 workloads give us a second look at integer IPC. Similar to the SPECint2000 results, we see Kirin 960 showing 5% to 15% gains over Kirin 950 in several of the tests, but there’s a bit more variation overall. The Kirin 960 is actually slower than Kirin 950 in some tests, and, in the case of Canny and Histogram Equalization, its A73 is even slower than the Exynos 7420’s A57. It also falls behind Qualcomm’s Kryo in the JPEG, PDF Rendering, and Camera tests. The tests where the Kirin 960 does well—HTML5 Parse, HTML5 DOM, and SQLite—are very common workloads, though, which should translate into better real-world performance.

Geekbench 4  (Single Threaded) Integer Score/MHz

The chart above accounts for differences in CPU frequency, making it easier to directly compare IPC. Overall the A73 shows only about a 4% improvement over the A72 and about a 12% improvement over the A57 in this group of workloads, considerably less than what we saw in SPECint2000; however, with margins ranging from 33.5% in Dijkstra to -16.1% in Canny, it’s impossible to make any sweeping statements about the A73’s integer performance being better or worse than the A72’s.

Qualcomm’s Kryo CPU falls just behind the A57 once again despite posting better results in many of the Geekbench integer tests. Its poor performance in LLVM and HTML5 DOM weighs heavily on its overall score.

I’ve also included results for ARM’s in-order A53 companion core. The A73’s integer IPC is 1.7x to 2x higher overall, which illustrates why octa-core A53 SoCs are so much slower, particularly in Web browsing, than designs that use 2-4 big cores (A73/A72/A57) instead of 4 additional A53s.

Geekbench 4 - Floating Point Performance
Single Threaded
  Kirin 960 Kirin 950
(% Advantage)
Exynos 7420
(% Advantage)
Snapdragon 821
(% Advantage)
N-Body Physics 838.4 Kpairs/s 896.9 Kpairs/s
634.5 Kpairs/s
1156.7 Kpairs/s
Rigid Body Physics 5891.4 FPS 6497.4 FPS
4662.7 FPS
7171.3 FPS
Ray Tracing 221.9 Kpixels/s 216.9 Kpixels/s
136.1 Kpixels/s
298.3 Kpixels/s
HDR 7.46 Mpixels/s 7.57 Mpixels/s
7.17 Mpixels/s
10.8 Mpixels/s
Gaussian Blur 23.6 Mpixels/s 28.6 Mpixels/s
24.4 Mpixels/s
48.5 Mpixels/s
Speech Recognition 12.8 Words/s 8.9 Words/s
10.2 Words/s
10.9 Words/s
Face Detection 501.2 Ksubs/s 518.9 Ksubs/s
435.5 Ksubs/s
685.0 Ksubs/s

With the exception of SFFT and Speech Recognition, the Kirin 960 is generally a little slower than the Kirin 950 in Geekbench 4’s floating-point workloads. This is a bit of a surprise considering that the A73’s NEON execution units are relatively unchanged from the A72’s design, with reduced latency for specific instructions improving NEON performance by 5%, according to ARM. These results are even harder to interpret after factoring in the A73’s lower-latency front end and improvements to its fetch block and memory subsystems. It’s possible that some of these tests are limited by the A73’s narrower decode stage, but given the variation in workloads, this is probably not true for every case. It will be interesting to see if A73 implementations from other SoC vendors show similar results.

Geekbench 4 (Single Threaded) Floating Point Score/MHz

After accounting for the differences in CPU frequency, floating-point IPC for the Kirin 960’s A73 is 3% to 5% lower overall than the A72 but about 3% higher than the older A57. These results, which are a geometric mean of the floating-point subtest scores, are certainly closer to what I would expect, but hide the large performance variation from one workload to the next.

It’s pretty obvious that floating-point performance was Qualcomm’s focus for its custom Kryo core. While integer IPC was no better than ARM’s A57, Kryo’s floating-point IPC is 23% higher than the A72 in Geekbench 4, with particularly strong results in the Gaussian Blur and HDR tests.

Introduction Memory and System Performance


View All Comments

  • Eden-K121D - Tuesday, March 14, 2017 - link

    Samsung only Reply
  • Meteor2 - Wednesday, March 15, 2017 - link

    I think the 820 acquitted itself well here. The 835 could be even better. Reply
  • name99 - Tuesday, March 14, 2017 - link

    "Despite the substantial microarchitectural differences between the A73 and A72, the A73’s integer IPC is only 11% higher than the A72’s."

    Well, sure, if you're judging by Intel standards...
    Apple has been able to sustainabout a 15% increase in IPC from A7 through A8, A9, and A10, while also ramping up frequency aggressively, maintaining power, and reducing throttling. But sure, not a BAD showing by ARM, the real issue is will they keep delivering this sort of improvement at least annually?

    Of more technical interest:
    - the largest jump is in mcf. This is a strongly memory-bound benchmark, which suggests a substantially improved prefetcher. In particular simplistic prefetchers struggle with it, suggesting a move beyond just next-line and stride prefetchers (or at least the smarts to track where these are doing more harm than good and switch them off.) People agree?

    - twolf appears to have the hardest branches to predict of the set, with vpr coming up second. So it's POSSIBLE (?) that their relative shortcomings reflect changes in the branch/fetch engine that benefit
    most apps but hurt specifically weird branching patterns?

    One thing that ARM has not made clear is where instruction fusion occurs, and so how it impacts the two-decode limit. If, for example, fusion is handled (to some extent anyway) as a pre-decode operation when lines are pulled into L1I, and if fusion possibilities are being aggressively pursued [basically all the ideas that people have floated --- compare+branch, large immediate calculation, op+storage (?), short (+8) branch+op => predication like POWER8 (?)] there could be a SUBSTANTIAL fraction of fused instruction going through the system so that the 2-wide decode is basically as good as the 3-wide of A72?
  • fanofanand - Wednesday, March 15, 2017 - link

    Once WinArm (or whatever they want to call it) is released, we will FINALLY be able to compare apples to apples when it comes to these designs. Right now there are mountains of speculation, but few people actually know where things are at. We will see just how performant Apple's cores are once they can be accurately compared to Ryzen/Core designs. I have the feeling a lot of Apple worshippers are going to be sorely disappointed. Time will tell. Reply
  • name99 - Wednesday, March 15, 2017 - link

    We can compare Apple's ARM cores to the Intel cores in Apple laptops today, with both GeekBench and Safari. The best matchup I can find is this:

    (I'd prefer to compare against the MacBook 12" 2016 edition with Skylake, but for some reason there seem to be no GB4 results for that.)

    This compares an iPhone (so ~5W max power?) against a Broadwell that turbo's up to 3.1 GHz (GB tends to run everything at the max turbo speed bcs it allows the core to cool between the [short] tests), and with TDP of 15W.

    Even so, the performance is comparable. When you normalize for frequency, you get that A10 is about 20% better IPC, so drops down to maybe 15% better IPC for Skylake.
    Of course that A10 runs at a lower (peak) frequency --- but also at much lower power.

    There's every reason to believe that the A10X will beat absolutely the equivalent Skylake chip in this class (not just m-class but also U-class), running at a frequency of ?between 3 and 3.5GHz? while retaining that 15-20% IPC advantage over Skylake and at a power of ?<10W?
    Hopefully we'll see in a few weeks --- the new iPads should be released either end March or beginning April.

    Point is --- I don't see why we need to wait for WinARM server --- specially since MS has made no commitment to selling WinARM to the public, all they've committed to is using ARM for Azure.
    Comparing GB4 or Safari on Apple devices gives us comparable compilers, comparable browsers, comparable OSs, comparable hardware design skill. I don't see what a Windows equivalent brings to the table that adds more value.
  • joms_us - Wednesday, March 15, 2017 - link

    Bwahaha keep dreamin iTard, GB is your most trusted benchmark. =D

    Why don't you run both machine with A10 and Celeron released in 2010. You will see how pathetic your A10 is in realworld apps.
  • name99 - Wednesday, March 15, 2017 - link

    When I was 10 years old, I was in the car and my father and his friend were discussing some technical chemistry. I was bored with this professional talk of pH and fractionation and synthesis, so after my father described some particular reagent he'd mixed up, I chimed in with "and then you drank it?", to which my father said "Oh be quiet. Listen to the adults and you might learn something." While some might have treated this as a horrible insult, the cause of all their later failures in life, I personally took it as serious advice and tried (somewhat successfully) to abide by it, to my great benefit.
    Thanks Dad!

    Relevance to this thread is an exercise left to the reader.
  • joms_us - Wednesday, March 15, 2017 - link

    Even the latest Ryzen is just barely equal or faster than Skylake clock per clock so what makes you think a worthless low-powered mobile chip will surpass them? A10 is not even better than SD821 on real-world apps comparison. Again real-world apps not Antutu, not Geekbench. Reply
  • zodiacfml - Wednesday, March 15, 2017 - link

    Intel's chips are smaller than Apple's. Apple also has the luxury to spend much on the SoC. Reply
  • Andrei Frumusanu - Tuesday, March 14, 2017 - link

    Stamp of approval. Reply

Log in

Don't have an account? Sign up now