Background: How GPUs Work

Seeing as how this is our first in-depth architecture article on a SoC GPU design (specifically as opposed to PC-derived designs like Intel and NVIDIA), we felt it best to start at the beginning. For our regular GPU readers the following should be redundant, but if you’ve ever wanted to learn a bit more about how a GPU works, this is your place to start.

GPUs, like most complex processors, are composed of a number of different functional units responsible for the different aspects of computation and rendering. We have functional units that setup geometry data, frequently called geometry engine, geometry processors, or polymorph engines. We have memory subsystems that provide caching and access to external memory. We have rendering backends (ROPs or pixel co-processors) that take computed geometry and pixels to blend them and finalize them. We have texture mapping units (TMUs) that fetch textures and texels to place them within a scene. And of course we have shaders, the compute cores that do much of the heavy lifting in today’s games.

Perhaps the most basic question even from a simple summary of the functional units in a GPU is why there are so many different functional units in the first place. While conceptually virtually all of these steps (except memory) can be done in software – and hence done in something like a shader – GPU designers don’t do that for performance and power reasons. So-called fixed function hardware (such as ROPs) exists because it’s far more efficient to do certain tasks with hardware that is tightly optimized for the job, rather than doing it with flexible hardware such as shaders. For a given task flexible hardware is bigger and consumes more power than fixed function hardware, hence the need to do as much work in power/space efficient fixed function hardware as is possible. As such the portions of the rendering process that need flexibility will take place in shaders, while other aspects that are by their nature consistent and fixed take place in fixed function units.

The bulk of the information Imagination is sharing with us today is with respect to shaders, so that’s what we’ll focus on today. On a die area basis and power basis the shader blocks are the biggest contributors to rendering. Though every functional unit is important for its job, it’s in the shaders that most of the work takes place for rendering, and the proportion of that work that is bottlenecked by shaders increases with every year and with every generation, as increasingly complex shader programs are created.

So with that in mind, let’s start with a simple question: just what is a shader?

At its most fundamental level, a shader core is a flexible mathematics pipeline; it is a single computational resource that accepts instructions (a shader program) and executes it in order to manipulate the pixels and polygon vertices within a scene. An individual shader core goes by many names depending on who makes it: AMD has Stream Processors, NVIDIA has CUDA cores, and Imagination has Pipelines. At the same time how a shader core is built and configured depends on the architecture and its design goals, so while there are always similarities it is rare that shader cores are identical.

On a lower technical level, a shader core itself contains several elements. It contains decoders, dispatchers, operand collectors, results collectors, and more. But the single most important element, and the element we’re typically fixated on, is the Arithmetic Logic Unit (ALU). ALUs are the most fundamental building blocks in a GPU, and are the base unit that actually performs the mathematical operations called for as part of a shader program.


And an Imgination PVR Rogue Series 6XT Pipeline

The number of ALUs within a shader core in turn depends on the design of the shader core. To use NVIDIA as an example again, they have 2 ALUs – an FP32 floating point ALU and an integer ALU – either of which is in operation as a shader program requires. In other designs such as Imagination’s Rogue Series 6XT, a single shader core can have up to 7 ALUs, in which multiple ALUs can be used simultaneously. From a practical perspective we typically count shader cores when discussing architectures, but it is at times important to remember that the number of ALUs within a shader core can vary.

When it comes to shader cores, GPU designs will implement hundreds and up to thousands of these shader cores. Graphics rendering is what we call an embarrassingly parallel process, as there are potentially millions of pixels in a scene, most of which can be operated in in a semi-independent or fully-independent manner. As a result a GPU will implement a large number of shader cores to work on multiple pixels in parallel. The use of a “wide” design is well suited for graphics rendering as it allows each shader core to be clocked relatively low, saving power while achieving work in bulk. A shader core may only operate at a few hundred megahertz, but because there are so many of them the aggregate throughput of a GPU can be enormous, which is just what we need for graphics rendering (and some classes of compute workloads, as it turns out).

A collection of Kepler CUDA cores, 192 in all

The final piece of the puzzle then is how these shader cores are organized. Like all processors, the shader cores in a GPU are fed by a “thread” of instructions, one instruction following another until all the necessary operations are complete for that program. In terms of shader organization there is a tradeoff between just how independent a shader core is, and how much space/power it takes up. In a perfectly ideal scenario, each and every shader core would be fully independent, potentially working on something entirely different than any of its neighbors. But in the real world we do not do that because it is space and power inefficient, and as it turns out it’s unnecessary.

Neighboring pixels may be independent – that is, their outcome doesn’t depend on the outcome of their neighbors – but in rendering a scene, most of the time we’re going to be applying the same operations to large groups of pixels. So rather than grant the shader cores true independence, they are grouped up together for the purpose of having all of them executing threads out of the same collection of threads. This setup is power and space efficient as the collection of shader cores take up less power and less space since they don’t need the intelligence to operate completely independently of each other.

The flow of threads within a wavefront/warp

Not unlike the construction of a shader core, how shader cores are grouped together will depend on the design. The most common groupings are either 16 or 32 shader cores. Smaller groupings are more performance efficient (you have fewer shader cores sitting idle if you can’t fill all of them with identical threads), while larger groupings are more space/power efficient since you can group more shader cores together under the control of a single instruction scheduler.

Finally, these groupings of threads can go by several different names. NVIDIA uses the term warp, AMD uses the term wavefront, and the official OpenGL terminology is the workgroup. Workgroup is technically the most accurate term, however it’s also the most ambiguous; lots of things in the world are called workgroups. Imagination doesn’t have an official name for their workgroups, so our preference is to stick with the term wavefront, since its more limited use makes it easier to pick up on the context of the discussion.

Summing things up then, we have ALUs, the most basic building block in a GPU shader design. From those ALUs we build up a shader core, and then we group those shader cores into a array of (typically) 16 or 32 shader cores. Finally, those arrays are fed threads of instructions, one thread per shader core, which like the shader cores are grouped together. We call these thread groups wavefronts.

And with that behind us, we can now take a look at the PowerVR Series 6/6XT Unfied Shading Cluster.

Imagination's PowerVR Rogue Architecture Explored Imagination’s PowerVR Rogue Series 6/6XT USCs Dissected
Comments Locked


View All Comments

  • MrPoletski - Sunday, March 9, 2014 - link

    The reason PoewrVR left the desktop market was simply because ST Microelectronics sold their graphics division. The KYRO was an extremely successful card and would have continued to be. IIRC Via tried to buy it up and carry on selling the KYRO 3 but could not reach a licence deal with STMicro - who claimed copyright on the chip design (the non powervr parts)
  • Scali - Monday, February 24, 2014 - link

    They could... But Imagination is just like ARM: they don't build GPUs themselves, they only license the designs.
    It has been possible for years to license a PowerVR design and scale it up to an interesting desktop GPU. It's just that so far, no company has done that. Probably too big a risk to take, trying to compete with giants such as nVidia and AMD.
    The last desktop PowerVR cards mainly failed because of poor software support. Aside from the drivers not being all that mature, there was also the problem that many games made assumptions that simply would not hold on a TBDR architecture, and rendering bugs were the result.
    If you were to build a PowerVR-based desktop solution today, chances are that you run into quite a lot of incompatibilities with existing games.
  • iwod - Monday, February 24, 2014 - link

    I didn't want the word Apple in it to retrain from trolls and flame war, so i didn't write it out clearly the first time.
    The sole reason why PowerVR failed in the first place were their Drivers And the same reason why most other GPU company failed as well. Much like S3. Drivers in the GPU market means literally everything. It doesn't matter if their GPU is insanely great if it doesn't run any of the latest games and error upon error it simply wont sell. Unlike CPU which you actually get down to the mental programming.

    Nvidia famously pointed out they have much more software engineers then Hardware. Writing a decent performing drivers takes time and money. Hence why not many GPU manufacturer survive. Most of them dont have enough resources to scale. Same goes with PowerVR. I still remember my Kyro Graphics Card I love, until it doesn't work on games I want to play.

    But this time it is different. The Mobile Market has already exceed the PC market and will likely exceed the total GPU shipped in PC + Console Combined! Since the drivers you are writing for Mobile iOS can in many case effectively be used on MacOSX as well. That is why using PowerVR on Mac makes an appealing case.

    May be the industry leader view Tablet / Mobile Phone + Console being the next trend, while PC & Mac will simply relinquish from Gaming?
  • Scali - Tuesday, February 25, 2014 - link

    "The sole reason why PowerVR failed in the first place were their Drivers"

    As I said, it was not necessarily the drivers themselves. A nice example is 3DMark2001. Some scenes did not work correctly because of illegal assumptions about z-buffer contents. When 3DMark2001SE was released, one of the changes was that it now worked correctly on Kyro cards.

    It is unclear where PowerVR stands today, since both their hardware and the 3D APIs and engines have changed massively. The only thing we know for sure is that there are various engines and games that work correctly on iPhone/iPad.
  • Sushisamurai - Monday, February 24, 2014 - link

    Typo: "one thread per shader care, which like the shader cores are grouped together into what we call wavefronts." Should be shader core?

    In "Background: how GPU's work"
  • Ryan Smith - Monday, February 24, 2014 - link

    Indeed it was. Thank you for pointing that out.
  • chinmaythosar - Monday, February 24, 2014 - link

    i wonder how AAPL will handle the FP16 cores ... they are moving to 64bit in CPUs and they would have hoped to move to FP64 in GPU ... it would have given them real talking point in the keynote for iPad 6 (or whatever they call it) .. " next-gen 192 core GPU FP64 architechture ..4x graphics power etc etc .. :P
  • MrSpadge - Saturday, March 1, 2014 - link

    Not sure what AAPL is, but pure FP64 for graphics would be horrible. You don't need the precision but waste lot's of die space and power.
  • xeizo - Monday, February 24, 2014 - link

    They would be more competitive and interesting to use if they published open drivers instead of "open architecture" pics .... :(
  • Eckej - Monday, February 24, 2014 - link

    Couple of small errors/typos:
    Under How Rogues Get Executed: Wavefronts & Superscalar ILP - the diagram should probably have 16, not 20 pipelines - looks like an extra row slipped in!

    The page before: " With Series 6, Imagination has an interesting setup where there FP16 ALUs can process up to 3 operations in one cycle." There should read their.

    Bottom of page 2 "And with that behind us, we can now take a look at the PowerVR Series 6/6XT Unfired Shading Cluster." - Unfired should read Unified.

    Sorry to be picky.

Log in

Don't have an account? Sign up now