Secrets of the PlayStation 4: Seriously modified Radeon, turbocharged APU design

For months, there have been rumours that the PS4 wouldn’t just be more powerful than Microsoft’s upcoming Xbox – it would be capable of certain compute workloads that Redmond’s next-generation console wouldn’t be able to touch. In an interview last week, Sony lead hardware architect Mark Cerny shed some major light on what these capabilities look like. We’ve taken his comments and combined them with what we’ve learned from other sources to build a model of how the PS4 is likely organised, and what it can do.

First, we now know the PS4 is a single system on a chip (SoC) design. According to Cerny, all eight CPU cores, the GPU, and a number of other custom units are all on the same die. Typically, when we talk about SoC design, we distinguish between “on-die” and “on-package.”

Components are on-package if they’re part of a finished processor but aren’t fabbed in a single unit. The Wii U, for example, has the CPU and GPU on-package, but not on-die. Building the entire PS4 in a monolithic die could cut costs long-term and improve performance, but is riskier in the short-term.

An overhauled GPU

According to Cerny, the GPU powering the PS4 is an ATI Radeon with “a large number of modifications.” From the GPU’s perspective, the large RAM pool doesn’t count as innovative. The PS4 has a unified pool of 8GB of RAM, but AMD’s Graphics Core Next GPU architecture (hereafter abbreviated to GCN) already ships with 6GB of GDDR5 aboard workstation cards. The biggest change to the graphics processor is Sony’s modification to the command processor, described as follows:

The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands – the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that’s in the system.

That’s a fairly bold statement. Let’s look at the relevant portion of the HD 7970′s structure:

Here, you can see the Asynchronous Compute Engines and the GPU Command Processor. AMD has always said that it could add more Asynchronous Compute Engine blocks to this structure to facilitate a greater degree of parallelisation, but I think Cerny mixed his apples and oranges here, possibly on purpose. First, he refers to specific hardware blocks, then segues into discussing queue depths. AMD released a different slide in its early GCN unveils that may shed some additional light on this topic.

Each ACE can fetch queue information from the Command Processor and can switch between asynchronous compute tasks depending on what’s coming next. GCN was designed with some support for out-of-order processing, and it sounds as though Sony has expanded the chip’s ability to monitor and schedule how tasks are executed. It’s entirely possible that Sony has added additional ACEs to GCN to support a greater amount of asynchronous computing capability, but simply stuffing the front of the chip with 61 additional ACEs wouldn’t magically make more execution resources available.

Now we turn our attention to the memory architecture. We know the PS4 uses a 256-bit memory bus and Cerny specifies 176GB of bandwidth. That works out to a GDDR5 clock speed of 1375MHz, which is comfortably within the current range of GDDR5 products already on the market. We’ve put together a set of what we consider to be the top three most likely structures, along with their strengths and their weaknesses.

Option 1: A supercharged APU-style design

AMD has published a great deal of information on Llano and Trinity’s APU design. Llano and Trinity share a common structure that looks like this:

In Llano and Trinity, the CPU-GPU communication path varies a great deal depending on which kind of data is being communicated. The solid line (Onion) is a lower-bandwidth bus (2x16B) that allows the GPU to snoop the CPU cache. The dotted lines are the Radeon Memory Bus (Garlic). This is a direct link between the GPU and the Unified North Bridge architecture, which contains the memory controller.

In the Gamasutra interview, Cerny states the following: “We added another bus to the GPU that allows it to read directly from system memory or write directly to system memory… As a result, if the data that’s being passed back and forth between CPU and GPU is small, you don’t have issues with synchronisation between them anymore… We can pass almost 20 gigabytes a second down that bus. That’s not very small in today’s terms – it’s larger than the PCIe on most PCs!”

That sounds almost exactly like Garlic, with some additional HSA features baked in. Remember, the point of HSA is to allow the CPU and GPU to share a common set of pointers and swap data more efficiently. It suggests that the PS4′s interconnect structure looks something like this:

This simplified structure shows the GPU with the lion’s share of access to memory bandwidth. Both the Onion and Garlic interfaces are faster than they were in Llano, and they’re tied to much faster memory, but they function in the same basic way. This is the most logical design based on what AMD has done before – it incorporates the direct memory bus that Cerny discusses, and it would be the easiest system for AMD to design given the firm’s limited resources.

The disadvantage is that it’s not particularly efficient. This table of available CPU-GPU bandwidth in Llano based on the type of operation being conducted indicates the problem:

Much of the anecdotal information on the PS4 suggests that the chip is designed for a much greater degree of sharing than Piledriver/Llano. We suspect that Option 1 integrated HSA-like features with an APU-style design. AMD would have had to make a number of improvements to bring these various capabilities up to more uniform bandwidth/latency, but these were improvements the company was planning to make with HSA in any case.

Option 2: Harkening back to R600 and a modern ring bus

AMD could have opted for a ring bus. Ring buses are great for joining multiple components in a high bandwidth, low latency configuration where data is shared across multiple elements. Intel uses a ring bus for Sandy Bridge and Ivy Bridge, and AMD’s first programmable GPU (R600) used one as well. The advantage of a ring bus is that it’d be simple. Not every component needs the same amount of memory bandwidth (the estimated 176GBps of memory bandwidth would be wasted on the CPUs) so you end up with 20GBps of bandwidth for the CPU cores and 176GBps of bandwidth for the GPU.

Sony has some experience with ring buses – the PS3′s Cell Architecture used one to manage communication between the various processing elements – but we don’t think this is a likely approach for the PS4. There’s no particular problem that a ring bus would solve, and no specific use-case that strongly suggests AMD would adopt one. Intel has used a ring bus in Sandy Bridge and Ivy Bridge, but these GPUs are tiny compared to the 18 CU design that’s built into the PlayStation 4.

The PS4 has multiple independent blocks that aren’t shown in our example above; it uses dedicated hardware for audio processing and zlib compression. A ring bus would be the easy way to connect those elements together. Furthermore, this approach wouldn’t explain Cerny’s comment about a dedicated GPU bus to main memory. On a ring bus, there is no “dedicated” path as such, and while you could theoretically build one, your CPUs would still be taking a circuitous path (obviating the advantage of a direct link).

There’s one more option to consider…

Option 3: Re-architect the HD 7000′s memory controller

We already know that the PS4 features a drastically widened Asynchronous Compute Engine (ACE). Could Sony have redesigned GCN’s memory controller to suit its needs? On the surface, this looks promising. The memory controller inside GCN is already capable of talking to GDDR5. It’s high bandwidth, designed for low latency, and modular.

According to the GCN whitepaper: “The memory controllers tie the GPU together and provide data to nearly every part of the system. The command processor, ACEs, L2 cache, RBEs, DMA Engines, PCI Express, video accelerators, Crossfire interconnects and display controllers all have access to local graphics memory. Each memory controller is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels.”

Unfortunately, it’s hard to get much insight into what AMD’s crossbar fabric looks like; most diagrams just describe it as a grey bar. Nonetheless, we can visualise how the SoC would be linked at the highest level.

Again, the devil is in the details. AMD designed GCN’s memory controller to handle a wide variety of compute tasks but has never fielded an APU with anything like this architecture. The HD 7000 design would have needed an overhaul to make this work.

According to AMD’s GCN whitepaper, the crossbar sits in front of the GPU’s L2 cache. GCN is designed to allow up to four CU units to share a common pool of L2 cache, which is why the crossbar is positioned the way it is.

Move the crossbar to the unified north bridge, and you’ve got to hang a great deal of GPU cache off it at some point. This isn’t shown in our diagram above because it’s not clear where such cache would fit. It also doesn’t address Cerny’s comments about a direct GPU bus to memory – a crossbar solution would obviate the need for such a thing.

The advantage of Option 3 is that it leverages existing AMD IP and repurposes a high bandwidth, low latency interface that’s already designed to communicate with a number of disparate devices. That makes it more likely than Option 2, which is more of a theoretical “AMD could build this” than a practical proposal with any basis in shipping AMD products. AMD’s whitepaper talks about the flexibility of the HD 7000 family’s memory controller and its ability to handle “asynchronous compute” tasks. These are exactly the sort of features Sony is implementing.

Wider compute engines, supercharged buses

After considering the alternatives, Option 1 still seems the most likely approach. Llano and Piledriver don’t incorporate the kind of HSA capabilities that Cerny says the PS4 offers, but they could be modified to include such. We know AMD’s first HSA-capable APU will be Kaveri, which will supposedly ship by the end of the year, and it makes sense that the PS4, at least, might incorporate similar capabilities in the same time frame.

HSA (or HSA-like capabilities) could emerge as an important distinction between the Xbox Durango and PS4. It’s possible that Microsoft’s own hardware will also be capable of sharing compute tasks the way Sony is emphasising, but MS has played things much tighter to its chest. It’s worth noting, however, that all of the Xbox leaks to date have pointed to a conventional CPU-GPU architecture built on exactly the Llano/Piledriver model.

Durango, in other words, sounds just like Kabini. And Kabini, according to AMD, doesn’t have next-generation HSA support. Cerny is painting the PS4 as a hybrid that, while also based on AMD’s Jaguar core, incorporates additional features.

Will this be enough to put the PS4 ahead of the Xbox in the next generation? That’s unclear.

The PS3 tried a similar strategy and failed miserably. But this time around, Sony may have picked better targets for its optimisations. It looks as though the PS4 will have specific capabilities that go beyond simply having more graphics cores or a faster CPU – and that could give it the advantage, this time around.

While you're here, you might also want to take a look at: AMD's next-generation APU should feature in Kaveri, Xbox 720, and the PS4.

Thanks to Peter Bright at Ars Technica for helping to hash out some of the underlying possibilities and for diagram assistance.

Image Credit: Digitoll