Why AMD should take note of Intel’s playbook and ditch Steamroller

AMD’s Kaveri is, in many respects, a huge step forward. The new APU’s low-power performance is excellent, its integrated graphics torpedo anything Intel offers at an equivalent price point, and it includes support for features like Mantle, HSA, and TrueAudio. Yet, despite these lauded capabilities, there’s a clear problem sitting in the middle of Kaveri like a turd in the proverbial punchbowl: Steamroller.

Despite significant improvement in the low-power segment, it remains fundamentally incapable of matching Intel clock-for-clock. HSA might one day help address the problem, but it’ll be years before HSA-compatible software is readily available.

It’s time to take a page from Intel’s book and dump the core. The good news is, much in the same way that Intel’s Pentium M core would eventually replace NetBurst, AMD already has a core that’s capable of stepping into Steamroller’s shoes – it just needs to be fine-tuned for the role.

How killing the Pentium 4 saved Intel

The Pentium M (codename: Banias) was created because Intel recognised that the Pentium 4 wasn’t going to be capable of addressing the mobile market very effectively. The Pentium M design team took Intel’s older Pentium 3 core (Tualatin) and optimised it for high efficiency and low power.

Banias used the P4’s quad-pumped front-side bus, added support for SSE2, and inherited the sophisticated branch prediction unit that the P4 relied on to keep its 20-stage pipeline fed. Over the next few years, as it became increasingly clear that the P4 had run out of steam, Intel cross-pollinated between the two architectures. Efficiency boosting technologies like SpeedStep and the Pentium M’s indirect branch predictor were ported to the P4 as well. In the long run, it was the Pentium M that gave Intel a path to the Core 2 Duo and Nehalem architectures – not the broken, fundamentally flawed Pentium 4.

Could Kabini’s Jaguar core do something similar? Let’s find out.

Calculating relative efficiency between Kabini, Kaveri, and Richland

The simplest way to measure the efficiency of the two chips is to divide their respective benchmark scores in a given application by (CPU Frequency * Core Count). This normalises both variables and gives us a measure of intrinsic core performance. The next step was to turn each of these clock-and-core normalised figures into a percentage. In a test like Cinebench, a score of less than 100 per cent indicates that Kabini is less efficient than its big-core rival, while a score of greater than 100 per cent means Kabini is more efficient.

Our test data was drawn from both our own tests and test results published at other major industry sites. The second set of efficiency figures is based on results in 18 synthetic and real-world tests, while the first set compares only real-world results (10 in total). Even if we omit the synthetic tests where Kabini does quite well, the core is still extremely competitive with AMD’s “big core” architecture, with an efficiency gap of less than 10 per cent. More importantly, there’s low-hanging fruit that would close that distance. Turning the L2 cache back up to full speed would help close the performance gap between the two, as would more aggressive branch prediction.

Why ditching Steamroller is the right move for AMD

While it’s true that the Steamroller core is now more efficient than Kabini, while Piledriver never was, AMD is still paying some significant costs for that performance. Namely, the following…

It’s (comparatively) huge: We know from previous AMD disclosures that one Jaguar CPU core is 3.1mm sq. Our best estimates put Steamroller at 9.6mm sq, or more than three times the size of a Jaguar CPU. Neither of these estimates include cache, which would tilt the equation still further towards Kabini as the more efficient solution. The “big core” version of Jaguar would be significantly larger than the current chip, but AMD has room to improve Kabini’s performance while still coming in below Steamroller’s size.

It’s a poor fit for modern foundries: The overwhelming emphasis today at both TSMC and GlobalFoundries is on delivering low-power, high-efficiency parts. Bulldozer was designed with the assumption that GlobalFoundries would deliver an aggressive process node with specific characteristics that would allow AMD’s new chip to hit its frequency targets. As recently as 2011, GF was talking about its plans for multiple 20nm process nodes, including a Super High Performance (SHP) line for AMD’s future parts.

Today, such plans are dust. TSMC and GF are both implementing one 20nm process node, and it’s not designed for high-performance chips. Bulldozer’s reliance on high frequencies will make it difficult enough for AMD to hit its 65W target for the next generation of Excavator parts.

Kabini is a better fit for AMD’s long-term goals: It’s Jaguar, not Steamroller, that powers the new consoles from Sony and Microsoft. The die space AMD would save by moving to Jaguar from Steamroller could be used to solve the bandwidth bottleneck that plagues AMD’s integrated GPUs. Photo estimates of the Xbox One die suggest that its two memory controllers are 16-17 sq mm. It’s not clear if AMD would need memory controllers this large, but even if it did, Jaguar cores allow for that kind of allocation without blowing the reticle size or budget. Meanwhile, a Jaguar-derived CPU retains better scaling characteristics than its Steamroller cousin.

If heterogeneous computing takes off in enterprise and server the way AMD hopes it will, Kabini-derived SoCs could marry 4-12 svelte x86 CPUs alongside massive on-die GPU compute engines with the whole system tied back to main memory over a quad-channel DDR4 solution.

The best-case timeline

During its Q3 2013 conference call, AMD CEO Rory Read noted that the company would begin taping out new 20nm designs “in the next couple of quarters.” That means no Excavator core until 2015 or so. While I think AMD should absolutely build a Jaguar variant to target the current Steamroller market, it takes time to do so. Banias debuted in March 2003, but the first Core 2 Duo didn’t arrive until August 2006. If AMD got started immediately, they might plausibly have a design ready to go for either GloFo’s 14nm-XM or TSMC’s 16nm FinFET.

It took Intel multiple iterations of the Pentium M (pictured above) to close the gap between it and the P4, but the end result was a far better processor. AMD made significant IPC gains with Steamroller, but most of those improvements were choked off by frequency cuts and the problem isn’t getting any better. The chance that AMD will be able to move IPC substantially forward while simultaneously dramatically improving performance per watt is exceedingly small. I’d congratulate the Steamroller designers for their achievements, but it doesn’t change the fact that this is the wrong architecture to drive AMD’s future products.

AMD already has a solution to this problem. Question is, will they use it?