From Larrabee to Knights Ferry
Intel’s MIC (pronounced “Mike”) began as a hybrid GPU/HPC (high-performance computing) product known as Larrabee. Intel officially announced Larrabee in 2007 and soon claimed that the card would usher in a new era of ray-traced video games and incredible performance. Intel eventually shelved its GPU ambitions once it became clear that Larrabee wasn’t going to be able to match the high-end hardware then available from Teams Green and Red, and rebranded Larrabee as an HPC-only part. The new design was dubbed Knights Ferry, and Intel began shipping it out to HPC developers in 2010.
So how much of Larrabee is left in Intel’s MIC? It depends on where you look. All of the display hardware and integrated logic necessary to drive a GPU is gone, but the number-crunching capabilities of the cores themselves appear largely unchanged. One difference we know of is that while Larrabee and KNF focused on single-precision floating point math, the upcoming Knights Corner will offer strong double-precision support as well. Compare Larrabee’s block diagram, above, with Knights Ferry, below.
A knight’s tale
So let’s talk about Knights Corner/Xeon Phi. Xeon Phi ramps up Knights Ferry; Intel isn’t giving many details yet, but we know the architecture will pack 50 or more cores and at least 8GB of RAM. In this space, total available memory is an important feature. Knights Ferry, with its 32 cores and max of 2GB of RAM, could only offer 64MB of RAM per core; a 50-core Xeon Phi with 8-16GB of RAM would offer between 163-327MB per core.
It’s logical to think Intel’s core counts and RAM load will vary depending on yields and customer needs. Customers with large regular datasets might see better scaling from a 50-core chip with 16GB of RAM, while small data sets might do best with an 8GB card and 64 cores. The layout of the Aubrey Isle die at the heart of Knights Ferry, pictured above, makes a 64-core target chip a strong possibility, with varying numbers of cores disabled to improve yields.
The cores at the heart of Intel’s first Xeon Phi are based on the P54C revision of the original Pentium, and appear largely unchanged from the design Intel planned to use for Larrabee. Despite some squabbling from Team Green, we recommend not conflating the phrase “based on” with “hampered by.” Intel returned to the P5 microarchitecture for Larrabee because it made good sense to do so – but Knights Corner isn’t a bunch of early 1990s hardware glued on a PCB.
Intel has added 64-bit support, larger on-die caches (the Pentium Classic never had an on-die L2, or an L1 with 1-cycle latency), a 512-bit, bi-directional ring bus that ties the entire architecture together, advanced power consumption management circuitry, and 32 512-bit vector registers. It’s the latter that give Xeon Phi its oomph – a top-end Core i7 today has 16 256-bit AVX registers.
From a computational perspective, calling Knights Corner a “modified Pentium” is like calling the starship Enterprise a modified space shuttle. The updated P54C core is better thought of as a launch gantry; it’s the framework Intel used for creating something new, not the vehicle itself.
Is Knights Corner x86 compatible? Mostly – or, perhaps more accurately, it’s x86-compatible enough. Intel’s software blog states the following: “Programs written in high-level languages (C, C++, Fortran, etc) can easily remain portable despite any ISA or ABI [application binary interface] differences. Programming efforts will centre on exploiting the high degree of parallelism through vectorisation and scaling: Vectorisation to utilise Knights Corner vector instructions and scaling to use more than 50 cores. This has the familiarity of optimising to use a highly-parallel SMP system based on CPUs.”
There are a handful of x86/x86-64 instructions, including a few fairly common ones, that KNC won’t support. The vectorisation and scalar instructions that KNC/KNF introduced are also unique – KNC doesn’t support traditional SIMDs like MMX, SSE, or AVX… yet. That “yet” is important, because it’s virtually guaranteed that Intel will cross-pollinate its instruction sets at some point in the future. The Transactional Synchronisation Extensions (TSX) set to debut in Haswell might be extremely useful for Knights Corner’s successor.
Intel hasn’t allowed anyone to release hard data on Knights Ferry performance, and there’s absolutely no info available on Xeon Phi. However, when we spoke with Dr. Glenn Brook, a research engineer with NICS (National Institute for Computational Sciences), he indicated that Intel’s KNF delivered a solid combination of strong scaling, power efficiency, and overall performance. The various papers highlighted at the TACC-Intel Highly Parallel Computing Symposium this past spring are generally optimistic regarding KNC’s ability to deliver capabilities that the HPC industry will find useful.
It’s important to note that the gains weren’t universal. A team from the National Centre for Atmospheric Research presented a paper on KNF’s performance in their climate models and reported that: “In general, we observed excellent scalability with both models on KNF. However, single-thread performance is poor with or without vectorisation enabled, which may indeed be the cause of the exceedingly favourable scalability.”
Intel’s support for OpenMP/MPI, and the fine-grained control of processor resources available on KNF, were also highlighted as significant features. One of the major challenges in HPC research is apparently the need to juggle small jobs with large ones in order to make the most efficient use of available computing resources. Knights Corner is designed to address that concern; the card can be deployed as part of an HPC cluster, or treated as a stand-alone co-processor for crunching specific tasks.
It’s also designed to be flexible in terms of software and hardware. We expect, for example, to see some form of Turbo Boost debut on KNC, though likely under a different name. Scaling clock speeds depending on total co-processor workload is one way Intel can improve performance, and it’s one way of offsetting relatively low single-thread efficiency.
KNC can communicate with other MIC cards across the PCIe bus and can even kick off jobs on other boards, but the PCIe bus only offers a fraction of the bandwidth the card has access to internally. KNF was a PCIe 2.0 solution and it wouldn’t be surprising if Knights Corner keeps that bus rate – moving to PCIe 3.0 would double effective bandwidth, but that’s not nearly enough to offset the higher access latency and minimal bandwidth compared to keeping data local.
As important as benchmarks and hard performance figures are, KNC’s real advantage probably won’t come from these metrics.
Software: Intel’s ace in the hole
Most supercomputing articles focus almost entirely on hardware, because software support isn’t as sexy as racks of big iron jammed with high-end components. After reading several papers released at TACC-Intel and discussing the matter with Dr. Brook, I’m increasingly convinced that this hardware-centric approach to coverage misses the larger issues of the field.
One of our questions for Dr. Brook was how much Intel’s vaunted x86 compatibility actually mattered to scientists working in the real world. His response was as follows: “It is also important to realise that many NSF-funded researchers are funded to conduct fundamental science not to port, optimise, or develop scientific codes… As such, many researchers view code development as an overhead cost that should be avoided or minimised when possible, and most are extremely reluctant to abandon years (or decades) of existing code development to transition to a significantly different programming model.” (Emphasis added).
An institution’s decision to adopt MIC or CUDA may be driven less by horsepower and more by accessibility, even in cases where one option offers substantially better real-world performance. Raw compute performance has increased by an order of magnitude in the past seven years. The fastest computer from the June 2005 Top500 list is ranked number 125 today – but the third fastest machine from 2005 doesn’t even make the list. Humans, unfortunately, haven’t gotten any upgrades in the intervening period. Efficiently managing the distribution of all that compute performance remains an extremely challenging task, which is part of why the CPU remains the basic currency of the HPC realm.
This is where Intel’s biggest advantage over Nvidia will come into play, and it’s going to be a damned hard card for the GPU designer to counter. If you program on x86 and you need a performance monitor, thread analyser, optimised library, or compiler, Intel has you covered. The company’s tools aren’t just numerous; they’re sophisticated, and they offer substantial under-the-hood visibility. That visibility is important when it comes to optimising workloads for cluster computing. Latencies and chip-to-chip communication levels that are entirely manageable in a 4S (quad-socket) system can overwhelm a network when scaled across nodes.
Of the 58 Top500 computers with co-processor/accelerators installed, 53 of them use Nvidia hardware. That’s a huge jump from June 2011, when just 12 systems used the company’s CUDA cards, but it pales in comparison to the 374 systems using Intel CPUs. According to Intel, optimising for Xeon Phi can often improve performance on traditional Xeons, even though the two processors don’t share vectorisation capabilities. Unlike Intel, Nvidia can’t build a “pure” Tesla box in which both CPUs and GPUs carry their own brand, and that gives Intel a natural advantage.
None of this, however, means that Nvidia is doomed. Far from it. I wouldn’t be surprised if the upcoming Kepler-based K20 Tesla is capable of beating Knights Corner’s performance in some real-world tests. NV’s upcoming Tesla might also have a size advantage; Xeon Phi’s die is going to be absolutely enormous. Visual estimates on Intel’s Knights Ferry pegged its die size at close to 700 mm2; Knights Corner is built on a much smaller process (22nm vs. 45nm) but it also adds cores. Yield management will be critical for both companies, which is probably part of the reason Nvidia is waiting until Q4 to launch the part.
The coming war between Intel and Nvidia for the supercomputing market will have a real impact on consumer products, even if it takes several years for the research to trickle down. The HPC industry is struggling to deal with problems of power efficiency, interconnect scaling, storage speed, processor utilisation, and communication latency. The mobile phone and tablet markets, meanwhile, are fighting with the very same issues, with the added headache of battery life thrown in. Advances at the top of the market will increasingly shape the bottom (and vice versa). The battle to be the company whose hardware powers both spheres of influence is about to kick off in earnest.