There are undoubtedly challenges facing the supercomputing industry when it comes to the efforts to hit exascale compute levels by the end of the decade. The problem has been widely discussed at supercomputing conferences, so we’re not surprised that Horst Simon, the Deputy Director at the Lawrence Berkeley National Laboratory’s NERSC (National Energy Research Scientific Computing Centre), has spent a significant amount of time talking about the problems with reaching exascale speeds.
But putting up $2,000 (£1,300) of his own money in a bet that we won’t hit exascale by 2020? That caught us off guard.
The exascale rethink
Simon lays out, in a 60 plus page slideshow, why he doesn’t think we’ll hit the exascale threshold within seven years. The bottom line is this: Hitting exascale compute levels requires a fundamental rethink of virtually the entire computation process.
One of the biggest problems standing in our way is power – not just the power required to run a task on a CPU, but the power required to share that data across the chip, node, and cluster. Data has to be written back to RAM, then shared across multiple systems. Caches must be kept coherent, calculation results written to storage, and new information loaded into RAM.
Power efficiency, measured on a per-core basis, is expected to continue improving for multi-core and many-core architectures, but interconnect power consumption hasn’t scaled nearly as well. This leads to a long-term problem – by 2018, it’ll cost more to move a FLOP off-die than to perform the calculation locally. This might seem like a trivial problem – and at the consumer end of the market (and by consumer, I mean anything up to a dual-socket workstation), that’s true enough.
But that’s precisely where exascale-level problems stab our concept of efficiency in the back. One exaflop is 10 quintillion FLOPS per second. When you’re working with 18 zeroes, picojoules suddenly start adding up. It’s theoretically possible to hit exascale computing with current levels of technology if you can afford to dedicate 100 Megawatts of power to the task, but the challenge is to bring exascale into the 20-30 Megawatts range.
Nvidia, Intel, and AMD (maybe) to the rescue… or not
Many-core architectures have been billed as the solution to the short-term scaling problem. The good news is that they definitely helped – the most efficient systems in the Top 500 list of supercomputers are all using Intel, Nvidia, or (in one case) AMD hardware to hit their targets.
The bad news is that this upgrade is basically a one-time deal. Supercomputing clusters that don’t currently use many-core architectures will continue to adopt them and will see benefits from doing so. Nvidia and Intel will deliver new products that improve performance-per-watt and increase efficiency. These improvements, however, are going to be iterative. That doesn’t mean many-core chips aren’t going to drive new records for performance efficiency – but it does mean that such improvements, in and of themselves, aren’t going to be enough to bridge the exascale gap.
But here’s the thing: What if the focus on “exascale” is actually the wrong way to look at the problem? It’s not an unprecedented question: The PC market clung to MHz as a primary measure of performance for more than two decades until Intel broke the system with the Pentium 4. Ironically, that break turned out to be a good idea long term. The marketing campaigns that companies like Apple and AMD launched in the late 1990s and early 2000s began communicating the message that MHz was a poor performance metric a few years before the multi-core transition and Core 2 Duo launch dumped MHz as a focus.
FLOPS has persisted as a metric in supercomputing even as core counts and system density has risen, but the peak performance of a supercomputer may be a poor measure of its usefulness. The ability to efficiently utilise a subset of the system’s total performance capability is extremely important. In the long term, FLOPS are easier than moving data across nodes. Taking advantage of parallelism becomes even more important. Keeping data local is a better way to save power than spreading the workload across nodes, because as node counts rise, concurrency consumes an increasing percentage of total system power.
Why exascale matters
The final slides of the presentation show why exascale processing is so important, even if Simon doesn’t think we’ll get there in the next seven years. Better climate simulation, battery technology research, and internal combustion engine design all depend on it. Safeguarding the nation’s aging nuclear weapon reserves is a critical task for the next decade. And if we ever intend to simulate a human brain, exascale is a necessity.
Right now, our best simulations are at 4.5 per cent of human scale, running at 1/83 real-time speed. At one exaflop, we’re (potentially) up to human-scale and real-time processing, if we can leverage conventional CMOS effectively. But to put that in perspective, compare our brain’s efficiency against the projected best-case scenario for an exaflop system.
This is something John Hewitt covered extensively in an article earlier this year. The human brain is mind-bogglingly efficient when compared to even the best semiconductors we’ll be able to build in 2020. Despite this, being able to simulate a mind – even at one million times the power consumption – would be an enormous step forward for technology. The overall point of Simon’s talk wasn’t that exascale is impossible, but that it’s vital, despite the high price tag, slow advancement, and the need to revisit virtually every assumption that’s driven supercomputing forward for the past 30 years.