Adapteva launches 64-core Epiphany-IV

Many-core start-up Adapteva has announced its latest creation: a 64-core microprocessor which draws just 2W of power, making it - the company claims - the most energy-efficient microprocessor in the world. Andreas Olofsson, the company's chief executive, talks us through what's new with the company.

We first covered Adapteva's dramatic performance claims back in May, when the company announced a third-generation chip design based on the Epiphany architecture. Simply put, Epiphany is a many-core co-processor designed to sit next to a host processor and accelerate highly-parallel tasks.

It's not a new approach - floating-point acceleration units were once common add-ons for slower systems, before becoming integrated into the CPU itself in modern designs - but Adapteva's mesh-based design promises dramatic performance-per-watt gains.

The company's Epiphany-III 16-core design, a 1GHz co-processor offering full C/C++ programmability in a general-purpose architecture, offered dramatic performance in a 0.25W power envelope. At the time, Olofsson predicted that a 64-core implementation would require around 1W of power - and while the company's latest product doesn't quite hit that level, it's still an impressive achievement.

"65nm was basically a five-year-old process, and we had to normalise our numbers to compete with the leading edge processors," Olofsson admitted during his interview with thinq_, referring to his company's initial Epiphany chip. "Now that we're at 28nm we're basically on-par in terms of technology, so our architecture shines that much more."

Dubbed the Epiphany-IV, Adapteva's latest chip uses a 28nm process size along with microarchitecture improvements to pack 64 superscalar RISC cores into a single chip drawing just 1.7W of power for an overall efficiency of 70 gigaflops per watt.

"It's not quite 4x performance," Olofsson explained. "We have four times the cores, but we've taken the frequency down a little bit from 1GHz to 800MHz. Basically, about a 2x factor in performance comes from energy efficiency gains from the process size, and some other things are done in the microarchitecture at the design level that make that even better."

Despite not hitting a four-fold boost over the company's 16-core chip, the company's performance claims are breathtaking: at 70 gigaflops per watt, the Epiphany-IV chip blows past the 50 gigaflops per watt goal of Intel's exascale computing project as described by Kirk Skaugen at the International Supercomputing Conference earlier this year. Even though Olofsson's chip only hits those figures in single-precision mode, compared with the double-precision required of scientific computing systems, it's an achievement of which he's proud.

"We believe that the goal is too modest. Getting fifty gigaflops per watt in 2018 is too low a goal," Olofsson boasted. "I know there are some people who have spoken to the contrary, that it's too difficult a goal, but we have 70 gigaflops per watt single-precision floating-point proven today, and that's on 28nm.

"By 2018, people are projecting between seven and 11nm, so just by scaling alone we're going to way exceed that goal at the processor level. Obviously there are overheads for memory, power, and hard disk and things like that, but I think that when people look at the majority of the power right now they're talking about the CPU core.

"So, yeah, we believe that - at least at the processor level - we can get well beyond 100 gigaflops per watt by 2018, which should make the goal pretty easily achievable by then. We can put together systems today that are tuned for low power that would meet the goal today, for single precision."

There are other roadblocks on the route to exascale computing, Olofsson admits. "There's fault tolerance, the programming model - how do you program with a million-level parallelism? We're putting a thousand cores, possibly, on a chip, and finding a thousand-level parallelism - how are we going to program that?

"We're working on providing the tools for it - I think that the CUDA and OpenCL, what the GPUs have done, is a great example of how you can take a very powerful architecture and abstract away all the underlying complexity. We're a much, much simpler architecture - and much more general purpose - than a GPU, so I don't see why we can't propose a similar programming model going forward. It works."

As a small five-man start-up, Olofsson's company has flexibility on its side, but lacks the resources of some of its competitors. "I'm going to choose my words carefully here," Olofsson joked when the topic turned to Intel and its mesh-based Many Integrated Cores project. "I'm going to compare Intel to a thousand-tonne gorilla. Somebody as big as Intel is always a bit of a threat, but certainly it's validation as well.

"If you compare the MIC architecture to ours, they've chosen the Pentium architecture as their baseline processor - so they're staying with the x86 instruction set - and they've chosen a certain network on a chip. So, I think it's another validation point - just like Tilera, which is a Boston-based company that are doing server-level chips with a mesh-connected many-core architecture.

"I think that we're seeing a lot of companies move to mesh-based bus architecture, and that's great. And for me, that's the only way to do it, for scaling reasons - it's the only real approach that scales. So, it's a great validation, but at the same time Intel is an incredible company, and they have resources that our five-person company does not have, so it's obviously a threat as well."

Despite this, Olofsson argues that his company has a major technology advantage which helps to keep costs down. "Our big advantage is really in silicon cost," he explained. "Even if we have slightly higher production costs per wafer, we have such a small die area we can get many more dies per wafer and we gain that way.

"For example, this latest product - the 64-core 28nm part - is only 10 square millimetres. If you know the numbers that people generally talk about for GPUs, for FPGAs, and CPUs, they're well above 100 square millimetres. Even the A5 from Apple in 40nm is over 120 square millimetres, so this is 10 square millimetres.

"In terms of silicon cost, we have an enormous advantage. And then of course you have to add on to that testing, and packaging, and logistics and things like that and there we don't have a clear... We don't have a technology advantage, so there we're on a level playing field. But at least in the silicon area, the silicon cost, we have a huge advantage."

For Adapteva's next milestone - a design win from a tablet or smartphone manufacturer, where a low-power co-processor that can accelerate common tasks without putting undue strain on the battery - Olofsson admits that even such gains in silicon costs might not be enough. "If we're talking about parts that need to cost a dollar, on that order, then certainly you need to be a tier one vendor to get the cost down that low. That's why we're looking at licensing for the mobile and smartphone market.

"We see our focus as extreme computing," explained Olofsson, "the guys who need the most flops per watt - and the two markets that, interestingly enough, respond to this are the exascale folks that are really pushing the envelope because they can't find enough electricity to feed their systems at the high end, and the smartphone and tablet guys. We are in evaluation for tablet-type chips right now.

"We make products - they're fully-qualified products - but we really see the mass adoption as coming from one of the big vendors. In the mobile space we are not providing chips - we are providing IP for the SoC integrators to incorporate - we are exactly like ARM for the mobile space."

While reaching profitability - thanks to a large licensing win earlier this year - is a great validation of the strength of Adapteva's technology, Olofsson admits that it hasn't always been easy. "Being a disruptive technology, we're always selling in to a market where there's an incumbent, and there's a way of doing things, and we're proposing a different way of doing things - and one that, going forward, I believe is going to be much, much better.

"But right now, people are still kind of managing with their existing solutions whether it's a general-purpose processor, a GPU, an FPGA... We have to sell people and convince people that this is the right way of doing things, and they have to transfer over from their existing code to our way of doing things. It's not a huge undertaking, but it costs some effort on the part of the customer."

The Epiphany-IV chip, due to sample in Q1 2012, offers 64 800MHz RISC processing cores in a package containing a network-on-chip mesh architecture with distributed memory. While third-party benchmarks aren't yet available, Adapteva claims 25GB/s local memory bandwidth, 6.4GB/s processor bandwidth per core, 64KB of local memory at each processor node, and 70 gigaflops per watt in single-precision mode. Pricing is yet to be revealed.