Tilera launches Gx-3000 series many-core chips

Intel's MIC programme might be generating a lot of buzz in the supercomputing community, but it's going to be a while before the technology becomes affordable for all. Thankfully, there are alternatives such as Tilera. We chat to Bob Doud to find out what the company is all about.

Intel's Many Core Architecture, which is due to hit goverment supply lists next year in the form of the Knights Corner x86 accelerator card, is likely to be the first time most people have heard of 'many-core' technologies. Intel isn't the first to coin the term, however: Tilera has been working on many-core processors for quite some time.

"We use the term 'many-core processors' - that's beginning to take hold - to represent companies that have chips that really go beyond the eight cores or even the sixteen cores that we've seen from some of our competitors, and getting into a world where maybe you stop counting cores and just look at distributed computing," Doud explained during an interview with thinq_. Unlike Intel, however, Tilera is already delivering on that promise.

"We've been shipping product since 2008 in production," Bob proudly proclaimed, "so we have two generations that have been shipping to customers, and we're launching the third generation, the Gx series, this year - starting to sample the 36-core version next quarter."

That Tile-GX series, announced today, promises to cut power consumption by around 80 per cent over the company's existing chips with each core consuming a mere 0.5W. Each core runs at 1.5GHz, and the initial product line will include 36, 64, or 100 core models.

Tilera TILE Gx block diagram

The design comes from the mind of company founder Anant Agarwal, professor at MIT and one of the designers behind the MIPS CPU architecture. "In 2002 he was awarded a grant from DARPA and the National Science Foundation to build a silicon chip with 16 cores on it - so really very leading, even at that time when others were doing just two-core processors." Doud recalled, "So, they actually built a 16-core processor called RAW and that we see as Rev. 0, or the predecessor, of the Tile Architecture."

That architecture is a clever beast: rather than the ring or bus architecture of a modern multi-core processor like an Intel Core-i7 or an AMD Opteron, Tilera - as the company name suggests - uses a layout of tiles that makes it easy to add increasing numbers of cores to the design.

See Page 2 - The Building blocks

"We've created building blocks that you can lay out like the tiles on your kitchen floor, and using a two-dimensional mesh you can interconnect them," explained Doud. "It's a very compelling approach, because from a physical design perspective you simply design a single core with a tile, and then you simply repeat them across the die - and the interconnections and bandwidth just come along for the ride.

"With other architectures like buses and rings, which is what most of the other guys are using, it's not so simple," Doud claimed. "You really have to lay out the bus very carefully, pipeline it, put registering at the necessary spots to get it to perform and to get data to move across the die at the speeds that you wish - it's a very manual process of getting that bus or ring architected for the specific core count. With the tile architecture, we literally use the exact same tile whether its a 16-core device or a 100-core device - we just lay out that many of them."

Those interconnects, with five leaving the tile at each cardinal point, offer an impressive amount of bandwidth for data transfer and inter-core communication. "Tilera uses five independent mesh interconnnects anywhere from 32-bits to 128-bits wide," Doud explained. "Those independent five meshes are used, functionally, to do different things: one of the wide ones, the 128-bit ones, is used to undertake memory requests to the memory subsystems from the core; another network is used for cache-to-cache transfers, so basically it's used as part of the L3 cache layer; and another network is used for IO requests.

"There's even a network that's dedicated simply for userspace access," Doud revealed. "So in other words, the basic workings of the chip are not going on that network but if the user wants to write a program that communicates core-to-core, they have their own private mesh to do that with that doesn't really interfere with memory or IO operations."

Tilera TILE Gx performance comparison

With around 200 terabits of aggregate bandwidth to play with in the company's 100-core implementation, there's plenty of scope for pushing data around the chip as quickly as possible. That's an important point, as many-core architectures can show an impressive synthetic performance while doing on-chip calculations that then gets crippled by the wait while data is fed to each core.

Unlike Intel with its supercomputer-oriented MIC architecture, Tilera is trying to bring the benefits of many-core computing to as many users as possible. With commercial products already on the market, Doud has seen the interest in the company's products grow.

"We have customers in networking - all manner of networking equipment whether it be security equipment, deep packet inspection, packet forensics, quality of service, and so on. Multimedia: we are very good at doing audio and video processing on our chips - we have many of the characteristics of a DSP, at the same time as we can do general-purpose compute.

"More recently, we've added on the market of cloud computing and, in fact, having a socket in a standard rackmount server, and there we can do all the standard applications: web apps, and data mining, search, social media, that sort of thing, and we do it at a very compelling compute per watt. That's really the traction there, that we're saving a lot of money on operational expenses as well as on the cooling and power supply design and things like that."

Doud is in complete agreement with Intel's Tony Neal-Graves on one matter: for a many-core architecture to succeed, it has to be familiar. "The x86 architecture is held up as the standard architecture for programming ease, since most engineers have learned on x86 platforms," explained Doud. "So, we drive towards that, and we feel that we have a very straightforward programming model: very standards-based, very much comfortable, like programming with your good-old familiar x86 processor."

That flexibility is key to Tilera's survival of the recent economic downturn, Doud believes. "There are other companies that are doing various types of many-core things, but most of them do tend to be somewhat boutique," he explained. "They tend to be building some sort of specialist acceleration for some sort of signal processing activity, whether it's video, or maybe wireless processing like the physical layer DSP stuff. Frankly a number of those companies also died over the last few years over the downturn. There were at least two, maybe three companies that just didn't make it through, and again our view of why did Tilera make it through and why did they not is basically because they were very focused on a certain market."

The TILE-Gx 3000 series, Tilera's most recent product, offers a range of devices capable of general-purpose compute: the entry model is the Gx3036, a 36-core unit with 12MB of cache drawing 20W with availability in Q3 2011; the mid-range part is the Gx3064, a 64-core unit with 20MB of cache drawing 35W; while the top-end chip, the Gx3100, offers a hundred cores with 32MB of cache in a 48W TDP. Both the higher-end models are due to launch in Q1 2012.

"We're also beginning our early development work on a 28nm family that we call Stratton, that will ultimately go up to over 200 cores." Doud told thinq_. A drop in process size from the current 40nm process to 28nm would bring significant gains in efficiency and performance, pushing the company's processors to still greater heights.

With Tilera already shipping its existing product ranges and the new high-performance TILE-Gx 3000 series just around the corner - not to mention the Stratton chips - Intel's hopes for pushing its MIC architecture down the channel from supercomputing to the server room look far from certain.