In a rather unexpected turn of events, Nvidia’s CEO, Jen-Hsun Huang gave a collective interview to the assembled media body present at GTC 2010, where he owned up to problems that the company faced with the first lot of Fermi samples.
Nico Ernst from Golem.de documented this historic moment on video and has posted it to his site.
When asked about the reasons behind the delays and problems faced by Fermi, there was an actual explanation at hand. According to Jen-Hsun, the interconnects between the SM clusters and memory were good only on paper. Once it received the samples back from TSMC, it was confronted with a mess of interconnects so compact that it was one huge traffic jam of electric signals that left the key components of Fermi completely “deaf and mute”. The fabric that connects all of this was broken.
As Jen-Hsun put it: “We found a major breakdown between the models, the tools and reality. So when we first got the first Fermi back, that piece of fabric, so imagine we’re all processors, all of us seem to be working. But we can’t talk to each other. It’s like we’re all deaf, we’re all mute and deaf. We can’t hear each other, we can’t talk to each other. And we found out it’s because this connection between us is completely broken.”
The company was forced to re-engineer the part in order to get it working, causing the delays. However, how did this come to happen? Well, the CEO explained that the problem stemmed from the engineers at Nvidia not knowing exactly who’s responsibility it was to develop the interconnect and ended having two departments (physics and architecture) doing the same thing in two different ways. Something which eluded company management until the first silicon turned up broken. A clear project management failure that should – you might imagine – cause some heads to roll at Nvidia (not that we’ve seen any).
“It turns out the reason why the fabric failed isn’t because it was hard, but because it sat between the responsibility of two groups,” Huang confessed. “The fabric is complicated in the sense that there’s an architectural component there’s a logic design component and there’s a physics component. My engineers who know physics and my engineers who know architecture are in two different organisations, and so you see this underlap of responsibility… ‘is it my job or your job?’ If you’d just simply moved it from one side to the other side they’d been more than happy to pick up the slack, but we let it sit right in the middle. ‘Let’s be both of our jobs’… that’s a bad answer.”
Yes it is. For both project management and engineering there is always a roadmap with tasks clearly assigned to each department as well as deadlines and meetings to discuss progress on the project. The interconnects are a complex weave of metal that run signals back and forward and really are important to get down right.
The take-home message is: something went wrong, we slapped ourselves on the hand and we won’t do it again. So shareholders should be happy enough with the company’s performance.
We’re left asking whether it would’ve been too much trouble to get this “confession” out in the open earlier and would it have made that big a deal?Leave a comment on this article