At an event in China, Microsoft Research chief Rick Rashid has demonstrated a real-time English-to-Mandarin speech-to-speech translation engine. Not only is the translation very accurate, but the software also preserves the user’s accent and intonation. We’re not just talking about a digitised, robotic translator here – this is firmly within the realms of Doctor Who or Star Trek universal translation.
The best way to appreciate this technology is to watch the video below. The first six minutes or so is Rick Rashid explaining the fundamental difficulty of computer translation, and then the last few minutes actually demonstrates the software’s English-to-Mandarin speech-to-speech translation engine.
Sadly I don’t speak Chinese, so I can’t attest to the veracity of the translation, but the audience – some 2,000 Chinese students – seems rather impressed. A professional English/Chinese interpreter also remarked to me that the computer translation is surprisingly good; not quite up to the level of human translation, but it’s getting close.
There is, of course, a lot of technological wizardry occurring behind the scenes. For a start, the software needs to be trained – both with a few hours of native, spoken Chinese, and an hour of Rick Rashid’s spoken English.
From this, the software essentially breaks the speech down into the smallest components (phonemes), and then mushes them together with the Chinese equivalent, creating a big map of English to Mandarin sounds. Then, during the actual on-stage presentation, the software converts his speech into text (as you see on the left screen), his text into Mandarin text (right screen), and then the Rashid/Chinese mash-up created during the training process is used to turn that text into spoken words.
The end result definitely has a strong hint of digitised, robotic Microsoft Sam, but it’s surprising just how much of Rashid’s accent, timbre, and intonation is preserved.
In terms of accuracy, Microsoft says that the complete system has an error rate of roughly one word in eight – an improvement of 30 per cent over the previous best of one word in five. Such a dramatic improvement was enabled by the use of Deep Neural Networks, a machine learning technique devised by Geoffrey Hinton of the University of Toronto. A Deep Neural Network is basically an artificial neural network (software that models thousands of interconnected “neurons”), but with some tweaks so that it more closely mimics the behaviour of the human brain.
Moving forward, the big question is when Microsoft Research’s speech-to-speech translation software will actually find its way to market – and yes, in case you were wondering, the software isn’t only limited to English and Chinese; all 26 languages supported by the Microsoft Speech Platform can be used, including Mandarin-to-English.
The most obvious usage would be on your Windows Phone 8 (or 9?) smartphone, or Skype: You could call up a company in China or Germany or Brazil, speak normally in English, and they would hear your voice in their local language. You could also use your smartphone as a universal translator while travelling. As you can see below, Microsoft was toying with real-time phone-to-phone translation all the way back in 2010:
Presumably Microsoft is working on such applications – but it’s probably being held back by practical considerations, such as the processing power required to do speech-to-speech translation, or providing an easy-to-use interface for the training/learning process. The training process itself might require more processing power than a home user can feasibly provide, too. There’s always the cloud, though!