View from a Rhino House: worlds & words apart

If you’re a language teacher now is the time to consider a move to a new career, maybe pig farming or becoming a trappist monk; anything where talking to people is not necessary – avoiding all those unpleasant memories & broken dreams. Almost-real-time speech conversion from one language to another has arrived. Microsoft Research demonstrated not only how to convert spoken English into Mandarin with just a few seconds’ delay, but also how to output that Mandarin speech with the rhythms & intonations of the original speaker. The technology was demonstrated by Microsoft’s research chief Rick Rashid in Tjianjin, China on 25th October (as part of the ill-starred Windows 8 & “Surface tablet” launches) but the news initially got lost in the bear-fight about responsibility for the general “Asian Launches Cock-up”.

Rashid said a few English sentences into the MR’s new speech-recognition, translation & generation system & reports suggest that the Mandarin output stunned a crowd of 2000 academics.

The system’s “whizz-bang” capability stems from a series of improvements throughout the speech-to-speech process. Software like Dragon has after many years of effort, at last begun to make inroads, & create opportunities, for speech recognition in offices & the next generation of tools  based on it, like Apple’s Siri, recognizes spoken questions & search for answers on the web. Microsoft’s Kinect has also recently had a speech interface added.

While such systems fail in handling words at an average rate of around 20% MR’s trick is to use a neural-networking heuristics system that reduces word-recognition errors to around 12%. That means the translation engine (Bing Translate) has a far better chance of creating intelligible Mandarin input to feed into the speaking engine.

But the “goddam!!” factor is the generation of Mandarin speech in a voice recognizably like that of the speaker’s: if you can preserve the speaker’s vocal rhythms & intonations in the translation, their meaning (it is claimed) will be more apparent & the conversation will be more effective. This was achieved for the Tjianjin presentation by having Rashid work with a machine-learning algorithm for an hour, rather than the more usual recitation of a standard text that software like Dragon asks for.

Just think of how many wars will start once we can all understand exactly what each politician really said!

What did he say?