As someone who has spent decades doing research in automatic speech recognition (ASR) I had become resigned to people I met having no understanding of what I was working on "Oh you mean voiceprints?" or "Is that for reading for the blind?" were typical responses. Imagine my surprise, then, when, having told the immigration officer at LA International airport recently that I worked in ASR, he immediately asked what I thought of Siri. He then went on to give a sophisticated and informed opinion himself.
Siri on the iPhone 4S is an extremely impressive piece of technology, as well as being a commercial success for Apple. It has finally put automatic speech recognition (ASR) firmly into the consciousness of the general public. As an ASR specialist, you might expect me to be delighted. In many ways I am. And yet, I feel simultaneously uneasy.
Since the 70s, the public reception of commercial ASR has been cyclic: excitement, disappointment, derision, then excitement again as newer technology appears. My hope is that this time the cycle has been broken, and my fear is that it might not have been.
To assess the chances of permanent success this time, let's look at the reasons why it has previously turned sour. I suggest there are three, namely: unreasonable public expectations and unrealistic marketing claims. (There is a third, but I'll come to that later.)
Understanding speech in our own language seems effortless. In quiet conditions we perceive the spoken speech signal to be perfectly clear and unambiguous. But this is an illusion perpetrated by our own brain, which processes a patchy, inconsistent acoustic signal and presents to our consciousness a cleaned-up stream of words. Since we aren't aware of our extraordinary skill in understanding speech, we are wrongly convinced that it must be easy to find the trick to automating the process, and this primes us to believe claims of a spectacular technical breakthrough.
Marketing people exploit this unreasonable expectation to persuade customers that their product has made the long-awaited breakthrough. When the early automatic dictation products allowing continuous speech appeared, IBM's advertising people had a man lying flat on his back dictating, while Dragon Systems called their product NaturallySpeaking. Personally, I think that Dragon NaturallySpeaking is a wonderful product and I'm using it to produce this article, but even the latest, much improved, version doesn't allow one to speak totally naturally, and it's a good idea to keep an eye on the screen rather than stare at the ceiling as the IBM adverts suggested.
A decade on, the TV adverts for Siri appear to be equally exaggerated, turning a fine technical achievement into a likely disappointment. One of them tells the user that his day is looking good because he has only two meetings. If Siri really does this, it is an absurd overstatement of intelligence by the Siri interpreter, since it cannot possibly know whether the user is a salesman at a trade show, who would be disappointed to have only two meetings set up, or is someone who normally has no meetings, hates meetings, and is horrified to have two on the same day.
Thus Siri may disappoint even when the speech recognition is prompt and accurate, and it can indeed be prompt and accurate when the data link is working well and the servers aren't overloaded. These provisos may be surprising to some people who believe that everything is handled locally in the iPhone, but Siri speech recognition and interpretation happens across the network on a server. In many parts of the UK and the US 3G coverage can be poor or non-existent, while LTE isn't going to be available in the UK any time soon. Moreover, Siri has sometimes been a victim of its own success when the servers become overloaded and the response is consequently slow or even fails to happen.
Finally, with realistic expectations of Siri's intelligence, a good network signal and servers that aren't overloaded, there's still a disappointment in store for some users, namely those who don't speak with a standard General American accent, or the equivalent for other languages supported. In Singapore, for example, the way of pronouncing English is problematic for Siri. Moreover, questions about local cuisine (nasi goreng, laksa, Yum Ka Kaya Toast ....) won't be understood however they are pronounced.
So what's to be done if disillusionment is to be avoided this time? Let's hope first that the publicity becomes more realistic and emphasises the genuinely useful side of Siri.
Perhaps Siri will, over time, extend its service in areas such as Singapore to cover the local vocabulary and local pronunciation. (SingTel, the dominant network service provider in Singapore, has already launched its own speech concierge service called deF!nd, which, while lacking the full power of Siri, does support the local vocabulary and local pronunciation and accesses SingTel's own InSing database covering all the restaurants, companies, etc in the island state.)
Finally, many of the simpler services supported by Siri can be done entirely locally on a smartphone. Even when a query needs information from the Internet, local speech recognition on the smartphone can remove much of the response delay and risk of server overload.
Let us hope that developments such as these soon appear widely and prevent the enthusiasm for automatic speech recognition generated by Siri's appearance fading into the scepticism that has dogged this technology for so long.
If that hope comes true, then the advances exemplified by Siri and Google Voice Search may come to be properly recognised for the remarkable technical achievements that they are, and - more importantly - automatic speech recognition will no longer be a curiosity but rather a key means of communicating with devices including smartphones, televisions and cars.
Even an ability to respond to a limited set of spoken commands and enquiries can be extremely useful, provided that the response is reliable and rapid. Let's stop daydreaming about talking to our devices as though they are another human being and instead take advantage of the real utility that current automatic speech recognition offers.