Dialects are an intrinsic part of who we are, and can represent home to many of us in a world becoming more global in nature. Many wonder though – how do we train speech recognition devices to understand such unique regional languages?
So what is a dialect? It’s a tricky question to answer that can get you into all sorts of political trouble in some areas of the world! In the past, central authorities were often sceptical of communities that claimed to have their own (regional) language, preferring rather to speak of a mere “dialect.” In the reverse, smaller countries with a big neighbour often insisted they spoke its own language, not just a dialect of the neighbour’s language. Luckily, linguistic variation today is often seen as a precious treasure of cultural heritage, but in many places linguist Max Weinreich’s summary that “a language is a dialect with an army and navy,” is still valid. Avoiding those issues, I will use “dialect” in a pragmatic way, to encompass regional languages and accents.
Looking back over the more than 20 years I have spoken to customers and others about Automatic Speech Recognition (ASR), the most frequently asked question definitely was, “Do your systems speak dialect X?”, where “X” may have been Bavarian, Scottish English, Swiss German, Canadian French – and many other examples.
After many centuries of authorities trying to discourage the use of dialects, today many people are actually proud of their ability to speak a dialect. Recall, for instance, when during a trip to an exotic place you recognised somebody coming from the place you were born in just by listening to how they speak – it’s a welcome feeling. Even governments exploit this today: the German state of Baden-Württemberg (which prides itself on being the birth place of many inventors and scientists, like Karl Benz, Johann Kepler, Albert Einstein and, coincidentally, is also the home of Nuance’s Ulm office) coined the amusing quip: “We can do everything. Except [speak] High German.”
Obviously, the quip is not quite true, in that most speakers of dialects also speak the “standard” form of their language and apply what linguists call “code switching.” Depending on the social setting, speakers switch between standard language (in a formal setting) to dialect (at home or with friends) and back. Dialect, similarly, can be a tool with which you can signal to somebody they are welcome in your home or that they will remain a stranger, as they don’t speak your dialect. The same mechanism may be at work in those numerous radio spots or YouTube videos where people make fun of ASR, which supposedly does not speak or recognise a dialect; hence the video of Scots in a lift.
The issue of schools
The second reason why people may have doubts about ASR working well with dialect may also be related to the long history of dialects not being an acceptable language to use in school (at least in some countries). Clearly dialects deviate from the rules of the standard language, as codified in the grammar book, so that somehow encouraged the myth that dialects do not have any rules, are “irregular” and therefore difficult to capture in a machine. But from a linguistic viewpoint, that is really just a myth: granted, dialects sometimes don’t have a written form, but for linguists spoken language is more important anyway with written language only being a secondary derivation. And in the spoken form, dialects are as regular as any other language; they are neither worse nor more difficult, nor better or easier than “standard” languages.
Machine Learning, especially Deep Learning based on Neural Nets, can deal with the variety of having several dialects and a standard form in one population. As long as you make sure all dialects are reflected in your training data (and we make sure it is; in the UK for example, we use more than 20 defined dialect regions) the resulting models will reflect all the ways of pronouncing the phonemes (or sounds) of a language. We make sure to include words that are special to a dialect (again, using the UK as an example, different areas refer to a bread roll as a cob, a barm cake or a bun) and where pronunciation differences go beyond isolated phonemes, we reflect that in the pronunciation dictionary.
For instance, our UK English language pack recognises 52 different pronunciations of the word “Heathrow” so our airline customers can cater to those whose first language isn’t English. When differences become too big, we create separate models in some cases. Users of Dragon speech recognition software can choose between variations of English and between Flemish (for Belgium) and Dutch (for the Netherlands).
Under the hood
Occasionally this is done “under the hood,” so to speak. Even in the Dragon US English version, there are several dialect models. We use a classifier (another application of Machine learning) to detect which “package” fits best to the user’s dialect and use that for recognising the dialect. We also verify that it works by measuring accuracy gains per variant, e.g. Dragon Professional Individual English has an accuracy improvement (over the previous version) of 22.5 per cent error reduction for speakers of English with a Hispanic accent, 16.5 per cent for southern (US) dialects, 13.5 per cent for Australian English, 18.8 per cent for UK English, 17.4 per cent for Indian English and 17.4 per cent for Southeast Asian speakers of English.
Finally, we have adaptation to help us with the challenge: dictation software like Dragon will adapt overtime to a user’s specific dialect. When the usage deviates from how we thought it would be used during training, ASR may not work for every dialect at every time. However, speech recognition’s accuracy across a number of languages has risen considerably by upwards of 99 per cent, and is evidenced by the broad and global integration of our cloud based ASR and NLU, used by thousands of apps in cars, IoT devices, smart phones etc.
Linguistic variety is as important to us as it is important to you; which is why we support more than 80 languages (including regional languages like Catalan and Basque, which we developed in cooperation with regional governments ), and as I have outlined, we do a lot more to cover variation and dialects beyond that number. So, we welcome the challenge of dialect – even if it’s in the form of an amusing YouTube spoof.
Nils Lenke, Senior Director, Corporate Research, Nuance Communications
Image Credit: Flickr / Daniel Gasienica