Skip to main content

The world doesn’t only have one accent per language, so why does speech recognition?

(Image credit: Image source: Shutterstock/polkadot_photo)

When we think of speech recognition technology, we typically think of voice assistants. It actually goes well beyond that. We just have to look at the pandemic to understand how critical voice is to keeping us connected and how important it is to have the right technology in place for businesses to capitalize on it. 

In particular, contact centers across industries have been inundated with voice calls, whether it’s people wanting to manage their finances or track an online purchase. The problem comes in when heavy accents, dialects or even different languages meet the current transcription models and ultimately fall flat. 

Customers should not be expected to change how they speak in the hopes of being understood. The question is then: why do so many speech recognition engines still not have access to a wider pool of data which consists of different accents/dialects in today’s voice driven world? 

How to train a speech engine 

To answer that question, we first have to understand how speech recognition engines are created. To train a machine to recognize how a human speaks, researchers need to collect thousands of hours of audio and corresponding transcripts. 

The first issue then stems from the availability of known, diverse data, which if not already available, is time consuming to gather and label. Different accents and dialects of a given language are then excluded if the data representing those accents is not available. Of course, accents and dialects of a language are not as easily classifiable as main language groups, i.e., a London or Scottish accent versus German or English. In the UK alone, for example, there are at least 56 main ‘accent types’ across the regions. 

Another example is Spanish. With approximately 500 million speakers globally, Spanish is the second most natively spoken language in the world and fourth most spoken language overall. However, only 10 percent of the global Spanish speaking population is in Spain, while the other 90 percent are in the United States, Mexico, Central and South America, Asia, and Africa. This results in numerous accents, dialects and regional variations which must be understood at speed without having to switch between models specific to each variation or country.

Not only does this mean that the vast majority of people face being misunderstood when calling in to contact centers, for example, it also raises issues with accuracy when it comes to analytics. Using human intervention to fully understand what has been captured is time consuming and of course costly for businesses. As a result, Call Centre Helper reported that only three percent of interactions are analyzed by contact centers on average, leaving 97 percent of potentially valuable customer information untouched.

Instead, speech recognition engines need to be trained on varied sets of audio and language data, using many hours of spoken data from global sources. The time and investment spent on gathering more of these datasets means ASR engines can learn from a huge and diverse training corpus. These models can be applied to a much wider range of applications and use cases.

However, there are difficulties with collecting and processing these vast datasets. As mentioned earlier, the first problem is that diverse datasets are difficult to gather and classify and are not always tagged with the accent/dialect or even other speaker characteristics – whether it’s age, gender, or acoustic environment. The second problem is in processing and incorporating this data during the training process to balance the impact of varying quantities of diverse data – dependent on accent/dialect – which in turn impacts the speech recognition and transcription. Much more research needs to be done.

One language pack to rule them all 

Speech recognition has advanced hugely in recent years for a field which is often used to marginal gains, thanks to step change improvements. Traditionally, building a new language pack takes months and is very expensive and labor intensive. It involves gathering high volumes of data, building a one-off system and continually refining it with input from experts in that language. 

Having one language pack per accent or region is an outdated approach in our increasingly connected world. It’s also expensive, not efficient or scalable. A better solution, though, might not be as difficult as it seems. 

As most languages that are part of the same ‘family’ have inherent similarities in their fundamental sounds and grammatical structures, patterns are easier to learn. This means ASR engines can use machine learning algorithms to recognize these patterns which in turn significantly reduces the time and data required to build a new language. As mentioned before, the problem is due to the lack of data and advanced level of research in the space.

Better for businesses 

Any-context speech recognition technology which draws on varied data can boost accuracy levels and result in cost savings when it comes to transcription and understanding. Not only this, but by being able to capture customer voice data in an accurate and clear way, analysis and investigations can happen quickly and easily, which ultimately saves businesses time and money and helps to keep customer satisfaction levels high. 

There is also a regulatory risk – in businesses where calls need to be monitored for compliance purposes, linguistic errors could prove to be very costly if they cause a breach of regulations. Take for example contact centers, where something as simple as a misinterpretation of language could result in the illegal provision of a service or product to a customer, causing both financial and reputational damage to a given business. 

This is why speech recognition software needs to be trained on a variety of datasets which encompass dialects and accents – particularly if a business has a global reach. Using effective ASR technology can ensure that opportunities are not missed due to simple oversights or worse, put a company into legal hot water. 

Whilst call analytics – using any-context speech recognition in real-time and post-processing – do not completely solve compliance problems instantly, it provides an opportunity to highlight issues immediately either by hints or even by rerouting the call to a supervisor before it escalates.

We are already seeing a shift to a speech-enabled future where voice is the primary form of communication. Users shouldn’t have to adjust the way they speak so that speech recognition systems can understand them. 

In particular, automatic speech recognition technology helps organizations and enterprises around the world to further automate their processes and workflows in order to streamline business operations and accelerate business growth. It brings numerous benefits from a customer experience perspective, allowing businesses to become more in tune with customers based on interactions that reduce customer churn and personalize service.

Thuy Le, Senior Product Manager, Speechmatics

Thuy Le has two decades worth of experience in technology and developing innovative ideas. Thuy has worked for a number of startups, non-profit organisations and multinational tech companies.