The spread of coronavirus has cast a long shadow on the long-standing design principles behind our shared physical spaces, and the mechanical and digital components within them. We’ve grown accustomed to twisting doorknobs, pressing buttons and tapping touch screens to navigate the world, from gyms to airports to residential complexes.
Our nimble fingers aren’t the only precise tools that evolution has gifted to us. We also have voices and language. While voice control has often been associated with futuristic, high-technology environments - from smart homes to Star Trek - the notion of using our voices to manipulate the world around us has actually been around for a long time. The classic phrase “Open Sesame” first appeared hundreds of years ago in a story within Arabian Nights, serving as a magical passphrase to open the door to a cave full of treasure.
As we revisit the digital infrastructure behind our physical world, what’s old is new again. Wake words - “Alexa,” “Hey Google,” “Hey Siri” - have become our new magical passphrases. And there will be many more.
It’s becoming increasingly clear that a lasting impact of the coronavirus outbreak will be a mass voice-enablement of our shared world, reducing the viral transmission risks inherent in today’s public buildings and devices. We’re already seeing these innovations in China, with the rise of voice-activated elevators and voice-enabled electronic medical records.
In order to achieve this at scale, consumers, businesses and the tech industry will need to overcome a number of challenges. Here’s a look at what will be required to bring about this paradigm shift in public-facing technology.
Establishing shared standards and design languages
Mainstream voice assistants today - Alexa, Siri, Google Assistant - have not been built around a set of common standards, outside of the basic “wake word” functionality. Switching between assistants can be clunky even for people in the voice industry.
When voice capabilities are embedded in a plethora of new devices that the general public must be facile with operating, there is a greater need for standardisation of design patterns and interaction models. This spans everything from indicating when a microphone is “hot” and expecting a user utterance, to providing clear “fallback intents” for moments of misunderstanding, to establishing principles for when certain information is presented auditorily, visually, or both. The Open Voice Network is a recently-launched organisation with the aim of defining these sorts of standards for multi-platform, multi-device voice assistance, which is a welcome head start.
Accelerating the user learning curve
Humans are remarkably adaptable, but it takes some time and training to change old habits and learn new behaviours. While voice tech advocates are often quick to claim that there is “no manual required” for an interface that taps into our natural language, there is a definite learning curve for humans when talking to machines.
Until there are significant advances across the entire voice tech stack - from speech recognition to natural language understanding - we users must do our part to increase the odds of being understood. The good news is that many users are getting a crash course in this in the comfort of their homes and cars, which are hubs for mainstream voice assistants.
Rethinking acoustic infrastructure
A major obstacle to the adoption of voice technology in shared spaces is ambient noise. How can microphones isolate what a given user is saying from the din of other voices and sounds nearby them?
Many companies, including Microsoft, have been putting significant effort behind new models of “noise-robust” automatic speech recognition (ASR). It’s likely that advances in ASR alone won’t be sufficient. We’ll also need to consider the environments in which we place voice-enabled devices, and the noise-dampening techniques and components that could be leveraged, from sound baffles to noise barriers.
Increasing use of voice biometric technology
Our voices are uniquely our own; they can serve as reliable tools for identifying who we are. Banks, including HSBC and Barclays, have been using speaker recognition technologies for years as a means of authenticating a customer’s identity. For scenarios where there is a higher risk of fraud - for example, withdrawing money from the ATM - a two-factor authentication will likely be required, but voice is a strong candidate to be one of them.
Deciding between mainstream vs. “owned” assistants
Lastly, device-makers will need to consider what voice system should power their customer interactions. Does it make sense to embed software and hardware from one of the big tech giants inside your fleet of kiosks, or is it better to assemble your stack from the ground up, perhaps embracing a far more narrowly-focused voice solution? Will customers want to use the voice assistant they use at home and in the car in public places as well, logging into their own personal instance of it? The same dilemma currently facing car-makers - to partner with big tech or to do something unique on their own, in their brand’s own image - will soon be facing the manufacturers of a host of other devices.
It won’t be easy to voice-enable the world, but we’re closer than many think. One sure bet: the demand to use technology to reduce personal and public health risk won’t be going away any time soon. For device-makers whose products live in a public settings and rely on manual inputs, now is the critical time to revisit product strategy with this truth in mind.
Eric Turkington VP of Strategic Partnerships, RAIN