Ever since humans evolved language, speech has proven to be the most efficient way for us to communicate, from the simplest requests to the most complex ideas. Now, with advances in technology, speech is poised to become the next major transformation of the user interface.
Just as human speech developed from basic sounds into words, and then combined those words into phrases and sentences, interfaces have developed along similar lines in the last 50 years. Command line interfaces (CLIs) in the 1960s allowed users to interact with computers using keyboards to 'speak' one or two word commands from a rigid, fixed set. This gave way to graphical user interfaces (GUIs) in the 1970s that extensively used the mouse and metaphors like physical desks and rubbish bins – phrases of a sort – to make computers more accessible and less rigid.
No more adapting to limitations
With the development of multi-input touchscreens from the 1980s through today, we have extended the GUI -- we use our fingers to swipe, pinch, and zoom a bit like we emphasise our spoken words with hand gestures. But we still find ourselves having to adapt to the machine’s limitations.
The next interface will transform this relationship: Instead of humans using the machine’s words and metaphors, we will use our free-flowing natural language to interact with our devices. It’s both the most intuitive choice and one that’s been anticipated for years.
Two primary technical challenges have so far prevented us from reaching this transformation: speech recognition and natural language processing. They are distinct problems. Speech recognition is the ability for a machine to discern the individual words you are saying correctly. Natural language processing is the ability for the machine to understand the meaning of those words as they are spoken together.
Both of these fields have been transformed by deep learning. Deep learning is an area of machine learning that is inspired by the structure and interaction of neurons in your brain. Through this, both these areas have seen a doubling in performance and are close to rivaling human ability. All you need do is consider how well something like Amazon’s Echo works to see how far we’ve come in the last couple of years. Use a device like that for a short time and you’ll understand why voice interfaces will proliferate quickly: It’s a very different experience to speak in a normal way and have a device both understand your accent and comprehend what to do as efficiently as it does.
In addition to be being a great natural user interface (NUI), voice can be extended far beyond where we find GUIs today. It requires only a tiny microphone, about 0.5 mm wide, so it can be incorporated into almost any device. This is very useful for the Internet of Things (IoT) where you need to communicate with devices that have ever shrinking form factors and limited real estate. At CES 2016, voice was enabling watches, thermostats and even alarm clocks. In fact, some type of voice interaction already is present in a lot of places, from smartphones to connected cars and personal assistants and it’s simplifying your everyday activities. Where you once were forced to use a cable remote to laboriously type the title of an on-demand movie you wanted to watch, it’s now possible on voice-activated systems to just say 'Watch 2001: A Space Odyssey' and it’s done.
Technology and devices haven’t caught up fully to human capabilities just yet, but we are on the cusp of a change that will allow us to bypass interacting with technology and just speak to accomplish a task. The keyboard may not disappear completely, but we’re getting close to a future where it will be a rare sight, like a punch card or CRT monitor. The era of voice is coming, and it’s going to be as beautiful and elegant as it is productive.
Vijay Balasubramaniyan, CEO & Co-founder at Pindrop Security