How’s your ‘paperless office’? Where did you park your flying car? Will you be taking your next vacation on the moon? Perhaps you could call your robot housemaid and ask it to grab a drink from the fridge. Or maybe you could ask Siri to make dinner reservations for you at your favourite restaurant.
Technology has provided us with many lifestyle benefits, but when it fails to deliver on its promises it fails spectacularly, and we can be an unforgiving lot, with very long memories.
One of the most obvious letdowns is one most of us will have had experience of at one time or another, either by way of the voice dialling feature on our smartphone, or via the voice prompts used on many telephone customer service lines. It’s how we expected to be communicating with technology by 2013, but how close are we to achieving speech recognition technology that actually works?
Engineers and scientists working in the field of speech recognition cautiously predict that the technology will be line ball with humans in terms of understanding multi-language speech by the magic date of 2020.
Maybe. And even if we hit that date, it’s hard not to feel that speech recognition technology is the application that has, so far, most seriously disappointed the consumer.
The role of artificial intelligence
Although we often speak (or swear) at our computers, the thought of them conversing with us is may be somewhat disturbing for many people. The ultimate computer with artificial intelligence was HAL, the onboard computer in the science fiction classic, 2001: A Space Odyssey.
HAL could not only understand and respond to human speech, but also determined what was best for the future of the human race, even if that meant losing a few people along the way!
While the computers of today do not have the capabilities of HAL, there is a new era of computers incorporating the use of artificial intelligence to enable speech recognition. Such voice technology systems are no longer just an emerging technology, but are being used by companies such as BWM, Dell, LG and Fisher and Paykel.
Automobile maker, Ford has recently demonstrated an advanced voice technology system that allows you to fairly naturally communicate with your car. Its vehicles can be equipped with ‘conversational’ speech interface technology. The system uses a text-to-speech technology that sounds like you are talking to another person, not a robot.
So what can you talk about with your car?
Want to play music? The system asks what type of music and then will list what artists are available.
Need to make a call? Tell your car to call John Smith and if there is more than one John Smith listed, the car will ask you which one should be called. Also controlled through voice recognition is the car’s navigation system, climate control and retractable roof. Ford’s conversational speech technology has a vocabulary of over 50,000 words.
Electronics manufacturers in particular are exploring the exciting potential of voice recognition technology across their broad product portfolios.
At the IFA 2013 Expo in Berlin, LG showcased voice recognition technology via a range of products, including:
The LG Android smartphone (due 2014) with always-on voice commands and the ability to differentiate between general conversation and requests/commands specifically directed at it.
A fridge that can provide spoken recommendations on dishes to cook based on which ingredients are available in the refrigerator.
LG’s robotic RoboKing Square vacuum cleaner that can be controlled with a smartphone using either onscreen controls or voice commands.
Speech recognition v Natural Language detection
Speech recognition technology has been around for decades, but it has always been awkward and far less accurate than typing on a keyboard.
Plus the user has always needed to ‘think’ like a computer rather than naturally conversing with the technology, much like a normal chat between two people.
Achieving this kind of ‘conversational understanding’ has been the Holy Grail of speech recognition technology, and it has taken major leaps forward in the last few years.
Often referred to as ‘Natural Language’ detection, the technology is now appearing in products as diverse as televisions, smartphones, appliances and high-end motor cars.
These products don’t just respond to simple verbal commands, you can request they search the web, find an airline arrival time, or recommend a movie with a particular actor for you to watch.
This kind of interaction is here now, and is set to reinvent how technology is used.
Vlad Sejnoha, the chief technology officer of Nuance Communications, one of the leading developers in voice recognition and birthplace of Apple’s digital assistant, Siri, says “We’re at a transition point where voice and natural language understanding are suddenly at the forefront.
“I think voice recognition is really going to upend the current user interface, not just in computers but in a broad variety of devices – it’s really the ‘next big thing.'”
Progress has come about thanks in part to a steady evolutionary progress in the technologies needed to help machines understand the nuances of human speech, including machine learning and statistical data-mining techniques.
Sophisticated voice technology is already commonplace in call centres, where it lets users navigate through menus and even helps identify irate customers who should be handed off to a human customer service representative.
Now the rapid rise of powerful mobile devices such as tablets and smartphones is making voice interfaces even more useful and commonplace.
Eugene Kim, Project leader for speech recognition technology at LG Electronics Advanced Research Institute in Korea, says today’s smartphones pack as much processing power as high-end PCs he worked with in the mid-’00s. Smartphones also have connections to the cloud, where powerful servers processing power involved in voice recognition and understanding language.
“The combination of more accessible data and more computing power means you can do things today that you just couldn’t do previously,” says Kim.
“Today we have access to more sophisticated statistical models with a higher degree of accuracy.”
The Siri way
Once of the most high profile example of a mobile voice interface is, of course, Apple’s Siri, the voice-activated personal assistant built into the iPhone and iPad.
Voice functionality is also built into the Android and Windows Phone platforms, and most other mobile systems, and many apps now incorporate voice recognition too. While these interfaces still have considerable limitations, it appears we are inching closer to machine interfaces we can actually converse with naturally.
According to LG’s DK Jung, Chief Research Engineer at LG Electronics in Korea, the company has been inspired by the success of voice recognition software on its mobile phones and TVs, and is now focusing on speech interfaces in many more of the company’s products. Jung says “LGE feels that there is a broad range of products and services that are now ripe for voice recognition innovation.”
Eugene Kim explained that, “The basis of LG ‘s voice recognition software is that it works in tandem with Google’s Android OS language-based systems. The collaboration with Google means that LG can embed the software within its own proprietary “Werniche” engine, which works to process sentences and accurately extract linguistic meanings.”
According to Kim, “This new technology delivers both Natural Language Understanding – which allows for intelligible processing of sentences and Dialogue Management – which accesses a vast learning database of language-based information used for responses to verbal requests.”
Smart TV and Natural language
The introduction of Smart TV with its myriad connectivity and content options social networking, web surfing, apps for games, movies and catchup TV, networked media sharing – has highlighted the limitations of the traditional TV remote control handset.
Fine for changing channels and volume levels, the 60-year-old control solution delivers the extra functionality of a Smart TV only with an unacceptable level of user complexity, and it’s given the industry impetus to find more intuitive types of human interface.
There has been some dalliance with gesture control, but it appears that Sony, Samsung, LG and Panasonic are opting for voice control as the standard interface. For example, through enhanced voice control capabilities, Samsung’s range of Smart TVs can understand more than 300 commands with much better language and contextual recognition rates.
Samsung has even invested in and collaborated with Australia’s Macquarie University Phonetics Department to localise the voice recognition for the range of Australian accents.
Accents, algorithms and the cloud
Indeed, accents are another wall of complexity for any voice recognition technology to scale.
If you’ve ever struggled to understand English spoken in a heavy foreign accent, where the words maybe run into each other, you can appreciate that the algorithms capable of interpreting all the variations of language need to be hugely sophisticated. Consider too that, almost all languages have their regional accents and dialects, and that English can be spoken with a Slavic accent, Japanese with an Australian accent, and German with an American accent.
The technology to overcome these challenges needs to be seriously clever, and right now, all the elements necessary for realising speech recognition that actually works are starting to align.
The confluence of the increased processing power of hardware devices, the huge storage capacity and anywhere-anytime access to the web now make it possible to create and use the superior algorithms needed to cope with the complex demands of Natural Language detection.
When Siri or Google Voice Search is accessed, for example, the huge voice recognition engine and database isn’t resident on an individual device, but managed via cloud-based infrastructure. With ever more powerful smartphones, tablets, TVs and other devices, that database can be accessed, interpreted and relayed to users more quickly and accurately than before.
And as computing power improves and data infrastructure expands, voice recognition technology’s usefulness will only increase.
Remember Nuance’s Chief Technology officer, Vlad Sejnoha? He believes that, within a few years, mobile voice interfaces will be much more pervasive and powerful.
“I should just be able to talk to a device without touching it,” he says. “It will constantly be listening for trigger words, and will just do it – pop up a calendar, or ready a text message, or access a browser that’s navigating you to where you want to go.”
Then there’s the next big leap – wearable technology. Imagine instructing your Google Glass to take a photo, then having that photo upload directly to your social network. Sejnoha says they are actively planning how speech technology would have to be designed to run on wearable computers.
There’s little doubt that sooner rather than later voice recognition can evolve into ‘speech understanding’.
The algorithms that allow computers to decide what a person just said may someday allow them to grasp the meaning behind the words, handle linguistic nuance and eventually combine visual cues with voice recognition.
Although it is a huge leap in terms of computing grunt and software sophistication, some researchers argue that the development of speech recognition offers the most direct line from the (clunky) computers of today to true artificial intelligence, reversing the idea that we need artificial intelligence to achieve speech recognition.