Speech recognition: the flop that’s about to be fab

100% human

Progress has come about thanks in part to a steady evolutionary progress in the technologies needed to help machines understand the nuances of human speech, including machine learning and statistical data-mining techniques.

Sophisticated voice technology is already commonplace in call centres, where it lets users navigate through menus and even helps identify irate customers who should be handed off to a human customer service representative.

Now the rapid rise of powerful mobile devices such as tablets and smartphones is making voice interfaces even more useful and commonplace.

One of LG's research and development facilities.

Eugene Kim, Project leader for speech recognition technology at LG  Electronics Advanced Research Institute in Korea, says today’s smartphones pack as much processing power as high-end PCs he worked with in the mid-’00s. Smartphones also have connections to the cloud, where powerful servers processing power involved in voice recognition and understanding language.

“The combination of more accessible data and more computing power means you can do things today that you just couldn’t do previously,” says Kim.

“Today we have access to more sophisticated statistical models with a higher degree of accuracy.”

The Siri way

Once of the most high profile example of a mobile voice interface is, of course, Apple’s Siri, the voice-activated personal assistant built into the iPhone and iPad.

Voice functionality is also built into the Android and Windows Phone platforms, and most other mobile systems, and many apps now incorporate voice recognition too. While these interfaces still have considerable limitations, it appears we are inching closer to machine interfaces we can actually converse with naturally.

According to LG’s DK Jung, Chief Research Engineer at LG Electronics in Korea, the company has been inspired by the success of voice recognition software on its mobile phones and TVs, and is now focusing on speech interfaces in many more of the company’s products. Jung says “LGE feels that there is a broad range of products and services that are now ripe for voice recognition innovation.”

Eugene Kim explained that, “The basis of LG ‘s voice recognition software is that it works in tandem with Google’s Android OS language-based systems. The collaboration with Google means that LG can embed the software within its own proprietary “Werniche” engine, which works to process sentences and accurately extract linguistic meanings.”

According to Kim, “This new technology delivers both Natural Language Understanding – which allows for intelligible processing of sentences and Dialogue Management – which accesses a vast learning database of language-based information used for responses to verbal requests.”

Smart TV and Natural language

The introduction of Smart TV with its myriad connectivity and content options social networking, web surfing, apps for games, movies and catchup TV, networked media sharing – has highlighted the limitations of the traditional TV remote control handset.

Fine for changing channels and volume levels, the 60-year-old control solution delivers the extra functionality of a Smart TV only with an unacceptable level of user complexity, and it’s given the industry impetus to find more intuitive types of human interface.

There has been some dalliance with gesture control, but it appears that Sony, Samsung, LG and Panasonic are opting for voice control as the standard interface. For example, through enhanced voice control capabilities, Samsung’s range of Smart TVs can understand more than 300 commands with much better language and contextual recognition rates.

Samsung has even invested in and collaborated with Australia’s Macquarie University Phonetics Department to localise the voice recognition for the range of Australian accents.

Samsung's use of voice analysis in its TVs could easily spread to interpreting Australian voices on phones.

Accents, algorithms and the cloud

Indeed, accents are another wall of complexity for any voice recognition technology to scale.

If you’ve ever struggled to understand English spoken in a heavy foreign accent, where the words maybe run into each other, you can appreciate that the algorithms capable of interpreting all the variations of language need to be hugely sophisticated. Consider too that, almost all languages have their regional accents and dialects, and that English can be spoken with a Slavic accent, Japanese with an Australian accent, and German with an American accent.

The technology to overcome these challenges needs to be seriously clever, and right now, all the elements necessary for realising speech recognition that actually works are starting to align.

The confluence of the increased processing power of hardware devices, the huge storage capacity and anywhere-anytime access to the web now make it possible to create and use the superior algorithms needed to cope with the complex demands of Natural Language detection.

There is plenty of power to interpret voice commands with cloud computing.

When Siri or Google Voice Search is accessed, for example, the huge voice recognition engine and database isn’t resident on an individual device, but managed via cloud-based infrastructure. With ever more powerful smartphones, tablets, TVs and other devices, that database can be accessed, interpreted and relayed to users more quickly and accurately than before.

And as computing power improves and data infrastructure expands, voice recognition technology’s usefulness will only increase.

Remember Nuance’s Chief Technology officer, Vlad Sejnoha? He believes that, within a few years, mobile voice interfaces will be much more pervasive and powerful.

“I should just be able to talk to a device without touching it,” he says. “It will constantly be listening for trigger words, and will just do it – pop up a calendar, or ready a text message, or access a browser that’s navigating you to where you want to go.”

Then there’s the next big leap – wearable technology. Imagine instructing your Google Glass to take a photo, then having that photo upload directly to your social network. Sejnoha says they are actively planning how speech technology would have to be designed to run on wearable computers.

Google Glass may present a different way of seeing the world, but it will listen to your voice commands as one of the ways to control it.

Final word

There’s little doubt that sooner rather than later voice recognition can evolve into ‘speech understanding’.

The algorithms that allow computers to decide what a person just said may someday allow them to grasp the meaning behind the words, handle linguistic nuance and eventually combine visual cues with voice recognition.

Although it is a huge leap in terms of computing grunt and software sophistication, some researchers argue that the development of speech recognition offers the most direct line from the (clunky) computers of today to true artificial intelligence, reversing the idea that we need artificial intelligence to achieve speech recognition.