The cure for boredom is curiosity. There is no cure for curiosity - Dorothy Parker
MAKER OF VOICE RECOGNITION SOFTWARE Dragon NaturallySpeaking, Nuance, took some time to demonstrate to us its latest use for the technology - a platform for mobile and embedded devices, and very clever it was, too.
The company has two very different products: one allowing mobile devices to type SMS messages, e-mails and documents, and another for embedded gadgets, such as in-car systems and GPS. Ford and FIAT have licensed this technology.
The advantage of the former is that it is now possible to reply to that ever so important message you receive without hurting your fingers, crashing your car, or having 2 rsrt 2 txt spk, looking like somebody with a mental age of five.
Nuance's software managed to beat out existing world champion texter, Ben Cook, at entering: The razor toothed piranhas of the genera Serrasalmus and Pygocentrus are the most ferocious freshwater fish in the world. In reality they seldom attack a human in a time of just over 16 seconds, well under half Cook's time of 42.2 seconds.
This particular hack has used the desktop version of Dragon to help with his RSI, however and, while impressively accurate, it is a bit of a resource hog. Naturally, we were interested to understand more about how it would cope with the massive drop of available processor power, storage and memory.
The answer is rather cunning, in fact: rather than relying on trying to optimise a full voice recognition program for the small and hugely varying resources of a mobile phone, it has split the process into two parts.
1) An app running on your Symbian, UIQ, Brew, WinCE or Crackberry device listens to your speech and splits the sound into its phones - the individual components that make up how a word is pronounced. This takes much less bandwidth than the full audio stream.
2) The compressed information is sent over the phone's data link to a server which then performs the voice recognition and returns what you've said as XML. It was claimed that a dual Xeon could handle 250 simultaneous dictation streams.
3) Like on the desktop, if something is not clear, i.e. if you've said a homonym (where two words sound the same but can be spelled differently), you can select an individual word and choose the alternative from a drop down list, although it will try to determine the best one from the context.
4) As you use and correct mistakes the service is making, these are fed back and it continues to learn the way you speak.
However, in order to get an accurate idea of your voice, the software needs to be trained, and this is done by reading about 90 seconds of pre-defined text containing the full range of individual phones. The server will then store this information on whether you prefer to say tom-ay-to, tom-ah-to or I'll have the steak, please.
By using a more powerful processor together with a tried and tested software base, it didn't appear to suffer from the problems I've experienced with using even simple voice commands on any phone I've owned, and it was indeed possible to dictate e-mails with excellent accuracy in a much shorter time than typing them.
Recognising their average user may not be used to dictating in a smooth, continuous fashion, the software can intelligently remove any spurious er or um sounds. However, such solutions do tend to suffer badly from external noise interference, so it will inevitably be much easier to use in a quiet environment.
The trade-off to the server-based approach is that using it will sap up your data bandwidth, although Nuance claim that the compression gets the data down to 5kb/s. Also there is an increasing trend nowadays for networks to offer unlimited data bandwidth as an option: Three and T-mobile already offer this in the UK.
Developers of mobile apps will be able to take advantage of the voice recognition functionality, either through the web and VoiceXML, or through an SDK.
As well as using it for texts or e-mails, you can use the voice recognition to command your phone, search the web, or to search through its MP3 library for particular songs or playlists. Finally, a text to speech engine will read out e-mails to you, so you don't need to look at the phone at all.
Phones and cars will shortly be available with the embedded version of the software that does not use the server-based recognition, rather a simpler general recognition software providing basic limited vocabulary SMS dictation, phone control and music searching.
Towards the end of 2007 or early 2008 we'll have the full featured version, and the company are in talks with the phone manufacturers and mobile networks, both to provide it as a built-in option on new phones, but potentially also as Java applets available for popular phones. ยต