Speech recognition heads to portable media players

The structure of applications follows the type of user interface used. The first interactive apps in PC-DOS days were text-based console apps. An application would ask the user a set of questions, one at a time, ending with the ubiquitous Are you sure? Y/N. A poor user who mis-typed an item would press N, and have to fill in the list all over again.

GUIs gave the initiative to the users, who could fill in (or not) fields in any order. Validation could occur on each item as it was entered. A big step forward, if you had a PC handy.
VUIs (voice user interfaces) are a whole different animal. In a VUI you must activate the grammar before you ask the question. It must contain all the possible answers. This complicates the user interface because some data values are open-ended. Consider getting a mailing address from the user.

The State is easy; there is a fixed set of them. Zip code is more open-ended but there is still an underlying pattern (5 digit number in the US, AlphaNumAlpha-NumAlphaNum in Canada) that can be used to create a grammar.

Street addresses are completely open-ended. It has, if you’ll pardon the pun, a large address space.

Heres a sampling:
1. Dr. Martin Luther King Jr. Ave is the longest street in Albuquerque
2. Ho Road in Carefree, AZ meets Hum Rd at the corner of Ho and Hum pic
3. Akaaka Street is in Oahu
4. not to mention the dreaded Welsh names like Gwernymynydd
The only feasible approach for getting a street address is divide-and-conquer. Ask the zip code first and then, using census data, have grammars for every zip code. Suddenly, your simple feature of getting the callers address requires determining every street name in the country! As discussed before, this is a perfect job for third-party speech objects.

The structure of speech application code reflects this issue. Much of the validation code of GUIs now becomes grammar generation code that runs at the start of the dialog. When the speech dialog ends, theres not much validation to do since the user was picking from lists that we generated. Of course, dynamic grammar generation creates problems its own: caching and avoiding unnecessary grammar reloads.

Apps that do well in an everything is a listbox world are ones that already know about the user. Existing customers call in, enter an account number, and the app already knows their phone numbers, address, GPS co-ordinates.

Two US firms have outlined ambitious plans to enable users to talk to their digital media players instructing them what they want to hear next.

Music library firm Gracenote has teamed up with Scansoft to offer a control system that hopes to give people hands-free access to their digital music collection on the move and make the need for thumbs a thing of the past.

“Voice command-and-control unlocks the potential of devices that can store large digital music collections,” said Ross Blanchard, vice president of business development for Gracenote.

“These applications will radically change the car entertainment experience, allowing drivers to enjoy their entire music collections without ever taking their hands off the steering wheel,” he added.

If the Gracenote name sounds familiar its because it currently provides music library information and ID3 tagging for millions of different albums for music download services such as Apple’s iTunes and Windows Media Player.

“Speech is a natural fit for today’s consumer devices, particularly in mobile environments, and the increasing portability of large libraries of music and video files make speech a necessary interface for safety and convenience for entertainment devices,” stated Alan Schwartz, vice president of SpeechWorks, a division of ScanSoft.
“Pairing our voice technologies with Gracenote’s vast music and video database will bring the benefits of speech technologies to a host of consumer devices and enable people to access their media in ways they’ve never imagined.”

Targeted products include car entertainment, portable media players and home entertainment devices such as media servers. The companies estimate that fully-integrated porno mexicano solutions for hardware and software platforms will be available in the fourth quarter of 2005.

However the companies have not commented on which players will be using the new software.