Odd things are afoot. IBM is starting a major research push to make computers understand(“recognize,” actually) speech better than people do by 2010. That’s ambitious. They may do it, though.
It’s important here to understand the difference between “speech recognition,” which is what IBM is up to, and the “understanding” of speech. Speech recognition means knowing what words were spoken, nothing more. If you say to a speech-recognizer, “What is art?” it will print out exactly that. It won’t have the foggiest idea what it means. Making a computer understand speech, in the sense of analyzing grammar and handling context, is a bear of a problem. Recognizing speech is a different and more tractable problem.
Pretty good speech-recognition software is available for PCs now, at least for recognizing clear English in a quiet environment. The leaders are Naturally Speaking, from Dragon Systems, and ViaVoice, from IBM. They work pretty well, anyway.
I’ve used Naturally Speaking. After installing it, you read into a microphone passages it displays on-screen, so that it can learn your voice and inflections. With occasional mistakes, the frequency of these depending on a number of factors, it sure enough prints what I say. Mostly.
But it is not speaker-independent: It will work only for those to whose voices it has been trained. Speaker-independence meaning that it will quickly adapt itself to anyone’s voice is an important goal. IBM (and everybody else in the field) is chasing it.
Another big problem is noisy environments. Today’s software is thrown off by smallish amounts of background noise. The trading floor at the New York Stock Exchange would be hopeless. But that’s the kind of problem the project is tackling. How to make it work?
I talked to David Nahamoo, who is honcho of IBM’s effort. He said that one promising approach is to combine lip reading with acoustic analysis. Today computers with cameras are quite able to watch the movements of a speaker’s mouth. (Remember the flap over face-recognizing cameras at football games.) As an artificial example, suppose a speaker in a noisy environment said, “Pool.” The acoustic-analysis software might miss part of it, and not know whether the speaker had said “pool” or “ghoul.” Because the lip positions differ, the camera would allow the computer to make the distinction.
Another approach is to use an array of microphones in different positions and combine their outputs to overcome ambient noise. (For tech-heads: a phased array with phase-shifters for beam steering which, Mr. Nahamoo said, might be slaved to a camera subsystem to follow a moving speaker.)
Another technique, used now but susceptible to improvement, is analysis of patterns of words. For example, suppose I said, “I had gone,” and the computer, because of noise in the background, heard, “I (blur) gone.” Very few single words fit grammatically between “I” and “gone.” The choices are pretty much “I am gone,” and “I had gone,” with “had” being statistically more likely. This works reasonably well now. It is curious to be dictating to your computer, see it make a mistake, and then five words later see it correct the error back in the sentence because it has figured out that it was probably wrong. (There are also fantastically mathematical methods of recognizing speech that we will studiously ignore here.)
Another problem needing work, said Mr. Nahamoo, is what he calls “spontaneous speech.” People don’t speak English, but rather a sort of linguistic spaghetti resembling English. E.g., “He was like uh you know jeetjet and I say ” The “uhs,” “wells,” “likeyouknows,” digressions and grammatical train wrecks of real speech make things more difficult.
Note, though, that to be useful speech-recognition doesn’t have to be perfect. Humans make mistakes. It just has to be good enough and if, as IBM hopes, computers can do it better than people, that will certainly be good enough.
Now, what would smoothly functioning, speaker-independent, noisy-area speech-recognition be good for? A number of uses come to mind. For example, controlling machines in noisy circumstances when your hands are busy. Or better electronic-answering systems for telephones, so you could say, “Give me accounting,” instead of listening to incompetently designed menus, futilely pushing buttons and swearing.
How comfortable will ordinary people be in talking to machines? I think the jury is still out. Maybe we’ll get used to it. But maybe not. I’d guess that dictation by computer will never fly because people won’t like announcing their thoughts to the office. Acceptance may require more than technical virtuosity. It looks as if we are about to find out.