Apple's human interface research - speech recognition

Written by David Tebbutt, MacUser 11/91 item 03 - scanned

The potential for speech recognition on the Mac is vast. In the third of a Series on Apple's Advanced Technology Group, MacUser sees how researchers are tackling usability issues early on.

The first article in this series was about a single project - handheld computers. The second dealt with a variety of projects - the virtual sphere, placement of objects in 3D space, and the animation of the user interface. This third article concerns Apple's human interface research (HIG) into speech recognition and its application in the workplace. Previous articles focused on HIG's achievements, this one concentrates on issues and their resolution.

Eric Hulteen, the team leader in HIG, is responsible for speech research. He is also the project leader for a cross-functional team within the Advanced Technology Group (ATG) of people working in human interface, natural language, speech recognition and speech synthesis. Their job is to explore tasks and applications in which voice interaction is most appropriate and compelling; to build working systems and do wide-scale testing; and to use the results of this work to define a path for the integration of speech into the user interface and to influence Apple's speech technology developments.

Hulteen has worked with speech systems for over 15 years. He is co-author of Put That There, a speech- and gesture-driven object placement system. He has also worked on graphical interfaces for editing audio waveforms, and on speech interfaces for 3D object manipulation. Hulteen worked with Nicholas Negroponte at MIT and Alan Kay, at Atari, and joined Apple in 1985, shortly before the departure of Steve Jobs.

According to Hulteen: "Jobs had absolute power. He decided when the interface was right". An interface group was formed to fill the void left by Jobs. Hulteen was one of the three original members of that team. The group to which he now belongs was formed later in ATG by S Joy Mountford.

Integrating Speech

The reseachers have wrestled for several years with how to integrate speech into the Mac (and machines that follow). To succeed, Hulteen says "It has to be integrated well. Recording, playback, synthesis, recognition and understanding need to be as fundamental to the platform as bitmap display and the mouse was in 1984." This raises a number of questions. How should the hardware platform be modified for such a computer-intensive task? How could speech facilities be made accessible to developers? And how should the user interface be changed to accept this new modality?

Delegate Responsibility

A much-touted feature of the graphical, direct manipulation interface of the Mac is WYSIWYG. But what you see is all you get. Hulteen's speech work is leading the researchers into the new, complementary, paradigm of delegation - of telling the machine the result you want, rather than doing it yourself. Hulteen uses a parallel in flying to Boston rather than driving. To fly, you delegate responsibility to the airline. Hulteen believes Apple should have a group dedicated to researching the concept of delegation.

Speech does not lend itself well to direct control of the interface. Users won't sit there, saying ``up, up, up, right a bit,'' and so on. That kind of control is best left to the mouse. On the other hand, speech does lend itself well to delegation. A spoken command such as ``call my wife'' could summon telephone dialler, look up ``wife'', try the number, call an alternative if there is no reply, and notify the user when the connection has been made.

After 15 years in the field, Hulteen has very clear ideas about speech research and the goals of the organisations that do speech research. He points out that IBM's efforts in speech were driven by the desire to have a listening typewriter. AT&T is interested in speaker-independent systems that work over the telephone. The security agencies need keyword recognisers that can trigger tape recorders or wake listeners up. The Japanese have a very complicated visual language (Kanji), so they'd love a substitute for the keyboard.

In the Mac world, Articulate Systems took out a licence on recognition technology from Dragon Systems. The same technology was licensed by IBM and built a peripheral with voice control of the graphical user interface. Hulteen believes this was the right first step, and one which Apple should have taken. He says: "There is a small role for speech in direct manipulation.'' Voice Navigator uses speech to trigger macros, which is a way of delegating. Hulteen says that Apple "should take this as a warning that others are as bright as we think we are''. But he points out that Voice Navigator cannot get deep enough into the operating system to accomplish the integration of speech.

"The user interface is what's important about the Mac. Apple wants to integrate speech technology into the interface to help users use the machine better and for developers to integrate speech better,'' says Hulteen. To achieve this, Apple recruited Kai-Fu Lee, an expert on speech recognition. He developed what Hulteen describes as "the world's leading recogniser'' at Carnegie Mellon University.

Many of today's speech recognisers have to be trained to recognise a particular voice and require their users to separate each word. From a user point of view, this is asking too much. The alternative is continuous speech recognition, regardless of speaker. One of Kai-Fu Lee's first jobs is to see whether his previous work can be implemented on a Mac. "If he can, we win. If he can't, then it's just a question of time,'' says Hulteen, referring, to the increasing power of new machines.

For Hulteen, "adding speech is at least as significant as the mouse and bitmap. It needs significant commitment at the corporate level.'' He points out that the Mac interface was the brainchild of Xerox PARC (Palo Alto Research Center) and Steve Jobs, and that, since then, we have only seen evolution of the idea. Speech could be the next big step.

The emphasis of the speech project is now on recognition. The work is spread around several groups, with Kai-Fu Lee's group focussing on speech technology, while HIG looks at applications and user interfaces. Hulteen's group is responsible for using today's technology to implement concrete, specific applications, and to find out how users interact with them. This is a move away from other technology demonstrations, which are motivated by the desire to show off technology, rather than a desire to show real users using it.

Hulteen suggests that speech can be used for obvious things, like speaking to buttons and dialogs (the graphical elements of direct-manipulation interfaces) with the user - the machine asks the questions, the user answers. But there's much more to speech than this. It is central to the support of delegation via speech at the interface. And Hulteen points out that simply implementing speech to support delegation, leaving the mouse and bitmap for manipulation, is not enough. Everything must be integrated, and the user must have the choice of using speech or conventional input methods.

The team's research was implemented in a telephony and remote access project. The idea was to bring telephone control to the screen, allowing remote across to the same functions. One of the applications, created by Chris DiGiano, uses direct manipulation to facilitate call conferencing and transfer in a telephone. It was chosen to demonstrate that speech and direct manipulation are more powerful together than either is alone.

Microphones were rejected in favour of a telephone handset for this application because people are used to speaking into telephones for this task. Speaking into microphones can be embarrassing - Hulteen calls it "mike fright''. He says: "People feel weird talking to small fibrous objects floating in space.'' And microphones must be carefully placed to avoid picking up too much extra noise.

Hulteen is aware that with a handset the hands aren't free and you don't want to pick up the phone in order to use speech. He says: "This is, however, not a consideration in a telephony application, since users normally pick up the handset anyway.''

Apple likes to find real life metaphors to help users understand new concepts. One of the problems the speech team faces is that people don't have an abstract view of a telephone's functions. So the telephone itself can't be used as a model. And the metaphor of a doctor's office was rejected. They are now working on a more abstract model of lines, diallers and conference areas. While not meeting with universal approval, this will do until further user testing suggests something better.

Another of the telephony prototypes, developed by Lisa Stifelman, uses Voice Navigator. Hulteen picks up a telephone, the system suppresses the dial tone so that his speech can be heard, and he says: "Lewis Knapp''. The application searches Hulteen's personal directory for Knapp's telephone number and uses a synthesiser to speak "calling Lewis Knapp'' in the earpiece of the telephone handset. The connection is made across Apple's internal telephone system and the two men speak to each other. The prototype system can cope with keywords (such as "wife", "office'' or ``home'') and additional telephone numbers for each directory entry. In theory, it could be accessed remotely.

The other side of the project is remote access. Hulteen feels the phrase ``the utility of the Mac falls to zero at a distance of two feet'' is appropriate. For distances up to 12 feet, a microphone might fit the bill. But what about further? Once again, the telephone is the most appropriate method. If the computer can understand and act on telephone instructions, then it doesn't really matter if the speaker is at the machine or calling from an airport lounge on the other side of the world.

The Mac could be directed to carry out tasks and deliver the results over the telephone. You could get it to send faxes, or give it a list of people you need to talk to. The Mac could find each person in turn, allow you to speak to them, and regain control after each conversation. The Mac could even call directory enquiries and ask for a number - but direct access to the directory database would be better.

User-Centred Application

The team is exploring applications that are user-centred, as opposed to technology-centred, and are meeting some resistance to this. Hulteen says: ``Technologists can't let go of the need to make it look good. We need to wrest it from their grasp and show it to people who couldn't give a damn, and see what they make of it.'' In the end, the user will determine if the technology actually gets used.

Susan Brennan, a psychologist on the team and a professor of psychology at the State University of New York at Stony Brook, has developed a dialogue manager that gives users evidence on which parts of the speech input the computer would understand at different points in a dialogue. By conducting these studies before the application actually exists, usability issues can be addressed early.

In this study, users spoke into a telephone and the experimenter rapidly typed their input. Then the experimenter used buttons on a HyperCard control panel to provide appropriate sounds and messages in response. Users' behaviour was recorded and analysed later.

This technique helped determine the kinds of feedback messages that users need to keep track of the telephone application and use it confidently, the kinds of things users are likely to say in this situation, and how users expect the telephony application to behave.

As well as continuing to develop the desktop applications of speech technology, the team will be looking at handheld computers and exploring the possibilities of speech there. They will look at how these machines can be used away from the office and how they will integrate into the office.

MacUser would like to thank Eric A Hulteen, Lewis Knapp, Lisa Stifelman, Chris DiGiano, Susan Brennan and S Joy Mountford for their help in researching this article.