Automated Speech Recognition Annotated Bibliography

Michael A. Grasso - Speech Recognition Annotated Bibliography

Research Agenda and Foundational Material
Comparison Studies and Evaluations
Direct Manipulation Interfaces
Speech Recognition and Natural Language Processing
Medical Speech-Driven Data Collection
Medical Speech-Driven Dictation
Speech in Virtual Environments
Mapping GUIs to Auditory Interfaces
World Wide Web, Networks and Servers
Application Development Frameworks

Research Agenda and Foundational Material

Cole R et al. The Challenge of Spoken Language Systems: Research Directions for the Nineties. IEEE Transactions on Speech and Audio Processing, 3(1):1-21, January 1995.

Eight areas were identified where basic research is needed to produce spoken language systems. These include robust speech recognition, training, spontaneous speech, dialogue models, natural language response, speech synthesis and generation, multilingual systems, and multimodal systems. In each area, the paper identifies key research challenges, the infrastructure needed to support research, and the expected benefits.

Bradford JH. The Human Factors of Speech-Based Interfaces: A Research Agenda. SIGCHI Bulletin, 27(2):61-67, April 1995.

The unique aspects of speech technology are compared to traditional interaction techniques. Various human factors relating to speech technology are then discussed in the context of a proposed research agenda for speech-based interfaces.

Tucker P, Jones DM. Voice as Interface: An Overview. International Journal of Human-Computer Interaction 3(2):145-170, 1991.

An overview of speech-enabled computer interfaces using speech recognition and speech synthesis is given. Human factors and implementation guidelines are covered to explain the predominance of visual/manual interfaces over speech and to suggest that much can be done to improve the usability of speech systems.

Peacocke RD, Graf DH. An Introduction to Speech and Speaker Recognition, Computer, 23(8):26-33, 1990.

This paper gives a taxonomy of speech-recognition technology and an assessment of the state of the art. Characteristics such as speaker dependence and vocabulary size are discussed, as is the basic architecture of a speech recognition system. Finally, a survey of speech applications for speech recognition and speaker recognition are reviewed.

Zue V, Research Overview, 1995 Annual Research Summary, Laboratory for Computer Science, Massachusetts Institute of Technology.

The Laboratory for Computer Science's strategy has been to develop human language technology within the context of real application domains. This strategy helps to illuminate critical technical issues, demonstrates the usefulness of the technology, and allow real people to access information and solve real problems. The Voyager urban navigation and exploration system has served as their primary platform for multilingual conversational systems. Pegasus, a spoken language interface to the SABRE reservation system, has been used for the development of displayless conversational systems. The Galaxy effort is a distributed architecture for accessing on-line information using spoken input.

Helander M, Moody TS, Joost MG. Systems Design for Automated Speech Recognition. Handbook of Human-Computer Interaction, 301-319, Elsevier Science, North-Holland, 1990.

Baber C, Noyes JM. Interactive Speech Technology. Taylor and Francis Ltd, Bristol PA, 1993.

Streeter LA. Applying Speech Synthesis to User Interfaces. Handbook of Human-Computer Interaction, 321-343, Elsevier Science, North-Holland, 1990.

Comparison Studies and Evaluations

Karl L, Pettey M, Shneiderman B. Speech-Activated versus Mouse-Activated Commands for Word Processing Applications: An Empirical Evaluation. Technical Report CAR-TR-630, Center for Automation Research, University of Maryland, College Park, MD, July 1992.

An experiment was conducted to demonstrate the utility of speech input for command activation during word processing. It was believed that speech would be superior to the mouse with respect to the activation of commands. Also, word processing was considered a hands busy application, since the user would have to interrupt typing of text in order to execute word processing commands. Speech-activated commands were found to be faster than mouse-activated commands and with similar error rates. In addition, when using speech, subjects made significantly more memorization errors. Subjects reacted positively to speech input and preferred it over the mouse for command activation, but expressed concerns about recognition accuracy, background noise, inadequate feedback, and slow response time.

Poon AD, Fagan LM. PEN-Ivory: The Design and Evaluation of a Pen-Based Computer System for Structured Data Entry. Report KSL-94-30, Knowledge Systems Laboratory, Medical Computer Science, Stanford University School of Medicine, Stanford, CA.

A pen-based computer system that uses structured data entry for creating patient progress notes was used to empirically evaluate the merits of alternative user-interface characteristics for pen-based systems. These characteristics included scrolling vs. paging, dynamic vs. fixed palette, and showing all vs. showing a subset of findings. Paging, fixed palette, and showing all findings were faster. Future research includes generalizing these results to other domains. Also important is studying the willingness of physicians to used pen-based systems compared to other modalities.

Damper, R.I., Tranchant, M. A. Speech versus Keying in Command and Control: Effect of Concurrent Tasking. International Journal of Human-Computer Studies, 45, (1996), pp. 337-348.

Based on work in the early 1980s, command and control is generally believed to be one specific application where speech input holds great advantages over keyed data entry. This study questioned this assumption with the belief that the early work was biased against keyed entry. Recent work has show that speech input in somewhat slower and significantly more error-prone that keyed input for command and control tasks. However, this recent study varied somewhat from work in the early 1980s. The objective of this work was to add concurrent, secondary tasking, similar to the earlier work to see if this could explain the earlier belief that speech was superior to keyed input in this area. The results showed that speech was not significantly faster for the primary task, somwhat more faster for the secondary task, and still significantly more error-prone than keyed input. However, if recognize errors are discounted, speech shows a clear superiority over keying. This suggests that as speech recognizers increase in accuracy, speech has a potential for the future, especially for high workload situations involving concurrent tasks.

DeHaemer MJ, Wright G, Dillon TW. Automated Speech Recognition for Spreadsheet Tasks: Performance Effects for Experts and Novices. International Journal of Human-Computer Interaction 6(3):299-318, 1994.

The performance of spreadsheet users was compared using keyboard and speech recognition as modes of input, using a discrete, speaker-dependent recognizer. The speech interface consisted of 3 grammars for cursor control, data entry, and commands. Keyboard input was significantly faster for both expert and novice users. As speech alone seems limited, future work is needed to explore multimodal interfaces. It will therefore be important to identify optimal subtasks for voice input in multimodal input environments.

Molnar, K. K., Kletke, M. G. The Impacts on User Performance and Satisfaction of a Voice-Based Front-End Interface for a Standard Software Tool. International Journal of Human-Computer Studies, 45, (1996), pp. 287-303.

This study empirically compared the effect on performance and satisfaction of a menu-driven and speech-driven front-end interface to a commonly used spreadsheet software package. The results suggest that there is a significant relationship between performance and the user expertise, and between performance and the type of interface. In general, performance and satisfaction increased with the menu-driven interface.

Dillon TW, McDowell D, Norcio AF, DeHaemer MJ. Nursing Acceptance of a Speech-Input Interface: A Preliminary Investigation. Computers in Nursing 12(6):264-271, November/December 1994.

Speech recognition has the potential to increase productivity and reduce data entry errors by providing automation at the source of data collection. With this in mind, a study was undertaken to investigate the acceptance of speech input for nursing applications. A moderately sized, speaker- dependent, connected, and structured vocabulary was used to develop and test a prototype application. To demonstrate hands-busy and eyes-busy operation, subjects used the application to perform a cardiovascular physical assessment on volunteer patients with no visual computer feedback. Results showed a positive reaction to the speech interface by nurses and that the more a user interacts with speech input, the more acceptable it becomes.

Dillon TW, Norcio AF, DeHaemer MJ. Spoken Language Interaction: Effects of Vocabulary Size and Experience on User Efficiency and Acceptability. Proceedings of the Fifth International Conference on Human-Computer Interaction, Orlando, Florida, August 8-13, 1993.

This paper describes a study used to determine the effects of vocabulary size and interface experience on the performance and acceptance of the user. Several volunteer student nurses were asked to perform a hands/eyes busy task while interacting with a speaker-dependent, connect-speech recognition system with audio output. The results suggested that as a subject gained experience with the interface, the time to complete each task decreased. Also, those using a larger, more inclusive vocabulary had far fewer non-recognized phrases. Subjective rating to assess user satisfaction and acceptance indicate that as users interact with the interface, they find the interface to be more acceptable.

Casali S. P., Williges B. H., Dryden R. D. Effects of Recognition Accuracy and Vocabulary Size of a Speech Recognition System on Task Performance and User Acceptance. Human Factors, 32, 2, (1990), pp. 183-196.

The purpose of this study was to determine the effects of recognizer accuracy, vocabulary size, and subject age on the speed and and acceptance of a speech recognition system. With a limited vocabulary, if a word needed to be entered that was not in the vocabulary, the subject was required to spell the word out. The task domain was data entry for an apparel store's inventory control system. It was observed that the decreasing the accuracy and vocabulary size both significantly increased the time needed to complete the task. Age by itself did not affect time. However, increased age combined with decreased accuracy did increase the time needed. User acceptance with accuracy level, but was not affected by vocabulary size. Also, older participants gave higher acceptance ratings.

Schmandt C, Ackerman MS, and Hindus D. Augmenting a Window System with Speech Input, Computer, 23(8):50-56, 1990.

An overview is given of a system developed at the Massachusetts Institute of Technology that uses speech as an auxiliary channel to support window navigation. Xspeak provides a speech interface to X Windows by allowing navigational tasks usually performed with a mouse to be controlled by speech instead. The effort was developed with the assumption that speech input is more valuable when it is combined with other input devices. Note that X Windows uses a three-dimensional spatial metaphor with window navigation controlled by a two-dimensional mouse. When there are many overlapping windows, it therefore can be difficult to reach some applications directly with the mouse. Initial testing revealed that while speech was not faster than the mouse for simple change-of-focus tasks, the advantage shifted toward speech if the window needed was partly or completely hidden. Another observation was that the users most inclined to choose speech input increased the number of overlapping windows or the degree of overlap.

Direct Manipulation Interfaces

Buxton B. HCI and the Inadequacies of Direct Manipulation Systems. SIGCHI Bulletin, 25(1):21-22, January 1993.

An argument is given that direct manipulation (DM) interfaces are inadequate for supporting transactions fundamental to applications such as word processing, CAD, and database queries. For example, one fundamental inadequacy of DM interfaces is the problem of object identification. The selection of object and operation is fundamental to the DM interface. However, this only works for simple operations on objects that are visible. Finally, a challenge is given to see the DM interface not as an end in itself, but as a starting point and to address the shortcomings of existing DM systems.

Carr D, Plaisant C, Hasegawa H. The Design of a Telepathology Workstation: Exploring Remote Images. Technical Report CAR-TR-708, Center for Automation Research, University of Maryland, College Park, MD, May 1994.

Direct manipulation (DM) within a telepathology system requires a different approach to overcome specific limitations. These include time delays, incomplete feedback, and unanticipated interferences. This paper reviews attempts to use a track ball with constant feedback to identify the destination of the move of the microscope. This proved to be an improvement over issuing relative start/stop commands, which often result in overshooting the target because of time delays.

Dillon TW, Emurian HH. Reports of Visual Fatigue Resulting From Use of a Video Display Unit. Computers in Human Behavior, 11(1):77-84, 1995.

This paper reviews four typical methods of gathering self-report visual fatigue symptoms that appear within the human factors literature. These include qualitative experiences, yes/no questions, answers based on an ordinal scale, and health related surveys related. No standardized data- gathering tool has been used for research in this area, and must be recognized as a major methodological problem in this area of research.

Jacob RJK, Sibert LE, McFarlane DC, Mullen MP. Integrality and Separability of Input Devices. ACM Transactions on Computer-Human Interaction, 1(1):3-26, March 1994.

The researchers tested the hypothesis that performance improves when the perceptual structure of the task matches the control structure of the input device. A two-dimensional mouse and a three-dimensional tracker were used as input devices. Two input tasks with three inputs each were used, one where the inputs were integral (x location, y location and size) and the other where the inputs were separable (x location, y location and color). Common sense might say that a three-dimensional tracker is a logical superset of a two-dimensional mouse and therefore always as good and sometimes better. Instead, the results showed that the tracker performance was better when the inputs were perceptually integral, while the mouse performed better when the inputs were separable.

Shneiderman B. Sparks of Innovation in Human-Computer Interaction, Ablex Publishing Corporation, Norwood, NJ, 1993.

Wright P, Lickorish A, Milroy R. Remembering While Mousing: The Cognitive Costs of Mouse Clicks. SIGCHI Bulletin, 26(1):41-45, January 1994.

To conserve space, a common practice is to remove information from computer displays that readers will only need intermittently. This information is often accessible by a single mouse click. This paper described a study that demonstrated this practice impairs people's memory for other task components due to increased cognitive costs. These findings suggest that software should be designed with additional memory support for users with small screens and also helps to explain the success of icon bars and ribbon displays which give people immediate access to the functions they frequently need.

Mahach, K. R., Boehm-Davis, D., Holt, R. The Effects of Mice and Pull-Down Menus Versus Command-Driven Interfaces on Writing. International Journal of Human-Computer Interaction, 7, 3, (1995), pp. 213-234.

This study compared a menu-driven interface to a command-driven interface in the context of writing essays with a word processor. Subjects who used the command-driven interface scored better on organization of their papers, creativity, number of supporting arguments, grammer, spelling, and letter grade than did their mouse counterparts. It was suggested that searching for menu items, physically locating the mouse, and having the menu block part of the screen may indicate that the mouse interface with pull-down menus undermines working memory during the writing task. Note that command-driven interface subjects used a series of function keys for the same editing tasks.

Speech Recognition and Natural Language Processing

Oviatt S. Predicting Spoken Disfluencies During Human-Computer Interaction. Computer Speech and Language 9(1):19-35, January 1995.

A predictive model was developed to account for spontaneous spoken disfluencies. The goal was to understand how to reduce or eliminate disfluent speech through proactive interface design. A Wizard-of-OZ simulation was used to study (1) communication modality - speech-only, pen-only, combined pen/voice; and (2) presentation format - structured, unconstrained, where the structured format consisted of linguistic and graphical clues to guide the user and unconstrained did not. The study revealed that length of spoken utterances is proportional to disfluencies. Also, with respect to presentation format, unconstrained format produced a greater degree of disfluencies compared to highly structured interaction. Future work is needed to explore other ways to reduce disfluencies aimed at supporting more robust spoken language processing.

Oviatt SL, Cohen PR, Wang M. Toward Interface Design for Human Language Technology: Modality and Structure as Determinants of Linguistic Complexity. Speech Communication, 15:283-300, 1994.

This paper describes a study that examines how input modality and presentation structure influence the linguistic complexity observed in people's spoken and written input to an interactive system. Participants were ask to enter data using speech-only, writing-only, and combined pen/voice exchanges. It was shown that a more structured interface reduced the number of words, length of utterances, and amount of information integrated into a single utterance. When comparing input modality, writing also effectively reduced wordiness and utterance length, but did not yield a parallel reduction in syntactic ambiguity. Modality and presentation structure both contributed to lexical composition. For example, written input contained fewer negatives and more abbreviations.

Marsh E, Wauchope K, Gurney JO. Human-Machine Dialogue for Multi-Modal Decision Support Systems. Technical Report AIC-94-032, NCARAI, US Naval Research Laboratory, Washington, DC.

A overview of specific research projects on multimodal dialogue at the Navy Center for Applied Research in Artificial Intelligence (NCARAI) is given. The first part describes the application of discourse modeling to graphical user interfaces to promote multimodal interactions. The second part describes the integration of graphics and natural language in a fully implemented multimodal/multimedia interface.

Cohen, PR. The Role of Natural Language in a Multimodal Interface. Proceedings of the ACM Symposium on User Interface Software and Technology, Monterey California, ACM Press, November 15-18, 1992.

This paper describes an effort to employ direct manipulation and natural language in multimodal software. The strengths and weaknesses of each modality are given along with arguments as to why they are complimentary with respect to reference selection, temporal information, and anaphora. The goal is not simply to make available two or more input modalities, but to integrate them together to produce more efficient communication. Based on this objective, a prototype multimodal system was developed that used an integrated direct manipulation and natural language interface. Several examples were cited where the combination of language and mouse input together were thought to be more productive than either modality alone.

Oviatt SL, Cohen PR. Spoken Language in Interpreted Telephone Dialogues. Computer Speech and Language, 6:277-302, 1992.

The results of study is presented that focuses on issues related to the development of automatic telephone interpretation systems. Using an interpreter, a group of English speakers take telephone calls to Japanese confederates who did not speak English. They again made calls to Japanese confederates fluent in English. The results of interpreted and non-interpreted dialogues was used to help provide a basis for predicting performance and dialogue patterns likely to be encountered by automatic telephone interpretation systems. For example, interpreters played an active role in both directing the content and organizing the flow of service-oriented calls. Over time, interpreters substantially increased their use of explicit third-party references. Also, interpreted calls were organized into a series of extended subdialogues between the interpreter and each of the two primary speakers.

Oviatt SL, Cohen PR. Discourse Structure and Performance Efficiency in Interactive and Non-Interactive Spoken Modalities. Computer Speech and Language, 5:297-326, 1991.

It is important to understand how limitations on speaker interaction influence spoken discourse patterns in different types of tasks. This paper describes a study that compares telephone dialogue with audiotape monologue. Both provide an audio-only interface, with the former being interactive and the latter non-interactive. Participants were asked to provide technical instructions using these methods. The study revealed that non-interactive speech can be prone to excessive elaboration and repetition that results in less integrated and less structured communication. Interactive speech followed a more synchronized collaboration with relatively brief descriptions.

Medical Speech-Driven Data Collection

McMillan PJ, Harris JG. Datavoice: A Microcomputer-Based General Purpose Voice-Controlled Data-Collection System, Comput Biol Med, 20(6):415-419, 1990.

Data collection can be a bottleneck in many computer applications. One system, design to assist in the collection of stereological data, combines speech input with a digitizing pad . Each data set consists of an object name, recorded by voice, followed by X and Y coordinates, entered with a digitizing pad. The system is used for boundary analysis and histomorphometry of bone and skin. It has a small speaker-dependent vocabulary (less than 50 words) for object names and voice commands, and recognizes discrete speech. This simple interface has low computational requirements and therefore is thought to have a high chance of success. It allows the user to choose between eight control words and between 20 object names. The combination of speech and a digitizing pad was shown to accelerate the data collection process.

Feldman CA, Stevens D. Pilot Study on the Feasibility of a Computerized Speech Recognition Charting System, Community Dent Oral Epidemiol, 18:213-215, 1990.

This paper describes a feasibility study using speech recognition to record clinical data during dental examination. Systems of this type would eliminate the need for a dental assistant to record results. Although speech input was shown to be slower, when the time needed to transfer written results into the computer was considered, the speech method was faster. Speech input also had more errors, although the difference was not statistically significant. Overall the study suggested that speech recognition may be a viable alternative to traditional charting methods.

Smith NT, Brian RA, Pettus DC, Jones BR, Quinn ML, Sarnat L. Recognition Accuracy with a Voice-Recognition System Designed for Anesthesia Record Keeping, J Clin Monit, 6(4):299-306, 1990.

The results of a speech interface for an anesthetist's record keeping system are discussed. Anesthetists are responsible for recording information on drugs administered during medical procedures. Often this information is recorded after the procedure. However, a long interval between the event and its recording can compromise the completeness and accuracy of the manual record. The prototype voice entry system allowed collection of the data during the medical procedure, while the anesthetist's hands are busy. The system used a vocabulary of around 300 words. Preliminary testing showed an accuracy rate of 96%, even in a noisy operating room.

O'Hara SP, Bryant TN, et al . Speech Recognition and the Clinical Microbiology Laboratory. Medical Laboratory Sciences 49:20-26, 1992.

In the clinical pathology laboratory, most information is collected automatically from automated machinery. However, in less automated departments, such as microbiology and histopathology, additional automation of data entry can be achieved through speech-driven systems. A speaker-dependent, mobile system was tested for on-line collection of data from microbiological examinations. The system was evaluated with respect to accuracy, speech recognition, reproducibility, speed, user friendliness, and cost effectiveness. The system performed well and may be a practical alternative to more conventional means of data entry.

Ikerira H, Matsumoto T, Iinuma TA, et al. Analysis of Bone Scintigram Data Using Speech Recognition Reporting System, Radiation Medicine, 8(1):8-12, 1990.

This paper discusses the results of using speech recognition to support hands-busy data collection of analysis of bone scintigraphic data. Such diagrams are analyzed to study metastases of malignant tumors. A speech system was developed to allow doctors to enter the results of image readings into the computer while looking at the images instead of the terminal. In 580 voice-entered reports, response time was shortened in comparison with dictation or writing by hand.

Issacs E, Wulfman CE, Rohn JA, Lane CD, Fagan LM. Graphical Access to Medical Expert System: IV. Experiments to Determine the Role of Spoken Input, Methods Inform Med, 32(1):18-32, 1993.

The results of a "Wizard of Oz" simulation designed to understand how clinicians might want to speak to a medical decision-support system are presented. The system test was ONCOCIN, a program that provides therapy advise for patients on complex cancer therapy protocols. It is believed that one cannot simply add speech input on top of an existing text or graphical interface. Instead, an understanding of user expectations and limitations of current speech recognition systems is required. On the whole, the researchers were able to build simple grammars from the "Wizard of Oz" simulations that described the vast majority of what the physicians said while entering operating the software. In addition, data gathered resulted in guidelines for the correction of miscommunications based on the part of the utterance that was not understood.

Wulfman CE, Rua M, Lane CD, Shortliffe EH, Fagan LM. Graphical Access to Medical Expert System: V. Integration with Continuous-Speech Recognition, Methods Inform Med, 32(1):33-46, 1993.

This paper describes three prototype computer systems for clinical record keeping that use a combination of window-based graphics and continuous speech in their user interfaces. It was noted that the use of template-based dictation with fill-in forms worked well only when the documentation task was limited to a few standardized reports. Template-based reporting may be inadequate in clinical domains, because the required documentation is less standardized. At the same time, current speech recognition technology does not permit the processing of free-form natural language. The prototype systems explore methods that circumvent shortcomings in the current technology while maintaining the flexibility and naturalness of speech.

Medical Speech-Driven Dictation

Klatt EC. Voice-Activated Dictation for Autopsy Pathology. Comp Biol Med, 21(6):429-433, 1991.

A voice-activated system for dictation and report preparation is described for autopsy pathology. The system uses a large vocabulary, speaker-adaptive approach to generate template-based reports using fill in forms, trigger phrases, and free-form speech. The system has the potential for reducing personnel time in dictation and transcription and increasing the accuracy of report. The amount of time to generate reports decreased when compared to manual typing. It was noted that a greater degree of computer literacy was required and that the need for typed input was not eliminated.

Hollbrook JA. Generating Medical Documentation Through Voice Input: The Emergency Room, Top Health Rec Manage, 12(3):58-63, 1992.

The application of automatic speech recognition (ASR) dictation systems to the emergency room is discussed. ASR systems have the potential to deal with several reporting problems in emergency medicine. Handwritten reports are often illegible and incomplete. Dictation and transcription is more complete, but is often expensive and unavailable. ASR reports have been shown to have a greater impact on quality and completeness than either handwritten or transcribed reports.

Massey BT, Geenen JE, Hogan WJ. Evaluation of a Voice Recognition System for Generation of Therapeutic ERCP Reports. Gastrointest Endosc, 37(6):617-620, 1991.

This paper describes the evaluation of a computerized report-generation system (EndoSpeak) using voice recognition technology for producing therapeutic ERCP reports. Thirty ERCP cases were evaluated using EndoSpeak and standard dictation to develop reports. Dictated reports were judged to have a higher information content than EndoSpeak and were also generated faster. These results appear to be related to the greater complexity of therapeutic ERCP procedures, which minimized the ability to use trigger phrases in EndoSpeak.

Speech in Virtual Environments

Everett SS, Wauchope K, Perez MA. A Natural Language Interface for Virtual Reality Systems. Technical Report AIC-94-046, NCARAI, US Naval Research Laboratory, Washington, DC.

This paper describes a prototype virtual reality system which uses a natural language interface based on off-the-shelf speech recognition and speech synthesis hardware. Natural language was shown to address a number of problems with virtual environments: the user's hands and eyes are occupied in the virtual world; language is better suited to abstract manipulations than joysticks or gloves; and speech output minimizes dependencies on textual displays which can be difficult to read on immersive presentation equipment.

Bolt RA, Herranz E. Two-Handed Gesture in Multi-Modal Natural Dialog. Proceedings of the ACM Symposium on User Interface Software and Technology, Monterey California, ACM Press, November 15-18, 1992.

An agenda for multimodal natural dialog is given, which focuses on not just natural language input, but concurrent input from gesture and gaze as well. Previous work identified several situations where two-handed input would provide an advantage over one-handed input when dealing with issues of scale and visibility. With this in mind, a prototype system was developed that supported coverbal gestures with two-handed input, gaze and speech recognition. The results of this prototype revealed that the exploration and expansion of two-handed gestural input as an integral part of multimodal natural language input is eminently possible.

Mapping GUIs to Auditory Interfaces

Mynatt ED, Edwards WK. Mapping GUIs to Auditory Interfaces. Proceedings of the ACM Symposium on User Interface Software and Technology, Monterey California, ACM Press, November 15-18, 1992.

This paper describes the Mercator Project, an effort to provide transparent mappings between X-Windows graphical interfaces and auditory interfaces. A framework for auditory interfaces in described along with the architecture use to convert graphical information into the auditory interface.

Mynatt ED, Johnson E. Multimodal Access to Graphical User Interfaces for People who are Blind. Technical Report, College of Computing, Georgia Institute of Technology, Atlanta, GA.

This paper describes the Mercator Project, an effort to provide transparent access to X Windows applications for computer users who are blind or severely visually-impaired. The goal of the system architecture is to create and store a semantic model based on semantically meaningful information captured from X applications. Specific issues related to multimodal access are discussed. This includes the concept of Auditory Interface Components (AICs), which correspond roughly to widgets, and auditory navigation to allow users to scan the interface and operate on interface objects.

Mynatt ED, Weber G. Nonvisual Presentation of Graphical User Interfaces: Contrasting Two Approaches. Technical Report, College of Computing, Georgia Institute of Technology, Atlanta, GA.

Two contrasting designs that provide nonvisual access to graphical user interfaces are discussed - the Mercator Project and GUIB (Textual and Graphical User Interfaces for Blind People). Mercator replaces the spatial graphical interface with a hierarchical auditory interface that uses non-speech auditory cues to convey iconic information presented in the graphical user interface. GUIB continues the use of the spatial metaphor of the graphical interface based on the spatial location of objects on specialized I/O devices - a tactile pad and braille display.

The Mercator Project: Providing Access to Graphical User Interfaces for Computer Users Who Are Blind. Technical Report, College of Computing, Georgia Institute of Technology, Atlanta, GA.

An introduction to the Mercator Project is given. The goal is to provide access to X Windows applications for blind computer users. The approach is that while an unmodified graphical application is running, an outside agent collects information about the application interface and then translate this information into an auditory interface. A list of related publications, project members, and sponsorship is also given.

Mynatt ED. Auditory Presentation of Graphical User Interfaces. Technical Report, College of Computing, Georgia Institute of Technology, Atlanta, GA.

This paper presents work work in the design of auditory interface to provide access to graphical user interfaces for people who are blind. An overview of the Mercator Project is given along with design strategies for auditory interfaces. Two major problems addressed to achieve this are monitoring graphical interfaces of X Windows applications without modifying the existing application and providing a methodology for translating the graphical information into a nonvisual interface. Only auditory output was chosen over a tactile interface, because it is thought that the auditory modality is closer to the capabilities of the visual interface. The Mercator interface translates information at the level of the interface components like menus, dialog boxes, and push buttons, instead of at the pixel level.

World Wide Web, Networks and Servers

House D, Spoken-Language Access to Multimedia (SLAM): A Multimodal Interface to the World-Wide Web. Masters Thesis, Dept. of Computer Science and Engineering, Oregon Graduate Institute, 1995.

A multimodal interface to the World-Wide Web was developed. Speech recognition and direct manipulation were used as complimentary modalities. As a result, speech input allowed access to information that was not directly available with mouse-based systems. Also, speech recognition was handled by the Web server, not the client. However, HTML documents must be modified to make them speech-enabled. Future work includes adding speech support for commands, hot icons, and text-entry forms, not just for navigation. Additional work is needed to perform user studies to gather empirical evidence on the usability of this interface and support speech-only access for the visually impaired or telephone access.

Arons B. Tools for Building Asynchronous Servers to Support Speech and Audio Applications. Proceedings of the ACM Symposium on User Interface Software and Technology, Monterey California, ACM Press, November 15-18, 1992.

A software architecture is presented that can be used to support multimedia resources in a distributed client/server model. In such an environment, an application would communicate with multiple servers for resources like audio and speech recognition. A tool is described for rapidly prototyping distributed asynchronous servers and applications, with an emphasis on providing support for highly interactive user interfaces, temporal media, and multi-modal I/O.

Raman TV, Information of the NII is not just for Viewing! Technical Report, Digital Equipment Corp., Cambridge, MA, March 1995.

Arguments are given on how best to ensure accessibility to information repositories on the World Wide Web (WWW). Accessible WWW clients in themselves is not enough. Instead, information must be stored in display-independent formats such as HTML, instead of visual -presentation formats like postscript.

Application Development Frameworks

Yankelovich N, SpeechActs and the Design of Speech Interfaces. Workshop on the Future of Speech and Audio in the Interface, ACM CHI'94 Conference on Human Factors in Computing Systems, Boston, MA, April 24-28, 1994.

The SpeechActs research project is focused on the the creation of tools for building speech applications as well as the definition of effective speech user interfaces (SUIs). Sever design principles used for SUI applications were introduced. Dialogs are designed so that the computer initiates the conversation, but does not take control of it. The system should maintain context to allow users to speak more naturally. Feedback must be complete, but also balanced with brevity. Finally, the user should be able to interrupt the synthesizer. Future directions include indicating to the user the boundaries of the functionality of the current applications, detecting user-initiated corrections, the design of context-sensitive help, and allowing users to interactively add vocabulary by speaking.

Yankelovich N, Talking vs. Taking: Speech Access to Remote Computers. ACM CHI'94 Conference Companion, Boston, MA, April 24-28, 1994.

This paper describes an effort to add remote speech-driven access to computers through the telephone. This work is an outgrowth of the previous efforts which used a telephone with a touch-tone interface. The telephone is seen as an ideal interface which addresses two problems with accessing office application - a lack of portability and remote access. A successful prototype application was developed for a calendar program to demonstrate proof of concept.

Yankelovich N, Baatz E. SpeechActs: A Framework for Building Speech Applications. Technical Report, Sun Microsystems Laboratories, Inc., Chelmsford, MA.

SpeechActs is a framework for building speech applications. It supports multiple speech recognizers and synthesizers along with a set of tools and APIs for the development and integration of these components into speech applications. The first goal of this effort is to supply general, robust speech capabilities to application writers in a fashion that minimizes the amount of speech-related skills they have to master. The second goal is to support multiple applications. The third goal is to use third-party tools whenever possible. A series of structured usability tests are planned to improve the quality of the application interfaces, create a set of interface guidelines, and collect qualitative data of users' reactions to interacting with a conversational system.

Yankelovich N, Levow GA, Marx M. Designing SpeechActs: Issues in Speech User Interfaces. ACM CHI'95 Conference on Human Factors in Computing Systems, Denver CO, May 7-11, 1995.

A number of challenging issues related to the development of speech interfaces is presented based on early work with SpeechActs. This includes adhering to conversational conventions, such as to avoid explicitly prompting the user for input whenever possible and to add explicit clue phrases when changing subdialogs. Also it was noted that existing graphical user interfaces would not transfer successfully to speech, warranting that speech interfaces be developed from scratch. Techniques were discussed for handling errors, such as progressive assistance and implicit verification. Finally, challenges related to the nature of speech, such as a lack of visual feedback, speech, and persistence, were discussed.

Martin P, Kehler A. SpeechActs: A Testbed for Continuous Speech Applications. Technical Report, Sun Microsystems Laboratories, Inc., Chelmsford, MA.

An overview of the SpeechActs system is given. It supports a variety of off-the-shelf speech recognition and text-to-speech generators. The main goal is to produce a generalized interface scheme where new components to be easily substituted. The essential components include third-party speech recognition and text-to-speech modules along with sever internally developed tools. The natural language interpreter provides tools to easily develop natural language interfaces. The unified grammar compiler creates grammars which support a variety of third-party speech recognition systems. The discourse manager coordinates user interactions between a co-existing suite of different speech applications.