APPROVAL SHEET

Title of Dissertation:	Speech Input in Multimodal Environments: Effects of Perceptual Structure on Speed, Accuracy, and Acceptance
Name of Candidate:	Michael A. Grasso Doctor of Philosophy, 1997
Dissertation and Abstract Approved:	Dr. Timothy W. Finin Professor Computer Science and Electrical Engineering
Dissertation and Abstract Approved:	Dr. David S. Ebert Assistant Professor Computer Science and Electrical Engineering
Date Approved:	May 12, 1997

Curriculum Vitae

Name: Michael A. Grasso

Degree and Date to be Conferred: Ph.D., 1997

Professional Publications:

Michael A. Grasso, David Ebert, Tim Finin. The Effect of Perceptual Structure on Multimodal Speech Recognition Interfaces. ACM Transactions on Computer-Human Interaction, under review.

Michael A. Grasso, David Ebert, Tim Finin. Acceptance of a Speech Interface for Biomedical Data Collection. 1997 AMIA Annual Fall Symposium, under review.

Michael A. Grasso and Tim Finin. Task Integration in Multimodal Speech Recognition Environments. Crossroads, 3(3):19-22, Spring 1997.

Michael A. Grasso. Speech Input in Multimodal Environments: A Proposal to Study the Effects of Reference Visibility, Reference Number, and Task Integration. Technical Report TR CS-96-09, University of Maryland Baltimore County, Department of Computer Science and Electrical Engineering, 1996.

Michael A. Grasso. Automated Speech Recognition in Medical Applications. M.D. Computing, 12(1):16-23, 1995.

Michael A. Grasso and Clare T. Grasso. Feasibility Study of Voice-Driven Data Collection in Animal Drug Toxicology Studies. Computers in Biology and Medicine, 24:4:289-294, 1994.

Professional Positions Held:

1988 - Present	President/Senior Computer Scientist. Segue Biomedical Computing, Laurel, Maryland.
1992 - Present	Instructor of Computer Science, Part-Time. University of Maryland Baltimore County, Department of Continuing Education.
1991 - 1993	Instructor of Computer Science, Part-Time. Howard Community College, Columbia, Maryland.
1987 - 1988	Senior Programmer/Analyst. Program Resources, Inc., Annapolis, Maryland.
1984 - 1985	Microbiology Technician and Computer Programmer. Johns Hopkins University, Baltimore, Maryland.
1981 - 1984	Medical Technologist and Microbiology Technician. Part-time positions and internships held while completing undergraduate education.

Abstract

Title of Dissertation:	Speech Input in Multimodal Environments: Effects of Perceptual Structure on Speed, Accuracy, and Acceptance
Michael A. Grasso, Doctor of Philosophy, 1997
Dissertation Directed By:	Dr. Timothy W. Finin, Professor Computer Science and Electrical Engineering Dr. David S. Ebert, Assistant Professor Computer Science and Electrical Engineering

A framework of complementary behavior has been identified which maintains that direct manipulation and speech interface modalities have reciprocal strengths and weaknesses. This suggests that user interface performance and acceptance may increase by adopting a multimodal approach that combines speech and direct manipulation. Based on this concept and the theory of perceptual structures, this work examined the hypothesis that the speed, accuracy, and acceptance of a multimodal speech and direct manipulation interface would increase when the modalities match the perceptual structure of the input attributes.

A software prototype to collect histopathology data was developed with two interfaces to test this hypothesis. The first interface used speech and direct manipulation in a way that did not match the perceptual structure of the attributes, while the second interface used speech and direct manipulation in a way that best matched the perceptual structure. A group of 20 clinical and veterinary pathologists evaluated the prototype in an experimental setting using repeating measures. The independent variables were interface order and task order, and the dependent variables were task completion time, speech errors, mouse errors, diagnosis errors, and user acceptance.

The results of this experiment support the hypothesis that the perceptual structure of an input task is an important consideration when designing multimodal computer interfaces. Task completion time improved by 22.5%, speech errors were reduced by 36%, and user acceptance increased 6.7% with the computer interface that best matched the perceptual structure of the input attributes. Mouse errors increased slightly and diagnosis errors decreased slightly, but these were not statistically significant. There was no relationship between user acceptance and time, suggesting that speed is not the predominate factor in determining approval. User acceptance was related to speech recognition errors, suggesting that recognition accuracy is critical to user satisfaction. User acceptance was also shown to be related to domain errors, suggesting that the more domain expertise a person has, the more he or she will embrace the computer interface.

Speech Input in Multimodal Environments:

Effects of Perceptual Structure

on Speed, Accuracy, and Acceptance

Michael A. Grasso

Dissertation submitted to the faculty of the Graduate School
of the University of Maryland in partial fulfillment
of the requirements of the degree of
Doctor of Philosophy
1997

Dedication

To my parents, Silvio and Angela Grasso, who taught me to dream, and to my wife, Clare, for believing.

Acknowledgment

The completion of this dissertation was made possible through the support and cooperation of many individuals. Thanks to my advisors, Timothy W. Finin and David S. Ebert, who provided thoughtful guidance and encouragement through what seemed to be a never-ending process. Thanks also to Tulay Adali, Charles K. Nicholas, and Anthony F. Norcio, the other members of my committee, for helping to me to understand the significance of this research with respect to medical informatics, software engineering, and human-computer interaction, respectively.

Several people provided technical assistance. Judy Fetters from the National Center for Toxicological Research (NCTR) consulted with me on the software prototype and in the selection of tissue slides. Alan Warbritton from NCTR scanned the slides for the prototype. Lowell Groninger, Greg Trafton, and Clare Grasso helped with the experiment design and statistical analysis. The support staff at Speech Systems, Inc. helped with grammar development and answered countless technical questions.

Finally, thanks to those who graciously participated in this study from the University of Maryland Medical Center, the Baltimore Veteran Affairs Medical Center, the Johns Hopkins Medical Institutions, and the Food and Drug Administration. Special thanks to Dr. John Strandberg, Dr. Michael Lipsky, and Dr. Jules Berman for helping to identify participants.

1. Introduction
1.1 Speech Recognition Systems
1.1.1 Historical Perspective
1.1.2 Speaker Dependence
1.1.3 Continuity of Speech
1.1.4 Vocabulary Size
1.1.5 Human Factors of Speech Interfaces
1.2 Direct Manipulation
1.3 The Problem
1.4 Significance of this Study
1.5 Research Questions
2. Literature Survey
2.1 Multimodal Speech Recognition Interfaces
2.1.1 Multimodal Access to the World-Wide Web
2.1.2 Integrated Multimodal Interface
2.1.3 Multimodal Window Navigation
2.2 Reference Attributes
2.2.1 Reference Visibility
2.2.2 Vocabulary Size
2.3 Multimodal Input Tasks
2.3.1 Theory of Perceptual Structures
2.3.2 Integrality of Input Devices
2.3.3 Integrating Input Modalities
2.4 Motivations of Speech in Medical Informatics
2.4.1 Template-Based Reporting
2.4.2 Natural Language Processing
2.4.3 Speech in Multimodal Environments
2.4.4 Hands-Busy Data Collection
2.5 Data Collection in Animal Toxicology Studies
2.6 Preliminary Work
2.6.1 Materials
2.6.2 Methods
2.6.3 Results and Discussion
2.6.4 Conclusion
3. Methodology
3.1 Independent Variables
3.2 Dependent Variables
3.3 Subjects
3.4 Procedure
3.5 Materials
3.6 Statistical Analysis
3.7 Schedule and Deliverables
4. Experimental Results
4.1 Task Completion Times
4.2 Errors
4.3 Acceptability
4.4 Correlation
5. Discussion and Conclusion
5.1 Findings
5.2 Relationships
5.2.1 Baseline Interface versus Perceptually Structured Interface
5.2.2 Relationships to Task Completion Time
5.2.3 Relationships with Acceptability Index
5.3 Summary
5.4 Future Research Directions
5.5 Conclusion
6. Appendices
6.1 Sample Memorandum to Request for Volunteers
6.2 Pre-Experiment Questionnaire
6.3 Post-Experiment Questionnaire
6.4 Pathology Nomenclature
6.5 Perceptually Structured Interface Vocabulary
6.6 Baseline Interface Vocabulary
6.7 Perceptually Structured Interface Transcript
6.8 Baseline Interface Transcript
6.9 Task Completion Time Scores
6.10 Speech Errors
6.11 Mouse Errors
6.12 Diagnosis Errors
6.13 Acceptability Scores
7. References