1. Introduction

For many applications, the human-computer interface has become a limiting factor. One such limitation is the demand for intuitive interfaces for non-technical users, a key obstacle to the widespread acceptance of computer automation [Landau, Norwich, and Evans 1989]. In addition, data entry has become the bottleneck of many applications in the field of medical informatics. This is due to hands-busy or eyes-busy restrictions during tasks such as patient care and microscopy.

An approach that addresses both of these limitations is to develop interface techniques using automated speech recognition. Speech is a natural form of communication that is pervasive, efficient, and can be used at a distance. However, widespread acceptance of speech as a human computer interface has yet to occur. This effort seeks to cultivate the speech modality by evaluating the use of speech in multimodal environments. To characterize the complementary behavior of speech and direct manipulation, several questions relating to the effects of reference visibility, reference predictability, reference number, and task integration are discussed. The specific focus of this effort is an empirical study of the effect of perceptual structure on the speed, accuracy, and acceptance of a multimodal speech and direct manipulation interface.

1.1 Speech Recognition Systems

Speech recognition systems provide computers with the ability to identify spoken words and phrases. Note that speech recognition focuses on word identification, not word understanding. The latter is part of natural language processing, which is a separate research area. This can be compared to entering characters into a computer using a keyboard. The computer can identify the characters which are typed. However, there is no implicit understanding by the computer as to what these characters mean.

1.1.1 Historical Perspective

The first speech recognition system was developed in 1952 on an analog computer using discrete speech to recognize the digits 0 through 9 with a speaker-dependent template matching algorithm [Davis, Biddulph, and Balashek 1953]. Recognition accuracy was reported to be 98%. Later that decade, a system with similar attributes was developed that recognized consonants and vowels [Dudley and Balashek 1958]. In the 1960s, research in speech recognition moved to digital computers, which became the basis for speech recognition technology to the present day [Lea 1993].

Despite rapid progress early on, limitations in computer architectures prevented any significant commercial speech recognition system development. Even though the data transfer rate of speech is only about 50 bits per second, the computational requirements associated with extracting this information are enormous. Over the last decade, a number of commercial systems have been successfully developed [Voice Processing Magazine 1993]. However, since true natural language processing is still several years away, a successful speech-driven system must allow for restrictions in the current technology. These restrictions include speaker dependence, continuity of speech, and vocabulary size [Bergeron and Locke 1990; Peacocke and Graf 1990].

1.1.2 Speaker Dependence

Speaker-dependent systems are those which require some type of user training before they can be put to use. Speech recognition systems typically use a pattern matching algorithm, where the spoken words are compared with predefined templates to find the best match. Before this can occur, the user must create templates by saying each word in the vocabulary two or three times. Representative word phrases may also be read aloud, to identify how certain words will be spoken in context. A speech model consists of all the templates for a given vocabulary. Each operator of a speaker-dependent system must create a speech model by training the system to recognize his or her way of saying every word in the vocabulary. Depending on the vocabulary size, training can take from a few minutes to several hours.

Speaker-independent systems use generic models to recognize speech from any user. Generic models are created by combining existing templates from a variety of speakers. This approach is advantageous in that it does not require individual operators to train the system to recognize their voices. However, because the templates are not user-specific, accuracy rates are usually lower.

An alternative is the speaker-adaptive approach, which uses a generic model to eliminate initial training and then automatically generates user-specific models for each operator over time. Although initial training is eliminated, recognition accuracy is diminished until the system develops an adequate user-specific model.

1.1.3 Continuity of Speech

Continuous speech systems can recognize words spoken in a natural rhythm. Although this approach seems more desirable at first glance, continuous speech is harder to process because of the difficulty of identifying word boundaries - as in "youth in Asia" and "euthanasia." Variability in articulation, such as the tendency to drop consonants or blur distinctions between them - as in "want it" and "wanted" - can result in further misunderstanding. To increase accuracy, speech models for continuous speech systems include information on representative word combinations and context rules.

Isolated word systems require a deliberate pause between each word. Pausing after each word is unnatural and can be tiring. However, accuracy rates are usually higher with isolated word systems than with systems using continuous speech. Isolated systems are therefore thought to work best with vocabularies that consist mainly of individual command words.

1.1.4 Vocabulary Size

The vocabularies of various speech recognition systems can range from 20 to more than 40,000 words. Large vocabularies cause difficulties in maintaining accuracy, but small vocabularies restrict the speaker. In addition, large vocabularies are likely to contain ambiguous words, which in speech recognition systems are words with pattern-matching templates that the computer will treat as similar - such as the words "tree" and "three."

Grammar rules can be added to impose constraints on the allowable sequences of words. These are especially important to offset technical limitations due to continuous speech or large vocabularies. A tightly constrained grammar is one in which only a small number of words can legally follow any given word, based on context of phrase structure. Keeping the list of candidate words small can increase recognition accuracy and decrease latency time during pattern matching, especially with large vocabularies. However, too many grammar rules can reduce the naturalness of communication.

1.1.5 Human Factors of Speech Interfaces

Along with technical characteristics of speech recognition systems, it is important to understand the human factors of speech as an interface modality. A criticism by Newell is that some researchers act as if the only bars to widespread adoption of speech interfaces are these technical limitations. Only occasional consideration is given to dialog design and other aspects necessary for an effective and efficient human interface [Newell 1992]. These comments highlight the importance of studying speech recognition interfaces as a human-computer interaction problem.

Speech is a unique modality with several profound qualitative differences from traditional user interface channels. The most significant is that speech is temporary. Once uttered, auditory information is no longer available to the user. This may place extra memory burdens on the user and severely limit the ability to scan, review and cross-reference information. A related limitation is that it is hard to represent spatial information, since the fleeting nature of speech makes it difficult to observe and manipulate the relative position of objects.

Speech can be used at a distance which makes it ideal for hands-busy and eyes-busy situations. It is omnidirectional and therefore can communicate with multiple users. However, this has implications related to privacy, security and may add to environmental noise in the workplace.

Finally, more than other modalities, there is the possibility of anthropomorphism when using speech recognition. It has been documented that users tend to overestimate the capabilities of a system if a speech interface is used and that users are more tempted to treat the device as another person [Jones, Hapeshi, and Frankish 1990].

1.2 Direct Manipulation

Direct manipulation interfaces, made popular by the Apple Macintosh and Microsoft Windows graphical user interfaces, are based on a number of principles [Shneiderman 1993].

Visual display of objects of interest.
Selection by pointing, instead of typing.
Rapid, incremental, and reversible actions.
Immediate and continuous feedback of results and actions.

The display in a direct manipulation interface should indicate a complete image of the application's environment, including its current state, what errors have occurred, and what actions are appropriate. A virtual representation of reality is created, which can be manipulated by the user. For example, the typical word processor today can display a document in its final form with fonts, graphics, and other characteristics exactly as they will appear when printed. Another example is the file manager that displays directories as a tree structure and files as icons.

In a direct manipulation environment, the computer is operated by direct engagement with the user interface. The commands themselves are physical actions, such as pointing, clicking, dragging, and sliding. For example, to delete a file, the user points to its icon and drags it to the trash can. Once the file is deleted, the user is given immediate confirmation by the fact that the file icon is no longer on the screen or that the trash can now appears to have something in it.

This approach has several potential advantages. The direct manipulation interface is based on intuitive metaphors with a consistent look-and-feel that enhances a user's ability to learn another program quickly. A hierarchy of menus makes available options clear and minimizes the need to learn cryptic command languages. Users can immediately see the results of their actions, making error detection more natural and minimizing the need for error messages. Finally, users may gain more confidence and are more in control since they initiate commands by physical actions.

In contrast to this, arguments have been made that direct manipulation interfaces are inadequate for supporting transactions fundamental to applications such as word processing, CAD, and database queries [Buxton 1993; Cohen and Oviatt 1994]. These comments were made in reference to the limited means of object identification and how the non-declarative aspects of direct manipulation can result in an interface that is too low-level. Shneiderman [1993] points to ambiguity in the meanings of icons and limitations in screen display space as problems with direct manipulation.

1.3 The Problem

It has been suggested that direct manipulation and speech recognition interfaces have complementary strengths and weaknesses which could be leveraged in multimodal user interfaces [Cohen and Oviatt 1994; House 1995; Cohen 1992]. By combining the two modalities, the strengths of one could be used to offset the weaknesses of the other. For simplicity, speech recognition will deal with the identification of spoken words, not necessarily natural language recognition, and direct manipulation will deal with mouse input.

The complementary advantages of direct manipulation and speech recognition are summarized in Table 1. Note that the advantages of one are the weaknesses of the other. For example, direct engagement provides an interactive environment which is thought to result in increased user acceptance and allow the computer to become transparent as users concentrate on their tasks [Shneiderman 1983]. However, the computer can only become totally transparent if the interface allows hands-free and eyes-free operation. Speech recognition interfaces provide this, but intuitive physical actions no longer drive the interface.

*Table 1*: Complementary Strengths of Direct Manipulation and Speech
*Direct Manipulation*	*Speech Recognition*
Direct engagement Simple, intuitive actions Consistent look and feel No reference ambiguity	Hands/eyes free operation Complex actions possible Reference does not depend on location Multiple ways to refer to entities

One of the key strengths of direct manipulation is that these physical commands are based on simple actions. One example of this are visual database interfaces based on the direct manipulation modality. An early example of a visual database interface is Query-by-Example [Zloof 1977], developed at IBM. Such interfaces rely on visual representations of the database structure, possibly with sliders and other mouse-driven interface objects to input query information [Ahlberg, Williamson, and Shneiderman 1992]. However, this method works best with databases consisting of well-formed ordinal data. Since the interface is directly tied to the actual underlying format of the database, it is considered too low level [Cohen and Oviatt 1994]. In contrast to this, the declarative nature of speech recognition interfaces and their ability to use anaphoric references should make them more appropriate for complex actions.

The consistent look and feel of direct manipulation interfaces is believed to provide a foundation for allowing novices to learn the basic functionality of these programs quickly by generalizing the commonality between applications. The limitation of this approach is its increased dependence on the visual display of information. When there are only a few interface objects, it is easy to arrange them in a consistent manner. However, this approach quickly breaks down when there are dozens of interface objects to manipulate. Speech interfaces do not have such limitations, but the abstract characteristic of speech makes it difficult to employ the concept of look-and-feel in the same way.

Direct manipulation interfaces do not have problems with reference ambiguity. When the user selects an object, the computer will not misinterpret this selection as some other object. The down side to this is that there is only one way to reference an object. A problem with direct manipulation stated earlier is that not all objects have easily distinguishable references. In other words, while selecting an object is unambiguous to the computer, the actual meanings of these references may be obscure to the user. Speech interfaces have the opposite characteristic. Since objects can be referred to in multiple ways, the meanings of various references should be less ambiguous to the user. However, due to recognition errors or grammar limitations, there is a greater chance the computer may not recognize this reference correctly.

Taking these observations into account, Cohen and Oviatt [1994] made the following statement with respect to the complementary benefits of direct manipulation and natural language. Note that this dissertation deals with the identification of words through speech recognition, not necessarily natural language interaction.

Theoretically, direct manipulation should be beneficial when the objects to be manipulated are on the screen, their identity is known, and there are not too many objects from which to select. Natural language interaction with computers offers potential benefits when users need to identify objects, actions, and events from sets too large to be displayed and/or examined individually and when users need to invoke actions at future times that must be described.

For example, direct manipulation interfaces are believed to be best used for specifying simple actions when all references are visible and the number of references are limited, while speech recognition interfaces are better at specifying more complex actions when references are numerous and not visible. This is summarized in Table 2.

*Table 2*: Proposed Applications for Direct Manipulation and Speech
*Direct Manipulation*	*Speech Recognition*
Visible References Limited References Simple Actions	Non-Visible References Multiple References Complex Actions

Based on these observations, a series of questions have been proposed to evaluate the effect of reference visibility, reference number, and task integration on the speed, accuracy and acceptance of direct manipulation and speech recognition systems. Such empirical results can be used to assist with the integration of speech with direction manipulation in multimodal environments. Due to time constraints, only the question on task integration was evaluated as part of this dissertation.

Relying on anecdotal arguments, expected results are that simple actions on a limited number of visible references would favor direct manipulation and complex actions on numerous, non-visible references would favor speech recognition. Intuitively, it is clear that direct manipulation interfaces are adversely affected by references which are not visible, since you must be able to see a reference in order to select it. In the same way, it is clear that speech recognition systems do not have this limitation, since any item can be referenced regardless of whether it is visible or not. Also, the declarative nature of speech recognition interfaces should allow the specification of more complex operations. However, this dissertation hypothesizes that this model of complementary behavior is only true under certain conditions related to the characteristics of the reference attributes and the type of interface task.

These original observations focused mainly on reference visibility. There are other attributes that may impact the speed, accuracy, and acceptance of both direct manipulation and speech recognition interfaces. The number of references is alluded to, however, only in the context of limiting visibility, such as when there are so many references that they all cannot be visible at the same time.

Also, regardless of reference attributes, the speed, accuracy, and acceptance may be impacted by how well the control structure of the input device matches the perceptual structure of the input task (whether the input attributes are perceived as integral or separable). It was reported that the performance of a unimodal, graphical interface improves when the structure of the perceptual space matches the control space of the input device [Jacob et al. 1994]. An appropriate follow-on question - and the focus of this study - is the effect of perceptual structure on multimodal tasks. A summary of reference attributes and interface tasks is in Table 3.

*Table 3*: Reference Attributes and Interface Tasks
*Reference Attributes*
Visible	The references are directly observable by the user and not obscured by other screen objects.
Numerous	The number of valid references available to the user are many.
Predictable	The references are sorted or otherwise familiar to the user.
Distinguishable	The references can be easily differentiated from each other.
*Interface Tasks*
Integral	The input attributes cannot be attended to individually.
Simple	The task is implicit based on reference selection.
Spatial	The task is based on dimensional input.
Declarative	The task requires a description or anaphoric reference.
Computational	The task requires the input of numbers or formulas.

1.4 Significance of this Study

There are three areas in which this research will contribute in a significant way to the understanding of speech recognition interfaces in human-computer interaction.

Replace anecdotal arguments with scientific evidence.
Identify situations where speech is the preferred modality.
Increase our understanding of speech in multimodal environments.
Address the data entry bottleneck in medical informatics.

The literature is filled with anecdotal arguments about the applicability of speech recognition interface. Shneiderman [1992] points out four such areas: when the hands are busy, the eyes are busy, mobility is required, and in harsh environments. Cohen and Oviatt [1994] suggest a similar set of conditions: when the user's hands or eyes are busy, only a limited keyboard or screen is available, the user is disabled, pronunciation is the subject matter of computer use, and when natural language interaction is preferred.

The first area where this research will contribute is to help replace these anecdotal arguments on the applicability of speech and the complementary advantages of direct manipulation and speech recognition with scientific evidence. Such a framework for research in human-computer interaction has been identified by Shneiderman [1993] as a foundational approach. By emphasizing controlled experiments which yield more objective and reliable results, arguments about "user friendly systems" are replaced with a more scientific approach.

The second area where this research will contribute is by identifying those situations where speech is the preferred interface modality. Note that the anecdotal arguments on the applicability of speech, while intuitive, have a particular bias. That is, they imply that speech is always a second choice that is only appropriate when traditional keyboard and screen interfaces are impractical. While acknowledging this bias, Bradford [1995] states that there are almost certainly applications where speech is the more natural medium and calls for comparative studies to determine where and when speech functions most effectively as a user interface. Cohen and Oviatt [1994] state that no principled methods exist which can predict those circumstances where speech will be the most effective, efficient, or the preferred interface modality. Still others point out that there is still a lack of theoretical work and empirical results [Carbonell 1994], and the need for rigorous scientific investigation into the applicability of speech as an interface medium [Damper 1993].

The third area where this research will contribute is by increasing our understanding as to when and under what conditions speech can be integrated with mouse input in multimodal environments. Cole et al. [1995] note the role that spoken language should ultimately play in multimodal systems is not well understood and calls for the development of theoretical models from which predictions can be made about the strengths, weaknesses, and overall performance of different types of unimodal and multimodal systems. The focus of this research is user perception of the input task based on the theory of perceptual structures. Such research is needed to understand how people select and integrate different modalities in the context of different types of human-computer interaction [Oviatt and Olsen 1994].

The objective of this dissertation was to the study the effect of the perceptual structure of multidimensional input tasks on the speed, accuracy and acceptance of multimodal direct manipulation and speech recognition systems. Such empirical results can be used to assist with the integration of speech in multimodal environments.

The fourth area where this research will contribute is by addressing the data entry bottleneck in medical informatics [Grasso and Grasso 1994; Dillon, McDowell, Norcio, DeHaemer 1994; McMillan and Harris 1990]. Histopathologic data collection in animal toxicology studies was chosen as the application domain for user testing. It includes several significant hands-busy and eyes-busy restrictions. It is based on a highly structured, specialized, and moderately sized vocabulary based on an accepted medical nomenclature. These and other characteristics make it a prototypical data collection task, similar to those required in biomedical research and clinical trials. Also, the input tasks mainly involve reference identification, with little declarative, spatial, or computational data entry required, which should eliminate any built-in bias toward either modality.

1.5 Research Questions

The three proposed studies are based on the following three research questions. Included with each research question is a summary of the literature review from Section 2, predicted results, and null hypotheses for statistical evaluation.

Only question number one on task integration has been studied as part of this doctoral research project. The other two questions were discussed, but not studied. This was to ensure that this research effort was completed in a reasonable amount of time.

Question 1

What multidimensional tasks can best be integrated with multimodal speech and direct manipulation?

Literature

The performance of multidimensional, unimodal input tasks is affected by whether the dimensions are perceived as integral or separable. Users are more likely to switch from one modality to another when there is a change in functionality or context.

Predicted Results

The speed, accuracy, and acceptance of multidimensional, multimodal input will increase when the attributes of the task are perceived as separable, and for unimodal input will increase when the attributes are perceived as integral.

Null Hypothesis 1

The integrality of input attributes has no effect on the speed of the user.

Null Hypothesis 2

The integrality of input attributes has no effect on the accuracy of the user.

Null Hypothesis 3

The integrality of input attributes has no effect on acceptance by the user.

Question 2

How does the lack of visible references affect the speed, accuracy, and acceptance of speech and direct manipulation interfaces?

Literature

Direct manipulation interfaces perform better with visible references while speech interfaces perform better with non-visible references.

Predicted Results

Decreasing visibility has a negative impact on the speed, accuracy, and acceptance of both direct manipulation and speech interfaces. The negative impact on speech interfaces is greater than or equal to that of direct manipulation, except when those references have a high degree of predictability.

Null Hypothesis 4

Reference visibility has no effect on the speed of the user.

Null Hypothesis 5

Reference visibility has no effect on the accuracy of the user.

Null Hypothesis 6

Reference visibility has no effect on acceptance by the user.

Question 3

How does increasing the number of references affect the speed, accuracy, and acceptance of speech and direct manipulation interfaces?

Literature

Direct manipulation interfaces perform better with fewer references while speech interfaces perform better when there are numerous references.

Predicted Results

Increasing the number of references has a negative impact on the speed, accuracy, and acceptance of both direct manipulation and speech interfaces. The negative impact on speech interfaces is greater than or equal to that of direct manipulation, except when those references have a high degree of predictability.

Null Hypothesis 7

The number of references has no effect on the speed of the user.

Null Hypothesis 8

The number of references has no effect on the accuracy of the user.

Null Hypothesis 9

The number of references has no effect on acceptance by the user.

Go to next section.