5. Discussion and Conclusion

The results of this experiment support the hypothesis that the perceptual structure of an input task is an important consideration when designing multimodal computer interfaces. For multimodal speech and direct manipulation biomedical interfaces, the speed, accuracy, and acceptance of multidimensional input tasks improved when the attributes were perceived as separable. For unimodal interfaces, speed, accuracy, and acceptance improved when the inputs were perceived as integral. This chapter reviews the research findings, identifies possible relationships, summarizes the results, and outlines future research directions.

5.1 Findings

Three null hypotheses were identified before the study began. Two of the null hypotheses were rejected in favor of predicted results. One of the null hypotheses was rejected in part, in favor of predicted results.

The first null hypothesis stated: (H1₀) The integrality of input attributes has no effect on the speed of the user. As reported in Section 4.1, a significant improvement in task completion time was observed when integral input attributes used the same modality and separable attributes used different modalities. The improvement in total time was 41.468 seconds, or about 22.5% (t(19) = 4.791, p < .001, two-tailed). Of the 20 participants, 18 saw improvement with the perceptually structured interface. Strengthening this finding was a significant ANOVA that times from the baseline and perceptually structured groups were from different populations. ANOVA also showed that interface order (baseline, perceptually structured) and task order (slide group 1, slide group 2) had no significant effect on the results. The null hypothesis was rejected in support of an alternate hypothesis based on the predicted results: (H1_A) The speed of multidimensional, multimodal interfaces will increase when the attributes of the task are perceived as separable, and for unimodal interfaces will increase when the attributes of the task are perceived as integral.

The second null hypothesis stated: (H2₀) The integrality of input attributes has no effect on the accuracy of the user. As reported in Section 4.2, there were 1.95 less speech errors with the perceptually structured group, or a 36% improvement, with 16 of the 20 subjects having less errors using the perceptually structured interface. The reduction in speech errors was significant (paired t(19) = 2.924, p < .01, two-tailed). Mouse errors were slightly lower with the baseline group and diagnosis errors were slightly lower with the perceptually structured group, but these were not significant.

The reason why mouse errors did not follow predicted results was possibly because there were few such errors recorded. Across all subjects, there were only 16 mouse errors compared to 175 speech errors. A mouse error was recorded only when a subject clicked on the wrong item from a list and later changed his or her mind, which was a rare event.

There were 77 diagnosis errors, but these also did not follow predicted results. Diagnosis errors were really a measure of a subject's expertise in identifying tissue types and reactions. The findings suggest that there is no relationship between perceptual structure of the input task and the ability of the user to apply domain expertise. However, this cannot be concluded, since efforts were made to avoid measuring a subject's ability to apply domain expertise by allowing subjects to review the tissue slides before the actual test.

The null hypothesis was accepted in part: (H2'₀) The integrality of input attributes has no effect on accuracy of the user, regarding mouse errors and applying domain expertise. The null hypothesis was rejected with respect to speech errors in support of the modified alternate hypothesis: (H2'_A) With respect to speech input, the accuracy of multidimensional, multimodal interfaces will increase when the attributes of the task are perceived as separable, and for unimodal tasks will increase when the attributes of the task are perceived as integral.

The third null hypothesis stated: (H3₀) The integrality of input attributes has no effect on acceptance by the user. As reported in Section 4.3, once the outlier was removed, the overall AI was 3.97 for the baseline group and 3.70 for the perceptually structured group. This was a moderate improvement of 6.7%, which was significant (2x13 ANOVA, p < .05). The null hypothesis was rejected in support of predicted results based on the alternate hypothesis: (H3_A) The acceptance of multidimensional, multimodal interface will increase when the attributes of the task are perceived as separable, and for unimodal tasks will increase when the attributes of the task are perceived as integral.

One difficult aspect of collecting subjective data on user acceptance was that the prototype being tested was not a complete system. Subjects could view tissue slides on the screen, but were limited in other ways. The prototype allowed only one visual plane of the original slide to be examined, while pathologists typically require four such images to make a diagnosis. The zoom feature was limited. Also the 800x600 TFT panel was not the ideal computer monitor for viewing detail required to make diagnoses. A more complete system was developed as described under Preliminary Work in Section 2.6. While adding these and other features to this system was possible, they might have interfered with the independent variables being manipulated in the actual experiment. Nevertheless, it seemed difficult for some to subjectively evaluate a software prototype with limited functionality.

Another difficulty was with speech recognition accuracy. From informal testing during software development, the PE500+ accuracy rate was greater than 95% using push-to-speak mode but only about 80% using voice-activated mode. During the actual experiment, accuracy was 53% for the baseline interface and 64% for the perceptually structured interface. As described earlier, the voice-activated mode was used to avoid unwanted side effects when pressing a button to speak. However, this decreased accuracy frustrated some of the subjects, one of which compared it to yelling at your three-year-old: it doesn't always work.

The perceptual structure of the input attributes used in this experiment might have been more subjective than originally anticipated. While most subjects who stated a preference selected the perceptually structured interface, some selected the baseline interface. In written comments, they viewed the morphology as the main term with the site and qualifier both modifying it. Using these assumptions, the baseline interface actually becomes more perceptually structured, since it uses separate modalities for the QM and SM input tasks and a single modality for SQ.

5.2 Relationships

The Pearson correlation coefficients, shown in Table 24, reveal possible relationships between the dependent variables. The following discussion examines why such relationships may exist.

5.2.1 Baseline Interface versus Perceptually Structured Interface

The positive correlation of time between the baseline interface and perceptually structured interface was anticipated. It was probably due to the fact that a subject who works slowly (or quickly) will do so regardless of the interface. The positive correlation of diagnosis errors between the baseline and perceptually structured interface suggests that a subject's ability to apply domain knowledge was not affected by the interface. Again, this was probably due to the fact that subjects were allowed to review the slides before the actual test. The lack of correlation for mouse errors makes sense, since very few mouse errors were recorded.

The lack of correlation for speech errors was notable. If there was a positive correlation, it would imply that a subject who made errors with one interface was predisposed to making errors with the other. Having no correlation agrees with the finding that the user was more likely to make speech errors with the baseline interface, where the interface did not match the perceptual structure of the input task.

5.2.2 Relationships to Task Completion Time

One would expect that an increase in speech errors would result in an increase in task completion time, since it takes time to correct errors. Two of the coefficients in this group showed a positive correlation that was significant. They were time verses speech errors for the perceptually structured interface and time versus speech errors for both interfaces. The other two showed a positive correlation that was not significant, but was close.

Again, one would expect that an increase in mouse errors would result in an increase in task completion time. Two of the coefficients in this group did show a significant positive correlation and two did not. However, due to the relatively few mouse errors which were recorded, nothing was inferred from these results.

No correlation was observed between task completion time and diagnosis errors. Normally, one could assume that a lack of domain knowledge would lead to a higher task completion time. For this experiment, subjects were allowed to review slides before the actual test. This was to ensure that the experiment was measuring data entry time and other attributes of user interface performance, and not the ability of participants to read tissue slides. Finding no correlation suggests that this goal was accomplished.

No correlation was observed between task completion time and the acceptability index. This result was similar to what was observed by Dillon [1995], who saw no correlation between time and acceptance, except with expert users. However, unlike Dillon, additional analysis found no correlation between time and acceptance with expert users. This was not necessarily a contradiction, because these two studies identified experts in different ways. Dillon identified a subject as an expert or novice based on a retrospective review of that person's work experience and education. Here, expertise was an independent variable. In contrast to that approach, this dissertation considered expertise a dependent variable and measured it prospectively, where expertise was inversely proportionate to the number of domain errors observed during the experiment.

5.2.3 Relationships with Acceptability Index

Between the acceptability index and speech errors, a significant positive correlation was observed for two of the four groups. This suggests than an increase in speech errors increases the likelihood the user will not be pleased with the interface. No correlation was found between the acceptability index and mouse errors. Again, this was most likely due to the lack of recorded mouse errors. Note that for the acceptability index, a lower score corresponds to higher user acceptance.

A significant positive correlation was observed between the acceptability index and diagnosis errors. Three of the four showed this correlation, with the fourth being close. What this finding suggests is that the more domain expertise a person has, the more he or she is likely to approve of the computer interface.

5.3 Summary

A research hypothesis was proposed for multimodal speech and direct manipulation biomedical interfaces. It stated that multimodal multidimensional interfaces work best when the input attributes are perceived as separable, and that unimodal multidimensional interfaces work best when the inputs are perceived as integral. This was based on previous research that extended the theory of perceptual structure [Garner 1972] to show that performance of multidimensional, unimodal, graphical environments improves when the structure of the perceptual space matches the control space of the input device [Jacob et al. 1994]. Also influencing this dissertation was the finding that contrastive functionality can drive a user's preference of input devices in multimodal interfaces [Oviatt and Olsen 1994] and the framework for complementary behavior between natural language and direct manipulation [Cohen 1992].

The results of this experiment support the hypothesis when using a multimodal interface on multidimensional biomedical tasks. Task completion time, accuracy, and user acceptance all increased when a single modality was used to enter attributes which were integral and two modalities were used to enter attributes which were separable. A software prototype was developed with two interfaces to test this hypothesis. The first was a baseline interface that used speech and mouse input in a way that did not match the perceptual structure of the attributes, while the second interface used speech and mouse input in a way that best matched the perceptual structure.

A group of 20 clinical and veterinary pathologists evaluated the interface in an experimental setting, where data on task completion time, speech errors, mouse errors, diagnosis errors, and user acceptance was collected. Task completion time improved by 22.5%, speech errors were reduced by 36%, and user acceptance increased 6.7% for the interface that best matched the perceptual structure of the attributes. Mouse errors decreased slightly and diagnosis errors increased slightly for the baseline interface, but these were not statistically significant. There was no relationship between user acceptance and time, suggesting that speed is not the predominate factor in determining approval. User acceptance was shown to be related to speech recognition errors, suggesting that recognition accuracy is crucial to user satisfaction. User acceptance was also shown to be related to domain errors, suggesting that the more domain expertise a person has, the more he or she will embrace the computer interface.

5.4 Future Research Directions

With respect to future directions, additional studies on domain expertise and minimizing speech errors would be helpful. This effort successfully reduced the rate of speech errors by applying certain principles based on perceptual structure. Others have reported a reduction in spoken disfluencies by applying other user interface techniques [Oviatt 1996]. Also, noting the strong relationship between user acceptance and domain expertise, additional research on how to build domain knowledge into the user interface may be helpful.

As outlined under Research Questions, in Section 1.5, other experimental studies were proposed to further evaluate the framework for complementary behavior between speech and direct manipulation. This includes studies on the effect of the reference number, reference predictability, and reference visibility on the speed, accuracy, and acceptance of speech and direct manipulation interfaces. Additional research on speech input in multimodal environments, like this study, would also be of interest.

Preliminary work, described in Section 2.6, listed several areas of future research for speech interfaces in hands-busy, eyes-busy biomedical environments. Some of these, such as reducing speech training requirements, are being addressed by new technology. A key area warranting further study is how to improve audible feedback in eyes-busy tasks to reduce dependence on visual displays. A possible research goal could be to develop a fully functional speech-driven system incorporating results from the preliminary study and this dissertation that can be evaluated in production environments.

5.5 Conclusion

In conclusion, this study demonstrated that matching a multidimensional multimodal interface to the perceptual structure of the input attributes can increase the performance, accuracy, and user acceptance of the interface. User acceptance was influenced more by accuracy than speed. In addition, factors unrelated to the software itself affected acceptance, such as the level of domain expertise. It is hoped that these empirical results add to our understanding of how best to incorporate speech into multimodal environments and help in the development of systems to collect and manage biomedical information.

Go to next section.