Posture as a Predictor of Learners' Affective Engagement

Do You Understand Me? The Impact of Speech Recognition 

Errors on Learning Gains 

Sidney K. D’Mello, Brandon King, O’Meed Entezari, Patrick Chipman, and Arthur Graesser 

Institute for Intelligent Systems, University of Memphis 

sdmello@memphis.edu 

Abstract 

The paper presents two studies where participants learned computer literacy by having a 

spoken conversation with AutoTutor, an intelligent tutoring system with conversational dialogue. 

In the first study 30 students completed a pre-test, a 35-minute training session, and a post-test. 

The results revealed that significant learning occurred despite considerable speech recognition 

errors. The second study utilized a repeated measures design where 23 participants completed a 

pre-test, a 3-part tutorial session (speech input, text input, no intervention), and a post-test. 

Learning gains for the speech and typed conditions were on par and significantly higher than the 

control. Speech recognition error rates were not correlated with learning. We address the 

implications of our findings for learning environments that support spoken student responses. 

Introduction 

The field of human-computer interaction may be radically transformed by computer systems 

equipped with robust automatic speech recognition (ASR) for speech-to-text transcription, 

coupled with effective natural language processing techniques that link this recognized text to 

specific computer actions. Since humans primarily communicate through speech and a host of 

non-verbal cues, computer systems that are able to recognize and respond to these 

communication channels will presumably provide a more effective, meaningful, and naturalistic 

interaction experience. 

One type of computer technology that would greatly benefit from verbal input is intelligent 

tutoring systems (ITSs) that promote one-on-one tutoring. It is hypothesized that the 

implementation of spoken dialogues in ITSs will result in added learning benefits (Pon-Barry, et 

al., 2004; Litman, et al., 2006), at least when compared to text-based input. However, this needs 

to be substantiated by further empirical research. The one study that compared text-based to 

speech-based ITSs reported no significant differences in learning gains between the two groups 

(Litman, et al., 2006). However, those students who interacted through speech showed a 

significant increase in some attributes, such as length of the student utterance, that have been 

shown to be highly correlated with learning in tutoring environments (Rose et al., 2003). Another 

important benefit of spoken interactions is that the learner models constructed by tutoring 

systems (in order to personalize their responses to a learner) can be substantially enhanced by 

incorporating information related to learner affect, motivation levels, self efficacy, etc. Online 

measurement of these constructs is possible by examining the paralinguistic features of speech 

(for a review, see Pantic & Rothkrantz, 2003).

Nevertheless, barring a few exceptions (Litman et al., 2006; Mostow & Aist, 2001; Schultz et 

al., 2003), almost all ITSs have involved text-based communication (Litman et al., 2006). This 

can be attributed to the technological limitations of ASR systems, such as significant recognition 

errors, long training times, and substantial computational resources. These problems are further 

complicated in learning environments in which students typically struggle to come up with 

answers, thereby plaguing their speech with pauses, mispronunciations, and imperfect phrasings, 

which are all factors that increase ASR errors. 

This trend is changing, as recent advances in the speed and accuracy of speech recognition 

systems have made speech-based ITSs technologically feasible. However, despite these 

advances, performance of these systems for conversational speech in real world environments is 

far from optimal and it is unlikely that perfect recognition will ever be achieved, at least not in 

the near future. Furthermore, it is difficult to establish a suitable upper bound on ASR 

performance because recognition accuracies vary according to the particular ASR engine and 

speech corpus that is used for a given evaluation. 

Therefore, it is an open question as to whether incorporating speech input will result in an 

improvement of learning gains or whether speech recognition errors will adversely interfere with 

learning. We tested this hypothesis by conducting two studies that involved learners speaking 

their answers while they were tutored on computer literacy topics (the Internet, hardware, and 

operating systems) with AutoTutor. AutoTutor is an intelligent tutoring system that helps 

learners construct explanations to difficult questions by interacting with them in natural language 

and by utilizing simulation environments (Graesser, et al., 1999). In previous versions of 

AutoTutor, learners typed, rather than spoke, their contributions. 

Method 

Study 1: Learner’s Interacting with the Spoken Input Version of AutoTutor 

The participants were 30 undergraduates at the University of Memphis. The procedure 

consisted of a pre-test, an interaction with AutoTutor (35 minutes), and a post-test. The pre-test 

and post-tests consisted of multiple-choice questions that had been used in previous AutoTutor 

research. 

We used the commercially available Dragon NaturallySpeaking (v. 6) speech recognition 

system for speech-to-text translation. The recognizer was supplied with a dictionary that 

included common words (e.g., and, because) plus 2550 additional content words that were 

specific to the computer literacy topic (e.g., RAM, ROM, floppy). 

Results 

Calibrating ASR Accuracy 

We parsed AutoTutor’s log files to obtain a transcript of the tutorial interactions. A sample of 

352 individual student dialogue moves was randomly selected from the total set of 1615 dialogue 

moves (confidence interval 95%, sampling error 4.62). We ensured that there was an equal 

distribution of speech excerpts from each participant. In order to evaluate the accuracy of the 

speech recognition system, the automatically recognized text of each student dialogue turn in the 

sample was compared to a manual transcription of that dialogue turn, which was prepared by a 

human from the audio recordings of the tutorial interactions.

Word recognition rate (WRR), a standard metric used to assess the reliability of ASR 

systems, was used assess recognition accuracy. WRR was computed as WRR = 1 − 

( S + D + I) 

where 

S, D, and I are the number of substitutions, deletions, and insertions in the automatically 

recognized text when compared to the manually transcribed text of N words, after the two strings 

are aligned using dynamic string alignment at the word level. The result of the WRR metric for 

our ASR system was .45, which lies somewhere between poor and fair when compared to 

automatic recognition of natural conversational speech. 

A follow up analysis was also performed to compare the accuracy of all words with content 

words (RAM, CPU, etc.) that are highly specific to the computer literacy topic. Our results 

revealed that recognition accuracy of content words (MCONTENT = .784) were significantly higher 

than the corresponding scores of all the words taken together (MALL = .693), F(1, 29) = 28.7, Mse 

= .004, p < .001. 

Effects of ASR Errors on Learning 

We computed pre-test scores and post-test scores of participants before and after they 

interacted with AutoTutor. As expected, the post-test scores were significantly higher than the 

pre-test scores, t(29) = 2.75, p < .05, Se = .044. So clearly, learning occurred during the 35minute 

interaction with AutoTutor, showing an effect size of 0.50 sigma (approximately half a 

letter grade). 

These results are significant because they indicate that learning gains can be achieved 

irrespective of speech recognition errors. Correlation analyses did not reveal any explicit 

relationship between recognition errors and learning gains. This implies that errors associated 

with automatic speech recognition had no impact on learning. 

Discussion Study 1 

Our findings from study 1 can be summarized as follows. First although the word recognition 

rate is quite low, content words were detected with a reasonable degree of accuracy. Second, 

there was no relation between error rates and learning gains. This implies that AutoTutor is 

unaware of most of the recognition errors and can appropriately interpret the students’ responses. 

Method 

Study 2: Interacting with Typed and Spoken Versions of AutoTutor 

The repeated-measures experimental design with 23 participants consisted of a pre-test, a 

tutorial session, and a post-test. Each participant was assigned to all three tutor conditions: 

speech input, text input, and control (no intervention). The order in which students used these 

input methods (speech first, text second or text first, speech second) was counterbalanced across 

participants. 

Results 

Calibrating ASR Accuracy 

A human manually transcribed all the spoken utterances from a transcript of the tutorial 

interactions. A word recognition rate of .47 was obtained in this study which was similar to that 

N

obtained in Study 1. Similar to Study 1, recognition accuracy of content words (MCONTENT = .92) 

were significantly higher than the corresponding scores of all the words taken together (MALL = 

.84), F(1, 22) = 54.1, Mse = .001, p < .001. 

Effects of ASR Errors on Learning 

Performance on the pre-test and post-test was measured as the proportion of correct answers. 

Learning gains for each condition were computed as the difference between post-test and pre-test 

scores. 

Learning gains for the speech (MSPEECH = .27) and typed conditions (MTYPED = .33) were on 

par with one another (i.e. no significant differences), and were significantly higher than the 

control (MCONTROL = .04), F(2, 44) = 9.59, p < .00, MSe = 0.06. Effect sizes for the speech and 

typed conditions were 1.2 and 1.5 sigma respectively, which indicates that considerable learning 

occurred (> 1 letter grade) with the AutoTutor intervention. 

Correlation analyses did not reveal any explicit relationship between recognition errors and 

learning gains for the speech condition. This implies that errors associated with automatic speech 

recognition had no impact on learning. 

Discussion Study 2 

The major findings of Study 1 were subsequently replicated in Study 2. In particular, both 

studies revealed that although overall ASR accuracy was quite low, this had no impact on 

learning gains. The high precision by which content words were being recognized presumably 

compensated for ASR errors. Another intriguing finding in Study 2 was that learning gains 

associated with spoken input were not significantly different than typed input. This suggests that 

there may be benefits to typed input such as the possibility to plan responses in advance and to 

actively evaluate and alter conversational contributions prior to submission (Quinlan, 2004). 

General Discussion 

Our major finding is that significant speech recognition errors had no impact on learning. 

Learning gains obtained with the new speech recognition version of AutoTutor are consistent 

with six previous experiments (with the typed input version of the tutor) where learning gains 

were evaluated on approximately 500 college students (VanLehn, Graesser, et al., 2007). This 

implies that AutoTutor is able to compensate for partial failure in the quality of interpreting the 

learner’s input. This was confirmed by a follow-up analysis (not reported here) where learner’s 

automatically recognized and transcribed contributions were compared to ideal answers from 

AutoTutor’s curriculum scripts. No significant differences between the recognized text vs. ideal 

answers and the transcribed text vs. ideal answers were discovered. 

We believe that AutoTutor’s ability to compensate for significant speech recognition errors 

arises from its use of shallow natural language processing (NLP) algorithms such as Latent 

Semantic Analysis (Landauer, McNamara, Dennis, & Kintsch, 2007), and content word overlap, 

coupled with a soft constraint-satisfaction model to evaluate the conceptual quality of students’ 

contributions. According to soft constraint-satisfaction models, the performance of an intelligent 

system should not rely on the integrity of any one level or module, but rather should reflect the 

confluence of several levels/modules that are statistically combined.

The success of shallow NLP techniques coupled with soft constraint satisfaction models to 

compensate for ASR errors has important design implications for ITSs that aspire to incorporate 

speech input. We have shown that if an evaluation module can adequately compensate for ASR 

errors, then such errors would have little to no impact on learning. 

Similar to Litman et al. (2006) we also discovered that speech input does not provide 

additional learning gains when compared to typed input. This may lead researchers to question 

the utility of migrating from typed to speech input. Our position is that the major advantage of 

incorporating speech input into an ITS is that monitoring the paralinguistic features of speech 

provide a rich trace of information related to the social dynamics, attitudes, urgency, and 

emotions of the learner. Of particular relevance are the affective states of the learner because 

learning inevitably involves failure and a host of affective responses. We have previously 

documented that increased levels of boredom have been negatively correlated with learning 

while engagement and confusion have been positively correlated with learning (Craig et al., 

2004; Graesser et al., 2007). Consequently, a learning environment that is responsive to the 

affective and cognitive states of a learner will be more effective in promoting learning 

We are exploring the possibility of automatically detecting learner emotions from acousticprosodic 

features of speech as part of a larger project aimed at transforming AutoTutor into an 

affect-sensitive ITS (D’Mello, Picard, & Graesser, 2007). Our hope is that an affect-sensitive 

AutoTutor with voice input will broaden the communicative bandwidth between the learner and 

the computer, thereby providing a more naturalistic interaction experience. This will presumably 

optimize learner satisfaction and positively impact learning. 

References 

Craig, S., Graesser, A., Sullins, J. & Gholson, B. (2004). Affect and learning: An exploratory 

look into the role of affect in learning. Journal of Educational Media, 29, 241-250. 

D’Mello, S. K., Picard, R., & Graesser, A. C. (2007). Towards an Affect Sensitive AutoTutor, 

IEEE Intelligent Systems, 22(4), 53-61. 

Landauer, T., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.) (2007). Handbook on Latent 

Semantic Analysis. Mahwah, NJ: Erlbaum. 

Graesser, A. C., Wiemer-Hastings, K., Wiemer-Hastings, P., Kreuz, R., & the TRG (1999). 

AutoTutor: A simulation of a human tutor. Journal of Cognitive Systems Research, 1, 35-51. 

Graesser, A. C. Chipman, P., King, B., McDaniel, B., and D’Mello, S (2007). Emotions and 

Learning with AutoTutor. In R. Luckin et al. (Eds.), 13th International Conference on 

Artificial Intelligence in Education (AIED 2007) (pp 569-571). IOS Press. 

Litman, D. J., Rose, C. P., Forbes-Riley, K., VanLehn, K., Dumisizwe, B., & Silliman, S. (2006). 

Spoken versus typed human and computer dialogue tutoring. International Journal of 

Artificial Intelligence in Education,16, 145-170. 

Mostow, J., & Aist, G. (2001). Evaluating tutors that listen: An overview of Project LISTEN. In 

K. Forbus & P. Feltovich (Eds.), Smart Machines in Education: The coming revolution in 

educational technology (pp. 169-234). MIT/AAAI Press. 

Pantic, M., & Rothkrantz, L. J. M. (2003). Towards an affect-sensitive multimodal humancomputer 

interaction. Proceedings of the IEEE, Special Issue on Multimodal Human- 

Computer Interaction (HCI), 91(9), 1370-1390.

Pon-Barry, H., Clark, H., Schultz, K., Bratt, E., & Peters, S. (2004). Advantages of spoken 

language interaction in tutorial dialogue systems. In J. C. Lester et al. (Eds.), 7th International 

Conference on Intelligent Tutoring Systems (pp. 390-400). Berlin: Springer-Verlag. 

Quinlan, T. (2004). Speech recognition technology and students with writing difficulties: 

Improving fluency. Journal of Educational Psychology, 96(2), 337-346. 

Rosé, C. P., Bhembe, D., Siler, S., Srivastava, R., & VanLehn, K. (2003). The role of why 

questions in effective human tutoring. In U. Hoppe, F. Verdejo & J. Kay (Eds.), Artificial 

Intelligence in Education. Amsterdam: IOS Press. 

Schultz, K., Bratt, E. O., Clark, B., Peters, S., Pon-Barry, H., & Treeratpituk, P. (2003). A 

scalable, reusable spoken conversational tutor: Scot. In Proceedings of the AIED 2003 

Workshop on Tutorial Dialogue Systems: With a View toward the Classroom (pp. 367-377). 

VanLehn, K., Graesser, A. C., Jackson, G. T., Jordan, P., Olney, A., & Rose, C. P. (2007). When 

are tutorial dialogues more effective than reading? Cognitive Science, 30, 3-62.

Posture as a Predictor of Learners' Affective Engagement

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?