02.08.2013 Views

Posture as a Predictor of Learners' Affective Engagement

Posture as a Predictor of Learners' Affective Engagement

Posture as a Predictor of Learners' Affective Engagement

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Do You Understand Me? The Impact <strong>of</strong> Speech Recognition<br />

Errors on Learning Gains<br />

Sidney K. D’Mello, Brandon King, O’Meed Entezari, Patrick Chipman, and Arthur Graesser<br />

Institute for Intelligent Systems, University <strong>of</strong> Memphis<br />

sdmello@memphis.edu<br />

Abstract<br />

The paper presents two studies where participants learned computer literacy by having a<br />

spoken conversation with AutoTutor, an intelligent tutoring system with conversational dialogue.<br />

In the first study 30 students completed a pre-test, a 35-minute training session, and a post-test.<br />

The results revealed that significant learning occurred despite considerable speech recognition<br />

errors. The second study utilized a repeated me<strong>as</strong>ures design where 23 participants completed a<br />

pre-test, a 3-part tutorial session (speech input, text input, no intervention), and a post-test.<br />

Learning gains for the speech and typed conditions were on par and significantly higher than the<br />

control. Speech recognition error rates were not correlated with learning. We address the<br />

implications <strong>of</strong> our findings for learning environments that support spoken student responses.<br />

Introduction<br />

The field <strong>of</strong> human-computer interaction may be radically transformed by computer systems<br />

equipped with robust automatic speech recognition (ASR) for speech-to-text transcription,<br />

coupled with effective natural language processing techniques that link this recognized text to<br />

specific computer actions. Since humans primarily communicate through speech and a host <strong>of</strong><br />

non-verbal cues, computer systems that are able to recognize and respond to these<br />

communication channels will presumably provide a more effective, meaningful, and naturalistic<br />

interaction experience.<br />

One type <strong>of</strong> computer technology that would greatly benefit from verbal input is intelligent<br />

tutoring systems (ITSs) that promote one-on-one tutoring. It is hypothesized that the<br />

implementation <strong>of</strong> spoken dialogues in ITSs will result in added learning benefits (Pon-Barry, et<br />

al., 2004; Litman, et al., 2006), at le<strong>as</strong>t when compared to text-b<strong>as</strong>ed input. However, this needs<br />

to be substantiated by further empirical research. The one study that compared text-b<strong>as</strong>ed to<br />

speech-b<strong>as</strong>ed ITSs reported no significant differences in learning gains between the two groups<br />

(Litman, et al., 2006). However, those students who interacted through speech showed a<br />

significant incre<strong>as</strong>e in some attributes, such <strong>as</strong> length <strong>of</strong> the student utterance, that have been<br />

shown to be highly correlated with learning in tutoring environments (Rose et al., 2003). Another<br />

important benefit <strong>of</strong> spoken interactions is that the learner models constructed by tutoring<br />

systems (in order to personalize their responses to a learner) can be substantially enhanced by<br />

incorporating information related to learner affect, motivation levels, self efficacy, etc. Online<br />

me<strong>as</strong>urement <strong>of</strong> these constructs is possible by examining the paralinguistic features <strong>of</strong> speech<br />

(for a review, see Pantic & Rothkrantz, 2003).


Nevertheless, barring a few exceptions (Litman et al., 2006; Mostow & Aist, 2001; Schultz et<br />

al., 2003), almost all ITSs have involved text-b<strong>as</strong>ed communication (Litman et al., 2006). This<br />

can be attributed to the technological limitations <strong>of</strong> ASR systems, such <strong>as</strong> significant recognition<br />

errors, long training times, and substantial computational resources. These problems are further<br />

complicated in learning environments in which students typically struggle to come up with<br />

answers, thereby plaguing their speech with pauses, mispronunciations, and imperfect phr<strong>as</strong>ings,<br />

which are all factors that incre<strong>as</strong>e ASR errors.<br />

This trend is changing, <strong>as</strong> recent advances in the speed and accuracy <strong>of</strong> speech recognition<br />

systems have made speech-b<strong>as</strong>ed ITSs technologically fe<strong>as</strong>ible. However, despite these<br />

advances, performance <strong>of</strong> these systems for conversational speech in real world environments is<br />

far from optimal and it is unlikely that perfect recognition will ever be achieved, at le<strong>as</strong>t not in<br />

the near future. Furthermore, it is difficult to establish a suitable upper bound on ASR<br />

performance because recognition accuracies vary according to the particular ASR engine and<br />

speech corpus that is used for a given evaluation.<br />

Therefore, it is an open question <strong>as</strong> to whether incorporating speech input will result in an<br />

improvement <strong>of</strong> learning gains or whether speech recognition errors will adversely interfere with<br />

learning. We tested this hypothesis by conducting two studies that involved learners speaking<br />

their answers while they were tutored on computer literacy topics (the Internet, hardware, and<br />

operating systems) with AutoTutor. AutoTutor is an intelligent tutoring system that helps<br />

learners construct explanations to difficult questions by interacting with them in natural language<br />

and by utilizing simulation environments (Graesser, et al., 1999). In previous versions <strong>of</strong><br />

AutoTutor, learners typed, rather than spoke, their contributions.<br />

Method<br />

Study 1: Learner’s Interacting with the Spoken Input Version <strong>of</strong> AutoTutor<br />

The participants were 30 undergraduates at the University <strong>of</strong> Memphis. The procedure<br />

consisted <strong>of</strong> a pre-test, an interaction with AutoTutor (35 minutes), and a post-test. The pre-test<br />

and post-tests consisted <strong>of</strong> multiple-choice questions that had been used in previous AutoTutor<br />

research.<br />

We used the commercially available Dragon NaturallySpeaking (v. 6) speech recognition<br />

system for speech-to-text translation. The recognizer w<strong>as</strong> supplied with a dictionary that<br />

included common words (e.g., and, because) plus 2550 additional content words that were<br />

specific to the computer literacy topic (e.g., RAM, ROM, floppy).<br />

Results<br />

Calibrating ASR Accuracy<br />

We parsed AutoTutor’s log files to obtain a transcript <strong>of</strong> the tutorial interactions. A sample <strong>of</strong><br />

352 individual student dialogue moves w<strong>as</strong> randomly selected from the total set <strong>of</strong> 1615 dialogue<br />

moves (confidence interval 95%, sampling error 4.62). We ensured that there w<strong>as</strong> an equal<br />

distribution <strong>of</strong> speech excerpts from each participant. In order to evaluate the accuracy <strong>of</strong> the<br />

speech recognition system, the automatically recognized text <strong>of</strong> each student dialogue turn in the<br />

sample w<strong>as</strong> compared to a manual transcription <strong>of</strong> that dialogue turn, which w<strong>as</strong> prepared by a<br />

human from the audio recordings <strong>of</strong> the tutorial interactions.


Word recognition rate (WRR), a standard metric used to <strong>as</strong>sess the reliability <strong>of</strong> ASR<br />

systems, w<strong>as</strong> used <strong>as</strong>sess recognition accuracy. WRR w<strong>as</strong> computed <strong>as</strong> WRR = 1 −<br />

( S + D + I)<br />

where<br />

S, D, and I are the number <strong>of</strong> substitutions, deletions, and insertions in the automatically<br />

recognized text when compared to the manually transcribed text <strong>of</strong> N words, after the two strings<br />

are aligned using dynamic string alignment at the word level. The result <strong>of</strong> the WRR metric for<br />

our ASR system w<strong>as</strong> .45, which lies somewhere between poor and fair when compared to<br />

automatic recognition <strong>of</strong> natural conversational speech.<br />

A follow up analysis w<strong>as</strong> also performed to compare the accuracy <strong>of</strong> all words with content<br />

words (RAM, CPU, etc.) that are highly specific to the computer literacy topic. Our results<br />

revealed that recognition accuracy <strong>of</strong> content words (MCONTENT = .784) were significantly higher<br />

than the corresponding scores <strong>of</strong> all the words taken together (MALL = .693), F(1, 29) = 28.7, Mse<br />

= .004, p < .001.<br />

Effects <strong>of</strong> ASR Errors on Learning<br />

We computed pre-test scores and post-test scores <strong>of</strong> participants before and after they<br />

interacted with AutoTutor. As expected, the post-test scores were significantly higher than the<br />

pre-test scores, t(29) = 2.75, p < .05, Se = .044. So clearly, learning occurred during the 35minute<br />

interaction with AutoTutor, showing an effect size <strong>of</strong> 0.50 sigma (approximately half a<br />

letter grade).<br />

These results are significant because they indicate that learning gains can be achieved<br />

irrespective <strong>of</strong> speech recognition errors. Correlation analyses did not reveal any explicit<br />

relationship between recognition errors and learning gains. This implies that errors <strong>as</strong>sociated<br />

with automatic speech recognition had no impact on learning.<br />

Discussion Study 1<br />

Our findings from study 1 can be summarized <strong>as</strong> follows. First although the word recognition<br />

rate is quite low, content words were detected with a re<strong>as</strong>onable degree <strong>of</strong> accuracy. Second,<br />

there w<strong>as</strong> no relation between error rates and learning gains. This implies that AutoTutor is<br />

unaware <strong>of</strong> most <strong>of</strong> the recognition errors and can appropriately interpret the students’ responses.<br />

Method<br />

Study 2: Interacting with Typed and Spoken Versions <strong>of</strong> AutoTutor<br />

The repeated-me<strong>as</strong>ures experimental design with 23 participants consisted <strong>of</strong> a pre-test, a<br />

tutorial session, and a post-test. Each participant w<strong>as</strong> <strong>as</strong>signed to all three tutor conditions:<br />

speech input, text input, and control (no intervention). The order in which students used these<br />

input methods (speech first, text second or text first, speech second) w<strong>as</strong> counterbalanced across<br />

participants.<br />

Results<br />

Calibrating ASR Accuracy<br />

A human manually transcribed all the spoken utterances from a transcript <strong>of</strong> the tutorial<br />

interactions. A word recognition rate <strong>of</strong> .47 w<strong>as</strong> obtained in this study which w<strong>as</strong> similar to that<br />

N


obtained in Study 1. Similar to Study 1, recognition accuracy <strong>of</strong> content words (MCONTENT = .92)<br />

were significantly higher than the corresponding scores <strong>of</strong> all the words taken together (MALL =<br />

.84), F(1, 22) = 54.1, Mse = .001, p < .001.<br />

Effects <strong>of</strong> ASR Errors on Learning<br />

Performance on the pre-test and post-test w<strong>as</strong> me<strong>as</strong>ured <strong>as</strong> the proportion <strong>of</strong> correct answers.<br />

Learning gains for each condition were computed <strong>as</strong> the difference between post-test and pre-test<br />

scores.<br />

Learning gains for the speech (MSPEECH = .27) and typed conditions (MTYPED = .33) were on<br />

par with one another (i.e. no significant differences), and were significantly higher than the<br />

control (MCONTROL = .04), F(2, 44) = 9.59, p < .00, MSe = 0.06. Effect sizes for the speech and<br />

typed conditions were 1.2 and 1.5 sigma respectively, which indicates that considerable learning<br />

occurred (> 1 letter grade) with the AutoTutor intervention.<br />

Correlation analyses did not reveal any explicit relationship between recognition errors and<br />

learning gains for the speech condition. This implies that errors <strong>as</strong>sociated with automatic speech<br />

recognition had no impact on learning.<br />

Discussion Study 2<br />

The major findings <strong>of</strong> Study 1 were subsequently replicated in Study 2. In particular, both<br />

studies revealed that although overall ASR accuracy w<strong>as</strong> quite low, this had no impact on<br />

learning gains. The high precision by which content words were being recognized presumably<br />

compensated for ASR errors. Another intriguing finding in Study 2 w<strong>as</strong> that learning gains<br />

<strong>as</strong>sociated with spoken input were not significantly different than typed input. This suggests that<br />

there may be benefits to typed input such <strong>as</strong> the possibility to plan responses in advance and to<br />

actively evaluate and alter conversational contributions prior to submission (Quinlan, 2004).<br />

General Discussion<br />

Our major finding is that significant speech recognition errors had no impact on learning.<br />

Learning gains obtained with the new speech recognition version <strong>of</strong> AutoTutor are consistent<br />

with six previous experiments (with the typed input version <strong>of</strong> the tutor) where learning gains<br />

were evaluated on approximately 500 college students (VanLehn, Graesser, et al., 2007). This<br />

implies that AutoTutor is able to compensate for partial failure in the quality <strong>of</strong> interpreting the<br />

learner’s input. This w<strong>as</strong> confirmed by a follow-up analysis (not reported here) where learner’s<br />

automatically recognized and transcribed contributions were compared to ideal answers from<br />

AutoTutor’s curriculum scripts. No significant differences between the recognized text vs. ideal<br />

answers and the transcribed text vs. ideal answers were discovered.<br />

We believe that AutoTutor’s ability to compensate for significant speech recognition errors<br />

arises from its use <strong>of</strong> shallow natural language processing (NLP) algorithms such <strong>as</strong> Latent<br />

Semantic Analysis (Landauer, McNamara, Dennis, & Kintsch, 2007), and content word overlap,<br />

coupled with a s<strong>of</strong>t constraint-satisfaction model to evaluate the conceptual quality <strong>of</strong> students’<br />

contributions. According to s<strong>of</strong>t constraint-satisfaction models, the performance <strong>of</strong> an intelligent<br />

system should not rely on the integrity <strong>of</strong> any one level or module, but rather should reflect the<br />

confluence <strong>of</strong> several levels/modules that are statistically combined.


The success <strong>of</strong> shallow NLP techniques coupled with s<strong>of</strong>t constraint satisfaction models to<br />

compensate for ASR errors h<strong>as</strong> important design implications for ITSs that <strong>as</strong>pire to incorporate<br />

speech input. We have shown that if an evaluation module can adequately compensate for ASR<br />

errors, then such errors would have little to no impact on learning.<br />

Similar to Litman et al. (2006) we also discovered that speech input does not provide<br />

additional learning gains when compared to typed input. This may lead researchers to question<br />

the utility <strong>of</strong> migrating from typed to speech input. Our position is that the major advantage <strong>of</strong><br />

incorporating speech input into an ITS is that monitoring the paralinguistic features <strong>of</strong> speech<br />

provide a rich trace <strong>of</strong> information related to the social dynamics, attitudes, urgency, and<br />

emotions <strong>of</strong> the learner. Of particular relevance are the affective states <strong>of</strong> the learner because<br />

learning inevitably involves failure and a host <strong>of</strong> affective responses. We have previously<br />

documented that incre<strong>as</strong>ed levels <strong>of</strong> boredom have been negatively correlated with learning<br />

while engagement and confusion have been positively correlated with learning (Craig et al.,<br />

2004; Graesser et al., 2007). Consequently, a learning environment that is responsive to the<br />

affective and cognitive states <strong>of</strong> a learner will be more effective in promoting learning<br />

We are exploring the possibility <strong>of</strong> automatically detecting learner emotions from acousticprosodic<br />

features <strong>of</strong> speech <strong>as</strong> part <strong>of</strong> a larger project aimed at transforming AutoTutor into an<br />

affect-sensitive ITS (D’Mello, Picard, & Graesser, 2007). Our hope is that an affect-sensitive<br />

AutoTutor with voice input will broaden the communicative bandwidth between the learner and<br />

the computer, thereby providing a more naturalistic interaction experience. This will presumably<br />

optimize learner satisfaction and positively impact learning.<br />

References<br />

Craig, S., Graesser, A., Sullins, J. & Gholson, B. (2004). Affect and learning: An exploratory<br />

look into the role <strong>of</strong> affect in learning. Journal <strong>of</strong> Educational Media, 29, 241-250.<br />

D’Mello, S. K., Picard, R., & Graesser, A. C. (2007). Towards an Affect Sensitive AutoTutor,<br />

IEEE Intelligent Systems, 22(4), 53-61.<br />

Landauer, T., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.) (2007). Handbook on Latent<br />

Semantic Analysis. Mahwah, NJ: Erlbaum.<br />

Graesser, A. C., Wiemer-H<strong>as</strong>tings, K., Wiemer-H<strong>as</strong>tings, P., Kreuz, R., & the TRG (1999).<br />

AutoTutor: A simulation <strong>of</strong> a human tutor. Journal <strong>of</strong> Cognitive Systems Research, 1, 35-51.<br />

Graesser, A. C. Chipman, P., King, B., McDaniel, B., and D’Mello, S (2007). Emotions and<br />

Learning with AutoTutor. In R. Luckin et al. (Eds.), 13th International Conference on<br />

Artificial Intelligence in Education (AIED 2007) (pp 569-571). IOS Press.<br />

Litman, D. J., Rose, C. P., Forbes-Riley, K., VanLehn, K., Dumisizwe, B., & Silliman, S. (2006).<br />

Spoken versus typed human and computer dialogue tutoring. International Journal <strong>of</strong><br />

Artificial Intelligence in Education,16, 145-170.<br />

Mostow, J., & Aist, G. (2001). Evaluating tutors that listen: An overview <strong>of</strong> Project LISTEN. In<br />

K. Forbus & P. Feltovich (Eds.), Smart Machines in Education: The coming revolution in<br />

educational technology (pp. 169-234). MIT/AAAI Press.<br />

Pantic, M., & Rothkrantz, L. J. M. (2003). Towards an affect-sensitive multimodal humancomputer<br />

interaction. Proceedings <strong>of</strong> the IEEE, Special Issue on Multimodal Human-<br />

Computer Interaction (HCI), 91(9), 1370-1390.


Pon-Barry, H., Clark, H., Schultz, K., Bratt, E., & Peters, S. (2004). Advantages <strong>of</strong> spoken<br />

language interaction in tutorial dialogue systems. In J. C. Lester et al. (Eds.), 7th International<br />

Conference on Intelligent Tutoring Systems (pp. 390-400). Berlin: Springer-Verlag.<br />

Quinlan, T. (2004). Speech recognition technology and students with writing difficulties:<br />

Improving fluency. Journal <strong>of</strong> Educational Psychology, 96(2), 337-346.<br />

Rosé, C. P., Bhembe, D., Siler, S., Sriv<strong>as</strong>tava, R., & VanLehn, K. (2003). The role <strong>of</strong> why<br />

questions in effective human tutoring. In U. Hoppe, F. Verdejo & J. Kay (Eds.), Artificial<br />

Intelligence in Education. Amsterdam: IOS Press.<br />

Schultz, K., Bratt, E. O., Clark, B., Peters, S., Pon-Barry, H., & Treeratpituk, P. (2003). A<br />

scalable, reusable spoken conversational tutor: Scot. In Proceedings <strong>of</strong> the AIED 2003<br />

Workshop on Tutorial Dialogue Systems: With a View toward the Cl<strong>as</strong>sroom (pp. 367-377).<br />

VanLehn, K., Graesser, A. C., Jackson, G. T., Jordan, P., Olney, A., & Rose, C. P. (2007). When<br />

are tutorial dialogues more effective than reading? Cognitive Science, 30, 3-62.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!