Detecting User Engagement in Everyday Conversations - Parc

Detecting User Engagement in Everyday Conversations 

Chen Yu 

Department of Computer Science 

University of Rochester 

Rochester, NY 14627 

yu@cs.rochester.edu 

Paul M. Aoki and Allison Woodruff 

Palo Alto Research Center 

3333 Coyote Hill Road 

Palo Alto, CA 94304 

{aoki,woodruff}@acm.org 

Abstract 

This paper presents a novel application of speech emotion 

recognition and engagement detection in computer-mediated 

voice communications systems. We utilize machine learning 

techniques, such as support vector machines (SVM) and hidden 

Markov models (HMM), to classify users’ emotions in speech. 

We argue that using prosodic features alone is not sufficient for 

modeling users’ engagement in conversations. In light of this, a 

multilevel architecture based on coupled HMMs is proposed to 

detect user engagement in spontaneous speech. The first level is 

comprised of SVM-based classifiers that are able to recognize 

discrete emotion types as well as arousal and valence states. A 

high-level HMM then uses those emotional states as input to 

estimate user engagement in conversations by decoding the internal 

states of the HMM. We report experimental results for the 

LDC Emotional Prosody and CALLFRIEND speech corpora. 

1. Introduction 

Telephones are used more and more often for everyday communication 

among family members and friends. The spread of 

mobile phones has accelerated this trend. With the availability 

of those communications services, more and more people 

talk to each other remotely. Most works of computer-mediated 

voice communications systems for mobile computing focus on 

some practical issues, such as bandwidth to support continuous 

network connections and robustness. In this work, we are 

interested in developing an intelligent user interface for future 

voice communication systems. To do so, one critical issue is 

that computers need to automatically detect user’s status from 

speech. We expect that computers can not only understand what 

people say (speech recognition), but also detect the degree of 

involvement in communication, such as their feelings and interest 

toward topics under discussion and toward participants. 

{allison and paul’s help about social audio spaces and engagement 

in conversations} [1]. 

In the context of this kind of remote communication, 

the primary input to a computational system is users’ voices. 

Speech communication is not merely an exchange of words. In 

addition to carrying linguistic information that consists of words 

and the rules of language, a user’s speech provides implicit messages 

such as the speaker’s gender, age, physical condition, as 

well as the speaker’s emotion and attitude toward the topic, the 

dialog partner or situation. In this work, we attempt to develop 

a computer system that is able to extract non-linguistic information 

from the user’s speech and adjust user communication 

channels and human-computer interfaces in response to user’s 

engagement in conversations. 

The correlations between acoustic features (e.g., prosody) 

and emotional states have been studied in speech production 

and phonetics (for a review, see [2]). Recently, there has been 

growing interest in automatic speech emotion recognition. Dellaert 

et al. [3] implemented a method based on the majority voting 

of subspace specialists to classify acted spoken utterances 

into four types. Batliner et al. [4] provided a comparative study 

of recognizing two emotions “neutral vs. anger” expressed by 

actors and naive subjects. Ang et al. [5] investigated the use of 

prosody for the detection of frustration and annoyance in natural 

dialogues. The method described by Lee et al. [6] combined 

acoustic features and the keywords with emotion salience 

to categorize spoken utterances into two sets: negative and nonnegative. 

A good review of emotion recognition can be found 

in [7]. 

The novelty in this work is to estimate users’ engagement in 

dialogue by considering multiple cues. The central idea is that 

in everyday conversations, a participant’s engagement state is 

influenced by his/her previous engagement state, his/her current 

emotional state, and the other participants’ engagement states. 

It has been shown that emotional states are closely related to engagement 

levels (cite???). In addition, we argue that the change 

of engagement levels should be considered in continuous mode 

and it is rare that people abruptly change their interests to a 

topic or a speaker. Furthermore, a participant’s engagement is 

naturally influenced by other people in conversations. For instance, 

a deeply engaged speaker usually makes audience more 

involved in conversations. The special advantages of the integration 

of multiple cues to estimate users’ engagement are 

twofold. First, we can get better accuracy by considering multiple 

information so that the noise in one channel can be removed 

during the integration process. Second, in the case that some information 

is not available, we can still compute users’ engagement 

based on partial information. For example, a user may 

just listen to the speaker’s talk without uttering any speech by 

himself/herself. In this case, we cannot estimate the listener’s 

engagement if the method purely depends on the acoustic features 

from his/her speech. In our method, however, we can utilize 

the listener’s previous state of engagement and the talker’s 

engagement level to estimate the listener’s involvement in communication. 

In light of the above analysis, we proposed a multilevel 

structure using support vector machine (SVM) and hidden 

Markov model (HMM) techniques as shown in Figure 1. The 

first level of the architecture is comprised of SVM classifiers 

that use acoustic features as input and predict emotional states 

of users, such as arousal levels or discrete emotion types. Those 

emotional states are utilized as input to the higher level HMM 

that models dynamic change of user’s engagement in conversations 

since emotion and engagement are conveyed in a contin-

SVM classifier 

feature extraction 

Coupled HMM 

SVM classifier 

feature extraction 

decoded engagement 

state sequence 

Figure 1: Overview of our approach to engagement detection. 

uous scale. In addition, we apply the coupled HMM (CHMM) 

technique to capture joint behaviors of participants and model 

the influence of individual participants on each other. In this 

way, the method decodes users’ engagement states in conversations 

by seamlessly integrating low-level prosodic, temporal 

and cross-participant cues. In the rest of the paper, Section 2 

presents our method of speech emotion recognition and Section 

3 describes engagement detection based on CHMM. The experimental 

results are reported and discussed in Section 4. 

2. Speech emotion recognition 

The first level classifies emotions through certain attributes of 

spoken communication. We focus on seven discrete emotion 

types in this study: anger, panic, sadness, happy, interest, boredom 

and neutral. In addition to label emotions as discrete 

categories, some researchers (e.g., in [7]) prefer to characterize 

emotions in terms of continuous dimensions. The two 

most commonly considered dimensions are arousal and valence. 

Arousal refers to the degree of intensity of the affect and 

ranges from sleep to excitement. Valence describes the pleasantness 

of the stimuli, such as positive (happy) and negative 

(sad). This work explores the classifications of discrete emotion 

types as well as arousal and valence levels. 

Note that the practical motivations of this work require us to 

build a speaker-independent emotion detection system that has 

to deal with the enormous speaker variability in speech. People’s 

willingness to display emotional responses and the way 

that they convey affective messages using speech vary widely 

across individuals. Thus, we need to find a set of robust features 

that are not only closely correlated with emotional categories 

but also invariant across different speakers. 

We first segment continuous speech into spoken utterances. 

Then we use Praat software package to extract the prosodic and 

energy profiles of each spoken utterance, which carry a large 

amount of information of emotion. Next, seven kinds of acoustic 

features are extracted from each spoken utterance: 

• Fundamental frequency (pitch): mean, maximum, minimum, 

standard deviation, range, 25-percentile, and 75- 

percentile. 

• Derivation of pitch: mean, maximum, minimum, standard 

deviation, range, mean of absolute pitch derivation, and 

standard deviation of absolute derivation. 

• Duration of pitch: ratio of duration of voiced and unvoiced 

regions, mean of frames of voiced regions, standard derivation 

of frames of voiced regions, number of voiced regions, 

ratio of frames of voiced and unvoiced regions, maximal 

frames of voiced duration and mean of the maximum pitch 

in every region. 

• Energy: mean, standard deviation, maximum, median, and 

energy in frequency bands (2kHz). 

• Derivation of energy: mean, standard deviation, maximum, 

median, and minimum. 

• Duration of energy in non-silent regions: mean of the number 

of frames, standard deviation of the number of frames, 

ratio of non-silent frames, and maximum of the number of 

frames. 

• Formants: first three formant frequencies (F1, F2, F3), and 

their bandwidths. 

Now we have multidimensional affective features for each 

spoken utterance. The curse of dimensionality in highdimensional 

classification is well-known in machine learning, 

which indicates that pruning the irrelevant features holds more 

promise for a generalized classification. We transformed the 

original feature space into a lower dimensional space by using 

the RELIEF-F algorithm for feature selection [8]. As 

mentioned above, we want to develop a speaker-independent 

emotion recognition system which needs to deal with speaker 

variation. In practice, due to big differences of prosodic features 

between male and female speakers, we divided users into 

two groups based on their genders and used different sets of 

prosodic features for different groups. The top 7 features for 

arousal level classification are as follows: 

• Male: range of F2, range of pitch, maximum of pitch, energy 

>2000Hz, maximum of voiced durations, standard deviation 

of derivation of energy, and maximum of energy. 

• Female: mean of pitch, range of derivative of pitch, mean 

duration of voiced regions, energy

puts of those classifiers are then used as the observations of the 

high-level HMM. The HMM is comprised of five hidden states 

corresponding to the degree of engagement in conversations and 

model temporal continuity of user engagement. 

In addition, we applied a CHMM to describe the influence 

of the engagement state of one participant to others. In the 

CHMM, each chain has 5 hidden states corresponding to engagement 

levels. The observations are arousal levels of participants 

in a conversation. 

Given the sequences of arousal levels and engagement levels 

of participants, the training procedure of the CHMM needs 

to estimate three kinds of probabilities: 

• p(o i|s j) is the probability of observing arousal level i 

in state j which is a multinomial distribution and can 

be learned by simply counting the expected frequency of 

arousal levels being in state j. 

• p(s m j |s m i ) is the transition probability of taking the transition 

from state s i to state s j in chain m. 

• p(s m j |s n i ) is the cross-participant influence probability of 

taking the transition to state s j in chain m being in state 

j in chain n. 

Note that currently the observations are discrete values (quantized 

arousal levels, etc.) and modeled by multinomial distributions. 

We can utilize Gaussian mixture models for continuous 

observations of CHMM in case that low-level classifiers provide 

probabilistic categories. 

In the testing phase, acoustic features are fed into the lowlevel 

SVM classifiers and the output arousal states are fed into 

the high-level CHMM. The decoded state sequences of CHMM 

are obtained using Viterbi algorithm that indicates the engagement 

states of participants. Formally, assume that CHMM consists 

of two chains corresponding to two participants in a conversation, 

and let s 1 t and s 2 t be the engagement states of participant 

1 and participant 2 at time t separately. o 1 t and o 2 t 

are the observations (arousal levels, etc.) of participants. The 

model predicts the current state s 1 t based on its own previous 

state s 1 t−1, cross-channel influence s 2 t−1 and new observation 

of arousal level o 1 t . Specifically, the probability of the combination 

of two participant’s states are as follows: 

p(s 1 t, s 2 t) = p(s 1 t |s 1 t−1)p(s 2 t |s 2 t−1)p(s 1 t |s 2 t−1)p(s 2 t |s 1 t−1) 

p(o 1 t |s 1 t)p(o 2 t |s 2 t) 

In this way, our method is able to estimate two participant’s 

engagement states simultaneously given raw speech data in the 

conversation. 

4. Experiments and results 

Several issues make the evaluation of computational assessment 

of emotion challenging. First, data from real-life scenarios 

is often difficult to acquire; much of the research on emotion 

in speech uses actors/actresses to simulate specific emotional 

states (as in LDC’s data set). Second, emotional categories are 

quite ambiguous in their definitions, and different researchers 

propose different sets of categories. Third, even when there is 

agreement on a clear definition of emotion, labeling emotional 

speech is not straightforward. In a conversation, a speaker can 

be thought of as encoding his/her emotions in speech and listeners 

can be thought of as decoding the emotional information 

from speech. However, the speaker and listeners may not 

agree on the emotion expressed or perceived in an utterance. 

Similarly, different listeners may infer different emotional states 

given the same utterance. All these factors make data collection 

and evaluation of emotion and engagement recognition much 

more complicated compared with other statistical pattern recognition 

problems, such as visual object recognition and text mining, 

in which cases we know the ground truth. 

4.1. Data and data coding 

In this work, we used two sets of English-language speech corpora 

obtained from the Linguistic Data Consortium (LDC): 

The LDC EMOTIONAL PROSODY corpus was produced by 

six professional actors/actresses expressing 14 discrete emotion 

types. There are approximately 25 spoken utterances per discrete 

emotion type. In our experiments, we focused on seven 

discrete emotion types that are most important in our application: 

anger, panic, sadness, happy, interest, boredom and neutral. 

Half of the utterances were used as training data and the 

other half as testing data. 

The LDC CALLFRIEND corpus was collected by the consensual 

recording of social telephone conversations between 

friends. We selected four dialogues that contained a range of 

affect and extracted usable subsets of approximately 10 minutes 

from each. Segmenting the subsets into utterances produced a 

total of 1011 utterances from four female speakers and 877 utterances 

from four male speakers. Five labelers were asked to 

listen to the individual utterances and provide four separate labels 

for each utterance: discrete emotion type as a categorical 

value, and numerical values (on a discretized 1–5 scale) for each 

of arousal, valence and engagement. We based the final labels 

for each utterance on the consensus of all the labelers. 

4.2. Results of emotion recognition 

Table 1 shows the results of categorizing emotional states 

of individual spoken utterances, which include discrete emotion 

types as well as arousal and valence states. Recognition 

rates in the second and third rows (5 discrete types) 

show that for the same feature extraction and machine learning 

method, the performance with acted speech in speakerdependent 

mode is 75% but the accuracy goes down to 51% in 

speaker-independent mode for spontaneous speech. Since many 

studies of speech emotion recognition focus on acted speech 

and speaker-dependent recognition, this comparative study indicates 

the challenge to use emotion recognition in real life settings 

and to generalize the method used in artificial conditions 

to person-independent natural scenarios. In addition, the difference 

between the first and second rows illustrates the influence 

of the number of emotional types in the classification results. 

The accuracy of recognizing arousal levels are reasonably good 

on natural speech data and speaker-independent mode although 

the recognition rates of valence levels are not as well as arousal. 

This is quite in line with the related psychological studies. It 

has been shown that valence (cite???). 

Table 1: Classification with support vector machine. SI: 

speaker independent, SD: speaker dependent 

EMOTIONAL 

features PROSODY CALLFRIEND 

7 discrete types (SI) 69% - 

5 discrete types (SI) 60% 51% 

5 discrete types (SD) 75% 62% 

5 arousal levels (SI) - 58% 

3 arousal levels (SI) - 67% 

3 valence levels (SI) - 54%

4.3. Results of engagement detection in continuous speech 

Table 2 shows the results of detecting users’ engagement states 

in a scale from 1 to 5. As a comparison, we trained a SVM classifier 

to directly categorize spoken utterances based on prosodic 

features and obtained 47% accuracy. A much better result 

(61%) was achieved by using the multilevel structure. The significant 

difference lies in the facts that low-level prosodic features 

in speech are not sufficient indicators of users’ engagement 

states and that the model encodes the inherent continuity 

of the dynamics of users’ engagement states by considering 

the problem in a continuous mode. Next, we included crossparticipant 

influence into the consideration by using the CHMM 

and achieved a smaller improvement in performance. That is 

because the degree to which the participants in a conversation 

can take effect on each other varies across different people. 

Therefore, it is less likely that we could completely encode this 

complex interaction with a simple model and limited training 

data. Considering that the data we used are from spontaneous 

speech in telephone calls and the method does not encode any 

speaker-dependent information, the results are reasonably good 

and promising. 

Table 2: Results of detecting engagement in continuous speech 

isolated SVM HMM coupled HMM 

accuracy 47% 61% 63 % 

5. Conclusions 

In this work, we proposed to use affective information encoded 

in speech to estimate users’ engagement in computer-mediated 

voice communications systems for mobile computing. To our 

knowledge, this is the first work that attempts to estimate users’ 

engagement in telephone conversations. We tested this idea by 

developing a machine learning system that can perform engagement 

detection in everyday dialogue and achieve reasonably 

good results. 

The main technical contribution of this work is to estimate 

users’ engagement based on a novel multilevel structure. 

Compared with previous studies that focus on classifying user’s 

emotional states based on individual spoken utterances, our 

method models the emotion recognition and engagement detection 

problem in continuous mode. In addition, we encode 

the joint behavior of the participants in a conversation by utilizing 

CHMM. We demonstrated that our method achieved much 

better results than the one just based on low-level acoustic signals 

of individual spoken utterances. A natural extension of 

the current work is to extract various affective information from 

speech, such as valence in speech and linguistic information, 

and then include them as the observations of the high-level 

HMM to improve the overall performance of engagement detection. 

of research paradigms,” Speech Communication, vol. 40, 

no. 1–2, pp. 227–256, 2003. 

[3] F. Dellaert, T. Polzin, and A. Waibel, “Recognizing emotion 

in speech,” in Proc. 4th ICSLP, vol. 3. IEEE, 1996, 

pp. 1970–1973. 

[4] A. Batliner, K. Fisher, R. Huber, J. Spilker, and E. Noth, 

“Desperately seeking emotions: Actors, wizards, and human 

beings,” in Proc. ISCA Workshop on Speech and Emotion. 

ISCA, 2000. 

[5] J. Ang, R. Dhillon, A. Krupski, E. Shriberg, and A. Stolcke, 

“Prosody-based automatic detection of annoyance and 

frustration in human-computer dialog,” in Proc. 7th ICSLP, 

vol. 3. ISCA, 2002, pp. 2037–2040. 

[6] C. M. Lee, S. Narayanan, and R. Pieraccini, “Combining 

acoustic and language information for emotion recognition,” 

in Proc. 7th ICSLP, vol. 2. ISCA, 2002, pp. 873– 

876. 

[7] R. Cowie, E. Douglas-Cowie, N.Tsapatsoulis, G. Votsis, 

S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition 

in human-computer interaction,” IEEE Signal Processing 

Mag., vol. 18, no. 1, pp. 32–80, 2001. 

[8] M. Robnik-Šikonja and I. Kononenko, “Theoretical and 

empirical analysis of ReliefF and RReliefF,” Machine 

Learning, vol. 53, no. 1–2, pp. 23–69, 2003. 

[9] C.-W. Hsu and C.-J. Lin, “A comparison of methods for 

multi-class support vector machines,” IEEE Trans. Neural 

Networks, vol. 13, no. 2, pp. 415–425, 2002. 

6. References 

[1] P. M. Aoki, M. Romaine, M. H. Szymanski, J. D. Thornton, 

D. Wilson, and A. Woodruff, “The Mad Hatter’s Cocktail 

Party: A social mobile audio space supporting multiple 

conversations,” in Proc. ACM SIGCHI Conf. ACM, 2003, 

pp. 425–432. 

[2] K. R. Scherer, “Vocal communication of emotion: A review

Detecting User Engagement in Everyday Conversations - Parc

Create successful ePaper yourself

Delete template?

Save as template?