14.05.2014 Views

Detecting User Engagement in Everyday Conversations - Parc

Detecting User Engagement in Everyday Conversations - Parc

Detecting User Engagement in Everyday Conversations - Parc

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Detect<strong>in</strong>g</strong> <strong>User</strong> <strong>Engagement</strong> <strong>in</strong> <strong>Everyday</strong> <strong>Conversations</strong><br />

Chen Yu<br />

Department of Computer Science<br />

University of Rochester<br />

Rochester, NY 14627<br />

yu@cs.rochester.edu<br />

Paul M. Aoki and Allison Woodruff<br />

Palo Alto Research Center<br />

3333 Coyote Hill Road<br />

Palo Alto, CA 94304<br />

{aoki,woodruff}@acm.org<br />

Abstract<br />

This paper presents a novel application of speech emotion<br />

recognition and engagement detection <strong>in</strong> computer-mediated<br />

voice communications systems. We utilize mach<strong>in</strong>e learn<strong>in</strong>g<br />

techniques, such as support vector mach<strong>in</strong>es (SVM) and hidden<br />

Markov models (HMM), to classify users’ emotions <strong>in</strong> speech.<br />

We argue that us<strong>in</strong>g prosodic features alone is not sufficient for<br />

model<strong>in</strong>g users’ engagement <strong>in</strong> conversations. In light of this, a<br />

multilevel architecture based on coupled HMMs is proposed to<br />

detect user engagement <strong>in</strong> spontaneous speech. The first level is<br />

comprised of SVM-based classifiers that are able to recognize<br />

discrete emotion types as well as arousal and valence states. A<br />

high-level HMM then uses those emotional states as <strong>in</strong>put to<br />

estimate user engagement <strong>in</strong> conversations by decod<strong>in</strong>g the <strong>in</strong>ternal<br />

states of the HMM. We report experimental results for the<br />

LDC Emotional Prosody and CALLFRIEND speech corpora.<br />

1. Introduction<br />

Telephones are used more and more often for everyday communication<br />

among family members and friends. The spread of<br />

mobile phones has accelerated this trend. With the availability<br />

of those communications services, more and more people<br />

talk to each other remotely. Most works of computer-mediated<br />

voice communications systems for mobile comput<strong>in</strong>g focus on<br />

some practical issues, such as bandwidth to support cont<strong>in</strong>uous<br />

network connections and robustness. In this work, we are<br />

<strong>in</strong>terested <strong>in</strong> develop<strong>in</strong>g an <strong>in</strong>telligent user <strong>in</strong>terface for future<br />

voice communication systems. To do so, one critical issue is<br />

that computers need to automatically detect user’s status from<br />

speech. We expect that computers can not only understand what<br />

people say (speech recognition), but also detect the degree of<br />

<strong>in</strong>volvement <strong>in</strong> communication, such as their feel<strong>in</strong>gs and <strong>in</strong>terest<br />

toward topics under discussion and toward participants.<br />

{allison and paul’s help about social audio spaces and engagement<br />

<strong>in</strong> conversations} [1].<br />

In the context of this k<strong>in</strong>d of remote communication,<br />

the primary <strong>in</strong>put to a computational system is users’ voices.<br />

Speech communication is not merely an exchange of words. In<br />

addition to carry<strong>in</strong>g l<strong>in</strong>guistic <strong>in</strong>formation that consists of words<br />

and the rules of language, a user’s speech provides implicit messages<br />

such as the speaker’s gender, age, physical condition, as<br />

well as the speaker’s emotion and attitude toward the topic, the<br />

dialog partner or situation. In this work, we attempt to develop<br />

a computer system that is able to extract non-l<strong>in</strong>guistic <strong>in</strong>formation<br />

from the user’s speech and adjust user communication<br />

channels and human-computer <strong>in</strong>terfaces <strong>in</strong> response to user’s<br />

engagement <strong>in</strong> conversations.<br />

The correlations between acoustic features (e.g., prosody)<br />

and emotional states have been studied <strong>in</strong> speech production<br />

and phonetics (for a review, see [2]). Recently, there has been<br />

grow<strong>in</strong>g <strong>in</strong>terest <strong>in</strong> automatic speech emotion recognition. Dellaert<br />

et al. [3] implemented a method based on the majority vot<strong>in</strong>g<br />

of subspace specialists to classify acted spoken utterances<br />

<strong>in</strong>to four types. Batl<strong>in</strong>er et al. [4] provided a comparative study<br />

of recogniz<strong>in</strong>g two emotions “neutral vs. anger” expressed by<br />

actors and naive subjects. Ang et al. [5] <strong>in</strong>vestigated the use of<br />

prosody for the detection of frustration and annoyance <strong>in</strong> natural<br />

dialogues. The method described by Lee et al. [6] comb<strong>in</strong>ed<br />

acoustic features and the keywords with emotion salience<br />

to categorize spoken utterances <strong>in</strong>to two sets: negative and nonnegative.<br />

A good review of emotion recognition can be found<br />

<strong>in</strong> [7].<br />

The novelty <strong>in</strong> this work is to estimate users’ engagement <strong>in</strong><br />

dialogue by consider<strong>in</strong>g multiple cues. The central idea is that<br />

<strong>in</strong> everyday conversations, a participant’s engagement state is<br />

<strong>in</strong>fluenced by his/her previous engagement state, his/her current<br />

emotional state, and the other participants’ engagement states.<br />

It has been shown that emotional states are closely related to engagement<br />

levels (cite???). In addition, we argue that the change<br />

of engagement levels should be considered <strong>in</strong> cont<strong>in</strong>uous mode<br />

and it is rare that people abruptly change their <strong>in</strong>terests to a<br />

topic or a speaker. Furthermore, a participant’s engagement is<br />

naturally <strong>in</strong>fluenced by other people <strong>in</strong> conversations. For <strong>in</strong>stance,<br />

a deeply engaged speaker usually makes audience more<br />

<strong>in</strong>volved <strong>in</strong> conversations. The special advantages of the <strong>in</strong>tegration<br />

of multiple cues to estimate users’ engagement are<br />

twofold. First, we can get better accuracy by consider<strong>in</strong>g multiple<br />

<strong>in</strong>formation so that the noise <strong>in</strong> one channel can be removed<br />

dur<strong>in</strong>g the <strong>in</strong>tegration process. Second, <strong>in</strong> the case that some <strong>in</strong>formation<br />

is not available, we can still compute users’ engagement<br />

based on partial <strong>in</strong>formation. For example, a user may<br />

just listen to the speaker’s talk without utter<strong>in</strong>g any speech by<br />

himself/herself. In this case, we cannot estimate the listener’s<br />

engagement if the method purely depends on the acoustic features<br />

from his/her speech. In our method, however, we can utilize<br />

the listener’s previous state of engagement and the talker’s<br />

engagement level to estimate the listener’s <strong>in</strong>volvement <strong>in</strong> communication.<br />

In light of the above analysis, we proposed a multilevel<br />

structure us<strong>in</strong>g support vector mach<strong>in</strong>e (SVM) and hidden<br />

Markov model (HMM) techniques as shown <strong>in</strong> Figure 1. The<br />

first level of the architecture is comprised of SVM classifiers<br />

that use acoustic features as <strong>in</strong>put and predict emotional states<br />

of users, such as arousal levels or discrete emotion types. Those<br />

emotional states are utilized as <strong>in</strong>put to the higher level HMM<br />

that models dynamic change of user’s engagement <strong>in</strong> conversations<br />

s<strong>in</strong>ce emotion and engagement are conveyed <strong>in</strong> a cont<strong>in</strong>-


SVM classifier<br />

feature extraction<br />

Coupled HMM<br />

SVM classifier<br />

feature extraction<br />

decoded engagement<br />

state sequence<br />

Figure 1: Overview of our approach to engagement detection.<br />

uous scale. In addition, we apply the coupled HMM (CHMM)<br />

technique to capture jo<strong>in</strong>t behaviors of participants and model<br />

the <strong>in</strong>fluence of <strong>in</strong>dividual participants on each other. In this<br />

way, the method decodes users’ engagement states <strong>in</strong> conversations<br />

by seamlessly <strong>in</strong>tegrat<strong>in</strong>g low-level prosodic, temporal<br />

and cross-participant cues. In the rest of the paper, Section 2<br />

presents our method of speech emotion recognition and Section<br />

3 describes engagement detection based on CHMM. The experimental<br />

results are reported and discussed <strong>in</strong> Section 4.<br />

2. Speech emotion recognition<br />

The first level classifies emotions through certa<strong>in</strong> attributes of<br />

spoken communication. We focus on seven discrete emotion<br />

types <strong>in</strong> this study: anger, panic, sadness, happy, <strong>in</strong>terest, boredom<br />

and neutral. In addition to label emotions as discrete<br />

categories, some researchers (e.g., <strong>in</strong> [7]) prefer to characterize<br />

emotions <strong>in</strong> terms of cont<strong>in</strong>uous dimensions. The two<br />

most commonly considered dimensions are arousal and valence.<br />

Arousal refers to the degree of <strong>in</strong>tensity of the affect and<br />

ranges from sleep to excitement. Valence describes the pleasantness<br />

of the stimuli, such as positive (happy) and negative<br />

(sad). This work explores the classifications of discrete emotion<br />

types as well as arousal and valence levels.<br />

Note that the practical motivations of this work require us to<br />

build a speaker-<strong>in</strong>dependent emotion detection system that has<br />

to deal with the enormous speaker variability <strong>in</strong> speech. People’s<br />

will<strong>in</strong>gness to display emotional responses and the way<br />

that they convey affective messages us<strong>in</strong>g speech vary widely<br />

across <strong>in</strong>dividuals. Thus, we need to f<strong>in</strong>d a set of robust features<br />

that are not only closely correlated with emotional categories<br />

but also <strong>in</strong>variant across different speakers.<br />

We first segment cont<strong>in</strong>uous speech <strong>in</strong>to spoken utterances.<br />

Then we use Praat software package to extract the prosodic and<br />

energy profiles of each spoken utterance, which carry a large<br />

amount of <strong>in</strong>formation of emotion. Next, seven k<strong>in</strong>ds of acoustic<br />

features are extracted from each spoken utterance:<br />

• Fundamental frequency (pitch): mean, maximum, m<strong>in</strong>imum,<br />

standard deviation, range, 25-percentile, and 75-<br />

percentile.<br />

• Derivation of pitch: mean, maximum, m<strong>in</strong>imum, standard<br />

deviation, range, mean of absolute pitch derivation, and<br />

standard deviation of absolute derivation.<br />

• Duration of pitch: ratio of duration of voiced and unvoiced<br />

regions, mean of frames of voiced regions, standard derivation<br />

of frames of voiced regions, number of voiced regions,<br />

ratio of frames of voiced and unvoiced regions, maximal<br />

frames of voiced duration and mean of the maximum pitch<br />

<strong>in</strong> every region.<br />

• Energy: mean, standard deviation, maximum, median, and<br />

energy <strong>in</strong> frequency bands (2kHz).<br />

• Derivation of energy: mean, standard deviation, maximum,<br />

median, and m<strong>in</strong>imum.<br />

• Duration of energy <strong>in</strong> non-silent regions: mean of the number<br />

of frames, standard deviation of the number of frames,<br />

ratio of non-silent frames, and maximum of the number of<br />

frames.<br />

• Formants: first three formant frequencies (F1, F2, F3), and<br />

their bandwidths.<br />

Now we have multidimensional affective features for each<br />

spoken utterance. The curse of dimensionality <strong>in</strong> highdimensional<br />

classification is well-known <strong>in</strong> mach<strong>in</strong>e learn<strong>in</strong>g,<br />

which <strong>in</strong>dicates that prun<strong>in</strong>g the irrelevant features holds more<br />

promise for a generalized classification. We transformed the<br />

orig<strong>in</strong>al feature space <strong>in</strong>to a lower dimensional space by us<strong>in</strong>g<br />

the RELIEF-F algorithm for feature selection [8]. As<br />

mentioned above, we want to develop a speaker-<strong>in</strong>dependent<br />

emotion recognition system which needs to deal with speaker<br />

variation. In practice, due to big differences of prosodic features<br />

between male and female speakers, we divided users <strong>in</strong>to<br />

two groups based on their genders and used different sets of<br />

prosodic features for different groups. The top 7 features for<br />

arousal level classification are as follows:<br />

• Male: range of F2, range of pitch, maximum of pitch, energy<br />

>2000Hz, maximum of voiced durations, standard deviation<br />

of derivation of energy, and maximum of energy.<br />

• Female: mean of pitch, range of derivative of pitch, mean<br />

duration of voiced regions, energy


puts of those classifiers are then used as the observations of the<br />

high-level HMM. The HMM is comprised of five hidden states<br />

correspond<strong>in</strong>g to the degree of engagement <strong>in</strong> conversations and<br />

model temporal cont<strong>in</strong>uity of user engagement.<br />

In addition, we applied a CHMM to describe the <strong>in</strong>fluence<br />

of the engagement state of one participant to others. In the<br />

CHMM, each cha<strong>in</strong> has 5 hidden states correspond<strong>in</strong>g to engagement<br />

levels. The observations are arousal levels of participants<br />

<strong>in</strong> a conversation.<br />

Given the sequences of arousal levels and engagement levels<br />

of participants, the tra<strong>in</strong><strong>in</strong>g procedure of the CHMM needs<br />

to estimate three k<strong>in</strong>ds of probabilities:<br />

• p(o i|s j) is the probability of observ<strong>in</strong>g arousal level i<br />

<strong>in</strong> state j which is a mult<strong>in</strong>omial distribution and can<br />

be learned by simply count<strong>in</strong>g the expected frequency of<br />

arousal levels be<strong>in</strong>g <strong>in</strong> state j.<br />

• p(s m j |s m i ) is the transition probability of tak<strong>in</strong>g the transition<br />

from state s i to state s j <strong>in</strong> cha<strong>in</strong> m.<br />

• p(s m j |s n i ) is the cross-participant <strong>in</strong>fluence probability of<br />

tak<strong>in</strong>g the transition to state s j <strong>in</strong> cha<strong>in</strong> m be<strong>in</strong>g <strong>in</strong> state<br />

j <strong>in</strong> cha<strong>in</strong> n.<br />

Note that currently the observations are discrete values (quantized<br />

arousal levels, etc.) and modeled by mult<strong>in</strong>omial distributions.<br />

We can utilize Gaussian mixture models for cont<strong>in</strong>uous<br />

observations of CHMM <strong>in</strong> case that low-level classifiers provide<br />

probabilistic categories.<br />

In the test<strong>in</strong>g phase, acoustic features are fed <strong>in</strong>to the lowlevel<br />

SVM classifiers and the output arousal states are fed <strong>in</strong>to<br />

the high-level CHMM. The decoded state sequences of CHMM<br />

are obta<strong>in</strong>ed us<strong>in</strong>g Viterbi algorithm that <strong>in</strong>dicates the engagement<br />

states of participants. Formally, assume that CHMM consists<br />

of two cha<strong>in</strong>s correspond<strong>in</strong>g to two participants <strong>in</strong> a conversation,<br />

and let s 1 t and s 2 t be the engagement states of participant<br />

1 and participant 2 at time t separately. o 1 t and o 2 t<br />

are the observations (arousal levels, etc.) of participants. The<br />

model predicts the current state s 1 t based on its own previous<br />

state s 1 t−1, cross-channel <strong>in</strong>fluence s 2 t−1 and new observation<br />

of arousal level o 1 t . Specifically, the probability of the comb<strong>in</strong>ation<br />

of two participant’s states are as follows:<br />

p(s 1 t, s 2 t) = p(s 1 t |s 1 t−1)p(s 2 t |s 2 t−1)p(s 1 t |s 2 t−1)p(s 2 t |s 1 t−1)<br />

p(o 1 t |s 1 t)p(o 2 t |s 2 t)<br />

In this way, our method is able to estimate two participant’s<br />

engagement states simultaneously given raw speech data <strong>in</strong> the<br />

conversation.<br />

4. Experiments and results<br />

Several issues make the evaluation of computational assessment<br />

of emotion challeng<strong>in</strong>g. First, data from real-life scenarios<br />

is often difficult to acquire; much of the research on emotion<br />

<strong>in</strong> speech uses actors/actresses to simulate specific emotional<br />

states (as <strong>in</strong> LDC’s data set). Second, emotional categories are<br />

quite ambiguous <strong>in</strong> their def<strong>in</strong>itions, and different researchers<br />

propose different sets of categories. Third, even when there is<br />

agreement on a clear def<strong>in</strong>ition of emotion, label<strong>in</strong>g emotional<br />

speech is not straightforward. In a conversation, a speaker can<br />

be thought of as encod<strong>in</strong>g his/her emotions <strong>in</strong> speech and listeners<br />

can be thought of as decod<strong>in</strong>g the emotional <strong>in</strong>formation<br />

from speech. However, the speaker and listeners may not<br />

agree on the emotion expressed or perceived <strong>in</strong> an utterance.<br />

Similarly, different listeners may <strong>in</strong>fer different emotional states<br />

given the same utterance. All these factors make data collection<br />

and evaluation of emotion and engagement recognition much<br />

more complicated compared with other statistical pattern recognition<br />

problems, such as visual object recognition and text m<strong>in</strong><strong>in</strong>g,<br />

<strong>in</strong> which cases we know the ground truth.<br />

4.1. Data and data cod<strong>in</strong>g<br />

In this work, we used two sets of English-language speech corpora<br />

obta<strong>in</strong>ed from the L<strong>in</strong>guistic Data Consortium (LDC):<br />

The LDC EMOTIONAL PROSODY corpus was produced by<br />

six professional actors/actresses express<strong>in</strong>g 14 discrete emotion<br />

types. There are approximately 25 spoken utterances per discrete<br />

emotion type. In our experiments, we focused on seven<br />

discrete emotion types that are most important <strong>in</strong> our application:<br />

anger, panic, sadness, happy, <strong>in</strong>terest, boredom and neutral.<br />

Half of the utterances were used as tra<strong>in</strong><strong>in</strong>g data and the<br />

other half as test<strong>in</strong>g data.<br />

The LDC CALLFRIEND corpus was collected by the consensual<br />

record<strong>in</strong>g of social telephone conversations between<br />

friends. We selected four dialogues that conta<strong>in</strong>ed a range of<br />

affect and extracted usable subsets of approximately 10 m<strong>in</strong>utes<br />

from each. Segment<strong>in</strong>g the subsets <strong>in</strong>to utterances produced a<br />

total of 1011 utterances from four female speakers and 877 utterances<br />

from four male speakers. Five labelers were asked to<br />

listen to the <strong>in</strong>dividual utterances and provide four separate labels<br />

for each utterance: discrete emotion type as a categorical<br />

value, and numerical values (on a discretized 1–5 scale) for each<br />

of arousal, valence and engagement. We based the f<strong>in</strong>al labels<br />

for each utterance on the consensus of all the labelers.<br />

4.2. Results of emotion recognition<br />

Table 1 shows the results of categoriz<strong>in</strong>g emotional states<br />

of <strong>in</strong>dividual spoken utterances, which <strong>in</strong>clude discrete emotion<br />

types as well as arousal and valence states. Recognition<br />

rates <strong>in</strong> the second and third rows (5 discrete types)<br />

show that for the same feature extraction and mach<strong>in</strong>e learn<strong>in</strong>g<br />

method, the performance with acted speech <strong>in</strong> speakerdependent<br />

mode is 75% but the accuracy goes down to 51% <strong>in</strong><br />

speaker-<strong>in</strong>dependent mode for spontaneous speech. S<strong>in</strong>ce many<br />

studies of speech emotion recognition focus on acted speech<br />

and speaker-dependent recognition, this comparative study <strong>in</strong>dicates<br />

the challenge to use emotion recognition <strong>in</strong> real life sett<strong>in</strong>gs<br />

and to generalize the method used <strong>in</strong> artificial conditions<br />

to person-<strong>in</strong>dependent natural scenarios. In addition, the difference<br />

between the first and second rows illustrates the <strong>in</strong>fluence<br />

of the number of emotional types <strong>in</strong> the classification results.<br />

The accuracy of recogniz<strong>in</strong>g arousal levels are reasonably good<br />

on natural speech data and speaker-<strong>in</strong>dependent mode although<br />

the recognition rates of valence levels are not as well as arousal.<br />

This is quite <strong>in</strong> l<strong>in</strong>e with the related psychological studies. It<br />

has been shown that valence (cite???).<br />

Table 1: Classification with support vector mach<strong>in</strong>e. SI:<br />

speaker <strong>in</strong>dependent, SD: speaker dependent<br />

EMOTIONAL<br />

features PROSODY CALLFRIEND<br />

7 discrete types (SI) 69% -<br />

5 discrete types (SI) 60% 51%<br />

5 discrete types (SD) 75% 62%<br />

5 arousal levels (SI) - 58%<br />

3 arousal levels (SI) - 67%<br />

3 valence levels (SI) - 54%


4.3. Results of engagement detection <strong>in</strong> cont<strong>in</strong>uous speech<br />

Table 2 shows the results of detect<strong>in</strong>g users’ engagement states<br />

<strong>in</strong> a scale from 1 to 5. As a comparison, we tra<strong>in</strong>ed a SVM classifier<br />

to directly categorize spoken utterances based on prosodic<br />

features and obta<strong>in</strong>ed 47% accuracy. A much better result<br />

(61%) was achieved by us<strong>in</strong>g the multilevel structure. The significant<br />

difference lies <strong>in</strong> the facts that low-level prosodic features<br />

<strong>in</strong> speech are not sufficient <strong>in</strong>dicators of users’ engagement<br />

states and that the model encodes the <strong>in</strong>herent cont<strong>in</strong>uity<br />

of the dynamics of users’ engagement states by consider<strong>in</strong>g<br />

the problem <strong>in</strong> a cont<strong>in</strong>uous mode. Next, we <strong>in</strong>cluded crossparticipant<br />

<strong>in</strong>fluence <strong>in</strong>to the consideration by us<strong>in</strong>g the CHMM<br />

and achieved a smaller improvement <strong>in</strong> performance. That is<br />

because the degree to which the participants <strong>in</strong> a conversation<br />

can take effect on each other varies across different people.<br />

Therefore, it is less likely that we could completely encode this<br />

complex <strong>in</strong>teraction with a simple model and limited tra<strong>in</strong><strong>in</strong>g<br />

data. Consider<strong>in</strong>g that the data we used are from spontaneous<br />

speech <strong>in</strong> telephone calls and the method does not encode any<br />

speaker-dependent <strong>in</strong>formation, the results are reasonably good<br />

and promis<strong>in</strong>g.<br />

Table 2: Results of detect<strong>in</strong>g engagement <strong>in</strong> cont<strong>in</strong>uous speech<br />

isolated SVM HMM coupled HMM<br />

accuracy 47% 61% 63 %<br />

5. Conclusions<br />

In this work, we proposed to use affective <strong>in</strong>formation encoded<br />

<strong>in</strong> speech to estimate users’ engagement <strong>in</strong> computer-mediated<br />

voice communications systems for mobile comput<strong>in</strong>g. To our<br />

knowledge, this is the first work that attempts to estimate users’<br />

engagement <strong>in</strong> telephone conversations. We tested this idea by<br />

develop<strong>in</strong>g a mach<strong>in</strong>e learn<strong>in</strong>g system that can perform engagement<br />

detection <strong>in</strong> everyday dialogue and achieve reasonably<br />

good results.<br />

The ma<strong>in</strong> technical contribution of this work is to estimate<br />

users’ engagement based on a novel multilevel structure.<br />

Compared with previous studies that focus on classify<strong>in</strong>g user’s<br />

emotional states based on <strong>in</strong>dividual spoken utterances, our<br />

method models the emotion recognition and engagement detection<br />

problem <strong>in</strong> cont<strong>in</strong>uous mode. In addition, we encode<br />

the jo<strong>in</strong>t behavior of the participants <strong>in</strong> a conversation by utiliz<strong>in</strong>g<br />

CHMM. We demonstrated that our method achieved much<br />

better results than the one just based on low-level acoustic signals<br />

of <strong>in</strong>dividual spoken utterances. A natural extension of<br />

the current work is to extract various affective <strong>in</strong>formation from<br />

speech, such as valence <strong>in</strong> speech and l<strong>in</strong>guistic <strong>in</strong>formation,<br />

and then <strong>in</strong>clude them as the observations of the high-level<br />

HMM to improve the overall performance of engagement detection.<br />

of research paradigms,” Speech Communication, vol. 40,<br />

no. 1–2, pp. 227–256, 2003.<br />

[3] F. Dellaert, T. Polz<strong>in</strong>, and A. Waibel, “Recogniz<strong>in</strong>g emotion<br />

<strong>in</strong> speech,” <strong>in</strong> Proc. 4th ICSLP, vol. 3. IEEE, 1996,<br />

pp. 1970–1973.<br />

[4] A. Batl<strong>in</strong>er, K. Fisher, R. Huber, J. Spilker, and E. Noth,<br />

“Desperately seek<strong>in</strong>g emotions: Actors, wizards, and human<br />

be<strong>in</strong>gs,” <strong>in</strong> Proc. ISCA Workshop on Speech and Emotion.<br />

ISCA, 2000.<br />

[5] J. Ang, R. Dhillon, A. Krupski, E. Shriberg, and A. Stolcke,<br />

“Prosody-based automatic detection of annoyance and<br />

frustration <strong>in</strong> human-computer dialog,” <strong>in</strong> Proc. 7th ICSLP,<br />

vol. 3. ISCA, 2002, pp. 2037–2040.<br />

[6] C. M. Lee, S. Narayanan, and R. Pieracc<strong>in</strong>i, “Comb<strong>in</strong><strong>in</strong>g<br />

acoustic and language <strong>in</strong>formation for emotion recognition,”<br />

<strong>in</strong> Proc. 7th ICSLP, vol. 2. ISCA, 2002, pp. 873–<br />

876.<br />

[7] R. Cowie, E. Douglas-Cowie, N.Tsapatsoulis, G. Votsis,<br />

S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition<br />

<strong>in</strong> human-computer <strong>in</strong>teraction,” IEEE Signal Process<strong>in</strong>g<br />

Mag., vol. 18, no. 1, pp. 32–80, 2001.<br />

[8] M. Robnik-Šikonja and I. Kononenko, “Theoretical and<br />

empirical analysis of ReliefF and RReliefF,” Mach<strong>in</strong>e<br />

Learn<strong>in</strong>g, vol. 53, no. 1–2, pp. 23–69, 2003.<br />

[9] C.-W. Hsu and C.-J. L<strong>in</strong>, “A comparison of methods for<br />

multi-class support vector mach<strong>in</strong>es,” IEEE Trans. Neural<br />

Networks, vol. 13, no. 2, pp. 415–425, 2002.<br />

6. References<br />

[1] P. M. Aoki, M. Roma<strong>in</strong>e, M. H. Szymanski, J. D. Thornton,<br />

D. Wilson, and A. Woodruff, “The Mad Hatter’s Cocktail<br />

Party: A social mobile audio space support<strong>in</strong>g multiple<br />

conversations,” <strong>in</strong> Proc. ACM SIGCHI Conf. ACM, 2003,<br />

pp. 425–432.<br />

[2] K. R. Scherer, “Vocal communication of emotion: A review

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!