Detecting User Engagement in Everyday Conversations - Parc
Detecting User Engagement in Everyday Conversations - Parc
Detecting User Engagement in Everyday Conversations - Parc
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Detect<strong>in</strong>g</strong> <strong>User</strong> <strong>Engagement</strong> <strong>in</strong> <strong>Everyday</strong> <strong>Conversations</strong><br />
Chen Yu<br />
Department of Computer Science<br />
University of Rochester<br />
Rochester, NY 14627<br />
yu@cs.rochester.edu<br />
Paul M. Aoki and Allison Woodruff<br />
Palo Alto Research Center<br />
3333 Coyote Hill Road<br />
Palo Alto, CA 94304<br />
{aoki,woodruff}@acm.org<br />
Abstract<br />
This paper presents a novel application of speech emotion<br />
recognition and engagement detection <strong>in</strong> computer-mediated<br />
voice communications systems. We utilize mach<strong>in</strong>e learn<strong>in</strong>g<br />
techniques, such as support vector mach<strong>in</strong>es (SVM) and hidden<br />
Markov models (HMM), to classify users’ emotions <strong>in</strong> speech.<br />
We argue that us<strong>in</strong>g prosodic features alone is not sufficient for<br />
model<strong>in</strong>g users’ engagement <strong>in</strong> conversations. In light of this, a<br />
multilevel architecture based on coupled HMMs is proposed to<br />
detect user engagement <strong>in</strong> spontaneous speech. The first level is<br />
comprised of SVM-based classifiers that are able to recognize<br />
discrete emotion types as well as arousal and valence states. A<br />
high-level HMM then uses those emotional states as <strong>in</strong>put to<br />
estimate user engagement <strong>in</strong> conversations by decod<strong>in</strong>g the <strong>in</strong>ternal<br />
states of the HMM. We report experimental results for the<br />
LDC Emotional Prosody and CALLFRIEND speech corpora.<br />
1. Introduction<br />
Telephones are used more and more often for everyday communication<br />
among family members and friends. The spread of<br />
mobile phones has accelerated this trend. With the availability<br />
of those communications services, more and more people<br />
talk to each other remotely. Most works of computer-mediated<br />
voice communications systems for mobile comput<strong>in</strong>g focus on<br />
some practical issues, such as bandwidth to support cont<strong>in</strong>uous<br />
network connections and robustness. In this work, we are<br />
<strong>in</strong>terested <strong>in</strong> develop<strong>in</strong>g an <strong>in</strong>telligent user <strong>in</strong>terface for future<br />
voice communication systems. To do so, one critical issue is<br />
that computers need to automatically detect user’s status from<br />
speech. We expect that computers can not only understand what<br />
people say (speech recognition), but also detect the degree of<br />
<strong>in</strong>volvement <strong>in</strong> communication, such as their feel<strong>in</strong>gs and <strong>in</strong>terest<br />
toward topics under discussion and toward participants.<br />
{allison and paul’s help about social audio spaces and engagement<br />
<strong>in</strong> conversations} [1].<br />
In the context of this k<strong>in</strong>d of remote communication,<br />
the primary <strong>in</strong>put to a computational system is users’ voices.<br />
Speech communication is not merely an exchange of words. In<br />
addition to carry<strong>in</strong>g l<strong>in</strong>guistic <strong>in</strong>formation that consists of words<br />
and the rules of language, a user’s speech provides implicit messages<br />
such as the speaker’s gender, age, physical condition, as<br />
well as the speaker’s emotion and attitude toward the topic, the<br />
dialog partner or situation. In this work, we attempt to develop<br />
a computer system that is able to extract non-l<strong>in</strong>guistic <strong>in</strong>formation<br />
from the user’s speech and adjust user communication<br />
channels and human-computer <strong>in</strong>terfaces <strong>in</strong> response to user’s<br />
engagement <strong>in</strong> conversations.<br />
The correlations between acoustic features (e.g., prosody)<br />
and emotional states have been studied <strong>in</strong> speech production<br />
and phonetics (for a review, see [2]). Recently, there has been<br />
grow<strong>in</strong>g <strong>in</strong>terest <strong>in</strong> automatic speech emotion recognition. Dellaert<br />
et al. [3] implemented a method based on the majority vot<strong>in</strong>g<br />
of subspace specialists to classify acted spoken utterances<br />
<strong>in</strong>to four types. Batl<strong>in</strong>er et al. [4] provided a comparative study<br />
of recogniz<strong>in</strong>g two emotions “neutral vs. anger” expressed by<br />
actors and naive subjects. Ang et al. [5] <strong>in</strong>vestigated the use of<br />
prosody for the detection of frustration and annoyance <strong>in</strong> natural<br />
dialogues. The method described by Lee et al. [6] comb<strong>in</strong>ed<br />
acoustic features and the keywords with emotion salience<br />
to categorize spoken utterances <strong>in</strong>to two sets: negative and nonnegative.<br />
A good review of emotion recognition can be found<br />
<strong>in</strong> [7].<br />
The novelty <strong>in</strong> this work is to estimate users’ engagement <strong>in</strong><br />
dialogue by consider<strong>in</strong>g multiple cues. The central idea is that<br />
<strong>in</strong> everyday conversations, a participant’s engagement state is<br />
<strong>in</strong>fluenced by his/her previous engagement state, his/her current<br />
emotional state, and the other participants’ engagement states.<br />
It has been shown that emotional states are closely related to engagement<br />
levels (cite???). In addition, we argue that the change<br />
of engagement levels should be considered <strong>in</strong> cont<strong>in</strong>uous mode<br />
and it is rare that people abruptly change their <strong>in</strong>terests to a<br />
topic or a speaker. Furthermore, a participant’s engagement is<br />
naturally <strong>in</strong>fluenced by other people <strong>in</strong> conversations. For <strong>in</strong>stance,<br />
a deeply engaged speaker usually makes audience more<br />
<strong>in</strong>volved <strong>in</strong> conversations. The special advantages of the <strong>in</strong>tegration<br />
of multiple cues to estimate users’ engagement are<br />
twofold. First, we can get better accuracy by consider<strong>in</strong>g multiple<br />
<strong>in</strong>formation so that the noise <strong>in</strong> one channel can be removed<br />
dur<strong>in</strong>g the <strong>in</strong>tegration process. Second, <strong>in</strong> the case that some <strong>in</strong>formation<br />
is not available, we can still compute users’ engagement<br />
based on partial <strong>in</strong>formation. For example, a user may<br />
just listen to the speaker’s talk without utter<strong>in</strong>g any speech by<br />
himself/herself. In this case, we cannot estimate the listener’s<br />
engagement if the method purely depends on the acoustic features<br />
from his/her speech. In our method, however, we can utilize<br />
the listener’s previous state of engagement and the talker’s<br />
engagement level to estimate the listener’s <strong>in</strong>volvement <strong>in</strong> communication.<br />
In light of the above analysis, we proposed a multilevel<br />
structure us<strong>in</strong>g support vector mach<strong>in</strong>e (SVM) and hidden<br />
Markov model (HMM) techniques as shown <strong>in</strong> Figure 1. The<br />
first level of the architecture is comprised of SVM classifiers<br />
that use acoustic features as <strong>in</strong>put and predict emotional states<br />
of users, such as arousal levels or discrete emotion types. Those<br />
emotional states are utilized as <strong>in</strong>put to the higher level HMM<br />
that models dynamic change of user’s engagement <strong>in</strong> conversations<br />
s<strong>in</strong>ce emotion and engagement are conveyed <strong>in</strong> a cont<strong>in</strong>-
SVM classifier<br />
feature extraction<br />
Coupled HMM<br />
SVM classifier<br />
feature extraction<br />
decoded engagement<br />
state sequence<br />
Figure 1: Overview of our approach to engagement detection.<br />
uous scale. In addition, we apply the coupled HMM (CHMM)<br />
technique to capture jo<strong>in</strong>t behaviors of participants and model<br />
the <strong>in</strong>fluence of <strong>in</strong>dividual participants on each other. In this<br />
way, the method decodes users’ engagement states <strong>in</strong> conversations<br />
by seamlessly <strong>in</strong>tegrat<strong>in</strong>g low-level prosodic, temporal<br />
and cross-participant cues. In the rest of the paper, Section 2<br />
presents our method of speech emotion recognition and Section<br />
3 describes engagement detection based on CHMM. The experimental<br />
results are reported and discussed <strong>in</strong> Section 4.<br />
2. Speech emotion recognition<br />
The first level classifies emotions through certa<strong>in</strong> attributes of<br />
spoken communication. We focus on seven discrete emotion<br />
types <strong>in</strong> this study: anger, panic, sadness, happy, <strong>in</strong>terest, boredom<br />
and neutral. In addition to label emotions as discrete<br />
categories, some researchers (e.g., <strong>in</strong> [7]) prefer to characterize<br />
emotions <strong>in</strong> terms of cont<strong>in</strong>uous dimensions. The two<br />
most commonly considered dimensions are arousal and valence.<br />
Arousal refers to the degree of <strong>in</strong>tensity of the affect and<br />
ranges from sleep to excitement. Valence describes the pleasantness<br />
of the stimuli, such as positive (happy) and negative<br />
(sad). This work explores the classifications of discrete emotion<br />
types as well as arousal and valence levels.<br />
Note that the practical motivations of this work require us to<br />
build a speaker-<strong>in</strong>dependent emotion detection system that has<br />
to deal with the enormous speaker variability <strong>in</strong> speech. People’s<br />
will<strong>in</strong>gness to display emotional responses and the way<br />
that they convey affective messages us<strong>in</strong>g speech vary widely<br />
across <strong>in</strong>dividuals. Thus, we need to f<strong>in</strong>d a set of robust features<br />
that are not only closely correlated with emotional categories<br />
but also <strong>in</strong>variant across different speakers.<br />
We first segment cont<strong>in</strong>uous speech <strong>in</strong>to spoken utterances.<br />
Then we use Praat software package to extract the prosodic and<br />
energy profiles of each spoken utterance, which carry a large<br />
amount of <strong>in</strong>formation of emotion. Next, seven k<strong>in</strong>ds of acoustic<br />
features are extracted from each spoken utterance:<br />
• Fundamental frequency (pitch): mean, maximum, m<strong>in</strong>imum,<br />
standard deviation, range, 25-percentile, and 75-<br />
percentile.<br />
• Derivation of pitch: mean, maximum, m<strong>in</strong>imum, standard<br />
deviation, range, mean of absolute pitch derivation, and<br />
standard deviation of absolute derivation.<br />
• Duration of pitch: ratio of duration of voiced and unvoiced<br />
regions, mean of frames of voiced regions, standard derivation<br />
of frames of voiced regions, number of voiced regions,<br />
ratio of frames of voiced and unvoiced regions, maximal<br />
frames of voiced duration and mean of the maximum pitch<br />
<strong>in</strong> every region.<br />
• Energy: mean, standard deviation, maximum, median, and<br />
energy <strong>in</strong> frequency bands (2kHz).<br />
• Derivation of energy: mean, standard deviation, maximum,<br />
median, and m<strong>in</strong>imum.<br />
• Duration of energy <strong>in</strong> non-silent regions: mean of the number<br />
of frames, standard deviation of the number of frames,<br />
ratio of non-silent frames, and maximum of the number of<br />
frames.<br />
• Formants: first three formant frequencies (F1, F2, F3), and<br />
their bandwidths.<br />
Now we have multidimensional affective features for each<br />
spoken utterance. The curse of dimensionality <strong>in</strong> highdimensional<br />
classification is well-known <strong>in</strong> mach<strong>in</strong>e learn<strong>in</strong>g,<br />
which <strong>in</strong>dicates that prun<strong>in</strong>g the irrelevant features holds more<br />
promise for a generalized classification. We transformed the<br />
orig<strong>in</strong>al feature space <strong>in</strong>to a lower dimensional space by us<strong>in</strong>g<br />
the RELIEF-F algorithm for feature selection [8]. As<br />
mentioned above, we want to develop a speaker-<strong>in</strong>dependent<br />
emotion recognition system which needs to deal with speaker<br />
variation. In practice, due to big differences of prosodic features<br />
between male and female speakers, we divided users <strong>in</strong>to<br />
two groups based on their genders and used different sets of<br />
prosodic features for different groups. The top 7 features for<br />
arousal level classification are as follows:<br />
• Male: range of F2, range of pitch, maximum of pitch, energy<br />
>2000Hz, maximum of voiced durations, standard deviation<br />
of derivation of energy, and maximum of energy.<br />
• Female: mean of pitch, range of derivative of pitch, mean<br />
duration of voiced regions, energy
puts of those classifiers are then used as the observations of the<br />
high-level HMM. The HMM is comprised of five hidden states<br />
correspond<strong>in</strong>g to the degree of engagement <strong>in</strong> conversations and<br />
model temporal cont<strong>in</strong>uity of user engagement.<br />
In addition, we applied a CHMM to describe the <strong>in</strong>fluence<br />
of the engagement state of one participant to others. In the<br />
CHMM, each cha<strong>in</strong> has 5 hidden states correspond<strong>in</strong>g to engagement<br />
levels. The observations are arousal levels of participants<br />
<strong>in</strong> a conversation.<br />
Given the sequences of arousal levels and engagement levels<br />
of participants, the tra<strong>in</strong><strong>in</strong>g procedure of the CHMM needs<br />
to estimate three k<strong>in</strong>ds of probabilities:<br />
• p(o i|s j) is the probability of observ<strong>in</strong>g arousal level i<br />
<strong>in</strong> state j which is a mult<strong>in</strong>omial distribution and can<br />
be learned by simply count<strong>in</strong>g the expected frequency of<br />
arousal levels be<strong>in</strong>g <strong>in</strong> state j.<br />
• p(s m j |s m i ) is the transition probability of tak<strong>in</strong>g the transition<br />
from state s i to state s j <strong>in</strong> cha<strong>in</strong> m.<br />
• p(s m j |s n i ) is the cross-participant <strong>in</strong>fluence probability of<br />
tak<strong>in</strong>g the transition to state s j <strong>in</strong> cha<strong>in</strong> m be<strong>in</strong>g <strong>in</strong> state<br />
j <strong>in</strong> cha<strong>in</strong> n.<br />
Note that currently the observations are discrete values (quantized<br />
arousal levels, etc.) and modeled by mult<strong>in</strong>omial distributions.<br />
We can utilize Gaussian mixture models for cont<strong>in</strong>uous<br />
observations of CHMM <strong>in</strong> case that low-level classifiers provide<br />
probabilistic categories.<br />
In the test<strong>in</strong>g phase, acoustic features are fed <strong>in</strong>to the lowlevel<br />
SVM classifiers and the output arousal states are fed <strong>in</strong>to<br />
the high-level CHMM. The decoded state sequences of CHMM<br />
are obta<strong>in</strong>ed us<strong>in</strong>g Viterbi algorithm that <strong>in</strong>dicates the engagement<br />
states of participants. Formally, assume that CHMM consists<br />
of two cha<strong>in</strong>s correspond<strong>in</strong>g to two participants <strong>in</strong> a conversation,<br />
and let s 1 t and s 2 t be the engagement states of participant<br />
1 and participant 2 at time t separately. o 1 t and o 2 t<br />
are the observations (arousal levels, etc.) of participants. The<br />
model predicts the current state s 1 t based on its own previous<br />
state s 1 t−1, cross-channel <strong>in</strong>fluence s 2 t−1 and new observation<br />
of arousal level o 1 t . Specifically, the probability of the comb<strong>in</strong>ation<br />
of two participant’s states are as follows:<br />
p(s 1 t, s 2 t) = p(s 1 t |s 1 t−1)p(s 2 t |s 2 t−1)p(s 1 t |s 2 t−1)p(s 2 t |s 1 t−1)<br />
p(o 1 t |s 1 t)p(o 2 t |s 2 t)<br />
In this way, our method is able to estimate two participant’s<br />
engagement states simultaneously given raw speech data <strong>in</strong> the<br />
conversation.<br />
4. Experiments and results<br />
Several issues make the evaluation of computational assessment<br />
of emotion challeng<strong>in</strong>g. First, data from real-life scenarios<br />
is often difficult to acquire; much of the research on emotion<br />
<strong>in</strong> speech uses actors/actresses to simulate specific emotional<br />
states (as <strong>in</strong> LDC’s data set). Second, emotional categories are<br />
quite ambiguous <strong>in</strong> their def<strong>in</strong>itions, and different researchers<br />
propose different sets of categories. Third, even when there is<br />
agreement on a clear def<strong>in</strong>ition of emotion, label<strong>in</strong>g emotional<br />
speech is not straightforward. In a conversation, a speaker can<br />
be thought of as encod<strong>in</strong>g his/her emotions <strong>in</strong> speech and listeners<br />
can be thought of as decod<strong>in</strong>g the emotional <strong>in</strong>formation<br />
from speech. However, the speaker and listeners may not<br />
agree on the emotion expressed or perceived <strong>in</strong> an utterance.<br />
Similarly, different listeners may <strong>in</strong>fer different emotional states<br />
given the same utterance. All these factors make data collection<br />
and evaluation of emotion and engagement recognition much<br />
more complicated compared with other statistical pattern recognition<br />
problems, such as visual object recognition and text m<strong>in</strong><strong>in</strong>g,<br />
<strong>in</strong> which cases we know the ground truth.<br />
4.1. Data and data cod<strong>in</strong>g<br />
In this work, we used two sets of English-language speech corpora<br />
obta<strong>in</strong>ed from the L<strong>in</strong>guistic Data Consortium (LDC):<br />
The LDC EMOTIONAL PROSODY corpus was produced by<br />
six professional actors/actresses express<strong>in</strong>g 14 discrete emotion<br />
types. There are approximately 25 spoken utterances per discrete<br />
emotion type. In our experiments, we focused on seven<br />
discrete emotion types that are most important <strong>in</strong> our application:<br />
anger, panic, sadness, happy, <strong>in</strong>terest, boredom and neutral.<br />
Half of the utterances were used as tra<strong>in</strong><strong>in</strong>g data and the<br />
other half as test<strong>in</strong>g data.<br />
The LDC CALLFRIEND corpus was collected by the consensual<br />
record<strong>in</strong>g of social telephone conversations between<br />
friends. We selected four dialogues that conta<strong>in</strong>ed a range of<br />
affect and extracted usable subsets of approximately 10 m<strong>in</strong>utes<br />
from each. Segment<strong>in</strong>g the subsets <strong>in</strong>to utterances produced a<br />
total of 1011 utterances from four female speakers and 877 utterances<br />
from four male speakers. Five labelers were asked to<br />
listen to the <strong>in</strong>dividual utterances and provide four separate labels<br />
for each utterance: discrete emotion type as a categorical<br />
value, and numerical values (on a discretized 1–5 scale) for each<br />
of arousal, valence and engagement. We based the f<strong>in</strong>al labels<br />
for each utterance on the consensus of all the labelers.<br />
4.2. Results of emotion recognition<br />
Table 1 shows the results of categoriz<strong>in</strong>g emotional states<br />
of <strong>in</strong>dividual spoken utterances, which <strong>in</strong>clude discrete emotion<br />
types as well as arousal and valence states. Recognition<br />
rates <strong>in</strong> the second and third rows (5 discrete types)<br />
show that for the same feature extraction and mach<strong>in</strong>e learn<strong>in</strong>g<br />
method, the performance with acted speech <strong>in</strong> speakerdependent<br />
mode is 75% but the accuracy goes down to 51% <strong>in</strong><br />
speaker-<strong>in</strong>dependent mode for spontaneous speech. S<strong>in</strong>ce many<br />
studies of speech emotion recognition focus on acted speech<br />
and speaker-dependent recognition, this comparative study <strong>in</strong>dicates<br />
the challenge to use emotion recognition <strong>in</strong> real life sett<strong>in</strong>gs<br />
and to generalize the method used <strong>in</strong> artificial conditions<br />
to person-<strong>in</strong>dependent natural scenarios. In addition, the difference<br />
between the first and second rows illustrates the <strong>in</strong>fluence<br />
of the number of emotional types <strong>in</strong> the classification results.<br />
The accuracy of recogniz<strong>in</strong>g arousal levels are reasonably good<br />
on natural speech data and speaker-<strong>in</strong>dependent mode although<br />
the recognition rates of valence levels are not as well as arousal.<br />
This is quite <strong>in</strong> l<strong>in</strong>e with the related psychological studies. It<br />
has been shown that valence (cite???).<br />
Table 1: Classification with support vector mach<strong>in</strong>e. SI:<br />
speaker <strong>in</strong>dependent, SD: speaker dependent<br />
EMOTIONAL<br />
features PROSODY CALLFRIEND<br />
7 discrete types (SI) 69% -<br />
5 discrete types (SI) 60% 51%<br />
5 discrete types (SD) 75% 62%<br />
5 arousal levels (SI) - 58%<br />
3 arousal levels (SI) - 67%<br />
3 valence levels (SI) - 54%
4.3. Results of engagement detection <strong>in</strong> cont<strong>in</strong>uous speech<br />
Table 2 shows the results of detect<strong>in</strong>g users’ engagement states<br />
<strong>in</strong> a scale from 1 to 5. As a comparison, we tra<strong>in</strong>ed a SVM classifier<br />
to directly categorize spoken utterances based on prosodic<br />
features and obta<strong>in</strong>ed 47% accuracy. A much better result<br />
(61%) was achieved by us<strong>in</strong>g the multilevel structure. The significant<br />
difference lies <strong>in</strong> the facts that low-level prosodic features<br />
<strong>in</strong> speech are not sufficient <strong>in</strong>dicators of users’ engagement<br />
states and that the model encodes the <strong>in</strong>herent cont<strong>in</strong>uity<br />
of the dynamics of users’ engagement states by consider<strong>in</strong>g<br />
the problem <strong>in</strong> a cont<strong>in</strong>uous mode. Next, we <strong>in</strong>cluded crossparticipant<br />
<strong>in</strong>fluence <strong>in</strong>to the consideration by us<strong>in</strong>g the CHMM<br />
and achieved a smaller improvement <strong>in</strong> performance. That is<br />
because the degree to which the participants <strong>in</strong> a conversation<br />
can take effect on each other varies across different people.<br />
Therefore, it is less likely that we could completely encode this<br />
complex <strong>in</strong>teraction with a simple model and limited tra<strong>in</strong><strong>in</strong>g<br />
data. Consider<strong>in</strong>g that the data we used are from spontaneous<br />
speech <strong>in</strong> telephone calls and the method does not encode any<br />
speaker-dependent <strong>in</strong>formation, the results are reasonably good<br />
and promis<strong>in</strong>g.<br />
Table 2: Results of detect<strong>in</strong>g engagement <strong>in</strong> cont<strong>in</strong>uous speech<br />
isolated SVM HMM coupled HMM<br />
accuracy 47% 61% 63 %<br />
5. Conclusions<br />
In this work, we proposed to use affective <strong>in</strong>formation encoded<br />
<strong>in</strong> speech to estimate users’ engagement <strong>in</strong> computer-mediated<br />
voice communications systems for mobile comput<strong>in</strong>g. To our<br />
knowledge, this is the first work that attempts to estimate users’<br />
engagement <strong>in</strong> telephone conversations. We tested this idea by<br />
develop<strong>in</strong>g a mach<strong>in</strong>e learn<strong>in</strong>g system that can perform engagement<br />
detection <strong>in</strong> everyday dialogue and achieve reasonably<br />
good results.<br />
The ma<strong>in</strong> technical contribution of this work is to estimate<br />
users’ engagement based on a novel multilevel structure.<br />
Compared with previous studies that focus on classify<strong>in</strong>g user’s<br />
emotional states based on <strong>in</strong>dividual spoken utterances, our<br />
method models the emotion recognition and engagement detection<br />
problem <strong>in</strong> cont<strong>in</strong>uous mode. In addition, we encode<br />
the jo<strong>in</strong>t behavior of the participants <strong>in</strong> a conversation by utiliz<strong>in</strong>g<br />
CHMM. We demonstrated that our method achieved much<br />
better results than the one just based on low-level acoustic signals<br />
of <strong>in</strong>dividual spoken utterances. A natural extension of<br />
the current work is to extract various affective <strong>in</strong>formation from<br />
speech, such as valence <strong>in</strong> speech and l<strong>in</strong>guistic <strong>in</strong>formation,<br />
and then <strong>in</strong>clude them as the observations of the high-level<br />
HMM to improve the overall performance of engagement detection.<br />
of research paradigms,” Speech Communication, vol. 40,<br />
no. 1–2, pp. 227–256, 2003.<br />
[3] F. Dellaert, T. Polz<strong>in</strong>, and A. Waibel, “Recogniz<strong>in</strong>g emotion<br />
<strong>in</strong> speech,” <strong>in</strong> Proc. 4th ICSLP, vol. 3. IEEE, 1996,<br />
pp. 1970–1973.<br />
[4] A. Batl<strong>in</strong>er, K. Fisher, R. Huber, J. Spilker, and E. Noth,<br />
“Desperately seek<strong>in</strong>g emotions: Actors, wizards, and human<br />
be<strong>in</strong>gs,” <strong>in</strong> Proc. ISCA Workshop on Speech and Emotion.<br />
ISCA, 2000.<br />
[5] J. Ang, R. Dhillon, A. Krupski, E. Shriberg, and A. Stolcke,<br />
“Prosody-based automatic detection of annoyance and<br />
frustration <strong>in</strong> human-computer dialog,” <strong>in</strong> Proc. 7th ICSLP,<br />
vol. 3. ISCA, 2002, pp. 2037–2040.<br />
[6] C. M. Lee, S. Narayanan, and R. Pieracc<strong>in</strong>i, “Comb<strong>in</strong><strong>in</strong>g<br />
acoustic and language <strong>in</strong>formation for emotion recognition,”<br />
<strong>in</strong> Proc. 7th ICSLP, vol. 2. ISCA, 2002, pp. 873–<br />
876.<br />
[7] R. Cowie, E. Douglas-Cowie, N.Tsapatsoulis, G. Votsis,<br />
S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition<br />
<strong>in</strong> human-computer <strong>in</strong>teraction,” IEEE Signal Process<strong>in</strong>g<br />
Mag., vol. 18, no. 1, pp. 32–80, 2001.<br />
[8] M. Robnik-Šikonja and I. Kononenko, “Theoretical and<br />
empirical analysis of ReliefF and RReliefF,” Mach<strong>in</strong>e<br />
Learn<strong>in</strong>g, vol. 53, no. 1–2, pp. 23–69, 2003.<br />
[9] C.-W. Hsu and C.-J. L<strong>in</strong>, “A comparison of methods for<br />
multi-class support vector mach<strong>in</strong>es,” IEEE Trans. Neural<br />
Networks, vol. 13, no. 2, pp. 415–425, 2002.<br />
6. References<br />
[1] P. M. Aoki, M. Roma<strong>in</strong>e, M. H. Szymanski, J. D. Thornton,<br />
D. Wilson, and A. Woodruff, “The Mad Hatter’s Cocktail<br />
Party: A social mobile audio space support<strong>in</strong>g multiple<br />
conversations,” <strong>in</strong> Proc. ACM SIGCHI Conf. ACM, 2003,<br />
pp. 425–432.<br />
[2] K. R. Scherer, “Vocal communication of emotion: A review