10.01.2013 Views

Download Master Thesis - EMBOTS - DFKI

Download Master Thesis - EMBOTS - DFKI

Download Master Thesis - EMBOTS - DFKI

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

SAARLAND UNIVERSITY<br />

Faculty of Natural Science and Technology I<br />

Department of Computer Science<br />

<strong>Master</strong>’s Program in Computer Science<br />

<strong>Master</strong>’s <strong>Thesis</strong><br />

Embodied Presentation Teams:<br />

A plan-based approach for affective<br />

sports commentary in real-time<br />

submitted by<br />

Ivan Gregor<br />

on March 1, 2010<br />

Supervisor<br />

Prof. Wolfgang Wahlster<br />

Advisor<br />

Dr. Michael Kipp<br />

Reviewers<br />

Prof. Wolfgang Wahlster<br />

Dr. Michael Kipp


Statement<br />

Hereby I confirm that this thesis is my own work and that I have documented all sources<br />

used.<br />

Signed:<br />

Date:<br />

Declaration of Consent<br />

Herewith I agree that my thesis will be made available through the library of the Com-<br />

puter Science Department.<br />

Signed:<br />

Date:


Abstract<br />

Virtual agents are essential representatives of multimodal user interfaces. This thesis<br />

presents the IVAN system (Intelligent Interactive Virtual Agent Narrators) that gen-<br />

erates affective commentary on a tennis game that is given as an annotated video in<br />

real-time. The system employs two distinguishable virtual agents that have different<br />

roles (TV commentator, expert), personality profiles, and positive, neutral, or negative<br />

attitudes to the players. The system uses an HTN planner to generate dialogues which<br />

enables to plan large dialogue contributions and generate alternative plans. The sys-<br />

tem can also interrupt the current discourse if a more important event happens. The<br />

current affect of the virtual agents is conveyed by lexical selection, facial expression,<br />

and gestures. The system integrates background knowledge about the players and the<br />

tournament and user pre-defined questions. We have focused on the dialogue planning,<br />

knowledge processing, and behaviour control of the virtual agents. Commercial products<br />

have been used as the audio-visual component of the system.<br />

A demo version of the IVAN system was accepted for the GALA 2009 that was a part of<br />

the 9th International Conference on Intelligent Virtual Agents. We have verified that an<br />

HTN planner can be employed to generate affective commentary on a continuous sports<br />

event in real-time. However, while the HTN planning is well suited to generate large<br />

dialogue contributions, the expert systems are more suitable to produce commentary on a<br />

rapidly changing environment. Most parts of the system are domain dependent, however<br />

the same architecture can be reused to implement applications such as: interactive<br />

tutoring systems, tourist guides, or guides for the blind.<br />

i


Acknowledgements<br />

First of all, I would like to thank Michael Kipp and Jan Miksatko for being very helpful<br />

and inspiring supervisors. Thanks as well to the <strong>DFKI</strong> for providing the opportunity<br />

to work on this project, for the necessary equipment, and funding to attend the GALA<br />

competition and the IVA conference. Thank you also to the Charamel GmbH and Nu-<br />

ance Communication, Inc., for providing the Charamel virtual agents Mark and Gloria<br />

and the RealSpeak Solo software with the Tom and Serena voices, respectively. Finally,<br />

I would like to thank my parents for being very supportive during my studies in Prague<br />

and Saarbruecken.<br />

ii


Contents<br />

Abstract i<br />

Acknowledgements ii<br />

List of Figures v<br />

List of Tables vii<br />

1 Introduction 1<br />

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.2 GALA 2009 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.3 IVAN System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

1.4 Research Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

2 Related Work 8<br />

2.1 ERIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2.1.1 The Affect Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.1.2 The Natural Language Generation Module . . . . . . . . . . . . . 9<br />

2.2 DEIRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

2.3 Spectators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />

2.4 STEVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />

2.5 Presentation Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />

2.5.1 Design of Presentation Teams . . . . . . . . . . . . . . . . . . . . . 13<br />

2.5.2 Inhabited Marketplace . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />

2.5.3 Rocco II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

3 Methods for Controlling Behaviour of Virtual Agents 16<br />

3.1 Hierarchical Task Network Planning . . . . . . . . . . . . . . . . . . . . . 16<br />

3.1.1 Example of a Planning Task . . . . . . . . . . . . . . . . . . . . . . 17<br />

3.1.2 Java Simple Hierarchical Ordered Planner (JSHOP) . . . . . . . . 19<br />

3.1.3 JSHOP Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

3.2 Expert Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

3.3 Statecharts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

4 Generating Dialogue 26<br />

iii


Contents iv<br />

4.1 Commentary Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

4.1.2 Dialogue Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

4.1.3 Planning Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

4.1.4 Commentary Excerpt . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

4.2 Affect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

4.2.2 Planning with Attitude . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

4.2.3 OCC Generated Emotions . . . . . . . . . . . . . . . . . . . . . . . 36<br />

4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

5 Architecture 41<br />

5.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />

5.1.1 Design Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

5.1.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

5.1.3 Off-the-shelf Components . . . . . . . . . . . . . . . . . . . . . . . 44<br />

5.2 Tennis Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

5.3 Plan Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

5.3.1 Event Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

5.3.2 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

5.3.3 Discourse Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />

5.4 Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

5.4.1 Template Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

5.4.2 Avatar Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

5.4.3 Output Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

6 Discussion 62<br />

6.1 Comparison with the ERIC system . . . . . . . . . . . . . . . . . . . . . . 62<br />

6.2 Evaluation in Terms of Research Aims . . . . . . . . . . . . . . . . . . . . 63<br />

6.3 Comparison JSHOP vs Jess . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

7 Conclusion 68<br />

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />

A Commentary Excerpt 72


List of Figures<br />

1.1 Event Position Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2 Example of an ANVIL File . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

2.1 ERIC commenting on a Horse Race . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.2 DEIRA (Dynamic Engaging Intelligent Reporter Agent) . . . . . . . . . . 10<br />

2.3 STEVE in a 3D Simulated Student’s Work Environment . . . . . . . . . . 12<br />

2.4 Example of a Planning Method (Dialogue Scheme) to Discuss an Attribute<br />

Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

2.5 Excerpt of the Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . 14<br />

2.6 Gerd and Metze commenting RobboCup Soccer Game . . . . . . . . . . . 15<br />

3.1 Example of a Planning Task - HTN . . . . . . . . . . . . . . . . . . . . . 18<br />

3.2 Example of a Planning Task - generated Plan . . . . . . . . . . . . . . . . 18<br />

3.3 JSHOP Input Generation Process . . . . . . . . . . . . . . . . . . . . . . 19<br />

3.4 Sample JSHOP Axiom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

3.5 Sample JSHOP Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

3.6 Sample JSHOP Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

3.7 Overview of the COHIBIT system . . . . . . . . . . . . . . . . . . . . . . 25<br />

4.1 Example of a Planning Method . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

4.2 Example of a Compound Task Decomposition . . . . . . . . . . . . . . . . 30<br />

4.3 Possible Decompositions of a Compound Task . . . . . . . . . . . . . . . . 31<br />

4.4 Decomposition of the Goal Task “Comment” . . . . . . . . . . . . . . . . 32<br />

4.5 Decomposition of the Subgoal Task Commant on rally . . . . . . . . . . . 32<br />

4.6 Decomposition of the Goal Task “Comment” that leads to a Subgoal Task<br />

Drop Volley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

4.7 Emotion Module GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

5.1 IVAN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

5.2 Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

5.3 Charamel Virtual Agents Mark and Gloria . . . . . . . . . . . . . . . . . 45<br />

5.4 Tennis Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

5.5 Tennis Simulator GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

5.6 IVAN Architecture - Plan Generation . . . . . . . . . . . . . . . . . . . . 47<br />

5.7 Dataflow - Plan Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

5.8 States of the Tennis Game . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

5.9 Tennis Score Counting using a Finite State Machine . . . . . . . . . . . . 50<br />

5.10 Hierarchy of Facts from which an Ace can be deduced . . . . . . . . . . . 52<br />

5.11 JSHOP Input Generation Process . . . . . . . . . . . . . . . . . . . . . . 55<br />

v


List of Figures vi<br />

5.12 IVAN Architecture - Plan Execution . . . . . . . . . . . . . . . . . . . . . 56<br />

5.13 Dataflow - Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 56


List of Tables<br />

1.1 Tennis Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2 Event Position Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.3 Track Element Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

4.1 Dialogue Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

4.2 Example of Generated Dialogues based on different Appraisals . . . . . . 36<br />

4.3 Description of the eight Basic OCC Emotions . . . . . . . . . . . . . . . . 37<br />

4.4 Five Personality Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

4.5 Example of Events that elicit respective Emotions . . . . . . . . . . . . . 38<br />

5.1 Description of the Tennis Counting Terminology . . . . . . . . . . . . . . 50<br />

5.2 Example of high-level facts deduced from low-level facts . . . . . . . . . . 52<br />

5.3 Examples of Facts deduced from the Background Knowledge . . . . . . . 53<br />

vii


Chapter 1<br />

Introduction<br />

This thesis presents the IVAN system (Intelligent Interactive Virtual Agent Narrators),<br />

that provides affective commentary on a continuous sports event in real-time. We have<br />

employed two virtual agents that are engaged in dialogues to comment on a tennis game<br />

that was given as the GALA 2009 challenge (see section 1.2). The virtual agents can have<br />

different attitudes to players and their current affective state can be conveyed by lexical<br />

selection, facial expression, and gestures. We have focused on the knowledge processing,<br />

dialogue planning, and behaviour control of the virtual agents. We have used commercial<br />

software as the audio-visual component of the system. In the following sections, we will<br />

explain why it is beneficial to employ virtual agents, then we will describe our task as<br />

the GALA 2009 challenge, outline the IVAN system, and describe our research aims.<br />

1.1 Motivation<br />

Multimodal user interfaces are becoming more and more important in human-machine<br />

communication. Essential representatives of such interfaces are the virtual agents that<br />

aim to act like humans in the way they employ gestures, gaze, facial expression, posture,<br />

and prosody to convey facts in face-to-face communication with a user. [1] The face-<br />

to-face interaction that uses rich communication channel is believed to be exclusively a<br />

human domain, for instance, if people have something important to say, they say it in<br />

person. To generate such a complex behaviour of a virtual agent, it is important to endow<br />

the virtual agent with emotions since s/he becomes more believable by humans and<br />

the system that employs such virtual agents becomes more entertaining and enjoyable<br />

for the users. [2] Virtual agents can be employed in many fields such as: computer<br />

games, tutoring systems, virtual training environments [3], story telling systems [4, 5],<br />

advertisement, automated presenters [6, 7, 8, 9] , and commentators. [10, 11]<br />

1


Chapter 1. Introduction 2<br />

In this thesis, we have focused on the commentary agents. Moreover, we have employed<br />

a presentation team since the use of a presentation team [6], i.e., the use of several<br />

distinguishable virtual agents with different personality profiles, roles, and goals, en-<br />

riches communication strategies and the information being conveyed can be distributed<br />

onto several virtual agents in the form of a dialogue. It is particularly important to<br />

endow virtual agents of the presentation team with emotions since they become more<br />

distinguishable. Distinct virtual agents can better represent different roles and opposing<br />

points of view. The use of a presentation team is also more advantageous in comparison<br />

to the use of only one virtual agent since the performance of a presentation team is more<br />

entertaining for the audience, provide better understanding, and improve recall of the<br />

presented information.<br />

The additional advantage of virtual commentary agents is that they can run locally on<br />

a user’s computer. Hence, the commentary can be partly customized since the user can<br />

set basic settings of a commentary. Thus, it is a good idea to employ virtual agents as<br />

a presentation team to comment on a sports event.<br />

1.2 GALA 2009 Challenge<br />

In this section, we will introduce our task that was given as the GALA 2009 1 challenge<br />

(Gathering of Animated Lifelike Agents). The GALA event is a part of the annual In-<br />

ternational Conference on Intelligent Virtual Agents (IVA) 2 . The aim of GALA is to<br />

encourage students to implement a system that provides behaviourally complex com-<br />

mentary on a continuous stream of events in real-time. The challenge of GALA 2009<br />

was to provide a commentary on a tennis game that was given as an annotated video.<br />

The GALA challenge in the previous years was to comment on a horse race that was<br />

given by a horse race simulator.<br />

The events that occur in the video of a tennis game are manually annotated with the<br />

ANVIL tool [12] and stored into an ANVIL file. The ANVIL file contains timestamped<br />

events that are grouped into tracks where each track contains events that have the same<br />

source, namely, we have one track for the ball and one track for each player. Table 1.1<br />

contains all events that can be annotated.<br />

Each event is further specified with the place on a tennis court where it happened.<br />

Table 1.2 contains attributes that specify the position of a ball or a player and Figure<br />

1.1 depicts these tags in the picture of a tennis court.<br />

1 http://hmi.ewi.utwente.nl/gala<br />

2 http://iva09.dfki.de/


Chapter 1. Introduction 3<br />

Player events Ball events<br />

throw shot<br />

serve cross net<br />

forehand hit net<br />

backhand hit tape<br />

forehand-volley bounce<br />

backhand-volley fault<br />

smash out<br />

miss<br />

Table 1.1: Tennis Events<br />

Position side Position longitudinal Position lateral Position height<br />

server net left low<br />

receiver mid court middle middle<br />

baseline right high<br />

Table 1.2: Event Position Specification<br />

Figure 1.1: Event Position Specification<br />

Each event with its timestamp and position specification stands for a track element.<br />

Table 1.3 contains information which attributes each track element has.


Chapter 1. Introduction 4<br />

Ball track element Player track element<br />

timestamp timestamp<br />

ball event player event<br />

position lateral position lateral<br />

position longitudinal position longitudinal<br />

position side<br />

position height<br />

Table 1.3: Track Element Specification<br />

Figure 1.2: Example of an ANVIL File<br />

Figure 1.2 shows two excerpts from an ANVIL file. The left column is an example of<br />

a ball track and the right column is an example of a track of the first player. As we<br />

can see, each track consists of track elements where each track element represents one<br />

event. Furthermore, each track element has a start time and an end time. Whilst the<br />

start time of an event corresponds to its timestamp, the end time of an event can be<br />

omitted, since all events can be considered as instantaneous. The ball track describes<br />

that a ball was shot on the right side of the baseline on the server side at time 7.49 sec,<br />

then the ball crossed the net in the middle, and bounced in the middle of the mid-court<br />

on the receiver side, and then was shot on the right side of the baseline. The player<br />

track describes that the player is throwing a ball on the right side of the baseline at time<br />

7.4 sec, then he is serving. Later on, the player is playing a forehand on the right side<br />

of the baseline, and then he is playing a backhand from the left side of the baseline.


Chapter 1. Introduction 5<br />

1.3 IVAN System<br />

In this section, we will introduce the IVAN system (Intelligent Interactive Virtual Agent<br />

Narrators) [13] that we have developed to produce affective, behaviourally complex<br />

commentary on a continuous sports event in real-time. The system was employed to<br />

comment on a tennis game that was given as the GALA 2009 challenge. We have<br />

employed a presentation team (Elisabeth Andre, Thomas Rist) [6], i.e., in our case<br />

two virtual agents with different roles (TV commentator, expert) to reflect two different<br />

presentation styles, attitudes to the players (positive, neutral, negative), and personality<br />

profiles to simultaneously comment on a tennis game. One virtual agent can interrupt<br />

the other virtual agent or himself/herself when more important event happens. The<br />

system also integrates background knowledge about the players and the tournament.<br />

Moreover, the user can fire one of the pre-defined questions at any time. We have<br />

focused on the knowledge processing, dialogue planning, and behaviour control of virtual<br />

agents. We have used commercial software as the audio-visual component of the system.<br />

The IVAN system consists of several modules that are running in separate threads and<br />

communicate via shared queues. We employed an HTN planner to generate dialogues,<br />

statecharts to simulate basic states of the game, and expert systems to maintain the<br />

emotional state of each virtual agent. When the system starts, the tennis simulator reads<br />

an ANVIL [12] file that contains the description of a tennis game and sends timestamped<br />

events (e.g. a player playes a forehand, a ball hits the net) at the time they occur to<br />

the input interface of the core system. The core system transforms these elementary<br />

events to the low-level facts (e.g. which player just scored) that form the knowledge base<br />

for the HTN planner and the emotion module. Generated plans that represent possible<br />

dialogues are transformed to individual utterances and annotated with gestures. The<br />

current emotional state of a virtual agent is used to derive his/her facial expression.<br />

Annotated utterances along with the corresponding facial expression tags are sent to<br />

the audio-visual component that creates the multimodal output of the system.<br />

When the system starts, our two virtual agents are engaged in dialogues to comment on<br />

the tennis game or on the background facts. A virtual agent is happy if his/her favourite<br />

player is doing well and unhappy s/he is losing. A virtual agent comments in a positive<br />

way on a player s/he likes and on events that lead to the victory of his/her favourite<br />

player. A virtual agent comments in a negative way on a player s/he dislikes and on<br />

events that hinder the victory of his favourite player. The current affect of a virtual<br />

agent is conveyed by lexical selection, facial expression, and gestures.


Chapter 1. Introduction 6<br />

1.4 Research Aims<br />

In this section, we will describe our four main research aims. They will be discussed in<br />

section Evaluation in Terms of Research Aims 6.2 after we describe the architecture of<br />

the whole system.<br />

• Dialogue Planning for Real-time Commentary<br />

In this master thesis, we wanted to investigate how an HTN planner can be em-<br />

ployed to generate commentary in the form of a dialogue on a continuous sports<br />

event in real-time for two virtual agents. An example of a real-time commentary<br />

system that uses the expert systems to control one virtual agent is ERIC [10], how-<br />

ever he might be too reactive, i.e., individual utterances are uttered at particular<br />

knowledge states where ERIC cannot generate larger contributions. In addition,<br />

the expert systems cannot generate alternative plans, thus the HTN planning of-<br />

fers more variability. Therefore, we wanted to examine an HTN planner that is<br />

supposed to be a good strategy to generate elaborate, large, and coherent dialogue<br />

contributions.<br />

• Reactivity<br />

The system should be able to react quickly to new events that happen during the<br />

tennis game. Moreover, when a more important event happens than the event<br />

on which the virtual agents are commenting at the moment, the system should<br />

be able to interrupt the current discourse and comment on the new event. The<br />

interruption should be graceful and have smooth transition.<br />

• Behavioural Complexity<br />

The virtual agents of our presentation team should ideally behave like human ten-<br />

nis commentators and produce interesting, suitable, and believable commentary.<br />

They should use the whole range of communication channels to convey facts about<br />

the tennis game. They should generate variety of dialogues along with synchro-<br />

nized hand and body gestures and have appropriate facial expression in dependence<br />

on their current emotional states. Moreover, if we allow the user to interact with<br />

the system, the system becomes more engaging. The behavioural complexity en-<br />

sures the believability of the virtual characters. Without above mentioned traits,<br />

the virtual agents would look unrealistic.<br />

• Affective Behaviours<br />

The virtual agents should affectively react to the events that occur in the tennis<br />

game according to their (positive, neutral, or negative) attitudes to the players.


Chapter 1. Introduction 7<br />

Their emotional state should be derived from the appraisals of the events that hap-<br />

pen during the tennis game. The virtual agents’ current affect should be conveyed<br />

by lexical selection, facial expression, and gestures. If we endow virtual agents<br />

with emotions, it will increase their believability and they will be better accepted<br />

by users.


Chapter 2<br />

Related Work<br />

In this chapter, we will describe several examples of virtual agent applications that are<br />

relevant to our work. We will introduce ERIC that is an affective, rule-based sport<br />

commentary agent that won GALA 2007 as a horse race commentator. We will also<br />

present DEIRA that is another horse race reporter. Then, we will present project Spec-<br />

tators that participated in GALA 2009 (see section 1.2); it employs several autonomous<br />

affective virtual agents that jointly watch a tennis game as ordinary tennis spectators.<br />

To introduce the HTN planning (see section 3.1) that we have employed in our system<br />

to generate dialogues, we will describe STEVE that uses the HTN planning to help<br />

students to perform physical procedural tasks in a 3D simulated student’s work envi-<br />

ronment. Since we employed a presentation team [6] in our system, we will also describe<br />

the general design of presentation teams and two applications that employ them.<br />

2.1 ERIC<br />

ERIC [10, 14] won GALA 2007 1 as a horse race commentator. ERIC is a generic rule-<br />

based framework for affective real-time commentary developed at <strong>DFKI</strong>. The system was<br />

tested in two domains: a horse race and a tank battle game, where the horse race was<br />

given in form of a horse race simulator supplied by GALA 2007. The simulator sends the<br />

speed and the position of each horse every second to ERIC via socket. ERIC is getting<br />

events from the horse race simulator and produces coherent natural language alongside<br />

with the non-verbal behaviour. The visual output is represented by a virtual agent that<br />

has lip movement synchronized to speech, can express various facial expressions and<br />

perform many different gestures. ERIC employs the same avatar engine as our system.<br />

The graphical output of ERIC is shown in Figure 2.1.<br />

1 http://hmi.ewi.utwente.nl/gala/finalists 2007/<br />

8


Chapter 2. Related Work 9<br />

Figure 2.1: ERIC commenting on a Horse Race<br />

ERIC consists of several modules. We will describe two most interesting modules, i.e.,<br />

the Affect module and the Natural Language Generation module in detail.<br />

2.1.1 The Affect Module<br />

The affect module is getting facts from the world and assigns appraisals to each event,<br />

action, and object according to the goals, desires, and cause effect relations. The ap-<br />

praisal of an event, action or objects is then sent in the form of a specific tag to the<br />

ALMA module [15] that maintains the commentator’s affective state. ALMA consid-<br />

ers three types of affect: emotions (short-term), mood (medium-term), and personality<br />

(long-term). Emotions are bound to specific events and decay through time. Mood<br />

represents the average of the emotional state across time. Personality is defined by Big<br />

Five [16], i.e., openess, conscientiousness, extraversion, agreeableness and neuroticism.<br />

Personality is used to compute the initial mood and influences the intensity and decay<br />

of emotions. The affective state of a virtual agent influences the utterance, gesture, and<br />

facial expression selection.<br />

2.1.2 The Natural Language Generation Module<br />

This module uses a template-based algorithm to generate utterances. Each template<br />

corresponds to a rule in a rule-based engine. Each such rule has conditions that can be<br />

partitioned into four groups: facts that must be known, facts that must be unknown,<br />

facts that must be true, and facts that must be false. There is at least one utterance


Chapter 2. Related Work 10<br />

for each template that contains flat text and slots for variables. First, all candidate<br />

templates are generated, then the corresponding utterances are retrieved and finally one<br />

of the most coherent utterances is chosen. The discourse coherence is ensured by the<br />

Centering Theory [17] that in a simplified way says that the discourse is coherent if every<br />

two utterances are coherent. Thus, the topic of a template and a list of all possible topics<br />

for a coherent following sentence is defined for each template. After the last template<br />

has been chosen, the next template is chosen so that its topic was among possible topics<br />

for a coherent following sentence of the last template.<br />

This system is most closely related to our work since the overall goal of ERIC is the<br />

same as ours. The comparison of the IVAN system and ERIC is in section 6.1.<br />

2.2 DEIRA<br />

DEIRA [11] (Dynamic Engaging Intelligent Reporter Agent) is another commentary<br />

agent that participated in GALA 2007 2 as a horse race reporter. DEIRA employs an<br />

expert system to generate affective commentary in real-time. The system maintains<br />

the affective state of the reporter according to his personality and events that occur in<br />

the horse race. The current affect is represented by a vector of four values (tension,<br />

surprise, amusement, pity) and is conveyed by lexical selection and facial expression of<br />

the reporter. The graphical output of the system is shown in Figure 2.2.<br />

Figure 2.2: DEIRA (Dynamic Engaging Intelligent Reporter Agent)<br />

2 http://hmi.ewi.utwente.nl/gala/finalists 2007/


Chapter 2. Related Work 11<br />

2.3 Spectators<br />

Project Spectators [18] participated in GALA 2009 3 (see section 1.2). The system<br />

consists of several autonomous virtual agents that are watching a tennis game. The<br />

spectators can have different attitudes to the teams where the attitude can be positive<br />

or neutral. Each spectator has a euphoria factor that determines how much the mood<br />

state of a spectator changes when an important event happens in the tennis game. The<br />

euphoria factor stands for the spectators’ personality trait. The mood of a spectator<br />

is expressed by his facial expression, typical animations, and speech. The spectators’<br />

moods are as follows: euphoric, happy, slightly happy, neutral, slightly sad, sad, and<br />

disappointed. Furthermore, the position of the ball is interpolated so that the spectators<br />

can gaze at the ball within a rally. Also the voice of a referee is incorporated to utter<br />

the score in a conventional way.<br />

However, the system focuses only on a non-verbal behaviour, i.e., neither the spectators<br />

nor the referee comment on the game as tennis commentators. The system essentially<br />

consists only of a limited set of rules that trigger respective animations. Thus, our<br />

system and the project Spectators could be put together to generate complex scene of a<br />

tennis game with both tennis commentators and spectators.<br />

2.4 STEVE<br />

STEVE (Soar Training Expert for Virtual Environments) [3] is a sample application that<br />

uses the same method as our system to control the behaviour of virtual agents, namely,<br />

the HTN planning (see section 3.1). STEVE is a virtual agent that helps students<br />

to perform physical procedural tasks in a 3D simulated student’s work environment.<br />

STEVE can either demonstrate procedural tasks or monitor students while they are<br />

performing tasks and provide assistance if they need help or ask questions. Each task<br />

consists of a set of partially ordered steps where a step can be a primitive action or<br />

a composite action which creates a hierarchical structure where some steps of a task<br />

can be also reused to solve other tasks. Therefore, STEVE employs the Hierarchical<br />

Task Network to define particular tasks. STEVE consists of the perception, cognition,<br />

and motor control module. The perception module monitors the state of the virtual<br />

world and maintains its coherent representation. In each loop of the decision cycle of<br />

the cognition module, the cognition module gets the current snapshot of the world from<br />

the perception module, chooses appropriate goals, and then constructs and executes<br />

plans. The motor control module gets high level commands from the cognition module<br />

3 http://hmi.ewi.utwente.nl/gala/finalists 2009/


Chapter 2. Related Work 12<br />

to control the voice, locomotion, gaze, gestures and objects manipulation. The graphical<br />

output of STEVE is shown in Figure 2.3.<br />

Figure 2.3: STEVE in a 3D Simulated Student’s Work Environment<br />

Our system, as well as STEVE, uses an HTN planner to generate speech and can interact<br />

with users via user questions. We were also inspired by the STEVE’s execution cycle and<br />

the concept of snapshots of the world. In comparison to STEVE, our system employs two<br />

virtual agents, maintains their affective states, and generates affective commentary. On<br />

the other hand, our system generates shorter contributions, it does not have elaborate<br />

user interaction, and our virtual agents cannot move in the virtual environment.<br />

2.5 Presentation Teams<br />

We employed a presentation team [6, 7, 8, 9] in our system to comment on a tennis<br />

game. In this section, we briefly describe the general design of presentation teams and<br />

then focus on two projects that employ them. The first project is Inhabited Marketplace<br />

where a car seller and customers have different preferences (e.g. running costs, prestige)<br />

and character profiles. They are engaged in dialogues to discuss different attributes of<br />

a car that the customers are interested in. The second project is Rocco II where two<br />

soccer fans that can have different attitudes to the teams and character profiles jointly<br />

watch a RoboCup soccer game and comment on it.


Chapter 2. Related Work 13<br />

2.5.1 Design of Presentation Teams<br />

The idea of presentation teams is to automatically generate presentations on the fly. A<br />

presentation team consists of at least two virtual agents to convey information in style<br />

of a performance to be observed by a user. This approach is believed to be more enter-<br />

taining and provide better understanding than a system with only one presenter. The<br />

virtual agents’ roles, character profiles, and dialogue types are chosen in dependence<br />

on the discourse purpose. Moreover, the characters should be distinguishable, i.e., they<br />

should have different audio-visual appearance, expertise, interest, and personality. Dis-<br />

tinct agents can also better express opposing roles. There are two basic approaches how<br />

to generate the dialogue. [19] Agents with the scripted behaviour correspond to actors<br />

of a play that can still improvise a little at performance time, i.e., their behaviour is<br />

first generated as a script (that contains slots for variables that can be substituted at<br />

runtime) and later on executed. In contrast to the agents with the scripted behaviour,<br />

the autonomous agents have no script, thus, they generate the dialogue contributions<br />

on the fly, i.e., they pursue their own communicative goals and react to the dialogue<br />

contributions of the other characters. First, we present a project that employs the agents<br />

with the scripted behaviour, and then a project that employs the autonomous agents.<br />

2.5.2 Inhabited Marketplace<br />

The Inhabited Marketplace project employs a presentation team to present facts along<br />

with an evaluation under constraints. Each character’s profile is defined by agreeable-<br />

ness (agreeable, neutral, disagreeable), extraversion (extravert, neutral, introvert) and<br />

valence (positive, neutral, negative). The presentation team consists of a car seller and<br />

customers where each of them can prefer different dimension (e.g. environment, economy,<br />

prestige, or running costs). The aim of each customer is to discuss all attributes that<br />

have positive or negative impact on a dimension they are interested in. Furthermore, the<br />

dialogue is also driven by the characters’ personality traits, e.g., an extrovert will start<br />

the conversation or an introvert will use less direct speech. The dialogue is generated<br />

by an HTN planner (see section 3.1), i.e., the goal task is successively decomposed by<br />

planning methods into individual utterances. An example of a planning method that<br />

represents a particular dialogue scheme is shown in Figure 2.4. The method represents<br />

a scenario where two agents discuss a feature of an object. It applies if the feature has a<br />

negative impact on any dimension and if this relationship can be easily inferred. Thus,<br />

any disagreeable buyer produces a negative comment referring to this dimension, e.g.,<br />

to the dimension running costs considering facts contained in Figure 2.5.


Chapter 2. Related Work 14<br />

Figure 2.4: Example of a Planning Method (Dialogue Scheme) to Discuss an Attribute<br />

Value<br />

2.5.3 Rocco II<br />

Figure 2.5: Excerpt of the Domain Knowledge<br />

Gerd and Metze are two soccer fans that comment on a RoboCup soccer game. They can<br />

have different attitudes to the teams and their character profile is defined by extraversion<br />

(extravert, neutral, introvert), openess (open, neutral, not open) and valence (positive,<br />

neutral, negative). The project focuses on the following dispositions: arousal (calm,<br />

neutral, excited) and valence. The system performs incremental event recognition [20]<br />

from high level analysis of the scene over recognized events to the basis for the commen-<br />

tary where the basis additionally contains background knowledge about the game and<br />

teams. The system employs two autonomous agents that use template based natural<br />

language generation to produce the commentary on the fly. Furthermore, an agent can<br />

interrupt himself if more important event happens. The templates are strings with slots<br />

for variables. Each template contains several tags, for instance: verbosity (the number<br />

of words), bias (positive, neutral, negative), formality (formal, normal, colloquial) and<br />

floridity (dry, normal, flowery language). The candidate templates are filtered in four<br />

steps in the execution cycle:


Chapter 2. Related Work 15<br />

1. pass only short templates in the case of the time pressure<br />

2. templates used recently are eliminated<br />

3. pass only templates expressing the speaker’s attitude<br />

4. choose templates according to the speaker’s personality<br />

The agents’ emotions are influenced by the current state of the game. Emotions can<br />

be expressed by the speed and pitch range of the speech along with different hand and<br />

body gestures. The graphical output of the system is shown in Figure 2.6.<br />

Figure 2.6: Gerd and Metze commenting RobboCup Soccer Game<br />

Similar to our system, Gerd and Metze can have different attitudes to the teams (play-<br />

ers), personality profiles, integrates background knowledge about the game and teams,<br />

and allow interruptions. In comparison to our system, they employed two autonomous<br />

agents that use template based natural language generation to produce the commentary<br />

on the fly. While our templates can be categorized only according to the bias, in Rocco II<br />

project they use wide range of different templates that are categorized according to: ver-<br />

bosity, bias, formality, and floridity. Thus, the system can generate more reactive and<br />

elaborate commentary than our system. The system also maintains the emotional state<br />

of the virtual agents which can be expressed by prosody, and hand and body gestures.<br />

On one hand, our system does not integrate prosody, on the other hand, our virtual<br />

agents have more elaborate facial expressions and gestures.


Chapter 3<br />

Methods for Controlling<br />

Behaviour of Virtual Agents<br />

In this chapter, we will introduce three basic methods for controlling the behaviour of<br />

virtual agents that we have employed in our system. The most important method is<br />

the HTN planning that we have employed to generate dialogues for our presentation<br />

team (see section 4.1). The second method are expert systems that we have used to<br />

define emotion eliciting conditions in the emotion module (see section 4.2.3). The third<br />

method are statecharts where we have used three simple finite state machines to model<br />

basic states of the system (see section 5.3.1). Let us note that all these methods can be<br />

used separately for the natural language generation (e.g. see ERIC in section 2.1 that<br />

uses the expert systems).<br />

3.1 Hierarchical Task Network Planning<br />

In our system, we have employed the Hierarchical Task Network (HTN) planning to<br />

generate the dialogues for our presentation team (see section 4.1). In general, planning<br />

is employed for the problem solving and can be applied in many different domains to<br />

save time and money, e.g., in air transport, flight control, controlling of space probes,<br />

army missions, maintenance of complex machines (e.g. submarines), help in the case of<br />

natural disasters, or tutoring systems (e.g. see STEVE in section 2.4). [21]<br />

HTN planning is a variant of the automated planning. First, we will introduce the<br />

STRIPS-Like planning [22] (where STRIPS stands for Stanford Research Institute Prob-<br />

lem Solver) and then compare it to the HTN planning. The input of a STRIPS-Like<br />

planner consists of a set of facts that describe the initial state of the world, a set of goal<br />

16


Chapter 3. Methods for Controlling Behaviour of Virtual Agents 17<br />

facts, and a set of planning operators that correspond to actions that can modify the<br />

current state of the world. Let us denote the set of facts that describe the current state<br />

of the world as a Base. A planning operator has a list of preconditions, a delete list,<br />

and an add list. A planning operator can be applied if its preconditions are contained<br />

in the Base. After a planning operator is applied, all facts that are in its delete list are<br />

deleted from the Base and all facts that are in its add list are added to the Base. The<br />

STRIPS-Like planner reaches the goal state of the world if the Base contains all goal<br />

facts. After the planner is started, it is searching for a sequence of planning operators<br />

that successively change the initial state of the world to its goal state. The output of<br />

the planner is a plan (or a list of all possible plans) that consists of a list of planning<br />

operators such that if we successively apply these operators to the initial state of the<br />

world, we get the goal state of the world. While a STRIPS-Like planner can try to<br />

apply any planning operator at any step of the planning process to reach the goal state<br />

of the world, an HTN planner can only try to apply planning operators that are defined<br />

in the HTN at a particular step of the planning process.<br />

The HTN planning is based on tasks decomposition, i.e., compound tasks are decom-<br />

posed into subtasks where each subtask is either a compound task on a lower level of<br />

the planning hierarchy or a primitive task that corresponds to an action that can be<br />

executed in the real world. Let us note that the primitive tasks in the HTN planning<br />

correspond to the planning operators in the STRIPS-Like planning. The description<br />

of the world (called planning domain in the HTN planning terminology) is given as a<br />

Hierarchical Task Network and the planning goal (called planning problem) is given as<br />

a list of goal tasks and a list of facts that describe the initial state of the world. The<br />

resulting plan is a list of primitive tasks such that if we successively perform these prim-<br />

itive tasks we accomplish the goal tasks. In the following text, we will show an example<br />

of a planning task, introduce JSHOP 1 as an implementation of an HTN planner that<br />

we have employed in our system to generate the dialogues for our presentation team (see<br />

section 4.1), and finally we will define some basic constructs of the JSHOP language.<br />

3.1.1 Example of a Planning Task<br />

Let us consider an example of a planning task that is depicted in Figure 3.1 to demon-<br />

strate a typical task for an HTN planner. [23] There is a Hierarchical Task Network that<br />

represents a way how to travel from x to y, more precisely, how to accomplish the goal<br />

task travel(x,y). We can either take a taxi for a short distance or we can fly by air for<br />

a long distance. (There might be also other ways how to travel that we do not consider<br />

here.) Thus, to accomplish the compound goal task travel(x,y) we have to fulfil one of<br />

1 JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/


Chapter 3. Methods for Controlling Behaviour of Virtual Agents 18<br />

its compound subtasks, namely, travel by taxi or travel by air. In the first case (travel<br />

by taxi) we must first get a taxi, then ride the taxi from x to y and finally pay for it. In<br />

the second case (travel by air) we must first buy a ticket from airport(x) to airport(y),<br />

then travel from x to airport(x), fly by air from airport(x) to airport(y) and eventually<br />

travel from airport(y) to y. Thus, to fulfil a compound task travel by taxi or travel by air<br />

we have to satisfy all its respective subtasks. Let us note that after the planner starts,<br />

it finds out first whether it is possible to travel by taxi and if not it backtracks and tries<br />

the option to travel by air.<br />

Figure 3.1: Example of a Planning Task - HTN<br />

The resulting plan how to travel from the UMD (University of Maryland) to the MIT<br />

is depicted in Figure 3.2. First we have to buy a ticket from the BWI (Baltimore<br />

Washington International) airport to the Logan airport, then take a taxi from the UMD<br />

to the BWI airport, then fly by air from the BWI airport to the Logan airport, and<br />

finally take a taxi from the Logan airport to the MIT.<br />

Figure 3.2: Example of a Planning Task - generated Plan


Chapter 3. Methods for Controlling Behaviour of Virtual Agents 19<br />

3.1.2 Java Simple Hierarchical Ordered Planner (JSHOP)<br />

In the following text, we will introduce the Java Simple Hierarchical Ordered Planner<br />

(JSHOP) 2 [24, 25] that is the implementation of an HTN planner that we have employed<br />

in our system. JSHOP is a Java implementation of a domain-independent Hierarchical<br />

Task Network (HTN) planner, developed at the University of Maryland, that is based on<br />

ordered task decomposition. The planning is conducted by problem reduction, i.e., the<br />

planner recursively decomposes tasks into subtasks and stops when it reaches primitive<br />

tasks that can be performed directly by planning operators. The compound task de-<br />

composition is realized by methods that define how to decompose compound tasks into<br />

subtasks. Since there may be more than one method that can be applied to a compound<br />

task, the planner can backtrack, i.e., it can try more than one method to decompose a<br />

compound task. As a consequence, the planner can find more than one suitable plan.<br />

The Input of JSHOP consists of a description of a planning domain and a planning<br />

problem. The planning domain creates the world description, i.e., it consists of planning<br />

methods, planning operators and axioms. The planning problem consists of a list of<br />

tasks and a list of facts that hold in the initial state of the world. The planning domain<br />

description is stored in a domain file and the problem description in a problem file.<br />

The Output of JSHOP is a list of suitable plans where each plan consists of a list of<br />

primitive tasks and each primitive task corresponds to an action that can be executed<br />

in the real world (e.g. utter an utterance or move object O from place X to place Y ).<br />

Figure 3.3: JSHOP Input Generation Process<br />

To Run the Planner, we have first to generate Java code from the respective domain<br />

and problem files that are written using special Lisp-like syntax. JSHOP is implemented<br />

2 JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/


Chapter 3. Methods for Controlling Behaviour of Virtual Agents 20<br />

in this way since this approach allows to perform certain optimizations and to produce<br />

Java code that is tailored for a particular domain and problem description. [26] See<br />

Figure 3.3. (The generated Domain Description Java file is compiled with the Domain-<br />

Independent Templates which results in a Domain-Specific Planner. The generated Java<br />

Problem file is compiled as well. At the end, we can run the planner that outputs all<br />

possible Solution Plans.)<br />

3.1.3 JSHOP Language<br />

In the following text, we will describe the most important JSHOP constructs, namely:<br />

axioms, planning operators, and planning methods. See the JSHOP manual [27] for<br />

more details on the whole syntax of the language. JSHOP contains many constructs<br />

characteristic for an HTN planner (e.g. symbols, terms, call terms, logical atoms, logical<br />

expressions, implication, universal quantification, assignment, call expressions, logical<br />

preconditions, task atoms, task list, axioms, operators, and methods). Furthermore, it<br />

is possible to write user defined functions in Java.<br />

Axioms<br />

An axiom is an expression of the form:<br />

(: − a [name1] L1 [name2] L2 . . . [namen] Ln)<br />

where the head of an axiom is a logical atom a and its tail is a list of pairs (name,<br />

logical precondition) where a is true if L1 is true or if L1...Lk−1 are false and Lk is true<br />

(for k ≤ n). The name of the logical precondition is optional, however it can improve<br />

readability. Figure 3.4 shows an example of an axiom. A place ?x is in walking distance<br />

if the weather is good and a place ?x is within two miles of home, or if the weather is<br />

bad and a place ?x is within one mile of home.<br />

Figure 3.4: Sample JSHOP Axiom


Chapter 3. Methods for Controlling Behaviour of Virtual Agents 21<br />

Operators<br />

An operator has the following form:<br />

(: operator h P D A [c])<br />

where h is the operator’s head; P is the operator’s precondition; D is the operator’s<br />

delete list; A is the operator’s add list; c is the operator’s cost where the default cost<br />

is 1. Let us denote the set of facts that describe the current state of the world as the<br />

facts base. The operator can be applied if the preconditions in P are satisfied. After the<br />

operator has been applied, all facts contained in D are deleted from the facts base and<br />

all facts contained in A are added to the facts base. Figure 3.5 shows an example of a<br />

planning operator. We can drive a ?truck from a ?old-loc to a ?location if the ?truck<br />

is at the ?old-loc. After the operator has been applied, the fact (at ?truck ?old-loc) is<br />

deleted from the facts base and a new fact (at ?truck ?location) is added to the facts<br />

base.<br />

Methods<br />

A method is a list of the form:<br />

Figure 3.5: Sample JSHOP Operator<br />

(: method h [name1] L1 T1 [name2] L2 T2 . . . [namen] Ln Tn)<br />

where h is the method’s head; each Li is a precondition; each Ti is a list of tasks; each<br />

namei is a respective optional name. The compound task specified by the method can<br />

be performed by performing all tasks in the list Ti if the precondition Li is satisfied and<br />

for all preconditions Lk such that k < i holds that they are not satisfied. Figure 3.6<br />

presents an example of a method. The task specified by this method is to eat a ?food.<br />

If we have a fork then we eat the ?food with a fork. If we do not have a fork but we<br />

have a spoon then we eat the ?food with a spoon.


Chapter 3. Methods for Controlling Behaviour of Virtual Agents 22<br />

3.2 Expert Systems<br />

Figure 3.6: Sample JSHOP Method<br />

Expert systems can be also employed to generate commentary on a sports event as we<br />

have shown in ERIC (see section 2.1). Nevertheless, we have employed an expert system<br />

only in the emotion module to define emotion eliciting conditions (see section 4.2.3).<br />

Expert systems are used in many domains to “replace” human experts. The know-how<br />

of human experts is first stored in the system. Afterwards, the system can be queried<br />

by users that always get consistent answers. Nevertheless, the disadvantage of such a<br />

system is that it is not appropriate for changing environments. Expert systems can be,<br />

for instance, used in the following domains: financial services, accounting, production,<br />

process control, medicine, or human resources. Examples of expert systems are CLIPS<br />

(C Language Integrated Production System) [28] and its reimplementation into Java<br />

Jess (Java Expert System Shell) [29] that we have employed in our system.<br />

Expert systems are used to reason about the world using some knowledge that consists<br />

of facts and rules. While the facts describe the current world in terms of assertions, the<br />

rules define how to modify the facts base (knowledge base), e.g., how to deduce new<br />

facts from already known facts where each rule is in the form of an if-then clause. Let<br />

us note that it is also possible to retract or modify facts as a result of a rule being fired.<br />

The inferring loop of a typical expert system consists of the following three steps:<br />

1. Match the left hand side of the rules against facts and move matched rules onto<br />

the agenda.<br />

2. Order the rules on the agenda according to some conflict resolution strategy (e.g.<br />

at random).<br />

3. Execute the right hand side of the rules on the agenda in the order decided by<br />

step (2).


Chapter 3. Methods for Controlling Behaviour of Virtual Agents 23<br />

The inferring loop ends when no new facts can be inferred. After the inferring process<br />

ends, we know which rules have been fired and the fact base contains all initial and<br />

inferred facts that have not been retracted. In the following text, we will present an<br />

implementation of an expert system that we have employed in our system.<br />

Java Expert System Shell (Jess)<br />

Jess [29] is a fast Java implementation of an expert system developed at Sandia National<br />

Laboratories. Although it has a rich Lisp-like syntax we will show only two examples:<br />

one that defines an unordered fact and the other that defines a rule. See [29] for more<br />

details on the complete syntax of the language.<br />

Unordered Fact - Every fact corresponds to a particular template. The definition<br />

of a template starts with a keyword deftemplate followed by a template name and an<br />

optional documentation comment. The following template is an example how to define<br />

an automobile. The template contains four slots: the manufacturer, the model, the year<br />

of production as an integer, and color where red is the default color.<br />

(deftemplate automobile<br />

)<br />

"A specific car."<br />

(slot make)<br />

(slot model)<br />

(slot year (type INTEGER))<br />

(slot color (default red))<br />

The following command asserts a concrete Volkswagen Golf that was produced in 2009<br />

and is of the default red colour.<br />

(assert (automobile (model Golf)(make Volkswagen)(year 2009)))<br />

Rule - Consider the following templates. The first template defines an agent that has<br />

a name and can be hungry, the second template defines the current time.<br />

(deftemplate agent<br />

)<br />

"A hungry agent"<br />

(slot name)<br />

(slot hungry)


Chapter 3. Methods for Controlling Behaviour of Virtual Agents 24<br />

(deftemplate current_time<br />

)<br />

"The current time"<br />

(slot ctime (type FLOAT))<br />

The following commands asserts agent George that is hungry and the current time that<br />

is half past twelve.<br />

(assert (agent (name George)(hungry TRUE)))<br />

(assert (time (ctime 12.5)))<br />

Consider the following rules that are chained.<br />

(defrule open_cafeteria<br />

)<br />

(current_time {(12.0


Chapter 3. Methods for Controlling Behaviour of Virtual Agents 25<br />

(see section 5.3.1). However, the statecharts can be also used to generate speech. An<br />

example of a tool that enables to control virtual agents using statecharts is SceneMaker.<br />

[30] A user can create arbitrary statechart using SceneMaker to describe the behaviour<br />

of virtual agents. In every node of a statechart, a scene is stored. A scene can, for<br />

instance, describe a dialogue between two virtual agents, i.e., the scene is described in a<br />

theater script-like language and consists of utterances that are annotated with gestures.<br />

A statechart can also consist of several types of edges that are used to define transitions<br />

between nodes (e.g. a timeout edge, a conditional edge, or a probability edge).<br />

The difference between SceneMaker and our approach is that while SceneMaker performs<br />

one of the pre-defined scene at a node, we first run the HTN planner to generate the<br />

scene, and then the scene is performed. Nevertheless, we have employed only three<br />

simple finite state machines to maintain the basic states of our system and the logic was<br />

implemented in the domain description of the HTN planner.<br />

SceneMaker was employed in several projects: CrossTalk [31], VirtualHuman [32],<br />

IDEAS4Games [33], and COHIBIT. [34, 35] For instance, the purpose of the COHIBIT<br />

project is to provide knowledge about car technology and virtual agents in an entertain-<br />

ing way. Two virtual agents interact with users and give them advices how to build a<br />

car from different car pieces. The system is informed about the presence of users via<br />

cameras and about the location and orientation of car pieces which is realized using<br />

RFID technology. The overview of the COHIBIT system is depicted in Figure 3.7.<br />

Figure 3.7: Overview of the COHIBIT system


Chapter 4<br />

Generating Dialogue<br />

In this chapter, we will explain how we generate affective commentary on a tennis game<br />

for our two virtual agents. First, we will describe how we generate dialogues using an<br />

HTN planner. Then, we will describe how we generate a piece of a dialogue that conveys<br />

a particular attitude of a virtual agent to a player, how we maintain the affective state<br />

of a virtual agent, and how a particular affect can be conveyed by different modalities.<br />

4.1 Commentary Planning<br />

In this section, we will describe how we generate the dialogues for our presentation team<br />

that consists of two virtual agents. We have employed the JSHOP planner (see sec-<br />

tion 3.1) to generate the commentary where the generated plans correspond to possible<br />

dialogues in which the presentation team can be engaged. The planner is triggered at<br />

particular states of the tennis game, gets facts that describe the current state of the<br />

tennis game, and outputs all possible plans. The detailed description of the states in<br />

which the planner is triggered, input facts, and how the generated plans are executed<br />

will be given in Chapter 5. Thus, in this section, we will focus only on the dialogue<br />

generation, i.e., in which dialogues our commentary team can be engaged in distinct<br />

states of the tennis game according to the facts that describe the tennis game and the<br />

background of the players and the tournament.<br />

4.1.1 Motivation<br />

The overall goal of our system is to automatically generate interesting, suitable, coherent,<br />

and affective commentary from different points of view (in dependence on commentators’<br />

attitudes to the players) in real-time. To investigate what the real tennis commentators<br />

26


Chapter 4. Generating Dialogue 27<br />

say during the game, we have analysed several tennis games from YouTube 1 . We have<br />

found out that there are usually two commentators that comment on a tennis match<br />

where the second commentator is usually a former tennis player or an expert in the field<br />

that can always provide additional background information. We have also found out<br />

that the commentary is to some extent driven by the states of the game, e.g., nobody<br />

is talking when the serving player concentrates before s/he serves, the commentators<br />

are engaged in small talks discussing players’ background when there is nothing else to<br />

comment on or the commentators usually summarize every rally after it finishes. Thus,<br />

for instance the statechart approach presented in the SceneMaker project (see section<br />

3.3) would be also convenient, therefore we have employed finite state machines to decide<br />

when to run the planner according to the states of the tennis game.<br />

We have also noticed that the information being conveyed by a sport commentator does<br />

not often bring much more than an ordinary spectator can perceive while s/he is watching<br />

the same tennis game. Since we wanted our commentary to be more sophisticated, we<br />

have let us inspire by the TennisEarth 2 web page that describes tennis matches (rally<br />

by rally) for tennis fans that have not seen them. As a consequence, the commentary on<br />

the TennisEarth is more elaborate and inspiring for us. We also wanted to incorporate<br />

more background knowledge since a standard tennis match is usually long-winded and<br />

there is often nothing to comment on, thus we have made use of the OnCourt 3 project<br />

as a source of the background knowledge about players and tennis tournaments.<br />

As we have already stated, the commentators have positive, neutral, or negative attitudes<br />

to the players. Since the standard live commentary is usually balanced, except for<br />

particular international tournaments, we had to add respective bias to our utterances.<br />

Let us note that biased utterances usually convey particular affects. To deal with the<br />

real-time requirement, we had to make sure that the dialogues are not too long. However,<br />

we can predict the time we have at our disposal for a commentary according to the state<br />

of the tennis game. For instance, we have always more time to comment on a just finished<br />

game than on an event that happens within a rally. Nevertheless, these predictions<br />

are only rough approximations, thus we had to allow interruptions, i.e., to interrupt<br />

the current plan if more relevant event happens. The coherence of the commentary is<br />

ensured by the dialogue planning that is elaborated in the next section.<br />

1 http://www.youtube.com/<br />

2 http://www.tennisearth.com/<br />

3 http://www.oncourt.info/


Chapter 4. Generating Dialogue 28<br />

4.1.2 Dialogue Planning<br />

To represent our presentation team, we have employed two virtual agents that have<br />

different roles, attitudes to the players, and audio-visual appearance. The first com-<br />

mentator is the Charamel virtual agent Mark that represents a TV tennis commentator<br />

and the second Charamel virtual agent is Gloria that represents a tennis expert. (See<br />

section 5.1.3 for more details on the Charamel avatar engine.) While Mark should con-<br />

centrate on simple facts concerning the tennis game Gloria should rather elaborate on<br />

these facts. Let us remember that all dialogues are based on commentators’ attitudes<br />

to the players that can be positive, neutral, or negative.<br />

Dialogue Schemes<br />

We were inspired by the dialogue schemes presented in the project Presentation Teams<br />

(see section 2.5.2). A dialogue scheme is a generic representation of a piece of dialogue<br />

that can be generated under certain conditions by a planner. Let us note that dialogue<br />

schemes correspond to the methods in the HTN planning. Let us also remember that<br />

in the HTN planning, the compound goal task is decomposed by planning methods to<br />

the subtasks where each subtask is either a planning operator that corresponds to a<br />

template (that represents an utterance) or a compound task that is further decomposed<br />

by planning methods. Consider the planning method depicted in Figure 4.1.<br />

Figure 4.1: Example of a Planning Method<br />

Let us assume that player ?P1 has played a winning return (i.e. player ?P2 has lost<br />

the rally) and the subgoal task deduced by the planner from the goal task according<br />

to the current state of the game is the compound task “comment on rally”. Thus, we


Chapter 4. Generating Dialogue 29<br />

can satisfy the compound task “comment on rally” by performing the BODY of the<br />

planning method if the PRECONDITIONS of the planning method can be satisfied (i.e.<br />

?A is a commentator, ?B is an expert, player ?P1 has played a winning return, player<br />

?P2 has lost the rally, ?A and ?B have both positive attitude to player ?P1 ). Figure<br />

4.1 also presents an example of a possible dialogue that can be generated by applying<br />

this planning method assuming that the BODY of the planning method consists only of<br />

two planning operators (i.e. not compound tasks), and variables ?P1, ?P2, ?A and ?B<br />

stands for respective players, commentator, and expert. We have already defined that<br />

all dialogue schemes are based on commentators’ attitudes to the players, nevertheless<br />

the semantic of the dialogue schemes can have one of the form defined in Table 4.1.<br />

Whilst the left column defines individual dialogue schemes the right column presents an<br />

example of a possible generated dialogue for each dialogue scheme.<br />

Dialogue Scheme Example of a Generated Dialogue<br />

A: argument for/against X A: “That serve was really phenomenal!”<br />

B: contrary B: “Well, that is a little exaggerated!”<br />

A: argument for/against X A: “Blake is in great shape as usual.”<br />

B: contrary B: “But he already produced several unforced errors.”<br />

A: override A: “Still, he is the best player on the court.”<br />

A: argue for X A: “Excellent return by Safin.”<br />

B: elaborate on X B: “Unreachable for Blake”.<br />

A: background fact X A: “The brother of Blake Thomas is a well known player.”<br />

B: evidence of X B: “His best ranking was the 141st place in 2002.”<br />

A: background fact X A: “Roddick has been 4 times injured recently.”<br />

B: consequence of X B: “It will be hard to break through today.”<br />

Table 4.1: Dialogue Schemes<br />

Planning Large Dialogue Contributions<br />

We have already shown how to generate a simple dialogue. In the following text, we<br />

will describe how to generate large dialogue contributions that consist of several simple<br />

dialogues. Consider a part of a planning tree that is depicted in Figure 4.2 where all<br />

nodes stand for compound tasks. Imagine that a game has finished and the subgoal task<br />

of the planner deduced from the goal task is the compound task “comment on just fin-<br />

ished game”. Hence, to satisfy the compound task “comment on just finished game”, we<br />

have to satisfy all its compound subtasks, namely: Introduction, Body, and Conclusion.<br />

Similarly, to satisfy the compound task Body, we have to satisfy all its compound sub-<br />

tasks, namely: comment on score, comment on winning team, and comment on losing<br />

team. The decomposition of the compound subtasks comment on winning team and


Chapter 4. Generating Dialogue 30<br />

Figure 4.2: Example of a Compound Task Decomposition<br />

comment on losing team are analogous. Every leaf of the subtree depicted in Figure<br />

4.2 corresponds to at least one planning method that decomposes respective compound<br />

task. The compound task decomposition is accomplished by a planning method that<br />

stands for a dialogue scheme or by a planning method that represents a hierarchy of di-<br />

alogue schemes, i.e., the compound task can be decomposed by a planning method into<br />

several dialogue schemes in dependence on the facts that hold in the current description<br />

of the world (e.g. commentators’ attitudes to the players). The following list presents<br />

a possible generated dialogue that summarizes a game that has just finished (where C<br />

and E stand for a commentator and an expert, respectively).<br />

Introduction<br />

E: “What a relief!”<br />

C : “Tight game let’s summarize it.”<br />

Comment on Score<br />

C : “Blake and Roddick won the first game.”<br />

E: “That’s unbelievable that they broke opponents’ serve!”<br />

C : “That was spectacular!”<br />

Comment on winning team - Highlights<br />

C : “Blake and Roddick played an excellent game.”<br />

E: “Well, they played several excellent winning returns.”<br />

Comment on winning team - Difficulties<br />

C : “Can you say something about difficulties of Blake and Roddick?”


Chapter 4. Generating Dialogue 31<br />

E: “They were already trailing.”<br />

C : “But they recovered.”<br />

Comment on winning team - Odds<br />

C : “Are Blake and Roddick going to win the match?”<br />

E: “They are my favourites!”<br />

Comment on losing team - Difficulties<br />

C : “What difficulties did Safin and Ferrer have?”<br />

E: “They did many unforced errors.”<br />

Comment on losing team - Odds<br />

C : “Do Safin and Ferrer have any chance to win.”<br />

E: “Well, they can still break through.”<br />

Conclusion<br />

C : “Let’s see the next game.”<br />

E: “Definitely.”<br />

4.1.3 Planning Tree<br />

In this section, we will describe our planning tree that represents the hierarchy of all<br />

dialogues that can be generated. The planning tree is defined as a Hierarchical Task<br />

Network (HTN) in the planning domain of the JSHOP planner (see section 3.1). The<br />

root of the planning tree is the goal task, any internal node of the planning tree is a<br />

compound task (i.e. a possible subgoal task), and every leaf of the planning tree is<br />

either a primitive task that corresponds to a template (that represents an utterance) or<br />

a reference to a particular compound task that is an internal node of the planning tree.<br />

Let us consider Figure 4.3. To satisfy a compound task, we have to either satisfy all its<br />

descendants (1), one arbitrary descendant that can be satisfied (2), or we have to satisfy<br />

the first descendant that can be satisfied (3).<br />

Figure 4.3: Possible Decompositions of a Compound Task


Chapter 4. Generating Dialogue 32<br />

The root of our planning tree is the goal task “Comment”. Figure 4.4 depicts how the<br />

goal task “Comment” is decomposed into subgoal tasks in dependence on the state of<br />

the game, e.g., the presentation team is engaged in dialogues to introduce the upcoming<br />

game if the game is just at the beginning or they summarize a rally just after a rally<br />

finishes.<br />

Figure 4.4: Decomposition of the Goal Task “Comment”<br />

Figure 4.5 shows the further decomposition of the compound task Comment on rally<br />

that is a subgoal task of the goal task “Comment”. Thus, our presentation team is<br />

commenting on the result of the last rally in dependence on its outcome, e.g., the pre-<br />

sentation team can comment on an excellent ace or a winning return played by a player.<br />

Figure 4.5: Decomposition of the Subgoal Task Commant on rally<br />

Figure 4.6 depicts the whole decomposition path from the goal task “Comment” to the<br />

subgoal task “Drop Volley” which results in a commentary on a rally that finished with<br />

a winning return that was a drop volley (i.e. the player won the rally by a ball that he<br />

played before it bounced and then placed it just behind the net).


Chapter 4. Generating Dialogue 33<br />

Figure 4.6: Decomposition of the Goal Task “Comment” that leads to a Subgoal<br />

Task Drop Volley<br />

4.1.4 Commentary Excerpt<br />

In this section, we will show an example of a generated dialogue where the players of<br />

the serving team are: Blake and Roddick and the players of the receiving team are:<br />

Safin and Ferrer. In this example, the dialogues are unbiased, i.e., the attitude of the<br />

commentators is neutral since we would like to show how detailed the commentary can<br />

be supposing that there is enough time to utter it. The state of the game and the<br />

subgoal of the planner are mentioned before each dialogue. Let us note that C stands<br />

for a commentator and E stands for a tennis expert. Another commentary excerpt is<br />

shown in Appendix A.<br />

Beginning - Introduction to the upcoming game<br />

C : “Ladies and Gentlemen! Welcome to the Wimbledon semi-final in doubles.”<br />

E: “We will guide you through the match in which James Blake and Andy Roddick<br />

are playing versus Marat Safin and David Ferrer.”<br />

C : “Enjoy the show!”<br />

Rally in Progress - Serving Player’s Background<br />

C : “Roddick has been 4 times injured since last years.”<br />

E: “It will be hard to break through today.”<br />

Rally in Progress - Comment on a nice shot<br />

E: “What a shot!”<br />

Rally finished - Summarize the rally (score: 15:0)<br />

C : “What a Forehand by Roddick.”


Chapter 4. Generating Dialogue 34<br />

E: “Roddick hit an excellent forehand-volley right into the left corner.”<br />

C : “Roddick took advantage of a weak forehand return from Safin.”<br />

Rally in Progress - Players’ Background<br />

C : “Brother of James Blake Thomas is also playing tennis.”<br />

E: “His best ranking was in 2002 when he occupied the 141st place in doubles.”<br />

Rally in Progress - Comment on a nice shot<br />

C : “What a shot by Roddick!”<br />

Rally finished - Summarize the rally (score: 30:0)<br />

C : “What a long ralley!”<br />

E: “Ended by an inaccurate backhand-volley by Safin.”<br />

C : “30:0”<br />

E: “Blake and Roddick are holding their serve so far.”<br />

Ralley in Progress - Background<br />

E: “The weather is cloudy today.”<br />

C : “Hopefully it won‘t be raining.”<br />

Rally finished - Summarize the rally (score: 30:15)<br />

C : “Nice high lob by Safin.”<br />

E: “Too high for Roddick.”<br />

C : “Caused unforced error by Blake.”<br />

4.2 Affect<br />

In the following sections, we will explain why it is important to generate affective com-<br />

mentary on a tennis game and how an affect can be conveyed by different modalities.<br />

We will explain two methods that we have employed to generate affective commentary<br />

on a tennis game and discuss the pros and cons of this approach.<br />

4.2.1 Motivation<br />

In this section, we will clarify how important it is to incorporate emotions into the<br />

commentary and how the affect can be expressed. In general, the virtual agents are<br />

better accepted by users if they are endowed with emotions. [2] Different personality<br />

profiles and affect make virtual agents more distinguishable which is beneficial to cre-<br />

ation of the presentation teams. We were inspired by the concept of the presentation<br />

teams described in section 2.5. Thus, we have employed two distinct virtual agents that


Chapter 4. Generating Dialogue 35<br />

have different roles (commentator, expert), attitudes to the players (positive, neutral,<br />

negative), and personality profiles (defined by: optimistic, choleric, extravert, neurotic,<br />

social). Two affective virtual agents can also better represent opposing opinions and<br />

are more entertaining than only one presenter. Moreover, the user should better recall<br />

conveyed facts.<br />

There can be many exciting moments in a tennis game as well, e.g., to win a tennis game<br />

a player must have at least four points in total and two points more than the opponent,<br />

thus the finish of a tennis game can be quite thrilling since there can be many game and<br />

break points (i.e. situations when the serving or receiving player needs only one point<br />

to win the game). Therefore, our virtual agents should affectively react to the events<br />

that, e.g., lead to the victory of their favourite player or that lower the odds to win. The<br />

current affect of a virtual agent can be expressed by dialogue scheme selection, lexical<br />

selection (i.e. choice of an appropriate utterance according to the current affect), gaze,<br />

facial expression, and hand and body gestures.<br />

4.2.2 Planning with Attitude<br />

In this section, we will describe how a particular affect can be conveyed via the choice<br />

of a corresponding dialogue scheme where a dialogue scheme is a generic definition of<br />

a piece of dialogue (see section 4.1.2). As we have already stated, a virtual agent can<br />

have positive, neutral, or negative attitude to a player. Let us note that almost every<br />

topic of the commentary is related to a specific event (e.g. a player has just scored, a<br />

player has lost the lead). Thus, every such event can be appraised by a virtual agent as<br />

desirable or undesirable according to his/her attitude to the players (e.g. it is desirable<br />

when my favourite player gets a point or undesirable when he loses the lead). Hence, a<br />

virtual agent will comment in a positive way on a desirable event and in a negative way<br />

on an undesirable events. Each event is also usually connected with a particular player,<br />

thus a virtual agent will comment in a positive way on actions of a player s/he likes and<br />

in a negative way on actions of a player s/he dislikes. A virtual agent that has a neutral<br />

attitude to a player will comment in a neutral way on events that are connected with a<br />

respective player.<br />

Let us consider a dialogue that consists of two utterances that are uttered by two virtual<br />

agents. Let us assume that the dialogue is either related to an event that can be<br />

appraised as positive, neutral, or negative, or the event is related to a player to which a<br />

virtual agent has positive, neutral, or negative attitude. Table 4.2 presents examples of<br />

possible generated dialogues where A and B stand for respective commentators. The first<br />

column represents a particular combination of appraisals of an event or a combination of


Chapter 4. Generating Dialogue 36<br />

attitudes to a player that is related to a particular event. The second column represents<br />

a dialogue scheme of a possible dialogue where X stands for a player’s action or a fact.<br />

The third column represents an example of a generated dialogue.<br />

Appraisal Dialogue Scheme Example of a Generated Dialogue<br />

A: positive A: argue for X A: “Outstanding ace by Blake!”<br />

B: positive B: support X B: “Blake hits blistering serve down the line!”<br />

A: positive A: argue for X A: “Excellent forehand by Safin!”<br />

B: negative B: play down X B: “That’s a bit overstated.”<br />

A: negative A: point out fault X A: “Safin failed to get the ball over the net.”<br />

B: positive B: excuse X B: “Safin just overhits the serve.”<br />

A: neutral A: convey fact X A: “The score is already 30:0.”<br />

B: negative B: consequence of X B: “Safin and Ferrer are real losers as usual!”<br />

A: neutral A: convey fact X A: “Deuce again.”<br />

B: neutral B: elaborate on fact X B: “Safind and Ferrer got back on board.”<br />

Table 4.2: Example of Generated Dialogues based on different Appraisals<br />

Thus, we have shown how a particular affect can be conveyed via the choice of an<br />

appropriate dialogue scheme. Let us note that the pieces of a generated dialogue are<br />

individual utterances where an utterance is usually uttered by a virtual agent in a<br />

particular situation that is correlated with a particular affect. Therefore, we annotated<br />

each utterance with default gesture and facial expression tags to seamlessly convey a<br />

particular affect by an utterance. Nevertheless, these tags are only default and can be<br />

substituted by other tags generated by other modules. For instance, the facial expression<br />

can be also set according to the current affective state of a virtual agent generated by<br />

the emotion module that is described in the next section.<br />

4.2.3 OCC Generated Emotions<br />

In this section, we will describe the emotion module that models the affective state<br />

of each virtual agent according to the OCC (Ortony, Collins, Clore) cognitive model<br />

of emotions. [36, 37] We simulate eight basic OCC emotions that are relevant to the<br />

tennis commentary. These emotions are explained in Table 4.3. The emotion module is<br />

initialized with the personality of each virtual agent that is defined by five personality<br />

traits listed in Table 4.4.


Chapter 4. Generating Dialogue 37<br />

OCC Emotion Description<br />

JOY Something happened that I wanted to happen.<br />

DISTRESS Something happened that I did not want to happen.<br />

HOPE Something may happen that I really want to occur.<br />

FEAR Something may happen that I wish to never occur.<br />

RELIEF Something bad did not happen.<br />

DISAPPOINTMENT Something did not happen that I really wanted to occur.<br />

SATISFACTION Something happened that I really wanted to occur.<br />

FEAR-CONFIRMED Something bad did actually happen.<br />

Table 4.3: Description of the eight Basic OCC Emotions<br />

Personality Trait<br />

optimistic<br />

choleric<br />

extravert<br />

neurotic<br />

social<br />

Table 4.4: Five Personality Traits<br />

The input of the emotion module are facts that our system deduces from the elementary<br />

events got from the tennis game. The main functionality of the emotion module 4 is<br />

implemented in Jess (see section 3.2). The goals and antigoals of a virtual agent are<br />

deduced from his/her attitude to the players, e.g., virtual agent A that has a positive<br />

attitude to player P wants P to win the game, conversely, virtual agent B that has a<br />

negative attitude to player P wants P to lose the game. The events that happen in the<br />

tennis game are appraised as desirable if they lead to the goal or undesirable if they<br />

hinder the goal. The conditions that elicit emotions based on the events that happen in<br />

the tennis game are called emotion eliciting conditions. The appraisals of the emotion<br />

eliciting conditions then generate particular emotions with respective intensities where<br />

the initial intensity of a particular emotion depends on personality of the respective<br />

virtual agent. The affective state of a virtual agent is represented by a vector of intensi-<br />

ties of each emotion where, for instance, the emotion with the highest intensity can be<br />

considered as the output of the emotion module. Since the emotions decay over time,<br />

the emotion module maintains the emotion decay using, e.g., a linear decay function.<br />

Table 4.5 shows examples of events that elicit respective emotions.<br />

4 The definitions of the OCC emotions (in source file occ.clp) were provided by Michael Kipp (<strong>DFKI</strong>).


Chapter 4. Generating Dialogue 38<br />

OCC Emotion Event<br />

JOY My favourite player scored.<br />

DISTRESS My favourite player lost a point.<br />

HOPE My favourite player is now leading.<br />

FEAR My favourite player is now trailing.<br />

RELIEF My favourite player settled the score.<br />

DISAPPOINTMENT My favourite player lost the lead.<br />

SATISFACTION My favourite player won the game.<br />

FEAR-CONFIRMED My favourite player lost the game.<br />

Table 4.5: Example of Events that elicit respective Emotions<br />

Figure 4.7 depicts the GUI of the emotion module. The left part of the chart depicts<br />

the current intensities of respective emotions for the first virtual agent and the right<br />

part of the chart depicts corresponding data for the second virtual agent. The dynamic<br />

bar chart was created using the JFreeChart 5 library. There is also a log for each virtual<br />

agent that lists all events that have caused a particular emotion from the beginning of<br />

the tennis game. (Let us remark that Figure 4.7 depicts only the last two events.) Each<br />

log entry consists of the emotion name, initial intensity, and the cause description.<br />

Figure 4.7: Emotion Module GUI<br />

5 Andreas Viklund. The JFreeChart Class Library. http://www.jfree.org/jfreechart/


Chapter 4. Generating Dialogue 39<br />

The output of the emotion module is currently employed to set and update the facial<br />

expression of each virtual agent every second. Nevertheless, it could be also used for the<br />

gesture and lexical selection or as an input of the planner (if we had dialogue schemes<br />

based on the OCC emotions).<br />

4.2.4 Discussion<br />

In this section, we will explain why we have employed two methods to simulate emotions<br />

and which other options we have considered. As we have already stated, all dialogue<br />

schemes are based on virtual agents’ attitudes to the players. Nevertheless, we could<br />

have based the dialogue schemes also on the virtual agents’ current emotions. In this<br />

case, we would have first derived the current emotion for each virtual agent and then<br />

we would have tried to find an appropriate dialogue scheme. Nevertheless, in this case,<br />

we would have had to face to the substantial growth of the number of dialogue schemes<br />

and to the subsequent growth of the number of templates that represent individual<br />

utterances. Thus, we would have had dialogue schemes for all meaningful combination<br />

of emotions that the virtual agents can have.<br />

However, we noticed that the positive appraisals usually correspond to emotions such as:<br />

joy, hope, satisfaction, and relief, and that the negative appraisals usually correspond<br />

to emotions such as: distress, disappointment, fear, and fear-confirmed. Therefore, we<br />

could simplify the design of the planning domain and base the dialogue schemes only on<br />

virtual agents’ attitudes to the players and derive specific emotion in a separate emotion<br />

module. Such a specific emotion can be expressed by other modalities (e.g. facial<br />

expression, gaze, gestures, lexical selection) except for the dialogue scheme selection.<br />

Nevertheless, if we had had the specific emotion of each virtual agent as an input of the<br />

planner, we could have also generated plans where the emotions could have changed at<br />

some point as a reaction to what the other agent would have said. However, this option<br />

is not useful in our case since both virtual agents share the same knowledge about the<br />

tennis game, and the emotion of a virtual agent should correspond to the current state<br />

of the game and not substantially change, for instance, from joy to distress if the virtual<br />

agent’s favourite player is winning but the other virtual agent has just said something<br />

bad about the winner.<br />

Nevertheless, the option to change the emotion at some point of a plan would be useful if<br />

the virtual agents had different knowledge about the tennis game such that an utterance<br />

uttered by one virtual agent could have substantially changed the emotion of the other<br />

virtual agent (e.g. one virtual agent would have made the other virtual agent happy if<br />

s/he had told him/her that his/her favourite player had just won the game). To change


Chapter 4. Generating Dialogue 40<br />

the emotion at some point in a plan would be also useful if the plans were longer, which<br />

in our case is only the commentary on a just finished game, but it is hard to imagine<br />

that a virtual agent that is very happy because his favourite player has just won the<br />

game would have changed his/her emotion, e.g., from joy to distress because the other<br />

virtual agent said something bad about a player s/he likes.<br />

We have written a separate emotion module since we wanted to simulate the emotional<br />

state of each virtual agent more precisely, e.g., we wanted to maintain the emotion decay<br />

which would be infeasible in the planner. We could also have used some off-the-shelf<br />

software to simulate the emotional state of each virtual agent. Nevertheless, we wanted<br />

to simulate the emotions in a transparent way so that we could clearly see which event<br />

had elicited which emotion and which emotion currently prevailed. We also wanted to<br />

have full control over the module (i.e. we can adjust the computation of the initial<br />

intensities of individual OCC emotions in dependence on the personality, we can define<br />

our decay function, and we have control over the input and output tags). Therefore, we<br />

did not use any “black box” such as ALMA [15], although ALMA is in general a good<br />

choice to simulate the affective state of a virtual agent since it additionally maintains<br />

the history and emotion blending.<br />

The emotion module and the planner run independently. The planner cannot update the<br />

emotion module since not every plan that is generated is also executed. Additionally,<br />

the time of the plan generation and the time of the plan execution are different. The<br />

emotion module could have passed the current emotional states of the virtual agents to<br />

the planner, nevertheless we do not need the exact emotional state of the virtual agents<br />

in the planner since our dialogue schemes are based only on virtual agents’ attitudes to<br />

the players.


Chapter 5<br />

Architecture<br />

In this chapter, we will introduce individual modules of our system and describe how<br />

they cooperate to generate a commentary on a tennis game for our presentation team<br />

based on elementary events that are produced by a tennis simulator in real-time. The<br />

system consists of several modules that are running in separate threads and communicate<br />

via shared queues. For each module, we will describe its task and how it communicates<br />

with other modules, i.e., what are the input and output of a particular module. First,<br />

we will introduce the tennis simulator that produces elementary events (e.g. a player<br />

plays a forehand, the ball crosses the net, the ball lands out). Then, we will describe<br />

the plan generation, i.e., how we generate plans based on the knowledge deduced from<br />

the elementary events got from the tennis simulator where a plan represents a particular<br />

dialogue. Afterwards, we will explain how these generated plans are executed, i.e., how<br />

we select plans from all the plans generated in the previous step. Our presentation team<br />

is then engaged in dialogues that correspond to the selected plans.<br />

5.1 System Overview<br />

In the following sections, we will present the main design aims, introduce the overall<br />

architecture of the system, and present the off-the-shelf components that are employed<br />

in the system. We will discuss advantages of the modular architecture of the system,<br />

how to ensure the reactivity, as well as discuss the need for extensibility. Finally we will<br />

briefly introduce individual modules of the system and how they cooperate to produce<br />

a commentary on a tennis game.<br />

41


Chapter 5. Architecture 42<br />

5.1.1 Design Aims<br />

The system was designed with three main design aims, namely: modularity, reactivity,<br />

and extensibility, that will be described below.<br />

Modularity<br />

The overall system is broken down into individual modules where each module provides<br />

clearly defined interface and functionality. Each module is running in a separate thread<br />

and asynchronously communicates with other modules via shared queues. This approach<br />

is advantageous since each module can be tested separately and possibly replaced by<br />

another module that implements the same interface.<br />

Reactivity<br />

The system should be able to react quickly to new events. Evidently, reactivity is<br />

closely related to the modularity which facilitates not only parallel execution at multi-<br />

core platforms but also the possibility of interruptions, i.e., one module can cause the<br />

interruption of another module by sending an asynchronous message. The response time<br />

of each module must be reasonably bound as well.<br />

Extensibility<br />

Since we wanted to participate in GALA 2009 (see section 1.2), we had at that time to<br />

rapidly develop a demo application. As a consequence, the overall design should have<br />

allowed for simple functionality implementation and subsequent refinement. This aim<br />

is also related to the modularity since individual modules can be added, replaced, or<br />

separately improved.<br />

5.1.2 System Architecture<br />

In the following text, we will briefly explain how we generate the commentary for our<br />

presentation team based on the elementary events (e.g. a player serves, a ball hits the<br />

net) that are produced by the tennis simulator. We will introduce individual modules<br />

of the system and describe how they communicate. Figure 5.1 depicts the overall ar-<br />

chitecture of the IVAN system and Figure 5.2 describes the dataflow that starts with<br />

the elementary events produced by the tennis simulator and ends with the multimodal<br />

output represented by the Charamel avatar engine (see section 5.1.3).<br />

The tennis simulator is sending elementary events to the event manager. The event<br />

manager is getting these elementary events (such as a ball is crossing the net, a ball<br />

bounces) and deduces low-level facts (e.g. a rally finished). These derived low-level<br />

facts are stored in the knowledge base. The event manager also decides when to run


Chapter 5. Architecture 43<br />

Figure 5.1: IVAN Architecture<br />

Figure 5.2: Dataflow<br />

the discourse planner based on the global state of the game. In other words, the event<br />

manager has a role of a perception unit since it is receiving events from the outside world<br />

and maintains its coherent representation in the form of a knowledge base. The discourse<br />

planner triggered by the event manager gets facts from the knowledge base, generates all<br />

possible plans, and passes them to the output manager where a plan represents a possible<br />

dialogues. Some facts can also be deduced during the planning process and stored in the<br />

knowledge base (e.g. statistics to generate the commentary that summarizes the game).<br />

The output manager maintains the plan execution, chooses one plan to execute, matches<br />

planning operators with templates, adds gestures annotation, and sends appropriate


Chapter 5. Architecture 44<br />

commands to the avatar manager that transforms them to the avatar engine specific<br />

commands. More precisely, there is a mapping that maps each planning operator onto<br />

a template where a template represents a set of possible annotated utterances. Thus,<br />

a planning operator is mapped onto an annotated utterance that is chosen at random<br />

among all utterances that correspond to a respective template. Furthermore, the avatar<br />

manager maintains the state of the dialogue (e.g. who is speaking at the moment or how<br />

long it will take to finish the current utterance) which can be used, e.g., to decide when<br />

to interrupt the current discourse.<br />

There is also the emotion module that maintains separately the emotional state of each<br />

virtual agent. For instance, the facial expression of each virtual agent is updated every<br />

second according to the current emotional state that is stored in the knowledge base.<br />

Let us note that the knowledge base also contains background facts about the game and<br />

players, virtual agents’ roles (commentator or expert), personality profiles, and attitudes<br />

(positive, neutral, or negative) to the players.<br />

5.1.3 Off-the-shelf Components<br />

We have used two commercial products as an audio-visual component of the system.<br />

We have employed Charamel 1 to visualize virtual agents and RealSpeak Solo 2 as a text-<br />

to-speech (TTS) engine. We will describe both software toolkits in the following para-<br />

graphs.<br />

Charamel Avatar Engine<br />

Charamel is a standalone application that communicates via socket and can visualize<br />

several virtual agents at the same time. Individual virtual agents are controlled via<br />

scripting language CharaScript. The virtual agents can express 14 different facial ex-<br />

pressions (e.g. smile, happy, disappointed, angry, sad) with varying intensities. Their<br />

lip movement is synchronized to speech that is produced by the RealSpeak Solo TTS.<br />

The virtual agents can playback around one hundred pre-fabricated gesture clips that<br />

can be tweaked using many different parameters (e.g. velocity, start time, end time,<br />

interpolation time). Moreover, the transitions between each two consecutive gestures or<br />

facial expressions are interpolated; the virtual agents are also performing idle gestures<br />

in the meantime while no other gestures are triggered in order to look natural. Figure<br />

5.3 depicts two Charamel virtual agents Mark and Gloria that were employed in the<br />

system.<br />

1 http://www.charamel.com/<br />

2 http://www.nuance.com/realspeak/solo/


Chapter 5. Architecture 45<br />

RealSpeak Solo TTS Engine<br />

Figure 5.3: Charamel Virtual Agents Mark and Gloria<br />

RealSpeak Solo is a TTS engine that gets commands from the Charamel to vocalize<br />

desired utterances. While the TTS engine is vocalizing an utterance it is also sending<br />

tags back to the Charamel which enables synchronized lip movement of a virtual agent<br />

that is speaking. RealSpeak Solo supports several male and female voices. We employed<br />

British female voice Serena for the Charamel virtual agent Gloria and American male<br />

voice Tom for Mark.<br />

5.2 Tennis Simulator<br />

The GALA 2009 challenge was given as a static ANVIL file that describes a tennis game<br />

(see section 1.2). Since we wanted to test our system as if it was a real-time application<br />

we wrote a tennis simulator that reads first an ANVIL file and then simulates the game<br />

in real-time. Although we consider the tennis simulator as a part of our system, it can<br />

be easily reused in other systems since it communicates via socket. Moreover, only a<br />

subtle modification is needed to simulate any game that is given as an ANVIL file (with<br />

a corresponding video). In the following text, we will describe our tennis simulator in<br />

detail.<br />

The architecture of the tennis simulator is shown in Figure 5.4. The tennis simulator<br />

first reads a video file and its annotation that is stored in an ANVIL file. The video is


Chapter 5. Architecture 46<br />

Figure 5.4: Tennis Simulator<br />

opened in a video player that is implemented using the Java Media Framework API 3 ;<br />

the timestamped events, read from the ANVIL file, are stored in a priority queue. When<br />

the simulator is started it is sending events one by one at the time they occur to a socket.<br />

Since the time of the simulation is determined by the video player it is possible to pause<br />

the simulation or to move it forwards. It is also possible to fire one of the pre-defined<br />

question event anytime.<br />

Figure 5.5 shows a GUI of the tennis simulator. A user first chooses an input file. S/he<br />

can decide whether the video will be displayed in the video player or not and whether<br />

the start of the simulation will be postponed or moved forward; then the simulation can<br />

be started.<br />

Figure 5.5: Tennis Simulator GUI<br />

3 http://java.sun.com/javase/technologies/desktop/media/jmf/


Chapter 5. Architecture 47<br />

5.3 Plan Generation<br />

In this section, we will describe how we generate plans that correspond to possible<br />

dialogues from the elementary events that are generated by the tennis simulator. Figure<br />

5.6 depicts in colors the part of the system that is responsible for the plan generation<br />

and Figure 5.7 shows which part of the dataflow is covered in this section. First, we will<br />

describe the event manager that is getting elementary events from the tennis simulator,<br />

deduces low-level facts from the elementary events and stores them in the knowledge<br />

base where the low-level facts along with the background knowledge, virtual agents’<br />

roles, personality profiles, and attitudes to the players create coherent representation<br />

of the outside world. Then, we will describe the discourse planner that is triggered by<br />

the event manager, gets facts from the knowledge base, and outputs all possible plans<br />

that are subsequently passed to the output manager that maintains the plan execution<br />

described in section 5.4.<br />

Figure 5.6: IVAN Architecture - Plan Generation


Chapter 5. Architecture 48<br />

5.3.1 Event Manager<br />

Figure 5.7: Dataflow - Plan Generation<br />

In this section, we will describe the event manager that has a role of a “perception unit”<br />

since it is getting events from the outside world and maintains its coherent representation<br />

that is stored in the knowledge base. More precisely, the event manager is getting<br />

elementary events from the tennis simulator and deduces low-level facts that are stored<br />

in the knowledge base. It also maintains the overall state and score of the match and<br />

decides when to run the discourse planner. The elementary events (e.g. a player plays<br />

a backhand, the ball lands out) that the event manager is getting from the tennis<br />

simulator were defined in the GALA 2009 scenario in detail (see section 1.2), moreover<br />

an elementary event can be also a user pre-defined question event. Let us remember<br />

that a tennis match consists of sets, a set consists of games, and a game consists of<br />

rallys. However, for the sake of simplicity we consider only one tennis game. Since we<br />

cannot run the discourse planner every time we get an elementary event, we first describe<br />

basic states of the tennis game that are modelled using finite state machines, and then<br />

we identify at which states we run the discourse planner. After that, we explain what<br />

low-level facts are deduced by the event manager, stored in the knowledge base and<br />

subsequently available for the discourse planner.<br />

States<br />

Two finite state machines that we have employed to simulate basic states of the tennis<br />

game are depicted in Figure 5.8. Both finite state machines run in parallel where the<br />

initial state is marked by red and the transitions correspond to particular sequences of<br />

elementary events.<br />

Let us first look at the finite state machine on the left side. We start at the state<br />

beginning, after a player throws a ball to serve we move to the state game in progress, at<br />

the end after the game finishes we move to the state game finished. The state machine<br />

on the right side starts at the state game not in progress. After a player throws a ball to


Chapter 5. Architecture 49<br />

Figure 5.8: States of the Tennis Game<br />

serve we move to the state rally beginning. A player can throw a ball several times before<br />

he actually serves but after he serves we move to the state rally in progress. After the<br />

ball hits the net, lands out, or bounces twice we get to the state rally finished. Then, in<br />

the case the game is finished we get to the state game not in progress otherwise we wait<br />

till a player throws a ball to serve and move to the state rally beginning. Both finite state<br />

machines could be perceived as one but two of them will provide better understanding.<br />

There are also two facts stored in the knowledge base derived from respective finite state<br />

machines.<br />

The event manager triggers the discourse planner at some states of the tennis game. We<br />

will now show the list of specific states at which the discourse planner is triggered. The<br />

list also contains some examples of goals that can be derived by the discourse planner<br />

at respective states. (Let us note that additional states could be added if desired).<br />

• beginning - do some introduction to the upcoming game<br />

• rally finished - summarize just finished rally<br />

• game finished - discuss just finished game<br />

• rally beginning & a player has thrown the ball already twice - a player is nervous,<br />

a player concentrates<br />

• rally in progress - comment on the serving player’s background<br />

• rally in progress & a volley or a smash was played - nice shot, risky shot<br />

• rally in progress & the ball hit the tape - luck, inaccuracy<br />

• a question event occured - answer the question


Chapter 5. Architecture 50<br />

Score<br />

The score of the game is also maintained in the event manager using a point counter for<br />

each player and a finite state machine depicted in Figure 5.9. If a player wins a rally<br />

s/he gets one point. A player wins the game if he has at least 4 points in total and at<br />

least 2 points more than the opponent. After both players reach at least 3 points and<br />

the game is not over yet, the score is either deuce or advantage. Table 5.1 explains how<br />

the tennis score is counted for one player in the tennis terminology. Let us note that<br />

the same player is serving within one game and that the score is read with the serving<br />

player’s score first.<br />

Figure 5.9: Tennis Score Counting using a Finite State Machine<br />

Score Explanation<br />

“love/zero” 0 points<br />

“fifteen” 1 point<br />

“thirty” 2 points<br />

“forty” 3 points<br />

“deuce” at least 3 points have been scored by each player, scores are equal<br />

“advantage” for the leading player, at least 3 points has been scored by each<br />

player and one player has one point more<br />

Facts<br />

Table 5.1: Description of the Tennis Counting Terminology<br />

We will now explain which low-level facts are deduced by the event manager from the<br />

elementary events and stored in the knowledge base. The reason why we perform the<br />

deduction of the low-level facts at this level in the event manager is that it substantially<br />

facilitates the planning domain design. Working with the elementary events in the<br />

planning domain would be quite cumbersome and unsuitable in the case we want to<br />

reach reasonable latency. As we have already mentioned, the state of the game and the<br />

score are maintained in the event manager, thus also the respective facts are stored in<br />

the knowledge base. While the knowledge base contains only the current state of the


Chapter 5. Architecture 51<br />

game, it contains all facts that describe the score from the beginning of the game. To<br />

distinguish between individual score facts and to rank them, we will introduce a concept<br />

of score generations, i.e., the first score fact has 0 generation, the second score fact has<br />

1 generation etc. We can deduce, e.g., whether a player has lost the lead or settled from<br />

the consecutive score facts. (Let us note that the concept of generations is often used in<br />

computer science to distinguish among data that originates at consecutive steps of an<br />

algorithm.)<br />

Rally Snapshots<br />

All events that occur in the tennis game are partitioned into the so-called rally snapshots.<br />

We will now describe which low-level facts are derived from a rally snapshot and stored<br />

in the knowledge base. Each rally snapshot has its generation that is similarly defined as<br />

the score generation. (Let us note that the rally generation and the score generation are<br />

different in general since, e.g, the first fault is a rally without score change.) The low-<br />

level facts are deduced for each rally snapshot and stored in the knowledge base. In the<br />

case the planner is triggered in the middle of a rally, the knowledge base then contains<br />

only facts deduced from the elementary events considering the current incomplete partial<br />

rally snapshot. The following list outlines which specific low-level facts are deduced from<br />

a rally snapshot and stored in the knowledge base:<br />

• how many times did the ball cross the net<br />

• a list at which heights the ball crossed the net<br />

• a list of pairs (player, shot) sorted from the beginning of a rally to its end<br />

• a position where the last ball, that was in the field, bounced first<br />

• a position where the last ball, that was out, bounced<br />

• whether the ball crossed the net before it landed out<br />

• which player missed the last ball<br />

• how many times the serving player had thrown the ball before he served<br />

Table 5.2 contains three examples that show which high-level facts can be deduced from<br />

the low-level facts listed above. Figure 5.10 depicts a hierarchy of facts that shows how<br />

an ace can be deduced.


Chapter 5. Architecture 52<br />

high-level fact a list of low-level facts<br />

ace the ball crossed once the net, bounced in the field,<br />

state - rally finished<br />

lob the ball crossed the net at high position, bounced at the baseline<br />

drop the ball crossed the net at low position, bounced at the net<br />

Table 5.2: Example of high-level facts deduced from low-level facts<br />

Figure 5.10: Hierarchy of Facts from which an Ace can be deduced<br />

Comparison to Related Work<br />

The event manager is to some extent similar to the STEVE’s perception module (see<br />

section 2.4) since it also maintains the state of the world and its coherent representa-<br />

tion. Our approach is also similar to the SceneMaker (see section 3.3), that employs<br />

statecharts to control virtual agents, with the difference that while the SceneMaker can<br />

perform, e.g., a pre-defined scene (i.e. a dialogue where utterances are annotated with<br />

gestures) at a certain state we run the planner to generate the scene.<br />

5.3.2 Background Knowledge<br />

The background knowledge about the players and the game is incorporated to produce<br />

commentary when, for instance, there is currently nothing else to comment on. We will<br />

show some examples of background facts that are stored in the knowledge base. The<br />

background knowledge is stored in several static CSV (Comma Separated Values) files<br />

that could be alternatively replaced with a relational database. After the system starts,<br />

all CSV files are read and the background knowledge they contain is transformed to the<br />

facts that are stored in the knowledge base. Table 5.3 shows some examples which facts


Chapter 5. Architecture 53<br />

can be deduced from the background knowledge.<br />

Background knowledge Examples of deduced fact<br />

Player’s details A sister of a player is also a tennis professional.<br />

Ranking A player is leading the ATP score.<br />

Style A player is playing risky as usual.<br />

Injury A player has been four times injured recently.<br />

Player’s results A player won two matches in a row.<br />

Tournaments details The tournament is played in London on grass.<br />

Table 5.3: Examples of Facts deduced from the Background Knowledge<br />

5.3.3 Discourse Planner<br />

The discourse planner is responsible for the plans generation where a plan represents a<br />

dialogue. The discourse planner is triggered by the event manager at particular states<br />

of the game. It gets all facts from the knowledge base and outputs all possible plans<br />

that are subsequently passed to the output manager. We will describe the input of the<br />

planner, the planner itself, and the representation of the planner output. Let us note<br />

that the concept of the dialogue generation has already been described in Chapter 4.<br />

Input<br />

The input of the planner consists of a planning task and a list of facts that describe<br />

the initial state of the world. The planning task is the same all the time, namely, the<br />

compound task “comment”, since the planner decides each time what it should comment<br />

on according to the supplied facts. The list of facts varies and contains all the facts that<br />

are stored in the knowledge base, i.e., it contains the following types of facts:<br />

• the current state of the game<br />

• scores of the game<br />

• rally snapshots<br />

• background knowledge (see section 5.3.2)<br />

• commentators’ (positive, neutral, negative) attitudes to the players<br />

• roles (commentator, expert)<br />

• a question (a fact identifying that there is a question to be answered)


Chapter 5. Architecture 54<br />

The Planner<br />

We have employed the JSHOP (Java Simple Hierarchical Ordered Planner) as an HTN<br />

planner to produce the commentary on a tennis game. See section 3.1 for more details<br />

on JSHOP. As described above, the planner gets the input in the form of a problem<br />

description and outputs all possible plans. The concept how these plans are generated<br />

has already been described in detail in Chapter 4. Since JSHOP is an offline planner,<br />

we had to modify it to run online. First, we will describe what makes JSHOP an offline<br />

planner, how we modified it to run online, and how we could have employed JSHOP<br />

without modification since we also considered and implemented this option.<br />

JSHOP as an Offline Planner - The drawback of JSHOP is that it requires to<br />

generate and compile the problem description prior to running the planner, assuming<br />

that the problem description changes whereas the domain description remains the same<br />

during the system run. As we can see, there is a costly compilation step before each run<br />

of the planner. See section 3.1 where we explained the JSHOP input generation process<br />

in detail. Let us also note that the planner does not have its own working memory in<br />

the sense that every time it is run all facts have to be supplied again.<br />

JSHOP as an Online Planner - We investigated how the problem description Java file<br />

was generated from the JSHOP problem file and found out how to bypass the compilation<br />

step described above. We have written a universal problem description Java file that<br />

has been compiled only once and fully replaces the problem description Java file that<br />

would be generated by the JSHOP, i.e., the instance of the problem description Java<br />

class accepts the discourse planner problem description representation as Java objects<br />

and serves as an input of the JSHOP as if the problem file was generated by the JSHOP.<br />

This approach is fast and the plan generation takes only about 50-150ms.<br />

Alternative Use of JSHOP as an Online Planner - JSHOP can be used as an<br />

online planner without modification. However, this approach is quite costly since the<br />

compilation step takes each time about 1 second and also consumes a lot of CPU re-<br />

sources. Figure 5.11 shows individual steps of this alternative approach that will be<br />

described below. The discourse planner uses its own problem description representation<br />

that is first transformed to the JSHOP problem file (that uses special Lisp-Like syntax),<br />

then the respective Java file is generated and compiled. After that, we make use of a<br />

nice Java feature, namely, that it allows to replace one class implementation by another<br />

at runtime, i.e., it allows to replace one *.class file by another during the system run.<br />

Thus, at the end of the process depicted in Figure 5.11, we have a *.class representation<br />

of a problem description and the planner can be started.<br />

Let us note that we use this approach to compile the domain description once at the<br />

beginning when the system starts. In this case the process starts with the JSHOP


Chapter 5. Architecture 55<br />

domain file from which the corresponding Java file is generated, compiled and replaced<br />

at runtime.<br />

Output<br />

Figure 5.11: JSHOP Input Generation Process<br />

The output of the planner is the so called planning response which contains: a list of<br />

all possible plans, the time when the planner was triggered, and the respective state<br />

of the game. Each plan from the list of all possible plans contains: priority, semantic<br />

token, and a list of planning operators. The semantic tokens are strings that identify<br />

plans. For instance, the semantic tokens can be used to avoid repetitions where we<br />

disallow consecutive execution of two plans with the same semantic token. The list<br />

of planning operators corresponds to a dialogue where each planning operator stands<br />

for one template (that corresponds to an utterance). Moreover, some facts can be also<br />

deduced during the planning process and stored in the knowledge base for the next run<br />

of the planner. For instance, it can be the statistics that summarises the game (e.g.<br />

the number of outs, winning returns, and aces for each player). These facts can be, for<br />

instance, used to generate the commentary on a just finished game.<br />

5.4 Plan Execution<br />

In this section, we will describe how we execute the plans that are generated by the<br />

discourse planner, i.e., how we select plans that will be executed, more precisely, in<br />

which dialogues the virtual agents will be engaged. Figure 5.12 depicts in colors the<br />

part of the system that is responsible for the plan execution and Figure 5.13 shows<br />

which part of the dataflow is covered in this section. First, we will describe the template<br />

manager that provides mapping for each planning operator of a plan onto a particular<br />

utterance that is furthermore annotated with gesture tags. Then, we will describe the<br />

avatar manager that stands for an interface of the Charamel avatar engine, and finally


Chapter 5. Architecture 56<br />

we will describe the output manager that is responsible for the plan execution, i.e., it<br />

decides which plans and when will be executed.<br />

5.4.1 Template Manager<br />

Figure 5.12: IVAN Architecture - Plan Execution<br />

Figure 5.13: Dataflow - Plan Execution<br />

Let us remember that each plan corresponds to a dialogue where a plan consists of a<br />

list of planning operators (primitive tasks) and each planning operator corresponds to a


Chapter 5. Architecture 57<br />

template that contains a set of possible utterances that can be uttered by a virtual agent.<br />

In this section, we will describe how a planning operator is mapped onto a particular<br />

utterance that can be additionally annotated with gesture tags. The template manager<br />

contains over 220 different templates and provides mapping for each planning operator<br />

onto a particular template where each template has usually several slots that can be<br />

substituted by parameters of a respective planning operator. Each template contains<br />

1-3 variants of an utterance. Which utterance will be chosen is decided at random for<br />

the sake of higher variability.<br />

Moreover, there are default gesture and facial expression tags in every utterance since<br />

each utterance is more or less bound to a particular situation that is correlated with a<br />

certain emotion. The facial expression tags can be for instance: Smile, Happy, Surprise,<br />

Angry, or Sad with different intensities. The gesture tags can be for instance: Disagree,<br />

DontKnow, Disappointed, Surprise, Oops, or OhYes. Each gesture tag is stored in a so<br />

called gesticon and is mapped onto a set of 1-3 possible gestures that can be directly<br />

performed by a virtual agent in a particular situation. Every time the gesticon is queried<br />

to find a mapping for a given gesture tag, it chooses one gesture from the corresponding<br />

set of possible gestures at random to achieve higher variability.<br />

Furthermore, there are two duration tags for each utterance, the first denotes the number<br />

of milliseconds needed to utter an utterance employing a male voice and the second tag<br />

is the respective duration for a female voice. These tags can be used to estimate the<br />

duration of utterance in the case it is not provided by the text-to-speech engine. Let us<br />

note that the gesture and facial expression tags stand only for default values, i.e., they<br />

can be filtered out and substituted by other tags generated by other modules.<br />

Example<br />

In the following text, we will show an example how a planning operator can be mapped<br />

onto a particular utterance. Imagine that the server has served and the receiver has<br />

returned the ball in such a way that the server failed to return it. One planning operator<br />

(more precisely operator’s head) of the generated plan can be for instance:<br />

briskly_returned_serve ?server ?receiver ?receiver_shot<br />

Where the first string is the operator’s name and the strings that begin with a question<br />

mark stand for variables that are substituted into slots of a template. The planning<br />

operator’s head contains three variables: ?server refers to the serving player, ?receiver<br />

refers to the receiving player, and ?receiver shot refers to the type of a shot that the<br />

receiving player played. There is a corresponding template in the template manager<br />

that contains three slots that correspond to the three variables of the planning operator.<br />

The template consists of two utterances:


Chapter 5. Architecture 58<br />

{EmotionSurprise} {ExplainTo} ?receiver surprised ?server with an accurate<br />

?receiver_shot return.<br />

{EmotionSurprise} {Play} ?receiver generated a ?receiver_shot {Look} return<br />

that was out of ?server’s reach.<br />

The facial expression and gesture tags are annotated in curly brackets. The facial<br />

expression tags start with the prefix Emotion whereas all other tags are gesture tags.<br />

Let us assume that: the second utterance has been chosen at random, the variable<br />

substitutions are known, and the respective gesture tags have been chosen from the<br />

gesticon at random. Thus, we get the following substitutions:<br />

?server := Safin<br />

?receiver := Federer<br />

?receiver_shot := forehand<br />

{EmotionSurprise} := $(Emotion,surprise,0.9,500,1000,3000)<br />

{Play} := $(Motion,interaction/bye/bye01,400,500,0,10000,1.5)<br />

{Look} := $(Motion,presentation/look/lookto_right02,400,500,0,1200,0.8)<br />

where the facial expression and gesture tags are mapped onto the avatar engine specific<br />

tags (see the Charamel manual [38] for more details). After we apply the substitutions<br />

we get the following annotated utterance that can be directly sent to the Charamel<br />

avatar engine.<br />

$(Emotion,surprise,0.9,500,1000,3000)<br />

$(Motion,interaction/bye/bye01,400,500,0,10000,1.5)<br />

Federer generated a forehand<br />

$(Motion,presentation/look/lookto_right02,400,500,0,1200,0.8)<br />

return that was out of Safin’s reach.<br />

After a Charamel virtual agent gets this utterance, s/he looks surprised, s/he makes a<br />

hand movement as if s/he played a ball with a tennis racket, and then s/he gazes at the<br />

other virtual agent.<br />

5.4.2 Avatar Manager<br />

The avatar manager serves as an interface of the Charamel avatar engine. In the fol-<br />

lowing text, we will describe how we have incorporated this module into our system


Chapter 5. Architecture 59<br />

and which functionality it provides. The avatar manager is placed between the out-<br />

put manager and the Charamel avatar engine. The output manager decides what plan<br />

will be executed, i.e., which utterance and when will be uttered whereas the Charamel<br />

avatar engine displays two virtual agents that represent our commentary team and ac-<br />

cepts commands to control their behaviour. Thus, the role of the avatar manager is<br />

to transform commands from the output manager to the Charamel specific commands.<br />

Furthermore, it maintains the state of the dialogue that can be exploited by the output<br />

manager. An annotated utterance, a gesture, or a facial expression can be sent to the<br />

avatar manager. The tags that describe the state of the dialogue got from the avatar<br />

manager are as follows: which virtual agent is currently speaking, how long s/he has<br />

already been speaking, how much time it takes to finish the current utterance, and what<br />

gesture or facial expression has been set for each virtual agent last time.<br />

Let us remember that all commands that are sent to the avatar manager or to the<br />

Charamel avatar engine are sent in a non-blocking manner (i.e. it never waits till a<br />

command is completed). Thus, the output manager must first get the current state<br />

of the dialogue and then decide which command to send to the avatar manager. For<br />

instance, if nobody is speaking it can send instantaneously an annotated utterance to<br />

the Charamel avatar engine. If somebody is speaking it knows who is speaking and how<br />

long it will take to finish the current utterance. Thus, the output manager can then<br />

decide whether to wait or send a new utterance right away. For instance, it should wait<br />

if the utterance that is being uttered will be finished in a second. Nonetheless, in the<br />

case that somebody is speaking and the avatar manager gets a command to utter the<br />

other utterance, it interrupts the virtual agent that is speaking and starts uttering the<br />

new utterance.<br />

There can be two kinds of interruptions: self-interruption or an interruption by the<br />

other agent. Gaze gestures and interruption utterances (e.g. “Wait!” or “Look!”) are<br />

used to make the interruptions smoother. As we have already stated, the length of an<br />

utterance is stored in the template manager for each template, nevertheless this length is<br />

not accurate since the exact length of an utterance depends on slot substitutions in the<br />

templates (e.g. a ?name “Ray” is shorter than “Richard”). Thus, the Charamel avatar<br />

engine is always queried to send back the real length of an utterance. However, it can<br />

take up to 1 second to get the response, thus the estimated length that is stored in the<br />

template manager is used as long as the real length returned from the Charamel avatar<br />

engine is unknown. A gesture or a facial expression can be sent to the Charamel avatar<br />

engine at any time. New gesture or facial expression will be smoothly interpolated with<br />

the previous one.


Chapter 5. Architecture 60<br />

Since the avatar manager communicates with the Charamel avatar engine via socket (see<br />

the Charamel manual [38]) we have to deal with some latency that can be up to one<br />

second which can cause unwanted delays in the commentary. Another shortcoming of the<br />

Charamel avatar engine is that a virtual agent that is speaking cannot be interrupted at<br />

a specific position in an utterance since the exact state of the virtual agent is unknown.<br />

We can only estimate the position in an utterance according to the time elapsed from<br />

its beginning. Therefore, we cannot prevent from an utterance being interrupted in the<br />

middle of a word.<br />

5.4.3 Output Manager<br />

The output manager is responsible for the plan execution, i.e., it decides in which di-<br />

alogues the virtual agents will be engaged. In the following text, we will explain the<br />

functionality of the output manager in detail. The output manager gets plans from the<br />

discourse planner, chooses one plan to execute, maps planning operators onto templates,<br />

and sends respective annotated utterances to the avatar manager that transforms them<br />

to the Charamel specific commands. Thus, the output manager decides which plan and<br />

when to execute. Furthermore, the output manager can interrupt the current plan and<br />

run a new one while the interrupted plan can be started again later. The decision when<br />

to interrupt a plan is based on heuristics. Moreover, the output manager keeps the plan<br />

history that prevents from repetitions so that one plan is not executed twice in a row.<br />

Decision Loop<br />

The functionality of the output manager is implemented in the decision loop that main-<br />

tains the state of the plan that is being executed, the stack of candidate plans, and the<br />

plan history. The decision loop consists of the following steps:<br />

1. Try to get new plans.<br />

2. If there are new plans then select one and put it on the stack of candidate plans.<br />

3. Remove old plans from the stack of candidate plans.<br />

4. Get the status of the dialogue engine.<br />

5. In the case that nobody is speaking we can perform one of the following actions:<br />

• The plan that is being executed continues with the next utterance.<br />

• The plan that has been interrupted starts again.<br />

• The current plan is interrupted by a new one.<br />

• A new plan is started.


Chapter 5. Architecture 61<br />

6. In the case that somebody is speaking and there is a newer plan on the stack of<br />

candidate plans, we decide according to heuristics whether the current plan will<br />

be interrupted or not.<br />

The plan is selected according to its priority and the least recently used strategy (at step<br />

2) such that it prefers plans with high priority and plans that have not been executed<br />

recently. To ensure that the stack of the candidate plans contains only plans that are<br />

up-to-date (at step 3), we go through the plans and filter out old plans depending on<br />

the semantic tokens of plans. For instance, a plan that contains some background facts<br />

(e.g. that the serving player is leading the ATP score) does not get older so fast as a<br />

plan that is related to some event that happened in the middle of a rally (e.g. when a<br />

player played a smash).<br />

Each time the output manager gets new plans, it has to decide on the basis of some<br />

heuristics whether to interrupt the current plan and continue with a new one or not.<br />

The output manager makes use of the state of the dialogue to know the approximate<br />

time needed to finish the current utterance or how long the current plan has already<br />

been running. For instance, the current plan will not be interrupted if it finishes in a<br />

second or if it was started a couple of milliseconds ago. The interruptions also cannot<br />

occur too often. In dependence on the semantic tokens of plans, some plans should be<br />

executed as soon as possible (e.g. a comment referring to an ace) and some plans can<br />

be executed with certain delay (e.g. a comment on player’s background). Furthermore,<br />

an interrupted plan can be run again if it is still up-to-date and has not been almost<br />

finished last time.


Chapter 6<br />

Discussion<br />

In this chapter, we will compare the IVAN system with ERIC, evaluate our system in<br />

terms of the research aims, and discuss two basic tools (JSHOP and Jess) that can<br />

be both employed to generate affective commentary on a continuous sports event in<br />

real-time.<br />

6.1 Comparison with the ERIC system<br />

In this section, we will compare our system with ERIC (see section 2.1) since ERIC<br />

is most closely related to our work. ERIC is an affective commentary virtual agent<br />

that won GALA 2007 1 as a horse race reporter. The overall goal of ERIC is the same<br />

as ours with the difference that while ERIC is a monologic system that employs one<br />

virtual agent we have employed a presentation team that consists of two virtual agents<br />

to comment on a sports event. Our virtual agents have different roles (TV commentator,<br />

expert) and can have different attitudes to the players (positive, neutral, negative). The<br />

use of a presentation team is believed to be more entertaining for the audience than only<br />

one presenter and enriches the communication strategies since our virtual agents can be<br />

engaged in dialogues and represent opposing points of view.<br />

ERIC employs the expert systems to generate speech where his utterances reflect his<br />

current knowledge state and the discourse coherence is ensured by the centering the-<br />

ory. Nevertheless, ERIC may be too reactive, i.e., individual utterances are uttered at<br />

particular knowledge states where ERIC cannot generate larger contributions. Hence,<br />

we have employed an HTN planner to generate the dialogues which enabled us to plan<br />

large dialogue contributions and the discourse coherence was ensured by the planner.<br />

1 http://hmi.ewi.utwente.nl/gala/finalists 2007/<br />

62


Chapter 6. Discussion 63<br />

In contrast to ERIC, we have also implemented the possibility of interruptions, i.e.,<br />

the current discourse can be interrupted if a more important event happens. However,<br />

there is always certain trade-off between reactivity, i.e., a reactive commentary with fre-<br />

quent interruptions, and discourse coherence, i.e., a commentary with large and coherent<br />

dialogue contributions that does not comment on each event.<br />

While ERIC uses ALMA to maintain his affective state, we use two methods: one that<br />

generates affective dialogues based on virtual agents’ attitudes to the players and the<br />

other that maintains the affective state of each virtual agent in the emotion module.<br />

ALMA might appear to be a “black box”, on the contrary, the generation of affective<br />

dialogues and the simulation of the affective states for our virtual agents are more<br />

transparent, i.e., we can adjust the computation of the initial intensities of individual<br />

OCC emotions in dependence on the personality, we can define our decay function, and<br />

we have full control over the input tags and output of our emotion module. We can also<br />

always say which event has caused the virtual agent’s current emotion or why a virtual<br />

agent is commenting in a positive or negative way on an event or a player.<br />

In comparison to ERIC, our virtual agents have gestures more synchronized with speech,<br />

use more elaborate idle gestures (provided by Charamel), can gaze at each other, and<br />

can interact with a user via user pre-defined questions. Whilst ERIC was designed to<br />

be domain independent and was tested in two different domains, our system has only<br />

been designed to comment on a tennis game, nevertheless the same architecture can be<br />

used to produce affective commentary in other domains.<br />

6.2 Evaluation in Terms of Research Aims<br />

In this section, we will compare our research amis, listed in section 1.4, with the system<br />

that we have implemented.<br />

Dialogue Planning for Real-time Commentary and Reactivity<br />

We have employed the JSHOP as an HTN planner to produce commentary on a contin-<br />

uous sports event in real-time. The motivation to use an HTN planner was to generate<br />

large dialogue contributions and to prevent from being too reactive (in the sense de-<br />

scribed in 6.1). It also seemed to be a good strategy to generate dialogues. First, the<br />

JSHOP gets all facts that describe the current state of the world and outputs all possible<br />

plans (dialogues). Then, in the decision loop, one plan is selected and executed. The<br />

problem arises when an important event happens in the middle of the execution of a<br />

plan (dialogue) that comments on another event. In this case, our system can either<br />

interrupt the execution of the current plan or wait till the current plan finishes. This


Chapter 6. Discussion 64<br />

problem would solve dynamic replanning, i.e., to modify the current plan on the fly.<br />

Since the JSHOP does not support dynamic replanning we can only either wait till the<br />

current plan finishes or we can interrupt it. However, if the JSHOP supported dynamic<br />

replanning, it would not be sufficient since the Charamel avatar engine does not indicate<br />

its exact state, e.g., we cannot interrupt an utterance at a specific position in an utter-<br />

ance. Moreover, if we sent an utterance word by word to the Charamel avatar engine it<br />

would not be uttered in a coherent way. Thus, the planner would need to work with the<br />

whole utterances which would not be optimal since we would have to wait till the current<br />

utterance would have been uttered, and then we would continue with an utterance of<br />

the modified plan that would have been created by the dynamic replanning.<br />

Therefore, there is always certain trade-off between reactivity and discourse coherence.<br />

We can either often interrupt plans (dialogues) to be reactive or we can delay the com-<br />

mentary on some events or we can even ignore some events to get large, coherent dialogue<br />

contributions. Nevertheless, we have noticed that the real-life tennis commentators do<br />

not comment on every event and in the case when the game is not interesting, they are<br />

engaged in small talks to amuse the audience by talking about the players’ background.<br />

Thus, we have implemented a compromise that uses some heuristics to decide when<br />

to interrupt the discourse. The resulting commentary is partly reactive but since we<br />

cannot interrupt the discourse too often, our commentary has sometimes delays or does<br />

not consider some events.<br />

There is also always certain trade-off between reactive commentary that uses short<br />

utterances and elaborate, more detailed commentary that is not so reactive. Since we<br />

wanted to produce more interesting and detailed commentary to convey more facts, our<br />

utterances are rather long.<br />

We have supposed that the HTN planning is convenient to produce commentary on<br />

sports events that are rather long-winded (e.g. a life tennis game). However, the testing<br />

files provided by GALA 2009 were generated by the Wii 2 software that produced tennis<br />

games that unfolded more quickly in comparison to a standard life tennis games. Hence,<br />

there was a slight mismatch between the input we anticipated and the input that we had<br />

got. Nevertheless, our system was able to produce the commentary even under these<br />

conditions.<br />

The reactivity of the system also partly depends on the response time of the avatar<br />

engine and the speed with which the virtual agents are talking. A little bit faster speech<br />

and lower response time of the avatar engine that is sometimes up to 1 second would<br />

lead to better results in terms of reactivity.<br />

2 http://wii.com/


Chapter 6. Discussion 65<br />

Behavioural Complexity and Affectivity<br />

Our virtual agents provide affective commentary on a tennis game according to their<br />

(positive, neutral, negative) attitudes to the players and according to the events that<br />

occur during the tennis game. The current affect of a virtual agent is expressed by<br />

dialogue scheme selection, lexical selection, facial expression, and gestures. A user can<br />

recognize which virtual agent is in favour of which player and whether the virtual agent’s<br />

favourite player is doing well or not. For instance, only the virtual agent’s facial expres-<br />

sion can reveal whether his/her favourite player is leading or not. The virtual agents<br />

have also gestures synchronized with speech and can interact with a user in the form of<br />

user pre-defined questions.<br />

The variability of dialogues is ensured by the planner that always outputs all possible<br />

plans (dialogues), and by the random selection of utterances and gestures within partic-<br />

ular templates. However, there is always certain trade-off between a few nice, suitable,<br />

and specific dialogues and a variety of a lot of general dialogues. Since we wanted to<br />

have specific commentary for GALA we have preferred the first option. Nevertheless,<br />

more variety could be achieved if we added more dialogue schemes and more variants<br />

of utterances and gestures to respective templates. The dialogue schemes could be also<br />

based on different types of OCC emotions that are maintained for each virtual agent in<br />

a simple emotion module which would also increase the variability and affectivity of the<br />

commentary.<br />

We have used two methods to produce affective commentary: one that generates affective<br />

dialogues based on virtual agents’ attitudes to the players and the other that maintains<br />

the affective state of each virtual agent in the emotion module. Thus, the user can see<br />

which event elicited which emotion and why a virtual agent is commenting in a positive<br />

or negative way.<br />

Generalizability<br />

Although our system was not designed to be domain independant, we will describe be-<br />

low which modifications would be necessary to change the domain. The tennis simulator<br />

would need only a subtle modification to simulate any sports event given as an ANVIL<br />

file. We would need to define new states at which the discourse planner is triggered by<br />

the event manager. We would also need to define the snapshots of the world and which<br />

low-level facts would be derived from the respective snapshots. The pre-processing of<br />

the background facts is done in a generic way, thus we would only provide corresponding<br />

input CSV files. While the Java code in the discourse planner is domain independent,<br />

the definition of the Hierarchical Task Network in the planning domain would need to<br />

be rewritten except for the part that concerns the background knowledge (e.g. injury,


Chapter 6. Discussion 66<br />

weather). We would also need to add corresponding templates, change some heuris-<br />

tics in the output manager, e.g., to determine under which conditions a plan can be<br />

interrupted. We would also need to define respective emotion eliciting conditions in the<br />

emotion module. The avatar manager is domain independent. Thus, the most complex<br />

task would be to rewrite the domain description of the planner and to add respective<br />

templates.<br />

6.3 Comparison JSHOP vs Jess<br />

In this section, we will compare two approaches, i.e., the HTN planning (see section 3.1)<br />

and the expert systems (see section 3.2) that can be used to generate a commentary<br />

on a sports event defined as GALA 2009 (see section 1.2). We will focus on two tools,<br />

namely: JSHOP 3 that is a representative of an HTN planner that we have employed<br />

in our system to generate dialogues, and Jess 4 which is a representative of an expert<br />

system that was used, e.g., in ERIC (see section 2.1) to generate speech. Whereas the<br />

HTN planning is well suited to plan larger contributions (e.g. dialogue planning), the<br />

expert systems are more suitable to produce shorter comments that reflect the current<br />

state of the world. In the following text, we will compare JSHOP and Jess in terms of<br />

their expressive power, usableness, and user-friendliness.<br />

• Variability<br />

The variability is important, e.g., for the dialogue planning since the virtual agents<br />

should not be engaged in the same dialogues all the time. In the logistics, it is<br />

also convenient to have more than one way how to deliver a package since not all<br />

paths cost the same, thus the cheapest path should be chosen, and some paths<br />

can be also dynamically added or deleted from the domain. The advantage of the<br />

planning is that it finds all solutions to a problem while the expert systems output<br />

only one. (More precisely, while a planner is backtracking to find all possible plans,<br />

it can try several substitutions of a variable. In contrast to the planning, once a<br />

rule fires in an expert system a variable is substituted and cannot be changed.)<br />

Nevertheless, it is possible to set the random resolution strategy in a rule-based<br />

system which resembles as if we have chosen a plan at random among all possible<br />

plans output by a planner. Thus, the variability can be reached in the rule-base<br />

systems to some extent as well.<br />

• Priority<br />

We can assign a cost to each planning operator in the planning domain such that<br />

3 JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/<br />

4 Jess (Java Expert System Shell) http://www.jessrules.com/


Chapter 6. Discussion 67<br />

the cost of a plan is equal to the sum of the costs of all planning operators that<br />

the plan contains. After the planner outputs all possible plans, we can choose the<br />

most or least expensive plan according to our preferences. If the cost corresponds<br />

to the length of a path, we will probably choose the shortest one. If the cost<br />

corresponds to the amount of money that we get when we execute the plan, we<br />

will presumably choose the most profitable plan. In an expert system, we can<br />

assign a salience value to each rule which specifies how urgently a rule should be<br />

fired and in the case the salience value of two rules is the same, the current conflict<br />

resolution strategy decides which rule will be fired as first. This is the way how<br />

the rule-based systems can prioritize some outcomes. Nevertheless, the use of the<br />

salience value should be avoided since it makes the execution of the rules very<br />

difficult to monitor.<br />

• Expressive Power<br />

Jess offers substantially more constructs than JSHOP. We will show two examples<br />

of constructs that are defined in Jess and that are not defined in JSHOP where<br />

it would be advantageous to have them in JSHOP as well. The first example:<br />

JSHOP does not support unordered facts, thus in the case we want to work with<br />

only one slot of a fact we have to consider all its slots since JSHOP supports only<br />

ordered facts. The second example: It is quite cumbersome to count the number<br />

of facts that match certain condition in JSHOP, nevertheless it can be bypassed<br />

by recursion. This task can be solved in Jess using the accumulate construct in an<br />

intuitive way.<br />

• Online vs Offline Execution<br />

We have already pointed out that JSHOP runs offline (see Figure 3.3). Thus, due<br />

to any change in the domain or problem file, respective Java file has to be first<br />

generated and then compiled before the planner can be actually run. In contrast<br />

to JSHOP, Jess runs online, i.e., after the Jess rule-based engine is initialized, it<br />

can be run several times where facts and rules can be added to its facts base or<br />

retracted in the meantime.<br />

• Development Environment<br />

Jess can be better integrated into a development environment than JSHOP since<br />

there is a plugin that integrates Jess into Eclipse IDE 5 which facilitates the de-<br />

velopment, e.g., it offers a Jess editor that emphasizes the Jess Lisp-like syntax<br />

and marks errors. In comparison to Jess, JSHOP is provided as a Java library.<br />

Nevertheless, the input JSHOP files can be edited as text files in Eclipse IDE as<br />

well.<br />

5 http://www.eclipse.org/


Chapter 7<br />

Conclusion<br />

7.1 Summary<br />

In this thesis, we have presented the architecture of the IVAN system (Intelligent In-<br />

teractive Virtual Agent Narrators) that generates an affective commentary on a tennis<br />

game in real-time where the input was given as an annotated video provided by GALA<br />

2009. The demo version of the IVAN system was accepted for the GALA 2009 1 that<br />

was a part of the 9th International Conference on Intelligent Virtual Agents (IVA) 2 .<br />

The system employs two virtual agents with different attitudes to the players that are<br />

engaged in dialogues to comment on a tennis game. We have focused on the knowledge<br />

processing, dialogue planning, and behaviour control of the virtual agents. Commercial<br />

products have been employed to represent the audio-visual component of the system.<br />

Most parts of the system are domain dependent. However, the same architecture can<br />

be reused to implement applications such as: interactive tutoring system, tourist guide,<br />

or guide for the blind.<br />

The system consists of several modules. We have employed an HTN planner to plan the<br />

dialogues, an expert system to define the appraisals of the emotion eliciting conditions in<br />

the emotion module, and finite state machines to simulate basic states of the system. Our<br />

two virtual agents can have positive, neutral, or negative attitudes to the players. The<br />

system uses two methods to generate affective multimodal output. In the first method,<br />

the dialogue schemes in the HTN planner are selected according to the desirability<br />

of particular events for respective virtual agents. In the second method, the system<br />

maintains the affective state of each virtual agent in the emotion module, according<br />

to the OCC cognitive model of emotions [36], based on the appraisals of the events<br />

1 http://hmi.ewi.utwente.nl/gala/finalists 2009/<br />

2 http://iva09.dfki.de/<br />

68


Chapter 7. Conclusion 69<br />

that happen in a tennis game. The current affect of the virtual agents is expressed<br />

by lexical selection, facial expression, and gestures. Furthermore, the system integrates<br />

background knowledge about the players and the tournament and allows the user to fire<br />

one of the pre-defined questions at any time.<br />

We have employed the JSHOP 3 as an HTN planner to generate dialogues for our two<br />

virtual agents. We have verified that JSHOP can be employed to generate affective<br />

commentary on a continuous sports event in real-time. However, the HTN planning is<br />

well suited to generate large dialogue contributions. Thus, if the environment changed<br />

rapidly and we wanted to consider most of the events that occur in the environment it<br />

would be more appropriate to use the expert systems as in ERIC. [10]<br />

7.2 Future Work<br />

In the following paragraphs, we will outline which modifications could be made to im-<br />

prove our system in the future.<br />

EMBR<br />

We could integrate EMBR (A Realtime Animation Engine For Interactive Embodied<br />

Agents). [39] Since EMBR has more advanced behaviour control, e.g., it can have more<br />

precise gaze that can express particular emotions whereas the Charamel virtual agents<br />

(see section 5.1.3) can only turn the head to gaze at the other virtual agent. We did not<br />

employ EMBR since it had not been released at that time and EMBR had also offered<br />

only one virtual agent where we needed two distinguishable characters.<br />

Prosody<br />

We could also integrate a prosody module if we had an appropriate TTS engine that<br />

would provide the option to set the respective parameters. Then, we could use the<br />

current emotional state of a virtual agent that is simulated by an emotion module (see<br />

section 4.2.3) to set respective parameters of the TTS engine. We have not implemented<br />

the prosody module since the RealSpeak Solo TTS 4 did not provide the option to change<br />

respective parameters.<br />

ALMA<br />

We could use ALMA [15] to maintain the emotional state of each virtual agent since<br />

ALMA in addition to our emotion module maintains history and the emotion blending.<br />

We could anticipate smoother transitions between individual emotional states of a virtual<br />

agent. Nevertheless, we did not employ ALMA since we wanted to have full control<br />

3 JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/<br />

4 http://www.nuance.com/realspeak/solo/


Chapter 7. Conclusion 70<br />

over the emotion module so that we could, e.g., adjust the computation of the initial<br />

intensities of individual OCC emotions in dependence on the personality and define our<br />

own decay function.<br />

Affect<br />

We could try to base some dialogue schemes on particular OCC emotions that are output<br />

by our emotion module. In this way, we would get more affective and suitable dialogues.<br />

Nevertheless, it would entail a lot of work since we would have also to make up a lot<br />

of utterances that would express particular emotions. Let us note that to work with a<br />

reasonable amount of templates, we can have either a lot of general affective dialogues<br />

or a lot of specific dialogues that express particular emotions in a limited way. In our<br />

case, we have chosen the second option, thus our dialogue schemes are only based on<br />

virtual agents’ (positive, neutral, negative) attitudes to the players.<br />

We could also base the selection of particular utterances and gestures in templates<br />

on the current emotional state of a virtual agent that is maintained by our emotion<br />

module. A particular utterance and a gesture would be chosen according to the current<br />

emotional state of the virtual agent. The current affect could also, for instance, influence<br />

the velocity of particular gestures. In this way, we would get more affective dialogues.<br />

Nevertheless, we did not implement this feature since it would have required to make up<br />

a lot of different affective utterances. We have also supposed that it is sufficient when<br />

the utterances convey only virtual agents’ (positive, neutral, negative) attitudes to the<br />

players.<br />

Dynamic Replanning<br />

We could try another planner (e.g. HOTRiDE [40]) that would support dynamic re-<br />

planning, since the only way how we can change the plan (dialogue) now is to interrupt<br />

the current plan and start a new plan. Nevertheless, the dynamic replanning seems<br />

to be quite difficult to implement. One reason, why we did not try such a planner is<br />

that the Charamel virtual engine (see section 5.1.3) does not indicate the exact state of<br />

the discourse, thus such a planner would have to work with the whole utterances which<br />

would not be optimal. Thus, the precondition to employ such a planner is to have an<br />

avatar engine that would indicate what exactly has been uttered so far at any point in<br />

time.<br />

Evaluation<br />

More elaborate evaluation of the system could be done. We could perform an experiment<br />

to find out what a user remembers from the commentary with and without virtual<br />

agents. However, the life tennis commentators are usually hidden so that the audience<br />

could concentrate on the tennis game. Though we would in general expect that the<br />

commentary with the virtual agents would be better, it can easily happen that the users


Chapter 7. Conclusion 71<br />

would more concentrate on the video of the tennis game and remember more without<br />

virtual agents since the use of the virtual agents would rather distract them. We have<br />

not performed this sort of evaluation since it was not clear how to interpret the possible<br />

results.<br />

We could also compare our commentary with a life commentary. Nevertheless, in com-<br />

parison to our system, the real commentators are usually hidden and their commentaries<br />

are not biased. Our system was also partly optimized for the GALA 2009 (see section<br />

1.2) that was slightly different from a life tennis game since it used Wii 5 videos of tennis<br />

games. The life tennis commentary is also often very elaborate, thus our system cannot<br />

compete with such a commentary in terms of variability.<br />

Other Domains<br />

We could reuse the architecture of the system to implement a system in other domains,<br />

e.g., another long-winded sports events, interactive tutoring systems, tour guides, or<br />

guides for the blind.<br />

5 http://wii.com/


Appendix A<br />

Commentary Excerpt<br />

In the following list, we will show a commentary excerpt, where C stands for a com-<br />

mentator and E stands for a tennis expert.<br />

C : “Ladies and Gentleman. Welcome to the Wimbledon semi-final in doubles.”<br />

E: “We will guide you through the match in which James Blake and Andy Roddict are<br />

playing versus Marat Safin and David Ferrer.”<br />

C : “Enjoy the show!”<br />

C : “The weather is cloudy.”<br />

E: “I hope it won’t be raining.”<br />

C : “Oops, Roddick scored!”<br />

E: “Roddick hits an excellent forehand-volley right into the left corner.”<br />

C : “Roddick has been unbeatable recently.”<br />

E: “What a control by Roddick!”<br />

C : “The score is already 30:0.”<br />

E: “Players Safin and Ferrer are real losers as usual!”<br />

E: “None of the players are playing on favourite surface.”<br />

C : “It should harm performance of Blake and Rod..”<br />

72


Appendix A. Commentary Excerpt 73<br />

Interruption<br />

E: “What an unexpected lob!”<br />

C : “It’s a bad idea to try to return a lob at the net.”<br />

C : “The brother of Blake Thom..”<br />

Interruption<br />

C : “Look!”<br />

C : “Blake missed the huge backhand return by Ferrer! ”<br />

C : “The brother of Blake Thomas is also playing tennis.”<br />

E: “His Best ranking was in 2002.”<br />

C : “The score is already 30:40.”<br />

E: “No, Blake and Roddick are facing the break point.”<br />

C : “Ferrer had a hard time recently.”<br />

C : “Deuce.”<br />

E: “Great recover by Blake and Roddick!”<br />

C : “Roddick has been unbeatable recently.”<br />

E: “No doubt he is a genius.”<br />

C : “Off forehand by Blake drifts outside the left sideline.”<br />

E: “Blake just overcooked his forehand.”<br />

E: “Blake concentrates on his serve.”<br />

C : “Roddic has been four times injured since last year.”<br />

E: “It’s amazing how he ...”<br />

Interruption<br />

C : “What a relief!”<br />

E: “Oh, no!”<br />

C : “Tight game let’s summerize it.”<br />

C : “Safin and Ferrer won the first game.”<br />

E: “That’s unbelievable that they broke opponents’ serve!”


Bibliography 74<br />

C : “That was quite obvious!”<br />

E: “Safin and Ferrer played below part!”<br />

C : “There were some excellent shots!”<br />

E: “Well there might have been some bright sides.”<br />

C : “Of course there were!”<br />

C : “Did safin and Ferrer have any difficulties?”<br />

E: “They were already trailing.”<br />

C : “We have seen nice recovery.”<br />

C : “Let’s see the next game.”<br />

E: “Definitely.”


Bibliography<br />

[1] Justine Cassell, Tim Bickmore, Lee Campbell, Hannes Vilhjalmsson, and Hao Yan.<br />

”Human conversation as a system framework: Designing embodied conversational<br />

agents”. In Embodied Conversational Agents, pages 29–63. MIT Press, Cambridge,<br />

2000.<br />

[2] Jonathan Gratch and Stacy Marsella. Tears and fears: modeling emotions and<br />

emotional behaviors in synthetic agents. In Proceedings of the fifth international<br />

conference on Autonomous agents, pages 278 – 285. ACM Press, Montreal, Quebec,<br />

Canada, 2001.<br />

[3] Jeff Rickel and W. Lewis Johnson. Animated agents for procedural training in<br />

virtual reality: Perception, cognition, and motor control. APPLIED ARTIFICIAL<br />

INTELLIGENCE, 13:343—382, 1998.<br />

[4] Marc Cavazza, Fred Charles, and Steven J. Mead. Interacting with virtual char-<br />

acters in interactive storytelling. In Proceedings of the first international joint<br />

conference on Autonomous agents and multiagent systems, pages 318–325. ACM<br />

Press, Bologna, Italy, 2002.<br />

[5] Mark Riedl, C.J. Saretto, and R. Michael Young. Managing interaction between<br />

users and agents in a multi-agent storytelling environment. In Proceedings of the 2nd<br />

International Joint Conference on Autonomous Agents and Multi Agent Systems.<br />

Melbourne, 2003.<br />

[6] Elisabeth Andre, Thomas Rist, Susanne van Mulken, Martin Klesen, and Stephan<br />

Baldes. The automated design of believable dialogues for animated presentation<br />

teams. In Embodied Conversational Agents, pages 220–225, Cambridge, 2000. MIT<br />

Press.<br />

[7] Elisabeth Andre and Thomas Rist. Presenting through performing: On the use of<br />

multiple Life-Like characters in Knowledge-Based presentation systems. In 2000<br />

International Conference on Intelligent User Interfaces, pages 1–8. ACM Press,<br />

New York, 2000.<br />

75


Bibliography 76<br />

[8] Elisabeth Andre, Thomas Rist, and Jochen Muller. Integrating reactive and scripted<br />

behaviors in a Life-Like presentation agent. In Proceedings of the Second Inter-<br />

national Conference on Autonomous Agents (Agents 1998), pages 261–268. ACM<br />

Press, New York, 1998.<br />

[9] Elisabeth Andre, Kim Binsted, Kumiko Tanaka-Ishii, Sean Luke, Gerd Herzog,<br />

and Thomas Rist. Three RoboCup simulation league commentator systems. AI<br />

Magazine, 22:57–66, 2000.<br />

[10] Martin Strauss and Michael Kipp. ERIC: a generic rule-based framework for an<br />

affective embodied commentary agent. 2007.<br />

[11] Francois L. A. Knoppel, Almer S. Tigelaar, Danny Oude Bos, Thijs Alofs, and<br />

Zsofia Ruttkay. Trackside DEIRA: a dynamic engaging intelligent reporter agent.<br />

In Proceedings of the 7th international joint conference on Autonomous agents and<br />

multiagent systems (AAMAS). Portugal, 2008.<br />

[12] Michael Kipp. ANVIL a generic annotation tool for multimodal dialogue. pages<br />

1367–1370, Aalborg, 2001.<br />

[13] Ivan Gregor, Michael Kipp, and Jan Miksatko. IVAN intelligent interactive virtual<br />

agent narrators. In Proceedings of the 9th International Conference on Intelligent<br />

Virtual Agents (IVA-09), pages 560–561. Springer, Amsterdam, 2009.<br />

[14] Martin Strauss. Realtime generation of multimodal affective sports commentary<br />

for embodied agents, 2007.<br />

[15] Patrick Gebhard. ALMA - a layered model of affect. In Proceedings of the Fourth In-<br />

ternational Joint Conference on Autonomous Agents and Multiagent Systems (AA-<br />

MAS 05), pages 29–36. Utrecht, 2005.<br />

[16] Lewis R. Goldberg. An alternative description of personality: The Big-Five fac-<br />

tor structure. In Journal of Personality and Social Psychology, volume 59, page<br />

12161229. 1990.<br />

[17] Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. Centering: A frame-<br />

work for modeling the local coherence of discourse. In Computational Linguistics,<br />

volume 21, page 203 225. 1995.<br />

[18] Ionut Damian, Kathrin Janowski, and Dominik Sollfrank. Spectators, a joy to<br />

watch. In Proceedings of the 9th International Conference on Intelligent Virtual<br />

Agents (IVA-09), pages 558–559. Springer, Amsterdam, 2009.


Bibliography 77<br />

[19] Elisabeth Andre and Thomas Rist. Controlling the behavior of animated pre-<br />

sentation agents in the interface: Scripting versus instructing. In AI Magazine,<br />

volume 22, pages 53–66. AAAI Press, 2001.<br />

[20] Elisabeth Andre, Gerd Herzog, and Thomas Rist. Generating multimedia presen-<br />

tations for RoboCup soccer games. In RoboCup-97: Robot Soccer World Cup I<br />

(Lecture Notes in Computer Science). Springer, 1998.<br />

[21] Dana Nau, Tsz-Chiu Au, Okhtay Ilghami, Ugur Kuter, Hector Munoz-Avila,<br />

J. William Murdock, Dan Wu, and Fusun Yaman. Applications of SHOP and<br />

SHOP2, 2004.<br />

[22] Richard Fikes and Nils Nilsson. STRIPS: a new approach to the application of<br />

theorem proving to problem solving. In Artificial Intelligence, volume 2, pages<br />

189–208. 1971.<br />

[23] Dana S. Nau, Stephen J. J. Smith, and Kutluhan Erol. Control strategies in HTN<br />

planning: Theory versus practice. In AAAI-98/IAAI-98 Proceedings, pages 1127–<br />

1133. 1998.<br />

[24] Dana Nau, Hector Munoz-Avila, Yue Cao, Amnon Lotem, and Steven Mitchell.<br />

Total-Order planning with partially ordered subtasks. In Proceedings of the Sev-<br />

enteenth International Joint Converence on Artificial Intelligence (IJCAI-2001).<br />

Seattle, 2001.<br />

[25] Dana Nau, Yue Cao, Amnon Lotem, and Hector Munoz-Avila. SHOP: simple hier-<br />

archical ordered planner. In International Joint Conference on Artificial Intelligence<br />

(IJCAI-99), pages 968–973, Stockholm, 1999.<br />

[26] Okhtay Ilghami and Dana S. Nau. A general approach to synthesize Problem-<br />

Specific planners, 2003.<br />

[27] Okhtay Ilghami. Documentation for JSHOP2. 2006.<br />

[28] Gary Riley. CLIPS: a tool for building expert systems, 2008. URL http:<br />

//clipsrules.sourceforge.net/.<br />

[29] Ernest Friedman-Hill. Jess, the rule engine for the java platform, 2009. URL<br />

http://www.jessrules.com/.<br />

[30] Patrick Gebhard, Michael Kipp, Martin Klesen, and Thomas Rist. Authoring scenes<br />

for adaptive, interactive performances. In Proceedings of the Second International<br />

Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-03),<br />

pages 725–732. ACM Press, New York, 2003.


Bibliography 78<br />

[31] Martin Klesen, Michael Kipp, Patrick Gebhard, and Thomas Rist. Staging exhibi-<br />

tions: Methods and tools for modelling narrative structure to produce interactive<br />

performances with virtual actors. In Virtual Reality. Special Issue on Storytelling<br />

in Virtual Environments, volume 7, pages 17–29. Springer-Verlag, 2003.<br />

[32] Norbert Reithinger, Patrick Gebhard, Markus Lockelt, Alassane Ndiaye, Norbert<br />

Pfleger, and Martin Klesen. VirtualHumanDialogic and affective interaction with<br />

virtual characters. In Proceedings of the 8th International Conference on Multimodal<br />

Interfaces (ICMI’06), pages 51–58. Canada, 2006.<br />

[33] Patrick Gebhard, Marc Schroder, Marcela Charfuelan, Christoph Endres, Michael<br />

Kipp, Sathish Pammi, Martin Rumpler, and Oytun Turk. IDEAS4Games: building<br />

expressive virtual characters for computer games. In Proceedings of the 8th Interna-<br />

tional Conference on Intelligent Virtual Agents (IVA’08), pages 426–440. Springer,<br />

2008.<br />

[34] Patrick Gebhard and Susanne Karsten. On-Site evaluation of the interactive CO-<br />

HIBIT museum exhibit. In Proceedings of the 9th International Conference on<br />

Intelligent Virtual Agents (IVA-09), pages 174–180. Springer, Amsterdam, 2009.<br />

[35] Michael Kipp, Kerstin H. Kipp, Alassane Ndiaye, and Patrick Gebhard. Evaluating<br />

the tangible interface and virtual characters in the interactive COHIBIT exhibit,<br />

2006.<br />

[36] Andrew Ortony, Allan Collins, and Gerald L. Clore. The cognitive structure of<br />

emotions., 1988.<br />

[37] Christoph Bartneck. Integrating the OCC model of emotions in embodied charac-<br />

ters. In Proceedings of the Workshop on Virtual Conversational Characters: Appli-<br />

cations, Methods, and Research Challenges. Melbourne, 2002.<br />

[38] Alexander Reinecke, Christian Dold, and Thomas Koch. Charamel Avatar Player<br />

Interface. 2009.<br />

[39] Alexis Heloir and Michael Kipp. EMBR - a realtime animation engine for interactive<br />

embodied agents. In Proceedings of the 9th International Conference on Intelligent<br />

Virtual Agents (IVA-09), pages 393–404. Springer, Amsterdam, 2009.<br />

[40] N. Fazil Ayan, Ugur Kuter, Fusun Yaman, and Robert P. Goldman. HOTRiDE:<br />

hierarchical ordered task replanning in dynamic environments. In Proceedings of<br />

the ICAPS-07 Workshop on Planning and Plan Execution for Real-World Systems<br />

- Principles and Practices for Planning in Execution. Providence, Rhode Island,<br />

USA, 2007.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!