Download Master Thesis - EMBOTS - DFKI

SAARLAND UNIVERSITY 

Faculty of Natural Science and Technology I 

Department of Computer Science 

Master’s Program in Computer Science 

Master’s Thesis 

Embodied Presentation Teams: 

A plan-based approach for affective 

sports commentary in real-time 

submitted by 

Ivan Gregor 

on March 1, 2010 

Supervisor 

Prof. Wolfgang Wahlster 

Advisor 

Dr. Michael Kipp 

Reviewers 

Prof. Wolfgang Wahlster 

Dr. Michael Kipp

Statement 

Hereby I confirm that this thesis is my own work and that I have documented all sources 

used. 

Signed: 

Date: 

Declaration of Consent 

Herewith I agree that my thesis will be made available through the library of the Com- 

puter Science Department. 

Signed: 

Date:

Abstract 

Virtual agents are essential representatives of multimodal user interfaces. This thesis 

presents the IVAN system (Intelligent Interactive Virtual Agent Narrators) that gen- 

erates affective commentary on a tennis game that is given as an annotated video in 

real-time. The system employs two distinguishable virtual agents that have different 

roles (TV commentator, expert), personality profiles, and positive, neutral, or negative 

attitudes to the players. The system uses an HTN planner to generate dialogues which 

enables to plan large dialogue contributions and generate alternative plans. The sys- 

tem can also interrupt the current discourse if a more important event happens. The 

current affect of the virtual agents is conveyed by lexical selection, facial expression, 

and gestures. The system integrates background knowledge about the players and the 

tournament and user pre-defined questions. We have focused on the dialogue planning, 

knowledge processing, and behaviour control of the virtual agents. Commercial products 

have been used as the audio-visual component of the system. 

A demo version of the IVAN system was accepted for the GALA 2009 that was a part of 

the 9th International Conference on Intelligent Virtual Agents. We have verified that an 

HTN planner can be employed to generate affective commentary on a continuous sports 

event in real-time. However, while the HTN planning is well suited to generate large 

dialogue contributions, the expert systems are more suitable to produce commentary on a 

rapidly changing environment. Most parts of the system are domain dependent, however 

the same architecture can be reused to implement applications such as: interactive 

tutoring systems, tourist guides, or guides for the blind. 

i

Acknowledgements 

First of all, I would like to thank Michael Kipp and Jan Miksatko for being very helpful 

and inspiring supervisors. Thanks as well to the DFKI for providing the opportunity 

to work on this project, for the necessary equipment, and funding to attend the GALA 

competition and the IVA conference. Thank you also to the Charamel GmbH and Nu- 

ance Communication, Inc., for providing the Charamel virtual agents Mark and Gloria 

and the RealSpeak Solo software with the Tom and Serena voices, respectively. Finally, 

I would like to thank my parents for being very supportive during my studies in Prague 

and Saarbruecken. 

ii

Contents 

Abstract i 

Acknowledgements ii 

List of Figures v 

List of Tables vii 

1 Introduction 1 

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 

1.2 GALA 2009 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.3 IVAN System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 

1.4 Research Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

2 Related Work 8 

2.1 ERIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

2.1.1 The Affect Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

2.1.2 The Natural Language Generation Module . . . . . . . . . . . . . 9 

2.2 DEIRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

2.3 Spectators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 

2.4 STEVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 

2.5 Presentation Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 

2.5.1 Design of Presentation Teams . . . . . . . . . . . . . . . . . . . . . 13 

2.5.2 Inhabited Marketplace . . . . . . . . . . . . . . . . . . . . . . . . . 13 

2.5.3 Rocco II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

3 Methods for Controlling Behaviour of Virtual Agents 16 

3.1 Hierarchical Task Network Planning . . . . . . . . . . . . . . . . . . . . . 16 

3.1.1 Example of a Planning Task . . . . . . . . . . . . . . . . . . . . . . 17 

3.1.2 Java Simple Hierarchical Ordered Planner (JSHOP) . . . . . . . . 19 

3.1.3 JSHOP Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

3.2 Expert Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

3.3 Statecharts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

4 Generating Dialogue 26 

iii

Contents iv 

4.1 Commentary Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

4.1.2 Dialogue Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

4.1.3 Planning Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

4.1.4 Commentary Excerpt . . . . . . . . . . . . . . . . . . . . . . . . . 33 

4.2 Affect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

4.2.2 Planning with Attitude . . . . . . . . . . . . . . . . . . . . . . . . 35 

4.2.3 OCC Generated Emotions . . . . . . . . . . . . . . . . . . . . . . . 36 

4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

5 Architecture 41 

5.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 

5.1.1 Design Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 

5.1.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 42 

5.1.3 Off-the-shelf Components . . . . . . . . . . . . . . . . . . . . . . . 44 

5.2 Tennis Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 

5.3 Plan Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

5.3.1 Event Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

5.3.2 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 52 

5.3.3 Discourse Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 

5.4 Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

5.4.1 Template Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

5.4.2 Avatar Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

5.4.3 Output Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

6 Discussion 62 

6.1 Comparison with the ERIC system . . . . . . . . . . . . . . . . . . . . . . 62 

6.2 Evaluation in Terms of Research Aims . . . . . . . . . . . . . . . . . . . . 63 

6.3 Comparison JSHOP vs Jess . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

7 Conclusion 68 

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 

A Commentary Excerpt 72

List of Figures 

1.1 Event Position Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.2 Example of an ANVIL File . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

2.1 ERIC commenting on a Horse Race . . . . . . . . . . . . . . . . . . . . . . 9 

2.2 DEIRA (Dynamic Engaging Intelligent Reporter Agent) . . . . . . . . . . 10 

2.3 STEVE in a 3D Simulated Student’s Work Environment . . . . . . . . . . 12 

2.4 Example of a Planning Method (Dialogue Scheme) to Discuss an Attribute 

Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

2.5 Excerpt of the Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . 14 

2.6 Gerd and Metze commenting RobboCup Soccer Game . . . . . . . . . . . 15 

3.1 Example of a Planning Task - HTN . . . . . . . . . . . . . . . . . . . . . 18 

3.2 Example of a Planning Task - generated Plan . . . . . . . . . . . . . . . . 18 

3.3 JSHOP Input Generation Process . . . . . . . . . . . . . . . . . . . . . . 19 

3.4 Sample JSHOP Axiom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

3.5 Sample JSHOP Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

3.6 Sample JSHOP Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

3.7 Overview of the COHIBIT system . . . . . . . . . . . . . . . . . . . . . . 25 

4.1 Example of a Planning Method . . . . . . . . . . . . . . . . . . . . . . . . 28 

4.2 Example of a Compound Task Decomposition . . . . . . . . . . . . . . . . 30 

4.3 Possible Decompositions of a Compound Task . . . . . . . . . . . . . . . . 31 

4.4 Decomposition of the Goal Task “Comment” . . . . . . . . . . . . . . . . 32 

4.5 Decomposition of the Subgoal Task Commant on rally . . . . . . . . . . . 32 

4.6 Decomposition of the Goal Task “Comment” that leads to a Subgoal Task 

Drop Volley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

4.7 Emotion Module GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

5.1 IVAN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

5.2 Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

5.3 Charamel Virtual Agents Mark and Gloria . . . . . . . . . . . . . . . . . 45 

5.4 Tennis Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

5.5 Tennis Simulator GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

5.6 IVAN Architecture - Plan Generation . . . . . . . . . . . . . . . . . . . . 47 

5.7 Dataflow - Plan Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

5.8 States of the Tennis Game . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

5.9 Tennis Score Counting using a Finite State Machine . . . . . . . . . . . . 50 

5.10 Hierarchy of Facts from which an Ace can be deduced . . . . . . . . . . . 52 

5.11 JSHOP Input Generation Process . . . . . . . . . . . . . . . . . . . . . . 55 

v

List of Figures vi 

5.12 IVAN Architecture - Plan Execution . . . . . . . . . . . . . . . . . . . . . 56 

5.13 Dataflow - Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

List of Tables 

1.1 Tennis Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.2 Event Position Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.3 Track Element Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

4.1 Dialogue Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

4.2 Example of Generated Dialogues based on different Appraisals . . . . . . 36 

4.3 Description of the eight Basic OCC Emotions . . . . . . . . . . . . . . . . 37 

4.4 Five Personality Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 

4.5 Example of Events that elicit respective Emotions . . . . . . . . . . . . . 38 

5.1 Description of the Tennis Counting Terminology . . . . . . . . . . . . . . 50 

5.2 Example of high-level facts deduced from low-level facts . . . . . . . . . . 52 

5.3 Examples of Facts deduced from the Background Knowledge . . . . . . . 53 

vii

Chapter 1 

Introduction 

This thesis presents the IVAN system (Intelligent Interactive Virtual Agent Narrators), 

that provides affective commentary on a continuous sports event in real-time. We have 

employed two virtual agents that are engaged in dialogues to comment on a tennis game 

that was given as the GALA 2009 challenge (see section 1.2). The virtual agents can have 

different attitudes to players and their current affective state can be conveyed by lexical 

selection, facial expression, and gestures. We have focused on the knowledge processing, 

dialogue planning, and behaviour control of the virtual agents. We have used commercial 

software as the audio-visual component of the system. In the following sections, we will 

explain why it is beneficial to employ virtual agents, then we will describe our task as 

the GALA 2009 challenge, outline the IVAN system, and describe our research aims. 

1.1 Motivation 

Multimodal user interfaces are becoming more and more important in human-machine 

communication. Essential representatives of such interfaces are the virtual agents that 

aim to act like humans in the way they employ gestures, gaze, facial expression, posture, 

and prosody to convey facts in face-to-face communication with a user. [1] The face- 

to-face interaction that uses rich communication channel is believed to be exclusively a 

human domain, for instance, if people have something important to say, they say it in 

person. To generate such a complex behaviour of a virtual agent, it is important to endow 

the virtual agent with emotions since s/he becomes more believable by humans and 

the system that employs such virtual agents becomes more entertaining and enjoyable 

for the users. [2] Virtual agents can be employed in many fields such as: computer 

games, tutoring systems, virtual training environments [3], story telling systems [4, 5], 

advertisement, automated presenters [6, 7, 8, 9] , and commentators. [10, 11] 

1

Chapter 1. Introduction 2 

In this thesis, we have focused on the commentary agents. Moreover, we have employed 

a presentation team since the use of a presentation team [6], i.e., the use of several 

distinguishable virtual agents with different personality profiles, roles, and goals, en- 

riches communication strategies and the information being conveyed can be distributed 

onto several virtual agents in the form of a dialogue. It is particularly important to 

endow virtual agents of the presentation team with emotions since they become more 

distinguishable. Distinct virtual agents can better represent different roles and opposing 

points of view. The use of a presentation team is also more advantageous in comparison 

to the use of only one virtual agent since the performance of a presentation team is more 

entertaining for the audience, provide better understanding, and improve recall of the 

presented information. 

The additional advantage of virtual commentary agents is that they can run locally on 

a user’s computer. Hence, the commentary can be partly customized since the user can 

set basic settings of a commentary. Thus, it is a good idea to employ virtual agents as 

a presentation team to comment on a sports event. 

1.2 GALA 2009 Challenge 

In this section, we will introduce our task that was given as the GALA 2009 1 challenge 

(Gathering of Animated Lifelike Agents). The GALA event is a part of the annual In- 

ternational Conference on Intelligent Virtual Agents (IVA) 2 . The aim of GALA is to 

encourage students to implement a system that provides behaviourally complex com- 

mentary on a continuous stream of events in real-time. The challenge of GALA 2009 

was to provide a commentary on a tennis game that was given as an annotated video. 

The GALA challenge in the previous years was to comment on a horse race that was 

given by a horse race simulator. 

The events that occur in the video of a tennis game are manually annotated with the 

ANVIL tool [12] and stored into an ANVIL file. The ANVIL file contains timestamped 

events that are grouped into tracks where each track contains events that have the same 

source, namely, we have one track for the ball and one track for each player. Table 1.1 

contains all events that can be annotated. 

Each event is further specified with the place on a tennis court where it happened. 

Table 1.2 contains attributes that specify the position of a ball or a player and Figure 

1.1 depicts these tags in the picture of a tennis court. 

1 http://hmi.ewi.utwente.nl/gala 

2 http://iva09.dfki.de/


Player events Ball events 

throw shot 

serve cross net 

forehand hit net 

backhand hit tape 

forehand-volley bounce 

backhand-volley fault 

smash out 

miss 

Table 1.1: Tennis Events 

Position side Position longitudinal Position lateral Position height 

server net left low 

receiver mid court middle middle 

baseline right high 

Table 1.2: Event Position Specification 

Figure 1.1: Event Position Specification 

Each event with its timestamp and position specification stands for a track element. 

Table 1.3 contains information which attributes each track element has.


Ball track element Player track element 

timestamp timestamp 

ball event player event 

position lateral position lateral 

position longitudinal position longitudinal 

position side 

position height 

Table 1.3: Track Element Specification 

Figure 1.2: Example of an ANVIL File 

Figure 1.2 shows two excerpts from an ANVIL file. The left column is an example of 

a ball track and the right column is an example of a track of the first player. As we 

can see, each track consists of track elements where each track element represents one 

event. Furthermore, each track element has a start time and an end time. Whilst the 

start time of an event corresponds to its timestamp, the end time of an event can be 

omitted, since all events can be considered as instantaneous. The ball track describes 

that a ball was shot on the right side of the baseline on the server side at time 7.49 sec, 

then the ball crossed the net in the middle, and bounced in the middle of the mid-court 

on the receiver side, and then was shot on the right side of the baseline. The player 

track describes that the player is throwing a ball on the right side of the baseline at time 

7.4 sec, then he is serving. Later on, the player is playing a forehand on the right side 

of the baseline, and then he is playing a backhand from the left side of the baseline.


1.3 IVAN System 

In this section, we will introduce the IVAN system (Intelligent Interactive Virtual Agent 

Narrators) [13] that we have developed to produce affective, behaviourally complex 

commentary on a continuous sports event in real-time. The system was employed to 

comment on a tennis game that was given as the GALA 2009 challenge. We have 

employed a presentation team (Elisabeth Andre, Thomas Rist) [6], i.e., in our case 

two virtual agents with different roles (TV commentator, expert) to reflect two different 

presentation styles, attitudes to the players (positive, neutral, negative), and personality 

profiles to simultaneously comment on a tennis game. One virtual agent can interrupt 

the other virtual agent or himself/herself when more important event happens. The 

system also integrates background knowledge about the players and the tournament. 

Moreover, the user can fire one of the pre-defined questions at any time. We have 

focused on the knowledge processing, dialogue planning, and behaviour control of virtual 

agents. We have used commercial software as the audio-visual component of the system. 

The IVAN system consists of several modules that are running in separate threads and 

communicate via shared queues. We employed an HTN planner to generate dialogues, 

statecharts to simulate basic states of the game, and expert systems to maintain the 

emotional state of each virtual agent. When the system starts, the tennis simulator reads 

an ANVIL [12] file that contains the description of a tennis game and sends timestamped 

events (e.g. a player playes a forehand, a ball hits the net) at the time they occur to 

the input interface of the core system. The core system transforms these elementary 

events to the low-level facts (e.g. which player just scored) that form the knowledge base 

for the HTN planner and the emotion module. Generated plans that represent possible 

dialogues are transformed to individual utterances and annotated with gestures. The 

current emotional state of a virtual agent is used to derive his/her facial expression. 

Annotated utterances along with the corresponding facial expression tags are sent to 

the audio-visual component that creates the multimodal output of the system. 

When the system starts, our two virtual agents are engaged in dialogues to comment on 

the tennis game or on the background facts. A virtual agent is happy if his/her favourite 

player is doing well and unhappy s/he is losing. A virtual agent comments in a positive 

way on a player s/he likes and on events that lead to the victory of his/her favourite 

player. A virtual agent comments in a negative way on a player s/he dislikes and on 

events that hinder the victory of his favourite player. The current affect of a virtual 

agent is conveyed by lexical selection, facial expression, and gestures.


1.4 Research Aims 

In this section, we will describe our four main research aims. They will be discussed in 

section Evaluation in Terms of Research Aims 6.2 after we describe the architecture of 

the whole system. 

• Dialogue Planning for Real-time Commentary 

In this master thesis, we wanted to investigate how an HTN planner can be em- 

ployed to generate commentary in the form of a dialogue on a continuous sports 

event in real-time for two virtual agents. An example of a real-time commentary 

system that uses the expert systems to control one virtual agent is ERIC [10], how- 

ever he might be too reactive, i.e., individual utterances are uttered at particular 

knowledge states where ERIC cannot generate larger contributions. In addition, 

the expert systems cannot generate alternative plans, thus the HTN planning of- 

fers more variability. Therefore, we wanted to examine an HTN planner that is 

supposed to be a good strategy to generate elaborate, large, and coherent dialogue 

contributions. 

• Reactivity 

The system should be able to react quickly to new events that happen during the 

tennis game. Moreover, when a more important event happens than the event 

on which the virtual agents are commenting at the moment, the system should 

be able to interrupt the current discourse and comment on the new event. The 

interruption should be graceful and have smooth transition. 

• Behavioural Complexity 

The virtual agents of our presentation team should ideally behave like human ten- 

nis commentators and produce interesting, suitable, and believable commentary. 

They should use the whole range of communication channels to convey facts about 

the tennis game. They should generate variety of dialogues along with synchro- 

nized hand and body gestures and have appropriate facial expression in dependence 

on their current emotional states. Moreover, if we allow the user to interact with 

the system, the system becomes more engaging. The behavioural complexity en- 

sures the believability of the virtual characters. Without above mentioned traits, 

the virtual agents would look unrealistic. 

• Affective Behaviours 

The virtual agents should affectively react to the events that occur in the tennis 

game according to their (positive, neutral, or negative) attitudes to the players.


Their emotional state should be derived from the appraisals of the events that hap- 

pen during the tennis game. The virtual agents’ current affect should be conveyed 

by lexical selection, facial expression, and gestures. If we endow virtual agents 

with emotions, it will increase their believability and they will be better accepted 

by users.

Chapter 2 

Related Work 

In this chapter, we will describe several examples of virtual agent applications that are 

relevant to our work. We will introduce ERIC that is an affective, rule-based sport 

commentary agent that won GALA 2007 as a horse race commentator. We will also 

present DEIRA that is another horse race reporter. Then, we will present project Spec- 

tators that participated in GALA 2009 (see section 1.2); it employs several autonomous 

affective virtual agents that jointly watch a tennis game as ordinary tennis spectators. 

To introduce the HTN planning (see section 3.1) that we have employed in our system 

to generate dialogues, we will describe STEVE that uses the HTN planning to help 

students to perform physical procedural tasks in a 3D simulated student’s work envi- 

ronment. Since we employed a presentation team [6] in our system, we will also describe 

the general design of presentation teams and two applications that employ them. 

2.1 ERIC 

ERIC [10, 14] won GALA 2007 1 as a horse race commentator. ERIC is a generic rule- 

based framework for affective real-time commentary developed at DFKI. The system was 

tested in two domains: a horse race and a tank battle game, where the horse race was 

given in form of a horse race simulator supplied by GALA 2007. The simulator sends the 

speed and the position of each horse every second to ERIC via socket. ERIC is getting 

events from the horse race simulator and produces coherent natural language alongside 

with the non-verbal behaviour. The visual output is represented by a virtual agent that 

has lip movement synchronized to speech, can express various facial expressions and 

perform many different gestures. ERIC employs the same avatar engine as our system. 

The graphical output of ERIC is shown in Figure 2.1. 

1 http://hmi.ewi.utwente.nl/gala/finalists 2007/ 

8

Chapter 2. Related Work 9 

Figure 2.1: ERIC commenting on a Horse Race 

ERIC consists of several modules. We will describe two most interesting modules, i.e., 

the Affect module and the Natural Language Generation module in detail. 

2.1.1 The Affect Module 

The affect module is getting facts from the world and assigns appraisals to each event, 

action, and object according to the goals, desires, and cause effect relations. The ap- 

praisal of an event, action or objects is then sent in the form of a specific tag to the 

ALMA module [15] that maintains the commentator’s affective state. ALMA consid- 

ers three types of affect: emotions (short-term), mood (medium-term), and personality 

(long-term). Emotions are bound to specific events and decay through time. Mood 

represents the average of the emotional state across time. Personality is defined by Big 

Five [16], i.e., openess, conscientiousness, extraversion, agreeableness and neuroticism. 

Personality is used to compute the initial mood and influences the intensity and decay 

of emotions. The affective state of a virtual agent influences the utterance, gesture, and 

facial expression selection. 

2.1.2 The Natural Language Generation Module 

This module uses a template-based algorithm to generate utterances. Each template 

corresponds to a rule in a rule-based engine. Each such rule has conditions that can be 

partitioned into four groups: facts that must be known, facts that must be unknown, 

facts that must be true, and facts that must be false. There is at least one utterance


for each template that contains flat text and slots for variables. First, all candidate 

templates are generated, then the corresponding utterances are retrieved and finally one 

of the most coherent utterances is chosen. The discourse coherence is ensured by the 

Centering Theory [17] that in a simplified way says that the discourse is coherent if every 

two utterances are coherent. Thus, the topic of a template and a list of all possible topics 

for a coherent following sentence is defined for each template. After the last template 

has been chosen, the next template is chosen so that its topic was among possible topics 

for a coherent following sentence of the last template. 

This system is most closely related to our work since the overall goal of ERIC is the 

same as ours. The comparison of the IVAN system and ERIC is in section 6.1. 

2.2 DEIRA 

DEIRA [11] (Dynamic Engaging Intelligent Reporter Agent) is another commentary 

agent that participated in GALA 2007 2 as a horse race reporter. DEIRA employs an 

expert system to generate affective commentary in real-time. The system maintains 

the affective state of the reporter according to his personality and events that occur in 

the horse race. The current affect is represented by a vector of four values (tension, 

surprise, amusement, pity) and is conveyed by lexical selection and facial expression of 

the reporter. The graphical output of the system is shown in Figure 2.2. 

Figure 2.2: DEIRA (Dynamic Engaging Intelligent Reporter Agent) 

2 http://hmi.ewi.utwente.nl/gala/finalists 2007/


2.3 Spectators 

Project Spectators [18] participated in GALA 2009 3 (see section 1.2). The system 

consists of several autonomous virtual agents that are watching a tennis game. The 

spectators can have different attitudes to the teams where the attitude can be positive 

or neutral. Each spectator has a euphoria factor that determines how much the mood 

state of a spectator changes when an important event happens in the tennis game. The 

euphoria factor stands for the spectators’ personality trait. The mood of a spectator 

is expressed by his facial expression, typical animations, and speech. The spectators’ 

moods are as follows: euphoric, happy, slightly happy, neutral, slightly sad, sad, and 

disappointed. Furthermore, the position of the ball is interpolated so that the spectators 

can gaze at the ball within a rally. Also the voice of a referee is incorporated to utter 

the score in a conventional way. 

However, the system focuses only on a non-verbal behaviour, i.e., neither the spectators 

nor the referee comment on the game as tennis commentators. The system essentially 

consists only of a limited set of rules that trigger respective animations. Thus, our 

system and the project Spectators could be put together to generate complex scene of a 

tennis game with both tennis commentators and spectators. 

2.4 STEVE 

STEVE (Soar Training Expert for Virtual Environments) [3] is a sample application that 

uses the same method as our system to control the behaviour of virtual agents, namely, 

the HTN planning (see section 3.1). STEVE is a virtual agent that helps students 

to perform physical procedural tasks in a 3D simulated student’s work environment. 

STEVE can either demonstrate procedural tasks or monitor students while they are 

performing tasks and provide assistance if they need help or ask questions. Each task 

consists of a set of partially ordered steps where a step can be a primitive action or 

a composite action which creates a hierarchical structure where some steps of a task 

can be also reused to solve other tasks. Therefore, STEVE employs the Hierarchical 

Task Network to define particular tasks. STEVE consists of the perception, cognition, 

and motor control module. The perception module monitors the state of the virtual 

world and maintains its coherent representation. In each loop of the decision cycle of 

the cognition module, the cognition module gets the current snapshot of the world from 

the perception module, chooses appropriate goals, and then constructs and executes 

plans. The motor control module gets high level commands from the cognition module 

3 http://hmi.ewi.utwente.nl/gala/finalists 2009/


to control the voice, locomotion, gaze, gestures and objects manipulation. The graphical 

output of STEVE is shown in Figure 2.3. 

Figure 2.3: STEVE in a 3D Simulated Student’s Work Environment 

Our system, as well as STEVE, uses an HTN planner to generate speech and can interact 

with users via user questions. We were also inspired by the STEVE’s execution cycle and 

the concept of snapshots of the world. In comparison to STEVE, our system employs two 

virtual agents, maintains their affective states, and generates affective commentary. On 

the other hand, our system generates shorter contributions, it does not have elaborate 

user interaction, and our virtual agents cannot move in the virtual environment. 

2.5 Presentation Teams 

We employed a presentation team [6, 7, 8, 9] in our system to comment on a tennis 

game. In this section, we briefly describe the general design of presentation teams and 

then focus on two projects that employ them. The first project is Inhabited Marketplace 

where a car seller and customers have different preferences (e.g. running costs, prestige) 

and character profiles. They are engaged in dialogues to discuss different attributes of 

a car that the customers are interested in. The second project is Rocco II where two 

soccer fans that can have different attitudes to the teams and character profiles jointly 

watch a RoboCup soccer game and comment on it.


2.5.1 Design of Presentation Teams 

The idea of presentation teams is to automatically generate presentations on the fly. A 

presentation team consists of at least two virtual agents to convey information in style 

of a performance to be observed by a user. This approach is believed to be more enter- 

taining and provide better understanding than a system with only one presenter. The 

virtual agents’ roles, character profiles, and dialogue types are chosen in dependence 

on the discourse purpose. Moreover, the characters should be distinguishable, i.e., they 

should have different audio-visual appearance, expertise, interest, and personality. Dis- 

tinct agents can also better express opposing roles. There are two basic approaches how 

to generate the dialogue. [19] Agents with the scripted behaviour correspond to actors 

of a play that can still improvise a little at performance time, i.e., their behaviour is 

first generated as a script (that contains slots for variables that can be substituted at 

runtime) and later on executed. In contrast to the agents with the scripted behaviour, 

the autonomous agents have no script, thus, they generate the dialogue contributions 

on the fly, i.e., they pursue their own communicative goals and react to the dialogue 

contributions of the other characters. First, we present a project that employs the agents 

with the scripted behaviour, and then a project that employs the autonomous agents. 

2.5.2 Inhabited Marketplace 

The Inhabited Marketplace project employs a presentation team to present facts along 

with an evaluation under constraints. Each character’s profile is defined by agreeable- 

ness (agreeable, neutral, disagreeable), extraversion (extravert, neutral, introvert) and 

valence (positive, neutral, negative). The presentation team consists of a car seller and 

customers where each of them can prefer different dimension (e.g. environment, economy, 

prestige, or running costs). The aim of each customer is to discuss all attributes that 

have positive or negative impact on a dimension they are interested in. Furthermore, the 

dialogue is also driven by the characters’ personality traits, e.g., an extrovert will start 

the conversation or an introvert will use less direct speech. The dialogue is generated 

by an HTN planner (see section 3.1), i.e., the goal task is successively decomposed by 

planning methods into individual utterances. An example of a planning method that 

represents a particular dialogue scheme is shown in Figure 2.4. The method represents 

a scenario where two agents discuss a feature of an object. It applies if the feature has a 

negative impact on any dimension and if this relationship can be easily inferred. Thus, 

any disagreeable buyer produces a negative comment referring to this dimension, e.g., 

to the dimension running costs considering facts contained in Figure 2.5.


Figure 2.4: Example of a Planning Method (Dialogue Scheme) to Discuss an Attribute 

Value 

2.5.3 Rocco II 

Figure 2.5: Excerpt of the Domain Knowledge 

Gerd and Metze are two soccer fans that comment on a RoboCup soccer game. They can 

have different attitudes to the teams and their character profile is defined by extraversion 

(extravert, neutral, introvert), openess (open, neutral, not open) and valence (positive, 

neutral, negative). The project focuses on the following dispositions: arousal (calm, 

neutral, excited) and valence. The system performs incremental event recognition [20] 

from high level analysis of the scene over recognized events to the basis for the commen- 

tary where the basis additionally contains background knowledge about the game and 

teams. The system employs two autonomous agents that use template based natural 

language generation to produce the commentary on the fly. Furthermore, an agent can 

interrupt himself if more important event happens. The templates are strings with slots 

for variables. Each template contains several tags, for instance: verbosity (the number 

of words), bias (positive, neutral, negative), formality (formal, normal, colloquial) and 

floridity (dry, normal, flowery language). The candidate templates are filtered in four 

steps in the execution cycle:


1. pass only short templates in the case of the time pressure 

2. templates used recently are eliminated 

3. pass only templates expressing the speaker’s attitude 

4. choose templates according to the speaker’s personality 

The agents’ emotions are influenced by the current state of the game. Emotions can 

be expressed by the speed and pitch range of the speech along with different hand and 

body gestures. The graphical output of the system is shown in Figure 2.6. 

Figure 2.6: Gerd and Metze commenting RobboCup Soccer Game 

Similar to our system, Gerd and Metze can have different attitudes to the teams (play- 

ers), personality profiles, integrates background knowledge about the game and teams, 

and allow interruptions. In comparison to our system, they employed two autonomous 

agents that use template based natural language generation to produce the commentary 

on the fly. While our templates can be categorized only according to the bias, in Rocco II 

project they use wide range of different templates that are categorized according to: ver- 

bosity, bias, formality, and floridity. Thus, the system can generate more reactive and 

elaborate commentary than our system. The system also maintains the emotional state 

of the virtual agents which can be expressed by prosody, and hand and body gestures. 

On one hand, our system does not integrate prosody, on the other hand, our virtual 

agents have more elaborate facial expressions and gestures.

Chapter 3 

Methods for Controlling 

Behaviour of Virtual Agents 

In this chapter, we will introduce three basic methods for controlling the behaviour of 

virtual agents that we have employed in our system. The most important method is 

the HTN planning that we have employed to generate dialogues for our presentation 

team (see section 4.1). The second method are expert systems that we have used to 

define emotion eliciting conditions in the emotion module (see section 4.2.3). The third 

method are statecharts where we have used three simple finite state machines to model 

basic states of the system (see section 5.3.1). Let us note that all these methods can be 

used separately for the natural language generation (e.g. see ERIC in section 2.1 that 

uses the expert systems). 

3.1 Hierarchical Task Network Planning 

In our system, we have employed the Hierarchical Task Network (HTN) planning to 

generate the dialogues for our presentation team (see section 4.1). In general, planning 

is employed for the problem solving and can be applied in many different domains to 

save time and money, e.g., in air transport, flight control, controlling of space probes, 

army missions, maintenance of complex machines (e.g. submarines), help in the case of 

natural disasters, or tutoring systems (e.g. see STEVE in section 2.4). [21] 

HTN planning is a variant of the automated planning. First, we will introduce the 

STRIPS-Like planning [22] (where STRIPS stands for Stanford Research Institute Prob- 

lem Solver) and then compare it to the HTN planning. The input of a STRIPS-Like 

planner consists of a set of facts that describe the initial state of the world, a set of goal 

16

Chapter 3. Methods for Controlling Behaviour of Virtual Agents 17 

facts, and a set of planning operators that correspond to actions that can modify the 

current state of the world. Let us denote the set of facts that describe the current state 

of the world as a Base. A planning operator has a list of preconditions, a delete list, 

and an add list. A planning operator can be applied if its preconditions are contained 

in the Base. After a planning operator is applied, all facts that are in its delete list are 

deleted from the Base and all facts that are in its add list are added to the Base. The 

STRIPS-Like planner reaches the goal state of the world if the Base contains all goal 

facts. After the planner is started, it is searching for a sequence of planning operators 

that successively change the initial state of the world to its goal state. The output of 

the planner is a plan (or a list of all possible plans) that consists of a list of planning 

operators such that if we successively apply these operators to the initial state of the 

world, we get the goal state of the world. While a STRIPS-Like planner can try to 

apply any planning operator at any step of the planning process to reach the goal state 

of the world, an HTN planner can only try to apply planning operators that are defined 

in the HTN at a particular step of the planning process. 

The HTN planning is based on tasks decomposition, i.e., compound tasks are decom- 

posed into subtasks where each subtask is either a compound task on a lower level of 

the planning hierarchy or a primitive task that corresponds to an action that can be 

executed in the real world. Let us note that the primitive tasks in the HTN planning 

correspond to the planning operators in the STRIPS-Like planning. The description 

of the world (called planning domain in the HTN planning terminology) is given as a 

Hierarchical Task Network and the planning goal (called planning problem) is given as 

a list of goal tasks and a list of facts that describe the initial state of the world. The 

resulting plan is a list of primitive tasks such that if we successively perform these prim- 

itive tasks we accomplish the goal tasks. In the following text, we will show an example 

of a planning task, introduce JSHOP 1 as an implementation of an HTN planner that 

we have employed in our system to generate the dialogues for our presentation team (see 

section 4.1), and finally we will define some basic constructs of the JSHOP language. 

3.1.1 Example of a Planning Task 

Let us consider an example of a planning task that is depicted in Figure 3.1 to demon- 

strate a typical task for an HTN planner. [23] There is a Hierarchical Task Network that 

represents a way how to travel from x to y, more precisely, how to accomplish the goal 

task travel(x,y). We can either take a taxi for a short distance or we can fly by air for 

a long distance. (There might be also other ways how to travel that we do not consider 

here.) Thus, to accomplish the compound goal task travel(x,y) we have to fulfil one of 

1 JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/


its compound subtasks, namely, travel by taxi or travel by air. In the first case (travel 

by taxi) we must first get a taxi, then ride the taxi from x to y and finally pay for it. In 

the second case (travel by air) we must first buy a ticket from airport(x) to airport(y), 

then travel from x to airport(x), fly by air from airport(x) to airport(y) and eventually 

travel from airport(y) to y. Thus, to fulfil a compound task travel by taxi or travel by air 

we have to satisfy all its respective subtasks. Let us note that after the planner starts, 

it finds out first whether it is possible to travel by taxi and if not it backtracks and tries 

the option to travel by air. 

Figure 3.1: Example of a Planning Task - HTN 

The resulting plan how to travel from the UMD (University of Maryland) to the MIT 

is depicted in Figure 3.2. First we have to buy a ticket from the BWI (Baltimore 

Washington International) airport to the Logan airport, then take a taxi from the UMD 

to the BWI airport, then fly by air from the BWI airport to the Logan airport, and 

finally take a taxi from the Logan airport to the MIT. 

Figure 3.2: Example of a Planning Task - generated Plan


3.1.2 Java Simple Hierarchical Ordered Planner (JSHOP) 

In the following text, we will introduce the Java Simple Hierarchical Ordered Planner 

(JSHOP) 2 [24, 25] that is the implementation of an HTN planner that we have employed 

in our system. JSHOP is a Java implementation of a domain-independent Hierarchical 

Task Network (HTN) planner, developed at the University of Maryland, that is based on 

ordered task decomposition. The planning is conducted by problem reduction, i.e., the 

planner recursively decomposes tasks into subtasks and stops when it reaches primitive 

tasks that can be performed directly by planning operators. The compound task de- 

composition is realized by methods that define how to decompose compound tasks into 

subtasks. Since there may be more than one method that can be applied to a compound 

task, the planner can backtrack, i.e., it can try more than one method to decompose a 

compound task. As a consequence, the planner can find more than one suitable plan. 

The Input of JSHOP consists of a description of a planning domain and a planning 

problem. The planning domain creates the world description, i.e., it consists of planning 

methods, planning operators and axioms. The planning problem consists of a list of 

tasks and a list of facts that hold in the initial state of the world. The planning domain 

description is stored in a domain file and the problem description in a problem file. 

The Output of JSHOP is a list of suitable plans where each plan consists of a list of 

primitive tasks and each primitive task corresponds to an action that can be executed 

in the real world (e.g. utter an utterance or move object O from place X to place Y ). 

Figure 3.3: JSHOP Input Generation Process 

To Run the Planner, we have first to generate Java code from the respective domain 

and problem files that are written using special Lisp-like syntax. JSHOP is implemented 

2 JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/


in this way since this approach allows to perform certain optimizations and to produce 

Java code that is tailored for a particular domain and problem description. [26] See 

Figure 3.3. (The generated Domain Description Java file is compiled with the Domain- 

Independent Templates which results in a Domain-Specific Planner. The generated Java 

Problem file is compiled as well. At the end, we can run the planner that outputs all 

possible Solution Plans.) 

3.1.3 JSHOP Language 

In the following text, we will describe the most important JSHOP constructs, namely: 

axioms, planning operators, and planning methods. See the JSHOP manual [27] for 

more details on the whole syntax of the language. JSHOP contains many constructs 

characteristic for an HTN planner (e.g. symbols, terms, call terms, logical atoms, logical 

expressions, implication, universal quantification, assignment, call expressions, logical 

preconditions, task atoms, task list, axioms, operators, and methods). Furthermore, it 

is possible to write user defined functions in Java. 

Axioms 

An axiom is an expression of the form: 

(: − a [name1] L1 [name2] L2 . . . [namen] Ln) 

where the head of an axiom is a logical atom a and its tail is a list of pairs (name, 

logical precondition) where a is true if L1 is true or if L1...Lk−1 are false and Lk is true 

(for k ≤ n). The name of the logical precondition is optional, however it can improve 

readability. Figure 3.4 shows an example of an axiom. A place ?x is in walking distance 

if the weather is good and a place ?x is within two miles of home, or if the weather is 

bad and a place ?x is within one mile of home. 

Figure 3.4: Sample JSHOP Axiom


Operators 

An operator has the following form: 

(: operator h P D A [c]) 

where h is the operator’s head; P is the operator’s precondition; D is the operator’s 

delete list; A is the operator’s add list; c is the operator’s cost where the default cost 

is 1. Let us denote the set of facts that describe the current state of the world as the 

facts base. The operator can be applied if the preconditions in P are satisfied. After the 

operator has been applied, all facts contained in D are deleted from the facts base and 

all facts contained in A are added to the facts base. Figure 3.5 shows an example of a 

planning operator. We can drive a ?truck from a ?old-loc to a ?location if the ?truck 

is at the ?old-loc. After the operator has been applied, the fact (at ?truck ?old-loc) is 

deleted from the facts base and a new fact (at ?truck ?location) is added to the facts 

base. 

Methods 

A method is a list of the form: 

Figure 3.5: Sample JSHOP Operator 

(: method h [name1] L1 T1 [name2] L2 T2 . . . [namen] Ln Tn) 

where h is the method’s head; each Li is a precondition; each Ti is a list of tasks; each 

namei is a respective optional name. The compound task specified by the method can 

be performed by performing all tasks in the list Ti if the precondition Li is satisfied and 

for all preconditions Lk such that k 

presents an example of a method. The task specified by this method is to eat a ?food. 

If we have a fork then we eat the ?food with a fork. If we do not have a fork but we 

have a spoon then we eat the ?food with a spoon.


3.2 Expert Systems 

Figure 3.6: Sample JSHOP Method 

Expert systems can be also employed to generate commentary on a sports event as we 

have shown in ERIC (see section 2.1). Nevertheless, we have employed an expert system 

only in the emotion module to define emotion eliciting conditions (see section 4.2.3). 

Expert systems are used in many domains to “replace” human experts. The know-how 

of human experts is first stored in the system. Afterwards, the system can be queried 

by users that always get consistent answers. Nevertheless, the disadvantage of such a 

system is that it is not appropriate for changing environments. Expert systems can be, 

for instance, used in the following domains: financial services, accounting, production, 

process control, medicine, or human resources. Examples of expert systems are CLIPS 

(C Language Integrated Production System) [28] and its reimplementation into Java 

Jess (Java Expert System Shell) [29] that we have employed in our system. 

Expert systems are used to reason about the world using some knowledge that consists 

of facts and rules. While the facts describe the current world in terms of assertions, the 

rules define how to modify the facts base (knowledge base), e.g., how to deduce new 

facts from already known facts where each rule is in the form of an if-then clause. Let 

us note that it is also possible to retract or modify facts as a result of a rule being fired. 

The inferring loop of a typical expert system consists of the following three steps: 

1. Match the left hand side of the rules against facts and move matched rules onto 

the agenda. 

2. Order the rules on the agenda according to some conflict resolution strategy (e.g. 

at random). 

3. Execute the right hand side of the rules on the agenda in the order decided by 

step (2).


The inferring loop ends when no new facts can be inferred. After the inferring process 

ends, we know which rules have been fired and the fact base contains all initial and 

inferred facts that have not been retracted. In the following text, we will present an 

implementation of an expert system that we have employed in our system. 

Java Expert System Shell (Jess) 

Jess [29] is a fast Java implementation of an expert system developed at Sandia National 

Laboratories. Although it has a rich Lisp-like syntax we will show only two examples: 

one that defines an unordered fact and the other that defines a rule. See [29] for more 

details on the complete syntax of the language. 

Unordered Fact - Every fact corresponds to a particular template. The definition 

of a template starts with a keyword deftemplate followed by a template name and an 

optional documentation comment. The following template is an example how to define 

an automobile. The template contains four slots: the manufacturer, the model, the year 

of production as an integer, and color where red is the default color. 

(deftemplate automobile 

) 

"A specific car." 

(slot make) 

(slot model) 

(slot year (type INTEGER)) 

(slot color (default red)) 

The following command asserts a concrete Volkswagen Golf that was produced in 2009 

and is of the default red colour. 

(assert (automobile (model Golf)(make Volkswagen)(year 2009))) 

Rule - Consider the following templates. The first template defines an agent that has 

a name and can be hungry, the second template defines the current time. 

(deftemplate agent 

) 

"A hungry agent" 

(slot name) 

(slot hungry)


(deftemplate current_time 

) 

"The current time" 

(slot ctime (type FLOAT)) 

The following commands asserts agent George that is hungry and the current time that 

is half past twelve. 

(assert (agent (name George)(hungry TRUE))) 

(assert (time (ctime 12.5))) 

Consider the following rules that are chained. 

(defrule open_cafeteria 

) 

(current_time {(12.0


(see section 5.3.1). However, the statecharts can be also used to generate speech. An 

example of a tool that enables to control virtual agents using statecharts is SceneMaker. 

[30] A user can create arbitrary statechart using SceneMaker to describe the behaviour 

of virtual agents. In every node of a statechart, a scene is stored. A scene can, for 

instance, describe a dialogue between two virtual agents, i.e., the scene is described in a 

theater script-like language and consists of utterances that are annotated with gestures. 

A statechart can also consist of several types of edges that are used to define transitions 

between nodes (e.g. a timeout edge, a conditional edge, or a probability edge). 

The difference between SceneMaker and our approach is that while SceneMaker performs 

one of the pre-defined scene at a node, we first run the HTN planner to generate the 

scene, and then the scene is performed. Nevertheless, we have employed only three 

simple finite state machines to maintain the basic states of our system and the logic was 

implemented in the domain description of the HTN planner. 

SceneMaker was employed in several projects: CrossTalk [31], VirtualHuman [32], 

IDEAS4Games [33], and COHIBIT. [34, 35] For instance, the purpose of the COHIBIT 

project is to provide knowledge about car technology and virtual agents in an entertain- 

ing way. Two virtual agents interact with users and give them advices how to build a 

car from different car pieces. The system is informed about the presence of users via 

cameras and about the location and orientation of car pieces which is realized using 

RFID technology. The overview of the COHIBIT system is depicted in Figure 3.7. 

Figure 3.7: Overview of the COHIBIT system

Chapter 4 

Generating Dialogue 

In this chapter, we will explain how we generate affective commentary on a tennis game 

for our two virtual agents. First, we will describe how we generate dialogues using an 

HTN planner. Then, we will describe how we generate a piece of a dialogue that conveys 

a particular attitude of a virtual agent to a player, how we maintain the affective state 

of a virtual agent, and how a particular affect can be conveyed by different modalities. 

4.1 Commentary Planning 

In this section, we will describe how we generate the dialogues for our presentation team 

that consists of two virtual agents. We have employed the JSHOP planner (see sec- 

tion 3.1) to generate the commentary where the generated plans correspond to possible 

dialogues in which the presentation team can be engaged. The planner is triggered at 

particular states of the tennis game, gets facts that describe the current state of the 

tennis game, and outputs all possible plans. The detailed description of the states in 

which the planner is triggered, input facts, and how the generated plans are executed 

will be given in Chapter 5. Thus, in this section, we will focus only on the dialogue 

generation, i.e., in which dialogues our commentary team can be engaged in distinct 

states of the tennis game according to the facts that describe the tennis game and the 

background of the players and the tournament. 

4.1.1 Motivation 

The overall goal of our system is to automatically generate interesting, suitable, coherent, 

and affective commentary from different points of view (in dependence on commentators’ 

attitudes to the players) in real-time. To investigate what the real tennis commentators 

26

Chapter 4. Generating Dialogue 27 

say during the game, we have analysed several tennis games from YouTube 1 . We have 

found out that there are usually two commentators that comment on a tennis match 

where the second commentator is usually a former tennis player or an expert in the field 

that can always provide additional background information. We have also found out 

that the commentary is to some extent driven by the states of the game, e.g., nobody 

is talking when the serving player concentrates before s/he serves, the commentators 

are engaged in small talks discussing players’ background when there is nothing else to 

comment on or the commentators usually summarize every rally after it finishes. Thus, 

for instance the statechart approach presented in the SceneMaker project (see section 

3.3) would be also convenient, therefore we have employed finite state machines to decide 

when to run the planner according to the states of the tennis game. 

We have also noticed that the information being conveyed by a sport commentator does 

not often bring much more than an ordinary spectator can perceive while s/he is watching 

the same tennis game. Since we wanted our commentary to be more sophisticated, we 

have let us inspire by the TennisEarth 2 web page that describes tennis matches (rally 

by rally) for tennis fans that have not seen them. As a consequence, the commentary on 

the TennisEarth is more elaborate and inspiring for us. We also wanted to incorporate 

more background knowledge since a standard tennis match is usually long-winded and 

there is often nothing to comment on, thus we have made use of the OnCourt 3 project 

as a source of the background knowledge about players and tennis tournaments. 

As we have already stated, the commentators have positive, neutral, or negative attitudes 

to the players. Since the standard live commentary is usually balanced, except for 

particular international tournaments, we had to add respective bias to our utterances. 

Let us note that biased utterances usually convey particular affects. To deal with the 

real-time requirement, we had to make sure that the dialogues are not too long. However, 

we can predict the time we have at our disposal for a commentary according to the state 

of the tennis game. For instance, we have always more time to comment on a just finished 

game than on an event that happens within a rally. Nevertheless, these predictions 

are only rough approximations, thus we had to allow interruptions, i.e., to interrupt 

the current plan if more relevant event happens. The coherence of the commentary is 

ensured by the dialogue planning that is elaborated in the next section. 

1 http://www.youtube.com/ 

2 http://www.tennisearth.com/ 

3 http://www.oncourt.info/


4.1.2 Dialogue Planning 

To represent our presentation team, we have employed two virtual agents that have 

different roles, attitudes to the players, and audio-visual appearance. The first com- 

mentator is the Charamel virtual agent Mark that represents a TV tennis commentator 

and the second Charamel virtual agent is Gloria that represents a tennis expert. (See 

section 5.1.3 for more details on the Charamel avatar engine.) While Mark should con- 

centrate on simple facts concerning the tennis game Gloria should rather elaborate on 

these facts. Let us remember that all dialogues are based on commentators’ attitudes 

to the players that can be positive, neutral, or negative. 

Dialogue Schemes 

We were inspired by the dialogue schemes presented in the project Presentation Teams 

(see section 2.5.2). A dialogue scheme is a generic representation of a piece of dialogue 

that can be generated under certain conditions by a planner. Let us note that dialogue 

schemes correspond to the methods in the HTN planning. Let us also remember that 

in the HTN planning, the compound goal task is decomposed by planning methods to 

the subtasks where each subtask is either a planning operator that corresponds to a 

template (that represents an utterance) or a compound task that is further decomposed 

by planning methods. Consider the planning method depicted in Figure 4.1. 

Figure 4.1: Example of a Planning Method 

Let us assume that player ?P1 has played a winning return (i.e. player ?P2 has lost 

the rally) and the subgoal task deduced by the planner from the goal task according 

to the current state of the game is the compound task “comment on rally”. Thus, we


can satisfy the compound task “comment on rally” by performing the BODY of the 

planning method if the PRECONDITIONS of the planning method can be satisfied (i.e. 

?A is a commentator, ?B is an expert, player ?P1 has played a winning return, player 

?P2 has lost the rally, ?A and ?B have both positive attitude to player ?P1 ). Figure 

4.1 also presents an example of a possible dialogue that can be generated by applying 

this planning method assuming that the BODY of the planning method consists only of 

two planning operators (i.e. not compound tasks), and variables ?P1, ?P2, ?A and ?B 

stands for respective players, commentator, and expert. We have already defined that 

all dialogue schemes are based on commentators’ attitudes to the players, nevertheless 

the semantic of the dialogue schemes can have one of the form defined in Table 4.1. 

Whilst the left column defines individual dialogue schemes the right column presents an 

example of a possible generated dialogue for each dialogue scheme. 

Dialogue Scheme Example of a Generated Dialogue 

A: argument for/against X A: “That serve was really phenomenal!” 

B: contrary B: “Well, that is a little exaggerated!” 

A: argument for/against X A: “Blake is in great shape as usual.” 

B: contrary B: “But he already produced several unforced errors.” 

A: override A: “Still, he is the best player on the court.” 

A: argue for X A: “Excellent return by Safin.” 

B: elaborate on X B: “Unreachable for Blake”. 

A: background fact X A: “The brother of Blake Thomas is a well known player.” 

B: evidence of X B: “His best ranking was the 141st place in 2002.” 

A: background fact X A: “Roddick has been 4 times injured recently.” 

B: consequence of X B: “It will be hard to break through today.” 

Table 4.1: Dialogue Schemes 

Planning Large Dialogue Contributions 

We have already shown how to generate a simple dialogue. In the following text, we 

will describe how to generate large dialogue contributions that consist of several simple 

dialogues. Consider a part of a planning tree that is depicted in Figure 4.2 where all 

nodes stand for compound tasks. Imagine that a game has finished and the subgoal task 

of the planner deduced from the goal task is the compound task “comment on just fin- 

ished game”. Hence, to satisfy the compound task “comment on just finished game”, we 

have to satisfy all its compound subtasks, namely: Introduction, Body, and Conclusion. 

Similarly, to satisfy the compound task Body, we have to satisfy all its compound sub- 

tasks, namely: comment on score, comment on winning team, and comment on losing 

team. The decomposition of the compound subtasks comment on winning team and


Figure 4.2: Example of a Compound Task Decomposition 

comment on losing team are analogous. Every leaf of the subtree depicted in Figure 

4.2 corresponds to at least one planning method that decomposes respective compound 

task. The compound task decomposition is accomplished by a planning method that 

stands for a dialogue scheme or by a planning method that represents a hierarchy of di- 

alogue schemes, i.e., the compound task can be decomposed by a planning method into 

several dialogue schemes in dependence on the facts that hold in the current description 

of the world (e.g. commentators’ attitudes to the players). The following list presents 

a possible generated dialogue that summarizes a game that has just finished (where C 

and E stand for a commentator and an expert, respectively). 

Introduction 

E: “What a relief!” 

C : “Tight game let’s summarize it.” 

Comment on Score 

C : “Blake and Roddick won the first game.” 

E: “That’s unbelievable that they broke opponents’ serve!” 

C : “That was spectacular!” 

Comment on winning team - Highlights 

C : “Blake and Roddick played an excellent game.” 

E: “Well, they played several excellent winning returns.” 

Comment on winning team - Difficulties 

C : “Can you say something about difficulties of Blake and Roddick?”


E: “They were already trailing.” 

C : “But they recovered.” 

Comment on winning team - Odds 

C : “Are Blake and Roddick going to win the match?” 

E: “They are my favourites!” 

Comment on losing team - Difficulties 

C : “What difficulties did Safin and Ferrer have?” 

E: “They did many unforced errors.” 

Comment on losing team - Odds 

C : “Do Safin and Ferrer have any chance to win.” 

E: “Well, they can still break through.” 

Conclusion 

C : “Let’s see the next game.” 

E: “Definitely.” 

4.1.3 Planning Tree 

In this section, we will describe our planning tree that represents the hierarchy of all 

dialogues that can be generated. The planning tree is defined as a Hierarchical Task 

Network (HTN) in the planning domain of the JSHOP planner (see section 3.1). The 

root of the planning tree is the goal task, any internal node of the planning tree is a 

compound task (i.e. a possible subgoal task), and every leaf of the planning tree is 

either a primitive task that corresponds to a template (that represents an utterance) or 

a reference to a particular compound task that is an internal node of the planning tree. 

Let us consider Figure 4.3. To satisfy a compound task, we have to either satisfy all its 

descendants (1), one arbitrary descendant that can be satisfied (2), or we have to satisfy 

the first descendant that can be satisfied (3). 

Figure 4.3: Possible Decompositions of a Compound Task


The root of our planning tree is the goal task “Comment”. Figure 4.4 depicts how the 

goal task “Comment” is decomposed into subgoal tasks in dependence on the state of 

the game, e.g., the presentation team is engaged in dialogues to introduce the upcoming 

game if the game is just at the beginning or they summarize a rally just after a rally 

finishes. 

Figure 4.4: Decomposition of the Goal Task “Comment” 

Figure 4.5 shows the further decomposition of the compound task Comment on rally 

that is a subgoal task of the goal task “Comment”. Thus, our presentation team is 

commenting on the result of the last rally in dependence on its outcome, e.g., the pre- 

sentation team can comment on an excellent ace or a winning return played by a player. 

Figure 4.5: Decomposition of the Subgoal Task Commant on rally 

Figure 4.6 depicts the whole decomposition path from the goal task “Comment” to the 

subgoal task “Drop Volley” which results in a commentary on a rally that finished with 

a winning return that was a drop volley (i.e. the player won the rally by a ball that he 

played before it bounced and then placed it just behind the net).


Figure 4.6: Decomposition of the Goal Task “Comment” that leads to a Subgoal 

Task Drop Volley 

4.1.4 Commentary Excerpt 

In this section, we will show an example of a generated dialogue where the players of 

the serving team are: Blake and Roddick and the players of the receiving team are: 

Safin and Ferrer. In this example, the dialogues are unbiased, i.e., the attitude of the 

commentators is neutral since we would like to show how detailed the commentary can 

be supposing that there is enough time to utter it. The state of the game and the 

subgoal of the planner are mentioned before each dialogue. Let us note that C stands 

for a commentator and E stands for a tennis expert. Another commentary excerpt is 

shown in Appendix A. 

Beginning - Introduction to the upcoming game 

C : “Ladies and Gentlemen! Welcome to the Wimbledon semi-final in doubles.” 

E: “We will guide you through the match in which James Blake and Andy Roddick 

are playing versus Marat Safin and David Ferrer.” 

C : “Enjoy the show!” 

Rally in Progress - Serving Player’s Background 

C : “Roddick has been 4 times injured since last years.” 

E: “It will be hard to break through today.” 

Rally in Progress - Comment on a nice shot 

E: “What a shot!” 

Rally finished - Summarize the rally (score: 15:0) 

C : “What a Forehand by Roddick.”


E: “Roddick hit an excellent forehand-volley right into the left corner.” 

C : “Roddick took advantage of a weak forehand return from Safin.” 

Rally in Progress - Players’ Background 

C : “Brother of James Blake Thomas is also playing tennis.” 

E: “His best ranking was in 2002 when he occupied the 141st place in doubles.” 

Rally in Progress - Comment on a nice shot 

C : “What a shot by Roddick!” 


C : “What a long ralley!” 

E: “Ended by an inaccurate backhand-volley by Safin.” 

C : “30:0” 

E: “Blake and Roddick are holding their serve so far.” 

Ralley in Progress - Background 

E: “The weather is cloudy today.” 

C : “Hopefully it won‘t be raining.” 


C : “Nice high lob by Safin.” 

E: “Too high for Roddick.” 

C : “Caused unforced error by Blake.” 

4.2 Affect 

In the following sections, we will explain why it is important to generate affective com- 

mentary on a tennis game and how an affect can be conveyed by different modalities. 

We will explain two methods that we have employed to generate affective commentary 

on a tennis game and discuss the pros and cons of this approach. 

4.2.1 Motivation 

In this section, we will clarify how important it is to incorporate emotions into the 

commentary and how the affect can be expressed. In general, the virtual agents are 

better accepted by users if they are endowed with emotions. [2] Different personality 

profiles and affect make virtual agents more distinguishable which is beneficial to cre- 

ation of the presentation teams. We were inspired by the concept of the presentation 

teams described in section 2.5. Thus, we have employed two distinct virtual agents that


have different roles (commentator, expert), attitudes to the players (positive, neutral, 

negative), and personality profiles (defined by: optimistic, choleric, extravert, neurotic, 

social). Two affective virtual agents can also better represent opposing opinions and 

are more entertaining than only one presenter. Moreover, the user should better recall 

conveyed facts. 

There can be many exciting moments in a tennis game as well, e.g., to win a tennis game 

a player must have at least four points in total and two points more than the opponent, 

thus the finish of a tennis game can be quite thrilling since there can be many game and 

break points (i.e. situations when the serving or receiving player needs only one point 

to win the game). Therefore, our virtual agents should affectively react to the events 

that, e.g., lead to the victory of their favourite player or that lower the odds to win. The 

current affect of a virtual agent can be expressed by dialogue scheme selection, lexical 

selection (i.e. choice of an appropriate utterance according to the current affect), gaze, 

facial expression, and hand and body gestures. 

4.2.2 Planning with Attitude 

In this section, we will describe how a particular affect can be conveyed via the choice 

of a corresponding dialogue scheme where a dialogue scheme is a generic definition of 

a piece of dialogue (see section 4.1.2). As we have already stated, a virtual agent can 

have positive, neutral, or negative attitude to a player. Let us note that almost every 

topic of the commentary is related to a specific event (e.g. a player has just scored, a 

player has lost the lead). Thus, every such event can be appraised by a virtual agent as 

desirable or undesirable according to his/her attitude to the players (e.g. it is desirable 

when my favourite player gets a point or undesirable when he loses the lead). Hence, a 

virtual agent will comment in a positive way on a desirable event and in a negative way 

on an undesirable events. Each event is also usually connected with a particular player, 

thus a virtual agent will comment in a positive way on actions of a player s/he likes and 

in a negative way on actions of a player s/he dislikes. A virtual agent that has a neutral 

attitude to a player will comment in a neutral way on events that are connected with a 

respective player. 

Let us consider a dialogue that consists of two utterances that are uttered by two virtual 

agents. Let us assume that the dialogue is either related to an event that can be 

appraised as positive, neutral, or negative, or the event is related to a player to which a 

virtual agent has positive, neutral, or negative attitude. Table 4.2 presents examples of 

possible generated dialogues where A and B stand for respective commentators. The first 

column represents a particular combination of appraisals of an event or a combination of


attitudes to a player that is related to a particular event. The second column represents 

a dialogue scheme of a possible dialogue where X stands for a player’s action or a fact. 

The third column represents an example of a generated dialogue. 

Appraisal Dialogue Scheme Example of a Generated Dialogue 

A: positive A: argue for X A: “Outstanding ace by Blake!” 

B: positive B: support X B: “Blake hits blistering serve down the line!” 

A: positive A: argue for X A: “Excellent forehand by Safin!” 

B: negative B: play down X B: “That’s a bit overstated.” 

A: negative A: point out fault X A: “Safin failed to get the ball over the net.” 

B: positive B: excuse X B: “Safin just overhits the serve.” 

A: neutral A: convey fact X A: “The score is already 30:0.” 

B: negative B: consequence of X B: “Safin and Ferrer are real losers as usual!” 

A: neutral A: convey fact X A: “Deuce again.” 

B: neutral B: elaborate on fact X B: “Safind and Ferrer got back on board.” 

Table 4.2: Example of Generated Dialogues based on different Appraisals 

Thus, we have shown how a particular affect can be conveyed via the choice of an 

appropriate dialogue scheme. Let us note that the pieces of a generated dialogue are 

individual utterances where an utterance is usually uttered by a virtual agent in a 

particular situation that is correlated with a particular affect. Therefore, we annotated 

each utterance with default gesture and facial expression tags to seamlessly convey a 

particular affect by an utterance. Nevertheless, these tags are only default and can be 

substituted by other tags generated by other modules. For instance, the facial expression 

can be also set according to the current affective state of a virtual agent generated by 

the emotion module that is described in the next section. 

4.2.3 OCC Generated Emotions 

In this section, we will describe the emotion module that models the affective state 

of each virtual agent according to the OCC (Ortony, Collins, Clore) cognitive model 

of emotions. [36, 37] We simulate eight basic OCC emotions that are relevant to the 

tennis commentary. These emotions are explained in Table 4.3. The emotion module is 

initialized with the personality of each virtual agent that is defined by five personality 

traits listed in Table 4.4.


OCC Emotion Description 

JOY Something happened that I wanted to happen. 

DISTRESS Something happened that I did not want to happen. 

HOPE Something may happen that I really want to occur. 

FEAR Something may happen that I wish to never occur. 

RELIEF Something bad did not happen. 

DISAPPOINTMENT Something did not happen that I really wanted to occur. 

SATISFACTION Something happened that I really wanted to occur. 

FEAR-CONFIRMED Something bad did actually happen. 

Table 4.3: Description of the eight Basic OCC Emotions 

Personality Trait 

optimistic 

choleric 

extravert 

neurotic 

social 

Table 4.4: Five Personality Traits 

The input of the emotion module are facts that our system deduces from the elementary 

events got from the tennis game. The main functionality of the emotion module 4 is 

implemented in Jess (see section 3.2). The goals and antigoals of a virtual agent are 

deduced from his/her attitude to the players, e.g., virtual agent A that has a positive 

attitude to player P wants P to win the game, conversely, virtual agent B that has a 

negative attitude to player P wants P to lose the game. The events that happen in the 

tennis game are appraised as desirable if they lead to the goal or undesirable if they 

hinder the goal. The conditions that elicit emotions based on the events that happen in 

the tennis game are called emotion eliciting conditions. The appraisals of the emotion 

eliciting conditions then generate particular emotions with respective intensities where 

the initial intensity of a particular emotion depends on personality of the respective 

virtual agent. The affective state of a virtual agent is represented by a vector of intensi- 

ties of each emotion where, for instance, the emotion with the highest intensity can be 

considered as the output of the emotion module. Since the emotions decay over time, 

the emotion module maintains the emotion decay using, e.g., a linear decay function. 

Table 4.5 shows examples of events that elicit respective emotions. 

4 The definitions of the OCC emotions (in source file occ.clp) were provided by Michael Kipp (DFKI).


OCC Emotion Event 

JOY My favourite player scored. 

DISTRESS My favourite player lost a point. 

HOPE My favourite player is now leading. 

FEAR My favourite player is now trailing. 

RELIEF My favourite player settled the score. 

DISAPPOINTMENT My favourite player lost the lead. 

SATISFACTION My favourite player won the game. 

FEAR-CONFIRMED My favourite player lost the game. 

Table 4.5: Example of Events that elicit respective Emotions 

Figure 4.7 depicts the GUI of the emotion module. The left part of the chart depicts 

the current intensities of respective emotions for the first virtual agent and the right 

part of the chart depicts corresponding data for the second virtual agent. The dynamic 

bar chart was created using the JFreeChart 5 library. There is also a log for each virtual 

agent that lists all events that have caused a particular emotion from the beginning of 

the tennis game. (Let us remark that Figure 4.7 depicts only the last two events.) Each 

log entry consists of the emotion name, initial intensity, and the cause description. 

Figure 4.7: Emotion Module GUI 

5 Andreas Viklund. The JFreeChart Class Library. http://www.jfree.org/jfreechart/


The output of the emotion module is currently employed to set and update the facial 

expression of each virtual agent every second. Nevertheless, it could be also used for the 

gesture and lexical selection or as an input of the planner (if we had dialogue schemes 

based on the OCC emotions). 

4.2.4 Discussion 

In this section, we will explain why we have employed two methods to simulate emotions 

and which other options we have considered. As we have already stated, all dialogue 

schemes are based on virtual agents’ attitudes to the players. Nevertheless, we could 

have based the dialogue schemes also on the virtual agents’ current emotions. In this 

case, we would have first derived the current emotion for each virtual agent and then 

we would have tried to find an appropriate dialogue scheme. Nevertheless, in this case, 

we would have had to face to the substantial growth of the number of dialogue schemes 

and to the subsequent growth of the number of templates that represent individual 

utterances. Thus, we would have had dialogue schemes for all meaningful combination 

of emotions that the virtual agents can have. 

However, we noticed that the positive appraisals usually correspond to emotions such as: 

joy, hope, satisfaction, and relief, and that the negative appraisals usually correspond 

to emotions such as: distress, disappointment, fear, and fear-confirmed. Therefore, we 

could simplify the design of the planning domain and base the dialogue schemes only on 

virtual agents’ attitudes to the players and derive specific emotion in a separate emotion 

module. Such a specific emotion can be expressed by other modalities (e.g. facial 

expression, gaze, gestures, lexical selection) except for the dialogue scheme selection. 

Nevertheless, if we had had the specific emotion of each virtual agent as an input of the 

planner, we could have also generated plans where the emotions could have changed at 

some point as a reaction to what the other agent would have said. However, this option 

is not useful in our case since both virtual agents share the same knowledge about the 

tennis game, and the emotion of a virtual agent should correspond to the current state 

of the game and not substantially change, for instance, from joy to distress if the virtual 

agent’s favourite player is winning but the other virtual agent has just said something 

bad about the winner. 

Nevertheless, the option to change the emotion at some point of a plan would be useful if 

the virtual agents had different knowledge about the tennis game such that an utterance 

uttered by one virtual agent could have substantially changed the emotion of the other 

virtual agent (e.g. one virtual agent would have made the other virtual agent happy if 

s/he had told him/her that his/her favourite player had just won the game). To change


the emotion at some point in a plan would be also useful if the plans were longer, which 

in our case is only the commentary on a just finished game, but it is hard to imagine 

that a virtual agent that is very happy because his favourite player has just won the 

game would have changed his/her emotion, e.g., from joy to distress because the other 

virtual agent said something bad about a player s/he likes. 

We have written a separate emotion module since we wanted to simulate the emotional 

state of each virtual agent more precisely, e.g., we wanted to maintain the emotion decay 

which would be infeasible in the planner. We could also have used some off-the-shelf 

software to simulate the emotional state of each virtual agent. Nevertheless, we wanted 

to simulate the emotions in a transparent way so that we could clearly see which event 

had elicited which emotion and which emotion currently prevailed. We also wanted to 

have full control over the module (i.e. we can adjust the computation of the initial 

intensities of individual OCC emotions in dependence on the personality, we can define 

our decay function, and we have control over the input and output tags). Therefore, we 

did not use any “black box” such as ALMA [15], although ALMA is in general a good 

choice to simulate the affective state of a virtual agent since it additionally maintains 

the history and emotion blending. 

The emotion module and the planner run independently. The planner cannot update the 

emotion module since not every plan that is generated is also executed. Additionally, 

the time of the plan generation and the time of the plan execution are different. The 

emotion module could have passed the current emotional states of the virtual agents to 

the planner, nevertheless we do not need the exact emotional state of the virtual agents 

in the planner since our dialogue schemes are based only on virtual agents’ attitudes to 

the players.

Chapter 5 

Architecture 

In this chapter, we will introduce individual modules of our system and describe how 

they cooperate to generate a commentary on a tennis game for our presentation team 

based on elementary events that are produced by a tennis simulator in real-time. The 

system consists of several modules that are running in separate threads and communicate 

via shared queues. For each module, we will describe its task and how it communicates 

with other modules, i.e., what are the input and output of a particular module. First, 

we will introduce the tennis simulator that produces elementary events (e.g. a player 

plays a forehand, the ball crosses the net, the ball lands out). Then, we will describe 

the plan generation, i.e., how we generate plans based on the knowledge deduced from 

the elementary events got from the tennis simulator where a plan represents a particular 

dialogue. Afterwards, we will explain how these generated plans are executed, i.e., how 

we select plans from all the plans generated in the previous step. Our presentation team 

is then engaged in dialogues that correspond to the selected plans. 

5.1 System Overview 

In the following sections, we will present the main design aims, introduce the overall 

architecture of the system, and present the off-the-shelf components that are employed 

in the system. We will discuss advantages of the modular architecture of the system, 

how to ensure the reactivity, as well as discuss the need for extensibility. Finally we will 

briefly introduce individual modules of the system and how they cooperate to produce 

a commentary on a tennis game. 

41

Chapter 5. Architecture 42 

5.1.1 Design Aims 

The system was designed with three main design aims, namely: modularity, reactivity, 

and extensibility, that will be described below. 

Modularity 

The overall system is broken down into individual modules where each module provides 

clearly defined interface and functionality. Each module is running in a separate thread 

and asynchronously communicates with other modules via shared queues. This approach 

is advantageous since each module can be tested separately and possibly replaced by 

another module that implements the same interface. 

Reactivity 

The system should be able to react quickly to new events. Evidently, reactivity is 

closely related to the modularity which facilitates not only parallel execution at multi- 

core platforms but also the possibility of interruptions, i.e., one module can cause the 

interruption of another module by sending an asynchronous message. The response time 

of each module must be reasonably bound as well. 

Extensibility 

Since we wanted to participate in GALA 2009 (see section 1.2), we had at that time to 

rapidly develop a demo application. As a consequence, the overall design should have 

allowed for simple functionality implementation and subsequent refinement. This aim 

is also related to the modularity since individual modules can be added, replaced, or 

separately improved. 

5.1.2 System Architecture 

In the following text, we will briefly explain how we generate the commentary for our 

presentation team based on the elementary events (e.g. a player serves, a ball hits the 

net) that are produced by the tennis simulator. We will introduce individual modules 

of the system and describe how they communicate. Figure 5.1 depicts the overall ar- 

chitecture of the IVAN system and Figure 5.2 describes the dataflow that starts with 

the elementary events produced by the tennis simulator and ends with the multimodal 

output represented by the Charamel avatar engine (see section 5.1.3). 

The tennis simulator is sending elementary events to the event manager. The event 

manager is getting these elementary events (such as a ball is crossing the net, a ball 

bounces) and deduces low-level facts (e.g. a rally finished). These derived low-level 

facts are stored in the knowledge base. The event manager also decides when to run


Figure 5.1: IVAN Architecture 

Figure 5.2: Dataflow 

the discourse planner based on the global state of the game. In other words, the event 

manager has a role of a perception unit since it is receiving events from the outside world 

and maintains its coherent representation in the form of a knowledge base. The discourse 

planner triggered by the event manager gets facts from the knowledge base, generates all 

possible plans, and passes them to the output manager where a plan represents a possible 

dialogues. Some facts can also be deduced during the planning process and stored in the 

knowledge base (e.g. statistics to generate the commentary that summarizes the game). 

The output manager maintains the plan execution, chooses one plan to execute, matches 

planning operators with templates, adds gestures annotation, and sends appropriate


commands to the avatar manager that transforms them to the avatar engine specific 

commands. More precisely, there is a mapping that maps each planning operator onto 

a template where a template represents a set of possible annotated utterances. Thus, 

a planning operator is mapped onto an annotated utterance that is chosen at random 

among all utterances that correspond to a respective template. Furthermore, the avatar 

manager maintains the state of the dialogue (e.g. who is speaking at the moment or how 

long it will take to finish the current utterance) which can be used, e.g., to decide when 

to interrupt the current discourse. 

There is also the emotion module that maintains separately the emotional state of each 

virtual agent. For instance, the facial expression of each virtual agent is updated every 

second according to the current emotional state that is stored in the knowledge base. 

Let us note that the knowledge base also contains background facts about the game and 

players, virtual agents’ roles (commentator or expert), personality profiles, and attitudes 

(positive, neutral, or negative) to the players. 

5.1.3 Off-the-shelf Components 

We have used two commercial products as an audio-visual component of the system. 

We have employed Charamel 1 to visualize virtual agents and RealSpeak Solo 2 as a text- 

to-speech (TTS) engine. We will describe both software toolkits in the following para- 

graphs. 

Charamel Avatar Engine 

Charamel is a standalone application that communicates via socket and can visualize 

several virtual agents at the same time. Individual virtual agents are controlled via 

scripting language CharaScript. The virtual agents can express 14 different facial ex- 

pressions (e.g. smile, happy, disappointed, angry, sad) with varying intensities. Their 

lip movement is synchronized to speech that is produced by the RealSpeak Solo TTS. 

The virtual agents can playback around one hundred pre-fabricated gesture clips that 

can be tweaked using many different parameters (e.g. velocity, start time, end time, 

interpolation time). Moreover, the transitions between each two consecutive gestures or 

facial expressions are interpolated; the virtual agents are also performing idle gestures 

in the meantime while no other gestures are triggered in order to look natural. Figure 

5.3 depicts two Charamel virtual agents Mark and Gloria that were employed in the 

system. 

1 http://www.charamel.com/ 

2 http://www.nuance.com/realspeak/solo/


RealSpeak Solo TTS Engine 

Figure 5.3: Charamel Virtual Agents Mark and Gloria 

RealSpeak Solo is a TTS engine that gets commands from the Charamel to vocalize 

desired utterances. While the TTS engine is vocalizing an utterance it is also sending 

tags back to the Charamel which enables synchronized lip movement of a virtual agent 

that is speaking. RealSpeak Solo supports several male and female voices. We employed 

British female voice Serena for the Charamel virtual agent Gloria and American male 

voice Tom for Mark. 

5.2 Tennis Simulator 

The GALA 2009 challenge was given as a static ANVIL file that describes a tennis game 

(see section 1.2). Since we wanted to test our system as if it was a real-time application 

we wrote a tennis simulator that reads first an ANVIL file and then simulates the game 

in real-time. Although we consider the tennis simulator as a part of our system, it can 

be easily reused in other systems since it communicates via socket. Moreover, only a 

subtle modification is needed to simulate any game that is given as an ANVIL file (with 

a corresponding video). In the following text, we will describe our tennis simulator in 

detail. 

The architecture of the tennis simulator is shown in Figure 5.4. The tennis simulator 

first reads a video file and its annotation that is stored in an ANVIL file. The video is


Figure 5.4: Tennis Simulator 

opened in a video player that is implemented using the Java Media Framework API 3 ; 

the timestamped events, read from the ANVIL file, are stored in a priority queue. When 

the simulator is started it is sending events one by one at the time they occur to a socket. 

Since the time of the simulation is determined by the video player it is possible to pause 

the simulation or to move it forwards. It is also possible to fire one of the pre-defined 

question event anytime. 

Figure 5.5 shows a GUI of the tennis simulator. A user first chooses an input file. S/he 

can decide whether the video will be displayed in the video player or not and whether 

the start of the simulation will be postponed or moved forward; then the simulation can 

be started. 

Figure 5.5: Tennis Simulator GUI 

3 http://java.sun.com/javase/technologies/desktop/media/jmf/


5.3 Plan Generation 

In this section, we will describe how we generate plans that correspond to possible 

dialogues from the elementary events that are generated by the tennis simulator. Figure 

5.6 depicts in colors the part of the system that is responsible for the plan generation 

and Figure 5.7 shows which part of the dataflow is covered in this section. First, we will 

describe the event manager that is getting elementary events from the tennis simulator, 

deduces low-level facts from the elementary events and stores them in the knowledge 

base where the low-level facts along with the background knowledge, virtual agents’ 

roles, personality profiles, and attitudes to the players create coherent representation 

of the outside world. Then, we will describe the discourse planner that is triggered by 

the event manager, gets facts from the knowledge base, and outputs all possible plans 

that are subsequently passed to the output manager that maintains the plan execution 

described in section 5.4. 

Figure 5.6: IVAN Architecture - Plan Generation


5.3.1 Event Manager 

Figure 5.7: Dataflow - Plan Generation 

In this section, we will describe the event manager that has a role of a “perception unit” 

since it is getting events from the outside world and maintains its coherent representation 

that is stored in the knowledge base. More precisely, the event manager is getting 

elementary events from the tennis simulator and deduces low-level facts that are stored 

in the knowledge base. It also maintains the overall state and score of the match and 

decides when to run the discourse planner. The elementary events (e.g. a player plays 

a backhand, the ball lands out) that the event manager is getting from the tennis 

simulator were defined in the GALA 2009 scenario in detail (see section 1.2), moreover 

an elementary event can be also a user pre-defined question event. Let us remember 

that a tennis match consists of sets, a set consists of games, and a game consists of 

rallys. However, for the sake of simplicity we consider only one tennis game. Since we 

cannot run the discourse planner every time we get an elementary event, we first describe 

basic states of the tennis game that are modelled using finite state machines, and then 

we identify at which states we run the discourse planner. After that, we explain what 

low-level facts are deduced by the event manager, stored in the knowledge base and 

subsequently available for the discourse planner. 

States 

Two finite state machines that we have employed to simulate basic states of the tennis 

game are depicted in Figure 5.8. Both finite state machines run in parallel where the 

initial state is marked by red and the transitions correspond to particular sequences of 

elementary events. 

Let us first look at the finite state machine on the left side. We start at the state 

beginning, after a player throws a ball to serve we move to the state game in progress, at 

the end after the game finishes we move to the state game finished. The state machine 

on the right side starts at the state game not in progress. After a player throws a ball to


Figure 5.8: States of the Tennis Game 

serve we move to the state rally beginning. A player can throw a ball several times before 

he actually serves but after he serves we move to the state rally in progress. After the 

ball hits the net, lands out, or bounces twice we get to the state rally finished. Then, in 

the case the game is finished we get to the state game not in progress otherwise we wait 

till a player throws a ball to serve and move to the state rally beginning. Both finite state 

machines could be perceived as one but two of them will provide better understanding. 

There are also two facts stored in the knowledge base derived from respective finite state 

machines. 

The event manager triggers the discourse planner at some states of the tennis game. We 

will now show the list of specific states at which the discourse planner is triggered. The 

list also contains some examples of goals that can be derived by the discourse planner 

at respective states. (Let us note that additional states could be added if desired). 

• beginning - do some introduction to the upcoming game 

• rally finished - summarize just finished rally 

• game finished - discuss just finished game 

• rally beginning & a player has thrown the ball already twice - a player is nervous, 

a player concentrates 

• rally in progress - comment on the serving player’s background 

• rally in progress & a volley or a smash was played - nice shot, risky shot 

• rally in progress & the ball hit the tape - luck, inaccuracy 

• a question event occured - answer the question


Score 

The score of the game is also maintained in the event manager using a point counter for 

each player and a finite state machine depicted in Figure 5.9. If a player wins a rally 

s/he gets one point. A player wins the game if he has at least 4 points in total and at 

least 2 points more than the opponent. After both players reach at least 3 points and 

the game is not over yet, the score is either deuce or advantage. Table 5.1 explains how 

the tennis score is counted for one player in the tennis terminology. Let us note that 

the same player is serving within one game and that the score is read with the serving 

player’s score first. 

Figure 5.9: Tennis Score Counting using a Finite State Machine 

Score Explanation 

“love/zero” 0 points 

“fifteen” 1 point 

“thirty” 2 points 

“forty” 3 points 

“deuce” at least 3 points have been scored by each player, scores are equal 

“advantage” for the leading player, at least 3 points has been scored by each 

player and one player has one point more 

Facts 

Table 5.1: Description of the Tennis Counting Terminology 

We will now explain which low-level facts are deduced by the event manager from the 

elementary events and stored in the knowledge base. The reason why we perform the 

deduction of the low-level facts at this level in the event manager is that it substantially 

facilitates the planning domain design. Working with the elementary events in the 

planning domain would be quite cumbersome and unsuitable in the case we want to 

reach reasonable latency. As we have already mentioned, the state of the game and the 

score are maintained in the event manager, thus also the respective facts are stored in 

the knowledge base. While the knowledge base contains only the current state of the


game, it contains all facts that describe the score from the beginning of the game. To 

distinguish between individual score facts and to rank them, we will introduce a concept 

of score generations, i.e., the first score fact has 0 generation, the second score fact has 

1 generation etc. We can deduce, e.g., whether a player has lost the lead or settled from 

the consecutive score facts. (Let us note that the concept of generations is often used in 

computer science to distinguish among data that originates at consecutive steps of an 

algorithm.) 

Rally Snapshots 

All events that occur in the tennis game are partitioned into the so-called rally snapshots. 

We will now describe which low-level facts are derived from a rally snapshot and stored 

in the knowledge base. Each rally snapshot has its generation that is similarly defined as 

the score generation. (Let us note that the rally generation and the score generation are 

different in general since, e.g, the first fault is a rally without score change.) The low- 

level facts are deduced for each rally snapshot and stored in the knowledge base. In the 

case the planner is triggered in the middle of a rally, the knowledge base then contains 

only facts deduced from the elementary events considering the current incomplete partial 

rally snapshot. The following list outlines which specific low-level facts are deduced from 

a rally snapshot and stored in the knowledge base: 

• how many times did the ball cross the net 

• a list at which heights the ball crossed the net 

• a list of pairs (player, shot) sorted from the beginning of a rally to its end 

• a position where the last ball, that was in the field, bounced first 

• a position where the last ball, that was out, bounced 

• whether the ball crossed the net before it landed out 

• which player missed the last ball 

• how many times the serving player had thrown the ball before he served 

Table 5.2 contains three examples that show which high-level facts can be deduced from 

the low-level facts listed above. Figure 5.10 depicts a hierarchy of facts that shows how 

an ace can be deduced.


high-level fact a list of low-level facts 

ace the ball crossed once the net, bounced in the field, 

state - rally finished 

lob the ball crossed the net at high position, bounced at the baseline 

drop the ball crossed the net at low position, bounced at the net 

Table 5.2: Example of high-level facts deduced from low-level facts 

Figure 5.10: Hierarchy of Facts from which an Ace can be deduced 

Comparison to Related Work 

The event manager is to some extent similar to the STEVE’s perception module (see 

section 2.4) since it also maintains the state of the world and its coherent representa- 

tion. Our approach is also similar to the SceneMaker (see section 3.3), that employs 

statecharts to control virtual agents, with the difference that while the SceneMaker can 

perform, e.g., a pre-defined scene (i.e. a dialogue where utterances are annotated with 

gestures) at a certain state we run the planner to generate the scene. 

5.3.2 Background Knowledge 

The background knowledge about the players and the game is incorporated to produce 

commentary when, for instance, there is currently nothing else to comment on. We will 

show some examples of background facts that are stored in the knowledge base. The 

background knowledge is stored in several static CSV (Comma Separated Values) files 

that could be alternatively replaced with a relational database. After the system starts, 

all CSV files are read and the background knowledge they contain is transformed to the 

facts that are stored in the knowledge base. Table 5.3 shows some examples which facts


can be deduced from the background knowledge. 

Background knowledge Examples of deduced fact 

Player’s details A sister of a player is also a tennis professional. 

Ranking A player is leading the ATP score. 

Style A player is playing risky as usual. 

Injury A player has been four times injured recently. 

Player’s results A player won two matches in a row. 

Tournaments details The tournament is played in London on grass. 

Table 5.3: Examples of Facts deduced from the Background Knowledge 

5.3.3 Discourse Planner 

The discourse planner is responsible for the plans generation where a plan represents a 

dialogue. The discourse planner is triggered by the event manager at particular states 

of the game. It gets all facts from the knowledge base and outputs all possible plans 

that are subsequently passed to the output manager. We will describe the input of the 

planner, the planner itself, and the representation of the planner output. Let us note 

that the concept of the dialogue generation has already been described in Chapter 4. 

Input 

The input of the planner consists of a planning task and a list of facts that describe 

the initial state of the world. The planning task is the same all the time, namely, the 

compound task “comment”, since the planner decides each time what it should comment 

on according to the supplied facts. The list of facts varies and contains all the facts that 

are stored in the knowledge base, i.e., it contains the following types of facts: 

• the current state of the game 

• scores of the game 

• rally snapshots 

• background knowledge (see section 5.3.2) 

• commentators’ (positive, neutral, negative) attitudes to the players 

• roles (commentator, expert) 

• a question (a fact identifying that there is a question to be answered)


The Planner 

We have employed the JSHOP (Java Simple Hierarchical Ordered Planner) as an HTN 

planner to produce the commentary on a tennis game. See section 3.1 for more details 

on JSHOP. As described above, the planner gets the input in the form of a problem 

description and outputs all possible plans. The concept how these plans are generated 

has already been described in detail in Chapter 4. Since JSHOP is an offline planner, 

we had to modify it to run online. First, we will describe what makes JSHOP an offline 

planner, how we modified it to run online, and how we could have employed JSHOP 

without modification since we also considered and implemented this option. 

JSHOP as an Offline Planner - The drawback of JSHOP is that it requires to 

generate and compile the problem description prior to running the planner, assuming 

that the problem description changes whereas the domain description remains the same 

during the system run. As we can see, there is a costly compilation step before each run 

of the planner. See section 3.1 where we explained the JSHOP input generation process 

in detail. Let us also note that the planner does not have its own working memory in 

the sense that every time it is run all facts have to be supplied again. 

JSHOP as an Online Planner - We investigated how the problem description Java file 

was generated from the JSHOP problem file and found out how to bypass the compilation 

step described above. We have written a universal problem description Java file that 

has been compiled only once and fully replaces the problem description Java file that 

would be generated by the JSHOP, i.e., the instance of the problem description Java 

class accepts the discourse planner problem description representation as Java objects 

and serves as an input of the JSHOP as if the problem file was generated by the JSHOP. 

This approach is fast and the plan generation takes only about 50-150ms. 

Alternative Use of JSHOP as an Online Planner - JSHOP can be used as an 

online planner without modification. However, this approach is quite costly since the 

compilation step takes each time about 1 second and also consumes a lot of CPU re- 

sources. Figure 5.11 shows individual steps of this alternative approach that will be 

described below. The discourse planner uses its own problem description representation 

that is first transformed to the JSHOP problem file (that uses special Lisp-Like syntax), 

then the respective Java file is generated and compiled. After that, we make use of a 

nice Java feature, namely, that it allows to replace one class implementation by another 

at runtime, i.e., it allows to replace one *.class file by another during the system run. 

Thus, at the end of the process depicted in Figure 5.11, we have a *.class representation 

of a problem description and the planner can be started. 

Let us note that we use this approach to compile the domain description once at the 

beginning when the system starts. In this case the process starts with the JSHOP


domain file from which the corresponding Java file is generated, compiled and replaced 

at runtime. 

Output 

Figure 5.11: JSHOP Input Generation Process 

The output of the planner is the so called planning response which contains: a list of 

all possible plans, the time when the planner was triggered, and the respective state 

of the game. Each plan from the list of all possible plans contains: priority, semantic 

token, and a list of planning operators. The semantic tokens are strings that identify 

plans. For instance, the semantic tokens can be used to avoid repetitions where we 

disallow consecutive execution of two plans with the same semantic token. The list 

of planning operators corresponds to a dialogue where each planning operator stands 

for one template (that corresponds to an utterance). Moreover, some facts can be also 

deduced during the planning process and stored in the knowledge base for the next run 

of the planner. For instance, it can be the statistics that summarises the game (e.g. 

the number of outs, winning returns, and aces for each player). These facts can be, for 

instance, used to generate the commentary on a just finished game. 

5.4 Plan Execution 

In this section, we will describe how we execute the plans that are generated by the 

discourse planner, i.e., how we select plans that will be executed, more precisely, in 

which dialogues the virtual agents will be engaged. Figure 5.12 depicts in colors the 

part of the system that is responsible for the plan execution and Figure 5.13 shows 

which part of the dataflow is covered in this section. First, we will describe the template 

manager that provides mapping for each planning operator of a plan onto a particular 

utterance that is furthermore annotated with gesture tags. Then, we will describe the 

avatar manager that stands for an interface of the Charamel avatar engine, and finally


we will describe the output manager that is responsible for the plan execution, i.e., it 

decides which plans and when will be executed. 

5.4.1 Template Manager 

Figure 5.12: IVAN Architecture - Plan Execution 

Figure 5.13: Dataflow - Plan Execution 

Let us remember that each plan corresponds to a dialogue where a plan consists of a 

list of planning operators (primitive tasks) and each planning operator corresponds to a


template that contains a set of possible utterances that can be uttered by a virtual agent. 

In this section, we will describe how a planning operator is mapped onto a particular 

utterance that can be additionally annotated with gesture tags. The template manager 

contains over 220 different templates and provides mapping for each planning operator 

onto a particular template where each template has usually several slots that can be 

substituted by parameters of a respective planning operator. Each template contains 

1-3 variants of an utterance. Which utterance will be chosen is decided at random for 

the sake of higher variability. 

Moreover, there are default gesture and facial expression tags in every utterance since 

each utterance is more or less bound to a particular situation that is correlated with a 

certain emotion. The facial expression tags can be for instance: Smile, Happy, Surprise, 

Angry, or Sad with different intensities. The gesture tags can be for instance: Disagree, 

DontKnow, Disappointed, Surprise, Oops, or OhYes. Each gesture tag is stored in a so 

called gesticon and is mapped onto a set of 1-3 possible gestures that can be directly 

performed by a virtual agent in a particular situation. Every time the gesticon is queried 

to find a mapping for a given gesture tag, it chooses one gesture from the corresponding 

set of possible gestures at random to achieve higher variability. 

Furthermore, there are two duration tags for each utterance, the first denotes the number 

of milliseconds needed to utter an utterance employing a male voice and the second tag 

is the respective duration for a female voice. These tags can be used to estimate the 

duration of utterance in the case it is not provided by the text-to-speech engine. Let us 

note that the gesture and facial expression tags stand only for default values, i.e., they 

can be filtered out and substituted by other tags generated by other modules. 

Example 

In the following text, we will show an example how a planning operator can be mapped 

onto a particular utterance. Imagine that the server has served and the receiver has 

returned the ball in such a way that the server failed to return it. One planning operator 

(more precisely operator’s head) of the generated plan can be for instance: 

briskly_returned_serve ?server ?receiver ?receiver_shot 

Where the first string is the operator’s name and the strings that begin with a question 

mark stand for variables that are substituted into slots of a template. The planning 

operator’s head contains three variables: ?server refers to the serving player, ?receiver 

refers to the receiving player, and ?receiver shot refers to the type of a shot that the 

receiving player played. There is a corresponding template in the template manager 

that contains three slots that correspond to the three variables of the planning operator. 

The template consists of two utterances:


{EmotionSurprise} {ExplainTo} ?receiver surprised ?server with an accurate 

?receiver_shot return. 

{EmotionSurprise} {Play} ?receiver generated a ?receiver_shot {Look} return 

that was out of ?server’s reach. 

The facial expression and gesture tags are annotated in curly brackets. The facial 

expression tags start with the prefix Emotion whereas all other tags are gesture tags. 

Let us assume that: the second utterance has been chosen at random, the variable 

substitutions are known, and the respective gesture tags have been chosen from the 

gesticon at random. Thus, we get the following substitutions: 

?server := Safin 

?receiver := Federer 

?receiver_shot := forehand 

{EmotionSurprise} := $(Emotion,surprise,0.9,500,1000,3000) 

{Play} := $(Motion,interaction/bye/bye01,400,500,0,10000,1.5) 

{Look} := $(Motion,presentation/look/lookto_right02,400,500,0,1200,0.8) 

where the facial expression and gesture tags are mapped onto the avatar engine specific 

tags (see the Charamel manual [38] for more details). After we apply the substitutions 

we get the following annotated utterance that can be directly sent to the Charamel 

avatar engine. 

$(Emotion,surprise,0.9,500,1000,3000) 

$(Motion,interaction/bye/bye01,400,500,0,10000,1.5) 

Federer generated a forehand 

$(Motion,presentation/look/lookto_right02,400,500,0,1200,0.8) 

return that was out of Safin’s reach. 

After a Charamel virtual agent gets this utterance, s/he looks surprised, s/he makes a 

hand movement as if s/he played a ball with a tennis racket, and then s/he gazes at the 

other virtual agent. 

5.4.2 Avatar Manager 

The avatar manager serves as an interface of the Charamel avatar engine. In the fol- 

lowing text, we will describe how we have incorporated this module into our system


and which functionality it provides. The avatar manager is placed between the out- 

put manager and the Charamel avatar engine. The output manager decides what plan 

will be executed, i.e., which utterance and when will be uttered whereas the Charamel 

avatar engine displays two virtual agents that represent our commentary team and ac- 

cepts commands to control their behaviour. Thus, the role of the avatar manager is 

to transform commands from the output manager to the Charamel specific commands. 

Furthermore, it maintains the state of the dialogue that can be exploited by the output 

manager. An annotated utterance, a gesture, or a facial expression can be sent to the 

avatar manager. The tags that describe the state of the dialogue got from the avatar 

manager are as follows: which virtual agent is currently speaking, how long s/he has 

already been speaking, how much time it takes to finish the current utterance, and what 

gesture or facial expression has been set for each virtual agent last time. 

Let us remember that all commands that are sent to the avatar manager or to the 

Charamel avatar engine are sent in a non-blocking manner (i.e. it never waits till a 

command is completed). Thus, the output manager must first get the current state 

of the dialogue and then decide which command to send to the avatar manager. For 

instance, if nobody is speaking it can send instantaneously an annotated utterance to 

the Charamel avatar engine. If somebody is speaking it knows who is speaking and how 

long it will take to finish the current utterance. Thus, the output manager can then 

decide whether to wait or send a new utterance right away. For instance, it should wait 

if the utterance that is being uttered will be finished in a second. Nonetheless, in the 

case that somebody is speaking and the avatar manager gets a command to utter the 

other utterance, it interrupts the virtual agent that is speaking and starts uttering the 

new utterance. 

There can be two kinds of interruptions: self-interruption or an interruption by the 

other agent. Gaze gestures and interruption utterances (e.g. “Wait!” or “Look!”) are 

used to make the interruptions smoother. As we have already stated, the length of an 

utterance is stored in the template manager for each template, nevertheless this length is 

not accurate since the exact length of an utterance depends on slot substitutions in the 

templates (e.g. a ?name “Ray” is shorter than “Richard”). Thus, the Charamel avatar 

engine is always queried to send back the real length of an utterance. However, it can 

take up to 1 second to get the response, thus the estimated length that is stored in the 

template manager is used as long as the real length returned from the Charamel avatar 

engine is unknown. A gesture or a facial expression can be sent to the Charamel avatar 

engine at any time. New gesture or facial expression will be smoothly interpolated with 

the previous one.


Since the avatar manager communicates with the Charamel avatar engine via socket (see 

the Charamel manual [38]) we have to deal with some latency that can be up to one 

second which can cause unwanted delays in the commentary. Another shortcoming of the 

Charamel avatar engine is that a virtual agent that is speaking cannot be interrupted at 

a specific position in an utterance since the exact state of the virtual agent is unknown. 

We can only estimate the position in an utterance according to the time elapsed from 

its beginning. Therefore, we cannot prevent from an utterance being interrupted in the 

middle of a word. 

5.4.3 Output Manager 

The output manager is responsible for the plan execution, i.e., it decides in which di- 

alogues the virtual agents will be engaged. In the following text, we will explain the 

functionality of the output manager in detail. The output manager gets plans from the 

discourse planner, chooses one plan to execute, maps planning operators onto templates, 

and sends respective annotated utterances to the avatar manager that transforms them 

to the Charamel specific commands. Thus, the output manager decides which plan and 

when to execute. Furthermore, the output manager can interrupt the current plan and 

run a new one while the interrupted plan can be started again later. The decision when 

to interrupt a plan is based on heuristics. Moreover, the output manager keeps the plan 

history that prevents from repetitions so that one plan is not executed twice in a row. 

Decision Loop 

The functionality of the output manager is implemented in the decision loop that main- 

tains the state of the plan that is being executed, the stack of candidate plans, and the 

plan history. The decision loop consists of the following steps: 

1. Try to get new plans. 

2. If there are new plans then select one and put it on the stack of candidate plans. 

3. Remove old plans from the stack of candidate plans. 

4. Get the status of the dialogue engine. 

5. In the case that nobody is speaking we can perform one of the following actions: 

• The plan that is being executed continues with the next utterance. 

• The plan that has been interrupted starts again. 

• The current plan is interrupted by a new one. 

• A new plan is started.


6. In the case that somebody is speaking and there is a newer plan on the stack of 

candidate plans, we decide according to heuristics whether the current plan will 

be interrupted or not. 

The plan is selected according to its priority and the least recently used strategy (at step 

2) such that it prefers plans with high priority and plans that have not been executed 

recently. To ensure that the stack of the candidate plans contains only plans that are 

up-to-date (at step 3), we go through the plans and filter out old plans depending on 

the semantic tokens of plans. For instance, a plan that contains some background facts 

(e.g. that the serving player is leading the ATP score) does not get older so fast as a 

plan that is related to some event that happened in the middle of a rally (e.g. when a 

player played a smash). 

Each time the output manager gets new plans, it has to decide on the basis of some 

heuristics whether to interrupt the current plan and continue with a new one or not. 

The output manager makes use of the state of the dialogue to know the approximate 

time needed to finish the current utterance or how long the current plan has already 

been running. For instance, the current plan will not be interrupted if it finishes in a 

second or if it was started a couple of milliseconds ago. The interruptions also cannot 

occur too often. In dependence on the semantic tokens of plans, some plans should be 

executed as soon as possible (e.g. a comment referring to an ace) and some plans can 

be executed with certain delay (e.g. a comment on player’s background). Furthermore, 

an interrupted plan can be run again if it is still up-to-date and has not been almost 

finished last time.

Chapter 6 

Discussion 

In this chapter, we will compare the IVAN system with ERIC, evaluate our system in 

terms of the research aims, and discuss two basic tools (JSHOP and Jess) that can 

be both employed to generate affective commentary on a continuous sports event in 

real-time. 

6.1 Comparison with the ERIC system 

In this section, we will compare our system with ERIC (see section 2.1) since ERIC 

is most closely related to our work. ERIC is an affective commentary virtual agent 

that won GALA 2007 1 as a horse race reporter. The overall goal of ERIC is the same 

as ours with the difference that while ERIC is a monologic system that employs one 

virtual agent we have employed a presentation team that consists of two virtual agents 

to comment on a sports event. Our virtual agents have different roles (TV commentator, 

expert) and can have different attitudes to the players (positive, neutral, negative). The 

use of a presentation team is believed to be more entertaining for the audience than only 

one presenter and enriches the communication strategies since our virtual agents can be 

engaged in dialogues and represent opposing points of view. 

ERIC employs the expert systems to generate speech where his utterances reflect his 

current knowledge state and the discourse coherence is ensured by the centering the- 

ory. Nevertheless, ERIC may be too reactive, i.e., individual utterances are uttered at 

particular knowledge states where ERIC cannot generate larger contributions. Hence, 

we have employed an HTN planner to generate the dialogues which enabled us to plan 

large dialogue contributions and the discourse coherence was ensured by the planner. 


62

Chapter 6. Discussion 63 

In contrast to ERIC, we have also implemented the possibility of interruptions, i.e., 

the current discourse can be interrupted if a more important event happens. However, 

there is always certain trade-off between reactivity, i.e., a reactive commentary with fre- 

quent interruptions, and discourse coherence, i.e., a commentary with large and coherent 

dialogue contributions that does not comment on each event. 

While ERIC uses ALMA to maintain his affective state, we use two methods: one that 

generates affective dialogues based on virtual agents’ attitudes to the players and the 

other that maintains the affective state of each virtual agent in the emotion module. 

ALMA might appear to be a “black box”, on the contrary, the generation of affective 

dialogues and the simulation of the affective states for our virtual agents are more 

transparent, i.e., we can adjust the computation of the initial intensities of individual 

OCC emotions in dependence on the personality, we can define our decay function, and 

we have full control over the input tags and output of our emotion module. We can also 

always say which event has caused the virtual agent’s current emotion or why a virtual 

agent is commenting in a positive or negative way on an event or a player. 

In comparison to ERIC, our virtual agents have gestures more synchronized with speech, 

use more elaborate idle gestures (provided by Charamel), can gaze at each other, and 

can interact with a user via user pre-defined questions. Whilst ERIC was designed to 

be domain independent and was tested in two different domains, our system has only 

been designed to comment on a tennis game, nevertheless the same architecture can be 

used to produce affective commentary in other domains. 

6.2 Evaluation in Terms of Research Aims 

In this section, we will compare our research amis, listed in section 1.4, with the system 

that we have implemented. 

Dialogue Planning for Real-time Commentary and Reactivity 

We have employed the JSHOP as an HTN planner to produce commentary on a contin- 

uous sports event in real-time. The motivation to use an HTN planner was to generate 

large dialogue contributions and to prevent from being too reactive (in the sense de- 

scribed in 6.1). It also seemed to be a good strategy to generate dialogues. First, the 

JSHOP gets all facts that describe the current state of the world and outputs all possible 

plans (dialogues). Then, in the decision loop, one plan is selected and executed. The 

problem arises when an important event happens in the middle of the execution of a 

plan (dialogue) that comments on another event. In this case, our system can either 

interrupt the execution of the current plan or wait till the current plan finishes. This


problem would solve dynamic replanning, i.e., to modify the current plan on the fly. 

Since the JSHOP does not support dynamic replanning we can only either wait till the 

current plan finishes or we can interrupt it. However, if the JSHOP supported dynamic 

replanning, it would not be sufficient since the Charamel avatar engine does not indicate 

its exact state, e.g., we cannot interrupt an utterance at a specific position in an utter- 

ance. Moreover, if we sent an utterance word by word to the Charamel avatar engine it 

would not be uttered in a coherent way. Thus, the planner would need to work with the 

whole utterances which would not be optimal since we would have to wait till the current 

utterance would have been uttered, and then we would continue with an utterance of 

the modified plan that would have been created by the dynamic replanning. 

Therefore, there is always certain trade-off between reactivity and discourse coherence. 

We can either often interrupt plans (dialogues) to be reactive or we can delay the com- 

mentary on some events or we can even ignore some events to get large, coherent dialogue 

contributions. Nevertheless, we have noticed that the real-life tennis commentators do 

not comment on every event and in the case when the game is not interesting, they are 

engaged in small talks to amuse the audience by talking about the players’ background. 

Thus, we have implemented a compromise that uses some heuristics to decide when 

to interrupt the discourse. The resulting commentary is partly reactive but since we 

cannot interrupt the discourse too often, our commentary has sometimes delays or does 

not consider some events. 

There is also always certain trade-off between reactive commentary that uses short 

utterances and elaborate, more detailed commentary that is not so reactive. Since we 

wanted to produce more interesting and detailed commentary to convey more facts, our 

utterances are rather long. 

We have supposed that the HTN planning is convenient to produce commentary on 

sports events that are rather long-winded (e.g. a life tennis game). However, the testing 

files provided by GALA 2009 were generated by the Wii 2 software that produced tennis 

games that unfolded more quickly in comparison to a standard life tennis games. Hence, 

there was a slight mismatch between the input we anticipated and the input that we had 

got. Nevertheless, our system was able to produce the commentary even under these 

conditions. 

The reactivity of the system also partly depends on the response time of the avatar 

engine and the speed with which the virtual agents are talking. A little bit faster speech 

and lower response time of the avatar engine that is sometimes up to 1 second would 

lead to better results in terms of reactivity. 

2 http://wii.com/


Behavioural Complexity and Affectivity 

Our virtual agents provide affective commentary on a tennis game according to their 

(positive, neutral, negative) attitudes to the players and according to the events that 

occur during the tennis game. The current affect of a virtual agent is expressed by 

dialogue scheme selection, lexical selection, facial expression, and gestures. A user can 

recognize which virtual agent is in favour of which player and whether the virtual agent’s 

favourite player is doing well or not. For instance, only the virtual agent’s facial expres- 

sion can reveal whether his/her favourite player is leading or not. The virtual agents 

have also gestures synchronized with speech and can interact with a user in the form of 

user pre-defined questions. 

The variability of dialogues is ensured by the planner that always outputs all possible 

plans (dialogues), and by the random selection of utterances and gestures within partic- 

ular templates. However, there is always certain trade-off between a few nice, suitable, 

and specific dialogues and a variety of a lot of general dialogues. Since we wanted to 

have specific commentary for GALA we have preferred the first option. Nevertheless, 

more variety could be achieved if we added more dialogue schemes and more variants 

of utterances and gestures to respective templates. The dialogue schemes could be also 

based on different types of OCC emotions that are maintained for each virtual agent in 

a simple emotion module which would also increase the variability and affectivity of the 

commentary. 

We have used two methods to produce affective commentary: one that generates affective 

dialogues based on virtual agents’ attitudes to the players and the other that maintains 

the affective state of each virtual agent in the emotion module. Thus, the user can see 

which event elicited which emotion and why a virtual agent is commenting in a positive 

or negative way. 

Generalizability 

Although our system was not designed to be domain independant, we will describe be- 

low which modifications would be necessary to change the domain. The tennis simulator 

would need only a subtle modification to simulate any sports event given as an ANVIL 

file. We would need to define new states at which the discourse planner is triggered by 

the event manager. We would also need to define the snapshots of the world and which 

low-level facts would be derived from the respective snapshots. The pre-processing of 

the background facts is done in a generic way, thus we would only provide corresponding 

input CSV files. While the Java code in the discourse planner is domain independent, 

the definition of the Hierarchical Task Network in the planning domain would need to 

be rewritten except for the part that concerns the background knowledge (e.g. injury,


weather). We would also need to add corresponding templates, change some heuris- 

tics in the output manager, e.g., to determine under which conditions a plan can be 

interrupted. We would also need to define respective emotion eliciting conditions in the 

emotion module. The avatar manager is domain independent. Thus, the most complex 

task would be to rewrite the domain description of the planner and to add respective 

templates. 

6.3 Comparison JSHOP vs Jess 

In this section, we will compare two approaches, i.e., the HTN planning (see section 3.1) 

and the expert systems (see section 3.2) that can be used to generate a commentary 

on a sports event defined as GALA 2009 (see section 1.2). We will focus on two tools, 

namely: JSHOP 3 that is a representative of an HTN planner that we have employed 

in our system to generate dialogues, and Jess 4 which is a representative of an expert 

system that was used, e.g., in ERIC (see section 2.1) to generate speech. Whereas the 

HTN planning is well suited to plan larger contributions (e.g. dialogue planning), the 

expert systems are more suitable to produce shorter comments that reflect the current 

state of the world. In the following text, we will compare JSHOP and Jess in terms of 

their expressive power, usableness, and user-friendliness. 

• Variability 

The variability is important, e.g., for the dialogue planning since the virtual agents 

should not be engaged in the same dialogues all the time. In the logistics, it is 

also convenient to have more than one way how to deliver a package since not all 

paths cost the same, thus the cheapest path should be chosen, and some paths 

can be also dynamically added or deleted from the domain. The advantage of the 

planning is that it finds all solutions to a problem while the expert systems output 

only one. (More precisely, while a planner is backtracking to find all possible plans, 

it can try several substitutions of a variable. In contrast to the planning, once a 

rule fires in an expert system a variable is substituted and cannot be changed.) 

Nevertheless, it is possible to set the random resolution strategy in a rule-based 

system which resembles as if we have chosen a plan at random among all possible 

plans output by a planner. Thus, the variability can be reached in the rule-base 

systems to some extent as well. 

• Priority 

We can assign a cost to each planning operator in the planning domain such that 

3 JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/ 

4 Jess (Java Expert System Shell) http://www.jessrules.com/


the cost of a plan is equal to the sum of the costs of all planning operators that 

the plan contains. After the planner outputs all possible plans, we can choose the 

most or least expensive plan according to our preferences. If the cost corresponds 

to the length of a path, we will probably choose the shortest one. If the cost 

corresponds to the amount of money that we get when we execute the plan, we 

will presumably choose the most profitable plan. In an expert system, we can 

assign a salience value to each rule which specifies how urgently a rule should be 

fired and in the case the salience value of two rules is the same, the current conflict 

resolution strategy decides which rule will be fired as first. This is the way how 

the rule-based systems can prioritize some outcomes. Nevertheless, the use of the 

salience value should be avoided since it makes the execution of the rules very 

difficult to monitor. 

• Expressive Power 

Jess offers substantially more constructs than JSHOP. We will show two examples 

of constructs that are defined in Jess and that are not defined in JSHOP where 

it would be advantageous to have them in JSHOP as well. The first example: 

JSHOP does not support unordered facts, thus in the case we want to work with 

only one slot of a fact we have to consider all its slots since JSHOP supports only 

ordered facts. The second example: It is quite cumbersome to count the number 

of facts that match certain condition in JSHOP, nevertheless it can be bypassed 

by recursion. This task can be solved in Jess using the accumulate construct in an 

intuitive way. 

• Online vs Offline Execution 

We have already pointed out that JSHOP runs offline (see Figure 3.3). Thus, due 

to any change in the domain or problem file, respective Java file has to be first 

generated and then compiled before the planner can be actually run. In contrast 

to JSHOP, Jess runs online, i.e., after the Jess rule-based engine is initialized, it 

can be run several times where facts and rules can be added to its facts base or 

retracted in the meantime. 

• Development Environment 

Jess can be better integrated into a development environment than JSHOP since 

there is a plugin that integrates Jess into Eclipse IDE 5 which facilitates the de- 

velopment, e.g., it offers a Jess editor that emphasizes the Jess Lisp-like syntax 

and marks errors. In comparison to Jess, JSHOP is provided as a Java library. 

Nevertheless, the input JSHOP files can be edited as text files in Eclipse IDE as 

well. 

5 http://www.eclipse.org/

Chapter 7 

Conclusion 

7.1 Summary 

In this thesis, we have presented the architecture of the IVAN system (Intelligent In- 

teractive Virtual Agent Narrators) that generates an affective commentary on a tennis 

game in real-time where the input was given as an annotated video provided by GALA 

2009. The demo version of the IVAN system was accepted for the GALA 2009 1 that 

was a part of the 9th International Conference on Intelligent Virtual Agents (IVA) 2 . 

The system employs two virtual agents with different attitudes to the players that are 

engaged in dialogues to comment on a tennis game. We have focused on the knowledge 

processing, dialogue planning, and behaviour control of the virtual agents. Commercial 

products have been employed to represent the audio-visual component of the system. 

Most parts of the system are domain dependent. However, the same architecture can 

be reused to implement applications such as: interactive tutoring system, tourist guide, 

or guide for the blind. 

The system consists of several modules. We have employed an HTN planner to plan the 

dialogues, an expert system to define the appraisals of the emotion eliciting conditions in 

the emotion module, and finite state machines to simulate basic states of the system. Our 

two virtual agents can have positive, neutral, or negative attitudes to the players. The 

system uses two methods to generate affective multimodal output. In the first method, 

the dialogue schemes in the HTN planner are selected according to the desirability 

of particular events for respective virtual agents. In the second method, the system 

maintains the affective state of each virtual agent in the emotion module, according 

to the OCC cognitive model of emotions [36], based on the appraisals of the events 


2 http://iva09.dfki.de/ 

68

Chapter 7. Conclusion 69 

that happen in a tennis game. The current affect of the virtual agents is expressed 

by lexical selection, facial expression, and gestures. Furthermore, the system integrates 

background knowledge about the players and the tournament and allows the user to fire 

one of the pre-defined questions at any time. 

We have employed the JSHOP 3 as an HTN planner to generate dialogues for our two 

virtual agents. We have verified that JSHOP can be employed to generate affective 

commentary on a continuous sports event in real-time. However, the HTN planning is 

well suited to generate large dialogue contributions. Thus, if the environment changed 

rapidly and we wanted to consider most of the events that occur in the environment it 

would be more appropriate to use the expert systems as in ERIC. [10] 

7.2 Future Work 

In the following paragraphs, we will outline which modifications could be made to im- 

prove our system in the future. 

EMBR 

We could integrate EMBR (A Realtime Animation Engine For Interactive Embodied 

Agents). [39] Since EMBR has more advanced behaviour control, e.g., it can have more 

precise gaze that can express particular emotions whereas the Charamel virtual agents 

(see section 5.1.3) can only turn the head to gaze at the other virtual agent. We did not 

employ EMBR since it had not been released at that time and EMBR had also offered 

only one virtual agent where we needed two distinguishable characters. 

Prosody 

We could also integrate a prosody module if we had an appropriate TTS engine that 

would provide the option to set the respective parameters. Then, we could use the 

current emotional state of a virtual agent that is simulated by an emotion module (see 

section 4.2.3) to set respective parameters of the TTS engine. We have not implemented 

the prosody module since the RealSpeak Solo TTS 4 did not provide the option to change 

respective parameters. 

ALMA 

We could use ALMA [15] to maintain the emotional state of each virtual agent since 

ALMA in addition to our emotion module maintains history and the emotion blending. 

We could anticipate smoother transitions between individual emotional states of a virtual 

agent. Nevertheless, we did not employ ALMA since we wanted to have full control 

3 JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/ 

4 http://www.nuance.com/realspeak/solo/


over the emotion module so that we could, e.g., adjust the computation of the initial 

intensities of individual OCC emotions in dependence on the personality and define our 

own decay function. 

Affect 

We could try to base some dialogue schemes on particular OCC emotions that are output 

by our emotion module. In this way, we would get more affective and suitable dialogues. 

Nevertheless, it would entail a lot of work since we would have also to make up a lot 

of utterances that would express particular emotions. Let us note that to work with a 

reasonable amount of templates, we can have either a lot of general affective dialogues 

or a lot of specific dialogues that express particular emotions in a limited way. In our 

case, we have chosen the second option, thus our dialogue schemes are only based on 

virtual agents’ (positive, neutral, negative) attitudes to the players. 

We could also base the selection of particular utterances and gestures in templates 

on the current emotional state of a virtual agent that is maintained by our emotion 

module. A particular utterance and a gesture would be chosen according to the current 

emotional state of the virtual agent. The current affect could also, for instance, influence 

the velocity of particular gestures. In this way, we would get more affective dialogues. 

Nevertheless, we did not implement this feature since it would have required to make up 

a lot of different affective utterances. We have also supposed that it is sufficient when 

the utterances convey only virtual agents’ (positive, neutral, negative) attitudes to the 

players. 

Dynamic Replanning 

We could try another planner (e.g. HOTRiDE [40]) that would support dynamic re- 

planning, since the only way how we can change the plan (dialogue) now is to interrupt 

the current plan and start a new plan. Nevertheless, the dynamic replanning seems 

to be quite difficult to implement. One reason, why we did not try such a planner is 

that the Charamel virtual engine (see section 5.1.3) does not indicate the exact state of 

the discourse, thus such a planner would have to work with the whole utterances which 

would not be optimal. Thus, the precondition to employ such a planner is to have an 

avatar engine that would indicate what exactly has been uttered so far at any point in 

time. 

Evaluation 

More elaborate evaluation of the system could be done. We could perform an experiment 

to find out what a user remembers from the commentary with and without virtual 

agents. However, the life tennis commentators are usually hidden so that the audience 

could concentrate on the tennis game. Though we would in general expect that the 

commentary with the virtual agents would be better, it can easily happen that the users


would more concentrate on the video of the tennis game and remember more without 

virtual agents since the use of the virtual agents would rather distract them. We have 

not performed this sort of evaluation since it was not clear how to interpret the possible 

results. 

We could also compare our commentary with a life commentary. Nevertheless, in com- 

parison to our system, the real commentators are usually hidden and their commentaries 

are not biased. Our system was also partly optimized for the GALA 2009 (see section 

1.2) that was slightly different from a life tennis game since it used Wii 5 videos of tennis 

games. The life tennis commentary is also often very elaborate, thus our system cannot 

compete with such a commentary in terms of variability. 

Other Domains 

We could reuse the architecture of the system to implement a system in other domains, 

e.g., another long-winded sports events, interactive tutoring systems, tour guides, or 

guides for the blind. 

5 http://wii.com/

Appendix A 

Commentary Excerpt 

In the following list, we will show a commentary excerpt, where C stands for a com- 

mentator and E stands for a tennis expert. 

C : “Ladies and Gentleman. Welcome to the Wimbledon semi-final in doubles.” 

E: “We will guide you through the match in which James Blake and Andy Roddict are 

playing versus Marat Safin and David Ferrer.” 

C : “Enjoy the show!” 

C : “The weather is cloudy.” 

E: “I hope it won’t be raining.” 

C : “Oops, Roddick scored!” 

E: “Roddick hits an excellent forehand-volley right into the left corner.” 

C : “Roddick has been unbeatable recently.” 

E: “What a control by Roddick!” 

C : “The score is already 30:0.” 

E: “Players Safin and Ferrer are real losers as usual!” 

E: “None of the players are playing on favourite surface.” 

C : “It should harm performance of Blake and Rod..” 

72

Appendix A. Commentary Excerpt 73 

Interruption 

E: “What an unexpected lob!” 

C : “It’s a bad idea to try to return a lob at the net.” 

C : “The brother of Blake Thom..” 

Interruption 

C : “Look!” 

C : “Blake missed the huge backhand return by Ferrer! ” 

C : “The brother of Blake Thomas is also playing tennis.” 

E: “His Best ranking was in 2002.” 

C : “The score is already 30:40.” 

E: “No, Blake and Roddick are facing the break point.” 

C : “Ferrer had a hard time recently.” 

C : “Deuce.” 

E: “Great recover by Blake and Roddick!” 

C : “Roddick has been unbeatable recently.” 

E: “No doubt he is a genius.” 

C : “Off forehand by Blake drifts outside the left sideline.” 

E: “Blake just overcooked his forehand.” 

E: “Blake concentrates on his serve.” 

C : “Roddic has been four times injured since last year.” 

E: “It’s amazing how he ...” 

Interruption 

C : “What a relief!” 

E: “Oh, no!” 

C : “Tight game let’s summerize it.” 

C : “Safin and Ferrer won the first game.” 

E: “That’s unbelievable that they broke opponents’ serve!”

Bibliography 74 

C : “That was quite obvious!” 

E: “Safin and Ferrer played below part!” 

C : “There were some excellent shots!” 

E: “Well there might have been some bright sides.” 

C : “Of course there were!” 

C : “Did safin and Ferrer have any difficulties?” 

E: “They were already trailing.” 

C : “We have seen nice recovery.” 

C : “Let’s see the next game.” 

E: “Definitely.”

Bibliography 

[1] Justine Cassell, Tim Bickmore, Lee Campbell, Hannes Vilhjalmsson, and Hao Yan. 

”Human conversation as a system framework: Designing embodied conversational 

agents”. In Embodied Conversational Agents, pages 29–63. MIT Press, Cambridge, 

2000. 

[2] Jonathan Gratch and Stacy Marsella. Tears and fears: modeling emotions and 

emotional behaviors in synthetic agents. In Proceedings of the fifth international 

conference on Autonomous agents, pages 278 – 285. ACM Press, Montreal, Quebec, 

Canada, 2001. 

[3] Jeff Rickel and W. Lewis Johnson. Animated agents for procedural training in 

virtual reality: Perception, cognition, and motor control. APPLIED ARTIFICIAL 

INTELLIGENCE, 13:343—382, 1998. 

[4] Marc Cavazza, Fred Charles, and Steven J. Mead. Interacting with virtual char- 

acters in interactive storytelling. In Proceedings of the first international joint 

conference on Autonomous agents and multiagent systems, pages 318–325. ACM 

Press, Bologna, Italy, 2002. 

[5] Mark Riedl, C.J. Saretto, and R. Michael Young. Managing interaction between 

users and agents in a multi-agent storytelling environment. In Proceedings of the 2nd 

International Joint Conference on Autonomous Agents and Multi Agent Systems. 

Melbourne, 2003. 

[6] Elisabeth Andre, Thomas Rist, Susanne van Mulken, Martin Klesen, and Stephan 

Baldes. The automated design of believable dialogues for animated presentation 

teams. In Embodied Conversational Agents, pages 220–225, Cambridge, 2000. MIT 

Press. 

[7] Elisabeth Andre and Thomas Rist. Presenting through performing: On the use of 

multiple Life-Like characters in Knowledge-Based presentation systems. In 2000 

International Conference on Intelligent User Interfaces, pages 1–8. ACM Press, 

New York, 2000. 

75


[8] Elisabeth Andre, Thomas Rist, and Jochen Muller. Integrating reactive and scripted 

behaviors in a Life-Like presentation agent. In Proceedings of the Second Inter- 

national Conference on Autonomous Agents (Agents 1998), pages 261–268. ACM 

Press, New York, 1998. 

[9] Elisabeth Andre, Kim Binsted, Kumiko Tanaka-Ishii, Sean Luke, Gerd Herzog, 

and Thomas Rist. Three RoboCup simulation league commentator systems. AI 

Magazine, 22:57–66, 2000. 

[10] Martin Strauss and Michael Kipp. ERIC: a generic rule-based framework for an 

affective embodied commentary agent. 2007. 

[11] Francois L. A. Knoppel, Almer S. Tigelaar, Danny Oude Bos, Thijs Alofs, and 

Zsofia Ruttkay. Trackside DEIRA: a dynamic engaging intelligent reporter agent. 

In Proceedings of the 7th international joint conference on Autonomous agents and 

multiagent systems (AAMAS). Portugal, 2008. 

[12] Michael Kipp. ANVIL a generic annotation tool for multimodal dialogue. pages 

1367–1370, Aalborg, 2001. 

[13] Ivan Gregor, Michael Kipp, and Jan Miksatko. IVAN intelligent interactive virtual 

agent narrators. In Proceedings of the 9th International Conference on Intelligent 

Virtual Agents (IVA-09), pages 560–561. Springer, Amsterdam, 2009. 

[14] Martin Strauss. Realtime generation of multimodal affective sports commentary 

for embodied agents, 2007. 

[15] Patrick Gebhard. ALMA - a layered model of affect. In Proceedings of the Fourth In- 

ternational Joint Conference on Autonomous Agents and Multiagent Systems (AA- 

MAS 05), pages 29–36. Utrecht, 2005. 

[16] Lewis R. Goldberg. An alternative description of personality: The Big-Five fac- 

tor structure. In Journal of Personality and Social Psychology, volume 59, page 

12161229. 1990. 

[17] Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. Centering: A frame- 

work for modeling the local coherence of discourse. In Computational Linguistics, 

volume 21, page 203 225. 1995. 

[18] Ionut Damian, Kathrin Janowski, and Dominik Sollfrank. Spectators, a joy to 

watch. In Proceedings of the 9th International Conference on Intelligent Virtual 

Agents (IVA-09), pages 558–559. Springer, Amsterdam, 2009.


[19] Elisabeth Andre and Thomas Rist. Controlling the behavior of animated pre- 

sentation agents in the interface: Scripting versus instructing. In AI Magazine, 

volume 22, pages 53–66. AAAI Press, 2001. 

[20] Elisabeth Andre, Gerd Herzog, and Thomas Rist. Generating multimedia presen- 

tations for RoboCup soccer games. In RoboCup-97: Robot Soccer World Cup I 

(Lecture Notes in Computer Science). Springer, 1998. 

[21] Dana Nau, Tsz-Chiu Au, Okhtay Ilghami, Ugur Kuter, Hector Munoz-Avila, 

J. William Murdock, Dan Wu, and Fusun Yaman. Applications of SHOP and 

SHOP2, 2004. 

[22] Richard Fikes and Nils Nilsson. STRIPS: a new approach to the application of 

theorem proving to problem solving. In Artificial Intelligence, volume 2, pages 

189–208. 1971. 

[23] Dana S. Nau, Stephen J. J. Smith, and Kutluhan Erol. Control strategies in HTN 

planning: Theory versus practice. In AAAI-98/IAAI-98 Proceedings, pages 1127– 

1133. 1998. 

[24] Dana Nau, Hector Munoz-Avila, Yue Cao, Amnon Lotem, and Steven Mitchell. 

Total-Order planning with partially ordered subtasks. In Proceedings of the Sev- 

enteenth International Joint Converence on Artificial Intelligence (IJCAI-2001). 

Seattle, 2001. 

[25] Dana Nau, Yue Cao, Amnon Lotem, and Hector Munoz-Avila. SHOP: simple hier- 

archical ordered planner. In International Joint Conference on Artificial Intelligence 

(IJCAI-99), pages 968–973, Stockholm, 1999. 

[26] Okhtay Ilghami and Dana S. Nau. A general approach to synthesize Problem- 

Specific planners, 2003. 

[27] Okhtay Ilghami. Documentation for JSHOP2. 2006. 

[28] Gary Riley. CLIPS: a tool for building expert systems, 2008. URL http: 

//clipsrules.sourceforge.net/. 

[29] Ernest Friedman-Hill. Jess, the rule engine for the java platform, 2009. URL 

http://www.jessrules.com/. 

[30] Patrick Gebhard, Michael Kipp, Martin Klesen, and Thomas Rist. Authoring scenes 

for adaptive, interactive performances. In Proceedings of the Second International 

Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-03), 

pages 725–732. ACM Press, New York, 2003.


[31] Martin Klesen, Michael Kipp, Patrick Gebhard, and Thomas Rist. Staging exhibi- 

tions: Methods and tools for modelling narrative structure to produce interactive 

performances with virtual actors. In Virtual Reality. Special Issue on Storytelling 

in Virtual Environments, volume 7, pages 17–29. Springer-Verlag, 2003. 

[32] Norbert Reithinger, Patrick Gebhard, Markus Lockelt, Alassane Ndiaye, Norbert 

Pfleger, and Martin Klesen. VirtualHumanDialogic and affective interaction with 

virtual characters. In Proceedings of the 8th International Conference on Multimodal 

Interfaces (ICMI’06), pages 51–58. Canada, 2006. 

[33] Patrick Gebhard, Marc Schroder, Marcela Charfuelan, Christoph Endres, Michael 

Kipp, Sathish Pammi, Martin Rumpler, and Oytun Turk. IDEAS4Games: building 

expressive virtual characters for computer games. In Proceedings of the 8th Interna- 

tional Conference on Intelligent Virtual Agents (IVA’08), pages 426–440. Springer, 

2008. 

[34] Patrick Gebhard and Susanne Karsten. On-Site evaluation of the interactive CO- 

HIBIT museum exhibit. In Proceedings of the 9th International Conference on 

Intelligent Virtual Agents (IVA-09), pages 174–180. Springer, Amsterdam, 2009. 

[35] Michael Kipp, Kerstin H. Kipp, Alassane Ndiaye, and Patrick Gebhard. Evaluating 

the tangible interface and virtual characters in the interactive COHIBIT exhibit, 

2006. 

[36] Andrew Ortony, Allan Collins, and Gerald L. Clore. The cognitive structure of 

emotions., 1988. 

[37] Christoph Bartneck. Integrating the OCC model of emotions in embodied charac- 

ters. In Proceedings of the Workshop on Virtual Conversational Characters: Appli- 

cations, Methods, and Research Challenges. Melbourne, 2002. 

[38] Alexander Reinecke, Christian Dold, and Thomas Koch. Charamel Avatar Player 

Interface. 2009. 

[39] Alexis Heloir and Michael Kipp. EMBR - a realtime animation engine for interactive 

embodied agents. In Proceedings of the 9th International Conference on Intelligent 

Virtual Agents (IVA-09), pages 393–404. Springer, Amsterdam, 2009. 

[40] N. Fazil Ayan, Ugur Kuter, Fusun Yaman, and Robert P. Goldman. HOTRiDE: 

hierarchical ordered task replanning in dynamic environments. In Proceedings of 

the ICAPS-07 Workshop on Planning and Plan Execution for Real-World Systems 

- Principles and Practices for Planning in Execution. Providence, Rhode Island, 

USA, 2007.

Download Master Thesis - EMBOTS - DFKI

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?