Paper Title (use style: paper title) - FER

Architecture of an Animation System for Human 

Characters 

T. Pejša * and I.S. Pandžić * 

* University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia 

(tomislav.pejsa, igor.pandzic)@fer.hr 

Abstract—Virtual human characters are found in a broad 

range of applications, from movies, games and networked 

virtual environments to teleconferencing and tutoring 

applications. Such applications are available on a variety of 

platforms, from desktop and web to mobile devices. Highquality 

animation is an essential prerequisite for realistic 

and believable virtual characters. Though researchers and 

application developers have ample animation techniques for 

virtual characters at their disposal, implementation of these 

techniques into an existing application tends to be a 

daunting and time-consuming task. In this paper we present 

visage|SDK, a versatile framework for real-time character 

animation based on MPEG-4 FBA standard that offers a 

wide spectrum of features that includes animation playback, 

lip synchronization and facial motion tracking, while 

facilitating rapid production of art assets and easy 

integration with existing graphics engines. 

I. INTRODUCTION 

Virtual characters have long been a staple of the 

entertainment industry – namely, motion pictures and 

electronic games – but in more recent times they have also 

found application in numerous other areas, such as 

education, communications, healthcare and business, 

where they are found in roles of avatars, virtual tutors, 

assistants, companions etc. A category of virtual 

characters that has been an exceptionally active topic of 

research are embodied conversational agents (ECAs), 

characters that interact with real humans in direct, face-toface 

conversations. 

Virtual character applications are of great potential 

interest to the field of telecommunications. Wellarticulated 

human characters are a common feature in 

networked virtual environments such as Second Life, 

Google Lively and World of Warcraft, where they are 

found in roles of user avatars and non-player characters 

(NPCs). A potential use of virtual characters is in video 

conferences, where digital avatars can be used to replace 

video streams of human participants and thus conserve 

bandwidth. Up to recently virtual characters have been 

almost exclusive to desktop and browser-based network 

applications, but growing processing power of mobile 

platforms now allows their use in mobile applications as 

well. 

These developments have resulted in increasing 

demand for high-quality visual simulation of virtual 

humans. This visual simulation consists of two aspects – 

graphical model and animation. The latter encompasses 

body animation (locomotion, gestures) and facial 

animation (expressions, lip movements, facial gestures). 

While many open-source and proprietary rendering 

solutions deliver excellent graphical quality, their 

animation functionality, particularly facial animation, is 

often limited. Moreover, they often offer limited or no 

tools for production of characters and animations, 

requiring the user to invest a great deal of effort into 

setting up a suitable art pipeline. 

Our system seeks to address this by delivering greater 

animation capabilities, while being general enough to 

work with any 3D engine and thus facilitating 

development of applications with cutting edge visuals. 

Our principal contributions are these: 

design of a character animation system architecture 

that supports advanced animation features and 

provides tools for production of new character 

animations assets with minimal expenditure of time 

and effort 

a model for decoupling animation, asset production 

and rendering to enable fast and easy integration of 

the system with different graphics engines and 

application frameworks 

Facial motion tracking, lip synchronization and other 

advanced features make visage|SDK especially suited for 

applications such as ECAs and low-bandwidth video 

communications. Due to simplicity of art asset production 

our system is ideal for researchers with limited resources 

at their disposal. 

We begin with a brief summary of related work and 

continue with an overview of our system's features, 

followed by a description of the underlying architecture. 

Finally, we discuss our future work and planned 

improvements to the system. 

II. RELATED WORK 

Though virtual characters have been a highly active 

area of research for years, little effort has been made to 

propose a system which would integrate various aspects of 

their visual simulation and be easily usable in combination 

with different graphics engines and for a broad range of 

applications. 

The most recent and ambitious effort is SmartBody, a 

modular system for animation and behavior modeling of 

ECAs [1]. SmartBody sports more advanced low-level 

animation than visage|SDK, featuring hierarchies of 

customizable, scheduled controllers. SmartBody also 

supports behavior modeling through Behavior markup 

language (BML) scripts [2]. However, SmartBody lacks 

some of visage|SDK's integrated functionality, namely 

face tracking, lip sync and visual text-to-speech and has 

no built-in capabilities for character model production. It 

also features a less common method of interfacing with

the renderer – namely, via TCP – whereas visage|SDK is 

statically or dynamically linked with the main engine. 

The new visage|SDK system builds upon the earlier 

visage framework for facial animation [3], introducing 

new features such as body animation support and facial 

motion tracking. It also greatly enhances integration 

capabilities by enabling easy integration into other 

graphics engines. 

Engines for simulations and electronic games typically 

have modular and extensible architectures, and it is 

common for such engines to feature third-party 

components. Companies such as Havok and 

NaturalMotion even specialize in developing modular 

animation and physics systems intended to be integrated 

into existing architectures. These architectural concepts 

are commonly found in non-science literature on graphics 

engine design and we found such resources to be very 

suitable references during development of our system [13] 

[14] [15]. 

III. FEATURES 

visage|SDK includes the following core features: 

animation playback 

lip synchronization 

visual text-to-speech (VTTS) conversion 

facial motion tracking from video 

In addition to these, visage|SDK also includes 

functionality for automatic off-line production of character 

models and their preparation for real-time animation: 

face model generation from photographs 

morph target cloning 

This functionality can be integrated into the user's own 

applications and it is also available as full-featured standalone 

tools or plug-ins for 3D modeling software. 

A. Animation playback 

visage|SDK animation system is based on MPEG-4 

Face and Body Animation (FBA) standard [4] [5], which 

defines a set of animation parameters (FBAPs) needed for 

detailed and efficient animation of virtual humans. These 

parameters can be divided into the following categories: 

body animation parameters (BAPs) – these 

parameters control individual degrees of freedom 

(DOFs) of the character's skeleton (e.g., 

r_shoulder_abduct) 

low-level facial animation parameters (FAPs) – 

these control movements of individual facial 

features (e.g., open_jaw or raise_l_i_eyebrow; see 

Fig. 1) 

expression – high-level FAP which controls the 

facial expression (e.g., joy or sadness) 

viseme – high-level FAP which controls the 

shape of the lips during speech (e.g., TH or aa) 

Animation in MPEG-4 FBA is nothing more than a 

temporal sequence of FBAP value sets. Our system is 

capable of loading FBA animations from MPEG-4 

standard file format and applying them, frame-by-frame, 

to the character model. How each FBAP value is applied 

to the model depends on the graphics engine – 

visage|SDK doesn't concern itself with details of FBAP 

implementation. 

Figure 1: MPEG-4 FBA face, marked with facial definition 

parameters (FDPs) 

Figure 2: Face model imported from FaceGen and animated in 

visage|SDK 

B. Lip synchronization 

visage|SDK features a lip sync component for both online 

and off-line applications. Speech signal is analyzed 

and classified into visemes using neural networks (NNs). 

A genetic algorithm (GA) is used to automatically train 

the NNs [6] [8]. 

Our lip sync implementation is language-independent 

and has been successfully used with a number of different 

languages, including English, Croatian, Swedish and 

Japanese [7]. 

C. Visual text-to-speech 

visage|SDK features a simple visual text-to-speech 

(VTTS) system based on Microsoft SAPI. It converts the 

SAPI output into a sequence of FBA visemes [9]. 

D. Facial motion tracking 

The facial motion tracker tracks facial movements of a 

real person from recorded or live video stream. The 

motion tracking algorithm is based on active appearance 

models (AAM) and doesn't require markers or special 

cameras – a simple, low-cost webcam is sufficient. 

Tracked motion is encoded as a sequence of FAP values 

and applied to the virtual character in real-time. In 

addition to this functionality, the facial motion tracker also 

supports automatic feature detection in static 2D images, 

which can be used to further automate the process of face 

model generation from photographs (see next section) 

[10]. 

Potential applications of the system include humancomputer 

interaction and teleconferencing, where it can be

used to drive 3D avatars with the purpose of replacing 

video streams of human participants. 

E. Face model generation from photos 

Face model generator can be used to rapidly generate 

3D face models. It takes a collection of orthogonal 

photographs of the head as input and uses them to deform 

a generic template face and produce a face model that 

matches the individual in the photographs [11]. Since the 

resulting models always have the same topology, the 

cloner can automatically generate morph targets for facial 

animation. 

F. Facial motion cloning 

The cloner copies morph targets from a source face 

model onto a target model [12]. For arbitrary models it 

requires that the user maps a set of feature points (FDPs) 

to vertices of the model, though this step can be bypassed 

if the target model and the source model have identical 

topologies. The cloner also supports fully automated 

processing of face models generated by Singular 

Inversions FaceGen application (Fig. 2). 

visage|SDK 

IV. ARCHITECTURE 

A. Components 

visage|SDK has a multi-layered architecture and is 

composed of the following key components: 

Scene wrapper 

High-level components 

Application 

Configure 

actions & 

add them 

to the player 

LipSync Text-to-Speech 

Get FBAP 

values & 

blend 

Animation Player 

Apply 

FBAP 

value set 

Scene Wrapper 

Set bone 

transformations 

/ morph weights 

Rendering Engine 

Facial Motion 

Tracker 

Figure 3: visage|SDK architecture 

Animation player 

High-level components – lip sync, TTS, face 

tracker, character model production libraries (face 

model generator, facial motion cloner) 

Scene wrapper provides a common, rendererindependent 

interface to the character model in the scene. 

Its main task is to interpret animation parameter values 

and apply them to the model. Furthermore, it aggregates 

information about the character model pertinent to MPEG- 

4 FBA – most notably mappings of FBAPs to skeleton 

joint transformations and mesh morph targets. This highlevel 

model data can be loaded and serialized to an XMLbased 

file format called VCM (Visage Character Model). 

Finally, scene wrapper also provides direct access to the 

model's geometry (meshes and morph targets) and joint 

transformations, permitting model production components 

to work with any model irrespective of the underlying 

renderer. 

Animation player is the core runtime component of the 

system, tasked with playing generalized FBA actions. 

These actions can be animations loaded from MPEG-4 

.fba files, but also procedural actions such as gaze 

following. Animation player can play the actions in its 

own thread or it can be updated manually in every frame. 

High-level components include lip sync, text-to-speech 

and facial motion tracker. They are implemented as FBA 

actions and therefore driven by the animation player. 

Character model production components are meant to 

be used off-line and so they don't interface with the 

visage|SDK 

Model production 

Facial Motion 

Cloner 

Application 

Get geometry 

Update geometry 

Scene Wrapper 

Get geometry 

Update geometry 

Rendering Engine 

Face Model 

Generator

animation player. They access the model's geometry via 

the common scene wrapper. 

B. Integration with a graphics engine 

When it comes to integration with graphics engines, 

visage|SDK is highly flexible and places only bare 

minimum requirements before the target engine. The 

engine should support basic character animation 

techniques – skeletal animation and mesh morphing – and 

the engine's API should provide the ability to manually set 

joint transformations and morph target blend weights. 

Animation is possible even if some of these requirements 

aren't met – for example, in absence of morph target 

support, a facial bone rig can be used for facial animation. 

Minimal integration of the system is a trivial endeavor, 

amounting to subclassing and implementation of a single 

wrapper class representing the character model. 

Depending on desired functionality, certain parts of the 

wrapper can be left unimplemented – e.g. there is no need 

to provide geometry access if the developer doesn't plan to 

use the cloner of face model generation features in their 

application. The 3D model itself is loaded and handled by 

the engine, while FBAP mappings and other information 

pertaining to MPEG-4 FBA are loaded from VCM files. 

VCM files are tied to visage|SDK rather than the graphics 

engine, which means they are portable and can be reused 

for a character model – or even different models with a 

similar structure – regardless of the underlying renderer. 

This greatly simplifies model production and reduces 

interdependence of the art pipelines. 

C. Component interactions 

A simplified overview of runtime component 

interactions is illustrated in Fig. 3. Animation process 

flows in the following manner: 

Application adds actions to the animation player. 

For example, lip sync coupled with gaze following 

and a set of simple repeating facial gestures (e.g. 

blinking). 

Animation player executes the animation loop. 

From each action it obtains the current frame of 

animation as a set of FBAP values, blends all the 

sets together and applies them to the character 

model via the wrapper. 

Scene wrapper receives the FBAP value set and 

interprets the values depending on the character's 

FBAP mappings. Typically, BAPs are converted to 

Euler angles and applied to bone transformations, 

while FAPs are interpreted as morph target blend 

weights. 

For cloner and face model generator interactions are 

even more straightforward and amount to obtaining and 

updating the model's geometry via the model wrapper. 

Figure 4: FBAPMapper – an OGRE-based application for 

mapping animation parameters 

D. Art pipeline 

As previously indicated, the art pipeline is very flexible. 

Characters are modeled in 3D modeling applications and 

exported into the target engine. Naturally, FBAPs need to 

be mapped to joints and morph targets of the model. This 

is done using a special plug-in for the 3D modeling 

application if one is available, otherwise it needs to be 

handled by a stand-alone application with appropriate 3D 

format support. For animations the pipeline is similar, and 

again a plug-in is used for export and import. 

We also provide stand-alone face model and morph 

target production applications that use our production 

libraries. These applications rely on intermediate file 

formats (currently VRML or OGRE formats, though 

support for others will be added in the future) to obtain the 

model, while results are output via the intermediate format 

in combination with VCM. Fig. 4 shows a screenshot of a 

simple application for mapping and testing animation 

parameters. 

V. EXAMPLES 

We have so far successfully integrated our system with 

two open-source rendering engines, with more 

implementations on the way. The results are presented in 

this section. 

A. OGRE 

OGRE [16] is one of the most popular open-source, 

cross-platform rendering engines. Its features include a 

powerful object-oriented interface, support for both 

OpenGL and Direct3D graphical APIs, shader-driven 

architecture, material scripts, hardware-accelerated 

skeletal animation with manual bone control, hardwareaccelerated 

morph target animation etc. Despite 

challenges encountered in implementing a wrapper around 

certain features, we have achieves both face and body 

animation in OGRE (Fig. 5 and 6).

OGRE is also notable for its extensive art pipeline, 

supported by exporters from nearly every modeling suite 

in existence. We initially encountered difficulties in 

loading complex character models composed of multiple 

meshes, because basic OGRE doesn't support file formats 

capable of storing entire scenes. However, this 

shortcoming is rectified by the community-made 

DotScene loader plug-in, and a COLLADA loader is also 

under development by the OGRE community. 

B. Irrlicht 

Though Irrlicht [17] doesn't boast OGRE's power, it is 

nonetheless popular for its small size and ease of use. Its 

main shortcoming in regard to our system is lack of 

support for morph target animation. However, we were 

able to alleviate this by creating a face model with a bone 

rig and parametrizing it over MPEG-4 FBAPs, with very 

promising results (see Fig. 8). 

Unlike OGRE's art pipeline, which is based on exporter 

plug-ins for 3D modeling applications, Irrlicht's art 

pipeline relies on a large number of loaders for various file 

formats. We found the loader for Microsoft .x format to be 

the most suited to our needs and were able to successfully 

import several character models, both with body and face 

rig (Fig. 7). 

C. Upcoming implementations 

We are working on integrating visage|SDK with several 

other engines in concurrence. These include: 

StudierStube (StbES) [19] – a commercial 

augmented reality (AR) kit with a 3D renderer and 

support for character animation 

Figure 5: Lip sync in OGRE 

Figure 6: Body animation in OGRE 

Horde3D [18] – a lightweight, open-source 

renderer 

Panda3D – an open-source game engine known 

for its intuitive Python-based API 

Of these we find StbES to be the most promising, as it 

will enable us to deliver the power of visage|SDK 

animation system to mobile platforms and combine it with 

StSb's extensive AR features. 

VI. CONCLUSIONS AND FUTURE WORK 

Our system supports a variety of character animation 

features and facilitates rapid application development and 

art asset production. Its feature set makes it suitable for 

research and commercial applications such as embodied 

agents and avatars in networked virtual environments and 

telecommunications, while flexibility of its architecture 

means it can be used on a variety of platforms, including 

mobile devices. We have successfully integrated it with 

popular graphics engines and plan to provide more 

implementations in near future, while simultaneously 

striving to make integration even easier. 

Furthermore, we are continually working on enhancing 

our system with new features. An upcoming major 

upgrade will introduce a new system for interactive 

motion controls based on parametric motion graphs and 

introduce character behavior modeling capabilities via 

BML. Our goal is to develop a universal and modular 

system for powerful, yet intuitive modeling of character 

behavior and continue using it as a backbone of our 

research into high-level character control and applications 

involving virtual humans. We plan to release a substantial

portion of our system under an open-source license. 

ACKNOWLEDGMENT 

This work was partly carried out within the research 

project "Embodied Conversational Agents as interface for 

networked and mobile services" supported bythe Ministry 

of Science, Education and Sports of the Republic of 

Croatia. It was also partly supported by Visage 

Technologies. Integration of visage|SDK with OGRE, 

Irrlicht and other engines was done by Mile Dogan, 

Danijel Pobi, Nikola Banko, Luka Šverko and Mario 

Medvedec, undergraduate students at the Faculty of 

Electrical Engineering and Computing in Zagreb, Croatia. 

REFERENCES 

[1] M. Thiebaux, A.N. Marshall, S. Marsella, M. Kallmann, 

"SmartBody: behavior realization for embodied conversational 

agents," in International Conference on Autonomous Agents, 

2008, vol. 1, pp. 151-158. 

[2] S. Kopp et al., "Towards a common framework for multimodal 

generation: The behavior markup language," in Intelligent Virtual 

Agents, 2006, pp. 205-217. 

[3] I.S. Pandžić, J. Ahlberg, M. Wzorek, P. Rudol, M. Mošmondor, 

"Faces everywhere: towards ubiquitous production and delivery of 

face animation," in International Conference on Mobile and 

Ubiquitous Multimedia, 2003, pp. 49-55. 

[4] I.S. Pandžić, R. Forchheimer, Ed., MPEG-4 Facial Animation - 

The Standard, Implementations and Applications, John Wiley & 

Sons, 2002. 

[5] ISO/IEC 14496 – MPEG-4 International Standard, Moving Picture 

Experts Group, www.cselt.it/mpeg 

Figure 7: Body animation in Irrlicht 

Figure 8: Facial animation in Irrlicht 

[6] G. Zorić, I.S. Pandžić, "Real-time language independent lip 

synchronization method using a genetic algorithm," in special 

issue of Signal Processing Journal on Multimodal Human- 

Computer Interfaces, vol. 86, issue 12, pp. 3644-3656, 2006. 

[7] A. Čereković et al., "Towards an embodied conversational agent 

talking in Croatian," in International Conference on 

Telecommunications, 2007, pp. 41-47. 

[8] G. Zorić, I.S. Pandžić, "A real-time lip sync system using a 

genetic algorithm for automatic neural network configuration," in 

IEEE International Conference on Multimedia & Expo, 2005, vol. 

6, pp. 1366-1369. 

[9] C. Pelachaud, "Visual Text-to-Speech" in MPEG-4 Facial 

Animation - The Standard, Implementations and Applications, I.S. 

Pandžić, R. Forchheimer, Ed., John Wiley & Sons, 2002. 

[10] G. Fanelli, M. Fratarcangeli, "A non-invasive approach for driving 

virtual talking heads from real facial movements," in 3DTV 

Conference, 2007, pp. 1-4. 

[11] M. Fratarcangeli, M. Andolfi, K. Stanković, I.S. Pandžić, 

"Animatable face models from uncalibrated input features," 

unpublished 

[12] I.S. Pandžić, "Facial Motion Cloning," Graphical Models Journal, 

vol. 65, issue 6, pp. 385-404, 2003. 

[13] D. Eberly, 3D Game Engine Architecture, Morgan Kaufmann, 

Elsevier, 2005. 

[14] S. Zerbst, O. Duvel, 3D Game Engine Programming, Course 

Technology PTR, 2004. 

[15] Havok Physics Animation 6.00 User Guide, Havok, 2008. 

[16] OGRE Manual v1.6, 2008, http://www.ogre3d.org/docs/manual/ 

[17] Nicolas Schulz, Horde3D Documentation, 2009, 

http://www.horde3d.org/docs/manual.html 

[18] Nikolaus Gebhardt, Irrlicht Engine 1.5 API Documentation, 2008, 

http://irrlicht.sourceforge.net/docu/index.html 

[19] Christian Doppler Laboratory, Graz University of Technology, 

"Handheld augmented reality," 2008, http://studierstube.icg.tugraz.ac.at/handheld_ar/

Paper Title (use style: paper title) - FER

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?