MIAA - Automotive IUI - DFKI

Proceedings of the 

3rd Workshop on Multimodal Interfaces for 

Automotive Applications 

(MIAA ‘11) 

February 13, 2011, Palo Alto, CA, USA 

organized at the International Conference on Intelligent User 

Interfaces (IUI ’11) 

Organizers: 

Christoph Endres, German Research Center for Artificial Intelligence (DFKI) 

Gerrit Meixner, German Research Center for Artificial Intelligence (DFKI) 

Christian Müller, German Research Center for Artificial Intelligence (DFKI)

Preface 

Multimodal interaction constitutes a key technology for intelligent user interfaces (IUI). The 

possibility to control devices and applications in a natural way enables an easier access to complex 

functionality as well as infotainment contents. In recent years, the complexity of on-board and 

accessory devices, infotainment services, and driver assistance systems in cars has experienced an 

enormous increase. This development emphasizes the need for new concepts for advanced humanmachine 

interfaces that support the seamless, intuitive and efficient use of this large variety of 

devices and services. 

A modern car already implements hundreds of functions that a user can interact with, in some cases 

deployed over almost a hundred embedded platforms. These numbers will even grow for the next 

generation of high-class vehicles. The growing number of electronic devices integrated into cars also 

affects the creation of the user interface. The built-in electronic control units are able to provide 

valuable context information, which needs to be considered for an intelligent management of 

multimodal interaction inside the car. Sensor information like e.g. vehicle speed, location (using GPS 

plus gyroscope and accelerometer for greater reliability), outside temperature, etc., allows drawing 

conclusions about the current driving situation. Furthermore, dialog management needs to keep 

track of state changes of operating elements like control switches. Access to vehicle functions is also 

essential in order to initiate desired operations. 

The goal of this workshop is to present, discuss, and outline context-aware multimodal interfaces for 

drivers and car passengers. The ultimate goal of this workshop is to unify innovative concepts that 

aim towards a new dimension of ease of use. 

The topics of the workshop with a strong focus on automotive or traffic applications are: 

� speech interfaces for in-car use 

� multimodal interaction 

� novel multimedia interfaces and in-car entertainment 

� user interface issues for assistive functionality 

� audio-visual information and entertainment 

� information fusion and fission 

� can bus architectures 

� experimental platforms and simulation solutions 

� user centered design applications 

� multi-party interaction concepts 

� integrated hardware solutions 

� car2car and car2X communication 

� approaches for the evaluation of novel car user interfaces 

� user interfaces for navigation systems 

� detection and estimation of user intentions 

� novel interactive car applications 

� interactive applications for drivers and passengers 

� model-driven user interface development

Table of Contents 

Flexible and Real-time Scenario Building for Experimental Driving Simulation Studies 

George D. Park, R. Wade Allen and Theodore J. Rosenthal .....................................................................1 

Contactless Gesture Recognition for Mobile Devices 

Heng-Tze Cheng, An Mei Chen, Ashu Razdan and Elliot Buller ................................................................5 

One Application, One User Interface Model, Many Cars: Abstract Interaction Modeling in the 

Automotive Domain 

Mark Poguntke and André Berton ...........................................................................................................9 

A Novel Multimedia Session Management Approach for In-Vehicle Middleware based on DPWS 

Michael Eichhorn, Martin Pfannenstein, Rainer Bodendorfer and Eckehard Steinbach ...................... 13 

“Hands Busy, Eyes Busy”: Generating Stories from Sensor Data for Automotive applications 

Joe Reddington, Ehud Reiter, Nava Tintarev, Rolf Black and Annalu Waller ........................................ 17 

A novel taxonomy for gestural interaction techniques: considerations for automotive 

environments 

Adriano Scoditti ..................................................................................................................................... 21 

Navigating Haystacks at 70 mph: Intelligent Search for Intelligent In-Car Services 

Ashweeni K. Beeharee, Sven Laqua and M. Angela Sasse..................................................................... 25 

Discover Significant Situations for User Interface Adaptations 

Sandro Rodriguez Garzon and Kristof Schütt ........................................................................................ 29 

A new interaction technique based on eye tracking and single switch scanning systems 

Pradipta Biswas and Pat Langdon ......................................................................................................... 33 

Gesture Recognition Exploration using Haartraining and KNN in a 3D Racing Game 

Kamlesh Mistry and Li Zhang ................................................................................................................. 37 

Model-Based User Interface Development in the Automotive Industry 

Moritz Kümmerling and Gerrit Meixner ................................................................................................ 41 

A Robotic Wheelchair using Human Gestures and Scene Contexts 

Jin Sun Ju and Eun Yi Kim....................................................................................................................... 45 

MetaBrain: Web Information Extraction and Visualization 

João Teixeira, Gabriel Barata and Daniel Gonçalves ............................................................................. 49 

MyDash: The Biometric Digital Dashboard 

Shelby S. Darnell, Ignacio Alvarez, Josh I. Ekandem, Damon L. Woodard and Juan E. Gilbert ............. 53 

Prototyping a Semi-Automatic In-Car Texting Assistant 

Christoph Endres, Daniel Braun and Christian Müller ........................................................................... 57 

Multimodal Summarization of Complex Sentences 

Naushad UzZaman, Jeffrey P. Bigham and James F. Allen .................................................................... 61

Flexible and Real-time Scenario Building for 

Experimental Driving Simulation Studies 

George D. Park, R. Wade Allen, and Theodore J. Rosenthal 

Systems Technology, Inc. 

13766 Hawthorne Blvd., Hawthorne CA 

georgepark@systemstech.com 

ABSTRACT 

The applications and cross-disciplinary nature of driving 

safety require driving simulation software to be sensitive to 

the requirements and limitations of their users. Provided 

here is an introduction to the driving simulation software, 

STISIM Drive and its unique approach towards flexible, 

real-time scenario building for applied experimental driving 

research. Several key concepts on how a user defines/builds 

a driving scenario and how the 3D graphics are generated in 

relation to the driver are discussed. Advantages and 

disadvantages of the STISIM Drive approach are discussed. 

References to previous user applications are provided. 

Author Keywords 

Driving simulation, scenario design, STISIM Drive. 

ACM Classification Keywords 

H5.2 Evaluation/methodology. H5.m Miscellaneous. 

INTRODUCTION 

Real-time, interactive (i.e., human-in-the-loop) driving 

simulation offers many advantages to the experimental 

researcher/developer interested in the areas of driving 

assessment, training, and research. It allows for a safe and 

controlled testing environment of driver behaviors in 

relation to the independent variable(s) of interest: driver 

factors (e.g., age, experience, drugs/alcohol/fatigue, mental 

workload, and deficits related to perception, cognition, or 

psychomotor), intervention factors (e.g., education and 

training programs), environmental factors (e.g., roadway 

infrastructure design, signage, weather, and traffic), and 

vehicle/device factors (e.g., controls/handling, dashboard 

design, warning systems, cell phones, and in-vehicle 

telematics). 

Copyright is held by the author/owner(s). 

MIAA 2011, February 13, 2011, Palo Alto, CA, USA. 

1 

With an array of applications and cross-disciplinary nature 

of driving safety, simulation software needs to be sensitive 

to the requirements and limitations of its users. Not all users 

will have the background or the resources for extensive 

scenario building in virtual environments (VE). In addition, 

the end product of driving simulation is rarely the 

simulation itself (e.g., a video game), but is more often a 

means for assessing the effect of one of the aforementioned 

independent variables. Therefore, a method of scenario 

development that is flexible, rapid, and cost-effective is 

often critical to project success. 

Databases for real-time 3D simulation have traditionally 

been developed in graphics programs as composite 3D 

models. In essence, a large, predefined virtual world is 

created for the user to interact with. This approach requires 

extensive effort and experience with graphics modeling 

programs to define the details required in driving simulation 

[1]. More user-friendly scenario building systems may use a 

“tile-based” system where the developer pieces together 

predefined tiles of the road (e.g., an intersection or street 

block) to create the larger virtual world [2]. The end result 

is a roadway environment not unlike a real world, 

coordinate-based map. While this may appear to be an 

intuitive method of scenario development, it may not be an 

entirely practical means of scenario design for experimental 

research. 

The purpose of this paper is to provide an introduction to 

the driving simulation software, STISIM Drive and its 

approach towards flexible, real-time scenario building for 

applied experimental driving research. STISIM Drive is a 

PC-based, desktop driving simulator software system that’s 

highly configurable in regards to hardware fidelity (driver 

displays & controls). Several key concepts on how a user 

defines/builds a driving scenario and how the 3D graphics 

are generated in relation to the driver are discussed. 

SCENARIO DEFINITION LANGUAGE (SDL) 

The Scenario Definition Language (SDL) is a scripting 

language developed for STISIM Drive to define the 

scenario events (i.e., what appears and happens) in a 

particular driving scenario run. The events are defined by 

ASCII text statements in a simple syntax form:

On Distance, Event, Appear Distance, Parameter1, 

Parameter2, … Parametern 

The On Distance is defined as the longitudinal distance 

(feet or meters as specified by user) driven by the driver in 

relation to the scenario environment at which the event will 

activate. At the start of a scenario, the driver’s vehicle 

distance is generally set at zero. Event refers to a specific 

procedure (e.g., a roadway, building, vehicle, or 

pedestrian). Appear Distance refers to the longitudinal 

distance (ft or m) in relation to the On Distance that the 

event will actually be displayed in the roadway scenery. 

The Parameters are the specific attributes given to the 

event (e.g., roadway dimensions, model type, lateral 

location, speed, timing, etc...). Take for example the 

following SDL statement for displaying a 3D model of a 

building: 

500, Building, 1000, 40, B1 

When the driver reaches 500 ft, a building event will be 

initiated. It will appear at 1000 ft ahead of the driver (so 

technically at 1500 ft from the start of the run). The lateral 

position will be 40 ft to the right of the center dividing line 

(Parameter1 for building events). The building model type 

will be B1 which in the model library is defined as the café 

(Parameter2). 

As shown in the above example, there is a single SDL event 

statement for each model in a particular scenario. There are 

also over 50 different available event types for the user to 

specify. While this may appear cumbersome for complex 

scenario designs, SDL statements can be arbitrarily 

arranged since the program sorts all events according to 

distance during run initialization. This allows the user to 

group statements according to meaningful chunks of 

roadway (e.g., street blocks) and/or categories (e.g., 

roadway definition, traffic control devices, roadside objects, 

traffic, etc.) to make relatively efficient global scenario 

changes. 

Besides 3D model events, there are SDL events that specify 

crash/violation settings, sound files, weather, data 

input/output signals, and data collection. Furthermore, the 

SDL allows the user to define and call subroutines referred 

to as previously defined events (PDEs) which are a 

combination of event statements that give a desired 

composite effect (e.g. buildings grouped around an 

intersection, traffic streams, vehicle/pedestrian collision 

events, etc.). Additional details on developing driving 

scenarios have been reported elsewhere [3]. 

EVENT TRIGGERING 

Due to the inherent variability in driver behaviors and 

factors that may affect a driver’s vehicle speed and steering 

(e.g., mental workload, age, experience, fatigue, risk 

perception), the initiation of dynamic 3D models (e.g., 

vehicles, pedestrians, signal lights) into action in the VE 

can be a complex process. This is particularly so if the 

intention is to create critical hazards that require an 

2 

immediate driver response. E.g., Figure 1 provides scenario 

screenshots of an amber (yellow) light intersection event 

(top) and a pedestrian crossing event in front of the driver 

(middle). 

Figure 1. STISIM Drive screenshots of amber signal light 

intersection (top), pedestrian crossing in front of driver 

(middle), and construction zone (bottom). 

STISIM Drive handles dynamic 3D model event triggering 

through several ways. In most cases, the variability in 

drivers’ speed can be neutralized by triggering events based 

on headway time (i.e., time-to-collision between the object 

and driver). However, additional parameters can be set to 

ensure data integrity: longitudinal distance of driver (or 

object) on the road, distance between driver and object, 

lateral position relationships, signal light changes, driver 

speed thresholds, and elapsed runtime. 

There is also the ability for the simulation operator to 

manually trigger events during a simulation run. Manually 

triggered events can comprise of singular discrete events 

(e.g., a sound file or crossing pedestrian) or larger PDE files

comprised of an array of static or dynamic 3D models. In 

effect, the operator can initiate whole sections of a scenario 

in real-time depending on how the driver is behaving. E.g., 

in Figure 1 (bottom), the operator can initiate a complete 

construction zone layout that includes vehicles and tubes 

onto the road. 

PARTIAL VIRTUAL ENVIRONMENT GENERATION 

The STISIM Drive method for generating the simulation 

scenario can be described as partial (or delayed) VE 

generation, where only a portion of the virtual world is 

displayed as the driver’s vehicle travels down the road. This 

is the basis of how the simulation is generated and how the 

driving scenarios are conceptually designed with the SDL. 

To illustrate the concept, Figure 2a and b both provide a 

vehicle approaching an intersection. In normal simulation 

programs using a coordinate map-based system (Figure 2a), 

continuing straight or turning left/right sends the driver into 

different sections (A, B, or C) of the virtual world. In 

STISIM Drive (Figure 2b), continuing straight or turning 

left/right sends the driver into the same section (B). The 

reason for this is related back to how scenario events are 

defined in the SDL. Since the On Distance of an event (in 

this case Section B) can be specified to occur after a driver 

reaches a particular road distance, Section B has not been 

generated yet. When turning (or not turning) the driver’s 

longitudinal distance travelled is still accumulating; 

therefore, Section B will continue to appear in relation to 

the start of the scenario. Once the On Distance for an event 

has been reached by the driver, the event is committed to 

appear in accordance to its specified parameters. 

a) b) 

Figure 2. a) Coordinate map-based VE generation. 

b) Partial VE generation. 

Moving into different sections when turning is not normally 

problematic and intuitive in designing VEs in relation to a 

coordinate map context. However, if the goal of the 

scenario is to measure driver behavior to a particular event 

(e.g., a pedestrian crossing or vehicle pullout in section B), 

scenario design becomes problematic. For normal 

simulation programs, unintended turning may result in 

system crashes when the boundaries of the VE are 

exceeded. Secondly, additional programming is required for 

sections A and C even though the driver may not encounter 

them. To ensure the occurrence of a particular event for 

measurement, the designer must either artificially preclude 

3 

vehicle turning, rely on driver compliance, add 

corresponding events in sections A and C, or have an 

operator manually trigger the event once a driver has 

committed to a particular roadway section. Any of these 

options while manageable are not necessarily parsimonious 

nor take into account the inherent unpredictability of human 

behavior. 

The advantages of partial VE generation for experimental 

driving research are multiple. Since the driver does not 

experience scenario sections based on a coordinate map 

system, roadway sections are essentially presented serially 

in nature. This means all drivers experience the same 

scenario regardless of turning behaviors. Drivers cannot get 

disoriented or lost in the VE. Instead, the illusion of turning 

into different VE sections is created for the driver while 

roadway events are presented as intended by the researcher. 

In addition, counterbalancing of scenario events or whole 

roadway sections (using PDEs) can then be easily designed 

to control for order effects. This method can also reduce the 

design requirements and development time for a particular 

scenario. 

One of the main limitations of partial VE generation is the 

inability to simulate specific geography in regards to a 

coordinate map-based system. Therefore, studies involving 

simulation with GPS mapping and navigational tasks are 

problematic. Additionally, the possibility of non-realistic 

route corrections for driver navigational errors is present. 

For example, if a driver makes a wrong turn, U-turns into 

previously presented scenario sections are not handled well 

since the program provides only a limited distance of back 

tracking. The driver is also not able to perform other 

corrective procedures normally seen in driving such as three 

rights to make a left turn and vice versa. Previous system 

users have overcome some of these obstacles by modifying 

general program settings and adding a single elaborate large 

scale 3D city model [4]. It should be noted that these 

applications would require considerable 3D modeling 

resources since the system was not conceptually designed 

function in this manner. 

CONCLUSION 

The advantages and disadvantages of the partial VE 

generation approach used by STISIM Drive should be 

weighed by users during initial study design. The flexibility 

of scenario design and relatively simple scripting language 

(SDL) for building and modifying scenarios makes it a very 

user defined system that mitigates inherent driver 

variability. This in conjunction with flexible hardware 

options has enabled the STISIM Drive software approach to 

be well validated and used in nearly every aspect of driver 

safety research. This includes driver factor effects: ageing 

[5, 6], novice driver [7], traumatic brain injury [8], and 

pharmaceutical effects[9]. Vehicle and device interactions: 

in-vehicle information devices [10], cognitive workload 

effects [11], and collision warning systems [12]. Successful 

integration of simulation software with actual vehicle

control hardware systems has also been demonstrated for 

steering [13] and braking systems [14]. Additional 

information and resources can be found on the software 

website (www.stisimdrive.com). 

REFERENCES 

1. Cremer, J., J. Kearney, and Y. Papelis, Driving 

simulation: Challenges for VR technology. Ieee 

Computer Graphics and Applications, 1996. 16(5): p. 

16-20. 

2. Suresh, P. and R.R. Mourant. A tile manager for 

deploying scenarios in virtual driving environments. in 

DSC 2005 North America. 2005. Orlando, FL. 

3. Park, G.D., T.J. Rosenthal, and B.L. Aponso, 

Developing Driving Scenarios for Research, Training 

and Clinical Applications. Advances in Transportation 

Studies An International Journal, 2004. 2004 Special 

Issue. 

4. Marcotte, T.D., et al., A multimodal assessment of 

driving performance in HIV infection. Neurology, 

2004. 63: p. 1417-1422. 

5. Lee, H.C. The validity of driving simulator to measure 

on-road driving performance of older drivers. in 24th 

Conference of Australian Institutes of Transport 

Research (CAITR). 2002. Sydney, AUS. 

6. Park, G.D., et al. Older driver simulator performance 

in relation to driving habits and DMV records. in 2nd 

International Conference on Technology and Aging. 

2007. Toronto, Canada. 

4 

7. Allen, R.W., et al. A PC Based Simulation System for 

Driver Assessment and Training. in TRB Annual 

Meeting. 2005. Washington, D.C. 

8. Stern, E.B., et al., Discriminating between brain 

injured and non-disabled persons: a PC-based 

interactive driving simulator pilot project. Advances in 

Transportation Studies An International Journal, 2004. 

Special Issue. 

9. Kay, G. The effect of Adderall XR and Atomoxetine on 

simulated driving safety in young adults with ADHD. in 

18th Annual US Psychiatric & Mental Health 

Congress. 2004. Las Vegas, NV. 

10. Wang, Y., et al., The validity of driving simulation for 

assessing differences between in-vehicle informational 

interfaces: A comparison with field testing. 

Ergonomics, 2010. 53(3): p. 404-420. 

11. Reimer, B., Impact of Cognitive Task Complexity on 

Drivers' Visual Tunneling. Transportation Research 

Record, 2009(2138): p. 13-19. 

12. Maltz, M. and D. Shinar, Imperfect in-vehicle collision 

avoidance warning systems can aid drivers. Human 

Factors, 2004. 46(2): p. 357-366. 

13. Eskandarian, A., et al. Development of an active 

steering control system in a car driving simulator. in 

SAE World Congress & Exposition. 2009. Detroit, MI. 

14. Allen, R.W., et al. A hardware-in-the-loop simulation 

of braking capability. in DSC 2005 Europe. 2008. 

Monaco.

Contactless Gesture Recognition for Mobile Devices 

Heng-Tze Cheng ∗ 

Electrical and Computer Engineering 

Carnegie Mellon University 

hengtze@cmu.edu 

ABSTRACT 

While gesture interfaces become pervasive, most existing 

approaches are undesirable for mobile devices because of 

the high power consumption, or the inconvenience that users 

need to wear/hold specific sensors. In this paper, we present 

a contactless gesture recognition system for mobile devices 

using proximity sensors. A set of infrared signal feature extraction 

methods and a decision-tree-based gesture classifier 

are proposed. The system allows a user to interact with mobile 

devices using intuitive gestures, without touching the 

screen or wearing/holding any additional device. Evaluation 

results show that the system is low-power, and able to recognize 

gestures with over 98% precision in real time. 


Gesture recognition, proximity sensor, infrared LED 


H.5.2 Information Interfaces and Presentation: User Interfaces—Input 

devices and strategies 

INTRODUCTION 

Gesture-based interfaces provide an intuitive way for users 

to specify commands and interact with computers [6, 8]. As 

mobile phones and tablets become ubiquitous, there is an increasing 

need of an intuitive user interfaces for small-sized, 

resource-limited mobile devices. 

Most existing gesture recognition systems can be classified 

into three types: motion-based, touch-based, and vision-based 

systems. For motion-based systems [11, 4], user cannot 

make gestures unless holding a mobile device or an external 

controller. Touch-based systems [12, 10] can accurately map 

the finger/pen positions and moving directions on the touchscreen 

to different commands. However, 3D gestures are not 

supported because all possible gestures are confined within 

the 2D screen surface. While the first two types of system 

∗ This work is done during the author’s employment at Office of 

The Chief Scientist, Qualcomm Incorporated. 

Permission to make digital or hard copies of all or part of this work for 

personal or classroom use is granted without fee provided that copies are 

not made or distributed for profit or commercial advantage and that copies 

bear this notice and the full citation on the first page. To copy otherwise, or 

republish, to post on servers or to redistribute to lists, requires prior specific 

permission and/or a fee. 



5 

An Mei Chen, Ashu Razdan, Elliot Buller 

Office of The Chief Scientist 

Qualcomm Incorporated 

{anc, arazdan, ebuller}@qualcomm.com 

require users to make contact with devices, vision-based systems 

[8, 14] using camera and computer vision techniques 

allow users to make intuitive gestures without touching the 

device. However, most vision-based systems are computationally 

expensive and power-consuming, which is undesirable 

for resource-limited mobile devices like tablets or mobile 

phones. 

To solve the existing challenges, we present a contactless 

gesture recognition system using only two infrared proximity 

sensors. We propose a set of infrared feature extraction 

and gesture classification algorithms. Using the system as 

a gesture interface, a user can flip e-book pages, scroll web 

pages, zoom in/out, and play games on mobile devices using 

intuitive hand gestures, without touching, wearing, or holding 

any additional devices. The design also reduces the frequency 

of users’ contact with devices, alleviating the wear 

and tear to screen surfaces. 

The main contributions of the paper are: 1) The design and 

evaluation of a contactless gesture recognition system using 

only two proximity sensors. 2) The proposed infrared (IR) 

feature set and classifier for real-time gesture classification. 

3) Reducing the power consumption of gesture recognition. 

RELATED WORK 

There has been extensive research on vision-based gesture 

recognition [8, 14], mostly focusing on the detection of hand 

trajectory. Although they can recognize complex gestures, 

they can be sensitive to background objects, color, and lighting. 

Robustness can be improved by adding color markers on 

the user’s hand [5], with a tradeoff of the inconvenience to 

wear additional gears. Moreover, continuous video recording 

of a user can make one feel like under surveillance and 

pose a threat on user privacy. 

Recently, SideSight [1] proposed an around-device multitouch 

interface by placing ten IR sensors on the long edges 

of a small mobile device. Another related work, HoverFlow 

[3], used six IR sensors facing the user to capture IR image 

maps, and then classify gestures using dynamic time warping 

(DTW). In this work, we reduce the number of the required 

IR sensors to two and thus reduce the power consumption, 

which is mentioned as a critical issue in [1]. Even 

using the limited information from only two IR sensors, our 

system can achieve accurate gesture recognition using the 

proposed IR feature set and the classifier. 

For motion-based system, one of the recent work uWave

[4] match accelerometer data with gesture templates using 

DTW. 98.6% and 93.5% accuracy was achieved with and 

without template adaptation, respectively, for user-dependent 

gesture recognition. However, a user need to hold a device 

with accelerometer, and press a button to indicate start and 

end of a gesture. In this work, we eliminate these limitations 

with contactless gesture recognition. 

Electromyogram-based (EMG-based) system [2, 13] is another 

novel way to recognize gesture patterns using electrical 

activity produced by skeletal muscles. However, a user 

must wear EMG sensors on the wrist at all times to perform 

gestures, which can be inconvenient and not suitable for mobile 

device interfaces. 

SYSTEM DESIGN AND METHODS 

Design Considerations 

Our system is designed based on four design considerations: 

1) Automatically detect gesture boundaries: A common challenge 

of gesture recognition is the uncertainty of when does 

a gesture begins or ends. We do not require a user to press a 

key to indicate the presence of a gesture since it would be inconvenient 

to do so. 2) Recognition must be real-time: Gesture 

interface must be very responsive, so no time-consuming 

postprocessing is allowed. 3) False alarm needs to be minimized: 

Executing a wrong command is generally worse than 

missing a command. 4) No user-dependent model training 

process for new users: Although supervised learning can optimize 

the performance for a specific user, collecting training 

data can be time consuming and not desirable for users. 

Proximity Sensor Data Acquisition 

We now describe each system component shown in Fig. 1. A 

proximity sensor consists of two IR LEDs and a IR receiver, 

which are placed underneath a plastic/glass screen surface, 

surrounded by optical barriers. The LEDs emit IR strobes 

in turns as two separate channels using time-division multiplexing. 

When a hand or any object is near, the receiver detects 

the reflection of the IR light, whose intensity increases 

as the object distance decreases. The light intensities of the 

two IR channels are sampled by the firmware at 100Hz. 

Framing 

Since the start and end of a gesture is not specified by the 

user, our program uses a moving window to scan the input IR 

intensity data and decide if any gesture signature is observed. 

The data is divided into 50% overlapping frames, each of 

which is 140 ms. After framing, three types of feature are 

extracted from each frame. 

Infrared Feature Extraction 

Inter-channel Time Delay 

The feature measures the pair-wise time delay between the 

sensor data of two channels, which shows how a hand approaches 

the IR LEDs at different instants. This corresponds 

to different moving directions of hands (see Fig. 2 for example). 

The time delay tD is calculated by finding the time 

shift n that yields maximum cross correlation value of two 

6 

Cross Correlation 

Module 

Gesture Model 

Proximity Sensor Data 

Framing 

Linear Regression 

Module 

Gesture Classifier 

Gesture History 

Database 

Screen 

Mobile 

Device 

Infrared LED 

Proximity Sensor 

(Infrared Receiver) 

Signal Statistics 

Module 

Temporal Dependency 

Computation 

Figure 1: The architecture of the gesture recognition system. 

IR Intensity (lux) 

Slope 

15000 

10000 

5000 

Time Delay (ms) 

Channel L 

Channel R 

Raw Sensor Data 

0 

0 2 4 6 8 10 12 

Push Pull 

Time (s) 

Time Delay Measured by Cross−Correlation 

50 

0 

−50 

0 2 4 6 

Time (s) 

8 10 12 

Slope Measured by Linear Regression 

1000 

0 

3 Left Swipes 3 Right Swipes 

Push Pull 

Push Pull 

−1000 

0 2 4 6 

Time (s) 

8 10 12 

Figure 2: An example of proximity sensor data and the features. 

discrete signal sequences f and g: 

tD = arg max 

n 

∞� 

f ∗ (m)g(m + n) (1) 

m=−∞ 

Local Sum of Slopes 

This feature estimates the local slope of the signal segment 

within a frame, which shows how fast the user’s hand is moving 

toward or away from the proximity sensors. The slope is 

calculated by first-order linear regression, and then summed 

up with the slopes of the 6 previous frames. The local sum 

better capture the continuous trend of slopes rather than sudden 

changes. 

Signal Statistics 

The mean and variance of the raw sensor data. A high variance 

can be observed when a gesture is present; on the contrary, 

when there is no hand present or a hand hovering above, 

a low variance is observed. 

Gesture Recognition Algorithm 

After feature extraction, a decision-tree classifier shown in 

Fig. 3 is adopted to classify the frame as one of the gesture 

in the predefined gesture model, or report that no gesture is 

detected. We also keep a history of 7 frames to take temporal

Precision (%) 

100 

80 

60 

40 

20 

0 

User1 

User2 

User3 

User4 

User5 

Average 

Left Swipe Right Swipe 

(a) Precision of left/right swipe 

Yes 

No Gesture 

Ch L lags 

Variance < Threshold? 

Recall (%) 

100 

80 

60 

40 

20 

Time Delay > Threshold? 

Yes No 

0 

Left Swipe Right Swipe 

User1 

User2 

User3 

User4 

User5 

Average 

(b) Recall of left/right swipe 

Inter-Channel Delay Local Sum of Slopes 

Ch R lags 

Right Swipe Left Swipe 

No 

> Threshold 

Precision (%) 

100 

80 

60 

40 

20 

0 

Push Pull 

User1 

User2 

User3 

User4 

User5 

Average 

(c) Precision of push/pull 

Figure 5: Precision and recall rate of gesture recogntion. 

Otherwise 

< −Threshold 

Push No Gesture Pull 

Figure 3: Illustration of the decision-tree-based gesture classifier. 

USB 

Port 

Sensor 

Board 

IR LED 

Channel L 

IR LED 

Channel R 

IR 

Receiver 

Start of Gesture (Left Swipe) End of Gesture (Left Swipe) 

Figure 4: A subject performed a left-swipe gesture using the 

prototype sensor board. 

dependency between consecutive frames into consideration. 

For example, when a gesture is detected, the system suppress 

the output of the same gesture for 6 frames because it is hard 

for a user to make the same gesture again very quickly. Once 

the gesture sequence history of a user is obtained, the transition 

probability between gestures can also be incorporated 

to improve the recognition accuracy. 

IMPLEMENTATION 

We implemented the prototype system using Silicon Labs 

Si1120 infrared proximity sensor [9]. The sensor data were 

transmitted to a laptop through a USB serial port. The feature 

extraction and gesture recognition algorithm was implemented 

in C++. The window sizes and thresholds are empirically 

set through experiments to minimize the false alarm 

rate of the system. A picture of the prototype system and a 

subject performing a gesture is shown in Fig. 4. 

EVALUATION 

7 

Recall (%) 

100 

80 

60 

40 

20 

0 

Push Pull 

(d) Recall of push/pull 

User1 

User2 

User3 

User4 

User5 

Average 

We define four essential gestures for evaluation: left swipe, 

right swipe, push (hand vertically moving vertically down 

toward the device), and pull (hand moving vertically up away 

from the device). The system is evaluated on a gesture dataset 

collected from five subjects, including four right-handed and 

one left-handed user. Their ages span from 20s to 40s, and 

one of them is female. The dataset consists of 2,000 gesture 

samples in total, with each user performing each of the four 

gesture 100 times. 

Recognition Performance 

We use the widely used precision/recall metric to evaluate 

the recognition performance: 

precision = 

T P 

T P + F P 

T P 

recall = 

(3) 

T P + F N 

where TP, FP, FN refer to true positive, false positive, and 

false negative. As shown in Fig. 5, the system achieved 98% 

precision in average, and is robust from user to user. The 

high precision implies low false alarm rate, which is ideal 

for gesture recognition because executing a wrong command 

is usually worse than missing a command. The recall rate is 

lower than precision because the system can miss gestures 

when the hand is too far from the sensor, or when a gesture 

is performed much slower than usual. 

User and System Factors 

We further design two experiments on user and system factors 

to evaluate the robustness and limitation of the system. 

User-to-Device Distance 

First, we evaluate the influence of user-to-device distance on 

the system performance. The distance is measured from the 

user’s hand to the proximity sensors. As shown in Fig. 6, the 

system can achieve over 80% accuracy when the user’s hand 

is within 3 inches. The effective range can be increased by 

increasing the power of IR LEDs, with a tradeoff of a higher 

power consumption. One can balance the tradeoff according 

to the system needs on user experience and battery life. 

Speed of Gesture 

Next, we evaluate the system performance when user perform 

gestures at different speeds. In this experiment, the user 

listens to a specific tempo given by an electronic metronome; 

the first beat “tic” indicates the start of a gesture, and the second 

beat “toc” indicates the end of a gesture. According to 

(2)

Accuracy (%) 

100 

80 

60 

40 

20 

0 

1 1.5 2 2.5 3 3.5 4 

Hand−to−Sensor Distance (inch) 

Figure 6: Recognition accuracy vs. hand-to-sensor distance. 

Accuracy (%) 

100 

80 

60 

40 

20 

0 

1 2 3 4 5 

Speed of Gesture (gestures per second) 

Figure 7: Recognition accuracy vs. speed of gesture. 

our observation, most users naturally make gestures at the 

speed of 2 to 4 gestures per second. In other words, it usually 

take 0.5 to 0.25 seconds for general users to complete a 

gesture. As shown in Fig. 7, the system achieves over 90% 

accuracy at general gesture speeds, and also maintains a robust 

performance of over 80% at very slow (1 gesture per 

second) or very fast (5 gestures per second) gesture speeds. 

Power Consumption 

The system power is dominated by the power consumed by 

IR LED (PLED) and the control chip (Pchip): 

PLED + Pchip = fconv · Tprx · (ILED + Ichip) · VLED (4) 

which is only 0.3 mW (idle) to 20 mW (active, when object 

is in proximity) [9], much lower than the 200-mW power 

budget for typical user interface of mobile device as reported 

in [7]. V , I, fconv, and Tprx denotes voltage, current, conversion 

frequency, and pulse width, respectively. 

CONCLUSION AND FUTURE WORK 

We have presented a contactless gesture recognition system 

that allows users to make gesture inputs without touching, 

holding, or wearing any device. Using the proposed IR feature 

set and classifier, the system can recognize gestures with 

98% precision and 88% recall rate. The low power consumption 

and high accuracy make the system particularly 

8 

desirable for deployment on resource-limited mobile consumer 

devices. 

Our future work is to extend the configuration to multiple 

sensor arrays to get more information from sensor data. Using 

the basic gesture set as building blocks, we can further 

recognize more compound 3D gestures as permutations of 

the simple ones. Hidden Markov model can also be incorporated 

to learn the gesture sequences performed by users. 

REFERENCES 

1. A. Butler, S. Izadi, and S. Hodges. Sidesight: 

multi-”touch” interaction around small devices. In 

Proc. UIST, pages 201–204, 2008. 

2. J. Kim, S. Mastnik, and E. André. EMG-based hand 

gesture recognition for realtime biosignal interfacing. 

In Proc. IUI, pages 30–39, 2008. 

3. S. Kratz and M. Rohs. Hoverflow: exploring 

around-device interaction with ir distance sensors. In 

Proc. MobileHCI, pages 42:1–42:4, 2009. 

4. J. Liu, L. Zhong, J. Wickramasuriya, and V. Vasudevan. 

uWave: Accelerometer-based personalized gesture 

recognition and its applications. Pervasive Mob. 

Comput., 5(6):657–675, 2009. 

5. P. Mistry, P. Maes, and L. Chang. WUW - wear ur 

world: a wearable gestural interface. In Proc. CHI ’09, 

pages 4111–4116, 2009. 

6. S. Mitra and T. Acharya. Gesture recognition: A 

survey. IEEE Trans. Syst., Man and Cybern., 

37(3):311–324, 2007. 

7. Y. Neuvo. Cellular phones as embedded systems. In 

IEEE ISSCC, 2004. 

8. V. I. Pavlovic, R. Sharma, and T. S. Huang. Visual 

interpretation of hand gestures for human-computer 

interaction: A review. PAMI, 19(7):677–695, 1997. 

9. Silicon Labs. Proximity/ambient light sensor with 

PWM output, 2009. 

10. W. C. Westerman and J. G. Elias. System and method 

for packing multi-touch gestures onto a hand, April 

2006. 

11. A. Wilson and S. Shafer. XWand: UI for intelligent 

spaces. In Proc. SIGCHI conf. Human factors in 

comput. syst., pages 545–552, 2003. 

12. J. O. Wobbrock, A. D. Wilson, and Y. Li. Gestures 

without libraries, toolkits or training: a $1 recognizer 

for user interface prototypes. In Proc. ACM UIST, 

pages 159–168, 2007. 

13. X. Zhang et al. Hand gesture recognition and virtual 

game control based on 3D accelerometer and EMG 

sensors. In Proc. IUI, pages 401–406, 2009. 

14. M. H. Yang, N. Ahuja, and M. Tabb. Extraction of 2D 

motion trajectories and its application to hand gesture 

recognition. IEEE Trans. Pattern Anal. Mach. Intell., 

24(8):1061–1074, 2002.

One Application, One User Interface Model, Many Cars: 

Abstract Interaction Modeling in the Automotive Domain 

Mark Poguntke 

Daimler AG 

Wilhelm-Runge-Straße 11, 89081 Ulm 

mark.poguntke@daimler.com 

ABSTRACT 

We present an approach for user interface generation based 

on abstract interaction modeling using UML class and state 

diagrams. By this, we enable the flexible enhancement of 

an automotive infotainment system with new external 

applications. A main objective is to do this without 

breaching the requirements resulting from the automotive 

context, e.g. minimized driver distraction. We achieve 

consistency with the automotive interaction and design 

concept through transforming the abstract model to the 

respective user interface concept and illustrate this with two 

automotive HMI concepts. 


HCI, Interaction Modeling, Abstract Interaction Model, 

Model-driven User Interface Development. 

INTRODUCTION 

A typical automotive infotainment system includes 

navigation, audio and video player as well as a phone 

application. Often, the only external applications integrated 

to the system are Bluetooth telephony, external music 

players and pre-defined internet services for weather 

forecasts or points of interests as examples. Using current 

technology, the features provided initially also do not 

change during the lifetime of a car. Imagine buying a 

desktop computer and having to use it for ten years without 

the possibility to install new applications – this is not 

satisfactory. It is our goal to make automotive systems more 

flexible and allow for the integration of new applications at 

later stages. However, the primary purpose of a car is still 

to provide a safe means of transportation. This implies a set 

of specific and very restrictive requirements for the design 

of the Human-Machine Interface (HMI), especially 

concerning the use of infotainment applications while 

driving. Minimizing driver distraction is an important 



not made or distributed for profit or commercial advantage and that copies 

bear this notice and the full citation on the first page. To copy otherwise, 

or republish, to post on servers or to redistribute to lists, requires prior 

specific permission and/or a fee. 

3 rd International Workshop on Multimodal Interfaces for Automotive 

Applications (MIAA) in conjunction with IUI 2011, Palo Alto, CA, USA. 

9 

André Berton 

Daimler AG 

Wilhelm-Runge-Straße 11, 89081 Ulm 

andre.berton@daimler.com 

requirement and external applications can only be 

integrated at a lager stage under the provision that this 

requirement is maintained. To ensure this, the control over 

the interaction and design concept of external applications 

has to be on side of the in-car software. 

Different cars and model lines often have different HMI 

devices and the respective concepts in them, e.g. a touch 

screen based HMI or an HMI based on the operation with a 

central control element (CCE), typically used in premium 

segment cars. Also, design concepts differ in screen 

resolution, screen layout, colors and styles. In order to 

seamlessly integrate external applications, we also have to 

provide solutions for multiple modalities. 

We approach the issues of keeping the control over the 

HMI integration for external applications on the one hand 

and aiming at flexibility and adaptability to serve different 

HMI concepts on the other hand. Our approach is based on 

abstract interaction modeling using the Unified Modeling 

Language (UML) [4] and XML-based user interface 

descriptions. 

Automotive Requirements 

The above mentioned conditions imply the need for much 

care to optimally integrate automotive user interfaces in the 

car. Many countries also restrict the design and use of 

automotive infotainment applications by certain regulations, 

e.g. the European Statement of Principles on HMI for invehicle 

information and communication systems (ESoP) or 

the guidelines of the Alliance of Automotive Manufacturers 

(AAM). Compliance with ergonomic standards, e.g. the 

ISO 9241-110, is a particularly desirable goal. 

Updating an automotive infotainment system with new 

applications which include new user interfaces is a critical 

modification. Unfamiliar user interfaces or inconsistencies 

with the in-car interaction concept may lead to driver 

distraction, limitations in interaction and frustrated drivers 

who hold the automotive manufacturer responsible for the 

whole infotainment system. This emphasizes the need of a 

carefully developed approach for integrating new 

applications and their user interfaces into the car. An 

automated user interface generation process has to conform 

to restrictive and well-defined rules.

Illustrating Example 

Throughout the following, we use an example scenario to 

illustrate the approach. An external application to-do list 

comprises the following functionalities: Present a list, add a 

new entry, select an entry and delete a selected entry. 

The to-do list will be integrated into two automotive user 

interface concepts. These are a touch screen based HMI and 

a CCE-based HMI illustrated later in more detail. 

RELATED WORK 

Several approaches exist that derive different user 

interfaces from abstract interaction representations. Concur 

Task Trees (CTT) [6] provides a notation to describe user 

interfaces on the level of task models. The User Interface 

eXtensible Markup Language (UsiXML) [9] describes a 

comprehensive modeling approach including 

transformations from abstract to concrete user interfaces 

based on the CAMELEON reference framework [1]. The 

Dialog and Interface Specification Language (DISL) is a 

user interface description language based on dialogue 

models and modality-independent presentation models [8]. 

In recent years attention has also been paid to the Unified 

Modeling Language (UML), which is a widespread industry 

standard for modeling software systems. Several 

approaches motivate the use of UML for user interface 

modeling [2,3,5,7]. De Melo provides a detailed analysis of 

UML as a basis for model-based user interface development 

and emphasizes advantages concerning comprehensibility, 

universality and tool support amongst others [3]. We 

consider UML as an appropriate basis, which can be 

adapted and extended for our approach. The availability of 

established tools is particularly important for the use in 

industry. We focus this paper on demonstrating abstract 

interaction modeling techniques with UML and implementing 

automatic transformations from an abstract model 

to specific automotive user interface concepts. 

ABSTRACT INTERACTION MODELING APPROACH 

The general approach is illustrated in Figure 1. We use the 

roles of an application developer and an interaction 

designer. An application is developed by an application 

developer including a functional application interface 

consisting of a class diagram with attributes and operations. 

An interaction designer uses this interface to create an 

abstract interaction model using UML state charts to 

describe user actions and corresponding system reactions. A 

transformation program uses the model and generates a user 

interface compliant to the respective automotive HMI 

concept. For the transformation process rules have to be 

implemented mapping the abstract model elements to user 

interface elements for a specific concept. 

10 

Figure 1. General approach: (1) The application developer 

provides the application interface, (2) the interaction designer 

creates the abstract interaction model that is used for user 

interface generation. 

The overall process is described and demonstrated for the 

to-do list example in the remainder of this paper. The 

definition of abstract data types and interaction elements is 

described in the following section. 

Abstract Data Types and Interaction Elements 

The application developer uses a defined set of abstract 

data types for the attributes to be provided. In Table 1 an 

extract of these data types is described. 

Type Description 

Boolean Logical value true or false 

String Sequence of symbols from the 

underlying set or alphabet 

Properties: 

Empty 

Boolean value whether the string is 

empty 

Collection A collection of elements with type 

 

… 

Properties: 

Empty 

Boolean value whether the collection is 

empty 

Subselection A collection of selected elements from 

the entire collection 

Table 1. Extract of abstract data types to be used 

by the application developer for the application interface. 

The interaction designer uses the provided attributes and a 

defined set of modeling elements and guidelines to create a 

UML state diagram. Table 2 provides an extract of elements 

that can be used by the interaction designer. 

The abstract data types and modeling elements are 

illustrated with the example application to-do list in the 

following section.

Element Meaning 

State Defined interaction state with a set of 

possible interactions 

do-activity within State 

PRESENT Presentation of 

to the user 

PROVIDE Possibility for the user to provide a value 

for 

PROVIDE() 

 

Transition with keyword ACT 

Possibility to provide 

elements for 

ACT The action that can be 

initiated by the user 

ACT 

[] 

ACT 

[not ] 

Transition with keyword SELECT 

SELECT() 

 

… 

The action that can be 

initiated by the user if is true. 

The action that can be 

initiated by the user if is false. 

Selection of elements from 

the collection . 

Table 2. Extract of defined elements to be used 

by the interaction designer in the UML state chart. 

Example: To-do List 

The application developer provides all attributes that can be 

used for interaction modeling as a UML class diagram, see 

Figure 2. For the to-do list these are addLabel and 

confirmLabel, which contain texts to be presented to the 

user during the respective interaction steps, and a collection 

named entryList containing elements of the custom type 

Entry. The developer also provides the information that an 

Entry consists of one string named description. Furthermore 

the operations saveEntry(Entry) and deleteEntry(Entry) are 

provided. 

Figure 2. UML class diagram for the to-do list provided by the 

application developer as functional application interface. 

The application developer furthermore provides textual 

descriptions of the attributes and operations. These support 

the interaction designer to understand the semantics in 

order to achieve correct mappings to the interaction model. 

The interaction designer uses the attributes when creating 

the abstract interaction model. Using UML the designer 

would have the possibility to include operations from the 

class diagram directly in the state chart. However, we 

decided to define the relations between interactions and 

operations outside of the state chart in a mapping table. 

11 

This allows the interaction designer to create the interaction 

model independent from this mapping. Figure 3 illustrates a 

possible interaction model for the to-do list application. 

Figure 3. Abstract interaction model for the to-do list using 

UML state charts with the defined set of model elements. 

The interaction designer uses then the provided operations 

and defines the relations to the interaction model. This is 

exemplified in Table 3. The saveEntry function is 

connected to the ACTSave transition with the entry 

provided by the user in the state Add entry. The 

deleteEntry function is connected to the ACTYes 

transition and deletes the subselection of entryList which is 

in this case exactly one entry selected by the user. 

Application function Relation to interaction model 

saveEntry(Entry) 

ACTSave 

Entry: PROVIDE(1) entryList 

deleteEntry(Entry) ACTYes 

Entry: entryList.Subselection 

Table 3. Mapping of interactions to application logic 

provided by the interaction designer based on the 

operations provided by the application developer. 

The next process step is to transform the abstract interaction 

model including the abstract data types and operations to 

different HMI concepts. For this example, we demonstrate 

two different automotive HMI concepts that are described 

in the following section. 

Example: Two Automotive HMI Concepts 

We illustrate the to-do list application with two different 

HMI concepts which can be summarized as follows: 

Touch screen based HMI: The first concept is based on 

operation with direct input via a touch screen. Touchable 

buttons are used to directly interact with the system. Lists 

are provided and can be operated (e.g. scrolling) via touch 

gestures. The system provides a software keyboard 

appearing when text or numbers are to be entered. 

CCE-based HMI: The second concept is based on indirect 

input via a CCE that can be pushed in eight directions, 

turned and pressed. Selectable menu entries are used to 

interact with the system. These are realized as menu

containers and are arranged in a certain hierarchy. The 

system provides specific complex speller widgets to enable 

the user to enter text or numbers. 

In order to map the abstract model to different HMI 

concepts, different rule sets have to be defined. Table 4 

illustrates general examples of required mappings. 

Abstract element Touch concept CCE concept 

PROVIDE Text field widget 

and software 

keyboard 

Edit speller widget 

ACT Touch button Menu entry in a 

menu container. 

SELECT(1) List box with the 

possibility to 

directly select one 

entry 

Menu container with 

the possibility to 

navigate through the 

entries and select/ 

highlight one entry 

Table 4. Example mappings from abstract to specific concepts. 

The requirements for arranging the different elements 

depending on some properties (e.g. list sizes, menu 

hierarchy, etc.) are provided by the HMI concept. These 

influence the transformation mechanism for each concept. 

We defined the specific transformation mechanisms 

including these requirements and exemplified the process 

with the to-do list example. The proof of concept is 

described in the following section. 

PROOF OF CONCEPT 

The two different HMI concepts were implemented with the 

respective widget and layout specifications based on XMLdescriptions 

in a pre-defined format. These specifications 

were used to create rules for the transformation from the 

abstract model elements to the respective specific HMI 

layout and interaction elements. This was implemented 

using eXtensible Stylesheet Language Transformation 

(XSLT). 

Based on the abstract model elements for the to-do list, 

example transformations were implemented for the two 

automotive HMI concepts described above. These 

transformations include enabled and disabled user actions, 

representations of collection variables (e.g. lists) with the 

selection of individual collection elements, and representations 

for presenting and providing basic data types like 

text strings. Example screenshots of the resulting generated 

HMIs are illustrated in figure 4. 

Figure 4. Screenshots for the to-do list from the demonstrator: 

left: touch screen based HMI, right: CCE based HMI. 

12 


We presented an abstract interaction modeling concept 

based on UML class diagrams and state charts. An example 

application was modeled and the transformation process 

was successfully implemented for two different automotive 

HMI concepts. The developed concept includes the 

abstraction of basic interaction possibilities and a first set of 

transformations for a controlled HMI generation. The 

demonstrated concept pushes further research and 

development to achieve more flexible and adaptive automotive 

infotainment systems allowing the integration of 

external applications after deployment of the car software. 

Covering a complete HMI concept specification including 

the respective transformation rule set may result in large 

implementations. Thus, one important issue for the future is 

to further improve the HMI specification process in order to 

minimize the effort of obtaining transformation rules. These 

activities will also support the definition of overall 

automotive industry solutions for HMI development 

processes, especially concerning modeling languages and 

definitions of interfaces between applications and the HMI. 

Detailed evaluations, the elaboration of further complex 

examples, and stepwise improvements and expansion of the 

rule sets are part of ongoing and future activities. The 

implementation of a client-server architecture is envisioned 

to allow a client HMI system to communicate with remote 

applications and other input and output devices via defined 

messages. This will also enable the flexible addition of 

interaction devices and modalities for external applications. 

REFERENCES 

1. CAMELEON Project. http://giove.cnuce.cnr.it/ 

projects/cameleon.html (11 Nov 2010). 

2. Dausend, M. & Poguntke, M.: Spezifikation 

multimodaler Interaktionsanwendungen mit UML. In 

Mensch & Computer (2010), 215-224. 

3. De Melo, G. Modellbasierte Entwicklung von Interaktionsanwendungen, 

München, Germany, 2010. 

4. O.M.G.: UML 2.2 Superstructure Specification (2009). 

5. Nobrega, L., Nunes, N. J., & Coelho, H.: Mapping 

ConcurTaskTrees into UML 2.0. LNCS 3941 (2006). 

6. Paternò, F., Mancini, C. Meniconi, S.: Concur- 

TaskTrees: A Diagrammatic Notation for Specifying 

Task Models. In Proceedings of the IFIP TC13 

International Conference on HCI (1997). 

7. Paternò, F.: Towards a UML for interactive systems. 

LNCS 2254 (2001), 7-18. 

8. Schäfer, R.: Model-Based Development of Multimodal 

and Multi-Device User Interfaces in Context-Aware 

Environments, Aachen, Germany, 2007. 

9. Vanderdonckt, J., Limbourg, et al.: UsiXML: A User 

Interface Description Language for multimodal User 

Interfaces. In Proc. Workshop on Multimodal 

Interaction WMI (2004), 1-7.

A Novel Multimedia Session Management Approach 

for In-Vehicle Middleware based on DPWS 

Michael Eichhorn*, Martin Pfannenstein*, Rainer Bodendorfer**, Eckehard Steinbach* 

Institute for Media Technology 

Technische Universität München 

*{firstname.lastname}@tum.de, **bodendorfer@gmx.de 

ABSTRACT 

In this paper, we present a novel multimedia session management 

approach for a future Ethernet/IP-based in-vehicle 

communication network. All network devices are available 

as services in a service-oriented architecture (SOA) that is 

established on top of the in-vehicle network. We use the Device 

Profile for Web Services (DPWS) as a middleware as it 

is designed to support resource-restrained embedded devices 

as they are typical for an in-vehicle scenario. The session 

management has been designed to support any type of data 

to be exchanged between the services. In this study, we put a 

particular focus on in-car video streaming and demonstrate 

that the proposed approach successfully supports a variety 

of video streaming scenarios. 


service-oriented architecture, human machine interface, session 

management, in-car infotainment, device integration 


H.5.2 Information Interfaces and Presentation: Miscellaneous 

INTRODUCTION 

The IT infrastructure of modern cars features a variety of 

electronic control units (ECU) to execute automation and 

control tasks which ensure the vehicle’s operation on the 

road. Additionally, more and more comfort and entertainment 

functionalities are shipped with modern vehicles, particular 

in the premium segment. The challenge that car manufacturers 

face today is to adapt the in-vehicle network to the 

increasing number of ECUs as well as their corresponding 

traffic, in particular, novel applications transmitting audioand 

video data. Therefore, car manufacturers target for a homogenized 

in-vehicle network rather than having installed 

multiple fieldbus systems like CAN, LIN, MOST, FlexRay 

etc., as in today’s cars. This then also fosters new services 

and applications due to the ubiquitous availability of data 

compared to a separation of sensors and actuators across the 

fieldbus systems mentioned above. One promising candidate 



13 

for such a homogeneous in-car infrastructure is Ethernet/IP 

as it comes with a well-proven and established set of interfaces 

and protocols. Additionally, not only the number of 

the vehicle’s internal ECUs increases, but also the number 

of externally connected devices. For instance, a driver as 

well as the passengers want to interact with the car via their 

personal devices, e.g., laptops, smartphones, PDAs and so 

on. Therefore, the requirements for a well organized and 

flexible human machine interface (HMI) emerge. This then 

should be as universally applicable as possible as due to the 

lifecycle of a car compared to those of consumer electronic 

(CE) devices, it is not foreseeable which personal devices 

will be brought into the car in the future. For IP-based IT 

infrastructures like for example in business organizations as 

well as on the Internet itself, where many different services 

are available, there is an emerging need for an arrangement 

of these services. This can be achieved by a service-oriented 

architecture (SOA) which depicts a middleware with standardized 

interfaces. We use such an architecture to connect 

ECUs and CE devices seamlessly as well as generate 

an HMI that can be distributed and composed by the devices 

connected as described in [3] and [4]. A user can therefore 

introduce personal devices and interact with them via the 

in-vehicle HMI. This approach also enables novel CE devices 

like for example future entertainment systems or even 

vehicle-relevant features like a more precise GPS receiver to 

be integrated into the IT infrastructure of a car after it has 

been shipped. An increasing interaction of the driver and 

passengers with the car also leads to a demand for a cooperative 

usage especially but not limited to infotainment systems 

like video and audio streams. For example, a driver wants 

to see the video of the vehicle’s rear view camera while two 

passengers in the back are watching a movie on two screens. 

As soon as the driver finished checking the rear view camera 

image, the passenger in the front also wants to watch the 

movie from the current position on or start over. This paper 

therefore presents a session management approach for a 

SOA-based in-vehicle network and is structured as follows: 

First, an overview of related work in this area is given. Afterwards, 

our system is presented and the proposed session 

management scheme is detailed. At the end, a summary and 

outlook is given. 

RELATED WORK 

There exist some approaches towards a more flexible HMI 

architecture as for example Continental’s Android-based AutoLinQ 

platform [1], the Neutrino RTOS by QNX [7] or 

Meego [9]. These platforms are supposed to act as an univer-

sal architecture compared to the car manufacturer’s specific 

approaches. The Extensible Messaging and Presence Protocol 

(XMPP) [8] is a widely used standard for text-based 

chatting. On top of that, the Jingle extension [5] is used to 

establish sessions for audio and video calls, mainly in peerto-peer 

networks. 

SYSTEM DESCRIPTION 

We consider an all-IP in-vehicle network with a SOA infrastructure 

on top. In order to also support embedded devices 

which do not feature rich processing resources, we use the 

Device Profile for Web Services (DPWS) [6], which is a Web 

Service based middleware. It is designed to operate also on 

resource-limited ECUs as installed in vehicles. Several services 

have been designed which cover automotive-specific 

use-cases. Further on, we regard a video streaming service 

which can be invoked by multiple clients. The scenario is depicted 

in Figure 1. The first box (simple) shows the most elementary 

scenario where one client requests a video stream 

from one service provider, i.e., video streaming service. A 

multi party interaction takes place at the ”separated” scenario, 

where two clients invoke the streaming service independently 

of each other. Both clients can also receive the 

same video content with the same playout time, i.e., they 

participate at a common session (shared). The last two scenarios 

can also be combined to a mixed scenario where two 

clients watch the same video content and a third one requests 

a separate video stream or the same one with a shifted playout. 

Figure 1. Overview of the considered media streaming scenarios. 

SESSION MANAGEMENT 

When using a SOA as an organizational instance for multimedia 

systems, the software development process is eased in 

many ways. Nevertheless, some points have to be taken care 

of in order to provide an intuitive experience to the user. 

In a common unconstrained SOA scenario, with many service 

providers and consumers, the service consumer chooses 

a provider considering often only technical or measurable aspects, 

i.e., hardware resources, latency, and so on. However, 

a human, as a service consumer, wishes to select one specific 

function of one specific service provider, neglecting technical 

aspects. Therefore, a management has to be established 

to cover for example: 

14 

• Overview and selection of compatible, available services. 

• Independent and un-interruptible use of a service. 

• Possibility to share the current service with others. 

• A clear distinction of users, their devices and the way they 

are using them. 

In order to enable these features in a multimedia scenario 

based on DPWS, a session management, realized as a dedicated 

service, has been developed. With the introduction of 

a session, users can be grouped and served independently, 

hence supporting their desired way of use. 

Establishing a new session 

Figure 2. Etablishment of a new session. 

The establishment of a new session is fundamental in order 

to operate independently of others, but, on the other hand, be 

also able to share a session. The message exchange pattern 

of a session establishment is depicted in Figure 2. Here, a 

user invokes a video streaming service by telling the client 

application to start a session and assigning a session name 

(step 1 and 2). The name of the session (Session-ID), which 

can be selected freely, is used to distinguish various running 

sessions. The Session-ID is then send to the session service 

(step 3), i.e., the video streaming device, to actually trigger 

the request. In order to know who is requesting a new session, 

this message also contains additional information like 

IP-address and Port. This is essential to distinguish participants 

and handle further service calls properly. When this 

message is received by the service provider, it checks if the 

desired Session-ID is available (step 4). With this verification, 

a unique assignment of Session-IDs is ensured. 

Furthermore, a User-ID is generated which is matched to the 

IP address and Port of the requesting client (step 5). Both 

IDs, the User-ID as well as the Session-ID are then stored 

at the service provider side. In fact, the User-ID is also assigned 

to the Session-ID to have a connection between sessions 

and users. The result is a list containing all running 

sessions and their participants. With the generated User-ID 

it is possible to retrieve information about a certain user. In 

future service calls of a known user, only the User-ID has to 

be included in order to identify a user and serve the appropriate 

session. 

The session client itself also needs to know the User-ID that 

he has been assigned to. For this reason, a message is sent 

(step 7) containing the User-ID and an error code. The error 

code contains the result of the verification process of the

Session-ID. The client knows about all possible results and is 

able to decide if the establishment of a session was successful. 

Finally, the session client saves his User- and Session-ID 

(step 8). With this message exchange, a session has been established. 

Joining an existing session 

Figure 3. Requesting a session list. 

Another mandatory feature regarding sessions is the participation 

in an existing session. All information about sessions 

are stored on the service side. A common user on the client 

side however has no knowledge about currently running sessions 

and assigned Session-IDs. In order to get an overview 

of all available sessions, a feature that handles this must be 

provided. 

Initially, the user enters a command to request a list of all 

ongoing sessions (Figure 3, step 1). The session client then 

sends a message to the session service (step 2). The session 

service queries an overview of all existing sessions (step 3) 

from its local database, and sends it back to the requesting 

client (step 4). Finally the client is able to display all currently 

running sessions to the user (step 5). 

This listing feature is not only implemented to join a session 

in the next step, it can be used to get a common overview of 

all running sessions. When a list of all available sessions is 

shown to the user, he can then choose one out of it in order 

to participate. 

First, he selects the appropriate command (Figure 4, step 

1) and enters the Session-ID of the desired session (step 2). 

With this given information a message is sent from the client 

to the service (step 3). When the message is received, the included 

Session-ID is checked by the service provider and the 

existence of the desired session is verified (step 4). The result 

of this verification may lead to the following situations. 

• Unknown Session-ID: 

The Session-ID cannot not be found among the currently 

running sessions (step 4). Thus, the desired session the 

user wants to join does not exist and henceforth, he cannot 

participate. This will be signaled to the user with a message 

(step 4.1). An error code is included and can be interpreted 

and displayed at the session client (step 4.2). Now, 

the user could restart the process with another Session-ID. 

• Known Session-ID, no streams present: 

If the Session-ID is known, the appropriate session exists 

and the user is able to join. At this point, we assume that 

no video streaming is running in the desired session (step 

15 

Figure 4. Participate in a session. 

5). Further, a User-ID is created (step 5.1) and added to 

the session (step 5.2). From this point on the user participates 

in the session. An error code, sent by a dedicated 

message (step 5.3), indicates the successful participation 

to the session client. The included Session-ID will be 

extracted and saved together with the Session-ID by the 

client (step 5.4). From now on, the user can trigger the 

streaming within the session. 

• Known Session-ID, streaming running: 

Of course, it is possible that a video streaming session is 

already running, initiated by another user. Hence, the new 

client has to be notified about the ongoing video stream. 

Therefore, metadata about the stream is gathered (step 6), 

a User-ID is generated (step 6.1) and added to the session 

(step 6.2). Now, the streaming service, which transmits 

the stream to all participating members of the session, is 

informed and updated (step 6.3) about the new member. 

The new client receives a message (step 6.4) with an error 

code, which signals a running streaming, his User-ID and 

metadata. The User- and Session-ID are then saved by the 

client (step 6.5). Next, the streaming client, which takes 

care of receiving and displaying the video, is started. The 

metadata of the received message act as a description for 

the expected stream. 

Leaving a session and handover 

The last essential functionality is leaving a session. This can 

be necessary if a user wants to stop the use of a device or 

he wants to join another session. Figure 5 shows the message 

flow after a successful initialization (see Figure 2). Afterwards, 

the participating clients subscribe to a notification

Figure 5. Leaving and handover of a session. 

channel (step 2) with a message (step 3) which is processed 

by the service (step 4). 

From now on, all required information is sent via the notification 

channel to all participating clients. In step 5, for 

instance, one client sends a play command (step 6) to the 

service provider. The contained User-ID is then verified as 

described in the section Establishing a new session (step 5) 

and the corresponding service is fired up. This is then broadcasted 

to the subscribed services via a notification message 

(step 9). In the depicted scenario, this contains the Session- 

ID as well as metadata to tell the clients which video properties 

they have to expect (codec, resolution, framerate and 

so on). The clients, on the other hand, check the Session-ID 

and prepare themselves to use the service (steps 10-12). The 

video streamed by the service provider can then be received 

and displayed. 

If a client wants to leave a session (step 13), he can notify the 

service provider via a dedicated message (14). The service 

provider then checks the user’s ID and deletes it from the 

receiver and notification list (step 15). This user is then no 

longer part of the session. However, as shown in Figure 5, 

the video stream is unaffectedly sent to the remaining client 

in the session. Therefore, a handover of the session has taken 

place. The remaining client can control all properties of the 

session or close it likewise. In this case, the actual number 

of participants of a session reaches zero. Hence, all users are 

removed and the Session-ID is no longer in use and can be 

assigned to new sessions. 

SUMMARY AND OUTLOOK 

In this paper, we presented a multimedia session management 

extension for our web protocol based HMI architecture, 

which has been introduced in our previous work. The 

session management has been realized as a dedicated service 

while not modifying the underlying DPWS stack. With 

16 

this extension, several video streaming scenarios are covered 

which then provide more convenience and flexibility to the 

driver and the passengers of a car. Users can invoke a service 

separated, e.g., a video streaming service can deliver multiple 

streams with a different playout time each. On the other 

hand, users can share one stream to watch the same video sequence 

on multiple screens, i.e., the playout time is the same. 

If there is an existing session available with a running stream 

and a new user wants to participate, the meta information is 

also indicated to the new user and his stream has the same 

playout time despite his late participation. Furthermore, it 

is possible that the initiator of a session leaves and another 

participating user takes over. This session handover offers 

high flexibility regarding connected devices, for instance, a 

movie that has been viewed during the trip can be continued 

on a mobile device afterwards. 

ACKNOWLEDGEMENTS 

This work has been supported, in part, by the BMBF funded 

research project SEIS (Security in Embedded IP-based Systems) 

[2]. 

REFERENCES 

1. Continental Automotive GmbH. AutoLinQ. 

http://www.conti-online.com/generator/www/de/en/ 

continental/automotive/themes/passenger cars/interior/ 

connectivity/autolinq/pi autolinq en.html, last accessed 

Nov. 2010. 

2. EENOVA. SEIS (Security in Embedded IP-based 

Systems). http://www.eenova.de/projekte/seis, last 

accessed Feb. 2010. 

3. M. Eichhorn, M. Pfannenstein, D. Muhra, and 

E. Steinbach. A SOA-based middleware concept for 

in-vehicle service discovery and device integration. In 

Intelligent Vehicles Symposium (IV), 2010 IEEE, pages 

663–669. IEEE, 2010. 

4. M. Eichhorn, M. Pfannenstein, and E. Steinbach. A 

flexible in-vehicle HMI architecture based on web 

technologies. In International Workshop on Multimodal 

Interfaces for Automotive Applications (MIAA2010), 

Hong Kong, China, Feb. 2010. 

5. S. Ludwig, J. Beda, P. Saint-Andre, R. McQueen, 

S. Egan, and J. Hildebrand. Xep-0166: Jingle. XMPP 

Enhancement Proposal, Jabber Software Foundation, 

2005. 

6. OASIS. Devices profile for web services version 1.1. 

http://docs.oasis-open.org/ws-dd/dpws/wsdd-dpws- 

1.1-spec.html, last accessed Nov. 2010. 

7. QNX Software Systems. QNX Neutrino RTOS. 

http://www.qnx.com/products/neutrino rtos/, last 

accessed Nov. 2010. 

8. P. Saint-Andre et al. Extensible messaging and presence 

protocol (XMPP): Core. 2004. 

9. The Linux Foundation. Meego. http://meego.com/, last 

accessed Nov. 2010.

“Hands Busy, Eyes Busy”: Generating Stories from 

Sensor Data for Automotive applications 

Joe Reddington, Ehud 

Reiter, Nava Tintarev 

Department of Computing 

Science 

University of Aberdeen 

j.reddington, e.reiter, 

n.tintarev@abdn.ac.uk 

ABSTRACT 

This paper examines the potential of using natural language 

generation to support “hands busy, eyes busy” automotive 

applications. It outlines a hierarchy of complexity of output 

text, and the type of sensor data that may be collected. It 

also suggests a number of ways natural language generation 

can generate narrative events from sensor data for drivers. 


NLG, AAC, event generation, narrative, story, sensors, automotive 

applications 



INTRODUCTION 

This work examines the potential of using automatically harvested 

information to generate new phrases automatically, 

creating support for “hands busy, eyes busy” automotive applications. 

Of particular interest is a review of how technologies 

and techniques developed in an assistive technology application 

(the recent “How was School Today...?” project) 

can be applied to the automotive domain. 

Mobile usage while driving has been identified as a risk factor 

in road accidents [2, 5]. Reducing both the motivation 

to use such devices while driving and the length of time for 

which they are used would potentially reduce the number of 

road accidents. The position of the authors is that the use 

of automatic narration techniques can support communication 

in scenarios such as making regular deliveries or public 

transportation. Methodologies to enable this type of automatic 

text generation are under-researched and NLG can aid 

in this task by creating a story that is structured, relevant and 

flexible to the current situation, based on sensor data. 



17 

Rolf Black, Annalu Waller 

School of Computing 

University of Dundee 

rolfblack,awaller@ 

computing.dundee.ac.uk 

It is easy to envisage a system by which buses or delivery 

vans automatically send an update of location to a home 

server, and indeed many services offer near real-time tracking 

of packages from source to destination. In contrast, this 

work focuses on combining such messages, augmented with 

information from weather reports, traffic reports and other 

data, to form a larger message with an overall narrative. 

In this paper we situate the work with regard to existing 

work, then introduce the “How was School Today...?” project 

that informed this work. We go on to identify potential application 

areas in the automotive domain, and discuss the 

possible effects, risks, and advantages. 

RELATED WORK 

Our existing work sits on the boundary between Natural Language 

Generation (NLG), which is a subcategory of natural 

language processing that examines the creation of text from 

nonlinguistic data such as sensor readings, and Alternative 

and Augmentative Communication (AAC), an area examining 

communication for those with restrictions on speech. 

NLG techniques can dynamically combine and change some 

output depending on the changing internal state of a system 

[11]. A popular application area for NLG has been 

weather forecasting (generating textual weather forecasts from 

the results of a numerical atmosphere simulation model), 

and several weather forecast generators have been fielded 

and used operationally [17, 16]. A number of data-to-text 

systems have also been developed in the medical community, 

such as BabyTalk [15], which generates summaries of 

clinical data from a neonatal intensive care unit, and the 

commercial Narrative Engine [14] which summarises data 

acquired during a doctor/patient encounter. 

In this paper, we seek to focus the technology away from 

AAC and on the automotive domain, where natural language 

processing systems have been used with some success. For 

example, RoadSafe is an NLG system that has been operationally 

deployed at Aerospace and Marine International 

(AMI) to produce weather forecast texts for winter road maintenance. 

It generates forecast texts describing various weather 

conditions on a road network [10]. Other systems have focused 

more on processing language to visualise and animate 

3D scenes from car accident reports [3].

Figure 1. Types of input that can be collected by a mobile device: voice recording, RFID, voice, emotional embellishments 

Automotive research in general is well developed; of particular 

relevance to this work is the issue of privacy in vehicleto-vehicle, 

or vehicle-to-base communication, see e.g. [8, 9]. 

The “How was School Today...?” project 

Our work is informed by the “How was School Today...?” 

(HWST) project [1, 6] which logged sensor data for students 

at a special needs school. This data included object and person 

interactions, voice recordings, and location information 

(at the room level). It also recorded positive and negative 

evaluations (e.g. “It was not a good day.”) input by the children. 

This framework has been tested as a proof-of-concept 

in the context of generating stories for children at the school. 

The students (who had no, or very limited, speech) could 

then relay these stories to parents or other conversation partners. 

For this particular domain, the types of data recorded for 

each user are: 

• Location data - each time the user entered a new room, 

this information was recorded. (Pre-processing removed 

rooms entered for less than three minutes). 

• Object interaction - each time the user interacted with an 

object that had an RFID tag, that interaction was recorded. 

• Person interaction - each time the user interacted with a 

person that had an RFID tag, that interaction was recorded. 

• Voice messages - staff and teachers were encouraged to 

record voice messages, as if the user was speaking in the 

first person, that described the user’s recent activities. 

An example set of data would be: 

11:36, Location, Tutorial Room 

11:36, Object, Money 

11:39, Object, Monkey Game 

Which is converted into English text to give the story: 

I played with Money and Monkey Game. This happened 

at a Tutorial Room. 

18 

Many of the input sensor data and techniques used in HWST 

can be applied to the automative domain. Figure 1 outlines 

the type of input that could be used in such a system and collected 

with a mobile phone, e.g. voice recordings, location, 

interactions with people and objects (RFID). 

The HWST project is in the process of introducing the Nokia 

6212 1 as a collection device, and may need to be supplemented 

with an additional system for recording location information 

on the room level. 

Depending on the granularity of location data required, other 

hardware may supplement a mobile phone. GPS tracking 

may be more suitable for larger distances while bluetooth or 

other methods may be preferable for room-level identification. 

Additional sensor data may be available in a vehicle 

such as change in light, temperature etc [4], or speed and 

fuel usage. 

TYPES OF PRODUCABLE CONTENT 

This section categorises the potential outputs of automatically 

generated content into a triple-tiered hierarchy of networkbased 

input, sensor-based input, and the creation of narratives 

from sensor input. This hierarchy can be broadly arranged 

in terms of invasiveness of the data collection. This 

and other privacy concerns are key to any implementation. 

Network-based input 

Network-based input is defined as new utterances that can 

be determined by access to information over the Internet, or 

some other large information portal. An example is talking 

about the weather - phrases such as “It’s very warm today”, 

and “The snow is starting to stick!”, but this can include 

“There was an accident on the M14”, or “Traffic is slow 

around Old Trafford due to the match”. 

1 http://europe.nokia.com/find-products/ 

devices/nokia-6212-classic, retrieved November 

2010

Sensor-based input 

Sensor-based input is defined as the use of single facts about 

the user provided by sensor data. Examples might include “I 

went to Leeds” - provided by GPS data, or “I just handled 

package 41” - provided by use of a barcode scanner in combination 

with an online lookup of the IDs for the packages. 

Although there is a concern that this sort of data collection 

can affect both privacy and also workload required to maintain 

it, messages can be better adapted: “I got a text message 

from Jamie this morning, he said ‘looking forward to tomorrow’ 

”. Voice messages are included in this category and 

can include information that would never be picked up by a 

sensor - “I helped jump-start a car and was 15 minutes late.”. 

Creation of narratives from sensor data 

This category contains those groups of messages, based on 

sensor data, that together relate an experience or tell a story, 

thus adding the problems of creating a narrative structure or 

consistent style to what has previously been a data-mining 

exercise. The importance of narrative in exchanging information 

is well-researched, for an NLG example see [12]. 

In HWST, stories were generated using additional reasoning, 

such as giving more importance to events that occurred in 

locations which were unexpected compared to a timetable. 

These stories were also augmented by users with positive 

and negative annotations of utterances “She was nice.” (for 

people) or “It was not a good day.” (for the whole story) [1]. 

The creation of multi-fact, multi-sentence messages with a 

structured narrative is a step forward in NLG-terms, requiring 

more sophisticated techniques than previous levels in 

the hierarchy. In particular, this moves the focus of NLG 

research to the tasks of document planning and document 

structuring, compared to text generation on the sentence level. 

The analysis of sensor-based data, defining one of these multifact 

and multi-sentence messages as an ‘event’ is discussed 

in [6]. While the NLG techniques outlined in [11] can combine 

facts into plain English, a further challenge lies in defining 

boundaries between groups of sensor data to define separate 

events. The goal is to arrange the sensor-based input 

into a narrative structure that accurately relates events. 

Based on a modified version of the data recording in the 

HWST project, one could assume input data such as that 

highlighted in Figure 2. The generated text could then be: 

“This morning, after picking up two packages, I helped 

jump-start a car and was delayed by 15 minutes. Later, I 

arrived at the Leeds depot and delivered the packages to Mr. 

Roberts. The delivery went fine”. 

APPLICATION AREAS 

The previous section discussed the types of text that can be 

generated. This section outlines several practical applications 

of the generated narrative text in automative applications: 

staying in touch; communication with head office; 

and accident reports. Privacy is an important consideration 

in any application; the people on whose behalf the story is 

generated should always have the possibility to read and edit 

19 

06:27:00, Object, Package1 


07:34:00, Voice Recording, I helped jump-start a car and was delayed 

by 15 minutes. 

09:40:00, Location, Leeds depot. 



09:40:00, Person, Mr. Roberts 

09:43:00, Embellishment, Positive . . . 

Figure 2. Possible input data 

any text before it is transmitted. Moreover, any generated 

text can be read aloud by text-to-speech software. 

This would also facilitate responses to messages originally 

sent to a driver, allowing the original sender (which may also 

be a driver) to hear the response without extra effort and reducing 

cognitive load. 

Staying in touch 

Many people keep in touch with mobile texts and an increasing 

number stay connected using social media such as Facebook 

and Twitter 2 . Professional drivers may feel that updating 

their status is important from a social as well as professional 

prospective. However, while driving, attention should 

be on the road, and hands and eyes will be occupied by driving. 

An application that uses NLG to automatically update 

friends on one’s activities may help drivers feel connected 

in their everyday lives. The necessity to automatically generate 

such short messages is highlighted in [4] who suggest 

messages such as “35 centigrades? It is very hot in here!”. 

In particular, the work on structuring narrative produced by 

HWST technology allows a move from the functional single 

sentence update to a more expressive longer update. 

Work Reports 

The key application in this area is the generation of automatic 

work reports based on a driver’s sensor data. This sort 

of narrative can supply an employer with information about 

his drivers, such as the hours that they have worked and 

which deliveries or other tasks have been successfully executed. 

At the same time, the automatic generation of the text 

relieves the employee of the task of writing lengthy reports. 

Of particular use is text informing end-users of the current 

conditions - rather than a simple “Delayed, new ETA:15:27” 

message, one can imagine “When coming from a previous 

delivery at Hogsmeade, there was heavy traffic due to an 

accident in the town so the delivery has been diverted via 

Hogwarts and should be with you by 15:27”. 

Accident Reports 

Generative narrative stories from sensor data can also be 

used to support police and ambulance staff at the scene of 

the accident. The generated reports can offer a human readable 

summary of the situation well ahead of arrival on the 

scene, allowing professionals to be ready once they arrive. 

This sort of report can help assess the degree of damage 

2 www.facebook.com, www.twitter.com, retrieved November 2010

incurred at an accident by considering road conditions and 

travel speed. This type of report could also help police (and 

insurance companies) assess potential accountability for a 

given accident. Infra-red sensors may help assess how many 

victims were involved in an accident as well, ensuring that 

all victims get pulled out of an affected vehicle. 

CONCLUSION AND ONGOING RESEARCH 

This paper describes the type of text that can be automatically 

generated to support drivers, and highlighted three application 

areas: staying in touch, communication with head 

office, and accident reports. Although a future goal for this 

research is to integrate with a commercial product, privacy 

and security of such systems require careful consideration 

While care has been taken to keep such concerns a key part 

of the research, the authors welcome any communication 

from parties with expertise in this area. 

ACKNOWLEDGEMENTS 

The authors are particularly grateful to the school, staff, and 

children. This research was supported by the UK Engineering 

and Physical Sciences Research Council under grants 

EP/F067151/1, EP/F066880/1, EP/E011764/1, 

EP/H022376/1, and EP/H022570/1. 

REFERENCES 

1. R. Black, J. Reddington, E. Reiter, N. Tintarev, and 

A. Waller. Using nlg and sensors to support personal 

narrative for children with complex communication 

needs. In Proceedings of the NAACL HLT 2010 

Workshop on Speech and Language Processing for 

Assistive Technologies, pages 1–9, Los Angeles, 

California, June 2010. Association for Computational 

Linguistics. 

2. F. A. Drews, H. Yazdani, C. N. Godfrey, J. M. Cooper, 

and D. L. Strayer. Text messaging during simulated 

driving. Human Factors: The Journal of the Human 

Factors and Ergonomics Society, 51 (5):762–770, 2009. 

3. S. Dupuy, A. Egges, V. Legendre, and P. Nugues. 

Generating a 3d simulation of a car accident from a 

written description in natural language: the carsim 

system. In Proceedings of the workshop on Temporal 

and spatial information processing - Volume 13, pages 

1:1–1:8, Morristown, NJ, USA, 2001. Association for 

Computational Linguistics. 

4. C. Endres and D. Braun. Pleopatra: A Semi-Automatic 

Status-Posting Prototype For Future In-Car Use. In 

Adjunct proceedings of the 2nd International 

Conference on Automotive User Interfaces and 

Interactive Vehicular Applications (AutomotiveUI 

2010), page 7, Pittsburgh, PA, USA, November 2010. 

5. S. P. McEvoy, M. R. Stevenson, and M. Woodward. 

Phone use and crashes while driving: a representative 

survey of drivers in two australian states. Medical 

journal of Australia, 185(11/12):630–634, 2006. 

6. J. Reddington and N. Tintarev. Automatically 

generating stories from sensor data. In Intelligent User 

Interfaces, 2011 (to appear). 

20 

7. E. Reiter, R. Turner, N. Alm, R. Black, M. Dempster, 

and A. Waller. Using nlg to help language-impaired 

users tell stories and participate in social dialogues. In 

In Proceedings of the 12th European Workshop on 

Natural Language Generation (ENLG-09, 2009. 

8. F. Schaub, F. Kargl, Z. Ma, and M. Weber. V-tokens for 

conditional pseudonymity in vanets. In IEEE Wireless 

Communications & Networking Conference (IEEE 

WCNC 2010), Sydney, Australia, 04/2010 2010. IEEE, 

IEEE. 

9. F. Schaub, Z. Ma, and F. Kargl. Privacy requirements in 

vehicular communication systems. Computational 

Science and Engineering, IEEE International 

Conference on, 3:139–145, 2009. 

10. R. Turner, Y. Sripada, and E. Reiter. Generating 

approximate geographic descriptions. In Proceedings of 

the 12th European Workshop on Natural Language 

Generation, ENLG ’09, pages 42–49, Morristown, NJ, 

USA, 2009. Association for Computational Linguistics. 

11. E. Reiter and R. Dale. Building natural language 

generation systems, Cambridge University Press, 2000. 

12. E. Reiter, A. Gatt, F. Portet, and M. van der Meulen. 

The importance of narrative and other lessons from an 

evaluation of an NLG system that summarises clinical 

data. INLG ’08, pp. 147–156, Morristown, NJ, USA, 

2008. Association for Computational Linguistics. 

13. S. Ashraf, A. Judson, I. W. Ricketts, A. Waller, N. Alm, 

B. Gordon, F. MacAulay, J. K. Brodie, M. Etchels, 

A. Warden, and A. J. Shearer. Capturing phrases for 

ICU-Talk, a communication aid for intubated intensive 

care patients. In ACM Conference on Assistive 

technologies, pp. 213–217, New York, NY, USA, 2002. 

14. M. D. Harris. Building a large-scale commercial NLG 

system for an EMR. In INLG ’08: Proceedings of the 

Fifth International Natural Language Generation 

Conference, pages 157–160, Morristown, NJ, USA, 

2008. Association for Computational Linguistics. 

15. A. Gatt, F. Portet, E. Reiter, J. Hunter, S. Mahamood, 

W. Moncur, and S. Sripada. From data to text in the 

neonatal intensive care unit: Using NLG technology for 

decision support and information management. AI 

Commun., 22(3):153–186, 2009. 

16. E. Reiter, S. Sripada, J. Hunter, J. Yu, and I. Davy. 

Choosing words in computer-generated weather 

forecasts. Artif. Intell., 167(1-2):137–169, 2005. 

17. E. Goldberg, N. Driedger, and R. I. Kittredge. Using 

natural-language processing to produce weather 

forecasts. IEEE Expert: Intelligent Systems and Their 

Applications, 9(2):45–53, 1994.

A novel taxonomy for gestural interaction techniques: 

considerations for automotive environments 

Adriano Scoditti 

Laboratoire d’Informatique de Grenoble, Equipe IIHM 

385, rue de la Bibliotheque, BP 53, F-38041 Grenoble cedex 9, France 

adriano.scoditti@imag.fr 

ABSTRACT 

A large variety of gestural interaction techniques is now 

available. In this article, we use a new taxonomic space [18] 

as a comparative structure to analyze the applicability of 

these techniques on automotive environment. The taxonomy 

plots a gestural interaction technique as a point in a 

space where the vertical axis denotes the semantic coverage 

of the technique, and the horizontal axis expresses the 

physical actions users are engaged in. In addition, syntactic 

modifiers are used to express the interpretation process of input 

tokens into semantics, as well as pragmatic modifiers to 

make explicit the level of indirections between users actions 

and system responses. In the taxonomy, the complexity of 

the gestural interaction lexicon, and the syntactic/pragmatic 

modifiers it is decorated with, are indexes of the cognitive 

load users are engaged in during the interaction. The integration 

of modern mobile devices, complex user interfaces and 

gestural interaction techniques into automotive environment 

rise the necessity to analyze gestural interaction technique 

from their cognitive load point of view. 


Handheld devices and mobile computing, Input and interaction 

technologies, Multi-modal interfaces, Recognition and 

interpretation of user input (face, body, speech etc.) 



INTRODUCTION 

Last generation mobile devices are enhanced with a diversity 

of sensors capable of probing real world physical properties 

in real time. The pioneering work on sensor-based interaction 

techniques [8, 11, 12, 15, 16] has paved the way for 

an active research area [1, 20, 21]. Although these results 

satisfy “the gold standard of science” [19], in practice, they 

are too “narrow truths” [4] to support designers decisions 

and researchers analysis. Designers and researchers need an 



21 

Figure 1. Integration of last generation mobile devices in automotive 

environment rise the necessity to analyze gestural interaction technique 

from their cognitive load point of view [?]. 

overall systematic structure that helps them to reason, compare, 

elicit (and create!) the appropriate techniques for the 

problem at hand. Taxonomies, which provide such a structure, 

are good candidates for generalization in an emerging 

field. The challenge, however, is to provide a classification 

framework that is both complete and simple to use. Since 

completeness is illusory in a moving and prolific domain 

such as user interface design, we will not include it in our 

goals. 

In this article, we propose the interpretation of a new taxonomy 

for gestural interaction techniques [18] with considerations 

for automotive environment. 

To develop our taxonomy, we have built a controlled vocabulary 

(i.e. primitives) obtained through an extensive analysis 

of the taxonomies that have laid the foundations for 

Human-Computer Interaction (HCI) more than twenty five 

years ago. For the most part, this early work in HCI has 

been ignored or forgotten by researchers driven by the trendy 

“technology push” approach. 

Our taxonomy is based on the following principles: 

(1) Interaction between a computer system and a human being 

is conveyed through input (output) expressions that are 

produced with input (output) devices, and that are compliant 

with an input (output) interaction language. 

(2) As any language, an input (output) interaction language 

can be defined formally in terms of semantics, syntax, and 

lexical units.

Figure 2. The “sliding” gesture is semantically multiplexed to achieve 

different meanings, depending on context. 

(3) The generation of an input (output) expression involves 

using devices whose characteristics, from the human perspective, 

have a strong impact on the expressiveness and 

the effectiveness of the user interface [5]. 

Building on Foley’s work [9] as well as on Buxton’s pragmatics 

considerations of input structures [5], our taxonomy 

brings together the four aspects of interaction ranging 

from semantics to pragmatics with the appropriate humanmotivated 

extensions for addressing the specificity of gestural 

interaction based on accelerometers. In contrast to 

Mackinlay et al.’s semantic analysis of the design space for 

input devices [13], we do not consider the transformation 

functions that characterize the system-oriented perspective 

of interaction techniques. 

Our expectation is to provide new insights and to start 

promising directions for the design of novel and powerful 

gestural interaction techniques. 

A NEW TAXONOMY 

As shown in Figure 2, the same gesture may convey very 

different meanings depending on the context in which it is 

produced: “go to previous photo” as for the Apple’s photo 

album (or “go to next slide” as in Charade in [2]), “open a 

submenu” in Francone’s Wavelet Menu [10], or “unlock” the 

iPhone screen. In addition, a gesture that makes sense for the 

system, may not be acceptable in a public social context [17] 

as it could be meaningful and interpreted by the public itself. 

These observations lead us to define a new taxonomy according 

to the following principles: (1) Coverage of semantic, 

syntactic, lexical, and pragmatic issues of interaction where 

semantic granularity is that of Foley’s et al. interaction tasks; 

(2) Adoption of a user centered perspective where physical 

human actions are premium, leaving aside the internal 

computational transformations; (3) Consideration for context; 

(4) Coverage of both foreground and background interaction 

(as defined by Buxton [6]). Figure 3 shows the 

elements of the framework that we describe in detail next. 

Lexical Axis 

Because of our focus on users’ involvement in the interaction, 

the input lexicon corresponds to the physical actions 

users apply to devices. We divide human physical actions 

into two groups: (1) conscious actions that belong to the 

22 

Figure 3. Our classification space for gestural interaction techniques 

based on accelerometers. The abscissa defines the lexicon in terms of 

the physical manipulations users perform with the device, with a clear 

separation between background and foreground interaction. The ordinate 

corresponds to Foley’s interaction tasks. An interaction technique 

is uniquely identified by an integer i and plotted as a point in this space. 

Each point is decorated with the pragmatic and syntactic properties of 

the corresponding interaction technique. 

foreground interaction, and (2) unconscious actions that correspond 

to background interaction. The foreground interaction 

area contains the interaction techniques that require 

the user to consciously manipulate the device to reach some 

objective (as for the sliding gesture of Figure 2). The background 

interaction area corresponds to the interaction techniques 

where the system interprets user’s unconscious actions 

together with contextual information to perform some 

system state change on behalf of the user. For example, during 

a phone call, the iPhone switches the screen backlight 

off to safe battery life as the user brings the device next to 

the ear. 

Whether human actions are performed consciously to address 

the system or not, our classification space characterizes 

these actions with two additional variables: (τ) the geometrical 

transformation matrix that models user’s movements in 

space, and (f) the frequency of these movements. The combinations 

of τ and f identify three sub-areas within the lexical 

axis: “Context”, “Affine Transformations” and “Shock”. 

The affine transformations group identifies the most common 

interaction techniques based on translations, rotations 

and/or scales (in this case, τ is different from the identity 

matrix I), and without any repetition (that is, f is equal to 

zero, meaning that the interaction is time driven). The sliding 

gesture of Figure 2 falls in this category. The shock 

category identifies those interaction techniques based on a 

combination of translations, rotations and/or scales (τ is different 

from the identity matrix) repeated over time (then, f 

is different from zero). The shake gesture exemplified by 

Shoogle [20] falls in this category. The context category 

corresponds to unconscious human manipulations that the 

system may interpret to feed into its own context model and, 

depending on this context, acts on behalf of the user. For 

this situation, we stipulate that τ is the Identity matrix and f 

is equal to zero.

Syntactic Axis 

Independently from the device used, we characterize the 

syntactic dimension of an interaction technique with the following 

two variables that we call syntactic modifiers: (1) the 

existence (or absence) of triggers to specify the begin/end of 

the interaction, and (2) the control type associated with the 

input token, which may be position-control, speed-control 

or acceleration-control. As a result, given that, in our taxonomy, 

an interaction technique is uniquely identified by an 

index i, the trigger syntactic modifier is represented as an 

oval that surrounds the interaction technique identifier using 

a dashed-line or a continuous line to respectively denote the 

presence (i.e. clutch) or absence (i.e unclutch) of a trigger. 

In addition, a derivative-like notation is used to convey the 

control type where i is decorated with an exponential number 

that expresses the derivative order with respect to time (i.e., 

no derivative for position, first order derivative for speed, 

and second order derivative for acceleration). 

Semantic Axis 

As justified in our review about the foundational taxonomies 

developed in HCI, we re-use Foley’s interaction tasks: Select, 

Position, Orient, Path, Quantify, and Text [9] (See the 

vertical axis of Figure 3). 

Pragmatic Axis 

One of the originalities of our work is the attempt to classify 

gestural interaction techniques in close connection with their 

meaning in the user’s real world. To do this, we introduce a 

pragmatic modifier that expresses the directness [14, 3] of 

the mapping between the user’s expectation (i.e. goal) and 

the semantics of the interaction technique in the computer 

world. For indirect mapping, the identifier i of the interaction 

technique becomes the parameter of a function F(i) 

to indicate the existence of one or several reinterpretation 

layers, whereas for direct mapping, i does not receive any 

additional decoration. 

DISCUSSION AND RESEARCH DIRECTIONS 

Our fine-structured, language-inspired analysis allows to understand 

intrinsic and implicit differences even among apparently 

similar interaction techniques allowing researcher 

to better explore them and designers to better choose the best 

suitable for each case. 

From the researcher’s point of view, the classification shows 

a transparent state of the art where each interaction technique 

is classified without ambiguity. Typically, reference 

taxonomies such as [9] or [5] do not consider the role of 

time (cf. frequency and duration), nor do they cover unconscious 

interaction (cf. background interaction) and unstructured 

interaction such as device shaking. In addition, they 

do not explicitly consider whether an interaction technique 

is clutched or unclutched introducing ambiguities and mixing 

up different aspects of human interaction behavior. 

From the designer’s point of view, the dimensions of our 

taxonomy can be used as a framework for decision making. 

For example, an unclutched interaction technique may 

23 

be considered for default tasks, while different clutched interaction 

techniques can be multiplexed through the use of 

standard or ad-hoc widgets. By proposing at least an interaction 

technique for each of the proposed task while designing 

an application, designers will be able to offer a complete 

and uniform user experience similar to the WIMP one. 

Furthermore, designers can predict the difficulties that final 

users will encounter by analyzing the pragmatic and syntactic 

modifiers that characterize the interaction techniques they 

envision. Thus, they will be able to choose interaction techniques 

that best suit the targeted representative users (novice, 

intermediate, expert). 

We think good research and development directions will be 

both toward the creation of widgets able to transform direct 

interactions in their more complex counterparts and toward 

the definition of the elementary interactions to base the 

development on. The classification suggests to concentrate 

the efforts toward the development of interaction techniques 

able to specify Path, Quantity and Text input. 

Direct pragmatical interaction techniques are the most suitable 

for automotive environment, in particular for drivers. 

The lack of indirection layers during the interaction characterizes 

lower cognitive loads thus easing the interaction and 

avoiding distraction. 

CONCLUSIONS 

The characteristics on which we choose to perform our analysis 

are the ones inspired by the parallelism existing between 

artificial languages proposed by interactions and gestural 

languages users are used to: lexicon, syntax, semantic and 

pragmatic. Our discussion did not deepened to system level, 

as we didn’t want to differentiate interaction techniques by 

their implementation characteristics (granularity, resolution 

function, state machine are the variables already been taken 

into account [7, 13] whom we want to be complementary 

rather than substitutes). 

Our approach proposed a user-centered classification able to 

analyze the state of the art of accelerometers-based interaction 

techniques by the manipulation point of view: the user 

perform a physical action in its space in order to communicate 

with the system. We think this is the atomic level on 

which we have to conceive our interfaces in order to propose 

system-wide coherent languages to the users. This coherence 

will drive them through a more agreeable, natural [5] 

and intuitive system, having coherence and direct pragmatic 

distances. 

We proposed the use of a parametrical space where the pragmatic 

distance and the syntactical modifiers are indexes of 

the learning curve users have to go over when approaching a 

new interaction language. 

We contextualized our approach and principles to automotive 

environment. We proposed the use of the syntactical 

and pragmatical modifiers as discriminants of the most appropriate 

gestural interaction techniques suitable in automotive 

environments.

REMARKS 

The content of this article refers to, and in some part is an extract 

of, the accelerometers interaction techniques taxonomy 

proposed by Scoditti et al. [18]. 

REFERENCES 

1. R. Ballagas, J. Borchers, M. Rohs, and J. G. Sheridan. 

The smart phone: A ubiquitous input device. IEEE 

Pervasive Computing, 5(1):70, 2006. 

2. T. Baudel and M. Beaudouin-Lafon. Charade: remote 

control of objects using free-hand gestures. Commun. 

ACM, 36(7):28–35, 1993. 

3. M. Beaudouin-Lafon. Instrumental interaction: an 

interaction model for designing post-wimp user 

interfaces. In CHI ’00: Proceedings of the SIGCHI 

conference on Human factors in computing systems, 

pages 446–453, New York, NY, USA, 2000. ACM. 

4. F. P. Brooks. Grasping reality through 

illusion—interactive graphics serving science. In CHI 

’88: Proceedings of the SIGCHI conference on Human 

factors in computing systems, pages 1–11, New York, 

NY, USA, 1988. ACM. 

5. W. Buxton. Lexical and pragmatic considerations of 

input structures. SIGGRAPH Comput. Graph., 

17(1):31–37, 1983. 

6. W. Buxton. Integrating the periphery and context: A 

new model of telematic. Proceedings of Graphics 

Interface, pages 239–246, 1995. 

7. S. K. Card, J. D. Mackinlay, and G. G. Robertson. A 

morphological analysis of the design space of input 

devices. ACM Trans. Inf. Syst., 9(2):99–122, 1991. 

8. G. W. Fitzmaurice, S. Zhai, and M. H. Chignell. Virtual 

reality for palmtop computers. ACM Trans. Inf. Syst., 

11(3):197–218, 1993. 

9. J. D. Foley, V. L. Wallace, and P. Chan. The human 

factors of computer graphics interaction techniques. 

IEEE Comput. Graph. Appl., 4(11):13–48, 1984. 

10. J. Francone, G. Bailly, L. Nigay, and E. Lecolinet. 

Wavelet menu: une adaptation des marking menus pour 

les dispositifs mobiles. In IHM ’09: Proceedings of the 

21st International Conference on Association 

Francophone d’Interaction Homme-Machine, pages 

367–370, New York, NY, USA, 2009. ACM. 

11. K. Hinckley, J. Pierce, M. Sinclair, and E. Horvitz. 

Sensing techniques for mobile interaction. In UIST ’00: 

24 

Proceedings of the 13th annual ACM symposium on 

User interface software and technology, pages 91–100, 

New York, NY, USA, 2000. ACM. 

12. G. Levin and P. Yarin. Bringing sketching tools to 

keychain computers with an acceleration-based 

interface. In CHI ’99: CHI ’99 extended abstracts on 

Human factors in computing systems, pages 268–269, 


13. J. Mackinlay, S. K. Card, and G. G. Robertson. A 

semantic analysis of the design space of input devices. 

Hum.-Comput. Interact., 5(2):145–190, 1990. 

14. D. Norman. User Centered System Design; New 

Perspectives on Human-Computer Interaction. L. 

Erlbaum Associates Inc., 1986. 

15. K. Partridge, S. Chatterjee, V. Sazawal, G. Borriello, 

and R. Want. Tilttype: accelerometer-supported text 

entry for very small devices. In UIST ’02: Proceedings 

of the 15th annual ACM symposium on User interface 

software and technology, pages 201–204, New York, 

NY, USA, 2002. ACM. 

16. J. Rekimoto. Tilting operations for small screen 

interfaces. In UIST ’96: Proceedings of the 9th annual 

ACM symposium on User interface software and 

technology, pages 167–168, New York, NY, USA, 

1996. ACM. 

17. J. Rico and S. Brewster. Usable gestures for mobile 

interfaces: evaluating social acceptability. In CHI ’10: 

Proceedings of the 28th international conference on 

Human factors in computing systems, pages 887–896, 


18. A. Scoditti, J. Coutaz, and R. Blanch. A novel 

taxonomy for gestural interaction techniques based on 

accelerometers. In IUI 2011. ACM, 2011. 

19. M. Shaw. What makes good research in software 

engineering? International Journal of Software Tools 

for Technology, 4(1):1–7, 2002. 

20. J. Williamson, R. Murray-Smith, and S. Hughes. 

Shoogle: excitatory multimodal interaction on mobile 

devices. In CHI ’07: Proceedings of the SIGCHI 



21. A. Wilson and S. Shafer. Xwand: Ui for intelligent 

spaces. In CHI ’03: Proceedings of the SIGCHI 


pages 545–552, New York, NY, USA, 2003. ACM.

Navigating Haystacks at 70 mph: 

Intelligent Search for Intelligent In-Car Services 

Ashweeni K. Beeharee 

University College London 

Department of Computer Science 

Gower Street, London, WC1E 6BT 

+44 (0)20 7679 0358 

a.beeharee@cs.ucl.ac.uk 

ABSTRACT 

With an explosion of in-car services, it has become not only 

difficult but unsafe for drivers to search and access large amounts 

of information using current interaction paradigms. In this paper, 

we present a novel approach for visualizing and exploring search 

results, and the potential benefits of its application to the current 

in-car environment. We have iteratively developed and tested a 

prototype system that enables the seamless and personalized 

exploration of information spaces. In a number of eye-tracking 

studies, we analyzed user satisfaction and task performance for 

factual and explorative search tasks. We found that most 

participants were faster, made fewer errors and found the system 

easier to use than traditional ones. We believe that this approach 

would improve the traditional in-car interfaces - to search and 

access large number of services with rich information. This would 

reduce driver inattention to the road and improve road safety. 

Categories and Subject Descriptors 

H.5.2 [Information Interfaces and Presentation]: User 

Interfaces - Graphical user interfaces. 

General Terms 

Design, Experimentation, Human Factors, Intelligent Transport 

System Services, Road Safety, Theory 

Keywords 

Contextualization, Personalization, Exploration, Search, Context 

Interfaces, Contextual User Interfaces 

1. SafeTRIP 

Satellite-based communication systems [10] for use in homes 

[1][13] and cars have been adopted by consumers in many parts of 

the world. The SafeTRIP project aims to build on this success and 

utilize a new generation of satellite technology to improve the 

safety, security and environmental sustainability of road transport. 

SafeTRIP uses S-band satellite technology, which is optimized for 

two-way communication for on-board vehicle units. The S-band 



not made or distributed for profit or commercial advantage and that 

copies bear this notice and the full citation on the first page. To copy 

otherwise, or republish, to post on servers or to redistribute to lists, 

requires prior specific permission and/or a fee. 


MIAA 2011, February 13, 2011, Palo Alto, CA, USA 

Sven Laqua 




+44 (0)20 7679 0351 

s.laqua@cs.ucl.ac.uk 

25 

M. Angela Sasse 




+44 (0)20 7679 7212 

a.sasse@cs.ucl.ac.uk 

communication requires a small antenna making it suitable for the 

mass market. Existing solutions that use other frequency bands 

(for e.g. Ku-Band) require larger antennas [12] thus being less 

suitable for integration in vehicles or in handheld devices. An 

open SafeTRIP platform will be implemented to host services for 

improved safety and navigation, but also entertainment and 

advertising to vehicle occupants. 

Figure 1 - The SafeTRIP concept 

During the requirements capture, we discussed with drivers, 

operators, emergency technicians, operation managers, 

technologists and the management from road operators, insurance 

companies, fleet operators, freight forwarders and coach operator 

to understand their needs. 

Figure 2 - User needs defines the SafeTRIP platform 

The SafeTRIP platform’s definition - based on key functionalities 

elicited from business (such as road operators) and individual 

stakeholders - is shown in Figure 2. The platform enables services 

that can provide access to rich information that might be useful to 

drivers. At the same time, this creates a risk of overloading drivers 

with information, and distracting their attention which should be 

focussed on the road. In this paper, we present a new paradigm for 

accessing rich media and information in a vehicle which has 

minimal impact on the driver’s attention while driving.

2. SafeTRIP Services 

From our requirements capture, a set of safety and comfort 

services were identified, including: 

• Road safety alert service – hazard and incident warning; 

• Speed limit service – display variable speed limits in-car; 

• Collaborative alert service – allow drivers to share 

information about road incidents and traffic information; 

• Entertainment service - provides access to Streaming media 

and TV channels; 

• Assistance service - remote assistance and diagnostics; 

• Parking guidance service - for hazardous goods vehicle and 

coaches; 

• Location-Based services – access and present localised 

information to driver such as petrol stations, restaurants, 

hotels, local events. 

These services will provide numerous benefits to the drivers. For 

instance, it will allow them to access rich and timely traffic 

information from various sources in the vehicle. Commercial 

systems such as Coyote have proven very popular amongst drivers 

who share information about speed cameras in Europe. Through 

SafeTRIP, drivers will also be able to share information about 

road incidents with each other. Our user requirements capture 

shows that individuals are interested in accessing richer 

information. Through the above services, they will be able to 

access localized information about parking spaces, hotels and 

petrol stations – along with rich information – to allow drivers to 

search for the cheapest place to be refueled or for a restaurant with 

a cuisine of their liking. 

Whilst this type of information could have many benefits for 

drivers, there are risks associated with delivering them into 

vehicles. In 2006, a study by the U.S. Department of Transport 

(DOT) reported that the leading factor to 80% of crashes and 65% 

of near-crashes is driver inattention [9]. The SafeTRIP platform 

will partly address this through a driver alertness service that 

monitors driver alertness and support warnings to drivers [8]. 

However, with access to a large number of services, the driver’s 

attention will be required to: 

• Access a service through the navigation interface 

• Interact with a specific service - which may involve searches 

that would require further interaction from the driver 

Current icon-based interfaces to in-car systems and virtual 

keyboards are too taxing to the driver’s attention – and it can only 

get worse with an increasing number of services. This has led us 

to consider alternative paradigms for driver interaction with 

information delivered into vehicles. 

3. INFORMATION EXPLORATION 

In this section we describe a novel information exploration 

technique to search and access information on the web. 

Experiments have clearly demonstrated its benefits and we believe 

that this approach will prove beneficial for drivers searching and 

interacting with information in their vehicle. 

Approaches such as contextual search [3], search result clustering 

[16] or personal search [2][15] aim to overcome some of the 

shortcomings of “traditional” search engines. However, none of 

those approaches challenges the current paradigm of how users 

interact with search engines. To us, it is obvious that the 

traditional interaction model using search engine result pages 

26 

(SERPs) does not work well for more complex information 

problems. 

To get a broader view, users need to consult different sources and 

understand contexts. Most of the time, a single resource will not 

be able to satisfy this need. Traditional SERPs fragment the 

relevant bits of information, rather than help users to contextualize 

them in meaningful ways. Users have to “crawl” site after site, 

foraging for meaningful bits [12], emulating the behavior of a 

search engine robot. The search engine interaction model 

(Figure 3, left side) illustrates users’ interaction with SERPs, 

moving back and forth between search results (A, B, C, D) and 

the actual SERP (central point). 

Figure 3 - Contrasting Interaction Models 

3.1 Information Exploration UI 

In contrast, users’ interaction with our information exploration 

interface – also referred to as Focus-Metaphor Interface (FMI) [4] 

- enables seamless exploration of the underlying information 

spaces (see Figure 3, right side). This approach combines a 

contextual navigation with the actual display of information (see 

Figure 4) and particularly facilitates orienteering behavior [14]. 

When visualizing search results, the FMI replaces traditional 

search engine result pages (see Figure 4-A). Its contextual 

interface elements contain snippet-like information previews of 

the actual search results, and are arranged around the central 

content element which displays details of the currently selected 

search result (see Figure 4-B). 

Figure 4 - FMI prototype for social tools evaluation 

When selecting another contextual element, its state changes: it 

enlarges into a content element and moves to the centre of the 

screen, replacing the previously displayed search result (see 

Figure 4-C). This approach allows “browsing” through search 

results whilst preserving contextual awareness of the other search 

result snippets. In addition, the chosen layout enables a less

hierarchical and more concurrent display of the “top X” search 

results, without requiring any scrolling. 

However, the key strength of the FMI model becomes apparent 

when none of the presented search results meet the user’s 

information need. Rather than having to re-formulate another 

search query hoping for more promising search results, the user 

can simply pick one of the existing results that she thinks comes 

“closest” to what she is looking for, and request similar/related 

results. This enables the dynamic adaptation of contextual 

elements to the currently displayed content element, without 

requiring the user to articulate their information need precisely. 

This approach represents a break from traditional search behavior, 

as the user does not need to constantly go back to a search 

interface to (re-)start a new search session. Instead, an initial 

search query is the starting point for a seamless and personalized 

orienteering and exploration process that guides the user from one 

information nugget to the next. Although Google search provides 

related functionality through a link called “similar” available with 

some of its search result snippets, this functionality mostly works 

at a very abstract level (e.g. sites related by topic), but not on the 

actual content level. Microsoft search (live.com) provides “related 

searches” through a list of similar search queries. However, this 

functionality again seems to only work on a rather abstract level 

with more generic search queries. 

Another key benefit of the FMI model is that its layout and 

interaction paradigm lends itself to novel interaction techniques, 

such as touch or even eye-gaze. In an earlier study, we have 

demonstrated the effective use of our information exploration 

interface with eye-gaze only [6]. 

3.2 Experimentation 

Over 3 years, we have conducted a number of lab-based studies of 

various FMI prototype iterations. We evaluated the performance 

of and user satisfaction with our prototype against a range of 

existing tools, such as individual blogs, blog spaces, Google news, 

Google Reader and PARC’s StarTree [4][5][7]. 

Throughout those studies, task completion times were 

significantly faster and error rates were significantly lower using 

the FMI than in blog environments (see Figure 5) and on a par 

with PARC’s StarTree (which only works for well-formed 

information spaces). 

Figure 5 - Cross-study comparison 

Participants using the FMI had short and very consistent average 

fixation durations, which indicate lower cognitive load than in all 

compared systems. User feedback through questionnaires and 

informal interviews confirmed the ease of use and learnability of 

the FMI prototypes for most users. 

27 

3.3 Social Tools Study 

In our latest study, we used a corpus of domain-specific blog 

entries to evaluate a range of social tools, namely the ability to 

tag, rate and bookmark any of the articles. We looked at the 

impact of 1) ratings on contextual search snippets and 2) tags on 

search result presentation (see Figure 6). 

Figure 6 – Screenshot of FMI with social tools 

The eye-tracking experiment involved 21 participants, 13m/8f, 

20-46 years (avg. 25.7). We used a range of factual and 

explorative search tasks. For factual search tasks, participants had 

to identify a specific article; for explorative search tasks, 

participants had to explore a certain topic for a few minutes. In 

both cases, we used small scenarios to facilitate intrinsic 

motivation in the participants. 

For the contextual search result snippets, our analysis of postexperiment 

usability questionnaires (Likert scale, 1-6) revealed 

that participants found the “5 star rating” functionality very quick 

and easy to use (5.5). The ability to have ratings displayed in the 

contextual navigation elements was rated significantly higher than 

the perceived impact on users’ navigational decisions (4.8 vs. 4.0, 

t20 = 2.09, p < 0.02). 

But, analysis of the eye-tracking data shows that participants’ 

awareness of the actual ratings was substantial, considering its 

actual size within the contextual search snippet (see Table 1). 

Table 1 - Search snippet attention distribution 

Attention Distribution 

(relative gaze time) 

Rating 17.1 % 

Title 54.4 % 

Description 28.5 % 

Within this study of social tools for the FMI, selecting a “new” 

central content element automatically updated the contextual 

elements to display the most similar/related articles to the newly 

activated content element. However, user feedback showed that 

the automatic contextualization of relevant search snippets is too 

volatile for users’ taste. For future studies, we have therefore 

settled on a static/persistent contextual visualization that (only) 

adjusts to the currently displayed content element upon request by 

the user. 

4. SafeTRIP FMI 

With the large number of services available through SafeTRIP, 

searching through services and information, using traditional

methods and interfaces in in-car systems, can prove to be time 

consuming. Inefficient search therefore has a detrimental impact 

on the driver’s attention and thus on road safety. 

As FMI has proven to be an effective tool for searching and 

presenting information, we believe that its application to the in-car 

environment would be beneficial to the driver. We have identified 

some application areas for the SafeTRIP in-car interface that 

could benefit from this approach. 

Service Search 

SafeTRIP is an open platform, allowing third party 

applications/services to be made available to the drivers. With 

typically dozens of services planned already and new ones 

appearing with time, the traditional icon/menu based interface in 

most in-car systems may not be appropriate. With FMI, the 

drivers will be able to search through 100s of services and locate 

the ones that are most relevant. As our studies show, precise 

search criteria may be difficult to formulate – especially when 

searching for a new service. Also, if the user goes down the wrong 

search path, he can explore information sets that look relevant, 

without reformulating the search all over again. 

Search Traffic Info 

Typically, drivers combine traffic information from various 

sources to make decisions while driving. With new services in 

SafeTRIP, traffic information will be available from yet more 

sources – namely road operators, other drivers, authorities and 

traffic information providers. The reliability and timeliness of 

such information differs across sources – and drivers know how to 

exploit these differences. FMI can be used to provide an efficient 

mechanism to search for the most appropriate information, given 

that complete automation is unlikely as drivers use a mix of 

information sources based on their personal preferences. 

Display Traffic Info 

With SafeTRIP, we plan to provide rich traffic information to the 

drivers. On the motorway, variable speed restriction (e.g. in the 

event of a road incident) will be sent to the vehicle (instead of 

being displayed on a Variable Message Sign) with some details 

about the incident. It is expected that drivers would be more likely 

to respect the new speed restrictions if they are aware of the 

underlying reason. However the display of rich information can 

lead to information overload or inattentional blindness – causing 

the driver to ignore the important information in the messages. 

The layout of information in the FMI is designed to be 

minimalistic, providing as much relevant information as a user 

can process effectively, allowing for easy decision making and 

exploration of further relevant information. 

Entertainment Selection Interface 

Remote controls fitted to the steering wheel are a definite 

improvement that allows drivers to interact with the in-car 

entertainment system without taking their eyes off the road. 

However, with the explosion of entertainment options – both 

audio and video – through the SafeTRIP platform, it is likely that 

such solutions will quickly show their limitations. We believe that 

the FMI approach would allow the driver to quickly and 

efficiently search through the entertainment options. 

5. CONCLUSION 

It is clear to us that web based search benefits from the FMI 

approach as demonstrated by the results obtained from 

experimentation. With the increase in number of services 

available in the car – such as the ones through SafeTRIP, there is 

28 

a real need for an effective and efficient way to search and interact 

with those services. We therefore believe that in-car systems 

would greatly benefit from the FMI approach by decreasing 

search time, thereby improving driver’s attention on the road and 

contributing towards road safety. 

6. REFERENCES 

[1] Bly, S., Schilit, B., McDonald, D.W., Rosario, B., Saint- 

Hilaire, Y., Broken expectations in the digital home, Ext. 

Abstracts CHI 2006, ACM Press(2006), 568-573. 

[2] Cutrell, E. et al. (2006). Fast, Flexible Filtering with Phlat – 

Personal Search and Organization Made Easy. In 

Proceedings of CHI 2006, Montreal, Canada. 

[3] Kraft, R. et al. (2006). Searching with Context. In Proc. of 

International World Wide Web Conference (WWW ’06), 

(Edinburgh, Scotland, 2006). ACM Press. 

[4] Laqua, S. and Brna, P. The Focus-Metaphor Approach: A 

Novel Concept for the Design of Adaptive and User-Centric 

Interfaces. In Proc. Interact 2005, Springer (2005), 295-308. 

[5] Laqua, S. and Sasse, M.A. (2009). Exploring Blog Spaces: A 

Study of Blog Reading Experiences using Dynamic 

Contextual Displays. In: Proc. HCI 2009, Cambridge, UK. 

[6] Laqua, S., Bandara, S. U., and Sasse, M.A. (2007) 

GazeSpace: Eye Gaze Controlled Content Spaces. In Proc. 

HCI 2007, Vol.2, 21 st BCS HCI Group Conference (2007). 

[7] Laqua, S., Ogbechie, N., and Sasse, M.A. (2007). 

Contextualizing the Blogosphere: A Comparison of 

Traditional and Novel User Interfaces for the Web. In Proc. 

HCI 2007, Vol.2, 21 st BCS HCI Group Conference. 

[8] Lee, J. D., Hoffman, J. D., and Hayes, E. 2004. Collision 

warning design to mitigate driver distraction. In Proceedings 

of the SIGCHI Conference on Human Factors in Computing 

Systems (Vienna, Austria, April 24 - 29, 2004). CHI '04. 

ACM, New York, NY, 65-72. 

[9] NHTSA. The impact of Driver Inattention on Near- 

Crash/Crash Risk. 

http://www.nhtsa.gov/Research/Human+Factors/Distraction 

[10] Orbcomm. http://www.orbcomm.com 

[11] OmniTRACS. http://www.qualcomm.com 

[12] Pirolli, P. (2007). Information Foraging Theory. Oxford 

University Press. 

[13] Seager, W., Knoche, H., Sasse, M.A., TV-centricity - 

Requirements gathering for triple play services. In 

Interactive TV: A Shared Experience TICSP Adjunct 

Proceedings of EuroITV (2007), 274-278. 

[14] Teevan, J. et al. The perfect search engine is not enough: a 

study of orienteering behavior in directed search. In Proc. 

CHI ’04. (Vienna, Austria, 2004) 

[15] Teevan, J. et al. Beyond the Commons: Investigating the 

Value of Personalizing Web Search. In Proc. of Workshop on 

New Technologies for Personalized Information Access 

(PIA). (Edinburgh, UK, 2005). 

[16] Zeng, H. J. et al. Learning to Cluster Web Search Results. In 

Proceedings of SIGIR ’04, Sheffield, United Kingdom, 2004. 

ACM Press, 210-217

Discover Significant Situations for User Interface 

Adaptations 

Sandro Rodriguez Garzon 

Daimler Center for Automotive Information 

Technology Innovations 

HMI Group 

sandro.rodriguez.garzon@dcaiti.com 

ABSTRACT 

Over the last years environmental awareness became an important 

research topic in the field of adaptive user interfaces. 

Especially in the research area of location-based services, 

context-aware interfaces started using models of the environment 

in conjunction with sophisticated user models to 

filter user relevant information. Despite the tight coupling 

of context-aware computing and user modeling, only less 

research focused on the correlations between an user preference 

and the context in which the user preference was inferred. 

Considering a user preference as a certain humaninterface 

interaction that happens regularly within similar 

context, this paper introduces a method to detect significant 

situations of frequent user interactions that occurred within 

similar environments. As an example, the paper discusses 

a definition of a personalization use case within the automotive 

environment: Adapting the user interface based on 

discovered user initiated radio station changes depending on 

the user’s location. 


context awareness, situation discovery, personalization, adaptation, 

intelligent user interface, temporal pattern 


H.5.2 User Interfaces: Theory and methods; H.3.3 Information 

Search and Retrieval: Miscellaneous 

INTRODUCTION 

Since the Active Badge Location System [7] many researchers 

have been interested in designing context-aware user interfaces. 

Considering the increase of complexity and functionality 

of user interfaces several researchers identified ways 

and means to increase the usability by displaying prefiltered 

information or modified user interface controls. While some 

approaches aimed at detecting similar interactions to apply 

personalized filters other approaches tried to propose methods 

to build adaptation-ready user interfaces. 



29 

Kristof Schütt 

Daimler Center for Automotive Information 

Technology Innovations 

HMI Group 

kschuett@cs.tu-berlin.de 

Context-aware computing focused on gathering the context 

of an entity to build a machine-readable model of the environment. 

With the help of these models it was possible 

to adapt a user interface dynamically, following the ”onefits-all” 

paradigm. Hence, a predefined rule specified the 

way how environmental factors influence the user interaction 

with the system. In contrast, the research in user modeling 

focused on gathering a accurate model of the user by applying 

sophisticated data mining methods. These user models 

were used to build user-centric adaptive systems that take 

the users needs into account. Unfortunately, most of the user 

modeling techniques were constrained to detect application 

specific user preferences. But in different contexts a user 

may prefer to interact with the system in different ways. 

Thus, this work on the discovery of significant situations is 

motivated by the desire to personalize user interfaces in dependence 

of their use in similar contexts. Contextual personalization 

can be seen as the process of bringing together 

the context with the user preferences. Our intent is not to 

construct a context dependent user model but to detect situations 

that might be followed by predictable user interactions. 

In order for the method to be applicable in the real 

world, our approach assures an unsupervised processing of 

user interactions without need to prompt the user for explicit 

feedback. 

RELATED WORK 

A promising idea concerning location-aware service personalization 

is presented by Coutand [2]. Coutand uses a casedbased 

reasoning approach to calculate similarities between 

records of service use enriched by location dependent properties 

to deduce preferred service utilization. An approach 

of clustering context data in order to determine if an actual 

context belongs to an already sensed context is presented by 

Flanagan [6]. By expressing the context in a symbolic form 

he is able to develop an unsupervised learning algorithm to 

extract and group similar contexts as context states. This 

idea is very close to our work since our approach groups context 

as well. The difference lies in the way the environment 

is sensed. Our approach uses prespecified temporal event 

patterns to extract interaction traces that are annotated with 

context features. Those interaction traces are clustered concerning 

their multiple contexts in contrast to [6], where independent 

instances of context feature vectors are grouped. 

The notion of an temporal event pattern as an appropriate 

representation of a user preference is also mentioned in [3].

Cram proposes a method to interactively detect recurring 

user interaction sequences to enhance context-aware assisting 

systems. Unfortunately, Crams approach isn’t applicable 

within the automotive environment because the user has to 

be involved in the process of discovering regular task signatures. 

DEFINITIONS 

Following up Etzion’s definition [5] of an event as an occurrence 

in the real world and its virtual representation we 

introduce the notion of an interaction event. 

DEFINITION 1. Interaction Event. An arbitrary user interaction 

affecting the user interface or its environment represented 

by means of an virtual object. 

An interaction event will be generated by the user interface 

and processed by the frequent interaction discovery component. 

The definition of the interaction event incorporates all 

events thrown directly or indirectly by the user interface as 

well as events triggered by a change of the state of the environment. 

The term environment will be used as the superset 

of context which is defined as all elements of the environment 

which the user’s computer knows about [1]. Given the 

following definitions 

DEFINITION 2. Action. The concrete occurrence of an 

interaction event sequence. 

DEFINITION 3. Situation. A period of time in which certain 

conditions are satisfied indicating a probable occurrence 

of a known action. 

our prototype distinguishes between the user interaction, namely 

action, and the state called situation at which our prototype 

assumes to know what the user will do next. An action is 

declared to be frequent if the amount of reoccurrences exceeds 

a prespecified limit. A situation is significant if the 

predictable action is frequent. 

SITUATION DETECTION 

The challenge lies in the detection of significant situations 

out of an arbitrary stream of interaction events. The probability 

of a certain action to reappear in the same constellation 

within the same context is very low. Therefore, to detect a reoccurence 

of an action a notion of similarity between actions 

considering their context is needed. This work distinguishes 

between a comparison of actions comprising one event and 

multiple events. The Section ”Interaction Event Processing” 

discusses the former case while Section ”Co-Situations” examines 

the latter case. 

We decided to split the process of significant situation detection 

into three successive subprocesses: Action discovery, 

context discovery and situation discovery. In the action discovery 

subprocess a prespecified event pattern is searched 

within the stream of interaction events. A concrete sequence 

of events found by the event engine is declared to be an action. 

The context discovery subprocess collects all actions 

and groups them by specified context features. The result 

30 

of the context discovery subprocess is made up of groups 

containing actions whereupon each group is characterized 

by several action specific group properties. If the amount 

of members of a group exceeds a limit all members will be 

declared as frequent. This conclusion is valid because all actions 

of a group are assumed to be similar. The group properties 

describe a common environment of all the actions that 

are contained within that group. To detect a situation likely 

to contain a reoccurrence of a frequent action it is necessary 

to search for an event sequence that is parametrized by 

the property values of the group the frequent action belongs 

to. This process is called situation discovery. If the process 

encounters a compatible interaction event sequence the user 

interface will be notified of the significant situation. 

INTERACTION EVENT PROCESSING 

The process of significant situation detection is supported 

by use case specific data mining instructions. These instructions 

will be specified beforehand by an expert and used during 

the runtime process to assist the data mining process to 

extract significant situation information for specific personalization 

use cases. In the proceeding discussion we use the 

term use case specification to subsume all instructions belonging 

to a certain use case. The explanation of the necessary 

specification steps and the event processing itself is 

accompanied by an automotive example: Detection of user 

initiated radio station changes depending on the user’s location. 

The intention of the personalization is to provide the 

user with a system generated proposal to change the radio 

station automatically at a certain location triggered by the 

detection of a significant situation. 

Context 

During runtime, every interaction event occurs within a certain 

environment. Considering our approach, the environment 

is represented by a fixed set of attributes and its situation 

dependent values. Since the environment comprises an 

almost infinite number of environmental factors it is necessary 

to define a subspace of the environmental factors that 

are relevant to the specific use cases. Hence, the use case 

specification must contain a context definition as an enumeration 

of context features that will be attached to every interaction 

event. In case of the radio example, two context 

features were identified as use case specific environmental 

factors: name of the radio station and a unique identification 

number (id) of the current road segment. 

Action Discovery 

The use case specification must also contain a prespecified 

interaction event pattern describing the action that should be 

investigated in detail. The event pattern is constructed by 

combining logical and temporal operators to form complex 

event sequence descriptions with event specific filter criteria. 

Since the prototypical implementation uses Esper [4] as 

the underlying event processing engine most of the available 

operators have their counterpart in Esper’s event processing 

language (EPL). Considering the radio example it is necessary 

to specify an accurate but generic event pattern representing 

the user’s action of changing the radio station: The 

radio station change should only be taken into account in

case the user initiated radio station change is not followed by 

any further radio station change within the next 10 minutes. 

During initialization the pattern will be passed to the complex 

event processing engine to start looking for compatible 

sequences. If the engine encounters a fitting sequence 

of interaction events it relays the concrete event sequence, 

namely action, to the subprocess of context discovery. 

Context Discovery 

The main task of the context discovery subprocess is to group 

all incoming actions based on prespecified criteria. As stated 

above, an action can be composed of an arbitrary number 

of temporally ordered events. Therefore, it is necessary to 

define a selection of specific events and corresponding context 

features of an action that are considered during a comparison 

between two actions. In other words, the similarity 

measure between two actions is calculated by a comparison 

between a fixed set of prespecified context features. The result 

of the grouping process is a set of groups of similar actions. 

The actions will be similar regarding the environment 

they occurred in. In turn, the characteristics of each group 

can be interpreted as a description of the common environment. 

Considering the radio example, we are only interested 

in grouping actions by the radio station and the unique road 

segment id. Thus, each group subsumes a certain case of 

user behavior within a certain environment. In this sense, a 

group is characterized by two properties: name of radio station 

and road segment ids. The former property will contain 

the name of the radio station while the latter property will 

contain a subnetwork of the road network represented by a 

set of unique road segment ids. Actions will be compared 

by the radio station name doing a string comparison and by 

the road segment id doing a network distance comparison. 

Radio station changes containing the same name of a radio 

station but occurring in neighboring road segments are assigned 

to the same group. A group will only be considered 

in the next subprocess if it contains a sufficient amount of 

actions. Such a group is called a significant group. 

Situation Discovery 

Finally, the situation discovery subprocess uses the characteristics 

of the significant groups found by the previous subprocess 

to parametrize a generic interaction event pattern 

namely situation pattern. The situation pattern describes 

a moment in which a group specific action is expected to 

reappear. The way the expert specifies the generic situation 

pattern is similar to the way the specification of the action 

was done before. The difference lies in fact that the context 

features of the events within the event pattern will be constrained 

by the discovered group characteristics. During runtime, 

several significant groups will be identified by the context 

discovery subprocess. For each group a new instance of 

the situation pattern will be generated with different context 

feature constraints. This group specific parametrization allows 

the event engine to use each generated situation pattern 

instance to discover actions that occur within the common 

environment of the corresponding group. As a consequence, 

the event processing engine is able to find significant situations 

as a result of detecting parametrized event sequences. 

Considering the radio example it is necessary to define a sit- 

31 

uation pattern that clearly describes an event sequence that is 

likely to be followed by a known radio station change event. 

To describe such a situation we include two conditions into 

the generic temporal event pattern: 1. The last radio station 

change resulted in a switch to a radio station that is different 

to the one found in the group property ”name of radio station” 

and 2. the current road segment id - originating from a 

location event - is part of the subnetwork found in the group 

property ”road segment ids”. Each significant group of radio 

station changes will start a process of detecting a significant 

situation in which the road segment is similar and the current 

radio station is different. Triggered by the notification of a 

significant situation, the prototype is able to propose a radio 

station change. 

CO-SITUATIONS 

So far the context of only one event - radio station change - 

was observed to group actions. But what happens if actions 

are composed of multiple events and the comparison should 

consider two different contexts of two events? In this case 

the prototype would primarily split the grouping into separate 

grouping procedures for each event. A powerful application 

would be to detect significant situations based on 

several temporally ordered events each being parametrized 

by the properties of different significant groups. Such a new 

type of situation would enable the specification of a causal 

sequence of arbitrary situations to form a new significant situation. 

DEFINITION 4. Co-Situation. A period of time in which 

certain temporally ordered conditions describing multiple 

situations are satisfied indicating a probable occurrence of 

a known action. 

Let’s go back to the radio example and consider a certain application 

of the well-known radio example in the real world 

while entering and leaving a tunnel. Assuming two significant 

groups of radio station changes were found situated at 

both tunnel exits. Both groups were detected due to the fact 

that one radio station is being received poorly on one side of 

the tunnel and vice versa. Without taking account of the direction 

the car is moving the prototype may propose a wrong 

radio station change while entering the tunnel. This misleading 

personalization happens because the location of a tunnel 

exit may match with the location of a tunnel entrance. In 

this case it does not matter how the driver is approaching a 

significant situation. 

In order to consider the moving direction the event pattern 

of the action discovery subprocess must be extended by two 

location events preceding the first event that changes the radio 

station. The modified event pattern will consider multiple 

temporally ordered events that may happen in different 

contexts. To discover co-situations, the actions will no 

longer be grouped by only one context of a certain event 

but grouped by the contexts of the two location events. In 

this sense, each detected group is finally characterizable by 

the unique radio station and a pair of contexts. One context 

that describes a region before entering the tunnel and 

another context describing a region at the exit of the tunnel.

Environment 

Car 

Tunnel 

Car 

Event Sequence 

Location 

Location 

RadioChange 

Prototype 

Location event found. 

Start looking for a 

following location event. 

Second location event 

found. Look for a Radio 

Change event. 

Radio Change event 

found. Wait 600 sec 

before reporting the 

encountered action. 

Environment 

Tunnel 

ActionDiscovery SituationDiscovery 

Car 

Car 

Event Sequence 

RadioChange 

Location 

Location 

Prototype 

Known context found. 

Start looking for second 

known context expected 

to follow the first known 

context. 

Second known context 

found. If radio station 

is unequal to known 

radio station notify UI 

Figure 1. Extended radio example: Relation between the environment and the order of event occurrence. 

Given all needed properties of a regular user interaction, it 

is necessary to extend the situation pattern as well to detect 

both contexts with respect to the causal order of its occurrence. 

Therefore, the situation event pattern will describe an 

event pattern that looks for location events within the context 

of the tunnel entry followed by location events within 

the context of the tunnel exit. A significant co-situation will 

be encountered if the car initially passes the context in front 

of the tunnel and than the context at the exit of the tunnel 

meanwhile the radio is switched to a different radio station. 

Figure 1 visualizes the co-situation. The task of grouping 

the driving direction is not necessary as long as the causal 

order of grouped contexts helps to trigger the intended radio 

station change. 

IMPLEMENTATION 

As a sample user interface we implemented a prototypical 

in-car-infotainment system based on ActionScript in combination 

with a context simulator. The prototype itself is implemented 

in Java with Esper [4] as its complex event processing 

engine. In order for the prototype to be independent 

of a certain user interface we decided to use XML as the underlying 

representational language for an event. To test the 

prototype under realistic conditions we used context records 

of several tracks to simulate the environment. 


We have presented a prototype that is able to discover significant 

situations based on the common environment of frequent 

user interactions. Supported by use case descriptions 

specified by an expert, the prototype detects similarities between 

user interactions and infers the corresponding environments 

needed to detect situations in which a user interaction 

is likely to reappear. Although we did not put our focus 

on time as well as space efficiency we have to acknowledge 

that in particular the space consumption is a critical factor 

affecting the application within embedded systems. Since 

in the current implementation all detected actions need to be 

stored along with their context it is necessary to subsume actions 

and to constrain the validity of an action. A first step 

towards a space optimized solution was done by limiting the 

influence of actions. If an action is too old it will discarded. 

Furthermore, we limited the amount of actions per situation 

independently of the time the action was detected. 

32 

During our work we identified some important refinements 

and extensions for future work. In particular, we will provide 

the expert with the ability to specify a minimal probability of 

occurrence for a significant situation as an additional trigger 

condition. Up to now the prototype reports a significant situation 

in case a certain number of similar actions happened 

within a certain context. This notion can be extended by 

reporting only in the case the probability of occurrence exceeds 

a user defined limit. In order to calculate the current 

likelihood of an action we also have to observe situations in 

which an action has not been executed. Since it is nearly impossible 

to accumulate all non occurrences of a use case we 

will constrain the observation to situations that are already 

associated with a concrete action of the use case. This is 

possible because the boundary of the situation is naturally 

given by the situation pattern. 

REFERENCES 

1. P. J. Brown. The stick-e document: a framework for 

creating context-aware applications. In EP, pages 

259–272, 1996. 

2. O. Coutand, S. Haseloff, S. L. Lau, and K. David. A 

Case-based Reasoning Approach for Personalizing 

Location-aware Services. In Workshop on Case-based 

Reasoning and Context Awareness, 2006. 

3. D. Cram, B. Fuchs, Y. Prié, and A. Mille. An approach 

to User-Centric Context-Aware Assistance based on 

Interaction Traces. In Int. Workshop Modeling and 

Reasoning in Context, pages 89–101, 2008. 

4. EsperTech Inc. Complex Event Processing. 

http://esper.codehaus.org/, Last access: 

30-12-2010. 

5. O. Etzion and P. Niblett. Event Processing in Action. 

Manning Publications Co., Greenwich, USA, 2010. 

6. J. A. Flanagan. Unsupervised clustering of context data 

and learning user requirements for a mobile device. In 

Int. and Interdisciplinary Conf. on Modeling and Using 

Context, pages 155–168, 2005. 

7. R. Want, A. Hopper, V. Falc, and J. Gibbons. The Active 

Badge Location System. ACM Transactions on 

Information Systems, pages 91–102, 1992.

A new interaction technique based on eye tracking and 

single switch scanning systems 

Pradipta Biswas 

Engineering Design Centre 

Department of Engineering 

University of Cambridge, UK 

E-mail: pb400@cam.ac.uk 

ABSTRACT 

In this paper we have presented a new input interaction 

system for people with severe disabilities. The new system 

works based on eye gaze tracking and single switch 

scanning interaction techniques. It combines eye gaze 

tracking and scanning in a unique way which is faster 

than only scanning based systems while more comfortable 

to use than only eye gaze tracking based systems, which 

is also supported by a user study. We have also pointed 

out a few applicatiosn of the system besides computer 

accessibility. 

Categories and Subject Descriptors 

D.2.2 [Software Engineering]: Design Tools and Techniques 

– user interfaces; K.4.2 [Computers and Society]: 

Social Issues – assistive technologies for persons 

with disabilities 

General Terms 

Algorithms, Experimentation, Human Factors 

Keywords 

Assistive Technology, Eye gaze tracker, Scanning, Usability 

Evaluation. 

1. INTRODUCTION 

Many physically challenged users cannot interact with a 

computer through a conventional keyboard and mouse. 

For example, spasticity, Amyotrophic Lateral Sclerosis 

(ALS), and Cerebral Palsy confine movement to a very 

small part of the body. Two possible solutions for these 

users will be eye gaze tracking based input system and 

scanning system. Eye gaze tracking based system alleviates 

the use of mouse and keyboard and enables the user 

to control the mouse pointer using only eye gaze. They 

can also use a virtual keyboard as an alternative to normal 









33 

Pat Langdon 

Engineering Design Centre 

Department of Engineering 

University of Cambridge, UK 

E-mail: pml24@eng.cam.ac.uk 

keyboard. 

Scanning is the technique of successively highlighting 

items on a computer screen and pressing a switch when 

the desired item is highlighted. Researches on eye gaze 

tracking systems for assistive technology and scanning 

systems were mainly explored in the field of alternative 

and augmentative communication (AAC) devices 

[7,8,11,13]. A plethora of commercial and research products 

are available which helps people with disabilities to 

communicate using eye gaze tracking or scanning interfaces 

[11]. 

However, navigation to arbitrary locations on a screen has 

also become important as graphical user interfaces are 

more widely used. A review of existing scanning systems 

for screen navigation can be found in a separate paper [3]. 

The main disadvantage of these systems is these are slow 

to operate. Many eye tracking based interfaces for people 

with disabilities use the eye gaze as a binary input like a 

switch press input through a blink [6, 13]. But the resulting 

system remain as slow as the scanning system. 

Zhai [14] presents a detailed list of advantages and disadvantages 

of using eye gaze based pointing devices. In 

short, using the eye gaze for controlling the cursor position 

pose several challenges as follows 

Strain: It is quite strenuous to control the cursor through 

eye gaze for long time as the eye muscles soon become 

fatigue. Fejtova and colleagues [9] reported eye strain in 

six out of ten able bodied participants in their study. 

Accuracy: The eye gaze tracker does not always work 

accurately, even the best eye trackers used to provide accuracy 

of 0.5° of visual angle. It often makes clicking on 

small target difficult. Donegan and colleagues [5] also 

reported problems in precision and speed of an eye gaze 

based system. So existing systems often change the screen 

layout and enlarge screen items for AAC systems based 

on eye gaze, but surely it is not a scalable solution. 

Clicking: Clicking or selecting a target using only eye 

gaze is also a problem. It is generally performed through 

increased dwell time or blinking. But either solution increases 

the chance of false positives or missed clicks. 

We tried to solve this problem by combining eye gaze

tracking and a scanning system in a unique way. Any 

pointing movement has two phases [10] 

An initial ballistic phase, which brings one near the target. 

A homing phase, which is one or more precise sub 

movements to home on the target. 

We used the eye gaze tracking for the initial ballistic 

phase and switch to scanning system for the homing 

phase and clicking. The approach is similar to the 

MAGIC system [14] though it replaces the regular pointing 

device with the scanning system. Our system works in 

the following way. 

2. The proposed system 

Initially, the system moves the pointer across the screen 

based on the eye gaze of the user. The user sees a small 

button moving across the screen and the button is placed 

approximately where they are looking at the screen. We 

extract the eye gaze position by using the Tobii SDK [12] 

and we use an average filter that changes the pointer position 

every 500 msec. The users can switch to the scanning 

system by giving a key press anytime during eye tracking. 

When they look at the target, the button (or pointer) appears 

near or on the target. At this point, the user is supposed 

to press a key to switch back to the scanning system 

for homing and clicking on the target. 

We have used a particular type of scanning system, 

known as eight directional scanning [3] to navigate across 

the screen. In eight-directional scanning technique the 

pointer icon is changed at regular time intervals to show 

one of eight directions (Up, Up-Left, Left, Left-Down, 

Down, Down-Right, Right, Right-Up). The user can 

choose a direction by pressing the switch when the 

pointer icon shows the required direction. After getting 

the direction choice, the pointer starts moving. When the 

pointer reaches the desired point in the screen, the user 

has to make another key press to stop the pointer movement 

and make a click. A state chart diagram of the scanning 

system is shown in Figure 1, which is same for user 

and device spaces in this case. A demonstration of the 

scanning system can be seen at 

http://www.youtube.com/watch?v=0eSyyXeBoXQandfeature=user. 

The user can move back to the eye gaze tracking system 

from the scanning system by selecting the exit button in 

the scanning interface (Figure 2). A couple of videos of 

the system can be found from the following links. 

Screenshot: http://www.youtube.com/watch?v=UnYVO1Ag17U 

Actual usage: http://www.youtube.com/watch?v=2izAZNvj9L0 

The technique is faster than only scanning based interface 

as users can move the pointer through a large distance in 

screen using their eye gaze quicker than using only single 

switch scanning interface. It is also less strenuous than 

the only eye gaze based interfaces because users can 

34 

switch back and forth between eye gaze tracking and 

scanning which gives rest to the eye muscles. Additionally, 

since they need not to home on a target using eye 

gaze, they are relieved from looking at a target for a long 

time to home and click on it. Finally, this technique does 

not depend on the accuracy of the eye tracker as eye 

tracking is only used to bring the cursor near the target (as 

opposed to on the target), so it can be used with low cost 

and low accuracy web cam based eye trackers. 

Figure 1. State Transition Diagram of the eightdirectional 

scanning mechanism with a single switch 

Figure 2. Screenshot of the scanning interface 

The only disadvantage of the technique is that it seems 

slower than only eye gaze based interface as users need to 

switch back to the slower scanning technique for each 

pointing task. So we conducted the following user study 

to compare the speed of our system with respect to only 

eye gaze based pointing. 

3. The study 

3.1. Procedure 

We conducted the ISO 9241 pointing task with three different 

combinations of target width (20, 30 and 40 pixels) 

and target amplitude (180, 240 and 300 pixels). Each participant 

undertook the task in two conditions – using only

eye gaze for pointing or using both eye gaze and eight 

directional scanning for pointing. None of the users used 

this system before and they were trained adequately before 

undertaking the trials. The training data is not used in 

the analysis. 

3.2. Material 

We used a desktop with 12.5’ monitor having 1280 Х 800 

pixels running Windows 7 operating system. We used a 

Tobii X120 eye tracker [12] with the Tobii SDK and an 

averaging filter to detect points of eye gaze fixation. Figure 

3 shows a snapshot of the experimental set up. None 

of the participants have any problem with the set up. 

Figure 3. Experimental set up 

3.3. Participants 

We collected data from 8 able bodied participants (7 male 

and 1 female) with average age of 27. The results will not 

be different for users with disabilities because 

o We assume that disabled who can use eye gaze 

based system will have eye muscles as strong as 

able bodied users. 

o Our previous study [4] did not find any statistically 

significant difference between able bodied 

and disabled users for scanning interface. 

3.4. Results 

The mean movement time was higher in eye tracking plus 

scanning system while the variance is higher in only eye 

tracking system (figure 4). However the difference was 

not significant in an unequal variance t-test (p > 0.05). We 

compared the average movement time for each input modalities 

of interaction with respect to individual participants, 

target width and amplitude (Figures 5, 6, and 7). It 

can be seen from figure 5 that only 2 out of 8 participants 

(P2 and P4) took significantly higher time in the eye 

tracking plus scanning system than the only eye tracking 

system. There were at least three occasions when participants 

failed to point on 20 pixel targets using only eye 

tracking system. We found only eye tracking system produced 

significantly less (p < 0.05) movement time for 240 

35 

pixel target amplitude while the difference in movement 

times for other combinations of target width and amplitude 

were not significant in an unequal variance t-test. 

The eye tracking plus scanning system tends to produce 

less movement time for 300 pixel target amplitude (figure 

7) as the eye tracker apparently lost some accuracy in the 

periphery of the screen. Finally all participants felt the eye 

tracking plus scanning system is more comfortable than 

the only eye tracking based system because their eye 

muscles could get rest while using the scanning system. 

140000 

120000 

100000 

80000 

60000 

40000 

20000 

0 

-20000 

N = 

64 

ET 

Figure 4. Comparing movement time 

Figure 5. Comparing movement time w.r.t. participants 

Figure 6. Comparing movement time w.r.t. target width 

Average Movement Time 

(in sec) 


(in sec) 


(in sec) 

120 

100 

80 

60 

40 

20 

0 

60.00 

50.00 

40.00 

30.00 

20.00 

10.00 

0.00 

80.00 

70.00 

60.00 

50.00 

40.00 

30.00 

20.00 

10.00 

0.00 

24 

2 

64 

15 

63 

Comparing Movement Time 

Figure 7. Comparing movement time w.r.t. amplitude 

64 

14 

62 

ETSCAN 

1 2 3 4 5 6 7 8 

Participants 


20 30 40 

Width of Target (in pixel) 


180 240 300 

Distance to Target (in pixel) 

Only ET 

ET & Scanning 

Only ET 

ET & Scanning 

Only ET 

ET & Scanning

3.5. Discussion 

The results show that using the scanning system with the 

eye tracking system did not reduce pointing time significantly 

compared to only eye gaze based system. The high 

variance in only eye tracking based system also indicates 

that in some cases the user took very long time to point, 

which would surely frustrate them. It should be noted that 

we used the Tobii tracker [12] for this study which is now 

best in market for accuracy. With a low cost and low accuracy 

eye tracker (like a web cam based one) the only 

eye tracking system will be harder to use while the eye 

tracker plus scanning system will not suffer much as the 

technique does not need high accuracy from the eye 

tracker. We used an average filter to extract points of eye 

gaze fixation, but use of a better filtering algorithm [1] 

will increase the accuracy of both the system equally. We 

have used a scan delay of 1 sec for the scanning system 

and a dwell time of 500 msec for the eye gaze tracking 

system for this study, which can be further reduced producing 

less movement time for expert users. Additionally 

this new technique is faster than only scanning system 

while gives more comfort and accuracy than only eye 

gaze tracking based pointing system. Our system is less 

proactive than MAGIC pointing [14] as the user can 

manually switch on and off either eye gaze tracking or 

scanning system whenever he wants by a single switch 

press. It seems more user friendly than Bates’ system [2] 

as operating a push button switch is easier than operating 

a Polhemus InsideTrack device by elevating shoulder. 

Our system can also solve the challenges faced by Fejtova 

[9] in developing eye gaze tracking based wheel chair, as 

the user can switch off eye tracking temporarily and clicking 

is done through the scanning system which reduces 

the possibilities of accidental missed clicks. Currently we 

are working on integrating the system with a web cam 

based eye tracker to develop a low cost interaction device. 

This technique can also have applications other than computer 

accessibility software. It can be used to provide 

hands-free access in a screen with multiple displays (or 

control screens), where the eye tracking system will locate 

a particular portion of screen or control display and 

the scanning technique can be used to operate inside the 

display. It would also be useful to overcome situational 

impairment in interaction like using an electronic display 

in a moving vehicle, where it is difficult to use a pointing 

device or touch screen. The eye tracking and scanning 

technique both require minimum input from user and so 

the user need not to disengage with his main job (like 

driving the car) for interacting with another device. 

4. Conclusions 

In this paper we have introduced a new input device involving 

an eye gaze tracker and scanning interface for 

people with severe disabilities. The system solves a few 

problems of existing eye gaze tracking based systems by 

offering more accuracy and comfort to users which is also 

supported by a user study. 

36 

Acknowledgement 

We are grateful to our participants for taking part in our 

study. We would also like to thank Prof. Peter Robinson 

of University of Cambridge Computer Laboratory for his 

help in organizing the study. 

References 

1. Adjouadi M. et. al., Remote Eye Gaze Tracking 

System as a Computer Interface for Persons with 

Severe Motor Disability. ICCHP 2004, LNCS 

3118 2004. 761-769. 

2. Bates R., Multimodal Eye-Based Interaction for 

Zoomed Target Selection on a Standard Graphical 

User Interface. INTERACT 1999. 

3. Biswas P. and Robinson P., A New Screen 

Scanning System based on Clustering Screen 

Objects, Journal of Assistive Technologies, Vol. 

2 Issue 3 September 2008, pp. 24-31, ISSN: 

1754-9450 

4. Biswas P. and Robinson P., The effects of hand 

strength on pointing performance, Designing Inclusive 

Interactions, Springer-Verlag, pp. 3-12, 

ISBN: 978-1-84996-165-3 

5. Donegan M. et. al. , Understanding users and 

their needs, Universal Access in the Information 

Society 8 (2009): 259-275 

6. Duchowski A. T., Eye Tracking Methodology. 

Springer-Verlag, 2007. 

7. Eye Pointing, URL: http://abilitynet.wetpaint. 

com/page/Eye+Pointing, Accessed on 19th August 

2010. 

8. EyeTech Digital System, URL: http://www. eyetechds.com/assistivetech/index.htm, 

Accessed on 

19th August 2010. 

9. Fejtova M. et. al. , Hands-free interaction with a 

computer and other technologies, Universal Access 

in the Information Society 8 (2009): 277- 

295 

10. Fitts P.M., The Information Capacity of The 

Human Motor System In Controlling The Amplitude 

of Movement. Journal of Experimental Psychology 

47 (1954): 381-391. 

11. Majaranta P. and Raiha K. Twenty Years of Eye 

Typing: Systems and Design Issues. Eye Tracking 

Research & Application 2002. 15-22. 

12. Tobii Eye Tracker, URL: 

http://www.imotionsglobal.com/Tobii+X120+ 

Eye-Tracker.344.aspx Accessed on 12th December 

2008 

13. Ward D., Dasher with an eye-tracker, URL: 

http://www.inference.phy.cam.ac.uk/djw30/dash 

er/eye.html, Accessed on 19th August 2010. 

14. Zhai S., Morimoto C. and Ihde S., Manual and 

Gaze Input Cascaded (MAGIC) Pointing. ACM 

SIGCHI Conference on Human Factors in Computing 

System (CHI) 1999.

Gesture Recognition Exploration using Haartraining and 

KNN in a 3D Racing Game 

Kamlesh Mistry 


Teesside University, UK 

mistry.kamlesh@gmail.com 

ABSTRACT 

Automatic recognition of body language is challenging but 

inspiring as a natural control channel for intelligent user 

interfaces. In this paper we report automatic car navigation 

via hand gesture recognition in a 3D racing games 

application. We have employed Haartraining and k-nearest 

neighbor (KNN) algorithms to recognize hand gestures with 

the assistance of image processing. Our study has explored 

vision-based gesture tracking and dynamic gesture 

recognition in real-time navigation games application. The 

gesture recognition system has been embedded in a 3D 

virtual world built with the assistance of a games engine, 

Irrlicht. Sound effect has also been employed for our 

application. We have also conducted user testing with 5 

testing subjects to evaluate the efficiency of KNN-based 

gesture recognition. Evaluation results for the Haartrainingbased 

recognition have also been provided. Overall the 

gesture recognition performance is very promising. Our 

work contributes to the workshop themes on natural user 

interfaces in novel, intelligent interaction systems, 

navigation systems and assistive functionalities. 


Gesture recognition, Haartraining, and K-nearest neighbor 


H.5.2 [User interfaces] 

INTRODUCTION 

Multimodal interaction based on the recognition and 

interpretation of body language and verbal input is 

challenging but inspiring for the building of efficient 

intelligent user interfaces. Advanced educational or 

entertaining applications residing in 3D virtual 

environments also request such a natural communication 

channel to enhance user experience. In order to pursue this 

research goal, we have developed a robot car with 

automatic navigation under the control of continuous hand 

gestures. In our previous work, we also produced a neural 

network driven automatic navigation component to enable a 

robot car to learn road and track condition and handle tough 



37 

Li Zhang 


Teesside University, UK 

l.zhang@tees.ac.uk 

turning situations successfully. Overall, we believe our 

developments have the potential to benefit innovative user 

interfaces development for navigation and assistive 

functionalities for driving in real life situations. 

RELATED WORK 

There have been various inspiring research studies 

conducted in the gesture recognition field. Billon et al. [1] 

have reported a gesture recognition system to facilitate the 

communication between a virtual actor and a real human 

actor in a martial art virtual games setting. Principle 

Component Analysis has been used to generate the artificial 

gesture representation, which was used for real-time gesture 

segmentation and recognition. Elmezain et al. [2] presented 

a hidden Markov model (HMM) based continuous gesture 

recognition system for the recognition of Arabic numbers 0- 

9. Tomibayashi et al. [4] have also produced a wearable DJ 

system to enable DJs to perform freely by using wearable 

computing and gesture recognition technologies. Wearable 

acceleration sensors have been used in their study to assist 

gesture recognition. Their system has been tested in realstage 

performances. Nam and Wohn [3] have presented 

another HMM based space-time hand gesture recognition 

system. In their system, HMM has been used to model the 

spatial variance and the time-scale variance in the hand 

movement to assist the recognition of the continuous, 

connected hand movement patterns. In our work, we have 

made attempts to recognize continuous connected hand 

movements and gestures using two different approaches, 

Haartraining and KNN algorithms. 3 key gestures have 

been recognized by KNN and 5 key gestures have been 

identified by Haartraining. The recognized gestures have 

also been used for real-time automobile navigation in a 3D 

racing game for entertainment purposes. We also provided 

evaluation results to prove the efficiency of our approaches. 

GESTURE RECOGNITION USING KNN 

K-nearest neighbor has been widely used for pattern 

recognition. We borrow it in our application to recognize 

real-time key gestures using webcam. Our recognition 

process can be carried out in three steps, including image 

pre-processing, vector generation, and final classification. 

At the training stage, raw images with hand gestures are 

collected from webcam. First of all, these collected images

will be cropped. An example original image and the 

corresponding cropped image are shown in Figure 1, in 

which white pixels are used to represent the object of 

interest while the black pixels are used to indicate the 

background. Comparing with the original image, the 

cropped image, which will be used for the training of KNN, 

only has a slightly different width and height. 

These cropped images are then converted into binary files 

in order to feed them to KNN. Vector generation has been 

used to convert the pre-processed images into the training 

binary files with the appropriate format. We have used each 

KNN class to represent a particular gesture. All the images 

representing one particular gesture have been stored under 

that particular KNN class. We have used .pbm format to 

store all the image files for training, since such a format can 

provide ASCII characters in decimals for the width and 

height of each image. The names of the training files are in 

‘CNN.pbm’ order, where C is the KNN class number and 

NN is the number of the image files stored in that class. We 

have used altogether 300 images representing 3 different 

hand gestures (100 images for each gesture) for the training 

of KNN. The three gestures recognized by KNN are shown 

in Figure 2. Thus a scalar matrix has been produced to 

represent all the training data. 

Figure 1. An example original and its cropped image after preprocessing 

(from left to right). 

Figure 2. Three key gestures recognized by KNN, including a 

palm gesture (for stopping), a fist gesture (for acceleration) 

and a pistol-like gesture (for turning). 

At testing stages, raw images collected from webcam also 

need to be pre-processed before feeding them to KNN. 

Since the images captured from webcam are the colored 32bit 

ones and our training binary images are only 8-bit, we 

have used skin detection algorithm to convert the captured 

32-bit testing images into 8-bit ones. The following 

procedures have been taken to detect skin color. First of all, 

we need to access RGB values for each pixel using the 

following formulas. 

p = y * image->widthStep + x 

blue = ImageData[p]; green = ImageData[p + 1]; red = 

ImageData[p + 2] 

Where p is pixel point. ImageData is the array to store all 

the pixels of the processing image. Thus imageData[p], 

38 

imageData[p+1] and imageData[p+2] are blue, green, and 

red color values for pixel P. X and Y are the coordinates of 

the pixel P. 

Figure 3. An example image before and after skin detection 

processing. 

Then the following criteria have been used to detect skin 

color: red>95, green>40 and blue>20; where max(RGB 

values for pixel P) - min(RGB values for pixel P)>15. If 

any pixel fulfills these premises, then we re-assign it to a 

white pixel with the value of RGB 255.255.255, otherwise, 

we re-assign it to a black pixel with the value of RGB 0.0.0. 

Figure 3 shows an example image before and after skin 

detection processing. 

In our application, KNN algorithm has been used for the 

recognition of the testing gestures. KNN has been used 

widely in pattern recognition and machine learning. It 

classifies a test query based on a majority vote of its 

neighbors with the test query labeled as the class most 

common amongst its k nearest neighbors. We have used a 

weighted KNN in order to avoid the domination of the 

classes with the more frequent examples as shown in the 

basic ‘majority voting’ classification. Therefore, in our 

application, the KNN classification algorithm is to weight 

the contribution of each of the k neighbors according to 

their distance to the query point Xq, assigning greater 

weight Wi to the nearest neighbors. The following equation 

has been used in our application. 

� � )) 

F Xq 

� arg max �Wi �� 

( v, 

f ( Xi 

i�1 

Where Xq, is a testing image containing a test gesture; v is 

the vector of the training set and Xi represents each KNN 

class. �(v, f(Xi)) represents the distance between the test 

query and each KNN class. Using KNN implementation, 

we have successfully classified 3 different continuous 

gestures with promising accuracy rates in real-time 

applications (see evaluation section for detail). We also 

noticed that KNN’s performance could be influenced by the 

backgrounds shown in the images. In order to avoid such a 

problem, we have also used another approach, Haartraining, 

to perform gesture tracking to assist the recognition of 

gesture movement in order to provide another effective 

control channel for the automatic car navigation without 

having any side-effect contributed by the image 

backgrounds. 

GESTURE RECOGNITION USING HAARTRAINING 

Haartraining has been well known for tasks such as face 

and pedestrian detection. In our application, Haartraining 

has been used to recognize five gesture movements such as 

k

fist gestures indicating car movement of up, down, left and 

right and a palm gesture for halt (see Figures 4 & 5). 

For the image acquisition process, we have also used 

webcam to collect positive and negative image samples. 

The positive images represent those only containing objects 

of interest (gestures). In another word, positive images are 

used to identify gestures. Moreover it does not affect the 

training even if backgrounds of the positive images are 

different from each other. Negative image samples only 

contain backgrounds and no any objects of interest. They 

can be any images such as landscape images, car photos, 

and various textures. Negative images are usually used to 

improve gesture recognition performance (since they allow 

gestures to be recognized with any backgrounds). 

In order to provide robust training, we have collected 116 

positive image samples. Then we divided the positive 

samples into training and testing sets. The former contains 

86 image samples and the latter has 30 samples. We have 

also collected 178 negative image samples for the training 

purposes. Figures 4 & 5 respectively show positive sample 

images for the training of fist and palm positions. 

Figure 4. Positive images representing the 4 key gestures (from 

left to right: a basic fist gesture followed by fist gestures 

indicating car moving left, right, up and down). 

Figure 5. Positive image samples for training representing 

palm gestures for stopping. 

Vector generation is also needed to convert the positive and 

negative images into the appropriate format to feed 

Haartraining at the training stage. The process is briefly 

explained as follows. A text file has been produced to 

contain the names of all the negative sample files (e.g. 

negative1.jpg; negative2.jpg etc), while another text file for 

positive image files has also been created with the names of 

all the positive images, number of objects and coordinates 

of bounding box over the objects of interest (e.g. 

positive1.jpg, number_of_object(1), 20 20 50 50 (x, y, 

width, height)). The command of ‘createsamples’ was also 

39 

used to create training and testing vector samples in order to 

avoid distortions. 

Adaboost algorithm embedded in ‘createsamples’ command 

has been used for the training of the samples. Adaboost has 

the effect to train a strong classifier with the linear 

combination of best features from training set and weak 

classifiers. For example, if there are weak image samples 

with comparatively dark light or low contrast, Adaboost 

approach is able to improve the visibility of the objects of 

interest with better contrast. Finally the Haartraining 

command is used to train the classifier. The evaluation 

results for Haartraining approach for gesture tracking and 

recognition are also provided in the evaluation section. 

INTEGRATION WITH A 3D GAMES ENGINE 

The produced gesture recognition components using KNN 

and Haartraining have been integrated with the 3D games 

world for the control of the car navigation. An open source 

games engine, Irrlicht (www.irrlicht.org), and Newton 

physics, have been used to construct the 3D world 

environment. The OpenCV library has been used for 3D 

world image processing. Also, the sound library, IrrKlang, 

provided by the developers of the Irrlicht games engine, has 

been employed to produce the sound effect. 

Briefly for the development of the games world, we load 

the racing racetrack and car as a mesh, and set the graphics 

API to OpenGL. We also apply physics to the car mesh by 

using the Newton physics library. Then we add the 

racetrack into the physics entity so that car is the object 

with the track as the entity. 

In order to obtain the input data for the image processing, 

we have used the OpenCV library. After capturing the 

images from the webcam, we used IplImage for storing the 

image files. Overall, we have collected continuous images 

for our application and the collected images have been used 

for pre-processing and classification. 

For the control of the robot car using KNN, we have used 

the following gesture commands: a fist representing 

acceleration, a palm representing stopping and a pistol-like 

gesture representing turning. Therefore based on the output 

of KNN, which has used image files stored in IplImage as 

testing images, the robot car can navigate accordingly. For 

example, if the output of KNN indicates a fist gesture, then 

the robot car performs acceleration. 

If Haartraining has been used to control the vehicle, we 

have defined the following gesture commands for 

navigation: a palm gesture for stopping, a fist position to 

the very left indicating turning left, a fist position to the 

very right representing moving right, a fist position to the 

top indicating acceleration and a fist position to the bottom 

showing reverse movement. Fist of all, if a gesture has been 

recognized by Haartraining, we need to check on which 

axis and at what position the gesture is recognized. In order 

to achieve the recognition, a Haartraining class has been 

implemented containing all the necessary functions such as

loading the Haarcascade, testing the cascade, and drawing a 

bounding box on a desired gesture. If the position of the 

recognized gesture is less than 100 on x-axis (a fist gesture 

to the very left) then the car will turn left. Otherwise if it is 

more than 500 on x-axis (a fist gesture to the very right) 

then the car will turn right. Similar features apply to the 

forward and reverse control, where if the position of the 

recognized gesture is less than 100 on y-axis (a fist gesture 

to the bottom) then car will move backwards and if it’s 

greater than 400 on y-axis (a fist gesture to the top) then it 

will move forwards. Figure 6 shows a system screenshot. 

Figure 6. A system screenshot. 

SYSTEM EVALUATION 

We conducted user testing with 5 subjects (20-25 yr old 

male) to evaluate the efficiency of our gesture recognition 

components based on KNN. The testing methodology for 

KNN is described in the following. We had each testing 

subject have a warm-up session. Then each testing subject 

had an experience of game playing using hand gestures for 

vehicle navigation. A video has been produced to record the 

gestures performed by each testing subject so that they can 

be used to compare with the gesture sequence recognized 

by KNN to gain an accuracy rate. With the 5 testing 

subjects engaging in our user study, we gained an average 

accuracy rate of 82%. Detailed results represented by a 

confusion matrix are provided in Table 1 with rows 

representing gestures performed by testing subjects and 

columns showing gestures recognized by KNN. 

Gestures 

Gestures recognized by 

KNN 

performed by users 

Fist 

gesture 

Palm 

gesture 

Turning 

gesture 

Fist gesture 90.47% 4.76% 4.76% 

Palm gesture 6.66% 86.66% 6.66% 

Turning gesture 42.85% 0% 57.14% 

Table 1. Evaluation results for KNN. 

From the recognition results of KNN, we noticed that most 

of the errors have been caused by the fact that sometimes a 

turning pistol-like gesture has been recognized as a fist 

gesture. It is because the skin detection algorithm 

sometimes mixed up the background of the gesture (part of 

40 

the arm) with the gesture itself. Also fist and palm gestures 

have been recognized well with accuracy rates >80%. 

The evaluation results for Haartraining have also been 

produced with 29 testing positive images (1 invalid image), 

different from the training set. The performance command 

(opencv_performance) is used for testing or detecting 

purpose. Table 2 shows the evaluation results for the 

recognition of the testing image samples. 

Correct 

recognition 

Inaccurate 

recognition 

Accuracy 

rate 

Positive palm images 9 7 56.3% 

Positive fist images 11 2 84.6% 

Table 2. Evaluation results for Haartraining. 

For the recognition of the palm and fist gestures using 

Haartraining, we have 9 positive images recognized as 

unknown gestures with 7 palm images and 2 fist ones. The 

main reason that led to the recognition errors is that 

probably weak images with dark light were involved in 

training set. For future work, high-quality images will be 

used instead to improve our system’s performance. 

Comparing with existing work (Billon et al. with a >80% 

accuracy rate), our system’s performances from both 

approaches are acceptable. Users also experienced effective 

car navigation using gestures in a real-time 3D application. 

Thus it has the potential to improve users’ engagement. 

CONCLUSION 

We reported a 3D car navigation games application via 

gesture recognition using both KNN and Haartraining. 

Although there is room for improvements, both approaches 

produced reasonable recognition results with Haartraining 

equipped with the ability to ignore the background 

interfering effects. In future directions, we also intend to 

employ HMM to extend our system with the capabilities of 

recognizing more complex (e.g. emotional) gestures to 

assist natural interaction for automatic navigation. 

REFERENCES 

1. Billon, R., Nédélec, A. and Tisseau, J. Gesture 

Recognition in Flow based on PCA Analysis using 

Multiagent System. In Proceedings of ACE. (2008). 

2. Elmezain, M., Al-Hamadi, A., Appenrodt, J. and 

Michaelis, B. A Hidden Markov Model-Based 

Continuous Gesture Recognition System for Hand 

Motion Trajectory. In Proceedings of 19th International 

Conference on Pattern Recognition, (2008). 1-4. 

3. Nam, Y. and Wohn, K. Recognition of Space-Time 

Hand-Gestures using Hidden Markov Model. In Proc. of 

ACM VRST96 Conference. (1996). 51-58. 

4. Tomibayashi, Y., Takegawa, Y., Terada, T., and 

Tsukamoto, M. Wearable DJ System: a New Motion- 

Controlled DJ System. In Proceedings of ACE. (2009).

Model-Based User Interface Development 

in the Automotive Industry 

Moritz Kuemmerling 

German Research Center for Artificial Intelligence 

(DFKI) 

Trippstadter Strasse 122 

67663 Kaiserslautern, Germany 

Moritz.Kuemmerling@dfki.de 

+49 631 205 3709 

ABSTRACT 

The time-to-market for human machine interfaces in the 

German automotive industry has to be reduced. The 

shortening of innovation cycles in other relevant industry 

fields and international competitors increase the pressure on 

German car manufacturers and their suppliers. Model-based 

user interface development is supposed to reduce the 

development time significantly thus improving the 

manufactures’ competitiveness. Therefore, a new domain 

specific modeling language for the specification of 

automotive human machine interfaces is being sought. Past 

approaches with similar objectives have either failed or 

have not been successfully established across the industry 

as a holistic solution. Within the scope of a new cooperative 

project whose partners cover the supply chain of the user 

interface development in the automotive industry for the 

first time completely, a common solution should be 

developed and manifested as an industry standard. 

Keywords 

Model-based user interface development, automotive HMI, 

domain specific language. 

INTRODUCTION 

The German automotive industry has to find a way to 

significantly reduce the development time for human 

machine interfaces (HMI) in vehicles. The reasons, among 

others, are the continuous development of driver assistance, 

communication and infotainment systems, new drive 

concepts as well as the continuous shortening of innovation 

cycles in the consumer electronics industry. To keep up 

with these technologies and with catching up competitors 

around the globe, future HMI-systems will be more and 



41 

Gerrit Meixner 

German Research Center for Artificial Intelligence 


Trippstadter Strasse 122 

67663 Kaiserslautern, Germany 

Moritz.Kuemmerling@dfki.de 

+49 631 205 3707 

more complex while their development costs and time-tomarket 

have to be reduced. However, current HMI 

development processes are characterized by different, 

inconsistent work flows and heterogeneous tool chains. The 

exchange of paper-based specification documents between 

the process participants causes media discontinuity, inhibits 

version management, reduces the reusability and hampers 

the communication [2]. 

Moreover it is impossible to automatically test the integrity 

and accuracy of paper based specification documents. The 

adoption of reliable and successful approaches from the 

field of model-based user interface development (MBUID) 

[9] is expected to be a proper remedy. 

To this purpose a new industry-driven project has been 

elaborated whose partners – several German car 

manufacturers (OEM), suppliers, a tool developer and the 

“Verband der Automobilindustrie e. V.“ (VDA) as an 

association – cover the supply chain of the HMI 

development in the automotive industry for the first time 

completely. 

Together, the partners aim to develop a new modeling 

language that will serve as an interface between the process 

participants thus avoiding media discontinuity and 

improving the communication among the involved actors. 

The intention is (not less than) to establish a new modeling 

language not only within the project consortium but as an 

industry standard. 

The paper is structured as follows: First we will give an 

overview about MBUID, existing modeling languages and 

past projects with similar objectives. Then we point out, 

what we expect to do differently in our project. We also 

explain the impact that the project will have on MBUID as 

a field of research. After an outlook on our next steps the 

paper ends with a conclusion. 

EXISTING UIDLS, PAST PROJECTS AND THEIR 

SHORTCOMMINGS 

A vast number of XML-based user interface modeling 

languages (UIDL) exist already in the field of MBUID. 

Some of the UIDLs are already standardized by the OASIS

and/or they are subject of a continuous development 

process. Numerous projects and applications prove their 

practical suitability. Some examples are UsiXML [14], 

UIML [1] or XIML [10, 11]. 

The purpose of using a UIDL to develop a user interface is 

to systematize the HMI development process. [9]. UIDLs 

enable the developer to systematically break down a user 

interface into different abstraction layers and to model these 

layers [8]. Thus it is for example possible to describe the 

behavior, the structure and the layout of a user interface 

independently of each other. 

In Figure 1 we show how an automotive HMI-system can 

be developed using a model-based approach. In a first step, 

designers and engineers describe an abstract model of the 

later HMI-system. The abstract model is independent from 

any hardware-platform and the developers can put their 

focus on the user requirements. In the next step, the abstract 

model is extended to a more concrete one. The concrete 

model allows the generation of virtual prototypes which can 

be used for first user tests of the later HMI-system. In the 

final step the concrete models are transformed and mapped 

to the platform-specific requirements of the target system. 

The reusability of the models decreases with each step. 

Figure 1. Automotive HMI development using a model-based 

approach. 

Existing UIDLs differ in terms of the supported platforms 

and modalities as well as in the amount of predefined interaction 

objects that are available to describe the elements of 

the user interface. In the relevant literature several authors 

struggled with the challenge of a clear comparison of 

existing UIDLs [4, 7, 13]. However, a comprehensive 

comparison is yet to be drawn. 

In the HMI development in the automotive industry a wide 

range of actors from many different branches are involved – 

computer scientists and electrical engineers work together 

with designers, ergonomists and psychologists in 

interdisciplinary teams (see Figure 2). The HMI modeling 

language that we want to develop shall serve as the 

connective link between these actors. On this account the 

modeling language has to be domain specific. Domain 

specific languages (DSL) are dedicated to a particular 

problem domain and their “vocabulary” is generally based 

42 

upon common expressions that are typical for the domain. 

Thus DSLs are far more expressive in their domain than 

general-purpose languages would be. Further benefits of 

DSLs are a better acceptance when introducing the 

language as well as a better readability of DSL-based 

specifications even for non-programmers. 

Figure 2. Actors in the HMI development process and their 

specification flow. 

The idea of reusing best practices and existing modeling 

languages from the field of MBUID to develop a new domain-specific 

language for automotive HMI development is 

not completely new. In the past there have been similar 

approaches: 

� IML (Infotainment Markup Language) [6] developed 

by IAV is a XML-based modeling language for 

infotainment systems. 

� OEM XLM (later VW XML) [3] is a XML-based 

language that resulted from a cooperation of AUDI, 

BMW, Daimler, Porsche and VW. It addresses the 

standardized description of head-units and instrument 

cluster systems. 

� AbstractHMI [12] is an XML-based modeling 

language for automotive HMI-systems. The language 

was developed at the University of Ulm in cooperation 

with Daimler. 

� ICUC XML [5] is dedicated to the modeling of 

instrument clusters in trucks. The language was 

developed by Elektrobit Automotive for Daimler. 

However, none of the languages presented above was 

successfully manifested as an industry standard. Today 

there are only a few, at best partial solutions that are used 

by some OEMs or suppliers. IAV gave up on the 

development of IML. AbstractHMI has never found its way 

from research to industrial application. ICUC XML can 

only be used via the development tool EB Guide and OEM 

XML is used – despite the numerous partners involved in 

its development – only by VW.

WHAT TO DO DIFFERENTLY? 

The sustainable success of the renewed attempt strongly 

depends on the impact that the new modeling language and 

further project related standardizations will achieve in the 

automotive industry. For this reason a consistent transfer of 

the project results towards the industry is required. 

Exhibitions of the project results on the leading fairs in the 

automotive industry, such as the International Motor Show 

(IAA) or the International Suppliers Fair (IZB), will attract 

attention and contribute to the dissemination of the project 

results. 

During the project period of three years the project results 

will be continuously tested, validated and exposed in form 

of several demonstrators. Towards the end of the project 

these demonstrators will be aggregated into an overall system. 

This overall system shall cover and demonstrate the 

complete HMI developing process in the automotive 

industry from the first mock-up to the implementation of 

the target code on the hardware in the cockpit of a vehicle. 

In particular, model-based aspects and differences to the 

common development process shall be highlighted. To this 

purpose, the final demonstrator shall for example show, that 

requests for changes in the running HMI-system are easy to 

be realized by small manipulations in the underlying HMI 

specification (which is based on the domain specific 

modeling language). The HMI-system is supposed to run on 

several OEM/supplier hardware combinations. The 

exchangeability of the cockpit’s hardware emphasis the 

wide coverage of the project results in the automotive 

industry. 

In addition to the optimization of the HMI development 

process and the communication among it, a standardized 

modeling language paves the way for some further 

improvements. 

The above mentioned incapability of paper-based exchange 

documents to be tested automatically for integrity and 

accuracy often leads to bugs in the HMI-system that are 

first noticed in late stages of development. Leveraging the 

full potential of machine-readable specification documents 

(e.g. model-based testing, early use of virtual prototypes) 

cost and time intensive subsequent iterations and 

corrections can be avoided. For both, suppliers and OEMs, 

this would be a significant cost saving potential. 

The connection of the HMI-system to the application layer 

of the vehicle is a further significant cost factor in current 

development processes. As the connection to the car’s 

application layer still requires manual processing, this step 

consumes resources to a similar extent as the actual 

development of the HMI-system. The introduction of a 

standardized modeling language creates the conditions for 

the development of a standard middleware allowing future 

HMI-systems to be easier connected to the car’s application 

layer. The consequences are a reduction of development 

time and a better exchangeability of the hardware 

components. 

43 

The integration of both aspects – model-based testing and 

middleware – points out the unexplored potential of modelbased 

HMI development in the automotive industry. 

IMPACT ON MBUID 

In the field of HMI development a distinction is made 

between model-based development of human-machineinterfaces 

at design and at runtime. The presented project 

addresses the model-based development of automotive 

HMI-systems at design time. Thus the project is the first 

extensive industrial use case for model-based HMI 

development. The collaborative application of this method 

by several industrial partners allows a proof of concept 

revealing strengths as well as possible weaknesses where 

further research is required. Furthermore the step towards 

model-based HMI development at design time is a 

necessary one in the automotive industry. Future runtimeadaptive 

HMI-systems require a model-based architecture. 

The development of such systems is necessary for a 

functional and efficient integration of the driver’s mobile 

devices (iPods, mobile phones etc.). 

NEXT STEPS TO TAKE 

The achievement of the above presented project’s 

objectives depends on some central tasks. 

Out of the numerous UIDLs without any automotive background 

a few well established examples have to be picked 

and compared to each other. The comparison has to be 

based on an appropriate use case that allows the 

identification of elements that can be useful for the 

development for the automotive modeling language (e.g. a 

simple interface for a music player). 

In parallel, existing automotive related UIDLs have to be 

carefully examined. In particular, the question why none of 

the languages became a standard has to be answered. 

The automotive HMI development process itself will be the 

subject of a comprehensive analysis. Tools, processes and 

specification documents are examined on site at each 

partner with a strong focus on the interfaces and the 

exchange of documents between OEMs, suppliers and tooldevelopers. 

The purpose is to identify best practices and to 

define an abstract reference process. The latter shall be used 

to derive a common data model as well as the requirements 

for the development of the new modeling language. 

CONCLUSION 

In this paper we summarized some of the main issues in 

current HMI development processes in the automotive 

industry. The adoption of methods from the field of 

MBUID is supposed to lead to machine-readable HMI 

specifications thus improving the communication between 

the process partners. Past attempts to develop a 

standardized modeling language have ether failed or lead to 

isolated applications. However, long-term benefits and 

potential subsequent developments necessitate an industrywide 

impact as well as a sustainable manifestation of the

outcomes of the presented project. The first step has already 

been taken as for the first time several OEMs will work 

together with their suppliers on the optimization of their 

HMI development processes. 

REFERENCES 

1. Abrams, M., Phanouriou, C. and Batongbacal, A. 

UIML: An Appliance-Independent XML User Interface 

Language. Proc. of the 8th International World Wide 

Web Conference, Toronto, Canada, 1999. 

2. Bock, C., Görlich, D. and Zühlke, D. Using Domain- 

Specific Languages in the Design of HMIs: Experiences 

and Lessons Learned. Proc. of Workshop: Model- 

Driven Development of Advanced User Interfaces, 

Workshop: Model-Driven Development of Advanced 

User Interfaces, UML/MoDELS 2006, Genua, Italy 

2006. 

3. Brunhorn, J. XML-Sprache zur Beschreibung von HMIs 

für Infotainmentsysteme und Kombiinstrumente. 

Language Specification 1.0. Carmeq GmbH / OEM 

Arbeitskreis HMI Methodik, 2007. 

4. Guerrero García, J., González Calleros, J. and 

Vanderdonckt, J. A Theoretical Survey of User Interface 

Description Languages: Preliminary Results. Proc. of 

Joint 4th Latin American Conference on Human- 

Computer Interaction 7th Latin American Web 

Congress, Los Alamitos, USA, 2009. 

5. Hübner, M. and Grüll, I. ICUC-XML Format. Format 

Specification Revision 14. Elektrobit, 2007. 

6. Jud, A. Präzise Syntaxdefinition einer 

Modellierungstechnik für Infotainment-Systeme. Master 

Thesis, Technische Universität Berlin, 2007. 

44 

7. Luyten, K. Dynamic User Interface Generation for 

Mobile and Embedded Systems with Model-Based User 

Interface Development. Doctoral Thesis, Transnationale 

Universiteit Limburg, Limburg, 2004. 

8. Meixner, G. Model-based Useware Engineering. in 

W3C Workshop on Future Standards for Model-Based 

User Interfaces, W3C Workshop on Future Standards 

for Model-Based User Interfaces (W3C-2010), May 13- 

14, Rome, Italy, 2010. 

9. Puerta, A. A Model-Based Interface Development 

Environment. IEEE Software, 14 (4), 40-47, 1997. 

10. Puerta, A. and Eisenstein, J. XIML: A Universal 

Language for User Interfaces. RedWhale Software, Palo 

Alto, CA USA, 2001. Retrieved September 09, 2011, 

from http://www.ximl.org/pages/docs.asp. 

11. Puerta, A. and Eisenstein, J. Developing a Multiple User 

Interface Representation Framework for Industry. In: 

Multiple User Interfaces. Cross-platform Applications 

and Context-Aware Interface, Wiley, 119-148, 2004. 

12. Reich, B. Abstrakte Beschreibung automobiler HMI- 

Systeme und deren Erweiterung für neue Dienste. 

Master Thesis, Universität Ulm, 2008. 

13. Souchon, N. and Vanderdockt, J. A Review of XML- 

Compliant User Interface Description Languages. Proc. 

of the 10th International Workshop on Interactive 

Systems: Design, Specification and Verification, 377- 

391, 2003. 

14. Vanderdonckt, J., Limbourg, Q. and Michotte, B. 

USIXML: A User Interface Description Language for 

Specifying Multimodal User Interfaces. Proc. of the 

W3C Workshop on Multimodal Interaction, 2004.

A Robotic Wheelchair using Human Gestures and 

Scene Contexts 

Jin Sun Ju, Eun Yi Kim 

Dept. of advanced technology fusion Engineering, Konkuk University, Seoul, Korea 

vocaljs@konkuk.ac.kr, eykim@konkuk.ac.kr 

82-2-450-4135 

ABSTRACT 

In this paper, we propose a new vision-based robotic 

wheelchairs using human’s gestures and scene contexts. For 

the easy and accurate control of a wheelchair, human 

gestures such as a face and mouth are used, where the 

direction of a robotic wheelchair is determined by the 

inclination of the user’s face, while proceeding and 

stopping are determined by the shape of the user’s mouth. 

And, for providing autonomous avoidance of obstacles, a 

monocular vision-based navigation is developed. 

To assess the effectiveness of the developed robotic 

wheelchair, several experiments were performed on indoor 

and outdoor under various situational effects. The results 

demonstrated the feasibility of our system as mobility aids 

of the disabled or elderly people. 


Robotic wheelchair, gesture recognition, MLP 


H5.m. Information interfaces and presentation (e.g., HCI): 

Miscellaneous; I.4 Image processing and computer vision; 

INTRODUCTION 

Robotic wheelchairs are generally electric powered 

wheelchairs with an embedded computer and sensors, 

giving them intelligence. Most important evaluation factors 

for the wheelchairs are safety and convenient controls, so 

1 

many studies have been performed for intelligent interface 

and autonomous navigation [1] [2]. The intelligent interface 

aims at making the handicapped users control the 

wheelchair with their limited physical abilities. For such an 

interface, we developed a control system using face 

inclination and mouth shape recognition in the previous 

work, which can enhance the accuracy to recognize user’s 

intention and the computational costs than existing 

approaches [3]. 

The navigation refers to detect various obstacles in real 

environments and avoid them. As the wheelchairs are used 

by handicapped people, some dangerous situation and 

accidents such as collisions with obstacles and other 

peoples are occurred. Accordingly, this study focuses on 

developing auto navigation techniques for obstacle 

detection and avoidance. 

In this paper, we develop vision-based robotic wheelchairs 

using human’s gestures and scene contexts. Fig.1 (a) 

illustrates the prototype of the proposed robotic wheelchair 

and the specifications of respective components. Our 

system consists of two modules: 1) a wheelchair control 

interface module, 2) a monocular vision-based navigation 

module. Fig. 1(b) describes the process of the proposed 

robotic wheelchair. 

(a) (b) 

Figure 1. The proposed system (a) the overall architecture of our wheelchair (b) the outline of proposed wheelchair system 


MIAA 2011, February 13, 2011, Palo Alto, CA, USA 

45

WHEELCHAIR CONTROL INTERFACE 

The proposed wheelchair control interface allows the user 

to control the wheelchair directly by changing their face 

inclination and mouth shape. If the user wants the 

wheelchair to move forward, they just say “Go.” 

Conversely, to stop the wheelchair the user just says “Uhm.” 

The direction of the wheelchair is determined by inclination 

of the user’s face, instead of turning heads or faces. 

Facial Feature Detection 

For robust detection of facial region, we use the AdaBost 

algorithm, which is recently many used in face detection 

due to its accuracy and speed [5]. It extracts the Haar-like 

features that can explain facial region from all possible 

rectangles obtained from a given image. Once a facial 

region is obtained, the mouth region is localized using edge 

information. The detection results may include some noises, 

which are filtered by the connected component analysis. 

Facial Feature Recognition 

Let ρ denote the orientation of the facial region. Then can 

be calculated by minimizing the following inertia. 

�� 1 

� �� 

� 

��,�� 

�� , �� 

(1) 

� �� 

If the value of ρ is less than 0, this means that the user nods 

their head slanting to the left. Otherwise, it means that the 

user nods their head slanting to the right. 

To recognize what the mouth shape of the current frame is, 

a template matching is performed, where the current mouth 

region is compared with mouth shape templates. Those 

templates are obtained by K-means clustering from 114 

mouth image. After localization the mouth in current frame, 

we first normalize the mouth region, calculate its matching 

score for all templates, and pick the template with the best 

matching score. 

MONOCULAR VISION-BASED NAVIGATION 

In this module, all information for environments where a 

wheelchair is positioning is represented as the form of 

occupancy map. Thereafter, for capturing visual 

characteristics among occupancy maps of the different 

directions a MLP is used. 

Obstacle Detection 

A cell in occupancy map model the risk of the 

corresponding area by gray color level, so we first design 

the map fitted to the environments. For the occupancy map 

generation, we estimate the background model by simple 

online learning, and compare it with every frame received 

from a CCD camera, thereby classifying the current frame 

as backgrounds and others. Here we use the simplified 

version of background detection method presented in [4]. 

The background color is estimated from only the reference 

area rather than a whole image. The input image is filtered 

by 5�5 Gaussian filters to reduce the noise, and 

transformed into the HSI color space. From the reference 

area, two color histograms are calculated for Hue and 

2 

46 

Intensity. These histograms are accumulated for recent five 

frames, which are used as background model. The 

background model is continuously updated as a new frame 

is input. Once the background model is obtained, the 

classification is performed. If the intensity and hue of a 

pixel are below thresholds, it was considered as obstacles. 

In this paper, the hue and intensity thresholds are set to 60 

and 80, respectively. Based on the background 

classification results, an occupancy map is produced, where 

each cell is allocated at a walking area and it has the 

different gray color levels according the occupancy of 

obstacles, as shown in Fig. 2. 

Here 10 gray-scales are used according to the risk. Then, 

the gray scale of a cell is determined by 1/10�(# of pixels 

classified as obstacles). A certain gray color is assigned to 

each pixel, according to its risk. The brighter a grid cell, the 

higher obstacle density. 

(a) (b) (c) 

Figure. 2: Examples of generated occupancy map (a) input 

image, (b) obstacle classification results, (c) occupancy maps 

Path generation 

We try to automatically extract the scene contexts from 

real-time streaming, and use them to determine viable paths, 

through machine learning. Here, we use a MLP to 

automatically capture important scene contexts among 

occupancy maps for different pats, as it integrates feature 

extraction and classification in its own architecture. The 

path generation is performed by two steps: off-line learning 

stage and on-line recognizing stages. 

Off-line learning stage 

In the off-line learning stage, the proposed system trains the 

visual properties among occupancy maps for each 

directions using MLP, thus it can recommend viable paths. 

In a MLP, the input layer receives the gray values of pixels 

on 32�24 occupancy map. The output value of a hidden 

node is then obtained from the dot product of the vector for 

the input values and the vector for the weight connected to 

the hidden nodes. This is then presented to the nodes of the 

next. Although various learning techniques can be used for 

multi-layered networks, this study used back-propagation, 

where the output values are compared with the correct 

answer during network training to compute the value of the 

error-function. In our system, the input layer is composed 

of 769 nodes and output layer is composed of four nodes, 

each of which corresponds to one of four directions {Go 

straight, Stop, Turn Left, and Turn Right}. 

On-line recognizing stage 

After training, the MLP is used to make a decision for 

online streaming. As the value of an output node is given as 

a floating-point numbers, ranging from 0 to 1, a threshold

value is required for decision of viable paths. Here, a 

threshold value of 0.7 was used for the MLP output nodes. 

Therefore, if the predicted output node score was larger 

than 0.7, the directions corresponding to the node were 

selected. 

EXPERIMENTS AND RESULTS 

To assess the effectiveness of the proposed system we 

performed the several experiments. Experiment I and II was 

designed to measure the accuracy of our two main modules, 

each of which reports the accuracy of interface and 

navigation. And Experiment III was designed to assess its 

effectiveness, thus its performance was compared with one 

of other existing method. 

Experiment I: To measure the accuracy of wheelchair 

control interface 

For the proposed wheelchair control interface to be 

practically used in the real environments, it should be 

robust to various illuminations and cluttered backgrounds. 

Thus, the proposed interface was tested on indoors and 

outdoors, furthermore on across both environments. 

Fig. 3 shows the facial feature detection and recognition 

results. As seen in Fig. 3, the proposed method accurately 

detected the face and mouth, confirming the robustness to 

time-varying illumination, and low sensitivity to a cluttered 

environment. 

Figure 3: Face and mouth detection results 

Table 1 shows the recognition rates of the proposed 

interface for the respective commands. The proposed 

interface shows the precision of 100% and the recall of 96.5% 

on average. Thus, this experiment proved that the proposed 

interface can accurately recognize user’s intention in realtime. 

Commands Recall (%) Precision (%) 

Turn Left 98 100 

Turn Right 94 100 

Go straight 96 100 

Stop 98 100 

Table 1. Performances in recognizing users’ commands 

Experiment II: To measure the accuracy of monocular 

vision-based navigation 

To fully support the mobility to the severely disabled 

people or cognitively disabled people, a navigation system 

to automatically detect obstacles and avoid them is 

necessary, so we developed a new monocular vision-based 

navigation using machine learning. 

Then, to be practically used in the real environments, it 

should detect a variety of obstacles, and it should be also 

robust to the situational effects such as place types and 

lightening conditions. 

3 

47 

Thus, it was tested on indoors and outdoors at daytime and 

night time. Fig. 4 shows the results to detect obstacles under 

various conditions, where 1 st to 4 th columns show the 

detection results for static obstacles on indoors, and two last 

columns show the results for moving obstacles on outdoors. 

In more detail, 1 st row shows the detection result for a static 

and thin obstacle, moreover it is floating. And 2 nd column 

shows the result for detecting a static thick obstacle. Such 

images were taken at day-time. On the other hand, the 3 rd 

and 4 th columns in Fig. 4 show the detections of a thin and 

small obstacle at night-time. Finally, 5 th and 6 th columns 

show the results for detecting moving obstacles at day-time 

and night-time, respectively. 

For given input image (as shown in Fig. 4(a)), the obstacle 

detection results and generated occupancy maps are shown 

in Figs. 4(b) and (c), respectively. As shown in Fig. 4(c), 

the proposed system can accurately detect a variety of 

obstacles under several illuminations. 

(a) 

(b) 

(c) 

Figure 4. Obstacle detection results (a) input image (b) 

background detection results (c) occupancy maps 

Table 2 summarizes the performance of our navigation 

system under various conditions. Although there are some 

differences, it showed the accuracy of 90% on average. 

Among four test groups, the accuracy for Type 2 was 

lowest. The experiments of Type 2 were performed on 

shopping mall, where much reflection was made due to the 

marble-textured background, and the scene are very 

cluttered by human and stores. However, despite these 

problems, our system can generate the viable paths to avoid 

collisions with obstacles. 

Environments 

Indoor Outdoor 

Type1 Type2 Type3 Type4 

Accuracy (%) 91% 87% 93% 89% 

Type 1,2,3,4: underground, shopping mall, A road, Foot way 

Table 2. Performances in determining viable paths 

Experiment III: To prove the effectiveness of our 

monocular vision-based navigation by comparison with 

other method 

To assess the validity of the monocular vision-based 

navigation module, its performance was compared with one 

of other method. Here we adopt the VFH [7], as it is the

most commonly used method in auto navigation, as 

mentioned in Section I (related work). 

Fig. 5 shows the performance comparisons of two methods 

on indoors and outdoors with time-varying illuminations. 

Fig. 5(a) shows the results of two methods under the timevarying 

sun-lights at day-time, and Fig. 5(b) shows the 

results under artificial lights at night-time. As can be seen 

in Fig. 5, the proposed method showed the better 

performance for all cases, regardless of place types and 

illumination conditions. On average, the proposed method 

can generate avoidable paths in the accuracy 92%, whereas 

VFH has accuracy of 79%. Consequently, the proposed 

method can improve the accuracy of 13%. 

(a) 

(b) 

Figure 5: Performance comparison with our system and VFH 

under various lightening conditions (a) comparisons under 

time-varying sun-lights, (b) comparison under artificial lights 

( the proposed method on outdoor, the proposed method 

on indoor, the VFH on outdoor, the VFH on indoor ) 

Indoor Outdoor 

Day time Night time Day time Night time 

Proposed method 8% 10% 11% 13% 

VFH 31% 30% 46% 48% 

Table 3. Collision rate of proposed method and VFH 

The most important role of a navigation system is to 

prevent some collisions, so their performance should be 

evaluated in this aspect. Table 3 shows the hit ratio of two 

methods when moving to a goal, where the proposed 

method detected collisions and stopped in the accuracy of 

89%, but the VFH showed the accuracy of just 61%. 

48 

As shown in Fig. 5 and Table 3, the numerical comparisons 

showed that the proposed method provide a more safe 

mobility than VFH, and is robust to the situational effects 

such as illumination conditions and place types. Moreover, 

the average time taken for the proposed method to process a 

frame was about 56ms, thereby allowing the proposed 

method to process more than 17frames/s. The proposed 

method was about 22ms faster than the VFH. 

Consequently, the proposed method can improve the 

detection of collision and the prediction of avoidable paths 

than existing method, thereby providing a wheelchair with 

safe navigation on real environments. 

CONCLUSIONS 

In this paper, we develop a vision-based robotic wheelchair 

using human’s gestures and scene contexts. The advantages 

of the proposed system include the followings: 1) our 

wheelchair control interface requires minimal user motion 

such as face inclination and mouth shapes, thereby making 

the proposed interface more suitable to the severely 

disabled. 2) By using scene contexts as well as obstacle 

density, our monocular vision-based navigation supports a 

wheelchair user with more safe mobility in unknown 

environments. 3) It has feasibility in using other mobile 

robots and other assistive devices such as ETA (Electronic 

Travel Aids) system for the visually impaired people to 

provide their safe mobility. 

To prove these advantages, several experiments were 

performed on indoor and outdoor with various situational 

effects, and its performance was compared with an existing 

method. The results showed the efficiency and effectiveness 

of the proposed robotic wheelchair. 

ACKNOWLEDGE 

This research was supported by the MKE(The Ministry of 

Knowledge Economy), Korea, under the ITRC(Information 

Technology Research Center) support program supervised 

by the NIPA(National IT Industry Promotion Agency 

(NIPA-2010-C1090-1001-0008)) 

REFERENCES 

1. J.S Ju, Intelligent Wheelchair interface using face and 

mouth recognition. International Conference on 

Intelligent Unser Interfaces ACM, (2009).02 

2. Guilherme N, deSouza&Avinash, Vision for Mobile 

robot navigation: a survey, Pattern Analysis and 

Machine intelligence, (2002) 237-267 

3. Mazo, M, Garcia, J.C, Experiences in assisted mobility: 

the SIAMO project, IEEE Control Applications, (2002). 

4. Iwan Ulrich, illah Nourbakhsh, Appearance-based 

obstacle detection with monocular color vision, AAA 

National Conference on Artificial intelligence, (2000) 

5. Paul Viola, Michael J.Jones: Robust real-time face 

detection. International Journal of Computer Vision. 

(2004), 137-154

MetaBrain: Web Information Extraction and Visualization 

João Teixeira Gabriel Barata Daniel Gonçalves 

Department of Computer Science and Engineering, IST 

Av. Rovisco Pais, 1000 Lisbon 

{joao.teixeira,gabriel.barata}@ist.utl.pt, daniel.goncalves@inesc-id.pt 

ABSTRACT 

Nowadays, the web is a huge source of information on 

different branches of knowledge. This knowledge, however, 

is dispersed across many sites, making it difficult to 

interrelate and understand. In the past few years some 

approaches have been developed to ease the extraction of 

this information, from Open Information Extraction to 

simpler data mining. Usually these solutions work as 

standalone applications and are developed from scratch and 

are brittle, very sensitive to changes in the data sources. 

This makes it difficult for the final user to fully explore the 

potential of using different algorithms together to better 

extract and analyze information. In this paper we propose a 

new approach where users can create their own 

personalized information extractors and visualizations, 

without needing to type a single line of code, in an easy and 

highly flexible manner using a special-purpose interface. 

Since raw data is most times difficult to understand, we also 

study how the user can create customized visualizations of 

this extracted data with low effort. A prototype of this 

concept, MetaBrain, has been implemented and tested. 

Preliminary heuristics evaluation, demonstrate favorable 

results for the concept. 


Information Extraction, visualization, user interaction. 


H.5.2 User Interfaces - Graphical user interfaces (GUI), 

H.5.m Miscellaneous. 

INTRODUCTION 

The versatility of the web is also its biggest problem. Since 

anyone is free to create their website in any way they want, 

there is no unifying structure for all this information. More 

than a huge repository of knowledge, the web contains a 

whole set of hidden implicit information. The way people 

express their thoughts reflect an unconscious collective of 

trends and patterns which are not obvious at first sight. 

What color does the Internet relate to the term apple? 

Surprisingly, white is the color that more frequently cooccurs 

with apple in web pages, next to red and green. 

Apple Inc. and Snow White may be to blame for this. 



49 

Traditionally, Information Extraction (IE) focuses on 

extracting information from specific pre-defined domains. 

Changing domains implies that new extraction rules need to 

be manually created, making it hard to scale. Manually 

querying search engines in order to extract large quantities 

of information is also not the right approach, since it is 

tedious and error-prone as pointed out by [6]. A possible 

solution to this problem is the use of Open Information 

Extraction [2], which states that a high amount of 

relashionships are expressed through a compact set of 

relation-independent lexico-syntactic patterns. This is only 

one of several techniques [3,5,7] which allow the extraction 

of information from the Web using only statistics and 

probabilities. 

Although many new tools for web IE have recently 

appeared, these tools are usually designed to use a single 

type of IE technique with no possibility of interaction with 

others. It may be in the best interest of the user to use 

different IE techniques simultaneously, thus discovering 

hidden and unexpected patterns in apparently unrelated 

data. For example, the possibility to automatically 

extracting a list of Operating Systems and see how popular 

each one is on different search engines or social networks, 

for different kind of users. Another problem found in these 

tools is that most are developed from scratch. Currently, 

there is no unified framework with different IE modules 

available for programmers or other users to use as a basis 

for their IE tools. Also, state of the art tools like 

TextRunner [1] lack advanced search options, like the 

selection of search engine to use, or the possibility to 

extract the retrieved data. These options may be important 

for advanced users. 

Our research aims at finding ways for normal web users to 

access the collective unconscious that is the Internet. Given 

the giant number of possible extraction scenarios this can 

be a very complex and difficult task. Our efforts were 

directed at creating the best interface to make this task as 

easy as possible. Since raw data from these techniques, at 

times, is difficult to understand, we also analyzed several 

information visualization techniques, from simple bar 

charts to hierarchical tree-maps, with the objective of 

creating a good and easy way for the user create and export 

their customized visualizations.

In the next sections, we detail how we extract information 

from the web. Then we explain our design and interaction 

decisions for our solution prototype. This is followed by the 

result analysis of the prototype’s heuristics evaluation and 

finally. We conclude with our final remarks and talk about 

future work. 

CREATING CUSTOMIZED IE SOLUTIONS 

There are different approaches to extract information from 

the web without the use of complex natural language 

parsers. Different algorithms use different features to 

extract the information. Generally, we find three different 

classes of approach that use: number of results found for a 

given query [9]; lexico-syntactic patterns [5,6]; and word 

co-occurrence [8]. Next we’ll see how we can use these 

different classes together to create customized IE tools. 

Selected Information Extraction approaches 

The number of results can be used as a way to identify the 

popularity of one or more concepts on the Internet, and also 

to measure the validity of extracted data. For example, if 

“fishing water” has more results than “fishing wall” then 

fishing is probably more related to water than to a wall. 

By using lexico-syntactic patterns like C{,} “such as ” 

IList, where C is a concept and IList is a list of instances 

from that concept, it is possible to generate special queries 

to use in search engines that will be able to map concepts to 

instances or instances to concepts. 

Recent works have been created to prove the validity of 

using term co-occurrence to do opinion mining [7,8]. With 

the rise of micro-blogging usage, it is now possible to more 

easily extract the general Internet opinion of a given 

concept by looking at what words co-occur with that 

concept. 

Putting It All Together 

Each one of these approaches is a way to extract a different 

type of information, so it would be good if we could use 

them together or alone, depending on what we want to 

extract. We can think of each one of these as a different 

search module. If we would like to extract a list of cities 

and then check their popularity online, instead of manually 

executing two different searches it would be good to create 

a single search query for the whole extraction. 

Because these modules are domain independent it’s a 

matter of defining a way to direct a module’s output to 

others input. In order to do this we can standardize all the 

three modules’ main input as a single query parameter and 

their output (result set) as a table (Figure 1), were the rows 

represent the different extracted information and the 

columns represent the extracted information (primary 

column) and some auxiliary attributes of the extraction. 

Looking at only the primary column of a result set we get a 

list of results which can be iterated by another search 

module as its input parameter. This way it is possible to 

easily create multi-level search queries. Figure 1 also shows 

a result of a multi-level search. 

50 

Figure 1. Left: result set for an extraction of city instances. 

Each row represents an extracted city, which is presented 

on the Extracted column, the table’s primary column; 

Right: result set for the number of results found for the 

different cities extracted on the left table. 

A prototype library was implemented with these 

capabilities and also the possibility to customize each 

search parameters (thresholds, search engine, etc.). Several 

search engines can be used, including social networks. A 

modular approach was used to create this library in order 

for it to be easily expansible with new search engines, IE 

algorithms, or simple web service APIs. Also, since some 

IE modules need to sometimes perform thousands of search 

queries, a cache system was developed to make the searches 

faster when possible. The direct use of this library still 

requires programming skills. Hence, we developed a 

special-purpose interface, Metabrain, which allows even 

non-programmers to perform IE and visualization tasks in a 

more natural way. 

METABRAIN PROTOTYPE 

With the library complete, we started looking into how we 

could create a GUI simple enough to allow regular Internet 

users to interact with it, without neglecting all the advanced 

options required by expert users. With this in mind, we 

decided to use HTML and Javascript, in order to create a 

very dynamic interface with standards-compliant 

technology. Also, it is easy to connect with our Python 

library. We want not only the users to extract information 

but also for them to create meaningful visualizations of the 

raw data. All these visualizations were implemented using 

the Protovis framework [4]. 

Data Set Creation 

Since the use of IE tools may not be common to most users, 

our efforts were to simplify every possible step of the 

extraction process, without disregarding the needs of 

advanced users. By default all customization options are 

hidden, although easy to access, and preset to a default 

value. This way the only thing needed is for the users to 

select what they want to extract. They can choose, and at 

any time change, between the different available extraction 

modules. These modules allow for the same type of IE 

previously discussed plus easy access to public API 

services, such as location to geographic coordinates and 

search engine suggestions. Each module is accompanied by 

a quick description of its purpose and a series of possible 

input examples with explanations. 

The design philosophy we follow is to only show relevant 

information in the interface so, by default, there is only one 

input section visible to the user. This reduces the visual 

noise needed to complete his task. For a simple one level IE

Figure 2. a) List of available extraction modules for the first input. b) Example of an extraction of the zodiac signs. c) Example of 

a multi-level search query. The final result will be the popularity, on the selected search engine, of every extracted city. 

the process is very straightforward: select the IE module to 

use, input the query parameter and search. For example, if 

the user whishes to extract from the Internet a list of zodiac 

signs, he just needs to select the Extract by Domain module 

and use zodiac signs as the search query. By doing this, a 

list of extracted zodiac signs is presented to the user, as 

seen on Error! Reference source not found.b. 

If the user whishes to create a multi-level search query, the 

interface will evolve during the process, along with the 

user’s needs. If, at any time, the user chooses to use the 

result of one search as a term in another, the interface will 

dynamically add a new input section where the second 

search query can be defined. These secondary input 

sections are called variables and have the form of %1, %2, 

etc. Graphically, every new query to obtain the values for 

each variable appears below the one in which it is used, and 

one level deeper on the interface (Error! Reference source 

not found.c). This helps users to effectively resort to 

several variables at once without getting lost or confused. 

In order to minimize the number of errors and not waste the 

user’s time in vain, before initiating the final search query, 

which may take from a few seconds to minutes or hours, it 

is possible to do a preview search in a smaller scale. This 

way, the user gets a quick glimpse of the kind of results 

returned by the current query and can make any 

adjustments necessary before starting the real long search. 

To increase the possibilities of query creation it is also 

possible to create Data Sets by importing users own 

personal data (CSV) through our prototype. Before the data 

is imported it is scanned and MetaBrain tries to guess what 

type of data is in each column (text, numbers, coordinates, 

etc.) Our guesses are then shown to the users so they can 

confirm and make any changes necessary. We’ll discuss the 

importance of this type of information in the next section. 

Visualization 

Now that we have a good and flexible approach that allows 

even non-programmers to do customized IE from the Web, 

the next step is to provide them with the possibility to 

visualize this information in a more meaningful way than 

the one provided by simple tables. We started by 

51 

identifying a set of requisites we would like the 

visualization creation process to follow: 

� Since the table of extracted information has multiple 

columns, the user must be able to choose which columns 

she or he wants to visualize. 

� The user should be able to choose from several different 

types of visualizations, from graphic bars to sunbursts or 

even maps; 

� All the visualizations must have its set of configuration 

options, bar width for the graphic bars, palette color for 

the sunbursts, etc.; 

� During this process it must be easy to change between 

different visualization types maintaining the users 

previously selected preferences, if these are applicable to 

the new type. 

� The user must be able to always preview the visualization 

being created Configuration changes to the current 

visualization should be applied instantly, without the 

need to refresh. 

Taking all these requisites into account, we decided to 

divide the visualization process into 3 steps: choose data to 

visualize (which columns); choose the visualization type; 

preview and configure the visualization. 

To address the first requisite we decided to let the user 

choose which columns to visualize by using a drag and drop 

metaphor. On the left side of the application a vertical list 

of names is visible. These are the names of the different 

columns existent in the selected data set and they are 

divided by the type of data they contain, this division makes 

the column selection easier for the user. On the right side of 

this list are two large horizontal boxes, representing the 

visualizations axis. The user is then able to drag columns 

from the left list and drop them in the axis input boxes. 

During the drag procedure these boxes are highlighted, 

making the user aware of valid drop inputs. 

We decided to use two axis after concluding, in a study, 

that all the different visualizations we wanted to implement 

required at least two degrees of freedom.

The available visualizations list starts empty. While the user 

makes column selections, these (columns selected, their 

data type and position in the axis) are used to verify what 

visualizations are available for this selected data. This way 

we can minimize the errors of the user choosing a map 

visualization type when no geographical data is selected. 

When the user has finished selecting the columns and has 

chosen the visualization a preview is instantly creates. Also, 

next to his visualization a list of configurable options 

(colors, scale, canvas size, etc.) appears with they’re default 

values selected. After changing any of these options values 

the preview is instantly refreshed. At any time during this 

process the user can change the selected columns or choose 

a different visualization. An example of a visualization 

being created is shown on Figure 3. When the users are 

satisfied with their visualization they can embed this 

visualization into their website by copying a piece of code 

into any webpage, much like embedding a YouTube video. 

HEURISTIC EVALUATION 

In order to test our solution we conducted a heuristic 

evaluation of MetaBrain, using Jakob Nielsen’s usability 

heuristics 1 . After a quick introduction to the purpose of our 

work, four usability experts proceeded to freely test the 

prototype for a few minutes and then received a list of four 

tasks to execute. In two the users were asked to extract 

information from the web, from given domains, and in the 

other two to craft specific visualizations for that 

information. All were successfully completed by all users. 

Overall, only ten usability problems of relevant severity 

were identified. Most were related to the data extraction 

interface, especially to the fact of some search queries were 

taking some minutes to finish and there was no indication 

of progress, only a looping loading sign. This problem has 

been solved by adding to the search interface the number of 

queries to be performed and how many have already been 

completed. All evaluation experts enjoyed the clean and 

minimalistic design and the dynamic way in which they 

could interact with the system. After completing the tasks, 

some wanted to keep playing with the system, curious about 

what other information MetaBrain would be able to extract. 

This preliminary evaluation allowed us to find and correct 

some usability problems. It is indicative that the interface 

can be effective and easy to use. Further validation of this 

will be provided by upcoming, more formal, user tests, 

where we’ll take into account the number of errors and time 

taken to complete the tasks. 

CONCLUSION 

We have presented an interface that allows us to extract and 

visualize information from the web in meaningful manners. 

Unlike previous research we strove to make this task as 

simple and flexible as possible so that any type of users, 

from less to more experienced, can create customized 

1 http://www.useit.com/papers/heuristic/heuristic_list.html 

52 

Figure 3. Creation of a map visualization, showing Portuguese 

cities and their respective population size. 

solutions that fit their needs. A preliminary evaluation of 

our prototype, MetaBrain, showed positive results. Further 

user studies will allow us to better validate our choices. 

REFERENCES 

1. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, 

M., and Etzioni, O. Open information extraction from 

the web. In Proc. of the IJCAI 2007. 

2. Banko, M. and Etzioni, O. The Tradeoffs Between 

Open and Traditional Relation Extraction. In Proc. of 

ACL-08: HLT, 28-36. 

3. Bollegala, D., Matsuo, Y., and Ishizuka, M. 

Measuring semantic similarity between words using 

web search engines. In Proc. WWW '07, ACM Press 

(2007), 757-766. 

4. Bostock, M. and Heer, J. Protovis: A Graphical 

Toolkit for Visualization. In Proc. IEEE TVCG, 15 

(2009), IEEE CS (2009), 1121-1128. 

5. Cimiano, P. and Staab, S. Learning by googling. 

SIGKDD Explor. Newsl., 6 (2004), 24-33. 

6. Etzioni, O., Cafarella, M., Downey, Doug et al. Webscale 

information extraction in knowitall: (preliminary 

results). In Proc. WWW '04, ACM (2004), 100-110. 

7. Kramer, A.D. An unobtrusive behavioral model of 

gross national happiness. In Proc. CHI '10, ACM 

(2010), 287-290. 

8. Ku, L., Lee, L., Wu, T., and Chen, H. Major topic 

detection and its application to opinion 

summarization. In Proc. SIGIR '05, ACM (2005), 627- 

628. 

9. Turney, P.D. Mining the Web for Synonyms: PMI-IR 

versus LSA on TOEFL. Machine Learning: ECML 

2001, Springer Berlin (2001), 491-502.

�� 

�� ∗ 

�� 

�� 

�� § 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

∗ �� 

�� 

�� 

�� 

† �� 

�� 

�� 

‡ �� 

�� 

�� 

�� 

§ �� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� † 

�� 

�� 

53 

�� 

�� 

�� 

�� ‡ 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

� �� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

��

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

54 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

��

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

� �� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

55 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

��

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

56 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

�� 

� �� 

�� 

�� 

�� 

�� 

��

Prototyping a Semi-Automatic In-Car Texting Assistant 

Christoph Endres 

German Research Center 

for Artificial Intelligence 


Saarbrücken, Germany 

christoph.endres@dfki.de 

ABSTRACT 

Texting while driving is dangerous and illegal in most countries. 

But both social as well as business forces led to a 

widespread ignorance of those bans and in turn to a potential 

lethal situation. We argue that, in addition to legislative regulations, 

in-car texting should be made less distracting and 

dangerous. We offer a solution for one specific communication 

goal, namely staying connected to a social network. We 

propose a semi-automatic status-posting system and present 

a prototype based on a Pleo. We argue that our approach 

should be extended by automated answering mechanisms. 

The aim of this paper is to foster discussion on texting while 

driving. The solution for one type of semi-atomatic texting 

is outlined, other types of texting need to be looked at separately. 


texting while driving, pleo, semic-automatic texting 


K4.2 Computers and society: Social Issues 

INTRODUCTION 

Ubiquity and convenience being a major driving factor, the 

spread of mobile email devices such as BlackBerry, iPhone, 

and others, has grown to tens of millions over the last several 

years [13]. [12] expect a sustained growth of this trend 

in the next decade. Mobile email promises seamless anywhere 

anytime connectivity. Employees connect with their 

organizations increasing productivity [13]. Participants in a 

study on BlackBerry use by [12] emphasized the liberating 

nature of mobile email by showing how it allowed them the 

freedom to work anywhere. 

On the other hand, using mobile devices while driving is 

without doubt distracting and thus dangerous. After a surge 

in horrific automobile accidents in which distracted driving 

was proven to be a factor, 38 US states have enacted textingwhile-driving 

bans [5]. Other countries issued similar bans. 



Daniel Braun 

Saarland University, 

CS Department 


daniel.braun@dfki.de 

57 

Christian Müller 

DFKI 


christian.mueller@dfki.de 

Figure 1. Pleo robot (Source: Ugobe) 

Nevertheless, people continue to text while driving. Reasons 

for ignoring bans on texting while driving vary, and include 

both business and social forces. People may be tempted to 

ignore texting while driving bans, because 

• professional communication partners expect universal availability. 

• driving is perceived as ”dead time” that needs to be filled 

with small talk. 

• intimates / buddies expect a message to be replied promptly. 

• there’s an audience to be constantly supplied with great 

content. 

In order to tackle this problem, we have to take a closer look 

at the different types of texting and the underlying motivation. 

Aside from widely known mobile email, we consider the following 

texting services relevant in the automotive context: 

SMS, Twitter (twitter.com), and Facebook (facebook.com). 

The latter are briefly introduced in the following. 

Short Message Service (SMS) is mostly used for person-toperson 

messaging (chat with friends). The text is limited to 

160 characters but the system can segment messages that exceed 

the maximum length into shorter messages. [12] argue 

that SMS is mostly a private communication means that has

not been widely adopted by the worldwide business community. 

Microblogging sites like Twitter provide a new means of 

communication [10]. Twitter provides the ability to deliver 

the data to interested users over multiple delivery channels: 

cell phone, Facebook application (see below), email, or as an 

Instant Message. A Twitter user interested in the statuses of 

another user signs up to be a ”follower”. Updates or posts are 

made by succinctly describing one’s current status within a 

limit of 140 characters. According to [8], Twitter fulfills the 

need for an even faster mode of communication compared to 

regular blogging. 

Facebook belongs to the category of online social network 

(OSN) services. Its core functionality is managing connections 

or ”friends” [9]. However, Facebook also provides opportunities 

for communication and hosting of content. Facebook 

is currently having the most users worldwide–other 

OSNs are MySpace, Friendster, Bebo, hi5, and Xanga, each 

with over forty million registered users [10]. 

As we pointed out earlier, legislation is unfortunately not 

sufficient to keep drivers from potentially lethal habits, so 

additional safeguards and alternative solutions need to be developed. 

In this paper we propose a way to circumvent composing 

twitter messages. 

OUR PROTOTYPE: PLEOPATRA 

The driving context and the nature of the communicative 

goal of Twitter lead to a limited amount of likely messages, 

which are usually diary-like. A typical status might be “We 

are already so close to Paris, but now we hit a traffic jam!” 

(see Figure 5). We argue that such a message could as well 

be generated using a set of message templates and current 

status information of the car, e.g. GPS position, current 

speed, and available traffic jam warnings. Due to its nature 

and complexity, a car on the street is not a very suitable environment 

for fast prototyping. In order to evaluate the concept 

on a smaller scale, we developed a prototype [4] on a Pleo 

toy dinosaur. Due to its complex sensors and single data bus, 

the Pleo can be considered a downscaled model of a modern 

car, which we will explain below in more detail. 

A Pleo is a rather sophisticated device–sometimes also referred 

to as artifical lifeform–equipped with a multitude of 

sensors (see Figure 1). 

The Pleo hardware is based on an Atmel ARM7 32bit processor 

(main CPU), a NXP ARM7 32bit microprocessor (camera, 

audio) and four Toshiba TMP86FH47AUG 8bit microprocessors 

(motor control). 

The movement is achieved trough 14 motors with feedback 

sensor. Additional sensors are: 

• A color camera with white light sensor 

• Two microphones 

• Eight touch-sensors 

58 

Figure 2. Pleopatra Tools Screenshot 

• Four push-buttons (one under each foot) 

• Tilt and Shake sensors 

• Infrared transmitter and receiver in the mouth 

• Infrared transmitter and receiver at the head 

Pleo is also equipped with two speakers and both internal 

flash memory as well as a SD card slot and a USB interface. 

We connect Pleo via its USB interface to a computer in 

order to communicate with it. Pleos USB interface is wrapping 

a serial port to which we can connect using standard 

libraries such as RXTX [7]. To facilitate the communication, 

we implemented an API wrapping the serial protocol 

in Java. It is called Pleopatra Tools [3] (see Figure 2). We 

published the library under GPL license. Higher level functions 

are included in a graphical user interface, which makes 

interaction with the Pleo easy. Included are: establishing a 

connection to Pleo, storing personalized information about 

different Pleo such as photo or name, which is recognized 

instantly once the Pleo is connected, Recording audio from 

Pleo and direct playback on the PC, inspection and playback 

of sound-, motion-, and personality files as well as displaying 

live camera images from pleo. The API itself furthermore 

offers: controlling motors and sensors, access to the 

file system, recording audio from pleo in wav format and 

accessing pleos camera and saving bmp images. 

Using this API we implemented a monitoring tool which 

constantly checks the sensor data for anything extraordinary, 

such as sudden darkness, very loud noise, very high or low 

temperature, detection of something green which is considered 

food for Pleo, etc. On detection, an event is triggerd. 

Depending on the type of event, a pre-formulated message is 

picked from a small database and refined with actual sensor 

values, e.g. “35 centigrades? It is very hot in here!”. These 

messages are then twittered (see Figure 3) via an automated 

Twitter interface (jTwitter) [1]. The Twitter application is 

also accessible via the Pleopatra Tools’ GUI.

Figure 3. Pleopatra: the first twittering dinosaur in the world 

The task we handled here is a typical example for a dual restricted 

data selection process (see Figre 4). The raw data 

from the sensors (e.g. motor 4 is blocked at an angle of 35 

degrees) is transformed and filtered into some higher level 

data (e.g. somebody/something holds the front paw). The 

resulting data is then further filtered according to two resource 

limitations: First more technical (”what is extraordinary 

enough to be presented?”) and then more cognitive 

(”how much information do we want to publish?”). We will 

get back to that concept in more detail later on. 

Figure 4. Dual restriction on data 

FROM DINOSAUR TO CAR 

We argue that a toy robot sensing his environment is comparable 

to a sensor-equipped car when it comes to automatic 

status message generation. In order to work properly, the 

driver has to be identified with his Twitter ID, just as each 

Pleo connected to the Pleopatra Tools API must be recognized 

by its serial ID before starting the Twitter application. 

In a car environment, this could be achieved for instance by 

checking the bluetooth ID of the drivers phone. Typical car 

sensors are much more complex than the sensors we have 

seen at the Pleo robot, and the access of data is usually not 

as uniform as a single USB interface. Data accessible in a 

car include current postion, speed, heading, temperature (inside 

and outside), etc. 

59 

The Controller Area Network (CAN) interface standard [2] 

was specified by Bosch in 1991 and is nowadays widely 

used in cars. It was devised to enable communication between 

subsystems of the car, since each subsystem may need 

to control actuators or receive feedback from sensors. The 

CAN bus may be used in vehicles to establish a commection 

between transmission and engine control unit (the cars main 

processor), or, for example, to connect the window openers, 

air condition, seat control, etc. 

The amount of pre-fabricated messages needed for a useful 

tweet-generation in a car is by far higher than the few dozens 

of messages in our Pleopatra prototype. Nevertheless, the 

basic principle stays the same: Sensor data is monitored, exceptional 

values are matched to a database of pre-fabricated 

messages and blancs in the message are filled with current 

values. The driver then only needs to accept a message for 

sending, which is clearly significantly less distracting than 

composing a message on a mobile device. 

SELECTION OF RELEVANT CONTENT 

Selection of relevant information based on a constant sensor 

data or information stream is not a trivial task. In [11], 

Maybury presents the SumGen system, which “selects key 

information from an event database by reasoning about event 

frequencies, frequencies of relation between them, and domain 

specific importance measures.”. The system is able to 

tailor a summarized report for a stereotypical user. 

More recent works aim at performing such a summarization 

in real-time in order to emulate a reporter at for instance a 

sports event. The IVAN system [6] “generates affective commentary 

on a tennis game that is given as an annotated video 

in real-time. The system employs two distinguishable virtual 

agents that have different roles (TV commentator, expert), 

personality profiles, and positive, neutral, or negative 

attitudes to the players.” 

In our example, the information streams to be monitored are 

sensor data. Defining which data is “extraordinary” is rather 

straightforward here: If the usual environment temeprature 

of the Pleo dinosaur ranges between 18 and 23 centigradess, 

then 35 centigrates is extraordinary. If the dinosaur does not 

have any input on his touch sensor on the back for 90 percent 

of its time, then getting an input there is extraordinary. 

The interpretation of sensor data usually depends on the context. 

In a toy context as our Pleopatra prototype, there is not 

much variation of context. The dinosaur usually stays more 

or less in the same environment, and extracting information 

from sensor data is straightforward. 

In the automotive context, we have to extend our information 

flow example from Figure 4. The car is moving in a complex 

environment, so in order to doublecheck our interpretation of 

the sensor data, we need additional environmental evidence 

as a second component. If the car is on the highway and 

moving at an extraordinary slow speed or even not at all, it 

doesn’t necessarily mean that the driver is stuck in a traffic 

jam. He might as well just rest on a parking lot or visit a

fast food restaurant’s drive-trough. But if we do have for 

instance traffic information announcing a traffic jam in that 

highway to verify our interpretation, the interpretation gets 

more reliable. So our first resource limitiation is environmental 

evidence: 

sensor data 

+ envionmental evidence 

interpretation of the situation 

The situation might be unusual or extraordinary, but to make 

it interesting and thus worth tweeting, another contextual 

component is usually needed. In our example: Being in a 

traffic jam could be something ordinary you encounter on 

your everyday commute, but being stuck close to your destination 

on a weekend trip is special. We add unusual context 

as part of the second, cognitive restriction: 

exceptional sensor data 

+ envionmental evidence 

+ unusual context 

relevant message 

At the same time, user defined parameters like desired frequency 

of status posts can be used to optimize the second 

resource limitation according to the drivers needs. 

CONCLUSION AND OUTLOOK 

We presented a prototype of a twittering toy dinosaur and argued 

that the introduced principle could - with an increased 

complexity and some modifications - be used for an automated 

generation of tweets. This automation would reduce 

the risk of driver distraction, especially for power users of 

social networks who have an urge to stay connected to their 

environment. This is of course just a part of the solution. 

Other communication goals need to be looked at and analyzed 

separately. 

In a next step, we can try to include automatic answering 

mechanisms. For instance, if driver A is on it’s way to person 

B, there could be an incoming tweet saying ”@DriverA: 

Where are you?” and based on the current status, the car 

could respond immedeately: ”I am on my way, but right now 

I am stuck in a traffic jam near Frankfurt, driving at less than 

10mph!”. This is just one example, the possibilities here are 

manyfold. 

REFERENCES 

1. JTwitter - the Java library for the Twitter API. 

http://www.winterwell.com/software/jtwitter.php, 2008. 

2. Bosch. Can specification, 2.0. 

http://www.semiconductors.bosch.de/ 

media/pdf/canliteratur/ can2spec.pdf, 1991. 

3. C. Endres and D. Braun. Pleopatra Tools. 

http://www.dfki.de/pleopatra, 2009. 

4. C. Endres and D. Braun. Pleopatra: A Semi-Automatic 

Status-Posting Prototype For Future In-Car Use. In 

Adjunct proceedings of the 2nd International 

60 

car  

sensor  

data   

Figure 5. Twittering car 

message  

templates  

We’re already so  

close to Paris, 

but now we hit a 

traffic jam! 

Conference on Automotive User Interfaces and 

Interactive Vehicular Applications (AutomotiveUI 

2010), page 7, Pittsburgh, PA, USA, November 2010. 

5. Governors Highway Safety Association. 

State cell phone use and texting while driving laws. 

http://www.ghsa.org/html/stateinfo/laws/cellphone laws.html, 

2010. 

6. I. Gregory. Embodied presentation teams: A plan-based 

approach for affective sports commentary in real-time. 

Master’s thesis, Saarland University, 2010. 

7. K. Jarvi. RXTX : serial and parallel I/O libraries 

supporting Sun’s CommAPI. http://www.rxtx.org/, 

2006. 

8. A. Java, X. Song, T. Finin, and B. Tseng. Why we 

twitter: understanding microblogging usage and 

communities. In WebKDD/SNA-KDD ’07: Proceedings 

of the 9th WebKDD and 1st SNA-KDD 2007 workshop 

on Web mining and social network analysis, pages 

56–65, New York, NY, USA, 2007. ACM. 

9. A. N. Joinson. Looking at, looking up or keeping up 

with people?: motives and use of facebook. In CHI ’08: 

Proceeding of the twenty-sixth annual SIGCHI 



10. B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps 

about twitter. In WOSP ’08: Proceedings of the first 

workshop on Online social networks, pages 19–24, 


11. M. T. Maybury. Generating summaries from event data. 

Inf. Process. Manage., 31:735–751, September 1995. 

12. C. A. Middleton and W. Cukier. Is mobile email 

functional or dysfunctional? two perspectives on 

mobile email usage. European Journal of Information 

Systems, 2006. 

13. O. Turel and A. Serenko. Is mobile email addiction 

overlooked? Commun. ACM, 53(5):41–43, 2010. 

chi2011.

Multimodal Summarization of Complex Sentences 

Naushad UzZaman 

Computer Science Department 

University of Rochester 

naushad@cs.rochester.edu 

ABSTRACT 

In this paper, we introduce the idea of automatically 

illustrating complex sentences as multimodal summaries 

that combine pictures, structure and simplified compressed 

text. By including text and structure in addition to pictures, 

multimodal summaries provide additional clues of what 

happened, who did it, to whom and how, to people who 

may have difficulty reading or who are looking to skim 

quickly. We present ROC-MMS, a system for automatically 

creating multimodal summaries (MMS) of complex 

sentences by generating pictures, textual summaries and 

structure. We show that pictures alone are insufficient to 

help people understand most sentences, especially for 

readers who are unfamiliar with the domain. An evaluation 

of ROC-MMS in the Wikipedia domain illustrates both the 

promise and challenge of automatically creating multimodal 

summaries. 


Multimodal summarization, summarization, visualization, 

illustration, picture, text-to-picture, automatic illustration, 

sentence compression, pictorial representation, AAC, 

augmentative and alternative communication, ROC MMS. 

General Terms 

Algorithms, Experimentation. 


H5.m. Information interfaces and presentation (e.g., HCI): 

Miscellaneous; I.2.7 [Artificial Intelligence]: Natural 

Language. 

INTRODUCTION 

Pictures, diagrams and illustrations are included in 

manually-created text because they help people 

comprehend and remember information [1]. Including 

alternative, supportive representations of text might help 

people with reading difficulties understand text better, for 

instance those reading text not in their first language, 







IUI’11, February 13–16, 2011, Palo Alto, California, USA. 

Copyright 2011 ACM 978-1-4503-0419-1/11/02...$10.00. 

Jeffrey P. Bigham 



jbigham@cs.rochester.edu 

61 

James F. Allen 



james@cs.rochester.edu 

children, older adults, or people with cognitive disabilities. 

Unfortunately, creating illustrations is expensive and timeconsuming, 

and consequently most text has only a few 

Figure 1: Multimodal summary (MMS) of the sentence, “In 

1492, Genoese explorer Christopher Columbus, under contract 

to the Spanish crown, reached several Caribbean islands, 

making first contact with the indigenous people.” 

illustrations, if any at all. In this paper we introduce ROC- 

MMS, a system that automatically converts existing text to 

multimodal summaries (MMS) that capture the meaning of 

a complex sentence in a diagram containing pictures and 

simplified text related by structure extracted from the 

original sentence. 

Motivated by sayings like, “A picture is worth a thousand 

words” prior work on Automatic Illustration and Text-to- 

Picture synthesis has approached the very difficult problem 

of generating pictorial replacements for text. Although this 

is an interesting challenge, existing systems have generally 

found success only within the domain of simple sentences 

of the type found in children’s books [2-4]. The problem of 

multimodal summarization relaxes the problem by allowing 

text to augment pictorial and structural information. 

Automatic Illustration is inherently difficult. To understand 

the problem better, we initially asked two annotators 1 to 

identify the main idea 2 (main event) and related entities 

(subject, object, etc) from sentences and find representative 

pictures. Sentences were chosen from the Wikipedia entries 

United States and France, and annotators were asked to 

include Wikipedia pictures in their illustrations. The 

annotators reported that it was too difficult to illustrate 

19.59% of the entities using Wikipedia pictures and thought 

1 Annotators are graduate students and not among the authors. 

Their annotations were used as a gold standard in our evaluation. 

2 In this paper, we loosely interchange between main idea, main 

concept and main event.

that 15.08% of entities couldn’t be represented with 

pictures at all (e.g. “territory”, “height of power”, “French 

War of religion”, etc and temporal expressions in general). 

These results suggest that it will often be difficult to find 

appropriate pictures and some entities are inherently unable 

to be illustrated easily with pictures. It can be particularly 

difficult to represent entities in an unfamiliar domain. For 

instance, if someone doesn’t know how Christopher 

Columbus looks like, even a good picture of Christopher 

Columbus will only convey general attributes (man, 

possibly historical). 

To remedy this problem MMSs keep both images and 

representative text, unlike previous systems for automatic 

illustration [2-6]. In this way, we can handle cases lacking a 

good picture and address cases that are hard to illustrate. 

Presenting pictures and text together can also improve both 

the understanding and remembering of concepts. According 

to dual code theory [7], text and pictures result in two 

different kinds of conceptual representations. These 

representations may allow independent access to 

information and hence benefit retention. Picture and text 

repeat important information, and may have similar 

beneficial effects on memory as explicit repetitions [8, 9]. 

Processing the information twice, once as text and once as a 

picture, may facilitate comprehension and memory. Finally, 

pictures often have a motivating effect, and text with 

pictures may also be more enjoyable to read, since the 

reader does not have to work as hard to understand the text 

and pictures also facilitate better comprehension of the text 

broadly beyond what is illustrated [10]. So our decision for 

inclusion of text with pictures is backed by theories that 

support that it helps people for better understanding and 

memorizing. 

To keep the MMS representations simple and easy to 

process, we simplify text so that it retains only the most 

important information, instead of the full text. We define 

the most important information as the subject (who did it), 

the event (what action), object (to whom or what) and 

prepositions directly related to the subject, main event, or 

object (how). This effectively converts complex sentences 

into simpler sentences. In this way, the reader can read out 

the text as a simple sentence in addition to seeing the 

pictorial view, making it easier to remember and understand 

text, and relate it to the full, complex text if they choose, 

such as when searching for details abstracted out of the 

MMS view. 

MMS can potentially help a diversity of readers. For 

example, highly-capable readers may use MMS to skim 

content or understand content more easily. The alternative, 

simplified representation it provides may be useful for 

children who are learning to read and for second language 

learners, as seeing pictures together with text may enhance 

learning [11]. Furthermore, it has been previously shown 

that when one component of the reading process is 

dysfunctional, other compensating skills may become 

highly developed [12]. It is estimated that more than 2 

62 

million people in United States have significant 

communication impairments that led them to rely on 

methods other than natural speech alone for communication 

[13]. Automatic Illustration of texts may eventually help 

these people understand text better. Automatic illustration 

can also help to support other representations like Pictorial 

Temporal representation [14] or can be paired-up with 

screen reading applications [15], which could further 

benefit people who have problems reading by allowing 

them to see content in multiple forms while listening to it 

being read. 

We define multimodal summarization of complex sentences 

as the combination of illustrations and a compressed form 

of the sentence text in simple sentence structure. In the next 

section we will describe the challenges for multimodal 

summarization and describe related work for the required 

subtasks. We then describe ROC-MMS, our system for 

multimodal summarization and describe an evaluation of it. 

Finally, we discuss potential for future work. 

SUBTASKS AND RELATED WORK 

Multimodal summarization (MMS) of complex sentences 

gives readers the main idea of the sentence using pictures 

and compressed text structured as simple sentence. Creating 

MMSs is challenging and involves many subtasks. In this 

section, we will describe each of the subtasks and the 

related work for each subtask, and the approach taken in 

ROC-MMS. The general steps in the MMS approach are 

the following: 

1. Identify both the main idea of the sentence and related 

entities and use them to create a compressed summary 

2. Extract pictures for the entities. 

3. Add structure to the pictures and text. 

Identifying the main idea and related entities 

Natural language sentences often convey multiple ideas, but 

representing multiple ideas with pictures can quickly 

become confusing. We, therefore, chose to express only the 

main idea of a sentence with MMS. If readers can 

understand the main idea of the sentence, then they may be 

able to later use the original text to decipher further details. 

The subtask of identifying the main idea of the sentence 

itself has two components. First, the important idea (the 

main event or main action) must be extracted, and, second, 

the entities related to the main idea need to be extracted, as 

illustrated in the following example drawn from Wikipedia: 

“In 1492, Genoese explorer Christopher Columbus, under 

contract to the Spanish crown, reached several Caribbean 

islands, making first contact with the indigenous people.” 

The summary or compressed form of the sentence is 

“Christopher Columbus reached several Caribbean islands 

in 1492.” Hence, the main event or main idea in the 

sentence is reached and the entities related to the event

eaching are Christopher Columbus (subject), several 

Caribbean islands (object) and 1492 (preposition in). 

A similar problem already addressed in the natural language 

processing community is called sentence compression [16]. 

In sentence compression, unnecessary information is 

removed while retaining the grammaticality of the sentence. 

Sentence compression might remove related entities of 

main event in the process of removing unnecessary 

information. This approach also doesn’t give a simple 

sentence structure. 

Another approach is main event extraction using the 

TimeML annotation scheme [17]. In this scheme, the main 

event label corresponds to the main idea of the sentence. 

Most competitive systems use syntactic and semantic 

information and machine-learning classifiers to identify 

events. For an overview of recent systems in this area, see 

the results of TempEval-2 [18]. The main events are 

annotated as part of the TempEval-2 task, although results 

on identifying main events were not explicitly reported. 

In the literature on Automatic Illustration for extracting 

entities, a popular approach has been to first extract 

representative keywords and then generate images for these 

keywords [6]. Keyword extraction has been studied in the 

natural language processing/information retrieval 

community [19, 20]. Goldberg et al. [2, 4] extract actions 

(events), who did them and to whom. They don’t focus on 

identifying only the important idea (action) because their 

experimental domain only contains short and simple 

sentences (and are, therefore, unlikely to contain more than 

one event). They convert the problem of identifying entities 

to a sequence labeling problem and use Conditional 

Random Fields for classification. On the other hand, 

Mihalcea and Leong [3] do not try to extract the entities, 

but they extract the pictures word-by-word and represent 

them linearly. Both approaches work best on simple 

sentences in which order roughly matches the role of the 

extracted entities. The ROC-MMS system includes a full 

natural language parse of the complex sentence in order to 

extract entities regardless of the order in which they appear. 

Extracting Pictures for Text 

Once we have the event and related entities, we next extract 

pictures to represent each concept. The task of associating 

words to pictures is similar to image retrieval. Although 

some work uses computer vision techniques for retrieval, 

most work (including popular image search engines) rely 

primarily on the text found near images in documents to 

find general images [21]. ROC-MMS generally follows this 

approach as well, but uses additional information 

automatically generated from the structure of the sentence 

to weight its search terms. 

Text-to-scene conversion places objects in 3D environment 

and is intended to aid graphic designers. This usually works 

with detailed descriptive text with visual and spatial 

elements. One of the best-known systems of this kind is 

WordsEye [22]. They are usually not intended as assistive 

63 

tools to communicate general text, because in that domain 

the texts are usually explaining the situation like “the house 

is 7 foot tall with two glass window and a door” and the 

system will try to interpret the natural language and create 

the 3D environment of the described situation. In contrast, 

we want to take a sentence from an existing news source, 

Wikipedia, or a book and represent it with pictures to help 

people to understand the text better. 

Barnard and Forsyth [23] introduced the idea of autoillustration 

as inverse of auto-annotation. Joshi et al. [6] 

approached this problem by considering the pair-wise 

reinforcement based on both visual and WordNet-based 

lexical similarity. This work identifies a few representative 

pictures for a story, which has practical applications like 

identifying representative pictures for news articles, or 

different articles, but not appropriate for our problem. 

Goldberg et al. [2, 4] built their own database of images to 

use for certain text and if they couldn’t find any appropriate 

image in their database then they do web image search and 

apply some vision techniques to identify the appropriate 

picture. Mihalcea and Leong [3] use an in-house image 

database, PicNet and other resources 3 . 

Adding Structure to Improve Understanding 

Having identified pictures and compressed text, the final 

step is to combine these elements in a layout structurally 

representative of what happened, who did it, to whom and 

how. To our knowledge, the only other work that attempts 

to address this problem is Goldberg et al. [2]. Their system 

identifies "who", "what action" and "to whom" by 

converting the problem into sequence labeling. They 

propose a layout represented by the sequence ABC, where 

A represents who did the action, B is what action was done 

and C is to whom. An example output of their system for 

“The girl rides the bus to school in the morning” is below: 

Figure 2: Example output of [2] illustrating the labeling of 

sequences where each element is assigned a picture. 

In this work, the textual information is ignored and 

represented only with pictures. Images incorrectly extracted 

in the previous step may confuse people more than helping 

them because there is no additional information to guide 

them to the correct interpretation. MMS includes extracted 

text in case of errors. With both picture and compressed 

text, we can represent hard-to-depict, but important, entities 

with text that may be ignored by prior work. We do not 

attempt to represent events (the action) with a picture, since 

this is a much more challenging task. 

3 http://tell.fll.purdue.edu/JapanProj/FLClipart/

This work also tries to identify the A (who), B (what action) 

and C (to whom) of their ABC layout by converting it to a 

sequence-tagging problem, which is well studied in NLP 

[24]. The problem with that approach is the requirement for 

hand-labeled training data, which will be a barrier for 

adaptation of the solution to a different or more complex 

domain. ROC-MMS uses dependency parsing to identify 

similar dependencies or related entities, without needing the 

hand-annotated training data. 

Finally, they restrict their attention to single simple 

sentences and their experiments were on domains that use 

very simple English, such as short narratives written by and 

for individuals with communicative disorders; one-sentence 

news synopses written in simple English targeting foreign 

language learners; and the child writing sections of the 

LUCY corpus. For complex sentences, they anticipate the 

use of text simplification to convert complex text into a set 

of appropriate inputs for their system. It is not clear how 

well they can eventually represent the complex sentences in 

their layout, since they are not considering “how” 

something happened. 

ROC-MMS addresses these problems for unrestricted texts 

that include complex and compound sentences. 

ROC-MMS 

In this section we will describe ROC-MMS, and how it 

approaches the subtasks described in the previous section. 

Identifying the main event(s) 

ROC-MMS finds concepts by identifying the events and 

related entities, and then identifies the main event to 

identify the main concept or the main idea. 

Event extraction 

Our view for event matches with the TimeML temporal 

annotation scheme [17], which considers events a cover 

term for situations that happen or occur. 

ROC-MMS extracts events using the TRIOS system [25], 

which had a very competitive performance in the TempEval 

2010 task for temporal information extraction [18]. The 

TRIOS system first parses text with the TRIPS parser [26] 

and uses hand-coded rules to extract events. The extraction 

rules are tuned for high recall and identify many more 

events than is necessary, including a few non-events. In the 

next step, a classifier is used as a filter to remove 

unnecessary events. 

The main event identification classifier takes all events for 

a sentence as input and identifies the main event from the 

sentence. In one of the tasks for TempEval 2010, main 

events were labeled. We used that labeled data to train our 

main event classifier. For this classification task, we used 

an off-the-shelf Markov Logic Network classifier 

(thebeast) 4 . As features, we used lexical features (word, 

stem, next word, previous word, previous verbal word 

sequence), syntactic features (part-of-speech tag, tense, 

4 http://code.google.com/p/thebeast/ 

64 

voice, polarity, TimeML aspect, modality, pos sequence, 

previous verbal pos sequence, next pos, previous pos) and 

semantic features (abstract semantic class – ontology type, 

TimeML class, semantic roles and their arguments) of 

events. The syntactic and semantic features are mostly 

generated from TRIPS parser output and also using other 

classifiers. 

This classifier first identifies the main events from the 

sentences. Then we run another pass to make sure every 

sentence has at least one main event. We force every 

sentence to have a main event. If a classifier didn’t identify 

a main event in a sentence, then we consider the first verbal 

event as the main event of the sentence. We back off to the 

first verbal event because it has a high baseline 

performance for the main-event identification task. 

Extract entities related to the event 

Instead of extracting all entities in the sentence [3], we 

extract only those entities related to the main event. We use 

the relations between the event and the related entities in 

the next step to structure them. From the parsed 

representation created from the Stanford dependency 

parser 5 , we find dependencies 6 in order to extract the 

subject (nominal subject - nsubj, agent), 

object (direct/indirect object - dobj/iobj, 

passive nominal subject - nsubjpass) and other 

dependencies (prepositions). For easier representation, 

we cluster all prepositional modifiers into a single entity, 

but include the preposition when representing. 

An example will help to illustrate how we use the 

dependency output to extract related entities for the events. 

The following is the Stanford dependency parser output for 

the sentence, “French fur traders established outposts of 

New France around the Great Lakes.” 

amod(traders-3, French-1) 

nn(traders-3, fur-2) 

nsubj(established-4, traders-3) 

dobj(established-4, outposts-5) 

nn(France-8, New-7) 

prep_of(outposts-5, France-8) 

det(Lakes-12, the-10) 

nn(Lakes-12, Great-11) 

prep_around(established-4, Lakes-12) 

The main event here is established, the subject is traders, 

the object is outposts and the preposition (around) is Lakes. 

By propagating through nn (noun compound 

modifier) and amod (adjectival modifier) 

dependencies, we extract the following entities: (subject: 

“French fur traders”), (object: “outposts”) and (preposition: 

“Great Lakes”). For subject, object and prepositions, we 

propagate through the nn and amod in this way and extract 

5 Stanford dependency parser: 

http://nlp.stanford.edu/software/lex-parser.shtml. 

6 Details on dependencies: 

http://nlp.stanford.edu/software/dependencies_manual.pdf

the resulting entities. The next step is to find the 

representative pictures for the entities. If we fail to find an 

image for any entity, we propagate through all 

dependencies (instead of just nn and amod) to extract an 

entity phrase. For example, we would extract the phrase 

“outpost of New France” for the object and “the Great 

Lakes” for the preposition, in the above examples. We then 

search for the picture of the entity phrase, instead of the 

entity. These steps are described in more detail next. 

Extracting Pictures for Concepts 

Image retrieval is a complicated task, even for humans 

because what constitutes a representative image is 

subjective. As a result, we simplified the problem by 

restricting our image search to Wikipedia, which we have 

found to often produce appropriate images. This has the 

following two benefits: (i) pictures of an entity are often 

found on the wiki page for that entity, and (ii) Wikipedia 

articles often have info box pictures selected by human 

editors that are often correct and representative. 

Finding pictures for an event (“what action” according to 

[2]) is much harder. When humans are asked to find 

pictures for events, they will often search for the event 

along with subject or object. For example, for the event 

“conquered” in the context “Rome conquered the Gauls”, 

an appropriate image would likely include Roman soldiers 

(it would be even better if it somehow indicated that the 

conquering occurred in Gaul). Search results for conquered 

alone include the following images in the top results: 

Figure 3: First three results from Yahoo Image Search for the 

word “conquered” illustrating the difficulty in finding good 

representative pictures even for simple concepts. 

A useful heuristic for finding better representative images is 

therefore to concatenate the action with the subject and 

object (if available, or just subject or object, if the other one 

is not available). Often web image search results still do not 

return the most appropriate images for our use as the first 

result. This can be fine for humans, who may glance 

through the top few results and pick the most appropriate 

one. Restricting pictures only to Wikipedia is a simple way 

to produce better results. 

Our methods for identifying the pictures are described 

below with different modules. 

Module find_image_in_wikipage(wikiurl): 

(i) Find the infobox picture 

65 

(ii) If infobox has multiple pictures, then consider the 

picture with largest width 7 

(iii) If there are no infobox picture 

a. Find all images 

b. Tokenize the image filename 8 with "_", ",", 

"[A-Z]", and spaces as delimiters 

c. For each image 

i. Find the edit-distance between 

tokenized filename and each word in 

wiki article name 

ii. Sum all scores, that’s the relatedness 

score for an image 

d. Return the picture with highest score and the 

score 

Module find_page_and_image(query): 

(i) Search with “wikipedia ” + query using yahoo 

search api 9 

(ii) Keep only en.wikipedia pages 

(iii) Traverse the resulting wiki pages one by one 

(a) Get the representative image with score 

from the wiki page’s url using the module: 

find_image_in_wikipage(result page) 

(b) If the resulting image's score is above 

threshold (we used 1.0) then return the 

image 

Module sentence_to_images(sentence): 

(i) Extract events, main event and the entities and 

entity phrases related to main event (all these 

described in previous section) 

(ii) For each of the dependencies (subject, object, 

prepositions): 

(a) If any word forms a main Wikipedia entry: 

Find the image in those wiki urls 

using find_image_in_wikipage(wikiurl) 

(b) If no result found so far and the entity 

doesn't have a wiki link 

Then find the image using yahoo search 

with find_page_and_image(entity) 

(c) If no result found so far and any word in the 

entity phrase is linked to wiki urls: 

Then find the image in those wiki urls 

using find_image_in_wikipage(wikiurl) 

(d) If no result found so far and entity phrase 

doesn’t have a wiki link: 

7 We found that when there are multiple pictures then the larger 

width picture is usually the main representative picture. 

8 We are only considering the tokenized filename, because, i. 

wikipedia has very descriptive image filenames, ii. text 

descriptions next to images are not consistent, some pictures have 

lots of text and others don't have any, since sometimes it’s just 

neglected by contributors, if the wiki entry is not too interesting. 

But we consider the alt tags of images, which is also very sparse. 

So we give a lower weight for that score (we used 0.25 for alt tags 

and 1.0 for image filename score). 

9 http://developer.yahoo.com/search/web/V1/webSearch.html

Then find the image using yahoo search 

with find_page_and_image(entity phrase) 

Consider the following clarifying example. The input 

sentence from Wikipedia is “French fur traders established 

outposts of New France around the Great Lakes.” 

(Underlined words are links to other Wikipedia pages). 

ROC-MMS extracts the following main event (in this case, 

the only event) as established, and the extracted entities and 

entity phrases are: (subject: French fur traders), (subject 

phrase: French fur traders), (object: outposts), (object 

phrase: outposts of New France), (preposition: around – 

Great Lakes), (preposition around phrase: the Great Lakes). 

First consider the subject, French fur traders. “Fur traders” 

has a wiki link, but the page does not have an infobox. For 

images on the linked page, we find the edit distance 

between the tokenized filename and the article name (Fur 

trade) and the best image according to the process described 

previously. 

Next we consider the object outpost, which does not have a 

wiki link. We search using Yahoo! restricting to Wikipedia 

pages, which doesn’t return any images above threshold in 

first 10 resulting pages. We then check the object phrase – 

outposts of New France, and New France has a wiki link, 

and we find a representative picture from that link. 

In our algorithm, we search for the entity first, instead of 

checking wiki URLs in the entity phrase, because 

sometimes in Wikipedia contributors fail to tag entities to 

its wiki article. For those cases, our yahoo_search module 

finds the expected wiki article. So we try this step first and 

if it fails, then we check the wiki links in the entity phrase, 

as shown in this example. Finally, the preposition (around) 

is Great Lakes, which links to its wiki article and we get the 

representative picture for that too. 

If there are multiple wiki links in an entity (or entity phrase) 

then we find images from all wiki links and cluster them. 

Figure 4: Clustered image of Genoa and Christopher 

Columbus for entity “Genese explorer Christopher Columbus”. 

We also cluster all prepositions. The sentence “The modern 

name ‘France’ derives from the name of the feudal domain 

of the Capetian Kings of France around Paris” contain two 

prepositions, from and around. We extract pictures for from 

the name of the feudal domain of the Capetian Kings of 

France and also for around Paris, and then combine them. 

66 

Figure 6: Example of clustering prepositions. 

Our annotators were unable to find images to represent 

temporal expressions, and indeed this is a difficult problem. 

To handle that problem, we give special treatment to 

temporal expressions. To identify temporal expressions, we 

use the TRIOS temporal expression identification and 

normalization system 10 [25], which had the second best 

performance in TempEval-2 [18]. When we identify a time, 

instead of searching for a picture of it, we represent it with 

something that represents time and add the text below. One 

example is given below. 

Figure 5: The representation of a temporal expression includes 

the extracted text and a picture. The picture conveys time 

generally, but not a specific time. 

Structuring the images and compressed text 

The final step is to combine the image and compressed text 

into a structured format 11 . Every sentence has a main event, 

which we don’t try to represent with pictures, a subject 

entity, object entity and clustered prepositions. We 

construct MMS using the following visual layout of these 

elements. 

Figure 7: Generalized visual layout for MMS. 

This representation is very similar to ABC layout [2], since 

the subject and object are essentially who did the action and 

to whom, however the primary difference is that MMS 

10 The temporal expression normalizer is also available as open 

source at: http://www.cs.rochester.edu/u/naushad/temporal 

11 All our auto-generated diagrams are generated using GraphViz 

toolkit.

includes prepositions and does not attempt to find a picture 

for the main event. As mentioned earlier, it is not clear from 

the description how they represent hard-to-depict events. It 

might have worked in their simple domain; however, they 

explained they only find pictures for easy-to-depict words. 

Many events can be missed as part of the filtering process. 

ROC-MMS makes appropriate trade-offs that enable it to 

create MMS diagrams for arbitrary text, even text that 

includes complex sentences. 

One example output from our system is given below: 

Figure 8: Multimodal summary (MMS) of the sentence, 

“French fur traders established outposts of New France around 

the Great Lakes; France eventually claimed much of the North 

American interior, down to the Gulf of Mexico.” 

Some sentences do not contain prepositions (or the they 

may not be correctly extracted). In such cases, we show 

only the event, subject and object, as shown below. 

Figure 9: MMS of the sentence, “The Carolingian dynasty ruled 

France until 987, when Hugh Capet, Duke of France and Count 

of Paris, was crowned King of France.” 

For sentences lacking an object, we merge the event text 

with the subject text and show it in subject text field. In the 

following example, died (event) is merged with the Charles 

IV (subject). 

Figure 10: MMS of the sentence, “Charles IV ( The Fair ) died 

without an heir in 1328 .” 

EVALUATION 

Illustrating a sentence with a diagram of pictures and text is 

difficult; evaluating how good a diagram is may be even 

67 

harder because it is very subjective. In this evaluation 

section, we first evaluate the subtasks of our multimodal 

summarization system in isolation. We then evaluate how 

well our representation retains the overall information of 

the overall sentence. All our evaluations are done on 44 

sentences drawn from Wikipedia article on United States 

and France. 

Identifying the Main Event and Related Entities 

We trained our main event identification classifier on 

TempEval-2 training data and tested it with 10 cross 

validation. Our performance for main event identification 

was around 77.94% (fscore). The baseline of choosing the 

first verbal event as the main event achieves around 59.64% 

on the TempEval domain. We ported that system on the 

Wikipedia domain and evaluated considering each 

annotator as gold standard. We calculated precision and 

recall for both cases, the performance is reported in Table 1. 

Metric Performance 

Precision 79.10% 

Recall 73.11% 

Fscore 75.98% 

Table 1. Main event identification performance 

We extract entities by first traversing the nn (noun 

compound modifier) and amod (adjectival modifier) 

dependencies of the dependency tree. If that entity results in 

a good picture (the matching score is above threshold), we 

keep it; otherwise we traverse through all dependencies of 

the event, resulting in a phrase. Our extracted entities often 

don’t exact match with the annotator’s entity but may 

partially 12 match with them. We report the average 

performance (considering both annotators) of our system on 

entity extraction in Table 2. We only consider cases in 

which our system and the annotators identified the same 

main event. 

Metric Performance 

Average strict precision 29.29% 

Average strict recall 31.64% 

Average relaxed precision 76.76% 

Average relaxed recall 83.82% 

Table 2. Entity extraction performance 

Extracting Pictures 

For evaluating how well our system extracts pictures, we 

compared our system output to extractions by two human 

annotators. We consider cases where our system and the 

annotater, with relaxed matching, identified the same main 

event and same entities and both extracted an image. In 

12 Either our entity is substring of annotator’s entity, or vice versa. 

Relaxed matching is partial matching.

Table 3, we show the percentage of cases when both 

systems extracted an image, given that both systems 

extracted the same entity. Not all extracted entities have a 

picture because human annotators sometimes didn’t extract 

the picture because they thought some concepts couldn’t be 

illustrated with a picture and sometimes thought there were 

no suitable pictures in Wikipedia to represent that entity. 

We also didn’t suggest a picture for entities if no picture 

was found with a score above threshold. We compared 

between two annotators and show the average system 

performance. We can see that our system has a very similar 

performance compared to performance between each 

annotators. 

Evaluation 

Both entity 

got Image 

Annotator1 vs Annotator2 66.66% 

Average of Annotators vs System 65.47% 

Table 3. Performance of Image Extraction 

On these selected matching pictures, we compare our 

extracted image with the images extracted by the 

annotators. We classify our output into Same Image (if both 

the system and annotators extracted the same image), 

Different Image but acceptable (e.g. for France, one 

extracted the French flag and the other extracted a map of 

France) and finally Bad Image by our system (this category 

is the category of images that we think are not acceptable, 

i.e. wrong representation of the text). A judge, another 

graduate student - who was not an annotator or an author, 

performed this classification. 

Evaluation 

Ann 1 vs 

Ann 2 

Ann vs System 

(Average) 

Exact same image 47.05% 21.51% 

Different image, but 

acceptable 

52.95% 44.15% 

Different and bad image 34.34% 

Table 4. Performance on quality of our extracted images 

We can see that our system extracts decent pictures around 

65% of the time. 

How well our structure with simple compressed text 

helps to understand text better 

In the previous subsections, we showed our performance in 

the different subtasks, which eventually propagates to the 

final performance; but overall how well does our system 

generate diagrams that convey the message of the content to 

the users? Does automatic illustration really help text 

comprehension? Do human-generated illustrations help for 

text comprehension? An illustration without text is unlikely 

to be useful if the domain is new to the reader because the 

reader won’t be able to interpret the pictures in the first 

place. That’s why MMS diagrams include simple 

68 

compressed text and the simple structure along with the 

event, subject, object, and prepositions. 

In this section, we motivate MMS over picture-only 

diagrams by showing that users get a better understanding 

from the MMS diagrams generated by ROC-MMS than 

they do for diagrams containing only pictures, even when 

human annotators have identified the pictures. 

For this evaluation, we recruited participants on Amazon 

Mechanical Turk 13 . In the task shown to participants, we 

show our system generated MMS diagram and ask the 

turkers to explain the diagram in English text. Participants 

were also given the option of saying that they “Can’t 

explain the diagram.” One example is shown in Figure 11. 

Figure 11: ROC-MMS generated diagram for “Gaul was 

conquered by Rome under Julius Caesar in the 1 st centiry BC” 

Next we created the diagram using entities and pictures 

selected by human annotators (representing a gold 

standard), but we didn’t add the structural layout or text like 

our MMS diagram. Influenced by Mihalcea and Leong [3], 

our baseline ordered the picture of the entities in the order 

of the sentence. For example, for the sentence, “Gaul was 

conquered by Rome under Julius Caesar in the 1st century 

BC”, we created the diagram with first picture for Gaul 

then event conquered (in text), then picture for Rome and 

finally Julius Caesar. The annotators thought 1 st century BC 

was hard to illustrate, and so did not find a picture for it. 

We asked our annotators not to find pictures for events, 

since we are not going to represent events with pictures and 

added the text for events instead in annotator’s diagram. 

One example diagram is shown in Figure 12. 

Figure 12: Diagram using human identified entities and 

pictures for “Gaul was conquered by Rome under Julius Caesar 

in the 1 st century BC” 

Although the pictures are accurate, it is quite difficult to 

find the meaning of this diagram. We see two maps; many 

13 Mechanical Turk website: www.mturk.com. For this task, we 

paid $0.01 for explaining the diagram with text. For each sentence, 

we collected responses from10 unique workers.

people might not understand which country or place is this. 

Even if they were to somehow interpret first one as Gaul 

and the second as Rome, they will read it wrong as Gaul 

conquered Rome, because it is linearly ordered, instead of 

using subject, event, object structure like ours. On the 

contrary, our diagram for the same example, failed to get a 

good representative picture for Rome and the Stanford 

parser failed to find that 1 st century BC is also related to the 

event conquered, but with structure and text, many people 

were able to understand the content and produced 

something very similar to the original summary text. 

Participants provided explanations of the diagrams (both 

those generated by our system and those of the two 

annotators) in English text from 10 different turkers for 

each sentence. We used Rouge [27], the automatic 

evaluation toolkit for summarization, to test how well their 

explanations retained the information of the original 

sentence’s summary. We generate the reference summaries 

using annotators’ identified entities and events and ordered 

them linearly like the diagram. For the example given 

above, our annotator’s reference sentence summary was 

“Gaul conquered Rome Julius Caesar 1st century BC”. 

These reference summary sentences are not grammatical 

and only consisted of the main event and the important 

entities. The Rouge evaluation handles this well because it 

is based on ngram matching and does not consider the 

grammaticality of sentences. For each system, we get the 

average Rouge score for each sentence (averaging over 10 

turker’s score) and then average over all sentences. We also 

average the two annotators’ score and report the average 

annotator Rouge score. 

In reporting our performance, we report both Rouge-1 and 

Rouge-L, since Rouge-1 14 and Rouge-L perform very well 

in evaluating very short summaries (head-line like 

summaries) [27]. In reporting our results, we are reporting 

precision (P), recall (R) and Fscore (F). 

Evaluation Rouge-1 Rouge-L 

Explanation of 

Annotators’ diagrams 

Explanation of the 

ROC-MMS diagrams 

0.0892482 (F) 0.08451066 (F) 

0.0680995 (R) 0.0635695 (R) 

0.1294495 (P) 0.1260265 (P) 

0.2405093 (F) 0.21649513 (F) 

0.26668 (R) 0.23619 (R) 

0.2190162 (P) 0.199832 (P) 

Table 5. Rouge-1 and Rouge-L for explanation of annotators 

diagram (average) and our system diagram 

The results match our intuition that participants didn’t do a 

very good job explaining the diagram with a sentence when 

they are provided with only pictures – even though human 

14 Rouge-1 is based on unigram and Rouge-L is based on longest 

common subsequence. 

69 

annotators selected these pictures. On the other hand, our 

system, despite the possibility of cascading errors from 

parsing, main event identification, entity extraction and 

identifying appropriate picture, did a lot better. 

Although the inclusion of text gave the MMS diagrams a bit 

of an advantage in the Rouge score measurement because it 

is based on ngrams, it suggests that ROC-MMS is able to 

accurately identify the main concepts of the sentences and 

create pictures that are reasonable. More broadly, this 

evaluation shows the advantage of adding even minimal 

text, as many participants’ were largely unable to produce 

accurate descriptions of the diagrams containing only 

pictures. Surprisingly, few participants simply wrote the 

text contained within the MMS diagrams, suggesting that 

the evaluation was more nuanced. 

We believe that MMS diagrams will eventually be helpful 

for people who have trouble reading and understanding 

complex text and may help capable readers more easily 

skim documents. The end goal of MMS will be its ability to 

improve reading comprehension; ROC-MMS represents an 

important step in this direction. 

FUTURE WORK 

We evaluated ROC-MMS in the Wikipedia to show that 

multimodal summarization can be applied to complex text 

in order to generate diagrams that combine text, pictures, 

and structure. These evaluations have shown the promise of 

creating MMS diagrams completely automatically for 

arbitrary text, and suggest numerous future research 

opportunities. 

First, our system currently relies partly on Wikipedia. An 

obvious extension would be to explore its performance in 

raw text, and adapt its modules to handle more general 

resources. The TRIPS parser used in ROC-MMS, already 

identifies named entities, which may be able to use to find 

better pictures for specific kind of entities, e.g., for people - 

we might search for portrait, for country – a flag or map. 

Multimodal summarization is in the middle of two 

extremes. One would be to consider all events, instead of 

main events, i.e. represent everything with pictures and text. 

This may be useful for people who have trouble reading and 

want to get as much information in multimodal 

representation as possible. The other extreme is applying 

the summarization to pick the important sentences first and 

then apply multimodal summarization only on the selected 

sentences. In this way, it will represent the important 

sentences and only the important information in those 

sentences. This could be very useful for capable readers to 

skim through articles. Exploring the relative benefits along 

this dimension could better characterize their potential. 

We simplified the problem of illustration by not 

representing events with pictures because events are usually 

hard to depict. Future work may try to illustrate events by 

more intelligently searching for events along with the

subject and object. We also want to extend the proposed 

multimodal summarization by adding speech modality [15]. 

Finally, we want to extend our evaluation to look at how 

MMS (and other summary techniques) improve reading 

comprehension for the target groups who motivated this 

work – specifically people who have difficulty reading. 

CONCLUSION 

In this paper, we approached the problem of visualizing text 

as multimodal summarization. To create MMS diagrams, 

we automatically summarize text by extracting simple 

sentence structures (subject – who did it, event – what 

happened, object – to whom, preposition – how) and 

illustrate the text with pictures and compressed text 

together. Our evaluation showed that we achieve good 

performance on all of the subtasks required to create MMS 

diagrams, and that the MMS diagrams generated by ROC- 

MMS were easier to understand than human illustrations 

with pictures alone. Our implementation and evaluation 

leveraged the Wikipedia domain, but the approach 

embodied in ROC-MMS can be generally extended to 

unrestricted text. 

ACKNOWLEDGMENT 

We thank the three anonymous reviewers for their valuable 

feedback. We also thank Benjamin van Durme for his 

suggestion of prototyping on the Wikipedia domain, and 

Anna Loparev, Amal Fahad and Shantonu Hossain for help 

with annotation tasks. 

REFERENCES 

1. R. N. Carney and J. R. Levin, "Pictorial Illustrations Still 

Improve Students' Learning from Text," Educational 

Psychology Review, vol. 14, 2002. 

2. B. Goldberg, et al., "Easy as ABC? Facilitating pictorial 

communication via semantically enhanced layout.," 

Twelfth International Conference on Computational 

Natural Language Learning, 2008. 

3. R. Mihalcea and B. Leong, "Toward communicating 

simple sentences using pictorial representations," 

presented at the Association of Machine Translation in the 

Americas., 2006. 

4. J. Zhu, et al., "A text-to-picture synthesis system for 

augmenting communication.," in The Integrated 

Intelligence Track of the Twenty-Second AAAI 

Conference on Artificial Intelligence, 2007. 

5. K. Barnard, et al., "Matching words and pictures.," 

Machine Learning Research, vol. 3, pp. 1107–1135, 2003. 

6. D. Joshi, et al., "The story picturing engine—a system for 

automatic text illustration.," ACM Transactions on 

Multimedia Computing, Communications, and 

Applications, vol. 2(1), 2006. 

7. Paivio, "Mental representations: A dual coding approach," 

New York: Oxford University Press., 1986. 

8. M. Glenberg, "Component-levels theory of the effects of 

spacing of repetitions on recall and recognition.," Memory 

and Cognition, vol. 7, pp. 95-112, 1979. 

9. R. G. Greene, "Spacing effects in memory: Evidence for a 

two-process account.," Journal of Experimental 

70 

Psychology: Learning. Memory. and Cognition, vol. 15, 

pp. 371-377, 1989. 

10. M. Glenberg and W. E. Langston, "Comprehension of 

illustrated text: pictures help to build mental models.," 

Memory and Language, vol. 31, pp. 129–151, 1992. 

11. R. E. Mayer, Multimedia learning. Cambridge, UK: 

Cambridge University Press., 2001. 

12. U. Frith, "A developmental framework for developmental 

dyslexia," Annals of Dyslexia, vol. 36, pp. 69-81, 1985. 

13. S. L. H. Association, "Roles and responsibilities of speech- 

language pathologists with respect to augmentative and 

alternative communication: Technical report," ASHA 

Supplement, vol. 24, 2004. 

14. N. UzZaman, et al., "Pictorial Temporal Structure of 

Documents to Help People who have Trouble Reading or 

Understanding. ," International Workshop on Design to 

Read, CHI, Atlanta, GA, 2010. 

15. J. P. Bigham, et al., "WebAnywhere: A Self-Voicing, 

Web-Browsing Web Application," International 

Conference on the World Wide Web, Beijing, China, 2008. 

16. K. Knight and D. Marcu, "Summarization beyond sentence 

extraction: a probabilistic approach to sentence 

compression," Artificial Intelligence, vol. 139, pp. 91–107, 

2002. 

17. J. Pustejovsky, et al., "TimeML: Robust Specication of 

Event and Temporal Expressions in Text. ," in New 

Directions in Question Answering, 2003. 

18. J. Pustejovsky and M. Verhagen, "SemEval-2010 task 13: 

evaluating events, time expressions, and temporal relations 

(TempEval-2)," Workshop on Semantic Evaluations: 

Recent Achievements and Future Directions, 2010. 

19. Y. Matsuo and M. Ishizuka, "Keyword Extraction from a 

Single Document Using Word Co-Occurrence Statistical 

Information," International Journal on Artificial 

Intelligence Tools, vol. 13, pp. 157-170, 2004. 

20. R. Mihalcea and P. Tarau, "TextRank: Bringing Order into 

Texts," Proceedings of the Conference on Empirical 

Methods in Natural Language Processing (EMNLP 2004), 

Barcelona, Spain, 2004. 

21. R. Datta, et al., "Image retrieval: Ideas, influences, and 

trends of the new age," ACM Comput. Surv., vol. 40, pp. 1- 

60, 2008. 

22. Coyne and R. Sproat, "WordsEye: An automatic text-toscene 

conversion system," SIG-GRAPH, 2001. 

23. K. Barnard and D. Forsyth, "Learning the Semantics of 

Words and Pictures," Eighth International Conference on 

Computer Vision (ICCV'01), 2001. 

24. J. Lafferty, et al., "Conditional random fields: Probabilistic 

models for segmenting and labeling sequence data," 

International Conference on Machine Learning, 2001. 

25. N. UzZaman and J. F. Allen, "TRIPS and TRIOS System 

for TempEval-2: Extracting Temporal Information from 

Text," International Workshop on Semantic Evaluations, 

ACL 2010. 

26. J. F. Allen, et al., "Deep semantic analysis of text," 

Symposium on Semantics in Systems for Text Processing 

(STEP), 2008. 

27. Y. Lin, "ROUGE: A package for automatic evaluation of 

summaries," ACL Text Summarization Workshop, 2004.

Author’s index 

James F. Allen Moritz Kümmerling 

R. Wade Allen Pat Langdon 

Ignacio Alvarez Sven Laqua 

Gabriel Barata Gerrit Meixner 

Ashweeni K. Beeharee Kamlesh Mistry 

André Berton Christian Müller 

Jeffrey P. Bigham George D. Park 

Pradipta Biswas Martin Pfannenstein 

Rolf Black Mark Poguntke 

Rainer Bodendorfer Ashu Razdan 

Daniel Braun Joseph Reddington 

Elliot Buller Ehud Reiter 

An Mei Chen Theodore J. Rosenthal 

Heng-Tze Cheng M. Angela Sasse 

Shelby S. Darnell Kristof Schütt 

Michael Eichhorn Adriano Scoditti 

Josh I. Ekandem Eckehard Steinbach 

Christoph Endres João Teixeira 

Sandro Rodriguez Garzon Nava Tintarev 

Juan E. Gilbert Naushad UzZaman 

Daniel Gonçalves Annalu Waller 

Jin Sun Ju Damon L. Woodard 

Eun Yi Kim Li Zhang

MIAA - Automotive IUI - DFKI

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?