MIAA - Automotive IUI - DFKI

MIAA - Automotive IUI - DFKI

MIAA - Automotive IUI - DFKI


You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Proceedings of the<br />

3rd Workshop on Multimodal Interfaces for<br />

<strong>Automotive</strong> Applications<br />

(<strong>MIAA</strong> ‘11)<br />

February 13, 2011, Palo Alto, CA, USA<br />

organized at the International Conference on Intelligent User<br />

Interfaces (<strong>IUI</strong> ’11)<br />

Organizers:<br />

Christoph Endres, German Research Center for Artificial Intelligence (<strong>DFKI</strong>)<br />

Gerrit Meixner, German Research Center for Artificial Intelligence (<strong>DFKI</strong>)<br />

Christian Müller, German Research Center for Artificial Intelligence (<strong>DFKI</strong>)

Preface<br />

Multimodal interaction constitutes a key technology for intelligent user interfaces (<strong>IUI</strong>). The<br />

possibility to control devices and applications in a natural way enables an easier access to complex<br />

functionality as well as infotainment contents. In recent years, the complexity of on-board and<br />

accessory devices, infotainment services, and driver assistance systems in cars has experienced an<br />

enormous increase. This development emphasizes the need for new concepts for advanced humanmachine<br />

interfaces that support the seamless, intuitive and efficient use of this large variety of<br />

devices and services.<br />

A modern car already implements hundreds of functions that a user can interact with, in some cases<br />

deployed over almost a hundred embedded platforms. These numbers will even grow for the next<br />

generation of high-class vehicles. The growing number of electronic devices integrated into cars also<br />

affects the creation of the user interface. The built-in electronic control units are able to provide<br />

valuable context information, which needs to be considered for an intelligent management of<br />

multimodal interaction inside the car. Sensor information like e.g. vehicle speed, location (using GPS<br />

plus gyroscope and accelerometer for greater reliability), outside temperature, etc., allows drawing<br />

conclusions about the current driving situation. Furthermore, dialog management needs to keep<br />

track of state changes of operating elements like control switches. Access to vehicle functions is also<br />

essential in order to initiate desired operations.<br />

The goal of this workshop is to present, discuss, and outline context-aware multimodal interfaces for<br />

drivers and car passengers. The ultimate goal of this workshop is to unify innovative concepts that<br />

aim towards a new dimension of ease of use.<br />

The topics of the workshop with a strong focus on automotive or traffic applications are:<br />

� speech interfaces for in-car use<br />

� multimodal interaction<br />

� novel multimedia interfaces and in-car entertainment<br />

� user interface issues for assistive functionality<br />

� audio-visual information and entertainment<br />

� information fusion and fission<br />

� can bus architectures<br />

� experimental platforms and simulation solutions<br />

� user centered design applications<br />

� multi-party interaction concepts<br />

� integrated hardware solutions<br />

� car2car and car2X communication<br />

� approaches for the evaluation of novel car user interfaces<br />

� user interfaces for navigation systems<br />

� detection and estimation of user intentions<br />

� novel interactive car applications<br />

� interactive applications for drivers and passengers<br />

� model-driven user interface development

Table of Contents<br />

Flexible and Real-time Scenario Building for Experimental Driving Simulation Studies<br />

George D. Park, R. Wade Allen and Theodore J. Rosenthal .....................................................................1<br />

Contactless Gesture Recognition for Mobile Devices<br />

Heng-Tze Cheng, An Mei Chen, Ashu Razdan and Elliot Buller ................................................................5<br />

One Application, One User Interface Model, Many Cars: Abstract Interaction Modeling in the<br />

<strong>Automotive</strong> Domain<br />

Mark Poguntke and André Berton ...........................................................................................................9<br />

A Novel Multimedia Session Management Approach for In-Vehicle Middleware based on DPWS<br />

Michael Eichhorn, Martin Pfannenstein, Rainer Bodendorfer and Eckehard Steinbach ...................... 13<br />

“Hands Busy, Eyes Busy”: Generating Stories from Sensor Data for <strong>Automotive</strong> applications<br />

Joe Reddington, Ehud Reiter, Nava Tintarev, Rolf Black and Annalu Waller ........................................ 17<br />

A novel taxonomy for gestural interaction techniques: considerations for automotive<br />

environments<br />

Adriano Scoditti ..................................................................................................................................... 21<br />

Navigating Haystacks at 70 mph: Intelligent Search for Intelligent In-Car Services<br />

Ashweeni K. Beeharee, Sven Laqua and M. Angela Sasse..................................................................... 25<br />

Discover Significant Situations for User Interface Adaptations<br />

Sandro Rodriguez Garzon and Kristof Schütt ........................................................................................ 29<br />

A new interaction technique based on eye tracking and single switch scanning systems<br />

Pradipta Biswas and Pat Langdon ......................................................................................................... 33<br />

Gesture Recognition Exploration using Haartraining and KNN in a 3D Racing Game<br />

Kamlesh Mistry and Li Zhang ................................................................................................................. 37<br />

Model-Based User Interface Development in the <strong>Automotive</strong> Industry<br />

Moritz Kümmerling and Gerrit Meixner ................................................................................................ 41<br />

A Robotic Wheelchair using Human Gestures and Scene Contexts<br />

Jin Sun Ju and Eun Yi Kim....................................................................................................................... 45<br />

MetaBrain: Web Information Extraction and Visualization<br />

João Teixeira, Gabriel Barata and Daniel Gonçalves ............................................................................. 49<br />

MyDash: The Biometric Digital Dashboard<br />

Shelby S. Darnell, Ignacio Alvarez, Josh I. Ekandem, Damon L. Woodard and Juan E. Gilbert ............. 53<br />

Prototyping a Semi-Automatic In-Car Texting Assistant<br />

Christoph Endres, Daniel Braun and Christian Müller ........................................................................... 57<br />

Multimodal Summarization of Complex Sentences<br />

Naushad UzZaman, Jeffrey P. Bigham and James F. Allen .................................................................... 61

Flexible and Real-time Scenario Building for<br />

Experimental Driving Simulation Studies<br />

George D. Park, R. Wade Allen, and Theodore J. Rosenthal<br />

Systems Technology, Inc.<br />

13766 Hawthorne Blvd., Hawthorne CA<br />

georgepark@systemstech.com<br />


The applications and cross-disciplinary nature of driving<br />

safety require driving simulation software to be sensitive to<br />

the requirements and limitations of their users. Provided<br />

here is an introduction to the driving simulation software,<br />

STISIM Drive and its unique approach towards flexible,<br />

real-time scenario building for applied experimental driving<br />

research. Several key concepts on how a user defines/builds<br />

a driving scenario and how the 3D graphics are generated in<br />

relation to the driver are discussed. Advantages and<br />

disadvantages of the STISIM Drive approach are discussed.<br />

References to previous user applications are provided.<br />

Author Keywords<br />

Driving simulation, scenario design, STISIM Drive.<br />

ACM Classification Keywords<br />

H5.2 Evaluation/methodology. H5.m Miscellaneous.<br />


Real-time, interactive (i.e., human-in-the-loop) driving<br />

simulation offers many advantages to the experimental<br />

researcher/developer interested in the areas of driving<br />

assessment, training, and research. It allows for a safe and<br />

controlled testing environment of driver behaviors in<br />

relation to the independent variable(s) of interest: driver<br />

factors (e.g., age, experience, drugs/alcohol/fatigue, mental<br />

workload, and deficits related to perception, cognition, or<br />

psychomotor), intervention factors (e.g., education and<br />

training programs), environmental factors (e.g., roadway<br />

infrastructure design, signage, weather, and traffic), and<br />

vehicle/device factors (e.g., controls/handling, dashboard<br />

design, warning systems, cell phones, and in-vehicle<br />

telematics).<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

1<br />

With an array of applications and cross-disciplinary nature<br />

of driving safety, simulation software needs to be sensitive<br />

to the requirements and limitations of its users. Not all users<br />

will have the background or the resources for extensive<br />

scenario building in virtual environments (VE). In addition,<br />

the end product of driving simulation is rarely the<br />

simulation itself (e.g., a video game), but is more often a<br />

means for assessing the effect of one of the aforementioned<br />

independent variables. Therefore, a method of scenario<br />

development that is flexible, rapid, and cost-effective is<br />

often critical to project success.<br />

Databases for real-time 3D simulation have traditionally<br />

been developed in graphics programs as composite 3D<br />

models. In essence, a large, predefined virtual world is<br />

created for the user to interact with. This approach requires<br />

extensive effort and experience with graphics modeling<br />

programs to define the details required in driving simulation<br />

[1]. More user-friendly scenario building systems may use a<br />

“tile-based” system where the developer pieces together<br />

predefined tiles of the road (e.g., an intersection or street<br />

block) to create the larger virtual world [2]. The end result<br />

is a roadway environment not unlike a real world,<br />

coordinate-based map. While this may appear to be an<br />

intuitive method of scenario development, it may not be an<br />

entirely practical means of scenario design for experimental<br />

research.<br />

The purpose of this paper is to provide an introduction to<br />

the driving simulation software, STISIM Drive and its<br />

approach towards flexible, real-time scenario building for<br />

applied experimental driving research. STISIM Drive is a<br />

PC-based, desktop driving simulator software system that’s<br />

highly configurable in regards to hardware fidelity (driver<br />

displays & controls). Several key concepts on how a user<br />

defines/builds a driving scenario and how the 3D graphics<br />

are generated in relation to the driver are discussed.<br />


The Scenario Definition Language (SDL) is a scripting<br />

language developed for STISIM Drive to define the<br />

scenario events (i.e., what appears and happens) in a<br />

particular driving scenario run. The events are defined by<br />

ASCII text statements in a simple syntax form:

On Distance, Event, Appear Distance, Parameter1,<br />

Parameter2, … Parametern<br />

The On Distance is defined as the longitudinal distance<br />

(feet or meters as specified by user) driven by the driver in<br />

relation to the scenario environment at which the event will<br />

activate. At the start of a scenario, the driver’s vehicle<br />

distance is generally set at zero. Event refers to a specific<br />

procedure (e.g., a roadway, building, vehicle, or<br />

pedestrian). Appear Distance refers to the longitudinal<br />

distance (ft or m) in relation to the On Distance that the<br />

event will actually be displayed in the roadway scenery.<br />

The Parameters are the specific attributes given to the<br />

event (e.g., roadway dimensions, model type, lateral<br />

location, speed, timing, etc...). Take for example the<br />

following SDL statement for displaying a 3D model of a<br />

building:<br />

500, Building, 1000, 40, B1<br />

When the driver reaches 500 ft, a building event will be<br />

initiated. It will appear at 1000 ft ahead of the driver (so<br />

technically at 1500 ft from the start of the run). The lateral<br />

position will be 40 ft to the right of the center dividing line<br />

(Parameter1 for building events). The building model type<br />

will be B1 which in the model library is defined as the café<br />

(Parameter2).<br />

As shown in the above example, there is a single SDL event<br />

statement for each model in a particular scenario. There are<br />

also over 50 different available event types for the user to<br />

specify. While this may appear cumbersome for complex<br />

scenario designs, SDL statements can be arbitrarily<br />

arranged since the program sorts all events according to<br />

distance during run initialization. This allows the user to<br />

group statements according to meaningful chunks of<br />

roadway (e.g., street blocks) and/or categories (e.g.,<br />

roadway definition, traffic control devices, roadside objects,<br />

traffic, etc.) to make relatively efficient global scenario<br />

changes.<br />

Besides 3D model events, there are SDL events that specify<br />

crash/violation settings, sound files, weather, data<br />

input/output signals, and data collection. Furthermore, the<br />

SDL allows the user to define and call subroutines referred<br />

to as previously defined events (PDEs) which are a<br />

combination of event statements that give a desired<br />

composite effect (e.g. buildings grouped around an<br />

intersection, traffic streams, vehicle/pedestrian collision<br />

events, etc.). Additional details on developing driving<br />

scenarios have been reported elsewhere [3].<br />


Due to the inherent variability in driver behaviors and<br />

factors that may affect a driver’s vehicle speed and steering<br />

(e.g., mental workload, age, experience, fatigue, risk<br />

perception), the initiation of dynamic 3D models (e.g.,<br />

vehicles, pedestrians, signal lights) into action in the VE<br />

can be a complex process. This is particularly so if the<br />

intention is to create critical hazards that require an<br />

2<br />

immediate driver response. E.g., Figure 1 provides scenario<br />

screenshots of an amber (yellow) light intersection event<br />

(top) and a pedestrian crossing event in front of the driver<br />

(middle).<br />

Figure 1. STISIM Drive screenshots of amber signal light<br />

intersection (top), pedestrian crossing in front of driver<br />

(middle), and construction zone (bottom).<br />

STISIM Drive handles dynamic 3D model event triggering<br />

through several ways. In most cases, the variability in<br />

drivers’ speed can be neutralized by triggering events based<br />

on headway time (i.e., time-to-collision between the object<br />

and driver). However, additional parameters can be set to<br />

ensure data integrity: longitudinal distance of driver (or<br />

object) on the road, distance between driver and object,<br />

lateral position relationships, signal light changes, driver<br />

speed thresholds, and elapsed runtime.<br />

There is also the ability for the simulation operator to<br />

manually trigger events during a simulation run. Manually<br />

triggered events can comprise of singular discrete events<br />

(e.g., a sound file or crossing pedestrian) or larger PDE files

comprised of an array of static or dynamic 3D models. In<br />

effect, the operator can initiate whole sections of a scenario<br />

in real-time depending on how the driver is behaving. E.g.,<br />

in Figure 1 (bottom), the operator can initiate a complete<br />

construction zone layout that includes vehicles and tubes<br />

onto the road.<br />


The STISIM Drive method for generating the simulation<br />

scenario can be described as partial (or delayed) VE<br />

generation, where only a portion of the virtual world is<br />

displayed as the driver’s vehicle travels down the road. This<br />

is the basis of how the simulation is generated and how the<br />

driving scenarios are conceptually designed with the SDL.<br />

To illustrate the concept, Figure 2a and b both provide a<br />

vehicle approaching an intersection. In normal simulation<br />

programs using a coordinate map-based system (Figure 2a),<br />

continuing straight or turning left/right sends the driver into<br />

different sections (A, B, or C) of the virtual world. In<br />

STISIM Drive (Figure 2b), continuing straight or turning<br />

left/right sends the driver into the same section (B). The<br />

reason for this is related back to how scenario events are<br />

defined in the SDL. Since the On Distance of an event (in<br />

this case Section B) can be specified to occur after a driver<br />

reaches a particular road distance, Section B has not been<br />

generated yet. When turning (or not turning) the driver’s<br />

longitudinal distance travelled is still accumulating;<br />

therefore, Section B will continue to appear in relation to<br />

the start of the scenario. Once the On Distance for an event<br />

has been reached by the driver, the event is committed to<br />

appear in accordance to its specified parameters.<br />

a) b)<br />

Figure 2. a) Coordinate map-based VE generation.<br />

b) Partial VE generation.<br />

Moving into different sections when turning is not normally<br />

problematic and intuitive in designing VEs in relation to a<br />

coordinate map context. However, if the goal of the<br />

scenario is to measure driver behavior to a particular event<br />

(e.g., a pedestrian crossing or vehicle pullout in section B),<br />

scenario design becomes problematic. For normal<br />

simulation programs, unintended turning may result in<br />

system crashes when the boundaries of the VE are<br />

exceeded. Secondly, additional programming is required for<br />

sections A and C even though the driver may not encounter<br />

them. To ensure the occurrence of a particular event for<br />

measurement, the designer must either artificially preclude<br />

3<br />

vehicle turning, rely on driver compliance, add<br />

corresponding events in sections A and C, or have an<br />

operator manually trigger the event once a driver has<br />

committed to a particular roadway section. Any of these<br />

options while manageable are not necessarily parsimonious<br />

nor take into account the inherent unpredictability of human<br />

behavior.<br />

The advantages of partial VE generation for experimental<br />

driving research are multiple. Since the driver does not<br />

experience scenario sections based on a coordinate map<br />

system, roadway sections are essentially presented serially<br />

in nature. This means all drivers experience the same<br />

scenario regardless of turning behaviors. Drivers cannot get<br />

disoriented or lost in the VE. Instead, the illusion of turning<br />

into different VE sections is created for the driver while<br />

roadway events are presented as intended by the researcher.<br />

In addition, counterbalancing of scenario events or whole<br />

roadway sections (using PDEs) can then be easily designed<br />

to control for order effects. This method can also reduce the<br />

design requirements and development time for a particular<br />

scenario.<br />

One of the main limitations of partial VE generation is the<br />

inability to simulate specific geography in regards to a<br />

coordinate map-based system. Therefore, studies involving<br />

simulation with GPS mapping and navigational tasks are<br />

problematic. Additionally, the possibility of non-realistic<br />

route corrections for driver navigational errors is present.<br />

For example, if a driver makes a wrong turn, U-turns into<br />

previously presented scenario sections are not handled well<br />

since the program provides only a limited distance of back<br />

tracking. The driver is also not able to perform other<br />

corrective procedures normally seen in driving such as three<br />

rights to make a left turn and vice versa. Previous system<br />

users have overcome some of these obstacles by modifying<br />

general program settings and adding a single elaborate large<br />

scale 3D city model [4]. It should be noted that these<br />

applications would require considerable 3D modeling<br />

resources since the system was not conceptually designed<br />

function in this manner.<br />


The advantages and disadvantages of the partial VE<br />

generation approach used by STISIM Drive should be<br />

weighed by users during initial study design. The flexibility<br />

of scenario design and relatively simple scripting language<br />

(SDL) for building and modifying scenarios makes it a very<br />

user defined system that mitigates inherent driver<br />

variability. This in conjunction with flexible hardware<br />

options has enabled the STISIM Drive software approach to<br />

be well validated and used in nearly every aspect of driver<br />

safety research. This includes driver factor effects: ageing<br />

[5, 6], novice driver [7], traumatic brain injury [8], and<br />

pharmaceutical effects[9]. Vehicle and device interactions:<br />

in-vehicle information devices [10], cognitive workload<br />

effects [11], and collision warning systems [12]. Successful<br />

integration of simulation software with actual vehicle

control hardware systems has also been demonstrated for<br />

steering [13] and braking systems [14]. Additional<br />

information and resources can be found on the software<br />

website (www.stisimdrive.com).<br />


1. Cremer, J., J. Kearney, and Y. Papelis, Driving<br />

simulation: Challenges for VR technology. Ieee<br />

Computer Graphics and Applications, 1996. 16(5): p.<br />

16-20.<br />

2. Suresh, P. and R.R. Mourant. A tile manager for<br />

deploying scenarios in virtual driving environments. in<br />

DSC 2005 North America. 2005. Orlando, FL.<br />

3. Park, G.D., T.J. Rosenthal, and B.L. Aponso,<br />

Developing Driving Scenarios for Research, Training<br />

and Clinical Applications. Advances in Transportation<br />

Studies An International Journal, 2004. 2004 Special<br />

Issue.<br />

4. Marcotte, T.D., et al., A multimodal assessment of<br />

driving performance in HIV infection. Neurology,<br />

2004. 63: p. 1417-1422.<br />

5. Lee, H.C. The validity of driving simulator to measure<br />

on-road driving performance of older drivers. in 24th<br />

Conference of Australian Institutes of Transport<br />

Research (CAITR). 2002. Sydney, AUS.<br />

6. Park, G.D., et al. Older driver simulator performance<br />

in relation to driving habits and DMV records. in 2nd<br />

International Conference on Technology and Aging.<br />

2007. Toronto, Canada.<br />

4<br />

7. Allen, R.W., et al. A PC Based Simulation System for<br />

Driver Assessment and Training. in TRB Annual<br />

Meeting. 2005. Washington, D.C.<br />

8. Stern, E.B., et al., Discriminating between brain<br />

injured and non-disabled persons: a PC-based<br />

interactive driving simulator pilot project. Advances in<br />

Transportation Studies An International Journal, 2004.<br />

Special Issue.<br />

9. Kay, G. The effect of Adderall XR and Atomoxetine on<br />

simulated driving safety in young adults with ADHD. in<br />

18th Annual US Psychiatric & Mental Health<br />

Congress. 2004. Las Vegas, NV.<br />

10. Wang, Y., et al., The validity of driving simulation for<br />

assessing differences between in-vehicle informational<br />

interfaces: A comparison with field testing.<br />

Ergonomics, 2010. 53(3): p. 404-420.<br />

11. Reimer, B., Impact of Cognitive Task Complexity on<br />

Drivers' Visual Tunneling. Transportation Research<br />

Record, 2009(2138): p. 13-19.<br />

12. Maltz, M. and D. Shinar, Imperfect in-vehicle collision<br />

avoidance warning systems can aid drivers. Human<br />

Factors, 2004. 46(2): p. 357-366.<br />

13. Eskandarian, A., et al. Development of an active<br />

steering control system in a car driving simulator. in<br />

SAE World Congress & Exposition. 2009. Detroit, MI.<br />

14. Allen, R.W., et al. A hardware-in-the-loop simulation<br />

of braking capability. in DSC 2005 Europe. 2008.<br />


Contactless Gesture Recognition for Mobile Devices<br />

Heng-Tze Cheng ∗<br />

Electrical and Computer Engineering<br />

Carnegie Mellon University<br />

hengtze@cmu.edu<br />


While gesture interfaces become pervasive, most existing<br />

approaches are undesirable for mobile devices because of<br />

the high power consumption, or the inconvenience that users<br />

need to wear/hold specific sensors. In this paper, we present<br />

a contactless gesture recognition system for mobile devices<br />

using proximity sensors. A set of infrared signal feature extraction<br />

methods and a decision-tree-based gesture classifier<br />

are proposed. The system allows a user to interact with mobile<br />

devices using intuitive gestures, without touching the<br />

screen or wearing/holding any additional device. Evaluation<br />

results show that the system is low-power, and able to recognize<br />

gestures with over 98% precision in real time.<br />

Author Keywords<br />

Gesture recognition, proximity sensor, infrared LED<br />

ACM Classification Keywords<br />

H.5.2 Information Interfaces and Presentation: User Interfaces—Input<br />

devices and strategies<br />


Gesture-based interfaces provide an intuitive way for users<br />

to specify commands and interact with computers [6, 8]. As<br />

mobile phones and tablets become ubiquitous, there is an increasing<br />

need of an intuitive user interfaces for small-sized,<br />

resource-limited mobile devices.<br />

Most existing gesture recognition systems can be classified<br />

into three types: motion-based, touch-based, and vision-based<br />

systems. For motion-based systems [11, 4], user cannot<br />

make gestures unless holding a mobile device or an external<br />

controller. Touch-based systems [12, 10] can accurately map<br />

the finger/pen positions and moving directions on the touchscreen<br />

to different commands. However, 3D gestures are not<br />

supported because all possible gestures are confined within<br />

the 2D screen surface. While the first two types of system<br />

∗ This work is done during the author’s employment at Office of<br />

The Chief Scientist, Qualcomm Incorporated.<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that copies<br />

bear this notice and the full citation on the first page. To copy otherwise, or<br />

republish, to post on servers or to redistribute to lists, requires prior specific<br />

permission and/or a fee.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

5<br />

An Mei Chen, Ashu Razdan, Elliot Buller<br />

Office of The Chief Scientist<br />

Qualcomm Incorporated<br />

{anc, arazdan, ebuller}@qualcomm.com<br />

require users to make contact with devices, vision-based systems<br />

[8, 14] using camera and computer vision techniques<br />

allow users to make intuitive gestures without touching the<br />

device. However, most vision-based systems are computationally<br />

expensive and power-consuming, which is undesirable<br />

for resource-limited mobile devices like tablets or mobile<br />

phones.<br />

To solve the existing challenges, we present a contactless<br />

gesture recognition system using only two infrared proximity<br />

sensors. We propose a set of infrared feature extraction<br />

and gesture classification algorithms. Using the system as<br />

a gesture interface, a user can flip e-book pages, scroll web<br />

pages, zoom in/out, and play games on mobile devices using<br />

intuitive hand gestures, without touching, wearing, or holding<br />

any additional devices. The design also reduces the frequency<br />

of users’ contact with devices, alleviating the wear<br />

and tear to screen surfaces.<br />

The main contributions of the paper are: 1) The design and<br />

evaluation of a contactless gesture recognition system using<br />

only two proximity sensors. 2) The proposed infrared (IR)<br />

feature set and classifier for real-time gesture classification.<br />

3) Reducing the power consumption of gesture recognition.<br />


There has been extensive research on vision-based gesture<br />

recognition [8, 14], mostly focusing on the detection of hand<br />

trajectory. Although they can recognize complex gestures,<br />

they can be sensitive to background objects, color, and lighting.<br />

Robustness can be improved by adding color markers on<br />

the user’s hand [5], with a tradeoff of the inconvenience to<br />

wear additional gears. Moreover, continuous video recording<br />

of a user can make one feel like under surveillance and<br />

pose a threat on user privacy.<br />

Recently, SideSight [1] proposed an around-device multitouch<br />

interface by placing ten IR sensors on the long edges<br />

of a small mobile device. Another related work, HoverFlow<br />

[3], used six IR sensors facing the user to capture IR image<br />

maps, and then classify gestures using dynamic time warping<br />

(DTW). In this work, we reduce the number of the required<br />

IR sensors to two and thus reduce the power consumption,<br />

which is mentioned as a critical issue in [1]. Even<br />

using the limited information from only two IR sensors, our<br />

system can achieve accurate gesture recognition using the<br />

proposed IR feature set and the classifier.<br />

For motion-based system, one of the recent work uWave

[4] match accelerometer data with gesture templates using<br />

DTW. 98.6% and 93.5% accuracy was achieved with and<br />

without template adaptation, respectively, for user-dependent<br />

gesture recognition. However, a user need to hold a device<br />

with accelerometer, and press a button to indicate start and<br />

end of a gesture. In this work, we eliminate these limitations<br />

with contactless gesture recognition.<br />

Electromyogram-based (EMG-based) system [2, 13] is another<br />

novel way to recognize gesture patterns using electrical<br />

activity produced by skeletal muscles. However, a user<br />

must wear EMG sensors on the wrist at all times to perform<br />

gestures, which can be inconvenient and not suitable for mobile<br />

device interfaces.<br />


Design Considerations<br />

Our system is designed based on four design considerations:<br />

1) Automatically detect gesture boundaries: A common challenge<br />

of gesture recognition is the uncertainty of when does<br />

a gesture begins or ends. We do not require a user to press a<br />

key to indicate the presence of a gesture since it would be inconvenient<br />

to do so. 2) Recognition must be real-time: Gesture<br />

interface must be very responsive, so no time-consuming<br />

postprocessing is allowed. 3) False alarm needs to be minimized:<br />

Executing a wrong command is generally worse than<br />

missing a command. 4) No user-dependent model training<br />

process for new users: Although supervised learning can optimize<br />

the performance for a specific user, collecting training<br />

data can be time consuming and not desirable for users.<br />

Proximity Sensor Data Acquisition<br />

We now describe each system component shown in Fig. 1. A<br />

proximity sensor consists of two IR LEDs and a IR receiver,<br />

which are placed underneath a plastic/glass screen surface,<br />

surrounded by optical barriers. The LEDs emit IR strobes<br />

in turns as two separate channels using time-division multiplexing.<br />

When a hand or any object is near, the receiver detects<br />

the reflection of the IR light, whose intensity increases<br />

as the object distance decreases. The light intensities of the<br />

two IR channels are sampled by the firmware at 100Hz.<br />

Framing<br />

Since the start and end of a gesture is not specified by the<br />

user, our program uses a moving window to scan the input IR<br />

intensity data and decide if any gesture signature is observed.<br />

The data is divided into 50% overlapping frames, each of<br />

which is 140 ms. After framing, three types of feature are<br />

extracted from each frame.<br />

Infrared Feature Extraction<br />

Inter-channel Time Delay<br />

The feature measures the pair-wise time delay between the<br />

sensor data of two channels, which shows how a hand approaches<br />

the IR LEDs at different instants. This corresponds<br />

to different moving directions of hands (see Fig. 2 for example).<br />

The time delay tD is calculated by finding the time<br />

shift n that yields maximum cross correlation value of two<br />

6<br />

Cross Correlation<br />

Module<br />

Gesture Model<br />

Proximity Sensor Data<br />

Framing<br />

Linear Regression<br />

Module<br />

Gesture Classifier<br />

Gesture History<br />

Database<br />

Screen<br />

Mobile<br />

Device<br />

Infrared LED<br />

Proximity Sensor<br />

(Infrared Receiver)<br />

Signal Statistics<br />

Module<br />

Temporal Dependency<br />

Computation<br />

Figure 1: The architecture of the gesture recognition system.<br />

IR Intensity (lux)<br />

Slope<br />

15000<br />

10000<br />

5000<br />

Time Delay (ms)<br />

Channel L<br />

Channel R<br />

Raw Sensor Data<br />

0<br />

0 2 4 6 8 10 12<br />

Push Pull<br />

Time (s)<br />

Time Delay Measured by Cross−Correlation<br />

50<br />

0<br />

−50<br />

0 2 4 6<br />

Time (s)<br />

8 10 12<br />

Slope Measured by Linear Regression<br />

1000<br />

0<br />

3 Left Swipes 3 Right Swipes<br />

Push Pull<br />

Push Pull<br />

−1000<br />

0 2 4 6<br />

Time (s)<br />

8 10 12<br />

Figure 2: An example of proximity sensor data and the features.<br />

discrete signal sequences f and g:<br />

tD = arg max<br />

n<br />

∞�<br />

f ∗ (m)g(m + n) (1)<br />

m=−∞<br />

Local Sum of Slopes<br />

This feature estimates the local slope of the signal segment<br />

within a frame, which shows how fast the user’s hand is moving<br />

toward or away from the proximity sensors. The slope is<br />

calculated by first-order linear regression, and then summed<br />

up with the slopes of the 6 previous frames. The local sum<br />

better capture the continuous trend of slopes rather than sudden<br />

changes.<br />

Signal Statistics<br />

The mean and variance of the raw sensor data. A high variance<br />

can be observed when a gesture is present; on the contrary,<br />

when there is no hand present or a hand hovering above,<br />

a low variance is observed.<br />

Gesture Recognition Algorithm<br />

After feature extraction, a decision-tree classifier shown in<br />

Fig. 3 is adopted to classify the frame as one of the gesture<br />

in the predefined gesture model, or report that no gesture is<br />

detected. We also keep a history of 7 frames to take temporal

Precision (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

User1<br />

User2<br />

User3<br />

User4<br />

User5<br />

Average<br />

Left Swipe Right Swipe<br />

(a) Precision of left/right swipe<br />

Yes<br />

No Gesture<br />

Ch L lags<br />

Variance < Threshold?<br />

Recall (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

Time Delay > Threshold?<br />

Yes No<br />

0<br />

Left Swipe Right Swipe<br />

User1<br />

User2<br />

User3<br />

User4<br />

User5<br />

Average<br />

(b) Recall of left/right swipe<br />

Inter-Channel Delay Local Sum of Slopes<br />

Ch R lags<br />

Right Swipe Left Swipe<br />

No<br />

> Threshold<br />

Precision (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

Push Pull<br />

User1<br />

User2<br />

User3<br />

User4<br />

User5<br />

Average<br />

(c) Precision of push/pull<br />

Figure 5: Precision and recall rate of gesture recogntion.<br />

Otherwise<br />

< −Threshold<br />

Push No Gesture Pull<br />

Figure 3: Illustration of the decision-tree-based gesture classifier.<br />

USB<br />

Port<br />

Sensor<br />

Board<br />

IR LED<br />

Channel L<br />

IR LED<br />

Channel R<br />

IR<br />

Receiver<br />

Start of Gesture (Left Swipe) End of Gesture (Left Swipe)<br />

Figure 4: A subject performed a left-swipe gesture using the<br />

prototype sensor board.<br />

dependency between consecutive frames into consideration.<br />

For example, when a gesture is detected, the system suppress<br />

the output of the same gesture for 6 frames because it is hard<br />

for a user to make the same gesture again very quickly. Once<br />

the gesture sequence history of a user is obtained, the transition<br />

probability between gestures can also be incorporated<br />

to improve the recognition accuracy.<br />


We implemented the prototype system using Silicon Labs<br />

Si1120 infrared proximity sensor [9]. The sensor data were<br />

transmitted to a laptop through a USB serial port. The feature<br />

extraction and gesture recognition algorithm was implemented<br />

in C++. The window sizes and thresholds are empirically<br />

set through experiments to minimize the false alarm<br />

rate of the system. A picture of the prototype system and a<br />

subject performing a gesture is shown in Fig. 4.<br />


7<br />

Recall (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

Push Pull<br />

(d) Recall of push/pull<br />

User1<br />

User2<br />

User3<br />

User4<br />

User5<br />

Average<br />

We define four essential gestures for evaluation: left swipe,<br />

right swipe, push (hand vertically moving vertically down<br />

toward the device), and pull (hand moving vertically up away<br />

from the device). The system is evaluated on a gesture dataset<br />

collected from five subjects, including four right-handed and<br />

one left-handed user. Their ages span from 20s to 40s, and<br />

one of them is female. The dataset consists of 2,000 gesture<br />

samples in total, with each user performing each of the four<br />

gesture 100 times.<br />

Recognition Performance<br />

We use the widely used precision/recall metric to evaluate<br />

the recognition performance:<br />

precision =<br />

T P<br />

T P + F P<br />

T P<br />

recall =<br />

(3)<br />

T P + F N<br />

where TP, FP, FN refer to true positive, false positive, and<br />

false negative. As shown in Fig. 5, the system achieved 98%<br />

precision in average, and is robust from user to user. The<br />

high precision implies low false alarm rate, which is ideal<br />

for gesture recognition because executing a wrong command<br />

is usually worse than missing a command. The recall rate is<br />

lower than precision because the system can miss gestures<br />

when the hand is too far from the sensor, or when a gesture<br />

is performed much slower than usual.<br />

User and System Factors<br />

We further design two experiments on user and system factors<br />

to evaluate the robustness and limitation of the system.<br />

User-to-Device Distance<br />

First, we evaluate the influence of user-to-device distance on<br />

the system performance. The distance is measured from the<br />

user’s hand to the proximity sensors. As shown in Fig. 6, the<br />

system can achieve over 80% accuracy when the user’s hand<br />

is within 3 inches. The effective range can be increased by<br />

increasing the power of IR LEDs, with a tradeoff of a higher<br />

power consumption. One can balance the tradeoff according<br />

to the system needs on user experience and battery life.<br />

Speed of Gesture<br />

Next, we evaluate the system performance when user perform<br />

gestures at different speeds. In this experiment, the user<br />

listens to a specific tempo given by an electronic metronome;<br />

the first beat “tic” indicates the start of a gesture, and the second<br />

beat “toc” indicates the end of a gesture. According to<br />


Accuracy (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

1 1.5 2 2.5 3 3.5 4<br />

Hand−to−Sensor Distance (inch)<br />

Figure 6: Recognition accuracy vs. hand-to-sensor distance.<br />

Accuracy (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

1 2 3 4 5<br />

Speed of Gesture (gestures per second)<br />

Figure 7: Recognition accuracy vs. speed of gesture.<br />

our observation, most users naturally make gestures at the<br />

speed of 2 to 4 gestures per second. In other words, it usually<br />

take 0.5 to 0.25 seconds for general users to complete a<br />

gesture. As shown in Fig. 7, the system achieves over 90%<br />

accuracy at general gesture speeds, and also maintains a robust<br />

performance of over 80% at very slow (1 gesture per<br />

second) or very fast (5 gestures per second) gesture speeds.<br />

Power Consumption<br />

The system power is dominated by the power consumed by<br />

IR LED (PLED) and the control chip (Pchip):<br />

PLED + Pchip = fconv · Tprx · (ILED + Ichip) · VLED (4)<br />

which is only 0.3 mW (idle) to 20 mW (active, when object<br />

is in proximity) [9], much lower than the 200-mW power<br />

budget for typical user interface of mobile device as reported<br />

in [7]. V , I, fconv, and Tprx denotes voltage, current, conversion<br />

frequency, and pulse width, respectively.<br />


We have presented a contactless gesture recognition system<br />

that allows users to make gesture inputs without touching,<br />

holding, or wearing any device. Using the proposed IR feature<br />

set and classifier, the system can recognize gestures with<br />

98% precision and 88% recall rate. The low power consumption<br />

and high accuracy make the system particularly<br />

8<br />

desirable for deployment on resource-limited mobile consumer<br />

devices.<br />

Our future work is to extend the configuration to multiple<br />

sensor arrays to get more information from sensor data. Using<br />

the basic gesture set as building blocks, we can further<br />

recognize more compound 3D gestures as permutations of<br />

the simple ones. Hidden Markov model can also be incorporated<br />

to learn the gesture sequences performed by users.<br />


1. A. Butler, S. Izadi, and S. Hodges. Sidesight:<br />

multi-”touch” interaction around small devices. In<br />

Proc. UIST, pages 201–204, 2008.<br />

2. J. Kim, S. Mastnik, and E. André. EMG-based hand<br />

gesture recognition for realtime biosignal interfacing.<br />

In Proc. <strong>IUI</strong>, pages 30–39, 2008.<br />

3. S. Kratz and M. Rohs. Hoverflow: exploring<br />

around-device interaction with ir distance sensors. In<br />

Proc. MobileHCI, pages 42:1–42:4, 2009.<br />

4. J. Liu, L. Zhong, J. Wickramasuriya, and V. Vasudevan.<br />

uWave: Accelerometer-based personalized gesture<br />

recognition and its applications. Pervasive Mob.<br />

Comput., 5(6):657–675, 2009.<br />

5. P. Mistry, P. Maes, and L. Chang. WUW - wear ur<br />

world: a wearable gestural interface. In Proc. CHI ’09,<br />

pages 4111–4116, 2009.<br />

6. S. Mitra and T. Acharya. Gesture recognition: A<br />

survey. IEEE Trans. Syst., Man and Cybern.,<br />

37(3):311–324, 2007.<br />

7. Y. Neuvo. Cellular phones as embedded systems. In<br />

IEEE ISSCC, 2004.<br />

8. V. I. Pavlovic, R. Sharma, and T. S. Huang. Visual<br />

interpretation of hand gestures for human-computer<br />

interaction: A review. PAMI, 19(7):677–695, 1997.<br />

9. Silicon Labs. Proximity/ambient light sensor with<br />

PWM output, 2009.<br />

10. W. C. Westerman and J. G. Elias. System and method<br />

for packing multi-touch gestures onto a hand, April<br />

2006.<br />

11. A. Wilson and S. Shafer. XWand: UI for intelligent<br />

spaces. In Proc. SIGCHI conf. Human factors in<br />

comput. syst., pages 545–552, 2003.<br />

12. J. O. Wobbrock, A. D. Wilson, and Y. Li. Gestures<br />

without libraries, toolkits or training: a $1 recognizer<br />

for user interface prototypes. In Proc. ACM UIST,<br />

pages 159–168, 2007.<br />

13. X. Zhang et al. Hand gesture recognition and virtual<br />

game control based on 3D accelerometer and EMG<br />

sensors. In Proc. <strong>IUI</strong>, pages 401–406, 2009.<br />

14. M. H. Yang, N. Ahuja, and M. Tabb. Extraction of 2D<br />

motion trajectories and its application to hand gesture<br />

recognition. IEEE Trans. Pattern Anal. Mach. Intell.,<br />

24(8):1061–1074, 2002.

One Application, One User Interface Model, Many Cars:<br />

Abstract Interaction Modeling in the <strong>Automotive</strong> Domain<br />

Mark Poguntke<br />

Daimler AG<br />

Wilhelm-Runge-Straße 11, 89081 Ulm<br />

mark.poguntke@daimler.com<br />


We present an approach for user interface generation based<br />

on abstract interaction modeling using UML class and state<br />

diagrams. By this, we enable the flexible enhancement of<br />

an automotive infotainment system with new external<br />

applications. A main objective is to do this without<br />

breaching the requirements resulting from the automotive<br />

context, e.g. minimized driver distraction. We achieve<br />

consistency with the automotive interaction and design<br />

concept through transforming the abstract model to the<br />

respective user interface concept and illustrate this with two<br />

automotive HMI concepts.<br />

Author Keywords<br />

HCI, Interaction Modeling, Abstract Interaction Model,<br />

Model-driven User Interface Development.<br />


A typical automotive infotainment system includes<br />

navigation, audio and video player as well as a phone<br />

application. Often, the only external applications integrated<br />

to the system are Bluetooth telephony, external music<br />

players and pre-defined internet services for weather<br />

forecasts or points of interests as examples. Using current<br />

technology, the features provided initially also do not<br />

change during the lifetime of a car. Imagine buying a<br />

desktop computer and having to use it for ten years without<br />

the possibility to install new applications – this is not<br />

satisfactory. It is our goal to make automotive systems more<br />

flexible and allow for the integration of new applications at<br />

later stages. However, the primary purpose of a car is still<br />

to provide a safe means of transportation. This implies a set<br />

of specific and very restrictive requirements for the design<br />

of the Human-Machine Interface (HMI), especially<br />

concerning the use of infotainment applications while<br />

driving. Minimizing driver distraction is an important<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that copies<br />

bear this notice and the full citation on the first page. To copy otherwise,<br />

or republish, to post on servers or to redistribute to lists, requires prior<br />

specific permission and/or a fee.<br />

3 rd International Workshop on Multimodal Interfaces for <strong>Automotive</strong><br />

Applications (<strong>MIAA</strong>) in conjunction with <strong>IUI</strong> 2011, Palo Alto, CA, USA.<br />

9<br />

André Berton<br />

Daimler AG<br />

Wilhelm-Runge-Straße 11, 89081 Ulm<br />

andre.berton@daimler.com<br />

requirement and external applications can only be<br />

integrated at a lager stage under the provision that this<br />

requirement is maintained. To ensure this, the control over<br />

the interaction and design concept of external applications<br />

has to be on side of the in-car software.<br />

Different cars and model lines often have different HMI<br />

devices and the respective concepts in them, e.g. a touch<br />

screen based HMI or an HMI based on the operation with a<br />

central control element (CCE), typically used in premium<br />

segment cars. Also, design concepts differ in screen<br />

resolution, screen layout, colors and styles. In order to<br />

seamlessly integrate external applications, we also have to<br />

provide solutions for multiple modalities.<br />

We approach the issues of keeping the control over the<br />

HMI integration for external applications on the one hand<br />

and aiming at flexibility and adaptability to serve different<br />

HMI concepts on the other hand. Our approach is based on<br />

abstract interaction modeling using the Unified Modeling<br />

Language (UML) [4] and XML-based user interface<br />

descriptions.<br />

<strong>Automotive</strong> Requirements<br />

The above mentioned conditions imply the need for much<br />

care to optimally integrate automotive user interfaces in the<br />

car. Many countries also restrict the design and use of<br />

automotive infotainment applications by certain regulations,<br />

e.g. the European Statement of Principles on HMI for invehicle<br />

information and communication systems (ESoP) or<br />

the guidelines of the Alliance of <strong>Automotive</strong> Manufacturers<br />

(AAM). Compliance with ergonomic standards, e.g. the<br />

ISO 9241-110, is a particularly desirable goal.<br />

Updating an automotive infotainment system with new<br />

applications which include new user interfaces is a critical<br />

modification. Unfamiliar user interfaces or inconsistencies<br />

with the in-car interaction concept may lead to driver<br />

distraction, limitations in interaction and frustrated drivers<br />

who hold the automotive manufacturer responsible for the<br />

whole infotainment system. This emphasizes the need of a<br />

carefully developed approach for integrating new<br />

applications and their user interfaces into the car. An<br />

automated user interface generation process has to conform<br />

to restrictive and well-defined rules.

Illustrating Example<br />

Throughout the following, we use an example scenario to<br />

illustrate the approach. An external application to-do list<br />

comprises the following functionalities: Present a list, add a<br />

new entry, select an entry and delete a selected entry.<br />

The to-do list will be integrated into two automotive user<br />

interface concepts. These are a touch screen based HMI and<br />

a CCE-based HMI illustrated later in more detail.<br />


Several approaches exist that derive different user<br />

interfaces from abstract interaction representations. Concur<br />

Task Trees (CTT) [6] provides a notation to describe user<br />

interfaces on the level of task models. The User Interface<br />

eXtensible Markup Language (UsiXML) [9] describes a<br />

comprehensive modeling approach including<br />

transformations from abstract to concrete user interfaces<br />

based on the CAMELEON reference framework [1]. The<br />

Dialog and Interface Specification Language (DISL) is a<br />

user interface description language based on dialogue<br />

models and modality-independent presentation models [8].<br />

In recent years attention has also been paid to the Unified<br />

Modeling Language (UML), which is a widespread industry<br />

standard for modeling software systems. Several<br />

approaches motivate the use of UML for user interface<br />

modeling [2,3,5,7]. De Melo provides a detailed analysis of<br />

UML as a basis for model-based user interface development<br />

and emphasizes advantages concerning comprehensibility,<br />

universality and tool support amongst others [3]. We<br />

consider UML as an appropriate basis, which can be<br />

adapted and extended for our approach. The availability of<br />

established tools is particularly important for the use in<br />

industry. We focus this paper on demonstrating abstract<br />

interaction modeling techniques with UML and implementing<br />

automatic transformations from an abstract model<br />

to specific automotive user interface concepts.<br />


The general approach is illustrated in Figure 1. We use the<br />

roles of an application developer and an interaction<br />

designer. An application is developed by an application<br />

developer including a functional application interface<br />

consisting of a class diagram with attributes and operations.<br />

An interaction designer uses this interface to create an<br />

abstract interaction model using UML state charts to<br />

describe user actions and corresponding system reactions. A<br />

transformation program uses the model and generates a user<br />

interface compliant to the respective automotive HMI<br />

concept. For the transformation process rules have to be<br />

implemented mapping the abstract model elements to user<br />

interface elements for a specific concept.<br />

10<br />

Figure 1. General approach: (1) The application developer<br />

provides the application interface, (2) the interaction designer<br />

creates the abstract interaction model that is used for user<br />

interface generation.<br />

The overall process is described and demonstrated for the<br />

to-do list example in the remainder of this paper. The<br />

definition of abstract data types and interaction elements is<br />

described in the following section.<br />

Abstract Data Types and Interaction Elements<br />

The application developer uses a defined set of abstract<br />

data types for the attributes to be provided. In Table 1 an<br />

extract of these data types is described.<br />

Type Description<br />

Boolean Logical value true or false<br />

String Sequence of symbols from the<br />

underlying set or alphabet<br />

Properties:<br />

Empty<br />

Boolean value whether the string is<br />

empty<br />

Collection A collection of elements with type<br />

<br />

…<br />

Properties:<br />

Empty<br />

Boolean value whether the collection is<br />

empty<br />

Subselection A collection of selected elements from<br />

the entire collection<br />

Table 1. Extract of abstract data types to be used<br />

by the application developer for the application interface.<br />

The interaction designer uses the provided attributes and a<br />

defined set of modeling elements and guidelines to create a<br />

UML state diagram. Table 2 provides an extract of elements<br />

that can be used by the interaction designer.<br />

The abstract data types and modeling elements are<br />

illustrated with the example application to-do list in the<br />

following section.

Element Meaning<br />

State Defined interaction state with a set of<br />

possible interactions<br />

do-activity within State<br />

PRESENT Presentation of <br />

to the user<br />

PROVIDE Possibility for the user to provide a value<br />

for <br />

PROVIDE()<br />

<br />

Transition with keyword ACT<br />

Possibility to provide <br />

elements for <br />

ACT The action that can be<br />

initiated by the user<br />

ACT<br />

[]<br />

ACT<br />

[not ]<br />

Transition with keyword SELECT<br />

SELECT()<br />

<br />

…<br />

The action that can be<br />

initiated by the user if is true.<br />

The action that can be<br />

initiated by the user if is false.<br />

Selection of elements from<br />

the collection .<br />

Table 2. Extract of defined elements to be used<br />

by the interaction designer in the UML state chart.<br />

Example: To-do List<br />

The application developer provides all attributes that can be<br />

used for interaction modeling as a UML class diagram, see<br />

Figure 2. For the to-do list these are addLabel and<br />

confirmLabel, which contain texts to be presented to the<br />

user during the respective interaction steps, and a collection<br />

named entryList containing elements of the custom type<br />

Entry. The developer also provides the information that an<br />

Entry consists of one string named description. Furthermore<br />

the operations saveEntry(Entry) and deleteEntry(Entry) are<br />

provided.<br />

Figure 2. UML class diagram for the to-do list provided by the<br />

application developer as functional application interface.<br />

The application developer furthermore provides textual<br />

descriptions of the attributes and operations. These support<br />

the interaction designer to understand the semantics in<br />

order to achieve correct mappings to the interaction model.<br />

The interaction designer uses the attributes when creating<br />

the abstract interaction model. Using UML the designer<br />

would have the possibility to include operations from the<br />

class diagram directly in the state chart. However, we<br />

decided to define the relations between interactions and<br />

operations outside of the state chart in a mapping table.<br />

11<br />

This allows the interaction designer to create the interaction<br />

model independent from this mapping. Figure 3 illustrates a<br />

possible interaction model for the to-do list application.<br />

Figure 3. Abstract interaction model for the to-do list using<br />

UML state charts with the defined set of model elements.<br />

The interaction designer uses then the provided operations<br />

and defines the relations to the interaction model. This is<br />

exemplified in Table 3. The saveEntry function is<br />

connected to the ACTSave transition with the entry<br />

provided by the user in the state Add entry. The<br />

deleteEntry function is connected to the ACTYes<br />

transition and deletes the subselection of entryList which is<br />

in this case exactly one entry selected by the user.<br />

Application function Relation to interaction model<br />

saveEntry(Entry)<br />

ACTSave<br />

Entry: PROVIDE(1) entryList<br />

deleteEntry(Entry) ACTYes<br />

Entry: entryList.Subselection<br />

Table 3. Mapping of interactions to application logic<br />

provided by the interaction designer based on the<br />

operations provided by the application developer.<br />

The next process step is to transform the abstract interaction<br />

model including the abstract data types and operations to<br />

different HMI concepts. For this example, we demonstrate<br />

two different automotive HMI concepts that are described<br />

in the following section.<br />

Example: Two <strong>Automotive</strong> HMI Concepts<br />

We illustrate the to-do list application with two different<br />

HMI concepts which can be summarized as follows:<br />

Touch screen based HMI: The first concept is based on<br />

operation with direct input via a touch screen. Touchable<br />

buttons are used to directly interact with the system. Lists<br />

are provided and can be operated (e.g. scrolling) via touch<br />

gestures. The system provides a software keyboard<br />

appearing when text or numbers are to be entered.<br />

CCE-based HMI: The second concept is based on indirect<br />

input via a CCE that can be pushed in eight directions,<br />

turned and pressed. Selectable menu entries are used to<br />

interact with the system. These are realized as menu

containers and are arranged in a certain hierarchy. The<br />

system provides specific complex speller widgets to enable<br />

the user to enter text or numbers.<br />

In order to map the abstract model to different HMI<br />

concepts, different rule sets have to be defined. Table 4<br />

illustrates general examples of required mappings.<br />

Abstract element Touch concept CCE concept<br />

PROVIDE Text field widget<br />

and software<br />

keyboard<br />

Edit speller widget<br />

ACT Touch button Menu entry in a<br />

menu container.<br />

SELECT(1) List box with the<br />

possibility to<br />

directly select one<br />

entry<br />

Menu container with<br />

the possibility to<br />

navigate through the<br />

entries and select/<br />

highlight one entry<br />

Table 4. Example mappings from abstract to specific concepts.<br />

The requirements for arranging the different elements<br />

depending on some properties (e.g. list sizes, menu<br />

hierarchy, etc.) are provided by the HMI concept. These<br />

influence the transformation mechanism for each concept.<br />

We defined the specific transformation mechanisms<br />

including these requirements and exemplified the process<br />

with the to-do list example. The proof of concept is<br />

described in the following section.<br />


The two different HMI concepts were implemented with the<br />

respective widget and layout specifications based on XMLdescriptions<br />

in a pre-defined format. These specifications<br />

were used to create rules for the transformation from the<br />

abstract model elements to the respective specific HMI<br />

layout and interaction elements. This was implemented<br />

using eXtensible Stylesheet Language Transformation<br />

(XSLT).<br />

Based on the abstract model elements for the to-do list,<br />

example transformations were implemented for the two<br />

automotive HMI concepts described above. These<br />

transformations include enabled and disabled user actions,<br />

representations of collection variables (e.g. lists) with the<br />

selection of individual collection elements, and representations<br />

for presenting and providing basic data types like<br />

text strings. Example screenshots of the resulting generated<br />

HMIs are illustrated in figure 4.<br />

Figure 4. Screenshots for the to-do list from the demonstrator:<br />

left: touch screen based HMI, right: CCE based HMI.<br />

12<br />


We presented an abstract interaction modeling concept<br />

based on UML class diagrams and state charts. An example<br />

application was modeled and the transformation process<br />

was successfully implemented for two different automotive<br />

HMI concepts. The developed concept includes the<br />

abstraction of basic interaction possibilities and a first set of<br />

transformations for a controlled HMI generation. The<br />

demonstrated concept pushes further research and<br />

development to achieve more flexible and adaptive automotive<br />

infotainment systems allowing the integration of<br />

external applications after deployment of the car software.<br />

Covering a complete HMI concept specification including<br />

the respective transformation rule set may result in large<br />

implementations. Thus, one important issue for the future is<br />

to further improve the HMI specification process in order to<br />

minimize the effort of obtaining transformation rules. These<br />

activities will also support the definition of overall<br />

automotive industry solutions for HMI development<br />

processes, especially concerning modeling languages and<br />

definitions of interfaces between applications and the HMI.<br />

Detailed evaluations, the elaboration of further complex<br />

examples, and stepwise improvements and expansion of the<br />

rule sets are part of ongoing and future activities. The<br />

implementation of a client-server architecture is envisioned<br />

to allow a client HMI system to communicate with remote<br />

applications and other input and output devices via defined<br />

messages. This will also enable the flexible addition of<br />

interaction devices and modalities for external applications.<br />


1. CAMELEON Project. http://giove.cnuce.cnr.it/<br />

projects/cameleon.html (11 Nov 2010).<br />

2. Dausend, M. & Poguntke, M.: Spezifikation<br />

multimodaler Interaktionsanwendungen mit UML. In<br />

Mensch & Computer (2010), 215-224.<br />

3. De Melo, G. Modellbasierte Entwicklung von Interaktionsanwendungen,<br />

München, Germany, 2010.<br />

4. O.M.G.: UML 2.2 Superstructure Specification (2009).<br />

5. Nobrega, L., Nunes, N. J., & Coelho, H.: Mapping<br />

ConcurTaskTrees into UML 2.0. LNCS 3941 (2006).<br />

6. Paternò, F., Mancini, C. Meniconi, S.: Concur-<br />

TaskTrees: A Diagrammatic Notation for Specifying<br />

Task Models. In Proceedings of the IFIP TC13<br />

International Conference on HCI (1997).<br />

7. Paternò, F.: Towards a UML for interactive systems.<br />

LNCS 2254 (2001), 7-18.<br />

8. Schäfer, R.: Model-Based Development of Multimodal<br />

and Multi-Device User Interfaces in Context-Aware<br />

Environments, Aachen, Germany, 2007.<br />

9. Vanderdonckt, J., Limbourg, et al.: UsiXML: A User<br />

Interface Description Language for multimodal User<br />

Interfaces. In Proc. Workshop on Multimodal<br />

Interaction WMI (2004), 1-7.

A Novel Multimedia Session Management Approach<br />

for In-Vehicle Middleware based on DPWS<br />

Michael Eichhorn*, Martin Pfannenstein*, Rainer Bodendorfer**, Eckehard Steinbach*<br />

Institute for Media Technology<br />

Technische Universität München<br />

*{firstname.lastname}@tum.de, **bodendorfer@gmx.de<br />


In this paper, we present a novel multimedia session management<br />

approach for a future Ethernet/IP-based in-vehicle<br />

communication network. All network devices are available<br />

as services in a service-oriented architecture (SOA) that is<br />

established on top of the in-vehicle network. We use the Device<br />

Profile for Web Services (DPWS) as a middleware as it<br />

is designed to support resource-restrained embedded devices<br />

as they are typical for an in-vehicle scenario. The session<br />

management has been designed to support any type of data<br />

to be exchanged between the services. In this study, we put a<br />

particular focus on in-car video streaming and demonstrate<br />

that the proposed approach successfully supports a variety<br />

of video streaming scenarios.<br />

Author Keywords<br />

service-oriented architecture, human machine interface, session<br />

management, in-car infotainment, device integration<br />

ACM Classification Keywords<br />

H.5.2 Information Interfaces and Presentation: Miscellaneous<br />


The IT infrastructure of modern cars features a variety of<br />

electronic control units (ECU) to execute automation and<br />

control tasks which ensure the vehicle’s operation on the<br />

road. Additionally, more and more comfort and entertainment<br />

functionalities are shipped with modern vehicles, particular<br />

in the premium segment. The challenge that car manufacturers<br />

face today is to adapt the in-vehicle network to the<br />

increasing number of ECUs as well as their corresponding<br />

traffic, in particular, novel applications transmitting audioand<br />

video data. Therefore, car manufacturers target for a homogenized<br />

in-vehicle network rather than having installed<br />

multiple fieldbus systems like CAN, LIN, MOST, FlexRay<br />

etc., as in today’s cars. This then also fosters new services<br />

and applications due to the ubiquitous availability of data<br />

compared to a separation of sensors and actuators across the<br />

fieldbus systems mentioned above. One promising candidate<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

13<br />

for such a homogeneous in-car infrastructure is Ethernet/IP<br />

as it comes with a well-proven and established set of interfaces<br />

and protocols. Additionally, not only the number of<br />

the vehicle’s internal ECUs increases, but also the number<br />

of externally connected devices. For instance, a driver as<br />

well as the passengers want to interact with the car via their<br />

personal devices, e.g., laptops, smartphones, PDAs and so<br />

on. Therefore, the requirements for a well organized and<br />

flexible human machine interface (HMI) emerge. This then<br />

should be as universally applicable as possible as due to the<br />

lifecycle of a car compared to those of consumer electronic<br />

(CE) devices, it is not foreseeable which personal devices<br />

will be brought into the car in the future. For IP-based IT<br />

infrastructures like for example in business organizations as<br />

well as on the Internet itself, where many different services<br />

are available, there is an emerging need for an arrangement<br />

of these services. This can be achieved by a service-oriented<br />

architecture (SOA) which depicts a middleware with standardized<br />

interfaces. We use such an architecture to connect<br />

ECUs and CE devices seamlessly as well as generate<br />

an HMI that can be distributed and composed by the devices<br />

connected as described in [3] and [4]. A user can therefore<br />

introduce personal devices and interact with them via the<br />

in-vehicle HMI. This approach also enables novel CE devices<br />

like for example future entertainment systems or even<br />

vehicle-relevant features like a more precise GPS receiver to<br />

be integrated into the IT infrastructure of a car after it has<br />

been shipped. An increasing interaction of the driver and<br />

passengers with the car also leads to a demand for a cooperative<br />

usage especially but not limited to infotainment systems<br />

like video and audio streams. For example, a driver wants<br />

to see the video of the vehicle’s rear view camera while two<br />

passengers in the back are watching a movie on two screens.<br />

As soon as the driver finished checking the rear view camera<br />

image, the passenger in the front also wants to watch the<br />

movie from the current position on or start over. This paper<br />

therefore presents a session management approach for a<br />

SOA-based in-vehicle network and is structured as follows:<br />

First, an overview of related work in this area is given. Afterwards,<br />

our system is presented and the proposed session<br />

management scheme is detailed. At the end, a summary and<br />

outlook is given.<br />


There exist some approaches towards a more flexible HMI<br />

architecture as for example Continental’s Android-based AutoLinQ<br />

platform [1], the Neutrino RTOS by QNX [7] or<br />

Meego [9]. These platforms are supposed to act as an univer-

sal architecture compared to the car manufacturer’s specific<br />

approaches. The Extensible Messaging and Presence Protocol<br />

(XMPP) [8] is a widely used standard for text-based<br />

chatting. On top of that, the Jingle extension [5] is used to<br />

establish sessions for audio and video calls, mainly in peerto-peer<br />

networks.<br />


We consider an all-IP in-vehicle network with a SOA infrastructure<br />

on top. In order to also support embedded devices<br />

which do not feature rich processing resources, we use the<br />

Device Profile for Web Services (DPWS) [6], which is a Web<br />

Service based middleware. It is designed to operate also on<br />

resource-limited ECUs as installed in vehicles. Several services<br />

have been designed which cover automotive-specific<br />

use-cases. Further on, we regard a video streaming service<br />

which can be invoked by multiple clients. The scenario is depicted<br />

in Figure 1. The first box (simple) shows the most elementary<br />

scenario where one client requests a video stream<br />

from one service provider, i.e., video streaming service. A<br />

multi party interaction takes place at the ”separated” scenario,<br />

where two clients invoke the streaming service independently<br />

of each other. Both clients can also receive the<br />

same video content with the same playout time, i.e., they<br />

participate at a common session (shared). The last two scenarios<br />

can also be combined to a mixed scenario where two<br />

clients watch the same video content and a third one requests<br />

a separate video stream or the same one with a shifted playout.<br />

Figure 1. Overview of the considered media streaming scenarios.<br />


When using a SOA as an organizational instance for multimedia<br />

systems, the software development process is eased in<br />

many ways. Nevertheless, some points have to be taken care<br />

of in order to provide an intuitive experience to the user.<br />

In a common unconstrained SOA scenario, with many service<br />

providers and consumers, the service consumer chooses<br />

a provider considering often only technical or measurable aspects,<br />

i.e., hardware resources, latency, and so on. However,<br />

a human, as a service consumer, wishes to select one specific<br />

function of one specific service provider, neglecting technical<br />

aspects. Therefore, a management has to be established<br />

to cover for example:<br />

14<br />

• Overview and selection of compatible, available services.<br />

• Independent and un-interruptible use of a service.<br />

• Possibility to share the current service with others.<br />

• A clear distinction of users, their devices and the way they<br />

are using them.<br />

In order to enable these features in a multimedia scenario<br />

based on DPWS, a session management, realized as a dedicated<br />

service, has been developed. With the introduction of<br />

a session, users can be grouped and served independently,<br />

hence supporting their desired way of use.<br />

Establishing a new session<br />

Figure 2. Etablishment of a new session.<br />

The establishment of a new session is fundamental in order<br />

to operate independently of others, but, on the other hand, be<br />

also able to share a session. The message exchange pattern<br />

of a session establishment is depicted in Figure 2. Here, a<br />

user invokes a video streaming service by telling the client<br />

application to start a session and assigning a session name<br />

(step 1 and 2). The name of the session (Session-ID), which<br />

can be selected freely, is used to distinguish various running<br />

sessions. The Session-ID is then send to the session service<br />

(step 3), i.e., the video streaming device, to actually trigger<br />

the request. In order to know who is requesting a new session,<br />

this message also contains additional information like<br />

IP-address and Port. This is essential to distinguish participants<br />

and handle further service calls properly. When this<br />

message is received by the service provider, it checks if the<br />

desired Session-ID is available (step 4). With this verification,<br />

a unique assignment of Session-IDs is ensured.<br />

Furthermore, a User-ID is generated which is matched to the<br />

IP address and Port of the requesting client (step 5). Both<br />

IDs, the User-ID as well as the Session-ID are then stored<br />

at the service provider side. In fact, the User-ID is also assigned<br />

to the Session-ID to have a connection between sessions<br />

and users. The result is a list containing all running<br />

sessions and their participants. With the generated User-ID<br />

it is possible to retrieve information about a certain user. In<br />

future service calls of a known user, only the User-ID has to<br />

be included in order to identify a user and serve the appropriate<br />

session.<br />

The session client itself also needs to know the User-ID that<br />

he has been assigned to. For this reason, a message is sent<br />

(step 7) containing the User-ID and an error code. The error<br />

code contains the result of the verification process of the

Session-ID. The client knows about all possible results and is<br />

able to decide if the establishment of a session was successful.<br />

Finally, the session client saves his User- and Session-ID<br />

(step 8). With this message exchange, a session has been established.<br />

Joining an existing session<br />

Figure 3. Requesting a session list.<br />

Another mandatory feature regarding sessions is the participation<br />

in an existing session. All information about sessions<br />

are stored on the service side. A common user on the client<br />

side however has no knowledge about currently running sessions<br />

and assigned Session-IDs. In order to get an overview<br />

of all available sessions, a feature that handles this must be<br />

provided.<br />

Initially, the user enters a command to request a list of all<br />

ongoing sessions (Figure 3, step 1). The session client then<br />

sends a message to the session service (step 2). The session<br />

service queries an overview of all existing sessions (step 3)<br />

from its local database, and sends it back to the requesting<br />

client (step 4). Finally the client is able to display all currently<br />

running sessions to the user (step 5).<br />

This listing feature is not only implemented to join a session<br />

in the next step, it can be used to get a common overview of<br />

all running sessions. When a list of all available sessions is<br />

shown to the user, he can then choose one out of it in order<br />

to participate.<br />

First, he selects the appropriate command (Figure 4, step<br />

1) and enters the Session-ID of the desired session (step 2).<br />

With this given information a message is sent from the client<br />

to the service (step 3). When the message is received, the included<br />

Session-ID is checked by the service provider and the<br />

existence of the desired session is verified (step 4). The result<br />

of this verification may lead to the following situations.<br />

• Unknown Session-ID:<br />

The Session-ID cannot not be found among the currently<br />

running sessions (step 4). Thus, the desired session the<br />

user wants to join does not exist and henceforth, he cannot<br />

participate. This will be signaled to the user with a message<br />

(step 4.1). An error code is included and can be interpreted<br />

and displayed at the session client (step 4.2). Now,<br />

the user could restart the process with another Session-ID.<br />

• Known Session-ID, no streams present:<br />

If the Session-ID is known, the appropriate session exists<br />

and the user is able to join. At this point, we assume that<br />

no video streaming is running in the desired session (step<br />

15<br />

Figure 4. Participate in a session.<br />

5). Further, a User-ID is created (step 5.1) and added to<br />

the session (step 5.2). From this point on the user participates<br />

in the session. An error code, sent by a dedicated<br />

message (step 5.3), indicates the successful participation<br />

to the session client. The included Session-ID will be<br />

extracted and saved together with the Session-ID by the<br />

client (step 5.4). From now on, the user can trigger the<br />

streaming within the session.<br />

• Known Session-ID, streaming running:<br />

Of course, it is possible that a video streaming session is<br />

already running, initiated by another user. Hence, the new<br />

client has to be notified about the ongoing video stream.<br />

Therefore, metadata about the stream is gathered (step 6),<br />

a User-ID is generated (step 6.1) and added to the session<br />

(step 6.2). Now, the streaming service, which transmits<br />

the stream to all participating members of the session, is<br />

informed and updated (step 6.3) about the new member.<br />

The new client receives a message (step 6.4) with an error<br />

code, which signals a running streaming, his User-ID and<br />

metadata. The User- and Session-ID are then saved by the<br />

client (step 6.5). Next, the streaming client, which takes<br />

care of receiving and displaying the video, is started. The<br />

metadata of the received message act as a description for<br />

the expected stream.<br />

Leaving a session and handover<br />

The last essential functionality is leaving a session. This can<br />

be necessary if a user wants to stop the use of a device or<br />

he wants to join another session. Figure 5 shows the message<br />

flow after a successful initialization (see Figure 2). Afterwards,<br />

the participating clients subscribe to a notification

Figure 5. Leaving and handover of a session.<br />

channel (step 2) with a message (step 3) which is processed<br />

by the service (step 4).<br />

From now on, all required information is sent via the notification<br />

channel to all participating clients. In step 5, for<br />

instance, one client sends a play command (step 6) to the<br />

service provider. The contained User-ID is then verified as<br />

described in the section Establishing a new session (step 5)<br />

and the corresponding service is fired up. This is then broadcasted<br />

to the subscribed services via a notification message<br />

(step 9). In the depicted scenario, this contains the Session-<br />

ID as well as metadata to tell the clients which video properties<br />

they have to expect (codec, resolution, framerate and<br />

so on). The clients, on the other hand, check the Session-ID<br />

and prepare themselves to use the service (steps 10-12). The<br />

video streamed by the service provider can then be received<br />

and displayed.<br />

If a client wants to leave a session (step 13), he can notify the<br />

service provider via a dedicated message (14). The service<br />

provider then checks the user’s ID and deletes it from the<br />

receiver and notification list (step 15). This user is then no<br />

longer part of the session. However, as shown in Figure 5,<br />

the video stream is unaffectedly sent to the remaining client<br />

in the session. Therefore, a handover of the session has taken<br />

place. The remaining client can control all properties of the<br />

session or close it likewise. In this case, the actual number<br />

of participants of a session reaches zero. Hence, all users are<br />

removed and the Session-ID is no longer in use and can be<br />

assigned to new sessions.<br />


In this paper, we presented a multimedia session management<br />

extension for our web protocol based HMI architecture,<br />

which has been introduced in our previous work. The<br />

session management has been realized as a dedicated service<br />

while not modifying the underlying DPWS stack. With<br />

16<br />

this extension, several video streaming scenarios are covered<br />

which then provide more convenience and flexibility to the<br />

driver and the passengers of a car. Users can invoke a service<br />

separated, e.g., a video streaming service can deliver multiple<br />

streams with a different playout time each. On the other<br />

hand, users can share one stream to watch the same video sequence<br />

on multiple screens, i.e., the playout time is the same.<br />

If there is an existing session available with a running stream<br />

and a new user wants to participate, the meta information is<br />

also indicated to the new user and his stream has the same<br />

playout time despite his late participation. Furthermore, it<br />

is possible that the initiator of a session leaves and another<br />

participating user takes over. This session handover offers<br />

high flexibility regarding connected devices, for instance, a<br />

movie that has been viewed during the trip can be continued<br />

on a mobile device afterwards.<br />


This work has been supported, in part, by the BMBF funded<br />

research project SEIS (Security in Embedded IP-based Systems)<br />

[2].<br />


1. Continental <strong>Automotive</strong> GmbH. AutoLinQ.<br />

http://www.conti-online.com/generator/www/de/en/<br />

continental/automotive/themes/passenger cars/interior/<br />

connectivity/autolinq/pi autolinq en.html, last accessed<br />

Nov. 2010.<br />

2. EENOVA. SEIS (Security in Embedded IP-based<br />

Systems). http://www.eenova.de/projekte/seis, last<br />

accessed Feb. 2010.<br />

3. M. Eichhorn, M. Pfannenstein, D. Muhra, and<br />

E. Steinbach. A SOA-based middleware concept for<br />

in-vehicle service discovery and device integration. In<br />

Intelligent Vehicles Symposium (IV), 2010 IEEE, pages<br />

663–669. IEEE, 2010.<br />

4. M. Eichhorn, M. Pfannenstein, and E. Steinbach. A<br />

flexible in-vehicle HMI architecture based on web<br />

technologies. In International Workshop on Multimodal<br />

Interfaces for <strong>Automotive</strong> Applications (<strong>MIAA</strong>2010),<br />

Hong Kong, China, Feb. 2010.<br />

5. S. Ludwig, J. Beda, P. Saint-Andre, R. McQueen,<br />

S. Egan, and J. Hildebrand. Xep-0166: Jingle. XMPP<br />

Enhancement Proposal, Jabber Software Foundation,<br />

2005.<br />

6. OASIS. Devices profile for web services version 1.1.<br />

http://docs.oasis-open.org/ws-dd/dpws/wsdd-dpws-<br />

1.1-spec.html, last accessed Nov. 2010.<br />

7. QNX Software Systems. QNX Neutrino RTOS.<br />

http://www.qnx.com/products/neutrino rtos/, last<br />

accessed Nov. 2010.<br />

8. P. Saint-Andre et al. Extensible messaging and presence<br />

protocol (XMPP): Core. 2004.<br />

9. The Linux Foundation. Meego. http://meego.com/, last<br />

accessed Nov. 2010.

“Hands Busy, Eyes Busy”: Generating Stories from<br />

Sensor Data for <strong>Automotive</strong> applications<br />

Joe Reddington, Ehud<br />

Reiter, Nava Tintarev<br />

Department of Computing<br />

Science<br />

University of Aberdeen<br />

j.reddington, e.reiter,<br />

n.tintarev@abdn.ac.uk<br />


This paper examines the potential of using natural language<br />

generation to support “hands busy, eyes busy” automotive<br />

applications. It outlines a hierarchy of complexity of output<br />

text, and the type of sensor data that may be collected. It<br />

also suggests a number of ways natural language generation<br />

can generate narrative events from sensor data for drivers.<br />

Author Keywords<br />

NLG, AAC, event generation, narrative, story, sensors, automotive<br />

applications<br />

ACM Classification Keywords<br />

H.5.2 Information Interfaces and Presentation: Miscellaneous<br />


This work examines the potential of using automatically harvested<br />

information to generate new phrases automatically,<br />

creating support for “hands busy, eyes busy” automotive applications.<br />

Of particular interest is a review of how technologies<br />

and techniques developed in an assistive technology application<br />

(the recent “How was School Today...?” project)<br />

can be applied to the automotive domain.<br />

Mobile usage while driving has been identified as a risk factor<br />

in road accidents [2, 5]. Reducing both the motivation<br />

to use such devices while driving and the length of time for<br />

which they are used would potentially reduce the number of<br />

road accidents. The position of the authors is that the use<br />

of automatic narration techniques can support communication<br />

in scenarios such as making regular deliveries or public<br />

transportation. Methodologies to enable this type of automatic<br />

text generation are under-researched and NLG can aid<br />

in this task by creating a story that is structured, relevant and<br />

flexible to the current situation, based on sensor data.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

17<br />

Rolf Black, Annalu Waller<br />

School of Computing<br />

University of Dundee<br />

rolfblack,awaller@<br />

computing.dundee.ac.uk<br />

It is easy to envisage a system by which buses or delivery<br />

vans automatically send an update of location to a home<br />

server, and indeed many services offer near real-time tracking<br />

of packages from source to destination. In contrast, this<br />

work focuses on combining such messages, augmented with<br />

information from weather reports, traffic reports and other<br />

data, to form a larger message with an overall narrative.<br />

In this paper we situate the work with regard to existing<br />

work, then introduce the “How was School Today...?” project<br />

that informed this work. We go on to identify potential application<br />

areas in the automotive domain, and discuss the<br />

possible effects, risks, and advantages.<br />


Our existing work sits on the boundary between Natural Language<br />

Generation (NLG), which is a subcategory of natural<br />

language processing that examines the creation of text from<br />

nonlinguistic data such as sensor readings, and Alternative<br />

and Augmentative Communication (AAC), an area examining<br />

communication for those with restrictions on speech.<br />

NLG techniques can dynamically combine and change some<br />

output depending on the changing internal state of a system<br />

[11]. A popular application area for NLG has been<br />

weather forecasting (generating textual weather forecasts from<br />

the results of a numerical atmosphere simulation model),<br />

and several weather forecast generators have been fielded<br />

and used operationally [17, 16]. A number of data-to-text<br />

systems have also been developed in the medical community,<br />

such as BabyTalk [15], which generates summaries of<br />

clinical data from a neonatal intensive care unit, and the<br />

commercial Narrative Engine [14] which summarises data<br />

acquired during a doctor/patient encounter.<br />

In this paper, we seek to focus the technology away from<br />

AAC and on the automotive domain, where natural language<br />

processing systems have been used with some success. For<br />

example, RoadSafe is an NLG system that has been operationally<br />

deployed at Aerospace and Marine International<br />

(AMI) to produce weather forecast texts for winter road maintenance.<br />

It generates forecast texts describing various weather<br />

conditions on a road network [10]. Other systems have focused<br />

more on processing language to visualise and animate<br />

3D scenes from car accident reports [3].

Figure 1. Types of input that can be collected by a mobile device: voice recording, RFID, voice, emotional embellishments<br />

<strong>Automotive</strong> research in general is well developed; of particular<br />

relevance to this work is the issue of privacy in vehicleto-vehicle,<br />

or vehicle-to-base communication, see e.g. [8, 9].<br />

The “How was School Today...?” project<br />

Our work is informed by the “How was School Today...?”<br />

(HWST) project [1, 6] which logged sensor data for students<br />

at a special needs school. This data included object and person<br />

interactions, voice recordings, and location information<br />

(at the room level). It also recorded positive and negative<br />

evaluations (e.g. “It was not a good day.”) input by the children.<br />

This framework has been tested as a proof-of-concept<br />

in the context of generating stories for children at the school.<br />

The students (who had no, or very limited, speech) could<br />

then relay these stories to parents or other conversation partners.<br />

For this particular domain, the types of data recorded for<br />

each user are:<br />

• Location data - each time the user entered a new room,<br />

this information was recorded. (Pre-processing removed<br />

rooms entered for less than three minutes).<br />

• Object interaction - each time the user interacted with an<br />

object that had an RFID tag, that interaction was recorded.<br />

• Person interaction - each time the user interacted with a<br />

person that had an RFID tag, that interaction was recorded.<br />

• Voice messages - staff and teachers were encouraged to<br />

record voice messages, as if the user was speaking in the<br />

first person, that described the user’s recent activities.<br />

An example set of data would be:<br />

11:36, Location, Tutorial Room<br />

11:36, Object, Money<br />

11:39, Object, Monkey Game<br />

Which is converted into English text to give the story:<br />

I played with Money and Monkey Game. This happened<br />

at a Tutorial Room.<br />

18<br />

Many of the input sensor data and techniques used in HWST<br />

can be applied to the automative domain. Figure 1 outlines<br />

the type of input that could be used in such a system and collected<br />

with a mobile phone, e.g. voice recordings, location,<br />

interactions with people and objects (RFID).<br />

The HWST project is in the process of introducing the Nokia<br />

6212 1 as a collection device, and may need to be supplemented<br />

with an additional system for recording location information<br />

on the room level.<br />

Depending on the granularity of location data required, other<br />

hardware may supplement a mobile phone. GPS tracking<br />

may be more suitable for larger distances while bluetooth or<br />

other methods may be preferable for room-level identification.<br />

Additional sensor data may be available in a vehicle<br />

such as change in light, temperature etc [4], or speed and<br />

fuel usage.<br />


This section categorises the potential outputs of automatically<br />

generated content into a triple-tiered hierarchy of networkbased<br />

input, sensor-based input, and the creation of narratives<br />

from sensor input. This hierarchy can be broadly arranged<br />

in terms of invasiveness of the data collection. This<br />

and other privacy concerns are key to any implementation.<br />

Network-based input<br />

Network-based input is defined as new utterances that can<br />

be determined by access to information over the Internet, or<br />

some other large information portal. An example is talking<br />

about the weather - phrases such as “It’s very warm today”,<br />

and “The snow is starting to stick!”, but this can include<br />

“There was an accident on the M14”, or “Traffic is slow<br />

around Old Trafford due to the match”.<br />

1 http://europe.nokia.com/find-products/<br />

devices/nokia-6212-classic, retrieved November<br />


Sensor-based input<br />

Sensor-based input is defined as the use of single facts about<br />

the user provided by sensor data. Examples might include “I<br />

went to Leeds” - provided by GPS data, or “I just handled<br />

package 41” - provided by use of a barcode scanner in combination<br />

with an online lookup of the IDs for the packages.<br />

Although there is a concern that this sort of data collection<br />

can affect both privacy and also workload required to maintain<br />

it, messages can be better adapted: “I got a text message<br />

from Jamie this morning, he said ‘looking forward to tomorrow’<br />

”. Voice messages are included in this category and<br />

can include information that would never be picked up by a<br />

sensor - “I helped jump-start a car and was 15 minutes late.”.<br />

Creation of narratives from sensor data<br />

This category contains those groups of messages, based on<br />

sensor data, that together relate an experience or tell a story,<br />

thus adding the problems of creating a narrative structure or<br />

consistent style to what has previously been a data-mining<br />

exercise. The importance of narrative in exchanging information<br />

is well-researched, for an NLG example see [12].<br />

In HWST, stories were generated using additional reasoning,<br />

such as giving more importance to events that occurred in<br />

locations which were unexpected compared to a timetable.<br />

These stories were also augmented by users with positive<br />

and negative annotations of utterances “She was nice.” (for<br />

people) or “It was not a good day.” (for the whole story) [1].<br />

The creation of multi-fact, multi-sentence messages with a<br />

structured narrative is a step forward in NLG-terms, requiring<br />

more sophisticated techniques than previous levels in<br />

the hierarchy. In particular, this moves the focus of NLG<br />

research to the tasks of document planning and document<br />

structuring, compared to text generation on the sentence level.<br />

The analysis of sensor-based data, defining one of these multifact<br />

and multi-sentence messages as an ‘event’ is discussed<br />

in [6]. While the NLG techniques outlined in [11] can combine<br />

facts into plain English, a further challenge lies in defining<br />

boundaries between groups of sensor data to define separate<br />

events. The goal is to arrange the sensor-based input<br />

into a narrative structure that accurately relates events.<br />

Based on a modified version of the data recording in the<br />

HWST project, one could assume input data such as that<br />

highlighted in Figure 2. The generated text could then be:<br />

“This morning, after picking up two packages, I helped<br />

jump-start a car and was delayed by 15 minutes. Later, I<br />

arrived at the Leeds depot and delivered the packages to Mr.<br />

Roberts. The delivery went fine”.<br />


The previous section discussed the types of text that can be<br />

generated. This section outlines several practical applications<br />

of the generated narrative text in automative applications:<br />

staying in touch; communication with head office;<br />

and accident reports. Privacy is an important consideration<br />

in any application; the people on whose behalf the story is<br />

generated should always have the possibility to read and edit<br />

19<br />

06:27:00, Object, Package1<br />

06:27:07, Object, Package2<br />

07:34:00, Voice Recording, I helped jump-start a car and was delayed<br />

by 15 minutes.<br />

09:40:00, Location, Leeds depot.<br />

09:40:00, Object, Package1<br />

09:40:05, Object, Package2<br />

09:40:00, Person, Mr. Roberts<br />

09:43:00, Embellishment, Positive . . .<br />

Figure 2. Possible input data<br />

any text before it is transmitted. Moreover, any generated<br />

text can be read aloud by text-to-speech software.<br />

This would also facilitate responses to messages originally<br />

sent to a driver, allowing the original sender (which may also<br />

be a driver) to hear the response without extra effort and reducing<br />

cognitive load.<br />

Staying in touch<br />

Many people keep in touch with mobile texts and an increasing<br />

number stay connected using social media such as Facebook<br />

and Twitter 2 . Professional drivers may feel that updating<br />

their status is important from a social as well as professional<br />

prospective. However, while driving, attention should<br />

be on the road, and hands and eyes will be occupied by driving.<br />

An application that uses NLG to automatically update<br />

friends on one’s activities may help drivers feel connected<br />

in their everyday lives. The necessity to automatically generate<br />

such short messages is highlighted in [4] who suggest<br />

messages such as “35 centigrades? It is very hot in here!”.<br />

In particular, the work on structuring narrative produced by<br />

HWST technology allows a move from the functional single<br />

sentence update to a more expressive longer update.<br />

Work Reports<br />

The key application in this area is the generation of automatic<br />

work reports based on a driver’s sensor data. This sort<br />

of narrative can supply an employer with information about<br />

his drivers, such as the hours that they have worked and<br />

which deliveries or other tasks have been successfully executed.<br />

At the same time, the automatic generation of the text<br />

relieves the employee of the task of writing lengthy reports.<br />

Of particular use is text informing end-users of the current<br />

conditions - rather than a simple “Delayed, new ETA:15:27”<br />

message, one can imagine “When coming from a previous<br />

delivery at Hogsmeade, there was heavy traffic due to an<br />

accident in the town so the delivery has been diverted via<br />

Hogwarts and should be with you by 15:27”.<br />

Accident Reports<br />

Generative narrative stories from sensor data can also be<br />

used to support police and ambulance staff at the scene of<br />

the accident. The generated reports can offer a human readable<br />

summary of the situation well ahead of arrival on the<br />

scene, allowing professionals to be ready once they arrive.<br />

This sort of report can help assess the degree of damage<br />

2 www.facebook.com, www.twitter.com, retrieved November 2010

incurred at an accident by considering road conditions and<br />

travel speed. This type of report could also help police (and<br />

insurance companies) assess potential accountability for a<br />

given accident. Infra-red sensors may help assess how many<br />

victims were involved in an accident as well, ensuring that<br />

all victims get pulled out of an affected vehicle.<br />


This paper describes the type of text that can be automatically<br />

generated to support drivers, and highlighted three application<br />

areas: staying in touch, communication with head<br />

office, and accident reports. Although a future goal for this<br />

research is to integrate with a commercial product, privacy<br />

and security of such systems require careful consideration<br />

While care has been taken to keep such concerns a key part<br />

of the research, the authors welcome any communication<br />

from parties with expertise in this area.<br />


The authors are particularly grateful to the school, staff, and<br />

children. This research was supported by the UK Engineering<br />

and Physical Sciences Research Council under grants<br />

EP/F067151/1, EP/F066880/1, EP/E011764/1,<br />

EP/H022376/1, and EP/H022570/1.<br />


1. R. Black, J. Reddington, E. Reiter, N. Tintarev, and<br />

A. Waller. Using nlg and sensors to support personal<br />

narrative for children with complex communication<br />

needs. In Proceedings of the NAACL HLT 2010<br />

Workshop on Speech and Language Processing for<br />

Assistive Technologies, pages 1–9, Los Angeles,<br />

California, June 2010. Association for Computational<br />

Linguistics.<br />

2. F. A. Drews, H. Yazdani, C. N. Godfrey, J. M. Cooper,<br />

and D. L. Strayer. Text messaging during simulated<br />

driving. Human Factors: The Journal of the Human<br />

Factors and Ergonomics Society, 51 (5):762–770, 2009.<br />

3. S. Dupuy, A. Egges, V. Legendre, and P. Nugues.<br />

Generating a 3d simulation of a car accident from a<br />

written description in natural language: the carsim<br />

system. In Proceedings of the workshop on Temporal<br />

and spatial information processing - Volume 13, pages<br />

1:1–1:8, Morristown, NJ, USA, 2001. Association for<br />

Computational Linguistics.<br />

4. C. Endres and D. Braun. Pleopatra: A Semi-Automatic<br />

Status-Posting Prototype For Future In-Car Use. In<br />

Adjunct proceedings of the 2nd International<br />

Conference on <strong>Automotive</strong> User Interfaces and<br />

Interactive Vehicular Applications (<strong>Automotive</strong>UI<br />

2010), page 7, Pittsburgh, PA, USA, November 2010.<br />

5. S. P. McEvoy, M. R. Stevenson, and M. Woodward.<br />

Phone use and crashes while driving: a representative<br />

survey of drivers in two australian states. Medical<br />

journal of Australia, 185(11/12):630–634, 2006.<br />

6. J. Reddington and N. Tintarev. Automatically<br />

generating stories from sensor data. In Intelligent User<br />

Interfaces, 2011 (to appear).<br />

20<br />

7. E. Reiter, R. Turner, N. Alm, R. Black, M. Dempster,<br />

and A. Waller. Using nlg to help language-impaired<br />

users tell stories and participate in social dialogues. In<br />

In Proceedings of the 12th European Workshop on<br />

Natural Language Generation (ENLG-09, 2009.<br />

8. F. Schaub, F. Kargl, Z. Ma, and M. Weber. V-tokens for<br />

conditional pseudonymity in vanets. In IEEE Wireless<br />

Communications & Networking Conference (IEEE<br />

WCNC 2010), Sydney, Australia, 04/2010 2010. IEEE,<br />

IEEE.<br />

9. F. Schaub, Z. Ma, and F. Kargl. Privacy requirements in<br />

vehicular communication systems. Computational<br />

Science and Engineering, IEEE International<br />

Conference on, 3:139–145, 2009.<br />

10. R. Turner, Y. Sripada, and E. Reiter. Generating<br />

approximate geographic descriptions. In Proceedings of<br />

the 12th European Workshop on Natural Language<br />

Generation, ENLG ’09, pages 42–49, Morristown, NJ,<br />

USA, 2009. Association for Computational Linguistics.<br />

11. E. Reiter and R. Dale. Building natural language<br />

generation systems, Cambridge University Press, 2000.<br />

12. E. Reiter, A. Gatt, F. Portet, and M. van der Meulen.<br />

The importance of narrative and other lessons from an<br />

evaluation of an NLG system that summarises clinical<br />

data. INLG ’08, pp. 147–156, Morristown, NJ, USA,<br />

2008. Association for Computational Linguistics.<br />

13. S. Ashraf, A. Judson, I. W. Ricketts, A. Waller, N. Alm,<br />

B. Gordon, F. MacAulay, J. K. Brodie, M. Etchels,<br />

A. Warden, and A. J. Shearer. Capturing phrases for<br />

ICU-Talk, a communication aid for intubated intensive<br />

care patients. In ACM Conference on Assistive<br />

technologies, pp. 213–217, New York, NY, USA, 2002.<br />

14. M. D. Harris. Building a large-scale commercial NLG<br />

system for an EMR. In INLG ’08: Proceedings of the<br />

Fifth International Natural Language Generation<br />

Conference, pages 157–160, Morristown, NJ, USA,<br />

2008. Association for Computational Linguistics.<br />

15. A. Gatt, F. Portet, E. Reiter, J. Hunter, S. Mahamood,<br />

W. Moncur, and S. Sripada. From data to text in the<br />

neonatal intensive care unit: Using NLG technology for<br />

decision support and information management. AI<br />

Commun., 22(3):153–186, 2009.<br />

16. E. Reiter, S. Sripada, J. Hunter, J. Yu, and I. Davy.<br />

Choosing words in computer-generated weather<br />

forecasts. Artif. Intell., 167(1-2):137–169, 2005.<br />

17. E. Goldberg, N. Driedger, and R. I. Kittredge. Using<br />

natural-language processing to produce weather<br />

forecasts. IEEE Expert: Intelligent Systems and Their<br />

Applications, 9(2):45–53, 1994.

A novel taxonomy for gestural interaction techniques:<br />

considerations for automotive environments<br />

Adriano Scoditti<br />

Laboratoire d’Informatique de Grenoble, Equipe IIHM<br />

385, rue de la Bibliotheque, BP 53, F-38041 Grenoble cedex 9, France<br />

adriano.scoditti@imag.fr<br />


A large variety of gestural interaction techniques is now<br />

available. In this article, we use a new taxonomic space [18]<br />

as a comparative structure to analyze the applicability of<br />

these techniques on automotive environment. The taxonomy<br />

plots a gestural interaction technique as a point in a<br />

space where the vertical axis denotes the semantic coverage<br />

of the technique, and the horizontal axis expresses the<br />

physical actions users are engaged in. In addition, syntactic<br />

modifiers are used to express the interpretation process of input<br />

tokens into semantics, as well as pragmatic modifiers to<br />

make explicit the level of indirections between users actions<br />

and system responses. In the taxonomy, the complexity of<br />

the gestural interaction lexicon, and the syntactic/pragmatic<br />

modifiers it is decorated with, are indexes of the cognitive<br />

load users are engaged in during the interaction. The integration<br />

of modern mobile devices, complex user interfaces and<br />

gestural interaction techniques into automotive environment<br />

rise the necessity to analyze gestural interaction technique<br />

from their cognitive load point of view.<br />

Author Keywords<br />

Handheld devices and mobile computing, Input and interaction<br />

technologies, Multi-modal interfaces, Recognition and<br />

interpretation of user input (face, body, speech etc.)<br />

ACM Classification Keywords<br />

H.5.2 Information Interfaces and Presentation: Miscellaneous<br />


Last generation mobile devices are enhanced with a diversity<br />

of sensors capable of probing real world physical properties<br />

in real time. The pioneering work on sensor-based interaction<br />

techniques [8, 11, 12, 15, 16] has paved the way for<br />

an active research area [1, 20, 21]. Although these results<br />

satisfy “the gold standard of science” [19], in practice, they<br />

are too “narrow truths” [4] to support designers decisions<br />

and researchers analysis. Designers and researchers need an<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

21<br />

Figure 1. Integration of last generation mobile devices in automotive<br />

environment rise the necessity to analyze gestural interaction technique<br />

from their cognitive load point of view [?].<br />

overall systematic structure that helps them to reason, compare,<br />

elicit (and create!) the appropriate techniques for the<br />

problem at hand. Taxonomies, which provide such a structure,<br />

are good candidates for generalization in an emerging<br />

field. The challenge, however, is to provide a classification<br />

framework that is both complete and simple to use. Since<br />

completeness is illusory in a moving and prolific domain<br />

such as user interface design, we will not include it in our<br />

goals.<br />

In this article, we propose the interpretation of a new taxonomy<br />

for gestural interaction techniques [18] with considerations<br />

for automotive environment.<br />

To develop our taxonomy, we have built a controlled vocabulary<br />

(i.e. primitives) obtained through an extensive analysis<br />

of the taxonomies that have laid the foundations for<br />

Human-Computer Interaction (HCI) more than twenty five<br />

years ago. For the most part, this early work in HCI has<br />

been ignored or forgotten by researchers driven by the trendy<br />

“technology push” approach.<br />

Our taxonomy is based on the following principles:<br />

(1) Interaction between a computer system and a human being<br />

is conveyed through input (output) expressions that are<br />

produced with input (output) devices, and that are compliant<br />

with an input (output) interaction language.<br />

(2) As any language, an input (output) interaction language<br />

can be defined formally in terms of semantics, syntax, and<br />

lexical units.

Figure 2. The “sliding” gesture is semantically multiplexed to achieve<br />

different meanings, depending on context.<br />

(3) The generation of an input (output) expression involves<br />

using devices whose characteristics, from the human perspective,<br />

have a strong impact on the expressiveness and<br />

the effectiveness of the user interface [5].<br />

Building on Foley’s work [9] as well as on Buxton’s pragmatics<br />

considerations of input structures [5], our taxonomy<br />

brings together the four aspects of interaction ranging<br />

from semantics to pragmatics with the appropriate humanmotivated<br />

extensions for addressing the specificity of gestural<br />

interaction based on accelerometers. In contrast to<br />

Mackinlay et al.’s semantic analysis of the design space for<br />

input devices [13], we do not consider the transformation<br />

functions that characterize the system-oriented perspective<br />

of interaction techniques.<br />

Our expectation is to provide new insights and to start<br />

promising directions for the design of novel and powerful<br />

gestural interaction techniques.<br />


As shown in Figure 2, the same gesture may convey very<br />

different meanings depending on the context in which it is<br />

produced: “go to previous photo” as for the Apple’s photo<br />

album (or “go to next slide” as in Charade in [2]), “open a<br />

submenu” in Francone’s Wavelet Menu [10], or “unlock” the<br />

iPhone screen. In addition, a gesture that makes sense for the<br />

system, may not be acceptable in a public social context [17]<br />

as it could be meaningful and interpreted by the public itself.<br />

These observations lead us to define a new taxonomy according<br />

to the following principles: (1) Coverage of semantic,<br />

syntactic, lexical, and pragmatic issues of interaction where<br />

semantic granularity is that of Foley’s et al. interaction tasks;<br />

(2) Adoption of a user centered perspective where physical<br />

human actions are premium, leaving aside the internal<br />

computational transformations; (3) Consideration for context;<br />

(4) Coverage of both foreground and background interaction<br />

(as defined by Buxton [6]). Figure 3 shows the<br />

elements of the framework that we describe in detail next.<br />

Lexical Axis<br />

Because of our focus on users’ involvement in the interaction,<br />

the input lexicon corresponds to the physical actions<br />

users apply to devices. We divide human physical actions<br />

into two groups: (1) conscious actions that belong to the<br />

22<br />

Figure 3. Our classification space for gestural interaction techniques<br />

based on accelerometers. The abscissa defines the lexicon in terms of<br />

the physical manipulations users perform with the device, with a clear<br />

separation between background and foreground interaction. The ordinate<br />

corresponds to Foley’s interaction tasks. An interaction technique<br />

is uniquely identified by an integer i and plotted as a point in this space.<br />

Each point is decorated with the pragmatic and syntactic properties of<br />

the corresponding interaction technique.<br />

foreground interaction, and (2) unconscious actions that correspond<br />

to background interaction. The foreground interaction<br />

area contains the interaction techniques that require<br />

the user to consciously manipulate the device to reach some<br />

objective (as for the sliding gesture of Figure 2). The background<br />

interaction area corresponds to the interaction techniques<br />

where the system interprets user’s unconscious actions<br />

together with contextual information to perform some<br />

system state change on behalf of the user. For example, during<br />

a phone call, the iPhone switches the screen backlight<br />

off to safe battery life as the user brings the device next to<br />

the ear.<br />

Whether human actions are performed consciously to address<br />

the system or not, our classification space characterizes<br />

these actions with two additional variables: (τ) the geometrical<br />

transformation matrix that models user’s movements in<br />

space, and (f) the frequency of these movements. The combinations<br />

of τ and f identify three sub-areas within the lexical<br />

axis: “Context”, “Affine Transformations” and “Shock”.<br />

The affine transformations group identifies the most common<br />

interaction techniques based on translations, rotations<br />

and/or scales (in this case, τ is different from the identity<br />

matrix I), and without any repetition (that is, f is equal to<br />

zero, meaning that the interaction is time driven). The sliding<br />

gesture of Figure 2 falls in this category. The shock<br />

category identifies those interaction techniques based on a<br />

combination of translations, rotations and/or scales (τ is different<br />

from the identity matrix) repeated over time (then, f<br />

is different from zero). The shake gesture exemplified by<br />

Shoogle [20] falls in this category. The context category<br />

corresponds to unconscious human manipulations that the<br />

system may interpret to feed into its own context model and,<br />

depending on this context, acts on behalf of the user. For<br />

this situation, we stipulate that τ is the Identity matrix and f<br />

is equal to zero.

Syntactic Axis<br />

Independently from the device used, we characterize the<br />

syntactic dimension of an interaction technique with the following<br />

two variables that we call syntactic modifiers: (1) the<br />

existence (or absence) of triggers to specify the begin/end of<br />

the interaction, and (2) the control type associated with the<br />

input token, which may be position-control, speed-control<br />

or acceleration-control. As a result, given that, in our taxonomy,<br />

an interaction technique is uniquely identified by an<br />

index i, the trigger syntactic modifier is represented as an<br />

oval that surrounds the interaction technique identifier using<br />

a dashed-line or a continuous line to respectively denote the<br />

presence (i.e. clutch) or absence (i.e unclutch) of a trigger.<br />

In addition, a derivative-like notation is used to convey the<br />

control type where i is decorated with an exponential number<br />

that expresses the derivative order with respect to time (i.e.,<br />

no derivative for position, first order derivative for speed,<br />

and second order derivative for acceleration).<br />

Semantic Axis<br />

As justified in our review about the foundational taxonomies<br />

developed in HCI, we re-use Foley’s interaction tasks: Select,<br />

Position, Orient, Path, Quantify, and Text [9] (See the<br />

vertical axis of Figure 3).<br />

Pragmatic Axis<br />

One of the originalities of our work is the attempt to classify<br />

gestural interaction techniques in close connection with their<br />

meaning in the user’s real world. To do this, we introduce a<br />

pragmatic modifier that expresses the directness [14, 3] of<br />

the mapping between the user’s expectation (i.e. goal) and<br />

the semantics of the interaction technique in the computer<br />

world. For indirect mapping, the identifier i of the interaction<br />

technique becomes the parameter of a function F(i)<br />

to indicate the existence of one or several reinterpretation<br />

layers, whereas for direct mapping, i does not receive any<br />

additional decoration.<br />


Our fine-structured, language-inspired analysis allows to understand<br />

intrinsic and implicit differences even among apparently<br />

similar interaction techniques allowing researcher<br />

to better explore them and designers to better choose the best<br />

suitable for each case.<br />

From the researcher’s point of view, the classification shows<br />

a transparent state of the art where each interaction technique<br />

is classified without ambiguity. Typically, reference<br />

taxonomies such as [9] or [5] do not consider the role of<br />

time (cf. frequency and duration), nor do they cover unconscious<br />

interaction (cf. background interaction) and unstructured<br />

interaction such as device shaking. In addition, they<br />

do not explicitly consider whether an interaction technique<br />

is clutched or unclutched introducing ambiguities and mixing<br />

up different aspects of human interaction behavior.<br />

From the designer’s point of view, the dimensions of our<br />

taxonomy can be used as a framework for decision making.<br />

For example, an unclutched interaction technique may<br />

23<br />

be considered for default tasks, while different clutched interaction<br />

techniques can be multiplexed through the use of<br />

standard or ad-hoc widgets. By proposing at least an interaction<br />

technique for each of the proposed task while designing<br />

an application, designers will be able to offer a complete<br />

and uniform user experience similar to the WIMP one.<br />

Furthermore, designers can predict the difficulties that final<br />

users will encounter by analyzing the pragmatic and syntactic<br />

modifiers that characterize the interaction techniques they<br />

envision. Thus, they will be able to choose interaction techniques<br />

that best suit the targeted representative users (novice,<br />

intermediate, expert).<br />

We think good research and development directions will be<br />

both toward the creation of widgets able to transform direct<br />

interactions in their more complex counterparts and toward<br />

the definition of the elementary interactions to base the<br />

development on. The classification suggests to concentrate<br />

the efforts toward the development of interaction techniques<br />

able to specify Path, Quantity and Text input.<br />

Direct pragmatical interaction techniques are the most suitable<br />

for automotive environment, in particular for drivers.<br />

The lack of indirection layers during the interaction characterizes<br />

lower cognitive loads thus easing the interaction and<br />

avoiding distraction.<br />


The characteristics on which we choose to perform our analysis<br />

are the ones inspired by the parallelism existing between<br />

artificial languages proposed by interactions and gestural<br />

languages users are used to: lexicon, syntax, semantic and<br />

pragmatic. Our discussion did not deepened to system level,<br />

as we didn’t want to differentiate interaction techniques by<br />

their implementation characteristics (granularity, resolution<br />

function, state machine are the variables already been taken<br />

into account [7, 13] whom we want to be complementary<br />

rather than substitutes).<br />

Our approach proposed a user-centered classification able to<br />

analyze the state of the art of accelerometers-based interaction<br />

techniques by the manipulation point of view: the user<br />

perform a physical action in its space in order to communicate<br />

with the system. We think this is the atomic level on<br />

which we have to conceive our interfaces in order to propose<br />

system-wide coherent languages to the users. This coherence<br />

will drive them through a more agreeable, natural [5]<br />

and intuitive system, having coherence and direct pragmatic<br />

distances.<br />

We proposed the use of a parametrical space where the pragmatic<br />

distance and the syntactical modifiers are indexes of<br />

the learning curve users have to go over when approaching a<br />

new interaction language.<br />

We contextualized our approach and principles to automotive<br />

environment. We proposed the use of the syntactical<br />

and pragmatical modifiers as discriminants of the most appropriate<br />

gestural interaction techniques suitable in automotive<br />



The content of this article refers to, and in some part is an extract<br />

of, the accelerometers interaction techniques taxonomy<br />

proposed by Scoditti et al. [18].<br />


1. R. Ballagas, J. Borchers, M. Rohs, and J. G. Sheridan.<br />

The smart phone: A ubiquitous input device. IEEE<br />

Pervasive Computing, 5(1):70, 2006.<br />

2. T. Baudel and M. Beaudouin-Lafon. Charade: remote<br />

control of objects using free-hand gestures. Commun.<br />

ACM, 36(7):28–35, 1993.<br />

3. M. Beaudouin-Lafon. Instrumental interaction: an<br />

interaction model for designing post-wimp user<br />

interfaces. In CHI ’00: Proceedings of the SIGCHI<br />

conference on Human factors in computing systems,<br />

pages 446–453, New York, NY, USA, 2000. ACM.<br />

4. F. P. Brooks. Grasping reality through<br />

illusion—interactive graphics serving science. In CHI<br />

’88: Proceedings of the SIGCHI conference on Human<br />

factors in computing systems, pages 1–11, New York,<br />

NY, USA, 1988. ACM.<br />

5. W. Buxton. Lexical and pragmatic considerations of<br />

input structures. SIGGRAPH Comput. Graph.,<br />

17(1):31–37, 1983.<br />

6. W. Buxton. Integrating the periphery and context: A<br />

new model of telematic. Proceedings of Graphics<br />

Interface, pages 239–246, 1995.<br />

7. S. K. Card, J. D. Mackinlay, and G. G. Robertson. A<br />

morphological analysis of the design space of input<br />

devices. ACM Trans. Inf. Syst., 9(2):99–122, 1991.<br />

8. G. W. Fitzmaurice, S. Zhai, and M. H. Chignell. Virtual<br />

reality for palmtop computers. ACM Trans. Inf. Syst.,<br />

11(3):197–218, 1993.<br />

9. J. D. Foley, V. L. Wallace, and P. Chan. The human<br />

factors of computer graphics interaction techniques.<br />

IEEE Comput. Graph. Appl., 4(11):13–48, 1984.<br />

10. J. Francone, G. Bailly, L. Nigay, and E. Lecolinet.<br />

Wavelet menu: une adaptation des marking menus pour<br />

les dispositifs mobiles. In IHM ’09: Proceedings of the<br />

21st International Conference on Association<br />

Francophone d’Interaction Homme-Machine, pages<br />

367–370, New York, NY, USA, 2009. ACM.<br />

11. K. Hinckley, J. Pierce, M. Sinclair, and E. Horvitz.<br />

Sensing techniques for mobile interaction. In UIST ’00:<br />

24<br />

Proceedings of the 13th annual ACM symposium on<br />

User interface software and technology, pages 91–100,<br />

New York, NY, USA, 2000. ACM.<br />

12. G. Levin and P. Yarin. Bringing sketching tools to<br />

keychain computers with an acceleration-based<br />

interface. In CHI ’99: CHI ’99 extended abstracts on<br />

Human factors in computing systems, pages 268–269,<br />

New York, NY, USA, 1999. ACM.<br />

13. J. Mackinlay, S. K. Card, and G. G. Robertson. A<br />

semantic analysis of the design space of input devices.<br />

Hum.-Comput. Interact., 5(2):145–190, 1990.<br />

14. D. Norman. User Centered System Design; New<br />

Perspectives on Human-Computer Interaction. L.<br />

Erlbaum Associates Inc., 1986.<br />

15. K. Partridge, S. Chatterjee, V. Sazawal, G. Borriello,<br />

and R. Want. Tilttype: accelerometer-supported text<br />

entry for very small devices. In UIST ’02: Proceedings<br />

of the 15th annual ACM symposium on User interface<br />

software and technology, pages 201–204, New York,<br />

NY, USA, 2002. ACM.<br />

16. J. Rekimoto. Tilting operations for small screen<br />

interfaces. In UIST ’96: Proceedings of the 9th annual<br />

ACM symposium on User interface software and<br />

technology, pages 167–168, New York, NY, USA,<br />

1996. ACM.<br />

17. J. Rico and S. Brewster. Usable gestures for mobile<br />

interfaces: evaluating social acceptability. In CHI ’10:<br />

Proceedings of the 28th international conference on<br />

Human factors in computing systems, pages 887–896,<br />

New York, NY, USA, 2010. ACM.<br />

18. A. Scoditti, J. Coutaz, and R. Blanch. A novel<br />

taxonomy for gestural interaction techniques based on<br />

accelerometers. In <strong>IUI</strong> 2011. ACM, 2011.<br />

19. M. Shaw. What makes good research in software<br />

engineering? International Journal of Software Tools<br />

for Technology, 4(1):1–7, 2002.<br />

20. J. Williamson, R. Murray-Smith, and S. Hughes.<br />

Shoogle: excitatory multimodal interaction on mobile<br />

devices. In CHI ’07: Proceedings of the SIGCHI<br />

conference on Human factors in computing systems,<br />

pages 121–124, New York, NY, USA, 2007. ACM.<br />

21. A. Wilson and S. Shafer. Xwand: Ui for intelligent<br />

spaces. In CHI ’03: Proceedings of the SIGCHI<br />

conference on Human factors in computing systems,<br />

pages 545–552, New York, NY, USA, 2003. ACM.

Navigating Haystacks at 70 mph:<br />

Intelligent Search for Intelligent In-Car Services<br />

Ashweeni K. Beeharee<br />

University College London<br />

Department of Computer Science<br />

Gower Street, London, WC1E 6BT<br />

+44 (0)20 7679 0358<br />

a.beeharee@cs.ucl.ac.uk<br />


With an explosion of in-car services, it has become not only<br />

difficult but unsafe for drivers to search and access large amounts<br />

of information using current interaction paradigms. In this paper,<br />

we present a novel approach for visualizing and exploring search<br />

results, and the potential benefits of its application to the current<br />

in-car environment. We have iteratively developed and tested a<br />

prototype system that enables the seamless and personalized<br />

exploration of information spaces. In a number of eye-tracking<br />

studies, we analyzed user satisfaction and task performance for<br />

factual and explorative search tasks. We found that most<br />

participants were faster, made fewer errors and found the system<br />

easier to use than traditional ones. We believe that this approach<br />

would improve the traditional in-car interfaces - to search and<br />

access large number of services with rich information. This would<br />

reduce driver inattention to the road and improve road safety.<br />

Categories and Subject Descriptors<br />

H.5.2 [Information Interfaces and Presentation]: User<br />

Interfaces - Graphical user interfaces.<br />

General Terms<br />

Design, Experimentation, Human Factors, Intelligent Transport<br />

System Services, Road Safety, Theory<br />

Keywords<br />

Contextualization, Personalization, Exploration, Search, Context<br />

Interfaces, Contextual User Interfaces<br />

1. SafeTRIP<br />

Satellite-based communication systems [10] for use in homes<br />

[1][13] and cars have been adopted by consumers in many parts of<br />

the world. The SafeTRIP project aims to build on this success and<br />

utilize a new generation of satellite technology to improve the<br />

safety, security and environmental sustainability of road transport.<br />

SafeTRIP uses S-band satellite technology, which is optimized for<br />

two-way communication for on-board vehicle units. The S-band<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that<br />

copies bear this notice and the full citation on the first page. To copy<br />

otherwise, or republish, to post on servers or to redistribute to lists,<br />

requires prior specific permission and/or a fee.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA<br />

Sven Laqua<br />

University College London<br />

Department of Computer Science<br />

Gower Street, London, WC1E 6BT<br />

+44 (0)20 7679 0351<br />

s.laqua@cs.ucl.ac.uk<br />

25<br />

M. Angela Sasse<br />

University College London<br />

Department of Computer Science<br />

Gower Street, London, WC1E 6BT<br />

+44 (0)20 7679 7212<br />

a.sasse@cs.ucl.ac.uk<br />

communication requires a small antenna making it suitable for the<br />

mass market. Existing solutions that use other frequency bands<br />

(for e.g. Ku-Band) require larger antennas [12] thus being less<br />

suitable for integration in vehicles or in handheld devices. An<br />

open SafeTRIP platform will be implemented to host services for<br />

improved safety and navigation, but also entertainment and<br />

advertising to vehicle occupants.<br />

Figure 1 - The SafeTRIP concept<br />

During the requirements capture, we discussed with drivers,<br />

operators, emergency technicians, operation managers,<br />

technologists and the management from road operators, insurance<br />

companies, fleet operators, freight forwarders and coach operator<br />

to understand their needs.<br />

Figure 2 - User needs defines the SafeTRIP platform<br />

The SafeTRIP platform’s definition - based on key functionalities<br />

elicited from business (such as road operators) and individual<br />

stakeholders - is shown in Figure 2. The platform enables services<br />

that can provide access to rich information that might be useful to<br />

drivers. At the same time, this creates a risk of overloading drivers<br />

with information, and distracting their attention which should be<br />

focussed on the road. In this paper, we present a new paradigm for<br />

accessing rich media and information in a vehicle which has<br />

minimal impact on the driver’s attention while driving.

2. SafeTRIP Services<br />

From our requirements capture, a set of safety and comfort<br />

services were identified, including:<br />

• Road safety alert service – hazard and incident warning;<br />

• Speed limit service – display variable speed limits in-car;<br />

• Collaborative alert service – allow drivers to share<br />

information about road incidents and traffic information;<br />

• Entertainment service - provides access to Streaming media<br />

and TV channels;<br />

• Assistance service - remote assistance and diagnostics;<br />

• Parking guidance service - for hazardous goods vehicle and<br />

coaches;<br />

• Location-Based services – access and present localised<br />

information to driver such as petrol stations, restaurants,<br />

hotels, local events.<br />

These services will provide numerous benefits to the drivers. For<br />

instance, it will allow them to access rich and timely traffic<br />

information from various sources in the vehicle. Commercial<br />

systems such as Coyote have proven very popular amongst drivers<br />

who share information about speed cameras in Europe. Through<br />

SafeTRIP, drivers will also be able to share information about<br />

road incidents with each other. Our user requirements capture<br />

shows that individuals are interested in accessing richer<br />

information. Through the above services, they will be able to<br />

access localized information about parking spaces, hotels and<br />

petrol stations – along with rich information – to allow drivers to<br />

search for the cheapest place to be refueled or for a restaurant with<br />

a cuisine of their liking.<br />

Whilst this type of information could have many benefits for<br />

drivers, there are risks associated with delivering them into<br />

vehicles. In 2006, a study by the U.S. Department of Transport<br />

(DOT) reported that the leading factor to 80% of crashes and 65%<br />

of near-crashes is driver inattention [9]. The SafeTRIP platform<br />

will partly address this through a driver alertness service that<br />

monitors driver alertness and support warnings to drivers [8].<br />

However, with access to a large number of services, the driver’s<br />

attention will be required to:<br />

• Access a service through the navigation interface<br />

• Interact with a specific service - which may involve searches<br />

that would require further interaction from the driver<br />

Current icon-based interfaces to in-car systems and virtual<br />

keyboards are too taxing to the driver’s attention – and it can only<br />

get worse with an increasing number of services. This has led us<br />

to consider alternative paradigms for driver interaction with<br />

information delivered into vehicles.<br />


In this section we describe a novel information exploration<br />

technique to search and access information on the web.<br />

Experiments have clearly demonstrated its benefits and we believe<br />

that this approach will prove beneficial for drivers searching and<br />

interacting with information in their vehicle.<br />

Approaches such as contextual search [3], search result clustering<br />

[16] or personal search [2][15] aim to overcome some of the<br />

shortcomings of “traditional” search engines. However, none of<br />

those approaches challenges the current paradigm of how users<br />

interact with search engines. To us, it is obvious that the<br />

traditional interaction model using search engine result pages<br />

26<br />

(SERPs) does not work well for more complex information<br />

problems.<br />

To get a broader view, users need to consult different sources and<br />

understand contexts. Most of the time, a single resource will not<br />

be able to satisfy this need. Traditional SERPs fragment the<br />

relevant bits of information, rather than help users to contextualize<br />

them in meaningful ways. Users have to “crawl” site after site,<br />

foraging for meaningful bits [12], emulating the behavior of a<br />

search engine robot. The search engine interaction model<br />

(Figure 3, left side) illustrates users’ interaction with SERPs,<br />

moving back and forth between search results (A, B, C, D) and<br />

the actual SERP (central point).<br />

Figure 3 - Contrasting Interaction Models<br />

3.1 Information Exploration UI<br />

In contrast, users’ interaction with our information exploration<br />

interface – also referred to as Focus-Metaphor Interface (FMI) [4]<br />

- enables seamless exploration of the underlying information<br />

spaces (see Figure 3, right side). This approach combines a<br />

contextual navigation with the actual display of information (see<br />

Figure 4) and particularly facilitates orienteering behavior [14].<br />

When visualizing search results, the FMI replaces traditional<br />

search engine result pages (see Figure 4-A). Its contextual<br />

interface elements contain snippet-like information previews of<br />

the actual search results, and are arranged around the central<br />

content element which displays details of the currently selected<br />

search result (see Figure 4-B).<br />

Figure 4 - FMI prototype for social tools evaluation<br />

When selecting another contextual element, its state changes: it<br />

enlarges into a content element and moves to the centre of the<br />

screen, replacing the previously displayed search result (see<br />

Figure 4-C). This approach allows “browsing” through search<br />

results whilst preserving contextual awareness of the other search<br />

result snippets. In addition, the chosen layout enables a less

hierarchical and more concurrent display of the “top X” search<br />

results, without requiring any scrolling.<br />

However, the key strength of the FMI model becomes apparent<br />

when none of the presented search results meet the user’s<br />

information need. Rather than having to re-formulate another<br />

search query hoping for more promising search results, the user<br />

can simply pick one of the existing results that she thinks comes<br />

“closest” to what she is looking for, and request similar/related<br />

results. This enables the dynamic adaptation of contextual<br />

elements to the currently displayed content element, without<br />

requiring the user to articulate their information need precisely.<br />

This approach represents a break from traditional search behavior,<br />

as the user does not need to constantly go back to a search<br />

interface to (re-)start a new search session. Instead, an initial<br />

search query is the starting point for a seamless and personalized<br />

orienteering and exploration process that guides the user from one<br />

information nugget to the next. Although Google search provides<br />

related functionality through a link called “similar” available with<br />

some of its search result snippets, this functionality mostly works<br />

at a very abstract level (e.g. sites related by topic), but not on the<br />

actual content level. Microsoft search (live.com) provides “related<br />

searches” through a list of similar search queries. However, this<br />

functionality again seems to only work on a rather abstract level<br />

with more generic search queries.<br />

Another key benefit of the FMI model is that its layout and<br />

interaction paradigm lends itself to novel interaction techniques,<br />

such as touch or even eye-gaze. In an earlier study, we have<br />

demonstrated the effective use of our information exploration<br />

interface with eye-gaze only [6].<br />

3.2 Experimentation<br />

Over 3 years, we have conducted a number of lab-based studies of<br />

various FMI prototype iterations. We evaluated the performance<br />

of and user satisfaction with our prototype against a range of<br />

existing tools, such as individual blogs, blog spaces, Google news,<br />

Google Reader and PARC’s StarTree [4][5][7].<br />

Throughout those studies, task completion times were<br />

significantly faster and error rates were significantly lower using<br />

the FMI than in blog environments (see Figure 5) and on a par<br />

with PARC’s StarTree (which only works for well-formed<br />

information spaces).<br />

Figure 5 - Cross-study comparison<br />

Participants using the FMI had short and very consistent average<br />

fixation durations, which indicate lower cognitive load than in all<br />

compared systems. User feedback through questionnaires and<br />

informal interviews confirmed the ease of use and learnability of<br />

the FMI prototypes for most users.<br />

27<br />

3.3 Social Tools Study<br />

In our latest study, we used a corpus of domain-specific blog<br />

entries to evaluate a range of social tools, namely the ability to<br />

tag, rate and bookmark any of the articles. We looked at the<br />

impact of 1) ratings on contextual search snippets and 2) tags on<br />

search result presentation (see Figure 6).<br />

Figure 6 – Screenshot of FMI with social tools<br />

The eye-tracking experiment involved 21 participants, 13m/8f,<br />

20-46 years (avg. 25.7). We used a range of factual and<br />

explorative search tasks. For factual search tasks, participants had<br />

to identify a specific article; for explorative search tasks,<br />

participants had to explore a certain topic for a few minutes. In<br />

both cases, we used small scenarios to facilitate intrinsic<br />

motivation in the participants.<br />

For the contextual search result snippets, our analysis of postexperiment<br />

usability questionnaires (Likert scale, 1-6) revealed<br />

that participants found the “5 star rating” functionality very quick<br />

and easy to use (5.5). The ability to have ratings displayed in the<br />

contextual navigation elements was rated significantly higher than<br />

the perceived impact on users’ navigational decisions (4.8 vs. 4.0,<br />

t20 = 2.09, p < 0.02).<br />

But, analysis of the eye-tracking data shows that participants’<br />

awareness of the actual ratings was substantial, considering its<br />

actual size within the contextual search snippet (see Table 1).<br />

Table 1 - Search snippet attention distribution<br />

Attention Distribution<br />

(relative gaze time)<br />

Rating 17.1 %<br />

Title 54.4 %<br />

Description 28.5 %<br />

Within this study of social tools for the FMI, selecting a “new”<br />

central content element automatically updated the contextual<br />

elements to display the most similar/related articles to the newly<br />

activated content element. However, user feedback showed that<br />

the automatic contextualization of relevant search snippets is too<br />

volatile for users’ taste. For future studies, we have therefore<br />

settled on a static/persistent contextual visualization that (only)<br />

adjusts to the currently displayed content element upon request by<br />

the user.<br />

4. SafeTRIP FMI<br />

With the large number of services available through SafeTRIP,<br />

searching through services and information, using traditional

methods and interfaces in in-car systems, can prove to be time<br />

consuming. Inefficient search therefore has a detrimental impact<br />

on the driver’s attention and thus on road safety.<br />

As FMI has proven to be an effective tool for searching and<br />

presenting information, we believe that its application to the in-car<br />

environment would be beneficial to the driver. We have identified<br />

some application areas for the SafeTRIP in-car interface that<br />

could benefit from this approach.<br />

Service Search<br />

SafeTRIP is an open platform, allowing third party<br />

applications/services to be made available to the drivers. With<br />

typically dozens of services planned already and new ones<br />

appearing with time, the traditional icon/menu based interface in<br />

most in-car systems may not be appropriate. With FMI, the<br />

drivers will be able to search through 100s of services and locate<br />

the ones that are most relevant. As our studies show, precise<br />

search criteria may be difficult to formulate – especially when<br />

searching for a new service. Also, if the user goes down the wrong<br />

search path, he can explore information sets that look relevant,<br />

without reformulating the search all over again.<br />

Search Traffic Info<br />

Typically, drivers combine traffic information from various<br />

sources to make decisions while driving. With new services in<br />

SafeTRIP, traffic information will be available from yet more<br />

sources – namely road operators, other drivers, authorities and<br />

traffic information providers. The reliability and timeliness of<br />

such information differs across sources – and drivers know how to<br />

exploit these differences. FMI can be used to provide an efficient<br />

mechanism to search for the most appropriate information, given<br />

that complete automation is unlikely as drivers use a mix of<br />

information sources based on their personal preferences.<br />

Display Traffic Info<br />

With SafeTRIP, we plan to provide rich traffic information to the<br />

drivers. On the motorway, variable speed restriction (e.g. in the<br />

event of a road incident) will be sent to the vehicle (instead of<br />

being displayed on a Variable Message Sign) with some details<br />

about the incident. It is expected that drivers would be more likely<br />

to respect the new speed restrictions if they are aware of the<br />

underlying reason. However the display of rich information can<br />

lead to information overload or inattentional blindness – causing<br />

the driver to ignore the important information in the messages.<br />

The layout of information in the FMI is designed to be<br />

minimalistic, providing as much relevant information as a user<br />

can process effectively, allowing for easy decision making and<br />

exploration of further relevant information.<br />

Entertainment Selection Interface<br />

Remote controls fitted to the steering wheel are a definite<br />

improvement that allows drivers to interact with the in-car<br />

entertainment system without taking their eyes off the road.<br />

However, with the explosion of entertainment options – both<br />

audio and video – through the SafeTRIP platform, it is likely that<br />

such solutions will quickly show their limitations. We believe that<br />

the FMI approach would allow the driver to quickly and<br />

efficiently search through the entertainment options.<br />


It is clear to us that web based search benefits from the FMI<br />

approach as demonstrated by the results obtained from<br />

experimentation. With the increase in number of services<br />

available in the car – such as the ones through SafeTRIP, there is<br />

28<br />

a real need for an effective and efficient way to search and interact<br />

with those services. We therefore believe that in-car systems<br />

would greatly benefit from the FMI approach by decreasing<br />

search time, thereby improving driver’s attention on the road and<br />

contributing towards road safety.<br />


[1] Bly, S., Schilit, B., McDonald, D.W., Rosario, B., Saint-<br />

Hilaire, Y., Broken expectations in the digital home, Ext.<br />

Abstracts CHI 2006, ACM Press(2006), 568-573.<br />

[2] Cutrell, E. et al. (2006). Fast, Flexible Filtering with Phlat –<br />

Personal Search and Organization Made Easy. In<br />

Proceedings of CHI 2006, Montreal, Canada.<br />

[3] Kraft, R. et al. (2006). Searching with Context. In Proc. of<br />

International World Wide Web Conference (WWW ’06),<br />

(Edinburgh, Scotland, 2006). ACM Press.<br />

[4] Laqua, S. and Brna, P. The Focus-Metaphor Approach: A<br />

Novel Concept for the Design of Adaptive and User-Centric<br />

Interfaces. In Proc. Interact 2005, Springer (2005), 295-308.<br />

[5] Laqua, S. and Sasse, M.A. (2009). Exploring Blog Spaces: A<br />

Study of Blog Reading Experiences using Dynamic<br />

Contextual Displays. In: Proc. HCI 2009, Cambridge, UK.<br />

[6] Laqua, S., Bandara, S. U., and Sasse, M.A. (2007)<br />

GazeSpace: Eye Gaze Controlled Content Spaces. In Proc.<br />

HCI 2007, Vol.2, 21 st BCS HCI Group Conference (2007).<br />

[7] Laqua, S., Ogbechie, N., and Sasse, M.A. (2007).<br />

Contextualizing the Blogosphere: A Comparison of<br />

Traditional and Novel User Interfaces for the Web. In Proc.<br />

HCI 2007, Vol.2, 21 st BCS HCI Group Conference.<br />

[8] Lee, J. D., Hoffman, J. D., and Hayes, E. 2004. Collision<br />

warning design to mitigate driver distraction. In Proceedings<br />

of the SIGCHI Conference on Human Factors in Computing<br />

Systems (Vienna, Austria, April 24 - 29, 2004). CHI '04.<br />

ACM, New York, NY, 65-72.<br />

[9] NHTSA. The impact of Driver Inattention on Near-<br />

Crash/Crash Risk.<br />

http://www.nhtsa.gov/Research/Human+Factors/Distraction<br />

[10] Orbcomm. http://www.orbcomm.com<br />

[11] OmniTRACS. http://www.qualcomm.com<br />

[12] Pirolli, P. (2007). Information Foraging Theory. Oxford<br />

University Press.<br />

[13] Seager, W., Knoche, H., Sasse, M.A., TV-centricity -<br />

Requirements gathering for triple play services. In<br />

Interactive TV: A Shared Experience TICSP Adjunct<br />

Proceedings of EuroITV (2007), 274-278.<br />

[14] Teevan, J. et al. The perfect search engine is not enough: a<br />

study of orienteering behavior in directed search. In Proc.<br />

CHI ’04. (Vienna, Austria, 2004)<br />

[15] Teevan, J. et al. Beyond the Commons: Investigating the<br />

Value of Personalizing Web Search. In Proc. of Workshop on<br />

New Technologies for Personalized Information Access<br />

(PIA). (Edinburgh, UK, 2005).<br />

[16] Zeng, H. J. et al. Learning to Cluster Web Search Results. In<br />

Proceedings of SIGIR ’04, Sheffield, United Kingdom, 2004.<br />

ACM Press, 210-217

Discover Significant Situations for User Interface<br />

Adaptations<br />

Sandro Rodriguez Garzon<br />

Daimler Center for <strong>Automotive</strong> Information<br />

Technology Innovations<br />

HMI Group<br />

sandro.rodriguez.garzon@dcaiti.com<br />


Over the last years environmental awareness became an important<br />

research topic in the field of adaptive user interfaces.<br />

Especially in the research area of location-based services,<br />

context-aware interfaces started using models of the environment<br />

in conjunction with sophisticated user models to<br />

filter user relevant information. Despite the tight coupling<br />

of context-aware computing and user modeling, only less<br />

research focused on the correlations between an user preference<br />

and the context in which the user preference was inferred.<br />

Considering a user preference as a certain humaninterface<br />

interaction that happens regularly within similar<br />

context, this paper introduces a method to detect significant<br />

situations of frequent user interactions that occurred within<br />

similar environments. As an example, the paper discusses<br />

a definition of a personalization use case within the automotive<br />

environment: Adapting the user interface based on<br />

discovered user initiated radio station changes depending on<br />

the user’s location.<br />

Author Keywords<br />

context awareness, situation discovery, personalization, adaptation,<br />

intelligent user interface, temporal pattern<br />

ACM Classification Keywords<br />

H.5.2 User Interfaces: Theory and methods; H.3.3 Information<br />

Search and Retrieval: Miscellaneous<br />


Since the Active Badge Location System [7] many researchers<br />

have been interested in designing context-aware user interfaces.<br />

Considering the increase of complexity and functionality<br />

of user interfaces several researchers identified ways<br />

and means to increase the usability by displaying prefiltered<br />

information or modified user interface controls. While some<br />

approaches aimed at detecting similar interactions to apply<br />

personalized filters other approaches tried to propose methods<br />

to build adaptation-ready user interfaces.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

29<br />

Kristof Schütt<br />

Daimler Center for <strong>Automotive</strong> Information<br />

Technology Innovations<br />

HMI Group<br />

kschuett@cs.tu-berlin.de<br />

Context-aware computing focused on gathering the context<br />

of an entity to build a machine-readable model of the environment.<br />

With the help of these models it was possible<br />

to adapt a user interface dynamically, following the ”onefits-all”<br />

paradigm. Hence, a predefined rule specified the<br />

way how environmental factors influence the user interaction<br />

with the system. In contrast, the research in user modeling<br />

focused on gathering a accurate model of the user by applying<br />

sophisticated data mining methods. These user models<br />

were used to build user-centric adaptive systems that take<br />

the users needs into account. Unfortunately, most of the user<br />

modeling techniques were constrained to detect application<br />

specific user preferences. But in different contexts a user<br />

may prefer to interact with the system in different ways.<br />

Thus, this work on the discovery of significant situations is<br />

motivated by the desire to personalize user interfaces in dependence<br />

of their use in similar contexts. Contextual personalization<br />

can be seen as the process of bringing together<br />

the context with the user preferences. Our intent is not to<br />

construct a context dependent user model but to detect situations<br />

that might be followed by predictable user interactions.<br />

In order for the method to be applicable in the real<br />

world, our approach assures an unsupervised processing of<br />

user interactions without need to prompt the user for explicit<br />

feedback.<br />


A promising idea concerning location-aware service personalization<br />

is presented by Coutand [2]. Coutand uses a casedbased<br />

reasoning approach to calculate similarities between<br />

records of service use enriched by location dependent properties<br />

to deduce preferred service utilization. An approach<br />

of clustering context data in order to determine if an actual<br />

context belongs to an already sensed context is presented by<br />

Flanagan [6]. By expressing the context in a symbolic form<br />

he is able to develop an unsupervised learning algorithm to<br />

extract and group similar contexts as context states. This<br />

idea is very close to our work since our approach groups context<br />

as well. The difference lies in the way the environment<br />

is sensed. Our approach uses prespecified temporal event<br />

patterns to extract interaction traces that are annotated with<br />

context features. Those interaction traces are clustered concerning<br />

their multiple contexts in contrast to [6], where independent<br />

instances of context feature vectors are grouped.<br />

The notion of an temporal event pattern as an appropriate<br />

representation of a user preference is also mentioned in [3].

Cram proposes a method to interactively detect recurring<br />

user interaction sequences to enhance context-aware assisting<br />

systems. Unfortunately, Crams approach isn’t applicable<br />

within the automotive environment because the user has to<br />

be involved in the process of discovering regular task signatures.<br />


Following up Etzion’s definition [5] of an event as an occurrence<br />

in the real world and its virtual representation we<br />

introduce the notion of an interaction event.<br />

DEFINITION 1. Interaction Event. An arbitrary user interaction<br />

affecting the user interface or its environment represented<br />

by means of an virtual object.<br />

An interaction event will be generated by the user interface<br />

and processed by the frequent interaction discovery component.<br />

The definition of the interaction event incorporates all<br />

events thrown directly or indirectly by the user interface as<br />

well as events triggered by a change of the state of the environment.<br />

The term environment will be used as the superset<br />

of context which is defined as all elements of the environment<br />

which the user’s computer knows about [1]. Given the<br />

following definitions<br />

DEFINITION 2. Action. The concrete occurrence of an<br />

interaction event sequence.<br />

DEFINITION 3. Situation. A period of time in which certain<br />

conditions are satisfied indicating a probable occurrence<br />

of a known action.<br />

our prototype distinguishes between the user interaction, namely<br />

action, and the state called situation at which our prototype<br />

assumes to know what the user will do next. An action is<br />

declared to be frequent if the amount of reoccurrences exceeds<br />

a prespecified limit. A situation is significant if the<br />

predictable action is frequent.<br />


The challenge lies in the detection of significant situations<br />

out of an arbitrary stream of interaction events. The probability<br />

of a certain action to reappear in the same constellation<br />

within the same context is very low. Therefore, to detect a reoccurence<br />

of an action a notion of similarity between actions<br />

considering their context is needed. This work distinguishes<br />

between a comparison of actions comprising one event and<br />

multiple events. The Section ”Interaction Event Processing”<br />

discusses the former case while Section ”Co-Situations” examines<br />

the latter case.<br />

We decided to split the process of significant situation detection<br />

into three successive subprocesses: Action discovery,<br />

context discovery and situation discovery. In the action discovery<br />

subprocess a prespecified event pattern is searched<br />

within the stream of interaction events. A concrete sequence<br />

of events found by the event engine is declared to be an action.<br />

The context discovery subprocess collects all actions<br />

and groups them by specified context features. The result<br />

30<br />

of the context discovery subprocess is made up of groups<br />

containing actions whereupon each group is characterized<br />

by several action specific group properties. If the amount<br />

of members of a group exceeds a limit all members will be<br />

declared as frequent. This conclusion is valid because all actions<br />

of a group are assumed to be similar. The group properties<br />

describe a common environment of all the actions that<br />

are contained within that group. To detect a situation likely<br />

to contain a reoccurrence of a frequent action it is necessary<br />

to search for an event sequence that is parametrized by<br />

the property values of the group the frequent action belongs<br />

to. This process is called situation discovery. If the process<br />

encounters a compatible interaction event sequence the user<br />

interface will be notified of the significant situation.<br />


The process of significant situation detection is supported<br />

by use case specific data mining instructions. These instructions<br />

will be specified beforehand by an expert and used during<br />

the runtime process to assist the data mining process to<br />

extract significant situation information for specific personalization<br />

use cases. In the proceeding discussion we use the<br />

term use case specification to subsume all instructions belonging<br />

to a certain use case. The explanation of the necessary<br />

specification steps and the event processing itself is<br />

accompanied by an automotive example: Detection of user<br />

initiated radio station changes depending on the user’s location.<br />

The intention of the personalization is to provide the<br />

user with a system generated proposal to change the radio<br />

station automatically at a certain location triggered by the<br />

detection of a significant situation.<br />

Context<br />

During runtime, every interaction event occurs within a certain<br />

environment. Considering our approach, the environment<br />

is represented by a fixed set of attributes and its situation<br />

dependent values. Since the environment comprises an<br />

almost infinite number of environmental factors it is necessary<br />

to define a subspace of the environmental factors that<br />

are relevant to the specific use cases. Hence, the use case<br />

specification must contain a context definition as an enumeration<br />

of context features that will be attached to every interaction<br />

event. In case of the radio example, two context<br />

features were identified as use case specific environmental<br />

factors: name of the radio station and a unique identification<br />

number (id) of the current road segment.<br />

Action Discovery<br />

The use case specification must also contain a prespecified<br />

interaction event pattern describing the action that should be<br />

investigated in detail. The event pattern is constructed by<br />

combining logical and temporal operators to form complex<br />

event sequence descriptions with event specific filter criteria.<br />

Since the prototypical implementation uses Esper [4] as<br />

the underlying event processing engine most of the available<br />

operators have their counterpart in Esper’s event processing<br />

language (EPL). Considering the radio example it is necessary<br />

to specify an accurate but generic event pattern representing<br />

the user’s action of changing the radio station: The<br />

radio station change should only be taken into account in

case the user initiated radio station change is not followed by<br />

any further radio station change within the next 10 minutes.<br />

During initialization the pattern will be passed to the complex<br />

event processing engine to start looking for compatible<br />

sequences. If the engine encounters a fitting sequence<br />

of interaction events it relays the concrete event sequence,<br />

namely action, to the subprocess of context discovery.<br />

Context Discovery<br />

The main task of the context discovery subprocess is to group<br />

all incoming actions based on prespecified criteria. As stated<br />

above, an action can be composed of an arbitrary number<br />

of temporally ordered events. Therefore, it is necessary to<br />

define a selection of specific events and corresponding context<br />

features of an action that are considered during a comparison<br />

between two actions. In other words, the similarity<br />

measure between two actions is calculated by a comparison<br />

between a fixed set of prespecified context features. The result<br />

of the grouping process is a set of groups of similar actions.<br />

The actions will be similar regarding the environment<br />

they occurred in. In turn, the characteristics of each group<br />

can be interpreted as a description of the common environment.<br />

Considering the radio example, we are only interested<br />

in grouping actions by the radio station and the unique road<br />

segment id. Thus, each group subsumes a certain case of<br />

user behavior within a certain environment. In this sense, a<br />

group is characterized by two properties: name of radio station<br />

and road segment ids. The former property will contain<br />

the name of the radio station while the latter property will<br />

contain a subnetwork of the road network represented by a<br />

set of unique road segment ids. Actions will be compared<br />

by the radio station name doing a string comparison and by<br />

the road segment id doing a network distance comparison.<br />

Radio station changes containing the same name of a radio<br />

station but occurring in neighboring road segments are assigned<br />

to the same group. A group will only be considered<br />

in the next subprocess if it contains a sufficient amount of<br />

actions. Such a group is called a significant group.<br />

Situation Discovery<br />

Finally, the situation discovery subprocess uses the characteristics<br />

of the significant groups found by the previous subprocess<br />

to parametrize a generic interaction event pattern<br />

namely situation pattern. The situation pattern describes<br />

a moment in which a group specific action is expected to<br />

reappear. The way the expert specifies the generic situation<br />

pattern is similar to the way the specification of the action<br />

was done before. The difference lies in fact that the context<br />

features of the events within the event pattern will be constrained<br />

by the discovered group characteristics. During runtime,<br />

several significant groups will be identified by the context<br />

discovery subprocess. For each group a new instance of<br />

the situation pattern will be generated with different context<br />

feature constraints. This group specific parametrization allows<br />

the event engine to use each generated situation pattern<br />

instance to discover actions that occur within the common<br />

environment of the corresponding group. As a consequence,<br />

the event processing engine is able to find significant situations<br />

as a result of detecting parametrized event sequences.<br />

Considering the radio example it is necessary to define a sit-<br />

31<br />

uation pattern that clearly describes an event sequence that is<br />

likely to be followed by a known radio station change event.<br />

To describe such a situation we include two conditions into<br />

the generic temporal event pattern: 1. The last radio station<br />

change resulted in a switch to a radio station that is different<br />

to the one found in the group property ”name of radio station”<br />

and 2. the current road segment id - originating from a<br />

location event - is part of the subnetwork found in the group<br />

property ”road segment ids”. Each significant group of radio<br />

station changes will start a process of detecting a significant<br />

situation in which the road segment is similar and the current<br />

radio station is different. Triggered by the notification of a<br />

significant situation, the prototype is able to propose a radio<br />

station change.<br />


So far the context of only one event - radio station change -<br />

was observed to group actions. But what happens if actions<br />

are composed of multiple events and the comparison should<br />

consider two different contexts of two events? In this case<br />

the prototype would primarily split the grouping into separate<br />

grouping procedures for each event. A powerful application<br />

would be to detect significant situations based on<br />

several temporally ordered events each being parametrized<br />

by the properties of different significant groups. Such a new<br />

type of situation would enable the specification of a causal<br />

sequence of arbitrary situations to form a new significant situation.<br />

DEFINITION 4. Co-Situation. A period of time in which<br />

certain temporally ordered conditions describing multiple<br />

situations are satisfied indicating a probable occurrence of<br />

a known action.<br />

Let’s go back to the radio example and consider a certain application<br />

of the well-known radio example in the real world<br />

while entering and leaving a tunnel. Assuming two significant<br />

groups of radio station changes were found situated at<br />

both tunnel exits. Both groups were detected due to the fact<br />

that one radio station is being received poorly on one side of<br />

the tunnel and vice versa. Without taking account of the direction<br />

the car is moving the prototype may propose a wrong<br />

radio station change while entering the tunnel. This misleading<br />

personalization happens because the location of a tunnel<br />

exit may match with the location of a tunnel entrance. In<br />

this case it does not matter how the driver is approaching a<br />

significant situation.<br />

In order to consider the moving direction the event pattern<br />

of the action discovery subprocess must be extended by two<br />

location events preceding the first event that changes the radio<br />

station. The modified event pattern will consider multiple<br />

temporally ordered events that may happen in different<br />

contexts. To discover co-situations, the actions will no<br />

longer be grouped by only one context of a certain event<br />

but grouped by the contexts of the two location events. In<br />

this sense, each detected group is finally characterizable by<br />

the unique radio station and a pair of contexts. One context<br />

that describes a region before entering the tunnel and<br />

another context describing a region at the exit of the tunnel.

Environment<br />

Car<br />

Tunnel<br />

Car<br />

Event Sequence<br />

Location<br />

Location<br />

RadioChange<br />

Prototype<br />

Location event found.<br />

Start looking for a<br />

following location event.<br />

Second location event<br />

found. Look for a Radio<br />

Change event.<br />

Radio Change event<br />

found. Wait 600 sec<br />

before reporting the<br />

encountered action.<br />

Environment<br />

Tunnel<br />

ActionDiscovery SituationDiscovery<br />

Car<br />

Car<br />

Event Sequence<br />

RadioChange<br />

Location<br />

Location<br />

Prototype<br />

Known context found.<br />

Start looking for second<br />

known context expected<br />

to follow the first known<br />

context.<br />

Second known context<br />

found. If radio station<br />

is unequal to known<br />

radio station notify UI<br />

Figure 1. Extended radio example: Relation between the environment and the order of event occurrence.<br />

Given all needed properties of a regular user interaction, it<br />

is necessary to extend the situation pattern as well to detect<br />

both contexts with respect to the causal order of its occurrence.<br />

Therefore, the situation event pattern will describe an<br />

event pattern that looks for location events within the context<br />

of the tunnel entry followed by location events within<br />

the context of the tunnel exit. A significant co-situation will<br />

be encountered if the car initially passes the context in front<br />

of the tunnel and than the context at the exit of the tunnel<br />

meanwhile the radio is switched to a different radio station.<br />

Figure 1 visualizes the co-situation. The task of grouping<br />

the driving direction is not necessary as long as the causal<br />

order of grouped contexts helps to trigger the intended radio<br />

station change.<br />


As a sample user interface we implemented a prototypical<br />

in-car-infotainment system based on ActionScript in combination<br />

with a context simulator. The prototype itself is implemented<br />

in Java with Esper [4] as its complex event processing<br />

engine. In order for the prototype to be independent<br />

of a certain user interface we decided to use XML as the underlying<br />

representational language for an event. To test the<br />

prototype under realistic conditions we used context records<br />

of several tracks to simulate the environment.<br />


We have presented a prototype that is able to discover significant<br />

situations based on the common environment of frequent<br />

user interactions. Supported by use case descriptions<br />

specified by an expert, the prototype detects similarities between<br />

user interactions and infers the corresponding environments<br />

needed to detect situations in which a user interaction<br />

is likely to reappear. Although we did not put our focus<br />

on time as well as space efficiency we have to acknowledge<br />

that in particular the space consumption is a critical factor<br />

affecting the application within embedded systems. Since<br />

in the current implementation all detected actions need to be<br />

stored along with their context it is necessary to subsume actions<br />

and to constrain the validity of an action. A first step<br />

towards a space optimized solution was done by limiting the<br />

influence of actions. If an action is too old it will discarded.<br />

Furthermore, we limited the amount of actions per situation<br />

independently of the time the action was detected.<br />

32<br />

During our work we identified some important refinements<br />

and extensions for future work. In particular, we will provide<br />

the expert with the ability to specify a minimal probability of<br />

occurrence for a significant situation as an additional trigger<br />

condition. Up to now the prototype reports a significant situation<br />

in case a certain number of similar actions happened<br />

within a certain context. This notion can be extended by<br />

reporting only in the case the probability of occurrence exceeds<br />

a user defined limit. In order to calculate the current<br />

likelihood of an action we also have to observe situations in<br />

which an action has not been executed. Since it is nearly impossible<br />

to accumulate all non occurrences of a use case we<br />

will constrain the observation to situations that are already<br />

associated with a concrete action of the use case. This is<br />

possible because the boundary of the situation is naturally<br />

given by the situation pattern.<br />


1. P. J. Brown. The stick-e document: a framework for<br />

creating context-aware applications. In EP, pages<br />

259–272, 1996.<br />

2. O. Coutand, S. Haseloff, S. L. Lau, and K. David. A<br />

Case-based Reasoning Approach for Personalizing<br />

Location-aware Services. In Workshop on Case-based<br />

Reasoning and Context Awareness, 2006.<br />

3. D. Cram, B. Fuchs, Y. Prié, and A. Mille. An approach<br />

to User-Centric Context-Aware Assistance based on<br />

Interaction Traces. In Int. Workshop Modeling and<br />

Reasoning in Context, pages 89–101, 2008.<br />

4. EsperTech Inc. Complex Event Processing.<br />

http://esper.codehaus.org/, Last access:<br />

30-12-2010.<br />

5. O. Etzion and P. Niblett. Event Processing in Action.<br />

Manning Publications Co., Greenwich, USA, 2010.<br />

6. J. A. Flanagan. Unsupervised clustering of context data<br />

and learning user requirements for a mobile device. In<br />

Int. and Interdisciplinary Conf. on Modeling and Using<br />

Context, pages 155–168, 2005.<br />

7. R. Want, A. Hopper, V. Falc, and J. Gibbons. The Active<br />

Badge Location System. ACM Transactions on<br />

Information Systems, pages 91–102, 1992.

A new interaction technique based on eye tracking and<br />

single switch scanning systems<br />

Pradipta Biswas<br />

Engineering Design Centre<br />

Department of Engineering<br />

University of Cambridge, UK<br />

E-mail: pb400@cam.ac.uk<br />


In this paper we have presented a new input interaction<br />

system for people with severe disabilities. The new system<br />

works based on eye gaze tracking and single switch<br />

scanning interaction techniques. It combines eye gaze<br />

tracking and scanning in a unique way which is faster<br />

than only scanning based systems while more comfortable<br />

to use than only eye gaze tracking based systems, which<br />

is also supported by a user study. We have also pointed<br />

out a few applicatiosn of the system besides computer<br />

accessibility.<br />

Categories and Subject Descriptors<br />

D.2.2 [Software Engineering]: Design Tools and Techniques<br />

– user interfaces; K.4.2 [Computers and Society]:<br />

Social Issues – assistive technologies for persons<br />

with disabilities<br />

General Terms<br />

Algorithms, Experimentation, Human Factors<br />

Keywords<br />

Assistive Technology, Eye gaze tracker, Scanning, Usability<br />

Evaluation.<br />


Many physically challenged users cannot interact with a<br />

computer through a conventional keyboard and mouse.<br />

For example, spasticity, Amyotrophic Lateral Sclerosis<br />

(ALS), and Cerebral Palsy confine movement to a very<br />

small part of the body. Two possible solutions for these<br />

users will be eye gaze tracking based input system and<br />

scanning system. Eye gaze tracking based system alleviates<br />

the use of mouse and keyboard and enables the user<br />

to control the mouse pointer using only eye gaze. They<br />

can also use a virtual keyboard as an alternative to normal<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that<br />

copies bear this notice and the full citation on the first page. To copy<br />

otherwise, or republish, to post on servers or to redistribute to lists,<br />

requires prior specific permission and/or a fee.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

33<br />

Pat Langdon<br />

Engineering Design Centre<br />

Department of Engineering<br />

University of Cambridge, UK<br />

E-mail: pml24@eng.cam.ac.uk<br />

keyboard.<br />

Scanning is the technique of successively highlighting<br />

items on a computer screen and pressing a switch when<br />

the desired item is highlighted. Researches on eye gaze<br />

tracking systems for assistive technology and scanning<br />

systems were mainly explored in the field of alternative<br />

and augmentative communication (AAC) devices<br />

[7,8,11,13]. A plethora of commercial and research products<br />

are available which helps people with disabilities to<br />

communicate using eye gaze tracking or scanning interfaces<br />

[11].<br />

However, navigation to arbitrary locations on a screen has<br />

also become important as graphical user interfaces are<br />

more widely used. A review of existing scanning systems<br />

for screen navigation can be found in a separate paper [3].<br />

The main disadvantage of these systems is these are slow<br />

to operate. Many eye tracking based interfaces for people<br />

with disabilities use the eye gaze as a binary input like a<br />

switch press input through a blink [6, 13]. But the resulting<br />

system remain as slow as the scanning system.<br />

Zhai [14] presents a detailed list of advantages and disadvantages<br />

of using eye gaze based pointing devices. In<br />

short, using the eye gaze for controlling the cursor position<br />

pose several challenges as follows<br />

Strain: It is quite strenuous to control the cursor through<br />

eye gaze for long time as the eye muscles soon become<br />

fatigue. Fejtova and colleagues [9] reported eye strain in<br />

six out of ten able bodied participants in their study.<br />

Accuracy: The eye gaze tracker does not always work<br />

accurately, even the best eye trackers used to provide accuracy<br />

of 0.5° of visual angle. It often makes clicking on<br />

small target difficult. Donegan and colleagues [5] also<br />

reported problems in precision and speed of an eye gaze<br />

based system. So existing systems often change the screen<br />

layout and enlarge screen items for AAC systems based<br />

on eye gaze, but surely it is not a scalable solution.<br />

Clicking: Clicking or selecting a target using only eye<br />

gaze is also a problem. It is generally performed through<br />

increased dwell time or blinking. But either solution increases<br />

the chance of false positives or missed clicks.<br />

We tried to solve this problem by combining eye gaze

tracking and a scanning system in a unique way. Any<br />

pointing movement has two phases [10]<br />

An initial ballistic phase, which brings one near the target.<br />

A homing phase, which is one or more precise sub<br />

movements to home on the target.<br />

We used the eye gaze tracking for the initial ballistic<br />

phase and switch to scanning system for the homing<br />

phase and clicking. The approach is similar to the<br />

MAGIC system [14] though it replaces the regular pointing<br />

device with the scanning system. Our system works in<br />

the following way.<br />

2. The proposed system<br />

Initially, the system moves the pointer across the screen<br />

based on the eye gaze of the user. The user sees a small<br />

button moving across the screen and the button is placed<br />

approximately where they are looking at the screen. We<br />

extract the eye gaze position by using the Tobii SDK [12]<br />

and we use an average filter that changes the pointer position<br />

every 500 msec. The users can switch to the scanning<br />

system by giving a key press anytime during eye tracking.<br />

When they look at the target, the button (or pointer) appears<br />

near or on the target. At this point, the user is supposed<br />

to press a key to switch back to the scanning system<br />

for homing and clicking on the target.<br />

We have used a particular type of scanning system,<br />

known as eight directional scanning [3] to navigate across<br />

the screen. In eight-directional scanning technique the<br />

pointer icon is changed at regular time intervals to show<br />

one of eight directions (Up, Up-Left, Left, Left-Down,<br />

Down, Down-Right, Right, Right-Up). The user can<br />

choose a direction by pressing the switch when the<br />

pointer icon shows the required direction. After getting<br />

the direction choice, the pointer starts moving. When the<br />

pointer reaches the desired point in the screen, the user<br />

has to make another key press to stop the pointer movement<br />

and make a click. A state chart diagram of the scanning<br />

system is shown in Figure 1, which is same for user<br />

and device spaces in this case. A demonstration of the<br />

scanning system can be seen at<br />

http://www.youtube.com/watch?v=0eSyyXeBoXQandfeature=user.<br />

The user can move back to the eye gaze tracking system<br />

from the scanning system by selecting the exit button in<br />

the scanning interface (Figure 2). A couple of videos of<br />

the system can be found from the following links.<br />

Screenshot: http://www.youtube.com/watch?v=UnYVO1Ag17U<br />

Actual usage: http://www.youtube.com/watch?v=2izAZNvj9L0<br />

The technique is faster than only scanning based interface<br />

as users can move the pointer through a large distance in<br />

screen using their eye gaze quicker than using only single<br />

switch scanning interface. It is also less strenuous than<br />

the only eye gaze based interfaces because users can<br />

34<br />

switch back and forth between eye gaze tracking and<br />

scanning which gives rest to the eye muscles. Additionally,<br />

since they need not to home on a target using eye<br />

gaze, they are relieved from looking at a target for a long<br />

time to home and click on it. Finally, this technique does<br />

not depend on the accuracy of the eye tracker as eye<br />

tracking is only used to bring the cursor near the target (as<br />

opposed to on the target), so it can be used with low cost<br />

and low accuracy web cam based eye trackers.<br />

Figure 1. State Transition Diagram of the eightdirectional<br />

scanning mechanism with a single switch<br />

Figure 2. Screenshot of the scanning interface<br />

The only disadvantage of the technique is that it seems<br />

slower than only eye gaze based interface as users need to<br />

switch back to the slower scanning technique for each<br />

pointing task. So we conducted the following user study<br />

to compare the speed of our system with respect to only<br />

eye gaze based pointing.<br />

3. The study<br />

3.1. Procedure<br />

We conducted the ISO 9241 pointing task with three different<br />

combinations of target width (20, 30 and 40 pixels)<br />

and target amplitude (180, 240 and 300 pixels). Each participant<br />

undertook the task in two conditions – using only

eye gaze for pointing or using both eye gaze and eight<br />

directional scanning for pointing. None of the users used<br />

this system before and they were trained adequately before<br />

undertaking the trials. The training data is not used in<br />

the analysis.<br />

3.2. Material<br />

We used a desktop with 12.5’ monitor having 1280 Х 800<br />

pixels running Windows 7 operating system. We used a<br />

Tobii X120 eye tracker [12] with the Tobii SDK and an<br />

averaging filter to detect points of eye gaze fixation. Figure<br />

3 shows a snapshot of the experimental set up. None<br />

of the participants have any problem with the set up.<br />

Figure 3. Experimental set up<br />

3.3. Participants<br />

We collected data from 8 able bodied participants (7 male<br />

and 1 female) with average age of 27. The results will not<br />

be different for users with disabilities because<br />

o We assume that disabled who can use eye gaze<br />

based system will have eye muscles as strong as<br />

able bodied users.<br />

o Our previous study [4] did not find any statistically<br />

significant difference between able bodied<br />

and disabled users for scanning interface.<br />

3.4. Results<br />

The mean movement time was higher in eye tracking plus<br />

scanning system while the variance is higher in only eye<br />

tracking system (figure 4). However the difference was<br />

not significant in an unequal variance t-test (p > 0.05). We<br />

compared the average movement time for each input modalities<br />

of interaction with respect to individual participants,<br />

target width and amplitude (Figures 5, 6, and 7). It<br />

can be seen from figure 5 that only 2 out of 8 participants<br />

(P2 and P4) took significantly higher time in the eye<br />

tracking plus scanning system than the only eye tracking<br />

system. There were at least three occasions when participants<br />

failed to point on 20 pixel targets using only eye<br />

tracking system. We found only eye tracking system produced<br />

significantly less (p < 0.05) movement time for 240<br />

35<br />

pixel target amplitude while the difference in movement<br />

times for other combinations of target width and amplitude<br />

were not significant in an unequal variance t-test.<br />

The eye tracking plus scanning system tends to produce<br />

less movement time for 300 pixel target amplitude (figure<br />

7) as the eye tracker apparently lost some accuracy in the<br />

periphery of the screen. Finally all participants felt the eye<br />

tracking plus scanning system is more comfortable than<br />

the only eye tracking based system because their eye<br />

muscles could get rest while using the scanning system.<br />

140000<br />

120000<br />

100000<br />

80000<br />

60000<br />

40000<br />

20000<br />

0<br />

-20000<br />

N =<br />

64<br />

ET<br />

Figure 4. Comparing movement time<br />

Figure 5. Comparing movement time w.r.t. participants<br />

Figure 6. Comparing movement time w.r.t. target width<br />

Average Movement Time<br />

(in sec)<br />

Average Movement Time<br />

(in sec)<br />

Average Movement Time<br />

(in sec)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

60.00<br />

50.00<br />

40.00<br />

30.00<br />

20.00<br />

10.00<br />

0.00<br />

80.00<br />

70.00<br />

60.00<br />

50.00<br />

40.00<br />

30.00<br />

20.00<br />

10.00<br />

0.00<br />

24<br />

2<br />

64<br />

15<br />

63<br />

Comparing Movement Time<br />

Figure 7. Comparing movement time w.r.t. amplitude<br />

64<br />

14<br />

62<br />

ETSCAN<br />

1 2 3 4 5 6 7 8<br />

Participants<br />

Comparing Movement Time<br />

20 30 40<br />

Width of Target (in pixel)<br />

Comparing Movement Time<br />

180 240 300<br />

Distance to Target (in pixel)<br />

Only ET<br />

ET & Scanning<br />

Only ET<br />

ET & Scanning<br />

Only ET<br />

ET & Scanning

3.5. Discussion<br />

The results show that using the scanning system with the<br />

eye tracking system did not reduce pointing time significantly<br />

compared to only eye gaze based system. The high<br />

variance in only eye tracking based system also indicates<br />

that in some cases the user took very long time to point,<br />

which would surely frustrate them. It should be noted that<br />

we used the Tobii tracker [12] for this study which is now<br />

best in market for accuracy. With a low cost and low accuracy<br />

eye tracker (like a web cam based one) the only<br />

eye tracking system will be harder to use while the eye<br />

tracker plus scanning system will not suffer much as the<br />

technique does not need high accuracy from the eye<br />

tracker. We used an average filter to extract points of eye<br />

gaze fixation, but use of a better filtering algorithm [1]<br />

will increase the accuracy of both the system equally. We<br />

have used a scan delay of 1 sec for the scanning system<br />

and a dwell time of 500 msec for the eye gaze tracking<br />

system for this study, which can be further reduced producing<br />

less movement time for expert users. Additionally<br />

this new technique is faster than only scanning system<br />

while gives more comfort and accuracy than only eye<br />

gaze tracking based pointing system. Our system is less<br />

proactive than MAGIC pointing [14] as the user can<br />

manually switch on and off either eye gaze tracking or<br />

scanning system whenever he wants by a single switch<br />

press. It seems more user friendly than Bates’ system [2]<br />

as operating a push button switch is easier than operating<br />

a Polhemus InsideTrack device by elevating shoulder.<br />

Our system can also solve the challenges faced by Fejtova<br />

[9] in developing eye gaze tracking based wheel chair, as<br />

the user can switch off eye tracking temporarily and clicking<br />

is done through the scanning system which reduces<br />

the possibilities of accidental missed clicks. Currently we<br />

are working on integrating the system with a web cam<br />

based eye tracker to develop a low cost interaction device.<br />

This technique can also have applications other than computer<br />

accessibility software. It can be used to provide<br />

hands-free access in a screen with multiple displays (or<br />

control screens), where the eye tracking system will locate<br />

a particular portion of screen or control display and<br />

the scanning technique can be used to operate inside the<br />

display. It would also be useful to overcome situational<br />

impairment in interaction like using an electronic display<br />

in a moving vehicle, where it is difficult to use a pointing<br />

device or touch screen. The eye tracking and scanning<br />

technique both require minimum input from user and so<br />

the user need not to disengage with his main job (like<br />

driving the car) for interacting with another device.<br />

4. Conclusions<br />

In this paper we have introduced a new input device involving<br />

an eye gaze tracker and scanning interface for<br />

people with severe disabilities. The system solves a few<br />

problems of existing eye gaze tracking based systems by<br />

offering more accuracy and comfort to users which is also<br />

supported by a user study.<br />

36<br />

Acknowledgement<br />

We are grateful to our participants for taking part in our<br />

study. We would also like to thank Prof. Peter Robinson<br />

of University of Cambridge Computer Laboratory for his<br />

help in organizing the study.<br />

References<br />

1. Adjouadi M. et. al., Remote Eye Gaze Tracking<br />

System as a Computer Interface for Persons with<br />

Severe Motor Disability. ICCHP 2004, LNCS<br />

3118 2004. 761-769.<br />

2. Bates R., Multimodal Eye-Based Interaction for<br />

Zoomed Target Selection on a Standard Graphical<br />

User Interface. INTERACT 1999.<br />

3. Biswas P. and Robinson P., A New Screen<br />

Scanning System based on Clustering Screen<br />

Objects, Journal of Assistive Technologies, Vol.<br />

2 Issue 3 September 2008, pp. 24-31, ISSN:<br />

1754-9450<br />

4. Biswas P. and Robinson P., The effects of hand<br />

strength on pointing performance, Designing Inclusive<br />

Interactions, Springer-Verlag, pp. 3-12,<br />

ISBN: 978-1-84996-165-3<br />

5. Donegan M. et. al. , Understanding users and<br />

their needs, Universal Access in the Information<br />

Society 8 (2009): 259-275<br />

6. Duchowski A. T., Eye Tracking Methodology.<br />

Springer-Verlag, 2007.<br />

7. Eye Pointing, URL: http://abilitynet.wetpaint.<br />

com/page/Eye+Pointing, Accessed on 19th August<br />

2010.<br />

8. EyeTech Digital System, URL: http://www. eyetechds.com/assistivetech/index.htm,<br />

Accessed on<br />

19th August 2010.<br />

9. Fejtova M. et. al. , Hands-free interaction with a<br />

computer and other technologies, Universal Access<br />

in the Information Society 8 (2009): 277-<br />

295<br />

10. Fitts P.M., The Information Capacity of The<br />

Human Motor System In Controlling The Amplitude<br />

of Movement. Journal of Experimental Psychology<br />

47 (1954): 381-391.<br />

11. Majaranta P. and Raiha K. Twenty Years of Eye<br />

Typing: Systems and Design Issues. Eye Tracking<br />

Research & Application 2002. 15-22.<br />

12. Tobii Eye Tracker, URL:<br />

http://www.imotionsglobal.com/Tobii+X120+<br />

Eye-Tracker.344.aspx Accessed on 12th December<br />

2008<br />

13. Ward D., Dasher with an eye-tracker, URL:<br />

http://www.inference.phy.cam.ac.uk/djw30/dash<br />

er/eye.html, Accessed on 19th August 2010.<br />

14. Zhai S., Morimoto C. and Ihde S., Manual and<br />

Gaze Input Cascaded (MAGIC) Pointing. ACM<br />

SIGCHI Conference on Human Factors in Computing<br />

System (CHI) 1999.

Gesture Recognition Exploration using Haartraining and<br />

KNN in a 3D Racing Game<br />

Kamlesh Mistry<br />

School of Computing<br />

Teesside University, UK<br />

mistry.kamlesh@gmail.com<br />


Automatic recognition of body language is challenging but<br />

inspiring as a natural control channel for intelligent user<br />

interfaces. In this paper we report automatic car navigation<br />

via hand gesture recognition in a 3D racing games<br />

application. We have employed Haartraining and k-nearest<br />

neighbor (KNN) algorithms to recognize hand gestures with<br />

the assistance of image processing. Our study has explored<br />

vision-based gesture tracking and dynamic gesture<br />

recognition in real-time navigation games application. The<br />

gesture recognition system has been embedded in a 3D<br />

virtual world built with the assistance of a games engine,<br />

Irrlicht. Sound effect has also been employed for our<br />

application. We have also conducted user testing with 5<br />

testing subjects to evaluate the efficiency of KNN-based<br />

gesture recognition. Evaluation results for the Haartrainingbased<br />

recognition have also been provided. Overall the<br />

gesture recognition performance is very promising. Our<br />

work contributes to the workshop themes on natural user<br />

interfaces in novel, intelligent interaction systems,<br />

navigation systems and assistive functionalities.<br />

Author Keywords<br />

Gesture recognition, Haartraining, and K-nearest neighbor<br />

ACM Classification Keywords<br />

H.5.2 [User interfaces]<br />


Multimodal interaction based on the recognition and<br />

interpretation of body language and verbal input is<br />

challenging but inspiring for the building of efficient<br />

intelligent user interfaces. Advanced educational or<br />

entertaining applications residing in 3D virtual<br />

environments also request such a natural communication<br />

channel to enhance user experience. In order to pursue this<br />

research goal, we have developed a robot car with<br />

automatic navigation under the control of continuous hand<br />

gestures. In our previous work, we also produced a neural<br />

network driven automatic navigation component to enable a<br />

robot car to learn road and track condition and handle tough<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

37<br />

Li Zhang<br />

School of Computing<br />

Teesside University, UK<br />

l.zhang@tees.ac.uk<br />

turning situations successfully. Overall, we believe our<br />

developments have the potential to benefit innovative user<br />

interfaces development for navigation and assistive<br />

functionalities for driving in real life situations.<br />


There have been various inspiring research studies<br />

conducted in the gesture recognition field. Billon et al. [1]<br />

have reported a gesture recognition system to facilitate the<br />

communication between a virtual actor and a real human<br />

actor in a martial art virtual games setting. Principle<br />

Component Analysis has been used to generate the artificial<br />

gesture representation, which was used for real-time gesture<br />

segmentation and recognition. Elmezain et al. [2] presented<br />

a hidden Markov model (HMM) based continuous gesture<br />

recognition system for the recognition of Arabic numbers 0-<br />

9. Tomibayashi et al. [4] have also produced a wearable DJ<br />

system to enable DJs to perform freely by using wearable<br />

computing and gesture recognition technologies. Wearable<br />

acceleration sensors have been used in their study to assist<br />

gesture recognition. Their system has been tested in realstage<br />

performances. Nam and Wohn [3] have presented<br />

another HMM based space-time hand gesture recognition<br />

system. In their system, HMM has been used to model the<br />

spatial variance and the time-scale variance in the hand<br />

movement to assist the recognition of the continuous,<br />

connected hand movement patterns. In our work, we have<br />

made attempts to recognize continuous connected hand<br />

movements and gestures using two different approaches,<br />

Haartraining and KNN algorithms. 3 key gestures have<br />

been recognized by KNN and 5 key gestures have been<br />

identified by Haartraining. The recognized gestures have<br />

also been used for real-time automobile navigation in a 3D<br />

racing game for entertainment purposes. We also provided<br />

evaluation results to prove the efficiency of our approaches.<br />


K-nearest neighbor has been widely used for pattern<br />

recognition. We borrow it in our application to recognize<br />

real-time key gestures using webcam. Our recognition<br />

process can be carried out in three steps, including image<br />

pre-processing, vector generation, and final classification.<br />

At the training stage, raw images with hand gestures are<br />

collected from webcam. First of all, these collected images

will be cropped. An example original image and the<br />

corresponding cropped image are shown in Figure 1, in<br />

which white pixels are used to represent the object of<br />

interest while the black pixels are used to indicate the<br />

background. Comparing with the original image, the<br />

cropped image, which will be used for the training of KNN,<br />

only has a slightly different width and height.<br />

These cropped images are then converted into binary files<br />

in order to feed them to KNN. Vector generation has been<br />

used to convert the pre-processed images into the training<br />

binary files with the appropriate format. We have used each<br />

KNN class to represent a particular gesture. All the images<br />

representing one particular gesture have been stored under<br />

that particular KNN class. We have used .pbm format to<br />

store all the image files for training, since such a format can<br />

provide ASCII characters in decimals for the width and<br />

height of each image. The names of the training files are in<br />

‘CNN.pbm’ order, where C is the KNN class number and<br />

NN is the number of the image files stored in that class. We<br />

have used altogether 300 images representing 3 different<br />

hand gestures (100 images for each gesture) for the training<br />

of KNN. The three gestures recognized by KNN are shown<br />

in Figure 2. Thus a scalar matrix has been produced to<br />

represent all the training data.<br />

Figure 1. An example original and its cropped image after preprocessing<br />

(from left to right).<br />

Figure 2. Three key gestures recognized by KNN, including a<br />

palm gesture (for stopping), a fist gesture (for acceleration)<br />

and a pistol-like gesture (for turning).<br />

At testing stages, raw images collected from webcam also<br />

need to be pre-processed before feeding them to KNN.<br />

Since the images captured from webcam are the colored 32bit<br />

ones and our training binary images are only 8-bit, we<br />

have used skin detection algorithm to convert the captured<br />

32-bit testing images into 8-bit ones. The following<br />

procedures have been taken to detect skin color. First of all,<br />

we need to access RGB values for each pixel using the<br />

following formulas.<br />

p = y * image->widthStep + x<br />

blue = ImageData[p]; green = ImageData[p + 1]; red =<br />

ImageData[p + 2]<br />

Where p is pixel point. ImageData is the array to store all<br />

the pixels of the processing image. Thus imageData[p],<br />

38<br />

imageData[p+1] and imageData[p+2] are blue, green, and<br />

red color values for pixel P. X and Y are the coordinates of<br />

the pixel P.<br />

Figure 3. An example image before and after skin detection<br />

processing.<br />

Then the following criteria have been used to detect skin<br />

color: red>95, green>40 and blue>20; where max(RGB<br />

values for pixel P) - min(RGB values for pixel P)>15. If<br />

any pixel fulfills these premises, then we re-assign it to a<br />

white pixel with the value of RGB 255.255.255, otherwise,<br />

we re-assign it to a black pixel with the value of RGB 0.0.0.<br />

Figure 3 shows an example image before and after skin<br />

detection processing.<br />

In our application, KNN algorithm has been used for the<br />

recognition of the testing gestures. KNN has been used<br />

widely in pattern recognition and machine learning. It<br />

classifies a test query based on a majority vote of its<br />

neighbors with the test query labeled as the class most<br />

common amongst its k nearest neighbors. We have used a<br />

weighted KNN in order to avoid the domination of the<br />

classes with the more frequent examples as shown in the<br />

basic ‘majority voting’ classification. Therefore, in our<br />

application, the KNN classification algorithm is to weight<br />

the contribution of each of the k neighbors according to<br />

their distance to the query point Xq, assigning greater<br />

weight Wi to the nearest neighbors. The following equation<br />

has been used in our application.<br />

� � ))<br />

F Xq<br />

� arg max �Wi ��<br />

( v,<br />

f ( Xi<br />

i�1<br />

Where Xq, is a testing image containing a test gesture; v is<br />

the vector of the training set and Xi represents each KNN<br />

class. �(v, f(Xi)) represents the distance between the test<br />

query and each KNN class. Using KNN implementation,<br />

we have successfully classified 3 different continuous<br />

gestures with promising accuracy rates in real-time<br />

applications (see evaluation section for detail). We also<br />

noticed that KNN’s performance could be influenced by the<br />

backgrounds shown in the images. In order to avoid such a<br />

problem, we have also used another approach, Haartraining,<br />

to perform gesture tracking to assist the recognition of<br />

gesture movement in order to provide another effective<br />

control channel for the automatic car navigation without<br />

having any side-effect contributed by the image<br />

backgrounds.<br />


Haartraining has been well known for tasks such as face<br />

and pedestrian detection. In our application, Haartraining<br />

has been used to recognize five gesture movements such as<br />


fist gestures indicating car movement of up, down, left and<br />

right and a palm gesture for halt (see Figures 4 & 5).<br />

For the image acquisition process, we have also used<br />

webcam to collect positive and negative image samples.<br />

The positive images represent those only containing objects<br />

of interest (gestures). In another word, positive images are<br />

used to identify gestures. Moreover it does not affect the<br />

training even if backgrounds of the positive images are<br />

different from each other. Negative image samples only<br />

contain backgrounds and no any objects of interest. They<br />

can be any images such as landscape images, car photos,<br />

and various textures. Negative images are usually used to<br />

improve gesture recognition performance (since they allow<br />

gestures to be recognized with any backgrounds).<br />

In order to provide robust training, we have collected 116<br />

positive image samples. Then we divided the positive<br />

samples into training and testing sets. The former contains<br />

86 image samples and the latter has 30 samples. We have<br />

also collected 178 negative image samples for the training<br />

purposes. Figures 4 & 5 respectively show positive sample<br />

images for the training of fist and palm positions.<br />

Figure 4. Positive images representing the 4 key gestures (from<br />

left to right: a basic fist gesture followed by fist gestures<br />

indicating car moving left, right, up and down).<br />

Figure 5. Positive image samples for training representing<br />

palm gestures for stopping.<br />

Vector generation is also needed to convert the positive and<br />

negative images into the appropriate format to feed<br />

Haartraining at the training stage. The process is briefly<br />

explained as follows. A text file has been produced to<br />

contain the names of all the negative sample files (e.g.<br />

negative1.jpg; negative2.jpg etc), while another text file for<br />

positive image files has also been created with the names of<br />

all the positive images, number of objects and coordinates<br />

of bounding box over the objects of interest (e.g.<br />

positive1.jpg, number_of_object(1), 20 20 50 50 (x, y,<br />

width, height)). The command of ‘createsamples’ was also<br />

39<br />

used to create training and testing vector samples in order to<br />

avoid distortions.<br />

Adaboost algorithm embedded in ‘createsamples’ command<br />

has been used for the training of the samples. Adaboost has<br />

the effect to train a strong classifier with the linear<br />

combination of best features from training set and weak<br />

classifiers. For example, if there are weak image samples<br />

with comparatively dark light or low contrast, Adaboost<br />

approach is able to improve the visibility of the objects of<br />

interest with better contrast. Finally the Haartraining<br />

command is used to train the classifier. The evaluation<br />

results for Haartraining approach for gesture tracking and<br />

recognition are also provided in the evaluation section.<br />


The produced gesture recognition components using KNN<br />

and Haartraining have been integrated with the 3D games<br />

world for the control of the car navigation. An open source<br />

games engine, Irrlicht (www.irrlicht.org), and Newton<br />

physics, have been used to construct the 3D world<br />

environment. The OpenCV library has been used for 3D<br />

world image processing. Also, the sound library, IrrKlang,<br />

provided by the developers of the Irrlicht games engine, has<br />

been employed to produce the sound effect.<br />

Briefly for the development of the games world, we load<br />

the racing racetrack and car as a mesh, and set the graphics<br />

API to OpenGL. We also apply physics to the car mesh by<br />

using the Newton physics library. Then we add the<br />

racetrack into the physics entity so that car is the object<br />

with the track as the entity.<br />

In order to obtain the input data for the image processing,<br />

we have used the OpenCV library. After capturing the<br />

images from the webcam, we used IplImage for storing the<br />

image files. Overall, we have collected continuous images<br />

for our application and the collected images have been used<br />

for pre-processing and classification.<br />

For the control of the robot car using KNN, we have used<br />

the following gesture commands: a fist representing<br />

acceleration, a palm representing stopping and a pistol-like<br />

gesture representing turning. Therefore based on the output<br />

of KNN, which has used image files stored in IplImage as<br />

testing images, the robot car can navigate accordingly. For<br />

example, if the output of KNN indicates a fist gesture, then<br />

the robot car performs acceleration.<br />

If Haartraining has been used to control the vehicle, we<br />

have defined the following gesture commands for<br />

navigation: a palm gesture for stopping, a fist position to<br />

the very left indicating turning left, a fist position to the<br />

very right representing moving right, a fist position to the<br />

top indicating acceleration and a fist position to the bottom<br />

showing reverse movement. Fist of all, if a gesture has been<br />

recognized by Haartraining, we need to check on which<br />

axis and at what position the gesture is recognized. In order<br />

to achieve the recognition, a Haartraining class has been<br />

implemented containing all the necessary functions such as

loading the Haarcascade, testing the cascade, and drawing a<br />

bounding box on a desired gesture. If the position of the<br />

recognized gesture is less than 100 on x-axis (a fist gesture<br />

to the very left) then the car will turn left. Otherwise if it is<br />

more than 500 on x-axis (a fist gesture to the very right)<br />

then the car will turn right. Similar features apply to the<br />

forward and reverse control, where if the position of the<br />

recognized gesture is less than 100 on y-axis (a fist gesture<br />

to the bottom) then car will move backwards and if it’s<br />

greater than 400 on y-axis (a fist gesture to the top) then it<br />

will move forwards. Figure 6 shows a system screenshot.<br />

Figure 6. A system screenshot.<br />


We conducted user testing with 5 subjects (20-25 yr old<br />

male) to evaluate the efficiency of our gesture recognition<br />

components based on KNN. The testing methodology for<br />

KNN is described in the following. We had each testing<br />

subject have a warm-up session. Then each testing subject<br />

had an experience of game playing using hand gestures for<br />

vehicle navigation. A video has been produced to record the<br />

gestures performed by each testing subject so that they can<br />

be used to compare with the gesture sequence recognized<br />

by KNN to gain an accuracy rate. With the 5 testing<br />

subjects engaging in our user study, we gained an average<br />

accuracy rate of 82%. Detailed results represented by a<br />

confusion matrix are provided in Table 1 with rows<br />

representing gestures performed by testing subjects and<br />

columns showing gestures recognized by KNN.<br />

Gestures<br />

Gestures recognized by<br />

KNN<br />

performed by users<br />

Fist<br />

gesture<br />

Palm<br />

gesture<br />

Turning<br />

gesture<br />

Fist gesture 90.47% 4.76% 4.76%<br />

Palm gesture 6.66% 86.66% 6.66%<br />

Turning gesture 42.85% 0% 57.14%<br />

Table 1. Evaluation results for KNN.<br />

From the recognition results of KNN, we noticed that most<br />

of the errors have been caused by the fact that sometimes a<br />

turning pistol-like gesture has been recognized as a fist<br />

gesture. It is because the skin detection algorithm<br />

sometimes mixed up the background of the gesture (part of<br />

40<br />

the arm) with the gesture itself. Also fist and palm gestures<br />

have been recognized well with accuracy rates >80%.<br />

The evaluation results for Haartraining have also been<br />

produced with 29 testing positive images (1 invalid image),<br />

different from the training set. The performance command<br />

(opencv_performance) is used for testing or detecting<br />

purpose. Table 2 shows the evaluation results for the<br />

recognition of the testing image samples.<br />

Correct<br />

recognition<br />

Inaccurate<br />

recognition<br />

Accuracy<br />

rate<br />

Positive palm images 9 7 56.3%<br />

Positive fist images 11 2 84.6%<br />

Table 2. Evaluation results for Haartraining.<br />

For the recognition of the palm and fist gestures using<br />

Haartraining, we have 9 positive images recognized as<br />

unknown gestures with 7 palm images and 2 fist ones. The<br />

main reason that led to the recognition errors is that<br />

probably weak images with dark light were involved in<br />

training set. For future work, high-quality images will be<br />

used instead to improve our system’s performance.<br />

Comparing with existing work (Billon et al. with a >80%<br />

accuracy rate), our system’s performances from both<br />

approaches are acceptable. Users also experienced effective<br />

car navigation using gestures in a real-time 3D application.<br />

Thus it has the potential to improve users’ engagement.<br />


We reported a 3D car navigation games application via<br />

gesture recognition using both KNN and Haartraining.<br />

Although there is room for improvements, both approaches<br />

produced reasonable recognition results with Haartraining<br />

equipped with the ability to ignore the background<br />

interfering effects. In future directions, we also intend to<br />

employ HMM to extend our system with the capabilities of<br />

recognizing more complex (e.g. emotional) gestures to<br />

assist natural interaction for automatic navigation.<br />


1. Billon, R., Nédélec, A. and Tisseau, J. Gesture<br />

Recognition in Flow based on PCA Analysis using<br />

Multiagent System. In Proceedings of ACE. (2008).<br />

2. Elmezain, M., Al-Hamadi, A., Appenrodt, J. and<br />

Michaelis, B. A Hidden Markov Model-Based<br />

Continuous Gesture Recognition System for Hand<br />

Motion Trajectory. In Proceedings of 19th International<br />

Conference on Pattern Recognition, (2008). 1-4.<br />

3. Nam, Y. and Wohn, K. Recognition of Space-Time<br />

Hand-Gestures using Hidden Markov Model. In Proc. of<br />

ACM VRST96 Conference. (1996). 51-58.<br />

4. Tomibayashi, Y., Takegawa, Y., Terada, T., and<br />

Tsukamoto, M. Wearable DJ System: a New Motion-<br />

Controlled DJ System. In Proceedings of ACE. (2009).

Model-Based User Interface Development<br />

in the <strong>Automotive</strong> Industry<br />

Moritz Kuemmerling<br />

German Research Center for Artificial Intelligence<br />

(<strong>DFKI</strong>)<br />

Trippstadter Strasse 122<br />

67663 Kaiserslautern, Germany<br />

Moritz.Kuemmerling@dfki.de<br />

+49 631 205 3709<br />


The time-to-market for human machine interfaces in the<br />

German automotive industry has to be reduced. The<br />

shortening of innovation cycles in other relevant industry<br />

fields and international competitors increase the pressure on<br />

German car manufacturers and their suppliers. Model-based<br />

user interface development is supposed to reduce the<br />

development time significantly thus improving the<br />

manufactures’ competitiveness. Therefore, a new domain<br />

specific modeling language for the specification of<br />

automotive human machine interfaces is being sought. Past<br />

approaches with similar objectives have either failed or<br />

have not been successfully established across the industry<br />

as a holistic solution. Within the scope of a new cooperative<br />

project whose partners cover the supply chain of the user<br />

interface development in the automotive industry for the<br />

first time completely, a common solution should be<br />

developed and manifested as an industry standard.<br />

Keywords<br />

Model-based user interface development, automotive HMI,<br />

domain specific language.<br />


The German automotive industry has to find a way to<br />

significantly reduce the development time for human<br />

machine interfaces (HMI) in vehicles. The reasons, among<br />

others, are the continuous development of driver assistance,<br />

communication and infotainment systems, new drive<br />

concepts as well as the continuous shortening of innovation<br />

cycles in the consumer electronics industry. To keep up<br />

with these technologies and with catching up competitors<br />

around the globe, future HMI-systems will be more and<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

41<br />

Gerrit Meixner<br />

German Research Center for Artificial Intelligence<br />

(<strong>DFKI</strong>)<br />

Trippstadter Strasse 122<br />

67663 Kaiserslautern, Germany<br />

Moritz.Kuemmerling@dfki.de<br />

+49 631 205 3707<br />

more complex while their development costs and time-tomarket<br />

have to be reduced. However, current HMI<br />

development processes are characterized by different,<br />

inconsistent work flows and heterogeneous tool chains. The<br />

exchange of paper-based specification documents between<br />

the process participants causes media discontinuity, inhibits<br />

version management, reduces the reusability and hampers<br />

the communication [2].<br />

Moreover it is impossible to automatically test the integrity<br />

and accuracy of paper based specification documents. The<br />

adoption of reliable and successful approaches from the<br />

field of model-based user interface development (MBUID)<br />

[9] is expected to be a proper remedy.<br />

To this purpose a new industry-driven project has been<br />

elaborated whose partners – several German car<br />

manufacturers (OEM), suppliers, a tool developer and the<br />

“Verband der Automobilindustrie e. V.“ (VDA) as an<br />

association – cover the supply chain of the HMI<br />

development in the automotive industry for the first time<br />

completely.<br />

Together, the partners aim to develop a new modeling<br />

language that will serve as an interface between the process<br />

participants thus avoiding media discontinuity and<br />

improving the communication among the involved actors.<br />

The intention is (not less than) to establish a new modeling<br />

language not only within the project consortium but as an<br />

industry standard.<br />

The paper is structured as follows: First we will give an<br />

overview about MBUID, existing modeling languages and<br />

past projects with similar objectives. Then we point out,<br />

what we expect to do differently in our project. We also<br />

explain the impact that the project will have on MBUID as<br />

a field of research. After an outlook on our next steps the<br />

paper ends with a conclusion.<br />



A vast number of XML-based user interface modeling<br />

languages (UIDL) exist already in the field of MBUID.<br />

Some of the UIDLs are already standardized by the OASIS

and/or they are subject of a continuous development<br />

process. Numerous projects and applications prove their<br />

practical suitability. Some examples are UsiXML [14],<br />

UIML [1] or XIML [10, 11].<br />

The purpose of using a UIDL to develop a user interface is<br />

to systematize the HMI development process. [9]. UIDLs<br />

enable the developer to systematically break down a user<br />

interface into different abstraction layers and to model these<br />

layers [8]. Thus it is for example possible to describe the<br />

behavior, the structure and the layout of a user interface<br />

independently of each other.<br />

In Figure 1 we show how an automotive HMI-system can<br />

be developed using a model-based approach. In a first step,<br />

designers and engineers describe an abstract model of the<br />

later HMI-system. The abstract model is independent from<br />

any hardware-platform and the developers can put their<br />

focus on the user requirements. In the next step, the abstract<br />

model is extended to a more concrete one. The concrete<br />

model allows the generation of virtual prototypes which can<br />

be used for first user tests of the later HMI-system. In the<br />

final step the concrete models are transformed and mapped<br />

to the platform-specific requirements of the target system.<br />

The reusability of the models decreases with each step.<br />

Figure 1. <strong>Automotive</strong> HMI development using a model-based<br />

approach.<br />

Existing UIDLs differ in terms of the supported platforms<br />

and modalities as well as in the amount of predefined interaction<br />

objects that are available to describe the elements of<br />

the user interface. In the relevant literature several authors<br />

struggled with the challenge of a clear comparison of<br />

existing UIDLs [4, 7, 13]. However, a comprehensive<br />

comparison is yet to be drawn.<br />

In the HMI development in the automotive industry a wide<br />

range of actors from many different branches are involved –<br />

computer scientists and electrical engineers work together<br />

with designers, ergonomists and psychologists in<br />

interdisciplinary teams (see Figure 2). The HMI modeling<br />

language that we want to develop shall serve as the<br />

connective link between these actors. On this account the<br />

modeling language has to be domain specific. Domain<br />

specific languages (DSL) are dedicated to a particular<br />

problem domain and their “vocabulary” is generally based<br />

42<br />

upon common expressions that are typical for the domain.<br />

Thus DSLs are far more expressive in their domain than<br />

general-purpose languages would be. Further benefits of<br />

DSLs are a better acceptance when introducing the<br />

language as well as a better readability of DSL-based<br />

specifications even for non-programmers.<br />

Figure 2. Actors in the HMI development process and their<br />

specification flow.<br />

The idea of reusing best practices and existing modeling<br />

languages from the field of MBUID to develop a new domain-specific<br />

language for automotive HMI development is<br />

not completely new. In the past there have been similar<br />

approaches:<br />

� IML (Infotainment Markup Language) [6] developed<br />

by IAV is a XML-based modeling language for<br />

infotainment systems.<br />

� OEM XLM (later VW XML) [3] is a XML-based<br />

language that resulted from a cooperation of AUDI,<br />

BMW, Daimler, Porsche and VW. It addresses the<br />

standardized description of head-units and instrument<br />

cluster systems.<br />

� AbstractHMI [12] is an XML-based modeling<br />

language for automotive HMI-systems. The language<br />

was developed at the University of Ulm in cooperation<br />

with Daimler.<br />

� ICUC XML [5] is dedicated to the modeling of<br />

instrument clusters in trucks. The language was<br />

developed by Elektrobit <strong>Automotive</strong> for Daimler.<br />

However, none of the languages presented above was<br />

successfully manifested as an industry standard. Today<br />

there are only a few, at best partial solutions that are used<br />

by some OEMs or suppliers. IAV gave up on the<br />

development of IML. AbstractHMI has never found its way<br />

from research to industrial application. ICUC XML can<br />

only be used via the development tool EB Guide and OEM<br />

XML is used – despite the numerous partners involved in<br />

its development – only by VW.


The sustainable success of the renewed attempt strongly<br />

depends on the impact that the new modeling language and<br />

further project related standardizations will achieve in the<br />

automotive industry. For this reason a consistent transfer of<br />

the project results towards the industry is required.<br />

Exhibitions of the project results on the leading fairs in the<br />

automotive industry, such as the International Motor Show<br />

(IAA) or the International Suppliers Fair (IZB), will attract<br />

attention and contribute to the dissemination of the project<br />

results.<br />

During the project period of three years the project results<br />

will be continuously tested, validated and exposed in form<br />

of several demonstrators. Towards the end of the project<br />

these demonstrators will be aggregated into an overall system.<br />

This overall system shall cover and demonstrate the<br />

complete HMI developing process in the automotive<br />

industry from the first mock-up to the implementation of<br />

the target code on the hardware in the cockpit of a vehicle.<br />

In particular, model-based aspects and differences to the<br />

common development process shall be highlighted. To this<br />

purpose, the final demonstrator shall for example show, that<br />

requests for changes in the running HMI-system are easy to<br />

be realized by small manipulations in the underlying HMI<br />

specification (which is based on the domain specific<br />

modeling language). The HMI-system is supposed to run on<br />

several OEM/supplier hardware combinations. The<br />

exchangeability of the cockpit’s hardware emphasis the<br />

wide coverage of the project results in the automotive<br />

industry.<br />

In addition to the optimization of the HMI development<br />

process and the communication among it, a standardized<br />

modeling language paves the way for some further<br />

improvements.<br />

The above mentioned incapability of paper-based exchange<br />

documents to be tested automatically for integrity and<br />

accuracy often leads to bugs in the HMI-system that are<br />

first noticed in late stages of development. Leveraging the<br />

full potential of machine-readable specification documents<br />

(e.g. model-based testing, early use of virtual prototypes)<br />

cost and time intensive subsequent iterations and<br />

corrections can be avoided. For both, suppliers and OEMs,<br />

this would be a significant cost saving potential.<br />

The connection of the HMI-system to the application layer<br />

of the vehicle is a further significant cost factor in current<br />

development processes. As the connection to the car’s<br />

application layer still requires manual processing, this step<br />

consumes resources to a similar extent as the actual<br />

development of the HMI-system. The introduction of a<br />

standardized modeling language creates the conditions for<br />

the development of a standard middleware allowing future<br />

HMI-systems to be easier connected to the car’s application<br />

layer. The consequences are a reduction of development<br />

time and a better exchangeability of the hardware<br />

components.<br />

43<br />

The integration of both aspects – model-based testing and<br />

middleware – points out the unexplored potential of modelbased<br />

HMI development in the automotive industry.<br />


In the field of HMI development a distinction is made<br />

between model-based development of human-machineinterfaces<br />

at design and at runtime. The presented project<br />

addresses the model-based development of automotive<br />

HMI-systems at design time. Thus the project is the first<br />

extensive industrial use case for model-based HMI<br />

development. The collaborative application of this method<br />

by several industrial partners allows a proof of concept<br />

revealing strengths as well as possible weaknesses where<br />

further research is required. Furthermore the step towards<br />

model-based HMI development at design time is a<br />

necessary one in the automotive industry. Future runtimeadaptive<br />

HMI-systems require a model-based architecture.<br />

The development of such systems is necessary for a<br />

functional and efficient integration of the driver’s mobile<br />

devices (iPods, mobile phones etc.).<br />


The achievement of the above presented project’s<br />

objectives depends on some central tasks.<br />

Out of the numerous UIDLs without any automotive background<br />

a few well established examples have to be picked<br />

and compared to each other. The comparison has to be<br />

based on an appropriate use case that allows the<br />

identification of elements that can be useful for the<br />

development for the automotive modeling language (e.g. a<br />

simple interface for a music player).<br />

In parallel, existing automotive related UIDLs have to be<br />

carefully examined. In particular, the question why none of<br />

the languages became a standard has to be answered.<br />

The automotive HMI development process itself will be the<br />

subject of a comprehensive analysis. Tools, processes and<br />

specification documents are examined on site at each<br />

partner with a strong focus on the interfaces and the<br />

exchange of documents between OEMs, suppliers and tooldevelopers.<br />

The purpose is to identify best practices and to<br />

define an abstract reference process. The latter shall be used<br />

to derive a common data model as well as the requirements<br />

for the development of the new modeling language.<br />


In this paper we summarized some of the main issues in<br />

current HMI development processes in the automotive<br />

industry. The adoption of methods from the field of<br />

MBUID is supposed to lead to machine-readable HMI<br />

specifications thus improving the communication between<br />

the process partners. Past attempts to develop a<br />

standardized modeling language have ether failed or lead to<br />

isolated applications. However, long-term benefits and<br />

potential subsequent developments necessitate an industrywide<br />

impact as well as a sustainable manifestation of the

outcomes of the presented project. The first step has already<br />

been taken as for the first time several OEMs will work<br />

together with their suppliers on the optimization of their<br />

HMI development processes.<br />


1. Abrams, M., Phanouriou, C. and Batongbacal, A.<br />

UIML: An Appliance-Independent XML User Interface<br />

Language. Proc. of the 8th International World Wide<br />

Web Conference, Toronto, Canada, 1999.<br />

2. Bock, C., Görlich, D. and Zühlke, D. Using Domain-<br />

Specific Languages in the Design of HMIs: Experiences<br />

and Lessons Learned. Proc. of Workshop: Model-<br />

Driven Development of Advanced User Interfaces,<br />

Workshop: Model-Driven Development of Advanced<br />

User Interfaces, UML/MoDELS 2006, Genua, Italy<br />

2006.<br />

3. Brunhorn, J. XML-Sprache zur Beschreibung von HMIs<br />

für Infotainmentsysteme und Kombiinstrumente.<br />

Language Specification 1.0. Carmeq GmbH / OEM<br />

Arbeitskreis HMI Methodik, 2007.<br />

4. Guerrero García, J., González Calleros, J. and<br />

Vanderdonckt, J. A Theoretical Survey of User Interface<br />

Description Languages: Preliminary Results. Proc. of<br />

Joint 4th Latin American Conference on Human-<br />

Computer Interaction 7th Latin American Web<br />

Congress, Los Alamitos, USA, 2009.<br />

5. Hübner, M. and Grüll, I. ICUC-XML Format. Format<br />

Specification Revision 14. Elektrobit, 2007.<br />

6. Jud, A. Präzise Syntaxdefinition einer<br />

Modellierungstechnik für Infotainment-Systeme. Master<br />

Thesis, Technische Universität Berlin, 2007.<br />

44<br />

7. Luyten, K. Dynamic User Interface Generation for<br />

Mobile and Embedded Systems with Model-Based User<br />

Interface Development. Doctoral Thesis, Transnationale<br />

Universiteit Limburg, Limburg, 2004.<br />

8. Meixner, G. Model-based Useware Engineering. in<br />

W3C Workshop on Future Standards for Model-Based<br />

User Interfaces, W3C Workshop on Future Standards<br />

for Model-Based User Interfaces (W3C-2010), May 13-<br />

14, Rome, Italy, 2010.<br />

9. Puerta, A. A Model-Based Interface Development<br />

Environment. IEEE Software, 14 (4), 40-47, 1997.<br />

10. Puerta, A. and Eisenstein, J. XIML: A Universal<br />

Language for User Interfaces. RedWhale Software, Palo<br />

Alto, CA USA, 2001. Retrieved September 09, 2011,<br />

from http://www.ximl.org/pages/docs.asp.<br />

11. Puerta, A. and Eisenstein, J. Developing a Multiple User<br />

Interface Representation Framework for Industry. In:<br />

Multiple User Interfaces. Cross-platform Applications<br />

and Context-Aware Interface, Wiley, 119-148, 2004.<br />

12. Reich, B. Abstrakte Beschreibung automobiler HMI-<br />

Systeme und deren Erweiterung für neue Dienste.<br />

Master Thesis, Universität Ulm, 2008.<br />

13. Souchon, N. and Vanderdockt, J. A Review of XML-<br />

Compliant User Interface Description Languages. Proc.<br />

of the 10th International Workshop on Interactive<br />

Systems: Design, Specification and Verification, 377-<br />

391, 2003.<br />

14. Vanderdonckt, J., Limbourg, Q. and Michotte, B.<br />

USIXML: A User Interface Description Language for<br />

Specifying Multimodal User Interfaces. Proc. of the<br />

W3C Workshop on Multimodal Interaction, 2004.

A Robotic Wheelchair using Human Gestures and<br />

Scene Contexts<br />

Jin Sun Ju, Eun Yi Kim<br />

Dept. of advanced technology fusion Engineering, Konkuk University, Seoul, Korea<br />

vocaljs@konkuk.ac.kr, eykim@konkuk.ac.kr<br />

82-2-450-4135<br />


In this paper, we propose a new vision-based robotic<br />

wheelchairs using human’s gestures and scene contexts. For<br />

the easy and accurate control of a wheelchair, human<br />

gestures such as a face and mouth are used, where the<br />

direction of a robotic wheelchair is determined by the<br />

inclination of the user’s face, while proceeding and<br />

stopping are determined by the shape of the user’s mouth.<br />

And, for providing autonomous avoidance of obstacles, a<br />

monocular vision-based navigation is developed.<br />

To assess the effectiveness of the developed robotic<br />

wheelchair, several experiments were performed on indoor<br />

and outdoor under various situational effects. The results<br />

demonstrated the feasibility of our system as mobility aids<br />

of the disabled or elderly people.<br />

Author Keywords<br />

Robotic wheelchair, gesture recognition, MLP<br />

ACM Classification Keywords<br />

H5.m. Information interfaces and presentation (e.g., HCI):<br />

Miscellaneous; I.4 Image processing and computer vision;<br />


Robotic wheelchairs are generally electric powered<br />

wheelchairs with an embedded computer and sensors,<br />

giving them intelligence. Most important evaluation factors<br />

for the wheelchairs are safety and convenient controls, so<br />

1<br />

many studies have been performed for intelligent interface<br />

and autonomous navigation [1] [2]. The intelligent interface<br />

aims at making the handicapped users control the<br />

wheelchair with their limited physical abilities. For such an<br />

interface, we developed a control system using face<br />

inclination and mouth shape recognition in the previous<br />

work, which can enhance the accuracy to recognize user’s<br />

intention and the computational costs than existing<br />

approaches [3].<br />

The navigation refers to detect various obstacles in real<br />

environments and avoid them. As the wheelchairs are used<br />

by handicapped people, some dangerous situation and<br />

accidents such as collisions with obstacles and other<br />

peoples are occurred. Accordingly, this study focuses on<br />

developing auto navigation techniques for obstacle<br />

detection and avoidance.<br />

In this paper, we develop vision-based robotic wheelchairs<br />

using human’s gestures and scene contexts. Fig.1 (a)<br />

illustrates the prototype of the proposed robotic wheelchair<br />

and the specifications of respective components. Our<br />

system consists of two modules: 1) a wheelchair control<br />

interface module, 2) a monocular vision-based navigation<br />

module. Fig. 1(b) describes the process of the proposed<br />

robotic wheelchair.<br />

(a) (b)<br />

Figure 1. The proposed system (a) the overall architecture of our wheelchair (b) the outline of proposed wheelchair system<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA<br />



The proposed wheelchair control interface allows the user<br />

to control the wheelchair directly by changing their face<br />

inclination and mouth shape. If the user wants the<br />

wheelchair to move forward, they just say “Go.”<br />

Conversely, to stop the wheelchair the user just says “Uhm.”<br />

The direction of the wheelchair is determined by inclination<br />

of the user’s face, instead of turning heads or faces.<br />

Facial Feature Detection<br />

For robust detection of facial region, we use the AdaBost<br />

algorithm, which is recently many used in face detection<br />

due to its accuracy and speed [5]. It extracts the Haar-like<br />

features that can explain facial region from all possible<br />

rectangles obtained from a given image. Once a facial<br />

region is obtained, the mouth region is localized using edge<br />

information. The detection results may include some noises,<br />

which are filtered by the connected component analysis.<br />

Facial Feature Recognition<br />

Let ρ denote the orientation of the facial region. Then can<br />

be calculated by minimizing the following inertia.<br />

������� � 1<br />

� ��� � ������� �<br />

�<br />

��,����<br />

��� � ��������� , �� � �<br />

(1)<br />

� ��<br />

If the value of ρ is less than 0, this means that the user nods<br />

their head slanting to the left. Otherwise, it means that the<br />

user nods their head slanting to the right.<br />

To recognize what the mouth shape of the current frame is,<br />

a template matching is performed, where the current mouth<br />

region is compared with mouth shape templates. Those<br />

templates are obtained by K-means clustering from 114<br />

mouth image. After localization the mouth in current frame,<br />

we first normalize the mouth region, calculate its matching<br />

score for all templates, and pick the template with the best<br />

matching score.<br />


In this module, all information for environments where a<br />

wheelchair is positioning is represented as the form of<br />

occupancy map. Thereafter, for capturing visual<br />

characteristics among occupancy maps of the different<br />

directions a MLP is used.<br />

Obstacle Detection<br />

A cell in occupancy map model the risk of the<br />

corresponding area by gray color level, so we first design<br />

the map fitted to the environments. For the occupancy map<br />

generation, we estimate the background model by simple<br />

online learning, and compare it with every frame received<br />

from a CCD camera, thereby classifying the current frame<br />

as backgrounds and others. Here we use the simplified<br />

version of background detection method presented in [4].<br />

The background color is estimated from only the reference<br />

area rather than a whole image. The input image is filtered<br />

by 5�5 Gaussian filters to reduce the noise, and<br />

transformed into the HSI color space. From the reference<br />

area, two color histograms are calculated for Hue and<br />

2<br />

46<br />

Intensity. These histograms are accumulated for recent five<br />

frames, which are used as background model. The<br />

background model is continuously updated as a new frame<br />

is input. Once the background model is obtained, the<br />

classification is performed. If the intensity and hue of a<br />

pixel are below thresholds, it was considered as obstacles.<br />

In this paper, the hue and intensity thresholds are set to 60<br />

and 80, respectively. Based on the background<br />

classification results, an occupancy map is produced, where<br />

each cell is allocated at a walking area and it has the<br />

different gray color levels according the occupancy of<br />

obstacles, as shown in Fig. 2.<br />

Here 10 gray-scales are used according to the risk. Then,<br />

the gray scale of a cell is determined by 1/10�(# of pixels<br />

classified as obstacles). A certain gray color is assigned to<br />

each pixel, according to its risk. The brighter a grid cell, the<br />

higher obstacle density.<br />

(a) (b) (c)<br />

Figure. 2: Examples of generated occupancy map (a) input<br />

image, (b) obstacle classification results, (c) occupancy maps<br />

Path generation<br />

We try to automatically extract the scene contexts from<br />

real-time streaming, and use them to determine viable paths,<br />

through machine learning. Here, we use a MLP to<br />

automatically capture important scene contexts among<br />

occupancy maps for different pats, as it integrates feature<br />

extraction and classification in its own architecture. The<br />

path generation is performed by two steps: off-line learning<br />

stage and on-line recognizing stages.<br />

Off-line learning stage<br />

In the off-line learning stage, the proposed system trains the<br />

visual properties among occupancy maps for each<br />

directions using MLP, thus it can recommend viable paths.<br />

In a MLP, the input layer receives the gray values of pixels<br />

on 32�24 occupancy map. The output value of a hidden<br />

node is then obtained from the dot product of the vector for<br />

the input values and the vector for the weight connected to<br />

the hidden nodes. This is then presented to the nodes of the<br />

next. Although various learning techniques can be used for<br />

multi-layered networks, this study used back-propagation,<br />

where the output values are compared with the correct<br />

answer during network training to compute the value of the<br />

error-function. In our system, the input layer is composed<br />

of 769 nodes and output layer is composed of four nodes,<br />

each of which corresponds to one of four directions {Go<br />

straight, Stop, Turn Left, and Turn Right}.<br />

On-line recognizing stage<br />

After training, the MLP is used to make a decision for<br />

online streaming. As the value of an output node is given as<br />

a floating-point numbers, ranging from 0 to 1, a threshold

value is required for decision of viable paths. Here, a<br />

threshold value of 0.7 was used for the MLP output nodes.<br />

Therefore, if the predicted output node score was larger<br />

than 0.7, the directions corresponding to the node were<br />

selected.<br />


To assess the effectiveness of the proposed system we<br />

performed the several experiments. Experiment I and II was<br />

designed to measure the accuracy of our two main modules,<br />

each of which reports the accuracy of interface and<br />

navigation. And Experiment III was designed to assess its<br />

effectiveness, thus its performance was compared with one<br />

of other existing method.<br />

Experiment I: To measure the accuracy of wheelchair<br />

control interface<br />

For the proposed wheelchair control interface to be<br />

practically used in the real environments, it should be<br />

robust to various illuminations and cluttered backgrounds.<br />

Thus, the proposed interface was tested on indoors and<br />

outdoors, furthermore on across both environments.<br />

Fig. 3 shows the facial feature detection and recognition<br />

results. As seen in Fig. 3, the proposed method accurately<br />

detected the face and mouth, confirming the robustness to<br />

time-varying illumination, and low sensitivity to a cluttered<br />

environment.<br />

Figure 3: Face and mouth detection results<br />

Table 1 shows the recognition rates of the proposed<br />

interface for the respective commands. The proposed<br />

interface shows the precision of 100% and the recall of 96.5%<br />

on average. Thus, this experiment proved that the proposed<br />

interface can accurately recognize user’s intention in realtime.<br />

Commands Recall (%) Precision (%)<br />

Turn Left 98 100<br />

Turn Right 94 100<br />

Go straight 96 100<br />

Stop 98 100<br />

Table 1. Performances in recognizing users’ commands<br />

Experiment II: To measure the accuracy of monocular<br />

vision-based navigation<br />

To fully support the mobility to the severely disabled<br />

people or cognitively disabled people, a navigation system<br />

to automatically detect obstacles and avoid them is<br />

necessary, so we developed a new monocular vision-based<br />

navigation using machine learning.<br />

Then, to be practically used in the real environments, it<br />

should detect a variety of obstacles, and it should be also<br />

robust to the situational effects such as place types and<br />

lightening conditions.<br />

3<br />

47<br />

Thus, it was tested on indoors and outdoors at daytime and<br />

night time. Fig. 4 shows the results to detect obstacles under<br />

various conditions, where 1 st to 4 th columns show the<br />

detection results for static obstacles on indoors, and two last<br />

columns show the results for moving obstacles on outdoors.<br />

In more detail, 1 st row shows the detection result for a static<br />

and thin obstacle, moreover it is floating. And 2 nd column<br />

shows the result for detecting a static thick obstacle. Such<br />

images were taken at day-time. On the other hand, the 3 rd<br />

and 4 th columns in Fig. 4 show the detections of a thin and<br />

small obstacle at night-time. Finally, 5 th and 6 th columns<br />

show the results for detecting moving obstacles at day-time<br />

and night-time, respectively.<br />

For given input image (as shown in Fig. 4(a)), the obstacle<br />

detection results and generated occupancy maps are shown<br />

in Figs. 4(b) and (c), respectively. As shown in Fig. 4(c),<br />

the proposed system can accurately detect a variety of<br />

obstacles under several illuminations.<br />

(a)<br />

(b)<br />

(c)<br />

Figure 4. Obstacle detection results (a) input image (b)<br />

background detection results (c) occupancy maps<br />

Table 2 summarizes the performance of our navigation<br />

system under various conditions. Although there are some<br />

differences, it showed the accuracy of 90% on average.<br />

Among four test groups, the accuracy for Type 2 was<br />

lowest. The experiments of Type 2 were performed on<br />

shopping mall, where much reflection was made due to the<br />

marble-textured background, and the scene are very<br />

cluttered by human and stores. However, despite these<br />

problems, our system can generate the viable paths to avoid<br />

collisions with obstacles.<br />

Environments<br />

Indoor Outdoor<br />

Type1 Type2 Type3 Type4<br />

Accuracy (%) 91% 87% 93% 89%<br />

Type 1,2,3,4: underground, shopping mall, A road, Foot way<br />

Table 2. Performances in determining viable paths<br />

Experiment III: To prove the effectiveness of our<br />

monocular vision-based navigation by comparison with<br />

other method<br />

To assess the validity of the monocular vision-based<br />

navigation module, its performance was compared with one<br />

of other method. Here we adopt the VFH [7], as it is the

most commonly used method in auto navigation, as<br />

mentioned in Section I (related work).<br />

Fig. 5 shows the performance comparisons of two methods<br />

on indoors and outdoors with time-varying illuminations.<br />

Fig. 5(a) shows the results of two methods under the timevarying<br />

sun-lights at day-time, and Fig. 5(b) shows the<br />

results under artificial lights at night-time. As can be seen<br />

in Fig. 5, the proposed method showed the better<br />

performance for all cases, regardless of place types and<br />

illumination conditions. On average, the proposed method<br />

can generate avoidable paths in the accuracy 92%, whereas<br />

VFH has accuracy of 79%. Consequently, the proposed<br />

method can improve the accuracy of 13%.<br />

(a)<br />

(b)<br />

Figure 5: Performance comparison with our system and VFH<br />

under various lightening conditions (a) comparisons under<br />

time-varying sun-lights, (b) comparison under artificial lights<br />

( the proposed method on outdoor, the proposed method<br />

on indoor, the VFH on outdoor, the VFH on indoor )<br />

Indoor Outdoor<br />

Day time Night time Day time Night time<br />

Proposed method 8% 10% 11% 13%<br />

VFH 31% 30% 46% 48%<br />

Table 3. Collision rate of proposed method and VFH<br />

The most important role of a navigation system is to<br />

prevent some collisions, so their performance should be<br />

evaluated in this aspect. Table 3 shows the hit ratio of two<br />

methods when moving to a goal, where the proposed<br />

method detected collisions and stopped in the accuracy of<br />

89%, but the VFH showed the accuracy of just 61%.<br />

48<br />

As shown in Fig. 5 and Table 3, the numerical comparisons<br />

showed that the proposed method provide a more safe<br />

mobility than VFH, and is robust to the situational effects<br />

such as illumination conditions and place types. Moreover,<br />

the average time taken for the proposed method to process a<br />

frame was about 56ms, thereby allowing the proposed<br />

method to process more than 17frames/s. The proposed<br />

method was about 22ms faster than the VFH.<br />

Consequently, the proposed method can improve the<br />

detection of collision and the prediction of avoidable paths<br />

than existing method, thereby providing a wheelchair with<br />

safe navigation on real environments.<br />


In this paper, we develop a vision-based robotic wheelchair<br />

using human’s gestures and scene contexts. The advantages<br />

of the proposed system include the followings: 1) our<br />

wheelchair control interface requires minimal user motion<br />

such as face inclination and mouth shapes, thereby making<br />

the proposed interface more suitable to the severely<br />

disabled. 2) By using scene contexts as well as obstacle<br />

density, our monocular vision-based navigation supports a<br />

wheelchair user with more safe mobility in unknown<br />

environments. 3) It has feasibility in using other mobile<br />

robots and other assistive devices such as ETA (Electronic<br />

Travel Aids) system for the visually impaired people to<br />

provide their safe mobility.<br />

To prove these advantages, several experiments were<br />

performed on indoor and outdoor with various situational<br />

effects, and its performance was compared with an existing<br />

method. The results showed the efficiency and effectiveness<br />

of the proposed robotic wheelchair.<br />


This research was supported by the MKE(The Ministry of<br />

Knowledge Economy), Korea, under the ITRC(Information<br />

Technology Research Center) support program supervised<br />

by the NIPA(National IT Industry Promotion Agency<br />

(NIPA-2010-C1090-1001-0008))<br />


1. J.S Ju, Intelligent Wheelchair interface using face and<br />

mouth recognition. International Conference on<br />

Intelligent Unser Interfaces ACM, (2009).02<br />

2. Guilherme N, deSouza&Avinash, Vision for Mobile<br />

robot navigation: a survey, Pattern Analysis and<br />

Machine intelligence, (2002) 237-267<br />

3. Mazo, M, Garcia, J.C, Experiences in assisted mobility:<br />

the SIAMO project, IEEE Control Applications, (2002).<br />

4. Iwan Ulrich, illah Nourbakhsh, Appearance-based<br />

obstacle detection with monocular color vision, AAA<br />

National Conference on Artificial intelligence, (2000)<br />

5. Paul Viola, Michael J.Jones: Robust real-time face<br />

detection. International Journal of Computer Vision.<br />

(2004), 137-154

MetaBrain: Web Information Extraction and Visualization<br />

João Teixeira Gabriel Barata Daniel Gonçalves<br />

Department of Computer Science and Engineering, IST<br />

Av. Rovisco Pais, 1000 Lisbon<br />

{joao.teixeira,gabriel.barata}@ist.utl.pt, daniel.goncalves@inesc-id.pt<br />


Nowadays, the web is a huge source of information on<br />

different branches of knowledge. This knowledge, however,<br />

is dispersed across many sites, making it difficult to<br />

interrelate and understand. In the past few years some<br />

approaches have been developed to ease the extraction of<br />

this information, from Open Information Extraction to<br />

simpler data mining. Usually these solutions work as<br />

standalone applications and are developed from scratch and<br />

are brittle, very sensitive to changes in the data sources.<br />

This makes it difficult for the final user to fully explore the<br />

potential of using different algorithms together to better<br />

extract and analyze information. In this paper we propose a<br />

new approach where users can create their own<br />

personalized information extractors and visualizations,<br />

without needing to type a single line of code, in an easy and<br />

highly flexible manner using a special-purpose interface.<br />

Since raw data is most times difficult to understand, we also<br />

study how the user can create customized visualizations of<br />

this extracted data with low effort. A prototype of this<br />

concept, MetaBrain, has been implemented and tested.<br />

Preliminary heuristics evaluation, demonstrate favorable<br />

results for the concept.<br />

Author Keywords<br />

Information Extraction, visualization, user interaction.<br />

ACM Classification Keywords<br />

H.5.2 User Interfaces - Graphical user interfaces (GUI),<br />

H.5.m Miscellaneous.<br />


The versatility of the web is also its biggest problem. Since<br />

anyone is free to create their website in any way they want,<br />

there is no unifying structure for all this information. More<br />

than a huge repository of knowledge, the web contains a<br />

whole set of hidden implicit information. The way people<br />

express their thoughts reflect an unconscious collective of<br />

trends and patterns which are not obvious at first sight.<br />

What color does the Internet relate to the term apple?<br />

Surprisingly, white is the color that more frequently cooccurs<br />

with apple in web pages, next to red and green.<br />

Apple Inc. and Snow White may be to blame for this.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

49<br />

Traditionally, Information Extraction (IE) focuses on<br />

extracting information from specific pre-defined domains.<br />

Changing domains implies that new extraction rules need to<br />

be manually created, making it hard to scale. Manually<br />

querying search engines in order to extract large quantities<br />

of information is also not the right approach, since it is<br />

tedious and error-prone as pointed out by [6]. A possible<br />

solution to this problem is the use of Open Information<br />

Extraction [2], which states that a high amount of<br />

relashionships are expressed through a compact set of<br />

relation-independent lexico-syntactic patterns. This is only<br />

one of several techniques [3,5,7] which allow the extraction<br />

of information from the Web using only statistics and<br />

probabilities.<br />

Although many new tools for web IE have recently<br />

appeared, these tools are usually designed to use a single<br />

type of IE technique with no possibility of interaction with<br />

others. It may be in the best interest of the user to use<br />

different IE techniques simultaneously, thus discovering<br />

hidden and unexpected patterns in apparently unrelated<br />

data. For example, the possibility to automatically<br />

extracting a list of Operating Systems and see how popular<br />

each one is on different search engines or social networks,<br />

for different kind of users. Another problem found in these<br />

tools is that most are developed from scratch. Currently,<br />

there is no unified framework with different IE modules<br />

available for programmers or other users to use as a basis<br />

for their IE tools. Also, state of the art tools like<br />

TextRunner [1] lack advanced search options, like the<br />

selection of search engine to use, or the possibility to<br />

extract the retrieved data. These options may be important<br />

for advanced users.<br />

Our research aims at finding ways for normal web users to<br />

access the collective unconscious that is the Internet. Given<br />

the giant number of possible extraction scenarios this can<br />

be a very complex and difficult task. Our efforts were<br />

directed at creating the best interface to make this task as<br />

easy as possible. Since raw data from these techniques, at<br />

times, is difficult to understand, we also analyzed several<br />

information visualization techniques, from simple bar<br />

charts to hierarchical tree-maps, with the objective of<br />

creating a good and easy way for the user create and export<br />

their customized visualizations.

In the next sections, we detail how we extract information<br />

from the web. Then we explain our design and interaction<br />

decisions for our solution prototype. This is followed by the<br />

result analysis of the prototype’s heuristics evaluation and<br />

finally. We conclude with our final remarks and talk about<br />

future work.<br />


There are different approaches to extract information from<br />

the web without the use of complex natural language<br />

parsers. Different algorithms use different features to<br />

extract the information. Generally, we find three different<br />

classes of approach that use: number of results found for a<br />

given query [9]; lexico-syntactic patterns [5,6]; and word<br />

co-occurrence [8]. Next we’ll see how we can use these<br />

different classes together to create customized IE tools.<br />

Selected Information Extraction approaches<br />

The number of results can be used as a way to identify the<br />

popularity of one or more concepts on the Internet, and also<br />

to measure the validity of extracted data. For example, if<br />

“fishing water” has more results than “fishing wall” then<br />

fishing is probably more related to water than to a wall.<br />

By using lexico-syntactic patterns like C{,} “such as ”<br />

IList, where C is a concept and IList is a list of instances<br />

from that concept, it is possible to generate special queries<br />

to use in search engines that will be able to map concepts to<br />

instances or instances to concepts.<br />

Recent works have been created to prove the validity of<br />

using term co-occurrence to do opinion mining [7,8]. With<br />

the rise of micro-blogging usage, it is now possible to more<br />

easily extract the general Internet opinion of a given<br />

concept by looking at what words co-occur with that<br />

concept.<br />

Putting It All Together<br />

Each one of these approaches is a way to extract a different<br />

type of information, so it would be good if we could use<br />

them together or alone, depending on what we want to<br />

extract. We can think of each one of these as a different<br />

search module. If we would like to extract a list of cities<br />

and then check their popularity online, instead of manually<br />

executing two different searches it would be good to create<br />

a single search query for the whole extraction.<br />

Because these modules are domain independent it’s a<br />

matter of defining a way to direct a module’s output to<br />

others input. In order to do this we can standardize all the<br />

three modules’ main input as a single query parameter and<br />

their output (result set) as a table (Figure 1), were the rows<br />

represent the different extracted information and the<br />

columns represent the extracted information (primary<br />

column) and some auxiliary attributes of the extraction.<br />

Looking at only the primary column of a result set we get a<br />

list of results which can be iterated by another search<br />

module as its input parameter. This way it is possible to<br />

easily create multi-level search queries. Figure 1 also shows<br />

a result of a multi-level search.<br />

50<br />

Figure 1. Left: result set for an extraction of city instances.<br />

Each row represents an extracted city, which is presented<br />

on the Extracted column, the table’s primary column;<br />

Right: result set for the number of results found for the<br />

different cities extracted on the left table.<br />

A prototype library was implemented with these<br />

capabilities and also the possibility to customize each<br />

search parameters (thresholds, search engine, etc.). Several<br />

search engines can be used, including social networks. A<br />

modular approach was used to create this library in order<br />

for it to be easily expansible with new search engines, IE<br />

algorithms, or simple web service APIs. Also, since some<br />

IE modules need to sometimes perform thousands of search<br />

queries, a cache system was developed to make the searches<br />

faster when possible. The direct use of this library still<br />

requires programming skills. Hence, we developed a<br />

special-purpose interface, Metabrain, which allows even<br />

non-programmers to perform IE and visualization tasks in a<br />

more natural way.<br />


With the library complete, we started looking into how we<br />

could create a GUI simple enough to allow regular Internet<br />

users to interact with it, without neglecting all the advanced<br />

options required by expert users. With this in mind, we<br />

decided to use HTML and Javascript, in order to create a<br />

very dynamic interface with standards-compliant<br />

technology. Also, it is easy to connect with our Python<br />

library. We want not only the users to extract information<br />

but also for them to create meaningful visualizations of the<br />

raw data. All these visualizations were implemented using<br />

the Protovis framework [4].<br />

Data Set Creation<br />

Since the use of IE tools may not be common to most users,<br />

our efforts were to simplify every possible step of the<br />

extraction process, without disregarding the needs of<br />

advanced users. By default all customization options are<br />

hidden, although easy to access, and preset to a default<br />

value. This way the only thing needed is for the users to<br />

select what they want to extract. They can choose, and at<br />

any time change, between the different available extraction<br />

modules. These modules allow for the same type of IE<br />

previously discussed plus easy access to public API<br />

services, such as location to geographic coordinates and<br />

search engine suggestions. Each module is accompanied by<br />

a quick description of its purpose and a series of possible<br />

input examples with explanations.<br />

The design philosophy we follow is to only show relevant<br />

information in the interface so, by default, there is only one<br />

input section visible to the user. This reduces the visual<br />

noise needed to complete his task. For a simple one level IE

Figure 2. a) List of available extraction modules for the first input. b) Example of an extraction of the zodiac signs. c) Example of<br />

a multi-level search query. The final result will be the popularity, on the selected search engine, of every extracted city.<br />

the process is very straightforward: select the IE module to<br />

use, input the query parameter and search. For example, if<br />

the user whishes to extract from the Internet a list of zodiac<br />

signs, he just needs to select the Extract by Domain module<br />

and use zodiac signs as the search query. By doing this, a<br />

list of extracted zodiac signs is presented to the user, as<br />

seen on Error! Reference source not found.b.<br />

If the user whishes to create a multi-level search query, the<br />

interface will evolve during the process, along with the<br />

user’s needs. If, at any time, the user chooses to use the<br />

result of one search as a term in another, the interface will<br />

dynamically add a new input section where the second<br />

search query can be defined. These secondary input<br />

sections are called variables and have the form of %1, %2,<br />

etc. Graphically, every new query to obtain the values for<br />

each variable appears below the one in which it is used, and<br />

one level deeper on the interface (Error! Reference source<br />

not found.c). This helps users to effectively resort to<br />

several variables at once without getting lost or confused.<br />

In order to minimize the number of errors and not waste the<br />

user’s time in vain, before initiating the final search query,<br />

which may take from a few seconds to minutes or hours, it<br />

is possible to do a preview search in a smaller scale. This<br />

way, the user gets a quick glimpse of the kind of results<br />

returned by the current query and can make any<br />

adjustments necessary before starting the real long search.<br />

To increase the possibilities of query creation it is also<br />

possible to create Data Sets by importing users own<br />

personal data (CSV) through our prototype. Before the data<br />

is imported it is scanned and MetaBrain tries to guess what<br />

type of data is in each column (text, numbers, coordinates,<br />

etc.) Our guesses are then shown to the users so they can<br />

confirm and make any changes necessary. We’ll discuss the<br />

importance of this type of information in the next section.<br />

Visualization<br />

Now that we have a good and flexible approach that allows<br />

even non-programmers to do customized IE from the Web,<br />

the next step is to provide them with the possibility to<br />

visualize this information in a more meaningful way than<br />

the one provided by simple tables. We started by<br />

51<br />

identifying a set of requisites we would like the<br />

visualization creation process to follow:<br />

� Since the table of extracted information has multiple<br />

columns, the user must be able to choose which columns<br />

she or he wants to visualize.<br />

� The user should be able to choose from several different<br />

types of visualizations, from graphic bars to sunbursts or<br />

even maps;<br />

� All the visualizations must have its set of configuration<br />

options, bar width for the graphic bars, palette color for<br />

the sunbursts, etc.;<br />

� During this process it must be easy to change between<br />

different visualization types maintaining the users<br />

previously selected preferences, if these are applicable to<br />

the new type.<br />

� The user must be able to always preview the visualization<br />

being created Configuration changes to the current<br />

visualization should be applied instantly, without the<br />

need to refresh.<br />

Taking all these requisites into account, we decided to<br />

divide the visualization process into 3 steps: choose data to<br />

visualize (which columns); choose the visualization type;<br />

preview and configure the visualization.<br />

To address the first requisite we decided to let the user<br />

choose which columns to visualize by using a drag and drop<br />

metaphor. On the left side of the application a vertical list<br />

of names is visible. These are the names of the different<br />

columns existent in the selected data set and they are<br />

divided by the type of data they contain, this division makes<br />

the column selection easier for the user. On the right side of<br />

this list are two large horizontal boxes, representing the<br />

visualizations axis. The user is then able to drag columns<br />

from the left list and drop them in the axis input boxes.<br />

During the drag procedure these boxes are highlighted,<br />

making the user aware of valid drop inputs.<br />

We decided to use two axis after concluding, in a study,<br />

that all the different visualizations we wanted to implement<br />

required at least two degrees of freedom.

The available visualizations list starts empty. While the user<br />

makes column selections, these (columns selected, their<br />

data type and position in the axis) are used to verify what<br />

visualizations are available for this selected data. This way<br />

we can minimize the errors of the user choosing a map<br />

visualization type when no geographical data is selected.<br />

When the user has finished selecting the columns and has<br />

chosen the visualization a preview is instantly creates. Also,<br />

next to his visualization a list of configurable options<br />

(colors, scale, canvas size, etc.) appears with they’re default<br />

values selected. After changing any of these options values<br />

the preview is instantly refreshed. At any time during this<br />

process the user can change the selected columns or choose<br />

a different visualization. An example of a visualization<br />

being created is shown on Figure 3. When the users are<br />

satisfied with their visualization they can embed this<br />

visualization into their website by copying a piece of code<br />

into any webpage, much like embedding a YouTube video.<br />


In order to test our solution we conducted a heuristic<br />

evaluation of MetaBrain, using Jakob Nielsen’s usability<br />

heuristics 1 . After a quick introduction to the purpose of our<br />

work, four usability experts proceeded to freely test the<br />

prototype for a few minutes and then received a list of four<br />

tasks to execute. In two the users were asked to extract<br />

information from the web, from given domains, and in the<br />

other two to craft specific visualizations for that<br />

information. All were successfully completed by all users.<br />

Overall, only ten usability problems of relevant severity<br />

were identified. Most were related to the data extraction<br />

interface, especially to the fact of some search queries were<br />

taking some minutes to finish and there was no indication<br />

of progress, only a looping loading sign. This problem has<br />

been solved by adding to the search interface the number of<br />

queries to be performed and how many have already been<br />

completed. All evaluation experts enjoyed the clean and<br />

minimalistic design and the dynamic way in which they<br />

could interact with the system. After completing the tasks,<br />

some wanted to keep playing with the system, curious about<br />

what other information MetaBrain would be able to extract.<br />

This preliminary evaluation allowed us to find and correct<br />

some usability problems. It is indicative that the interface<br />

can be effective and easy to use. Further validation of this<br />

will be provided by upcoming, more formal, user tests,<br />

where we’ll take into account the number of errors and time<br />

taken to complete the tasks.<br />


We have presented an interface that allows us to extract and<br />

visualize information from the web in meaningful manners.<br />

Unlike previous research we strove to make this task as<br />

simple and flexible as possible so that any type of users,<br />

from less to more experienced, can create customized<br />

1 http://www.useit.com/papers/heuristic/heuristic_list.html<br />

52<br />

Figure 3. Creation of a map visualization, showing Portuguese<br />

cities and their respective population size.<br />

solutions that fit their needs. A preliminary evaluation of<br />

our prototype, MetaBrain, showed positive results. Further<br />

user studies will allow us to better validate our choices.<br />


1. Banko, M., Cafarella, M.J., Soderland, S., Broadhead,<br />

M., and Etzioni, O. Open information extraction from<br />

the web. In Proc. of the IJCAI 2007.<br />

2. Banko, M. and Etzioni, O. The Tradeoffs Between<br />

Open and Traditional Relation Extraction. In Proc. of<br />

ACL-08: HLT, 28-36.<br />

3. Bollegala, D., Matsuo, Y., and Ishizuka, M.<br />

Measuring semantic similarity between words using<br />

web search engines. In Proc. WWW '07, ACM Press<br />

(2007), 757-766.<br />

4. Bostock, M. and Heer, J. Protovis: A Graphical<br />

Toolkit for Visualization. In Proc. IEEE TVCG, 15<br />

(2009), IEEE CS (2009), 1121-1128.<br />

5. Cimiano, P. and Staab, S. Learning by googling.<br />

SIGKDD Explor. Newsl., 6 (2004), 24-33.<br />

6. Etzioni, O., Cafarella, M., Downey, Doug et al. Webscale<br />

information extraction in knowitall: (preliminary<br />

results). In Proc. WWW '04, ACM (2004), 100-110.<br />

7. Kramer, A.D. An unobtrusive behavioral model of<br />

gross national happiness. In Proc. CHI '10, ACM<br />

(2010), 287-290.<br />

8. Ku, L., Lee, L., Wu, T., and Chen, H. Major topic<br />

detection and its application to opinion<br />

summarization. In Proc. SIGIR '05, ACM (2005), 627-<br />

628.<br />

9. Turney, P.D. Mining the Web for Synonyms: PMI-IR<br />

versus LSA on TOEFL. Machine Learning: ECML<br />

2001, Springer Berlin (2001), 491-502.

������� ��� ��������� ������� ���������<br />

������ �� ������� ∗<br />

������� ����������<br />

������� �� ������� ��� ���<br />

����� �� ������� §<br />

������� ����������<br />

������� �� ������� ��� ���<br />

��������<br />

���� ������� ����� ���������� � ������� ��������� ���� ����<br />

����� ��������� �������������� ��� ������ �� �������� ��<br />

���� ��� �������� �������� �������� ������ ����� ����� �������<br />

��� �������� ��� ���� ��� ��� ���� ������������ �������� ����<br />

�� ���������� ��� ���������� ��������� ���������� ��� �����<br />

��������� ��� ��������� ��������� �� �� ������� �� �� �����<br />

������ ����� ��� ������� ��� ����������� ��� ���� �������<br />

����� ����������� ������������ �������� ��� ������ �� � ������<br />

������ ���� ��� ������� ��� ��� ���� �� ��������� ��� �����<br />

������� ����� ��� �������� �������� ��������� ����������<br />

������������<br />

��� ��������� ������� ��������� ����� �� ��� ��� ���� ���<br />

��� ����������� ���� ���� ��� �� � ����� ������ �������� ���<br />

���� �������� ����������� ������� ��� ��� ���� ��� ������<br />

���� ����� ������ ��� ���� ����������� �� ��� ��� ���<br />

���� ���� ����� ������ ���� ��������� �������� ����� ��� ���<br />

∗ ���������� ��� ������� ����������� ��� ������<br />

������ �� ��������� �����<br />

��� ������� ����<br />

�������� �� �����<br />

† ���������� ��� �������� ����������� �����<br />

��� ����� ����<br />

�������� �� �����<br />

‡ ����������� ��� ����������� ��� �����<br />

������ �� ��������� �����<br />

��� ������� ����<br />

�������� �� �����<br />

§ �������� ���������� ��� ������� ����������� ��� ������<br />

������ �� ���������<br />

��� ������� ����<br />

�������� �� �����<br />

����� ����� �������� ��������� �������� �����<br />

�������� ����������� ��� ���������� ��� �����<br />

������ �� ���������<br />

��� ������� ����<br />

�������� �� �����<br />

���������� �� ���� ������� �� ���� ������ �� ��� �� ���� �� ���� ���� ���<br />

�������� �� ��������� ��� �� ������� ������� ��� �������� ���� ������ ���<br />

��� ���� �� ����������� ��� ����� �� ���������� ��������� ��� ���� ������<br />

���� ���� ������ ��� ��� ���� �������� �� ��� ���� ����� �� ���� ���������� ��<br />

���������� �� ���� �� ������� �� �� ������������ �� ������ �������� ����� �������<br />

���������� ������ � ����<br />

���� ����� �������� ��� ����� ���� ����� ��� ����<br />

��������� �� ���� �� ��� ����������������<br />

������� ������� †<br />

������� ����������<br />

���� �� ������� ��� ���<br />

53<br />

���� �� ������� <br />

������� ����������<br />

���� �� ������� ��� ���<br />

���� �� ������� ‡<br />

������� ����������<br />

������� �� ������� ��� ���<br />

��������� ������� ���� ����� �������� ��� ��������� � ����<br />

����� ��� ������� ���������� ��� �� �������� ��� ���� ����<br />

���� ������ ��� ������� ��� ��� ��� ��� �� �������� �����<br />

������ �� ��� ������� ���� ��� ����� ���� ������ ������� ���<br />

��������������� �� ��� ���������� �������� ������� � ���<br />

������ � ����� ��� �� ��� � �������� ������������ ������� ����<br />

����� �� ��� �� �� ����� ��� ������������� ����������� ���<br />

�������� �� �� ��� ���������<br />

��� ����� �������� ��� ����������� �� � ��������� �����������<br />

������ ���� � ������� ���������� ������� ������ �������<br />

�������� �� ��� ������� �������� ��������� ����� �������� ����<br />

��������� ���� ��� ������� ������� ������� ��� �� ��� ������<br />

��������� ����������� �� ������ ���� ��� ������� ����� ��<br />

� �������� ���������� ��� ������� ���� ���� ��� ������ ��<br />

����� � ��������� ��� �� ������� ��� ������� ���������� ��<br />

��� ����� �� ��� �������� ������ ��� ��� ������ ���� ����� ���<br />

��������� �������� ���� ���� �� ���� ��� ��� ������� ������<br />

�� �����������<br />

����� ���������� ��� ��������� � ������� �� � ���� ���� ���<br />

������ ��������� �������� �������� ��������������� ��� �� �����<br />

���� ���������� ������� ����� ��� �������� � ������ �� ���<br />

�������� ������� ��� ���� ��� �������� � ���������� ������<br />

����� ������ ���������� ���� ������ ���������� �����������<br />

����� ����������� ����������� ������ ���� ��� ������ ����<br />

������� ��� �������� �������������� ��� �������������� ������<br />

���� ��� ������������ � ��������������� �������� �������<br />

���� ������� ������� ��� ������� � ����� ������ ��� �����<br />

���������� �� ��� ������ ���������� ��������������� ����<br />

������� ���� ���� ���� �� ���� ����� ��������� ���� ����<br />

�������� ��� ��� ��� ��� ��� �� ���� ���� ��� ���������<br />

��������� ����� ���� ����������� ��������� ������� ���������<br />

��� ���� �� ���� ����� �� ��������� ���� ���� ��������� �����<br />

������� ������ ������������ �������������� ���������� ���<br />

������ �������� ��� ���������� ���� ������ ��� �������<br />

����� �� ������� ������������� ���� ��� �������� �����������<br />

��� ���������� ������� ���� ������� ��� ���������� ��� ����<br />

��� �� ��� �������� ��� ��� ���� ��� ������ �����������<br />

�������� ���� ��� ��� ��� ����� �������� ����� ��� ���<br />

������������ ������� ���� ������� ���� ���� �� ������� �� �����<br />

��� ��� ��� ��� ��� �� ���� ��� ����� ������� ��������<br />

�������� ��� ������� ���������� �� ���������� ���� ����� ���<br />

����� ���� ���� � ������� �� ��� ������� ��� ��� �� ����

�� ��������<br />

����������<br />

������������� �� ������� ��������<br />

�� ��� ����� ���� ������� ����������� ���� ���� ���������������<br />

��� ����������������� ����� ��������� ��������� �� � ���<br />

������� ���� ������ ��������� ������� ��� ���������� ���<br />

�������� ��� ���� �� ��� ����������� �� ���� ������� ����<br />

������������ ������ � ��� ���� ��� ���������� ���� ��� ���<br />

����� ������� �� ��� � ��� �� ����� ���� ��� ��� ������ ��<br />

������� �������� ���� ������������ �������� �� ���� �����<br />

������� �� ��� ��������� ����� �� ���� ��������� � ���������<br />

�� �������� �� ���� ���������� ��������� ����������������<br />

������������� ��� �������� �� ��� ������ ����� �������� �����<br />

��� ������� ������� ��� �� �������� ������� �� �� �������<br />

�������� ������������ ����� ���������� ���� �� �� ���� ���<br />

������� � ��� ������ ������ ��� ������� ������ � ���������<br />

���� ������ �� ����� ��� ������ �� ������� ������� ����<br />

����� ��� ������������� �� �������� �� ���������� ��������<br />

��������� ���� �� ����� ����������������� �������� ��� ��<br />

������ �� � ������ ������ ������ ��� ������� �������� �����<br />

�� ������������� ������ ����� �������� ������������ �� � ����<br />

���� ������ ��� ��������� �� � ������ ����� �������������<br />

��� ��� ������� ������� �� ��� ������� ������ ���� ���������<br />

�� � ���� ������ ����� �������� ������� �� ���� ����� ��������<br />

�������� ���� ���� �� ��� ���������� ������ �� ��� ��������<br />

���� ������ ������ ����������� �������� ����������� �� ���<br />

�������� �� ��� ���� ��� ���� ����� ��������� �� ��� �����<br />

���� �������� �� ����� ������� � ����������� �� ��� ��� ���<br />

���� ��� ���������� �� �������� �� ��� ��� ���� ��� �������<br />

����� ������������� ����������� ���� ��� �����������<br />

�������� ������ ����������<br />

���������� ����������� ��� ������� ��� ��� �� ��������<br />

���������� ��� ��������� �� ���������� ������� ���� �� �����<br />

��� �������� ���������� ��� ������� ������� ������� ��� ����<br />

���� �������� ���� ��� �������� ����� ���������� �����������<br />

�� ��� ���������� ����� ������� ���� ����� ��� ������������<br />

���� �� �� ���� ������������� �� ��� ����� ����� �������� ��<br />

�������� ������������ ��� ������ �������� ������� ������<br />

������� ��� ����������� �� ���������� ��������� ����������<br />

���� �� ��� �������� ������� ��� ����������� ���� ����� ���<br />

���������� �� ��� ���� �� ��������� �������� ��������� ���<br />

������� ����������� ������� �������� ������� ����� ���� ��<br />

��������� ����� ��� ����� ��� ������� ��������������� ������<br />

������ ���� ���������� ������������ ����� ��� �����������<br />

����� ������ ������ ��� ���� ����� �� �� �� �������� ���� ����<br />

����������� ���������� ������������ �������� ��� ��������<br />

���� ������ ����� ����������� ��������� ���������� ��� �����<br />

������ ����� ���������� ��� ������� ��� ���� �� ������� ���<br />

������� ��� �������� ���� ��������� ��� �������� ���� �����<br />

������ ������� ���� ����� �������� ����<br />

���� ��������� ������� ����������<br />

���� �� ��� ������ �������� ������� ��������� ������� �����<br />

������� ���������� ���� ����� ������� ��� ������ �� ���������<br />

��� ����� ��� ������������� ���� ��� ������� �������� ���� ���<br />

����� ��� ����������� ������ ���� ������� ��������� �����<br />

���������� ��� ��� �������� ������ �� �������� ���� ��� ������<br />

54<br />

�� ������� ������� ��� ����� �������� ������������ ������� ��<br />

�������� �� ��� ����������� ����������� ��� ������������� ���<br />

����������<br />

�� ���� ������ ������������ ������� ��������� ������� ���<br />

��������� ���� �� ���� ���� ��� �������� ���������� ��� ���<br />

���� � ���������� ��� �� �������� ��� ����������� ��������<br />

����� ������� ���������� ������� ���� ����� ��� ��������<br />

������� ������� ���� ��� ������� ��������� �������� ���������<br />

���� �� � ���������� �������� �������� ��� ������ ����� ��� ��<br />

����� ��������� ���� ��� ��������� �������� ����� ����� ���<br />

���� ���������� ������������ ��� ���� �� ������� ��� ������<br />

���� � ��������� ������ ������ ��� ���� ��������� ��� ����<br />

��������� ���� ���� �������� ����� ����� ��������� �������<br />

������� ������ ����� �� ��� ��������� ������� �� �������<br />

��� ������������ �� ��������� ���� ������������� ������� ��<br />

��������� �������������<br />

���������� ��� ��������<br />

��� ��������� ������� ��������� ����� ���� ������� � ����<br />

������ ������ �� �� ���������� ���� ��� ���� ����� ��� �������<br />

������ ��������� ���������� ��� ����� ������ ������� ����<br />

���� ���������� �� ��������� ���������� ���� ��� �������<br />

���� ������� ��������� ��� ���� �� ���� ���� ��������� ����<br />

��� ���� ���������� ��������� ���������� ��� ���� ���� ���<br />

���� �������������� �������� �� ��� �� ��������� �� � ������<br />

��� ���� ���� ���� ���� ������������<br />

��� ����������<br />

����������� �� ��� ���� ������ ���� ��� ������� ���������<br />

��������� ��� �� ���������� ��� ����� ����� ����� ������� ��<br />

���� �� ��� ����� ������ �� ������� ��������� ���������� ���<br />

��������� ���� ��� ����� ��� �� ��� �������� �� ��������� ����<br />

��� ����� ����� ��� ������ ����������� ����������� �������� ����<br />

���� ��� �� ���� �� ����� ����� ����� �� � ��� ������� ��� ��<br />

��� ����������� �� �������� ���������� ���� �� ����������<br />

����� ��� ���� ��� ������� ������ ��� ������� ��������� ������<br />

����������� ����������� ���� ��� ������� ���������� ���������<br />

����� ���������� �������� ���� ���� ��� ����������� ���<br />

���� �� ���� ���� �� ��� �� ��� ����������� ����������� ���<br />

�������� ������� ���������� �� ������ ��������� ��� ������ ��<br />

���� ����� �� ������� � ����������� ���������� ������ ���� ���<br />

�� �������� �� �� ��� ������ ����<br />

������ �����������<br />

��� ��� �� � ������ ���� ���� ��������� ����������� ���� ��<br />

������� �� �������� ������ �� �������� ����� �� �� ��������<br />

������ ���� � �������� ������� �� � ��� ���� �� ����� ��� ������<br />

��� ��� ��� ������ �� �������� ���� ��������� ��������� ��<br />

���� �� ������ �� ��� ������ ��������� ��������� ��� ���<br />

��� ��� ����� ��������� ��������� �� � ��� ��� � ���������<br />

���������� ����� ��� ��� ���� ��� ����� ��� ��� ���������<br />

������� ���������� �������� ����� ��� �������� �������� ���<br />

����� ������ ��� ������ ��������� ��� ��������� ���� �� ���<br />

������� ���� ������<br />

�������� ����������<br />

��� ��� �� ���������� �� �� ���� ��� �� ������� ������<br />

���� ����� ��� ��������� ���� � ���������� �������� �����<br />

��� �� ������������� ������� ��� ���������� ������� �����

������ �� ��� ������� ����<br />

�� ������ �� ��� �������� ���� ���� ������ ��� ������� ��<br />

��� ����� �� ��� �������� ����� ��� ���������� ���� ��� �����<br />

��� ���������� ����� ���� ���� �� �� ���������� ������ ���<br />

������ ��������������<br />

����������� �����������<br />

���� ��� ���� ����������� ��� ��� ������ �������� �����<br />

���� ��������� �� ���� ������������<br />

��� ���� �� ���� ��� ���� ���� � �������� ���� � ��� ���� ��<br />

������ ���� ����� ������� �� ��� ������� ���� ������� ����<br />

����� ������ �� ��� ��� ���������� ������� �� ��� ������� ���<br />

������� ���� �� ������ �� ��� �������� �������� �� ��� ����<br />

��� ������� �� ��� ����� ���� ��� ���� ������ ���� ��������<br />

��� ����������� ������� ��� ��� ��� ����� ���� �������� ���<br />

���� �� ��� ������ �� ��� ������� ����� �������� ��� �������<br />

��������� ������� �� ��� ������� ������� � ������ ��������<br />

���� ��� ��� ����� ������� ���� �� ������ ������ �������� ���<br />

�������<br />

��������� ������� ����� ���� ���������� ������ �� ���������<br />

��������� ��� ��� ���� �� ���� ������������ ��� ���� ������<br />

������ ��� ���� ���� ��������� ��������� ������ � ����� � ����<br />

����� �������� ��� ������ ��� ������ ��������� � �������<br />

������ ����� ���� ��� ������ ���������� �� ������� ������ ����<br />

������� ������������ �� ������� ����� ��� �� ��� ���� �� ���<br />

�������� �������� ��� ������� ������� ����� �� ������� ���<br />

���� �� ��� ������ �� ��� ������� ��� ������������� ���������<br />

��� ����� �� ������ �� ��� ��� ������� �� �������<br />

� ������ �������� ��������� � ������ ������� � ������� ���<br />

��������� �� ������ ���� ��� �������� ��� ����� ����� �<br />

������� ������� ������� ���� ��� ���� ��������� ������������<br />

���� ����������� ������� ������������ ���� ��� ������ ����<br />

����� �� ������ ��� ���� ��� ������ ���������� ��� �������<br />

��� ������� ��� ����� ������� ��� ���������� ������ �� ���<br />

���� ������� ��� ������� ����� ��� ��������� �����������<br />

��������� �� ��� ������ ��� �������� �� ������� ��� ������<br />

������ ������ �� ��� �������� ���� ����� ��� ������ �������<br />

��� ������ �� �������� ��� ����� ��� ��� ������������ ���������<br />

���� ������ ��� �������� ��� ������ ���� ���� �� ������ ���<br />

55<br />

������ �� ����������� ������� ������� ��������<br />

��������� ����������� �� ����� ��� ������� ��� ������� �����<br />

���� ������������� ���� ���������� ������ ��������� ������ �<br />

����� ��� ������� ��� ��� ��� ���� ��� ����������<br />

��� ����� ����������� ����������� ����� ���� ����� �������<br />

��� ��� ������� � �������� ��������� �� ��� �������� ���<br />

������� ��������� ������ ��� �������� ����������� ���������<br />

���� ��� ��������������� ���� ������� ��� ���� ����� ����<br />

��� ��� �������� ��� ��� ������� ������� ����� ���� �� �������<br />

��������� ������ ���� �� �������� ��� ������� ������������<br />

��� ���������� ������� ������� ����� ���������� �� ��� ����<br />

���� ������ ����� ������� �������� ����� �� ����������<br />

���� ��� ������� ���� ���� ���� �������� ��� ������������� ���<br />

������ ��� ��� ���� �������� ������ �����������<br />

�������� ����������<br />

��� ���� �������� ��������� ���� �� ���� ���������� ���<br />

�������� ����������� ��������� ����� ������ �������� ����<br />

�� � ������������� ������������ ���� �������� ������ ������<br />

��������� ������� ��� ��� �� ��� ������� ��� ��� ��������<br />

���� ������ ���� � �������� �������� ��� ������ ���� �������<br />

��� ������� �������� ��� ��� ����������� ���������� ��������<br />

������� ����� ��� ����� ������ �������� ��� ��������� ���<br />

���� ��� ���� ������� ����� ����������� �� �� ��������� ��<br />

��� ���� ����� ��� ��������� �� �� ��� �� ������� ������ ��<br />

���� ������� ��� ������ ������������ ������� ��� ����� ����<br />

��� ������������ ��� ��� ������ ���� �� ���� �� �����������<br />

��� ���� ��� ���� �� ���� �� �� ��� ������� ���������� ���<br />

����� ������������ �� ����� ����������� ���� ��� �� ����� ��<br />

��� ����� ���� ��� ��� ������� �� � ��������� �����������<br />

������������<br />

�������� ��������<br />

�������� �������� ��� ��� ��� ��� ��� ��������� �������<br />

���� ������� ��� ������ ���� �������� ��� ������� �������<br />

��� ������������ ��� ��� �������� �� ����� ����������� ���<br />

������� ��� ����� �� ��� ���� ���� ���� ��� ������� ����<br />

��� ������� �� ��� ��������� �������� ������������� ���� ���<br />

�������� ��������� ������� �� ��� �� ���� ������� ��������<br />

�������� ���������� ��� ���������� ������� ��� ������ ����<br />

���������� ����� ��������� ������������ � ��� ��������

������� ���� ���� �� ���������� ������ ��� ������ ������<br />

�������� ���� ��� �������� �������� ������� ��� ������ ����<br />

���� ������ �� ��� ���������� ������ ���� ���� �������<br />

���� �������� ���������� ������ ����� ��� ������� ��������<br />

�� ���� � ������ ������� ��� ��� �� �������� ��� ����� ��<br />

��� ���� ������ ������ �� �� ����� ��� ����� ��� �������<br />

�������� ��� ��������� ����������� ���� �������� ��������<br />

����� ���� ��� ������ �� ��� ����� �������� �� ��� ��������<br />

������������� ���� ��� ����� ����� ���������<br />

������ �� ������ ������������� ������<br />

���� ����� �������� ����� � ������� ������ �������������<br />

������ ����� �� ���� ��������� �������� ����������� ���������<br />

����� � ������ ��������� ����������� ��� ������� ��� ����� ���<br />

��� �������������� �������� �� ��� ������ ������ �������<br />

�� ��� ������� �� ���� ��� ������� ��� �������� �� ������ ����<br />

���� ���� ������� ��� ���� ������ ��� ������� ��� ��� ������<br />

���������<br />

����������<br />

�����������<br />

����� ��������� �������������� ��� ������ �������������<br />

��� �������� � �������� �� ����� ��� ����������� �� ������<br />

���� ��� ������� �� ������ �� ������ ���� ��������� ���������<br />

������ ��� ����� ����������� � ������ ���� ��� ���� �� �����<br />

���� ��������� ��� ������� ����� ��� ���� ������ ��� �������<br />

�� �������� ������� ���� ��� ��������� ������� ��������� ���<br />

������ �� ������ ���� �� ������ � ��������� ��������� ���<br />

����� ��� ������ ���� ��� ������ ���� ������� ��� ���������<br />

��� ��� ��� ��� ��������� �� ���� ���� ��� ����� ��� ����<br />

��� �������� ��� ��� �������������<br />

������� ��� ��������<br />

��� ��� ������ �������� ������� ���� ��������� ��������<br />

�� ���� ������� ��������� ������� ������ ���� �� ��������� ��<br />

��������� ������ ����� ���� �� �� ������� ���� �� ��� ��<br />

���������� �� ���� ��������� ����������� �� ���������� ����<br />

����� ������� �������� �� ������� ����� ���� �� ���������<br />

���� ������� ��� ��� �� ���� ����� �� ������ �������� �������<br />

�� � ������� ���� ��� ����� ����� �������� ���������� ����<br />

���� ��� ���� ���� ����� ������� � ���� ���� �� �����������<br />

�� �������� ��� ������ ���� ����� �������� �������� ������� ��<br />

������ ��� ������� �������� ��������� ��� �������� �� �����<br />

���� ���� �� ��������� ��� �� ������� ���������<br />

������ �������<br />

��� ��� �� � �������� ��������� ������ ���� �������� � ������<br />

����� ��������� �������� ��� � ��� �� ���� �� �������� ������<br />

��� ��� �� ���������� ���� ������� ������� ������������<br />

��� ����� ���� �� ����� � ���� ��������� ������� �����������<br />

����������� ���� ����� �������� ����������� ������ �� ����<br />

������� �� ��� ��� ������� ����������� ����� �������<br />

56<br />

������������ �� ���� ��� ������������ ���� ����� ����� ��<br />

��������� � �������� �������� ���� � ��������� ����������<br />

������������ ����� ����� �������� ��� ��� ������������ ������<br />

��� �������� �������� ��� ������ ���� ��� ��� ������ �� ���<br />

�������� ��� ���� �� ���� ���� ���� ������� �������� ��� ��<br />

���� ���� ����� ������� ��� ������ �������� ����� ����������<br />

������� ������ ��� ����� ������� �����������<br />

����������<br />

�� ���������� ��� �������� �������� ���� ���� �������<br />

���� ���� ����������� ��� ��� ��������� ����<br />

���������� ������������� ������ ��������� ����� ����<br />

�� �� ����� �� ���������� ������ ��� ��� ���������<br />

������ ������� ���� �����<br />

�� �� ����� ���������� ��� ���������� �� ��� ���������<br />

���� ������ �����<br />

�� �� ����� ������ ���� �������� ������� ���� �� ������<br />

��������������<br />

�� �� ������ ����������� �� ��� �������� ����� ���������<br />

����� �������� � ��� ���� �����<br />

�� �� �������� �� ���������� ��� �� �������<br />

����������� �� ��������� ���� ��� �������� ����������<br />

��� ���� ���������� ��� ����������� �� � ������<br />

������������� ����� ���������� ��������������<br />

������������ ��������� � ���� �����<br />

�� �� �� �� �� ������ ��� ���� ���������� ���������<br />

����������������������������������������� ����������<br />

������ �����<br />

�� �������� ��� ��� ��� ������������ �������� �����<br />

�� �� ����� �� �������� �� ��������� �� �����������<br />

�� ������ �� ����� ��� �� �������� ���� ����������<br />

������ ������ �� ������� ����� ������� �����������<br />

�������� �������� ���� ������������ ��� ���������� �<br />

����� ����� �����<br />

��� �� �� �������� ��� �������� �������� ��� �������<br />

���������� �� �����������<br />

���������������������������������������������������<br />

��������������������������������������<br />

�������������������������������� ���� ����� �����<br />

��� �� �� ������ ��� �� �� ���������� ��������������<br />

���������� �� ��������� ����������� ���������������<br />

�������� ���������� �������� ������� �� ������� ���<br />

�������� ������������� ��������� � ���� ����� ������<br />

�������� ��� ������ ���������� �� ���������� ��������<br />

��������������<br />

��� �� ������� �������� ��������������� ��������<br />

��������������� �������� ����� � ��������� ���������<br />

� �� �����<br />

��� �� ������� ��� ��������� ���� ���� ���������� ������<br />

������<br />

��� �� ������ ��� �� ��������� ����������� ������� ��������<br />

����� �� ����� ������ ������ ������ ���� ��������� ���<br />

������� �����

Prototyping a Semi-Automatic In-Car Texting Assistant<br />

Christoph Endres<br />

German Research Center<br />

for Artificial Intelligence<br />

(<strong>DFKI</strong>)<br />

Saarbrücken, Germany<br />

christoph.endres@dfki.de<br />


Texting while driving is dangerous and illegal in most countries.<br />

But both social as well as business forces led to a<br />

widespread ignorance of those bans and in turn to a potential<br />

lethal situation. We argue that, in addition to legislative regulations,<br />

in-car texting should be made less distracting and<br />

dangerous. We offer a solution for one specific communication<br />

goal, namely staying connected to a social network. We<br />

propose a semi-automatic status-posting system and present<br />

a prototype based on a Pleo. We argue that our approach<br />

should be extended by automated answering mechanisms.<br />

The aim of this paper is to foster discussion on texting while<br />

driving. The solution for one type of semi-atomatic texting<br />

is outlined, other types of texting need to be looked at separately.<br />

Author Keywords<br />

texting while driving, pleo, semic-automatic texting<br />

ACM Classification Keywords<br />

K4.2 Computers and society: Social Issues<br />


Ubiquity and convenience being a major driving factor, the<br />

spread of mobile email devices such as BlackBerry, iPhone,<br />

and others, has grown to tens of millions over the last several<br />

years [13]. [12] expect a sustained growth of this trend<br />

in the next decade. Mobile email promises seamless anywhere<br />

anytime connectivity. Employees connect with their<br />

organizations increasing productivity [13]. Participants in a<br />

study on BlackBerry use by [12] emphasized the liberating<br />

nature of mobile email by showing how it allowed them the<br />

freedom to work anywhere.<br />

On the other hand, using mobile devices while driving is<br />

without doubt distracting and thus dangerous. After a surge<br />

in horrific automobile accidents in which distracted driving<br />

was proven to be a factor, 38 US states have enacted textingwhile-driving<br />

bans [5]. Other countries issued similar bans.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

Daniel Braun<br />

Saarland University,<br />

CS Department<br />

Saarbrücken, Germany<br />

daniel.braun@dfki.de<br />

57<br />

Christian Müller<br />

<strong>DFKI</strong><br />

Saarbrücken, Germany<br />

christian.mueller@dfki.de<br />

Figure 1. Pleo robot (Source: Ugobe)<br />

Nevertheless, people continue to text while driving. Reasons<br />

for ignoring bans on texting while driving vary, and include<br />

both business and social forces. People may be tempted to<br />

ignore texting while driving bans, because<br />

• professional communication partners expect universal availability.<br />

• driving is perceived as ”dead time” that needs to be filled<br />

with small talk.<br />

• intimates / buddies expect a message to be replied promptly.<br />

• there’s an audience to be constantly supplied with great<br />

content.<br />

In order to tackle this problem, we have to take a closer look<br />

at the different types of texting and the underlying motivation.<br />

Aside from widely known mobile email, we consider the following<br />

texting services relevant in the automotive context:<br />

SMS, Twitter (twitter.com), and Facebook (facebook.com).<br />

The latter are briefly introduced in the following.<br />

Short Message Service (SMS) is mostly used for person-toperson<br />

messaging (chat with friends). The text is limited to<br />

160 characters but the system can segment messages that exceed<br />

the maximum length into shorter messages. [12] argue<br />

that SMS is mostly a private communication means that has

not been widely adopted by the worldwide business community.<br />

Microblogging sites like Twitter provide a new means of<br />

communication [10]. Twitter provides the ability to deliver<br />

the data to interested users over multiple delivery channels:<br />

cell phone, Facebook application (see below), email, or as an<br />

Instant Message. A Twitter user interested in the statuses of<br />

another user signs up to be a ”follower”. Updates or posts are<br />

made by succinctly describing one’s current status within a<br />

limit of 140 characters. According to [8], Twitter fulfills the<br />

need for an even faster mode of communication compared to<br />

regular blogging.<br />

Facebook belongs to the category of online social network<br />

(OSN) services. Its core functionality is managing connections<br />

or ”friends” [9]. However, Facebook also provides opportunities<br />

for communication and hosting of content. Facebook<br />

is currently having the most users worldwide–other<br />

OSNs are MySpace, Friendster, Bebo, hi5, and Xanga, each<br />

with over forty million registered users [10].<br />

As we pointed out earlier, legislation is unfortunately not<br />

sufficient to keep drivers from potentially lethal habits, so<br />

additional safeguards and alternative solutions need to be developed.<br />

In this paper we propose a way to circumvent composing<br />

twitter messages.<br />


The driving context and the nature of the communicative<br />

goal of Twitter lead to a limited amount of likely messages,<br />

which are usually diary-like. A typical status might be “We<br />

are already so close to Paris, but now we hit a traffic jam!”<br />

(see Figure 5). We argue that such a message could as well<br />

be generated using a set of message templates and current<br />

status information of the car, e.g. GPS position, current<br />

speed, and available traffic jam warnings. Due to its nature<br />

and complexity, a car on the street is not a very suitable environment<br />

for fast prototyping. In order to evaluate the concept<br />

on a smaller scale, we developed a prototype [4] on a Pleo<br />

toy dinosaur. Due to its complex sensors and single data bus,<br />

the Pleo can be considered a downscaled model of a modern<br />

car, which we will explain below in more detail.<br />

A Pleo is a rather sophisticated device–sometimes also referred<br />

to as artifical lifeform–equipped with a multitude of<br />

sensors (see Figure 1).<br />

The Pleo hardware is based on an Atmel ARM7 32bit processor<br />

(main CPU), a NXP ARM7 32bit microprocessor (camera,<br />

audio) and four Toshiba TMP86FH47AUG 8bit microprocessors<br />

(motor control).<br />

The movement is achieved trough 14 motors with feedback<br />

sensor. Additional sensors are:<br />

• A color camera with white light sensor<br />

• Two microphones<br />

• Eight touch-sensors<br />

58<br />

Figure 2. Pleopatra Tools Screenshot<br />

• Four push-buttons (one under each foot)<br />

• Tilt and Shake sensors<br />

• Infrared transmitter and receiver in the mouth<br />

• Infrared transmitter and receiver at the head<br />

Pleo is also equipped with two speakers and both internal<br />

flash memory as well as a SD card slot and a USB interface.<br />

We connect Pleo via its USB interface to a computer in<br />

order to communicate with it. Pleos USB interface is wrapping<br />

a serial port to which we can connect using standard<br />

libraries such as RXTX [7]. To facilitate the communication,<br />

we implemented an API wrapping the serial protocol<br />

in Java. It is called Pleopatra Tools [3] (see Figure 2). We<br />

published the library under GPL license. Higher level functions<br />

are included in a graphical user interface, which makes<br />

interaction with the Pleo easy. Included are: establishing a<br />

connection to Pleo, storing personalized information about<br />

different Pleo such as photo or name, which is recognized<br />

instantly once the Pleo is connected, Recording audio from<br />

Pleo and direct playback on the PC, inspection and playback<br />

of sound-, motion-, and personality files as well as displaying<br />

live camera images from pleo. The API itself furthermore<br />

offers: controlling motors and sensors, access to the<br />

file system, recording audio from pleo in wav format and<br />

accessing pleos camera and saving bmp images.<br />

Using this API we implemented a monitoring tool which<br />

constantly checks the sensor data for anything extraordinary,<br />

such as sudden darkness, very loud noise, very high or low<br />

temperature, detection of something green which is considered<br />

food for Pleo, etc. On detection, an event is triggerd.<br />

Depending on the type of event, a pre-formulated message is<br />

picked from a small database and refined with actual sensor<br />

values, e.g. “35 centigrades? It is very hot in here!”. These<br />

messages are then twittered (see Figure 3) via an automated<br />

Twitter interface (jTwitter) [1]. The Twitter application is<br />

also accessible via the Pleopatra Tools’ GUI.

Figure 3. Pleopatra: the first twittering dinosaur in the world<br />

The task we handled here is a typical example for a dual restricted<br />

data selection process (see Figre 4). The raw data<br />

from the sensors (e.g. motor 4 is blocked at an angle of 35<br />

degrees) is transformed and filtered into some higher level<br />

data (e.g. somebody/something holds the front paw). The<br />

resulting data is then further filtered according to two resource<br />

limitations: First more technical (”what is extraordinary<br />

enough to be presented?”) and then more cognitive<br />

(”how much information do we want to publish?”). We will<br />

get back to that concept in more detail later on.<br />

Figure 4. Dual restriction on data<br />


We argue that a toy robot sensing his environment is comparable<br />

to a sensor-equipped car when it comes to automatic<br />

status message generation. In order to work properly, the<br />

driver has to be identified with his Twitter ID, just as each<br />

Pleo connected to the Pleopatra Tools API must be recognized<br />

by its serial ID before starting the Twitter application.<br />

In a car environment, this could be achieved for instance by<br />

checking the bluetooth ID of the drivers phone. Typical car<br />

sensors are much more complex than the sensors we have<br />

seen at the Pleo robot, and the access of data is usually not<br />

as uniform as a single USB interface. Data accessible in a<br />

car include current postion, speed, heading, temperature (inside<br />

and outside), etc.<br />

59<br />

The Controller Area Network (CAN) interface standard [2]<br />

was specified by Bosch in 1991 and is nowadays widely<br />

used in cars. It was devised to enable communication between<br />

subsystems of the car, since each subsystem may need<br />

to control actuators or receive feedback from sensors. The<br />

CAN bus may be used in vehicles to establish a commection<br />

between transmission and engine control unit (the cars main<br />

processor), or, for example, to connect the window openers,<br />

air condition, seat control, etc.<br />

The amount of pre-fabricated messages needed for a useful<br />

tweet-generation in a car is by far higher than the few dozens<br />

of messages in our Pleopatra prototype. Nevertheless, the<br />

basic principle stays the same: Sensor data is monitored, exceptional<br />

values are matched to a database of pre-fabricated<br />

messages and blancs in the message are filled with current<br />

values. The driver then only needs to accept a message for<br />

sending, which is clearly significantly less distracting than<br />

composing a message on a mobile device.<br />


Selection of relevant information based on a constant sensor<br />

data or information stream is not a trivial task. In [11],<br />

Maybury presents the SumGen system, which “selects key<br />

information from an event database by reasoning about event<br />

frequencies, frequencies of relation between them, and domain<br />

specific importance measures.”. The system is able to<br />

tailor a summarized report for a stereotypical user.<br />

More recent works aim at performing such a summarization<br />

in real-time in order to emulate a reporter at for instance a<br />

sports event. The IVAN system [6] “generates affective commentary<br />

on a tennis game that is given as an annotated video<br />

in real-time. The system employs two distinguishable virtual<br />

agents that have different roles (TV commentator, expert),<br />

personality profiles, and positive, neutral, or negative<br />

attitudes to the players.”<br />

In our example, the information streams to be monitored are<br />

sensor data. Defining which data is “extraordinary” is rather<br />

straightforward here: If the usual environment temeprature<br />

of the Pleo dinosaur ranges between 18 and 23 centigradess,<br />

then 35 centigrates is extraordinary. If the dinosaur does not<br />

have any input on his touch sensor on the back for 90 percent<br />

of its time, then getting an input there is extraordinary.<br />

The interpretation of sensor data usually depends on the context.<br />

In a toy context as our Pleopatra prototype, there is not<br />

much variation of context. The dinosaur usually stays more<br />

or less in the same environment, and extracting information<br />

from sensor data is straightforward.<br />

In the automotive context, we have to extend our information<br />

flow example from Figure 4. The car is moving in a complex<br />

environment, so in order to doublecheck our interpretation of<br />

the sensor data, we need additional environmental evidence<br />

as a second component. If the car is on the highway and<br />

moving at an extraordinary slow speed or even not at all, it<br />

doesn’t necessarily mean that the driver is stuck in a traffic<br />

jam. He might as well just rest on a parking lot or visit a

fast food restaurant’s drive-trough. But if we do have for<br />

instance traffic information announcing a traffic jam in that<br />

highway to verify our interpretation, the interpretation gets<br />

more reliable. So our first resource limitiation is environmental<br />

evidence:<br />

sensor data<br />

+ envionmental evidence<br />

interpretation of the situation<br />

The situation might be unusual or extraordinary, but to make<br />

it interesting and thus worth tweeting, another contextual<br />

component is usually needed. In our example: Being in a<br />

traffic jam could be something ordinary you encounter on<br />

your everyday commute, but being stuck close to your destination<br />

on a weekend trip is special. We add unusual context<br />

as part of the second, cognitive restriction:<br />

exceptional sensor data<br />

+ envionmental evidence<br />

+ unusual context<br />

relevant message<br />

At the same time, user defined parameters like desired frequency<br />

of status posts can be used to optimize the second<br />

resource limitation according to the drivers needs.<br />


We presented a prototype of a twittering toy dinosaur and argued<br />

that the introduced principle could - with an increased<br />

complexity and some modifications - be used for an automated<br />

generation of tweets. This automation would reduce<br />

the risk of driver distraction, especially for power users of<br />

social networks who have an urge to stay connected to their<br />

environment. This is of course just a part of the solution.<br />

Other communication goals need to be looked at and analyzed<br />

separately.<br />

In a next step, we can try to include automatic answering<br />

mechanisms. For instance, if driver A is on it’s way to person<br />

B, there could be an incoming tweet saying ”@DriverA:<br />

Where are you?” and based on the current status, the car<br />

could respond immedeately: ”I am on my way, but right now<br />

I am stuck in a traffic jam near Frankfurt, driving at less than<br />

10mph!”. This is just one example, the possibilities here are<br />

manyfold.<br />


1. JTwitter - the Java library for the Twitter API.<br />

http://www.winterwell.com/software/jtwitter.php, 2008.<br />

2. Bosch. Can specification, 2.0.<br />

http://www.semiconductors.bosch.de/<br />

media/pdf/canliteratur/ can2spec.pdf, 1991.<br />

3. C. Endres and D. Braun. Pleopatra Tools.<br />

http://www.dfki.de/pleopatra, 2009.<br />

4. C. Endres and D. Braun. Pleopatra: A Semi-Automatic<br />

Status-Posting Prototype For Future In-Car Use. In<br />

Adjunct proceedings of the 2nd International<br />

60<br />

<br />

<br />


<br />

Figure 5. Twittering car<br />

<br />

<br />

We’re already so
<br />

close to Paris,<br />

but now we hit a<br />

traffic jam!<br />

Conference on <strong>Automotive</strong> User Interfaces and<br />

Interactive Vehicular Applications (<strong>Automotive</strong>UI<br />

2010), page 7, Pittsburgh, PA, USA, November 2010.<br />

5. Governors Highway Safety Association.<br />

State cell phone use and texting while driving laws.<br />

http://www.ghsa.org/html/stateinfo/laws/cellphone laws.html,<br />

2010.<br />

6. I. Gregory. Embodied presentation teams: A plan-based<br />

approach for affective sports commentary in real-time.<br />

Master’s thesis, Saarland University, 2010.<br />

7. K. Jarvi. RXTX : serial and parallel I/O libraries<br />

supporting Sun’s CommAPI. http://www.rxtx.org/,<br />

2006.<br />

8. A. Java, X. Song, T. Finin, and B. Tseng. Why we<br />

twitter: understanding microblogging usage and<br />

communities. In WebKDD/SNA-KDD ’07: Proceedings<br />

of the 9th WebKDD and 1st SNA-KDD 2007 workshop<br />

on Web mining and social network analysis, pages<br />

56–65, New York, NY, USA, 2007. ACM.<br />

9. A. N. Joinson. Looking at, looking up or keeping up<br />

with people?: motives and use of facebook. In CHI ’08:<br />

Proceeding of the twenty-sixth annual SIGCHI<br />

conference on Human factors in computing systems,<br />

pages 1027–1036, New York, NY, USA, 2008. ACM.<br />

10. B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps<br />

about twitter. In WOSP ’08: Proceedings of the first<br />

workshop on Online social networks, pages 19–24,<br />

New York, NY, USA, 2008. ACM.<br />

11. M. T. Maybury. Generating summaries from event data.<br />

Inf. Process. Manage., 31:735–751, September 1995.<br />

12. C. A. Middleton and W. Cukier. Is mobile email<br />

functional or dysfunctional? two perspectives on<br />

mobile email usage. European Journal of Information<br />

Systems, 2006.<br />

13. O. Turel and A. Serenko. Is mobile email addiction<br />

overlooked? Commun. ACM, 53(5):41–43, 2010.<br />


Multimodal Summarization of Complex Sentences<br />

Naushad UzZaman<br />

Computer Science Department<br />

University of Rochester<br />

naushad@cs.rochester.edu<br />


In this paper, we introduce the idea of automatically<br />

illustrating complex sentences as multimodal summaries<br />

that combine pictures, structure and simplified compressed<br />

text. By including text and structure in addition to pictures,<br />

multimodal summaries provide additional clues of what<br />

happened, who did it, to whom and how, to people who<br />

may have difficulty reading or who are looking to skim<br />

quickly. We present ROC-MMS, a system for automatically<br />

creating multimodal summaries (MMS) of complex<br />

sentences by generating pictures, textual summaries and<br />

structure. We show that pictures alone are insufficient to<br />

help people understand most sentences, especially for<br />

readers who are unfamiliar with the domain. An evaluation<br />

of ROC-MMS in the Wikipedia domain illustrates both the<br />

promise and challenge of automatically creating multimodal<br />

summaries.<br />

Author Keywords<br />

Multimodal summarization, summarization, visualization,<br />

illustration, picture, text-to-picture, automatic illustration,<br />

sentence compression, pictorial representation, AAC,<br />

augmentative and alternative communication, ROC MMS.<br />

General Terms<br />

Algorithms, Experimentation.<br />

ACM Classification Keywords<br />

H5.m. Information interfaces and presentation (e.g., HCI):<br />

Miscellaneous; I.2.7 [Artificial Intelligence]: Natural<br />

Language.<br />


Pictures, diagrams and illustrations are included in<br />

manually-created text because they help people<br />

comprehend and remember information [1]. Including<br />

alternative, supportive representations of text might help<br />

people with reading difficulties understand text better, for<br />

instance those reading text not in their first language,<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that<br />

copies bear this notice and the full citation on the first page. To copy<br />

otherwise, or republish, to post on servers or to redistribute to lists,<br />

requires prior specific permission and/or a fee.<br />

<strong>IUI</strong>’11, February 13–16, 2011, Palo Alto, California, USA.<br />

Copyright 2011 ACM 978-1-4503-0419-1/11/02...$10.00.<br />

Jeffrey P. Bigham<br />

Computer Science Department<br />

University of Rochester<br />

jbigham@cs.rochester.edu<br />

61<br />

James F. Allen<br />

Computer Science Department<br />

University of Rochester<br />

james@cs.rochester.edu<br />

children, older adults, or people with cognitive disabilities.<br />

Unfortunately, creating illustrations is expensive and timeconsuming,<br />

and consequently most text has only a few<br />

Figure 1: Multimodal summary (MMS) of the sentence, “In<br />

1492, Genoese explorer Christopher Columbus, under contract<br />

to the Spanish crown, reached several Caribbean islands,<br />

making first contact with the indigenous people.”<br />

illustrations, if any at all. In this paper we introduce ROC-<br />

MMS, a system that automatically converts existing text to<br />

multimodal summaries (MMS) that capture the meaning of<br />

a complex sentence in a diagram containing pictures and<br />

simplified text related by structure extracted from the<br />

original sentence.<br />

Motivated by sayings like, “A picture is worth a thousand<br />

words” prior work on Automatic Illustration and Text-to-<br />

Picture synthesis has approached the very difficult problem<br />

of generating pictorial replacements for text. Although this<br />

is an interesting challenge, existing systems have generally<br />

found success only within the domain of simple sentences<br />

of the type found in children’s books [2-4]. The problem of<br />

multimodal summarization relaxes the problem by allowing<br />

text to augment pictorial and structural information.<br />

Automatic Illustration is inherently difficult. To understand<br />

the problem better, we initially asked two annotators 1 to<br />

identify the main idea 2 (main event) and related entities<br />

(subject, object, etc) from sentences and find representative<br />

pictures. Sentences were chosen from the Wikipedia entries<br />

United States and France, and annotators were asked to<br />

include Wikipedia pictures in their illustrations. The<br />

annotators reported that it was too difficult to illustrate<br />

19.59% of the entities using Wikipedia pictures and thought<br />

1 Annotators are graduate students and not among the authors.<br />

Their annotations were used as a gold standard in our evaluation.<br />

2 In this paper, we loosely interchange between main idea, main<br />

concept and main event.

that 15.08% of entities couldn’t be represented with<br />

pictures at all (e.g. “territory”, “height of power”, “French<br />

War of religion”, etc and temporal expressions in general).<br />

These results suggest that it will often be difficult to find<br />

appropriate pictures and some entities are inherently unable<br />

to be illustrated easily with pictures. It can be particularly<br />

difficult to represent entities in an unfamiliar domain. For<br />

instance, if someone doesn’t know how Christopher<br />

Columbus looks like, even a good picture of Christopher<br />

Columbus will only convey general attributes (man,<br />

possibly historical).<br />

To remedy this problem MMSs keep both images and<br />

representative text, unlike previous systems for automatic<br />

illustration [2-6]. In this way, we can handle cases lacking a<br />

good picture and address cases that are hard to illustrate.<br />

Presenting pictures and text together can also improve both<br />

the understanding and remembering of concepts. According<br />

to dual code theory [7], text and pictures result in two<br />

different kinds of conceptual representations. These<br />

representations may allow independent access to<br />

information and hence benefit retention. Picture and text<br />

repeat important information, and may have similar<br />

beneficial effects on memory as explicit repetitions [8, 9].<br />

Processing the information twice, once as text and once as a<br />

picture, may facilitate comprehension and memory. Finally,<br />

pictures often have a motivating effect, and text with<br />

pictures may also be more enjoyable to read, since the<br />

reader does not have to work as hard to understand the text<br />

and pictures also facilitate better comprehension of the text<br />

broadly beyond what is illustrated [10]. So our decision for<br />

inclusion of text with pictures is backed by theories that<br />

support that it helps people for better understanding and<br />

memorizing.<br />

To keep the MMS representations simple and easy to<br />

process, we simplify text so that it retains only the most<br />

important information, instead of the full text. We define<br />

the most important information as the subject (who did it),<br />

the event (what action), object (to whom or what) and<br />

prepositions directly related to the subject, main event, or<br />

object (how). This effectively converts complex sentences<br />

into simpler sentences. In this way, the reader can read out<br />

the text as a simple sentence in addition to seeing the<br />

pictorial view, making it easier to remember and understand<br />

text, and relate it to the full, complex text if they choose,<br />

such as when searching for details abstracted out of the<br />

MMS view.<br />

MMS can potentially help a diversity of readers. For<br />

example, highly-capable readers may use MMS to skim<br />

content or understand content more easily. The alternative,<br />

simplified representation it provides may be useful for<br />

children who are learning to read and for second language<br />

learners, as seeing pictures together with text may enhance<br />

learning [11]. Furthermore, it has been previously shown<br />

that when one component of the reading process is<br />

dysfunctional, other compensating skills may become<br />

highly developed [12]. It is estimated that more than 2<br />

62<br />

million people in United States have significant<br />

communication impairments that led them to rely on<br />

methods other than natural speech alone for communication<br />

[13]. Automatic Illustration of texts may eventually help<br />

these people understand text better. Automatic illustration<br />

can also help to support other representations like Pictorial<br />

Temporal representation [14] or can be paired-up with<br />

screen reading applications [15], which could further<br />

benefit people who have problems reading by allowing<br />

them to see content in multiple forms while listening to it<br />

being read.<br />

We define multimodal summarization of complex sentences<br />

as the combination of illustrations and a compressed form<br />

of the sentence text in simple sentence structure. In the next<br />

section we will describe the challenges for multimodal<br />

summarization and describe related work for the required<br />

subtasks. We then describe ROC-MMS, our system for<br />

multimodal summarization and describe an evaluation of it.<br />

Finally, we discuss potential for future work.<br />


Multimodal summarization (MMS) of complex sentences<br />

gives readers the main idea of the sentence using pictures<br />

and compressed text structured as simple sentence. Creating<br />

MMSs is challenging and involves many subtasks. In this<br />

section, we will describe each of the subtasks and the<br />

related work for each subtask, and the approach taken in<br />

ROC-MMS. The general steps in the MMS approach are<br />

the following:<br />

1. Identify both the main idea of the sentence and related<br />

entities and use them to create a compressed summary<br />

2. Extract pictures for the entities.<br />

3. Add structure to the pictures and text.<br />

Identifying the main idea and related entities<br />

Natural language sentences often convey multiple ideas, but<br />

representing multiple ideas with pictures can quickly<br />

become confusing. We, therefore, chose to express only the<br />

main idea of a sentence with MMS. If readers can<br />

understand the main idea of the sentence, then they may be<br />

able to later use the original text to decipher further details.<br />

The subtask of identifying the main idea of the sentence<br />

itself has two components. First, the important idea (the<br />

main event or main action) must be extracted, and, second,<br />

the entities related to the main idea need to be extracted, as<br />

illustrated in the following example drawn from Wikipedia:<br />

“In 1492, Genoese explorer Christopher Columbus, under<br />

contract to the Spanish crown, reached several Caribbean<br />

islands, making first contact with the indigenous people.”<br />

The summary or compressed form of the sentence is<br />

“Christopher Columbus reached several Caribbean islands<br />

in 1492.” Hence, the main event or main idea in the<br />

sentence is reached and the entities related to the event

eaching are Christopher Columbus (subject), several<br />

Caribbean islands (object) and 1492 (preposition in).<br />

A similar problem already addressed in the natural language<br />

processing community is called sentence compression [16].<br />

In sentence compression, unnecessary information is<br />

removed while retaining the grammaticality of the sentence.<br />

Sentence compression might remove related entities of<br />

main event in the process of removing unnecessary<br />

information. This approach also doesn’t give a simple<br />

sentence structure.<br />

Another approach is main event extraction using the<br />

TimeML annotation scheme [17]. In this scheme, the main<br />

event label corresponds to the main idea of the sentence.<br />

Most competitive systems use syntactic and semantic<br />

information and machine-learning classifiers to identify<br />

events. For an overview of recent systems in this area, see<br />

the results of TempEval-2 [18]. The main events are<br />

annotated as part of the TempEval-2 task, although results<br />

on identifying main events were not explicitly reported.<br />

In the literature on Automatic Illustration for extracting<br />

entities, a popular approach has been to first extract<br />

representative keywords and then generate images for these<br />

keywords [6]. Keyword extraction has been studied in the<br />

natural language processing/information retrieval<br />

community [19, 20]. Goldberg et al. [2, 4] extract actions<br />

(events), who did them and to whom. They don’t focus on<br />

identifying only the important idea (action) because their<br />

experimental domain only contains short and simple<br />

sentences (and are, therefore, unlikely to contain more than<br />

one event). They convert the problem of identifying entities<br />

to a sequence labeling problem and use Conditional<br />

Random Fields for classification. On the other hand,<br />

Mihalcea and Leong [3] do not try to extract the entities,<br />

but they extract the pictures word-by-word and represent<br />

them linearly. Both approaches work best on simple<br />

sentences in which order roughly matches the role of the<br />

extracted entities. The ROC-MMS system includes a full<br />

natural language parse of the complex sentence in order to<br />

extract entities regardless of the order in which they appear.<br />

Extracting Pictures for Text<br />

Once we have the event and related entities, we next extract<br />

pictures to represent each concept. The task of associating<br />

words to pictures is similar to image retrieval. Although<br />

some work uses computer vision techniques for retrieval,<br />

most work (including popular image search engines) rely<br />

primarily on the text found near images in documents to<br />

find general images [21]. ROC-MMS generally follows this<br />

approach as well, but uses additional information<br />

automatically generated from the structure of the sentence<br />

to weight its search terms.<br />

Text-to-scene conversion places objects in 3D environment<br />

and is intended to aid graphic designers. This usually works<br />

with detailed descriptive text with visual and spatial<br />

elements. One of the best-known systems of this kind is<br />

WordsEye [22]. They are usually not intended as assistive<br />

63<br />

tools to communicate general text, because in that domain<br />

the texts are usually explaining the situation like “the house<br />

is 7 foot tall with two glass window and a door” and the<br />

system will try to interpret the natural language and create<br />

the 3D environment of the described situation. In contrast,<br />

we want to take a sentence from an existing news source,<br />

Wikipedia, or a book and represent it with pictures to help<br />

people to understand the text better.<br />

Barnard and Forsyth [23] introduced the idea of autoillustration<br />

as inverse of auto-annotation. Joshi et al. [6]<br />

approached this problem by considering the pair-wise<br />

reinforcement based on both visual and WordNet-based<br />

lexical similarity. This work identifies a few representative<br />

pictures for a story, which has practical applications like<br />

identifying representative pictures for news articles, or<br />

different articles, but not appropriate for our problem.<br />

Goldberg et al. [2, 4] built their own database of images to<br />

use for certain text and if they couldn’t find any appropriate<br />

image in their database then they do web image search and<br />

apply some vision techniques to identify the appropriate<br />

picture. Mihalcea and Leong [3] use an in-house image<br />

database, PicNet and other resources 3 .<br />

Adding Structure to Improve Understanding<br />

Having identified pictures and compressed text, the final<br />

step is to combine these elements in a layout structurally<br />

representative of what happened, who did it, to whom and<br />

how. To our knowledge, the only other work that attempts<br />

to address this problem is Goldberg et al. [2]. Their system<br />

identifies "who", "what action" and "to whom" by<br />

converting the problem into sequence labeling. They<br />

propose a layout represented by the sequence ABC, where<br />

A represents who did the action, B is what action was done<br />

and C is to whom. An example output of their system for<br />

“The girl rides the bus to school in the morning” is below:<br />

Figure 2: Example output of [2] illustrating the labeling of<br />

sequences where each element is assigned a picture.<br />

In this work, the textual information is ignored and<br />

represented only with pictures. Images incorrectly extracted<br />

in the previous step may confuse people more than helping<br />

them because there is no additional information to guide<br />

them to the correct interpretation. MMS includes extracted<br />

text in case of errors. With both picture and compressed<br />

text, we can represent hard-to-depict, but important, entities<br />

with text that may be ignored by prior work. We do not<br />

attempt to represent events (the action) with a picture, since<br />

this is a much more challenging task.<br />

3 http://tell.fll.purdue.edu/JapanProj/FLClipart/

This work also tries to identify the A (who), B (what action)<br />

and C (to whom) of their ABC layout by converting it to a<br />

sequence-tagging problem, which is well studied in NLP<br />

[24]. The problem with that approach is the requirement for<br />

hand-labeled training data, which will be a barrier for<br />

adaptation of the solution to a different or more complex<br />

domain. ROC-MMS uses dependency parsing to identify<br />

similar dependencies or related entities, without needing the<br />

hand-annotated training data.<br />

Finally, they restrict their attention to single simple<br />

sentences and their experiments were on domains that use<br />

very simple English, such as short narratives written by and<br />

for individuals with communicative disorders; one-sentence<br />

news synopses written in simple English targeting foreign<br />

language learners; and the child writing sections of the<br />

LUCY corpus. For complex sentences, they anticipate the<br />

use of text simplification to convert complex text into a set<br />

of appropriate inputs for their system. It is not clear how<br />

well they can eventually represent the complex sentences in<br />

their layout, since they are not considering “how”<br />

something happened.<br />

ROC-MMS addresses these problems for unrestricted texts<br />

that include complex and compound sentences.<br />

ROC-MMS<br />

In this section we will describe ROC-MMS, and how it<br />

approaches the subtasks described in the previous section.<br />

Identifying the main event(s)<br />

ROC-MMS finds concepts by identifying the events and<br />

related entities, and then identifies the main event to<br />

identify the main concept or the main idea.<br />

Event extraction<br />

Our view for event matches with the TimeML temporal<br />

annotation scheme [17], which considers events a cover<br />

term for situations that happen or occur.<br />

ROC-MMS extracts events using the TRIOS system [25],<br />

which had a very competitive performance in the TempEval<br />

2010 task for temporal information extraction [18]. The<br />

TRIOS system first parses text with the TRIPS parser [26]<br />

and uses hand-coded rules to extract events. The extraction<br />

rules are tuned for high recall and identify many more<br />

events than is necessary, including a few non-events. In the<br />

next step, a classifier is used as a filter to remove<br />

unnecessary events.<br />

The main event identification classifier takes all events for<br />

a sentence as input and identifies the main event from the<br />

sentence. In one of the tasks for TempEval 2010, main<br />

events were labeled. We used that labeled data to train our<br />

main event classifier. For this classification task, we used<br />

an off-the-shelf Markov Logic Network classifier<br />

(thebeast) 4 . As features, we used lexical features (word,<br />

stem, next word, previous word, previous verbal word<br />

sequence), syntactic features (part-of-speech tag, tense,<br />

4 http://code.google.com/p/thebeast/<br />

64<br />

voice, polarity, TimeML aspect, modality, pos sequence,<br />

previous verbal pos sequence, next pos, previous pos) and<br />

semantic features (abstract semantic class – ontology type,<br />

TimeML class, semantic roles and their arguments) of<br />

events. The syntactic and semantic features are mostly<br />

generated from TRIPS parser output and also using other<br />

classifiers.<br />

This classifier first identifies the main events from the<br />

sentences. Then we run another pass to make sure every<br />

sentence has at least one main event. We force every<br />

sentence to have a main event. If a classifier didn’t identify<br />

a main event in a sentence, then we consider the first verbal<br />

event as the main event of the sentence. We back off to the<br />

first verbal event because it has a high baseline<br />

performance for the main-event identification task.<br />

Extract entities related to the event<br />

Instead of extracting all entities in the sentence [3], we<br />

extract only those entities related to the main event. We use<br />

the relations between the event and the related entities in<br />

the next step to structure them. From the parsed<br />

representation created from the Stanford dependency<br />

parser 5 , we find dependencies 6 in order to extract the<br />

subject (nominal subject - nsubj, agent),<br />

object (direct/indirect object - dobj/iobj,<br />

passive nominal subject - nsubjpass) and other<br />

dependencies (prepositions). For easier representation,<br />

we cluster all prepositional modifiers into a single entity,<br />

but include the preposition when representing.<br />

An example will help to illustrate how we use the<br />

dependency output to extract related entities for the events.<br />

The following is the Stanford dependency parser output for<br />

the sentence, “French fur traders established outposts of<br />

New France around the Great Lakes.”<br />

amod(traders-3, French-1)<br />

nn(traders-3, fur-2)<br />

nsubj(established-4, traders-3)<br />

dobj(established-4, outposts-5)<br />

nn(France-8, New-7)<br />

prep_of(outposts-5, France-8)<br />

det(Lakes-12, the-10)<br />

nn(Lakes-12, Great-11)<br />

prep_around(established-4, Lakes-12)<br />

The main event here is established, the subject is traders,<br />

the object is outposts and the preposition (around) is Lakes.<br />

By propagating through nn (noun compound<br />

modifier) and amod (adjectival modifier)<br />

dependencies, we extract the following entities: (subject:<br />

“French fur traders”), (object: “outposts”) and (preposition:<br />

“Great Lakes”). For subject, object and prepositions, we<br />

propagate through the nn and amod in this way and extract<br />

5 Stanford dependency parser:<br />

http://nlp.stanford.edu/software/lex-parser.shtml.<br />

6 Details on dependencies:<br />


the resulting entities. The next step is to find the<br />

representative pictures for the entities. If we fail to find an<br />

image for any entity, we propagate through all<br />

dependencies (instead of just nn and amod) to extract an<br />

entity phrase. For example, we would extract the phrase<br />

“outpost of New France” for the object and “the Great<br />

Lakes” for the preposition, in the above examples. We then<br />

search for the picture of the entity phrase, instead of the<br />

entity. These steps are described in more detail next.<br />

Extracting Pictures for Concepts<br />

Image retrieval is a complicated task, even for humans<br />

because what constitutes a representative image is<br />

subjective. As a result, we simplified the problem by<br />

restricting our image search to Wikipedia, which we have<br />

found to often produce appropriate images. This has the<br />

following two benefits: (i) pictures of an entity are often<br />

found on the wiki page for that entity, and (ii) Wikipedia<br />

articles often have info box pictures selected by human<br />

editors that are often correct and representative.<br />

Finding pictures for an event (“what action” according to<br />

[2]) is much harder. When humans are asked to find<br />

pictures for events, they will often search for the event<br />

along with subject or object. For example, for the event<br />

“conquered” in the context “Rome conquered the Gauls”,<br />

an appropriate image would likely include Roman soldiers<br />

(it would be even better if it somehow indicated that the<br />

conquering occurred in Gaul). Search results for conquered<br />

alone include the following images in the top results:<br />

Figure 3: First three results from Yahoo Image Search for the<br />

word “conquered” illustrating the difficulty in finding good<br />

representative pictures even for simple concepts.<br />

A useful heuristic for finding better representative images is<br />

therefore to concatenate the action with the subject and<br />

object (if available, or just subject or object, if the other one<br />

is not available). Often web image search results still do not<br />

return the most appropriate images for our use as the first<br />

result. This can be fine for humans, who may glance<br />

through the top few results and pick the most appropriate<br />

one. Restricting pictures only to Wikipedia is a simple way<br />

to produce better results.<br />

Our methods for identifying the pictures are described<br />

below with different modules.<br />

Module find_image_in_wikipage(wikiurl):<br />

(i) Find the infobox picture<br />

65<br />

(ii) If infobox has multiple pictures, then consider the<br />

picture with largest width 7<br />

(iii) If there are no infobox picture<br />

a. Find all images<br />

b. Tokenize the image filename 8 with "_", ",",<br />

"[A-Z]", and spaces as delimiters<br />

c. For each image<br />

i. Find the edit-distance between<br />

tokenized filename and each word in<br />

wiki article name<br />

ii. Sum all scores, that’s the relatedness<br />

score for an image<br />

d. Return the picture with highest score and the<br />

score<br />

Module find_page_and_image(query):<br />

(i) Search with “wikipedia ” + query using yahoo<br />

search api 9<br />

(ii) Keep only en.wikipedia pages<br />

(iii) Traverse the resulting wiki pages one by one<br />

(a) Get the representative image with score<br />

from the wiki page’s url using the module:<br />

find_image_in_wikipage(result page)<br />

(b) If the resulting image's score is above<br />

threshold (we used 1.0) then return the<br />

image<br />

Module sentence_to_images(sentence):<br />

(i) Extract events, main event and the entities and<br />

entity phrases related to main event (all these<br />

described in previous section)<br />

(ii) For each of the dependencies (subject, object,<br />

prepositions):<br />

(a) If any word forms a main Wikipedia entry:<br />

Find the image in those wiki urls<br />

using find_image_in_wikipage(wikiurl)<br />

(b) If no result found so far and the entity<br />

doesn't have a wiki link<br />

Then find the image using yahoo search<br />

with find_page_and_image(entity)<br />

(c) If no result found so far and any word in the<br />

entity phrase is linked to wiki urls:<br />

Then find the image in those wiki urls<br />

using find_image_in_wikipage(wikiurl)<br />

(d) If no result found so far and entity phrase<br />

doesn’t have a wiki link:<br />

7 We found that when there are multiple pictures then the larger<br />

width picture is usually the main representative picture.<br />

8 We are only considering the tokenized filename, because, i.<br />

wikipedia has very descriptive image filenames, ii. text<br />

descriptions next to images are not consistent, some pictures have<br />

lots of text and others don't have any, since sometimes it’s just<br />

neglected by contributors, if the wiki entry is not too interesting.<br />

But we consider the alt tags of images, which is also very sparse.<br />

So we give a lower weight for that score (we used 0.25 for alt tags<br />

and 1.0 for image filename score).<br />

9 http://developer.yahoo.com/search/web/V1/webSearch.html

Then find the image using yahoo search<br />

with find_page_and_image(entity phrase)<br />

Consider the following clarifying example. The input<br />

sentence from Wikipedia is “French fur traders established<br />

outposts of New France around the Great Lakes.”<br />

(Underlined words are links to other Wikipedia pages).<br />

ROC-MMS extracts the following main event (in this case,<br />

the only event) as established, and the extracted entities and<br />

entity phrases are: (subject: French fur traders), (subject<br />

phrase: French fur traders), (object: outposts), (object<br />

phrase: outposts of New France), (preposition: around –<br />

Great Lakes), (preposition around phrase: the Great Lakes).<br />

First consider the subject, French fur traders. “Fur traders”<br />

has a wiki link, but the page does not have an infobox. For<br />

images on the linked page, we find the edit distance<br />

between the tokenized filename and the article name (Fur<br />

trade) and the best image according to the process described<br />

previously.<br />

Next we consider the object outpost, which does not have a<br />

wiki link. We search using Yahoo! restricting to Wikipedia<br />

pages, which doesn’t return any images above threshold in<br />

first 10 resulting pages. We then check the object phrase –<br />

outposts of New France, and New France has a wiki link,<br />

and we find a representative picture from that link.<br />

In our algorithm, we search for the entity first, instead of<br />

checking wiki URLs in the entity phrase, because<br />

sometimes in Wikipedia contributors fail to tag entities to<br />

its wiki article. For those cases, our yahoo_search module<br />

finds the expected wiki article. So we try this step first and<br />

if it fails, then we check the wiki links in the entity phrase,<br />

as shown in this example. Finally, the preposition (around)<br />

is Great Lakes, which links to its wiki article and we get the<br />

representative picture for that too.<br />

If there are multiple wiki links in an entity (or entity phrase)<br />

then we find images from all wiki links and cluster them.<br />

Figure 4: Clustered image of Genoa and Christopher<br />

Columbus for entity “Genese explorer Christopher Columbus”.<br />

We also cluster all prepositions. The sentence “The modern<br />

name ‘France’ derives from the name of the feudal domain<br />

of the Capetian Kings of France around Paris” contain two<br />

prepositions, from and around. We extract pictures for from<br />

the name of the feudal domain of the Capetian Kings of<br />

France and also for around Paris, and then combine them.<br />

66<br />

Figure 6: Example of clustering prepositions.<br />

Our annotators were unable to find images to represent<br />

temporal expressions, and indeed this is a difficult problem.<br />

To handle that problem, we give special treatment to<br />

temporal expressions. To identify temporal expressions, we<br />

use the TRIOS temporal expression identification and<br />

normalization system 10 [25], which had the second best<br />

performance in TempEval-2 [18]. When we identify a time,<br />

instead of searching for a picture of it, we represent it with<br />

something that represents time and add the text below. One<br />

example is given below.<br />

Figure 5: The representation of a temporal expression includes<br />

the extracted text and a picture. The picture conveys time<br />

generally, but not a specific time.<br />

Structuring the images and compressed text<br />

The final step is to combine the image and compressed text<br />

into a structured format 11 . Every sentence has a main event,<br />

which we don’t try to represent with pictures, a subject<br />

entity, object entity and clustered prepositions. We<br />

construct MMS using the following visual layout of these<br />

elements.<br />

Figure 7: Generalized visual layout for MMS.<br />

This representation is very similar to ABC layout [2], since<br />

the subject and object are essentially who did the action and<br />

to whom, however the primary difference is that MMS<br />

10 The temporal expression normalizer is also available as open<br />

source at: http://www.cs.rochester.edu/u/naushad/temporal<br />

11 All our auto-generated diagrams are generated using GraphViz<br />


includes prepositions and does not attempt to find a picture<br />

for the main event. As mentioned earlier, it is not clear from<br />

the description how they represent hard-to-depict events. It<br />

might have worked in their simple domain; however, they<br />

explained they only find pictures for easy-to-depict words.<br />

Many events can be missed as part of the filtering process.<br />

ROC-MMS makes appropriate trade-offs that enable it to<br />

create MMS diagrams for arbitrary text, even text that<br />

includes complex sentences.<br />

One example output from our system is given below:<br />

Figure 8: Multimodal summary (MMS) of the sentence,<br />

“French fur traders established outposts of New France around<br />

the Great Lakes; France eventually claimed much of the North<br />

American interior, down to the Gulf of Mexico.”<br />

Some sentences do not contain prepositions (or the they<br />

may not be correctly extracted). In such cases, we show<br />

only the event, subject and object, as shown below.<br />

Figure 9: MMS of the sentence, “The Carolingian dynasty ruled<br />

France until 987, when Hugh Capet, Duke of France and Count<br />

of Paris, was crowned King of France.”<br />

For sentences lacking an object, we merge the event text<br />

with the subject text and show it in subject text field. In the<br />

following example, died (event) is merged with the Charles<br />

IV (subject).<br />

Figure 10: MMS of the sentence, “Charles IV ( The Fair ) died<br />

without an heir in 1328 .”<br />


Illustrating a sentence with a diagram of pictures and text is<br />

difficult; evaluating how good a diagram is may be even<br />

67<br />

harder because it is very subjective. In this evaluation<br />

section, we first evaluate the subtasks of our multimodal<br />

summarization system in isolation. We then evaluate how<br />

well our representation retains the overall information of<br />

the overall sentence. All our evaluations are done on 44<br />

sentences drawn from Wikipedia article on United States<br />

and France.<br />

Identifying the Main Event and Related Entities<br />

We trained our main event identification classifier on<br />

TempEval-2 training data and tested it with 10 cross<br />

validation. Our performance for main event identification<br />

was around 77.94% (fscore). The baseline of choosing the<br />

first verbal event as the main event achieves around 59.64%<br />

on the TempEval domain. We ported that system on the<br />

Wikipedia domain and evaluated considering each<br />

annotator as gold standard. We calculated precision and<br />

recall for both cases, the performance is reported in Table 1.<br />

Metric Performance<br />

Precision 79.10%<br />

Recall 73.11%<br />

Fscore 75.98%<br />

Table 1. Main event identification performance<br />

We extract entities by first traversing the nn (noun<br />

compound modifier) and amod (adjectival modifier)<br />

dependencies of the dependency tree. If that entity results in<br />

a good picture (the matching score is above threshold), we<br />

keep it; otherwise we traverse through all dependencies of<br />

the event, resulting in a phrase. Our extracted entities often<br />

don’t exact match with the annotator’s entity but may<br />

partially 12 match with them. We report the average<br />

performance (considering both annotators) of our system on<br />

entity extraction in Table 2. We only consider cases in<br />

which our system and the annotators identified the same<br />

main event.<br />

Metric Performance<br />

Average strict precision 29.29%<br />

Average strict recall 31.64%<br />

Average relaxed precision 76.76%<br />

Average relaxed recall 83.82%<br />

Table 2. Entity extraction performance<br />

Extracting Pictures<br />

For evaluating how well our system extracts pictures, we<br />

compared our system output to extractions by two human<br />

annotators. We consider cases where our system and the<br />

annotater, with relaxed matching, identified the same main<br />

event and same entities and both extracted an image. In<br />

12 Either our entity is substring of annotator’s entity, or vice versa.<br />

Relaxed matching is partial matching.

Table 3, we show the percentage of cases when both<br />

systems extracted an image, given that both systems<br />

extracted the same entity. Not all extracted entities have a<br />

picture because human annotators sometimes didn’t extract<br />

the picture because they thought some concepts couldn’t be<br />

illustrated with a picture and sometimes thought there were<br />

no suitable pictures in Wikipedia to represent that entity.<br />

We also didn’t suggest a picture for entities if no picture<br />

was found with a score above threshold. We compared<br />

between two annotators and show the average system<br />

performance. We can see that our system has a very similar<br />

performance compared to performance between each<br />

annotators.<br />

Evaluation<br />

Both entity<br />

got Image<br />

Annotator1 vs Annotator2 66.66%<br />

Average of Annotators vs System 65.47%<br />

Table 3. Performance of Image Extraction<br />

On these selected matching pictures, we compare our<br />

extracted image with the images extracted by the<br />

annotators. We classify our output into Same Image (if both<br />

the system and annotators extracted the same image),<br />

Different Image but acceptable (e.g. for France, one<br />

extracted the French flag and the other extracted a map of<br />

France) and finally Bad Image by our system (this category<br />

is the category of images that we think are not acceptable,<br />

i.e. wrong representation of the text). A judge, another<br />

graduate student - who was not an annotator or an author,<br />

performed this classification.<br />

Evaluation<br />

Ann 1 vs<br />

Ann 2<br />

Ann vs System<br />

(Average)<br />

Exact same image 47.05% 21.51%<br />

Different image, but<br />

acceptable<br />

52.95% 44.15%<br />

Different and bad image 34.34%<br />

Table 4. Performance on quality of our extracted images<br />

We can see that our system extracts decent pictures around<br />

65% of the time.<br />

How well our structure with simple compressed text<br />

helps to understand text better<br />

In the previous subsections, we showed our performance in<br />

the different subtasks, which eventually propagates to the<br />

final performance; but overall how well does our system<br />

generate diagrams that convey the message of the content to<br />

the users? Does automatic illustration really help text<br />

comprehension? Do human-generated illustrations help for<br />

text comprehension? An illustration without text is unlikely<br />

to be useful if the domain is new to the reader because the<br />

reader won’t be able to interpret the pictures in the first<br />

place. That’s why MMS diagrams include simple<br />

68<br />

compressed text and the simple structure along with the<br />

event, subject, object, and prepositions.<br />

In this section, we motivate MMS over picture-only<br />

diagrams by showing that users get a better understanding<br />

from the MMS diagrams generated by ROC-MMS than<br />

they do for diagrams containing only pictures, even when<br />

human annotators have identified the pictures.<br />

For this evaluation, we recruited participants on Amazon<br />

Mechanical Turk 13 . In the task shown to participants, we<br />

show our system generated MMS diagram and ask the<br />

turkers to explain the diagram in English text. Participants<br />

were also given the option of saying that they “Can’t<br />

explain the diagram.” One example is shown in Figure 11.<br />

Figure 11: ROC-MMS generated diagram for “Gaul was<br />

conquered by Rome under Julius Caesar in the 1 st centiry BC”<br />

Next we created the diagram using entities and pictures<br />

selected by human annotators (representing a gold<br />

standard), but we didn’t add the structural layout or text like<br />

our MMS diagram. Influenced by Mihalcea and Leong [3],<br />

our baseline ordered the picture of the entities in the order<br />

of the sentence. For example, for the sentence, “Gaul was<br />

conquered by Rome under Julius Caesar in the 1st century<br />

BC”, we created the diagram with first picture for Gaul<br />

then event conquered (in text), then picture for Rome and<br />

finally Julius Caesar. The annotators thought 1 st century BC<br />

was hard to illustrate, and so did not find a picture for it.<br />

We asked our annotators not to find pictures for events,<br />

since we are not going to represent events with pictures and<br />

added the text for events instead in annotator’s diagram.<br />

One example diagram is shown in Figure 12.<br />

Figure 12: Diagram using human identified entities and<br />

pictures for “Gaul was conquered by Rome under Julius Caesar<br />

in the 1 st century BC”<br />

Although the pictures are accurate, it is quite difficult to<br />

find the meaning of this diagram. We see two maps; many<br />

13 Mechanical Turk website: www.mturk.com. For this task, we<br />

paid $0.01 for explaining the diagram with text. For each sentence,<br />

we collected responses from10 unique workers.

people might not understand which country or place is this.<br />

Even if they were to somehow interpret first one as Gaul<br />

and the second as Rome, they will read it wrong as Gaul<br />

conquered Rome, because it is linearly ordered, instead of<br />

using subject, event, object structure like ours. On the<br />

contrary, our diagram for the same example, failed to get a<br />

good representative picture for Rome and the Stanford<br />

parser failed to find that 1 st century BC is also related to the<br />

event conquered, but with structure and text, many people<br />

were able to understand the content and produced<br />

something very similar to the original summary text.<br />

Participants provided explanations of the diagrams (both<br />

those generated by our system and those of the two<br />

annotators) in English text from 10 different turkers for<br />

each sentence. We used Rouge [27], the automatic<br />

evaluation toolkit for summarization, to test how well their<br />

explanations retained the information of the original<br />

sentence’s summary. We generate the reference summaries<br />

using annotators’ identified entities and events and ordered<br />

them linearly like the diagram. For the example given<br />

above, our annotator’s reference sentence summary was<br />

“Gaul conquered Rome Julius Caesar 1st century BC”.<br />

These reference summary sentences are not grammatical<br />

and only consisted of the main event and the important<br />

entities. The Rouge evaluation handles this well because it<br />

is based on ngram matching and does not consider the<br />

grammaticality of sentences. For each system, we get the<br />

average Rouge score for each sentence (averaging over 10<br />

turker’s score) and then average over all sentences. We also<br />

average the two annotators’ score and report the average<br />

annotator Rouge score.<br />

In reporting our performance, we report both Rouge-1 and<br />

Rouge-L, since Rouge-1 14 and Rouge-L perform very well<br />

in evaluating very short summaries (head-line like<br />

summaries) [27]. In reporting our results, we are reporting<br />

precision (P), recall (R) and Fscore (F).<br />

Evaluation Rouge-1 Rouge-L<br />

Explanation of<br />

Annotators’ diagrams<br />

Explanation of the<br />

ROC-MMS diagrams<br />

0.0892482 (F) 0.08451066 (F)<br />

0.0680995 (R) 0.0635695 (R)<br />

0.1294495 (P) 0.1260265 (P)<br />

0.2405093 (F) 0.21649513 (F)<br />

0.26668 (R) 0.23619 (R)<br />

0.2190162 (P) 0.199832 (P)<br />

Table 5. Rouge-1 and Rouge-L for explanation of annotators<br />

diagram (average) and our system diagram<br />

The results match our intuition that participants didn’t do a<br />

very good job explaining the diagram with a sentence when<br />

they are provided with only pictures – even though human<br />

14 Rouge-1 is based on unigram and Rouge-L is based on longest<br />

common subsequence.<br />

69<br />

annotators selected these pictures. On the other hand, our<br />

system, despite the possibility of cascading errors from<br />

parsing, main event identification, entity extraction and<br />

identifying appropriate picture, did a lot better.<br />

Although the inclusion of text gave the MMS diagrams a bit<br />

of an advantage in the Rouge score measurement because it<br />

is based on ngrams, it suggests that ROC-MMS is able to<br />

accurately identify the main concepts of the sentences and<br />

create pictures that are reasonable. More broadly, this<br />

evaluation shows the advantage of adding even minimal<br />

text, as many participants’ were largely unable to produce<br />

accurate descriptions of the diagrams containing only<br />

pictures. Surprisingly, few participants simply wrote the<br />

text contained within the MMS diagrams, suggesting that<br />

the evaluation was more nuanced.<br />

We believe that MMS diagrams will eventually be helpful<br />

for people who have trouble reading and understanding<br />

complex text and may help capable readers more easily<br />

skim documents. The end goal of MMS will be its ability to<br />

improve reading comprehension; ROC-MMS represents an<br />

important step in this direction.<br />


We evaluated ROC-MMS in the Wikipedia to show that<br />

multimodal summarization can be applied to complex text<br />

in order to generate diagrams that combine text, pictures,<br />

and structure. These evaluations have shown the promise of<br />

creating MMS diagrams completely automatically for<br />

arbitrary text, and suggest numerous future research<br />

opportunities.<br />

First, our system currently relies partly on Wikipedia. An<br />

obvious extension would be to explore its performance in<br />

raw text, and adapt its modules to handle more general<br />

resources. The TRIPS parser used in ROC-MMS, already<br />

identifies named entities, which may be able to use to find<br />

better pictures for specific kind of entities, e.g., for people -<br />

we might search for portrait, for country – a flag or map.<br />

Multimodal summarization is in the middle of two<br />

extremes. One would be to consider all events, instead of<br />

main events, i.e. represent everything with pictures and text.<br />

This may be useful for people who have trouble reading and<br />

want to get as much information in multimodal<br />

representation as possible. The other extreme is applying<br />

the summarization to pick the important sentences first and<br />

then apply multimodal summarization only on the selected<br />

sentences. In this way, it will represent the important<br />

sentences and only the important information in those<br />

sentences. This could be very useful for capable readers to<br />

skim through articles. Exploring the relative benefits along<br />

this dimension could better characterize their potential.<br />

We simplified the problem of illustration by not<br />

representing events with pictures because events are usually<br />

hard to depict. Future work may try to illustrate events by<br />

more intelligently searching for events along with the

subject and object. We also want to extend the proposed<br />

multimodal summarization by adding speech modality [15].<br />

Finally, we want to extend our evaluation to look at how<br />

MMS (and other summary techniques) improve reading<br />

comprehension for the target groups who motivated this<br />

work – specifically people who have difficulty reading.<br />


In this paper, we approached the problem of visualizing text<br />

as multimodal summarization. To create MMS diagrams,<br />

we automatically summarize text by extracting simple<br />

sentence structures (subject – who did it, event – what<br />

happened, object – to whom, preposition – how) and<br />

illustrate the text with pictures and compressed text<br />

together. Our evaluation showed that we achieve good<br />

performance on all of the subtasks required to create MMS<br />

diagrams, and that the MMS diagrams generated by ROC-<br />

MMS were easier to understand than human illustrations<br />

with pictures alone. Our implementation and evaluation<br />

leveraged the Wikipedia domain, but the approach<br />

embodied in ROC-MMS can be generally extended to<br />

unrestricted text.<br />


We thank the three anonymous reviewers for their valuable<br />

feedback. We also thank Benjamin van Durme for his<br />

suggestion of prototyping on the Wikipedia domain, and<br />

Anna Loparev, Amal Fahad and Shantonu Hossain for help<br />

with annotation tasks.<br />


1. R. N. Carney and J. R. Levin, "Pictorial Illustrations Still<br />

Improve Students' Learning from Text," Educational<br />

Psychology Review, vol. 14, 2002.<br />

2. B. Goldberg, et al., "Easy as ABC? Facilitating pictorial<br />

communication via semantically enhanced layout.,"<br />

Twelfth International Conference on Computational<br />

Natural Language Learning, 2008.<br />

3. R. Mihalcea and B. Leong, "Toward communicating<br />

simple sentences using pictorial representations,"<br />

presented at the Association of Machine Translation in the<br />

Americas., 2006.<br />

4. J. Zhu, et al., "A text-to-picture synthesis system for<br />

augmenting communication.," in The Integrated<br />

Intelligence Track of the Twenty-Second AAAI<br />

Conference on Artificial Intelligence, 2007.<br />

5. K. Barnard, et al., "Matching words and pictures.,"<br />

Machine Learning Research, vol. 3, pp. 1107–1135, 2003.<br />

6. D. Joshi, et al., "The story picturing engine—a system for<br />

automatic text illustration.," ACM Transactions on<br />

Multimedia Computing, Communications, and<br />

Applications, vol. 2(1), 2006.<br />

7. Paivio, "Mental representations: A dual coding approach,"<br />

New York: Oxford University Press., 1986.<br />

8. M. Glenberg, "Component-levels theory of the effects of<br />

spacing of repetitions on recall and recognition.," Memory<br />

and Cognition, vol. 7, pp. 95-112, 1979.<br />

9. R. G. Greene, "Spacing effects in memory: Evidence for a<br />

two-process account.," Journal of Experimental<br />

70<br />

Psychology: Learning. Memory. and Cognition, vol. 15,<br />

pp. 371-377, 1989.<br />

10. M. Glenberg and W. E. Langston, "Comprehension of<br />

illustrated text: pictures help to build mental models.,"<br />

Memory and Language, vol. 31, pp. 129–151, 1992.<br />

11. R. E. Mayer, Multimedia learning. Cambridge, UK:<br />

Cambridge University Press., 2001.<br />

12. U. Frith, "A developmental framework for developmental<br />

dyslexia," Annals of Dyslexia, vol. 36, pp. 69-81, 1985.<br />

13. S. L. H. Association, "Roles and responsibilities of speech-<br />

language pathologists with respect to augmentative and<br />

alternative communication: Technical report," ASHA<br />

Supplement, vol. 24, 2004.<br />

14. N. UzZaman, et al., "Pictorial Temporal Structure of<br />

Documents to Help People who have Trouble Reading or<br />

Understanding. ," International Workshop on Design to<br />

Read, CHI, Atlanta, GA, 2010.<br />

15. J. P. Bigham, et al., "WebAnywhere: A Self-Voicing,<br />

Web-Browsing Web Application," International<br />

Conference on the World Wide Web, Beijing, China, 2008.<br />

16. K. Knight and D. Marcu, "Summarization beyond sentence<br />

extraction: a probabilistic approach to sentence<br />

compression," Artificial Intelligence, vol. 139, pp. 91–107,<br />

2002.<br />

17. J. Pustejovsky, et al., "TimeML: Robust Specication of<br />

Event and Temporal Expressions in Text. ," in New<br />

Directions in Question Answering, 2003.<br />

18. J. Pustejovsky and M. Verhagen, "SemEval-2010 task 13:<br />

evaluating events, time expressions, and temporal relations<br />

(TempEval-2)," Workshop on Semantic Evaluations:<br />

Recent Achievements and Future Directions, 2010.<br />

19. Y. Matsuo and M. Ishizuka, "Keyword Extraction from a<br />

Single Document Using Word Co-Occurrence Statistical<br />

Information," International Journal on Artificial<br />

Intelligence Tools, vol. 13, pp. 157-170, 2004.<br />

20. R. Mihalcea and P. Tarau, "TextRank: Bringing Order into<br />

Texts," Proceedings of the Conference on Empirical<br />

Methods in Natural Language Processing (EMNLP 2004),<br />

Barcelona, Spain, 2004.<br />

21. R. Datta, et al., "Image retrieval: Ideas, influences, and<br />

trends of the new age," ACM Comput. Surv., vol. 40, pp. 1-<br />

60, 2008.<br />

22. Coyne and R. Sproat, "WordsEye: An automatic text-toscene<br />

conversion system," SIG-GRAPH, 2001.<br />

23. K. Barnard and D. Forsyth, "Learning the Semantics of<br />

Words and Pictures," Eighth International Conference on<br />

Computer Vision (ICCV'01), 2001.<br />

24. J. Lafferty, et al., "Conditional random fields: Probabilistic<br />

models for segmenting and labeling sequence data,"<br />

International Conference on Machine Learning, 2001.<br />

25. N. UzZaman and J. F. Allen, "TRIPS and TRIOS System<br />

for TempEval-2: Extracting Temporal Information from<br />

Text," International Workshop on Semantic Evaluations,<br />

ACL 2010.<br />

26. J. F. Allen, et al., "Deep semantic analysis of text,"<br />

Symposium on Semantics in Systems for Text Processing<br />

(STEP), 2008.<br />

27. Y. Lin, "ROUGE: A package for automatic evaluation of<br />

summaries," ACL Text Summarization Workshop, 2004.

Author’s index<br />

James F. Allen Moritz Kümmerling<br />

R. Wade Allen Pat Langdon<br />

Ignacio Alvarez Sven Laqua<br />

Gabriel Barata Gerrit Meixner<br />

Ashweeni K. Beeharee Kamlesh Mistry<br />

André Berton Christian Müller<br />

Jeffrey P. Bigham George D. Park<br />

Pradipta Biswas Martin Pfannenstein<br />

Rolf Black Mark Poguntke<br />

Rainer Bodendorfer Ashu Razdan<br />

Daniel Braun Joseph Reddington<br />

Elliot Buller Ehud Reiter<br />

An Mei Chen Theodore J. Rosenthal<br />

Heng-Tze Cheng M. Angela Sasse<br />

Shelby S. Darnell Kristof Schütt<br />

Michael Eichhorn Adriano Scoditti<br />

Josh I. Ekandem Eckehard Steinbach<br />

Christoph Endres João Teixeira<br />

Sandro Rodriguez Garzon Nava Tintarev<br />

Juan E. Gilbert Naushad UzZaman<br />

Daniel Gonçalves Annalu Waller<br />

Jin Sun Ju Damon L. Woodard<br />

Eun Yi Kim Li Zhang

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!