31.01.2013 Views

MIAA - Automotive IUI - DFKI

MIAA - Automotive IUI - DFKI

MIAA - Automotive IUI - DFKI

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Proceedings of the<br />

3rd Workshop on Multimodal Interfaces for<br />

<strong>Automotive</strong> Applications<br />

(<strong>MIAA</strong> ‘11)<br />

February 13, 2011, Palo Alto, CA, USA<br />

organized at the International Conference on Intelligent User<br />

Interfaces (<strong>IUI</strong> ’11)<br />

Organizers:<br />

Christoph Endres, German Research Center for Artificial Intelligence (<strong>DFKI</strong>)<br />

Gerrit Meixner, German Research Center for Artificial Intelligence (<strong>DFKI</strong>)<br />

Christian Müller, German Research Center for Artificial Intelligence (<strong>DFKI</strong>)


Preface<br />

Multimodal interaction constitutes a key technology for intelligent user interfaces (<strong>IUI</strong>). The<br />

possibility to control devices and applications in a natural way enables an easier access to complex<br />

functionality as well as infotainment contents. In recent years, the complexity of on-board and<br />

accessory devices, infotainment services, and driver assistance systems in cars has experienced an<br />

enormous increase. This development emphasizes the need for new concepts for advanced humanmachine<br />

interfaces that support the seamless, intuitive and efficient use of this large variety of<br />

devices and services.<br />

A modern car already implements hundreds of functions that a user can interact with, in some cases<br />

deployed over almost a hundred embedded platforms. These numbers will even grow for the next<br />

generation of high-class vehicles. The growing number of electronic devices integrated into cars also<br />

affects the creation of the user interface. The built-in electronic control units are able to provide<br />

valuable context information, which needs to be considered for an intelligent management of<br />

multimodal interaction inside the car. Sensor information like e.g. vehicle speed, location (using GPS<br />

plus gyroscope and accelerometer for greater reliability), outside temperature, etc., allows drawing<br />

conclusions about the current driving situation. Furthermore, dialog management needs to keep<br />

track of state changes of operating elements like control switches. Access to vehicle functions is also<br />

essential in order to initiate desired operations.<br />

The goal of this workshop is to present, discuss, and outline context-aware multimodal interfaces for<br />

drivers and car passengers. The ultimate goal of this workshop is to unify innovative concepts that<br />

aim towards a new dimension of ease of use.<br />

The topics of the workshop with a strong focus on automotive or traffic applications are:<br />

� speech interfaces for in-car use<br />

� multimodal interaction<br />

� novel multimedia interfaces and in-car entertainment<br />

� user interface issues for assistive functionality<br />

� audio-visual information and entertainment<br />

� information fusion and fission<br />

� can bus architectures<br />

� experimental platforms and simulation solutions<br />

� user centered design applications<br />

� multi-party interaction concepts<br />

� integrated hardware solutions<br />

� car2car and car2X communication<br />

� approaches for the evaluation of novel car user interfaces<br />

� user interfaces for navigation systems<br />

� detection and estimation of user intentions<br />

� novel interactive car applications<br />

� interactive applications for drivers and passengers<br />

� model-driven user interface development


Table of Contents<br />

Flexible and Real-time Scenario Building for Experimental Driving Simulation Studies<br />

George D. Park, R. Wade Allen and Theodore J. Rosenthal .....................................................................1<br />

Contactless Gesture Recognition for Mobile Devices<br />

Heng-Tze Cheng, An Mei Chen, Ashu Razdan and Elliot Buller ................................................................5<br />

One Application, One User Interface Model, Many Cars: Abstract Interaction Modeling in the<br />

<strong>Automotive</strong> Domain<br />

Mark Poguntke and André Berton ...........................................................................................................9<br />

A Novel Multimedia Session Management Approach for In-Vehicle Middleware based on DPWS<br />

Michael Eichhorn, Martin Pfannenstein, Rainer Bodendorfer and Eckehard Steinbach ...................... 13<br />

“Hands Busy, Eyes Busy”: Generating Stories from Sensor Data for <strong>Automotive</strong> applications<br />

Joe Reddington, Ehud Reiter, Nava Tintarev, Rolf Black and Annalu Waller ........................................ 17<br />

A novel taxonomy for gestural interaction techniques: considerations for automotive<br />

environments<br />

Adriano Scoditti ..................................................................................................................................... 21<br />

Navigating Haystacks at 70 mph: Intelligent Search for Intelligent In-Car Services<br />

Ashweeni K. Beeharee, Sven Laqua and M. Angela Sasse..................................................................... 25<br />

Discover Significant Situations for User Interface Adaptations<br />

Sandro Rodriguez Garzon and Kristof Schütt ........................................................................................ 29<br />

A new interaction technique based on eye tracking and single switch scanning systems<br />

Pradipta Biswas and Pat Langdon ......................................................................................................... 33<br />

Gesture Recognition Exploration using Haartraining and KNN in a 3D Racing Game<br />

Kamlesh Mistry and Li Zhang ................................................................................................................. 37<br />

Model-Based User Interface Development in the <strong>Automotive</strong> Industry<br />

Moritz Kümmerling and Gerrit Meixner ................................................................................................ 41<br />

A Robotic Wheelchair using Human Gestures and Scene Contexts<br />

Jin Sun Ju and Eun Yi Kim....................................................................................................................... 45<br />

MetaBrain: Web Information Extraction and Visualization<br />

João Teixeira, Gabriel Barata and Daniel Gonçalves ............................................................................. 49<br />

MyDash: The Biometric Digital Dashboard<br />

Shelby S. Darnell, Ignacio Alvarez, Josh I. Ekandem, Damon L. Woodard and Juan E. Gilbert ............. 53<br />

Prototyping a Semi-Automatic In-Car Texting Assistant<br />

Christoph Endres, Daniel Braun and Christian Müller ........................................................................... 57<br />

Multimodal Summarization of Complex Sentences<br />

Naushad UzZaman, Jeffrey P. Bigham and James F. Allen .................................................................... 61


Flexible and Real-time Scenario Building for<br />

Experimental Driving Simulation Studies<br />

George D. Park, R. Wade Allen, and Theodore J. Rosenthal<br />

Systems Technology, Inc.<br />

13766 Hawthorne Blvd., Hawthorne CA<br />

georgepark@systemstech.com<br />

ABSTRACT<br />

The applications and cross-disciplinary nature of driving<br />

safety require driving simulation software to be sensitive to<br />

the requirements and limitations of their users. Provided<br />

here is an introduction to the driving simulation software,<br />

STISIM Drive and its unique approach towards flexible,<br />

real-time scenario building for applied experimental driving<br />

research. Several key concepts on how a user defines/builds<br />

a driving scenario and how the 3D graphics are generated in<br />

relation to the driver are discussed. Advantages and<br />

disadvantages of the STISIM Drive approach are discussed.<br />

References to previous user applications are provided.<br />

Author Keywords<br />

Driving simulation, scenario design, STISIM Drive.<br />

ACM Classification Keywords<br />

H5.2 Evaluation/methodology. H5.m Miscellaneous.<br />

INTRODUCTION<br />

Real-time, interactive (i.e., human-in-the-loop) driving<br />

simulation offers many advantages to the experimental<br />

researcher/developer interested in the areas of driving<br />

assessment, training, and research. It allows for a safe and<br />

controlled testing environment of driver behaviors in<br />

relation to the independent variable(s) of interest: driver<br />

factors (e.g., age, experience, drugs/alcohol/fatigue, mental<br />

workload, and deficits related to perception, cognition, or<br />

psychomotor), intervention factors (e.g., education and<br />

training programs), environmental factors (e.g., roadway<br />

infrastructure design, signage, weather, and traffic), and<br />

vehicle/device factors (e.g., controls/handling, dashboard<br />

design, warning systems, cell phones, and in-vehicle<br />

telematics).<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

1<br />

With an array of applications and cross-disciplinary nature<br />

of driving safety, simulation software needs to be sensitive<br />

to the requirements and limitations of its users. Not all users<br />

will have the background or the resources for extensive<br />

scenario building in virtual environments (VE). In addition,<br />

the end product of driving simulation is rarely the<br />

simulation itself (e.g., a video game), but is more often a<br />

means for assessing the effect of one of the aforementioned<br />

independent variables. Therefore, a method of scenario<br />

development that is flexible, rapid, and cost-effective is<br />

often critical to project success.<br />

Databases for real-time 3D simulation have traditionally<br />

been developed in graphics programs as composite 3D<br />

models. In essence, a large, predefined virtual world is<br />

created for the user to interact with. This approach requires<br />

extensive effort and experience with graphics modeling<br />

programs to define the details required in driving simulation<br />

[1]. More user-friendly scenario building systems may use a<br />

“tile-based” system where the developer pieces together<br />

predefined tiles of the road (e.g., an intersection or street<br />

block) to create the larger virtual world [2]. The end result<br />

is a roadway environment not unlike a real world,<br />

coordinate-based map. While this may appear to be an<br />

intuitive method of scenario development, it may not be an<br />

entirely practical means of scenario design for experimental<br />

research.<br />

The purpose of this paper is to provide an introduction to<br />

the driving simulation software, STISIM Drive and its<br />

approach towards flexible, real-time scenario building for<br />

applied experimental driving research. STISIM Drive is a<br />

PC-based, desktop driving simulator software system that’s<br />

highly configurable in regards to hardware fidelity (driver<br />

displays & controls). Several key concepts on how a user<br />

defines/builds a driving scenario and how the 3D graphics<br />

are generated in relation to the driver are discussed.<br />

SCENARIO DEFINITION LANGUAGE (SDL)<br />

The Scenario Definition Language (SDL) is a scripting<br />

language developed for STISIM Drive to define the<br />

scenario events (i.e., what appears and happens) in a<br />

particular driving scenario run. The events are defined by<br />

ASCII text statements in a simple syntax form:


On Distance, Event, Appear Distance, Parameter1,<br />

Parameter2, … Parametern<br />

The On Distance is defined as the longitudinal distance<br />

(feet or meters as specified by user) driven by the driver in<br />

relation to the scenario environment at which the event will<br />

activate. At the start of a scenario, the driver’s vehicle<br />

distance is generally set at zero. Event refers to a specific<br />

procedure (e.g., a roadway, building, vehicle, or<br />

pedestrian). Appear Distance refers to the longitudinal<br />

distance (ft or m) in relation to the On Distance that the<br />

event will actually be displayed in the roadway scenery.<br />

The Parameters are the specific attributes given to the<br />

event (e.g., roadway dimensions, model type, lateral<br />

location, speed, timing, etc...). Take for example the<br />

following SDL statement for displaying a 3D model of a<br />

building:<br />

500, Building, 1000, 40, B1<br />

When the driver reaches 500 ft, a building event will be<br />

initiated. It will appear at 1000 ft ahead of the driver (so<br />

technically at 1500 ft from the start of the run). The lateral<br />

position will be 40 ft to the right of the center dividing line<br />

(Parameter1 for building events). The building model type<br />

will be B1 which in the model library is defined as the café<br />

(Parameter2).<br />

As shown in the above example, there is a single SDL event<br />

statement for each model in a particular scenario. There are<br />

also over 50 different available event types for the user to<br />

specify. While this may appear cumbersome for complex<br />

scenario designs, SDL statements can be arbitrarily<br />

arranged since the program sorts all events according to<br />

distance during run initialization. This allows the user to<br />

group statements according to meaningful chunks of<br />

roadway (e.g., street blocks) and/or categories (e.g.,<br />

roadway definition, traffic control devices, roadside objects,<br />

traffic, etc.) to make relatively efficient global scenario<br />

changes.<br />

Besides 3D model events, there are SDL events that specify<br />

crash/violation settings, sound files, weather, data<br />

input/output signals, and data collection. Furthermore, the<br />

SDL allows the user to define and call subroutines referred<br />

to as previously defined events (PDEs) which are a<br />

combination of event statements that give a desired<br />

composite effect (e.g. buildings grouped around an<br />

intersection, traffic streams, vehicle/pedestrian collision<br />

events, etc.). Additional details on developing driving<br />

scenarios have been reported elsewhere [3].<br />

EVENT TRIGGERING<br />

Due to the inherent variability in driver behaviors and<br />

factors that may affect a driver’s vehicle speed and steering<br />

(e.g., mental workload, age, experience, fatigue, risk<br />

perception), the initiation of dynamic 3D models (e.g.,<br />

vehicles, pedestrians, signal lights) into action in the VE<br />

can be a complex process. This is particularly so if the<br />

intention is to create critical hazards that require an<br />

2<br />

immediate driver response. E.g., Figure 1 provides scenario<br />

screenshots of an amber (yellow) light intersection event<br />

(top) and a pedestrian crossing event in front of the driver<br />

(middle).<br />

Figure 1. STISIM Drive screenshots of amber signal light<br />

intersection (top), pedestrian crossing in front of driver<br />

(middle), and construction zone (bottom).<br />

STISIM Drive handles dynamic 3D model event triggering<br />

through several ways. In most cases, the variability in<br />

drivers’ speed can be neutralized by triggering events based<br />

on headway time (i.e., time-to-collision between the object<br />

and driver). However, additional parameters can be set to<br />

ensure data integrity: longitudinal distance of driver (or<br />

object) on the road, distance between driver and object,<br />

lateral position relationships, signal light changes, driver<br />

speed thresholds, and elapsed runtime.<br />

There is also the ability for the simulation operator to<br />

manually trigger events during a simulation run. Manually<br />

triggered events can comprise of singular discrete events<br />

(e.g., a sound file or crossing pedestrian) or larger PDE files


comprised of an array of static or dynamic 3D models. In<br />

effect, the operator can initiate whole sections of a scenario<br />

in real-time depending on how the driver is behaving. E.g.,<br />

in Figure 1 (bottom), the operator can initiate a complete<br />

construction zone layout that includes vehicles and tubes<br />

onto the road.<br />

PARTIAL VIRTUAL ENVIRONMENT GENERATION<br />

The STISIM Drive method for generating the simulation<br />

scenario can be described as partial (or delayed) VE<br />

generation, where only a portion of the virtual world is<br />

displayed as the driver’s vehicle travels down the road. This<br />

is the basis of how the simulation is generated and how the<br />

driving scenarios are conceptually designed with the SDL.<br />

To illustrate the concept, Figure 2a and b both provide a<br />

vehicle approaching an intersection. In normal simulation<br />

programs using a coordinate map-based system (Figure 2a),<br />

continuing straight or turning left/right sends the driver into<br />

different sections (A, B, or C) of the virtual world. In<br />

STISIM Drive (Figure 2b), continuing straight or turning<br />

left/right sends the driver into the same section (B). The<br />

reason for this is related back to how scenario events are<br />

defined in the SDL. Since the On Distance of an event (in<br />

this case Section B) can be specified to occur after a driver<br />

reaches a particular road distance, Section B has not been<br />

generated yet. When turning (or not turning) the driver’s<br />

longitudinal distance travelled is still accumulating;<br />

therefore, Section B will continue to appear in relation to<br />

the start of the scenario. Once the On Distance for an event<br />

has been reached by the driver, the event is committed to<br />

appear in accordance to its specified parameters.<br />

a) b)<br />

Figure 2. a) Coordinate map-based VE generation.<br />

b) Partial VE generation.<br />

Moving into different sections when turning is not normally<br />

problematic and intuitive in designing VEs in relation to a<br />

coordinate map context. However, if the goal of the<br />

scenario is to measure driver behavior to a particular event<br />

(e.g., a pedestrian crossing or vehicle pullout in section B),<br />

scenario design becomes problematic. For normal<br />

simulation programs, unintended turning may result in<br />

system crashes when the boundaries of the VE are<br />

exceeded. Secondly, additional programming is required for<br />

sections A and C even though the driver may not encounter<br />

them. To ensure the occurrence of a particular event for<br />

measurement, the designer must either artificially preclude<br />

3<br />

vehicle turning, rely on driver compliance, add<br />

corresponding events in sections A and C, or have an<br />

operator manually trigger the event once a driver has<br />

committed to a particular roadway section. Any of these<br />

options while manageable are not necessarily parsimonious<br />

nor take into account the inherent unpredictability of human<br />

behavior.<br />

The advantages of partial VE generation for experimental<br />

driving research are multiple. Since the driver does not<br />

experience scenario sections based on a coordinate map<br />

system, roadway sections are essentially presented serially<br />

in nature. This means all drivers experience the same<br />

scenario regardless of turning behaviors. Drivers cannot get<br />

disoriented or lost in the VE. Instead, the illusion of turning<br />

into different VE sections is created for the driver while<br />

roadway events are presented as intended by the researcher.<br />

In addition, counterbalancing of scenario events or whole<br />

roadway sections (using PDEs) can then be easily designed<br />

to control for order effects. This method can also reduce the<br />

design requirements and development time for a particular<br />

scenario.<br />

One of the main limitations of partial VE generation is the<br />

inability to simulate specific geography in regards to a<br />

coordinate map-based system. Therefore, studies involving<br />

simulation with GPS mapping and navigational tasks are<br />

problematic. Additionally, the possibility of non-realistic<br />

route corrections for driver navigational errors is present.<br />

For example, if a driver makes a wrong turn, U-turns into<br />

previously presented scenario sections are not handled well<br />

since the program provides only a limited distance of back<br />

tracking. The driver is also not able to perform other<br />

corrective procedures normally seen in driving such as three<br />

rights to make a left turn and vice versa. Previous system<br />

users have overcome some of these obstacles by modifying<br />

general program settings and adding a single elaborate large<br />

scale 3D city model [4]. It should be noted that these<br />

applications would require considerable 3D modeling<br />

resources since the system was not conceptually designed<br />

function in this manner.<br />

CONCLUSION<br />

The advantages and disadvantages of the partial VE<br />

generation approach used by STISIM Drive should be<br />

weighed by users during initial study design. The flexibility<br />

of scenario design and relatively simple scripting language<br />

(SDL) for building and modifying scenarios makes it a very<br />

user defined system that mitigates inherent driver<br />

variability. This in conjunction with flexible hardware<br />

options has enabled the STISIM Drive software approach to<br />

be well validated and used in nearly every aspect of driver<br />

safety research. This includes driver factor effects: ageing<br />

[5, 6], novice driver [7], traumatic brain injury [8], and<br />

pharmaceutical effects[9]. Vehicle and device interactions:<br />

in-vehicle information devices [10], cognitive workload<br />

effects [11], and collision warning systems [12]. Successful<br />

integration of simulation software with actual vehicle


control hardware systems has also been demonstrated for<br />

steering [13] and braking systems [14]. Additional<br />

information and resources can be found on the software<br />

website (www.stisimdrive.com).<br />

REFERENCES<br />

1. Cremer, J., J. Kearney, and Y. Papelis, Driving<br />

simulation: Challenges for VR technology. Ieee<br />

Computer Graphics and Applications, 1996. 16(5): p.<br />

16-20.<br />

2. Suresh, P. and R.R. Mourant. A tile manager for<br />

deploying scenarios in virtual driving environments. in<br />

DSC 2005 North America. 2005. Orlando, FL.<br />

3. Park, G.D., T.J. Rosenthal, and B.L. Aponso,<br />

Developing Driving Scenarios for Research, Training<br />

and Clinical Applications. Advances in Transportation<br />

Studies An International Journal, 2004. 2004 Special<br />

Issue.<br />

4. Marcotte, T.D., et al., A multimodal assessment of<br />

driving performance in HIV infection. Neurology,<br />

2004. 63: p. 1417-1422.<br />

5. Lee, H.C. The validity of driving simulator to measure<br />

on-road driving performance of older drivers. in 24th<br />

Conference of Australian Institutes of Transport<br />

Research (CAITR). 2002. Sydney, AUS.<br />

6. Park, G.D., et al. Older driver simulator performance<br />

in relation to driving habits and DMV records. in 2nd<br />

International Conference on Technology and Aging.<br />

2007. Toronto, Canada.<br />

4<br />

7. Allen, R.W., et al. A PC Based Simulation System for<br />

Driver Assessment and Training. in TRB Annual<br />

Meeting. 2005. Washington, D.C.<br />

8. Stern, E.B., et al., Discriminating between brain<br />

injured and non-disabled persons: a PC-based<br />

interactive driving simulator pilot project. Advances in<br />

Transportation Studies An International Journal, 2004.<br />

Special Issue.<br />

9. Kay, G. The effect of Adderall XR and Atomoxetine on<br />

simulated driving safety in young adults with ADHD. in<br />

18th Annual US Psychiatric & Mental Health<br />

Congress. 2004. Las Vegas, NV.<br />

10. Wang, Y., et al., The validity of driving simulation for<br />

assessing differences between in-vehicle informational<br />

interfaces: A comparison with field testing.<br />

Ergonomics, 2010. 53(3): p. 404-420.<br />

11. Reimer, B., Impact of Cognitive Task Complexity on<br />

Drivers' Visual Tunneling. Transportation Research<br />

Record, 2009(2138): p. 13-19.<br />

12. Maltz, M. and D. Shinar, Imperfect in-vehicle collision<br />

avoidance warning systems can aid drivers. Human<br />

Factors, 2004. 46(2): p. 357-366.<br />

13. Eskandarian, A., et al. Development of an active<br />

steering control system in a car driving simulator. in<br />

SAE World Congress & Exposition. 2009. Detroit, MI.<br />

14. Allen, R.W., et al. A hardware-in-the-loop simulation<br />

of braking capability. in DSC 2005 Europe. 2008.<br />

Monaco.


Contactless Gesture Recognition for Mobile Devices<br />

Heng-Tze Cheng ∗<br />

Electrical and Computer Engineering<br />

Carnegie Mellon University<br />

hengtze@cmu.edu<br />

ABSTRACT<br />

While gesture interfaces become pervasive, most existing<br />

approaches are undesirable for mobile devices because of<br />

the high power consumption, or the inconvenience that users<br />

need to wear/hold specific sensors. In this paper, we present<br />

a contactless gesture recognition system for mobile devices<br />

using proximity sensors. A set of infrared signal feature extraction<br />

methods and a decision-tree-based gesture classifier<br />

are proposed. The system allows a user to interact with mobile<br />

devices using intuitive gestures, without touching the<br />

screen or wearing/holding any additional device. Evaluation<br />

results show that the system is low-power, and able to recognize<br />

gestures with over 98% precision in real time.<br />

Author Keywords<br />

Gesture recognition, proximity sensor, infrared LED<br />

ACM Classification Keywords<br />

H.5.2 Information Interfaces and Presentation: User Interfaces—Input<br />

devices and strategies<br />

INTRODUCTION<br />

Gesture-based interfaces provide an intuitive way for users<br />

to specify commands and interact with computers [6, 8]. As<br />

mobile phones and tablets become ubiquitous, there is an increasing<br />

need of an intuitive user interfaces for small-sized,<br />

resource-limited mobile devices.<br />

Most existing gesture recognition systems can be classified<br />

into three types: motion-based, touch-based, and vision-based<br />

systems. For motion-based systems [11, 4], user cannot<br />

make gestures unless holding a mobile device or an external<br />

controller. Touch-based systems [12, 10] can accurately map<br />

the finger/pen positions and moving directions on the touchscreen<br />

to different commands. However, 3D gestures are not<br />

supported because all possible gestures are confined within<br />

the 2D screen surface. While the first two types of system<br />

∗ This work is done during the author’s employment at Office of<br />

The Chief Scientist, Qualcomm Incorporated.<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that copies<br />

bear this notice and the full citation on the first page. To copy otherwise, or<br />

republish, to post on servers or to redistribute to lists, requires prior specific<br />

permission and/or a fee.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

5<br />

An Mei Chen, Ashu Razdan, Elliot Buller<br />

Office of The Chief Scientist<br />

Qualcomm Incorporated<br />

{anc, arazdan, ebuller}@qualcomm.com<br />

require users to make contact with devices, vision-based systems<br />

[8, 14] using camera and computer vision techniques<br />

allow users to make intuitive gestures without touching the<br />

device. However, most vision-based systems are computationally<br />

expensive and power-consuming, which is undesirable<br />

for resource-limited mobile devices like tablets or mobile<br />

phones.<br />

To solve the existing challenges, we present a contactless<br />

gesture recognition system using only two infrared proximity<br />

sensors. We propose a set of infrared feature extraction<br />

and gesture classification algorithms. Using the system as<br />

a gesture interface, a user can flip e-book pages, scroll web<br />

pages, zoom in/out, and play games on mobile devices using<br />

intuitive hand gestures, without touching, wearing, or holding<br />

any additional devices. The design also reduces the frequency<br />

of users’ contact with devices, alleviating the wear<br />

and tear to screen surfaces.<br />

The main contributions of the paper are: 1) The design and<br />

evaluation of a contactless gesture recognition system using<br />

only two proximity sensors. 2) The proposed infrared (IR)<br />

feature set and classifier for real-time gesture classification.<br />

3) Reducing the power consumption of gesture recognition.<br />

RELATED WORK<br />

There has been extensive research on vision-based gesture<br />

recognition [8, 14], mostly focusing on the detection of hand<br />

trajectory. Although they can recognize complex gestures,<br />

they can be sensitive to background objects, color, and lighting.<br />

Robustness can be improved by adding color markers on<br />

the user’s hand [5], with a tradeoff of the inconvenience to<br />

wear additional gears. Moreover, continuous video recording<br />

of a user can make one feel like under surveillance and<br />

pose a threat on user privacy.<br />

Recently, SideSight [1] proposed an around-device multitouch<br />

interface by placing ten IR sensors on the long edges<br />

of a small mobile device. Another related work, HoverFlow<br />

[3], used six IR sensors facing the user to capture IR image<br />

maps, and then classify gestures using dynamic time warping<br />

(DTW). In this work, we reduce the number of the required<br />

IR sensors to two and thus reduce the power consumption,<br />

which is mentioned as a critical issue in [1]. Even<br />

using the limited information from only two IR sensors, our<br />

system can achieve accurate gesture recognition using the<br />

proposed IR feature set and the classifier.<br />

For motion-based system, one of the recent work uWave


[4] match accelerometer data with gesture templates using<br />

DTW. 98.6% and 93.5% accuracy was achieved with and<br />

without template adaptation, respectively, for user-dependent<br />

gesture recognition. However, a user need to hold a device<br />

with accelerometer, and press a button to indicate start and<br />

end of a gesture. In this work, we eliminate these limitations<br />

with contactless gesture recognition.<br />

Electromyogram-based (EMG-based) system [2, 13] is another<br />

novel way to recognize gesture patterns using electrical<br />

activity produced by skeletal muscles. However, a user<br />

must wear EMG sensors on the wrist at all times to perform<br />

gestures, which can be inconvenient and not suitable for mobile<br />

device interfaces.<br />

SYSTEM DESIGN AND METHODS<br />

Design Considerations<br />

Our system is designed based on four design considerations:<br />

1) Automatically detect gesture boundaries: A common challenge<br />

of gesture recognition is the uncertainty of when does<br />

a gesture begins or ends. We do not require a user to press a<br />

key to indicate the presence of a gesture since it would be inconvenient<br />

to do so. 2) Recognition must be real-time: Gesture<br />

interface must be very responsive, so no time-consuming<br />

postprocessing is allowed. 3) False alarm needs to be minimized:<br />

Executing a wrong command is generally worse than<br />

missing a command. 4) No user-dependent model training<br />

process for new users: Although supervised learning can optimize<br />

the performance for a specific user, collecting training<br />

data can be time consuming and not desirable for users.<br />

Proximity Sensor Data Acquisition<br />

We now describe each system component shown in Fig. 1. A<br />

proximity sensor consists of two IR LEDs and a IR receiver,<br />

which are placed underneath a plastic/glass screen surface,<br />

surrounded by optical barriers. The LEDs emit IR strobes<br />

in turns as two separate channels using time-division multiplexing.<br />

When a hand or any object is near, the receiver detects<br />

the reflection of the IR light, whose intensity increases<br />

as the object distance decreases. The light intensities of the<br />

two IR channels are sampled by the firmware at 100Hz.<br />

Framing<br />

Since the start and end of a gesture is not specified by the<br />

user, our program uses a moving window to scan the input IR<br />

intensity data and decide if any gesture signature is observed.<br />

The data is divided into 50% overlapping frames, each of<br />

which is 140 ms. After framing, three types of feature are<br />

extracted from each frame.<br />

Infrared Feature Extraction<br />

Inter-channel Time Delay<br />

The feature measures the pair-wise time delay between the<br />

sensor data of two channels, which shows how a hand approaches<br />

the IR LEDs at different instants. This corresponds<br />

to different moving directions of hands (see Fig. 2 for example).<br />

The time delay tD is calculated by finding the time<br />

shift n that yields maximum cross correlation value of two<br />

6<br />

Cross Correlation<br />

Module<br />

Gesture Model<br />

Proximity Sensor Data<br />

Framing<br />

Linear Regression<br />

Module<br />

Gesture Classifier<br />

Gesture History<br />

Database<br />

Screen<br />

Mobile<br />

Device<br />

Infrared LED<br />

Proximity Sensor<br />

(Infrared Receiver)<br />

Signal Statistics<br />

Module<br />

Temporal Dependency<br />

Computation<br />

Figure 1: The architecture of the gesture recognition system.<br />

IR Intensity (lux)<br />

Slope<br />

15000<br />

10000<br />

5000<br />

Time Delay (ms)<br />

Channel L<br />

Channel R<br />

Raw Sensor Data<br />

0<br />

0 2 4 6 8 10 12<br />

Push Pull<br />

Time (s)<br />

Time Delay Measured by Cross−Correlation<br />

50<br />

0<br />

−50<br />

0 2 4 6<br />

Time (s)<br />

8 10 12<br />

Slope Measured by Linear Regression<br />

1000<br />

0<br />

3 Left Swipes 3 Right Swipes<br />

Push Pull<br />

Push Pull<br />

−1000<br />

0 2 4 6<br />

Time (s)<br />

8 10 12<br />

Figure 2: An example of proximity sensor data and the features.<br />

discrete signal sequences f and g:<br />

tD = arg max<br />

n<br />

∞�<br />

f ∗ (m)g(m + n) (1)<br />

m=−∞<br />

Local Sum of Slopes<br />

This feature estimates the local slope of the signal segment<br />

within a frame, which shows how fast the user’s hand is moving<br />

toward or away from the proximity sensors. The slope is<br />

calculated by first-order linear regression, and then summed<br />

up with the slopes of the 6 previous frames. The local sum<br />

better capture the continuous trend of slopes rather than sudden<br />

changes.<br />

Signal Statistics<br />

The mean and variance of the raw sensor data. A high variance<br />

can be observed when a gesture is present; on the contrary,<br />

when there is no hand present or a hand hovering above,<br />

a low variance is observed.<br />

Gesture Recognition Algorithm<br />

After feature extraction, a decision-tree classifier shown in<br />

Fig. 3 is adopted to classify the frame as one of the gesture<br />

in the predefined gesture model, or report that no gesture is<br />

detected. We also keep a history of 7 frames to take temporal


Precision (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

User1<br />

User2<br />

User3<br />

User4<br />

User5<br />

Average<br />

Left Swipe Right Swipe<br />

(a) Precision of left/right swipe<br />

Yes<br />

No Gesture<br />

Ch L lags<br />

Variance < Threshold?<br />

Recall (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

Time Delay > Threshold?<br />

Yes No<br />

0<br />

Left Swipe Right Swipe<br />

User1<br />

User2<br />

User3<br />

User4<br />

User5<br />

Average<br />

(b) Recall of left/right swipe<br />

Inter-Channel Delay Local Sum of Slopes<br />

Ch R lags<br />

Right Swipe Left Swipe<br />

No<br />

> Threshold<br />

Precision (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

Push Pull<br />

User1<br />

User2<br />

User3<br />

User4<br />

User5<br />

Average<br />

(c) Precision of push/pull<br />

Figure 5: Precision and recall rate of gesture recogntion.<br />

Otherwise<br />

< −Threshold<br />

Push No Gesture Pull<br />

Figure 3: Illustration of the decision-tree-based gesture classifier.<br />

USB<br />

Port<br />

Sensor<br />

Board<br />

IR LED<br />

Channel L<br />

IR LED<br />

Channel R<br />

IR<br />

Receiver<br />

Start of Gesture (Left Swipe) End of Gesture (Left Swipe)<br />

Figure 4: A subject performed a left-swipe gesture using the<br />

prototype sensor board.<br />

dependency between consecutive frames into consideration.<br />

For example, when a gesture is detected, the system suppress<br />

the output of the same gesture for 6 frames because it is hard<br />

for a user to make the same gesture again very quickly. Once<br />

the gesture sequence history of a user is obtained, the transition<br />

probability between gestures can also be incorporated<br />

to improve the recognition accuracy.<br />

IMPLEMENTATION<br />

We implemented the prototype system using Silicon Labs<br />

Si1120 infrared proximity sensor [9]. The sensor data were<br />

transmitted to a laptop through a USB serial port. The feature<br />

extraction and gesture recognition algorithm was implemented<br />

in C++. The window sizes and thresholds are empirically<br />

set through experiments to minimize the false alarm<br />

rate of the system. A picture of the prototype system and a<br />

subject performing a gesture is shown in Fig. 4.<br />

EVALUATION<br />

7<br />

Recall (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

Push Pull<br />

(d) Recall of push/pull<br />

User1<br />

User2<br />

User3<br />

User4<br />

User5<br />

Average<br />

We define four essential gestures for evaluation: left swipe,<br />

right swipe, push (hand vertically moving vertically down<br />

toward the device), and pull (hand moving vertically up away<br />

from the device). The system is evaluated on a gesture dataset<br />

collected from five subjects, including four right-handed and<br />

one left-handed user. Their ages span from 20s to 40s, and<br />

one of them is female. The dataset consists of 2,000 gesture<br />

samples in total, with each user performing each of the four<br />

gesture 100 times.<br />

Recognition Performance<br />

We use the widely used precision/recall metric to evaluate<br />

the recognition performance:<br />

precision =<br />

T P<br />

T P + F P<br />

T P<br />

recall =<br />

(3)<br />

T P + F N<br />

where TP, FP, FN refer to true positive, false positive, and<br />

false negative. As shown in Fig. 5, the system achieved 98%<br />

precision in average, and is robust from user to user. The<br />

high precision implies low false alarm rate, which is ideal<br />

for gesture recognition because executing a wrong command<br />

is usually worse than missing a command. The recall rate is<br />

lower than precision because the system can miss gestures<br />

when the hand is too far from the sensor, or when a gesture<br />

is performed much slower than usual.<br />

User and System Factors<br />

We further design two experiments on user and system factors<br />

to evaluate the robustness and limitation of the system.<br />

User-to-Device Distance<br />

First, we evaluate the influence of user-to-device distance on<br />

the system performance. The distance is measured from the<br />

user’s hand to the proximity sensors. As shown in Fig. 6, the<br />

system can achieve over 80% accuracy when the user’s hand<br />

is within 3 inches. The effective range can be increased by<br />

increasing the power of IR LEDs, with a tradeoff of a higher<br />

power consumption. One can balance the tradeoff according<br />

to the system needs on user experience and battery life.<br />

Speed of Gesture<br />

Next, we evaluate the system performance when user perform<br />

gestures at different speeds. In this experiment, the user<br />

listens to a specific tempo given by an electronic metronome;<br />

the first beat “tic” indicates the start of a gesture, and the second<br />

beat “toc” indicates the end of a gesture. According to<br />

(2)


Accuracy (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

1 1.5 2 2.5 3 3.5 4<br />

Hand−to−Sensor Distance (inch)<br />

Figure 6: Recognition accuracy vs. hand-to-sensor distance.<br />

Accuracy (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

1 2 3 4 5<br />

Speed of Gesture (gestures per second)<br />

Figure 7: Recognition accuracy vs. speed of gesture.<br />

our observation, most users naturally make gestures at the<br />

speed of 2 to 4 gestures per second. In other words, it usually<br />

take 0.5 to 0.25 seconds for general users to complete a<br />

gesture. As shown in Fig. 7, the system achieves over 90%<br />

accuracy at general gesture speeds, and also maintains a robust<br />

performance of over 80% at very slow (1 gesture per<br />

second) or very fast (5 gestures per second) gesture speeds.<br />

Power Consumption<br />

The system power is dominated by the power consumed by<br />

IR LED (PLED) and the control chip (Pchip):<br />

PLED + Pchip = fconv · Tprx · (ILED + Ichip) · VLED (4)<br />

which is only 0.3 mW (idle) to 20 mW (active, when object<br />

is in proximity) [9], much lower than the 200-mW power<br />

budget for typical user interface of mobile device as reported<br />

in [7]. V , I, fconv, and Tprx denotes voltage, current, conversion<br />

frequency, and pulse width, respectively.<br />

CONCLUSION AND FUTURE WORK<br />

We have presented a contactless gesture recognition system<br />

that allows users to make gesture inputs without touching,<br />

holding, or wearing any device. Using the proposed IR feature<br />

set and classifier, the system can recognize gestures with<br />

98% precision and 88% recall rate. The low power consumption<br />

and high accuracy make the system particularly<br />

8<br />

desirable for deployment on resource-limited mobile consumer<br />

devices.<br />

Our future work is to extend the configuration to multiple<br />

sensor arrays to get more information from sensor data. Using<br />

the basic gesture set as building blocks, we can further<br />

recognize more compound 3D gestures as permutations of<br />

the simple ones. Hidden Markov model can also be incorporated<br />

to learn the gesture sequences performed by users.<br />

REFERENCES<br />

1. A. Butler, S. Izadi, and S. Hodges. Sidesight:<br />

multi-”touch” interaction around small devices. In<br />

Proc. UIST, pages 201–204, 2008.<br />

2. J. Kim, S. Mastnik, and E. André. EMG-based hand<br />

gesture recognition for realtime biosignal interfacing.<br />

In Proc. <strong>IUI</strong>, pages 30–39, 2008.<br />

3. S. Kratz and M. Rohs. Hoverflow: exploring<br />

around-device interaction with ir distance sensors. In<br />

Proc. MobileHCI, pages 42:1–42:4, 2009.<br />

4. J. Liu, L. Zhong, J. Wickramasuriya, and V. Vasudevan.<br />

uWave: Accelerometer-based personalized gesture<br />

recognition and its applications. Pervasive Mob.<br />

Comput., 5(6):657–675, 2009.<br />

5. P. Mistry, P. Maes, and L. Chang. WUW - wear ur<br />

world: a wearable gestural interface. In Proc. CHI ’09,<br />

pages 4111–4116, 2009.<br />

6. S. Mitra and T. Acharya. Gesture recognition: A<br />

survey. IEEE Trans. Syst., Man and Cybern.,<br />

37(3):311–324, 2007.<br />

7. Y. Neuvo. Cellular phones as embedded systems. In<br />

IEEE ISSCC, 2004.<br />

8. V. I. Pavlovic, R. Sharma, and T. S. Huang. Visual<br />

interpretation of hand gestures for human-computer<br />

interaction: A review. PAMI, 19(7):677–695, 1997.<br />

9. Silicon Labs. Proximity/ambient light sensor with<br />

PWM output, 2009.<br />

10. W. C. Westerman and J. G. Elias. System and method<br />

for packing multi-touch gestures onto a hand, April<br />

2006.<br />

11. A. Wilson and S. Shafer. XWand: UI for intelligent<br />

spaces. In Proc. SIGCHI conf. Human factors in<br />

comput. syst., pages 545–552, 2003.<br />

12. J. O. Wobbrock, A. D. Wilson, and Y. Li. Gestures<br />

without libraries, toolkits or training: a $1 recognizer<br />

for user interface prototypes. In Proc. ACM UIST,<br />

pages 159–168, 2007.<br />

13. X. Zhang et al. Hand gesture recognition and virtual<br />

game control based on 3D accelerometer and EMG<br />

sensors. In Proc. <strong>IUI</strong>, pages 401–406, 2009.<br />

14. M. H. Yang, N. Ahuja, and M. Tabb. Extraction of 2D<br />

motion trajectories and its application to hand gesture<br />

recognition. IEEE Trans. Pattern Anal. Mach. Intell.,<br />

24(8):1061–1074, 2002.


One Application, One User Interface Model, Many Cars:<br />

Abstract Interaction Modeling in the <strong>Automotive</strong> Domain<br />

Mark Poguntke<br />

Daimler AG<br />

Wilhelm-Runge-Straße 11, 89081 Ulm<br />

mark.poguntke@daimler.com<br />

ABSTRACT<br />

We present an approach for user interface generation based<br />

on abstract interaction modeling using UML class and state<br />

diagrams. By this, we enable the flexible enhancement of<br />

an automotive infotainment system with new external<br />

applications. A main objective is to do this without<br />

breaching the requirements resulting from the automotive<br />

context, e.g. minimized driver distraction. We achieve<br />

consistency with the automotive interaction and design<br />

concept through transforming the abstract model to the<br />

respective user interface concept and illustrate this with two<br />

automotive HMI concepts.<br />

Author Keywords<br />

HCI, Interaction Modeling, Abstract Interaction Model,<br />

Model-driven User Interface Development.<br />

INTRODUCTION<br />

A typical automotive infotainment system includes<br />

navigation, audio and video player as well as a phone<br />

application. Often, the only external applications integrated<br />

to the system are Bluetooth telephony, external music<br />

players and pre-defined internet services for weather<br />

forecasts or points of interests as examples. Using current<br />

technology, the features provided initially also do not<br />

change during the lifetime of a car. Imagine buying a<br />

desktop computer and having to use it for ten years without<br />

the possibility to install new applications – this is not<br />

satisfactory. It is our goal to make automotive systems more<br />

flexible and allow for the integration of new applications at<br />

later stages. However, the primary purpose of a car is still<br />

to provide a safe means of transportation. This implies a set<br />

of specific and very restrictive requirements for the design<br />

of the Human-Machine Interface (HMI), especially<br />

concerning the use of infotainment applications while<br />

driving. Minimizing driver distraction is an important<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that copies<br />

bear this notice and the full citation on the first page. To copy otherwise,<br />

or republish, to post on servers or to redistribute to lists, requires prior<br />

specific permission and/or a fee.<br />

3 rd International Workshop on Multimodal Interfaces for <strong>Automotive</strong><br />

Applications (<strong>MIAA</strong>) in conjunction with <strong>IUI</strong> 2011, Palo Alto, CA, USA.<br />

9<br />

André Berton<br />

Daimler AG<br />

Wilhelm-Runge-Straße 11, 89081 Ulm<br />

andre.berton@daimler.com<br />

requirement and external applications can only be<br />

integrated at a lager stage under the provision that this<br />

requirement is maintained. To ensure this, the control over<br />

the interaction and design concept of external applications<br />

has to be on side of the in-car software.<br />

Different cars and model lines often have different HMI<br />

devices and the respective concepts in them, e.g. a touch<br />

screen based HMI or an HMI based on the operation with a<br />

central control element (CCE), typically used in premium<br />

segment cars. Also, design concepts differ in screen<br />

resolution, screen layout, colors and styles. In order to<br />

seamlessly integrate external applications, we also have to<br />

provide solutions for multiple modalities.<br />

We approach the issues of keeping the control over the<br />

HMI integration for external applications on the one hand<br />

and aiming at flexibility and adaptability to serve different<br />

HMI concepts on the other hand. Our approach is based on<br />

abstract interaction modeling using the Unified Modeling<br />

Language (UML) [4] and XML-based user interface<br />

descriptions.<br />

<strong>Automotive</strong> Requirements<br />

The above mentioned conditions imply the need for much<br />

care to optimally integrate automotive user interfaces in the<br />

car. Many countries also restrict the design and use of<br />

automotive infotainment applications by certain regulations,<br />

e.g. the European Statement of Principles on HMI for invehicle<br />

information and communication systems (ESoP) or<br />

the guidelines of the Alliance of <strong>Automotive</strong> Manufacturers<br />

(AAM). Compliance with ergonomic standards, e.g. the<br />

ISO 9241-110, is a particularly desirable goal.<br />

Updating an automotive infotainment system with new<br />

applications which include new user interfaces is a critical<br />

modification. Unfamiliar user interfaces or inconsistencies<br />

with the in-car interaction concept may lead to driver<br />

distraction, limitations in interaction and frustrated drivers<br />

who hold the automotive manufacturer responsible for the<br />

whole infotainment system. This emphasizes the need of a<br />

carefully developed approach for integrating new<br />

applications and their user interfaces into the car. An<br />

automated user interface generation process has to conform<br />

to restrictive and well-defined rules.


Illustrating Example<br />

Throughout the following, we use an example scenario to<br />

illustrate the approach. An external application to-do list<br />

comprises the following functionalities: Present a list, add a<br />

new entry, select an entry and delete a selected entry.<br />

The to-do list will be integrated into two automotive user<br />

interface concepts. These are a touch screen based HMI and<br />

a CCE-based HMI illustrated later in more detail.<br />

RELATED WORK<br />

Several approaches exist that derive different user<br />

interfaces from abstract interaction representations. Concur<br />

Task Trees (CTT) [6] provides a notation to describe user<br />

interfaces on the level of task models. The User Interface<br />

eXtensible Markup Language (UsiXML) [9] describes a<br />

comprehensive modeling approach including<br />

transformations from abstract to concrete user interfaces<br />

based on the CAMELEON reference framework [1]. The<br />

Dialog and Interface Specification Language (DISL) is a<br />

user interface description language based on dialogue<br />

models and modality-independent presentation models [8].<br />

In recent years attention has also been paid to the Unified<br />

Modeling Language (UML), which is a widespread industry<br />

standard for modeling software systems. Several<br />

approaches motivate the use of UML for user interface<br />

modeling [2,3,5,7]. De Melo provides a detailed analysis of<br />

UML as a basis for model-based user interface development<br />

and emphasizes advantages concerning comprehensibility,<br />

universality and tool support amongst others [3]. We<br />

consider UML as an appropriate basis, which can be<br />

adapted and extended for our approach. The availability of<br />

established tools is particularly important for the use in<br />

industry. We focus this paper on demonstrating abstract<br />

interaction modeling techniques with UML and implementing<br />

automatic transformations from an abstract model<br />

to specific automotive user interface concepts.<br />

ABSTRACT INTERACTION MODELING APPROACH<br />

The general approach is illustrated in Figure 1. We use the<br />

roles of an application developer and an interaction<br />

designer. An application is developed by an application<br />

developer including a functional application interface<br />

consisting of a class diagram with attributes and operations.<br />

An interaction designer uses this interface to create an<br />

abstract interaction model using UML state charts to<br />

describe user actions and corresponding system reactions. A<br />

transformation program uses the model and generates a user<br />

interface compliant to the respective automotive HMI<br />

concept. For the transformation process rules have to be<br />

implemented mapping the abstract model elements to user<br />

interface elements for a specific concept.<br />

10<br />

Figure 1. General approach: (1) The application developer<br />

provides the application interface, (2) the interaction designer<br />

creates the abstract interaction model that is used for user<br />

interface generation.<br />

The overall process is described and demonstrated for the<br />

to-do list example in the remainder of this paper. The<br />

definition of abstract data types and interaction elements is<br />

described in the following section.<br />

Abstract Data Types and Interaction Elements<br />

The application developer uses a defined set of abstract<br />

data types for the attributes to be provided. In Table 1 an<br />

extract of these data types is described.<br />

Type Description<br />

Boolean Logical value true or false<br />

String Sequence of symbols from the<br />

underlying set or alphabet<br />

Properties:<br />

Empty<br />

Boolean value whether the string is<br />

empty<br />

Collection A collection of elements with type<br />

<br />

…<br />

Properties:<br />

Empty<br />

Boolean value whether the collection is<br />

empty<br />

Subselection A collection of selected elements from<br />

the entire collection<br />

Table 1. Extract of abstract data types to be used<br />

by the application developer for the application interface.<br />

The interaction designer uses the provided attributes and a<br />

defined set of modeling elements and guidelines to create a<br />

UML state diagram. Table 2 provides an extract of elements<br />

that can be used by the interaction designer.<br />

The abstract data types and modeling elements are<br />

illustrated with the example application to-do list in the<br />

following section.


Element Meaning<br />

State Defined interaction state with a set of<br />

possible interactions<br />

do-activity within State<br />

PRESENT Presentation of <br />

to the user<br />

PROVIDE Possibility for the user to provide a value<br />

for <br />

PROVIDE()<br />

<br />

Transition with keyword ACT<br />

Possibility to provide <br />

elements for <br />

ACT The action that can be<br />

initiated by the user<br />

ACT<br />

[]<br />

ACT<br />

[not ]<br />

Transition with keyword SELECT<br />

SELECT()<br />

<br />

…<br />

The action that can be<br />

initiated by the user if is true.<br />

The action that can be<br />

initiated by the user if is false.<br />

Selection of elements from<br />

the collection .<br />

Table 2. Extract of defined elements to be used<br />

by the interaction designer in the UML state chart.<br />

Example: To-do List<br />

The application developer provides all attributes that can be<br />

used for interaction modeling as a UML class diagram, see<br />

Figure 2. For the to-do list these are addLabel and<br />

confirmLabel, which contain texts to be presented to the<br />

user during the respective interaction steps, and a collection<br />

named entryList containing elements of the custom type<br />

Entry. The developer also provides the information that an<br />

Entry consists of one string named description. Furthermore<br />

the operations saveEntry(Entry) and deleteEntry(Entry) are<br />

provided.<br />

Figure 2. UML class diagram for the to-do list provided by the<br />

application developer as functional application interface.<br />

The application developer furthermore provides textual<br />

descriptions of the attributes and operations. These support<br />

the interaction designer to understand the semantics in<br />

order to achieve correct mappings to the interaction model.<br />

The interaction designer uses the attributes when creating<br />

the abstract interaction model. Using UML the designer<br />

would have the possibility to include operations from the<br />

class diagram directly in the state chart. However, we<br />

decided to define the relations between interactions and<br />

operations outside of the state chart in a mapping table.<br />

11<br />

This allows the interaction designer to create the interaction<br />

model independent from this mapping. Figure 3 illustrates a<br />

possible interaction model for the to-do list application.<br />

Figure 3. Abstract interaction model for the to-do list using<br />

UML state charts with the defined set of model elements.<br />

The interaction designer uses then the provided operations<br />

and defines the relations to the interaction model. This is<br />

exemplified in Table 3. The saveEntry function is<br />

connected to the ACTSave transition with the entry<br />

provided by the user in the state Add entry. The<br />

deleteEntry function is connected to the ACTYes<br />

transition and deletes the subselection of entryList which is<br />

in this case exactly one entry selected by the user.<br />

Application function Relation to interaction model<br />

saveEntry(Entry)<br />

ACTSave<br />

Entry: PROVIDE(1) entryList<br />

deleteEntry(Entry) ACTYes<br />

Entry: entryList.Subselection<br />

Table 3. Mapping of interactions to application logic<br />

provided by the interaction designer based on the<br />

operations provided by the application developer.<br />

The next process step is to transform the abstract interaction<br />

model including the abstract data types and operations to<br />

different HMI concepts. For this example, we demonstrate<br />

two different automotive HMI concepts that are described<br />

in the following section.<br />

Example: Two <strong>Automotive</strong> HMI Concepts<br />

We illustrate the to-do list application with two different<br />

HMI concepts which can be summarized as follows:<br />

Touch screen based HMI: The first concept is based on<br />

operation with direct input via a touch screen. Touchable<br />

buttons are used to directly interact with the system. Lists<br />

are provided and can be operated (e.g. scrolling) via touch<br />

gestures. The system provides a software keyboard<br />

appearing when text or numbers are to be entered.<br />

CCE-based HMI: The second concept is based on indirect<br />

input via a CCE that can be pushed in eight directions,<br />

turned and pressed. Selectable menu entries are used to<br />

interact with the system. These are realized as menu


containers and are arranged in a certain hierarchy. The<br />

system provides specific complex speller widgets to enable<br />

the user to enter text or numbers.<br />

In order to map the abstract model to different HMI<br />

concepts, different rule sets have to be defined. Table 4<br />

illustrates general examples of required mappings.<br />

Abstract element Touch concept CCE concept<br />

PROVIDE Text field widget<br />

and software<br />

keyboard<br />

Edit speller widget<br />

ACT Touch button Menu entry in a<br />

menu container.<br />

SELECT(1) List box with the<br />

possibility to<br />

directly select one<br />

entry<br />

Menu container with<br />

the possibility to<br />

navigate through the<br />

entries and select/<br />

highlight one entry<br />

Table 4. Example mappings from abstract to specific concepts.<br />

The requirements for arranging the different elements<br />

depending on some properties (e.g. list sizes, menu<br />

hierarchy, etc.) are provided by the HMI concept. These<br />

influence the transformation mechanism for each concept.<br />

We defined the specific transformation mechanisms<br />

including these requirements and exemplified the process<br />

with the to-do list example. The proof of concept is<br />

described in the following section.<br />

PROOF OF CONCEPT<br />

The two different HMI concepts were implemented with the<br />

respective widget and layout specifications based on XMLdescriptions<br />

in a pre-defined format. These specifications<br />

were used to create rules for the transformation from the<br />

abstract model elements to the respective specific HMI<br />

layout and interaction elements. This was implemented<br />

using eXtensible Stylesheet Language Transformation<br />

(XSLT).<br />

Based on the abstract model elements for the to-do list,<br />

example transformations were implemented for the two<br />

automotive HMI concepts described above. These<br />

transformations include enabled and disabled user actions,<br />

representations of collection variables (e.g. lists) with the<br />

selection of individual collection elements, and representations<br />

for presenting and providing basic data types like<br />

text strings. Example screenshots of the resulting generated<br />

HMIs are illustrated in figure 4.<br />

Figure 4. Screenshots for the to-do list from the demonstrator:<br />

left: touch screen based HMI, right: CCE based HMI.<br />

12<br />

CONCLUSION AND FUTURE WORK<br />

We presented an abstract interaction modeling concept<br />

based on UML class diagrams and state charts. An example<br />

application was modeled and the transformation process<br />

was successfully implemented for two different automotive<br />

HMI concepts. The developed concept includes the<br />

abstraction of basic interaction possibilities and a first set of<br />

transformations for a controlled HMI generation. The<br />

demonstrated concept pushes further research and<br />

development to achieve more flexible and adaptive automotive<br />

infotainment systems allowing the integration of<br />

external applications after deployment of the car software.<br />

Covering a complete HMI concept specification including<br />

the respective transformation rule set may result in large<br />

implementations. Thus, one important issue for the future is<br />

to further improve the HMI specification process in order to<br />

minimize the effort of obtaining transformation rules. These<br />

activities will also support the definition of overall<br />

automotive industry solutions for HMI development<br />

processes, especially concerning modeling languages and<br />

definitions of interfaces between applications and the HMI.<br />

Detailed evaluations, the elaboration of further complex<br />

examples, and stepwise improvements and expansion of the<br />

rule sets are part of ongoing and future activities. The<br />

implementation of a client-server architecture is envisioned<br />

to allow a client HMI system to communicate with remote<br />

applications and other input and output devices via defined<br />

messages. This will also enable the flexible addition of<br />

interaction devices and modalities for external applications.<br />

REFERENCES<br />

1. CAMELEON Project. http://giove.cnuce.cnr.it/<br />

projects/cameleon.html (11 Nov 2010).<br />

2. Dausend, M. & Poguntke, M.: Spezifikation<br />

multimodaler Interaktionsanwendungen mit UML. In<br />

Mensch & Computer (2010), 215-224.<br />

3. De Melo, G. Modellbasierte Entwicklung von Interaktionsanwendungen,<br />

München, Germany, 2010.<br />

4. O.M.G.: UML 2.2 Superstructure Specification (2009).<br />

5. Nobrega, L., Nunes, N. J., & Coelho, H.: Mapping<br />

ConcurTaskTrees into UML 2.0. LNCS 3941 (2006).<br />

6. Paternò, F., Mancini, C. Meniconi, S.: Concur-<br />

TaskTrees: A Diagrammatic Notation for Specifying<br />

Task Models. In Proceedings of the IFIP TC13<br />

International Conference on HCI (1997).<br />

7. Paternò, F.: Towards a UML for interactive systems.<br />

LNCS 2254 (2001), 7-18.<br />

8. Schäfer, R.: Model-Based Development of Multimodal<br />

and Multi-Device User Interfaces in Context-Aware<br />

Environments, Aachen, Germany, 2007.<br />

9. Vanderdonckt, J., Limbourg, et al.: UsiXML: A User<br />

Interface Description Language for multimodal User<br />

Interfaces. In Proc. Workshop on Multimodal<br />

Interaction WMI (2004), 1-7.


A Novel Multimedia Session Management Approach<br />

for In-Vehicle Middleware based on DPWS<br />

Michael Eichhorn*, Martin Pfannenstein*, Rainer Bodendorfer**, Eckehard Steinbach*<br />

Institute for Media Technology<br />

Technische Universität München<br />

*{firstname.lastname}@tum.de, **bodendorfer@gmx.de<br />

ABSTRACT<br />

In this paper, we present a novel multimedia session management<br />

approach for a future Ethernet/IP-based in-vehicle<br />

communication network. All network devices are available<br />

as services in a service-oriented architecture (SOA) that is<br />

established on top of the in-vehicle network. We use the Device<br />

Profile for Web Services (DPWS) as a middleware as it<br />

is designed to support resource-restrained embedded devices<br />

as they are typical for an in-vehicle scenario. The session<br />

management has been designed to support any type of data<br />

to be exchanged between the services. In this study, we put a<br />

particular focus on in-car video streaming and demonstrate<br />

that the proposed approach successfully supports a variety<br />

of video streaming scenarios.<br />

Author Keywords<br />

service-oriented architecture, human machine interface, session<br />

management, in-car infotainment, device integration<br />

ACM Classification Keywords<br />

H.5.2 Information Interfaces and Presentation: Miscellaneous<br />

INTRODUCTION<br />

The IT infrastructure of modern cars features a variety of<br />

electronic control units (ECU) to execute automation and<br />

control tasks which ensure the vehicle’s operation on the<br />

road. Additionally, more and more comfort and entertainment<br />

functionalities are shipped with modern vehicles, particular<br />

in the premium segment. The challenge that car manufacturers<br />

face today is to adapt the in-vehicle network to the<br />

increasing number of ECUs as well as their corresponding<br />

traffic, in particular, novel applications transmitting audioand<br />

video data. Therefore, car manufacturers target for a homogenized<br />

in-vehicle network rather than having installed<br />

multiple fieldbus systems like CAN, LIN, MOST, FlexRay<br />

etc., as in today’s cars. This then also fosters new services<br />

and applications due to the ubiquitous availability of data<br />

compared to a separation of sensors and actuators across the<br />

fieldbus systems mentioned above. One promising candidate<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

13<br />

for such a homogeneous in-car infrastructure is Ethernet/IP<br />

as it comes with a well-proven and established set of interfaces<br />

and protocols. Additionally, not only the number of<br />

the vehicle’s internal ECUs increases, but also the number<br />

of externally connected devices. For instance, a driver as<br />

well as the passengers want to interact with the car via their<br />

personal devices, e.g., laptops, smartphones, PDAs and so<br />

on. Therefore, the requirements for a well organized and<br />

flexible human machine interface (HMI) emerge. This then<br />

should be as universally applicable as possible as due to the<br />

lifecycle of a car compared to those of consumer electronic<br />

(CE) devices, it is not foreseeable which personal devices<br />

will be brought into the car in the future. For IP-based IT<br />

infrastructures like for example in business organizations as<br />

well as on the Internet itself, where many different services<br />

are available, there is an emerging need for an arrangement<br />

of these services. This can be achieved by a service-oriented<br />

architecture (SOA) which depicts a middleware with standardized<br />

interfaces. We use such an architecture to connect<br />

ECUs and CE devices seamlessly as well as generate<br />

an HMI that can be distributed and composed by the devices<br />

connected as described in [3] and [4]. A user can therefore<br />

introduce personal devices and interact with them via the<br />

in-vehicle HMI. This approach also enables novel CE devices<br />

like for example future entertainment systems or even<br />

vehicle-relevant features like a more precise GPS receiver to<br />

be integrated into the IT infrastructure of a car after it has<br />

been shipped. An increasing interaction of the driver and<br />

passengers with the car also leads to a demand for a cooperative<br />

usage especially but not limited to infotainment systems<br />

like video and audio streams. For example, a driver wants<br />

to see the video of the vehicle’s rear view camera while two<br />

passengers in the back are watching a movie on two screens.<br />

As soon as the driver finished checking the rear view camera<br />

image, the passenger in the front also wants to watch the<br />

movie from the current position on or start over. This paper<br />

therefore presents a session management approach for a<br />

SOA-based in-vehicle network and is structured as follows:<br />

First, an overview of related work in this area is given. Afterwards,<br />

our system is presented and the proposed session<br />

management scheme is detailed. At the end, a summary and<br />

outlook is given.<br />

RELATED WORK<br />

There exist some approaches towards a more flexible HMI<br />

architecture as for example Continental’s Android-based AutoLinQ<br />

platform [1], the Neutrino RTOS by QNX [7] or<br />

Meego [9]. These platforms are supposed to act as an univer-


sal architecture compared to the car manufacturer’s specific<br />

approaches. The Extensible Messaging and Presence Protocol<br />

(XMPP) [8] is a widely used standard for text-based<br />

chatting. On top of that, the Jingle extension [5] is used to<br />

establish sessions for audio and video calls, mainly in peerto-peer<br />

networks.<br />

SYSTEM DESCRIPTION<br />

We consider an all-IP in-vehicle network with a SOA infrastructure<br />

on top. In order to also support embedded devices<br />

which do not feature rich processing resources, we use the<br />

Device Profile for Web Services (DPWS) [6], which is a Web<br />

Service based middleware. It is designed to operate also on<br />

resource-limited ECUs as installed in vehicles. Several services<br />

have been designed which cover automotive-specific<br />

use-cases. Further on, we regard a video streaming service<br />

which can be invoked by multiple clients. The scenario is depicted<br />

in Figure 1. The first box (simple) shows the most elementary<br />

scenario where one client requests a video stream<br />

from one service provider, i.e., video streaming service. A<br />

multi party interaction takes place at the ”separated” scenario,<br />

where two clients invoke the streaming service independently<br />

of each other. Both clients can also receive the<br />

same video content with the same playout time, i.e., they<br />

participate at a common session (shared). The last two scenarios<br />

can also be combined to a mixed scenario where two<br />

clients watch the same video content and a third one requests<br />

a separate video stream or the same one with a shifted playout.<br />

Figure 1. Overview of the considered media streaming scenarios.<br />

SESSION MANAGEMENT<br />

When using a SOA as an organizational instance for multimedia<br />

systems, the software development process is eased in<br />

many ways. Nevertheless, some points have to be taken care<br />

of in order to provide an intuitive experience to the user.<br />

In a common unconstrained SOA scenario, with many service<br />

providers and consumers, the service consumer chooses<br />

a provider considering often only technical or measurable aspects,<br />

i.e., hardware resources, latency, and so on. However,<br />

a human, as a service consumer, wishes to select one specific<br />

function of one specific service provider, neglecting technical<br />

aspects. Therefore, a management has to be established<br />

to cover for example:<br />

14<br />

• Overview and selection of compatible, available services.<br />

• Independent and un-interruptible use of a service.<br />

• Possibility to share the current service with others.<br />

• A clear distinction of users, their devices and the way they<br />

are using them.<br />

In order to enable these features in a multimedia scenario<br />

based on DPWS, a session management, realized as a dedicated<br />

service, has been developed. With the introduction of<br />

a session, users can be grouped and served independently,<br />

hence supporting their desired way of use.<br />

Establishing a new session<br />

Figure 2. Etablishment of a new session.<br />

The establishment of a new session is fundamental in order<br />

to operate independently of others, but, on the other hand, be<br />

also able to share a session. The message exchange pattern<br />

of a session establishment is depicted in Figure 2. Here, a<br />

user invokes a video streaming service by telling the client<br />

application to start a session and assigning a session name<br />

(step 1 and 2). The name of the session (Session-ID), which<br />

can be selected freely, is used to distinguish various running<br />

sessions. The Session-ID is then send to the session service<br />

(step 3), i.e., the video streaming device, to actually trigger<br />

the request. In order to know who is requesting a new session,<br />

this message also contains additional information like<br />

IP-address and Port. This is essential to distinguish participants<br />

and handle further service calls properly. When this<br />

message is received by the service provider, it checks if the<br />

desired Session-ID is available (step 4). With this verification,<br />

a unique assignment of Session-IDs is ensured.<br />

Furthermore, a User-ID is generated which is matched to the<br />

IP address and Port of the requesting client (step 5). Both<br />

IDs, the User-ID as well as the Session-ID are then stored<br />

at the service provider side. In fact, the User-ID is also assigned<br />

to the Session-ID to have a connection between sessions<br />

and users. The result is a list containing all running<br />

sessions and their participants. With the generated User-ID<br />

it is possible to retrieve information about a certain user. In<br />

future service calls of a known user, only the User-ID has to<br />

be included in order to identify a user and serve the appropriate<br />

session.<br />

The session client itself also needs to know the User-ID that<br />

he has been assigned to. For this reason, a message is sent<br />

(step 7) containing the User-ID and an error code. The error<br />

code contains the result of the verification process of the


Session-ID. The client knows about all possible results and is<br />

able to decide if the establishment of a session was successful.<br />

Finally, the session client saves his User- and Session-ID<br />

(step 8). With this message exchange, a session has been established.<br />

Joining an existing session<br />

Figure 3. Requesting a session list.<br />

Another mandatory feature regarding sessions is the participation<br />

in an existing session. All information about sessions<br />

are stored on the service side. A common user on the client<br />

side however has no knowledge about currently running sessions<br />

and assigned Session-IDs. In order to get an overview<br />

of all available sessions, a feature that handles this must be<br />

provided.<br />

Initially, the user enters a command to request a list of all<br />

ongoing sessions (Figure 3, step 1). The session client then<br />

sends a message to the session service (step 2). The session<br />

service queries an overview of all existing sessions (step 3)<br />

from its local database, and sends it back to the requesting<br />

client (step 4). Finally the client is able to display all currently<br />

running sessions to the user (step 5).<br />

This listing feature is not only implemented to join a session<br />

in the next step, it can be used to get a common overview of<br />

all running sessions. When a list of all available sessions is<br />

shown to the user, he can then choose one out of it in order<br />

to participate.<br />

First, he selects the appropriate command (Figure 4, step<br />

1) and enters the Session-ID of the desired session (step 2).<br />

With this given information a message is sent from the client<br />

to the service (step 3). When the message is received, the included<br />

Session-ID is checked by the service provider and the<br />

existence of the desired session is verified (step 4). The result<br />

of this verification may lead to the following situations.<br />

• Unknown Session-ID:<br />

The Session-ID cannot not be found among the currently<br />

running sessions (step 4). Thus, the desired session the<br />

user wants to join does not exist and henceforth, he cannot<br />

participate. This will be signaled to the user with a message<br />

(step 4.1). An error code is included and can be interpreted<br />

and displayed at the session client (step 4.2). Now,<br />

the user could restart the process with another Session-ID.<br />

• Known Session-ID, no streams present:<br />

If the Session-ID is known, the appropriate session exists<br />

and the user is able to join. At this point, we assume that<br />

no video streaming is running in the desired session (step<br />

15<br />

Figure 4. Participate in a session.<br />

5). Further, a User-ID is created (step 5.1) and added to<br />

the session (step 5.2). From this point on the user participates<br />

in the session. An error code, sent by a dedicated<br />

message (step 5.3), indicates the successful participation<br />

to the session client. The included Session-ID will be<br />

extracted and saved together with the Session-ID by the<br />

client (step 5.4). From now on, the user can trigger the<br />

streaming within the session.<br />

• Known Session-ID, streaming running:<br />

Of course, it is possible that a video streaming session is<br />

already running, initiated by another user. Hence, the new<br />

client has to be notified about the ongoing video stream.<br />

Therefore, metadata about the stream is gathered (step 6),<br />

a User-ID is generated (step 6.1) and added to the session<br />

(step 6.2). Now, the streaming service, which transmits<br />

the stream to all participating members of the session, is<br />

informed and updated (step 6.3) about the new member.<br />

The new client receives a message (step 6.4) with an error<br />

code, which signals a running streaming, his User-ID and<br />

metadata. The User- and Session-ID are then saved by the<br />

client (step 6.5). Next, the streaming client, which takes<br />

care of receiving and displaying the video, is started. The<br />

metadata of the received message act as a description for<br />

the expected stream.<br />

Leaving a session and handover<br />

The last essential functionality is leaving a session. This can<br />

be necessary if a user wants to stop the use of a device or<br />

he wants to join another session. Figure 5 shows the message<br />

flow after a successful initialization (see Figure 2). Afterwards,<br />

the participating clients subscribe to a notification


Figure 5. Leaving and handover of a session.<br />

channel (step 2) with a message (step 3) which is processed<br />

by the service (step 4).<br />

From now on, all required information is sent via the notification<br />

channel to all participating clients. In step 5, for<br />

instance, one client sends a play command (step 6) to the<br />

service provider. The contained User-ID is then verified as<br />

described in the section Establishing a new session (step 5)<br />

and the corresponding service is fired up. This is then broadcasted<br />

to the subscribed services via a notification message<br />

(step 9). In the depicted scenario, this contains the Session-<br />

ID as well as metadata to tell the clients which video properties<br />

they have to expect (codec, resolution, framerate and<br />

so on). The clients, on the other hand, check the Session-ID<br />

and prepare themselves to use the service (steps 10-12). The<br />

video streamed by the service provider can then be received<br />

and displayed.<br />

If a client wants to leave a session (step 13), he can notify the<br />

service provider via a dedicated message (14). The service<br />

provider then checks the user’s ID and deletes it from the<br />

receiver and notification list (step 15). This user is then no<br />

longer part of the session. However, as shown in Figure 5,<br />

the video stream is unaffectedly sent to the remaining client<br />

in the session. Therefore, a handover of the session has taken<br />

place. The remaining client can control all properties of the<br />

session or close it likewise. In this case, the actual number<br />

of participants of a session reaches zero. Hence, all users are<br />

removed and the Session-ID is no longer in use and can be<br />

assigned to new sessions.<br />

SUMMARY AND OUTLOOK<br />

In this paper, we presented a multimedia session management<br />

extension for our web protocol based HMI architecture,<br />

which has been introduced in our previous work. The<br />

session management has been realized as a dedicated service<br />

while not modifying the underlying DPWS stack. With<br />

16<br />

this extension, several video streaming scenarios are covered<br />

which then provide more convenience and flexibility to the<br />

driver and the passengers of a car. Users can invoke a service<br />

separated, e.g., a video streaming service can deliver multiple<br />

streams with a different playout time each. On the other<br />

hand, users can share one stream to watch the same video sequence<br />

on multiple screens, i.e., the playout time is the same.<br />

If there is an existing session available with a running stream<br />

and a new user wants to participate, the meta information is<br />

also indicated to the new user and his stream has the same<br />

playout time despite his late participation. Furthermore, it<br />

is possible that the initiator of a session leaves and another<br />

participating user takes over. This session handover offers<br />

high flexibility regarding connected devices, for instance, a<br />

movie that has been viewed during the trip can be continued<br />

on a mobile device afterwards.<br />

ACKNOWLEDGEMENTS<br />

This work has been supported, in part, by the BMBF funded<br />

research project SEIS (Security in Embedded IP-based Systems)<br />

[2].<br />

REFERENCES<br />

1. Continental <strong>Automotive</strong> GmbH. AutoLinQ.<br />

http://www.conti-online.com/generator/www/de/en/<br />

continental/automotive/themes/passenger cars/interior/<br />

connectivity/autolinq/pi autolinq en.html, last accessed<br />

Nov. 2010.<br />

2. EENOVA. SEIS (Security in Embedded IP-based<br />

Systems). http://www.eenova.de/projekte/seis, last<br />

accessed Feb. 2010.<br />

3. M. Eichhorn, M. Pfannenstein, D. Muhra, and<br />

E. Steinbach. A SOA-based middleware concept for<br />

in-vehicle service discovery and device integration. In<br />

Intelligent Vehicles Symposium (IV), 2010 IEEE, pages<br />

663–669. IEEE, 2010.<br />

4. M. Eichhorn, M. Pfannenstein, and E. Steinbach. A<br />

flexible in-vehicle HMI architecture based on web<br />

technologies. In International Workshop on Multimodal<br />

Interfaces for <strong>Automotive</strong> Applications (<strong>MIAA</strong>2010),<br />

Hong Kong, China, Feb. 2010.<br />

5. S. Ludwig, J. Beda, P. Saint-Andre, R. McQueen,<br />

S. Egan, and J. Hildebrand. Xep-0166: Jingle. XMPP<br />

Enhancement Proposal, Jabber Software Foundation,<br />

2005.<br />

6. OASIS. Devices profile for web services version 1.1.<br />

http://docs.oasis-open.org/ws-dd/dpws/wsdd-dpws-<br />

1.1-spec.html, last accessed Nov. 2010.<br />

7. QNX Software Systems. QNX Neutrino RTOS.<br />

http://www.qnx.com/products/neutrino rtos/, last<br />

accessed Nov. 2010.<br />

8. P. Saint-Andre et al. Extensible messaging and presence<br />

protocol (XMPP): Core. 2004.<br />

9. The Linux Foundation. Meego. http://meego.com/, last<br />

accessed Nov. 2010.


“Hands Busy, Eyes Busy”: Generating Stories from<br />

Sensor Data for <strong>Automotive</strong> applications<br />

Joe Reddington, Ehud<br />

Reiter, Nava Tintarev<br />

Department of Computing<br />

Science<br />

University of Aberdeen<br />

j.reddington, e.reiter,<br />

n.tintarev@abdn.ac.uk<br />

ABSTRACT<br />

This paper examines the potential of using natural language<br />

generation to support “hands busy, eyes busy” automotive<br />

applications. It outlines a hierarchy of complexity of output<br />

text, and the type of sensor data that may be collected. It<br />

also suggests a number of ways natural language generation<br />

can generate narrative events from sensor data for drivers.<br />

Author Keywords<br />

NLG, AAC, event generation, narrative, story, sensors, automotive<br />

applications<br />

ACM Classification Keywords<br />

H.5.2 Information Interfaces and Presentation: Miscellaneous<br />

INTRODUCTION<br />

This work examines the potential of using automatically harvested<br />

information to generate new phrases automatically,<br />

creating support for “hands busy, eyes busy” automotive applications.<br />

Of particular interest is a review of how technologies<br />

and techniques developed in an assistive technology application<br />

(the recent “How was School Today...?” project)<br />

can be applied to the automotive domain.<br />

Mobile usage while driving has been identified as a risk factor<br />

in road accidents [2, 5]. Reducing both the motivation<br />

to use such devices while driving and the length of time for<br />

which they are used would potentially reduce the number of<br />

road accidents. The position of the authors is that the use<br />

of automatic narration techniques can support communication<br />

in scenarios such as making regular deliveries or public<br />

transportation. Methodologies to enable this type of automatic<br />

text generation are under-researched and NLG can aid<br />

in this task by creating a story that is structured, relevant and<br />

flexible to the current situation, based on sensor data.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

17<br />

Rolf Black, Annalu Waller<br />

School of Computing<br />

University of Dundee<br />

rolfblack,awaller@<br />

computing.dundee.ac.uk<br />

It is easy to envisage a system by which buses or delivery<br />

vans automatically send an update of location to a home<br />

server, and indeed many services offer near real-time tracking<br />

of packages from source to destination. In contrast, this<br />

work focuses on combining such messages, augmented with<br />

information from weather reports, traffic reports and other<br />

data, to form a larger message with an overall narrative.<br />

In this paper we situate the work with regard to existing<br />

work, then introduce the “How was School Today...?” project<br />

that informed this work. We go on to identify potential application<br />

areas in the automotive domain, and discuss the<br />

possible effects, risks, and advantages.<br />

RELATED WORK<br />

Our existing work sits on the boundary between Natural Language<br />

Generation (NLG), which is a subcategory of natural<br />

language processing that examines the creation of text from<br />

nonlinguistic data such as sensor readings, and Alternative<br />

and Augmentative Communication (AAC), an area examining<br />

communication for those with restrictions on speech.<br />

NLG techniques can dynamically combine and change some<br />

output depending on the changing internal state of a system<br />

[11]. A popular application area for NLG has been<br />

weather forecasting (generating textual weather forecasts from<br />

the results of a numerical atmosphere simulation model),<br />

and several weather forecast generators have been fielded<br />

and used operationally [17, 16]. A number of data-to-text<br />

systems have also been developed in the medical community,<br />

such as BabyTalk [15], which generates summaries of<br />

clinical data from a neonatal intensive care unit, and the<br />

commercial Narrative Engine [14] which summarises data<br />

acquired during a doctor/patient encounter.<br />

In this paper, we seek to focus the technology away from<br />

AAC and on the automotive domain, where natural language<br />

processing systems have been used with some success. For<br />

example, RoadSafe is an NLG system that has been operationally<br />

deployed at Aerospace and Marine International<br />

(AMI) to produce weather forecast texts for winter road maintenance.<br />

It generates forecast texts describing various weather<br />

conditions on a road network [10]. Other systems have focused<br />

more on processing language to visualise and animate<br />

3D scenes from car accident reports [3].


Figure 1. Types of input that can be collected by a mobile device: voice recording, RFID, voice, emotional embellishments<br />

<strong>Automotive</strong> research in general is well developed; of particular<br />

relevance to this work is the issue of privacy in vehicleto-vehicle,<br />

or vehicle-to-base communication, see e.g. [8, 9].<br />

The “How was School Today...?” project<br />

Our work is informed by the “How was School Today...?”<br />

(HWST) project [1, 6] which logged sensor data for students<br />

at a special needs school. This data included object and person<br />

interactions, voice recordings, and location information<br />

(at the room level). It also recorded positive and negative<br />

evaluations (e.g. “It was not a good day.”) input by the children.<br />

This framework has been tested as a proof-of-concept<br />

in the context of generating stories for children at the school.<br />

The students (who had no, or very limited, speech) could<br />

then relay these stories to parents or other conversation partners.<br />

For this particular domain, the types of data recorded for<br />

each user are:<br />

• Location data - each time the user entered a new room,<br />

this information was recorded. (Pre-processing removed<br />

rooms entered for less than three minutes).<br />

• Object interaction - each time the user interacted with an<br />

object that had an RFID tag, that interaction was recorded.<br />

• Person interaction - each time the user interacted with a<br />

person that had an RFID tag, that interaction was recorded.<br />

• Voice messages - staff and teachers were encouraged to<br />

record voice messages, as if the user was speaking in the<br />

first person, that described the user’s recent activities.<br />

An example set of data would be:<br />

11:36, Location, Tutorial Room<br />

11:36, Object, Money<br />

11:39, Object, Monkey Game<br />

Which is converted into English text to give the story:<br />

I played with Money and Monkey Game. This happened<br />

at a Tutorial Room.<br />

18<br />

Many of the input sensor data and techniques used in HWST<br />

can be applied to the automative domain. Figure 1 outlines<br />

the type of input that could be used in such a system and collected<br />

with a mobile phone, e.g. voice recordings, location,<br />

interactions with people and objects (RFID).<br />

The HWST project is in the process of introducing the Nokia<br />

6212 1 as a collection device, and may need to be supplemented<br />

with an additional system for recording location information<br />

on the room level.<br />

Depending on the granularity of location data required, other<br />

hardware may supplement a mobile phone. GPS tracking<br />

may be more suitable for larger distances while bluetooth or<br />

other methods may be preferable for room-level identification.<br />

Additional sensor data may be available in a vehicle<br />

such as change in light, temperature etc [4], or speed and<br />

fuel usage.<br />

TYPES OF PRODUCABLE CONTENT<br />

This section categorises the potential outputs of automatically<br />

generated content into a triple-tiered hierarchy of networkbased<br />

input, sensor-based input, and the creation of narratives<br />

from sensor input. This hierarchy can be broadly arranged<br />

in terms of invasiveness of the data collection. This<br />

and other privacy concerns are key to any implementation.<br />

Network-based input<br />

Network-based input is defined as new utterances that can<br />

be determined by access to information over the Internet, or<br />

some other large information portal. An example is talking<br />

about the weather - phrases such as “It’s very warm today”,<br />

and “The snow is starting to stick!”, but this can include<br />

“There was an accident on the M14”, or “Traffic is slow<br />

around Old Trafford due to the match”.<br />

1 http://europe.nokia.com/find-products/<br />

devices/nokia-6212-classic, retrieved November<br />

2010


Sensor-based input<br />

Sensor-based input is defined as the use of single facts about<br />

the user provided by sensor data. Examples might include “I<br />

went to Leeds” - provided by GPS data, or “I just handled<br />

package 41” - provided by use of a barcode scanner in combination<br />

with an online lookup of the IDs for the packages.<br />

Although there is a concern that this sort of data collection<br />

can affect both privacy and also workload required to maintain<br />

it, messages can be better adapted: “I got a text message<br />

from Jamie this morning, he said ‘looking forward to tomorrow’<br />

”. Voice messages are included in this category and<br />

can include information that would never be picked up by a<br />

sensor - “I helped jump-start a car and was 15 minutes late.”.<br />

Creation of narratives from sensor data<br />

This category contains those groups of messages, based on<br />

sensor data, that together relate an experience or tell a story,<br />

thus adding the problems of creating a narrative structure or<br />

consistent style to what has previously been a data-mining<br />

exercise. The importance of narrative in exchanging information<br />

is well-researched, for an NLG example see [12].<br />

In HWST, stories were generated using additional reasoning,<br />

such as giving more importance to events that occurred in<br />

locations which were unexpected compared to a timetable.<br />

These stories were also augmented by users with positive<br />

and negative annotations of utterances “She was nice.” (for<br />

people) or “It was not a good day.” (for the whole story) [1].<br />

The creation of multi-fact, multi-sentence messages with a<br />

structured narrative is a step forward in NLG-terms, requiring<br />

more sophisticated techniques than previous levels in<br />

the hierarchy. In particular, this moves the focus of NLG<br />

research to the tasks of document planning and document<br />

structuring, compared to text generation on the sentence level.<br />

The analysis of sensor-based data, defining one of these multifact<br />

and multi-sentence messages as an ‘event’ is discussed<br />

in [6]. While the NLG techniques outlined in [11] can combine<br />

facts into plain English, a further challenge lies in defining<br />

boundaries between groups of sensor data to define separate<br />

events. The goal is to arrange the sensor-based input<br />

into a narrative structure that accurately relates events.<br />

Based on a modified version of the data recording in the<br />

HWST project, one could assume input data such as that<br />

highlighted in Figure 2. The generated text could then be:<br />

“This morning, after picking up two packages, I helped<br />

jump-start a car and was delayed by 15 minutes. Later, I<br />

arrived at the Leeds depot and delivered the packages to Mr.<br />

Roberts. The delivery went fine”.<br />

APPLICATION AREAS<br />

The previous section discussed the types of text that can be<br />

generated. This section outlines several practical applications<br />

of the generated narrative text in automative applications:<br />

staying in touch; communication with head office;<br />

and accident reports. Privacy is an important consideration<br />

in any application; the people on whose behalf the story is<br />

generated should always have the possibility to read and edit<br />

19<br />

06:27:00, Object, Package1<br />

06:27:07, Object, Package2<br />

07:34:00, Voice Recording, I helped jump-start a car and was delayed<br />

by 15 minutes.<br />

09:40:00, Location, Leeds depot.<br />

09:40:00, Object, Package1<br />

09:40:05, Object, Package2<br />

09:40:00, Person, Mr. Roberts<br />

09:43:00, Embellishment, Positive . . .<br />

Figure 2. Possible input data<br />

any text before it is transmitted. Moreover, any generated<br />

text can be read aloud by text-to-speech software.<br />

This would also facilitate responses to messages originally<br />

sent to a driver, allowing the original sender (which may also<br />

be a driver) to hear the response without extra effort and reducing<br />

cognitive load.<br />

Staying in touch<br />

Many people keep in touch with mobile texts and an increasing<br />

number stay connected using social media such as Facebook<br />

and Twitter 2 . Professional drivers may feel that updating<br />

their status is important from a social as well as professional<br />

prospective. However, while driving, attention should<br />

be on the road, and hands and eyes will be occupied by driving.<br />

An application that uses NLG to automatically update<br />

friends on one’s activities may help drivers feel connected<br />

in their everyday lives. The necessity to automatically generate<br />

such short messages is highlighted in [4] who suggest<br />

messages such as “35 centigrades? It is very hot in here!”.<br />

In particular, the work on structuring narrative produced by<br />

HWST technology allows a move from the functional single<br />

sentence update to a more expressive longer update.<br />

Work Reports<br />

The key application in this area is the generation of automatic<br />

work reports based on a driver’s sensor data. This sort<br />

of narrative can supply an employer with information about<br />

his drivers, such as the hours that they have worked and<br />

which deliveries or other tasks have been successfully executed.<br />

At the same time, the automatic generation of the text<br />

relieves the employee of the task of writing lengthy reports.<br />

Of particular use is text informing end-users of the current<br />

conditions - rather than a simple “Delayed, new ETA:15:27”<br />

message, one can imagine “When coming from a previous<br />

delivery at Hogsmeade, there was heavy traffic due to an<br />

accident in the town so the delivery has been diverted via<br />

Hogwarts and should be with you by 15:27”.<br />

Accident Reports<br />

Generative narrative stories from sensor data can also be<br />

used to support police and ambulance staff at the scene of<br />

the accident. The generated reports can offer a human readable<br />

summary of the situation well ahead of arrival on the<br />

scene, allowing professionals to be ready once they arrive.<br />

This sort of report can help assess the degree of damage<br />

2 www.facebook.com, www.twitter.com, retrieved November 2010


incurred at an accident by considering road conditions and<br />

travel speed. This type of report could also help police (and<br />

insurance companies) assess potential accountability for a<br />

given accident. Infra-red sensors may help assess how many<br />

victims were involved in an accident as well, ensuring that<br />

all victims get pulled out of an affected vehicle.<br />

CONCLUSION AND ONGOING RESEARCH<br />

This paper describes the type of text that can be automatically<br />

generated to support drivers, and highlighted three application<br />

areas: staying in touch, communication with head<br />

office, and accident reports. Although a future goal for this<br />

research is to integrate with a commercial product, privacy<br />

and security of such systems require careful consideration<br />

While care has been taken to keep such concerns a key part<br />

of the research, the authors welcome any communication<br />

from parties with expertise in this area.<br />

ACKNOWLEDGEMENTS<br />

The authors are particularly grateful to the school, staff, and<br />

children. This research was supported by the UK Engineering<br />

and Physical Sciences Research Council under grants<br />

EP/F067151/1, EP/F066880/1, EP/E011764/1,<br />

EP/H022376/1, and EP/H022570/1.<br />

REFERENCES<br />

1. R. Black, J. Reddington, E. Reiter, N. Tintarev, and<br />

A. Waller. Using nlg and sensors to support personal<br />

narrative for children with complex communication<br />

needs. In Proceedings of the NAACL HLT 2010<br />

Workshop on Speech and Language Processing for<br />

Assistive Technologies, pages 1–9, Los Angeles,<br />

California, June 2010. Association for Computational<br />

Linguistics.<br />

2. F. A. Drews, H. Yazdani, C. N. Godfrey, J. M. Cooper,<br />

and D. L. Strayer. Text messaging during simulated<br />

driving. Human Factors: The Journal of the Human<br />

Factors and Ergonomics Society, 51 (5):762–770, 2009.<br />

3. S. Dupuy, A. Egges, V. Legendre, and P. Nugues.<br />

Generating a 3d simulation of a car accident from a<br />

written description in natural language: the carsim<br />

system. In Proceedings of the workshop on Temporal<br />

and spatial information processing - Volume 13, pages<br />

1:1–1:8, Morristown, NJ, USA, 2001. Association for<br />

Computational Linguistics.<br />

4. C. Endres and D. Braun. Pleopatra: A Semi-Automatic<br />

Status-Posting Prototype For Future In-Car Use. In<br />

Adjunct proceedings of the 2nd International<br />

Conference on <strong>Automotive</strong> User Interfaces and<br />

Interactive Vehicular Applications (<strong>Automotive</strong>UI<br />

2010), page 7, Pittsburgh, PA, USA, November 2010.<br />

5. S. P. McEvoy, M. R. Stevenson, and M. Woodward.<br />

Phone use and crashes while driving: a representative<br />

survey of drivers in two australian states. Medical<br />

journal of Australia, 185(11/12):630–634, 2006.<br />

6. J. Reddington and N. Tintarev. Automatically<br />

generating stories from sensor data. In Intelligent User<br />

Interfaces, 2011 (to appear).<br />

20<br />

7. E. Reiter, R. Turner, N. Alm, R. Black, M. Dempster,<br />

and A. Waller. Using nlg to help language-impaired<br />

users tell stories and participate in social dialogues. In<br />

In Proceedings of the 12th European Workshop on<br />

Natural Language Generation (ENLG-09, 2009.<br />

8. F. Schaub, F. Kargl, Z. Ma, and M. Weber. V-tokens for<br />

conditional pseudonymity in vanets. In IEEE Wireless<br />

Communications & Networking Conference (IEEE<br />

WCNC 2010), Sydney, Australia, 04/2010 2010. IEEE,<br />

IEEE.<br />

9. F. Schaub, Z. Ma, and F. Kargl. Privacy requirements in<br />

vehicular communication systems. Computational<br />

Science and Engineering, IEEE International<br />

Conference on, 3:139–145, 2009.<br />

10. R. Turner, Y. Sripada, and E. Reiter. Generating<br />

approximate geographic descriptions. In Proceedings of<br />

the 12th European Workshop on Natural Language<br />

Generation, ENLG ’09, pages 42–49, Morristown, NJ,<br />

USA, 2009. Association for Computational Linguistics.<br />

11. E. Reiter and R. Dale. Building natural language<br />

generation systems, Cambridge University Press, 2000.<br />

12. E. Reiter, A. Gatt, F. Portet, and M. van der Meulen.<br />

The importance of narrative and other lessons from an<br />

evaluation of an NLG system that summarises clinical<br />

data. INLG ’08, pp. 147–156, Morristown, NJ, USA,<br />

2008. Association for Computational Linguistics.<br />

13. S. Ashraf, A. Judson, I. W. Ricketts, A. Waller, N. Alm,<br />

B. Gordon, F. MacAulay, J. K. Brodie, M. Etchels,<br />

A. Warden, and A. J. Shearer. Capturing phrases for<br />

ICU-Talk, a communication aid for intubated intensive<br />

care patients. In ACM Conference on Assistive<br />

technologies, pp. 213–217, New York, NY, USA, 2002.<br />

14. M. D. Harris. Building a large-scale commercial NLG<br />

system for an EMR. In INLG ’08: Proceedings of the<br />

Fifth International Natural Language Generation<br />

Conference, pages 157–160, Morristown, NJ, USA,<br />

2008. Association for Computational Linguistics.<br />

15. A. Gatt, F. Portet, E. Reiter, J. Hunter, S. Mahamood,<br />

W. Moncur, and S. Sripada. From data to text in the<br />

neonatal intensive care unit: Using NLG technology for<br />

decision support and information management. AI<br />

Commun., 22(3):153–186, 2009.<br />

16. E. Reiter, S. Sripada, J. Hunter, J. Yu, and I. Davy.<br />

Choosing words in computer-generated weather<br />

forecasts. Artif. Intell., 167(1-2):137–169, 2005.<br />

17. E. Goldberg, N. Driedger, and R. I. Kittredge. Using<br />

natural-language processing to produce weather<br />

forecasts. IEEE Expert: Intelligent Systems and Their<br />

Applications, 9(2):45–53, 1994.


A novel taxonomy for gestural interaction techniques:<br />

considerations for automotive environments<br />

Adriano Scoditti<br />

Laboratoire d’Informatique de Grenoble, Equipe IIHM<br />

385, rue de la Bibliotheque, BP 53, F-38041 Grenoble cedex 9, France<br />

adriano.scoditti@imag.fr<br />

ABSTRACT<br />

A large variety of gestural interaction techniques is now<br />

available. In this article, we use a new taxonomic space [18]<br />

as a comparative structure to analyze the applicability of<br />

these techniques on automotive environment. The taxonomy<br />

plots a gestural interaction technique as a point in a<br />

space where the vertical axis denotes the semantic coverage<br />

of the technique, and the horizontal axis expresses the<br />

physical actions users are engaged in. In addition, syntactic<br />

modifiers are used to express the interpretation process of input<br />

tokens into semantics, as well as pragmatic modifiers to<br />

make explicit the level of indirections between users actions<br />

and system responses. In the taxonomy, the complexity of<br />

the gestural interaction lexicon, and the syntactic/pragmatic<br />

modifiers it is decorated with, are indexes of the cognitive<br />

load users are engaged in during the interaction. The integration<br />

of modern mobile devices, complex user interfaces and<br />

gestural interaction techniques into automotive environment<br />

rise the necessity to analyze gestural interaction technique<br />

from their cognitive load point of view.<br />

Author Keywords<br />

Handheld devices and mobile computing, Input and interaction<br />

technologies, Multi-modal interfaces, Recognition and<br />

interpretation of user input (face, body, speech etc.)<br />

ACM Classification Keywords<br />

H.5.2 Information Interfaces and Presentation: Miscellaneous<br />

INTRODUCTION<br />

Last generation mobile devices are enhanced with a diversity<br />

of sensors capable of probing real world physical properties<br />

in real time. The pioneering work on sensor-based interaction<br />

techniques [8, 11, 12, 15, 16] has paved the way for<br />

an active research area [1, 20, 21]. Although these results<br />

satisfy “the gold standard of science” [19], in practice, they<br />

are too “narrow truths” [4] to support designers decisions<br />

and researchers analysis. Designers and researchers need an<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

21<br />

Figure 1. Integration of last generation mobile devices in automotive<br />

environment rise the necessity to analyze gestural interaction technique<br />

from their cognitive load point of view [?].<br />

overall systematic structure that helps them to reason, compare,<br />

elicit (and create!) the appropriate techniques for the<br />

problem at hand. Taxonomies, which provide such a structure,<br />

are good candidates for generalization in an emerging<br />

field. The challenge, however, is to provide a classification<br />

framework that is both complete and simple to use. Since<br />

completeness is illusory in a moving and prolific domain<br />

such as user interface design, we will not include it in our<br />

goals.<br />

In this article, we propose the interpretation of a new taxonomy<br />

for gestural interaction techniques [18] with considerations<br />

for automotive environment.<br />

To develop our taxonomy, we have built a controlled vocabulary<br />

(i.e. primitives) obtained through an extensive analysis<br />

of the taxonomies that have laid the foundations for<br />

Human-Computer Interaction (HCI) more than twenty five<br />

years ago. For the most part, this early work in HCI has<br />

been ignored or forgotten by researchers driven by the trendy<br />

“technology push” approach.<br />

Our taxonomy is based on the following principles:<br />

(1) Interaction between a computer system and a human being<br />

is conveyed through input (output) expressions that are<br />

produced with input (output) devices, and that are compliant<br />

with an input (output) interaction language.<br />

(2) As any language, an input (output) interaction language<br />

can be defined formally in terms of semantics, syntax, and<br />

lexical units.


Figure 2. The “sliding” gesture is semantically multiplexed to achieve<br />

different meanings, depending on context.<br />

(3) The generation of an input (output) expression involves<br />

using devices whose characteristics, from the human perspective,<br />

have a strong impact on the expressiveness and<br />

the effectiveness of the user interface [5].<br />

Building on Foley’s work [9] as well as on Buxton’s pragmatics<br />

considerations of input structures [5], our taxonomy<br />

brings together the four aspects of interaction ranging<br />

from semantics to pragmatics with the appropriate humanmotivated<br />

extensions for addressing the specificity of gestural<br />

interaction based on accelerometers. In contrast to<br />

Mackinlay et al.’s semantic analysis of the design space for<br />

input devices [13], we do not consider the transformation<br />

functions that characterize the system-oriented perspective<br />

of interaction techniques.<br />

Our expectation is to provide new insights and to start<br />

promising directions for the design of novel and powerful<br />

gestural interaction techniques.<br />

A NEW TAXONOMY<br />

As shown in Figure 2, the same gesture may convey very<br />

different meanings depending on the context in which it is<br />

produced: “go to previous photo” as for the Apple’s photo<br />

album (or “go to next slide” as in Charade in [2]), “open a<br />

submenu” in Francone’s Wavelet Menu [10], or “unlock” the<br />

iPhone screen. In addition, a gesture that makes sense for the<br />

system, may not be acceptable in a public social context [17]<br />

as it could be meaningful and interpreted by the public itself.<br />

These observations lead us to define a new taxonomy according<br />

to the following principles: (1) Coverage of semantic,<br />

syntactic, lexical, and pragmatic issues of interaction where<br />

semantic granularity is that of Foley’s et al. interaction tasks;<br />

(2) Adoption of a user centered perspective where physical<br />

human actions are premium, leaving aside the internal<br />

computational transformations; (3) Consideration for context;<br />

(4) Coverage of both foreground and background interaction<br />

(as defined by Buxton [6]). Figure 3 shows the<br />

elements of the framework that we describe in detail next.<br />

Lexical Axis<br />

Because of our focus on users’ involvement in the interaction,<br />

the input lexicon corresponds to the physical actions<br />

users apply to devices. We divide human physical actions<br />

into two groups: (1) conscious actions that belong to the<br />

22<br />

Figure 3. Our classification space for gestural interaction techniques<br />

based on accelerometers. The abscissa defines the lexicon in terms of<br />

the physical manipulations users perform with the device, with a clear<br />

separation between background and foreground interaction. The ordinate<br />

corresponds to Foley’s interaction tasks. An interaction technique<br />

is uniquely identified by an integer i and plotted as a point in this space.<br />

Each point is decorated with the pragmatic and syntactic properties of<br />

the corresponding interaction technique.<br />

foreground interaction, and (2) unconscious actions that correspond<br />

to background interaction. The foreground interaction<br />

area contains the interaction techniques that require<br />

the user to consciously manipulate the device to reach some<br />

objective (as for the sliding gesture of Figure 2). The background<br />

interaction area corresponds to the interaction techniques<br />

where the system interprets user’s unconscious actions<br />

together with contextual information to perform some<br />

system state change on behalf of the user. For example, during<br />

a phone call, the iPhone switches the screen backlight<br />

off to safe battery life as the user brings the device next to<br />

the ear.<br />

Whether human actions are performed consciously to address<br />

the system or not, our classification space characterizes<br />

these actions with two additional variables: (τ) the geometrical<br />

transformation matrix that models user’s movements in<br />

space, and (f) the frequency of these movements. The combinations<br />

of τ and f identify three sub-areas within the lexical<br />

axis: “Context”, “Affine Transformations” and “Shock”.<br />

The affine transformations group identifies the most common<br />

interaction techniques based on translations, rotations<br />

and/or scales (in this case, τ is different from the identity<br />

matrix I), and without any repetition (that is, f is equal to<br />

zero, meaning that the interaction is time driven). The sliding<br />

gesture of Figure 2 falls in this category. The shock<br />

category identifies those interaction techniques based on a<br />

combination of translations, rotations and/or scales (τ is different<br />

from the identity matrix) repeated over time (then, f<br />

is different from zero). The shake gesture exemplified by<br />

Shoogle [20] falls in this category. The context category<br />

corresponds to unconscious human manipulations that the<br />

system may interpret to feed into its own context model and,<br />

depending on this context, acts on behalf of the user. For<br />

this situation, we stipulate that τ is the Identity matrix and f<br />

is equal to zero.


Syntactic Axis<br />

Independently from the device used, we characterize the<br />

syntactic dimension of an interaction technique with the following<br />

two variables that we call syntactic modifiers: (1) the<br />

existence (or absence) of triggers to specify the begin/end of<br />

the interaction, and (2) the control type associated with the<br />

input token, which may be position-control, speed-control<br />

or acceleration-control. As a result, given that, in our taxonomy,<br />

an interaction technique is uniquely identified by an<br />

index i, the trigger syntactic modifier is represented as an<br />

oval that surrounds the interaction technique identifier using<br />

a dashed-line or a continuous line to respectively denote the<br />

presence (i.e. clutch) or absence (i.e unclutch) of a trigger.<br />

In addition, a derivative-like notation is used to convey the<br />

control type where i is decorated with an exponential number<br />

that expresses the derivative order with respect to time (i.e.,<br />

no derivative for position, first order derivative for speed,<br />

and second order derivative for acceleration).<br />

Semantic Axis<br />

As justified in our review about the foundational taxonomies<br />

developed in HCI, we re-use Foley’s interaction tasks: Select,<br />

Position, Orient, Path, Quantify, and Text [9] (See the<br />

vertical axis of Figure 3).<br />

Pragmatic Axis<br />

One of the originalities of our work is the attempt to classify<br />

gestural interaction techniques in close connection with their<br />

meaning in the user’s real world. To do this, we introduce a<br />

pragmatic modifier that expresses the directness [14, 3] of<br />

the mapping between the user’s expectation (i.e. goal) and<br />

the semantics of the interaction technique in the computer<br />

world. For indirect mapping, the identifier i of the interaction<br />

technique becomes the parameter of a function F(i)<br />

to indicate the existence of one or several reinterpretation<br />

layers, whereas for direct mapping, i does not receive any<br />

additional decoration.<br />

DISCUSSION AND RESEARCH DIRECTIONS<br />

Our fine-structured, language-inspired analysis allows to understand<br />

intrinsic and implicit differences even among apparently<br />

similar interaction techniques allowing researcher<br />

to better explore them and designers to better choose the best<br />

suitable for each case.<br />

From the researcher’s point of view, the classification shows<br />

a transparent state of the art where each interaction technique<br />

is classified without ambiguity. Typically, reference<br />

taxonomies such as [9] or [5] do not consider the role of<br />

time (cf. frequency and duration), nor do they cover unconscious<br />

interaction (cf. background interaction) and unstructured<br />

interaction such as device shaking. In addition, they<br />

do not explicitly consider whether an interaction technique<br />

is clutched or unclutched introducing ambiguities and mixing<br />

up different aspects of human interaction behavior.<br />

From the designer’s point of view, the dimensions of our<br />

taxonomy can be used as a framework for decision making.<br />

For example, an unclutched interaction technique may<br />

23<br />

be considered for default tasks, while different clutched interaction<br />

techniques can be multiplexed through the use of<br />

standard or ad-hoc widgets. By proposing at least an interaction<br />

technique for each of the proposed task while designing<br />

an application, designers will be able to offer a complete<br />

and uniform user experience similar to the WIMP one.<br />

Furthermore, designers can predict the difficulties that final<br />

users will encounter by analyzing the pragmatic and syntactic<br />

modifiers that characterize the interaction techniques they<br />

envision. Thus, they will be able to choose interaction techniques<br />

that best suit the targeted representative users (novice,<br />

intermediate, expert).<br />

We think good research and development directions will be<br />

both toward the creation of widgets able to transform direct<br />

interactions in their more complex counterparts and toward<br />

the definition of the elementary interactions to base the<br />

development on. The classification suggests to concentrate<br />

the efforts toward the development of interaction techniques<br />

able to specify Path, Quantity and Text input.<br />

Direct pragmatical interaction techniques are the most suitable<br />

for automotive environment, in particular for drivers.<br />

The lack of indirection layers during the interaction characterizes<br />

lower cognitive loads thus easing the interaction and<br />

avoiding distraction.<br />

CONCLUSIONS<br />

The characteristics on which we choose to perform our analysis<br />

are the ones inspired by the parallelism existing between<br />

artificial languages proposed by interactions and gestural<br />

languages users are used to: lexicon, syntax, semantic and<br />

pragmatic. Our discussion did not deepened to system level,<br />

as we didn’t want to differentiate interaction techniques by<br />

their implementation characteristics (granularity, resolution<br />

function, state machine are the variables already been taken<br />

into account [7, 13] whom we want to be complementary<br />

rather than substitutes).<br />

Our approach proposed a user-centered classification able to<br />

analyze the state of the art of accelerometers-based interaction<br />

techniques by the manipulation point of view: the user<br />

perform a physical action in its space in order to communicate<br />

with the system. We think this is the atomic level on<br />

which we have to conceive our interfaces in order to propose<br />

system-wide coherent languages to the users. This coherence<br />

will drive them through a more agreeable, natural [5]<br />

and intuitive system, having coherence and direct pragmatic<br />

distances.<br />

We proposed the use of a parametrical space where the pragmatic<br />

distance and the syntactical modifiers are indexes of<br />

the learning curve users have to go over when approaching a<br />

new interaction language.<br />

We contextualized our approach and principles to automotive<br />

environment. We proposed the use of the syntactical<br />

and pragmatical modifiers as discriminants of the most appropriate<br />

gestural interaction techniques suitable in automotive<br />

environments.


REMARKS<br />

The content of this article refers to, and in some part is an extract<br />

of, the accelerometers interaction techniques taxonomy<br />

proposed by Scoditti et al. [18].<br />

REFERENCES<br />

1. R. Ballagas, J. Borchers, M. Rohs, and J. G. Sheridan.<br />

The smart phone: A ubiquitous input device. IEEE<br />

Pervasive Computing, 5(1):70, 2006.<br />

2. T. Baudel and M. Beaudouin-Lafon. Charade: remote<br />

control of objects using free-hand gestures. Commun.<br />

ACM, 36(7):28–35, 1993.<br />

3. M. Beaudouin-Lafon. Instrumental interaction: an<br />

interaction model for designing post-wimp user<br />

interfaces. In CHI ’00: Proceedings of the SIGCHI<br />

conference on Human factors in computing systems,<br />

pages 446–453, New York, NY, USA, 2000. ACM.<br />

4. F. P. Brooks. Grasping reality through<br />

illusion—interactive graphics serving science. In CHI<br />

’88: Proceedings of the SIGCHI conference on Human<br />

factors in computing systems, pages 1–11, New York,<br />

NY, USA, 1988. ACM.<br />

5. W. Buxton. Lexical and pragmatic considerations of<br />

input structures. SIGGRAPH Comput. Graph.,<br />

17(1):31–37, 1983.<br />

6. W. Buxton. Integrating the periphery and context: A<br />

new model of telematic. Proceedings of Graphics<br />

Interface, pages 239–246, 1995.<br />

7. S. K. Card, J. D. Mackinlay, and G. G. Robertson. A<br />

morphological analysis of the design space of input<br />

devices. ACM Trans. Inf. Syst., 9(2):99–122, 1991.<br />

8. G. W. Fitzmaurice, S. Zhai, and M. H. Chignell. Virtual<br />

reality for palmtop computers. ACM Trans. Inf. Syst.,<br />

11(3):197–218, 1993.<br />

9. J. D. Foley, V. L. Wallace, and P. Chan. The human<br />

factors of computer graphics interaction techniques.<br />

IEEE Comput. Graph. Appl., 4(11):13–48, 1984.<br />

10. J. Francone, G. Bailly, L. Nigay, and E. Lecolinet.<br />

Wavelet menu: une adaptation des marking menus pour<br />

les dispositifs mobiles. In IHM ’09: Proceedings of the<br />

21st International Conference on Association<br />

Francophone d’Interaction Homme-Machine, pages<br />

367–370, New York, NY, USA, 2009. ACM.<br />

11. K. Hinckley, J. Pierce, M. Sinclair, and E. Horvitz.<br />

Sensing techniques for mobile interaction. In UIST ’00:<br />

24<br />

Proceedings of the 13th annual ACM symposium on<br />

User interface software and technology, pages 91–100,<br />

New York, NY, USA, 2000. ACM.<br />

12. G. Levin and P. Yarin. Bringing sketching tools to<br />

keychain computers with an acceleration-based<br />

interface. In CHI ’99: CHI ’99 extended abstracts on<br />

Human factors in computing systems, pages 268–269,<br />

New York, NY, USA, 1999. ACM.<br />

13. J. Mackinlay, S. K. Card, and G. G. Robertson. A<br />

semantic analysis of the design space of input devices.<br />

Hum.-Comput. Interact., 5(2):145–190, 1990.<br />

14. D. Norman. User Centered System Design; New<br />

Perspectives on Human-Computer Interaction. L.<br />

Erlbaum Associates Inc., 1986.<br />

15. K. Partridge, S. Chatterjee, V. Sazawal, G. Borriello,<br />

and R. Want. Tilttype: accelerometer-supported text<br />

entry for very small devices. In UIST ’02: Proceedings<br />

of the 15th annual ACM symposium on User interface<br />

software and technology, pages 201–204, New York,<br />

NY, USA, 2002. ACM.<br />

16. J. Rekimoto. Tilting operations for small screen<br />

interfaces. In UIST ’96: Proceedings of the 9th annual<br />

ACM symposium on User interface software and<br />

technology, pages 167–168, New York, NY, USA,<br />

1996. ACM.<br />

17. J. Rico and S. Brewster. Usable gestures for mobile<br />

interfaces: evaluating social acceptability. In CHI ’10:<br />

Proceedings of the 28th international conference on<br />

Human factors in computing systems, pages 887–896,<br />

New York, NY, USA, 2010. ACM.<br />

18. A. Scoditti, J. Coutaz, and R. Blanch. A novel<br />

taxonomy for gestural interaction techniques based on<br />

accelerometers. In <strong>IUI</strong> 2011. ACM, 2011.<br />

19. M. Shaw. What makes good research in software<br />

engineering? International Journal of Software Tools<br />

for Technology, 4(1):1–7, 2002.<br />

20. J. Williamson, R. Murray-Smith, and S. Hughes.<br />

Shoogle: excitatory multimodal interaction on mobile<br />

devices. In CHI ’07: Proceedings of the SIGCHI<br />

conference on Human factors in computing systems,<br />

pages 121–124, New York, NY, USA, 2007. ACM.<br />

21. A. Wilson and S. Shafer. Xwand: Ui for intelligent<br />

spaces. In CHI ’03: Proceedings of the SIGCHI<br />

conference on Human factors in computing systems,<br />

pages 545–552, New York, NY, USA, 2003. ACM.


Navigating Haystacks at 70 mph:<br />

Intelligent Search for Intelligent In-Car Services<br />

Ashweeni K. Beeharee<br />

University College London<br />

Department of Computer Science<br />

Gower Street, London, WC1E 6BT<br />

+44 (0)20 7679 0358<br />

a.beeharee@cs.ucl.ac.uk<br />

ABSTRACT<br />

With an explosion of in-car services, it has become not only<br />

difficult but unsafe for drivers to search and access large amounts<br />

of information using current interaction paradigms. In this paper,<br />

we present a novel approach for visualizing and exploring search<br />

results, and the potential benefits of its application to the current<br />

in-car environment. We have iteratively developed and tested a<br />

prototype system that enables the seamless and personalized<br />

exploration of information spaces. In a number of eye-tracking<br />

studies, we analyzed user satisfaction and task performance for<br />

factual and explorative search tasks. We found that most<br />

participants were faster, made fewer errors and found the system<br />

easier to use than traditional ones. We believe that this approach<br />

would improve the traditional in-car interfaces - to search and<br />

access large number of services with rich information. This would<br />

reduce driver inattention to the road and improve road safety.<br />

Categories and Subject Descriptors<br />

H.5.2 [Information Interfaces and Presentation]: User<br />

Interfaces - Graphical user interfaces.<br />

General Terms<br />

Design, Experimentation, Human Factors, Intelligent Transport<br />

System Services, Road Safety, Theory<br />

Keywords<br />

Contextualization, Personalization, Exploration, Search, Context<br />

Interfaces, Contextual User Interfaces<br />

1. SafeTRIP<br />

Satellite-based communication systems [10] for use in homes<br />

[1][13] and cars have been adopted by consumers in many parts of<br />

the world. The SafeTRIP project aims to build on this success and<br />

utilize a new generation of satellite technology to improve the<br />

safety, security and environmental sustainability of road transport.<br />

SafeTRIP uses S-band satellite technology, which is optimized for<br />

two-way communication for on-board vehicle units. The S-band<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that<br />

copies bear this notice and the full citation on the first page. To copy<br />

otherwise, or republish, to post on servers or to redistribute to lists,<br />

requires prior specific permission and/or a fee.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA<br />

Sven Laqua<br />

University College London<br />

Department of Computer Science<br />

Gower Street, London, WC1E 6BT<br />

+44 (0)20 7679 0351<br />

s.laqua@cs.ucl.ac.uk<br />

25<br />

M. Angela Sasse<br />

University College London<br />

Department of Computer Science<br />

Gower Street, London, WC1E 6BT<br />

+44 (0)20 7679 7212<br />

a.sasse@cs.ucl.ac.uk<br />

communication requires a small antenna making it suitable for the<br />

mass market. Existing solutions that use other frequency bands<br />

(for e.g. Ku-Band) require larger antennas [12] thus being less<br />

suitable for integration in vehicles or in handheld devices. An<br />

open SafeTRIP platform will be implemented to host services for<br />

improved safety and navigation, but also entertainment and<br />

advertising to vehicle occupants.<br />

Figure 1 - The SafeTRIP concept<br />

During the requirements capture, we discussed with drivers,<br />

operators, emergency technicians, operation managers,<br />

technologists and the management from road operators, insurance<br />

companies, fleet operators, freight forwarders and coach operator<br />

to understand their needs.<br />

Figure 2 - User needs defines the SafeTRIP platform<br />

The SafeTRIP platform’s definition - based on key functionalities<br />

elicited from business (such as road operators) and individual<br />

stakeholders - is shown in Figure 2. The platform enables services<br />

that can provide access to rich information that might be useful to<br />

drivers. At the same time, this creates a risk of overloading drivers<br />

with information, and distracting their attention which should be<br />

focussed on the road. In this paper, we present a new paradigm for<br />

accessing rich media and information in a vehicle which has<br />

minimal impact on the driver’s attention while driving.


2. SafeTRIP Services<br />

From our requirements capture, a set of safety and comfort<br />

services were identified, including:<br />

• Road safety alert service – hazard and incident warning;<br />

• Speed limit service – display variable speed limits in-car;<br />

• Collaborative alert service – allow drivers to share<br />

information about road incidents and traffic information;<br />

• Entertainment service - provides access to Streaming media<br />

and TV channels;<br />

• Assistance service - remote assistance and diagnostics;<br />

• Parking guidance service - for hazardous goods vehicle and<br />

coaches;<br />

• Location-Based services – access and present localised<br />

information to driver such as petrol stations, restaurants,<br />

hotels, local events.<br />

These services will provide numerous benefits to the drivers. For<br />

instance, it will allow them to access rich and timely traffic<br />

information from various sources in the vehicle. Commercial<br />

systems such as Coyote have proven very popular amongst drivers<br />

who share information about speed cameras in Europe. Through<br />

SafeTRIP, drivers will also be able to share information about<br />

road incidents with each other. Our user requirements capture<br />

shows that individuals are interested in accessing richer<br />

information. Through the above services, they will be able to<br />

access localized information about parking spaces, hotels and<br />

petrol stations – along with rich information – to allow drivers to<br />

search for the cheapest place to be refueled or for a restaurant with<br />

a cuisine of their liking.<br />

Whilst this type of information could have many benefits for<br />

drivers, there are risks associated with delivering them into<br />

vehicles. In 2006, a study by the U.S. Department of Transport<br />

(DOT) reported that the leading factor to 80% of crashes and 65%<br />

of near-crashes is driver inattention [9]. The SafeTRIP platform<br />

will partly address this through a driver alertness service that<br />

monitors driver alertness and support warnings to drivers [8].<br />

However, with access to a large number of services, the driver’s<br />

attention will be required to:<br />

• Access a service through the navigation interface<br />

• Interact with a specific service - which may involve searches<br />

that would require further interaction from the driver<br />

Current icon-based interfaces to in-car systems and virtual<br />

keyboards are too taxing to the driver’s attention – and it can only<br />

get worse with an increasing number of services. This has led us<br />

to consider alternative paradigms for driver interaction with<br />

information delivered into vehicles.<br />

3. INFORMATION EXPLORATION<br />

In this section we describe a novel information exploration<br />

technique to search and access information on the web.<br />

Experiments have clearly demonstrated its benefits and we believe<br />

that this approach will prove beneficial for drivers searching and<br />

interacting with information in their vehicle.<br />

Approaches such as contextual search [3], search result clustering<br />

[16] or personal search [2][15] aim to overcome some of the<br />

shortcomings of “traditional” search engines. However, none of<br />

those approaches challenges the current paradigm of how users<br />

interact with search engines. To us, it is obvious that the<br />

traditional interaction model using search engine result pages<br />

26<br />

(SERPs) does not work well for more complex information<br />

problems.<br />

To get a broader view, users need to consult different sources and<br />

understand contexts. Most of the time, a single resource will not<br />

be able to satisfy this need. Traditional SERPs fragment the<br />

relevant bits of information, rather than help users to contextualize<br />

them in meaningful ways. Users have to “crawl” site after site,<br />

foraging for meaningful bits [12], emulating the behavior of a<br />

search engine robot. The search engine interaction model<br />

(Figure 3, left side) illustrates users’ interaction with SERPs,<br />

moving back and forth between search results (A, B, C, D) and<br />

the actual SERP (central point).<br />

Figure 3 - Contrasting Interaction Models<br />

3.1 Information Exploration UI<br />

In contrast, users’ interaction with our information exploration<br />

interface – also referred to as Focus-Metaphor Interface (FMI) [4]<br />

- enables seamless exploration of the underlying information<br />

spaces (see Figure 3, right side). This approach combines a<br />

contextual navigation with the actual display of information (see<br />

Figure 4) and particularly facilitates orienteering behavior [14].<br />

When visualizing search results, the FMI replaces traditional<br />

search engine result pages (see Figure 4-A). Its contextual<br />

interface elements contain snippet-like information previews of<br />

the actual search results, and are arranged around the central<br />

content element which displays details of the currently selected<br />

search result (see Figure 4-B).<br />

Figure 4 - FMI prototype for social tools evaluation<br />

When selecting another contextual element, its state changes: it<br />

enlarges into a content element and moves to the centre of the<br />

screen, replacing the previously displayed search result (see<br />

Figure 4-C). This approach allows “browsing” through search<br />

results whilst preserving contextual awareness of the other search<br />

result snippets. In addition, the chosen layout enables a less


hierarchical and more concurrent display of the “top X” search<br />

results, without requiring any scrolling.<br />

However, the key strength of the FMI model becomes apparent<br />

when none of the presented search results meet the user’s<br />

information need. Rather than having to re-formulate another<br />

search query hoping for more promising search results, the user<br />

can simply pick one of the existing results that she thinks comes<br />

“closest” to what she is looking for, and request similar/related<br />

results. This enables the dynamic adaptation of contextual<br />

elements to the currently displayed content element, without<br />

requiring the user to articulate their information need precisely.<br />

This approach represents a break from traditional search behavior,<br />

as the user does not need to constantly go back to a search<br />

interface to (re-)start a new search session. Instead, an initial<br />

search query is the starting point for a seamless and personalized<br />

orienteering and exploration process that guides the user from one<br />

information nugget to the next. Although Google search provides<br />

related functionality through a link called “similar” available with<br />

some of its search result snippets, this functionality mostly works<br />

at a very abstract level (e.g. sites related by topic), but not on the<br />

actual content level. Microsoft search (live.com) provides “related<br />

searches” through a list of similar search queries. However, this<br />

functionality again seems to only work on a rather abstract level<br />

with more generic search queries.<br />

Another key benefit of the FMI model is that its layout and<br />

interaction paradigm lends itself to novel interaction techniques,<br />

such as touch or even eye-gaze. In an earlier study, we have<br />

demonstrated the effective use of our information exploration<br />

interface with eye-gaze only [6].<br />

3.2 Experimentation<br />

Over 3 years, we have conducted a number of lab-based studies of<br />

various FMI prototype iterations. We evaluated the performance<br />

of and user satisfaction with our prototype against a range of<br />

existing tools, such as individual blogs, blog spaces, Google news,<br />

Google Reader and PARC’s StarTree [4][5][7].<br />

Throughout those studies, task completion times were<br />

significantly faster and error rates were significantly lower using<br />

the FMI than in blog environments (see Figure 5) and on a par<br />

with PARC’s StarTree (which only works for well-formed<br />

information spaces).<br />

Figure 5 - Cross-study comparison<br />

Participants using the FMI had short and very consistent average<br />

fixation durations, which indicate lower cognitive load than in all<br />

compared systems. User feedback through questionnaires and<br />

informal interviews confirmed the ease of use and learnability of<br />

the FMI prototypes for most users.<br />

27<br />

3.3 Social Tools Study<br />

In our latest study, we used a corpus of domain-specific blog<br />

entries to evaluate a range of social tools, namely the ability to<br />

tag, rate and bookmark any of the articles. We looked at the<br />

impact of 1) ratings on contextual search snippets and 2) tags on<br />

search result presentation (see Figure 6).<br />

Figure 6 – Screenshot of FMI with social tools<br />

The eye-tracking experiment involved 21 participants, 13m/8f,<br />

20-46 years (avg. 25.7). We used a range of factual and<br />

explorative search tasks. For factual search tasks, participants had<br />

to identify a specific article; for explorative search tasks,<br />

participants had to explore a certain topic for a few minutes. In<br />

both cases, we used small scenarios to facilitate intrinsic<br />

motivation in the participants.<br />

For the contextual search result snippets, our analysis of postexperiment<br />

usability questionnaires (Likert scale, 1-6) revealed<br />

that participants found the “5 star rating” functionality very quick<br />

and easy to use (5.5). The ability to have ratings displayed in the<br />

contextual navigation elements was rated significantly higher than<br />

the perceived impact on users’ navigational decisions (4.8 vs. 4.0,<br />

t20 = 2.09, p < 0.02).<br />

But, analysis of the eye-tracking data shows that participants’<br />

awareness of the actual ratings was substantial, considering its<br />

actual size within the contextual search snippet (see Table 1).<br />

Table 1 - Search snippet attention distribution<br />

Attention Distribution<br />

(relative gaze time)<br />

Rating 17.1 %<br />

Title 54.4 %<br />

Description 28.5 %<br />

Within this study of social tools for the FMI, selecting a “new”<br />

central content element automatically updated the contextual<br />

elements to display the most similar/related articles to the newly<br />

activated content element. However, user feedback showed that<br />

the automatic contextualization of relevant search snippets is too<br />

volatile for users’ taste. For future studies, we have therefore<br />

settled on a static/persistent contextual visualization that (only)<br />

adjusts to the currently displayed content element upon request by<br />

the user.<br />

4. SafeTRIP FMI<br />

With the large number of services available through SafeTRIP,<br />

searching through services and information, using traditional


methods and interfaces in in-car systems, can prove to be time<br />

consuming. Inefficient search therefore has a detrimental impact<br />

on the driver’s attention and thus on road safety.<br />

As FMI has proven to be an effective tool for searching and<br />

presenting information, we believe that its application to the in-car<br />

environment would be beneficial to the driver. We have identified<br />

some application areas for the SafeTRIP in-car interface that<br />

could benefit from this approach.<br />

Service Search<br />

SafeTRIP is an open platform, allowing third party<br />

applications/services to be made available to the drivers. With<br />

typically dozens of services planned already and new ones<br />

appearing with time, the traditional icon/menu based interface in<br />

most in-car systems may not be appropriate. With FMI, the<br />

drivers will be able to search through 100s of services and locate<br />

the ones that are most relevant. As our studies show, precise<br />

search criteria may be difficult to formulate – especially when<br />

searching for a new service. Also, if the user goes down the wrong<br />

search path, he can explore information sets that look relevant,<br />

without reformulating the search all over again.<br />

Search Traffic Info<br />

Typically, drivers combine traffic information from various<br />

sources to make decisions while driving. With new services in<br />

SafeTRIP, traffic information will be available from yet more<br />

sources – namely road operators, other drivers, authorities and<br />

traffic information providers. The reliability and timeliness of<br />

such information differs across sources – and drivers know how to<br />

exploit these differences. FMI can be used to provide an efficient<br />

mechanism to search for the most appropriate information, given<br />

that complete automation is unlikely as drivers use a mix of<br />

information sources based on their personal preferences.<br />

Display Traffic Info<br />

With SafeTRIP, we plan to provide rich traffic information to the<br />

drivers. On the motorway, variable speed restriction (e.g. in the<br />

event of a road incident) will be sent to the vehicle (instead of<br />

being displayed on a Variable Message Sign) with some details<br />

about the incident. It is expected that drivers would be more likely<br />

to respect the new speed restrictions if they are aware of the<br />

underlying reason. However the display of rich information can<br />

lead to information overload or inattentional blindness – causing<br />

the driver to ignore the important information in the messages.<br />

The layout of information in the FMI is designed to be<br />

minimalistic, providing as much relevant information as a user<br />

can process effectively, allowing for easy decision making and<br />

exploration of further relevant information.<br />

Entertainment Selection Interface<br />

Remote controls fitted to the steering wheel are a definite<br />

improvement that allows drivers to interact with the in-car<br />

entertainment system without taking their eyes off the road.<br />

However, with the explosion of entertainment options – both<br />

audio and video – through the SafeTRIP platform, it is likely that<br />

such solutions will quickly show their limitations. We believe that<br />

the FMI approach would allow the driver to quickly and<br />

efficiently search through the entertainment options.<br />

5. CONCLUSION<br />

It is clear to us that web based search benefits from the FMI<br />

approach as demonstrated by the results obtained from<br />

experimentation. With the increase in number of services<br />

available in the car – such as the ones through SafeTRIP, there is<br />

28<br />

a real need for an effective and efficient way to search and interact<br />

with those services. We therefore believe that in-car systems<br />

would greatly benefit from the FMI approach by decreasing<br />

search time, thereby improving driver’s attention on the road and<br />

contributing towards road safety.<br />

6. REFERENCES<br />

[1] Bly, S., Schilit, B., McDonald, D.W., Rosario, B., Saint-<br />

Hilaire, Y., Broken expectations in the digital home, Ext.<br />

Abstracts CHI 2006, ACM Press(2006), 568-573.<br />

[2] Cutrell, E. et al. (2006). Fast, Flexible Filtering with Phlat –<br />

Personal Search and Organization Made Easy. In<br />

Proceedings of CHI 2006, Montreal, Canada.<br />

[3] Kraft, R. et al. (2006). Searching with Context. In Proc. of<br />

International World Wide Web Conference (WWW ’06),<br />

(Edinburgh, Scotland, 2006). ACM Press.<br />

[4] Laqua, S. and Brna, P. The Focus-Metaphor Approach: A<br />

Novel Concept for the Design of Adaptive and User-Centric<br />

Interfaces. In Proc. Interact 2005, Springer (2005), 295-308.<br />

[5] Laqua, S. and Sasse, M.A. (2009). Exploring Blog Spaces: A<br />

Study of Blog Reading Experiences using Dynamic<br />

Contextual Displays. In: Proc. HCI 2009, Cambridge, UK.<br />

[6] Laqua, S., Bandara, S. U., and Sasse, M.A. (2007)<br />

GazeSpace: Eye Gaze Controlled Content Spaces. In Proc.<br />

HCI 2007, Vol.2, 21 st BCS HCI Group Conference (2007).<br />

[7] Laqua, S., Ogbechie, N., and Sasse, M.A. (2007).<br />

Contextualizing the Blogosphere: A Comparison of<br />

Traditional and Novel User Interfaces for the Web. In Proc.<br />

HCI 2007, Vol.2, 21 st BCS HCI Group Conference.<br />

[8] Lee, J. D., Hoffman, J. D., and Hayes, E. 2004. Collision<br />

warning design to mitigate driver distraction. In Proceedings<br />

of the SIGCHI Conference on Human Factors in Computing<br />

Systems (Vienna, Austria, April 24 - 29, 2004). CHI '04.<br />

ACM, New York, NY, 65-72.<br />

[9] NHTSA. The impact of Driver Inattention on Near-<br />

Crash/Crash Risk.<br />

http://www.nhtsa.gov/Research/Human+Factors/Distraction<br />

[10] Orbcomm. http://www.orbcomm.com<br />

[11] OmniTRACS. http://www.qualcomm.com<br />

[12] Pirolli, P. (2007). Information Foraging Theory. Oxford<br />

University Press.<br />

[13] Seager, W., Knoche, H., Sasse, M.A., TV-centricity -<br />

Requirements gathering for triple play services. In<br />

Interactive TV: A Shared Experience TICSP Adjunct<br />

Proceedings of EuroITV (2007), 274-278.<br />

[14] Teevan, J. et al. The perfect search engine is not enough: a<br />

study of orienteering behavior in directed search. In Proc.<br />

CHI ’04. (Vienna, Austria, 2004)<br />

[15] Teevan, J. et al. Beyond the Commons: Investigating the<br />

Value of Personalizing Web Search. In Proc. of Workshop on<br />

New Technologies for Personalized Information Access<br />

(PIA). (Edinburgh, UK, 2005).<br />

[16] Zeng, H. J. et al. Learning to Cluster Web Search Results. In<br />

Proceedings of SIGIR ’04, Sheffield, United Kingdom, 2004.<br />

ACM Press, 210-217


Discover Significant Situations for User Interface<br />

Adaptations<br />

Sandro Rodriguez Garzon<br />

Daimler Center for <strong>Automotive</strong> Information<br />

Technology Innovations<br />

HMI Group<br />

sandro.rodriguez.garzon@dcaiti.com<br />

ABSTRACT<br />

Over the last years environmental awareness became an important<br />

research topic in the field of adaptive user interfaces.<br />

Especially in the research area of location-based services,<br />

context-aware interfaces started using models of the environment<br />

in conjunction with sophisticated user models to<br />

filter user relevant information. Despite the tight coupling<br />

of context-aware computing and user modeling, only less<br />

research focused on the correlations between an user preference<br />

and the context in which the user preference was inferred.<br />

Considering a user preference as a certain humaninterface<br />

interaction that happens regularly within similar<br />

context, this paper introduces a method to detect significant<br />

situations of frequent user interactions that occurred within<br />

similar environments. As an example, the paper discusses<br />

a definition of a personalization use case within the automotive<br />

environment: Adapting the user interface based on<br />

discovered user initiated radio station changes depending on<br />

the user’s location.<br />

Author Keywords<br />

context awareness, situation discovery, personalization, adaptation,<br />

intelligent user interface, temporal pattern<br />

ACM Classification Keywords<br />

H.5.2 User Interfaces: Theory and methods; H.3.3 Information<br />

Search and Retrieval: Miscellaneous<br />

INTRODUCTION<br />

Since the Active Badge Location System [7] many researchers<br />

have been interested in designing context-aware user interfaces.<br />

Considering the increase of complexity and functionality<br />

of user interfaces several researchers identified ways<br />

and means to increase the usability by displaying prefiltered<br />

information or modified user interface controls. While some<br />

approaches aimed at detecting similar interactions to apply<br />

personalized filters other approaches tried to propose methods<br />

to build adaptation-ready user interfaces.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

29<br />

Kristof Schütt<br />

Daimler Center for <strong>Automotive</strong> Information<br />

Technology Innovations<br />

HMI Group<br />

kschuett@cs.tu-berlin.de<br />

Context-aware computing focused on gathering the context<br />

of an entity to build a machine-readable model of the environment.<br />

With the help of these models it was possible<br />

to adapt a user interface dynamically, following the ”onefits-all”<br />

paradigm. Hence, a predefined rule specified the<br />

way how environmental factors influence the user interaction<br />

with the system. In contrast, the research in user modeling<br />

focused on gathering a accurate model of the user by applying<br />

sophisticated data mining methods. These user models<br />

were used to build user-centric adaptive systems that take<br />

the users needs into account. Unfortunately, most of the user<br />

modeling techniques were constrained to detect application<br />

specific user preferences. But in different contexts a user<br />

may prefer to interact with the system in different ways.<br />

Thus, this work on the discovery of significant situations is<br />

motivated by the desire to personalize user interfaces in dependence<br />

of their use in similar contexts. Contextual personalization<br />

can be seen as the process of bringing together<br />

the context with the user preferences. Our intent is not to<br />

construct a context dependent user model but to detect situations<br />

that might be followed by predictable user interactions.<br />

In order for the method to be applicable in the real<br />

world, our approach assures an unsupervised processing of<br />

user interactions without need to prompt the user for explicit<br />

feedback.<br />

RELATED WORK<br />

A promising idea concerning location-aware service personalization<br />

is presented by Coutand [2]. Coutand uses a casedbased<br />

reasoning approach to calculate similarities between<br />

records of service use enriched by location dependent properties<br />

to deduce preferred service utilization. An approach<br />

of clustering context data in order to determine if an actual<br />

context belongs to an already sensed context is presented by<br />

Flanagan [6]. By expressing the context in a symbolic form<br />

he is able to develop an unsupervised learning algorithm to<br />

extract and group similar contexts as context states. This<br />

idea is very close to our work since our approach groups context<br />

as well. The difference lies in the way the environment<br />

is sensed. Our approach uses prespecified temporal event<br />

patterns to extract interaction traces that are annotated with<br />

context features. Those interaction traces are clustered concerning<br />

their multiple contexts in contrast to [6], where independent<br />

instances of context feature vectors are grouped.<br />

The notion of an temporal event pattern as an appropriate<br />

representation of a user preference is also mentioned in [3].


Cram proposes a method to interactively detect recurring<br />

user interaction sequences to enhance context-aware assisting<br />

systems. Unfortunately, Crams approach isn’t applicable<br />

within the automotive environment because the user has to<br />

be involved in the process of discovering regular task signatures.<br />

DEFINITIONS<br />

Following up Etzion’s definition [5] of an event as an occurrence<br />

in the real world and its virtual representation we<br />

introduce the notion of an interaction event.<br />

DEFINITION 1. Interaction Event. An arbitrary user interaction<br />

affecting the user interface or its environment represented<br />

by means of an virtual object.<br />

An interaction event will be generated by the user interface<br />

and processed by the frequent interaction discovery component.<br />

The definition of the interaction event incorporates all<br />

events thrown directly or indirectly by the user interface as<br />

well as events triggered by a change of the state of the environment.<br />

The term environment will be used as the superset<br />

of context which is defined as all elements of the environment<br />

which the user’s computer knows about [1]. Given the<br />

following definitions<br />

DEFINITION 2. Action. The concrete occurrence of an<br />

interaction event sequence.<br />

DEFINITION 3. Situation. A period of time in which certain<br />

conditions are satisfied indicating a probable occurrence<br />

of a known action.<br />

our prototype distinguishes between the user interaction, namely<br />

action, and the state called situation at which our prototype<br />

assumes to know what the user will do next. An action is<br />

declared to be frequent if the amount of reoccurrences exceeds<br />

a prespecified limit. A situation is significant if the<br />

predictable action is frequent.<br />

SITUATION DETECTION<br />

The challenge lies in the detection of significant situations<br />

out of an arbitrary stream of interaction events. The probability<br />

of a certain action to reappear in the same constellation<br />

within the same context is very low. Therefore, to detect a reoccurence<br />

of an action a notion of similarity between actions<br />

considering their context is needed. This work distinguishes<br />

between a comparison of actions comprising one event and<br />

multiple events. The Section ”Interaction Event Processing”<br />

discusses the former case while Section ”Co-Situations” examines<br />

the latter case.<br />

We decided to split the process of significant situation detection<br />

into three successive subprocesses: Action discovery,<br />

context discovery and situation discovery. In the action discovery<br />

subprocess a prespecified event pattern is searched<br />

within the stream of interaction events. A concrete sequence<br />

of events found by the event engine is declared to be an action.<br />

The context discovery subprocess collects all actions<br />

and groups them by specified context features. The result<br />

30<br />

of the context discovery subprocess is made up of groups<br />

containing actions whereupon each group is characterized<br />

by several action specific group properties. If the amount<br />

of members of a group exceeds a limit all members will be<br />

declared as frequent. This conclusion is valid because all actions<br />

of a group are assumed to be similar. The group properties<br />

describe a common environment of all the actions that<br />

are contained within that group. To detect a situation likely<br />

to contain a reoccurrence of a frequent action it is necessary<br />

to search for an event sequence that is parametrized by<br />

the property values of the group the frequent action belongs<br />

to. This process is called situation discovery. If the process<br />

encounters a compatible interaction event sequence the user<br />

interface will be notified of the significant situation.<br />

INTERACTION EVENT PROCESSING<br />

The process of significant situation detection is supported<br />

by use case specific data mining instructions. These instructions<br />

will be specified beforehand by an expert and used during<br />

the runtime process to assist the data mining process to<br />

extract significant situation information for specific personalization<br />

use cases. In the proceeding discussion we use the<br />

term use case specification to subsume all instructions belonging<br />

to a certain use case. The explanation of the necessary<br />

specification steps and the event processing itself is<br />

accompanied by an automotive example: Detection of user<br />

initiated radio station changes depending on the user’s location.<br />

The intention of the personalization is to provide the<br />

user with a system generated proposal to change the radio<br />

station automatically at a certain location triggered by the<br />

detection of a significant situation.<br />

Context<br />

During runtime, every interaction event occurs within a certain<br />

environment. Considering our approach, the environment<br />

is represented by a fixed set of attributes and its situation<br />

dependent values. Since the environment comprises an<br />

almost infinite number of environmental factors it is necessary<br />

to define a subspace of the environmental factors that<br />

are relevant to the specific use cases. Hence, the use case<br />

specification must contain a context definition as an enumeration<br />

of context features that will be attached to every interaction<br />

event. In case of the radio example, two context<br />

features were identified as use case specific environmental<br />

factors: name of the radio station and a unique identification<br />

number (id) of the current road segment.<br />

Action Discovery<br />

The use case specification must also contain a prespecified<br />

interaction event pattern describing the action that should be<br />

investigated in detail. The event pattern is constructed by<br />

combining logical and temporal operators to form complex<br />

event sequence descriptions with event specific filter criteria.<br />

Since the prototypical implementation uses Esper [4] as<br />

the underlying event processing engine most of the available<br />

operators have their counterpart in Esper’s event processing<br />

language (EPL). Considering the radio example it is necessary<br />

to specify an accurate but generic event pattern representing<br />

the user’s action of changing the radio station: The<br />

radio station change should only be taken into account in


case the user initiated radio station change is not followed by<br />

any further radio station change within the next 10 minutes.<br />

During initialization the pattern will be passed to the complex<br />

event processing engine to start looking for compatible<br />

sequences. If the engine encounters a fitting sequence<br />

of interaction events it relays the concrete event sequence,<br />

namely action, to the subprocess of context discovery.<br />

Context Discovery<br />

The main task of the context discovery subprocess is to group<br />

all incoming actions based on prespecified criteria. As stated<br />

above, an action can be composed of an arbitrary number<br />

of temporally ordered events. Therefore, it is necessary to<br />

define a selection of specific events and corresponding context<br />

features of an action that are considered during a comparison<br />

between two actions. In other words, the similarity<br />

measure between two actions is calculated by a comparison<br />

between a fixed set of prespecified context features. The result<br />

of the grouping process is a set of groups of similar actions.<br />

The actions will be similar regarding the environment<br />

they occurred in. In turn, the characteristics of each group<br />

can be interpreted as a description of the common environment.<br />

Considering the radio example, we are only interested<br />

in grouping actions by the radio station and the unique road<br />

segment id. Thus, each group subsumes a certain case of<br />

user behavior within a certain environment. In this sense, a<br />

group is characterized by two properties: name of radio station<br />

and road segment ids. The former property will contain<br />

the name of the radio station while the latter property will<br />

contain a subnetwork of the road network represented by a<br />

set of unique road segment ids. Actions will be compared<br />

by the radio station name doing a string comparison and by<br />

the road segment id doing a network distance comparison.<br />

Radio station changes containing the same name of a radio<br />

station but occurring in neighboring road segments are assigned<br />

to the same group. A group will only be considered<br />

in the next subprocess if it contains a sufficient amount of<br />

actions. Such a group is called a significant group.<br />

Situation Discovery<br />

Finally, the situation discovery subprocess uses the characteristics<br />

of the significant groups found by the previous subprocess<br />

to parametrize a generic interaction event pattern<br />

namely situation pattern. The situation pattern describes<br />

a moment in which a group specific action is expected to<br />

reappear. The way the expert specifies the generic situation<br />

pattern is similar to the way the specification of the action<br />

was done before. The difference lies in fact that the context<br />

features of the events within the event pattern will be constrained<br />

by the discovered group characteristics. During runtime,<br />

several significant groups will be identified by the context<br />

discovery subprocess. For each group a new instance of<br />

the situation pattern will be generated with different context<br />

feature constraints. This group specific parametrization allows<br />

the event engine to use each generated situation pattern<br />

instance to discover actions that occur within the common<br />

environment of the corresponding group. As a consequence,<br />

the event processing engine is able to find significant situations<br />

as a result of detecting parametrized event sequences.<br />

Considering the radio example it is necessary to define a sit-<br />

31<br />

uation pattern that clearly describes an event sequence that is<br />

likely to be followed by a known radio station change event.<br />

To describe such a situation we include two conditions into<br />

the generic temporal event pattern: 1. The last radio station<br />

change resulted in a switch to a radio station that is different<br />

to the one found in the group property ”name of radio station”<br />

and 2. the current road segment id - originating from a<br />

location event - is part of the subnetwork found in the group<br />

property ”road segment ids”. Each significant group of radio<br />

station changes will start a process of detecting a significant<br />

situation in which the road segment is similar and the current<br />

radio station is different. Triggered by the notification of a<br />

significant situation, the prototype is able to propose a radio<br />

station change.<br />

CO-SITUATIONS<br />

So far the context of only one event - radio station change -<br />

was observed to group actions. But what happens if actions<br />

are composed of multiple events and the comparison should<br />

consider two different contexts of two events? In this case<br />

the prototype would primarily split the grouping into separate<br />

grouping procedures for each event. A powerful application<br />

would be to detect significant situations based on<br />

several temporally ordered events each being parametrized<br />

by the properties of different significant groups. Such a new<br />

type of situation would enable the specification of a causal<br />

sequence of arbitrary situations to form a new significant situation.<br />

DEFINITION 4. Co-Situation. A period of time in which<br />

certain temporally ordered conditions describing multiple<br />

situations are satisfied indicating a probable occurrence of<br />

a known action.<br />

Let’s go back to the radio example and consider a certain application<br />

of the well-known radio example in the real world<br />

while entering and leaving a tunnel. Assuming two significant<br />

groups of radio station changes were found situated at<br />

both tunnel exits. Both groups were detected due to the fact<br />

that one radio station is being received poorly on one side of<br />

the tunnel and vice versa. Without taking account of the direction<br />

the car is moving the prototype may propose a wrong<br />

radio station change while entering the tunnel. This misleading<br />

personalization happens because the location of a tunnel<br />

exit may match with the location of a tunnel entrance. In<br />

this case it does not matter how the driver is approaching a<br />

significant situation.<br />

In order to consider the moving direction the event pattern<br />

of the action discovery subprocess must be extended by two<br />

location events preceding the first event that changes the radio<br />

station. The modified event pattern will consider multiple<br />

temporally ordered events that may happen in different<br />

contexts. To discover co-situations, the actions will no<br />

longer be grouped by only one context of a certain event<br />

but grouped by the contexts of the two location events. In<br />

this sense, each detected group is finally characterizable by<br />

the unique radio station and a pair of contexts. One context<br />

that describes a region before entering the tunnel and<br />

another context describing a region at the exit of the tunnel.


Environment<br />

Car<br />

Tunnel<br />

Car<br />

Event Sequence<br />

Location<br />

Location<br />

RadioChange<br />

Prototype<br />

Location event found.<br />

Start looking for a<br />

following location event.<br />

Second location event<br />

found. Look for a Radio<br />

Change event.<br />

Radio Change event<br />

found. Wait 600 sec<br />

before reporting the<br />

encountered action.<br />

Environment<br />

Tunnel<br />

ActionDiscovery SituationDiscovery<br />

Car<br />

Car<br />

Event Sequence<br />

RadioChange<br />

Location<br />

Location<br />

Prototype<br />

Known context found.<br />

Start looking for second<br />

known context expected<br />

to follow the first known<br />

context.<br />

Second known context<br />

found. If radio station<br />

is unequal to known<br />

radio station notify UI<br />

Figure 1. Extended radio example: Relation between the environment and the order of event occurrence.<br />

Given all needed properties of a regular user interaction, it<br />

is necessary to extend the situation pattern as well to detect<br />

both contexts with respect to the causal order of its occurrence.<br />

Therefore, the situation event pattern will describe an<br />

event pattern that looks for location events within the context<br />

of the tunnel entry followed by location events within<br />

the context of the tunnel exit. A significant co-situation will<br />

be encountered if the car initially passes the context in front<br />

of the tunnel and than the context at the exit of the tunnel<br />

meanwhile the radio is switched to a different radio station.<br />

Figure 1 visualizes the co-situation. The task of grouping<br />

the driving direction is not necessary as long as the causal<br />

order of grouped contexts helps to trigger the intended radio<br />

station change.<br />

IMPLEMENTATION<br />

As a sample user interface we implemented a prototypical<br />

in-car-infotainment system based on ActionScript in combination<br />

with a context simulator. The prototype itself is implemented<br />

in Java with Esper [4] as its complex event processing<br />

engine. In order for the prototype to be independent<br />

of a certain user interface we decided to use XML as the underlying<br />

representational language for an event. To test the<br />

prototype under realistic conditions we used context records<br />

of several tracks to simulate the environment.<br />

CONCLUSION AND FUTURE WORK<br />

We have presented a prototype that is able to discover significant<br />

situations based on the common environment of frequent<br />

user interactions. Supported by use case descriptions<br />

specified by an expert, the prototype detects similarities between<br />

user interactions and infers the corresponding environments<br />

needed to detect situations in which a user interaction<br />

is likely to reappear. Although we did not put our focus<br />

on time as well as space efficiency we have to acknowledge<br />

that in particular the space consumption is a critical factor<br />

affecting the application within embedded systems. Since<br />

in the current implementation all detected actions need to be<br />

stored along with their context it is necessary to subsume actions<br />

and to constrain the validity of an action. A first step<br />

towards a space optimized solution was done by limiting the<br />

influence of actions. If an action is too old it will discarded.<br />

Furthermore, we limited the amount of actions per situation<br />

independently of the time the action was detected.<br />

32<br />

During our work we identified some important refinements<br />

and extensions for future work. In particular, we will provide<br />

the expert with the ability to specify a minimal probability of<br />

occurrence for a significant situation as an additional trigger<br />

condition. Up to now the prototype reports a significant situation<br />

in case a certain number of similar actions happened<br />

within a certain context. This notion can be extended by<br />

reporting only in the case the probability of occurrence exceeds<br />

a user defined limit. In order to calculate the current<br />

likelihood of an action we also have to observe situations in<br />

which an action has not been executed. Since it is nearly impossible<br />

to accumulate all non occurrences of a use case we<br />

will constrain the observation to situations that are already<br />

associated with a concrete action of the use case. This is<br />

possible because the boundary of the situation is naturally<br />

given by the situation pattern.<br />

REFERENCES<br />

1. P. J. Brown. The stick-e document: a framework for<br />

creating context-aware applications. In EP, pages<br />

259–272, 1996.<br />

2. O. Coutand, S. Haseloff, S. L. Lau, and K. David. A<br />

Case-based Reasoning Approach for Personalizing<br />

Location-aware Services. In Workshop on Case-based<br />

Reasoning and Context Awareness, 2006.<br />

3. D. Cram, B. Fuchs, Y. Prié, and A. Mille. An approach<br />

to User-Centric Context-Aware Assistance based on<br />

Interaction Traces. In Int. Workshop Modeling and<br />

Reasoning in Context, pages 89–101, 2008.<br />

4. EsperTech Inc. Complex Event Processing.<br />

http://esper.codehaus.org/, Last access:<br />

30-12-2010.<br />

5. O. Etzion and P. Niblett. Event Processing in Action.<br />

Manning Publications Co., Greenwich, USA, 2010.<br />

6. J. A. Flanagan. Unsupervised clustering of context data<br />

and learning user requirements for a mobile device. In<br />

Int. and Interdisciplinary Conf. on Modeling and Using<br />

Context, pages 155–168, 2005.<br />

7. R. Want, A. Hopper, V. Falc, and J. Gibbons. The Active<br />

Badge Location System. ACM Transactions on<br />

Information Systems, pages 91–102, 1992.


A new interaction technique based on eye tracking and<br />

single switch scanning systems<br />

Pradipta Biswas<br />

Engineering Design Centre<br />

Department of Engineering<br />

University of Cambridge, UK<br />

E-mail: pb400@cam.ac.uk<br />

ABSTRACT<br />

In this paper we have presented a new input interaction<br />

system for people with severe disabilities. The new system<br />

works based on eye gaze tracking and single switch<br />

scanning interaction techniques. It combines eye gaze<br />

tracking and scanning in a unique way which is faster<br />

than only scanning based systems while more comfortable<br />

to use than only eye gaze tracking based systems, which<br />

is also supported by a user study. We have also pointed<br />

out a few applicatiosn of the system besides computer<br />

accessibility.<br />

Categories and Subject Descriptors<br />

D.2.2 [Software Engineering]: Design Tools and Techniques<br />

– user interfaces; K.4.2 [Computers and Society]:<br />

Social Issues – assistive technologies for persons<br />

with disabilities<br />

General Terms<br />

Algorithms, Experimentation, Human Factors<br />

Keywords<br />

Assistive Technology, Eye gaze tracker, Scanning, Usability<br />

Evaluation.<br />

1. INTRODUCTION<br />

Many physically challenged users cannot interact with a<br />

computer through a conventional keyboard and mouse.<br />

For example, spasticity, Amyotrophic Lateral Sclerosis<br />

(ALS), and Cerebral Palsy confine movement to a very<br />

small part of the body. Two possible solutions for these<br />

users will be eye gaze tracking based input system and<br />

scanning system. Eye gaze tracking based system alleviates<br />

the use of mouse and keyboard and enables the user<br />

to control the mouse pointer using only eye gaze. They<br />

can also use a virtual keyboard as an alternative to normal<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that<br />

copies bear this notice and the full citation on the first page. To copy<br />

otherwise, or republish, to post on servers or to redistribute to lists,<br />

requires prior specific permission and/or a fee.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

33<br />

Pat Langdon<br />

Engineering Design Centre<br />

Department of Engineering<br />

University of Cambridge, UK<br />

E-mail: pml24@eng.cam.ac.uk<br />

keyboard.<br />

Scanning is the technique of successively highlighting<br />

items on a computer screen and pressing a switch when<br />

the desired item is highlighted. Researches on eye gaze<br />

tracking systems for assistive technology and scanning<br />

systems were mainly explored in the field of alternative<br />

and augmentative communication (AAC) devices<br />

[7,8,11,13]. A plethora of commercial and research products<br />

are available which helps people with disabilities to<br />

communicate using eye gaze tracking or scanning interfaces<br />

[11].<br />

However, navigation to arbitrary locations on a screen has<br />

also become important as graphical user interfaces are<br />

more widely used. A review of existing scanning systems<br />

for screen navigation can be found in a separate paper [3].<br />

The main disadvantage of these systems is these are slow<br />

to operate. Many eye tracking based interfaces for people<br />

with disabilities use the eye gaze as a binary input like a<br />

switch press input through a blink [6, 13]. But the resulting<br />

system remain as slow as the scanning system.<br />

Zhai [14] presents a detailed list of advantages and disadvantages<br />

of using eye gaze based pointing devices. In<br />

short, using the eye gaze for controlling the cursor position<br />

pose several challenges as follows<br />

Strain: It is quite strenuous to control the cursor through<br />

eye gaze for long time as the eye muscles soon become<br />

fatigue. Fejtova and colleagues [9] reported eye strain in<br />

six out of ten able bodied participants in their study.<br />

Accuracy: The eye gaze tracker does not always work<br />

accurately, even the best eye trackers used to provide accuracy<br />

of 0.5° of visual angle. It often makes clicking on<br />

small target difficult. Donegan and colleagues [5] also<br />

reported problems in precision and speed of an eye gaze<br />

based system. So existing systems often change the screen<br />

layout and enlarge screen items for AAC systems based<br />

on eye gaze, but surely it is not a scalable solution.<br />

Clicking: Clicking or selecting a target using only eye<br />

gaze is also a problem. It is generally performed through<br />

increased dwell time or blinking. But either solution increases<br />

the chance of false positives or missed clicks.<br />

We tried to solve this problem by combining eye gaze


tracking and a scanning system in a unique way. Any<br />

pointing movement has two phases [10]<br />

An initial ballistic phase, which brings one near the target.<br />

A homing phase, which is one or more precise sub<br />

movements to home on the target.<br />

We used the eye gaze tracking for the initial ballistic<br />

phase and switch to scanning system for the homing<br />

phase and clicking. The approach is similar to the<br />

MAGIC system [14] though it replaces the regular pointing<br />

device with the scanning system. Our system works in<br />

the following way.<br />

2. The proposed system<br />

Initially, the system moves the pointer across the screen<br />

based on the eye gaze of the user. The user sees a small<br />

button moving across the screen and the button is placed<br />

approximately where they are looking at the screen. We<br />

extract the eye gaze position by using the Tobii SDK [12]<br />

and we use an average filter that changes the pointer position<br />

every 500 msec. The users can switch to the scanning<br />

system by giving a key press anytime during eye tracking.<br />

When they look at the target, the button (or pointer) appears<br />

near or on the target. At this point, the user is supposed<br />

to press a key to switch back to the scanning system<br />

for homing and clicking on the target.<br />

We have used a particular type of scanning system,<br />

known as eight directional scanning [3] to navigate across<br />

the screen. In eight-directional scanning technique the<br />

pointer icon is changed at regular time intervals to show<br />

one of eight directions (Up, Up-Left, Left, Left-Down,<br />

Down, Down-Right, Right, Right-Up). The user can<br />

choose a direction by pressing the switch when the<br />

pointer icon shows the required direction. After getting<br />

the direction choice, the pointer starts moving. When the<br />

pointer reaches the desired point in the screen, the user<br />

has to make another key press to stop the pointer movement<br />

and make a click. A state chart diagram of the scanning<br />

system is shown in Figure 1, which is same for user<br />

and device spaces in this case. A demonstration of the<br />

scanning system can be seen at<br />

http://www.youtube.com/watch?v=0eSyyXeBoXQandfeature=user.<br />

The user can move back to the eye gaze tracking system<br />

from the scanning system by selecting the exit button in<br />

the scanning interface (Figure 2). A couple of videos of<br />

the system can be found from the following links.<br />

Screenshot: http://www.youtube.com/watch?v=UnYVO1Ag17U<br />

Actual usage: http://www.youtube.com/watch?v=2izAZNvj9L0<br />

The technique is faster than only scanning based interface<br />

as users can move the pointer through a large distance in<br />

screen using their eye gaze quicker than using only single<br />

switch scanning interface. It is also less strenuous than<br />

the only eye gaze based interfaces because users can<br />

34<br />

switch back and forth between eye gaze tracking and<br />

scanning which gives rest to the eye muscles. Additionally,<br />

since they need not to home on a target using eye<br />

gaze, they are relieved from looking at a target for a long<br />

time to home and click on it. Finally, this technique does<br />

not depend on the accuracy of the eye tracker as eye<br />

tracking is only used to bring the cursor near the target (as<br />

opposed to on the target), so it can be used with low cost<br />

and low accuracy web cam based eye trackers.<br />

Figure 1. State Transition Diagram of the eightdirectional<br />

scanning mechanism with a single switch<br />

Figure 2. Screenshot of the scanning interface<br />

The only disadvantage of the technique is that it seems<br />

slower than only eye gaze based interface as users need to<br />

switch back to the slower scanning technique for each<br />

pointing task. So we conducted the following user study<br />

to compare the speed of our system with respect to only<br />

eye gaze based pointing.<br />

3. The study<br />

3.1. Procedure<br />

We conducted the ISO 9241 pointing task with three different<br />

combinations of target width (20, 30 and 40 pixels)<br />

and target amplitude (180, 240 and 300 pixels). Each participant<br />

undertook the task in two conditions – using only


eye gaze for pointing or using both eye gaze and eight<br />

directional scanning for pointing. None of the users used<br />

this system before and they were trained adequately before<br />

undertaking the trials. The training data is not used in<br />

the analysis.<br />

3.2. Material<br />

We used a desktop with 12.5’ monitor having 1280 Х 800<br />

pixels running Windows 7 operating system. We used a<br />

Tobii X120 eye tracker [12] with the Tobii SDK and an<br />

averaging filter to detect points of eye gaze fixation. Figure<br />

3 shows a snapshot of the experimental set up. None<br />

of the participants have any problem with the set up.<br />

Figure 3. Experimental set up<br />

3.3. Participants<br />

We collected data from 8 able bodied participants (7 male<br />

and 1 female) with average age of 27. The results will not<br />

be different for users with disabilities because<br />

o We assume that disabled who can use eye gaze<br />

based system will have eye muscles as strong as<br />

able bodied users.<br />

o Our previous study [4] did not find any statistically<br />

significant difference between able bodied<br />

and disabled users for scanning interface.<br />

3.4. Results<br />

The mean movement time was higher in eye tracking plus<br />

scanning system while the variance is higher in only eye<br />

tracking system (figure 4). However the difference was<br />

not significant in an unequal variance t-test (p > 0.05). We<br />

compared the average movement time for each input modalities<br />

of interaction with respect to individual participants,<br />

target width and amplitude (Figures 5, 6, and 7). It<br />

can be seen from figure 5 that only 2 out of 8 participants<br />

(P2 and P4) took significantly higher time in the eye<br />

tracking plus scanning system than the only eye tracking<br />

system. There were at least three occasions when participants<br />

failed to point on 20 pixel targets using only eye<br />

tracking system. We found only eye tracking system produced<br />

significantly less (p < 0.05) movement time for 240<br />

35<br />

pixel target amplitude while the difference in movement<br />

times for other combinations of target width and amplitude<br />

were not significant in an unequal variance t-test.<br />

The eye tracking plus scanning system tends to produce<br />

less movement time for 300 pixel target amplitude (figure<br />

7) as the eye tracker apparently lost some accuracy in the<br />

periphery of the screen. Finally all participants felt the eye<br />

tracking plus scanning system is more comfortable than<br />

the only eye tracking based system because their eye<br />

muscles could get rest while using the scanning system.<br />

140000<br />

120000<br />

100000<br />

80000<br />

60000<br />

40000<br />

20000<br />

0<br />

-20000<br />

N =<br />

64<br />

ET<br />

Figure 4. Comparing movement time<br />

Figure 5. Comparing movement time w.r.t. participants<br />

Figure 6. Comparing movement time w.r.t. target width<br />

Average Movement Time<br />

(in sec)<br />

Average Movement Time<br />

(in sec)<br />

Average Movement Time<br />

(in sec)<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

60.00<br />

50.00<br />

40.00<br />

30.00<br />

20.00<br />

10.00<br />

0.00<br />

80.00<br />

70.00<br />

60.00<br />

50.00<br />

40.00<br />

30.00<br />

20.00<br />

10.00<br />

0.00<br />

24<br />

2<br />

64<br />

15<br />

63<br />

Comparing Movement Time<br />

Figure 7. Comparing movement time w.r.t. amplitude<br />

64<br />

14<br />

62<br />

ETSCAN<br />

1 2 3 4 5 6 7 8<br />

Participants<br />

Comparing Movement Time<br />

20 30 40<br />

Width of Target (in pixel)<br />

Comparing Movement Time<br />

180 240 300<br />

Distance to Target (in pixel)<br />

Only ET<br />

ET & Scanning<br />

Only ET<br />

ET & Scanning<br />

Only ET<br />

ET & Scanning


3.5. Discussion<br />

The results show that using the scanning system with the<br />

eye tracking system did not reduce pointing time significantly<br />

compared to only eye gaze based system. The high<br />

variance in only eye tracking based system also indicates<br />

that in some cases the user took very long time to point,<br />

which would surely frustrate them. It should be noted that<br />

we used the Tobii tracker [12] for this study which is now<br />

best in market for accuracy. With a low cost and low accuracy<br />

eye tracker (like a web cam based one) the only<br />

eye tracking system will be harder to use while the eye<br />

tracker plus scanning system will not suffer much as the<br />

technique does not need high accuracy from the eye<br />

tracker. We used an average filter to extract points of eye<br />

gaze fixation, but use of a better filtering algorithm [1]<br />

will increase the accuracy of both the system equally. We<br />

have used a scan delay of 1 sec for the scanning system<br />

and a dwell time of 500 msec for the eye gaze tracking<br />

system for this study, which can be further reduced producing<br />

less movement time for expert users. Additionally<br />

this new technique is faster than only scanning system<br />

while gives more comfort and accuracy than only eye<br />

gaze tracking based pointing system. Our system is less<br />

proactive than MAGIC pointing [14] as the user can<br />

manually switch on and off either eye gaze tracking or<br />

scanning system whenever he wants by a single switch<br />

press. It seems more user friendly than Bates’ system [2]<br />

as operating a push button switch is easier than operating<br />

a Polhemus InsideTrack device by elevating shoulder.<br />

Our system can also solve the challenges faced by Fejtova<br />

[9] in developing eye gaze tracking based wheel chair, as<br />

the user can switch off eye tracking temporarily and clicking<br />

is done through the scanning system which reduces<br />

the possibilities of accidental missed clicks. Currently we<br />

are working on integrating the system with a web cam<br />

based eye tracker to develop a low cost interaction device.<br />

This technique can also have applications other than computer<br />

accessibility software. It can be used to provide<br />

hands-free access in a screen with multiple displays (or<br />

control screens), where the eye tracking system will locate<br />

a particular portion of screen or control display and<br />

the scanning technique can be used to operate inside the<br />

display. It would also be useful to overcome situational<br />

impairment in interaction like using an electronic display<br />

in a moving vehicle, where it is difficult to use a pointing<br />

device or touch screen. The eye tracking and scanning<br />

technique both require minimum input from user and so<br />

the user need not to disengage with his main job (like<br />

driving the car) for interacting with another device.<br />

4. Conclusions<br />

In this paper we have introduced a new input device involving<br />

an eye gaze tracker and scanning interface for<br />

people with severe disabilities. The system solves a few<br />

problems of existing eye gaze tracking based systems by<br />

offering more accuracy and comfort to users which is also<br />

supported by a user study.<br />

36<br />

Acknowledgement<br />

We are grateful to our participants for taking part in our<br />

study. We would also like to thank Prof. Peter Robinson<br />

of University of Cambridge Computer Laboratory for his<br />

help in organizing the study.<br />

References<br />

1. Adjouadi M. et. al., Remote Eye Gaze Tracking<br />

System as a Computer Interface for Persons with<br />

Severe Motor Disability. ICCHP 2004, LNCS<br />

3118 2004. 761-769.<br />

2. Bates R., Multimodal Eye-Based Interaction for<br />

Zoomed Target Selection on a Standard Graphical<br />

User Interface. INTERACT 1999.<br />

3. Biswas P. and Robinson P., A New Screen<br />

Scanning System based on Clustering Screen<br />

Objects, Journal of Assistive Technologies, Vol.<br />

2 Issue 3 September 2008, pp. 24-31, ISSN:<br />

1754-9450<br />

4. Biswas P. and Robinson P., The effects of hand<br />

strength on pointing performance, Designing Inclusive<br />

Interactions, Springer-Verlag, pp. 3-12,<br />

ISBN: 978-1-84996-165-3<br />

5. Donegan M. et. al. , Understanding users and<br />

their needs, Universal Access in the Information<br />

Society 8 (2009): 259-275<br />

6. Duchowski A. T., Eye Tracking Methodology.<br />

Springer-Verlag, 2007.<br />

7. Eye Pointing, URL: http://abilitynet.wetpaint.<br />

com/page/Eye+Pointing, Accessed on 19th August<br />

2010.<br />

8. EyeTech Digital System, URL: http://www. eyetechds.com/assistivetech/index.htm,<br />

Accessed on<br />

19th August 2010.<br />

9. Fejtova M. et. al. , Hands-free interaction with a<br />

computer and other technologies, Universal Access<br />

in the Information Society 8 (2009): 277-<br />

295<br />

10. Fitts P.M., The Information Capacity of The<br />

Human Motor System In Controlling The Amplitude<br />

of Movement. Journal of Experimental Psychology<br />

47 (1954): 381-391.<br />

11. Majaranta P. and Raiha K. Twenty Years of Eye<br />

Typing: Systems and Design Issues. Eye Tracking<br />

Research & Application 2002. 15-22.<br />

12. Tobii Eye Tracker, URL:<br />

http://www.imotionsglobal.com/Tobii+X120+<br />

Eye-Tracker.344.aspx Accessed on 12th December<br />

2008<br />

13. Ward D., Dasher with an eye-tracker, URL:<br />

http://www.inference.phy.cam.ac.uk/djw30/dash<br />

er/eye.html, Accessed on 19th August 2010.<br />

14. Zhai S., Morimoto C. and Ihde S., Manual and<br />

Gaze Input Cascaded (MAGIC) Pointing. ACM<br />

SIGCHI Conference on Human Factors in Computing<br />

System (CHI) 1999.


Gesture Recognition Exploration using Haartraining and<br />

KNN in a 3D Racing Game<br />

Kamlesh Mistry<br />

School of Computing<br />

Teesside University, UK<br />

mistry.kamlesh@gmail.com<br />

ABSTRACT<br />

Automatic recognition of body language is challenging but<br />

inspiring as a natural control channel for intelligent user<br />

interfaces. In this paper we report automatic car navigation<br />

via hand gesture recognition in a 3D racing games<br />

application. We have employed Haartraining and k-nearest<br />

neighbor (KNN) algorithms to recognize hand gestures with<br />

the assistance of image processing. Our study has explored<br />

vision-based gesture tracking and dynamic gesture<br />

recognition in real-time navigation games application. The<br />

gesture recognition system has been embedded in a 3D<br />

virtual world built with the assistance of a games engine,<br />

Irrlicht. Sound effect has also been employed for our<br />

application. We have also conducted user testing with 5<br />

testing subjects to evaluate the efficiency of KNN-based<br />

gesture recognition. Evaluation results for the Haartrainingbased<br />

recognition have also been provided. Overall the<br />

gesture recognition performance is very promising. Our<br />

work contributes to the workshop themes on natural user<br />

interfaces in novel, intelligent interaction systems,<br />

navigation systems and assistive functionalities.<br />

Author Keywords<br />

Gesture recognition, Haartraining, and K-nearest neighbor<br />

ACM Classification Keywords<br />

H.5.2 [User interfaces]<br />

INTRODUCTION<br />

Multimodal interaction based on the recognition and<br />

interpretation of body language and verbal input is<br />

challenging but inspiring for the building of efficient<br />

intelligent user interfaces. Advanced educational or<br />

entertaining applications residing in 3D virtual<br />

environments also request such a natural communication<br />

channel to enhance user experience. In order to pursue this<br />

research goal, we have developed a robot car with<br />

automatic navigation under the control of continuous hand<br />

gestures. In our previous work, we also produced a neural<br />

network driven automatic navigation component to enable a<br />

robot car to learn road and track condition and handle tough<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

37<br />

Li Zhang<br />

School of Computing<br />

Teesside University, UK<br />

l.zhang@tees.ac.uk<br />

turning situations successfully. Overall, we believe our<br />

developments have the potential to benefit innovative user<br />

interfaces development for navigation and assistive<br />

functionalities for driving in real life situations.<br />

RELATED WORK<br />

There have been various inspiring research studies<br />

conducted in the gesture recognition field. Billon et al. [1]<br />

have reported a gesture recognition system to facilitate the<br />

communication between a virtual actor and a real human<br />

actor in a martial art virtual games setting. Principle<br />

Component Analysis has been used to generate the artificial<br />

gesture representation, which was used for real-time gesture<br />

segmentation and recognition. Elmezain et al. [2] presented<br />

a hidden Markov model (HMM) based continuous gesture<br />

recognition system for the recognition of Arabic numbers 0-<br />

9. Tomibayashi et al. [4] have also produced a wearable DJ<br />

system to enable DJs to perform freely by using wearable<br />

computing and gesture recognition technologies. Wearable<br />

acceleration sensors have been used in their study to assist<br />

gesture recognition. Their system has been tested in realstage<br />

performances. Nam and Wohn [3] have presented<br />

another HMM based space-time hand gesture recognition<br />

system. In their system, HMM has been used to model the<br />

spatial variance and the time-scale variance in the hand<br />

movement to assist the recognition of the continuous,<br />

connected hand movement patterns. In our work, we have<br />

made attempts to recognize continuous connected hand<br />

movements and gestures using two different approaches,<br />

Haartraining and KNN algorithms. 3 key gestures have<br />

been recognized by KNN and 5 key gestures have been<br />

identified by Haartraining. The recognized gestures have<br />

also been used for real-time automobile navigation in a 3D<br />

racing game for entertainment purposes. We also provided<br />

evaluation results to prove the efficiency of our approaches.<br />

GESTURE RECOGNITION USING KNN<br />

K-nearest neighbor has been widely used for pattern<br />

recognition. We borrow it in our application to recognize<br />

real-time key gestures using webcam. Our recognition<br />

process can be carried out in three steps, including image<br />

pre-processing, vector generation, and final classification.<br />

At the training stage, raw images with hand gestures are<br />

collected from webcam. First of all, these collected images


will be cropped. An example original image and the<br />

corresponding cropped image are shown in Figure 1, in<br />

which white pixels are used to represent the object of<br />

interest while the black pixels are used to indicate the<br />

background. Comparing with the original image, the<br />

cropped image, which will be used for the training of KNN,<br />

only has a slightly different width and height.<br />

These cropped images are then converted into binary files<br />

in order to feed them to KNN. Vector generation has been<br />

used to convert the pre-processed images into the training<br />

binary files with the appropriate format. We have used each<br />

KNN class to represent a particular gesture. All the images<br />

representing one particular gesture have been stored under<br />

that particular KNN class. We have used .pbm format to<br />

store all the image files for training, since such a format can<br />

provide ASCII characters in decimals for the width and<br />

height of each image. The names of the training files are in<br />

‘CNN.pbm’ order, where C is the KNN class number and<br />

NN is the number of the image files stored in that class. We<br />

have used altogether 300 images representing 3 different<br />

hand gestures (100 images for each gesture) for the training<br />

of KNN. The three gestures recognized by KNN are shown<br />

in Figure 2. Thus a scalar matrix has been produced to<br />

represent all the training data.<br />

Figure 1. An example original and its cropped image after preprocessing<br />

(from left to right).<br />

Figure 2. Three key gestures recognized by KNN, including a<br />

palm gesture (for stopping), a fist gesture (for acceleration)<br />

and a pistol-like gesture (for turning).<br />

At testing stages, raw images collected from webcam also<br />

need to be pre-processed before feeding them to KNN.<br />

Since the images captured from webcam are the colored 32bit<br />

ones and our training binary images are only 8-bit, we<br />

have used skin detection algorithm to convert the captured<br />

32-bit testing images into 8-bit ones. The following<br />

procedures have been taken to detect skin color. First of all,<br />

we need to access RGB values for each pixel using the<br />

following formulas.<br />

p = y * image->widthStep + x<br />

blue = ImageData[p]; green = ImageData[p + 1]; red =<br />

ImageData[p + 2]<br />

Where p is pixel point. ImageData is the array to store all<br />

the pixels of the processing image. Thus imageData[p],<br />

38<br />

imageData[p+1] and imageData[p+2] are blue, green, and<br />

red color values for pixel P. X and Y are the coordinates of<br />

the pixel P.<br />

Figure 3. An example image before and after skin detection<br />

processing.<br />

Then the following criteria have been used to detect skin<br />

color: red>95, green>40 and blue>20; where max(RGB<br />

values for pixel P) - min(RGB values for pixel P)>15. If<br />

any pixel fulfills these premises, then we re-assign it to a<br />

white pixel with the value of RGB 255.255.255, otherwise,<br />

we re-assign it to a black pixel with the value of RGB 0.0.0.<br />

Figure 3 shows an example image before and after skin<br />

detection processing.<br />

In our application, KNN algorithm has been used for the<br />

recognition of the testing gestures. KNN has been used<br />

widely in pattern recognition and machine learning. It<br />

classifies a test query based on a majority vote of its<br />

neighbors with the test query labeled as the class most<br />

common amongst its k nearest neighbors. We have used a<br />

weighted KNN in order to avoid the domination of the<br />

classes with the more frequent examples as shown in the<br />

basic ‘majority voting’ classification. Therefore, in our<br />

application, the KNN classification algorithm is to weight<br />

the contribution of each of the k neighbors according to<br />

their distance to the query point Xq, assigning greater<br />

weight Wi to the nearest neighbors. The following equation<br />

has been used in our application.<br />

� � ))<br />

F Xq<br />

� arg max �Wi ��<br />

( v,<br />

f ( Xi<br />

i�1<br />

Where Xq, is a testing image containing a test gesture; v is<br />

the vector of the training set and Xi represents each KNN<br />

class. �(v, f(Xi)) represents the distance between the test<br />

query and each KNN class. Using KNN implementation,<br />

we have successfully classified 3 different continuous<br />

gestures with promising accuracy rates in real-time<br />

applications (see evaluation section for detail). We also<br />

noticed that KNN’s performance could be influenced by the<br />

backgrounds shown in the images. In order to avoid such a<br />

problem, we have also used another approach, Haartraining,<br />

to perform gesture tracking to assist the recognition of<br />

gesture movement in order to provide another effective<br />

control channel for the automatic car navigation without<br />

having any side-effect contributed by the image<br />

backgrounds.<br />

GESTURE RECOGNITION USING HAARTRAINING<br />

Haartraining has been well known for tasks such as face<br />

and pedestrian detection. In our application, Haartraining<br />

has been used to recognize five gesture movements such as<br />

k


fist gestures indicating car movement of up, down, left and<br />

right and a palm gesture for halt (see Figures 4 & 5).<br />

For the image acquisition process, we have also used<br />

webcam to collect positive and negative image samples.<br />

The positive images represent those only containing objects<br />

of interest (gestures). In another word, positive images are<br />

used to identify gestures. Moreover it does not affect the<br />

training even if backgrounds of the positive images are<br />

different from each other. Negative image samples only<br />

contain backgrounds and no any objects of interest. They<br />

can be any images such as landscape images, car photos,<br />

and various textures. Negative images are usually used to<br />

improve gesture recognition performance (since they allow<br />

gestures to be recognized with any backgrounds).<br />

In order to provide robust training, we have collected 116<br />

positive image samples. Then we divided the positive<br />

samples into training and testing sets. The former contains<br />

86 image samples and the latter has 30 samples. We have<br />

also collected 178 negative image samples for the training<br />

purposes. Figures 4 & 5 respectively show positive sample<br />

images for the training of fist and palm positions.<br />

Figure 4. Positive images representing the 4 key gestures (from<br />

left to right: a basic fist gesture followed by fist gestures<br />

indicating car moving left, right, up and down).<br />

Figure 5. Positive image samples for training representing<br />

palm gestures for stopping.<br />

Vector generation is also needed to convert the positive and<br />

negative images into the appropriate format to feed<br />

Haartraining at the training stage. The process is briefly<br />

explained as follows. A text file has been produced to<br />

contain the names of all the negative sample files (e.g.<br />

negative1.jpg; negative2.jpg etc), while another text file for<br />

positive image files has also been created with the names of<br />

all the positive images, number of objects and coordinates<br />

of bounding box over the objects of interest (e.g.<br />

positive1.jpg, number_of_object(1), 20 20 50 50 (x, y,<br />

width, height)). The command of ‘createsamples’ was also<br />

39<br />

used to create training and testing vector samples in order to<br />

avoid distortions.<br />

Adaboost algorithm embedded in ‘createsamples’ command<br />

has been used for the training of the samples. Adaboost has<br />

the effect to train a strong classifier with the linear<br />

combination of best features from training set and weak<br />

classifiers. For example, if there are weak image samples<br />

with comparatively dark light or low contrast, Adaboost<br />

approach is able to improve the visibility of the objects of<br />

interest with better contrast. Finally the Haartraining<br />

command is used to train the classifier. The evaluation<br />

results for Haartraining approach for gesture tracking and<br />

recognition are also provided in the evaluation section.<br />

INTEGRATION WITH A 3D GAMES ENGINE<br />

The produced gesture recognition components using KNN<br />

and Haartraining have been integrated with the 3D games<br />

world for the control of the car navigation. An open source<br />

games engine, Irrlicht (www.irrlicht.org), and Newton<br />

physics, have been used to construct the 3D world<br />

environment. The OpenCV library has been used for 3D<br />

world image processing. Also, the sound library, IrrKlang,<br />

provided by the developers of the Irrlicht games engine, has<br />

been employed to produce the sound effect.<br />

Briefly for the development of the games world, we load<br />

the racing racetrack and car as a mesh, and set the graphics<br />

API to OpenGL. We also apply physics to the car mesh by<br />

using the Newton physics library. Then we add the<br />

racetrack into the physics entity so that car is the object<br />

with the track as the entity.<br />

In order to obtain the input data for the image processing,<br />

we have used the OpenCV library. After capturing the<br />

images from the webcam, we used IplImage for storing the<br />

image files. Overall, we have collected continuous images<br />

for our application and the collected images have been used<br />

for pre-processing and classification.<br />

For the control of the robot car using KNN, we have used<br />

the following gesture commands: a fist representing<br />

acceleration, a palm representing stopping and a pistol-like<br />

gesture representing turning. Therefore based on the output<br />

of KNN, which has used image files stored in IplImage as<br />

testing images, the robot car can navigate accordingly. For<br />

example, if the output of KNN indicates a fist gesture, then<br />

the robot car performs acceleration.<br />

If Haartraining has been used to control the vehicle, we<br />

have defined the following gesture commands for<br />

navigation: a palm gesture for stopping, a fist position to<br />

the very left indicating turning left, a fist position to the<br />

very right representing moving right, a fist position to the<br />

top indicating acceleration and a fist position to the bottom<br />

showing reverse movement. Fist of all, if a gesture has been<br />

recognized by Haartraining, we need to check on which<br />

axis and at what position the gesture is recognized. In order<br />

to achieve the recognition, a Haartraining class has been<br />

implemented containing all the necessary functions such as


loading the Haarcascade, testing the cascade, and drawing a<br />

bounding box on a desired gesture. If the position of the<br />

recognized gesture is less than 100 on x-axis (a fist gesture<br />

to the very left) then the car will turn left. Otherwise if it is<br />

more than 500 on x-axis (a fist gesture to the very right)<br />

then the car will turn right. Similar features apply to the<br />

forward and reverse control, where if the position of the<br />

recognized gesture is less than 100 on y-axis (a fist gesture<br />

to the bottom) then car will move backwards and if it’s<br />

greater than 400 on y-axis (a fist gesture to the top) then it<br />

will move forwards. Figure 6 shows a system screenshot.<br />

Figure 6. A system screenshot.<br />

SYSTEM EVALUATION<br />

We conducted user testing with 5 subjects (20-25 yr old<br />

male) to evaluate the efficiency of our gesture recognition<br />

components based on KNN. The testing methodology for<br />

KNN is described in the following. We had each testing<br />

subject have a warm-up session. Then each testing subject<br />

had an experience of game playing using hand gestures for<br />

vehicle navigation. A video has been produced to record the<br />

gestures performed by each testing subject so that they can<br />

be used to compare with the gesture sequence recognized<br />

by KNN to gain an accuracy rate. With the 5 testing<br />

subjects engaging in our user study, we gained an average<br />

accuracy rate of 82%. Detailed results represented by a<br />

confusion matrix are provided in Table 1 with rows<br />

representing gestures performed by testing subjects and<br />

columns showing gestures recognized by KNN.<br />

Gestures<br />

Gestures recognized by<br />

KNN<br />

performed by users<br />

Fist<br />

gesture<br />

Palm<br />

gesture<br />

Turning<br />

gesture<br />

Fist gesture 90.47% 4.76% 4.76%<br />

Palm gesture 6.66% 86.66% 6.66%<br />

Turning gesture 42.85% 0% 57.14%<br />

Table 1. Evaluation results for KNN.<br />

From the recognition results of KNN, we noticed that most<br />

of the errors have been caused by the fact that sometimes a<br />

turning pistol-like gesture has been recognized as a fist<br />

gesture. It is because the skin detection algorithm<br />

sometimes mixed up the background of the gesture (part of<br />

40<br />

the arm) with the gesture itself. Also fist and palm gestures<br />

have been recognized well with accuracy rates >80%.<br />

The evaluation results for Haartraining have also been<br />

produced with 29 testing positive images (1 invalid image),<br />

different from the training set. The performance command<br />

(opencv_performance) is used for testing or detecting<br />

purpose. Table 2 shows the evaluation results for the<br />

recognition of the testing image samples.<br />

Correct<br />

recognition<br />

Inaccurate<br />

recognition<br />

Accuracy<br />

rate<br />

Positive palm images 9 7 56.3%<br />

Positive fist images 11 2 84.6%<br />

Table 2. Evaluation results for Haartraining.<br />

For the recognition of the palm and fist gestures using<br />

Haartraining, we have 9 positive images recognized as<br />

unknown gestures with 7 palm images and 2 fist ones. The<br />

main reason that led to the recognition errors is that<br />

probably weak images with dark light were involved in<br />

training set. For future work, high-quality images will be<br />

used instead to improve our system’s performance.<br />

Comparing with existing work (Billon et al. with a >80%<br />

accuracy rate), our system’s performances from both<br />

approaches are acceptable. Users also experienced effective<br />

car navigation using gestures in a real-time 3D application.<br />

Thus it has the potential to improve users’ engagement.<br />

CONCLUSION<br />

We reported a 3D car navigation games application via<br />

gesture recognition using both KNN and Haartraining.<br />

Although there is room for improvements, both approaches<br />

produced reasonable recognition results with Haartraining<br />

equipped with the ability to ignore the background<br />

interfering effects. In future directions, we also intend to<br />

employ HMM to extend our system with the capabilities of<br />

recognizing more complex (e.g. emotional) gestures to<br />

assist natural interaction for automatic navigation.<br />

REFERENCES<br />

1. Billon, R., Nédélec, A. and Tisseau, J. Gesture<br />

Recognition in Flow based on PCA Analysis using<br />

Multiagent System. In Proceedings of ACE. (2008).<br />

2. Elmezain, M., Al-Hamadi, A., Appenrodt, J. and<br />

Michaelis, B. A Hidden Markov Model-Based<br />

Continuous Gesture Recognition System for Hand<br />

Motion Trajectory. In Proceedings of 19th International<br />

Conference on Pattern Recognition, (2008). 1-4.<br />

3. Nam, Y. and Wohn, K. Recognition of Space-Time<br />

Hand-Gestures using Hidden Markov Model. In Proc. of<br />

ACM VRST96 Conference. (1996). 51-58.<br />

4. Tomibayashi, Y., Takegawa, Y., Terada, T., and<br />

Tsukamoto, M. Wearable DJ System: a New Motion-<br />

Controlled DJ System. In Proceedings of ACE. (2009).


Model-Based User Interface Development<br />

in the <strong>Automotive</strong> Industry<br />

Moritz Kuemmerling<br />

German Research Center for Artificial Intelligence<br />

(<strong>DFKI</strong>)<br />

Trippstadter Strasse 122<br />

67663 Kaiserslautern, Germany<br />

Moritz.Kuemmerling@dfki.de<br />

+49 631 205 3709<br />

ABSTRACT<br />

The time-to-market for human machine interfaces in the<br />

German automotive industry has to be reduced. The<br />

shortening of innovation cycles in other relevant industry<br />

fields and international competitors increase the pressure on<br />

German car manufacturers and their suppliers. Model-based<br />

user interface development is supposed to reduce the<br />

development time significantly thus improving the<br />

manufactures’ competitiveness. Therefore, a new domain<br />

specific modeling language for the specification of<br />

automotive human machine interfaces is being sought. Past<br />

approaches with similar objectives have either failed or<br />

have not been successfully established across the industry<br />

as a holistic solution. Within the scope of a new cooperative<br />

project whose partners cover the supply chain of the user<br />

interface development in the automotive industry for the<br />

first time completely, a common solution should be<br />

developed and manifested as an industry standard.<br />

Keywords<br />

Model-based user interface development, automotive HMI,<br />

domain specific language.<br />

INTRODUCTION<br />

The German automotive industry has to find a way to<br />

significantly reduce the development time for human<br />

machine interfaces (HMI) in vehicles. The reasons, among<br />

others, are the continuous development of driver assistance,<br />

communication and infotainment systems, new drive<br />

concepts as well as the continuous shortening of innovation<br />

cycles in the consumer electronics industry. To keep up<br />

with these technologies and with catching up competitors<br />

around the globe, future HMI-systems will be more and<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

41<br />

Gerrit Meixner<br />

German Research Center for Artificial Intelligence<br />

(<strong>DFKI</strong>)<br />

Trippstadter Strasse 122<br />

67663 Kaiserslautern, Germany<br />

Moritz.Kuemmerling@dfki.de<br />

+49 631 205 3707<br />

more complex while their development costs and time-tomarket<br />

have to be reduced. However, current HMI<br />

development processes are characterized by different,<br />

inconsistent work flows and heterogeneous tool chains. The<br />

exchange of paper-based specification documents between<br />

the process participants causes media discontinuity, inhibits<br />

version management, reduces the reusability and hampers<br />

the communication [2].<br />

Moreover it is impossible to automatically test the integrity<br />

and accuracy of paper based specification documents. The<br />

adoption of reliable and successful approaches from the<br />

field of model-based user interface development (MBUID)<br />

[9] is expected to be a proper remedy.<br />

To this purpose a new industry-driven project has been<br />

elaborated whose partners – several German car<br />

manufacturers (OEM), suppliers, a tool developer and the<br />

“Verband der Automobilindustrie e. V.“ (VDA) as an<br />

association – cover the supply chain of the HMI<br />

development in the automotive industry for the first time<br />

completely.<br />

Together, the partners aim to develop a new modeling<br />

language that will serve as an interface between the process<br />

participants thus avoiding media discontinuity and<br />

improving the communication among the involved actors.<br />

The intention is (not less than) to establish a new modeling<br />

language not only within the project consortium but as an<br />

industry standard.<br />

The paper is structured as follows: First we will give an<br />

overview about MBUID, existing modeling languages and<br />

past projects with similar objectives. Then we point out,<br />

what we expect to do differently in our project. We also<br />

explain the impact that the project will have on MBUID as<br />

a field of research. After an outlook on our next steps the<br />

paper ends with a conclusion.<br />

EXISTING UIDLS, PAST PROJECTS AND THEIR<br />

SHORTCOMMINGS<br />

A vast number of XML-based user interface modeling<br />

languages (UIDL) exist already in the field of MBUID.<br />

Some of the UIDLs are already standardized by the OASIS


and/or they are subject of a continuous development<br />

process. Numerous projects and applications prove their<br />

practical suitability. Some examples are UsiXML [14],<br />

UIML [1] or XIML [10, 11].<br />

The purpose of using a UIDL to develop a user interface is<br />

to systematize the HMI development process. [9]. UIDLs<br />

enable the developer to systematically break down a user<br />

interface into different abstraction layers and to model these<br />

layers [8]. Thus it is for example possible to describe the<br />

behavior, the structure and the layout of a user interface<br />

independently of each other.<br />

In Figure 1 we show how an automotive HMI-system can<br />

be developed using a model-based approach. In a first step,<br />

designers and engineers describe an abstract model of the<br />

later HMI-system. The abstract model is independent from<br />

any hardware-platform and the developers can put their<br />

focus on the user requirements. In the next step, the abstract<br />

model is extended to a more concrete one. The concrete<br />

model allows the generation of virtual prototypes which can<br />

be used for first user tests of the later HMI-system. In the<br />

final step the concrete models are transformed and mapped<br />

to the platform-specific requirements of the target system.<br />

The reusability of the models decreases with each step.<br />

Figure 1. <strong>Automotive</strong> HMI development using a model-based<br />

approach.<br />

Existing UIDLs differ in terms of the supported platforms<br />

and modalities as well as in the amount of predefined interaction<br />

objects that are available to describe the elements of<br />

the user interface. In the relevant literature several authors<br />

struggled with the challenge of a clear comparison of<br />

existing UIDLs [4, 7, 13]. However, a comprehensive<br />

comparison is yet to be drawn.<br />

In the HMI development in the automotive industry a wide<br />

range of actors from many different branches are involved –<br />

computer scientists and electrical engineers work together<br />

with designers, ergonomists and psychologists in<br />

interdisciplinary teams (see Figure 2). The HMI modeling<br />

language that we want to develop shall serve as the<br />

connective link between these actors. On this account the<br />

modeling language has to be domain specific. Domain<br />

specific languages (DSL) are dedicated to a particular<br />

problem domain and their “vocabulary” is generally based<br />

42<br />

upon common expressions that are typical for the domain.<br />

Thus DSLs are far more expressive in their domain than<br />

general-purpose languages would be. Further benefits of<br />

DSLs are a better acceptance when introducing the<br />

language as well as a better readability of DSL-based<br />

specifications even for non-programmers.<br />

Figure 2. Actors in the HMI development process and their<br />

specification flow.<br />

The idea of reusing best practices and existing modeling<br />

languages from the field of MBUID to develop a new domain-specific<br />

language for automotive HMI development is<br />

not completely new. In the past there have been similar<br />

approaches:<br />

� IML (Infotainment Markup Language) [6] developed<br />

by IAV is a XML-based modeling language for<br />

infotainment systems.<br />

� OEM XLM (later VW XML) [3] is a XML-based<br />

language that resulted from a cooperation of AUDI,<br />

BMW, Daimler, Porsche and VW. It addresses the<br />

standardized description of head-units and instrument<br />

cluster systems.<br />

� AbstractHMI [12] is an XML-based modeling<br />

language for automotive HMI-systems. The language<br />

was developed at the University of Ulm in cooperation<br />

with Daimler.<br />

� ICUC XML [5] is dedicated to the modeling of<br />

instrument clusters in trucks. The language was<br />

developed by Elektrobit <strong>Automotive</strong> for Daimler.<br />

However, none of the languages presented above was<br />

successfully manifested as an industry standard. Today<br />

there are only a few, at best partial solutions that are used<br />

by some OEMs or suppliers. IAV gave up on the<br />

development of IML. AbstractHMI has never found its way<br />

from research to industrial application. ICUC XML can<br />

only be used via the development tool EB Guide and OEM<br />

XML is used – despite the numerous partners involved in<br />

its development – only by VW.


WHAT TO DO DIFFERENTLY?<br />

The sustainable success of the renewed attempt strongly<br />

depends on the impact that the new modeling language and<br />

further project related standardizations will achieve in the<br />

automotive industry. For this reason a consistent transfer of<br />

the project results towards the industry is required.<br />

Exhibitions of the project results on the leading fairs in the<br />

automotive industry, such as the International Motor Show<br />

(IAA) or the International Suppliers Fair (IZB), will attract<br />

attention and contribute to the dissemination of the project<br />

results.<br />

During the project period of three years the project results<br />

will be continuously tested, validated and exposed in form<br />

of several demonstrators. Towards the end of the project<br />

these demonstrators will be aggregated into an overall system.<br />

This overall system shall cover and demonstrate the<br />

complete HMI developing process in the automotive<br />

industry from the first mock-up to the implementation of<br />

the target code on the hardware in the cockpit of a vehicle.<br />

In particular, model-based aspects and differences to the<br />

common development process shall be highlighted. To this<br />

purpose, the final demonstrator shall for example show, that<br />

requests for changes in the running HMI-system are easy to<br />

be realized by small manipulations in the underlying HMI<br />

specification (which is based on the domain specific<br />

modeling language). The HMI-system is supposed to run on<br />

several OEM/supplier hardware combinations. The<br />

exchangeability of the cockpit’s hardware emphasis the<br />

wide coverage of the project results in the automotive<br />

industry.<br />

In addition to the optimization of the HMI development<br />

process and the communication among it, a standardized<br />

modeling language paves the way for some further<br />

improvements.<br />

The above mentioned incapability of paper-based exchange<br />

documents to be tested automatically for integrity and<br />

accuracy often leads to bugs in the HMI-system that are<br />

first noticed in late stages of development. Leveraging the<br />

full potential of machine-readable specification documents<br />

(e.g. model-based testing, early use of virtual prototypes)<br />

cost and time intensive subsequent iterations and<br />

corrections can be avoided. For both, suppliers and OEMs,<br />

this would be a significant cost saving potential.<br />

The connection of the HMI-system to the application layer<br />

of the vehicle is a further significant cost factor in current<br />

development processes. As the connection to the car’s<br />

application layer still requires manual processing, this step<br />

consumes resources to a similar extent as the actual<br />

development of the HMI-system. The introduction of a<br />

standardized modeling language creates the conditions for<br />

the development of a standard middleware allowing future<br />

HMI-systems to be easier connected to the car’s application<br />

layer. The consequences are a reduction of development<br />

time and a better exchangeability of the hardware<br />

components.<br />

43<br />

The integration of both aspects – model-based testing and<br />

middleware – points out the unexplored potential of modelbased<br />

HMI development in the automotive industry.<br />

IMPACT ON MBUID<br />

In the field of HMI development a distinction is made<br />

between model-based development of human-machineinterfaces<br />

at design and at runtime. The presented project<br />

addresses the model-based development of automotive<br />

HMI-systems at design time. Thus the project is the first<br />

extensive industrial use case for model-based HMI<br />

development. The collaborative application of this method<br />

by several industrial partners allows a proof of concept<br />

revealing strengths as well as possible weaknesses where<br />

further research is required. Furthermore the step towards<br />

model-based HMI development at design time is a<br />

necessary one in the automotive industry. Future runtimeadaptive<br />

HMI-systems require a model-based architecture.<br />

The development of such systems is necessary for a<br />

functional and efficient integration of the driver’s mobile<br />

devices (iPods, mobile phones etc.).<br />

NEXT STEPS TO TAKE<br />

The achievement of the above presented project’s<br />

objectives depends on some central tasks.<br />

Out of the numerous UIDLs without any automotive background<br />

a few well established examples have to be picked<br />

and compared to each other. The comparison has to be<br />

based on an appropriate use case that allows the<br />

identification of elements that can be useful for the<br />

development for the automotive modeling language (e.g. a<br />

simple interface for a music player).<br />

In parallel, existing automotive related UIDLs have to be<br />

carefully examined. In particular, the question why none of<br />

the languages became a standard has to be answered.<br />

The automotive HMI development process itself will be the<br />

subject of a comprehensive analysis. Tools, processes and<br />

specification documents are examined on site at each<br />

partner with a strong focus on the interfaces and the<br />

exchange of documents between OEMs, suppliers and tooldevelopers.<br />

The purpose is to identify best practices and to<br />

define an abstract reference process. The latter shall be used<br />

to derive a common data model as well as the requirements<br />

for the development of the new modeling language.<br />

CONCLUSION<br />

In this paper we summarized some of the main issues in<br />

current HMI development processes in the automotive<br />

industry. The adoption of methods from the field of<br />

MBUID is supposed to lead to machine-readable HMI<br />

specifications thus improving the communication between<br />

the process partners. Past attempts to develop a<br />

standardized modeling language have ether failed or lead to<br />

isolated applications. However, long-term benefits and<br />

potential subsequent developments necessitate an industrywide<br />

impact as well as a sustainable manifestation of the


outcomes of the presented project. The first step has already<br />

been taken as for the first time several OEMs will work<br />

together with their suppliers on the optimization of their<br />

HMI development processes.<br />

REFERENCES<br />

1. Abrams, M., Phanouriou, C. and Batongbacal, A.<br />

UIML: An Appliance-Independent XML User Interface<br />

Language. Proc. of the 8th International World Wide<br />

Web Conference, Toronto, Canada, 1999.<br />

2. Bock, C., Görlich, D. and Zühlke, D. Using Domain-<br />

Specific Languages in the Design of HMIs: Experiences<br />

and Lessons Learned. Proc. of Workshop: Model-<br />

Driven Development of Advanced User Interfaces,<br />

Workshop: Model-Driven Development of Advanced<br />

User Interfaces, UML/MoDELS 2006, Genua, Italy<br />

2006.<br />

3. Brunhorn, J. XML-Sprache zur Beschreibung von HMIs<br />

für Infotainmentsysteme und Kombiinstrumente.<br />

Language Specification 1.0. Carmeq GmbH / OEM<br />

Arbeitskreis HMI Methodik, 2007.<br />

4. Guerrero García, J., González Calleros, J. and<br />

Vanderdonckt, J. A Theoretical Survey of User Interface<br />

Description Languages: Preliminary Results. Proc. of<br />

Joint 4th Latin American Conference on Human-<br />

Computer Interaction 7th Latin American Web<br />

Congress, Los Alamitos, USA, 2009.<br />

5. Hübner, M. and Grüll, I. ICUC-XML Format. Format<br />

Specification Revision 14. Elektrobit, 2007.<br />

6. Jud, A. Präzise Syntaxdefinition einer<br />

Modellierungstechnik für Infotainment-Systeme. Master<br />

Thesis, Technische Universität Berlin, 2007.<br />

44<br />

7. Luyten, K. Dynamic User Interface Generation for<br />

Mobile and Embedded Systems with Model-Based User<br />

Interface Development. Doctoral Thesis, Transnationale<br />

Universiteit Limburg, Limburg, 2004.<br />

8. Meixner, G. Model-based Useware Engineering. in<br />

W3C Workshop on Future Standards for Model-Based<br />

User Interfaces, W3C Workshop on Future Standards<br />

for Model-Based User Interfaces (W3C-2010), May 13-<br />

14, Rome, Italy, 2010.<br />

9. Puerta, A. A Model-Based Interface Development<br />

Environment. IEEE Software, 14 (4), 40-47, 1997.<br />

10. Puerta, A. and Eisenstein, J. XIML: A Universal<br />

Language for User Interfaces. RedWhale Software, Palo<br />

Alto, CA USA, 2001. Retrieved September 09, 2011,<br />

from http://www.ximl.org/pages/docs.asp.<br />

11. Puerta, A. and Eisenstein, J. Developing a Multiple User<br />

Interface Representation Framework for Industry. In:<br />

Multiple User Interfaces. Cross-platform Applications<br />

and Context-Aware Interface, Wiley, 119-148, 2004.<br />

12. Reich, B. Abstrakte Beschreibung automobiler HMI-<br />

Systeme und deren Erweiterung für neue Dienste.<br />

Master Thesis, Universität Ulm, 2008.<br />

13. Souchon, N. and Vanderdockt, J. A Review of XML-<br />

Compliant User Interface Description Languages. Proc.<br />

of the 10th International Workshop on Interactive<br />

Systems: Design, Specification and Verification, 377-<br />

391, 2003.<br />

14. Vanderdonckt, J., Limbourg, Q. and Michotte, B.<br />

USIXML: A User Interface Description Language for<br />

Specifying Multimodal User Interfaces. Proc. of the<br />

W3C Workshop on Multimodal Interaction, 2004.


A Robotic Wheelchair using Human Gestures and<br />

Scene Contexts<br />

Jin Sun Ju, Eun Yi Kim<br />

Dept. of advanced technology fusion Engineering, Konkuk University, Seoul, Korea<br />

vocaljs@konkuk.ac.kr, eykim@konkuk.ac.kr<br />

82-2-450-4135<br />

ABSTRACT<br />

In this paper, we propose a new vision-based robotic<br />

wheelchairs using human’s gestures and scene contexts. For<br />

the easy and accurate control of a wheelchair, human<br />

gestures such as a face and mouth are used, where the<br />

direction of a robotic wheelchair is determined by the<br />

inclination of the user’s face, while proceeding and<br />

stopping are determined by the shape of the user’s mouth.<br />

And, for providing autonomous avoidance of obstacles, a<br />

monocular vision-based navigation is developed.<br />

To assess the effectiveness of the developed robotic<br />

wheelchair, several experiments were performed on indoor<br />

and outdoor under various situational effects. The results<br />

demonstrated the feasibility of our system as mobility aids<br />

of the disabled or elderly people.<br />

Author Keywords<br />

Robotic wheelchair, gesture recognition, MLP<br />

ACM Classification Keywords<br />

H5.m. Information interfaces and presentation (e.g., HCI):<br />

Miscellaneous; I.4 Image processing and computer vision;<br />

INTRODUCTION<br />

Robotic wheelchairs are generally electric powered<br />

wheelchairs with an embedded computer and sensors,<br />

giving them intelligence. Most important evaluation factors<br />

for the wheelchairs are safety and convenient controls, so<br />

1<br />

many studies have been performed for intelligent interface<br />

and autonomous navigation [1] [2]. The intelligent interface<br />

aims at making the handicapped users control the<br />

wheelchair with their limited physical abilities. For such an<br />

interface, we developed a control system using face<br />

inclination and mouth shape recognition in the previous<br />

work, which can enhance the accuracy to recognize user’s<br />

intention and the computational costs than existing<br />

approaches [3].<br />

The navigation refers to detect various obstacles in real<br />

environments and avoid them. As the wheelchairs are used<br />

by handicapped people, some dangerous situation and<br />

accidents such as collisions with obstacles and other<br />

peoples are occurred. Accordingly, this study focuses on<br />

developing auto navigation techniques for obstacle<br />

detection and avoidance.<br />

In this paper, we develop vision-based robotic wheelchairs<br />

using human’s gestures and scene contexts. Fig.1 (a)<br />

illustrates the prototype of the proposed robotic wheelchair<br />

and the specifications of respective components. Our<br />

system consists of two modules: 1) a wheelchair control<br />

interface module, 2) a monocular vision-based navigation<br />

module. Fig. 1(b) describes the process of the proposed<br />

robotic wheelchair.<br />

(a) (b)<br />

Figure 1. The proposed system (a) the overall architecture of our wheelchair (b) the outline of proposed wheelchair system<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA<br />

45


WHEELCHAIR CONTROL INTERFACE<br />

The proposed wheelchair control interface allows the user<br />

to control the wheelchair directly by changing their face<br />

inclination and mouth shape. If the user wants the<br />

wheelchair to move forward, they just say “Go.”<br />

Conversely, to stop the wheelchair the user just says “Uhm.”<br />

The direction of the wheelchair is determined by inclination<br />

of the user’s face, instead of turning heads or faces.<br />

Facial Feature Detection<br />

For robust detection of facial region, we use the AdaBost<br />

algorithm, which is recently many used in face detection<br />

due to its accuracy and speed [5]. It extracts the Haar-like<br />

features that can explain facial region from all possible<br />

rectangles obtained from a given image. Once a facial<br />

region is obtained, the mouth region is localized using edge<br />

information. The detection results may include some noises,<br />

which are filtered by the connected component analysis.<br />

Facial Feature Recognition<br />

Let ρ denote the orientation of the facial region. Then can<br />

be calculated by minimizing the following inertia.<br />

������� � 1<br />

� ��� � ������� �<br />

�<br />

��,����<br />

��� � ��������� , �� � �<br />

(1)<br />

� ��<br />

If the value of ρ is less than 0, this means that the user nods<br />

their head slanting to the left. Otherwise, it means that the<br />

user nods their head slanting to the right.<br />

To recognize what the mouth shape of the current frame is,<br />

a template matching is performed, where the current mouth<br />

region is compared with mouth shape templates. Those<br />

templates are obtained by K-means clustering from 114<br />

mouth image. After localization the mouth in current frame,<br />

we first normalize the mouth region, calculate its matching<br />

score for all templates, and pick the template with the best<br />

matching score.<br />

MONOCULAR VISION-BASED NAVIGATION<br />

In this module, all information for environments where a<br />

wheelchair is positioning is represented as the form of<br />

occupancy map. Thereafter, for capturing visual<br />

characteristics among occupancy maps of the different<br />

directions a MLP is used.<br />

Obstacle Detection<br />

A cell in occupancy map model the risk of the<br />

corresponding area by gray color level, so we first design<br />

the map fitted to the environments. For the occupancy map<br />

generation, we estimate the background model by simple<br />

online learning, and compare it with every frame received<br />

from a CCD camera, thereby classifying the current frame<br />

as backgrounds and others. Here we use the simplified<br />

version of background detection method presented in [4].<br />

The background color is estimated from only the reference<br />

area rather than a whole image. The input image is filtered<br />

by 5�5 Gaussian filters to reduce the noise, and<br />

transformed into the HSI color space. From the reference<br />

area, two color histograms are calculated for Hue and<br />

2<br />

46<br />

Intensity. These histograms are accumulated for recent five<br />

frames, which are used as background model. The<br />

background model is continuously updated as a new frame<br />

is input. Once the background model is obtained, the<br />

classification is performed. If the intensity and hue of a<br />

pixel are below thresholds, it was considered as obstacles.<br />

In this paper, the hue and intensity thresholds are set to 60<br />

and 80, respectively. Based on the background<br />

classification results, an occupancy map is produced, where<br />

each cell is allocated at a walking area and it has the<br />

different gray color levels according the occupancy of<br />

obstacles, as shown in Fig. 2.<br />

Here 10 gray-scales are used according to the risk. Then,<br />

the gray scale of a cell is determined by 1/10�(# of pixels<br />

classified as obstacles). A certain gray color is assigned to<br />

each pixel, according to its risk. The brighter a grid cell, the<br />

higher obstacle density.<br />

(a) (b) (c)<br />

Figure. 2: Examples of generated occupancy map (a) input<br />

image, (b) obstacle classification results, (c) occupancy maps<br />

Path generation<br />

We try to automatically extract the scene contexts from<br />

real-time streaming, and use them to determine viable paths,<br />

through machine learning. Here, we use a MLP to<br />

automatically capture important scene contexts among<br />

occupancy maps for different pats, as it integrates feature<br />

extraction and classification in its own architecture. The<br />

path generation is performed by two steps: off-line learning<br />

stage and on-line recognizing stages.<br />

Off-line learning stage<br />

In the off-line learning stage, the proposed system trains the<br />

visual properties among occupancy maps for each<br />

directions using MLP, thus it can recommend viable paths.<br />

In a MLP, the input layer receives the gray values of pixels<br />

on 32�24 occupancy map. The output value of a hidden<br />

node is then obtained from the dot product of the vector for<br />

the input values and the vector for the weight connected to<br />

the hidden nodes. This is then presented to the nodes of the<br />

next. Although various learning techniques can be used for<br />

multi-layered networks, this study used back-propagation,<br />

where the output values are compared with the correct<br />

answer during network training to compute the value of the<br />

error-function. In our system, the input layer is composed<br />

of 769 nodes and output layer is composed of four nodes,<br />

each of which corresponds to one of four directions {Go<br />

straight, Stop, Turn Left, and Turn Right}.<br />

On-line recognizing stage<br />

After training, the MLP is used to make a decision for<br />

online streaming. As the value of an output node is given as<br />

a floating-point numbers, ranging from 0 to 1, a threshold


value is required for decision of viable paths. Here, a<br />

threshold value of 0.7 was used for the MLP output nodes.<br />

Therefore, if the predicted output node score was larger<br />

than 0.7, the directions corresponding to the node were<br />

selected.<br />

EXPERIMENTS AND RESULTS<br />

To assess the effectiveness of the proposed system we<br />

performed the several experiments. Experiment I and II was<br />

designed to measure the accuracy of our two main modules,<br />

each of which reports the accuracy of interface and<br />

navigation. And Experiment III was designed to assess its<br />

effectiveness, thus its performance was compared with one<br />

of other existing method.<br />

Experiment I: To measure the accuracy of wheelchair<br />

control interface<br />

For the proposed wheelchair control interface to be<br />

practically used in the real environments, it should be<br />

robust to various illuminations and cluttered backgrounds.<br />

Thus, the proposed interface was tested on indoors and<br />

outdoors, furthermore on across both environments.<br />

Fig. 3 shows the facial feature detection and recognition<br />

results. As seen in Fig. 3, the proposed method accurately<br />

detected the face and mouth, confirming the robustness to<br />

time-varying illumination, and low sensitivity to a cluttered<br />

environment.<br />

Figure 3: Face and mouth detection results<br />

Table 1 shows the recognition rates of the proposed<br />

interface for the respective commands. The proposed<br />

interface shows the precision of 100% and the recall of 96.5%<br />

on average. Thus, this experiment proved that the proposed<br />

interface can accurately recognize user’s intention in realtime.<br />

Commands Recall (%) Precision (%)<br />

Turn Left 98 100<br />

Turn Right 94 100<br />

Go straight 96 100<br />

Stop 98 100<br />

Table 1. Performances in recognizing users’ commands<br />

Experiment II: To measure the accuracy of monocular<br />

vision-based navigation<br />

To fully support the mobility to the severely disabled<br />

people or cognitively disabled people, a navigation system<br />

to automatically detect obstacles and avoid them is<br />

necessary, so we developed a new monocular vision-based<br />

navigation using machine learning.<br />

Then, to be practically used in the real environments, it<br />

should detect a variety of obstacles, and it should be also<br />

robust to the situational effects such as place types and<br />

lightening conditions.<br />

3<br />

47<br />

Thus, it was tested on indoors and outdoors at daytime and<br />

night time. Fig. 4 shows the results to detect obstacles under<br />

various conditions, where 1 st to 4 th columns show the<br />

detection results for static obstacles on indoors, and two last<br />

columns show the results for moving obstacles on outdoors.<br />

In more detail, 1 st row shows the detection result for a static<br />

and thin obstacle, moreover it is floating. And 2 nd column<br />

shows the result for detecting a static thick obstacle. Such<br />

images were taken at day-time. On the other hand, the 3 rd<br />

and 4 th columns in Fig. 4 show the detections of a thin and<br />

small obstacle at night-time. Finally, 5 th and 6 th columns<br />

show the results for detecting moving obstacles at day-time<br />

and night-time, respectively.<br />

For given input image (as shown in Fig. 4(a)), the obstacle<br />

detection results and generated occupancy maps are shown<br />

in Figs. 4(b) and (c), respectively. As shown in Fig. 4(c),<br />

the proposed system can accurately detect a variety of<br />

obstacles under several illuminations.<br />

(a)<br />

(b)<br />

(c)<br />

Figure 4. Obstacle detection results (a) input image (b)<br />

background detection results (c) occupancy maps<br />

Table 2 summarizes the performance of our navigation<br />

system under various conditions. Although there are some<br />

differences, it showed the accuracy of 90% on average.<br />

Among four test groups, the accuracy for Type 2 was<br />

lowest. The experiments of Type 2 were performed on<br />

shopping mall, where much reflection was made due to the<br />

marble-textured background, and the scene are very<br />

cluttered by human and stores. However, despite these<br />

problems, our system can generate the viable paths to avoid<br />

collisions with obstacles.<br />

Environments<br />

Indoor Outdoor<br />

Type1 Type2 Type3 Type4<br />

Accuracy (%) 91% 87% 93% 89%<br />

Type 1,2,3,4: underground, shopping mall, A road, Foot way<br />

Table 2. Performances in determining viable paths<br />

Experiment III: To prove the effectiveness of our<br />

monocular vision-based navigation by comparison with<br />

other method<br />

To assess the validity of the monocular vision-based<br />

navigation module, its performance was compared with one<br />

of other method. Here we adopt the VFH [7], as it is the


most commonly used method in auto navigation, as<br />

mentioned in Section I (related work).<br />

Fig. 5 shows the performance comparisons of two methods<br />

on indoors and outdoors with time-varying illuminations.<br />

Fig. 5(a) shows the results of two methods under the timevarying<br />

sun-lights at day-time, and Fig. 5(b) shows the<br />

results under artificial lights at night-time. As can be seen<br />

in Fig. 5, the proposed method showed the better<br />

performance for all cases, regardless of place types and<br />

illumination conditions. On average, the proposed method<br />

can generate avoidable paths in the accuracy 92%, whereas<br />

VFH has accuracy of 79%. Consequently, the proposed<br />

method can improve the accuracy of 13%.<br />

(a)<br />

(b)<br />

Figure 5: Performance comparison with our system and VFH<br />

under various lightening conditions (a) comparisons under<br />

time-varying sun-lights, (b) comparison under artificial lights<br />

( the proposed method on outdoor, the proposed method<br />

on indoor, the VFH on outdoor, the VFH on indoor )<br />

Indoor Outdoor<br />

Day time Night time Day time Night time<br />

Proposed method 8% 10% 11% 13%<br />

VFH 31% 30% 46% 48%<br />

Table 3. Collision rate of proposed method and VFH<br />

The most important role of a navigation system is to<br />

prevent some collisions, so their performance should be<br />

evaluated in this aspect. Table 3 shows the hit ratio of two<br />

methods when moving to a goal, where the proposed<br />

method detected collisions and stopped in the accuracy of<br />

89%, but the VFH showed the accuracy of just 61%.<br />

48<br />

As shown in Fig. 5 and Table 3, the numerical comparisons<br />

showed that the proposed method provide a more safe<br />

mobility than VFH, and is robust to the situational effects<br />

such as illumination conditions and place types. Moreover,<br />

the average time taken for the proposed method to process a<br />

frame was about 56ms, thereby allowing the proposed<br />

method to process more than 17frames/s. The proposed<br />

method was about 22ms faster than the VFH.<br />

Consequently, the proposed method can improve the<br />

detection of collision and the prediction of avoidable paths<br />

than existing method, thereby providing a wheelchair with<br />

safe navigation on real environments.<br />

CONCLUSIONS<br />

In this paper, we develop a vision-based robotic wheelchair<br />

using human’s gestures and scene contexts. The advantages<br />

of the proposed system include the followings: 1) our<br />

wheelchair control interface requires minimal user motion<br />

such as face inclination and mouth shapes, thereby making<br />

the proposed interface more suitable to the severely<br />

disabled. 2) By using scene contexts as well as obstacle<br />

density, our monocular vision-based navigation supports a<br />

wheelchair user with more safe mobility in unknown<br />

environments. 3) It has feasibility in using other mobile<br />

robots and other assistive devices such as ETA (Electronic<br />

Travel Aids) system for the visually impaired people to<br />

provide their safe mobility.<br />

To prove these advantages, several experiments were<br />

performed on indoor and outdoor with various situational<br />

effects, and its performance was compared with an existing<br />

method. The results showed the efficiency and effectiveness<br />

of the proposed robotic wheelchair.<br />

ACKNOWLEDGE<br />

This research was supported by the MKE(The Ministry of<br />

Knowledge Economy), Korea, under the ITRC(Information<br />

Technology Research Center) support program supervised<br />

by the NIPA(National IT Industry Promotion Agency<br />

(NIPA-2010-C1090-1001-0008))<br />

REFERENCES<br />

1. J.S Ju, Intelligent Wheelchair interface using face and<br />

mouth recognition. International Conference on<br />

Intelligent Unser Interfaces ACM, (2009).02<br />

2. Guilherme N, deSouza&Avinash, Vision for Mobile<br />

robot navigation: a survey, Pattern Analysis and<br />

Machine intelligence, (2002) 237-267<br />

3. Mazo, M, Garcia, J.C, Experiences in assisted mobility:<br />

the SIAMO project, IEEE Control Applications, (2002).<br />

4. Iwan Ulrich, illah Nourbakhsh, Appearance-based<br />

obstacle detection with monocular color vision, AAA<br />

National Conference on Artificial intelligence, (2000)<br />

5. Paul Viola, Michael J.Jones: Robust real-time face<br />

detection. International Journal of Computer Vision.<br />

(2004), 137-154


MetaBrain: Web Information Extraction and Visualization<br />

João Teixeira Gabriel Barata Daniel Gonçalves<br />

Department of Computer Science and Engineering, IST<br />

Av. Rovisco Pais, 1000 Lisbon<br />

{joao.teixeira,gabriel.barata}@ist.utl.pt, daniel.goncalves@inesc-id.pt<br />

ABSTRACT<br />

Nowadays, the web is a huge source of information on<br />

different branches of knowledge. This knowledge, however,<br />

is dispersed across many sites, making it difficult to<br />

interrelate and understand. In the past few years some<br />

approaches have been developed to ease the extraction of<br />

this information, from Open Information Extraction to<br />

simpler data mining. Usually these solutions work as<br />

standalone applications and are developed from scratch and<br />

are brittle, very sensitive to changes in the data sources.<br />

This makes it difficult for the final user to fully explore the<br />

potential of using different algorithms together to better<br />

extract and analyze information. In this paper we propose a<br />

new approach where users can create their own<br />

personalized information extractors and visualizations,<br />

without needing to type a single line of code, in an easy and<br />

highly flexible manner using a special-purpose interface.<br />

Since raw data is most times difficult to understand, we also<br />

study how the user can create customized visualizations of<br />

this extracted data with low effort. A prototype of this<br />

concept, MetaBrain, has been implemented and tested.<br />

Preliminary heuristics evaluation, demonstrate favorable<br />

results for the concept.<br />

Author Keywords<br />

Information Extraction, visualization, user interaction.<br />

ACM Classification Keywords<br />

H.5.2 User Interfaces - Graphical user interfaces (GUI),<br />

H.5.m Miscellaneous.<br />

INTRODUCTION<br />

The versatility of the web is also its biggest problem. Since<br />

anyone is free to create their website in any way they want,<br />

there is no unifying structure for all this information. More<br />

than a huge repository of knowledge, the web contains a<br />

whole set of hidden implicit information. The way people<br />

express their thoughts reflect an unconscious collective of<br />

trends and patterns which are not obvious at first sight.<br />

What color does the Internet relate to the term apple?<br />

Surprisingly, white is the color that more frequently cooccurs<br />

with apple in web pages, next to red and green.<br />

Apple Inc. and Snow White may be to blame for this.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

49<br />

Traditionally, Information Extraction (IE) focuses on<br />

extracting information from specific pre-defined domains.<br />

Changing domains implies that new extraction rules need to<br />

be manually created, making it hard to scale. Manually<br />

querying search engines in order to extract large quantities<br />

of information is also not the right approach, since it is<br />

tedious and error-prone as pointed out by [6]. A possible<br />

solution to this problem is the use of Open Information<br />

Extraction [2], which states that a high amount of<br />

relashionships are expressed through a compact set of<br />

relation-independent lexico-syntactic patterns. This is only<br />

one of several techniques [3,5,7] which allow the extraction<br />

of information from the Web using only statistics and<br />

probabilities.<br />

Although many new tools for web IE have recently<br />

appeared, these tools are usually designed to use a single<br />

type of IE technique with no possibility of interaction with<br />

others. It may be in the best interest of the user to use<br />

different IE techniques simultaneously, thus discovering<br />

hidden and unexpected patterns in apparently unrelated<br />

data. For example, the possibility to automatically<br />

extracting a list of Operating Systems and see how popular<br />

each one is on different search engines or social networks,<br />

for different kind of users. Another problem found in these<br />

tools is that most are developed from scratch. Currently,<br />

there is no unified framework with different IE modules<br />

available for programmers or other users to use as a basis<br />

for their IE tools. Also, state of the art tools like<br />

TextRunner [1] lack advanced search options, like the<br />

selection of search engine to use, or the possibility to<br />

extract the retrieved data. These options may be important<br />

for advanced users.<br />

Our research aims at finding ways for normal web users to<br />

access the collective unconscious that is the Internet. Given<br />

the giant number of possible extraction scenarios this can<br />

be a very complex and difficult task. Our efforts were<br />

directed at creating the best interface to make this task as<br />

easy as possible. Since raw data from these techniques, at<br />

times, is difficult to understand, we also analyzed several<br />

information visualization techniques, from simple bar<br />

charts to hierarchical tree-maps, with the objective of<br />

creating a good and easy way for the user create and export<br />

their customized visualizations.


In the next sections, we detail how we extract information<br />

from the web. Then we explain our design and interaction<br />

decisions for our solution prototype. This is followed by the<br />

result analysis of the prototype’s heuristics evaluation and<br />

finally. We conclude with our final remarks and talk about<br />

future work.<br />

CREATING CUSTOMIZED IE SOLUTIONS<br />

There are different approaches to extract information from<br />

the web without the use of complex natural language<br />

parsers. Different algorithms use different features to<br />

extract the information. Generally, we find three different<br />

classes of approach that use: number of results found for a<br />

given query [9]; lexico-syntactic patterns [5,6]; and word<br />

co-occurrence [8]. Next we’ll see how we can use these<br />

different classes together to create customized IE tools.<br />

Selected Information Extraction approaches<br />

The number of results can be used as a way to identify the<br />

popularity of one or more concepts on the Internet, and also<br />

to measure the validity of extracted data. For example, if<br />

“fishing water” has more results than “fishing wall” then<br />

fishing is probably more related to water than to a wall.<br />

By using lexico-syntactic patterns like C{,} “such as ”<br />

IList, where C is a concept and IList is a list of instances<br />

from that concept, it is possible to generate special queries<br />

to use in search engines that will be able to map concepts to<br />

instances or instances to concepts.<br />

Recent works have been created to prove the validity of<br />

using term co-occurrence to do opinion mining [7,8]. With<br />

the rise of micro-blogging usage, it is now possible to more<br />

easily extract the general Internet opinion of a given<br />

concept by looking at what words co-occur with that<br />

concept.<br />

Putting It All Together<br />

Each one of these approaches is a way to extract a different<br />

type of information, so it would be good if we could use<br />

them together or alone, depending on what we want to<br />

extract. We can think of each one of these as a different<br />

search module. If we would like to extract a list of cities<br />

and then check their popularity online, instead of manually<br />

executing two different searches it would be good to create<br />

a single search query for the whole extraction.<br />

Because these modules are domain independent it’s a<br />

matter of defining a way to direct a module’s output to<br />

others input. In order to do this we can standardize all the<br />

three modules’ main input as a single query parameter and<br />

their output (result set) as a table (Figure 1), were the rows<br />

represent the different extracted information and the<br />

columns represent the extracted information (primary<br />

column) and some auxiliary attributes of the extraction.<br />

Looking at only the primary column of a result set we get a<br />

list of results which can be iterated by another search<br />

module as its input parameter. This way it is possible to<br />

easily create multi-level search queries. Figure 1 also shows<br />

a result of a multi-level search.<br />

50<br />

Figure 1. Left: result set for an extraction of city instances.<br />

Each row represents an extracted city, which is presented<br />

on the Extracted column, the table’s primary column;<br />

Right: result set for the number of results found for the<br />

different cities extracted on the left table.<br />

A prototype library was implemented with these<br />

capabilities and also the possibility to customize each<br />

search parameters (thresholds, search engine, etc.). Several<br />

search engines can be used, including social networks. A<br />

modular approach was used to create this library in order<br />

for it to be easily expansible with new search engines, IE<br />

algorithms, or simple web service APIs. Also, since some<br />

IE modules need to sometimes perform thousands of search<br />

queries, a cache system was developed to make the searches<br />

faster when possible. The direct use of this library still<br />

requires programming skills. Hence, we developed a<br />

special-purpose interface, Metabrain, which allows even<br />

non-programmers to perform IE and visualization tasks in a<br />

more natural way.<br />

METABRAIN PROTOTYPE<br />

With the library complete, we started looking into how we<br />

could create a GUI simple enough to allow regular Internet<br />

users to interact with it, without neglecting all the advanced<br />

options required by expert users. With this in mind, we<br />

decided to use HTML and Javascript, in order to create a<br />

very dynamic interface with standards-compliant<br />

technology. Also, it is easy to connect with our Python<br />

library. We want not only the users to extract information<br />

but also for them to create meaningful visualizations of the<br />

raw data. All these visualizations were implemented using<br />

the Protovis framework [4].<br />

Data Set Creation<br />

Since the use of IE tools may not be common to most users,<br />

our efforts were to simplify every possible step of the<br />

extraction process, without disregarding the needs of<br />

advanced users. By default all customization options are<br />

hidden, although easy to access, and preset to a default<br />

value. This way the only thing needed is for the users to<br />

select what they want to extract. They can choose, and at<br />

any time change, between the different available extraction<br />

modules. These modules allow for the same type of IE<br />

previously discussed plus easy access to public API<br />

services, such as location to geographic coordinates and<br />

search engine suggestions. Each module is accompanied by<br />

a quick description of its purpose and a series of possible<br />

input examples with explanations.<br />

The design philosophy we follow is to only show relevant<br />

information in the interface so, by default, there is only one<br />

input section visible to the user. This reduces the visual<br />

noise needed to complete his task. For a simple one level IE


Figure 2. a) List of available extraction modules for the first input. b) Example of an extraction of the zodiac signs. c) Example of<br />

a multi-level search query. The final result will be the popularity, on the selected search engine, of every extracted city.<br />

the process is very straightforward: select the IE module to<br />

use, input the query parameter and search. For example, if<br />

the user whishes to extract from the Internet a list of zodiac<br />

signs, he just needs to select the Extract by Domain module<br />

and use zodiac signs as the search query. By doing this, a<br />

list of extracted zodiac signs is presented to the user, as<br />

seen on Error! Reference source not found.b.<br />

If the user whishes to create a multi-level search query, the<br />

interface will evolve during the process, along with the<br />

user’s needs. If, at any time, the user chooses to use the<br />

result of one search as a term in another, the interface will<br />

dynamically add a new input section where the second<br />

search query can be defined. These secondary input<br />

sections are called variables and have the form of %1, %2,<br />

etc. Graphically, every new query to obtain the values for<br />

each variable appears below the one in which it is used, and<br />

one level deeper on the interface (Error! Reference source<br />

not found.c). This helps users to effectively resort to<br />

several variables at once without getting lost or confused.<br />

In order to minimize the number of errors and not waste the<br />

user’s time in vain, before initiating the final search query,<br />

which may take from a few seconds to minutes or hours, it<br />

is possible to do a preview search in a smaller scale. This<br />

way, the user gets a quick glimpse of the kind of results<br />

returned by the current query and can make any<br />

adjustments necessary before starting the real long search.<br />

To increase the possibilities of query creation it is also<br />

possible to create Data Sets by importing users own<br />

personal data (CSV) through our prototype. Before the data<br />

is imported it is scanned and MetaBrain tries to guess what<br />

type of data is in each column (text, numbers, coordinates,<br />

etc.) Our guesses are then shown to the users so they can<br />

confirm and make any changes necessary. We’ll discuss the<br />

importance of this type of information in the next section.<br />

Visualization<br />

Now that we have a good and flexible approach that allows<br />

even non-programmers to do customized IE from the Web,<br />

the next step is to provide them with the possibility to<br />

visualize this information in a more meaningful way than<br />

the one provided by simple tables. We started by<br />

51<br />

identifying a set of requisites we would like the<br />

visualization creation process to follow:<br />

� Since the table of extracted information has multiple<br />

columns, the user must be able to choose which columns<br />

she or he wants to visualize.<br />

� The user should be able to choose from several different<br />

types of visualizations, from graphic bars to sunbursts or<br />

even maps;<br />

� All the visualizations must have its set of configuration<br />

options, bar width for the graphic bars, palette color for<br />

the sunbursts, etc.;<br />

� During this process it must be easy to change between<br />

different visualization types maintaining the users<br />

previously selected preferences, if these are applicable to<br />

the new type.<br />

� The user must be able to always preview the visualization<br />

being created Configuration changes to the current<br />

visualization should be applied instantly, without the<br />

need to refresh.<br />

Taking all these requisites into account, we decided to<br />

divide the visualization process into 3 steps: choose data to<br />

visualize (which columns); choose the visualization type;<br />

preview and configure the visualization.<br />

To address the first requisite we decided to let the user<br />

choose which columns to visualize by using a drag and drop<br />

metaphor. On the left side of the application a vertical list<br />

of names is visible. These are the names of the different<br />

columns existent in the selected data set and they are<br />

divided by the type of data they contain, this division makes<br />

the column selection easier for the user. On the right side of<br />

this list are two large horizontal boxes, representing the<br />

visualizations axis. The user is then able to drag columns<br />

from the left list and drop them in the axis input boxes.<br />

During the drag procedure these boxes are highlighted,<br />

making the user aware of valid drop inputs.<br />

We decided to use two axis after concluding, in a study,<br />

that all the different visualizations we wanted to implement<br />

required at least two degrees of freedom.


The available visualizations list starts empty. While the user<br />

makes column selections, these (columns selected, their<br />

data type and position in the axis) are used to verify what<br />

visualizations are available for this selected data. This way<br />

we can minimize the errors of the user choosing a map<br />

visualization type when no geographical data is selected.<br />

When the user has finished selecting the columns and has<br />

chosen the visualization a preview is instantly creates. Also,<br />

next to his visualization a list of configurable options<br />

(colors, scale, canvas size, etc.) appears with they’re default<br />

values selected. After changing any of these options values<br />

the preview is instantly refreshed. At any time during this<br />

process the user can change the selected columns or choose<br />

a different visualization. An example of a visualization<br />

being created is shown on Figure 3. When the users are<br />

satisfied with their visualization they can embed this<br />

visualization into their website by copying a piece of code<br />

into any webpage, much like embedding a YouTube video.<br />

HEURISTIC EVALUATION<br />

In order to test our solution we conducted a heuristic<br />

evaluation of MetaBrain, using Jakob Nielsen’s usability<br />

heuristics 1 . After a quick introduction to the purpose of our<br />

work, four usability experts proceeded to freely test the<br />

prototype for a few minutes and then received a list of four<br />

tasks to execute. In two the users were asked to extract<br />

information from the web, from given domains, and in the<br />

other two to craft specific visualizations for that<br />

information. All were successfully completed by all users.<br />

Overall, only ten usability problems of relevant severity<br />

were identified. Most were related to the data extraction<br />

interface, especially to the fact of some search queries were<br />

taking some minutes to finish and there was no indication<br />

of progress, only a looping loading sign. This problem has<br />

been solved by adding to the search interface the number of<br />

queries to be performed and how many have already been<br />

completed. All evaluation experts enjoyed the clean and<br />

minimalistic design and the dynamic way in which they<br />

could interact with the system. After completing the tasks,<br />

some wanted to keep playing with the system, curious about<br />

what other information MetaBrain would be able to extract.<br />

This preliminary evaluation allowed us to find and correct<br />

some usability problems. It is indicative that the interface<br />

can be effective and easy to use. Further validation of this<br />

will be provided by upcoming, more formal, user tests,<br />

where we’ll take into account the number of errors and time<br />

taken to complete the tasks.<br />

CONCLUSION<br />

We have presented an interface that allows us to extract and<br />

visualize information from the web in meaningful manners.<br />

Unlike previous research we strove to make this task as<br />

simple and flexible as possible so that any type of users,<br />

from less to more experienced, can create customized<br />

1 http://www.useit.com/papers/heuristic/heuristic_list.html<br />

52<br />

Figure 3. Creation of a map visualization, showing Portuguese<br />

cities and their respective population size.<br />

solutions that fit their needs. A preliminary evaluation of<br />

our prototype, MetaBrain, showed positive results. Further<br />

user studies will allow us to better validate our choices.<br />

REFERENCES<br />

1. Banko, M., Cafarella, M.J., Soderland, S., Broadhead,<br />

M., and Etzioni, O. Open information extraction from<br />

the web. In Proc. of the IJCAI 2007.<br />

2. Banko, M. and Etzioni, O. The Tradeoffs Between<br />

Open and Traditional Relation Extraction. In Proc. of<br />

ACL-08: HLT, 28-36.<br />

3. Bollegala, D., Matsuo, Y., and Ishizuka, M.<br />

Measuring semantic similarity between words using<br />

web search engines. In Proc. WWW '07, ACM Press<br />

(2007), 757-766.<br />

4. Bostock, M. and Heer, J. Protovis: A Graphical<br />

Toolkit for Visualization. In Proc. IEEE TVCG, 15<br />

(2009), IEEE CS (2009), 1121-1128.<br />

5. Cimiano, P. and Staab, S. Learning by googling.<br />

SIGKDD Explor. Newsl., 6 (2004), 24-33.<br />

6. Etzioni, O., Cafarella, M., Downey, Doug et al. Webscale<br />

information extraction in knowitall: (preliminary<br />

results). In Proc. WWW '04, ACM (2004), 100-110.<br />

7. Kramer, A.D. An unobtrusive behavioral model of<br />

gross national happiness. In Proc. CHI '10, ACM<br />

(2010), 287-290.<br />

8. Ku, L., Lee, L., Wu, T., and Chen, H. Major topic<br />

detection and its application to opinion<br />

summarization. In Proc. SIGIR '05, ACM (2005), 627-<br />

628.<br />

9. Turney, P.D. Mining the Web for Synonyms: PMI-IR<br />

versus LSA on TOEFL. Machine Learning: ECML<br />

2001, Springer Berlin (2001), 491-502.


������� ��� ��������� ������� ���������<br />

������ �� ������� ∗<br />

������� ����������<br />

������� �� ������� ��� ���<br />

����� �� ������� §<br />

������� ����������<br />

������� �� ������� ��� ���<br />

��������<br />

���� ������� ����� ���������� � ������� ��������� ���� ����<br />

����� ��������� �������������� ��� ������ �� �������� ��<br />

���� ��� �������� �������� �������� ������ ����� ����� �������<br />

��� �������� ��� ���� ��� ��� ���� ������������ �������� ����<br />

�� ���������� ��� ���������� ��������� ���������� ��� �����<br />

��������� ��� ��������� ��������� �� �� ������� �� �� �����<br />

������ ����� ��� ������� ��� ����������� ��� ���� �������<br />

����� ����������� ������������ �������� ��� ������ �� � ������<br />

������ ���� ��� ������� ��� ��� ���� �� ��������� ��� �����<br />

������� ����� ��� �������� �������� ��������� ����������<br />

������������<br />

��� ��������� ������� ��������� ����� �� ��� ��� ���� ���<br />

��� ����������� ���� ���� ��� �� � ����� ������ �������� ���<br />

���� �������� ����������� ������� ��� ��� ���� ��� ������<br />

���� ����� ������ ��� ���� ����������� �� ��� ��� ���<br />

���� ���� ����� ������ ���� ��������� �������� ����� ��� ���<br />

∗ ���������� ��� ������� ����������� ��� ������<br />

������ �� ��������� �����<br />

��� ������� ����<br />

�������� �� �����<br />

† ���������� ��� �������� ����������� �����<br />

��� ����� ����<br />

�������� �� �����<br />

‡ ����������� ��� ����������� ��� �����<br />

������ �� ��������� �����<br />

��� ������� ����<br />

�������� �� �����<br />

§ �������� ���������� ��� ������� ����������� ��� ������<br />

������ �� ���������<br />

��� ������� ����<br />

�������� �� �����<br />

����� ����� �������� ��������� �������� �����<br />

�������� ����������� ��� ���������� ��� �����<br />

������ �� ���������<br />

��� ������� ����<br />

�������� �� �����<br />

���������� �� ���� ������� �� ���� ������ �� ��� �� ���� �� ���� ���� ���<br />

�������� �� ��������� ��� �� ������� ������� ��� �������� ���� ������ ���<br />

��� ���� �� ����������� ��� ����� �� ���������� ��������� ��� ���� ������<br />

���� ���� ������ ��� ��� ���� �������� �� ��� ���� ����� �� ���� ���������� ��<br />

���������� �� ���� �� ������� �� �� ������������ �� ������ �������� ����� �������<br />

���������� ������ � ����<br />

���� ����� �������� ��� ����� ���� ����� ��� ����<br />

��������� �� ���� �� ��� ����������������<br />

������� ������� †<br />

������� ����������<br />

���� �� ������� ��� ���<br />

53<br />

���� �� ������� <br />

������� ����������<br />

���� �� ������� ��� ���<br />

���� �� ������� ‡<br />

������� ����������<br />

������� �� ������� ��� ���<br />

��������� ������� ���� ����� �������� ��� ��������� � ����<br />

����� ��� ������� ���������� ��� �� �������� ��� ���� ����<br />

���� ������ ��� ������� ��� ��� ��� ��� �� �������� �����<br />

������ �� ��� ������� ���� ��� ����� ���� ������ ������� ���<br />

��������������� �� ��� ���������� �������� ������� � ���<br />

������ � ����� ��� �� ��� � �������� ������������ ������� ����<br />

����� �� ��� �� �� ����� ��� ������������� ����������� ���<br />

�������� �� �� ��� ���������<br />

��� ����� �������� ��� ����������� �� � ��������� �����������<br />

������ ���� � ������� ���������� ������� ������ �������<br />

�������� �� ��� ������� �������� ��������� ����� �������� ����<br />

��������� ���� ��� ������� ������� ������� ��� �� ��� ������<br />

��������� ����������� �� ������ ���� ��� ������� ����� ��<br />

� �������� ���������� ��� ������� ���� ���� ��� ������ ��<br />

����� � ��������� ��� �� ������� ��� ������� ���������� ��<br />

��� ����� �� ��� �������� ������ ��� ��� ������ ���� ����� ���<br />

��������� �������� ���� ���� �� ���� ��� ��� ������� ������<br />

�� �����������<br />

����� ���������� ��� ��������� � ������� �� � ���� ���� ���<br />

������ ��������� �������� �������� ��������������� ��� �� �����<br />

���� ���������� ������� ����� ��� �������� � ������ �� ���<br />

�������� ������� ��� ���� ��� �������� � ���������� ������<br />

����� ������ ���������� ���� ������ ���������� �����������<br />

����� ����������� ����������� ������ ���� ��� ������ ����<br />

������� ��� �������� �������������� ��� �������������� ������<br />

���� ��� ������������ � ��������������� �������� �������<br />

���� ������� ������� ��� ������� � ����� ������ ��� �����<br />

���������� �� ��� ������ ���������� ��������������� ����<br />

������� ���� ���� ���� �� ���� ����� ��������� ���� ����<br />

�������� ��� ��� ��� ��� ��� �� ���� ���� ��� ���������<br />

��������� ����� ���� ����������� ��������� ������� ���������<br />

��� ���� �� ���� ����� �� ��������� ���� ���� ��������� �����<br />

������� ������ ������������ �������������� ���������� ���<br />

������ �������� ��� ���������� ���� ������ ��� �������<br />

����� �� ������� ������������� ���� ��� �������� �����������<br />

��� ���������� ������� ���� ������� ��� ���������� ��� ����<br />

��� �� ��� �������� ��� ��� ���� ��� ������ �����������<br />

�������� ���� ��� ��� ��� ����� �������� ����� ��� ���<br />

������������ ������� ���� ������� ���� ���� �� ������� �� �����<br />

��� ��� ��� ��� ��� �� ���� ��� ����� ������� ��������<br />

�������� ��� ������� ���������� �� ���������� ���� ����� ���<br />

����� ���� ���� � ������� �� ��� ������� ��� ��� �� ����


�� ��������<br />

����������<br />

������������� �� ������� ��������<br />

�� ��� ����� ���� ������� ����������� ���� ���� ���������������<br />

��� ����������������� ����� ��������� ��������� �� � ���<br />

������� ���� ������ ��������� ������� ��� ���������� ���<br />

�������� ��� ���� �� ��� ����������� �� ���� ������� ����<br />

������������ ������ � ��� ���� ��� ���������� ���� ��� ���<br />

����� ������� �� ��� � ��� �� ����� ���� ��� ��� ������ ��<br />

������� �������� ���� ������������ �������� �� ���� �����<br />

������� �� ��� ��������� ����� �� ���� ��������� � ���������<br />

�� �������� �� ���� ���������� ��������� ����������������<br />

������������� ��� �������� �� ��� ������ ����� �������� �����<br />

��� ������� ������� ��� �� �������� ������� �� �� �������<br />

�������� ������������ ����� ���������� ���� �� �� ���� ���<br />

������� � ��� ������ ������ ��� ������� ������ � ���������<br />

���� ������ �� ����� ��� ������ �� ������� ������� ����<br />

����� ��� ������������� �� �������� �� ���������� ��������<br />

��������� ���� �� ����� ����������������� �������� ��� ��<br />

������ �� � ������ ������ ������ ��� ������� �������� �����<br />

�� ������������� ������ ����� �������� ������������ �� � ����<br />

���� ������ ��� ��������� �� � ������ ����� �������������<br />

��� ��� ������� ������� �� ��� ������� ������ ���� ���������<br />

�� � ���� ������ ����� �������� ������� �� ���� ����� ��������<br />

�������� ���� ���� �� ��� ���������� ������ �� ��� ��������<br />

���� ������ ������ ����������� �������� ����������� �� ���<br />

�������� �� ��� ���� ��� ���� ����� ��������� �� ��� �����<br />

���� �������� �� ����� ������� � ����������� �� ��� ��� ���<br />

���� ��� ���������� �� �������� �� ��� ��� ���� ��� �������<br />

����� ������������� ����������� ���� ��� �����������<br />

�������� ������ ����������<br />

���������� ����������� ��� ������� ��� ��� �� ��������<br />

���������� ��� ��������� �� ���������� ������� ���� �� �����<br />

��� �������� ���������� ��� ������� ������� ������� ��� ����<br />

���� �������� ���� ��� �������� ����� ���������� �����������<br />

�� ��� ���������� ����� ������� ���� ����� ��� ������������<br />

���� �� �� ���� ������������� �� ��� ����� ����� �������� ��<br />

�������� ������������ ��� ������ �������� ������� ������<br />

������� ��� ����������� �� ���������� ��������� ����������<br />

���� �� ��� �������� ������� ��� ����������� ���� ����� ���<br />

���������� �� ��� ���� �� ��������� �������� ��������� ���<br />

������� ����������� ������� �������� ������� ����� ���� ��<br />

��������� ����� ��� ����� ��� ������� ��������������� ������<br />

������ ���� ���������� ������������ ����� ��� �����������<br />

����� ������ ������ ��� ���� ����� �� �� �� �������� ���� ����<br />

����������� ���������� ������������ �������� ��� ��������<br />

���� ������ ����� ����������� ��������� ���������� ��� �����<br />

������ ����� ���������� ��� ������� ��� ���� �� ������� ���<br />

������� ��� �������� ���� ��������� ��� �������� ���� �����<br />

������ ������� ���� ����� �������� ����<br />

���� ��������� ������� ����������<br />

���� �� ��� ������ �������� ������� ��������� ������� �����<br />

������� ���������� ���� ����� ������� ��� ������ �� ���������<br />

��� ����� ��� ������������� ���� ��� ������� �������� ���� ���<br />

����� ��� ����������� ������ ���� ������� ��������� �����<br />

���������� ��� ��� �������� ������ �� �������� ���� ��� ������<br />

54<br />

�� ������� ������� ��� ����� �������� ������������ ������� ��<br />

�������� �� ��� ����������� ����������� ��� ������������� ���<br />

����������<br />

�� ���� ������ ������������ ������� ��������� ������� ���<br />

��������� ���� �� ���� ���� ��� �������� ���������� ��� ���<br />

���� � ���������� ��� �� �������� ��� ����������� ��������<br />

����� ������� ���������� ������� ���� ����� ��� ��������<br />

������� ������� ���� ��� ������� ��������� �������� ���������<br />

���� �� � ���������� �������� �������� ��� ������ ����� ��� ��<br />

����� ��������� ���� ��� ��������� �������� ����� ����� ���<br />

���� ���������� ������������ ��� ���� �� ������� ��� ������<br />

���� � ��������� ������ ������ ��� ���� ��������� ��� ����<br />

��������� ���� ���� �������� ����� ����� ��������� �������<br />

������� ������ ����� �� ��� ��������� ������� �� �������<br />

��� ������������ �� ��������� ���� ������������� ������� ��<br />

��������� �������������<br />

���������� ��� ��������<br />

��� ��������� ������� ��������� ����� ���� ������� � ����<br />

������ ������ �� �� ���������� ���� ��� ���� ����� ��� �������<br />

������ ��������� ���������� ��� ����� ������ ������� ����<br />

���� ���������� �� ��������� ���������� ���� ��� �������<br />

���� ������� ��������� ��� ���� �� ���� ���� ��������� ����<br />

��� ���� ���������� ��������� ���������� ��� ���� ���� ���<br />

���� �������������� �������� �� ��� �� ��������� �� � ������<br />

��� ���� ���� ���� ���� ������������<br />

��� ����������<br />

����������� �� ��� ���� ������ ���� ��� ������� ���������<br />

��������� ��� �� ���������� ��� ����� ����� ����� ������� ��<br />

���� �� ��� ����� ������ �� ������� ��������� ���������� ���<br />

��������� ���� ��� ����� ��� �� ��� �������� �� ��������� ����<br />

��� ����� ����� ��� ������ ����������� ����������� �������� ����<br />

���� ��� �� ���� �� ����� ����� ����� �� � ��� ������� ��� ��<br />

��� ����������� �� �������� ���������� ���� �� ����������<br />

����� ��� ���� ��� ������� ������ ��� ������� ��������� ������<br />

����������� ����������� ���� ��� ������� ���������� ���������<br />

����� ���������� �������� ���� ���� ��� ����������� ���<br />

���� �� ���� ���� �� ��� �� ��� ����������� ����������� ���<br />

�������� ������� ���������� �� ������ ��������� ��� ������ ��<br />

���� ����� �� ������� � ����������� ���������� ������ ���� ���<br />

�� �������� �� �� ��� ������ ����<br />

������ �����������<br />

��� ��� �� � ������ ���� ���� ��������� ����������� ���� ��<br />

������� �� �������� ������ �� �������� ����� �� �� ��������<br />

������ ���� � �������� ������� �� � ��� ���� �� ����� ��� ������<br />

��� ��� ��� ������ �� �������� ���� ��������� ��������� ��<br />

���� �� ������ �� ��� ������ ��������� ��������� ��� ���<br />

��� ��� ����� ��������� ��������� �� � ��� ��� � ���������<br />

���������� ����� ��� ��� ���� ��� ����� ��� ��� ���������<br />

������� ���������� �������� ����� ��� �������� �������� ���<br />

����� ������ ��� ������ ��������� ��� ��������� ���� �� ���<br />

������� ���� ������<br />

�������� ����������<br />

��� ��� �� ���������� �� �� ���� ��� �� ������� ������<br />

���� ����� ��� ��������� ���� � ���������� �������� �����<br />

��� �� ������������� ������� ��� ���������� ������� �����


������ �� ��� ������� ����<br />

�� ������ �� ��� �������� ���� ���� ������ ��� ������� ��<br />

��� ����� �� ��� �������� ����� ��� ���������� ���� ��� �����<br />

��� ���������� ����� ���� ���� �� �� ���������� ������ ���<br />

������ ��������������<br />

����������� �����������<br />

���� ��� ���� ����������� ��� ��� ������ �������� �����<br />

���� ��������� �� ���� ������������<br />

��� ���� �� ���� ��� ���� ���� � �������� ���� � ��� ���� ��<br />

������ ���� ����� ������� �� ��� ������� ���� ������� ����<br />

����� ������ �� ��� ��� ���������� ������� �� ��� ������� ���<br />

������� ���� �� ������ �� ��� �������� �������� �� ��� ����<br />

��� ������� �� ��� ����� ���� ��� ���� ������ ���� ��������<br />

��� ����������� ������� ��� ��� ��� ����� ���� �������� ���<br />

���� �� ��� ������ �� ��� ������� ����� �������� ��� �������<br />

��������� ������� �� ��� ������� ������� � ������ ��������<br />

���� ��� ��� ����� ������� ���� �� ������ ������ �������� ���<br />

�������<br />

��������� ������� ����� ���� ���������� ������ �� ���������<br />

��������� ��� ��� ���� �� ���� ������������ ��� ���� ������<br />

������ ��� ���� ���� ��������� ��������� ������ � ����� � ����<br />

����� �������� ��� ������ ��� ������ ��������� � �������<br />

������ ����� ���� ��� ������ ���������� �� ������� ������ ����<br />

������� ������������ �� ������� ����� ��� �� ��� ���� �� ���<br />

�������� �������� ��� ������� ������� ����� �� ������� ���<br />

���� �� ��� ������ �� ��� ������� ��� ������������� ���������<br />

��� ����� �� ������ �� ��� ��� ������� �� �������<br />

� ������ �������� ��������� � ������ ������� � ������� ���<br />

��������� �� ������ ���� ��� �������� ��� ����� ����� �<br />

������� ������� ������� ���� ��� ���� ��������� ������������<br />

���� ����������� ������� ������������ ���� ��� ������ ����<br />

����� �� ������ ��� ���� ��� ������ ���������� ��� �������<br />

��� ������� ��� ����� ������� ��� ���������� ������ �� ���<br />

���� ������� ��� ������� ����� ��� ��������� �����������<br />

��������� �� ��� ������ ��� �������� �� ������� ��� ������<br />

������ ������ �� ��� �������� ���� ����� ��� ������ �������<br />

��� ������ �� �������� ��� ����� ��� ��� ������������ ���������<br />

���� ������ ��� �������� ��� ������ ���� ���� �� ������ ���<br />

55<br />

������ �� ����������� ������� ������� ��������<br />

��������� ����������� �� ����� ��� ������� ��� ������� �����<br />

���� ������������� ���� ���������� ������ ��������� ������ �<br />

����� ��� ������� ��� ��� ��� ���� ��� ����������<br />

��� ����� ����������� ����������� ����� ���� ����� �������<br />

��� ��� ������� � �������� ��������� �� ��� �������� ���<br />

������� ��������� ������ ��� �������� ����������� ���������<br />

���� ��� ��������������� ���� ������� ��� ���� ����� ����<br />

��� ��� �������� ��� ��� ������� ������� ����� ���� �� �������<br />

��������� ������ ���� �� �������� ��� ������� ������������<br />

��� ���������� ������� ������� ����� ���������� �� ��� ����<br />

���� ������ ����� ������� �������� ����� �� ����������<br />

���� ��� ������� ���� ���� ���� �������� ��� ������������� ���<br />

������ ��� ��� ���� �������� ������ �����������<br />

�������� ����������<br />

��� ���� �������� ��������� ���� �� ���� ���������� ���<br />

�������� ����������� ��������� ����� ������ �������� ����<br />

�� � ������������� ������������ ���� �������� ������ ������<br />

��������� ������� ��� ��� �� ��� ������� ��� ��� ��������<br />

���� ������ ���� � �������� �������� ��� ������ ���� �������<br />

��� ������� �������� ��� ��� ����������� ���������� ��������<br />

������� ����� ��� ����� ������ �������� ��� ��������� ���<br />

���� ��� ���� ������� ����� ����������� �� �� ��������� ��<br />

��� ���� ����� ��� ��������� �� �� ��� �� ������� ������ ��<br />

���� ������� ��� ������ ������������ ������� ��� ����� ����<br />

��� ������������ ��� ��� ������ ���� �� ���� �� �����������<br />

��� ���� ��� ���� �� ���� �� �� ��� ������� ���������� ���<br />

����� ������������ �� ����� ����������� ���� ��� �� ����� ��<br />

��� ����� ���� ��� ��� ������� �� � ��������� �����������<br />

������������<br />

�������� ��������<br />

�������� �������� ��� ��� ��� ��� ��� ��������� �������<br />

���� ������� ��� ������ ���� �������� ��� ������� �������<br />

��� ������������ ��� ��� �������� �� ����� ����������� ���<br />

������� ��� ����� �� ��� ���� ���� ���� ��� ������� ����<br />

��� ������� �� ��� ��������� �������� ������������� ���� ���<br />

�������� ��������� ������� �� ��� �� ���� ������� ��������<br />

�������� ���������� ��� ���������� ������� ��� ������ ����<br />

���������� ����� ��������� ������������ � ��� ��������


������� ���� ���� �� ���������� ������ ��� ������ ������<br />

�������� ���� ��� �������� �������� ������� ��� ������ ����<br />

���� ������ �� ��� ���������� ������ ���� ���� �������<br />

���� �������� ���������� ������ ����� ��� ������� ��������<br />

�� ���� � ������ ������� ��� ��� �� �������� ��� ����� ��<br />

��� ���� ������ ������ �� �� ����� ��� ����� ��� �������<br />

�������� ��� ��������� ����������� ���� �������� ��������<br />

����� ���� ��� ������ �� ��� ����� �������� �� ��� ��������<br />

������������� ���� ��� ����� ����� ���������<br />

������ �� ������ ������������� ������<br />

���� ����� �������� ����� � ������� ������ �������������<br />

������ ����� �� ���� ��������� �������� ����������� ���������<br />

����� � ������ ��������� ����������� ��� ������� ��� ����� ���<br />

��� �������������� �������� �� ��� ������ ������ �������<br />

�� ��� ������� �� ���� ��� ������� ��� �������� �� ������ ����<br />

���� ���� ������� ��� ���� ������ ��� ������� ��� ��� ������<br />

���������<br />

����������<br />

�����������<br />

����� ��������� �������������� ��� ������ �������������<br />

��� �������� � �������� �� ����� ��� ����������� �� ������<br />

���� ��� ������� �� ������ �� ������ ���� ��������� ���������<br />

������ ��� ����� ����������� � ������ ���� ��� ���� �� �����<br />

���� ��������� ��� ������� ����� ��� ���� ������ ��� �������<br />

�� �������� ������� ���� ��� ��������� ������� ��������� ���<br />

������ �� ������ ���� �� ������ � ��������� ��������� ���<br />

����� ��� ������ ���� ��� ������ ���� ������� ��� ���������<br />

��� ��� ��� ��� ��������� �� ���� ���� ��� ����� ��� ����<br />

��� �������� ��� ��� �������������<br />

������� ��� ��������<br />

��� ��� ������ �������� ������� ���� ��������� ��������<br />

�� ���� ������� ��������� ������� ������ ���� �� ��������� ��<br />

��������� ������ ����� ���� �� �� ������� ���� �� ��� ��<br />

���������� �� ���� ��������� ����������� �� ���������� ����<br />

����� ������� �������� �� ������� ����� ���� �� ���������<br />

���� ������� ��� ��� �� ���� ����� �� ������ �������� �������<br />

�� � ������� ���� ��� ����� ����� �������� ���������� ����<br />

���� ��� ���� ���� ����� ������� � ���� ���� �� �����������<br />

�� �������� ��� ������ ���� ����� �������� �������� ������� ��<br />

������ ��� ������� �������� ��������� ��� �������� �� �����<br />

���� ���� �� ��������� ��� �� ������� ���������<br />

������ �������<br />

��� ��� �� � �������� ��������� ������ ���� �������� � ������<br />

����� ��������� �������� ��� � ��� �� ���� �� �������� ������<br />

��� ��� �� ���������� ���� ������� ������� ������������<br />

��� ����� ���� �� ����� � ���� ��������� ������� �����������<br />

����������� ���� ����� �������� ����������� ������ �� ����<br />

������� �� ��� ��� ������� ����������� ����� �������<br />

56<br />

������������ �� ���� ��� ������������ ���� ����� ����� ��<br />

��������� � �������� �������� ���� � ��������� ����������<br />

������������ ����� ����� �������� ��� ��� ������������ ������<br />

��� �������� �������� ��� ������ ���� ��� ��� ������ �� ���<br />

�������� ��� ���� �� ���� ���� ���� ������� �������� ��� ��<br />

���� ���� ����� ������� ��� ������ �������� ����� ����������<br />

������� ������ ��� ����� ������� �����������<br />

����������<br />

�� ���������� ��� �������� �������� ���� ���� �������<br />

���� ���� ����������� ��� ��� ��������� ����<br />

���������� ������������� ������ ��������� ����� ����<br />

�� �� ����� �� ���������� ������ ��� ��� ���������<br />

������ ������� ���� �����<br />

�� �� ����� ���������� ��� ���������� �� ��� ���������<br />

���� ������ �����<br />

�� �� ����� ������ ���� �������� ������� ���� �� ������<br />

��������������<br />

�� �� ������ ����������� �� ��� �������� ����� ���������<br />

����� �������� � ��� ���� �����<br />

�� �� �������� �� ���������� ��� �� �������<br />

����������� �� ��������� ���� ��� �������� ����������<br />

��� ���� ���������� ��� ����������� �� � ������<br />

������������� ����� ���������� ��������������<br />

������������ ��������� � ���� �����<br />

�� �� �� �� �� ������ ��� ���� ���������� ���������<br />

����������������������������������������� ����������<br />

������ �����<br />

�� �������� ��� ��� ��� ������������ �������� �����<br />

�� �� ����� �� �������� �� ��������� �� �����������<br />

�� ������ �� ����� ��� �� �������� ���� ����������<br />

������ ������ �� ������� ����� ������� �����������<br />

�������� �������� ���� ������������ ��� ���������� �<br />

����� ����� �����<br />

��� �� �� �������� ��� �������� �������� ��� �������<br />

���������� �� �����������<br />

���������������������������������������������������<br />

��������������������������������������<br />

�������������������������������� ���� ����� �����<br />

��� �� �� ������ ��� �� �� ���������� ��������������<br />

���������� �� ��������� ����������� ���������������<br />

�������� ���������� �������� ������� �� ������� ���<br />

�������� ������������� ��������� � ���� ����� ������<br />

�������� ��� ������ ���������� �� ���������� ��������<br />

��������������<br />

��� �� ������� �������� ��������������� ��������<br />

��������������� �������� ����� � ��������� ���������<br />

� �� �����<br />

��� �� ������� ��� ��������� ���� ���� ���������� ������<br />

������<br />

��� �� ������ ��� �� ��������� ����������� ������� ��������<br />

����� �� ����� ������ ������ ������ ���� ��������� ���<br />

������� �����


Prototyping a Semi-Automatic In-Car Texting Assistant<br />

Christoph Endres<br />

German Research Center<br />

for Artificial Intelligence<br />

(<strong>DFKI</strong>)<br />

Saarbrücken, Germany<br />

christoph.endres@dfki.de<br />

ABSTRACT<br />

Texting while driving is dangerous and illegal in most countries.<br />

But both social as well as business forces led to a<br />

widespread ignorance of those bans and in turn to a potential<br />

lethal situation. We argue that, in addition to legislative regulations,<br />

in-car texting should be made less distracting and<br />

dangerous. We offer a solution for one specific communication<br />

goal, namely staying connected to a social network. We<br />

propose a semi-automatic status-posting system and present<br />

a prototype based on a Pleo. We argue that our approach<br />

should be extended by automated answering mechanisms.<br />

The aim of this paper is to foster discussion on texting while<br />

driving. The solution for one type of semi-atomatic texting<br />

is outlined, other types of texting need to be looked at separately.<br />

Author Keywords<br />

texting while driving, pleo, semic-automatic texting<br />

ACM Classification Keywords<br />

K4.2 Computers and society: Social Issues<br />

INTRODUCTION<br />

Ubiquity and convenience being a major driving factor, the<br />

spread of mobile email devices such as BlackBerry, iPhone,<br />

and others, has grown to tens of millions over the last several<br />

years [13]. [12] expect a sustained growth of this trend<br />

in the next decade. Mobile email promises seamless anywhere<br />

anytime connectivity. Employees connect with their<br />

organizations increasing productivity [13]. Participants in a<br />

study on BlackBerry use by [12] emphasized the liberating<br />

nature of mobile email by showing how it allowed them the<br />

freedom to work anywhere.<br />

On the other hand, using mobile devices while driving is<br />

without doubt distracting and thus dangerous. After a surge<br />

in horrific automobile accidents in which distracted driving<br />

was proven to be a factor, 38 US states have enacted textingwhile-driving<br />

bans [5]. Other countries issued similar bans.<br />

Copyright is held by the author/owner(s).<br />

<strong>MIAA</strong> 2011, February 13, 2011, Palo Alto, CA, USA.<br />

Daniel Braun<br />

Saarland University,<br />

CS Department<br />

Saarbrücken, Germany<br />

daniel.braun@dfki.de<br />

57<br />

Christian Müller<br />

<strong>DFKI</strong><br />

Saarbrücken, Germany<br />

christian.mueller@dfki.de<br />

Figure 1. Pleo robot (Source: Ugobe)<br />

Nevertheless, people continue to text while driving. Reasons<br />

for ignoring bans on texting while driving vary, and include<br />

both business and social forces. People may be tempted to<br />

ignore texting while driving bans, because<br />

• professional communication partners expect universal availability.<br />

• driving is perceived as ”dead time” that needs to be filled<br />

with small talk.<br />

• intimates / buddies expect a message to be replied promptly.<br />

• there’s an audience to be constantly supplied with great<br />

content.<br />

In order to tackle this problem, we have to take a closer look<br />

at the different types of texting and the underlying motivation.<br />

Aside from widely known mobile email, we consider the following<br />

texting services relevant in the automotive context:<br />

SMS, Twitter (twitter.com), and Facebook (facebook.com).<br />

The latter are briefly introduced in the following.<br />

Short Message Service (SMS) is mostly used for person-toperson<br />

messaging (chat with friends). The text is limited to<br />

160 characters but the system can segment messages that exceed<br />

the maximum length into shorter messages. [12] argue<br />

that SMS is mostly a private communication means that has


not been widely adopted by the worldwide business community.<br />

Microblogging sites like Twitter provide a new means of<br />

communication [10]. Twitter provides the ability to deliver<br />

the data to interested users over multiple delivery channels:<br />

cell phone, Facebook application (see below), email, or as an<br />

Instant Message. A Twitter user interested in the statuses of<br />

another user signs up to be a ”follower”. Updates or posts are<br />

made by succinctly describing one’s current status within a<br />

limit of 140 characters. According to [8], Twitter fulfills the<br />

need for an even faster mode of communication compared to<br />

regular blogging.<br />

Facebook belongs to the category of online social network<br />

(OSN) services. Its core functionality is managing connections<br />

or ”friends” [9]. However, Facebook also provides opportunities<br />

for communication and hosting of content. Facebook<br />

is currently having the most users worldwide–other<br />

OSNs are MySpace, Friendster, Bebo, hi5, and Xanga, each<br />

with over forty million registered users [10].<br />

As we pointed out earlier, legislation is unfortunately not<br />

sufficient to keep drivers from potentially lethal habits, so<br />

additional safeguards and alternative solutions need to be developed.<br />

In this paper we propose a way to circumvent composing<br />

twitter messages.<br />

OUR PROTOTYPE: PLEOPATRA<br />

The driving context and the nature of the communicative<br />

goal of Twitter lead to a limited amount of likely messages,<br />

which are usually diary-like. A typical status might be “We<br />

are already so close to Paris, but now we hit a traffic jam!”<br />

(see Figure 5). We argue that such a message could as well<br />

be generated using a set of message templates and current<br />

status information of the car, e.g. GPS position, current<br />

speed, and available traffic jam warnings. Due to its nature<br />

and complexity, a car on the street is not a very suitable environment<br />

for fast prototyping. In order to evaluate the concept<br />

on a smaller scale, we developed a prototype [4] on a Pleo<br />

toy dinosaur. Due to its complex sensors and single data bus,<br />

the Pleo can be considered a downscaled model of a modern<br />

car, which we will explain below in more detail.<br />

A Pleo is a rather sophisticated device–sometimes also referred<br />

to as artifical lifeform–equipped with a multitude of<br />

sensors (see Figure 1).<br />

The Pleo hardware is based on an Atmel ARM7 32bit processor<br />

(main CPU), a NXP ARM7 32bit microprocessor (camera,<br />

audio) and four Toshiba TMP86FH47AUG 8bit microprocessors<br />

(motor control).<br />

The movement is achieved trough 14 motors with feedback<br />

sensor. Additional sensors are:<br />

• A color camera with white light sensor<br />

• Two microphones<br />

• Eight touch-sensors<br />

58<br />

Figure 2. Pleopatra Tools Screenshot<br />

• Four push-buttons (one under each foot)<br />

• Tilt and Shake sensors<br />

• Infrared transmitter and receiver in the mouth<br />

• Infrared transmitter and receiver at the head<br />

Pleo is also equipped with two speakers and both internal<br />

flash memory as well as a SD card slot and a USB interface.<br />

We connect Pleo via its USB interface to a computer in<br />

order to communicate with it. Pleos USB interface is wrapping<br />

a serial port to which we can connect using standard<br />

libraries such as RXTX [7]. To facilitate the communication,<br />

we implemented an API wrapping the serial protocol<br />

in Java. It is called Pleopatra Tools [3] (see Figure 2). We<br />

published the library under GPL license. Higher level functions<br />

are included in a graphical user interface, which makes<br />

interaction with the Pleo easy. Included are: establishing a<br />

connection to Pleo, storing personalized information about<br />

different Pleo such as photo or name, which is recognized<br />

instantly once the Pleo is connected, Recording audio from<br />

Pleo and direct playback on the PC, inspection and playback<br />

of sound-, motion-, and personality files as well as displaying<br />

live camera images from pleo. The API itself furthermore<br />

offers: controlling motors and sensors, access to the<br />

file system, recording audio from pleo in wav format and<br />

accessing pleos camera and saving bmp images.<br />

Using this API we implemented a monitoring tool which<br />

constantly checks the sensor data for anything extraordinary,<br />

such as sudden darkness, very loud noise, very high or low<br />

temperature, detection of something green which is considered<br />

food for Pleo, etc. On detection, an event is triggerd.<br />

Depending on the type of event, a pre-formulated message is<br />

picked from a small database and refined with actual sensor<br />

values, e.g. “35 centigrades? It is very hot in here!”. These<br />

messages are then twittered (see Figure 3) via an automated<br />

Twitter interface (jTwitter) [1]. The Twitter application is<br />

also accessible via the Pleopatra Tools’ GUI.


Figure 3. Pleopatra: the first twittering dinosaur in the world<br />

The task we handled here is a typical example for a dual restricted<br />

data selection process (see Figre 4). The raw data<br />

from the sensors (e.g. motor 4 is blocked at an angle of 35<br />

degrees) is transformed and filtered into some higher level<br />

data (e.g. somebody/something holds the front paw). The<br />

resulting data is then further filtered according to two resource<br />

limitations: First more technical (”what is extraordinary<br />

enough to be presented?”) and then more cognitive<br />

(”how much information do we want to publish?”). We will<br />

get back to that concept in more detail later on.<br />

Figure 4. Dual restriction on data<br />

FROM DINOSAUR TO CAR<br />

We argue that a toy robot sensing his environment is comparable<br />

to a sensor-equipped car when it comes to automatic<br />

status message generation. In order to work properly, the<br />

driver has to be identified with his Twitter ID, just as each<br />

Pleo connected to the Pleopatra Tools API must be recognized<br />

by its serial ID before starting the Twitter application.<br />

In a car environment, this could be achieved for instance by<br />

checking the bluetooth ID of the drivers phone. Typical car<br />

sensors are much more complex than the sensors we have<br />

seen at the Pleo robot, and the access of data is usually not<br />

as uniform as a single USB interface. Data accessible in a<br />

car include current postion, speed, heading, temperature (inside<br />

and outside), etc.<br />

59<br />

The Controller Area Network (CAN) interface standard [2]<br />

was specified by Bosch in 1991 and is nowadays widely<br />

used in cars. It was devised to enable communication between<br />

subsystems of the car, since each subsystem may need<br />

to control actuators or receive feedback from sensors. The<br />

CAN bus may be used in vehicles to establish a commection<br />

between transmission and engine control unit (the cars main<br />

processor), or, for example, to connect the window openers,<br />

air condition, seat control, etc.<br />

The amount of pre-fabricated messages needed for a useful<br />

tweet-generation in a car is by far higher than the few dozens<br />

of messages in our Pleopatra prototype. Nevertheless, the<br />

basic principle stays the same: Sensor data is monitored, exceptional<br />

values are matched to a database of pre-fabricated<br />

messages and blancs in the message are filled with current<br />

values. The driver then only needs to accept a message for<br />

sending, which is clearly significantly less distracting than<br />

composing a message on a mobile device.<br />

SELECTION OF RELEVANT CONTENT<br />

Selection of relevant information based on a constant sensor<br />

data or information stream is not a trivial task. In [11],<br />

Maybury presents the SumGen system, which “selects key<br />

information from an event database by reasoning about event<br />

frequencies, frequencies of relation between them, and domain<br />

specific importance measures.”. The system is able to<br />

tailor a summarized report for a stereotypical user.<br />

More recent works aim at performing such a summarization<br />

in real-time in order to emulate a reporter at for instance a<br />

sports event. The IVAN system [6] “generates affective commentary<br />

on a tennis game that is given as an annotated video<br />

in real-time. The system employs two distinguishable virtual<br />

agents that have different roles (TV commentator, expert),<br />

personality profiles, and positive, neutral, or negative<br />

attitudes to the players.”<br />

In our example, the information streams to be monitored are<br />

sensor data. Defining which data is “extraordinary” is rather<br />

straightforward here: If the usual environment temeprature<br />

of the Pleo dinosaur ranges between 18 and 23 centigradess,<br />

then 35 centigrates is extraordinary. If the dinosaur does not<br />

have any input on his touch sensor on the back for 90 percent<br />

of its time, then getting an input there is extraordinary.<br />

The interpretation of sensor data usually depends on the context.<br />

In a toy context as our Pleopatra prototype, there is not<br />

much variation of context. The dinosaur usually stays more<br />

or less in the same environment, and extracting information<br />

from sensor data is straightforward.<br />

In the automotive context, we have to extend our information<br />

flow example from Figure 4. The car is moving in a complex<br />

environment, so in order to doublecheck our interpretation of<br />

the sensor data, we need additional environmental evidence<br />

as a second component. If the car is on the highway and<br />

moving at an extraordinary slow speed or even not at all, it<br />

doesn’t necessarily mean that the driver is stuck in a traffic<br />

jam. He might as well just rest on a parking lot or visit a


fast food restaurant’s drive-trough. But if we do have for<br />

instance traffic information announcing a traffic jam in that<br />

highway to verify our interpretation, the interpretation gets<br />

more reliable. So our first resource limitiation is environmental<br />

evidence:<br />

sensor data<br />

+ envionmental evidence<br />

interpretation of the situation<br />

The situation might be unusual or extraordinary, but to make<br />

it interesting and thus worth tweeting, another contextual<br />

component is usually needed. In our example: Being in a<br />

traffic jam could be something ordinary you encounter on<br />

your everyday commute, but being stuck close to your destination<br />

on a weekend trip is special. We add unusual context<br />

as part of the second, cognitive restriction:<br />

exceptional sensor data<br />

+ envionmental evidence<br />

+ unusual context<br />

relevant message<br />

At the same time, user defined parameters like desired frequency<br />

of status posts can be used to optimize the second<br />

resource limitation according to the drivers needs.<br />

CONCLUSION AND OUTLOOK<br />

We presented a prototype of a twittering toy dinosaur and argued<br />

that the introduced principle could - with an increased<br />

complexity and some modifications - be used for an automated<br />

generation of tweets. This automation would reduce<br />

the risk of driver distraction, especially for power users of<br />

social networks who have an urge to stay connected to their<br />

environment. This is of course just a part of the solution.<br />

Other communication goals need to be looked at and analyzed<br />

separately.<br />

In a next step, we can try to include automatic answering<br />

mechanisms. For instance, if driver A is on it’s way to person<br />

B, there could be an incoming tweet saying ”@DriverA:<br />

Where are you?” and based on the current status, the car<br />

could respond immedeately: ”I am on my way, but right now<br />

I am stuck in a traffic jam near Frankfurt, driving at less than<br />

10mph!”. This is just one example, the possibilities here are<br />

manyfold.<br />

REFERENCES<br />

1. JTwitter - the Java library for the Twitter API.<br />

http://www.winterwell.com/software/jtwitter.php, 2008.<br />

2. Bosch. Can specification, 2.0.<br />

http://www.semiconductors.bosch.de/<br />

media/pdf/canliteratur/ can2spec.pdf, 1991.<br />

3. C. Endres and D. Braun. Pleopatra Tools.<br />

http://www.dfki.de/pleopatra, 2009.<br />

4. C. Endres and D. Braun. Pleopatra: A Semi-Automatic<br />

Status-Posting Prototype For Future In-Car Use. In<br />

Adjunct proceedings of the 2nd International<br />

60<br />

car
<br />

sensor
<br />

data

<br />

Figure 5. Twittering car<br />

message
<br />

templates
<br />

We’re already so
<br />

close to Paris,<br />

but now we hit a<br />

traffic jam!<br />

Conference on <strong>Automotive</strong> User Interfaces and<br />

Interactive Vehicular Applications (<strong>Automotive</strong>UI<br />

2010), page 7, Pittsburgh, PA, USA, November 2010.<br />

5. Governors Highway Safety Association.<br />

State cell phone use and texting while driving laws.<br />

http://www.ghsa.org/html/stateinfo/laws/cellphone laws.html,<br />

2010.<br />

6. I. Gregory. Embodied presentation teams: A plan-based<br />

approach for affective sports commentary in real-time.<br />

Master’s thesis, Saarland University, 2010.<br />

7. K. Jarvi. RXTX : serial and parallel I/O libraries<br />

supporting Sun’s CommAPI. http://www.rxtx.org/,<br />

2006.<br />

8. A. Java, X. Song, T. Finin, and B. Tseng. Why we<br />

twitter: understanding microblogging usage and<br />

communities. In WebKDD/SNA-KDD ’07: Proceedings<br />

of the 9th WebKDD and 1st SNA-KDD 2007 workshop<br />

on Web mining and social network analysis, pages<br />

56–65, New York, NY, USA, 2007. ACM.<br />

9. A. N. Joinson. Looking at, looking up or keeping up<br />

with people?: motives and use of facebook. In CHI ’08:<br />

Proceeding of the twenty-sixth annual SIGCHI<br />

conference on Human factors in computing systems,<br />

pages 1027–1036, New York, NY, USA, 2008. ACM.<br />

10. B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps<br />

about twitter. In WOSP ’08: Proceedings of the first<br />

workshop on Online social networks, pages 19–24,<br />

New York, NY, USA, 2008. ACM.<br />

11. M. T. Maybury. Generating summaries from event data.<br />

Inf. Process. Manage., 31:735–751, September 1995.<br />

12. C. A. Middleton and W. Cukier. Is mobile email<br />

functional or dysfunctional? two perspectives on<br />

mobile email usage. European Journal of Information<br />

Systems, 2006.<br />

13. O. Turel and A. Serenko. Is mobile email addiction<br />

overlooked? Commun. ACM, 53(5):41–43, 2010.<br />

chi2011.


Multimodal Summarization of Complex Sentences<br />

Naushad UzZaman<br />

Computer Science Department<br />

University of Rochester<br />

naushad@cs.rochester.edu<br />

ABSTRACT<br />

In this paper, we introduce the idea of automatically<br />

illustrating complex sentences as multimodal summaries<br />

that combine pictures, structure and simplified compressed<br />

text. By including text and structure in addition to pictures,<br />

multimodal summaries provide additional clues of what<br />

happened, who did it, to whom and how, to people who<br />

may have difficulty reading or who are looking to skim<br />

quickly. We present ROC-MMS, a system for automatically<br />

creating multimodal summaries (MMS) of complex<br />

sentences by generating pictures, textual summaries and<br />

structure. We show that pictures alone are insufficient to<br />

help people understand most sentences, especially for<br />

readers who are unfamiliar with the domain. An evaluation<br />

of ROC-MMS in the Wikipedia domain illustrates both the<br />

promise and challenge of automatically creating multimodal<br />

summaries.<br />

Author Keywords<br />

Multimodal summarization, summarization, visualization,<br />

illustration, picture, text-to-picture, automatic illustration,<br />

sentence compression, pictorial representation, AAC,<br />

augmentative and alternative communication, ROC MMS.<br />

General Terms<br />

Algorithms, Experimentation.<br />

ACM Classification Keywords<br />

H5.m. Information interfaces and presentation (e.g., HCI):<br />

Miscellaneous; I.2.7 [Artificial Intelligence]: Natural<br />

Language.<br />

INTRODUCTION<br />

Pictures, diagrams and illustrations are included in<br />

manually-created text because they help people<br />

comprehend and remember information [1]. Including<br />

alternative, supportive representations of text might help<br />

people with reading difficulties understand text better, for<br />

instance those reading text not in their first language,<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that<br />

copies bear this notice and the full citation on the first page. To copy<br />

otherwise, or republish, to post on servers or to redistribute to lists,<br />

requires prior specific permission and/or a fee.<br />

<strong>IUI</strong>’11, February 13–16, 2011, Palo Alto, California, USA.<br />

Copyright 2011 ACM 978-1-4503-0419-1/11/02...$10.00.<br />

Jeffrey P. Bigham<br />

Computer Science Department<br />

University of Rochester<br />

jbigham@cs.rochester.edu<br />

61<br />

James F. Allen<br />

Computer Science Department<br />

University of Rochester<br />

james@cs.rochester.edu<br />

children, older adults, or people with cognitive disabilities.<br />

Unfortunately, creating illustrations is expensive and timeconsuming,<br />

and consequently most text has only a few<br />

Figure 1: Multimodal summary (MMS) of the sentence, “In<br />

1492, Genoese explorer Christopher Columbus, under contract<br />

to the Spanish crown, reached several Caribbean islands,<br />

making first contact with the indigenous people.”<br />

illustrations, if any at all. In this paper we introduce ROC-<br />

MMS, a system that automatically converts existing text to<br />

multimodal summaries (MMS) that capture the meaning of<br />

a complex sentence in a diagram containing pictures and<br />

simplified text related by structure extracted from the<br />

original sentence.<br />

Motivated by sayings like, “A picture is worth a thousand<br />

words” prior work on Automatic Illustration and Text-to-<br />

Picture synthesis has approached the very difficult problem<br />

of generating pictorial replacements for text. Although this<br />

is an interesting challenge, existing systems have generally<br />

found success only within the domain of simple sentences<br />

of the type found in children’s books [2-4]. The problem of<br />

multimodal summarization relaxes the problem by allowing<br />

text to augment pictorial and structural information.<br />

Automatic Illustration is inherently difficult. To understand<br />

the problem better, we initially asked two annotators 1 to<br />

identify the main idea 2 (main event) and related entities<br />

(subject, object, etc) from sentences and find representative<br />

pictures. Sentences were chosen from the Wikipedia entries<br />

United States and France, and annotators were asked to<br />

include Wikipedia pictures in their illustrations. The<br />

annotators reported that it was too difficult to illustrate<br />

19.59% of the entities using Wikipedia pictures and thought<br />

1 Annotators are graduate students and not among the authors.<br />

Their annotations were used as a gold standard in our evaluation.<br />

2 In this paper, we loosely interchange between main idea, main<br />

concept and main event.


that 15.08% of entities couldn’t be represented with<br />

pictures at all (e.g. “territory”, “height of power”, “French<br />

War of religion”, etc and temporal expressions in general).<br />

These results suggest that it will often be difficult to find<br />

appropriate pictures and some entities are inherently unable<br />

to be illustrated easily with pictures. It can be particularly<br />

difficult to represent entities in an unfamiliar domain. For<br />

instance, if someone doesn’t know how Christopher<br />

Columbus looks like, even a good picture of Christopher<br />

Columbus will only convey general attributes (man,<br />

possibly historical).<br />

To remedy this problem MMSs keep both images and<br />

representative text, unlike previous systems for automatic<br />

illustration [2-6]. In this way, we can handle cases lacking a<br />

good picture and address cases that are hard to illustrate.<br />

Presenting pictures and text together can also improve both<br />

the understanding and remembering of concepts. According<br />

to dual code theory [7], text and pictures result in two<br />

different kinds of conceptual representations. These<br />

representations may allow independent access to<br />

information and hence benefit retention. Picture and text<br />

repeat important information, and may have similar<br />

beneficial effects on memory as explicit repetitions [8, 9].<br />

Processing the information twice, once as text and once as a<br />

picture, may facilitate comprehension and memory. Finally,<br />

pictures often have a motivating effect, and text with<br />

pictures may also be more enjoyable to read, since the<br />

reader does not have to work as hard to understand the text<br />

and pictures also facilitate better comprehension of the text<br />

broadly beyond what is illustrated [10]. So our decision for<br />

inclusion of text with pictures is backed by theories that<br />

support that it helps people for better understanding and<br />

memorizing.<br />

To keep the MMS representations simple and easy to<br />

process, we simplify text so that it retains only the most<br />

important information, instead of the full text. We define<br />

the most important information as the subject (who did it),<br />

the event (what action), object (to whom or what) and<br />

prepositions directly related to the subject, main event, or<br />

object (how). This effectively converts complex sentences<br />

into simpler sentences. In this way, the reader can read out<br />

the text as a simple sentence in addition to seeing the<br />

pictorial view, making it easier to remember and understand<br />

text, and relate it to the full, complex text if they choose,<br />

such as when searching for details abstracted out of the<br />

MMS view.<br />

MMS can potentially help a diversity of readers. For<br />

example, highly-capable readers may use MMS to skim<br />

content or understand content more easily. The alternative,<br />

simplified representation it provides may be useful for<br />

children who are learning to read and for second language<br />

learners, as seeing pictures together with text may enhance<br />

learning [11]. Furthermore, it has been previously shown<br />

that when one component of the reading process is<br />

dysfunctional, other compensating skills may become<br />

highly developed [12]. It is estimated that more than 2<br />

62<br />

million people in United States have significant<br />

communication impairments that led them to rely on<br />

methods other than natural speech alone for communication<br />

[13]. Automatic Illustration of texts may eventually help<br />

these people understand text better. Automatic illustration<br />

can also help to support other representations like Pictorial<br />

Temporal representation [14] or can be paired-up with<br />

screen reading applications [15], which could further<br />

benefit people who have problems reading by allowing<br />

them to see content in multiple forms while listening to it<br />

being read.<br />

We define multimodal summarization of complex sentences<br />

as the combination of illustrations and a compressed form<br />

of the sentence text in simple sentence structure. In the next<br />

section we will describe the challenges for multimodal<br />

summarization and describe related work for the required<br />

subtasks. We then describe ROC-MMS, our system for<br />

multimodal summarization and describe an evaluation of it.<br />

Finally, we discuss potential for future work.<br />

SUBTASKS AND RELATED WORK<br />

Multimodal summarization (MMS) of complex sentences<br />

gives readers the main idea of the sentence using pictures<br />

and compressed text structured as simple sentence. Creating<br />

MMSs is challenging and involves many subtasks. In this<br />

section, we will describe each of the subtasks and the<br />

related work for each subtask, and the approach taken in<br />

ROC-MMS. The general steps in the MMS approach are<br />

the following:<br />

1. Identify both the main idea of the sentence and related<br />

entities and use them to create a compressed summary<br />

2. Extract pictures for the entities.<br />

3. Add structure to the pictures and text.<br />

Identifying the main idea and related entities<br />

Natural language sentences often convey multiple ideas, but<br />

representing multiple ideas with pictures can quickly<br />

become confusing. We, therefore, chose to express only the<br />

main idea of a sentence with MMS. If readers can<br />

understand the main idea of the sentence, then they may be<br />

able to later use the original text to decipher further details.<br />

The subtask of identifying the main idea of the sentence<br />

itself has two components. First, the important idea (the<br />

main event or main action) must be extracted, and, second,<br />

the entities related to the main idea need to be extracted, as<br />

illustrated in the following example drawn from Wikipedia:<br />

“In 1492, Genoese explorer Christopher Columbus, under<br />

contract to the Spanish crown, reached several Caribbean<br />

islands, making first contact with the indigenous people.”<br />

The summary or compressed form of the sentence is<br />

“Christopher Columbus reached several Caribbean islands<br />

in 1492.” Hence, the main event or main idea in the<br />

sentence is reached and the entities related to the event


eaching are Christopher Columbus (subject), several<br />

Caribbean islands (object) and 1492 (preposition in).<br />

A similar problem already addressed in the natural language<br />

processing community is called sentence compression [16].<br />

In sentence compression, unnecessary information is<br />

removed while retaining the grammaticality of the sentence.<br />

Sentence compression might remove related entities of<br />

main event in the process of removing unnecessary<br />

information. This approach also doesn’t give a simple<br />

sentence structure.<br />

Another approach is main event extraction using the<br />

TimeML annotation scheme [17]. In this scheme, the main<br />

event label corresponds to the main idea of the sentence.<br />

Most competitive systems use syntactic and semantic<br />

information and machine-learning classifiers to identify<br />

events. For an overview of recent systems in this area, see<br />

the results of TempEval-2 [18]. The main events are<br />

annotated as part of the TempEval-2 task, although results<br />

on identifying main events were not explicitly reported.<br />

In the literature on Automatic Illustration for extracting<br />

entities, a popular approach has been to first extract<br />

representative keywords and then generate images for these<br />

keywords [6]. Keyword extraction has been studied in the<br />

natural language processing/information retrieval<br />

community [19, 20]. Goldberg et al. [2, 4] extract actions<br />

(events), who did them and to whom. They don’t focus on<br />

identifying only the important idea (action) because their<br />

experimental domain only contains short and simple<br />

sentences (and are, therefore, unlikely to contain more than<br />

one event). They convert the problem of identifying entities<br />

to a sequence labeling problem and use Conditional<br />

Random Fields for classification. On the other hand,<br />

Mihalcea and Leong [3] do not try to extract the entities,<br />

but they extract the pictures word-by-word and represent<br />

them linearly. Both approaches work best on simple<br />

sentences in which order roughly matches the role of the<br />

extracted entities. The ROC-MMS system includes a full<br />

natural language parse of the complex sentence in order to<br />

extract entities regardless of the order in which they appear.<br />

Extracting Pictures for Text<br />

Once we have the event and related entities, we next extract<br />

pictures to represent each concept. The task of associating<br />

words to pictures is similar to image retrieval. Although<br />

some work uses computer vision techniques for retrieval,<br />

most work (including popular image search engines) rely<br />

primarily on the text found near images in documents to<br />

find general images [21]. ROC-MMS generally follows this<br />

approach as well, but uses additional information<br />

automatically generated from the structure of the sentence<br />

to weight its search terms.<br />

Text-to-scene conversion places objects in 3D environment<br />

and is intended to aid graphic designers. This usually works<br />

with detailed descriptive text with visual and spatial<br />

elements. One of the best-known systems of this kind is<br />

WordsEye [22]. They are usually not intended as assistive<br />

63<br />

tools to communicate general text, because in that domain<br />

the texts are usually explaining the situation like “the house<br />

is 7 foot tall with two glass window and a door” and the<br />

system will try to interpret the natural language and create<br />

the 3D environment of the described situation. In contrast,<br />

we want to take a sentence from an existing news source,<br />

Wikipedia, or a book and represent it with pictures to help<br />

people to understand the text better.<br />

Barnard and Forsyth [23] introduced the idea of autoillustration<br />

as inverse of auto-annotation. Joshi et al. [6]<br />

approached this problem by considering the pair-wise<br />

reinforcement based on both visual and WordNet-based<br />

lexical similarity. This work identifies a few representative<br />

pictures for a story, which has practical applications like<br />

identifying representative pictures for news articles, or<br />

different articles, but not appropriate for our problem.<br />

Goldberg et al. [2, 4] built their own database of images to<br />

use for certain text and if they couldn’t find any appropriate<br />

image in their database then they do web image search and<br />

apply some vision techniques to identify the appropriate<br />

picture. Mihalcea and Leong [3] use an in-house image<br />

database, PicNet and other resources 3 .<br />

Adding Structure to Improve Understanding<br />

Having identified pictures and compressed text, the final<br />

step is to combine these elements in a layout structurally<br />

representative of what happened, who did it, to whom and<br />

how. To our knowledge, the only other work that attempts<br />

to address this problem is Goldberg et al. [2]. Their system<br />

identifies "who", "what action" and "to whom" by<br />

converting the problem into sequence labeling. They<br />

propose a layout represented by the sequence ABC, where<br />

A represents who did the action, B is what action was done<br />

and C is to whom. An example output of their system for<br />

“The girl rides the bus to school in the morning” is below:<br />

Figure 2: Example output of [2] illustrating the labeling of<br />

sequences where each element is assigned a picture.<br />

In this work, the textual information is ignored and<br />

represented only with pictures. Images incorrectly extracted<br />

in the previous step may confuse people more than helping<br />

them because there is no additional information to guide<br />

them to the correct interpretation. MMS includes extracted<br />

text in case of errors. With both picture and compressed<br />

text, we can represent hard-to-depict, but important, entities<br />

with text that may be ignored by prior work. We do not<br />

attempt to represent events (the action) with a picture, since<br />

this is a much more challenging task.<br />

3 http://tell.fll.purdue.edu/JapanProj/FLClipart/


This work also tries to identify the A (who), B (what action)<br />

and C (to whom) of their ABC layout by converting it to a<br />

sequence-tagging problem, which is well studied in NLP<br />

[24]. The problem with that approach is the requirement for<br />

hand-labeled training data, which will be a barrier for<br />

adaptation of the solution to a different or more complex<br />

domain. ROC-MMS uses dependency parsing to identify<br />

similar dependencies or related entities, without needing the<br />

hand-annotated training data.<br />

Finally, they restrict their attention to single simple<br />

sentences and their experiments were on domains that use<br />

very simple English, such as short narratives written by and<br />

for individuals with communicative disorders; one-sentence<br />

news synopses written in simple English targeting foreign<br />

language learners; and the child writing sections of the<br />

LUCY corpus. For complex sentences, they anticipate the<br />

use of text simplification to convert complex text into a set<br />

of appropriate inputs for their system. It is not clear how<br />

well they can eventually represent the complex sentences in<br />

their layout, since they are not considering “how”<br />

something happened.<br />

ROC-MMS addresses these problems for unrestricted texts<br />

that include complex and compound sentences.<br />

ROC-MMS<br />

In this section we will describe ROC-MMS, and how it<br />

approaches the subtasks described in the previous section.<br />

Identifying the main event(s)<br />

ROC-MMS finds concepts by identifying the events and<br />

related entities, and then identifies the main event to<br />

identify the main concept or the main idea.<br />

Event extraction<br />

Our view for event matches with the TimeML temporal<br />

annotation scheme [17], which considers events a cover<br />

term for situations that happen or occur.<br />

ROC-MMS extracts events using the TRIOS system [25],<br />

which had a very competitive performance in the TempEval<br />

2010 task for temporal information extraction [18]. The<br />

TRIOS system first parses text with the TRIPS parser [26]<br />

and uses hand-coded rules to extract events. The extraction<br />

rules are tuned for high recall and identify many more<br />

events than is necessary, including a few non-events. In the<br />

next step, a classifier is used as a filter to remove<br />

unnecessary events.<br />

The main event identification classifier takes all events for<br />

a sentence as input and identifies the main event from the<br />

sentence. In one of the tasks for TempEval 2010, main<br />

events were labeled. We used that labeled data to train our<br />

main event classifier. For this classification task, we used<br />

an off-the-shelf Markov Logic Network classifier<br />

(thebeast) 4 . As features, we used lexical features (word,<br />

stem, next word, previous word, previous verbal word<br />

sequence), syntactic features (part-of-speech tag, tense,<br />

4 http://code.google.com/p/thebeast/<br />

64<br />

voice, polarity, TimeML aspect, modality, pos sequence,<br />

previous verbal pos sequence, next pos, previous pos) and<br />

semantic features (abstract semantic class – ontology type,<br />

TimeML class, semantic roles and their arguments) of<br />

events. The syntactic and semantic features are mostly<br />

generated from TRIPS parser output and also using other<br />

classifiers.<br />

This classifier first identifies the main events from the<br />

sentences. Then we run another pass to make sure every<br />

sentence has at least one main event. We force every<br />

sentence to have a main event. If a classifier didn’t identify<br />

a main event in a sentence, then we consider the first verbal<br />

event as the main event of the sentence. We back off to the<br />

first verbal event because it has a high baseline<br />

performance for the main-event identification task.<br />

Extract entities related to the event<br />

Instead of extracting all entities in the sentence [3], we<br />

extract only those entities related to the main event. We use<br />

the relations between the event and the related entities in<br />

the next step to structure them. From the parsed<br />

representation created from the Stanford dependency<br />

parser 5 , we find dependencies 6 in order to extract the<br />

subject (nominal subject - nsubj, agent),<br />

object (direct/indirect object - dobj/iobj,<br />

passive nominal subject - nsubjpass) and other<br />

dependencies (prepositions). For easier representation,<br />

we cluster all prepositional modifiers into a single entity,<br />

but include the preposition when representing.<br />

An example will help to illustrate how we use the<br />

dependency output to extract related entities for the events.<br />

The following is the Stanford dependency parser output for<br />

the sentence, “French fur traders established outposts of<br />

New France around the Great Lakes.”<br />

amod(traders-3, French-1)<br />

nn(traders-3, fur-2)<br />

nsubj(established-4, traders-3)<br />

dobj(established-4, outposts-5)<br />

nn(France-8, New-7)<br />

prep_of(outposts-5, France-8)<br />

det(Lakes-12, the-10)<br />

nn(Lakes-12, Great-11)<br />

prep_around(established-4, Lakes-12)<br />

The main event here is established, the subject is traders,<br />

the object is outposts and the preposition (around) is Lakes.<br />

By propagating through nn (noun compound<br />

modifier) and amod (adjectival modifier)<br />

dependencies, we extract the following entities: (subject:<br />

“French fur traders”), (object: “outposts”) and (preposition:<br />

“Great Lakes”). For subject, object and prepositions, we<br />

propagate through the nn and amod in this way and extract<br />

5 Stanford dependency parser:<br />

http://nlp.stanford.edu/software/lex-parser.shtml.<br />

6 Details on dependencies:<br />

http://nlp.stanford.edu/software/dependencies_manual.pdf


the resulting entities. The next step is to find the<br />

representative pictures for the entities. If we fail to find an<br />

image for any entity, we propagate through all<br />

dependencies (instead of just nn and amod) to extract an<br />

entity phrase. For example, we would extract the phrase<br />

“outpost of New France” for the object and “the Great<br />

Lakes” for the preposition, in the above examples. We then<br />

search for the picture of the entity phrase, instead of the<br />

entity. These steps are described in more detail next.<br />

Extracting Pictures for Concepts<br />

Image retrieval is a complicated task, even for humans<br />

because what constitutes a representative image is<br />

subjective. As a result, we simplified the problem by<br />

restricting our image search to Wikipedia, which we have<br />

found to often produce appropriate images. This has the<br />

following two benefits: (i) pictures of an entity are often<br />

found on the wiki page for that entity, and (ii) Wikipedia<br />

articles often have info box pictures selected by human<br />

editors that are often correct and representative.<br />

Finding pictures for an event (“what action” according to<br />

[2]) is much harder. When humans are asked to find<br />

pictures for events, they will often search for the event<br />

along with subject or object. For example, for the event<br />

“conquered” in the context “Rome conquered the Gauls”,<br />

an appropriate image would likely include Roman soldiers<br />

(it would be even better if it somehow indicated that the<br />

conquering occurred in Gaul). Search results for conquered<br />

alone include the following images in the top results:<br />

Figure 3: First three results from Yahoo Image Search for the<br />

word “conquered” illustrating the difficulty in finding good<br />

representative pictures even for simple concepts.<br />

A useful heuristic for finding better representative images is<br />

therefore to concatenate the action with the subject and<br />

object (if available, or just subject or object, if the other one<br />

is not available). Often web image search results still do not<br />

return the most appropriate images for our use as the first<br />

result. This can be fine for humans, who may glance<br />

through the top few results and pick the most appropriate<br />

one. Restricting pictures only to Wikipedia is a simple way<br />

to produce better results.<br />

Our methods for identifying the pictures are described<br />

below with different modules.<br />

Module find_image_in_wikipage(wikiurl):<br />

(i) Find the infobox picture<br />

65<br />

(ii) If infobox has multiple pictures, then consider the<br />

picture with largest width 7<br />

(iii) If there are no infobox picture<br />

a. Find all images<br />

b. Tokenize the image filename 8 with "_", ",",<br />

"[A-Z]", and spaces as delimiters<br />

c. For each image<br />

i. Find the edit-distance between<br />

tokenized filename and each word in<br />

wiki article name<br />

ii. Sum all scores, that’s the relatedness<br />

score for an image<br />

d. Return the picture with highest score and the<br />

score<br />

Module find_page_and_image(query):<br />

(i) Search with “wikipedia ” + query using yahoo<br />

search api 9<br />

(ii) Keep only en.wikipedia pages<br />

(iii) Traverse the resulting wiki pages one by one<br />

(a) Get the representative image with score<br />

from the wiki page’s url using the module:<br />

find_image_in_wikipage(result page)<br />

(b) If the resulting image's score is above<br />

threshold (we used 1.0) then return the<br />

image<br />

Module sentence_to_images(sentence):<br />

(i) Extract events, main event and the entities and<br />

entity phrases related to main event (all these<br />

described in previous section)<br />

(ii) For each of the dependencies (subject, object,<br />

prepositions):<br />

(a) If any word forms a main Wikipedia entry:<br />

Find the image in those wiki urls<br />

using find_image_in_wikipage(wikiurl)<br />

(b) If no result found so far and the entity<br />

doesn't have a wiki link<br />

Then find the image using yahoo search<br />

with find_page_and_image(entity)<br />

(c) If no result found so far and any word in the<br />

entity phrase is linked to wiki urls:<br />

Then find the image in those wiki urls<br />

using find_image_in_wikipage(wikiurl)<br />

(d) If no result found so far and entity phrase<br />

doesn’t have a wiki link:<br />

7 We found that when there are multiple pictures then the larger<br />

width picture is usually the main representative picture.<br />

8 We are only considering the tokenized filename, because, i.<br />

wikipedia has very descriptive image filenames, ii. text<br />

descriptions next to images are not consistent, some pictures have<br />

lots of text and others don't have any, since sometimes it’s just<br />

neglected by contributors, if the wiki entry is not too interesting.<br />

But we consider the alt tags of images, which is also very sparse.<br />

So we give a lower weight for that score (we used 0.25 for alt tags<br />

and 1.0 for image filename score).<br />

9 http://developer.yahoo.com/search/web/V1/webSearch.html


Then find the image using yahoo search<br />

with find_page_and_image(entity phrase)<br />

Consider the following clarifying example. The input<br />

sentence from Wikipedia is “French fur traders established<br />

outposts of New France around the Great Lakes.”<br />

(Underlined words are links to other Wikipedia pages).<br />

ROC-MMS extracts the following main event (in this case,<br />

the only event) as established, and the extracted entities and<br />

entity phrases are: (subject: French fur traders), (subject<br />

phrase: French fur traders), (object: outposts), (object<br />

phrase: outposts of New France), (preposition: around –<br />

Great Lakes), (preposition around phrase: the Great Lakes).<br />

First consider the subject, French fur traders. “Fur traders”<br />

has a wiki link, but the page does not have an infobox. For<br />

images on the linked page, we find the edit distance<br />

between the tokenized filename and the article name (Fur<br />

trade) and the best image according to the process described<br />

previously.<br />

Next we consider the object outpost, which does not have a<br />

wiki link. We search using Yahoo! restricting to Wikipedia<br />

pages, which doesn’t return any images above threshold in<br />

first 10 resulting pages. We then check the object phrase –<br />

outposts of New France, and New France has a wiki link,<br />

and we find a representative picture from that link.<br />

In our algorithm, we search for the entity first, instead of<br />

checking wiki URLs in the entity phrase, because<br />

sometimes in Wikipedia contributors fail to tag entities to<br />

its wiki article. For those cases, our yahoo_search module<br />

finds the expected wiki article. So we try this step first and<br />

if it fails, then we check the wiki links in the entity phrase,<br />

as shown in this example. Finally, the preposition (around)<br />

is Great Lakes, which links to its wiki article and we get the<br />

representative picture for that too.<br />

If there are multiple wiki links in an entity (or entity phrase)<br />

then we find images from all wiki links and cluster them.<br />

Figure 4: Clustered image of Genoa and Christopher<br />

Columbus for entity “Genese explorer Christopher Columbus”.<br />

We also cluster all prepositions. The sentence “The modern<br />

name ‘France’ derives from the name of the feudal domain<br />

of the Capetian Kings of France around Paris” contain two<br />

prepositions, from and around. We extract pictures for from<br />

the name of the feudal domain of the Capetian Kings of<br />

France and also for around Paris, and then combine them.<br />

66<br />

Figure 6: Example of clustering prepositions.<br />

Our annotators were unable to find images to represent<br />

temporal expressions, and indeed this is a difficult problem.<br />

To handle that problem, we give special treatment to<br />

temporal expressions. To identify temporal expressions, we<br />

use the TRIOS temporal expression identification and<br />

normalization system 10 [25], which had the second best<br />

performance in TempEval-2 [18]. When we identify a time,<br />

instead of searching for a picture of it, we represent it with<br />

something that represents time and add the text below. One<br />

example is given below.<br />

Figure 5: The representation of a temporal expression includes<br />

the extracted text and a picture. The picture conveys time<br />

generally, but not a specific time.<br />

Structuring the images and compressed text<br />

The final step is to combine the image and compressed text<br />

into a structured format 11 . Every sentence has a main event,<br />

which we don’t try to represent with pictures, a subject<br />

entity, object entity and clustered prepositions. We<br />

construct MMS using the following visual layout of these<br />

elements.<br />

Figure 7: Generalized visual layout for MMS.<br />

This representation is very similar to ABC layout [2], since<br />

the subject and object are essentially who did the action and<br />

to whom, however the primary difference is that MMS<br />

10 The temporal expression normalizer is also available as open<br />

source at: http://www.cs.rochester.edu/u/naushad/temporal<br />

11 All our auto-generated diagrams are generated using GraphViz<br />

toolkit.


includes prepositions and does not attempt to find a picture<br />

for the main event. As mentioned earlier, it is not clear from<br />

the description how they represent hard-to-depict events. It<br />

might have worked in their simple domain; however, they<br />

explained they only find pictures for easy-to-depict words.<br />

Many events can be missed as part of the filtering process.<br />

ROC-MMS makes appropriate trade-offs that enable it to<br />

create MMS diagrams for arbitrary text, even text that<br />

includes complex sentences.<br />

One example output from our system is given below:<br />

Figure 8: Multimodal summary (MMS) of the sentence,<br />

“French fur traders established outposts of New France around<br />

the Great Lakes; France eventually claimed much of the North<br />

American interior, down to the Gulf of Mexico.”<br />

Some sentences do not contain prepositions (or the they<br />

may not be correctly extracted). In such cases, we show<br />

only the event, subject and object, as shown below.<br />

Figure 9: MMS of the sentence, “The Carolingian dynasty ruled<br />

France until 987, when Hugh Capet, Duke of France and Count<br />

of Paris, was crowned King of France.”<br />

For sentences lacking an object, we merge the event text<br />

with the subject text and show it in subject text field. In the<br />

following example, died (event) is merged with the Charles<br />

IV (subject).<br />

Figure 10: MMS of the sentence, “Charles IV ( The Fair ) died<br />

without an heir in 1328 .”<br />

EVALUATION<br />

Illustrating a sentence with a diagram of pictures and text is<br />

difficult; evaluating how good a diagram is may be even<br />

67<br />

harder because it is very subjective. In this evaluation<br />

section, we first evaluate the subtasks of our multimodal<br />

summarization system in isolation. We then evaluate how<br />

well our representation retains the overall information of<br />

the overall sentence. All our evaluations are done on 44<br />

sentences drawn from Wikipedia article on United States<br />

and France.<br />

Identifying the Main Event and Related Entities<br />

We trained our main event identification classifier on<br />

TempEval-2 training data and tested it with 10 cross<br />

validation. Our performance for main event identification<br />

was around 77.94% (fscore). The baseline of choosing the<br />

first verbal event as the main event achieves around 59.64%<br />

on the TempEval domain. We ported that system on the<br />

Wikipedia domain and evaluated considering each<br />

annotator as gold standard. We calculated precision and<br />

recall for both cases, the performance is reported in Table 1.<br />

Metric Performance<br />

Precision 79.10%<br />

Recall 73.11%<br />

Fscore 75.98%<br />

Table 1. Main event identification performance<br />

We extract entities by first traversing the nn (noun<br />

compound modifier) and amod (adjectival modifier)<br />

dependencies of the dependency tree. If that entity results in<br />

a good picture (the matching score is above threshold), we<br />

keep it; otherwise we traverse through all dependencies of<br />

the event, resulting in a phrase. Our extracted entities often<br />

don’t exact match with the annotator’s entity but may<br />

partially 12 match with them. We report the average<br />

performance (considering both annotators) of our system on<br />

entity extraction in Table 2. We only consider cases in<br />

which our system and the annotators identified the same<br />

main event.<br />

Metric Performance<br />

Average strict precision 29.29%<br />

Average strict recall 31.64%<br />

Average relaxed precision 76.76%<br />

Average relaxed recall 83.82%<br />

Table 2. Entity extraction performance<br />

Extracting Pictures<br />

For evaluating how well our system extracts pictures, we<br />

compared our system output to extractions by two human<br />

annotators. We consider cases where our system and the<br />

annotater, with relaxed matching, identified the same main<br />

event and same entities and both extracted an image. In<br />

12 Either our entity is substring of annotator’s entity, or vice versa.<br />

Relaxed matching is partial matching.


Table 3, we show the percentage of cases when both<br />

systems extracted an image, given that both systems<br />

extracted the same entity. Not all extracted entities have a<br />

picture because human annotators sometimes didn’t extract<br />

the picture because they thought some concepts couldn’t be<br />

illustrated with a picture and sometimes thought there were<br />

no suitable pictures in Wikipedia to represent that entity.<br />

We also didn’t suggest a picture for entities if no picture<br />

was found with a score above threshold. We compared<br />

between two annotators and show the average system<br />

performance. We can see that our system has a very similar<br />

performance compared to performance between each<br />

annotators.<br />

Evaluation<br />

Both entity<br />

got Image<br />

Annotator1 vs Annotator2 66.66%<br />

Average of Annotators vs System 65.47%<br />

Table 3. Performance of Image Extraction<br />

On these selected matching pictures, we compare our<br />

extracted image with the images extracted by the<br />

annotators. We classify our output into Same Image (if both<br />

the system and annotators extracted the same image),<br />

Different Image but acceptable (e.g. for France, one<br />

extracted the French flag and the other extracted a map of<br />

France) and finally Bad Image by our system (this category<br />

is the category of images that we think are not acceptable,<br />

i.e. wrong representation of the text). A judge, another<br />

graduate student - who was not an annotator or an author,<br />

performed this classification.<br />

Evaluation<br />

Ann 1 vs<br />

Ann 2<br />

Ann vs System<br />

(Average)<br />

Exact same image 47.05% 21.51%<br />

Different image, but<br />

acceptable<br />

52.95% 44.15%<br />

Different and bad image 34.34%<br />

Table 4. Performance on quality of our extracted images<br />

We can see that our system extracts decent pictures around<br />

65% of the time.<br />

How well our structure with simple compressed text<br />

helps to understand text better<br />

In the previous subsections, we showed our performance in<br />

the different subtasks, which eventually propagates to the<br />

final performance; but overall how well does our system<br />

generate diagrams that convey the message of the content to<br />

the users? Does automatic illustration really help text<br />

comprehension? Do human-generated illustrations help for<br />

text comprehension? An illustration without text is unlikely<br />

to be useful if the domain is new to the reader because the<br />

reader won’t be able to interpret the pictures in the first<br />

place. That’s why MMS diagrams include simple<br />

68<br />

compressed text and the simple structure along with the<br />

event, subject, object, and prepositions.<br />

In this section, we motivate MMS over picture-only<br />

diagrams by showing that users get a better understanding<br />

from the MMS diagrams generated by ROC-MMS than<br />

they do for diagrams containing only pictures, even when<br />

human annotators have identified the pictures.<br />

For this evaluation, we recruited participants on Amazon<br />

Mechanical Turk 13 . In the task shown to participants, we<br />

show our system generated MMS diagram and ask the<br />

turkers to explain the diagram in English text. Participants<br />

were also given the option of saying that they “Can’t<br />

explain the diagram.” One example is shown in Figure 11.<br />

Figure 11: ROC-MMS generated diagram for “Gaul was<br />

conquered by Rome under Julius Caesar in the 1 st centiry BC”<br />

Next we created the diagram using entities and pictures<br />

selected by human annotators (representing a gold<br />

standard), but we didn’t add the structural layout or text like<br />

our MMS diagram. Influenced by Mihalcea and Leong [3],<br />

our baseline ordered the picture of the entities in the order<br />

of the sentence. For example, for the sentence, “Gaul was<br />

conquered by Rome under Julius Caesar in the 1st century<br />

BC”, we created the diagram with first picture for Gaul<br />

then event conquered (in text), then picture for Rome and<br />

finally Julius Caesar. The annotators thought 1 st century BC<br />

was hard to illustrate, and so did not find a picture for it.<br />

We asked our annotators not to find pictures for events,<br />

since we are not going to represent events with pictures and<br />

added the text for events instead in annotator’s diagram.<br />

One example diagram is shown in Figure 12.<br />

Figure 12: Diagram using human identified entities and<br />

pictures for “Gaul was conquered by Rome under Julius Caesar<br />

in the 1 st century BC”<br />

Although the pictures are accurate, it is quite difficult to<br />

find the meaning of this diagram. We see two maps; many<br />

13 Mechanical Turk website: www.mturk.com. For this task, we<br />

paid $0.01 for explaining the diagram with text. For each sentence,<br />

we collected responses from10 unique workers.


people might not understand which country or place is this.<br />

Even if they were to somehow interpret first one as Gaul<br />

and the second as Rome, they will read it wrong as Gaul<br />

conquered Rome, because it is linearly ordered, instead of<br />

using subject, event, object structure like ours. On the<br />

contrary, our diagram for the same example, failed to get a<br />

good representative picture for Rome and the Stanford<br />

parser failed to find that 1 st century BC is also related to the<br />

event conquered, but with structure and text, many people<br />

were able to understand the content and produced<br />

something very similar to the original summary text.<br />

Participants provided explanations of the diagrams (both<br />

those generated by our system and those of the two<br />

annotators) in English text from 10 different turkers for<br />

each sentence. We used Rouge [27], the automatic<br />

evaluation toolkit for summarization, to test how well their<br />

explanations retained the information of the original<br />

sentence’s summary. We generate the reference summaries<br />

using annotators’ identified entities and events and ordered<br />

them linearly like the diagram. For the example given<br />

above, our annotator’s reference sentence summary was<br />

“Gaul conquered Rome Julius Caesar 1st century BC”.<br />

These reference summary sentences are not grammatical<br />

and only consisted of the main event and the important<br />

entities. The Rouge evaluation handles this well because it<br />

is based on ngram matching and does not consider the<br />

grammaticality of sentences. For each system, we get the<br />

average Rouge score for each sentence (averaging over 10<br />

turker’s score) and then average over all sentences. We also<br />

average the two annotators’ score and report the average<br />

annotator Rouge score.<br />

In reporting our performance, we report both Rouge-1 and<br />

Rouge-L, since Rouge-1 14 and Rouge-L perform very well<br />

in evaluating very short summaries (head-line like<br />

summaries) [27]. In reporting our results, we are reporting<br />

precision (P), recall (R) and Fscore (F).<br />

Evaluation Rouge-1 Rouge-L<br />

Explanation of<br />

Annotators’ diagrams<br />

Explanation of the<br />

ROC-MMS diagrams<br />

0.0892482 (F) 0.08451066 (F)<br />

0.0680995 (R) 0.0635695 (R)<br />

0.1294495 (P) 0.1260265 (P)<br />

0.2405093 (F) 0.21649513 (F)<br />

0.26668 (R) 0.23619 (R)<br />

0.2190162 (P) 0.199832 (P)<br />

Table 5. Rouge-1 and Rouge-L for explanation of annotators<br />

diagram (average) and our system diagram<br />

The results match our intuition that participants didn’t do a<br />

very good job explaining the diagram with a sentence when<br />

they are provided with only pictures – even though human<br />

14 Rouge-1 is based on unigram and Rouge-L is based on longest<br />

common subsequence.<br />

69<br />

annotators selected these pictures. On the other hand, our<br />

system, despite the possibility of cascading errors from<br />

parsing, main event identification, entity extraction and<br />

identifying appropriate picture, did a lot better.<br />

Although the inclusion of text gave the MMS diagrams a bit<br />

of an advantage in the Rouge score measurement because it<br />

is based on ngrams, it suggests that ROC-MMS is able to<br />

accurately identify the main concepts of the sentences and<br />

create pictures that are reasonable. More broadly, this<br />

evaluation shows the advantage of adding even minimal<br />

text, as many participants’ were largely unable to produce<br />

accurate descriptions of the diagrams containing only<br />

pictures. Surprisingly, few participants simply wrote the<br />

text contained within the MMS diagrams, suggesting that<br />

the evaluation was more nuanced.<br />

We believe that MMS diagrams will eventually be helpful<br />

for people who have trouble reading and understanding<br />

complex text and may help capable readers more easily<br />

skim documents. The end goal of MMS will be its ability to<br />

improve reading comprehension; ROC-MMS represents an<br />

important step in this direction.<br />

FUTURE WORK<br />

We evaluated ROC-MMS in the Wikipedia to show that<br />

multimodal summarization can be applied to complex text<br />

in order to generate diagrams that combine text, pictures,<br />

and structure. These evaluations have shown the promise of<br />

creating MMS diagrams completely automatically for<br />

arbitrary text, and suggest numerous future research<br />

opportunities.<br />

First, our system currently relies partly on Wikipedia. An<br />

obvious extension would be to explore its performance in<br />

raw text, and adapt its modules to handle more general<br />

resources. The TRIPS parser used in ROC-MMS, already<br />

identifies named entities, which may be able to use to find<br />

better pictures for specific kind of entities, e.g., for people -<br />

we might search for portrait, for country – a flag or map.<br />

Multimodal summarization is in the middle of two<br />

extremes. One would be to consider all events, instead of<br />

main events, i.e. represent everything with pictures and text.<br />

This may be useful for people who have trouble reading and<br />

want to get as much information in multimodal<br />

representation as possible. The other extreme is applying<br />

the summarization to pick the important sentences first and<br />

then apply multimodal summarization only on the selected<br />

sentences. In this way, it will represent the important<br />

sentences and only the important information in those<br />

sentences. This could be very useful for capable readers to<br />

skim through articles. Exploring the relative benefits along<br />

this dimension could better characterize their potential.<br />

We simplified the problem of illustration by not<br />

representing events with pictures because events are usually<br />

hard to depict. Future work may try to illustrate events by<br />

more intelligently searching for events along with the


subject and object. We also want to extend the proposed<br />

multimodal summarization by adding speech modality [15].<br />

Finally, we want to extend our evaluation to look at how<br />

MMS (and other summary techniques) improve reading<br />

comprehension for the target groups who motivated this<br />

work – specifically people who have difficulty reading.<br />

CONCLUSION<br />

In this paper, we approached the problem of visualizing text<br />

as multimodal summarization. To create MMS diagrams,<br />

we automatically summarize text by extracting simple<br />

sentence structures (subject – who did it, event – what<br />

happened, object – to whom, preposition – how) and<br />

illustrate the text with pictures and compressed text<br />

together. Our evaluation showed that we achieve good<br />

performance on all of the subtasks required to create MMS<br />

diagrams, and that the MMS diagrams generated by ROC-<br />

MMS were easier to understand than human illustrations<br />

with pictures alone. Our implementation and evaluation<br />

leveraged the Wikipedia domain, but the approach<br />

embodied in ROC-MMS can be generally extended to<br />

unrestricted text.<br />

ACKNOWLEDGMENT<br />

We thank the three anonymous reviewers for their valuable<br />

feedback. We also thank Benjamin van Durme for his<br />

suggestion of prototyping on the Wikipedia domain, and<br />

Anna Loparev, Amal Fahad and Shantonu Hossain for help<br />

with annotation tasks.<br />

REFERENCES<br />

1. R. N. Carney and J. R. Levin, "Pictorial Illustrations Still<br />

Improve Students' Learning from Text," Educational<br />

Psychology Review, vol. 14, 2002.<br />

2. B. Goldberg, et al., "Easy as ABC? Facilitating pictorial<br />

communication via semantically enhanced layout.,"<br />

Twelfth International Conference on Computational<br />

Natural Language Learning, 2008.<br />

3. R. Mihalcea and B. Leong, "Toward communicating<br />

simple sentences using pictorial representations,"<br />

presented at the Association of Machine Translation in the<br />

Americas., 2006.<br />

4. J. Zhu, et al., "A text-to-picture synthesis system for<br />

augmenting communication.," in The Integrated<br />

Intelligence Track of the Twenty-Second AAAI<br />

Conference on Artificial Intelligence, 2007.<br />

5. K. Barnard, et al., "Matching words and pictures.,"<br />

Machine Learning Research, vol. 3, pp. 1107–1135, 2003.<br />

6. D. Joshi, et al., "The story picturing engine—a system for<br />

automatic text illustration.," ACM Transactions on<br />

Multimedia Computing, Communications, and<br />

Applications, vol. 2(1), 2006.<br />

7. Paivio, "Mental representations: A dual coding approach,"<br />

New York: Oxford University Press., 1986.<br />

8. M. Glenberg, "Component-levels theory of the effects of<br />

spacing of repetitions on recall and recognition.," Memory<br />

and Cognition, vol. 7, pp. 95-112, 1979.<br />

9. R. G. Greene, "Spacing effects in memory: Evidence for a<br />

two-process account.," Journal of Experimental<br />

70<br />

Psychology: Learning. Memory. and Cognition, vol. 15,<br />

pp. 371-377, 1989.<br />

10. M. Glenberg and W. E. Langston, "Comprehension of<br />

illustrated text: pictures help to build mental models.,"<br />

Memory and Language, vol. 31, pp. 129–151, 1992.<br />

11. R. E. Mayer, Multimedia learning. Cambridge, UK:<br />

Cambridge University Press., 2001.<br />

12. U. Frith, "A developmental framework for developmental<br />

dyslexia," Annals of Dyslexia, vol. 36, pp. 69-81, 1985.<br />

13. S. L. H. Association, "Roles and responsibilities of speech-<br />

language pathologists with respect to augmentative and<br />

alternative communication: Technical report," ASHA<br />

Supplement, vol. 24, 2004.<br />

14. N. UzZaman, et al., "Pictorial Temporal Structure of<br />

Documents to Help People who have Trouble Reading or<br />

Understanding. ," International Workshop on Design to<br />

Read, CHI, Atlanta, GA, 2010.<br />

15. J. P. Bigham, et al., "WebAnywhere: A Self-Voicing,<br />

Web-Browsing Web Application," International<br />

Conference on the World Wide Web, Beijing, China, 2008.<br />

16. K. Knight and D. Marcu, "Summarization beyond sentence<br />

extraction: a probabilistic approach to sentence<br />

compression," Artificial Intelligence, vol. 139, pp. 91–107,<br />

2002.<br />

17. J. Pustejovsky, et al., "TimeML: Robust Specication of<br />

Event and Temporal Expressions in Text. ," in New<br />

Directions in Question Answering, 2003.<br />

18. J. Pustejovsky and M. Verhagen, "SemEval-2010 task 13:<br />

evaluating events, time expressions, and temporal relations<br />

(TempEval-2)," Workshop on Semantic Evaluations:<br />

Recent Achievements and Future Directions, 2010.<br />

19. Y. Matsuo and M. Ishizuka, "Keyword Extraction from a<br />

Single Document Using Word Co-Occurrence Statistical<br />

Information," International Journal on Artificial<br />

Intelligence Tools, vol. 13, pp. 157-170, 2004.<br />

20. R. Mihalcea and P. Tarau, "TextRank: Bringing Order into<br />

Texts," Proceedings of the Conference on Empirical<br />

Methods in Natural Language Processing (EMNLP 2004),<br />

Barcelona, Spain, 2004.<br />

21. R. Datta, et al., "Image retrieval: Ideas, influences, and<br />

trends of the new age," ACM Comput. Surv., vol. 40, pp. 1-<br />

60, 2008.<br />

22. Coyne and R. Sproat, "WordsEye: An automatic text-toscene<br />

conversion system," SIG-GRAPH, 2001.<br />

23. K. Barnard and D. Forsyth, "Learning the Semantics of<br />

Words and Pictures," Eighth International Conference on<br />

Computer Vision (ICCV'01), 2001.<br />

24. J. Lafferty, et al., "Conditional random fields: Probabilistic<br />

models for segmenting and labeling sequence data,"<br />

International Conference on Machine Learning, 2001.<br />

25. N. UzZaman and J. F. Allen, "TRIPS and TRIOS System<br />

for TempEval-2: Extracting Temporal Information from<br />

Text," International Workshop on Semantic Evaluations,<br />

ACL 2010.<br />

26. J. F. Allen, et al., "Deep semantic analysis of text,"<br />

Symposium on Semantics in Systems for Text Processing<br />

(STEP), 2008.<br />

27. Y. Lin, "ROUGE: A package for automatic evaluation of<br />

summaries," ACL Text Summarization Workshop, 2004.


Author’s index<br />

James F. Allen Moritz Kümmerling<br />

R. Wade Allen Pat Langdon<br />

Ignacio Alvarez Sven Laqua<br />

Gabriel Barata Gerrit Meixner<br />

Ashweeni K. Beeharee Kamlesh Mistry<br />

André Berton Christian Müller<br />

Jeffrey P. Bigham George D. Park<br />

Pradipta Biswas Martin Pfannenstein<br />

Rolf Black Mark Poguntke<br />

Rainer Bodendorfer Ashu Razdan<br />

Daniel Braun Joseph Reddington<br />

Elliot Buller Ehud Reiter<br />

An Mei Chen Theodore J. Rosenthal<br />

Heng-Tze Cheng M. Angela Sasse<br />

Shelby S. Darnell Kristof Schütt<br />

Michael Eichhorn Adriano Scoditti<br />

Josh I. Ekandem Eckehard Steinbach<br />

Christoph Endres João Teixeira<br />

Sandro Rodriguez Garzon Nava Tintarev<br />

Juan E. Gilbert Naushad UzZaman<br />

Daniel Gonçalves Annalu Waller<br />

Jin Sun Ju Damon L. Woodard<br />

Eun Yi Kim Li Zhang

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!