28.02.2014 Views

Exploring the Use of Speech Input by Blind People on ... - Washington

Exploring the Use of Speech Input by Blind People on ... - Washington

Exploring the Use of Speech Input by Blind People on ... - Washington

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Use</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Speech</str<strong>on</strong>g> <str<strong>on</strong>g>Input</str<strong>on</strong>g> <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>Blind</str<strong>on</strong>g> <str<strong>on</strong>g>People</str<strong>on</strong>g><br />

<strong>on</strong> Mobile Devices<br />

Shiri Azenkot<br />

Computer Science & Engineering | DUB Group<br />

University <str<strong>on</strong>g>of</str<strong>on</strong>g> Washingt<strong>on</strong><br />

Seattle, WA 98195 USA<br />

shiri@cs.washingt<strong>on</strong>.edu<br />

Nicole B. Lee<br />

Human Centered Design & Engineering | DUB Group<br />

University <str<strong>on</strong>g>of</str<strong>on</strong>g> Washingt<strong>on</strong><br />

Seattle, WA 98195 USA<br />

nikki@nicoleblee.com<br />

ABSTRACT<br />

Much recent work has explored <str<strong>on</strong>g>the</str<strong>on</strong>g> challenge <str<strong>on</strong>g>of</str<strong>on</strong>g> n<strong>on</strong>visual text<br />

entry <strong>on</strong> mobile devices. While researchers have attempted to<br />

solve this problem with gestures, we explore a different modality:<br />

speech. We c<strong>on</strong>ducted a survey with 169 blind and sighted<br />

participants to investigate how <str<strong>on</strong>g>of</str<strong>on</strong>g>ten, what for, and why blind<br />

people used speech for input <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir mobile devices. We found<br />

that blind people used speech more <str<strong>on</strong>g>of</str<strong>on</strong>g>ten and input l<strong>on</strong>ger<br />

messages than sighted people. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n c<strong>on</strong>ducted a study with 8<br />

blind people to observe how <str<strong>on</strong>g>the</str<strong>on</strong>g>y used speech input <strong>on</strong> an iPod<br />

compared with <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard with VoiceOver. We<br />

found that speech was nearly 5 times as fast as <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard.<br />

While participants were mostly satisfied with speech input,<br />

editing recogniti<strong>on</strong> errors was frustrating. Participants spent an<br />

average <str<strong>on</strong>g>of</str<strong>on</strong>g> 80.3% <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time editing. Finally, we propose<br />

challenges for future work, including more efficient eyes-free<br />

editing and better error detecti<strong>on</strong> methods for reviewing text.<br />

Categories and Subject Descriptors<br />

D.3.3 H.5.2 [Informati<strong>on</strong> Interfaces and Presentati<strong>on</strong>]: <str<strong>on</strong>g>Use</str<strong>on</strong>g>r<br />

Interfaces – <str<strong>on</strong>g>Input</str<strong>on</strong>g> devices and strategies, Voice I/O.; K.4.2<br />

[Computers and society]: Social issues – assistive technologies<br />

for pers<strong>on</strong>s with disabilities.<br />

Keywords<br />

Dictati<strong>on</strong>, text entry, eyes-free, mobile devices.<br />

1. INTRODUCTION<br />

In <str<strong>on</strong>g>the</str<strong>on</strong>g> past few years, touchscreen devices have become widely<br />

adopted <str<strong>on</strong>g>by</str<strong>on</strong>g> blind people thanks to screen readers like Apple’s<br />

VoiceOver. A blind user can explore a touchscreen <str<strong>on</strong>g>by</str<strong>on</strong>g> touching it<br />

as VoiceOver speaks <str<strong>on</strong>g>the</str<strong>on</strong>g> labels <str<strong>on</strong>g>of</str<strong>on</strong>g> UI elements being touched. To<br />

select an element, <str<strong>on</strong>g>the</str<strong>on</strong>g> user performs a sec<strong>on</strong>d gesture such as a<br />

double tap. While this interacti<strong>on</strong> makes an interface generally<br />

accessible, finding and selecting keys <strong>on</strong> an <strong>on</strong>-screen keyboard is<br />

slow and error-pr<strong>on</strong>e. Azenkot et al. [5] found that <str<strong>on</strong>g>the</str<strong>on</strong>g> mean text<br />

entry rate <strong>on</strong> an iPh<strong>on</strong>e with VoiceOver was <strong>on</strong>ly 4.5 words per<br />

minute (WPM).<br />

Many researchers have attempted to alleviate <str<strong>on</strong>g>the</str<strong>on</strong>g> challenge <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

Permissi<strong>on</strong> to make digital or hard copies <str<strong>on</strong>g>of</str<strong>on</strong>g> all or part <str<strong>on</strong>g>of</str<strong>on</strong>g> this work for<br />

pers<strong>on</strong>al or classroom use is granted without fee provided that copies are<br />

not made or distributed for pr<str<strong>on</strong>g>of</str<strong>on</strong>g>it or commercial advantage and that<br />

copies bear this notice and <str<strong>on</strong>g>the</str<strong>on</strong>g> full citati<strong>on</strong> <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> first page. To copy<br />

o<str<strong>on</strong>g>the</str<strong>on</strong>g>rwise, or republish, to post <strong>on</strong> servers or to redistribute to lists,<br />

requires prior specific permissi<strong>on</strong> and/or a fee.<br />

ASSETS’13, Oct 22–24, 2013, Bellevue, WA, USA.<br />

Copyright 2010 ACM 1-58113-000-0/00/0010 …$15.00.<br />

n<strong>on</strong>visual text entry using larger keys [6,22] multi-touch taps<br />

[5,10,18]. To our knowledge, no <strong>on</strong>e has explored speech as an<br />

eyes-free input modality. Yet speaking is natural, fast, and<br />

n<strong>on</strong>visual and automatic speech recogniti<strong>on</strong> (ASR) is already well<br />

integrated <strong>on</strong> mobile platforms. Mobile device users can enter<br />

text with speech in any text entry scenario <strong>on</strong> iOS and Android<br />

devices (see Figure 1) and perform acti<strong>on</strong>s such as searching <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

internet and creating calendar events with Google Voice Search<br />

[11] and Siri [4].<br />

Figure 1. The Android (left) and iPh<strong>on</strong>e (right)<br />

keyboards have a DICTATE butt<strong>on</strong> to <str<strong>on</strong>g>the</str<strong>on</strong>g> left <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

SPACE key that enables users to dictate text instead <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

using <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard.<br />

While <str<strong>on</strong>g>the</str<strong>on</strong>g> act <str<strong>on</strong>g>of</str<strong>on</strong>g> speaking is eyes-free, <str<strong>on</strong>g>the</str<strong>on</strong>g> process <str<strong>on</strong>g>of</str<strong>on</strong>g> dictating,<br />

reviewing, and editing text is complex, and requires additi<strong>on</strong>al<br />

input. Both reviewing and editing may be challenging for blind<br />

people with today’s mobile technology. A sighted pers<strong>on</strong> can<br />

review a speech recognizer’s output <str<strong>on</strong>g>by</str<strong>on</strong>g> simply reading <str<strong>on</strong>g>the</str<strong>on</strong>g> text <strong>on</strong><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> screen. Meanwhile, a blind pers<strong>on</strong> can use VoiceOver to<br />

review <str<strong>on</strong>g>the</str<strong>on</strong>g> text with speech output. Detecting recogniti<strong>on</strong> errors<br />

with speech output may be difficult, however, since <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

recognizer is likely to output words that sound like those <str<strong>on</strong>g>the</str<strong>on</strong>g> user<br />

said, but are incorrect. For example, ASR is known to have<br />

difficulty with segmentati<strong>on</strong>, and may recognize <str<strong>on</strong>g>the</str<strong>on</strong>g> words<br />

“recognize speech,” as <str<strong>on</strong>g>the</str<strong>on</strong>g> similarly-sounding “wreck a nice<br />

beach.”<br />

After identifying recogniti<strong>on</strong> errors, blind people may face<br />

challenges correcting <str<strong>on</strong>g>the</str<strong>on</strong>g>se errors. Prior work has shown that<br />

sighted users spend <str<strong>on</strong>g>the</str<strong>on</strong>g> majority <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time correcting errors<br />

when dictating text. Karat et al. [14] found that sighted people<br />

spent 66% <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time editing ASR output <strong>on</strong> a desktop dictati<strong>on</strong><br />

system. The editing process for blind users <strong>on</strong> mobile devices is<br />

likely to be much slower, since users may resort to using <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>screen<br />

keyboard to correct and edit ASR output. Perhaps <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

combined difficulty <str<strong>on</strong>g>of</str<strong>on</strong>g> reviewing and editing ASR output will<br />

outweigh <str<strong>on</strong>g>the</str<strong>on</strong>g> ease and speed <str<strong>on</strong>g>of</str<strong>on</strong>g> speech.<br />

In this paper, we explore <str<strong>on</strong>g>the</str<strong>on</strong>g> patterns and challenges <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> use <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

speech input <str<strong>on</strong>g>by</str<strong>on</strong>g> blind people <strong>on</strong> mobile devices. Our ultimate<br />

goal is to improve <str<strong>on</strong>g>the</str<strong>on</strong>g> experience <str<strong>on</strong>g>of</str<strong>on</strong>g> n<strong>on</strong>visual speech input, but


first we study current use to identify specific challenges that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

accessibility community can address. We c<strong>on</strong>ducted a survey<br />

with 169 people (105 sighted and 65 blind and low-visi<strong>on</strong>) to<br />

learn how <str<strong>on</strong>g>of</str<strong>on</strong>g>ten people use speech input, what <str<strong>on</strong>g>the</str<strong>on</strong>g>y use it for, and<br />

how much <str<strong>on</strong>g>the</str<strong>on</strong>g>y like it. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n c<strong>on</strong>ducted a laboratory study with<br />

8 blind people to observe how blind people use speech to<br />

compose paragraphs. We wanted to discover what techniques<br />

people used to review and edit ASR output and how effective<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>se techniques were.<br />

In our survey, we found that blind people used speech for input<br />

more frequently and for l<strong>on</strong>ger messages than sighted people.<br />

<str<strong>on</strong>g>Blind</str<strong>on</strong>g> people were also more satisfied with speech than sighted<br />

people, probably because <str<strong>on</strong>g>the</str<strong>on</strong>g> comparative advantage <str<strong>on</strong>g>of</str<strong>on</strong>g> speech to<br />

keyboard input was far greater for <str<strong>on</strong>g>the</str<strong>on</strong>g>m than for sighted people.<br />

Our laboratory study showed that speech was nearly five times as<br />

fast as <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard, but editing recogniti<strong>on</strong> errors was<br />

frustrating. Participants spent an average <str<strong>on</strong>g>of</str<strong>on</strong>g> 80.3% <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time<br />

reviewing and editing <str<strong>on</strong>g>the</str<strong>on</strong>g>ir text. Most edits were performed using<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> BACKSPACE key and reentering characters with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard.<br />

Six out <str<strong>on</strong>g>of</str<strong>on</strong>g> eight participants in <str<strong>on</strong>g>the</str<strong>on</strong>g> study preferred speech to<br />

keyboard entry.<br />

Our main c<strong>on</strong>tributi<strong>on</strong> is our findings from <str<strong>on</strong>g>the</str<strong>on</strong>g> survey and study.<br />

Also, we c<strong>on</strong>tribute specific research challenges for <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

community to explore that can improve n<strong>on</strong>visual text entry using<br />

both speech and touch input.<br />

2. RELATED WORK<br />

To our knowledge, we are <str<strong>on</strong>g>the</str<strong>on</strong>g> first to explore speech input for<br />

blind people in <str<strong>on</strong>g>the</str<strong>on</strong>g> human-computer interacti<strong>on</strong> literature. <str<strong>on</strong>g>Speech</str<strong>on</strong>g><br />

has mostly been studied as a form <str<strong>on</strong>g>of</str<strong>on</strong>g> n<strong>on</strong>visual output (e.g., [24,<br />

28]) ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than n<strong>on</strong>visual input. There has been some work <strong>on</strong><br />

hands-free dictati<strong>on</strong>, both for people with motor impairments<br />

[26], and for <str<strong>on</strong>g>the</str<strong>on</strong>g> general populati<strong>on</strong> [12,14].<br />

Prior work <strong>on</strong> speech input interacti<strong>on</strong> focuses <strong>on</strong> error<br />

correcti<strong>on</strong>, <str<strong>on</strong>g>the</str<strong>on</strong>g> “Achilles Heel <str<strong>on</strong>g>of</str<strong>on</strong>g> speech technology” [23].<br />

Desktop dictati<strong>on</strong> systems such as Drag<strong>on</strong> Naturally Speaking<br />

[21], which gained popularity in <str<strong>on</strong>g>the</str<strong>on</strong>g> late 1990’s, use speech<br />

commands for cursor navigati<strong>on</strong> and error correcti<strong>on</strong>. <str<strong>on</strong>g>Use</str<strong>on</strong>g>rs speak<br />

commands such as “move left” and “undo” to edit text or<br />

repositi<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor. Karat et al. [14] found that novice users<br />

entered text at a rate <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>on</strong>ly 13.6 WPM with a commercial<br />

desktop dictati<strong>on</strong> system and 32.5 WPM with a keyboard and<br />

mouse. This striking discrepancy was due to (1) cascades <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

errors triggered <str<strong>on</strong>g>by</str<strong>on</strong>g> a user’s correcti<strong>on</strong> and (2) spiral-depth, a<br />

user’s repeated attempts to speak a word that is not correctly<br />

recognized.<br />

Some work has aimed to alleviate <str<strong>on</strong>g>the</str<strong>on</strong>g> difficulty <str<strong>on</strong>g>of</str<strong>on</strong>g> error<br />

correcti<strong>on</strong> through touch or stylus input (see [23] for a review).<br />

Suhm et al. [29] present a system where users touch <str<strong>on</strong>g>the</str<strong>on</strong>g> word<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>y want to correct, eliminating speech-based navigati<strong>on</strong>. Their<br />

system also supports small gestures such as striking out a word as<br />

shortcuts. Martin et al. [19] enable users to correct errors <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

preliminary results with a mouse click. These systems make error<br />

correcti<strong>on</strong> more efficient but have high visual demands and are<br />

not appropriate for bind users.<br />

There is little recent academic work <strong>on</strong> speech-based input<br />

systems. Voice Typing, introduced <str<strong>on</strong>g>by</str<strong>on</strong>g> Kumar et al. in 2012 [16],<br />

displays recognized text as <str<strong>on</strong>g>the</str<strong>on</strong>g> user dictates short phrases. <str<strong>on</strong>g>Use</str<strong>on</strong>g>rs<br />

can correct recogniti<strong>on</strong> errors with a marking menu. Kumar et al.<br />

found that correcting errors with Voice Typing required less<br />

effort than correcting errors with <str<strong>on</strong>g>the</str<strong>on</strong>g> iPh<strong>on</strong>e’s dictati<strong>on</strong> model.<br />

The iOS and Android dictati<strong>on</strong> systems resemble <str<strong>on</strong>g>the</str<strong>on</strong>g> literature<br />

described above. Dictati<strong>on</strong> <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> iPh<strong>on</strong>e follows an open-loop<br />

interacti<strong>on</strong> model, where <str<strong>on</strong>g>the</str<strong>on</strong>g> recognizer outputs text <strong>on</strong>ly after <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

user completes <str<strong>on</strong>g>the</str<strong>on</strong>g> dictati<strong>on</strong>. Android, in c<strong>on</strong>trast, currently has<br />

incremental speech recogniti<strong>on</strong>, displaying recognized text as <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

user speaks. With VoiceOver, iOS dictati<strong>on</strong> seems to be<br />

accessible to blind people, but it is unclear how incremental<br />

recogniti<strong>on</strong> affects accessibility <strong>on</strong> Android, especially since<br />

Android is not as generally accessible as iOS.<br />

While <str<strong>on</strong>g>the</str<strong>on</strong>g>re is no known work <strong>on</strong> n<strong>on</strong>visual speech input, <str<strong>on</strong>g>the</str<strong>on</strong>g>re<br />

has been significant interest in n<strong>on</strong>visual touch-based input. In<br />

2008, Slide Rule [13], introduced an accessible eyes-free<br />

interacti<strong>on</strong> technique for touchscreen devices which was later<br />

adopted <str<strong>on</strong>g>by</str<strong>on</strong>g> VoiceOver. Since <str<strong>on</strong>g>the</str<strong>on</strong>g>n, researchers and developers<br />

have been trying to improve <str<strong>on</strong>g>the</str<strong>on</strong>g> experience <str<strong>on</strong>g>of</str<strong>on</strong>g> eyes-free text entry<br />

with gesture-based methods. Several text entry techniques were<br />

proposed that were based <strong>on</strong> Braille, including Perkinput [5],<br />

BrailleTouch [10,27], BrailleType [22], and TypeInBraille [18].<br />

Methods that were not based <strong>on</strong> Braille [9,25,34], including No-<br />

Look Notes [6] and NavTouch [22], did not achieve comparable<br />

performance to <str<strong>on</strong>g>the</str<strong>on</strong>g> former set. Despite <str<strong>on</strong>g>the</str<strong>on</strong>g> large amount <str<strong>on</strong>g>of</str<strong>on</strong>g> work<br />

in this area, entry rates for blind users remain relatively slow: at<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> end <str<strong>on</strong>g>of</str<strong>on</strong>g> a l<strong>on</strong>gitudinal study, Perkinput users entered text at a<br />

rate <str<strong>on</strong>g>of</str<strong>on</strong>g> 7.3 WPM (with a 2.1% error rate). BrailleTouch users,<br />

who were expert users <str<strong>on</strong>g>of</str<strong>on</strong>g> Braille keyboards, entered text at a rate<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> 23.1 WPM, but with an error rate <str<strong>on</strong>g>of</str<strong>on</strong>g> 4.8%.<br />

Several eyes-free text entry methods were proposed and evaluated<br />

with sighted people (e.g., [30]), but <str<strong>on</strong>g>the</str<strong>on</strong>g>y may not be appropriate<br />

for blind users. As Kane et al. found [14], blind people have<br />

different preferences and performance abilities with touch screen<br />

gestures than sighted people.<br />

3. SURVEY: PATTERNS OF SPEECH<br />

INPUT AMONG BLIND AND SIGHTED<br />

PEOPLE<br />

We c<strong>on</strong>ducted a survey to determine how <str<strong>on</strong>g>of</str<strong>on</strong>g>ten blind people use<br />

speech input <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir mobile devices, what <str<strong>on</strong>g>the</str<strong>on</strong>g>y use it for, and<br />

how <str<strong>on</strong>g>the</str<strong>on</strong>g>y feel about it. We surveyed both blind and sighted<br />

people to evaluate <str<strong>on</strong>g>the</str<strong>on</strong>g> n<strong>on</strong>visual experience against a baseline <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> comm<strong>on</strong> (visual) use case.<br />

3.1 Methods<br />

We surveyed 54 blind participants, 10 low-visi<strong>on</strong> participants,<br />

and 105 sighted participants. There were 31 female and 33 male<br />

blind/low-visi<strong>on</strong> (BLV) participants, with an average age <str<strong>on</strong>g>of</str<strong>on</strong>g> 40<br />

(age range: 18 to 66). Sighted participants were younger, with 51<br />

males and 54 females and an average age <str<strong>on</strong>g>of</str<strong>on</strong>g> 32 (age range: 20 to<br />

66). We sent emails <strong>on</strong> mailing lists related to our university and<br />

blindness organizati<strong>on</strong>s to recruit participants. Participants did not<br />

receive compensati<strong>on</strong> for completing <str<strong>on</strong>g>the</str<strong>on</strong>g> survey.<br />

The survey included a maximum number <str<strong>on</strong>g>of</str<strong>on</strong>g> 9 questi<strong>on</strong>s. The first<br />

three questi<strong>on</strong>s asked for demographic informati<strong>on</strong>: age, gender,<br />

and disability. The next questi<strong>on</strong> asked:<br />

Have you recently used dictati<strong>on</strong> instead <str<strong>on</strong>g>of</str<strong>on</strong>g> a<br />

keyboard to enter text <strong>on</strong> a smartph<strong>on</strong>e?<br />

Examples <str<strong>on</strong>g>of</str<strong>on</strong>g> dictati<strong>on</strong> include:<br />

- Asking Siri a questi<strong>on</strong>, e.g., "what's <str<strong>on</strong>g>the</str<strong>on</strong>g> wea<str<strong>on</strong>g>the</str<strong>on</strong>g>r like today?"<br />

- Giving Siri a command, e.g., "call John Johns<strong>on</strong>"<br />

- Dictating an email or text message<br />

Required.


[ ] Yes<br />

[ ] No<br />

If <str<strong>on</strong>g>the</str<strong>on</strong>g> participant answered “No” to <str<strong>on</strong>g>the</str<strong>on</strong>g> questi<strong>on</strong> above, <str<strong>on</strong>g>the</str<strong>on</strong>g> survey<br />

c<strong>on</strong>cluded with a final questi<strong>on</strong> that asked why not. If <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

participant answered “yes,” she was asked to recall a specific<br />

instance in which she used dictati<strong>on</strong>. She was <str<strong>on</strong>g>the</str<strong>on</strong>g>n asked several<br />

questi<strong>on</strong>s about this instance, such as when <str<strong>on</strong>g>the</str<strong>on</strong>g> instance occurred,<br />

and what she dictated (a questi<strong>on</strong> to Siri, a text message, etc.).<br />

The penultimate questi<strong>on</strong> presented <str<strong>on</strong>g>the</str<strong>on</strong>g> user with three statements<br />

and asked her to describe how she feels about each statement <strong>on</strong> a<br />

Likert scale. The statements were:<br />

• Dictati<strong>on</strong> <strong>on</strong> a smartph<strong>on</strong>e is accurate<br />

• Using dictati<strong>on</strong> <strong>on</strong> a smartph<strong>on</strong>e (including <str<strong>on</strong>g>the</str<strong>on</strong>g> time it<br />

takes to correct errors) is fast relative to an <strong>on</strong>-screen<br />

keyboard.<br />

• I am satisfied with dictati<strong>on</strong> <strong>on</strong> my smartph<strong>on</strong>e.<br />

The survey c<strong>on</strong>cluded with a prompt for “o<str<strong>on</strong>g>the</str<strong>on</strong>g>r comments” and a<br />

free-form text box for <str<strong>on</strong>g>the</str<strong>on</strong>g>ir resp<strong>on</strong>se.<br />

Surveys were completed <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> Internet and resp<strong>on</strong>ses were<br />

an<strong>on</strong>ymized.<br />

To analyze <str<strong>on</strong>g>the</str<strong>on</strong>g> results, we graphed <str<strong>on</strong>g>the</str<strong>on</strong>g> data and computed<br />

descriptive statistics for all questi<strong>on</strong>s. We used Wilcox<strong>on</strong> Rank<br />

Sums tests to compare means between Likert scale resp<strong>on</strong>ses. We<br />

modeled <str<strong>on</strong>g>the</str<strong>on</strong>g> data with <strong>on</strong>e factor, SightAbility, with two levels:<br />

BLV, and Sighted. The measures corresp<strong>on</strong>ded to <str<strong>on</strong>g>the</str<strong>on</strong>g> three Likert<br />

resp<strong>on</strong>se statements: Accurate, Fast, and Satisfied.<br />

3.2 Results<br />

The survey resp<strong>on</strong>ses showed that BLV people used dictati<strong>on</strong> far<br />

more frequently than sighted people. Interestingly, 58 BLV<br />

participants (90.6%) and 58 sighted participants (55.2%) used<br />

dictati<strong>on</strong> recently. Am<strong>on</strong>g BLV participants, <strong>on</strong>ly <strong>on</strong>e used<br />

speech input <strong>on</strong> an Android device and <str<strong>on</strong>g>the</str<strong>on</strong>g> rest used it <strong>on</strong> iOS<br />

devices. Am<strong>on</strong>g sighted people, 21 participants used speech input<br />

<strong>on</strong> an Android device and 34 <strong>on</strong> an iOS device. Most BLV people<br />

used speech input within <str<strong>on</strong>g>the</str<strong>on</strong>g> last day while most sighted people<br />

used it within <str<strong>on</strong>g>the</str<strong>on</strong>g> last week.<br />

Both BLV and sighted participants used speech most for<br />

composing text message. Table 1 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> kinds <str<strong>on</strong>g>of</str<strong>on</strong>g> messages<br />

participants composed. Many more BLV than sighted people used<br />

speech to compose emails. Figure 2 supports this finding,<br />

showing that BLV people composed l<strong>on</strong>ger messages.<br />

Table 1. Number <str<strong>on</strong>g>of</str<strong>on</strong>g> resp<strong>on</strong>ses for a survey questi<strong>on</strong>.<br />

What did you use speech input for? BLV Sighted<br />

A command (e.g., "Call bob smith") 8 14<br />

a questi<strong>on</strong> (e.g., "Siri, what's <str<strong>on</strong>g>the</str<strong>on</strong>g> wea<str<strong>on</strong>g>the</str<strong>on</strong>g>r like<br />

today?") 13 14<br />

An email 12 4<br />

A text message 20 19<br />

O<str<strong>on</strong>g>the</str<strong>on</strong>g>r 5 7<br />

Figure 2 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> means and standard deviati<strong>on</strong>s (SD’s) <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

participant Likert scale resp<strong>on</strong>ses to <str<strong>on</strong>g>the</str<strong>on</strong>g> penultimate questi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> survey. Histograms <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> data showed <str<strong>on</strong>g>the</str<strong>on</strong>g> distributi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

resp<strong>on</strong>ses were roughly normal, so <str<strong>on</strong>g>the</str<strong>on</strong>g> means and SDs represent<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> resp<strong>on</strong>ses appropriately. As Figure 3 shows, BLV people<br />

were more satisfied with speech and thought it was faster than <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

<strong>on</strong>-screen keyboard compared with sighted people. This resulted<br />

in a significant effect <str<strong>on</strong>g>of</str<strong>on</strong>g> SightAbility <strong>on</strong> Satisfacti<strong>on</strong> (W = 2296, p<br />

< 0.001) and Speed (W = 2240, p = 0.001). There was no<br />

significant effect <str<strong>on</strong>g>of</str<strong>on</strong>g> SightAbility <strong>on</strong> Accuracy, but <str<strong>on</strong>g>the</str<strong>on</strong>g>re was a<br />

str<strong>on</strong>g trend (W = 1977, p = 0.067). Perhaps BLV people were<br />

able to get fewer recogniti<strong>on</strong> errors because <str<strong>on</strong>g>the</str<strong>on</strong>g>y had more<br />

practice using speech for input.<br />

Resp<strong>on</strong>ses<br />

0 5 10 15 20 25 30<br />

BLV<br />

Sighted<br />

1 − 5 6 − 10 >10<br />

Figure 2. Survey resp<strong>on</strong>ses to <str<strong>on</strong>g>the</str<strong>on</strong>g> questi<strong>on</strong>, “About<br />

how l<strong>on</strong>g was your dictated text?”<br />

0 1 2 3 4 5<br />

Accurate Fast Satisfied<br />

BLV<br />

Sighted<br />

Figure 3. Survey resp<strong>on</strong>ses <strong>on</strong> a 5-point Likert-scale:<br />

1 is str<strong>on</strong>gly disagree, and 5 is str<strong>on</strong>g agree. Resp<strong>on</strong>ses<br />

were roughly normally distributed.<br />

When prompted for o<str<strong>on</strong>g>the</str<strong>on</strong>g>r comments, many participants noted <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

challenge <str<strong>on</strong>g>of</str<strong>on</strong>g> editing <str<strong>on</strong>g>the</str<strong>on</strong>g> recognizer’s output, and speaking in<br />

noisy envir<strong>on</strong>ments. One blind participant explained,<br />

Accuracy in noisy envir<strong>on</strong>ments is <str<strong>on</strong>g>the</str<strong>on</strong>g> biggest<br />

challenge I feel. I prefer to dictate short commands<br />

and text, saving l<strong>on</strong>g e-mail resp<strong>on</strong>ses for a standard<br />

computer. Editing can be a challenge.<br />

Three sighted participants felt awkward using speaking to <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />

device. Ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r sighted participant echoed this c<strong>on</strong>cern, feeling<br />

frustrated with <str<strong>on</strong>g>the</str<strong>on</strong>g> lack <str<strong>on</strong>g>of</str<strong>on</strong>g> feedback, “I find it hard to talk to <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

device. Do you yell at it and hope it understands better?”<br />

Participants who did not use speech for input recently were<br />

mostly c<strong>on</strong>cerned with accuracy and errors. Some were also<br />

c<strong>on</strong>cerned about privacy or social appropriateness, since o<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />

people can hear what <str<strong>on</strong>g>the</str<strong>on</strong>g>y say when <str<strong>on</strong>g>the</str<strong>on</strong>g>y speak to <str<strong>on</strong>g>the</str<strong>on</strong>g>ir devices.<br />

Some sighted participants said that <str<strong>on</strong>g>the</str<strong>on</strong>g>y simply “d<strong>on</strong>’t need” to<br />

use speech for input or haven’t figured out how to use it yet.<br />

3.3 Discussi<strong>on</strong><br />

The survey results suggest that speech is already a widely used<br />

eyes-free alternative to keyboard input. <str<strong>on</strong>g>Blind</str<strong>on</strong>g> people seem more<br />

satisfied with speech than sighted people. This is probably<br />

because keyboard input with VoiceOver is so much slower than


standard keyboard input, that sighted people do not feel speech<br />

input <str<strong>on</strong>g>of</str<strong>on</strong>g>fers a significant advantage.<br />

We were surprised that nearly half <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> blind participants<br />

reported that <str<strong>on</strong>g>the</str<strong>on</strong>g>ir recent speech input message was over 10<br />

words l<strong>on</strong>g. From anecdotal experiences and formative work, it<br />

seems that it would be difficult to review and edit a message that<br />

was more than 10 words l<strong>on</strong>g. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, more blind<br />

people entered text messages with speech than emails, suggesting<br />

that perhaps speech input was preferred for shorter and more<br />

casual messages.<br />

The survey findings raised new questi<strong>on</strong>s. Exactly how l<strong>on</strong>g were<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> messages blind people entered with speech? The results<br />

suggested speech input was much faster than keyboard input, but<br />

<str<strong>on</strong>g>by</str<strong>on</strong>g> how much? How did participants review and edit <str<strong>on</strong>g>the</str<strong>on</strong>g>ir text,<br />

and how much time did <str<strong>on</strong>g>the</str<strong>on</strong>g>y spend doing so? In <str<strong>on</strong>g>the</str<strong>on</strong>g> next secti<strong>on</strong>,<br />

we describe <str<strong>on</strong>g>the</str<strong>on</strong>g> study we c<strong>on</strong>ducted to answer <str<strong>on</strong>g>the</str<strong>on</strong>g>se questi<strong>on</strong>s.<br />

4. STUDY: OBSERVING THE USE OF<br />

SPEECH INPUT BY BLIND PEOPLE<br />

After finding that nearly all <str<strong>on</strong>g>of</str<strong>on</strong>g> our blind survey participants used<br />

speech for input, we wanted to learn more about <str<strong>on</strong>g>the</str<strong>on</strong>g>ir experience<br />

using it. We c<strong>on</strong>ducted a laboratory study to observe blind people<br />

composing paragraphs with dictati<strong>on</strong> and <str<strong>on</strong>g>the</str<strong>on</strong>g> accessible <strong>on</strong>-screen<br />

keyboard.<br />

4.1 Methods<br />

4.1.1 Participants<br />

We recruited eight blind participants (five males, three females)<br />

with an average age <str<strong>on</strong>g>of</str<strong>on</strong>g> 44 (age range: 22 to 61). We required that<br />

participants be blind and use a smart mobile device such as a<br />

smartph<strong>on</strong>e. All participants owned iPh<strong>on</strong>es, which <str<strong>on</strong>g>the</str<strong>on</strong>g>y used<br />

many times a day. Two participants had <str<strong>on</strong>g>the</str<strong>on</strong>g>ir ph<strong>on</strong>es for about a<br />

year, and <str<strong>on</strong>g>the</str<strong>on</strong>g> rest had <str<strong>on</strong>g>the</str<strong>on</strong>g>ir ph<strong>on</strong>es for two or more years. All had<br />

no functi<strong>on</strong>al visi<strong>on</strong> and used VoiceOver to interact with <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />

ph<strong>on</strong>es.<br />

Since we wanted to observe how blind people use speech input in<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>ir daily lives, we also required that participants have<br />

experience using speech <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir mobile devices. Six participants<br />

used speech input every day, while <str<strong>on</strong>g>the</str<strong>on</strong>g> remaining two participants<br />

used speech input weekly.<br />

4.1.2 Procedure & Apparatus<br />

Participants completed <strong>on</strong>e sessi<strong>on</strong> in <str<strong>on</strong>g>the</str<strong>on</strong>g> study that was about<br />

<strong>on</strong>e hour and fifteen minutes l<strong>on</strong>g. At <str<strong>on</strong>g>the</str<strong>on</strong>g> beginning <str<strong>on</strong>g>of</str<strong>on</strong>g> a sessi<strong>on</strong>,<br />

we asked participants for demographic informati<strong>on</strong>, and asked<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>m how <str<strong>on</strong>g>of</str<strong>on</strong>g>ten and what <str<strong>on</strong>g>the</str<strong>on</strong>g>y used speech as input for <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />

mobile devices. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n showed participants an iPod Touch 5<br />

that we used for <str<strong>on</strong>g>the</str<strong>on</strong>g> study and adjusted <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard and<br />

VoiceOver settings to match <str<strong>on</strong>g>the</str<strong>on</strong>g> participant’s preferences (e.g.,<br />

disable auto-correct).<br />

We <str<strong>on</strong>g>the</str<strong>on</strong>g>n asked participants to compose paragraphs using ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />

speech or <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard for input. When composing<br />

with speech, participants used <str<strong>on</strong>g>the</str<strong>on</strong>g> DICTATE butt<strong>on</strong> <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

keyboard, located to <str<strong>on</strong>g>the</str<strong>on</strong>g> left <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> SPACE key (shown in Figure<br />

1). They could use <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard when editing <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR output,<br />

but we required <str<strong>on</strong>g>the</str<strong>on</strong>g>m to use speech for <str<strong>on</strong>g>the</str<strong>on</strong>g> initial compositi<strong>on</strong><br />

process. Participants composed text <strong>on</strong> a simple website we<br />

developed that included <strong>on</strong>ly a prompt, a textarea widget, and a<br />

submit butt<strong>on</strong>. When <str<strong>on</strong>g>the</str<strong>on</strong>g>y completed a paragraph, <str<strong>on</strong>g>the</str<strong>on</strong>g>y clicked<br />

submit.<br />

We presented participants with short prompts such as, “Tell us<br />

about a book you read recently. What did you like about it?” In<br />

additi<strong>on</strong> to telling <str<strong>on</strong>g>the</str<strong>on</strong>g>m which input method to use (speech or<br />

keyboard), we gave <str<strong>on</strong>g>the</str<strong>on</strong>g>m two guidelines:<br />

1. Enter about 4 to 8 sentences in resp<strong>on</strong>se to <str<strong>on</strong>g>the</str<strong>on</strong>g> prompt.<br />

2. <str<strong>on</strong>g>Use</str<strong>on</strong>g> pr<str<strong>on</strong>g>of</str<strong>on</strong>g>essi<strong>on</strong>al language, as though you were<br />

emailing a potential employer.<br />

We hoped <str<strong>on</strong>g>the</str<strong>on</strong>g> first guideline would encourage participants to type<br />

reas<strong>on</strong>ably l<strong>on</strong>g paragraphs, that would reveal more interesting<br />

behavior, and <str<strong>on</strong>g>the</str<strong>on</strong>g> sec<strong>on</strong>d guideline would encourage <str<strong>on</strong>g>the</str<strong>on</strong>g>m to<br />

write complete sentences with proper grammar. We<br />

recommended that participants review and edit <str<strong>on</strong>g>the</str<strong>on</strong>g>ir text as <str<strong>on</strong>g>the</str<strong>on</strong>g>y<br />

normally would (with both input modalities). We wanted to study<br />

speech input for l<strong>on</strong>g and formal paragraphs because we aim to<br />

make it usable for a variety <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>texts, not just short, casual text<br />

messages.<br />

Our procedure differed from standard text entry studies in at least<br />

two ways. First, in standard text entry studies (e.g.,<br />

[5,17,22,27,30,33]), participants transcribe sets <str<strong>on</strong>g>of</str<strong>on</strong>g> phrases. This<br />

approach is not appropriate for speech input, however, because<br />

people speak differently when reading text. <str<strong>on</strong>g>Speech</str<strong>on</strong>g> recognizers<br />

are trained <strong>on</strong> c<strong>on</strong>versati<strong>on</strong>al speech, not read-aloud phrases.<br />

Recognizers would be likely to produce more errors if<br />

participants read phrases out loud. Sec<strong>on</strong>d, most text entry studies<br />

do not permit participants to edit text post-hoc (i.e., insert, delete,<br />

or replace text after repositi<strong>on</strong>ing <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor). Since post-hoc<br />

editing is a comm<strong>on</strong> part <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> compositi<strong>on</strong> process, especially<br />

with dictati<strong>on</strong>, we included it in our study design.<br />

After a participant entered <strong>on</strong>e paragraph with each modality, we<br />

decided whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r to ask him or her to enter a sec<strong>on</strong>d set <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

paragraphs, depending <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> amount <str<strong>on</strong>g>of</str<strong>on</strong>g> time remaining in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

study. We counter-balanced <str<strong>on</strong>g>the</str<strong>on</strong>g> order <str<strong>on</strong>g>of</str<strong>on</strong>g> input methods and<br />

alternated methods between successive paragraphs for each<br />

participant (if he or she entered more than <strong>on</strong>e paragraph). We<br />

c<strong>on</strong>cluded each study with a 15-minute semi-structured interview.<br />

The interview included open-ended questi<strong>on</strong>s about what<br />

participants liked and disliked about using speech for input. We<br />

also asked <str<strong>on</strong>g>the</str<strong>on</strong>g>m to resp<strong>on</strong>d to three statements <strong>on</strong> a 7-point Likert<br />

scale (1 is str<strong>on</strong>gly agree, 7 is str<strong>on</strong>gly disagree). The<br />

statements were:<br />

1. Entering text with speech was fast compared to entering<br />

text with <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard.<br />

2. Entering text with speech was frustrating compared<br />

with entering text with <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard.<br />

3. I am satisfied with using speech to enter text compared<br />

to using <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard.<br />

We recorded audio for each study, al<strong>on</strong>g with a video capture <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> iPod Touch’s screen. We mirrored <str<strong>on</strong>g>the</str<strong>on</strong>g> iPod Touch screen <strong>on</strong>to<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> researcher’s computer using <str<strong>on</strong>g>the</str<strong>on</strong>g> built-in AirPlay [2] client <strong>on</strong><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> iPod and a Mac applicati<strong>on</strong> called AirServer [1].<br />

4.1.3 Design & Analysis<br />

Participants composed a total <str<strong>on</strong>g>of</str<strong>on</strong>g> 22 paragraphs, 11 with speech<br />

input and 11 with keyboard input. Only three participants entered<br />

two paragraphs with each method and <str<strong>on</strong>g>the</str<strong>on</strong>g> rest entered <strong>on</strong>e<br />

paragraph with each method. The average length <str<strong>on</strong>g>of</str<strong>on</strong>g> a paragraph<br />

was 66.4 words (SD = 41.6).


The study was an 8 x 1 design, with a single factor Method with<br />

two levels, <str<strong>on</strong>g>Speech</str<strong>on</strong>g> and Keyboard. We calculated entry speed in<br />

terms <str<strong>on</strong>g>of</str<strong>on</strong>g> Words per Minute (WPM), using <str<strong>on</strong>g>the</str<strong>on</strong>g> formula [30]:<br />

WPM = T −1 •60• 1 S 5<br />

where T is <str<strong>on</strong>g>the</str<strong>on</strong>g> transcribed string entered, |T| is <str<strong>on</strong>g>the</str<strong>on</strong>g> length <str<strong>on</strong>g>of</str<strong>on</strong>g> T,<br />

and S is <str<strong>on</strong>g>the</str<strong>on</strong>g> time in sec<strong>on</strong>ds from <str<strong>on</strong>g>the</str<strong>on</strong>g> entry <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> first character to<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> final keystroke. We included post-hoc editing in <str<strong>on</strong>g>the</str<strong>on</strong>g> entry rate<br />

calculati<strong>on</strong>, so S included <str<strong>on</strong>g>the</str<strong>on</strong>g> time taken to review and edit a<br />

paragraph.<br />

We evaluated accuracy in two ways. First, we computed <str<strong>on</strong>g>the</str<strong>on</strong>g> error<br />

rate <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> speech recognizer for all text entered with speech.<br />

Sec<strong>on</strong>d, we computed <str<strong>on</strong>g>the</str<strong>on</strong>g> error rate <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> final transcripti<strong>on</strong>s for<br />

both speech and keyboard inputs. We measured <str<strong>on</strong>g>the</str<strong>on</strong>g> error rate in<br />

terms <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Word Error Rate (WER), a standard metric used to<br />

evaluate speech recogniti<strong>on</strong> systems [19]. The WER is <str<strong>on</strong>g>the</str<strong>on</strong>g> word<br />

edit distance (i.e., <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> word inserti<strong>on</strong>s, deleti<strong>on</strong>s, and<br />

replacements) between a reference and transcripti<strong>on</strong> text,<br />

normalized <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> length <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> reference text. For evaluating<br />

recogniti<strong>on</strong> errors, <str<strong>on</strong>g>the</str<strong>on</strong>g> reference text is <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g>fline, humanperceived<br />

transcript <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> participant’s speech. For evaluating <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

final, edited keyboard and speech text, determining <str<strong>on</strong>g>the</str<strong>on</strong>g> reference<br />

text is less straight-forward. Since we are uncertain <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> user’s<br />

intended entry (this was a compositi<strong>on</strong> and not a transcripti<strong>on</strong><br />

task), it is difficult to identify errors. As such, we c<strong>on</strong>sidered a<br />

word to be an error if it was (1) a misspelled word, (2) a n<strong>on</strong><br />

sequitur that made no sense in <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>text <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> sentence, or (3) a<br />

clear grammatical error such as a repeated word, or a short<br />

sentence fragment. Since participants’ input was c<strong>on</strong>versati<strong>on</strong>al,<br />

classifying words as errors was relatively straight-forward, given<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> aforementi<strong>on</strong>ed categories.<br />

We used two-sided t-tests to compare text entry rates, since <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

WPM was roughly normally distributed. For error rates, which<br />

were not normally distributed, we used Wilcox<strong>on</strong> Rank Sums<br />

tests to compare <str<strong>on</strong>g>the</str<strong>on</strong>g> means between input methods.<br />

4.2 Results<br />

There was high variability in <str<strong>on</strong>g>the</str<strong>on</strong>g> speech input behavior am<strong>on</strong>g<br />

participants, so we explain some <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> results in terms <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

individual performance. P5 and P6 did not know how to<br />

repositi<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor, so <str<strong>on</strong>g>the</str<strong>on</strong>g>ir ability to edit was limited,<br />

impacting both speed and accuracy. P5’s accuracy was also<br />

affected <str<strong>on</strong>g>by</str<strong>on</strong>g> her thick foreign accent, although she still used<br />

speech for input weekly. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, P3 had many years <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

experience with dictati<strong>on</strong> systems <strong>on</strong> different platforms and<br />

spoke with no hesitati<strong>on</strong> and clear dicti<strong>on</strong>.<br />

4.2.1 Entry Time and Accuracy<br />

The entry rate for speech input was much higher than for<br />

keyboard input. With speech input, participants entered text at a<br />

rate <str<strong>on</strong>g>of</str<strong>on</strong>g> 19.5 WPM (SD = 10.1), while <str<strong>on</strong>g>the</str<strong>on</strong>g>y entered <strong>on</strong>ly 4.3 WPM<br />

(SD = 1.5) with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. As expected, this resulted in a<br />

significant effect <str<strong>on</strong>g>of</str<strong>on</strong>g> Method <strong>on</strong> WPM (t (10) =-5.07, p < 0.001). The<br />

maximum entry rate was achieved <str<strong>on</strong>g>by</str<strong>on</strong>g> P3, at 34.6 WPM with<br />

speech, while <str<strong>on</strong>g>the</str<strong>on</strong>g> maximum speed for keyboard input was<br />

achieved <str<strong>on</strong>g>by</str<strong>on</strong>g> P7, at 6.7 WPM.<br />

When inputting speech, participants spent most <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time<br />

reviewing and editing recogniti<strong>on</strong> errors. On average, this<br />

amounted to 80.3% (SD = 10.2) <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> compositi<strong>on</strong> time. P1 spent<br />

94.2% (about 10 minutes) <str<strong>on</strong>g>of</str<strong>on</strong>g> his time reviewing and editing <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

(1)<br />

recognizer’s output—more than any o<str<strong>on</strong>g>the</str<strong>on</strong>g>r participant. Figure 4<br />

shows <str<strong>on</strong>g>the</str<strong>on</strong>g> amount <str<strong>on</strong>g>of</str<strong>on</strong>g> time participants spent dictating vs.<br />

reviewing and editing <str<strong>on</strong>g>the</str<strong>on</strong>g>ir speech input compositi<strong>on</strong>s. Each bar<br />

in <str<strong>on</strong>g>the</str<strong>on</strong>g> plot represents <strong>on</strong>e composed paragraph (i.e., <strong>on</strong>e task).<br />

Time (Minutes)<br />

0 5 10 15 20<br />

Inline entry<br />

Review & edits<br />

1.1 2.1 2.2 3.1 3.2 4.1 5.1 6.1 7.1 7.2 8.1<br />

Participant.task<br />

Figure 4. Time spent entering (red) and reviewing<br />

and editing (blue) text. Each column represents <strong>on</strong>e<br />

composed paragraph (i.e., a task).<br />

While participants spent most <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time reviewing and editing<br />

text when using speech input, <str<strong>on</strong>g>the</str<strong>on</strong>g>y spent little time <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>se<br />

activities when using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. On average, participants spent<br />

<strong>on</strong>ly 9.0% (SD = 11.7) <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time reviewing and editing<br />

keyboard output. This amounted to an average <str<strong>on</strong>g>of</str<strong>on</strong>g> 1.2 minutes <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

review and edit time with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard compared to 5.4 minutes<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> review and edit time with speech input. Participants made most<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir edits in <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard c<strong>on</strong>diti<strong>on</strong> inline, <str<strong>on</strong>g>by</str<strong>on</strong>g> deleting text<br />

with <str<strong>on</strong>g>the</str<strong>on</strong>g> BACKSPACE key and reentering it.<br />

The number <str<strong>on</strong>g>of</str<strong>on</strong>g> errors produced <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR largely determined<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> amount <str<strong>on</strong>g>of</str<strong>on</strong>g> reviewing and editing participants performed, and,<br />

in turn, <str<strong>on</strong>g>the</str<strong>on</strong>g>ir overall rate <str<strong>on</strong>g>of</str<strong>on</strong>g> entry. The average WER for <str<strong>on</strong>g>the</str<strong>on</strong>g> iOS<br />

ASR was 10.2% (SD = 10.4), ranging from 0% for P3 to 35.6%<br />

for P5. Figure 5 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> entry rates <str<strong>on</strong>g>of</str<strong>on</strong>g> each paragraph input<br />

with speech as a functi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR’s WER. As expected, <str<strong>on</strong>g>the</str<strong>on</strong>g>re<br />

was a negative correlati<strong>on</strong>, with an outlier point in <str<strong>on</strong>g>the</str<strong>on</strong>g> far right<br />

for P5.<br />

Word Per Minute (WPM)<br />

0 5 10 20 30<br />

●<br />

3.1,3.2<br />

●<br />

●<br />

●<br />

8.1<br />

2.1<br />

7.2<br />

●<br />

●<br />

7.1<br />

6.1 ● 2.2<br />

● 1.1<br />

● 4.1<br />

0.0 0.1 0.2 0.3 0.4<br />

<str<strong>on</strong>g>Speech</str<strong>on</strong>g> Recogniti<strong>on</strong> Word Error Rate (WER)<br />

Figure 5. Entry rate vs. speech recogniti<strong>on</strong> error rate for<br />

speech input. The labels <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> plotted points indicate <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

participant (1 – 8) and paragraph number (1 – 2).<br />

Participants corrected most <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> speech recognizer’s errors,<br />

yielding a mean WER <str<strong>on</strong>g>of</str<strong>on</strong>g> 3.2% (SD = 4.6) for <str<strong>on</strong>g>the</str<strong>on</strong>g> final text input<br />

with speech. The mean WER for text input with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard was<br />

slightly higher, at 4.0% (SD = 3.3). There was no significant<br />

effect <str<strong>on</strong>g>of</str<strong>on</strong>g> Method <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> WER <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> final text (W = 77, n.s.).<br />

●<br />

5.1


4.2.2 Reviewing and Editing Text<br />

In this secti<strong>on</strong>, we describe <str<strong>on</strong>g>the</str<strong>on</strong>g> reviewing and editing techniques<br />

that participants used when composing paragraphs. After entering<br />

several sentences, most participants reviewed <str<strong>on</strong>g>the</str<strong>on</strong>g>ir text <str<strong>on</strong>g>by</str<strong>on</strong>g><br />

making VoiceOver read it word <str<strong>on</strong>g>by</str<strong>on</strong>g> word. If a word did not sound<br />

correct, <str<strong>on</strong>g>the</str<strong>on</strong>g>y read it character <str<strong>on</strong>g>by</str<strong>on</strong>g> character. To do this,<br />

participants used <str<strong>on</strong>g>the</str<strong>on</strong>g> VoiceOver rotor [3], which enables users to<br />

read text <strong>on</strong>e unit at a time. A unit can be a line, word, or<br />

character. To move <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor from <strong>on</strong>e unit to <str<strong>on</strong>g>the</str<strong>on</strong>g> next, a user<br />

swipes down <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen. The user can also swipe up to move<br />

to <str<strong>on</strong>g>the</str<strong>on</strong>g> previous unit. In this way, participants moved <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

VoiceOver cursor around <str<strong>on</strong>g>the</str<strong>on</strong>g> text in <str<strong>on</strong>g>the</str<strong>on</strong>g> text area to verify <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />

input. When reviewing, participants <str<strong>on</strong>g>of</str<strong>on</strong>g>ten read words several<br />

times, <str<strong>on</strong>g>the</str<strong>on</strong>g>n iterated through <str<strong>on</strong>g>the</str<strong>on</strong>g>ir characters.<br />

Some ASR errors were difficult to detect using <str<strong>on</strong>g>the</str<strong>on</strong>g> VoiceOver<br />

text-to-speech because <str<strong>on</strong>g>the</str<strong>on</strong>g>y sounded similar to <str<strong>on</strong>g>the</str<strong>on</strong>g> user’s original<br />

speech. For example, P1 dictated <str<strong>on</strong>g>the</str<strong>on</strong>g> phrase “lost my sight,” but<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> speech recognizer output <str<strong>on</strong>g>the</str<strong>on</strong>g> phrase, “lost my site.” P1 did not<br />

detect this error. Ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r participant, P6, said, “you can hike,”<br />

but <str<strong>on</strong>g>the</str<strong>on</strong>g> speech recognizer output, “you can’t hike.” P6 did not<br />

detect <str<strong>on</strong>g>the</str<strong>on</strong>g> error ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r. Some grammatical errors were also not<br />

detected in <str<strong>on</strong>g>the</str<strong>on</strong>g> review process. VoiceOver’s text-to-speech does<br />

not differentiate upper- and lower-case letters, so <str<strong>on</strong>g>the</str<strong>on</strong>g>re were some<br />

uncorrected case errors. The composed paragraphs also included<br />

extra spaces between words and sentences that participants did<br />

not seem to be aware <str<strong>on</strong>g>of</str<strong>on</strong>g>.<br />

We observed that VoiceOver did not alert participants to passages<br />

that were recognized with low-c<strong>on</strong>fidence. Occasi<strong>on</strong>ally, <str<strong>on</strong>g>the</str<strong>on</strong>g> iOS<br />

speech recognizer underlined a word or phrase that was<br />

recognized with low c<strong>on</strong>fidence, and would present <str<strong>on</strong>g>the</str<strong>on</strong>g> user with<br />

alternative recogniti<strong>on</strong> opti<strong>on</strong>s if <str<strong>on</strong>g>the</str<strong>on</strong>g> phrase was touched. This<br />

informati<strong>on</strong> was not accessible. Also, we observed that<br />

VoiceOver communicated punctuati<strong>on</strong> marks in <str<strong>on</strong>g>the</str<strong>on</strong>g> text in a<br />

subtle manner, <str<strong>on</strong>g>by</str<strong>on</strong>g> varying prosody and pausing. This made it<br />

difficult for participants to detect punctuati<strong>on</strong> marks in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

recognized text when reviewing it.<br />

The review process took l<strong>on</strong>ger than we expected, but participants<br />

spent much more time editing <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR output. Figure 6 shows <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

number <str<strong>on</strong>g>of</str<strong>on</strong>g> edits participants performed for each paragraph input<br />

with speech. The total number <str<strong>on</strong>g>of</str<strong>on</strong>g> edits performed during speech<br />

input tasks was 96. Edits included inserting, deleting, and<br />

replacing characters, words, or strings <str<strong>on</strong>g>of</str<strong>on</strong>g> words. Only 15 (15.6%)<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>se edits were d<strong>on</strong>e using speech input. Although inputting<br />

speech was much faster than using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard, participants<br />

preferred using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard to edit <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR output. Several<br />

participants inserted words or phrases with speech but deleted<br />

err<strong>on</strong>eous text with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard’s BACKSPACE key. Also, <str<strong>on</strong>g>the</str<strong>on</strong>g>re<br />

was no way to move <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor using speech commands, so<br />

participants preferred to c<strong>on</strong>tinue using touch to make characteror<br />

word-level edits.<br />

We observed three editing techniques for speech input in our<br />

study. We describe <str<strong>on</strong>g>the</str<strong>on</strong>g>m as follows.<br />

H<strong>on</strong>e in, delete, and reenter. The first technique was <str<strong>on</strong>g>the</str<strong>on</strong>g> most<br />

comm<strong>on</strong> and was used <str<strong>on</strong>g>by</str<strong>on</strong>g> all participants except P5 and P6 who<br />

did not know how to use VoiceOver gestures. This technique<br />

involved (1) moving <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor to a desired positi<strong>on</strong> using <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

VoiceOver gestures describe previously, (2) using <str<strong>on</strong>g>the</str<strong>on</strong>g> BACKSPACE<br />

key to delete unwanted characters, words, or even phrases, and<br />

finally (3) to enter characters using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard.<br />

H<strong>on</strong>e in, select, and reenter. The sec<strong>on</strong>d editing technique<br />

seems more efficient than <str<strong>on</strong>g>the</str<strong>on</strong>g> first, but was <strong>on</strong>ly used <str<strong>on</strong>g>by</str<strong>on</strong>g> P8. It<br />

involved (1) using VoiceOver gestures to move <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor to a<br />

desired positi<strong>on</strong>, (2) choosing <str<strong>on</strong>g>the</str<strong>on</strong>g> EDIT opti<strong>on</strong> from <str<strong>on</strong>g>the</str<strong>on</strong>g> VoiceOver<br />

rotor, (3) selecting <str<strong>on</strong>g>the</str<strong>on</strong>g> word (<strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> EDIT opti<strong>on</strong>s, al<strong>on</strong>g with<br />

COPY and PASTE), and (4) re-enter <str<strong>on</strong>g>the</str<strong>on</strong>g> word with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. P8<br />

sometimes re-entered <str<strong>on</strong>g>the</str<strong>on</strong>g> word with speech input. This technique<br />

required fewer key presses than “h<strong>on</strong>e in, delete, and reenter.”<br />

Delete and start over. P5 and P6 did not know how to repositi<strong>on</strong><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> cursor, so <str<strong>on</strong>g>the</str<strong>on</strong>g>y used this technique. They deleted entered text<br />

with <str<strong>on</strong>g>the</str<strong>on</strong>g> DELETE key, starting from <str<strong>on</strong>g>the</str<strong>on</strong>g> end (where <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor was<br />

located <str<strong>on</strong>g>by</str<strong>on</strong>g> default) and re-entered with speech. They both said<br />

that <str<strong>on</strong>g>the</str<strong>on</strong>g>y <str<strong>on</strong>g>of</str<strong>on</strong>g>ten used speech for short messages, and deleted<br />

everything and started over if <str<strong>on</strong>g>the</str<strong>on</strong>g>re was a recogniti<strong>on</strong> error. This<br />

was <str<strong>on</strong>g>the</str<strong>on</strong>g> least flexible and, most likely, <str<strong>on</strong>g>the</str<strong>on</strong>g> least efficient editing<br />

technique for l<strong>on</strong>ger messages.<br />

Number <str<strong>on</strong>g>of</str<strong>on</strong>g> Edits<br />

0 5 10 15 20 25 30<br />

Keyboard edits<br />

<str<strong>on</strong>g>Speech</str<strong>on</strong>g> edits<br />

1.1 2.1 2.2 3.1 3.2 4.1 5.1 6.1 7.1 7.2 8.1<br />

Participant.task<br />

Figure 6. Number <str<strong>on</strong>g>of</str<strong>on</strong>g> edits performed with each<br />

modality for each composed paragraph (i.e., task).<br />

4.2.3 Qualitative <str<strong>on</strong>g>Use</str<strong>on</strong>g>r Feedback<br />

Two <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 8 (25%) participants in our study preferred keyboard<br />

input over speech input. These were P1 and P5, who were <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

<strong>on</strong>ly two participants who used speech weekly, ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than daily.<br />

All participants menti<strong>on</strong>ed speed as <str<strong>on</strong>g>the</str<strong>on</strong>g> primary benefit <str<strong>on</strong>g>of</str<strong>on</strong>g> using<br />

speech for input. P1 and P5 preferred <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard because <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

challenge <str<strong>on</strong>g>of</str<strong>on</strong>g> editing: P5 said she knew <str<strong>on</strong>g>the</str<strong>on</strong>g>re were mistakes and<br />

didn’t know how to fix <str<strong>on</strong>g>the</str<strong>on</strong>g>m. P1 said that “although [with] speech<br />

you can get more volume <str<strong>on</strong>g>of</str<strong>on</strong>g> text in <str<strong>on</strong>g>the</str<strong>on</strong>g>re…but my weakness is<br />

efficiently editing. that’s <str<strong>on</strong>g>the</str<strong>on</strong>g> downside <str<strong>on</strong>g>of</str<strong>on</strong>g> speech.”<br />

P3 and P8 said it was easy for <str<strong>on</strong>g>the</str<strong>on</strong>g>m to express <str<strong>on</strong>g>the</str<strong>on</strong>g>ir thoughts<br />

verbally, having had prior experience with dictati<strong>on</strong> systems,<br />

unlike P1 and P2 who found it more difficult. P8 also said that<br />

speech input helped him avoid spelling mistakes—auto-correct<br />

was not sufficient for correcting spelling mistakes.<br />

I rely <strong>on</strong> dictati<strong>on</strong> more because my spelling is not <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

greatest and feel like can compose more <str<strong>on</strong>g>of</str<strong>on</strong>g> coherent<br />

sentence verbally than <str<strong>on</strong>g>by</str<strong>on</strong>g> writing, especially <strong>on</strong> an<br />

iOS device where it’s difficult to correct mistakes. On<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> computer, you can see with spell check and<br />

correct. but <strong>on</strong> iOS device it’s easier not to have to<br />

worry about it.<br />

While most preferred speech, all participants found certain<br />

aspects <str<strong>on</strong>g>of</str<strong>on</strong>g> speech input frustrating. All participants except P3<br />

cited editing as a source <str<strong>on</strong>g>of</str<strong>on</strong>g> frustrati<strong>on</strong>. P8 would like to be able to<br />

do “inline” editing as he spoke. P7 wanted an easier way to edit<br />

with speech ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard, which she did for<br />

most edits. P7 and P3 menti<strong>on</strong>ed <str<strong>on</strong>g>the</str<strong>on</strong>g> problem <str<strong>on</strong>g>of</str<strong>on</strong>g> dictating words<br />

that were out-<str<strong>on</strong>g>of</str<strong>on</strong>g>-vocabulary, such as names. Figure 7 shows<br />

Likert scale resp<strong>on</strong>ses to three statements from <str<strong>on</strong>g>the</str<strong>on</strong>g> interviews at


<str<strong>on</strong>g>the</str<strong>on</strong>g> end <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> study sessi<strong>on</strong>s, showing varying levels <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

satisfacti<strong>on</strong> and frustrati<strong>on</strong>. Figure 7 also shows that all<br />

participants felt inputting text with speech was much faster than<br />

with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard.<br />

Frequency<br />

Frequency<br />

Frequency<br />

0.0 0.5 1.0 1.5 2.0<br />

0 1 2 3 4 5<br />

0 1 2 3 4<br />

0 1 2 3 4 5 6 7<br />

<str<strong>on</strong>g>Speech</str<strong>on</strong>g> is fast compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Mean = 1.6 (SD = 0.9).<br />

0 1 2 3 4 5 6 7<br />

<str<strong>on</strong>g>Speech</str<strong>on</strong>g> is frustrating compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Mean = 4.8 (SD = 2.1).<br />

0 1 2 3 4 5 6 7<br />

I'm satisfied with speech compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Mean = 3.5 (SD = 1.9).<br />

Figure 7. Resp<strong>on</strong>ses to three statements <strong>on</strong> a<br />

7-point Likert scale (1 is str<strong>on</strong>gly agree, 7 is<br />

str<strong>on</strong>gly disagree).<br />

5. Discussi<strong>on</strong><br />

Our study showed that speech input is an efficient entry method<br />

for blind people compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard, yet it is<br />

impeded <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> time required to review and edit ASR output.<br />

<str<strong>on</strong>g>People</str<strong>on</strong>g> can speak intelligibly at a rate <str<strong>on</strong>g>of</str<strong>on</strong>g> about 150 WPM [31],<br />

but <str<strong>on</strong>g>the</str<strong>on</strong>g> average entry rate <str<strong>on</strong>g>of</str<strong>on</strong>g> blind people using speech in our<br />

study was just 19.5 WPM. N<strong>on</strong>e<str<strong>on</strong>g>the</str<strong>on</strong>g>less, this was comparable to<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> entry rate <str<strong>on</strong>g>of</str<strong>on</strong>g> sighted people using <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard <str<strong>on</strong>g>of</str<strong>on</strong>g> a<br />

smartph<strong>on</strong>e, as found in prior work [7]. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore, we found<br />

that <str<strong>on</strong>g>the</str<strong>on</strong>g> error rate <str<strong>on</strong>g>of</str<strong>on</strong>g> speech input was no higher than that <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

keyboard input for participants in our study. It is important to<br />

note, however, that we measured accuracy <strong>on</strong>ly in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

WER, which does not necessarily correlate with <str<strong>on</strong>g>the</str<strong>on</strong>g> intelligibility<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> text [19]. The WER penalizes equally for small and major<br />

errors in a word, but it is <str<strong>on</strong>g>the</str<strong>on</strong>g> standard measure for evaluating<br />

ASR accuracy.<br />

Six <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> eight participants (75%) preferred speech over <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

keyboard because <str<strong>on</strong>g>of</str<strong>on</strong>g> speed, but all participants faced challenges<br />

when using speech input. Editing was <str<strong>on</strong>g>the</str<strong>on</strong>g> primary challenge;<br />

participants spent <str<strong>on</strong>g>the</str<strong>on</strong>g> majority (80%) <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time editing <str<strong>on</strong>g>the</str<strong>on</strong>g> text<br />

output <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> recognizer. Surprisingly, <str<strong>on</strong>g>the</str<strong>on</strong>g>ir most comm<strong>on</strong> editing<br />

technique was highly inefficient in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> keystrokes.<br />

Participants deleted characters with BACKSPACE and <str<strong>on</strong>g>the</str<strong>on</strong>g>n reentered<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>m with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. It was unclear why <str<strong>on</strong>g>the</str<strong>on</strong>g>y did not<br />

select whole words to replace <str<strong>on</strong>g>the</str<strong>on</strong>g>m, or use speech for editing<br />

more than <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Perhaps some participants did not know<br />

how to select whole words with VoiceOver. They may have<br />

preferred to edit text with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard because it was more<br />

predictable, preventing additi<strong>on</strong>al errors.<br />

Study resp<strong>on</strong>ses were less positive than our survey resp<strong>on</strong>ses<br />

were. This was probably because, in our study, we asked<br />

participants to enter paragraphs that were l<strong>on</strong>ger and more formal<br />

than many smartph<strong>on</strong>e communicati<strong>on</strong>s. For example, a text<br />

message input <str<strong>on</strong>g>by</str<strong>on</strong>g> a survey participant was probably less than four<br />

sentences l<strong>on</strong>g and not as formal as an email that <strong>on</strong>e would write<br />

to a potential employer (referring to <str<strong>on</strong>g>the</str<strong>on</strong>g> guidelines we gave<br />

participants in <str<strong>on</strong>g>the</str<strong>on</strong>g> study). <str<strong>on</strong>g>Speech</str<strong>on</strong>g> is currently better suited for<br />

short, casual messages, probably because <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> difficulty <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

correcting and identifying errors. We believe <str<strong>on</strong>g>the</str<strong>on</strong>g> research<br />

community should facilitate <str<strong>on</strong>g>the</str<strong>on</strong>g> process <str<strong>on</strong>g>of</str<strong>on</strong>g> correcting c<strong>on</strong>tent and<br />

grammar to make speech input more versatile.<br />

Our study also uncovered interesting keyboard input behavior<br />

with VoiceOver. Since this was not our focus, we did not<br />

document challenges with keyboard input rigorously, but<br />

observed several interesting trends. Surprisingly, some<br />

participants did not use <str<strong>on</strong>g>the</str<strong>on</strong>g> auto-correct feature, which could<br />

have improved <str<strong>on</strong>g>the</str<strong>on</strong>g>ir speed and accuracy. They found it difficult<br />

to m<strong>on</strong>itor and dismiss auto-correct suggesti<strong>on</strong>s. We also<br />

observed that VoiceOver did not communicate punctuati<strong>on</strong><br />

clearly, and some minor grammatical issues, such as extra spaces<br />

between words, were <strong>on</strong>ly noticeable when reviewing text<br />

character <str<strong>on</strong>g>by</str<strong>on</strong>g> character. VoiceOver had a setting in which it speaks<br />

punctuati<strong>on</strong> marks, but not <strong>on</strong>e participant used this setting.<br />

VoiceOver also did not communicate misspelled words that were<br />

visually identified with an underline. Enabling participants to<br />

more easily identify punctuati<strong>on</strong> and grammar and spelling issues<br />

would likely improve efficiency and compositi<strong>on</strong> quality for both<br />

keyboard and speech input.<br />

Throughout <str<strong>on</strong>g>the</str<strong>on</strong>g> paper, we have compared speech input with <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

de facto standard accessible input method for touchscreens: <strong>on</strong>screen<br />

keyboard input with VoiceOver. However, <str<strong>on</strong>g>the</str<strong>on</strong>g>re are input<br />

alternatives that are comm<strong>on</strong>ly used <str<strong>on</strong>g>by</str<strong>on</strong>g> both blind and sighted<br />

people that should be c<strong>on</strong>sidered when evaluating speech. Several<br />

study participants used a small external keyboard with hard keys;<br />

<strong>on</strong>e participant used <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard <strong>on</strong> his Braille display, which<br />

c<strong>on</strong>nected to his iPh<strong>on</strong>e; <strong>on</strong>e participant used <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen input<br />

method Fleksy [9], <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> many gesture-based text entry methods<br />

(see Related Work for o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs). These alternatives are more private<br />

than speech, and probably more reliable in noisy envir<strong>on</strong>ments. It<br />

would be interesting to compare <str<strong>on</strong>g>the</str<strong>on</strong>g>se methods to speech in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

future.<br />

6. Challenges for Future Research<br />

We distill our findings into a set <str<strong>on</strong>g>of</str<strong>on</strong>g> challenges for researchers<br />

interested in n<strong>on</strong>visual text entry. These challenges can be<br />

incorporated into both speech and gesture-based input methods.<br />

1. Text selecti<strong>on</strong> – a better method for n<strong>on</strong>visual selecti<strong>on</strong><br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> text. This can also include o<str<strong>on</strong>g>the</str<strong>on</strong>g>r edit operati<strong>on</strong>s, such<br />

as cut, copy, and paste.<br />

2. Cursor positi<strong>on</strong>ing – an easier way to move a cursor<br />

around a text area; enable a user to easily h<strong>on</strong>e in <strong>on</strong><br />

errors.


3. Error detecti<strong>on</strong> – an easier way to detect errors such as<br />

spelling mistakes, letter case errors, and low-c<strong>on</strong>fidence<br />

ASR output.<br />

4. Auto-correct – a study <str<strong>on</strong>g>of</str<strong>on</strong>g> how well auto-correct works<br />

for n<strong>on</strong>visual use, and a way to make it more effective.<br />

7. C<strong>on</strong>clusi<strong>on</strong><br />

We have explored <str<strong>on</strong>g>the</str<strong>on</strong>g> patterns and challenges <str<strong>on</strong>g>of</str<strong>on</strong>g> speech input for<br />

blind mobile device users through a survey and a laboratory<br />

study. We found that speech input is a popular alternative to <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

<strong>on</strong>-screen keyboard with VoiceOver, yet people face challenges<br />

when reviewing and editing a speech recognizer’s output, <str<strong>on</strong>g>of</str<strong>on</strong>g>ten<br />

resorting to using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. We hope this work will enable<br />

text entry researchers to better understand <str<strong>on</strong>g>the</str<strong>on</strong>g> patterns and<br />

challenges <str<strong>on</strong>g>of</str<strong>on</strong>g> current n<strong>on</strong>visual text entry, and spur fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />

research in <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong> n<strong>on</strong>visual speech input.<br />

8. ACKNOWLEDGMENTS<br />

We thank Gina-Anne Levow, Richard Ladner, Rochelle H. Ng,<br />

and Sim<strong>on</strong>e Schaffer. This work was supported in part <str<strong>on</strong>g>by</str<strong>on</strong>g> AT&T<br />

Labs and <str<strong>on</strong>g>the</str<strong>on</strong>g> Nati<strong>on</strong>al Science Foundati<strong>on</strong> under grant No.<br />

1116051.<br />

9. REFERENCES<br />

1. AirServer. http://www.airserver.com/.<br />

2. Apple Inc., AirPlay. http://www.apple.com/airplay/<br />

3. Apple Inc., Chapter 11: Using VoiceOver Gestures. From<br />

VoiceOver Getting Started. http://www.apple.com/<br />

voiceover/info/guide/_1137.html#vo28035.<br />

4. Apple Inc., Siri. http://www.apple.com/ios/siri/<br />

5. Azenkot, S., Wobbrock, J.O., Prasain, S., and Ladner, R.E.<br />

<str<strong>on</strong>g>Input</str<strong>on</strong>g> finger detecti<strong>on</strong> for n<strong>on</strong>visual touch screen text entry<br />

in Perkinput. Proc. GI ‘12, 121–129.<br />

6. B<strong>on</strong>ner, M., Brudvik, J., Abowd, G. Edwards, K. (2010).<br />

No-Look Notes: Accessible Eyes-Free Multi-Touch Text<br />

Entry. Proc. Pervasive ’10, 409-426.<br />

7. Castellucci, S., and MacKenzie, I.S. (2011). Ga<str<strong>on</strong>g>the</str<strong>on</strong>g>ring text<br />

entry metrics <strong>on</strong> android devices. Proc. CHI EA '11, 1507-<br />

1512.<br />

8. Fischer, A.R.H., Price, K.J., and Sears, A. <str<strong>on</strong>g>Speech</str<strong>on</strong>g>-based<br />

Text Entry for Mobile Handheld Devices: An analysis <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

efficacy and error correcti<strong>on</strong> techniques for server-based<br />

soluti<strong>on</strong>s. Internati<strong>on</strong>al Journal <str<strong>on</strong>g>of</str<strong>on</strong>g> Human-Computer<br />

Interacti<strong>on</strong> 19, 3 (2005), 279–304.<br />

9. Fleksy App <str<strong>on</strong>g>by</str<strong>on</strong>g> Syntellia. http://fleksy.com/<br />

10. Frey, B., Sou<str<strong>on</strong>g>the</str<strong>on</strong>g>rn, C., and Romero, M. BrailleTouch:<br />

Mobile Texting for <str<strong>on</strong>g>the</str<strong>on</strong>g> Visually Impaired. Proc. UAHCI'11,<br />

19-25.<br />

11. Google. Voice Search Anywhere. http://www.google.com/<br />

insidesearch/features/voicesearch/index-chrome.html<br />

12. Halvers<strong>on</strong>, C., Horn, D., Karat, C., and Karat, J. The beauty<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> errors: patterns <str<strong>on</strong>g>of</str<strong>on</strong>g> error correcti<strong>on</strong> in desktop speech<br />

systems. IOS Press (1999), 133–140.<br />

13. Kane, S.K., Bigham, J.P. and Wobbrock, J.O. (2008). Slide<br />

Rule: Making mobile touch screens accessible to blind<br />

people using multitouch interacti<strong>on</strong> techniques. Proc.<br />

ASSETS '08, 73-80.<br />

14. Kane, S. K., Wobbrock, J. O. and Ladner, R. E. (2011).<br />

Usable gestures for blind people: understanding preference<br />

and performance. Proc. CHI '11, 413-422.<br />

15. Karat, C.-M., Halvers<strong>on</strong>, C., Horn, D., and Karat, J. Patterns<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> entry and correcti<strong>on</strong> in large vocabulary c<strong>on</strong>tinuous<br />

speech recogniti<strong>on</strong> systems. Proc. CHI‘99, 568–575.<br />

16. Kumar, A., Paek, T., and Lee, B. Voice typing: a new speech<br />

interacti<strong>on</strong> model for dictati<strong>on</strong> <strong>on</strong> touchscreen devices. Proc.<br />

CHI ‘12, 2277-2286.<br />

17. MacKenzie, I.S. and Zhang, S.X. (1999). The design and<br />

evaluati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a high-performance s<str<strong>on</strong>g>of</str<strong>on</strong>g>t keyboard. Proc. CHI<br />

'99, 25-31.<br />

18. Mascetti, S., Bernareggi, C., and Belotti, M. (2011).<br />

TypeInBraille: a braille-based typing applicati<strong>on</strong> for<br />

touchscreen devices. Proc. ASSETS '11, 295-296.<br />

19. Martin, T.B. and Welch, J.R. Practical speech recognizers<br />

and some performance effectiveness parameters. In Trends<br />

in <str<strong>on</strong>g>Speech</str<strong>on</strong>g> Recogniti<strong>on</strong>. Prentice Hall, Englewood Cliffs, NJ,<br />

USA, 1980.<br />

20. Mishra, T., Ljolje, A., Gilbert, Mazin. (2011). Predicting<br />

Human Perceived Accuracy <str<strong>on</strong>g>of</str<strong>on</strong>g> ASR Systems. Proc.<br />

INTERSPEECH ’11, 1945-1948.<br />

21. Nuance. Drag<strong>on</strong> Naturally Speaking S<str<strong>on</strong>g>of</str<strong>on</strong>g>tware.<br />

http://www.nuance.com/drag<strong>on</strong>/index.htm<br />

22. Oliveira, J., Guerreiro, T., Nicolau, H., Jorge, J., and<br />

G<strong>on</strong>çalves, D. (2011). <str<strong>on</strong>g>Blind</str<strong>on</strong>g> people and mobile touch-based<br />

text-entry: acknowledging <str<strong>on</strong>g>the</str<strong>on</strong>g> need for different flavors.<br />

Proc. ASSETS '11, 179-186.<br />

23. Oviatt, S. Taming recogniti<strong>on</strong> errors with a multimodal<br />

interface. Commun. ACM 43, 9 (2000), 45–51.<br />

24. Pitt, I., and Edwards, A.D.N. (1996). Improving <str<strong>on</strong>g>the</str<strong>on</strong>g> usability<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> speech-based interfaces for blind users. Proc. ASSETS<br />

’96, 124-130.<br />

25. Sánchez, J. and Aguayo, F. (2006), Mobile messenger for<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> blind. Proc. UAAI ‘06, 369-385<br />

26. Sears, A., Karat, C.-M., Oseitutu, K., Karimullah, A., and<br />

Feng, J. Productivity, satisfacti<strong>on</strong>, and interacti<strong>on</strong> strategies<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> individuals with spinal cord injuries and traditi<strong>on</strong>al users<br />

interacting with speech recogniti<strong>on</strong> s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware. Universal<br />

Access in <str<strong>on</strong>g>the</str<strong>on</strong>g> Informati<strong>on</strong> Society 1, 1 (2001), 4–15.<br />

27. Sou<str<strong>on</strong>g>the</str<strong>on</strong>g>rn, C., Claws<strong>on</strong>, J., Frey, B., Abowd, B., and Romero,<br />

M. 2012. An evaluati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> BrailleTouch: mobile touchscreen<br />

text entry for <str<strong>on</strong>g>the</str<strong>on</strong>g> visually impaired. Proc. MobileHCI '12,<br />

317-326.<br />

28. Stent, A., Syrdal, A., and Mishra, T. 2011. On <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

intelligibility <str<strong>on</strong>g>of</str<strong>on</strong>g> fast syn<str<strong>on</strong>g>the</str<strong>on</strong>g>sized speech for individuals with<br />

early-<strong>on</strong>set blindness. Proc. ASSETS '11, 211-218.<br />

29. Suhm, B., Myers, B., and Waibel, A. Multi-Modal Error<br />

Correcti<strong>on</strong> for <str<strong>on</strong>g>Speech</str<strong>on</strong>g> <str<strong>on</strong>g>Use</str<strong>on</strong>g>r Interfaces. ACM TOCHI 8, 1<br />

(2001), 60-98.<br />

30. Tinwala, H, and MacKenzie, I. S. (2009). Eyes-free text<br />

entry <strong>on</strong> a touchscreen ph<strong>on</strong>e. Proc. TIC-STH ’09, 83-88.<br />

31. Williams, J. R. (1998). Guidelines for <str<strong>on</strong>g>the</str<strong>on</strong>g> use <str<strong>on</strong>g>of</str<strong>on</strong>g> multimedia<br />

in instructi<strong>on</strong>, Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Human Factors and<br />

Erg<strong>on</strong>omics Society 42nd Annual Meeting, 1447–1451.<br />

32. Wobbrock, J.O. (2007). Measures <str<strong>on</strong>g>of</str<strong>on</strong>g> text entry<br />

performance. In Text Entry Systems: Mobility,<br />

Accessibility, Universality, I. S. MacKenzie and K. Tanaka-<br />

Ishii (eds.). San Francisco: Morgan Kaufmann, 47-74.<br />

33. Wobbrock, J.O. and Myers, B.A. Analyzing <str<strong>on</strong>g>the</str<strong>on</strong>g> input stream<br />

for character- level errors in unc<strong>on</strong>strained text entry<br />

evaluati<strong>on</strong>s. ACM TOCHI. 13, 4 (2006), 458–489.<br />

34. Yfantidis, G. and Evreinov, G. Adaptive blind interacti<strong>on</strong><br />

technique for touchscreens, Universal Access in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

Informati<strong>on</strong> Society, 4, 2006, 328-337.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!