Exploring the Use of Speech Input by Blind People on ... - Washington
Exploring the Use of Speech Input by Blind People on ... - Washington Exploring the Use of Speech Input by Blind People on ... - Washington
- Page 2 and 3: first we study current use to ident
- Page 4 and 5: standard keyboard input, that sight
- Page 6 and 7: 4.2.2 Reviewing and Editing Text In
- Page 8: 3. Error detection - an easier way
<str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Use</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>Speech</str<strong>on</strong>g> <str<strong>on</strong>g>Input</str<strong>on</strong>g> <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>Blind</str<strong>on</strong>g> <str<strong>on</strong>g>People</str<strong>on</strong>g><br />
<strong>on</strong> Mobile Devices<br />
Shiri Azenkot<br />
Computer Science & Engineering | DUB Group<br />
University <str<strong>on</strong>g>of</str<strong>on</strong>g> Washingt<strong>on</strong><br />
Seattle, WA 98195 USA<br />
shiri@cs.washingt<strong>on</strong>.edu<br />
Nicole B. Lee<br />
Human Centered Design & Engineering | DUB Group<br />
University <str<strong>on</strong>g>of</str<strong>on</strong>g> Washingt<strong>on</strong><br />
Seattle, WA 98195 USA<br />
nikki@nicoleblee.com<br />
ABSTRACT<br />
Much recent work has explored <str<strong>on</strong>g>the</str<strong>on</strong>g> challenge <str<strong>on</strong>g>of</str<strong>on</strong>g> n<strong>on</strong>visual text<br />
entry <strong>on</strong> mobile devices. While researchers have attempted to<br />
solve this problem with gestures, we explore a different modality:<br />
speech. We c<strong>on</strong>ducted a survey with 169 blind and sighted<br />
participants to investigate how <str<strong>on</strong>g>of</str<strong>on</strong>g>ten, what for, and why blind<br />
people used speech for input <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir mobile devices. We found<br />
that blind people used speech more <str<strong>on</strong>g>of</str<strong>on</strong>g>ten and input l<strong>on</strong>ger<br />
messages than sighted people. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n c<strong>on</strong>ducted a study with 8<br />
blind people to observe how <str<strong>on</strong>g>the</str<strong>on</strong>g>y used speech input <strong>on</strong> an iPod<br />
compared with <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard with VoiceOver. We<br />
found that speech was nearly 5 times as fast as <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard.<br />
While participants were mostly satisfied with speech input,<br />
editing recogniti<strong>on</strong> errors was frustrating. Participants spent an<br />
average <str<strong>on</strong>g>of</str<strong>on</strong>g> 80.3% <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time editing. Finally, we propose<br />
challenges for future work, including more efficient eyes-free<br />
editing and better error detecti<strong>on</strong> methods for reviewing text.<br />
Categories and Subject Descriptors<br />
D.3.3 H.5.2 [Informati<strong>on</strong> Interfaces and Presentati<strong>on</strong>]: <str<strong>on</strong>g>Use</str<strong>on</strong>g>r<br />
Interfaces – <str<strong>on</strong>g>Input</str<strong>on</strong>g> devices and strategies, Voice I/O.; K.4.2<br />
[Computers and society]: Social issues – assistive technologies<br />
for pers<strong>on</strong>s with disabilities.<br />
Keywords<br />
Dictati<strong>on</strong>, text entry, eyes-free, mobile devices.<br />
1. INTRODUCTION<br />
In <str<strong>on</strong>g>the</str<strong>on</strong>g> past few years, touchscreen devices have become widely<br />
adopted <str<strong>on</strong>g>by</str<strong>on</strong>g> blind people thanks to screen readers like Apple’s<br />
VoiceOver. A blind user can explore a touchscreen <str<strong>on</strong>g>by</str<strong>on</strong>g> touching it<br />
as VoiceOver speaks <str<strong>on</strong>g>the</str<strong>on</strong>g> labels <str<strong>on</strong>g>of</str<strong>on</strong>g> UI elements being touched. To<br />
select an element, <str<strong>on</strong>g>the</str<strong>on</strong>g> user performs a sec<strong>on</strong>d gesture such as a<br />
double tap. While this interacti<strong>on</strong> makes an interface generally<br />
accessible, finding and selecting keys <strong>on</strong> an <strong>on</strong>-screen keyboard is<br />
slow and error-pr<strong>on</strong>e. Azenkot et al. [5] found that <str<strong>on</strong>g>the</str<strong>on</strong>g> mean text<br />
entry rate <strong>on</strong> an iPh<strong>on</strong>e with VoiceOver was <strong>on</strong>ly 4.5 words per<br />
minute (WPM).<br />
Many researchers have attempted to alleviate <str<strong>on</strong>g>the</str<strong>on</strong>g> challenge <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
Permissi<strong>on</strong> to make digital or hard copies <str<strong>on</strong>g>of</str<strong>on</strong>g> all or part <str<strong>on</strong>g>of</str<strong>on</strong>g> this work for<br />
pers<strong>on</strong>al or classroom use is granted without fee provided that copies are<br />
not made or distributed for pr<str<strong>on</strong>g>of</str<strong>on</strong>g>it or commercial advantage and that<br />
copies bear this notice and <str<strong>on</strong>g>the</str<strong>on</strong>g> full citati<strong>on</strong> <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> first page. To copy<br />
o<str<strong>on</strong>g>the</str<strong>on</strong>g>rwise, or republish, to post <strong>on</strong> servers or to redistribute to lists,<br />
requires prior specific permissi<strong>on</strong> and/or a fee.<br />
ASSETS’13, Oct 22–24, 2013, Bellevue, WA, USA.<br />
Copyright 2010 ACM 1-58113-000-0/00/0010 …$15.00.<br />
n<strong>on</strong>visual text entry using larger keys [6,22] multi-touch taps<br />
[5,10,18]. To our knowledge, no <strong>on</strong>e has explored speech as an<br />
eyes-free input modality. Yet speaking is natural, fast, and<br />
n<strong>on</strong>visual and automatic speech recogniti<strong>on</strong> (ASR) is already well<br />
integrated <strong>on</strong> mobile platforms. Mobile device users can enter<br />
text with speech in any text entry scenario <strong>on</strong> iOS and Android<br />
devices (see Figure 1) and perform acti<strong>on</strong>s such as searching <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
internet and creating calendar events with Google Voice Search<br />
[11] and Siri [4].<br />
Figure 1. The Android (left) and iPh<strong>on</strong>e (right)<br />
keyboards have a DICTATE butt<strong>on</strong> to <str<strong>on</strong>g>the</str<strong>on</strong>g> left <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
SPACE key that enables users to dictate text instead <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
using <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard.<br />
While <str<strong>on</strong>g>the</str<strong>on</strong>g> act <str<strong>on</strong>g>of</str<strong>on</strong>g> speaking is eyes-free, <str<strong>on</strong>g>the</str<strong>on</strong>g> process <str<strong>on</strong>g>of</str<strong>on</strong>g> dictating,<br />
reviewing, and editing text is complex, and requires additi<strong>on</strong>al<br />
input. Both reviewing and editing may be challenging for blind<br />
people with today’s mobile technology. A sighted pers<strong>on</strong> can<br />
review a speech recognizer’s output <str<strong>on</strong>g>by</str<strong>on</strong>g> simply reading <str<strong>on</strong>g>the</str<strong>on</strong>g> text <strong>on</strong><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> screen. Meanwhile, a blind pers<strong>on</strong> can use VoiceOver to<br />
review <str<strong>on</strong>g>the</str<strong>on</strong>g> text with speech output. Detecting recogniti<strong>on</strong> errors<br />
with speech output may be difficult, however, since <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
recognizer is likely to output words that sound like those <str<strong>on</strong>g>the</str<strong>on</strong>g> user<br />
said, but are incorrect. For example, ASR is known to have<br />
difficulty with segmentati<strong>on</strong>, and may recognize <str<strong>on</strong>g>the</str<strong>on</strong>g> words<br />
“recognize speech,” as <str<strong>on</strong>g>the</str<strong>on</strong>g> similarly-sounding “wreck a nice<br />
beach.”<br />
After identifying recogniti<strong>on</strong> errors, blind people may face<br />
challenges correcting <str<strong>on</strong>g>the</str<strong>on</strong>g>se errors. Prior work has shown that<br />
sighted users spend <str<strong>on</strong>g>the</str<strong>on</strong>g> majority <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time correcting errors<br />
when dictating text. Karat et al. [14] found that sighted people<br />
spent 66% <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time editing ASR output <strong>on</strong> a desktop dictati<strong>on</strong><br />
system. The editing process for blind users <strong>on</strong> mobile devices is<br />
likely to be much slower, since users may resort to using <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>screen<br />
keyboard to correct and edit ASR output. Perhaps <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
combined difficulty <str<strong>on</strong>g>of</str<strong>on</strong>g> reviewing and editing ASR output will<br />
outweigh <str<strong>on</strong>g>the</str<strong>on</strong>g> ease and speed <str<strong>on</strong>g>of</str<strong>on</strong>g> speech.<br />
In this paper, we explore <str<strong>on</strong>g>the</str<strong>on</strong>g> patterns and challenges <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> use <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
speech input <str<strong>on</strong>g>by</str<strong>on</strong>g> blind people <strong>on</strong> mobile devices. Our ultimate<br />
goal is to improve <str<strong>on</strong>g>the</str<strong>on</strong>g> experience <str<strong>on</strong>g>of</str<strong>on</strong>g> n<strong>on</strong>visual speech input, but
first we study current use to identify specific challenges that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
accessibility community can address. We c<strong>on</strong>ducted a survey<br />
with 169 people (105 sighted and 65 blind and low-visi<strong>on</strong>) to<br />
learn how <str<strong>on</strong>g>of</str<strong>on</strong>g>ten people use speech input, what <str<strong>on</strong>g>the</str<strong>on</strong>g>y use it for, and<br />
how much <str<strong>on</strong>g>the</str<strong>on</strong>g>y like it. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n c<strong>on</strong>ducted a laboratory study with<br />
8 blind people to observe how blind people use speech to<br />
compose paragraphs. We wanted to discover what techniques<br />
people used to review and edit ASR output and how effective<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>se techniques were.<br />
In our survey, we found that blind people used speech for input<br />
more frequently and for l<strong>on</strong>ger messages than sighted people.<br />
<str<strong>on</strong>g>Blind</str<strong>on</strong>g> people were also more satisfied with speech than sighted<br />
people, probably because <str<strong>on</strong>g>the</str<strong>on</strong>g> comparative advantage <str<strong>on</strong>g>of</str<strong>on</strong>g> speech to<br />
keyboard input was far greater for <str<strong>on</strong>g>the</str<strong>on</strong>g>m than for sighted people.<br />
Our laboratory study showed that speech was nearly five times as<br />
fast as <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard, but editing recogniti<strong>on</strong> errors was<br />
frustrating. Participants spent an average <str<strong>on</strong>g>of</str<strong>on</strong>g> 80.3% <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time<br />
reviewing and editing <str<strong>on</strong>g>the</str<strong>on</strong>g>ir text. Most edits were performed using<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> BACKSPACE key and reentering characters with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard.<br />
Six out <str<strong>on</strong>g>of</str<strong>on</strong>g> eight participants in <str<strong>on</strong>g>the</str<strong>on</strong>g> study preferred speech to<br />
keyboard entry.<br />
Our main c<strong>on</strong>tributi<strong>on</strong> is our findings from <str<strong>on</strong>g>the</str<strong>on</strong>g> survey and study.<br />
Also, we c<strong>on</strong>tribute specific research challenges for <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
community to explore that can improve n<strong>on</strong>visual text entry using<br />
both speech and touch input.<br />
2. RELATED WORK<br />
To our knowledge, we are <str<strong>on</strong>g>the</str<strong>on</strong>g> first to explore speech input for<br />
blind people in <str<strong>on</strong>g>the</str<strong>on</strong>g> human-computer interacti<strong>on</strong> literature. <str<strong>on</strong>g>Speech</str<strong>on</strong>g><br />
has mostly been studied as a form <str<strong>on</strong>g>of</str<strong>on</strong>g> n<strong>on</strong>visual output (e.g., [24,<br />
28]) ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than n<strong>on</strong>visual input. There has been some work <strong>on</strong><br />
hands-free dictati<strong>on</strong>, both for people with motor impairments<br />
[26], and for <str<strong>on</strong>g>the</str<strong>on</strong>g> general populati<strong>on</strong> [12,14].<br />
Prior work <strong>on</strong> speech input interacti<strong>on</strong> focuses <strong>on</strong> error<br />
correcti<strong>on</strong>, <str<strong>on</strong>g>the</str<strong>on</strong>g> “Achilles Heel <str<strong>on</strong>g>of</str<strong>on</strong>g> speech technology” [23].<br />
Desktop dictati<strong>on</strong> systems such as Drag<strong>on</strong> Naturally Speaking<br />
[21], which gained popularity in <str<strong>on</strong>g>the</str<strong>on</strong>g> late 1990’s, use speech<br />
commands for cursor navigati<strong>on</strong> and error correcti<strong>on</strong>. <str<strong>on</strong>g>Use</str<strong>on</strong>g>rs speak<br />
commands such as “move left” and “undo” to edit text or<br />
repositi<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor. Karat et al. [14] found that novice users<br />
entered text at a rate <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>on</strong>ly 13.6 WPM with a commercial<br />
desktop dictati<strong>on</strong> system and 32.5 WPM with a keyboard and<br />
mouse. This striking discrepancy was due to (1) cascades <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
errors triggered <str<strong>on</strong>g>by</str<strong>on</strong>g> a user’s correcti<strong>on</strong> and (2) spiral-depth, a<br />
user’s repeated attempts to speak a word that is not correctly<br />
recognized.<br />
Some work has aimed to alleviate <str<strong>on</strong>g>the</str<strong>on</strong>g> difficulty <str<strong>on</strong>g>of</str<strong>on</strong>g> error<br />
correcti<strong>on</strong> through touch or stylus input (see [23] for a review).<br />
Suhm et al. [29] present a system where users touch <str<strong>on</strong>g>the</str<strong>on</strong>g> word<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>y want to correct, eliminating speech-based navigati<strong>on</strong>. Their<br />
system also supports small gestures such as striking out a word as<br />
shortcuts. Martin et al. [19] enable users to correct errors <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
preliminary results with a mouse click. These systems make error<br />
correcti<strong>on</strong> more efficient but have high visual demands and are<br />
not appropriate for bind users.<br />
There is little recent academic work <strong>on</strong> speech-based input<br />
systems. Voice Typing, introduced <str<strong>on</strong>g>by</str<strong>on</strong>g> Kumar et al. in 2012 [16],<br />
displays recognized text as <str<strong>on</strong>g>the</str<strong>on</strong>g> user dictates short phrases. <str<strong>on</strong>g>Use</str<strong>on</strong>g>rs<br />
can correct recogniti<strong>on</strong> errors with a marking menu. Kumar et al.<br />
found that correcting errors with Voice Typing required less<br />
effort than correcting errors with <str<strong>on</strong>g>the</str<strong>on</strong>g> iPh<strong>on</strong>e’s dictati<strong>on</strong> model.<br />
The iOS and Android dictati<strong>on</strong> systems resemble <str<strong>on</strong>g>the</str<strong>on</strong>g> literature<br />
described above. Dictati<strong>on</strong> <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> iPh<strong>on</strong>e follows an open-loop<br />
interacti<strong>on</strong> model, where <str<strong>on</strong>g>the</str<strong>on</strong>g> recognizer outputs text <strong>on</strong>ly after <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
user completes <str<strong>on</strong>g>the</str<strong>on</strong>g> dictati<strong>on</strong>. Android, in c<strong>on</strong>trast, currently has<br />
incremental speech recogniti<strong>on</strong>, displaying recognized text as <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
user speaks. With VoiceOver, iOS dictati<strong>on</strong> seems to be<br />
accessible to blind people, but it is unclear how incremental<br />
recogniti<strong>on</strong> affects accessibility <strong>on</strong> Android, especially since<br />
Android is not as generally accessible as iOS.<br />
While <str<strong>on</strong>g>the</str<strong>on</strong>g>re is no known work <strong>on</strong> n<strong>on</strong>visual speech input, <str<strong>on</strong>g>the</str<strong>on</strong>g>re<br />
has been significant interest in n<strong>on</strong>visual touch-based input. In<br />
2008, Slide Rule [13], introduced an accessible eyes-free<br />
interacti<strong>on</strong> technique for touchscreen devices which was later<br />
adopted <str<strong>on</strong>g>by</str<strong>on</strong>g> VoiceOver. Since <str<strong>on</strong>g>the</str<strong>on</strong>g>n, researchers and developers<br />
have been trying to improve <str<strong>on</strong>g>the</str<strong>on</strong>g> experience <str<strong>on</strong>g>of</str<strong>on</strong>g> eyes-free text entry<br />
with gesture-based methods. Several text entry techniques were<br />
proposed that were based <strong>on</strong> Braille, including Perkinput [5],<br />
BrailleTouch [10,27], BrailleType [22], and TypeInBraille [18].<br />
Methods that were not based <strong>on</strong> Braille [9,25,34], including No-<br />
Look Notes [6] and NavTouch [22], did not achieve comparable<br />
performance to <str<strong>on</strong>g>the</str<strong>on</strong>g> former set. Despite <str<strong>on</strong>g>the</str<strong>on</strong>g> large amount <str<strong>on</strong>g>of</str<strong>on</strong>g> work<br />
in this area, entry rates for blind users remain relatively slow: at<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> end <str<strong>on</strong>g>of</str<strong>on</strong>g> a l<strong>on</strong>gitudinal study, Perkinput users entered text at a<br />
rate <str<strong>on</strong>g>of</str<strong>on</strong>g> 7.3 WPM (with a 2.1% error rate). BrailleTouch users,<br />
who were expert users <str<strong>on</strong>g>of</str<strong>on</strong>g> Braille keyboards, entered text at a rate<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> 23.1 WPM, but with an error rate <str<strong>on</strong>g>of</str<strong>on</strong>g> 4.8%.<br />
Several eyes-free text entry methods were proposed and evaluated<br />
with sighted people (e.g., [30]), but <str<strong>on</strong>g>the</str<strong>on</strong>g>y may not be appropriate<br />
for blind users. As Kane et al. found [14], blind people have<br />
different preferences and performance abilities with touch screen<br />
gestures than sighted people.<br />
3. SURVEY: PATTERNS OF SPEECH<br />
INPUT AMONG BLIND AND SIGHTED<br />
PEOPLE<br />
We c<strong>on</strong>ducted a survey to determine how <str<strong>on</strong>g>of</str<strong>on</strong>g>ten blind people use<br />
speech input <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir mobile devices, what <str<strong>on</strong>g>the</str<strong>on</strong>g>y use it for, and<br />
how <str<strong>on</strong>g>the</str<strong>on</strong>g>y feel about it. We surveyed both blind and sighted<br />
people to evaluate <str<strong>on</strong>g>the</str<strong>on</strong>g> n<strong>on</strong>visual experience against a baseline <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> comm<strong>on</strong> (visual) use case.<br />
3.1 Methods<br />
We surveyed 54 blind participants, 10 low-visi<strong>on</strong> participants,<br />
and 105 sighted participants. There were 31 female and 33 male<br />
blind/low-visi<strong>on</strong> (BLV) participants, with an average age <str<strong>on</strong>g>of</str<strong>on</strong>g> 40<br />
(age range: 18 to 66). Sighted participants were younger, with 51<br />
males and 54 females and an average age <str<strong>on</strong>g>of</str<strong>on</strong>g> 32 (age range: 20 to<br />
66). We sent emails <strong>on</strong> mailing lists related to our university and<br />
blindness organizati<strong>on</strong>s to recruit participants. Participants did not<br />
receive compensati<strong>on</strong> for completing <str<strong>on</strong>g>the</str<strong>on</strong>g> survey.<br />
The survey included a maximum number <str<strong>on</strong>g>of</str<strong>on</strong>g> 9 questi<strong>on</strong>s. The first<br />
three questi<strong>on</strong>s asked for demographic informati<strong>on</strong>: age, gender,<br />
and disability. The next questi<strong>on</strong> asked:<br />
Have you recently used dictati<strong>on</strong> instead <str<strong>on</strong>g>of</str<strong>on</strong>g> a<br />
keyboard to enter text <strong>on</strong> a smartph<strong>on</strong>e?<br />
Examples <str<strong>on</strong>g>of</str<strong>on</strong>g> dictati<strong>on</strong> include:<br />
- Asking Siri a questi<strong>on</strong>, e.g., "what's <str<strong>on</strong>g>the</str<strong>on</strong>g> wea<str<strong>on</strong>g>the</str<strong>on</strong>g>r like today?"<br />
- Giving Siri a command, e.g., "call John Johns<strong>on</strong>"<br />
- Dictating an email or text message<br />
Required.
[ ] Yes<br />
[ ] No<br />
If <str<strong>on</strong>g>the</str<strong>on</strong>g> participant answered “No” to <str<strong>on</strong>g>the</str<strong>on</strong>g> questi<strong>on</strong> above, <str<strong>on</strong>g>the</str<strong>on</strong>g> survey<br />
c<strong>on</strong>cluded with a final questi<strong>on</strong> that asked why not. If <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
participant answered “yes,” she was asked to recall a specific<br />
instance in which she used dictati<strong>on</strong>. She was <str<strong>on</strong>g>the</str<strong>on</strong>g>n asked several<br />
questi<strong>on</strong>s about this instance, such as when <str<strong>on</strong>g>the</str<strong>on</strong>g> instance occurred,<br />
and what she dictated (a questi<strong>on</strong> to Siri, a text message, etc.).<br />
The penultimate questi<strong>on</strong> presented <str<strong>on</strong>g>the</str<strong>on</strong>g> user with three statements<br />
and asked her to describe how she feels about each statement <strong>on</strong> a<br />
Likert scale. The statements were:<br />
• Dictati<strong>on</strong> <strong>on</strong> a smartph<strong>on</strong>e is accurate<br />
• Using dictati<strong>on</strong> <strong>on</strong> a smartph<strong>on</strong>e (including <str<strong>on</strong>g>the</str<strong>on</strong>g> time it<br />
takes to correct errors) is fast relative to an <strong>on</strong>-screen<br />
keyboard.<br />
• I am satisfied with dictati<strong>on</strong> <strong>on</strong> my smartph<strong>on</strong>e.<br />
The survey c<strong>on</strong>cluded with a prompt for “o<str<strong>on</strong>g>the</str<strong>on</strong>g>r comments” and a<br />
free-form text box for <str<strong>on</strong>g>the</str<strong>on</strong>g>ir resp<strong>on</strong>se.<br />
Surveys were completed <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> Internet and resp<strong>on</strong>ses were<br />
an<strong>on</strong>ymized.<br />
To analyze <str<strong>on</strong>g>the</str<strong>on</strong>g> results, we graphed <str<strong>on</strong>g>the</str<strong>on</strong>g> data and computed<br />
descriptive statistics for all questi<strong>on</strong>s. We used Wilcox<strong>on</strong> Rank<br />
Sums tests to compare means between Likert scale resp<strong>on</strong>ses. We<br />
modeled <str<strong>on</strong>g>the</str<strong>on</strong>g> data with <strong>on</strong>e factor, SightAbility, with two levels:<br />
BLV, and Sighted. The measures corresp<strong>on</strong>ded to <str<strong>on</strong>g>the</str<strong>on</strong>g> three Likert<br />
resp<strong>on</strong>se statements: Accurate, Fast, and Satisfied.<br />
3.2 Results<br />
The survey resp<strong>on</strong>ses showed that BLV people used dictati<strong>on</strong> far<br />
more frequently than sighted people. Interestingly, 58 BLV<br />
participants (90.6%) and 58 sighted participants (55.2%) used<br />
dictati<strong>on</strong> recently. Am<strong>on</strong>g BLV participants, <strong>on</strong>ly <strong>on</strong>e used<br />
speech input <strong>on</strong> an Android device and <str<strong>on</strong>g>the</str<strong>on</strong>g> rest used it <strong>on</strong> iOS<br />
devices. Am<strong>on</strong>g sighted people, 21 participants used speech input<br />
<strong>on</strong> an Android device and 34 <strong>on</strong> an iOS device. Most BLV people<br />
used speech input within <str<strong>on</strong>g>the</str<strong>on</strong>g> last day while most sighted people<br />
used it within <str<strong>on</strong>g>the</str<strong>on</strong>g> last week.<br />
Both BLV and sighted participants used speech most for<br />
composing text message. Table 1 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> kinds <str<strong>on</strong>g>of</str<strong>on</strong>g> messages<br />
participants composed. Many more BLV than sighted people used<br />
speech to compose emails. Figure 2 supports this finding,<br />
showing that BLV people composed l<strong>on</strong>ger messages.<br />
Table 1. Number <str<strong>on</strong>g>of</str<strong>on</strong>g> resp<strong>on</strong>ses for a survey questi<strong>on</strong>.<br />
What did you use speech input for? BLV Sighted<br />
A command (e.g., "Call bob smith") 8 14<br />
a questi<strong>on</strong> (e.g., "Siri, what's <str<strong>on</strong>g>the</str<strong>on</strong>g> wea<str<strong>on</strong>g>the</str<strong>on</strong>g>r like<br />
today?") 13 14<br />
An email 12 4<br />
A text message 20 19<br />
O<str<strong>on</strong>g>the</str<strong>on</strong>g>r 5 7<br />
Figure 2 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> means and standard deviati<strong>on</strong>s (SD’s) <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
participant Likert scale resp<strong>on</strong>ses to <str<strong>on</strong>g>the</str<strong>on</strong>g> penultimate questi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> survey. Histograms <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> data showed <str<strong>on</strong>g>the</str<strong>on</strong>g> distributi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
resp<strong>on</strong>ses were roughly normal, so <str<strong>on</strong>g>the</str<strong>on</strong>g> means and SDs represent<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> resp<strong>on</strong>ses appropriately. As Figure 3 shows, BLV people<br />
were more satisfied with speech and thought it was faster than <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
<strong>on</strong>-screen keyboard compared with sighted people. This resulted<br />
in a significant effect <str<strong>on</strong>g>of</str<strong>on</strong>g> SightAbility <strong>on</strong> Satisfacti<strong>on</strong> (W = 2296, p<br />
< 0.001) and Speed (W = 2240, p = 0.001). There was no<br />
significant effect <str<strong>on</strong>g>of</str<strong>on</strong>g> SightAbility <strong>on</strong> Accuracy, but <str<strong>on</strong>g>the</str<strong>on</strong>g>re was a<br />
str<strong>on</strong>g trend (W = 1977, p = 0.067). Perhaps BLV people were<br />
able to get fewer recogniti<strong>on</strong> errors because <str<strong>on</strong>g>the</str<strong>on</strong>g>y had more<br />
practice using speech for input.<br />
Resp<strong>on</strong>ses<br />
0 5 10 15 20 25 30<br />
BLV<br />
Sighted<br />
1 − 5 6 − 10 >10<br />
Figure 2. Survey resp<strong>on</strong>ses to <str<strong>on</strong>g>the</str<strong>on</strong>g> questi<strong>on</strong>, “About<br />
how l<strong>on</strong>g was your dictated text?”<br />
0 1 2 3 4 5<br />
Accurate Fast Satisfied<br />
BLV<br />
Sighted<br />
Figure 3. Survey resp<strong>on</strong>ses <strong>on</strong> a 5-point Likert-scale:<br />
1 is str<strong>on</strong>gly disagree, and 5 is str<strong>on</strong>g agree. Resp<strong>on</strong>ses<br />
were roughly normally distributed.<br />
When prompted for o<str<strong>on</strong>g>the</str<strong>on</strong>g>r comments, many participants noted <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
challenge <str<strong>on</strong>g>of</str<strong>on</strong>g> editing <str<strong>on</strong>g>the</str<strong>on</strong>g> recognizer’s output, and speaking in<br />
noisy envir<strong>on</strong>ments. One blind participant explained,<br />
Accuracy in noisy envir<strong>on</strong>ments is <str<strong>on</strong>g>the</str<strong>on</strong>g> biggest<br />
challenge I feel. I prefer to dictate short commands<br />
and text, saving l<strong>on</strong>g e-mail resp<strong>on</strong>ses for a standard<br />
computer. Editing can be a challenge.<br />
Three sighted participants felt awkward using speaking to <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />
device. Ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r sighted participant echoed this c<strong>on</strong>cern, feeling<br />
frustrated with <str<strong>on</strong>g>the</str<strong>on</strong>g> lack <str<strong>on</strong>g>of</str<strong>on</strong>g> feedback, “I find it hard to talk to <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
device. Do you yell at it and hope it understands better?”<br />
Participants who did not use speech for input recently were<br />
mostly c<strong>on</strong>cerned with accuracy and errors. Some were also<br />
c<strong>on</strong>cerned about privacy or social appropriateness, since o<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />
people can hear what <str<strong>on</strong>g>the</str<strong>on</strong>g>y say when <str<strong>on</strong>g>the</str<strong>on</strong>g>y speak to <str<strong>on</strong>g>the</str<strong>on</strong>g>ir devices.<br />
Some sighted participants said that <str<strong>on</strong>g>the</str<strong>on</strong>g>y simply “d<strong>on</strong>’t need” to<br />
use speech for input or haven’t figured out how to use it yet.<br />
3.3 Discussi<strong>on</strong><br />
The survey results suggest that speech is already a widely used<br />
eyes-free alternative to keyboard input. <str<strong>on</strong>g>Blind</str<strong>on</strong>g> people seem more<br />
satisfied with speech than sighted people. This is probably<br />
because keyboard input with VoiceOver is so much slower than
standard keyboard input, that sighted people do not feel speech<br />
input <str<strong>on</strong>g>of</str<strong>on</strong>g>fers a significant advantage.<br />
We were surprised that nearly half <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> blind participants<br />
reported that <str<strong>on</strong>g>the</str<strong>on</strong>g>ir recent speech input message was over 10<br />
words l<strong>on</strong>g. From anecdotal experiences and formative work, it<br />
seems that it would be difficult to review and edit a message that<br />
was more than 10 words l<strong>on</strong>g. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, more blind<br />
people entered text messages with speech than emails, suggesting<br />
that perhaps speech input was preferred for shorter and more<br />
casual messages.<br />
The survey findings raised new questi<strong>on</strong>s. Exactly how l<strong>on</strong>g were<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> messages blind people entered with speech? The results<br />
suggested speech input was much faster than keyboard input, but<br />
<str<strong>on</strong>g>by</str<strong>on</strong>g> how much? How did participants review and edit <str<strong>on</strong>g>the</str<strong>on</strong>g>ir text,<br />
and how much time did <str<strong>on</strong>g>the</str<strong>on</strong>g>y spend doing so? In <str<strong>on</strong>g>the</str<strong>on</strong>g> next secti<strong>on</strong>,<br />
we describe <str<strong>on</strong>g>the</str<strong>on</strong>g> study we c<strong>on</strong>ducted to answer <str<strong>on</strong>g>the</str<strong>on</strong>g>se questi<strong>on</strong>s.<br />
4. STUDY: OBSERVING THE USE OF<br />
SPEECH INPUT BY BLIND PEOPLE<br />
After finding that nearly all <str<strong>on</strong>g>of</str<strong>on</strong>g> our blind survey participants used<br />
speech for input, we wanted to learn more about <str<strong>on</strong>g>the</str<strong>on</strong>g>ir experience<br />
using it. We c<strong>on</strong>ducted a laboratory study to observe blind people<br />
composing paragraphs with dictati<strong>on</strong> and <str<strong>on</strong>g>the</str<strong>on</strong>g> accessible <strong>on</strong>-screen<br />
keyboard.<br />
4.1 Methods<br />
4.1.1 Participants<br />
We recruited eight blind participants (five males, three females)<br />
with an average age <str<strong>on</strong>g>of</str<strong>on</strong>g> 44 (age range: 22 to 61). We required that<br />
participants be blind and use a smart mobile device such as a<br />
smartph<strong>on</strong>e. All participants owned iPh<strong>on</strong>es, which <str<strong>on</strong>g>the</str<strong>on</strong>g>y used<br />
many times a day. Two participants had <str<strong>on</strong>g>the</str<strong>on</strong>g>ir ph<strong>on</strong>es for about a<br />
year, and <str<strong>on</strong>g>the</str<strong>on</strong>g> rest had <str<strong>on</strong>g>the</str<strong>on</strong>g>ir ph<strong>on</strong>es for two or more years. All had<br />
no functi<strong>on</strong>al visi<strong>on</strong> and used VoiceOver to interact with <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />
ph<strong>on</strong>es.<br />
Since we wanted to observe how blind people use speech input in<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>ir daily lives, we also required that participants have<br />
experience using speech <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir mobile devices. Six participants<br />
used speech input every day, while <str<strong>on</strong>g>the</str<strong>on</strong>g> remaining two participants<br />
used speech input weekly.<br />
4.1.2 Procedure & Apparatus<br />
Participants completed <strong>on</strong>e sessi<strong>on</strong> in <str<strong>on</strong>g>the</str<strong>on</strong>g> study that was about<br />
<strong>on</strong>e hour and fifteen minutes l<strong>on</strong>g. At <str<strong>on</strong>g>the</str<strong>on</strong>g> beginning <str<strong>on</strong>g>of</str<strong>on</strong>g> a sessi<strong>on</strong>,<br />
we asked participants for demographic informati<strong>on</strong>, and asked<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>m how <str<strong>on</strong>g>of</str<strong>on</strong>g>ten and what <str<strong>on</strong>g>the</str<strong>on</strong>g>y used speech as input for <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />
mobile devices. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n showed participants an iPod Touch 5<br />
that we used for <str<strong>on</strong>g>the</str<strong>on</strong>g> study and adjusted <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard and<br />
VoiceOver settings to match <str<strong>on</strong>g>the</str<strong>on</strong>g> participant’s preferences (e.g.,<br />
disable auto-correct).<br />
We <str<strong>on</strong>g>the</str<strong>on</strong>g>n asked participants to compose paragraphs using ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />
speech or <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard for input. When composing<br />
with speech, participants used <str<strong>on</strong>g>the</str<strong>on</strong>g> DICTATE butt<strong>on</strong> <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
keyboard, located to <str<strong>on</strong>g>the</str<strong>on</strong>g> left <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> SPACE key (shown in Figure<br />
1). They could use <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard when editing <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR output,<br />
but we required <str<strong>on</strong>g>the</str<strong>on</strong>g>m to use speech for <str<strong>on</strong>g>the</str<strong>on</strong>g> initial compositi<strong>on</strong><br />
process. Participants composed text <strong>on</strong> a simple website we<br />
developed that included <strong>on</strong>ly a prompt, a textarea widget, and a<br />
submit butt<strong>on</strong>. When <str<strong>on</strong>g>the</str<strong>on</strong>g>y completed a paragraph, <str<strong>on</strong>g>the</str<strong>on</strong>g>y clicked<br />
submit.<br />
We presented participants with short prompts such as, “Tell us<br />
about a book you read recently. What did you like about it?” In<br />
additi<strong>on</strong> to telling <str<strong>on</strong>g>the</str<strong>on</strong>g>m which input method to use (speech or<br />
keyboard), we gave <str<strong>on</strong>g>the</str<strong>on</strong>g>m two guidelines:<br />
1. Enter about 4 to 8 sentences in resp<strong>on</strong>se to <str<strong>on</strong>g>the</str<strong>on</strong>g> prompt.<br />
2. <str<strong>on</strong>g>Use</str<strong>on</strong>g> pr<str<strong>on</strong>g>of</str<strong>on</strong>g>essi<strong>on</strong>al language, as though you were<br />
emailing a potential employer.<br />
We hoped <str<strong>on</strong>g>the</str<strong>on</strong>g> first guideline would encourage participants to type<br />
reas<strong>on</strong>ably l<strong>on</strong>g paragraphs, that would reveal more interesting<br />
behavior, and <str<strong>on</strong>g>the</str<strong>on</strong>g> sec<strong>on</strong>d guideline would encourage <str<strong>on</strong>g>the</str<strong>on</strong>g>m to<br />
write complete sentences with proper grammar. We<br />
recommended that participants review and edit <str<strong>on</strong>g>the</str<strong>on</strong>g>ir text as <str<strong>on</strong>g>the</str<strong>on</strong>g>y<br />
normally would (with both input modalities). We wanted to study<br />
speech input for l<strong>on</strong>g and formal paragraphs because we aim to<br />
make it usable for a variety <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>texts, not just short, casual text<br />
messages.<br />
Our procedure differed from standard text entry studies in at least<br />
two ways. First, in standard text entry studies (e.g.,<br />
[5,17,22,27,30,33]), participants transcribe sets <str<strong>on</strong>g>of</str<strong>on</strong>g> phrases. This<br />
approach is not appropriate for speech input, however, because<br />
people speak differently when reading text. <str<strong>on</strong>g>Speech</str<strong>on</strong>g> recognizers<br />
are trained <strong>on</strong> c<strong>on</strong>versati<strong>on</strong>al speech, not read-aloud phrases.<br />
Recognizers would be likely to produce more errors if<br />
participants read phrases out loud. Sec<strong>on</strong>d, most text entry studies<br />
do not permit participants to edit text post-hoc (i.e., insert, delete,<br />
or replace text after repositi<strong>on</strong>ing <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor). Since post-hoc<br />
editing is a comm<strong>on</strong> part <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> compositi<strong>on</strong> process, especially<br />
with dictati<strong>on</strong>, we included it in our study design.<br />
After a participant entered <strong>on</strong>e paragraph with each modality, we<br />
decided whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r to ask him or her to enter a sec<strong>on</strong>d set <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
paragraphs, depending <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> amount <str<strong>on</strong>g>of</str<strong>on</strong>g> time remaining in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
study. We counter-balanced <str<strong>on</strong>g>the</str<strong>on</strong>g> order <str<strong>on</strong>g>of</str<strong>on</strong>g> input methods and<br />
alternated methods between successive paragraphs for each<br />
participant (if he or she entered more than <strong>on</strong>e paragraph). We<br />
c<strong>on</strong>cluded each study with a 15-minute semi-structured interview.<br />
The interview included open-ended questi<strong>on</strong>s about what<br />
participants liked and disliked about using speech for input. We<br />
also asked <str<strong>on</strong>g>the</str<strong>on</strong>g>m to resp<strong>on</strong>d to three statements <strong>on</strong> a 7-point Likert<br />
scale (1 is str<strong>on</strong>gly agree, 7 is str<strong>on</strong>gly disagree). The<br />
statements were:<br />
1. Entering text with speech was fast compared to entering<br />
text with <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard.<br />
2. Entering text with speech was frustrating compared<br />
with entering text with <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard.<br />
3. I am satisfied with using speech to enter text compared<br />
to using <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard.<br />
We recorded audio for each study, al<strong>on</strong>g with a video capture <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> iPod Touch’s screen. We mirrored <str<strong>on</strong>g>the</str<strong>on</strong>g> iPod Touch screen <strong>on</strong>to<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> researcher’s computer using <str<strong>on</strong>g>the</str<strong>on</strong>g> built-in AirPlay [2] client <strong>on</strong><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> iPod and a Mac applicati<strong>on</strong> called AirServer [1].<br />
4.1.3 Design & Analysis<br />
Participants composed a total <str<strong>on</strong>g>of</str<strong>on</strong>g> 22 paragraphs, 11 with speech<br />
input and 11 with keyboard input. Only three participants entered<br />
two paragraphs with each method and <str<strong>on</strong>g>the</str<strong>on</strong>g> rest entered <strong>on</strong>e<br />
paragraph with each method. The average length <str<strong>on</strong>g>of</str<strong>on</strong>g> a paragraph<br />
was 66.4 words (SD = 41.6).
The study was an 8 x 1 design, with a single factor Method with<br />
two levels, <str<strong>on</strong>g>Speech</str<strong>on</strong>g> and Keyboard. We calculated entry speed in<br />
terms <str<strong>on</strong>g>of</str<strong>on</strong>g> Words per Minute (WPM), using <str<strong>on</strong>g>the</str<strong>on</strong>g> formula [30]:<br />
WPM = T −1 •60• 1 S 5<br />
where T is <str<strong>on</strong>g>the</str<strong>on</strong>g> transcribed string entered, |T| is <str<strong>on</strong>g>the</str<strong>on</strong>g> length <str<strong>on</strong>g>of</str<strong>on</strong>g> T,<br />
and S is <str<strong>on</strong>g>the</str<strong>on</strong>g> time in sec<strong>on</strong>ds from <str<strong>on</strong>g>the</str<strong>on</strong>g> entry <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> first character to<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> final keystroke. We included post-hoc editing in <str<strong>on</strong>g>the</str<strong>on</strong>g> entry rate<br />
calculati<strong>on</strong>, so S included <str<strong>on</strong>g>the</str<strong>on</strong>g> time taken to review and edit a<br />
paragraph.<br />
We evaluated accuracy in two ways. First, we computed <str<strong>on</strong>g>the</str<strong>on</strong>g> error<br />
rate <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> speech recognizer for all text entered with speech.<br />
Sec<strong>on</strong>d, we computed <str<strong>on</strong>g>the</str<strong>on</strong>g> error rate <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> final transcripti<strong>on</strong>s for<br />
both speech and keyboard inputs. We measured <str<strong>on</strong>g>the</str<strong>on</strong>g> error rate in<br />
terms <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Word Error Rate (WER), a standard metric used to<br />
evaluate speech recogniti<strong>on</strong> systems [19]. The WER is <str<strong>on</strong>g>the</str<strong>on</strong>g> word<br />
edit distance (i.e., <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> word inserti<strong>on</strong>s, deleti<strong>on</strong>s, and<br />
replacements) between a reference and transcripti<strong>on</strong> text,<br />
normalized <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> length <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> reference text. For evaluating<br />
recogniti<strong>on</strong> errors, <str<strong>on</strong>g>the</str<strong>on</strong>g> reference text is <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g>fline, humanperceived<br />
transcript <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> participant’s speech. For evaluating <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
final, edited keyboard and speech text, determining <str<strong>on</strong>g>the</str<strong>on</strong>g> reference<br />
text is less straight-forward. Since we are uncertain <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> user’s<br />
intended entry (this was a compositi<strong>on</strong> and not a transcripti<strong>on</strong><br />
task), it is difficult to identify errors. As such, we c<strong>on</strong>sidered a<br />
word to be an error if it was (1) a misspelled word, (2) a n<strong>on</strong><br />
sequitur that made no sense in <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>text <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> sentence, or (3) a<br />
clear grammatical error such as a repeated word, or a short<br />
sentence fragment. Since participants’ input was c<strong>on</strong>versati<strong>on</strong>al,<br />
classifying words as errors was relatively straight-forward, given<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> aforementi<strong>on</strong>ed categories.<br />
We used two-sided t-tests to compare text entry rates, since <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
WPM was roughly normally distributed. For error rates, which<br />
were not normally distributed, we used Wilcox<strong>on</strong> Rank Sums<br />
tests to compare <str<strong>on</strong>g>the</str<strong>on</strong>g> means between input methods.<br />
4.2 Results<br />
There was high variability in <str<strong>on</strong>g>the</str<strong>on</strong>g> speech input behavior am<strong>on</strong>g<br />
participants, so we explain some <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> results in terms <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
individual performance. P5 and P6 did not know how to<br />
repositi<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor, so <str<strong>on</strong>g>the</str<strong>on</strong>g>ir ability to edit was limited,<br />
impacting both speed and accuracy. P5’s accuracy was also<br />
affected <str<strong>on</strong>g>by</str<strong>on</strong>g> her thick foreign accent, although she still used<br />
speech for input weekly. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, P3 had many years <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
experience with dictati<strong>on</strong> systems <strong>on</strong> different platforms and<br />
spoke with no hesitati<strong>on</strong> and clear dicti<strong>on</strong>.<br />
4.2.1 Entry Time and Accuracy<br />
The entry rate for speech input was much higher than for<br />
keyboard input. With speech input, participants entered text at a<br />
rate <str<strong>on</strong>g>of</str<strong>on</strong>g> 19.5 WPM (SD = 10.1), while <str<strong>on</strong>g>the</str<strong>on</strong>g>y entered <strong>on</strong>ly 4.3 WPM<br />
(SD = 1.5) with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. As expected, this resulted in a<br />
significant effect <str<strong>on</strong>g>of</str<strong>on</strong>g> Method <strong>on</strong> WPM (t (10) =-5.07, p < 0.001). The<br />
maximum entry rate was achieved <str<strong>on</strong>g>by</str<strong>on</strong>g> P3, at 34.6 WPM with<br />
speech, while <str<strong>on</strong>g>the</str<strong>on</strong>g> maximum speed for keyboard input was<br />
achieved <str<strong>on</strong>g>by</str<strong>on</strong>g> P7, at 6.7 WPM.<br />
When inputting speech, participants spent most <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time<br />
reviewing and editing recogniti<strong>on</strong> errors. On average, this<br />
amounted to 80.3% (SD = 10.2) <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> compositi<strong>on</strong> time. P1 spent<br />
94.2% (about 10 minutes) <str<strong>on</strong>g>of</str<strong>on</strong>g> his time reviewing and editing <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
(1)<br />
recognizer’s output—more than any o<str<strong>on</strong>g>the</str<strong>on</strong>g>r participant. Figure 4<br />
shows <str<strong>on</strong>g>the</str<strong>on</strong>g> amount <str<strong>on</strong>g>of</str<strong>on</strong>g> time participants spent dictating vs.<br />
reviewing and editing <str<strong>on</strong>g>the</str<strong>on</strong>g>ir speech input compositi<strong>on</strong>s. Each bar<br />
in <str<strong>on</strong>g>the</str<strong>on</strong>g> plot represents <strong>on</strong>e composed paragraph (i.e., <strong>on</strong>e task).<br />
Time (Minutes)<br />
0 5 10 15 20<br />
Inline entry<br />
Review & edits<br />
1.1 2.1 2.2 3.1 3.2 4.1 5.1 6.1 7.1 7.2 8.1<br />
Participant.task<br />
Figure 4. Time spent entering (red) and reviewing<br />
and editing (blue) text. Each column represents <strong>on</strong>e<br />
composed paragraph (i.e., a task).<br />
While participants spent most <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time reviewing and editing<br />
text when using speech input, <str<strong>on</strong>g>the</str<strong>on</strong>g>y spent little time <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>se<br />
activities when using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. On average, participants spent<br />
<strong>on</strong>ly 9.0% (SD = 11.7) <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time reviewing and editing<br />
keyboard output. This amounted to an average <str<strong>on</strong>g>of</str<strong>on</strong>g> 1.2 minutes <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
review and edit time with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard compared to 5.4 minutes<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> review and edit time with speech input. Participants made most<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir edits in <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard c<strong>on</strong>diti<strong>on</strong> inline, <str<strong>on</strong>g>by</str<strong>on</strong>g> deleting text<br />
with <str<strong>on</strong>g>the</str<strong>on</strong>g> BACKSPACE key and reentering it.<br />
The number <str<strong>on</strong>g>of</str<strong>on</strong>g> errors produced <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR largely determined<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> amount <str<strong>on</strong>g>of</str<strong>on</strong>g> reviewing and editing participants performed, and,<br />
in turn, <str<strong>on</strong>g>the</str<strong>on</strong>g>ir overall rate <str<strong>on</strong>g>of</str<strong>on</strong>g> entry. The average WER for <str<strong>on</strong>g>the</str<strong>on</strong>g> iOS<br />
ASR was 10.2% (SD = 10.4), ranging from 0% for P3 to 35.6%<br />
for P5. Figure 5 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> entry rates <str<strong>on</strong>g>of</str<strong>on</strong>g> each paragraph input<br />
with speech as a functi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR’s WER. As expected, <str<strong>on</strong>g>the</str<strong>on</strong>g>re<br />
was a negative correlati<strong>on</strong>, with an outlier point in <str<strong>on</strong>g>the</str<strong>on</strong>g> far right<br />
for P5.<br />
Word Per Minute (WPM)<br />
0 5 10 20 30<br />
●<br />
3.1,3.2<br />
●<br />
●<br />
●<br />
8.1<br />
2.1<br />
7.2<br />
●<br />
●<br />
7.1<br />
6.1 ● 2.2<br />
● 1.1<br />
● 4.1<br />
0.0 0.1 0.2 0.3 0.4<br />
<str<strong>on</strong>g>Speech</str<strong>on</strong>g> Recogniti<strong>on</strong> Word Error Rate (WER)<br />
Figure 5. Entry rate vs. speech recogniti<strong>on</strong> error rate for<br />
speech input. The labels <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> plotted points indicate <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
participant (1 – 8) and paragraph number (1 – 2).<br />
Participants corrected most <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> speech recognizer’s errors,<br />
yielding a mean WER <str<strong>on</strong>g>of</str<strong>on</strong>g> 3.2% (SD = 4.6) for <str<strong>on</strong>g>the</str<strong>on</strong>g> final text input<br />
with speech. The mean WER for text input with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard was<br />
slightly higher, at 4.0% (SD = 3.3). There was no significant<br />
effect <str<strong>on</strong>g>of</str<strong>on</strong>g> Method <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> WER <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> final text (W = 77, n.s.).<br />
●<br />
5.1
4.2.2 Reviewing and Editing Text<br />
In this secti<strong>on</strong>, we describe <str<strong>on</strong>g>the</str<strong>on</strong>g> reviewing and editing techniques<br />
that participants used when composing paragraphs. After entering<br />
several sentences, most participants reviewed <str<strong>on</strong>g>the</str<strong>on</strong>g>ir text <str<strong>on</strong>g>by</str<strong>on</strong>g><br />
making VoiceOver read it word <str<strong>on</strong>g>by</str<strong>on</strong>g> word. If a word did not sound<br />
correct, <str<strong>on</strong>g>the</str<strong>on</strong>g>y read it character <str<strong>on</strong>g>by</str<strong>on</strong>g> character. To do this,<br />
participants used <str<strong>on</strong>g>the</str<strong>on</strong>g> VoiceOver rotor [3], which enables users to<br />
read text <strong>on</strong>e unit at a time. A unit can be a line, word, or<br />
character. To move <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor from <strong>on</strong>e unit to <str<strong>on</strong>g>the</str<strong>on</strong>g> next, a user<br />
swipes down <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> screen. The user can also swipe up to move<br />
to <str<strong>on</strong>g>the</str<strong>on</strong>g> previous unit. In this way, participants moved <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
VoiceOver cursor around <str<strong>on</strong>g>the</str<strong>on</strong>g> text in <str<strong>on</strong>g>the</str<strong>on</strong>g> text area to verify <str<strong>on</strong>g>the</str<strong>on</strong>g>ir<br />
input. When reviewing, participants <str<strong>on</strong>g>of</str<strong>on</strong>g>ten read words several<br />
times, <str<strong>on</strong>g>the</str<strong>on</strong>g>n iterated through <str<strong>on</strong>g>the</str<strong>on</strong>g>ir characters.<br />
Some ASR errors were difficult to detect using <str<strong>on</strong>g>the</str<strong>on</strong>g> VoiceOver<br />
text-to-speech because <str<strong>on</strong>g>the</str<strong>on</strong>g>y sounded similar to <str<strong>on</strong>g>the</str<strong>on</strong>g> user’s original<br />
speech. For example, P1 dictated <str<strong>on</strong>g>the</str<strong>on</strong>g> phrase “lost my sight,” but<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> speech recognizer output <str<strong>on</strong>g>the</str<strong>on</strong>g> phrase, “lost my site.” P1 did not<br />
detect this error. Ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r participant, P6, said, “you can hike,”<br />
but <str<strong>on</strong>g>the</str<strong>on</strong>g> speech recognizer output, “you can’t hike.” P6 did not<br />
detect <str<strong>on</strong>g>the</str<strong>on</strong>g> error ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r. Some grammatical errors were also not<br />
detected in <str<strong>on</strong>g>the</str<strong>on</strong>g> review process. VoiceOver’s text-to-speech does<br />
not differentiate upper- and lower-case letters, so <str<strong>on</strong>g>the</str<strong>on</strong>g>re were some<br />
uncorrected case errors. The composed paragraphs also included<br />
extra spaces between words and sentences that participants did<br />
not seem to be aware <str<strong>on</strong>g>of</str<strong>on</strong>g>.<br />
We observed that VoiceOver did not alert participants to passages<br />
that were recognized with low-c<strong>on</strong>fidence. Occasi<strong>on</strong>ally, <str<strong>on</strong>g>the</str<strong>on</strong>g> iOS<br />
speech recognizer underlined a word or phrase that was<br />
recognized with low c<strong>on</strong>fidence, and would present <str<strong>on</strong>g>the</str<strong>on</strong>g> user with<br />
alternative recogniti<strong>on</strong> opti<strong>on</strong>s if <str<strong>on</strong>g>the</str<strong>on</strong>g> phrase was touched. This<br />
informati<strong>on</strong> was not accessible. Also, we observed that<br />
VoiceOver communicated punctuati<strong>on</strong> marks in <str<strong>on</strong>g>the</str<strong>on</strong>g> text in a<br />
subtle manner, <str<strong>on</strong>g>by</str<strong>on</strong>g> varying prosody and pausing. This made it<br />
difficult for participants to detect punctuati<strong>on</strong> marks in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
recognized text when reviewing it.<br />
The review process took l<strong>on</strong>ger than we expected, but participants<br />
spent much more time editing <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR output. Figure 6 shows <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
number <str<strong>on</strong>g>of</str<strong>on</strong>g> edits participants performed for each paragraph input<br />
with speech. The total number <str<strong>on</strong>g>of</str<strong>on</strong>g> edits performed during speech<br />
input tasks was 96. Edits included inserting, deleting, and<br />
replacing characters, words, or strings <str<strong>on</strong>g>of</str<strong>on</strong>g> words. Only 15 (15.6%)<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>se edits were d<strong>on</strong>e using speech input. Although inputting<br />
speech was much faster than using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard, participants<br />
preferred using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard to edit <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR output. Several<br />
participants inserted words or phrases with speech but deleted<br />
err<strong>on</strong>eous text with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard’s BACKSPACE key. Also, <str<strong>on</strong>g>the</str<strong>on</strong>g>re<br />
was no way to move <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor using speech commands, so<br />
participants preferred to c<strong>on</strong>tinue using touch to make characteror<br />
word-level edits.<br />
We observed three editing techniques for speech input in our<br />
study. We describe <str<strong>on</strong>g>the</str<strong>on</strong>g>m as follows.<br />
H<strong>on</strong>e in, delete, and reenter. The first technique was <str<strong>on</strong>g>the</str<strong>on</strong>g> most<br />
comm<strong>on</strong> and was used <str<strong>on</strong>g>by</str<strong>on</strong>g> all participants except P5 and P6 who<br />
did not know how to use VoiceOver gestures. This technique<br />
involved (1) moving <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor to a desired positi<strong>on</strong> using <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
VoiceOver gestures describe previously, (2) using <str<strong>on</strong>g>the</str<strong>on</strong>g> BACKSPACE<br />
key to delete unwanted characters, words, or even phrases, and<br />
finally (3) to enter characters using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard.<br />
H<strong>on</strong>e in, select, and reenter. The sec<strong>on</strong>d editing technique<br />
seems more efficient than <str<strong>on</strong>g>the</str<strong>on</strong>g> first, but was <strong>on</strong>ly used <str<strong>on</strong>g>by</str<strong>on</strong>g> P8. It<br />
involved (1) using VoiceOver gestures to move <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor to a<br />
desired positi<strong>on</strong>, (2) choosing <str<strong>on</strong>g>the</str<strong>on</strong>g> EDIT opti<strong>on</strong> from <str<strong>on</strong>g>the</str<strong>on</strong>g> VoiceOver<br />
rotor, (3) selecting <str<strong>on</strong>g>the</str<strong>on</strong>g> word (<strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> EDIT opti<strong>on</strong>s, al<strong>on</strong>g with<br />
COPY and PASTE), and (4) re-enter <str<strong>on</strong>g>the</str<strong>on</strong>g> word with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. P8<br />
sometimes re-entered <str<strong>on</strong>g>the</str<strong>on</strong>g> word with speech input. This technique<br />
required fewer key presses than “h<strong>on</strong>e in, delete, and reenter.”<br />
Delete and start over. P5 and P6 did not know how to repositi<strong>on</strong><br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> cursor, so <str<strong>on</strong>g>the</str<strong>on</strong>g>y used this technique. They deleted entered text<br />
with <str<strong>on</strong>g>the</str<strong>on</strong>g> DELETE key, starting from <str<strong>on</strong>g>the</str<strong>on</strong>g> end (where <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor was<br />
located <str<strong>on</strong>g>by</str<strong>on</strong>g> default) and re-entered with speech. They both said<br />
that <str<strong>on</strong>g>the</str<strong>on</strong>g>y <str<strong>on</strong>g>of</str<strong>on</strong>g>ten used speech for short messages, and deleted<br />
everything and started over if <str<strong>on</strong>g>the</str<strong>on</strong>g>re was a recogniti<strong>on</strong> error. This<br />
was <str<strong>on</strong>g>the</str<strong>on</strong>g> least flexible and, most likely, <str<strong>on</strong>g>the</str<strong>on</strong>g> least efficient editing<br />
technique for l<strong>on</strong>ger messages.<br />
Number <str<strong>on</strong>g>of</str<strong>on</strong>g> Edits<br />
0 5 10 15 20 25 30<br />
Keyboard edits<br />
<str<strong>on</strong>g>Speech</str<strong>on</strong>g> edits<br />
1.1 2.1 2.2 3.1 3.2 4.1 5.1 6.1 7.1 7.2 8.1<br />
Participant.task<br />
Figure 6. Number <str<strong>on</strong>g>of</str<strong>on</strong>g> edits performed with each<br />
modality for each composed paragraph (i.e., task).<br />
4.2.3 Qualitative <str<strong>on</strong>g>Use</str<strong>on</strong>g>r Feedback<br />
Two <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 8 (25%) participants in our study preferred keyboard<br />
input over speech input. These were P1 and P5, who were <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
<strong>on</strong>ly two participants who used speech weekly, ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than daily.<br />
All participants menti<strong>on</strong>ed speed as <str<strong>on</strong>g>the</str<strong>on</strong>g> primary benefit <str<strong>on</strong>g>of</str<strong>on</strong>g> using<br />
speech for input. P1 and P5 preferred <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard because <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
challenge <str<strong>on</strong>g>of</str<strong>on</strong>g> editing: P5 said she knew <str<strong>on</strong>g>the</str<strong>on</strong>g>re were mistakes and<br />
didn’t know how to fix <str<strong>on</strong>g>the</str<strong>on</strong>g>m. P1 said that “although [with] speech<br />
you can get more volume <str<strong>on</strong>g>of</str<strong>on</strong>g> text in <str<strong>on</strong>g>the</str<strong>on</strong>g>re…but my weakness is<br />
efficiently editing. that’s <str<strong>on</strong>g>the</str<strong>on</strong>g> downside <str<strong>on</strong>g>of</str<strong>on</strong>g> speech.”<br />
P3 and P8 said it was easy for <str<strong>on</strong>g>the</str<strong>on</strong>g>m to express <str<strong>on</strong>g>the</str<strong>on</strong>g>ir thoughts<br />
verbally, having had prior experience with dictati<strong>on</strong> systems,<br />
unlike P1 and P2 who found it more difficult. P8 also said that<br />
speech input helped him avoid spelling mistakes—auto-correct<br />
was not sufficient for correcting spelling mistakes.<br />
I rely <strong>on</strong> dictati<strong>on</strong> more because my spelling is not <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
greatest and feel like can compose more <str<strong>on</strong>g>of</str<strong>on</strong>g> coherent<br />
sentence verbally than <str<strong>on</strong>g>by</str<strong>on</strong>g> writing, especially <strong>on</strong> an<br />
iOS device where it’s difficult to correct mistakes. On<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> computer, you can see with spell check and<br />
correct. but <strong>on</strong> iOS device it’s easier not to have to<br />
worry about it.<br />
While most preferred speech, all participants found certain<br />
aspects <str<strong>on</strong>g>of</str<strong>on</strong>g> speech input frustrating. All participants except P3<br />
cited editing as a source <str<strong>on</strong>g>of</str<strong>on</strong>g> frustrati<strong>on</strong>. P8 would like to be able to<br />
do “inline” editing as he spoke. P7 wanted an easier way to edit<br />
with speech ra<str<strong>on</strong>g>the</str<strong>on</strong>g>r than using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard, which she did for<br />
most edits. P7 and P3 menti<strong>on</strong>ed <str<strong>on</strong>g>the</str<strong>on</strong>g> problem <str<strong>on</strong>g>of</str<strong>on</strong>g> dictating words<br />
that were out-<str<strong>on</strong>g>of</str<strong>on</strong>g>-vocabulary, such as names. Figure 7 shows<br />
Likert scale resp<strong>on</strong>ses to three statements from <str<strong>on</strong>g>the</str<strong>on</strong>g> interviews at
<str<strong>on</strong>g>the</str<strong>on</strong>g> end <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> study sessi<strong>on</strong>s, showing varying levels <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
satisfacti<strong>on</strong> and frustrati<strong>on</strong>. Figure 7 also shows that all<br />
participants felt inputting text with speech was much faster than<br />
with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard.<br />
Frequency<br />
Frequency<br />
Frequency<br />
0.0 0.5 1.0 1.5 2.0<br />
0 1 2 3 4 5<br />
0 1 2 3 4<br />
0 1 2 3 4 5 6 7<br />
<str<strong>on</strong>g>Speech</str<strong>on</strong>g> is fast compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Mean = 1.6 (SD = 0.9).<br />
0 1 2 3 4 5 6 7<br />
<str<strong>on</strong>g>Speech</str<strong>on</strong>g> is frustrating compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Mean = 4.8 (SD = 2.1).<br />
0 1 2 3 4 5 6 7<br />
I'm satisfied with speech compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Mean = 3.5 (SD = 1.9).<br />
Figure 7. Resp<strong>on</strong>ses to three statements <strong>on</strong> a<br />
7-point Likert scale (1 is str<strong>on</strong>gly agree, 7 is<br />
str<strong>on</strong>gly disagree).<br />
5. Discussi<strong>on</strong><br />
Our study showed that speech input is an efficient entry method<br />
for blind people compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard, yet it is<br />
impeded <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> time required to review and edit ASR output.<br />
<str<strong>on</strong>g>People</str<strong>on</strong>g> can speak intelligibly at a rate <str<strong>on</strong>g>of</str<strong>on</strong>g> about 150 WPM [31],<br />
but <str<strong>on</strong>g>the</str<strong>on</strong>g> average entry rate <str<strong>on</strong>g>of</str<strong>on</strong>g> blind people using speech in our<br />
study was just 19.5 WPM. N<strong>on</strong>e<str<strong>on</strong>g>the</str<strong>on</strong>g>less, this was comparable to<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> entry rate <str<strong>on</strong>g>of</str<strong>on</strong>g> sighted people using <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard <str<strong>on</strong>g>of</str<strong>on</strong>g> a<br />
smartph<strong>on</strong>e, as found in prior work [7]. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore, we found<br />
that <str<strong>on</strong>g>the</str<strong>on</strong>g> error rate <str<strong>on</strong>g>of</str<strong>on</strong>g> speech input was no higher than that <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
keyboard input for participants in our study. It is important to<br />
note, however, that we measured accuracy <strong>on</strong>ly in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
WER, which does not necessarily correlate with <str<strong>on</strong>g>the</str<strong>on</strong>g> intelligibility<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> text [19]. The WER penalizes equally for small and major<br />
errors in a word, but it is <str<strong>on</strong>g>the</str<strong>on</strong>g> standard measure for evaluating<br />
ASR accuracy.<br />
Six <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> eight participants (75%) preferred speech over <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
keyboard because <str<strong>on</strong>g>of</str<strong>on</strong>g> speed, but all participants faced challenges<br />
when using speech input. Editing was <str<strong>on</strong>g>the</str<strong>on</strong>g> primary challenge;<br />
participants spent <str<strong>on</strong>g>the</str<strong>on</strong>g> majority (80%) <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time editing <str<strong>on</strong>g>the</str<strong>on</strong>g> text<br />
output <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> recognizer. Surprisingly, <str<strong>on</strong>g>the</str<strong>on</strong>g>ir most comm<strong>on</strong> editing<br />
technique was highly inefficient in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> keystrokes.<br />
Participants deleted characters with BACKSPACE and <str<strong>on</strong>g>the</str<strong>on</strong>g>n reentered<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g>m with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. It was unclear why <str<strong>on</strong>g>the</str<strong>on</strong>g>y did not<br />
select whole words to replace <str<strong>on</strong>g>the</str<strong>on</strong>g>m, or use speech for editing<br />
more than <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Perhaps some participants did not know<br />
how to select whole words with VoiceOver. They may have<br />
preferred to edit text with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard because it was more<br />
predictable, preventing additi<strong>on</strong>al errors.<br />
Study resp<strong>on</strong>ses were less positive than our survey resp<strong>on</strong>ses<br />
were. This was probably because, in our study, we asked<br />
participants to enter paragraphs that were l<strong>on</strong>ger and more formal<br />
than many smartph<strong>on</strong>e communicati<strong>on</strong>s. For example, a text<br />
message input <str<strong>on</strong>g>by</str<strong>on</strong>g> a survey participant was probably less than four<br />
sentences l<strong>on</strong>g and not as formal as an email that <strong>on</strong>e would write<br />
to a potential employer (referring to <str<strong>on</strong>g>the</str<strong>on</strong>g> guidelines we gave<br />
participants in <str<strong>on</strong>g>the</str<strong>on</strong>g> study). <str<strong>on</strong>g>Speech</str<strong>on</strong>g> is currently better suited for<br />
short, casual messages, probably because <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> difficulty <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
correcting and identifying errors. We believe <str<strong>on</strong>g>the</str<strong>on</strong>g> research<br />
community should facilitate <str<strong>on</strong>g>the</str<strong>on</strong>g> process <str<strong>on</strong>g>of</str<strong>on</strong>g> correcting c<strong>on</strong>tent and<br />
grammar to make speech input more versatile.<br />
Our study also uncovered interesting keyboard input behavior<br />
with VoiceOver. Since this was not our focus, we did not<br />
document challenges with keyboard input rigorously, but<br />
observed several interesting trends. Surprisingly, some<br />
participants did not use <str<strong>on</strong>g>the</str<strong>on</strong>g> auto-correct feature, which could<br />
have improved <str<strong>on</strong>g>the</str<strong>on</strong>g>ir speed and accuracy. They found it difficult<br />
to m<strong>on</strong>itor and dismiss auto-correct suggesti<strong>on</strong>s. We also<br />
observed that VoiceOver did not communicate punctuati<strong>on</strong><br />
clearly, and some minor grammatical issues, such as extra spaces<br />
between words, were <strong>on</strong>ly noticeable when reviewing text<br />
character <str<strong>on</strong>g>by</str<strong>on</strong>g> character. VoiceOver had a setting in which it speaks<br />
punctuati<strong>on</strong> marks, but not <strong>on</strong>e participant used this setting.<br />
VoiceOver also did not communicate misspelled words that were<br />
visually identified with an underline. Enabling participants to<br />
more easily identify punctuati<strong>on</strong> and grammar and spelling issues<br />
would likely improve efficiency and compositi<strong>on</strong> quality for both<br />
keyboard and speech input.<br />
Throughout <str<strong>on</strong>g>the</str<strong>on</strong>g> paper, we have compared speech input with <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
de facto standard accessible input method for touchscreens: <strong>on</strong>screen<br />
keyboard input with VoiceOver. However, <str<strong>on</strong>g>the</str<strong>on</strong>g>re are input<br />
alternatives that are comm<strong>on</strong>ly used <str<strong>on</strong>g>by</str<strong>on</strong>g> both blind and sighted<br />
people that should be c<strong>on</strong>sidered when evaluating speech. Several<br />
study participants used a small external keyboard with hard keys;<br />
<strong>on</strong>e participant used <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard <strong>on</strong> his Braille display, which<br />
c<strong>on</strong>nected to his iPh<strong>on</strong>e; <strong>on</strong>e participant used <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen input<br />
method Fleksy [9], <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> many gesture-based text entry methods<br />
(see Related Work for o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs). These alternatives are more private<br />
than speech, and probably more reliable in noisy envir<strong>on</strong>ments. It<br />
would be interesting to compare <str<strong>on</strong>g>the</str<strong>on</strong>g>se methods to speech in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
future.<br />
6. Challenges for Future Research<br />
We distill our findings into a set <str<strong>on</strong>g>of</str<strong>on</strong>g> challenges for researchers<br />
interested in n<strong>on</strong>visual text entry. These challenges can be<br />
incorporated into both speech and gesture-based input methods.<br />
1. Text selecti<strong>on</strong> – a better method for n<strong>on</strong>visual selecti<strong>on</strong><br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> text. This can also include o<str<strong>on</strong>g>the</str<strong>on</strong>g>r edit operati<strong>on</strong>s, such<br />
as cut, copy, and paste.<br />
2. Cursor positi<strong>on</strong>ing – an easier way to move a cursor<br />
around a text area; enable a user to easily h<strong>on</strong>e in <strong>on</strong><br />
errors.
3. Error detecti<strong>on</strong> – an easier way to detect errors such as<br />
spelling mistakes, letter case errors, and low-c<strong>on</strong>fidence<br />
ASR output.<br />
4. Auto-correct – a study <str<strong>on</strong>g>of</str<strong>on</strong>g> how well auto-correct works<br />
for n<strong>on</strong>visual use, and a way to make it more effective.<br />
7. C<strong>on</strong>clusi<strong>on</strong><br />
We have explored <str<strong>on</strong>g>the</str<strong>on</strong>g> patterns and challenges <str<strong>on</strong>g>of</str<strong>on</strong>g> speech input for<br />
blind mobile device users through a survey and a laboratory<br />
study. We found that speech input is a popular alternative to <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
<strong>on</strong>-screen keyboard with VoiceOver, yet people face challenges<br />
when reviewing and editing a speech recognizer’s output, <str<strong>on</strong>g>of</str<strong>on</strong>g>ten<br />
resorting to using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. We hope this work will enable<br />
text entry researchers to better understand <str<strong>on</strong>g>the</str<strong>on</strong>g> patterns and<br />
challenges <str<strong>on</strong>g>of</str<strong>on</strong>g> current n<strong>on</strong>visual text entry, and spur fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r<br />
research in <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong> n<strong>on</strong>visual speech input.<br />
8. ACKNOWLEDGMENTS<br />
We thank Gina-Anne Levow, Richard Ladner, Rochelle H. Ng,<br />
and Sim<strong>on</strong>e Schaffer. This work was supported in part <str<strong>on</strong>g>by</str<strong>on</strong>g> AT&T<br />
Labs and <str<strong>on</strong>g>the</str<strong>on</strong>g> Nati<strong>on</strong>al Science Foundati<strong>on</strong> under grant No.<br />
1116051.<br />
9. REFERENCES<br />
1. AirServer. http://www.airserver.com/.<br />
2. Apple Inc., AirPlay. http://www.apple.com/airplay/<br />
3. Apple Inc., Chapter 11: Using VoiceOver Gestures. From<br />
VoiceOver Getting Started. http://www.apple.com/<br />
voiceover/info/guide/_1137.html#vo28035.<br />
4. Apple Inc., Siri. http://www.apple.com/ios/siri/<br />
5. Azenkot, S., Wobbrock, J.O., Prasain, S., and Ladner, R.E.<br />
<str<strong>on</strong>g>Input</str<strong>on</strong>g> finger detecti<strong>on</strong> for n<strong>on</strong>visual touch screen text entry<br />
in Perkinput. Proc. GI ‘12, 121–129.<br />
6. B<strong>on</strong>ner, M., Brudvik, J., Abowd, G. Edwards, K. (2010).<br />
No-Look Notes: Accessible Eyes-Free Multi-Touch Text<br />
Entry. Proc. Pervasive ’10, 409-426.<br />
7. Castellucci, S., and MacKenzie, I.S. (2011). Ga<str<strong>on</strong>g>the</str<strong>on</strong>g>ring text<br />
entry metrics <strong>on</strong> android devices. Proc. CHI EA '11, 1507-<br />
1512.<br />
8. Fischer, A.R.H., Price, K.J., and Sears, A. <str<strong>on</strong>g>Speech</str<strong>on</strong>g>-based<br />
Text Entry for Mobile Handheld Devices: An analysis <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
efficacy and error correcti<strong>on</strong> techniques for server-based<br />
soluti<strong>on</strong>s. Internati<strong>on</strong>al Journal <str<strong>on</strong>g>of</str<strong>on</strong>g> Human-Computer<br />
Interacti<strong>on</strong> 19, 3 (2005), 279–304.<br />
9. Fleksy App <str<strong>on</strong>g>by</str<strong>on</strong>g> Syntellia. http://fleksy.com/<br />
10. Frey, B., Sou<str<strong>on</strong>g>the</str<strong>on</strong>g>rn, C., and Romero, M. BrailleTouch:<br />
Mobile Texting for <str<strong>on</strong>g>the</str<strong>on</strong>g> Visually Impaired. Proc. UAHCI'11,<br />
19-25.<br />
11. Google. Voice Search Anywhere. http://www.google.com/<br />
insidesearch/features/voicesearch/index-chrome.html<br />
12. Halvers<strong>on</strong>, C., Horn, D., Karat, C., and Karat, J. The beauty<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> errors: patterns <str<strong>on</strong>g>of</str<strong>on</strong>g> error correcti<strong>on</strong> in desktop speech<br />
systems. IOS Press (1999), 133–140.<br />
13. Kane, S.K., Bigham, J.P. and Wobbrock, J.O. (2008). Slide<br />
Rule: Making mobile touch screens accessible to blind<br />
people using multitouch interacti<strong>on</strong> techniques. Proc.<br />
ASSETS '08, 73-80.<br />
14. Kane, S. K., Wobbrock, J. O. and Ladner, R. E. (2011).<br />
Usable gestures for blind people: understanding preference<br />
and performance. Proc. CHI '11, 413-422.<br />
15. Karat, C.-M., Halvers<strong>on</strong>, C., Horn, D., and Karat, J. Patterns<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> entry and correcti<strong>on</strong> in large vocabulary c<strong>on</strong>tinuous<br />
speech recogniti<strong>on</strong> systems. Proc. CHI‘99, 568–575.<br />
16. Kumar, A., Paek, T., and Lee, B. Voice typing: a new speech<br />
interacti<strong>on</strong> model for dictati<strong>on</strong> <strong>on</strong> touchscreen devices. Proc.<br />
CHI ‘12, 2277-2286.<br />
17. MacKenzie, I.S. and Zhang, S.X. (1999). The design and<br />
evaluati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a high-performance s<str<strong>on</strong>g>of</str<strong>on</strong>g>t keyboard. Proc. CHI<br />
'99, 25-31.<br />
18. Mascetti, S., Bernareggi, C., and Belotti, M. (2011).<br />
TypeInBraille: a braille-based typing applicati<strong>on</strong> for<br />
touchscreen devices. Proc. ASSETS '11, 295-296.<br />
19. Martin, T.B. and Welch, J.R. Practical speech recognizers<br />
and some performance effectiveness parameters. In Trends<br />
in <str<strong>on</strong>g>Speech</str<strong>on</strong>g> Recogniti<strong>on</strong>. Prentice Hall, Englewood Cliffs, NJ,<br />
USA, 1980.<br />
20. Mishra, T., Ljolje, A., Gilbert, Mazin. (2011). Predicting<br />
Human Perceived Accuracy <str<strong>on</strong>g>of</str<strong>on</strong>g> ASR Systems. Proc.<br />
INTERSPEECH ’11, 1945-1948.<br />
21. Nuance. Drag<strong>on</strong> Naturally Speaking S<str<strong>on</strong>g>of</str<strong>on</strong>g>tware.<br />
http://www.nuance.com/drag<strong>on</strong>/index.htm<br />
22. Oliveira, J., Guerreiro, T., Nicolau, H., Jorge, J., and<br />
G<strong>on</strong>çalves, D. (2011). <str<strong>on</strong>g>Blind</str<strong>on</strong>g> people and mobile touch-based<br />
text-entry: acknowledging <str<strong>on</strong>g>the</str<strong>on</strong>g> need for different flavors.<br />
Proc. ASSETS '11, 179-186.<br />
23. Oviatt, S. Taming recogniti<strong>on</strong> errors with a multimodal<br />
interface. Commun. ACM 43, 9 (2000), 45–51.<br />
24. Pitt, I., and Edwards, A.D.N. (1996). Improving <str<strong>on</strong>g>the</str<strong>on</strong>g> usability<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> speech-based interfaces for blind users. Proc. ASSETS<br />
’96, 124-130.<br />
25. Sánchez, J. and Aguayo, F. (2006), Mobile messenger for<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> blind. Proc. UAAI ‘06, 369-385<br />
26. Sears, A., Karat, C.-M., Oseitutu, K., Karimullah, A., and<br />
Feng, J. Productivity, satisfacti<strong>on</strong>, and interacti<strong>on</strong> strategies<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> individuals with spinal cord injuries and traditi<strong>on</strong>al users<br />
interacting with speech recogniti<strong>on</strong> s<str<strong>on</strong>g>of</str<strong>on</strong>g>tware. Universal<br />
Access in <str<strong>on</strong>g>the</str<strong>on</strong>g> Informati<strong>on</strong> Society 1, 1 (2001), 4–15.<br />
27. Sou<str<strong>on</strong>g>the</str<strong>on</strong>g>rn, C., Claws<strong>on</strong>, J., Frey, B., Abowd, B., and Romero,<br />
M. 2012. An evaluati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> BrailleTouch: mobile touchscreen<br />
text entry for <str<strong>on</strong>g>the</str<strong>on</strong>g> visually impaired. Proc. MobileHCI '12,<br />
317-326.<br />
28. Stent, A., Syrdal, A., and Mishra, T. 2011. On <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
intelligibility <str<strong>on</strong>g>of</str<strong>on</strong>g> fast syn<str<strong>on</strong>g>the</str<strong>on</strong>g>sized speech for individuals with<br />
early-<strong>on</strong>set blindness. Proc. ASSETS '11, 211-218.<br />
29. Suhm, B., Myers, B., and Waibel, A. Multi-Modal Error<br />
Correcti<strong>on</strong> for <str<strong>on</strong>g>Speech</str<strong>on</strong>g> <str<strong>on</strong>g>Use</str<strong>on</strong>g>r Interfaces. ACM TOCHI 8, 1<br />
(2001), 60-98.<br />
30. Tinwala, H, and MacKenzie, I. S. (2009). Eyes-free text<br />
entry <strong>on</strong> a touchscreen ph<strong>on</strong>e. Proc. TIC-STH ’09, 83-88.<br />
31. Williams, J. R. (1998). Guidelines for <str<strong>on</strong>g>the</str<strong>on</strong>g> use <str<strong>on</strong>g>of</str<strong>on</strong>g> multimedia<br />
in instructi<strong>on</strong>, Proceedings <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Human Factors and<br />
Erg<strong>on</strong>omics Society 42nd Annual Meeting, 1447–1451.<br />
32. Wobbrock, J.O. (2007). Measures <str<strong>on</strong>g>of</str<strong>on</strong>g> text entry<br />
performance. In Text Entry Systems: Mobility,<br />
Accessibility, Universality, I. S. MacKenzie and K. Tanaka-<br />
Ishii (eds.). San Francisco: Morgan Kaufmann, 47-74.<br />
33. Wobbrock, J.O. and Myers, B.A. Analyzing <str<strong>on</strong>g>the</str<strong>on</strong>g> input stream<br />
for character- level errors in unc<strong>on</strong>strained text entry<br />
evaluati<strong>on</strong>s. ACM TOCHI. 13, 4 (2006), 458–489.<br />
34. Yfantidis, G. and Evreinov, G. Adaptive blind interacti<strong>on</strong><br />
technique for touchscreens, Universal Access in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
Informati<strong>on</strong> Society, 4, 2006, 328-337.