Exploring the Use of Speech Input by Blind People on ... - Washington
Exploring the Use of Speech Input by Blind People on ... - Washington
Exploring the Use of Speech Input by Blind People on ... - Washington
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
The study was an 8 x 1 design, with a single factor Method with<br />
two levels, <str<strong>on</strong>g>Speech</str<strong>on</strong>g> and Keyboard. We calculated entry speed in<br />
terms <str<strong>on</strong>g>of</str<strong>on</strong>g> Words per Minute (WPM), using <str<strong>on</strong>g>the</str<strong>on</strong>g> formula [30]:<br />
WPM = T −1 •60• 1 S 5<br />
where T is <str<strong>on</strong>g>the</str<strong>on</strong>g> transcribed string entered, |T| is <str<strong>on</strong>g>the</str<strong>on</strong>g> length <str<strong>on</strong>g>of</str<strong>on</strong>g> T,<br />
and S is <str<strong>on</strong>g>the</str<strong>on</strong>g> time in sec<strong>on</strong>ds from <str<strong>on</strong>g>the</str<strong>on</strong>g> entry <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> first character to<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> final keystroke. We included post-hoc editing in <str<strong>on</strong>g>the</str<strong>on</strong>g> entry rate<br />
calculati<strong>on</strong>, so S included <str<strong>on</strong>g>the</str<strong>on</strong>g> time taken to review and edit a<br />
paragraph.<br />
We evaluated accuracy in two ways. First, we computed <str<strong>on</strong>g>the</str<strong>on</strong>g> error<br />
rate <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> speech recognizer for all text entered with speech.<br />
Sec<strong>on</strong>d, we computed <str<strong>on</strong>g>the</str<strong>on</strong>g> error rate <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> final transcripti<strong>on</strong>s for<br />
both speech and keyboard inputs. We measured <str<strong>on</strong>g>the</str<strong>on</strong>g> error rate in<br />
terms <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Word Error Rate (WER), a standard metric used to<br />
evaluate speech recogniti<strong>on</strong> systems [19]. The WER is <str<strong>on</strong>g>the</str<strong>on</strong>g> word<br />
edit distance (i.e., <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> word inserti<strong>on</strong>s, deleti<strong>on</strong>s, and<br />
replacements) between a reference and transcripti<strong>on</strong> text,<br />
normalized <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> length <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> reference text. For evaluating<br />
recogniti<strong>on</strong> errors, <str<strong>on</strong>g>the</str<strong>on</strong>g> reference text is <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g>fline, humanperceived<br />
transcript <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> participant’s speech. For evaluating <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
final, edited keyboard and speech text, determining <str<strong>on</strong>g>the</str<strong>on</strong>g> reference<br />
text is less straight-forward. Since we are uncertain <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> user’s<br />
intended entry (this was a compositi<strong>on</strong> and not a transcripti<strong>on</strong><br />
task), it is difficult to identify errors. As such, we c<strong>on</strong>sidered a<br />
word to be an error if it was (1) a misspelled word, (2) a n<strong>on</strong><br />
sequitur that made no sense in <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>text <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> sentence, or (3) a<br />
clear grammatical error such as a repeated word, or a short<br />
sentence fragment. Since participants’ input was c<strong>on</strong>versati<strong>on</strong>al,<br />
classifying words as errors was relatively straight-forward, given<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> aforementi<strong>on</strong>ed categories.<br />
We used two-sided t-tests to compare text entry rates, since <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
WPM was roughly normally distributed. For error rates, which<br />
were not normally distributed, we used Wilcox<strong>on</strong> Rank Sums<br />
tests to compare <str<strong>on</strong>g>the</str<strong>on</strong>g> means between input methods.<br />
4.2 Results<br />
There was high variability in <str<strong>on</strong>g>the</str<strong>on</strong>g> speech input behavior am<strong>on</strong>g<br />
participants, so we explain some <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> results in terms <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
individual performance. P5 and P6 did not know how to<br />
repositi<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> cursor, so <str<strong>on</strong>g>the</str<strong>on</strong>g>ir ability to edit was limited,<br />
impacting both speed and accuracy. P5’s accuracy was also<br />
affected <str<strong>on</strong>g>by</str<strong>on</strong>g> her thick foreign accent, although she still used<br />
speech for input weekly. On <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, P3 had many years <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
experience with dictati<strong>on</strong> systems <strong>on</strong> different platforms and<br />
spoke with no hesitati<strong>on</strong> and clear dicti<strong>on</strong>.<br />
4.2.1 Entry Time and Accuracy<br />
The entry rate for speech input was much higher than for<br />
keyboard input. With speech input, participants entered text at a<br />
rate <str<strong>on</strong>g>of</str<strong>on</strong>g> 19.5 WPM (SD = 10.1), while <str<strong>on</strong>g>the</str<strong>on</strong>g>y entered <strong>on</strong>ly 4.3 WPM<br />
(SD = 1.5) with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. As expected, this resulted in a<br />
significant effect <str<strong>on</strong>g>of</str<strong>on</strong>g> Method <strong>on</strong> WPM (t (10) =-5.07, p < 0.001). The<br />
maximum entry rate was achieved <str<strong>on</strong>g>by</str<strong>on</strong>g> P3, at 34.6 WPM with<br />
speech, while <str<strong>on</strong>g>the</str<strong>on</strong>g> maximum speed for keyboard input was<br />
achieved <str<strong>on</strong>g>by</str<strong>on</strong>g> P7, at 6.7 WPM.<br />
When inputting speech, participants spent most <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time<br />
reviewing and editing recogniti<strong>on</strong> errors. On average, this<br />
amounted to 80.3% (SD = 10.2) <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> compositi<strong>on</strong> time. P1 spent<br />
94.2% (about 10 minutes) <str<strong>on</strong>g>of</str<strong>on</strong>g> his time reviewing and editing <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
(1)<br />
recognizer’s output—more than any o<str<strong>on</strong>g>the</str<strong>on</strong>g>r participant. Figure 4<br />
shows <str<strong>on</strong>g>the</str<strong>on</strong>g> amount <str<strong>on</strong>g>of</str<strong>on</strong>g> time participants spent dictating vs.<br />
reviewing and editing <str<strong>on</strong>g>the</str<strong>on</strong>g>ir speech input compositi<strong>on</strong>s. Each bar<br />
in <str<strong>on</strong>g>the</str<strong>on</strong>g> plot represents <strong>on</strong>e composed paragraph (i.e., <strong>on</strong>e task).<br />
Time (Minutes)<br />
0 5 10 15 20<br />
Inline entry<br />
Review & edits<br />
1.1 2.1 2.2 3.1 3.2 4.1 5.1 6.1 7.1 7.2 8.1<br />
Participant.task<br />
Figure 4. Time spent entering (red) and reviewing<br />
and editing (blue) text. Each column represents <strong>on</strong>e<br />
composed paragraph (i.e., a task).<br />
While participants spent most <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time reviewing and editing<br />
text when using speech input, <str<strong>on</strong>g>the</str<strong>on</strong>g>y spent little time <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>se<br />
activities when using <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. On average, participants spent<br />
<strong>on</strong>ly 9.0% (SD = 11.7) <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time reviewing and editing<br />
keyboard output. This amounted to an average <str<strong>on</strong>g>of</str<strong>on</strong>g> 1.2 minutes <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
review and edit time with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard compared to 5.4 minutes<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> review and edit time with speech input. Participants made most<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir edits in <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard c<strong>on</strong>diti<strong>on</strong> inline, <str<strong>on</strong>g>by</str<strong>on</strong>g> deleting text<br />
with <str<strong>on</strong>g>the</str<strong>on</strong>g> BACKSPACE key and reentering it.<br />
The number <str<strong>on</strong>g>of</str<strong>on</strong>g> errors produced <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR largely determined<br />
<str<strong>on</strong>g>the</str<strong>on</strong>g> amount <str<strong>on</strong>g>of</str<strong>on</strong>g> reviewing and editing participants performed, and,<br />
in turn, <str<strong>on</strong>g>the</str<strong>on</strong>g>ir overall rate <str<strong>on</strong>g>of</str<strong>on</strong>g> entry. The average WER for <str<strong>on</strong>g>the</str<strong>on</strong>g> iOS<br />
ASR was 10.2% (SD = 10.4), ranging from 0% for P3 to 35.6%<br />
for P5. Figure 5 shows <str<strong>on</strong>g>the</str<strong>on</strong>g> entry rates <str<strong>on</strong>g>of</str<strong>on</strong>g> each paragraph input<br />
with speech as a functi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> ASR’s WER. As expected, <str<strong>on</strong>g>the</str<strong>on</strong>g>re<br />
was a negative correlati<strong>on</strong>, with an outlier point in <str<strong>on</strong>g>the</str<strong>on</strong>g> far right<br />
for P5.<br />
Word Per Minute (WPM)<br />
0 5 10 20 30<br />
●<br />
3.1,3.2<br />
●<br />
●<br />
●<br />
8.1<br />
2.1<br />
7.2<br />
●<br />
●<br />
7.1<br />
6.1 ● 2.2<br />
● 1.1<br />
● 4.1<br />
0.0 0.1 0.2 0.3 0.4<br />
<str<strong>on</strong>g>Speech</str<strong>on</strong>g> Recogniti<strong>on</strong> Word Error Rate (WER)<br />
Figure 5. Entry rate vs. speech recogniti<strong>on</strong> error rate for<br />
speech input. The labels <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> plotted points indicate <str<strong>on</strong>g>the</str<strong>on</strong>g><br />
participant (1 – 8) and paragraph number (1 – 2).<br />
Participants corrected most <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> speech recognizer’s errors,<br />
yielding a mean WER <str<strong>on</strong>g>of</str<strong>on</strong>g> 3.2% (SD = 4.6) for <str<strong>on</strong>g>the</str<strong>on</strong>g> final text input<br />
with speech. The mean WER for text input with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard was<br />
slightly higher, at 4.0% (SD = 3.3). There was no significant<br />
effect <str<strong>on</strong>g>of</str<strong>on</strong>g> Method <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> WER <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> final text (W = 77, n.s.).<br />
●<br />
5.1