Exploring the Use of Speech Input by Blind People on ... - Washington

Exploring the Use of Speech Input by Blind People on ... - Washington Exploring the Use of Speech Input by Blind People on ... - Washington

homes.cs.washington.edu
from homes.cs.washington.edu More from this publisher
28.02.2014 Views

4.2.2 Reviewing and Editing Text In this section, we describe ong>theong> reviewing and editing techniques that participants used when composing paragraphs. After entering several sentences, most participants reviewed ong>theong>ir text ong>byong> making VoiceOver read it word ong>byong> word. If a word did not sound correct, ong>theong>y read it character ong>byong> character. To do this, participants used ong>theong> VoiceOver rotor [3], which enables users to read text one unit at a time. A unit can be a line, word, or character. To move ong>theong> cursor from one unit to ong>theong> next, a user swipes down on ong>theong> screen. The user can also swipe up to move to ong>theong> previous unit. In this way, participants moved ong>theong> VoiceOver cursor around ong>theong> text in ong>theong> text area to verify ong>theong>ir input. When reviewing, participants ong>ofong>ten read words several times, ong>theong>n iterated through ong>theong>ir characters. Some ASR errors were difficult to detect using ong>theong> VoiceOver text-to-speech because ong>theong>y sounded similar to ong>theong> user’s original speech. For example, P1 dictated ong>theong> phrase “lost my sight,” but ong>theong> speech recognizer output ong>theong> phrase, “lost my site.” P1 did not detect this error. Anoong>theong>r participant, P6, said, “you can hike,” but ong>theong> speech recognizer output, “you can’t hike.” P6 did not detect ong>theong> error eiong>theong>r. Some grammatical errors were also not detected in ong>theong> review process. VoiceOver’s text-to-speech does not differentiate upper- and lower-case letters, so ong>theong>re were some uncorrected case errors. The composed paragraphs also included extra spaces between words and sentences that participants did not seem to be aware ong>ofong>. We observed that VoiceOver did not alert participants to passages that were recognized with low-confidence. Occasionally, ong>theong> iOS speech recognizer underlined a word or phrase that was recognized with low confidence, and would present ong>theong> user with alternative recognition options if ong>theong> phrase was touched. This information was not accessible. Also, we observed that VoiceOver communicated punctuation marks in ong>theong> text in a subtle manner, ong>byong> varying prosody and pausing. This made it difficult for participants to detect punctuation marks in ong>theong> recognized text when reviewing it. The review process took longer than we expected, but participants spent much more time editing ong>theong> ASR output. Figure 6 shows ong>theong> number ong>ofong> edits participants performed for each paragraph input with speech. The total number ong>ofong> edits performed during speech input tasks was 96. Edits included inserting, deleting, and replacing characters, words, or strings ong>ofong> words. Only 15 (15.6%) ong>ofong> ong>theong>se edits were done using speech input. Although inputting speech was much faster than using ong>theong> keyboard, participants preferred using ong>theong> keyboard to edit ong>theong> ASR output. Several participants inserted words or phrases with speech but deleted erroneous text with ong>theong> keyboard’s BACKSPACE key. Also, ong>theong>re was no way to move ong>theong> cursor using speech commands, so participants preferred to continue using touch to make characteror word-level edits. We observed three editing techniques for speech input in our study. We describe ong>theong>m as follows. Hone in, delete, and reenter. The first technique was ong>theong> most common and was used ong>byong> all participants except P5 and P6 who did not know how to use VoiceOver gestures. This technique involved (1) moving ong>theong> cursor to a desired position using ong>theong> VoiceOver gestures describe previously, (2) using ong>theong> BACKSPACE key to delete unwanted characters, words, or even phrases, and finally (3) to enter characters using ong>theong> keyboard. Hone in, select, and reenter. The second editing technique seems more efficient than ong>theong> first, but was only used ong>byong> P8. It involved (1) using VoiceOver gestures to move ong>theong> cursor to a desired position, (2) choosing ong>theong> EDIT option from ong>theong> VoiceOver rotor, (3) selecting ong>theong> word (one ong>ofong> ong>theong> EDIT options, along with COPY and PASTE), and (4) re-enter ong>theong> word with ong>theong> keyboard. P8 sometimes re-entered ong>theong> word with speech input. This technique required fewer key presses than “hone in, delete, and reenter.” Delete and start over. P5 and P6 did not know how to reposition ong>theong> cursor, so ong>theong>y used this technique. They deleted entered text with ong>theong> DELETE key, starting from ong>theong> end (where ong>theong> cursor was located ong>byong> default) and re-entered with speech. They both said that ong>theong>y ong>ofong>ten used speech for short messages, and deleted everything and started over if ong>theong>re was a recognition error. This was ong>theong> least flexible and, most likely, ong>theong> least efficient editing technique for longer messages. Number ong>ofong> Edits 0 5 10 15 20 25 30 Keyboard edits ong>Speechong> edits 1.1 2.1 2.2 3.1 3.2 4.1 5.1 6.1 7.1 7.2 8.1 Participant.task Figure 6. Number ong>ofong> edits performed with each modality for each composed paragraph (i.e., task). 4.2.3 Qualitative ong>Useong>r Feedback Two ong>ofong> ong>theong> 8 (25%) participants in our study preferred keyboard input over speech input. These were P1 and P5, who were ong>theong> only two participants who used speech weekly, raong>theong>r than daily. All participants mentioned speed as ong>theong> primary benefit ong>ofong> using speech for input. P1 and P5 preferred ong>theong> keyboard because ong>ofong> ong>theong> challenge ong>ofong> editing: P5 said she knew ong>theong>re were mistakes and didn’t know how to fix ong>theong>m. P1 said that “although [with] speech you can get more volume ong>ofong> text in ong>theong>re…but my weakness is efficiently editing. that’s ong>theong> downside ong>ofong> speech.” P3 and P8 said it was easy for ong>theong>m to express ong>theong>ir thoughts verbally, having had prior experience with dictation systems, unlike P1 and P2 who found it more difficult. P8 also said that speech input helped him avoid spelling mistakes—auto-correct was not sufficient for correcting spelling mistakes. I rely on dictation more because my spelling is not ong>theong> greatest and feel like can compose more ong>ofong> coherent sentence verbally than ong>byong> writing, especially on an iOS device where it’s difficult to correct mistakes. On ong>theong> computer, you can see with spell check and correct. but on iOS device it’s easier not to have to worry about it. While most preferred speech, all participants found certain aspects ong>ofong> speech input frustrating. All participants except P3 cited editing as a source ong>ofong> frustration. P8 would like to be able to do “inline” editing as he spoke. P7 wanted an easier way to edit with speech raong>theong>r than using ong>theong> keyboard, which she did for most edits. P7 and P3 mentioned ong>theong> problem ong>ofong> dictating words that were out-ong>ofong>-vocabulary, such as names. Figure 7 shows Likert scale responses to three statements from ong>theong> interviews at

ong>theong> end ong>ofong> ong>theong> study sessions, showing varying levels ong>ofong> satisfaction and frustration. Figure 7 also shows that all participants felt inputting text with speech was much faster than with ong>theong> keyboard. Frequency Frequency Frequency 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 0 1 2 3 4 0 1 2 3 4 5 6 7 ong>Speechong> is fast compared to ong>theong> keyboard. Mean = 1.6 (SD = 0.9). 0 1 2 3 4 5 6 7 ong>Speechong> is frustrating compared to ong>theong> keyboard. Mean = 4.8 (SD = 2.1). 0 1 2 3 4 5 6 7 I'm satisfied with speech compared to ong>theong> keyboard. Mean = 3.5 (SD = 1.9). Figure 7. Responses to three statements on a 7-point Likert scale (1 is strongly agree, 7 is strongly disagree). 5. Discussion Our study showed that speech input is an efficient entry method for blind people compared to ong>theong> on-screen keyboard, yet it is impeded ong>byong> ong>theong> time required to review and edit ASR output. ong>Peopleong> can speak intelligibly at a rate ong>ofong> about 150 WPM [31], but ong>theong> average entry rate ong>ofong> blind people using speech in our study was just 19.5 WPM. Noneong>theong>less, this was comparable to ong>theong> entry rate ong>ofong> sighted people using ong>theong> on-screen keyboard ong>ofong> a smartphone, as found in prior work [7]. Furong>theong>rmore, we found that ong>theong> error rate ong>ofong> speech input was no higher than that ong>ofong> keyboard input for participants in our study. It is important to note, however, that we measured accuracy only in terms ong>ofong> ong>theong> WER, which does not necessarily correlate with ong>theong> intelligibility ong>ofong> text [19]. The WER penalizes equally for small and major errors in a word, but it is ong>theong> standard measure for evaluating ASR accuracy. Six ong>ofong> ong>theong> eight participants (75%) preferred speech over ong>theong> keyboard because ong>ofong> speed, but all participants faced challenges when using speech input. Editing was ong>theong> primary challenge; participants spent ong>theong> majority (80%) ong>ofong> ong>theong>ir time editing ong>theong> text output ong>byong> ong>theong> recognizer. Surprisingly, ong>theong>ir most common editing technique was highly inefficient in terms ong>ofong> keystrokes. Participants deleted characters with BACKSPACE and ong>theong>n reentered ong>theong>m with ong>theong> keyboard. It was unclear why ong>theong>y did not select whole words to replace ong>theong>m, or use speech for editing more than ong>theong> keyboard. Perhaps some participants did not know how to select whole words with VoiceOver. They may have preferred to edit text with ong>theong> keyboard because it was more predictable, preventing additional errors. Study responses were less positive than our survey responses were. This was probably because, in our study, we asked participants to enter paragraphs that were longer and more formal than many smartphone communications. For example, a text message input ong>byong> a survey participant was probably less than four sentences long and not as formal as an email that one would write to a potential employer (referring to ong>theong> guidelines we gave participants in ong>theong> study). ong>Speechong> is currently better suited for short, casual messages, probably because ong>ofong> ong>theong> difficulty ong>ofong> correcting and identifying errors. We believe ong>theong> research community should facilitate ong>theong> process ong>ofong> correcting content and grammar to make speech input more versatile. Our study also uncovered interesting keyboard input behavior with VoiceOver. Since this was not our focus, we did not document challenges with keyboard input rigorously, but observed several interesting trends. Surprisingly, some participants did not use ong>theong> auto-correct feature, which could have improved ong>theong>ir speed and accuracy. They found it difficult to monitor and dismiss auto-correct suggestions. We also observed that VoiceOver did not communicate punctuation clearly, and some minor grammatical issues, such as extra spaces between words, were only noticeable when reviewing text character ong>byong> character. VoiceOver had a setting in which it speaks punctuation marks, but not one participant used this setting. VoiceOver also did not communicate misspelled words that were visually identified with an underline. Enabling participants to more easily identify punctuation and grammar and spelling issues would likely improve efficiency and composition quality for both keyboard and speech input. Throughout ong>theong> paper, we have compared speech input with ong>theong> de facto standard accessible input method for touchscreens: onscreen keyboard input with VoiceOver. However, ong>theong>re are input alternatives that are commonly used ong>byong> both blind and sighted people that should be considered when evaluating speech. Several study participants used a small external keyboard with hard keys; one participant used ong>theong> keyboard on his Braille display, which connected to his iPhone; one participant used ong>theong> on-screen input method Fleksy [9], one ong>ofong> many gesture-based text entry methods (see Related Work for oong>theong>rs). These alternatives are more private than speech, and probably more reliable in noisy environments. It would be interesting to compare ong>theong>se methods to speech in ong>theong> future. 6. Challenges for Future Research We distill our findings into a set ong>ofong> challenges for researchers interested in nonvisual text entry. These challenges can be incorporated into both speech and gesture-based input methods. 1. Text selection – a better method for nonvisual selection ong>ofong> text. This can also include oong>theong>r edit operations, such as cut, copy, and paste. 2. Cursor positioning – an easier way to move a cursor around a text area; enable a user to easily hone in on errors.

<str<strong>on</strong>g>the</str<strong>on</strong>g> end <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> study sessi<strong>on</strong>s, showing varying levels <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

satisfacti<strong>on</strong> and frustrati<strong>on</strong>. Figure 7 also shows that all<br />

participants felt inputting text with speech was much faster than<br />

with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard.<br />

Frequency<br />

Frequency<br />

Frequency<br />

0.0 0.5 1.0 1.5 2.0<br />

0 1 2 3 4 5<br />

0 1 2 3 4<br />

0 1 2 3 4 5 6 7<br />

<str<strong>on</strong>g>Speech</str<strong>on</strong>g> is fast compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Mean = 1.6 (SD = 0.9).<br />

0 1 2 3 4 5 6 7<br />

<str<strong>on</strong>g>Speech</str<strong>on</strong>g> is frustrating compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Mean = 4.8 (SD = 2.1).<br />

0 1 2 3 4 5 6 7<br />

I'm satisfied with speech compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Mean = 3.5 (SD = 1.9).<br />

Figure 7. Resp<strong>on</strong>ses to three statements <strong>on</strong> a<br />

7-point Likert scale (1 is str<strong>on</strong>gly agree, 7 is<br />

str<strong>on</strong>gly disagree).<br />

5. Discussi<strong>on</strong><br />

Our study showed that speech input is an efficient entry method<br />

for blind people compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard, yet it is<br />

impeded <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> time required to review and edit ASR output.<br />

<str<strong>on</strong>g>People</str<strong>on</strong>g> can speak intelligibly at a rate <str<strong>on</strong>g>of</str<strong>on</strong>g> about 150 WPM [31],<br />

but <str<strong>on</strong>g>the</str<strong>on</strong>g> average entry rate <str<strong>on</strong>g>of</str<strong>on</strong>g> blind people using speech in our<br />

study was just 19.5 WPM. N<strong>on</strong>e<str<strong>on</strong>g>the</str<strong>on</strong>g>less, this was comparable to<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> entry rate <str<strong>on</strong>g>of</str<strong>on</strong>g> sighted people using <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen keyboard <str<strong>on</strong>g>of</str<strong>on</strong>g> a<br />

smartph<strong>on</strong>e, as found in prior work [7]. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore, we found<br />

that <str<strong>on</strong>g>the</str<strong>on</strong>g> error rate <str<strong>on</strong>g>of</str<strong>on</strong>g> speech input was no higher than that <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

keyboard input for participants in our study. It is important to<br />

note, however, that we measured accuracy <strong>on</strong>ly in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

WER, which does not necessarily correlate with <str<strong>on</strong>g>the</str<strong>on</strong>g> intelligibility<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> text [19]. The WER penalizes equally for small and major<br />

errors in a word, but it is <str<strong>on</strong>g>the</str<strong>on</strong>g> standard measure for evaluating<br />

ASR accuracy.<br />

Six <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> eight participants (75%) preferred speech over <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

keyboard because <str<strong>on</strong>g>of</str<strong>on</strong>g> speed, but all participants faced challenges<br />

when using speech input. Editing was <str<strong>on</strong>g>the</str<strong>on</strong>g> primary challenge;<br />

participants spent <str<strong>on</strong>g>the</str<strong>on</strong>g> majority (80%) <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir time editing <str<strong>on</strong>g>the</str<strong>on</strong>g> text<br />

output <str<strong>on</strong>g>by</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> recognizer. Surprisingly, <str<strong>on</strong>g>the</str<strong>on</strong>g>ir most comm<strong>on</strong> editing<br />

technique was highly inefficient in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> keystrokes.<br />

Participants deleted characters with BACKSPACE and <str<strong>on</strong>g>the</str<strong>on</strong>g>n reentered<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>m with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. It was unclear why <str<strong>on</strong>g>the</str<strong>on</strong>g>y did not<br />

select whole words to replace <str<strong>on</strong>g>the</str<strong>on</strong>g>m, or use speech for editing<br />

more than <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard. Perhaps some participants did not know<br />

how to select whole words with VoiceOver. They may have<br />

preferred to edit text with <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard because it was more<br />

predictable, preventing additi<strong>on</strong>al errors.<br />

Study resp<strong>on</strong>ses were less positive than our survey resp<strong>on</strong>ses<br />

were. This was probably because, in our study, we asked<br />

participants to enter paragraphs that were l<strong>on</strong>ger and more formal<br />

than many smartph<strong>on</strong>e communicati<strong>on</strong>s. For example, a text<br />

message input <str<strong>on</strong>g>by</str<strong>on</strong>g> a survey participant was probably less than four<br />

sentences l<strong>on</strong>g and not as formal as an email that <strong>on</strong>e would write<br />

to a potential employer (referring to <str<strong>on</strong>g>the</str<strong>on</strong>g> guidelines we gave<br />

participants in <str<strong>on</strong>g>the</str<strong>on</strong>g> study). <str<strong>on</strong>g>Speech</str<strong>on</strong>g> is currently better suited for<br />

short, casual messages, probably because <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> difficulty <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

correcting and identifying errors. We believe <str<strong>on</strong>g>the</str<strong>on</strong>g> research<br />

community should facilitate <str<strong>on</strong>g>the</str<strong>on</strong>g> process <str<strong>on</strong>g>of</str<strong>on</strong>g> correcting c<strong>on</strong>tent and<br />

grammar to make speech input more versatile.<br />

Our study also uncovered interesting keyboard input behavior<br />

with VoiceOver. Since this was not our focus, we did not<br />

document challenges with keyboard input rigorously, but<br />

observed several interesting trends. Surprisingly, some<br />

participants did not use <str<strong>on</strong>g>the</str<strong>on</strong>g> auto-correct feature, which could<br />

have improved <str<strong>on</strong>g>the</str<strong>on</strong>g>ir speed and accuracy. They found it difficult<br />

to m<strong>on</strong>itor and dismiss auto-correct suggesti<strong>on</strong>s. We also<br />

observed that VoiceOver did not communicate punctuati<strong>on</strong><br />

clearly, and some minor grammatical issues, such as extra spaces<br />

between words, were <strong>on</strong>ly noticeable when reviewing text<br />

character <str<strong>on</strong>g>by</str<strong>on</strong>g> character. VoiceOver had a setting in which it speaks<br />

punctuati<strong>on</strong> marks, but not <strong>on</strong>e participant used this setting.<br />

VoiceOver also did not communicate misspelled words that were<br />

visually identified with an underline. Enabling participants to<br />

more easily identify punctuati<strong>on</strong> and grammar and spelling issues<br />

would likely improve efficiency and compositi<strong>on</strong> quality for both<br />

keyboard and speech input.<br />

Throughout <str<strong>on</strong>g>the</str<strong>on</strong>g> paper, we have compared speech input with <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

de facto standard accessible input method for touchscreens: <strong>on</strong>screen<br />

keyboard input with VoiceOver. However, <str<strong>on</strong>g>the</str<strong>on</strong>g>re are input<br />

alternatives that are comm<strong>on</strong>ly used <str<strong>on</strong>g>by</str<strong>on</strong>g> both blind and sighted<br />

people that should be c<strong>on</strong>sidered when evaluating speech. Several<br />

study participants used a small external keyboard with hard keys;<br />

<strong>on</strong>e participant used <str<strong>on</strong>g>the</str<strong>on</strong>g> keyboard <strong>on</strong> his Braille display, which<br />

c<strong>on</strong>nected to his iPh<strong>on</strong>e; <strong>on</strong>e participant used <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>-screen input<br />

method Fleksy [9], <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> many gesture-based text entry methods<br />

(see Related Work for o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs). These alternatives are more private<br />

than speech, and probably more reliable in noisy envir<strong>on</strong>ments. It<br />

would be interesting to compare <str<strong>on</strong>g>the</str<strong>on</strong>g>se methods to speech in <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

future.<br />

6. Challenges for Future Research<br />

We distill our findings into a set <str<strong>on</strong>g>of</str<strong>on</strong>g> challenges for researchers<br />

interested in n<strong>on</strong>visual text entry. These challenges can be<br />

incorporated into both speech and gesture-based input methods.<br />

1. Text selecti<strong>on</strong> – a better method for n<strong>on</strong>visual selecti<strong>on</strong><br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> text. This can also include o<str<strong>on</strong>g>the</str<strong>on</strong>g>r edit operati<strong>on</strong>s, such<br />

as cut, copy, and paste.<br />

2. Cursor positi<strong>on</strong>ing – an easier way to move a cursor<br />

around a text area; enable a user to easily h<strong>on</strong>e in <strong>on</strong><br />

errors.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!