User Interface Design for Computer-Assisted Speech Transcription
User Interface Design for Computer-Assisted Speech Transcription
User Interface Design for Computer-Assisted Speech Transcription
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Interface</strong> <strong>Design</strong> Strategies <strong>for</strong> <strong>Computer</strong>-<strong>Assisted</strong> <strong>Speech</strong><br />
<strong>Transcription</strong><br />
Saturnino Luz<br />
Dept. of <strong>Computer</strong> Science<br />
Trinity College Dublin<br />
Ireland<br />
luzs@cs.tcd.ie<br />
Masood Masoodian<br />
Dept. of <strong>Computer</strong> Science<br />
The University of Waikato<br />
New Zealand<br />
masood@cs.waikato.ac.nz<br />
Chris Deering<br />
Dept. of <strong>Computer</strong> Science<br />
Trinity College Dublin<br />
Ireland<br />
deeringc@cs.tcd.ie<br />
Bill Rogers<br />
Dept. of <strong>Computer</strong> Science<br />
The University of Waikato<br />
New Zealand<br />
coms0108@cs.waikato.ac.nz<br />
ABSTRACT<br />
A set of user interface design techniques <strong>for</strong> computerassisted<br />
speech transcription are presented and evaluated<br />
with respect to task per<strong>for</strong>mance and usability. These techniques<br />
include error-correction mechanisms which originated<br />
in dictation systems and audio editors as well as new techniques<br />
developed by us which exploit specific characteristics<br />
of existing speech recognition technologies in order to facilitate<br />
transcription in settings that typically yield considerable<br />
recognition inaccuracy, such as when the speech to be<br />
transcribed was produced by different speakers. In particular,<br />
we describe a mechanism <strong>for</strong> dynamic propagation of<br />
user feedback which progressively adapts the system to different<br />
speakers and lexical contexts. Results of usability and<br />
per<strong>for</strong>mance evaluation trials indicate that feedback propagation,<br />
menu-based correction coupled with keyboard interaction<br />
and text-driven audio playback are positively perceived<br />
by users and result in improved transcript accuracy.<br />
Categories and Subject Descriptors<br />
H.5.2 [<strong>User</strong> <strong>Interface</strong>s]: Natural Language<br />
General Terms<br />
Human Factors; H.5.2 [<strong>User</strong> <strong>Interface</strong>s]: Voice I/O; H.5.1<br />
[In<strong>for</strong>mation <strong>Interface</strong>s and Presentation]: Multimedia<br />
In<strong>for</strong>mation Systems<br />
Keywords<br />
Automatic speech recognition; error correction; computerassisted<br />
speech transcription<br />
OZCHI 2008, December 8-12, 2008, Cairns, QLD, Australia. Copyright<br />
the author(s) and CHISIG. Additional copies can be ordered from CHISIG<br />
(secretary@chisig.org)<br />
OZCHI 2008 Proceedings ISBN: 0-9803063-4-5<br />
1. INTRODUCTION<br />
Automatic speech recognition (ASR) is a mature technology<br />
which has found application in diverse areas. However,<br />
speech transcriptions generated by modern ASR systems are<br />
often far from perfect. Although <strong>for</strong> certain applications,<br />
such as interactive voice response, potentially adverse effects<br />
of high recognition error rates can be effectively mitigated<br />
through appropriate interaction design, transcription applications<br />
usually require more substantial user involvement in<br />
order to produce satisfactory results. This paper focuses<br />
on systems that combine ASR and user interface design<br />
techniques to support effective creation of long transcripts<br />
by novice transcribers, also known as computer-assisted (or<br />
computer-aided) speech transcription systems.<br />
The importance of computer-aided speech transcription, as<br />
opposed to fully-automated (ASR) transcription, and the<br />
need <strong>for</strong> effective user interfaces to support it have been<br />
increasingly recognised. This is specially true of application<br />
domains, such as medical and legal settings 1 , where<br />
high accuracy demands make human post-editing of transcripts<br />
a practical necessity. In the area of health in<strong>for</strong>matics,<br />
<strong>for</strong> instance, this trend towards computer-aided transcription<br />
is apparent even in the titles of articles published<br />
in this area over the years. They have ranged from an optimistic<br />
“<strong>Computer</strong>-based speech recognition as a replacement<br />
<strong>for</strong> medical transcription” [19], in 1998, to a more realistic<br />
“<strong>Computer</strong>-based speech recognition as an alternative to<br />
medical transcription” [5], in 2001, to a representative of the<br />
emerging consensus in this area: “<strong>Speech</strong> recognition as a<br />
transcription aid” [14], published in 2003 (emphasis added).<br />
In addition to health care and law applications, the considerable<br />
growth in availability of recorded speech and multimedia<br />
resources on the Internet, ranging from videos of<br />
various sorts to dedicated news channels and repositories of<br />
1 In court record production, the term “computer-assisted<br />
transcription” is also used to denote a method of stenographic<br />
reporting in which a computer functions as an enhanced<br />
stenographic machine [2]. Such methods do not usually<br />
involve ASR and there<strong>for</strong>e bear no relation to the techniques<br />
described in this paper.<br />
203
ecorded broadcasts, has created a demand <strong>for</strong> fast and effective<br />
transcription also in these domains. Automated speech<br />
recognition systems can play an important role in meeting<br />
this demand. Requirements <strong>for</strong> these kinds of transcription<br />
tasks vary from application to application. In the case of<br />
news broadcasts, <strong>for</strong> instance, approximately accurate but<br />
quickly produced “rush transcripts” are often sufficient [22].<br />
However, except perhaps <strong>for</strong> applications in the area of in<strong>for</strong>mation<br />
retrieval, where the transcript is not actually meant<br />
to be read in its entirety (e.g. [6, 18]; see also [7] <strong>for</strong> a survey)<br />
a minimum of understandability is a basic requirement.<br />
While state-of-the-art speech recognition systems can deliver<br />
intelligible transcripts, human intervention is often required.<br />
The approach to computer-assisted transcription we propose<br />
and investigate in this paper essentially consists in the following.<br />
An ASR system is initially employed to generate a<br />
(typically imperfect but time-aligned) transcription of a long<br />
speech recording. From that point on, fine tuning is per<strong>for</strong>med<br />
through iterations of human-driven error correction<br />
steps supported by an interactive system. This approach<br />
there<strong>for</strong>e shifts the system’s focus from automatic recognition<br />
to user interface support <strong>for</strong> error correction. We explore<br />
a number of user interface design strategies to support<br />
the creation of time-aligned transcriptions of speech recordings.<br />
These techniques have been implemented through a<br />
prototype “transcript editor” system and evaluated <strong>for</strong> usability<br />
and effectiveness.<br />
2. RELATED WORK<br />
<strong>User</strong> interface support <strong>for</strong> error correction in speech applications<br />
varies greatly. Error correction methods employed<br />
in dictation systems [21], <strong>for</strong> instance, is very different from<br />
repair in dialogue systems [10]. Such techniques usually involve<br />
error recognition via the visual modality and repair<br />
through speech interaction (“respeaking”) [1] or, more recently,<br />
through combinations of modalities [21, 8, 11]. <strong>Interface</strong><br />
design <strong>for</strong> computer-assisted transcription [17] seems<br />
to have received less attention than dialogue or dictation systems.<br />
However, techniques developed <strong>for</strong> those types of systems,<br />
specially dictation systems, might be profitably employed<br />
in speech transcription. It is somewhat surprising<br />
that, although techniques like alternative lists [1, 16] and<br />
multimodal interaction [21] can apparently benefit transcription<br />
tasks, little usability and human factors research has<br />
been done <strong>for</strong> such applications. One could speculate that<br />
this is due to the fact that the distinction between ASRenhanced<br />
transcription of pre-recorded speech and dictation<br />
is blurred by the focus on error-correction strategies. Focusing<br />
entirely on error-correction obscures certain aspects<br />
of transcription tasks that differ considerably from dictation<br />
and could be exploited through user interface design in order<br />
to improve task per<strong>for</strong>mance. One such aspect is that,<br />
in computer-assisted transcription, one can take advantage<br />
of the fact that the entire speech to be transcribed is often<br />
available (in recorded <strong>for</strong>m) when the transcription task<br />
starts, allowing the system to propagate changes dynamically<br />
and take maximum advantage of user feedback.<br />
The usual <strong>for</strong>m of representation of the speech signal on the<br />
user interface is as a wave<strong>for</strong>m or a spectrogram stretching<br />
along a timeline. Some speech transcription systems, notably<br />
those used by speech scientists <strong>for</strong> transcribing and<br />
segmenting speech records <strong>for</strong> further processing or corpora<br />
production, employ this interface style, allowing access to<br />
the audio to be transcribed via horizontal scrolling. The<br />
Transcriber [4] is a system that uses that strategy. Since its<br />
primary goals is annotation of the speech signal, the local<br />
nature of the transcription (and alignment) suits the slidingwindow<br />
view af<strong>for</strong>ded by the timeline representation. Other<br />
systems that specifically target annotation tasks, but not<br />
necessarily at the speech signal level, such as the Elan tool<br />
[15], also employ a similar interface style. In the latter case,<br />
however, since the typical use of such systems is dialogue annotation,<br />
it is not clear whether displaying the annotation<br />
on a single horizontal timeline would not prove a hindrance.<br />
A departure from the timeline metaphor is illustrated by<br />
the system described in [17], which employs a table-like<br />
visualisation to display the alternatives considered by the<br />
ASR component as a confusion matrix. This system incorporates<br />
correction propagation within sentence boundaries<br />
through re-scoring of the recognition lattice. The authors<br />
report good results in term of accuracy and usability when<br />
the system is tested against a baseline (keyboard input only)<br />
system, even though the tabular presentation restricts the<br />
amount of text that can be displayed on the screen. More<br />
recently, a system [11] has been proposed which allows users<br />
to correct an ASR-generated transcript through handwriting<br />
input. This system is designed <strong>for</strong> non-expert transcribers<br />
and does not necessarily rely on a timeline metaphor. Although<br />
the system aligns the audio signal to the text strings<br />
it generates internally, no feedback of this alignment operation<br />
is presented to the user.<br />
Finally, a two-stage strategy has been proposed [9] <strong>for</strong><br />
transcription which combines automatic speech recognition<br />
(ASR) and a human transcriber. It consists in first manually<br />
generating an approximate transcript and then employing<br />
ASR to align this manually generated transcript to<br />
the speech signal, often correcting transcription errors in the<br />
process. This strategy is there<strong>for</strong>e the opposite of the approach<br />
we described in this paper. Furthermore, it targets a<br />
different user group in that it is aimed at experienced human<br />
transcribers, who are able to manually produce reasonably<br />
accurate transcripts in real time. Our system, on the other<br />
hand, caters <strong>for</strong> the novice or occasional transcriber.<br />
3. DYNAMIC TRANSCRIPTION AND ER-<br />
ROR CORRECTION<br />
Unlike most audio editing and transcription tools [4], ours<br />
has a user interface that is based on a text editor metaphor<br />
augmented with components <strong>for</strong> visualisation and interactive<br />
manipulation of the input speech signal. The user<br />
“loads” the transcript onto the editor screen by selecting an<br />
audio file and running it through the recogniser. The system<br />
then allows the user to correct transcription errors manually.<br />
<strong>User</strong> feedback in the <strong>for</strong>m of error correction is then used to<br />
dynamically update the recogniser’s vocabulary and speaker<br />
model, thus reducing the likelihood of later occurrences of<br />
the same transcription errors during the recognition process.<br />
As the transcription task progresses user feedback repeatedly<br />
causes such dynamic updates, thus gradually reducing<br />
the number of transcription errors at each iteration and<br />
there<strong>for</strong>e the amount of manual intervention required.<br />
204
Figure 1: The user interface of the transcription editor prototype<br />
The speech transcript editor is shown in Figure 1. The editor<br />
shows the wave<strong>for</strong>m representation of the input signal<br />
above the transcription (Figure 1, b). Depending on the<br />
nature of the application (e.g. creation of rush transcripts<br />
versus creation of an aligned speech corpus), the user may<br />
reduce or increase the zooming level of the wave<strong>for</strong>m component<br />
(i.e. time scale and wave<strong>for</strong>m amplitude), and add<br />
speech and text boundary markers. These markers can be<br />
altered, words sequences can be merged into a single word,<br />
and words can be split into sequences. Words that have<br />
been assigned low confidence scores by the recogniser are<br />
highlighted on the screen. Selecting such words activates a<br />
menu of alternative transcriptions. The user can also select<br />
specific segments and play the portions of the speech<br />
recording associated with the selected segments. This has<br />
been carefully integrated with the keyboard commands to<br />
allow the user to smoothly move between the listen/check<br />
and transcript correction processes.<br />
According to the classification of errors that occur in ASRenabled<br />
and human transcription presented in [23], transcription<br />
errors can be due to: speaker mispronunciation<br />
(annunciation errors), out-of-vocabulary terms (dictionary<br />
errors), misrecognition of the appropriate <strong>for</strong>m of a word<br />
(suffix errors), substitution of homonyms (homonym errors),<br />
omission of spoken words (deletion errors), spelling errors,<br />
failure to capture the meaning of sentences in context (nonsense<br />
errors), and ambiguous expressions which could lead to<br />
misunderstanding on the part of the reader (critical errors).<br />
Our system incorporates distinct prevention and correction<br />
mechanisms to deal with these potential sources of errors.<br />
Prevention of critical and nonsense and deletion errors is<br />
facilitated by the running-text layout of the transcription,<br />
which helps the human transcriber visualise the broader context<br />
of the passages being transcribed. Correction of annunciation<br />
and dictionary errors is dealt with directly through<br />
the correction propagation functionality. Menu-based correction<br />
and auto-completion mechanisms attempt to minimise<br />
spelling, suffix and homonym errors. Annunciation,<br />
spelling and homonym error correction is also addressed<br />
through re-scoring of the recognition lattice or, in certain<br />
cases, re-recognition. If the file is large enough, user preferences<br />
<strong>for</strong> certain spelling variants can be recorded so that<br />
their scores can be boosted in the existing word lattices,<br />
potentially altering the transcript, or incorporated into the<br />
language model <strong>for</strong> future runs of the recogniser.<br />
Commercial human transcription services often find it helpful<br />
to be able to “clean up”the audio as well as the transcript<br />
by removing noise, speech dysfluencies, and sometimes even<br />
whole sentences [9]. Our speech transcript editor supports<br />
this type of task by offering the option of editing the audio<br />
at the same time as the transcript.<br />
Other <strong>for</strong>ms of support <strong>for</strong> audio file editing and annotation<br />
found in tools such as Audacity [3], Wavesurfer [20] and<br />
Transcriber [4] (e.g. support <strong>for</strong> various signal processing<br />
algorithms, spectrogram display, multilevel annotation etc)<br />
has not been implemented in the current version of the system.<br />
There<strong>for</strong>e this aspect of the tool has not been explicitly<br />
addressed in the evaluation reported below. However, we believe<br />
that the interaction style we propose to be well suited<br />
to audio alignment and editing tasks such as those targeted<br />
by existing specialised audio annotation tools.<br />
4. EVALUATION<br />
A two-stage evaluation method was employed in order to<br />
assess the effectiveness of the user interface techniques implemented<br />
in the transcript editor. The first stage consisted<br />
of an analytical evaluation of the effects of propagating user<br />
corrections on transcription accuracy. This was followed by<br />
a second stage consisting of a series of user trials designed<br />
to assess how the new interface functionality is perceived<br />
by the users, as well as the potential impact of that func-<br />
205
tionality on the accuracy of the final transcript and on user<br />
per<strong>for</strong>mance at transcription tasks.<br />
The motivation <strong>for</strong> this two-stage strategy is admittedly<br />
practical. Since dynamic correction propagation can only<br />
be expected to have a measurable effect on relatively large<br />
transcripts, a meaningful empirical evaluation of this feature<br />
would require subjects to work on much longer tasks than<br />
the limited amount of time available <strong>for</strong> an experiment conducted<br />
in a usability laboratory would allow. We there<strong>for</strong>e<br />
decided to restrict the assessment of this feature to running<br />
a full-featured version of the speech transcript editor on a<br />
large audio file with focus on misrecognition due to pronunciation<br />
variants and out-of-vocabulary words. On the<br />
other hand, the assessment of the usability of error correction<br />
strategies implemented at the user interface level was<br />
per<strong>for</strong>med on a stripped-down version of the system which<br />
had the dynamic correction propagation disabled, in order<br />
to prevent potential effects of this feature from interfering<br />
with the results. A more comprehensive longitudinal study<br />
of the fully functional system in more realistic settings is<br />
planned as future work.<br />
accuracy<br />
0.3 0.4 0.5 0.6 0.7 0.8 0.9<br />
after proper noun correction<br />
be<strong>for</strong>e proper noun correction<br />
USF1 USF2 USM1 USM2 CANM SCOM<br />
speaker<br />
4.1 Analytical evaluation: the effect of propagating<br />
user corrections<br />
The purpose of this preliminary evaluation exercise was to<br />
assess the effects of dynamic propagation of word additions<br />
on the accuracy of the system’s underlying speech recogniser.<br />
For this test we focused on the fact that one of the<br />
main difficulties speech recognisers face is recognising proper<br />
nouns, which tend to occur regularly in cases such as broadcast<br />
news, medical and legal applications. This happens because<br />
such nouns are often out-of-vocabulary words (dictionary<br />
errors) or vary greatly in pronunciation (certain types<br />
of spelling and homonym errors).<br />
For this experiment, six audio files were selected from the<br />
CMU ARCTIC databases 2 . The speech files chosen were<br />
produced by two US female speakers, two US male speakers,<br />
a Canadian male speaker and a Scottish male speaker. Each<br />
speech file contained 1,132 sentences, consisting of around<br />
10,000 words. Five proper nouns (“Philip”, “Saxon”, “Gregson”,<br />
“Jeanne”, “MacDougall”) were chosen which together<br />
occurred 76 times in each of the recordings (accounting <strong>for</strong><br />
0.8% of the transcribed content). One instance of each of<br />
these nouns from one of the US female speaker’s recording<br />
was used to train the system. In the case of the words<br />
“Gregson” and “Jeanne” which were not present in the system’s<br />
initial vocabulary, the respective entries and phonetic<br />
spellings had to be added during the test.<br />
Accuracy scores <strong>for</strong> the tested files and targeted vocabulary<br />
items per speaker be<strong>for</strong>e and after just four iterations of<br />
error correction are shown in Figure 2. Addition of outof-vocabulary<br />
words, specification of alternative phonetic<br />
spellings and training of the recogniser on the selected words<br />
improves recognition across different speakers on average<br />
from 41% to 90%. This clearly indicates the benefits of<br />
adding words and training on even a single case of each<br />
word. Further investigation of the errors after word additions<br />
and training in this small test set shows that 100%<br />
2 http://festvox.org/cmu arctic/<br />
Figure 2: Accuracy of proper noun recognition by<br />
speaker be<strong>for</strong>e and after four error-correction iterations<br />
accuracy can be achieved by additionally including the possessives<br />
<strong>for</strong> each of the missing nouns. While it is unlikely<br />
that such optimal outcomes will hold <strong>for</strong> speech recorded<br />
under less favourable conditions, the test nevertheless shows<br />
that correction propagation is a viable repair technique <strong>for</strong><br />
speech transcription.<br />
4.2 Usability evaluation<br />
In order to further assess the effectiveness and usability<br />
of the user interface features implemented, we designed a<br />
within-subjects experiment in which each subject was asked<br />
to transcribe two short audio recordings using alternately<br />
a baseline ASR-assisted transcription system and a transcription<br />
prototype with functionality similar to that of the<br />
system described in Section 3, except <strong>for</strong> the wave<strong>for</strong>m visualisation<br />
and error propagation capabilities. These different<br />
versions were simply identified to the users by the neutral<br />
terms “green editor” and “blue editor”, respectively, so as to<br />
avoid introducing a bias.<br />
The green editor consists of a plain text editor coupled to<br />
an audio playback component (Figure 3) which allows playing,<br />
pausing and jumping to different parts of the recording<br />
through a slider. The user transcription process is initiated<br />
when the user selects an audio file and presses the green arrow<br />
at the bottom of the editor’s window. The audio file<br />
is run through the speech recogniser and the recognition results<br />
appear on the editor pane, along with sentence markers<br />
and “noise phones” such as +COUGH+, +UH+ etc. The subjects<br />
were instructed to simply ignore those symbols. Once transcribed<br />
sentences start to appear on the editor pane, users<br />
can begin correcting them by listening to the audio (using<br />
the audio playback component) and freely editing the transcript<br />
using keyboard and (possibly) mouse.<br />
206
The functionality tested there<strong>for</strong>e encompassed: the editorcontrollable<br />
playback feature, the highlighting of lowconfidence<br />
segments, the drop-down menus <strong>for</strong> choosing<br />
among recognition alternatives, the auto-complete facility,<br />
the word merge and split functions and the keyboard shortcuts.<br />
As with the full-fledged prototype described in Section<br />
3, clicking on a highlighted segment (or selected sequence<br />
of segments) on the blue editor activates a menu<br />
which displays the recognition alternatives according with<br />
the current ASR word lattice as well as the possible actions<br />
allowed by the editor depending on the context. These actions<br />
include merging of selected words into a single word,<br />
splitting of a single word into a sequence of words and manual<br />
correction. Both the word splitting and the manual editing<br />
components have word completion capabilities which simultaneously<br />
accesses the recogniser’s vocabulary and the<br />
phonetic symbol table. The system also allows the user to<br />
enter their own phonetic spellings, overriding the initial suggestion,<br />
if needed.<br />
Figure 3: The green editor<br />
We hypothesised that the transcript correction strategy<br />
users would adopt when using the green editor would be<br />
to re-play the audio sequentially while inspecting the initial<br />
transcription results, stop the audio playback to make<br />
corrections as needed, resume the playback, and so on.<br />
The blue editor (Figure 4) implements essentially the same<br />
functionality as the fully-fledged system, except <strong>for</strong> dynamic<br />
error propagation and audio wave<strong>for</strong>m display, as mentioned<br />
above. Display of aligned wave<strong>for</strong>ms has been omitted because<br />
this feature targets more specialised user groups, such<br />
as linguists and speech scientists, and would find no natural<br />
counterpart in the baseline (green) transcriber. Although<br />
the wave<strong>for</strong>m components are not displayed, the blue editor<br />
can display timestamps around each token, as shown in Figure<br />
4, thus providing feedback on how speech and text are<br />
actually aligned. Similarly, the word splitting component<br />
(Figure 5) allows the user to find the correct alignment by<br />
moving a slider along the timeline below the word segments.<br />
The system initially attempts to guess the relative length<br />
of each typed subsequence by looking up a table of phonetic<br />
symbols and allocating time proportionally to the duration<br />
of the symbol sequence. Where precise alignment to<br />
the audio signal is needed, the user can define the sequence<br />
alignment by adjusting it through the horizontal slider while<br />
replaying the audio. For this trial, however, users were told<br />
that alignment was not part of the task and that these features<br />
could be ignored. Accordingly, the editor used in the<br />
trials was set up so as to hide the timestamps.<br />
Figure 4: The blue editor<br />
In contrast to the green editor, we expected users to adopt<br />
a less sequential approach to transcript correction when using<br />
the blue editor. A non-sequential correction strategy is<br />
one where the user selects target areas <strong>for</strong> correction based<br />
on their salience (e.g. words highlighted as low-confidence<br />
recognition results by the ASR system) rather than proofreading<br />
from beginning to end setence by sentence. That<br />
users would adopt such a strategy seems to be naturally<br />
suggested by the facts that: (1) the the words scoring lowest<br />
according to the ASR are highlighted on the screen, thus<br />
indicating that alternative suggestions can be selected from<br />
a menu and (2) the audio can be easily played back <strong>for</strong> any<br />
arbitrarily selected text segment.<br />
Figure 5: The word split component (blue editor)<br />
All trials were conducted on a computer lab, using the same<br />
hardware configuration. A total of 14 subjects took part in<br />
the evaluation. Twelve of these subjects were undergraduate<br />
students and the remaining two were post-doctoral researchers,<br />
all native speakers of English. They were given<br />
headphones and two printed documents containing brief descriptions<br />
of the blue and green editors. The trial started<br />
with an explanation of the task and a short tutorial on how<br />
to operate each version of the software. The subjects were<br />
then given a short audio file to transcribe using both editors<br />
207
to become familiar with their functionality. This familiarisation<br />
phase used a sample audio file different from the audio<br />
file used <strong>for</strong> the actual trial. Once the users felt com<strong>for</strong>table<br />
enough with the software they were asked to transcribe<br />
one audio file using the green editor and another using the<br />
blue editor. The order in which the editors were to be used<br />
and the audio file to be used with each editor were selected<br />
randomly <strong>for</strong> each trial so as to minimise potential order<br />
effects.<br />
We assessed transcription task accuracy <strong>for</strong> these trials in<br />
terms of word error rate (WER), macro precision (π M) and<br />
macro recall (ρ M) scores. The WER score is defined as the<br />
ratio between the minimum number of (word) insertions,<br />
deletions and substitutions needed to convert the transcript<br />
into its aligned reference text (Levenshtein distance) and<br />
the total number of words in the reference text. Alternative<br />
spellings of certain reference words and phrases were allowed<br />
in the user-produced transcripts during alignment to avoid<br />
arbitrarily penalising legitimate spellings entered through<br />
free typing (green editor) as opposed to menus. Examples<br />
of such allowed alternatives include: “thorough going” and<br />
“thorough-going”, “upper hand” and “upperhand”, etc. The<br />
precision and recall scores were also included in order to provide<br />
a more nuanced view of user per<strong>for</strong>mance, following the<br />
approach proposed in [13]. Precision is defined <strong>for</strong> a single<br />
word as the proportion of occurrences of that word in the<br />
transcript which matches the corresponding word positions<br />
in the reference. Formally, π i = |T i∩R i |<br />
|T i |<br />
, where T i and R i are<br />
the sets of positions occupied by word i in the transcript and<br />
reference text respectively. Conversely, recall indicates the<br />
proportion of the reference that is correctly matched in the<br />
transcribed text, or ρ i = |T i∩R i |<br />
|R i |<br />
. The global macho scores<br />
are obtained by averaging over the scores <strong>for</strong> each word in<br />
the vocabulary of the transcript (<strong>for</strong> π M ) and reference (<strong>for</strong><br />
ρ M).<br />
The two audio files used in this experiment were selected<br />
from the Vox<strong>for</strong>ge corpus 3 and assessed as being of similar<br />
length and difficulty. They contained readings of English<br />
texts by a native speaker. The total number of words in<br />
the reference transcripts <strong>for</strong> these files were 288 and 278.<br />
The initial word error rates (WER) <strong>for</strong> ASR-generated transcripts<br />
<strong>for</strong> these files (i.e. prior to user intervention) were<br />
23.5% and 21.5% .<br />
<strong>User</strong>s were allowed to spend as much time as they needed<br />
on the task. They were asked to try to transcribe as accurately<br />
as possible, and instructed to press the save button<br />
when they were satisfied with the transcript they had<br />
produced. The trials were visually monitored by the experimenter.<br />
<strong>Transcription</strong> times were recorded automatically by<br />
the system and accuracy statistics were compiled at the end<br />
of the experiments. At the end of each session, users were<br />
asked to fill out an evaluation questionnaire, and this was<br />
followed by a brief interview. The questionnaire included<br />
questions that asked the users to rate (on a 5-point Likert<br />
scale ranging from “strongly disagree” to “strongly agree”)<br />
certain features of the blue editor in terms of helpfulness<br />
and usefulness, compared to the green editor. These questions<br />
are shown in Table 1.<br />
3 http://www.vox<strong>for</strong>ge.org/<br />
Table 1: Questionnaire regarding the usability of<br />
transcript correction mechanisms<br />
1. The ability to choose between alternative words<br />
from a menu was helpful<br />
2. Being able to play back the audio <strong>for</strong> selected<br />
parts of the text was helpful<br />
3. Using the menu-based editing in the blue editor<br />
was as efficient as typing freely in the green<br />
editor<br />
4. The auto-complete function was helpful<br />
5. I found the keyboard shortcuts in the Blue editor<br />
beneficial<br />
6. I enjoyed using the blue version<br />
7. I enjoyed using the green version<br />
In addition, users were asked to estimate their usage of keyboard<br />
shortcuts and the menu <strong>for</strong> correcting words, and to<br />
indicate which of the editors they preferred using, if any.<br />
0 2 4 6 8 10<br />
Strongly Disagree<br />
Disagree<br />
Neutral<br />
Agree<br />
Strongly Agree<br />
q0 q1 q2 q3 q4 q7 q8<br />
Figure 6: Answers to the questionnaire regarding<br />
the usability of error correction mechanisms.<br />
4.2.1 Results<br />
The majority of users expressed a preference <strong>for</strong> the blue<br />
editor: 8 users (57.1%) stated that they preferred using the<br />
blue editor, while only 4 users (28.6%) preferred the green<br />
editor, with two users not expressing any preference. These<br />
overall preferences are consistent with the users’ assessment<br />
of the usability of individual components. The responses to<br />
the questions listed in Table 1 are summarised in Figure 6.<br />
The numbers at the bottom of the graph correspond to the<br />
208
question numbers shown in the a<strong>for</strong>ementioned table.<br />
Figure 6 shows that the ability to play back the audio <strong>for</strong><br />
selected parts of the text (question 2) was the feature the<br />
users found most valuable. The pop-up menus and keyboard<br />
shortcuts (questions 1 and 5) were also positively evaluated.<br />
<strong>User</strong>s were divided as to whether using menu-based editing<br />
in the blue editor was as efficient as typing freely in the<br />
green editor (question 3), with a trend towards disagreement.<br />
Perhaps surprisingly, most users were neutral with<br />
respect to the auto-complete functionality. The following is<br />
typical of the comments made by the users: “I felt the Blue<br />
version was more efficient than typing freely in the Green<br />
editor. Both editors were useful - but I preferred the Blue<br />
with right menu options. I found the keyboard shortcuts<br />
less useful than expected - but I expect I would use them<br />
more, with practice.”<br />
<strong>User</strong> observations showed that, as we had hypothesised, the<br />
preferred strategy adopted by users when using the green<br />
editor was to play the audio and correct sequentially. Contrary<br />
to our expectation, however, a predominantly sequential<br />
strategy was also adopted when the blue editor was used.<br />
This is probably due to the fact that, not having heard the<br />
audio recording prior to the beginning of the transcript correction<br />
task, users could not assess how reliable the ASRgenerated<br />
transcript was to begin with, and there<strong>for</strong>e were<br />
unable to take full advantage of the visual cues and random<br />
audio access facilities provided by the blue editor. This also<br />
suggests as to why the users were slightly faster when using<br />
the green editor, despite the fact that most users preferred<br />
the blue editor.<br />
Table 2 summarises the per<strong>for</strong>mance results in terms of average<br />
word error rate (WER) <strong>for</strong> the final transcripts, the<br />
percentage of errors corrected per editor, the macro precision<br />
and recall scores, and the transcription speeds expressed<br />
in words per minute (WPM). As regards task per<strong>for</strong>mance,<br />
results indicate that the user interface components<br />
employed by the blue editor produced more accurate final<br />
transcripts at the expense of transcription time. Furthermore,<br />
the higher differences found in the π M scores indicate<br />
that most remaining errors in transcripts produced with the<br />
green editor were due to user failure in spotting misrecognitions.<br />
Table 2: <strong>User</strong> per<strong>for</strong>mance comparison<br />
blue editor green editor<br />
text 1<br />
WER 4.49% 7.83%<br />
% corrected 82% 68%<br />
speed (WPM) 22.4 28.4<br />
text 2<br />
WER 3.81% 3.30%<br />
% corrected 86.4% 88.3%<br />
speed (WPM) 24.2 30.1<br />
Total<br />
WER 4.30% 5.89%<br />
π M 0.85 0.76<br />
ρ M 0.87 0.85<br />
% corrected 83.4% 76.7%<br />
speed (WPM) 23.3 29.1<br />
The speed results can be partially explained by the task<br />
setup, which precluded listening to the audio while ASR<br />
was taking place, as explained above. Another contributing<br />
factor, however, is the perceived “ease” of one of the texts<br />
(text 2) which required less edits and there<strong>for</strong>e less menubased<br />
interaction. As Table 2 and Figure 7 show, users<br />
editing text 2 with the green editor in general took less time<br />
than those who used the blue editor, even though the error<br />
rates <strong>for</strong> the <strong>for</strong>mer were quite spread while the latter tended<br />
to cluster towards the lower end of the error scale. For text<br />
1, the completion time differences were less evident but the<br />
blue was clearly superior in terms of accuracy.<br />
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0<br />
text 1, blue<br />
text 2, blue<br />
text 1, green<br />
text 2, green<br />
0.01 0.02 0.03 0.04 0.05 0.06 0.07<br />
Figure 7: Error rates per user (after user editing)<br />
against task completion time expressed as ratio of<br />
completion time to maximum completion time per<br />
text.<br />
5. CONCLUSION<br />
The results of evaluation show that the error correction<br />
mechanisms implemented by the prototypes described in<br />
this paper (and related systems) can potentially improve<br />
user per<strong>for</strong>mance at computer-assisted transcription tasks.<br />
Dynamic propagation of corrections can improve per<strong>for</strong>mance<br />
by modifying the working transcript continuously as<br />
error correction feedback is received by the system. Other<br />
user interface improvements such as menu-based correction<br />
(coupled with keyboard shortcuts) and text-driven playback<br />
are also well received by users and likely to result in more<br />
accurate transcripts.<br />
However, the results of usability evaluation also suggest<br />
that certain design improvements need to be made in order<br />
<strong>for</strong> the new functionality (blue editor) to match the naturalness<br />
of free typing (green editor) <strong>for</strong> transcript correction<br />
tasks. This is indicated by the perception expressed<br />
by some users (through the questionnaires and interviews)<br />
that menus and other mouse-activated components tended<br />
to slow them down while producing little benefit. For users<br />
with this sort of profile, which is likely to include expert<br />
209
users and professional transcribers, interface designs that<br />
allow <strong>for</strong> keyword-controlled menus activated automatically<br />
as the cursor moves over low-scoring words and merge-split<br />
functions embedded in the editor pane might help increase<br />
transcription speed.<br />
We plan to conduct further experiments with a more focused<br />
user group and task (e.g. speech corpus production<br />
[4]) in order to evaluate the segmentation, alignment and<br />
wave<strong>for</strong>m display features. We have also devised alternative<br />
<strong>for</strong>ms of animated visualisation <strong>for</strong> transcripts (described in<br />
detail elsewhere [12]) which we intend to incorporate into<br />
the transcription editor in order to encourage users to adopt<br />
a non-sequential correction strategy, which could be particularly<br />
useful in contexts where transcripts serve an in<strong>for</strong>mation<br />
retrieval purpose in conjunction with other audiovisual<br />
sources [16].<br />
6. ACKNOWLEDGEMENTS<br />
S. Luz’s research was part-funded by Science Foundation<br />
Ireland throught the CNGL/CSET grant.<br />
7. REFERENCES<br />
[1] W. A. Ainsworth and S. R. Pratt. Feedback strategies<br />
<strong>for</strong> error correction in speech recognition systems.<br />
International Journal of Man-Machine Studies,<br />
36(6):833–842, June 1992.<br />
[2] A. Artegian. The technology-augmented court record.<br />
In Proceedings of the Fifth National Court Technology<br />
Conference, 1997.<br />
[3] Audacity sound editor. http://audacity.sf.net/.<br />
accessed 2nd July 2008.<br />
[4] C. Barras, E. Geoffrois, Z. Wu, and M. Liberman.<br />
Transcriber: development and use of a tool <strong>for</strong><br />
assisting speech corpora production. <strong>Speech</strong><br />
Communication, 33(1-2):5–22, 2001.<br />
[5] S. Borowitz. <strong>Computer</strong>-based speech recognition as an<br />
alternative to medical transcription. Journal of the<br />
American Medical In<strong>for</strong>matics Association, 8:101–102,<br />
2001.<br />
[6] M.-M. Bouamrane and S. Luz. An analytical<br />
evaluation of search by content and interaction<br />
patterns on multimodal meeting records. Multimedia<br />
Systems, 13(2):89–102, 2007.<br />
[7] J. Foote. An overview of audio in<strong>for</strong>mation retrieval.<br />
Multimedia Systems, 7(1):2–10, 1999.<br />
[8] Halverson, C. A., Horn, D. B., Karat, C.-M., and<br />
J. Karat. The beauty of errors: Patterns of error<br />
correction in desktop speech systems. In Proceedings<br />
of INTERACT’99: Human-<strong>Computer</strong> Interaction,<br />
pages 133–140, 1999.<br />
[9] T. Hazen. Automatic alignment and error correction<br />
of human generated transcripts <strong>for</strong> long speech<br />
recordings. In Procedings of Interspeech’06, pages<br />
1606–1609, Pittsburgh, Pennsylvania, 2006.<br />
[10] K. S. Hone and C. Baber. Modelling the effects of<br />
constraint upon speech-based human-computer<br />
interaction. <strong>Interface</strong> Journal of Human-<strong>Computer</strong><br />
Studies, 50(1):85–107, 1999.<br />
[11] P. Liu and F. K. Soong. Word graph based speech<br />
recognition error correction by handwriting input. In<br />
ICMI ’06: Proceedings of the 8th international<br />
conference on Multimodal interfaces, pages 339–346,<br />
New York, NY, USA, 2006. ACM Press.<br />
[12] S. Luz, M. Masoodian, and B. Rogers. Interactive<br />
visualisation techniques <strong>for</strong> dynamic speech<br />
transcription, correction and training. In Procs. of the<br />
9th ACM SIGCHI-NZ annual Conf. on<br />
<strong>Computer</strong>-Human Interaction, pages 9–16. ACM<br />
Press, 2008.<br />
[13] I. McCowan, D. Moore, J. Dines, D. Gatica-Perez,<br />
M. Flynn, P. Wellner, and H. Bourlard. On the use of<br />
in<strong>for</strong>mation retrieval measures <strong>for</strong> speech recognition<br />
evaluation. Technical Report IDIAP-RR-04-73,<br />
LIDIAP, 2004.<br />
[14] D. N. Mohr, D. W. Turner, G. R. Pond, J. S. Kamath,<br />
C. B. De Vos, and P. C. Carpenter. <strong>Speech</strong> recognition<br />
as a transcription aid: A randomized comparison with<br />
standard transcription. Journal of the American<br />
Medical In<strong>for</strong>matics Association, 10(1):85–93, 2003.<br />
[15] MPI. ELAN: Eucido Linguistic Annotator. Max<br />
Planck Institute <strong>for</strong> Psycholinguistics, March 2005.<br />
http://www.mpi.nl/tools/elan.html.<br />
[16] C. Munteanu, R. Baecker, and G. Penn. Collaborative<br />
editing <strong>for</strong> improved usefulness and usability of<br />
transcript-enhanced webcasts. In CHI ’08: Proceeding<br />
of the twenty-sixth annual SIGCHI conference on<br />
Human factors in computing systems, pages 373–382.<br />
ACM Press, 2008.<br />
[17] H. Nanjo and T. Kawahara. Towards an efficient<br />
archive of spontaneous speech: <strong>Design</strong> of<br />
computer-assisted speech transcription system. The<br />
Journal of the Acoustical Society of America,<br />
120:3042, 2006.<br />
[18] D. D. Palmer, M. Ostendorf, and J. D. Burger. Robust<br />
in<strong>for</strong>mation extraction from automatically generated<br />
speech transcriptions. <strong>Speech</strong> Communication,<br />
32(1-2):95–109, 2000.<br />
[19] D. Rosenthal, F. Chew, D. Dupuy, S. Kattapuram,<br />
W. Palmer, R. Yap, and L. Levine. <strong>Computer</strong>-based<br />
speech recognition as a replacement <strong>for</strong> medical<br />
transcription. American Journal of Roentgenology,<br />
170(1):23–5, 1998.<br />
[20] K. Sjölander and J. Beskow. Wavesurfer - an open<br />
source speech tool. In Proceedings of the 6th<br />
International Conference on Spoken Language<br />
Processing. ISCA, 2000.<br />
[21] B. Suhm, B. Myers, and A. Waibel. Multimodal error<br />
correction <strong>for</strong> speech user interfaces. ACM Trans.<br />
Comput.-Hum. Interact., 8(1):60–98, 2001.<br />
[22] C. L. Wayne. Topic detection and tracking (TDT):<br />
Overview and perspectives. In Proceedings of the<br />
DARPA Broadcast News <strong>Transcription</strong> and<br />
Understanding Workshop, Lansdowne Conference<br />
Resort Lansdowne, Virginia, USA, Feb. 1998.<br />
[23] A. Zafar, B. Mamlin, S. Perkins, A. M. Belsito, J. M.<br />
Overhage, and C. J. McDonald. A simple error<br />
classification system <strong>for</strong> understanding sources of error<br />
in automatic speech recognition and human<br />
transcription. International Journal of Medical<br />
In<strong>for</strong>matics, 73:719–730, Sep 2004.<br />
210