27.12.2012 Views

Oscillations, Waves, and Interactions - GWDG

Oscillations, Waves, and Interactions - GWDG

Oscillations, Waves, and Interactions - GWDG

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.4 Transfer to running speech<br />

Speech research 29<br />

The voice analysis was originally based on stationary vowels. In clinical diagnostics,<br />

however, a voice analysis from continuous speech is required in order to objectively<br />

assess the vocal disease under normal stress <strong>and</strong> to be able to treat it optimally.<br />

The stationary phonation corresponds rather to a singing voice, contrary to the<br />

more natural running speech. Thus for a comprehensive description of voice quality<br />

the analysis of running speech is an essential extension of the analysis of stationary<br />

phonation. The methods of vowel analysis should be partially transferable to voiced<br />

intervals in running speech. For this purpose a method was developed to recognize<br />

such intervals automatically. The main difficulty was that the linguistically voiced<br />

sounds are not necessarily realized as voiced for strong voice disorders.<br />

3.4.1 Determination of voiced <strong>and</strong> unvoiced intervals<br />

A voiced/unvoiced classification by (e. g.) zero-crossing <strong>and</strong> correlation techniques<br />

directly on the speech signal would, for strongly disturbed voices, recognize too few<br />

voiced intervals. Thus a consideration of the spectral envelope (formant structure) is<br />

preferable, which little depends on the actual glottal excitation. The method uses a<br />

3-layer perceptron with sigmoid activation function (values 0 to 1) as classifier. The<br />

template vectors for its input were formed as follows (numbers refer to Fig. 3):<br />

The speech signals, digitized with 48 kHz, were downsampled to 12 kHz <strong>and</strong> decomposed<br />

into overlapping Hann-windowed 40 ms intervals with 10 ms frame shift (3,4).<br />

Pauses are eliminated based on an empirical energy threshold. An LPC analysis of<br />

12 th order (autocorrelation method; preemphasis 0.9735) yields a model spectrum<br />

(5), which is converted to 19 critical b<strong>and</strong>s (Bark scale) by summation in overlapping<br />

trapezoidal windows (6). It is compressed with exponent 0.23 <strong>and</strong> normalized<br />

by its maximum over time <strong>and</strong> critical b<strong>and</strong>s (7,8). The LPC order <strong>and</strong> method were<br />

optimized to yield minimal misclassification.<br />

The optimal values of the perceptron parameters (number of hidden cells, learning<br />

rate, classification threshold, number of iterations, training material) were determined<br />

in extensive experiments (6750 different cases, about 12000 spectra). Twelve<br />

hidden cells worked best. As training method of the perceptron (9), an accelerated<br />

backpropagation [24] was employed with learning rate 0.01 <strong>and</strong> momentum term 0.8.<br />

The classification threshold at the output is 0.45, the desired net outputs for training<br />

are 0.1 (unvoiced) <strong>and</strong> 0.9 (voiced). The weights are initialized with r<strong>and</strong>om numbers<br />

in the range 0 to 1. Three perceptrons with different initial weights were used<br />

in parallel, averaging their recognition scores.<br />

Since our own speech data were unlabeled, they could not serve for training. Instead,<br />

the training set consisted first of 32 phonetically segmented texts (16 times<br />

“Nordwind und Sonne”, 16 times “Berlingeschichte”) from 16 different normal speakers<br />

in the German Phondat database, a total of 154550 labeled Bark spectra, excluding<br />

pauses. The training used up to 5000 iterations. For testing, different subsets of<br />

30 of the 32 texts were used in training <strong>and</strong> the remaining 2 in the test. The error<br />

score amounted to 4.8%. With the above threshold (0.45), only 25% of these are<br />

falsely classified as voiced; false unvoiced classification is less detrimental. As mis-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!