Classification Techniques used in Speech Recognition
Classification Techniques used in Speech Recognition
Classification Techniques used in Speech Recognition
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
ISSN:2229-6093<br />
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
<strong>Classification</strong> <strong>Techniques</strong> <strong>used</strong> <strong>in</strong> <strong>Speech</strong> <strong>Recognition</strong> Applications: A Review<br />
M.A.Anusuya* 1 , S.K.Katti* 2<br />
*Department of computer science and Eng<strong>in</strong>eer<strong>in</strong>g<br />
SJCE, Mysore, INDIA<br />
1 anusuya_ma@yahoo.co.<strong>in</strong>, 2 skkatti@<strong>in</strong>diatimes.com<br />
Abstract—<strong>Classification</strong> phase is one of the most active<br />
research and application areas of speech recognition. The<br />
literature is vast and grow<strong>in</strong>g. This paper summarizes the<br />
some of the most important developments <strong>in</strong> the classification<br />
procedures of the speech recognition applications. The state of<br />
art of the classification technique has also been presented <strong>in</strong><br />
this paper. Different classification techniques and their<br />
parameter estimation methods, properties, advantages,<br />
disadvantages along with their application areas are discussed<br />
with each classification method. Our purpose is to provide a<br />
synthesis of the published research <strong>in</strong> the area of speech<br />
recognition and stimulate further research <strong>in</strong>terests and efforts<br />
<strong>in</strong> the identified topics. This paper presents an overview of<br />
several pattern classification methods available <strong>in</strong> literature<br />
for speech recognition applications.<br />
Keywords— <strong>Classification</strong>, Classifiers, Taxonomy, Bayes<br />
decision theory, Acoustic Phonetic approach, Template<br />
match<strong>in</strong>g, Dynamic Time Warp<strong>in</strong>g(DTW),Vector<br />
Quantization(VQ), Hidden Markov Model(HMM), Artificial<br />
Neural Network(ANN), Support Vector Mach<strong>in</strong>e(SVM), K-<br />
Nearest Neighbor(KNN), Gaussian Mixture Model<strong>in</strong>g,<br />
Cluster<strong>in</strong>g techniques, Evaluations, Applications.<br />
I. INTRODUCTION<br />
CLASSIFICATION is one of the most frequently encountered<br />
decision mak<strong>in</strong>g tasks of human activity[1]. <strong>Classification</strong><br />
problem occurs when an object needs to be assigned <strong>in</strong>to a<br />
predef<strong>in</strong>ed group or class based on a number of observed<br />
attributes related to that object. Many problems <strong>in</strong> bus<strong>in</strong>ess,<br />
science, <strong>in</strong>dustry, and medic<strong>in</strong>e can be treated as classification<br />
problems. The goal of this paper is to survey the core concepts<br />
and techniques <strong>in</strong> the large subset of classification and<br />
analyz<strong>in</strong>g with its roots <strong>in</strong> statistics and decision theory.<br />
Although significant progress has been made <strong>in</strong> classification,<br />
related areas of speech recognition, a number of issues <strong>in</strong><br />
apply<strong>in</strong>g classification techniques still rema<strong>in</strong> and have not<br />
been solved successfully or completely. In this paper, some<br />
theoretical as well as empirical issues of speech recognition<br />
classification methods are reviewed and discussed. The vast<br />
research topics and extensive literature makes it impossible<br />
for one review to cover all of the work <strong>in</strong> the field. This<br />
review aims to provide a summary of the most important<br />
advances <strong>in</strong> general classification methods.<br />
Pattern recognition techniques are <strong>used</strong> to automatically<br />
classify physical objects (1D,2D or 3D) or abstract<br />
multidimensional patterns (n po<strong>in</strong>ts <strong>in</strong> d dimensions) <strong>in</strong>to<br />
known or possibly unknown categories. A number of<br />
commercial pattern recognition systems exist for speech<br />
recognition, character recognition, handwrit<strong>in</strong>g recognition,<br />
document classification, f<strong>in</strong>gerpr<strong>in</strong>t classification, speech and<br />
speaker recognition, white blood cell (leukocyte)<br />
classification, military target recognition among others. Most<br />
mach<strong>in</strong>e vision systems employ pattern recognition techniques<br />
to identify objects for sort<strong>in</strong>g, <strong>in</strong>spection, and assembly. The<br />
most widely <strong>used</strong> classifiers are the nearest neighbour- hood,<br />
Kernel methods such as SVM, KNN algorithms, Gaussian<br />
mixture model<strong>in</strong>g, naïve Bayes classifier and decision tree.<br />
1.1. <strong>Classification</strong> method design:<br />
<strong>Classification</strong> is the f<strong>in</strong>al stage of the pattern recognition. This<br />
is the stage where an automated system declares that the<br />
<strong>in</strong>putted object belongs to a particular category. There are<br />
many classification methods <strong>in</strong> the field. <strong>Classification</strong><br />
method designs are based on the follow<strong>in</strong>g concepts.<br />
i) <strong>Classification</strong><br />
Assign<strong>in</strong>g a class to a measurement, or equivalently,<br />
identify<strong>in</strong>g the probabilistic source of a measurement. The<br />
only statistical model that is needed is the conditional model<br />
of the class variable given the measurement. This conditional<br />
model can be obta<strong>in</strong>ed from a jo<strong>in</strong>t model or it can be learned<br />
directly. The former approach is generative s<strong>in</strong>ce it models the<br />
measurements <strong>in</strong> each class. It is more work, but it can exploit<br />
more prior knowledge, needs less data, is more modular, and<br />
can handle miss<strong>in</strong>g or corrupted data. Methods <strong>in</strong>clude<br />
mixture models and Hidden Markov Models. The latter<br />
approach is discrim<strong>in</strong>ative s<strong>in</strong>ce it focuses only on<br />
discrim<strong>in</strong>at<strong>in</strong>g one class from another. It can be more efficient<br />
once tra<strong>in</strong>ed and requires fewer model<strong>in</strong>g assumptions.<br />
Methods <strong>in</strong>clude logistic regression, generalized l<strong>in</strong>ear<br />
classifiers, and nearest-neighbor.<br />
ii) Model selection<br />
Choos<strong>in</strong>g the parametric family for density estimation is<br />
important <strong>in</strong> model selection. This is harder than parameter<br />
estimation s<strong>in</strong>ce we have to take <strong>in</strong>to account every member<br />
of each family <strong>in</strong> order to choose the best family.<br />
a. Member-roster concept: Under this template-match<strong>in</strong>g<br />
concept, a set of patterns belong<strong>in</strong>g to a same pattern is stored<br />
<strong>in</strong> a classification system. When an unknown pattern is given<br />
as <strong>in</strong>put, it is compared with exist<strong>in</strong>g patterns and placed<br />
under the match<strong>in</strong>g pattern class.<br />
b. Common property concept: In this concept, the common<br />
properties of patterns are stored <strong>in</strong> a classification system.<br />
When an unknown pattern comes <strong>in</strong>side, the system checks its<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
910
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
extracted common property aga<strong>in</strong>st the common properties of<br />
exist<strong>in</strong>g classes and places the pattern/object under a class,<br />
which has similar, common properties.<br />
c. Cluster<strong>in</strong>g concept: Here, the patterns of the targeted<br />
classes are represented <strong>in</strong> vectors whose components are real<br />
numbers. Us<strong>in</strong>g its cluster<strong>in</strong>g properties, we can easily<br />
classify the unknown pattern. If the target vectors are far apart<br />
<strong>in</strong> geometrical arrangement, it is easy to classify the unknown<br />
patterns. If they are nearby or if there is any overlap <strong>in</strong> the<br />
cluster arrangement, we need more complex algorithms to<br />
classify the unknown patterns. One simple algorithm based on<br />
the cluster<strong>in</strong>g concept is M<strong>in</strong>imum Distance <strong>Classification</strong>.<br />
This method computes the distance between the unknown<br />
pattern and the desired set of known patterns and determ<strong>in</strong>es<br />
which known pattern is closest to the unknown and, f<strong>in</strong>ally,<br />
the unknown pattern is placed under the known pattern to<br />
which it has m<strong>in</strong>imum distance. This algorithm works well<br />
when the target patterns are far apart.<br />
1.2. Classifiers design:<br />
Classifiers are functions that use pattern match<strong>in</strong>g to<br />
determ<strong>in</strong>e a closet math. After optimal feature subset is<br />
selected a classifier can be designed us<strong>in</strong>g various approaches.<br />
Roughly speak<strong>in</strong>g, there are three different approaches [1,2].<br />
The first approach is the simplest and the most <strong>in</strong>tuitive<br />
approach which is based on the concept of similarity.<br />
Template match<strong>in</strong>g is an example. The second one is a<br />
probabilistic approach. It <strong>in</strong>cludes methods based on Bayes<br />
decision rule, the maximum likelihood or density estimator.<br />
Three well-known methods are K-nearest neighbor (KNN),<br />
Parzen w<strong>in</strong>dow classifier and branch-and bound methods<br />
(BnB).The third approach is to construct decision boundaries<br />
directly by optimiz<strong>in</strong>g certa<strong>in</strong> error criterion. Examples are<br />
fisher’s l<strong>in</strong>ear discrim<strong>in</strong>ant, multilayer perceptrons, decision<br />
tree and support vector mach<strong>in</strong>e. Determ<strong>in</strong><strong>in</strong>g a suitable<br />
classifier for a given problem is still more an art than science.<br />
The first class of classifiers has some similarity metrics and<br />
assigns class labels for maximiz<strong>in</strong>g the similarity.<br />
Probabilistic methods, for which the Bayesian classifier is the<br />
most known, depend on the prior probabilities of classes and<br />
class-conditional densities of the <strong>in</strong>stances. In addition to<br />
Bayesian classifiers, logistic classifiers belong to this type of<br />
classifiers. The logistic classifiers deal with unknown<br />
parameters based on the maximum-likelihood [3]. Further<br />
details on logistic classifiers can be found <strong>in</strong> [4].Geometric<br />
classifiers, which build decision boundaries by directly<br />
m<strong>in</strong>imiz<strong>in</strong>g the error criterion, s<strong>in</strong>ce no related experiments<br />
are supplied. An example to these classifiers is Fisher’s l<strong>in</strong>ear<br />
discrim<strong>in</strong>ant, which ma<strong>in</strong>ly aim to reduce the size of the<br />
feature space to lower dimensions <strong>in</strong> case of a huge number of<br />
features. It m<strong>in</strong>imizes the mean squared error between the<br />
class labels and the tested <strong>in</strong>stance. Additionally, neural<br />
networks are examples of geometric classifiers.<br />
1.3. <strong>Classification</strong> taxonomy:<br />
Based on the available literature figure1 and figure 1a shows<br />
the taxonomy of different classifiers <strong>used</strong> for various<br />
applications of speech recognition based on classification<br />
techniques and the density functions.<br />
[OR] the other way of represent<strong>in</strong>g the taxonomy, based on<br />
density approach can be represented as follows.<br />
Figure1a. Taxonomy based on class –conditional densities<br />
2. Knowledge Based classification Method<br />
Human knowledge of speech has to be expressed <strong>in</strong> terms of<br />
explicit rules. Acoustic phonetic rules describes the words of<br />
lexicon, describes the syntax of knowledge and so on and it<br />
deals with the phonetic and l<strong>in</strong>guistic pr<strong>in</strong>ciples. Basically<br />
there exist two approaches to speech recognition.<br />
They are<br />
• Acoustic Phonetic Approach<br />
• Artificial Intelligence Approach<br />
2.1. Acoustic Phonetic Approach[5]:<br />
The acoustic phonetic approach is based on the theory of<br />
acoustic phonetics that postulates that there exists f<strong>in</strong>ite,<br />
dist<strong>in</strong>ctive phonetic units <strong>in</strong> spoken language and that the<br />
phonetic units are broadly characterized by a set of properties<br />
that are manifest <strong>in</strong> the speech signal, or its spectrum,<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
911
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
overtime[5]. Even though the acoustic properties of phonetic<br />
units are highly variable, both with speakers and with<br />
neighbor<strong>in</strong>g phonetic units ( the so called co-articulation of<br />
sounds), it is assumed that the rules govern<strong>in</strong>g the variability<br />
are straight forward and can readily be learned and applied <strong>in</strong><br />
practical situations. Hence the first step <strong>in</strong> the acoustic<br />
phonetic approach to speech recognition is called a<br />
segmentation and label<strong>in</strong>g phase, because it <strong>in</strong>volves<br />
segment<strong>in</strong>g the speech signal <strong>in</strong>to discrete ( <strong>in</strong> time) regions<br />
where the acoustic properties of the signal are representative<br />
of one ( or possibly several ) phonetic units (or classes) and<br />
then attach<strong>in</strong>g one or more phonetic lables to each segmented<br />
region accord<strong>in</strong>g to the acoustic properties. To actually do<br />
speech recognition, a second step attempts to determ<strong>in</strong>e a<br />
valid word( or str<strong>in</strong>g of a words) from the sequence of<br />
phonetic labels produced <strong>in</strong> the first step, which is consistent<br />
with the constra<strong>in</strong>ts of the speech recognition task( i.e. the<br />
words are drawn from a given vocabulary, the word sequence<br />
makes syntactic sense and has semantic mean<strong>in</strong>g etc.,)<br />
To illustrate the steps <strong>in</strong>volved <strong>in</strong> the acoustic phonetic<br />
approach to speech recognition, consider the phoneme lattice<br />
shown <strong>in</strong> Figure 2. (A phoneme lattice is the result of the<br />
segmentation and label<strong>in</strong>g step of the recognition process, and<br />
represents a sequential set of phonemes that are likely matches<br />
to the spoken <strong>in</strong>put speech) The problem is to decode the<br />
phoneme lattice <strong>in</strong>to a word str<strong>in</strong>g ( one or more words) such<br />
that every <strong>in</strong>stant of time is <strong>in</strong>cluded <strong>in</strong> one of the phonemes is<br />
the lattice, and such that the word (or word sequence) is valid<br />
accord<strong>in</strong>g to rules of English syntax.(The symbol SIL stands<br />
for silence or a pause between sounds or words; the vertical<br />
position <strong>in</strong> the lattice, at any time, is a measure of the<br />
goodness of the acoustic match to the phonetic unit, with the<br />
highest unit hav<strong>in</strong>g the best match) With a modest amount of<br />
search<strong>in</strong>g, one can derive the appropriate phonetic str<strong>in</strong>g SIL-<br />
AO-L-AX-B-AW-T correspond<strong>in</strong>g to the word str<strong>in</strong>g “ all<br />
about,” with the phonemes L,AX, and B hav<strong>in</strong>g been second<br />
or third choices <strong>in</strong> the lattice, and all other phonemes hav<strong>in</strong>g<br />
been first choices.<br />
This simple example illustrates well the difficulty <strong>in</strong> decod<strong>in</strong>g<br />
phonetic units <strong>in</strong>to word str<strong>in</strong>gs. This is the so called lexical<br />
access problem. The real problem with the acoustic phonetic<br />
approach to speech recognitions is the difficulty <strong>in</strong> gett<strong>in</strong>g a<br />
reliable phoneme lattice for the lexical access stage.<br />
Fig.3 shows a block diagram of the acoustic phonetic<br />
approach to speech recognition. The first step <strong>in</strong> the<br />
process<strong>in</strong>g ( a step common to all approaches to speech<br />
recognition ) is the speech analysis system( so called the<br />
feature measurement method),which provides an appropriate<br />
( spectral) representation of the characteristics of the time<br />
vary<strong>in</strong>g speech signal. The most common techniques of<br />
spectral analysis are the class of filter bank methods and the<br />
class of l<strong>in</strong>ear predictive cod<strong>in</strong>g (LPC) methods. Both of these<br />
methods provide spectral descriptions of the speech over time.<br />
The next step <strong>in</strong> the process<strong>in</strong>g is the feature-detection stage.<br />
The idea here is to cover the spectral measurements to a set of<br />
features that describe the broad a acoustic properties of the<br />
different phonetic units. Among the features proposed for<br />
recognition are nasality (presence or absence of nasal<br />
resonance), frication (presence or absence of random<br />
excitation <strong>in</strong> the speech),formant locations(frequencies of the<br />
first three resonances), voiced unvoiced classification<br />
( periodic or a periodic excitation), and ratios of high and low<br />
frequency energy. Many proposed features are <strong>in</strong>herently<br />
b<strong>in</strong>ary(e.g. nasality, frication, voiced unvoiced); others are<br />
cont<strong>in</strong>uous (e.g. formant locations, energy ratios). The feature<br />
detection stage unusually consists of a set of detectors that<br />
operate <strong>in</strong> parallel and use appropriate process<strong>in</strong>g and logic to<br />
make the decision as to presence or absence, or value, of a<br />
feature. The algorithms <strong>used</strong> for <strong>in</strong>dividual feature detectors<br />
are sometimes sophisticated ones that do a lot of signal<br />
process<strong>in</strong>g, and some times they are rather trivial estimation<br />
procedure.<br />
The third step <strong>in</strong> the procedure is the segmentation and<br />
label<strong>in</strong>g phase wehere by the system tries to f<strong>in</strong>d stable<br />
regions (where the features change very little over the region)<br />
and then to label the segmented region accord<strong>in</strong>g to how well<br />
the features with <strong>in</strong> theat region match those of <strong>in</strong>dividual<br />
phonetic units. This stage is the heart of the acoustic phonetic<br />
recognizer and is the most difficult one to carry out reliably;<br />
hence various control strategies are <strong>used</strong> to limit the range of<br />
segmentation po<strong>in</strong>ts and label possibilities. For example, for<br />
<strong>in</strong>dividual word recognition, the constra<strong>in</strong>t that a word<br />
conta<strong>in</strong>s at least two phonetic units and no more than six<br />
phonetic units means that the control strategy need consider<br />
solutions with between 1 and 5 <strong>in</strong>ternal segmentation po<strong>in</strong>ts.<br />
Furthermore, the label<strong>in</strong>g strategy can exploit lexical<br />
constra<strong>in</strong>ts on words to consider only words with n phonetic<br />
units when ever the segmentation gives n-1 segmentation<br />
po<strong>in</strong>ts. These constra<strong>in</strong>ts are often powerful ones that reduce<br />
the search space and significantly <strong>in</strong>crease performance<br />
(accuracy of segmentation and label<strong>in</strong>g) of the system.<br />
The result of the segmentation and label<strong>in</strong>g step is usually a<br />
phoneme lattice ( of the type shown <strong>in</strong> Figure 2 from which a<br />
lexical access procedure determ<strong>in</strong>es the best match<strong>in</strong>g word or<br />
sequence of words. Other types of lattices (e.g. syllable, word)<br />
can also be derived by <strong>in</strong>tegrat<strong>in</strong>g vocabulary and syntax<br />
constra<strong>in</strong>ts <strong>in</strong>to the control strategy as discussed above. The<br />
quality of the match<strong>in</strong>g of the features with<strong>in</strong> a segment, to<br />
phonetic units can be <strong>used</strong> to assign probabilities to the labels,<br />
which then can be <strong>used</strong> <strong>in</strong> a probabilistic lexical access<br />
procedure. The f<strong>in</strong>al output of the recognizer is the word or<br />
word sequence that best matches, <strong>in</strong> some well def<strong>in</strong>ed sense,<br />
the sequence of phonetic units <strong>in</strong> the phoneme lattice.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
912
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
sequences that match the vocabulary and grammar constra<strong>in</strong>ts<br />
are <strong>used</strong> to decide upon the spoken utterance by comb<strong>in</strong><strong>in</strong>g<br />
the acoustic and language scores.<br />
Figure 3 Block diagram of acoustic phonetic speech<br />
recognition system<br />
2.1.1. General Discussion on Acoustic Phonetic Approach:<br />
A typical acoustic phonetic approach to ASR has the<br />
follow<strong>in</strong>g steps (this is similar to the overview of the acousticphonetic<br />
approach presented by Rab<strong>in</strong>er (Rab<strong>in</strong>er and Juang,<br />
1993) but it is def<strong>in</strong>ed here more broadly):<br />
1. <strong>Speech</strong> is analyzed us<strong>in</strong>g any of the spectral analysis<br />
methods - Short Time Fourier Transform (STFT), L<strong>in</strong>ear<br />
Predictive Cod<strong>in</strong>g (LPC), Perceptual L<strong>in</strong>ear Prediction (PLP),<br />
etc. - us<strong>in</strong>g overlapp<strong>in</strong>g frames with a typical size of 10-25ms<br />
and typical overlap of 5ms.<br />
2. Acoustic correlates of phonetic features are extracted from<br />
the spectral representation.<br />
For example, low frequency energy may be calculated as an<br />
acoustic correlate of sonoracy, zero cross<strong>in</strong>g rate may be<br />
calculated as a correlate of frication, and so on.<br />
3. <strong>Speech</strong> is segmented by either f<strong>in</strong>d<strong>in</strong>g transient locations<br />
us<strong>in</strong>g the spectral change across two consecutive frames, or<br />
us<strong>in</strong>g the acoustic correlates of source or manner classes to<br />
f<strong>in</strong>d the segments with stable manner classes. The earlier<br />
approach , that is, f<strong>in</strong>d<strong>in</strong>g acoustic stable regions us<strong>in</strong>g the<br />
locations of spectral change has been followed by Glass et al.<br />
(Glass and Zue, 1988). The latter method of us<strong>in</strong>g broad<br />
manner class scores to segment the signal has been <strong>used</strong> by a<br />
number of researchers (Bitar, 1997; Liu, 1996; Fohr et al.;<br />
Carbonell et al., 1987). Multiple segmentations may be<br />
generated <strong>in</strong>stead of a s<strong>in</strong>gle representation, for example, the<br />
dendograms <strong>in</strong> the speech recognition method proposed by<br />
Glass (Glass and Zue, 1988). (The system built by Glass et al.<br />
is <strong>in</strong>cluded here as an acoustic phonetic system because it fits<br />
the broad def<strong>in</strong>ition of the acoustic-phonetic approach, but<br />
this system uses very little knowledgeof acoustic phonetics.)<br />
4. Further analysis of the <strong>in</strong>dividual segmentations is carried<br />
out next to either recognize each segment as a phoneme<br />
directly or f<strong>in</strong>d the presence or absence of <strong>in</strong>dividual phonetic<br />
features and us<strong>in</strong>g the <strong>in</strong>termediate decisions to f<strong>in</strong>d the<br />
phonemes. When multiple segmentations are generated<br />
<strong>in</strong>stead of a s<strong>in</strong>gle segmentation, a number of different<br />
phoneme sequences may be generated. The phoneme<br />
2.1.2. Hurdles/Challenges <strong>in</strong> the acoustic-phonetic<br />
approach:<br />
A number of problems have been associated with the acousticphonetic<br />
approach <strong>in</strong> the literature. Rab<strong>in</strong>er (Rab<strong>in</strong>er and<br />
Juang, 1993) lists at least five such problems or hurdles that<br />
have made the use of the approach m<strong>in</strong>imal <strong>in</strong> the ASR<br />
community. The problems with the acoustic phonetic<br />
approach and some ideas for solv<strong>in</strong>g them provide much of<br />
the motivation for the present work. These documented<br />
problems of the acoustic-phonetic approach are now listed and<br />
it is argued that either <strong>in</strong>sufficient effort has gone <strong>in</strong>to solv<strong>in</strong>g<br />
these problems or that the problems are not unique to the<br />
acoustic-phonetic approach.<br />
a) It has been argued that the difficulty <strong>in</strong> proper decod<strong>in</strong>g of<br />
phonetic units <strong>in</strong>to words and sentences grows dramatically<br />
with an <strong>in</strong>crease <strong>in</strong> the rate of phoneme <strong>in</strong>sertion, deletion and<br />
substitution. This argument makes the assumption that<br />
phoneme units are recognized <strong>in</strong> the first pass with no<br />
knowledge of language and vocabulary constra<strong>in</strong>ts. This has<br />
been true for many of the acoustic phonetic methods, but this<br />
is not necessary s<strong>in</strong>ce vocabulary and grammar constra<strong>in</strong>ts<br />
may be <strong>used</strong> to constra<strong>in</strong> the speech segmentation paths<br />
(Glass et al., 1996).<br />
b) Extensive knowledge of the acoustic manifestations of<br />
phonetic units is required and the lack of completeness of this<br />
knowledge has been po<strong>in</strong>ted out as a drawback of the<br />
knowledge based approach. While it is true that the<br />
knowledge is <strong>in</strong>complete, there is no reason to believe that the<br />
standard signal representations, for example, Mel-Frequency<br />
Cepstral Coefficients (MFCCs), <strong>used</strong> <strong>in</strong> the state-of-the-art<br />
ASR methods are sufficient to capture all the acoustic<br />
manifestations of the speech sounds. Although the knowledge<br />
is not complete, a number of efforts to f<strong>in</strong>d acoustic correlates<br />
of phonetic features have obta<strong>in</strong>ed excellent results. Most<br />
recently, there has been significant development <strong>in</strong> the<br />
research on the acoustic correlates of place of stop consonants<br />
and fricatives (Stevens et al., 1999; Ali , 1999; Bitar, 1997),<br />
nasal detection (Pruthi and Espy-Wilson, 2003), and<br />
semivowel classification (Espy-Wilson, 1994). The<br />
knowledge from these sources may be adequate to start<br />
build<strong>in</strong>g an acoustic-phonetic speech recognizer to carry out<br />
word recognition tasks, and that was the focus of this work. It<br />
should be noted that because of the physical significance of<br />
the knowledge based acoustic measurements, it is easy to<br />
p<strong>in</strong>po<strong>in</strong>t the source of recognition errors <strong>in</strong> the recognition<br />
system. Such an error analysis is close to impossible <strong>in</strong> MFCC<br />
like front-ends.<br />
c) The third argument aga<strong>in</strong>st the acoustic-phonetic approach<br />
is that the choice of phonetic features and their acoustic<br />
correlates is not optimal. It is true that l<strong>in</strong>guists may not agree<br />
with each other on the optimal set of phonetic features, but<br />
f<strong>in</strong>d<strong>in</strong>g the best set of features is a task that can be carried out<br />
<strong>in</strong>stead of turn<strong>in</strong>g to other ASR methods. The phonetic feature<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
913
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
set <strong>used</strong> <strong>in</strong> this work will be based on the dist<strong>in</strong>ctive feature<br />
theory and it will be optimal <strong>in</strong> that sense.<br />
d) Another drawback of the acoustic-phonetic approach as<br />
po<strong>in</strong>ted out <strong>in</strong> (Rab<strong>in</strong>er and Juang, 1993) is that the design of<br />
the sound classifiers is not optimal. This argument probably<br />
assumes that b<strong>in</strong>ary decision trees with hard knowledge-based<br />
thresholds are <strong>used</strong> to carry out the decisions <strong>in</strong> the acoustic<br />
phonetic approach. Statistical pattern recognition methods that<br />
is no less optimal.<br />
2.1.3 Advantages and Disadvantages:<br />
a)Advantages:<br />
1) Not all acoustic Phonetics are <strong>used</strong> for all decisions<br />
2) S<strong>in</strong>ce the acoustic phonetics have a strong physical<br />
<strong>in</strong>terpretation, it is easy to p<strong>in</strong>po<strong>in</strong>t the source of error <strong>in</strong><br />
such a recognition system. It is easy to tell whether the<br />
pattern matcher has failed.<br />
3) The method can easily take advantage of years of<br />
research that has gone <strong>in</strong>to acoustic phonetics as well as<br />
signal process<strong>in</strong>g based on human auditory models.<br />
b) Disadvantages:<br />
The chosen phonemes are not only the first choices <strong>in</strong> the<br />
phonetic sequence, but also second (B and AX) and third (L)<br />
choices. Therefore match<strong>in</strong>g a phonetic sequence with a word<br />
or a group of words is not obvious In fact, this is the ma<strong>in</strong><br />
disadvantage of this approach.<br />
2.1.4. Applications:<br />
1. Acoustic phonetic approach to speech recognition:<br />
Application to the Semivowels<br />
2) Models of Phonetic <strong>Recognition</strong>: The Role of Analysis by<br />
Synthesis <strong>in</strong> Phonetic <strong>Recognition</strong><br />
3) The Influence of Phonetic Context on the Acoustic<br />
Properties of Stops<br />
4) The Role of Syllable Structure <strong>in</strong> the Acoustic Realizations<br />
of Stops<br />
5) A Semivowel <strong>Recognition</strong> System<br />
6) Two-Dimensional Characterization of the <strong>Speech</strong> Signal<br />
and Its Potential Applications to <strong>Speech</strong> Process<strong>in</strong>g<br />
7) <strong>Recognition</strong> of Words from their Spell<strong>in</strong>gs: Integration of<br />
Multiple Knowledge Sour.<br />
2.2. Artificial Intelligent Approach [5]:<br />
Historically there are two ma<strong>in</strong> approaches to AI: classical<br />
approach (design<strong>in</strong>g the AI), based on symbolic reason<strong>in</strong>g - a<br />
mathematical approach <strong>in</strong> which ideas and concepts are<br />
represented by symbols such as words, phrases or sentences,<br />
which are then processed accord<strong>in</strong>g to the rules of logic. A<br />
connectionist approach (lett<strong>in</strong>g AI develop), based on artificial<br />
neural networks, which imitate the way neurons work, and<br />
genetic algorithms, which imitate <strong>in</strong>heritance and fitness to<br />
evolve better solutions to a problem with every generation.<br />
AI approach [5] to speech recognition is a hybrid of the<br />
acoustic phonetic approach and the pattern recognition<br />
approach <strong>in</strong> that it exploits ideas and concepts of both<br />
methods. The artificial Intelligence approach attempts to<br />
mechanize the recognition procedure accord<strong>in</strong>g to the way a<br />
person applies its <strong>in</strong>telligence <strong>in</strong> visualiz<strong>in</strong>g, analyz<strong>in</strong>g, and<br />
f<strong>in</strong>ally mak<strong>in</strong>g a decisions on the measured acoustic features.<br />
In particular, among the techniques <strong>used</strong> with<strong>in</strong> this class of<br />
methods are the use of; an experts system for segmentation<br />
and label<strong>in</strong>g so that this crucial and most difficult step can be<br />
performed with more than just the acoustic <strong>in</strong>formation <strong>used</strong><br />
by pure acoustic phonetic methods( <strong>in</strong> particular, methods that<br />
<strong>in</strong>tegrate phonemic, lexical syntactic, semantic and even<br />
pragmatic knowledge <strong>in</strong>to the expert system have been<br />
proposed and studied),learn<strong>in</strong>g and adapt<strong>in</strong>g over time(i.e. the<br />
concept that knowledge is often both static and dynamic and<br />
that models must adapt to the dynamic component of the data);<br />
the use of neural networks for learn<strong>in</strong>g the relationships<br />
between phonetic events and all known <strong>in</strong>puts(<strong>in</strong>clud<strong>in</strong>g<br />
acoustic, lexical, syntactic, semantic, etc., as well as for<br />
discrim<strong>in</strong>ation between similar sound classes.<br />
The basic ideal of the artificial <strong>in</strong>telligence approach to speech<br />
recognition is to compile and <strong>in</strong>corporate knowledge from<br />
variety of knowledge sources and to br<strong>in</strong>g it to bear on the<br />
problem at hand. Thus, for example, the AI approach to<br />
segmentation and label<strong>in</strong>g would be to augment the generally<br />
<strong>used</strong> acoustic knowledge with phonemic knowledge, lexical<br />
knowledge, syntactic knowledge, semantic knowledge, and<br />
even pragmatic knowledge. The different knowledge sources<br />
required are as follows:<br />
a) Acoustic knowledge-evidence of which sounds (predef<strong>in</strong>ed<br />
phonetic units) are spoken on the basis of spectral<br />
measurements and presence of absence of features<br />
b) Lexical knowledge- the comb<strong>in</strong>ation of acoustic evidence<br />
so as to postulate words as specified by a lexicon that maps<br />
sounds <strong>in</strong>to words ( or equivalently decomposes words <strong>in</strong>to<br />
sounds)<br />
c) Syntactic knowledge- the comb<strong>in</strong>ation of words to form<br />
grammatically correct str<strong>in</strong>gs (accord<strong>in</strong>g to a language model)<br />
such as sentences or phrases<br />
d) Semantic knowledge-understand<strong>in</strong>g of the task doma<strong>in</strong> so<br />
as to be able to validate sentences (or phrases) that are<br />
consistent with the task be<strong>in</strong>g performed or which are<br />
consistent with previously decoded sentences<br />
e) Pragmatic knowledge- <strong>in</strong>ference ability necessary <strong>in</strong><br />
resolv<strong>in</strong>g ambiguity of mean<strong>in</strong>g based on ways <strong>in</strong> which<br />
words are generally <strong>used</strong>.<br />
2.2.1. Advantages and Disadvantages of Artifical<br />
Intelligent approach:<br />
a) Advantages:<br />
i) AI has made some progress at imitat<strong>in</strong>g "subsymbolic"<br />
problem solv<strong>in</strong>g: embodied agent<br />
approaches emphasize the importance of<br />
sensorimotor skills to higher reason<strong>in</strong>g; neural<br />
net research attempts to simulate the structures<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
914
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
ii)<br />
iii)<br />
<strong>in</strong>side human and animal bra<strong>in</strong>s that give rise to<br />
this skill; step-by-step reason<strong>in</strong>g that humans<br />
were often assumed to use when they solve<br />
puzzles, play board games or make logical<br />
deductions.<br />
By the late 1980s and 1990s, AI research had<br />
also developed highly successful methods for<br />
deal<strong>in</strong>g with uncerta<strong>in</strong> or <strong>in</strong>complete <strong>in</strong>formation,<br />
employ<strong>in</strong>g concepts from probability and<br />
econopmic.<br />
The search for more efficient problem solv<strong>in</strong>g<br />
algorithms is a high priority for AI research.<br />
b)Disadvantages:<br />
i) For difficult problems most of the algorithms <strong>in</strong> artificial<br />
<strong>in</strong>telligence approach require enormous computational<br />
resources- most “comb<strong>in</strong>atorial explosion” : the amount of<br />
memory or computer time required becomes astronomical<br />
when the problem goes beyond a certa<strong>in</strong> size.<br />
ii) Intelligent systems are not like humans.<br />
2.2.2. Applications:<br />
1. Artificial <strong>in</strong>telligent approach to speech recognition<br />
2. AI approach to chemical <strong>in</strong>ference.<br />
3. AI approach to cognitive l<strong>in</strong>guistics<br />
4. AI approach to VLSI design<br />
5. AI approach to mach<strong>in</strong>e learn<strong>in</strong>g<br />
6. AI approach to reservoir<br />
7. AI approach to automated office<br />
3. Bayes Decision Theory:<br />
Bayesian decision mak<strong>in</strong>g refers to choos<strong>in</strong>g most likely clas,<br />
given the value of the feature or features. The probabilities of<br />
class membership are calculated from Baye’s theorem. If the<br />
feature value is denoted by x and a class of <strong>in</strong>terest is C, then<br />
P(x) is the probability distribution for feature x <strong>in</strong> the entire<br />
population and P(C ) is the prior probability that a random<br />
sample is a member is a member of class C. P(x/C) is the<br />
conditional probability of obta<strong>in</strong><strong>in</strong>g feature value x given that<br />
it has a feature value x, which is denoted by P(C/x), on the<br />
basis of the values of P(x/C),P(C) and P(x).<br />
4. Database classification method:<br />
In this classification, the patterns are stored <strong>in</strong> the database<br />
and comparison is done with the test signal aga<strong>in</strong>st the<br />
patterns stored <strong>in</strong> the database. S<strong>in</strong>ce the collection of tra<strong>in</strong>ed<br />
patterns is stored <strong>in</strong> the database this method is called as<br />
database classification method. This has been categorize as<br />
one of the important classification method i.e., classified as<br />
the pattern recognition approach. In turn this pattern<br />
recognition approach has been classified <strong>in</strong>to two methods<br />
namely, template/DTW/supervised and unsupervised<br />
classification methods. Each of these methods are discussed <strong>in</strong><br />
detail <strong>in</strong> the below section.<br />
4.1. Introduction to Pattern <strong>Recognition</strong> approach:<br />
Pattern recognition as a field of study developed significantly<br />
<strong>in</strong> the 1960s. It is very much an <strong>in</strong>terdiscipl<strong>in</strong>ary subject,<br />
cover<strong>in</strong>g developments <strong>in</strong> the areas of statistics, eng<strong>in</strong>eer<strong>in</strong>g,<br />
artificial <strong>in</strong>telligence, computer science, psychology and<br />
physiology, among others. Watnabe[74] def<strong>in</strong>es a pattern “as<br />
opposite of chaos”. It is an entity, vaguely def<strong>in</strong>ed that could<br />
be given a name.<br />
Pattern recognition is concerned with the classification of<br />
objects <strong>in</strong>to categories, especially by mach<strong>in</strong>e. A strong<br />
emphasis is placed on the statistical theory of discrim<strong>in</strong>ation,<br />
but cluster<strong>in</strong>g also receives some attention. Hence it can be<br />
summed <strong>in</strong> a s<strong>in</strong>gle word: ‘classification’, both supervised<br />
(us<strong>in</strong>g class <strong>in</strong>formation to design a classifier – i.e.<br />
discrim<strong>in</strong>ation) and unsupervised (allocat<strong>in</strong>g to groups<br />
without class <strong>in</strong>formation – i.e. cluster<strong>in</strong>g). Its ultimate goal is<br />
to optimally extract patterns based on certa<strong>in</strong> conditions and is<br />
to separate one class from the others. Pattern recognition was<br />
often achieved us<strong>in</strong>g l<strong>in</strong>ear and quadratic discrim<strong>in</strong>ants [6],<br />
the k-nearest neighbor classifier [7] or the Parzen density<br />
estimator [8], template match<strong>in</strong>g [9] and Neural Networks<br />
[10]. These methods are basically statistic. The problem of<br />
us<strong>in</strong>g these recognition methods are <strong>in</strong> the construction of the<br />
classification rule without hav<strong>in</strong>g any idea of the distribution<br />
of the measurements <strong>in</strong> different groups. Support Vector<br />
Mach<strong>in</strong>e (SVM) [11] has ga<strong>in</strong>ed prom<strong>in</strong>ence <strong>in</strong> the field of<br />
pattern classification. They are forcefully compet<strong>in</strong>g with<br />
other techniques such as template match<strong>in</strong>g and Neural<br />
Networks for pattern recognition.<br />
.<br />
4.1.1. General Process of Pattern <strong>Recognition</strong>:<br />
A pattern is a pair compris<strong>in</strong>g an observation and a mean<strong>in</strong>g.<br />
Pattern recognition is <strong>in</strong>ferr<strong>in</strong>g mean<strong>in</strong>g from observation.<br />
Design<strong>in</strong>g a pattern recognition system is establish<strong>in</strong>g a<br />
mapp<strong>in</strong>g from measurement space <strong>in</strong>to the space of potential<br />
mean<strong>in</strong>gs; The basic components <strong>in</strong> pattern recognition are<br />
pre-process<strong>in</strong>g, feature extraction and selection, classifier<br />
design and optimization.<br />
4.1.1a Pre-process<strong>in</strong>g:<br />
The role of pre-process<strong>in</strong>g is to segment the <strong>in</strong>terest<strong>in</strong>g pattern<br />
from the background. Generally, noise filter<strong>in</strong>g, smooth<strong>in</strong>g<br />
and normalization should be done <strong>in</strong> this step. The preprocess<strong>in</strong>g<br />
also def<strong>in</strong>es a compact representation of the pattern.<br />
4.1.1b Feature Selection and extraction:<br />
Features should be easily computed, robust, <strong>in</strong>sensitive to<br />
various distortions and variations <strong>in</strong> the signal, and<br />
rotationally <strong>in</strong>variant. Two k<strong>in</strong>ds of features are <strong>used</strong> <strong>in</strong><br />
pattern recognition problems. One k<strong>in</strong>d of features has clear<br />
physical mean<strong>in</strong>g, such as geometric or structural and<br />
statistical features. Another k<strong>in</strong>d of features has no physical<br />
mean<strong>in</strong>g. These features are called as mapp<strong>in</strong>g features. The<br />
advantage of physical features is that they need not deal with<br />
irrelevant features. The advantage of the mapp<strong>in</strong>g features is<br />
that they make classification easier because clear boundaries<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
915
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
will be obta<strong>in</strong>ed between classes but <strong>in</strong>creas<strong>in</strong>g the<br />
computational complexity.<br />
i) Feature selection is to select the best subset from the <strong>in</strong>put<br />
space. Its ultimate goal is to select the optimal features subset<br />
that can achieve the highest accuracy results. While feature<br />
extraction is applied <strong>in</strong> the situation when no physical features<br />
can be obta<strong>in</strong>ed. Most of feature selection algorithms <strong>in</strong>volve<br />
a comb<strong>in</strong>atorial search through the whole space. Usually,<br />
heuristic methods, such as hill climb<strong>in</strong>g, have to be adopted,<br />
because the size of <strong>in</strong>put space is exponential <strong>in</strong> the number of<br />
features. Other methods divide the feature space <strong>in</strong>to several<br />
subspaces which can be searched easily.<br />
There are basically two types of feature selection methods:<br />
filter and wrapper [12]. Filters methods select the best features<br />
accord<strong>in</strong>g to some prior knowledge without th<strong>in</strong>k<strong>in</strong>g about the<br />
bias of further <strong>in</strong>duction algorithm. So these methods<br />
performed <strong>in</strong>dependently of the classification algorithm or its<br />
error criteria.<br />
ii) In feature extraction, most methods are supervised. These<br />
approaches need some prior knowledge and labelled tra<strong>in</strong><strong>in</strong>g<br />
samples. There are two k<strong>in</strong>ds of supervised methods <strong>used</strong>:<br />
L<strong>in</strong>ear feature extraction and nonl<strong>in</strong>ear feature extraction.<br />
L<strong>in</strong>ear feature extraction techniques <strong>in</strong>clude Pr<strong>in</strong>cipal<br />
Component Analysis (PCA), L<strong>in</strong>ear Discrim<strong>in</strong>ant Analysis<br />
(LDA), projection pursuit, and Independent Component<br />
Analysis (ICA). Nonl<strong>in</strong>ear feature extraction methods <strong>in</strong>clude<br />
kernel PCA, PCA network, nonl<strong>in</strong>ear PCA, nonl<strong>in</strong>ear autoassociative<br />
network, Multi-Dimensional Scal<strong>in</strong>g (MDS) and<br />
Self-Organiz<strong>in</strong>g Map (SOM), and so forth.<br />
4.1.1c. Classifiers design:<br />
After optimal feature subset is selected a classifier can be<br />
designed us<strong>in</strong>g various approaches. Roughly speak<strong>in</strong>g, there<br />
are three different approaches [1]. The first approach is the<br />
simplest and the most <strong>in</strong>tuitive approach which is based on the<br />
concept of similarity. Template match<strong>in</strong>g is an example. The<br />
second one is a probabilistic approach. It <strong>in</strong>cludes methods<br />
based on Bayes decision rule, the maximum likelihood or<br />
density estimator. Three well-known methods are K-nearest<br />
neighbor (KNN), Parzen w<strong>in</strong>dow classifier and branch-and<br />
bound methods (BnB).The third approach is to construct<br />
decision boundaries directly by optimiz<strong>in</strong>g certa<strong>in</strong> error<br />
criterion. Examples are fisher’s l<strong>in</strong>ear discrim<strong>in</strong>ant, multilayer<br />
perceptrons, decision tree and support vector mach<strong>in</strong>e [13].<br />
4.1.1d. Optimization:<br />
The optimization is not a separate step, it is comb<strong>in</strong>ed with<br />
several parts of the pattern recognition process. In<br />
preprocess<strong>in</strong>g, optimization guarantees that, the <strong>in</strong>put pattern<br />
have the best quality [13]. Then <strong>in</strong> the feature selection and<br />
extraction part, optimal feature subsets are obta<strong>in</strong>ed under<br />
some optimization techniques. Furthermore, the f<strong>in</strong>al<br />
classification error rate is lowered <strong>in</strong> the classification part.<br />
4.1.2. Steps <strong>in</strong> statistical pattern recognition:<br />
i) Formulation of the problem: ga<strong>in</strong><strong>in</strong>g a clear understand<strong>in</strong>g<br />
of the aims of the <strong>in</strong>vestigation and plann<strong>in</strong>g the rema<strong>in</strong><strong>in</strong>g<br />
stages.<br />
ii) Data collection: mak<strong>in</strong>g measurements on appropriate<br />
variables and record<strong>in</strong>g details of the data collection<br />
procedure (ground truth).<br />
iii) Initial exam<strong>in</strong>ation of the data: check<strong>in</strong>g the data,<br />
calculat<strong>in</strong>g summary statistics and produc<strong>in</strong>g plots <strong>in</strong> order to<br />
get a feel for the structure.<br />
iv) Feature selection or feature extraction: select<strong>in</strong>g variables<br />
from the measured set that are appropriate for the task. These<br />
new variables may be obta<strong>in</strong>ed by a l<strong>in</strong>ear or<br />
nonl<strong>in</strong>ear transformation of the orig<strong>in</strong>al set (feature<br />
extraction). To some extent, the division of feature<br />
extraction and classification is artificial.<br />
v) Unsupervised pattern classification or cluster<strong>in</strong>g: This may<br />
be viewed as exploratory data analysis and it may provide a<br />
successful conclusion to a study. On the other hand, it may<br />
be a means of pre-process<strong>in</strong>g the data for a supervised<br />
classification procedure.<br />
vi) Apply discrim<strong>in</strong>ation or regression procedures as<br />
appropriate: The classifier is designed us<strong>in</strong>g a tra<strong>in</strong><strong>in</strong>g set of<br />
exemplar patterns.<br />
vii) Assessment of results: This may <strong>in</strong>volve apply<strong>in</strong>g the<br />
tra<strong>in</strong>ed classifier to an <strong>in</strong>dependent test set of labeled<br />
patterns.<br />
viii) Interpretation: The above is necessarily an iterative<br />
process: the analysis of the results may pose further<br />
hypotheses that require further data collection. Also, the cycle<br />
may be term<strong>in</strong>ated at different stages: the questions posed<br />
may be answered by an <strong>in</strong>itial exam<strong>in</strong>ation of the data or it<br />
may be discovered that the data cannot answer the <strong>in</strong>itial<br />
question and the problem must be reformulated.<br />
The follow<strong>in</strong>g block diagram of a canonic pattern recognition<br />
approach to speech recognition is shown <strong>in</strong> figure 4 the<br />
recognition step has four steps, namely,<br />
1. Parameter Estimation: In which a sequence of<br />
measurements is made on the <strong>in</strong>put signal to def<strong>in</strong>e the test<br />
pattern. For speech signals the feature measurements are<br />
usually the output of some type of spectral analysis technique,<br />
such as filter bank analyzer, a l<strong>in</strong>ear predictive cod<strong>in</strong>g analysis,<br />
or a discrete Fourier transform (DFT) analysis.<br />
2. Pattern Tra<strong>in</strong><strong>in</strong>g: <strong>in</strong> which one or more test patterns<br />
correspond<strong>in</strong>g to speech sounds of the same class are <strong>used</strong> to<br />
create a pattern a representative of the features of that class.<br />
The result<strong>in</strong>g pattern, generally called reference pattern, can<br />
be an exemplar or template, derived from some type of<br />
averag<strong>in</strong>g technique, or it can be a model that characterizes<br />
the statistics of the features of the reference pattern.<br />
3.Pattern Comparison: <strong>in</strong> which the unknown test pattern is<br />
compared with each (sound) class reference pattern and a<br />
measure of similarity( distance) between the test pattern and<br />
each reference pattern is computed. To compare speech<br />
patterns(which consists of a sequence of spectral vectors),we<br />
require both a local distance measure, <strong>in</strong> which local distance<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
916
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
is def<strong>in</strong>ed as the spectral “distance” between two well def<strong>in</strong>ed<br />
spectral vectors, and a global time alignment procedure (often<br />
called a dynamic time warp<strong>in</strong>g algorithm), which compensates<br />
for different rates of speak<strong>in</strong>g (time scales) of the two patterns.<br />
4. Decision logic: <strong>in</strong> which the reference pattern similarity<br />
scores are <strong>used</strong> to decide which reference pattern (or possibly<br />
which sequence of reference patterns)best matches the<br />
unknown test pattern. The factors that dist<strong>in</strong>guish different<br />
pattern recognition approaches are the types of feature<br />
measurement, the choice of templates or models for reference<br />
patterns and the method <strong>used</strong> to create reference patterns and<br />
classify unknown test patterns.<br />
TABLE 2<br />
Examples of pattern recognition applications<br />
5. Template based approach:<br />
Figure 4. pattern recognition approach to speech recognition<br />
4.1.3. Pattern recognition approach:<br />
The four best known pattern recognition approaches are: i)<br />
Template approach ii) Statistical approach iii) Syntactic or<br />
structural approach iv) Neural Network approach. These<br />
models are not necessarily <strong>in</strong>dependent and sometimes the<br />
same pattern recognition methods exist with different<br />
<strong>in</strong>terpretations. Attempts have been made to design hybrid<br />
systems <strong>in</strong>volv<strong>in</strong>g multiple models [75]. A brief description<br />
and comparison is given below and discussed <strong>in</strong> the table 1.<br />
TABLE 1<br />
Pattern recognition models<br />
4.1.4. Examples of pattern recognition applications:<br />
Interest <strong>in</strong> the area of pattern recognition has been renewed<br />
recently due to emerg<strong>in</strong>g applications which are not only<br />
challeng<strong>in</strong>g but also computationally demand<strong>in</strong>g. These<br />
applications <strong>in</strong>clude data m<strong>in</strong><strong>in</strong>g, bio<strong>in</strong>formatics etc., as<br />
shown <strong>in</strong> table 2.<br />
One of the simplest and earliest approaches to pattern<br />
recognition is the template approach. Match<strong>in</strong>g is a generic<br />
operation <strong>in</strong> pattern recognition which is <strong>used</strong> to determ<strong>in</strong>e the<br />
similarity between two entities of the same type. In template<br />
match<strong>in</strong>g the template or prototype of the pattern to be<br />
recognized is available. The pattern to be recognized is<br />
matched aga<strong>in</strong>st the stored template tak<strong>in</strong>g <strong>in</strong>to account all<br />
allowable pose and scale changes.<br />
The major pattern recognition techniques for speech<br />
recognition are template method and Dynamic Time warp<strong>in</strong>g<br />
method(DTW). Template based approaches to speech<br />
recognition have provided a family of techniques that have<br />
advanced the field considerably dur<strong>in</strong>g the last six decades.<br />
The underly<strong>in</strong>g idea is simple. A collection of prototypical<br />
speech patterns are stored as reference patterns represent<strong>in</strong>g<br />
the dictionary of candidate s words. <strong>Recognition</strong> is then<br />
carried out by match<strong>in</strong>g an unknown spoken utterance with<br />
each of these reference templates and select<strong>in</strong>g the category of<br />
the best match<strong>in</strong>g pattern. Usually templates for entire words<br />
are constructed. This has the advantage that, errors due to<br />
segmentation or classification of smaller acoustically more<br />
variable units such as phonemes can be avoided. In turn, each<br />
word must have its own full reference template; template<br />
preparation and match<strong>in</strong>g become prohibitively expensive or<br />
impractical as vocabulary size <strong>in</strong>creases beyond a few<br />
hundred words. One key idea <strong>in</strong> template method is to derive<br />
typical sequences of speech frames for a pattern (a word) via<br />
some averag<strong>in</strong>g procedure, and to rely on the use of local<br />
spectral distance measures to compare patterns. Another key<br />
idea is to use some form of dynamic programm<strong>in</strong>g to<br />
temporarily align patterns to account for differences <strong>in</strong><br />
speak<strong>in</strong>g rates across talkers as well as across repetitions of<br />
the word by the same talker.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
917
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
5.1. Introduction:<br />
A template is the representation of an actual segment of<br />
speech. It consists of a sequence of consecutive acoustic<br />
feature vectors (or frames), a transcription of the sounds or<br />
words it represents (typically one or more phonetic symbols),<br />
knowledge of neighbour<strong>in</strong>g templates (a template number if<br />
no templates overlap), and a tag with meta-<strong>in</strong>formation. The<br />
term template is often <strong>used</strong> for two fundamentally different<br />
concepts: either for the representation of a s<strong>in</strong>gle segment of<br />
speech with a known transcription, or for some sort of<br />
average of a number of different segments of speech. Both<br />
types of templates can be <strong>used</strong> <strong>in</strong> the DTW algorithm to<br />
compare them with a segment of <strong>in</strong>put speech. Us<strong>in</strong>g the latter<br />
type has the obvious advantage of reduc<strong>in</strong>g the number of<br />
templates and be<strong>in</strong>g more robust to outliers [14]. However,<br />
the averag<strong>in</strong>g is a model build<strong>in</strong>g step, which makes it more<br />
ak<strong>in</strong> to HMMs than to true example based recognition.<br />
Template based approach to speech recognition have provided<br />
a family of techniques that have advanced the field<br />
considerably dur<strong>in</strong>g the last six decades. The underly<strong>in</strong>g idea<br />
is simple. A collection of prototypical speech patterns are<br />
stored as reference patterns represent<strong>in</strong>g the dictionary of<br />
candidate’s words. <strong>Recognition</strong> is then carried out by<br />
match<strong>in</strong>g an unknown spoken utterance with each of these<br />
reference templates and select<strong>in</strong>g the category of the best<br />
match<strong>in</strong>g pattern. Usually templates for entire words are<br />
constructed. Template preparation and match<strong>in</strong>g become<br />
prohibitively expensive or impractical as vocabulary size<br />
<strong>in</strong>creases beyond a few hundred words. One key idea <strong>in</strong><br />
template method is to derive typical sequences of speech<br />
frames for a pattern (a word) via some averag<strong>in</strong>g procedure,<br />
and to rely on the use of local spectral distance measures to<br />
compare patterns. Another key idea is to use some form of<br />
dynamic programm<strong>in</strong>g to temporarily align patterns to<br />
account for differences <strong>in</strong> speak<strong>in</strong>g rates across talkers as well<br />
as across repetitions of the word by the same talker.<br />
5.1.2 Similarity and Distance methods <strong>used</strong> <strong>in</strong> Template<br />
approach:<br />
The first type of classifiers that are <strong>used</strong> are the similarity<br />
between patterns to decide on a good classification. First,<br />
similarity has to be def<strong>in</strong>ed. The nearest mean classifiers<br />
def<strong>in</strong>e the features of a class as a vector and represent the<br />
class with the mean of the elements of the vector. Thus, any<br />
unlabeled vector of features will be classified as the class with<br />
nearest mean value. Template match<strong>in</strong>g uses a template for<br />
def<strong>in</strong><strong>in</strong>g class labels, and tries to f<strong>in</strong>d the most similar<br />
template for classification. Another important classifier of this<br />
type uses the Nearest Neighbor (NN) Algorithm [15, 16]. The<br />
data is represented as po<strong>in</strong>ts <strong>in</strong> space, and classification is<br />
done based on the. Euclidean distance, of the data to the<br />
labeled classes. For the k-NN, the classifier checks the k<br />
nearest po<strong>in</strong>ts and decides <strong>in</strong> favor of the majority.<br />
5.2 Advantages and Disadvantages of Template Method:<br />
a) Advantages:<br />
1. An <strong>in</strong>tr<strong>in</strong>sic advantage of template based recognition<br />
is that, it is not required to model the speech process.<br />
This is very convenient, s<strong>in</strong>ce our understand<strong>in</strong>g of<br />
speech is still limited, especially with respect to its<br />
transient nature.<br />
2. The ma<strong>in</strong> advantage is precisely the use of long<br />
temporal context: all the frames of the keyword<br />
template, as well as the <strong>in</strong>formation about their<br />
relative position, are <strong>used</strong> dur<strong>in</strong>g the Dynamic Time<br />
Warp<strong>in</strong>g (DTW) procedure. This provides an implicit<br />
model<strong>in</strong>g of co-articulation effects or speaker<br />
dependencies [9].<br />
3. This has the advantage that, errors due to<br />
segmentation or classification of smaller acoustically<br />
more variable units such as phonemes can be avoided.<br />
b) Disadvantages:<br />
1) Template match<strong>in</strong>g approaches fail to take advantage<br />
of large amount of tra<strong>in</strong><strong>in</strong>g data.<br />
2) They cannot model acoustic variabilities, except <strong>in</strong> a<br />
coarse way by assign<strong>in</strong>g multiple templates to each<br />
word;<br />
3) In practice they are limited to whole-word models,<br />
because it's hard to record or segment a sample<br />
shorter than a word - so templates are useful only <strong>in</strong><br />
small systems which can afford the luxury of us<strong>in</strong>g<br />
whole-word models.<br />
4) Each word must have its own full reference template;<br />
template preparation and match<strong>in</strong>g become<br />
prohibitively expensive or impractical as vocabulary<br />
size <strong>in</strong>creases beyond a few hundred words.<br />
5) It is difficulty to test for large data.<br />
6) Compulsorily a template should be supplied for each<br />
pattern<br />
5.3. Applications of template method:<br />
i) A multi-scale template method for shape detection with biomedical<br />
applications<br />
ii) Template match<strong>in</strong>g framework for detect<strong>in</strong>g geometrically<br />
transformed objects.<br />
iii)Template match<strong>in</strong>g is one the way for perform<strong>in</strong>g<br />
operations like: object recognition, identification or<br />
classification, and detection. There are various literature are<br />
available for template match<strong>in</strong>g method but they vary from<br />
application to application. There is no standard method<br />
developed yet.<br />
iv) Adaptive template-match<strong>in</strong>g method for vessel wall<br />
boundary detection <strong>in</strong> brachial artery ultrasound (US) scans.<br />
6. Dynamic Time Warp<strong>in</strong>g(DTW):<br />
Dynamic time warp<strong>in</strong>g is an algorithm for measur<strong>in</strong>g<br />
similarity between two sequences which may vary <strong>in</strong> time or<br />
speed. For <strong>in</strong>stance, similarities <strong>in</strong> walk<strong>in</strong>g patterns would be<br />
detected, even if <strong>in</strong> one video, the person was walk<strong>in</strong>g slowly<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
918
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
and if <strong>in</strong> another, he or she were walk<strong>in</strong>g more quickly, or<br />
even if there were accelerations and decelerations dur<strong>in</strong>g the<br />
course of one observation. A well known application has been<br />
automatic speech recognition, to cope with different speak<strong>in</strong>g<br />
speeds. (In general, DTW is a method that allows a computer<br />
to f<strong>in</strong>d an optimal match between two given sequences (e.g.<br />
time series) with certa<strong>in</strong> restrictions. The sequences are<br />
"warped" non-l<strong>in</strong>early <strong>in</strong> the time dimension to determ<strong>in</strong>e a<br />
measure of their similarity <strong>in</strong>dependent of certa<strong>in</strong> non-l<strong>in</strong>ear<br />
variations <strong>in</strong> the time dimension. This sequence alignment<br />
method is often <strong>used</strong> <strong>in</strong> the context of hidden Markov models.<br />
One example of the restrictions imposed on the match<strong>in</strong>g of<br />
the sequences is on the monotonicity of the mapp<strong>in</strong>g <strong>in</strong> the<br />
time dimension. Cont<strong>in</strong>uity is less important <strong>in</strong> DTW than <strong>in</strong><br />
other pattern match<strong>in</strong>g algorithms; DTW is an algorithm<br />
particularly suited to match<strong>in</strong>g sequences with miss<strong>in</strong>g<br />
<strong>in</strong>formation, provided there are long enough segments for<br />
match<strong>in</strong>g to occur. The optimization process is performed<br />
us<strong>in</strong>g dynamic programm<strong>in</strong>g, hence the name.<br />
Moreover, with<strong>in</strong> a word, there will be variation <strong>in</strong> the length<br />
of <strong>in</strong>dividual phonemes: Cassidy might be uttered with a long<br />
/A/ and short f<strong>in</strong>al /i/ or with a short /A/ and long /i/. The<br />
match<strong>in</strong>g process needs to compensate for length differences<br />
and take account of the non-l<strong>in</strong>ear nature of the length<br />
differences with<strong>in</strong> the words. The Dynamic Time Warp<strong>in</strong>g<br />
algorithm achieves this goal; it f<strong>in</strong>ds an optimal match<br />
between two sequences of feature vectors which allows for<br />
stretched and compressed sections of the sequence.<br />
6.1. Concepts of Dynamic Time Warp<strong>in</strong>g:<br />
Dynamic Time Warp<strong>in</strong>g is a pattern match<strong>in</strong>g algorithm<br />
with a non-l<strong>in</strong>ear time normalization effect. It is based on<br />
Bellman's pr<strong>in</strong>ciple of optimality [17], which implies that,<br />
given an optimal path w from A to B and a po<strong>in</strong>t C ly<strong>in</strong>g<br />
somewhere on this path, the path segments AC and CB are<br />
optimal paths from A to C and from C to B respectively. The<br />
dynamic time warp<strong>in</strong>g algorithm [18] creates an alignment<br />
between two sequences of feature vectors, (T 1 , T 2 ,.....T N ) and<br />
(S 1 , S 2 ,....,S M ). A distance d(i, j) can be evaluated between any<br />
two feature vectors Ti and Sj . This distance is referred to as<br />
the local distance. In DTW the global distance D(i,j) of any<br />
two feature vectors Ti and Sj is computed recursively by<br />
add<strong>in</strong>g its local distance d(i,j) to the evaluated global distance<br />
for the best predecessor. The best predecessor is the one that<br />
gives the m<strong>in</strong>imum global distance D(i,j) at row i and column<br />
j:<br />
…….(1)<br />
The computational complexity can be reduced by impos<strong>in</strong>g<br />
constra<strong>in</strong>ts that prevent the selection of sequences that cannot<br />
be optimal [18]. Global constra<strong>in</strong>ts affect the maximal overall<br />
stretch<strong>in</strong>g or compression. Local constra<strong>in</strong>ts affect the set of<br />
predecessors from which the best predecessor is chosen.<br />
Dynamic Time Warp<strong>in</strong>g (DTW) is <strong>used</strong> to establish a time<br />
scale alignment between two patterns. It results <strong>in</strong> a time<br />
warp<strong>in</strong>g vector w, describ<strong>in</strong>g the time alignment of segments<br />
of the two signals assigns a certa<strong>in</strong> segment of the source<br />
signal to each of a set of regularly spaced synthesis <strong>in</strong>stants <strong>in</strong><br />
the target signal.<br />
1) 6.1.1. The DTW Grid:<br />
We can arrange the two sequences of observations on the<br />
sides of a grid (Figure 5) with the unknown sequence on the<br />
bottom (six observations <strong>in</strong> the example) and the stored<br />
template up the left hand side (eight observations). Both<br />
sequences start on the bottom left of the grid. Inside each cell<br />
a distance measure is <strong>used</strong> for compar<strong>in</strong>g the correspond<strong>in</strong>g<br />
elements of the two sequences.<br />
Figure 5. An example DTW grid<br />
To f<strong>in</strong>d the best match between these two sequences we can<br />
f<strong>in</strong>d a path through the grid which m<strong>in</strong>imizes the total distance<br />
between them. The path shown <strong>in</strong> blue <strong>in</strong> Figure 5 gives an<br />
example. Here, the first and second elements of each sequence<br />
match together while the third element of the <strong>in</strong>put also<br />
matches best aga<strong>in</strong>st the second element of the stored pattern.<br />
This corresponds to a section of the stored pattern be<strong>in</strong>g<br />
stretched <strong>in</strong> the <strong>in</strong>put. Similarly, the fourth element of the<br />
<strong>in</strong>put matches both the second and third elements of the stored<br />
sequence: here a section of the stored sequence has been<br />
compressed <strong>in</strong> the <strong>in</strong>put sequence. Once an overall best path<br />
has been found the total distance between the two sequences<br />
can be calculated for this stored template.<br />
The procedure for comput<strong>in</strong>g this overall distance measure is<br />
to f<strong>in</strong>d all possible routes through the grid and for each one of<br />
these compute the overall distance. The overall distance is<br />
given <strong>in</strong> Sakoe and Chiba, Equation 1, as the m<strong>in</strong>imum of the<br />
sum of the distances between <strong>in</strong>dividual elements on the path<br />
divided by the sum of the warp<strong>in</strong>g function .The division is to<br />
make paths of different lengths comparable.<br />
It should be apparent that for any reasonably sized sequences,<br />
the number of possible paths through the grid will be very<br />
large. In addition, many of the distance measures could be<br />
avoided s<strong>in</strong>ce the first element of the <strong>in</strong>put is unlikely to<br />
match the last element of the template for example. The DTW<br />
algorithm is designed to exploit some observations about the<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
919
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
likely solution to make the comparison between sequences<br />
more efficient.<br />
2) 6.1.2. Optimization <strong>in</strong> DTW:<br />
The major optimizations to the DTW algorithm arise from<br />
observations on the nature of good paths through the grid.<br />
These are outl<strong>in</strong>ed <strong>in</strong> Sakoe and Chiba and can be summarized<br />
as:<br />
• Monotonic condition: the path will not turn back on<br />
itself, both the i and j <strong>in</strong>dexes either stay the same or<br />
<strong>in</strong>crease, they never decrease.<br />
• Cont<strong>in</strong>uity condition: The path advances one step at a<br />
time. Both i and j can only <strong>in</strong>crease by 1 on each step<br />
along the path.<br />
• Boundary condition: the path starts at the bottom left<br />
and ends at the top right.<br />
• Adjustment w<strong>in</strong>dow condition: a good path is<br />
unlikely to wander very far from the diagonal. The<br />
distance that the path is allowed to wander is the<br />
w<strong>in</strong>dow length r.<br />
• Slope constra<strong>in</strong>t condition: The path should not be<br />
too steep or too shallow. This prevents very short<br />
sequences match<strong>in</strong>g very long ones. The condition is<br />
expressed as a ratio n/m where m is the number of<br />
steps <strong>in</strong> the x direction and m is the number <strong>in</strong> the y<br />
direction. After m steps <strong>in</strong> x you must make a step <strong>in</strong><br />
y and vice versa.<br />
By apply<strong>in</strong>g these observations it can restrict the moves that<br />
can be made from any po<strong>in</strong>t <strong>in</strong> the path and so restrict the<br />
number of paths that need to be considered. For example, with<br />
a slope constra<strong>in</strong>t of P=1, if a path has already moved one<br />
square up it must next move either diagonally or to the<br />
right.The power of the DTW algorithm goes beyond these<br />
observations though. Instead of f<strong>in</strong>d<strong>in</strong>g all possible routes<br />
through the grid which satisfy these constra<strong>in</strong>ts, the DTW<br />
algorithm works by keep<strong>in</strong>g track of the cost of the best path<br />
to each po<strong>in</strong>t <strong>in</strong> the grid. Dur<strong>in</strong>g the match process there will<br />
be no idea about the lowest cost path ; but this can be traced<br />
back when we reach the end po<strong>in</strong>t.<br />
6.2. Advantages and Disadvantages of Dynamic Time<br />
Warp<strong>in</strong>g:<br />
a)Advantages:<br />
1) Works well for small number of templates (
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
7. Supervised versus unsupervised <strong>Classification</strong>/Learn<strong>in</strong>g<br />
<strong>Techniques</strong>:<br />
There are two ma<strong>in</strong> divisions of classification procedure <strong>in</strong><br />
pattern recognition: supervised classification (or<br />
discrim<strong>in</strong>ation) and unsupervised classification (sometimes <strong>in</strong><br />
the statistics literature simply referred to as classification or<br />
cluster<strong>in</strong>g). In supervised classification a set of data samples<br />
(each consist<strong>in</strong>g of measurements on a set of variables) with<br />
associated labels, the class types. These are <strong>used</strong> as exemplars<br />
<strong>in</strong> the classifier design. In unsupervised classification, the data<br />
are not labeled and <strong>in</strong>tended to f<strong>in</strong>d groups <strong>in</strong> the data and the<br />
features that dist<strong>in</strong>guish one group from another. Cluster<strong>in</strong>g<br />
techniques can also be <strong>used</strong> as part of a supervised<br />
classification scheme by def<strong>in</strong><strong>in</strong>g prototypes. A cluster<strong>in</strong>g<br />
scheme may be applied to the data for each class separately<br />
and representative samples for each group with<strong>in</strong> the class is<br />
(the group means, for example) <strong>used</strong> as the prototypes for that<br />
class.<br />
7.1. Supervised Learn<strong>in</strong>g:<br />
In automatic pattern recognition, the term supervised<br />
learn<strong>in</strong>g/classification refers to the process of design<strong>in</strong>g a<br />
pattern classifier by us<strong>in</strong>g a tra<strong>in</strong><strong>in</strong>g set of patterns of known<br />
class to determ<strong>in</strong>e the choice of a specific decision mak<strong>in</strong>g<br />
technique for classify<strong>in</strong>g additional similar samples <strong>in</strong> future.<br />
The classifier <strong>in</strong> other words is designed us<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g<br />
data. To provide an unprejudiced estimate of the classifiers<br />
accuracy on new data, it must be tested on a separate test of<br />
patterns for which the class of each pattern is known.<br />
Supervised learn<strong>in</strong>g is fairly common <strong>in</strong> classification<br />
problems because the goal is often to get the computer to learn<br />
a classification system that it has created. In the supervised<br />
learn<strong>in</strong>g process, two types parametric analysis is done<br />
namely parametric and non parametric decision mak<strong>in</strong>g<br />
methods or classification methods.<br />
7.1.1.There are several ways <strong>in</strong> which the standard supervised<br />
learn<strong>in</strong>g problem can be generalized:<br />
1. Semi-supervised learn<strong>in</strong>g: In this sett<strong>in</strong>g, the desired<br />
output values are provided only for a subset of the<br />
tra<strong>in</strong><strong>in</strong>g data. The rema<strong>in</strong><strong>in</strong>g data is unlabeled.<br />
2. Active learn<strong>in</strong>g: Instead of assum<strong>in</strong>g that all of the<br />
tra<strong>in</strong><strong>in</strong>g examples are given at the start, active<br />
learn<strong>in</strong>g algorithms <strong>in</strong>teractively collect new<br />
examples, typically by mak<strong>in</strong>g queries to a human<br />
user. Often, the queries are based on unlabeled data,<br />
which is a scenario that comb<strong>in</strong>es semi-supervised<br />
learn<strong>in</strong>g with active learn<strong>in</strong>g.<br />
3. Structured prediction: When the desired output value<br />
is a complex object, such as a parse tree or a labeled<br />
graph, then standard methods must be extended.<br />
4. Learn<strong>in</strong>g to rank: When the <strong>in</strong>put is a set of objects<br />
and the desired output is a rank<strong>in</strong>g of those objects,<br />
then aga<strong>in</strong> the standard methods must be extended.<br />
5.<br />
7.2. Advantages and Disadvantages of supervised learn<strong>in</strong>g:<br />
a)Advantages:<br />
1) Rules are written for you automatically. This is useful for<br />
large document sets.<br />
b) Disadvantages:<br />
1) It assigns documents to categories before generat<strong>in</strong>g the<br />
rules.<br />
2) Rules may not be as specific or accurate as those we write<br />
yourself.<br />
3) Provides over fitt<strong>in</strong>g<br />
7.3. Challenges <strong>in</strong> supervised learn<strong>in</strong>g:<br />
The importance of the classification problem, is the goal of<br />
the learn<strong>in</strong>g, that is <strong>used</strong> to m<strong>in</strong>imize the error with respect to<br />
the given <strong>in</strong>puts. These <strong>in</strong>puts, are called the "tra<strong>in</strong><strong>in</strong>g set", are<br />
the examples from which the agent tries to learn. But learn<strong>in</strong>g<br />
the tra<strong>in</strong><strong>in</strong>g set well is not necessarily the best th<strong>in</strong>g to do. Not<br />
all tra<strong>in</strong><strong>in</strong>g sets will have the <strong>in</strong>puts classified correctly. This<br />
can lead to problems if the algorithm <strong>used</strong> is powerful enough<br />
to memorize even the apparently "special cases" that don't fit<br />
the more general pr<strong>in</strong>ciples. This, too, can lead to over fitt<strong>in</strong>g,<br />
and it is a challenge to f<strong>in</strong>d algorithms that are both powerful<br />
enough to learn complex functions and robust enough to<br />
produce generalizable results.<br />
7.4. Applications:<br />
• Bio<strong>in</strong>formatics<br />
• Database market<strong>in</strong>g<br />
• Handwrit<strong>in</strong>g recognition<br />
• Information retrieval<br />
o Learn<strong>in</strong>g to rank<br />
• Object recognition <strong>in</strong> computer vision<br />
• Optical character recognition<br />
• Spam detection<br />
• Pattern recognition<br />
• <strong>Speech</strong> recognition<br />
• Forecast<strong>in</strong>g Fraudulent F<strong>in</strong>ancial Statements<br />
8. Introduction to parametric representation:<br />
Parametric representation - Parametric statistics is a branch<br />
of statistics that assumes data come from a type of probability<br />
distribution and makes <strong>in</strong>ferences about the parameters of the<br />
distribution. Most well-known elementary statistical methods<br />
are parametric. Generally speak<strong>in</strong>g parametric methods make<br />
more assumptions than non-parametric methods. If those extra<br />
assumptions are correct, parametric methods can produce<br />
more accurate and precise estimates. They are said to have<br />
more statistical power. However, if those assumptions are<br />
<strong>in</strong>correct, parametric methods can be very mislead<strong>in</strong>g. For that<br />
reason they are often not considered robust. On the other hand,<br />
parametric formulae are often simpler to write down and<br />
faster to compute. In some, but def<strong>in</strong>itely not all cases, their<br />
simplicity makes up for their non-robustness, especially if<br />
care is taken to exam<strong>in</strong>e diagnostic statistics. Parametric<br />
decision mak<strong>in</strong>g refers to the situation <strong>in</strong> which we know or<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
921
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
will<strong>in</strong>g to assume the general form of the probability<br />
distribution function or density function for each class but not<br />
the values of the parameters such as the mean and variance.<br />
Before us<strong>in</strong>g these densities the values of the parameters have<br />
to be estimated.<br />
Most important parametric method <strong>used</strong> <strong>in</strong> speech recognition<br />
application is the hidden Markov Model.<br />
Stochastic model<strong>in</strong>g [97] entails the use of probabilistic<br />
models to deal with uncerta<strong>in</strong> or <strong>in</strong>complete <strong>in</strong>formation. In<br />
speech recognition, uncerta<strong>in</strong>ty and <strong>in</strong>completeness arise from<br />
many sources; for example, confusable sounds, speaker<br />
variability s, contextual effects, and homophones words. Thus,<br />
stochastic models are particularly suitable approach to speech<br />
recognition. The most popular stochastic approach today is<br />
hidden Markov model<strong>in</strong>g. A hidden Markov model is<br />
characterized by a f<strong>in</strong>ite state markov model and a set of<br />
output distributions. The transition parameters <strong>in</strong> the Markov<br />
cha<strong>in</strong> models, temporal variabilities, while the parameters <strong>in</strong><br />
the output distribution model, spectral variabilities. These two<br />
types of variabilities are the essence of speech recognition.<br />
8.1.Hidden Markov Model (statistical approach):<br />
Hidden Markov Models (HMMs) have dom<strong>in</strong>ated [19]<br />
automatic speech recognition for at least the last decade. The<br />
model’s success lies <strong>in</strong> its mathematical simplicity; efficient<br />
and robust algorithms have been developed to facilitate its<br />
practical implementation. However, there is noth<strong>in</strong>g uniquely<br />
speech-oriented about acoustic-based HMMs. Standard<br />
HMMs model speech as a series of stationary regions <strong>in</strong> some<br />
representation of the acoustic signal. <strong>Speech</strong> is a cont<strong>in</strong>uous<br />
process though, and ideally should be modeled as such.<br />
Furthermore, HMMs assume that state and phone boundaries<br />
are strictly synchronized with events <strong>in</strong> the parameter space,<br />
whereas <strong>in</strong> fact different acoustic and articulator parameters<br />
do not necessarily change value simultaneously at boundaries.<br />
8.1.1.Markov Models<br />
A Markov model is a probabilistic process over a f<strong>in</strong>ite set,<br />
{S 1 , ..., S k }, usually called its states. Each state-transition<br />
generates a character from the alphabet of the process. IT is<br />
<strong>in</strong>terested <strong>in</strong> the matters such as the probability of a given<br />
state com<strong>in</strong>g up next, pr(x t =S i ), and this may depend on the<br />
prior history to t-1. In comput<strong>in</strong>g, such processes, if they are<br />
reasonably complex and <strong>in</strong>terest<strong>in</strong>g, they are usually called<br />
Probabilistic F<strong>in</strong>ite State Automata (PFSA) or Probabilistic<br />
F<strong>in</strong>ite State Mach<strong>in</strong>es (PFSM) because of their close l<strong>in</strong>ks to<br />
determ<strong>in</strong>istic and non-determ<strong>in</strong>istic f<strong>in</strong>ite state automata as<br />
<strong>used</strong> <strong>in</strong> formal language theory.<br />
8.1.2. Types of Hidden Markov Models<br />
8.1.2a. Discrete HMMs:<br />
HMMs can be classified accord<strong>in</strong>g to the nature of the<br />
elements of the B matrix, which are distribution functions.<br />
Distributions are def<strong>in</strong>ed on f<strong>in</strong>ite spaces <strong>in</strong> the so called<br />
discrete HMMs. In this case, observations are vectors of<br />
symbols <strong>in</strong> a f<strong>in</strong>ite alphabet of N different elements. For each<br />
one of the Q vector components, a discrete density<br />
{w(k)/k=1,….N} is def<strong>in</strong>ed, and the distribution is obta<strong>in</strong>ed<br />
by multiply<strong>in</strong>g the probabilities of each component. Notice<br />
that this def<strong>in</strong>ition assumes that the different components are<br />
<strong>in</strong>dependent. Fig.6 shows an example of a discrete HMM with<br />
one-dimensional observations. Distributions are associated<br />
with model transitions.<br />
Figure 6: Example of a discrete HMM. A transition<br />
probability and an output distribution on the symbol set is<br />
associated with every transition.<br />
8.1.2b.Cont<strong>in</strong>ious HMM:<br />
Another possibility is to def<strong>in</strong>e distributions as probability<br />
densities on cont<strong>in</strong>uous observation spaces. In this case,<br />
strong restrictions have to be imposed on the functional form<br />
of the distributions, <strong>in</strong> order to have a manageable number of<br />
statistical parameters to estimate. The most popular approach<br />
is to characterize the model transitions with mixtures of base<br />
densities g of a family G hav<strong>in</strong>g a simple parametric form.<br />
The base densities g є G are usually Gaussian or Laplacian,<br />
and can be parameterized by the mean vector and the<br />
covariance matrix. HMMs with these k<strong>in</strong>ds of distributions<br />
are usually referred to as cont<strong>in</strong>uous HMMs. In order to model<br />
complex distributions <strong>in</strong> this large number of base densities<br />
has to be <strong>used</strong> <strong>in</strong> every mixture. This may require a very large<br />
tra<strong>in</strong><strong>in</strong>g corpus of data for the estimation of the distribution<br />
parameters. Problems aris<strong>in</strong>g when the available corpus is not<br />
large enough can be alleviated by shar<strong>in</strong>g distributions among<br />
transitions of different models.<br />
8.1.2c. Semi-Cont<strong>in</strong>uous HMMs :<br />
In semi-cont<strong>in</strong>uous HMMs, all mixtures are expressed <strong>in</strong> terms<br />
of a common set of base densities. Different mixtures are<br />
characterized only by different weights. A common<br />
generalization of semi-cont<strong>in</strong>uous model<strong>in</strong>g consists of<br />
<strong>in</strong>terpret<strong>in</strong>g the <strong>in</strong>put vector y as composed of several<br />
components<br />
, each of which is associated<br />
with a different set of base distributions. The components are<br />
assumed to be statistically <strong>in</strong>dependent; hence the<br />
distributions associated with model transitions are products of<br />
the component density functions. Computation of probabilities<br />
with discrete models is faster than with cont<strong>in</strong>uous models,<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
922
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
nevertheless it is possible to speed up the mixture densities<br />
computation by apply<strong>in</strong>g vector quantization (VQ) on the<br />
Gaussians of the mixtures Parameters of statistical models are<br />
estimated by iterative learn<strong>in</strong>g algorithms <strong>in</strong> which the<br />
likelihood of a set of tra<strong>in</strong><strong>in</strong>g data is guaranteed to <strong>in</strong>crease at<br />
each step.<br />
8.2. HMM Constra<strong>in</strong>ts/Limitations for <strong>Speech</strong> <strong>Recognition</strong><br />
Systems:<br />
HMM have different constra<strong>in</strong>ts depend<strong>in</strong>g on the nature of<br />
the problem that has to be modeled. The ma<strong>in</strong> constra<strong>in</strong>ts<br />
needed <strong>in</strong> the implementation of speech Recognizers can be<br />
summarized <strong>in</strong> the follow<strong>in</strong>g assumptions [20]:<br />
1 – First order Markov cha<strong>in</strong> :<br />
In this assumption the probability of transition to a state<br />
depends only on the current state<br />
2 – Stationary states’ transition<br />
This assumption testifies that the states transition are time<br />
<strong>in</strong>dependent, and accord<strong>in</strong>gly<br />
we will have:<br />
3 – Observations <strong>in</strong>dependence:<br />
This assumption presumes that the observations come out<br />
with<strong>in</strong> certa<strong>in</strong> state depend only on the underly<strong>in</strong>g Markov<br />
cha<strong>in</strong> of the states, without consider<strong>in</strong>g the effect of the<br />
occurrence, of the other observations. Although this<br />
assumption is a poor one and deviates from reality but it<br />
works f<strong>in</strong>e <strong>in</strong> model<strong>in</strong>g speech signal.<br />
This assumption implies that:<br />
where p represents the considered history of the observation<br />
sequence.<br />
Then we will have :<br />
4 – Left-Right topology constra<strong>in</strong>t:<br />
If the observations are discrete then the last <strong>in</strong>tegration will be<br />
a summation.<br />
6- S<strong>in</strong>ce HMMs are well def<strong>in</strong>ed for processes that are<br />
function of one <strong>in</strong>dependent variable such as time. It doesn’t<br />
work satisfactorily for two variables.<br />
7- The Maximum likelihood tra<strong>in</strong><strong>in</strong>g criterion <strong>used</strong> <strong>in</strong> HMM<br />
leads to poor discrim<strong>in</strong>ation between the acoustic models<br />
given limited tra<strong>in</strong><strong>in</strong>g data and correspond<strong>in</strong>gly limited<br />
models. Discrim<strong>in</strong>ation can be improved us<strong>in</strong>g the Maximum<br />
Mutual Information(MMI) tra<strong>in</strong><strong>in</strong>g criterion but this is more<br />
complex and difficult to implement properly. Because HMMs<br />
suffer from all these weaknesses, they can obta<strong>in</strong> good<br />
performances only by rely<strong>in</strong>g on context dependent phone<br />
models i.e. tri-pohone models.<br />
8.3.Three Basic Problems for HMMs:<br />
There are three basic problems to be solved for HMMs[21].<br />
The parameter estimation problem is to tra<strong>in</strong> speech and<br />
speaker models, the evaluation problem is to compute<br />
likelihood functions for recognition and the decod<strong>in</strong>g<br />
problem is to determ<strong>in</strong>e the best fitt<strong>in</strong>g(unobservable) state<br />
sequence [Rab<strong>in</strong>er and Juange 1993, Huange et al.1990].<br />
i)The parameter estimation problem: This problem<br />
determ<strong>in</strong>es the optimal model parameters λ of the HMM<br />
accord<strong>in</strong>g to given optimization criterion. A variant of the EM<br />
algorithm, known as the Baum Welch algorithm, yields an<br />
iterative procedure to re-estimate the model parameters λ<br />
us<strong>in</strong>g the ML criterion [Baum 1972,Baum and Sell<br />
1968,Baum and Eagon 1967]. In the Baum-Welch algorithm,<br />
the unobservable data are the state sequence S and the<br />
observable data are the observation sequence O. The Q-<br />
function for the HMM is as follows<br />
Q(λ ,<br />
−<br />
λ ) = ∑<br />
S<br />
−<br />
P(S|O,λ)log P(O,S| λ ) (2)<br />
Comput<strong>in</strong>g P(S|O,λ) [Rab<strong>in</strong>er and Juang 1993, Huang et al.<br />
1990], we obta<strong>in</strong><br />
5 – Probability constra<strong>in</strong>ts:<br />
Our problem is deal<strong>in</strong>g with probabilities then we have the<br />
follow<strong>in</strong>g extra constra<strong>in</strong>ts:<br />
Q(λ ,<br />
(3)<br />
− T −1<br />
λ )= ∑<br />
t=<br />
0<br />
∑<br />
st<br />
∑<br />
st+1<br />
P(s t ,s t+1 |O, λ )log(<br />
−<br />
a<br />
stst+1<br />
−<br />
b<br />
st+1<br />
(o t+1 )]<br />
−<br />
Where π is denoted by<br />
s1<br />
−<br />
a for simplicity. Regroup<strong>in</strong>g eq.3<br />
s0s1<br />
<strong>in</strong>to three terms for the π,A,B coefficients, and apply<strong>in</strong>g<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
923
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
lagrange multipliers, we obta<strong>in</strong> the HMM parameter<br />
estimation equations<br />
−<br />
a =<br />
ij<br />
• For discrete HMM:<br />
−<br />
j<br />
π =γ 1 (i),<br />
T −1<br />
∑ξt<br />
t=<br />
1<br />
T −1<br />
∑<br />
t=<br />
1<br />
−<br />
b j(k) =<br />
Where<br />
( i,<br />
j)<br />
γt(<br />
i)<br />
T<br />
∑<br />
γt(<br />
j)<br />
t=<br />
1<br />
s.<br />
t.<br />
ot<br />
= v<br />
N<br />
∑<br />
i=<br />
1<br />
T<br />
∑γt(<br />
t=<br />
1<br />
k<br />
j)<br />
N<br />
γ t(i)= ∑ ξ t(<br />
i,<br />
j)<br />
,<br />
j=<br />
1<br />
ξ t (i,j) = P(s t = i,s t+1 = j|O, λ) =<br />
t(<br />
i)<br />
aijbj(<br />
ot<br />
+ 1)<br />
βt<br />
+ 1(<br />
j)<br />
α<br />
N<br />
∑αt(<br />
j=<br />
1<br />
i)<br />
aijbj(<br />
o<br />
t + 1<br />
) β<br />
t + 1<br />
( j)<br />
3) For cont<strong>in</strong>uous HMM: estimation equations for the π<br />
and A distributions are unchanged, but the output<br />
distribution B is estimated via Gaussian mixture<br />
parameters are represented <strong>in</strong> equ. 6<br />
T<br />
∑ηt(<br />
ϖ jk = t=<br />
1<br />
T K<br />
∑∑ t<br />
t= 1 k = 1<br />
T<br />
∑ηt(<br />
j<br />
−<br />
t=<br />
1<br />
µ jk =<br />
T<br />
∑ηt(<br />
t=<br />
1<br />
j,<br />
k)<br />
η ( j,<br />
k)<br />
, k)<br />
xt<br />
j,<br />
k)<br />
,<br />
(4)<br />
(5)<br />
−<br />
Σ jk =<br />
Where<br />
η t (j,k) =<br />
T<br />
∑<br />
t=<br />
1<br />
t<br />
N<br />
∑αt(<br />
j=<br />
1<br />
ηt(<br />
j,<br />
k)(<br />
xt<br />
− µ jk)(<br />
xt − µ jk)'<br />
T<br />
∑ηt(<br />
t=<br />
1<br />
(6)<br />
α ( j)<br />
βt(<br />
j)<br />
x<br />
j)<br />
βt(<br />
j)<br />
K<br />
∑<br />
k = 1<br />
j,<br />
k)<br />
ωjkN(<br />
xt,<br />
µ jk,<br />
Σjk)<br />
ωjkN(<br />
xt,<br />
µ jk,<br />
Σjk)<br />
Note that for practical implementation, a scal<strong>in</strong>g procedure<br />
[Rab<strong>in</strong>er and Juang 1993] is required to avoid number<br />
underflow on computers with ord<strong>in</strong>ary float<strong>in</strong>g-po<strong>in</strong>t number<br />
representations.<br />
ii)The evaluation Problem: How can we efficiently compute<br />
P(O/λ), the probabilitiy that the observation sequence O was<br />
produced by the model λ?<br />
For solv<strong>in</strong>g this problem, we obta<strong>in</strong><br />
∑<br />
P(O/λ)= ∑ P(O,S/λ) =<br />
allS<br />
s 1, s2,...,<br />
sT<br />
π s1 b s1 (o 1 )a s1s2 b s2 (o 2 )….a sT-<br />
1s T b ST (o T ) (8)<br />
An <strong>in</strong>terpretation of the computation <strong>in</strong> (8) is the follow<strong>in</strong>g.<br />
At time t=1,we are <strong>in</strong> state s1 with probability π s1 , and<br />
generate the symbol o1 with probability b s1 (o 1 ). A transition is<br />
made from state s 1 at time t=1 to state s 2 at time t=2 with<br />
probability a s1s2 and we generate a symbol o 2 with probability<br />
b s2 (o 2 ). This process cont<strong>in</strong>ues <strong>in</strong> this manner until the last<br />
transition at time T from state s T-1 to state s T is made with<br />
probability a sT-2 s T and we generate symbol o T with probability<br />
b sT (o T ). Figure 7 shows an N-state left-to-right HMM with ∆i<br />
set to 1.<br />
(7)<br />
Fig.7 The Markov generation Model<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
924
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
To reduce computations, the forward and the backward<br />
variables are <strong>used</strong>. The forward variable α t (i) is def<strong>in</strong>ed as<br />
α t (i) = P(o 1 ,o 2 ,…o t ,s t =i/λ),<br />
which can be computed iteratively as<br />
α 1 (i) = π i b i (o 1 ), 1≤ I≤ N<br />
and<br />
⎤<br />
α t+1 (j)= ⎢<br />
⎡ N<br />
∑αt(<br />
i)<br />
aij⎥ b j (o t+1 ), 1≤ j ≤N, 1≤ t ≤T-1 (9)<br />
⎣ i=<br />
1 ⎦<br />
and the backward variable β t (i) is def<strong>in</strong>ed as<br />
β t (i) = P(o t+1 ,o t+2 ,…o T| s t =i/λ),<br />
which can be computed iteratively as<br />
β T (i) = 1, 1≤ i ≤N<br />
and<br />
N<br />
β t (i) = ∑ a<br />
j=<br />
1<br />
ijbj( ot<br />
+ β t+1 (j) , 1≤ i ≤ N, t = T-1,…..,1 (10)<br />
1)<br />
Us<strong>in</strong>g these variables, the probability P(O/λ) can be computed<br />
follow<strong>in</strong>g the forward variable or the backward variable or<br />
both the forward and backward variables as follows<br />
P(O/λ)= ∑<br />
i=<br />
N<br />
N<br />
N<br />
α T(<br />
i)<br />
=<br />
1<br />
∑ π ibi( o1 ) β 1(<br />
i)<br />
=<br />
i=<br />
1<br />
∑ α t( i)<br />
βt(<br />
i)<br />
(11)<br />
i=<br />
1<br />
iii) The decod<strong>in</strong>g Problem: Given the observation sequence<br />
O and the model λ, how do we choose a correspond<strong>in</strong>g state<br />
sequence S that is optimal <strong>in</strong> some sense?<br />
This problem attempts to uncover the hidden part of the model.<br />
There are several possible ways to solve this problem, but the<br />
most widely <strong>used</strong> criterion is to f<strong>in</strong>d the s<strong>in</strong>gle best state<br />
sequence that can be implemented by the Viterbi algorithm .<br />
In practice, it is preferable to base recognition on the<br />
maximum likelihood state sequence s<strong>in</strong>ce this generalizes<br />
easily to the cont<strong>in</strong>uous speech case. This likelihood is<br />
computed us<strong>in</strong>g the same algorithm as forward algorithm<br />
except that the summation is replaced by a maximum<br />
operation.<br />
Comparision with Template and HMM methods:<br />
Compared to template based approach, hidden Markov<br />
model<strong>in</strong>g is more general and has a firmer mathematical<br />
foundation. A template based model is simply a cont<strong>in</strong>uous<br />
density HMM, with identity covariance matrices and a slope<br />
constra<strong>in</strong>ed topology. Although templates can be tra<strong>in</strong>ed on<br />
fewer <strong>in</strong>stances, they lack the probabilistic formulation of full<br />
HMMs and typically underperform HMMs. Compared to<br />
knowledge based approaches; HMMs enable easy <strong>in</strong>tegration<br />
of knowledge sources <strong>in</strong>to a compiled architecture. A negative<br />
side effect of this is that HMMs do not provide much <strong>in</strong>sight<br />
on the recognition process. As a result, it is often difficult to<br />
analyze the errors of an HMM system <strong>in</strong> an attempt to<br />
improve its performance. Nevertheless, prudent <strong>in</strong>corporation<br />
of knowledge has significantly improved HMM based systems.<br />
8.4. Advantages and Disadvantages of HMM:<br />
a) Advantages:<br />
1) One of the most important advantage of HMMs is that<br />
they can easily be extended to deal with strong tasks.<br />
2) In the tra<strong>in</strong><strong>in</strong>g stages, HMMs are dynamically assembled<br />
accord<strong>in</strong>g to the class sequence. For example, if the class<br />
sequence was my hat, then two models for each word<br />
would be l<strong>in</strong>ked, with the last state of the first l<strong>in</strong>k<strong>in</strong>g to<br />
the first state of the second. The re-estimation algorithm<br />
is then applied as usual. Once tra<strong>in</strong><strong>in</strong>g on that <strong>in</strong>stance is<br />
complete, the models are unl<strong>in</strong>ked aga<strong>in</strong>. When<br />
recognition is attempted, large HMMs are assembled<br />
from the smaller <strong>in</strong>dividual models. This is done by<br />
convert<strong>in</strong>g from a grammar <strong>in</strong>to a graph representation,<br />
then replac<strong>in</strong>g each node <strong>in</strong> the graph with the appropriate<br />
model. This process is called ``embedded re-estimation''.<br />
To f<strong>in</strong>d out what the class sequence was, the most<br />
probable path is calculated. The path traversed<br />
corresponds to a sequence of classes, which is our f<strong>in</strong>al<br />
classification.<br />
3) Because each HMM uses only positive data, they scale<br />
well. New words can be added without affect<strong>in</strong>g learnt<br />
HMMs. It is also possible to set up HMMs <strong>in</strong> such a way<br />
that they can learn <strong>in</strong>crementally. As mentioned above,<br />
grammar and other constructs can be built <strong>in</strong>to the system<br />
by us<strong>in</strong>g embedded re-estimation. This gives the<br />
opportunity for the <strong>in</strong>clusion of high-level doma<strong>in</strong><br />
knowledge, which is important for tasks like speech<br />
recognition where a great deal of doma<strong>in</strong> knowledge is<br />
available.<br />
4) Architecture-Basic characteristics of the mathematical<br />
frame work are useful for speech recognition.<br />
5) Completeness: advantages of the underly<strong>in</strong>g approach<br />
over specific knowledge based approaches<br />
6) Flexibility: Ways <strong>in</strong> which speech knowledge can be<br />
<strong>in</strong>corporated <strong>in</strong>to HMMs <strong>in</strong> the<br />
form of constra<strong>in</strong>ts on the basic flexible structure.<br />
b)Disadvantages:<br />
i) They make very large assumptions about the data.<br />
ii) They make the Markovian assumption: that the emission<br />
and the transition probabilities depend only on the current<br />
state. This has subtle effects; for example, the probability of<br />
stay<strong>in</strong>g <strong>in</strong> a given state falls off exponentially .<br />
iii)The Gaussian mixture assumption for cont<strong>in</strong>uous-density<br />
hidden Markov models a huge one. We cannot always assume<br />
that the values are distributed <strong>in</strong> a normal manner.<br />
iv) The number of parameters that need to be set <strong>in</strong> an HMM<br />
is huge.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
925
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
v) The Viterbi algorithm allocates frames to states, the frames<br />
associated with a state can often change, caus<strong>in</strong>g further<br />
susceptibility to the parameters. Those <strong>in</strong>volved <strong>in</strong> HMMs<br />
often use the technique of ``parameter-ty<strong>in</strong>g'' to reduce the<br />
number of variables that need to be learnt by forc<strong>in</strong>g the<br />
emission probabilities <strong>in</strong> one state to be the same as those <strong>in</strong><br />
another. For example, if one had two words: cat and mad, then<br />
the parameters of the states associated with the ``a'' sound<br />
could be tied together.<br />
vi) As a result of the above, the amount of data that is required<br />
to tra<strong>in</strong> an HMM is very large. This can be seen by<br />
consider<strong>in</strong>g typical speech recognition corpora that are <strong>used</strong><br />
for tra<strong>in</strong><strong>in</strong>g. The TIMIT database for <strong>in</strong>stance, has a total of<br />
630 readers read<strong>in</strong>g a text; the ISOLET database for isolated<br />
letter recognition has 300 examples per letter. Many other<br />
doma<strong>in</strong>s do not have such large datasets readily available.<br />
vii) HMMs only use positive data to tra<strong>in</strong>. In other words,<br />
HMM tra<strong>in</strong><strong>in</strong>g <strong>in</strong>volves maximiz<strong>in</strong>g the observed probabilities.<br />
viii) In some doma<strong>in</strong>s, the number of states and transitions can<br />
be found us<strong>in</strong>g an educated guess or trial and error, <strong>in</strong> general,<br />
there is no way to determ<strong>in</strong>e this. Furthermore, the states and<br />
transitions depend on the class be<strong>in</strong>g learnt.<br />
ix) The concept learnt by a hidden Markov model is the<br />
emission and transition probabilities. If one is try<strong>in</strong>g to<br />
understand the concept learnt by the hidden Markov model,<br />
then this concept representation is difficult to understand. In<br />
speech recognition, this issue is of little significance, but <strong>in</strong><br />
other doma<strong>in</strong>s, it may be even more important than accuracy.<br />
x) Ist order HMM Markovian assumptions of conditional<br />
dependence ( i.e. be<strong>in</strong>g <strong>in</strong> a state depends upon a previous<br />
state).<br />
xi) HMMs are well def<strong>in</strong>ed for processes that are function of<br />
one <strong>in</strong>dependent variable such as time is one dimensional.<br />
xii) One major limitation of the statistical models is that they<br />
work well only when the underly<strong>in</strong>g assumptions are satisfied.<br />
The effectiveness of these methods depends to a large extent<br />
on the various assumptions or conditions under which the<br />
models are developed.<br />
8.5. Applications:<br />
1) First application of Markov Cha<strong>in</strong>s was made by Andrey<br />
Markov himself <strong>in</strong> the area of language model<strong>in</strong>g.<br />
2) Another example of Markov cha<strong>in</strong>s application <strong>in</strong><br />
l<strong>in</strong>guistics is stochastic language model<strong>in</strong>g.<br />
3) Use of Markov cha<strong>in</strong>s to generate random numbers that<br />
belong exactly to the desired distribution or, speak<strong>in</strong>g<br />
4) HMM for f<strong>in</strong>ancial economic applications<br />
5) HMM for Signature verification<br />
6) HMM for <strong>Speech</strong> and speaker recognition<br />
7) Hidden Markov Model <strong>in</strong> Intrusion Detection Systems<br />
8) HMM <strong>in</strong> bio <strong>in</strong>formatics<br />
9) HMM applications <strong>in</strong> bar code read<strong>in</strong>g<br />
10) HMM applications <strong>in</strong> computer vision<br />
9. Non-parameter techniques:<br />
In most real problems, even the types of the density functions<br />
of the <strong>in</strong>terest are unknown. Look<strong>in</strong>g at histograms, scatter<br />
plots or tables of the data, or the application of statistical<br />
procedures may suggest that a particular type of the class<br />
density may be <strong>used</strong>, or they may <strong>in</strong>dicate that the data are not<br />
well fit by any of the standard types of densities or<br />
distributions. In this case, non parametric techniques are<br />
needed. There are different classification methods <strong>in</strong> non<br />
parametric techniques namely vector quantization, Artificial<br />
Neural Network, Support vector mach<strong>in</strong>es, K-Nearest<br />
Neighbor method and Gaussian Mixture Model<strong>in</strong>g methods.<br />
These methods are discussed <strong>in</strong> the follow<strong>in</strong>g sections.<br />
9.1. Advantages and disadvantages <strong>in</strong> Non Parametric<br />
Method:<br />
a) Advantages:<br />
(1) Nonparametric test make less str<strong>in</strong>gent demands of the<br />
data. For standard parametric procedures to be valid, certa<strong>in</strong><br />
underly<strong>in</strong>g conditions or assumptions must be met,<br />
particularly for smaller sample sizes.<br />
(2) Nonparametric procedures can sometimes be <strong>used</strong> to get a<br />
quick answer with little calculation.<br />
3) Nonparametric methods provide an air of objectivity when<br />
there is no reliable (universally recognized) underly<strong>in</strong>g scale<br />
for the orig<strong>in</strong>al data and there is some concern that the results<br />
of standard parametric techniques would be criticized for their<br />
dependence on an artificial metric.<br />
4) One of the key advantages of non-parametric techniques is<br />
that they do not make any statistical assumptions about data.<br />
b) Disadvantages:<br />
1) The major disadvantage of nonparametric techniques is<br />
conta<strong>in</strong>ed <strong>in</strong> its name. Because the procedures are<br />
nonparametric, there are no parameters to describe and it<br />
becomes more difficult to make quantitative statements about<br />
the actual difference between populations.<br />
2) The second disadvantage is that nonparametric procedures<br />
throw away <strong>in</strong>formation. Because <strong>in</strong>formation is discarded,<br />
nonparametric procedures can never be as powerful (able to<br />
detect exist<strong>in</strong>g differences) as their parametric counterparts<br />
when parametric tests can be <strong>used</strong>.<br />
9.3. Applications of Non parametric methods:<br />
1) <strong>Speech</strong> recognition applications<br />
2) Chi-square applications<br />
3) Efficiency analysis of the models<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
926
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
4) Analysis of Hedonic Models<br />
5) Data m<strong>in</strong><strong>in</strong>g<br />
6) Cl<strong>in</strong>ical applications<br />
10. Vector quantization [5]:<br />
Vector Quantization(VQ)[97] is often applied to ASR. It is a<br />
system for mapp<strong>in</strong>g a sequence of cont<strong>in</strong>uous or discrete<br />
vectors <strong>in</strong>to a digital sequence suitable for communication<br />
over or storage <strong>in</strong> a digital channel. The goal of this system is<br />
the data compression: to reduce the bit rate so as to m<strong>in</strong>imize<br />
communication channel capacity or digital storage memory<br />
requirements while ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the necessary fidelity of the<br />
data.<br />
10.1. Introduction to vector quantization:<br />
Vector quantization is a classical quantization technique<br />
from signal process<strong>in</strong>g which allows the model<strong>in</strong>g of<br />
probability density functions by the distribution of prototype<br />
vectors. It was orig<strong>in</strong>ally <strong>used</strong> for data compression. It works<br />
by divid<strong>in</strong>g a large set of po<strong>in</strong>ts (vectors) <strong>in</strong>to groups hav<strong>in</strong>g<br />
approximately the same number of po<strong>in</strong>ts closest to them.<br />
Each group is represented by its centroid po<strong>in</strong>t, as <strong>in</strong> k-means<br />
and some other cluster<strong>in</strong>g algorithms.<br />
The density match<strong>in</strong>g property of vector quantization is<br />
powerful, especially for identify<strong>in</strong>g the density of large and<br />
high-dimensioned data. S<strong>in</strong>ce data po<strong>in</strong>ts are represented by<br />
the <strong>in</strong>dex of their closest centroid, commonly occurr<strong>in</strong>g data<br />
have low error, and rare data high error. This is why VQ is<br />
suitable for lossy data compression. It can also be <strong>used</strong> for<br />
lossy data correction and density estimation. Vector<br />
quantization is based on the competitive learn<strong>in</strong>g paradigm, so<br />
it is closely related to the self-organiz<strong>in</strong>g map model<br />
The results of either the filter bank analysis or the LPC<br />
analysis’s are a series of vectors characteristic of the time<br />
vary<strong>in</strong>g spectral characteristics of the speech signal .The<br />
spectral vectors are denoted as v l , l=1,2,...,L, Where Typically<br />
Each Vector is a P- Dimensional Vector. If we Compare the<br />
<strong>in</strong>formation rate of the vector representation to that of the raw<br />
speech waveform we see that the spectral analysis’s has<br />
significantly reduced the required <strong>in</strong>formation rate of<br />
160,000bps is required to store the speech samples <strong>in</strong><br />
uncompressed format. For the spectral analysis, consider<br />
vectors of dimension p=10 us<strong>in</strong>g 100 spectral vectors per<br />
second. If we aga<strong>in</strong> represent each spectral component t to 16-<br />
bit precision the required storage is about 100x10x16bps or<br />
16000 bps about a 10 to 1 reduction over the uncompressed<br />
signal. Such compressions <strong>in</strong> storage rate are impressive.<br />
Based on the concept of ultimately need<strong>in</strong>g only a s<strong>in</strong>gle<br />
spectral representation for each basic speech unit, it may be<br />
possible to further reduce the raw spectral representation of<br />
speech to those drawn from a small f<strong>in</strong>ite number of unique<br />
spectral vectors each correspond<strong>in</strong>g to one of the basic speech<br />
units. This ideal representation is of course impractical<br />
because there is so much variability <strong>in</strong> the spectral properties<br />
of each of the basic speech units. However the concept of<br />
build<strong>in</strong>g a codebook of dist<strong>in</strong>ct, analysis vectors, albeit with<br />
significantly more code words than the basic set of phonemes,<br />
rema<strong>in</strong>s an attractive idea and is the basis beh<strong>in</strong>d a set of<br />
techniques commonly called vector quantization methods.<br />
Based on this l<strong>in</strong>e of reason<strong>in</strong>g assume that we require a code<br />
book with about 1024 unique spectral vectors. Then to<br />
represent an arbitrary spectral vector all we need is a 10 bit<br />
number the <strong>in</strong>dex of the codebook vector that best matches the<br />
<strong>in</strong>put vector. Assum<strong>in</strong>g a rate of 100 spectral vectors per<br />
second we see at a total bit rate of about 1000 bps is required<br />
to represent the spectral vectors of a speech signal. This rate is<br />
about 1/16 th the rate required by the cont<strong>in</strong>uous spectral<br />
vectors. Hence the vector quantization representation is<br />
potentially an extremely efficient representation the spectral<br />
<strong>in</strong>formation <strong>in</strong> the speech signal.<br />
10.1.1.Elements of a vector quantization implementation;<br />
To build a vector quantization and implement a VQ analysis<br />
procedure we need the follow<strong>in</strong>g:<br />
• a large set of spectral analysis vectors v 1 ,v 2 ,....v l ,<br />
which form a triag<strong>in</strong>g set. The tra<strong>in</strong><strong>in</strong>g set is <strong>used</strong> to<br />
create the optimal set6 of codebook vectors for<br />
present<strong>in</strong>g he spectral variability observed the<br />
tra<strong>in</strong><strong>in</strong>g set If we denote the size of the VQ code<br />
book as M=2 B vector then we require L>> M so as<br />
to be able to f<strong>in</strong>d the best set of M code book vectors<br />
<strong>in</strong> a robust manner. In practice, it has been found that<br />
L should be at least 10M <strong>in</strong> order to tra<strong>in</strong> a VQ<br />
codebook that works reasonably well.<br />
• a measure of similarity or distance between a pair of<br />
spectral analysis’s vectors so as to be able to cluster<br />
the tra<strong>in</strong><strong>in</strong>g set vectors as well as to associate or<br />
classify arbitrary spectral vectors <strong>in</strong>to unique<br />
codebook entries. We denote the spectral distance<br />
d(v i ,v j ) between two vectors v i , v j as d ij . We defer a<br />
discussion of spectral distance measure.<br />
• iii) a centroid computation procedure. On the basis of<br />
the partition<strong>in</strong>g that classifies the L tra<strong>in</strong><strong>in</strong>g set<br />
vectors <strong>in</strong>to M cluster we choose the M code book<br />
vectors as the centroid of each of the M clusters.<br />
• a classification procedure for arbitrary speech<br />
spectral analysis’s, vectors that chooses the codebook<br />
vector closet to the <strong>in</strong>put vector ;and uses the<br />
codebook <strong>in</strong>dex as the result<strong>in</strong>g spectral<br />
representation. This is often referred to as the nearest<br />
neighbor label<strong>in</strong>g or optimal encod<strong>in</strong>g procedure.<br />
The classification procedure is essentially, a<br />
quantizer that accepts as <strong>in</strong>put a speech spectral<br />
vector and provides as output the codebook <strong>in</strong>dex of<br />
the codebook vector that best matches the <strong>in</strong>put;. The<br />
follow<strong>in</strong>g figure 8 shows the basic VQ tra<strong>in</strong><strong>in</strong>g and<br />
classification structure.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
927
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
For code book sizes of 1000or larger, the storage is often nontrivial.<br />
Hence an <strong>in</strong>herent trade off among quantization error,<br />
process<strong>in</strong>g for choos<strong>in</strong>g the code book vector, and storage of<br />
code book vectors exists, and practical designs balance each<br />
of these three factors.<br />
3) VQ has the low prediction ga<strong>in</strong> of the vector predictor, due<br />
to the autocorrelation function of speech with <strong>in</strong>creas<strong>in</strong>g lag.<br />
Figure.8 of the basic VQ tra<strong>in</strong><strong>in</strong>g and classification structure<br />
10.2. Advantages and Disadvantages of Vector<br />
Quantization:<br />
a)Advantages:<br />
1) Reduced storage for spectral analysis <strong>in</strong>formation. This<br />
efficiency can be exploited <strong>in</strong> a number of ways <strong>in</strong> practical<br />
vector quantization based speech recognition systems.<br />
2) Reduced computation for determ<strong>in</strong><strong>in</strong>g similarity of spectral<br />
analysis vectors. In speech recognition a major component of<br />
the computation is the determ<strong>in</strong>ation of spectral similarity<br />
between a pair of vectors. Based on the vector quantization<br />
representation this spectral similarity computation is often<br />
reduced to a table lookup of similarities between pairs of<br />
codebook vectors.<br />
3) Discrete representation of speech sounds. By associat<strong>in</strong>g a<br />
phonetic label (or possible a set of phonetic labels or a<br />
phonetic class with each codebook vector, the process of<br />
choos<strong>in</strong>g a best codebook vector to represent a given spectral<br />
vector becomes equivalent to assign<strong>in</strong>g a phonetic label to<br />
each spectral frame of speech. A range of recognition systems<br />
exist that exploit these labels so as to recognize speech <strong>in</strong> an<br />
efficient manner.<br />
4) Vector quantization lowers the bit rate of the signal be<strong>in</strong>g<br />
quantized thus mak<strong>in</strong>g it more bandwidth efficient than scalar<br />
quantization. But this however contributes to it's<br />
implementation complexity (computation and storage).<br />
b) Disadvantages:<br />
1) An <strong>in</strong>herent spectral distortion <strong>in</strong> represent<strong>in</strong>g the actual<br />
analysis vector. S<strong>in</strong>ce there a f<strong>in</strong>ite number of code book<br />
vectors, the process of choos<strong>in</strong>g the ”best” representation of a<br />
given spectral vector <strong>in</strong>herently is equivalent to quantiz<strong>in</strong>g the<br />
vector and leads, by def<strong>in</strong>ition, to a certa<strong>in</strong> level of<br />
quantization error. As the size of the code book <strong>in</strong>creases, the<br />
size of the quantization error decreases. However, with any<br />
f<strong>in</strong>ite code book there will always be some non zero level of<br />
quantization error.<br />
2) The storage required for code book vectors is of ten non<br />
trivial. The larger the codebook (so as to reduce quantization<br />
error), the more storage is required for the code book entries.<br />
10.3. Applications:<br />
i) Image and voice compression<br />
ii) <strong>Speech</strong> <strong>Recognition</strong> application<br />
iii) Image cod<strong>in</strong>g<br />
iv) VQ for neural gas network<br />
v) VQ is <strong>used</strong> for lossy data compression, lossy<br />
data correction, and density estimation<br />
11. Artificial Neural Network (ANN)[5]:<br />
A variety of knowledge sources need to be established <strong>in</strong> the<br />
AI approach to speech recognition. Therefore, two key<br />
concepts of artificial <strong>in</strong>telligence are automatic knowledge<br />
acquisitions (learn<strong>in</strong>g and adaptation. One way <strong>in</strong> which these<br />
concepts have been implemented is via the neural network<br />
approach.Fig.9 shows the example of a neural network model.<br />
11.1. Basics of Neural Networks:<br />
A neural network, which is also called a connectionist model,<br />
a neural network a parallel distributed process<strong>in</strong>g (PDP)<br />
model, is basically a dense <strong>in</strong>terconnection of simple,<br />
nonl<strong>in</strong>ear, computational elements. It is assumed that there are<br />
N <strong>in</strong>puts labeled x 1 ,x 2 ,…x N , which are summed with weights<br />
w 1 ,w 2 …w mn , threshold and then nonl<strong>in</strong>early compressed to<br />
give the output y, def<strong>in</strong>ed as<br />
1toN<br />
Y=f (Σ W i X i - φ) -----(12)<br />
i=1<br />
Where pi is an <strong>in</strong>ternal threshold or offset, and f is a non<br />
l<strong>in</strong>earity of one of the types given below.<br />
1.hard limiter f(x) = +1 x≤0,or -1 x0 or<br />
The sigmoid nonl<strong>in</strong>earities are <strong>used</strong> most often because they<br />
are cont<strong>in</strong>uous and differentiable. The biological basis of the<br />
neural network is a model by McCullough and Pitts[22] of<br />
neurons <strong>in</strong> the human nervous system.<br />
11.1.1 Neural Network topologies:<br />
There are several issues <strong>in</strong> the design of the so called artificial<br />
neural networks which model various physical phenomena,<br />
where we def<strong>in</strong>e an ANN as an arbitrary connection of simple<br />
computational elements. One key issue <strong>in</strong> network topologies<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
928
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
– that is, how the simple computational elements are<br />
<strong>in</strong>terconnected. There are three standard and well known<br />
topologies.<br />
i) S<strong>in</strong>gle/multilayer perceptrons<br />
ii) Hopfield or recurrent networks<br />
iii) Kohonen or Self-organiz<strong>in</strong>g networks<br />
In the s<strong>in</strong>gle/multilayer perceptron, the outputs of one or more<br />
simple computational elements at one layer form the <strong>in</strong>puts to<br />
a new set of simple computational elements of the next layer.<br />
The s<strong>in</strong>gle layer perceptron has N <strong>in</strong>puts connected to M<br />
outputs <strong>in</strong> the output layer as shown <strong>in</strong> the fig.9. The three<br />
layer perceptron has two hidden layers between the <strong>in</strong>put and<br />
output layers. The s<strong>in</strong>gle layer perceptron can separate static<br />
patterns <strong>in</strong>to classes with class boundaries characterized by<br />
the hyper planes <strong>in</strong> the (x 1 ,x 2 ….x n )space. Similarly, a<br />
multilayer perceptron, with at least one hidden layer can<br />
realize an arbitrary set of decision regions <strong>in</strong> the (x 1 ,x 2 ….x n )<br />
space. Thus, for example if the <strong>in</strong>puts to a multilayer<br />
perceptron are the first two speech resonances (F 1 and F 2 ) the<br />
network can implement a set of decision regions that partition<br />
the (F1-F2) space <strong>in</strong>to the 10 steady state vowels.<br />
The Hopfield network is a recurrent network <strong>in</strong> which the<br />
<strong>in</strong>put to each computational element <strong>in</strong>cludes both <strong>in</strong>puts as<br />
well as outputs. Thus with the <strong>in</strong>put and output <strong>in</strong>dexed by<br />
time xi(t) and yi(t) and the weight connect<strong>in</strong>g the ith node and<br />
the jth node denoted by wij, the basic equation for the ith<br />
recurrent computational element is<br />
Yi(t)= f[x i (t),+Σw ij y j (t-1) –φ] (13)<br />
j<br />
And a recurrent network with N <strong>in</strong>puts and N outputs. The<br />
most important property of the Hopfield Network is wij=wji<br />
and when the recurrent computation (eq.2.5) is performed<br />
asynchronously, for an arbitrary constant <strong>in</strong>put, the network<br />
will eventually settle to a fixed po<strong>in</strong>t where y i (t)=y i (t-1)for all<br />
i. These fixed relaxation po<strong>in</strong>ts represent stable configurations<br />
of the network and can be <strong>used</strong> <strong>in</strong> applications that have a<br />
fixed set of patterns to be matched <strong>in</strong> the form of a content<br />
addressable or associative memory. Recurrent network has a<br />
stable set of attractors and repellers, each form<strong>in</strong>g a fixed<br />
po<strong>in</strong>t<strong>in</strong>g <strong>in</strong> the <strong>in</strong>put space. Every <strong>in</strong>put vector, x is either<br />
attracted to one of the fixed po<strong>in</strong>ts or repelled from another of<br />
the fixed po<strong>in</strong>ts. The strength of this type of network is its<br />
ability to correctly classify noisy versions of the patterns that<br />
form the stable fixed po<strong>in</strong>ts.<br />
The third popular type of neural network topology is the<br />
Kohonen, self organiz<strong>in</strong>g feature map, which is a cluster<strong>in</strong>g<br />
procedure for provid<strong>in</strong>g a codebook of stable patterns <strong>in</strong> the<br />
<strong>in</strong>put space that characterize an arbitrary <strong>in</strong>put vector, by a<br />
small number of representative clusters.<br />
Figure 9 simplified view of a artificial neural network<br />
11.1.2. Network Characteristics:<br />
Four model characteristics must be specified to implement an<br />
arbitrary neural network. Fig.9 shows the architecture of the<br />
simple neural network model.<br />
a) Number and type of <strong>in</strong>puts: The issues <strong>in</strong>volved <strong>in</strong> the<br />
choice of <strong>in</strong>puts to a neural network are similar to those<br />
<strong>in</strong>volved <strong>in</strong> the choice of features for any pattern classification<br />
system. They must provide the <strong>in</strong>formation required to make<br />
the decision required of the network.<br />
b) Connectivity of the network: This issue <strong>in</strong>volves the size of<br />
the network that is, the number of hidden layers and the<br />
number of nodes <strong>in</strong> each layers between the <strong>in</strong>put and output.<br />
There is no good rule of thumb as to how large ( or small)<br />
such hidden layers must be. Intuition says that if the hidden<br />
layers are large, then it will be difficult to tra<strong>in</strong> the network.<br />
Similarly, if the hidden layers are too small the network may<br />
not be able to accurately classify the entire desired <strong>in</strong>put<br />
pattern.<br />
c) Choice of offset: The choice of the threshold, pi for each<br />
computational element must be made as part of the tra<strong>in</strong><strong>in</strong>g<br />
procedure, which chooses values for the <strong>in</strong>terconnection<br />
weights (w ij ) and the offset pi.<br />
d) Choice of nonl<strong>in</strong>earity: Experience <strong>in</strong>dicates that the exact<br />
choice of the nonl<strong>in</strong>earity f, is not every important <strong>in</strong> terms of<br />
the network performance. However, f must be cont<strong>in</strong>uous and<br />
differentiable for the tra<strong>in</strong><strong>in</strong>g algorithm to be applicable.<br />
11.2. Tra<strong>in</strong><strong>in</strong>g of Neural Network Parameters:<br />
To completely specify a neural network, values for the<br />
weight<strong>in</strong>g coefficients and the offset threshold for each<br />
computation element must be determ<strong>in</strong>e, based on a labeled<br />
set of tra<strong>in</strong><strong>in</strong>g data. By a labeled tra<strong>in</strong><strong>in</strong>g set of data, means<br />
association between a set of Q <strong>in</strong>put vectors x 1 ,x 2 ,….x q and a<br />
set of Q output vectors y 1 ,y 2 ,…y q where x 1 =y 1 ,x 2 =y 2 …..x q =y q .<br />
For multilayer perceptrons a simple iterative, convergent<br />
procedure exists for choos<strong>in</strong>g a set of parameters whose value<br />
asymptotically approaches a stationary po<strong>in</strong>t with a certa<strong>in</strong><br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
929
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
optimality property ( e.g., a local m<strong>in</strong>imum of themean<br />
squared error, etc.). This procedure, called back propagation<br />
learn<strong>in</strong>g, is a simple stochastic gradient technique. For a<br />
simple, s<strong>in</strong>gle layer network, the tra<strong>in</strong><strong>in</strong>g algorithm can be<br />
realized via the follow<strong>in</strong>g convergence steps:<br />
Perceptron Convergence Procedure<br />
1. Initialization: At time t=0, set w ij (0), φ j to small random<br />
values (where wij are the weight<strong>in</strong>g coefficients connect<strong>in</strong>g i th<br />
<strong>in</strong>put node and j th output node and φ ij is the offset to a<br />
particular computational element and the w ij are function of<br />
time).<br />
2.Acquire <strong>in</strong>put: At time t, obta<strong>in</strong> a new <strong>in</strong>put x={x 1 ,x 2 ,..x N }<br />
with the desired ouput, yx={y x 1,y x 2,….Y M x}<br />
n<br />
3.Calculate output : y i =f(Σ w ij(t) xi- φ j )<br />
I=1<br />
4.Adapt Weights: Update the weights as:<br />
w ij (t+1)= w ij (t)+T(t)[y x j-y j }-x i<br />
5.Iteration:<br />
Iterate steps 2-4 until: w ij (t+1)= w ij (t)<br />
11.3. Difference between Neural Networks and<br />
Conventional Classifiers:<br />
The difference between the neural network classifier and the<br />
conventional classifier is given <strong>in</strong> the table A.<br />
TABLE A<br />
Difference between Neural network and conventional<br />
classifier<br />
Sl.No. Neural Network Conventional<br />
classifier<br />
1 Estimates posterior<br />
probability<br />
2 Non l<strong>in</strong>ear model free<br />
method<br />
3 Uses discrim<strong>in</strong>ant<br />
function<br />
4 M<strong>in</strong>imizes the total no.<br />
of miss classification<br />
errors<br />
5 Data driven and self<br />
adapt<strong>in</strong>g<br />
Based on bayes<br />
decision theory us<strong>in</strong>g<br />
posterior probability<br />
L<strong>in</strong>ear and model based<br />
method<br />
Uses probabilistic<br />
function<br />
M<strong>in</strong>imizes<br />
classification error<br />
Data driven not self<br />
adapt<strong>in</strong>g<br />
Statistical pattern classifiers are based on the Bayes decision<br />
theory <strong>in</strong> which posterior probabilities play a central role. The<br />
fact that neural networks can <strong>in</strong> fact provide estimates of<br />
posterior probability implicitly establishes the l<strong>in</strong>k between<br />
neural networks and statistical classifiers. The direct<br />
comparison between them may not be possible s<strong>in</strong>ce neural<br />
networks are nonl<strong>in</strong>ear model-free method while statistical<br />
methods are basically l<strong>in</strong>ear and model based. By appropriate<br />
cod<strong>in</strong>g of the desired output membership values, we may let<br />
neural networks directly model some discrim<strong>in</strong>ant functions.<br />
For example, <strong>in</strong> a two-group classification problem, if the<br />
desired output is coded as 1 if the object is from class 1 and -1<br />
if it is from class 2.The neural network estimates the<br />
follow<strong>in</strong>g discrim<strong>in</strong>ant function:<br />
---(14)<br />
The discrim<strong>in</strong>at<strong>in</strong>g rule is simply: assign X to w 1 if g(x)>0 or<br />
w 2 if g(x)
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
diagnosis and epidemiologic studies [32]. Logistic regression<br />
is often preferred over discrim<strong>in</strong>ant analysis <strong>in</strong> practice<br />
[33,34]. In addition, the model can be <strong>in</strong>terpreted as posterior<br />
probability or odds ratio. It is a simple fact that when the<br />
logistic transfer function is <strong>used</strong> for the output nodes, simple<br />
neural networks without hidden layers are identical to logistic<br />
regression models. Another connection is that the maximum<br />
likelihood function of logistic regression is essentially the<br />
cross-entropy cost function which is often <strong>used</strong> <strong>in</strong> tra<strong>in</strong><strong>in</strong>g<br />
neural network classifiers. Schumacher et al. [35] make a<br />
detailed comparison between neural networks and logistic<br />
regression. They f<strong>in</strong>d that the added model<strong>in</strong>g flexibility of<br />
neural networks due to hidden layers does not automatically<br />
guarantee their superiority over logistic regression because of<br />
the possible over fitt<strong>in</strong>g and other <strong>in</strong>herent problems with<br />
neural networks [36]. L<strong>in</strong>ks between neural and other<br />
conventional classifiers have been illustrated by<br />
[37,38,39,40,41,42,43]. Ripley [44,45] empirically compares<br />
neural networks with various classifiers such as classification<br />
tree, projection pursuit regression, l<strong>in</strong>ear vector quantization,<br />
multivariate adaptive regression spl<strong>in</strong>es and nearest neighbor<br />
methods.<br />
A large number of studies have been devoted to empirical<br />
comparisons between neural and conventional classifiers. The<br />
most comprehensive one can be found <strong>in</strong> Michie et al. [46]<br />
which reports a large-scale comparative study—the StatLog<br />
project. In this project, three general classification approaches<br />
of neural networks, statistical classifiers and mach<strong>in</strong>e learn<strong>in</strong>g<br />
with 23 methods are compared us<strong>in</strong>g more than 20 different<br />
real data sets. Their general conclusion is that no s<strong>in</strong>gle<br />
classifier is the best for all data sets although the feed forward<br />
neural networks do have good performance over a wide range<br />
of problems.<br />
Neural networks have also been compared with decision trees<br />
[47,48,49,50] discrim<strong>in</strong>ant analysis [51], [52], [53], [54], [55],<br />
CART [56]], -nearest-neighbor [57], and l<strong>in</strong>ear programm<strong>in</strong>g<br />
method. Although classification costs are difficult to assign <strong>in</strong><br />
real problems, ignor<strong>in</strong>g the unequal misclassification risk for<br />
different groups may have significant impact on the practical<br />
use of the classification. It should be po<strong>in</strong>ted out that a neural<br />
classifier which m<strong>in</strong>imizes the total number of<br />
misclassification errors may not be useful for situations where<br />
different misclassification errors carry highly uneven<br />
consequences or costs.<br />
11.4. Advantages and Disadvantages of Neural Networks:<br />
a) Advantages:<br />
1) Neural networks are data driven and self adaptive-learn<strong>in</strong>g<br />
2) Pocesses Self-organization mechanism<br />
3) Has Fault-tolerance capabilities<br />
4) A neural network can perform tasks that a l<strong>in</strong>ear program<br />
cannot.<br />
5) When an element of the neural network fails, it can<br />
cont<strong>in</strong>ue without any problem by their parallel nature.<br />
6) A neural network learns and does not need to be<br />
reprogrammed.<br />
7) It can be implemented <strong>in</strong> any application. It can be<br />
implemented without any problem<br />
8) The connectionist structure is <strong>used</strong> to model the local<br />
feature vector conditioned on the Markov process.<br />
9) There is no need to assume an underly<strong>in</strong>g data distribution<br />
such as usually is done <strong>in</strong> statistical model<strong>in</strong>g<br />
10) Neural networks are applicable to multivariate non-l<strong>in</strong>ear<br />
problems.<br />
11) Neural networks are data driven self-adaptive methods <strong>in</strong><br />
that they can adjust themselves to the data without any explicit<br />
specification of functional or distributional form for the<br />
underly<strong>in</strong>g model.<br />
12) They are universal functional approximators, <strong>in</strong> that<br />
neural networks can approximate any function with arbitrary<br />
accuracy.<br />
13) Neural networks are nonl<strong>in</strong>ear models, which makes them<br />
flexible <strong>in</strong> model<strong>in</strong>g real world complex relationships.<br />
14) F<strong>in</strong>ally, neural networks are able to estimate the posterior<br />
probabilities, which provides the basis for establish<strong>in</strong>g<br />
classification rule and perform<strong>in</strong>g statistical analysis<br />
15) They can adjust themselves to the data without any<br />
explicit specification of functional or distributional form for<br />
the underly<strong>in</strong>g model.<br />
16) They are re-universal functional approximators <strong>in</strong> that<br />
neural networks can approximate any function with arbitrary<br />
accuracy [58], [59], [60].<br />
17) Neural networks are nonl<strong>in</strong>ear models, which makes them<br />
flexible <strong>in</strong> model<strong>in</strong>g real world complex relationships.<br />
18) Neural networks are able to estimate the posterior<br />
probabilities, which provide the basis for establish<strong>in</strong>g<br />
classification rule and perform<strong>in</strong>g statistical analysis [61].<br />
19) They can readily implement a massive degree of parallel<br />
computation<br />
20) They <strong>in</strong>tr<strong>in</strong>sically possess a great deal of robustness or<br />
fault tolerance.<br />
21) The connection weights of the network need not be<br />
constra<strong>in</strong>ed to be fixed. They can be adopted <strong>in</strong> real time to<br />
improve performance.<br />
22) Because of non l<strong>in</strong>earity with<strong>in</strong> each computational<br />
element a sufficiently large neural network can approximate<br />
any nonl<strong>in</strong>earity or nonl<strong>in</strong>ear dynamical system.<br />
23) They can adapt to unknown situations<br />
24) Robustness: Fault tolerance due to network redundancy<br />
25) Autonomous learn<strong>in</strong>g due learn<strong>in</strong>g and generalization<br />
b. Disadvantages:<br />
1) The neural network needs tra<strong>in</strong><strong>in</strong>g to operate.<br />
2) The architecture of a neural network is different from the<br />
architecture of microprocessors therefore needs to be<br />
emulated.<br />
3) Requires high process<strong>in</strong>g time for large neural networks.<br />
4) M<strong>in</strong>imiz<strong>in</strong>g over fitt<strong>in</strong>g requires a great deal of<br />
computational effort.<br />
5)The <strong>in</strong>dividual relations between the <strong>in</strong>put variables and the<br />
output variables are not developed by eng<strong>in</strong>eer<strong>in</strong>g judgment<br />
so that the model tends to be a black box or <strong>in</strong>put/output table<br />
without analytical basis.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
931
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
6) The sample size has to be large.<br />
7) Large complexity of the network structure.<br />
11.5. Applications:<br />
S<strong>in</strong>ce neural networks are best at identify<strong>in</strong>g patterns or trends<br />
<strong>in</strong> data, they are well suited for prediction or forecast<strong>in</strong>g needs<br />
<strong>in</strong>clud<strong>in</strong>g:<br />
1) Sales forecast<strong>in</strong>g, <strong>in</strong>dustrial process control ,customer<br />
research ,data validation, risk management, target market<strong>in</strong>g.<br />
2) Model<strong>in</strong>g and Diagnos<strong>in</strong>g the Cardiovascular System<br />
3) medic<strong>in</strong>e/Medical diagnosis<br />
4) bus<strong>in</strong>ess/Market<strong>in</strong>g<br />
5) Electronic noses<br />
6) <strong>Speech</strong> <strong>Recognition</strong><br />
7) Credit evaluation<br />
8) <strong>Speech</strong> and speaker applications<br />
9) Fault detection<br />
10) Prediction: Learn<strong>in</strong>g from past experiences; Weather<br />
prediction<br />
11) <strong>Classification</strong>: Image Process<strong>in</strong>g, Risk management<br />
12) <strong>Recognition</strong>: Character/Hand written recognition<br />
13) Data association:<br />
14) Data conceptualization<br />
15) Data filter<strong>in</strong>g<br />
16) Plann<strong>in</strong>g<br />
12.1. Introduction to Support Vector Mach<strong>in</strong>e (SVM)<br />
Models<br />
Dur<strong>in</strong>g the last decade, however, a new tool appeared <strong>in</strong> the<br />
field of mach<strong>in</strong>e learn<strong>in</strong>g that has proved to be able to cope<br />
with hard classification problems <strong>in</strong> several fields of<br />
application: the Support Vector Mach<strong>in</strong>es (SVMs). A SVM is<br />
essentially a b<strong>in</strong>ary nonl<strong>in</strong>ear classifier capable of guess<strong>in</strong>g<br />
whether an <strong>in</strong>put vector x belongs to a class 1 (the desired<br />
output would be then y = +1) or to a class 2 (y −1). = This<br />
algorithm was first proposed <strong>in</strong> [63] <strong>in</strong> 1992, and it is a<br />
nonl<strong>in</strong>ear version of a much older l<strong>in</strong>ear algorithm, the<br />
optimal hyper plane decision rule (also known as the<br />
generalized portrait algorithm), which was <strong>in</strong>troduced <strong>in</strong> the<br />
sixties.<br />
The SVMs are effective discrim<strong>in</strong>ative classifiers with<br />
several outstand<strong>in</strong>g characteristics [62], namely: their solution<br />
is that with maximum marg<strong>in</strong>; they are capable to deal with<br />
samples of a very higher dimensionality; and their<br />
convergence to the m<strong>in</strong>imum of the associated cost function is<br />
guaranteed. A Support Vector Mach<strong>in</strong>e (SVM) performs<br />
classification by construct<strong>in</strong>g an N-dimensional hyper plane<br />
that optimally separates the data <strong>in</strong>to two categories.<br />
12. Support Vector Mach<strong>in</strong>es (SVM):<br />
One of the powerful tools for pattern recognition that uses a<br />
discrim<strong>in</strong>ative approach is a SVM[97]. SVMs use l<strong>in</strong>ear and<br />
nonl<strong>in</strong>ear separat<strong>in</strong>g hyper-planes for data classification.<br />
S<strong>in</strong>ce SVMs can only classify fixed length data vectors, this<br />
method cannot be readily applied to task <strong>in</strong>volv<strong>in</strong>g variable<br />
length data classification. The variable length data has to be<br />
transformed to fixed length vectors before SVMs can be <strong>used</strong>.<br />
It is a generalized l<strong>in</strong>ear classifier with maximum-marg<strong>in</strong><br />
fitt<strong>in</strong>g functions. This fitt<strong>in</strong>g function provides regularization<br />
which helps the classifier generalized better. The classifier<br />
tends to ignore many of the features. Conventional statistical<br />
and Neural Network methods control model complexity by<br />
us<strong>in</strong>g a small number of features (the problem dimensionality<br />
or the number of hidden units). SVM controls the model<br />
complexity by controll<strong>in</strong>g the VC dimensions of its model.<br />
This method is <strong>in</strong>dependent of dimensionality and can utilize<br />
spaces of very large dimensions spaces, which permits a<br />
construction of very large number of non-l<strong>in</strong>ear features and<br />
then perform<strong>in</strong>g adaptive feature selection dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g. By<br />
shift<strong>in</strong>g all non-l<strong>in</strong>earity to the features, SVM can use l<strong>in</strong>ear<br />
model for which VC dimensions is known. For example, a<br />
support vector mach<strong>in</strong>e can be <strong>used</strong> as a regularized radial<br />
basis function classifier.<br />
Fig.10 Support vector mach<strong>in</strong>e process<br />
These characteristics have made SVMs very popular and<br />
successful. In the parlance of SVM literature, a predictor<br />
variable is called an attribute, and a transformed attribute that<br />
is <strong>used</strong> to def<strong>in</strong>e the hyper plane is called a feature. The task<br />
of choos<strong>in</strong>g the most suitable representation is known as<br />
feature selection. A set of features that describes one case (i.e.,<br />
a row of predictor values) is called a vector. So the goal of<br />
SVM model<strong>in</strong>g is to f<strong>in</strong>d the optimal hyper plane that<br />
separates clusters of vector <strong>in</strong> such a way that cases with one<br />
category of the target variable are on one side of the plane and<br />
cases with the other category are on the other size of the plane.<br />
The vectors near the hyper plane are the support vectors. The<br />
figure 10 below presents an overview of the SVM process.<br />
12.1.1 SVM formulation:<br />
Given a set of separable data, the goal is to f<strong>in</strong>d the optimal<br />
decision function. It can be easily seen that there is an <strong>in</strong>f<strong>in</strong>ite<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
932
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
number of optimal solutions for this problem, <strong>in</strong> the sense that<br />
they can separate the tra<strong>in</strong><strong>in</strong>g samples with zero errors.<br />
Function is <strong>used</strong> to generalize for unseen samples; the<br />
additional criterion is <strong>used</strong> to f<strong>in</strong>d the best solution among<br />
those with zero errors. If the probability densities of the<br />
classes, we could apply the maximum a posteriori (MAP)<br />
criterion to f<strong>in</strong>d the optimal solution. In most practical cases<br />
this <strong>in</strong>formation is not available, so it adopts other simpler<br />
criteria: among those functions without tra<strong>in</strong><strong>in</strong>g errors, it will<br />
choose that with the maximum marg<strong>in</strong>, be<strong>in</strong>g this marg<strong>in</strong> the<br />
distance between the closest sample and the decision<br />
boundary def<strong>in</strong>ed by that function. Of course, optimality <strong>in</strong><br />
the sense of maximum marg<strong>in</strong> does not imply necessarily<br />
optimality <strong>in</strong> the sense of m<strong>in</strong>imiz<strong>in</strong>g the number of errors <strong>in</strong><br />
test, but it is a simple criterion that yields to solutions which,<br />
<strong>in</strong> practice, turn out to be the best ones for many problems<br />
[64].<br />
…..(15)<br />
This can be formulated as a problem of quadratic optimization:<br />
In order to get a classifier with a better generalization ability<br />
and capable of handl<strong>in</strong>g the non-separable case, we should<br />
allow a number of misclassified data. This is accomplished by<br />
<strong>in</strong>troduc<strong>in</strong>g a penalty term <strong>in</strong> the function to be m<strong>in</strong>imized:<br />
….(16)<br />
Figure 11 Soft marg<strong>in</strong> decision<br />
As can be <strong>in</strong>ferred from the Figure 11, the nonl<strong>in</strong>ear<br />
discrim<strong>in</strong>ant function f(x i ) can be written as:<br />
…(14a)<br />
Where<br />
is a nonl<strong>in</strong>ear<br />
function which maps the vector xi <strong>in</strong>to what is called a feature<br />
space of higher dimensionality (possibly <strong>in</strong>f<strong>in</strong>ite) where<br />
classes are assumed to be l<strong>in</strong>early separable. The vector w<br />
represents the separat<strong>in</strong>g hyper plane <strong>in</strong> such a space. It is<br />
worth not<strong>in</strong>g that the mean<strong>in</strong>g of feature space here has<br />
noth<strong>in</strong>g to do with the space of the speech features that with<strong>in</strong><br />
the kernel methods nomenclature belong to the <strong>in</strong>put space.<br />
On the other hand, r x is the distance between the transformed<br />
sample and the separat<strong>in</strong>g hyper plane, and<br />
the Euclidean norm of w. We call support vectors those<br />
closest to the decision boundary. These vectors def<strong>in</strong>e the<br />
marg<strong>in</strong> and are the only samples that are needed to f<strong>in</strong>d the<br />
solution. Thus, we have that for every<br />
sample<br />
Hence, the goal to f<strong>in</strong>d the<br />
Where<br />
are the tra<strong>in</strong><strong>in</strong>g vectors<br />
correspond<strong>in</strong>g to the labels and the variables<br />
are called slack variables and allow a certa<strong>in</strong> amount of errors<br />
that contribute to obta<strong>in</strong> solutions <strong>in</strong> the non-separable<br />
case , verifies for those samples well<br />
classified but <strong>in</strong>side the marg<strong>in</strong>, and for those<br />
samples wrongly classified. The C term, on the other hand,<br />
expresses the trade-off between the number of tra<strong>in</strong><strong>in</strong>g errors<br />
and the generalization capability.This problem is usually<br />
solved <strong>in</strong>troduc<strong>in</strong>g the restrictions <strong>in</strong> the function to be<br />
optimized us<strong>in</strong>g Lagrange multipliers, lead<strong>in</strong>g to the<br />
maximization of the Wolfe dual:<br />
…(17)<br />
This problem is quadratic and convex, so its convergence to a<br />
global m<strong>in</strong>imum is guaranteed us<strong>in</strong>g quadratic programm<strong>in</strong>g<br />
(QP) schemes. The result<strong>in</strong>g decision boundary w will be<br />
given by:<br />
optimum classifier is achieved by m<strong>in</strong>imiz<strong>in</strong>g with the<br />
restriction of all samples be<strong>in</strong>g correctly classified, i.e.:<br />
..(18)<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
933
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
Accord<strong>in</strong>g to (18), only vectors with an associated<br />
will contribute to determ<strong>in</strong>e the weight vector w and, therefore,<br />
the separat<strong>in</strong>g boundary. These are the support vectors that, as<br />
we have mentioned before, def<strong>in</strong>e the separation border and<br />
the marg<strong>in</strong>. Generally, the function is not explicitly<br />
known (<strong>in</strong> fact, <strong>in</strong> most of the cases its evaluation would be<br />
impossible as long as the feature space dimensionality can be<br />
<strong>in</strong>f<strong>in</strong>ite). S<strong>in</strong>ce it only need to evaluate the dot<br />
products<br />
which, by us<strong>in</strong>g what has been<br />
called the kernel trick, can be evaluated us<strong>in</strong>g a kernel<br />
function K(x i , x j ).<br />
Many of the SVM implementations compute this function for<br />
every pair of <strong>in</strong>put samples produc<strong>in</strong>g a kernel matrix that is<br />
stored <strong>in</strong> memory. By us<strong>in</strong>g this method and replac<strong>in</strong>g w <strong>in</strong><br />
equation (14) by the expression <strong>in</strong> (18), the form that a SVM<br />
f<strong>in</strong>ally adopts is the follow<strong>in</strong>g:<br />
….(19)<br />
The most widely <strong>used</strong> kernel functions are:<br />
• the simple l<strong>in</strong>ear kernel<br />
…….(20)<br />
• the radial basis function kernel (RBF kernel),<br />
….(21)<br />
where is proportional to the <strong>in</strong>verse of the variance of the<br />
Gaussian function and whose associated feature space is of<br />
<strong>in</strong>f<strong>in</strong>ite dimensionality; and<br />
• the polynomial kernel<br />
..(22)<br />
whose associated feature space are polynomials up to grade p,<br />
and<br />
• the sigmoid kernel<br />
…(23)<br />
It is worth mention<strong>in</strong>g that there are some conditions that a<br />
function should accomplish to be <strong>used</strong> as a kernel. These are<br />
often denom<strong>in</strong>ated KKT (Karush- Kuhn-Tucker) conditions<br />
[65] and can be reduced to check the kernel matrix is<br />
symmetrical and positive semi-def<strong>in</strong>ite.<br />
12.2. Advantages and Disadvantages of SVM:<br />
a. Advantages:<br />
1) Follows l<strong>in</strong>ear discrim<strong>in</strong>ants <strong>in</strong> its learn<strong>in</strong>g criterion.<br />
2) It m<strong>in</strong>imizes the number of misclassifications <strong>in</strong> any<br />
possible set of samples and this is known as Risk<br />
M<strong>in</strong>imization (RM).<br />
3) It m<strong>in</strong>imizes the number of misclassifications with<strong>in</strong> the<br />
tra<strong>in</strong><strong>in</strong>g set and this is known as Empirical Risk M<strong>in</strong>imization<br />
(ERM).<br />
4) They have a unique solution and its convergence is<br />
guaranteed (the solution is found by m<strong>in</strong>imiz<strong>in</strong>g a convex<br />
function). This is an advantage compared to other classifiers<br />
as ANNs that often fall <strong>in</strong> local m<strong>in</strong>ima or does not converge<br />
to a stable version.<br />
5) S<strong>in</strong>ce <strong>in</strong> the m<strong>in</strong>imization process only the kernel matrix is<br />
<strong>in</strong>volved, they can deal with <strong>in</strong>put vectors of very high<br />
dimensionality, as long as it is capable of calculat<strong>in</strong>g their<br />
correspond<strong>in</strong>g kernels and they can deal with vectors of<br />
thousands of dimensions.<br />
6) The <strong>in</strong>put vectors of an SVM with the formulation must<br />
have a fixed size.<br />
7) The important advantage of SVM is that it offers a<br />
possibility to tra<strong>in</strong> generalizable, nonl<strong>in</strong>ear classifiers <strong>in</strong> high<br />
dimensional spaces us<strong>in</strong>g a small tra<strong>in</strong><strong>in</strong>g set.<br />
8) SVMs generalization error is not related to the <strong>in</strong>put<br />
dimensionality of the problem but to the marg<strong>in</strong> with which it<br />
separates the data. That is why SVMs can have good<br />
performance even with a large number of <strong>in</strong>puts.<br />
b. Disadvantages:<br />
1) Most implementations of SVM algorithm require<br />
comput<strong>in</strong>g and stor<strong>in</strong>g <strong>in</strong> memory the complete kernel matrix<br />
of all the <strong>in</strong>put samples. This task have a space complexity<br />
O(n2), and is one of the ma<strong>in</strong> problems of these algorithms<br />
that prevent their application on very large speech databases.<br />
2) The optimality of the solution found can depend on the<br />
kernel that has been <strong>used</strong>, and there is no method to know a<br />
priori which will be the best kernel for a concrete task.<br />
3) The best value for the parameter C is unknown a priori.<br />
12.3. Applications:<br />
1) SVM <strong>in</strong> speech and speaker recognition<br />
2) SVM <strong>in</strong> f<strong>in</strong>ancial applications<br />
3) SVM <strong>in</strong> computational biology<br />
4) SVM <strong>in</strong> bio<strong>in</strong>formatics/biological applications<br />
5) SVM <strong>in</strong> text classification<br />
6) SVM <strong>in</strong> chemistry<br />
13. K-Nearest Neighbor Method:<br />
A more general version of the nearest neighbor technique [66]<br />
bases the classification of an unknown sample on the votes of<br />
k-fits nearest neighbors rather than on only its s<strong>in</strong>gle nearest<br />
neighbor. The k-nearest neighbor classification procedure is<br />
denoted by k-NN. If the costs of error are equal for each class,<br />
the estimated class of an unknown sample is chosen to be the<br />
class that is most commonly represented <strong>in</strong> the collection of<br />
its k nearest neighbours.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
934
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
13.1. <strong>Classification</strong> concept of KNN:<br />
In pattern recognition, the k-nearest neighbor’s algorithm<br />
(k-NN) is a method for classify<strong>in</strong>g objects based on closest<br />
tra<strong>in</strong><strong>in</strong>g examples <strong>in</strong> the feature space. k-NN is a type of<br />
<strong>in</strong>stance-based learn<strong>in</strong>g, or lazy learn<strong>in</strong>g where the function is<br />
only approximated locally and all computation is deferred<br />
until classification. The k-nearest neighbor algorithm is<br />
amongst the simplest of all mach<strong>in</strong>e learn<strong>in</strong>g algorithms: an<br />
object is classified by a majority vote of its neighbors, with<br />
the object be<strong>in</strong>g assigned to the class most common amongst<br />
its k nearest neighbors (k is a positive <strong>in</strong>teger, typically small).<br />
If k = 1, then the object is simply assigned to the class of its<br />
nearest neighbor.<br />
<strong>Classification</strong> (generalization) us<strong>in</strong>g an <strong>in</strong>stance-based<br />
classifier can be a simple matter of locat<strong>in</strong>g the nearest<br />
neighbor <strong>in</strong> <strong>in</strong>stance space and label<strong>in</strong>g the unknown <strong>in</strong>stance<br />
with the same class label as that of the located (known)<br />
neighbor. This approach is often referred to as a nearest<br />
neighbor classifier. More robust models can be achieved by<br />
locat<strong>in</strong>g k, where k > 1, neighbors and lett<strong>in</strong>g the majority<br />
vote decide the outcome of the class label<strong>in</strong>g. A higher value<br />
of k results <strong>in</strong> a smoother, less locally sensitive, function. The<br />
nearest neighbor classifier can be regarded as a special case<br />
of the more general k-nearest neighbors classifier, hereafter<br />
referred to as a k-NN classifier.<br />
The same method can be <strong>used</strong> for regression, by simply<br />
assign<strong>in</strong>g the property value for the object to be the average of<br />
the values of its k nearest neighbors. It can be useful to weight<br />
the contributions of the neighbors, so that the nearer neighbors<br />
contribute more to the average than the more distant ones. (A<br />
common weight<strong>in</strong>g scheme is to give each neighbor a weight<br />
of 1/d, where d is the distance to the neighbor. This scheme is<br />
a generalization of l<strong>in</strong>ear <strong>in</strong>terpolation.). The neighbors are<br />
taken from a set of objects for which the correct classification<br />
(or, <strong>in</strong> the case of regression, the value of the property) is<br />
known. This can be thought of as the tra<strong>in</strong><strong>in</strong>g set for the<br />
algorithm, though no explicit tra<strong>in</strong><strong>in</strong>g step is required. Nearest<br />
neighbor rules <strong>in</strong> effect compute the decision boundary <strong>in</strong> an<br />
implicit manner. It is also possible to compute the decision<br />
boundary itself explicitly, and to do so <strong>in</strong> an efficient manner<br />
so that the computational complexity is a function of the<br />
boundary complexity.<br />
13.1.1. Assumptions <strong>in</strong> KNN:<br />
Before us<strong>in</strong>g KNN, some of the assumptions <strong>in</strong> KNN are to be<br />
considered.<br />
• KNN assumes that the data is <strong>in</strong> a feature space.<br />
More exactly, the data po<strong>in</strong>ts are <strong>in</strong> a metric space.<br />
The data can be scalars or possibly even<br />
multidimensional vectors. S<strong>in</strong>ce the po<strong>in</strong>ts are <strong>in</strong><br />
feature space, they have a notion of distance – This<br />
need not necessarily be Euclidean distance although<br />
it is the one commonly <strong>used</strong>.<br />
• Each of the tra<strong>in</strong><strong>in</strong>g data consists of a set of vectors<br />
and class label associated with each vector. In the<br />
simplest case , it will be either + or – (for positive or<br />
negative classes). But KNN , can work equally well<br />
with arbitrary number of classes.<br />
• It is also given a s<strong>in</strong>gle number "k”. This number<br />
decides how many neighbors (where neighbors are<br />
def<strong>in</strong>ed based on the distance metric) <strong>in</strong>fluence the<br />
classification. This is usually a odd number if the<br />
number of classes is 2. If k=1 , then the algorithm is<br />
simply called the nearest neighbor algorithm.<br />
13.1.2. Parameter selection <strong>in</strong> KNN:<br />
• The best choice of k depends upon the data;<br />
generally, larger values of k reduce the effect of noise<br />
on the classification, but make boundaries between<br />
classes less dist<strong>in</strong>ct. A good k can be selected by<br />
various heuristic techniques, for example, crossvalidation.<br />
The special case where the class is<br />
predicted to be the class of the closest tra<strong>in</strong><strong>in</strong>g<br />
sample (i.e. when k = 1) is called the nearest<br />
neighbor algorithm.<br />
• Much research effort has been put <strong>in</strong>to select<strong>in</strong>g or<br />
scal<strong>in</strong>g features to improve classification. A<br />
particularly popular approach is the use of<br />
evolutionary algorithms to optimize feature scal<strong>in</strong>g. [4]<br />
Another popular approach is to scale features by the<br />
mutual <strong>in</strong>formation of the tra<strong>in</strong><strong>in</strong>g data with the<br />
tra<strong>in</strong><strong>in</strong>g classes. In b<strong>in</strong>ary (two class) classification<br />
problems, it is helpful to choose k to be an odd<br />
number as this avoids tied votes. One popular way of<br />
choos<strong>in</strong>g the empirically optimal k <strong>in</strong> this sett<strong>in</strong>g is<br />
via bootstrap method.<br />
13.1.3. Properties <strong>in</strong> KNN:<br />
• The naive version of the algorithm is easy to<br />
implement by comput<strong>in</strong>g the distances from the test<br />
sample to all stored vectors, but it is computationally<br />
<strong>in</strong>tensive, especially when the size of the tra<strong>in</strong><strong>in</strong>g set<br />
grows. Many nearest neighbor search algorithms<br />
have been proposed over the years; these generally<br />
seek to reduce the number of distance evaluations<br />
actually performed. Us<strong>in</strong>g an appropriate nearest<br />
neighbor search algorithm makes k-NN<br />
computationally tractable even for large data sets.<br />
• The nearest neighbor algorithm has some strong<br />
consistency results. As the amount of data<br />
approaches <strong>in</strong>f<strong>in</strong>ity, the algorithm is guaranteed to<br />
yield an error rate no worse than twice the Bayes<br />
error rate (the m<strong>in</strong>imum achievable error rate given<br />
the distribution of the data). k-nearest neighbor is<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
935
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
guaranteed to approach the Bayes error rate, for some<br />
value of k (where k <strong>in</strong>creases as a function of the<br />
number of data po<strong>in</strong>ts). Various improvements to k-<br />
nearest neighbor methods are possible by us<strong>in</strong>g<br />
proximity graphs.<br />
.13.1.4. KNN for Density Estimation:<br />
Although classification rema<strong>in</strong>s the primary application of<br />
KNN, it is also <strong>used</strong> to do density estimation also. S<strong>in</strong>ce KNN<br />
is non parametric, it can do estimation for arbitrary<br />
distributions. The idea is very similar to use of Parzen<br />
w<strong>in</strong>dow . Instead of us<strong>in</strong>g hypercube and kernel functions, it<br />
does the estimation as follows – For estimat<strong>in</strong>g the density at<br />
a po<strong>in</strong>t x, place a hypercube centered at x and keep <strong>in</strong>creas<strong>in</strong>g<br />
its size till k neighbors are captured. Now estimate the density<br />
us<strong>in</strong>g the formula,<br />
…(24)<br />
Where n is the total number of V is the volume of the<br />
hypercube. Notice that the numerator is essentially a constant<br />
and the density is <strong>in</strong>fluenced by the volume. The <strong>in</strong>tuition is<br />
this : Lets say density at x is very high. We can f<strong>in</strong>d k po<strong>in</strong>ts<br />
near x very quickly. These po<strong>in</strong>ts are also very close to x (by<br />
def<strong>in</strong>ition of high density). This means the volume of<br />
hypercube is small and the resultant density is high. It is said<br />
that the density around x is very low. Then the volume of the<br />
hypercube needed to encompass k nearest neighbors is large<br />
and consequently, the ratio is low. The volume performs a job<br />
similar to the bandwidth parameter <strong>in</strong> kernel density<br />
estimation..<br />
13.2. Some Basic Observations regard<strong>in</strong>g K-NN:<br />
1. If the po<strong>in</strong>ts are d-dimensional, then the straight forward<br />
implementation of f<strong>in</strong>d<strong>in</strong>g k Nearest Neighbor takes O(n)<br />
time.<br />
2. KNN can be analyzed <strong>in</strong> two ways – One way is that KNN<br />
tries to estimate the posterior probability of the po<strong>in</strong>t to be<br />
labeled (and apply Bayesian decision theory based on the<br />
posterior probability). An alternate way is that KNN<br />
calculates the decision surface (either implicitly or explicitly)<br />
and then uses it to decide on the class of the new po<strong>in</strong>ts.<br />
3. There are many possible ways to apply weights for KNN –<br />
One popular example is the Shephard’s method.<br />
4. Even though the naive method takes O(dn) time, it is very<br />
hard to do better unless other assumptions are <strong>used</strong>. There are<br />
some efficient data structures like KD-Tree which can reduce<br />
the time complexity but they do it at the cost of <strong>in</strong>creased<br />
tra<strong>in</strong><strong>in</strong>g time and complexity.<br />
5. In KNN, k is usually chosen as an odd number if the<br />
number of classes is 2.<br />
6. Choice of k is very critical – A small value of k means that<br />
noise will have a higher <strong>in</strong>fluence on the result. A large value<br />
make it computationally expensive and k defeats the basic<br />
philosophy beh<strong>in</strong>d KNN (that po<strong>in</strong>ts that are near might have<br />
similar densities or classes).A simple approach to select k is<br />
set .<br />
7. There are some <strong>in</strong>terest<strong>in</strong>g data structures and algorithms<br />
when we apply KNN on graphs –Euclidean m<strong>in</strong>imum<br />
spann<strong>in</strong>g tree and Nearest neighbor graph .<br />
13.3. Advantages and Disadvantages of K-NN:<br />
a)Advantages:<br />
1) The high degree of local sensitivity makes nearest<br />
neighbor classifiers highly susceptible to noise <strong>in</strong> the<br />
tra<strong>in</strong><strong>in</strong>g data.<br />
2) It follows a Non parametric architecture<br />
3) It is a simple and powerful, algorithm<br />
4) KNN is one of common methods to estimate the<br />
bandwidth (eg adaptive mean shift)<br />
b) Disadvantages:<br />
1) The downside of this simple approach is the lack of<br />
robustness that characterizes the result<strong>in</strong>g classifiers.<br />
2) It is Memory <strong>in</strong>tensive,<br />
3) Its classification/estimation is slow<br />
4) For large tra<strong>in</strong><strong>in</strong>g sets, requires large memory is slow when<br />
mak<strong>in</strong>g a prediction<br />
5) Needs similarity measure and attributes that “match” target<br />
function<br />
6) The k-nearest neighbor algorithm is sensitive to the local<br />
structure of the data.<br />
7) Computation complexity is a function of the boundary<br />
complexity that affects the decision boundary.<br />
8) Lack of generalization means that KNN keeps all the<br />
tra<strong>in</strong><strong>in</strong>g data.<br />
9) The accuracy of the k-NN algorithm can be severely<br />
degraded by the presence of noisy or irrelevant features, or if<br />
the feature scales are not consistent with their importance.<br />
10) Prediction accuracy can quickly degrade when number<br />
attributes grow.<br />
The drawback of <strong>in</strong>creas<strong>in</strong>g the value of k is of course that as<br />
k approaches n, where n is the size of the <strong>in</strong>stance base, the<br />
performance of the classifier will approach that of the most<br />
straightforward statistical basel<strong>in</strong>e, the assumption that all<br />
unknown <strong>in</strong>stances belong to the class most frequently<br />
represented <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g data.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
936
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
13.4. Applications:<br />
The nearest neighbor search problem arises <strong>in</strong> numerous fields<br />
of application, <strong>in</strong>clud<strong>in</strong>g:<br />
• Pattern recognition - <strong>in</strong> particular for optical<br />
character recognition<br />
• Statistical classification- see k-nearest neighbor<br />
algorithm<br />
• Computer vision<br />
• Databases - e.g. content-based image retrieval<br />
• Cod<strong>in</strong>g theory - see maximum likelihood decod<strong>in</strong>g<br />
• Data compression - see MPEG-2 standard<br />
• Recommendation systems<br />
• Internet market<strong>in</strong>g - see contextual advertis<strong>in</strong>g and<br />
behavioral target<strong>in</strong>g<br />
• DNA sequenc<strong>in</strong>g<br />
• Spell check<strong>in</strong>g - suggest<strong>in</strong>g correct spell<strong>in</strong>g<br />
• Plagiarism detection<br />
• Contact search<strong>in</strong>g algorithms <strong>in</strong> FEA<br />
• Similarity scores for predict<strong>in</strong>g career paths of<br />
professional athletes.<br />
• Cluster analysis - assignment of a set of observations<br />
<strong>in</strong>to subsets (called clusters) so that observations <strong>in</strong><br />
the same cluster are similar <strong>in</strong> some sense, usually<br />
based on Euclidean distance<br />
• Gene Expression<br />
• Prote<strong>in</strong>-Prote<strong>in</strong> <strong>in</strong>teraction and 3D structure<br />
prediction<br />
• Nearest Neighbor based Content Retrieval<br />
14. Gaussian Mixture Model (GMM):<br />
Gaussian Mixture Models (GMMs) are among the most<br />
statistically mature methods for cluster<strong>in</strong>g (though they are<br />
also <strong>used</strong> <strong>in</strong>tensively for density estimation).<br />
14.1.Introduction:<br />
A Gaussian Mixture Model (GMM) is a parametric<br />
probability density function represented as a weighted sum of<br />
Gaussian component densities. GMMs are commonly <strong>used</strong> as<br />
a parametric model of the probability distribution of<br />
cont<strong>in</strong>uous measurements or features <strong>in</strong> a biometric system,<br />
such as vocal-tract related spectral features <strong>in</strong> a speaker<br />
recognition system. GMM parameters are estimated from<br />
tra<strong>in</strong><strong>in</strong>g data us<strong>in</strong>g the iterative Expectation-Maximization<br />
(EM) algorithm or Maximum A Posteriori (MAP) estimation<br />
from a well-tra<strong>in</strong>ed prior model.<br />
A Gaussian mixture model is a weighted sum of M component<br />
Gaussian densities as given by the equation,<br />
(i.e. measurement or features), w i , i = 1, . . . ,M, are the<br />
mixture weights, and g(x|µi,_i), i = 1, . . . ,M, are the<br />
component Gaussian densities. Each component density is a<br />
D-variate Gaussian function of the form,<br />
with mean vector µi and covariance matrix The mixture<br />
weights satisfy the constra<strong>in</strong>t that<br />
The<br />
complete Gaussian mixture model is parameterized by the<br />
mean vectors, covariance matrices and mixture weights from<br />
all component densities. These parameters are collectively<br />
represented by the notation,<br />
….(27)<br />
There are several variants on the GMM shown <strong>in</strong> Equation<br />
(27). The covariance matrices, , can be full rank or<br />
constra<strong>in</strong>ed to be diagonal. Additionally, parameters can be<br />
shared, or tied, among the Gaussian components, such as<br />
hav<strong>in</strong>g a common covariance matrix for all components. The<br />
choice of model configuration (number of components, full or<br />
diagonal covariance matrices, and parameter ty<strong>in</strong>g) is often<br />
determ<strong>in</strong>ed by the amount of data available for estimat<strong>in</strong>g the<br />
GMM parameters and how the GMM is <strong>used</strong> <strong>in</strong> a particular<br />
biometric application. It is also important to note that because<br />
the component Gaussian is act<strong>in</strong>g together to model the<br />
overall feature densities, full covariance matrices are not<br />
necessary even if the features are not statistically <strong>in</strong>dependent.<br />
The l<strong>in</strong>ear comb<strong>in</strong>ation of diagonal covariance basis<br />
Gaussians is capable of model<strong>in</strong>g the correlations between<br />
feature vector elements. The effect of us<strong>in</strong>g a set of M full<br />
covariance matrix Gaussians can be equally obta<strong>in</strong>ed by us<strong>in</strong>g<br />
a larger set of diagonal covariance Gaussians. GMMs are<br />
often <strong>used</strong> <strong>in</strong> biometric systems, most notably <strong>in</strong> speaker<br />
recognition systems, due to their capability of represent<strong>in</strong>g a<br />
large class of sample distributions. One of the powerful<br />
attributes of the GMM is its ability to form smooth<br />
approximations to arbitrarily shaped densities. The classical<br />
uni-modal Gaussian model represents feature distributions by<br />
a position (mean vector) and a elliptic shape (covariance<br />
matrix) and a vector quantizer (VQ) or nearest neighbor<br />
model represents a distribution by a discrete set of<br />
characteristic templates [67]. A GMM acts as a hybrid<br />
between these two models by us<strong>in</strong>g a discrete set of Gaussian<br />
functions, each with their own mean and covariance matrix, to<br />
allow a better model<strong>in</strong>g capability. Figure 12 compares the<br />
densities obta<strong>in</strong>ed us<strong>in</strong>g a uni modal Gaussian model, a GMM<br />
and a VQ model. Plot (a)<br />
……(25)<br />
where x is a D-dimensional cont<strong>in</strong>uous-valued data vector<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
937
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
The aim of ML estimation is to f<strong>in</strong>d the model parameters<br />
which maximize the likelihood of the GMM given the tra<strong>in</strong><strong>in</strong>g<br />
data. For a sequence of T tra<strong>in</strong><strong>in</strong>g vectors X = {x 1 , . . . , x T },<br />
the GMM likelihood, assum<strong>in</strong>g <strong>in</strong>dependence between the<br />
vectors, can be written as,<br />
Figure 12 comparison of distribution model<strong>in</strong>g. ) Histogram<br />
of a s<strong>in</strong>gle cepstral coefficient from a 25 second utterance by a<br />
male speaker b) maximum likelihood unimodal Gaussian<br />
model c) GMM and its 10 underly<strong>in</strong>g component densities d)<br />
histogram of the data assigned to the VQ centroid locations of<br />
a 10 element codebook.<br />
Figure 12 shows the histogram of a s<strong>in</strong>gle feature from a<br />
speaker recognition system (a s<strong>in</strong>gle cepstral value from a 25<br />
second utterance by a male speaker); plot (b) shows a unimodal<br />
Gaussian model of this feature distribution; plot (c)<br />
shows a GMM and its ten underly<strong>in</strong>g component densities;<br />
and plot (d) shows a histogram of the data assigned to the VQ<br />
centroid locations of a 10 element codebook. The GMM not<br />
only provides a smooth overall distribution fit, its components<br />
also clearly detail the multi-modal nature of the density.<br />
The use of a GMM for represent<strong>in</strong>g feature distributions <strong>in</strong> a<br />
biometric system may also be motivated by the <strong>in</strong>tuitive<br />
notion that the <strong>in</strong>dividual component densities may model<br />
some underly<strong>in</strong>g set of hidden classes. For example, <strong>in</strong><br />
speaker recognition, it is reasonable to assume the acoustic<br />
space of spectral related features correspond<strong>in</strong>g to a speaker’s<br />
broad phonetic events, such as vowels, nasals or fricatives.<br />
These acoustic classes reflect some general speaker dependent<br />
vocal tract configurations that are useful for characteriz<strong>in</strong>g<br />
speaker identity. The spectral shape of the i th acoustic class<br />
can <strong>in</strong> turn be represented by the mean µ i of the i th component<br />
density, and variations of the average spectral shape can be<br />
represented by the covariance matrix . Because all the<br />
features <strong>used</strong> to tra<strong>in</strong> the GMM are unlabeled, the acoustic<br />
classes are hidden <strong>in</strong> that the class of an observation is<br />
unknown. A GMM can also be viewed as a s<strong>in</strong>gle-state HMM<br />
with a Gaussian mixture observation density, or an ergodic<br />
Gaussian observation HMM with fixed, equal transition<br />
probabilities. Assum<strong>in</strong>g <strong>in</strong>dependent feature vectors, the<br />
observation density of feature vectors drawn from these<br />
hidden acoustic classes is a Gaussian mixture [68, 69].<br />
14.2.Maximum Likelihood Parameter Estimation<br />
Given tra<strong>in</strong><strong>in</strong>g vectors and a GMM configuration, can<br />
estimate the parameters of the GMM, λ, which <strong>in</strong> some sense<br />
best matches the distribution of the tra<strong>in</strong><strong>in</strong>g feature vectors.<br />
There are several techniques available for estimat<strong>in</strong>g the<br />
parameters of a GMM [70]. The most popular and wellestablished<br />
method is maximum likelihood (ML) estimation.<br />
…(28)<br />
Unfortunately, this expression is a non-l<strong>in</strong>ear function of the<br />
parameters _ and directs maximization is not possible.<br />
However, ML parameter estimates can be obta<strong>in</strong>ed iteratively<br />
us<strong>in</strong>g a special case of the expectation-maximization (EM)<br />
algorithm [71]. The basic idea of the EM algorithm is,<br />
beg<strong>in</strong>n<strong>in</strong>g with an <strong>in</strong>itial model λ, to estimate a new model λ¯,<br />
such that p(X| λ¯) ≥ p(X| λ). The new model then becomes the<br />
<strong>in</strong>itial model for the next iteration and the process is repeated<br />
until some convergence threshold is reached. The <strong>in</strong>itial<br />
model is typically derived by us<strong>in</strong>g some form of b<strong>in</strong>ary VQ<br />
estimation. On each EM iteration, the follow<strong>in</strong>g re-estimation<br />
formulas are <strong>used</strong>, which guarantees a monotonic <strong>in</strong>crease <strong>in</strong><br />
the model’s likelihood value,<br />
The a posteriori probability for component i is given by<br />
…(29)<br />
14.3. Maximum A Posteriori (MAP) Parameter Estimation<br />
In addition to estimat<strong>in</strong>g GMM parameters via the EM<br />
algorithm, the parameters may also be estimated us<strong>in</strong>g<br />
Maximum A Posteriori (MAP) estimation. MAP estimation is<br />
<strong>used</strong>, for example, <strong>in</strong> speaker recognition applications to<br />
derive speaker model by adapt<strong>in</strong>g from a universal<br />
background model (UBM) [72] as shown <strong>in</strong> fig.13. It is also<br />
<strong>used</strong> <strong>in</strong> other pattern recognition tasks where limited labeled<br />
tra<strong>in</strong><strong>in</strong>g data is <strong>used</strong> to adapt a prior, general model. Like the<br />
EM algorithm, the MAP estimation is a two step estimation<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
938
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
process. The first step is identical to the “Expectation” step of<br />
the EM algorithm, where estimates of the sufficient statistics<br />
of the tra<strong>in</strong><strong>in</strong>g data are computed for each mixture <strong>in</strong> the prior<br />
model. Unlike the second step of the EM algorithm, for<br />
adaptation these “new” sufficient statistic estimates are then<br />
comb<strong>in</strong>ed with the “old” sufficient statistics from the prior<br />
mixture parameters us<strong>in</strong>g a data-dependent mix<strong>in</strong>g coefficient.<br />
The data-dependent mix<strong>in</strong>g coefficient is designed so that<br />
mixtures with high counts of new data rely more on the new<br />
sufficient statistics for f<strong>in</strong>al parameter estimation and mixtures<br />
with low counts of new data rely more on the old sufficient<br />
statistics for f<strong>in</strong>al parameter estimation. The specifics of the<br />
adaptation are as follows. Given a prior model and tra<strong>in</strong><strong>in</strong>g<br />
vectors from the desired class, X = {X 1 . . . , X T }, we first<br />
determ<strong>in</strong>e the probabilistic alignment of the tra<strong>in</strong><strong>in</strong>g vectors<br />
<strong>in</strong>to the prior mixture components (Figure 12(a)). That is, for<br />
mixture i <strong>in</strong> the prior model, we compute Pr(i| X T , λ prior ), as <strong>in</strong><br />
Equation (29). Then compute the sufficient statistics for the<br />
weight, mean and variance parameters:<br />
..(32)<br />
Figure 13 Pictorial example of two steps <strong>in</strong> adapt<strong>in</strong>g a<br />
hypothesized speaker model. (a) The tra<strong>in</strong><strong>in</strong>g vectors (x’s) are<br />
probabilistically mapped <strong>in</strong>to the UBM (prior) mixtures. (b)<br />
The adapted mixture parameters are derived us<strong>in</strong>g the<br />
statistics of the new data and the UBM (prior) mixture<br />
parameters. The adaptation is data dependent, so UBM (prior)<br />
mixture parameters are adapted by different amounts.<br />
where γ p is a fixed “relevance” factor for parameter p. It is<br />
common <strong>in</strong> speaker recognition applications to use one<br />
adaptation coefficient for all parameters<br />
…(30)<br />
This is the same as the “Expectation” step <strong>in</strong> the EM<br />
algorithm.<br />
Lastly, these new sufficient statistics from the tra<strong>in</strong><strong>in</strong>g data are<br />
<strong>used</strong> to update the prior sufficient statistics for mixture i to<br />
create the adapted parameters for mixture i (Figure 2(b)) with<br />
the equations:<br />
..(31)<br />
The adaptation coefficients controll<strong>in</strong>g the balance between<br />
old and new estimates are<br />
for the weights, means<br />
and variances, respectively. The scale factor, γ , is computed<br />
over all adapted mixture weights to ensure they sum to unity.<br />
Note that the sufficient statistics, not the derived parameters,<br />
such as the variance, are be<strong>in</strong>g adapted. For each mixture and<br />
each parameter, a data-dependent adaptation coefficient<br />
def<strong>in</strong>ed as<br />
, is <strong>used</strong> <strong>in</strong> the above equations. This is<br />
and further to only adapt certa<strong>in</strong> GMM parameters, such as<br />
only the mean vectors. Us<strong>in</strong>g a data-dependent adaptation<br />
coefficient allows mixture dependent adaptation of<br />
parameters. If a mixture component has a low probabilistic<br />
count, n i , of new data, then αi p →0caus<strong>in</strong>g the de-emphasis of<br />
the new (potentially under-tra<strong>in</strong>ed) parameters and the<br />
emphasis of the old (better tra<strong>in</strong>ed) parameters. For mixture<br />
components with high probabilistic counts,<br />
caus<strong>in</strong>g the use of the new class-dependent parameters. The<br />
relevance factor is a way of controll<strong>in</strong>g how much new data<br />
should be observed <strong>in</strong> a mixture before the new parameters<br />
beg<strong>in</strong> replac<strong>in</strong>g the old parameters. This approach should thus<br />
be robust to limited tra<strong>in</strong><strong>in</strong>g data.<br />
14.4. Advantages and Disadvantages of GMM:<br />
a. Advantages:<br />
1) Less time consum<strong>in</strong>g when applied to a large set of data.<br />
2) It is text <strong>in</strong>dependent<br />
3) It is easy to implement<br />
4) It follows the Probabilistic frame work ( robust)<br />
5) It is computationally efficient.<br />
b. Disadvantages:<br />
1) Ability to track time-evolv<strong>in</strong>g patterns is slow.<br />
2) It cannot exclude exponential functions.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
939
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
14.5. Applications:<br />
1) Used <strong>in</strong> Speaker identification<br />
2) Used <strong>in</strong> Image segmentation<br />
3) Used <strong>in</strong> model<strong>in</strong>g video sequences<br />
4) Used <strong>in</strong> Musical Instrument Identification <strong>in</strong> Polyphonic<br />
Music<br />
5) Used <strong>in</strong> Extraction of melodic l<strong>in</strong>es from audio record<strong>in</strong>gs<br />
6) Used <strong>in</strong> Speaker verification/speaker identification<br />
15. Unsupervised classification Method:<br />
In unsupervised classification, the goal is harder because there<br />
are no pre-determ<strong>in</strong>ed categorizations. There are actually two<br />
approaches to unsupervised learn<strong>in</strong>g. The first approach is to<br />
teach the agent not by giv<strong>in</strong>g explicit categorizations, but by<br />
us<strong>in</strong>g some sort of reward system to <strong>in</strong>dicate success. This<br />
type of tra<strong>in</strong><strong>in</strong>g will generally fit <strong>in</strong>to the decision problem<br />
framework because the goal is not to produce a classification<br />
but to make decisions that maximize rewards. This approach<br />
nicely generalizes to the real world, where agents might be<br />
rewarded for do<strong>in</strong>g certa<strong>in</strong> actions.<br />
A second approach of unsupervised learn<strong>in</strong>g is called<br />
cluster<strong>in</strong>g. In this type of learn<strong>in</strong>g, the goal is not to maximize<br />
a utility function, but simply to f<strong>in</strong>d similarities <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g<br />
data. The assumption is often that the clusters discovered will<br />
match reasonably well with an <strong>in</strong>tuitive classification. This<br />
method is commonly <strong>used</strong> <strong>in</strong> most of the applications<br />
especially <strong>in</strong> speech recognition applications. Hence this<br />
method is discussed <strong>in</strong> detail.<br />
OR In other terms unsupervised learn<strong>in</strong>g is also def<strong>in</strong>ed as the<br />
learn<strong>in</strong>g method where the computer doesn't get any feedback<br />
or guidance while learn<strong>in</strong>g. No guidel<strong>in</strong>es are provided either.<br />
It means that unlike supervised learn<strong>in</strong>g, patterns are not<br />
labeled or classified beforehand.<br />
15.2. Advantages and Disadvantages of Unsupervised<br />
classification:<br />
a)Advantages:<br />
1) Need to provide either the classification rules or the sample<br />
documents as a tra<strong>in</strong><strong>in</strong>g set.<br />
2) Unsupervised classification technique are <strong>used</strong> when we<br />
do not have a clear idea of rules or classifications. One<br />
possible scenario is to use unsupervised classification to<br />
provide an <strong>in</strong>itial set of categories, and to subsequently build<br />
on these through supervised classification.<br />
b)Disadvantages:<br />
1) Cluster<strong>in</strong>g might result <strong>in</strong> unexpected group<strong>in</strong>gs, s<strong>in</strong>ce the<br />
cluster<strong>in</strong>g operation is not user-def<strong>in</strong>ed, but based on an<br />
<strong>in</strong>ternal algorithm.<br />
2) Rules that create the clusters are not seen.<br />
3) The cluster<strong>in</strong>g operation is CPU <strong>in</strong>tensive and can take at<br />
least the same time as <strong>in</strong>dex<strong>in</strong>g.<br />
4) Suffers from over fitt<strong>in</strong>g<br />
15.1 Introduction to Cluster<strong>in</strong>g<br />
Cluster<strong>in</strong>g is the unsupervised classification of patterns<br />
(observations, data items, or feature vectors) <strong>in</strong>to groups<br />
(clusters). The cluster<strong>in</strong>g problem has been addressed <strong>in</strong> many<br />
contexts and by researchers <strong>in</strong> many discipl<strong>in</strong>es; this reflects<br />
its broad appeal and usefulness as one of the steps <strong>in</strong><br />
exploratory data analysis. However, cluster<strong>in</strong>g is a difficult<br />
problem comb<strong>in</strong>atorial, and differences <strong>in</strong> assumptions and<br />
contexts <strong>in</strong> different communities have made the transfer of<br />
useful generic concepts and methodologies slow to occur.<br />
In mach<strong>in</strong>e learn<strong>in</strong>g, unsupervised learn<strong>in</strong>g is a class of<br />
problems <strong>in</strong> which one seeks to determ<strong>in</strong>e how the data are<br />
organized. Many methods employed here are based on data<br />
m<strong>in</strong><strong>in</strong>g methods <strong>used</strong> to preprocess data. It is dist<strong>in</strong>guished<br />
from supervised learn<strong>in</strong>g (and re<strong>in</strong>forcement learn<strong>in</strong>g) <strong>in</strong> that<br />
the learner is given only unlabeled examples. Unsupervised<br />
learn<strong>in</strong>g is closely related to the problem of density estimation<br />
<strong>in</strong> statistics. However unsupervised learn<strong>in</strong>g also encompasses<br />
many other techniques that seek to summarize and expla<strong>in</strong> key<br />
features of the data.One form of unsupervised learn<strong>in</strong>g is<br />
cluster<strong>in</strong>g. Another example is bl<strong>in</strong>d source separation based<br />
on Independent Component Analysis (ICA).<br />
There are two broads of classification procedures: supervised<br />
classification unsupervised classification. The supervised<br />
classification is the essential tool <strong>used</strong> for extract<strong>in</strong>g<br />
quantitative <strong>in</strong>formation from remotely sensed image data<br />
[Richards, 1993, p85]. Us<strong>in</strong>g this method, the analyst has<br />
available sufficient known pixels to generate representative<br />
parameters for each class of <strong>in</strong>terest. This step is called<br />
tra<strong>in</strong><strong>in</strong>g. Once tra<strong>in</strong>ed, the classifier is then <strong>used</strong> to attach<br />
labels to all the image pixels accord<strong>in</strong>g to the tra<strong>in</strong>ed<br />
parameters. The most commonly <strong>used</strong> supervised<br />
classification is maximum likelihood classification (MLC),<br />
which assumes that each spectral class can be described by a<br />
multivariate normal distribution. Therefore, MCL takes<br />
advantage of both the mean vectors and the multivariate<br />
spreads of each class, and can identify those elongated classes.<br />
However, the effectiveness of maximum likelihood<br />
classification depends on reasonably accurate estimation of<br />
the mean vector m and the covariance matrix for each spectral<br />
class data [Richards, 1993, p189]. What’s more, it assumes<br />
that the classes are distributed unmoral <strong>in</strong> multivariate space.<br />
When the classes are multimodal distributed, we cannot get<br />
accurate results. Another broad of classification is<br />
unsupervised classification. It doesn’t require human to have<br />
the foreknowledge of the classes, and ma<strong>in</strong>ly us<strong>in</strong>g some<br />
cluster<strong>in</strong>g algorithm to classify an image data [Richards, 1993,<br />
p85]. These procedures can be <strong>used</strong> to determ<strong>in</strong>e the number<br />
and location of the uni-modal spectral classes. One of the<br />
most commonly <strong>used</strong> unsupervised classifications is the<br />
migrat<strong>in</strong>g means cluster<strong>in</strong>g classifier (MMC). This method is<br />
based on label<strong>in</strong>g each pixel to unknown cluster centers and<br />
then mov<strong>in</strong>g from one cluster center to another <strong>in</strong> a way that<br />
the SSE measure of the preced<strong>in</strong>g section is reduced data<br />
[Richards,1993, p231].<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
940
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
15.1.1.Data cluster<strong>in</strong>g:<br />
Data analysis underlies many comput<strong>in</strong>g applications, either<br />
<strong>in</strong> a design phase or as part of their on-l<strong>in</strong>e operations. Data<br />
analysis procedures can be dichotomized as either exploratory<br />
or confirmatory, based on the availability of appropriate<br />
models for the data source, but a key element <strong>in</strong> both types of<br />
procedures whether for hypothesis formation or decisionmak<strong>in</strong>g)<br />
is the group<strong>in</strong>g, or classification of measurements<br />
based on either (i) goodness-of-fit to a postulated model, or (ii)<br />
natural group<strong>in</strong>gs (cluster<strong>in</strong>g) revealed through analysis.<br />
Cluster analysis is the organization of a collection of patterns<br />
(usually represented as a vector of measurements, or a po<strong>in</strong>t <strong>in</strong><br />
a multidimensional space) <strong>in</strong>to clusters based on similarity.<br />
Intuitively, patterns with<strong>in</strong> a valid cluster are more similar to<br />
each other than they are to a pattern belong<strong>in</strong>g to a different<br />
cluster. An example of cluster<strong>in</strong>g is depicted <strong>in</strong> Figure 14. The<br />
<strong>in</strong>put patterns are shown <strong>in</strong> Figure 14(a), and the desired<br />
clusters are shown <strong>in</strong> Figure 14 (b). Here, po<strong>in</strong>ts belong<strong>in</strong>g to<br />
the same cluster are given the same label. The variety of<br />
techniques for represent<strong>in</strong>g data, measur<strong>in</strong>g proximity<br />
(similarity) between data elements, and group<strong>in</strong>g data<br />
elements has produced a rich and often confus<strong>in</strong>g assortment<br />
of cluster<strong>in</strong>g methods.<br />
Figure 14. Data cluster<strong>in</strong>g<br />
It is important to understand the difference between cluster<strong>in</strong>g<br />
(unsupervised classification) and discrim<strong>in</strong>ant analysis<br />
(supervised classification). In supervised classification, we are<br />
provided with a collection of labeled (preclassified) patterns;<br />
the problem is to label a newly encountered, yet unlabeled,<br />
pattern. Typically, the given labeled (tra<strong>in</strong><strong>in</strong>g) patterns are<br />
<strong>used</strong> to learn the descriptions of classes which <strong>in</strong> turn are <strong>used</strong><br />
to label a new pattern. In the case of cluster<strong>in</strong>g, the problem is<br />
to group a given collection of unlabeled patterns <strong>in</strong>to<br />
mean<strong>in</strong>gful clusters. In a sense, labels are associated with<br />
clusters also, but these category labels are data driven; that is,<br />
they are obta<strong>in</strong>ed solely from the data. Cluster<strong>in</strong>g is useful <strong>in</strong><br />
several exploratory pattern-analysis, group<strong>in</strong>g, decisionmak<strong>in</strong>g,<br />
and mach<strong>in</strong>e-learn<strong>in</strong>g situations, <strong>in</strong>clud<strong>in</strong>g data<br />
m<strong>in</strong><strong>in</strong>g, document retrieval, image segmentation, and pattern<br />
classification. However, <strong>in</strong> many such problems, there is little<br />
prior <strong>in</strong>formation (e.g., statistical models) available about the<br />
data, and the decision-maker must make as few assumptions<br />
about the data as possible. It is under these restrictions that<br />
cluster<strong>in</strong>g methodology is particularly appropriate for the<br />
exploration of <strong>in</strong>terrelationships among the data po<strong>in</strong>ts to<br />
make an assessment (perhaps prelim<strong>in</strong>ary) of their structure.<br />
The term “cluster<strong>in</strong>g” is <strong>used</strong> <strong>in</strong> several research communities<br />
to describe methods for group<strong>in</strong>g of unlabeled data.<br />
These communities have different term<strong>in</strong>ologies and<br />
assumptions for the components of the cluster<strong>in</strong>g process and<br />
the contexts <strong>in</strong> which cluster<strong>in</strong>g are <strong>used</strong>. Thus, we face a<br />
dilemma regard<strong>in</strong>g the scope of this survey. The production of<br />
a truly comprehensive survey would be a monumental task<br />
given the sheer mass of literature <strong>in</strong> this area. The<br />
accessibility of the survey might also be questionable given<br />
the need to reconcile very different vocabularies and<br />
assumptions regard<strong>in</strong>g cluster<strong>in</strong>g <strong>in</strong> the various communities.<br />
The goal of this paper is to survey the core concepts and<br />
techniques <strong>in</strong> the large subset of cluster analysis with its roots<br />
<strong>in</strong> statistics and decision theory. Where appropriate,<br />
references will be made to key concepts and techniques<br />
aris<strong>in</strong>g from cluster<strong>in</strong>g methodology <strong>in</strong> the mach<strong>in</strong>e-learn<strong>in</strong>g<br />
and other communities. The audience for this paper <strong>in</strong>cludes<br />
practitioners <strong>in</strong> the pattern recognition and image analysis<br />
communities (who should view it as a summarization of<br />
current practice), practitioners <strong>in</strong> the mach<strong>in</strong>e-learn<strong>in</strong>g<br />
communities (who should view it as a snapshot of a closely<br />
related field with a rich history of well understood techniques),<br />
and the broader audience of scientific professionals (who<br />
should view it as an accessible <strong>in</strong>troduction to a mature field<br />
that is mak<strong>in</strong>g important contributions to comput<strong>in</strong>g<br />
application areas).<br />
15.1.2 Components of a Cluster<strong>in</strong>g Task<br />
Typical pattern cluster<strong>in</strong>g activity <strong>in</strong>volves the follow<strong>in</strong>g steps<br />
[Ja<strong>in</strong> and Dubes 1988]:<br />
(1) Pattern representation (optionally <strong>in</strong>clud<strong>in</strong>g feature<br />
extraction and/or selection),<br />
(2) Def<strong>in</strong>ition of a pattern proximity measure appropriate to<br />
the data doma<strong>in</strong>,<br />
(3) Cluster<strong>in</strong>g or group<strong>in</strong>g,<br />
(4) Data abstraction (if needed), and<br />
(5) Assessment of output (if needed).<br />
Figure 15 depicts a typical sequenc<strong>in</strong>g of the first three of<br />
these steps, <strong>in</strong>clud<strong>in</strong>g a feedback path where the group<strong>in</strong>g<br />
process output could affect subsequent feature extraction and<br />
similarity computations. Pattern representation refers to the<br />
number of classes, the number of available patterns, and the<br />
number, type, and scale of the features available to the<br />
cluster<strong>in</strong>g algorithm. Some of this <strong>in</strong>formation may not be<br />
controllable by the practitioner.<br />
Figure 15 Stages <strong>in</strong> Cluster<strong>in</strong>g<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
941
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
15.1.3. Advantages and Disadvantages of cluster<strong>in</strong>g:<br />
a)Advantages:<br />
1. High performance<br />
2. Large capacity<br />
3. High availability<br />
4. Incremental growth<br />
b) Disadvantages:<br />
1. Complexity<br />
2. Inability to recover from database corruption<br />
15.1.4. Applications of Cluster<strong>in</strong>g:<br />
1. Cluster<strong>in</strong>g <strong>in</strong> the design of neural networks<br />
2. Information Retrival<br />
3. Data m<strong>in</strong><strong>in</strong>g<br />
4. <strong>Speech</strong> and speaker.<br />
16. SIMILARITY MEASURES<br />
S<strong>in</strong>ce similarity is fundamental to the def<strong>in</strong>ition of a cluster, a<br />
measure of the similarity between two patterns drawn from<br />
the same feature space is essential to most cluster<strong>in</strong>g<br />
procedures. Because of the variety of feature types and scales,<br />
the distance measure (or measures) must be chosen carefully.<br />
It is most common to calculate the dissimilarity between two<br />
patterns us<strong>in</strong>g a distance measure def<strong>in</strong>ed on the feature space.<br />
We will focus on the well-known distance measures <strong>used</strong> for<br />
patterns whose features are all cont<strong>in</strong>uous. The most popular<br />
metric for cont<strong>in</strong>uous features is the Euclidean distance<br />
…...(33)<br />
which is a special case (p52) of the M<strong>in</strong>kowski metric<br />
…(34)<br />
The Euclidean distance has an <strong>in</strong>tuitive appeal as it is<br />
commonly <strong>used</strong> to evaluate the proximity of objects <strong>in</strong> two or<br />
three-dimensional space. It works well when a data set has<br />
“compact” or “isolated” clusters [Mao and Ja<strong>in</strong> 1996]. The<br />
drawback to direct use of the M<strong>in</strong>kowski metrics is the<br />
tendency of the largest-scaled feature to dom<strong>in</strong>ate the others.<br />
Solutions to this problem <strong>in</strong>clude normalization of the<br />
cont<strong>in</strong>uous features (to a common range or variance) or other<br />
weight<strong>in</strong>g schemes. L<strong>in</strong>ear correlation among features can<br />
also distort distance measures; this distortion can be alleviated<br />
by apply<strong>in</strong>g a whiten<strong>in</strong>g transformation to the data or by us<strong>in</strong>g<br />
the squared Mahalanobis distance<br />
…(35)<br />
where the patterns X i and X j are assumed to be row vectors,<br />
and ∑ is the sample covariance matrix of the patterns or the<br />
known covariance matrix of the pattern generation process;<br />
d M (. , .). assigns different weights to different features based<br />
on their variances and pair wise l<strong>in</strong>ear correlations. Here, it is<br />
implicitly assumed that class conditional densities are<br />
unimodal and characterized by multidimensional spread, i.e.,<br />
that the densities are multivariate Gaussian. The regularized<br />
Mahalanobis distance was <strong>used</strong> <strong>in</strong> Mao and Ja<strong>in</strong> [1996] to<br />
extract hyper ellipsoidal clusters. Recently, several researchers<br />
[Huttenlocher et al. 1993; Dubuisson and Ja<strong>in</strong> 1994] have<br />
<strong>used</strong> the Hausdorff distance <strong>in</strong> a po<strong>in</strong>t set match<strong>in</strong>g context.<br />
Some cluster<strong>in</strong>g algorithms work on a matrix of proximity<br />
values <strong>in</strong>stead of on the orig<strong>in</strong>al pattern set. It is useful <strong>in</strong> such<br />
situations to pre-compute all the n(n-1)/2 pair wise distance<br />
values for the n patterns and store them <strong>in</strong> a (symmetric)<br />
matrix. Computation of distances between patterns with some<br />
or all features be<strong>in</strong>g non cont<strong>in</strong>uous is problematic, s<strong>in</strong>ce the<br />
different types of features are not comparable and (as an<br />
extreme example) the notion of proximity is effectively<br />
b<strong>in</strong>ary- valued for nom<strong>in</strong>al-scaled features. Nonetheless,<br />
practitioners (especially those <strong>in</strong> mach<strong>in</strong>e learn<strong>in</strong>g, where<br />
mixed-type patterns are common) have developed proximity<br />
measures for heterogeneous type patterns. A recent example is<br />
Wilson and Mart<strong>in</strong>ez [1997], which proposes a comb<strong>in</strong>ation<br />
of a modified M<strong>in</strong>kowski metric for cont<strong>in</strong>uous features and a<br />
distance based on counts (population) for nom<strong>in</strong>al attributes.<br />
A variety of other metrics have been reported <strong>in</strong> Diday and<br />
Simon [1976] and Ich<strong>in</strong>o and Yaguchi [1994] for comput<strong>in</strong>g<br />
the similarity between patterns represented us<strong>in</strong>g quantitative<br />
as well as qualitative features. Patterns can also be represented<br />
us<strong>in</strong>g str<strong>in</strong>g or tree structures [Knuth 1973]. Str<strong>in</strong>gs are <strong>used</strong><br />
<strong>in</strong> syntactic cluster<strong>in</strong>g [Fu and Lu 1977]. Several measures of<br />
similarity between str<strong>in</strong>gs are described <strong>in</strong> Baeza-Yates<br />
[1992]. A good summary of similarity measures between trees<br />
is given by Zhang [1995]. A comparison of syntactic and<br />
statistical approaches for pattern recognition us<strong>in</strong>g several<br />
criteria was presented <strong>in</strong> Tanaka [1995] and the conclusion<br />
was that syntactic methods are <strong>in</strong>ferior <strong>in</strong> every aspect.<br />
Therefore, we do not consider syntactic methods further <strong>in</strong><br />
this paper. There are some distance measures reported <strong>in</strong> the<br />
literature [Gowda and Krishna 1977; Jarvis and Patrick 1973]<br />
that take <strong>in</strong>to account the effect of surround<strong>in</strong>g or neighbor<strong>in</strong>g<br />
po<strong>in</strong>ts. These surround<strong>in</strong>g po<strong>in</strong>ts are called context <strong>in</strong><br />
Michalski and Stepp [1983]. The similarity between two<br />
po<strong>in</strong>ts xi and xj, given this context, is given by<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
942
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
conceptual similarity measure is the most general similarity<br />
measure.<br />
….(36)<br />
where is the context (the set of surround<strong>in</strong>g po<strong>in</strong>ts). One<br />
metric def<strong>in</strong>ed us<strong>in</strong>g context is the mutual neighbor distance<br />
(MND), proposed <strong>in</strong> Gowda and Krishna [1977], which is<br />
given by<br />
…(37)<br />
where NN(x i , x j ) is the neighbor number of x j with respect to<br />
x,. Figures 16 and 17 give an example. In Figure 16, the<br />
nearest neighbor of A is B, and B’s nearest neighbor is A. So,<br />
NN(A, B)=5 NN(B, A) = 1 and the MND between A and B is 2.<br />
However, N(B, C)= 1 but NN(C, B)= 2, and therefore MND(B,<br />
C)= 3. Figure 17 was obta<strong>in</strong>ed from Figure 4 by add<strong>in</strong>g three<br />
new po<strong>in</strong>ts D, E, and F. Now MND(B, C)= 3 (as before), but<br />
MND(A, B)= 5. The MND between A and B has <strong>in</strong>creased by<br />
<strong>in</strong>troduc<strong>in</strong>g additional po<strong>in</strong>ts, even though A and B have not<br />
moved. The MND is not a metric (it does not satisfy the<br />
triangle <strong>in</strong>equality [Zhang 1995]). In spite of this, MND has<br />
been successfully applied <strong>in</strong> several cluster<strong>in</strong>g applications<br />
[Gowda and Diday 1992]. This observation supports the<br />
viewpo<strong>in</strong>t that the dissimilarity does not need to be a metric.<br />
Watanabe’s theorem of the ugly duckl<strong>in</strong>g [Watanabe 1985]<br />
states: “Insofar as we use a f<strong>in</strong>ite set of predicates that are<br />
capable of dist<strong>in</strong>guish<strong>in</strong>g any two objects considered, the<br />
number of predicates shared by any two such objects is<br />
constant, <strong>in</strong>dependent of the choice of objects.” This implies<br />
that it is possible to make any two arbitrary patterns equally<br />
similar by encod<strong>in</strong>g them with a sufficiently large number of<br />
features. As a consequence, any two arbitrary patterns are<br />
equally similar, unless we use some additional doma<strong>in</strong><br />
<strong>in</strong>formation. For example, <strong>in</strong> the case of conceptual cluster<strong>in</strong>g<br />
[Michalski and Stepp 1983], the similarity between x i and x j<br />
is def<strong>in</strong>ed as<br />
…(38)<br />
where is a set of pre-def<strong>in</strong>ed concepts. This notion is<br />
illustrated with the help of Figure 18. Here, the Euclidean<br />
distance between po<strong>in</strong>ts A and B is less than that between B<br />
and C. However, B and C can be viewed as “more similar”<br />
than A and B because B and C belong to the same concept<br />
(ellipse) and A belongs to a different concept (rectangle). The<br />
Figure 18 Conceptual similarities between po<strong>in</strong>ts:<br />
17. CLUSTERING TECHNIQUES:<br />
Different approaches to cluster<strong>in</strong>g data can be described with<br />
the help of the hierarchy shown <strong>in</strong> Figure 19 (other<br />
taxonometric representations of cluster<strong>in</strong>g methodology are<br />
possible; ours is based on the discussion <strong>in</strong> Ja<strong>in</strong> and Dubes<br />
[1988]). At the top level, there is a dist<strong>in</strong>ction between<br />
hierarchical and partitional approaches (hierarchical methods<br />
produce a nested series of partitions, while partitional methods<br />
produce only one). The taxonomy shown <strong>in</strong> Figure 19 must be<br />
supplemented by a discussion of cross-cutt<strong>in</strong>g issues that may<br />
(<strong>in</strong> pr<strong>in</strong>ciple) affect all of the different approaches regardless<br />
of their placement <strong>in</strong> the taxonomy.<br />
—Agglomerative vs. divisive: This aspect relates to<br />
algorithmic structure and operation. An agglomerative<br />
approach beg<strong>in</strong>s with each pattern <strong>in</strong> a dist<strong>in</strong>ct (s<strong>in</strong>gleton)<br />
cluster, and successively merges clusters together until a<br />
stopp<strong>in</strong>g criterion is satisfied. A divisive method beg<strong>in</strong>s with<br />
all patterns <strong>in</strong> a s<strong>in</strong>gle cluster and performs splitt<strong>in</strong>g until a<br />
stopp<strong>in</strong>g criterion is met.<br />
—Monothetic vs. polythetic: This aspect relates to the<br />
sequential or simultaneous use of features <strong>in</strong> the cluster<strong>in</strong>g<br />
process. Most algorithms are polythetic; that is, all features<br />
enter <strong>in</strong>to the computation of distances between patterns, and<br />
decisions are based on those distances. A simple monothetic<br />
algorithm reported <strong>in</strong> Anderberg [1973] considers features<br />
sequentially to divide the given collection of patterns. This is<br />
illustrated <strong>in</strong> Figure 20. Here, the collection is divided <strong>in</strong>to<br />
two groups us<strong>in</strong>g feature x1; the vertical broken l<strong>in</strong>e V is the<br />
separat<strong>in</strong>g l<strong>in</strong>e. Each of these clusters is further divided<br />
<strong>in</strong>dependently us<strong>in</strong>g feature x2, as depicted by the broken<br />
l<strong>in</strong>es H1 and H2. The major problem with this algorithm is<br />
that it generates 2d clusters where d is the dimensionality of<br />
the patterns. For large values of d (d >. 100 is typical <strong>in</strong><br />
<strong>in</strong>formation retrieval applications [Salton 1991]), the number<br />
of clusters generated by this algorithm is so large that the data<br />
set is divided <strong>in</strong>to un<strong>in</strong>terest<strong>in</strong>gly small and fragmented<br />
clusters.<br />
—Hard vs. fuzzy: A hard cluster<strong>in</strong>g algorithm allocates each<br />
pattern to a s<strong>in</strong>gle cluster dur<strong>in</strong>g its operation and <strong>in</strong> its output.<br />
A fuzzy cluster<strong>in</strong>g method assigns degrees of membership <strong>in</strong><br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
943
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
several clusters to each <strong>in</strong>put pattern. A fuzzy cluster<strong>in</strong>g can<br />
be converted to a hard cluster<strong>in</strong>g by assign<strong>in</strong>g each pattern to<br />
the cluster with the largest measure of membership.<br />
—Determ<strong>in</strong>istic vs. stochastic: This issue is most relevant to<br />
partitional approaches designed to optimize a squared error<br />
function. This optimization can be accomplished us<strong>in</strong>g<br />
traditional techniques or through a random search of the state<br />
space consist<strong>in</strong>g of all possible label<strong>in</strong>gs.<br />
—Incremental vs. non-<strong>in</strong>cremental: This issue arises when<br />
the pattern set to be clustered is large, and constra<strong>in</strong>ts on<br />
execution time or memory space affect the architecture of the<br />
algorithm. The early history of cluster<strong>in</strong>g methodology does<br />
not conta<strong>in</strong> many examples of cluster<strong>in</strong>g algorithms designed<br />
to work with large data sets, but the advent of data m<strong>in</strong><strong>in</strong>g has<br />
fostered the development of cluster<strong>in</strong>g algorithms that<br />
m<strong>in</strong>imize the number of scans through the pattern set, reduce<br />
the number of patterns exam<strong>in</strong>ed dur<strong>in</strong>g execution, or reduce<br />
the size of data structures <strong>used</strong> <strong>in</strong> the algorithm’s operations.<br />
A cogent observation <strong>in</strong> Ja<strong>in</strong> and Dubes [1988] is that the<br />
specification of an algorithm for cluster<strong>in</strong>g usually leaves<br />
considerable flexibilty <strong>in</strong> implementation.<br />
Figure 20 Monothetic partitional cluster<strong>in</strong>g<br />
l<strong>in</strong>k algorithm [Ja<strong>in</strong> and Dubes 1988]) is shown <strong>in</strong> Figure 21.<br />
The dendrogram can be broken at different levels to yield<br />
different cluster<strong>in</strong>gs of the data. Most hierarchical cluster<strong>in</strong>g<br />
algorithms are variants of the s<strong>in</strong>gle-l<strong>in</strong>k [Sneath and Sokal<br />
1973], complete-l<strong>in</strong>k [K<strong>in</strong>g 1967], and m<strong>in</strong>imum-variance<br />
[Ward 1963; Murtagh 1984] algorithms. Of these, the s<strong>in</strong>glel<strong>in</strong>k<br />
and complete l<strong>in</strong>k algorithms are most popular. These two<br />
algorithms differ <strong>in</strong> the way they characterize the similarity<br />
between a pair of clusters. In the s<strong>in</strong>gle-l<strong>in</strong>k method, the<br />
distance between two clusters is the m<strong>in</strong>imum of the distances<br />
between all pairs.<br />
Figure 20 Po<strong>in</strong>ts fall<strong>in</strong>g <strong>in</strong> three clusters<br />
Figure 19 A taxonomy of cluster<strong>in</strong>g approaches<br />
17.1. Hierarchical Cluster<strong>in</strong>g Algorithms:<br />
The operation of a hierarchical cluster<strong>in</strong>g algorithm is<br />
illustrated us<strong>in</strong>g the two-dimensional data set <strong>in</strong> Figure 20.<br />
This figure depicts seven patterns labeled A, B, C, D, E, F,<br />
and G <strong>in</strong> three clusters. A hierarchical algorithm yields a<br />
dendrogram represent<strong>in</strong>g the nested group<strong>in</strong>g of patterns and<br />
similarity levels at which group<strong>in</strong>gs change. A dendrogram<br />
correspond<strong>in</strong>g to the seven po<strong>in</strong>ts <strong>in</strong> Figure 20 (obta<strong>in</strong>ed from<br />
the s<strong>in</strong>gle.<br />
Figure 21 The dendrogram obta<strong>in</strong>ed us<strong>in</strong>g s<strong>in</strong>gle l<strong>in</strong>k<br />
algorithm<br />
Figure 21 Two concentric clusters<br />
of patterns drawn from the two clusters (one pattern from the<br />
first cluster, the other from the second). In the complete-l<strong>in</strong>k<br />
algorithm, the distance between two clusters is the maximum<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
944
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
of all pair wise distances between patterns <strong>in</strong> the two clusters.<br />
In either case, two clusters are merged to form a larger cluster<br />
based on m<strong>in</strong>imum distance criteria. The complete-l<strong>in</strong>k<br />
algorithm produces tightly bound or compact clusters [Baeza-<br />
Yates 1992]. The s<strong>in</strong>gle-l<strong>in</strong>k algorithm, by contrast, suffers<br />
from a cha<strong>in</strong><strong>in</strong>g effect [Nagy 1968]. It has a tendency to<br />
produce clusters that are straggly or elongated. There are two<br />
clusters <strong>in</strong> Figures 22 and 23 separated by a “bridge” of noisy<br />
patterns. The s<strong>in</strong>gle-l<strong>in</strong>k algorithm produces the clusters<br />
shown <strong>in</strong> Figure 22, whereas the complete-l<strong>in</strong>k algorithm<br />
obta<strong>in</strong>s the cluster<strong>in</strong>g shown <strong>in</strong> Figure 23. The clusters<br />
obta<strong>in</strong>ed by the complete l<strong>in</strong>k algorithm are more compact<br />
than those obta<strong>in</strong>ed by the s<strong>in</strong>gle-l<strong>in</strong>k algorithm; the cluster<br />
labeled 1 obta<strong>in</strong>ed us<strong>in</strong>g the s<strong>in</strong>gle-l<strong>in</strong>k algorithm is elongated<br />
because of the noisy patterns labeled “*”. The s<strong>in</strong>gle-l<strong>in</strong>k<br />
algorithm is more versatile than the complete-l<strong>in</strong>k algorithm,<br />
otherwise. For example, the s<strong>in</strong>gle-l<strong>in</strong>k algorithm can extract<br />
the concentric clusters shown <strong>in</strong> Figure 21, but the completel<strong>in</strong>k<br />
algorithm cannot. However, from a pragmatic viewpo<strong>in</strong>t,<br />
it has been observed that the complete l<strong>in</strong>k algorithm produces<br />
more useful hierarchies <strong>in</strong> many applications than the s<strong>in</strong>glel<strong>in</strong>k<br />
algorithm [Ja<strong>in</strong> and Dubes 1988].<br />
17.2. Agglomerative S<strong>in</strong>gle-L<strong>in</strong>k Cluster<strong>in</strong>g Algorithm:<br />
(1) Place each pattern <strong>in</strong> its own cluster. Construct a list<br />
of <strong>in</strong>ter pattern distances for all dist<strong>in</strong>ct unordered<br />
pairs of patterns, and sort this list <strong>in</strong> ascend<strong>in</strong>g order.<br />
(2) Step through the sorted list of distances, form<strong>in</strong>g<br />
for each dist<strong>in</strong>ct dissimilarity value dk a graph on the<br />
patterns where pairs of patterns closer than dk are<br />
connected by a graph edge. If all the patterns are<br />
members of a connected graph, stop. Otherwise,<br />
repeat this step.<br />
(3) The output of the algorithm is a nested hierarchy of<br />
graphs which can be cut at a desired dissimilarity level<br />
form<strong>in</strong>g a partition (cluster<strong>in</strong>g) identified by simply connected<br />
components <strong>in</strong> the correspond<strong>in</strong>g graph.<br />
17.3. Agglomerative Complete-L<strong>in</strong>k Cluster<strong>in</strong>g Algorithm:<br />
(1) Place each pattern <strong>in</strong> its own cluster. Construct a list of<br />
<strong>in</strong>ter pattern distances for all dist<strong>in</strong>ct unordered pairs of<br />
patterns, and sort this list <strong>in</strong> ascend<strong>in</strong>g order. (2) Step through<br />
the sorted list of distances, form<strong>in</strong>g for each dist<strong>in</strong>ct<br />
dissimilarity value dk a graph on the patterns where pairs of<br />
patterns closer than dk are connected by a graph edge. If all<br />
the patterns are members of a completely connected graph,<br />
stop. (3) The output of the algorithm is a nested hierarchy of<br />
graphs which can be cut at a desired dissimilarity level<br />
form<strong>in</strong>g a partition (cluster<strong>in</strong>g) identified by completely<br />
connected components <strong>in</strong> the correspond<strong>in</strong>g graph.<br />
Hierarchical algorithms are more versatile than partitional<br />
algorithms. For example, the s<strong>in</strong>gle-l<strong>in</strong>k cluster<strong>in</strong>g algorithm<br />
works well on data sets conta<strong>in</strong><strong>in</strong>g non-isotropic clusters<br />
<strong>in</strong>clud<strong>in</strong>g well-separated, cha<strong>in</strong>-like, and concentric clusters,<br />
whereas a typical partitional algorithm such as the k-means<br />
algorithm works well only on data sets hav<strong>in</strong>g isotropic<br />
clusters [Nagy 1968]. On the other hand, the time and space<br />
complexities [Day 1992] of the partitional algorithms are<br />
typically lower than those of the hierarchical algorithms. It is<br />
possible to develop hybrid algorithms [Murty and Krishna<br />
1980] that exploit the good features of both categories.<br />
17.4. Hierarchical Agglomerative Cluster<strong>in</strong>g Algorithm:<br />
(1) Compute the proximity matrix conta<strong>in</strong><strong>in</strong>g the distance<br />
between each pair of patterns. Treat each pattern as a cluster.<br />
(2) F<strong>in</strong>d the most similar pair of clusters us<strong>in</strong>g the proximity<br />
matrix. Merge these two clusters <strong>in</strong>to one cluster. Update the<br />
proximity matrix to reflect this merge operation. (3) If all<br />
patterns are <strong>in</strong> one cluster, stop. Otherwise, go to step 2.<br />
Based on the way the proximity matrix is updated <strong>in</strong> step 2, a<br />
variety of agglomerative algorithms can be designed.<br />
Hierarchical divisive algorithms start with a s<strong>in</strong>gle cluster of<br />
all the given objects and keep splitt<strong>in</strong>g the clusters based on<br />
some criterion to obta<strong>in</strong> a partition of s<strong>in</strong>gleton clusters.<br />
17.5. Partitional Algorithms:<br />
A partitional cluster<strong>in</strong>g algorithm obta<strong>in</strong>s a s<strong>in</strong>gle partition of<br />
the data <strong>in</strong>stead of a cluster<strong>in</strong>g structure, such as the<br />
dendrogram produced by a hierarchical technique. Partitional<br />
methods have advantages <strong>in</strong> applications <strong>in</strong>volv<strong>in</strong>g large data<br />
sets for which the construction of a dendrogram is<br />
computationally prohibitive. A problem accompany<strong>in</strong>g the use<br />
of a partitional algorithm is the choice of the number of<br />
desired output clusters. A sem<strong>in</strong>al paper [Dubes 1987]<br />
provides guidance on this key design decision. The partitional<br />
techniques usually produce clusters by optimiz<strong>in</strong>g a criterion<br />
function def<strong>in</strong>ed either locally (on a subset of the patterns) or<br />
globally (def<strong>in</strong>ed over all of the patterns). Comb<strong>in</strong>atorial<br />
search of the set of possible label<strong>in</strong>gs for an optimum value of<br />
a criterion is clearly computationally prohibitive. In practice,<br />
therefore, the algorithm is typically run multiple times with<br />
different start<strong>in</strong>g states, and the best configuration obta<strong>in</strong>ed<br />
from all of the runs is <strong>used</strong> as the output cluster<strong>in</strong>g.<br />
17.5.1. Squared Error Algorithms:<br />
The most <strong>in</strong>tuitive and frequently <strong>used</strong> criterion function <strong>in</strong><br />
partitional cluster<strong>in</strong>g techniques is the squared error criterion,<br />
which tends to work well with isolated and compact clusters.<br />
The squared error for a cluster<strong>in</strong>g of a pattern set -<br />
(conta<strong>in</strong><strong>in</strong>g K clusters)<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
945
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
Is where<br />
…..(39)<br />
(2) Assign each pattern to the closest cluster center. (3)<br />
Recompute the cluster centers us<strong>in</strong>g the current cluster<br />
memberships. (4) If a convergence criterion is not met, go to<br />
step 2. Typical convergence criteria are: no (or m<strong>in</strong>imal)<br />
reassignment of patterns to new cluster centers, or m<strong>in</strong>imal<br />
decrease <strong>in</strong> squared error.<br />
xi<br />
is the ith pattern belong<strong>in</strong>g to the jth cluster and cj is the<br />
centroid of the jth cluster. The k-means is the simplest and<br />
most commonly <strong>used</strong> algorithm employ<strong>in</strong>g a squared error<br />
criterion McQueen 1967]. It starts with a random <strong>in</strong>itial<br />
partition and keeps reassign<strong>in</strong>g the patterns to clusters based<br />
on the similarity between the pattern and the cluster centers<br />
until a convergence criterion is met (e.g., there is no<br />
reassignment of any pattern from one cluster to another, or the<br />
squared error ceases to decrease significantly after some<br />
number of iterations). The k-means algorithm is popular<br />
because it is easy to implement, and its time complexity is<br />
O(n), where n is the number of patterns. A major problem<br />
with this algorithm is that it is sensitive to the selection of the<br />
<strong>in</strong>itial partition and may converge to a local m<strong>in</strong>imum of the<br />
criterion function value if the <strong>in</strong>itial partition is not properly<br />
chosen. Figure 24 shows seven two-dimensional patterns. If<br />
we start with patterns A, B, and C as the <strong>in</strong>itial means around<br />
which the three clusters are built, then we end up with the<br />
partition {{A}, {B, C}, {D, E, F, G}} shown by ellipses. The<br />
squared error criterion value is much larger for this partition<br />
than for the best partition {{A, B, C}, {D, E}, {F, G}} shown<br />
by rectangles, which yields the global m<strong>in</strong>imum value of the<br />
squared error criterion function for a cluster<strong>in</strong>g conta<strong>in</strong><strong>in</strong>g<br />
three clusters. The correct three-cluster solution is obta<strong>in</strong>ed by<br />
choos<strong>in</strong>g, for example, A, D, and F as the <strong>in</strong>itial cluster means.<br />
17.5.2.Squared Error Cluster<strong>in</strong>g Method:<br />
(1) Select an <strong>in</strong>itial partition of the patterns with a fixed<br />
number of clusters and cluster centers. (2) Assign each pattern<br />
to its closest cluster center and compute the new cluster<br />
centers as the centroids of the clusters. Repeat this step until<br />
convergence is achieved, i.e., until the cluster membership is<br />
stable. (3) Merge and split clusters based on some heuristic<br />
<strong>in</strong>formation, optionally repeat<strong>in</strong>g step 2.<br />
17.5.3.k-Means Cluster<strong>in</strong>g Algorithm:<br />
(1) Choose k cluster centers to co<strong>in</strong>cide with k randomlychosen<br />
patterns or k randomly def<strong>in</strong>ed po<strong>in</strong>ts <strong>in</strong>side the<br />
hyper-volume conta<strong>in</strong><strong>in</strong>g the pattern set.<br />
Figure 24 The k-means algorithm is sensitive to the <strong>in</strong>itial<br />
partitions.<br />
Several variants [Anderberg 1973] of the k-means algorithm<br />
have been reported <strong>in</strong> the literature. Some of them attempt to<br />
select a good <strong>in</strong>itial partition so that the algorithm is more<br />
likely to f<strong>in</strong>d the global <strong>in</strong>imum value. Another variation is to<br />
permit splitt<strong>in</strong>g and merg<strong>in</strong>g of the result<strong>in</strong>g clusters.<br />
Typically, a cluster is split when its variance is above a prespecified<br />
threshold, and two clusters are merged when the<br />
distance between their centroids is below another prespecified<br />
threshold. Us<strong>in</strong>g this variant, it is possible to obta<strong>in</strong><br />
the optimal partition start<strong>in</strong>g from any arbitrary <strong>in</strong>itial<br />
partition, provided proper threshold values are specified. The<br />
well-known ISODATA [Ball and Hall 1965] algorithm<br />
employs this technique of merg<strong>in</strong>g and splitt<strong>in</strong>g clusters. If<br />
ISODATA is given the “ellipse” partition<strong>in</strong>g shown <strong>in</strong> Figure<br />
14 as an <strong>in</strong>itial partition<strong>in</strong>g, it will produce the optimal threecluster<br />
partition<strong>in</strong>g. ISODATA will first merge the clusters {A}<br />
and {B,C} <strong>in</strong>to one cluster because the distance between their<br />
centroids is small and then split the cluster {D,E,F,G}, which<br />
has a large variance, <strong>in</strong>to two clusters {D,E} and {F,G}.<br />
Another variation of the k-means algorithm <strong>in</strong>volves select<strong>in</strong>g<br />
a different criterion function altogether. The dynamic<br />
cluster<strong>in</strong>g algorithm (which permits representations other than<br />
the centroid for each cluster) was proposed <strong>in</strong> Diday [1973],<br />
and Symon [1977] and describes a dynamic cluster<strong>in</strong>g<br />
approach obta<strong>in</strong>ed by formulat<strong>in</strong>g the cluster<strong>in</strong>g problem <strong>in</strong><br />
the framework of maximum-likelihood estimation. The<br />
regularized Mahalanobis distance was <strong>used</strong> <strong>in</strong> Mao and Ja<strong>in</strong><br />
[1996] to obta<strong>in</strong> hyperellipsoidal clusters.<br />
17.5.4. Graph-Theoretic Cluster<strong>in</strong>g.<br />
The best-known graph-theoretic divisive cluster<strong>in</strong>g algorithm<br />
is based on construction of the m<strong>in</strong>imal spann<strong>in</strong>g tree (MST)<br />
of the data [Zahn 1971], and then delet<strong>in</strong>g the MST edges<br />
with the largest lengths to generate clusters. Figure 25 depicts<br />
the MST obta<strong>in</strong>ed from n<strong>in</strong>e two-dimensional po<strong>in</strong>ts. By<br />
break<strong>in</strong>g the l<strong>in</strong>k labeled CD with a length of 6 units (the edge<br />
with the maximum Euclidean length), two clusters ({A, B, C}<br />
and {D, E, F, G, H, I}) are obta<strong>in</strong>ed. The second cluster can<br />
be further divided <strong>in</strong>to two clusters by break<strong>in</strong>g the edge EF,<br />
which has a length of 4.5 units. The hierarchical approaches<br />
are also related to graph-theoretic cluster<strong>in</strong>g. S<strong>in</strong>gle-l<strong>in</strong>k<br />
clusters are sub graphs of the m<strong>in</strong>imum spann<strong>in</strong>g tree of the<br />
data [Gower and Ross 1969] which are also the connected<br />
components [Gotlieb and Kumar 1968]. Complete-l<strong>in</strong>k<br />
clusters are maximal complete sub graphs, and are related to<br />
the node colourability of graphs [Backer and Hubert 1976].<br />
The maximal complete sub graph was considered the strictest<br />
def<strong>in</strong>ition of a cluster <strong>in</strong> Augustson and M<strong>in</strong>ker [1970] and<br />
Raghavan and Yu [1981]. A graph-oriented approach for non-<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
946
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
hierarchical structures and overlapp<strong>in</strong>g clusters is presented <strong>in</strong><br />
Ozawa [1985].<br />
cluster<strong>in</strong>g algorithm us<strong>in</strong>g a distance measure based on a<br />
nonparametric density estimate.<br />
17.7. Nearest Neighbour Cluster<strong>in</strong>g:<br />
Figure 25 Us<strong>in</strong>g the m<strong>in</strong>imal spann<strong>in</strong>g tree to from cluster<br />
Delaunay graph (DG) is obta<strong>in</strong>ed by connect<strong>in</strong>g all the pairs<br />
of po<strong>in</strong>ts that are Voronoi neighbours. The DG conta<strong>in</strong>s all the<br />
neighbourhood <strong>in</strong>formation conta<strong>in</strong>ed <strong>in</strong> the MST and the<br />
relative neighbourhood graph (RNG) [Toussa<strong>in</strong>t 1980].<br />
17.6. Mixture-Resolv<strong>in</strong>g and Mode-Seek<strong>in</strong>g Algorithms<br />
The mixture resolv<strong>in</strong>g approach to cluster analysis has been<br />
addressed <strong>in</strong> a number of ways. The underly<strong>in</strong>g assumption is<br />
that the patterns to be clustered are drawn from one of several<br />
distributions, and the goal is to identify the parameters of each<br />
and (perhaps) their number. Most of the work <strong>in</strong> this area has<br />
assumed that the <strong>in</strong>dividual components of the mixture density<br />
are Gaussian, and <strong>in</strong> this case the parameters of the <strong>in</strong>dividual<br />
Gaussians are to be estimated by the procedure. Traditional<br />
approaches to this problem <strong>in</strong>volve obta<strong>in</strong><strong>in</strong>g (iteratively) a<br />
maximum likelihood estimate of the parameter vectors of the<br />
component densities [Ja<strong>in</strong> and Dubes 1988]. More recently,<br />
the Expectation Maximization (EM) algorithm (a generalpurpose<br />
maximum likelihood algorithm [Dempster et al. 1977]<br />
for miss<strong>in</strong>g-data problems) has been applied to the problem of<br />
parameter estimation. A recent book [Mitchell 1997] provides<br />
an accessible description of the technique. In the EM<br />
framework, the parameters of the component densities are<br />
unknown, as are the mix<strong>in</strong>g parameters, and these are<br />
estimated from the patterns. The EM procedure beg<strong>in</strong>s with<br />
an <strong>in</strong>itial estimate of the parameter vector and iteratively<br />
rescores the patterns aga<strong>in</strong>st the mixture density produced by<br />
the parameter vector. The rescored patterns are then <strong>used</strong> to<br />
update the parameter estimates. In a cluster<strong>in</strong>g context, the<br />
scores of the patterns (which essentially measure their<br />
likelihood of be<strong>in</strong>g drawn from particular components of the<br />
mixture) can be viewed as h<strong>in</strong>ts at the class of the pattern.<br />
Those patterns, placed (by their scores) <strong>in</strong> a particular<br />
component, would therefore be viewed as belong<strong>in</strong>g to the<br />
same cluster. Nonparametric techniques for density- based<br />
cluster<strong>in</strong>g have also been developed [Ja<strong>in</strong> and Dubes 1988].<br />
Inspired by the Parzen w<strong>in</strong>dow approach to nonparametric<br />
density estimation, the correspond<strong>in</strong>g cluster<strong>in</strong>g procedure<br />
searches for b<strong>in</strong>s with large counts <strong>in</strong> a multidimensional<br />
histogram of the <strong>in</strong>put pattern set. Other approaches <strong>in</strong>clude<br />
the application of another partitional or hierarchical<br />
S<strong>in</strong>ce proximity plays a key role <strong>in</strong> our <strong>in</strong>tuitive notion of a<br />
cluster, nearest neighbour distances can serve as the basis of<br />
cluster<strong>in</strong>g procedures. An iterative procedure was proposed <strong>in</strong><br />
Lu and Fu [1978]; it assigns each unlabeled pattern to the<br />
cluster of its nearest labelled neighbour pattern, provided the<br />
distance to that labelled neighbour is below a threshold. The<br />
process cont<strong>in</strong>ues until all patterns are labelled or no<br />
additional labell<strong>in</strong>g occur. The mutual neighbourhood value<br />
(described earlier <strong>in</strong> the context of distance computation) can<br />
also be <strong>used</strong> to grow clusters from near neighbours.<br />
17.8. Fuzzy Cluster<strong>in</strong>g:<br />
Traditional cluster<strong>in</strong>g approaches generate partitions; <strong>in</strong> a<br />
partition, each pattern belongs to one and only one cluster.<br />
Hence, the clusters <strong>in</strong> a hard cluster<strong>in</strong>g are disjo<strong>in</strong>t. Fuzzy<br />
cluster<strong>in</strong>g extends this notion to associate each pattern with<br />
every cluster us<strong>in</strong>g a membership function [Zadeh 1965]. The<br />
output of such algorithms is a cluster<strong>in</strong>g, but not a partition.<br />
We give a high-level partitional fuzzy cluster<strong>in</strong>g algorithm<br />
below.<br />
17.8.1.Fuzzy Cluster<strong>in</strong>g Algorithm:<br />
(1) Select an <strong>in</strong>itial fuzzy partition of the N objects <strong>in</strong>to K<br />
clusters by select<strong>in</strong>g the Nx3 K membership matrix U. An<br />
element uij of this matrix represents the grade of membership<br />
of object xi <strong>in</strong> cluster cj. Typically, uij [0,1].(2) Us<strong>in</strong>g U,<br />
f<strong>in</strong>d the value of a fuzzy criterion function, e.g., a weighted<br />
squared error criterion function, associated with the<br />
correspond<strong>in</strong>g partition. One possible fuzzy criterion function<br />
is<br />
Where<br />
…(40)<br />
is the k th fuzzy cluster center. Reassign patterns to clusters to<br />
reduce this criterion function value and recompute U. (3)<br />
Repeat step 2 until entries <strong>in</strong> U do not change significantly. In<br />
fuzzy cluster<strong>in</strong>g, each cluster is a fuzzy set of all the patterns.<br />
Figure 26 illustrates the idea. The rectangles enclose two<br />
“hard” clusters <strong>in</strong> the data: H1 ={1,2,3,4,5} and H2={6,7,8,9}<br />
A fuzzy cluster<strong>in</strong>g algorithm might produce the two fuzzy<br />
clusters F1 and F2 depicted by ellipses. The patterns will have<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
947
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
membership values <strong>in</strong> [0,1] for each cluster. For example,<br />
fuzzy cluster F1 could be compactly described as<br />
Figure 27 Representation of a cluster by po<strong>in</strong>ts<br />
Figure 26 Fuzzy clusters<br />
The ordered pairs (i,μ i ) <strong>in</strong> each cluster represent the i th pattern<br />
and its membership value to the cluster mi. Larger<br />
membership values <strong>in</strong>dicate higher confidence <strong>in</strong> the<br />
assignment of the pattern to the cluster. A hard cluster<strong>in</strong>g can<br />
be obta<strong>in</strong>ed from a fuzzy partition by threshold<strong>in</strong>g the<br />
membership value.<br />
Fuzzy set theory was <strong>in</strong>itially applied to cluster<strong>in</strong>g <strong>in</strong> Rusp<strong>in</strong>i<br />
[1969]. The book by Bezdek [1981] is a good source for<br />
material on fuzzy cluster<strong>in</strong>g. The most popular fuzzy<br />
cluster<strong>in</strong>g algorithm is the fuzzy c-means (FCM) algorithm.<br />
Even though it is better than the hard k-means algorithm at<br />
avoid<strong>in</strong>g localv m<strong>in</strong>ima, FCM can still converge to local<br />
m<strong>in</strong>ima of the squared error criterion. The design of<br />
membership functions is the most important problem <strong>in</strong> fuzzy<br />
cluster<strong>in</strong>g; different choices <strong>in</strong>clude those based on similarity<br />
decomposition and centroids of clusters. A generalization of<br />
the FCM algorithm was proposed by Bezdek [1981] through a<br />
family of objective functions. A fuzzy c-shell algorithm and<br />
an adaptive variant for detect<strong>in</strong>g circular and elliptical<br />
boundaries was presented <strong>in</strong> Dave [1992].<br />
17.9. Representation of Clusters:<br />
In applications where the number of classes or clusters <strong>in</strong> a<br />
data set must be discovered, a partition of the data set is the<br />
end product. Here, a partition gives an idea about the<br />
separability of the data po<strong>in</strong>ts <strong>in</strong>to clusters and whether it is<br />
mean<strong>in</strong>gful to employ a supervised classifier that assumes a<br />
given number of classes <strong>in</strong> the data set. However, <strong>in</strong> many<br />
other applications that <strong>in</strong>volve decision ma k<strong>in</strong>g, the result<strong>in</strong>g<br />
clusters have to be represented or described <strong>in</strong> a compact form<br />
to achieve data abstraction. Even though the construction of a<br />
cluster representation is an important step <strong>in</strong> decision mak<strong>in</strong>g,<br />
it has not been exam<strong>in</strong>ed closely by researchers. The notion of<br />
cluster representation was <strong>in</strong>troduced <strong>in</strong> Duran and Odell<br />
[1974] and was subsequently studied <strong>in</strong> Diday and Simon<br />
[1976] and Michalski et al. [1981]. They suggested the<br />
follow<strong>in</strong>g representation schemes:<br />
(1) Represent a cluster of po<strong>in</strong>ts by their centroid or by a set<br />
of distant po<strong>in</strong>ts <strong>in</strong> the cluster. Figure 27 depicts these two<br />
ideas.<br />
(2) Represent clusters us<strong>in</strong>g nodes <strong>in</strong> a classification tree.<br />
(3)Represent clusters by us<strong>in</strong>g conjunctive logical<br />
expressions. For example, the expression<br />
Figure 28 stands for the logical statement ‘X1 is greater than<br />
3’ and ’X2 is less than 2’. Use of the centroid to represent a<br />
cluster is the most popular scheme. It works well when the<br />
clusters are compact or isotropic. However, when the clusters<br />
are elongated or non-isotropic, then this scheme fails to<br />
represent them properly. In such a case, the use of a collection<br />
of boundary po<strong>in</strong>ts <strong>in</strong> a cluster captures its shape well. The<br />
number of po<strong>in</strong>ts <strong>used</strong> to represent a cluster should <strong>in</strong>crease as<br />
the complexity of its shape <strong>in</strong>creases. The two different<br />
representations illustrated <strong>in</strong> Figure 18 are equivalent. Every<br />
path <strong>in</strong> a classification tree from the root node to a leaf node<br />
corresponds to a conjunctive statement. An important<br />
limitation of the typical use of the simple conjunctive concept<br />
representations is that they can describe only rectangular or<br />
isotropic clusters <strong>in</strong> the feature space. Data abstraction is<br />
useful <strong>in</strong> decision mak<strong>in</strong>g because of the follow<strong>in</strong>g: (1) It<br />
gives a simple and <strong>in</strong>tuitive description of clusters which is<br />
easy for human comprehension. In both conceptual cluster<strong>in</strong>g<br />
[Michalski Cluster<strong>in</strong>g [Gowda and Diday 1992] this<br />
representation is obta<strong>in</strong>ed without us<strong>in</strong>g an<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
948
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
Figure 28 Representation of clusters by a classification tree or<br />
by conjunctive statements<br />
Additional step: These algorithms generate the clusters as well<br />
as their descriptions. A set of fuzzy rules can be obta<strong>in</strong>ed from<br />
fuzzy clusters of a data set. These rules can be <strong>used</strong> to build<br />
fuzzy classifiers and fuzzy controllers. (2) It helps <strong>in</strong><br />
achiev<strong>in</strong>g data compression that can be exploited further by a<br />
computer [Murty and Krishna 1980]. Figure 19(a) shows<br />
samples belong<strong>in</strong>g to two cha<strong>in</strong>-like clusters labeled 1 and 2.<br />
A partitional cluster<strong>in</strong>g like the k-means algorithm cannot<br />
separate these two structures properly. The s<strong>in</strong>gle-l<strong>in</strong>k<br />
algorithm works well on this data, but is computationally<br />
expensive. So a hybrid approach may be <strong>used</strong> to exploit the<br />
desirable properties of both these algorithms. We obta<strong>in</strong> 8 sub<br />
clusters of the data us<strong>in</strong>g the computationally efficient) k-<br />
means algorithm. Each of these sub clusters can be<br />
represented by their centroids as shown <strong>in</strong> Figure 19(a). Now<br />
the s<strong>in</strong>gle- l<strong>in</strong>k algorithm can be applied on these centroids<br />
alone to cluster them <strong>in</strong>to 2 groups. The result<strong>in</strong>g groups are<br />
shown <strong>in</strong> Figure 19(b). Here, a data reduction is achieved by<br />
represent<strong>in</strong>g the sub clusters by their centroids. (3) It <strong>in</strong>creases<br />
the efficiency of the decision mak<strong>in</strong>g task. In a cluster based<br />
document retrieval technique [Salton 1991], a large collection<br />
of documents is clustered and each of the clusters is<br />
represented us<strong>in</strong>g its centroid. In order to retrieve documents<br />
relevant to a query, the query is matched with the cluster<br />
centroids rather than with all the documents. This helps <strong>in</strong><br />
retriev<strong>in</strong>g relevant documents efficiently. Also <strong>in</strong> several<br />
applications <strong>in</strong>volv<strong>in</strong>g large data sets, cluster<strong>in</strong>g is <strong>used</strong> to<br />
perform <strong>in</strong>dex<strong>in</strong>g, which helps <strong>in</strong> efficient decision mak<strong>in</strong>g<br />
[Dorai and Ja<strong>in</strong> 1995].<br />
18. Evaluation of classification techniques:<br />
A framework to evaluate classification techniques and do an<br />
analysis of the techniques was proposed by fractal white paper<br />
[79] and covered <strong>in</strong> this paper on the follow<strong>in</strong>g criteria:<br />
• Statistical assumptions<br />
• Data needs<br />
• Complexity of deployment<br />
• Model Performance<br />
• Model build<strong>in</strong>g time<br />
18.1 Statistical Assumptions:<br />
All parametric techniques make statistical assumptions about<br />
data. In most real life cases, these assumptions cannot be fully<br />
met. Pragmatism mixed with caution should help <strong>in</strong> gett<strong>in</strong>g<br />
the best of a model<strong>in</strong>g technique. If there are multicoll<strong>in</strong>earity<br />
issues with data, for example, one should<br />
def<strong>in</strong>itely explore the use of non-parametric techniques like<br />
neural networks or genetic algorithms for a possible superior<br />
fit compared with parametric statistical methods. Similarly, if<br />
the sample has a skewed good-bad mix, discrim<strong>in</strong>ant and k-<br />
NN techniques are likely to under perform vis-à-vis the other<br />
techniques. A presence of complex non-l<strong>in</strong>ear relationships<br />
with<strong>in</strong> data precludes the use of l<strong>in</strong>ear techniques. In such<br />
situations, recursive partition<strong>in</strong>g and non-parametric<br />
techniques are likely to outperform most parametric statistical<br />
techniques.<br />
18.2 Data Needs:<br />
All techniques perform better if they are exposed to large<br />
sample of representative data. Equal number of good and bad<br />
observations also can help <strong>in</strong> model build<strong>in</strong>g. However, <strong>in</strong><br />
most practical situations, availability of enough data po<strong>in</strong>ts on<br />
both event types is difficult. Non-parametric and recursive<br />
partition<strong>in</strong>g techniques usually tend to be more data hungry<br />
than parametric techniques. As discussed <strong>in</strong> the previous<br />
section, discrim<strong>in</strong> ant analysis and K-NN techniques are<br />
strongly sensitive to good bad mix <strong>in</strong> the data. K-NN is also<br />
sensitive to the presence of irrelevant variables <strong>in</strong> model<br />
build<strong>in</strong>g. All non-parametric techniques have a tendency to<br />
over fit the model when number of variables <strong>used</strong> for model<br />
build<strong>in</strong>g is large. In these cases, it might be a useful idea to<br />
run parametric statistical techniques and delete unimportant<br />
variables from analysis before proceed<strong>in</strong>g to use nonparametric<br />
techniques.<br />
Table 4- A comparative study of credit scor<strong>in</strong>g techniques<br />
Source: Monteserrat, Guillen, Count Data models for a credit<br />
scor<strong>in</strong>g system,1992<br />
18.3. Model Build<strong>in</strong>g Time:<br />
Model build<strong>in</strong>g is an iterative process. It requires<br />
experimentation with alternative predictor variables and<br />
several different transformations. Model build<strong>in</strong>g is also a<br />
multi stage process and at each stage, many variables could be<br />
dropped or altered for seek<strong>in</strong>g a better fit. The time taken to<br />
tra<strong>in</strong> a model, can <strong>in</strong>fluence the choice of technique <strong>in</strong> some<br />
cases. Parametric techniques take relatively less time for<br />
comput<strong>in</strong>g a model. L<strong>in</strong>ear models are the friendliest <strong>in</strong> this<br />
respect. Non-parametric techniques, on the other hand, can<br />
take <strong>in</strong>ord<strong>in</strong>ate amounts of time for model tra<strong>in</strong><strong>in</strong>g. K-NN is<br />
an O(n2) process and thus can take large amounts of time for<br />
large tra<strong>in</strong><strong>in</strong>g data. Recursive partition<strong>in</strong>g techniques may take<br />
less time than non-parametric techniques but are slower<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
949
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
compared to logistic regression. Model build<strong>in</strong>g time is also<br />
important because re-calibrat<strong>in</strong>g of models might be<br />
undertaken frequently <strong>in</strong> the light of additional data.<br />
18.4. Transparency:<br />
Transparency of the model plays an important role <strong>in</strong> the<br />
acceptance of the model by users. The black-box approach of<br />
non-parametric techniques is probably the most important<br />
ground for us<strong>in</strong>g other techniques. <strong>Classification</strong> trees provide<br />
the most user friendly and <strong>in</strong>tuitive output amongst<br />
classification techniques. Parametric models are also<br />
transparent and show the contribution of each variable to the<br />
score. In cases of multicoll<strong>in</strong>earity, logistic and l<strong>in</strong>ear models<br />
might not truly reflect the importance of each variable. This is<br />
because another correlated variable might have accounted for<br />
the dependent variable by the virtue of hav<strong>in</strong>g entered the<br />
model earlier.<br />
18.5. Deployment:<br />
Practical considerations of deployment might sometimes rule<br />
out the use of some techniques. Deployment of nonparametric<br />
techniques can be cumbersome and might require<br />
writ<strong>in</strong>g of programm<strong>in</strong>g code or use of proprietary software<br />
components. Deployment of parametric and classification tree<br />
models is relatively simpler. An organization should measure<br />
the <strong>in</strong>cremental profit from a model versus the <strong>in</strong>cremental<br />
deployment cost and effort to decide on the choice of model<br />
for deployment.<br />
19. Survey of classification techniques <strong>used</strong> <strong>in</strong> different<br />
speech recognition applications:<br />
TABLE 5<br />
<strong>Classification</strong> technique adopted <strong>in</strong> different speech<br />
recognition application<br />
The abbreviations <strong>used</strong> <strong>in</strong> this table as follows;<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
950
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
21. Some of well know cluster<strong>in</strong>g algorithms have been listed<br />
<strong>in</strong> the table 7[81].<br />
TABLE 7<br />
Cluster<strong>in</strong>g algorithms<br />
20. Comparison of <strong>Classification</strong> techniques:<br />
We summaries the most commonly <strong>used</strong> classifiers <strong>in</strong> table<br />
6.Many of them represent, <strong>in</strong> fact, an entire family of<br />
classifiers and allow the user to modify the associated<br />
parameters and criterion functions. All of these classifiers are<br />
admissible; <strong>in</strong> the sense that there exist some classification<br />
problems is the state log project which showed a large<br />
variability over their relative performances, prov<strong>in</strong>g that there<br />
is no such th<strong>in</strong>g that there is no overall optimal classification<br />
rule.<br />
TABLE 6<br />
<strong>Classification</strong> methods<br />
22. Conclusions:<br />
In this overview paper different classification techniques have<br />
been discussed. At the beg<strong>in</strong>n<strong>in</strong>g of the paper the taxonomy of<br />
the classification techniques have been presented and<br />
expla<strong>in</strong>ed respectively. For each method, the advantages and<br />
disadvantages, and various application areas have been<br />
presented. The purpose of this paper is to provide all the<br />
classification techniques <strong>used</strong> <strong>in</strong> the area of speech<br />
recognition <strong>in</strong> brief ,for the young researchers. The<br />
contributions of this paper is the survey of the different<br />
classification methods to different speech recognition<br />
applications, with their evaluation criteria, Comments and<br />
properties of different classification methods and properties<br />
and comments of different cluster<strong>in</strong>g algorithms are also<br />
discussed.<br />
Acknowledgements:<br />
Thanks are due to Prof.G.Krishna Professor (Rtd.), of Indian<br />
Institute of Science, Bangalore and Dr.M.Narshima Murthy,<br />
Professor, Dept. of Automation and computer science, Indian<br />
Institute of Science, Bangalore, for useful discussion with<br />
them, while prepar<strong>in</strong>g this manuscript.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
951
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
REFERENCES:<br />
1)Anil K.Ja<strong>in</strong>, “Statistical Pattern <strong>Recognition</strong>”,IEEE<br />
Transactions On Pattern Analysis And Mach<strong>in</strong>e Intelligence,<br />
Vol. 22, No. 1, January 2000<br />
2)A. K. Ja<strong>in</strong>, R. P. W. Du<strong>in</strong>, and J. Mao, “Statistical Pattern<br />
<strong>Recognition</strong>: A Review”, IEEE Trans. on Pattern Analysis and<br />
Mach<strong>in</strong>e Intelligence, 22(1):4-37, January 2000.<br />
3) J. A. Bilmes, “A Gentle Tutorial on the EM Algorithm and<br />
its Application to Parameter Estimation for Gaussian Mixture<br />
and Hidden Markov Models, Technical Report TR-97-021”,<br />
International Computer Science Institute, University of<br />
California, Berkeley, April 1998.<br />
4) J. A. Anderson, P. R. Krishnaiah et.al “Logistic<br />
Discrim<strong>in</strong>ation, Handbook of Statistics”, , vol. 2, pp. 169-191,<br />
Amsterdam: North Holland,1982.<br />
5) Rab<strong>in</strong>er and Jung, “Fundamental of <strong>Speech</strong> recognition”,<br />
Pearson Education,©1993.<br />
6) R.A. Fisher, “The Use of Multiple Measurements <strong>in</strong><br />
Taxonomic Problems”, Annals of Eugenics, vol. 7, part II, pp.<br />
179-188, 1936.<br />
7) Dasarathy, B.V,“M<strong>in</strong>imal consistent set (MCS)<br />
identification for optimal nearest neighbor decision systems<br />
design”, IEEE Transactions on Systems, Man and cybernetics,<br />
Vol. 24, Issue: 3, pp:511 – 517, March 1994.<br />
8) Girolami, M and Chao He “Probability density estimation<br />
from optimally condensed data samples” pattern Analysis and<br />
Mach<strong>in</strong>e Intelligence, IEEE Transactions on, Volume: 25,<br />
Issue: 10 , pp:1253 – 1264,Oct. 2003.<br />
9) Meijer, B.R.; “Rules and algorithms for the design of<br />
templates for template match<strong>in</strong>g”, Pattern recognition, 1992.<br />
Vol.1. Conference A: Computer Vision and Applications, 11th<br />
IAPR International Conference on, pp: 760 – 763, Aug. 1992.<br />
10) Hush, D.R., Horne B.G. “Progress <strong>in</strong> supervised neural<br />
networks”, Signal Process<strong>in</strong>g Magaz<strong>in</strong>e, IEEE, Vol. 10, Issue:<br />
1, pp:8 – 39, Jan. 1993.<br />
11)Vapnik, V., “The Nature of Statistical Learn<strong>in</strong>g Theory”,<br />
Spr<strong>in</strong>ger, 1995.<br />
12) Julia Neumann, Christoph Schnorr, “SVM-based feature<br />
selection by direct objective m<strong>in</strong>imization”, 2004.<br />
13) Lihong Zheng and Xiangjian , “<strong>Classification</strong> <strong>Techniques</strong><br />
<strong>in</strong> Pattern <strong>Recognition</strong>”, University of Technology, , Australia<br />
2007.<br />
14) ) L. R. Rab<strong>in</strong>er, J. G. Wilpon, A. M. Qu<strong>in</strong>n, and S. G.<br />
Terrace, “On the application of embedded digit tra<strong>in</strong><strong>in</strong>g to<br />
speaker <strong>in</strong>dependent connected digit recognition,” IEEE<br />
Transactions on Acoustics, <strong>Speech</strong> and Signal Process<strong>in</strong>g, vol.<br />
32, no. 2, pp. 272–280, April 1984.<br />
15) T. M. Cover and P. E. Hart, “Nearest neighbor pattern<br />
classification”, IEEE Transactions <strong>in</strong>formation Theory, vol.<br />
IT-13, pp. 2127, 1967.<br />
16) Y. Chenyz Y. Hungyz C. Fuhz, “Fast Algorithm for<br />
Nearest Neighbor Search Based on a Lower Bound Tree”,<br />
Proceed<strong>in</strong>gs of the 8th International Conference on Computer<br />
Vision, Vancouver, Canada, July 2001.<br />
17) R. Bellman and S. Dreyfus, “Applied Dynamic<br />
Programm<strong>in</strong>g”, Pr<strong>in</strong>ceton, NJ, Pr<strong>in</strong>ceton University Press,<br />
1962.<br />
18) H. Silverman and D. Morgan, “The application of<br />
dynamic programm<strong>in</strong>g to connected speech<br />
recognition” ,IEEE ASSP Magaz<strong>in</strong>e, vol. 7, no. 3,pp. 6-25,<br />
1990.<br />
19) Alex Waibel and Kai-Fu Lee, “Read<strong>in</strong>gs of speech<br />
recognition”,Morgan Kaufmann Publishers, San<br />
Mateo,Calif,1990.<br />
20) Rab<strong>in</strong>er and Jung, “ HMM Tutorial”, IEEE Transactions<br />
on Acoustics, <strong>Speech</strong> and Signal Process<strong>in</strong>g, vol. 39, no. 5, pp.<br />
272–280, April 1984.<br />
21) Dat.Tat.Tran, “Fuzzy approaches to speech and speaker<br />
recognition”, A Ph.D. thesis submitted to the university of<br />
Caniberra, Austrelia, May 2000.<br />
22)W.S.McCullough and W.H.Pitts, “A Calculus of Ideas<br />
Immanent <strong>in</strong> Nervous Activiity”,bull Math Biophysics, 5,115-<br />
133,1943.<br />
23) P. Gall<strong>in</strong>ari, S. Thiria, R. Badran, and F. Fogelman-Soulie,<br />
“On the relationships between discrim<strong>in</strong>ant analysis and<br />
multilayer perceptrons,”Neural Networks, vol. 4, pp. 349–360,<br />
1991.<br />
24) H. Asoh and N. Otsu, “An approximation of nonl<strong>in</strong>ear<br />
discrim<strong>in</strong>ant analysis by multilayer neural networks,” <strong>in</strong> Proc.<br />
Int. Jo<strong>in</strong>t Conf. Neural Networks, San Diego, CA, 1990, pp.<br />
III-211–III-216.<br />
25) A. R.Webb and D. Lowe, “The optimized <strong>in</strong>ternal<br />
representation of multilayer classifier networks performs<br />
nonl<strong>in</strong>ear discrim<strong>in</strong>ant analysis,” Neural Networks, vol. 3, no.<br />
4, pp. 367–375, 1990.<br />
26) G. S. Lim, M. Alder, and P. Had<strong>in</strong>gham, “Adaptive<br />
quadratic neural nets”, Pattern <strong>Recognition</strong>. Letter, vol. 13, pp.<br />
325–329, 1992.<br />
27) S. Raudys, “Evolution and generalization of a s<strong>in</strong>gle<br />
neuron: I. S<strong>in</strong>glelayer perceptron as seven statistical<br />
classifiers”, Neural Networks, vol. 11, pp. 283–296, 1998.<br />
28) S.Raudys,“Evolution and generalization of a s<strong>in</strong>gle<br />
neurone: II. Complexity of statistical classifiers and sample<br />
size considerations,” Neural Networks, vol. 11, pp. 297–313,<br />
1998.<br />
29) F. Kanaya and S. Miyake, “Bayes statistical behavior and<br />
valid generalization of pattern classify<strong>in</strong>g neural networks,”<br />
IEEE Trans. Neural Networks, vol. 2, no. 4, pp. 471–475,<br />
1991.<br />
30) S. Miyake and F. Kanaya, “A neural network approach to<br />
a Bayesian statistical decision problem”, IEEE Trans. Neural<br />
Networks, vol. 2, pp. 538–540, 1991<br />
31) D. G. Kle<strong>in</strong>baum, L. L. Kupper, and L. E. Chambless,<br />
“Logistic regression analysis of epidemiologic data”, Theory<br />
Practice, Commun. Statist. A, vol. 11, pp. 485–547, 1982.<br />
32) F. E. Harreli and K. L. Lee, “A comparison of the<br />
discrim<strong>in</strong>ant analysis and logistic regression under<br />
multivariate normality”, <strong>in</strong> Biostatistics: Statistics <strong>in</strong><br />
Biomedical, Public Health, and Environmental Sciences, P. K.<br />
Sen, Ed, Amsterdam, The Netherlands: North Holland, 1985.<br />
33) S. J. Press and S. Wilson, “Choos<strong>in</strong>g between logistic<br />
regression and discrim<strong>in</strong>ant analysis”, J. Amer. Statist. Assoc.,<br />
vol. 73, pp. 699–705,1978.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
952
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
34) M. Schumacher, R. Robner, andW. Vach, “Neural<br />
networks and logistic regression: Part I”, Comput. Statist.<br />
Data Anal., vol. 21, pp. 661–682,1996.<br />
35) W. Vach, R. Robner, and M. Schumacher, “Neural<br />
networks and logistic regression: Part II”, Comput. Statist.<br />
Data Anal., vol. 21, pp. 683–701,1996.<br />
36) B. Cheng and D. Titter<strong>in</strong>gton, “Neural networks: A review<br />
from a statistical perspective”, Statist. Sci., vol. 9, no. 1, pp.<br />
2–54, 1994.<br />
37) A. Ciampi and Y. Lechevallier, “Statistical models as<br />
build<strong>in</strong>g blocks of neural networks”, Commun. Statist., vol. 26,<br />
no. 4, pp. 991–1009, 1997.<br />
38) L. Holmstrom, P. Koist<strong>in</strong>en, J. Laaksonen, and E. Oja,<br />
“Neural and statistical classifiers-taxonomy and two case<br />
studies”, IEEE Trans. Neural Networks, vol. 8, pp. 5–17, 1997.<br />
39) A. Ripley, “Statistical aspects of neural networks”, <strong>in</strong><br />
Networks and Chaos—Statistical and Probabilistic Aspects, O.<br />
E. Barndorff-Nielsen, J. L. Jensen, andW. S. Kendall, Eds.<br />
London, U.K.: Chapman & Hall, 1993, pp. 40–123<br />
40) “Neural networks and related methods for classification”,<br />
J. R.Statist. Soc. B, vol. 56, no. 3, pp. 409–456, 1994.<br />
41) I. Sethi and M. Otten, “Comparison between entropy net<br />
and decision tree classifiers”, <strong>in</strong> Proc. Int. Jo<strong>in</strong>t Conf. Neural<br />
Networks, vol. 3, 1990, pp. 63–68.<br />
42) P. E. Utgoff, “Perceptron trees: A case study <strong>in</strong> hybrid<br />
concept representation”, Connect. Sci., vol. 1, pp. 377–391,<br />
1989.<br />
43) A. Ripley, “Statistical aspects of neural networks”, <strong>in</strong><br />
Networks and Chaos—Statistical and Probabilistic Aspects, O.<br />
E. Barndorff-Nielsen, J. L. Jensen, andW. S. Kendall, Eds.<br />
London, U.K.: Chapman & Hall, 1993, pp. 40–123.<br />
44) J. R.Statist. Soc. B, “Neural networks and related methods<br />
for classification,” International journal on Pattern recognition,<br />
vol. 56, no. 3, pp. 409–456, 1994.<br />
45) D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Eds.,<br />
“Mach<strong>in</strong>e Learn<strong>in</strong>g,Neural, and Statistical <strong>Classification</strong>”,<br />
London, U.K.: Ellis Horwood,1994.<br />
46) D. E. Brown, V. Corruble, and C. L. Pittard, “A<br />
comparison of decision tree classifiers with back propagation<br />
neural networks for multimodal classification problems”,<br />
Pattern <strong>Recognition</strong>., vol. 26, pp. 953–961, 1993.<br />
47) S. P. Curram and J. M<strong>in</strong>gers, “Neural networks, decision<br />
tree <strong>in</strong>duction and discrim<strong>in</strong>ant analysis: An empirical<br />
comparison”, J. Oper. Res. Soc., vol. 45, no. 4, pp. 440–450,<br />
1994.<br />
48) A. Hart, “Us<strong>in</strong>g neural networks for classification tasks—<br />
Some experiments on datasets and practical advice”, J. Oper.<br />
Res. Soc., vol. 43, pp.215–226, 1992.<br />
49) T. S. Lim,W. Y. Loh, and Y. S. Shih, “An empirical<br />
comparison of decision trees and other classification methods”,<br />
Dept. Statistics, Univ.Wiscons<strong>in</strong>, Madison, Tech. Rep. 979,<br />
1998.<br />
50) E. Patwo, M. Y. Hu, and M. S. Hung, “Two-group<br />
classification us<strong>in</strong>g neural networks”, Decis. Sci., vol. 24, no.<br />
4, pp. 825–845, 1993.<br />
51) M. S. Sanchez and L. A. Sarabia, “Efficiency of multilayered<br />
feed-forward neural networks on classification <strong>in</strong><br />
relation to l<strong>in</strong>ear discrim<strong>in</strong>ant analysis, quadratic discrim<strong>in</strong>ant<br />
analysis and regularized discrim<strong>in</strong>ant analysis”, Chemometr.<br />
Intell. Labor.Syst., vol. 28, pp. 287–303, 1995.<br />
52) V. Subramanian, M. S. Hung, and M. Y. Hu, “An<br />
experimental evaluation of neural networks for classification”,<br />
Comput. Oper. Res., vol. 20,pp. 769–782, 1993.<br />
53) R. Kohavi and D. H. Wolpert, “Bias plus variance<br />
decomposition for zero-one loss functions,” <strong>in</strong> Proc. 13th Int.<br />
Conf. Mach<strong>in</strong>e Learn<strong>in</strong>g,1996, pp. 275–283.<br />
54) L. Atlas, R. Cole, J. Connor, M. El-Sharkawi, R. J. Marks<br />
II, Y. Muthusamy, and E. Barnard, “Performance comparisons<br />
between back propagation networks and classification trees on<br />
three real-world applications”, <strong>in</strong> Advances <strong>in</strong> Neural<br />
Information Process<strong>in</strong>g Systems, D. S. Touretzky, Ed. San<br />
Mateo, CA: Morgan Kaufmann, 1990, vol. 2, pp. 622–629.<br />
55) T. G. Dietterich and G. Bakiri, “Solv<strong>in</strong>g multiclass<br />
learn<strong>in</strong>g problems via error-correct<strong>in</strong>g output codes”, J. Artif.<br />
Intell. Res., vol. 2, pp. 263–286, 1995.<br />
56) W. Y. Huang and R. P. Lippmann, “Comparisons between<br />
neural net and conventional classifiers”, IEEE 1st Int. Conf.<br />
Neural Networks, San Diego, CA, 1987, pp. 485–493.<br />
57) E. Patwo, M. Y. Hu, and M. S. Hung, “Two-group<br />
classification us<strong>in</strong>g neural networks”, Decis. Sci., vol. 24, no.<br />
4, pp. 825–845, 1993<br />
58) G. Cybenko, “Approximation by super-positions of a<br />
sigmoidal function”, Math. Contr. Signals Syst., vol. 2, pp.<br />
303–314, 1989.<br />
59) K. Hornik, “Approximation capabilities of multilayer feed<br />
forward networks”, Neural Networks, vol. 4, pp. 251–257,<br />
1991.<br />
60] K. Hornik, M. St<strong>in</strong>chcombe, and H. White, “Multilayer<br />
feed forward networks are universal approximators”, Neural<br />
Networks, vol. 2, pp. 359–366, 1989.<br />
61) M. D. Richard and R. Lippmann, “Neural network<br />
classifiers estimate Bayesian a posteriori probabilities”,<br />
Neural Comput., vol. 3, pp. 461–483, 1991.<br />
62) R. Solera-Ure˜na, J. Padrell-Sendra et.al, “SVMs for<br />
Automatic <strong>Speech</strong> <strong>Recognition</strong>: A Survey” Signal Theory and<br />
Communications Department EPS-Universidad Carlos III de<br />
Madrid,Avda. de la Universidad, 30, 28911-Legan´es<br />
(Madrid), SPAIN<br />
63) B.E. Boser, I. Guyon, and V. Vapnik, “ A tra<strong>in</strong><strong>in</strong>g<br />
algorithm for optimal marg<strong>in</strong> classifiers”, Computational<br />
Learn<strong>in</strong>g Theory, pages 144–152, 1992.<br />
64) F. P´erez-Cruz and O. Bousquet, “ Kernel Methods and<br />
Their Potential Use <strong>in</strong> Signal Process<strong>in</strong>g”. IEEE Signal<br />
Process<strong>in</strong>g Magaz<strong>in</strong>e, 21(3):57–65, 2004.<br />
65) R. Fletcher., “Practical Methods of Optimization”. Wiley-<br />
Interscience, New York, NY (USA), 1987.<br />
66) Earl Gosh et.al.,, “Pattern recognition”, School of<br />
computer science, -Tele-communciations and <strong>in</strong>formation<br />
system, DePaul University, Prentice Hall of India, New Delhi.<br />
67) . Gray, R. “ Vector Quantization”, IEEE ASSP<br />
Magaz<strong>in</strong>epp. 4–29, 1984.<br />
68) Reynolds, D.A., “A Gaussian Mixture Model<strong>in</strong>g<br />
Approach to Text-Independent Speaker Identification”, PhD<br />
thesis, Georgia Institute of Technology ,1992.<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
953
M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />
ISSN:2229-6093<br />
69) Reynolds, D.A., Rose, R.C, “ Robust Text-Independent<br />
Speaker Identification us<strong>in</strong>g Gaussian Mixture Speaker<br />
Models”, IEEE Transactions on Acoustics, <strong>Speech</strong>, and Signal<br />
Process<strong>in</strong>g 3(1) (1995) 72–83.<br />
70) McLachlan, G., ed. “ Mixture Models” Marcel Dekker,<br />
New York, NY,1988.<br />
71) Dempster, A., Laird, N., Rub<strong>in</strong>, D., “Maximum<br />
Likelihood from Incomplete Data via the EM Algorithm”,<br />
Journal of the Royal Statistical Society 39(1) 1–38, 1977.<br />
72) Reynolds, D.A., Quatieri, T.F., Dunn, R.B, “Speaker<br />
Verification Us<strong>in</strong>g Adapted Gaussian Mixture Models”,<br />
Digital Signal Process<strong>in</strong>g Vol.10,pp. 19–41, 2000 .<br />
73) A.K.Ja<strong>in</strong>,M.N.Murthy and P.J.FLYNN, “Data Cluster<strong>in</strong>g:<br />
A Review”, The Ohio State University, ACM Comput<strong>in</strong>g<br />
Surveys, Vol. 31, No. 3, September 1999.<br />
74) S.Watanabe, “Pattern recognition: Human and<br />
mechanical”,Wiley,Newyork-1985.<br />
75)K.S.Fu, “A step towards unification of syntactic and<br />
statistical pattern recognition”, IEEE Trans. On Pattern<br />
<strong>Recognition</strong> and Mach<strong>in</strong>e Intelligence, Vol.5,no.2,pp 200-<br />
205,March 1983.<br />
76) Anil K.Ja<strong>in</strong> et.al, “Statistical pattern recognition: a<br />
Review”, IEEE Trans. on Pattern Analysis and Mach<strong>in</strong>e<br />
<strong>in</strong>telligence, Vol.22,no.1,PP.4-37,Jan 2000.<br />
77) Monserrat, Guillen, Manuel, Artis, "Count data models for<br />
credit scor<strong>in</strong>g system”, Third Meet<strong>in</strong>g on the European<br />
Conference Series <strong>in</strong> Quantitative Economics and<br />
Econometrics on Econometrics of Duration, Count and<br />
Transition Models, Paris, December 1992.<br />
78) Thomas, Lyn. C, “A Survey of Credit and Behavioral<br />
Scor<strong>in</strong>g; Forecast<strong>in</strong>g f<strong>in</strong>ancial risk of lend<strong>in</strong>g to consumers”,<br />
University of Ed<strong>in</strong>burgh,2000.<br />
79) A Fractal Whitepaper, “Comparative Analysis of<br />
<strong>Classification</strong> <strong>Techniques</strong>”, September 2003.<br />
80) 80) D.Michie et.al.,, “Mach<strong>in</strong>e learn<strong>in</strong>g and Neural and<br />
statistical classification”, Ellis Horrwood,New York,1994.<br />
81)A.K.Ja<strong>in</strong> and R.C.Dubes, “Algorithms for cluster<strong>in</strong>g data”,<br />
Prentice Hall, Engle Wood,Cliffs,1988<br />
80) D.Michie et.al.,, “Mach<strong>in</strong>e learn<strong>in</strong>g and Neural and<br />
statistical classification”, Ellis Horrwood,New York,1994.<br />
81)A.K.Ja<strong>in</strong> and R.C.Dubes, “Algorithms for cluster<strong>in</strong>g data”,<br />
Prentice Hall, Engle Wood,Cliffs,1988<br />
IJCTA | JULY-AUGUST 2011<br />
Available onl<strong>in</strong>e@www.ijcta.com<br />
954