22.12.2013 Views

Classification Techniques used in Speech Recognition

Classification Techniques used in Speech Recognition

Classification Techniques used in Speech Recognition

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ISSN:2229-6093<br />

M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

<strong>Classification</strong> <strong>Techniques</strong> <strong>used</strong> <strong>in</strong> <strong>Speech</strong> <strong>Recognition</strong> Applications: A Review<br />

M.A.Anusuya* 1 , S.K.Katti* 2<br />

*Department of computer science and Eng<strong>in</strong>eer<strong>in</strong>g<br />

SJCE, Mysore, INDIA<br />

1 anusuya_ma@yahoo.co.<strong>in</strong>, 2 skkatti@<strong>in</strong>diatimes.com<br />

Abstract—<strong>Classification</strong> phase is one of the most active<br />

research and application areas of speech recognition. The<br />

literature is vast and grow<strong>in</strong>g. This paper summarizes the<br />

some of the most important developments <strong>in</strong> the classification<br />

procedures of the speech recognition applications. The state of<br />

art of the classification technique has also been presented <strong>in</strong><br />

this paper. Different classification techniques and their<br />

parameter estimation methods, properties, advantages,<br />

disadvantages along with their application areas are discussed<br />

with each classification method. Our purpose is to provide a<br />

synthesis of the published research <strong>in</strong> the area of speech<br />

recognition and stimulate further research <strong>in</strong>terests and efforts<br />

<strong>in</strong> the identified topics. This paper presents an overview of<br />

several pattern classification methods available <strong>in</strong> literature<br />

for speech recognition applications.<br />

Keywords— <strong>Classification</strong>, Classifiers, Taxonomy, Bayes<br />

decision theory, Acoustic Phonetic approach, Template<br />

match<strong>in</strong>g, Dynamic Time Warp<strong>in</strong>g(DTW),Vector<br />

Quantization(VQ), Hidden Markov Model(HMM), Artificial<br />

Neural Network(ANN), Support Vector Mach<strong>in</strong>e(SVM), K-<br />

Nearest Neighbor(KNN), Gaussian Mixture Model<strong>in</strong>g,<br />

Cluster<strong>in</strong>g techniques, Evaluations, Applications.<br />

I. INTRODUCTION<br />

CLASSIFICATION is one of the most frequently encountered<br />

decision mak<strong>in</strong>g tasks of human activity[1]. <strong>Classification</strong><br />

problem occurs when an object needs to be assigned <strong>in</strong>to a<br />

predef<strong>in</strong>ed group or class based on a number of observed<br />

attributes related to that object. Many problems <strong>in</strong> bus<strong>in</strong>ess,<br />

science, <strong>in</strong>dustry, and medic<strong>in</strong>e can be treated as classification<br />

problems. The goal of this paper is to survey the core concepts<br />

and techniques <strong>in</strong> the large subset of classification and<br />

analyz<strong>in</strong>g with its roots <strong>in</strong> statistics and decision theory.<br />

Although significant progress has been made <strong>in</strong> classification,<br />

related areas of speech recognition, a number of issues <strong>in</strong><br />

apply<strong>in</strong>g classification techniques still rema<strong>in</strong> and have not<br />

been solved successfully or completely. In this paper, some<br />

theoretical as well as empirical issues of speech recognition<br />

classification methods are reviewed and discussed. The vast<br />

research topics and extensive literature makes it impossible<br />

for one review to cover all of the work <strong>in</strong> the field. This<br />

review aims to provide a summary of the most important<br />

advances <strong>in</strong> general classification methods.<br />

Pattern recognition techniques are <strong>used</strong> to automatically<br />

classify physical objects (1D,2D or 3D) or abstract<br />

multidimensional patterns (n po<strong>in</strong>ts <strong>in</strong> d dimensions) <strong>in</strong>to<br />

known or possibly unknown categories. A number of<br />

commercial pattern recognition systems exist for speech<br />

recognition, character recognition, handwrit<strong>in</strong>g recognition,<br />

document classification, f<strong>in</strong>gerpr<strong>in</strong>t classification, speech and<br />

speaker recognition, white blood cell (leukocyte)<br />

classification, military target recognition among others. Most<br />

mach<strong>in</strong>e vision systems employ pattern recognition techniques<br />

to identify objects for sort<strong>in</strong>g, <strong>in</strong>spection, and assembly. The<br />

most widely <strong>used</strong> classifiers are the nearest neighbour- hood,<br />

Kernel methods such as SVM, KNN algorithms, Gaussian<br />

mixture model<strong>in</strong>g, naïve Bayes classifier and decision tree.<br />

1.1. <strong>Classification</strong> method design:<br />

<strong>Classification</strong> is the f<strong>in</strong>al stage of the pattern recognition. This<br />

is the stage where an automated system declares that the<br />

<strong>in</strong>putted object belongs to a particular category. There are<br />

many classification methods <strong>in</strong> the field. <strong>Classification</strong><br />

method designs are based on the follow<strong>in</strong>g concepts.<br />

i) <strong>Classification</strong><br />

Assign<strong>in</strong>g a class to a measurement, or equivalently,<br />

identify<strong>in</strong>g the probabilistic source of a measurement. The<br />

only statistical model that is needed is the conditional model<br />

of the class variable given the measurement. This conditional<br />

model can be obta<strong>in</strong>ed from a jo<strong>in</strong>t model or it can be learned<br />

directly. The former approach is generative s<strong>in</strong>ce it models the<br />

measurements <strong>in</strong> each class. It is more work, but it can exploit<br />

more prior knowledge, needs less data, is more modular, and<br />

can handle miss<strong>in</strong>g or corrupted data. Methods <strong>in</strong>clude<br />

mixture models and Hidden Markov Models. The latter<br />

approach is discrim<strong>in</strong>ative s<strong>in</strong>ce it focuses only on<br />

discrim<strong>in</strong>at<strong>in</strong>g one class from another. It can be more efficient<br />

once tra<strong>in</strong>ed and requires fewer model<strong>in</strong>g assumptions.<br />

Methods <strong>in</strong>clude logistic regression, generalized l<strong>in</strong>ear<br />

classifiers, and nearest-neighbor.<br />

ii) Model selection<br />

Choos<strong>in</strong>g the parametric family for density estimation is<br />

important <strong>in</strong> model selection. This is harder than parameter<br />

estimation s<strong>in</strong>ce we have to take <strong>in</strong>to account every member<br />

of each family <strong>in</strong> order to choose the best family.<br />

a. Member-roster concept: Under this template-match<strong>in</strong>g<br />

concept, a set of patterns belong<strong>in</strong>g to a same pattern is stored<br />

<strong>in</strong> a classification system. When an unknown pattern is given<br />

as <strong>in</strong>put, it is compared with exist<strong>in</strong>g patterns and placed<br />

under the match<strong>in</strong>g pattern class.<br />

b. Common property concept: In this concept, the common<br />

properties of patterns are stored <strong>in</strong> a classification system.<br />

When an unknown pattern comes <strong>in</strong>side, the system checks its<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

910


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

extracted common property aga<strong>in</strong>st the common properties of<br />

exist<strong>in</strong>g classes and places the pattern/object under a class,<br />

which has similar, common properties.<br />

c. Cluster<strong>in</strong>g concept: Here, the patterns of the targeted<br />

classes are represented <strong>in</strong> vectors whose components are real<br />

numbers. Us<strong>in</strong>g its cluster<strong>in</strong>g properties, we can easily<br />

classify the unknown pattern. If the target vectors are far apart<br />

<strong>in</strong> geometrical arrangement, it is easy to classify the unknown<br />

patterns. If they are nearby or if there is any overlap <strong>in</strong> the<br />

cluster arrangement, we need more complex algorithms to<br />

classify the unknown patterns. One simple algorithm based on<br />

the cluster<strong>in</strong>g concept is M<strong>in</strong>imum Distance <strong>Classification</strong>.<br />

This method computes the distance between the unknown<br />

pattern and the desired set of known patterns and determ<strong>in</strong>es<br />

which known pattern is closest to the unknown and, f<strong>in</strong>ally,<br />

the unknown pattern is placed under the known pattern to<br />

which it has m<strong>in</strong>imum distance. This algorithm works well<br />

when the target patterns are far apart.<br />

1.2. Classifiers design:<br />

Classifiers are functions that use pattern match<strong>in</strong>g to<br />

determ<strong>in</strong>e a closet math. After optimal feature subset is<br />

selected a classifier can be designed us<strong>in</strong>g various approaches.<br />

Roughly speak<strong>in</strong>g, there are three different approaches [1,2].<br />

The first approach is the simplest and the most <strong>in</strong>tuitive<br />

approach which is based on the concept of similarity.<br />

Template match<strong>in</strong>g is an example. The second one is a<br />

probabilistic approach. It <strong>in</strong>cludes methods based on Bayes<br />

decision rule, the maximum likelihood or density estimator.<br />

Three well-known methods are K-nearest neighbor (KNN),<br />

Parzen w<strong>in</strong>dow classifier and branch-and bound methods<br />

(BnB).The third approach is to construct decision boundaries<br />

directly by optimiz<strong>in</strong>g certa<strong>in</strong> error criterion. Examples are<br />

fisher’s l<strong>in</strong>ear discrim<strong>in</strong>ant, multilayer perceptrons, decision<br />

tree and support vector mach<strong>in</strong>e. Determ<strong>in</strong><strong>in</strong>g a suitable<br />

classifier for a given problem is still more an art than science.<br />

The first class of classifiers has some similarity metrics and<br />

assigns class labels for maximiz<strong>in</strong>g the similarity.<br />

Probabilistic methods, for which the Bayesian classifier is the<br />

most known, depend on the prior probabilities of classes and<br />

class-conditional densities of the <strong>in</strong>stances. In addition to<br />

Bayesian classifiers, logistic classifiers belong to this type of<br />

classifiers. The logistic classifiers deal with unknown<br />

parameters based on the maximum-likelihood [3]. Further<br />

details on logistic classifiers can be found <strong>in</strong> [4].Geometric<br />

classifiers, which build decision boundaries by directly<br />

m<strong>in</strong>imiz<strong>in</strong>g the error criterion, s<strong>in</strong>ce no related experiments<br />

are supplied. An example to these classifiers is Fisher’s l<strong>in</strong>ear<br />

discrim<strong>in</strong>ant, which ma<strong>in</strong>ly aim to reduce the size of the<br />

feature space to lower dimensions <strong>in</strong> case of a huge number of<br />

features. It m<strong>in</strong>imizes the mean squared error between the<br />

class labels and the tested <strong>in</strong>stance. Additionally, neural<br />

networks are examples of geometric classifiers.<br />

1.3. <strong>Classification</strong> taxonomy:<br />

Based on the available literature figure1 and figure 1a shows<br />

the taxonomy of different classifiers <strong>used</strong> for various<br />

applications of speech recognition based on classification<br />

techniques and the density functions.<br />

[OR] the other way of represent<strong>in</strong>g the taxonomy, based on<br />

density approach can be represented as follows.<br />

Figure1a. Taxonomy based on class –conditional densities<br />

2. Knowledge Based classification Method<br />

Human knowledge of speech has to be expressed <strong>in</strong> terms of<br />

explicit rules. Acoustic phonetic rules describes the words of<br />

lexicon, describes the syntax of knowledge and so on and it<br />

deals with the phonetic and l<strong>in</strong>guistic pr<strong>in</strong>ciples. Basically<br />

there exist two approaches to speech recognition.<br />

They are<br />

• Acoustic Phonetic Approach<br />

• Artificial Intelligence Approach<br />

2.1. Acoustic Phonetic Approach[5]:<br />

The acoustic phonetic approach is based on the theory of<br />

acoustic phonetics that postulates that there exists f<strong>in</strong>ite,<br />

dist<strong>in</strong>ctive phonetic units <strong>in</strong> spoken language and that the<br />

phonetic units are broadly characterized by a set of properties<br />

that are manifest <strong>in</strong> the speech signal, or its spectrum,<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

911


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

overtime[5]. Even though the acoustic properties of phonetic<br />

units are highly variable, both with speakers and with<br />

neighbor<strong>in</strong>g phonetic units ( the so called co-articulation of<br />

sounds), it is assumed that the rules govern<strong>in</strong>g the variability<br />

are straight forward and can readily be learned and applied <strong>in</strong><br />

practical situations. Hence the first step <strong>in</strong> the acoustic<br />

phonetic approach to speech recognition is called a<br />

segmentation and label<strong>in</strong>g phase, because it <strong>in</strong>volves<br />

segment<strong>in</strong>g the speech signal <strong>in</strong>to discrete ( <strong>in</strong> time) regions<br />

where the acoustic properties of the signal are representative<br />

of one ( or possibly several ) phonetic units (or classes) and<br />

then attach<strong>in</strong>g one or more phonetic lables to each segmented<br />

region accord<strong>in</strong>g to the acoustic properties. To actually do<br />

speech recognition, a second step attempts to determ<strong>in</strong>e a<br />

valid word( or str<strong>in</strong>g of a words) from the sequence of<br />

phonetic labels produced <strong>in</strong> the first step, which is consistent<br />

with the constra<strong>in</strong>ts of the speech recognition task( i.e. the<br />

words are drawn from a given vocabulary, the word sequence<br />

makes syntactic sense and has semantic mean<strong>in</strong>g etc.,)<br />

To illustrate the steps <strong>in</strong>volved <strong>in</strong> the acoustic phonetic<br />

approach to speech recognition, consider the phoneme lattice<br />

shown <strong>in</strong> Figure 2. (A phoneme lattice is the result of the<br />

segmentation and label<strong>in</strong>g step of the recognition process, and<br />

represents a sequential set of phonemes that are likely matches<br />

to the spoken <strong>in</strong>put speech) The problem is to decode the<br />

phoneme lattice <strong>in</strong>to a word str<strong>in</strong>g ( one or more words) such<br />

that every <strong>in</strong>stant of time is <strong>in</strong>cluded <strong>in</strong> one of the phonemes is<br />

the lattice, and such that the word (or word sequence) is valid<br />

accord<strong>in</strong>g to rules of English syntax.(The symbol SIL stands<br />

for silence or a pause between sounds or words; the vertical<br />

position <strong>in</strong> the lattice, at any time, is a measure of the<br />

goodness of the acoustic match to the phonetic unit, with the<br />

highest unit hav<strong>in</strong>g the best match) With a modest amount of<br />

search<strong>in</strong>g, one can derive the appropriate phonetic str<strong>in</strong>g SIL-<br />

AO-L-AX-B-AW-T correspond<strong>in</strong>g to the word str<strong>in</strong>g “ all<br />

about,” with the phonemes L,AX, and B hav<strong>in</strong>g been second<br />

or third choices <strong>in</strong> the lattice, and all other phonemes hav<strong>in</strong>g<br />

been first choices.<br />

This simple example illustrates well the difficulty <strong>in</strong> decod<strong>in</strong>g<br />

phonetic units <strong>in</strong>to word str<strong>in</strong>gs. This is the so called lexical<br />

access problem. The real problem with the acoustic phonetic<br />

approach to speech recognitions is the difficulty <strong>in</strong> gett<strong>in</strong>g a<br />

reliable phoneme lattice for the lexical access stage.<br />

Fig.3 shows a block diagram of the acoustic phonetic<br />

approach to speech recognition. The first step <strong>in</strong> the<br />

process<strong>in</strong>g ( a step common to all approaches to speech<br />

recognition ) is the speech analysis system( so called the<br />

feature measurement method),which provides an appropriate<br />

( spectral) representation of the characteristics of the time<br />

vary<strong>in</strong>g speech signal. The most common techniques of<br />

spectral analysis are the class of filter bank methods and the<br />

class of l<strong>in</strong>ear predictive cod<strong>in</strong>g (LPC) methods. Both of these<br />

methods provide spectral descriptions of the speech over time.<br />

The next step <strong>in</strong> the process<strong>in</strong>g is the feature-detection stage.<br />

The idea here is to cover the spectral measurements to a set of<br />

features that describe the broad a acoustic properties of the<br />

different phonetic units. Among the features proposed for<br />

recognition are nasality (presence or absence of nasal<br />

resonance), frication (presence or absence of random<br />

excitation <strong>in</strong> the speech),formant locations(frequencies of the<br />

first three resonances), voiced unvoiced classification<br />

( periodic or a periodic excitation), and ratios of high and low<br />

frequency energy. Many proposed features are <strong>in</strong>herently<br />

b<strong>in</strong>ary(e.g. nasality, frication, voiced unvoiced); others are<br />

cont<strong>in</strong>uous (e.g. formant locations, energy ratios). The feature<br />

detection stage unusually consists of a set of detectors that<br />

operate <strong>in</strong> parallel and use appropriate process<strong>in</strong>g and logic to<br />

make the decision as to presence or absence, or value, of a<br />

feature. The algorithms <strong>used</strong> for <strong>in</strong>dividual feature detectors<br />

are sometimes sophisticated ones that do a lot of signal<br />

process<strong>in</strong>g, and some times they are rather trivial estimation<br />

procedure.<br />

The third step <strong>in</strong> the procedure is the segmentation and<br />

label<strong>in</strong>g phase wehere by the system tries to f<strong>in</strong>d stable<br />

regions (where the features change very little over the region)<br />

and then to label the segmented region accord<strong>in</strong>g to how well<br />

the features with <strong>in</strong> theat region match those of <strong>in</strong>dividual<br />

phonetic units. This stage is the heart of the acoustic phonetic<br />

recognizer and is the most difficult one to carry out reliably;<br />

hence various control strategies are <strong>used</strong> to limit the range of<br />

segmentation po<strong>in</strong>ts and label possibilities. For example, for<br />

<strong>in</strong>dividual word recognition, the constra<strong>in</strong>t that a word<br />

conta<strong>in</strong>s at least two phonetic units and no more than six<br />

phonetic units means that the control strategy need consider<br />

solutions with between 1 and 5 <strong>in</strong>ternal segmentation po<strong>in</strong>ts.<br />

Furthermore, the label<strong>in</strong>g strategy can exploit lexical<br />

constra<strong>in</strong>ts on words to consider only words with n phonetic<br />

units when ever the segmentation gives n-1 segmentation<br />

po<strong>in</strong>ts. These constra<strong>in</strong>ts are often powerful ones that reduce<br />

the search space and significantly <strong>in</strong>crease performance<br />

(accuracy of segmentation and label<strong>in</strong>g) of the system.<br />

The result of the segmentation and label<strong>in</strong>g step is usually a<br />

phoneme lattice ( of the type shown <strong>in</strong> Figure 2 from which a<br />

lexical access procedure determ<strong>in</strong>es the best match<strong>in</strong>g word or<br />

sequence of words. Other types of lattices (e.g. syllable, word)<br />

can also be derived by <strong>in</strong>tegrat<strong>in</strong>g vocabulary and syntax<br />

constra<strong>in</strong>ts <strong>in</strong>to the control strategy as discussed above. The<br />

quality of the match<strong>in</strong>g of the features with<strong>in</strong> a segment, to<br />

phonetic units can be <strong>used</strong> to assign probabilities to the labels,<br />

which then can be <strong>used</strong> <strong>in</strong> a probabilistic lexical access<br />

procedure. The f<strong>in</strong>al output of the recognizer is the word or<br />

word sequence that best matches, <strong>in</strong> some well def<strong>in</strong>ed sense,<br />

the sequence of phonetic units <strong>in</strong> the phoneme lattice.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

912


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

sequences that match the vocabulary and grammar constra<strong>in</strong>ts<br />

are <strong>used</strong> to decide upon the spoken utterance by comb<strong>in</strong><strong>in</strong>g<br />

the acoustic and language scores.<br />

Figure 3 Block diagram of acoustic phonetic speech<br />

recognition system<br />

2.1.1. General Discussion on Acoustic Phonetic Approach:<br />

A typical acoustic phonetic approach to ASR has the<br />

follow<strong>in</strong>g steps (this is similar to the overview of the acousticphonetic<br />

approach presented by Rab<strong>in</strong>er (Rab<strong>in</strong>er and Juang,<br />

1993) but it is def<strong>in</strong>ed here more broadly):<br />

1. <strong>Speech</strong> is analyzed us<strong>in</strong>g any of the spectral analysis<br />

methods - Short Time Fourier Transform (STFT), L<strong>in</strong>ear<br />

Predictive Cod<strong>in</strong>g (LPC), Perceptual L<strong>in</strong>ear Prediction (PLP),<br />

etc. - us<strong>in</strong>g overlapp<strong>in</strong>g frames with a typical size of 10-25ms<br />

and typical overlap of 5ms.<br />

2. Acoustic correlates of phonetic features are extracted from<br />

the spectral representation.<br />

For example, low frequency energy may be calculated as an<br />

acoustic correlate of sonoracy, zero cross<strong>in</strong>g rate may be<br />

calculated as a correlate of frication, and so on.<br />

3. <strong>Speech</strong> is segmented by either f<strong>in</strong>d<strong>in</strong>g transient locations<br />

us<strong>in</strong>g the spectral change across two consecutive frames, or<br />

us<strong>in</strong>g the acoustic correlates of source or manner classes to<br />

f<strong>in</strong>d the segments with stable manner classes. The earlier<br />

approach , that is, f<strong>in</strong>d<strong>in</strong>g acoustic stable regions us<strong>in</strong>g the<br />

locations of spectral change has been followed by Glass et al.<br />

(Glass and Zue, 1988). The latter method of us<strong>in</strong>g broad<br />

manner class scores to segment the signal has been <strong>used</strong> by a<br />

number of researchers (Bitar, 1997; Liu, 1996; Fohr et al.;<br />

Carbonell et al., 1987). Multiple segmentations may be<br />

generated <strong>in</strong>stead of a s<strong>in</strong>gle representation, for example, the<br />

dendograms <strong>in</strong> the speech recognition method proposed by<br />

Glass (Glass and Zue, 1988). (The system built by Glass et al.<br />

is <strong>in</strong>cluded here as an acoustic phonetic system because it fits<br />

the broad def<strong>in</strong>ition of the acoustic-phonetic approach, but<br />

this system uses very little knowledgeof acoustic phonetics.)<br />

4. Further analysis of the <strong>in</strong>dividual segmentations is carried<br />

out next to either recognize each segment as a phoneme<br />

directly or f<strong>in</strong>d the presence or absence of <strong>in</strong>dividual phonetic<br />

features and us<strong>in</strong>g the <strong>in</strong>termediate decisions to f<strong>in</strong>d the<br />

phonemes. When multiple segmentations are generated<br />

<strong>in</strong>stead of a s<strong>in</strong>gle segmentation, a number of different<br />

phoneme sequences may be generated. The phoneme<br />

2.1.2. Hurdles/Challenges <strong>in</strong> the acoustic-phonetic<br />

approach:<br />

A number of problems have been associated with the acousticphonetic<br />

approach <strong>in</strong> the literature. Rab<strong>in</strong>er (Rab<strong>in</strong>er and<br />

Juang, 1993) lists at least five such problems or hurdles that<br />

have made the use of the approach m<strong>in</strong>imal <strong>in</strong> the ASR<br />

community. The problems with the acoustic phonetic<br />

approach and some ideas for solv<strong>in</strong>g them provide much of<br />

the motivation for the present work. These documented<br />

problems of the acoustic-phonetic approach are now listed and<br />

it is argued that either <strong>in</strong>sufficient effort has gone <strong>in</strong>to solv<strong>in</strong>g<br />

these problems or that the problems are not unique to the<br />

acoustic-phonetic approach.<br />

a) It has been argued that the difficulty <strong>in</strong> proper decod<strong>in</strong>g of<br />

phonetic units <strong>in</strong>to words and sentences grows dramatically<br />

with an <strong>in</strong>crease <strong>in</strong> the rate of phoneme <strong>in</strong>sertion, deletion and<br />

substitution. This argument makes the assumption that<br />

phoneme units are recognized <strong>in</strong> the first pass with no<br />

knowledge of language and vocabulary constra<strong>in</strong>ts. This has<br />

been true for many of the acoustic phonetic methods, but this<br />

is not necessary s<strong>in</strong>ce vocabulary and grammar constra<strong>in</strong>ts<br />

may be <strong>used</strong> to constra<strong>in</strong> the speech segmentation paths<br />

(Glass et al., 1996).<br />

b) Extensive knowledge of the acoustic manifestations of<br />

phonetic units is required and the lack of completeness of this<br />

knowledge has been po<strong>in</strong>ted out as a drawback of the<br />

knowledge based approach. While it is true that the<br />

knowledge is <strong>in</strong>complete, there is no reason to believe that the<br />

standard signal representations, for example, Mel-Frequency<br />

Cepstral Coefficients (MFCCs), <strong>used</strong> <strong>in</strong> the state-of-the-art<br />

ASR methods are sufficient to capture all the acoustic<br />

manifestations of the speech sounds. Although the knowledge<br />

is not complete, a number of efforts to f<strong>in</strong>d acoustic correlates<br />

of phonetic features have obta<strong>in</strong>ed excellent results. Most<br />

recently, there has been significant development <strong>in</strong> the<br />

research on the acoustic correlates of place of stop consonants<br />

and fricatives (Stevens et al., 1999; Ali , 1999; Bitar, 1997),<br />

nasal detection (Pruthi and Espy-Wilson, 2003), and<br />

semivowel classification (Espy-Wilson, 1994). The<br />

knowledge from these sources may be adequate to start<br />

build<strong>in</strong>g an acoustic-phonetic speech recognizer to carry out<br />

word recognition tasks, and that was the focus of this work. It<br />

should be noted that because of the physical significance of<br />

the knowledge based acoustic measurements, it is easy to<br />

p<strong>in</strong>po<strong>in</strong>t the source of recognition errors <strong>in</strong> the recognition<br />

system. Such an error analysis is close to impossible <strong>in</strong> MFCC<br />

like front-ends.<br />

c) The third argument aga<strong>in</strong>st the acoustic-phonetic approach<br />

is that the choice of phonetic features and their acoustic<br />

correlates is not optimal. It is true that l<strong>in</strong>guists may not agree<br />

with each other on the optimal set of phonetic features, but<br />

f<strong>in</strong>d<strong>in</strong>g the best set of features is a task that can be carried out<br />

<strong>in</strong>stead of turn<strong>in</strong>g to other ASR methods. The phonetic feature<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

913


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

set <strong>used</strong> <strong>in</strong> this work will be based on the dist<strong>in</strong>ctive feature<br />

theory and it will be optimal <strong>in</strong> that sense.<br />

d) Another drawback of the acoustic-phonetic approach as<br />

po<strong>in</strong>ted out <strong>in</strong> (Rab<strong>in</strong>er and Juang, 1993) is that the design of<br />

the sound classifiers is not optimal. This argument probably<br />

assumes that b<strong>in</strong>ary decision trees with hard knowledge-based<br />

thresholds are <strong>used</strong> to carry out the decisions <strong>in</strong> the acoustic<br />

phonetic approach. Statistical pattern recognition methods that<br />

is no less optimal.<br />

2.1.3 Advantages and Disadvantages:<br />

a)Advantages:<br />

1) Not all acoustic Phonetics are <strong>used</strong> for all decisions<br />

2) S<strong>in</strong>ce the acoustic phonetics have a strong physical<br />

<strong>in</strong>terpretation, it is easy to p<strong>in</strong>po<strong>in</strong>t the source of error <strong>in</strong><br />

such a recognition system. It is easy to tell whether the<br />

pattern matcher has failed.<br />

3) The method can easily take advantage of years of<br />

research that has gone <strong>in</strong>to acoustic phonetics as well as<br />

signal process<strong>in</strong>g based on human auditory models.<br />

b) Disadvantages:<br />

The chosen phonemes are not only the first choices <strong>in</strong> the<br />

phonetic sequence, but also second (B and AX) and third (L)<br />

choices. Therefore match<strong>in</strong>g a phonetic sequence with a word<br />

or a group of words is not obvious In fact, this is the ma<strong>in</strong><br />

disadvantage of this approach.<br />

2.1.4. Applications:<br />

1. Acoustic phonetic approach to speech recognition:<br />

Application to the Semivowels<br />

2) Models of Phonetic <strong>Recognition</strong>: The Role of Analysis by<br />

Synthesis <strong>in</strong> Phonetic <strong>Recognition</strong><br />

3) The Influence of Phonetic Context on the Acoustic<br />

Properties of Stops<br />

4) The Role of Syllable Structure <strong>in</strong> the Acoustic Realizations<br />

of Stops<br />

5) A Semivowel <strong>Recognition</strong> System<br />

6) Two-Dimensional Characterization of the <strong>Speech</strong> Signal<br />

and Its Potential Applications to <strong>Speech</strong> Process<strong>in</strong>g<br />

7) <strong>Recognition</strong> of Words from their Spell<strong>in</strong>gs: Integration of<br />

Multiple Knowledge Sour.<br />

2.2. Artificial Intelligent Approach [5]:<br />

Historically there are two ma<strong>in</strong> approaches to AI: classical<br />

approach (design<strong>in</strong>g the AI), based on symbolic reason<strong>in</strong>g - a<br />

mathematical approach <strong>in</strong> which ideas and concepts are<br />

represented by symbols such as words, phrases or sentences,<br />

which are then processed accord<strong>in</strong>g to the rules of logic. A<br />

connectionist approach (lett<strong>in</strong>g AI develop), based on artificial<br />

neural networks, which imitate the way neurons work, and<br />

genetic algorithms, which imitate <strong>in</strong>heritance and fitness to<br />

evolve better solutions to a problem with every generation.<br />

AI approach [5] to speech recognition is a hybrid of the<br />

acoustic phonetic approach and the pattern recognition<br />

approach <strong>in</strong> that it exploits ideas and concepts of both<br />

methods. The artificial Intelligence approach attempts to<br />

mechanize the recognition procedure accord<strong>in</strong>g to the way a<br />

person applies its <strong>in</strong>telligence <strong>in</strong> visualiz<strong>in</strong>g, analyz<strong>in</strong>g, and<br />

f<strong>in</strong>ally mak<strong>in</strong>g a decisions on the measured acoustic features.<br />

In particular, among the techniques <strong>used</strong> with<strong>in</strong> this class of<br />

methods are the use of; an experts system for segmentation<br />

and label<strong>in</strong>g so that this crucial and most difficult step can be<br />

performed with more than just the acoustic <strong>in</strong>formation <strong>used</strong><br />

by pure acoustic phonetic methods( <strong>in</strong> particular, methods that<br />

<strong>in</strong>tegrate phonemic, lexical syntactic, semantic and even<br />

pragmatic knowledge <strong>in</strong>to the expert system have been<br />

proposed and studied),learn<strong>in</strong>g and adapt<strong>in</strong>g over time(i.e. the<br />

concept that knowledge is often both static and dynamic and<br />

that models must adapt to the dynamic component of the data);<br />

the use of neural networks for learn<strong>in</strong>g the relationships<br />

between phonetic events and all known <strong>in</strong>puts(<strong>in</strong>clud<strong>in</strong>g<br />

acoustic, lexical, syntactic, semantic, etc., as well as for<br />

discrim<strong>in</strong>ation between similar sound classes.<br />

The basic ideal of the artificial <strong>in</strong>telligence approach to speech<br />

recognition is to compile and <strong>in</strong>corporate knowledge from<br />

variety of knowledge sources and to br<strong>in</strong>g it to bear on the<br />

problem at hand. Thus, for example, the AI approach to<br />

segmentation and label<strong>in</strong>g would be to augment the generally<br />

<strong>used</strong> acoustic knowledge with phonemic knowledge, lexical<br />

knowledge, syntactic knowledge, semantic knowledge, and<br />

even pragmatic knowledge. The different knowledge sources<br />

required are as follows:<br />

a) Acoustic knowledge-evidence of which sounds (predef<strong>in</strong>ed<br />

phonetic units) are spoken on the basis of spectral<br />

measurements and presence of absence of features<br />

b) Lexical knowledge- the comb<strong>in</strong>ation of acoustic evidence<br />

so as to postulate words as specified by a lexicon that maps<br />

sounds <strong>in</strong>to words ( or equivalently decomposes words <strong>in</strong>to<br />

sounds)<br />

c) Syntactic knowledge- the comb<strong>in</strong>ation of words to form<br />

grammatically correct str<strong>in</strong>gs (accord<strong>in</strong>g to a language model)<br />

such as sentences or phrases<br />

d) Semantic knowledge-understand<strong>in</strong>g of the task doma<strong>in</strong> so<br />

as to be able to validate sentences (or phrases) that are<br />

consistent with the task be<strong>in</strong>g performed or which are<br />

consistent with previously decoded sentences<br />

e) Pragmatic knowledge- <strong>in</strong>ference ability necessary <strong>in</strong><br />

resolv<strong>in</strong>g ambiguity of mean<strong>in</strong>g based on ways <strong>in</strong> which<br />

words are generally <strong>used</strong>.<br />

2.2.1. Advantages and Disadvantages of Artifical<br />

Intelligent approach:<br />

a) Advantages:<br />

i) AI has made some progress at imitat<strong>in</strong>g "subsymbolic"<br />

problem solv<strong>in</strong>g: embodied agent<br />

approaches emphasize the importance of<br />

sensorimotor skills to higher reason<strong>in</strong>g; neural<br />

net research attempts to simulate the structures<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

914


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

ii)<br />

iii)<br />

<strong>in</strong>side human and animal bra<strong>in</strong>s that give rise to<br />

this skill; step-by-step reason<strong>in</strong>g that humans<br />

were often assumed to use when they solve<br />

puzzles, play board games or make logical<br />

deductions.<br />

By the late 1980s and 1990s, AI research had<br />

also developed highly successful methods for<br />

deal<strong>in</strong>g with uncerta<strong>in</strong> or <strong>in</strong>complete <strong>in</strong>formation,<br />

employ<strong>in</strong>g concepts from probability and<br />

econopmic.<br />

The search for more efficient problem solv<strong>in</strong>g<br />

algorithms is a high priority for AI research.<br />

b)Disadvantages:<br />

i) For difficult problems most of the algorithms <strong>in</strong> artificial<br />

<strong>in</strong>telligence approach require enormous computational<br />

resources- most “comb<strong>in</strong>atorial explosion” : the amount of<br />

memory or computer time required becomes astronomical<br />

when the problem goes beyond a certa<strong>in</strong> size.<br />

ii) Intelligent systems are not like humans.<br />

2.2.2. Applications:<br />

1. Artificial <strong>in</strong>telligent approach to speech recognition<br />

2. AI approach to chemical <strong>in</strong>ference.<br />

3. AI approach to cognitive l<strong>in</strong>guistics<br />

4. AI approach to VLSI design<br />

5. AI approach to mach<strong>in</strong>e learn<strong>in</strong>g<br />

6. AI approach to reservoir<br />

7. AI approach to automated office<br />

3. Bayes Decision Theory:<br />

Bayesian decision mak<strong>in</strong>g refers to choos<strong>in</strong>g most likely clas,<br />

given the value of the feature or features. The probabilities of<br />

class membership are calculated from Baye’s theorem. If the<br />

feature value is denoted by x and a class of <strong>in</strong>terest is C, then<br />

P(x) is the probability distribution for feature x <strong>in</strong> the entire<br />

population and P(C ) is the prior probability that a random<br />

sample is a member is a member of class C. P(x/C) is the<br />

conditional probability of obta<strong>in</strong><strong>in</strong>g feature value x given that<br />

it has a feature value x, which is denoted by P(C/x), on the<br />

basis of the values of P(x/C),P(C) and P(x).<br />

4. Database classification method:<br />

In this classification, the patterns are stored <strong>in</strong> the database<br />

and comparison is done with the test signal aga<strong>in</strong>st the<br />

patterns stored <strong>in</strong> the database. S<strong>in</strong>ce the collection of tra<strong>in</strong>ed<br />

patterns is stored <strong>in</strong> the database this method is called as<br />

database classification method. This has been categorize as<br />

one of the important classification method i.e., classified as<br />

the pattern recognition approach. In turn this pattern<br />

recognition approach has been classified <strong>in</strong>to two methods<br />

namely, template/DTW/supervised and unsupervised<br />

classification methods. Each of these methods are discussed <strong>in</strong><br />

detail <strong>in</strong> the below section.<br />

4.1. Introduction to Pattern <strong>Recognition</strong> approach:<br />

Pattern recognition as a field of study developed significantly<br />

<strong>in</strong> the 1960s. It is very much an <strong>in</strong>terdiscipl<strong>in</strong>ary subject,<br />

cover<strong>in</strong>g developments <strong>in</strong> the areas of statistics, eng<strong>in</strong>eer<strong>in</strong>g,<br />

artificial <strong>in</strong>telligence, computer science, psychology and<br />

physiology, among others. Watnabe[74] def<strong>in</strong>es a pattern “as<br />

opposite of chaos”. It is an entity, vaguely def<strong>in</strong>ed that could<br />

be given a name.<br />

Pattern recognition is concerned with the classification of<br />

objects <strong>in</strong>to categories, especially by mach<strong>in</strong>e. A strong<br />

emphasis is placed on the statistical theory of discrim<strong>in</strong>ation,<br />

but cluster<strong>in</strong>g also receives some attention. Hence it can be<br />

summed <strong>in</strong> a s<strong>in</strong>gle word: ‘classification’, both supervised<br />

(us<strong>in</strong>g class <strong>in</strong>formation to design a classifier – i.e.<br />

discrim<strong>in</strong>ation) and unsupervised (allocat<strong>in</strong>g to groups<br />

without class <strong>in</strong>formation – i.e. cluster<strong>in</strong>g). Its ultimate goal is<br />

to optimally extract patterns based on certa<strong>in</strong> conditions and is<br />

to separate one class from the others. Pattern recognition was<br />

often achieved us<strong>in</strong>g l<strong>in</strong>ear and quadratic discrim<strong>in</strong>ants [6],<br />

the k-nearest neighbor classifier [7] or the Parzen density<br />

estimator [8], template match<strong>in</strong>g [9] and Neural Networks<br />

[10]. These methods are basically statistic. The problem of<br />

us<strong>in</strong>g these recognition methods are <strong>in</strong> the construction of the<br />

classification rule without hav<strong>in</strong>g any idea of the distribution<br />

of the measurements <strong>in</strong> different groups. Support Vector<br />

Mach<strong>in</strong>e (SVM) [11] has ga<strong>in</strong>ed prom<strong>in</strong>ence <strong>in</strong> the field of<br />

pattern classification. They are forcefully compet<strong>in</strong>g with<br />

other techniques such as template match<strong>in</strong>g and Neural<br />

Networks for pattern recognition.<br />

.<br />

4.1.1. General Process of Pattern <strong>Recognition</strong>:<br />

A pattern is a pair compris<strong>in</strong>g an observation and a mean<strong>in</strong>g.<br />

Pattern recognition is <strong>in</strong>ferr<strong>in</strong>g mean<strong>in</strong>g from observation.<br />

Design<strong>in</strong>g a pattern recognition system is establish<strong>in</strong>g a<br />

mapp<strong>in</strong>g from measurement space <strong>in</strong>to the space of potential<br />

mean<strong>in</strong>gs; The basic components <strong>in</strong> pattern recognition are<br />

pre-process<strong>in</strong>g, feature extraction and selection, classifier<br />

design and optimization.<br />

4.1.1a Pre-process<strong>in</strong>g:<br />

The role of pre-process<strong>in</strong>g is to segment the <strong>in</strong>terest<strong>in</strong>g pattern<br />

from the background. Generally, noise filter<strong>in</strong>g, smooth<strong>in</strong>g<br />

and normalization should be done <strong>in</strong> this step. The preprocess<strong>in</strong>g<br />

also def<strong>in</strong>es a compact representation of the pattern.<br />

4.1.1b Feature Selection and extraction:<br />

Features should be easily computed, robust, <strong>in</strong>sensitive to<br />

various distortions and variations <strong>in</strong> the signal, and<br />

rotationally <strong>in</strong>variant. Two k<strong>in</strong>ds of features are <strong>used</strong> <strong>in</strong><br />

pattern recognition problems. One k<strong>in</strong>d of features has clear<br />

physical mean<strong>in</strong>g, such as geometric or structural and<br />

statistical features. Another k<strong>in</strong>d of features has no physical<br />

mean<strong>in</strong>g. These features are called as mapp<strong>in</strong>g features. The<br />

advantage of physical features is that they need not deal with<br />

irrelevant features. The advantage of the mapp<strong>in</strong>g features is<br />

that they make classification easier because clear boundaries<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

915


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

will be obta<strong>in</strong>ed between classes but <strong>in</strong>creas<strong>in</strong>g the<br />

computational complexity.<br />

i) Feature selection is to select the best subset from the <strong>in</strong>put<br />

space. Its ultimate goal is to select the optimal features subset<br />

that can achieve the highest accuracy results. While feature<br />

extraction is applied <strong>in</strong> the situation when no physical features<br />

can be obta<strong>in</strong>ed. Most of feature selection algorithms <strong>in</strong>volve<br />

a comb<strong>in</strong>atorial search through the whole space. Usually,<br />

heuristic methods, such as hill climb<strong>in</strong>g, have to be adopted,<br />

because the size of <strong>in</strong>put space is exponential <strong>in</strong> the number of<br />

features. Other methods divide the feature space <strong>in</strong>to several<br />

subspaces which can be searched easily.<br />

There are basically two types of feature selection methods:<br />

filter and wrapper [12]. Filters methods select the best features<br />

accord<strong>in</strong>g to some prior knowledge without th<strong>in</strong>k<strong>in</strong>g about the<br />

bias of further <strong>in</strong>duction algorithm. So these methods<br />

performed <strong>in</strong>dependently of the classification algorithm or its<br />

error criteria.<br />

ii) In feature extraction, most methods are supervised. These<br />

approaches need some prior knowledge and labelled tra<strong>in</strong><strong>in</strong>g<br />

samples. There are two k<strong>in</strong>ds of supervised methods <strong>used</strong>:<br />

L<strong>in</strong>ear feature extraction and nonl<strong>in</strong>ear feature extraction.<br />

L<strong>in</strong>ear feature extraction techniques <strong>in</strong>clude Pr<strong>in</strong>cipal<br />

Component Analysis (PCA), L<strong>in</strong>ear Discrim<strong>in</strong>ant Analysis<br />

(LDA), projection pursuit, and Independent Component<br />

Analysis (ICA). Nonl<strong>in</strong>ear feature extraction methods <strong>in</strong>clude<br />

kernel PCA, PCA network, nonl<strong>in</strong>ear PCA, nonl<strong>in</strong>ear autoassociative<br />

network, Multi-Dimensional Scal<strong>in</strong>g (MDS) and<br />

Self-Organiz<strong>in</strong>g Map (SOM), and so forth.<br />

4.1.1c. Classifiers design:<br />

After optimal feature subset is selected a classifier can be<br />

designed us<strong>in</strong>g various approaches. Roughly speak<strong>in</strong>g, there<br />

are three different approaches [1]. The first approach is the<br />

simplest and the most <strong>in</strong>tuitive approach which is based on the<br />

concept of similarity. Template match<strong>in</strong>g is an example. The<br />

second one is a probabilistic approach. It <strong>in</strong>cludes methods<br />

based on Bayes decision rule, the maximum likelihood or<br />

density estimator. Three well-known methods are K-nearest<br />

neighbor (KNN), Parzen w<strong>in</strong>dow classifier and branch-and<br />

bound methods (BnB).The third approach is to construct<br />

decision boundaries directly by optimiz<strong>in</strong>g certa<strong>in</strong> error<br />

criterion. Examples are fisher’s l<strong>in</strong>ear discrim<strong>in</strong>ant, multilayer<br />

perceptrons, decision tree and support vector mach<strong>in</strong>e [13].<br />

4.1.1d. Optimization:<br />

The optimization is not a separate step, it is comb<strong>in</strong>ed with<br />

several parts of the pattern recognition process. In<br />

preprocess<strong>in</strong>g, optimization guarantees that, the <strong>in</strong>put pattern<br />

have the best quality [13]. Then <strong>in</strong> the feature selection and<br />

extraction part, optimal feature subsets are obta<strong>in</strong>ed under<br />

some optimization techniques. Furthermore, the f<strong>in</strong>al<br />

classification error rate is lowered <strong>in</strong> the classification part.<br />

4.1.2. Steps <strong>in</strong> statistical pattern recognition:<br />

i) Formulation of the problem: ga<strong>in</strong><strong>in</strong>g a clear understand<strong>in</strong>g<br />

of the aims of the <strong>in</strong>vestigation and plann<strong>in</strong>g the rema<strong>in</strong><strong>in</strong>g<br />

stages.<br />

ii) Data collection: mak<strong>in</strong>g measurements on appropriate<br />

variables and record<strong>in</strong>g details of the data collection<br />

procedure (ground truth).<br />

iii) Initial exam<strong>in</strong>ation of the data: check<strong>in</strong>g the data,<br />

calculat<strong>in</strong>g summary statistics and produc<strong>in</strong>g plots <strong>in</strong> order to<br />

get a feel for the structure.<br />

iv) Feature selection or feature extraction: select<strong>in</strong>g variables<br />

from the measured set that are appropriate for the task. These<br />

new variables may be obta<strong>in</strong>ed by a l<strong>in</strong>ear or<br />

nonl<strong>in</strong>ear transformation of the orig<strong>in</strong>al set (feature<br />

extraction). To some extent, the division of feature<br />

extraction and classification is artificial.<br />

v) Unsupervised pattern classification or cluster<strong>in</strong>g: This may<br />

be viewed as exploratory data analysis and it may provide a<br />

successful conclusion to a study. On the other hand, it may<br />

be a means of pre-process<strong>in</strong>g the data for a supervised<br />

classification procedure.<br />

vi) Apply discrim<strong>in</strong>ation or regression procedures as<br />

appropriate: The classifier is designed us<strong>in</strong>g a tra<strong>in</strong><strong>in</strong>g set of<br />

exemplar patterns.<br />

vii) Assessment of results: This may <strong>in</strong>volve apply<strong>in</strong>g the<br />

tra<strong>in</strong>ed classifier to an <strong>in</strong>dependent test set of labeled<br />

patterns.<br />

viii) Interpretation: The above is necessarily an iterative<br />

process: the analysis of the results may pose further<br />

hypotheses that require further data collection. Also, the cycle<br />

may be term<strong>in</strong>ated at different stages: the questions posed<br />

may be answered by an <strong>in</strong>itial exam<strong>in</strong>ation of the data or it<br />

may be discovered that the data cannot answer the <strong>in</strong>itial<br />

question and the problem must be reformulated.<br />

The follow<strong>in</strong>g block diagram of a canonic pattern recognition<br />

approach to speech recognition is shown <strong>in</strong> figure 4 the<br />

recognition step has four steps, namely,<br />

1. Parameter Estimation: In which a sequence of<br />

measurements is made on the <strong>in</strong>put signal to def<strong>in</strong>e the test<br />

pattern. For speech signals the feature measurements are<br />

usually the output of some type of spectral analysis technique,<br />

such as filter bank analyzer, a l<strong>in</strong>ear predictive cod<strong>in</strong>g analysis,<br />

or a discrete Fourier transform (DFT) analysis.<br />

2. Pattern Tra<strong>in</strong><strong>in</strong>g: <strong>in</strong> which one or more test patterns<br />

correspond<strong>in</strong>g to speech sounds of the same class are <strong>used</strong> to<br />

create a pattern a representative of the features of that class.<br />

The result<strong>in</strong>g pattern, generally called reference pattern, can<br />

be an exemplar or template, derived from some type of<br />

averag<strong>in</strong>g technique, or it can be a model that characterizes<br />

the statistics of the features of the reference pattern.<br />

3.Pattern Comparison: <strong>in</strong> which the unknown test pattern is<br />

compared with each (sound) class reference pattern and a<br />

measure of similarity( distance) between the test pattern and<br />

each reference pattern is computed. To compare speech<br />

patterns(which consists of a sequence of spectral vectors),we<br />

require both a local distance measure, <strong>in</strong> which local distance<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

916


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

is def<strong>in</strong>ed as the spectral “distance” between two well def<strong>in</strong>ed<br />

spectral vectors, and a global time alignment procedure (often<br />

called a dynamic time warp<strong>in</strong>g algorithm), which compensates<br />

for different rates of speak<strong>in</strong>g (time scales) of the two patterns.<br />

4. Decision logic: <strong>in</strong> which the reference pattern similarity<br />

scores are <strong>used</strong> to decide which reference pattern (or possibly<br />

which sequence of reference patterns)best matches the<br />

unknown test pattern. The factors that dist<strong>in</strong>guish different<br />

pattern recognition approaches are the types of feature<br />

measurement, the choice of templates or models for reference<br />

patterns and the method <strong>used</strong> to create reference patterns and<br />

classify unknown test patterns.<br />

TABLE 2<br />

Examples of pattern recognition applications<br />

5. Template based approach:<br />

Figure 4. pattern recognition approach to speech recognition<br />

4.1.3. Pattern recognition approach:<br />

The four best known pattern recognition approaches are: i)<br />

Template approach ii) Statistical approach iii) Syntactic or<br />

structural approach iv) Neural Network approach. These<br />

models are not necessarily <strong>in</strong>dependent and sometimes the<br />

same pattern recognition methods exist with different<br />

<strong>in</strong>terpretations. Attempts have been made to design hybrid<br />

systems <strong>in</strong>volv<strong>in</strong>g multiple models [75]. A brief description<br />

and comparison is given below and discussed <strong>in</strong> the table 1.<br />

TABLE 1<br />

Pattern recognition models<br />

4.1.4. Examples of pattern recognition applications:<br />

Interest <strong>in</strong> the area of pattern recognition has been renewed<br />

recently due to emerg<strong>in</strong>g applications which are not only<br />

challeng<strong>in</strong>g but also computationally demand<strong>in</strong>g. These<br />

applications <strong>in</strong>clude data m<strong>in</strong><strong>in</strong>g, bio<strong>in</strong>formatics etc., as<br />

shown <strong>in</strong> table 2.<br />

One of the simplest and earliest approaches to pattern<br />

recognition is the template approach. Match<strong>in</strong>g is a generic<br />

operation <strong>in</strong> pattern recognition which is <strong>used</strong> to determ<strong>in</strong>e the<br />

similarity between two entities of the same type. In template<br />

match<strong>in</strong>g the template or prototype of the pattern to be<br />

recognized is available. The pattern to be recognized is<br />

matched aga<strong>in</strong>st the stored template tak<strong>in</strong>g <strong>in</strong>to account all<br />

allowable pose and scale changes.<br />

The major pattern recognition techniques for speech<br />

recognition are template method and Dynamic Time warp<strong>in</strong>g<br />

method(DTW). Template based approaches to speech<br />

recognition have provided a family of techniques that have<br />

advanced the field considerably dur<strong>in</strong>g the last six decades.<br />

The underly<strong>in</strong>g idea is simple. A collection of prototypical<br />

speech patterns are stored as reference patterns represent<strong>in</strong>g<br />

the dictionary of candidate s words. <strong>Recognition</strong> is then<br />

carried out by match<strong>in</strong>g an unknown spoken utterance with<br />

each of these reference templates and select<strong>in</strong>g the category of<br />

the best match<strong>in</strong>g pattern. Usually templates for entire words<br />

are constructed. This has the advantage that, errors due to<br />

segmentation or classification of smaller acoustically more<br />

variable units such as phonemes can be avoided. In turn, each<br />

word must have its own full reference template; template<br />

preparation and match<strong>in</strong>g become prohibitively expensive or<br />

impractical as vocabulary size <strong>in</strong>creases beyond a few<br />

hundred words. One key idea <strong>in</strong> template method is to derive<br />

typical sequences of speech frames for a pattern (a word) via<br />

some averag<strong>in</strong>g procedure, and to rely on the use of local<br />

spectral distance measures to compare patterns. Another key<br />

idea is to use some form of dynamic programm<strong>in</strong>g to<br />

temporarily align patterns to account for differences <strong>in</strong><br />

speak<strong>in</strong>g rates across talkers as well as across repetitions of<br />

the word by the same talker.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

917


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

5.1. Introduction:<br />

A template is the representation of an actual segment of<br />

speech. It consists of a sequence of consecutive acoustic<br />

feature vectors (or frames), a transcription of the sounds or<br />

words it represents (typically one or more phonetic symbols),<br />

knowledge of neighbour<strong>in</strong>g templates (a template number if<br />

no templates overlap), and a tag with meta-<strong>in</strong>formation. The<br />

term template is often <strong>used</strong> for two fundamentally different<br />

concepts: either for the representation of a s<strong>in</strong>gle segment of<br />

speech with a known transcription, or for some sort of<br />

average of a number of different segments of speech. Both<br />

types of templates can be <strong>used</strong> <strong>in</strong> the DTW algorithm to<br />

compare them with a segment of <strong>in</strong>put speech. Us<strong>in</strong>g the latter<br />

type has the obvious advantage of reduc<strong>in</strong>g the number of<br />

templates and be<strong>in</strong>g more robust to outliers [14]. However,<br />

the averag<strong>in</strong>g is a model build<strong>in</strong>g step, which makes it more<br />

ak<strong>in</strong> to HMMs than to true example based recognition.<br />

Template based approach to speech recognition have provided<br />

a family of techniques that have advanced the field<br />

considerably dur<strong>in</strong>g the last six decades. The underly<strong>in</strong>g idea<br />

is simple. A collection of prototypical speech patterns are<br />

stored as reference patterns represent<strong>in</strong>g the dictionary of<br />

candidate’s words. <strong>Recognition</strong> is then carried out by<br />

match<strong>in</strong>g an unknown spoken utterance with each of these<br />

reference templates and select<strong>in</strong>g the category of the best<br />

match<strong>in</strong>g pattern. Usually templates for entire words are<br />

constructed. Template preparation and match<strong>in</strong>g become<br />

prohibitively expensive or impractical as vocabulary size<br />

<strong>in</strong>creases beyond a few hundred words. One key idea <strong>in</strong><br />

template method is to derive typical sequences of speech<br />

frames for a pattern (a word) via some averag<strong>in</strong>g procedure,<br />

and to rely on the use of local spectral distance measures to<br />

compare patterns. Another key idea is to use some form of<br />

dynamic programm<strong>in</strong>g to temporarily align patterns to<br />

account for differences <strong>in</strong> speak<strong>in</strong>g rates across talkers as well<br />

as across repetitions of the word by the same talker.<br />

5.1.2 Similarity and Distance methods <strong>used</strong> <strong>in</strong> Template<br />

approach:<br />

The first type of classifiers that are <strong>used</strong> are the similarity<br />

between patterns to decide on a good classification. First,<br />

similarity has to be def<strong>in</strong>ed. The nearest mean classifiers<br />

def<strong>in</strong>e the features of a class as a vector and represent the<br />

class with the mean of the elements of the vector. Thus, any<br />

unlabeled vector of features will be classified as the class with<br />

nearest mean value. Template match<strong>in</strong>g uses a template for<br />

def<strong>in</strong><strong>in</strong>g class labels, and tries to f<strong>in</strong>d the most similar<br />

template for classification. Another important classifier of this<br />

type uses the Nearest Neighbor (NN) Algorithm [15, 16]. The<br />

data is represented as po<strong>in</strong>ts <strong>in</strong> space, and classification is<br />

done based on the. Euclidean distance, of the data to the<br />

labeled classes. For the k-NN, the classifier checks the k<br />

nearest po<strong>in</strong>ts and decides <strong>in</strong> favor of the majority.<br />

5.2 Advantages and Disadvantages of Template Method:<br />

a) Advantages:<br />

1. An <strong>in</strong>tr<strong>in</strong>sic advantage of template based recognition<br />

is that, it is not required to model the speech process.<br />

This is very convenient, s<strong>in</strong>ce our understand<strong>in</strong>g of<br />

speech is still limited, especially with respect to its<br />

transient nature.<br />

2. The ma<strong>in</strong> advantage is precisely the use of long<br />

temporal context: all the frames of the keyword<br />

template, as well as the <strong>in</strong>formation about their<br />

relative position, are <strong>used</strong> dur<strong>in</strong>g the Dynamic Time<br />

Warp<strong>in</strong>g (DTW) procedure. This provides an implicit<br />

model<strong>in</strong>g of co-articulation effects or speaker<br />

dependencies [9].<br />

3. This has the advantage that, errors due to<br />

segmentation or classification of smaller acoustically<br />

more variable units such as phonemes can be avoided.<br />

b) Disadvantages:<br />

1) Template match<strong>in</strong>g approaches fail to take advantage<br />

of large amount of tra<strong>in</strong><strong>in</strong>g data.<br />

2) They cannot model acoustic variabilities, except <strong>in</strong> a<br />

coarse way by assign<strong>in</strong>g multiple templates to each<br />

word;<br />

3) In practice they are limited to whole-word models,<br />

because it's hard to record or segment a sample<br />

shorter than a word - so templates are useful only <strong>in</strong><br />

small systems which can afford the luxury of us<strong>in</strong>g<br />

whole-word models.<br />

4) Each word must have its own full reference template;<br />

template preparation and match<strong>in</strong>g become<br />

prohibitively expensive or impractical as vocabulary<br />

size <strong>in</strong>creases beyond a few hundred words.<br />

5) It is difficulty to test for large data.<br />

6) Compulsorily a template should be supplied for each<br />

pattern<br />

5.3. Applications of template method:<br />

i) A multi-scale template method for shape detection with biomedical<br />

applications<br />

ii) Template match<strong>in</strong>g framework for detect<strong>in</strong>g geometrically<br />

transformed objects.<br />

iii)Template match<strong>in</strong>g is one the way for perform<strong>in</strong>g<br />

operations like: object recognition, identification or<br />

classification, and detection. There are various literature are<br />

available for template match<strong>in</strong>g method but they vary from<br />

application to application. There is no standard method<br />

developed yet.<br />

iv) Adaptive template-match<strong>in</strong>g method for vessel wall<br />

boundary detection <strong>in</strong> brachial artery ultrasound (US) scans.<br />

6. Dynamic Time Warp<strong>in</strong>g(DTW):<br />

Dynamic time warp<strong>in</strong>g is an algorithm for measur<strong>in</strong>g<br />

similarity between two sequences which may vary <strong>in</strong> time or<br />

speed. For <strong>in</strong>stance, similarities <strong>in</strong> walk<strong>in</strong>g patterns would be<br />

detected, even if <strong>in</strong> one video, the person was walk<strong>in</strong>g slowly<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

918


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

and if <strong>in</strong> another, he or she were walk<strong>in</strong>g more quickly, or<br />

even if there were accelerations and decelerations dur<strong>in</strong>g the<br />

course of one observation. A well known application has been<br />

automatic speech recognition, to cope with different speak<strong>in</strong>g<br />

speeds. (In general, DTW is a method that allows a computer<br />

to f<strong>in</strong>d an optimal match between two given sequences (e.g.<br />

time series) with certa<strong>in</strong> restrictions. The sequences are<br />

"warped" non-l<strong>in</strong>early <strong>in</strong> the time dimension to determ<strong>in</strong>e a<br />

measure of their similarity <strong>in</strong>dependent of certa<strong>in</strong> non-l<strong>in</strong>ear<br />

variations <strong>in</strong> the time dimension. This sequence alignment<br />

method is often <strong>used</strong> <strong>in</strong> the context of hidden Markov models.<br />

One example of the restrictions imposed on the match<strong>in</strong>g of<br />

the sequences is on the monotonicity of the mapp<strong>in</strong>g <strong>in</strong> the<br />

time dimension. Cont<strong>in</strong>uity is less important <strong>in</strong> DTW than <strong>in</strong><br />

other pattern match<strong>in</strong>g algorithms; DTW is an algorithm<br />

particularly suited to match<strong>in</strong>g sequences with miss<strong>in</strong>g<br />

<strong>in</strong>formation, provided there are long enough segments for<br />

match<strong>in</strong>g to occur. The optimization process is performed<br />

us<strong>in</strong>g dynamic programm<strong>in</strong>g, hence the name.<br />

Moreover, with<strong>in</strong> a word, there will be variation <strong>in</strong> the length<br />

of <strong>in</strong>dividual phonemes: Cassidy might be uttered with a long<br />

/A/ and short f<strong>in</strong>al /i/ or with a short /A/ and long /i/. The<br />

match<strong>in</strong>g process needs to compensate for length differences<br />

and take account of the non-l<strong>in</strong>ear nature of the length<br />

differences with<strong>in</strong> the words. The Dynamic Time Warp<strong>in</strong>g<br />

algorithm achieves this goal; it f<strong>in</strong>ds an optimal match<br />

between two sequences of feature vectors which allows for<br />

stretched and compressed sections of the sequence.<br />

6.1. Concepts of Dynamic Time Warp<strong>in</strong>g:<br />

Dynamic Time Warp<strong>in</strong>g is a pattern match<strong>in</strong>g algorithm<br />

with a non-l<strong>in</strong>ear time normalization effect. It is based on<br />

Bellman's pr<strong>in</strong>ciple of optimality [17], which implies that,<br />

given an optimal path w from A to B and a po<strong>in</strong>t C ly<strong>in</strong>g<br />

somewhere on this path, the path segments AC and CB are<br />

optimal paths from A to C and from C to B respectively. The<br />

dynamic time warp<strong>in</strong>g algorithm [18] creates an alignment<br />

between two sequences of feature vectors, (T 1 , T 2 ,.....T N ) and<br />

(S 1 , S 2 ,....,S M ). A distance d(i, j) can be evaluated between any<br />

two feature vectors Ti and Sj . This distance is referred to as<br />

the local distance. In DTW the global distance D(i,j) of any<br />

two feature vectors Ti and Sj is computed recursively by<br />

add<strong>in</strong>g its local distance d(i,j) to the evaluated global distance<br />

for the best predecessor. The best predecessor is the one that<br />

gives the m<strong>in</strong>imum global distance D(i,j) at row i and column<br />

j:<br />

…….(1)<br />

The computational complexity can be reduced by impos<strong>in</strong>g<br />

constra<strong>in</strong>ts that prevent the selection of sequences that cannot<br />

be optimal [18]. Global constra<strong>in</strong>ts affect the maximal overall<br />

stretch<strong>in</strong>g or compression. Local constra<strong>in</strong>ts affect the set of<br />

predecessors from which the best predecessor is chosen.<br />

Dynamic Time Warp<strong>in</strong>g (DTW) is <strong>used</strong> to establish a time<br />

scale alignment between two patterns. It results <strong>in</strong> a time<br />

warp<strong>in</strong>g vector w, describ<strong>in</strong>g the time alignment of segments<br />

of the two signals assigns a certa<strong>in</strong> segment of the source<br />

signal to each of a set of regularly spaced synthesis <strong>in</strong>stants <strong>in</strong><br />

the target signal.<br />

1) 6.1.1. The DTW Grid:<br />

We can arrange the two sequences of observations on the<br />

sides of a grid (Figure 5) with the unknown sequence on the<br />

bottom (six observations <strong>in</strong> the example) and the stored<br />

template up the left hand side (eight observations). Both<br />

sequences start on the bottom left of the grid. Inside each cell<br />

a distance measure is <strong>used</strong> for compar<strong>in</strong>g the correspond<strong>in</strong>g<br />

elements of the two sequences.<br />

Figure 5. An example DTW grid<br />

To f<strong>in</strong>d the best match between these two sequences we can<br />

f<strong>in</strong>d a path through the grid which m<strong>in</strong>imizes the total distance<br />

between them. The path shown <strong>in</strong> blue <strong>in</strong> Figure 5 gives an<br />

example. Here, the first and second elements of each sequence<br />

match together while the third element of the <strong>in</strong>put also<br />

matches best aga<strong>in</strong>st the second element of the stored pattern.<br />

This corresponds to a section of the stored pattern be<strong>in</strong>g<br />

stretched <strong>in</strong> the <strong>in</strong>put. Similarly, the fourth element of the<br />

<strong>in</strong>put matches both the second and third elements of the stored<br />

sequence: here a section of the stored sequence has been<br />

compressed <strong>in</strong> the <strong>in</strong>put sequence. Once an overall best path<br />

has been found the total distance between the two sequences<br />

can be calculated for this stored template.<br />

The procedure for comput<strong>in</strong>g this overall distance measure is<br />

to f<strong>in</strong>d all possible routes through the grid and for each one of<br />

these compute the overall distance. The overall distance is<br />

given <strong>in</strong> Sakoe and Chiba, Equation 1, as the m<strong>in</strong>imum of the<br />

sum of the distances between <strong>in</strong>dividual elements on the path<br />

divided by the sum of the warp<strong>in</strong>g function .The division is to<br />

make paths of different lengths comparable.<br />

It should be apparent that for any reasonably sized sequences,<br />

the number of possible paths through the grid will be very<br />

large. In addition, many of the distance measures could be<br />

avoided s<strong>in</strong>ce the first element of the <strong>in</strong>put is unlikely to<br />

match the last element of the template for example. The DTW<br />

algorithm is designed to exploit some observations about the<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

919


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

likely solution to make the comparison between sequences<br />

more efficient.<br />

2) 6.1.2. Optimization <strong>in</strong> DTW:<br />

The major optimizations to the DTW algorithm arise from<br />

observations on the nature of good paths through the grid.<br />

These are outl<strong>in</strong>ed <strong>in</strong> Sakoe and Chiba and can be summarized<br />

as:<br />

• Monotonic condition: the path will not turn back on<br />

itself, both the i and j <strong>in</strong>dexes either stay the same or<br />

<strong>in</strong>crease, they never decrease.<br />

• Cont<strong>in</strong>uity condition: The path advances one step at a<br />

time. Both i and j can only <strong>in</strong>crease by 1 on each step<br />

along the path.<br />

• Boundary condition: the path starts at the bottom left<br />

and ends at the top right.<br />

• Adjustment w<strong>in</strong>dow condition: a good path is<br />

unlikely to wander very far from the diagonal. The<br />

distance that the path is allowed to wander is the<br />

w<strong>in</strong>dow length r.<br />

• Slope constra<strong>in</strong>t condition: The path should not be<br />

too steep or too shallow. This prevents very short<br />

sequences match<strong>in</strong>g very long ones. The condition is<br />

expressed as a ratio n/m where m is the number of<br />

steps <strong>in</strong> the x direction and m is the number <strong>in</strong> the y<br />

direction. After m steps <strong>in</strong> x you must make a step <strong>in</strong><br />

y and vice versa.<br />

By apply<strong>in</strong>g these observations it can restrict the moves that<br />

can be made from any po<strong>in</strong>t <strong>in</strong> the path and so restrict the<br />

number of paths that need to be considered. For example, with<br />

a slope constra<strong>in</strong>t of P=1, if a path has already moved one<br />

square up it must next move either diagonally or to the<br />

right.The power of the DTW algorithm goes beyond these<br />

observations though. Instead of f<strong>in</strong>d<strong>in</strong>g all possible routes<br />

through the grid which satisfy these constra<strong>in</strong>ts, the DTW<br />

algorithm works by keep<strong>in</strong>g track of the cost of the best path<br />

to each po<strong>in</strong>t <strong>in</strong> the grid. Dur<strong>in</strong>g the match process there will<br />

be no idea about the lowest cost path ; but this can be traced<br />

back when we reach the end po<strong>in</strong>t.<br />

6.2. Advantages and Disadvantages of Dynamic Time<br />

Warp<strong>in</strong>g:<br />

a)Advantages:<br />

1) Works well for small number of templates (


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

7. Supervised versus unsupervised <strong>Classification</strong>/Learn<strong>in</strong>g<br />

<strong>Techniques</strong>:<br />

There are two ma<strong>in</strong> divisions of classification procedure <strong>in</strong><br />

pattern recognition: supervised classification (or<br />

discrim<strong>in</strong>ation) and unsupervised classification (sometimes <strong>in</strong><br />

the statistics literature simply referred to as classification or<br />

cluster<strong>in</strong>g). In supervised classification a set of data samples<br />

(each consist<strong>in</strong>g of measurements on a set of variables) with<br />

associated labels, the class types. These are <strong>used</strong> as exemplars<br />

<strong>in</strong> the classifier design. In unsupervised classification, the data<br />

are not labeled and <strong>in</strong>tended to f<strong>in</strong>d groups <strong>in</strong> the data and the<br />

features that dist<strong>in</strong>guish one group from another. Cluster<strong>in</strong>g<br />

techniques can also be <strong>used</strong> as part of a supervised<br />

classification scheme by def<strong>in</strong><strong>in</strong>g prototypes. A cluster<strong>in</strong>g<br />

scheme may be applied to the data for each class separately<br />

and representative samples for each group with<strong>in</strong> the class is<br />

(the group means, for example) <strong>used</strong> as the prototypes for that<br />

class.<br />

7.1. Supervised Learn<strong>in</strong>g:<br />

In automatic pattern recognition, the term supervised<br />

learn<strong>in</strong>g/classification refers to the process of design<strong>in</strong>g a<br />

pattern classifier by us<strong>in</strong>g a tra<strong>in</strong><strong>in</strong>g set of patterns of known<br />

class to determ<strong>in</strong>e the choice of a specific decision mak<strong>in</strong>g<br />

technique for classify<strong>in</strong>g additional similar samples <strong>in</strong> future.<br />

The classifier <strong>in</strong> other words is designed us<strong>in</strong>g the tra<strong>in</strong><strong>in</strong>g<br />

data. To provide an unprejudiced estimate of the classifiers<br />

accuracy on new data, it must be tested on a separate test of<br />

patterns for which the class of each pattern is known.<br />

Supervised learn<strong>in</strong>g is fairly common <strong>in</strong> classification<br />

problems because the goal is often to get the computer to learn<br />

a classification system that it has created. In the supervised<br />

learn<strong>in</strong>g process, two types parametric analysis is done<br />

namely parametric and non parametric decision mak<strong>in</strong>g<br />

methods or classification methods.<br />

7.1.1.There are several ways <strong>in</strong> which the standard supervised<br />

learn<strong>in</strong>g problem can be generalized:<br />

1. Semi-supervised learn<strong>in</strong>g: In this sett<strong>in</strong>g, the desired<br />

output values are provided only for a subset of the<br />

tra<strong>in</strong><strong>in</strong>g data. The rema<strong>in</strong><strong>in</strong>g data is unlabeled.<br />

2. Active learn<strong>in</strong>g: Instead of assum<strong>in</strong>g that all of the<br />

tra<strong>in</strong><strong>in</strong>g examples are given at the start, active<br />

learn<strong>in</strong>g algorithms <strong>in</strong>teractively collect new<br />

examples, typically by mak<strong>in</strong>g queries to a human<br />

user. Often, the queries are based on unlabeled data,<br />

which is a scenario that comb<strong>in</strong>es semi-supervised<br />

learn<strong>in</strong>g with active learn<strong>in</strong>g.<br />

3. Structured prediction: When the desired output value<br />

is a complex object, such as a parse tree or a labeled<br />

graph, then standard methods must be extended.<br />

4. Learn<strong>in</strong>g to rank: When the <strong>in</strong>put is a set of objects<br />

and the desired output is a rank<strong>in</strong>g of those objects,<br />

then aga<strong>in</strong> the standard methods must be extended.<br />

5.<br />

7.2. Advantages and Disadvantages of supervised learn<strong>in</strong>g:<br />

a)Advantages:<br />

1) Rules are written for you automatically. This is useful for<br />

large document sets.<br />

b) Disadvantages:<br />

1) It assigns documents to categories before generat<strong>in</strong>g the<br />

rules.<br />

2) Rules may not be as specific or accurate as those we write<br />

yourself.<br />

3) Provides over fitt<strong>in</strong>g<br />

7.3. Challenges <strong>in</strong> supervised learn<strong>in</strong>g:<br />

The importance of the classification problem, is the goal of<br />

the learn<strong>in</strong>g, that is <strong>used</strong> to m<strong>in</strong>imize the error with respect to<br />

the given <strong>in</strong>puts. These <strong>in</strong>puts, are called the "tra<strong>in</strong><strong>in</strong>g set", are<br />

the examples from which the agent tries to learn. But learn<strong>in</strong>g<br />

the tra<strong>in</strong><strong>in</strong>g set well is not necessarily the best th<strong>in</strong>g to do. Not<br />

all tra<strong>in</strong><strong>in</strong>g sets will have the <strong>in</strong>puts classified correctly. This<br />

can lead to problems if the algorithm <strong>used</strong> is powerful enough<br />

to memorize even the apparently "special cases" that don't fit<br />

the more general pr<strong>in</strong>ciples. This, too, can lead to over fitt<strong>in</strong>g,<br />

and it is a challenge to f<strong>in</strong>d algorithms that are both powerful<br />

enough to learn complex functions and robust enough to<br />

produce generalizable results.<br />

7.4. Applications:<br />

• Bio<strong>in</strong>formatics<br />

• Database market<strong>in</strong>g<br />

• Handwrit<strong>in</strong>g recognition<br />

• Information retrieval<br />

o Learn<strong>in</strong>g to rank<br />

• Object recognition <strong>in</strong> computer vision<br />

• Optical character recognition<br />

• Spam detection<br />

• Pattern recognition<br />

• <strong>Speech</strong> recognition<br />

• Forecast<strong>in</strong>g Fraudulent F<strong>in</strong>ancial Statements<br />

8. Introduction to parametric representation:<br />

Parametric representation - Parametric statistics is a branch<br />

of statistics that assumes data come from a type of probability<br />

distribution and makes <strong>in</strong>ferences about the parameters of the<br />

distribution. Most well-known elementary statistical methods<br />

are parametric. Generally speak<strong>in</strong>g parametric methods make<br />

more assumptions than non-parametric methods. If those extra<br />

assumptions are correct, parametric methods can produce<br />

more accurate and precise estimates. They are said to have<br />

more statistical power. However, if those assumptions are<br />

<strong>in</strong>correct, parametric methods can be very mislead<strong>in</strong>g. For that<br />

reason they are often not considered robust. On the other hand,<br />

parametric formulae are often simpler to write down and<br />

faster to compute. In some, but def<strong>in</strong>itely not all cases, their<br />

simplicity makes up for their non-robustness, especially if<br />

care is taken to exam<strong>in</strong>e diagnostic statistics. Parametric<br />

decision mak<strong>in</strong>g refers to the situation <strong>in</strong> which we know or<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

921


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

will<strong>in</strong>g to assume the general form of the probability<br />

distribution function or density function for each class but not<br />

the values of the parameters such as the mean and variance.<br />

Before us<strong>in</strong>g these densities the values of the parameters have<br />

to be estimated.<br />

Most important parametric method <strong>used</strong> <strong>in</strong> speech recognition<br />

application is the hidden Markov Model.<br />

Stochastic model<strong>in</strong>g [97] entails the use of probabilistic<br />

models to deal with uncerta<strong>in</strong> or <strong>in</strong>complete <strong>in</strong>formation. In<br />

speech recognition, uncerta<strong>in</strong>ty and <strong>in</strong>completeness arise from<br />

many sources; for example, confusable sounds, speaker<br />

variability s, contextual effects, and homophones words. Thus,<br />

stochastic models are particularly suitable approach to speech<br />

recognition. The most popular stochastic approach today is<br />

hidden Markov model<strong>in</strong>g. A hidden Markov model is<br />

characterized by a f<strong>in</strong>ite state markov model and a set of<br />

output distributions. The transition parameters <strong>in</strong> the Markov<br />

cha<strong>in</strong> models, temporal variabilities, while the parameters <strong>in</strong><br />

the output distribution model, spectral variabilities. These two<br />

types of variabilities are the essence of speech recognition.<br />

8.1.Hidden Markov Model (statistical approach):<br />

Hidden Markov Models (HMMs) have dom<strong>in</strong>ated [19]<br />

automatic speech recognition for at least the last decade. The<br />

model’s success lies <strong>in</strong> its mathematical simplicity; efficient<br />

and robust algorithms have been developed to facilitate its<br />

practical implementation. However, there is noth<strong>in</strong>g uniquely<br />

speech-oriented about acoustic-based HMMs. Standard<br />

HMMs model speech as a series of stationary regions <strong>in</strong> some<br />

representation of the acoustic signal. <strong>Speech</strong> is a cont<strong>in</strong>uous<br />

process though, and ideally should be modeled as such.<br />

Furthermore, HMMs assume that state and phone boundaries<br />

are strictly synchronized with events <strong>in</strong> the parameter space,<br />

whereas <strong>in</strong> fact different acoustic and articulator parameters<br />

do not necessarily change value simultaneously at boundaries.<br />

8.1.1.Markov Models<br />

A Markov model is a probabilistic process over a f<strong>in</strong>ite set,<br />

{S 1 , ..., S k }, usually called its states. Each state-transition<br />

generates a character from the alphabet of the process. IT is<br />

<strong>in</strong>terested <strong>in</strong> the matters such as the probability of a given<br />

state com<strong>in</strong>g up next, pr(x t =S i ), and this may depend on the<br />

prior history to t-1. In comput<strong>in</strong>g, such processes, if they are<br />

reasonably complex and <strong>in</strong>terest<strong>in</strong>g, they are usually called<br />

Probabilistic F<strong>in</strong>ite State Automata (PFSA) or Probabilistic<br />

F<strong>in</strong>ite State Mach<strong>in</strong>es (PFSM) because of their close l<strong>in</strong>ks to<br />

determ<strong>in</strong>istic and non-determ<strong>in</strong>istic f<strong>in</strong>ite state automata as<br />

<strong>used</strong> <strong>in</strong> formal language theory.<br />

8.1.2. Types of Hidden Markov Models<br />

8.1.2a. Discrete HMMs:<br />

HMMs can be classified accord<strong>in</strong>g to the nature of the<br />

elements of the B matrix, which are distribution functions.<br />

Distributions are def<strong>in</strong>ed on f<strong>in</strong>ite spaces <strong>in</strong> the so called<br />

discrete HMMs. In this case, observations are vectors of<br />

symbols <strong>in</strong> a f<strong>in</strong>ite alphabet of N different elements. For each<br />

one of the Q vector components, a discrete density<br />

{w(k)/k=1,….N} is def<strong>in</strong>ed, and the distribution is obta<strong>in</strong>ed<br />

by multiply<strong>in</strong>g the probabilities of each component. Notice<br />

that this def<strong>in</strong>ition assumes that the different components are<br />

<strong>in</strong>dependent. Fig.6 shows an example of a discrete HMM with<br />

one-dimensional observations. Distributions are associated<br />

with model transitions.<br />

Figure 6: Example of a discrete HMM. A transition<br />

probability and an output distribution on the symbol set is<br />

associated with every transition.<br />

8.1.2b.Cont<strong>in</strong>ious HMM:<br />

Another possibility is to def<strong>in</strong>e distributions as probability<br />

densities on cont<strong>in</strong>uous observation spaces. In this case,<br />

strong restrictions have to be imposed on the functional form<br />

of the distributions, <strong>in</strong> order to have a manageable number of<br />

statistical parameters to estimate. The most popular approach<br />

is to characterize the model transitions with mixtures of base<br />

densities g of a family G hav<strong>in</strong>g a simple parametric form.<br />

The base densities g є G are usually Gaussian or Laplacian,<br />

and can be parameterized by the mean vector and the<br />

covariance matrix. HMMs with these k<strong>in</strong>ds of distributions<br />

are usually referred to as cont<strong>in</strong>uous HMMs. In order to model<br />

complex distributions <strong>in</strong> this large number of base densities<br />

has to be <strong>used</strong> <strong>in</strong> every mixture. This may require a very large<br />

tra<strong>in</strong><strong>in</strong>g corpus of data for the estimation of the distribution<br />

parameters. Problems aris<strong>in</strong>g when the available corpus is not<br />

large enough can be alleviated by shar<strong>in</strong>g distributions among<br />

transitions of different models.<br />

8.1.2c. Semi-Cont<strong>in</strong>uous HMMs :<br />

In semi-cont<strong>in</strong>uous HMMs, all mixtures are expressed <strong>in</strong> terms<br />

of a common set of base densities. Different mixtures are<br />

characterized only by different weights. A common<br />

generalization of semi-cont<strong>in</strong>uous model<strong>in</strong>g consists of<br />

<strong>in</strong>terpret<strong>in</strong>g the <strong>in</strong>put vector y as composed of several<br />

components<br />

, each of which is associated<br />

with a different set of base distributions. The components are<br />

assumed to be statistically <strong>in</strong>dependent; hence the<br />

distributions associated with model transitions are products of<br />

the component density functions. Computation of probabilities<br />

with discrete models is faster than with cont<strong>in</strong>uous models,<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

922


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

nevertheless it is possible to speed up the mixture densities<br />

computation by apply<strong>in</strong>g vector quantization (VQ) on the<br />

Gaussians of the mixtures Parameters of statistical models are<br />

estimated by iterative learn<strong>in</strong>g algorithms <strong>in</strong> which the<br />

likelihood of a set of tra<strong>in</strong><strong>in</strong>g data is guaranteed to <strong>in</strong>crease at<br />

each step.<br />

8.2. HMM Constra<strong>in</strong>ts/Limitations for <strong>Speech</strong> <strong>Recognition</strong><br />

Systems:<br />

HMM have different constra<strong>in</strong>ts depend<strong>in</strong>g on the nature of<br />

the problem that has to be modeled. The ma<strong>in</strong> constra<strong>in</strong>ts<br />

needed <strong>in</strong> the implementation of speech Recognizers can be<br />

summarized <strong>in</strong> the follow<strong>in</strong>g assumptions [20]:<br />

1 – First order Markov cha<strong>in</strong> :<br />

In this assumption the probability of transition to a state<br />

depends only on the current state<br />

2 – Stationary states’ transition<br />

This assumption testifies that the states transition are time<br />

<strong>in</strong>dependent, and accord<strong>in</strong>gly<br />

we will have:<br />

3 – Observations <strong>in</strong>dependence:<br />

This assumption presumes that the observations come out<br />

with<strong>in</strong> certa<strong>in</strong> state depend only on the underly<strong>in</strong>g Markov<br />

cha<strong>in</strong> of the states, without consider<strong>in</strong>g the effect of the<br />

occurrence, of the other observations. Although this<br />

assumption is a poor one and deviates from reality but it<br />

works f<strong>in</strong>e <strong>in</strong> model<strong>in</strong>g speech signal.<br />

This assumption implies that:<br />

where p represents the considered history of the observation<br />

sequence.<br />

Then we will have :<br />

4 – Left-Right topology constra<strong>in</strong>t:<br />

If the observations are discrete then the last <strong>in</strong>tegration will be<br />

a summation.<br />

6- S<strong>in</strong>ce HMMs are well def<strong>in</strong>ed for processes that are<br />

function of one <strong>in</strong>dependent variable such as time. It doesn’t<br />

work satisfactorily for two variables.<br />

7- The Maximum likelihood tra<strong>in</strong><strong>in</strong>g criterion <strong>used</strong> <strong>in</strong> HMM<br />

leads to poor discrim<strong>in</strong>ation between the acoustic models<br />

given limited tra<strong>in</strong><strong>in</strong>g data and correspond<strong>in</strong>gly limited<br />

models. Discrim<strong>in</strong>ation can be improved us<strong>in</strong>g the Maximum<br />

Mutual Information(MMI) tra<strong>in</strong><strong>in</strong>g criterion but this is more<br />

complex and difficult to implement properly. Because HMMs<br />

suffer from all these weaknesses, they can obta<strong>in</strong> good<br />

performances only by rely<strong>in</strong>g on context dependent phone<br />

models i.e. tri-pohone models.<br />

8.3.Three Basic Problems for HMMs:<br />

There are three basic problems to be solved for HMMs[21].<br />

The parameter estimation problem is to tra<strong>in</strong> speech and<br />

speaker models, the evaluation problem is to compute<br />

likelihood functions for recognition and the decod<strong>in</strong>g<br />

problem is to determ<strong>in</strong>e the best fitt<strong>in</strong>g(unobservable) state<br />

sequence [Rab<strong>in</strong>er and Juange 1993, Huange et al.1990].<br />

i)The parameter estimation problem: This problem<br />

determ<strong>in</strong>es the optimal model parameters λ of the HMM<br />

accord<strong>in</strong>g to given optimization criterion. A variant of the EM<br />

algorithm, known as the Baum Welch algorithm, yields an<br />

iterative procedure to re-estimate the model parameters λ<br />

us<strong>in</strong>g the ML criterion [Baum 1972,Baum and Sell<br />

1968,Baum and Eagon 1967]. In the Baum-Welch algorithm,<br />

the unobservable data are the state sequence S and the<br />

observable data are the observation sequence O. The Q-<br />

function for the HMM is as follows<br />

Q(λ ,<br />

−<br />

λ ) = ∑<br />

S<br />

−<br />

P(S|O,λ)log P(O,S| λ ) (2)<br />

Comput<strong>in</strong>g P(S|O,λ) [Rab<strong>in</strong>er and Juang 1993, Huang et al.<br />

1990], we obta<strong>in</strong><br />

5 – Probability constra<strong>in</strong>ts:<br />

Our problem is deal<strong>in</strong>g with probabilities then we have the<br />

follow<strong>in</strong>g extra constra<strong>in</strong>ts:<br />

Q(λ ,<br />

(3)<br />

− T −1<br />

λ )= ∑<br />

t=<br />

0<br />

∑<br />

st<br />

∑<br />

st+1<br />

P(s t ,s t+1 |O, λ )log(<br />

−<br />

a<br />

stst+1<br />

−<br />

b<br />

st+1<br />

(o t+1 )]<br />

−<br />

Where π is denoted by<br />

s1<br />

−<br />

a for simplicity. Regroup<strong>in</strong>g eq.3<br />

s0s1<br />

<strong>in</strong>to three terms for the π,A,B coefficients, and apply<strong>in</strong>g<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

923


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

lagrange multipliers, we obta<strong>in</strong> the HMM parameter<br />

estimation equations<br />

−<br />

a =<br />

ij<br />

• For discrete HMM:<br />

−<br />

j<br />

π =γ 1 (i),<br />

T −1<br />

∑ξt<br />

t=<br />

1<br />

T −1<br />

∑<br />

t=<br />

1<br />

−<br />

b j(k) =<br />

Where<br />

( i,<br />

j)<br />

γt(<br />

i)<br />

T<br />

∑<br />

γt(<br />

j)<br />

t=<br />

1<br />

s.<br />

t.<br />

ot<br />

= v<br />

N<br />

∑<br />

i=<br />

1<br />

T<br />

∑γt(<br />

t=<br />

1<br />

k<br />

j)<br />

N<br />

γ t(i)= ∑ ξ t(<br />

i,<br />

j)<br />

,<br />

j=<br />

1<br />

ξ t (i,j) = P(s t = i,s t+1 = j|O, λ) =<br />

t(<br />

i)<br />

aijbj(<br />

ot<br />

+ 1)<br />

βt<br />

+ 1(<br />

j)<br />

α<br />

N<br />

∑αt(<br />

j=<br />

1<br />

i)<br />

aijbj(<br />

o<br />

t + 1<br />

) β<br />

t + 1<br />

( j)<br />

3) For cont<strong>in</strong>uous HMM: estimation equations for the π<br />

and A distributions are unchanged, but the output<br />

distribution B is estimated via Gaussian mixture<br />

parameters are represented <strong>in</strong> equ. 6<br />

T<br />

∑ηt(<br />

ϖ jk = t=<br />

1<br />

T K<br />

∑∑ t<br />

t= 1 k = 1<br />

T<br />

∑ηt(<br />

j<br />

−<br />

t=<br />

1<br />

µ jk =<br />

T<br />

∑ηt(<br />

t=<br />

1<br />

j,<br />

k)<br />

η ( j,<br />

k)<br />

, k)<br />

xt<br />

j,<br />

k)<br />

,<br />

(4)<br />

(5)<br />

−<br />

Σ jk =<br />

Where<br />

η t (j,k) =<br />

T<br />

∑<br />

t=<br />

1<br />

t<br />

N<br />

∑αt(<br />

j=<br />

1<br />

ηt(<br />

j,<br />

k)(<br />

xt<br />

− µ jk)(<br />

xt − µ jk)'<br />

T<br />

∑ηt(<br />

t=<br />

1<br />

(6)<br />

α ( j)<br />

βt(<br />

j)<br />

x<br />

j)<br />

βt(<br />

j)<br />

K<br />

∑<br />

k = 1<br />

j,<br />

k)<br />

ωjkN(<br />

xt,<br />

µ jk,<br />

Σjk)<br />

ωjkN(<br />

xt,<br />

µ jk,<br />

Σjk)<br />

Note that for practical implementation, a scal<strong>in</strong>g procedure<br />

[Rab<strong>in</strong>er and Juang 1993] is required to avoid number<br />

underflow on computers with ord<strong>in</strong>ary float<strong>in</strong>g-po<strong>in</strong>t number<br />

representations.<br />

ii)The evaluation Problem: How can we efficiently compute<br />

P(O/λ), the probabilitiy that the observation sequence O was<br />

produced by the model λ?<br />

For solv<strong>in</strong>g this problem, we obta<strong>in</strong><br />

∑<br />

P(O/λ)= ∑ P(O,S/λ) =<br />

allS<br />

s 1, s2,...,<br />

sT<br />

π s1 b s1 (o 1 )a s1s2 b s2 (o 2 )….a sT-<br />

1s T b ST (o T ) (8)<br />

An <strong>in</strong>terpretation of the computation <strong>in</strong> (8) is the follow<strong>in</strong>g.<br />

At time t=1,we are <strong>in</strong> state s1 with probability π s1 , and<br />

generate the symbol o1 with probability b s1 (o 1 ). A transition is<br />

made from state s 1 at time t=1 to state s 2 at time t=2 with<br />

probability a s1s2 and we generate a symbol o 2 with probability<br />

b s2 (o 2 ). This process cont<strong>in</strong>ues <strong>in</strong> this manner until the last<br />

transition at time T from state s T-1 to state s T is made with<br />

probability a sT-2 s T and we generate symbol o T with probability<br />

b sT (o T ). Figure 7 shows an N-state left-to-right HMM with ∆i<br />

set to 1.<br />

(7)<br />

Fig.7 The Markov generation Model<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

924


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

To reduce computations, the forward and the backward<br />

variables are <strong>used</strong>. The forward variable α t (i) is def<strong>in</strong>ed as<br />

α t (i) = P(o 1 ,o 2 ,…o t ,s t =i/λ),<br />

which can be computed iteratively as<br />

α 1 (i) = π i b i (o 1 ), 1≤ I≤ N<br />

and<br />

⎤<br />

α t+1 (j)= ⎢<br />

⎡ N<br />

∑αt(<br />

i)<br />

aij⎥ b j (o t+1 ), 1≤ j ≤N, 1≤ t ≤T-1 (9)<br />

⎣ i=<br />

1 ⎦<br />

and the backward variable β t (i) is def<strong>in</strong>ed as<br />

β t (i) = P(o t+1 ,o t+2 ,…o T| s t =i/λ),<br />

which can be computed iteratively as<br />

β T (i) = 1, 1≤ i ≤N<br />

and<br />

N<br />

β t (i) = ∑ a<br />

j=<br />

1<br />

ijbj( ot<br />

+ β t+1 (j) , 1≤ i ≤ N, t = T-1,…..,1 (10)<br />

1)<br />

Us<strong>in</strong>g these variables, the probability P(O/λ) can be computed<br />

follow<strong>in</strong>g the forward variable or the backward variable or<br />

both the forward and backward variables as follows<br />

P(O/λ)= ∑<br />

i=<br />

N<br />

N<br />

N<br />

α T(<br />

i)<br />

=<br />

1<br />

∑ π ibi( o1 ) β 1(<br />

i)<br />

=<br />

i=<br />

1<br />

∑ α t( i)<br />

βt(<br />

i)<br />

(11)<br />

i=<br />

1<br />

iii) The decod<strong>in</strong>g Problem: Given the observation sequence<br />

O and the model λ, how do we choose a correspond<strong>in</strong>g state<br />

sequence S that is optimal <strong>in</strong> some sense?<br />

This problem attempts to uncover the hidden part of the model.<br />

There are several possible ways to solve this problem, but the<br />

most widely <strong>used</strong> criterion is to f<strong>in</strong>d the s<strong>in</strong>gle best state<br />

sequence that can be implemented by the Viterbi algorithm .<br />

In practice, it is preferable to base recognition on the<br />

maximum likelihood state sequence s<strong>in</strong>ce this generalizes<br />

easily to the cont<strong>in</strong>uous speech case. This likelihood is<br />

computed us<strong>in</strong>g the same algorithm as forward algorithm<br />

except that the summation is replaced by a maximum<br />

operation.<br />

Comparision with Template and HMM methods:<br />

Compared to template based approach, hidden Markov<br />

model<strong>in</strong>g is more general and has a firmer mathematical<br />

foundation. A template based model is simply a cont<strong>in</strong>uous<br />

density HMM, with identity covariance matrices and a slope<br />

constra<strong>in</strong>ed topology. Although templates can be tra<strong>in</strong>ed on<br />

fewer <strong>in</strong>stances, they lack the probabilistic formulation of full<br />

HMMs and typically underperform HMMs. Compared to<br />

knowledge based approaches; HMMs enable easy <strong>in</strong>tegration<br />

of knowledge sources <strong>in</strong>to a compiled architecture. A negative<br />

side effect of this is that HMMs do not provide much <strong>in</strong>sight<br />

on the recognition process. As a result, it is often difficult to<br />

analyze the errors of an HMM system <strong>in</strong> an attempt to<br />

improve its performance. Nevertheless, prudent <strong>in</strong>corporation<br />

of knowledge has significantly improved HMM based systems.<br />

8.4. Advantages and Disadvantages of HMM:<br />

a) Advantages:<br />

1) One of the most important advantage of HMMs is that<br />

they can easily be extended to deal with strong tasks.<br />

2) In the tra<strong>in</strong><strong>in</strong>g stages, HMMs are dynamically assembled<br />

accord<strong>in</strong>g to the class sequence. For example, if the class<br />

sequence was my hat, then two models for each word<br />

would be l<strong>in</strong>ked, with the last state of the first l<strong>in</strong>k<strong>in</strong>g to<br />

the first state of the second. The re-estimation algorithm<br />

is then applied as usual. Once tra<strong>in</strong><strong>in</strong>g on that <strong>in</strong>stance is<br />

complete, the models are unl<strong>in</strong>ked aga<strong>in</strong>. When<br />

recognition is attempted, large HMMs are assembled<br />

from the smaller <strong>in</strong>dividual models. This is done by<br />

convert<strong>in</strong>g from a grammar <strong>in</strong>to a graph representation,<br />

then replac<strong>in</strong>g each node <strong>in</strong> the graph with the appropriate<br />

model. This process is called ``embedded re-estimation''.<br />

To f<strong>in</strong>d out what the class sequence was, the most<br />

probable path is calculated. The path traversed<br />

corresponds to a sequence of classes, which is our f<strong>in</strong>al<br />

classification.<br />

3) Because each HMM uses only positive data, they scale<br />

well. New words can be added without affect<strong>in</strong>g learnt<br />

HMMs. It is also possible to set up HMMs <strong>in</strong> such a way<br />

that they can learn <strong>in</strong>crementally. As mentioned above,<br />

grammar and other constructs can be built <strong>in</strong>to the system<br />

by us<strong>in</strong>g embedded re-estimation. This gives the<br />

opportunity for the <strong>in</strong>clusion of high-level doma<strong>in</strong><br />

knowledge, which is important for tasks like speech<br />

recognition where a great deal of doma<strong>in</strong> knowledge is<br />

available.<br />

4) Architecture-Basic characteristics of the mathematical<br />

frame work are useful for speech recognition.<br />

5) Completeness: advantages of the underly<strong>in</strong>g approach<br />

over specific knowledge based approaches<br />

6) Flexibility: Ways <strong>in</strong> which speech knowledge can be<br />

<strong>in</strong>corporated <strong>in</strong>to HMMs <strong>in</strong> the<br />

form of constra<strong>in</strong>ts on the basic flexible structure.<br />

b)Disadvantages:<br />

i) They make very large assumptions about the data.<br />

ii) They make the Markovian assumption: that the emission<br />

and the transition probabilities depend only on the current<br />

state. This has subtle effects; for example, the probability of<br />

stay<strong>in</strong>g <strong>in</strong> a given state falls off exponentially .<br />

iii)The Gaussian mixture assumption for cont<strong>in</strong>uous-density<br />

hidden Markov models a huge one. We cannot always assume<br />

that the values are distributed <strong>in</strong> a normal manner.<br />

iv) The number of parameters that need to be set <strong>in</strong> an HMM<br />

is huge.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

925


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

v) The Viterbi algorithm allocates frames to states, the frames<br />

associated with a state can often change, caus<strong>in</strong>g further<br />

susceptibility to the parameters. Those <strong>in</strong>volved <strong>in</strong> HMMs<br />

often use the technique of ``parameter-ty<strong>in</strong>g'' to reduce the<br />

number of variables that need to be learnt by forc<strong>in</strong>g the<br />

emission probabilities <strong>in</strong> one state to be the same as those <strong>in</strong><br />

another. For example, if one had two words: cat and mad, then<br />

the parameters of the states associated with the ``a'' sound<br />

could be tied together.<br />

vi) As a result of the above, the amount of data that is required<br />

to tra<strong>in</strong> an HMM is very large. This can be seen by<br />

consider<strong>in</strong>g typical speech recognition corpora that are <strong>used</strong><br />

for tra<strong>in</strong><strong>in</strong>g. The TIMIT database for <strong>in</strong>stance, has a total of<br />

630 readers read<strong>in</strong>g a text; the ISOLET database for isolated<br />

letter recognition has 300 examples per letter. Many other<br />

doma<strong>in</strong>s do not have such large datasets readily available.<br />

vii) HMMs only use positive data to tra<strong>in</strong>. In other words,<br />

HMM tra<strong>in</strong><strong>in</strong>g <strong>in</strong>volves maximiz<strong>in</strong>g the observed probabilities.<br />

viii) In some doma<strong>in</strong>s, the number of states and transitions can<br />

be found us<strong>in</strong>g an educated guess or trial and error, <strong>in</strong> general,<br />

there is no way to determ<strong>in</strong>e this. Furthermore, the states and<br />

transitions depend on the class be<strong>in</strong>g learnt.<br />

ix) The concept learnt by a hidden Markov model is the<br />

emission and transition probabilities. If one is try<strong>in</strong>g to<br />

understand the concept learnt by the hidden Markov model,<br />

then this concept representation is difficult to understand. In<br />

speech recognition, this issue is of little significance, but <strong>in</strong><br />

other doma<strong>in</strong>s, it may be even more important than accuracy.<br />

x) Ist order HMM Markovian assumptions of conditional<br />

dependence ( i.e. be<strong>in</strong>g <strong>in</strong> a state depends upon a previous<br />

state).<br />

xi) HMMs are well def<strong>in</strong>ed for processes that are function of<br />

one <strong>in</strong>dependent variable such as time is one dimensional.<br />

xii) One major limitation of the statistical models is that they<br />

work well only when the underly<strong>in</strong>g assumptions are satisfied.<br />

The effectiveness of these methods depends to a large extent<br />

on the various assumptions or conditions under which the<br />

models are developed.<br />

8.5. Applications:<br />

1) First application of Markov Cha<strong>in</strong>s was made by Andrey<br />

Markov himself <strong>in</strong> the area of language model<strong>in</strong>g.<br />

2) Another example of Markov cha<strong>in</strong>s application <strong>in</strong><br />

l<strong>in</strong>guistics is stochastic language model<strong>in</strong>g.<br />

3) Use of Markov cha<strong>in</strong>s to generate random numbers that<br />

belong exactly to the desired distribution or, speak<strong>in</strong>g<br />

4) HMM for f<strong>in</strong>ancial economic applications<br />

5) HMM for Signature verification<br />

6) HMM for <strong>Speech</strong> and speaker recognition<br />

7) Hidden Markov Model <strong>in</strong> Intrusion Detection Systems<br />

8) HMM <strong>in</strong> bio <strong>in</strong>formatics<br />

9) HMM applications <strong>in</strong> bar code read<strong>in</strong>g<br />

10) HMM applications <strong>in</strong> computer vision<br />

9. Non-parameter techniques:<br />

In most real problems, even the types of the density functions<br />

of the <strong>in</strong>terest are unknown. Look<strong>in</strong>g at histograms, scatter<br />

plots or tables of the data, or the application of statistical<br />

procedures may suggest that a particular type of the class<br />

density may be <strong>used</strong>, or they may <strong>in</strong>dicate that the data are not<br />

well fit by any of the standard types of densities or<br />

distributions. In this case, non parametric techniques are<br />

needed. There are different classification methods <strong>in</strong> non<br />

parametric techniques namely vector quantization, Artificial<br />

Neural Network, Support vector mach<strong>in</strong>es, K-Nearest<br />

Neighbor method and Gaussian Mixture Model<strong>in</strong>g methods.<br />

These methods are discussed <strong>in</strong> the follow<strong>in</strong>g sections.<br />

9.1. Advantages and disadvantages <strong>in</strong> Non Parametric<br />

Method:<br />

a) Advantages:<br />

(1) Nonparametric test make less str<strong>in</strong>gent demands of the<br />

data. For standard parametric procedures to be valid, certa<strong>in</strong><br />

underly<strong>in</strong>g conditions or assumptions must be met,<br />

particularly for smaller sample sizes.<br />

(2) Nonparametric procedures can sometimes be <strong>used</strong> to get a<br />

quick answer with little calculation.<br />

3) Nonparametric methods provide an air of objectivity when<br />

there is no reliable (universally recognized) underly<strong>in</strong>g scale<br />

for the orig<strong>in</strong>al data and there is some concern that the results<br />

of standard parametric techniques would be criticized for their<br />

dependence on an artificial metric.<br />

4) One of the key advantages of non-parametric techniques is<br />

that they do not make any statistical assumptions about data.<br />

b) Disadvantages:<br />

1) The major disadvantage of nonparametric techniques is<br />

conta<strong>in</strong>ed <strong>in</strong> its name. Because the procedures are<br />

nonparametric, there are no parameters to describe and it<br />

becomes more difficult to make quantitative statements about<br />

the actual difference between populations.<br />

2) The second disadvantage is that nonparametric procedures<br />

throw away <strong>in</strong>formation. Because <strong>in</strong>formation is discarded,<br />

nonparametric procedures can never be as powerful (able to<br />

detect exist<strong>in</strong>g differences) as their parametric counterparts<br />

when parametric tests can be <strong>used</strong>.<br />

9.3. Applications of Non parametric methods:<br />

1) <strong>Speech</strong> recognition applications<br />

2) Chi-square applications<br />

3) Efficiency analysis of the models<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

926


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

4) Analysis of Hedonic Models<br />

5) Data m<strong>in</strong><strong>in</strong>g<br />

6) Cl<strong>in</strong>ical applications<br />

10. Vector quantization [5]:<br />

Vector Quantization(VQ)[97] is often applied to ASR. It is a<br />

system for mapp<strong>in</strong>g a sequence of cont<strong>in</strong>uous or discrete<br />

vectors <strong>in</strong>to a digital sequence suitable for communication<br />

over or storage <strong>in</strong> a digital channel. The goal of this system is<br />

the data compression: to reduce the bit rate so as to m<strong>in</strong>imize<br />

communication channel capacity or digital storage memory<br />

requirements while ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the necessary fidelity of the<br />

data.<br />

10.1. Introduction to vector quantization:<br />

Vector quantization is a classical quantization technique<br />

from signal process<strong>in</strong>g which allows the model<strong>in</strong>g of<br />

probability density functions by the distribution of prototype<br />

vectors. It was orig<strong>in</strong>ally <strong>used</strong> for data compression. It works<br />

by divid<strong>in</strong>g a large set of po<strong>in</strong>ts (vectors) <strong>in</strong>to groups hav<strong>in</strong>g<br />

approximately the same number of po<strong>in</strong>ts closest to them.<br />

Each group is represented by its centroid po<strong>in</strong>t, as <strong>in</strong> k-means<br />

and some other cluster<strong>in</strong>g algorithms.<br />

The density match<strong>in</strong>g property of vector quantization is<br />

powerful, especially for identify<strong>in</strong>g the density of large and<br />

high-dimensioned data. S<strong>in</strong>ce data po<strong>in</strong>ts are represented by<br />

the <strong>in</strong>dex of their closest centroid, commonly occurr<strong>in</strong>g data<br />

have low error, and rare data high error. This is why VQ is<br />

suitable for lossy data compression. It can also be <strong>used</strong> for<br />

lossy data correction and density estimation. Vector<br />

quantization is based on the competitive learn<strong>in</strong>g paradigm, so<br />

it is closely related to the self-organiz<strong>in</strong>g map model<br />

The results of either the filter bank analysis or the LPC<br />

analysis’s are a series of vectors characteristic of the time<br />

vary<strong>in</strong>g spectral characteristics of the speech signal .The<br />

spectral vectors are denoted as v l , l=1,2,...,L, Where Typically<br />

Each Vector is a P- Dimensional Vector. If we Compare the<br />

<strong>in</strong>formation rate of the vector representation to that of the raw<br />

speech waveform we see that the spectral analysis’s has<br />

significantly reduced the required <strong>in</strong>formation rate of<br />

160,000bps is required to store the speech samples <strong>in</strong><br />

uncompressed format. For the spectral analysis, consider<br />

vectors of dimension p=10 us<strong>in</strong>g 100 spectral vectors per<br />

second. If we aga<strong>in</strong> represent each spectral component t to 16-<br />

bit precision the required storage is about 100x10x16bps or<br />

16000 bps about a 10 to 1 reduction over the uncompressed<br />

signal. Such compressions <strong>in</strong> storage rate are impressive.<br />

Based on the concept of ultimately need<strong>in</strong>g only a s<strong>in</strong>gle<br />

spectral representation for each basic speech unit, it may be<br />

possible to further reduce the raw spectral representation of<br />

speech to those drawn from a small f<strong>in</strong>ite number of unique<br />

spectral vectors each correspond<strong>in</strong>g to one of the basic speech<br />

units. This ideal representation is of course impractical<br />

because there is so much variability <strong>in</strong> the spectral properties<br />

of each of the basic speech units. However the concept of<br />

build<strong>in</strong>g a codebook of dist<strong>in</strong>ct, analysis vectors, albeit with<br />

significantly more code words than the basic set of phonemes,<br />

rema<strong>in</strong>s an attractive idea and is the basis beh<strong>in</strong>d a set of<br />

techniques commonly called vector quantization methods.<br />

Based on this l<strong>in</strong>e of reason<strong>in</strong>g assume that we require a code<br />

book with about 1024 unique spectral vectors. Then to<br />

represent an arbitrary spectral vector all we need is a 10 bit<br />

number the <strong>in</strong>dex of the codebook vector that best matches the<br />

<strong>in</strong>put vector. Assum<strong>in</strong>g a rate of 100 spectral vectors per<br />

second we see at a total bit rate of about 1000 bps is required<br />

to represent the spectral vectors of a speech signal. This rate is<br />

about 1/16 th the rate required by the cont<strong>in</strong>uous spectral<br />

vectors. Hence the vector quantization representation is<br />

potentially an extremely efficient representation the spectral<br />

<strong>in</strong>formation <strong>in</strong> the speech signal.<br />

10.1.1.Elements of a vector quantization implementation;<br />

To build a vector quantization and implement a VQ analysis<br />

procedure we need the follow<strong>in</strong>g:<br />

• a large set of spectral analysis vectors v 1 ,v 2 ,....v l ,<br />

which form a triag<strong>in</strong>g set. The tra<strong>in</strong><strong>in</strong>g set is <strong>used</strong> to<br />

create the optimal set6 of codebook vectors for<br />

present<strong>in</strong>g he spectral variability observed the<br />

tra<strong>in</strong><strong>in</strong>g set If we denote the size of the VQ code<br />

book as M=2 B vector then we require L>> M so as<br />

to be able to f<strong>in</strong>d the best set of M code book vectors<br />

<strong>in</strong> a robust manner. In practice, it has been found that<br />

L should be at least 10M <strong>in</strong> order to tra<strong>in</strong> a VQ<br />

codebook that works reasonably well.<br />

• a measure of similarity or distance between a pair of<br />

spectral analysis’s vectors so as to be able to cluster<br />

the tra<strong>in</strong><strong>in</strong>g set vectors as well as to associate or<br />

classify arbitrary spectral vectors <strong>in</strong>to unique<br />

codebook entries. We denote the spectral distance<br />

d(v i ,v j ) between two vectors v i , v j as d ij . We defer a<br />

discussion of spectral distance measure.<br />

• iii) a centroid computation procedure. On the basis of<br />

the partition<strong>in</strong>g that classifies the L tra<strong>in</strong><strong>in</strong>g set<br />

vectors <strong>in</strong>to M cluster we choose the M code book<br />

vectors as the centroid of each of the M clusters.<br />

• a classification procedure for arbitrary speech<br />

spectral analysis’s, vectors that chooses the codebook<br />

vector closet to the <strong>in</strong>put vector ;and uses the<br />

codebook <strong>in</strong>dex as the result<strong>in</strong>g spectral<br />

representation. This is often referred to as the nearest<br />

neighbor label<strong>in</strong>g or optimal encod<strong>in</strong>g procedure.<br />

The classification procedure is essentially, a<br />

quantizer that accepts as <strong>in</strong>put a speech spectral<br />

vector and provides as output the codebook <strong>in</strong>dex of<br />

the codebook vector that best matches the <strong>in</strong>put;. The<br />

follow<strong>in</strong>g figure 8 shows the basic VQ tra<strong>in</strong><strong>in</strong>g and<br />

classification structure.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

927


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

For code book sizes of 1000or larger, the storage is often nontrivial.<br />

Hence an <strong>in</strong>herent trade off among quantization error,<br />

process<strong>in</strong>g for choos<strong>in</strong>g the code book vector, and storage of<br />

code book vectors exists, and practical designs balance each<br />

of these three factors.<br />

3) VQ has the low prediction ga<strong>in</strong> of the vector predictor, due<br />

to the autocorrelation function of speech with <strong>in</strong>creas<strong>in</strong>g lag.<br />

Figure.8 of the basic VQ tra<strong>in</strong><strong>in</strong>g and classification structure<br />

10.2. Advantages and Disadvantages of Vector<br />

Quantization:<br />

a)Advantages:<br />

1) Reduced storage for spectral analysis <strong>in</strong>formation. This<br />

efficiency can be exploited <strong>in</strong> a number of ways <strong>in</strong> practical<br />

vector quantization based speech recognition systems.<br />

2) Reduced computation for determ<strong>in</strong><strong>in</strong>g similarity of spectral<br />

analysis vectors. In speech recognition a major component of<br />

the computation is the determ<strong>in</strong>ation of spectral similarity<br />

between a pair of vectors. Based on the vector quantization<br />

representation this spectral similarity computation is often<br />

reduced to a table lookup of similarities between pairs of<br />

codebook vectors.<br />

3) Discrete representation of speech sounds. By associat<strong>in</strong>g a<br />

phonetic label (or possible a set of phonetic labels or a<br />

phonetic class with each codebook vector, the process of<br />

choos<strong>in</strong>g a best codebook vector to represent a given spectral<br />

vector becomes equivalent to assign<strong>in</strong>g a phonetic label to<br />

each spectral frame of speech. A range of recognition systems<br />

exist that exploit these labels so as to recognize speech <strong>in</strong> an<br />

efficient manner.<br />

4) Vector quantization lowers the bit rate of the signal be<strong>in</strong>g<br />

quantized thus mak<strong>in</strong>g it more bandwidth efficient than scalar<br />

quantization. But this however contributes to it's<br />

implementation complexity (computation and storage).<br />

b) Disadvantages:<br />

1) An <strong>in</strong>herent spectral distortion <strong>in</strong> represent<strong>in</strong>g the actual<br />

analysis vector. S<strong>in</strong>ce there a f<strong>in</strong>ite number of code book<br />

vectors, the process of choos<strong>in</strong>g the ”best” representation of a<br />

given spectral vector <strong>in</strong>herently is equivalent to quantiz<strong>in</strong>g the<br />

vector and leads, by def<strong>in</strong>ition, to a certa<strong>in</strong> level of<br />

quantization error. As the size of the code book <strong>in</strong>creases, the<br />

size of the quantization error decreases. However, with any<br />

f<strong>in</strong>ite code book there will always be some non zero level of<br />

quantization error.<br />

2) The storage required for code book vectors is of ten non<br />

trivial. The larger the codebook (so as to reduce quantization<br />

error), the more storage is required for the code book entries.<br />

10.3. Applications:<br />

i) Image and voice compression<br />

ii) <strong>Speech</strong> <strong>Recognition</strong> application<br />

iii) Image cod<strong>in</strong>g<br />

iv) VQ for neural gas network<br />

v) VQ is <strong>used</strong> for lossy data compression, lossy<br />

data correction, and density estimation<br />

11. Artificial Neural Network (ANN)[5]:<br />

A variety of knowledge sources need to be established <strong>in</strong> the<br />

AI approach to speech recognition. Therefore, two key<br />

concepts of artificial <strong>in</strong>telligence are automatic knowledge<br />

acquisitions (learn<strong>in</strong>g and adaptation. One way <strong>in</strong> which these<br />

concepts have been implemented is via the neural network<br />

approach.Fig.9 shows the example of a neural network model.<br />

11.1. Basics of Neural Networks:<br />

A neural network, which is also called a connectionist model,<br />

a neural network a parallel distributed process<strong>in</strong>g (PDP)<br />

model, is basically a dense <strong>in</strong>terconnection of simple,<br />

nonl<strong>in</strong>ear, computational elements. It is assumed that there are<br />

N <strong>in</strong>puts labeled x 1 ,x 2 ,…x N , which are summed with weights<br />

w 1 ,w 2 …w mn , threshold and then nonl<strong>in</strong>early compressed to<br />

give the output y, def<strong>in</strong>ed as<br />

1toN<br />

Y=f (Σ W i X i - φ) -----(12)<br />

i=1<br />

Where pi is an <strong>in</strong>ternal threshold or offset, and f is a non<br />

l<strong>in</strong>earity of one of the types given below.<br />

1.hard limiter f(x) = +1 x≤0,or -1 x0 or<br />

The sigmoid nonl<strong>in</strong>earities are <strong>used</strong> most often because they<br />

are cont<strong>in</strong>uous and differentiable. The biological basis of the<br />

neural network is a model by McCullough and Pitts[22] of<br />

neurons <strong>in</strong> the human nervous system.<br />

11.1.1 Neural Network topologies:<br />

There are several issues <strong>in</strong> the design of the so called artificial<br />

neural networks which model various physical phenomena,<br />

where we def<strong>in</strong>e an ANN as an arbitrary connection of simple<br />

computational elements. One key issue <strong>in</strong> network topologies<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

928


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

– that is, how the simple computational elements are<br />

<strong>in</strong>terconnected. There are three standard and well known<br />

topologies.<br />

i) S<strong>in</strong>gle/multilayer perceptrons<br />

ii) Hopfield or recurrent networks<br />

iii) Kohonen or Self-organiz<strong>in</strong>g networks<br />

In the s<strong>in</strong>gle/multilayer perceptron, the outputs of one or more<br />

simple computational elements at one layer form the <strong>in</strong>puts to<br />

a new set of simple computational elements of the next layer.<br />

The s<strong>in</strong>gle layer perceptron has N <strong>in</strong>puts connected to M<br />

outputs <strong>in</strong> the output layer as shown <strong>in</strong> the fig.9. The three<br />

layer perceptron has two hidden layers between the <strong>in</strong>put and<br />

output layers. The s<strong>in</strong>gle layer perceptron can separate static<br />

patterns <strong>in</strong>to classes with class boundaries characterized by<br />

the hyper planes <strong>in</strong> the (x 1 ,x 2 ….x n )space. Similarly, a<br />

multilayer perceptron, with at least one hidden layer can<br />

realize an arbitrary set of decision regions <strong>in</strong> the (x 1 ,x 2 ….x n )<br />

space. Thus, for example if the <strong>in</strong>puts to a multilayer<br />

perceptron are the first two speech resonances (F 1 and F 2 ) the<br />

network can implement a set of decision regions that partition<br />

the (F1-F2) space <strong>in</strong>to the 10 steady state vowels.<br />

The Hopfield network is a recurrent network <strong>in</strong> which the<br />

<strong>in</strong>put to each computational element <strong>in</strong>cludes both <strong>in</strong>puts as<br />

well as outputs. Thus with the <strong>in</strong>put and output <strong>in</strong>dexed by<br />

time xi(t) and yi(t) and the weight connect<strong>in</strong>g the ith node and<br />

the jth node denoted by wij, the basic equation for the ith<br />

recurrent computational element is<br />

Yi(t)= f[x i (t),+Σw ij y j (t-1) –φ] (13)<br />

j<br />

And a recurrent network with N <strong>in</strong>puts and N outputs. The<br />

most important property of the Hopfield Network is wij=wji<br />

and when the recurrent computation (eq.2.5) is performed<br />

asynchronously, for an arbitrary constant <strong>in</strong>put, the network<br />

will eventually settle to a fixed po<strong>in</strong>t where y i (t)=y i (t-1)for all<br />

i. These fixed relaxation po<strong>in</strong>ts represent stable configurations<br />

of the network and can be <strong>used</strong> <strong>in</strong> applications that have a<br />

fixed set of patterns to be matched <strong>in</strong> the form of a content<br />

addressable or associative memory. Recurrent network has a<br />

stable set of attractors and repellers, each form<strong>in</strong>g a fixed<br />

po<strong>in</strong>t<strong>in</strong>g <strong>in</strong> the <strong>in</strong>put space. Every <strong>in</strong>put vector, x is either<br />

attracted to one of the fixed po<strong>in</strong>ts or repelled from another of<br />

the fixed po<strong>in</strong>ts. The strength of this type of network is its<br />

ability to correctly classify noisy versions of the patterns that<br />

form the stable fixed po<strong>in</strong>ts.<br />

The third popular type of neural network topology is the<br />

Kohonen, self organiz<strong>in</strong>g feature map, which is a cluster<strong>in</strong>g<br />

procedure for provid<strong>in</strong>g a codebook of stable patterns <strong>in</strong> the<br />

<strong>in</strong>put space that characterize an arbitrary <strong>in</strong>put vector, by a<br />

small number of representative clusters.<br />

Figure 9 simplified view of a artificial neural network<br />

11.1.2. Network Characteristics:<br />

Four model characteristics must be specified to implement an<br />

arbitrary neural network. Fig.9 shows the architecture of the<br />

simple neural network model.<br />

a) Number and type of <strong>in</strong>puts: The issues <strong>in</strong>volved <strong>in</strong> the<br />

choice of <strong>in</strong>puts to a neural network are similar to those<br />

<strong>in</strong>volved <strong>in</strong> the choice of features for any pattern classification<br />

system. They must provide the <strong>in</strong>formation required to make<br />

the decision required of the network.<br />

b) Connectivity of the network: This issue <strong>in</strong>volves the size of<br />

the network that is, the number of hidden layers and the<br />

number of nodes <strong>in</strong> each layers between the <strong>in</strong>put and output.<br />

There is no good rule of thumb as to how large ( or small)<br />

such hidden layers must be. Intuition says that if the hidden<br />

layers are large, then it will be difficult to tra<strong>in</strong> the network.<br />

Similarly, if the hidden layers are too small the network may<br />

not be able to accurately classify the entire desired <strong>in</strong>put<br />

pattern.<br />

c) Choice of offset: The choice of the threshold, pi for each<br />

computational element must be made as part of the tra<strong>in</strong><strong>in</strong>g<br />

procedure, which chooses values for the <strong>in</strong>terconnection<br />

weights (w ij ) and the offset pi.<br />

d) Choice of nonl<strong>in</strong>earity: Experience <strong>in</strong>dicates that the exact<br />

choice of the nonl<strong>in</strong>earity f, is not every important <strong>in</strong> terms of<br />

the network performance. However, f must be cont<strong>in</strong>uous and<br />

differentiable for the tra<strong>in</strong><strong>in</strong>g algorithm to be applicable.<br />

11.2. Tra<strong>in</strong><strong>in</strong>g of Neural Network Parameters:<br />

To completely specify a neural network, values for the<br />

weight<strong>in</strong>g coefficients and the offset threshold for each<br />

computation element must be determ<strong>in</strong>e, based on a labeled<br />

set of tra<strong>in</strong><strong>in</strong>g data. By a labeled tra<strong>in</strong><strong>in</strong>g set of data, means<br />

association between a set of Q <strong>in</strong>put vectors x 1 ,x 2 ,….x q and a<br />

set of Q output vectors y 1 ,y 2 ,…y q where x 1 =y 1 ,x 2 =y 2 …..x q =y q .<br />

For multilayer perceptrons a simple iterative, convergent<br />

procedure exists for choos<strong>in</strong>g a set of parameters whose value<br />

asymptotically approaches a stationary po<strong>in</strong>t with a certa<strong>in</strong><br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

929


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

optimality property ( e.g., a local m<strong>in</strong>imum of themean<br />

squared error, etc.). This procedure, called back propagation<br />

learn<strong>in</strong>g, is a simple stochastic gradient technique. For a<br />

simple, s<strong>in</strong>gle layer network, the tra<strong>in</strong><strong>in</strong>g algorithm can be<br />

realized via the follow<strong>in</strong>g convergence steps:<br />

Perceptron Convergence Procedure<br />

1. Initialization: At time t=0, set w ij (0), φ j to small random<br />

values (where wij are the weight<strong>in</strong>g coefficients connect<strong>in</strong>g i th<br />

<strong>in</strong>put node and j th output node and φ ij is the offset to a<br />

particular computational element and the w ij are function of<br />

time).<br />

2.Acquire <strong>in</strong>put: At time t, obta<strong>in</strong> a new <strong>in</strong>put x={x 1 ,x 2 ,..x N }<br />

with the desired ouput, yx={y x 1,y x 2,….Y M x}<br />

n<br />

3.Calculate output : y i =f(Σ w ij(t) xi- φ j )<br />

I=1<br />

4.Adapt Weights: Update the weights as:<br />

w ij (t+1)= w ij (t)+T(t)[y x j-y j }-x i<br />

5.Iteration:<br />

Iterate steps 2-4 until: w ij (t+1)= w ij (t)<br />

11.3. Difference between Neural Networks and<br />

Conventional Classifiers:<br />

The difference between the neural network classifier and the<br />

conventional classifier is given <strong>in</strong> the table A.<br />

TABLE A<br />

Difference between Neural network and conventional<br />

classifier<br />

Sl.No. Neural Network Conventional<br />

classifier<br />

1 Estimates posterior<br />

probability<br />

2 Non l<strong>in</strong>ear model free<br />

method<br />

3 Uses discrim<strong>in</strong>ant<br />

function<br />

4 M<strong>in</strong>imizes the total no.<br />

of miss classification<br />

errors<br />

5 Data driven and self<br />

adapt<strong>in</strong>g<br />

Based on bayes<br />

decision theory us<strong>in</strong>g<br />

posterior probability<br />

L<strong>in</strong>ear and model based<br />

method<br />

Uses probabilistic<br />

function<br />

M<strong>in</strong>imizes<br />

classification error<br />

Data driven not self<br />

adapt<strong>in</strong>g<br />

Statistical pattern classifiers are based on the Bayes decision<br />

theory <strong>in</strong> which posterior probabilities play a central role. The<br />

fact that neural networks can <strong>in</strong> fact provide estimates of<br />

posterior probability implicitly establishes the l<strong>in</strong>k between<br />

neural networks and statistical classifiers. The direct<br />

comparison between them may not be possible s<strong>in</strong>ce neural<br />

networks are nonl<strong>in</strong>ear model-free method while statistical<br />

methods are basically l<strong>in</strong>ear and model based. By appropriate<br />

cod<strong>in</strong>g of the desired output membership values, we may let<br />

neural networks directly model some discrim<strong>in</strong>ant functions.<br />

For example, <strong>in</strong> a two-group classification problem, if the<br />

desired output is coded as 1 if the object is from class 1 and -1<br />

if it is from class 2.The neural network estimates the<br />

follow<strong>in</strong>g discrim<strong>in</strong>ant function:<br />

---(14)<br />

The discrim<strong>in</strong>at<strong>in</strong>g rule is simply: assign X to w 1 if g(x)>0 or<br />

w 2 if g(x)


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

diagnosis and epidemiologic studies [32]. Logistic regression<br />

is often preferred over discrim<strong>in</strong>ant analysis <strong>in</strong> practice<br />

[33,34]. In addition, the model can be <strong>in</strong>terpreted as posterior<br />

probability or odds ratio. It is a simple fact that when the<br />

logistic transfer function is <strong>used</strong> for the output nodes, simple<br />

neural networks without hidden layers are identical to logistic<br />

regression models. Another connection is that the maximum<br />

likelihood function of logistic regression is essentially the<br />

cross-entropy cost function which is often <strong>used</strong> <strong>in</strong> tra<strong>in</strong><strong>in</strong>g<br />

neural network classifiers. Schumacher et al. [35] make a<br />

detailed comparison between neural networks and logistic<br />

regression. They f<strong>in</strong>d that the added model<strong>in</strong>g flexibility of<br />

neural networks due to hidden layers does not automatically<br />

guarantee their superiority over logistic regression because of<br />

the possible over fitt<strong>in</strong>g and other <strong>in</strong>herent problems with<br />

neural networks [36]. L<strong>in</strong>ks between neural and other<br />

conventional classifiers have been illustrated by<br />

[37,38,39,40,41,42,43]. Ripley [44,45] empirically compares<br />

neural networks with various classifiers such as classification<br />

tree, projection pursuit regression, l<strong>in</strong>ear vector quantization,<br />

multivariate adaptive regression spl<strong>in</strong>es and nearest neighbor<br />

methods.<br />

A large number of studies have been devoted to empirical<br />

comparisons between neural and conventional classifiers. The<br />

most comprehensive one can be found <strong>in</strong> Michie et al. [46]<br />

which reports a large-scale comparative study—the StatLog<br />

project. In this project, three general classification approaches<br />

of neural networks, statistical classifiers and mach<strong>in</strong>e learn<strong>in</strong>g<br />

with 23 methods are compared us<strong>in</strong>g more than 20 different<br />

real data sets. Their general conclusion is that no s<strong>in</strong>gle<br />

classifier is the best for all data sets although the feed forward<br />

neural networks do have good performance over a wide range<br />

of problems.<br />

Neural networks have also been compared with decision trees<br />

[47,48,49,50] discrim<strong>in</strong>ant analysis [51], [52], [53], [54], [55],<br />

CART [56]], -nearest-neighbor [57], and l<strong>in</strong>ear programm<strong>in</strong>g<br />

method. Although classification costs are difficult to assign <strong>in</strong><br />

real problems, ignor<strong>in</strong>g the unequal misclassification risk for<br />

different groups may have significant impact on the practical<br />

use of the classification. It should be po<strong>in</strong>ted out that a neural<br />

classifier which m<strong>in</strong>imizes the total number of<br />

misclassification errors may not be useful for situations where<br />

different misclassification errors carry highly uneven<br />

consequences or costs.<br />

11.4. Advantages and Disadvantages of Neural Networks:<br />

a) Advantages:<br />

1) Neural networks are data driven and self adaptive-learn<strong>in</strong>g<br />

2) Pocesses Self-organization mechanism<br />

3) Has Fault-tolerance capabilities<br />

4) A neural network can perform tasks that a l<strong>in</strong>ear program<br />

cannot.<br />

5) When an element of the neural network fails, it can<br />

cont<strong>in</strong>ue without any problem by their parallel nature.<br />

6) A neural network learns and does not need to be<br />

reprogrammed.<br />

7) It can be implemented <strong>in</strong> any application. It can be<br />

implemented without any problem<br />

8) The connectionist structure is <strong>used</strong> to model the local<br />

feature vector conditioned on the Markov process.<br />

9) There is no need to assume an underly<strong>in</strong>g data distribution<br />

such as usually is done <strong>in</strong> statistical model<strong>in</strong>g<br />

10) Neural networks are applicable to multivariate non-l<strong>in</strong>ear<br />

problems.<br />

11) Neural networks are data driven self-adaptive methods <strong>in</strong><br />

that they can adjust themselves to the data without any explicit<br />

specification of functional or distributional form for the<br />

underly<strong>in</strong>g model.<br />

12) They are universal functional approximators, <strong>in</strong> that<br />

neural networks can approximate any function with arbitrary<br />

accuracy.<br />

13) Neural networks are nonl<strong>in</strong>ear models, which makes them<br />

flexible <strong>in</strong> model<strong>in</strong>g real world complex relationships.<br />

14) F<strong>in</strong>ally, neural networks are able to estimate the posterior<br />

probabilities, which provides the basis for establish<strong>in</strong>g<br />

classification rule and perform<strong>in</strong>g statistical analysis<br />

15) They can adjust themselves to the data without any<br />

explicit specification of functional or distributional form for<br />

the underly<strong>in</strong>g model.<br />

16) They are re-universal functional approximators <strong>in</strong> that<br />

neural networks can approximate any function with arbitrary<br />

accuracy [58], [59], [60].<br />

17) Neural networks are nonl<strong>in</strong>ear models, which makes them<br />

flexible <strong>in</strong> model<strong>in</strong>g real world complex relationships.<br />

18) Neural networks are able to estimate the posterior<br />

probabilities, which provide the basis for establish<strong>in</strong>g<br />

classification rule and perform<strong>in</strong>g statistical analysis [61].<br />

19) They can readily implement a massive degree of parallel<br />

computation<br />

20) They <strong>in</strong>tr<strong>in</strong>sically possess a great deal of robustness or<br />

fault tolerance.<br />

21) The connection weights of the network need not be<br />

constra<strong>in</strong>ed to be fixed. They can be adopted <strong>in</strong> real time to<br />

improve performance.<br />

22) Because of non l<strong>in</strong>earity with<strong>in</strong> each computational<br />

element a sufficiently large neural network can approximate<br />

any nonl<strong>in</strong>earity or nonl<strong>in</strong>ear dynamical system.<br />

23) They can adapt to unknown situations<br />

24) Robustness: Fault tolerance due to network redundancy<br />

25) Autonomous learn<strong>in</strong>g due learn<strong>in</strong>g and generalization<br />

b. Disadvantages:<br />

1) The neural network needs tra<strong>in</strong><strong>in</strong>g to operate.<br />

2) The architecture of a neural network is different from the<br />

architecture of microprocessors therefore needs to be<br />

emulated.<br />

3) Requires high process<strong>in</strong>g time for large neural networks.<br />

4) M<strong>in</strong>imiz<strong>in</strong>g over fitt<strong>in</strong>g requires a great deal of<br />

computational effort.<br />

5)The <strong>in</strong>dividual relations between the <strong>in</strong>put variables and the<br />

output variables are not developed by eng<strong>in</strong>eer<strong>in</strong>g judgment<br />

so that the model tends to be a black box or <strong>in</strong>put/output table<br />

without analytical basis.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

931


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

6) The sample size has to be large.<br />

7) Large complexity of the network structure.<br />

11.5. Applications:<br />

S<strong>in</strong>ce neural networks are best at identify<strong>in</strong>g patterns or trends<br />

<strong>in</strong> data, they are well suited for prediction or forecast<strong>in</strong>g needs<br />

<strong>in</strong>clud<strong>in</strong>g:<br />

1) Sales forecast<strong>in</strong>g, <strong>in</strong>dustrial process control ,customer<br />

research ,data validation, risk management, target market<strong>in</strong>g.<br />

2) Model<strong>in</strong>g and Diagnos<strong>in</strong>g the Cardiovascular System<br />

3) medic<strong>in</strong>e/Medical diagnosis<br />

4) bus<strong>in</strong>ess/Market<strong>in</strong>g<br />

5) Electronic noses<br />

6) <strong>Speech</strong> <strong>Recognition</strong><br />

7) Credit evaluation<br />

8) <strong>Speech</strong> and speaker applications<br />

9) Fault detection<br />

10) Prediction: Learn<strong>in</strong>g from past experiences; Weather<br />

prediction<br />

11) <strong>Classification</strong>: Image Process<strong>in</strong>g, Risk management<br />

12) <strong>Recognition</strong>: Character/Hand written recognition<br />

13) Data association:<br />

14) Data conceptualization<br />

15) Data filter<strong>in</strong>g<br />

16) Plann<strong>in</strong>g<br />

12.1. Introduction to Support Vector Mach<strong>in</strong>e (SVM)<br />

Models<br />

Dur<strong>in</strong>g the last decade, however, a new tool appeared <strong>in</strong> the<br />

field of mach<strong>in</strong>e learn<strong>in</strong>g that has proved to be able to cope<br />

with hard classification problems <strong>in</strong> several fields of<br />

application: the Support Vector Mach<strong>in</strong>es (SVMs). A SVM is<br />

essentially a b<strong>in</strong>ary nonl<strong>in</strong>ear classifier capable of guess<strong>in</strong>g<br />

whether an <strong>in</strong>put vector x belongs to a class 1 (the desired<br />

output would be then y = +1) or to a class 2 (y −1). = This<br />

algorithm was first proposed <strong>in</strong> [63] <strong>in</strong> 1992, and it is a<br />

nonl<strong>in</strong>ear version of a much older l<strong>in</strong>ear algorithm, the<br />

optimal hyper plane decision rule (also known as the<br />

generalized portrait algorithm), which was <strong>in</strong>troduced <strong>in</strong> the<br />

sixties.<br />

The SVMs are effective discrim<strong>in</strong>ative classifiers with<br />

several outstand<strong>in</strong>g characteristics [62], namely: their solution<br />

is that with maximum marg<strong>in</strong>; they are capable to deal with<br />

samples of a very higher dimensionality; and their<br />

convergence to the m<strong>in</strong>imum of the associated cost function is<br />

guaranteed. A Support Vector Mach<strong>in</strong>e (SVM) performs<br />

classification by construct<strong>in</strong>g an N-dimensional hyper plane<br />

that optimally separates the data <strong>in</strong>to two categories.<br />

12. Support Vector Mach<strong>in</strong>es (SVM):<br />

One of the powerful tools for pattern recognition that uses a<br />

discrim<strong>in</strong>ative approach is a SVM[97]. SVMs use l<strong>in</strong>ear and<br />

nonl<strong>in</strong>ear separat<strong>in</strong>g hyper-planes for data classification.<br />

S<strong>in</strong>ce SVMs can only classify fixed length data vectors, this<br />

method cannot be readily applied to task <strong>in</strong>volv<strong>in</strong>g variable<br />

length data classification. The variable length data has to be<br />

transformed to fixed length vectors before SVMs can be <strong>used</strong>.<br />

It is a generalized l<strong>in</strong>ear classifier with maximum-marg<strong>in</strong><br />

fitt<strong>in</strong>g functions. This fitt<strong>in</strong>g function provides regularization<br />

which helps the classifier generalized better. The classifier<br />

tends to ignore many of the features. Conventional statistical<br />

and Neural Network methods control model complexity by<br />

us<strong>in</strong>g a small number of features (the problem dimensionality<br />

or the number of hidden units). SVM controls the model<br />

complexity by controll<strong>in</strong>g the VC dimensions of its model.<br />

This method is <strong>in</strong>dependent of dimensionality and can utilize<br />

spaces of very large dimensions spaces, which permits a<br />

construction of very large number of non-l<strong>in</strong>ear features and<br />

then perform<strong>in</strong>g adaptive feature selection dur<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g. By<br />

shift<strong>in</strong>g all non-l<strong>in</strong>earity to the features, SVM can use l<strong>in</strong>ear<br />

model for which VC dimensions is known. For example, a<br />

support vector mach<strong>in</strong>e can be <strong>used</strong> as a regularized radial<br />

basis function classifier.<br />

Fig.10 Support vector mach<strong>in</strong>e process<br />

These characteristics have made SVMs very popular and<br />

successful. In the parlance of SVM literature, a predictor<br />

variable is called an attribute, and a transformed attribute that<br />

is <strong>used</strong> to def<strong>in</strong>e the hyper plane is called a feature. The task<br />

of choos<strong>in</strong>g the most suitable representation is known as<br />

feature selection. A set of features that describes one case (i.e.,<br />

a row of predictor values) is called a vector. So the goal of<br />

SVM model<strong>in</strong>g is to f<strong>in</strong>d the optimal hyper plane that<br />

separates clusters of vector <strong>in</strong> such a way that cases with one<br />

category of the target variable are on one side of the plane and<br />

cases with the other category are on the other size of the plane.<br />

The vectors near the hyper plane are the support vectors. The<br />

figure 10 below presents an overview of the SVM process.<br />

12.1.1 SVM formulation:<br />

Given a set of separable data, the goal is to f<strong>in</strong>d the optimal<br />

decision function. It can be easily seen that there is an <strong>in</strong>f<strong>in</strong>ite<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

932


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

number of optimal solutions for this problem, <strong>in</strong> the sense that<br />

they can separate the tra<strong>in</strong><strong>in</strong>g samples with zero errors.<br />

Function is <strong>used</strong> to generalize for unseen samples; the<br />

additional criterion is <strong>used</strong> to f<strong>in</strong>d the best solution among<br />

those with zero errors. If the probability densities of the<br />

classes, we could apply the maximum a posteriori (MAP)<br />

criterion to f<strong>in</strong>d the optimal solution. In most practical cases<br />

this <strong>in</strong>formation is not available, so it adopts other simpler<br />

criteria: among those functions without tra<strong>in</strong><strong>in</strong>g errors, it will<br />

choose that with the maximum marg<strong>in</strong>, be<strong>in</strong>g this marg<strong>in</strong> the<br />

distance between the closest sample and the decision<br />

boundary def<strong>in</strong>ed by that function. Of course, optimality <strong>in</strong><br />

the sense of maximum marg<strong>in</strong> does not imply necessarily<br />

optimality <strong>in</strong> the sense of m<strong>in</strong>imiz<strong>in</strong>g the number of errors <strong>in</strong><br />

test, but it is a simple criterion that yields to solutions which,<br />

<strong>in</strong> practice, turn out to be the best ones for many problems<br />

[64].<br />

…..(15)<br />

This can be formulated as a problem of quadratic optimization:<br />

In order to get a classifier with a better generalization ability<br />

and capable of handl<strong>in</strong>g the non-separable case, we should<br />

allow a number of misclassified data. This is accomplished by<br />

<strong>in</strong>troduc<strong>in</strong>g a penalty term <strong>in</strong> the function to be m<strong>in</strong>imized:<br />

….(16)<br />

Figure 11 Soft marg<strong>in</strong> decision<br />

As can be <strong>in</strong>ferred from the Figure 11, the nonl<strong>in</strong>ear<br />

discrim<strong>in</strong>ant function f(x i ) can be written as:<br />

…(14a)<br />

Where<br />

is a nonl<strong>in</strong>ear<br />

function which maps the vector xi <strong>in</strong>to what is called a feature<br />

space of higher dimensionality (possibly <strong>in</strong>f<strong>in</strong>ite) where<br />

classes are assumed to be l<strong>in</strong>early separable. The vector w<br />

represents the separat<strong>in</strong>g hyper plane <strong>in</strong> such a space. It is<br />

worth not<strong>in</strong>g that the mean<strong>in</strong>g of feature space here has<br />

noth<strong>in</strong>g to do with the space of the speech features that with<strong>in</strong><br />

the kernel methods nomenclature belong to the <strong>in</strong>put space.<br />

On the other hand, r x is the distance between the transformed<br />

sample and the separat<strong>in</strong>g hyper plane, and<br />

the Euclidean norm of w. We call support vectors those<br />

closest to the decision boundary. These vectors def<strong>in</strong>e the<br />

marg<strong>in</strong> and are the only samples that are needed to f<strong>in</strong>d the<br />

solution. Thus, we have that for every<br />

sample<br />

Hence, the goal to f<strong>in</strong>d the<br />

Where<br />

are the tra<strong>in</strong><strong>in</strong>g vectors<br />

correspond<strong>in</strong>g to the labels and the variables<br />

are called slack variables and allow a certa<strong>in</strong> amount of errors<br />

that contribute to obta<strong>in</strong> solutions <strong>in</strong> the non-separable<br />

case , verifies for those samples well<br />

classified but <strong>in</strong>side the marg<strong>in</strong>, and for those<br />

samples wrongly classified. The C term, on the other hand,<br />

expresses the trade-off between the number of tra<strong>in</strong><strong>in</strong>g errors<br />

and the generalization capability.This problem is usually<br />

solved <strong>in</strong>troduc<strong>in</strong>g the restrictions <strong>in</strong> the function to be<br />

optimized us<strong>in</strong>g Lagrange multipliers, lead<strong>in</strong>g to the<br />

maximization of the Wolfe dual:<br />

…(17)<br />

This problem is quadratic and convex, so its convergence to a<br />

global m<strong>in</strong>imum is guaranteed us<strong>in</strong>g quadratic programm<strong>in</strong>g<br />

(QP) schemes. The result<strong>in</strong>g decision boundary w will be<br />

given by:<br />

optimum classifier is achieved by m<strong>in</strong>imiz<strong>in</strong>g with the<br />

restriction of all samples be<strong>in</strong>g correctly classified, i.e.:<br />

..(18)<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

933


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

Accord<strong>in</strong>g to (18), only vectors with an associated<br />

will contribute to determ<strong>in</strong>e the weight vector w and, therefore,<br />

the separat<strong>in</strong>g boundary. These are the support vectors that, as<br />

we have mentioned before, def<strong>in</strong>e the separation border and<br />

the marg<strong>in</strong>. Generally, the function is not explicitly<br />

known (<strong>in</strong> fact, <strong>in</strong> most of the cases its evaluation would be<br />

impossible as long as the feature space dimensionality can be<br />

<strong>in</strong>f<strong>in</strong>ite). S<strong>in</strong>ce it only need to evaluate the dot<br />

products<br />

which, by us<strong>in</strong>g what has been<br />

called the kernel trick, can be evaluated us<strong>in</strong>g a kernel<br />

function K(x i , x j ).<br />

Many of the SVM implementations compute this function for<br />

every pair of <strong>in</strong>put samples produc<strong>in</strong>g a kernel matrix that is<br />

stored <strong>in</strong> memory. By us<strong>in</strong>g this method and replac<strong>in</strong>g w <strong>in</strong><br />

equation (14) by the expression <strong>in</strong> (18), the form that a SVM<br />

f<strong>in</strong>ally adopts is the follow<strong>in</strong>g:<br />

….(19)<br />

The most widely <strong>used</strong> kernel functions are:<br />

• the simple l<strong>in</strong>ear kernel<br />

…….(20)<br />

• the radial basis function kernel (RBF kernel),<br />

….(21)<br />

where is proportional to the <strong>in</strong>verse of the variance of the<br />

Gaussian function and whose associated feature space is of<br />

<strong>in</strong>f<strong>in</strong>ite dimensionality; and<br />

• the polynomial kernel<br />

..(22)<br />

whose associated feature space are polynomials up to grade p,<br />

and<br />

• the sigmoid kernel<br />

…(23)<br />

It is worth mention<strong>in</strong>g that there are some conditions that a<br />

function should accomplish to be <strong>used</strong> as a kernel. These are<br />

often denom<strong>in</strong>ated KKT (Karush- Kuhn-Tucker) conditions<br />

[65] and can be reduced to check the kernel matrix is<br />

symmetrical and positive semi-def<strong>in</strong>ite.<br />

12.2. Advantages and Disadvantages of SVM:<br />

a. Advantages:<br />

1) Follows l<strong>in</strong>ear discrim<strong>in</strong>ants <strong>in</strong> its learn<strong>in</strong>g criterion.<br />

2) It m<strong>in</strong>imizes the number of misclassifications <strong>in</strong> any<br />

possible set of samples and this is known as Risk<br />

M<strong>in</strong>imization (RM).<br />

3) It m<strong>in</strong>imizes the number of misclassifications with<strong>in</strong> the<br />

tra<strong>in</strong><strong>in</strong>g set and this is known as Empirical Risk M<strong>in</strong>imization<br />

(ERM).<br />

4) They have a unique solution and its convergence is<br />

guaranteed (the solution is found by m<strong>in</strong>imiz<strong>in</strong>g a convex<br />

function). This is an advantage compared to other classifiers<br />

as ANNs that often fall <strong>in</strong> local m<strong>in</strong>ima or does not converge<br />

to a stable version.<br />

5) S<strong>in</strong>ce <strong>in</strong> the m<strong>in</strong>imization process only the kernel matrix is<br />

<strong>in</strong>volved, they can deal with <strong>in</strong>put vectors of very high<br />

dimensionality, as long as it is capable of calculat<strong>in</strong>g their<br />

correspond<strong>in</strong>g kernels and they can deal with vectors of<br />

thousands of dimensions.<br />

6) The <strong>in</strong>put vectors of an SVM with the formulation must<br />

have a fixed size.<br />

7) The important advantage of SVM is that it offers a<br />

possibility to tra<strong>in</strong> generalizable, nonl<strong>in</strong>ear classifiers <strong>in</strong> high<br />

dimensional spaces us<strong>in</strong>g a small tra<strong>in</strong><strong>in</strong>g set.<br />

8) SVMs generalization error is not related to the <strong>in</strong>put<br />

dimensionality of the problem but to the marg<strong>in</strong> with which it<br />

separates the data. That is why SVMs can have good<br />

performance even with a large number of <strong>in</strong>puts.<br />

b. Disadvantages:<br />

1) Most implementations of SVM algorithm require<br />

comput<strong>in</strong>g and stor<strong>in</strong>g <strong>in</strong> memory the complete kernel matrix<br />

of all the <strong>in</strong>put samples. This task have a space complexity<br />

O(n2), and is one of the ma<strong>in</strong> problems of these algorithms<br />

that prevent their application on very large speech databases.<br />

2) The optimality of the solution found can depend on the<br />

kernel that has been <strong>used</strong>, and there is no method to know a<br />

priori which will be the best kernel for a concrete task.<br />

3) The best value for the parameter C is unknown a priori.<br />

12.3. Applications:<br />

1) SVM <strong>in</strong> speech and speaker recognition<br />

2) SVM <strong>in</strong> f<strong>in</strong>ancial applications<br />

3) SVM <strong>in</strong> computational biology<br />

4) SVM <strong>in</strong> bio<strong>in</strong>formatics/biological applications<br />

5) SVM <strong>in</strong> text classification<br />

6) SVM <strong>in</strong> chemistry<br />

13. K-Nearest Neighbor Method:<br />

A more general version of the nearest neighbor technique [66]<br />

bases the classification of an unknown sample on the votes of<br />

k-fits nearest neighbors rather than on only its s<strong>in</strong>gle nearest<br />

neighbor. The k-nearest neighbor classification procedure is<br />

denoted by k-NN. If the costs of error are equal for each class,<br />

the estimated class of an unknown sample is chosen to be the<br />

class that is most commonly represented <strong>in</strong> the collection of<br />

its k nearest neighbours.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

934


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

13.1. <strong>Classification</strong> concept of KNN:<br />

In pattern recognition, the k-nearest neighbor’s algorithm<br />

(k-NN) is a method for classify<strong>in</strong>g objects based on closest<br />

tra<strong>in</strong><strong>in</strong>g examples <strong>in</strong> the feature space. k-NN is a type of<br />

<strong>in</strong>stance-based learn<strong>in</strong>g, or lazy learn<strong>in</strong>g where the function is<br />

only approximated locally and all computation is deferred<br />

until classification. The k-nearest neighbor algorithm is<br />

amongst the simplest of all mach<strong>in</strong>e learn<strong>in</strong>g algorithms: an<br />

object is classified by a majority vote of its neighbors, with<br />

the object be<strong>in</strong>g assigned to the class most common amongst<br />

its k nearest neighbors (k is a positive <strong>in</strong>teger, typically small).<br />

If k = 1, then the object is simply assigned to the class of its<br />

nearest neighbor.<br />

<strong>Classification</strong> (generalization) us<strong>in</strong>g an <strong>in</strong>stance-based<br />

classifier can be a simple matter of locat<strong>in</strong>g the nearest<br />

neighbor <strong>in</strong> <strong>in</strong>stance space and label<strong>in</strong>g the unknown <strong>in</strong>stance<br />

with the same class label as that of the located (known)<br />

neighbor. This approach is often referred to as a nearest<br />

neighbor classifier. More robust models can be achieved by<br />

locat<strong>in</strong>g k, where k > 1, neighbors and lett<strong>in</strong>g the majority<br />

vote decide the outcome of the class label<strong>in</strong>g. A higher value<br />

of k results <strong>in</strong> a smoother, less locally sensitive, function. The<br />

nearest neighbor classifier can be regarded as a special case<br />

of the more general k-nearest neighbors classifier, hereafter<br />

referred to as a k-NN classifier.<br />

The same method can be <strong>used</strong> for regression, by simply<br />

assign<strong>in</strong>g the property value for the object to be the average of<br />

the values of its k nearest neighbors. It can be useful to weight<br />

the contributions of the neighbors, so that the nearer neighbors<br />

contribute more to the average than the more distant ones. (A<br />

common weight<strong>in</strong>g scheme is to give each neighbor a weight<br />

of 1/d, where d is the distance to the neighbor. This scheme is<br />

a generalization of l<strong>in</strong>ear <strong>in</strong>terpolation.). The neighbors are<br />

taken from a set of objects for which the correct classification<br />

(or, <strong>in</strong> the case of regression, the value of the property) is<br />

known. This can be thought of as the tra<strong>in</strong><strong>in</strong>g set for the<br />

algorithm, though no explicit tra<strong>in</strong><strong>in</strong>g step is required. Nearest<br />

neighbor rules <strong>in</strong> effect compute the decision boundary <strong>in</strong> an<br />

implicit manner. It is also possible to compute the decision<br />

boundary itself explicitly, and to do so <strong>in</strong> an efficient manner<br />

so that the computational complexity is a function of the<br />

boundary complexity.<br />

13.1.1. Assumptions <strong>in</strong> KNN:<br />

Before us<strong>in</strong>g KNN, some of the assumptions <strong>in</strong> KNN are to be<br />

considered.<br />

• KNN assumes that the data is <strong>in</strong> a feature space.<br />

More exactly, the data po<strong>in</strong>ts are <strong>in</strong> a metric space.<br />

The data can be scalars or possibly even<br />

multidimensional vectors. S<strong>in</strong>ce the po<strong>in</strong>ts are <strong>in</strong><br />

feature space, they have a notion of distance – This<br />

need not necessarily be Euclidean distance although<br />

it is the one commonly <strong>used</strong>.<br />

• Each of the tra<strong>in</strong><strong>in</strong>g data consists of a set of vectors<br />

and class label associated with each vector. In the<br />

simplest case , it will be either + or – (for positive or<br />

negative classes). But KNN , can work equally well<br />

with arbitrary number of classes.<br />

• It is also given a s<strong>in</strong>gle number "k”. This number<br />

decides how many neighbors (where neighbors are<br />

def<strong>in</strong>ed based on the distance metric) <strong>in</strong>fluence the<br />

classification. This is usually a odd number if the<br />

number of classes is 2. If k=1 , then the algorithm is<br />

simply called the nearest neighbor algorithm.<br />

13.1.2. Parameter selection <strong>in</strong> KNN:<br />

• The best choice of k depends upon the data;<br />

generally, larger values of k reduce the effect of noise<br />

on the classification, but make boundaries between<br />

classes less dist<strong>in</strong>ct. A good k can be selected by<br />

various heuristic techniques, for example, crossvalidation.<br />

The special case where the class is<br />

predicted to be the class of the closest tra<strong>in</strong><strong>in</strong>g<br />

sample (i.e. when k = 1) is called the nearest<br />

neighbor algorithm.<br />

• Much research effort has been put <strong>in</strong>to select<strong>in</strong>g or<br />

scal<strong>in</strong>g features to improve classification. A<br />

particularly popular approach is the use of<br />

evolutionary algorithms to optimize feature scal<strong>in</strong>g. [4]<br />

Another popular approach is to scale features by the<br />

mutual <strong>in</strong>formation of the tra<strong>in</strong><strong>in</strong>g data with the<br />

tra<strong>in</strong><strong>in</strong>g classes. In b<strong>in</strong>ary (two class) classification<br />

problems, it is helpful to choose k to be an odd<br />

number as this avoids tied votes. One popular way of<br />

choos<strong>in</strong>g the empirically optimal k <strong>in</strong> this sett<strong>in</strong>g is<br />

via bootstrap method.<br />

13.1.3. Properties <strong>in</strong> KNN:<br />

• The naive version of the algorithm is easy to<br />

implement by comput<strong>in</strong>g the distances from the test<br />

sample to all stored vectors, but it is computationally<br />

<strong>in</strong>tensive, especially when the size of the tra<strong>in</strong><strong>in</strong>g set<br />

grows. Many nearest neighbor search algorithms<br />

have been proposed over the years; these generally<br />

seek to reduce the number of distance evaluations<br />

actually performed. Us<strong>in</strong>g an appropriate nearest<br />

neighbor search algorithm makes k-NN<br />

computationally tractable even for large data sets.<br />

• The nearest neighbor algorithm has some strong<br />

consistency results. As the amount of data<br />

approaches <strong>in</strong>f<strong>in</strong>ity, the algorithm is guaranteed to<br />

yield an error rate no worse than twice the Bayes<br />

error rate (the m<strong>in</strong>imum achievable error rate given<br />

the distribution of the data). k-nearest neighbor is<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

935


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

guaranteed to approach the Bayes error rate, for some<br />

value of k (where k <strong>in</strong>creases as a function of the<br />

number of data po<strong>in</strong>ts). Various improvements to k-<br />

nearest neighbor methods are possible by us<strong>in</strong>g<br />

proximity graphs.<br />

.13.1.4. KNN for Density Estimation:<br />

Although classification rema<strong>in</strong>s the primary application of<br />

KNN, it is also <strong>used</strong> to do density estimation also. S<strong>in</strong>ce KNN<br />

is non parametric, it can do estimation for arbitrary<br />

distributions. The idea is very similar to use of Parzen<br />

w<strong>in</strong>dow . Instead of us<strong>in</strong>g hypercube and kernel functions, it<br />

does the estimation as follows – For estimat<strong>in</strong>g the density at<br />

a po<strong>in</strong>t x, place a hypercube centered at x and keep <strong>in</strong>creas<strong>in</strong>g<br />

its size till k neighbors are captured. Now estimate the density<br />

us<strong>in</strong>g the formula,<br />

…(24)<br />

Where n is the total number of V is the volume of the<br />

hypercube. Notice that the numerator is essentially a constant<br />

and the density is <strong>in</strong>fluenced by the volume. The <strong>in</strong>tuition is<br />

this : Lets say density at x is very high. We can f<strong>in</strong>d k po<strong>in</strong>ts<br />

near x very quickly. These po<strong>in</strong>ts are also very close to x (by<br />

def<strong>in</strong>ition of high density). This means the volume of<br />

hypercube is small and the resultant density is high. It is said<br />

that the density around x is very low. Then the volume of the<br />

hypercube needed to encompass k nearest neighbors is large<br />

and consequently, the ratio is low. The volume performs a job<br />

similar to the bandwidth parameter <strong>in</strong> kernel density<br />

estimation..<br />

13.2. Some Basic Observations regard<strong>in</strong>g K-NN:<br />

1. If the po<strong>in</strong>ts are d-dimensional, then the straight forward<br />

implementation of f<strong>in</strong>d<strong>in</strong>g k Nearest Neighbor takes O(n)<br />

time.<br />

2. KNN can be analyzed <strong>in</strong> two ways – One way is that KNN<br />

tries to estimate the posterior probability of the po<strong>in</strong>t to be<br />

labeled (and apply Bayesian decision theory based on the<br />

posterior probability). An alternate way is that KNN<br />

calculates the decision surface (either implicitly or explicitly)<br />

and then uses it to decide on the class of the new po<strong>in</strong>ts.<br />

3. There are many possible ways to apply weights for KNN –<br />

One popular example is the Shephard’s method.<br />

4. Even though the naive method takes O(dn) time, it is very<br />

hard to do better unless other assumptions are <strong>used</strong>. There are<br />

some efficient data structures like KD-Tree which can reduce<br />

the time complexity but they do it at the cost of <strong>in</strong>creased<br />

tra<strong>in</strong><strong>in</strong>g time and complexity.<br />

5. In KNN, k is usually chosen as an odd number if the<br />

number of classes is 2.<br />

6. Choice of k is very critical – A small value of k means that<br />

noise will have a higher <strong>in</strong>fluence on the result. A large value<br />

make it computationally expensive and k defeats the basic<br />

philosophy beh<strong>in</strong>d KNN (that po<strong>in</strong>ts that are near might have<br />

similar densities or classes).A simple approach to select k is<br />

set .<br />

7. There are some <strong>in</strong>terest<strong>in</strong>g data structures and algorithms<br />

when we apply KNN on graphs –Euclidean m<strong>in</strong>imum<br />

spann<strong>in</strong>g tree and Nearest neighbor graph .<br />

13.3. Advantages and Disadvantages of K-NN:<br />

a)Advantages:<br />

1) The high degree of local sensitivity makes nearest<br />

neighbor classifiers highly susceptible to noise <strong>in</strong> the<br />

tra<strong>in</strong><strong>in</strong>g data.<br />

2) It follows a Non parametric architecture<br />

3) It is a simple and powerful, algorithm<br />

4) KNN is one of common methods to estimate the<br />

bandwidth (eg adaptive mean shift)<br />

b) Disadvantages:<br />

1) The downside of this simple approach is the lack of<br />

robustness that characterizes the result<strong>in</strong>g classifiers.<br />

2) It is Memory <strong>in</strong>tensive,<br />

3) Its classification/estimation is slow<br />

4) For large tra<strong>in</strong><strong>in</strong>g sets, requires large memory is slow when<br />

mak<strong>in</strong>g a prediction<br />

5) Needs similarity measure and attributes that “match” target<br />

function<br />

6) The k-nearest neighbor algorithm is sensitive to the local<br />

structure of the data.<br />

7) Computation complexity is a function of the boundary<br />

complexity that affects the decision boundary.<br />

8) Lack of generalization means that KNN keeps all the<br />

tra<strong>in</strong><strong>in</strong>g data.<br />

9) The accuracy of the k-NN algorithm can be severely<br />

degraded by the presence of noisy or irrelevant features, or if<br />

the feature scales are not consistent with their importance.<br />

10) Prediction accuracy can quickly degrade when number<br />

attributes grow.<br />

The drawback of <strong>in</strong>creas<strong>in</strong>g the value of k is of course that as<br />

k approaches n, where n is the size of the <strong>in</strong>stance base, the<br />

performance of the classifier will approach that of the most<br />

straightforward statistical basel<strong>in</strong>e, the assumption that all<br />

unknown <strong>in</strong>stances belong to the class most frequently<br />

represented <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g data.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

936


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

13.4. Applications:<br />

The nearest neighbor search problem arises <strong>in</strong> numerous fields<br />

of application, <strong>in</strong>clud<strong>in</strong>g:<br />

• Pattern recognition - <strong>in</strong> particular for optical<br />

character recognition<br />

• Statistical classification- see k-nearest neighbor<br />

algorithm<br />

• Computer vision<br />

• Databases - e.g. content-based image retrieval<br />

• Cod<strong>in</strong>g theory - see maximum likelihood decod<strong>in</strong>g<br />

• Data compression - see MPEG-2 standard<br />

• Recommendation systems<br />

• Internet market<strong>in</strong>g - see contextual advertis<strong>in</strong>g and<br />

behavioral target<strong>in</strong>g<br />

• DNA sequenc<strong>in</strong>g<br />

• Spell check<strong>in</strong>g - suggest<strong>in</strong>g correct spell<strong>in</strong>g<br />

• Plagiarism detection<br />

• Contact search<strong>in</strong>g algorithms <strong>in</strong> FEA<br />

• Similarity scores for predict<strong>in</strong>g career paths of<br />

professional athletes.<br />

• Cluster analysis - assignment of a set of observations<br />

<strong>in</strong>to subsets (called clusters) so that observations <strong>in</strong><br />

the same cluster are similar <strong>in</strong> some sense, usually<br />

based on Euclidean distance<br />

• Gene Expression<br />

• Prote<strong>in</strong>-Prote<strong>in</strong> <strong>in</strong>teraction and 3D structure<br />

prediction<br />

• Nearest Neighbor based Content Retrieval<br />

14. Gaussian Mixture Model (GMM):<br />

Gaussian Mixture Models (GMMs) are among the most<br />

statistically mature methods for cluster<strong>in</strong>g (though they are<br />

also <strong>used</strong> <strong>in</strong>tensively for density estimation).<br />

14.1.Introduction:<br />

A Gaussian Mixture Model (GMM) is a parametric<br />

probability density function represented as a weighted sum of<br />

Gaussian component densities. GMMs are commonly <strong>used</strong> as<br />

a parametric model of the probability distribution of<br />

cont<strong>in</strong>uous measurements or features <strong>in</strong> a biometric system,<br />

such as vocal-tract related spectral features <strong>in</strong> a speaker<br />

recognition system. GMM parameters are estimated from<br />

tra<strong>in</strong><strong>in</strong>g data us<strong>in</strong>g the iterative Expectation-Maximization<br />

(EM) algorithm or Maximum A Posteriori (MAP) estimation<br />

from a well-tra<strong>in</strong>ed prior model.<br />

A Gaussian mixture model is a weighted sum of M component<br />

Gaussian densities as given by the equation,<br />

(i.e. measurement or features), w i , i = 1, . . . ,M, are the<br />

mixture weights, and g(x|µi,_i), i = 1, . . . ,M, are the<br />

component Gaussian densities. Each component density is a<br />

D-variate Gaussian function of the form,<br />

with mean vector µi and covariance matrix The mixture<br />

weights satisfy the constra<strong>in</strong>t that<br />

The<br />

complete Gaussian mixture model is parameterized by the<br />

mean vectors, covariance matrices and mixture weights from<br />

all component densities. These parameters are collectively<br />

represented by the notation,<br />

….(27)<br />

There are several variants on the GMM shown <strong>in</strong> Equation<br />

(27). The covariance matrices, , can be full rank or<br />

constra<strong>in</strong>ed to be diagonal. Additionally, parameters can be<br />

shared, or tied, among the Gaussian components, such as<br />

hav<strong>in</strong>g a common covariance matrix for all components. The<br />

choice of model configuration (number of components, full or<br />

diagonal covariance matrices, and parameter ty<strong>in</strong>g) is often<br />

determ<strong>in</strong>ed by the amount of data available for estimat<strong>in</strong>g the<br />

GMM parameters and how the GMM is <strong>used</strong> <strong>in</strong> a particular<br />

biometric application. It is also important to note that because<br />

the component Gaussian is act<strong>in</strong>g together to model the<br />

overall feature densities, full covariance matrices are not<br />

necessary even if the features are not statistically <strong>in</strong>dependent.<br />

The l<strong>in</strong>ear comb<strong>in</strong>ation of diagonal covariance basis<br />

Gaussians is capable of model<strong>in</strong>g the correlations between<br />

feature vector elements. The effect of us<strong>in</strong>g a set of M full<br />

covariance matrix Gaussians can be equally obta<strong>in</strong>ed by us<strong>in</strong>g<br />

a larger set of diagonal covariance Gaussians. GMMs are<br />

often <strong>used</strong> <strong>in</strong> biometric systems, most notably <strong>in</strong> speaker<br />

recognition systems, due to their capability of represent<strong>in</strong>g a<br />

large class of sample distributions. One of the powerful<br />

attributes of the GMM is its ability to form smooth<br />

approximations to arbitrarily shaped densities. The classical<br />

uni-modal Gaussian model represents feature distributions by<br />

a position (mean vector) and a elliptic shape (covariance<br />

matrix) and a vector quantizer (VQ) or nearest neighbor<br />

model represents a distribution by a discrete set of<br />

characteristic templates [67]. A GMM acts as a hybrid<br />

between these two models by us<strong>in</strong>g a discrete set of Gaussian<br />

functions, each with their own mean and covariance matrix, to<br />

allow a better model<strong>in</strong>g capability. Figure 12 compares the<br />

densities obta<strong>in</strong>ed us<strong>in</strong>g a uni modal Gaussian model, a GMM<br />

and a VQ model. Plot (a)<br />

……(25)<br />

where x is a D-dimensional cont<strong>in</strong>uous-valued data vector<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

937


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

The aim of ML estimation is to f<strong>in</strong>d the model parameters<br />

which maximize the likelihood of the GMM given the tra<strong>in</strong><strong>in</strong>g<br />

data. For a sequence of T tra<strong>in</strong><strong>in</strong>g vectors X = {x 1 , . . . , x T },<br />

the GMM likelihood, assum<strong>in</strong>g <strong>in</strong>dependence between the<br />

vectors, can be written as,<br />

Figure 12 comparison of distribution model<strong>in</strong>g. ) Histogram<br />

of a s<strong>in</strong>gle cepstral coefficient from a 25 second utterance by a<br />

male speaker b) maximum likelihood unimodal Gaussian<br />

model c) GMM and its 10 underly<strong>in</strong>g component densities d)<br />

histogram of the data assigned to the VQ centroid locations of<br />

a 10 element codebook.<br />

Figure 12 shows the histogram of a s<strong>in</strong>gle feature from a<br />

speaker recognition system (a s<strong>in</strong>gle cepstral value from a 25<br />

second utterance by a male speaker); plot (b) shows a unimodal<br />

Gaussian model of this feature distribution; plot (c)<br />

shows a GMM and its ten underly<strong>in</strong>g component densities;<br />

and plot (d) shows a histogram of the data assigned to the VQ<br />

centroid locations of a 10 element codebook. The GMM not<br />

only provides a smooth overall distribution fit, its components<br />

also clearly detail the multi-modal nature of the density.<br />

The use of a GMM for represent<strong>in</strong>g feature distributions <strong>in</strong> a<br />

biometric system may also be motivated by the <strong>in</strong>tuitive<br />

notion that the <strong>in</strong>dividual component densities may model<br />

some underly<strong>in</strong>g set of hidden classes. For example, <strong>in</strong><br />

speaker recognition, it is reasonable to assume the acoustic<br />

space of spectral related features correspond<strong>in</strong>g to a speaker’s<br />

broad phonetic events, such as vowels, nasals or fricatives.<br />

These acoustic classes reflect some general speaker dependent<br />

vocal tract configurations that are useful for characteriz<strong>in</strong>g<br />

speaker identity. The spectral shape of the i th acoustic class<br />

can <strong>in</strong> turn be represented by the mean µ i of the i th component<br />

density, and variations of the average spectral shape can be<br />

represented by the covariance matrix . Because all the<br />

features <strong>used</strong> to tra<strong>in</strong> the GMM are unlabeled, the acoustic<br />

classes are hidden <strong>in</strong> that the class of an observation is<br />

unknown. A GMM can also be viewed as a s<strong>in</strong>gle-state HMM<br />

with a Gaussian mixture observation density, or an ergodic<br />

Gaussian observation HMM with fixed, equal transition<br />

probabilities. Assum<strong>in</strong>g <strong>in</strong>dependent feature vectors, the<br />

observation density of feature vectors drawn from these<br />

hidden acoustic classes is a Gaussian mixture [68, 69].<br />

14.2.Maximum Likelihood Parameter Estimation<br />

Given tra<strong>in</strong><strong>in</strong>g vectors and a GMM configuration, can<br />

estimate the parameters of the GMM, λ, which <strong>in</strong> some sense<br />

best matches the distribution of the tra<strong>in</strong><strong>in</strong>g feature vectors.<br />

There are several techniques available for estimat<strong>in</strong>g the<br />

parameters of a GMM [70]. The most popular and wellestablished<br />

method is maximum likelihood (ML) estimation.<br />

…(28)<br />

Unfortunately, this expression is a non-l<strong>in</strong>ear function of the<br />

parameters _ and directs maximization is not possible.<br />

However, ML parameter estimates can be obta<strong>in</strong>ed iteratively<br />

us<strong>in</strong>g a special case of the expectation-maximization (EM)<br />

algorithm [71]. The basic idea of the EM algorithm is,<br />

beg<strong>in</strong>n<strong>in</strong>g with an <strong>in</strong>itial model λ, to estimate a new model λ¯,<br />

such that p(X| λ¯) ≥ p(X| λ). The new model then becomes the<br />

<strong>in</strong>itial model for the next iteration and the process is repeated<br />

until some convergence threshold is reached. The <strong>in</strong>itial<br />

model is typically derived by us<strong>in</strong>g some form of b<strong>in</strong>ary VQ<br />

estimation. On each EM iteration, the follow<strong>in</strong>g re-estimation<br />

formulas are <strong>used</strong>, which guarantees a monotonic <strong>in</strong>crease <strong>in</strong><br />

the model’s likelihood value,<br />

The a posteriori probability for component i is given by<br />

…(29)<br />

14.3. Maximum A Posteriori (MAP) Parameter Estimation<br />

In addition to estimat<strong>in</strong>g GMM parameters via the EM<br />

algorithm, the parameters may also be estimated us<strong>in</strong>g<br />

Maximum A Posteriori (MAP) estimation. MAP estimation is<br />

<strong>used</strong>, for example, <strong>in</strong> speaker recognition applications to<br />

derive speaker model by adapt<strong>in</strong>g from a universal<br />

background model (UBM) [72] as shown <strong>in</strong> fig.13. It is also<br />

<strong>used</strong> <strong>in</strong> other pattern recognition tasks where limited labeled<br />

tra<strong>in</strong><strong>in</strong>g data is <strong>used</strong> to adapt a prior, general model. Like the<br />

EM algorithm, the MAP estimation is a two step estimation<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

938


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

process. The first step is identical to the “Expectation” step of<br />

the EM algorithm, where estimates of the sufficient statistics<br />

of the tra<strong>in</strong><strong>in</strong>g data are computed for each mixture <strong>in</strong> the prior<br />

model. Unlike the second step of the EM algorithm, for<br />

adaptation these “new” sufficient statistic estimates are then<br />

comb<strong>in</strong>ed with the “old” sufficient statistics from the prior<br />

mixture parameters us<strong>in</strong>g a data-dependent mix<strong>in</strong>g coefficient.<br />

The data-dependent mix<strong>in</strong>g coefficient is designed so that<br />

mixtures with high counts of new data rely more on the new<br />

sufficient statistics for f<strong>in</strong>al parameter estimation and mixtures<br />

with low counts of new data rely more on the old sufficient<br />

statistics for f<strong>in</strong>al parameter estimation. The specifics of the<br />

adaptation are as follows. Given a prior model and tra<strong>in</strong><strong>in</strong>g<br />

vectors from the desired class, X = {X 1 . . . , X T }, we first<br />

determ<strong>in</strong>e the probabilistic alignment of the tra<strong>in</strong><strong>in</strong>g vectors<br />

<strong>in</strong>to the prior mixture components (Figure 12(a)). That is, for<br />

mixture i <strong>in</strong> the prior model, we compute Pr(i| X T , λ prior ), as <strong>in</strong><br />

Equation (29). Then compute the sufficient statistics for the<br />

weight, mean and variance parameters:<br />

..(32)<br />

Figure 13 Pictorial example of two steps <strong>in</strong> adapt<strong>in</strong>g a<br />

hypothesized speaker model. (a) The tra<strong>in</strong><strong>in</strong>g vectors (x’s) are<br />

probabilistically mapped <strong>in</strong>to the UBM (prior) mixtures. (b)<br />

The adapted mixture parameters are derived us<strong>in</strong>g the<br />

statistics of the new data and the UBM (prior) mixture<br />

parameters. The adaptation is data dependent, so UBM (prior)<br />

mixture parameters are adapted by different amounts.<br />

where γ p is a fixed “relevance” factor for parameter p. It is<br />

common <strong>in</strong> speaker recognition applications to use one<br />

adaptation coefficient for all parameters<br />

…(30)<br />

This is the same as the “Expectation” step <strong>in</strong> the EM<br />

algorithm.<br />

Lastly, these new sufficient statistics from the tra<strong>in</strong><strong>in</strong>g data are<br />

<strong>used</strong> to update the prior sufficient statistics for mixture i to<br />

create the adapted parameters for mixture i (Figure 2(b)) with<br />

the equations:<br />

..(31)<br />

The adaptation coefficients controll<strong>in</strong>g the balance between<br />

old and new estimates are<br />

for the weights, means<br />

and variances, respectively. The scale factor, γ , is computed<br />

over all adapted mixture weights to ensure they sum to unity.<br />

Note that the sufficient statistics, not the derived parameters,<br />

such as the variance, are be<strong>in</strong>g adapted. For each mixture and<br />

each parameter, a data-dependent adaptation coefficient<br />

def<strong>in</strong>ed as<br />

, is <strong>used</strong> <strong>in</strong> the above equations. This is<br />

and further to only adapt certa<strong>in</strong> GMM parameters, such as<br />

only the mean vectors. Us<strong>in</strong>g a data-dependent adaptation<br />

coefficient allows mixture dependent adaptation of<br />

parameters. If a mixture component has a low probabilistic<br />

count, n i , of new data, then αi p →0caus<strong>in</strong>g the de-emphasis of<br />

the new (potentially under-tra<strong>in</strong>ed) parameters and the<br />

emphasis of the old (better tra<strong>in</strong>ed) parameters. For mixture<br />

components with high probabilistic counts,<br />

caus<strong>in</strong>g the use of the new class-dependent parameters. The<br />

relevance factor is a way of controll<strong>in</strong>g how much new data<br />

should be observed <strong>in</strong> a mixture before the new parameters<br />

beg<strong>in</strong> replac<strong>in</strong>g the old parameters. This approach should thus<br />

be robust to limited tra<strong>in</strong><strong>in</strong>g data.<br />

14.4. Advantages and Disadvantages of GMM:<br />

a. Advantages:<br />

1) Less time consum<strong>in</strong>g when applied to a large set of data.<br />

2) It is text <strong>in</strong>dependent<br />

3) It is easy to implement<br />

4) It follows the Probabilistic frame work ( robust)<br />

5) It is computationally efficient.<br />

b. Disadvantages:<br />

1) Ability to track time-evolv<strong>in</strong>g patterns is slow.<br />

2) It cannot exclude exponential functions.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

939


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

14.5. Applications:<br />

1) Used <strong>in</strong> Speaker identification<br />

2) Used <strong>in</strong> Image segmentation<br />

3) Used <strong>in</strong> model<strong>in</strong>g video sequences<br />

4) Used <strong>in</strong> Musical Instrument Identification <strong>in</strong> Polyphonic<br />

Music<br />

5) Used <strong>in</strong> Extraction of melodic l<strong>in</strong>es from audio record<strong>in</strong>gs<br />

6) Used <strong>in</strong> Speaker verification/speaker identification<br />

15. Unsupervised classification Method:<br />

In unsupervised classification, the goal is harder because there<br />

are no pre-determ<strong>in</strong>ed categorizations. There are actually two<br />

approaches to unsupervised learn<strong>in</strong>g. The first approach is to<br />

teach the agent not by giv<strong>in</strong>g explicit categorizations, but by<br />

us<strong>in</strong>g some sort of reward system to <strong>in</strong>dicate success. This<br />

type of tra<strong>in</strong><strong>in</strong>g will generally fit <strong>in</strong>to the decision problem<br />

framework because the goal is not to produce a classification<br />

but to make decisions that maximize rewards. This approach<br />

nicely generalizes to the real world, where agents might be<br />

rewarded for do<strong>in</strong>g certa<strong>in</strong> actions.<br />

A second approach of unsupervised learn<strong>in</strong>g is called<br />

cluster<strong>in</strong>g. In this type of learn<strong>in</strong>g, the goal is not to maximize<br />

a utility function, but simply to f<strong>in</strong>d similarities <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g<br />

data. The assumption is often that the clusters discovered will<br />

match reasonably well with an <strong>in</strong>tuitive classification. This<br />

method is commonly <strong>used</strong> <strong>in</strong> most of the applications<br />

especially <strong>in</strong> speech recognition applications. Hence this<br />

method is discussed <strong>in</strong> detail.<br />

OR In other terms unsupervised learn<strong>in</strong>g is also def<strong>in</strong>ed as the<br />

learn<strong>in</strong>g method where the computer doesn't get any feedback<br />

or guidance while learn<strong>in</strong>g. No guidel<strong>in</strong>es are provided either.<br />

It means that unlike supervised learn<strong>in</strong>g, patterns are not<br />

labeled or classified beforehand.<br />

15.2. Advantages and Disadvantages of Unsupervised<br />

classification:<br />

a)Advantages:<br />

1) Need to provide either the classification rules or the sample<br />

documents as a tra<strong>in</strong><strong>in</strong>g set.<br />

2) Unsupervised classification technique are <strong>used</strong> when we<br />

do not have a clear idea of rules or classifications. One<br />

possible scenario is to use unsupervised classification to<br />

provide an <strong>in</strong>itial set of categories, and to subsequently build<br />

on these through supervised classification.<br />

b)Disadvantages:<br />

1) Cluster<strong>in</strong>g might result <strong>in</strong> unexpected group<strong>in</strong>gs, s<strong>in</strong>ce the<br />

cluster<strong>in</strong>g operation is not user-def<strong>in</strong>ed, but based on an<br />

<strong>in</strong>ternal algorithm.<br />

2) Rules that create the clusters are not seen.<br />

3) The cluster<strong>in</strong>g operation is CPU <strong>in</strong>tensive and can take at<br />

least the same time as <strong>in</strong>dex<strong>in</strong>g.<br />

4) Suffers from over fitt<strong>in</strong>g<br />

15.1 Introduction to Cluster<strong>in</strong>g<br />

Cluster<strong>in</strong>g is the unsupervised classification of patterns<br />

(observations, data items, or feature vectors) <strong>in</strong>to groups<br />

(clusters). The cluster<strong>in</strong>g problem has been addressed <strong>in</strong> many<br />

contexts and by researchers <strong>in</strong> many discipl<strong>in</strong>es; this reflects<br />

its broad appeal and usefulness as one of the steps <strong>in</strong><br />

exploratory data analysis. However, cluster<strong>in</strong>g is a difficult<br />

problem comb<strong>in</strong>atorial, and differences <strong>in</strong> assumptions and<br />

contexts <strong>in</strong> different communities have made the transfer of<br />

useful generic concepts and methodologies slow to occur.<br />

In mach<strong>in</strong>e learn<strong>in</strong>g, unsupervised learn<strong>in</strong>g is a class of<br />

problems <strong>in</strong> which one seeks to determ<strong>in</strong>e how the data are<br />

organized. Many methods employed here are based on data<br />

m<strong>in</strong><strong>in</strong>g methods <strong>used</strong> to preprocess data. It is dist<strong>in</strong>guished<br />

from supervised learn<strong>in</strong>g (and re<strong>in</strong>forcement learn<strong>in</strong>g) <strong>in</strong> that<br />

the learner is given only unlabeled examples. Unsupervised<br />

learn<strong>in</strong>g is closely related to the problem of density estimation<br />

<strong>in</strong> statistics. However unsupervised learn<strong>in</strong>g also encompasses<br />

many other techniques that seek to summarize and expla<strong>in</strong> key<br />

features of the data.One form of unsupervised learn<strong>in</strong>g is<br />

cluster<strong>in</strong>g. Another example is bl<strong>in</strong>d source separation based<br />

on Independent Component Analysis (ICA).<br />

There are two broads of classification procedures: supervised<br />

classification unsupervised classification. The supervised<br />

classification is the essential tool <strong>used</strong> for extract<strong>in</strong>g<br />

quantitative <strong>in</strong>formation from remotely sensed image data<br />

[Richards, 1993, p85]. Us<strong>in</strong>g this method, the analyst has<br />

available sufficient known pixels to generate representative<br />

parameters for each class of <strong>in</strong>terest. This step is called<br />

tra<strong>in</strong><strong>in</strong>g. Once tra<strong>in</strong>ed, the classifier is then <strong>used</strong> to attach<br />

labels to all the image pixels accord<strong>in</strong>g to the tra<strong>in</strong>ed<br />

parameters. The most commonly <strong>used</strong> supervised<br />

classification is maximum likelihood classification (MLC),<br />

which assumes that each spectral class can be described by a<br />

multivariate normal distribution. Therefore, MCL takes<br />

advantage of both the mean vectors and the multivariate<br />

spreads of each class, and can identify those elongated classes.<br />

However, the effectiveness of maximum likelihood<br />

classification depends on reasonably accurate estimation of<br />

the mean vector m and the covariance matrix for each spectral<br />

class data [Richards, 1993, p189]. What’s more, it assumes<br />

that the classes are distributed unmoral <strong>in</strong> multivariate space.<br />

When the classes are multimodal distributed, we cannot get<br />

accurate results. Another broad of classification is<br />

unsupervised classification. It doesn’t require human to have<br />

the foreknowledge of the classes, and ma<strong>in</strong>ly us<strong>in</strong>g some<br />

cluster<strong>in</strong>g algorithm to classify an image data [Richards, 1993,<br />

p85]. These procedures can be <strong>used</strong> to determ<strong>in</strong>e the number<br />

and location of the uni-modal spectral classes. One of the<br />

most commonly <strong>used</strong> unsupervised classifications is the<br />

migrat<strong>in</strong>g means cluster<strong>in</strong>g classifier (MMC). This method is<br />

based on label<strong>in</strong>g each pixel to unknown cluster centers and<br />

then mov<strong>in</strong>g from one cluster center to another <strong>in</strong> a way that<br />

the SSE measure of the preced<strong>in</strong>g section is reduced data<br />

[Richards,1993, p231].<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

940


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

15.1.1.Data cluster<strong>in</strong>g:<br />

Data analysis underlies many comput<strong>in</strong>g applications, either<br />

<strong>in</strong> a design phase or as part of their on-l<strong>in</strong>e operations. Data<br />

analysis procedures can be dichotomized as either exploratory<br />

or confirmatory, based on the availability of appropriate<br />

models for the data source, but a key element <strong>in</strong> both types of<br />

procedures whether for hypothesis formation or decisionmak<strong>in</strong>g)<br />

is the group<strong>in</strong>g, or classification of measurements<br />

based on either (i) goodness-of-fit to a postulated model, or (ii)<br />

natural group<strong>in</strong>gs (cluster<strong>in</strong>g) revealed through analysis.<br />

Cluster analysis is the organization of a collection of patterns<br />

(usually represented as a vector of measurements, or a po<strong>in</strong>t <strong>in</strong><br />

a multidimensional space) <strong>in</strong>to clusters based on similarity.<br />

Intuitively, patterns with<strong>in</strong> a valid cluster are more similar to<br />

each other than they are to a pattern belong<strong>in</strong>g to a different<br />

cluster. An example of cluster<strong>in</strong>g is depicted <strong>in</strong> Figure 14. The<br />

<strong>in</strong>put patterns are shown <strong>in</strong> Figure 14(a), and the desired<br />

clusters are shown <strong>in</strong> Figure 14 (b). Here, po<strong>in</strong>ts belong<strong>in</strong>g to<br />

the same cluster are given the same label. The variety of<br />

techniques for represent<strong>in</strong>g data, measur<strong>in</strong>g proximity<br />

(similarity) between data elements, and group<strong>in</strong>g data<br />

elements has produced a rich and often confus<strong>in</strong>g assortment<br />

of cluster<strong>in</strong>g methods.<br />

Figure 14. Data cluster<strong>in</strong>g<br />

It is important to understand the difference between cluster<strong>in</strong>g<br />

(unsupervised classification) and discrim<strong>in</strong>ant analysis<br />

(supervised classification). In supervised classification, we are<br />

provided with a collection of labeled (preclassified) patterns;<br />

the problem is to label a newly encountered, yet unlabeled,<br />

pattern. Typically, the given labeled (tra<strong>in</strong><strong>in</strong>g) patterns are<br />

<strong>used</strong> to learn the descriptions of classes which <strong>in</strong> turn are <strong>used</strong><br />

to label a new pattern. In the case of cluster<strong>in</strong>g, the problem is<br />

to group a given collection of unlabeled patterns <strong>in</strong>to<br />

mean<strong>in</strong>gful clusters. In a sense, labels are associated with<br />

clusters also, but these category labels are data driven; that is,<br />

they are obta<strong>in</strong>ed solely from the data. Cluster<strong>in</strong>g is useful <strong>in</strong><br />

several exploratory pattern-analysis, group<strong>in</strong>g, decisionmak<strong>in</strong>g,<br />

and mach<strong>in</strong>e-learn<strong>in</strong>g situations, <strong>in</strong>clud<strong>in</strong>g data<br />

m<strong>in</strong><strong>in</strong>g, document retrieval, image segmentation, and pattern<br />

classification. However, <strong>in</strong> many such problems, there is little<br />

prior <strong>in</strong>formation (e.g., statistical models) available about the<br />

data, and the decision-maker must make as few assumptions<br />

about the data as possible. It is under these restrictions that<br />

cluster<strong>in</strong>g methodology is particularly appropriate for the<br />

exploration of <strong>in</strong>terrelationships among the data po<strong>in</strong>ts to<br />

make an assessment (perhaps prelim<strong>in</strong>ary) of their structure.<br />

The term “cluster<strong>in</strong>g” is <strong>used</strong> <strong>in</strong> several research communities<br />

to describe methods for group<strong>in</strong>g of unlabeled data.<br />

These communities have different term<strong>in</strong>ologies and<br />

assumptions for the components of the cluster<strong>in</strong>g process and<br />

the contexts <strong>in</strong> which cluster<strong>in</strong>g are <strong>used</strong>. Thus, we face a<br />

dilemma regard<strong>in</strong>g the scope of this survey. The production of<br />

a truly comprehensive survey would be a monumental task<br />

given the sheer mass of literature <strong>in</strong> this area. The<br />

accessibility of the survey might also be questionable given<br />

the need to reconcile very different vocabularies and<br />

assumptions regard<strong>in</strong>g cluster<strong>in</strong>g <strong>in</strong> the various communities.<br />

The goal of this paper is to survey the core concepts and<br />

techniques <strong>in</strong> the large subset of cluster analysis with its roots<br />

<strong>in</strong> statistics and decision theory. Where appropriate,<br />

references will be made to key concepts and techniques<br />

aris<strong>in</strong>g from cluster<strong>in</strong>g methodology <strong>in</strong> the mach<strong>in</strong>e-learn<strong>in</strong>g<br />

and other communities. The audience for this paper <strong>in</strong>cludes<br />

practitioners <strong>in</strong> the pattern recognition and image analysis<br />

communities (who should view it as a summarization of<br />

current practice), practitioners <strong>in</strong> the mach<strong>in</strong>e-learn<strong>in</strong>g<br />

communities (who should view it as a snapshot of a closely<br />

related field with a rich history of well understood techniques),<br />

and the broader audience of scientific professionals (who<br />

should view it as an accessible <strong>in</strong>troduction to a mature field<br />

that is mak<strong>in</strong>g important contributions to comput<strong>in</strong>g<br />

application areas).<br />

15.1.2 Components of a Cluster<strong>in</strong>g Task<br />

Typical pattern cluster<strong>in</strong>g activity <strong>in</strong>volves the follow<strong>in</strong>g steps<br />

[Ja<strong>in</strong> and Dubes 1988]:<br />

(1) Pattern representation (optionally <strong>in</strong>clud<strong>in</strong>g feature<br />

extraction and/or selection),<br />

(2) Def<strong>in</strong>ition of a pattern proximity measure appropriate to<br />

the data doma<strong>in</strong>,<br />

(3) Cluster<strong>in</strong>g or group<strong>in</strong>g,<br />

(4) Data abstraction (if needed), and<br />

(5) Assessment of output (if needed).<br />

Figure 15 depicts a typical sequenc<strong>in</strong>g of the first three of<br />

these steps, <strong>in</strong>clud<strong>in</strong>g a feedback path where the group<strong>in</strong>g<br />

process output could affect subsequent feature extraction and<br />

similarity computations. Pattern representation refers to the<br />

number of classes, the number of available patterns, and the<br />

number, type, and scale of the features available to the<br />

cluster<strong>in</strong>g algorithm. Some of this <strong>in</strong>formation may not be<br />

controllable by the practitioner.<br />

Figure 15 Stages <strong>in</strong> Cluster<strong>in</strong>g<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

941


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

15.1.3. Advantages and Disadvantages of cluster<strong>in</strong>g:<br />

a)Advantages:<br />

1. High performance<br />

2. Large capacity<br />

3. High availability<br />

4. Incremental growth<br />

b) Disadvantages:<br />

1. Complexity<br />

2. Inability to recover from database corruption<br />

15.1.4. Applications of Cluster<strong>in</strong>g:<br />

1. Cluster<strong>in</strong>g <strong>in</strong> the design of neural networks<br />

2. Information Retrival<br />

3. Data m<strong>in</strong><strong>in</strong>g<br />

4. <strong>Speech</strong> and speaker.<br />

16. SIMILARITY MEASURES<br />

S<strong>in</strong>ce similarity is fundamental to the def<strong>in</strong>ition of a cluster, a<br />

measure of the similarity between two patterns drawn from<br />

the same feature space is essential to most cluster<strong>in</strong>g<br />

procedures. Because of the variety of feature types and scales,<br />

the distance measure (or measures) must be chosen carefully.<br />

It is most common to calculate the dissimilarity between two<br />

patterns us<strong>in</strong>g a distance measure def<strong>in</strong>ed on the feature space.<br />

We will focus on the well-known distance measures <strong>used</strong> for<br />

patterns whose features are all cont<strong>in</strong>uous. The most popular<br />

metric for cont<strong>in</strong>uous features is the Euclidean distance<br />

…...(33)<br />

which is a special case (p52) of the M<strong>in</strong>kowski metric<br />

…(34)<br />

The Euclidean distance has an <strong>in</strong>tuitive appeal as it is<br />

commonly <strong>used</strong> to evaluate the proximity of objects <strong>in</strong> two or<br />

three-dimensional space. It works well when a data set has<br />

“compact” or “isolated” clusters [Mao and Ja<strong>in</strong> 1996]. The<br />

drawback to direct use of the M<strong>in</strong>kowski metrics is the<br />

tendency of the largest-scaled feature to dom<strong>in</strong>ate the others.<br />

Solutions to this problem <strong>in</strong>clude normalization of the<br />

cont<strong>in</strong>uous features (to a common range or variance) or other<br />

weight<strong>in</strong>g schemes. L<strong>in</strong>ear correlation among features can<br />

also distort distance measures; this distortion can be alleviated<br />

by apply<strong>in</strong>g a whiten<strong>in</strong>g transformation to the data or by us<strong>in</strong>g<br />

the squared Mahalanobis distance<br />

…(35)<br />

where the patterns X i and X j are assumed to be row vectors,<br />

and ∑ is the sample covariance matrix of the patterns or the<br />

known covariance matrix of the pattern generation process;<br />

d M (. , .). assigns different weights to different features based<br />

on their variances and pair wise l<strong>in</strong>ear correlations. Here, it is<br />

implicitly assumed that class conditional densities are<br />

unimodal and characterized by multidimensional spread, i.e.,<br />

that the densities are multivariate Gaussian. The regularized<br />

Mahalanobis distance was <strong>used</strong> <strong>in</strong> Mao and Ja<strong>in</strong> [1996] to<br />

extract hyper ellipsoidal clusters. Recently, several researchers<br />

[Huttenlocher et al. 1993; Dubuisson and Ja<strong>in</strong> 1994] have<br />

<strong>used</strong> the Hausdorff distance <strong>in</strong> a po<strong>in</strong>t set match<strong>in</strong>g context.<br />

Some cluster<strong>in</strong>g algorithms work on a matrix of proximity<br />

values <strong>in</strong>stead of on the orig<strong>in</strong>al pattern set. It is useful <strong>in</strong> such<br />

situations to pre-compute all the n(n-1)/2 pair wise distance<br />

values for the n patterns and store them <strong>in</strong> a (symmetric)<br />

matrix. Computation of distances between patterns with some<br />

or all features be<strong>in</strong>g non cont<strong>in</strong>uous is problematic, s<strong>in</strong>ce the<br />

different types of features are not comparable and (as an<br />

extreme example) the notion of proximity is effectively<br />

b<strong>in</strong>ary- valued for nom<strong>in</strong>al-scaled features. Nonetheless,<br />

practitioners (especially those <strong>in</strong> mach<strong>in</strong>e learn<strong>in</strong>g, where<br />

mixed-type patterns are common) have developed proximity<br />

measures for heterogeneous type patterns. A recent example is<br />

Wilson and Mart<strong>in</strong>ez [1997], which proposes a comb<strong>in</strong>ation<br />

of a modified M<strong>in</strong>kowski metric for cont<strong>in</strong>uous features and a<br />

distance based on counts (population) for nom<strong>in</strong>al attributes.<br />

A variety of other metrics have been reported <strong>in</strong> Diday and<br />

Simon [1976] and Ich<strong>in</strong>o and Yaguchi [1994] for comput<strong>in</strong>g<br />

the similarity between patterns represented us<strong>in</strong>g quantitative<br />

as well as qualitative features. Patterns can also be represented<br />

us<strong>in</strong>g str<strong>in</strong>g or tree structures [Knuth 1973]. Str<strong>in</strong>gs are <strong>used</strong><br />

<strong>in</strong> syntactic cluster<strong>in</strong>g [Fu and Lu 1977]. Several measures of<br />

similarity between str<strong>in</strong>gs are described <strong>in</strong> Baeza-Yates<br />

[1992]. A good summary of similarity measures between trees<br />

is given by Zhang [1995]. A comparison of syntactic and<br />

statistical approaches for pattern recognition us<strong>in</strong>g several<br />

criteria was presented <strong>in</strong> Tanaka [1995] and the conclusion<br />

was that syntactic methods are <strong>in</strong>ferior <strong>in</strong> every aspect.<br />

Therefore, we do not consider syntactic methods further <strong>in</strong><br />

this paper. There are some distance measures reported <strong>in</strong> the<br />

literature [Gowda and Krishna 1977; Jarvis and Patrick 1973]<br />

that take <strong>in</strong>to account the effect of surround<strong>in</strong>g or neighbor<strong>in</strong>g<br />

po<strong>in</strong>ts. These surround<strong>in</strong>g po<strong>in</strong>ts are called context <strong>in</strong><br />

Michalski and Stepp [1983]. The similarity between two<br />

po<strong>in</strong>ts xi and xj, given this context, is given by<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

942


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

conceptual similarity measure is the most general similarity<br />

measure.<br />

….(36)<br />

where is the context (the set of surround<strong>in</strong>g po<strong>in</strong>ts). One<br />

metric def<strong>in</strong>ed us<strong>in</strong>g context is the mutual neighbor distance<br />

(MND), proposed <strong>in</strong> Gowda and Krishna [1977], which is<br />

given by<br />

…(37)<br />

where NN(x i , x j ) is the neighbor number of x j with respect to<br />

x,. Figures 16 and 17 give an example. In Figure 16, the<br />

nearest neighbor of A is B, and B’s nearest neighbor is A. So,<br />

NN(A, B)=5 NN(B, A) = 1 and the MND between A and B is 2.<br />

However, N(B, C)= 1 but NN(C, B)= 2, and therefore MND(B,<br />

C)= 3. Figure 17 was obta<strong>in</strong>ed from Figure 4 by add<strong>in</strong>g three<br />

new po<strong>in</strong>ts D, E, and F. Now MND(B, C)= 3 (as before), but<br />

MND(A, B)= 5. The MND between A and B has <strong>in</strong>creased by<br />

<strong>in</strong>troduc<strong>in</strong>g additional po<strong>in</strong>ts, even though A and B have not<br />

moved. The MND is not a metric (it does not satisfy the<br />

triangle <strong>in</strong>equality [Zhang 1995]). In spite of this, MND has<br />

been successfully applied <strong>in</strong> several cluster<strong>in</strong>g applications<br />

[Gowda and Diday 1992]. This observation supports the<br />

viewpo<strong>in</strong>t that the dissimilarity does not need to be a metric.<br />

Watanabe’s theorem of the ugly duckl<strong>in</strong>g [Watanabe 1985]<br />

states: “Insofar as we use a f<strong>in</strong>ite set of predicates that are<br />

capable of dist<strong>in</strong>guish<strong>in</strong>g any two objects considered, the<br />

number of predicates shared by any two such objects is<br />

constant, <strong>in</strong>dependent of the choice of objects.” This implies<br />

that it is possible to make any two arbitrary patterns equally<br />

similar by encod<strong>in</strong>g them with a sufficiently large number of<br />

features. As a consequence, any two arbitrary patterns are<br />

equally similar, unless we use some additional doma<strong>in</strong><br />

<strong>in</strong>formation. For example, <strong>in</strong> the case of conceptual cluster<strong>in</strong>g<br />

[Michalski and Stepp 1983], the similarity between x i and x j<br />

is def<strong>in</strong>ed as<br />

…(38)<br />

where is a set of pre-def<strong>in</strong>ed concepts. This notion is<br />

illustrated with the help of Figure 18. Here, the Euclidean<br />

distance between po<strong>in</strong>ts A and B is less than that between B<br />

and C. However, B and C can be viewed as “more similar”<br />

than A and B because B and C belong to the same concept<br />

(ellipse) and A belongs to a different concept (rectangle). The<br />

Figure 18 Conceptual similarities between po<strong>in</strong>ts:<br />

17. CLUSTERING TECHNIQUES:<br />

Different approaches to cluster<strong>in</strong>g data can be described with<br />

the help of the hierarchy shown <strong>in</strong> Figure 19 (other<br />

taxonometric representations of cluster<strong>in</strong>g methodology are<br />

possible; ours is based on the discussion <strong>in</strong> Ja<strong>in</strong> and Dubes<br />

[1988]). At the top level, there is a dist<strong>in</strong>ction between<br />

hierarchical and partitional approaches (hierarchical methods<br />

produce a nested series of partitions, while partitional methods<br />

produce only one). The taxonomy shown <strong>in</strong> Figure 19 must be<br />

supplemented by a discussion of cross-cutt<strong>in</strong>g issues that may<br />

(<strong>in</strong> pr<strong>in</strong>ciple) affect all of the different approaches regardless<br />

of their placement <strong>in</strong> the taxonomy.<br />

—Agglomerative vs. divisive: This aspect relates to<br />

algorithmic structure and operation. An agglomerative<br />

approach beg<strong>in</strong>s with each pattern <strong>in</strong> a dist<strong>in</strong>ct (s<strong>in</strong>gleton)<br />

cluster, and successively merges clusters together until a<br />

stopp<strong>in</strong>g criterion is satisfied. A divisive method beg<strong>in</strong>s with<br />

all patterns <strong>in</strong> a s<strong>in</strong>gle cluster and performs splitt<strong>in</strong>g until a<br />

stopp<strong>in</strong>g criterion is met.<br />

—Monothetic vs. polythetic: This aspect relates to the<br />

sequential or simultaneous use of features <strong>in</strong> the cluster<strong>in</strong>g<br />

process. Most algorithms are polythetic; that is, all features<br />

enter <strong>in</strong>to the computation of distances between patterns, and<br />

decisions are based on those distances. A simple monothetic<br />

algorithm reported <strong>in</strong> Anderberg [1973] considers features<br />

sequentially to divide the given collection of patterns. This is<br />

illustrated <strong>in</strong> Figure 20. Here, the collection is divided <strong>in</strong>to<br />

two groups us<strong>in</strong>g feature x1; the vertical broken l<strong>in</strong>e V is the<br />

separat<strong>in</strong>g l<strong>in</strong>e. Each of these clusters is further divided<br />

<strong>in</strong>dependently us<strong>in</strong>g feature x2, as depicted by the broken<br />

l<strong>in</strong>es H1 and H2. The major problem with this algorithm is<br />

that it generates 2d clusters where d is the dimensionality of<br />

the patterns. For large values of d (d >. 100 is typical <strong>in</strong><br />

<strong>in</strong>formation retrieval applications [Salton 1991]), the number<br />

of clusters generated by this algorithm is so large that the data<br />

set is divided <strong>in</strong>to un<strong>in</strong>terest<strong>in</strong>gly small and fragmented<br />

clusters.<br />

—Hard vs. fuzzy: A hard cluster<strong>in</strong>g algorithm allocates each<br />

pattern to a s<strong>in</strong>gle cluster dur<strong>in</strong>g its operation and <strong>in</strong> its output.<br />

A fuzzy cluster<strong>in</strong>g method assigns degrees of membership <strong>in</strong><br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

943


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

several clusters to each <strong>in</strong>put pattern. A fuzzy cluster<strong>in</strong>g can<br />

be converted to a hard cluster<strong>in</strong>g by assign<strong>in</strong>g each pattern to<br />

the cluster with the largest measure of membership.<br />

—Determ<strong>in</strong>istic vs. stochastic: This issue is most relevant to<br />

partitional approaches designed to optimize a squared error<br />

function. This optimization can be accomplished us<strong>in</strong>g<br />

traditional techniques or through a random search of the state<br />

space consist<strong>in</strong>g of all possible label<strong>in</strong>gs.<br />

—Incremental vs. non-<strong>in</strong>cremental: This issue arises when<br />

the pattern set to be clustered is large, and constra<strong>in</strong>ts on<br />

execution time or memory space affect the architecture of the<br />

algorithm. The early history of cluster<strong>in</strong>g methodology does<br />

not conta<strong>in</strong> many examples of cluster<strong>in</strong>g algorithms designed<br />

to work with large data sets, but the advent of data m<strong>in</strong><strong>in</strong>g has<br />

fostered the development of cluster<strong>in</strong>g algorithms that<br />

m<strong>in</strong>imize the number of scans through the pattern set, reduce<br />

the number of patterns exam<strong>in</strong>ed dur<strong>in</strong>g execution, or reduce<br />

the size of data structures <strong>used</strong> <strong>in</strong> the algorithm’s operations.<br />

A cogent observation <strong>in</strong> Ja<strong>in</strong> and Dubes [1988] is that the<br />

specification of an algorithm for cluster<strong>in</strong>g usually leaves<br />

considerable flexibilty <strong>in</strong> implementation.<br />

Figure 20 Monothetic partitional cluster<strong>in</strong>g<br />

l<strong>in</strong>k algorithm [Ja<strong>in</strong> and Dubes 1988]) is shown <strong>in</strong> Figure 21.<br />

The dendrogram can be broken at different levels to yield<br />

different cluster<strong>in</strong>gs of the data. Most hierarchical cluster<strong>in</strong>g<br />

algorithms are variants of the s<strong>in</strong>gle-l<strong>in</strong>k [Sneath and Sokal<br />

1973], complete-l<strong>in</strong>k [K<strong>in</strong>g 1967], and m<strong>in</strong>imum-variance<br />

[Ward 1963; Murtagh 1984] algorithms. Of these, the s<strong>in</strong>glel<strong>in</strong>k<br />

and complete l<strong>in</strong>k algorithms are most popular. These two<br />

algorithms differ <strong>in</strong> the way they characterize the similarity<br />

between a pair of clusters. In the s<strong>in</strong>gle-l<strong>in</strong>k method, the<br />

distance between two clusters is the m<strong>in</strong>imum of the distances<br />

between all pairs.<br />

Figure 20 Po<strong>in</strong>ts fall<strong>in</strong>g <strong>in</strong> three clusters<br />

Figure 19 A taxonomy of cluster<strong>in</strong>g approaches<br />

17.1. Hierarchical Cluster<strong>in</strong>g Algorithms:<br />

The operation of a hierarchical cluster<strong>in</strong>g algorithm is<br />

illustrated us<strong>in</strong>g the two-dimensional data set <strong>in</strong> Figure 20.<br />

This figure depicts seven patterns labeled A, B, C, D, E, F,<br />

and G <strong>in</strong> three clusters. A hierarchical algorithm yields a<br />

dendrogram represent<strong>in</strong>g the nested group<strong>in</strong>g of patterns and<br />

similarity levels at which group<strong>in</strong>gs change. A dendrogram<br />

correspond<strong>in</strong>g to the seven po<strong>in</strong>ts <strong>in</strong> Figure 20 (obta<strong>in</strong>ed from<br />

the s<strong>in</strong>gle.<br />

Figure 21 The dendrogram obta<strong>in</strong>ed us<strong>in</strong>g s<strong>in</strong>gle l<strong>in</strong>k<br />

algorithm<br />

Figure 21 Two concentric clusters<br />

of patterns drawn from the two clusters (one pattern from the<br />

first cluster, the other from the second). In the complete-l<strong>in</strong>k<br />

algorithm, the distance between two clusters is the maximum<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

944


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

of all pair wise distances between patterns <strong>in</strong> the two clusters.<br />

In either case, two clusters are merged to form a larger cluster<br />

based on m<strong>in</strong>imum distance criteria. The complete-l<strong>in</strong>k<br />

algorithm produces tightly bound or compact clusters [Baeza-<br />

Yates 1992]. The s<strong>in</strong>gle-l<strong>in</strong>k algorithm, by contrast, suffers<br />

from a cha<strong>in</strong><strong>in</strong>g effect [Nagy 1968]. It has a tendency to<br />

produce clusters that are straggly or elongated. There are two<br />

clusters <strong>in</strong> Figures 22 and 23 separated by a “bridge” of noisy<br />

patterns. The s<strong>in</strong>gle-l<strong>in</strong>k algorithm produces the clusters<br />

shown <strong>in</strong> Figure 22, whereas the complete-l<strong>in</strong>k algorithm<br />

obta<strong>in</strong>s the cluster<strong>in</strong>g shown <strong>in</strong> Figure 23. The clusters<br />

obta<strong>in</strong>ed by the complete l<strong>in</strong>k algorithm are more compact<br />

than those obta<strong>in</strong>ed by the s<strong>in</strong>gle-l<strong>in</strong>k algorithm; the cluster<br />

labeled 1 obta<strong>in</strong>ed us<strong>in</strong>g the s<strong>in</strong>gle-l<strong>in</strong>k algorithm is elongated<br />

because of the noisy patterns labeled “*”. The s<strong>in</strong>gle-l<strong>in</strong>k<br />

algorithm is more versatile than the complete-l<strong>in</strong>k algorithm,<br />

otherwise. For example, the s<strong>in</strong>gle-l<strong>in</strong>k algorithm can extract<br />

the concentric clusters shown <strong>in</strong> Figure 21, but the completel<strong>in</strong>k<br />

algorithm cannot. However, from a pragmatic viewpo<strong>in</strong>t,<br />

it has been observed that the complete l<strong>in</strong>k algorithm produces<br />

more useful hierarchies <strong>in</strong> many applications than the s<strong>in</strong>glel<strong>in</strong>k<br />

algorithm [Ja<strong>in</strong> and Dubes 1988].<br />

17.2. Agglomerative S<strong>in</strong>gle-L<strong>in</strong>k Cluster<strong>in</strong>g Algorithm:<br />

(1) Place each pattern <strong>in</strong> its own cluster. Construct a list<br />

of <strong>in</strong>ter pattern distances for all dist<strong>in</strong>ct unordered<br />

pairs of patterns, and sort this list <strong>in</strong> ascend<strong>in</strong>g order.<br />

(2) Step through the sorted list of distances, form<strong>in</strong>g<br />

for each dist<strong>in</strong>ct dissimilarity value dk a graph on the<br />

patterns where pairs of patterns closer than dk are<br />

connected by a graph edge. If all the patterns are<br />

members of a connected graph, stop. Otherwise,<br />

repeat this step.<br />

(3) The output of the algorithm is a nested hierarchy of<br />

graphs which can be cut at a desired dissimilarity level<br />

form<strong>in</strong>g a partition (cluster<strong>in</strong>g) identified by simply connected<br />

components <strong>in</strong> the correspond<strong>in</strong>g graph.<br />

17.3. Agglomerative Complete-L<strong>in</strong>k Cluster<strong>in</strong>g Algorithm:<br />

(1) Place each pattern <strong>in</strong> its own cluster. Construct a list of<br />

<strong>in</strong>ter pattern distances for all dist<strong>in</strong>ct unordered pairs of<br />

patterns, and sort this list <strong>in</strong> ascend<strong>in</strong>g order. (2) Step through<br />

the sorted list of distances, form<strong>in</strong>g for each dist<strong>in</strong>ct<br />

dissimilarity value dk a graph on the patterns where pairs of<br />

patterns closer than dk are connected by a graph edge. If all<br />

the patterns are members of a completely connected graph,<br />

stop. (3) The output of the algorithm is a nested hierarchy of<br />

graphs which can be cut at a desired dissimilarity level<br />

form<strong>in</strong>g a partition (cluster<strong>in</strong>g) identified by completely<br />

connected components <strong>in</strong> the correspond<strong>in</strong>g graph.<br />

Hierarchical algorithms are more versatile than partitional<br />

algorithms. For example, the s<strong>in</strong>gle-l<strong>in</strong>k cluster<strong>in</strong>g algorithm<br />

works well on data sets conta<strong>in</strong><strong>in</strong>g non-isotropic clusters<br />

<strong>in</strong>clud<strong>in</strong>g well-separated, cha<strong>in</strong>-like, and concentric clusters,<br />

whereas a typical partitional algorithm such as the k-means<br />

algorithm works well only on data sets hav<strong>in</strong>g isotropic<br />

clusters [Nagy 1968]. On the other hand, the time and space<br />

complexities [Day 1992] of the partitional algorithms are<br />

typically lower than those of the hierarchical algorithms. It is<br />

possible to develop hybrid algorithms [Murty and Krishna<br />

1980] that exploit the good features of both categories.<br />

17.4. Hierarchical Agglomerative Cluster<strong>in</strong>g Algorithm:<br />

(1) Compute the proximity matrix conta<strong>in</strong><strong>in</strong>g the distance<br />

between each pair of patterns. Treat each pattern as a cluster.<br />

(2) F<strong>in</strong>d the most similar pair of clusters us<strong>in</strong>g the proximity<br />

matrix. Merge these two clusters <strong>in</strong>to one cluster. Update the<br />

proximity matrix to reflect this merge operation. (3) If all<br />

patterns are <strong>in</strong> one cluster, stop. Otherwise, go to step 2.<br />

Based on the way the proximity matrix is updated <strong>in</strong> step 2, a<br />

variety of agglomerative algorithms can be designed.<br />

Hierarchical divisive algorithms start with a s<strong>in</strong>gle cluster of<br />

all the given objects and keep splitt<strong>in</strong>g the clusters based on<br />

some criterion to obta<strong>in</strong> a partition of s<strong>in</strong>gleton clusters.<br />

17.5. Partitional Algorithms:<br />

A partitional cluster<strong>in</strong>g algorithm obta<strong>in</strong>s a s<strong>in</strong>gle partition of<br />

the data <strong>in</strong>stead of a cluster<strong>in</strong>g structure, such as the<br />

dendrogram produced by a hierarchical technique. Partitional<br />

methods have advantages <strong>in</strong> applications <strong>in</strong>volv<strong>in</strong>g large data<br />

sets for which the construction of a dendrogram is<br />

computationally prohibitive. A problem accompany<strong>in</strong>g the use<br />

of a partitional algorithm is the choice of the number of<br />

desired output clusters. A sem<strong>in</strong>al paper [Dubes 1987]<br />

provides guidance on this key design decision. The partitional<br />

techniques usually produce clusters by optimiz<strong>in</strong>g a criterion<br />

function def<strong>in</strong>ed either locally (on a subset of the patterns) or<br />

globally (def<strong>in</strong>ed over all of the patterns). Comb<strong>in</strong>atorial<br />

search of the set of possible label<strong>in</strong>gs for an optimum value of<br />

a criterion is clearly computationally prohibitive. In practice,<br />

therefore, the algorithm is typically run multiple times with<br />

different start<strong>in</strong>g states, and the best configuration obta<strong>in</strong>ed<br />

from all of the runs is <strong>used</strong> as the output cluster<strong>in</strong>g.<br />

17.5.1. Squared Error Algorithms:<br />

The most <strong>in</strong>tuitive and frequently <strong>used</strong> criterion function <strong>in</strong><br />

partitional cluster<strong>in</strong>g techniques is the squared error criterion,<br />

which tends to work well with isolated and compact clusters.<br />

The squared error for a cluster<strong>in</strong>g of a pattern set -<br />

(conta<strong>in</strong><strong>in</strong>g K clusters)<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

945


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

Is where<br />

…..(39)<br />

(2) Assign each pattern to the closest cluster center. (3)<br />

Recompute the cluster centers us<strong>in</strong>g the current cluster<br />

memberships. (4) If a convergence criterion is not met, go to<br />

step 2. Typical convergence criteria are: no (or m<strong>in</strong>imal)<br />

reassignment of patterns to new cluster centers, or m<strong>in</strong>imal<br />

decrease <strong>in</strong> squared error.<br />

xi<br />

is the ith pattern belong<strong>in</strong>g to the jth cluster and cj is the<br />

centroid of the jth cluster. The k-means is the simplest and<br />

most commonly <strong>used</strong> algorithm employ<strong>in</strong>g a squared error<br />

criterion McQueen 1967]. It starts with a random <strong>in</strong>itial<br />

partition and keeps reassign<strong>in</strong>g the patterns to clusters based<br />

on the similarity between the pattern and the cluster centers<br />

until a convergence criterion is met (e.g., there is no<br />

reassignment of any pattern from one cluster to another, or the<br />

squared error ceases to decrease significantly after some<br />

number of iterations). The k-means algorithm is popular<br />

because it is easy to implement, and its time complexity is<br />

O(n), where n is the number of patterns. A major problem<br />

with this algorithm is that it is sensitive to the selection of the<br />

<strong>in</strong>itial partition and may converge to a local m<strong>in</strong>imum of the<br />

criterion function value if the <strong>in</strong>itial partition is not properly<br />

chosen. Figure 24 shows seven two-dimensional patterns. If<br />

we start with patterns A, B, and C as the <strong>in</strong>itial means around<br />

which the three clusters are built, then we end up with the<br />

partition {{A}, {B, C}, {D, E, F, G}} shown by ellipses. The<br />

squared error criterion value is much larger for this partition<br />

than for the best partition {{A, B, C}, {D, E}, {F, G}} shown<br />

by rectangles, which yields the global m<strong>in</strong>imum value of the<br />

squared error criterion function for a cluster<strong>in</strong>g conta<strong>in</strong><strong>in</strong>g<br />

three clusters. The correct three-cluster solution is obta<strong>in</strong>ed by<br />

choos<strong>in</strong>g, for example, A, D, and F as the <strong>in</strong>itial cluster means.<br />

17.5.2.Squared Error Cluster<strong>in</strong>g Method:<br />

(1) Select an <strong>in</strong>itial partition of the patterns with a fixed<br />

number of clusters and cluster centers. (2) Assign each pattern<br />

to its closest cluster center and compute the new cluster<br />

centers as the centroids of the clusters. Repeat this step until<br />

convergence is achieved, i.e., until the cluster membership is<br />

stable. (3) Merge and split clusters based on some heuristic<br />

<strong>in</strong>formation, optionally repeat<strong>in</strong>g step 2.<br />

17.5.3.k-Means Cluster<strong>in</strong>g Algorithm:<br />

(1) Choose k cluster centers to co<strong>in</strong>cide with k randomlychosen<br />

patterns or k randomly def<strong>in</strong>ed po<strong>in</strong>ts <strong>in</strong>side the<br />

hyper-volume conta<strong>in</strong><strong>in</strong>g the pattern set.<br />

Figure 24 The k-means algorithm is sensitive to the <strong>in</strong>itial<br />

partitions.<br />

Several variants [Anderberg 1973] of the k-means algorithm<br />

have been reported <strong>in</strong> the literature. Some of them attempt to<br />

select a good <strong>in</strong>itial partition so that the algorithm is more<br />

likely to f<strong>in</strong>d the global <strong>in</strong>imum value. Another variation is to<br />

permit splitt<strong>in</strong>g and merg<strong>in</strong>g of the result<strong>in</strong>g clusters.<br />

Typically, a cluster is split when its variance is above a prespecified<br />

threshold, and two clusters are merged when the<br />

distance between their centroids is below another prespecified<br />

threshold. Us<strong>in</strong>g this variant, it is possible to obta<strong>in</strong><br />

the optimal partition start<strong>in</strong>g from any arbitrary <strong>in</strong>itial<br />

partition, provided proper threshold values are specified. The<br />

well-known ISODATA [Ball and Hall 1965] algorithm<br />

employs this technique of merg<strong>in</strong>g and splitt<strong>in</strong>g clusters. If<br />

ISODATA is given the “ellipse” partition<strong>in</strong>g shown <strong>in</strong> Figure<br />

14 as an <strong>in</strong>itial partition<strong>in</strong>g, it will produce the optimal threecluster<br />

partition<strong>in</strong>g. ISODATA will first merge the clusters {A}<br />

and {B,C} <strong>in</strong>to one cluster because the distance between their<br />

centroids is small and then split the cluster {D,E,F,G}, which<br />

has a large variance, <strong>in</strong>to two clusters {D,E} and {F,G}.<br />

Another variation of the k-means algorithm <strong>in</strong>volves select<strong>in</strong>g<br />

a different criterion function altogether. The dynamic<br />

cluster<strong>in</strong>g algorithm (which permits representations other than<br />

the centroid for each cluster) was proposed <strong>in</strong> Diday [1973],<br />

and Symon [1977] and describes a dynamic cluster<strong>in</strong>g<br />

approach obta<strong>in</strong>ed by formulat<strong>in</strong>g the cluster<strong>in</strong>g problem <strong>in</strong><br />

the framework of maximum-likelihood estimation. The<br />

regularized Mahalanobis distance was <strong>used</strong> <strong>in</strong> Mao and Ja<strong>in</strong><br />

[1996] to obta<strong>in</strong> hyperellipsoidal clusters.<br />

17.5.4. Graph-Theoretic Cluster<strong>in</strong>g.<br />

The best-known graph-theoretic divisive cluster<strong>in</strong>g algorithm<br />

is based on construction of the m<strong>in</strong>imal spann<strong>in</strong>g tree (MST)<br />

of the data [Zahn 1971], and then delet<strong>in</strong>g the MST edges<br />

with the largest lengths to generate clusters. Figure 25 depicts<br />

the MST obta<strong>in</strong>ed from n<strong>in</strong>e two-dimensional po<strong>in</strong>ts. By<br />

break<strong>in</strong>g the l<strong>in</strong>k labeled CD with a length of 6 units (the edge<br />

with the maximum Euclidean length), two clusters ({A, B, C}<br />

and {D, E, F, G, H, I}) are obta<strong>in</strong>ed. The second cluster can<br />

be further divided <strong>in</strong>to two clusters by break<strong>in</strong>g the edge EF,<br />

which has a length of 4.5 units. The hierarchical approaches<br />

are also related to graph-theoretic cluster<strong>in</strong>g. S<strong>in</strong>gle-l<strong>in</strong>k<br />

clusters are sub graphs of the m<strong>in</strong>imum spann<strong>in</strong>g tree of the<br />

data [Gower and Ross 1969] which are also the connected<br />

components [Gotlieb and Kumar 1968]. Complete-l<strong>in</strong>k<br />

clusters are maximal complete sub graphs, and are related to<br />

the node colourability of graphs [Backer and Hubert 1976].<br />

The maximal complete sub graph was considered the strictest<br />

def<strong>in</strong>ition of a cluster <strong>in</strong> Augustson and M<strong>in</strong>ker [1970] and<br />

Raghavan and Yu [1981]. A graph-oriented approach for non-<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

946


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

hierarchical structures and overlapp<strong>in</strong>g clusters is presented <strong>in</strong><br />

Ozawa [1985].<br />

cluster<strong>in</strong>g algorithm us<strong>in</strong>g a distance measure based on a<br />

nonparametric density estimate.<br />

17.7. Nearest Neighbour Cluster<strong>in</strong>g:<br />

Figure 25 Us<strong>in</strong>g the m<strong>in</strong>imal spann<strong>in</strong>g tree to from cluster<br />

Delaunay graph (DG) is obta<strong>in</strong>ed by connect<strong>in</strong>g all the pairs<br />

of po<strong>in</strong>ts that are Voronoi neighbours. The DG conta<strong>in</strong>s all the<br />

neighbourhood <strong>in</strong>formation conta<strong>in</strong>ed <strong>in</strong> the MST and the<br />

relative neighbourhood graph (RNG) [Toussa<strong>in</strong>t 1980].<br />

17.6. Mixture-Resolv<strong>in</strong>g and Mode-Seek<strong>in</strong>g Algorithms<br />

The mixture resolv<strong>in</strong>g approach to cluster analysis has been<br />

addressed <strong>in</strong> a number of ways. The underly<strong>in</strong>g assumption is<br />

that the patterns to be clustered are drawn from one of several<br />

distributions, and the goal is to identify the parameters of each<br />

and (perhaps) their number. Most of the work <strong>in</strong> this area has<br />

assumed that the <strong>in</strong>dividual components of the mixture density<br />

are Gaussian, and <strong>in</strong> this case the parameters of the <strong>in</strong>dividual<br />

Gaussians are to be estimated by the procedure. Traditional<br />

approaches to this problem <strong>in</strong>volve obta<strong>in</strong><strong>in</strong>g (iteratively) a<br />

maximum likelihood estimate of the parameter vectors of the<br />

component densities [Ja<strong>in</strong> and Dubes 1988]. More recently,<br />

the Expectation Maximization (EM) algorithm (a generalpurpose<br />

maximum likelihood algorithm [Dempster et al. 1977]<br />

for miss<strong>in</strong>g-data problems) has been applied to the problem of<br />

parameter estimation. A recent book [Mitchell 1997] provides<br />

an accessible description of the technique. In the EM<br />

framework, the parameters of the component densities are<br />

unknown, as are the mix<strong>in</strong>g parameters, and these are<br />

estimated from the patterns. The EM procedure beg<strong>in</strong>s with<br />

an <strong>in</strong>itial estimate of the parameter vector and iteratively<br />

rescores the patterns aga<strong>in</strong>st the mixture density produced by<br />

the parameter vector. The rescored patterns are then <strong>used</strong> to<br />

update the parameter estimates. In a cluster<strong>in</strong>g context, the<br />

scores of the patterns (which essentially measure their<br />

likelihood of be<strong>in</strong>g drawn from particular components of the<br />

mixture) can be viewed as h<strong>in</strong>ts at the class of the pattern.<br />

Those patterns, placed (by their scores) <strong>in</strong> a particular<br />

component, would therefore be viewed as belong<strong>in</strong>g to the<br />

same cluster. Nonparametric techniques for density- based<br />

cluster<strong>in</strong>g have also been developed [Ja<strong>in</strong> and Dubes 1988].<br />

Inspired by the Parzen w<strong>in</strong>dow approach to nonparametric<br />

density estimation, the correspond<strong>in</strong>g cluster<strong>in</strong>g procedure<br />

searches for b<strong>in</strong>s with large counts <strong>in</strong> a multidimensional<br />

histogram of the <strong>in</strong>put pattern set. Other approaches <strong>in</strong>clude<br />

the application of another partitional or hierarchical<br />

S<strong>in</strong>ce proximity plays a key role <strong>in</strong> our <strong>in</strong>tuitive notion of a<br />

cluster, nearest neighbour distances can serve as the basis of<br />

cluster<strong>in</strong>g procedures. An iterative procedure was proposed <strong>in</strong><br />

Lu and Fu [1978]; it assigns each unlabeled pattern to the<br />

cluster of its nearest labelled neighbour pattern, provided the<br />

distance to that labelled neighbour is below a threshold. The<br />

process cont<strong>in</strong>ues until all patterns are labelled or no<br />

additional labell<strong>in</strong>g occur. The mutual neighbourhood value<br />

(described earlier <strong>in</strong> the context of distance computation) can<br />

also be <strong>used</strong> to grow clusters from near neighbours.<br />

17.8. Fuzzy Cluster<strong>in</strong>g:<br />

Traditional cluster<strong>in</strong>g approaches generate partitions; <strong>in</strong> a<br />

partition, each pattern belongs to one and only one cluster.<br />

Hence, the clusters <strong>in</strong> a hard cluster<strong>in</strong>g are disjo<strong>in</strong>t. Fuzzy<br />

cluster<strong>in</strong>g extends this notion to associate each pattern with<br />

every cluster us<strong>in</strong>g a membership function [Zadeh 1965]. The<br />

output of such algorithms is a cluster<strong>in</strong>g, but not a partition.<br />

We give a high-level partitional fuzzy cluster<strong>in</strong>g algorithm<br />

below.<br />

17.8.1.Fuzzy Cluster<strong>in</strong>g Algorithm:<br />

(1) Select an <strong>in</strong>itial fuzzy partition of the N objects <strong>in</strong>to K<br />

clusters by select<strong>in</strong>g the Nx3 K membership matrix U. An<br />

element uij of this matrix represents the grade of membership<br />

of object xi <strong>in</strong> cluster cj. Typically, uij [0,1].(2) Us<strong>in</strong>g U,<br />

f<strong>in</strong>d the value of a fuzzy criterion function, e.g., a weighted<br />

squared error criterion function, associated with the<br />

correspond<strong>in</strong>g partition. One possible fuzzy criterion function<br />

is<br />

Where<br />

…(40)<br />

is the k th fuzzy cluster center. Reassign patterns to clusters to<br />

reduce this criterion function value and recompute U. (3)<br />

Repeat step 2 until entries <strong>in</strong> U do not change significantly. In<br />

fuzzy cluster<strong>in</strong>g, each cluster is a fuzzy set of all the patterns.<br />

Figure 26 illustrates the idea. The rectangles enclose two<br />

“hard” clusters <strong>in</strong> the data: H1 ={1,2,3,4,5} and H2={6,7,8,9}<br />

A fuzzy cluster<strong>in</strong>g algorithm might produce the two fuzzy<br />

clusters F1 and F2 depicted by ellipses. The patterns will have<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

947


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

membership values <strong>in</strong> [0,1] for each cluster. For example,<br />

fuzzy cluster F1 could be compactly described as<br />

Figure 27 Representation of a cluster by po<strong>in</strong>ts<br />

Figure 26 Fuzzy clusters<br />

The ordered pairs (i,μ i ) <strong>in</strong> each cluster represent the i th pattern<br />

and its membership value to the cluster mi. Larger<br />

membership values <strong>in</strong>dicate higher confidence <strong>in</strong> the<br />

assignment of the pattern to the cluster. A hard cluster<strong>in</strong>g can<br />

be obta<strong>in</strong>ed from a fuzzy partition by threshold<strong>in</strong>g the<br />

membership value.<br />

Fuzzy set theory was <strong>in</strong>itially applied to cluster<strong>in</strong>g <strong>in</strong> Rusp<strong>in</strong>i<br />

[1969]. The book by Bezdek [1981] is a good source for<br />

material on fuzzy cluster<strong>in</strong>g. The most popular fuzzy<br />

cluster<strong>in</strong>g algorithm is the fuzzy c-means (FCM) algorithm.<br />

Even though it is better than the hard k-means algorithm at<br />

avoid<strong>in</strong>g localv m<strong>in</strong>ima, FCM can still converge to local<br />

m<strong>in</strong>ima of the squared error criterion. The design of<br />

membership functions is the most important problem <strong>in</strong> fuzzy<br />

cluster<strong>in</strong>g; different choices <strong>in</strong>clude those based on similarity<br />

decomposition and centroids of clusters. A generalization of<br />

the FCM algorithm was proposed by Bezdek [1981] through a<br />

family of objective functions. A fuzzy c-shell algorithm and<br />

an adaptive variant for detect<strong>in</strong>g circular and elliptical<br />

boundaries was presented <strong>in</strong> Dave [1992].<br />

17.9. Representation of Clusters:<br />

In applications where the number of classes or clusters <strong>in</strong> a<br />

data set must be discovered, a partition of the data set is the<br />

end product. Here, a partition gives an idea about the<br />

separability of the data po<strong>in</strong>ts <strong>in</strong>to clusters and whether it is<br />

mean<strong>in</strong>gful to employ a supervised classifier that assumes a<br />

given number of classes <strong>in</strong> the data set. However, <strong>in</strong> many<br />

other applications that <strong>in</strong>volve decision ma k<strong>in</strong>g, the result<strong>in</strong>g<br />

clusters have to be represented or described <strong>in</strong> a compact form<br />

to achieve data abstraction. Even though the construction of a<br />

cluster representation is an important step <strong>in</strong> decision mak<strong>in</strong>g,<br />

it has not been exam<strong>in</strong>ed closely by researchers. The notion of<br />

cluster representation was <strong>in</strong>troduced <strong>in</strong> Duran and Odell<br />

[1974] and was subsequently studied <strong>in</strong> Diday and Simon<br />

[1976] and Michalski et al. [1981]. They suggested the<br />

follow<strong>in</strong>g representation schemes:<br />

(1) Represent a cluster of po<strong>in</strong>ts by their centroid or by a set<br />

of distant po<strong>in</strong>ts <strong>in</strong> the cluster. Figure 27 depicts these two<br />

ideas.<br />

(2) Represent clusters us<strong>in</strong>g nodes <strong>in</strong> a classification tree.<br />

(3)Represent clusters by us<strong>in</strong>g conjunctive logical<br />

expressions. For example, the expression<br />

Figure 28 stands for the logical statement ‘X1 is greater than<br />

3’ and ’X2 is less than 2’. Use of the centroid to represent a<br />

cluster is the most popular scheme. It works well when the<br />

clusters are compact or isotropic. However, when the clusters<br />

are elongated or non-isotropic, then this scheme fails to<br />

represent them properly. In such a case, the use of a collection<br />

of boundary po<strong>in</strong>ts <strong>in</strong> a cluster captures its shape well. The<br />

number of po<strong>in</strong>ts <strong>used</strong> to represent a cluster should <strong>in</strong>crease as<br />

the complexity of its shape <strong>in</strong>creases. The two different<br />

representations illustrated <strong>in</strong> Figure 18 are equivalent. Every<br />

path <strong>in</strong> a classification tree from the root node to a leaf node<br />

corresponds to a conjunctive statement. An important<br />

limitation of the typical use of the simple conjunctive concept<br />

representations is that they can describe only rectangular or<br />

isotropic clusters <strong>in</strong> the feature space. Data abstraction is<br />

useful <strong>in</strong> decision mak<strong>in</strong>g because of the follow<strong>in</strong>g: (1) It<br />

gives a simple and <strong>in</strong>tuitive description of clusters which is<br />

easy for human comprehension. In both conceptual cluster<strong>in</strong>g<br />

[Michalski Cluster<strong>in</strong>g [Gowda and Diday 1992] this<br />

representation is obta<strong>in</strong>ed without us<strong>in</strong>g an<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

948


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

Figure 28 Representation of clusters by a classification tree or<br />

by conjunctive statements<br />

Additional step: These algorithms generate the clusters as well<br />

as their descriptions. A set of fuzzy rules can be obta<strong>in</strong>ed from<br />

fuzzy clusters of a data set. These rules can be <strong>used</strong> to build<br />

fuzzy classifiers and fuzzy controllers. (2) It helps <strong>in</strong><br />

achiev<strong>in</strong>g data compression that can be exploited further by a<br />

computer [Murty and Krishna 1980]. Figure 19(a) shows<br />

samples belong<strong>in</strong>g to two cha<strong>in</strong>-like clusters labeled 1 and 2.<br />

A partitional cluster<strong>in</strong>g like the k-means algorithm cannot<br />

separate these two structures properly. The s<strong>in</strong>gle-l<strong>in</strong>k<br />

algorithm works well on this data, but is computationally<br />

expensive. So a hybrid approach may be <strong>used</strong> to exploit the<br />

desirable properties of both these algorithms. We obta<strong>in</strong> 8 sub<br />

clusters of the data us<strong>in</strong>g the computationally efficient) k-<br />

means algorithm. Each of these sub clusters can be<br />

represented by their centroids as shown <strong>in</strong> Figure 19(a). Now<br />

the s<strong>in</strong>gle- l<strong>in</strong>k algorithm can be applied on these centroids<br />

alone to cluster them <strong>in</strong>to 2 groups. The result<strong>in</strong>g groups are<br />

shown <strong>in</strong> Figure 19(b). Here, a data reduction is achieved by<br />

represent<strong>in</strong>g the sub clusters by their centroids. (3) It <strong>in</strong>creases<br />

the efficiency of the decision mak<strong>in</strong>g task. In a cluster based<br />

document retrieval technique [Salton 1991], a large collection<br />

of documents is clustered and each of the clusters is<br />

represented us<strong>in</strong>g its centroid. In order to retrieve documents<br />

relevant to a query, the query is matched with the cluster<br />

centroids rather than with all the documents. This helps <strong>in</strong><br />

retriev<strong>in</strong>g relevant documents efficiently. Also <strong>in</strong> several<br />

applications <strong>in</strong>volv<strong>in</strong>g large data sets, cluster<strong>in</strong>g is <strong>used</strong> to<br />

perform <strong>in</strong>dex<strong>in</strong>g, which helps <strong>in</strong> efficient decision mak<strong>in</strong>g<br />

[Dorai and Ja<strong>in</strong> 1995].<br />

18. Evaluation of classification techniques:<br />

A framework to evaluate classification techniques and do an<br />

analysis of the techniques was proposed by fractal white paper<br />

[79] and covered <strong>in</strong> this paper on the follow<strong>in</strong>g criteria:<br />

• Statistical assumptions<br />

• Data needs<br />

• Complexity of deployment<br />

• Model Performance<br />

• Model build<strong>in</strong>g time<br />

18.1 Statistical Assumptions:<br />

All parametric techniques make statistical assumptions about<br />

data. In most real life cases, these assumptions cannot be fully<br />

met. Pragmatism mixed with caution should help <strong>in</strong> gett<strong>in</strong>g<br />

the best of a model<strong>in</strong>g technique. If there are multicoll<strong>in</strong>earity<br />

issues with data, for example, one should<br />

def<strong>in</strong>itely explore the use of non-parametric techniques like<br />

neural networks or genetic algorithms for a possible superior<br />

fit compared with parametric statistical methods. Similarly, if<br />

the sample has a skewed good-bad mix, discrim<strong>in</strong>ant and k-<br />

NN techniques are likely to under perform vis-à-vis the other<br />

techniques. A presence of complex non-l<strong>in</strong>ear relationships<br />

with<strong>in</strong> data precludes the use of l<strong>in</strong>ear techniques. In such<br />

situations, recursive partition<strong>in</strong>g and non-parametric<br />

techniques are likely to outperform most parametric statistical<br />

techniques.<br />

18.2 Data Needs:<br />

All techniques perform better if they are exposed to large<br />

sample of representative data. Equal number of good and bad<br />

observations also can help <strong>in</strong> model build<strong>in</strong>g. However, <strong>in</strong><br />

most practical situations, availability of enough data po<strong>in</strong>ts on<br />

both event types is difficult. Non-parametric and recursive<br />

partition<strong>in</strong>g techniques usually tend to be more data hungry<br />

than parametric techniques. As discussed <strong>in</strong> the previous<br />

section, discrim<strong>in</strong> ant analysis and K-NN techniques are<br />

strongly sensitive to good bad mix <strong>in</strong> the data. K-NN is also<br />

sensitive to the presence of irrelevant variables <strong>in</strong> model<br />

build<strong>in</strong>g. All non-parametric techniques have a tendency to<br />

over fit the model when number of variables <strong>used</strong> for model<br />

build<strong>in</strong>g is large. In these cases, it might be a useful idea to<br />

run parametric statistical techniques and delete unimportant<br />

variables from analysis before proceed<strong>in</strong>g to use nonparametric<br />

techniques.<br />

Table 4- A comparative study of credit scor<strong>in</strong>g techniques<br />

Source: Monteserrat, Guillen, Count Data models for a credit<br />

scor<strong>in</strong>g system,1992<br />

18.3. Model Build<strong>in</strong>g Time:<br />

Model build<strong>in</strong>g is an iterative process. It requires<br />

experimentation with alternative predictor variables and<br />

several different transformations. Model build<strong>in</strong>g is also a<br />

multi stage process and at each stage, many variables could be<br />

dropped or altered for seek<strong>in</strong>g a better fit. The time taken to<br />

tra<strong>in</strong> a model, can <strong>in</strong>fluence the choice of technique <strong>in</strong> some<br />

cases. Parametric techniques take relatively less time for<br />

comput<strong>in</strong>g a model. L<strong>in</strong>ear models are the friendliest <strong>in</strong> this<br />

respect. Non-parametric techniques, on the other hand, can<br />

take <strong>in</strong>ord<strong>in</strong>ate amounts of time for model tra<strong>in</strong><strong>in</strong>g. K-NN is<br />

an O(n2) process and thus can take large amounts of time for<br />

large tra<strong>in</strong><strong>in</strong>g data. Recursive partition<strong>in</strong>g techniques may take<br />

less time than non-parametric techniques but are slower<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

949


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

compared to logistic regression. Model build<strong>in</strong>g time is also<br />

important because re-calibrat<strong>in</strong>g of models might be<br />

undertaken frequently <strong>in</strong> the light of additional data.<br />

18.4. Transparency:<br />

Transparency of the model plays an important role <strong>in</strong> the<br />

acceptance of the model by users. The black-box approach of<br />

non-parametric techniques is probably the most important<br />

ground for us<strong>in</strong>g other techniques. <strong>Classification</strong> trees provide<br />

the most user friendly and <strong>in</strong>tuitive output amongst<br />

classification techniques. Parametric models are also<br />

transparent and show the contribution of each variable to the<br />

score. In cases of multicoll<strong>in</strong>earity, logistic and l<strong>in</strong>ear models<br />

might not truly reflect the importance of each variable. This is<br />

because another correlated variable might have accounted for<br />

the dependent variable by the virtue of hav<strong>in</strong>g entered the<br />

model earlier.<br />

18.5. Deployment:<br />

Practical considerations of deployment might sometimes rule<br />

out the use of some techniques. Deployment of nonparametric<br />

techniques can be cumbersome and might require<br />

writ<strong>in</strong>g of programm<strong>in</strong>g code or use of proprietary software<br />

components. Deployment of parametric and classification tree<br />

models is relatively simpler. An organization should measure<br />

the <strong>in</strong>cremental profit from a model versus the <strong>in</strong>cremental<br />

deployment cost and effort to decide on the choice of model<br />

for deployment.<br />

19. Survey of classification techniques <strong>used</strong> <strong>in</strong> different<br />

speech recognition applications:<br />

TABLE 5<br />

<strong>Classification</strong> technique adopted <strong>in</strong> different speech<br />

recognition application<br />

The abbreviations <strong>used</strong> <strong>in</strong> this table as follows;<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

950


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

21. Some of well know cluster<strong>in</strong>g algorithms have been listed<br />

<strong>in</strong> the table 7[81].<br />

TABLE 7<br />

Cluster<strong>in</strong>g algorithms<br />

20. Comparison of <strong>Classification</strong> techniques:<br />

We summaries the most commonly <strong>used</strong> classifiers <strong>in</strong> table<br />

6.Many of them represent, <strong>in</strong> fact, an entire family of<br />

classifiers and allow the user to modify the associated<br />

parameters and criterion functions. All of these classifiers are<br />

admissible; <strong>in</strong> the sense that there exist some classification<br />

problems is the state log project which showed a large<br />

variability over their relative performances, prov<strong>in</strong>g that there<br />

is no such th<strong>in</strong>g that there is no overall optimal classification<br />

rule.<br />

TABLE 6<br />

<strong>Classification</strong> methods<br />

22. Conclusions:<br />

In this overview paper different classification techniques have<br />

been discussed. At the beg<strong>in</strong>n<strong>in</strong>g of the paper the taxonomy of<br />

the classification techniques have been presented and<br />

expla<strong>in</strong>ed respectively. For each method, the advantages and<br />

disadvantages, and various application areas have been<br />

presented. The purpose of this paper is to provide all the<br />

classification techniques <strong>used</strong> <strong>in</strong> the area of speech<br />

recognition <strong>in</strong> brief ,for the young researchers. The<br />

contributions of this paper is the survey of the different<br />

classification methods to different speech recognition<br />

applications, with their evaluation criteria, Comments and<br />

properties of different classification methods and properties<br />

and comments of different cluster<strong>in</strong>g algorithms are also<br />

discussed.<br />

Acknowledgements:<br />

Thanks are due to Prof.G.Krishna Professor (Rtd.), of Indian<br />

Institute of Science, Bangalore and Dr.M.Narshima Murthy,<br />

Professor, Dept. of Automation and computer science, Indian<br />

Institute of Science, Bangalore, for useful discussion with<br />

them, while prepar<strong>in</strong>g this manuscript.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

951


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

REFERENCES:<br />

1)Anil K.Ja<strong>in</strong>, “Statistical Pattern <strong>Recognition</strong>”,IEEE<br />

Transactions On Pattern Analysis And Mach<strong>in</strong>e Intelligence,<br />

Vol. 22, No. 1, January 2000<br />

2)A. K. Ja<strong>in</strong>, R. P. W. Du<strong>in</strong>, and J. Mao, “Statistical Pattern<br />

<strong>Recognition</strong>: A Review”, IEEE Trans. on Pattern Analysis and<br />

Mach<strong>in</strong>e Intelligence, 22(1):4-37, January 2000.<br />

3) J. A. Bilmes, “A Gentle Tutorial on the EM Algorithm and<br />

its Application to Parameter Estimation for Gaussian Mixture<br />

and Hidden Markov Models, Technical Report TR-97-021”,<br />

International Computer Science Institute, University of<br />

California, Berkeley, April 1998.<br />

4) J. A. Anderson, P. R. Krishnaiah et.al “Logistic<br />

Discrim<strong>in</strong>ation, Handbook of Statistics”, , vol. 2, pp. 169-191,<br />

Amsterdam: North Holland,1982.<br />

5) Rab<strong>in</strong>er and Jung, “Fundamental of <strong>Speech</strong> recognition”,<br />

Pearson Education,©1993.<br />

6) R.A. Fisher, “The Use of Multiple Measurements <strong>in</strong><br />

Taxonomic Problems”, Annals of Eugenics, vol. 7, part II, pp.<br />

179-188, 1936.<br />

7) Dasarathy, B.V,“M<strong>in</strong>imal consistent set (MCS)<br />

identification for optimal nearest neighbor decision systems<br />

design”, IEEE Transactions on Systems, Man and cybernetics,<br />

Vol. 24, Issue: 3, pp:511 – 517, March 1994.<br />

8) Girolami, M and Chao He “Probability density estimation<br />

from optimally condensed data samples” pattern Analysis and<br />

Mach<strong>in</strong>e Intelligence, IEEE Transactions on, Volume: 25,<br />

Issue: 10 , pp:1253 – 1264,Oct. 2003.<br />

9) Meijer, B.R.; “Rules and algorithms for the design of<br />

templates for template match<strong>in</strong>g”, Pattern recognition, 1992.<br />

Vol.1. Conference A: Computer Vision and Applications, 11th<br />

IAPR International Conference on, pp: 760 – 763, Aug. 1992.<br />

10) Hush, D.R., Horne B.G. “Progress <strong>in</strong> supervised neural<br />

networks”, Signal Process<strong>in</strong>g Magaz<strong>in</strong>e, IEEE, Vol. 10, Issue:<br />

1, pp:8 – 39, Jan. 1993.<br />

11)Vapnik, V., “The Nature of Statistical Learn<strong>in</strong>g Theory”,<br />

Spr<strong>in</strong>ger, 1995.<br />

12) Julia Neumann, Christoph Schnorr, “SVM-based feature<br />

selection by direct objective m<strong>in</strong>imization”, 2004.<br />

13) Lihong Zheng and Xiangjian , “<strong>Classification</strong> <strong>Techniques</strong><br />

<strong>in</strong> Pattern <strong>Recognition</strong>”, University of Technology, , Australia<br />

2007.<br />

14) ) L. R. Rab<strong>in</strong>er, J. G. Wilpon, A. M. Qu<strong>in</strong>n, and S. G.<br />

Terrace, “On the application of embedded digit tra<strong>in</strong><strong>in</strong>g to<br />

speaker <strong>in</strong>dependent connected digit recognition,” IEEE<br />

Transactions on Acoustics, <strong>Speech</strong> and Signal Process<strong>in</strong>g, vol.<br />

32, no. 2, pp. 272–280, April 1984.<br />

15) T. M. Cover and P. E. Hart, “Nearest neighbor pattern<br />

classification”, IEEE Transactions <strong>in</strong>formation Theory, vol.<br />

IT-13, pp. 2127, 1967.<br />

16) Y. Chenyz Y. Hungyz C. Fuhz, “Fast Algorithm for<br />

Nearest Neighbor Search Based on a Lower Bound Tree”,<br />

Proceed<strong>in</strong>gs of the 8th International Conference on Computer<br />

Vision, Vancouver, Canada, July 2001.<br />

17) R. Bellman and S. Dreyfus, “Applied Dynamic<br />

Programm<strong>in</strong>g”, Pr<strong>in</strong>ceton, NJ, Pr<strong>in</strong>ceton University Press,<br />

1962.<br />

18) H. Silverman and D. Morgan, “The application of<br />

dynamic programm<strong>in</strong>g to connected speech<br />

recognition” ,IEEE ASSP Magaz<strong>in</strong>e, vol. 7, no. 3,pp. 6-25,<br />

1990.<br />

19) Alex Waibel and Kai-Fu Lee, “Read<strong>in</strong>gs of speech<br />

recognition”,Morgan Kaufmann Publishers, San<br />

Mateo,Calif,1990.<br />

20) Rab<strong>in</strong>er and Jung, “ HMM Tutorial”, IEEE Transactions<br />

on Acoustics, <strong>Speech</strong> and Signal Process<strong>in</strong>g, vol. 39, no. 5, pp.<br />

272–280, April 1984.<br />

21) Dat.Tat.Tran, “Fuzzy approaches to speech and speaker<br />

recognition”, A Ph.D. thesis submitted to the university of<br />

Caniberra, Austrelia, May 2000.<br />

22)W.S.McCullough and W.H.Pitts, “A Calculus of Ideas<br />

Immanent <strong>in</strong> Nervous Activiity”,bull Math Biophysics, 5,115-<br />

133,1943.<br />

23) P. Gall<strong>in</strong>ari, S. Thiria, R. Badran, and F. Fogelman-Soulie,<br />

“On the relationships between discrim<strong>in</strong>ant analysis and<br />

multilayer perceptrons,”Neural Networks, vol. 4, pp. 349–360,<br />

1991.<br />

24) H. Asoh and N. Otsu, “An approximation of nonl<strong>in</strong>ear<br />

discrim<strong>in</strong>ant analysis by multilayer neural networks,” <strong>in</strong> Proc.<br />

Int. Jo<strong>in</strong>t Conf. Neural Networks, San Diego, CA, 1990, pp.<br />

III-211–III-216.<br />

25) A. R.Webb and D. Lowe, “The optimized <strong>in</strong>ternal<br />

representation of multilayer classifier networks performs<br />

nonl<strong>in</strong>ear discrim<strong>in</strong>ant analysis,” Neural Networks, vol. 3, no.<br />

4, pp. 367–375, 1990.<br />

26) G. S. Lim, M. Alder, and P. Had<strong>in</strong>gham, “Adaptive<br />

quadratic neural nets”, Pattern <strong>Recognition</strong>. Letter, vol. 13, pp.<br />

325–329, 1992.<br />

27) S. Raudys, “Evolution and generalization of a s<strong>in</strong>gle<br />

neuron: I. S<strong>in</strong>glelayer perceptron as seven statistical<br />

classifiers”, Neural Networks, vol. 11, pp. 283–296, 1998.<br />

28) S.Raudys,“Evolution and generalization of a s<strong>in</strong>gle<br />

neurone: II. Complexity of statistical classifiers and sample<br />

size considerations,” Neural Networks, vol. 11, pp. 297–313,<br />

1998.<br />

29) F. Kanaya and S. Miyake, “Bayes statistical behavior and<br />

valid generalization of pattern classify<strong>in</strong>g neural networks,”<br />

IEEE Trans. Neural Networks, vol. 2, no. 4, pp. 471–475,<br />

1991.<br />

30) S. Miyake and F. Kanaya, “A neural network approach to<br />

a Bayesian statistical decision problem”, IEEE Trans. Neural<br />

Networks, vol. 2, pp. 538–540, 1991<br />

31) D. G. Kle<strong>in</strong>baum, L. L. Kupper, and L. E. Chambless,<br />

“Logistic regression analysis of epidemiologic data”, Theory<br />

Practice, Commun. Statist. A, vol. 11, pp. 485–547, 1982.<br />

32) F. E. Harreli and K. L. Lee, “A comparison of the<br />

discrim<strong>in</strong>ant analysis and logistic regression under<br />

multivariate normality”, <strong>in</strong> Biostatistics: Statistics <strong>in</strong><br />

Biomedical, Public Health, and Environmental Sciences, P. K.<br />

Sen, Ed, Amsterdam, The Netherlands: North Holland, 1985.<br />

33) S. J. Press and S. Wilson, “Choos<strong>in</strong>g between logistic<br />

regression and discrim<strong>in</strong>ant analysis”, J. Amer. Statist. Assoc.,<br />

vol. 73, pp. 699–705,1978.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

952


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

34) M. Schumacher, R. Robner, andW. Vach, “Neural<br />

networks and logistic regression: Part I”, Comput. Statist.<br />

Data Anal., vol. 21, pp. 661–682,1996.<br />

35) W. Vach, R. Robner, and M. Schumacher, “Neural<br />

networks and logistic regression: Part II”, Comput. Statist.<br />

Data Anal., vol. 21, pp. 683–701,1996.<br />

36) B. Cheng and D. Titter<strong>in</strong>gton, “Neural networks: A review<br />

from a statistical perspective”, Statist. Sci., vol. 9, no. 1, pp.<br />

2–54, 1994.<br />

37) A. Ciampi and Y. Lechevallier, “Statistical models as<br />

build<strong>in</strong>g blocks of neural networks”, Commun. Statist., vol. 26,<br />

no. 4, pp. 991–1009, 1997.<br />

38) L. Holmstrom, P. Koist<strong>in</strong>en, J. Laaksonen, and E. Oja,<br />

“Neural and statistical classifiers-taxonomy and two case<br />

studies”, IEEE Trans. Neural Networks, vol. 8, pp. 5–17, 1997.<br />

39) A. Ripley, “Statistical aspects of neural networks”, <strong>in</strong><br />

Networks and Chaos—Statistical and Probabilistic Aspects, O.<br />

E. Barndorff-Nielsen, J. L. Jensen, andW. S. Kendall, Eds.<br />

London, U.K.: Chapman & Hall, 1993, pp. 40–123<br />

40) “Neural networks and related methods for classification”,<br />

J. R.Statist. Soc. B, vol. 56, no. 3, pp. 409–456, 1994.<br />

41) I. Sethi and M. Otten, “Comparison between entropy net<br />

and decision tree classifiers”, <strong>in</strong> Proc. Int. Jo<strong>in</strong>t Conf. Neural<br />

Networks, vol. 3, 1990, pp. 63–68.<br />

42) P. E. Utgoff, “Perceptron trees: A case study <strong>in</strong> hybrid<br />

concept representation”, Connect. Sci., vol. 1, pp. 377–391,<br />

1989.<br />

43) A. Ripley, “Statistical aspects of neural networks”, <strong>in</strong><br />

Networks and Chaos—Statistical and Probabilistic Aspects, O.<br />

E. Barndorff-Nielsen, J. L. Jensen, andW. S. Kendall, Eds.<br />

London, U.K.: Chapman & Hall, 1993, pp. 40–123.<br />

44) J. R.Statist. Soc. B, “Neural networks and related methods<br />

for classification,” International journal on Pattern recognition,<br />

vol. 56, no. 3, pp. 409–456, 1994.<br />

45) D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Eds.,<br />

“Mach<strong>in</strong>e Learn<strong>in</strong>g,Neural, and Statistical <strong>Classification</strong>”,<br />

London, U.K.: Ellis Horwood,1994.<br />

46) D. E. Brown, V. Corruble, and C. L. Pittard, “A<br />

comparison of decision tree classifiers with back propagation<br />

neural networks for multimodal classification problems”,<br />

Pattern <strong>Recognition</strong>., vol. 26, pp. 953–961, 1993.<br />

47) S. P. Curram and J. M<strong>in</strong>gers, “Neural networks, decision<br />

tree <strong>in</strong>duction and discrim<strong>in</strong>ant analysis: An empirical<br />

comparison”, J. Oper. Res. Soc., vol. 45, no. 4, pp. 440–450,<br />

1994.<br />

48) A. Hart, “Us<strong>in</strong>g neural networks for classification tasks—<br />

Some experiments on datasets and practical advice”, J. Oper.<br />

Res. Soc., vol. 43, pp.215–226, 1992.<br />

49) T. S. Lim,W. Y. Loh, and Y. S. Shih, “An empirical<br />

comparison of decision trees and other classification methods”,<br />

Dept. Statistics, Univ.Wiscons<strong>in</strong>, Madison, Tech. Rep. 979,<br />

1998.<br />

50) E. Patwo, M. Y. Hu, and M. S. Hung, “Two-group<br />

classification us<strong>in</strong>g neural networks”, Decis. Sci., vol. 24, no.<br />

4, pp. 825–845, 1993.<br />

51) M. S. Sanchez and L. A. Sarabia, “Efficiency of multilayered<br />

feed-forward neural networks on classification <strong>in</strong><br />

relation to l<strong>in</strong>ear discrim<strong>in</strong>ant analysis, quadratic discrim<strong>in</strong>ant<br />

analysis and regularized discrim<strong>in</strong>ant analysis”, Chemometr.<br />

Intell. Labor.Syst., vol. 28, pp. 287–303, 1995.<br />

52) V. Subramanian, M. S. Hung, and M. Y. Hu, “An<br />

experimental evaluation of neural networks for classification”,<br />

Comput. Oper. Res., vol. 20,pp. 769–782, 1993.<br />

53) R. Kohavi and D. H. Wolpert, “Bias plus variance<br />

decomposition for zero-one loss functions,” <strong>in</strong> Proc. 13th Int.<br />

Conf. Mach<strong>in</strong>e Learn<strong>in</strong>g,1996, pp. 275–283.<br />

54) L. Atlas, R. Cole, J. Connor, M. El-Sharkawi, R. J. Marks<br />

II, Y. Muthusamy, and E. Barnard, “Performance comparisons<br />

between back propagation networks and classification trees on<br />

three real-world applications”, <strong>in</strong> Advances <strong>in</strong> Neural<br />

Information Process<strong>in</strong>g Systems, D. S. Touretzky, Ed. San<br />

Mateo, CA: Morgan Kaufmann, 1990, vol. 2, pp. 622–629.<br />

55) T. G. Dietterich and G. Bakiri, “Solv<strong>in</strong>g multiclass<br />

learn<strong>in</strong>g problems via error-correct<strong>in</strong>g output codes”, J. Artif.<br />

Intell. Res., vol. 2, pp. 263–286, 1995.<br />

56) W. Y. Huang and R. P. Lippmann, “Comparisons between<br />

neural net and conventional classifiers”, IEEE 1st Int. Conf.<br />

Neural Networks, San Diego, CA, 1987, pp. 485–493.<br />

57) E. Patwo, M. Y. Hu, and M. S. Hung, “Two-group<br />

classification us<strong>in</strong>g neural networks”, Decis. Sci., vol. 24, no.<br />

4, pp. 825–845, 1993<br />

58) G. Cybenko, “Approximation by super-positions of a<br />

sigmoidal function”, Math. Contr. Signals Syst., vol. 2, pp.<br />

303–314, 1989.<br />

59) K. Hornik, “Approximation capabilities of multilayer feed<br />

forward networks”, Neural Networks, vol. 4, pp. 251–257,<br />

1991.<br />

60] K. Hornik, M. St<strong>in</strong>chcombe, and H. White, “Multilayer<br />

feed forward networks are universal approximators”, Neural<br />

Networks, vol. 2, pp. 359–366, 1989.<br />

61) M. D. Richard and R. Lippmann, “Neural network<br />

classifiers estimate Bayesian a posteriori probabilities”,<br />

Neural Comput., vol. 3, pp. 461–483, 1991.<br />

62) R. Solera-Ure˜na, J. Padrell-Sendra et.al, “SVMs for<br />

Automatic <strong>Speech</strong> <strong>Recognition</strong>: A Survey” Signal Theory and<br />

Communications Department EPS-Universidad Carlos III de<br />

Madrid,Avda. de la Universidad, 30, 28911-Legan´es<br />

(Madrid), SPAIN<br />

63) B.E. Boser, I. Guyon, and V. Vapnik, “ A tra<strong>in</strong><strong>in</strong>g<br />

algorithm for optimal marg<strong>in</strong> classifiers”, Computational<br />

Learn<strong>in</strong>g Theory, pages 144–152, 1992.<br />

64) F. P´erez-Cruz and O. Bousquet, “ Kernel Methods and<br />

Their Potential Use <strong>in</strong> Signal Process<strong>in</strong>g”. IEEE Signal<br />

Process<strong>in</strong>g Magaz<strong>in</strong>e, 21(3):57–65, 2004.<br />

65) R. Fletcher., “Practical Methods of Optimization”. Wiley-<br />

Interscience, New York, NY (USA), 1987.<br />

66) Earl Gosh et.al.,, “Pattern recognition”, School of<br />

computer science, -Tele-communciations and <strong>in</strong>formation<br />

system, DePaul University, Prentice Hall of India, New Delhi.<br />

67) . Gray, R. “ Vector Quantization”, IEEE ASSP<br />

Magaz<strong>in</strong>epp. 4–29, 1984.<br />

68) Reynolds, D.A., “A Gaussian Mixture Model<strong>in</strong>g<br />

Approach to Text-Independent Speaker Identification”, PhD<br />

thesis, Georgia Institute of Technology ,1992.<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

953


M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954<br />

ISSN:2229-6093<br />

69) Reynolds, D.A., Rose, R.C, “ Robust Text-Independent<br />

Speaker Identification us<strong>in</strong>g Gaussian Mixture Speaker<br />

Models”, IEEE Transactions on Acoustics, <strong>Speech</strong>, and Signal<br />

Process<strong>in</strong>g 3(1) (1995) 72–83.<br />

70) McLachlan, G., ed. “ Mixture Models” Marcel Dekker,<br />

New York, NY,1988.<br />

71) Dempster, A., Laird, N., Rub<strong>in</strong>, D., “Maximum<br />

Likelihood from Incomplete Data via the EM Algorithm”,<br />

Journal of the Royal Statistical Society 39(1) 1–38, 1977.<br />

72) Reynolds, D.A., Quatieri, T.F., Dunn, R.B, “Speaker<br />

Verification Us<strong>in</strong>g Adapted Gaussian Mixture Models”,<br />

Digital Signal Process<strong>in</strong>g Vol.10,pp. 19–41, 2000 .<br />

73) A.K.Ja<strong>in</strong>,M.N.Murthy and P.J.FLYNN, “Data Cluster<strong>in</strong>g:<br />

A Review”, The Ohio State University, ACM Comput<strong>in</strong>g<br />

Surveys, Vol. 31, No. 3, September 1999.<br />

74) S.Watanabe, “Pattern recognition: Human and<br />

mechanical”,Wiley,Newyork-1985.<br />

75)K.S.Fu, “A step towards unification of syntactic and<br />

statistical pattern recognition”, IEEE Trans. On Pattern<br />

<strong>Recognition</strong> and Mach<strong>in</strong>e Intelligence, Vol.5,no.2,pp 200-<br />

205,March 1983.<br />

76) Anil K.Ja<strong>in</strong> et.al, “Statistical pattern recognition: a<br />

Review”, IEEE Trans. on Pattern Analysis and Mach<strong>in</strong>e<br />

<strong>in</strong>telligence, Vol.22,no.1,PP.4-37,Jan 2000.<br />

77) Monserrat, Guillen, Manuel, Artis, "Count data models for<br />

credit scor<strong>in</strong>g system”, Third Meet<strong>in</strong>g on the European<br />

Conference Series <strong>in</strong> Quantitative Economics and<br />

Econometrics on Econometrics of Duration, Count and<br />

Transition Models, Paris, December 1992.<br />

78) Thomas, Lyn. C, “A Survey of Credit and Behavioral<br />

Scor<strong>in</strong>g; Forecast<strong>in</strong>g f<strong>in</strong>ancial risk of lend<strong>in</strong>g to consumers”,<br />

University of Ed<strong>in</strong>burgh,2000.<br />

79) A Fractal Whitepaper, “Comparative Analysis of<br />

<strong>Classification</strong> <strong>Techniques</strong>”, September 2003.<br />

80) 80) D.Michie et.al.,, “Mach<strong>in</strong>e learn<strong>in</strong>g and Neural and<br />

statistical classification”, Ellis Horrwood,New York,1994.<br />

81)A.K.Ja<strong>in</strong> and R.C.Dubes, “Algorithms for cluster<strong>in</strong>g data”,<br />

Prentice Hall, Engle Wood,Cliffs,1988<br />

80) D.Michie et.al.,, “Mach<strong>in</strong>e learn<strong>in</strong>g and Neural and<br />

statistical classification”, Ellis Horrwood,New York,1994.<br />

81)A.K.Ja<strong>in</strong> and R.C.Dubes, “Algorithms for cluster<strong>in</strong>g data”,<br />

Prentice Hall, Engle Wood,Cliffs,1988<br />

IJCTA | JULY-AUGUST 2011<br />

Available onl<strong>in</strong>e@www.ijcta.com<br />

954

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!