Classification Techniques used in Speech Recognition

ISSN:2229-6093 

M A Anusuya et al, Int. J. Comp. Tech. Appl., Vol 2 (4), 910-954 

Classification Techniques used in Speech Recognition Applications: A Review 

M.A.Anusuya* 1 , S.K.Katti* 2 

*Department of computer science and Engineering 

SJCE, Mysore, INDIA 

1 anusuya_ma@yahoo.co.in, 2 skkatti@indiatimes.com 

Abstract—Classification phase is one of the most active 

research and application areas of speech recognition. The 

literature is vast and growing. This paper summarizes the 

some of the most important developments in the classification 

procedures of the speech recognition applications. The state of 

art of the classification technique has also been presented in 

this paper. Different classification techniques and their 

parameter estimation methods, properties, advantages, 

disadvantages along with their application areas are discussed 

with each classification method. Our purpose is to provide a 

synthesis of the published research in the area of speech 

recognition and stimulate further research interests and efforts 

in the identified topics. This paper presents an overview of 

several pattern classification methods available in literature 

for speech recognition applications. 

Keywords— Classification, Classifiers, Taxonomy, Bayes 

decision theory, Acoustic Phonetic approach, Template 

matching, Dynamic Time Warping(DTW),Vector 

Quantization(VQ), Hidden Markov Model(HMM), Artificial 

Neural Network(ANN), Support Vector Machine(SVM), K- 

Nearest Neighbor(KNN), Gaussian Mixture Modeling, 

Clustering techniques, Evaluations, Applications. 

I. INTRODUCTION 

CLASSIFICATION is one of the most frequently encountered 

decision making tasks of human activity[1]. Classification 

problem occurs when an object needs to be assigned into a 

predefined group or class based on a number of observed 

attributes related to that object. Many problems in business, 

science, industry, and medicine can be treated as classification 

problems. The goal of this paper is to survey the core concepts 

and techniques in the large subset of classification and 

analyzing with its roots in statistics and decision theory. 

Although significant progress has been made in classification, 

related areas of speech recognition, a number of issues in 

applying classification techniques still remain and have not 

been solved successfully or completely. In this paper, some 

theoretical as well as empirical issues of speech recognition 

classification methods are reviewed and discussed. The vast 

research topics and extensive literature makes it impossible 

for one review to cover all of the work in the field. This 

review aims to provide a summary of the most important 

advances in general classification methods. 

Pattern recognition techniques are used to automatically 

classify physical objects (1D,2D or 3D) or abstract 

multidimensional patterns (n points in d dimensions) into 

known or possibly unknown categories. A number of 

commercial pattern recognition systems exist for speech 

recognition, character recognition, handwriting recognition, 

document classification, fingerprint classification, speech and 

speaker recognition, white blood cell (leukocyte) 

classification, military target recognition among others. Most 

machine vision systems employ pattern recognition techniques 

to identify objects for sorting, inspection, and assembly. The 

most widely used classifiers are the nearest neighbourhood, 

Kernel methods such as SVM, KNN algorithms, Gaussian 

mixture modeling, naïve Bayes classifier and decision tree. 

1.1. Classification method design: 

Classification is the final stage of the pattern recognition. This 

is the stage where an automated system declares that the 

inputted object belongs to a particular category. There are 

many classification methods in the field. Classification 

method designs are based on the following concepts. 

i) Classification 

Assigning a class to a measurement, or equivalently, 

identifying the probabilistic source of a measurement. The 

only statistical model that is needed is the conditional model 

of the class variable given the measurement. This conditional 

model can be obtained from a joint model or it can be learned 

directly. The former approach is generative since it models the 

measurements in each class. It is more work, but it can exploit 

more prior knowledge, needs less data, is more modular, and 

can handle missing or corrupted data. Methods include 

mixture models and Hidden Markov Models. The latter 

approach is discriminative since it focuses only on 

discriminating one class from another. It can be more efficient 

once trained and requires fewer modeling assumptions. 

Methods include logistic regression, generalized linear 

classifiers, and nearest-neighbor. 

ii) Model selection 

Choosing the parametric family for density estimation is 

important in model selection. This is harder than parameter 

estimation since we have to take into account every member 

of each family in order to choose the best family. 

a. Member-roster concept: Under this template-matching 

concept, a set of patterns belonging to a same pattern is stored 

in a classification system. When an unknown pattern is given 

as input, it is compared with existing patterns and placed 

under the matching pattern class. 

b. Common property concept: In this concept, the common 

properties of patterns are stored in a classification system. 

When an unknown pattern comes inside, the system checks its 

IJCTA | JULY-AUGUST 2011 

Available online@www.ijcta.com 

910


ISSN:2229-6093 

extracted common property against the common properties of 

existing classes and places the pattern/object under a class, 

which has similar, common properties. 

c. Clustering concept: Here, the patterns of the targeted 

classes are represented in vectors whose components are real 

numbers. Using its clustering properties, we can easily 

classify the unknown pattern. If the target vectors are far apart 

in geometrical arrangement, it is easy to classify the unknown 

patterns. If they are nearby or if there is any overlap in the 

cluster arrangement, we need more complex algorithms to 

classify the unknown patterns. One simple algorithm based on 

the clustering concept is Minimum Distance Classification. 

This method computes the distance between the unknown 

pattern and the desired set of known patterns and determines 

which known pattern is closest to the unknown and, finally, 

the unknown pattern is placed under the known pattern to 

which it has minimum distance. This algorithm works well 

when the target patterns are far apart. 

1.2. Classifiers design: 

Classifiers are functions that use pattern matching to 

determine a closet math. After optimal feature subset is 

selected a classifier can be designed using various approaches. 

Roughly speaking, there are three different approaches [1,2]. 

The first approach is the simplest and the most intuitive 

approach which is based on the concept of similarity. 

Template matching is an example. The second one is a 

probabilistic approach. It includes methods based on Bayes 

decision rule, the maximum likelihood or density estimator. 

Three well-known methods are K-nearest neighbor (KNN), 

Parzen window classifier and branch-and bound methods 

(BnB).The third approach is to construct decision boundaries 

directly by optimizing certain error criterion. Examples are 

fisher’s linear discriminant, multilayer perceptrons, decision 

tree and support vector machine. Determining a suitable 

classifier for a given problem is still more an art than science. 

The first class of classifiers has some similarity metrics and 

assigns class labels for maximizing the similarity. 

Probabilistic methods, for which the Bayesian classifier is the 

most known, depend on the prior probabilities of classes and 

class-conditional densities of the instances. In addition to 

Bayesian classifiers, logistic classifiers belong to this type of 

classifiers. The logistic classifiers deal with unknown 

parameters based on the maximum-likelihood [3]. Further 

details on logistic classifiers can be found in [4].Geometric 

classifiers, which build decision boundaries by directly 

minimizing the error criterion, since no related experiments 

are supplied. An example to these classifiers is Fisher’s linear 

discriminant, which mainly aim to reduce the size of the 

feature space to lower dimensions in case of a huge number of 

features. It minimizes the mean squared error between the 

class labels and the tested instance. Additionally, neural 

networks are examples of geometric classifiers. 

1.3. Classification taxonomy: 

Based on the available literature figure1 and figure 1a shows 

the taxonomy of different classifiers used for various 

applications of speech recognition based on classification 

techniques and the density functions. 

[OR] the other way of representing the taxonomy, based on 

density approach can be represented as follows. 

Figure1a. Taxonomy based on class –conditional densities 

2. Knowledge Based classification Method 

Human knowledge of speech has to be expressed in terms of 

explicit rules. Acoustic phonetic rules describes the words of 

lexicon, describes the syntax of knowledge and so on and it 

deals with the phonetic and linguistic principles. Basically 

there exist two approaches to speech recognition. 

They are 

• Acoustic Phonetic Approach 

• Artificial Intelligence Approach 

2.1. Acoustic Phonetic Approach[5]: 

The acoustic phonetic approach is based on the theory of 

acoustic phonetics that postulates that there exists finite, 

distinctive phonetic units in spoken language and that the 

phonetic units are broadly characterized by a set of properties 

that are manifest in the speech signal, or its spectrum, 



911


ISSN:2229-6093 

overtime[5]. Even though the acoustic properties of phonetic 

units are highly variable, both with speakers and with 

neighboring phonetic units ( the so called co-articulation of 

sounds), it is assumed that the rules governing the variability 

are straight forward and can readily be learned and applied in 

practical situations. Hence the first step in the acoustic 

phonetic approach to speech recognition is called a 

segmentation and labeling phase, because it involves 

segmenting the speech signal into discrete ( in time) regions 

where the acoustic properties of the signal are representative 

of one ( or possibly several ) phonetic units (or classes) and 

then attaching one or more phonetic lables to each segmented 

region according to the acoustic properties. To actually do 

speech recognition, a second step attempts to determine a 

valid word( or string of a words) from the sequence of 

phonetic labels produced in the first step, which is consistent 

with the constraints of the speech recognition task( i.e. the 

words are drawn from a given vocabulary, the word sequence 

makes syntactic sense and has semantic meaning etc.,) 

To illustrate the steps involved in the acoustic phonetic 

approach to speech recognition, consider the phoneme lattice 

shown in Figure 2. (A phoneme lattice is the result of the 

segmentation and labeling step of the recognition process, and 

represents a sequential set of phonemes that are likely matches 

to the spoken input speech) The problem is to decode the 

phoneme lattice into a word string ( one or more words) such 

that every instant of time is included in one of the phonemes is 

the lattice, and such that the word (or word sequence) is valid 

according to rules of English syntax.(The symbol SIL stands 

for silence or a pause between sounds or words; the vertical 

position in the lattice, at any time, is a measure of the 

goodness of the acoustic match to the phonetic unit, with the 

highest unit having the best match) With a modest amount of 

searching, one can derive the appropriate phonetic string SIL- 

AO-L-AX-B-AW-T corresponding to the word string “ all 

about,” with the phonemes L,AX, and B having been second 

or third choices in the lattice, and all other phonemes having 

been first choices. 

This simple example illustrates well the difficulty in decoding 

phonetic units into word strings. This is the so called lexical 

access problem. The real problem with the acoustic phonetic 

approach to speech recognitions is the difficulty in getting a 

reliable phoneme lattice for the lexical access stage. 

Fig.3 shows a block diagram of the acoustic phonetic 

approach to speech recognition. The first step in the 

processing ( a step common to all approaches to speech 

recognition ) is the speech analysis system( so called the 

feature measurement method),which provides an appropriate 

( spectral) representation of the characteristics of the time 

varying speech signal. The most common techniques of 

spectral analysis are the class of filter bank methods and the 

class of linear predictive coding (LPC) methods. Both of these 

methods provide spectral descriptions of the speech over time. 

The next step in the processing is the feature-detection stage. 

The idea here is to cover the spectral measurements to a set of 

features that describe the broad a acoustic properties of the 

different phonetic units. Among the features proposed for 

recognition are nasality (presence or absence of nasal 

resonance), frication (presence or absence of random 

excitation in the speech),formant locations(frequencies of the 

first three resonances), voiced unvoiced classification 

( periodic or a periodic excitation), and ratios of high and low 

frequency energy. Many proposed features are inherently 

binary(e.g. nasality, frication, voiced unvoiced); others are 

continuous (e.g. formant locations, energy ratios). The feature 

detection stage unusually consists of a set of detectors that 

operate in parallel and use appropriate processing and logic to 

make the decision as to presence or absence, or value, of a 

feature. The algorithms used for individual feature detectors 

are sometimes sophisticated ones that do a lot of signal 

processing, and some times they are rather trivial estimation 

procedure. 

The third step in the procedure is the segmentation and 

labeling phase wehere by the system tries to find stable 

regions (where the features change very little over the region) 

and then to label the segmented region according to how well 

the features with in theat region match those of individual 

phonetic units. This stage is the heart of the acoustic phonetic 

recognizer and is the most difficult one to carry out reliably; 

hence various control strategies are used to limit the range of 

segmentation points and label possibilities. For example, for 

individual word recognition, the constraint that a word 

contains at least two phonetic units and no more than six 

phonetic units means that the control strategy need consider 

solutions with between 1 and 5 internal segmentation points. 

Furthermore, the labeling strategy can exploit lexical 

constraints on words to consider only words with n phonetic 

units when ever the segmentation gives n-1 segmentation 

points. These constraints are often powerful ones that reduce 

the search space and significantly increase performance 

(accuracy of segmentation and labeling) of the system. 

The result of the segmentation and labeling step is usually a 

phoneme lattice ( of the type shown in Figure 2 from which a 

lexical access procedure determines the best matching word or 

sequence of words. Other types of lattices (e.g. syllable, word) 

can also be derived by integrating vocabulary and syntax 

constraints into the control strategy as discussed above. The 

quality of the matching of the features within a segment, to 

phonetic units can be used to assign probabilities to the labels, 

which then can be used in a probabilistic lexical access 

procedure. The final output of the recognizer is the word or 

word sequence that best matches, in some well defined sense, 

the sequence of phonetic units in the phoneme lattice. 



912


ISSN:2229-6093 

sequences that match the vocabulary and grammar constraints 

are used to decide upon the spoken utterance by combining 

the acoustic and language scores. 

Figure 3 Block diagram of acoustic phonetic speech 

recognition system 

2.1.1. General Discussion on Acoustic Phonetic Approach: 

A typical acoustic phonetic approach to ASR has the 

following steps (this is similar to the overview of the acousticphonetic 

approach presented by Rabiner (Rabiner and Juang, 

1993) but it is defined here more broadly): 

1. Speech is analyzed using any of the spectral analysis 

methods - Short Time Fourier Transform (STFT), Linear 

Predictive Coding (LPC), Perceptual Linear Prediction (PLP), 

etc. - using overlapping frames with a typical size of 10-25ms 

and typical overlap of 5ms. 

2. Acoustic correlates of phonetic features are extracted from 

the spectral representation. 

For example, low frequency energy may be calculated as an 

acoustic correlate of sonoracy, zero crossing rate may be 

calculated as a correlate of frication, and so on. 

3. Speech is segmented by either finding transient locations 

using the spectral change across two consecutive frames, or 

using the acoustic correlates of source or manner classes to 

find the segments with stable manner classes. The earlier 

approach , that is, finding acoustic stable regions using the 

locations of spectral change has been followed by Glass et al. 

(Glass and Zue, 1988). The latter method of using broad 

manner class scores to segment the signal has been used by a 

number of researchers (Bitar, 1997; Liu, 1996; Fohr et al.; 

Carbonell et al., 1987). Multiple segmentations may be 

generated instead of a single representation, for example, the 

dendograms in the speech recognition method proposed by 

Glass (Glass and Zue, 1988). (The system built by Glass et al. 

is included here as an acoustic phonetic system because it fits 

the broad definition of the acoustic-phonetic approach, but 

this system uses very little knowledgeof acoustic phonetics.) 

4. Further analysis of the individual segmentations is carried 

out next to either recognize each segment as a phoneme 

directly or find the presence or absence of individual phonetic 

features and using the intermediate decisions to find the 

phonemes. When multiple segmentations are generated 

instead of a single segmentation, a number of different 

phoneme sequences may be generated. The phoneme 

2.1.2. Hurdles/Challenges in the acoustic-phonetic 

approach: 

A number of problems have been associated with the acousticphonetic 

approach in the literature. Rabiner (Rabiner and 

Juang, 1993) lists at least five such problems or hurdles that 

have made the use of the approach minimal in the ASR 

community. The problems with the acoustic phonetic 

approach and some ideas for solving them provide much of 

the motivation for the present work. These documented 

problems of the acoustic-phonetic approach are now listed and 

it is argued that either insufficient effort has gone into solving 

these problems or that the problems are not unique to the 

acoustic-phonetic approach. 

a) It has been argued that the difficulty in proper decoding of 

phonetic units into words and sentences grows dramatically 

with an increase in the rate of phoneme insertion, deletion and 

substitution. This argument makes the assumption that 

phoneme units are recognized in the first pass with no 

knowledge of language and vocabulary constraints. This has 

been true for many of the acoustic phonetic methods, but this 

is not necessary since vocabulary and grammar constraints 

may be used to constrain the speech segmentation paths 

(Glass et al., 1996). 

b) Extensive knowledge of the acoustic manifestations of 

phonetic units is required and the lack of completeness of this 

knowledge has been pointed out as a drawback of the 

knowledge based approach. While it is true that the 

knowledge is incomplete, there is no reason to believe that the 

standard signal representations, for example, Mel-Frequency 

Cepstral Coefficients (MFCCs), used in the state-of-the-art 

ASR methods are sufficient to capture all the acoustic 

manifestations of the speech sounds. Although the knowledge 

is not complete, a number of efforts to find acoustic correlates 

of phonetic features have obtained excellent results. Most 

recently, there has been significant development in the 

research on the acoustic correlates of place of stop consonants 

and fricatives (Stevens et al., 1999; Ali , 1999; Bitar, 1997), 

nasal detection (Pruthi and Espy-Wilson, 2003), and 

semivowel classification (Espy-Wilson, 1994). The 

knowledge from these sources may be adequate to start 

building an acoustic-phonetic speech recognizer to carry out 

word recognition tasks, and that was the focus of this work. It 

should be noted that because of the physical significance of 

the knowledge based acoustic measurements, it is easy to 

pinpoint the source of recognition errors in the recognition 

system. Such an error analysis is close to impossible in MFCC 

like front-ends. 

c) The third argument against the acoustic-phonetic approach 

is that the choice of phonetic features and their acoustic 

correlates is not optimal. It is true that linguists may not agree 

with each other on the optimal set of phonetic features, but 

finding the best set of features is a task that can be carried out 

instead of turning to other ASR methods. The phonetic feature 



913


ISSN:2229-6093 

set used in this work will be based on the distinctive feature 

theory and it will be optimal in that sense. 

d) Another drawback of the acoustic-phonetic approach as 

pointed out in (Rabiner and Juang, 1993) is that the design of 

the sound classifiers is not optimal. This argument probably 

assumes that binary decision trees with hard knowledge-based 

thresholds are used to carry out the decisions in the acoustic 

phonetic approach. Statistical pattern recognition methods that 

is no less optimal. 

2.1.3 Advantages and Disadvantages: 

a)Advantages: 

1) Not all acoustic Phonetics are used for all decisions 

2) Since the acoustic phonetics have a strong physical 

interpretation, it is easy to pinpoint the source of error in 

such a recognition system. It is easy to tell whether the 

pattern matcher has failed. 

3) The method can easily take advantage of years of 

research that has gone into acoustic phonetics as well as 

signal processing based on human auditory models. 

b) Disadvantages: 

The chosen phonemes are not only the first choices in the 

phonetic sequence, but also second (B and AX) and third (L) 

choices. Therefore matching a phonetic sequence with a word 

or a group of words is not obvious In fact, this is the main 

disadvantage of this approach. 

2.1.4. Applications: 

1. Acoustic phonetic approach to speech recognition: 

Application to the Semivowels 

2) Models of Phonetic Recognition: The Role of Analysis by 

Synthesis in Phonetic Recognition 

3) The Influence of Phonetic Context on the Acoustic 

Properties of Stops 

4) The Role of Syllable Structure in the Acoustic Realizations 

of Stops 

5) A Semivowel Recognition System 

6) Two-Dimensional Characterization of the Speech Signal 

and Its Potential Applications to Speech Processing 

7) Recognition of Words from their Spellings: Integration of 

Multiple Knowledge Sour. 

2.2. Artificial Intelligent Approach [5]: 

Historically there are two main approaches to AI: classical 

approach (designing the AI), based on symbolic reasoning - a 

mathematical approach in which ideas and concepts are 

represented by symbols such as words, phrases or sentences, 

which are then processed according to the rules of logic. A 

connectionist approach (letting AI develop), based on artificial 

neural networks, which imitate the way neurons work, and 

genetic algorithms, which imitate inheritance and fitness to 

evolve better solutions to a problem with every generation. 

AI approach [5] to speech recognition is a hybrid of the 

acoustic phonetic approach and the pattern recognition 

approach in that it exploits ideas and concepts of both 

methods. The artificial Intelligence approach attempts to 

mechanize the recognition procedure according to the way a 

person applies its intelligence in visualizing, analyzing, and 

finally making a decisions on the measured acoustic features. 

In particular, among the techniques used within this class of 

methods are the use of; an experts system for segmentation 

and labeling so that this crucial and most difficult step can be 

performed with more than just the acoustic information used 

by pure acoustic phonetic methods( in particular, methods that 

integrate phonemic, lexical syntactic, semantic and even 

pragmatic knowledge into the expert system have been 

proposed and studied),learning and adapting over time(i.e. the 

concept that knowledge is often both static and dynamic and 

that models must adapt to the dynamic component of the data); 

the use of neural networks for learning the relationships 

between phonetic events and all known inputs(including 

acoustic, lexical, syntactic, semantic, etc., as well as for 

discrimination between similar sound classes. 

The basic ideal of the artificial intelligence approach to speech 

recognition is to compile and incorporate knowledge from 

variety of knowledge sources and to bring it to bear on the 

problem at hand. Thus, for example, the AI approach to 

segmentation and labeling would be to augment the generally 

used acoustic knowledge with phonemic knowledge, lexical 

knowledge, syntactic knowledge, semantic knowledge, and 

even pragmatic knowledge. The different knowledge sources 

required are as follows: 

a) Acoustic knowledge-evidence of which sounds (predefined 

phonetic units) are spoken on the basis of spectral 

measurements and presence of absence of features 

b) Lexical knowledge- the combination of acoustic evidence 

so as to postulate words as specified by a lexicon that maps 

sounds into words ( or equivalently decomposes words into 

sounds) 

c) Syntactic knowledge- the combination of words to form 

grammatically correct strings (according to a language model) 

such as sentences or phrases 

d) Semantic knowledge-understanding of the task domain so 

as to be able to validate sentences (or phrases) that are 

consistent with the task being performed or which are 

consistent with previously decoded sentences 

e) Pragmatic knowledge- inference ability necessary in 

resolving ambiguity of meaning based on ways in which 

words are generally used. 

2.2.1. Advantages and Disadvantages of Artifical 

Intelligent approach: 

a) Advantages: 

i) AI has made some progress at imitating "subsymbolic" 

problem solving: embodied agent 

approaches emphasize the importance of 

sensorimotor skills to higher reasoning; neural 

net research attempts to simulate the structures 



914


ISSN:2229-6093 

ii) 

iii) 

inside human and animal brains that give rise to 

this skill; step-by-step reasoning that humans 

were often assumed to use when they solve 

puzzles, play board games or make logical 

deductions. 

By the late 1980s and 1990s, AI research had 

also developed highly successful methods for 

dealing with uncertain or incomplete information, 

employing concepts from probability and 

econopmic. 

The search for more efficient problem solving 

algorithms is a high priority for AI research. 

b)Disadvantages: 

i) For difficult problems most of the algorithms in artificial 

intelligence approach require enormous computational 

resources- most “combinatorial explosion” : the amount of 

memory or computer time required becomes astronomical 

when the problem goes beyond a certain size. 

ii) Intelligent systems are not like humans. 

2.2.2. Applications: 

1. Artificial intelligent approach to speech recognition 

2. AI approach to chemical inference. 

3. AI approach to cognitive linguistics 

4. AI approach to VLSI design 

5. AI approach to machine learning 

6. AI approach to reservoir 

7. AI approach to automated office 

3. Bayes Decision Theory: 

Bayesian decision making refers to choosing most likely clas, 

given the value of the feature or features. The probabilities of 

class membership are calculated from Baye’s theorem. If the 

feature value is denoted by x and a class of interest is C, then 

P(x) is the probability distribution for feature x in the entire 

population and P(C ) is the prior probability that a random 

sample is a member is a member of class C. P(x/C) is the 

conditional probability of obtaining feature value x given that 

it has a feature value x, which is denoted by P(C/x), on the 

basis of the values of P(x/C),P(C) and P(x). 

4. Database classification method: 

In this classification, the patterns are stored in the database 

and comparison is done with the test signal against the 

patterns stored in the database. Since the collection of trained 

patterns is stored in the database this method is called as 

database classification method. This has been categorize as 

one of the important classification method i.e., classified as 

the pattern recognition approach. In turn this pattern 

recognition approach has been classified into two methods 

namely, template/DTW/supervised and unsupervised 

classification methods. Each of these methods are discussed in 

detail in the below section. 

4.1. Introduction to Pattern Recognition approach: 

Pattern recognition as a field of study developed significantly 

in the 1960s. It is very much an interdisciplinary subject, 

covering developments in the areas of statistics, engineering, 

artificial intelligence, computer science, psychology and 

physiology, among others. Watnabe[74] defines a pattern “as 

opposite of chaos”. It is an entity, vaguely defined that could 

be given a name. 

Pattern recognition is concerned with the classification of 

objects into categories, especially by machine. A strong 

emphasis is placed on the statistical theory of discrimination, 

but clustering also receives some attention. Hence it can be 

summed in a single word: ‘classification’, both supervised 

(using class information to design a classifier – i.e. 

discrimination) and unsupervised (allocating to groups 

without class information – i.e. clustering). Its ultimate goal is 

to optimally extract patterns based on certain conditions and is 

to separate one class from the others. Pattern recognition was 

often achieved using linear and quadratic discriminants [6], 

the k-nearest neighbor classifier [7] or the Parzen density 

estimator [8], template matching [9] and Neural Networks 

[10]. These methods are basically statistic. The problem of 

using these recognition methods are in the construction of the 

classification rule without having any idea of the distribution 

of the measurements in different groups. Support Vector 

Machine (SVM) [11] has gained prominence in the field of 

pattern classification. They are forcefully competing with 

other techniques such as template matching and Neural 

Networks for pattern recognition. 

. 

4.1.1. General Process of Pattern Recognition: 

A pattern is a pair comprising an observation and a meaning. 

Pattern recognition is inferring meaning from observation. 

Designing a pattern recognition system is establishing a 

mapping from measurement space into the space of potential 

meanings; The basic components in pattern recognition are 

pre-processing, feature extraction and selection, classifier 

design and optimization. 

4.1.1a Pre-processing: 

The role of pre-processing is to segment the interesting pattern 

from the background. Generally, noise filtering, smoothing 

and normalization should be done in this step. The preprocessing 

also defines a compact representation of the pattern. 

4.1.1b Feature Selection and extraction: 

Features should be easily computed, robust, insensitive to 

various distortions and variations in the signal, and 

rotationally invariant. Two kinds of features are used in 

pattern recognition problems. One kind of features has clear 

physical meaning, such as geometric or structural and 

statistical features. Another kind of features has no physical 

meaning. These features are called as mapping features. The 

advantage of physical features is that they need not deal with 

irrelevant features. The advantage of the mapping features is 

that they make classification easier because clear boundaries 



915


ISSN:2229-6093 

will be obtained between classes but increasing the 

computational complexity. 

i) Feature selection is to select the best subset from the input 

space. Its ultimate goal is to select the optimal features subset 

that can achieve the highest accuracy results. While feature 

extraction is applied in the situation when no physical features 

can be obtained. Most of feature selection algorithms involve 

a combinatorial search through the whole space. Usually, 

heuristic methods, such as hill climbing, have to be adopted, 

because the size of input space is exponential in the number of 

features. Other methods divide the feature space into several 

subspaces which can be searched easily. 

There are basically two types of feature selection methods: 

filter and wrapper [12]. Filters methods select the best features 

according to some prior knowledge without thinking about the 

bias of further induction algorithm. So these methods 

performed independently of the classification algorithm or its 

error criteria. 

ii) In feature extraction, most methods are supervised. These 

approaches need some prior knowledge and labelled training 

samples. There are two kinds of supervised methods used: 

Linear feature extraction and nonlinear feature extraction. 

Linear feature extraction techniques include Principal 

Component Analysis (PCA), Linear Discriminant Analysis 

(LDA), projection pursuit, and Independent Component 

Analysis (ICA). Nonlinear feature extraction methods include 

kernel PCA, PCA network, nonlinear PCA, nonlinear autoassociative 

network, Multi-Dimensional Scaling (MDS) and 

Self-Organizing Map (SOM), and so forth. 

4.1.1c. Classifiers design: 

After optimal feature subset is selected a classifier can be 

designed using various approaches. Roughly speaking, there 

are three different approaches [1]. The first approach is the 

simplest and the most intuitive approach which is based on the 

concept of similarity. Template matching is an example. The 

second one is a probabilistic approach. It includes methods 

based on Bayes decision rule, the maximum likelihood or 

density estimator. Three well-known methods are K-nearest 

neighbor (KNN), Parzen window classifier and branch-and 

bound methods (BnB).The third approach is to construct 

decision boundaries directly by optimizing certain error 

criterion. Examples are fisher’s linear discriminant, multilayer 

perceptrons, decision tree and support vector machine [13]. 

4.1.1d. Optimization: 

The optimization is not a separate step, it is combined with 

several parts of the pattern recognition process. In 

preprocessing, optimization guarantees that, the input pattern 

have the best quality [13]. Then in the feature selection and 

extraction part, optimal feature subsets are obtained under 

some optimization techniques. Furthermore, the final 

classification error rate is lowered in the classification part. 

4.1.2. Steps in statistical pattern recognition: 

i) Formulation of the problem: gaining a clear understanding 

of the aims of the investigation and planning the remaining 

stages. 

ii) Data collection: making measurements on appropriate 

variables and recording details of the data collection 

procedure (ground truth). 

iii) Initial examination of the data: checking the data, 

calculating summary statistics and producing plots in order to 

get a feel for the structure. 

iv) Feature selection or feature extraction: selecting variables 

from the measured set that are appropriate for the task. These 

new variables may be obtained by a linear or 

nonlinear transformation of the original set (feature 

extraction). To some extent, the division of feature 

extraction and classification is artificial. 

v) Unsupervised pattern classification or clustering: This may 

be viewed as exploratory data analysis and it may provide a 

successful conclusion to a study. On the other hand, it may 

be a means of pre-processing the data for a supervised 

classification procedure. 

vi) Apply discrimination or regression procedures as 

appropriate: The classifier is designed using a training set of 

exemplar patterns. 

vii) Assessment of results: This may involve applying the 

trained classifier to an independent test set of labeled 

patterns. 

viii) Interpretation: The above is necessarily an iterative 

process: the analysis of the results may pose further 

hypotheses that require further data collection. Also, the cycle 

may be terminated at different stages: the questions posed 

may be answered by an initial examination of the data or it 

may be discovered that the data cannot answer the initial 

question and the problem must be reformulated. 

The following block diagram of a canonic pattern recognition 

approach to speech recognition is shown in figure 4 the 

recognition step has four steps, namely, 

1. Parameter Estimation: In which a sequence of 

measurements is made on the input signal to define the test 

pattern. For speech signals the feature measurements are 

usually the output of some type of spectral analysis technique, 

such as filter bank analyzer, a linear predictive coding analysis, 

or a discrete Fourier transform (DFT) analysis. 

2. Pattern Training: in which one or more test patterns 

corresponding to speech sounds of the same class are used to 

create a pattern a representative of the features of that class. 

The resulting pattern, generally called reference pattern, can 

be an exemplar or template, derived from some type of 

averaging technique, or it can be a model that characterizes 

the statistics of the features of the reference pattern. 

3.Pattern Comparison: in which the unknown test pattern is 

compared with each (sound) class reference pattern and a 

measure of similarity( distance) between the test pattern and 

each reference pattern is computed. To compare speech 

patterns(which consists of a sequence of spectral vectors),we 

require both a local distance measure, in which local distance 



916


ISSN:2229-6093 

is defined as the spectral “distance” between two well defined 

spectral vectors, and a global time alignment procedure (often 

called a dynamic time warping algorithm), which compensates 

for different rates of speaking (time scales) of the two patterns. 

4. Decision logic: in which the reference pattern similarity 

scores are used to decide which reference pattern (or possibly 

which sequence of reference patterns)best matches the 

unknown test pattern. The factors that distinguish different 

pattern recognition approaches are the types of feature 

measurement, the choice of templates or models for reference 

patterns and the method used to create reference patterns and 

classify unknown test patterns. 

TABLE 2 

Examples of pattern recognition applications 

5. Template based approach: 

Figure 4. pattern recognition approach to speech recognition 

4.1.3. Pattern recognition approach: 

The four best known pattern recognition approaches are: i) 

Template approach ii) Statistical approach iii) Syntactic or 

structural approach iv) Neural Network approach. These 

models are not necessarily independent and sometimes the 

same pattern recognition methods exist with different 

interpretations. Attempts have been made to design hybrid 

systems involving multiple models [75]. A brief description 

and comparison is given below and discussed in the table 1. 

TABLE 1 

Pattern recognition models 

4.1.4. Examples of pattern recognition applications: 

Interest in the area of pattern recognition has been renewed 

recently due to emerging applications which are not only 

challenging but also computationally demanding. These 

applications include data mining, bioinformatics etc., as 

shown in table 2. 

One of the simplest and earliest approaches to pattern 

recognition is the template approach. Matching is a generic 

operation in pattern recognition which is used to determine the 

similarity between two entities of the same type. In template 

matching the template or prototype of the pattern to be 

recognized is available. The pattern to be recognized is 

matched against the stored template taking into account all 

allowable pose and scale changes. 

The major pattern recognition techniques for speech 

recognition are template method and Dynamic Time warping 

method(DTW). Template based approaches to speech 

recognition have provided a family of techniques that have 

advanced the field considerably during the last six decades. 

The underlying idea is simple. A collection of prototypical 

speech patterns are stored as reference patterns representing 

the dictionary of candidate s words. Recognition is then 

carried out by matching an unknown spoken utterance with 

each of these reference templates and selecting the category of 

the best matching pattern. Usually templates for entire words 

are constructed. This has the advantage that, errors due to 

segmentation or classification of smaller acoustically more 

variable units such as phonemes can be avoided. In turn, each 

word must have its own full reference template; template 

preparation and matching become prohibitively expensive or 

impractical as vocabulary size increases beyond a few 

hundred words. One key idea in template method is to derive 

typical sequences of speech frames for a pattern (a word) via 

some averaging procedure, and to rely on the use of local 

spectral distance measures to compare patterns. Another key 

idea is to use some form of dynamic programming to 

temporarily align patterns to account for differences in 

speaking rates across talkers as well as across repetitions of 

the word by the same talker. 



917


ISSN:2229-6093 

5.1. Introduction: 

A template is the representation of an actual segment of 

speech. It consists of a sequence of consecutive acoustic 

feature vectors (or frames), a transcription of the sounds or 

words it represents (typically one or more phonetic symbols), 

knowledge of neighbouring templates (a template number if 

no templates overlap), and a tag with meta-information. The 

term template is often used for two fundamentally different 

concepts: either for the representation of a single segment of 

speech with a known transcription, or for some sort of 

average of a number of different segments of speech. Both 

types of templates can be used in the DTW algorithm to 

compare them with a segment of input speech. Using the latter 

type has the obvious advantage of reducing the number of 

templates and being more robust to outliers [14]. However, 

the averaging is a model building step, which makes it more 

akin to HMMs than to true example based recognition. 

Template based approach to speech recognition have provided 

a family of techniques that have advanced the field 

considerably during the last six decades. The underlying idea 

is simple. A collection of prototypical speech patterns are 

stored as reference patterns representing the dictionary of 

candidate’s words. Recognition is then carried out by 

matching an unknown spoken utterance with each of these 

reference templates and selecting the category of the best 

matching pattern. Usually templates for entire words are 

constructed. Template preparation and matching become 

prohibitively expensive or impractical as vocabulary size 

increases beyond a few hundred words. One key idea in 

template method is to derive typical sequences of speech 

frames for a pattern (a word) via some averaging procedure, 

and to rely on the use of local spectral distance measures to 

compare patterns. Another key idea is to use some form of 

dynamic programming to temporarily align patterns to 

account for differences in speaking rates across talkers as well 

as across repetitions of the word by the same talker. 

5.1.2 Similarity and Distance methods used in Template 

approach: 

The first type of classifiers that are used are the similarity 

between patterns to decide on a good classification. First, 

similarity has to be defined. The nearest mean classifiers 

define the features of a class as a vector and represent the 

class with the mean of the elements of the vector. Thus, any 

unlabeled vector of features will be classified as the class with 

nearest mean value. Template matching uses a template for 

defining class labels, and tries to find the most similar 

template for classification. Another important classifier of this 

type uses the Nearest Neighbor (NN) Algorithm [15, 16]. The 

data is represented as points in space, and classification is 

done based on the. Euclidean distance, of the data to the 

labeled classes. For the k-NN, the classifier checks the k 

nearest points and decides in favor of the majority. 

5.2 Advantages and Disadvantages of Template Method: 


1. An intrinsic advantage of template based recognition 

is that, it is not required to model the speech process. 

This is very convenient, since our understanding of 

speech is still limited, especially with respect to its 

transient nature. 

2. The main advantage is precisely the use of long 

temporal context: all the frames of the keyword 

template, as well as the information about their 

relative position, are used during the Dynamic Time 

Warping (DTW) procedure. This provides an implicit 

modeling of co-articulation effects or speaker 

dependencies [9]. 

3. This has the advantage that, errors due to 

segmentation or classification of smaller acoustically 

more variable units such as phonemes can be avoided. 


1) Template matching approaches fail to take advantage 

of large amount of training data. 

2) They cannot model acoustic variabilities, except in a 

coarse way by assigning multiple templates to each 

word; 

3) In practice they are limited to whole-word models, 

because it's hard to record or segment a sample 

shorter than a word - so templates are useful only in 

small systems which can afford the luxury of using 

whole-word models. 

4) Each word must have its own full reference template; 

template preparation and matching become 

prohibitively expensive or impractical as vocabulary 

size increases beyond a few hundred words. 

5) It is difficulty to test for large data. 

6) Compulsorily a template should be supplied for each 

pattern 

5.3. Applications of template method: 

i) A multi-scale template method for shape detection with biomedical 

applications 

ii) Template matching framework for detecting geometrically 

transformed objects. 

iii)Template matching is one the way for performing 

operations like: object recognition, identification or 

classification, and detection. There are various literature are 

available for template matching method but they vary from 

application to application. There is no standard method 

developed yet. 

iv) Adaptive template-matching method for vessel wall 

boundary detection in brachial artery ultrasound (US) scans. 

6. Dynamic Time Warping(DTW): 

Dynamic time warping is an algorithm for measuring 

similarity between two sequences which may vary in time or 

speed. For instance, similarities in walking patterns would be 

detected, even if in one video, the person was walking slowly 



918


ISSN:2229-6093 

and if in another, he or she were walking more quickly, or 

even if there were accelerations and decelerations during the 

course of one observation. A well known application has been 

automatic speech recognition, to cope with different speaking 

speeds. (In general, DTW is a method that allows a computer 

to find an optimal match between two given sequences (e.g. 

time series) with certain restrictions. The sequences are 

"warped" non-linearly in the time dimension to determine a 

measure of their similarity independent of certain non-linear 

variations in the time dimension. This sequence alignment 

method is often used in the context of hidden Markov models. 

One example of the restrictions imposed on the matching of 

the sequences is on the monotonicity of the mapping in the 

time dimension. Continuity is less important in DTW than in 

other pattern matching algorithms; DTW is an algorithm 

particularly suited to matching sequences with missing 

information, provided there are long enough segments for 

matching to occur. The optimization process is performed 

using dynamic programming, hence the name. 

Moreover, within a word, there will be variation in the length 

of individual phonemes: Cassidy might be uttered with a long 

/A/ and short final /i/ or with a short /A/ and long /i/. The 

matching process needs to compensate for length differences 

and take account of the non-linear nature of the length 

differences within the words. The Dynamic Time Warping 

algorithm achieves this goal; it finds an optimal match 

between two sequences of feature vectors which allows for 

stretched and compressed sections of the sequence. 

6.1. Concepts of Dynamic Time Warping: 

Dynamic Time Warping is a pattern matching algorithm 

with a non-linear time normalization effect. It is based on 

Bellman's principle of optimality [17], which implies that, 

given an optimal path w from A to B and a point C lying 

somewhere on this path, the path segments AC and CB are 

optimal paths from A to C and from C to B respectively. The 

dynamic time warping algorithm [18] creates an alignment 

between two sequences of feature vectors, (T 1 , T 2 ,.....T N ) and 

(S 1 , S 2 ,....,S M ). A distance d(i, j) can be evaluated between any 

two feature vectors Ti and Sj . This distance is referred to as 

the local distance. In DTW the global distance D(i,j) of any 

two feature vectors Ti and Sj is computed recursively by 

adding its local distance d(i,j) to the evaluated global distance 

for the best predecessor. The best predecessor is the one that 

gives the minimum global distance D(i,j) at row i and column 

j: 

…….(1) 

The computational complexity can be reduced by imposing 

constraints that prevent the selection of sequences that cannot 

be optimal [18]. Global constraints affect the maximal overall 

stretching or compression. Local constraints affect the set of 

predecessors from which the best predecessor is chosen. 

Dynamic Time Warping (DTW) is used to establish a time 

scale alignment between two patterns. It results in a time 

warping vector w, describing the time alignment of segments 

of the two signals assigns a certain segment of the source 

signal to each of a set of regularly spaced synthesis instants in 

the target signal. 

1) 6.1.1. The DTW Grid: 

We can arrange the two sequences of observations on the 

sides of a grid (Figure 5) with the unknown sequence on the 

bottom (six observations in the example) and the stored 

template up the left hand side (eight observations). Both 

sequences start on the bottom left of the grid. Inside each cell 

a distance measure is used for comparing the corresponding 

elements of the two sequences. 

Figure 5. An example DTW grid 

To find the best match between these two sequences we can 

find a path through the grid which minimizes the total distance 

between them. The path shown in blue in Figure 5 gives an 

example. Here, the first and second elements of each sequence 

match together while the third element of the input also 

matches best against the second element of the stored pattern. 

This corresponds to a section of the stored pattern being 

stretched in the input. Similarly, the fourth element of the 

input matches both the second and third elements of the stored 

sequence: here a section of the stored sequence has been 

compressed in the input sequence. Once an overall best path 

has been found the total distance between the two sequences 

can be calculated for this stored template. 

The procedure for computing this overall distance measure is 

to find all possible routes through the grid and for each one of 

these compute the overall distance. The overall distance is 

given in Sakoe and Chiba, Equation 1, as the minimum of the 

sum of the distances between individual elements on the path 

divided by the sum of the warping function .The division is to 

make paths of different lengths comparable. 

It should be apparent that for any reasonably sized sequences, 

the number of possible paths through the grid will be very 

large. In addition, many of the distance measures could be 

avoided since the first element of the input is unlikely to 

match the last element of the template for example. The DTW 

algorithm is designed to exploit some observations about the 



919


ISSN:2229-6093 

likely solution to make the comparison between sequences 

more efficient. 

2) 6.1.2. Optimization in DTW: 

The major optimizations to the DTW algorithm arise from 

observations on the nature of good paths through the grid. 

These are outlined in Sakoe and Chiba and can be summarized 

as: 

• Monotonic condition: the path will not turn back on 

itself, both the i and j indexes either stay the same or 

increase, they never decrease. 

• Continuity condition: The path advances one step at a 

time. Both i and j can only increase by 1 on each step 

along the path. 

• Boundary condition: the path starts at the bottom left 

and ends at the top right. 

• Adjustment window condition: a good path is 

unlikely to wander very far from the diagonal. The 

distance that the path is allowed to wander is the 

window length r. 

• Slope constraint condition: The path should not be 

too steep or too shallow. This prevents very short 

sequences matching very long ones. The condition is 

expressed as a ratio n/m where m is the number of 

steps in the x direction and m is the number in the y 

direction. After m steps in x you must make a step in 

y and vice versa. 

By applying these observations it can restrict the moves that 

can be made from any point in the path and so restrict the 

number of paths that need to be considered. For example, with 

a slope constraint of P=1, if a path has already moved one 

square up it must next move either diagonally or to the 

right.The power of the DTW algorithm goes beyond these 

observations though. Instead of finding all possible routes 

through the grid which satisfy these constraints, the DTW 

algorithm works by keeping track of the cost of the best path 

to each point in the grid. During the match process there will 

be no idea about the lowest cost path ; but this can be traced 

back when we reach the end point. 

6.2. Advantages and Disadvantages of Dynamic Time 

Warping: 

a)Advantages: 

1) Works well for small number of templates (


ISSN:2229-6093 

7. Supervised versus unsupervised Classification/Learning 

Techniques: 

There are two main divisions of classification procedure in 

pattern recognition: supervised classification (or 

discrimination) and unsupervised classification (sometimes in 

the statistics literature simply referred to as classification or 

clustering). In supervised classification a set of data samples 

(each consisting of measurements on a set of variables) with 

associated labels, the class types. These are used as exemplars 

in the classifier design. In unsupervised classification, the data 

are not labeled and intended to find groups in the data and the 

features that distinguish one group from another. Clustering 

techniques can also be used as part of a supervised 

classification scheme by defining prototypes. A clustering 

scheme may be applied to the data for each class separately 

and representative samples for each group within the class is 

(the group means, for example) used as the prototypes for that 

class. 

7.1. Supervised Learning: 

In automatic pattern recognition, the term supervised 

learning/classification refers to the process of designing a 

pattern classifier by using a training set of patterns of known 

class to determine the choice of a specific decision making 

technique for classifying additional similar samples in future. 

The classifier in other words is designed using the training 

data. To provide an unprejudiced estimate of the classifiers 

accuracy on new data, it must be tested on a separate test of 

patterns for which the class of each pattern is known. 

Supervised learning is fairly common in classification 

problems because the goal is often to get the computer to learn 

a classification system that it has created. In the supervised 

learning process, two types parametric analysis is done 

namely parametric and non parametric decision making 

methods or classification methods. 

7.1.1.There are several ways in which the standard supervised 

learning problem can be generalized: 

1. Semi-supervised learning: In this setting, the desired 

output values are provided only for a subset of the 

training data. The remaining data is unlabeled. 

2. Active learning: Instead of assuming that all of the 

training examples are given at the start, active 

learning algorithms interactively collect new 

examples, typically by making queries to a human 

user. Often, the queries are based on unlabeled data, 

which is a scenario that combines semi-supervised 

learning with active learning. 

3. Structured prediction: When the desired output value 

is a complex object, such as a parse tree or a labeled 

graph, then standard methods must be extended. 

4. Learning to rank: When the input is a set of objects 

and the desired output is a ranking of those objects, 

then again the standard methods must be extended. 

5. 

7.2. Advantages and Disadvantages of supervised learning: 

a)Advantages: 

1) Rules are written for you automatically. This is useful for 

large document sets. 


1) It assigns documents to categories before generating the 

rules. 

2) Rules may not be as specific or accurate as those we write 

yourself. 

3) Provides over fitting 

7.3. Challenges in supervised learning: 

The importance of the classification problem, is the goal of 

the learning, that is used to minimize the error with respect to 

the given inputs. These inputs, are called the "training set", are 

the examples from which the agent tries to learn. But learning 

the training set well is not necessarily the best thing to do. Not 

all training sets will have the inputs classified correctly. This 

can lead to problems if the algorithm used is powerful enough 

to memorize even the apparently "special cases" that don't fit 

the more general principles. This, too, can lead to over fitting, 

and it is a challenge to find algorithms that are both powerful 

enough to learn complex functions and robust enough to 

produce generalizable results. 

7.4. Applications: 

• Bioinformatics 

• Database marketing 

• Handwriting recognition 

• Information retrieval 

o Learning to rank 

• Object recognition in computer vision 

• Optical character recognition 

• Spam detection 

• Pattern recognition 

• Speech recognition 

• Forecasting Fraudulent Financial Statements 

8. Introduction to parametric representation: 

Parametric representation - Parametric statistics is a branch 

of statistics that assumes data come from a type of probability 

distribution and makes inferences about the parameters of the 

distribution. Most well-known elementary statistical methods 

are parametric. Generally speaking parametric methods make 

more assumptions than non-parametric methods. If those extra 

assumptions are correct, parametric methods can produce 

more accurate and precise estimates. They are said to have 

more statistical power. However, if those assumptions are 

incorrect, parametric methods can be very misleading. For that 

reason they are often not considered robust. On the other hand, 

parametric formulae are often simpler to write down and 

faster to compute. In some, but definitely not all cases, their 

simplicity makes up for their non-robustness, especially if 

care is taken to examine diagnostic statistics. Parametric 

decision making refers to the situation in which we know or 



921


ISSN:2229-6093 

willing to assume the general form of the probability 

distribution function or density function for each class but not 

the values of the parameters such as the mean and variance. 

Before using these densities the values of the parameters have 

to be estimated. 

Most important parametric method used in speech recognition 

application is the hidden Markov Model. 

Stochastic modeling [97] entails the use of probabilistic 

models to deal with uncertain or incomplete information. In 

speech recognition, uncertainty and incompleteness arise from 

many sources; for example, confusable sounds, speaker 

variability s, contextual effects, and homophones words. Thus, 

stochastic models are particularly suitable approach to speech 

recognition. The most popular stochastic approach today is 

hidden Markov modeling. A hidden Markov model is 

characterized by a finite state markov model and a set of 

output distributions. The transition parameters in the Markov 

chain models, temporal variabilities, while the parameters in 

the output distribution model, spectral variabilities. These two 

types of variabilities are the essence of speech recognition. 

8.1.Hidden Markov Model (statistical approach): 

Hidden Markov Models (HMMs) have dominated [19] 

automatic speech recognition for at least the last decade. The 

model’s success lies in its mathematical simplicity; efficient 

and robust algorithms have been developed to facilitate its 

practical implementation. However, there is nothing uniquely 

speech-oriented about acoustic-based HMMs. Standard 

HMMs model speech as a series of stationary regions in some 

representation of the acoustic signal. Speech is a continuous 

process though, and ideally should be modeled as such. 

Furthermore, HMMs assume that state and phone boundaries 

are strictly synchronized with events in the parameter space, 

whereas in fact different acoustic and articulator parameters 

do not necessarily change value simultaneously at boundaries. 

8.1.1.Markov Models 

A Markov model is a probabilistic process over a finite set, 

{S 1 , ..., S k }, usually called its states. Each state-transition 

generates a character from the alphabet of the process. IT is 

interested in the matters such as the probability of a given 

state coming up next, pr(x t =S i ), and this may depend on the 

prior history to t-1. In computing, such processes, if they are 

reasonably complex and interesting, they are usually called 

Probabilistic Finite State Automata (PFSA) or Probabilistic 

Finite State Machines (PFSM) because of their close links to 

deterministic and non-deterministic finite state automata as 

used in formal language theory. 

8.1.2. Types of Hidden Markov Models 

8.1.2a. Discrete HMMs: 

HMMs can be classified according to the nature of the 

elements of the B matrix, which are distribution functions. 

Distributions are defined on finite spaces in the so called 

discrete HMMs. In this case, observations are vectors of 

symbols in a finite alphabet of N different elements. For each 

one of the Q vector components, a discrete density 

{w(k)/k=1,….N} is defined, and the distribution is obtained 

by multiplying the probabilities of each component. Notice 

that this definition assumes that the different components are 

independent. Fig.6 shows an example of a discrete HMM with 

one-dimensional observations. Distributions are associated 

with model transitions. 

Figure 6: Example of a discrete HMM. A transition 

probability and an output distribution on the symbol set is 

associated with every transition. 

8.1.2b.Continious HMM: 

Another possibility is to define distributions as probability 

densities on continuous observation spaces. In this case, 

strong restrictions have to be imposed on the functional form 

of the distributions, in order to have a manageable number of 

statistical parameters to estimate. The most popular approach 

is to characterize the model transitions with mixtures of base 

densities g of a family G having a simple parametric form. 

The base densities g є G are usually Gaussian or Laplacian, 

and can be parameterized by the mean vector and the 

covariance matrix. HMMs with these kinds of distributions 

are usually referred to as continuous HMMs. In order to model 

complex distributions in this large number of base densities 

has to be used in every mixture. This may require a very large 

training corpus of data for the estimation of the distribution 

parameters. Problems arising when the available corpus is not 

large enough can be alleviated by sharing distributions among 

transitions of different models. 

8.1.2c. Semi-Continuous HMMs : 

In semi-continuous HMMs, all mixtures are expressed in terms 

of a common set of base densities. Different mixtures are 

characterized only by different weights. A common 

generalization of semi-continuous modeling consists of 

interpreting the input vector y as composed of several 

components 

, each of which is associated 

with a different set of base distributions. The components are 

assumed to be statistically independent; hence the 

distributions associated with model transitions are products of 

the component density functions. Computation of probabilities 

with discrete models is faster than with continuous models, 



922


ISSN:2229-6093 

nevertheless it is possible to speed up the mixture densities 

computation by applying vector quantization (VQ) on the 

Gaussians of the mixtures Parameters of statistical models are 

estimated by iterative learning algorithms in which the 

likelihood of a set of training data is guaranteed to increase at 

each step. 

8.2. HMM Constraints/Limitations for Speech Recognition 

Systems: 

HMM have different constraints depending on the nature of 

the problem that has to be modeled. The main constraints 

needed in the implementation of speech Recognizers can be 

summarized in the following assumptions [20]: 

1 – First order Markov chain : 

In this assumption the probability of transition to a state 

depends only on the current state 

2 – Stationary states’ transition 

This assumption testifies that the states transition are time 

independent, and accordingly 

we will have: 

3 – Observations independence: 

This assumption presumes that the observations come out 

within certain state depend only on the underlying Markov 

chain of the states, without considering the effect of the 

occurrence, of the other observations. Although this 

assumption is a poor one and deviates from reality but it 

works fine in modeling speech signal. 

This assumption implies that: 

where p represents the considered history of the observation 

sequence. 

Then we will have : 

4 – Left-Right topology constraint: 

If the observations are discrete then the last integration will be 

a summation. 

6- Since HMMs are well defined for processes that are 

function of one independent variable such as time. It doesn’t 

work satisfactorily for two variables. 

7- The Maximum likelihood training criterion used in HMM 

leads to poor discrimination between the acoustic models 

given limited training data and correspondingly limited 

models. Discrimination can be improved using the Maximum 

Mutual Information(MMI) training criterion but this is more 

complex and difficult to implement properly. Because HMMs 

suffer from all these weaknesses, they can obtain good 

performances only by relying on context dependent phone 

models i.e. tri-pohone models. 

8.3.Three Basic Problems for HMMs: 

There are three basic problems to be solved for HMMs[21]. 

The parameter estimation problem is to train speech and 

speaker models, the evaluation problem is to compute 

likelihood functions for recognition and the decoding 

problem is to determine the best fitting(unobservable) state 

sequence [Rabiner and Juange 1993, Huange et al.1990]. 

i)The parameter estimation problem: This problem 

determines the optimal model parameters λ of the HMM 

according to given optimization criterion. A variant of the EM 

algorithm, known as the Baum Welch algorithm, yields an 

iterative procedure to re-estimate the model parameters λ 

using the ML criterion [Baum 1972,Baum and Sell 

1968,Baum and Eagon 1967]. In the Baum-Welch algorithm, 

the unobservable data are the state sequence S and the 

observable data are the observation sequence O. The Q- 

function for the HMM is as follows 

Q(λ , 

− 

λ ) = ∑ 

S 

− 

P(S|O,λ)log P(O,S| λ ) (2) 

Computing P(S|O,λ) [Rabiner and Juang 1993, Huang et al. 

1990], we obtain 

5 – Probability constraints: 

Our problem is dealing with probabilities then we have the 

following extra constraints: 

Q(λ , 

(3) 

− T −1 

λ )= ∑ 

t= 

0 

∑ 

st 

∑ 

st+1 

P(s t ,s t+1 |O, λ )log( 

− 

a 

stst+1 

− 

b 

st+1 

(o t+1 )] 

− 

Where π is denoted by 

s1 

− 

a for simplicity. Regrouping eq.3 

s0s1 

into three terms for the π,A,B coefficients, and applying 



923


ISSN:2229-6093 

lagrange multipliers, we obtain the HMM parameter 

estimation equations 

− 

a = 

ij 

• For discrete HMM: 

− 

j 

π =γ 1 (i), 

T −1 

∑ξt 

t= 

1 

T −1 

∑ 

t= 

1 

− 

b j(k) = 

Where 

( i, 

j) 

γt( 

i) 

T 

∑ 

γt( 

j) 

t= 

1 

s. 

t. 

ot 

= v 

N 

∑ 

i= 

1 

T 

∑γt( 

t= 

1 

k 

j) 

N 

γ t(i)= ∑ ξ t( 

i, 

j) 

, 

j= 

1 

ξ t (i,j) = P(s t = i,s t+1 = j|O, λ) = 

t( 

i) 

aijbj( 

ot 

+ 1) 

βt 

+ 1( 

j) 

α 

N 

∑αt( 

j= 

1 

i) 

aijbj( 

o 

t + 1 

) β 

t + 1 

( j) 

3) For continuous HMM: estimation equations for the π 

and A distributions are unchanged, but the output 

distribution B is estimated via Gaussian mixture 

parameters are represented in equ. 6 

T 

∑ηt( 

ϖ jk = t= 

1 

T K 

∑∑ t 

t= 1 k = 1 

T 

∑ηt( 

j 

− 

t= 

1 

µ jk = 

T 

∑ηt( 

t= 

1 

j, 

k) 

η ( j, 

k) 

, k) 

xt 

j, 

k) 

, 

(4) 

(5) 

− 

Σ jk = 

Where 

η t (j,k) = 

T 

∑ 

t= 

1 

t 

N 

∑αt( 

j= 

1 

ηt( 

j, 

k)( 

xt 

− µ jk)( 

xt − µ jk)' 

T 

∑ηt( 

t= 

1 

(6) 

α ( j) 

βt( 

j) 

x 

j) 

βt( 

j) 

K 

∑ 

k = 1 

j, 

k) 

ωjkN( 

xt, 

µ jk, 

Σjk) 

ωjkN( 

xt, 

µ jk, 

Σjk) 

Note that for practical implementation, a scaling procedure 

[Rabiner and Juang 1993] is required to avoid number 

underflow on computers with ordinary floating-point number 

representations. 

ii)The evaluation Problem: How can we efficiently compute 

P(O/λ), the probabilitiy that the observation sequence O was 

produced by the model λ? 

For solving this problem, we obtain 

∑ 

P(O/λ)= ∑ P(O,S/λ) = 

allS 

s 1, s2,..., 

sT 

π s1 b s1 (o 1 )a s1s2 b s2 (o 2 )….a sT- 

1s T b ST (o T ) (8) 

An interpretation of the computation in (8) is the following. 

At time t=1,we are in state s1 with probability π s1 , and 

generate the symbol o1 with probability b s1 (o 1 ). A transition is 

made from state s 1 at time t=1 to state s 2 at time t=2 with 

probability a s1s2 and we generate a symbol o 2 with probability 

b s2 (o 2 ). This process continues in this manner until the last 

transition at time T from state s T-1 to state s T is made with 

probability a sT-2 s T and we generate symbol o T with probability 

b sT (o T ). Figure 7 shows an N-state left-to-right HMM with ∆i 

set to 1. 

(7) 

Fig.7 The Markov generation Model 



924


ISSN:2229-6093 

To reduce computations, the forward and the backward 

variables are used. The forward variable α t (i) is defined as 

α t (i) = P(o 1 ,o 2 ,…o t ,s t =i/λ), 

which can be computed iteratively as 

α 1 (i) = π i b i (o 1 ), 1≤ I≤ N 

and 

⎤ 

α t+1 (j)= ⎢ 

⎡ N 

∑αt( 

i) 

aij⎥ b j (o t+1 ), 1≤ j ≤N, 1≤ t ≤T-1 (9) 

⎣ i= 

1 ⎦ 

and the backward variable β t (i) is defined as 

β t (i) = P(o t+1 ,o t+2 ,…o T| s t =i/λ), 

which can be computed iteratively as 

β T (i) = 1, 1≤ i ≤N 

and 

N 

β t (i) = ∑ a 

j= 

1 

ijbj( ot 

+ β t+1 (j) , 1≤ i ≤ N, t = T-1,…..,1 (10) 

1) 

Using these variables, the probability P(O/λ) can be computed 

following the forward variable or the backward variable or 

both the forward and backward variables as follows 

P(O/λ)= ∑ 

i= 

N 

N 

N 

α T( 

i) 

= 

1 

∑ π ibi( o1 ) β 1( 

i) 

= 

i= 

1 

∑ α t( i) 

βt( 

i) 

(11) 

i= 

1 

iii) The decoding Problem: Given the observation sequence 

O and the model λ, how do we choose a corresponding state 

sequence S that is optimal in some sense? 

This problem attempts to uncover the hidden part of the model. 

There are several possible ways to solve this problem, but the 

most widely used criterion is to find the single best state 

sequence that can be implemented by the Viterbi algorithm . 

In practice, it is preferable to base recognition on the 

maximum likelihood state sequence since this generalizes 

easily to the continuous speech case. This likelihood is 

computed using the same algorithm as forward algorithm 

except that the summation is replaced by a maximum 

operation. 

Comparision with Template and HMM methods: 

Compared to template based approach, hidden Markov 

modeling is more general and has a firmer mathematical 

foundation. A template based model is simply a continuous 

density HMM, with identity covariance matrices and a slope 

constrained topology. Although templates can be trained on 

fewer instances, they lack the probabilistic formulation of full 

HMMs and typically underperform HMMs. Compared to 

knowledge based approaches; HMMs enable easy integration 

of knowledge sources into a compiled architecture. A negative 

side effect of this is that HMMs do not provide much insight 

on the recognition process. As a result, it is often difficult to 

analyze the errors of an HMM system in an attempt to 

improve its performance. Nevertheless, prudent incorporation 

of knowledge has significantly improved HMM based systems. 

8.4. Advantages and Disadvantages of HMM: 


1) One of the most important advantage of HMMs is that 

they can easily be extended to deal with strong tasks. 

2) In the training stages, HMMs are dynamically assembled 

according to the class sequence. For example, if the class 

sequence was my hat, then two models for each word 

would be linked, with the last state of the first linking to 

the first state of the second. The re-estimation algorithm 

is then applied as usual. Once training on that instance is 

complete, the models are unlinked again. When 

recognition is attempted, large HMMs are assembled 

from the smaller individual models. This is done by 

converting from a grammar into a graph representation, 

then replacing each node in the graph with the appropriate 

model. This process is called `èmbedded re-estimation''. 

To find out what the class sequence was, the most 

probable path is calculated. The path traversed 

corresponds to a sequence of classes, which is our final 

classification. 

3) Because each HMM uses only positive data, they scale 

well. New words can be added without affecting learnt 

HMMs. It is also possible to set up HMMs in such a way 

that they can learn incrementally. As mentioned above, 

grammar and other constructs can be built into the system 

by using embedded re-estimation. This gives the 

opportunity for the inclusion of high-level domain 

knowledge, which is important for tasks like speech 

recognition where a great deal of domain knowledge is 

available. 

4) Architecture-Basic characteristics of the mathematical 

frame work are useful for speech recognition. 

5) Completeness: advantages of the underlying approach 

over specific knowledge based approaches 

6) Flexibility: Ways in which speech knowledge can be 

incorporated into HMMs in the 

form of constraints on the basic flexible structure. 


i) They make very large assumptions about the data. 

ii) They make the Markovian assumption: that the emission 

and the transition probabilities depend only on the current 

state. This has subtle effects; for example, the probability of 

staying in a given state falls off exponentially . 

iii)The Gaussian mixture assumption for continuous-density 

hidden Markov models a huge one. We cannot always assume 

that the values are distributed in a normal manner. 

iv) The number of parameters that need to be set in an HMM 

is huge. 



925


ISSN:2229-6093 

v) The Viterbi algorithm allocates frames to states, the frames 

associated with a state can often change, causing further 

susceptibility to the parameters. Those involved in HMMs 

often use the technique of ``parameter-tying'' to reduce the 

number of variables that need to be learnt by forcing the 

emission probabilities in one state to be the same as those in 

another. For example, if one had two words: cat and mad, then 

the parameters of the states associated with the `à'' sound 

could be tied together. 

vi) As a result of the above, the amount of data that is required 

to train an HMM is very large. This can be seen by 

considering typical speech recognition corpora that are used 

for training. The TIMIT database for instance, has a total of 

630 readers reading a text; the ISOLET database for isolated 

letter recognition has 300 examples per letter. Many other 

domains do not have such large datasets readily available. 

vii) HMMs only use positive data to train. In other words, 

HMM training involves maximizing the observed probabilities. 

viii) In some domains, the number of states and transitions can 

be found using an educated guess or trial and error, in general, 

there is no way to determine this. Furthermore, the states and 

transitions depend on the class being learnt. 

ix) The concept learnt by a hidden Markov model is the 

emission and transition probabilities. If one is trying to 

understand the concept learnt by the hidden Markov model, 

then this concept representation is difficult to understand. In 

speech recognition, this issue is of little significance, but in 

other domains, it may be even more important than accuracy. 

x) Ist order HMM Markovian assumptions of conditional 

dependence ( i.e. being in a state depends upon a previous 

state). 

xi) HMMs are well defined for processes that are function of 

one independent variable such as time is one dimensional. 

xii) One major limitation of the statistical models is that they 

work well only when the underlying assumptions are satisfied. 

The effectiveness of these methods depends to a large extent 

on the various assumptions or conditions under which the 

models are developed. 


1) First application of Markov Chains was made by Andrey 

Markov himself in the area of language modeling. 

2) Another example of Markov chains application in 

linguistics is stochastic language modeling. 

3) Use of Markov chains to generate random numbers that 

belong exactly to the desired distribution or, speaking 

4) HMM for financial economic applications 

5) HMM for Signature verification 

6) HMM for Speech and speaker recognition 

7) Hidden Markov Model in Intrusion Detection Systems 

8) HMM in bio informatics 

9) HMM applications in bar code reading 

10) HMM applications in computer vision 

9. Non-parameter techniques: 

In most real problems, even the types of the density functions 

of the interest are unknown. Looking at histograms, scatter 

plots or tables of the data, or the application of statistical 

procedures may suggest that a particular type of the class 

density may be used, or they may indicate that the data are not 

well fit by any of the standard types of densities or 

distributions. In this case, non parametric techniques are 

needed. There are different classification methods in non 

parametric techniques namely vector quantization, Artificial 

Neural Network, Support vector machines, K-Nearest 

Neighbor method and Gaussian Mixture Modeling methods. 

These methods are discussed in the following sections. 

9.1. Advantages and disadvantages in Non Parametric 

Method: 


(1) Nonparametric test make less stringent demands of the 

data. For standard parametric procedures to be valid, certain 

underlying conditions or assumptions must be met, 

particularly for smaller sample sizes. 

(2) Nonparametric procedures can sometimes be used to get a 

quick answer with little calculation. 

3) Nonparametric methods provide an air of objectivity when 

there is no reliable (universally recognized) underlying scale 

for the original data and there is some concern that the results 

of standard parametric techniques would be criticized for their 

dependence on an artificial metric. 

4) One of the key advantages of non-parametric techniques is 

that they do not make any statistical assumptions about data. 


1) The major disadvantage of nonparametric techniques is 

contained in its name. Because the procedures are 

nonparametric, there are no parameters to describe and it 

becomes more difficult to make quantitative statements about 

the actual difference between populations. 

2) The second disadvantage is that nonparametric procedures 

throw away information. Because information is discarded, 

nonparametric procedures can never be as powerful (able to 

detect existing differences) as their parametric counterparts 

when parametric tests can be used. 

9.3. Applications of Non parametric methods: 

1) Speech recognition applications 

2) Chi-square applications 

3) Efficiency analysis of the models 



926


ISSN:2229-6093 

4) Analysis of Hedonic Models 

5) Data mining 

6) Clinical applications 

10. Vector quantization [5]: 

Vector Quantization(VQ)[97] is often applied to ASR. It is a 

system for mapping a sequence of continuous or discrete 

vectors into a digital sequence suitable for communication 

over or storage in a digital channel. The goal of this system is 

the data compression: to reduce the bit rate so as to minimize 

communication channel capacity or digital storage memory 

requirements while maintaining the necessary fidelity of the 

data. 

10.1. Introduction to vector quantization: 

Vector quantization is a classical quantization technique 

from signal processing which allows the modeling of 

probability density functions by the distribution of prototype 

vectors. It was originally used for data compression. It works 

by dividing a large set of points (vectors) into groups having 

approximately the same number of points closest to them. 

Each group is represented by its centroid point, as in k-means 

and some other clustering algorithms. 

The density matching property of vector quantization is 

powerful, especially for identifying the density of large and 

high-dimensioned data. Since data points are represented by 

the index of their closest centroid, commonly occurring data 

have low error, and rare data high error. This is why VQ is 

suitable for lossy data compression. It can also be used for 

lossy data correction and density estimation. Vector 

quantization is based on the competitive learning paradigm, so 

it is closely related to the self-organizing map model 

The results of either the filter bank analysis or the LPC 

analysis’s are a series of vectors characteristic of the time 

varying spectral characteristics of the speech signal .The 

spectral vectors are denoted as v l , l=1,2,...,L, Where Typically 

Each Vector is a P- Dimensional Vector. If we Compare the 

information rate of the vector representation to that of the raw 

speech waveform we see that the spectral analysis’s has 

significantly reduced the required information rate of 

160,000bps is required to store the speech samples in 

uncompressed format. For the spectral analysis, consider 

vectors of dimension p=10 using 100 spectral vectors per 

second. If we again represent each spectral component t to 16- 

bit precision the required storage is about 100x10x16bps or 

16000 bps about a 10 to 1 reduction over the uncompressed 

signal. Such compressions in storage rate are impressive. 

Based on the concept of ultimately needing only a single 

spectral representation for each basic speech unit, it may be 

possible to further reduce the raw spectral representation of 

speech to those drawn from a small finite number of unique 

spectral vectors each corresponding to one of the basic speech 

units. This ideal representation is of course impractical 

because there is so much variability in the spectral properties 

of each of the basic speech units. However the concept of 

building a codebook of distinct, analysis vectors, albeit with 

significantly more code words than the basic set of phonemes, 

remains an attractive idea and is the basis behind a set of 

techniques commonly called vector quantization methods. 

Based on this line of reasoning assume that we require a code 

book with about 1024 unique spectral vectors. Then to 

represent an arbitrary spectral vector all we need is a 10 bit 

number the index of the codebook vector that best matches the 

input vector. Assuming a rate of 100 spectral vectors per 

second we see at a total bit rate of about 1000 bps is required 

to represent the spectral vectors of a speech signal. This rate is 

about 1/16 th the rate required by the continuous spectral 

vectors. Hence the vector quantization representation is 

potentially an extremely efficient representation the spectral 

information in the speech signal. 

10.1.1.Elements of a vector quantization implementation; 

To build a vector quantization and implement a VQ analysis 

procedure we need the following: 

• a large set of spectral analysis vectors v 1 ,v 2 ,....v l , 

which form a triaging set. The training set is used to 

create the optimal set6 of codebook vectors for 

presenting he spectral variability observed the 

training set If we denote the size of the VQ code 

book as M=2 B vector then we require L>> M so as 

to be able to find the best set of M code book vectors 

in a robust manner. In practice, it has been found that 

L should be at least 10M in order to train a VQ 

codebook that works reasonably well. 

• a measure of similarity or distance between a pair of 

spectral analysis’s vectors so as to be able to cluster 

the training set vectors as well as to associate or 

classify arbitrary spectral vectors into unique 

codebook entries. We denote the spectral distance 

d(v i ,v j ) between two vectors v i , v j as d ij . We defer a 

discussion of spectral distance measure. 

• iii) a centroid computation procedure. On the basis of 

the partitioning that classifies the L training set 

vectors into M cluster we choose the M code book 

vectors as the centroid of each of the M clusters. 

• a classification procedure for arbitrary speech 

spectral analysis’s, vectors that chooses the codebook 

vector closet to the input vector ;and uses the 

codebook index as the resulting spectral 

representation. This is often referred to as the nearest 

neighbor labeling or optimal encoding procedure. 

The classification procedure is essentially, a 

quantizer that accepts as input a speech spectral 

vector and provides as output the codebook index of 

the codebook vector that best matches the input;. The 

following figure 8 shows the basic VQ training and 

classification structure. 



927


ISSN:2229-6093 

For code book sizes of 1000or larger, the storage is often nontrivial. 

Hence an inherent trade off among quantization error, 

processing for choosing the code book vector, and storage of 

code book vectors exists, and practical designs balance each 

of these three factors. 

3) VQ has the low prediction gain of the vector predictor, due 

to the autocorrelation function of speech with increasing lag. 

Figure.8 of the basic VQ training and classification structure 

10.2. Advantages and Disadvantages of Vector 

Quantization: 

a)Advantages: 

1) Reduced storage for spectral analysis information. This 

efficiency can be exploited in a number of ways in practical 

vector quantization based speech recognition systems. 

2) Reduced computation for determining similarity of spectral 

analysis vectors. In speech recognition a major component of 

the computation is the determination of spectral similarity 

between a pair of vectors. Based on the vector quantization 

representation this spectral similarity computation is often 

reduced to a table lookup of similarities between pairs of 

codebook vectors. 

3) Discrete representation of speech sounds. By associating a 

phonetic label (or possible a set of phonetic labels or a 

phonetic class with each codebook vector, the process of 

choosing a best codebook vector to represent a given spectral 

vector becomes equivalent to assigning a phonetic label to 

each spectral frame of speech. A range of recognition systems 

exist that exploit these labels so as to recognize speech in an 

efficient manner. 

4) Vector quantization lowers the bit rate of the signal being 

quantized thus making it more bandwidth efficient than scalar 

quantization. But this however contributes to it's 

implementation complexity (computation and storage). 


1) An inherent spectral distortion in representing the actual 

analysis vector. Since there a finite number of code book 

vectors, the process of choosing the ”best” representation of a 

given spectral vector inherently is equivalent to quantizing the 

vector and leads, by definition, to a certain level of 

quantization error. As the size of the code book increases, the 

size of the quantization error decreases. However, with any 

finite code book there will always be some non zero level of 

quantization error. 

2) The storage required for code book vectors is of ten non 

trivial. The larger the codebook (so as to reduce quantization 

error), the more storage is required for the code book entries. 


i) Image and voice compression 

ii) Speech Recognition application 

iii) Image coding 

iv) VQ for neural gas network 

v) VQ is used for lossy data compression, lossy 

data correction, and density estimation 

11. Artificial Neural Network (ANN)[5]: 

A variety of knowledge sources need to be established in the 

AI approach to speech recognition. Therefore, two key 

concepts of artificial intelligence are automatic knowledge 

acquisitions (learning and adaptation. One way in which these 

concepts have been implemented is via the neural network 

approach.Fig.9 shows the example of a neural network model. 

11.1. Basics of Neural Networks: 

A neural network, which is also called a connectionist model, 

a neural network a parallel distributed processing (PDP) 

model, is basically a dense interconnection of simple, 

nonlinear, computational elements. It is assumed that there are 

N inputs labeled x 1 ,x 2 ,…x N , which are summed with weights 

w 1 ,w 2 …w mn , threshold and then nonlinearly compressed to 

give the output y, defined as 

1toN 

Y=f (Σ W i X i - φ) -----(12) 

i=1 

Where pi is an internal threshold or offset, and f is a non 

linearity of one of the types given below. 

1.hard limiter f(x) = +1 x≤0,or -1 x0 or 

The sigmoid nonlinearities are used most often because they 

are continuous and differentiable. The biological basis of the 

neural network is a model by McCullough and Pitts[22] of 

neurons in the human nervous system. 

11.1.1 Neural Network topologies: 

There are several issues in the design of the so called artificial 

neural networks which model various physical phenomena, 

where we define an ANN as an arbitrary connection of simple 

computational elements. One key issue in network topologies 



928


ISSN:2229-6093 

– that is, how the simple computational elements are 

interconnected. There are three standard and well known 

topologies. 

i) Single/multilayer perceptrons 

ii) Hopfield or recurrent networks 

iii) Kohonen or Self-organizing networks 

In the single/multilayer perceptron, the outputs of one or more 

simple computational elements at one layer form the inputs to 

a new set of simple computational elements of the next layer. 

The single layer perceptron has N inputs connected to M 

outputs in the output layer as shown in the fig.9. The three 

layer perceptron has two hidden layers between the input and 

output layers. The single layer perceptron can separate static 

patterns into classes with class boundaries characterized by 

the hyper planes in the (x 1 ,x 2 ….x n )space. Similarly, a 

multilayer perceptron, with at least one hidden layer can 

realize an arbitrary set of decision regions in the (x 1 ,x 2 ….x n ) 

space. Thus, for example if the inputs to a multilayer 

perceptron are the first two speech resonances (F 1 and F 2 ) the 

network can implement a set of decision regions that partition 

the (F1-F2) space into the 10 steady state vowels. 

The Hopfield network is a recurrent network in which the 

input to each computational element includes both inputs as 

well as outputs. Thus with the input and output indexed by 

time xi(t) and yi(t) and the weight connecting the ith node and 

the jth node denoted by wij, the basic equation for the ith 

recurrent computational element is 

Yi(t)= f[x i (t),+Σw ij y j (t-1) –φ] (13) 

j 

And a recurrent network with N inputs and N outputs. The 

most important property of the Hopfield Network is wij=wji 

and when the recurrent computation (eq.2.5) is performed 

asynchronously, for an arbitrary constant input, the network 

will eventually settle to a fixed point where y i (t)=y i (t-1)for all 

i. These fixed relaxation points represent stable configurations 

of the network and can be used in applications that have a 

fixed set of patterns to be matched in the form of a content 

addressable or associative memory. Recurrent network has a 

stable set of attractors and repellers, each forming a fixed 

pointing in the input space. Every input vector, x is either 

attracted to one of the fixed points or repelled from another of 

the fixed points. The strength of this type of network is its 

ability to correctly classify noisy versions of the patterns that 

form the stable fixed points. 

The third popular type of neural network topology is the 

Kohonen, self organizing feature map, which is a clustering 

procedure for providing a codebook of stable patterns in the 

input space that characterize an arbitrary input vector, by a 

small number of representative clusters. 

Figure 9 simplified view of a artificial neural network 

11.1.2. Network Characteristics: 

Four model characteristics must be specified to implement an 

arbitrary neural network. Fig.9 shows the architecture of the 

simple neural network model. 

a) Number and type of inputs: The issues involved in the 

choice of inputs to a neural network are similar to those 

involved in the choice of features for any pattern classification 

system. They must provide the information required to make 

the decision required of the network. 

b) Connectivity of the network: This issue involves the size of 

the network that is, the number of hidden layers and the 

number of nodes in each layers between the input and output. 

There is no good rule of thumb as to how large ( or small) 

such hidden layers must be. Intuition says that if the hidden 

layers are large, then it will be difficult to train the network. 

Similarly, if the hidden layers are too small the network may 

not be able to accurately classify the entire desired input 

pattern. 

c) Choice of offset: The choice of the threshold, pi for each 

computational element must be made as part of the training 

procedure, which chooses values for the interconnection 

weights (w ij ) and the offset pi. 

d) Choice of nonlinearity: Experience indicates that the exact 

choice of the nonlinearity f, is not every important in terms of 

the network performance. However, f must be continuous and 

differentiable for the training algorithm to be applicable. 

11.2. Training of Neural Network Parameters: 

To completely specify a neural network, values for the 

weighting coefficients and the offset threshold for each 

computation element must be determine, based on a labeled 

set of training data. By a labeled training set of data, means 

association between a set of Q input vectors x 1 ,x 2 ,….x q and a 

set of Q output vectors y 1 ,y 2 ,…y q where x 1 =y 1 ,x 2 =y 2 …..x q =y q . 

For multilayer perceptrons a simple iterative, convergent 

procedure exists for choosing a set of parameters whose value 

asymptotically approaches a stationary point with a certain 



929


ISSN:2229-6093 

optimality property ( e.g., a local minimum of themean 

squared error, etc.). This procedure, called back propagation 

learning, is a simple stochastic gradient technique. For a 

simple, single layer network, the training algorithm can be 

realized via the following convergence steps: 

Perceptron Convergence Procedure 

1. Initialization: At time t=0, set w ij (0), φ j to small random 

values (where wij are the weighting coefficients connecting i th 

input node and j th output node and φ ij is the offset to a 

particular computational element and the w ij are function of 

time). 

2.Acquire input: At time t, obtain a new input x={x 1 ,x 2 ,..x N } 

with the desired ouput, yx={y x 1,y x 2,….Y M x} 

n 

3.Calculate output : y i =f(Σ w ij(t) xi- φ j ) 

I=1 

4.Adapt Weights: Update the weights as: 

w ij (t+1)= w ij (t)+T(t)[y x j-y j }-x i 

5.Iteration: 

Iterate steps 2-4 until: w ij (t+1)= w ij (t) 

11.3. Difference between Neural Networks and 

Conventional Classifiers: 

The difference between the neural network classifier and the 

conventional classifier is given in the table A. 

TABLE A 

Difference between Neural network and conventional 

classifier 

Sl.No. Neural Network Conventional 

classifier 

1 Estimates posterior 

probability 

2 Non linear model free 

method 

3 Uses discriminant 

function 

4 Minimizes the total no. 

of miss classification 

errors 

5 Data driven and self 

adapting 

Based on bayes 

decision theory using 

posterior probability 

Linear and model based 

method 

Uses probabilistic 

function 

Minimizes 

classification error 

Data driven not self 

adapting 

Statistical pattern classifiers are based on the Bayes decision 

theory in which posterior probabilities play a central role. The 

fact that neural networks can in fact provide estimates of 

posterior probability implicitly establishes the link between 

neural networks and statistical classifiers. The direct 

comparison between them may not be possible since neural 

networks are nonlinear model-free method while statistical 

methods are basically linear and model based. By appropriate 

coding of the desired output membership values, we may let 

neural networks directly model some discriminant functions. 

For example, in a two-group classification problem, if the 

desired output is coded as 1 if the object is from class 1 and -1 

if it is from class 2.The neural network estimates the 

following discriminant function: 

---(14) 

The discriminating rule is simply: assign X to w 1 if g(x)>0 or 

w 2 if g(x)


ISSN:2229-6093 

diagnosis and epidemiologic studies [32]. Logistic regression 

is often preferred over discriminant analysis in practice 

[33,34]. In addition, the model can be interpreted as posterior 

probability or odds ratio. It is a simple fact that when the 

logistic transfer function is used for the output nodes, simple 

neural networks without hidden layers are identical to logistic 

regression models. Another connection is that the maximum 

likelihood function of logistic regression is essentially the 

cross-entropy cost function which is often used in training 

neural network classifiers. Schumacher et al. [35] make a 

detailed comparison between neural networks and logistic 

regression. They find that the added modeling flexibility of 

neural networks due to hidden layers does not automatically 

guarantee their superiority over logistic regression because of 

the possible over fitting and other inherent problems with 

neural networks [36]. Links between neural and other 

conventional classifiers have been illustrated by 

[37,38,39,40,41,42,43]. Ripley [44,45] empirically compares 

neural networks with various classifiers such as classification 

tree, projection pursuit regression, linear vector quantization, 

multivariate adaptive regression splines and nearest neighbor 

methods. 

A large number of studies have been devoted to empirical 

comparisons between neural and conventional classifiers. The 

most comprehensive one can be found in Michie et al. [46] 

which reports a large-scale comparative study—the StatLog 

project. In this project, three general classification approaches 

of neural networks, statistical classifiers and machine learning 

with 23 methods are compared using more than 20 different 

real data sets. Their general conclusion is that no single 

classifier is the best for all data sets although the feed forward 

neural networks do have good performance over a wide range 

of problems. 

Neural networks have also been compared with decision trees 

[47,48,49,50] discriminant analysis [51], [52], [53], [54], [55], 

CART [56]], -nearest-neighbor [57], and linear programming 

method. Although classification costs are difficult to assign in 

real problems, ignoring the unequal misclassification risk for 

different groups may have significant impact on the practical 

use of the classification. It should be pointed out that a neural 

classifier which minimizes the total number of 

misclassification errors may not be useful for situations where 

different misclassification errors carry highly uneven 

consequences or costs. 

11.4. Advantages and Disadvantages of Neural Networks: 


1) Neural networks are data driven and self adaptive-learning 

2) Pocesses Self-organization mechanism 

3) Has Fault-tolerance capabilities 

4) A neural network can perform tasks that a linear program 

cannot. 

5) When an element of the neural network fails, it can 

continue without any problem by their parallel nature. 

6) A neural network learns and does not need to be 

reprogrammed. 

7) It can be implemented in any application. It can be 

implemented without any problem 

8) The connectionist structure is used to model the local 

feature vector conditioned on the Markov process. 

9) There is no need to assume an underlying data distribution 

such as usually is done in statistical modeling 

10) Neural networks are applicable to multivariate non-linear 

problems. 

11) Neural networks are data driven self-adaptive methods in 

that they can adjust themselves to the data without any explicit 

specification of functional or distributional form for the 

underlying model. 

12) They are universal functional approximators, in that 

neural networks can approximate any function with arbitrary 

accuracy. 

13) Neural networks are nonlinear models, which makes them 

flexible in modeling real world complex relationships. 

14) Finally, neural networks are able to estimate the posterior 

probabilities, which provides the basis for establishing 

classification rule and performing statistical analysis 

15) They can adjust themselves to the data without any 

explicit specification of functional or distributional form for 

the underlying model. 

16) They are re-universal functional approximators in that 

neural networks can approximate any function with arbitrary 

accuracy [58], [59], [60]. 

17) Neural networks are nonlinear models, which makes them 

flexible in modeling real world complex relationships. 

18) Neural networks are able to estimate the posterior 

probabilities, which provide the basis for establishing 

classification rule and performing statistical analysis [61]. 

19) They can readily implement a massive degree of parallel 

computation 

20) They intrinsically possess a great deal of robustness or 

fault tolerance. 

21) The connection weights of the network need not be 

constrained to be fixed. They can be adopted in real time to 

improve performance. 

22) Because of non linearity within each computational 

element a sufficiently large neural network can approximate 

any nonlinearity or nonlinear dynamical system. 

23) They can adapt to unknown situations 

24) Robustness: Fault tolerance due to network redundancy 

25) Autonomous learning due learning and generalization 

b. Disadvantages: 

1) The neural network needs training to operate. 

2) The architecture of a neural network is different from the 

architecture of microprocessors therefore needs to be 

emulated. 

3) Requires high processing time for large neural networks. 

4) Minimizing over fitting requires a great deal of 

computational effort. 

5)The individual relations between the input variables and the 

output variables are not developed by engineering judgment 

so that the model tends to be a black box or input/output table 

without analytical basis. 



931


ISSN:2229-6093 

6) The sample size has to be large. 

7) Large complexity of the network structure. 


Since neural networks are best at identifying patterns or trends 

in data, they are well suited for prediction or forecasting needs 

including: 

1) Sales forecasting, industrial process control ,customer 

research ,data validation, risk management, target marketing. 

2) Modeling and Diagnosing the Cardiovascular System 

3) medicine/Medical diagnosis 

4) business/Marketing 

5) Electronic noses 

6) Speech Recognition 

7) Credit evaluation 

8) Speech and speaker applications 

9) Fault detection 

10) Prediction: Learning from past experiences; Weather 

prediction 

11) Classification: Image Processing, Risk management 

12) Recognition: Character/Hand written recognition 

13) Data association: 

14) Data conceptualization 

15) Data filtering 

16) Planning 

12.1. Introduction to Support Vector Machine (SVM) 

Models 

During the last decade, however, a new tool appeared in the 

field of machine learning that has proved to be able to cope 

with hard classification problems in several fields of 

application: the Support Vector Machines (SVMs). A SVM is 

essentially a binary nonlinear classifier capable of guessing 

whether an input vector x belongs to a class 1 (the desired 

output would be then y = +1) or to a class 2 (y −1). = This 

algorithm was first proposed in [63] in 1992, and it is a 

nonlinear version of a much older linear algorithm, the 

optimal hyper plane decision rule (also known as the 

generalized portrait algorithm), which was introduced in the 

sixties. 

The SVMs are effective discriminative classifiers with 

several outstanding characteristics [62], namely: their solution 

is that with maximum margin; they are capable to deal with 

samples of a very higher dimensionality; and their 

convergence to the minimum of the associated cost function is 

guaranteed. A Support Vector Machine (SVM) performs 

classification by constructing an N-dimensional hyper plane 

that optimally separates the data into two categories. 

12. Support Vector Machines (SVM): 

One of the powerful tools for pattern recognition that uses a 

discriminative approach is a SVM[97]. SVMs use linear and 

nonlinear separating hyper-planes for data classification. 

Since SVMs can only classify fixed length data vectors, this 

method cannot be readily applied to task involving variable 

length data classification. The variable length data has to be 

transformed to fixed length vectors before SVMs can be used. 

It is a generalized linear classifier with maximum-margin 

fitting functions. This fitting function provides regularization 

which helps the classifier generalized better. The classifier 

tends to ignore many of the features. Conventional statistical 

and Neural Network methods control model complexity by 

using a small number of features (the problem dimensionality 

or the number of hidden units). SVM controls the model 

complexity by controlling the VC dimensions of its model. 

This method is independent of dimensionality and can utilize 

spaces of very large dimensions spaces, which permits a 

construction of very large number of non-linear features and 

then performing adaptive feature selection during training. By 

shifting all non-linearity to the features, SVM can use linear 

model for which VC dimensions is known. For example, a 

support vector machine can be used as a regularized radial 

basis function classifier. 

Fig.10 Support vector machine process 

These characteristics have made SVMs very popular and 

successful. In the parlance of SVM literature, a predictor 

variable is called an attribute, and a transformed attribute that 

is used to define the hyper plane is called a feature. The task 

of choosing the most suitable representation is known as 

feature selection. A set of features that describes one case (i.e., 

a row of predictor values) is called a vector. So the goal of 

SVM modeling is to find the optimal hyper plane that 

separates clusters of vector in such a way that cases with one 

category of the target variable are on one side of the plane and 

cases with the other category are on the other size of the plane. 

The vectors near the hyper plane are the support vectors. The 

figure 10 below presents an overview of the SVM process. 

12.1.1 SVM formulation: 

Given a set of separable data, the goal is to find the optimal 

decision function. It can be easily seen that there is an infinite 



932


ISSN:2229-6093 

number of optimal solutions for this problem, in the sense that 

they can separate the training samples with zero errors. 

Function is used to generalize for unseen samples; the 

additional criterion is used to find the best solution among 

those with zero errors. If the probability densities of the 

classes, we could apply the maximum a posteriori (MAP) 

criterion to find the optimal solution. In most practical cases 

this information is not available, so it adopts other simpler 

criteria: among those functions without training errors, it will 

choose that with the maximum margin, being this margin the 

distance between the closest sample and the decision 

boundary defined by that function. Of course, optimality in 

the sense of maximum margin does not imply necessarily 

optimality in the sense of minimizing the number of errors in 

test, but it is a simple criterion that yields to solutions which, 

in practice, turn out to be the best ones for many problems 

[64]. 

…..(15) 

This can be formulated as a problem of quadratic optimization: 

In order to get a classifier with a better generalization ability 

and capable of handling the non-separable case, we should 

allow a number of misclassified data. This is accomplished by 

introducing a penalty term in the function to be minimized: 

….(16) 

Figure 11 Soft margin decision 

As can be inferred from the Figure 11, the nonlinear 

discriminant function f(x i ) can be written as: 

…(14a) 

Where 

is a nonlinear 

function which maps the vector xi into what is called a feature 

space of higher dimensionality (possibly infinite) where 

classes are assumed to be linearly separable. The vector w 

represents the separating hyper plane in such a space. It is 

worth noting that the meaning of feature space here has 

nothing to do with the space of the speech features that within 

the kernel methods nomenclature belong to the input space. 

On the other hand, r x is the distance between the transformed 

sample and the separating hyper plane, and 

the Euclidean norm of w. We call support vectors those 

closest to the decision boundary. These vectors define the 

margin and are the only samples that are needed to find the 

solution. Thus, we have that for every 

sample 

Hence, the goal to find the 

Where 

are the training vectors 

corresponding to the labels and the variables 

are called slack variables and allow a certain amount of errors 

that contribute to obtain solutions in the non-separable 

case , verifies for those samples well 

classified but inside the margin, and for those 

samples wrongly classified. The C term, on the other hand, 

expresses the trade-off between the number of training errors 

and the generalization capability.This problem is usually 

solved introducing the restrictions in the function to be 

optimized using Lagrange multipliers, leading to the 

maximization of the Wolfe dual: 

…(17) 

This problem is quadratic and convex, so its convergence to a 

global minimum is guaranteed using quadratic programming 

(QP) schemes. The resulting decision boundary w will be 

given by: 

optimum classifier is achieved by minimizing with the 

restriction of all samples being correctly classified, i.e.: 

..(18) 



933


ISSN:2229-6093 

According to (18), only vectors with an associated 

will contribute to determine the weight vector w and, therefore, 

the separating boundary. These are the support vectors that, as 

we have mentioned before, define the separation border and 

the margin. Generally, the function is not explicitly 

known (in fact, in most of the cases its evaluation would be 

impossible as long as the feature space dimensionality can be 

infinite). Since it only need to evaluate the dot 

products 

which, by using what has been 

called the kernel trick, can be evaluated using a kernel 

function K(x i , x j ). 

Many of the SVM implementations compute this function for 

every pair of input samples producing a kernel matrix that is 

stored in memory. By using this method and replacing w in 

equation (14) by the expression in (18), the form that a SVM 

finally adopts is the following: 

….(19) 

The most widely used kernel functions are: 

• the simple linear kernel 

…….(20) 

• the radial basis function kernel (RBF kernel), 

….(21) 

where is proportional to the inverse of the variance of the 

Gaussian function and whose associated feature space is of 

infinite dimensionality; and 

• the polynomial kernel 

..(22) 

whose associated feature space are polynomials up to grade p, 

and 

• the sigmoid kernel 

…(23) 

It is worth mentioning that there are some conditions that a 

function should accomplish to be used as a kernel. These are 

often denominated KKT (Karush- Kuhn-Tucker) conditions 

[65] and can be reduced to check the kernel matrix is 

symmetrical and positive semi-definite. 

12.2. Advantages and Disadvantages of SVM: 

a. Advantages: 

1) Follows linear discriminants in its learning criterion. 

2) It minimizes the number of misclassifications in any 

possible set of samples and this is known as Risk 

Minimization (RM). 

3) It minimizes the number of misclassifications within the 

training set and this is known as Empirical Risk Minimization 

(ERM). 

4) They have a unique solution and its convergence is 

guaranteed (the solution is found by minimizing a convex 

function). This is an advantage compared to other classifiers 

as ANNs that often fall in local minima or does not converge 

to a stable version. 

5) Since in the minimization process only the kernel matrix is 

involved, they can deal with input vectors of very high 

dimensionality, as long as it is capable of calculating their 

corresponding kernels and they can deal with vectors of 

thousands of dimensions. 

6) The input vectors of an SVM with the formulation must 

have a fixed size. 

7) The important advantage of SVM is that it offers a 

possibility to train generalizable, nonlinear classifiers in high 

dimensional spaces using a small training set. 

8) SVMs generalization error is not related to the input 

dimensionality of the problem but to the margin with which it 

separates the data. That is why SVMs can have good 

performance even with a large number of inputs. 


1) Most implementations of SVM algorithm require 

computing and storing in memory the complete kernel matrix 

of all the input samples. This task have a space complexity 

O(n2), and is one of the main problems of these algorithms 

that prevent their application on very large speech databases. 

2) The optimality of the solution found can depend on the 

kernel that has been used, and there is no method to know a 

priori which will be the best kernel for a concrete task. 

3) The best value for the parameter C is unknown a priori. 


1) SVM in speech and speaker recognition 

2) SVM in financial applications 

3) SVM in computational biology 

4) SVM in bioinformatics/biological applications 

5) SVM in text classification 

6) SVM in chemistry 

13. K-Nearest Neighbor Method: 

A more general version of the nearest neighbor technique [66] 

bases the classification of an unknown sample on the votes of 

k-fits nearest neighbors rather than on only its single nearest 

neighbor. The k-nearest neighbor classification procedure is 

denoted by k-NN. If the costs of error are equal for each class, 

the estimated class of an unknown sample is chosen to be the 

class that is most commonly represented in the collection of 

its k nearest neighbours. 



934


ISSN:2229-6093 

13.1. Classification concept of KNN: 

In pattern recognition, the k-nearest neighbor’s algorithm 

(k-NN) is a method for classifying objects based on closest 

training examples in the feature space. k-NN is a type of 

instance-based learning, or lazy learning where the function is 

only approximated locally and all computation is deferred 

until classification. The k-nearest neighbor algorithm is 

amongst the simplest of all machine learning algorithms: an 

object is classified by a majority vote of its neighbors, with 

the object being assigned to the class most common amongst 

its k nearest neighbors (k is a positive integer, typically small). 

If k = 1, then the object is simply assigned to the class of its 

nearest neighbor. 

Classification (generalization) using an instance-based 

classifier can be a simple matter of locating the nearest 

neighbor in instance space and labeling the unknown instance 

with the same class label as that of the located (known) 

neighbor. This approach is often referred to as a nearest 

neighbor classifier. More robust models can be achieved by 

locating k, where k > 1, neighbors and letting the majority 

vote decide the outcome of the class labeling. A higher value 

of k results in a smoother, less locally sensitive, function. The 

nearest neighbor classifier can be regarded as a special case 

of the more general k-nearest neighbors classifier, hereafter 

referred to as a k-NN classifier. 

The same method can be used for regression, by simply 

assigning the property value for the object to be the average of 

the values of its k nearest neighbors. It can be useful to weight 

the contributions of the neighbors, so that the nearer neighbors 

contribute more to the average than the more distant ones. (A 

common weighting scheme is to give each neighbor a weight 

of 1/d, where d is the distance to the neighbor. This scheme is 

a generalization of linear interpolation.). The neighbors are 

taken from a set of objects for which the correct classification 

(or, in the case of regression, the value of the property) is 

known. This can be thought of as the training set for the 

algorithm, though no explicit training step is required. Nearest 

neighbor rules in effect compute the decision boundary in an 

implicit manner. It is also possible to compute the decision 

boundary itself explicitly, and to do so in an efficient manner 

so that the computational complexity is a function of the 

boundary complexity. 

13.1.1. Assumptions in KNN: 

Before using KNN, some of the assumptions in KNN are to be 

considered. 

• KNN assumes that the data is in a feature space. 

More exactly, the data points are in a metric space. 

The data can be scalars or possibly even 

multidimensional vectors. Since the points are in 

feature space, they have a notion of distance – This 

need not necessarily be Euclidean distance although 

it is the one commonly used. 

• Each of the training data consists of a set of vectors 

and class label associated with each vector. In the 

simplest case , it will be either + or – (for positive or 

negative classes). But KNN , can work equally well 

with arbitrary number of classes. 

• It is also given a single number "k”. This number 

decides how many neighbors (where neighbors are 

defined based on the distance metric) influence the 

classification. This is usually a odd number if the 

number of classes is 2. If k=1 , then the algorithm is 

simply called the nearest neighbor algorithm. 

13.1.2. Parameter selection in KNN: 

• The best choice of k depends upon the data; 

generally, larger values of k reduce the effect of noise 

on the classification, but make boundaries between 

classes less distinct. A good k can be selected by 

various heuristic techniques, for example, crossvalidation. 

The special case where the class is 

predicted to be the class of the closest training 

sample (i.e. when k = 1) is called the nearest 

neighbor algorithm. 

• Much research effort has been put into selecting or 

scaling features to improve classification. A 

particularly popular approach is the use of 

evolutionary algorithms to optimize feature scaling. [4] 

Another popular approach is to scale features by the 

mutual information of the training data with the 

training classes. In binary (two class) classification 

problems, it is helpful to choose k to be an odd 

number as this avoids tied votes. One popular way of 

choosing the empirically optimal k in this setting is 

via bootstrap method. 

13.1.3. Properties in KNN: 

• The naive version of the algorithm is easy to 

implement by computing the distances from the test 

sample to all stored vectors, but it is computationally 

intensive, especially when the size of the training set 

grows. Many nearest neighbor search algorithms 

have been proposed over the years; these generally 

seek to reduce the number of distance evaluations 

actually performed. Using an appropriate nearest 

neighbor search algorithm makes k-NN 

computationally tractable even for large data sets. 

• The nearest neighbor algorithm has some strong 

consistency results. As the amount of data 

approaches infinity, the algorithm is guaranteed to 

yield an error rate no worse than twice the Bayes 

error rate (the minimum achievable error rate given 

the distribution of the data). k-nearest neighbor is 



935


ISSN:2229-6093 

guaranteed to approach the Bayes error rate, for some 

value of k (where k increases as a function of the 

number of data points). Various improvements to k- 

nearest neighbor methods are possible by using 

proximity graphs. 

.13.1.4. KNN for Density Estimation: 

Although classification remains the primary application of 

KNN, it is also used to do density estimation also. Since KNN 

is non parametric, it can do estimation for arbitrary 

distributions. The idea is very similar to use of Parzen 

window . Instead of using hypercube and kernel functions, it 

does the estimation as follows – For estimating the density at 

a point x, place a hypercube centered at x and keep increasing 

its size till k neighbors are captured. Now estimate the density 

using the formula, 

…(24) 

Where n is the total number of V is the volume of the 

hypercube. Notice that the numerator is essentially a constant 

and the density is influenced by the volume. The intuition is 

this : Lets say density at x is very high. We can find k points 

near x very quickly. These points are also very close to x (by 

definition of high density). This means the volume of 

hypercube is small and the resultant density is high. It is said 

that the density around x is very low. Then the volume of the 

hypercube needed to encompass k nearest neighbors is large 

and consequently, the ratio is low. The volume performs a job 

similar to the bandwidth parameter in kernel density 

estimation.. 

13.2. Some Basic Observations regarding K-NN: 

1. If the points are d-dimensional, then the straight forward 

implementation of finding k Nearest Neighbor takes O(n) 

time. 

2. KNN can be analyzed in two ways – One way is that KNN 

tries to estimate the posterior probability of the point to be 

labeled (and apply Bayesian decision theory based on the 

posterior probability). An alternate way is that KNN 

calculates the decision surface (either implicitly or explicitly) 

and then uses it to decide on the class of the new points. 

3. There are many possible ways to apply weights for KNN – 

One popular example is the Shephard’s method. 

4. Even though the naive method takes O(dn) time, it is very 

hard to do better unless other assumptions are used. There are 

some efficient data structures like KD-Tree which can reduce 

the time complexity but they do it at the cost of increased 

training time and complexity. 

5. In KNN, k is usually chosen as an odd number if the 

number of classes is 2. 

6. Choice of k is very critical – A small value of k means that 

noise will have a higher influence on the result. A large value 

make it computationally expensive and k defeats the basic 

philosophy behind KNN (that points that are near might have 

similar densities or classes).A simple approach to select k is 

set . 

7. There are some interesting data structures and algorithms 

when we apply KNN on graphs –Euclidean minimum 

spanning tree and Nearest neighbor graph . 

13.3. Advantages and Disadvantages of K-NN: 

a)Advantages: 

1) The high degree of local sensitivity makes nearest 

neighbor classifiers highly susceptible to noise in the 

training data. 

2) It follows a Non parametric architecture 

3) It is a simple and powerful, algorithm 

4) KNN is one of common methods to estimate the 

bandwidth (eg adaptive mean shift) 


1) The downside of this simple approach is the lack of 

robustness that characterizes the resulting classifiers. 

2) It is Memory intensive, 

3) Its classification/estimation is slow 

4) For large training sets, requires large memory is slow when 

making a prediction 

5) Needs similarity measure and attributes that “match” target 

function 

6) The k-nearest neighbor algorithm is sensitive to the local 

structure of the data. 

7) Computation complexity is a function of the boundary 

complexity that affects the decision boundary. 

8) Lack of generalization means that KNN keeps all the 

training data. 

9) The accuracy of the k-NN algorithm can be severely 

degraded by the presence of noisy or irrelevant features, or if 

the feature scales are not consistent with their importance. 

10) Prediction accuracy can quickly degrade when number 

attributes grow. 

The drawback of increasing the value of k is of course that as 

k approaches n, where n is the size of the instance base, the 

performance of the classifier will approach that of the most 

straightforward statistical baseline, the assumption that all 

unknown instances belong to the class most frequently 

represented in the training data. 



936


ISSN:2229-6093 


The nearest neighbor search problem arises in numerous fields 

of application, including: 

• Pattern recognition - in particular for optical 

character recognition 

• Statistical classification- see k-nearest neighbor 

algorithm 

• Computer vision 

• Databases - e.g. content-based image retrieval 

• Coding theory - see maximum likelihood decoding 

• Data compression - see MPEG-2 standard 

• Recommendation systems 

• Internet marketing - see contextual advertising and 

behavioral targeting 

• DNA sequencing 

• Spell checking - suggesting correct spelling 

• Plagiarism detection 

• Contact searching algorithms in FEA 

• Similarity scores for predicting career paths of 

professional athletes. 

• Cluster analysis - assignment of a set of observations 

into subsets (called clusters) so that observations in 

the same cluster are similar in some sense, usually 

based on Euclidean distance 

• Gene Expression 

• Protein-Protein interaction and 3D structure 

prediction 

• Nearest Neighbor based Content Retrieval 

14. Gaussian Mixture Model (GMM): 

Gaussian Mixture Models (GMMs) are among the most 

statistically mature methods for clustering (though they are 

also used intensively for density estimation). 

14.1.Introduction: 

A Gaussian Mixture Model (GMM) is a parametric 

probability density function represented as a weighted sum of 

Gaussian component densities. GMMs are commonly used as 

a parametric model of the probability distribution of 

continuous measurements or features in a biometric system, 

such as vocal-tract related spectral features in a speaker 

recognition system. GMM parameters are estimated from 

training data using the iterative Expectation-Maximization 

(EM) algorithm or Maximum A Posteriori (MAP) estimation 

from a well-trained prior model. 

A Gaussian mixture model is a weighted sum of M component 

Gaussian densities as given by the equation, 

(i.e. measurement or features), w i , i = 1, . . . ,M, are the 

mixture weights, and g(x|µi,_i), i = 1, . . . ,M, are the 

component Gaussian densities. Each component density is a 

D-variate Gaussian function of the form, 

with mean vector µi and covariance matrix The mixture 

weights satisfy the constraint that 

The 

complete Gaussian mixture model is parameterized by the 

mean vectors, covariance matrices and mixture weights from 

all component densities. These parameters are collectively 

represented by the notation, 

….(27) 

There are several variants on the GMM shown in Equation 

(27). The covariance matrices, , can be full rank or 

constrained to be diagonal. Additionally, parameters can be 

shared, or tied, among the Gaussian components, such as 

having a common covariance matrix for all components. The 

choice of model configuration (number of components, full or 

diagonal covariance matrices, and parameter tying) is often 

determined by the amount of data available for estimating the 

GMM parameters and how the GMM is used in a particular 

biometric application. It is also important to note that because 

the component Gaussian is acting together to model the 

overall feature densities, full covariance matrices are not 

necessary even if the features are not statistically independent. 

The linear combination of diagonal covariance basis 

Gaussians is capable of modeling the correlations between 

feature vector elements. The effect of using a set of M full 

covariance matrix Gaussians can be equally obtained by using 

a larger set of diagonal covariance Gaussians. GMMs are 

often used in biometric systems, most notably in speaker 

recognition systems, due to their capability of representing a 

large class of sample distributions. One of the powerful 

attributes of the GMM is its ability to form smooth 

approximations to arbitrarily shaped densities. The classical 

uni-modal Gaussian model represents feature distributions by 

a position (mean vector) and a elliptic shape (covariance 

matrix) and a vector quantizer (VQ) or nearest neighbor 

model represents a distribution by a discrete set of 

characteristic templates [67]. A GMM acts as a hybrid 

between these two models by using a discrete set of Gaussian 

functions, each with their own mean and covariance matrix, to 

allow a better modeling capability. Figure 12 compares the 

densities obtained using a uni modal Gaussian model, a GMM 

and a VQ model. Plot (a) 

……(25) 

where x is a D-dimensional continuous-valued data vector 



937


ISSN:2229-6093 

The aim of ML estimation is to find the model parameters 

which maximize the likelihood of the GMM given the training 

data. For a sequence of T training vectors X = {x 1 , . . . , x T }, 

the GMM likelihood, assuming independence between the 

vectors, can be written as, 

Figure 12 comparison of distribution modeling. ) Histogram 

of a single cepstral coefficient from a 25 second utterance by a 

male speaker b) maximum likelihood unimodal Gaussian 

model c) GMM and its 10 underlying component densities d) 

histogram of the data assigned to the VQ centroid locations of 

a 10 element codebook. 

Figure 12 shows the histogram of a single feature from a 

speaker recognition system (a single cepstral value from a 25 

second utterance by a male speaker); plot (b) shows a unimodal 

Gaussian model of this feature distribution; plot (c) 

shows a GMM and its ten underlying component densities; 

and plot (d) shows a histogram of the data assigned to the VQ 

centroid locations of a 10 element codebook. The GMM not 

only provides a smooth overall distribution fit, its components 

also clearly detail the multi-modal nature of the density. 

The use of a GMM for representing feature distributions in a 

biometric system may also be motivated by the intuitive 

notion that the individual component densities may model 

some underlying set of hidden classes. For example, in 

speaker recognition, it is reasonable to assume the acoustic 

space of spectral related features corresponding to a speaker’s 

broad phonetic events, such as vowels, nasals or fricatives. 

These acoustic classes reflect some general speaker dependent 

vocal tract configurations that are useful for characterizing 

speaker identity. The spectral shape of the i th acoustic class 

can in turn be represented by the mean µ i of the i th component 

density, and variations of the average spectral shape can be 

represented by the covariance matrix . Because all the 

features used to train the GMM are unlabeled, the acoustic 

classes are hidden in that the class of an observation is 

unknown. A GMM can also be viewed as a single-state HMM 

with a Gaussian mixture observation density, or an ergodic 

Gaussian observation HMM with fixed, equal transition 

probabilities. Assuming independent feature vectors, the 

observation density of feature vectors drawn from these 

hidden acoustic classes is a Gaussian mixture [68, 69]. 

14.2.Maximum Likelihood Parameter Estimation 

Given training vectors and a GMM configuration, can 

estimate the parameters of the GMM, λ, which in some sense 

best matches the distribution of the training feature vectors. 

There are several techniques available for estimating the 

parameters of a GMM [70]. The most popular and wellestablished 

method is maximum likelihood (ML) estimation. 

…(28) 

Unfortunately, this expression is a non-linear function of the 

parameters _ and directs maximization is not possible. 

However, ML parameter estimates can be obtained iteratively 

using a special case of the expectation-maximization (EM) 

algorithm [71]. The basic idea of the EM algorithm is, 

beginning with an initial model λ, to estimate a new model λ¯, 

such that p(X| λ¯) ≥ p(X| λ). The new model then becomes the 

initial model for the next iteration and the process is repeated 

until some convergence threshold is reached. The initial 

model is typically derived by using some form of binary VQ 

estimation. On each EM iteration, the following re-estimation 

formulas are used, which guarantees a monotonic increase in 

the model’s likelihood value, 

The a posteriori probability for component i is given by 

…(29) 

14.3. Maximum A Posteriori (MAP) Parameter Estimation 

In addition to estimating GMM parameters via the EM 

algorithm, the parameters may also be estimated using 

Maximum A Posteriori (MAP) estimation. MAP estimation is 

used, for example, in speaker recognition applications to 

derive speaker model by adapting from a universal 

background model (UBM) [72] as shown in fig.13. It is also 

used in other pattern recognition tasks where limited labeled 

training data is used to adapt a prior, general model. Like the 

EM algorithm, the MAP estimation is a two step estimation 



938


ISSN:2229-6093 

process. The first step is identical to the “Expectation” step of 

the EM algorithm, where estimates of the sufficient statistics 

of the training data are computed for each mixture in the prior 

model. Unlike the second step of the EM algorithm, for 

adaptation these “new” sufficient statistic estimates are then 

combined with the “old” sufficient statistics from the prior 

mixture parameters using a data-dependent mixing coefficient. 

The data-dependent mixing coefficient is designed so that 

mixtures with high counts of new data rely more on the new 

sufficient statistics for final parameter estimation and mixtures 

with low counts of new data rely more on the old sufficient 

statistics for final parameter estimation. The specifics of the 

adaptation are as follows. Given a prior model and training 

vectors from the desired class, X = {X 1 . . . , X T }, we first 

determine the probabilistic alignment of the training vectors 

into the prior mixture components (Figure 12(a)). That is, for 

mixture i in the prior model, we compute Pr(i| X T , λ prior ), as in 

Equation (29). Then compute the sufficient statistics for the 

weight, mean and variance parameters: 

..(32) 

Figure 13 Pictorial example of two steps in adapting a 

hypothesized speaker model. (a) The training vectors (x’s) are 

probabilistically mapped into the UBM (prior) mixtures. (b) 

The adapted mixture parameters are derived using the 

statistics of the new data and the UBM (prior) mixture 

parameters. The adaptation is data dependent, so UBM (prior) 

mixture parameters are adapted by different amounts. 

where γ p is a fixed “relevance” factor for parameter p. It is 

common in speaker recognition applications to use one 

adaptation coefficient for all parameters 

…(30) 

This is the same as the “Expectation” step in the EM 

algorithm. 

Lastly, these new sufficient statistics from the training data are 

used to update the prior sufficient statistics for mixture i to 

create the adapted parameters for mixture i (Figure 2(b)) with 

the equations: 

..(31) 

The adaptation coefficients controlling the balance between 

old and new estimates are 

for the weights, means 

and variances, respectively. The scale factor, γ , is computed 

over all adapted mixture weights to ensure they sum to unity. 

Note that the sufficient statistics, not the derived parameters, 

such as the variance, are being adapted. For each mixture and 

each parameter, a data-dependent adaptation coefficient 

defined as 

, is used in the above equations. This is 

and further to only adapt certain GMM parameters, such as 

only the mean vectors. Using a data-dependent adaptation 

coefficient allows mixture dependent adaptation of 

parameters. If a mixture component has a low probabilistic 

count, n i , of new data, then αi p →0causing the de-emphasis of 

the new (potentially under-trained) parameters and the 

emphasis of the old (better trained) parameters. For mixture 

components with high probabilistic counts, 

causing the use of the new class-dependent parameters. The 

relevance factor is a way of controlling how much new data 

should be observed in a mixture before the new parameters 

begin replacing the old parameters. This approach should thus 

be robust to limited training data. 

14.4. Advantages and Disadvantages of GMM: 

a. Advantages: 

1) Less time consuming when applied to a large set of data. 

2) It is text independent 

3) It is easy to implement 

4) It follows the Probabilistic frame work ( robust) 

5) It is computationally efficient. 


1) Ability to track time-evolving patterns is slow. 

2) It cannot exclude exponential functions. 



939


ISSN:2229-6093 


1) Used in Speaker identification 

2) Used in Image segmentation 

3) Used in modeling video sequences 

4) Used in Musical Instrument Identification in Polyphonic 

Music 

5) Used in Extraction of melodic lines from audio recordings 

6) Used in Speaker verification/speaker identification 

15. Unsupervised classification Method: 

In unsupervised classification, the goal is harder because there 

are no pre-determined categorizations. There are actually two 

approaches to unsupervised learning. The first approach is to 

teach the agent not by giving explicit categorizations, but by 

using some sort of reward system to indicate success. This 

type of training will generally fit into the decision problem 

framework because the goal is not to produce a classification 

but to make decisions that maximize rewards. This approach 

nicely generalizes to the real world, where agents might be 

rewarded for doing certain actions. 

A second approach of unsupervised learning is called 

clustering. In this type of learning, the goal is not to maximize 

a utility function, but simply to find similarities in the training 

data. The assumption is often that the clusters discovered will 

match reasonably well with an intuitive classification. This 

method is commonly used in most of the applications 

especially in speech recognition applications. Hence this 

method is discussed in detail. 

OR In other terms unsupervised learning is also defined as the 

learning method where the computer doesn't get any feedback 

or guidance while learning. No guidelines are provided either. 

It means that unlike supervised learning, patterns are not 

labeled or classified beforehand. 

15.2. Advantages and Disadvantages of Unsupervised 

classification: 

a)Advantages: 

1) Need to provide either the classification rules or the sample 

documents as a training set. 

2) Unsupervised classification technique are used when we 

do not have a clear idea of rules or classifications. One 

possible scenario is to use unsupervised classification to 

provide an initial set of categories, and to subsequently build 

on these through supervised classification. 


1) Clustering might result in unexpected groupings, since the 

clustering operation is not user-defined, but based on an 

internal algorithm. 

2) Rules that create the clusters are not seen. 

3) The clustering operation is CPU intensive and can take at 

least the same time as indexing. 

4) Suffers from over fitting 

15.1 Introduction to Clustering 

Clustering is the unsupervised classification of patterns 

(observations, data items, or feature vectors) into groups 

(clusters). The clustering problem has been addressed in many 

contexts and by researchers in many disciplines; this reflects 

its broad appeal and usefulness as one of the steps in 

exploratory data analysis. However, clustering is a difficult 

problem combinatorial, and differences in assumptions and 

contexts in different communities have made the transfer of 

useful generic concepts and methodologies slow to occur. 

In machine learning, unsupervised learning is a class of 

problems in which one seeks to determine how the data are 

organized. Many methods employed here are based on data 

mining methods used to preprocess data. It is distinguished 

from supervised learning (and reinforcement learning) in that 

the learner is given only unlabeled examples. Unsupervised 

learning is closely related to the problem of density estimation 

in statistics. However unsupervised learning also encompasses 

many other techniques that seek to summarize and explain key 

features of the data.One form of unsupervised learning is 

clustering. Another example is blind source separation based 

on Independent Component Analysis (ICA). 

There are two broads of classification procedures: supervised 

classification unsupervised classification. The supervised 

classification is the essential tool used for extracting 

quantitative information from remotely sensed image data 

[Richards, 1993, p85]. Using this method, the analyst has 

available sufficient known pixels to generate representative 

parameters for each class of interest. This step is called 

training. Once trained, the classifier is then used to attach 

labels to all the image pixels according to the trained 

parameters. The most commonly used supervised 

classification is maximum likelihood classification (MLC), 

which assumes that each spectral class can be described by a 

multivariate normal distribution. Therefore, MCL takes 

advantage of both the mean vectors and the multivariate 

spreads of each class, and can identify those elongated classes. 

However, the effectiveness of maximum likelihood 

classification depends on reasonably accurate estimation of 

the mean vector m and the covariance matrix for each spectral 

class data [Richards, 1993, p189]. What’s more, it assumes 

that the classes are distributed unmoral in multivariate space. 

When the classes are multimodal distributed, we cannot get 

accurate results. Another broad of classification is 

unsupervised classification. It doesn’t require human to have 

the foreknowledge of the classes, and mainly using some 

clustering algorithm to classify an image data [Richards, 1993, 

p85]. These procedures can be used to determine the number 

and location of the uni-modal spectral classes. One of the 

most commonly used unsupervised classifications is the 

migrating means clustering classifier (MMC). This method is 

based on labeling each pixel to unknown cluster centers and 

then moving from one cluster center to another in a way that 

the SSE measure of the preceding section is reduced data 

[Richards,1993, p231]. 



940


ISSN:2229-6093 

15.1.1.Data clustering: 

Data analysis underlies many computing applications, either 

in a design phase or as part of their on-line operations. Data 

analysis procedures can be dichotomized as either exploratory 

or confirmatory, based on the availability of appropriate 

models for the data source, but a key element in both types of 

procedures whether for hypothesis formation or decisionmaking) 

is the grouping, or classification of measurements 

based on either (i) goodness-of-fit to a postulated model, or (ii) 

natural groupings (clustering) revealed through analysis. 

Cluster analysis is the organization of a collection of patterns 

(usually represented as a vector of measurements, or a point in 

a multidimensional space) into clusters based on similarity. 

Intuitively, patterns within a valid cluster are more similar to 

each other than they are to a pattern belonging to a different 

cluster. An example of clustering is depicted in Figure 14. The 

input patterns are shown in Figure 14(a), and the desired 

clusters are shown in Figure 14 (b). Here, points belonging to 

the same cluster are given the same label. The variety of 

techniques for representing data, measuring proximity 

(similarity) between data elements, and grouping data 

elements has produced a rich and often confusing assortment 

of clustering methods. 

Figure 14. Data clustering 

It is important to understand the difference between clustering 

(unsupervised classification) and discriminant analysis 

(supervised classification). In supervised classification, we are 

provided with a collection of labeled (preclassified) patterns; 

the problem is to label a newly encountered, yet unlabeled, 

pattern. Typically, the given labeled (training) patterns are 

used to learn the descriptions of classes which in turn are used 

to label a new pattern. In the case of clustering, the problem is 

to group a given collection of unlabeled patterns into 

meaningful clusters. In a sense, labels are associated with 

clusters also, but these category labels are data driven; that is, 

they are obtained solely from the data. Clustering is useful in 

several exploratory pattern-analysis, grouping, decisionmaking, 

and machine-learning situations, including data 

mining, document retrieval, image segmentation, and pattern 

classification. However, in many such problems, there is little 

prior information (e.g., statistical models) available about the 

data, and the decision-maker must make as few assumptions 

about the data as possible. It is under these restrictions that 

clustering methodology is particularly appropriate for the 

exploration of interrelationships among the data points to 

make an assessment (perhaps preliminary) of their structure. 

The term “clustering” is used in several research communities 

to describe methods for grouping of unlabeled data. 

These communities have different terminologies and 

assumptions for the components of the clustering process and 

the contexts in which clustering are used. Thus, we face a 

dilemma regarding the scope of this survey. The production of 

a truly comprehensive survey would be a monumental task 

given the sheer mass of literature in this area. The 

accessibility of the survey might also be questionable given 

the need to reconcile very different vocabularies and 

assumptions regarding clustering in the various communities. 

The goal of this paper is to survey the core concepts and 

techniques in the large subset of cluster analysis with its roots 

in statistics and decision theory. Where appropriate, 

references will be made to key concepts and techniques 

arising from clustering methodology in the machine-learning 

and other communities. The audience for this paper includes 

practitioners in the pattern recognition and image analysis 

communities (who should view it as a summarization of 

current practice), practitioners in the machine-learning 

communities (who should view it as a snapshot of a closely 

related field with a rich history of well understood techniques), 

and the broader audience of scientific professionals (who 

should view it as an accessible introduction to a mature field 

that is making important contributions to computing 

application areas). 

15.1.2 Components of a Clustering Task 

Typical pattern clustering activity involves the following steps 

[Jain and Dubes 1988]: 

(1) Pattern representation (optionally including feature 

extraction and/or selection), 

(2) Definition of a pattern proximity measure appropriate to 

the data domain, 

(3) Clustering or grouping, 

(4) Data abstraction (if needed), and 

(5) Assessment of output (if needed). 

Figure 15 depicts a typical sequencing of the first three of 

these steps, including a feedback path where the grouping 

process output could affect subsequent feature extraction and 

similarity computations. Pattern representation refers to the 

number of classes, the number of available patterns, and the 

number, type, and scale of the features available to the 

clustering algorithm. Some of this information may not be 

controllable by the practitioner. 

Figure 15 Stages in Clustering 



941


ISSN:2229-6093 

15.1.3. Advantages and Disadvantages of clustering: 

a)Advantages: 

1. High performance 

2. Large capacity 

3. High availability 

4. Incremental growth 


1. Complexity 

2. Inability to recover from database corruption 

15.1.4. Applications of Clustering: 

1. Clustering in the design of neural networks 

2. Information Retrival 

3. Data mining 

4. Speech and speaker. 

16. SIMILARITY MEASURES 

Since similarity is fundamental to the definition of a cluster, a 

measure of the similarity between two patterns drawn from 

the same feature space is essential to most clustering 

procedures. Because of the variety of feature types and scales, 

the distance measure (or measures) must be chosen carefully. 

It is most common to calculate the dissimilarity between two 

patterns using a distance measure defined on the feature space. 

We will focus on the well-known distance measures used for 

patterns whose features are all continuous. The most popular 

metric for continuous features is the Euclidean distance 

…...(33) 

which is a special case (p52) of the Minkowski metric 

…(34) 

The Euclidean distance has an intuitive appeal as it is 

commonly used to evaluate the proximity of objects in two or 

three-dimensional space. It works well when a data set has 

“compact” or “isolated” clusters [Mao and Jain 1996]. The 

drawback to direct use of the Minkowski metrics is the 

tendency of the largest-scaled feature to dominate the others. 

Solutions to this problem include normalization of the 

continuous features (to a common range or variance) or other 

weighting schemes. Linear correlation among features can 

also distort distance measures; this distortion can be alleviated 

by applying a whitening transformation to the data or by using 

the squared Mahalanobis distance 

…(35) 

where the patterns X i and X j are assumed to be row vectors, 

and ∑ is the sample covariance matrix of the patterns or the 

known covariance matrix of the pattern generation process; 

d M (. , .). assigns different weights to different features based 

on their variances and pair wise linear correlations. Here, it is 

implicitly assumed that class conditional densities are 

unimodal and characterized by multidimensional spread, i.e., 

that the densities are multivariate Gaussian. The regularized 

Mahalanobis distance was used in Mao and Jain [1996] to 

extract hyper ellipsoidal clusters. Recently, several researchers 

[Huttenlocher et al. 1993; Dubuisson and Jain 1994] have 

used the Hausdorff distance in a point set matching context. 

Some clustering algorithms work on a matrix of proximity 

values instead of on the original pattern set. It is useful in such 

situations to pre-compute all the n(n-1)/2 pair wise distance 

values for the n patterns and store them in a (symmetric) 

matrix. Computation of distances between patterns with some 

or all features being non continuous is problematic, since the 

different types of features are not comparable and (as an 

extreme example) the notion of proximity is effectively 

binary- valued for nominal-scaled features. Nonetheless, 

practitioners (especially those in machine learning, where 

mixed-type patterns are common) have developed proximity 

measures for heterogeneous type patterns. A recent example is 

Wilson and Martinez [1997], which proposes a combination 

of a modified Minkowski metric for continuous features and a 

distance based on counts (population) for nominal attributes. 

A variety of other metrics have been reported in Diday and 

Simon [1976] and Ichino and Yaguchi [1994] for computing 

the similarity between patterns represented using quantitative 

as well as qualitative features. Patterns can also be represented 

using string or tree structures [Knuth 1973]. Strings are used 

in syntactic clustering [Fu and Lu 1977]. Several measures of 

similarity between strings are described in Baeza-Yates 

[1992]. A good summary of similarity measures between trees 

is given by Zhang [1995]. A comparison of syntactic and 

statistical approaches for pattern recognition using several 

criteria was presented in Tanaka [1995] and the conclusion 

was that syntactic methods are inferior in every aspect. 

Therefore, we do not consider syntactic methods further in 

this paper. There are some distance measures reported in the 

literature [Gowda and Krishna 1977; Jarvis and Patrick 1973] 

that take into account the effect of surrounding or neighboring 

points. These surrounding points are called context in 

Michalski and Stepp [1983]. The similarity between two 

points xi and xj, given this context, is given by 



942


ISSN:2229-6093 

conceptual similarity measure is the most general similarity 

measure. 

….(36) 

where is the context (the set of surrounding points). One 

metric defined using context is the mutual neighbor distance 

(MND), proposed in Gowda and Krishna [1977], which is 

given by 

…(37) 

where NN(x i , x j ) is the neighbor number of x j with respect to 

x,. Figures 16 and 17 give an example. In Figure 16, the 

nearest neighbor of A is B, and B’s nearest neighbor is A. So, 

NN(A, B)=5 NN(B, A) = 1 and the MND between A and B is 2. 

However, N(B, C)= 1 but NN(C, B)= 2, and therefore MND(B, 

C)= 3. Figure 17 was obtained from Figure 4 by adding three 

new points D, E, and F. Now MND(B, C)= 3 (as before), but 

MND(A, B)= 5. The MND between A and B has increased by 

introducing additional points, even though A and B have not 

moved. The MND is not a metric (it does not satisfy the 

triangle inequality [Zhang 1995]). In spite of this, MND has 

been successfully applied in several clustering applications 

[Gowda and Diday 1992]. This observation supports the 

viewpoint that the dissimilarity does not need to be a metric. 

Watanabe’s theorem of the ugly duckling [Watanabe 1985] 

states: “Insofar as we use a finite set of predicates that are 

capable of distinguishing any two objects considered, the 

number of predicates shared by any two such objects is 

constant, independent of the choice of objects.” This implies 

that it is possible to make any two arbitrary patterns equally 

similar by encoding them with a sufficiently large number of 

features. As a consequence, any two arbitrary patterns are 

equally similar, unless we use some additional domain 

information. For example, in the case of conceptual clustering 

[Michalski and Stepp 1983], the similarity between x i and x j 

is defined as 

…(38) 

where is a set of pre-defined concepts. This notion is 

illustrated with the help of Figure 18. Here, the Euclidean 

distance between points A and B is less than that between B 

and C. However, B and C can be viewed as “more similar” 

than A and B because B and C belong to the same concept 

(ellipse) and A belongs to a different concept (rectangle). The 

Figure 18 Conceptual similarities between points: 

17. CLUSTERING TECHNIQUES: 

Different approaches to clustering data can be described with 

the help of the hierarchy shown in Figure 19 (other 

taxonometric representations of clustering methodology are 

possible; ours is based on the discussion in Jain and Dubes 

[1988]). At the top level, there is a distinction between 

hierarchical and partitional approaches (hierarchical methods 

produce a nested series of partitions, while partitional methods 

produce only one). The taxonomy shown in Figure 19 must be 

supplemented by a discussion of cross-cutting issues that may 

(in principle) affect all of the different approaches regardless 

of their placement in the taxonomy. 

—Agglomerative vs. divisive: This aspect relates to 

algorithmic structure and operation. An agglomerative 

approach begins with each pattern in a distinct (singleton) 

cluster, and successively merges clusters together until a 

stopping criterion is satisfied. A divisive method begins with 

all patterns in a single cluster and performs splitting until a 

stopping criterion is met. 

—Monothetic vs. polythetic: This aspect relates to the 

sequential or simultaneous use of features in the clustering 

process. Most algorithms are polythetic; that is, all features 

enter into the computation of distances between patterns, and 

decisions are based on those distances. A simple monothetic 

algorithm reported in Anderberg [1973] considers features 

sequentially to divide the given collection of patterns. This is 

illustrated in Figure 20. Here, the collection is divided into 

two groups using feature x1; the vertical broken line V is the 

separating line. Each of these clusters is further divided 

independently using feature x2, as depicted by the broken 

lines H1 and H2. The major problem with this algorithm is 

that it generates 2d clusters where d is the dimensionality of 

the patterns. For large values of d (d >. 100 is typical in 

information retrieval applications [Salton 1991]), the number 

of clusters generated by this algorithm is so large that the data 

set is divided into uninterestingly small and fragmented 

clusters. 

—Hard vs. fuzzy: A hard clustering algorithm allocates each 

pattern to a single cluster during its operation and in its output. 

A fuzzy clustering method assigns degrees of membership in 



943


ISSN:2229-6093 

several clusters to each input pattern. A fuzzy clustering can 

be converted to a hard clustering by assigning each pattern to 

the cluster with the largest measure of membership. 

—Deterministic vs. stochastic: This issue is most relevant to 

partitional approaches designed to optimize a squared error 

function. This optimization can be accomplished using 

traditional techniques or through a random search of the state 

space consisting of all possible labelings. 

—Incremental vs. non-incremental: This issue arises when 

the pattern set to be clustered is large, and constraints on 

execution time or memory space affect the architecture of the 

algorithm. The early history of clustering methodology does 

not contain many examples of clustering algorithms designed 

to work with large data sets, but the advent of data mining has 

fostered the development of clustering algorithms that 

minimize the number of scans through the pattern set, reduce 

the number of patterns examined during execution, or reduce 

the size of data structures used in the algorithm’s operations. 

A cogent observation in Jain and Dubes [1988] is that the 

specification of an algorithm for clustering usually leaves 

considerable flexibilty in implementation. 

Figure 20 Monothetic partitional clustering 

link algorithm [Jain and Dubes 1988]) is shown in Figure 21. 

The dendrogram can be broken at different levels to yield 

different clusterings of the data. Most hierarchical clustering 

algorithms are variants of the single-link [Sneath and Sokal 

1973], complete-link [King 1967], and minimum-variance 

[Ward 1963; Murtagh 1984] algorithms. Of these, the singlelink 

and complete link algorithms are most popular. These two 

algorithms differ in the way they characterize the similarity 

between a pair of clusters. In the single-link method, the 

distance between two clusters is the minimum of the distances 

between all pairs. 

Figure 20 Points falling in three clusters 

Figure 19 A taxonomy of clustering approaches 

17.1. Hierarchical Clustering Algorithms: 

The operation of a hierarchical clustering algorithm is 

illustrated using the two-dimensional data set in Figure 20. 

This figure depicts seven patterns labeled A, B, C, D, E, F, 

and G in three clusters. A hierarchical algorithm yields a 

dendrogram representing the nested grouping of patterns and 

similarity levels at which groupings change. A dendrogram 

corresponding to the seven points in Figure 20 (obtained from 

the single. 

Figure 21 The dendrogram obtained using single link 

algorithm 

Figure 21 Two concentric clusters 

of patterns drawn from the two clusters (one pattern from the 

first cluster, the other from the second). In the complete-link 

algorithm, the distance between two clusters is the maximum 



944


ISSN:2229-6093 

of all pair wise distances between patterns in the two clusters. 

In either case, two clusters are merged to form a larger cluster 

based on minimum distance criteria. The complete-link 

algorithm produces tightly bound or compact clusters [Baeza- 

Yates 1992]. The single-link algorithm, by contrast, suffers 

from a chaining effect [Nagy 1968]. It has a tendency to 

produce clusters that are straggly or elongated. There are two 

clusters in Figures 22 and 23 separated by a “bridge” of noisy 

patterns. The single-link algorithm produces the clusters 

shown in Figure 22, whereas the complete-link algorithm 

obtains the clustering shown in Figure 23. The clusters 

obtained by the complete link algorithm are more compact 

than those obtained by the single-link algorithm; the cluster 

labeled 1 obtained using the single-link algorithm is elongated 

because of the noisy patterns labeled “*”. The single-link 

algorithm is more versatile than the complete-link algorithm, 

otherwise. For example, the single-link algorithm can extract 

the concentric clusters shown in Figure 21, but the completelink 

algorithm cannot. However, from a pragmatic viewpoint, 

it has been observed that the complete link algorithm produces 

more useful hierarchies in many applications than the singlelink 

algorithm [Jain and Dubes 1988]. 

17.2. Agglomerative Single-Link Clustering Algorithm: 

(1) Place each pattern in its own cluster. Construct a list 

of inter pattern distances for all distinct unordered 

pairs of patterns, and sort this list in ascending order. 

(2) Step through the sorted list of distances, forming 

for each distinct dissimilarity value dk a graph on the 

patterns where pairs of patterns closer than dk are 

connected by a graph edge. If all the patterns are 

members of a connected graph, stop. Otherwise, 

repeat this step. 

(3) The output of the algorithm is a nested hierarchy of 

graphs which can be cut at a desired dissimilarity level 

forming a partition (clustering) identified by simply connected 

components in the corresponding graph. 

17.3. Agglomerative Complete-Link Clustering Algorithm: 

(1) Place each pattern in its own cluster. Construct a list of 

inter pattern distances for all distinct unordered pairs of 

patterns, and sort this list in ascending order. (2) Step through 

the sorted list of distances, forming for each distinct 

dissimilarity value dk a graph on the patterns where pairs of 

patterns closer than dk are connected by a graph edge. If all 

the patterns are members of a completely connected graph, 

stop. (3) The output of the algorithm is a nested hierarchy of 

graphs which can be cut at a desired dissimilarity level 

forming a partition (clustering) identified by completely 

connected components in the corresponding graph. 

Hierarchical algorithms are more versatile than partitional 

algorithms. For example, the single-link clustering algorithm 

works well on data sets containing non-isotropic clusters 

including well-separated, chain-like, and concentric clusters, 

whereas a typical partitional algorithm such as the k-means 

algorithm works well only on data sets having isotropic 

clusters [Nagy 1968]. On the other hand, the time and space 

complexities [Day 1992] of the partitional algorithms are 

typically lower than those of the hierarchical algorithms. It is 

possible to develop hybrid algorithms [Murty and Krishna 

1980] that exploit the good features of both categories. 

17.4. Hierarchical Agglomerative Clustering Algorithm: 

(1) Compute the proximity matrix containing the distance 

between each pair of patterns. Treat each pattern as a cluster. 

(2) Find the most similar pair of clusters using the proximity 

matrix. Merge these two clusters into one cluster. Update the 

proximity matrix to reflect this merge operation. (3) If all 

patterns are in one cluster, stop. Otherwise, go to step 2. 

Based on the way the proximity matrix is updated in step 2, a 

variety of agglomerative algorithms can be designed. 

Hierarchical divisive algorithms start with a single cluster of 

all the given objects and keep splitting the clusters based on 

some criterion to obtain a partition of singleton clusters. 

17.5. Partitional Algorithms: 

A partitional clustering algorithm obtains a single partition of 

the data instead of a clustering structure, such as the 

dendrogram produced by a hierarchical technique. Partitional 

methods have advantages in applications involving large data 

sets for which the construction of a dendrogram is 

computationally prohibitive. A problem accompanying the use 

of a partitional algorithm is the choice of the number of 

desired output clusters. A seminal paper [Dubes 1987] 

provides guidance on this key design decision. The partitional 

techniques usually produce clusters by optimizing a criterion 

function defined either locally (on a subset of the patterns) or 

globally (defined over all of the patterns). Combinatorial 

search of the set of possible labelings for an optimum value of 

a criterion is clearly computationally prohibitive. In practice, 

therefore, the algorithm is typically run multiple times with 

different starting states, and the best configuration obtained 

from all of the runs is used as the output clustering. 

17.5.1. Squared Error Algorithms: 

The most intuitive and frequently used criterion function in 

partitional clustering techniques is the squared error criterion, 

which tends to work well with isolated and compact clusters. 

The squared error for a clustering of a pattern set - 

(containing K clusters) 



945


ISSN:2229-6093 

Is where 

…..(39) 

(2) Assign each pattern to the closest cluster center. (3) 

Recompute the cluster centers using the current cluster 

memberships. (4) If a convergence criterion is not met, go to 

step 2. Typical convergence criteria are: no (or minimal) 

reassignment of patterns to new cluster centers, or minimal 

decrease in squared error. 

xi 

is the ith pattern belonging to the jth cluster and cj is the 

centroid of the jth cluster. The k-means is the simplest and 

most commonly used algorithm employing a squared error 

criterion McQueen 1967]. It starts with a random initial 

partition and keeps reassigning the patterns to clusters based 

on the similarity between the pattern and the cluster centers 

until a convergence criterion is met (e.g., there is no 

reassignment of any pattern from one cluster to another, or the 

squared error ceases to decrease significantly after some 

number of iterations). The k-means algorithm is popular 

because it is easy to implement, and its time complexity is 

O(n), where n is the number of patterns. A major problem 

with this algorithm is that it is sensitive to the selection of the 

initial partition and may converge to a local minimum of the 

criterion function value if the initial partition is not properly 

chosen. Figure 24 shows seven two-dimensional patterns. If 

we start with patterns A, B, and C as the initial means around 

which the three clusters are built, then we end up with the 

partition {{A}, {B, C}, {D, E, F, G}} shown by ellipses. The 

squared error criterion value is much larger for this partition 

than for the best partition {{A, B, C}, {D, E}, {F, G}} shown 

by rectangles, which yields the global minimum value of the 

squared error criterion function for a clustering containing 

three clusters. The correct three-cluster solution is obtained by 

choosing, for example, A, D, and F as the initial cluster means. 

17.5.2.Squared Error Clustering Method: 

(1) Select an initial partition of the patterns with a fixed 

number of clusters and cluster centers. (2) Assign each pattern 

to its closest cluster center and compute the new cluster 

centers as the centroids of the clusters. Repeat this step until 

convergence is achieved, i.e., until the cluster membership is 

stable. (3) Merge and split clusters based on some heuristic 

information, optionally repeating step 2. 

17.5.3.k-Means Clustering Algorithm: 

(1) Choose k cluster centers to coincide with k randomlychosen 

patterns or k randomly defined points inside the 

hyper-volume containing the pattern set. 

Figure 24 The k-means algorithm is sensitive to the initial 

partitions. 

Several variants [Anderberg 1973] of the k-means algorithm 

have been reported in the literature. Some of them attempt to 

select a good initial partition so that the algorithm is more 

likely to find the global inimum value. Another variation is to 

permit splitting and merging of the resulting clusters. 

Typically, a cluster is split when its variance is above a prespecified 

threshold, and two clusters are merged when the 

distance between their centroids is below another prespecified 

threshold. Using this variant, it is possible to obtain 

the optimal partition starting from any arbitrary initial 

partition, provided proper threshold values are specified. The 

well-known ISODATA [Ball and Hall 1965] algorithm 

employs this technique of merging and splitting clusters. If 

ISODATA is given the “ellipse” partitioning shown in Figure 

14 as an initial partitioning, it will produce the optimal threecluster 

partitioning. ISODATA will first merge the clusters {A} 

and {B,C} into one cluster because the distance between their 

centroids is small and then split the cluster {D,E,F,G}, which 

has a large variance, into two clusters {D,E} and {F,G}. 

Another variation of the k-means algorithm involves selecting 

a different criterion function altogether. The dynamic 

clustering algorithm (which permits representations other than 

the centroid for each cluster) was proposed in Diday [1973], 

and Symon [1977] and describes a dynamic clustering 

approach obtained by formulating the clustering problem in 

the framework of maximum-likelihood estimation. The 

regularized Mahalanobis distance was used in Mao and Jain 

[1996] to obtain hyperellipsoidal clusters. 

17.5.4. Graph-Theoretic Clustering. 

The best-known graph-theoretic divisive clustering algorithm 

is based on construction of the minimal spanning tree (MST) 

of the data [Zahn 1971], and then deleting the MST edges 

with the largest lengths to generate clusters. Figure 25 depicts 

the MST obtained from nine two-dimensional points. By 

breaking the link labeled CD with a length of 6 units (the edge 

with the maximum Euclidean length), two clusters ({A, B, C} 

and {D, E, F, G, H, I}) are obtained. The second cluster can 

be further divided into two clusters by breaking the edge EF, 

which has a length of 4.5 units. The hierarchical approaches 

are also related to graph-theoretic clustering. Single-link 

clusters are sub graphs of the minimum spanning tree of the 

data [Gower and Ross 1969] which are also the connected 

components [Gotlieb and Kumar 1968]. Complete-link 

clusters are maximal complete sub graphs, and are related to 

the node colourability of graphs [Backer and Hubert 1976]. 

The maximal complete sub graph was considered the strictest 

definition of a cluster in Augustson and Minker [1970] and 

Raghavan and Yu [1981]. A graph-oriented approach for non- 



946


ISSN:2229-6093 

hierarchical structures and overlapping clusters is presented in 

Ozawa [1985]. 

clustering algorithm using a distance measure based on a 

nonparametric density estimate. 

17.7. Nearest Neighbour Clustering: 

Figure 25 Using the minimal spanning tree to from cluster 

Delaunay graph (DG) is obtained by connecting all the pairs 

of points that are Voronoi neighbours. The DG contains all the 

neighbourhood information contained in the MST and the 

relative neighbourhood graph (RNG) [Toussaint 1980]. 

17.6. Mixture-Resolving and Mode-Seeking Algorithms 

The mixture resolving approach to cluster analysis has been 

addressed in a number of ways. The underlying assumption is 

that the patterns to be clustered are drawn from one of several 

distributions, and the goal is to identify the parameters of each 

and (perhaps) their number. Most of the work in this area has 

assumed that the individual components of the mixture density 

are Gaussian, and in this case the parameters of the individual 

Gaussians are to be estimated by the procedure. Traditional 

approaches to this problem involve obtaining (iteratively) a 

maximum likelihood estimate of the parameter vectors of the 

component densities [Jain and Dubes 1988]. More recently, 

the Expectation Maximization (EM) algorithm (a generalpurpose 

maximum likelihood algorithm [Dempster et al. 1977] 

for missing-data problems) has been applied to the problem of 

parameter estimation. A recent book [Mitchell 1997] provides 

an accessible description of the technique. In the EM 

framework, the parameters of the component densities are 

unknown, as are the mixing parameters, and these are 

estimated from the patterns. The EM procedure begins with 

an initial estimate of the parameter vector and iteratively 

rescores the patterns against the mixture density produced by 

the parameter vector. The rescored patterns are then used to 

update the parameter estimates. In a clustering context, the 

scores of the patterns (which essentially measure their 

likelihood of being drawn from particular components of the 

mixture) can be viewed as hints at the class of the pattern. 

Those patterns, placed (by their scores) in a particular 

component, would therefore be viewed as belonging to the 

same cluster. Nonparametric techniques for density- based 

clustering have also been developed [Jain and Dubes 1988]. 

Inspired by the Parzen window approach to nonparametric 

density estimation, the corresponding clustering procedure 

searches for bins with large counts in a multidimensional 

histogram of the input pattern set. Other approaches include 

the application of another partitional or hierarchical 

Since proximity plays a key role in our intuitive notion of a 

cluster, nearest neighbour distances can serve as the basis of 

clustering procedures. An iterative procedure was proposed in 

Lu and Fu [1978]; it assigns each unlabeled pattern to the 

cluster of its nearest labelled neighbour pattern, provided the 

distance to that labelled neighbour is below a threshold. The 

process continues until all patterns are labelled or no 

additional labelling occur. The mutual neighbourhood value 

(described earlier in the context of distance computation) can 

also be used to grow clusters from near neighbours. 

17.8. Fuzzy Clustering: 

Traditional clustering approaches generate partitions; in a 

partition, each pattern belongs to one and only one cluster. 

Hence, the clusters in a hard clustering are disjoint. Fuzzy 

clustering extends this notion to associate each pattern with 

every cluster using a membership function [Zadeh 1965]. The 

output of such algorithms is a clustering, but not a partition. 

We give a high-level partitional fuzzy clustering algorithm 

below. 

17.8.1.Fuzzy Clustering Algorithm: 

(1) Select an initial fuzzy partition of the N objects into K 

clusters by selecting the Nx3 K membership matrix U. An 

element uij of this matrix represents the grade of membership 

of object xi in cluster cj. Typically, uij [0,1].(2) Using U, 

find the value of a fuzzy criterion function, e.g., a weighted 

squared error criterion function, associated with the 

corresponding partition. One possible fuzzy criterion function 

is 

Where 

…(40) 

is the k th fuzzy cluster center. Reassign patterns to clusters to 

reduce this criterion function value and recompute U. (3) 

Repeat step 2 until entries in U do not change significantly. In 

fuzzy clustering, each cluster is a fuzzy set of all the patterns. 

Figure 26 illustrates the idea. The rectangles enclose two 

“hard” clusters in the data: H1 ={1,2,3,4,5} and H2={6,7,8,9} 

A fuzzy clustering algorithm might produce the two fuzzy 

clusters F1 and F2 depicted by ellipses. The patterns will have 



947


ISSN:2229-6093 

membership values in [0,1] for each cluster. For example, 

fuzzy cluster F1 could be compactly described as 

Figure 27 Representation of a cluster by points 

Figure 26 Fuzzy clusters 

The ordered pairs (i,μ i ) in each cluster represent the i th pattern 

and its membership value to the cluster mi. Larger 

membership values indicate higher confidence in the 

assignment of the pattern to the cluster. A hard clustering can 

be obtained from a fuzzy partition by thresholding the 

membership value. 

Fuzzy set theory was initially applied to clustering in Ruspini 

[1969]. The book by Bezdek [1981] is a good source for 

material on fuzzy clustering. The most popular fuzzy 

clustering algorithm is the fuzzy c-means (FCM) algorithm. 

Even though it is better than the hard k-means algorithm at 

avoiding localv minima, FCM can still converge to local 

minima of the squared error criterion. The design of 

membership functions is the most important problem in fuzzy 

clustering; different choices include those based on similarity 

decomposition and centroids of clusters. A generalization of 

the FCM algorithm was proposed by Bezdek [1981] through a 

family of objective functions. A fuzzy c-shell algorithm and 

an adaptive variant for detecting circular and elliptical 

boundaries was presented in Dave [1992]. 

17.9. Representation of Clusters: 

In applications where the number of classes or clusters in a 

data set must be discovered, a partition of the data set is the 

end product. Here, a partition gives an idea about the 

separability of the data points into clusters and whether it is 

meaningful to employ a supervised classifier that assumes a 

given number of classes in the data set. However, in many 

other applications that involve decision ma king, the resulting 

clusters have to be represented or described in a compact form 

to achieve data abstraction. Even though the construction of a 

cluster representation is an important step in decision making, 

it has not been examined closely by researchers. The notion of 

cluster representation was introduced in Duran and Odell 

[1974] and was subsequently studied in Diday and Simon 

[1976] and Michalski et al. [1981]. They suggested the 

following representation schemes: 

(1) Represent a cluster of points by their centroid or by a set 

of distant points in the cluster. Figure 27 depicts these two 

ideas. 

(2) Represent clusters using nodes in a classification tree. 

(3)Represent clusters by using conjunctive logical 

expressions. For example, the expression 

Figure 28 stands for the logical statement ‘X1 is greater than 

3’ and ’X2 is less than 2’. Use of the centroid to represent a 

cluster is the most popular scheme. It works well when the 

clusters are compact or isotropic. However, when the clusters 

are elongated or non-isotropic, then this scheme fails to 

represent them properly. In such a case, the use of a collection 

of boundary points in a cluster captures its shape well. The 

number of points used to represent a cluster should increase as 

the complexity of its shape increases. The two different 

representations illustrated in Figure 18 are equivalent. Every 

path in a classification tree from the root node to a leaf node 

corresponds to a conjunctive statement. An important 

limitation of the typical use of the simple conjunctive concept 

representations is that they can describe only rectangular or 

isotropic clusters in the feature space. Data abstraction is 

useful in decision making because of the following: (1) It 

gives a simple and intuitive description of clusters which is 

easy for human comprehension. In both conceptual clustering 

[Michalski Clustering [Gowda and Diday 1992] this 

representation is obtained without using an 



948


ISSN:2229-6093 

Figure 28 Representation of clusters by a classification tree or 

by conjunctive statements 

Additional step: These algorithms generate the clusters as well 

as their descriptions. A set of fuzzy rules can be obtained from 

fuzzy clusters of a data set. These rules can be used to build 

fuzzy classifiers and fuzzy controllers. (2) It helps in 

achieving data compression that can be exploited further by a 

computer [Murty and Krishna 1980]. Figure 19(a) shows 

samples belonging to two chain-like clusters labeled 1 and 2. 

A partitional clustering like the k-means algorithm cannot 

separate these two structures properly. The single-link 

algorithm works well on this data, but is computationally 

expensive. So a hybrid approach may be used to exploit the 

desirable properties of both these algorithms. We obtain 8 sub 

clusters of the data using the computationally efficient) k- 

means algorithm. Each of these sub clusters can be 

represented by their centroids as shown in Figure 19(a). Now 

the single- link algorithm can be applied on these centroids 

alone to cluster them into 2 groups. The resulting groups are 

shown in Figure 19(b). Here, a data reduction is achieved by 

representing the sub clusters by their centroids. (3) It increases 

the efficiency of the decision making task. In a cluster based 

document retrieval technique [Salton 1991], a large collection 

of documents is clustered and each of the clusters is 

represented using its centroid. In order to retrieve documents 

relevant to a query, the query is matched with the cluster 

centroids rather than with all the documents. This helps in 

retrieving relevant documents efficiently. Also in several 

applications involving large data sets, clustering is used to 

perform indexing, which helps in efficient decision making 

[Dorai and Jain 1995]. 

18. Evaluation of classification techniques: 

A framework to evaluate classification techniques and do an 

analysis of the techniques was proposed by fractal white paper 

[79] and covered in this paper on the following criteria: 

• Statistical assumptions 

• Data needs 

• Complexity of deployment 

• Model Performance 

• Model building time 

18.1 Statistical Assumptions: 

All parametric techniques make statistical assumptions about 

data. In most real life cases, these assumptions cannot be fully 

met. Pragmatism mixed with caution should help in getting 

the best of a modeling technique. If there are multicollinearity 

issues with data, for example, one should 

definitely explore the use of non-parametric techniques like 

neural networks or genetic algorithms for a possible superior 

fit compared with parametric statistical methods. Similarly, if 

the sample has a skewed good-bad mix, discriminant and k- 

NN techniques are likely to under perform vis-à-vis the other 

techniques. A presence of complex non-linear relationships 

within data precludes the use of linear techniques. In such 

situations, recursive partitioning and non-parametric 

techniques are likely to outperform most parametric statistical 

techniques. 

18.2 Data Needs: 

All techniques perform better if they are exposed to large 

sample of representative data. Equal number of good and bad 

observations also can help in model building. However, in 

most practical situations, availability of enough data points on 

both event types is difficult. Non-parametric and recursive 

partitioning techniques usually tend to be more data hungry 

than parametric techniques. As discussed in the previous 

section, discrimin ant analysis and K-NN techniques are 

strongly sensitive to good bad mix in the data. K-NN is also 

sensitive to the presence of irrelevant variables in model 

building. All non-parametric techniques have a tendency to 

over fit the model when number of variables used for model 

building is large. In these cases, it might be a useful idea to 

run parametric statistical techniques and delete unimportant 

variables from analysis before proceeding to use nonparametric 

techniques. 

Table 4- A comparative study of credit scoring techniques 

Source: Monteserrat, Guillen, Count Data models for a credit 

scoring system,1992 

18.3. Model Building Time: 

Model building is an iterative process. It requires 

experimentation with alternative predictor variables and 

several different transformations. Model building is also a 

multi stage process and at each stage, many variables could be 

dropped or altered for seeking a better fit. The time taken to 

train a model, can influence the choice of technique in some 

cases. Parametric techniques take relatively less time for 

computing a model. Linear models are the friendliest in this 

respect. Non-parametric techniques, on the other hand, can 

take inordinate amounts of time for model training. K-NN is 

an O(n2) process and thus can take large amounts of time for 

large training data. Recursive partitioning techniques may take 

less time than non-parametric techniques but are slower 



949


ISSN:2229-6093 

compared to logistic regression. Model building time is also 

important because re-calibrating of models might be 

undertaken frequently in the light of additional data. 

18.4. Transparency: 

Transparency of the model plays an important role in the 

acceptance of the model by users. The black-box approach of 

non-parametric techniques is probably the most important 

ground for using other techniques. Classification trees provide 

the most user friendly and intuitive output amongst 

classification techniques. Parametric models are also 

transparent and show the contribution of each variable to the 

score. In cases of multicollinearity, logistic and linear models 

might not truly reflect the importance of each variable. This is 

because another correlated variable might have accounted for 

the dependent variable by the virtue of having entered the 

model earlier. 

18.5. Deployment: 

Practical considerations of deployment might sometimes rule 

out the use of some techniques. Deployment of nonparametric 

techniques can be cumbersome and might require 

writing of programming code or use of proprietary software 

components. Deployment of parametric and classification tree 

models is relatively simpler. An organization should measure 

the incremental profit from a model versus the incremental 

deployment cost and effort to decide on the choice of model 

for deployment. 

19. Survey of classification techniques used in different 

speech recognition applications: 

TABLE 5 

Classification technique adopted in different speech 

recognition application 

The abbreviations used in this table as follows; 



950


ISSN:2229-6093 

21. Some of well know clustering algorithms have been listed 

in the table 7[81]. 

TABLE 7 

Clustering algorithms 

20. Comparison of Classification techniques: 

We summaries the most commonly used classifiers in table 

6.Many of them represent, in fact, an entire family of 

classifiers and allow the user to modify the associated 

parameters and criterion functions. All of these classifiers are 

admissible; in the sense that there exist some classification 

problems is the state log project which showed a large 

variability over their relative performances, proving that there 

is no such thing that there is no overall optimal classification 

rule. 

TABLE 6 

Classification methods 

22. Conclusions: 

In this overview paper different classification techniques have 

been discussed. At the beginning of the paper the taxonomy of 

the classification techniques have been presented and 

explained respectively. For each method, the advantages and 

disadvantages, and various application areas have been 

presented. The purpose of this paper is to provide all the 

classification techniques used in the area of speech 

recognition in brief ,for the young researchers. The 

contributions of this paper is the survey of the different 

classification methods to different speech recognition 

applications, with their evaluation criteria, Comments and 

properties of different classification methods and properties 

and comments of different clustering algorithms are also 

discussed. 

Acknowledgements: 

Thanks are due to Prof.G.Krishna Professor (Rtd.), of Indian 

Institute of Science, Bangalore and Dr.M.Narshima Murthy, 

Professor, Dept. of Automation and computer science, Indian 

Institute of Science, Bangalore, for useful discussion with 

them, while preparing this manuscript. 



951


ISSN:2229-6093 

REFERENCES: 

1)Anil K.Jain, “Statistical Pattern Recognition”,IEEE 

Transactions On Pattern Analysis And Machine Intelligence, 

Vol. 22, No. 1, January 2000 

2)A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical Pattern 

Recognition: A Review”, IEEE Trans. on Pattern Analysis and 

Machine Intelligence, 22(1):4-37, January 2000. 

3) J. A. Bilmes, “A Gentle Tutorial on the EM Algorithm and 

its Application to Parameter Estimation for Gaussian Mixture 

and Hidden Markov Models, Technical Report TR-97-021”, 

International Computer Science Institute, University of 

California, Berkeley, April 1998. 

4) J. A. Anderson, P. R. Krishnaiah et.al “Logistic 

Discrimination, Handbook of Statistics”, , vol. 2, pp. 169-191, 

Amsterdam: North Holland,1982. 

5) Rabiner and Jung, “Fundamental of Speech recognition”, 

Pearson Education,©1993. 

6) R.A. Fisher, “The Use of Multiple Measurements in 

Taxonomic Problems”, Annals of Eugenics, vol. 7, part II, pp. 

179-188, 1936. 

7) Dasarathy, B.V,“Minimal consistent set (MCS) 

identification for optimal nearest neighbor decision systems 

design”, IEEE Transactions on Systems, Man and cybernetics, 

Vol. 24, Issue: 3, pp:511 – 517, March 1994. 

8) Girolami, M and Chao He “Probability density estimation 

from optimally condensed data samples” pattern Analysis and 

Machine Intelligence, IEEE Transactions on, Volume: 25, 

Issue: 10 , pp:1253 – 1264,Oct. 2003. 

9) Meijer, B.R.; “Rules and algorithms for the design of 

templates for template matching”, Pattern recognition, 1992. 

Vol.1. Conference A: Computer Vision and Applications, 11th 

IAPR International Conference on, pp: 760 – 763, Aug. 1992. 

10) Hush, D.R., Horne B.G. “Progress in supervised neural 

networks”, Signal Processing Magazine, IEEE, Vol. 10, Issue: 

1, pp:8 – 39, Jan. 1993. 

11)Vapnik, V., “The Nature of Statistical Learning Theory”, 

Springer, 1995. 

12) Julia Neumann, Christoph Schnorr, “SVM-based feature 

selection by direct objective minimization”, 2004. 

13) Lihong Zheng and Xiangjian , “Classification Techniques 

in Pattern Recognition”, University of Technology, , Australia 

2007. 

14) ) L. R. Rabiner, J. G. Wilpon, A. M. Quinn, and S. G. 

Terrace, “On the application of embedded digit training to 

speaker independent connected digit recognition,” IEEE 

Transactions on Acoustics, Speech and Signal Processing, vol. 

32, no. 2, pp. 272–280, April 1984. 

15) T. M. Cover and P. E. Hart, “Nearest neighbor pattern 

classification”, IEEE Transactions information Theory, vol. 

IT-13, pp. 2127, 1967. 

16) Y. Chenyz Y. Hungyz C. Fuhz, “Fast Algorithm for 

Nearest Neighbor Search Based on a Lower Bound Tree”, 

Proceedings of the 8th International Conference on Computer 

Vision, Vancouver, Canada, July 2001. 

17) R. Bellman and S. Dreyfus, “Applied Dynamic 

Programming”, Princeton, NJ, Princeton University Press, 

1962. 

18) H. Silverman and D. Morgan, “The application of 

dynamic programming to connected speech 

recognition” ,IEEE ASSP Magazine, vol. 7, no. 3,pp. 6-25, 

1990. 

19) Alex Waibel and Kai-Fu Lee, “Readings of speech 

recognition”,Morgan Kaufmann Publishers, San 

Mateo,Calif,1990. 

20) Rabiner and Jung, “ HMM Tutorial”, IEEE Transactions 

on Acoustics, Speech and Signal Processing, vol. 39, no. 5, pp. 

272–280, April 1984. 

21) Dat.Tat.Tran, “Fuzzy approaches to speech and speaker 

recognition”, A Ph.D. thesis submitted to the university of 

Caniberra, Austrelia, May 2000. 

22)W.S.McCullough and W.H.Pitts, “A Calculus of Ideas 

Immanent in Nervous Activiity”,bull Math Biophysics, 5,115- 

133,1943. 

23) P. Gallinari, S. Thiria, R. Badran, and F. Fogelman-Soulie, 

“On the relationships between discriminant analysis and 

multilayer perceptrons,”Neural Networks, vol. 4, pp. 349–360, 

1991. 

24) H. Asoh and N. Otsu, “An approximation of nonlinear 

discriminant analysis by multilayer neural networks,” in Proc. 

Int. Joint Conf. Neural Networks, San Diego, CA, 1990, pp. 

III-211–III-216. 

25) A. R.Webb and D. Lowe, “The optimized internal 

representation of multilayer classifier networks performs 

nonlinear discriminant analysis,” Neural Networks, vol. 3, no. 

4, pp. 367–375, 1990. 

26) G. S. Lim, M. Alder, and P. Hadingham, “Adaptive 

quadratic neural nets”, Pattern Recognition. Letter, vol. 13, pp. 

325–329, 1992. 

27) S. Raudys, “Evolution and generalization of a single 

neuron: I. Singlelayer perceptron as seven statistical 

classifiers”, Neural Networks, vol. 11, pp. 283–296, 1998. 

28) S.Raudys,“Evolution and generalization of a single 

neurone: II. Complexity of statistical classifiers and sample 

size considerations,” Neural Networks, vol. 11, pp. 297–313, 

1998. 

29) F. Kanaya and S. Miyake, “Bayes statistical behavior and 

valid generalization of pattern classifying neural networks,” 

IEEE Trans. Neural Networks, vol. 2, no. 4, pp. 471–475, 

1991. 

30) S. Miyake and F. Kanaya, “A neural network approach to 

a Bayesian statistical decision problem”, IEEE Trans. Neural 

Networks, vol. 2, pp. 538–540, 1991 

31) D. G. Kleinbaum, L. L. Kupper, and L. E. Chambless, 

“Logistic regression analysis of epidemiologic data”, Theory 

Practice, Commun. Statist. A, vol. 11, pp. 485–547, 1982. 

32) F. E. Harreli and K. L. Lee, “A comparison of the 

discriminant analysis and logistic regression under 

multivariate normality”, in Biostatistics: Statistics in 

Biomedical, Public Health, and Environmental Sciences, P. K. 

Sen, Ed, Amsterdam, The Netherlands: North Holland, 1985. 

33) S. J. Press and S. Wilson, “Choosing between logistic 

regression and discriminant analysis”, J. Amer. Statist. Assoc., 

vol. 73, pp. 699–705,1978. 



952


ISSN:2229-6093 

34) M. Schumacher, R. Robner, andW. Vach, “Neural 

networks and logistic regression: Part I”, Comput. Statist. 

Data Anal., vol. 21, pp. 661–682,1996. 

35) W. Vach, R. Robner, and M. Schumacher, “Neural 

networks and logistic regression: Part II”, Comput. Statist. 

Data Anal., vol. 21, pp. 683–701,1996. 

36) B. Cheng and D. Titterington, “Neural networks: A review 

from a statistical perspective”, Statist. Sci., vol. 9, no. 1, pp. 

2–54, 1994. 

37) A. Ciampi and Y. Lechevallier, “Statistical models as 

building blocks of neural networks”, Commun. Statist., vol. 26, 

no. 4, pp. 991–1009, 1997. 

38) L. Holmstrom, P. Koistinen, J. Laaksonen, and E. Oja, 

“Neural and statistical classifiers-taxonomy and two case 

studies”, IEEE Trans. Neural Networks, vol. 8, pp. 5–17, 1997. 

39) A. Ripley, “Statistical aspects of neural networks”, in 

Networks and Chaos—Statistical and Probabilistic Aspects, O. 

E. Barndorff-Nielsen, J. L. Jensen, andW. S. Kendall, Eds. 

London, U.K.: Chapman & Hall, 1993, pp. 40–123 

40) “Neural networks and related methods for classification”, 

J. R.Statist. Soc. B, vol. 56, no. 3, pp. 409–456, 1994. 

41) I. Sethi and M. Otten, “Comparison between entropy net 

and decision tree classifiers”, in Proc. Int. Joint Conf. Neural 

Networks, vol. 3, 1990, pp. 63–68. 

42) P. E. Utgoff, “Perceptron trees: A case study in hybrid 

concept representation”, Connect. Sci., vol. 1, pp. 377–391, 

1989. 

43) A. Ripley, “Statistical aspects of neural networks”, in 

Networks and Chaos—Statistical and Probabilistic Aspects, O. 

E. Barndorff-Nielsen, J. L. Jensen, andW. S. Kendall, Eds. 

London, U.K.: Chapman & Hall, 1993, pp. 40–123. 

44) J. R.Statist. Soc. B, “Neural networks and related methods 

for classification,” International journal on Pattern recognition, 

vol. 56, no. 3, pp. 409–456, 1994. 

45) D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Eds., 

“Machine Learning,Neural, and Statistical Classification”, 

London, U.K.: Ellis Horwood,1994. 

46) D. E. Brown, V. Corruble, and C. L. Pittard, “A 

comparison of decision tree classifiers with back propagation 

neural networks for multimodal classification problems”, 

Pattern Recognition., vol. 26, pp. 953–961, 1993. 

47) S. P. Curram and J. Mingers, “Neural networks, decision 

tree induction and discriminant analysis: An empirical 

comparison”, J. Oper. Res. Soc., vol. 45, no. 4, pp. 440–450, 

1994. 

48) A. Hart, “Using neural networks for classification tasks— 

Some experiments on datasets and practical advice”, J. Oper. 

Res. Soc., vol. 43, pp.215–226, 1992. 

49) T. S. Lim,W. Y. Loh, and Y. S. Shih, “An empirical 

comparison of decision trees and other classification methods”, 

Dept. Statistics, Univ.Wisconsin, Madison, Tech. Rep. 979, 

1998. 

50) E. Patwo, M. Y. Hu, and M. S. Hung, “Two-group 

classification using neural networks”, Decis. Sci., vol. 24, no. 

4, pp. 825–845, 1993. 

51) M. S. Sanchez and L. A. Sarabia, “Efficiency of multilayered 

feed-forward neural networks on classification in 

relation to linear discriminant analysis, quadratic discriminant 

analysis and regularized discriminant analysis”, Chemometr. 

Intell. Labor.Syst., vol. 28, pp. 287–303, 1995. 

52) V. Subramanian, M. S. Hung, and M. Y. Hu, “An 

experimental evaluation of neural networks for classification”, 

Comput. Oper. Res., vol. 20,pp. 769–782, 1993. 

53) R. Kohavi and D. H. Wolpert, “Bias plus variance 

decomposition for zero-one loss functions,” in Proc. 13th Int. 

Conf. Machine Learning,1996, pp. 275–283. 

54) L. Atlas, R. Cole, J. Connor, M. El-Sharkawi, R. J. Marks 

II, Y. Muthusamy, and E. Barnard, “Performance comparisons 

between back propagation networks and classification trees on 

three real-world applications”, in Advances in Neural 

Information Processing Systems, D. S. Touretzky, Ed. San 

Mateo, CA: Morgan Kaufmann, 1990, vol. 2, pp. 622–629. 

55) T. G. Dietterich and G. Bakiri, “Solving multiclass 

learning problems via error-correcting output codes”, J. Artif. 

Intell. Res., vol. 2, pp. 263–286, 1995. 

56) W. Y. Huang and R. P. Lippmann, “Comparisons between 

neural net and conventional classifiers”, IEEE 1st Int. Conf. 

Neural Networks, San Diego, CA, 1987, pp. 485–493. 

57) E. Patwo, M. Y. Hu, and M. S. Hung, “Two-group 

classification using neural networks”, Decis. Sci., vol. 24, no. 

4, pp. 825–845, 1993 

58) G. Cybenko, “Approximation by super-positions of a 

sigmoidal function”, Math. Contr. Signals Syst., vol. 2, pp. 

303–314, 1989. 

59) K. Hornik, “Approximation capabilities of multilayer feed 

forward networks”, Neural Networks, vol. 4, pp. 251–257, 

1991. 

60] K. Hornik, M. Stinchcombe, and H. White, “Multilayer 

feed forward networks are universal approximators”, Neural 

Networks, vol. 2, pp. 359–366, 1989. 

61) M. D. Richard and R. Lippmann, “Neural network 

classifiers estimate Bayesian a posteriori probabilities”, 

Neural Comput., vol. 3, pp. 461–483, 1991. 

62) R. Solera-Ureña, J. Padrell-Sendra et.al, “SVMs for 

Automatic Speech Recognition: A Survey” Signal Theory and 

Communications Department EPS-Universidad Carlos III de 

Madrid,Avda. de la Universidad, 30, 28911-Leganés 

(Madrid), SPAIN 

63) B.E. Boser, I. Guyon, and V. Vapnik, “ A training 

algorithm for optimal margin classifiers”, Computational 

Learning Theory, pages 144–152, 1992. 

64) F. Pérez-Cruz and O. Bousquet, “ Kernel Methods and 

Their Potential Use in Signal Processing”. IEEE Signal 

Processing Magazine, 21(3):57–65, 2004. 

65) R. Fletcher., “Practical Methods of Optimization”. Wiley- 

Interscience, New York, NY (USA), 1987. 

66) Earl Gosh et.al.,, “Pattern recognition”, School of 

computer science, -Tele-communciations and information 

system, DePaul University, Prentice Hall of India, New Delhi. 

67) . Gray, R. “ Vector Quantization”, IEEE ASSP 

Magazinepp. 4–29, 1984. 

68) Reynolds, D.A., “A Gaussian Mixture Modeling 

Approach to Text-Independent Speaker Identification”, PhD 

thesis, Georgia Institute of Technology ,1992. 



953


ISSN:2229-6093 

69) Reynolds, D.A., Rose, R.C, “ Robust Text-Independent 

Speaker Identification using Gaussian Mixture Speaker 

Models”, IEEE Transactions on Acoustics, Speech, and Signal 

Processing 3(1) (1995) 72–83. 

70) McLachlan, G., ed. “ Mixture Models” Marcel Dekker, 

New York, NY,1988. 

71) Dempster, A., Laird, N., Rubin, D., “Maximum 

Likelihood from Incomplete Data via the EM Algorithm”, 

Journal of the Royal Statistical Society 39(1) 1–38, 1977. 

72) Reynolds, D.A., Quatieri, T.F., Dunn, R.B, “Speaker 

Verification Using Adapted Gaussian Mixture Models”, 

Digital Signal Processing Vol.10,pp. 19–41, 2000 . 

73) A.K.Jain,M.N.Murthy and P.J.FLYNN, “Data Clustering: 

A Review”, The Ohio State University, ACM Computing 

Surveys, Vol. 31, No. 3, September 1999. 

74) S.Watanabe, “Pattern recognition: Human and 

mechanical”,Wiley,Newyork-1985. 

75)K.S.Fu, “A step towards unification of syntactic and 

statistical pattern recognition”, IEEE Trans. On Pattern 

Recognition and Machine Intelligence, Vol.5,no.2,pp 200- 

205,March 1983. 

76) Anil K.Jain et.al, “Statistical pattern recognition: a 

Review”, IEEE Trans. on Pattern Analysis and Machine 

intelligence, Vol.22,no.1,PP.4-37,Jan 2000. 

77) Monserrat, Guillen, Manuel, Artis, "Count data models for 

credit scoring system”, Third Meeting on the European 

Conference Series in Quantitative Economics and 

Econometrics on Econometrics of Duration, Count and 

Transition Models, Paris, December 1992. 

78) Thomas, Lyn. C, “A Survey of Credit and Behavioral 

Scoring; Forecasting financial risk of lending to consumers”, 

University of Edinburgh,2000. 

79) A Fractal Whitepaper, “Comparative Analysis of 

Classification Techniques”, September 2003. 

80) 80) D.Michie et.al.,, “Machine learning and Neural and 

statistical classification”, Ellis Horrwood,New York,1994. 

81)A.K.Jain and R.C.Dubes, “Algorithms for clustering data”, 

Prentice Hall, Engle Wood,Cliffs,1988 

80) D.Michie et.al.,, “Machine learning and Neural and 

statistical classification”, Ellis Horrwood,New York,1994. 

81)A.K.Jain and R.C.Dubes, “Algorithms for clustering data”, 

Prentice Hall, Engle Wood,Cliffs,1988 



954

Classification Techniques used in Speech Recognition

Create successful ePaper yourself

Delete template?

Save as template?