12.08.2013 Views

Chapter 2

Chapter 2

Chapter 2

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Pattern Classification<br />

<strong>Chapter</strong> 2. Bayesian Decision Theory


2<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

•What is a pattern?<br />

Introduction<br />

•In statistical pattern recognition, a<br />

pattern is a d-dimensional feature<br />

vector<br />

x = (x 1, x 2, …, x d) t


3<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction<br />

• The sea bass/salmon example<br />

• State of nature<br />

• prior<br />

• State of nature is a random variable<br />

• The catch of salmon and sea bass is equiprobable<br />

P(ω 1 ) = P(ω 2 ) (Prior)<br />

P(ω 1 ) + P( ω 2 ) = 1 (exclusivity and<br />

exhaustivity)


4<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction<br />

•Decision rule with only the prior information<br />

• Decide ω 1 if P(ω 1 ) > P(ω 2 )<br />

otherwise decide ω 2<br />

•Use of the class–conditional information<br />

•P(x | ω 1 ) and P(x | ω 2 ) describe the difference<br />

in lightness between populations of sea<br />

and salmon


5<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction


dfjPx<br />

kgkdfj ( )<br />

6<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

•Posterior, likelihood, evidence<br />

P(ω j | x) = P(x | ω j ) P (ω j ) / P(x)<br />

Introduction<br />

Where in case of two categories<br />

∑ j=2<br />

( ) ( )<br />

j j<br />

P(x) = P x ω P ω<br />

j=1<br />

Posterior = (Likelihood * Prior) / Evidence


7<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction


8<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction<br />

•Decision given the posterior probabilities<br />

x is an observation for which:<br />

if P(ω 1 | x) > P(ω 2 | x) True state of nature = ω 1<br />

if P(ω 1 | x) < P(ω 2 | x) True state of nature = ω 2<br />

•Therefore:<br />

whenever we observe a particular x, the probability of<br />

error is :<br />

P(error | x) = P(ω 1 | x) if we decide ω 2<br />

P(error | x) = P(ω 2 | x) if we decide ω 1


9<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction<br />

•Minimizing the probability of error<br />

• Decide ω 1 if P(ω 1 | x) > P(ω 2 | x);<br />

• otherwise decide ω 2<br />

Therefore:<br />

P(error | x) = min [P(ω 1 | x), P(ω 2 | x)]<br />

(Bayes decision)


10<br />

Pattern Recognition:<br />

Probability of Error<br />

Introduction


11<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction<br />

Generalization of the preceding ideas<br />

– Use of more than one feature<br />

– Use more than two states of nature<br />

– Allowing actions and not only decide on the state<br />

of nature<br />

– Introduce a loss function which is more general<br />

than the probability of error


12<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction<br />

•Allowing actions other than classification,<br />

primarily allows the possibility of rejection –<br />

refusing to make a decision in close or bad<br />

cases<br />

•The loss function states how costly each<br />

action taken is


13<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction<br />

•Let {ω 1 , ω 2 ,…, ω c } be the set of c states of<br />

nature (“categories”)<br />

•Let {α 1 , α 2 ,…, α a } be the set of possible actions<br />

•Let λ(α i | ω j ) be the loss incurred for taking<br />

action α i when the state of nature is ω j


14<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction<br />

•Overall risk R = Sum of all R(α i | x) for i = 1,…,a<br />

•Minimizing R(α i | x) for i = 1,…, a<br />

j=c<br />

( ) ∑ ( ) ( )<br />

i i j j<br />

R α x = λαωP ω x<br />

j=1<br />

for i =1, L,a


15<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Introduction<br />

•Select the action α i for which R(α i | x) is<br />

minimum<br />

R is minimum and R in this case is called<br />

the Bayes risk = best performance that<br />

can be achieved.


16<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

•Two-category classification<br />

• α 1 : deciding ω 1<br />

• α 2 : deciding ω 2<br />

Introduction<br />

• λ ij = λ(α i | ω j )<br />

loss incurred for deciding ω i when the true state of<br />

nature is ω j<br />

•Conditional risk:<br />

•R(α 1 | x) = λ 11 P(ω 1 | x) + λ 12 P(ω 2 | x)<br />

•R(α 2 | x) = λ 21 P(ω 1 | x) + λ 22 P(ω 2 | x)


17<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

Our rule is the following:<br />

if R(α 1 | x) < R(α 2 | x)<br />

action α 1 : “decide ω 1 ” is taken<br />

Introduction<br />

This results in the equivalent rule :<br />

decide ω 1 if:<br />

(λ 21 - λ 11 ) P(x | ω 1 ) P(ω 1 ) > (λ 12 - λ 22 ) P(x | ω 2 ) P(ω 2 )<br />

and decide ω 2 otherwise


18<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

•Likelihood ratio:<br />

Introduction<br />

The preceding rule is equivalent to the<br />

following rule:<br />

( )<br />

( )<br />

( )<br />

( )<br />

Pxω λ -λ P ω<br />

if > ×<br />

Pxω<br />

λ -λ P ω<br />

1 12 22<br />

2<br />

2<br />

21 11 1<br />

•Then take action α 1 (decide ω 1 )<br />

•Otherwise take action α 2 (decide ω 2 )


19<br />

Pattern Recognition:<br />

Statistical Decision Theory<br />

•Optimal decision property<br />

Introduction<br />

• “If the likelihood ratio exceeds a threshold<br />

value independent of the input pattern x, we<br />

can take optimal actions”


20<br />

Pattern Recognition:<br />

Introduction<br />

Minimum-Error-Rate Classification<br />

•Actions are decisions on classes<br />

•If action α i is taken and the true state of nature<br />

is ω j then:<br />

• the decision is correct if i = j and in error if i ≠ j<br />

•Seek a decision rule that minimizes the<br />

probability of error which is the error rate


21<br />

Pattern Recognition:<br />

Introduction<br />

Minimum-Error-Rate Classification<br />

Introduction of the zero-one loss function:<br />

⎧0<br />

i= j<br />

λ( α i,ω j)<br />

= ⎨ i,j=1, L ,c<br />

⎩ 1 i ≠ j<br />

•Therefore, the conditional risk is:<br />

j=c<br />

( ) ( ) ( )<br />

i i j j<br />

∑<br />

R α x = λαω P ω x<br />

j=1<br />

∑<br />

ji ≠<br />

( ) ( )<br />

j i<br />

= P ω x =1-P ω x<br />

•“The risk corresponding to this loss function<br />

is the average probability error”


22<br />

Pattern Recognition:<br />

Introduction<br />

Minimum-Error-Rate Classification<br />

• Minimizing the risk requires maximizing<br />

P(ω i | x) (since R(α i | x) = 1 – P(ω i | x))<br />

• For Minimum error rate<br />

Decide ω i if P (ω i | x) > P(ω j | x) ∀j ≠ i


23<br />

Pattern Recognition:<br />

Introduction<br />

Minimum-Error-Rate Classification<br />

•Investigate the loss function:<br />

( 2)<br />

( )<br />

( 1)<br />

( )<br />

λ12 -λ P ω<br />

Pxω<br />

22<br />

Let θ = × = then decide ω if: >θ<br />

λ -λ P ω Pxω<br />

λ 1 λ<br />

21 11 1 2<br />

If λ is the zero-one loss function which means:<br />

λ =<br />

⎛ 0<br />

⎜<br />

⎝ 1<br />

1 ⎞<br />

⎟<br />

0 ⎠<br />

then θ =<br />

P<br />

P<br />

ω<br />

ω<br />

= θ<br />

( 2 )<br />

( )<br />

λ a<br />

1<br />

( 2 )<br />

( )<br />

⎛ 0 2 ⎞<br />

2P ω<br />

if λ = ⎜ ⎟ then θ = = θ<br />

⎝ 1 0 ⎠<br />

P ω<br />

1<br />

λ b


24<br />

Pattern Recognition:<br />

Introduction<br />

Decision Regions: Effect of Loss Functions


25<br />

Pattern Recognition:<br />

Introduction<br />

Classifiers, Discriminant<br />

Functions and Decision Surfaces<br />

THE MULTICATEGORY CASE<br />

Set of discriminant functions<br />

g i(x), i = 1,…, c<br />

The classifier assigns a feature<br />

vector x to class ω i<br />

if: g i(x) > g j(x) ∀j ≠ i


26<br />

Pattern Recognition:<br />

Introduction<br />

Classifiers, Discriminant<br />

Functions and Decision Surfaces<br />

•Let g i(x) = - R(α i | x)<br />

(max. discriminant corresponds to min. risk)<br />

•For the minimum error rate, we take<br />

g i(x) = P(ω i | x)<br />

(max. discrimination corresponds to max. posterior)<br />

In this case, we can also write:<br />

g i(x) = P(x | ω i) P(ω i) or<br />

g i(x) = ln P(x | ω i) + ln P(ω i) (ln: natural logarithm)


27<br />

Pattern Recognition:<br />

Introduction<br />

General Statistical Classifier


28<br />

Pattern Recognition:<br />

Introduction<br />

Classifiers, Discriminant<br />

Functions and Decision Surfaces<br />

• Feature space divided into c decision regions<br />

if g i (x) > g j (x) ∀j ≠ i then x is in R i<br />

R i means assign x to ω i<br />

• The two-category case:<br />

• A classifier is a dichotomizer with two<br />

discriminant functions g 1 and g 2<br />

• Let g(x) ≡ g 1 (x) – g 2 (x)<br />

• Decide ω 1 if g(x) > 0;<br />

• Otherwise decide ω 2


29<br />

Pattern Recognition:<br />

Introduction<br />

Classifiers, Discriminant<br />

Functions and Decision Surfaces<br />

•The computation of g(x)<br />

( ) ( )<br />

g(x) = P ω x -P ω x<br />

1 2<br />

( )<br />

( )<br />

P x ω P ω<br />

=ln +ln<br />

P x ω<br />

P ω<br />

( )<br />

1 1<br />

2<br />

( )<br />

2


30<br />

Pattern Recognition:<br />

Introduction<br />

Classifiers, Discriminant<br />

Functions and Decision Surfaces


31<br />

Pattern Recognition:<br />

The Normal Density<br />

Introduction<br />

Univariate density<br />

-Density which is analytically tractable<br />

-Continuous density<br />

-A lot of processes are asymptotically Gaussian<br />

-Handwritten characters, speech sounds are<br />

examples or prototypes corrupted by<br />

random process (central limit theorem)<br />

1<br />

p ( x ) =<br />

e<br />

σ 2 π<br />

Where: µ = mean or expected value of x<br />

σ 2 = the variance of x<br />

−<br />

1<br />

2<br />

⎛<br />

⎜<br />

⎝<br />

x − µ ⎞<br />

⎟<br />

σ<br />

⎠<br />

2


32<br />

Pattern Recognition:<br />

The Normal Density<br />

Introduction


33<br />

Pattern Recognition:<br />

The Normal Density<br />

Introduction<br />

•Multivariate density<br />

•Multivariate normal density in d dimensions is:<br />

1 ⎡ 1 t -1 ⎤<br />

P(x) = exp<br />

⎢<br />

- d/2 1/2 ( x -µ ) Σ ( x-µ )<br />

⎣<br />

⎥<br />

2π Σ<br />

2<br />

⎦<br />

( )<br />

where: x = (x 1 , x 2 , …, x d ) t<br />

µ = (µ 1 , µ 2 , …, µ d ) t mean vector<br />

Σ = d*d covariance matrix<br />

|Σ| and Σ -1 are the determinant<br />

and inverse,<br />

respectively


34<br />

Pattern Recognition:<br />

Introduction<br />

Discriminant Functions for the Normal<br />

Density<br />

•We saw that the minimum error-rate<br />

classification can be achieved by the<br />

discriminant function<br />

g i (x) = ln P(x | ω i ) + ln P(ω i )<br />

•Case of multivariate normal distribution<br />

1 d 1<br />

g(x)=- x-µ x-µ - ln2π- In Σ +InP ω<br />

2 2 2<br />

∑ -1<br />

i i i<br />

i i i<br />

( ) ( ) ( )<br />

t


35<br />

Pattern Recognition:<br />

Introduction<br />

Discrimination and Classification for<br />

Different Cases<br />

∑ 2<br />

i<br />

Case = σ I<br />

g (x) = w x + w (linear discriminant function)<br />

i<br />

t<br />

i i0<br />

where :<br />

(w is called the threshold for the ith<br />

i0<br />

category!)<br />

µ i 1 t<br />

w= i ;w 2 i0 =- µµ+lnP(ω 2 i i i)<br />

σ 2σ


36<br />

Pattern Recognition:<br />

Introduction<br />

Discrimination and Classification for<br />

Different Cases<br />

–A classifier that uses linear discriminant<br />

functions is called “a linear machine”<br />

–The decision surfaces for a linear machine are<br />

pieces of hyperplanes defined by: g i (x) = g j (x)


37<br />

Pattern Recognition:<br />

Equal Covariances<br />

Introduction


38<br />

Pattern Recognition:<br />

Introduction<br />

Discrimination and Classification for<br />

Different Cases<br />

•The hyperplane separating R i and R j<br />

( )<br />

2<br />

1 σ P ω<br />

x = ( µ +µ ) - ln µ -µ<br />

2 µ-µ P ω<br />

i ( )<br />

( ) j<br />

0 i j 2<br />

i j<br />

i j<br />

• Always orthogonal to the line linking the means<br />

1<br />

if P ω =P ω then x = µ +µ<br />

2<br />

( ) ( ) ( )<br />

i j 0 i j


39<br />

Pattern Recognition:<br />

Shift in Priors<br />

Introduction


40<br />

Pattern Recognition:<br />

Shift in Priors<br />

Introduction


41<br />

Pattern Recognition:<br />

Introduction<br />

Discrimination and Classification for<br />

Different Cases<br />

Case Σ i = Σ (covariance of all classes are<br />

identical but arbitrary)<br />

Hyperplane separating R i and R j<br />

ln ⎡ ( ) ( ) ⎤<br />

⎣<br />

P ωi P ωj<br />

⎦<br />

( i<br />

t<br />

j) -1 ( i j)<br />

1<br />

x = ( µ +µ ) - × µ -µ<br />

2 µ-µ Σ µ-µ<br />

( )<br />

0 i j i j<br />

(the hyperplane separating R i and R j is<br />

generally not orthogonal to the line<br />

between the means)


42<br />

Pattern Recognition:<br />

Decision Surfaces<br />

Introduction


43<br />

Pattern Recognition:<br />

Decision Surfaces<br />

Introduction


44<br />

Pattern Recognition:<br />

Introduction<br />

Discrimination and Classification for<br />

Different Cases<br />

Case Σ i = arbitrary<br />

– The covariance matrices are different for each category<br />

where :<br />

( )<br />

g x = x Wx + wx+<br />

w<br />

t<br />

t<br />

i i i i0<br />

1<br />

W=-<br />

2<br />

∑<br />

∑<br />

i i<br />

-1<br />

w = µ<br />

-1<br />

i i i<br />

1 t -1 1<br />

w i0 = - µ i ∑ µ i - ln ∑ + lnP( ωi<br />

)<br />

i i<br />

2 2<br />

(Hyperquadratics which are: hyperplanes, pairs of hyperplanes,<br />

hyperspheres, hyperellipsoids, hyperparaboloids, etc.)


45<br />

Pattern Recognition:<br />

Decision Boundaries<br />

Introduction


46<br />

Pattern Recognition:<br />

Decision Boundaries<br />

Introduction


47<br />

Pattern Recognition:<br />

Decision Boundaries<br />

Introduction


48<br />

Pattern Recognition:<br />

Introduction<br />

Bayesian Decision Theory – Discrete<br />

Features<br />

Components of x are binary or integer valued, x<br />

can take only one of m discrete values<br />

v 1 , v 2 , …, v m<br />

Case of independent binary features in 2 category<br />

problem<br />

Let x = (x 1 , x 2 , …, x d ) t where each x i is either 0 or 1,<br />

with probabilities:<br />

p i = P(x i = 1 | ω 1 )<br />

q i = P(x i = 1 | ω 2 )


49<br />

Pattern Recognition:<br />

Introduction<br />

Bayesian Decision Theory – Discrete<br />

Features<br />

The discriminant function in this case is:<br />

d<br />

∑<br />

g(x) = w x + w<br />

where<br />

and<br />

i=1<br />

i i 0<br />

( )<br />

( )<br />

p i 1- q i<br />

w i = ln i = 1, L ,d<br />

q 1-p<br />

i i<br />

1-p<br />

( 1 )<br />

( )<br />

P ω<br />

d<br />

i<br />

w 0 = ∑ ln + ln<br />

i=1 1-qi P ω 2<br />

decide ω if g(x) > 0 and ω if g (x) ≤ 0<br />

1 2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!