24.06.2015 Views

Review of Probability Theory - Sorry

Review of Probability Theory - Sorry

Review of Probability Theory - Sorry

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

http://www.survey.ntua.gr/main/labs/rsens/DeCETI/IRIT/MSI-FUSION/index.html<br />

<strong>Review</strong> <strong>of</strong> <strong>Probability</strong> <strong>Theory</strong><br />

1. Basic Notions<br />

Event<br />

The sample space, S, is the set <strong>of</strong> all possible outcomes <strong>of</strong> a random experiment. A subset <strong>of</strong><br />

the sample space is called an event. An event <strong>of</strong> a single element <strong>of</strong> the sample space is called<br />

an elementary event.<br />

Example:<br />

• Random experiment: tossing a coin<br />

• Sample space: {head, tail}<br />

• An elementary event: {tail}<br />

<strong>Probability</strong><br />

If the sample space consists <strong>of</strong> N outcomes and the probabilities p <strong>of</strong> the elementary events<br />

are all equal, then<br />

p = 1 / N<br />

An event A consisting <strong>of</strong> r elements then has a probability<br />

For the previous example, we get<br />

Random Variable and <strong>Probability</strong> Distribution<br />

A random variable X is a number assigned to every outcome <strong>of</strong> an experiment.


The probability <strong>of</strong> the event X=x is denoted as P X (x) or P X (X=x).<br />

P X (x) is <strong>of</strong>ten called the probability mass (or density function) or the probability distribution<br />

for X.<br />

It follows that and .<br />

Example:<br />

Outcome Random variable X P X (X=x)<br />

head 0 P X (0) = 1/2<br />

tail 1 P X (1) = 1/2<br />

Expectation<br />

1. The expectation (alias mean value, average) <strong>of</strong> a random variable X is defined as<br />

Example:<br />

2. The expectation <strong>of</strong> a function F=F(X) is defined as<br />

Example: if<br />

we get


4. Bayes‘ Decision theory<br />

Bayes‘ decision theory has been developed in the book [Duda and Hart, 1973], from which we<br />

summarize the following fundamentals.<br />

Bayes‘ decision theory is a fundamental statistical approach to the problem <strong>of</strong> pattern<br />

classification. This approach is based on the assumption that the decision problem is posed in<br />

probabilistic terms, and that all <strong>of</strong> the relevant probability values are known.<br />

The aim <strong>of</strong> the decision is to assign an object to a class. This classification is only carried out<br />

by means <strong>of</strong> measurements taken over the objects.<br />

Example<br />

Let us reconsider the problem, posed in the previous pages, <strong>of</strong> designing a classifier to<br />

separate two kinds <strong>of</strong> marble pieces, Carrara marble and Thassos marble.<br />

Suppose that an observer, watching marble emerge from the mill, finds it so hard to predict<br />

what type will emerge next, that the sequence <strong>of</strong> types <strong>of</strong> marble appears to be random.<br />

Using decision-theoretic terminology, we say that as each piece <strong>of</strong> marble emerges, nature is<br />

in one or the other <strong>of</strong> the two possible states: either the marble is Carrara marble, or Thassos<br />

marble. We let<br />

• = 1 for Carrara,<br />

• = 2 for Thassos.<br />

denote the state <strong>of</strong> nature, with:<br />

Because the state <strong>of</strong> nature is so unpredictable, we consider<br />

to be a random variable.<br />

If the mill produced as much Carrara marble than Thassos marble, we would say that the next<br />

piece <strong>of</strong> marble is equally likely to be Carrara marble or Thassos marble. More generally, we<br />

assume that there is some a priori probability P( 1 ) that the next piece is Carrara marble,<br />

and some a priori probability P( 2 ) that the next piece is Thassos marble. These a priori<br />

probabilities reflect our prior knowledge <strong>of</strong> how likely we are to see Carrara marble or<br />

Thassos marble before the marble actually appears. P( 1 ) and P( 2 ) are non negative and<br />

sum to one.<br />

Now, if we have to make a decision about the type <strong>of</strong> marble that will appear next, without<br />

being allowed to see it, the only information we are allowed to use is the value <strong>of</strong> the a priori<br />

probabilities. It seems reasonable to use the following decision rule:<br />

In this situation, we always make the same decision, even though we know that both types <strong>of</strong><br />

marble will appear. How well it works depends upon the values <strong>of</strong> the a priori probabilities:


• If P( 1 ) is very much greater than P( 2 ), our decision in favour <strong>of</strong> 1 will be right<br />

most <strong>of</strong> the time.<br />

• If P( 1 ) = P( 2 ), we have only a fifty-fifty chance <strong>of</strong> being right.<br />

In general, the probability <strong>of</strong> error is the smaller <strong>of</strong> P( 1 ) and P( 2 ).<br />

In most circumstances, we have not to make decisions with so little evidence. In our example<br />

example, we can use the brightness measurement x as evidence, as Thassos marble is lighter<br />

than Carrara marble. Different samples <strong>of</strong> marble will yield different brightness readings, and<br />

it is natural to express this variability in probabilistic terms; we consider x as a continuous<br />

random variable whose distribution depends on the state <strong>of</strong> nature.<br />

Let p(x | j ) be the state-conditional probability density function for x, i.e. the probability<br />

density function for x given that the state <strong>of</strong> nature is j . Then, the difference between<br />

p(x | 1 ) and p(x | 2 ) describes the difference in brightness between Carrara and Thassos<br />

marble (see figure 16)<br />

Figure 16: Hypothetical class-conditional probability density functions.<br />

Suppose that we know both the a priori probabilities P( j ) and the conditional densities<br />

p(x | j ). Suppose further that we measure the brightness <strong>of</strong> a piece <strong>of</strong> marble and discover<br />

the value <strong>of</strong> x. How does this measurement influence our attitude concerning the true state <strong>of</strong><br />

nature? The answer to this question is provided by Bayes‘ rule:<br />

(3)


where<br />

(4)<br />

Bayes‘rule shows how the observation <strong>of</strong> value x changes the a priori probability P( j ) into<br />

the a posteriori probability P( j | x). Variation <strong>of</strong> P( j | x) with x is illustrated in Figure 17<br />

for the case<br />

.<br />

Figure 17: A posteriori probabilities for and .<br />

If we have an observation x for which P( 1 | x) is greater than P( 2 | x)., we would be<br />

naturally inclined to decide that the true state <strong>of</strong> nature is 1 . Similarly, if P( 2 | x). is<br />

greater than P( 1 | x)., we would be naturally inclined to choose 2 .


To justify this procedure, let us calculate the probability <strong>of</strong> error whenever we make a<br />

decision. Whenever we observe a particular x,<br />

Clearly, in every instance in which we observe the same value for x, we can minimize the<br />

probability error by deciding:<br />

• if , and<br />

• if .<br />

Of course, we may never observe exactly the same value <strong>of</strong> x twice. Will this rule minimize<br />

the average probability <strong>of</strong> error? Yes, because the average probability <strong>of</strong> error is given by:<br />

and if for every x, P(error/ x) is as small as possible, the integral must be as small as possible.<br />

Thus, we have justified the following Bayes‘ decision rule for minimizing the probability <strong>of</strong><br />

error:<br />

This form <strong>of</strong> the decision rule emphasizes the role <strong>of</strong> the a posteriori probabilities. By using<br />

equation 3, we can express the rule in terms <strong>of</strong> conditional and a priori probabilities.<br />

Note that p(x) in equation 3 is unimportant as far as making a decision is concerned. It is<br />

basically just a scale factor that assures us that<br />

By eliminating this scale factor, we obtain the following completely equivalent decision rule:<br />

.


Some additional insight can be obtained by considering a few special cases:<br />

• if for some x, p(x | 1 ) = p(x | 2 ) , then that particular observation gives us no<br />

information about the state <strong>of</strong> nature; in this case, the decision hinges entirely on the a<br />

priori probabilities.<br />

• On the other hand, if P( 1 ) = P( 2 ), then the states <strong>of</strong> nature are equally likely a<br />

priori; in this case the decision is based entirely on , p(x | j ), the likelihood <strong>of</strong> j<br />

with respect to x.<br />

In general, both <strong>of</strong> these factors are important in making a decision, and the Bayes‘ decision<br />

rule combines them to achieve the minimum probability <strong>of</strong> error.<br />

3. Classifier<br />

Consider a set<br />

<strong>of</strong> elements. With Bayes' decision rule, it is possible to divide the elements<br />

<strong>of</strong> into p classes C 1 , C 2 , …, C p , from n discriminant attributes A 1 , A 2 , …, A n . We must<br />

already have examples <strong>of</strong> each class to choose typical values for the attributes <strong>of</strong> each class.<br />

The probability <strong>of</strong> meeting element ∈ , having attribute A i , given that we consider<br />

class C l , will be denoted by p Ai ( | C l ).<br />

If we put all these probabilities together for each attribute, we obtain the global probability <strong>of</strong><br />

meeting element , given that the class is C l :<br />

(5)<br />

Classification must allow the class <strong>of</strong> unknown element to be decided with the lowest risk<br />

<strong>of</strong> error. Decision in Bayes' theory chooses the class C l for which the a posteriori membership<br />

probability p(C l | ). is the highest:<br />

(6)<br />

According to Bayes' rule, the a posteriori probability <strong>of</strong> membership p(C l | ) is calculated<br />

from the a priori probabilities <strong>of</strong> membership <strong>of</strong> element to class C l :


(7)<br />

Denominator p() is a normalization factor. It ensures that the sum <strong>of</strong> probabilities p(C l | )<br />

is equal to 1 when l varies.<br />

Some classes appear more frequently than others, and P(C l ) denotes the a priori probability <strong>of</strong><br />

meeting class C l .<br />

p( | C l ). denotes the conditional probability <strong>of</strong> meeting the element<br />

on class C l ( given that class C l is true).<br />

, given that we focus<br />

Parameters<br />

The use <strong>of</strong> a Bayesian classifier implies that we know:<br />

• the a priori probabilities P(C l ) that class C l appears;<br />

• and the conditional probability p( | C l ) <strong>of</strong> being in the presence <strong>of</strong> element ,<br />

given that the observation class is C l .<br />

1. A priori probability P(C l )<br />

If the examples used to sympl the system to recognize each sympl, are sufficiently<br />

numerous, then a priori probability P(C l ) can be estimated as the frequency <strong>of</strong><br />

appearance <strong>of</strong> this sympl in comparison with the other classes. This is the most <strong>of</strong>ten<br />

observed approach when the system is taught to recognize classes from symplex.<br />

2. Conditional probability p( | C l )<br />

The estimation <strong>of</strong> the conditional probability p( | C l ) represents the main problem.<br />

It is very difficult because it requires the estimation <strong>of</strong> the conditional probabilities for<br />

all the possible combinations <strong>of</strong> elements, given one particular class.<br />

In reality, it is impossible to calculate these estimations. Some assumptions <strong>of</strong><br />

simplification are <strong>of</strong>ten used in order to make the training <strong>of</strong> the system feasible. The<br />

assumption most frequently used is that <strong>of</strong> conditional independence which states that<br />

the probability <strong>of</strong> two elements 1 and 2 ,given that class is C l , is the product <strong>of</strong> the<br />

probabilities <strong>of</strong> each element taken separately, given that class is C l :


(8)<br />

With such an assumption, conditional probability p( | C l ) <strong>of</strong> an element<br />

according to attributes A i becomes:<br />

(9)<br />

Bayesian decision rule<br />

This leads to the following Bayesian decision rule, where class C l represents the class<br />

selected:<br />

(10)<br />

p Ai ( | C l ) represents the proportion <strong>of</strong> examples taking the value <strong>of</strong> element<br />

attribute A i in all examples belonging to class C j .<br />

The product <strong>of</strong> these probabilities proposed by each attribute A i for element<br />

element for information fusion in this rule <strong>of</strong> Bayes.<br />

for<br />

is the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!