Review of Probability Theory - Sorry
Review of Probability Theory - Sorry
Review of Probability Theory - Sorry
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
http://www.survey.ntua.gr/main/labs/rsens/DeCETI/IRIT/MSI-FUSION/index.html<br />
<strong>Review</strong> <strong>of</strong> <strong>Probability</strong> <strong>Theory</strong><br />
1. Basic Notions<br />
Event<br />
The sample space, S, is the set <strong>of</strong> all possible outcomes <strong>of</strong> a random experiment. A subset <strong>of</strong><br />
the sample space is called an event. An event <strong>of</strong> a single element <strong>of</strong> the sample space is called<br />
an elementary event.<br />
Example:<br />
• Random experiment: tossing a coin<br />
• Sample space: {head, tail}<br />
• An elementary event: {tail}<br />
<strong>Probability</strong><br />
If the sample space consists <strong>of</strong> N outcomes and the probabilities p <strong>of</strong> the elementary events<br />
are all equal, then<br />
p = 1 / N<br />
An event A consisting <strong>of</strong> r elements then has a probability<br />
For the previous example, we get<br />
Random Variable and <strong>Probability</strong> Distribution<br />
A random variable X is a number assigned to every outcome <strong>of</strong> an experiment.
The probability <strong>of</strong> the event X=x is denoted as P X (x) or P X (X=x).<br />
P X (x) is <strong>of</strong>ten called the probability mass (or density function) or the probability distribution<br />
for X.<br />
It follows that and .<br />
Example:<br />
Outcome Random variable X P X (X=x)<br />
head 0 P X (0) = 1/2<br />
tail 1 P X (1) = 1/2<br />
Expectation<br />
1. The expectation (alias mean value, average) <strong>of</strong> a random variable X is defined as<br />
Example:<br />
2. The expectation <strong>of</strong> a function F=F(X) is defined as<br />
Example: if<br />
we get
4. Bayes‘ Decision theory<br />
Bayes‘ decision theory has been developed in the book [Duda and Hart, 1973], from which we<br />
summarize the following fundamentals.<br />
Bayes‘ decision theory is a fundamental statistical approach to the problem <strong>of</strong> pattern<br />
classification. This approach is based on the assumption that the decision problem is posed in<br />
probabilistic terms, and that all <strong>of</strong> the relevant probability values are known.<br />
The aim <strong>of</strong> the decision is to assign an object to a class. This classification is only carried out<br />
by means <strong>of</strong> measurements taken over the objects.<br />
Example<br />
Let us reconsider the problem, posed in the previous pages, <strong>of</strong> designing a classifier to<br />
separate two kinds <strong>of</strong> marble pieces, Carrara marble and Thassos marble.<br />
Suppose that an observer, watching marble emerge from the mill, finds it so hard to predict<br />
what type will emerge next, that the sequence <strong>of</strong> types <strong>of</strong> marble appears to be random.<br />
Using decision-theoretic terminology, we say that as each piece <strong>of</strong> marble emerges, nature is<br />
in one or the other <strong>of</strong> the two possible states: either the marble is Carrara marble, or Thassos<br />
marble. We let<br />
• = 1 for Carrara,<br />
• = 2 for Thassos.<br />
denote the state <strong>of</strong> nature, with:<br />
Because the state <strong>of</strong> nature is so unpredictable, we consider<br />
to be a random variable.<br />
If the mill produced as much Carrara marble than Thassos marble, we would say that the next<br />
piece <strong>of</strong> marble is equally likely to be Carrara marble or Thassos marble. More generally, we<br />
assume that there is some a priori probability P( 1 ) that the next piece is Carrara marble,<br />
and some a priori probability P( 2 ) that the next piece is Thassos marble. These a priori<br />
probabilities reflect our prior knowledge <strong>of</strong> how likely we are to see Carrara marble or<br />
Thassos marble before the marble actually appears. P( 1 ) and P( 2 ) are non negative and<br />
sum to one.<br />
Now, if we have to make a decision about the type <strong>of</strong> marble that will appear next, without<br />
being allowed to see it, the only information we are allowed to use is the value <strong>of</strong> the a priori<br />
probabilities. It seems reasonable to use the following decision rule:<br />
In this situation, we always make the same decision, even though we know that both types <strong>of</strong><br />
marble will appear. How well it works depends upon the values <strong>of</strong> the a priori probabilities:
• If P( 1 ) is very much greater than P( 2 ), our decision in favour <strong>of</strong> 1 will be right<br />
most <strong>of</strong> the time.<br />
• If P( 1 ) = P( 2 ), we have only a fifty-fifty chance <strong>of</strong> being right.<br />
In general, the probability <strong>of</strong> error is the smaller <strong>of</strong> P( 1 ) and P( 2 ).<br />
In most circumstances, we have not to make decisions with so little evidence. In our example<br />
example, we can use the brightness measurement x as evidence, as Thassos marble is lighter<br />
than Carrara marble. Different samples <strong>of</strong> marble will yield different brightness readings, and<br />
it is natural to express this variability in probabilistic terms; we consider x as a continuous<br />
random variable whose distribution depends on the state <strong>of</strong> nature.<br />
Let p(x | j ) be the state-conditional probability density function for x, i.e. the probability<br />
density function for x given that the state <strong>of</strong> nature is j . Then, the difference between<br />
p(x | 1 ) and p(x | 2 ) describes the difference in brightness between Carrara and Thassos<br />
marble (see figure 16)<br />
Figure 16: Hypothetical class-conditional probability density functions.<br />
Suppose that we know both the a priori probabilities P( j ) and the conditional densities<br />
p(x | j ). Suppose further that we measure the brightness <strong>of</strong> a piece <strong>of</strong> marble and discover<br />
the value <strong>of</strong> x. How does this measurement influence our attitude concerning the true state <strong>of</strong><br />
nature? The answer to this question is provided by Bayes‘ rule:<br />
(3)
where<br />
(4)<br />
Bayes‘rule shows how the observation <strong>of</strong> value x changes the a priori probability P( j ) into<br />
the a posteriori probability P( j | x). Variation <strong>of</strong> P( j | x) with x is illustrated in Figure 17<br />
for the case<br />
.<br />
Figure 17: A posteriori probabilities for and .<br />
If we have an observation x for which P( 1 | x) is greater than P( 2 | x)., we would be<br />
naturally inclined to decide that the true state <strong>of</strong> nature is 1 . Similarly, if P( 2 | x). is<br />
greater than P( 1 | x)., we would be naturally inclined to choose 2 .
To justify this procedure, let us calculate the probability <strong>of</strong> error whenever we make a<br />
decision. Whenever we observe a particular x,<br />
Clearly, in every instance in which we observe the same value for x, we can minimize the<br />
probability error by deciding:<br />
• if , and<br />
• if .<br />
Of course, we may never observe exactly the same value <strong>of</strong> x twice. Will this rule minimize<br />
the average probability <strong>of</strong> error? Yes, because the average probability <strong>of</strong> error is given by:<br />
and if for every x, P(error/ x) is as small as possible, the integral must be as small as possible.<br />
Thus, we have justified the following Bayes‘ decision rule for minimizing the probability <strong>of</strong><br />
error:<br />
This form <strong>of</strong> the decision rule emphasizes the role <strong>of</strong> the a posteriori probabilities. By using<br />
equation 3, we can express the rule in terms <strong>of</strong> conditional and a priori probabilities.<br />
Note that p(x) in equation 3 is unimportant as far as making a decision is concerned. It is<br />
basically just a scale factor that assures us that<br />
By eliminating this scale factor, we obtain the following completely equivalent decision rule:<br />
.
Some additional insight can be obtained by considering a few special cases:<br />
• if for some x, p(x | 1 ) = p(x | 2 ) , then that particular observation gives us no<br />
information about the state <strong>of</strong> nature; in this case, the decision hinges entirely on the a<br />
priori probabilities.<br />
• On the other hand, if P( 1 ) = P( 2 ), then the states <strong>of</strong> nature are equally likely a<br />
priori; in this case the decision is based entirely on , p(x | j ), the likelihood <strong>of</strong> j<br />
with respect to x.<br />
In general, both <strong>of</strong> these factors are important in making a decision, and the Bayes‘ decision<br />
rule combines them to achieve the minimum probability <strong>of</strong> error.<br />
3. Classifier<br />
Consider a set<br />
<strong>of</strong> elements. With Bayes' decision rule, it is possible to divide the elements<br />
<strong>of</strong> into p classes C 1 , C 2 , …, C p , from n discriminant attributes A 1 , A 2 , …, A n . We must<br />
already have examples <strong>of</strong> each class to choose typical values for the attributes <strong>of</strong> each class.<br />
The probability <strong>of</strong> meeting element ∈ , having attribute A i , given that we consider<br />
class C l , will be denoted by p Ai ( | C l ).<br />
If we put all these probabilities together for each attribute, we obtain the global probability <strong>of</strong><br />
meeting element , given that the class is C l :<br />
(5)<br />
Classification must allow the class <strong>of</strong> unknown element to be decided with the lowest risk<br />
<strong>of</strong> error. Decision in Bayes' theory chooses the class C l for which the a posteriori membership<br />
probability p(C l | ). is the highest:<br />
(6)<br />
According to Bayes' rule, the a posteriori probability <strong>of</strong> membership p(C l | ) is calculated<br />
from the a priori probabilities <strong>of</strong> membership <strong>of</strong> element to class C l :
(7)<br />
Denominator p() is a normalization factor. It ensures that the sum <strong>of</strong> probabilities p(C l | )<br />
is equal to 1 when l varies.<br />
Some classes appear more frequently than others, and P(C l ) denotes the a priori probability <strong>of</strong><br />
meeting class C l .<br />
p( | C l ). denotes the conditional probability <strong>of</strong> meeting the element<br />
on class C l ( given that class C l is true).<br />
, given that we focus<br />
Parameters<br />
The use <strong>of</strong> a Bayesian classifier implies that we know:<br />
• the a priori probabilities P(C l ) that class C l appears;<br />
• and the conditional probability p( | C l ) <strong>of</strong> being in the presence <strong>of</strong> element ,<br />
given that the observation class is C l .<br />
1. A priori probability P(C l )<br />
If the examples used to sympl the system to recognize each sympl, are sufficiently<br />
numerous, then a priori probability P(C l ) can be estimated as the frequency <strong>of</strong><br />
appearance <strong>of</strong> this sympl in comparison with the other classes. This is the most <strong>of</strong>ten<br />
observed approach when the system is taught to recognize classes from symplex.<br />
2. Conditional probability p( | C l )<br />
The estimation <strong>of</strong> the conditional probability p( | C l ) represents the main problem.<br />
It is very difficult because it requires the estimation <strong>of</strong> the conditional probabilities for<br />
all the possible combinations <strong>of</strong> elements, given one particular class.<br />
In reality, it is impossible to calculate these estimations. Some assumptions <strong>of</strong><br />
simplification are <strong>of</strong>ten used in order to make the training <strong>of</strong> the system feasible. The<br />
assumption most frequently used is that <strong>of</strong> conditional independence which states that<br />
the probability <strong>of</strong> two elements 1 and 2 ,given that class is C l , is the product <strong>of</strong> the<br />
probabilities <strong>of</strong> each element taken separately, given that class is C l :
(8)<br />
With such an assumption, conditional probability p( | C l ) <strong>of</strong> an element<br />
according to attributes A i becomes:<br />
(9)<br />
Bayesian decision rule<br />
This leads to the following Bayesian decision rule, where class C l represents the class<br />
selected:<br />
(10)<br />
p Ai ( | C l ) represents the proportion <strong>of</strong> examples taking the value <strong>of</strong> element<br />
attribute A i in all examples belonging to class C j .<br />
The product <strong>of</strong> these probabilities proposed by each attribute A i for element<br />
element for information fusion in this rule <strong>of</strong> Bayes.<br />
for<br />
is the