Probabilistic Knowledge and Probabilistic Common Knowledge 1 ...
Probabilistic Knowledge and Probabilistic Common Knowledge 1 ...
Probabilistic Knowledge and Probabilistic Common Knowledge 1 ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Probabilistic</strong> <strong>Knowledge</strong> <strong>and</strong> <strong>Probabilistic</strong> <strong>Common</strong><br />
<strong>Knowledge</strong><br />
Paul Krasucki 1 , Rohit Parikh 2 <strong>and</strong> Gilbert Ndjatou 3<br />
Abstract: In this paper we develop a theory of probabilistic common knowledge <strong>and</strong><br />
probabilistic knowledge in a group of individuals whose knowledge partitions are not wholly<br />
independent.<br />
1 Introduction<br />
Our purpose in this paper is to extend conventional information theory <strong>and</strong> to address the<br />
issue of measuring the amount of knowledge that n individuals have in common. Suppose,<br />
for example, that two individuals have partitions which correspond closely, then we would<br />
expect that they share a great deal. However, the conventional definition of mutual knowledge<br />
may give us the conclusion that there is no fact which is mutually known, or even<br />
known to one as being known to another.<br />
This is unfortunate because [CM] <strong>and</strong> [HM] both give us arguments that seem to show<br />
that common knowledge (mutual knowledge if two individuals are involved) is both difficult<br />
to attain <strong>and</strong> necessary for certain tasks. If however, we can show that probabilistic<br />
knowledge is both easier to attain <strong>and</strong> a suitable substitute in many situations, then we<br />
have made progress. See [Pa2] for a description of situations where partial knowledge is<br />
adequate for communication.<br />
To this end, we shall develop a theory of probabilistic common knowledge which turns<br />
out to have surprising <strong>and</strong> fruitful connections both with traditional information theory <strong>and</strong><br />
with Markov chains. To be sure, these theories have their own areas of intended application.<br />
Nonetheless, it will turn out that our mathematical theory has many points in common with<br />
these two theories.<br />
The st<strong>and</strong>ard Logics of <strong>Knowledge</strong> tend to use Kripke models with S5 accessibility<br />
relations, one for each knower. One can easily study instead the partitions corresponding<br />
to these accessibility relations <strong>and</strong> we shall do this. We also assume that the space W of<br />
possible worlds has a probability measure µ given with it.<br />
1 Department of Computer Science, Rutgers-Camden<br />
2 Department of Computer Science, CUNY Graduate center, 33 West 42nd Street, New York, NY 10036.<br />
email: RIPBC@CUNYVM.CUNY.EDU.<br />
3 Department of Computer Science, College of Staten Isl<strong>and</strong>, CUNY <strong>and</strong> CUNY Graduate center<br />
1
In Figure I below, Ann has partition A = {A 1 , A 2 } <strong>and</strong> Bob has partition B = {B 1 , B 2 }<br />
so that each of the sets A i , B j has probability .5 <strong>and</strong> the intersections A i ∩B j have probability<br />
.45 when i = j <strong>and</strong> .05 otherwise.<br />
The vertical line divides A 1 from A 2<br />
The slanted line divides B 1 from B 2<br />
.45<br />
.05<br />
▲<br />
▲<br />
▲<br />
▲<br />
▲<br />
▲<br />
.05 ▲<br />
Figure I<br />
.45<br />
Since the meet of the partitions is trivial, there is no common knowledge in the usual<br />
sense of [Au], [HM]. In fact there is no nontrivial proposition p such that Ann knows that<br />
Bob knows p. It is clear, however, that Ann <strong>and</strong> Bob have nearly the same information<br />
<strong>and</strong> if the partitions are themselves common knowledge, then Ann <strong>and</strong> Bob will be able to<br />
guess, with high probability, what the other knows. We would like then to say that Ann <strong>and</strong><br />
Bob have probabilistic common knowledge, but how much? One purpose of this paper is to<br />
answer this question <strong>and</strong> to prove properties of our definition that show why the answer is<br />
plausible.<br />
A closely related question is that of measuring indirect probabilistic knowledge. For<br />
example, we would expect that what Ann knows about Bob’s knowledge is less than or<br />
equal to what Bob himself knows, <strong>and</strong> what Ann knows of Bob’s knowledge of Carol is in<br />
turn less than or equal to the amount of knowledge that Bob has about Carol’s knowledge.<br />
We would expect in the limit that what Ann knows about what Bob knows about what<br />
Ann knows... about what Bob knows will approach whatever ordinary common knowledge<br />
they have.<br />
It turns out that to tackle these questions successfully, we need a third notion. This<br />
is the notion of the amount of information acquired when one’s probabilities change as a<br />
result of new information (which does not invalidate old information). Suppose for example<br />
that I am told that a certain fruit is a peach. I may then assign a probability of .45 to the<br />
proposition that it is sweet. If I learn then that it just came off a tree, then I will expect<br />
that it was probably picked for shipping <strong>and</strong> the probability may drop to .2, but if I learn<br />
again that it fell off the tree, then it will rise to .9. In each case I am getting information,<br />
consistent with previous information <strong>and</strong> causing me to revise my probabilities, but how<br />
2
much information am I getting?<br />
2 Preliminaries<br />
We start by giving some definitions, some old, some apparently new. If a space has 2 n<br />
points, all equally likely, then the amount of information gained by knowing the identity of<br />
a specific point x is n bits. If one only knows a set X in which x falls, then the information<br />
gained is less, in fact equal to I(X) = − log(µ(X)) where µ(X) is the probability 4 of X. If<br />
P = P 1 , ..., P k is a partition of the whole space W , then the expected information when one<br />
discovers the identity of the P i which contains x, is<br />
k∑<br />
k∑<br />
H(P) = µ(P i )I(P i ) = −µ(P i ) log(µ(P i ))<br />
i=1<br />
i=1<br />
These definitions so far are st<strong>and</strong>ard in the literature [Sh], [Ab], [Dr]. We now introduce<br />
a notion which is apparently new.<br />
Suppose I have a partition P = {P 1 , ..., P k } whose a priori probabilities are y 1 , ..., y k ,<br />
but some information that I receive causes me to change them to u 1 , ..., u k . How much<br />
information have I received?<br />
Definition 1 :<br />
k∑<br />
k∑<br />
IG(⃗u, ⃗y) = u i (log u i − log y i ) = u i log( u i<br />
)<br />
y<br />
i=1<br />
i=1 i<br />
Here IG st<strong>and</strong>s for “information gain”.<br />
Clearly this definition needs some justification. We will first provide an intuitive explanation,<br />
<strong>and</strong> then prove some properties of this notion IG which will make it more plausible<br />
that it is the right one.<br />
(a) Suppose that the space had 2 n points, <strong>and</strong> the distribution of probabilities that we<br />
had was the flat distribution. Then the set P i has 2 n ·y i points 5 . After we receive our<br />
information, the points are no longer equally likely, <strong>and</strong> each point in P i has probability<br />
u i<br />
|P i | = u i<br />
y i 2 n . Thus the expected information of the partition of the 2 n singleton sets is<br />
k∑<br />
− (y i · 2 n ) u i<br />
y<br />
i=1<br />
i 2 n log( u i<br />
y i 2 n )<br />
4 We will use the letter µ for both absolute <strong>and</strong> relative probabilities, to save the letter p for other uses.<br />
All logs will be to base 2 <strong>and</strong> since x log(x) → 0 as x → 0 we will take x log(x) to be 0 when x is 0.<br />
5 There is a tacit assumption here that the y i are of the form k/2 n . But note that numbers of this form<br />
are dense in the unit interval <strong>and</strong> if we assume that the function IG is continuous, then it is sufficient to<br />
consider numbers of this form.<br />
3
which comes out to<br />
k∑<br />
α = n − u i (log u i − log y i )<br />
i=1<br />
Since the flat distribution had expected information n, we have gained information equal<br />
to<br />
k∑<br />
k∑<br />
k∑<br />
n − α = n − (n − u i (log u i − log y i )) = u i (log u i − log y i ) = u i log( u i<br />
)<br />
y<br />
i=1<br />
i=1<br />
i=1 i<br />
(b) In information theory, we have a notion of the information that two partitions P <strong>and</strong><br />
Q share, also called their mutual information, <strong>and</strong> usually denoted by I(P; Q).<br />
I(P; Q) = ∑ i,j<br />
µ(P i ∩ Q j ) log( µ(P i ∩ Q j )<br />
µ(P i ) · µ(Q j ) )<br />
We will recalculate this quantity using the function IG. If Ann has partition P, then with<br />
probability µ(P i ) she knows that P i is true. In that case, she will revise her probabilities<br />
of Bob’s partition from µ( Q) ⃗ to µ( Q|P ⃗ i ) <strong>and</strong> <strong>and</strong> in that case her information gain about<br />
Bob’s partition is IG(µ( Q|P ⃗ i ), µ( Q)). ⃗<br />
Summing over all the P i we get<br />
∑<br />
µ(P i ) · IG(µ( Q|P ⃗ i ), µ( Q)) ⃗ = ∑<br />
i<br />
i<br />
µ(P i )( ∑ j<br />
(µ(Q j |P i ) log( µ(Q j|P i )<br />
µ(Q j ) ))<br />
<strong>and</strong> an easy calculation shows that this is the same as<br />
I(P; Q) = ∑ i,j<br />
µ(P i ∩ Q j ) log( µ(P i ∩ Q j )<br />
µ(P i ) · µ(Q j ) )<br />
Since the calculation through IG gives the same result as the usual formula, this gives<br />
additional support to the claim that our formula for the information gain is the right one.<br />
3 Properties of information gain<br />
Theorem 1 : (a) IG(⃗u, ⃗v) ≥ 0 <strong>and</strong> IG(⃗u, ⃗v) = 0 iff ⃗u = ⃗v.<br />
(b1) If ⃗p = µ(P ⃗ ) <strong>and</strong> if there is set X such that u i = µ(P i |X) for all i, then<br />
IG(⃗u, ⃗p) ≤ −log(µ(X))<br />
Thus the information received, by way of a change of probabilities, is less than or equal to<br />
the information I(X) contained in X.<br />
(b2) Equality obtains in (b1) above iff for all i, either µ(P i |X) = µ(P i ), or else µ(P i ∩X) = 0.<br />
4
Thus if all nonempty sets involved have non-zero measure, every P i is either a subset of X<br />
or disjoint from it.<br />
Proof : (a) It is straightforward to show using elementary calculus that<br />
log x < (x − 1) log e except when x = 1 when the two are equal. 6 Replacing x by 1/x we<br />
get log x > (1 − 1/x) log e except again at x = 1. This yields<br />
IG(⃗u, ⃗v) = ( ∑ i<br />
u i (log u i<br />
v i<br />
)) ≥ ( ∑ i<br />
u i (1 − v i<br />
u i<br />
)) log e = (( ∑ i<br />
u i ) − ( ∑ i<br />
v i )) log e = 0<br />
with equality holding iff, for all i, either u i<br />
v i<br />
= 1, or u i = 0. However, the case u i = 0 cannot<br />
arise, since we know that ∑ i u i = ∑ i v i = 1 <strong>and</strong> u i ≤ v i for all i.<br />
(b1) Let u i = µ(P i |X), ⃗p = (µ(P 1 ), ..., µ(P k )).<br />
IG(⃗u, ⃗p) =<br />
k∑<br />
i=1<br />
k∑<br />
i=1<br />
µ(P i |X) log µ(P i|X)<br />
µ(P i )<br />
µ(P i |X) log µ(P i ∩ X)<br />
µ(P i )<br />
where α = ∑ k<br />
i=1 µ(P i |X) log µ(P i∩X)<br />
µ(P i )<br />
=<br />
k∑<br />
i=1<br />
µ(P i |X) log µ(P i ∩ X)<br />
µ(P i )µ(X) =<br />
k∑<br />
− µ(P i |X) log µ(X)) = α + I(X)<br />
i=1<br />
≤ 0, since µ(P i∩X)<br />
µ(P i )<br />
≤ 1 for all i <strong>and</strong> ∑ k<br />
i=1 µ(P i |X) = 1.<br />
(b2) α = 0 only if, for all i, µ(P i |X) = 0 or µ(P i ∩ X) = µ(P i ), i.e. either P i ∩ X = ∅ or<br />
P i ⊆ X (X is a union of the P i ’s).<br />
If we learn that one of the sets can be excluded, that we had initially considered possible<br />
(its probability was greater then zero), then our information gain is the least if the<br />
probability of the excluded piece is distributed over all the other elements of the partition,<br />
proportionately to their initial probabilities. the gain is greatest when the probability of the<br />
excluded piece is shifted to a single element of the partition, <strong>and</strong> this element was initially<br />
one of the least likely elements.<br />
Theorem 2 : Let ⃗v = (v 1 , ..., v k−1 , v k ), ⃗u = (u 1 , ..., u k−1 , u k ), where u k = 0, u i =<br />
v i + a i v k for i = 1, ..., k − 1, a i ≥ 0, ∑ k−1<br />
i=1 a i = 1, <strong>and</strong> v k > 0. Then:<br />
(a) IG(⃗u, ⃗v) is minimum when a i<br />
v i<br />
= c is the same for all i = 1, ..., k − 1, <strong>and</strong> c = 1<br />
1−v k<br />
.<br />
moreover, this minimum value is just − log(1 − v k ).<br />
(b) IG(⃗u, ⃗v) is maximum when a i = 1 for some i such that v i = min j=1...k−1 (v j ) <strong>and</strong> the<br />
other a j are 0.<br />
Proof : (a) Let a = (a 1 , ..., a k−2 , a k−1 ). Since ∑ k−1<br />
i=1 a i = 1 we have a k−1 = 1− ∑ k−2<br />
i=1 a i.<br />
6 e is of course the number whose natural log is 1. Note that log e = log 2 e = 1 . The line y = (x−1) log e<br />
ln 2<br />
is tangent to the curve y = log x at (1,0), <strong>and</strong> lies above it.<br />
✷<br />
5
So we need only look at f : [0, 1] k−2 → R, defined by:<br />
k−2<br />
∑<br />
f(⃗a) = IG(⃗u, ⃗v) = (v i +a i v k ) log v i + a i v k<br />
i=1<br />
k−2 ∑<br />
+(v k−1 +v k (1−<br />
v i<br />
j=1<br />
a j )) log (v k−1 + v k (1 − ∑ k−2<br />
j=1 a j)<br />
v k−1<br />
To find the extrema of f in [0, 1] k−2 , consider the partial derivatives<br />
∂f<br />
∂a i<br />
= 0 iff v i+a i v k<br />
v i<br />
for all i, a i<br />
v i<br />
= a k−1<br />
v k−1<br />
∂f<br />
∂a i<br />
= v k (log v i + a i v k<br />
v i<br />
= (v k−1+v k (1− ∑ k−2<br />
v k−1<br />
j=1 a j))<br />
− log v k−1 + v k (1 − ∑ k−2<br />
j=1 a j)<br />
) log e<br />
v k−1<br />
. Recall that a k−1 = 1 − ∑ k−2<br />
i=1 a i. Then we have<br />
or a i = cv i where c is a constant <strong>and</strong> i range over 1, ..., k − 1. If we add<br />
these equations <strong>and</strong> use the fact that ∑ k−1<br />
i=1 a i = 1 <strong>and</strong> the fact that ∑ k−1<br />
i=1 v i = 1 − v k we<br />
get c = 1<br />
1−v k<br />
. Now ∂f<br />
∂a i<br />
is an increasing function of a i , so it is > 0 iff a i > v i<br />
1−v k<br />
<strong>and</strong> it is < 0<br />
iff a i <<br />
v i<br />
1−v k<br />
. Thus f has a minimum when a i =<br />
v i<br />
1−v k<br />
for all i. The fact that this minimum<br />
value is − log(1 − v k ) is easily calculated by substitution. Note that this quantity is exactly<br />
equal to I(X) where X is the complement of the set P k whose probability was v k . Thus we<br />
have an exact correspondence with parts (b1) <strong>and</strong> (b2) of the previous theorem.<br />
(b) To get the maximum, note that since the first derivatives ∂f<br />
∂a i<br />
are always increasing,<br />
<strong>and</strong> the second derivatives are all positive, the maxima can only occur at the vertices<br />
of [0, 1] k−1 . (If they occurred elsewhere, we could increase the value by moving in some<br />
direction). Now the values of f at the points p j = (0, ...0, 1, 0, ...0) (a i = δ(i, j)), are<br />
IG(⃗u, ⃗v) = g(v j ) where g(x) = (x+v k ) log x+v k<br />
x<br />
. But g(x) = (x+v k) log x+v k<br />
x<br />
is a decreasing<br />
function of x. so IG(u, v) is maximum when a j = 1 for some j such that v j is minimal. ✷<br />
Example 1 : Suppose for example that a partition {P 1 , P 2 , P 3 , P 4 } is such that all the P i<br />
have probabilities equal to .25. If we now receive the information that P 4 is impossible, then<br />
we will have gained information approximately equal to IG(.33, .33, .33, 0, .25, .25, .25, .25) ≈<br />
3·(.33) log .33<br />
.25 ≈ log 4 3 ≈ .42. Similarly, if we discover instead that it is P 3 which is impossible.<br />
If, however, we only discover that the total probability of the set P 3 ∪ P 4 has decreased to<br />
.33, then our information gain is only IG(.33, .33, .17, .17, .25, .25, .25, .25) ≈ .08, which is<br />
much less. And this makes sense, since knowing that the set P 3 ∪ P 4 has gone down in<br />
weight tells us less than knowing that half of it is no longer to be considered, <strong>and</strong> moreover<br />
which half.<br />
If we discover that P 4 is impossible <strong>and</strong> all the cases that we had thought to be in P 4<br />
are in fact in P 1 , then the information gain is IG(.50, .25, .25, 0, .25, .25, .25, .25) = 1 2 log 2<br />
which is .5 <strong>and</strong> more than our information gain in the two previous cases.<br />
6
Example 2 : As the following example shows, IG doesn’t satisfy the triangle inequality.<br />
I.e. if we revise our probabilities form ⃗y to ⃗u <strong>and</strong> then again to ⃗v, our total gain can be<br />
less than revising it straight from ⃗y to ⃗v. This may perhaps explain why we do not notice<br />
gradual changes, but are struck by the cumulative effect of all of them.<br />
Take ⃗v = (0.1, 0.9), ⃗u = (0.25, 0.75), ⃗y = (0.5, 0.5). IG(⃗v, ⃗u) + IG(⃗u, ⃗y) = 0.13 + 0.21 =<br />
0.34, while IG(⃗v, ⃗y) = 0.53 (approximately). Also IG(⃗y, ⃗v) = 0.74 so that IG is not<br />
symmetric.<br />
Another way to see that this failure of the triangle equality is reasonable is to notice<br />
that we could have gained information by first relativising to a set X, <strong>and</strong> then to another<br />
set Y , gaining information ≤ − log(µ(X)) <strong>and</strong> − log(µ(Y )) respectively. However, to get<br />
the cumulative information gain, we might need to relativise to X ∩ Y whose probability<br />
might be much less than µ(X)µ(Y ).<br />
We have defined the mutual knowledge I(P; Q) of two partitions P, Q. If we denote<br />
their join as P +Q then the quantity usually denoted in the literature as H(P, Q)) is merely<br />
H(P + Q). The connection between mutual information <strong>and</strong> entropy is well known [Ab]:<br />
H(P + Q) = H(P) + H(Q) − I(P; Q)<br />
Moreover, the equivocation H(P|Q) of P with respect to Q is defined as H(P|Q) = H(P) −<br />
I(P; Q). If i <strong>and</strong> j are agents with respective partitions P i <strong>and</strong> P j respectively, then inf(ij)<br />
will be just I(P i ; P j ).<br />
The equivocations are non-negative, <strong>and</strong> I is symmetric, <strong>and</strong> so we have:<br />
I(P; Q) ≤ min(H(P), H(Q))<br />
Thus what Ann knows about Bob’s knowledge is always less than what Bob knows <strong>and</strong><br />
what Ann herself knows.<br />
We want now to generalise these notions to more than two people, for which we will<br />
need a notion from the theory of Markov chains, namely stochastic matrices. We start by<br />
making a connection between boolean matrices <strong>and</strong> the usual notion of knowledge.<br />
4 <strong>Common</strong> knowledge <strong>and</strong> Boolean matrices<br />
We start by reviewing some notions from ordinary knowledge theory, [Au], [HM], [PK].<br />
Definition 2 : Suppose that {1,...,k} are individuals <strong>and</strong> i has knowledge partition P i .<br />
If w ∈ W then i knows E at w iff P i (w) ⊆ E, where P i (w) is the element of the partition<br />
7
P i containing w. K i (E) = {w|i knows E at w}. Note that K i (E) is always a subset of E.<br />
Write w ≈ i w ′ if w <strong>and</strong> w ′ are in the same element of the partition P i (iff P i (w) =<br />
P i (w ′ )). Then i knows E at w iff for all w ′ , w ≈ i w ′ → w ′ ɛE.<br />
Also, it follows that i knows that j knows E at w iff wɛK i (K j (E)) iff ⋃ l≤n {P j l|P j l ∩<br />
P i (w) ≠ ∅} ⊆ E i.e. {w ′ |∃v such that w ≈ i v ≈ j w ′ } ⊆ E.<br />
Definition 3 : An event E is common knowledge between a group of individuals<br />
i 1 , ..., i m at w iff (∀j 1 , ..., j k ∈ {i 1 , ..., i m })(w ≈ j1 w 1 , ..., w k−1 ≈ jk w ′ ) → (w ′ ∈ E) iff for all<br />
Xɛ{K 1 , ..., K n } ∗ wɛX(E).<br />
We now analyse knowledge <strong>and</strong> common knowledge using boolean transition matrices 7 :<br />
Definition 4 : The boolean transition matrix B ij of ij is defined by letting B ij (k, l) = 1<br />
if P k<br />
i<br />
∩ P l j<br />
≠ ∅, <strong>and</strong> 0 otherwise.<br />
We can extend this definition to a string of individuals x = i 1 ...i k :<br />
Definition 5 : The boolean transition matrix B x for a string x = i 1 ...i k is<br />
B x = B i1 i 2<br />
⊗ B i2 i 3<br />
⊗ ... ⊗ B ik−1 i k<br />
where ⊗ is defined as normalised matrix multiplication:<br />
if (B × B ′ )(k, l) > 0 then (B ⊗ B ′ )(k, l) is set to 1, otherwise it is 0. We can also define ⊗<br />
as: (B ⊗ B ′ )(k, l) = ∨ n<br />
m=1 (B(k, m) ∧ B ′ (m, l))<br />
We say that there is no non-trivial common knowledge iff the only event that is common<br />
knowledge at any w is the whole space W .<br />
Fact 1 : There is no non-trivial common knowledge iff for every string x including all<br />
individuals, lim n→∞ B x n = 1 where 1 is the matrix filled with 1’s only.<br />
We now consider the case of stochastic matrices.<br />
5 Information via a string of agents<br />
When we consider boolean transition matrices, we may lose some information. If we know<br />
the probabilities of all the elements of the σ-field generated by the join of the partitions P i ,<br />
the boolean transition matrix B ij is created by putting a 1 in position (k, l) iff µ(P l j |P k<br />
i ) > 0,<br />
<strong>and</strong> 0 otherwise. We keep more of the information by having µ(Pj l|P i k ) in position (k, l).<br />
We denote this matrix by M ij <strong>and</strong> we call it the transition matrix from i to j.<br />
7 The subscripts to the matrices will denote the knowers, <strong>and</strong> the row <strong>and</strong> column will be presented<br />
explicitly as arguments. Thus B ij(k, l) is the entry in the kth row <strong>and</strong> jth column of the matrix B ij.<br />
8
Definition 6 : For every i, j, the ij-transition matrix M ij is defined by: M ij (a, b) =<br />
µ(P b j |P a<br />
i ).<br />
For all i, M ii is the unit matrix of dimension equal to the size of partition P i .<br />
Definition 7 : If x is a string of elements of {1, ..., k} (x ∈ {1, ..., k} ∗ , x = x 1 ...x n ),<br />
then M x = M x1 x 2<br />
× ... × M xn−1 x n<br />
is the transition matrix for x.<br />
We now define inf(ixj), where x is a sequence of agents. inf(ixj) will be the information<br />
that i has about j via x. If e.g. i = 3, x = 1, j = 2, we should interpret inf(ixj) as the<br />
amount of information 3 has about 1’s knowledge of 2.<br />
Example 3 : In our example in the introduction, If i were Ann <strong>and</strong> j were Bob, then<br />
we would get<br />
M ij =<br />
∣<br />
.9 .1<br />
.1 .9<br />
The matrix M ji equals the matrix M ij <strong>and</strong> the matrix M iji is<br />
M iji =<br />
∣<br />
∣<br />
.82 .18<br />
.18 .82<br />
Thus it turns out that each of Ann <strong>and</strong> Bob has .53 bits of knowledge about the other <strong>and</strong><br />
Ann has .32 bits of knowledge about Bob’s knowledge of her.<br />
Definition 8 : Let ⃗m l = (m l1 , ..., m lk ) be the lth row vector of the transition matrix<br />
M ixj (m lt = µ(Pj t| xPi l),<br />
where µ(P j t| xPi l ) is the probability that a point in P<br />
l<br />
i will end up<br />
in P t j<br />
after a r<strong>and</strong>om move within P<br />
l<br />
i<br />
within the elements of those P xr<br />
∣<br />
followed by a sequence of r<strong>and</strong>om moves respectively<br />
which form x). Then:<br />
k∑<br />
inf(ixj) = µ(Pi l )IG( ⃗m l , µ(P ⃗ j ))<br />
l=1<br />
where IG( ⃗m l , µ(P ⃗ j )) is the information gain of the distribution ⃗m l over the distribution<br />
µ(P ⃗ j ).<br />
The intuitive idea is that the a priori probabilites of j’s partition are<br />
⃗ µ(P j ). However,<br />
if w is in Pi l , the l’th set in i’s partition, then these probabilities will be revised according<br />
to the l’th row of the matrix M ixj <strong>and</strong> the information gain will be IG(⃗m l , µ(P ⃗ j )). The<br />
expected information gain for i about j via x is then obtained by multiplying by the µ(P l<br />
i )’s<br />
<strong>and</strong> summing over all l.<br />
Example 4 : Consider M iji . For convenience we’ll denote elements P m<br />
i<br />
elements P m j<br />
by A m <strong>and</strong><br />
by B m (so that the A’s are elements of i’s partition, <strong>and</strong> the B’s are elements<br />
9
of j’s partition). Therefore M iji = M ij × M ji where:<br />
µ(B 1 |A 1 ) · · · µ(B k |A 1 )<br />
µ(B 1 |A 2 ) · · · µ(B k |A 2 )<br />
M ij =<br />
. . .. M ji =<br />
.<br />
∣ µ(B 1 |A k ) · · · µ(B k |A k ) ∣ ∣<br />
µ(A 1 |B 1 ) · · · µ(A k |B 1 )<br />
µ(A 1 |B 2 ) · · · µ(A k |B 2 )<br />
.<br />
. .. .<br />
µ(A 1 |B k ) · · · µ(A k |B k )<br />
∣<br />
M iji is the matrix of probabilities µ(A l | j A m ) for l, m = 1, ..., k, where µ(A l | j A m ) is the<br />
probability that a point in A m , will end up in A l after a r<strong>and</strong>om move within A m followed<br />
by a r<strong>and</strong>om move within some B s .<br />
µ(A 1 | j A 1 ) µ(A 1 | j A 2 ) · · · µ(A 1 | j A k )<br />
µ(A 2 | j A 1 ) µ(A 2 | j A 2 ) · · · µ(A 2 | j A k )<br />
M iji =<br />
. .<br />
. .. .<br />
∣ µ(A k | j A 1 ) µ(A k | j A 2 ) · · · µ(A k | j A k ) ∣<br />
Note that for x = λ, where λ is the empty string, inf(ij) = I(P i ; P j ), as in the st<strong>and</strong>ard<br />
definition: inf(ij) = ∑ k<br />
l=1 µ(Pi l)IG(<br />
µ(P<br />
⃗ j |Pi l),<br />
µ(P ⃗ j )) = ∑ k<br />
l=1 µ(Pi l)<br />
∑ k<br />
t=1 µ(P j |Pi l)<br />
log µ(P j t|P i l)<br />
= ∑ k<br />
l,t=1 µ(P j ∩ P l<br />
i ) log µ(P t j ∩P l i )<br />
µ(P t j )µ(P l i )<br />
µ(P t j )<br />
6 Properties of transition matrices<br />
The results in this section are either from the theory of Markov chains, or easily derived<br />
from these.<br />
Definition 9 : A matrix M is stochastic if all elements of M are reals in [0,1] <strong>and</strong> the<br />
sum of every row is 1.<br />
Fact 2 : For every x, the matrix M x is stochastic.<br />
Definition 10 : A matrix M is regular if there is m such that ∀(k, l)M m (k, l) > 0.<br />
The following fact establishes a connection between regular stochastic matrices <strong>and</strong><br />
common knowledge:<br />
Fact 3 :<br />
individuals from x.<br />
Matrix M ixi is regular iff there is no common knowledge between i <strong>and</strong><br />
Fact 4 : For every regular stochastic matrix M, there is a matrix M ′ such that<br />
lim M n = M ′<br />
n→∞<br />
M ′ is stochastic, <strong>and</strong> all the rows in M ′ are the same. Moreover the rate of convergence is<br />
exponential: for a given column r, let d n (r) be the difference between the maximum <strong>and</strong><br />
10
the minimum in M n , in that column. Then there is ɛ < 1 such that for all columns r <strong>and</strong><br />
all sufficiently large n, d n (r) ≤ ɛ n .<br />
By combining the last two facts we get the following corollary:<br />
Fact 5 : If there is no common knowledge between i <strong>and</strong> the individuals in x, then<br />
lim (M ixi) n = M<br />
n→∞<br />
where M is stochastic, <strong>and</strong> all rows in M are equal to the vector ⃗u i of probabilities of the<br />
sets in the partition P i .<br />
A matrix with all rows equal represents the situation that all information is lost <strong>and</strong> all<br />
that is known is the a priori probabilities.<br />
Fact 6 : If L, S are stochastic matrices <strong>and</strong> all the rows of L are equal, then S × L = L,<br />
<strong>and</strong> L × S = L ′ , where all rows in L ′ are equal (though they may be different from those of<br />
L).<br />
Fact 7 : For any stochastic matrix S <strong>and</strong> regular matrix M ixi :<br />
S × lim<br />
n→∞ (M ixi) n = M ′<br />
where<br />
M ′ = lim<br />
n→∞ (M ixi) n<br />
Definition 11 : For a given partition P i <strong>and</strong> string x = x 1 x 2 ...x k we can define a<br />
relation ≈ x between the partitions P i <strong>and</strong> P j . P m<br />
i ≈ x P n j iff for w ∈ P m<br />
i <strong>and</strong> w ′ ∈ P n j ,<br />
there are v 1 , ..., v k−1 such that v 1 ∈ P m<br />
i , v k ∈ P n j <strong>and</strong> w ≈ x1 v 1 ≈ x2 ...v k−1 ≈ xk w ′ .<br />
Definition 12 : ≈ ∗ x is the transitive closure of ≈ x . It is an equivalence relation.<br />
Fact 8 : Assume that x contains all j. Then the relation ≈ ∗ x does not depend on the<br />
particular x <strong>and</strong> we may drop the x. P m<br />
i ≈ ∗ P n j iff P m<br />
i <strong>and</strong> P n j are subsets of the same<br />
element of P − where P − is the meet of the partitions of all the individuals.<br />
Observation: We can permute the elements of the partition P i so that the elements of<br />
the same equivalence class of ≈ ∗ have consecutive numbers <strong>and</strong> then M ixi looks as follows:<br />
M ixi =<br />
∣<br />
∣<br />
M 1 · · · 0 ∣∣∣∣∣∣<br />
.<br />
. .. .<br />
0 · · · M r<br />
where M l for l ≤ r is the matrix corresponding to the transitions within one equivalence<br />
class of ≈ ∗ . All submatrices M l are square <strong>and</strong> regular.<br />
11
Note that if there is no common knowledge then ≈ ∗ has a single equivalence class.<br />
Since we can always renumber the elements of the partitions so that the transition matrix<br />
is in the form described above, we will assume from now on that the transition matrix is<br />
always given in such a form.<br />
Fact 9 : If x contains all j then<br />
lim (M ixi) n = M<br />
n→∞<br />
where M is stochastic, submatrices M l of M are regular (in fact positive) <strong>and</strong> all the rows<br />
within every submatrix M l are the same.<br />
7 Properties of inf(ixj)<br />
Theorem 3 : If there is no common knowledge <strong>and</strong> x includes all the individuals, then<br />
lim<br />
n→∞ inf(i(jxj)n ) = 0<br />
Proof : Matrix M = lim n→∞ (M jxj ) n has all rows positive <strong>and</strong> equal. Let ⃗m be a<br />
row vector of M. Then lim n→∞ inf(i(jxj) n ) = IG(⃗m, µ(P ⃗ j )). Since the limiting vector ⃗m<br />
is equal to the distribution µ(P ⃗ j ), we get: lim n→∞ inf(i(jxj) n ) = IG( µ(P ⃗ j ), µ(P ⃗ j )) = 0. ✷<br />
The last theorem can be easily generalised to the following:<br />
Fact 10 : If there is no common knowledge among the individuals in x, <strong>and</strong> i, j occur<br />
in x, then as n → ∞, inf(ix n j) goes to zero.<br />
8 <strong>Probabilistic</strong> common knowledge<br />
<strong>Common</strong> knowledge is very rare. But, even if there is no common knowledge in the system,<br />
we often have probabilistic common knowledge.<br />
Definition 13 : Individuals {1, ..., n} have probabilistic common knowledge if<br />
∀x ∈ {1, ..., n} ∗ inf(x) > 0<br />
We note that there is no probabilistic common knowledge in the system iff there is some<br />
string x such that for some i, M xi is a matrix with all rows equal <strong>and</strong> M xi (·, t) = µ(P t<br />
i ) for<br />
all t.<br />
12
Theorem 4 : If there is common knowledge in the system then there is probabilistic<br />
common knowledge, <strong>and</strong><br />
∀x ∈ {1, ..., n} ∗ inf(x) ≥ H(P − )<br />
Proof<br />
: We know from Fact 9 that<br />
M ixi =<br />
∣<br />
∣<br />
M 1 · · · 0 ∣∣∣∣∣∣<br />
.<br />
. .. .<br />
0 · · · M r<br />
where M l for l ≤ r is the matrix corresponding to the transitions within one equivalence class<br />
of ≈ ∗ x, <strong>and</strong> all submatrices M l are square <strong>and</strong> regular. Here r is the number of partitions in<br />
P − . Suppose that the probabilities of the sets in the partition P i are u 1 , ..., u k <strong>and</strong> that the<br />
probabilities of the partition P − are w 1 , ..., w r . Each w j is going to be the sum of those u l<br />
where the lth set in the partition P i is a subset of the jth set in the partition P − . Let ⃗m l<br />
be the lth row of the matrix M ixi . Then inf(ixi) is ∑ k<br />
l=1 u l IG(⃗m l , ⃗u). The row ⃗m l consists<br />
of zeroes, except in places corresponding to subsets of the apppropriate element P − j of P − .<br />
Then, by theorem 2, part (a): IG(⃗m i , ⃗u) ≥ log(<br />
1<br />
1−(1−w j ) ) = − log w j. This quantity<br />
may repeat, since several elements of P i may be contained in P − j<br />
. When we add up all the<br />
multipliers u i that occur with log w j , these multipliers also add up to w j . Thus we get<br />
r∑<br />
inf(ixi) ≥ −w j log(w j ) = H(P − )<br />
j=1<br />
. ✷<br />
We can also show:<br />
Theorem 5 : If x contains i, j <strong>and</strong> there is common knowledge between i, j <strong>and</strong> all the<br />
components of x, then the limiting information always exists <strong>and</strong> lim n→∞ inf(i(jxj) n ) =<br />
H(P − )<br />
We postpone the proof to the full paper.<br />
References<br />
[Ab] Abramson, N., Information Theory <strong>and</strong> Coding, McGraw-Hill, 1963<br />
[AH] M. Abadi <strong>and</strong> J. Halpern, Decidability <strong>and</strong> expressiveness for first-order logics of probability,<br />
Proc. of the 30th Annual Conference on Foundations of Computer Science, 1989,<br />
pp. 148–153.<br />
[Au] Aumann, R., “Agreeing to Disagree”, Annals of Statistics, 1976, 4, pp. 1236-1239<br />
13
[Ba] F. Bacchus, On Probability Distributions over Possible Worlds, Proceedings of the 4th<br />
Workshop on Uncertainty in AI, 1988, pp. 15-21<br />
[CM] H. H. Clark <strong>and</strong> C. R. Marshall, “Definite Reference <strong>and</strong> Mutual <strong>Knowledge</strong>”, in<br />
Elements of Discourse Underst<strong>and</strong>ing, Ed. Joshi, Webber <strong>and</strong> Sag, Cambridge U. Press,<br />
1981.<br />
[Dr] F. Dretske. <strong>Knowledge</strong> <strong>and</strong> the Flow of Information, MIT Press, 1981.<br />
[Ha] J. Halpern, An analysis of first-order logics of probability, Proc. of the 11th International<br />
Joint Conference on Artificial Intelligence (IJCAI 89), 1989, pp. 1375–1381.<br />
[HM] Halpern, J. <strong>and</strong> Moses, Y., “<strong>Knowledge</strong> <strong>and</strong> <strong>Common</strong> <strong>Knowledge</strong> in a Distributed<br />
Environment”, Proc. 3rd ACM Conf. on Principles of Distributed Computing, 1984, pp.<br />
50-61<br />
[KS] Kemeny, J. <strong>and</strong> Snell, L., Finite Markov Chains, Van Nostr<strong>and</strong>, 1960<br />
[Pa] Parikh, R., “Levels of <strong>Knowledge</strong> in Distributed Computing”, Proc. IEEE Symposium<br />
on Logic in Computer Science, 1986, pp. 322-331<br />
[Pa2] Parikh, R., “A Utility Based Approach to Vague Predicates” To appear.<br />
[PK] Parikh, R. <strong>and</strong> Krasucki, P. “Levels of <strong>Knowledge</strong> in Distributed Computing”, Research<br />
report, Brooklyn College, CUNY (1986). Revised version of [Pa] above.<br />
[Sh] Shannon, C., “Mathematical Theory of Communication” Bell System Technical Journal,<br />
28, 1948. (Reprinted in: Shannon <strong>and</strong> Weaver, A Mathematical Theory of Communication<br />
University of Illinois Press, 1964.)<br />
14