28.06.2014 Views

Probabilistic Knowledge and Probabilistic Common Knowledge 1 ...

Probabilistic Knowledge and Probabilistic Common Knowledge 1 ...

Probabilistic Knowledge and Probabilistic Common Knowledge 1 ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Probabilistic</strong> <strong>Knowledge</strong> <strong>and</strong> <strong>Probabilistic</strong> <strong>Common</strong><br />

<strong>Knowledge</strong><br />

Paul Krasucki 1 , Rohit Parikh 2 <strong>and</strong> Gilbert Ndjatou 3<br />

Abstract: In this paper we develop a theory of probabilistic common knowledge <strong>and</strong><br />

probabilistic knowledge in a group of individuals whose knowledge partitions are not wholly<br />

independent.<br />

1 Introduction<br />

Our purpose in this paper is to extend conventional information theory <strong>and</strong> to address the<br />

issue of measuring the amount of knowledge that n individuals have in common. Suppose,<br />

for example, that two individuals have partitions which correspond closely, then we would<br />

expect that they share a great deal. However, the conventional definition of mutual knowledge<br />

may give us the conclusion that there is no fact which is mutually known, or even<br />

known to one as being known to another.<br />

This is unfortunate because [CM] <strong>and</strong> [HM] both give us arguments that seem to show<br />

that common knowledge (mutual knowledge if two individuals are involved) is both difficult<br />

to attain <strong>and</strong> necessary for certain tasks. If however, we can show that probabilistic<br />

knowledge is both easier to attain <strong>and</strong> a suitable substitute in many situations, then we<br />

have made progress. See [Pa2] for a description of situations where partial knowledge is<br />

adequate for communication.<br />

To this end, we shall develop a theory of probabilistic common knowledge which turns<br />

out to have surprising <strong>and</strong> fruitful connections both with traditional information theory <strong>and</strong><br />

with Markov chains. To be sure, these theories have their own areas of intended application.<br />

Nonetheless, it will turn out that our mathematical theory has many points in common with<br />

these two theories.<br />

The st<strong>and</strong>ard Logics of <strong>Knowledge</strong> tend to use Kripke models with S5 accessibility<br />

relations, one for each knower. One can easily study instead the partitions corresponding<br />

to these accessibility relations <strong>and</strong> we shall do this. We also assume that the space W of<br />

possible worlds has a probability measure µ given with it.<br />

1 Department of Computer Science, Rutgers-Camden<br />

2 Department of Computer Science, CUNY Graduate center, 33 West 42nd Street, New York, NY 10036.<br />

email: RIPBC@CUNYVM.CUNY.EDU.<br />

3 Department of Computer Science, College of Staten Isl<strong>and</strong>, CUNY <strong>and</strong> CUNY Graduate center<br />

1


In Figure I below, Ann has partition A = {A 1 , A 2 } <strong>and</strong> Bob has partition B = {B 1 , B 2 }<br />

so that each of the sets A i , B j has probability .5 <strong>and</strong> the intersections A i ∩B j have probability<br />

.45 when i = j <strong>and</strong> .05 otherwise.<br />

The vertical line divides A 1 from A 2<br />

The slanted line divides B 1 from B 2<br />

.45<br />

.05<br />

▲<br />

▲<br />

▲<br />

▲<br />

▲<br />

▲<br />

.05 ▲<br />

Figure I<br />

.45<br />

Since the meet of the partitions is trivial, there is no common knowledge in the usual<br />

sense of [Au], [HM]. In fact there is no nontrivial proposition p such that Ann knows that<br />

Bob knows p. It is clear, however, that Ann <strong>and</strong> Bob have nearly the same information<br />

<strong>and</strong> if the partitions are themselves common knowledge, then Ann <strong>and</strong> Bob will be able to<br />

guess, with high probability, what the other knows. We would like then to say that Ann <strong>and</strong><br />

Bob have probabilistic common knowledge, but how much? One purpose of this paper is to<br />

answer this question <strong>and</strong> to prove properties of our definition that show why the answer is<br />

plausible.<br />

A closely related question is that of measuring indirect probabilistic knowledge. For<br />

example, we would expect that what Ann knows about Bob’s knowledge is less than or<br />

equal to what Bob himself knows, <strong>and</strong> what Ann knows of Bob’s knowledge of Carol is in<br />

turn less than or equal to the amount of knowledge that Bob has about Carol’s knowledge.<br />

We would expect in the limit that what Ann knows about what Bob knows about what<br />

Ann knows... about what Bob knows will approach whatever ordinary common knowledge<br />

they have.<br />

It turns out that to tackle these questions successfully, we need a third notion. This<br />

is the notion of the amount of information acquired when one’s probabilities change as a<br />

result of new information (which does not invalidate old information). Suppose for example<br />

that I am told that a certain fruit is a peach. I may then assign a probability of .45 to the<br />

proposition that it is sweet. If I learn then that it just came off a tree, then I will expect<br />

that it was probably picked for shipping <strong>and</strong> the probability may drop to .2, but if I learn<br />

again that it fell off the tree, then it will rise to .9. In each case I am getting information,<br />

consistent with previous information <strong>and</strong> causing me to revise my probabilities, but how<br />

2


much information am I getting?<br />

2 Preliminaries<br />

We start by giving some definitions, some old, some apparently new. If a space has 2 n<br />

points, all equally likely, then the amount of information gained by knowing the identity of<br />

a specific point x is n bits. If one only knows a set X in which x falls, then the information<br />

gained is less, in fact equal to I(X) = − log(µ(X)) where µ(X) is the probability 4 of X. If<br />

P = P 1 , ..., P k is a partition of the whole space W , then the expected information when one<br />

discovers the identity of the P i which contains x, is<br />

k∑<br />

k∑<br />

H(P) = µ(P i )I(P i ) = −µ(P i ) log(µ(P i ))<br />

i=1<br />

i=1<br />

These definitions so far are st<strong>and</strong>ard in the literature [Sh], [Ab], [Dr]. We now introduce<br />

a notion which is apparently new.<br />

Suppose I have a partition P = {P 1 , ..., P k } whose a priori probabilities are y 1 , ..., y k ,<br />

but some information that I receive causes me to change them to u 1 , ..., u k . How much<br />

information have I received?<br />

Definition 1 :<br />

k∑<br />

k∑<br />

IG(⃗u, ⃗y) = u i (log u i − log y i ) = u i log( u i<br />

)<br />

y<br />

i=1<br />

i=1 i<br />

Here IG st<strong>and</strong>s for “information gain”.<br />

Clearly this definition needs some justification. We will first provide an intuitive explanation,<br />

<strong>and</strong> then prove some properties of this notion IG which will make it more plausible<br />

that it is the right one.<br />

(a) Suppose that the space had 2 n points, <strong>and</strong> the distribution of probabilities that we<br />

had was the flat distribution. Then the set P i has 2 n ·y i points 5 . After we receive our<br />

information, the points are no longer equally likely, <strong>and</strong> each point in P i has probability<br />

u i<br />

|P i | = u i<br />

y i 2 n . Thus the expected information of the partition of the 2 n singleton sets is<br />

k∑<br />

− (y i · 2 n ) u i<br />

y<br />

i=1<br />

i 2 n log( u i<br />

y i 2 n )<br />

4 We will use the letter µ for both absolute <strong>and</strong> relative probabilities, to save the letter p for other uses.<br />

All logs will be to base 2 <strong>and</strong> since x log(x) → 0 as x → 0 we will take x log(x) to be 0 when x is 0.<br />

5 There is a tacit assumption here that the y i are of the form k/2 n . But note that numbers of this form<br />

are dense in the unit interval <strong>and</strong> if we assume that the function IG is continuous, then it is sufficient to<br />

consider numbers of this form.<br />

3


which comes out to<br />

k∑<br />

α = n − u i (log u i − log y i )<br />

i=1<br />

Since the flat distribution had expected information n, we have gained information equal<br />

to<br />

k∑<br />

k∑<br />

k∑<br />

n − α = n − (n − u i (log u i − log y i )) = u i (log u i − log y i ) = u i log( u i<br />

)<br />

y<br />

i=1<br />

i=1<br />

i=1 i<br />

(b) In information theory, we have a notion of the information that two partitions P <strong>and</strong><br />

Q share, also called their mutual information, <strong>and</strong> usually denoted by I(P; Q).<br />

I(P; Q) = ∑ i,j<br />

µ(P i ∩ Q j ) log( µ(P i ∩ Q j )<br />

µ(P i ) · µ(Q j ) )<br />

We will recalculate this quantity using the function IG. If Ann has partition P, then with<br />

probability µ(P i ) she knows that P i is true. In that case, she will revise her probabilities<br />

of Bob’s partition from µ( Q) ⃗ to µ( Q|P ⃗ i ) <strong>and</strong> <strong>and</strong> in that case her information gain about<br />

Bob’s partition is IG(µ( Q|P ⃗ i ), µ( Q)). ⃗<br />

Summing over all the P i we get<br />

∑<br />

µ(P i ) · IG(µ( Q|P ⃗ i ), µ( Q)) ⃗ = ∑<br />

i<br />

i<br />

µ(P i )( ∑ j<br />

(µ(Q j |P i ) log( µ(Q j|P i )<br />

µ(Q j ) ))<br />

<strong>and</strong> an easy calculation shows that this is the same as<br />

I(P; Q) = ∑ i,j<br />

µ(P i ∩ Q j ) log( µ(P i ∩ Q j )<br />

µ(P i ) · µ(Q j ) )<br />

Since the calculation through IG gives the same result as the usual formula, this gives<br />

additional support to the claim that our formula for the information gain is the right one.<br />

3 Properties of information gain<br />

Theorem 1 : (a) IG(⃗u, ⃗v) ≥ 0 <strong>and</strong> IG(⃗u, ⃗v) = 0 iff ⃗u = ⃗v.<br />

(b1) If ⃗p = µ(P ⃗ ) <strong>and</strong> if there is set X such that u i = µ(P i |X) for all i, then<br />

IG(⃗u, ⃗p) ≤ −log(µ(X))<br />

Thus the information received, by way of a change of probabilities, is less than or equal to<br />

the information I(X) contained in X.<br />

(b2) Equality obtains in (b1) above iff for all i, either µ(P i |X) = µ(P i ), or else µ(P i ∩X) = 0.<br />

4


Thus if all nonempty sets involved have non-zero measure, every P i is either a subset of X<br />

or disjoint from it.<br />

Proof : (a) It is straightforward to show using elementary calculus that<br />

log x < (x − 1) log e except when x = 1 when the two are equal. 6 Replacing x by 1/x we<br />

get log x > (1 − 1/x) log e except again at x = 1. This yields<br />

IG(⃗u, ⃗v) = ( ∑ i<br />

u i (log u i<br />

v i<br />

)) ≥ ( ∑ i<br />

u i (1 − v i<br />

u i<br />

)) log e = (( ∑ i<br />

u i ) − ( ∑ i<br />

v i )) log e = 0<br />

with equality holding iff, for all i, either u i<br />

v i<br />

= 1, or u i = 0. However, the case u i = 0 cannot<br />

arise, since we know that ∑ i u i = ∑ i v i = 1 <strong>and</strong> u i ≤ v i for all i.<br />

(b1) Let u i = µ(P i |X), ⃗p = (µ(P 1 ), ..., µ(P k )).<br />

IG(⃗u, ⃗p) =<br />

k∑<br />

i=1<br />

k∑<br />

i=1<br />

µ(P i |X) log µ(P i|X)<br />

µ(P i )<br />

µ(P i |X) log µ(P i ∩ X)<br />

µ(P i )<br />

where α = ∑ k<br />

i=1 µ(P i |X) log µ(P i∩X)<br />

µ(P i )<br />

=<br />

k∑<br />

i=1<br />

µ(P i |X) log µ(P i ∩ X)<br />

µ(P i )µ(X) =<br />

k∑<br />

− µ(P i |X) log µ(X)) = α + I(X)<br />

i=1<br />

≤ 0, since µ(P i∩X)<br />

µ(P i )<br />

≤ 1 for all i <strong>and</strong> ∑ k<br />

i=1 µ(P i |X) = 1.<br />

(b2) α = 0 only if, for all i, µ(P i |X) = 0 or µ(P i ∩ X) = µ(P i ), i.e. either P i ∩ X = ∅ or<br />

P i ⊆ X (X is a union of the P i ’s).<br />

If we learn that one of the sets can be excluded, that we had initially considered possible<br />

(its probability was greater then zero), then our information gain is the least if the<br />

probability of the excluded piece is distributed over all the other elements of the partition,<br />

proportionately to their initial probabilities. the gain is greatest when the probability of the<br />

excluded piece is shifted to a single element of the partition, <strong>and</strong> this element was initially<br />

one of the least likely elements.<br />

Theorem 2 : Let ⃗v = (v 1 , ..., v k−1 , v k ), ⃗u = (u 1 , ..., u k−1 , u k ), where u k = 0, u i =<br />

v i + a i v k for i = 1, ..., k − 1, a i ≥ 0, ∑ k−1<br />

i=1 a i = 1, <strong>and</strong> v k > 0. Then:<br />

(a) IG(⃗u, ⃗v) is minimum when a i<br />

v i<br />

= c is the same for all i = 1, ..., k − 1, <strong>and</strong> c = 1<br />

1−v k<br />

.<br />

moreover, this minimum value is just − log(1 − v k ).<br />

(b) IG(⃗u, ⃗v) is maximum when a i = 1 for some i such that v i = min j=1...k−1 (v j ) <strong>and</strong> the<br />

other a j are 0.<br />

Proof : (a) Let a = (a 1 , ..., a k−2 , a k−1 ). Since ∑ k−1<br />

i=1 a i = 1 we have a k−1 = 1− ∑ k−2<br />

i=1 a i.<br />

6 e is of course the number whose natural log is 1. Note that log e = log 2 e = 1 . The line y = (x−1) log e<br />

ln 2<br />

is tangent to the curve y = log x at (1,0), <strong>and</strong> lies above it.<br />

✷<br />

5


So we need only look at f : [0, 1] k−2 → R, defined by:<br />

k−2<br />

∑<br />

f(⃗a) = IG(⃗u, ⃗v) = (v i +a i v k ) log v i + a i v k<br />

i=1<br />

k−2 ∑<br />

+(v k−1 +v k (1−<br />

v i<br />

j=1<br />

a j )) log (v k−1 + v k (1 − ∑ k−2<br />

j=1 a j)<br />

v k−1<br />

To find the extrema of f in [0, 1] k−2 , consider the partial derivatives<br />

∂f<br />

∂a i<br />

= 0 iff v i+a i v k<br />

v i<br />

for all i, a i<br />

v i<br />

= a k−1<br />

v k−1<br />

∂f<br />

∂a i<br />

= v k (log v i + a i v k<br />

v i<br />

= (v k−1+v k (1− ∑ k−2<br />

v k−1<br />

j=1 a j))<br />

− log v k−1 + v k (1 − ∑ k−2<br />

j=1 a j)<br />

) log e<br />

v k−1<br />

. Recall that a k−1 = 1 − ∑ k−2<br />

i=1 a i. Then we have<br />

or a i = cv i where c is a constant <strong>and</strong> i range over 1, ..., k − 1. If we add<br />

these equations <strong>and</strong> use the fact that ∑ k−1<br />

i=1 a i = 1 <strong>and</strong> the fact that ∑ k−1<br />

i=1 v i = 1 − v k we<br />

get c = 1<br />

1−v k<br />

. Now ∂f<br />

∂a i<br />

is an increasing function of a i , so it is > 0 iff a i > v i<br />

1−v k<br />

<strong>and</strong> it is < 0<br />

iff a i <<br />

v i<br />

1−v k<br />

. Thus f has a minimum when a i =<br />

v i<br />

1−v k<br />

for all i. The fact that this minimum<br />

value is − log(1 − v k ) is easily calculated by substitution. Note that this quantity is exactly<br />

equal to I(X) where X is the complement of the set P k whose probability was v k . Thus we<br />

have an exact correspondence with parts (b1) <strong>and</strong> (b2) of the previous theorem.<br />

(b) To get the maximum, note that since the first derivatives ∂f<br />

∂a i<br />

are always increasing,<br />

<strong>and</strong> the second derivatives are all positive, the maxima can only occur at the vertices<br />

of [0, 1] k−1 . (If they occurred elsewhere, we could increase the value by moving in some<br />

direction). Now the values of f at the points p j = (0, ...0, 1, 0, ...0) (a i = δ(i, j)), are<br />

IG(⃗u, ⃗v) = g(v j ) where g(x) = (x+v k ) log x+v k<br />

x<br />

. But g(x) = (x+v k) log x+v k<br />

x<br />

is a decreasing<br />

function of x. so IG(u, v) is maximum when a j = 1 for some j such that v j is minimal. ✷<br />

Example 1 : Suppose for example that a partition {P 1 , P 2 , P 3 , P 4 } is such that all the P i<br />

have probabilities equal to .25. If we now receive the information that P 4 is impossible, then<br />

we will have gained information approximately equal to IG(.33, .33, .33, 0, .25, .25, .25, .25) ≈<br />

3·(.33) log .33<br />

.25 ≈ log 4 3 ≈ .42. Similarly, if we discover instead that it is P 3 which is impossible.<br />

If, however, we only discover that the total probability of the set P 3 ∪ P 4 has decreased to<br />

.33, then our information gain is only IG(.33, .33, .17, .17, .25, .25, .25, .25) ≈ .08, which is<br />

much less. And this makes sense, since knowing that the set P 3 ∪ P 4 has gone down in<br />

weight tells us less than knowing that half of it is no longer to be considered, <strong>and</strong> moreover<br />

which half.<br />

If we discover that P 4 is impossible <strong>and</strong> all the cases that we had thought to be in P 4<br />

are in fact in P 1 , then the information gain is IG(.50, .25, .25, 0, .25, .25, .25, .25) = 1 2 log 2<br />

which is .5 <strong>and</strong> more than our information gain in the two previous cases.<br />

6


Example 2 : As the following example shows, IG doesn’t satisfy the triangle inequality.<br />

I.e. if we revise our probabilities form ⃗y to ⃗u <strong>and</strong> then again to ⃗v, our total gain can be<br />

less than revising it straight from ⃗y to ⃗v. This may perhaps explain why we do not notice<br />

gradual changes, but are struck by the cumulative effect of all of them.<br />

Take ⃗v = (0.1, 0.9), ⃗u = (0.25, 0.75), ⃗y = (0.5, 0.5). IG(⃗v, ⃗u) + IG(⃗u, ⃗y) = 0.13 + 0.21 =<br />

0.34, while IG(⃗v, ⃗y) = 0.53 (approximately). Also IG(⃗y, ⃗v) = 0.74 so that IG is not<br />

symmetric.<br />

Another way to see that this failure of the triangle equality is reasonable is to notice<br />

that we could have gained information by first relativising to a set X, <strong>and</strong> then to another<br />

set Y , gaining information ≤ − log(µ(X)) <strong>and</strong> − log(µ(Y )) respectively. However, to get<br />

the cumulative information gain, we might need to relativise to X ∩ Y whose probability<br />

might be much less than µ(X)µ(Y ).<br />

We have defined the mutual knowledge I(P; Q) of two partitions P, Q. If we denote<br />

their join as P +Q then the quantity usually denoted in the literature as H(P, Q)) is merely<br />

H(P + Q). The connection between mutual information <strong>and</strong> entropy is well known [Ab]:<br />

H(P + Q) = H(P) + H(Q) − I(P; Q)<br />

Moreover, the equivocation H(P|Q) of P with respect to Q is defined as H(P|Q) = H(P) −<br />

I(P; Q). If i <strong>and</strong> j are agents with respective partitions P i <strong>and</strong> P j respectively, then inf(ij)<br />

will be just I(P i ; P j ).<br />

The equivocations are non-negative, <strong>and</strong> I is symmetric, <strong>and</strong> so we have:<br />

I(P; Q) ≤ min(H(P), H(Q))<br />

Thus what Ann knows about Bob’s knowledge is always less than what Bob knows <strong>and</strong><br />

what Ann herself knows.<br />

We want now to generalise these notions to more than two people, for which we will<br />

need a notion from the theory of Markov chains, namely stochastic matrices. We start by<br />

making a connection between boolean matrices <strong>and</strong> the usual notion of knowledge.<br />

4 <strong>Common</strong> knowledge <strong>and</strong> Boolean matrices<br />

We start by reviewing some notions from ordinary knowledge theory, [Au], [HM], [PK].<br />

Definition 2 : Suppose that {1,...,k} are individuals <strong>and</strong> i has knowledge partition P i .<br />

If w ∈ W then i knows E at w iff P i (w) ⊆ E, where P i (w) is the element of the partition<br />

7


P i containing w. K i (E) = {w|i knows E at w}. Note that K i (E) is always a subset of E.<br />

Write w ≈ i w ′ if w <strong>and</strong> w ′ are in the same element of the partition P i (iff P i (w) =<br />

P i (w ′ )). Then i knows E at w iff for all w ′ , w ≈ i w ′ → w ′ ɛE.<br />

Also, it follows that i knows that j knows E at w iff wɛK i (K j (E)) iff ⋃ l≤n {P j l|P j l ∩<br />

P i (w) ≠ ∅} ⊆ E i.e. {w ′ |∃v such that w ≈ i v ≈ j w ′ } ⊆ E.<br />

Definition 3 : An event E is common knowledge between a group of individuals<br />

i 1 , ..., i m at w iff (∀j 1 , ..., j k ∈ {i 1 , ..., i m })(w ≈ j1 w 1 , ..., w k−1 ≈ jk w ′ ) → (w ′ ∈ E) iff for all<br />

Xɛ{K 1 , ..., K n } ∗ wɛX(E).<br />

We now analyse knowledge <strong>and</strong> common knowledge using boolean transition matrices 7 :<br />

Definition 4 : The boolean transition matrix B ij of ij is defined by letting B ij (k, l) = 1<br />

if P k<br />

i<br />

∩ P l j<br />

≠ ∅, <strong>and</strong> 0 otherwise.<br />

We can extend this definition to a string of individuals x = i 1 ...i k :<br />

Definition 5 : The boolean transition matrix B x for a string x = i 1 ...i k is<br />

B x = B i1 i 2<br />

⊗ B i2 i 3<br />

⊗ ... ⊗ B ik−1 i k<br />

where ⊗ is defined as normalised matrix multiplication:<br />

if (B × B ′ )(k, l) > 0 then (B ⊗ B ′ )(k, l) is set to 1, otherwise it is 0. We can also define ⊗<br />

as: (B ⊗ B ′ )(k, l) = ∨ n<br />

m=1 (B(k, m) ∧ B ′ (m, l))<br />

We say that there is no non-trivial common knowledge iff the only event that is common<br />

knowledge at any w is the whole space W .<br />

Fact 1 : There is no non-trivial common knowledge iff for every string x including all<br />

individuals, lim n→∞ B x n = 1 where 1 is the matrix filled with 1’s only.<br />

We now consider the case of stochastic matrices.<br />

5 Information via a string of agents<br />

When we consider boolean transition matrices, we may lose some information. If we know<br />

the probabilities of all the elements of the σ-field generated by the join of the partitions P i ,<br />

the boolean transition matrix B ij is created by putting a 1 in position (k, l) iff µ(P l j |P k<br />

i ) > 0,<br />

<strong>and</strong> 0 otherwise. We keep more of the information by having µ(Pj l|P i k ) in position (k, l).<br />

We denote this matrix by M ij <strong>and</strong> we call it the transition matrix from i to j.<br />

7 The subscripts to the matrices will denote the knowers, <strong>and</strong> the row <strong>and</strong> column will be presented<br />

explicitly as arguments. Thus B ij(k, l) is the entry in the kth row <strong>and</strong> jth column of the matrix B ij.<br />

8


Definition 6 : For every i, j, the ij-transition matrix M ij is defined by: M ij (a, b) =<br />

µ(P b j |P a<br />

i ).<br />

For all i, M ii is the unit matrix of dimension equal to the size of partition P i .<br />

Definition 7 : If x is a string of elements of {1, ..., k} (x ∈ {1, ..., k} ∗ , x = x 1 ...x n ),<br />

then M x = M x1 x 2<br />

× ... × M xn−1 x n<br />

is the transition matrix for x.<br />

We now define inf(ixj), where x is a sequence of agents. inf(ixj) will be the information<br />

that i has about j via x. If e.g. i = 3, x = 1, j = 2, we should interpret inf(ixj) as the<br />

amount of information 3 has about 1’s knowledge of 2.<br />

Example 3 : In our example in the introduction, If i were Ann <strong>and</strong> j were Bob, then<br />

we would get<br />

M ij =<br />

∣<br />

.9 .1<br />

.1 .9<br />

The matrix M ji equals the matrix M ij <strong>and</strong> the matrix M iji is<br />

M iji =<br />

∣<br />

∣<br />

.82 .18<br />

.18 .82<br />

Thus it turns out that each of Ann <strong>and</strong> Bob has .53 bits of knowledge about the other <strong>and</strong><br />

Ann has .32 bits of knowledge about Bob’s knowledge of her.<br />

Definition 8 : Let ⃗m l = (m l1 , ..., m lk ) be the lth row vector of the transition matrix<br />

M ixj (m lt = µ(Pj t| xPi l),<br />

where µ(P j t| xPi l ) is the probability that a point in P<br />

l<br />

i will end up<br />

in P t j<br />

after a r<strong>and</strong>om move within P<br />

l<br />

i<br />

within the elements of those P xr<br />

∣<br />

followed by a sequence of r<strong>and</strong>om moves respectively<br />

which form x). Then:<br />

k∑<br />

inf(ixj) = µ(Pi l )IG( ⃗m l , µ(P ⃗ j ))<br />

l=1<br />

where IG( ⃗m l , µ(P ⃗ j )) is the information gain of the distribution ⃗m l over the distribution<br />

µ(P ⃗ j ).<br />

The intuitive idea is that the a priori probabilites of j’s partition are<br />

⃗ µ(P j ). However,<br />

if w is in Pi l , the l’th set in i’s partition, then these probabilities will be revised according<br />

to the l’th row of the matrix M ixj <strong>and</strong> the information gain will be IG(⃗m l , µ(P ⃗ j )). The<br />

expected information gain for i about j via x is then obtained by multiplying by the µ(P l<br />

i )’s<br />

<strong>and</strong> summing over all l.<br />

Example 4 : Consider M iji . For convenience we’ll denote elements P m<br />

i<br />

elements P m j<br />

by A m <strong>and</strong><br />

by B m (so that the A’s are elements of i’s partition, <strong>and</strong> the B’s are elements<br />

9


of j’s partition). Therefore M iji = M ij × M ji where:<br />

µ(B 1 |A 1 ) · · · µ(B k |A 1 )<br />

µ(B 1 |A 2 ) · · · µ(B k |A 2 )<br />

M ij =<br />

. . .. M ji =<br />

.<br />

∣ µ(B 1 |A k ) · · · µ(B k |A k ) ∣ ∣<br />

µ(A 1 |B 1 ) · · · µ(A k |B 1 )<br />

µ(A 1 |B 2 ) · · · µ(A k |B 2 )<br />

.<br />

. .. .<br />

µ(A 1 |B k ) · · · µ(A k |B k )<br />

∣<br />

M iji is the matrix of probabilities µ(A l | j A m ) for l, m = 1, ..., k, where µ(A l | j A m ) is the<br />

probability that a point in A m , will end up in A l after a r<strong>and</strong>om move within A m followed<br />

by a r<strong>and</strong>om move within some B s .<br />

µ(A 1 | j A 1 ) µ(A 1 | j A 2 ) · · · µ(A 1 | j A k )<br />

µ(A 2 | j A 1 ) µ(A 2 | j A 2 ) · · · µ(A 2 | j A k )<br />

M iji =<br />

. .<br />

. .. .<br />

∣ µ(A k | j A 1 ) µ(A k | j A 2 ) · · · µ(A k | j A k ) ∣<br />

Note that for x = λ, where λ is the empty string, inf(ij) = I(P i ; P j ), as in the st<strong>and</strong>ard<br />

definition: inf(ij) = ∑ k<br />

l=1 µ(Pi l)IG(<br />

µ(P<br />

⃗ j |Pi l),<br />

µ(P ⃗ j )) = ∑ k<br />

l=1 µ(Pi l)<br />

∑ k<br />

t=1 µ(P j |Pi l)<br />

log µ(P j t|P i l)<br />

= ∑ k<br />

l,t=1 µ(P j ∩ P l<br />

i ) log µ(P t j ∩P l i )<br />

µ(P t j )µ(P l i )<br />

µ(P t j )<br />

6 Properties of transition matrices<br />

The results in this section are either from the theory of Markov chains, or easily derived<br />

from these.<br />

Definition 9 : A matrix M is stochastic if all elements of M are reals in [0,1] <strong>and</strong> the<br />

sum of every row is 1.<br />

Fact 2 : For every x, the matrix M x is stochastic.<br />

Definition 10 : A matrix M is regular if there is m such that ∀(k, l)M m (k, l) > 0.<br />

The following fact establishes a connection between regular stochastic matrices <strong>and</strong><br />

common knowledge:<br />

Fact 3 :<br />

individuals from x.<br />

Matrix M ixi is regular iff there is no common knowledge between i <strong>and</strong><br />

Fact 4 : For every regular stochastic matrix M, there is a matrix M ′ such that<br />

lim M n = M ′<br />

n→∞<br />

M ′ is stochastic, <strong>and</strong> all the rows in M ′ are the same. Moreover the rate of convergence is<br />

exponential: for a given column r, let d n (r) be the difference between the maximum <strong>and</strong><br />

10


the minimum in M n , in that column. Then there is ɛ < 1 such that for all columns r <strong>and</strong><br />

all sufficiently large n, d n (r) ≤ ɛ n .<br />

By combining the last two facts we get the following corollary:<br />

Fact 5 : If there is no common knowledge between i <strong>and</strong> the individuals in x, then<br />

lim (M ixi) n = M<br />

n→∞<br />

where M is stochastic, <strong>and</strong> all rows in M are equal to the vector ⃗u i of probabilities of the<br />

sets in the partition P i .<br />

A matrix with all rows equal represents the situation that all information is lost <strong>and</strong> all<br />

that is known is the a priori probabilities.<br />

Fact 6 : If L, S are stochastic matrices <strong>and</strong> all the rows of L are equal, then S × L = L,<br />

<strong>and</strong> L × S = L ′ , where all rows in L ′ are equal (though they may be different from those of<br />

L).<br />

Fact 7 : For any stochastic matrix S <strong>and</strong> regular matrix M ixi :<br />

S × lim<br />

n→∞ (M ixi) n = M ′<br />

where<br />

M ′ = lim<br />

n→∞ (M ixi) n<br />

Definition 11 : For a given partition P i <strong>and</strong> string x = x 1 x 2 ...x k we can define a<br />

relation ≈ x between the partitions P i <strong>and</strong> P j . P m<br />

i ≈ x P n j iff for w ∈ P m<br />

i <strong>and</strong> w ′ ∈ P n j ,<br />

there are v 1 , ..., v k−1 such that v 1 ∈ P m<br />

i , v k ∈ P n j <strong>and</strong> w ≈ x1 v 1 ≈ x2 ...v k−1 ≈ xk w ′ .<br />

Definition 12 : ≈ ∗ x is the transitive closure of ≈ x . It is an equivalence relation.<br />

Fact 8 : Assume that x contains all j. Then the relation ≈ ∗ x does not depend on the<br />

particular x <strong>and</strong> we may drop the x. P m<br />

i ≈ ∗ P n j iff P m<br />

i <strong>and</strong> P n j are subsets of the same<br />

element of P − where P − is the meet of the partitions of all the individuals.<br />

Observation: We can permute the elements of the partition P i so that the elements of<br />

the same equivalence class of ≈ ∗ have consecutive numbers <strong>and</strong> then M ixi looks as follows:<br />

M ixi =<br />

∣<br />

∣<br />

M 1 · · · 0 ∣∣∣∣∣∣<br />

.<br />

. .. .<br />

0 · · · M r<br />

where M l for l ≤ r is the matrix corresponding to the transitions within one equivalence<br />

class of ≈ ∗ . All submatrices M l are square <strong>and</strong> regular.<br />

11


Note that if there is no common knowledge then ≈ ∗ has a single equivalence class.<br />

Since we can always renumber the elements of the partitions so that the transition matrix<br />

is in the form described above, we will assume from now on that the transition matrix is<br />

always given in such a form.<br />

Fact 9 : If x contains all j then<br />

lim (M ixi) n = M<br />

n→∞<br />

where M is stochastic, submatrices M l of M are regular (in fact positive) <strong>and</strong> all the rows<br />

within every submatrix M l are the same.<br />

7 Properties of inf(ixj)<br />

Theorem 3 : If there is no common knowledge <strong>and</strong> x includes all the individuals, then<br />

lim<br />

n→∞ inf(i(jxj)n ) = 0<br />

Proof : Matrix M = lim n→∞ (M jxj ) n has all rows positive <strong>and</strong> equal. Let ⃗m be a<br />

row vector of M. Then lim n→∞ inf(i(jxj) n ) = IG(⃗m, µ(P ⃗ j )). Since the limiting vector ⃗m<br />

is equal to the distribution µ(P ⃗ j ), we get: lim n→∞ inf(i(jxj) n ) = IG( µ(P ⃗ j ), µ(P ⃗ j )) = 0. ✷<br />

The last theorem can be easily generalised to the following:<br />

Fact 10 : If there is no common knowledge among the individuals in x, <strong>and</strong> i, j occur<br />

in x, then as n → ∞, inf(ix n j) goes to zero.<br />

8 <strong>Probabilistic</strong> common knowledge<br />

<strong>Common</strong> knowledge is very rare. But, even if there is no common knowledge in the system,<br />

we often have probabilistic common knowledge.<br />

Definition 13 : Individuals {1, ..., n} have probabilistic common knowledge if<br />

∀x ∈ {1, ..., n} ∗ inf(x) > 0<br />

We note that there is no probabilistic common knowledge in the system iff there is some<br />

string x such that for some i, M xi is a matrix with all rows equal <strong>and</strong> M xi (·, t) = µ(P t<br />

i ) for<br />

all t.<br />

12


Theorem 4 : If there is common knowledge in the system then there is probabilistic<br />

common knowledge, <strong>and</strong><br />

∀x ∈ {1, ..., n} ∗ inf(x) ≥ H(P − )<br />

Proof<br />

: We know from Fact 9 that<br />

M ixi =<br />

∣<br />

∣<br />

M 1 · · · 0 ∣∣∣∣∣∣<br />

.<br />

. .. .<br />

0 · · · M r<br />

where M l for l ≤ r is the matrix corresponding to the transitions within one equivalence class<br />

of ≈ ∗ x, <strong>and</strong> all submatrices M l are square <strong>and</strong> regular. Here r is the number of partitions in<br />

P − . Suppose that the probabilities of the sets in the partition P i are u 1 , ..., u k <strong>and</strong> that the<br />

probabilities of the partition P − are w 1 , ..., w r . Each w j is going to be the sum of those u l<br />

where the lth set in the partition P i is a subset of the jth set in the partition P − . Let ⃗m l<br />

be the lth row of the matrix M ixi . Then inf(ixi) is ∑ k<br />

l=1 u l IG(⃗m l , ⃗u). The row ⃗m l consists<br />

of zeroes, except in places corresponding to subsets of the apppropriate element P − j of P − .<br />

Then, by theorem 2, part (a): IG(⃗m i , ⃗u) ≥ log(<br />

1<br />

1−(1−w j ) ) = − log w j. This quantity<br />

may repeat, since several elements of P i may be contained in P − j<br />

. When we add up all the<br />

multipliers u i that occur with log w j , these multipliers also add up to w j . Thus we get<br />

r∑<br />

inf(ixi) ≥ −w j log(w j ) = H(P − )<br />

j=1<br />

. ✷<br />

We can also show:<br />

Theorem 5 : If x contains i, j <strong>and</strong> there is common knowledge between i, j <strong>and</strong> all the<br />

components of x, then the limiting information always exists <strong>and</strong> lim n→∞ inf(i(jxj) n ) =<br />

H(P − )<br />

We postpone the proof to the full paper.<br />

References<br />

[Ab] Abramson, N., Information Theory <strong>and</strong> Coding, McGraw-Hill, 1963<br />

[AH] M. Abadi <strong>and</strong> J. Halpern, Decidability <strong>and</strong> expressiveness for first-order logics of probability,<br />

Proc. of the 30th Annual Conference on Foundations of Computer Science, 1989,<br />

pp. 148–153.<br />

[Au] Aumann, R., “Agreeing to Disagree”, Annals of Statistics, 1976, 4, pp. 1236-1239<br />

13


[Ba] F. Bacchus, On Probability Distributions over Possible Worlds, Proceedings of the 4th<br />

Workshop on Uncertainty in AI, 1988, pp. 15-21<br />

[CM] H. H. Clark <strong>and</strong> C. R. Marshall, “Definite Reference <strong>and</strong> Mutual <strong>Knowledge</strong>”, in<br />

Elements of Discourse Underst<strong>and</strong>ing, Ed. Joshi, Webber <strong>and</strong> Sag, Cambridge U. Press,<br />

1981.<br />

[Dr] F. Dretske. <strong>Knowledge</strong> <strong>and</strong> the Flow of Information, MIT Press, 1981.<br />

[Ha] J. Halpern, An analysis of first-order logics of probability, Proc. of the 11th International<br />

Joint Conference on Artificial Intelligence (IJCAI 89), 1989, pp. 1375–1381.<br />

[HM] Halpern, J. <strong>and</strong> Moses, Y., “<strong>Knowledge</strong> <strong>and</strong> <strong>Common</strong> <strong>Knowledge</strong> in a Distributed<br />

Environment”, Proc. 3rd ACM Conf. on Principles of Distributed Computing, 1984, pp.<br />

50-61<br />

[KS] Kemeny, J. <strong>and</strong> Snell, L., Finite Markov Chains, Van Nostr<strong>and</strong>, 1960<br />

[Pa] Parikh, R., “Levels of <strong>Knowledge</strong> in Distributed Computing”, Proc. IEEE Symposium<br />

on Logic in Computer Science, 1986, pp. 322-331<br />

[Pa2] Parikh, R., “A Utility Based Approach to Vague Predicates” To appear.<br />

[PK] Parikh, R. <strong>and</strong> Krasucki, P. “Levels of <strong>Knowledge</strong> in Distributed Computing”, Research<br />

report, Brooklyn College, CUNY (1986). Revised version of [Pa] above.<br />

[Sh] Shannon, C., “Mathematical Theory of Communication” Bell System Technical Journal,<br />

28, 1948. (Reprinted in: Shannon <strong>and</strong> Weaver, A Mathematical Theory of Communication<br />

University of Illinois Press, 1964.)<br />

14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!