25.07.2013 Views

The Asymptotic Equipartition Property

The Asymptotic Equipartition Property

The Asymptotic Equipartition Property

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Asymptotic</strong> <strong>Equipartition</strong><br />

<strong>Property</strong> and Entropy Rates<br />

Prof. Ja-Ling Wu<br />

Department of Computer Science<br />

and Information Engineering<br />

National Taiwan University


Law of large Numbers:<br />

for independent, identically distributed (i.i.d.) random<br />

n<br />

1<br />

variables, X<br />

n<br />

i is close to its expected value EX for<br />

i 1<br />

large values of n.<br />

<strong>The</strong> <strong>Asymptotic</strong> <strong>Equipartition</strong> <strong>Property</strong>:<br />

1<br />

log<br />

P X 1 , X 2 , X n is close to the entropy H, where<br />

X1,X2,…,Xn are i.i.d. random variables and<br />

P(X1,X2,…,Xn) is the probability of observing the<br />

sequence X1,X2,…,Xn. n <br />

1<br />

Thus, the probability P(X 1,X 2,…,X n) assigned to an<br />

observed sequence will be close to 2 -nH<br />

Information <strong>The</strong>ory 2


Almost all events are almost equally surprising:<br />

r<br />

n H <br />

X X , , X : P X , X , , X 2 1<br />

P <br />

, 1 2<br />

N<br />

1 2<br />

if X 1,X 2,…,X n are i.i.d. ~ P(x)<br />

Thm: (AEP): If X 1,X 2,…, are i.i.d. ~ P(X), then –1/n log<br />

P(X 1,X 2,…,X n) H(X) in probability<br />

Proof: Since the X i are i.i.d., so are log P(X i). By the<br />

weak law of law of large numbers,<br />

log<br />

(<br />

X , X , <br />

, X <br />

1<br />

1<br />

log P log P ( X<br />

n<br />

1 2<br />

n<br />

n <br />

i<br />

<br />

<br />

<br />

E<br />

H<br />

X<br />

P<br />

)<br />

(<br />

X<br />

)<br />

in<br />

probabilit<br />

N<br />

y<br />

Information <strong>The</strong>ory 3<br />

)


<strong>The</strong> typical set A (n) w.r.t. P(X) is the set of sequences<br />

(X 1,X 2,…,X n)X n with the following property:<br />

2 -n(H(X)+) P((X 1,X 2,…,X n) 2 -n(H(X)-)<br />

<strong>The</strong> properties of the typical set A (n) :<br />

i. If (x 1,x 2,…,x n) A (n) , then<br />

ii. P r{A (n) } > 1 – for n sufficiently large.<br />

iii. |A (n) | 2 n(H(X)+)<br />

iv. |A (n) | (1 – ) 2 n(H(X)-) for n sufficiently large.<br />

X , X , <br />

X <br />

1<br />

H ( X ) log P 1 2<br />

n<br />

Thus, the typical set has probability nearly 1, all<br />

elements of the typical set are nearly equiprobable,<br />

and the number of elements in the typical is nearly 2 nH .<br />

<br />

H ( X<br />

Information <strong>The</strong>ory 4<br />

n<br />

)


proof :<br />

(i) From the definition of typical set<br />

Taking logarithm<br />

Dividing by n<br />

2<br />

H<br />

n<br />

H x <br />

<br />

x , x ,..., x <br />

(ii) Since the prob. of the event<br />

<br />

p<br />

1<br />

H x <br />

<br />

x log p x , x ,..., x H x <br />

n<br />

1<br />

tends to 1 as n∞. Thus for any δ>0, there<br />

exists an n 0, such that for all n ≧n 0, we have<br />

P<br />

r<br />

<br />

<br />

<br />

<br />

setting<br />

1<br />

n<br />

log<br />

<br />

<br />

p<br />

,<br />

x , x ,..., x H x <br />

1<br />

we<br />

2<br />

have<br />

n<br />

P<br />

r<br />

2<br />

1<br />

n<br />

2<br />

<br />

Information <strong>The</strong>ory 5<br />

2<br />

n<br />

x ,..., x <br />

n A 1 <br />

<br />

n<br />

x , A<br />

1 2 n <br />

<br />

<br />

<br />

<br />

<br />

1<br />

<br />

<br />

n


(iii)<br />

1<br />

<br />

<br />

hence<br />

<br />

X<br />

<br />

n <br />

A<br />

<br />

n<br />

P<br />

P<br />

A<br />

<br />

<br />

n <br />

A<br />

n n H x <br />

<br />

<br />

<br />

<br />

2<br />

<br />

<br />

2<br />

n<br />

H x <br />

<br />

H x <br />

n <br />

n<br />

(iv) For sufficiently large n, 1 <br />

r (from (ii))<br />

A P<br />

1<br />

<br />

2<br />

n<br />

<br />

n <br />

n H x <br />

<br />

A 2<br />

Pr<br />

<br />

n <br />

A<br />

hence<br />

A<br />

n <br />

<br />

<br />

<br />

2<br />

<br />

n<br />

1 <br />

H x <br />

n <br />

2<br />

n<br />

H x <br />

<br />

Information <strong>The</strong>ory 6<br />

A<br />

<br />

A<br />

<br />

,


Consequences of the AEP : Data Compression<br />

Let X1,X2,…,Xn be i.i.d. random variables~p(x)<br />

Data compression / Compaction :<br />

to find “short Descriptions” for such sequences<br />

of rv’s.<br />

n<br />

We divide all sequences in X into two sets :<br />

n <br />

the typical set and its complement<br />

A <br />

X<br />

|<br />

n<br />

:<br />

X<br />

n<br />

elements<br />

Non-typical set<br />

A <br />

Typical set<br />

A<br />

n <br />

n c<br />

A n n H x <br />

<br />

|: 2 elements<br />

<br />

c<br />

:<br />

Information <strong>The</strong>ory 7


We order all elements in each set according to<br />

some order (say lexicographic order).each<br />

sequence of n can be represented by giving<br />

A <br />

the index of the sequence in the set.<br />

n H x <br />

<br />

Since there are sequences in<br />

A <br />

n <br />

<br />

2<br />

, the indexing requires no more than<br />

n(H(x)+ε)+1 bits.(<strong>The</strong> extra bit may be<br />

necessary because n(H(x)+ε) may not be an<br />

integer.)<br />

We prefix all these sequences by a ‘0’, giving<br />

a total length of ≦ n(H(x)+ε)+2 bits to<br />

n <br />

represent each sequence in .<br />

A <br />

Information <strong>The</strong>ory 8


Similarly, we can index each sequence in<br />

by using not more than bits.<br />

Prefixing these indices by ‘1’, we have a code<br />

for all the sequences in Xn .<br />

Some features of the above coding scheme<br />

<strong>The</strong> code is one-to-one and easily decodable.<br />

<strong>The</strong> initial bit acts as a flag bit to indicate the<br />

length of the codeword that follows.<br />

We have used a brute force enumeration of<br />

the atypical set without taking into account<br />

the fact that the number of elements in is<br />

less than the number of elements in Xn n c<br />

A n log X 1<br />

n c<br />

A n c<br />

A .<br />

(Surprisingly, this is good enough to yield an<br />

efficient description.)<br />

Information <strong>The</strong>ory 9


<strong>The</strong> typical sequences have short description<br />

of length nH(x)<br />

X<br />

<br />

E<br />

n<br />

<br />

x x ,..., x <br />

, 1 2<br />

n x : the cordword length correspond ing to<br />

n<br />

x<br />

n x <br />

<br />

n n<br />

p x x <br />

n n<br />

p x x <br />

n n<br />

p x x <br />

where<br />

by<br />

an<br />

<br />

<br />

<br />

n n <br />

n n <br />

x x A<br />

x A<br />

x<br />

P<br />

n<br />

n<br />

<br />

<br />

n <br />

A<br />

r<br />

n x n H <br />

2 <br />

n A n H <br />

1<br />

p<br />

<br />

n<br />

2 <br />

n c<br />

n <br />

A<br />

n x n log X 2 <br />

n c<br />

A n log X 2 <br />

H n log X 2 n H '<br />

' log X<br />

appropriat<br />

e<br />

choice<br />

<br />

2<br />

n<br />

of<br />

can<br />

be<br />

and<br />

<br />

n.<br />

<br />

P<br />

r<br />

made<br />

x<br />

n<br />

<br />

<br />

<br />

<br />

p<br />

c<br />

arbitraril<br />

small<br />

Information <strong>The</strong>ory 10<br />

y


<strong>The</strong>orem :<br />

Let X n be i.i.d ~p(x). Let ε>0.<br />

<strong>The</strong>re exists a code which maps sequences X n<br />

of length n into binary strings such that<br />

mapping is one-to-one (and therefore<br />

invertible) and<br />

E<br />

<br />

<br />

<br />

1<br />

n<br />

n<br />

x H x <br />

for sufficiently large n.<br />

<br />

<br />

<br />

<br />

One can represent sequences X n using nH(x)<br />

bits on the average!!<br />

Information <strong>The</strong>ory 11


Markov’s and Chebyshev’s inequalities :<br />

(i) Markov’s inequality :<br />

For any non-negative rv. X and any δ>0,<br />

then<br />

X <br />

EX<br />

P r<br />

(ii) Chebyshev’s inequality :<br />

<br />

Let Y be a random variable with mean μ and<br />

variance 2 . By letting X=(Y-) 2 , then for<br />

any ε>0,<br />

P r<br />

<br />

Y <br />

2<br />

<br />

2<br />

Information <strong>The</strong>ory 12


Entropy Rate of a Stochastic Process<br />

<strong>The</strong> pre-described AEP establishes that nH(X)<br />

bits suffice on the average to describe n<br />

independent and identically distributed (i.i.d.)<br />

random variables.<br />

But what if the random variables are<br />

dependent ?<br />

We will show, just as in i.i.d. case, that the<br />

entropy H(X 1,X 2,…,X n) grows (asymptotically)<br />

linear with n at a rate H(X), which will be<br />

called the entropy rate of the process.<br />

Information <strong>The</strong>ory 13


Markov Chain<br />

A stochastic process is an indexed sequence of rv’s. In<br />

general, there can be an arbitrary dependence among<br />

them. <strong>The</strong> process is characterized by the joint<br />

Probability Mass Functions<br />

P<br />

r<br />

X , X ,..., X x , x ,..., x <br />

p x , x ,..., x <br />

n<br />

x , x ,..., x X for n 1,2,...<br />

1<br />

2<br />

1<br />

2<br />

n<br />

n<br />

1<br />

2<br />

Definition : A stochastic process is said be stationary if<br />

the joint distribution of any subset of the<br />

sequence of rv’s is invariant w.r.t. shifts in the<br />

time index, i.e.,<br />

P<br />

<br />

r<br />

for<br />

X x , X x ,..., X x <br />

P<br />

r<br />

1<br />

X x , X x ,..., X x <br />

every<br />

1 <br />

1<br />

shift<br />

2<br />

1<br />

<br />

2<br />

2 <br />

and<br />

n<br />

for<br />

2<br />

n<br />

all<br />

x<br />

1<br />

,<br />

n<br />

1<br />

n <br />

x<br />

2<br />

2<br />

,...,<br />

Information <strong>The</strong>ory 14<br />

x<br />

n<br />

n<br />

<br />

n<br />

,<br />

X


A simple example of a stochastic process with<br />

dependence is one in which each r.v. depends<br />

on the one preceding it and is conditionally<br />

independent of all the other preceding rv’s.<br />

Such a process is said to be Markov.<br />

Definition:<br />

A discrete stochastic process X 1,X 2,… is said to be a<br />

Markov chain or a Markov process if, for n=1,2,…,<br />

P<br />

<br />

r<br />

for<br />

X x X x , X x ,..., X x <br />

P<br />

r<br />

n 1<br />

X x X x <br />

all<br />

n 1<br />

x<br />

1<br />

,<br />

n 1<br />

x<br />

2<br />

n 1<br />

,...,<br />

n<br />

x<br />

n<br />

n<br />

,<br />

x<br />

n<br />

n 1<br />

n<br />

<br />

n 1<br />

X<br />

n 1<br />

Information <strong>The</strong>ory 15<br />

1<br />

1


In this case, the joint probability mass function of<br />

the rv’s can be written as :<br />

p<br />

x x ,..., x p x p x x p x x ... p x x <br />

, 1 2 n<br />

1 2 1 3 2<br />

n n 1<br />

Definition :<br />

<strong>The</strong> Markov chain is said to be time invariant if<br />

the conditional probability p(x n+1|x n) does not<br />

depend on n, i.e., for n=1,2,…,<br />

X b X a P X b X a <br />

Pr n 1<br />

n<br />

r<br />

for<br />

all<br />

a,<br />

b<br />

<br />

X<br />

2<br />

Information <strong>The</strong>ory 16<br />

1<br />

,


Recall:<br />

If {X i} is a Markov chain, then X n is called the state at<br />

time n. A time invariant Markov chain is characterized<br />

by its initial state and a probability transition matrix<br />

P=[P ij], i.j{1,2,…,m}, where P ij=P r{X n+1=j|X n=I}<br />

If it is possible to go with positive probability from any<br />

state of the Markov chain to any other state in a finite<br />

number of steps, then the Markov chain is said to be<br />

Irreducible. If the largest common factor of the lengths<br />

of different paths from a state to itself is 1, the Markov<br />

chain is said to aperiodic.<br />

If the pmf of the rv. at time n is p(xn), then the pmf at<br />

time n+1 is<br />

P<br />

( x n 1<br />

)<br />

<br />

x <br />

p P n X n X n 1<br />

x<br />

n<br />

Information <strong>The</strong>ory 17


A distribution on the states such that the<br />

distribution at time n+1 is the same as the<br />

distribution at time n is called a stationary<br />

distribution.<br />

If the initial state of a Markov chain is drawn<br />

according to a stationary distribution, then the<br />

Markov chain forms a stationary process.<br />

If the finite state Markov chain is irreducible<br />

and aperiodic, then the stationary distribution<br />

is unique, and from any starting distribution,<br />

the distribution of X n tends to the stationary<br />

distribution as n.<br />

Information <strong>The</strong>ory 18


Example: Consider a two-state Markov chain<br />

with a probability transition matrix<br />

Pij<br />

<br />

1<br />

<br />

<br />

<br />

<br />

<br />

<br />

1 <br />

<strong>The</strong> stationary probability can be found by<br />

solving the equation<br />

1<br />

<br />

2<br />

1- 1- <br />

1<br />

<br />

<br />

more simply, by balancing probabilities.<br />

<br />

<br />

<br />

<br />

i<br />

1<br />

<br />

<br />

<br />

<br />

1 <br />

<br />

1 2 <br />

Information <strong>The</strong>ory 19<br />

<br />

2


For the stationary distribution, the net probability<br />

flow across any cut-set in the state transition<br />

graph is 0.<br />

Applying this property to the above state<br />

transition graph,<br />

we obtain<br />

1= 2<br />

Since 1+2=1, the stationary distribution is<br />

<br />

1<br />

<br />

,<br />

2<br />

<br />

<br />

<br />

<br />

<br />

Information <strong>The</strong>ory 20


If the Markov chain has an initial state drawn<br />

according to the stationary distribution, the<br />

resulting process will be stationary. <strong>The</strong> entropy<br />

of the state X n at time n is<br />

<br />

X H ,<br />

H n<br />

<br />

<br />

<br />

<br />

However, this is not the rate at which entropy<br />

grows for H(x 1,x 2,…,x n)<br />

<br />

<br />

<br />

<br />

<br />

Information <strong>The</strong>ory 21


Entropy Rate :<br />

If we have a sequence of n rv’s, a natural<br />

question to ask is “how does the entropy of<br />

the sequence grow with n” .<br />

Definition :<br />

<strong>The</strong> Entropy Rate of a stochastic process {Xi} is defined by<br />

when the limit exists.<br />

1<br />

H lim ,...,<br />

1 2<br />

n<br />

X <br />

H X , X X <br />

Information <strong>The</strong>ory 22<br />

n


Examples :<br />

1. Typewriter<br />

A typewriter that has m equally likely output<br />

letters. <strong>The</strong> typewriter can produce m n<br />

sequences of length n, all of them equally<br />

likely. Hence<br />

H(X 1,X 2,…X n)=logm n , and the entropy rate is<br />

1<br />

H X lim H X , X ,..., X <br />

1 2<br />

n<br />

n<br />

<br />

<br />

2.X 1,X 2,…,X n are i.i.d. rv’s. <strong>The</strong>n<br />

log<br />

bits<br />

symbol<br />

Information <strong>The</strong>ory 23<br />

m<br />

X <br />

1<br />

nH 1<br />

H X <br />

H X , X ,..., X <br />

1 2<br />

n<br />

n<br />

n<br />

X <br />

lim H 1<br />

,<br />

b<br />

s


3.Sequence of independent, but not identically<br />

distributed rv’s.<br />

In this case,<br />

H<br />

X , X ,..., X H X <br />

i 1<br />

but the H(Xi)’s are not all equal.<br />

1<br />

2<br />

We can choose a sequence of distributions on<br />

X1,X2,… such that the limit of does<br />

i<br />

not exist.<br />

X H<br />

1<br />

n<br />

An example of such a sequence is a random<br />

binary sequence where Pi=P(Xi=1) is not<br />

constant , but a function of i, chosen carefully<br />

1<br />

so that the limit in lim H X , X ,..., X does not<br />

1<br />

2<br />

n<br />

n<br />

exist.<br />

n<br />

n<br />

<br />

Information <strong>The</strong>ory 24<br />

i


For example, let<br />

Pi<br />

<br />

0 . 5<br />

<br />

0<br />

2k<br />

<strong>The</strong>n there are arbitrarily long stretches where<br />

H(x i)=1, followed by exponentially longer<br />

segments, where H(x i)=0.<br />

Hence the running average of the H(x i) will<br />

oscillate between 0 and 1 and will not have a<br />

limit.<br />

Thus H(X) is not definded for this process.<br />

Now, let’s define a related quantity for entropy<br />

rate : ' X <br />

lim H X X , X ,..., X <br />

when the limit exists.<br />

,<br />

,<br />

if<br />

if<br />

2k<br />

<br />

<br />

loglog<br />

1<br />

<br />

i<br />

loglog<br />

<br />

2k<br />

H n n 1<br />

n 2<br />

n <br />

i<br />

<br />

<br />

2k<br />

1<br />

<br />

2<br />

1<br />

Information <strong>The</strong>ory 25


Note that :<br />

1<br />

H X H X , X ,..., X <br />

: the per symbol entropy of the rv’s.<br />

: the conditional entropy of the last rv. given the<br />

past<br />

For stationary processes both limits exist and are equal.<br />

<strong>The</strong>orem :<br />

lim 1 2<br />

n<br />

X <br />

lim H X X , X ,..., <br />

H n n 1<br />

n 2<br />

n <br />

' X<br />

For a stationary stochastic process,<br />

H(X)=H’(X)<br />

we will first prove that limH(X n|X n-1,…,X 1) exists.<br />

n<br />

1<br />

Information <strong>The</strong>ory 26


Lemma 1: For a stationary stochastic process,<br />

H(X n|X n-1,…,X 1) is decreasing in n and<br />

has a limit H’(X).<br />

proof :<br />

H<br />

X X , X ,..., X <br />

H X X ,..., X <br />

n 1<br />

n<br />

n 1<br />

1<br />

n 1<br />

X X ,..., X <br />

n 1<br />

Since H(X n|X n-1,…,X 1) is a decreasing<br />

sequence of non-negative numbers, it<br />

has a limit H’(X)<br />

<br />

H<br />

Information <strong>The</strong>ory 27<br />

n<br />

n<br />

2<br />

1


Lemma 2 : (Ces’aro mean) :<br />

If<br />

then<br />

a<br />

n<br />

proof :<br />

1<br />

a and b n <br />

n i <br />

b<br />

n<br />

<br />

a<br />

n<br />

1<br />

a<br />

i<br />

Since a na ,there exists a number N()<br />

such that |a n-a|≦ε for all n≧N(ε).Hence<br />

b<br />

n<br />

<br />

a<br />

<br />

<br />

<br />

<br />

1<br />

n<br />

1<br />

n<br />

1<br />

n<br />

1<br />

n<br />

<br />

i 1<br />

<br />

i 1<br />

N<br />

<br />

<br />

i 1<br />

N<br />

n<br />

n<br />

<br />

<br />

i 1<br />

a<br />

i<br />

<br />

a a <br />

a<br />

a<br />

i<br />

i<br />

i<br />

<br />

<br />

a<br />

a<br />

a<br />

<br />

<br />

<br />

n<br />

<br />

1<br />

n<br />

<br />

n<br />

<br />

i 1<br />

N<br />

n<br />

a a <br />

i<br />

<br />

<br />

Information <strong>The</strong>ory 28


for all n≧N(ε). Since the first term goes to<br />

zero as n→∞, we can make |b n-a|≦2εby<br />

taking n large enough. Hence b n→a as n→∞.<br />

Proof of the theorem : By the chain rule,<br />

1<br />

1<br />

H 1 2<br />

n <br />

n n 1<br />

X , X ,..., X H X X ,..., X <br />

n<br />

i<br />

i.e., the entropy rate is the time average of the<br />

conditional entropies. But we know that the<br />

conditional entropies tend to have a limit H’(X).<br />

Hence by lemma 2, their running average has<br />

a limit, which is equal to H’(X). Thus,<br />

1<br />

X H X , X ,..., X lim H X X ,..., X H ' X <br />

H lim 1 2<br />

n<br />

n n 1<br />

1<br />

n<br />

i<br />

i 1<br />

Information <strong>The</strong>ory 29<br />

1


<strong>The</strong> significance for the entropy rate of a<br />

stochastic process arises from the AEP for a<br />

stationary Ergodic process.<br />

remark :<br />

For any stationary ergodic process,<br />

with probability 1.<br />

Markov chains :<br />

p ,...,<br />

n<br />

log<br />

1<br />

1<br />

2<br />

X X X H X <br />

,<br />

For a stationary Markov chain, the entropy<br />

rate is given by<br />

H(X)=H’(X)=limH(X n|X n-1,…,X 1)<br />

=limH(X n|X n-1)=H(X 2|X 1)<br />

n<br />

Information <strong>The</strong>ory 30


<strong>The</strong>orem :<br />

Let {Xi} be a stationary Markov chain with<br />

stationary distribution μand transition matrix p.<br />

<strong>The</strong>n the entropy rate is :<br />

proof :<br />

H<br />

H<br />

X <br />

<br />

<br />

<br />

ij<br />

P<br />

i<br />

X H X X <br />

2 1<br />

i <br />

ij<br />

log<br />

<strong>The</strong>refore, the entropy rate of the two-state<br />

Markov chain example should be :<br />

H<br />

P<br />

ij<br />

<br />

<br />

<br />

i j<br />

X H X X H H <br />

2<br />

1<br />

<br />

<br />

<br />

<br />

<br />

<br />

P<br />

Information <strong>The</strong>ory 31<br />

ij<br />

<br />

log<br />

P<br />

ij


Concluding Remark :<br />

If the Markov chain is irreducible and aperiodic,<br />

then it has a unique stationary distribution on<br />

the states, and any initial distribution tends to<br />

the stationary distribution as n→∞.<br />

Information <strong>The</strong>ory 32


Hidden Markov Models :<br />

Let X 1,X 2,…,X n,… be a stationary Markov<br />

chain, and let Y i=Φ(X i) be a process, each<br />

term of which is a function of the<br />

corresponding state in the Markov chain.<br />

Such functions of Markov chains occur often in<br />

practice!!<br />

It would simplify matters greatly if Y 1,Y 2,…,Y n<br />

also formed a Markov chain, but in many cases<br />

this is not true.<br />

Information <strong>The</strong>ory 33


However, since the Markov chain is stationary, so<br />

is Y 1,…,Y n, and the Entropy rate is well defined.<br />

However, if we wish to compute H(Y), we might<br />

compute H(Y n|Y n-1,…,Y 1) for each n and find the<br />

limit. Since the convergence can be arbitrarily<br />

slow, we will never know how close we are to the<br />

limit; we will not know when to stop. (we can’t look<br />

at the change between the values at n and n+1,<br />

since this difference may be small even when we<br />

are far away from the limit!!)<br />

Information <strong>The</strong>ory 34


Remark:<br />

It would be useful computationally to have<br />

upper and lower bounds converging to the limit<br />

from above and below!<br />

We can halt the computation when the<br />

difference between the upper bound and the<br />

lower bound is small, and we will then have a<br />

good estimate of the limit.<br />

recall : For a stationary stochastic process,<br />

H(X n|X n-1,…,X 1) is decreasing in n and has a<br />

limit H’(X).<br />

→H(Y n|Y n-1,…,Y 1) converges monotonically to<br />

H(y) from above. H(Y n|Y n-1,…,Y 1)≧H(y)<br />

Information <strong>The</strong>ory 35


For a lower bound, we will use H(Y n|Y n-1,…,Y 1,X 1).<br />

This is a neat trick based on the idea that X 1<br />

contains as much information about Y n as Y 1,<br />

Y 0,Y -1,…<br />

Lemma : H(Y n|Y n-1,…,Y 2,X 1) ≦H(y)<br />

proof : we have, for k=1,2,…,<br />

H(Y n|Y n-1,…,Y 2,X 1)<br />

(a)<br />

= H(Y n|Y n-1,…,Y 2,Y 1,X 1)<br />

(b)<br />

= H(Y n|Y n-1,…,Y 1,X 1,X 0,X -1,…,X -k)<br />

(c)<br />

= H(Y n|Y n-1,…,Y1,X 1,X 0,X -1,…X -k,Y 0,…,Y -k)<br />

(d)<br />

≦ H(Y n|Y n-1,…,Y 1,Y 0,…,Y -k)<br />

= H(Y n+k+1|Y n+k,…,Y 1) (stationary)<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

Y 1 is a function of X 1<br />

Markovity of X<br />

Yi is a function of X i<br />

conditioning reduces<br />

entropy<br />

Information <strong>The</strong>ory 36


Since the inequality is true for all k, it is true in<br />

the limit. Thus,<br />

Y Y ,...,<br />

Y , X lim H Y Y ,..., Y <br />

<strong>The</strong> next lemma shows that the interval between<br />

the upper and the lower bounds decreases in<br />

length.<br />

Lemma :<br />

H n n 1 1 1<br />

n k 1<br />

n k 1<br />

k<br />

<br />

H<br />

y <br />

H(Y n|Y n-1,…,Y 1)-H(Y n|Y n-1,…,Y 1,X 1) →0<br />

Information <strong>The</strong>ory 37


proof : <strong>The</strong> interval length can be written as<br />

H(Y n|Y n-1,…,Y 1)-H(Y n|Y n-1,…,Y 1,X 1)<br />

=I(X 1;Y n|Y n-1,…,Y 1)<br />

By the properties of mutual information,<br />

I(X 1;Y 1,Y 2,…,Y n) ≦H(X 1) and I(X 1;Y 1,Y 2,…,Y n) increases with n<br />

and hence<br />

X ; Y , Y ,..., Y existed and lim I X ; Y , Y ,..., Y H ( X )<br />

lim I 1 1 2 n<br />

1 1 2 n<br />

1<br />

n <br />

n <br />

By the chain rule,<br />

H<br />

(<br />

X<br />

)<br />

<br />

<br />

lim<br />

n <br />

<br />

<br />

i 1<br />

I<br />

I<br />

X ; Y , Y ,..., Y lim I X ; Y Y ,..., Y <br />

1 1 2 n<br />

1 i i 1<br />

1<br />

X ; Y Y ,..., Y <br />

1<br />

i<br />

i 1<br />

1<br />

n <br />

Since this infinite sum is finite and the terms are nonnegative,<br />

the terms must tend to 0, i.e.,<br />

lim I(X 1;Y n|Y n-1,…,Y 1)=0<br />

n<br />

i 1<br />

Information <strong>The</strong>ory 38


Combining the previous two lemmas, we have<br />

the following theorem:<br />

<strong>The</strong>orem :<br />

If X 1,X 2,…,X n form a stationary Markov chain,<br />

and Y i=Φ(X i)<br />

then<br />

H (Y n|Y n-1,…,Y 1,X 1) ≦H(y) ≦H (Y n|Y n-1,…,Y 1)<br />

and<br />

lim H(Y n|Y n-1,…,Y 1,X 1)=H(y)=lim H(Y n|Y n-1,…,Y 1)<br />

Information <strong>The</strong>ory 39


In general, we could also consider the case where Y i<br />

is a stochastic function as opposed to a deterministic<br />

function of X i.<br />

Consider a Markov process X 1, X 2, …, X n, and define<br />

a new process Y 1, Y 2, …, Y n where each Y i is drawn<br />

according to p(y i|x i), conditionally independent of all<br />

other X j, j≠i; that is,<br />

p (<br />

x<br />

n<br />

,<br />

y<br />

n<br />

)<br />

<br />

p (<br />

x<br />

1<br />

)<br />

n 1<br />

<br />

i 1<br />

p (<br />

x<br />

i 1<br />

|<br />

x<br />

i<br />

)<br />

n<br />

<br />

i 1<br />

p (<br />

Information <strong>The</strong>ory 40<br />

y<br />

i<br />

|<br />

x<br />

i<br />

)


p<br />

( 1 x<br />

)<br />

p ( x | x ) 2 1<br />

X 1 X 2 X n<br />

p y | x ) p y | x )<br />

p y | x )<br />

( 1 1<br />

( 2 2<br />

p<br />

( x | x )<br />

n n 1<br />

( n n<br />

Y 1 Y 2 Y n<br />

Such a process, called a hidden Markov model (HMM), is used<br />

extensively in speech recognition, handwriting recognition, and<br />

so on.<br />

<strong>The</strong> same argument as that used for functions of a Markov chain<br />

carry over to HMM’s, and we can lowerbound the entropy rate of<br />

a HMM by conditioning it on the underlying Markov states.<br />

Information <strong>The</strong>ory 41


Rate Distortion Functions<br />

Recall:<br />

For discrete-amplitude memoryless sources,<br />

H<br />

X E I X <br />

P x log P x <br />

For discrete-amplitude source with memory,<br />

H<br />

H<br />

X lim H X <br />

<br />

N<br />

N<br />

N<br />

N<br />

K<br />

<br />

k 1<br />

k<br />

Information <strong>The</strong>ory 42<br />

2<br />

X log P X <br />

X | H X |<br />

log K<br />

with<br />

<br />

lim<br />

<br />

<br />

<br />

<br />

<br />

memory<br />

1<br />

<br />

<br />

without<br />

P<br />

all<br />

x<br />

memory<br />

2<br />

k<br />

2


Source Redundancy = log 2K – H(X)<br />

– Nonuniform distribution of P(x K)<br />

– <strong>The</strong> present of memory<br />

<br />

Non-Gaussian pdf for<br />

continuous source<br />

For continuous-amplitude memoryless sources,<br />

h<br />

X E log P X <br />

P x log P x <br />

For<br />

h<br />

h<br />

continous<br />

1<br />

X lim h X lim<br />

P x log P x <br />

1<br />

2<br />

X | h X |<br />

log 2 e <br />

with<br />

N <br />

memory<br />

N<br />

-<br />

2<br />

X<br />

amplitude<br />

N <br />

without<br />

<br />

<br />

<br />

<br />

sources<br />

N<br />

<br />

<br />

memory<br />

x<br />

<br />

with<br />

<br />

<br />

2<br />

Information <strong>The</strong>ory 43<br />

2<br />

2<br />

memory,<br />

x<br />

x<br />

dx<br />

Gaussian distribution<br />

x<br />

2<br />

x<br />

dx


Source and Channel Coding <strong>The</strong>orems<br />

According to the noisy channel coding theorem,<br />

information can be transmitted reliably (i.e., without<br />

error) over a noisy channel, at any source rate, say R,<br />

below a so-called capacity C of the channel<br />

R < C for reliable transmission<br />

According to the source coding theorem, there exists a<br />

mapping from source waveform to codewords such<br />

that for a given distortion D, R(D) bits (per source<br />

sample) are sufficient to enable waveform<br />

reconstruction with an average distortion that is<br />

arbitrarily close to D.<br />

<strong>The</strong> actual rate R has to obey:<br />

R R(D) for fidelity given by D<br />

<strong>The</strong> function R(D) is called the rate distortion function.<br />

Its inverse, D(R), is called the distortion rate function.<br />

Information <strong>The</strong>ory 44


Digital Communication of Waveforms<br />

Source<br />

X(n): after sampler<br />

Waveform<br />

encoder<br />

Analog ; digital<br />

U(n)<br />

Digital<br />

channel<br />

C, Pe V(n)<br />

Both C and R(D) are related to the concept of mutual information.<br />

Let X and Y refer to sequences of coder input and decoder output;<br />

and let U and V refer to sequences of channel input and channel<br />

output.<br />

I(X;Y) : the average information transmitted to the<br />

destination per sample<br />

Distortion D and I(X;Y) depend on the type of source coding. <strong>The</strong>re<br />

is a minimum of I(X;Y), that is needed for a reconstruction at the<br />

destination if the average distortion must not exceed the specified<br />

upper limit D.<br />

This minimum value of I(X;Y) is R(D)<br />

Waveform<br />

decoder<br />

Y(n): before imterpolation<br />

Sink<br />

Information <strong>The</strong>ory 45


<strong>The</strong> channel capacity C is related to the average<br />

mutual information I(U;V) per sample that<br />

characterizes channel input statistics and input-output<br />

mappings across the channel, as described by<br />

appropriate conditional probabilities.<br />

For a given channel, C is the maximum of I(U;V) over<br />

all channel input statistics.<br />

Information Transmission theorem:<br />

C R(D) for reliable transmission and fidelity D.<br />

: waveform reconstruction with fidelity D is possible<br />

after transmission over a noisy channel, provided that<br />

its capacity is greater than R(D).<br />

Information <strong>The</strong>ory 46


<strong>The</strong> significance of the information transmission<br />

theorem is that it justifies the separation of source<br />

coding (source representation) and channel coding<br />

(information transmission) functions in waveform<br />

communication and provides a framework wherein the<br />

channel controls the rate, but not necessarily the<br />

accuracy, of waveform reproduction.<br />

Thus the only property of the channel that should<br />

concern the source coder is the single parameter C<br />

and the only requirement for the efficient utilization of<br />

the channel is that its input statistics (statistics at<br />

output of source coder) should be so as to maximize<br />

I(U;V).<br />

Information <strong>The</strong>ory 47


R<br />

rate<br />

Distortion D/ x 2<br />

: information theory bound<br />

: high complexity coder<br />

: medium complexity coder<br />

: low complexity coder<br />

<strong>The</strong> rate distortions are<br />

monotonically non-increasing<br />

(higher information-rate<br />

representations should lead to<br />

smaller average distortions)<br />

<strong>The</strong> distortion at Rate R=0 should<br />

be finite if the input variance is<br />

finite.<br />

For discrete-amplitude sources<br />

R(0) = H(X)<br />

Information <strong>The</strong>ory 48


<strong>The</strong> Power of Lagrange Multiplier<br />

Formal Proof of:<br />

1. Uniform distribution of finite discrete sources gives maximal<br />

entropy<br />

N<br />

N<br />

1<br />

log i 2<br />

<br />

i 1<br />

p i<br />

i 1<br />

cost function: Max p<br />

subject to:<br />

f ( p ,<br />

f<br />

p<br />

j<br />

1<br />

<br />

p<br />

2<br />

log<br />

,...,<br />

1<br />

e<br />

2<br />

p<br />

N<br />

<br />

log<br />

<br />

1<br />

p<br />

j<br />

N<br />

1<br />

1 <br />

) p log <br />

i e <br />

log 2 i p<br />

e 1<br />

i i 1<br />

<br />

<br />

<br />

<br />

<br />

1<br />

<br />

<br />

<br />

Since everything in the above equation is a constant, it follows<br />

that each pj is the same. If all the pj are equal, each has value<br />

1/N, and hence the maximum entropy is ( p ) <br />

log N .<br />

0<br />

for<br />

j<br />

<br />

N<br />

Information <strong>The</strong>ory 49<br />

p<br />

i<br />

1,<br />

2 ,...,<br />

<br />

1 <br />

<br />

N<br />

p i<br />

H 2<br />

2<br />

<br />

1


Information <strong>The</strong>ory 50<br />

2. X Y<br />

a<br />

b<br />

c<br />

a’<br />

b’<br />

c’<br />

1<br />

p<br />

p<br />

q<br />

q<br />

constraint<br />

1<br />

2<br />

)<br />

(<br />

)<br />

(<br />

)<br />

(<br />

<br />

<br />

<br />

<br />

<br />

<br />

Q<br />

P<br />

Q<br />

P<br />

c<br />

p<br />

b<br />

p<br />

a<br />

p<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

2<br />

log<br />

then<br />

2<br />

1<br />

and<br />

2<br />

log<br />

log<br />

λ<br />

0<br />

2<br />

2<br />

log<br />

2<br />

2<br />

0<br />

log<br />

1<br />

)<br />

2<br />

(<br />

2<br />

log<br />

2<br />

log<br />

Consider<br />

1.<br />

2<br />

constraint<br />

the<br />

subject to<br />

,<br />

)<br />

|<br />

(<br />

)<br />

(<br />

maximize<br />

to<br />

as<br />

way<br />

a<br />

such<br />

in<br />

and<br />

choose<br />

to<br />

have<br />

we<br />

capacity,<br />

channel<br />

find<br />

order to<br />

In<br />

2<br />

)<br />

|<br />

(<br />

log<br />

2<br />

log<br />

)<br />

(<br />

log<br />

log<br />

Let<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

c<br />

e<br />

α<br />

λ<br />

α<br />

-<br />

U<br />

U<br />

U<br />

Y<br />

X<br />

-H<br />

X<br />

H<br />

Y<br />

X<br />

H<br />

X<br />

H<br />

q<br />

q<br />

p<br />

p<br />

α<br />

Q<br />

P<br />

Q<br />

Q<br />

P<br />

Q<br />

P<br />

Q<br />

Q<br />

P<br />

P<br />

Q<br />

P<br />

Q<br />

-<br />

Q<br />

Q<br />

P<br />

P<br />

Q<br />

P<br />

Q<br />

P<br />

Q<br />

Q<br />

Q<br />

P<br />

P<br />

2<br />

2<br />

1<br />

1<br />

2<br />

1<br />

2<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Q<br />

P<br />

Q<br />

Q<br />

Q<br />

Q<br />

P<br />

Eliminating

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!