The Asymptotic Equipartition Property

The Asymptotic Equipartition 

Property and Entropy Rates 

Prof. Ja-Ling Wu 

Department of Computer Science 

and Information Engineering 

National Taiwan University

Law of large Numbers: 

for independent, identically distributed (i.i.d.) random 

n 

1 

variables, X 

n 

i is close to its expected value EX for 

i 1 

large values of n. 

The Asymptotic Equipartition Property: 

1 

log 

P X 1 , X 2 , X n is close to the entropy H, where 

X1,X2,…,Xn are i.i.d. random variables and 

P(X1,X2,…,Xn) is the probability of observing the 

sequence X1,X2,…,Xn. n 

1 

Thus, the probability P(X 1,X 2,…,X n) assigned to an 

observed sequence will be close to 2 -nH 

Information Theory 2

Almost all events are almost equally surprising: 

r 

n H 

X X , , X : P X , X , , X 2 1 

P 

, 1 2 

N 

1 2 

if X 1,X 2,…,X n are i.i.d. ~ P(x) 

Thm: (AEP): If X 1,X 2,…, are i.i.d. ~ P(X), then –1/n log 

P(X 1,X 2,…,X n) H(X) in probability 

Proof: Since the X i are i.i.d., so are log P(X i). By the 

weak law of law of large numbers, 

log 

( 

X , X , 

, X 

1 

1 

log P log P ( X 

n 

1 2 

n 

n 

i 

 

 

 

E 

H 

X 

P 

) 

( 

X 

) 

in 

probabilit 

N 

y 

Information Theory 3 

)

The typical set A (n) w.r.t. P(X) is the set of sequences 

(X 1,X 2,…,X n)X n with the following property: 

2 -n(H(X)+) P((X 1,X 2,…,X n) 2 -n(H(X)-) 

The properties of the typical set A (n) : 

i. If (x 1,x 2,…,x n) A (n) , then 

ii. P r{A (n) } > 1 – for n sufficiently large. 

iii. |A (n) | 2 n(H(X)+) 

iv. |A (n) | (1 – ) 2 n(H(X)-) for n sufficiently large. 

X , X , 

X 

1 

H ( X ) log P 1 2 

n 

Thus, the typical set has probability nearly 1, all 

elements of the typical set are nearly equiprobable, 

and the number of elements in the typical is nearly 2 nH . 

 

H ( X 


n 

)

proof : 

(i) From the definition of typical set 

Taking logarithm 

Dividing by n 

2 

H 

n 

H x 

 

x , x ,..., x 

(ii) Since the prob. of the event 

 

p 

1 

H x 

 

x log p x , x ,..., x H x 

n 

1 

tends to 1 as n∞. Thus for any δ>0, there 

exists an n 0, such that for all n ≧n 0, we have 

P 

r 

 

 

 

 

setting 

1 

n 

log 

 

 

p 

, 

x , x ,..., x H x 

1 

we 

2 

have 

n 

P 

r 

2 

1 

n 

2 

 


2 

n 

x ,..., x 

n A 1 

 

n 

x , A 

1 2 n 

 

 

 

 

 

1 

 

 

n

(iii) 

1 

 

 

hence 

 

X 

 

n 

A 

 

n 

P 

P 

A 

 

 

n 

A 

n n H x 

 

 

 

 

2 

 

 

2 

n 

H x 

 

H x 

n 

n 

(iv) For sufficiently large n, 1 

r (from (ii)) 

A P 

1 

 

2 

n 

 

n 

n H x 

 

A 2 

Pr 

 

n 

A 

hence 

A 

n 

 

 

 

2 

 

n 

1 

H x 

n 

2 

n 

H x 

 


A 

 

A 

 

,

Consequences of the AEP : Data Compression 

Let X1,X2,…,Xn be i.i.d. random variables~p(x) 

Data compression / Compaction : 

to find “short Descriptions” for such sequences 

of rv’s. 

n 

We divide all sequences in X into two sets : 

n 

the typical set and its complement 

A 

X 

| 

n 

: 

X 

n 

elements 

Non-typical set 

A 

Typical set 

A 

n 

n c 

A n n H x 

 

|: 2 elements 

 

c 

: 


We order all elements in each set according to 

some order (say lexicographic order).each 

sequence of n can be represented by giving 

A 

the index of the sequence in the set. 

n H x 

 

Since there are sequences in 

A 

n 

 

2 

, the indexing requires no more than 

n(H(x)+ε)+1 bits.(The extra bit may be 

necessary because n(H(x)+ε) may not be an 

integer.) 

We prefix all these sequences by a ‘0’, giving 

a total length of ≦ n(H(x)+ε)+2 bits to 

n 

represent each sequence in . 

A 


Similarly, we can index each sequence in 

by using not more than bits. 

Prefixing these indices by ‘1’, we have a code 

for all the sequences in Xn . 

Some features of the above coding scheme 

The code is one-to-one and easily decodable. 

The initial bit acts as a flag bit to indicate the 

length of the codeword that follows. 

We have used a brute force enumeration of 

the atypical set without taking into account 

the fact that the number of elements in is 

less than the number of elements in Xn n c 

A n log X 1 

n c 

A n c 

A . 

(Surprisingly, this is good enough to yield an 

efficient description.) 


The typical sequences have short description 

of length nH(x) 

X 

 

E 

n 

 

x x ,..., x 

, 1 2 

n x : the cordword length correspond ing to 

n 

x 

n x 

 

n n 

p x x 

n n 

p x x 

n n 

p x x 

where 

by 

an 

 

 

 

n n 

n n 

x x A 

x A 

x 

P 

n 

n 

 

 

n 

A 

r 

n x n H 

2 

n A n H 

1 

p 

 

n 

2 

n c 

n 

A 

n x n log X 2 

n c 

A n log X 2 

H n log X 2 n H ' 

' log X 

appropriat 

e 

choice 

 

2 

n 

of 

can 

be 

and 

 

n. 

 

P 

r 

made 

x 

n 

 

 

 

 

p 

c 

arbitraril 

small 


y

Theorem : 

Let X n be i.i.d ~p(x). Let ε>0. 

There exists a code which maps sequences X n 

of length n into binary strings such that 

mapping is one-to-one (and therefore 

invertible) and 

E 

 

 

 

1 

n 

n 

x H x 

for sufficiently large n. 

 

 

 

 

One can represent sequences X n using nH(x) 

bits on the average!! 


Markov’s and Chebyshev’s inequalities : 

(i) Markov’s inequality : 

For any non-negative rv. X and any δ>0, 

then 

X 

EX 

P r 

(ii) Chebyshev’s inequality : 

 

Let Y be a random variable with mean μ and 

variance 2 . By letting X=(Y-) 2 , then for 

any ε>0, 

P r 

 

Y 

2 

 

2 


Entropy Rate of a Stochastic Process 

The pre-described AEP establishes that nH(X) 

bits suffice on the average to describe n 

independent and identically distributed (i.i.d.) 

random variables. 

But what if the random variables are 

dependent ? 

We will show, just as in i.i.d. case, that the 

entropy H(X 1,X 2,…,X n) grows (asymptotically) 

linear with n at a rate H(X), which will be 

called the entropy rate of the process. 


Markov Chain 

A stochastic process is an indexed sequence of rv’s. In 

general, there can be an arbitrary dependence among 

them. The process is characterized by the joint 

Probability Mass Functions 

P 

r 

X , X ,..., X x , x ,..., x 

p x , x ,..., x 

n 

x , x ,..., x X for n 1,2,... 

1 

2 

1 

2 

n 

n 

1 

2 

Definition : A stochastic process is said be stationary if 

the joint distribution of any subset of the 

sequence of rv’s is invariant w.r.t. shifts in the 

time index, i.e., 

P 

 

r 

for 

X x , X x ,..., X x 

P 

r 

1 

X x , X x ,..., X x 

every 

1 

1 

shift 

2 

1 

 

2 

2 

and 

n 

for 

2 

n 

all 

x 

1 

, 

n 

1 

n 

x 

2 

2 

,..., 


x 

n 

n 

 

n 

, 

X

A simple example of a stochastic process with 

dependence is one in which each r.v. depends 

on the one preceding it and is conditionally 

independent of all the other preceding rv’s. 

Such a process is said to be Markov. 

Definition: 

A discrete stochastic process X 1,X 2,… is said to be a 

Markov chain or a Markov process if, for n=1,2,…, 

P 

 

r 

for 

X x X x , X x ,..., X x 

P 

r 

n 1 

X x X x 

all 

n 1 

x 

1 

, 

n 1 

x 

2 

n 1 

,..., 

n 

x 

n 

n 

, 

x 

n 

n 1 

n 

 

n 1 

X 

n 1 


1 

1

In this case, the joint probability mass function of 

the rv’s can be written as : 

p 

x x ,..., x p x p x x p x x ... p x x 

, 1 2 n 

1 2 1 3 2 

n n 1 

Definition : 

The Markov chain is said to be time invariant if 

the conditional probability p(x n+1|x n) does not 

depend on n, i.e., for n=1,2,…, 

X b X a P X b X a 

Pr n 1 

n 

r 

for 

all 

a, 

b 

 

X 

2 


1 

,

Recall: 

If {X i} is a Markov chain, then X n is called the state at 

time n. A time invariant Markov chain is characterized 

by its initial state and a probability transition matrix 

P=[P ij], i.j{1,2,…,m}, where P ij=P r{X n+1=j|X n=I} 

If it is possible to go with positive probability from any 

state of the Markov chain to any other state in a finite 

number of steps, then the Markov chain is said to be 

Irreducible. If the largest common factor of the lengths 

of different paths from a state to itself is 1, the Markov 

chain is said to aperiodic. 

If the pmf of the rv. at time n is p(xn), then the pmf at 

time n+1 is 

P 

( x n 1 

) 

 

x 

p P n X n X n 1 

x 

n 


A distribution on the states such that the 

distribution at time n+1 is the same as the 

distribution at time n is called a stationary 

distribution. 

If the initial state of a Markov chain is drawn 

according to a stationary distribution, then the 

Markov chain forms a stationary process. 

If the finite state Markov chain is irreducible 

and aperiodic, then the stationary distribution 

is unique, and from any starting distribution, 

the distribution of X n tends to the stationary 

distribution as n. 


Example: Consider a two-state Markov chain 

with a probability transition matrix 

Pij 

 

1 

 

 

 

 

 

 

1 

The stationary probability can be found by 

solving the equation 

1 

 

2 

1- 1- 

1 

 

 

more simply, by balancing probabilities. 

 

 

 

 

i 

1 

 

 

 

 

1 

 

1 2 


 

2

For the stationary distribution, the net probability 

flow across any cut-set in the state transition 

graph is 0. 

Applying this property to the above state 

transition graph, 

we obtain 

1= 2 

Since 1+2=1, the stationary distribution is 

 

1 

 

, 

2 

 

 

 

 

 


If the Markov chain has an initial state drawn 

according to the stationary distribution, the 

resulting process will be stationary. The entropy 

of the state X n at time n is 

 

X H , 

H n 

 

 

 

 

However, this is not the rate at which entropy 

grows for H(x 1,x 2,…,x n) 

 

 

 

 

 


Entropy Rate : 

If we have a sequence of n rv’s, a natural 

question to ask is “how does the entropy of 

the sequence grow with n” . 

Definition : 

The Entropy Rate of a stochastic process {Xi} is defined by 

when the limit exists. 

1 

H lim ,..., 

1 2 

n 

X 

H X , X X 


n

Examples : 

1. Typewriter 

A typewriter that has m equally likely output 

letters. The typewriter can produce m n 

sequences of length n, all of them equally 

likely. Hence 

H(X 1,X 2,…X n)=logm n , and the entropy rate is 

1 

H X lim H X , X ,..., X 

1 2 

n 

n 

 

 

2.X 1,X 2,…,X n are i.i.d. rv’s. Then 

log 

bits 

symbol 


m 

X 

1 

nH 1 

H X 

H X , X ,..., X 

1 2 

n 

n 

n 

X 

lim H 1 

, 

b 

s

3.Sequence of independent, but not identically 

distributed rv’s. 

In this case, 

H 

X , X ,..., X H X 

i 1 

but the H(Xi)’s are not all equal. 

1 

2 

We can choose a sequence of distributions on 

X1,X2,… such that the limit of does 

i 

not exist. 

X H 

1 

n 

An example of such a sequence is a random 

binary sequence where Pi=P(Xi=1) is not 

constant , but a function of i, chosen carefully 

1 

so that the limit in lim H X , X ,..., X does not 

1 

2 

n 

n 

exist. 

n 

n 

 


i

For example, let 

Pi 

 

0 . 5 

 

0 

2k 

Then there are arbitrarily long stretches where 

H(x i)=1, followed by exponentially longer 

segments, where H(x i)=0. 

Hence the running average of the H(x i) will 

oscillate between 0 and 1 and will not have a 

limit. 

Thus H(X) is not definded for this process. 

Now, let’s define a related quantity for entropy 

rate : ' X 

lim H X X , X ,..., X 

when the limit exists. 

, 

, 

if 

if 

2k 

 

 

loglog 

1 

 

i 

loglog 

 

2k 

H n n 1 

n 2 

n 

i 

 

 

2k 

1 

 

2 

1 


Note that : 

1 

H X H X , X ,..., X 

: the per symbol entropy of the rv’s. 

: the conditional entropy of the last rv. given the 

past 

For stationary processes both limits exist and are equal. 


lim 1 2 

n 

X 

lim H X X , X ,..., 

H n n 1 

n 2 

n 

' X 

For a stationary stochastic process, 

H(X)=H’(X) 

we will first prove that limH(X n|X n-1,…,X 1) exists. 

n 

1 


Lemma 1: For a stationary stochastic process, 

H(X n|X n-1,…,X 1) is decreasing in n and 

has a limit H’(X). 

proof : 

H 

X X , X ,..., X 

H X X ,..., X 

n 1 

n 

n 1 

1 

n 1 

X X ,..., X 

n 1 

Since H(X n|X n-1,…,X 1) is a decreasing 

sequence of non-negative numbers, it 

has a limit H’(X) 

 

H 


n 

n 

2 

1

Lemma 2 : (Ces’aro mean) : 

If 

then 

a 

n 

proof : 

1 

a and b n 

n i 

b 

n 

 

a 

n 

1 

a 

i 

Since a na ,there exists a number N() 

such that |a n-a|≦ε for all n≧N(ε).Hence 

b 

n 

 

a 

 

 

 

 

1 

n 

1 

n 

1 

n 

1 

n 

 

i 1 

 

i 1 

N 

 

 

i 1 

N 

n 

n 

 

 

i 1 

a 

i 

 

a a 

a 

a 

i 

i 

i 

 

 

a 

a 

a 

 

 

 

n 

 

1 

n 

 

n 

 

i 1 

N 

n 

a a 

i 

 

 


for all n≧N(ε). Since the first term goes to 

zero as n→∞, we can make |b n-a|≦2εby 

taking n large enough. Hence b n→a as n→∞. 

Proof of the theorem : By the chain rule, 

1 

1 

H 1 2 

n 

n n 1 

X , X ,..., X H X X ,..., X 

n 

i 

i.e., the entropy rate is the time average of the 

conditional entropies. But we know that the 

conditional entropies tend to have a limit H’(X). 

Hence by lemma 2, their running average has 

a limit, which is equal to H’(X). Thus, 

1 

X H X , X ,..., X lim H X X ,..., X H ' X 

H lim 1 2 

n 

n n 1 

1 

n 

i 

i 1 


1

The significance for the entropy rate of a 

stochastic process arises from the AEP for a 

stationary Ergodic process. 

remark : 

For any stationary ergodic process, 

with probability 1. 

Markov chains : 

p ,..., 

n 

log 

1 

1 

2 

X X X H X 

, 

For a stationary Markov chain, the entropy 

rate is given by 

H(X)=H’(X)=limH(X n|X n-1,…,X 1) 

=limH(X n|X n-1)=H(X 2|X 1) 

n 



Let {Xi} be a stationary Markov chain with 

stationary distribution μand transition matrix p. 

Then the entropy rate is : 

proof : 

H 

H 

X 

 

 

 

ij 

P 

i 

X H X X 

2 1 

i 

ij 

log 

Therefore, the entropy rate of the two-state 

Markov chain example should be : 

H 

P 

ij 

 

 

 

i j 

X H X X H H 

2 

1 

 

 

 

 

 

 

P 


ij 

 

log 

P 

ij

Concluding Remark : 

If the Markov chain is irreducible and aperiodic, 

then it has a unique stationary distribution on 

the states, and any initial distribution tends to 

the stationary distribution as n→∞. 


Hidden Markov Models : 

Let X 1,X 2,…,X n,… be a stationary Markov 

chain, and let Y i=Φ(X i) be a process, each 

term of which is a function of the 

corresponding state in the Markov chain. 

Such functions of Markov chains occur often in 

practice!! 

It would simplify matters greatly if Y 1,Y 2,…,Y n 

also formed a Markov chain, but in many cases 

this is not true. 


However, since the Markov chain is stationary, so 

is Y 1,…,Y n, and the Entropy rate is well defined. 

However, if we wish to compute H(Y), we might 

compute H(Y n|Y n-1,…,Y 1) for each n and find the 

limit. Since the convergence can be arbitrarily 

slow, we will never know how close we are to the 

limit; we will not know when to stop. (we can’t look 

at the change between the values at n and n+1, 

since this difference may be small even when we 

are far away from the limit!!) 


Remark: 

It would be useful computationally to have 

upper and lower bounds converging to the limit 

from above and below! 

We can halt the computation when the 

difference between the upper bound and the 

lower bound is small, and we will then have a 

good estimate of the limit. 

recall : For a stationary stochastic process, 

H(X n|X n-1,…,X 1) is decreasing in n and has a 

limit H’(X). 

→H(Y n|Y n-1,…,Y 1) converges monotonically to 

H(y) from above. H(Y n|Y n-1,…,Y 1)≧H(y) 


For a lower bound, we will use H(Y n|Y n-1,…,Y 1,X 1). 

This is a neat trick based on the idea that X 1 

contains as much information about Y n as Y 1, 

Y 0,Y -1,… 

Lemma : H(Y n|Y n-1,…,Y 2,X 1) ≦H(y) 

proof : we have, for k=1,2,…, 

H(Y n|Y n-1,…,Y 2,X 1) 

(a) 

= H(Y n|Y n-1,…,Y 2,Y 1,X 1) 

(b) 

= H(Y n|Y n-1,…,Y 1,X 1,X 0,X -1,…,X -k) 

(c) 

= H(Y n|Y n-1,…,Y1,X 1,X 0,X -1,…X -k,Y 0,…,Y -k) 

(d) 

≦ H(Y n|Y n-1,…,Y 1,Y 0,…,Y -k) 

= H(Y n+k+1|Y n+k,…,Y 1) (stationary) 

(a) 

(b) 

(c) 

(d) 

Y 1 is a function of X 1 

Markovity of X 

Yi is a function of X i 

conditioning reduces 

entropy 


Since the inequality is true for all k, it is true in 

the limit. Thus, 

Y Y ,..., 

Y , X lim H Y Y ,..., Y 

The next lemma shows that the interval between 

the upper and the lower bounds decreases in 

length. 

Lemma : 

H n n 1 1 1 

n k 1 

n k 1 

k 

 

H 

y 

H(Y n|Y n-1,…,Y 1)-H(Y n|Y n-1,…,Y 1,X 1) →0 


proof : The interval length can be written as 

H(Y n|Y n-1,…,Y 1)-H(Y n|Y n-1,…,Y 1,X 1) 

=I(X 1;Y n|Y n-1,…,Y 1) 

By the properties of mutual information, 

I(X 1;Y 1,Y 2,…,Y n) ≦H(X 1) and I(X 1;Y 1,Y 2,…,Y n) increases with n 

and hence 

X ; Y , Y ,..., Y existed and lim I X ; Y , Y ,..., Y H ( X ) 

lim I 1 1 2 n 

1 1 2 n 

1 

n 

n 

By the chain rule, 

H 

( 

X 

) 

 

 

lim 

n 

 

 

i 1 

I 

I 

X ; Y , Y ,..., Y lim I X ; Y Y ,..., Y 

1 1 2 n 

1 i i 1 

1 

X ; Y Y ,..., Y 

1 

i 

i 1 

1 

n 

Since this infinite sum is finite and the terms are nonnegative, 

the terms must tend to 0, i.e., 

lim I(X 1;Y n|Y n-1,…,Y 1)=0 

n 

i 1 


Combining the previous two lemmas, we have 

the following theorem: 


If X 1,X 2,…,X n form a stationary Markov chain, 

and Y i=Φ(X i) 

then 

H (Y n|Y n-1,…,Y 1,X 1) ≦H(y) ≦H (Y n|Y n-1,…,Y 1) 

and 

lim H(Y n|Y n-1,…,Y 1,X 1)=H(y)=lim H(Y n|Y n-1,…,Y 1) 


In general, we could also consider the case where Y i 

is a stochastic function as opposed to a deterministic 

function of X i. 

Consider a Markov process X 1, X 2, …, X n, and define 

a new process Y 1, Y 2, …, Y n where each Y i is drawn 

according to p(y i|x i), conditionally independent of all 

other X j, j≠i; that is, 

p ( 

x 

n 

, 

y 

n 

) 

 

p ( 

x 

1 

) 

n 1 

 

i 1 

p ( 

x 

i 1 

| 

x 

i 

) 

n 

 

i 1 

p ( 


y 

i 

| 

x 

i 

)

p 

( 1 x 

) 

p ( x | x ) 2 1 

X 1 X 2 X n 

p y | x ) p y | x ) 

p y | x ) 

( 1 1 

( 2 2 

p 

( x | x ) 

n n 1 

( n n 

Y 1 Y 2 Y n 

Such a process, called a hidden Markov model (HMM), is used 

extensively in speech recognition, handwriting recognition, and 

so on. 

The same argument as that used for functions of a Markov chain 

carry over to HMM’s, and we can lowerbound the entropy rate of 

a HMM by conditioning it on the underlying Markov states. 


Rate Distortion Functions 

Recall: 

For discrete-amplitude memoryless sources, 

H 

X E I X 

P x log P x 

For discrete-amplitude source with memory, 

H 

H 

X lim H X 

 

N 

N 

N 

N 

K 

 

k 1 

k 


2 

X log P X 

X | H X | 

log K 

with 

 

lim 

 

 

 

 

 

memory 

1 

 

 

without 

P 

all 

x 

memory 

2 

k 

2

Source Redundancy = log 2K – H(X) 

– Nonuniform distribution of P(x K) 

– The present of memory 

 

Non-Gaussian pdf for 

continuous source 

For continuous-amplitude memoryless sources, 

h 

X E log P X 

P x log P x 

For 

h 

h 

continous 

1 

X lim h X lim 

P x log P x 

1 

2 

X | h X | 

log 2 e 

with 

N 

memory 

N 

- 

2 

X 

amplitude 

N 

without 

 

 

 

 

sources 

N 

 

 

memory 

x 

 

with 

 

 

2 


2 

2 

memory, 

x 

x 

dx 

Gaussian distribution 

x 

2 

x 

dx

Source and Channel Coding Theorems 

According to the noisy channel coding theorem, 

information can be transmitted reliably (i.e., without 

error) over a noisy channel, at any source rate, say R, 

below a so-called capacity C of the channel 

R < C for reliable transmission 

According to the source coding theorem, there exists a 

mapping from source waveform to codewords such 

that for a given distortion D, R(D) bits (per source 

sample) are sufficient to enable waveform 

reconstruction with an average distortion that is 

arbitrarily close to D. 

The actual rate R has to obey: 

R R(D) for fidelity given by D 

The function R(D) is called the rate distortion function. 

Its inverse, D(R), is called the distortion rate function. 


Digital Communication of Waveforms 

Source 

X(n): after sampler 

Waveform 

encoder 

Analog ; digital 

U(n) 

Digital 

channel 

C, Pe V(n) 

Both C and R(D) are related to the concept of mutual information. 

Let X and Y refer to sequences of coder input and decoder output; 

and let U and V refer to sequences of channel input and channel 

output. 

I(X;Y) : the average information transmitted to the 

destination per sample 

Distortion D and I(X;Y) depend on the type of source coding. There 

is a minimum of I(X;Y), that is needed for a reconstruction at the 

destination if the average distortion must not exceed the specified 

upper limit D. 

This minimum value of I(X;Y) is R(D) 

Waveform 

decoder 

Y(n): before imterpolation 

Sink 


The channel capacity C is related to the average 

mutual information I(U;V) per sample that 

characterizes channel input statistics and input-output 

mappings across the channel, as described by 

appropriate conditional probabilities. 

For a given channel, C is the maximum of I(U;V) over 

all channel input statistics. 

Information Transmission theorem: 

C R(D) for reliable transmission and fidelity D. 

: waveform reconstruction with fidelity D is possible 

after transmission over a noisy channel, provided that 

its capacity is greater than R(D). 


The significance of the information transmission 

theorem is that it justifies the separation of source 

coding (source representation) and channel coding 

(information transmission) functions in waveform 

communication and provides a framework wherein the 

channel controls the rate, but not necessarily the 

accuracy, of waveform reproduction. 

Thus the only property of the channel that should 

concern the source coder is the single parameter C 

and the only requirement for the efficient utilization of 

the channel is that its input statistics (statistics at 

output of source coder) should be so as to maximize 

I(U;V). 


R 

rate 

Distortion D/ x 2 

: information theory bound 

: high complexity coder 

: medium complexity coder 

: low complexity coder 

The rate distortions are 

monotonically non-increasing 

(higher information-rate 

representations should lead to 

smaller average distortions) 

The distortion at Rate R=0 should 

be finite if the input variance is 

finite. 

For discrete-amplitude sources 

R(0) = H(X) 


The Power of Lagrange Multiplier 

Formal Proof of: 

1. Uniform distribution of finite discrete sources gives maximal 

entropy 

N 

N 

1 

log i 2 

 

i 1 

p i 

i 1 

cost function: Max p 

subject to: 

f ( p , 

f 

p 

j 

1 

 

p 

2 

log 

,..., 

1 

e 

2 

p 

N 

 

log 

 

1 

p 

j 

N 

1 

1 

) p log 

i e 

log 2 i p 

e 1 

i i 1 

 

 

 

 

 

1 

 

 

 

Since everything in the above equation is a constant, it follows 

that each pj is the same. If all the pj are equal, each has value 

1/N, and hence the maximum entropy is ( p ) 

log N . 

0 

for 

j 

 

N 


p 

i 

1, 

2 ,..., 

 

1 

 

N 

p i 

H 2 

2 

 

1


2. X Y 

a 

b 

c 

a’ 

b’ 

c’ 

1 

p 

p 

q 

q 

constraint 

1 

2 

) 

( 

) 

( 

) 

( 

 

 

 

 

 

 

Q 

P 

Q 

P 

c 

p 

b 

p 

a 

p 

 

 

 

 

 

 

 

 

 

 

 

 

 

2 

log 

then 

2 

1 

and 

2 

log 

log 

λ 

0 

2 

2 

log 

2 

2 

0 

log 

1 

) 

2 

( 

2 

log 

2 

log 

Consider 

1. 

2 

constraint 

the 

subject to 

, 

) 

| 

( 

) 

( 

maximize 

to 

as 

way 

a 

such 

in 

and 

choose 

to 

have 

we 

capacity, 

channel 

find 

order to 

In 

2 

) 

| 

( 

log 

2 

log 

) 

( 

log 

log 

Let 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

c 

e 

α 

λ 

α 

- 

U 

U 

U 

Y 

X 

-H 

X 

H 

Y 

X 

H 

X 

H 

q 

q 

p 

p 

α 

Q 

P 

Q 

Q 

P 

Q 

P 

Q 

Q 

P 

P 

Q 

P 

Q 

- 

Q 

Q 

P 

P 

Q 

P 

Q 

P 

Q 

Q 

Q 

P 

P 

2 

2 

1 

1 

2 

1 

2 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Q 

P 

Q 

Q 

Q 

Q 

P 

Eliminating

The Asymptotic Equipartition Property

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?