Notes from Limit Theorems 2 Mihai Nica

Notes from Limit Theorems 2 

Mihai Nica

Notes. These are my notes from the class Limit Theorems 2 taught by Proffesor 

McKean in Spring 2012. I have tried to carefully go over the bigger theorems 

from the course and fill in all the details explicitly. There is also a lot of information 

that is folded in from other sources. 

• The section on Martingales is supplemented with some notes from ”A First 

Look at Rigorous Probability Theory” by Jeffrey S. Rosenthal, which has 

a really nice introduction to Martingales. 

• The section of the law of the iterated logarithm is supplemented with 

some inequalities which I looked up on the internet...mostly wikipedia 

and PlanetMath. 

• In the section on Ergodic theorem, I use a notation I found on wikipedia 

that I like for continued fractions. In my pen-and-paper notes, there is 

also a little section about Ergodic theory for geodesics on surfaces, which 

is really cute. However, I couldn’t figure out a good way to draw the 

pictures so it hasn’t been typed up yet. 

• The section on Brownian Motion is supplemented by the book Brownian 

Motion and Martingale’s in Analysis by Richard Durret which is really 

wonderful. Some of the slick results are taken straight from there. 

• I also include an appendix with results that I found myself reviewing as I 

went through this stuff.

Contents 

Chapter 1. Martingales 5 

1. Definitions and Examples 5 

2. Stopping times 6 

3. Martingale Convergence Theorem 7 

4. Applications 9 

Chapter 2. The Law of the Iterated Logarithm 13 

1. First Half of the Law of the Iterated Logarithm 13 

2. Second Half of the Law of the Iterated Logarithm 15 

Chapter 3. Ergodic Theorem 19 

1. Motivation 19 

2. Birkhoff’s Theorem 20 

3. Continued Fractions 24 

Chapter 4. Brownian Motion 29 

1. Motivation 29 

2. Levy’s Construction 30 

3. Construction from Durret’s Book 33 

4. Some Properties 36 

Chapter 5. Appendix 39 

1. Conditional Random Variables 39 

2. Extension Theorems 40 

3

CHAPTER 1 

Martingales 

1. Definitions and Examples 

This section on Martingales contains heavy use of conditional random variables. 

I do a quick review of this topic from Limit Theorems 1 in the appendix. 

Definition 1.1. A sequence of random variables X0, X1, ... is called a martingale 

if E(|Xn|) < ∞for all n and with probability 1: 

E (Xn+1|X0, X1, ..., Xn) = Xn 

Intuitively, this is says that the average value of Xn+1is the same as that of Xn, 

even if we are given the values of X0to Xn. Note that conditioning on X0, ..., Xnis 

just different notation for conditioning on σ(X0, ..., Xn), which is the sigma algebra 

generated by preimages of Borel sets through X0, ..., Xn. One can make more general 

martingales by replacing σ(X0, ..., Xn) with an arbitrary increasing chain of sigma 

algebras Fn; the results here carry over to that setting too. 

Example 1.2. Sometimes martingales are called “fair games”. The analogy is 

that the random variable Xn represents the bankroll of the gambler at time n. The 

game is fair, because at any point in time the equity of the gambler is constant. 

Definition 1.3. A submartingale is when E (Xn+1|X0, X1, ..., Xn) ≥ Xn (i.e. 

the capital is increasing) and a supermartingale is when E (Xn+1|X0, X1, ..., Xn) ≤ 

Xn (i.e. the capital is decreasing) Most of the theorems for martingales work for 

submartingales, just change the inequality in the right place. To avoid confusion 

between sub-, super-, and ordinary martingales, we will sometimes call a martingale 

a “fair martingale”. 

Example 1.4. The symmetric random walk, Xn = Z0 + Z1 + ... + Zn with 

each Zn = ±1 with probability 1 

2 is a martingale. In terms of the fair game, this is 

gambling on the outcome of a fair coin. 

Remark. Using the properties of conditional probabilities to see that: 

E (Xn+2|X0, X1, ..., Xn) = E (E (Xn+2|X0, X1, ..., Xn+1) |X0, ...Xn) 

= E (Xn+1|X0, ...Xn) 

= Xn 

With a simple argument by induction, we get that in general: 

E (Xm|X0, X1, ..., Xn) = Xn 

In particular then E(Xn) = E(X0) for every n. If τ is a random “time”, (a 

non-negative integer) that is independent of the Xn’s, then E(Xτ ) is a weighted 

average of E(Xn)’s, so have E(Xτ ) = E(X0) still. What if υis dependent on the 

5

6 1. MARTINGALES 

X ′ ns? In general we cannot have equality for the example of the simple symmetric 

random walk (coin-flip betting), with τ =first time that Xn = −1 has E(Xn) = 

−1 = 0 = E(X0). The next section gives some conditions where this holds. 

2. Stopping times 

Definition 2.1. For a martingale {Xn}, A non-negative integer valued random 

variable τ is a stopping time if it has the property that: 

{τ = n} ∈ σ(X1, X2, . . . , Xn) 

Intuitively, this is saying that one can determine if τ = n just by looking at the 

first n steps in the martingale. 

Example 2.2. In the example of the random coin flipping, if we let τ be the 

first time so that Xn =10, then τ is a stopping time. 

Example 2.3. We often are interested in Xτ , the value of the martingale at 

the random time τ. This is precisely defined as Xτ (ω) = X τ(ω)(ω). Another handy 

rewriting is: Xτ = Xk1 {τ=k} . 

Lemma 2.4. If {Xn}is a submartingale and τ1, τ2are bounded stopping times 

so that ∃M s.t. 0 ≤ τ1 ≤ τ2 ≤ M with probability 1, then E(Xτ1 ) ≤ E(Xτ2 ), with 

equality for fair martingales. 

Proof. For fixed k, the event {τ1 < k ≤ τ2}can be written as {τ1 < k ≤ 

τ2} = {τ1 ≤ k − 1} ∩ {τ2 ≤ k − 1} C from which we see that the event {τ1 < 

k ≤ τ2} ∈ σ(X0, X1, . . . , Xk−1) because τ1and τ2are both stopping times. We 

have then the following manipulation using a telescoping series, linearity of the 

expectation, the fact that E(Y 1A)= E(E(Y |X0, X1, . . . , Xk−1)1A) for events A ∈ 

σ(X0, X1, . . . , Xk−1), and finally the fact that E(Xk|X0, X1, . . . Xk−1) − Xk−1 ≥ 0 

since Xn is a (sub)martingale. (with equality for fair martingales): 

E(Xτ2 ) − E(Xτ1 ) = E(Xτ2 − Xτ1 ) 

τ2 

= E( 

= E 

k=τ1+1 

M 

k=1 

Xk − Xk−1) 

(Xk − Xk−1)1 {τ1

3. MARTINGALE CONVERGENCE THEOREM 7 

Theorem 2.5. Say {Xn} is a martingale and τ a bounded stopping time, (that 

is ∃M s.t. 0 ≤ τ ≤ M with probability 1). Then: 

E(Xτ ) = E(X0) 

Proof. Let υbe the random variable which is constantly 0. This is a stopping 

time! So by the above lemma, since 0 ≤ υ ≤ τ ≤ M, we have that E(Xτ ) = 

E(Xυ) = E(X0) 

Theorem 2.6. For {Xn}a martingale and τ a stopping time which is almost 

surely finite (that is P(τ < ∞) = 1) we have: 

 

E(Xτ ) = E(X0) ⇐⇒ E lim 

n→∞ X 

min(τ,n) = lim 

n→∞ E 

Xmin(τ,n) Proof. It suffices to show that E(Xτ ) = E 

limn→∞ Xmin(τ,n) andE(X0) = 

limn→∞ E 

Xmin(τ,n) . The first equality holds since P(τ < ∞) = 1 gives P(limn→∞ Xmin(τ,n) = 

Xτ ) = 1, so they agree almost surely. The second holds by the above theorem concerning 

bounded stopping times since for any n, min(τ, n) is a bounded stopping 

time, so we have E 

Xmin(τ,n) = E(X0), so equality holds in the limit too. 

Remark. The above theorem can be combined with things like monotone 

convergence theorem or Lebesgue dominated convergence theorem to switch the 

limits and conclude that E(Xτ ) = E(X0). Here are some examples: 

Example 2.7. If {Xn}is a martingale and τ a stopping time so that P(τ < 

∞) = 1 and E(|Xτ |) < ∞, and limn→∞ E(Xn1τ>n) = 0, then E(Xτ ) = E(X0). 

Proof. For any n we have: X min(τ,n) = Xn1τ>n +Xτ 1τ≤nTaking expectation 

and then the limit as n → ∞, gives: 

lim 

n→∞ E(Xmin(τ,n)) = lim 

n→∞ E(Xn1τ>n) + lim 

n→∞ E(Xτ 1τ>n) 

= 0 + E(Xτ ) 

Where the first term is 0 by hypothesis, and the second limit is justified since 

Xτ 1τ>n → Xτ pointwise almost surely since P(τ < ∞) = 1, and the dominant 

majorant E(|Xτ |) < ∞lets us use the Lebesgue dominated convergence theorem to 

conclude the convergence of the expectation. 

Example 2.8. Suppose {Xn}is a martingale and τ a stopping time so that 

E(τ) < ∞ and |Xn+1 − Xn| ≤ M < ∞for some fixed M and for every n. Then 

E(Xτ ) = E(X0). 

Proof. Let Y = |X0| + Mτ. Then Y can be used as a dominant majorant in 

a L.D.C.T. very similar to the above example to get the conclusion. 

3. Martingale Convergence Theorem 

The proof relies on the famous upcrossing lemma: 

Lemma 3.1. [The Upcrossing Lemma]. Let {Xn}be a submartingale. For fixed 

α, β ∈ R, β > α,and M ∈ N let U α,β 

M be the number of “upcrossings” that the 

martingale {Xn}makes of the interval α, β in the time period 1 ≤ n ≤ M. (An 

upcrossing is when Xn goes from being less than α initially to being more than β


later. Precisely this is: U α,β 

M = maxk{k : ∃t1 < u1 < . . . < tk < uk ≤ M s.t. Xti ≤ 

α and Xui ≥ β ∀i} ). Then: 

E(U α,β 

M ) ≤ E (|XM − X0|) 

β − α 

Proof. Firstly, we remark that it suffices to prove the result when the submartingale 

{Xn}is replaced by {max(Xn, α)}, since this is still a submartingale, 

it has the same number of upcrossings as Xn, and | max(XM , α) − max(X0, α)| ≤ 

|XM −X0|, so the equality is only strengthened. In other words, we assume without 

loss of generality that Xn ≥ α for all n. This simplification is used in exactly one 

spot later on to get the inequality we need. 

Let us now carefully nail down where the upcrossings happen. Define u0 = 

v0 = 0 and iteratively define: 

uj = min(M, inf {k : Xk ≤ α} 

k>vj−1 

vj = min(M, inf {k : Xk ≥ β} 

k>uj 

These record the times where the martingale crosses the interval [α, β]; the uj’s 

record when it first crosses moving to the left of the interval, and the vj’s record 

crosses going to the right of the interval. They are also truncated at time M so that 

they are bounded stopping times. Moreover, since these times are strictly increasing 

until they hit M, it must be the case that vM = M. We have then, using some 

crafty telescoping sums: 

E(XM ) = E (XvM ) 

= E XvM − XuM + XuM − XvM−1 + XvM−1 − . . . − Xu1 

 

+ Xu1 − X0 + X0 

 

M 

 

M 

= E (X0) + E Xvk − Xuk + E 

Xuk − Xvk−1 

k=1 

The third term is non-negative! This is because uk and vk−1are both bounded 

stopping times with 0 ≤ vk−1 ≤ uk ≤ M, so our theorem about stopping times 

gives that this expectation is non-negative. (This is subtle! Most of the time 

(when we haven’t hit time M yet) we expect Xuk < αwhile Xvk−1 > β, so their 

difference is negative. However, because of the small probability event where vk−1 < 

M and uk = M, we get a big positive number with small probability which balances 

the whole expectation. Compare to the example of a simple symmetric random walk 

with a truncated stopping time for τ =first time that Xn = −1.) 

M 

Now the second term, has E Xvk k=1 − Xuk ≥ E (β − α)U α,β 

 

M . This is 

because each upcrossing counted in U α,β 

M contributes at least (β−α) to the sum, null 

cycles (where uk = vk = M) contribute nothing, and the possibly one incomplete 

cycle (where uk < M but vk = M) must give a non-negative contribution to the 

sum by the simplification that Xn > α. 

Hence we have: 

 

E(XM ) ≥ E (X0) + (β − α)E + 0 

k=1 

U α,β 

M 

Which gives the desired result.

4. APPLICATIONS 9 

Theorem 3.2. [Martingale Convergence Theorem] Let {Xn}be a submartingale 

with sup n E(|Xn|) < ∞. Then there exists a random variable X so that Xn → X 

almost surely. (That is Xn(ω) = X(ω) for almost all ω ∈ Ω). 

Proof. Firstly, since sup n E(|Xn|) < ∞, by Fatou’s lemma we have: E(lim infn |Xn|) ≤ 

lim infn E(|Xn|) ≤ sup n E(|Xn|) < ∞, from which it follows that P(|Xn| → ∞) = 

0. This ensures that the Xncannot “leak away” probability to ±∞, which would 

prevent the limiting random variable from being properly normalized. 

Now suppose by contradiction that P(lim inf Xn < lim sup Xn) > 0, i.e. there is 

a non-zero probability of Xnnot converging. Then, using the density of the rationals 

and countable subadditivity to find an α and β so that P(lim inf Xn < α < β < 

lim sup Xn) > 0. Counting the number of upcrossing Xn 

 

makes of [α, β],we see that 

α,β 

we must have: P lim UM = ∞ > P(lim inf Xn < α < β < lim sup Xn) > 0. 

 

M→∞ 

α,β 

Hence E lim UM = ∞. By the monotone convergence theorem however, we 

M→∞ 

α,β 

have that lim 

lim U = ∞. 

M→∞ 

α,β 

E(UM ) =E 

M→∞ 

M 

But now we have reached a contradiction! For by the upcrossing lemma: 

α,β 

lim E(UM M→∞ ) ≤ limM→∞ E (|XM − X0|) 

≤ 

β − α 

2 supM E(|Xn|) 

< ∞ 

β − α 

4. Applications 

Theorem 4.1. [Levy]Suppose Z a random variable with E(|Z|) < ∞, and that 

{Fn} is a decreasing chain of σ−algebras, F1 ⊃ F2 ⊃ . . . (This is saying that they 

are getting coarser and coarser). Let F∞ = ∩Fn. Then we have almost surely: 

lim 

n→∞ E(Z|Fn) = E(Z|F∞) 

Proof. We first prove that there is an almost sure limit using the martingale 

convergence theorem, and then we check the defining properties of E(Z|F∞) to 

verify that this is indeed the limit. 

Firstly, let Xn = E(Z|Fn). Then for any fixed M ∈ N we have that the sequence 

XM , XM−1, . . . X2, X1is a martingale (Here we are referring to a slightly more general 

martingale than in our original definition, the sigma algebra σ(X1, X2, . . .) in 

the definition is replaced by arbitrary increasing sigma algebras Fn. The expectation 

property of the martingale follows by the fact that E(E(Z|F)|G) = E(Z|G) 

when G ⊂ F) Notice that we had to reverse the order of the sequence to get the 

sigma algebras to increase (i.e. get finer and finer), so that we really have a martingale. 

For this reason, the martingale convergence theorem does not apply directly 

but the idea of the proof will still work. Suppose by contradiction, as in the proof of 

the martingale convergence theorem, that P(lim inf Xn < lim sup Xn) > 0. Then, 

as before, find αand β so that P(lim inf Xn < α < β < lim sup Xn) > 0. Since there 

are infinitely many crossings then of the interval [α, β], we can know that the number 

of downcrossings D α,β 

M has P 

Hence, since D α,β 

M 

 

lim 

M→∞ Dα,β 

M 

 

 

= ∞ > 0 and so E 

lim 

M→∞ Dα,β 

M 

 

 

= ∞. 

is increasing in M (the number of downcrossings 

 

can only increase 

if we wait longer), we may find an M0 ∈ Nso that E D α,β 

 

> M0 

2E(|Z|) 

β−α .


Taking now the martingale sequence XM0 , XM0−1, . . . X2, X1, we have a violation 

of the upcrossing lemma just as we did in the martingale convergence theorem. 

Next, to verify that the limit is indeed E(Z|F∞) we just need to check the 

two defining properties, namely that it is F∞measurable and that it has the correct 

expectation value for events in F∞. limn→∞ E(Z|Fn) is F∞measurable, since 

F∞ ⊂ F n for every n, meaning that E(Z|Fn) is F∞measurable for every n, and so 

the limit is too. 

To see that limn→∞ E(Z|Fn) takes the correct expectations for events in F, 

notice that for any A ∈ F∞ ⊂ Fn we have for every n that E (E(Z|Fn)1A) = 

E(Z1A) since A ∈ Fn, so in the limit limn→∞ E (E(Z|Fn)1A) = E(Z1A). Hence 

the problem of proving that E (limn→∞ E(Z|Fn)1A) = E(Z1A) is reduced to an 

interchange of a limit with an expectation. If Z is bounded, this is justified by the 

bounded convergence theorem. For Z not bounded, truncating Z by Z1 {Z≤N}with 

a bit more work will give the same interchange of limits. 

Theorem 4.2. [Levy] Suppose Z a random variable with E(|Z|) < ∞, and that 

{Fn} is an increasing chain of σalgebras, F1 ⊂ F2 ⊂ . . . (This is saying that they 

are getting finer and finer). Let F∞ = ∪Fn. Then we have almost surely: 

lim 

n→∞ E(Z|Fn) = E(Z|F∞) 

Proof. This proof is like the last one. In this case E(Z|Fn) really is a martingale 

(no backwards), so an almost sure limit exists by the martingale convergence 

theorem. Some more work here is needed....I think you get the desired property 

by approximation with “tame events” A ∈ F∞, for every ɛ > 0 there exists 

An ∈ Fnsuch that P(A∆An) < ɛ. 

Remark. This result is often known as the “Levy Zero-One Law” since a 

common application is to consider an event A ∈ F∞, for which the theorem tells 

us that: 

lim 

n→∞ P(A|Fn) = lim 

n→∞ E(1A|Fn) 

= E(1A|F∞) 

= 1A 

Where the last equality holds since A is F∞measurable. This says in particular 

that this probability is either 0 or 1, since these are the only two values taken on 

by 1A. In this setting, the theorem gives a short proof of the Kolmogorov zero-one 

law. 

Theorem 4.3. [Kolmogorov Zero-One law] Let X1, X2, . . .be an infinite sequence 

of i.i.d. random variables. Define: 

 

n 

 

Fn = σ σ(Xk) 

F∞ = 

Ftail = 

n 

k=1 

k=1 

Fn 

∞ 

 

∞ 

 

σ σ(Xk) 

n=1 

k=n

4. APPLICATIONS 11 

Then any event A ∈ Ftail has either P(A) = 0 or P(A) = 1. These are those 

events which do not depend on finitely many of the X ′ ns. 

Proof. Let A ∈ Ftail. For any n ∈ N we have that P(A) = P(A|Fn) = 

E(1A|Fn) since A ∈ Ftail does not depend on the first n variables, so its conditional 

expectation is a constant. Have then, (as in the above “Levy 0-1” remark): 

P(A) = lim 

n→∞ P(A|Fn) 

= 1A 

Since A ∈ F∞So indeed, the only the values of P(A) that are possible are 1 

and 0. 

Theorem 4.4. [Strong Law of Large Numbers] Suppose X1, X2, . . . are i.i.d. 

Then we have almost surely that: 

X1 + X2 + . . . + Xn 

lim 

= E(X1) 

n→∞ n 

∞ 

Proof. Define Sn = X1 + X2 + . . . + Xn, and let Fn = σ( σ(Sk)) be the 

sigma algebra of the tail Sn, Sn+1, . . . . We now claim that: 

E(X1|Fn) = Sn 

n 

This can be seen in the following slick way. First notice that by symmetry, 

we 

 

must have E(X1|Fn) = E(X2|Fn) = . . . = E(Xn|Fn). By linearity now: 

n 

k=1 E(Xk|Fn)=E(Σn k=1Xk|Fn) = E(Sn|Fn) = Sn, since Sn ∈ Fn. Hence since 

they are all equal, and sum to Sn, we get E(X1|Fn) = Sn 

n as desired. By Levy’s 

theorem now: 

Sn 

lim 

n→∞ n 

= lim 

n→∞ E(X1|Fn) 

 

= E X1| 

 

From here, one can use the Hewitt-Savage zero-one law (which says that permutation 

 

invariant events have a zero one law), to see that the whole sigma algebra 

k Fk must be the trivial one, so then E (X1| 

k Fk) = E(X1). Alternatively, once 

we have conclude that such an almost sure limit exists, one could then remark 

by the Kolmogorov zero that the limit must be a constant (for limn→∞ Sn 

n does 

k 

Fk 

k=n 

not depend on finitely many of the X ′ ns so any type of event {lim Sn 

n 

< α}must 

have probability 0 or 1. By taking a sup, we can find that it must be a constant.) 

Combining this with the above, using the fact that conditional random variables 

preserve the expectation, shows the constant is indeed E(X1). 

Theorem 4.5. [Hewitt Savage Zero-One Law] Let X1, X2, . . .be an infinite sequence 

of i.i.d. random variables. Let A be an event which is unchanged under 

finite permutations of the induces of the X ′ is. (e.g. for every finite permutation 

Π,ω = (x1, x2, . . .) ∈ A iff Π(ω) = (xΠ(1), xΠ(2), . . .) ∈ A i.e. Π(A) = A). Then 

P(A) = 0 or 1.


Proof. We call an event “tame” if it only depends on finitely many of the 

X ′ is.The proof is a consequence of the fact that for any ɛ, any event A can be 

approximated by a “tame event” B so that P(B△A) < ɛ. (This is completely analogous 

to the fact that for the usual Lebesgue measure on R, one can approximate 

any measurable set S by a finite union of open intervals In so that λ(∪n i=1Ii△U) < ɛ. 

This comes from the definition of the Lebesgue measure as the inf of the outer measure 

with open sets, and the fact that every open set is a union of countably many 

intervals, of which only finitely many are needed to be within ɛ/2. In the same vein, 

the probability measure on the infinite sequence of events is generated by the outer 

measure from tame events. This is usually all packaged up in the Caratheodory extension 

theorem.). Once we have this tame event B, depending only on X1, . . . Xn 

we letΠ be the permutation that permutes 1, . . . n with n + 1, . . . , 2n so that B and 

Π(B) are independent events. Have then: 

P(A) ≈ P(A ∩ B) 

= P(Π(A) ∩ Π(B)) 

= P(A ∩ Π(B)) 

≈ P(B ∩ Π(B)) 

= P(B)P(Π(B)) 

= P(B) 2 

≈ P(A) 2 

Where each of the approximations hold within ɛ by the choice of B. Since we 

can do this for every ɛ > 0, we get P(A) = P(A) 2 and the result follows.

CHAPTER 2 

The Law of the Iterated Logarithm 

We will prove that for a sequence of i.i.d events X1, X2, . . .with mean 0 and 

variance 1 that for Sn = n i=1 Xn: 

 

Sn 

P lim sup = 

n→∞ n log(log n)) √ 

2 

This result is giving us finer information about these sums than the law of large 

numbers or the central limit theorem. We need the theory of martingales to get 

Doob’s inequality, and then a bunch of other sneaky tricks, like the Borel Cantelli 

lemmas, to get the result. We will also need a few analytic type estimates along the 

way. (Actually, our proof here will only prove the case where the X ′ ns are ±1 with 

probability 1/2 each. The result can be generalized by using even finer estimates) 

1. First Half of the Law of the Iterated Logarithm 

To start, we will first prove some helpful lemmas. 

Lemma 1.1. [Doob’s Inequality] For a submartingale Zn, we have for any α > 0 

that: 

 

P max 

0≤i≤n Zi 

 

≥ α ≤ E(|Zn|) 

α 

Proof. (Taken from Rosenthal) Let Akbe the event that {Xk ≥ α, but Xi < 

α for i < k},i.e. that the process reaches αfor the first time at time k. These are 

disjoint events with A = ∪Ak = {(max0≤i≤n Zi) ≥ α} which is the event we want. 

Now consider: 

αP(A) = 

n 

αP(Ak) 

k=0 

= E(α1Ak ) 

≤ E(Xk1Ak ) since Xk ≥ αon Ak 

≤ E (E (Xn|X1, X2, . . . Xk) 1Ak ) since it’s a submartingale 

= E(Xn1Ak ) 

= E(Xn1A) 

≤ E(|Xn|) 

And the result follows. 

13

14 2. THE LAW OF THE ITERATED LOGARITHM 

Remark. This is a “rich man’s version of Chebyushev-type inequalities”, which 

are proved using the same trick as in lines 3 and 4 of the inequality train above. 

The fact that the behavior of the whole martingale can be controlled by the end 

point of the martingale gives us the little extra oomph we need. 

Lemma 1.2. [Hoeffding’s Inequality] Let Y be a random variable so that E(Y ) = 

0 and a, b ∈ R so that a ≤ Y ≤ b almost surely. Then E(e tY ) ≤ e t2 (b−a) 2 /8 . 

Proof. Write X as a convex combination of a and b: Y = αb + (1 − α)a where 

α = (Y − a)/(b − a). By convexity of e ( ) , have then: 

e tY ≤ 

Y − a 

b − a etb b − Y 

+ 

b − a eta 

Taking expectations (and using E(Y ) = 0), have: 

E e tY ≤ −a 

b − a etb + b 

b − a eta = e g(t(b−a)) 

For g(u) = −γu + log(1 − γ + γe u )and γ = − a 

b−a . Notice g(0) = g′ (0) = 0 and 

g ′′ (u) < 1 

4 

for all u. Hence by Taylor’s theorem: 

g(u) = g(0) + ug ′ (0) + u2 

2 g′′ (ξ) 

≤ 0 + 0 + u 2 1 u2 

= 

2 4 8 

So then E e tY ≤ e g(t(b−a)) ≤ e t2 (b−a) 2 /8 

Lemma 1.3. Let X1, X2, . . . be i.i.d with P(X1 = ±1) = 1 

2 and Sn = n 

k=1 Xk. 

Then P(maxk≤n Sk > λ) ≤ e −λ2 /2n . 

Proof. Have, by using Doob’s inequality and Hoeffding’s Inequality, for any 

t ∈ R, we have: 

Set t = 4λ/n(b − a) 2 to get: 

P(max 

k≤n Sk > λ) = P(max 

k≤n etSk > e tλ ) 

≤ e −tλ E(e tSn ) 

= e −tλ E(e tX1 ) n 

≤ e −tλ e nt2 (b−a) 2 /8 

P(max 

k≤n Sk > λ) ≤ e −(4λ/n(b−a)2 )λ e n(4λ/n(b−a) 2 ) 2 (b−a) 2 /8 

= e −2λ2 /n(b−a) 2 

For simple symmetric steps, we have a = −1 and b = 1, so this gives the 

result. 

Theorem 1.4. Let X1, X2, . . . be i.i.d with P(X1 = ±1) = 1 

2 and Sn 

= 

n 

k=1 Xk. Then for any ɛ > 0, 

 

 

P 

lim sup 

n→∞ 

Sn 

n log(log n)) > √ 2 + ɛ 

= 0

2. SECOND HALF OF THE LAW OF THE ITERATED LOGARITHM 15 

Or in other words, since this holds for any value of ɛ > 0: 

 

Sn 

P lim sup ≤ 


2 = 1 

Proof. Fix some θ > 1 (the choice will be made more precise later). We will 

show that with the correct choice of θ, the events An = {Sk > (2 + ɛ)k log(log k))for 

some k, θn−1 ≤ k < θn } happens only finitely many times, which will show that 

the limsup can’t be more than √ 2 + ɛ. To do this it suffices to show that P(An) 

is summable, because then the Borel-Cantelli lemmas will show that An happens 

finitely often with probability 1. We have (using our previous lemma): 

 

P(An) = P Sk > (2 + ɛ)k log(log k)), θ n−1 ≤ k < θ n 

 

≤ P Sk > (2 + ɛ)θn−1 log(log θn−1 )), θ n−1 ≤ k < θ n 

 

≤ P max 

k≤θn Sk > (2 + ɛ)θn−1 log(log θn−1 

)) 

2 

(2 + ɛ)θn−1 log(log θn−1 )) 

≤ exp − 

2θn 

2 + ɛ θ 

= exp − 

2 

n−1 (log(n − 1) + log(log(θ))) 

θn 

 

≈ exp − 1 + ɛ 

 

θ 

2 

−1 

log(n − 1) for large n 

So choosing θ < 1 + ɛ 

2 , gives us that 1 + ɛ 

 

−1 

2 θ > 1, so this is: 

ɛ −(1+ 

P(An) ≤ (n − 1) 2)θ −1 

From which we see that P(An) is summable (it’s a p-series!). By using the 

Borel Cantelli lemma, this means that An happens only finitely many times with 

probability 1, which is the desired result. 

2. Second Half of the Law of the Iterated Logarithm 

To prove the other half, we need some more estimates. 

Lemma 2.1. [Mill’s Inequality] This is an estimate concerning the probability 

density function of a Gaussian: 

λ 

λ2 + 1 e−λ2 ∞ 

/2 

≤ e −y2 /2 1 

dy ≤ 

λ e−λ2 /2 

λ 

Proof. To prove the lower bound, we find a remarkable anti-derivative: 

∞ 

e 

λ 

−y2 /2 

dy ≥ 

∞ 

e 

λ 

−y2 4 2 

/2 y + 2y − 1 

y2 + 2y2 

dy 

+ 1 

= 

 

− y 

y2 + 1 e−y2 ∞ /2 

= 

λ 

λ 2 + 1 e−λ2 /2 

λ


The upper bound is found by using the estimate y/λ > 1 in the range of 

integration: 

 

λ 

∞ 

e −y2 /2 dy ≤ 

∞ 

y 

λ e−y2 /2 

dy 

λ 

= 1 

λ 

 

−e −y2 /2 ∞ 

= 1 

λ e−λ2 /2 

Theorem 2.2. Let X1, X2, . . . be i.i.d with P(X1 = ±1) = 1 

2 and Sn 

= 

n 

k=1 Xk. Then for any ɛ > 0, 

 

 

P 

lim sup 

n→∞ 

Sn 

n log(log n)) ≥ √ 2 − 2ɛ 

Or in other words, since this holds for any value of ɛ > 0: 

 

Sn 

P lim sup ≥ 


2 = 1 

Proof. As in the proof of the other half of the law, the idea is to prove that the 

appropriate events happen infinitely often using the Borel-Cantelli 

lemmas. Fix θ > 

1 (the choice will be made precise later). Let Bn = 

We will show that these occur infinitely often and then show why this gives the 

result. Notice that the B ′ ns are independent, as each Bn depends only on the value 

of Xk for θn−1 ≤ k ≤ θn , so to prove that Bnhappens i.o. it suffices to show, via 

the Borel Cantelli lemma, that P(Bn) is not summable. Consider: 

 

P(Bn) = P Sθn − Sθn−1 ≥ (2 − ɛ)θn log(log(θn 

)) 

 

= P 

≈ 

λ 

= 1 

 

S θ n −θ n−1 ≥ (2 − ɛ)θ n log(log(θ n )) 

1 

√ 

2π 

∞ 

√ (2−ɛ)θ n log(log(θ n )) 

√ θ n −θ n−1 

Sθ n − S θ n−1 ≥ (2 − ɛ)θ n log(log(θ n )) 

e −y2 /2 dy 

Where the first equality holds using the Markov property of the sums (equiv- 

s are i.i.d.), and 

alently, look at the definition as sums of X ′ is and the fact the X′ i 

the second equality is coming asymptotically as θn − θn−1 √ → ∞ from the central 

(2−ɛ)θn log(log(θn )) 

limit theorem. Now, let λ = √ be the lower bound of the integral 

θn−θn−1 and use Mill’s inequality to get: 

P(Bn) ≥ 

= 

1 

√ 2π 

1 

√ 2π 

λ 

λ 2 + 1 e−λ2 /2 

1 

λ + λ −1 e−λ2 /2 

 

 

.

2. SECOND HALF OF THE LAW OF THE ITERATED LOGARITHM 17 

But now notice that λ = 

log n 

√ (2−ɛ)θ n log(log(θ n )) 

√ θ n −θ n−1 

≈ √ 2 − ɛ √ log n 

√ 1−θ −1 , so λ2 ≈ (2 − 

ɛ) 1−θ−1 . So our estimate is: 

1 

P(Bn) ≥ C √ √ −1 

log n + log n exp 

 

2 − ɛ 

2(1 − θ−1 

log n 

) 

≥ Cn − 

“ 

1−ɛ/2 

1−θ−1 ” 

log n −1/2 

Where C’s are some constants. By choosing θlarge enough, 1−ɛ/2 

1−θ −1 < 1 and this 

will not be summable! Have then Bn occurs infinitely often. 

Now, we will show that these events Bnoccurring infinitely often will be enough 

to see that Sθ n ≥ (2 − 2ɛ)θ n log(log(θ n )) infinitely often too. To do this we 

will use the first half of the law of the iterated logarithm we already proved, 

namely that for any η > 0, the events {Sk > (2 + η)k log(log k))}happen only 

finitely often with probability 1. By symmetry, we’ll have the events {Sk < 

− (2 + η)k log(log k))}happen only finitely often too. Hence, the events An 

= 

Sθn−1 < − (2 + η)θn−1 log(log θn−1 

)) happens only finitely often with proba- 

bility 1. Now, since the B ′ ns occur infinitely often with probability 1, and the A ′ ns 

occur only finitely often with probability 1, the events Bn ∩ A c n will occur infinitely 

often with probability 1 too. This will give us the infinite sequence we need, for on 

the event Bn ∩ A c n we have the inequalities: 

Sθ n − S θ n−1 ≥ (2 − ɛ)θ n log(log(θ n )) 

S θ n−1 ≥ − (2 + η)θ n−1 log(log θ n−1 )) 

Hence, with probability 1, we have that for infinitely many values of n: 

Sθ n ≥ (2 − ɛ)θ n log(log(θ n )) + S θ n−1 

≥ (2 − ɛ)θn log(log(θn )) − (2 + η)θn−1 log(log θn−1 )) 

≥ (2 − ɛ)θn log(log(θn 

(2 + η) 

)) − θ 

θ 

n log(log θn )) 

 

√2 2 + η θ 

− ɛ − 

n log(log(θn )) 

= 

θ 

So by fixing η, (any choice will do) and then choosing θ large enough we can 

√2 

make the coefficient − ɛ − ≥ √ 2 − 2ɛ. (Note that this doesn’t disrupt 

2+η 

θ 

our choice of θ previously because that too was a choice to make θ large, so we can 

always find θso big to suit both our needs.) We have then that for infinitely many 

n: 

Sθ n 

θ n log(log(θ n )) ≥ √ 2 − 2ɛ 

So then: 

P 

 

lim sup 

n→∞ 

Sn 

n log(log n)) ≥ √ 2 − 2ɛ 

The two halves of the law of the iterated logarithm give the full result: 

 

= 1


P 

 

lim sup 

n→∞ 

Sn 

n log(log n)) = √ 2 

 

= 1

CHAPTER 3 

Ergodic Theorem 

1. Motivation 

The study of Ergodic Theory was first motivated by statistical mechanics. Here, 

one is interested in the long term average of systems. For example, say we have some 

particles with position Q(t) at time t, and momentum P (t) at time t. Let f be a 

function on this state space, for example f might be the pressure/temperature/some 

other macroscopic variable. Can we find a distribution G so that: 

T 

1 

lim 

T →∞ T 

0 

 

f(Q(s), P (s))ds = fdG 

Gibbs et al. worked on this problem and it turns out that G = 1 

Z e−H/kT with 

Z the partition function, H the Hamiltonian, T temperature, and k Boltzmann’s 

constant has this! These types of long term averaging things can be useful. We will 

start with a simple example. 

Example 1.1. Let Ω = [0, 1) = {θ : 0 ≤ θ < 1} where we think of Ω as a 

circle with perimeter 1 (and θthe position on the circle). For some fixed angle ω, let 

T : Ω → Ω be rotation by ω, that is T (θ) = θ + ω mod 1. This is clearly measure 

preserving in the sense that for any set B we have that m(B) = m(T −1 (B)) where 

m is the usual Lebesgue measure. Could it be that: 

N−1 

1 

lim 

N→∞ N 

n=0 

f (T n x) = 

1 

0 

f(s)ds 

If ωis rational, this doesn’t have a chance, because T n eventually cycles back 

to the identity, so T n x will only sample finitely many points. However, if ωis 

irrational, this is true! We can prove it in this case using Fourier analysis. When 

f(x) = e 2πimx , for m ∈ N, we have the geometric series: 

N−1 

1 

N 

n=0 

 

f (T n x) = 1 

N−1 

N 

n=0 

e 2πim(x+nω) 

= 1 

N e2πmx e2πimNω − 1 

e2πimω − 1 

→ 0 

1 

= f(s)ds 

Where the fact that ωirrational ensures that e 2πimω −1 = 0. In the case m = 0, 

f is constant, so of course the result holds. Now for any f ∈ C 2 (Ω), we can expand 

0 

19

20 3. ERGODIC THEOREM 

f as a Fourier series to see the result holds. This lets us calculate for example: 

For if f = 1 (a,b) notice that 

#{k ≤ N : x + kω ∈ (a, b)} 

lim 

= b − a 

N→∞ 

N 

#{k≤N: x+kω∈(a,b)} 

N 

= 1 N−1 N n=0 f (T nx). By ap- 

proximating f by C 2 functions (in the L 1 sense) from above and below, and applying 

the limit calculated above, we get the result. 

Is there away we can do this kind of thing using probability methods (rather 

than Fourier)? The next result is a nice theorem in this direction. 

2. Birkhoff’s Theorem 

Theorem 2.1. [Birkoff-Khinchin Ergodic Theorem] Say (Ω, F, P) is a probability 

space. Suppose T : Ω → Ω is a measure preserving map, in the sense that 

P(T −1 (B)) = P(B) for all B ∈ F. Let F0 = {A ∈ F : T −1 A = A a.e.} be the field 

of T invariant events. For f : Ω → R a random variable with E(|f|) < ∞, we have 

almost surely: 

N−1 

1 

lim f (T 

N→∞ N 

n x) = E (f|F0) 

n=0 

Corollary 2.2. In the case that F0 is the trivial field, E (f|F0) = E(f) is a 

constant, so this is exactly the thing we had above. This happens precisely when 

T −1 A = A ⇒ P(A) = 0 or 1. In this case we say that the map T is “ergodic”. 

The proof of this theorem relies on the following lemma. 

Lemma 2.3. [Maximal Ergodic Lemma] Say (Ω, F, P) is a probability space. 

Suppose T : Ω → Ω is a measure preserving map, in the sense that P(T −1 (B)) = 

P(B) for all B ∈ F. Say f : Ω → R a random variable with E(|f|) < ∞. Let 

Sn = n−1 k=1 f(T kx) and let A = {supn≥1 Sn > 0} be the event that this is positive 

at some point. Then: 

 

E (f1A) = fdP > 0 

A 

Proof. Define f + (x) = f(T x) and let mn = max{0, S1, S2, . . . Sn}, and m + n in 

the same way, replacing f by f + in the definition of Sk. Notice that by this 

definition the mn’s are non-decreasing. Notice that the event A = {sup n≥1 Sn > 0} 

is the same as saying mn > 0 for n large enough. For this reason, it will be enough 

to restrict our attention to the events {mn > 0}. Notice that if we are in the event 

{mn > 0} then we have: 

S1 + m + n = S1 + max{0, S + 1 , S+ 2 , . . . S+ n } 

= S1 + max{0, S2 − S1, S3 − S1, . . . Sn+1 − S1} 

= max{S1, S2, . . . Sn+1} 

= mn+1 

Where we used that we’re on the event{mn > 0} in the last step to see the last 

equality, and we used S + n = n−1 

0 

f(T k T x) = n 

1 f(T x) = Sn+1 − S1in the second

2. BIRKHOFF’S THEOREM 21 

equality. We have then: 

E 

f1 {mn>0} = 

= 

 

E S11 {mn>0} 

E (mn+1 − m + = 

 

n )1 {mn>0} 

E ≥ 

 

+ 

mn+11 {mn>0} − E m n 1 {mn>0} 

E + 

mn+11 {mn>0} − E m n 

The last inequality holds since on the event {mn = 0},we have S1 ≤ 0, so 

m + n = mn+1 − S1 ≥ mn+1 ≥ 0, so E m + 

+ 

n 1 {mn=0} ≥ 0. Hence E (m n ) = 

E m + 

+ + 

n 1 {mn>0} + E m n 1 {mn=0} ≥ E m n 1 {mn>0} . From here, we note that 

E(m + n ) = E(mn) since the map T is measure preserving, and the only difference 

between m + n and mn is the map x → T x. Have then: 

E 

f1 {mn>0} ≥ E mn+11 {mn>0} − E (mn) 

= E 

mn+11 {mn>0} − E mn1 {mn>0} 

 

= E (mn+1 − mn)1 {mn>0} 

≥ 0 

The second equality holds since mn ≥ 0 always holds, and the last inequality 

holds since the m ′ ns are non-increasing. Finally, to get the result, notice that 

{mn > 0} is increasing to {sup Sn > 0}, so by a monotone convergence theorem 

result, we have: 

E 

f1 {sup Sn>0} = lim 

n→∞ E 

f1 {mn>0} ≥ 0 

With this in hand, we can prove Birkhoff’s theorem: 

Theorem 2.4. [Birkoff-Khinchin Ergodic Theorem] Say (Ω, F, P) is a probability 

space. Suppose T : Ω → Ω is a measure preserving map, in the sense that 

P(T −1 (B)) = P(B) for all B ∈ F. Let F0 = {A ∈ F : T −1A = A a.e.} be the field 

of T -invariant events. For f : Ω → R a random variable with E(|f|) < ∞, we have 

almost surely: 

N−1 

1 

lim f (T 

N→∞ N 

n x) = E (f|F0) 

n=0 

Proof. Firstly, we will argue that limN→∞ 1 N−1 N n=0 f (T nx) converges a.s. to 

some random variables, and then we (as usual) check that it has the two defining 

properties of conditional expectation. 

Define SN = N−1 

N−1 n=0 f (T nx) does not converge 

a.s.. By the usual trick with rational numbers then, we can find a, b ∈ R so that 

the even A = lim inf Sn 

Sn 

n ≤ a 0. Notice moreover, 

that A is a T -invariant event, i.e. x ∈ A ⇒ T x ∈ A, since applying T shifts the 

terms in Sn by one, which does not affect the limsup or liminf of Sn/n. (Indeed, 

these don’t depend on finitely many of the terms!). For this reason, we may define 

a new probability measure on the set A, namely we think of (A, ˜ F, ˜ P) as a new 

probability space, with ˜ F={A ∩ B : B ∈ F}and ˜ P(E) = P(E)/P(A). The fact 

that A is T -invariant means that T nx ∈ A whenever x ∈ A so we can still talk 

n=0 f(T nx) as before, so that we are interested in the sum 

Sn/n. Suppose by contradiction that limN→∞ 1 

N


about Snand so on on this space. The fact that P(A) > 0 means that there is no 

problem re-normalizing like this. So we have now ˜ P(A) = 1 is the whole space. 

With this new space as our framework, we let f ′ (ω) = f(ω) − b, then we get new 

sums S ′ n with S′ 

 

n Sn 

n = n − b and then A = lim inf S′ 

 

n 

S′ 

n 

n ≤ a − b < 0 ≤ lim sup n . 

Notice then that ˜ P(lim sup Sn 

n ≥ 0) ≥ ˜ P(A) = 1 so then ˜ P({sup S ′ n > 0}) = 1 is 

the whole space A. Have then by the maximal ergodic lemma that: 

0 < ˜ E(f ′ 1 {sup S ′ n >0}) = ˜ E(f ′ ) = ˜ E(f) − b 

The same argument on f ′′′ (ω) = a − f(ω) gives: 

0 < ˜ E(f ′′ 1 {sup S ′′ 

n >0}) = a − ˜ E(f) 

But this is a contradiction now, for we have: 

a > ˜ E(f) > b 

Which is impossible since a < b. This contradiction means that its impossible 

to separate the liminf and the limsup like this, in other words we have almost sure 

convergence. 

Next it remains only to see that the random variable that this converges to is 

E(f|F0). Let us denote Firstly, notice that limN→∞ 1 

N 

N−1 

n=0 f (T n x) by ¯ f. We 

must show ¯ f is F0 measurable and that E( ¯ f1A) = E(f1A) for all A ∈ F 0. Notice 

that applying x → T x does not change limN→∞ 1 

N 

N−1 

n=0 f (T n x) as it only effects 

finitely many terms. This shows that ¯ f(x) = ¯ f(T x) This is the reason why ¯ f is F0 

measurable. More formally, to see that ¯ f −1 (B) is T -invariant for any Borel set B, 

just write out the definitions: 

T ( ¯ f −1 (B)) = T x ∈ Ω : ¯ f(x) ∈ B 

= T x ∈ Ω : ¯ f(T x) ∈ B 

= y ∈ Ω : ¯ f(y) ∈ B 

= ¯ f −1 (B) 

So indeed, ¯ f −1 (B) ⊂ F0 means ¯ f is F0 measurable. To see that ¯ f has the right 

expectation values, we first see prove the result for indicator functions and then use 

the “ladder” of integration to get the result we need. Consider that for sets A ∈ F0 

and B ∈ F we have: 

 

 

1B(x)dP = 1A(x)1B(x)dP 

A 

= 

= 

= 

 

 

 

A 

1A(T x)1B(T x)dP 

1A(x)1B(T x)dP 

1B(T x)dP 

Where the second equality is using the fact that P is T -invariant and the third 

equality uses the fact that A ∈ F0 ⇒ 1A(x) = 1A(T x). Since 

 

1B(x)dP= A A 1B(Tx)dP,

2. BIRKHOFF’S THEOREM 23 

by following along with the construction of the Lebesgue integral starting from indicator 

functions, we conclude that 

 

f(x)dP = f(T x)dP for any integrable f. 

A A 

Applying this inductively, we see that for any N ∈ N that: 

 

A 

N−1 

1 

f(T 

N 

k x)dP = 

k=0 

 

A 

f(x)dP 

When ¯ f is bounded, we can take the limit as N → ∞ and use the bounded 

convergence theorem to conclude: 

 

A 

¯fdP = lim 

= 

 

N→∞ 

A 

 

A 

f(x)dP 

N−1 

1 

f(T 

N 

k x)dP 

For general ¯ f, we can use a truncation argument and the monotone convergence 

theorem to get finish the result. 

Example 2.5. If we look at our first example of rotation by an angle ω, we 

concluded (using Fourier analysis) that when ωis irrational and f has a Fourier 

series that: 

N−1 

1 

lim 

N→∞ N 

n=0 

By Birkhoff’s theorem, we know that: 

f (T n x) = 

k=0 

1 

0 

f(s)ds 

N−1 

1 

lim f (T 

N→∞ N 

n x) = E(f|F0) 

n=0 

So we conclude that: 1 

0 f(s)ds = E(f|F0). Since this holds for every f, it 

must be that F0 is the trivial field. Notice that this improves our result a little bit, 

since we may now apply it to any f integrable, not just f which are C 2 . 

Example 2.6. In the first example, we were essentially looking at 1 N−1 N n=0 e2πim(x+nω) . 

Now lets ask about the series: 1 N−1 N n=0 e2πim(2n x) . This is harder to handle with 

Fourier techniques, but we can still use Birkhoff’s theorem. Again take Ω = [0, 1) 

to be our space, but instead of thinking of this as a circle, think of this as binary 

sequence (which are the binary expansions of each number between 0 and 1), 

Ω = {0.e1e2 . . . : ei = ±1}. Let T : Ω → Ω by T (0.e1e2e3 . . .) = 0.e2e3 . . . . This 

translates to T (x) = 2x mod 1 (this is the reason that applying it N times gives 

2Nx). It’s not hard to verify that this is measure preserving. By the Kolmogorov 

Zero-One law, the field F0of T -invariant events must be the trivial field, for by 

applying T N times, we see that an event A ∈ F0cannot depend on the first N 

digits e1, e2, . . . eN . Since this works for any N, this is a subset of the tail field,


which by K-0-1 is trivial. Hence, by Birkhoff’s Theorem, we have: 

N−1 

1 

lim f (T 

N→∞ N 

n x) = E(f|F0) 

n=0 

= E(f) 

= 

1 

0 

fdP 

For the Fourier basis function f(x) = e 2πimx , this is saying that: 

N−1 

1 

lim e 

N→∞ N 

2πim(2nx) = 0 

n=0 

Example 2.7. We can use Birkhoff’s theorem to give yet another proof of 

the strong law of large numbers. Let (X1, X2, . . .) be a sequence of i.i.d. random 

variables with finite mean and let Ω be the probability space for these sequences. 

Define T : Ω → Ω by T (x1, x2, x3, . . .) = (x2, x3, . . .). Notice that since the X ′ s 

are i.i.d. that this is measure preserving. As in example 2, the Kolmogorov zero 

one law tells us the field F0 of T -invariant is trivial. Let f(x1, x2, . . .) = x1. By 

Birkhoff’s theorem: 

lim 

N→∞ 

N−1 

1 

N 

n=0 

xn = lim 

N→∞ 

= E(f|F0) 

= E(f) 

= E(X1) 

N−1 

1 

N 

n=0 

Which is exactly the strong law of large numbers. 

3. Continued Fractions 

f (T n x) 

One way to specify a number in x ∈ [0, 1) is the binary expansion. Each binary 

digit tells you “which half” of the number line x is in. e.g. first digits says if its 

in 0, 1 

 

1 

2 or 2 , 1 , and then we treat that interval like [0, 1) and start over again for 

the next digit. Another way to do this game would be to draw the harmonic series 

1 

1 1 

n on the number line, and then specify which interval [ n+1 , n ) the number is in. 

1 

1 

Call this first number n1, and we’ll have then that ≤ x < . From this we 

n1+1 n1 

may conclude that: 

1 

x = 

n1 + ɛ1 

For some ɛ1 ∈ [0, 1). Play the same game again for ɛ1, and we get: 

1 

x = 

n1 + 1 

n2+ɛ2 

Continuing this indefinitely gives us the “continued fraction expansion” for x. 

Since this is hard to write, we will adopt the convention that x = [n1; n2; n3; . . .] to 

mean the continued fraction expansion n1 and then n2 and so on. 

Proposition 3.1. If the sequence [n1; n2; n3; . . .] is cyclic (that is it repeats 

after some finite number of steps), then x = [n1; n2; n3; . . .] is algebraic.

3. CONTINUED FRACTIONS 25 

Proof. The easiest way to see this is an example. Suppose we look at x = 

[1; 1; 1; . . .]. Then: 

So then: 

1 

x = 

1 + 1 

. .. 

1+ 

1 

= 1 + x 

x 

But then x2 − x + 1 = 0, so x is the root of a quadratic equation. In this case 

x = √ 5−1 

2 is the golden section. In general, if the continued fraction expansion is 

periodic after N steps, then x will be the root of an N + 1 order polynomial. 

Definition 3.2. We write x = [n1; n2; n3; . . .] to mean: 

x = 

n1 + 

1 

1 

n2+ 1 

n 3 +... 

Problem 3.3. Let T : (0, 1) → (0, 1) by T ([n1; n2; . . .]) = [n2; n3; . . .]. This is 

the map T (x) = 1 

x mod 1. Is there a probability density P we can put on (0, 1) so 

that T will be measure preserving? 

Proof. [Gauss] The probability density dP = 1 

Indeed, just notice that by the definition of T that: 

T −1 (a, b) = 

∞ 

n=1 

1 

log 2 1+x 

 

1 

b + n , 

 

1 

a + n 

dx will do the trick! 

So then the requirement P(T −1 (a, b)) = P(a, b) gives (using ρ as a probability 

density function): 

b 

a 

ρ(x)dx = 

∞ 

1 

a+n 

 

n=1 1 

b+n 

Taking the derivative w.r.t. b here gives: 

ρ(x) = 

∞ 

n=1 

ρ(x)dx 

 

1 1 

ρ 

x + n (x + n) 2


This is hard to solve, but its easy to verify that ρ(x) = 1 

1+x 

LHS is 1 

1+x 

while the RHS is: 

∞ 

 

1 

ρ 

x + n 

n=1 

1 

(x + n) 2 

= 

= 

= 

= 

= 

∞ 

n=1 

n=1 

1 

1 + 1 

x+n 

1 

(x + n) 2 

∞ x + n 1 

1 + (x + n) 

n=1 

(x + n) 2 

∞ 1 

(x + n + 1)(x + n) 

n=1 

∞ 1 

x + n − 

1 

x + n + 1 

1 

x + 1 

works, since the 

Which is a telescoping sum so we can evaluate it exactly. The factor of 1 

log 2 

normalizes ρ so that 1 

ρ(x)dx = 1. Indeed: 

0 

1 

0 

1 1 1 

log 2 − log 1 

dx = [log(1 + x)]1 0 = = 1 

log 2 x + 1 log 2 log 2 

Theorem 3.4. The shift function T : [0, 1] → [0, 1] given by T ([n1; n2; , . . .]) = 

[n2; n3; . . .]is ergodic. 

Proof. Fix N ∈ N and a list of integers n1, n2, . . . , nN . Now define: 

1 

n(x) := 

n1 + 

1 

n2+...+ 1 

n N+x 

For each choice of n1, n2, . . . , nN, the image of [0, 1] through n(x) is an interval 

whose endpoints are n(0) and n(1). As N increases, the interval [n(0), n(1)] gets 

smaller and smaller. An easy proof by induction shows that n(x) can be written 

as: 

Ax + B 

n(x) = 

Cx + D 

For A, B, C, D ∈ Rwith 0 ≤ A ≤ B and 1 ≤ C ≤ D and with AD − BC = ±1 

where the sign depends on the parity of N. Now, let I = [n(0), n(1)] and let 

J = (a, b) be an arbitarty interval. 

Claim. |I ∩ T −N (J)| ≥ 1 

2 |I||J| holds for all N ∈ N. 

Proof. Take x ∈ I ∩ T −N (J). Notice that x ∈ I if and only if x = n(y) 

for some y ∈ [0, 1] by definition of I. So we can write x as a continued fraction 

x = [n1; n2; . . . ; nN−1; nN + y]. On the other hand, x ∈ T −N (J) if and only if 

T N x ∈ J. But T N x = T N ([n1; n2; . . . ; nN−1; nN; y]) = y by definition of T . This 

shows that x ∈ T −N (J) if and only if y ∈ J. 

Have then, using the the observation that n is a fractional linear transformation, 

that: 

I ∩ T −N (J) = {n(y) : y ∈ J} = [n(a), n(b)]

This shows: 

3. CONTINUED FRACTIONS 27 

|I ∩ T −N (J)| = 

= 

= 

|n(b) − n(a)| 

 

 

 

 

Ab + B Aa + B 

− 

Cb + D Ca + D 

 

 

 

 

b − a 

 

(Ca 

+ D)(Cb + D) 

≥ 

|b − a| 

since a, b < 1 

(C + D) 2 

≥ 1 

|b − a||I| 

2 

= 1 

2 |J||I| 

The last inequality holds by writing out |I|and using AD − BC = ±1 and the 

fact that 1 ≤ C ≤ D so that C + D ≤ 2D: 

|I| = 

= 

|n(0) − n(1)| 

 

 

 

 

A + B B 

− 

C + D D 

= 

1 

|AD − BC| 

D(C + D) 

= 

1 

D(C + D) 

≤ 

2 

(C + D) 2 

Finally, to see that T is ergodic, take any Borel set B ∈ F. By approximating 

B by intervals, the inequality from the claim still holds: 

 

I ∩ T −N B ≥ 1 

2 |I||B| 

Take any set A now. Again, by approximting A by intervals I, we can use the 

above inequality to get: 

 

A ∩ T −N B ≥ 1 

2 |A||B| 

This gives what we want, for if B is T −invariant, we have T −N B = B for every 

N. The choice A = B c in the above gives: 

1 

2 |B|Bc | ≤ |B c ∩ T −N B| 

= |B c ∩ B| 

= 0 

So |B||B c | = 0, which is only possible if |B| = 1 or |B| = 0. This is saying 

all T invarant sets are either measure zero or full measure. In other words, T is 

ergodic.

CHAPTER 4 

Brownian Motion 

1. Motivation 

Our aim is to discuss a stochastic process on [0, 1] (that is a probability space 

(Ω, F, P) and a collection of random variables Bt(ω), for t ∈ [0, 1]) which has the 

following properties: 

• B0(ω) = 0 for every ω ∈ Ω 

• Fix a T ∈ [0, 1], and define for t > T,B + t = BT +t − Bt. We want B + t to 

look statistically identical to Bt. (This says the process has some sort of 

“time homogenous” property.) 

• We want B + t as defined above to be independent of Bt. (This says that 

the process has some sort of Markov property) 

• E(B 2 t ) < ∞ 

• E(Bt) = 0 

• Bt(ω) is continuous for every (or almost every) ω ∈ Ω. 

This process is supposed to describe something like a piece of dust that you can 

see sometimes wiggling about in a sunbeam. Notice that the time homogenous and 

Markov property together means we can write: 

BT = 

N 

k=1 

B kT 

N 

− B (k−1)T 

N 

Which is a sum of many independent increments. By the central limit theorem, 

this is suggesting Bt ∼ N(0, σ 2 ) is normally distributed (to get this more rigorously 

would take a bit more work, since the above set up is not exactly the set up for the 

central limit theorem). This is often taken as an “axiom” : 

• Bt ∼ N(0, σ 2 ) 

A quick calculation shows that σ2 ∝ t. Let f(t) = σ2 be the variance for Bt. Then: 

f(t + s) = E (Bt+s) 2 

= E (Bt+s − Bs + Bs) 2 

= E (Bt+s − Bs) 2 + E B 2 s + 2E ((Bt+s − Bs)Bs) 

= f(t) + f(s) + 2 · 0 

Where we used the time homogenous property and the Markov property. This 

functional relation means that f(t) must be linear! f(0) = 0 holds since B0 is 

known exactly. Hence f(t) = c · t. It doesn’t hurt to take c = 1, since anything we 

get can be rescaled for other values of c if need be. Sometimes this is taken as the 

“axiom”: 

(1) Bt ∼ N(0, t) 

29

30 4. BROWNIAN MOTION 

The following resulting property also turns out to be very useful: 

Proposition 1.1. E(BaBb) = min(a, b) 

Proof. Suppose W.O.L.O.G. a < b. Then: E(BaBb) = E(Ba(Bb − Ba + 

Ba)) = E(Ba(Bb − Ba)) + E(B 2 a) = 0 + a = min(a, b) 

It remains to see that such a process really exists. The main difficulty is proving 

that the process is continuous. There is more than one way to skin the cat for this; 

each method is useful because it gives a different insight into what is going on. 

2. Levy’s Construction 

We will construct Brownian motion on t ∈ [0, 1] as a uniform limit of continuous 

functions B N t , as N → ∞. Each B N t will be an approximation of the Brownian 

motion that is piecewise linear between the dyadic rationals of the form a 

2 N . The 

real trick in the construction is the remarkable observation that the corrections 

from BN t to B N+1 

t are independent of the construction so far up to level N, which 

is the crucial fact that makes the construction so nice and allows it to converge. 

The crucial fact about Brownian motion that makes this possible is captured in the 

below proposition: 

Proposition 2.1. Let Bt be a Brownian path and 0 < a 

line segment joining Ba and Bb: l(t) = Ba+(t−a) Bb−Ba 

b−a . Consider the value of the 

. The difference from this point to the line 

Brownian path at the midpoint time B a+b 

2 

l(t) is independent of Bb and Ba. That is to say: X = B a+b 

2 

− l( a+b 

2 ) = B a+b − 

2 

1 

2Ba − 1 

2Bb, is independent of Ba and Bb. Moreover, X is normally distributed 

X ∼ N(0, 1 

4 

(b − a)). 

Proof. Firstly, we notice that the random variables X, Ba,and Bb are have a 

joint normal distribution. This can be seen without much difficulty by expanding 

the definition of X to write any linear combination of X, Baand Bb as a linear 

combination of B a+b , Ba,and Bb. From here, rewrite as a linear combination of 

2 

Ba, B a+b − Ba, and Bb − B a+b . By the hypothesis on our Brownian motion, each 

2 

2 

of these are independent Gaussian variables, so any linear combination of them is 

again Gaussian. Hence any linear combination of X, Ba and Bb is Gaussian. This 

property is a characterization of the joint Gaussian distribution. The observation 

that X, Ba and Bb are jointly normal substantially simplifies the verification of 

their independence, as for jointly normal distributions they are independent if and 

only if they are uncorrelated. From here we calculate (with the help of the useful 

covariance relation): 

 

E(BaX) = E Ba(B a+b − 

2 

1 

2 Ba − 1 

2 Bb) 

 

 

= E BaB a+b − 

2 

1 

2 E B 2 1 

a − 

2 E(BaBb) 

= a − 1 1 

a − 

2 2 a 

= 0 

A similar calculation holds for E(BbX). Since these are uncorrelated and jointly 

normal, they are independent. A quick calculation using the covariance relation 

again gives X ∼ N(0, 1 

4 (b − a))

2. LEVY’S CONSTRUCTION 31 

This remarkable fact gives us a nice idea to construct Brownian motion starting 

with an infinite sequence of standard E(Z) = 0, E(Z2 ) = 1 i.i.d Gaussian variables 

(Z0, Z1, Z2, . . .). The idea is to first construct B0 = 0, B1 = Z0. Then, once B0, and 

B1 are constructed by the above proposition, we know that B1/2− 1 1 

2B0− 2B1 can be 

modeled by 1 

4Z1, so set B1/2 = 1 

2B1 

1 + 4Z1. Once B0, B1/2, B1 are constructed, 

the above proposition gives us a way to get B 1 and B 3 using two more normal 

4 4 

1 

variables 8Z2 

1 and 8Z3 and so on. 

The above proposition and paragraph is the basic idea. It becomes a bit of a 

mouthful to write it all down. A confused reader should focus on understanding 

the construction above before digesting the below details. 

To formalize the process, we let B N t be the construction at the N − th level of 

construction, which will have the correct values at points of the form a 

2 N , 0 ≤ a ≤ 

2 N . We make fill in in between these points with a piecewise continuous function. 

After some bookkeeping, the easiest way to write this down is as follows. First 

2k 

define some “tent” functions which make little peaks in the interval 2n , 2(k+1) 

2n 

of 

unit height: 

⎧ 

⎪⎨ 

2 

Tn,k = 

⎪⎩ 

n (t − (2k)) t ∈ 2k 

2n , 2k+1 

2n 2 

 

n ((2k + 2) − t) t ∈ 2k+1 

2n , 2k+2 

2n 0 

 

 

2k t /∈ 

2n , 2(k+1) 

2n Notice that for every level n, 0 ≤ k ≤ 2 n−1 − 1 means there are 2 n−1 tents, and 

notice that these tents are disjoint and of unit height. 

Now, at every level of the construction we make sure that B N t has the right 

value at points of the form a 

2 N by adding in the right tents with heights distributed 

by scaled normal functions: 

B N t = Z0t + 

N 

n=1 

2 n−1 −1 

k=0 

1 

Zn,kTn,k(t) 

2n+1 Explanation of this formula: The “Z0t” is the initial level 0 construction. The 

sum 0 ≤ n ≤ N sums over the N levels of construction, and the sum 0 ≤ k ≤ 

2 n−1 − 1 is over the 2 n−1 tents that get added on at the n − th level. Each tent 

 

1 

has a height distributed like 2n+1 1 

Z ∼ N(0, 2n+1 ) , where Z ∼ N(0, 1)(This is the 

content of the proposition above!) For convenience, we label the infinite sequence 

of normal variables so that Zn,k is controlling the height of the k − th tent on the 

n − th level. 

Finally we get the Brownian motion as Bt = limN→∞ BN t , which puts the 

Brownian motion on the same probability space as the infinite sequence of normal 

variables. To see that this is continuous, we show that the convergence is uniform 

almost surely. Since each BN t is continuous, and a uniform limit of continuous 

functions is continuous, this gives that Bt is continuous. 

Proposition 2.2. The family of functions B N t is converging uniformly almost 

surely. 

Proof. As you might suspect, the trick is to use the right summablesequence 

with a clever application of the Borel Cantelli lemma. Let Hn = maxt∈[0,1] 2 n−1 

−1 1 

k=0 2n+1 Zn,kTn,k(t)


be the maximum height contribution to Bt at level n. Since the tent functions 

Tn,k(t) are disjoint, this is Hn = 

following estimate: 

P(Hn > 2 

n − 2 

 

√ 

2n) = P 

 

1 

2n+1 max 

0≤k≤2n−1−1 max 

0≤k≤2n−1 (|Zn,k|) > 2 

−1 

≤ 2 n−1 P |Z| > 2 √ n 

= 2 n P Z > 2 √ n 

= 

≤ 

2 n 

√ 2π 

2 n 

√ 2π 

∞ 

2 √ n 

= C · 1 

√ n · 

 

exp − x2 

 

dx 

2 

(|Zn,k|). We now make the 

n − 2 2 n+1 

2 2 1 

2 

 

√ 

n 

1 

2 √ n exp 

 

− (2√n) 2 

(this is Mill’s ratio) 

2 

 

2 

e2 n Which is a summable sequence! Hence, we know by the Borel Cantelli lemma 

that this happens only finely often almost surely. That is to say, for almost every 

n √ 

− ω ∈ Ω, we can find N ∈ N so that Hn(ω) ≤ 2 2 2n for all n > N. But then we 

have that for all p, q > Nand any t ∈ [0, 1]: 

|B p 

t − B q 

t | = 

≤ 

≤ 

≤ 

 

 

 

 

 

 

q 

n=p+1 

q 

n=p+1 

q 

n=p+1 

∞ 

2 

n=N 

2 n−1 −1 

k=0 

|Hn| 

2 

n − 2 

n − 2 

√ 2n 

√ 2n 

 

 

 

1 

 

Zn,kTn,k(t) 

2n+1 

 

n √ 

− But since 2 2 2n is summable, this can be made arbitrarily small, and we 

see then that BN t is Cauchy in the uniform norm. Since this holds for almost every 

ω ∈ Ω, we indeed have uniform convergence almost surely. 

Finally, to see that the limiting process is really what we want, we just verify 

that E (Bt − Bs) 2 = |t − s|, from which it’s easy to check the properties we want. 

To see this, we just use the density of the dyadic rationals in [0, 1]. The above 

construction fixes points of the form a 

2 n at step n, that is to say Bt( a 

2 n ) = B n t ( a 

2 n ). 

Hence for t, s dyadic rationals, we have E (Bt − Bs) 2 = E (B n t − B n s ) 2 = |t − s| 

which is easily checked by the construction above/the earlier proposition.

3. CONSTRUCTION FROM DURRET’S BOOK 33 

For arbitrary t now, but s still taken to be a dyadic rational, we take a sequence 

of dyadic rationals tn → t. We have then using Fatou’s lemma: 

E (Bt − Bs) 2 

= E lim (Btn − Bs) 

2 

n→∞ 

≤ lim 

n→∞ E (Btn − Bs) 

2 

= lim 

n→∞ |tn − s| 

= |t − s| 

Now consider, for any n ∈ N: 

E (Bt − Bs) 2 = E (Bt − Btn − Bs + Btn )2 

= E (Bt − Btn) 2 + E (Bs − Btn) 2 + 2E ((Bt − Btn)(Bs − Btn)) 

Since this holds for any n ∈ N, we get: 

E (Bt − Bs) 2 = 

 

lim E (Bt − Btn 

n→∞ 

)2 + E (Bs − Btn )2 + 2E ((Bt − Btn )(Bs − Btn )) 

= 0 + lim 

n→∞ |tn − s| + 0 

= |t − s| 

Where we have observed that the two limits on either side are 0 by using 

E (Bt − Bs) 2 ≤ |t−s| in a clever way. First:limn→∞ E (Bt − Btn )2 ≤ limn→∞ |t− 

tn| = 0 and secondly with the help of Holder: 

lim 

n→∞ |E ((Bt − Btn )(Bs − Btn )) | ≤ lim 

n→∞ 

≤ lim 

n→∞ 

= 0 

E((Bt − Btn )2 ) E((Bs − Btn )2 ) 

|t − tn| |s − tn| 

Once we have E (Bt − Bs) 2 = |t − s| for arbitrary t and dyadic s, the same 

argument repeated again will show that E (Bt − Bs) 2 = |t − s| works when both 

t and s are arbitrary. 

3. Construction from Durret’s Book 

(I call this “Durret’s construction” since I read it out of Durret’s book: “Brownian 

Motion and Martingale’s in Analysis”) 

The above construction is pretty elementary and gives all the desired properties. 

The following construction is a bit more technical, in particular it uses a few 

extension results like Caratheodory and Kolmogorov. However, it gives immediately 

that not only is the Brownian motion continuous, but it is Holder continuous 

for exponents γ < 1 

2 . This construction uses a few ”extension theorems”, which are 

gone over briefly in the appendix. 

Definition 3.1. (Constructing Brownian Motion with the Kolmogorov Extension 

Theorem) 

The Kolmogorov Extension Theorem gives us a quick way to define a measure 

on the space of functions. However, since the space of functions {f : T → R} is so 

large, this theorem often gives us a very unwieldy space to work with, one in which 

we can’t get our hands on the properties we want. The construction of Brownian 

motion below is a great example, constructing with the Kolmogorov theorem is


bad, while if we take more care and construct it on only countably many points, 

we get what we want. 

Let Pt1,t2,...tn (A1×A2×. 

 

. .×An) = dx1 dx2 . . . 

A1 

A2 

An 

dxnΠ n k=1 pti−ti−1 (xi−1, xi), 

where pt(x, y) = √ 2πt −1 exp(− |y−x|2 

2t ). This is naively what you get as the distribution 

of Bt1 , Bt2,..., Btn if you use the Markov property and normal distribution 

of the Brownian motion. By Kolmogorov, we get a measure Pon the entire space 

of function {f : [0, 1] → R}. This defines the Brownian motion! 

Proposition 3.2. With the above description of P, it will be impossible to 

see that the Brownian motion is almost surely continuous because the continuous 

functions C ⊂ {f : [0, 1] → R} are not even measurable. 

Proof. Suppose by contradiction C is measurable. Then we can find a sequence 

t1, t2, . . . of times and Borel sets B1, B2, . . . so that C = {f : (f(ti) ∈ Bi} 

(The proof of this fact comes by showing that sets of the form {f : (f(ti) ∈ Bi} are a 

sigma-algebra which contain the cylinder sets used to define Ω = σ(A) ). Take any 

continuous function f now, and alter its value at a single point t /∈ {t1, t2, . . .} to 

get a function ˆ f which agrees with f at {t1, t2, . . .} but is not continuous. But then 

ˆf ∈ C = {f : (f(ti) ∈ Bi} since it agrees with f at {t1, t2, . . .} is a contradiction. 

This result means that our construction is not good. It is better to construct 

the Bt as follows: 

Definition 3.3. (Constructing Brownian Motion with Uniform Continuity) 

Step 1. (Define on dyadic rationals). Let Pt1,...tn as above. Use the countable 

Kolmogorov Extension Theorem to get a measure P on the set of functions Ω = 

{f : [0, 1] ∩ D2 → R} from the dyadic rationals to R. 

Step 2. Check that functions in Ω are almost surely Holder continuous. i.e. for 

almost all f ∈ Ω, |f(t) − f(s)| ≤ C|t − s| γ 

Step 3. Conclude that for almost every f ∈ Ω,there is a unique way to extend 

f to a function f : [0, 1] → R since the dyadic rationals are dense in R. 

Step 1 is pretty simple, but step 2 requires some verification and is the real 

heart of the problem: 

Proposition 3.4. Fix γ < 1 

2 . For almost every f ∈ Ω, there is a constant C 

so that |f(t) − f(s)| ≤ C|t − s| γ 

We first prove a lemma. 

Lemma 3.5. Fix γ < 1 

2 . Then there exists δ > 0,so that for almost every f ∈ Ω, 

there is an N ∈ N (which depends on f) so that for n ≥ N we have: 

|f(x) − f(y)| ≤ |x − y| γ 

Whenever x = i2−n , y = j2−nand |x − y| ≤ 

1 n(1−δ) 

2 

Proof. Take m ∈ N so large so that m > 1 

1−2γ . We use the inequality 

E |f(t) − f(s)| 2m ≤ Cm|t − s| m with Cm = E|f(1)| 2m (This follows by the property 

that f(t) − f(s) ∼ f(s) + N(0, t − s) ). For any n ∈ N now, consider now the

following estimates: 

P 

 

3. CONSTRUCTION FROM DURRET’S BOOK 35 

|f(x) − f(y)| > |x − y| γ for some x = i2 −n , y = j2 −n and |x − y| ≤ 

 

n(1−δ) 

1 

2 

≤ |x − y| −2mγ E |f(x) − f(y)| 2m 

Where the sum on the right hand side is taken over all the possible x, y that satisfy 

the inequality |x − y| ≤ 

1 n(1−δ) 

(There are finitely many, since we are restricting 

2 

ourselves to dyadic rationals x = i2−n , y = j2−n ). We have used the Chebyshev 

inequality P(|X| > a) ≤ a−mE(|X| m ) here. Now, by the above inequality, we have: 

 

−2mγ m 

LHS ≤ Cm |x − y| |x − y| 

 

−2mγ+m 

= Cm |x − y| 

≤ Cm2 n 2 nδ (2 −n(1−δ) ) −2mγ+m 

= Cm2 −n(−(1+δ)+(1−δ)(−2mγ+m)) 

The last bound comes in because |x − y| ≤ 2 −n(1−δ) for x, y in our sum, and 

there are at most 2 n choices for x and 2 nδ choices for y once x has been fixed 

(remember, they are all n-th level dyadic rationals). Now, the term that appears 

in the exponent is: 

ɛ = −(1 + δ) + (1 − δ)(−2mγ + m) 

Since m is so large so that −2mγ + m > 1, we can choose δ so small so that 

ɛ > 0. We will have then that 

LHS ≤ 2 −nɛ 

Which is a summable sequence! By the Borel Cantelli lemma, it must be the 

case that for almost every f ∈ Ω the event here happens only finitely many times. 

This is exactly the statement of the lemma which we wanted to prove. 

Proposition 3.6. Fix γ < 1 

2 . For almost every f ∈ Ω, there is a constant C 

so that |f(t) − f(s)| ≤ C|t − s| γ 

Proof. For almost every f ∈ Ω, find δ > 0, N ∈ N as in the lemma. Take any 

t, s ∈ D2 ∩ [0, 1] with t − s < 2 −N(1−δ) .Choose m > N now so that 2 −(m+1)(1−δ) ≤ 

t − s ≤ 2 −m(1−δ) .Write now t = i2 −m − 2 −q1 − 2 −q2 − . . . 2 −qk < (i − 1)2 −m , and 

s = j2 −m + 2 −r1 + . . . + 2 −rl < (j + 1)2 m for some choice of q ′ s and r ′ s so that 

m < q1 < . . . < qk and m < r1 < . . . < rl. Since t − r < 2 −m(1−δ) , we have 

i2 −m − j2 −m < t − s < 2 −m(1−δ) so we can apply the result from the lemma to 

conclude: 

|f(i2 −m ) − f(j2 −m )| ≤ ((2 mδ )2 −m ) γ 

= 2 −m(1−δ)γ


Now, we use the result of the lemma again many times to see that (using our 

clever rewriting of t): 

|f(t) − f(i2 −m )| ≤ |f(i2 −m − 2 −q1 ) − f(i2 −m )| + |f(i2 −m − 2 −q1 − 2 −q2 ) − f(i2 −m − 2 −q1 )| + . . . + |f(i2 −m 

≤ |2 −q1 | γ + . . . + |2 −qk | γ 

≤ 

∞ 

j=m+1 

≤ C2 −γm 

(2 −j ) γ 

Since m < qp for each p, and where we used Jensen’s inequality to bound the 

sum. We similarly get a bound on |f(s) − f(j2 −m )|.Finally then: 

|f(t) − f(s)| ≤ C2 −γm(1−δ) + C2 −γm + C2 −γm 

≤ C2 −γm(1−δ) 

 

γ(1−δ) 

= C2 2 −(m+1)(1−δ) γ 

≤ C2 γ(1−δ) |t − s| γ 

By the choice of m so that 2 −(m+1)(1−δ) ≤ t − s. 

So from here we see that the Brownian motion is almost surely Holder continuous 

for exponents γ < 1 

2 . This result lets us find a unique extension of f(t) from 

the dyadic rationals to all of [0, 1] which is not only continuous, but moreover its 

Holder continuous for exponents γ < 1 

2 , which is a stronger result than our first 

construction. For ease of notation now, we will change our notation now a little 

bit. We will refer to ω ∈ Ωnow instead of f and we now have a family of random 

variables Bt(ω) = ω(t). What we have just proven is that for fixed ω, the map 

t → Bt(ω) is indeed a Holder continuous path for exponents γ < 1 

2 . 

4. Some Properties 

The following slick result shows that the Brownian motion is nowhere Holder 

continuous for γ > 1 

2 , which in particular shows that it is nowhere differentiable. 

Proposition 4.1. For γ > 1 

2 , the set of functions which are Holder continuous 

with exponent γ at some point is a null set. In other words, the Brownian motion 

is almost surely nowhere Holder continuous for exponents γ > 1 

2 . 

Proof. Fix a γ > 1 

m+1 

2 and C ∈ R. Choose m ∈ N so large so that γ¿ 2m . Define 

the events, starting at n > m: 

 

An = ω : ∃s ∈ [0, 1] such that |Bt − Bs| ≤ C|t − s| γ ∀t ∈ [s − m m 

, s + 

n n ] 

 

Define the random variable: 

 

 

Yn,k(ω) = max 

j=0,1,...2m B 

k + j k + j − 1 

− B 

n 

n 

And finally, the events: 

Bn = at least one of the Yn,k ≤ 2C 

m γ 

n 

We now claim that An ⊂ Bn, since for ω ∈ An, we find an s so that |Bt − Bs| ≤ 

C|t − s| γ∀t ∈ [s − m m 

n , s + n ]. In particular, |Bt − Bs| ≤ C 

m γ 

n By the pigeonhole

4. SOME PROPERTIES 37 

principle, inside this interval we can find k so that { k k+1 k+2 

n , n , n 

m m 

n , s + n ] . But then, for this k, we have: 

 

 

Yn,k(ω) = max 

j=0,1,...2m B 

k + j k + j − 1 

− B 

n 

n 

 

 

≤ max 

j=0...2m B 

k + j 

− B(s) 

n + 

 

 

 

k + j − 1 

B(s) − B 

n 

 

m 

γ ≤ 2C 

n 

So ω ∈ Bn by definition.Now consider that: 

P(An) ≤ P(Bn) 

≤ 

 

≤ 

k=0..n−m 

 

k=0..n−m 

 

P Yn,k ≤ 2C 

 

B 

P k+j 

n 

 

m 

γ n 

− B k+j−1 

n 

 

 

≤ 2C 

 

≤ nP |B 1 

n − B0| 

 

m 

γ2m < 2C 

n 

 

 

m 

γ √ 

= nP |B1 − B0| < 2C n 

n 

 

2 

 

m 

 

γ 

2m 

√ 

≤ n √2π 2C n 

n 

2m 

1 ( 

= Dn 2 −γ)2m+1 = Dn m+1−2mγ → 0 

 

m 

γ n 

k+2m , . . . n } ⊂ [s − 

 

for each j = 0, 1, ..2m 

Where we used the independence property of disjoint intervals of the Brownian 

motion, the scaling relation P(Bt > a) = P(Bct > √ ca), and the easy inequality 

P(N(0, 1) > λ) ≤ 2λwhich comes from integrating the p.d.f.. Finally, by the choice 

of m so that γ > m+1 

2m 

, we know that m + 1 − 2mγ < 0 so this probability does 

indeed go to zero. But then, as the events An are increasing, this means that An 

are all zero probability events, which is the result we wanted.

CHAPTER 5 

Appendix 

1. Conditional Random Variables 

Let (Ω, F, P) be a probability space and X, Y : Ω → R random variables. B is 

the Borel sigma algebra of R. 

Definition 1.1. We define σ(X) ⊂ F to be the sigma-algebra generated by 

the preimages of Borel sets through F. That is: 

σ(X) = σ({X −1 (B) : B ∈ B}) 

Remark. The sub-algebra σ(X) is in coarser than all of F. Intuitively, the 

random variable X can only “detect” up to sets in σ(X). 

Definition 1.2. Let Σ ⊂ F be a subalgebra of F. We say a random variable 

X : Ω → R is Σ−measurable if X −1 (B) ∈ Σ for all B ∈ B. Equivalently, if 

σ(X) ⊂ Σ. 

. 

Example 1.3. Every random variable is always F measurable, since σ(X) ⊂ F 

Definition 1.4. Given X and Y , we can define a new random variable Z = 

E(Y |X) to be the unique random variable with the following two properties: 

1. Z is σ(X) measurable. 

2. For any B ∈ B we have E (Z1X∈B) = E (Y 1X∈B) 

Remark. The existence of this random variable is proven by restricting the 

Radon-Nikodym derivative of Y with respect to the probability space to just the 

sigma field σ(X). 

Remark. There is no problem with picking any subalgebra Σ ⊂ F instead 

of σ(X). The second condition is simply that for any S ∈ Σ we have E (Z1S) = 

E (Y 1S), which is really the condition above with Σ = σ(X). 

Remark. Z = E(Y |X) is a random variable Z : Ω → R, but it is often thought 

of as a function Z : R → R, whose input is the random variable X. This works 

because Z is σ(X) measurable. The following two little results clear this up a bit: 

Proposition 1.5. If f : R → R is measurable, and Z : Ω → Ris Σ-measurable, 

then the random variable f ◦ Z is Σ-measurable too. 

Proof. For any B ∈ B we have (f ◦ Z) −1 (B) = Z −1 (f −1 (B)) ∈ Σsince 

f −1 (B) ∈ B as f is measurable and Z is Σ- measurable. 

Proposition 1.6. If Z is σ(X)-measurable random variable, then we may think 

of Z as a function Z : R → R whose input is X. 

39

40 5. APPENDIX 

Proof. Define ˜ Z : R → R by ˜ Z(x) = Z(ω) for any representative ω ∈ 

X −1 ({x}). We must justify why this value is independent of the choice of ω ∈ 

X −1 ({x}). Indeed for ω1, ω2 ∈ X −1 ({x}), let z = Z(ω1).Since Z is σ(X) measurable, 

we have that: 

Z −1 ({z}) ∈ σ(X) 

⇒ Z −1 ({z}) = X −1 (B) for some B ∈ B 

But then ω1 ∈ Z −1 ({z}) = X −1 (B), so that X(ω1) ∈ B. Since X(ω1) = X(ω2) = 

x, we have then ω2 ∈ X −1 (B) = Z −1 ({z}), which means that Z(ω1) = Z(ω2) = z, 

as desired. Hence ˜ Z is well defined! With this definition of ˜ Z, we see that Z = ˜ Z◦X. 

We often conflate Z with ˜ Z in practice. 

2. Extension Theorems 

Theorem 2.1. [Caratheodory Extension Theorem] 

Fix some (Ω, A, P0), where Ω is a set, A is an algebra of sets (aka a field 

of sets), and P0is a finitely additive probability measure on A. If we have the 

additional property that: 

For sequences of sets A1, A2, . . . ∈ A which are pairwise disjoint with the property 

that ∪An ∈ A too, then we necessarily have P0(∪An) = P0(An) . 

Then there is a unique extension to a probability space (Ω, σ(A), P) so that P 

and P0 agree on A. 

Proof. [sketch] The idea is exactly the same as the construction of the Lebesgue 

measure on [0, 1] from the premeasure generated by µ((a, b)) 

 

= b − a on the algebra 

of open sets. Define an outer measure: P(E) := inf P0(An). From here 

E⊂∪An 

you check that P is indeed a probability measure. Countable subadditivity and 

monotonicity are easy. To get that P(A) = P0(A) for A ∈ A requires the special 

property we are given above. Once this is done, you can define measurable sets a-la 

Caratheodory: E measurable iff for all A ∈ A we have P(A) = P(A∩E)+P(A∩E c ). 

Then you verify that σ(A) is a subset of these measurable sets, and declare P = Pto 

be the measure on σ(A). 

Remark. The above condition needed in the theorem can be replaced with 

“Continuity from above at ∅”: 

For A1, A2, . . . ∈ A which are decreasing down to ∅, then we necessarily have 

that P0(An) → 0 too. 

The equivalence of these two conditions is not too difficult. The first condition is 

more intuitive, while this second condition is sometimes easier to verify in practice. 

Theorem 2.2. [Countable Kolmogorov Extension Theorem] 

Suppose for every n ≥ 1, we have a probability measure Pn on R n . Suppose 

also that these probability measure’s satisfy the following consistency condition for 

every Borel set E ∈ R n : 

Pn+k(E × R k ) = Pn(E) 

Then there exists a unique measure Pon the infinite product measure R ∞ of 

sequences, so that for every Borel set E ∈ R n P(E × R × R × . . .) = Pn(E).

2. EXTENSION THEOREMS 41 

Proof. [sketch] Take Ω = R ∞ be real-valued sequences. Define the field of 

cylinder sets to be: 

A = {E × R × R × . . . : E ∈ R n is Borel} 

With finitely additive measure P0(E × R × R × . . .) := Pn(E). The given 

condition on the P ′ ns shows this is well defined. To see continuity from above at 

∅, notice that if Ak ↓ ∅, then we must have Ak = Ek × R × R × . . . for some sets 

Ek ∈ R n with Ek ↓ ∅. But then of course, since Pn is a probability measure, 

we have P0(Ak) = Pn(Ek) → 0. By application of the Caratheodory extension 

theorem, we get the desired measure! 

Theorem 2.3. [Kolmogorov Extension Theorem] 

Let T be any interval T ⊂ R. Suppose we have a family of probability measure’s 

Pt1,t2,...tn on Rn whenever t1, t2, . . . tn is a finite number of points in T . Suppose 

also that these probability measure’s satisfy the following consistency condition: 

P t1,t2,...tn,ˆt1,ˆt2,...ˆtm (E × Rm ) = Pt1,t2,...tn (E) 

Then there exists a unique measure P on the set of functions {f : T → R}so 

that: 

P ({f : (f(t1), f(t2), . . . f(tn)) ∈ E}) = Pt1,t2,...tn (E) 

Remark. This is very similar to the countable version, but requires some more 

work to make it work out. However, since the space of functions {f : T → R} is so 

large, this theorem often gives us a very unwieldy space to work with, one in which 

we can’t get our hands on the properties we want. The construction of Brownian 

motion below is a great example, constructing with the uncountable Kolmogorov 

theorem is bad, while with the countable one is a good.

Notes from Limit Theorems 2 Mihai Nica

Create successful ePaper yourself

Delete template?

Save as template?