18.07.2013 Views

CS 373: Combinatorial Algorithms

CS 373: Combinatorial Algorithms

CS 373: Combinatorial Algorithms

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>CS</strong> <strong>373</strong>: <strong>Combinatorial</strong> <strong>Algorithms</strong><br />

University of Illinois, Urbana-Champaign<br />

Instructor: Jeff Erickson<br />

Teaching Assistants:<br />

• Spring 1999: Mitch Harris and Shripad Thite<br />

• Summer 1999 (IM<strong>CS</strong>): Mitch Harris<br />

• Summer 2000 (IM<strong>CS</strong>): Mitch Harris<br />

• Fall 2000: Chris Neihengen, Ekta Manaktala, and Nick Hurlburt<br />

• Spring 2001: Brian Ensink, Chris Neihengen, and Nick Hurlburt<br />

• Summer 2001 (I2<strong>CS</strong>): Asha Seetharam and Dan Bullok<br />

• Fall 2002: Erin Wolf, Gio Kao, Kevin Small, Michael Bond, Rishi Talreja, Rob Mc-<br />

Cann, and Yasutaka Furakawa<br />

c○ Copyright 1999, 2000, 2001, 2002, 2003 Jeff Erickson.<br />

This work may be freely copied and distributed.<br />

It may not be sold for more than the actual cost of reproduction.<br />

This work is distributed under a Creative Commons license; see http://creativecommons.org/licenses/by-nc-sa/1.0/.<br />

For the most recent edition, see http://www.uiuc.edu/ ∼ jeffe/teaching/<strong>373</strong>/.


For junior faculty, it may be a choice between a book and tenure.<br />

— George A. Bekey, “The Assistant Professor’s Guide to the Galaxy” (1993)<br />

I’m writing a book. I’ve got the page numbers done.<br />

About These Notes<br />

— Stephen Wright<br />

This course packet includes lecture notes, homework questions, and exam questions from the<br />

course ‘<strong>CS</strong> <strong>373</strong>: <strong>Combinatorial</strong> <strong>Algorithms</strong>’, which I taught at the University of Illinois in Spring<br />

1999, Fall 2000, Spring 2001, and Fall 2002. Lecture notes and videotapes lectures were also used<br />

during Summer 1999, Summer 2000, Summer 2001, and Fall 2002 as part of the UIUC computer<br />

science department’s Illinois Internet Computer Science (I2<strong>CS</strong>) program.<br />

The recurrences handout is based on samizdat, probably written by Ari Trachtenberg, based<br />

on a paper by George Lueker, from an earlier semester taught by Ed Reingold. I wrote most<br />

of the lecture notes in Spring 1999; I revised them and added a few new notes in each following<br />

semester. Except for the infamous Homework Zero, which is entirely my doing, homework and<br />

exam problems and their solutions were written mostly by the teaching assistants: Asha Seetharam,<br />

Brian Ensink, Chris Neihengen, Dan Bullok, Ekta Manaktala, Erin Wolf, Gio Kao, Kevin Small,<br />

Michael Bond, Mitch Harris, Nick Hurlburt, Rishi Talreja, Rob McCann, Shripad Thite, and Yasu<br />

Furakawa. Lecture notes were posted to the course web site a few days (on average) after each<br />

lecture. Homeworks, exams, and solutions were also distributed over the web. I have deliberately<br />

excluded solutions from this course packet.<br />

The lecture notes, homeworks, and exams draw heavily on the following sources, all of which I<br />

can recommend as good references.<br />

• Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The Design and Analysis of Computer<br />

<strong>Algorithms</strong>. Addison-Wesley, 1974. (This was the textbook for the algorithms classes<br />

I took as an undergrad at Rice and as a masters student at UC Irvine.)<br />

• Sara Baase and Allen Van Gelder. Computer <strong>Algorithms</strong>: Introduction to Design and Analysis.<br />

Addison-Wesley, 2000.<br />

• Mark de Berg, Marc van Kreveld, Mark Overmars, and Otfried Schwarzkopf. Computational<br />

Geometry: <strong>Algorithms</strong> and Applications. Springer-Verlag, 1997. (This is the required<br />

textbook in my computational geometry course.)<br />

• Thomas Cormen, Charles Leiserson, Ron Rivest, and Cliff Stein. Introduction to <strong>Algorithms</strong>,<br />

second edition. MIT Press/McGraw-Hill, 2000. (This is the required textbook for <strong>CS</strong> <strong>373</strong>,<br />

although I never actually use it in class. Students use it as a educated second opinion. I used<br />

the first edition of this book as a teaching assistant at Berkeley.)<br />

• Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the<br />

Theory of NP-Completeness. W. H. Freeman, 1979.<br />

• Michael T. Goodrich and Roberto Tamassia. Algorithm Design: Foundations, Analysis, and<br />

Internet Examples. John Wiley & Sons, 2002.<br />

• Dan Gusfield. <strong>Algorithms</strong> on Strings, Trees, and Sequences: Computer Science and Molecular<br />

Biology. Cambridge University Press, 1997.<br />

• Udi Manber. Introduction to <strong>Algorithms</strong>: A Creative Approach. Addison-Wesley, 1989.<br />

(I used this textbook as a teaching assistant at Berkeley.)


• Rajeev Motwani and Prabhakar Raghavan. Randomized <strong>Algorithms</strong>. Cambridge University<br />

Press, 1995.<br />

• Ian Parberry. Problems on <strong>Algorithms</strong>. Prentice-Hall, 1995. (This was a recommended<br />

textbook for early versions of <strong>CS</strong> <strong>373</strong>, primarily for students who needed to strengthen their<br />

prerequisite knowledge. This book is out of print, but can be downloaded as karmaware from<br />

http://hercule.csci.unt.edu/ ∼ ian/books/poa.html .)<br />

• Robert Sedgewick. <strong>Algorithms</strong>. Addison-Wesley, 1988. (This book and its sequels have by<br />

far the best algorithm illustrations anywhere.)<br />

• Robert Endre Tarjan. Data Structures and Network <strong>Algorithms</strong>. SIAM, 1983.<br />

• Class notes from my own algorithms classes at Berkeley, especially those taught by Dick Karp<br />

and Raimund Seidel.<br />

• Various journal and conference papers (cited in the notes).<br />

• Google.<br />

Naturally, everything here owes a great debt to the people who taught me this algorithm stuff<br />

in the first place: Abhiram Ranade, Bob Bixby, David Eppstein, Dan Hirshberg, Dick Karp, Ed<br />

Reingold, George Lueker, Manuel Blum, Mike Luby, Michael Perlman, and Raimund Seidel. I’ve<br />

also been helped immensely by many discussions with colleagues at UIUC—Ed Reingold, Edgar<br />

Raoms, Herbert Edelsbrunner, Jason Zych, Lenny Pitt, Mahesh Viswanathan, Shang-Hua Teng,<br />

Steve LaValle, and especially Sariel Har-Peled—as well as voluminous feedback from the students<br />

and teaching assistants. I stole the overall course structure (and the idea to write up my own<br />

lecture notes) from Herbert Edelsbrunner.<br />

We did the best we could, but I’m sure there are still plenty of mistakes, errors, bugs, gaffes,<br />

omissions, snafus, kludges, typos, mathos, grammaros, thinkos, brain farts, nonsense, garbage,<br />

cruft, junk, and outright lies, all of which are entirely Steve Skiena’s fault. I revise and update<br />

these notes every time I teach the course, so please let me know if you find a bug. (Steve is unlikely<br />

to care.)<br />

When I’m teaching <strong>CS</strong> <strong>373</strong>, I award extra credit points to the first student to post an explanation<br />

and correction of any error in the lecture notes to the course newsgroup (uiuc.class.cs<strong>373</strong>).<br />

Obviously, the number of extra credit points depends on the severity of the error and the quality<br />

of the correction. If I’m not teaching the course, encourage your instructor to set up a similar<br />

extra-credit scheme, and forward the bug reports to Steve me!<br />

Of course, any other feedback is also welcome!<br />

Enjoy!<br />

— Jeff<br />

It is traditional for the author to magnanimously accept the blame for whatever deficiencies<br />

remain. I don’t. Any errors, deficiencies, or problems in this book are somebody else’s fault,<br />

but I would appreciate knowing about them so as to determine who is to blame.<br />

— Steven S. Skiena, The Algorithm Design Manual, Springer-Verlag, 1997, page x.


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

1 Fun with Fibonacci numbers<br />

Consider the reproductive cycle of bees. Each male bee has a mother but no father; each female<br />

bee has both a mother and a father. If we examine the generations we see the following family tree:<br />

♀ ♂<br />

♀<br />

♀<br />

♀<br />

♂<br />

♀<br />

♀ ♂<br />

♀<br />

♂<br />

♀<br />

♀ ♂<br />

♀<br />

♀<br />

♂<br />

♀<br />

♂<br />

♀<br />

♀ ♂<br />

♀<br />

♀<br />

♀<br />

♂<br />

♀<br />

♂<br />

♀ ♂<br />

♀<br />

♂<br />

♀<br />

♂<br />

♀ ♂<br />

♀<br />

♀<br />

♀<br />

♂<br />

♀<br />

♀ ♂<br />

♀<br />

♂<br />

♀<br />

♂<br />

♀ ♂<br />

We easily see that the number of ancestors in each generation is the sum of the two numbers<br />

before it. For example, our male bee has three great-grandparents, two grandparents, and one<br />

parent, and 3 = 2 + 1. The number of ancestors a bee has in generation n is defined by the<br />

Fibonacci sequence; we can also see this by applying the rule of sum.<br />

As a second example, consider light entering two adjacent planes of glass:<br />

At any meeting surface (between the two panes of glass, or between the glass and air), the light<br />

may either reflect or continue straight through (refract). For example, here is the light bouncing<br />

seven times before it leaves the glass.<br />

In general, how many different paths can the light take if we are told that it bounces n times before<br />

leaving the glass?<br />

The answer to the question (in case you haven’t guessed) rests with the Fibonacci sequence.<br />

We can apply the rule of sum to the event E constituting all paths through the glass in n bounces.<br />

We generate two separate sub-events, E1 and E2, illustrated in the following picture.<br />

1<br />

♀<br />

♀<br />

♂<br />

♀<br />


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

Sub-event E1: Let E1 be the event that the first bounce is not at the boundary between<br />

the two panes. In this case, the light must first bounce off the bottom pane, or else we are<br />

dealing with the case of having zero bounces (there is only one way to have zero bounces).<br />

However, the number of remaining paths after bouncing off the bottom pane is the same as<br />

the number of paths entering through the bottom pane and bouncing n − 1 bounces more.<br />

Entering through the bottom pane is the same as entering through the top pane (but flipped<br />

over), so E1 is the number of paths of light bouncing n − 1 times.<br />

Sub-event E2: Let E2 be the event that the first bounce is on the boundary between the<br />

two panes. In this case, we consider the two options for the light after the first bounce: it<br />

can either leave the glass (in which case we are dealing with the case of having one bounce,<br />

and there is only one way for the light to bounce once) or it can bounce yet again on the top<br />

of the upper pane, in which case it is equivalent to the light entering from the top with n − 2<br />

bounces to take along its path.<br />

By the rule of sum, we thus get the following recurrence relation for Fn, the number of paths<br />

in which the light can travel with exactly n bounces. There is exactly one way for the light to<br />

travel with no bounces—straight through—and exactly two ways for the light to travel with only<br />

one bounce—off the bottom and off the middle. For any n > 1, there are Fn−1 paths where the<br />

light bounces off the bottom of the glass, and Fn−2 paths where the light bounces off the middle<br />

and then off the top.<br />

Stump a professor<br />

F0 = 1<br />

F1 = 2<br />

Fn = Fn−1 + Fn−2<br />

What is the recurrence relation for three panes of glass? This question once stumped an anonymous<br />

professor 1 in a science discipline, but now you should be able to solve it with a bit of effort. Aren’t<br />

you proud of your knowledge?<br />

1 Not me! —Jeff<br />

2


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

2 Sequences, sequence operators, and annihilators<br />

We have shown that several different problems can be expressed in terms of Fibonacci sequences,<br />

but we don’t yet know how to explicitly compute the nth Fibonacci number, or even (and more<br />

importantly) roughly how big it is. We can easily write a program to compute the n th Fibonacci<br />

number, but that doesn’t help us much here. What we really want is a closed form solution for the<br />

Fibonacci recurrence—an explicit algebraic formula without conditionals, loops, or recursion.<br />

In order to solve recurrences like the Fibonacci recurrence, we first need to understand operations<br />

on infinite sequences of numbers. Although these sequences are formally defined as functions of<br />

the form A : IN → IR, we will write them either as A = 〈a0, a1, a2, a3, a4, . . .〉 when we want to<br />

emphasize the entire sequence 2 , or as A = 〈ai〉 when we want to emphasize a generic element. For<br />

example, the Fibonacci sequence is 〈0, 1, 1, 2, 3, 5, 8, 13, 21, . . .〉.<br />

We can naturally define several sequence operators:<br />

We can add or subtract any two sequences:<br />

〈ai〉 + 〈bi〉 = 〈a0, a1, a2, . . .〉 + 〈b0, b1, b2, . . .〉 = 〈a0 + b0, a1 + b1, a2 + b2, . . .〉 = 〈ai + bi〉<br />

〈ai〉 − 〈bi〉 = 〈a0, a1, a2, . . .〉 − 〈b0, b1, b2, . . .〉 = 〈a0 − b0, a1 − b1, a2 − b2, . . .〉 = 〈ai − bi〉<br />

We can multiply any sequence by a constant:<br />

c · 〈ai〉 = c · 〈a0, a1, a2, . . .〉 = 〈c · a0, c · a1, c · a2, . . .〉 = 〈c · ai〉<br />

We can shift any sequence to the left by removing its initial element:<br />

E〈ai〉 = E〈a0, a1, a2, a3, . . .〉 = 〈a1, a2, a3, a4, . . .〉 = 〈ai+1〉<br />

Example: We can understand these operators better by looking at some specific examples, using<br />

the sequence T of powers of two.<br />

T = 〈2 0 , 2 1 , 2 2 , 2 3 , . . .〉 = 〈2 i 〉<br />

ET = 〈2 1 , 2 2 , 2 3 , 2 4 , . . .〉 = 〈2 i+1 〉<br />

2T = 〈2 · 2 0 , 2 · 2 1 , 2 · 2 2 , 2 · 2 3 , . . .〉 = 〈2 1 , 2 2 , 2 3 , 2 4 , . . .〉 = 〈2 i+1 〉<br />

2T − ET = 〈2 1 − 2 1 , 2 2 − 2 2 , 2 3 − 2 3 , 2 4 − 2 4 , . . .〉 = 〈0, 0, 0, 0, . . .〉 = 〈0〉<br />

2.1 Properties of operators<br />

It turns out that the distributive property holds for these operators, so we can rewrite ET − 2T as<br />

(E − 2)T . Since (E − 2)T = 〈0, 0, 0, 0, . . .〉, we say that the operator (E − 2) annihilates T , and we<br />

call (E − 2) an annihilator of T . Obviously, we can trivially annihilate any sequence by multiplying<br />

it by zero, so as a technical matter, we do not consider multiplication by 0 to be an annihilator.<br />

What happens when we apply the operator (E − 3) to our sequence T ?<br />

(E − 3)T = ET − 3T = 〈2 i+1 〉 − 3〈2 i 〉 = 〈2 i+1 − 3 · 2 i 〉 = 〈−2 i 〉 = −T<br />

The operator (E − 3) did very little to our sequence T ; it just flipped the sign of each number in<br />

the sequence. In fact, we will soon see that only (E − 2) will annihilate T , and all other simple<br />

2 It really doesn’t matter whether we start a sequence with a0 or a1 or a5 or even a−17. Zero is often a convenient<br />

starting point for many recursively defined sequences, so we’ll usually start there.<br />

3


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

operators will affect T in very minor ways. Thus, if we know how to annihilate the sequence, we<br />

know what the sequence must look like.<br />

In general, (E − c) annihilates any geometric sequence A = 〈a0, a0c, a0c 2 , a0c 3 , . . .〉 = 〈a0c i 〉:<br />

(E − c)〈a0c i 〉 = E〈a0c i 〉 − c〈a0ci〉 = 〈a0c i+1 〉 − 〈c · a0ci〉 = 〈a0c i+1 − a0c i+1 〉 = 〈0〉<br />

To see that this is the only operator of this form that annihilates A, let’s see the effect of operator<br />

(E − d) for some d = c:<br />

(E − d)〈a0c i 〉 = E〈a0c i 〉 − d〈a0ci〉 = 〈a0c i+1 − da0ci〉 = 〈(c − d)a0c i 〉 = (c − d)〈a0c i 〉<br />

So we have a more rigorous confirmation that an annihilator annihilates exactly one type of sequence,<br />

but multiplies other similar sequences by a constant.<br />

We can use this fact about annihilators of geometric sequences to solve certain recurrences. For<br />

example, consider the sequence R = 〈r0, r1, r2, . . .〉 defined recursively as follows:<br />

r0 = 3<br />

ri+1 = 5ri<br />

We can easily prove that the operator (E − 5) annihilates R:<br />

(E − 5)〈ri〉 = E〈ri〉 − 5〈ri〉 = 〈ri+1〉 − 〈5ri〉 = 〈ri+1 − 5ri〉 = 〈0〉<br />

Since (E − 5) is an annihilator for R, we must have the closed form solution ri = r05 i = 3 · 5 i . We<br />

can easily verify this by induction, as follows:<br />

r0 = 3 · 5 0 = 3 [definition]<br />

ri = 5ri−1<br />

2.2 Multiple operators<br />

[definition]<br />

= 5 · (3 · 5 i−1 ) [induction hypothesis]<br />

= 5 i · 3 [algebra]<br />

An operator is a function that transforms one sequence into another. Like any other function, we can<br />

apply operators one after another to the same sequence. For example, we can multiply a sequence<br />

〈ai〉 by a constant d and then by a constant c, resulting in the sequence c(d〈ai〉) = 〈c·d·ai〉 = (cd)〈ai〉.<br />

Alternatively, we may multiply the sequence by a constant c and then shift it to the left to get<br />

E(c〈ai〉) = E〈c · ai〉 = 〈c · ai+1〉. This is exactly the same as applying the operators in the<br />

reverse order: c(E〈ai〉) = c〈ai+1〉 = 〈c · ai+1〉. We can also shift the sequence twice to the left:<br />

E(E〈ai〉) = E〈ai+1〉 = 〈ai+2〉. We will write this in shorthand as E 2 〈ai〉. More generally, the<br />

operator E k shifts a sequence k steps to the left: E k 〈ai〉 = 〈ai+k〉.<br />

We now have the tools to solve a whole host of recurrence problems. For example, what<br />

annihilates C = 〈2 i + 3 i 〉? Well, we know that (E − 2) annihilates 〈2 i 〉 while leaving 〈3 i 〉 essentially<br />

unscathed. Similarly, (E − 3) annihilates 〈3 i 〉 while leaving 〈2 i 〉 essentially unscathed. Thus, if we<br />

apply both operators one after the other, we see that (E − 2)(E − 3) annihilates our sequence C.<br />

In general, for any integers a = b, the operator (E − a)(E − b) annihilates any sequence of the<br />

form 〈c1a i + c2b i 〉 but nothing else. We will often ‘multiply out’ the operators into the shorthand<br />

notation E 2 − (a + b)E + ab. It is left as an exhilarating exercise to the student to verify that this<br />

4


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

shorthand actually makes sense—the operators (E − a)(E − b) and E 2 − (a + b)E + ab have the<br />

same effect on every sequence.<br />

We now know finally enough to solve the recurrence for Fibonacci numbers. Specifically, notice<br />

that the recurrence Fi = Fi−1 + Fi−2 is annihilated by E 2 − E − 1:<br />

(E 2 − E − 1)〈Fi〉 = E 2 〈Fi〉 − E〈Fi〉 − 〈Fi〉<br />

= 〈Fi+2〉 − 〈Fi+1〉 − 〈Fi〉<br />

= 〈Fi−2 − Fi−1 − Fi〉<br />

= 〈0〉<br />

Factoring E 2 − E − 1 using the quadratic formula, we obtain<br />

E 2 − E − 1 = (E − φ)(E − ˆ φ)<br />

where φ = (1 + √ 5)/2 ≈ 1.618034 is the golden ratio and ˆ φ = (1 − √ 5)/2 = 1 − φ = −1/φ. Thus,<br />

the operator (E − φ)(E − ˆ φ) annihilates the Fibonacci sequence, so Fi must have the form<br />

Fi = cφ i + ĉ ˆ φ i<br />

for some constants c and ĉ. We call this the generic solution to the recurrence, since it doesn’t<br />

depend at all on the base cases. To compute the constants c and ĉ, we use the base cases F0 = 0<br />

and F1 = 1 to obtain a pair of linear equations:<br />

F0 = 0 = c + ĉ<br />

F1 = 1 = cφ + ĉ ˆ φ<br />

Solving this system of equations gives us c = 1/(2φ − 1) = 1/ √ 5 and ĉ = −1/ √ 5.<br />

We now have a closed-form expression for the ith Fibonacci number:<br />

Fi = φi − ˆ φi √ =<br />

5 1<br />

<br />

1 +<br />

√<br />

5<br />

√ i 5<br />

−<br />

2<br />

1<br />

<br />

1 −<br />

√<br />

5<br />

√ i 5<br />

2<br />

With all the square roots in this formula, it’s quite amazing that Fibonacci numbers are integers.<br />

However, if we do all the math correctly, all the square roots cancel out when i is an integer. (In<br />

fact, this is pretty easy to prove using the binomial theorem.)<br />

2.3 Degenerate cases<br />

We can’t quite solve every recurrence yet. In our above formulation of (E − a)(E − b), we assumed<br />

that a = b. What about the operator (E − a)(E − a) = (E − a) 2 ? It turns out that this operator<br />

annihilates sequences such as 〈ia i 〉:<br />

(E − a)〈ia i 〉 = 〈(i + 1)a i+1 − (a)ia i 〉<br />

= 〈(i + 1)a i+1 − ia i+1 〉<br />

= 〈a i+1 〉<br />

(E − a) 2 〈ia i 〉 = (E − a)〈a i+1 〉 = 〈0〉<br />

More generally, the operator (E − a) k annihilates any sequence 〈p(i) · a i 〉, where p(i) is any<br />

polynomial in i of degree k − 1. As an example, (E − 1) 3 annihilates the sequence 〈i 2 · 1 i 〉 = 〈i 2 〉 =<br />

〈1, 4, 9, 16, 25, . . .〉, since p(i) = i 2 is a polynomial of degree n − 1 = 2.<br />

As a review, try to explain the following statements:<br />

5


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

(E − 1) annihilates any constant sequence 〈α〉.<br />

(E − 1) 2 annihilates any arithmetic sequence 〈α + βi〉.<br />

(E − 1) 3 annihilates any quadratic sequence 〈α + βi + γi 2 〉.<br />

(E − 3)(E − 2)(E − 1) annihilates any sequence 〈α + β2 i + γ3 i 〉.<br />

(E − 3) 2 (E − 2)(E − 1) annihilates any sequence 〈α + β2 i + γ3 i + δi3 i 〉.<br />

2.4 Summary<br />

In summary, we have learned several operators that act on sequences, as well as a few ways of<br />

combining operators.<br />

Operator Definition<br />

Addition 〈ai〉 + 〈bi〉 = 〈ai + bi〉<br />

Subtraction 〈ai〉 + 〈bi〉 = 〈ai + bi〉<br />

Scalar multiplication c〈ai〉 = 〈cai〉<br />

Shift E〈ai〉 = 〈ai+1〉<br />

Composition of operators (X + Y)〈ai〉 = X〈ai〉 + Y〈ai〉<br />

(X − Y)〈ai〉 = X〈ai〉 − Y〈ai〉<br />

XY〈ai〉 = X(Y〈ai〉) = Y(X〈ai〉)<br />

k-fold shift E k 〈ai〉 = 〈ai+k〉<br />

Notice that we have not defined a multiplication operator for two sequences. This is usually<br />

accomplished by convolution:<br />

<br />

i<br />

〈ai〉 ∗ 〈bi〉 =<br />

<br />

.<br />

j=0<br />

ajbi−j<br />

Fortunately, convolution is unnecessary for solving the recurrences we will see in this course.<br />

We have also learned some things about annihilators, which can be summarized as follows:<br />

Sequence Annihilator<br />

<br />

〈α〉<br />

αai E − 1<br />

<br />

αai + βbi E − a<br />

<br />

α0a<br />

(E − a)(E − b)<br />

i 0 + α1ai 1 + · · · + αnai <br />

n<br />

〈αi + β〉<br />

(E − a0)(E − a1) · · · (E − an)<br />

(E − 1) 2<br />

<br />

(αi + β)ai (E − a) 2<br />

<br />

(αi + β)ai + γbi (E − a) 2 <br />

(α0 + α1i + · · · αn−1i<br />

(E − b)<br />

n−1 )ai (E − a) n<br />

If X annihilates 〈ai〉, then X also annihilates c〈ai〉 for any constant c.<br />

If X annihilates 〈ai〉 and Y annihilates 〈bi〉, then XY annihilates 〈ai〉 ± 〈bi〉.<br />

3 Solving Linear Recurrences<br />

3.1 Homogeneous Recurrences<br />

The general expressions in the annihilator box above are really the most important things to<br />

remember about annihilators because they help you to solve any recurrence for which you can<br />

write down an annihilator. The general method is:<br />

6


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

1. Write down the annihilator for the recurrence<br />

2. Factor the annihilator<br />

3. Determine the sequence annihilated by each factor<br />

4. Add these sequences together to form the generic solution<br />

5. Solve for constants of the solution by using initial conditions<br />

Example: Let’s show the steps required to solve the following recurrence:<br />

r0 = 1<br />

r1 = 5<br />

r2 = 17<br />

ri = 7ri−1 − 16ri−2 + 12ri−3<br />

1. Write down the annihilator. Since ri+3 − 7ri+2 + 16ri+1 − 12ri = 0, the annihilator is E 3 −<br />

7E 2 + 16E − 12.<br />

2. Factor the annihilator. E 3 − 7E 2 + 16E − 12 = (E − 2) 2 (E − 3).<br />

3. Determine sequences annihilated by each factor. (E − 2) 2 annihilates 〈(αi + β)2 i 〉 for any<br />

constants α and β, and (E − 3) annihilates 〈γ3 i 〉 for any constant γ.<br />

4. Combine the sequences. (E − 2) 2 (E − 3) annihilates 〈(αi + β)2 i + γ3 i 〉 for any constants<br />

α, β, γ.<br />

5. Solve for the constants. The base cases give us three equations in the three unknowns α, β, γ:<br />

r0 = 1 = (α · 0 + β)2 0 + γ · 3 0 = β + γ<br />

r1 = 5 = (α · 1 + β)2 1 + γ · 3 1 = 2α + 2β + 3γ<br />

r2 = 17 = (α · 2 + β)2 2 + γ · 3 2 = 8α + 4β + 9γ<br />

We can solve these equations to get α = 1, β = 0, γ = 1. Thus, our final solution is<br />

ri = i2 i + 3 i , which we can verify by induction.<br />

3.2 Non-homogeneous Recurrences<br />

A height balanced tree is a binary tree, where the heights of the two subtrees of the root differ by<br />

at most one, and both subtrees are also height balanced. To ground the recursive definition, the<br />

empty set is considered a height balanced tree of height −1, and a single node is a height balanced<br />

tree of height 0.<br />

Let Tn be the smallest height-balanced tree of height n—how many nodes does Tn have? Well,<br />

one of the subtrees of Tn has height n − 1 (since Tn has height n) and the other has height either<br />

n − 1 or n − 2 (since Tn is height-balanced and as small as possible). Since both subtrees are<br />

themselves height-balanced, the two subtrees must be Tn−1 and Tn−2.<br />

We have just derived the following recurrence for tn, the number of nodes in the tree Tn:<br />

t−1 = 0 [the empty set]<br />

t0 = 1 [a single node]<br />

tn = tn−1 + tn−2 + 1<br />

7


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

The final ‘+1’ is for the root of Tn.<br />

We refer to the terms in the equation involving ti’s as the homogeneous terms and the rest<br />

as the non-homogeneous terms. (If there were no non-homogeneous terms, we would say that the<br />

recurrence itself is homogeneous.) We know that E 2 − E − 1 annihilates the homogeneous part<br />

tn = tn−1 + tn−2. Let us try applying this annihilator to the entire equation:<br />

(E 2 − E − 1)〈ti〉 = E 2 〈ti〉 − E〈ai〉 − 1〈ai〉<br />

= 〈ti+2〉 − 〈ti+1〉 − 〈ti〉<br />

= 〈ti+2 − ti+1 − ti〉<br />

= 〈1〉<br />

The leftover sequence 〈1, 1, 1, . . .〉 is called the residue. To obtain the annihilator for the entire<br />

recurrence, we compose the annihilator for its homogeneous part with the annihilator of its residue.<br />

Since E − 1 annihilates 〈1〉, it follows that (E 2 − E − 1)(E − 1) annihilates 〈tn〉. We can factor the<br />

annihilator into<br />

(E − φ)(E − ˆ φ)(E − 1),<br />

so our annihilator rules tell us that<br />

tn = αφ n + β ˆ φ n + γ<br />

for some constants α, β, γ. We call this the generic solution to the recurrence. Different recurrences<br />

can have the same generic solution.<br />

To solve for the unknown constants, we need three equations in three unknowns. Our base cases<br />

give us two equations, and we can get a third by examining the next nontrivial case t1 = 2:<br />

Solving these equations, we find that α =<br />

tn =<br />

t−1 = 0 = αφ −1 + β ˆ φ −1 + γ = α/φ + β/ ˆ φ + γ<br />

t0 = 1 = αφ 0 + β ˆ φ 0 + γ = α + β + γ<br />

t1 = 2 = αφ 1 + β ˆ φ 1 + γ = αφ + β ˆ φ + γ<br />

√ 5 + 2<br />

√ 5<br />

<br />

√ √<br />

√5+2, β = √5−2, and γ = −1. Thus,<br />

5 5<br />

1 + √ n √ <br />

5 5 − 2 1 −<br />

+ √<br />

2<br />

5<br />

√ n 5<br />

− 1<br />

2<br />

Here is the general method for non-homogeneous recurrences:<br />

1. Write down the homogeneous annihilator, directly from the recurrence<br />

11 2 . ‘Multiply’ by the annihilator for the residue<br />

2. Factor the annihilator<br />

3. Determine what sequence each factor annihilates<br />

4. Add these sequences together to form the generic solution<br />

5. Solve for constants of the solution by using initial conditions<br />

8


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

3.3 Some more examples<br />

In each example below, we use the base cases a0 = 0 and a1 = 1.<br />

an = an−1 + an−2 + 2<br />

– The homogeneous annihilator is E 2 − E − 1.<br />

– The residue is the constant sequence 〈2, 2, 2, . . .〉, which is annihilated by E − 1.<br />

– Thus, the annihilator is (E 2 − E − 1)(E − 1).<br />

– The annihilator factors into (E − φ)(E − ˆ φ)(E − 1).<br />

– Thus, the generic solution is an = αφ n + β ˆ φ n + γ.<br />

– The constants α, β, γ satisfy the equations<br />

– Solving the equations gives us α =<br />

– So the final solution is an =<br />

a0 = 0 = α + β + γ<br />

a1 = 1 = αφ + β ˆ φ + γ<br />

a2 = 3 = αφ 2 + β ˆ φ 2 + γ<br />

√ 5 + 2<br />

√ 5<br />

√ √<br />

√5+2 , β = √5−2 , and γ = −2<br />

5 5<br />

<br />

1 + √ 5<br />

2<br />

n<br />

+<br />

√ 5 − 2<br />

√ 5<br />

<br />

1 − √ 5<br />

2<br />

(In the remaining examples, I won’t explicitly enumerate the steps like this.)<br />

an = an−1 + an−2 + 3<br />

The homogeneous annihilator (E 2 − E − 1) leaves a constant residue 〈3, 3, 3, . . .〉, so the<br />

annihilator is (E 2 − E − 1)(E − 1), and the generic solution is an = αφ n + β ˆ φ n + γ. Solving<br />

the equations<br />

gives us the final solution an =<br />

an = an−1 + an−2 + 2 n<br />

a0 = 0 = α + β + γ<br />

a1 = 1 = αφ + β ˆ φ + γ<br />

a2 = 4 = αφ 2 + β ˆ φ 2 + γ<br />

√ 5 + 3<br />

√ 5<br />

<br />

n<br />

− 2<br />

1 + √ n √ <br />

5 5 − 3 1 −<br />

+ √<br />

2<br />

5<br />

√ n 5<br />

− 3<br />

2<br />

The homogeneous annihilator (E 2 − E − 1) leaves an exponential residue 〈4, 8, 16, 32, . . .〉 =<br />

〈2 i+2 〉, which is annihilated by E − 2. Thus, the annihilator is (E 2 − E − 1)(E − 2), and<br />

the generic solution is an = αφ n + β ˆ φ n + γ2 n . The constants α, β, γ satisfy the following<br />

equations:<br />

a0 = 0 = α + β + γ<br />

a1 = 1 = αφ + β ˆ φ + 2γ<br />

a2 = 5 = αφ 2 + β ˆ φ 2 + 4γ<br />

9


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

an = an−1 + an−2 + n<br />

The homogeneous annihilator (E 2 − E − 1) leaves a linear residue 〈2, 3, 4, 5 . . .〉 = 〈i + 2〉,<br />

which is annihilated by (E − 1) 2 . Thus, the annihilator is (E 2 − E − 1)(E − 1) 2 , and the<br />

generic solution is an = αφ n + β ˆ φ n + γ + δn. The constants α, β, γ, δ satisfy the following<br />

equations:<br />

an = an−1 + an−2 + n 2<br />

a0 = 0 = α + β + γ<br />

a1 = 1 = αφ + β ˆ φ + γ + δ<br />

a2 = 3 = αφ 2 + β ˆ φ 2 + γ + 2δ<br />

a3 = 7 = αφ 3 + β ˆ φ 3 + γ + 3δ<br />

The homogeneous annihilator (E 2 − E − 1) leaves a quadratic residue 〈4, 9, 16, 25 . . .〉 =<br />

〈(i + 2) 2 〉, which is annihilated by (E − 1) 3 . Thus, the annihilator is (E 2 − E − 1)(E − 1) 3 ,<br />

and the generic solution is an = αφ n + β ˆ φ n + γ + δn + εn 2 . The constants α, β, γ, δ, ε satisfy<br />

the following equations:<br />

an = an−1 + an−2 + n 2 − 2 n<br />

a0 = 0 = α + β + γ<br />

a1 = 1 = αφ + β ˆ φ + γ + δ + ε<br />

a2 = 5 = αφ 2 + β ˆ φ 2 + γ + 2δ + 4ε<br />

a3 = 15 = αφ 3 + β ˆ φ 3 + γ + 3δ + 9ε<br />

a4 = 36 = αφ 4 + β ˆ φ 4 + γ + 4δ + 16ε<br />

The homogeneous annihilator (E 2 − E − 1) leaves the residue 〈(i + 2) 2 − 2 i−2 〉. The quadratic<br />

part of the residue is annihilated by (E − 1) 3 , and the exponential part is annihilated by<br />

(E − 2). Thus, the annihilator for the whole recurrence is (E 2 − E − 1)(E − 1) 3 (E − 2), and<br />

so the generic solution is an = αφ n + β ˆ φ n + γ + δn + εn 2 + η2 i . The constants α, β, γ, δ, ε, η<br />

satisfy a system of six equations in six unknowns determined by a0, a1, . . . , a5.<br />

an = an−1 + an−2 + φ n<br />

The annihilator is (E 2 − E − 1)(E − φ) = (E − φ) 2 (E − ˆ φ), so the generic solution is an =<br />

αφ n + βnφ n + γ ˆ φ n . (Other recurrence solving methods will have a “interference” problem<br />

with this equation, while the operator method does not.)<br />

Our method does not work on recurrences like an = an−1 + 1<br />

functions 1<br />

n<br />

n or an = an−1 + lg n, because the<br />

and lg n do not have annihilators. Our tool, as it stands, is limited to linear recurrences.<br />

4 Divide and Conquer Recurrences and the Master Theorem<br />

Divide and conquer algorithms often give us running-time recurrences of the form<br />

T (n) = a T (n/b) + f(n) (1)<br />

where a and b are constants and f(n) is some other function. The so-called ‘Master Theorem’ gives<br />

us a general method for solving such recurrences f(n) is a simple polynomial.<br />

10


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

Unfortunately, the Master Theorem doesn’t work for all functions f(n), and many useful recurrences<br />

don’t look like (1) at all. Fortunately, there’s a general technique to solve most divide-andconquer<br />

recurrences, even if they don’t have this form. This technique is used to prove the Master<br />

Theorem, so if you remember this technique, you can forget the Master Theorem entirely (which<br />

is what I did). Throw off your chains!<br />

I’ll illustrate the technique using the generic recurrence (1). We start by drawing a recursion<br />

tree. The root of the recursion tree is a box containing the value f(n), it has a children, each<br />

of which is the root of a recursion tree for T (n/b). Equivalently, a recursion tree is a complete<br />

a-ary tree where each node at depth i contains the value a i f(n/b i ). The recursion stops when we<br />

get to the base case(s) of the recurrence. Since we’re looking for asymptotic bounds, it turns out<br />

not to matter much what we use for the base case; for purposes of illustration, I’ll assume that<br />

T (1) = f(1).<br />

f( )<br />

1<br />

3 f(n/b )<br />

2 f(n/b )<br />

a<br />

f(n/b)<br />

2 2 2<br />

f(n/b ) f(n/b ) f(n/b )<br />

a<br />

f(n)<br />

f(n/b) f(n/b) f(n/b)<br />

A recursion tree for the recurrence T (n) = a T (n/b) + f(n)<br />

f(n)<br />

a f(n/b)<br />

2 2 a f(n/b )<br />

a f(n/b ) 3 3<br />

L a f(1)<br />

Now T (n) is just the sum of all values stored in the tree. Assuming that each level of the tree<br />

is full, we have<br />

T (n) = f(n) + a f(n/b) + a 2 f(n/b 2 ) + · · · + a i f(n/b i ) + · · · + a L f(n/b L )<br />

where L is the depth of the recursion tree. We easily see that L = log b n, since n/b L = 1. Since<br />

f(1) = Θ(1), the last non-zero term in the summation is Θ(a L ) = Θ(a log b n ) = Θ(n log b a ).<br />

Now we can easily state and prove the Master Theorem, in a slightly different form than it’s<br />

usually stated.<br />

The Master Theorem. The recurrence T (n) = aT (n/b) + f(n) can be solved as follows.<br />

If a f(n/b) = κ f(n) for some constant κ < 1, then T (n) = Θ(f(n)).<br />

If a f(n/b) = K f(n) for some constant K > 1, then T (n) = Θ(n log b a ).<br />

If a f(n/b) = f(n), then T (n) = Θ(f(n) log b n).<br />

Proof: If f(n) is a constant factor larger than a f(b/n), then by induction, the sum is a descending<br />

geometric series. The sum of any geometric series is a constant times its largest term. In this case,<br />

the largest term is the first term f(n).<br />

If f(n) is a constant factor smaller than a f(b/n), then by induction, the sum is an ascending<br />

geometric series. The sum of any geometric series is a constant times its largest term. In this case,<br />

this is the last term, which by our earlier argument is Θ(n log b a ).<br />

11


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

Finally, if a f(b/n) = f(n), then by induction, each of the L + 1 terms in the summation is<br />

equal to f(n). <br />

Here are a few canonical examples of the Master Theorem in action:<br />

Randomized selection: T (n) = T (3n/4) + n<br />

Here a f(n/b) = 3n/4 is smaller than f(n) = n by a factor of 4/3, so T (n) = Θ(n)<br />

Karatsuba’s multiplication algorithm: T (n) = 3T (n/2) + n<br />

Here a f(n/b) = 3n/2 is bigger than f(n) = n by a factor of 3/2, so T (n) = Θ(n log 2 3 )<br />

Mergesort: T (n) = 2T (n/2) + n<br />

Here a f(n/b) = f(n), so T (n) = Θ(n log n)<br />

T (n) = 4T (n/2) + n lg n<br />

In this case, we have a f(n/b) = 2n lg n−2n, which is not quite twice f(n) = n lg n. However,<br />

for sufficiently large n (which is all we care about with asymptotic bounds) we have 2f(n) ><br />

a f(n/b) > 1.9f(n). Since the level sums are bounded both above and below by ascending<br />

geometric series, the solution is T (n) = Θ(n log 2 4 ) = Θ(n 2 ) . (This trick will not work in the<br />

second or third cases of the Master Theorem!)<br />

Using the same recursion-tree technique, we can also solve recurrences where the Master Theorem<br />

doesn’t apply.<br />

T (n) = 2T (n/2) + n/ lg n<br />

We can’t apply the Master Theorem here, because a f(n/b) = n/(lg n − 1) isn’t equal to<br />

f(n) = n/ lg n, but the difference isn’t a constant factor. So we need to compute each of the<br />

level sums and compute their total in some other way. It’s not hard to see that the sum of<br />

all the nodes in the ith level is n/(lg n − i). In particular, this means the depth of the tree is<br />

at most lg n − 1.<br />

T (n) =<br />

lg n−1<br />

i=0<br />

n<br />

lg n − i =<br />

lg<br />

n<br />

n<br />

j<br />

j=1<br />

= nHlg n = Θ(n lg lg n)<br />

Randomized quicksort: T (n) = T (3n/4) + T (n/4) + n<br />

In this case, nodes in the same level of the recursion tree have different values. This makes<br />

the tree lopsided; different leaves are at different levels. However, it’s not to hard to see that<br />

the nodes in any complete level (i.e., above any of the leaves) sum to n, so this is like the<br />

last case of the Master Theorem, and that every leaf has depth between log 4 n and log 4/3 n.<br />

To derive an upper bound, we overestimate T (n) by ignoring the base cases and extending<br />

the tree downward to the level of the deepest leaf. Similarly, to derive a lower bound, we<br />

overestimate T (n) by counting only nodes in the tree up to the level of the shallowest leaf.<br />

These observations give us the upper and lower bounds n log 4 n ≤ T (n) ≤ n log 4/3 n. Since<br />

these bound differ by only a constant factor, we have T (n) = Θ(n log n) .<br />

12


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

Deterministic selection: T (n) = T (n/5) + T (7n/10) + n<br />

Again, we have a lopsided recursion tree. If we look only at complete levels of the tree, we<br />

find that the level sums form a descending geometric series T (n) = n+9n/10+81n/100+· · · ,<br />

so this is like the first case of the master theorem. We can get an upper bound by ignoring<br />

the base cases entirely and growing the tree out to infinity, and we can get a lower bound by<br />

only counting nodes in complete levels. Either way, the geometric series is dominated by its<br />

largest term, so T (n) = Θ(n) .<br />

T (n) = √ n · T ( √ n) + n<br />

In this case, we have a complete recursion tree, but the degree of the nodes is no longer<br />

constant, so we have to be a bit more careful. It’s not hard to see that the nodes in any<br />

level sum to n, so this is like the third Master case. The depth L satisfies the identity<br />

n2−L = 2 (we can’t get all the way down to 1 by taking square roots), so L = lg lg n and<br />

T (n) = Θ(n lg lg n) .<br />

T (n) = 2 √ n · T ( √ n) + n<br />

We still have at most lg lg n levels, but now the nodes in level i sum to 2 i n. We have an<br />

increasing geometric series of level sums, like the second Master case, so T (n) is dominated<br />

by the sum over the deepest level: T (n) = Θ(2 lg lg n n) = Θ(n log n)<br />

T (n) = 4 √ n · T ( √ n) + n<br />

Now the nodes in level i sum to 4 i n. Again, we have an increasing geometric series, like the<br />

second Master case, so we only care about the leaves: T (n) = Θ(4 lg lg n n) = Θ(n log 2 n) Ick!<br />

5 Transforming Recurrences<br />

5.1 An analysis of mergesort: domain transformation<br />

Previously we gave the recurrence for mergesort as T (n) = 2T (n/2) + n, and obtained the solution<br />

T (n) = Θ(n log n) using the Master Theorem (or the recursion tree method if you, like me, can’t<br />

remember the Master Theorem). This is fine is n is a power of two, but for other values values of<br />

n, this recurrence is incorrect. When n is odd, then the recurrence calls for us to sort a fractional<br />

number of elements! Worse yet, if n is not a power of two, we will never reach the base case<br />

T (1) = 0.<br />

To get a recurrence that’s valid for all integers n, we need to carefully add ceilings and floors:<br />

T (n) = T (⌈n/2⌉) + T (⌊n/2⌋) + n.<br />

We have almost no hope of getting an exact solution here; the floors and ceilings will eventually<br />

kill us. So instead, let’s just try to get a tight asymptotic upper bound for T (n) using a technique<br />

called domain transformation. A domain transformation rewrites a function T (n) with a difficult<br />

recurrence as a nested function S(f(n)), where f(n) is a simple function and S() has an easier<br />

recurrence.<br />

First we overestimate the time bound, once by pretending that the two subproblem sizes are<br />

equal, and again to eliminate the ceiling:<br />

T (n) ≤ 2T ⌈n/2⌉ + n ≤ 2T (n/2 + 1) + n.<br />

13


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

Now we define a new function S(n) = T (n + α), where α is a unknown constant, chosen so that<br />

S(n) satisfies the Master-ready recurrence S(n) ≤ 2S(n/2) + O(n). To figure out the correct value<br />

of α, we compare two versions of the recurrence for the function T (n + α):<br />

S(n) ≤ 2S(n/2) + O(n) =⇒ T (n + α) ≤ 2T (n/2 + α) + O(n)<br />

T (n) ≤ 2T (n/2 + 1) + n =⇒ T (n + α) ≤ 2T ((n + α)/2 + 1) + n + α<br />

For these two recurrences to be equal, we need n/2 + α = (n + α)/2 + 1, which implies that α = 2.<br />

The Master Theorem now tells us that S(n) = O(n log n), so<br />

T (n) = S(n − 2) = O((n − 2) log(n − 2)) = O(n log n).<br />

A similar argument gives a matching lower bound T (n) = Ω(n log n). So T (n) = Θ(n log n) after<br />

all, just as though we had ignored the floors and ceilings from the beginning!<br />

Domain transformations are useful for removing floors, ceilings, and lower order terms from the<br />

arguments of any recurrence that otherwise looks like it ought to fit either the Master Theorem or<br />

the recursion tree method. But now that we know this, we don’t need to bother grinding through<br />

the actual gory details!<br />

5.2 A less trivial example<br />

There is a data structure in computational geometry called ham-sandwich trees, where the cost<br />

of doing a certain search operation obeys the recurrence T (n) = T (n/2) + T (n/4) + 1. This<br />

doesn’t fit the Master theorem, because the two subproblems have different sizes, and using the<br />

recursion tree method only gives us the loose bounds √ n ≪ T (n) ≪ n.<br />

Domain transformations save the day. If we define the new function t(k) = T (2 k ), we have a<br />

new recurrence<br />

t(k) = t(k − 1) + t(k − 2) + 1<br />

which should immediately remind you of Fibonacci numbers. Sure enough, after a bit of work, the<br />

annihilator method gives us the solution t(k) = Θ(φ k ), where φ = (1 + √ 5)/2 is the golden ratio.<br />

This implies that<br />

T (n) = t(lg n) = Θ(φ lg n ) = Θ(n lg φ ) ≈ Θ(n 0.69424 ).<br />

It’s possible to solve this recurrence without domain transformations and annihilators—in fact,<br />

the inventors of ham-sandwich trees did so—but it’s much more difficult.<br />

5.3 Secondary recurrences<br />

Consider the recurrence T (n) = 2T ( n<br />

− 1) + n with the base case T (1) = 1. We already know<br />

3<br />

how to use domain transformations to get the tight asymptotic bound T (n) = Θ(n), but how would<br />

we we obtain an exact solution?<br />

First we need to figure out how the parameter n changes as we get deeper and deeper into the<br />

recurrence. For this we use a secondary recurrence. We define a sequence ni so that<br />

T (ni) = 2T (ni−1) + ni,<br />

So ni is the argument of T () when we are i recursion steps away from the base case n0 = 1. The<br />

original recurrence gives us the following secondary recurrence for ni:<br />

ni−1 = ni<br />

3 − 1 =⇒ ni = 3ni−3 + 3.<br />

14


<strong>CS</strong> <strong>373</strong> Notes on Solving Recurrence Relations<br />

The annihilator for this recurrence is (E − 1)(E − 3), so the generic solution is ni = α3 i + β.<br />

Plugging in the base cases n0 = 1 and n1 = 6, we get the exact solution<br />

ni = 5<br />

2 · 3i − 3<br />

2 .<br />

Notice that our original function T (n) is only well-defined if n = ni for some integer i ≥ 0.<br />

Now to solve the original recurrence, we do a range transformation. If we set ti = T (ni), we<br />

, which by now we can solve using the annihilator method.<br />

have the recurrence ti = 2ti−1 + 5<br />

2 · 3i − 3<br />

2<br />

The annihilator of the recurrence is (E − 2)(E − 3)(E − 1), so the generic solution is α ′ 3i + β ′ 2i + γ ′ .<br />

Plugging in the base cases t0 = 1, t1 = 8, t2 = 37, we get the exact solution<br />

ti = 15<br />

2 · 3i − 8 · 2 i + 3<br />

2<br />

Finally, we need to substitute to get a solution for the original recurrence in terms of n, by<br />

inverting the solution of the secondary recurrence. If n = ni = 5<br />

2 ·3i − 3<br />

2 , then (after a little algebra)<br />

we have<br />

<br />

2 3<br />

i = log3 n + .<br />

5 5<br />

Substituting this into the expression for ti, we get our exact, closed-form solution.<br />

T (n) = 15<br />

2 · 3i − 8 · 2 i + 3<br />

2<br />

= 15<br />

2<br />

= 15<br />

2<br />

2 3<br />

n+<br />

· 3( 5<br />

= 3n − 8 ·<br />

<br />

2 3<br />

n +<br />

5 5<br />

5) log<br />

− 8 · 2 3( 2<br />

5<br />

<br />

2<br />

− 8 ·<br />

<br />

2 3<br />

n +<br />

5 5<br />

log3 2<br />

3<br />

n+ 3<br />

5) +<br />

2<br />

3<br />

n +<br />

5 5<br />

log3 2<br />

Isn’t that special? Now you know why we stick to asymptotic bounds for most recurrences.<br />

6 References<br />

Methods for solving recurrences by annihilators, domain transformations, and secondary recurrences<br />

are nicely outlined in G. Lueker, Some Techniques for Solving Recurrences, ACM Computing<br />

Surveys 12(4):419-436, 1980. The master theorem is presented in sections 4.3 and 4.4 of CLR.<br />

Sections 1–3 and 5 of this handout were written by Ed Reingold and Ari Trachtenberg and<br />

substantially revised by Jeff Erickson. Section 4 is entirely Jeff’s fault.<br />

15<br />

+ 6<br />

+ 3<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 0: Introduction Fall 2002<br />

0 Introduction (August 29)<br />

0.1 What is an algorithm?<br />

Partly because of his computational skills, Gerbert, in his later years, was made Pope by<br />

Otto the Great, Holy Roman Emperor, and took the name Sylvester II. By this time, his gift<br />

in the art of calculating contributed to the belief, commonly held throughout Europe, that<br />

he had sold his soul to the devil.<br />

— Dominic Olivastro, Ancient Puzzles, 1993<br />

This is a course about algorithms, specifically combinatorial algorithms. An algorithm is a set of<br />

simple, unambiguous, step-by-step instructions for accomplishing a specific task. Note that the<br />

word “computer” doesn’t appear anywhere in this definition; algorithms don’t necessarily have<br />

anything to do with computers! For example, here is an algorithm for singing that annoying song<br />

‘99 Bottles of Beer on the Wall’ for arbitrary values of 99:<br />

BottlesOfBeer(n):<br />

For i ← n down to 1<br />

Sing “i bottles of beer on the wall, i bottles of beer,”<br />

Sing “Take one down, pass it around, i − 1 bottles of beer on the wall.”<br />

Sing “No bottles of beer on the wall, no bottles of beer,”<br />

Sing “Go to the store, buy some more, x bottles of beer on the wall.”<br />

<strong>Algorithms</strong> have been with us since the dawn of civilization. Here is an algorithm, popularized<br />

(but almost certainly not discovered) by Euclid about 2500 years ago, for multiplying or dividing<br />

numbers using a ruler and compass. The Greek geometers represented numbers using line segments<br />

of the right length. In the pseudo-code below, Circle(p, q) represents the circle centered at a point<br />

p and passing through another point q; hopefully the other instructions are obvious.<br />

〈〈Construct the line perpendicular to ℓ and passing through P .〉〉<br />

RightAngle(ℓ, P ):<br />

Choose a point A ∈ ℓ<br />

A, B ← Intersect(Circle(P, A), ℓ)<br />

C, D ← Intersect(Circle(A, B), Circle(B, A))<br />

return Line(C, D)<br />

〈〈Construct a point Z such that |AZ| = |AC||AD|/|AB|.〉〉<br />

MultiplyOrDivide(A, B, C, D):<br />

α ← RightAngle(Line(A, C), A)<br />

E ← Intersect(Circle(A, B), α)<br />

F ← Intersect(Circle(A, D), α)<br />

β ← RightAngle(Line(E, C), F )<br />

γ ← RightAngle(β, F )<br />

return Intersect(γ, Line(A, C))<br />

Multiplying or dividing using a compass and straight-edge.<br />

1<br />

D<br />

Z<br />

C<br />

A<br />

B<br />

E<br />

F<br />

γ<br />

β<br />

α


<strong>CS</strong> <strong>373</strong> Lecture 0: Introduction Fall 2002<br />

This algorithm breaks down the difficult task of multiplication into simple primitive steps:<br />

drawing a line between two points, drawing a circle with a given center and boundary point, and so<br />

on. The primitive steps need not be quite this primitive, but each primitive step must be something<br />

that the person or machine executing the algorithm already knows how to do. Notice in this example<br />

that we have made constructing a right angle a primitive operation in the MultiplyOrDivide<br />

algorithm by writing a subroutine.<br />

As a bad example, consider “Martin’s algorithm”: 1<br />

BecomeAMillionaireAndNeverPayTaxes:<br />

Get a million dollars.<br />

Don’t pay taxes.<br />

If you get caught,<br />

Say “I forgot.”<br />

Pretty simple, except for that first step; it’s a doozy. A group of billionaire CEOs would consider<br />

this an algorithm, since for them the first step is both unambiguous and trivial. But for the rest<br />

of us poor slobs who don’t have a million dollars handy, Martin’s procedure is too vague to be<br />

considered an algorithm. [On the other hand, this is a perfect example of a reduction—it reduces<br />

the problem of being a millionaire and never paying taxes to the ‘easier’ problem of acquiring a<br />

million dollars. We’ll see reductions over and over again in this class. As hundreds of businessmen<br />

and politicians have demonstrated, if you know how to solve the easier problem, a reduction tells<br />

you how to solve the harder one.]<br />

Although most of the previous examples are algorithms, they’re not the kind of algorithms<br />

that computer scientists are used to thinking about. In this class, we’ll focus (almost!) exclusively<br />

on algorithms that can be reasonably implemented on a computer. In other words, each step in<br />

the algorithm must be something that either is directly supported by your favorite programming<br />

language (arithmetic, assignments, loops, recursion, etc.) or is something that you’ve already<br />

learned how to do in an earlier class (sorting, binary search, depth first search, etc.).<br />

For example, here’s the algorithm that’s actually used to determine the number of congressional<br />

representatives assigned to each state. 2 The input array P [1 .. n] stores the populations of the n<br />

states, and R is the total number of representatives. (Currently, n = 50 and R = 435.)<br />

ApportionCongress(P [1 .. n], R):<br />

H ← NewMaxHeap<br />

for i ← 1 to n<br />

r[i] ← 1<br />

Insert H, i, P [i]/ √ 2 <br />

R ← R − n<br />

while R > 0<br />

s ← ExtractMax(H)<br />

r[s] ← r[s] + 1<br />

Insert H, i, P [i]/ r[i](r[i] + 1) <br />

R ← R − 1<br />

return r[1 .. n]<br />

1<br />

S. Martin, “You Can Be A Millionaire”, Saturday Night Live, January 21, 1978. Reprinted in Comedy Is Not<br />

Pretty, Warner Bros. Records, 1979.<br />

2<br />

The congressional apportionment algorithm is described in detail at http://www.census.gov/population/www/<br />

censusdata/apportionment/computing.html, and some earlier algorithms are described at http://www.census.gov/<br />

population/www/censusdata/apportionment/history.html.<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 0: Introduction Fall 2002<br />

Note that this description assumes that you know how to implement a max-heap and its basic operations<br />

NewMaxHeap, Insert, and ExtractMax. Moreover, the correctness of the algorithm<br />

doesn’t depend at all on how these operations are implemented. 3 The Census Bureau implements<br />

the max-heap as an unsorted array, probably inside an Excel spreadsheet. (You should have learned<br />

a more efficient solution in <strong>CS</strong> 225.)<br />

So what’s a combinatorial algorithm? The distinction is fairly artificial, but basically, this means<br />

something distinct from a numerical algorithm. Numerical algorithms are used to approximate<br />

computation with ideal real numbers on finite precision computers. For example, here’s a numerical<br />

algorithm to compute the square root of a number to a given precision. (This algorithm works<br />

remarkably quickly—every iteration doubles the number of correct digits.)<br />

SquareRoot(x, ε):<br />

s ← 1<br />

while |s − x/s| > ε<br />

s ← (s + x/s)/2<br />

return s<br />

The output of a numerical algorithm is necessarily an approximation to some ideal mathematical<br />

object. Any number that’s close enough to the ideal answer is a correct answer. <strong>Combinatorial</strong><br />

algorithms, on the other hand, manipulate discrete objects like arrays, lists, trees, and graphs that<br />

can be represented exactly on a digital computer.<br />

0.2 Writing down algorithms<br />

<strong>Algorithms</strong> are not programs; they should never be specified in a particular programming language.<br />

The whole point of this course is to develop computational techniques that can be used in any programming<br />

language. 4 The idiosyncratic syntactic details of C, Java, Visual Basic, ML, Smalltalk,<br />

Intercal, or Brainfunk 5 are of absolutely no importance in algorithm design, and focusing on them<br />

will only distract you from what’s really going on. What we really want is closer to what you’d<br />

write in the comments of a real program than the code itself.<br />

On the other hand, a plain English prose description is usually not a good idea either. <strong>Algorithms</strong><br />

have a lot of structure—especially conditionals, loops, and recursion—that are far too easily<br />

hidden by unstructured prose. Like any language spoken by humans, English is full of ambiguities,<br />

subtleties, and shades of meaning, but algorithms must be described as accurately as possible.<br />

Finally and more seriously, many people have a tendency to describe loops informally: “Do this<br />

first, then do this second, and so on.” As anyone who has taken one of those ‘what comes next in<br />

this sequence?’ tests already knows, specifying what happens in the first couple of iterations of a<br />

loop doesn’t say much about what happens later on. 6 Phrases like ‘and so on’ or ‘do X over and<br />

over’ or ‘et cetera’ are a good indication that the algorithm should have been described in terms of<br />

3<br />

However, as several students pointed out, the algorithm’s correctness does depend on the populations of the<br />

states being not too different. If Congress had only 384 representatives, then the 2000 Census data would have given<br />

Rhode Island only one representative! If state populations continue to change at their current rates, The United<br />

States have have a serious constitutional crisis right after the 2030 Census. I can’t wait!<br />

4<br />

See http://www.ionet.net/˜timtroyr/funhouse/beer.html for implementations of the BottlesOfBeer algorithm<br />

in over 200 different programming languages.<br />

5<br />

Pardon my thinly bowdlerized Anglo-Saxon. Brainfork is the well-deserved name of a programming language<br />

invented by Urban Mueller in 1993. Brainflick programs are written entirely using the punctuation characters<br />

+-,.[], each representing a different operation (roughly: shift left, shift right, increment, decrement, input, output,<br />

begin loop, end loop). See http://www.catseye.mb.ca/esoteric/bf/ for a complete definition of Brainfooey, sample<br />

Brainfudge programs, a Brainfiretruck interpreter (written in just 230 characters of C), and related shit.<br />

6<br />

See http://www.research.att.com/˜njas/sequences/.<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 0: Introduction Fall 2002<br />

loops or recursion, and the description should have specified what happens in a generic iteration<br />

of the loop. 7<br />

The best way to write down an algorithm is using pseudocode. Pseudocode uses the structure<br />

of formal programming languages and mathematics to break the algorithm into one-sentence steps,<br />

but those sentences can be written using mathematics, pure English, or some mixture of the two.<br />

Exactly how to structure the pseudocode is a personal choice, but here are the basic rules I follow:<br />

• Use standard imperative programming keywords (if/then/else, while, for, repeat/until, case,<br />

return) and notation (array[index], pointer→field, function(args), etc.)<br />

• The block structure should be visible from across the room. Indent everything carefully and<br />

consistently. Don’t use syntactic sugar (like C/C++/Java braces or Pascal/Algol begin/end<br />

tags) unless the pseudocode is absolutely unreadable without it.<br />

• Don’t typeset keywords in a different font or style. Changing type style emphasizes the<br />

keywords, making the reader think the syntactic sugar is actually important—it isn’t!<br />

• Each statement should fit on one line, each line should contain only one statement. (The only<br />

exception is extremely short and similar statements like i ← i + 1; j ← j − 1; k ← 0.)<br />

• Put each structuring statement (for, while, if) on its own line. The order of nested loops<br />

matters a great deal; make it absolutely obvious.<br />

• Use short but mnemonic algorithm and variable names. Absolutely never use pronouns!<br />

A good description of an algorithm reveals the internal structure, hides irrelevant details, and<br />

can be implemented easily by any competent programmer in any programming language, even if<br />

they don’t understand why the algorithm works. Good pseudocode, like good code, makes the<br />

algorithm much easier to understand and analyze; it also makes mistakes much easier to spot. The<br />

algorithm descriptions in the textbooks and lecture notes are good examples of what we want to<br />

see on your homeworks and exams..<br />

0.3 Analyzing algorithms<br />

It’s not enough just to write down an algorithm and say ‘Behold!’ We also need to convince<br />

ourselves (and our graders) that the algorithm does what it’s supposed to do, and that it does it<br />

quickly.<br />

Correctness: In the real world, it is often acceptable for programs to behave correctly most of<br />

the time, on all ‘reasonable’ inputs. Not in this class; our standards are higher 8 . We need to prove<br />

that our algorithms are correct on all possible inputs. Sometimes this is fairly obvious, especially<br />

for algorithms you’ve seen in earlier courses. But many of the algorithms we will discuss in this<br />

course will require some extra work to prove. Correctness proofs almost always involve induction.<br />

We like induction. Induction is our friend. 9<br />

Before we can formally prove that our algorithm does what we want it to, we have to formally<br />

state what we want the algorithm to do! Usually problems are given to us in real-world terms,<br />

7<br />

Similarly, the appearance of the phrase ‘and so on’ in a proof is a good indication that the proof should have<br />

been done by induction!<br />

8<br />

or at least different<br />

9<br />

If induction is not your friend, you will have a hard time in this course.<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 0: Introduction Fall 2002<br />

not with formal mathematical descriptions. It’s up to us, the algorithm designer, to restate the<br />

problem in terms of mathematical objects that we can prove things about: numbers, arrays, lists,<br />

graphs, trees, and so on. We also need to determine if the problem statement makes any hidden<br />

assumptions, and state those assumptions explicitly. (For example, in the song “n Bottles of Beer<br />

on the Wall”, n is always a positive integer.) Restating the problem formally is not only required<br />

for proofs; it is also one of the best ways to really understand what the problems is asking for. The<br />

hardest part of solving a problem is figuring out the right way to ask the question!<br />

An important distinction to keep in mind is the distinction between a problem and an algorithm.<br />

A problem is a task to perform, like “Compute the square root of x” or “Sort these n numbers”<br />

or “Keep n algorithms students awake for t minutes”. An algorithm is a set of instructions that<br />

you follow if you want to execute this task. The same problem may have hundreds of different<br />

algorithms.<br />

Running time: The usual way of distinguishing between different algorithms for the same problem<br />

is by how fast they run. Ideally, we want the fastest possible algorithm for our problem. In<br />

the real world, it is often acceptable for programs to run efficiently most of the time, on all ‘reasonable’<br />

inputs. Not in this class; our standards are different. We require algorithms that always<br />

run efficiently, even in the worst case.<br />

But how do we measure running time? As a specific example, how long does it take to sing the<br />

song BottlesOfBeer(n)? This is obviously a function of the input value n, but it also depends<br />

on how quickly you can sing. Some singers might take ten seconds to sing a verse; others might<br />

take twenty. Technology widens the possibilities even further. Dictating the song over a telegraph<br />

using Morse code might take a full minute per verse. Ripping an mp3 over the Web might take a<br />

tenth of a second per verse. Duplicating the mp3 in a computer’s main memory might take only a<br />

few microseconds per verse.<br />

Nevertheless, what’s important here is how the singing time changes as n grows. Singing<br />

BottlesOfBeer(2n) takes about twice as long as singing BottlesOfBeer(n), no matter what<br />

technology is being used. This is reflected in the asymptotic singing time Θ(n). We can measure<br />

time by counting how many times the algorithm executes a certain instruction or reaches a certain<br />

milestone in the ‘code’. For example, we might notice that the word ‘beer’ is sung three times in<br />

every verse of BottlesOfBeer, so the number of times you sing ‘beer’ is a good indication of<br />

the total singing time. For this question, we can give an exact answer: BottlesOfBeer(n) uses<br />

exactly 3n + 3 beers.<br />

There are plenty of other songs that have non-trivial singing time. This one is probably familiar<br />

to most English-speakers:<br />

NDaysOfChristmas(gifts[2 .. n]):<br />

for i ← 1 to n<br />

Sing “On the ith day of Christmas, my true love gave to me”<br />

for j ← i down to 2<br />

Sing “j gifts[j]”<br />

if i > 1<br />

Sing “and”<br />

Sing “a partridge in a pear tree.”<br />

The input to NDaysOfChristmas is a list of n − 1 gifts. It’s quite easy to show that the<br />

singing time is Θ(n2 ); in particular, the singer mentions the name of a gift n i=1 i = n(n + 1)/2<br />

times (counting the partridge in the pear tree). It’s also easy to see that during the first n days<br />

of Christmas, my true love gave to me exactly n i i=1 j=1 j = n(n + 1)(n + 2)/6 = Θ(n3 ) gifts.<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 0: Introduction Fall 2002<br />

Other songs that take quadratic time to sing are “Old MacDonald”, “There Was an Old Lady Who<br />

Swallowed a Fly”, “Green Grow the Rushes O”, “The Barley Mow” (which we’ll see in Homework 1),<br />

“Echad Mi Yode’a” (“Who knows one?”), “Allouette”, “Ist das nicht ein Schnitzelbank?” 10 etc.<br />

For details, consult your nearest pre-schooler.<br />

For a slightly more complicated example, consider the algorithm ApportionCongress. Here<br />

the running time obviously depends on the implementation of the max-heap operations, but we<br />

can certainly bound the running time as O(N + RI + (R − n)E), where N is the time for a<br />

NewMaxHeap, I is the time for an Insert, and E is the time for an ExtractMax. Under the<br />

reasonable assumption that R > 2n (on average, each state gets at least two representatives), this<br />

simplifies to O(N + R(I + E)). The Census Bureau uses an unsorted array of size n, for which<br />

N = I = Θ(1) (since we know a priori how big the array is), and E = Θ(n), so the overall running<br />

time is Θ(Rn). This is fine for the federal government, but if we want to be more efficient, we can<br />

implement the heap as a perfectly balanced n-node binary tree (or a heap-ordered array). In this<br />

case, we have N = Θ(1) and I = R = O(log n), so the overall running time is Θ(R log n).<br />

Incidentally, there is a faster algorithm for apportioning Congress. I’ll give extra credit to the<br />

first student who can find the faster algorithm, analyze its running time, and prove that it always<br />

gives exactly the same results as ApportionCongress.<br />

Sometimes we are also interested in other computational resources: space, disk swaps, concurrency,<br />

and so forth. We use the same techniques to analyze those resources as we use for running<br />

time.<br />

0.4 Why are we here, anyway?<br />

We will try to teach you two things in <strong>CS</strong><strong>373</strong>: how to think about algorithms and how to talk<br />

about algorithms. Along the way, you’ll pick up a bunch of algorithmic facts—mergesort runs in<br />

Θ(n log n) time; the amortized time to search in a splay tree is O(log n); the traveling salesman<br />

problem is NP-hard—but these aren’t the point of the course. You can always look up facts in a<br />

textbook, provided you have the intuition to know what to look for. That’s why we let you bring<br />

cheat sheets to the exams; we don’t want you wasting your study time trying to memorize all the<br />

facts you’ve seen. You’ll also practice a lot of algorithm design and analysis skills—finding useful<br />

(counter)examples, developing induction proofs, solving recurrences, using big-Oh notation, using<br />

probability, and so on. These skills are useful, but they aren’t really the point of the course either.<br />

At this point in your educational career, you should be able to pick up those skills on your own,<br />

once you know what you’re trying to do.<br />

The main goal of this course is to help you develop algorithmic intuition. How do various<br />

algorithms really work? When you see a problem for the first time, how should you attack it?<br />

How do you tell which techniques will work at all, and which ones will work best? How do you<br />

judge whether one algorithm is better than another? How do you tell if you have the best possible<br />

solution?<br />

The flip side of this goal is developing algorithmic language. It’s not enough just to understand<br />

how to solve a problem; you also have to be able to explain your solution to somebody else. I don’t<br />

mean just how to turn your algorithms into code—despite what many students (and inexperienced<br />

programmers) think, ‘somebody else’ is not just a computer. Nobody programs alone. Perhaps<br />

more importantly in the short term, explaining something to somebody else is one of the best ways<br />

of clarifying your own understanding.<br />

10 Wakko: Ist das nicht Otto von Schnitzelpusskrankengescheitmeyer?<br />

Yakko and Dot: Ja, das ist Otto von Schnitzelpusskrankengescheitmeyer!!<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 0: Introduction Fall 2002<br />

You’ll also get a chance to develop brand new algorithms and algorithmic techniques on your<br />

own. Unfortunately, this is not the sort of thing that we can really teach you. All we can really do<br />

is lay out the tools, encourage you to practice with them, and give you feedback.<br />

Good algorithms are extremely useful, but they can also be elegant, surprising, deep, even<br />

beautiful. But most importantly, algorithms are fun!! Hopefully this class will inspire at least a<br />

few of you to come play!<br />

7


<strong>CS</strong> <strong>373</strong> Lecture 1: Divide and Conquer Fall 2002<br />

1 Divide and Conquer (September 3)<br />

1.1 MergeSort<br />

The control of a large force is the same principle as the control of a few men:<br />

it is merely a question of dividing up their numbers.<br />

— Sun Zi, The Art of War (c. 400 C.E.), translated by Lionel Giles (1910)<br />

Mergesort is one of the earliest algorithms proposed for sorting. According to Knuth, it was<br />

suggested by John von Neumann as early as 1945.<br />

1. Divide the array A[1 .. n] into two subarrays A[1 .. m] and A[m + 1 .. n], where m = ⌊n/2⌋.<br />

2. Recursively mergesort the subarrays A[1 .. m] and A[m + 1 .. n].<br />

3. Merge the newly-sorted subarrays A[1 .. m] and A[m + 1 .. n] into a single sorted list.<br />

Input: S O R T I N G E X A M P L<br />

Divide: S O R T I N G E X A M P L<br />

Recurse: I N O S R T A E G L M P X<br />

Merge: A E G I L M N O P S R T X<br />

A Mergesort example.<br />

The first step is completely trivial; we only need to compute the median index m. The second<br />

step is also trivial, thanks to our friend the recursion fairy. All the real work is done in the final<br />

step; the two sorted subarrays A[1 .. m] and A[m + 1 .. n] can be merged using a simple linear-time<br />

algorithm. Here’s a complete specification of the Mergesort algorithm; for simplicity, we separate<br />

out the merge step as a subroutine.<br />

MergeSort(A[1 .. n]):<br />

if (n > 1)<br />

m ← ⌊n/2⌋<br />

MergeSort(A[1 .. m])<br />

MergeSort(A[m + 1 .. n])<br />

Merge(A[1 .. n], m)<br />

Merge(A[1 .. n], m):<br />

i ← 1; j ← m + 1<br />

for k ← 1 to n<br />

if j > n<br />

B[k] ← A[i]; i ← i + 1<br />

else if i > m<br />

B[k] ← A[j]; j ← j + 1<br />

else if A[i] < A[j]<br />

B[k] ← A[i]; i ← i + 1<br />

else<br />

B[k] ← A[j]; j ← j + 1<br />

for k ← 1 to n<br />

A[k] ← B[k]<br />

To prove that the algorithm is correct, we use our old friend induction. We can prove that<br />

Merge is correct using induction on the total size of the two subarrays A[i .. m] and A[j .. n] left to<br />

be merged into B[k .. n]. The base case, where at least one subarray is empty, is straightforward; the<br />

algorithm just copies it into B. Otherwise, the smallest remaining element is either A[i] or A[j],<br />

since both subarrays are sorted, so B[k] is assigned correctly. The remaining subarrays—either<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 1: Divide and Conquer Fall 2002<br />

A[i + 1 .. m] and A[j .. n], or A[i .. m] and A[j + 1 .. n]—are merged correctly into B[k + 1 .. n] by the<br />

inductive hypothesis. 1 This completes the proof.<br />

Now we can prove MergeSort correct by another round of straightforward induction. 2 The<br />

base cases n ≤ 1 are trivial. Otherwise, by the inductive hypothesis, the two smaller subarrays<br />

A[1 .. m] and A[m+1 .. n] are sorted correctly, and by our earlier argument, merged into the correct<br />

sorted output.<br />

What’s the running time? Since we have a recursive algorithm, we’re going to get a recurrence<br />

of some sort. Merge clearly takes linear time, since it’s a simple for-loop with constant work per<br />

iteration. We get the following recurrence for MergeSort:<br />

T (1) = O(1), T (n) = T ⌈n/2⌉ + T ⌊n/2⌋ + O(n).<br />

1.2 Aside: Domain Transformations<br />

Except for the floor and ceiling, this recurrence falls into case (b) of the Master Theorem [CLR,<br />

§4.3]. If we simply ignore the floor and ceiling, the Master Theorem suggests the solution T (n) =<br />

O(n log n). We can easily check that this answer is correct using induction, but there is a simple<br />

method for solving recurrences like this directly, called domain transformation.<br />

First we overestimate the time bound, once by pretending that the two subproblem sizes are<br />

equal, and again to eliminate the ceiling:<br />

T (n) ≤ 2T ⌈n/2⌉ + O(n) ≤ 2T (n/2 + 1) + O(n).<br />

Now we define a new function S(n) = T (n + α), where α is a constant chosen so that S(n) satisfies<br />

the Master-ready recurrence S(n) ≤ 2S(n/2) + O(n). To figure out the appropriate value for α, we<br />

compare two versions of the recurrence for T (n + α):<br />

S(n) ≤ 2S(n/2) + O(n) =⇒ T (n + α) ≤ 2T (n/2 + α) + O(n)<br />

T (n) ≤ 2T (n/2 + 1) + O(n) =⇒ T (n + α) ≤ 2T ((n + α)/2 + 1) + O(n + α)<br />

For these two recurrences to be equal, we need n/2 + α = (n + α)/2 + 1, which implies that α = 2.<br />

The Master Theorem tells us that S(n) = O(n log n), so<br />

T (n) = S(n − 2) = O((n − 2) log(n − 2)) = O(n log n).<br />

We can use domain transformations to remove floors, ceilings, and lower order terms from any<br />

recurrence. But now that we know this, we won’t bother actually grinding through the details!<br />

1.3 QuickSort<br />

Quicksort was discovered by Tony Hoare in 1962. In this algorithm, the hard work is splitting the<br />

array into subsets so that merging the final result is trivial.<br />

1. Choose a pivot element from the array.<br />

1 “The inductive hypothesis” is just a technical nickname for our friend the recursion fairy.<br />

2 Many textbooks draw an artificial distinction between several different flavors of induction: standard/weak (‘the<br />

principle of mathematical induction’), strong (’the second principle of mathematical induction’), complex, structural,<br />

transfinite, decaffeinated, etc. Those textbooks would call this proof “strong” induction. I don’t. All induction<br />

proofs have precisely the same structure: Pick an arbitrary object, make one or more simpler objects from it, apply<br />

the inductive hypothesis to the simpler object(s), infer the required property for the original object, and check the<br />

base cases. Induction is just recursion for proofs.<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 1: Divide and Conquer Fall 2002<br />

2. Split the array into three subarrays containing the items less than the pivot, the pivot itself,<br />

and the items bigger than the pivot.<br />

3. Recursively quicksort the first and last subarray.<br />

Input: S O R T I N G E X A M P L<br />

Choose a pivot: S O R T I N G E X A M P L<br />

Partition: M A E G I L N R X O S P T<br />

Recurse: A E G I L M N O P S R T X<br />

A Quicksort example.<br />

Here’s a more formal specification of the Quicksort algorithm. The separate Partition subroutine<br />

takes the original position of the pivot element as input and returns the post-partition pivot<br />

position as output.<br />

QuickSort(A[1 .. n]):<br />

if (n > 1)<br />

Choose a pivot element A[p]<br />

k ← Partition(A, p)<br />

QuickSort(A[1 .. k − 1])<br />

QuickSort(A[k + 1 .. n])<br />

Partition(A[1 .. n], p):<br />

if (p = n)<br />

swap A[p] ↔ A[n]<br />

i ← 0; j ← n<br />

while (i < j)<br />

repeat i ← i + 1 until (i = j or A[i] ≥ A[n])<br />

repeat j ← j − 1 until (i = j or A[j] ≤ A[n])<br />

if (i < j)<br />

swap A[i] ↔ A[j]<br />

if (i = n)<br />

swap A[i] ↔ A[n]<br />

return i<br />

Just as we did for mergesort, we need two induction proofs to show that QuickSort is correct—<br />

weak induction to prove that Partition correctly partitions the array, and then straightforward<br />

strong induction to prove that QuickSort correctly sorts assuming Partition is correct. I’ll leave<br />

the gory details as an exercise for the reader.<br />

The analysis is also similar to mergesort. Partition runs in O(n) time: j − i = n at the<br />

beginning, j − i = 0 at the end, and we do a constant amount of work each time we increment i<br />

or decrement j. For QuickSort, we get a recurrence that depends on k, the rank of the chosen<br />

pivot:<br />

T (n) = T (k − 1) + T (n − k) + O(n)<br />

If we could choose the pivot to be the median element of the array A, we would have k = ⌈n/2⌉,<br />

the two subproblems would be as close to the same size as possible, the recurrence would become<br />

T (n) = 2T ⌈n/2⌉ − 1 + T ⌊n/2⌋ + O(n) ≤ 2T (n/2) + O(n),<br />

and we’d have T (n) = O(n log n) by the Master Theorem.<br />

Unfortunately, although it is theoretically possible to locate the median of an unsorted array in<br />

linear time, the algorithm is incredibly complicated, and the hidden constant in the O() notation<br />

is quite large. So in practice, programmers settle for something simple, like choosing the first or<br />

last element of the array. In this case, k can be anything from 1 to n, so we have<br />

<br />

T (n) = max T (k − 1) + T (n − k) + O(n)<br />

1≤k≤n<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 1: Divide and Conquer Fall 2002<br />

In the worst case, the two subproblems are completely unbalanced—either k = 1 or k = n—and<br />

the recurrence becomes T (n) ≤ T (n − 1) + O(n). The solution is T (n) = O(n 2 ). Another common<br />

heuristic is ‘median of three’—choose three elements (usually at the beginning, middle, and end<br />

of the array), and take the middle one as the pivot. Although this is better in practice than just<br />

choosing one element, we can still have k = 2 or k = n − 1 in the worst case. With the medianof-three<br />

heuristic, the recurrence becomes T (n) ≤ T (1) + T (n − 2) + O(n), whose solution is still<br />

T (n) = O(n 2 ).<br />

Intuitively, the pivot element will ‘usually’ fall somewhere in the middle of the array, say between<br />

n/10 and 9n/10. This suggests that the average-case running time is O(n log n). Although this<br />

intuition is correct, we are still far from a proof that quicksort is usually efficient. I’ll formalize<br />

this intuition about average cases in a later lecture.<br />

1.4 The Pattern<br />

Both mergesort and and quicksort follow the same general three-step pattern of all divide and<br />

conquer algorithms:<br />

1. Split the problem into several smaller independent subproblems.<br />

2. Recurse to get a subsolution for each subproblem.<br />

3. Merge the subsolutions together into the final solution.<br />

If the size of any subproblem falls below some constant threshold, the recursion bottoms out.<br />

Hopefully, at that point, the problem is trivial, but if not, we switch to a different algorithm<br />

instead.<br />

Proving a divide-and-conquer algorithm correct usually involves strong induction. Analyzing<br />

the running time requires setting up and solving a recurrence, which often (but unfortunately not<br />

always!) can be solved using the Master Theorem, perhaps after a simple domain transformation.<br />

1.5 Multiplication<br />

Adding two n-digit numbers takes O(n) time by the standard iterative ‘ripple-carry’ algorithm,<br />

using a lookup table for each one-digit addition. Similarly, multiplying an n-digit number by a<br />

one-digit number takes O(n) time, using essentially the same algorithm.<br />

What about multiplying two n-digit numbers? At least in the United States, every grade school<br />

student (supposedly) learns to multiply by breaking the problem into n one-digit multiplications<br />

and n additions:<br />

31415962<br />

× 27182818<br />

251327696<br />

31415962<br />

251327696<br />

62831924<br />

251327696<br />

31415962<br />

219911734<br />

62831924<br />

853974377340916<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 1: Divide and Conquer Fall 2002<br />

We could easily formalize this algorithm as a pair of nested for-loops. The algorithm runs in<br />

O(n 2 ) time—altogether, there are O(n 2 ) digits in the partial products, and for each digit, we spend<br />

constant time.<br />

We can do better by exploiting the following algebraic formula:<br />

(10 m a + b)(10 m c + d) = 10 2m ac + 10 m (bc + ad) + bd<br />

Here is a divide-and-conquer algorithm that computes the product of two n-digit numbers x and y,<br />

based on this formula. Each of the four sub-products e, f, g, h is computed recursively. The last<br />

line does not involve any multiplications, however; to multiply by a power of ten, we just shift the<br />

digits and fill in the right number of zeros.<br />

Multiply(x, y, n):<br />

if n = 1<br />

return x · y<br />

else<br />

m ← ⌈n/2⌉<br />

a ← ⌊x/10 m ⌋; b ← x mod 10 m<br />

d ← ⌊y/10 m ⌋; c ← y mod 10 m<br />

e ← Multiply(a, c, m)<br />

f ← Multiply(b, d, m)<br />

g ← Multiply(b, c, m)<br />

h ← Multiply(a, d, m)<br />

return 10 2m e + 10 m (g + h) + f<br />

You can easily prove by induction that this algorithm is correct. The running time for this algorithm<br />

is given by the recurrence<br />

T (n) = 4T (⌈n/2⌉) + O(n), T (1) = 1,<br />

which solves to T (n) = O(n 2 ) by the Master Theorem (after a simple domain transformation).<br />

Hmm. . . I guess this didn’t help after all.<br />

But there’s a trick, first suggested by Anatoliĭ Karatsuba in 1962. We can compute the middle<br />

coefficient bc + ad using only one recursive multiplication, by exploiting yet another bit of algebra:<br />

ac + bd − (a − b)(c − d) = bc + ad<br />

This trick lets use replace the last three lines in the previous algorithm as follows:<br />

FastMultiply(x, y, n):<br />

if n = 1<br />

return x · y<br />

else<br />

m ← ⌈n/2⌉<br />

a ← ⌊x/10 m ⌋; b ← x mod 10 m<br />

d ← ⌊y/10 m ⌋; c ← y mod 10 m<br />

e ← FastMultiply(a, c, m)<br />

f ← FastMultiply(b, d, m)<br />

g ← FastMultiply(a − b, c − d, m)<br />

return 10 2m e + 10 m (e + f − g) + f<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 1: Divide and Conquer Fall 2002<br />

The running time of Karatsuba’s FastMultiply algorithm is given by the recurrence<br />

T (n) ≤ 3T (⌈n/2⌉) + O(n), T (1) = 1.<br />

After a domain transformation, we can plug this into the Master Theorem to get the solution<br />

T (n) = O(n lg 3 ) = O(n 1.585 ), a significant improvement over our earlier quadratic-time algorithm. 3<br />

Of course, in practice, all this is done in binary instead of decimal.<br />

We can take this idea even further, splitting the numbers into more pieces and combining them<br />

in more complicated ways, to get even faster multiplication algorithms. Ultimately, this idea leads<br />

to the development of the Fast Fourier transform, a complicated divide-and-conquer algorithm that<br />

can be used to multiply two n-digit numbers in O(n log n) time. 4 We’ll talk about Fast Fourier<br />

transforms later in the semester.<br />

1.6 Exponentiation<br />

Given a number a and a positive integer n, suppose we want to compute a n . The standard naïve<br />

method is a simple for-loop that does n − 1 multiplications by a:<br />

SlowPower(a, n):<br />

x ← a<br />

for i ← 2 to n<br />

x ← x · a<br />

return x<br />

This iterative algorithm requires n multiplications.<br />

Notice that the input a could be an integer, or a rational, or a floating point number. In fact,<br />

it doesn’t need to be a number at all, as long as it’s something that we know how to multiply. For<br />

example, the same algorithm can be used to compute powers modulo some finite number (an operation<br />

commonly used in cryptography algorithms) or to compute powers of matrices (an operation<br />

used to evaluate recurrences and to compute shortest paths in graphs). All that’s required is that<br />

a belong to a multiplicative group. 5 Since we don’t know what kind of things we’re mutliplying,<br />

we can’t know how long a multiplication takes, so we’re forced analyze the running time in terms<br />

of the number of multiplications.<br />

There is a much faster divide-and-conquer method, using the simple formula a n = a ⌊n/2⌋ ·a ⌈n/2⌉ .<br />

What makes this approach more efficient is that once we compute the first factor a ⌊n/2⌋ , we can<br />

compute the second factor a ⌈n/2⌉ using at most one more multiplication.<br />

3<br />

Karatsuba actually proposed an algorithm based on the formula (a + c)(b + d) − ac − bd = bc + ad. This algorithm<br />

also runs in O(n lg 3 ) time, but the actual recurrence is a bit messier: a − b and c − d are still m-digit numbers, but<br />

a + b and c + d might have m + 1 digits. The simplification presented here is due to Donald Knuth.<br />

4<br />

This fast algorithm for multiplying integers using FFTs was discovered by Arnold Schönhange and Volker Strassen<br />

in 1971.<br />

5<br />

A multiplicative group (G, ⊗) is a set G and a function ⊗ : G × G → G, satisfying three axioms:<br />

1. There is a unit element 1 ∈ G such that 1 ⊗ g = g ⊗ 1 for any element g ∈ G.<br />

2. Any element g ∈ G has a inverse element g −1 ∈ G such that g ⊗ g −1 = g −1 ⊗ g = 1<br />

3. The function is associative: for any elements f, g, h ∈ G, we have f ⊗ (g ⊗ h) = (f ⊗ g) ⊗ h.<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 1: Divide and Conquer Fall 2002<br />

FastPower(a, n):<br />

if n = 1<br />

return a<br />

else<br />

x ← FastPower(a, ⌊n/2⌋)<br />

if n is even<br />

return x · x<br />

else<br />

return x · x · a<br />

The total number of multiplications is given by the recurrence T (n) ≤ T (⌊n/2⌋) + 2, with the<br />

base case T (1) = 0. After a domain transformation, the Master Theorem gives us the solution<br />

T (n) = O(log n).<br />

Incidentally, this algorithm is asymptotically optimal—any algorithm for computing a n must<br />

perform Ω(log n) multiplications. In fact, when n is a power of two, this algorithm is exactly<br />

optimal. However, there are slightly faster methods for other values of n. For example, our divideand-conquer<br />

algorithm computes a 15 in six multiplications (a 15 = a 7 · a 7 · a; a 7 = a 3 · a 3 · a;<br />

a 3 = a · a · a), but only five multiplications are necessary (a → a 2 → a 3 → a 5 → a 10 → a 15 ).<br />

Nobody knows of an algorithm that always uses the minimum possible number of multiplications.<br />

7


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

Those who cannot remember the past are doomed to repeat it.<br />

— George Santayana, The Life of Reason, Book I:<br />

Introduction and Reason in Common Sense (1905)<br />

The “Dynamic-Tension r○ ” bodybuilding program takes only 15 minutes a day<br />

in the privacy of your room.<br />

— Charles Atlas<br />

2 Dynamic Programming (September 5 and 10)<br />

2.1 Exponentiation (Again)<br />

Last time we saw a “divide and conquer” algorithm for computing the expression a n , given two<br />

integers a and n as input: first compute a ⌊n/2⌋ , then a ⌈n/2⌉ , then multiply. If we computed both<br />

factors a ⌊n/2⌋ and a ⌈n/2⌉ recursively, the number of multiplications would be given by the recurrence<br />

T (1) = 0, T (n) = T (⌊n/2⌋) + T (⌈n/2⌉) + 1.<br />

The solution is T (n) = n − 1, so the naïve recursive algorithm uses exactly the same number of<br />

multiplications as the naïve iterative algorithm. 1<br />

In this case, it’s obvious how to speed up the algorithm. Once we’ve computed a ⌊n/2⌋ , we<br />

we don’t need to start over from scratch to compute a ⌊n/2⌋ ; we can do it in at most one more<br />

multiplication. This same simple idea—don’t solve the same subproblem more than once—<br />

can be applied to lots of recursive algorithms to speed them up, often (as in this case) by an<br />

exponential amount. The technical name for this technique is dynamic programming.<br />

2.2 Fibonacci Numbers<br />

The Fibonacci numbers Fn, named after Leonardo Fibonacci Pisano 2 , are defined as follows: F0 = 0,<br />

F1 = 1, and Fn = Fn−1 + Fn−2 for all n ≥ 2. The recursive definition of Fibonacci numbers<br />

immediately gives us a recursive algorithm for computing them:<br />

RecFibo(n):<br />

if (n < 2)<br />

return n<br />

else<br />

return RecFibo(n − 1) + RecFibo(n − 2)<br />

How long does this algorithm take? Except for the recursive calls, the entire algorithm requires<br />

only a constant number of steps: one comparison and possibly one addition. If T (n) represents the<br />

number of recursive calls to RecFibo, we have the recurrence<br />

T (0) = 1, T (1) = 1, T (n) = T (n − 1) + T (n − 2) + 1.<br />

This looks an awful lot like the recurrence for Fibonacci numbers! In fact, it’s fairly easy to show<br />

by induction that T (n) = 2Fn+1 − 1 . In other words, computing Fn using this algorithm takes<br />

more than twice as many steps as just counting to Fn!<br />

1 But less time. If we assume that multiplying two n-digit numbers takes O(n log n) time, then the iterative<br />

algorithm takes O(n 2 log n) time, but this recursive algorithm takes only O(n log 2 n) time.<br />

2 Literally, “Leonardo, son of Bonacci, of Pisa”.<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

Another way to see this is that the RecFibo is building a big binary tree of additions, with<br />

nothing but zeros and ones at the leaves. Since the eventual output is Fn, our algorithm must<br />

call RecRibo(1) (which returns 1) exactly Fn times. A quick inductive argument implies that<br />

RecFibo(0) is called exactly Fn−1 times. Thus, the recursion tree has Fn + Fn−1 = Fn+1 leaves,<br />

and therefore, because it’s a full binary tree, it must have 2Fn+1 − 1 nodes. (See Homework Zero!)<br />

2.3 Aside: The Annihilator Method<br />

Just how slow is that? We can get a good asymptotic estimate for T (n) by applying the annihilator<br />

method, described in the ‘solving recurrences’ handout:<br />

〈T (n + 2) − T (n + 1) − T (n)〉 = 〈1〉<br />

〈T (n + 2)〉 = 〈T (n + 1)〉 + 〈T (n)〉 + 〈1〉<br />

(E 2 − E − 1) 〈T (n)〉 = 〈1〉<br />

(E 2 − E − 1)(E − 1) 〈T (n)〉 = 〈0〉<br />

The characteristic polynomial of this recurrence is (r 2 − r − 1)(r − 1), which has three roots:<br />

φ = 1+√ 5<br />

2<br />

≈ 1.618, ˆ φ = 1−√ 5<br />

2<br />

Now we plug in a few base cases:<br />

≈ −0.618, and 1. Thus, the generic solution is<br />

T (n) = αφ n + β ˆ φ n + γ.<br />

T (0) = 1 = α + β + γ<br />

Solving this system of linear equations gives us<br />

so our final solution is<br />

T (n) =<br />

T (1) = 1 = αφ + β ˆ φ + γ<br />

T (2) = 3 = αφ 2 + β ˆ φ 2 + γ<br />

α = 1 + 1<br />

√ 5 , β = 1 − 1<br />

√ 5 , γ = −1,<br />

<br />

1 + 1<br />

<br />

√ φ<br />

5<br />

n <br />

+ 1 − 1<br />

<br />

√ ˆφ<br />

5<br />

n − 1 = Θ(φ n ).<br />

Actually, if we only want an asymptotic bound, we only need to show that α = 0, which is<br />

much easier than solving the whole system of equations. Since φ is the largest characteristic root<br />

with non-zero coefficient, we immediately have T (n) = Θ(φ n ).<br />

2.4 Memo(r)ization and Dynamic Programming<br />

The obvious reason for the recursive algorithm’s lack of speed is that it computes the same Fibonacci<br />

numbers over and over and over. A single call to RecursiveFibo(n) results in one recursive call<br />

to RecursiveFibo(n − 1), two recursive calls to RecursiveFibo(n − 2), three recursive calls<br />

to RecursiveFibo(n − 3), five recursive calls to RecursiveFibo(n − 4), and in general, Fk−1<br />

recursive calls to RecursiveFibo(n − k), for any 0 ≤ k < n. For each call, we’re recomputing<br />

some Fibonacci number from scratch.<br />

We can speed up the algorithm considerably just by writing down the results of our recursive<br />

calls and looking them up again if we need them later. This process is called memoization. 3<br />

3 “My name is Elmer J. Fudd, millionaire. I own a mansion and a yacht.”<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

MemFibo(n):<br />

if (n < 2)<br />

return n<br />

else<br />

if F [n] is undefined<br />

F [n] ← MemFibo(n − 1) + MemFibo(n − 2)<br />

return F [n]<br />

If we actually trace through the recursive calls made by MemFibo, we find that the array F [ ]<br />

gets filled from the bottom up: first F [2], then F [3], and so on, up to F [n]. Once we realize this,<br />

we can replace the recursion with a simple for-loop that just fills up the array in that order, instead<br />

of relying on the complicated recursion to do it for us. This gives us our first explicit dynamic<br />

programming algorithm.<br />

IterFibo(n):<br />

F [0] ← 0<br />

F [1] ← 1<br />

for i ← 2 to n<br />

F [i] ← F [i − 1] + F [i − 2]<br />

return F [n]<br />

IterFibo clearly takes only O(n) time and O(n) space to compute Fn, an exponential speedup<br />

over our original recursive algorithm. We can reduce the space to O(1) by noticing that we never<br />

need more than the last two elements of the array:<br />

IterFibo2(n):<br />

prev ← 1<br />

curr ← 0<br />

for i ← 1 to n<br />

next ← curr + prev<br />

prev ← curr<br />

curr ← next<br />

return curr<br />

(This algorithm uses the non-standard but perfectly consistent base case F−1 = 1.)<br />

But even this isn’t the fastest algorithm for computing Fibonacci numbers. There’s a faster<br />

algorithm defined in terms of matrix multiplication, using the following wonderful fact:<br />

<br />

0<br />

<br />

1 x<br />

1 1 y<br />

=<br />

<br />

y<br />

x + y<br />

<br />

In other words, multiplying a two-dimensional vector by the matrix [ 0 1<br />

1 1 ] does exactly the same<br />

thing as one iteration of the inner loop of IterFibo2. This might lead us to believe that multiplying<br />

by the matrix n times is the same as iterating the loop n times:<br />

n <br />

0 1 1 Fn−1<br />

= .<br />

1 1 0<br />

A quick inductive argument proves this. So if we want to compute the nth Fibonacci number, all<br />

we have to do is compute the nth power of the matrix [ 0 1<br />

3<br />

Fn<br />

1 1 ].


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

We saw in the previous lecture, and the beginning of this lecture, that computing the nth<br />

power of something requires only O(log n) multiplications. In this case, that means O(log n) 2 × 2<br />

matrix multiplications, but one matrix multiplications can be done with only a constant number<br />

of integer multiplications and additions. By applying our earlier dynamic programming algorithm<br />

for computing exponents, we can compute Fn in only O(log n) steps.<br />

This is an exponential speedup over the standard iterative algorithm, which was already an<br />

exponential speedup over our original recursive algorithm. Right?<br />

2.5 Uh. . . wait a minute.<br />

Well, not exactly. Fibonacci numbers grow exponentially fast. The nth Fibonacci number is<br />

approximately n log 10 φ ≈ n/5 decimal digits long, or n log 2 φ ≈ 2n/3 bits. So we can’t possibly<br />

compute Fn in logarithmic time — we need Ω(n) time just to write down the answer!<br />

I’ve been cheating by assuming we can do arbitrary-precision arithmetic in constant time. As<br />

we discussed last time, multiplying two n-digit numbers takes O(n log n) time. That means that<br />

the matrix-based algorithm’s actual running time is given by the recurrence<br />

T (n) = T (⌊n/2⌋) + O(n log n),<br />

which solves to T (n) = O(n log n) by the Master Theorem.<br />

Is this slower than our “linear-time” iterative algorithm? No! Addition isn’t free, either. Adding<br />

two n-digit numbers takes O(n) time, so the running time of the iterative algorithm is O(n 2 ). (Do<br />

you see why?) So our matrix algorithm really is faster than our iterative algorithm, but not<br />

exponentially faster.<br />

Incidentally, in the recursive algorithm, the extra cost of arbitrary-precision arithmetic is overwhelmed<br />

by the huge number of recursive calls. The correct recurrence is<br />

T (n) = T (n − 1) + T (n − 2) + O(n),<br />

which still has the solution O(φ n ) by the annihilator method.<br />

2.6 The Pattern<br />

Dynamic programming is essentially recursion without repetition. Developing a dynamic programming<br />

algorithm generally involves two separate steps.<br />

1. Formulate the problem recursively. Write down a formula for the whole problem as a<br />

simple combination of the answers to smaller subproblems.<br />

2. Build solutions to your recurrence from the bottom up. Write an algorithm that<br />

starts with the base cases of your recurrence and works its way up to the final solution by<br />

considering the intermediate subproblems in the correct order.<br />

Of course, you have to prove that each of these steps is correct. If your recurrence is wrong, or if<br />

you try to build up answers in the wrong order, your algorithm won’t work!<br />

Dynamic programming algorithms need to store the results of intermediate subproblems. This<br />

is often but not always done with some kind of table.<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

2.7 Edit Distance<br />

The edit distance between two words is the minimum number of letter insertions, letter deletions,<br />

and letter substitutions required to transform one word into another. For example, the edit distance<br />

between FOOD and MONEY is at most four:<br />

FOOD → MOOD → MON ∧D → MONED → MONEY<br />

A better way to display this editing process is to place the words one above the other, with a gap<br />

in the first word for every insertion, and a gap in the second word for every deletion. Columns with<br />

two different characters correspond to substitutions. Thus, the number of editing steps is just the<br />

number of columns that don’t contain the same character twice.<br />

F O O D<br />

M O N E Y<br />

It’s fairly obvious that you can’t get from FOOD to MONEY in three steps, so their edit distance is<br />

exactly four. Unfortunately, this is not so easy in general. Here’s a longer example, showing that<br />

the distance between ALGORITHM and ALTRUISTIC is at most six. Is this optimal?<br />

A L G O R I T H M<br />

A L T R U I S T I C<br />

To develop a dynamic programming algorithm to compute the edit distance between two strings,<br />

we first need to develop a recursive definition. Let’s say we have an m-character string A and an<br />

n-character string B. Then define E(i, j) to be the edit distance between the first i characters of A<br />

and the first j characters of B. The edit distance between the entire strings A and B is E(m, n).<br />

This gap representation for edit sequences has a crucial “optimal substructure” property. Suppose<br />

we have the gap representation for the shortest edit sequence for two strings. If we remove<br />

the last column, the remaining columns must represent the shortest edit sequence for<br />

the remaining substrings. We can easily prove this by contradiction. If the substrings had a<br />

shorter edit sequence, we could just glue the last column back on and get a shorter edit sequence<br />

for the original strings.<br />

There are a couple of obvious base cases. The only way to convert the empty string into a string<br />

of j characters is by doing j insertions, and the only way to convert a string of i characters into<br />

the empty string is with i deletions:<br />

E(i, 0) = i, E(0, j) = j.<br />

In general, there are three possibilities for the last column in the shortest possible edit sequence:<br />

• Insertion: The last entry in the bottom row is empty. In this case, E(i, j) = E(i − 1, j) + 1.<br />

• Deletion: The last entry in the top row is empty. In this case, E(i, j) = E(i, j − 1) + 1.<br />

• Substitution: Both rows have characters in the last column. If the characters are the same,<br />

we don’t actually have to pay for the substitution, so E(i, j) = E(i−1, j−1). If the characters<br />

are different, then E(i, j) = E(i − 1, j − 1) + 1.<br />

To summarize, the edit distance E(i, j) is the smallest of these three possibilities:<br />

⎧<br />

⎨ E(i − 1, j) + 1<br />

E(i, j) = min E(i, j − 1) + 1<br />

⎩<br />

E(i − 1, j − 1) + A[i] = B[j] <br />

⎫<br />

⎬<br />

⎭<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

[The bracket notation P denotes the indicator variable for the logical proposition P . Its value is<br />

1 if P is true and 0 if P is false. This is the same as the C/C++/Java expression P ? 1 : 0.]<br />

If we turned this recurrence directly into a recursive algorithm, we would have the following<br />

horrible double recurrence for the running time:<br />

T (m, 0) = T (0, n) = O(1), T (m, n) = T (m, n − 1) + T (m − 1, n) + T (n − 1, m − 1) + O(1).<br />

Yuck!! The solution for this recurrence is an exponential mess! I don’t know a general closed form,<br />

but T (n, n) = Θ((1 + √ 2) n ). Obviously a recursive algorithm is not the way to go here.<br />

Instead, let’s build an m × n table of all possible values of E(i, j). We can start by filling in<br />

the base cases, the entries in the 0th row and 0th column, each in constant time. To fill in any<br />

other entry, we need to know the values directly above it, directly to the left, and both above and<br />

to the left. If we fill in our table in the standard way—row by row from top down, each row from<br />

left to right—then whenever we reach an entry in the matrix, the entries it depends on are already<br />

available.<br />

EditDistance(A[1 .. m], B[1 .. n]):<br />

for i ← 1 to m<br />

Edit[i, 0] ← i<br />

for j ← 1 to n<br />

Edit[0, j] ← j<br />

for i ← 1 to m<br />

for j ← 1 to n<br />

if A[i] = B[j]<br />

Edit[i, j] ← min Edit[i − 1, j] + 1,<br />

Edit[i, j − 1] + 1,<br />

Edit[i − 1, j − 1] <br />

else<br />

Edit[i, j] ← min Edit[i − 1, j] + 1,<br />

Edit[i, j − 1] + 1,<br />

Edit[i − 1, j − 1] + 1 <br />

return Edit[m, n]<br />

Since there are Θ(n 2 ) entries in the table, and each entry takes Θ(1) time once we know its<br />

predecessors, the total running time is Θ(n 2 ).<br />

Here’s the resulting table for ALGORITHM → ALTRUISTIC. Bold numbers indicate places where<br />

characters in the two strings are equal. The arrows represent the predecessor(s) that actually define<br />

each entry. Each direction of arrow corresponds to a different edit operation: horizontal=deletion,<br />

vertical=insertion, and diagonal=substitution. Bold diagonal arrows indicate “free” substitutions<br />

of a letter for itself. A path of arrows from the top left corner to the bottom right corner of this<br />

table represents an optimal edit sequence between the two strings. There can be many such paths.<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

A L G O R I T H M<br />

0 →1→2→ 3 → 4 →5→6→7→8→ 9<br />

↓ ↘<br />

A 1 0→1→ 2 → 3 →4→5→6→7→ 8<br />

↓ ↓↘<br />

L 2 1 0→ 1 → 2 →3→4→5→6→ 7<br />

↓ ↓ ↓↘ ↘ ↘ ↘ ↘<br />

T 3 2 1 1 → 2 →3→4→4→5→ 6<br />

↓ ↓ ↓ ↓ ↘ ↘ ↘ ↘<br />

R 4 3 2 2 2 2→3→4→5→ 6<br />

↓ ↓ ↓↘ ↓ ↘ ↓ ↘↓↘ ↘ ↘ ↘<br />

U 5 4 3 3 3 3 3→4→5→ 6<br />

↓ ↓ ↓↘ ↓ ↘ ↓ ↘↓↘ ↘ ↘ ↘<br />

I 6 5 4 4 4 4 3→4→5→ 6<br />

↓ ↓ ↓↘ ↓ ↘ ↓ ↘↓ ↓↘ ↘ ↘<br />

S 7 6 5 5 5 5 4 4 5 6<br />

↓ ↓ ↓↘ ↓ ↘ ↓ ↘↓ ↓↘ ↘ ↘<br />

T 8 7 6 6 6 6 5 4→5→ 6<br />

↓ ↓ ↓↘ ↓ ↘ ↓ ↘↓↘↓ ↓↘ ↘<br />

I 9 8 7 7 7 7 6 5 5→ 6<br />

↓ ↓ ↓↘ ↓ ↘ ↓ ↘↓ ↓ ↓↘↓↘<br />

C 10 9 8 8 8 8 7 6 6 6<br />

The edit distance between ALGORITHM and ALTRUISTIC is indeed six. There are three paths<br />

through this table from the top left to the bottom right, so there are three optimal edit sequences:<br />

2.8 Danger! Greed kills!<br />

A L G O R I T H M<br />

A L T R U I S T I C<br />

A L G O R I T H M<br />

A L T R U I S T I C<br />

A L G O R I T H M<br />

A L T R U I S T I C<br />

If we’re very very very very lucky, we can bypass all the recurrences and tables and so forth, and<br />

solve the problem using a greedy algorithm. The general greedy strategy is look for the best first<br />

step, take it, and then continue. For example, a greedy algorithm for the edit distance problem<br />

might look for the longest common substring of the two strings, match up those substrings (since<br />

those substitutions dont cost anything), and then recursively look for the edit distances between<br />

the left halves and right halves of the strings. If there is no common substring—that is, if the two<br />

strings have no characters in common—the edit distance is clearly the length of the larger string.<br />

If this sounds like a hack to you, pat yourself on the back. It isn’t even close to the correct<br />

solution. Nevertheless, for many problems involving dynamic programming, many student’s first<br />

intuition is to apply a greedy strategy. This almost never works; problems that can be solved<br />

correctly by a greedy algorithm are very rare. Everyone should tattoo the following sentence on<br />

the back of their hands, right under all the rules about logarithms and big-Oh notation:<br />

Greedy algorithms never work!<br />

Use dynamic programming instead!<br />

Well. . . hardly ever. Later in the semester we’ll see correct greedy algorithms for minimum<br />

spanning trees and shortest paths.<br />

7


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

2.9 Optimal Binary Search Trees<br />

You all remember that the cost of a successful search in a binary search tree is proportional to the<br />

depth of the target node plus one. As a result, the worst-case search time is proportional to the<br />

height of the tree. To minimize the worst-case search time, the height of the tree should be as small<br />

as possible; ideally, the tree is perfectly balanced.<br />

In many applications of binary search trees, it is more important to minimize the total cost of<br />

several searches than to minimize the worst-case cost of a single search. If x is a more ‘popular’<br />

search target than y, we can save time by building a tree where the depth of x is smaller than the<br />

depth of y, even if that means increasing the overall height of the tree. A perfectly balanced tree<br />

is not the best choice if some items are significantly more popular than others. In fact, a totally<br />

unbalanced tree of depth Ω(n) might actually be the best choice!<br />

This situation suggests the following problem. Suppose we are given a sorted array of n keys<br />

A[1 .. n] and an array of corresponding access frequencies f[1 .. n]. Over the lifetime of the search<br />

tree, we will search for the key A[i] exactly f[i] times. Our task is to build the binary search tree<br />

that minimizes the total search time.<br />

Before we think about how to solve this problem, we should first come up with the right way<br />

to describe the function we are trying to optimize! Suppose we have a binary search tree T . Let<br />

depth(T, i) denote the depth of the node in T that stores the key A[i]. Up to constant factors, the<br />

total search time S(T ) is given by the following expression:<br />

S(T ) =<br />

n <br />

depth(T, i) + 1 · f[i]<br />

i=1<br />

This expression is called the weighted external path length of T . We can trivially split this expression<br />

into two components:<br />

n n<br />

S(T ) = f[i] + depth(T, i) · f[i].<br />

i=1<br />

i=1<br />

The first term is the total number of searches, which doesn’t depend on our choice of tree at all.<br />

The second term is called the weighted internal path length of T .<br />

We can express S(T ) in terms of the recursive structure of T as follows. Suppose the root of T<br />

contains the key A[r], so the left subtree stores the keys A[1 .. r − 1], and the right subtree stores<br />

the keys A[r + 1 .. n]. We can actually define depth(T, i) recursively as follows:<br />

⎧<br />

⎪⎨ depth(left(T ), i) + 1 if i < r<br />

depth(T, i) = 0<br />

⎪⎩<br />

depth(right(T ), i) + 1<br />

if i = r<br />

if i > r<br />

If we plug this recursive definition into our earlier expression for S(T ), we get the following:<br />

S(T ) =<br />

n<br />

f[i] +<br />

i=1<br />

r−1<br />

<br />

depth(left(T ), i) + 1 · f[i] +<br />

i=1<br />

r−1<br />

<br />

depth(right(T ), i) + 1 · f[i]<br />

This looks complicated, until we realize that the second and third look exactly like our initial<br />

expression for S(T )!<br />

n<br />

S(T ) = f[i] + S(left(T )) + S(right(T ))<br />

i=1<br />

8<br />

i=1


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

Now our task is to compute the tree Topt that minimizes the total search time S(T ). Suppose<br />

the root of Topt stores key A[r]. The recursive definition of S(T ) immediately implies that the left<br />

subtree left(Topt) must also be the optimal search tree for the keys A[1 .. r−1] and access frequencies<br />

f[1 .. r − 1]. Similarly, the right subtree right(Topt) must also be the optimal search tree for the<br />

keys A[r + 1 .. n] and access frequencies f[r + 1 .. n]. Thus, once we choose the correct key to store<br />

at the root, the recursion fairy will automatically construct the rest of the optimal tree for us!<br />

More formally, let S(i, j) denote the total search time for the optimal search tree containing<br />

the subarray A[1 .. j]; our task is to compute S(1, n). To simplify notation a bit, let F (i, j) denote<br />

the total frequency counts for all the keys in the subarray A[i .. j]:<br />

We now have the following recurrence:<br />

F (i, j) =<br />

j<br />

f[k]<br />

⎧<br />

⎨0<br />

if j = i − i<br />

S(i, j) =<br />

<br />

⎩F<br />

(i, j) + min S(1, r − 1) + S(r + 1, n) otherwise<br />

1≤r≤n<br />

The base case might look a little weird, but all it means is that the total cost for searching an<br />

empty set of keys is zero. We could use the base cases S(i, i) = f[i] instead, but this would lead<br />

to extra special cases when r = 1 or r = n. Also, the F (i, j) term is outside the max because it<br />

doesn’t depend on the root index r.<br />

Now, if we try to evaluate this recurrence directly using a recursive algorithm, the running time<br />

will have the following evil-looking recurrence:<br />

T (n) = Θ(n) +<br />

k=i<br />

n <br />

T (k − 1) + T (n − k)<br />

k=1<br />

The Θ(n) term comes from computing F (1, n), which is the total number of searches. A few minutes<br />

of pain and suffering by a professional algorithm analyst gives us the solution T (n) = Θ(3n ). Once<br />

again, top-down recursion is not the way to go.<br />

In fact, we’re not even computing the access counts F (i, j) as efficiently as we could. Even if<br />

we memoize the answers in an array F [1 .. n][1 .. n], computing each value F (i, j) using a separate<br />

for-loop requires a total of O(n3 ) time. A better approach is to turn the recurrence<br />

<br />

f[i] if i = j<br />

F (i, j) =<br />

F (i, j − 1) + f[j] otherwise<br />

into the following O(n 2 )-time dynamic programming algorithm:<br />

InitF(f[1 .. n]):<br />

for i ← 1 to n<br />

F [i, i − 1] ← 0<br />

for j ← i to n<br />

F [i, j] ← F [i, j − 1] + f[i]<br />

This will be used as an initialization subroutine in our final algorithm.<br />

9


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

So now let’s compute the optimal search tree cost S(1, n) from the bottom up. We can store all<br />

intermediate results in a table S[1 .. n, 0 .. n]. Only the entries S[i, j] with j ≥ i − 1 will actually be<br />

used. The base case of the recursion tells us that any entry of the form S[i, i − 1] can immediately<br />

be set to 0. For any other entry S[i, j], we can use the following algorithm fragment, which comes<br />

directly from the recurrence:<br />

ComputeS(i, j):<br />

S[i, j] ← ∞<br />

for r ← i to j<br />

tmp ← S[i, r − 1] + S[r + 1, j]<br />

if S[i, j] > tmp<br />

S[i, j] ← tmp<br />

S[i, j] ← S[i, j] + F [i, j]<br />

The only question left is what order to fill in the table.<br />

Each entry S[i, j] depends on all entries S[i, r − 1] and S[r + 1, j] with i ≤ k ≤ j. In other<br />

words, every entry in the table depends on all the entries directly to the left or directly below. In<br />

order to fill the table efficiently, we must choose an order that computes all those entries before<br />

S[i, j]. There are at least three different orders that satisfy this constraint. The one that occurs to<br />

most people first is to scan through the table one diagonal at a time, starting with the trivial base<br />

cases S[i, i − 1]. The complete algorithm looks like this:<br />

OptimalSearchTree(f[1 .. n]):<br />

InitF(f[1 .. n])<br />

for i ← 1 to n<br />

S[i, i − 1] ← 0<br />

for d ← 0 to n − 1<br />

for i ← 1 to n − d<br />

ComputeS(i, i + d)<br />

return S[1, n]<br />

We could also traverse the array row by row from the bottom up, traversing each row from left to<br />

right, or column by column from left to right, traversing each columns from the bottom up. These<br />

two orders give us the following algorithms:<br />

OptimalSearchTree2(f[1 .. n]):<br />

InitF(f[1 .. n])<br />

for i ← n downto 1<br />

S[i, i − 1] ← 0<br />

for j ← i to n<br />

ComputeS(i, j)<br />

return S[1, n]<br />

Three different orders to fill in the table S[i, j].<br />

10<br />

OptimalSearchTree3(f[1 .. n]):<br />

InitF(f[1 .. n])<br />

for j ← 0 to n<br />

S[j + 1, j] ← 0<br />

for i ← j downto 1<br />

ComputeS(i, j)<br />

return S[1, n]


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

No matter which of these three orders we actually use, the resulting algorithm runs in Θ(n 3 )<br />

time and uses Θ(n 2 ) space.<br />

We could have predicted this from the original recursive formulation.<br />

⎧<br />

⎨0<br />

if j = i − i<br />

S(i, j) =<br />

<br />

⎩F<br />

(i, j) + min S(i, r − 1) + S(r + 1, j) otherwise<br />

i≤r≤j<br />

First, the function has two arguments, each of which can take on any value between 1 and n, so we<br />

probably need a table of size O(n 2 ). Next, there are three variables in the recurrence (i, j, and r),<br />

each of which can take any value between 1 and n, so it should take us O(n 3 ) time to fill the table.<br />

In general, you can get an easy estimate of the time and space bounds for any dynamic programming<br />

algorithm by looking at the recurrence. The time bound is determined by how many values<br />

all the variables can have, and the space bound is determined by how many values the parameters<br />

of the function can have. For example, the (completely made up) recurrence<br />

F (i, j, k, l, m) = min<br />

0≤p≤i max<br />

0≤q≤j<br />

k−m <br />

r=1<br />

F (i − p, j − q, r, l − 1, m − r)<br />

should immediately suggests a dynamic programming algorithm that uses O(n 8 ) time and O(n 5 )<br />

space. This rule of thumb immediately usually gives us the right time bound to shoot for.<br />

But not always! In fact, the algorithm I’ve described is not the most efficient algorithm for<br />

computing optimal binary search trees. Let R[i, j] denote the root of the optimal search tree for<br />

A[i .. j]. Donald Knuth proved the following nice monotonicity property for optimal subtrees: if<br />

we move either end of the subarray, the optimal root moves in the same direction or not at all, or<br />

more formally:<br />

R[i, j − 1] ≤ R[i, j] ≤ R[i + 1, j] for all i and j.<br />

This (nontrivial!) observation leads to the following more efficient algorithm:<br />

FasterOptimalSearchTree(f[1 .. n]):<br />

InitF(f[1 .. n])<br />

for i ← n downto 1<br />

S[i, i − 1] ← 0<br />

R[i, i − 1] ← i<br />

for j ← i to n<br />

ComputeSandR(i, j)<br />

return S[1, n]<br />

ComputeSandR(f[1 .. n]):<br />

S[i, j] ← ∞<br />

for r ← R[i, j − 1] to j<br />

tmp ← S[i, r − 1] + S[r + 1, j]<br />

if S[i, j] > tmp<br />

S[i, j] ← tmp<br />

R[i, j] ← r<br />

S[i, j] ← S[i, j] + F [i, j]<br />

It’s not hard to see the r increases monotonically from i to n during each iteration of the outermost<br />

for loop. Consequently, the innermost for loop iterates at most n times during a single iteration of<br />

the outermost loop, so the total running time of the algorithm is O(n 2 ).<br />

If we formulate the problem slightly differently, this algorithm can be improved even further.<br />

Suppose we require the optimum external binary tree, where the keys A[1 .. n] are all stored at<br />

the leaves, and intermediate pivot values are stored at the internal nodes. An algorithm due to<br />

Te Ching Hu and Alan Tucker 4 computes the optimal binary search tree in this setting in only<br />

4 T. C. Hu and A. C. Tucker, Optimal computer search trees and variable length alphabetic codes, SIAM J.<br />

Applied Math. 21:514–532, 1971. For a slightly simpler algorithm with the same running time, see A. M. Garsia and<br />

M. L. Wachs, A new algorithms for minimal binary search trees, SIAM J. Comput. 6:622–642, 1977. The original<br />

correctness proofs for both algorithms are rather intricate; for simpler proofs, see Marek Karpinski, Lawrence L.<br />

Larmore, and Wojciech Rytter, Correctness of constructing optimal alphabetic trees revisited, Theoretical Computer<br />

Science, 180:309-324, 1997.<br />

11


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

O(n log n) time!<br />

2.10 Optimal Triangulations of Convex Polygons<br />

A convex polygon is a circular chain of line segments, arranged so none of the corners point inwards—<br />

imagine a rubber band stretched around a bunch of nails. (This is technically not the best definition,<br />

but it’ll do for now.) A diagonal is a line segment that cuts across the interior of the polygon from<br />

one corner to another. A simple induction argument (hint, hint) implies that any n-sided convex<br />

polygon can be split into n−2 triangles by cutting along n−3 different diagonals. This collection of<br />

triangles is called a triangulation of the polygon. Triangulations are incredibly useful in computer<br />

graphics—most graphics hardware is built to draw triangles incredibly quickly, but to draw anything<br />

more complicated, you usually have to break it into triangles first.<br />

A convex polygon and two of its many possible triangulations.<br />

There are several different ways to triangulate any convex polygon. Suppose we want to find<br />

the triangulation that requires the least amount of ink to draw, or in other words, the triangulation<br />

where the total perimeter of the triangles is as small as possible. To make things concrete, let’s<br />

label the corners of the polygon from 1 to n, starting at the bottom of the polygon and going<br />

clockwise. We’ll need the following subroutines to compute the perimeter of a triangle joining three<br />

corners using their x- and y-coordinates:<br />

∆(i, j, k) :<br />

return Dist(i, j) + Dist(j, k) + Dist(i, k)<br />

Dist(i, j) :<br />

return (x[i] − x[j]) 2 + (y[i] − y[j]) 2<br />

In order to get a dynamic programming algorithm, we first need a recursive formulation of<br />

the minimum-length triangulation. To do that, we really need some kind of recursive definition of<br />

a triangulation! Notice that in any triangulation, exactly one triangle uses both the first corner<br />

and the last corner of the polygon. If we remove that triangle, what’s left over is two smaller<br />

triangulations. The base case of this recursive definition is a ‘polygon’ with just two corners.<br />

Notice that at any point in the recursion, we have a polygon joining a contiguous subset of the<br />

original corners.<br />

2<br />

3<br />

1<br />

4<br />

8<br />

5<br />

6<br />

7<br />

2<br />

3<br />

1 1<br />

4 4 4<br />

8<br />

8<br />

5<br />

6<br />

7<br />

Two examples of the recursive definition of a triangulation.<br />

Building on this recursive definition, we can now recursively define the total length of the<br />

minimum-length triangulation. In the best triangulation, if we remove the ‘base’ triangle, what<br />

12<br />

2<br />

3<br />

1<br />

4<br />

8<br />

5<br />

6<br />

7<br />

2<br />

3<br />

1<br />

1<br />

4<br />

8<br />

8<br />

5<br />

6<br />

7 7 7


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

remains must be the optimal triangulation of the two smaller polygons. So we just have choose the<br />

best triangle to attach to the first and last corners, and let the recursion fairy take care of the rest:<br />

⎧<br />

⎨0<br />

if j = i + 1<br />

M(i, j) = <br />

⎩ min ∆(i, j, k) + M(i, k) + M(k, j) otherwise<br />

i


<strong>CS</strong> <strong>373</strong> Lecture 2: Dynamic Programming Fall 2002<br />

O(pqr) arithmetic operations; the result is a p×r matrix. If we have three matrices to multiply, the<br />

cost depends on which pair we multiply first. For example, suppose A and C are 1000 × 2 matrices<br />

and B is a 2 × 1000 matrix. There are two different ways to compute the threefold product ABC:<br />

• (AB)C: Computing AB takes 1000·2·1000 = 2 000 000 operations and produces a 1000×1000<br />

matrix. Multiplying this matrix by C takes 1000 · 1000 · 2 = 2 000 000 additional operations.<br />

So the total cost of (AB)C is 4 000 000 operations.<br />

• A(BC): Computing BC takes 2 · 1000 · 2 = 4000 operations and produces a 2 × 2 matrix.<br />

Multiplying A by this matrix takes 1000 · 2 · 2 = 4 000 additional operations. So the total cost<br />

of A(BC) is only 8000 operations.<br />

Now suppose we are given an array D[0 .. n] as input, indicating that each matrix Mi has<br />

D[i − 1] rows and D[i] columns. We have an exponential number of possible ways to compute the<br />

n-fold product n i=1 Mi. The following dynamic programming algorithm computes the number of<br />

arithmetic operations for the best possible parenthesization:<br />

MatrixChainMult:<br />

for i ← n downto 1<br />

M[i, i + 1] ← 0<br />

for j ← i + 2 to n<br />

ComputeM(i, j)<br />

return M[1, n]<br />

The derivation of this algorithm is left as a simple exercise.<br />

ComputeM(i, j):<br />

M[i, j] ← ∞<br />

for k ← i + 1 to j − 1<br />

tmp ← (D[i] · D[j] · D[k]) + M[i, k] + M[k, j]<br />

if M[i, j] > tmp<br />

M[i, j] ← tmp<br />

14


<strong>CS</strong> <strong>373</strong> Lecture 3: Randomized <strong>Algorithms</strong> Fall 2002<br />

The first nuts and bolts appeared in the middle 1400’s. The bolts were just screws with straight<br />

sides and a blunt end. The nuts were hand-made, and very crude. When a match was found<br />

between a nut and a bolt, they were kept together until they were finally assembled.<br />

In the Industrial Revolution, it soon became obvious that threaded fasteners made it easier<br />

to assemble products, and they also meant more reliable products. But the next big step came<br />

in 1801, with Eli Whitney, the inventor of the cotton gin. The lathe had been recently improved.<br />

Batches of bolts could now be cut on different lathes, and they would all fit the same nut.<br />

Whitney set up a demonstration for President Adams, and Vice-President Jefferson. He had<br />

piles of musket parts on a table. There were 10 similar parts in each pile. He went from pile to<br />

pile, picking up a part at random. Using these completely random parts, he quickly put together<br />

a working musket.<br />

— Karl S. Kruszelnicki (‘Dr. Karl’), Karl Trek, December 1997<br />

3 Randomized <strong>Algorithms</strong> (September 12)<br />

3.1 Nuts and Bolts<br />

Suppose we are given n nuts and n bolts of different sizes. Each nut matches exactly one bolt and<br />

vice versa. The nuts and bolts are all almost exactly the same size, so we can’t tell if one bolt is<br />

bigger than the other, or if one nut is bigger than the other. If we try to match a nut witch a bolt,<br />

however, the nut will be either too big, too small, or just right for the bolt.<br />

Our task is to match each nut to its corresponding bolt. But before we do this, let’s try to solve<br />

some simpler problems, just to get a feel for what we can and can’t do.<br />

Suppose we want to find the nut that matches a particular bolt. The obvious algorithm — test<br />

every nut until we find a match — requires exactly n − 1 tests in the worst case. We might have<br />

to check every bolt except one; if we get down the the last bolt without finding a match, we know<br />

that the last nut is the one we’re looking for. 1<br />

Intuitively, in the ‘average’ case, this algorithm will look at approximately n/2 nuts. But what<br />

exactly does ‘average case’ mean?<br />

3.2 Deterministic vs. Randomized <strong>Algorithms</strong><br />

Normally, when we talk about the running time of an algorithm, we mean the worst-case running<br />

time. This is the maximum, over all problems of a certain size, of the running time of that algorithm<br />

on that input:<br />

Tworst-case(n) = max T (X).<br />

|X|=n<br />

On extremely rare occasions, we will also be interested in the best-case running time:<br />

Tbest-case(n) = min T (X).<br />

|X|=n<br />

The average-case running time is best defined by the expected value, over all inputs X of a certain<br />

size, of the algorithm’s running time for X: 2<br />

<br />

Taverage-case(n) = E [T (X)] = T (x) · Pr[X].<br />

|X|=n<br />

|X|=n<br />

1 “Whenever you lose something, it’s always in the last place you look. So why not just look there first?”<br />

2 The notation E[ ] for expectation has nothing to do with the shift operator E used in the annihilator method for<br />

solving recurrences!<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 3: Randomized <strong>Algorithms</strong> Fall 2002<br />

The problem with this definition is that we rarely, if ever, know what the probability of getting<br />

any particular input X is. We could compute average-case running times by assuming a particular<br />

probability distribution—for example, every possible input is equally likely—but this assumption<br />

doesn’t describe reality very well. Most real-life data is decidedly non-random.<br />

Instead of considering this rather questionable notion of average case running time, we will<br />

make a distinction between two kinds of algorithms: deterministic and randomized. A deterministic<br />

algorithm is one that always behaves the same way given the same input; the input completely<br />

determines the sequence of computations performed by the algorithm. Randomized algorithms, on<br />

the other hand, base their behavior not only on the input but also on several random choices. The<br />

same randomized algorithm, given the same input multiple times, may perform different computations<br />

in each invocation.<br />

This means, among other things, that the running time of a randomized algorithm on a given<br />

input is no longer fixed, but is itself a random variable. When we analyze randomized algorithms,<br />

we are typically interested in the worst-case expected running time. That is, we look at the average<br />

running time for each input, and then choose the maximum over all inputs of a certain size:<br />

Tworst-case expected(n) = max E[T (X)].<br />

|X|=n<br />

It’s important to note here that we are making no assumptions about the probability distribution<br />

of possible inputs. All the randomness is inside the algorithm, where we can control it!<br />

3.3 Back to Nuts and Bolts<br />

Let’s go back to the problem of finding the nut that matches a given bolt. Suppose we use the<br />

same algorithm as before, but at each step we choose a nut uniformly at random from the untested<br />

nuts. ‘Uniformly’ is a technical term meaning that each nut has exactly the same probability of<br />

being chosen. 3 So if there are k nuts left to test, each one will be chosen with probability 1/k. Now<br />

what’s the expected number of comparisons we have to perform? Intuitively, it should be about<br />

n/2, but let’s formalize our intuition.<br />

Let T (n) denote the number of comparisons our algorithm uses to find a match for a single bolt<br />

out of n nuts. 4 We still have some simple base cases T (1) = 0 and T (2) = 1, but when n > 2,<br />

T (n) is a random variable. T (n) is always between 1 and n − 1; it’s actual value depends on our<br />

algorithm’s random choices. We are interested in the expected value or expectation of T (n), which<br />

is defined as follows:<br />

n−1 <br />

E[T (n)] = k · Pr[T (n) = k]<br />

k=1<br />

If the target nut is the kth nut tested, our algorithm performs min{k, n − 1} comparisons. In<br />

particular, if the target nut is the last nut chosen, we don’t actually test it. Because we choose<br />

the next nut to test uniformly at random, the target nut is equally likely—with probability exactly<br />

1/n—to be the first, second, third, or kth bolt tested, for any k. Thus:<br />

<br />

1/n if k < n − 1,<br />

Pr[T (n) = k] =<br />

2/n if k = n − 1.<br />

3 This is what most people think ‘random’ means, but they’re wrong.<br />

4 Note that for this algorithm, the input is completely specified by the number n. Since we’re choosing the nuts<br />

to test at random, even the order in which the nuts and bolts are presented doesn’t matter. That’s why I’m using<br />

the simpler notation T (n) instead of T (X).<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 3: Randomized <strong>Algorithms</strong> Fall 2002<br />

Plugging this into the definition of expectation gives us our answer.<br />

E[T (n)] =<br />

n−2 <br />

k=1<br />

n−1<br />

k=1<br />

k 2(n − 1)<br />

+<br />

n n<br />

k n − 1<br />

= +<br />

n n<br />

= n(n − 1)<br />

+ 1 −<br />

2n<br />

1<br />

n<br />

= n + 1 1<br />

−<br />

2 n<br />

We can get exactly the same answer by thinking of this algorithm recursively. We always have<br />

to perform at least one test. With probability 1/n, we successfully find the matching nut and halt.<br />

With the remaining probability 1 − 1/n, we recursively solve the same problem but with one fewer<br />

nut. We get the following recurrence for the expected number of tests:<br />

T (1) = 0, E[T (n)] = 1 +<br />

n − 1<br />

n<br />

E[T (n − 1)]<br />

To get the solution, we define a new function t(n) = n E[T (n)] and rewrite:<br />

t(1) = 0, t(n) = n + t(n − 1)<br />

This recurrence translates into a simple summation, which we can easily solve.<br />

3.4 Finding All Matches<br />

t(n) =<br />

n<br />

k =<br />

k=2<br />

=⇒ E[T (n)] = t(n)<br />

n<br />

n(n + 1)<br />

2<br />

= n + 1<br />

2<br />

Not let’s go back to the problem introduced at the beginning of the lecture: finding the matching<br />

nut for every bolt. The simplest algorithm simply compares every nut with every bolt, for a total<br />

of n 2 comparisons. The next thing we might try is repeatedly finding an arbitrary matched pair,<br />

using our very first nuts and bolts algorithm. This requires<br />

n<br />

i=1<br />

(i − 1) = n2 − n<br />

2<br />

comparisons in the worst case. So we save roughly a factor of two over the really stupid algorithm.<br />

Not very exciting.<br />

Here’s another possibility. Choose a pivot bolt, and test it against every nut. Then test the<br />

matching pivot nut against every other bolt. After these 2n − 1 tests, we have one matched pair,<br />

and the remaining nuts and bolts are partitioned into two subsets: those smaller than the pivot pair<br />

and those larger than the pivot pair. Finally, recursively match up the two subsets. The worst-case<br />

number of tests made by this algorithm is given by the recurrence<br />

− 1<br />

n<br />

− 1<br />

T (n) = 2n − 1 + max {T (k − 1) + T (n − k)}<br />

1≤k≤n<br />

= 2n − 1 + T (n − 1)<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 3: Randomized <strong>Algorithms</strong> Fall 2002<br />

Along with the trivial base case T (0) = 0, this recurrence solves to<br />

T (n) =<br />

n<br />

(2n − 1) = n 2 .<br />

i=1<br />

In the worst case, this algorithm tests every nut-bolt pair! We could have been a little more<br />

clever—for example, if the pivot bolt is the smallest bolt, we only need n − 1 tests to partition<br />

everything, not 2n−1—but cleverness doesn’t actually help that much. We still end up with about<br />

n 2 /2 tests in the worst case.<br />

However, since this recursive algorithm looks almost exactly like quicksort, and everybody<br />

‘knows’ that the ‘average-case’ running time of quicksort is Θ(n log n), it seems reasonable to guess<br />

that the average number of nut-bolt comparisons is also Θ(n log n). As we shall see shortly, if the<br />

pivot bolt is always chosen uniformly at random, this intuition is exactly right.<br />

3.5 Reductions to and from Sorting<br />

The second algorithm for mathing up the nuts and bolts looks exactly like quicksort. The algorithm<br />

not only matches up the nuts and bolts, but also sorts them by size.<br />

In fact, the problems of sorting and matching nuts and bolts are equivalent, in the following<br />

sense. If the bolts were sorted, we could match the nuts and bolts in O(n log n) time by performing<br />

a binary search with each nut. Thus, if we had an algorithm to sort the bolts in O(n log n) time,<br />

we would immediately have an algorithm to match the nuts and bolts, starting from scratch, in<br />

O(n log n) time. This process of assuming a solution to one problem and using it to solve another<br />

is called reduction—we can reduce the matching problem to the sorting problem in O(n log n) time.<br />

There is a reduction in the other direction, too. If the nuts and bolts were matched, we could<br />

sort them in O(n log n) time using, for example, merge sort. Thus, if we have an O(n log n) time<br />

algorithm for either sorting or matching nuts and bolts, we automatically have an O(n log n) time<br />

algorithm for the other problem.<br />

Unfortunately, since we aren’t allowed to directly compare two bolts or two nuts, we can’t use<br />

heapsort or mergesort to sort the nuts and bolts in O(n log n) worst case time. In fact, the problem<br />

of sorting nuts and bolts deterministically in O(n log n) time was only ‘solved’ in 1995 5 , but both<br />

the algorithms and their analysis are incredibly technical, the constant hidden in the O(·) notation<br />

is extremely large, and worst of all, the solutions are nonconstructive—We know the algorithms<br />

exist, but we don’t know what they look like!<br />

Reductions will come up again later in the course when we start talking about lower bounds<br />

and NP-completeness.<br />

3.6 Recursive Analysis<br />

Intuitively, we can argue that our quicksort-like algorithm will usually choose a bolt of approximately<br />

median size, and so the average numbers of tests should be O(n log n). We can now finally<br />

formalize this intuition. To simplify the notation slightly, I’ll write T (n) in place of E[T (n)] everywhere.<br />

Our randomized matching/sorting algorithm chooses its pivot bolt uniformly at random from<br />

the set of unmatched bolts. Since the pivot bolt is equally likely to be the smallest, second smallest,<br />

5 János Komlós, Yuan Ma, and Endre Szemerédi, Sorting nuts and bolts in O(n log n) time, SIAM J. Discrete<br />

Math 11(3):347–372, 1998. See also Phillip G. Bradford, Matching nuts and bolts optimally, Technical Report<br />

MPI-I-95-1-025, Max-Planck-Institut für Informatik, September 1995. Bradford’s algorithm is slightly simpler.<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 3: Randomized <strong>Algorithms</strong> Fall 2002<br />

or kth smallest for any k, the expected number of tests performed by our algorithm is given by the<br />

following recurrence:<br />

<br />

T (n) = 2n − 1 + Ek T (k − 1) + T (n − k)<br />

= 2n − 1 + 1<br />

n<br />

n−1 <br />

T (k − 1) + T (n − k)<br />

k=0<br />

The base case is T (0) = 0. (We can save a few tests by setting T (1) = 1, but the analysis will be<br />

easier if we’re a little stupid.)<br />

Yuck. At this point, we could simply guess the solution, based on the incessant rumors that<br />

quicksort runs in O(n log n) time in the average case, and prove our guess correct by induction. A<br />

similar inductive proof appears in [CLR, pp. 166–167], but it was removed from the new edition<br />

[CLRS]. That’s okay; nobody ever understood that proof anyway.<br />

However, if we’re only interested in asymptotic bounds, we can afford to be a little conservative.<br />

What we’d really like is for the pivot bolt to be the median bolt, so that half the bolts are bigger<br />

and half the bolts are smaller. This isn’t very likely, but there is a good chance that the pivot bolt<br />

is close to the median bolt. Let’s say that a pivot bolt is good if it’s in the middle half of the final<br />

sorted set of bolts, that is, bigger than at least n/4 bolts and smaller than at least n/4 bolts. If the<br />

pivot bolt is good, then the worst split we can have is into one set of 3n/4 pairs and one set of n/4<br />

pairs. If the pivot bolt is bad, then our algorithm is still better than starting over from scratch.<br />

Finally, a randomly chosen pivot bolt is good with probability 1/2.<br />

These simple observations give us the following simple recursive upper bound for the expected<br />

running time of our algorithm:<br />

T (n) ≤ 2n − 1 + 1<br />

2<br />

A little algebra simplifies this even further:<br />

<br />

T<br />

T (n) ≤ 4n − 2 + T<br />

<br />

3n<br />

<br />

+ T<br />

4<br />

<br />

n<br />

<br />

4<br />

<br />

+ 1<br />

· T (n)<br />

2<br />

<br />

3n<br />

<br />

n<br />

<br />

+ T<br />

4 4<br />

We can easily solve this recurrence using the recursion tree method, giving us the completely<br />

unsurprising upper bound T (n) = O(n log n) . Similar observations give us the matching lower<br />

bound T (n) = Ω(n log n).<br />

3.7 Iterative Analysis<br />

By making a simple change to our algorithm, which will have no effect on the number of tests, we<br />

can analyze it much more directly and exactly, without needing to solve a recurrence.<br />

The recursive subproblems solved by quicksort can be laid out in a binary tree, where each node<br />

corresponds to a subset of the nuts and bolts. In the usual recursive formulation, the algorithm<br />

partitions the nuts and bolts at the root, then the left child of the root, then the leftmost grandchild,<br />

and so forth, recursively sorting everything on the left before starting on the right subproblem.<br />

But we don’t have to solve the subproblems in this order. In fact, we can visit the nodes in<br />

the recursion tree in any order we like, as long as the root is visited first, and any other node is<br />

visited after its parent. Thus, we can recast quicksort in the following iterative form. Choose a<br />

pivot bolt, find its match, and partition the remaining nuts and bolts into two subsets. Then pick<br />

a second pivot bolt and partition whichever of the two subsets contains it. At this point, we have<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 3: Randomized <strong>Algorithms</strong> Fall 2002<br />

two matched pairs and three subsets of nuts and bolts. Continue choosing new pivot bolts and<br />

partitioning subsets, each time finding one match and increasing the number of subsets by one,<br />

until every bolt has been chosen as the pivot. At the end, every bolt has been matched, and the<br />

nuts and bolts are sorted.<br />

Suppose we always choose the next pivot bolt uniformly at random from the bolts that haven’t<br />

been pivots yet. Then no matter which subset contains this bolt, the pivot bolt is equally likely to<br />

be any bolt in that subset. That means our randomized iterative algorithm performs exactly the<br />

same set of tests as our randomized recursive algorithm, just in a different order.<br />

Now let Bi denote the ith smallest bolt, and Nj denote the jth smallest nut. For each i and j,<br />

define an indicator variable Xij that equals 1 if our algorithm compares Bi with Nj and zero<br />

otherwise. Then the total number of nut/bolt comparisons is exactly<br />

T (n) =<br />

n<br />

n<br />

i=1 j=1<br />

Xij.<br />

We are interested in the expected value of this double summation:<br />

⎡ ⎤<br />

n n<br />

E[T (n)] = E⎣<br />

n n<br />

⎦ = E[Xij].<br />

i=1 j=1<br />

Xij<br />

i=1 j=1<br />

This equation uses a crucial property of random variables called linearity of expectation: for any<br />

random variables X and Y , the sum of their expectations is equal to the expectation of their sum:<br />

E[X + Y ] = E[X] + E[Y ]. To analyze our algorithm, we only need to compute the expected value<br />

of each Xij. By definition of expectation,<br />

E[Xij] = 0 · Pr[Xij = 0] + 1 · Pr[Xij = 1] = Pr[Xij = 1],<br />

so we just need to calculate Pr[Xij = 1] for all i and j.<br />

First let’s assume that i < j. The only comparisons our algorithm performs are between some<br />

pivot bolt (or its partner) and a nut (or bolt) in the same subset. The only thing that can prevent<br />

us from comparing Bi and Nj is if some intermediate bolt Bk, with i < k < j, is chosen as a pivot<br />

before Bi or Bj. In other words:<br />

Our algorithm compares Bi and Nj if and only if the first pivot chosen<br />

from the set {Bi, Bi+1, . . . , Bj} is either Bi or Bj.<br />

Since the set {Bi, Bi+1, . . . , Bj} contains j − i + 1 bolts, each of which is equally likely to be chosen<br />

first, we immediately have<br />

2<br />

E[Xij] =<br />

for all i < j.<br />

j − i + 1<br />

Symmetric arguments give us E[Xij] = 2<br />

i−j+1 for all i > j. Since our algorithm is a little stupid,<br />

every bolt is compared with its partner, so Xii = 1 for all i. (In fact, if a pivot bolt is the only bolt<br />

in its subset, we don’t need to compare it against its partner, but this improvement complicates<br />

the analysis.)<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 3: Randomized <strong>Algorithms</strong> Fall 2002<br />

Putting everything together, we get the following summation.<br />

E[T (n)] =<br />

=<br />

n<br />

i=1 j=1<br />

n<br />

E[Xij]<br />

n<br />

E[Xii] + 2<br />

i=1<br />

= n + 4<br />

n<br />

n<br />

i=1 j=i+1<br />

n<br />

n<br />

i=1 j=i+1<br />

1<br />

j − i + 1<br />

E[Xij]<br />

This is quite a bit simpler than the recurrence we got before. In fact, with just a few lines of algebra,<br />

we can turn it into an exact, closed-form expression for the expected number of comparisons.<br />

E[T (n)] = n + 4<br />

Sure enough, it’s Θ(n log n).<br />

= n + 4<br />

= n + 4<br />

= n + 4<br />

*3.9 Masochistic Analysis<br />

n<br />

i=1<br />

n<br />

k=2<br />

n<br />

k=2<br />

<br />

n−i+1 <br />

j=2<br />

n−k+1 <br />

i=1<br />

1<br />

k<br />

1<br />

k<br />

n − k + 1<br />

k<br />

(n − 1)<br />

n<br />

k=2<br />

1<br />

k −<br />

[substitute k = j − i + 1]<br />

n<br />

<br />

1<br />

k=2<br />

= n + 4((n + 1)(Hn − 1) − (n − 1))<br />

= 4nHn − 7n + 4Hn<br />

[reorder summations]<br />

If we’re feeling particularly masochistic, it is possible to solve the recurrence directly, all the way<br />

to an exact closed-form solution. [I’m including this only to show you it can be done; this won’t<br />

be on the test.] First we simplify the recurrence slightly by combining symmetric terms.<br />

T (n) = 2n − 1 + 1<br />

n<br />

n−1 <br />

T (k − 1) + T (n − k)<br />

k=0<br />

= 2n − 1 + 2<br />

n−1 <br />

T (k)<br />

n<br />

k=0<br />

We then convert this ‘full history’ recurrence into a ‘limited history’ recurrence by shifting, and<br />

subtracting away common terms. (I call this “Magic step #1”.) To make this slightly easier, we<br />

7


<strong>CS</strong> <strong>373</strong> Lecture 3: Randomized <strong>Algorithms</strong> Fall 2002<br />

first multiply both sides of the recurrence by n to get rid of the fractions.<br />

nT (n) = 2n 2 n−1<br />

− n + 2<br />

<br />

T (k)<br />

k=0<br />

(n − 1)T (n − 1) = 2(n − 1) 2 − (n − 1)<br />

<br />

2n 2 −5n+3<br />

nT (n) − (n − 1)T (n − 1) = 4n − 3 + 2T (n − 1)<br />

T (n) = 4 − 3<br />

n<br />

n + 1<br />

+ T (n − 1)<br />

n<br />

n−2 <br />

+ 2<br />

To solve this limited-history recurrence, we define a new function t(n) = T (n)/(n + 1). (I call this<br />

“Magic step #2”.) This gives us an even simpler recurrence for t(n) in terms of t(n − 1):<br />

t(n) =<br />

T (n)<br />

n + 1<br />

= 1<br />

n + 1<br />

= 4<br />

n + 1 −<br />

<br />

4 − 3<br />

n<br />

<br />

(n − 1)<br />

+ (n + 1)T<br />

n<br />

3<br />

+ t(n − 1)<br />

n(n + 1)<br />

= 7 3<br />

− + t(n − 1)<br />

n + 1 n<br />

I used the technique of partial fractions (remember calculus?) to replace<br />

k=0<br />

T (k)<br />

1<br />

n(n+1)<br />

with 1<br />

n<br />

− 1<br />

n+1 in<br />

the last step. The base case for this recurrence is t(0) = 0. Once again, we have a recurrence that<br />

translates directly into a summation, which we can solve with just a few lines of algebra.<br />

t(n) =<br />

= 7<br />

n<br />

i=1<br />

<br />

7 3<br />

−<br />

i + 1 i<br />

n<br />

i=1<br />

1<br />

− 3<br />

i + 1<br />

n<br />

i=1<br />

= 7(Hn+1 − 1) − 3Hn<br />

= 4Hn − 7 + 7<br />

n + 1<br />

. Finally,<br />

substituting T (n) = (n+1)t(n) and simplifying gives us the exact solution to the original recurrence.<br />

The last step uses the recursive definition of the harmonic numbers: Hn+1 = Hn + 1<br />

n+1<br />

T (n) = 4(n + 1)Hn − 7(n + 1) + 7 = 4nHn − 7n + 4Hn<br />

Surprise, surprise, we get exactly the same solution!<br />

8<br />

1<br />

i


<strong>CS</strong> <strong>373</strong> Lecture 4: Randomized Treaps Fall 2002<br />

I thought the following four [rules] would be enough, provided that I made a firm and<br />

constant resolution not to fail even once in the observance of them. The first was never<br />

to accept anything as true if I had not evident knowledge of its being so. . . . The second,<br />

to divide each problem I examined into as many parts as was feasible, and as was requisite<br />

for its better solution. The third, to direct my thoughts in an orderly way. . . establishing an<br />

order in thought even when the objects had no natural priority one to another. And the last,<br />

to make throughout such complete enumerations and such general surveys that I might be<br />

sure of leaving nothing out.<br />

— René Descartes, Discours de la Méthode (1637)<br />

4 Randomized Treaps (September 17)<br />

4.1 Treaps<br />

In this lecture, we will consider binary trees where every internal node has both a search key and a<br />

priority. In our examples, we will use letters for the search keys and numbers for the priorities. A<br />

treap is a binary tree where the inorder sequence of search keys is sorted and each node’s priority<br />

is smaller than the priorities of its children. 1 In other words, a treap is simultaneously a binary<br />

search tree for the search keys and a (min-)heap for the priorities.<br />

G<br />

7<br />

H<br />

T<br />

2 3<br />

I<br />

4<br />

A L<br />

9 8<br />

A treap. The top half of each node shows its search key and the bottom half shows its priority.<br />

I’ll assume from now on that all the keys and priorities are distinct. Under this assumption, we<br />

can easily prove by induction that the structure of a treap is completely determined by the search<br />

keys and priorities of its nodes. Since it’s a heap, the node v with highest priority must be the root.<br />

Since it’s also a binary search tree, any node u with key(u) < key(v) must be in the left subtree,<br />

and any node w with key(w) > key(v) must be in the right subtree. Finally, since the subtrees<br />

are treaps, by induction, their structures are completely determined. The base case is the trivial<br />

empty treap.<br />

Another way to describe the structure is that a treap is exactly the binary tree that results by<br />

inserting the nodes one at a time into an initially empty tree, in order of increasing priority, using<br />

the usual insertion algorithm. This is also easy to prove by induction.<br />

A third way interprets the keys and priorities as the coordinates of a set of points in the plane.<br />

The root corresponds to a T whose joint lies on the topmost point. The T splits the plane into three<br />

parts. The top part is (by definition) empty; the left and right parts are split recursively. This<br />

interpretation has some interesting applications in computational geometry, which (unfortunately)<br />

we probably won’t have time to talk about.<br />

1 Sometimes I hate English. Normally, ‘higher priority’ means ‘more important’, but ‘first priority’ is also more<br />

important than ‘second priority’. Maybe ‘posteriority’ would be better; one student suggested ‘unimportance’.<br />

1<br />

M<br />

1<br />

O<br />

6<br />

R<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 4: Randomized Treaps Fall 2002<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

A G H I L M O R T<br />

A geometric interpretation of the same treap.<br />

Treaps were first discovered by Jean Vuillemin in 1980, but he called them Cartesian trees. 2<br />

The word ‘treap’ was first used by Edward McCreight around 1980 to describe a slightly different<br />

data structure, but he later switched to the more prosaic name priority search trees. 3 Treaps were<br />

rediscovered and used to build randomized search trees by Cecilia Aragon and Raimund Seidel in<br />

1989. 4 A different kind of randomized binary search tree, which uses random rebalancing instead<br />

of random priorities, was later discovered and analyzed by Conrado Martínez and Salvador Roura<br />

in 1996. 5<br />

4.2 Binary Search Tree Operations<br />

The search algorithm is the usual one for binary search trees. The time for a successful search is<br />

proportional to the depth of the node. The time for an unsuccessful search is proportional to the<br />

depth of either its successor or its predecessor.<br />

To insert a new node z, we start by using the standard binary search tree insertion algorithm<br />

to insert it at the bottom of the tree. At the point, the search keys still form a search tree, but the<br />

priorities may no longer form a heap. To fix the heap property, as long as z has smaller priority<br />

than its parent, perform a rotation at z. The running time is proportional to the depth of z before<br />

the rotations—we have to walk down the treap to insert z, and then walk back up the treap doing<br />

rotations. Another way to say this is that the time to insert z is roughly twice the time to perform<br />

an unsuccessful search for key(z).<br />

G<br />

7<br />

H<br />

2<br />

I<br />

4<br />

M<br />

1<br />

R<br />

5<br />

T<br />

3<br />

A L O S<br />

9<br />

8 6 −1<br />

G<br />

7<br />

H<br />

2<br />

M<br />

1<br />

I<br />

S<br />

4 −1<br />

A L R<br />

9<br />

8 5<br />

O<br />

6<br />

T<br />

3<br />

M<br />

1<br />

H S<br />

2 −1<br />

G I<br />

R T<br />

7 4 5 3<br />

A L O<br />

9 8 6<br />

Left to right: After inserting (S, 10), rotate it up to fix the heap property.<br />

Right to left: Before deleting (S, 10), rotate it down to make it a leaf.<br />

G<br />

7<br />

H<br />

2<br />

M<br />

1<br />

I O<br />

4 6<br />

A<br />

L<br />

9 8<br />

2 J. Vuillemin, A unifying look at data structures. Commun. ACM 23:229–239, 1980.<br />

3 E. M. McCreight. Priority search trees. SIAM J. Comput. 14(2):257–276, 1985.<br />

4 R. Seidel and C. R. Aragon. Randomized search trees. Algorithmica 16:464–497, 1996.<br />

5 C. Martínez and S. Roura. Randomized binary search trees. J. ACM 45(2):288-323, 1998. The results in this<br />

paper are virtually identical (including the constant factors!) to the corresponding results for treaps, although the<br />

analysis techniques are quite different.<br />

2<br />

S<br />

−1<br />

R<br />

5<br />

T<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 4: Randomized Treaps Fall 2002<br />

Deleting a node is exactly like inserting a node, but in reverse order. Suppose we want to delete<br />

node z. As long as z is not a leaf, perform a rotation at the child of z with smaller priority. This<br />

moves z down a level and its smaller-priority child up a level. The choice of which child to rotate<br />

preserves the heap property everywhere except at z. When z becomes a leaf, chop it off.<br />

We sometimes also want to split a treap T into two treaps T< and T> along some pivot key π,<br />

so that all the nodes in T< have keys less than π and all the nodes in T> have keys bigger then π.<br />

A simple way to do this is to insert a new node z with key(z) = π and priority(z) = −∞. After the<br />

insertion, the new node is the root of the treap. If we delete the root, the left and right sub-treaps<br />

are exactly the trees we want. The time to split at π is roughly twice the time to (unsuccessfully)<br />

search for π.<br />

Similarly, we may want to merge two treaps T< and T>, where every node in T< has a smaller<br />

search key than any node in T>, into one super-treap. Merging is just splitting in reverse—create<br />

a dummy root whose left sub-treap is T< and whose right sub-treap is T>, rotate the dummy node<br />

down to a leaf, and then cut it off.<br />

4.3 Analysis<br />

The cost of each of these operations is proportional to the depth d(v) of some node v in the treap.<br />

• Search: A successful search for key k takes O(d(v)) time, where v is the node with key(v) = k.<br />

For an unsuccessful search, let v − be the inorder predecessor of k (the node whose key is just<br />

barely smaller than k), and let v + be the inorder successor of k (the node whose key is just<br />

barely larger than k). Since the last node examined by the binary search is either v − or v + ,<br />

the time for an unsuccessful search is either O(d(v + )) or O(d(v − )).<br />

• Insert/Delete: Inserting a new node with key k takes either O(d(v + )) time or O(d(v − ))<br />

time, where v + and v − are the predecessor and successor of the new node. Deletion is just<br />

insertion in reverse.<br />

• Split/Merge: Splitting a treap at pivot value k takes either O(d(v + )) time or O(d(v − ))<br />

time, since it costs the same as inserting a new dummy root with search key k and priority<br />

−∞. Merging is just splitting in reverse.<br />

Since the depth of a node in a treap is Θ(n) in the worst case, each of these operations has a<br />

worst-case running time of Θ(n).<br />

4.4 Random Priorities<br />

A randomized binary search tree is a treap in which the priorities are independently and uniformly<br />

distributed continuous random variables. That means that whenever we insert a new search key<br />

into the treap, we generate a random real number between (say) 0 and 1 and use that number as<br />

the priority of the new node. The only reason we’re using real numbers is so that the probability<br />

of two nodes having the same priority is zero, since equal priorities make the analysis messy. In<br />

practice, we could just choose random integers from a large range, like 0 to 2 31 − 1, or random<br />

floating point numbers. Also, since the priorities are independent, each node is equally likely to<br />

have the smallest priority.<br />

The cost of all the operations we discussed—search, insert, delete, split, join—is proportional to<br />

the depth of some node in the tree. Here we’ll see that the expected depth of any node is O(log n),<br />

which implies that the expected running time for any of those operations is also O(log n).<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 4: Randomized Treaps Fall 2002<br />

Let xk denote the node with the kth smallest search key. To analyze the expected depth, we<br />

define an indicator variable<br />

A i k = <br />

xi is a proper ancestor of xk .<br />

(The superscript doesn’t mean power in this case; it just a reminder of which node is supposed to<br />

be further up in the tree.) Since the depth d(v) of v is just the number of proper ancestors of v,<br />

we have the following identity:<br />

n<br />

d(xk) = A i k .<br />

Now we can express the expected depth of a node in terms of these indicator variables as follows.<br />

E[d(xk)] =<br />

n<br />

i=1<br />

i=1<br />

Pr[A i k<br />

(Just as in our analysis of matching nuts and bolts in Lecture 3, we’re using linearity of expectation<br />

and the fact that E[X] = Pr[X = 1] for any indicator variable X.) So to compute the expected<br />

depth of a node, we just have to compute the probability that some node is a proper ancestor of<br />

some other node.<br />

Fortunately, we can do this easily once we prove a simple structural lemma. Let X(i, k) denote<br />

either the subset of treap nodes {xi, xi+1, . . . , xk} or the subset {xk, xk+1, . . . , xi}, depending on<br />

whether i < k or i > k. X(i, k) and X(k, i) always denote prceisly the same subset, and this subset<br />

contains |k − i| + 1 nodes. X(1, n) = X(n, 1) contains all n nodes in the treap.<br />

Lemma 1. For all i = k, xi is a proper ancestor of xk if and only if xi has the smallest priority<br />

among all nodes in X(i, k).<br />

Proof: If xi is the root, then it is an ancestor of xk, and by definition, it has the smallest priority<br />

of any node in the treap, so it must have the smallest priority in X(i, k).<br />

On the other hand, if xk is the root, then xi is not an ancestor of xk, and indeed xi does not<br />

have the smallest priority in X(i, k) — xk does.<br />

On the gripping hand 6 , suppose some other node xj is the root. If xi and xk are in different<br />

subtrees, then either i < j < k or i > j > k, so xj ∈ X(i, k). In this case, xi is not an ancestor<br />

of xk, and indeed xi does not have the smallest priority in X(i, k) — xj does.<br />

Finally, if xi and xk are in the same subtree, the lemma follows inductively (or, if you prefer,<br />

recursively), since the subtree is a smaller treap. The empty treap is the trivial base case. <br />

Since each node in X(i, k) is equally likely to have smallest priority, we immediately have the<br />

probability we wanted:<br />

Pr[A i k<br />

= 1] = [i = k]<br />

|k − i| + 1 =<br />

= 1]<br />

⎧<br />

1<br />

⎪⎨ k − i + 1<br />

if i < k<br />

0<br />

⎪⎩<br />

1<br />

i − k + 1<br />

if i = k<br />

if i > k<br />

6 See Larry Niven and Jerry Pournelle, The Gripping Hand, Pocket Books, 1994.<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 4: Randomized Treaps Fall 2002<br />

To compute the expected depth of a node xk, we just plug this probability into our formula and<br />

grind through the algebra.<br />

E[d(xk)] =<br />

=<br />

=<br />

n<br />

i=1<br />

k−1<br />

i=1<br />

k<br />

j=2<br />

Pr[A i k<br />

= 1]<br />

1<br />

k − i + 1 +<br />

1<br />

j +<br />

n−k 1<br />

j<br />

i=2<br />

n<br />

i=k+1<br />

= Hk − 1 + Hn−k − 1<br />

< ln k + ln(n − k) − 2<br />

< 2 ln n − 2.<br />

1<br />

i − k + 1<br />

In conclusion, every search, insertion, deletion, split, and merge operation in an n-node randomized<br />

binary search tree takes O(log n) expected time.<br />

Since a treap is exactly the binary tree that results when you insert the keys in order of increasing<br />

priority, a randomized treap is the result of inserting the keys in random order. So our analysis also<br />

automatically gives us the expected depth of any node in a binary tree built by random insertions<br />

(without using priorities).<br />

4.5 Randomized Quicksort (Again?!)<br />

We’ve already seen two completely different ways of describing randomized quicksort. The first<br />

is the familiar recursive one: choose a random pivot, partition, and recurse. The second is a<br />

less familiar iterative version: repeatedly choose a new random pivot, partition whatever subset<br />

contains it, and continue. But there’s a third way to describe randomized quicksort, this time in<br />

terms of binary search trees.<br />

RandomizedQuicksort:<br />

T ← an empty binary search tree<br />

insert the keys into T in random order<br />

output the inorder sequence of keys in T<br />

Our treap analysis tells us is that this algorithm will run in O(n log n) expected time, since each<br />

key is inserted in O(log n) expected time.<br />

Why is this quicksort? Just like last time, all we’ve done is rearrange the order of the comparisons.<br />

Intuitively, the binary tree is just the recursion tree created by the normal version of<br />

quicksort. In the recursive formulation, we compare the initial pivot against everything else and<br />

then recurse. In the binary tree formulation, the first “pivot” becomes the root of the tree without<br />

any comparisons, but then later as each other key is inserted into the tree, it is compared against<br />

the root. Either way, the first pivot chosen is compared with everything else. The partition splits<br />

the remaining items into a left subarray and a right subarray; in the binary tree version, these are<br />

exactly the items that go into the left subtree and the right subtree. Since both algorithms define<br />

the same two subproblems, by induction, both algorithms perform the same comparisons.<br />

1<br />

We even saw the probability |k−i|+1 before, when we were talking about sorting nuts and bolts<br />

with a variant of randomized quicksort. In the more familiar setting of sorting an array of numbers,<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 4: Randomized Treaps Fall 2002<br />

the probability that randomized quicksort compares the ith largest and kth largest elements is<br />

exactly<br />

or vice versa, so the probabilities are exactly the same.<br />

2<br />

|k−i|+1 . The binary tree version compares xi and xk if and only if xi is an ancestor of xk<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 5: Randomized Minimum Cuts Fall 2002<br />

Jaques: But, for the seventh cause; how did you find the quarrel on the seventh cause?<br />

Touchstone: Upon a lie seven times removed:–bear your body more seeming, Audrey:–as<br />

thus, sir. I did dislike the cut of a certain courtier’s beard: he sent me word, if I<br />

said his beard was not cut well, he was in the mind it was: this is called the Retort<br />

Courteous. If I sent him word again ‘it was not well cut,’ he would send me word, he<br />

cut it to please himself: this is called the Quip Modest. If again ‘it was not well cut,’<br />

he disabled my judgment: this is called the Reply Churlish. If again ‘it was not well<br />

cut,’ he would answer, I spake not true: this is called the Reproof Valiant. If again ‘it<br />

was not well cut,’ he would say I lied: this is called the Counter-cheque Quarrelsome:<br />

and so to the Lie Circumstantial and the Lie Direct.<br />

Jaques: And how oft did you say his beard was not well cut?<br />

Touchstone: I durst go no further than the Lie Circumstantial, nor he durst not give me<br />

the Lie Direct; and so we measured swords and parted.<br />

5 Randomized Minimum Cut (September 19)<br />

5.1 Setting Up the Problem<br />

— William Shakespeare, As You Like It Act V, Scene 4 (1600)<br />

This lecture considers a problem that arises in robust network design. Suppose we have a connected<br />

multigraph 1 G representing a communications network like the UIUC telephone system, the<br />

internet, or Al-Qaeda. In order to disrupt the network, an enemy agent plans to remove some of<br />

the edges in this multigraph (by cutting wires, placing police at strategic drop-off points, or paying<br />

street urchins to ‘lose’ messages) to separate it into multiple components. Since his country is currently<br />

having an economic crisis, the agent wants to remove as few edges as possible to accomplish<br />

this task.<br />

More formally, a cut partitions the nodes of G into two nonempty subsets. The size of the cut<br />

is the number of crossing edges, which have one endpoint in each subset. Finally, a minimum cut in<br />

G is a cut with the smallest number of crossing edges. The same graph may have several minimum<br />

cuts.<br />

c<br />

a b<br />

d e f<br />

g h<br />

A multigraph whose minimum cut has three edges.<br />

This problem has a long history. The classical deterministic algorithms for this problem rely on<br />

network flow techniques, which are discussed in Chapter 26 of CLRS. The fastest such algorithms<br />

run in O(n 3 ) time and are quite complex and difficult to understand (unless you’re already familiar<br />

with network flows). Here I’ll describe a relatively simple randomized algorithm published by David<br />

Karger 2 , how was a Ph.D. student at Stanford at the time.<br />

Karger’s algorithm uses a primitive operation called collapsing an edge. Suppose u and v are<br />

vertices that are connected by an edge in some multigraph G. To collapse the edge {u, v}, we<br />

create a new node called uv, replace any edge of the form u, w or v, w with a new edge uv, w, and<br />

1 A multigraph allows multiple edges between the same pair of nodes. Everything in this lecture could be rephrased<br />

in terms of simple graphs where every edge has a non-negative weight, but this would make the algorithms and analysis<br />

slightly more complicated.<br />

2 D. R. Karger ∗ . Random sampling in cut, flow, and network design problems. Proc. 25th STOC, 648–657, 1994.<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 5: Randomized Minimum Cuts Fall 2002<br />

then delete the original vertices u and v. Equivalently, collapsing the edge shrinks the edge down<br />

to nothing, pulling the two endpoints together. The new collapsed graph is denoted G/{u, v}. We<br />

don’t allow self-loops in our multigraphs; if there are multiple edges between u and v, collapsing<br />

any one of them deletes them all.<br />

be<br />

a<br />

c d<br />

b<br />

a<br />

e<br />

c d<br />

A graph G and two collapsed graphs G/{b, e} and G/{c, d}.<br />

I won’t describe how to actually implement collapsing an edge—it will be a homework exercise<br />

later in the course—but it can be done in O(n) time. Let’s just accept collapsing as a black box<br />

subroutine for now.<br />

The correctness of our algorithms will eventually boil down the following simple observation:<br />

For any cut in G/{u, v}, there is cut in G with exactly the same number of crossing edges. In<br />

fact, in some sense, the ‘same’ edges form the cut in both graphs. The converse is not necessarily<br />

true, however. For example, in the picture above, the original graph G has a cut of size 1, but the<br />

collapsed graph G/{c, d} does not.<br />

This simple observation has two immediate but important consequences. First, collapsing an<br />

edge cannot decrease the minimum cut size. More importantly, collapsing an edge increases the<br />

minimum cut size if and only if that edge is part of every minimum cut.<br />

5.2 Blindly Guessing<br />

Let’s start with an algorithm that tries to guess the minimum cut by randomly collapsing edges<br />

until the graph has only two vertices left.<br />

GuessMinCut(G):<br />

for i ← n downto 2<br />

pick a random edge e in G<br />

G ← G/e<br />

return the only cut in G<br />

Since each collapse takes O(n) time, this algorithm runs in O(n 2 ) time. Our earlier observations<br />

imply that as long as we never collapse an edge that lies in every minimum cut, our algorithm will<br />

actually guess correctly. But how likely is that?<br />

Suppose G has only one minimum cut—if it actually has more than one, just pick your favorite—<br />

and this cut has size k. Every vertex of G must lie on at least k edges; otherwise, we could separate<br />

that vertex from the rest of the graph with an even smaller cut. Thus, the number of incident<br />

vertex-edge pairs is at least kn. Since every edge is incident to exactly two vertices, G must have at<br />

least kn/2 edges. That implies that if we pick an edge in G uniformly at random, the probability<br />

of picking an edge in the minimum cut is at most 2/n. In other words, the probability that we<br />

don’t screw up on the very first step is at least 1 − 2/n.<br />

Once we’ve collapsed the first random edge, the rest of the algorithm proceeds recursively (with<br />

independent random choices) on the remaining (n − 1)-node graph. So the overall probability that<br />

2<br />

b<br />

a<br />

e<br />

cd


<strong>CS</strong> <strong>373</strong> Lecture 5: Randomized Minimum Cuts Fall 2002<br />

GuessMinCut returns the true minimum cut is given by the following recurrence:<br />

P (n) ≥<br />

n − 2<br />

n<br />

· P (n − 1).<br />

The base case for this recurrence is P (2) = 1. We can immediately expand this recurrence into a<br />

product, most of whose factors cancel out immediately.<br />

P (n) ≥<br />

n<br />

i=3<br />

i − 2<br />

i<br />

=<br />

n<br />

(i − 2)<br />

n<br />

=<br />

i<br />

i=3<br />

5.3 Blindly Guessing Over and Over<br />

i=3<br />

n−2 <br />

i<br />

n<br />

=<br />

i<br />

i=1<br />

i=3<br />

2<br />

n(n − 1)<br />

That’s not very good. Fortunately, there’s a simple method for increasing our chances of finding the<br />

minimum cut: run the guessing algorithm many times and return the smallest guess. Randomized<br />

algorithms folks like to call this idea amplification.<br />

KargerMinCut(G):<br />

mink ← ∞<br />

for i ← 1 to N<br />

X ← GuessMinCut(G)<br />

if |X| < mink<br />

mink ← |X|<br />

minX ← X<br />

return minX<br />

Both the running time and the probability of success will depend on the number of iterations N,<br />

which we haven’t specified yet.<br />

First let’s figure out the probability that KargerMinCut returns the actual minimum cut.<br />

The only way for the algorithm to return the wrong answer is if GuessMinCut fails N times in<br />

a row. Since each guess is independent, our probability of success is at least<br />

<br />

N 2<br />

1 − 1 −<br />

.<br />

n(n − 1)<br />

We can simplify this using one of the most important (and easy) inequalities known to mankind:<br />

So our success probability is at least<br />

1 − x ≤ e −x<br />

1 − e −2N/n(n−1) .<br />

By making N larger, we can make this probability arbitrarily close to 1, but never equal to 1.<br />

In particular, if we set N = c n 2 ln n for some constant c, then KargerMinCut is correct with<br />

probability at least<br />

1 − e −c ln n = 1 − 1<br />

.<br />

nc 3


<strong>CS</strong> <strong>373</strong> Lecture 5: Randomized Minimum Cuts Fall 2002<br />

When the failure probability is a polynomial fraction, we say that the algorithm is correct with<br />

high probability. Thus, KargerMinCut computes the minimum cut of any n-node graph in<br />

O(n 4 log n) time.<br />

If we make the number of iterations even larger, say N = n 2 (n − 1)/2, the success probability<br />

becomes 1 − e −n . When the failure probability is exponentially small like this, we say that<br />

the algorithm is correct with very high probability. In practice, very high probability is usually<br />

overkill; high probability is enough. (Remember, there is a small but non-zero probability that<br />

your computer will transform itself into a kitten before your program is finished.)<br />

5.4 Not-So-Blindly Guessing<br />

The O(n 4 log n) running time is actually comparable to some of the simpler flow-based algorithms,<br />

but it’s nothing to get excited about. But we can improve our guessing algorithm, and thus decrease<br />

the number of iterations in the outer loop, by observing that as the graph shrinks, the probability<br />

of collapsing an edge in the minimum cut increases. At first the probability is quite small, only<br />

2/n, but near the end of execution, when the graph has only three vertices, we have a 2/3 chance<br />

of screwing up!<br />

A simple technique for working around this increasing probability of error was developed by<br />

David Karger and Cliff Stein. 3 Their idea is to group the first several random collapses a ‘safe’ phase,<br />

so that the cumulative probability of screwing up is small—less than 1/2, say—and a ‘dangerous’<br />

phase, which is much more likely to screw up.<br />

The safe phase shrinks the graph from n nodes to n/ √ 2 nodes, 4 using a sequence of n − n/ √ 2<br />

random collapses. Following our earlier analysis, the probability that any of these safe collapses<br />

touches the minimum cut is at most<br />

n<br />

i=n/ √ 2+1<br />

i − 2<br />

i = (n/√2)((n/ √ 2) − 1))<br />

n(n − 1)<br />

< 1/2.<br />

Now, to get around the danger of the dangerous phase, we use amplification. Instead of running<br />

through the dangerous phase one, we run it twice and keep the best of the two answers. And of<br />

course, we treat the dangerous phase recursively, so we actually obtain a larger binary recursion<br />

tree, which gets wider and wider as we get closer to the base case, instead of a single path. More<br />

formally, the algorithm looks like this:<br />

Contract(G, m):<br />

for i ← n downto m<br />

pick a random edge e in G<br />

G ← G/e<br />

return G<br />

BetterGuess(G):<br />

X1 ← BetterGuess(Contract(G, n/ √ 2))<br />

X2 ← BetterGuess(Contract(G, n/ √ 2))<br />

return min{X1, X2}<br />

This might look like we’re just doing to same thing twice, but remember that Contract (and<br />

thus BetterGuess) is randomized. Each call to Contract contracts a different random set of<br />

edges; X1 and X2 are almost always different cuts.<br />

3 ∗<br />

D. R. Karger and C. Stein. An Õ(n 2 ) algorithm for minimum cuts. Proc. 25th STOC, 757–765, 1993. Yes, that<br />

Cliff Stein.<br />

√<br />

4<br />

Strictly speaking, I should say ⌊n/ 2⌋, because otherwise, we’d have an irrational number of nodes! But this only<br />

adds some floors and ceilings to our recurrences, which we know that we can remove with domain transformations,<br />

so I’ll just take them out now.<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 5: Randomized Minimum Cuts Fall 2002<br />

BetterGuess correctly returns the minimum cut unless both of the first two lines gives the<br />

wrong result. A single call to BetterGuess(Contract(G, n/ √ 2)) gives the correct answer if<br />

none of the min cut edges are Contracted and if the recursive BetterGuess is correct. So we<br />

have the following recurrence for the probability of success for an n-node graph:<br />

<br />

P (n) ≥ 1 − 1 − 1<br />

2 P<br />

2 <br />

n√2<br />

n√2<br />

= P − 1<br />

4 P<br />

2 n√2<br />

I don’t know how to derive it from scratch, but we can easily prove by induction that P (n) = 1/ lg n<br />

is a solution for this recurrence. The base case is P (2) = 1 = 1/ lg 2.<br />

<br />

n√2<br />

P (n) ≥ P − 1<br />

4 P<br />

2 n√2<br />

= 1<br />

lg n √ 2<br />

=<br />

=<br />

−<br />

1<br />

lg n − 1<br />

2<br />

1<br />

4 lg 2 n √ 2<br />

−<br />

1<br />

4(lg n − 1<br />

2 )2<br />

4 lg n − 3<br />

4 lg 2 n − 4 lg n + 1<br />

= 1<br />

lg n +<br />

> 1<br />

lg n <br />

1 − 1<br />

lg n<br />

4 lg 2 n − 4 lg n + 1<br />

The last step used the fact that n > 2; otherwise, that ugly fraction we threw out might be zero or<br />

negative.<br />

For the running time, we get a simple recurrence that is easily solved using the Master theorem.<br />

T (n) = O(n 2 <br />

n√2<br />

) + 2T = O(n 2 log n)<br />

So all this splitting and recursing has slowed down the guessing algorithm slightly, but the probability<br />

of failure is exponentially smaller!<br />

Now if we call BetterGuess N = c(lg n)(ln n) times, for some constant c, the overall proba-<br />

bility of success is<br />

<br />

1 − 1 − 1<br />

c(lg n)(ln n)<br />

lg n<br />

≥ 1 − e −c ln n = 1 − 1<br />

.<br />

nc In other words, we now have an algorithm that computes the minimum cut with high probability<br />

in only O(n 2 log 3 n) time!<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 6: Hash Tables Fall 2002<br />

6 Hash Tables (September 24)<br />

6.1 Introduction<br />

Aitch Ex<br />

Are Eye<br />

Ay Gee<br />

Bee Jay<br />

Cue Kay<br />

Dee Oh<br />

Double U Pea<br />

Ee See<br />

Ef Tee<br />

El Vee<br />

Em Wy<br />

En Yu<br />

Ess Zee<br />

— Sidney Harris, “The Alphabet in Alphabetical Order”<br />

A hash table is a data structure for storing a set of items, so that we can quickly determine whether<br />

an item is or is not in the set. The basic idea is to pick a hash function h that maps every possible<br />

item x to a small integer h(x). Then we store x in slot h(x) in an array. The array is the hash<br />

table.<br />

Let’s be a little more specific. We want to store a set of n items. Each item is an element of<br />

some finite 1 set U called the universe; we use u to denote the size of the universe, which is just<br />

the number of items in U. A hash table is an array T [1 .. m], where m is another positive integer,<br />

which we call the table size. Typically, m is much smaller than u. A hash function is a function<br />

h: U → {0, 1, . . . , m − 1}<br />

that maps each possible item in U to a slot in the hash table. We say that an item x hashes to the<br />

slot T [h(x)].<br />

Of course, if u = m, then we can always just use the trivial hash function h(x) = x. In other<br />

words, use the item itself as the index into the table. This is called a direct access table (or more<br />

commonly, an array). In most applications, though, the universe of possible keys is orders of<br />

magnitude too large for this approach to be practical. Even when it is possible to allocate enough<br />

memory, we usually need to store only a small fraction of the universe. Rather than wasting lots of<br />

space, we should make m roughly equal to n, the number of items in the set we want to maintain.<br />

What we’d like is for every item in our set to hash to a different position in the array. Unfortunately,<br />

unless m = u, this is too much to hope for, so we have to deal with collisions. We say<br />

that two items x and y collide if the have the same hash value: h(x) = h(y). Since we obviously<br />

can’t store two items in the same slot of an array, we need to describe some methods for resolving<br />

collisions. The two most common methods are called chaining and open addressing.<br />

1<br />

This finiteness assumption is necessary for several of the technical details to work out, but can be ignored in<br />

practice. To hash elements from an infinite universe (for example, the positive integers), pretend that the universe<br />

is actually finite but very very large. In fact, in real practice, the universe actually is finite but very very large. For<br />

example, on most modern computers, there are only 2 64 integers (unless you use a big integer package like GMP, in<br />

which case the number of integers is closer to 2 232<br />

.)<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 6: Hash Tables Fall 2002<br />

6.2 Chaining<br />

In a chained hash table, each entry T [i] is not just a single item, but rather (a pointer to) a linked<br />

list of all the items that hash to T [i]. Let ℓ(x) denote the length of the list T [h(x)]. To see if<br />

an item x is in the hash table, we scan the entire list T [h(x)]. The worst-case time required to<br />

search for x is O(1) to compute h(x) plus O(1) for every element in T [h(x)], or O(1 + ℓ(x)) overall.<br />

Inserting and deleting x also take O(1 + ℓ(x)) time.<br />

G H R O<br />

M<br />

I T<br />

A L<br />

A chained hash table with load factor 1.<br />

In the worst case, every item would be hashed to the same value, so we’d get just one long list<br />

of n items. In principle, for any deterministic hashing scheme, a malicious adversary can always<br />

present a set of items with exactly this property. In order to defeat such malicious behavior, we’d<br />

like to use a hash function that is as random as possible. Choosing a truly random hash function<br />

is completely impractical, but since there are several heuristics for producing hash functions that<br />

behave close to randomly (on real data), we will analyze the performance as though our hash<br />

function were completely random. More formally, we make the following assumption.<br />

Simple uniform hashing assumption: If x = y then Pr [h(x) = h(y)] = 1/m.<br />

Let’s compute the expected value of ℓ(x) under the simple uniform hashing assumption; this<br />

will immediately imply a bound on the expected time to search for an item x. To be concrete,<br />

let’s suppose that x is not already stored in the hash table. For all items x and y, we define the<br />

indicator variable<br />

Cx,y = h(x) = h(y) .<br />

(In case you’ve forgotten the bracket notation, Cx,y = 1 if h(x) = h(y) and Cx,y = 0 if h(x) = h(y).)<br />

Since the length of T [h(x)] is precisely equal to the number of items that collide with x, we have<br />

ℓ(x) = <br />

y∈T<br />

S<br />

Cx,y.<br />

We can rewrite the simple uniform hashing assumption as follows:<br />

x = y =⇒ E[Cx,y] = 1<br />

m .<br />

Now we just have to grind through the definitions.<br />

E[ℓ(x)] = <br />

E[Cx,y] = 1 n<br />

=<br />

m m<br />

y∈T<br />

We call this fraction n/m the load factor of the hash table. Since the load factor shows up<br />

everywhere, we will give it its own symbol α.<br />

α = n<br />

m<br />

2<br />

y∈T


<strong>CS</strong> <strong>373</strong> Lecture 6: Hash Tables Fall 2002<br />

Our analysis implies that the expected time for an unsuccessful search in a chained hash table is<br />

Θ(1+α). As long as the number if items n is only a constant factor bigger than the table size m, the<br />

search time is a constant. A similar analysis gives the same expected time bound 2 for a successful<br />

search.<br />

Obviously, linked lists are not the only data structure we could use to store the chains; any data<br />

structure that can store a set of items will work. For example, if the universe U has a total ordering,<br />

we can store each chain in a balanced binary search tree. This reduces the worst-case time for a<br />

search to O(1 + log ℓ(x)), and under the simple uniform hashing assumption, the expected time for<br />

a search is O(1 + log α).<br />

Another possibility is to keep the overflow lists in hash tables! Specifically, for each T [i], we<br />

maintain a hash table Ti containing all the items with hash value i. To keep things efficient, we<br />

make sure the load factor of each secondary hash table is always a constant less than 1; this can<br />

be done with only constant amortized overhead. 3 Since the load factor is constant, a search in<br />

any secondary table always takes O(1) expected time, so the total expected time to search in the<br />

top-level hash table is also O(1).<br />

6.3 Open Addressing<br />

Another method we can use to resolve collisions is called open addressing. Here, rather than building<br />

secondary data structures, we resolve collisions by looking elsewhere in the table. Specifically, we<br />

have a sequence of hash functions 〈h0, h1, h2, . . . , hm−1〉, such that for any item x, the probe sequence<br />

〈h0(x), h1(x), . . . , hm−1(x)〉 is a permutation of 〈0, 1, 2, . . . , m − 1〉. In other words, different hash<br />

functions in the sequence always map x to different locations in the hash table.<br />

We search for x using the following algorithm, which returns the array index i if T [i] = x,<br />

‘absent’ if x is not in the table but there is an empty slot, and ‘full’ if x is not in the table and<br />

there no no empty slots.<br />

OpenAddressSearch(x):<br />

for i ← 0 to m − 1<br />

if T [hi(x)] = x<br />

return hi(x)<br />

else if T [hi(x)] = ∅<br />

return ‘absent’<br />

return ‘full’<br />

The algorithm for inserting a new item into the table is similar; only the second-to-last line is<br />

changed to T [hi(x)] ← x. Notice that for an open-addressed hash table, the load factor is never<br />

bigger than 1.<br />

Just as with chaining, we’d like the sequence of hash values to be random, and for purposes<br />

of analysis, there is a stronger uniform hashing assumption that gives us constant expected search<br />

and insertion time.<br />

Strong uniform hashing assumption: For any item x, the probe sequence 〈h0(x),<br />

h1(x), . . . , hm−1(x)〉 is equally likely to be any permutation of the set {0, 1, 2, . . . , m−1}.<br />

2 but with smaller constants hidden in the O()—see p.225 of CLR for details.<br />

3 This means that a single insertion or deletion may take more than constant time, but the total time to handle<br />

any sequence of k insertions of deletions, for any k, is O(k) time. We’ll discuss amortized running times after the<br />

first midterm. This particular result will be an easy homework problem.<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 6: Hash Tables Fall 2002<br />

Let’s compute the expected time for an unsuccessful search using this stronger assumption.<br />

Suppose there are currently n elements in the hash table. Our strong uniform hashing assumption<br />

has two important consequences:<br />

• The initial hash value h0(x) is equally likely to be any integer in the set {0, 1, 2, . . . , m − 1}.<br />

• If we ignore the first probe, the remaining probe sequence 〈h1(x), h2(x), . . . , hm−1(x)〉 is<br />

equally likely to be any permutation of the smaller set {0, 1, 2, . . . , m − 1} \ {h0(x)}.<br />

The first sentence implies that the probability that T [h0(x)] is occupied is exactly n/m. The second<br />

sentence implies that if T [h0(x)] is occupied, our search algorithm recursively searches the rest of<br />

the hash table! Since the algorithm will never again probe T [h0(x)], for purposes of analysis, we<br />

might as well pretend that slot in the table no longer exists. Thus, we get the following recurrence<br />

for the expected number of probes, as a function of m and n:<br />

E[T (m, n)] = 1 + n<br />

E[T (m − 1, n − 1)].<br />

m<br />

The trivial base case is T (m, 0) = 1; if there’s nothing in the hash table, the first probe always hits<br />

an empty slot. We can now easily prove by induction that E[T (m, n)] ≤ m/(m − n) :<br />

E[T (m, n)] = 1 + n<br />

E[T (m − 1, n − 1)]<br />

m<br />

≤ 1 + n m − 1<br />

·<br />

m m − n<br />

< 1 + n m<br />

·<br />

m m − n<br />

= m<br />

m − n<br />

[induction hypothesis]<br />

[m − 1 < m]<br />

[algebra]<br />

Rewriting this in terms of the load factor α = n/m, we get E[T (m, n)] ≤ 1/(1 − α) . In other words,<br />

the expected time for an unsuccessful search is O(1), unless the hash table is almost completely<br />

full.<br />

In practice, however, we can’t generate truly random probe sequences, so we use one of the<br />

following heuristics:<br />

• Linear probing: We use a single hash function h(x), and define hi(x) = (h(x) + i) mod m.<br />

This is nice and simple, but collisions tend to make items in the table clump together badly,<br />

so this is not really a good idea.<br />

• Quadratic probing: We use a single hash function h(x), and define hi(x) = (h(x)+i 2 ) mod<br />

m. Unfortunately, for certain values of m, the sequence of hash values 〈hi(x)〉 does not hit<br />

every possible slot in the table; we can avoid this problem by making m a prime number.<br />

(That’s often a good idea anyway.) Although quadratic probing does not suffer from the same<br />

clumping problems as linear probing, it does have a weaker clustering problem: If two items<br />

have the same initial hash value, their entire probe sequences will be the same.<br />

• Double hashing: We use two hash functions h(x) and h ′ (x), and define hi as follows:<br />

hi(x) = (h(x) + i · h ′ (x)) mod m<br />

To guarantee that this can hit every slot in the table, the stride function h ′ (x) and the<br />

table size m must be relatively prime. We can guarantee this by making m prime, but<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 6: Hash Tables Fall 2002<br />

a simpler solution is to make m a power of 2 and choose a stride function that is always<br />

odd. Double hashing avoids the clustering problems of linear and quadratic probing. In fact,<br />

the actual performance of double hashing is almost the same as predicted by the uniform<br />

hashing assumption, at least when m is large and the component hash functions h and h ′ are<br />

sufficiently random. This is the method of choice! 4<br />

6.4 Deleting from an Open-Addressed Hash Table<br />

Deleting an item x from an open-addressed hash table is a bit more difficult than in a chained hash<br />

table. We can’t simply clear out the slot in the table, because we may need to know that T [h(x)]<br />

is occupied in order to find some other item!<br />

Instead, we should delete more or less the way we did with scapegoat trees. When we delete<br />

an item, we mark the slot that used to contain it as a wasted slot. A sufficiently long sequence of<br />

insertions and deletions could eventually fill the table with marks, leaving little room for any real<br />

data and causing searches to take linear time.<br />

However, we can still get good amortized performance by using two rebuilding rules. First, if<br />

the number of items in the hash table exceeds m/4, double the size of the table (m ← 2m) and<br />

rehash everything. Second, if the number of wasted slots exceeds m/2, clear all the marks and<br />

rehash everything in the table. Rehashing everything takes m steps to create the new hash table<br />

and O(n) expected steps to hash each of the n items. By charging a $4 tax for each insertion and<br />

a $2 tax for each deletion, we expect to have enough money to pay for any rebuilding.<br />

In conclusion, the expected amortized cost of any insertion or deletion is O(1), under the uniform<br />

hashing assumption. Notice that we’re doing two very different kinds of averaging here. On the one<br />

hand, we are averaging the possible costs of each individual search over all possible probe sequences<br />

(‘expected’). On the other hand, we are also averaging the costs of the entire sequence of operations<br />

to ‘smooth out’ the cost of rebuilding (‘amortized’). Both randomization and amortization are<br />

necessary to get this constant time bound.<br />

6.5 Universal Hashing<br />

Now I’ll describe how to generate hash functions that (at least in expectation) satisfy the uniform<br />

hashing assumption. We say that a set H of hash function is universal if it satisfies the following<br />

property: For any items x = y, if a hash function h is chosen uniformly at random from the<br />

set H, then Pr[h(x) = h(y)] = 1/m. Note that this probability holds for any items x and y; the<br />

randomness is entirely in choosing a hash function from the set H.<br />

To simplify the following discussion, I’ll assume that the universe U contains exactly m 2 items,<br />

each represented as a pair (x, x ′ ) of integers between 0 and m − 1. (Think of the items as two-digit<br />

numbers in base m.) I will also assume that m is a prime number.<br />

For any integers 0 ≤ a, b ≤ m − 1, define the function ha,b : U → {0, 1, . . . , m − 1} as follows:<br />

Then the set<br />

ha,b(x, x ′ ) = (ax + bx ′ ) mod m.<br />

H = {ha,b | 0 ≤ a, b ≤ m − 1}<br />

of all such functions is universal. To prove this, we need to show that for any pair of distinct items<br />

(x, x ′ ) = (y, y ′ ), exactly m of the m 2 functions in H cause a collision.<br />

4 ...unless your hash tables are really huge, in which case linear probing has far better cache behavior, especially<br />

when the load factor is small.<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 6: Hash Tables Fall 2002<br />

Choose two items (x, x ′ ) = (y, y ′ ), and assume without loss of generality 5 that x = y. A function<br />

ha,b ∈ H causes a collision between (x, x ′ ) and (y, y ′ ) if and only if<br />

ha,b(x, x ′ ) = ha,b(y, y ′ )<br />

(ax + bx ′ ) mod m = (ay + by ′ ) mod m<br />

ax + bx ′ ≡ ay + by ′<br />

(mod m)<br />

a(x − y) ≡ b(y ′ − x ′ ) (mod m)<br />

a ≡ b(y′ − x ′ )<br />

x − y<br />

(mod m).<br />

In the last step, we are using the fact that m is prime and x − y = 0, which implies that x − y has a<br />

unique multiplicative inverse modulo m. 6 Now notice for each possible value of b, the last identity<br />

defines a unique value of a such that ha,b causes a collision. Since there are m possible values for<br />

b, there are exactly m hash functions ha,b that cause a collision, which is exactly what we needed<br />

to prove.<br />

Thus, if we want to achieve the constant expected time bounds described earlier, we should<br />

choose a random element of H as our hash function, by generating two numbers a and b uniformly<br />

at random between 0 and m − 1. (Notice that this is exactly the same as choosing a element of U<br />

uniformly at random.)<br />

One perhaps undesirable ‘feature’ of this construction is that we have a small chance of choosing<br />

the trivial hash function h0,0, which maps everything to 0. So in practice, if we happen to pick<br />

a = b = 0, we should reject that choice and pick new random numbers. By taking h0,0 out of<br />

consideration, we reduce the probability of a collision from 1/m to (m − 1)/(m 2 − 1) = 1/(m + 1).<br />

In other words, the set H \ {h0,0} is slightly better than universal.<br />

This construction can be generalized easily to larger universes. Suppose u = m r for some<br />

constant r, so that each element x ∈ U can be represented by a vector (x0, x1, . . . , xr−1) of integers<br />

between 0 and m − 1. (Think of x as an r-digit number written in base m.) Then for each vector<br />

a = (a0, a1, . . . , ar−1), define the corresponding hash function ha as follows:<br />

ha(x) = (a0x0 + a1x1 + · · · + ar−1xr−1) mod m.<br />

Then the set of all m r such functions is universal.<br />

5 ‘Without loss of generality’ is a phrase that appears (perhaps too) often in combinatorial proofs. What it means<br />

is that we are considering one of many possible cases, but once we see the proof for one case, the proofs for all the<br />

other cases are obvious thanks to some inherent symmetry. For this proof, we are not explicitly considering what<br />

happens when x = y and x ′ = y ′ .<br />

6 For example, the multiplicative inverse of 12 modulo 17 is 10, since 12 · 10 = 120 ≡ 1 (mod 17).<br />

6


<strong>CS</strong> <strong>373</strong> Non-Lecture A: Skip Lists Fall 2002<br />

A Skip Lists<br />

For example, creating shortcuts by sprinkling a few diversely connected<br />

individuals throughout a large organization could dramatically speed up<br />

information flow between departments. On the other hand, because only<br />

a few random shortcuts are necessary to make the world small, subtle<br />

changes to networks have alarming consequences for the rapid spread of<br />

computer viruses, pernicious rumors, and infectious diseases.<br />

— Ivars Peterson, Science News, August 22, 1998.<br />

This lecture is about a probabilistic data structure called skip lists, first discovered by Bill Pugh in<br />

the late 1980’s. 1 Skip lists have many of the desirable properties of balanced binary search trees,<br />

but their structure is completely different.<br />

A.1 Shortcuts<br />

At a high level, a skip list is just a sorted, singly linked list with some shortcuts. To do a search<br />

in a normal singly-linked list of length n, we obviously need to look at n items in the worst case.<br />

To speed up this process, we can make a second-level list that contains roughly half the items from<br />

the original list. Specifically, for each item in the original list, we duplicate it with probability 1/2.<br />

We then string together all the duplicates into a second sorted linked list, and add a pointer from<br />

each duplicate back to its original. Just to be safe, we also add sentinel nodes at the beginning and<br />

end of both lists.<br />

0 1 3<br />

6 7 9<br />

−∞ +∞<br />

0<br />

1 2 3 4 5 6 7 8 9<br />

−∞ +∞<br />

A linked list with some randomly-chosen shortcuts.<br />

Now we can find a value x in this augmented structure using a two-stage algorithm. First,<br />

we scan for x in the shortcut list, starting at the −∞ sentinel node. If we find x, we’re done.<br />

Otherwise, we reach some value bigger than x and we know that x is not in the shortcut list. Let<br />

w be the largest item less than x in the shortcut list. In the second phase, we scan for x in the<br />

original list, starting from w. Again, if we reach a value bigger than x, we know that x is not in<br />

the data structure.<br />

0 1 3<br />

6 7 9<br />

−∞ +∞<br />

0<br />

1 2 3 4 5 6 7 8 9<br />

−∞ +∞<br />

Searching for 5 in a list with shortcuts.<br />

Since each node appears in the shortcut list with probability 1/2, the expected number of nodes<br />

examined in the first phase is at most n/2. Only one of the nodes examined in the second phase<br />

has a duplicate. The probability that any node is followed by k nodes without duplicates is 2 −k , so<br />

the expected number of nodes examined in the second phase is at most 1 + <br />

k≥0 2−k = 2. Thus,<br />

by adding these random shortcuts, we’ve reduced the cost of a search from n to n/2 + 2, roughly<br />

a factor of two in savings.<br />

1 William Pugh. Skip lists: A probabilistic alternative to balanced trees. Commun. ACM 33(6):668–676, 1990.<br />

1


<strong>CS</strong> <strong>373</strong> Non-Lecture A: Skip Lists Fall 2002<br />

A.2 Skip lists<br />

Now there’s an obvious improvement—add shortcuts to the shortcuts, and repeat recursively.<br />

That’s exactly how skip lists are constructed. For each node in the original list, we flip a coin<br />

over and over until we get tails. For each heads, we make a duplicate of the node. The duplicates<br />

are stacked up in levels, and the nodes on each level are strung together into sorted linked lists.<br />

Each node v stores a search key (key(v)), a pointer to its next lower copy (down(v)), and a pointer<br />

to the next node in its level (right(v)).<br />

−∞ +∞<br />

−∞ +∞<br />

−∞ +∞<br />

−∞ +∞<br />

0<br />

−∞ +∞<br />

0<br />

1 7<br />

1 6 7<br />

1 3<br />

6 7 9<br />

1 2 3 4 5 6 7 8 9<br />

−∞ +∞<br />

A skip list is a linked list with recursive random shortcuts.<br />

The search algorithm for skip lists is very simple. Starting at the leftmost node L in the highest<br />

level, we scan through each level as far as we can without passing the target value x, and then<br />

proceed down to the next level. The search ends when we either reach a node with search key x or<br />

fail to find x on the lowest level.<br />

SkipListFind(x, L):<br />

v ← L<br />

while (v = Null and key(v) = x)<br />

if key(right(v)) > x<br />

v ← down(v)<br />

else<br />

v ← right(v)<br />

return v<br />

−∞ +∞<br />

−∞ 7<br />

+∞<br />

−∞ 1 7<br />

+∞<br />

−∞ 1 6 7<br />

+∞<br />

0<br />

1 3<br />

6 7 9<br />

−∞ +∞<br />

−∞ 0 1 2 3 4 5 6 7 8 9 +∞<br />

Searching for 5 in a skip list.<br />

Intuitively, Since each level of the skip lists has about half the number of nodes as the previous<br />

level, the total number of levels should be about O(log n). Similarly, each time we add another level<br />

of random shortcuts to the skip list, we cut the search time in half except for a constant overhead.<br />

So after O(log n) levels, we should have a search time of O(log n). Let’s formalize each of these two<br />

intuitive observations.<br />

2<br />

7


<strong>CS</strong> <strong>373</strong> Non-Lecture A: Skip Lists Fall 2002<br />

A.3 Number of Levels<br />

The actual values of the search keys don’t affect the skip list analysis, so let’s assume the keys<br />

are the integers 1 through n. Let L(x) be the number of levels of the skip list that contain some<br />

search key x, not counting the bottom level. Each new copy of x is created with probability 1/2<br />

from the previous level, essentially by flipping a coin. We can compute the expected value of L(x)<br />

recursively—with probability 1/2, we flip tails and L(x) = 0; and with probability 1/2, we flip<br />

heads, increase L(x) by one, and recurse:<br />

E[L(x)] = 1<br />

2<br />

1<br />

<br />

· 0 + 1 + E[L(x)]<br />

2<br />

Solving this equation gives us E[L(x)] = 1.<br />

In order to analyze the expected cost of a search, however, we need a bound on the number<br />

of levels L = maxx L(x). Unfortunately, we can’t compute the average of a maximum the way<br />

we would compute the average of a sum. Instead, we will derive a stronger result, showing that<br />

the depth is O(log n) with high probability. ‘High probability’ is a technical term that means the<br />

probability is at least 1 − 1/n c for some constant c ≥ 1.<br />

In order for a search key x to appear on the kth level, we must have flipped k heads in a row,<br />

so Pr[L(x) ≥ k] = 2 −k . In particular,<br />

Pr[L(x) ≥ 2 lg n] = 1<br />

.<br />

n2 (There’s nothing special about the number 2 here.) The skip list has at least 2 lg n levels if and<br />

only if L(x) ≥ 2 lg n for at least one of the n search keys.<br />

Pr[L ≥ 2 lg n] = Pr (L(1) ≥ 2 lg n) ∨ (L(2) ≥ 2 lg n) ∨ · · · ∨ (L(n) ≥ 2 lg n) <br />

Since Pr[A ∨ B] ≤ P r[A] + Pr[B] for any random events A and B, we can simplify this as follows:<br />

Pr[L ≥ 2 lg n] ≤<br />

n<br />

Pr[L(x) ≥ 2 lg n] =<br />

x=1<br />

So with high probability, a skip list has O(log n) levels.<br />

A.4 Logarithmic Search Time<br />

n<br />

x=1<br />

1 1<br />

=<br />

n2 n .<br />

It’s a little easier to analyze the cost of a search if we imagine running the algorithm backwards.<br />

takes the output from SkipListFind as input and traces back through the data<br />

structure to the upper left corner. Skip lists don’t really have up and left pointers, but we’ll pretend<br />

that they do so we don’t have to write ‘v down(v) ← ’ or ‘v right(v) ← ’. 2<br />

2<br />

SkipListFind<br />

(v):<br />

while (v = L)<br />

if up(v) exists<br />

v ← up(v)<br />

else<br />

v ← left(v)<br />

3<br />

SkipListFind<br />

Leonardo da Vinci used to write everything this way, but not because he wanted to keep his discoveries secret.<br />

He just had really bad arthritis in his right hand!


<strong>CS</strong> <strong>373</strong> Non-Lecture A: Skip Lists Fall 2002<br />

Now for every node v in the skip list, up(v) exists with probability 1/2. So for purposes of<br />

is equivalent to the following algorithm:<br />

analysis,SkipListFind<br />

FlipWalk(v):<br />

while (v = L)<br />

if CoinFlip = Heads<br />

v ← up(v)<br />

else<br />

v ← left(v)<br />

Obviously, the expected number of heads is exactly the same as the expected number of tails.<br />

Thus, the expected running time of this algorithm is twice the expected number of upward jumps.<br />

Since we already know that the number of upward jumps is O(log n) with high probability, we can<br />

conclude that the expected search time is O(log n).<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 7: Amortized Analysis Fall 2002<br />

7 Amortized Analysis (October 3)<br />

7.1 Incrementing a Binary Counter<br />

The goode workes that men don whil they ben in good lif al amortised by<br />

synne folwyng.<br />

— Geoffrey Chaucer, “The Persones [Parson’s] Tale” (c.1400)<br />

I will gladly pay you Tuesday for a hamburger today.<br />

— J. Wellington Wimpy, “Thimble Theatre” (1931)<br />

I want my two dollars!<br />

— Johnny Gasparini [Demian Slade], “Better Off Dead” (1985)<br />

One of the questions in Homework Zero asked you to prove that any number could be written in<br />

binary. Although some of you (correctly) proved this using strong induction—pulling off either<br />

the least significant bit or the most significant bit and letting the recursion fairy convert the<br />

remainder—the most common proof used weak induction as follows:<br />

Proof: Base case: 1 = 2 0 .<br />

Inductive step: Suppose we have a set of distinct powers of two whose sum is n. If we add 2 0<br />

to this set, we get a ‘set’ of powers of two whose sum is n + 1, but there might be two copies of 2 0 .<br />

To fix this, as long as there are two copies of any 2 i , delete them both and add 2 i+1 . The value<br />

of the sum is unchanged by this process, since 2 i+1 = 2 i + 2 i . Since each iteration decreases the<br />

number of powers of two in our ‘set’, this process must eventually terminate. At the end of this<br />

process, we have a set of distinct powers of two whose sum is n + 1. <br />

Here’s a more formal (and shorter!) description of the algorithm to add one to a binary numeral.<br />

The input B is an array of bits, where B[i] = 1 if and only if 2 i appears in the sum.<br />

Increment(B):<br />

i ← 0<br />

while B[i] = 1<br />

B[i] ← 0<br />

i ← i + 1<br />

B[i] ← 1<br />

We’ve already argued that Increment must terminate, but how quickly? Obviously, the running<br />

time depends on the array of bits passed as input. If the first k bits are all 1s, then Increment<br />

takes Θ(k) time. Thus, if the number represented by B is between 0 and n, Increment takes<br />

Θ(log n) time in the worst case, since the binary representation for n is exactly ⌊lg n⌋ + 1 bits long.<br />

7.2 Counting from 0 to n: The Aggregate Method<br />

Now suppose we want to use Increment to count from 0 to n. If we only use the worst-case running<br />

time for each call, we get an upper bound of O(n log n) on the total running time. Although this<br />

bound is correct, it isn’t the best we can do. The easiest way to get a tighter bound is to observe<br />

that we don’t need to flip Θ(log n) bits every time Increment is called. The least significant<br />

bit B[0] does flip every time, but B[1] only flips every other time, B[2] flips every 4th time, and<br />

in general, B[i] flips every 2 i th time. If we start from an array full of zeros, a sequence of n<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 7: Amortized Analysis Fall 2002<br />

Increments flips each bit B[i] exactly ⌊n/2i⌋ times. Thus, the total number of bit-flips for the<br />

entire sequence is<br />

⌊lg n⌋ <br />

n<br />

2i ∞ n<br />

< = 2n.<br />

2i i=0<br />

Thus, on average, each call to Increment flips only two bits, and so runs in constant time.<br />

This ‘on average’ is quite different from the averaging we did in the previous lecture. There is<br />

no probability involved; we are averaging over a sequence of operations, not the possible running<br />

times of a single operation. This averaging idea is called amortization—the amortized cost of each<br />

Increment is O(1). Amortization is a sleazy clever trick used by accountants to average large onetime<br />

costs over long periods of time; the most common example is calculating uniform payments<br />

for a loan, even though the borrower is paying interest on less and less capital over time.<br />

There are several different methods for deriving amortized bounds for a sequence of operations.<br />

CLR calls the technique we just used the aggregate method, which is just a fancy way of saying<br />

sum up the total cost of the sequence and divide by the number of operations.<br />

The Aggregate Method. Find the worst case running time T (n) for a sequence of n<br />

operations. The amortized cost of each operation is T (n)/n.<br />

7.3 The Taxation (Accounting) Method<br />

The second method we can use to derive amortized bounds is called the accounting method in CLR,<br />

but a better name for it might be the taxation method. Suppose it costs us a dollar to toggle a bit,<br />

so we can measure the running time of our algorithm in dollars. Time is money!<br />

Instead of paying for each bit flip when it happens, the Increment Revenue Service charges a<br />

two-dollarincrement tax whenever we want to set a bit from zero to one. One of those dollars is<br />

spent changing the bit from zero to one; the other is stored away as credit until we need to reset<br />

the same bit to zero. The key point here is that we always have enough credit saved up to pay<br />

for the next Increment. The amortized cost of an Increment is the total tax it incurs, which is<br />

exactly 2 dollars, since each Increment changes just one bit from 0 to 1.<br />

It is often useful to assign various parts of the tax income to specific pieces of the data structure.<br />

For example, for each Increment, we could store one of the two dollars on the single bit that is<br />

set for 0 to 1, so that that bit can pay to reset itself back to zero later on.<br />

Taxation Method 1. Certain steps in the algorithm charge you taxes, so that the<br />

total money it spends is never more than the total taxes you pay. The amortized cost<br />

of an operation is the overall tax charged to you during that operation.<br />

Perhaps a more optimistic way of looking at the taxation method is to have the bits in the<br />

array pay us a tax for the privilege of being updated at the proper time. Regardless of whether we<br />

change the bit or not, we charge each bit B[i] a tax of 1/2 i dollars for each Increment. The total<br />

tax we collect is <br />

i≥0 2−i = 2 dollars. Every time B[i] actually needs to be flipped, it has paid us<br />

a total of $1 since the last change, which is just enough for us to pay for the flip.<br />

Taxation Method 2. Charge taxes to certain items in the data structure at each<br />

operation, so that the total money you spend is never more than the total taxes you<br />

collect. The amortized cost of an operation is the overall tax you collect during that<br />

operation.<br />

2<br />

i=0


<strong>CS</strong> <strong>373</strong> Lecture 7: Amortized Analysis Fall 2002<br />

In both of the taxation methods, our task as algorithm analysts is to come up with an appropriate<br />

‘tax schedule’. Different ‘schedules’ can result in different amortized time bounds. The tightest<br />

bounds are obtained from tax schedules that just barely stay in the black.<br />

7.4 The Potential Method<br />

The most powerful method (and the hardest to use) builds on a physics metaphor of ‘potential<br />

energy’. Instead of associating costs or taxes with particular operations or pieces of the data<br />

structure, we represent prepaid work as potential that can be spent on later operations. The<br />

potential is a function of the entire data structure.<br />

Let Di denote our data structure after i operations, and let Φi denote its potential. Let ci<br />

denote the actual cost of the ith operation (which changes Di−1 into Di). Then the amortized cost<br />

of the ith operation, denoted ai, is defined to be the actual cost plus the change in potential:<br />

ai = ci + Φi − Φi−1<br />

So the total amortized cost of n operations is the actual total cost plus the total change in potential:<br />

n n<br />

n<br />

ai = (ci + Φi − Φi−1) =<br />

i=1<br />

i=1<br />

i=1<br />

ci + Φn − Φ0.<br />

Our task is to define a potential function so that Φ0 = 0 and Φi ≥ 0 for all i. Once we do this, the<br />

total actual cost of any sequence of operations will be less than the total amortized cost:<br />

n n<br />

n<br />

ci = ai − Φn ≤ ai.<br />

i=1<br />

i=1<br />

For our binary counter example, we can define the potential Φi after the ith Increment to<br />

be the number of bits with value 1. Initially, all bits are equal to zero, so Φ0 = 0, and clearly<br />

Φi > 0 for all i > 0, so this is a legal potential function. We can describe both the actual cost of<br />

an Increment and the change in potential in terms of the number of bits set to 1 and reset to 0.<br />

ci = #bits changed from 0 to 1 + #bits changed from 1 to 0<br />

Φi − Φi−1 = #bits changed from 0 to 1 − #bits changed from 1 to 0<br />

Thus, the amortized cost of the ith Increment is<br />

i=1<br />

ai = ci + Φi − Φi−1 = 2 × #bits changed from 0 to 1<br />

Since Increment changes only one bit from 0 to 1, the amortized cost Increment is 2.<br />

The Potential Method. Define a potential function for the data structure that is<br />

initially equal to zero and is always nonnegative. The amortized cost of an operation is<br />

its actual cost plus the change in potential.<br />

For this particular example, the potential is exactly equal to the total unspent taxes paid using<br />

the taxation method, so not too surprisingly, we have exactly the same amortized cost. In general,<br />

however, there may be no way of interpreting the change in potential as ‘taxes’.<br />

Different potential functions will lead to different amortized time bounds. The trick to using the<br />

potential method is to come up with the best possible potential function. A good potential function<br />

goes up a little during any cheap/fast operation, and goes down a lot during any expensive/slow<br />

operation. Unfortunately, there is no general technique for doing this other than playing around<br />

with the data structure and trying lots of different possibilities.<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 7: Amortized Analysis Fall 2002<br />

7.5 Incrementing and Decrementing<br />

Now suppose we wanted a binary counter that we could both increment and decrement efficiently.<br />

A standard binary counter won’t work, even in an amortized sense, since alternating between 2 k<br />

and 2 k − 1 costs Θ(k) time per operation.<br />

A nice alternative is represent a number as a pair of bit strings (P, N), where for any bit<br />

position i, at most one of the bits P [i] and N[i] is equal to 1. The actual value of the counter is<br />

P − N. Here are algorithms to increment and decrement our double binary counter.<br />

Increment(P, N):<br />

i ← 0<br />

while P [i] = 1<br />

P [i] ← 0<br />

i ← i + 1<br />

if N[i] = 1<br />

N[i] ← 0<br />

else<br />

P [i] ← 1<br />

Decrement(P, N):<br />

i ← 0<br />

while N[i] = 1<br />

N[i] ← 0<br />

i ← i + 1<br />

if P [i] = 1<br />

P [i] ← 0<br />

else<br />

N[i] ← 1<br />

Here’s an example of these algorithms in action. Notice that any number other than zero can<br />

be represented in multiple (in fact, infinitely many) ways.<br />

P = 10001 P = 10010 P = 10011 P = 10000 P = 10000 P = 10000 P = 10001<br />

++<br />

++<br />

++<br />

−−<br />

−−<br />

++<br />

N = 01100 −→ N = 01100 −→ N = 01100 −→ N = 01000 −→ N = 01001 −→ N = 01010 −→ N = 01010<br />

P − N = 5 P − N = 6 P − N = 7 P − N = 8 P − N = 7 P − N = 6 P − N = 7<br />

Incrementing and decrementing a double-binary counter.<br />

Now suppose we start from (0, 0) and apply a sequence of n Increments and Decrements.<br />

In the worst case, operation takes Θ(log n) time, but what is the amortized cost? We can’t use the<br />

aggregate method here, since we don’t know what the sequence of operations looks like.<br />

What about the taxation method? It’s not hard to prove (by induction, of course) that after<br />

either P [i] or N[i] is set to 1, there must be at least 2 i operations, either Increments or Decrements,<br />

before that bit is reset to 0. So if each bit P [i] and N[i] pays a tax of 2 −i at each operation,<br />

we will always have enough money to pay for the next operation. Thus, the amortized cost of each<br />

operation is at most <br />

i≥0 2(·2−i ) = 4.<br />

We can get even better bounds using the potential method. Define the potential Φi to be the<br />

number of 1-bits in both P and N after i operations. Just as before, we have<br />

ci = #bits changed from 0 to 1 + #bits changed from 1 to 0<br />

Φi − Φi−1 = #bits changed from 0 to 1 − #bits changed from 1 to 0<br />

=⇒ ai = 2 × #bits changed from 0 to 1<br />

Since each operation changes at most one bit to 1, the ith operation has amortized cost ai ≤ 2.<br />

Exercise: Modify the binary double-counter to support a new operation Sign, which determines<br />

whether the number being stored is positive, negative, or zero, in constant time. The amortized<br />

time to increment or decrement the counter should still be a constant. [Hint: If P has p significant<br />

bits, and N has n significant bits, then p − n always has the same sign as P − N. For example, if<br />

P = 17 = 100012 and N = 0, then p = 5 and n = 0. But how do you store p and n??]<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 7: Amortized Analysis Fall 2002<br />

Exercise: Suppose instead of powers of two, we represent integers as the sum of Fibonacci numbers.<br />

In other words, instead of an array of bits, we keep an array of fits, where the ith least significant<br />

fit indicates whether the sum includes the ith Fibonacci number Fi. For example, the fitstring<br />

101110F represents the number F6 + F4 + F3 + F2 = 8 + 3 + 2 + 1 = 14. Describe algorithms to<br />

increment and decrement a single fitstring in constant amortized time. [Hint: Most numbers can<br />

be represented by more than one fitstring!]<br />

7.6 Aside: Gray Codes<br />

An attractive alternate solution to the increment/decrement problem was independently suggested<br />

by several students. Gray codes (named after Frank Gray, who discovered them in the 1950s) are<br />

methods for representing numbers as bit strings so that successive numbers differ by only one bit.<br />

For example, here is the four-bit binary reflected Gray code for the integers 0 through 15:<br />

0000, 0001, 0011, 0010, 0110, 0111, 0101, 0100, 1100, 1101, 1111, 1110, 1010, 1011, 1001, 1000<br />

The general rule for incrementing a binary reflected Gray code is to invert the bit that would be<br />

set from 0 to 1 by a normal binary counter. In terms of bit-flips, this is the perfect solution; each<br />

increment of decrement by definition changes only one bit. Unfortunately, it appears that finding<br />

the single bit to flip still requires Θ(log n) time in the worst case, so the total cost of maintaining<br />

a Gray code is actually the same as that of maintaining a normal binary counter.<br />

Actually, this is only true of the naïve algorithm. The following algorithm, discovered by Gideon<br />

Ehrlich 1 in 1973, maintains a Gray code counter in constant worst-case time per increment! The<br />

algorithm uses a separate ‘focus’ array F [0 .. n] in addition to a Gray-code bit array G[0 .. n − 1].<br />

EhrlichGrayInit(n):<br />

for i ← 0 to n − 1<br />

G[i] ← 0<br />

for i ← 0 to n<br />

F [i] ← i<br />

EhrlichGrayIncrement(n):<br />

j ← F [0]<br />

F [0] ← 0<br />

if j = n<br />

G[n − 1] ← 1 − G[n − 1]<br />

else<br />

G[j] = 1 − G[j]<br />

F [j] ← F [j + 1]<br />

F [j + 1] ← j + 1<br />

The EhrlichGrayIncrement algorithm obviously runs in O(1) time, even in the worst case.<br />

Here’s the algorithm in action with n = 4. The first line is the Gray bit-vector G, and the second<br />

line shows the focus vector F , both in reverse order:<br />

G : 0000, 0001, 0011, 0010, 0110, 0111, 0101, 0100, 1100, 1101, 1111, 1110, 1010, 1011, 1001, 1000<br />

F : 3210, 3211, 3220, 3212, 3310, 3311, 3230, 3213, 4210, 4211, 4220, 4212, 3410, 3411, 3240, 3214<br />

Voodoo! I won’t explain in detail how Ehrlich’s algorithm works, except to point out the following<br />

invariant. Let B[i] denote the ith bit in the standard binary representation of the current number.<br />

If B[j] = 0 and B[j −1] = 1, then F [j] is the smallest integer k > j such that B[k] = 1;<br />

otherwise, F [j] = j. Got that?<br />

But wait — this algorithm only handles increments; what if we also want to decrement? Sorry,<br />

I don’t have a clue. Extra credit, anyone?<br />

1 G. Ehrlich. Loopless algorithms for generating permutations, combinations, and other combinatorial configura-<br />

tions. J. Assoc. Comput. Mach. 20:500–513, 1973.<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 8: Dynamic Binary Search Trees Fall 2002<br />

Everything was balanced before the computers went off line. Try and<br />

adjust something, and you unbalance something else. Try and adjust<br />

that, you unbalance two more and before you know what’s happened, the<br />

ship is out of control.<br />

— Blake, Blake’s 7, “Breakdown” (March 6, 1978)<br />

A good scapegoat is nearly as welcome as a solution to the problem.<br />

— Anonymous<br />

Let’s play.<br />

8 Dynamic Binary Search Trees (February 8)<br />

8.1 Definitions<br />

— El Mariachi [Antonio Banderas], Desperado (1992)<br />

I’ll assume that everyone is already familiar with the standard terminology for binary search trees—<br />

node, search key, edge, root, internal node, leaf, right child, left child, parent, descendant, sibling,<br />

ancestor, subtree, preorder, postorder, inorder, etc.—as well as the standard algorithms for searching<br />

for a node, inserting a node, or deleting a node. Otherwise, see Chapter 12 of CLRS.<br />

For this lecture, we will consider only full binary trees—where every internal node has exactly<br />

two children—where only the internal nodes actually store search keys. In practice, we can represent<br />

the leaves with null pointers.<br />

Recall that the depth d(v) of a node v is its distance from the root, and its height h(v) is the<br />

distance to the farthest leaf in its subtree. The height (or depth) of the tree is just the height of<br />

the root. The size |v| of v is the number of nodes in its subtree. The size of the whole tree is just<br />

the total number of nodes, which I’ll usually denote by n.<br />

A tree with height h has at most 2 h leaves, so the minimum height of an n-leaf binary tree<br />

is ⌈lg n⌉. In the worst case, the time required for a search, insertion, or deletion to the height<br />

of the tree, so in general we would like keep the height as close to lg n as possible. The best we<br />

can possibly do is to have a perfectly balanced tree, in which each subtree has as close to half the<br />

leaves as possible, and both subtrees are perfectly balanced. The height of a perfectly balanced tree<br />

is ⌈lg n⌉, so the worst-case search time is O(log n). However, even if we started with a perfectly<br />

balanced tree, a malicious sequence of insertions and/or deletions could make the tree arbitrarily<br />

unbalanced, driving the search time up to Θ(n).<br />

To avoid this problem, we need to periodically modify the tree to maintain ‘balance’. There<br />

are several methods for doing this, and depending on the method we use, the search tree is given<br />

a different name. Examples include AVL trees, red-black trees, height-balanced trees, weightbalanced<br />

trees, bounded-balance trees, path-balanced trees, B-trees, treaps, randomized binary<br />

search trees, skip lists, 1 and jumplists. 2 Some of these trees support searches, insertions, and<br />

deletions, in O(log n) worst-case time, others in O(log n) amortized time, still others in O(log n)<br />

expected time.<br />

In this lecture, I’ll discuss two binary search tree data structures with good amortized performance.<br />

The first is the scapegoat tree, discovered by Arne Andersson in 1989 and independently by<br />

1 Yeah, yeah. Skip lists aren’t really binary search trees. Whatever you say, Mr. Picky.<br />

2 These are essentially randomized variants of the Phobian binary search trees you saw in the first midterm!<br />

[H. Brönnimann, F. Cazals, and M. Durand. Randomized jumplists: A jump-and-walk dictionary data structure.<br />

Manuscript, 2002. http://photon.poly.edu/ ∼ hbr/publi/jumplist.html.] So now you know who to blame.<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 8: Dynamic Binary Search Trees Fall 2002<br />

Igal Galperin and Ron Rivest in 1993. 3 The second is the splay tree, discovered by Danny Sleator<br />

and Bob Tarjan in 1985. 4<br />

8.2 Lazy Deletions: Global Rebuilding<br />

First let’s consider the simple case where we start with a perfectly-balanced tree, and we only want<br />

to perform searches and deletions. To get good search and delete times, we will use a technique<br />

called global rebuilding. When we get a delete request, we locate and mark the node to be deleted,<br />

but we don’t actually delete it. This requires a simple modification to our search algorithm—we<br />

still use marked nodes to guide searches, but if we search for a marked node, the search routine<br />

says it isn’t there. This keeps the tree more or less balanced, but now the search time is no longer<br />

a function of the amount of data currently stored in the tree. To remedy this, we also keep track<br />

of how many nodes have been marked, and then apply the following rule:<br />

Global Rebuilding Rule. As soon as half the nodes in the tree have been marked,<br />

rebuild a new perfectly balanced tree containing only the unmarked nodes. 5<br />

With this rule in place, a search takes O(log n) time in the worst case, where n is the number of<br />

unmarked nodes. Specifically, since the tree has at most n marked nodes, or 2n nodes altogether,<br />

we need to examine at most lg n + 1 keys. There are several methods for rebuilding the tree in<br />

O(n) time, where n is the size of the new tree. (Homework!) So a single deletion can cost Θ(n)<br />

time in the worst case, but only if we have to rebuild; most deletions take only O(log n) time.<br />

We spend O(n) time rebuilding, but only after Ω(n) deletions, so the amortized cost of rebuilding<br />

the tree is O(1) per deletion. (Here I’m using a simple version of the ‘taxation method’. For each<br />

deletion, we charge a $1 tax; after n deletions, we’ve collected $n, which is just enough to pay for<br />

rebalancing the tree containing the remaining n nodes.) Since we also have to find and mark the<br />

node being ‘deleted’, the total amortized time for a deletion is O(log n) .<br />

8.3 Insertions: Partial Rebuilding<br />

Now suppose we only want to support searches and insertions. We can’t ‘not really insert’ new<br />

nodes into the tree, since that would make them unavailable to the search algorithm. 6 So instead,<br />

we’ll use another method called partial rebuilding. We will insert new nodes normally, but whenever<br />

a subtree becomes unbalanced enough, we rebuild it. The definition of ‘unbalanced enough’ depends<br />

on an arbitrary constant α > 1.<br />

Each node v will now also store its height h(v) and the size of its subtree |v|. We now modify<br />

our insertion algorithm with the following rule:<br />

3<br />

A. Andersson. General balanced trees. J. <strong>Algorithms</strong> 30:1-28, 1999. I. Galperin and R. L. Rivest. Scapegoat<br />

trees. Proc. SODA 1993, pp. 165–174, 1993. The claim of independence is Andersson’s; the conference version of<br />

his paper appeared in 1989. The two papers actually describe very slightly different rebalancing algorithms. The<br />

algorithm I’m using here is closer to Andersson’s, but my analysis is closer to Galperin and Rivest’s.<br />

4<br />

D. D. Sleator and R. Tarjan. Self-adjusting binary search trees. J. ACM 32:652–686, 1985.<br />

5<br />

Alternately: When the number of unmarked nodes is one less than an exact power of two, rebuild the tree. This<br />

rule ensures that the tree is always exactly balanced.<br />

6<br />

Actually, there is another dynamic data structure technique, first described by Jon Bentley and James Saxe in<br />

1980, that doesn’t really insert the new node! Instead, their algorithm puts the new node into a brand new data<br />

structure all by itself. Then as long as there are two trees of exactly the same size, those two trees are merged into a<br />

new tree. So this is exactly like a binary counter—instead of one n-node tree, you have a collection of 2 i -nodes trees of<br />

distinct sizes. The amortized cost of inserting a new element is O(log n). Unfortunately, we have to look through up<br />

to lg n trees to find anything, so the search time goes up to O(log 2 n). [J. L. Bentley and J. B. Saxe. Decomposable<br />

searching problems I: Static-to-dynamic transformation. J. <strong>Algorithms</strong> 1:301-358, 1980.] See also Homework 3.<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 8: Dynamic Binary Search Trees Fall 2002<br />

Partial Rebuilding Rule. After we insert a node, walk back up the tree updating<br />

the heights and sizes of the nodes on the search path. If we encounter a node v where<br />

h(v) > α · lg|v|, rebuild its subtree into a perfectly balanced tree (in O(|v|) time).<br />

If we always follow this rule, then after an insertion, the height of the tree is at most α · lg n.<br />

Thus, since α is a constant, the worst-case search time is O(log n). In the worst case, insertions<br />

require Θ(n) time—we might have to rebuild the entire tree. However, the amortized time for each<br />

insertion is again only O(log n). Not surprisingly, the proof is a little bit more complicated than<br />

for deletions.<br />

Define the imbalance I(v) of a node v to be one less than the absolute difference between the<br />

sizes of its two subtrees, or zero, whichever is larger:<br />

I(v) = max |left(v)| − |right(v)| − 1, 0 <br />

A simple induction proof implies that I(v) = 0 for every node v in a perfectly balanced tree. So<br />

immediately after we rebuild the subtree of v, we have I(v) = 0. On the other hand, each insertion<br />

into the subtree of v increments either |left(v)| or |right(v)|, so I(v) changes by at most 1.<br />

The whole analysis boils down to the following lemma.<br />

Lemma 1. Just before we rebuild v’s subtree, I(v) = Ω(|v|).<br />

Before we prove this, let’s first look at what it implies. If I(v) = Ω(|v|), then Ω(|v|) keys have<br />

been inserted in the v’s subtree since the last time it was rebuilt from scratch. On the other hand,<br />

rebuilding the subtree requires O(|v|) time. Thus, if we amortize the rebuilding cost across all the<br />

insertions since the last rebuilding, v is charged constant time for each insertion into its subtree.<br />

Since each new key is inserted into at most α · lg n = O(log n) subtrees, the total amortized cost of<br />

an insertion is O(log n) .<br />

Proof: Since we’re about to rebuild the subtree at v, we must have h(v) > α · lg |v|. Without<br />

loss of generality, suppose that the node we just inserted went into v’s left subtree. Either we just<br />

rebuilt this subtree or we didn’t have to, so we also have h(left(v)) ≤ α · lg |left(v)|. Combining<br />

these two inequalities with the recursive definition of height, we get<br />

α · lg |v| < h(v) ≤ h(left(v)) + 1 ≤ α · lg |left(v)| + 1.<br />

After some algebra, this simplifies to |left(v)| > |v|/2 1/α . Combining this with the identity |v| =<br />

|left(v)| + |right(v)| + 1 and doing some more algebra gives us the inequality<br />

|right(v)| < 1 − 1/2 1/α |v| − 1.<br />

Finally, we combine these two inequalities using the recursive definition of imbalance.<br />

I(v) ≥ |left(v)| − |right(v)| − 1 > 2/2 1/α − 1 |v|<br />

Since α is a constant bigger than 1, the factor in parentheses is a positive constant. <br />

3


<strong>CS</strong> <strong>373</strong> Lecture 8: Dynamic Binary Search Trees Fall 2002<br />

8.4 Scapegoat Trees<br />

Finally, to handle both insertions and deletions efficiently, scapegoat trees use both of the previous<br />

techniques. We use partial rebuilding to re-balance the tree after insertions, and global rebuilding<br />

to re-balance the tree after deletions. Each search takes O(log n) time in the worst case, and the<br />

amortized time for any insertion or deletion is also O(log n). There are a few small technical details<br />

left (which I won’t describe), but no new ideas are required.<br />

Once we’ve done the analysis, we can actually simplify the data structure. It’s not hard to prove<br />

that at most one subtree (the scapegoat) is rebuilt during any insertion. Less obviously, we can even<br />

get the same amortized time bounds (except for a small constant factor) if we only maintain the<br />

three integers in addition to the actual tree: the size of the entire tree, the height of the entire tree,<br />

and the number of marked nodes. Whenever an insertion causes the tree to become unbalanced, we<br />

can compute the sizes of all the subtrees on the search path, starting at the new leaf and stopping<br />

at the scapegoat, in time proportional to the size of the scapegoat subtree. Since we need that<br />

much time to re-balance the scapegoat subtree, this computation increases the running time by<br />

only a small constant factor! Thus, unlike almost every other kind of balanced trees, scapegoat<br />

trees require only O(1) extra space.<br />

8.5 Rotations, Double Rotations, and Splaying<br />

Another method for maintaining balance in binary search trees is by adjusting the shape of the<br />

tree locally, using an operation called a rotation. A rotation at a node x decreases its depth by one<br />

and increases its parent’s depth by one. Rotations can be performed in constant time, since they<br />

only involve simple pointer manipulation.<br />

x<br />

y<br />

right<br />

left<br />

A right rotation at x and a left rotation at y are inverses.<br />

For technical reasons, we will need to use rotations two at a time. There are two types of double<br />

rotations, which might be called roller-coaster and zig-zag. A roller-coaster at a node x consists<br />

of a rotation at x’s parent followed by a rotation at x, both in the same direction. A zig-zag at x<br />

consists of two rotations at x, in opposite directions. Each double rotation decreases the depth of<br />

x by two, leaves the depth of its parent unchanged, and increases the depth of its grandparent by<br />

either one or two, depending on the type of double rotation. Either type of double rotation can be<br />

performed in constant time.<br />

x<br />

y<br />

z<br />

x<br />

y<br />

A right roller-coaster at x and a left roller-coaster at z.<br />

4<br />

z<br />

x<br />

y<br />

x<br />

y<br />

z


<strong>CS</strong> <strong>373</strong> Lecture 8: Dynamic Binary Search Trees Fall 2002<br />

x<br />

z<br />

w x<br />

w<br />

z<br />

A zig-zag at x. The symmetric case is not shown.<br />

Finally, a splay operation moves an arbitrary node in the tree up to the root through a series<br />

of double rotations, possibly with one single rotation at the end. Splaying a node v requires time<br />

proportional to d(v). (Obviously, this means the depth before splaying, since after splaying v is the<br />

root and thus has depth zero!)<br />

a<br />

e<br />

b<br />

c<br />

f<br />

g<br />

d<br />

k<br />

h<br />

i<br />

m<br />

8.6 Splay Trees<br />

l<br />

j<br />

n<br />

x<br />

g<br />

a<br />

e<br />

h<br />

b<br />

c<br />

f<br />

j<br />

i<br />

d<br />

k<br />

x<br />

m<br />

l<br />

n<br />

a<br />

e<br />

g<br />

b<br />

c<br />

f<br />

h<br />

d<br />

x<br />

j<br />

i<br />

m<br />

k<br />

n<br />

l<br />

c<br />

a<br />

b<br />

w<br />

x<br />

d m<br />

Splaying a node. Irrelevant subtrees are omitted for clarity.<br />

e<br />

g<br />

f<br />

h<br />

x<br />

j<br />

i<br />

k<br />

l<br />

n<br />

a<br />

z<br />

x<br />

b m<br />

c f<br />

A splay tree is a binary search tree that is kept more or less balanced by splaying. Intuitively, after<br />

we access any node, we move it to the root with a splay operation. In more detail:<br />

• Search: Find the node containing the key using the usual algorithm, or its predecessor or<br />

successor if the key is not present. Splay whichever node was found.<br />

• Insert: Insert a new node using the usual algorithm, then splay that node.<br />

• Delete: Find the node x to be deleted, splay it, and then delete it. This splits the tree into<br />

two subtrees, one with keys less than x, the other with keys bigger than x. Find the node w<br />

in the left subtree with the largest key (i.e., the inorder predecessor of x in the original tree),<br />

splay it, and finally join it to the right subtree.<br />

x<br />

x<br />

w<br />

w<br />

Deleting a node in a splay tree.<br />

Each search, insertion, or deletion consists of a constant number of operations of the form walk<br />

down to a node, and then splay it up to the root. Since the walk down is clearly cheaper than the<br />

splay up, all we need to get good amortized bounds for splay trees is to derive good amortized<br />

bounds for a single splay.<br />

5<br />

w<br />

d<br />

e<br />

g<br />

h<br />

k<br />

j<br />

i<br />

l<br />

n


<strong>CS</strong> <strong>373</strong> Lecture 8: Dynamic Binary Search Trees Fall 2002<br />

Believe it or not, the easiest way to do this uses the potential method. The rank of any node v<br />

is defined as r(v) = ⌊lg |v|⌋. In particular, if v is the root of the tree, then r(v) = ⌊lg n⌋. We define<br />

the potential of a splay tree to be the sum of the ranks of all its nodes:<br />

Φ = <br />

r(v) = <br />

⌊lg |v|⌋<br />

v<br />

The amortized analysis of splay trees boils down to the following lemma. Here, r(v) denotes<br />

the rank of v before a (single or double) rotation, and r ′ (v) denotes its rank afterwards.<br />

Lemma 2. The amortized cost of a single rotation at v is at most 1 + 3r ′ (v) − 3r(v), and the<br />

amortized cost of a double rotation at v is at most 3r ′ (v) − 3r(v).<br />

Proving this lemma requires an easy but boring case analysis of the different types of rotations,<br />

which takes up almost a page in Sleator and Tarjan’s original paper. (They call it the ‘Access<br />

Lemma’.) I won’t repeat it here.<br />

By adding up the amortized costs of all the rotations, we find that the total amortized cost of<br />

splaying a node v is at most 1 + 3r ′ (v) − 3r(v), where r ′ (v) is the rank of v after the splay. (The<br />

intermediate ranks cancel out in a nice telescoping sum.) But after the splay, v is the root, so<br />

r ′ (v) = ⌊lg n⌋, which means that the amortized cost of a splay is at most 3 lg n − 1 = O(log n).<br />

Thus, every insertion, deletion, or search in a splay tree takes O(log n) amortized time , which is<br />

optimal.<br />

Actually, splay trees are optimal in a much stronger sense. If p(v) denotes the probability of<br />

searching for a node v, then the amortized search time for v is O(log(1/p(v))). If every node is<br />

equally likely, we have p(v) = 1/n so O(log(1/p(v))) = O(log n), as before. Even if the nodes aren’t<br />

equally likely, though, the optimal static binary search tree for this set of weighted nodes has a<br />

search time of Θ(log(1/p(v))). Splay trees give us this optimal amortized weighted search time<br />

with no change to the data structure or the splaying algorithm.<br />

6<br />

v


<strong>CS</strong> <strong>373</strong> Lecture 9: Disjoint Sets Fall 2002<br />

E pluribus unum (Out of many, one)<br />

— Official motto of the United States of America<br />

John: Who’s your daddy? C’mon, you know who your daddy is! Who’s<br />

your daddy? D’Argo, tell him who his daddy is!”<br />

D’Argo: I’m your daddy.<br />

— Farscape, “Thanks for Sharing” (June 15, 2001)<br />

9 Data Structures for Disjoint Sets (October 10 and 15)<br />

In this lecture, we describe some methods for maintaining a collection of disjoint sets. Each set is<br />

represented as a pointer-based data structure, with one node per element. Each set has a ‘leader’<br />

element, which uniquely identifies the set. (Since the sets are always disjoint, the same object<br />

cannot be the leader of more than one set.) We want to support the following operations.<br />

• MakeSet(x): Create a new set {x} containing the single element x. The element x must<br />

not appear in any other set in our collection. The leader of the new set is obviously x.<br />

• Find(x): Find (the leader of) the set containing x.<br />

• Union(A, B): Replace two sets A and B in our collection with their union A ∪ B. For<br />

example, Union(A, MakeSet(x)) adds a new element x to an existing set A. The sets A<br />

and B are specified by arbitrary elements, so Union(x, y) has exactly the same behavior as<br />

Union(Find(x), Find(y)).<br />

Disjoint set data structures have lots of applications. For instance, Kruskal’s minimum spanning<br />

tree algorithm relies on such a data structure to maintain the components of the intermediate<br />

spanning forest. Another application might be maintaining the connected components of a graph<br />

as new vertices and edges are added. In both these applications, we can use a disjoint-set data<br />

structure, where we keep a set for each connected component, containing that component’s vertices.<br />

9.1 Reversed Trees<br />

One of the easiest ways to store sets is using trees. Each object points to another object, called<br />

its parent, except for the leader of each set, which points to itself and thus is the root of the tree.<br />

MakeSet is trivial. Find traverses the parent pointers up to the leader. Union just redirects the<br />

parent pointer of one leader to the other. Notice that unlike most tree data structures, objects do<br />

not have pointers down to their children.<br />

MakeSet(x):<br />

parent(x) ← x<br />

c<br />

b<br />

a<br />

Find(x):<br />

while x = parent(x)<br />

x ← parent(x)<br />

return x<br />

d<br />

p<br />

q r<br />

c<br />

b<br />

a<br />

d<br />

p<br />

q r<br />

Union(x, y):<br />

x ← Find(x)<br />

y ← Find(y)<br />

parent(y) ← x<br />

Merging two sets stored as trees. Arrows point to parents. The shaded node has a new parent.<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 9: Disjoint Sets Fall 2002<br />

Make-Set clearly takes Θ(1) time, and Union requires only O(1) time in addition to the two<br />

Finds. The running time of Find(x) is proportional to the depth of x in the tree. It is not hard to<br />

come up with a sequence of operations that results in a tree that is a long chain of nodes, so that<br />

Find takes Θ(n) time in the worst case.<br />

However, there is an easy change we can make to our Union algorithm, called union by depth,<br />

so that the trees always have logarithmic depth. Whenever we need to merge two trees, we always<br />

make the root of the shallower tree a child of the deeper one. This requires us to also maintain the<br />

depth of each tree, but this is quite easy.<br />

MakeSet(x):<br />

parent(x) ← x<br />

depth(x) ← 0<br />

Find(x):<br />

while x = parent(x)<br />

x ← parent(x)<br />

return x<br />

Union(x, y)<br />

x ← Find(x)<br />

y ← Find(y)<br />

if depth(x) > depth(y)<br />

parent(y) ← x<br />

else<br />

parent(x) ← y<br />

if depth(x) = depth(y)<br />

depth(y) ← depth(y) + 1<br />

With this simple change, Find and Union both run in Θ(log n) time in the worst case.<br />

9.2 Shallow Threaded Trees<br />

Alternately, we could just have every object keep a pointer to the leader of its set. Thus, each<br />

set is represented by a shallow tree, where the leader is the root and all the other elements are its<br />

children. With this representation, MakeSet and Find are completely trivial. Both operations<br />

clearly run in constant time. Union is a little more difficult, but not much. Our algorithm sets all<br />

the leader pointers in one set to point to the leader of the other set. To do this, we need a method<br />

to visit every element in a set; we will ‘thread’ a linked list through each set, starting at the set’s<br />

leader. The two threads are merged in the Union algorithm in constant time.<br />

a<br />

b c d<br />

p<br />

q r<br />

a<br />

p q r b c d<br />

Merging two sets stored as threaded trees.<br />

Bold arrows point to leaders; lighter arrows form the threads. Shaded nodes have a new leader.<br />

MakeSet(x):<br />

leader(x) ← x<br />

next(x) ← x<br />

Find(x):<br />

return leader(x)<br />

2<br />

Union(x, y):<br />

x ← Find(x)<br />

y ← Find(y)<br />

y ← y<br />

leader(y) ← x<br />

while (next(y) = Null)<br />

y ← next(y)<br />

leader(y) ← x<br />

next(y) ← next(x)<br />

next(x) ← y


<strong>CS</strong> <strong>373</strong> Lecture 9: Disjoint Sets Fall 2002<br />

The worst-case running time of Union is a constant times the size of the larger set. Thus, if we<br />

merge a one-element set with another n-element set, the running time can be Θ(n). Generalizing<br />

this idea, it is quite easy to come up with a sequence of n MakeSet and n − 1 Union operations<br />

that requires Θ(n 2 ) time to create the set {1, 2, . . . , n} from scratch.<br />

WorstCaseSequence(n):<br />

MakeSet(1)<br />

for i ← 2 to n<br />

MakeSet(i)<br />

Union(1, i)<br />

We are being stupid in two different ways here. One is the order of operations in WorstCase-<br />

Sequence. Obviously, it would be more efficient to merge the sets in the other order, or to use<br />

some sort of divide and conquer approach. Unfortunately, we can’t fix this; we don’t get to decide<br />

how our data structures are used! The other is that we always update the leader pointers in the<br />

larger set. To fix this, we add a comparison inside the Union algorithm to determine which set is<br />

smaller. This requires us to maintain the size of each set, but that’s easy.<br />

MakeWeightedSet(x):<br />

leader(x) ← x<br />

next(x) ← x<br />

size(x) ← 1<br />

WeightedUnion(x, y)<br />

x ← Find(x)<br />

y ← Find(y)<br />

if size(x) > size(y)<br />

Union(x, y)<br />

size(x) ← size(x) + size(y)<br />

else<br />

Union(y, x)<br />

size(x) ← size(x) + size(y)<br />

The new WeightedUnion algorithm still takes Θ(n) time to merge two n-element sets. However,<br />

in an amortized sense, this algorithm is much more efficient. Intuitively, before we can merge<br />

two large sets, we have to perform a large number of MakeWeightedSet operations.<br />

Theorem 1. A sequence of m MakeWeightedSet operations and n WeightedUnion operations<br />

takes O(m + n log n) time in the worst case.<br />

Proof: Whenever the leader of an object x is changed by a WeightedUnion, the size of the set<br />

containing x increases by at least a factor of two. By induction, if the leader of x has changed<br />

k times, the set containing x has at least 2 k members. After the sequence ends, the largest set<br />

contains at most n members. (Why?) Thus, the leader of any object x has changed at most ⌊lg n⌋<br />

times.<br />

Since each WeightedUnion reduces the number of sets by one, there are m−n sets at the end<br />

of the sequence, and at most n objects are not in singleton sets. Since each of the non-singleton<br />

objects had O(log n) leader changes, the total amount of work done in updating the leader pointers<br />

is O(n log n). <br />

The aggregate method now implies that each WeightedUnion has amortized cost O(log n) .<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 9: Disjoint Sets Fall 2002<br />

9.3 Path Compression<br />

Using unthreaded tress, Find takes logarithmic time and everything else is constant; using threaded<br />

trees, Union takes logarithmic amortized time and everything else is constant. A third method<br />

allows us to get both of these operations to have almost constant running time.<br />

We start with the original unthreaded tree representation, where every object points to a parent.<br />

The key observation is that in any Find operation, once we determine the leader of an object x,<br />

we can speed up future Finds by redirecting x’s parent pointer directly to that leader. In fact, we<br />

can change the parent pointers of all the ancestors of x all the way up to the root; this is easiest<br />

if we use recursion for the initial traversal up the tree. This modification to Find is called path<br />

compression.<br />

c<br />

b<br />

a<br />

d<br />

p<br />

q r<br />

c<br />

b<br />

p<br />

a<br />

d<br />

q r<br />

Path compression during Find(c). Shaded nodes have a new parent.<br />

Find(x)<br />

if x = parent(x)<br />

parent(x) ← Find(parent(x))<br />

return parent(x)<br />

If we use path compression, the ‘depth’ field we used earlier to keep the trees shallow is no<br />

longer correct, and correcting it would take way too long. But this information still ensures that<br />

Find runs in Θ(log n) time in the worst case, so we’ll just give it another name: rank.<br />

MakeSet(x):<br />

parent(x) ← x<br />

rank(x) ← 0<br />

Union(x, y)<br />

x ← Find(x)<br />

y ← Find(y)<br />

if rank(x) > rank(y)<br />

parent(y) ← x<br />

else<br />

parent(x) ← y<br />

if rank(x) = rank(y)<br />

rank(y) ← rank(y) + 1<br />

Ranks have several useful properties that can be verified easily by examining the Union and<br />

Find algorithms. For example:<br />

• If an object x is not a set leader, then the rank of x is strictly less than the rank of its parent.<br />

• Whenever parent(x) changes, the new parent has larger rank than the old parent.<br />

• The size of any set is exponential in the rank of its leader: size(x) ≥ 2 rank(x) . (This is easy to<br />

prve by induction hint hint.)<br />

• In particular, since there are only n objects, the highest possible rank is ⌊lg n⌋.<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 9: Disjoint Sets Fall 2002<br />

We can also derive a bound on the number of nodes with a given rank r. Only set leaders can<br />

change their rank. When the rank of a set leader x changes from r − 1 to r, mark all the nodes in<br />

that set. At least 2 r nodes are marked. The next time these nodes get a new leader y, the rank<br />

of y will be at least r + 1. Thus, any node is marked at most once. There are n nodes altogether,<br />

and any object with rank r marks 2 r of them. Thus, there can be at most n/2 r objects of rank r .<br />

Purely as an accounting tool, we will also partition the objects into several numbered blocks.<br />

Specifically, each object x is assigned to block number lg ∗ (rank(x)). In other words, x is in block b<br />

if and only if<br />

2 ↑↑ (b − 1) < rank(x) ≤ 2 ↑↑ b,<br />

where 2 ↑↑ b is the tower function 1<br />

r=2↑↑(b−1)+1<br />

<br />

22...2<br />

b 1 if b = 0<br />

2 ↑↑ b = 2 =<br />

22↑↑(b−1) if b > 0<br />

Since there are at most n/2r objects with any rank r, the total number of objects in block b is at<br />

most<br />

2↑↑b <br />

∞<br />

n<br />

n n n<br />

<<br />

= =<br />

2r 2r 22↑↑(b−1) 2 ↑↑ b .<br />

r=2↑↑(b−1)+1<br />

Every object has a rank between 0 and ⌊lg n⌋, so there are lg ∗ n blocks , numbered from 0 to<br />

lg ∗ ⌊lg n⌋ = lg ∗ n − 1.<br />

Theorem 2. If we use both union-by-rank and path compression, the worst-case running time of<br />

a sequence of m operations, n of which are MakeSet operations, is O(m log ∗ n).<br />

Proof: Since each MakeSet and Union operation takes constant time, it suffices to show that<br />

any sequence of m Find operations requires O(m log ∗ n) time in the worst case.<br />

The cost of Find(x0) is proportional to the number of nodes on the find path from x0 up to its<br />

leader (before path compression). To count up the total cost of all Finds, we use an accounting<br />

method—each object x0, x1, x2, . . . , xl on the find path pays a $1 tax into one of several different<br />

bank accounts. After all the Find operations are done, the total amount of money in these accounts<br />

will tell us the total running time.<br />

• The leader xl pays into the leader account.<br />

• The child of the leader xl−1 pays into the child account.<br />

• Any other object xi in a different block from its parent xi+1 pays into the block account.<br />

• Any other object xi in the same block as its parent xi+1 pays into the path account.<br />

During any Find operation, one dollar is paid into the leader account, at most one dollar is<br />

paid into the child account, and at most one dollar is paid into the block account for each of the<br />

lg ∗ n blocks. Thus, when the sequence of m operations ends, those three accounts share a total of<br />

at most 2m + m lg ∗ n dollars. The only remaining difficulty is the path account.<br />

1 The arrow notation a ↑↑ b was introduced by Don Knuth in 1976.<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 9: Disjoint Sets Fall 2002<br />

B<br />

B<br />

P<br />

B<br />

P<br />

B<br />

P<br />

P<br />

B<br />

P<br />

P<br />

C<br />

L<br />

block 7<br />

block 6<br />

block 5<br />

block 4 (empty)<br />

block 3<br />

block 2<br />

block 1<br />

Different nodes on the find path pay into different accounts: B=block, P=path, C=child, L=leader.<br />

Horizontal lines are boundaries between blocks. Only the nodes on the find path are shown.<br />

So consider an object xi in block b that pays into the path account. This object is not a set leader,<br />

so its rank can never change. The parent of xi is also not a set leader, so after path compression,<br />

xi acquires a new parent—namely xl—whose rank is strictly larger than its old parent xi+1. Since<br />

rank(parent(x)) is always increasing, the parent of xi must eventually lie in a different block than xi,<br />

after which xi will never pay into the path account. Thus, xi can pay into the path account at<br />

most once for every rank in block b, or less than 2 ↑↑ b times overall.<br />

Since block b contains less than n/(2 ↑↑ b) objects, these objects contribute less than n dollars<br />

to the path account. There are lg ∗ n blocks, so the path account receives less than n lg ∗ n dollars<br />

altogether.<br />

Thus, the total amount of money in all four accounts is less than 2m + m lg ∗ n + n lg ∗ n =<br />

O(m lg ∗ n), and this bounds the total running time of the m Find operations. <br />

The aggregate method now implies that each Find has amortized cost O(log ∗ n) , which is<br />

significantly better than its worst-case cost Θ(log n) .<br />

9.4 Ackermann’s Function and Its Inverse<br />

But this amortized time bound can be improved even more! Just to state the correct time bound,<br />

I need to introduce a certain function defined by Wilhelm Ackermann in 1928. The function can<br />

be 2 defined by the following two-parameter recurrence.<br />

⎧<br />

⎪⎨ 2 if n = 1<br />

Ai(n) = 2j<br />

⎪⎩<br />

Ai−1(Ai(n − 1))<br />

if i = 1 and n > 1<br />

otherwise<br />

Clearly, each Ai(n) is a monotonically increasing function of n, and these functions grow faster and<br />

faster as the index i increases—A2(n) is the power function 2n , A3(n) is the tower function 2 ↑↑ n,<br />

A4(n) is the wower function 2 ↑↑↑ n = 2 ↑↑ 2 ↑↑ · · · ↑↑ 2 (so named by John Conway), et cetera ad<br />

<br />

infinitum.<br />

n<br />

2 Ackermann didn’t define his function this way—I’m actually describing a different function defined 35 years later<br />

by R. Creighton Buck—but the exact details of the definition are surprisingly irrelevant!<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 9: Disjoint Sets Fall 2002<br />

i Ai(n) n = 1 n = 2 n = 3 n = 4 n = 5<br />

i = 1 2n 2 4 6 8 10<br />

i = 2 2↑n 2 4 8 16 32<br />

i = 3 2 ↑↑ n 2 4 16 65536 2 65536<br />

i = 4 2 ↑↑↑ n 2 4 65536 2<br />

ff<br />

65536<br />

22...2<br />

i = 5 2 ↑↑↑↑ n 2 4 2<br />

2 2...2¯ 2 ...2<br />

¯<br />

.<br />

. .2...2<br />

22...2 ff<br />

¯ 65536<br />

65536<br />

Small(!!) values of Ackermann’s function.<br />

)<br />

2 22...2<br />

The functional inverse of Ackermann’s function is defined as follows:<br />

α(m, n) = min {i | Ai(⌊m/n⌋) > lg n}<br />

ff<br />

65536<br />

ff<br />

2<br />

22...2<br />

2 22...2<br />

ff<br />

〈〈Yeah, right.〉〉<br />

For all practical values of n and m, we have α(m, n) ≤ 4; nevertheless, if we increase m and keep<br />

n fixed, α(m, n) is eventually bigger than any fixed constant.<br />

Bob Tarjan proved the following surprising theorem. The proof of the upper bound 3 is very<br />

similar to the proof of Theorem 2, except that it uses a more complicated ‘block’ structure. The<br />

proof of the matching lower bound 4 is, unfortunately, way beyond the scope of this class. 5<br />

Theorem 3. Using both union by rank and path compression, the worst-case running time of a<br />

sequence of m operations, n of which are MakeSets, is Θ(mα(m, n)). Thus, each operation has<br />

amortized cost Θ(α(m, n)). This time bound is optimal: any pointer-based data structure needs<br />

Ω(mα(m, n)) time to perform these operations.<br />

3<br />

R. E. Tarjan. Efficiency of agood but not linear set union algorithm. J. Assoc. Comput. Mach. 22:215–225, 1975.<br />

4<br />

R. E. Tarjan. A class of algorithms which require non-linear time to maintain disjoint sets. J. Comput. Syst.<br />

Sci. 19:110–127, 1979.<br />

5<br />

But if you like this sort of thing, google for “Davenport-Schinzel sequences”.<br />

7<br />

65536


<strong>CS</strong> <strong>373</strong> Non-Lecture B: Fibonacci Heaps Fall 2002<br />

B Fibonacci Heaps<br />

B.1 Mergeable Heaps<br />

A mergeable heap is a data structure that stores a collection of keys 1 and supports the following<br />

operations.<br />

• Insert: Insert a new key into a heap. This operation can also be used to create a new heap<br />

containing just one key.<br />

• FindMin: Return the smallest key in a heap.<br />

• DeleteMin: Remove the smallest key from a heap.<br />

• Merge: Merge two heaps into one. The new heap contains all the keys that used to be in<br />

the old heaps, and the old heaps are (possibly) destroyed.<br />

If we never had to use DeleteMin, mergeable heaps would be completely trivial. Each “heap”<br />

just stores to maintain the single record (if any) with the smallest key. Inserts and Merges<br />

require only one comparison to decide which record to keep, so they take constant time. FindMin<br />

obviously takes constant time as well.<br />

If we need DeleteMin, but we don’t care how long it takes, we can still implement mergeable<br />

heaps so that Inserts, Merges, and FindMins take constant time. We store the records in a<br />

circular doubly-linked list, and keep a pointer to the minimum key. Now deleting the minimum<br />

key takes Θ(n) time, since we have to scan the linked list to find the new smallest key.<br />

In this lecture, I’ll describe a data structure called a Fibonacci heap that supports Inserts,<br />

Merges, and FindMins in constant time, even in the worst case, and also handles DeleteMin in<br />

O(log n) amortized time. That means that any sequence of n Inserts, m Merges, f FindMins,<br />

and d DeleteMins takes O(n + m + f + d log n) time.<br />

B.2 Binomial Trees and Fibonacci Heaps<br />

A Fibonacci heap is a circular doubly linked list, with a pointer to the minimum key, but the<br />

elements of the list are not single keys. Instead, we collect keys together into structures called<br />

binomial heaps. Binomial heaps are trees 2 that satisfy the heap property — every node has a<br />

smaller key than its children — and have the following special structure.<br />

B 4<br />

Binomial trees of order 0 through 5.<br />

1 In the earlier lecture on treaps, I called these keys priorities to distinguish them from search keys.<br />

2 CLR uses the name ‘binomial heap’ to describe a more complicated data structure consisting of a set of heapordered<br />

binomial trees, with at most one binomial tree of each order.<br />

1<br />

B5<br />

B 4


<strong>CS</strong> <strong>373</strong> Non-Lecture B: Fibonacci Heaps Fall 2002<br />

A kth order binomial tree, which I’ll abbreviate Bk, is defined recursively. B0 is a single node.<br />

For all k > 0, Bk consists of two copies of Bk−1 that have been linked together, meaning that the<br />

root of one Bk−1 has become a new child of the other root.<br />

Binomial trees have several useful properties, which are easy to prove by induction (hint, hint).<br />

• The root of Bk has degree k.<br />

• The children of the root of Bk are the roots of B0, B1, . . . , Bk−1.<br />

• Bk has height k.<br />

• Bk has 2 k nodes.<br />

• Bk can be obtained from Bk−1 by adding a new child to every node.<br />

• Bk has k d nodes at depth d, for all 0 ≤ d ≤ k.<br />

• Bk has 2 k−h−1 nodes with height h, for all 0 ≤ h < k, and one node (the root) with height k.<br />

Although we normally don’t care in this class about the low-level details of data structures, we<br />

need to be specific about how Fibonacci heaps are actually implemented, so that we can be sure<br />

that certain operations can be performed quickly. Every node in a Fibonacci heap points to four<br />

other nodes: its parent, its ‘next’ sibling, its ‘previous’ sibling, and one of its children. The sibling<br />

pointers are used to join the roots together into a circular doubly-linked root list. In each binomial<br />

tree, the children of each node are also joined into a circular doubly-linked list using the sibling<br />

pointers.<br />

min<br />

A high-level view and a detailed view of the same Fibonacci heap. Null pointers are omitted for clarity.<br />

With this representation, we can add or remove nodes from the root list, merge two root lists<br />

together, link one two binomial tree to another, or merge a node’s list of children with the root list,<br />

in constant time, and we can visit every node in the root list in constant time per node. Having<br />

established that these primitive operations can be performed quickly, we never again need to think<br />

about the low-level representation details.<br />

B.3 Operations on Fibonacci Heaps<br />

The Insert, Merge, and FindMin algorithms for Fibonacci heaps are exactly like the corresponding<br />

algorithms for linked lists. Since we maintain a pointer to the minimum key, FindMin is trivial.<br />

To insert a new key, we add a single node (which we should think of as a B0) to the root list and (if<br />

necessary) update the pointer to the minimum key. To merge two Fibonacci heaps, we just merge<br />

the two root lists and keep the pointer to the smaller of the two minimum keys. Clearly, all three<br />

operations take O(1) time.<br />

2<br />

min


<strong>CS</strong> <strong>373</strong> Non-Lecture B: Fibonacci Heaps Fall 2002<br />

Deleting the minimum key is a little more complicated. First, we remove the minimum key<br />

from the root list and splice its children into the root list. Except for updating the parent pointers,<br />

this takes O(1) time. Then we scan through the root list to find the new smallest key and update<br />

the parent pointers of the new roots. This scan could take Θ(n) time in the worst case. To bring<br />

down the amortized deletion time, we apply a Cleanup algorithm, which links pairs of equal-size<br />

binomial heaps until there is only one binomial heap of any particular size.<br />

Let me describe the Cleanup algorithm in more detail, so we can analyze its running time.<br />

The following algorithm maintains a global array B[1 .. ⌊lg n⌋], where B[i] is a pointer to some<br />

previously-visited binomial heap of order i, or Null if there is no such binomial heap. Notice that<br />

Cleanup simultaneously resets the parent pointers of all the new roots and updates the pointer to<br />

the minimum key. I’ve split off the part of the algorithm that merges binomial heaps of the same<br />

order into a separate subroutine MergeDupes.<br />

Cleanup:<br />

newmin ← some node in the root list<br />

for i ← 0 to ⌊lg n⌋<br />

B[i] ← Null<br />

for all nodes v in the root list<br />

parent(v) ← Null (⋆)<br />

if key(newmin) > key(v)<br />

newmin ← v<br />

MergeDupes(v)<br />

B<br />

0 1 2 3<br />

v<br />

B<br />

0 1 2 3<br />

MergeDupes(v):<br />

w ← B[deg(v)]<br />

while w = Null<br />

B[deg(v)] ← Null<br />

if key(v) ≤ key(w)<br />

swap v ⇆ w<br />

remove w from the root list (⋆⋆)<br />

link w to v<br />

w ← B[deg(v)]<br />

B[deg(v)] ← v<br />

v B<br />

0 1 2 3<br />

MergeDupes(v), ensuring that no earlier root has the same degree as v.<br />

Notices that MergeDupes is careful to merge heaps so that the heap property is maintained—<br />

the heap whose root has the larger key becomes a new child of the heap whose root has the smaller<br />

key. This is handled by swapping v and w if their keys are in the wrong order.<br />

The running time of Cleanup is O(r ′ ), where r ′ is the length of the root list just before<br />

Cleanup is called. The easiest way to see this is to count the number of times the two starred lines<br />

can be executed: line (⋆) is executed once for every node v on the root list, and line (⋆⋆) is executed<br />

at most once for every node w on the root list. Since DeleteMin does only a constant amount<br />

of work before calling Cleanup, the running time of DeleteMin is O(r ′ ) = O(r + deg(min))<br />

where r is the number of roots before DeleteMin begins, and min is the node deleted.<br />

Although deg(min) is at most lg n, we can still have r = Θ(n) (for example, if nothing has been<br />

deleted yet), so the worst-case time for a DeleteMin is Θ(n). After a DeleteMin, the root list<br />

has length O(log n), since all the binomial heaps have unique orders and the largest has order at<br />

most ⌊lg n⌋.<br />

B.4 Amortized Analysis of DeleteMin<br />

To bound the amortized cost, observe that each insertion increments r. If we charge a constant<br />

‘cleanup tax’ for each insertion, and use the collected tax to pay for the Cleanup algorithm, the<br />

3<br />

v


<strong>CS</strong> <strong>373</strong> Non-Lecture B: Fibonacci Heaps Fall 2002<br />

unpaid cost of a DeleteMin is only O(deg(min)) = O(log n).<br />

More formally, define the potential of the Fibonacci heap to be the number of roots. Recall<br />

that the amortized time of an operation can be defined as its actual running time plus the increase<br />

in potential, provided the potential is initially zero (it is) and we never have negative potential<br />

(we never do). Let r be the number of roots before a DeleteMin, and let r ′′ denote the number<br />

of roots afterwards. The actual cost of DeleteMin is r + deg(min), and the number of roots<br />

increases by r ′′ − r, so the amortized cost is r ′′ + deg(min). Since r ′′ = O(log n) and the degree of<br />

any node is O(log n), the amortized cost of DeleteMin is O(log n).<br />

Each Insert adds only one root, so its amortized cost is still constant. A Merge actually<br />

doesn’t change the number of roots, since the new Fibonacci heap has all the roots from its constituents<br />

and no others, so its amortized cost is also constant.<br />

B.5 Decreasing Keys<br />

In some applications of heaps, we also need the ability to delete an arbitrary node. The usual way<br />

to do this is to decrease the node’s key to −∞, and then use DeleteMin. Here I’ll describe how<br />

to decrease the key of a node in a Fibonacci heap; the algorithm will take O(log n) time in the<br />

worst case, but the amortized time will be only O(1).<br />

Our algorithm for decreasing the key at a node v follows two simple rules.<br />

1. Promote v up to the root list. (This moves the whole subtree rooted at v.)<br />

2. As soon as two children of any node w have been promoted, immediately promote w.<br />

In order to enforce the second rule, we now mark certain nodes in the Fibonacci heap. Specifically,<br />

a node is marked if exactly one of its children has been promoted. If some child of a marked node<br />

is promoted, we promote (and unmark) that node as well. Whenever we promote a marked node,<br />

we unmark it; this is theonly way to unmark a node. (Specifically, splicing nodes into the root list<br />

during a DeleteMin is not considered a promotion.)<br />

Here’s a more formal description of the algorithm. The input is a pointer to a node v and the<br />

new value k for its key.<br />

DecreaseKey(v, k):<br />

key(v) ← k<br />

update the pointer to the smallest key<br />

Promote(v)<br />

Promote(v):<br />

unmark v<br />

if parent(v) = Null<br />

remove v from parent(v)’s list of children<br />

insert v into the root list<br />

if parent(v) is marked<br />

Promote(parent(v))<br />

else<br />

mark parent(v)<br />

The Promote algorithm calls itself recursively, resulting in a ‘cascading promotion’. Each<br />

consecutive marked ancestor of v is promoted to the root list and unmarked, otherwise unchanged.<br />

The lowest unmarked ancestor is then marked, since one of its children has been promoted.<br />

4


<strong>CS</strong> <strong>373</strong> Non-Lecture B: Fibonacci Heaps Fall 2002<br />

a<br />

b c d e<br />

f g h i j k<br />

l m n o<br />

p<br />

f<br />

l m<br />

p<br />

f<br />

l m<br />

p<br />

b<br />

g<br />

n<br />

a<br />

b c d e<br />

g h i j k<br />

n o<br />

h<br />

c<br />

i<br />

o<br />

f<br />

l m<br />

p<br />

b c<br />

g h i j<br />

n o<br />

Decreasing the keys of four nodes: first f, then d, then j, and finally h. Dark nodes are marked.<br />

DecreaseKey(h) causes nodes b and a to be recursively promoted.<br />

The time to decrease the key of a node v is O(1+#consecutive marked ancestors of v). Binomial<br />

heaps have logarithmic depth, so if we still had only full binomial heaps, the running time would<br />

be O(log n). Unfortunately, promoting nodes destroys the nice binomial tree structure; our trees<br />

no longer have logarithmic depth! In fact, DecreaseKey runs in Θ(n) time in the worst case.<br />

To compute the amortized cost of DecreaseKey, we’ll use the potential method, just as we<br />

did for DeleteMin. We need to find a potential function Φ that goes up a little whenever we do<br />

a little work, and goes down a lot whenever we do a lot of work. DecreaseKey unmarks several<br />

marked ancestors and possibly also marks one node. So the number of marked nodes might be an<br />

appropriate potential function here. Whenever we do a little bit of work, the number of marks goes<br />

up by at most one; whenever we do a lot of work, the number of marks goes down a lot.<br />

More precisely, let m and m ′ be the number of marked nodes before and after a DecreaseKey<br />

operation. The actual time (ignoring constant factors) is<br />

and if we set Φ = m, the increase in potential is<br />

f<br />

l m<br />

t = 1 + #consecutive marked ancestors of v<br />

p<br />

b<br />

g h<br />

m ′ − m ≤ 1 − #consecutive marked ancestors of v.<br />

Since t + ∆Φ ≤ 2, the amortized cost of DecreaseKey is O(1) .<br />

B.6 Bounding the Degree<br />

But now we have a problem with our earlier analysis of DeleteMin. The amortized time for a<br />

DeleteMin is still O(r + deg(min)). To show that this equaled O(log n), we used the fact that the<br />

maximum degree of any node is O(log n), which implies that after a Cleanup the number of roots<br />

is O(log n). But now that we don’t have complete binomial heaps, this ‘fact’ is no longer obvious!<br />

So let’s prove it. For any node v, let |v| denote the number of nodes in the subtree of v,<br />

including v itself. Our proof uses the following lemma, which finally tells us why these things are<br />

called Fibonacci heaps.<br />

Lemma 1. For any node v in a Fibonacci heap, |v| ≥ F deg(v)+2.<br />

Proof: Label the children of v in the chronological order in which they were linked to v. Consider<br />

the situation just before the ith oldest child wi was linked to v. At that time, v had at least i − 1<br />

children (possibly more). Since Cleanup only links trees with the same degree, we had deg(wi) =<br />

5<br />

a<br />

e<br />

j<br />

d<br />

k<br />

n<br />

c<br />

i<br />

o<br />

a<br />

e<br />

a<br />

e<br />

j<br />

d<br />

k<br />

d<br />

k


<strong>CS</strong> <strong>373</strong> Non-Lecture B: Fibonacci Heaps Fall 2002<br />

deg(v) ≥ i − 1. Since that time, at most one child of wi has been promoted away; otherwise, wi<br />

would have been promoted to the root list by now. So currently we have deg(wi) ≥ i − 2.<br />

We also quickly observe that deg(wi) ≥ 0. (Duh.)<br />

Let sd be the minimum possible size of a tree with degree d in any Fibonacci heap. Clearly<br />

s0 = 1; for notational convenience, let s−1 = 1 also. By our earlier argument, the ith oldest child<br />

of the root has degree at least max{0, i − 2}, and thus has size at least max{1, si−2} = si−2. Thus,<br />

we have the following recurrence:<br />

d<br />

sd ≥ 1 +<br />

If we assume inductively that si ≥ Fi+2 for all −1 ≤ i < d (with the easy base cases s−1 = F1 and<br />

s0 = F2), we have<br />

d<br />

sd ≥ 1 + Fi = Fd+2.<br />

i=1<br />

(The last step was a practice problem in Homework 0.) By definition, |v| ≥ s deg(v). <br />

i=1<br />

You can easily show (using either induction or the annihilator method) that Fk+2 > φ k where<br />

φ = 1+√ 5<br />

2<br />

si−2<br />

≈ 1.618 is the golden ratio. Thus, Lemma 1 implies that<br />

deg(v) ≤ log φ|v| = O(log|v|).<br />

Thus, since the size of any subtree in an n-node Fibonacci heap is obviously at most n, the degree<br />

of any node is O(log n), which is exactly what we wanted. Our earlier analysis is still good.<br />

B.7 Analyzing Everything Together<br />

Unfortunately, our analyses of DeleteMin and DecreaseKey used two different potential functions.<br />

Unless we can find a single potential function that works for both operations, we can’t claim<br />

both amortized time bounds simultaneously. So we need to find a potential function Φ that goes<br />

up a little during a cheap DeleteMin or a cheap DecreaseKey, and goes down a lot during an<br />

expensive DeleteMin or an expensive DecreaseKey.<br />

Let’s look a little more carefully at the cost of each Fibonacci heap operation, and its effect<br />

on both the number of roots and the number of marked nodes, the things we used as out earlier<br />

potential functions. Let r and m be the numbers of roots and marks before each operation, and<br />

let r ′ and m ′ be the numbers of roots and marks after the operation.<br />

operation actual cost r ′ − r m ′ − m<br />

Insert 1 1 0<br />

Merge 1 0 0<br />

DeleteMin r + deg(min) r ′ − r 0<br />

DecreaseKey 1 + m − m ′ 1 + m − m ′ m ′ − m<br />

In particular, notice that promoting a node in DecreaseKey requires constant time and increases<br />

the number of roots by one, and that we promote (at most) one unmarked node.<br />

If we guess that the correct potential function is a linear combination of our old potential<br />

functions r and m and play around with various possibilities for the coefficients, we will eventually<br />

stumble across the correct answer:<br />

Φ = r + 2m<br />

6


<strong>CS</strong> <strong>373</strong> Non-Lecture B: Fibonacci Heaps Fall 2002<br />

To see that this potential function gives us good amortized bounds for every Fibonacci heap operation,<br />

let’s add two more columns to our table.<br />

operation actual cost r ′ − r m ′ − m Φ ′ − Φ amortized cost<br />

Insert 1 1 0 1 2<br />

Merge 1 0 0 0 1<br />

DeleteMin r + deg(min) r ′ − r 0 r ′ − r r ′ + deg(min)<br />

DecreaseKey 1 + m − m ′ 1 + m − m ′ m ′ − m 1 + m ′ − m 2<br />

Since Lemma 1 implies that r ′ + deg(min) = O(log n), we’re finally done! (Whew!)<br />

B.8 Fibonacci Trees<br />

To give you a little more intuition about how Fibonacci heaps behave, let’s look at a worst-case<br />

construction for Lemma 1. Suppose we want to remove as many nodes as possible from a binomial<br />

heap of order k, by promoting various nodes to the root list, but without causing any cascading<br />

promotions. The most damage we can do is to promote the largest subtree of every node. Call the<br />

result a Fibonacci tree of order k + 1, and denote it fk+1. As a base case, let f1 be the tree with<br />

one (unmarked) node, that is, f1 = B0. The reason for shifting the index should be obvious after<br />

a few seconds.<br />

Fibonacci trees of order 1 through 6. Light nodes have been promoted away; dark nodes are marked.<br />

Recall that the root of a binomial tree Bk has k children, which are roots of B0, B1, . . . , Bk−1.<br />

To convert Bk to fk+1, we promote the root of Bk−1, and recursively convert each of the other<br />

subtrees Bi to fi+1. The root of the resulting tree fk+1 has degree k − 1, and the children are the<br />

roots of smaller Fibonacci trees f1, f2, . . . , fk−1. We can also consider Bk as two copies of Bk−1<br />

linked together. It’s quite easy to show that an order-k Fibonacci tree consists of an order k − 2<br />

Fibonacci tree linked to an order k − 1 Fibonacci tree. (See the picture below.)<br />

B 5<br />

B 4<br />

B 3<br />

B 6<br />

B 0<br />

B1<br />

B 2<br />

B 5<br />

B 6<br />

B 5<br />

Comparing the recursive structures of B6 and f7.<br />

Since f1 and f2 both have exactly one node, the number of nodes in an order-k Fibonacci tree<br />

is exactly the kth Fibonacci number! (That’s why we changed in the index.) Like binomial trees,<br />

Fibonacci trees have lots of other nice properties that easy to prove by induction (hint, hint):<br />

• The root of fk has degree k − 2.<br />

• fk can be obtained from fk−1 by adding a new unmarked child to every marked node and<br />

then marking all the old unmarked nodes.<br />

7<br />

f<br />

5<br />

f<br />

4<br />

f 7<br />

f<br />

3<br />

f<br />

f 2 1<br />

f<br />

5<br />

f<br />

f<br />

7<br />

6


<strong>CS</strong> <strong>373</strong> Non-Lecture B: Fibonacci Heaps Fall 2002<br />

• fk has height ⌈k/2⌉ − 1.<br />

• fk has Fk−2 unmarked nodes, Fk−1 marked nodes, and thus Fk nodes altogether.<br />

• fk has k−d−2 k−d−2 k−d−1 d−1 unmarked nodes, d marked nodes, and d total nodes at depth d,<br />

for all 0 ≤ d ≤ ⌊k/2⌋ − 1.<br />

• fk has Fk−2h−1 nodes with height h, for all 0 ≤ h ≤ ⌊k/2⌋ − 1, and one node (the root) with<br />

height ⌈k/2⌉ − 1.<br />

8


<strong>CS</strong> <strong>373</strong> Lecture 10: Basic Graph Properties Fall 2002<br />

Obie looked at the seein’ eye dog. Then at the twenty-seven 8 by 10 color glossy pictures<br />

with the circles and arrows and a paragraph on the back of each one. . . and then he looked<br />

at the seein’ eye dog. And then at the twenty-seven 8 by 10 color glossy pictures with the<br />

circles and arrows and a paragraph on the back of each one and began to cry.<br />

Because Obie came to the realization that it was a typical case of American blind justice,<br />

and there wasn’t nothin’ he could do about it, and the judge wasn’t gonna look at the<br />

twenty-seven 8 by 10 color glossy pictures with the circles and arrows and a paragraph on<br />

the back of each one explainin’ what each one was, to be used as evidence against us.<br />

And we was fined fifty dollars and had to pick up the garbage. In the snow.<br />

But that’s not what I’m here to tell you about.<br />

— Arlo Guthrie, “Alice’s Restaurant” (1966)<br />

10 Basic Graph Properties (October 17)<br />

10.1 Definitions<br />

A graph G is a pair of sets (V, E). V is a set of arbitrary objects which we call vertices 1 or nodes.<br />

E is a set of vertex pairs, which we call edges or occasionally arcs. In an undirected graph, the<br />

edges are unordered pairs, or just sets containing two vertices. In a directed graph, the edges are<br />

ordered pairs of vertices. We will only be concerned with simple graphs, where there is no edge<br />

from a vertex to itself and there is at most one edge from any vertex to any other.<br />

Following standard (but admittedly confusing) practice, I’ll also use V to denote the number of<br />

vertices in a graph, and E to denote the number of edges. Thus, in an undirected graph, we have<br />

0 ≤ E ≤ V <br />

2 , and in a directed graph, 0 ≤ E ≤ V (V − 1).<br />

We usually visualize graphs by looking at an embedding. An embedding of a graph maps each<br />

vertex to a point in the plane and each edge to a curve or straight line segment between the two<br />

vertices. A graph is planar if it has an embedding where no two edges cross. The same graph can<br />

have many different embeddings, so it is important not to confuse a particular embedding with the<br />

graph itself. In particular, planar graphs can have non-planar embeddings!<br />

a<br />

b<br />

c<br />

d<br />

e<br />

f g<br />

h<br />

i<br />

a<br />

b<br />

e d<br />

A non-planar embedding of a planar graph with nine vertices, thirteen edges, and two connected components,<br />

and a planar embedding of the same graph.<br />

There are other ways of visualizing and representing graphs that are sometimes also useful. For<br />

example, the intersection graph of a collection of objects has a node for every object and an edge<br />

for every intersecting pair. Whether a particular graph can be represented as an intersection graph<br />

depends on what kind of object you want to use for the vertices. Different types of objects—line<br />

segments, rectangles, circles, etc.—define different classes of graphs. One particularly useful type<br />

of intersection graph is an interval graph, whose vertices are intervals on the real line, with an edge<br />

between any two intervals that overlap.<br />

1 The singular of ‘vertices’ is vertex. The singular of ‘matrices’ is matrix. Unless you’re speaking Italian, there<br />

is no such thing as a vertice, a matrice, an indice, an appendice, a helice, an apice, a vortice, a radice, a simplice, a<br />

directrice, a dominatrice, a Unice, a Kleenice, or Jimi Hendrice!<br />

1<br />

c<br />

f<br />

h<br />

i<br />

g


<strong>CS</strong> <strong>373</strong> Lecture 10: Basic Graph Properties Fall 2002<br />

a<br />

c<br />

b<br />

f<br />

d<br />

e<br />

g h<br />

i<br />

a<br />

b<br />

c<br />

d<br />

e<br />

f g<br />

(a) (b) (c)<br />

h<br />

i<br />

a<br />

b<br />

c<br />

d<br />

e<br />

f<br />

g h i<br />

The example graph is also the intersection graph of (a) a set of line segments, (b) a set of circles,<br />

or (c) a set of intervals on the real line (stacked for visibility).<br />

If (u, v) is an edge in an undirected graph, then u is a neighbor or v and vice versa. The degree<br />

of a node is the number of neighbors. In directed graphs, we have two kinds of neighbors. If u → v<br />

is a directed edge, then u is a predecessor of v and v is a successor of u. The in-degree of a node<br />

is the number of predecessors, which is the same as the number of edges going into the node. The<br />

out-degree is the number of successors, or the number of edges going out of the node.<br />

A graph G ′ = (V ′ , E ′ ) is a subgraph of G = (V, E) if V ′ ⊆ V and E ′ ⊆ E.<br />

A path is a sequence of edges, where each successive pair of edges shares a vertex, and all other<br />

edges are disjoint. A graph is connected if there is a path from any vertex to any other vertex.<br />

A disconnected graph consists of several connected components, which are maximal connected<br />

subgraphs. Two vertices are in the same connected component if and only if there is a path<br />

between them.<br />

A cycle is a path that starts and ends at the same vertex, and has at least one edge. A graph<br />

is acyclic if no subgraph is a cycle; acyclic graphs are also called forests. Trees are special graphs<br />

that can be defined in several different ways. You can easily prove by induction (hint, hint, hint)<br />

that the following definitions are equivalent.<br />

• A tree is a connected acyclic graph.<br />

• A tree is a connected component of a forest.<br />

• A tree is a connected graph with at most V − 1 edges.<br />

• A tree is a minimal connected graph; removing any edge makes the graph disconnected.<br />

• A tree is an acyclic graph with at least V − 1 edges.<br />

• A tree is a maximal acyclic graph; adding an edge between any two vertices creates a cycle.<br />

A spanning tree of a graph G is a subgraph that is a tree and contains every vertex of G. Of course,<br />

a graph can only have a spanning tree if it’s connected. A spanning forest of G is a collection of<br />

spanning trees, one for each connected component of G.<br />

10.2 Explicit Representations of Graphs<br />

There are two common data structures used to explicitly represent graphs: adjacency matrices 2<br />

and adjacency lists.<br />

The adjacency matrix of a graph G is a V × V matrix of indicator variables. Each entry in the<br />

matrix indicates whether a particular edge is or is not in the graph:<br />

2 See footnote 1.<br />

A[i, j] = (i, j) ∈ E .<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 10: Basic Graph Properties Fall 2002<br />

For undirected graphs, the adjacency matrix is always symmetric: A[i, j] = A[j, i]. Since we don’t<br />

allow edges from a vertex to itself, the diagonal elements A[i, i] are all zeros.<br />

Given an adjacency matrix, we can decide in Θ(1) time whether two vertices are connected by<br />

an edge just by looking in the appropriate slot in the matrix. We can also list all the neighbors of a<br />

vertex in Θ(V ) time by scanning the corresponding row (or column). This is optimal in the worst<br />

case, since a vertex can have up to V − 1 neighbors; however, if a vertex has few neighbors, we may<br />

still have to examine every entry in the row to see them all. Similarly, adjacency matrices require<br />

Θ(V 2 ) space, regardless of how many edges the graph actually has, so it is only space-efficient for<br />

very dense graphs.<br />

a b c d e f g h i<br />

a 0 1 1 0 0 0 0 0 0<br />

b 1 0 1 1 1 0 0 0 0<br />

c 1 1 0 1 1 0 0 0 0<br />

d 0 1 1 0 1 1 0 0 0<br />

e 0 1 1 1 0 1 0 0 0<br />

f 0 0 0 1 1 0 0 0 0<br />

g 0 0 0 0 0 0 0 1 0<br />

h 0 0 0 0 0 0 1 0 1<br />

i 0 0 0 0 0 0 1 1 0<br />

Adjacency matrix and adjacency list representations for the example graph.<br />

For sparse graphs—graphs with relatively few edges—we’re better off using adjacency lists. An<br />

adjacency list is an array of linked lists, one list per vertex. Each linked list stores the neighbors<br />

of the corresponding vertex.<br />

For undirected graphs, each edge (u, v) is stored twice, once in u’s neighbor list and once in<br />

v’s neighbor list; for directed graphs, each edge is stores only once. Either way, the overall space<br />

required for an adjacency list is O(V +E). Listing the neighbors of a node v takes O(1+deg(v)) time;<br />

just scan the neighbor list. Similarly, we can determine whether (u, v) is an edge in O(1 + deg(u))<br />

time by scanning the neighbor list of u. For undirected graphs, we can speed up the search by<br />

simultaneously scanning the neighbor lists of both u and v, stopping either we locate the edge or<br />

when we fall of the end of a list. This takes O(1 + min{deg(u), deg(v)}) time.<br />

The adjacency list structure should immediately remind you of hash tables with chaining. Just<br />

as with hash tables, we can make adjacency list structure more efficient by using something besides<br />

a linked list to store the neighbors. For example, if we use a hash table with constant load factor,<br />

when we can detect edges in O(1) expected time, just as with an adjacency list. In practice, this<br />

will only be useful for vertices with large degree, since the constant overhead in both the space and<br />

search time is larger for hash tables than for simple linked lists.<br />

You might at this point ask why anyone would ever use an adjacency matrix. After all, if you<br />

use hash tables to store the neighbors of each vertex, you can do everything as fast or faster with an<br />

adjacency list as with an adjacency matrix, only using less space. The answer is that many graphs<br />

are only represented implicitly. For example, intersection graphs are usually represented implicitly<br />

by simply storing the list of objects. As long as we can test whether two objects overlap in constant<br />

time, we can apply any graph algorithm to an intersection graph by pretending that it is stored<br />

explicitly as an adjacency matrix. On the other hand, any data structure build from records with<br />

pointers between them can be seen as a directed graph. <strong>Algorithms</strong> for searching graphs can be<br />

applied to these data structures by pretending that the graph is represented explicitly using an<br />

adjacency list.<br />

To keep things simple, we’ll consider only undirected graphs for the rest of this lecture, although<br />

the algorithms I’ll describe also work for directed graphs.<br />

3<br />

a<br />

b<br />

c<br />

d<br />

e<br />

f<br />

g<br />

h<br />

i<br />

b<br />

a<br />

a<br />

b<br />

b<br />

d<br />

h<br />

g<br />

g<br />

c<br />

c<br />

b<br />

c<br />

c<br />

e<br />

i<br />

i<br />

h<br />

d<br />

d<br />

e<br />

d<br />

e<br />

e<br />

f<br />

f


<strong>CS</strong> <strong>373</strong> Lecture 10: Basic Graph Properties Fall 2002<br />

10.3 Traversing connected graphs<br />

Suppose we want to visit every node in a connected graph (represented either explicitly or implicitly).<br />

The simplest method to do this is an algorithm called depth-first search, which can be written<br />

either recursively or iteratively. It’s exactly the same algorithm either way; the only difference is<br />

that we can actually see the ‘recursion’ stack in the non-recursive version. Both versions are initially<br />

passed a source vertex s.<br />

RecursiveDFS(v):<br />

if v is unmarked<br />

mark v<br />

for each edge (v, w)<br />

RecursiveDFS(w)<br />

IterativeDFS(s):<br />

Push(s)<br />

while stack not empty<br />

v ← Pop<br />

if v is unmarked<br />

mark v<br />

for each edge (v, w)<br />

Push(w)<br />

Depth-first search is one (perhaps the most common) instance of a general family of graph<br />

traversal algorithms. The generic graph traversal algorithm stores a set of candidate edges in some<br />

data structure that I’ll call a ‘bag’. The only important properties of a ‘bag’ are that we can put<br />

stuff into it and then later take stuff back out. (In C ++ terms, think of the ‘bag’ as a template for<br />

a real data structure.) Here’s the algorithm:<br />

Traverse(s):<br />

put (∅, s) in bag<br />

while the bag is not empty<br />

take (p, v) from the bag (⋆)<br />

if v is unmarked<br />

mark v<br />

parent(v) ← p<br />

for each edge (v, w) (†)<br />

put (v, w) into the bag (⋆⋆)<br />

Notice that we’re keeping edges in the bag instead of vertices. This is because we want to<br />

remember, whenever we visit a vertex v for the first time, which previously-visited vertex p put v<br />

into the bag. The vertex p is called the parent of v.<br />

Lemma 1. Traverse(s) marks every vertex in any connected graph exactly once, and the set of<br />

edges (v, parent(v)) with parent(v) = ∅ form a spanning tree of the graph.<br />

Proof: First, it should be obvious that no node is marked more than once.<br />

Clearly, the algorithm marks s. Let v = s be a vertex, and let s → · · · → u → v be the path<br />

from s to v with the minimum number of edges. Since the graph is connected, such a path always<br />

exists. (If s and v are neighbors, then u = s, and the path has just one edge.) If the algorithm<br />

marks u, then it must put (u, v) into the bag, so it must later take (u, v) out of the bag, at which<br />

point v must be marked (if it isn’t already). Thus, by induction on the shortest-path distance<br />

from s, the algorithm marks every vertex in the graph.<br />

Call an edge (v, parent(v)) with parent(v) = ∅ a parent edge. For any node v, the path of<br />

parent edges v → parent(v) → parent(parent(v)) → · · · eventually leads back to s, so the set of<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 10: Basic Graph Properties Fall 2002<br />

parent edges form a connected graph. Clearly, both endpoints of every parent edge are marked,<br />

and the number of parent edges is exactly one less than the number of vertices. Thus, the parent<br />

edges form a spanning tree. <br />

The exact running time of the traversal algorithm depends on how the graph is represented and<br />

what data structure is used as the ‘bag’, but we can make a few general observations. Since each<br />

vertex is visited at most once, the for loop (†) is executed at most V times. Each edge is put into<br />

the bag exactly twice; once as (u, v) and once as (v, u), so line (⋆⋆) is executed at most 2E times.<br />

Finally, since we can’t take more things out of the bag than we put in, line (⋆) is executed at most<br />

2E + 1 times.<br />

10.4 Examples<br />

Let’s first assume that the graph is represented by an adjacency list, so that the overhead of the<br />

for loop (†) is only a constant per edge.<br />

• If we implement the ‘bag’ by using a stack, we have depth-first search. Each execution of<br />

(⋆) or (⋆⋆) takes constant time, so the overall running time is O(V + E). Since the graph<br />

is connected, V ≤ E + 1, so we can simplify the running time to O(E). The spanning tree<br />

formed by the parent edges is called a depth-first spanning tree. The exact shape of the tree<br />

depends on the order in which neighbor edges are pushed onto the stack, but the in general,<br />

depth-first spanning trees are long and skinny.<br />

• If we use a queue instead of a stack, we have breadth-first search. Again, each execution of<br />

(⋆) or (⋆⋆) takes constant time, so the overall running time is still O(E). In this case, the<br />

breadth-first spanning tree formed by the parent edges contains shortest paths from the start<br />

vertex s to every other vertex in its connected component. The exact shape of the shortest<br />

path tree depends on the order in which neighbor edges are pushed onto the queue, but the<br />

in general, shortest path trees are short and bushy. We’ll see shortest paths again next week.<br />

a<br />

b<br />

c<br />

d<br />

e<br />

f<br />

A depth-first spanning tree and a breadth-first spanning tree<br />

of one component of the example graph, with start vertex a.<br />

• Suppose the edges of the graph are weighted. If we implement the ‘bag’ using a priority<br />

queue, always extracting the minimum-weight edge in line (⋆), then we we have what might<br />

be called shortest-first search. In this case, each execution of (⋆) or (⋆⋆) takes O(log E) time,<br />

so the overall running time is O(V + E log E), which simplifies to O(E log E) if the graph is<br />

connected. For this algorithm, the set of parent edges form the minimum spanning tree of<br />

the connected component of s. We’ll see minimum spanning trees again in the next lecture.<br />

If the graph is represented using an adjacency matrix, the finding all the neighbors of each<br />

vertex in line (†) takes O(V ) time. Thus, depth- and breadth-first search take O(V 2 ) time overall,<br />

and ‘shortest-first search’ takes O(V 2 + E log E) = O(V 2 log V ) time overall.<br />

5<br />

a<br />

b<br />

c<br />

d<br />

e<br />

f


<strong>CS</strong> <strong>373</strong> Lecture 10: Basic Graph Properties Fall 2002<br />

10.5 Searching disconnected graphs<br />

If the graph is disconnected, then Traverse(s) only visits the nodes in the connected component<br />

of the start vertex s. If we want to visit all the nodes in every component, we can use the following<br />

‘wrapper’ around our generic traversal algorithm. Since Traverse computes a spanning tree of<br />

one component, TraverseAll computes a spanning forest of the entire graph.<br />

TraverseAll(s):<br />

for all vertices v<br />

if v is unmarked<br />

Traverse(v)<br />

There is a rather unfortunate mistake on page 477 of CLR:<br />

Unlike breadth-first search, whose predecessor subgraph forms a tree, the predecessor<br />

subgraph produced by depth-first search may be composed of several trees, because the<br />

search may be repeated from multiple sources.<br />

This statement seems to imply that depth-first search is always called with the TraverseAll,<br />

and breadth-first search never is, but this is not true! The choice of whether to use a stack or a<br />

queue is completely independent of the choice of whether to use TraverseAll or not.<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 11: Shortest Paths Fall 2002<br />

11 Shortest Paths (October 22)<br />

11.1 Introduction<br />

Given a weighted directed graph G = (V, E, w) with two special vertices, a source s and a target t,<br />

we want to find the shortest directed path from s to t. In other words, we want to find the path p<br />

starting at s and ending at t minimizing the function<br />

w(p) = <br />

w(e).<br />

e∈p<br />

For example, if I want to answer the question ‘What’s the fastest way to drive from my old<br />

apartment in Champaign, Illinois to my wife’s old apartment in Columbus, Ohio?’, we might<br />

use a graph whose vertices are cities, edges are roads, weights are driving times, s is Champaign,<br />

and t is Columbus. 1 The graph is directed since the driving times along the same road might be<br />

different in different directions. 2<br />

Perhaps counter to intuition, we will allow the weights on the edges to be negative. Negative<br />

edges make our lives complicated, since the presence of a negative cycle might mean that there<br />

is no shortest path. In general, a shortest path from s to t exists if and only if there is at least<br />

one path from s to t, but there is no path from s to t that touches a negative cycle. If there is a<br />

negative cycle between s and t, then se can always find a shorter path by going around the cycle<br />

one more time.<br />

2 −8<br />

s 5<br />

3 t<br />

4 1<br />

There is no shortest path from s to t.<br />

Every algorithm known for solving this problem actually solves (large portions of) the following<br />

more general single source shortest path or SSSP problem: find the shortest path from the source<br />

vertex s to every other vertex in the graph. In fact, the problem is usually solved by finding a<br />

shortest path tree rooted at s that contains all the desired shortest paths.<br />

It’s not hard to see that if the shortest paths are unique, then they form a tree. To prove this,<br />

it’s enough to observe that sub-paths of shortest paths are also shortest paths. If there are multiple<br />

shortest paths to the same vertices, we can always choose one path to each vertex so that the union<br />

of the paths is a tree. If there are shortest paths to two vertices u and v that diverge, then meet,<br />

then diverge again, we can modify one of the paths so that the two paths only diverge once.<br />

s<br />

a<br />

b c<br />

x y<br />

If s → a → b → c → d → v and s → a → x → y → d → u are both shortest paths,<br />

then s → a → b → c → d → u is also a shortest path.<br />

1 West on Church, north on Prospect, east on I-74, south on I-465, east on Airport Expressway, north on I-65,<br />

east on I-70, north on Grandview, east on 5th, north on Olentangy River, east on Dodridge, north on High, west on<br />

Kelso, south on Neil. Depending on traffic. We both live in Urbana now.<br />

2 There is a speed trap on I-70 just inside the Ohio border, but only for eastbound traffic.<br />

1<br />

d<br />

u<br />

v


<strong>CS</strong> <strong>373</strong> Lecture 11: Shortest Paths Fall 2002<br />

I should emphasize here that shortest path trees and minimum spanning trees are usually very<br />

different. For one thing, there is only one minimum spanning tree, but in general, there is a different<br />

shortest path tree for every source vertex.<br />

8 5<br />

10<br />

2 3<br />

18 16<br />

12 30<br />

14<br />

4 26<br />

8 5<br />

10<br />

2 3<br />

18 16<br />

12 30<br />

14<br />

4 26<br />

A minimum spanning tree and a shortest path tree (rooted at the topmost vertex) of the same graph.<br />

All of the algorithms I’m describing in this lecture also work for undirected graphs, with some<br />

slight modifications. Most importantly, we must specifically prohibit alternating back and forth<br />

across the same undirected negative-weight edge. Our unmodified algorithms would interpret any<br />

such edge as a negative cycle of length 2.<br />

11.2 The Only SSSP Algorithm<br />

Just like graph traversal and minimum spanning trees, there are several different SSSP algorithms,<br />

but they are all special cases of the a single generic algorithm. Each vertex v in the graph stores<br />

two values, which describe a tentative shortest path from s to v.<br />

• dist(v) is the length of the tentative shortest s ❀ v path.<br />

• pred(v) is the predecessor of v in the tentative shortest s ❀ v path.<br />

Notice that the predecessor pointers automatically define a tentative shortest path tree. We already<br />

know that dist(s) = 0 and pred(s) = Null. For every vertex v = s, we initially set dist(v) = ∞<br />

and pred(v) = Null to indicate that we do not know of any path from s to v.<br />

We call an edge u → v tense if dist(u) + w(u → v) < dist(v). If u → v is tense, then the<br />

tentative shortest path s ❀ v is incorrect, since the path s ❀ u → v is shorter. Our generic<br />

algorithm repeatedly finds a tense edge in the graph and relaxes it:<br />

Relax(u → v):<br />

dist(v) ← dist(u) + w(u → v)<br />

pred(v) ← u<br />

If there are no tense edges, our algorithm is finished, and we have our desired shortest path tree.<br />

The correctness of the relaxation algorithm follows directly from three simple claims:<br />

1. If dist(v) = ∞, then dist(v) is the total weight of the predecessor chain ending at v:<br />

s → · · · → pred(pred(v)) → pred(v) → v.<br />

This is easy to prove by induction on the number of relaxation steps. (Hint, hint.)<br />

2. If the algorithm halts, then dist(v) ≤ w(s ❀ v) for any path s ❀ v. This is easy to prove by<br />

induction on the number of edges in the path s ❀ v. (Hint, hint.)<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 11: Shortest Paths Fall 2002<br />

3. The algorithm halts if and only if there is no negative cycle reachable from s. The ‘only if’<br />

direction is easy—if there is a reachable negative cycle, then after the first edge in the cycle<br />

is relaxed, the cycle always has at least one tense edge. The ‘if’ direction follows from the<br />

fact that every relaxation step reduces either the number of vertices with dist(v) = ∞ by 1<br />

or reduces the sum of the finite shortest path lengths by some positive amount.<br />

I haven’t said anything about how we detect which edges can be relaxed, or what order we relax<br />

them in. In order to make this easier, we can refine the relaxation algorithm slightly, into something<br />

closely resembling the generic graph traversal algorithm. We maintain a ‘bag’ of vertices, initially<br />

containing just the source vertex s. Whenever we take a vertex u out of the bag, we scan all of its<br />

outgoing edges, looking for something to relax. Whenever we successfully relax and edge u → v,<br />

we put v in the bag.<br />

InitSSSP(s):<br />

dist(s) ← 0<br />

pred(s) ← Null<br />

for all vertices v = s<br />

dist(v) ← ∞<br />

pred(v) ← Null<br />

GenericSSSP(s):<br />

InitSSSP(s)<br />

put s in the bag<br />

while the bag is not empty<br />

take u from the bag<br />

for all edges u → v<br />

if u → v is tense<br />

Relax(u → v)<br />

put v in the bag<br />

Just as with graph traversal, using different data structures for the ‘bag’ gives us different<br />

algorithms. There are three obvious choices to try: a stack, a queue, and a heap. Unfortunately, if<br />

we use a stack, we have to perform Θ(2 V ) relaxation steps in the worst case! (This is a problem in<br />

the current homework.) The other two possibilities are much more efficient.<br />

11.3 Dijkstra’s Algorithm<br />

If we implement the bag as a heap, where the key of a vertex v is dist(v), we obtain an algorithm<br />

first published by Edsger Dijkstra in 1959.<br />

Dijkstra’s algorithm is particularly well-behaved if the graph has no negative-weight edges. In<br />

this case, it’s not hard to show (by induction, of course) that the vertices are scanned in increasing<br />

order of their shortest-path distance from s. It follows that each vertex is scanned at most once,<br />

and thus that each edge is relaxed at most once. Since the key of each vertex in the heap is its<br />

tentative distance from s, the algorithm performs a DecreaseKey operation every time an edge is<br />

relaxed. Thus, the algorithm performs at most E DecreaseKeys. Similarly, there are at most V<br />

Insert and ExtractMin operations. Thus, if we store the vertices in a Fibonacci heap, the total<br />

running time of Dijkstra’s algorithm is O(E + V log V ) .<br />

This analysis assumes that no edge has negative weight. Dijkstra’s algorithm (in the form I’m<br />

presenting here) is still correct if there are negative edges 3 , but the worst-case running time could<br />

be exponential. (I’ll leave the proof of this unfortunate fact as an extra credit problem.)<br />

3 The version of Dijkstra’s algorithm presented in CLRS gives incorrect results for graphs with negative edges.<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 11: Shortest Paths Fall 2002<br />

∞<br />

∞<br />

3 2<br />

1<br />

0 5<br />

10 ∞<br />

12<br />

∞<br />

8<br />

4<br />

6 3<br />

0<br />

s<br />

7<br />

4<br />

∞<br />

∞<br />

∞<br />

∞<br />

3 2<br />

1<br />

0 5<br />

10 4<br />

12<br />

∞<br />

7<br />

3 2<br />

1<br />

0 5<br />

10 4<br />

12<br />

14<br />

8<br />

4<br />

7<br />

8<br />

4<br />

7<br />

∞<br />

3<br />

∞<br />

3 2<br />

6 3<br />

6 3<br />

0 0<br />

s<br />

s<br />

9<br />

3<br />

1<br />

∞<br />

1<br />

0 5<br />

10 4<br />

12<br />

∞<br />

∞<br />

3 2<br />

0 5<br />

10 4<br />

12<br />

6 3<br />

6 3<br />

0 0<br />

s<br />

s<br />

Four phases of Dijkstra’s algorithm run on a graph with no negative edges.<br />

At each phase, the shaded vertices are in the heap, and the bold vertex has just been scanned.<br />

The bold edges describe the evolving shortest path tree.<br />

11.4 The A ∗ Heuristic<br />

A slight generalization of Dijkstra’s algorithm, commonly known as the A ∗ algorithm, is frequently<br />

used to find a shortest path from a single source node s to a single target node t. A ∗ uses a blackbox<br />

function GuessDistance(v, t) that returns an estimate of the distance from v to t. The only<br />

difference between Dijkstra and A ∗ is that the key of a vertex v is dist(v) + GuessDistance(v, t).<br />

The function GuessDistance is called admissible if GuessDistance(v, t) never overestimates<br />

the actual shortest path distance from v to t. If GuessDistance is admissible and the actual edge<br />

weights are all non-negative, the A ∗ algorithm computes the actual shortest path from s to t at least<br />

as quickly as Dijkstra’s algorithm. The closer GuessDistance(v, t) is to the real distance from v<br />

to t, the faster the algorithm. However, in the worst case, the running time is still O(E + V log V ).<br />

The heuristic is especially useful in situations where the actual graph is not known. For example,<br />

A ∗ can be used to solve puzzles (15-puzzle, Freecell, Shanghai, Minesweeper, Sokoban, Atomix,<br />

Rush Hour, Rubik’s Cube, . . . ) and other path planning problems where the starting and goal<br />

configurations are given, but the graph of all possible configurations and their connections is not<br />

given explicitly.<br />

11.5 Moore’s Algorithm (‘Bellman-Ford’)<br />

If we replace the heap in Dijkstra’s algorithm with a queue, we get an algorithm that was first<br />

published by Moore in 1957, and then independently by Bellman in 1958. (Since Bellman used the<br />

idea of relaxing edges, which was first proposed by Ford in 1956, this algorithm is usually called<br />

‘Bellman-Ford’.) Moore’s algorithm is efficient even if there are negative edges, and it can be used<br />

to quickly detect the presence of negative cycles. If there are no negative edges, however, Dijkstra’s<br />

4<br />

4<br />

∞<br />

8<br />

4<br />

7<br />

8<br />

9<br />

3<br />

4<br />

7<br />

12<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 11: Shortest Paths Fall 2002<br />

algorithm is faster. (In fact, in practice, Dijkstra’s algorithm is often faster even for graphs with<br />

negative edges.)<br />

The easiest way to analyze the algorithm is to break the execution into phases. Before we<br />

even begin, we insert a token into the queue. Whenever we take the token out of the queue, we<br />

begin a new phase by just reinserting the token into the queue. The 0th phase consists entirely<br />

of scanning the source vertex s. The algorithm ends when the queue contains only the token. A<br />

simple inductive argument (hint, hint) implies the following invariant:<br />

At the end of the ith phase, for every vertex v, dist(v) is less than or equal to the length<br />

of the shortest path s ❀ v consisting of i or fewer edges.<br />

Since a shortest path can only pass through each vertex once, either the algorithm halts before<br />

the V th phase, or the graph contains a negative cycle. In each phase, we scan each vertex at most<br />

once, so we relax each edge at most once, so the running time of a single phase is O(E). Thus, the<br />

overall running time of Moore’s algorithm is O(V E) .<br />

d<br />

∞<br />

1<br />

0 5<br />

−3<br />

b<br />

∞ −3<br />

−3<br />

∞<br />

a<br />

8<br />

−8<br />

e<br />

∞<br />

4<br />

6 3<br />

0<br />

s<br />

2<br />

−1<br />

d<br />

1<br />

−2<br />

a<br />

f<br />

∞<br />

∞<br />

c<br />

1<br />

0 5<br />

−3<br />

b<br />

2<br />

−3<br />

−3<br />

8<br />

−8<br />

e<br />

9<br />

4<br />

d<br />

∞<br />

6<br />

a<br />

6 3<br />

0<br />

s<br />

2<br />

−1<br />

8<br />

−8<br />

f<br />

7<br />

3<br />

c<br />

e<br />

∞<br />

1<br />

0 5<br />

b<br />

4<br />

4<br />

6 3<br />

0<br />

s<br />

2<br />

−1<br />

d<br />

1<br />

f<br />

2 7<br />

1<br />

a<br />

f<br />

∞<br />

3 6<br />

c<br />

a<br />

8<br />

−8<br />

−3<br />

e<br />

9<br />

0 5<br />

b<br />

2<br />

4<br />

−3<br />

d<br />

4<br />

6 3<br />

0<br />

s<br />

2<br />

−1<br />

8<br />

−8<br />

3<br />

c<br />

−3<br />

e<br />

∞<br />

1<br />

0 5<br />

b<br />

2<br />

4<br />

6 3<br />

0<br />

s<br />

Four phases of Moore’s algorithm run on a directed graph with negative edges.<br />

Nodes are taken from the queue in the order s ⋄ a b c ⋄ d f b ⋄ a e d ⋄ d a ⋄ ⋄, where ⋄ is the token.<br />

Shaded vertices are in the queue at the end of each phase. The bold edges describe the evolving shortest path tree.<br />

Now that we understand how the phases of Moore’s algorithm behave, we can simplify the<br />

algorithm considerably. Instead of using a queue to perform a partial breadth-first search of the<br />

graph in each phase, let’s just scan through the adjacency list directly and try to relax every edge<br />

in the graph.<br />

5<br />

2<br />

−1<br />

f<br />

7<br />

3<br />

c<br />

−3


<strong>CS</strong> <strong>373</strong> Lecture 11: Shortest Paths Fall 2002<br />

MooreSSSP(s)<br />

InitSSSP(s)<br />

repeat V times:<br />

for every edge u → v<br />

if u → v is tense<br />

Relax(u → v)<br />

for every edge u → v<br />

if u → v is tense<br />

return ‘Negative cycle!’<br />

This is closer to how CLRS presents the ‘Bellman-Ford’ algorithm. The O(V E) running time<br />

of this version of the algorithm should be obvious, but it may not be clear that the algorithm is<br />

still correct. To prove correctness, we just have to show that our earlier invariant holds; as before,<br />

this can be proved by induction on i.<br />

11.6 Greedy Shortest Paths?<br />

Here’s another algorithm that fits our generic framework, but which I’ve never seen analyzed.<br />

Repeatedly relax the tensest edge.<br />

Specifically, let’s define the ‘tenseness’ of an edge u → v as follows:<br />

tenseness(u → v) = max {0, dist(v) − dist(u) − w(u → v)}<br />

(This is defined even when dist(v) = ∞ or dist(u) = ∞, as long as we treat ∞ just like some<br />

indescribably large but finite number.) If an edge has zero tenseness, it’s not tense. If we relax an<br />

edge u → v, then dist(v) decreases by the edge’s tenseness.<br />

Intuitively, we can keep the edges of the graph in some sort of heap, where the key of each edge<br />

is its tenseness. Then we repeatedly pull out the tensest edge u → v and relax it. Then we need to<br />

recompute the tenseness of other edges adjacent to v. Edges leaving v possibly become more tense,<br />

and edges coming into v possibly become less tense. So we need a heap that efficiently supports<br />

the operations Insert, ExtractMax, IncreaseKey, and DecreaseKey.<br />

If there are no negative cycles, this algorithm eventually halts with a shortest path tree, but<br />

how quickly? Can the same edge be relaxed more than once, and if so, how many times? Is it<br />

faster if all the edge weights are positive? Hmm....<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 12: All-Pair Shortest Paths Fall 2002<br />

12 All-Pair Shortest Paths (October 24)<br />

12.1 The Problem<br />

In the last lecture, we saw algorithms to find the shortest path from a source vertex s to a target<br />

vertex t in a directed graph. As it turns out, the best algorithms for this problem actually find the<br />

shortest path from s to every possible target (or from every possible source to t) by constructing<br />

a shortest path tree. The shortest path tree specifies two pieces of information for each node v in<br />

the graph<br />

• dist(v) is the length of the shortest path (if any) from s to v.<br />

• pred(v) is the second-to-last vertex (if any) the shortest path (if any) from s to v.<br />

In this lecture, we want to generalize the shortest path problem even further. In the all pairs<br />

shortest path problem, we want to find the shortest path from every possible source to every<br />

possible destination. Specifically, for every pair of vertices u and v, we need to compute the<br />

following information:<br />

• dist(u, v) is the length of the shortest path (if any) from u to v.<br />

• pred(u, v) is the second-to-last vertex (if any) on the shortest path (if any) from u to v.<br />

For example, for any vertex v, we have dist(v, v) = 0 and pred(v, v) = Null. If the shortest path<br />

from u to v is only one edge long, then dist(u, v) = w(u → v) and pred(u, v) = u. If there is<br />

no shortest path from u to v—either because there’s no path at all, or because there’s a negative<br />

cycle—then dist(u, v) = ∞ and pred(v, v) = Null.<br />

The output of our shortest path algorithms will be a pair of V × V arrays encoding all V 2<br />

distances and predecessors. Many maps include a distance matrix—to find the distance from (say)<br />

Champaign to (say) Columbus, you would look in the row labeled ‘Champaign’ and the column<br />

labeled ‘Columbus’. In these notes, I’ll focus almost exclusively on computing the distance array.<br />

The predecessor array, from which you would compute the actual shortest paths, can be computed<br />

with only minor additions to the algorithms I’ll describe (hint, hint).<br />

12.2 Lots of Single Sources<br />

The most obvious solution to the all pairs shortest path problem is just to run a single-source<br />

shortest path algorithm V times, once for every possible source vertex! Specifically, to fill in the<br />

one-dimensional subarray dist[s][ ], we invoke either Dijkstra’s or Moore’s algorithm starting at the<br />

source vertex s.<br />

ObviousAPSP(V, E, w):<br />

for every vertex s<br />

dist[s][ ] ← SSSP(V, E, w, s)<br />

The running time of this algorithm depends on which single source algorithm we use. If we use<br />

Moore’s algorithm, the overall running time is Θ(V 2 E) = O(V 4 ). If all the edge weights are positive,<br />

we can use Dijkstra’s algorithm instead, which decreases the running time to Θ(V E + V 2 log V ) =<br />

O(V 3 ). For graphs with negative edge weights, Dijkstra’s algorithm can take exponential time, so<br />

we can’t get this improvement directly.<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 12: All-Pair Shortest Paths Fall 2002<br />

12.3 Reweighting<br />

One idea that occurs to most people is increasing the weights of all the edges by the same amount<br />

so that all the weights become positive, and then applying Dijkstra’s algorithm. Unfortunately, this<br />

simple idea doesn’t work. Different paths change by different amounts, which means the shortest<br />

paths in the reweighted graph may not be the same as in the original graph.<br />

3<br />

s<br />

2 2<br />

t<br />

4 4<br />

Increasing all the edge weights by 2 changes the shortest path s to t.<br />

However, there is a more complicated method for reweighting the edges in a graph. Suppose<br />

each vertex v has some associated cost c(v), which might be positive, negative, or zero. We can<br />

define a new weight function w ′ as follows:<br />

w ′ (u → v) = c(u) + w(u → v) − c(v)<br />

To give some intuition, imagine that when we leave vertex u, we have to pay an exit tax of c(u),<br />

and when we enter v, we get c(v) as an entrance gift.<br />

Now it’s not too hard to show that the shortest paths with the new weight function w ′ are exactly<br />

the same as the shortest paths with the original weight function w. In fact, for any path u ❀ v<br />

from one vertex u to another vertex v, we have<br />

w ′ (u ❀ v) = c(u) + w(u ❀ v) − c(v).<br />

We pay c(u) in exit fees, plus the original weight of of the path, minus the c(v) entrance gift. At<br />

every intermediate vertex x on the path, we get c(x) as an entrance gift, but then immediately pay<br />

it back as an exit tax!<br />

12.4 Johnson’s Algorithm<br />

Johnson’s all-pairs shortest path algorithm finds a cost c(v) for each vertex, so that when the graph<br />

is reweighted, every edge has non-negative weight.<br />

Suppose the graph has a vertex s that has a path to every other vertex. Johnson’s algorithm<br />

computes the shortest paths from s to every other vertex, using Moore’s algorithm (which doesn’t<br />

care if the edge weights are negative), and then sets<br />

so the new weight of every edge is<br />

c(v) = dist(s, v),<br />

w ′ (u → v) = dist(s, u) + w(u → v) − dist(s, v).<br />

Why are all these new weights non-negative? Because otherwise, Moore’s algorithm wouldn’t be<br />

finished! Recall that an edge u → v is tense if dist(s, u) + w(u → v) < dist(s, v), and that singlesource<br />

shortest path algorithms eliminate all tense edges. The only exception is if the graph has a<br />

negative cycle, but then shortest paths aren’t defined, and Johnson’s algorithm simply aborts.<br />

But what if the graph doesn’t have a vertex s that can reach everything? Then no matter<br />

where we start Moore’s algorithm, some of those vertex costs will be infinite. Johnson’s algorithm<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 12: All-Pair Shortest Paths Fall 2002<br />

avoids this problem by adding a new vertex s to the graph, with zero-weight edges going from s<br />

to every other vertex, but no edges going back into s. This addition doesn’t change the shortest<br />

paths between any other pair of vertices, because there are no paths into s.<br />

So here’s Johnson’s algorithm in all its glory.<br />

JohnsonAPSP(V, E, w) :<br />

create a new vertex s<br />

for every vertex v ∈ V<br />

w(s → v) ← 0; w(v → s) ← ∞<br />

dist[s][ ] ← Moore(V, E, w, s)<br />

abort if Moore found a negative cycle<br />

for every edge (u, v) ∈ E<br />

w ′ (u → v) ← dist[s][u] + w(u → v) − dist[v][s]<br />

for every vertex v ∈ V<br />

dist[v][ ] ← Dijkstra(V, E, w ′ , v)<br />

The algorithm spends Θ(V ) time adding the artificial start vertex s, Θ(V E) time running<br />

Moore, O(E) time reweighting the graph, and then Θ(V E + V 2 log V ) running V passes of Dijk-<br />

stra’s algorithm. The overall running time is Θ(V E + V 2 log V ) .<br />

12.5 Dynamic Programming<br />

There’s a completely different solution to the all-pairs shortest path problem that uses dynamic<br />

programming instead of a single-source algorithm. For dense graphs where E = Ω(V 2 ), the dynamic<br />

programming approach gives the same running time as Johnson’s algorithm, but with a much<br />

simpler algorithm. In particular, the new algorithm avoids Dijkstra’s algorithm, which gets its<br />

efficiency from Fibonacci heaps, which are rather easy to screw up in the implementation.<br />

To get a dynamic programming algorithm, we first need to come up with a recursive formulation<br />

of the problem. If we try to recursively define dist(u, v), we might get something like this:<br />

dist(u, v) =<br />

0 if u = v<br />

<br />

min dist(u, x) + w(x → v)<br />

x<br />

otherwise<br />

In other words, to find the shortest path from u to v, try all possible predecessors x, compute the<br />

shortest path from u to x, and then add the last edge u → v. Unfortunately, this recurrence<br />

doesn’t work! In order to compute dist(u, v), we first have to compute dist(u, x) for every other<br />

vertex x, but to compute any dist(u, x), we first need to compute dist(u, v). We’re stuck in an<br />

infinite loop!<br />

To avoid this circular dependency, we need some additional parameter that decreases at each<br />

recursion, eventually reaching zero at the base case. One possibility is to include the number of<br />

edges in the shortest path as this third magic parameter. So let’s define dist(u, v, k) to be the<br />

length of the shortest path from u to v that uses at most k edges. Since we know that the shortest<br />

path between any two vertices has at most V − 1 vertices, what we’re really trying to compute is<br />

dist(u, v, V − 1).<br />

After a little thought, we get the following recurrence.<br />

⎧<br />

⎪⎨<br />

0 if u = v<br />

dist(u, v, k) = ∞ if k = 0 and u = v<br />

⎪⎩<br />

<br />

min dist(u, x, k − 1) + w(x → v) otherwise<br />

x<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 12: All-Pair Shortest Paths Fall 2002<br />

Just like last time, the recurrence tries all possible predecessors of v in the shortest path, but now<br />

the recursion actually bottoms out when k = 0.<br />

Now it’s not difficult to turn this recurrence into a dynamic programming algorithm. Even<br />

before we write down the algorithm, though, we can tell that its running time will be Θ(V 4 )<br />

simply because recurrence has four variables—u, v, k, and x—each of which can take on V different<br />

values. Except for the base cases, the algorithm itself is just four nested for loops. To make the<br />

algorithm a little shorter, let’s assume that w(v → v) = 0 for every vertex v.<br />

DynamicProgrammingAPSP(V, E, w):<br />

for all vertices u ∈ V<br />

for all vertices v ∈ V<br />

if u = v<br />

dist[u][v][0] ← 0<br />

else<br />

dist[u][v][0] ← ∞<br />

for k ← 1 to V − 1<br />

for all vertices u ∈ V<br />

for all vertices v ∈ V<br />

dist[u][v][k] ← ∞<br />

for all vertices x ∈ V<br />

if dist[u][v][k] > dist[u][x][k − 1] + w(x → v)<br />

dist[u][v][k] ← dist[u][x][k − 1] + w(x → v)<br />

The last four lines actually evaluate the recurrence.<br />

In fact, this algorithm is almost exactly the same as running Moore’s algorithm once for every<br />

source vertex. The only difference is the innermost loop, which in Moore’s algorithm would read<br />

“for all edges x → v”. This simple change improves the running time to Θ(V 2 E), assuming the<br />

graph is stored in an adjacency list.<br />

12.6 Divide and Conquer<br />

But we can make a more significant improvement. The recurrence we just used broke the shortest<br />

path into a slightly shorter path and a single edge, by considering all predecessors. Instead, let’s<br />

break it into two shorter paths at the middle vertex on the path. This idea gives us a different<br />

recurrence for dist(u, v, k). Once again, to simplify things, let’s assume w(v → v) = 0.<br />

<br />

w(u → v) if k = 1<br />

dist(u, v, k) = <br />

min dist(u, x, k/2) + dist(x, v, k/2) otherwise<br />

x<br />

This recurrence only works when k is a power of two, since otherwise we might try to find the shortest<br />

path with a fractional number of edges! But that’s not really a problem, since dist(u, v, 2 ⌈lg V ⌉ )<br />

gives us the overall shortest distance from u to v. Notice that we use the base case k = 1 instead<br />

of k = 0, since we can’t use half an edge.<br />

Once again, a dynamic programming solution is straightforward. Even before we write down<br />

the algorithm, we can tell the running time is Θ(V 3 log V ) —we consider V possible values of u,<br />

v, and x, but only ⌈lg V ⌉ possible values of k.<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 12: All-Pair Shortest Paths Fall 2002<br />

FastDynamicProgrammingAPSP(V, E, w):<br />

for all vertices u ∈ V<br />

for all vertices v ∈ V<br />

dist[u][v][0] ← w(u → v)<br />

for i ← 1 to ⌈lg V ⌉ 〈〈k = 2 i 〉〉<br />

for all vertices u ∈ V<br />

for all vertices v ∈ V<br />

dist[u][v][i] ← ∞<br />

for all vertices x ∈ V<br />

if dist[u][v][i] > dist[u][x][i − 1] + dist[x][v][i − 1]<br />

dist[u][v][i] ← dist[u][x][i − 1] + dist[x][v][i − 1]<br />

12.7 Aside: ‘Funny’ Matrix Multiplication<br />

There is a very close connection between computing shortest paths in a directed graph and computing<br />

powers of a square matrix. Compare the following algorithm for multiplying two n × n<br />

matrices A and B with the inner loop of our first dynamic programming algorithm. (I’ve changed<br />

the variable names in the second algorithm slightly to make the similarity clearer.)<br />

MatrixMultiply(A, B):<br />

for i ← 1 to n<br />

for j ← 1 to n<br />

C[i][j] ← 0<br />

for k ← 1 to n<br />

C[i][j] ← C[i][j] + A[i][k] · B[k][j]<br />

APSPInnerLoop:<br />

for all vertices u<br />

for all vertices v<br />

D ′ [u][v] ← ∞<br />

for all vertices x<br />

D ′ [u][v] ← min{D ′ [u][v], D[u][x] + w[x][v]}<br />

The only difference between these two algorithms is that we use addition instead of multiplication<br />

and minimization instead of addition. For this reason, the shortest path inner loop is often referred<br />

to as ‘funny’ matrix multiplication.<br />

DynamicProgrammingAPSP is the standard iterative algorithm for computing the (V − 1)th<br />

‘funny power’ of the weight matrix w. The first set of for loops sets up the ‘funny identity matrix’,<br />

with zeros on the main diagonal and infinity everywhere else. Then each iteration of the second<br />

main for loop computes the next ‘funny power’. FastDynamicProgrammingAPSP replaces this<br />

iterative method for computing powers with repeated squaring, exactly like we saw at the beginning<br />

of the semester. The fast algorithm is simplified slightly by the fact that unless there are negative<br />

cycles, every ‘funny power’ after the V th is the same.<br />

There are faster methods for multiplying matrices, similar to Karatsuba’s divide-and-conquer<br />

algorithm for multiplying integers. (See ‘Strassen’s algorithm’ in CLR.) Unfortunately, these algorithms<br />

us subtraction, and there’s no ‘funny’ equivalent of subtraction. (What’s the inverse<br />

operation for min?) So at least for general graphs, there seems to be no way to speed up the inner<br />

loop of our dynamic programming algorithms.<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 12: All-Pair Shortest Paths Fall 2002<br />

Fortunately, this isn’t true. There is a beautiful randomized algorithm, due to Noga Alon,<br />

Zvi Galil, Oded Margalit*, and Moni Noar, 1 that computes all-pairs shortest paths in undirected<br />

graphs in O(M(V ) log 2 V ) expected time, where M(V ) is the time to multiply two V × V integer<br />

matrices. A simplified version of this algorithm for unweighted graphs, due to Raimund Seidel 2 ,<br />

appears in the current homework.<br />

12.8 Floyd and Warshall’s Algorithm<br />

Our fast dynamic programming algorithm is still a factor of O(log V ) slower than Johnson’s algorithm.<br />

A different formulation due to Floyd and Warshall removes this logarithmic factor. Their<br />

insight was to use a different third parameter in the recurrence.<br />

Number the vertices arbitrarily from 1 to V , and define dist(u, v, r) to be the length of the<br />

shortest path from u to v, where all the intermediate vertices (if any) are numbered r or less. If<br />

r = 0, we aren’t allowed to use any intermediate vertices, so the shortest legal path from u to v is<br />

just the edge (if any) from u to v. If r > 0, then either the shortest legal path from u to v goes<br />

through vertex r or it doesn’t. We get the following recurrence:<br />

<br />

w(u → v) if r = 0<br />

dist(u, v, r) =<br />

min dist(u, v, r − 1), dist(u, r, r − 1) + dist(r, v, r − 1) otherwise<br />

We need to compute the shortest path distance from u to v with no restrictions, which is just<br />

dist(u, v, V ).<br />

Once again, we should immediately see that a dynamic programming algorithm that implements<br />

this recurrence will run in Θ(V 3 ) time: three variables appear in the recurrence (u, v, and r),<br />

each of which can take on V possible values. Here’s one way to do it:<br />

FloydWarshall(V, E, w):<br />

for u ← 1 to V<br />

for v ← 1 to V<br />

dist[u][v][0] ← w(u → v)<br />

for r ← 1 to V<br />

for u ← 1 to V<br />

for v ← 1 to V<br />

if dist[u][v][r − 1] < dist[u][r][r − 1] + dist[r][v][r − 1]<br />

dist[u][v][r] ← dist[u][v][r − 1]<br />

else<br />

dist[u][v][r] ← dist[u][r][r − 1] + dist[r][v][r − 1]<br />

1 N. Alon, Z. Galil, O. Margalit*, and M. Naor. Witnesses for Boolean matrix multiplication and for shortest<br />

paths. Proc. 33rd FO<strong>CS</strong> 417-426, 1992. See also N. Alon, Z. Galil, O. Margalit*. On the exponent of the all pairs<br />

shortest path problem. Journal of Computer and System Sciences 54(2):255–262, 1997.<br />

2 R. Seidel. On the all-pairs-shortest-path problem in unweighted undirected graphs. Journal of Computer and<br />

System Sciences, 51(3):400-403, 1995. This is one of the few algorithms papers where (in the conference version at<br />

least) the algorithm is completely described and analyzed in the abstract of the paper.<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 13: Minimum Spanning Trees Fall 2002<br />

13 Minimum Spanning Trees (October 29)<br />

13.1 Introduction<br />

Suppose we are given a connected, undirected, weighted graph. This is a graph G = (V, E) together<br />

with a function w : E → IR that assigns a weight w(e) to each edge e. For this lecture, we’ll assume<br />

that the weights are real numbers. Our task is to find the minimum spanning tree of G, i.e., the<br />

spanning tree T minimizing the function<br />

w(T ) = <br />

w(e).<br />

e∈T<br />

To keep things simple, I’ll assume that all the edge weights are distinct: w(e) = w(e ′ ) for any<br />

pair of edges e and e ′ . Distinct weights guarantee that the minimum spanning tree of the graph<br />

is unique. Without this condition, there may be several different minimum spanning trees. For<br />

example, if all the edges have weight 1, then every spanning tree is a minimum spanning tree with<br />

weight V − 1.<br />

8 5<br />

10<br />

2 3<br />

18 16<br />

12 30<br />

14<br />

4 26<br />

A weighted graph and its minimum spanning tree.<br />

If we have an algorithm that assumes the edge weights are unique, we can still use it on graphs<br />

where multiple edges have the same weight, as long as we have a consistent method for breaking<br />

ties. One way to break ties consistently is to use the following algorithm in place of a simple<br />

comparison. ShorterEdge takes as input four integers i, j, k, l, and decides which of the two<br />

edges (i, j) and (k, l) has ‘smaller’ weight.<br />

13.2 The Only MST Algorithm<br />

ShorterEdge(i, j, k, l)<br />

if w(i, j) < w(k, l) return (i, j)<br />

if w(i, j) > w(k, l) return (k, l)<br />

if min(i, j) < min(k, l) return (i, j)<br />

if min(i, j) > min(k, l) return (k, l)<br />

if max(i, j) < max(k, l) return (i, j)<br />

〈〈if max(i,j) < max(k,l) 〉〉 return (k, l)<br />

There are several different ways to compute minimum spanning trees, but really they are all instances<br />

of the following generic algorithm. The situation is similar to the previous lecture, where we<br />

saw that depth-first search and breadth-first search were both instances of a single generic traversal<br />

algorithm.<br />

The generic MST algorithm maintains an acyclic subgraph F of the input graph G, which we<br />

will call an intermediate spanning forest. F is a subgraph of the minimum spanning tree of G,<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 13: Minimum Spanning Trees Fall 2002<br />

and every component of F is a minimum spanning tree of its vertices. Initially, F consists of n<br />

one-node trees. The generic MST algorithm merges trees together by adding certain edges between<br />

them. When the algorithm halts, F consists of a single n-node tree, which must be the minimum<br />

spanning tree. Obviously, we have to be careful about which edges we add to the evolving forest,<br />

since not every edge is in the eventual minimum spanning tree.<br />

The intermediate spanning forest F induces two special types of edges. An edge is useless if it is<br />

not an edge of F , but both its endpoints are in the same component of F . For each component of F ,<br />

we associate a safe edge—the minimum-weight edge with exactly one endpoint in that component. 1<br />

Different components might or might not have different safe edges. Some edges are neither safe nor<br />

useless—we call these edges undecided.<br />

All minimum spanning tree algorithms are based on two simple observations.<br />

Lemma 1. The minimum spanning tree contains every safe edge and no useless edges.<br />

Proof: Let T be the minimum spanning tree. Suppose F has a ‘bad’ component whose safe edge<br />

e = (u, v) is not in T . Since T is connected, it contains a unique path from u to v, and at least<br />

one edge e ′ on this path has exactly one endpoint in the bad component. Removing e ′ from the<br />

minimum spanning tree and adding e gives us a new spanning tree. Since e is the bad component’s<br />

safe edge, we have w(e ′ ) > w(e), so the the new spanning tree has smaller total weight than T .<br />

But this is impossible—T is the minimum spanning tree. So T must contain every safe edge.<br />

Adding any useless edge to F would introduce a cycle. <br />

u<br />

e’<br />

Proving that every safe edge is in the minimum spanning tree. The ‘bad’ component of F is highlighted.<br />

So our generic minimum spanning tree algorithm repeatedly adds one or more safe edges to<br />

the evolving forest F . Whenever we add new edges to F , some undecided edges become safe, and<br />

others become useless. To specify a particular algorithm, we must decide which safe edges to add,<br />

and how to identify new safe and new useless edges, at each iteration of our generic template.<br />

13.3 Bor ˙uvka’s Algorithm<br />

The oldest and possibly simplest minimum spanning tree algorithm was discovered by Bor ˙uvka in<br />

1926, long before computers even existed, and practically before the invention of graph theory! 2 The<br />

algorithm was rediscovered by Choquet in 1938; again by Florek, Lukaziewicz, Perkal, Stienhaus,<br />

and Zubrzycki in 1951; and again by Sollin some time in the early 1960s.<br />

The Bor ˙uvka/Choquet/Florek/Lukaziewicz/Perkal/Stienhaus/Zubrzycki/Sollin algorithm can<br />

be summarized in one line:<br />

1 This is actually a special case of a more general definition: For any partition of F into two subforests, the<br />

minimum-weight edge with one endpoint in each subforest is light. A few minimum spanning tree algorithms require<br />

this more general definition, but we won’t talk about them here.<br />

2 The first book on graph theory, written by D. König, was published in 1936. Leonard Euler published his famous<br />

theorem about the bridges of Königsburg (HW3, problem 2) in 1736. Königsburg was not named after that König.<br />

e<br />

2<br />

v


<strong>CS</strong> <strong>373</strong> Lecture 13: Minimum Spanning Trees Fall 2002<br />

Bor ˙uvka: Add all the safe edges and recurse.<br />

8 5<br />

10<br />

2 3<br />

18 16<br />

12 30<br />

14<br />

4 26<br />

18<br />

12<br />

14<br />

Bor˙uvka’s algorithm run on the example graph. Thick edges are in F .<br />

Arrows point along each component’s safe edge. Dashed edges are useless.<br />

At the beginning of each phase of the Bor ˙uvka algorithm, each component elects an arbitrary<br />

‘leader’ node. The simplest way to hold these elections is a depth-first search of F ; the first node<br />

we visit in any component is that component’s leader. Once the leaders are elected, we find the<br />

safe edges for each component, essentially by brute force. Finally, we add these safe edges to F .<br />

Bor ˙uvka(V, E):<br />

F = (V, ∅)<br />

while F has more than one component<br />

choose leaders using DFS<br />

FindSafeEdges(V, E)<br />

for each leader v<br />

add safe(v) to F<br />

26<br />

FindSafeEdges(V, E):<br />

for each leader v<br />

safe(v) ← ∞<br />

for each edge (u, v) ∈ E<br />

u ← leader(u)<br />

v ← leader(v)<br />

if u = v<br />

if w(u, v) < w(safe(u))<br />

safe(u) ← (u, v)<br />

if w(u, v) < w(safe(v))<br />

safe(v) ← (u, v)<br />

Each call to FindSafeEdges takes O(E) time, since it examines every edge. Since the graph is<br />

connected, it has at most E + 1 vertices. Thus, each iteration of the while loop in Bor ˙uvka takes<br />

O(E) time, assuming the graph is represented by an adjacency list. Each iteration also reduces the<br />

number of components of F by at least a factor of two—the worst case occurs when the components<br />

coalesce in pairs. Since there are initially V components, the while loop iterates O(log V ) times.<br />

Thus, the overall running time of Bor ˙uvka’s algorithm is O(E log V ) .<br />

Despite its relatively obscure origin, early algorithms researchers were aware of Bor ˙uvka’s algorithm,<br />

but dismissed it as being “too complicated”! As a result, despite its simplicity and efficiency,<br />

Bor ˙uvka’s algorithm is rarely mentioned in algorithms and data structures textbooks.<br />

13.4 Jarník’s (‘Prim’s’) Algorithm<br />

The next oldest minimum spanning tree algorithm was discovered by the Vojtěch Jarník in 1930,<br />

but it is usually called Prim’s algorithm. Prim independently rediscovered the algorithm in 1956<br />

and gave a much more detailed description than Jarník. The algorithm was rediscovered again in<br />

1958 by Dijkstra, but he already had an algorithm named after him. Such is fame in academia.<br />

In Jarník’s algorithm, the forest F contains only one nontrivial component T ; all the other<br />

components are isolated vertices. Initially, T consists of an arbitrary vertex of the graph. The<br />

algorithm repeats the following step until T spans the whole graph:<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 13: Minimum Spanning Trees Fall 2002<br />

8 5<br />

10<br />

2 3<br />

18 16<br />

12 30<br />

14<br />

4 26<br />

Jarník: Find T ’s safe edge and add it to T .<br />

18<br />

8 5<br />

10<br />

12<br />

2 3<br />

14<br />

30<br />

26<br />

16<br />

8 5<br />

10<br />

2 3<br />

18 16<br />

30<br />

26<br />

8 5<br />

8 5<br />

10<br />

Jarník’s algorithm run on the example graph, starting with the bottom vertex.<br />

At each stage, thick edges are in T , an arrow points along T ’s safe edge, and dashed edges are useless.<br />

To implement Jarník’s algorithm, we keep all the edges adjacent to T in a heap. When we<br />

pull the minimum-weight edge off the heap, we first check whether both of its endpoints are in T .<br />

If not, we add the edge to T and then add the new neighboring edges to the heap. In other<br />

words, Jarník’s algorithm is just another instance of the generic graph traversal algorithm we saw<br />

last time, using a heap as the ‘bag’! If we implement the algorithm this way, its running time is<br />

O(E log E) = O(E log V ).<br />

However, we can speed up the implementation by observing that the graph traversal algorithm<br />

visits each vertex only once. Rather than keeping edges in the heap, we can keep a heap of vertices,<br />

where the key of each vertex v is the length of the minimum-weight edge between v and T (or ∞<br />

if there is no such edge). Each time we add a new edge to T , we may need to decrease the key of<br />

some neighboring vertices.<br />

To make the description easier, we break the algorithm into two parts. JarníkInit initializes<br />

the vertex heap. JarníkLoop is the main algorithm. The input consists of the vertices and edges<br />

of the graph, plus the start vertex s.<br />

JarníkInit(V, E, s):<br />

for each vertex v ∈ V \ {s}<br />

if (v, s) ∈ E<br />

edge(v) ← (v, s)<br />

key(v) ← w(v, s)<br />

else<br />

edge(v) ← Null<br />

key(v) ← ∞<br />

Insert(v)<br />

30<br />

26<br />

16<br />

Jarník(V, E, s):<br />

JarníkInit(V, E, s)<br />

JarníkLoop(V, E, s)<br />

JarníkLoop(V, E, s):<br />

T ← ({s}, ∅)<br />

for i ← 1 to |V | − 1<br />

v ← ExtractMin<br />

add v and edge(v) to T<br />

for each edge (u, v) ∈ E<br />

if u /∈ T and key(u) > w(u, v)<br />

edge(u) ← (u, v)<br />

DecreaseKey(u, w(u, v))<br />

4<br />

30<br />

26<br />

16<br />

3<br />

30<br />

26<br />

16


<strong>CS</strong> <strong>373</strong> Lecture 13: Minimum Spanning Trees Fall 2002<br />

The running time of Jarník is dominated by the cost of the heap operations Insert, Extract-<br />

Min, and DecreaseKey. Insert and ExtractMin are each called O(V ) times once for each<br />

vertex except s, and DecreaseKey is called O(E) times, at most twice for each edge. If we use<br />

a Fibonacci heap, the amortized costs of Insert and DecreaseKey is O(1), and the amortized<br />

cost of ExtractMin is O(log n). Thus, the overall running time of Jarník is O(E + V log V ) .<br />

This is faster than Bor ˙uvka’s algorithm unless E = O(V ).<br />

13.5 Kruskal’s Algorithm<br />

The last minimum spanning tree algorithm I’ll discuss was discovered by Kruskal in 1956.<br />

Kruskal: Scan all edges in increasing weight order; if an edge is safe, add it to F .<br />

Since we examine the edges in order from lightest to heaviest, any edge we examine is safe if<br />

and only if its endpoints are in different components of the forest F . To prove this, suppose the<br />

edge e joins two components A and B but is not safe. Then there would be a lighter edge e ′ with<br />

exactly one endpoint in A. But this is impossible, because (inductively) any previously examined<br />

edge has both endpoints in the same component of F .<br />

8 5<br />

10<br />

2 3<br />

18 16<br />

12 30<br />

14<br />

4 26<br />

8 5<br />

10<br />

18 16<br />

12 30<br />

18<br />

18<br />

14<br />

3<br />

4 26<br />

30<br />

26<br />

30<br />

26<br />

16<br />

8 5<br />

10<br />

18 16<br />

12 30<br />

14<br />

4 26<br />

8 5<br />

10<br />

18 16<br />

12 30<br />

Kruskal’s algorithm run on the example graph. Thick edges are in F . Dashed edges are useless.<br />

Just as in Bor ˙uvka’s algorithm, each component of F has a ‘leader’ node. An edge joins two<br />

components of F if and only if the two endpoints have different leaders. But unlike Bor ˙uvka’s<br />

algorithm, we do not recompute leaders from scratch every time we add an edge. Instead, when<br />

two components are joined, the two leaders duke it out in a nationally-televised no-holds-barred<br />

steel-cage grudge match. 3 One of the two emerges victorious as the leader of the new larger<br />

component. More formally, we will use our earlier algorithms for the Union-Find problem, where<br />

the vertices are the elements and the components of F are the sets. Here’s a more formal description<br />

of the algorithm:<br />

3 Live at the Assembly Hall! Only $49.95 on Pay-Per-View!<br />

18<br />

5<br />

14<br />

30<br />

26<br />

30<br />

26<br />

16<br />

18<br />

12<br />

14<br />

14<br />

26<br />

30<br />

26<br />

30<br />

16<br />

18<br />

18<br />

8<br />

12<br />

12<br />

14<br />

14<br />

10<br />

10<br />

30<br />

26<br />

30<br />

26<br />

16<br />

16


<strong>CS</strong> <strong>373</strong> Lecture 13: Minimum Spanning Trees Fall 2002<br />

Kruskal(V, E):<br />

sort E by wieght<br />

F ← ∅<br />

for each vertex v ∈ V<br />

MakeSet(v)<br />

for i ← 1 to |E|<br />

(u, v) ← ith lightest edge in E<br />

if Find(u) = Find(v)<br />

Union(u, v)<br />

add (u, v) to F<br />

return F<br />

In our case, the sets are components of F , and n = V . Kruskal’s algorithm performs O(E)<br />

Find operations, two for each edge in the graph, and O(V ) Union operations, one for each edge in<br />

the minimum spanning tree. Using union-by-rank and path compression allows us to perform each<br />

Union or Find in O(α(E, V )) time, where α is the not-quite-constant inverse-Ackerman function.<br />

So ignoring the cost of sorting the edges, the running time of this algorithm is O(Eα(E, V )).<br />

We need O(E log E) = O(E log V ) additional time just to sort the edges. Since this is bigger<br />

than the time for the Union-Find data structure, the overall running time of Kruskal’s algorithm<br />

is O(E log V ) , exactly the same as Bor ˙uvka’s algorithm, or Jarník’s algorithm with a normal<br />

(non-Fibonacci) heap.<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 14: Fast Fourier Transforms Fall 2002<br />

14 Fast Fourier Transforms (November 7)<br />

14.1 Polynomials<br />

Blech! Ack! Oop! THPPFFT!<br />

— Bill the Cat, “Bloom County” (1980)<br />

In this lecture we’ll talk about algorithms for manipulating polynomials: functions of one variable<br />

built from additions subtractions, and multiplications (but no divisions). The most common<br />

representation for a polynomial p(x) is as a sum of weighted powers of a variable x:<br />

p(x) =<br />

n<br />

ajx j .<br />

j=0<br />

The numbers aj are called coefficients. The degree of the polynomial is the largest power of x; in<br />

the example above, the degree is n. Any polynomial of degree n can be specified by a sequence<br />

of n + 1 coefficients. Some of these coefficients may be zero, but not the nth coefficient, because<br />

otherwise the degree would be less than n.<br />

Here are three of the most common operations that are performed with polynomials:<br />

• Evaluate: Give a polynomial p and a number x, compute the number p(x).<br />

• Add: Give two polynomials p and q, compute a polynomial r = p + q, so that r(x) =<br />

p(x) + q(x) for all x. If p and q both have degree n, then their sum p + q also has degree n.<br />

• Multiply: Give two polynomials p and q, compute a polynomial r = p · q, so that r(x) =<br />

p(x) · q(x) for all x. If p and q both have degree n, then their product p · q has degree 2n.<br />

Suppose we represent a polynomial of degree n as an array of n + 1 coefficients P [0 .. n], where<br />

P [j] is the coefficient of the x j term. We learned simple algorithms for all three of these operations<br />

in high-school algebra:<br />

Evaluate(P [0 .. n], x):<br />

X ← 1 〈〈X = x j 〉〉<br />

y ← 0<br />

for j ← 0 to n<br />

y ← y + P [j] · X<br />

X ← X · x<br />

return y<br />

Add(P [0 .. n], Q[0 .. n]):<br />

for j ← 0 to n<br />

R[j] ← P [j] + Q[j]<br />

return R[0 .. n]<br />

Multiply(P [0 .. n], Q[0 .. m]):<br />

for j ← 0 to n + m<br />

R[j] ← 0<br />

for j ← 0 to n<br />

for k ← 0 to m<br />

R[j + k] ← P [j] · Q[k]<br />

return R[0 .. n + m]<br />

Evaluate uses O(n) arithmetic operations. 1 This is the best you can do in theory, but we can<br />

cut the number of multiplications in half using Horner’s rule:<br />

p(x) = a0 + x(a1 + x(a2 + . . . + xan)).<br />

1 I’m going to assume in this lecture that each arithmetic operation takes O(1) time. This may not be true<br />

in practice; in fact, one of the most powerful applications of FFTs is fast integer multiplication. One of the<br />

fastest integer multiplication algorithms, due to Schönhage and Strassen, multiplies two n-bit binary numbers in<br />

O(n log n log log n log log log n log log log log n · · · ) time. The algorithm uses an n-element Fast Fourier Transform,<br />

which requires several O(log n)-nit integer multiplications. These smaller multiplications are carried out recursively<br />

(of course!), which leads to the cascade of logs in the running time. Needless to say, this is a can of worms.<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 14: Fast Fourier Transforms Fall 2002<br />

Horner(P [0 .. n], x):<br />

y ← P [n]<br />

for i ← n − 1 downto 0<br />

y ← x · y + P [i]<br />

return y<br />

The addition algorithm also runs in O(n) time, and this is clearly the best we can do.<br />

The multiplication algorithm, however, runs in O(n 2 ) time. In the very first lecture, we saw a<br />

divide and conquer algorithm (due to Karatsuba) for multiplying two n-bit integers in only O(n lg 3 )<br />

steps; precisely the same algorithm can be applied here. Even cleverer divide-and-conquer strategies<br />

lead to multiplication algorithms whose running times are arbitrarily close to linear—O(n 1+ε ) for<br />

your favorite value e > 0—but with great cleverness comes great confusion. These algorithms are<br />

difficult to understand, even more difficult to implement correctly, and not worth the trouble in<br />

practice thanks to large constant factors.<br />

14.2 Alternate Representations: Roots and Samples<br />

Part of what makes multiplication so much harder than the other two operations is our input<br />

representation. Coefficients vectors are the most common representation for polynomials, but<br />

there are at least two other useful representations.<br />

The first exploits the fundamental theorem of algebra: Every polynomial p of degree n has n<br />

roots r1, r2, . . . rn such that p(rj) = 0 for all j. Some of these roots may be irrational; some of these<br />

roots may by complex; and some of these roots may be repeated. Despite these complications, we<br />

do get a unique representation of any polynomial of the form<br />

p(x) = s<br />

n<br />

(x − rj)<br />

j=1<br />

where the rj’s are the roots and s is a scale factor. Once again, to represent a polynomial of degree<br />

n, we need a list of n + 1 numbers: one scale factor and n roots.<br />

Given a polynomial in root representation, we can clearly evaluate it in O(n) time. Given two<br />

polynomials in root representation, we can easily multiply them in O(n) time by multiplying their<br />

scale factors and just concatenating the two root sequences; in fact, if we don’t care about keeping<br />

the old polynomials around, we can compute their product in O(1) time! Unfortunately if we want<br />

to add two polynomials in root representation, we’re pretty much out of luck; there’s essentially<br />

no correlation between the roots of p, the roots of q, and the roots of p + q. We could convert<br />

the polynomials to the more familiar coefficient representation first—this takes O(n 2 ) time using<br />

the high-school algorithms—but there’s no easy way to convert the answer back. In fact, given a<br />

polynomial in coefficient form, it’s usually impossible to compute its roots exactly. 2<br />

Our third representation for polynomials comes from the following consequence of the fundamental<br />

theorem of algebra. Given a list of n + 1 pairs {(x0, y0), (x1, y1), . . . , (xn, yn) }, there is<br />

exactly one polynomial p of degree n such that p(xj) = yj for all j. This is just a generalization<br />

of the fact that any two points determine a unique line, since a line is (the graph of) a polynomial<br />

of degree 1. We say that the polynomial p interpolates the points (xj, yj). As long as we agree<br />

on the sample locations xj in advance, we once again need exactly n + 1 numbers to represent a<br />

polynomial of degree n.<br />

2 This is where numerical analysis comes from.<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 14: Fast Fourier Transforms Fall 2002<br />

Adding or multiplying two polynomials in this sample representation is easy, as long as they use<br />

the same sample locations xj. To add the polynomials, just add their sample values. To multiply<br />

two polynomials, just multiply their sample values; however, if we’re multiplying two polynomials<br />

of degree n, we need to start with 2n + 1 sample values for each polynomial, since that’s how many<br />

we need to uniquely represent the product polynomial. Both algorithms run in O(n) time.<br />

Unfortunately, evaluating a polynomial in this representation is no longer trivial. The following<br />

formula, due to Lagrange, allows us to compute the value of any polynomial of degree n at any<br />

point, given a set of n + 1 samples.<br />

<br />

⎛<br />

⎞<br />

n−1 <br />

p(x) =<br />

j=0<br />

yj<br />

k=j<br />

(x − xk)<br />

<br />

k=j (xj − xk)<br />

n−1 yj<br />

= ⎝<br />

k=j (xj<br />

<br />

(x − xk) ⎠<br />

− xk)<br />

Hopefully it’s clear that formula actually describes a polynomial, since each term in the rightmost<br />

sum is written as a scaled product of monomials. It’s also not hard to check that p(xj) = yj for<br />

all j. As I mentioned earlier, the fact that this is the only polynomial that interpolates the points<br />

{(xj, yj)} is an easy consequence of the fundamental theorem of algebra. We can easily transform<br />

this formula into an O(n 2 )-time algorithm.<br />

We find ourselves in th following frustrating situation. We have three representations for polynomials<br />

and three basic operations. Each representation allows us to almost trivially perform a<br />

different pair of operations in linear time, but the third takes at least quadratic time, if it can be<br />

done at all!<br />

evaluate add multiply<br />

coefficients O(n) O(n) O(n 2 )<br />

roots + scale O(n) ∞ O(n)<br />

samples O(n 2 ) O(n) O(n)<br />

j=0<br />

14.3 Converting Between Representations?<br />

What we need are fast algorithms to convert quickly from one representation to another. That way,<br />

when we need to perform an operation that’s hard for our default representation, we can switch to<br />

a different representation that makes the operation easy, perform that operation, and then switch<br />

back. This strategy immediately rules out the root representation, since (as I mentioned earlier)<br />

finding roots of polynomials is impossible in general, at least if we’re interested in exact results.<br />

So how do we convert from coefficients to samples and back? Clearly, once we choose our sample<br />

positions xj, we can compute each sample value yj = p(xj) in O(n) time from the coefficients.<br />

So we can convert a polynomial of degree n from coefficients to samples in O(n 2 ) time. The<br />

Lagrange formula gives us an explicit conversion algorithm from the sample representation back<br />

to the more familiar coefficient representation. If we use the naïve algorithms for adding and<br />

multiplying polynomials (in coefficient form), this conversion takes O(n 3 ) time.<br />

This looks pretty bad, until we realize there’s a degree of freedom we haven’t exploited yet.<br />

Whenever we convert from coefficients to samples, we get to choose the sample points!<br />

Our slow algorithms may be slow only because we’re trying to be too general. Perhaps, if we choose<br />

a set of sample points with just the right kind of recursive structure, we can do the conversion more<br />

quickly. In fact, there is a set of sample points that’s perfect for the job.<br />

14.4 The Discrete Fourier Transform<br />

Given a polynomial of degree n−1, we’d like to find n sample points that are somehow as symmetric<br />

as possible. The most natural choice for those n points are the nth roots of unity; these are the<br />

3<br />

k=j


<strong>CS</strong> <strong>373</strong> Lecture 14: Fast Fourier Transforms Fall 2002<br />

roots of the polynomial x n − 1 = 0. These n roots are spaced exactly evenly around the unit circle<br />

in the complex plane. 3 Every nth root of unity is a power of the primitive root<br />

A typical nth root of unity has the form<br />

ωn = e 2πi/n = cos 2π<br />

n<br />

ω j n = e(2πi/n)j = cos<br />

+ i sin 2π<br />

n .<br />

<br />

2π<br />

n j<br />

<br />

2π<br />

+ i sin<br />

n j<br />

<br />

.<br />

These complex numbers have several useful properties for any integers n and k:<br />

• There are only n different nth roots of unity: ω k n<br />

• If n is even, then ω k+n/2<br />

n<br />

• 1/ω k n<br />

= ωk<br />

mod n<br />

n .<br />

= −ωk n ; in particular, ωn/2 n = −ω0 n = −1.<br />

= ω−k<br />

n = ωk n = (ωn) k , where the bar represents complex conjugation: a + bi = a − bi<br />

• ωn = ωk kn . Thus, every nth root of unity is also a (kn)th root of unity.<br />

If we sample a polynomial of degree n − 1 at the nth roots of unity, the resulting list of sample<br />

values is called the discrete Fourier transform of the polynomial (or more formally, of the coefficient<br />

vector). Thus, given an array P [0 .. n − 1] of coefficients, the discrete Fourier transform computes<br />

a new vector P ∗ [0 .. n − 1] where<br />

P ∗ [j] = p(ω j n−1 <br />

n) =<br />

k=0<br />

P [k] · ω jk<br />

n<br />

We can obviously compute P ∗ in O(n 2 ) time, but the structure of the nth roots of unity lets<br />

us do better. But before we describe that faster algorithm, let’s think about how we might invert<br />

this transformation.<br />

It’s not hard to see that the discrete Fourier transform—in fact, any conversion from a vector<br />

of coefficients to a vector of sample values—is a linear transformation. The DFT just multiplies<br />

the coefficient vector by a matrix V to obtain the sample vector. Each entry in V is an nth root of<br />

unity; specifically, vjk = ω jk<br />

n for all j, k.<br />

⎡<br />

1 1 1 1 · · · 1<br />

⎢<br />

V = ⎢<br />

⎣.<br />

. . .<br />

. ..<br />

n<br />

n<br />

n<br />

.<br />

1 ωn ω2 n ω3 n · · · ωn−1 1 ω2 n ω4 n ω6 n · · · ω 2(n−1)<br />

1 ω 3 n ω 6 n ω 9 n · · · ω 3(n−1)<br />

1 ω n−1<br />

n<br />

ω 2(n−1)<br />

n<br />

ω 3(n−1)<br />

n · · · ω (n−1)2<br />

n<br />

3 In this lecture, i always represents the square root of −1. Most computer scientists are used to thinking of i as<br />

an integer index into a sequence, an array, or a for-loop, but we obviously can’t do that here. The physicist’s habit<br />

of using j = √ −1 just delays the problem (how do physicists write quaternions?), and typographical tricks like I or<br />

i or Mathematica’s ı ◦ are just stupid.<br />

4<br />

⎤<br />

⎥<br />


<strong>CS</strong> <strong>373</strong> Lecture 14: Fast Fourier Transforms Fall 2002<br />

To invert the discrete Fourier transform, we just have to multiply P ∗ by the inverse matrix V −1 .<br />

But this is almost the same as multiplying by V itself, because of the following amazing fact:<br />

V −1 = V /n<br />

In other words, if W = V −1 then wjk = vjk/n = ω jk<br />

n /n = ω −jk<br />

n /n. It’s not hard to prove this fact<br />

with a little linear algebra.<br />

Proof: We just have to show that M = V W is the identity matrix. We can compute a single entry<br />

in this matrix as follows:<br />

If j = k, then ω j−k<br />

n<br />

n−1 <br />

mjk =<br />

l=0<br />

= 1, so<br />

<br />

and if j = k, we have a geometric series<br />

n−1 <br />

mjk = (ω j−k<br />

l=0<br />

<br />

n−1<br />

vjl · wlk = ω<br />

l=0<br />

jl<br />

n · ωn lk /n = 1<br />

n−1<br />

ω<br />

n<br />

l=0<br />

jl−lk<br />

n<br />

mjk = 1<br />

n−1 <br />

1 =<br />

n<br />

n<br />

= 1,<br />

n<br />

l=0<br />

= 1 n−1<br />

n<br />

l=0<br />

n ) l = (ωj−k n ) n − 1<br />

ω j−k<br />

n − 1 = (ωn n )j−k − 1<br />

ω j−k<br />

n − 1 = 1j−k − 1<br />

ω j−k<br />

n − 1<br />

(ω j−k<br />

n ) l<br />

That’s it! <br />

What this means for us computer scientists is that any algorithm for computing the discrete<br />

Fourier transform can be easily modified to compute the inverse transform as well.<br />

14.5 Divide and Conquer<br />

The structure of the matrix V also allows us to compute the discrete Fourier transform efficiently<br />

using a divide and conquer strategy. The basic structure of the algorithm is almost the same as<br />

MergeSort, and the O(n log n) running time will ultimately follow from the same recurrence. The<br />

Fast Fourier Transform algorithm, popularized by Cooley and Tukey in 19654 , assumes that n is<br />

a power of two; if necessary, we can just pad the coefficient vector with zeros.<br />

To get an idea of how the divide-and-conquer strategy works, let’s look at the DFT matrixes<br />

for n = 8. To simplify notation, let ω = ω8 = √ 2/2 + i √ 2/2.<br />

⎡<br />

1 1 1 1 1 1 1 1<br />

⎢<br />

1 ω ω<br />

⎢<br />

2 ω3 ω4 ω5 ω6 ω7 1 ω2 ω4 ω6 ω8 ω10 ω12 ω14 1 ω3 ω6 ω9 ω12 ω15 ω18 ω21 ⎤ ⎡<br />

⎤<br />

1 1 1 1 1 1 1 1<br />

⎥ ⎢<br />

⎥ ⎢ 1 ω i ω −1 −ω −i −ω<br />

⎥<br />

⎥ ⎢<br />

⎥<br />

⎥ ⎢ 1 i −1 −i 1 i −1 −i ⎥<br />

⎥ ⎢<br />

⎥<br />

⎥ ⎢<br />

= ⎢<br />

1 ω i ω −1 −ω −i −ω ⎥<br />

⎢<br />

⎣<br />

1 ω4 ω8 ω12 ω16 ω20 ω24 ω28 1 ω5 ω10 ω15 ω20 ω25 ω30 ω35 1 ω6 ω12 ω18 ω24 ω30 ω36 ω42 1 ω7 ω14 ω21 ω28 ω35 ω42 ω49 ⎥<br />

⎦<br />

= 0.<br />

⎢<br />

1<br />

⎢ 1<br />

⎣ 1<br />

−1<br />

−ω<br />

−i<br />

1<br />

i<br />

−1<br />

−1<br />

−ω<br />

i<br />

1<br />

−1<br />

1<br />

−1<br />

ω<br />

−i<br />

1<br />

−i<br />

−1<br />

−1 ⎥<br />

ω ⎥<br />

i ⎦<br />

1 −ω −i −ω −1 ω i ω<br />

4 Actually, the FFT algorithm was previously published by Runge and König in 1924, and again by Yates in<br />

1932, and again by Stumpf in 1937, and again by Danielson and Lanczos in 1942. But it was first used by Gauss<br />

in the 1800s for calculating the paths of asteroids from a finite number of equally-spaced observations. By hand.<br />

Fourier always did it the hard way. Cooley and Tukey apparently developed their algorithm to help detect Soviet<br />

nuclear tests without actually visiting Soviet nuclear facilities, by interpolating off-shore seismic readings. Without<br />

their rediscovery of the FFT algorithm, the nuclear test ban treaty would never have been ratified, and we’d all be<br />

speaking Russian, or more likely, whatever language radioactive glass speaks.<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 14: Fast Fourier Transforms Fall 2002<br />

The boxed entries actually form the DFT matrix for n = 4!<br />

⎡<br />

1 1 1 1<br />

⎢<br />

⎢1<br />

⎣<br />

ω4 ω2 4 ω3 ⎤<br />

⎥<br />

4⎥<br />

⎦ =<br />

⎡<br />

1 1 1<br />

⎤<br />

1<br />

⎢<br />

⎢1<br />

⎣1<br />

i<br />

−1<br />

−1<br />

1<br />

−i ⎥<br />

−1⎦<br />

1 −i −1 i<br />

1 ω 2 4 ω4 4 ω6 4<br />

1 ω 3 4 ω6 4 ω9 4<br />

The input to the FFT algorithm is an array P [0 .. n − 1] of coefficients of a polynomial p(x)<br />

with degree n − 1. We start by splitting p into two smaller polynomials u and v, each with degree<br />

n/2 − 1, by setting<br />

U[k] = P [2k] and V [k] = P [2k − 1].<br />

In other words, u has all the even-degree coefficients of p, and v has all the odd-degree coefficients.<br />

For example, if p(x) = 3x 3 − 4x 2 + 7x + 5, then u(x) = −4x + 5 and v(x) = 3x + 7. These three<br />

polynomials satisfy the equation<br />

p(x) = u(x 2 ) + x · v(x 2 ).<br />

In particular, if x is an nth root of unity, we have<br />

P ∗ [k] = p(ω k n ) = u(ω2k n ) + ωk n · v(ω2k n ).<br />

Now we can exploit those roots of unity again. Since n is a power of two, n must be even, so we<br />

have ω2k n = ωk mod n/2<br />

n/2 = ωk<br />

n/2 . In other words, the values of p at the nth roots of unity depend on<br />

the values of u and v at (n/2)th roots of unity. Those are just coefficients in the DFTs of u and v!<br />

P ∗ [k] = U ∗ [k mod n/2] + ω k n · V ∗ [k mod n/2]<br />

So once we recursively compute U ∗ and V ∗ , we can compute P ∗ in linear time. The base case for<br />

the recurrence is n = 1. The overall running time satisfies the recurrence T (n) = Θ(n) + 2T (n/2),<br />

which as we all know solves to T (n) = Θ(n log n).<br />

Here’s the complete FFT algorithm, along with its inverse.<br />

FFT(P [0 .. n − 1]):<br />

if n = 1<br />

return P<br />

for j ← 0 to n/2 − 1<br />

U[j] ← P [2j]<br />

V [j] ← P [2j + 1]<br />

U ∗ ← FFT(U[0 .. n/2 − 1])<br />

V ∗ ← FFT(V [0 .. n/2 − 1])<br />

ωn ← cos( 2π<br />

2π<br />

n ) + i sin( n )<br />

ω ← 1<br />

for j ← 0 to n/2 − 1<br />

P ∗ [j] ← U ∗ [j] + ω · V ∗ [j]<br />

P ∗ [j + n/2] ← U ∗ [j] − ω · V ∗ [j]<br />

ω ← ω · ωn<br />

return P ∗ [0 .. n − 1]<br />

6<br />

InverseFFT(P ∗ [0 .. n − 1]):<br />

if n = 1<br />

return P<br />

for j ← 0 to n/2 − 1<br />

U ∗ [j] ← P ∗ [2j]<br />

V ∗ [j] ← P ∗ [2j + 1]<br />

U ← InverseFFT(U[0 .. n/2 − 1])<br />

V ← InverseFFT(V [0 .. n/2 − 1])<br />

ωn ← cos( 2π<br />

2π<br />

n ) − i sin( n )<br />

ω ← 1<br />

for j ← 0 to n/2 − 1<br />

P [j] ← 2(U[j] + ω · V [j])<br />

P [j + n/2] ← 2(U[j] − ω · V [j])<br />

ω ← ω · ωn<br />

return P [0 .. n − 1]


<strong>CS</strong> <strong>373</strong> Lecture 14: Fast Fourier Transforms Fall 2002<br />

Given two polynomials p and q, each represented by an array of coefficients, we can multiply<br />

them in Θ(n log n) arithmetic operations as follows. First, pad the coefficient vectors and with zeros<br />

until the size is a power of two greater than or equal to the sum of the degrees. Then compute the<br />

DFTs of each coefficient vector, multiply the sample values one by one, and compute the inverse<br />

DFT of the resulting sample vector.<br />

14.6 Inside the FFT<br />

FFTMultiply(P [0 .. n − 1], Q[0 .. m − 1]):<br />

ℓ ← ⌈lg(n + m)⌉<br />

for j ← n to 2 ℓ − 1<br />

P [j] ← 0<br />

for j ← m to 2 ℓ − 1<br />

Q[j] ← 0<br />

P ∗ ← F F T (P )<br />

Q ∗ ← F F T (Q)<br />

for j ← 0 to 2 ℓ − 1<br />

R ∗ [j] ← P ∗ [j] · Q ∗ [j]<br />

return InverseFFT(R ∗ )<br />

FFTs are often implemented in hardware as circuits. To see the recursive structure of the circuit,<br />

let’s connect the top-level inputs and outputs to the inputs and outputs of the recursive calls. On<br />

the left we split the input P into two recursive inputs U and V . On the right, we compbine the<br />

outputs U ∗ and V ∗ to obtain the final output P ∗ .<br />

U FFT(n/2) U*<br />

P P*<br />

V FFT(n/2) V*<br />

000<br />

001<br />

010<br />

011<br />

100<br />

101<br />

110<br />

111<br />

000<br />

010<br />

100<br />

110<br />

001<br />

011<br />

101<br />

111<br />

The recursive structure of the FFT algorithm.<br />

000<br />

100<br />

010<br />

110<br />

001<br />

101<br />

011<br />

111<br />

bit reversal permutation butterfly network<br />

If we expand this recursive structure completely, we see that the circuit splits naturally into<br />

two parts. The left half computes the bit-reversal permuation of the input. To find the position<br />

of P [k] in this permutation, write k in binary, adn then read the bits backwards. For example,<br />

in an 8-element bit-reversal permutation, P [3] = P [0112] ends up in position 6 = 1102. The right<br />

half of the FFT circuit is a butterlfy network. Butterfly networks are often used to route between<br />

processors in massively-parallel computers, since they allow any processor to communicate with<br />

any other in only O(log n) steps.<br />

7


<strong>CS</strong> <strong>373</strong> Lecture 15: Number-Theoretic <strong>Algorithms</strong> Fall 2002<br />

15 Number Theoretic <strong>Algorithms</strong> (November 12 and 14)<br />

15.1 Greatest Common Divisors<br />

And it’s one, two, three,<br />

What are we fighting for?<br />

Don’t tell me, I don’t give a damn,<br />

Next stop is Vietnam; [or: This time we’ll kill Saddam]<br />

And it’s five, six, seven,<br />

Open up the pearly gates,<br />

Well there ain’t no time to wonder why,<br />

Whoopee! We’re all going to die.<br />

— Country Joe and the Fish<br />

“I-Feel-Like-I’m-Fixin’-to-Die Rag” (1967)<br />

There are 0 kinds of mathematicians:<br />

Those who can count modulo 2 and those who can’t.<br />

— anonymous<br />

Before we get to any actual algorithms, we need some definitions and preliminary results. Unless<br />

specifically indicated otherwise, all variables in this lecture are integers.<br />

The symbol Z (from the German word “Zahlen”, meaning ‘numbers’ or ‘to count’) to denote<br />

the set of integers. We say that one integer d divides another integer n, or that d is a divisor of n,<br />

if the quotient n/d is also an integer. Symbolically, we can write this definition as follows:<br />

d | n ⇐⇒<br />

<br />

n<br />

<br />

d<br />

In particular, zero is not a divisor of any integer—∞ is not an integer—but every other integer is<br />

a divisor of zero. If d and n are positive, then d | n immediately implies that d ≤ n.<br />

Any integer n can be written in the form n = qd+r for some non-negative integer 0 ≤ r ≤ |d−1|.<br />

Moreover, the choices for the quotient q and remainder r are unique:<br />

q =<br />

<br />

n<br />

<br />

d<br />

= n<br />

d<br />

and r = n mod d = n − d<br />

<br />

n<br />

<br />

.<br />

d<br />

Note that the remainder n mod d is always non-negative, even if n < 0 or d < 0 or both. 1<br />

If d divides two integers m and n, we say that d is a common divisor of m and n. It’s trivial to<br />

prove (by definition crunching) that any common divisor of m and n also divides any integer linear<br />

combination of m and n:<br />

(d | m) and (d | n) =⇒ d | (am + bn)<br />

The greatest common divisor of m and n, written gcd(m, n), 2 is the largest integer that divides<br />

both m and n. Sometimes this is also called the greater common denominator. The greatest<br />

common divisor has another useful characterization as the smallest element of another set.<br />

Lemma 1. gcd(m, n) is the smallest positive integer of the form am + bn.<br />

1<br />

The sign rules for the C/C ++/Java % operator are just plain stupid. I can’t count the number of times I’ve had<br />

to write x = (x+n)%n; instead of x %= n;. Idiots.<br />

2<br />

Do not use the notation (m, n) for greatest common divisor. Ever.<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 15: Number-Theoretic <strong>Algorithms</strong> Fall 2002<br />

Proof: Let s be the smallest positive integer of the form am + bn. Any common divisor of m and<br />

n is also a divisor of s = am + bn. In particular, gcd(m, n) is a divisor of s, which implies that<br />

gcd(m, n) ≤ s .<br />

To prove the other inequality, let’s show that s|m by calculating m mod s.<br />

<br />

m<br />

<br />

<br />

m<br />

<br />

m<br />

<br />

m<br />

<br />

m mod s = m − s = m − (am + bn) = m 1 − a + n −b<br />

s<br />

s<br />

s<br />

s<br />

We observe that m mod s is an integer linear combination of m and n. Since m mod s < s, and<br />

s is the smallest positive integer linear combination, m mod s cannot be positive. So it must be<br />

zero, which implies that s | m, as we claimed. By a symmetric argument, s | n. Thus, s is a<br />

common divisor of m and n. A common divisor can’t be greater than the greatest common divisor,<br />

so s ≤ gcd(m, n) .<br />

These two inequalities imply that s = gcd(m, n), completing the proof. <br />

15.2 Euclid’s GCD Algorithm<br />

The first part of this lecture is about computing the greatest common divisor of two integers.<br />

Our first algorithm for computing greatest common divisors follows immediately from two simple<br />

observations:<br />

gcd(m, n) = gcd(m, n − m) and gcd(n, 0) = n<br />

The algorithm uses the first observation as a way to reduce the input and recurse; the second<br />

observation provides the base case.<br />

SlowGCD(m, n):<br />

m ← |m|; n ← |n|<br />

if m < n<br />

swap m ↔ n<br />

while n > 0<br />

m ← m − n<br />

if m < n<br />

swap m ↔ n<br />

return m<br />

The first few lines just ensure that m ≥ n ≥ 0. Each iteration of the main loop decreases one of<br />

the numbers by at least 1, so the running time is O(m + n). This bound is tight in the worst case;<br />

consider the case n = 1. Unfortunately, this is terrible. The input consists of just log m + log n<br />

bits; as a function of the input size, this algorithm runs in exponential time.<br />

Let’s think for a moment about what the main loop computes between swaps. We start with<br />

two numbers m and n and repeatedly subtract n from m until we can’t any more. This is just a<br />

(slow) recipe for computing m mod n! That means we can speed up the algorithm by using mod<br />

instead of subtraction.<br />

EuclidGCD(m, n):<br />

m ← |m|; n ← |n|<br />

if m < n<br />

swap m ↔ n<br />

while n > 0<br />

m ← m mod n (⋆)<br />

swap m ↔ n<br />

return m<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 15: Number-Theoretic <strong>Algorithms</strong> Fall 2002<br />

This algorithm swaps m and n at every iteration, because m mod n is always less than n. This is<br />

usually called Euclid’s algorithm, because the main idea is included in Euclid’s Elements. 3<br />

The easiest way to analyze this algorithm is to work backward. First, let’s consider the number<br />

of iterations of the main loop, or equivalently, the number times line (⋆) is executed. To keep<br />

things simple, let’s assume that m > n > 0, so the first three lines are redundant, and the algorithm<br />

performs at least one iteration. Recall that the Fibonacci numbers(!) are defined as F0 = 0, F1 = 1,<br />

and Fk = Fk−1 + Fk−2 for all k > 1.<br />

Lemma 2. If the algorithm performs k iterations, then m ≥ Fk+2 and n ≥ Fk+1.<br />

Proof (by induction on k): If k = 1, we have the trivial bounds n ≥ 1 = F2 and m ≥ 2 = F3.<br />

Suppose k > 1. The first iteration of the loop replaces (m, n) with (n, m mod n). Since the<br />

algorithm performs k − 1 more iterations, the inductive hypothesis implies that n ≥ Fk+1 and<br />

m mod n ≥ Fk. We’ve assumed that m > n, so m ≥ m + n(1 − ⌊m/n⌋) = n + (m mod n). We<br />

conclude that m ≥ Fk+1 + Fk = Fk+1. <br />

Theorem 1. EuclidGCD(m, n) runs in O(log m) iterations.<br />

Proof: Let k be the number of iterations. Lemma 2 implies that m ≥ Fk+2 ≥ φ k+2 / √ 5 − 1, where<br />

φ = (1 + √ 5)/2 (by the annihilator method). Thus, k ≤ log φ( √ 5(m + 1)) − 2 = O(log m). <br />

What about the actual running time? Every number used by the algorithm has O(log m)<br />

bits. Computing the remainder of one b-bit integer by another using the grade-school long division<br />

algorithm requires O(b 2 ) time. So crudely, the running time is O(b 2 log m) = O(log 3 m). More<br />

careful analysis reduces the time bound to O(log 2 m) . We can make the algorithm even faster by<br />

using a fast integer division algorithm (based on FFTs, for example).<br />

15.3 Modular Arithmetic and Algebraic Groups<br />

Modular arithmetic is familiar to anyone who’s ever wondered how many minutes are left in an<br />

exam that ends at 9:15 when the clock says 8:54.<br />

When we do arithmetic ‘modulo n’, what we’re really doing is a funny kind of arithmetic on<br />

the elements of following set:<br />

Zn = {0, 1, 2, . . . , n − 1}<br />

Modular addition and subtraction satisfies all the axioms that we expect implicitly:<br />

• Zn is closed under addition mod n: For any a, b ∈ Zn, their sum a + b mod n is also in Zn<br />

3 However, Euclid’s exposition was a little, erm, informal by current standards, primarily because the Greeks didn’t<br />

know about induction. He basically said “Try one iteration. If that doesn’t work, try three iterations.” In modern<br />

language, Euclid’s algorithm would be written as follows, assuming m ≥ n > 0.<br />

ActualEuclidGCD(m, n):<br />

if n | m<br />

return n<br />

else<br />

return n mod (m mod n)<br />

This algorithm is obviously incorrect; consider the input m = 3, n = 2. Nevertheless, mathematics and algorithms<br />

students have applied ‘Euclidean induction’ to a vast number of problems, only to scratch their heads in dismay when<br />

they don’t get any credit.<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 15: Number-Theoretic <strong>Algorithms</strong> Fall 2002<br />

• Addition is associative: (a + b mod n) + c mod n = a + (b + c mod n) mod n.<br />

• Zero is an additive identity element: 0 + a mod n = a + 0 mod n = a mod n.<br />

• Every element a ∈ Zn has an inverse b ∈ Zn such that a+b mod n = 0. Specifically, if a = 0,<br />

then b = 0; otherwise, b = n − a.<br />

Any set with a binary operator that satisfies the closure, associativity, identity, and inverse<br />

axioms is called a group. Since Zn is a group under an ‘addition’ operator, we call it an additive<br />

group. Moreover, since addition is commutative (a+b mod n = b+a mod n), we can call ( Zn, + mod<br />

n) is an abelian additive group.<br />

What about multiplication? Zn is closed under multiplication mod n, multiplication mod n<br />

is associative (and commutative), and 1 is a multiplicative identity, but some elements do not<br />

have multiplicative inverses. Formally, we say that Zn is a ring under addition and multiplication<br />

modulo n.<br />

If n is composite, then the following theorem shows that we can factor the ring Zn into two<br />

smaller rings. The Chinese Remainder Theorem is named for Sun Tsu (or Sun Zi), the author of<br />

the Art of War, who proved a special case. (See the quotation for Lecture 1!)<br />

The Chinese Remainder Theorem. If p ⊥ q, then Zpq ∼ = Zp × Zq.<br />

Okay, okay, before we prove this, let’s define all the notation. The product Zp × Zq is the set of<br />

ordered pairs {(a, b) | a ∈ Zp, b ∈ Zq}, where addition, subtraction, and multiplication are defined<br />

as follows:<br />

(a, b) + (c, d) = (a + c mod p, b + d mod q)<br />

(a, b) − (c, d) = (a − c mod p, b − d mod q)<br />

(a, b) · (c, d) = (ac mod p, bd mod q)<br />

It’s not hard to check that Zp × Zq is a ring under these operations, where (0, 0) is the additive<br />

identity and (1, 1) is the multiplicative identity. The funky equal sign ∼ = means that these two rings<br />

are isomorphic: there is a bijection between the two sets that is consistent with the arithmetic<br />

operations.<br />

As an example, the following table describes the bijection between Z15 and Z3 × Z5:<br />

0 1 2 3 4<br />

0 0 6 12 3 9<br />

1 10 1 7 13 4<br />

2 5 11 2 8 14<br />

For instance, we have 8 = (2, 3) and 13 = (1, 3), and<br />

(2, 3) + (1, 3) = (2 + 1 mod 3, 3 + 3 mod 5) = (0, 1) = 6 = 21 mod 15 = (8 + 13) mod 15.<br />

(2, 3) · (1, 3) = (2 · 1 mod 3, 3 · 3 mod 5) = (2, 4) = 14 = 104 mod 15 = (8 · 13) mod 15.<br />

Proof: The functions n ↦→ (n mod p, n mod q) and (a, b) ↦→ aq(q mod p)+bp(p mod q) are inverses<br />

of each other, and each map preserves the ring structure. <br />

We can extend the Chinese remainder theorem inductively as follows:<br />

4


<strong>CS</strong> <strong>373</strong> Lecture 15: Number-Theoretic <strong>Algorithms</strong> Fall 2002<br />

The Real Chinese Remainder Theorem. Suppose n = r i=1 pi, where pi ⊥ pj for all i and j.<br />

Then Zn ∼ = r Zpi i=1 = Zp1 × Zp2 × · · · × Zpr.<br />

If we want to perform modular arithmetic where the modulus n is very large, we can improve the<br />

performance of our algorithms by breaking n into several relatively prime factors, and performing<br />

modular arithmetic separately modulo each factor.<br />

So we can do modular addition, subtraction, and multiplication; what about division? As I<br />

said earlier, not every element of Zn has a multiplicative inverse. The most obvious example is 0,<br />

but there can be others. For example, 3 has no multiplicative inverse in Z15; there is no integer x<br />

such that 3x mod 15 = 1. On the other hand, 0 is the only element of Z7 without a multiplicative<br />

inverse:<br />

1 · 1 ≡ 2 · 4 ≡ 3 · 5 ≡ 6 · 6 ≡ 1 (mod 7)<br />

These examples suggest (I hope) that x has a multiplicative inverse in Zn if and only if a and<br />

x are relatively prime. This is easy to prove as follows. If xy mod n = 1, then xy + kn = 1 for<br />

some integer k. Thus, 1 is an integer linear combination of x and n, so Lemma 1 implies that<br />

gcd(x, n) = 1. On the other hand, if x ⊥ n, then ax + bn = 1 for some integers a and b, which<br />

implies that ax mod n = 1.<br />

Let’s define the set Z ∗ n to be the set of elements if Zn that have multiplicative inverses.<br />

Z ∗ n = {a ∈ Zn | a ⊥ n}<br />

It is a tedious exercise to show that Z ∗ n is an abelian group under multiplication modulo n. As<br />

long as we stick to elements of this group, we can reasonably talk about ‘division mod n’.<br />

We denote the number of elements in Z ∗ n by φ(n); this is called Euler’s totient function. This<br />

function is remarkably badly-behaved, but there is a relatively simple formula for φ(n) (not surprisingly)<br />

involving prime numbers and division:<br />

φ(n) = n <br />

p|n<br />

p − 1<br />

p<br />

I won’t prove this formula, but the following intuition is helpful. If we start with Zn and throw<br />

out all n/2 multiples of 2, all n/3 multiples of 3, all n/5 multiples of 5, and so on. Whenever we<br />

throw out multiples of p, we multiply the size of the set by (p − 1)/p. At the end of this process,<br />

we’re left with precisely the elements of Z ∗ n. This is not a proof! On the one hand, this argument<br />

throws out some numbers (like 6) more than once, so our estimate seems too low. On the other<br />

hand, there are actually ⌈n/p⌉ multiples of p in Zn, so our estimate seems too high. Surprisingly,<br />

these two errors exactly cancel each other out.<br />

15.4 Toward Primality Testing<br />

In this last section, we discuss algorithms for detecting whether a number is prime. Large prime<br />

numbers are used primarily (but not exclusively) in cryptography algorithms.<br />

A positive integer is prime if it has exactly two positive divisors, and composite if it has more<br />

than two positive divisors. The integer 1 is neither prime nor composite. Equivalently, an integer<br />

n ≥ 2 is prime if n is relatively prime with every positive integer smaller than n. We can rephrase<br />

this definition yet again: n is prime if and only if φ(n) = n − 1.<br />

The obvious algorithm for testing whether a number is prime is trial division: simply try every<br />

possible nontrivial divisor between 2 and √ n.<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 15: Number-Theoretic <strong>Algorithms</strong> Fall 2002<br />

TrialDivPrime(n) :<br />

for d ← 1 to ⌊ √ n⌋<br />

if n mod d = 0<br />

return Composite<br />

return Prime<br />

Unfortunately, this algorithm is horribly slow. Even if we could do the remainder computation in<br />

constant time, the overall running time of this algorithm would be Ω( √ n), which is exponential in<br />

the number of input bits.<br />

This might seem completely hopeless, but fortunately most composite numbers are quite easy to<br />

detect as composite. Consider, for example, the related problem of deciding whether a given integer<br />

n, whether n = m e for any integers m > 1 and e > 1. We can solve this problem in polynomial<br />

time with the following straightforward algorithm. The subroutine Root(n, i) computes ⌊n 1/i ⌋<br />

essentially by binary search. (I’ll leave the analysis as a simple exercise.)<br />

ExactPower?(n):<br />

for i ← 2 to lg n<br />

if (Root(n, i)) i = n<br />

return True<br />

return False<br />

Root(n, i):<br />

r ← 0<br />

for ℓ ← ⌈(lg n)/i⌉ down to 1<br />

if (r + 2 ℓ ) i ≤ n<br />

r ← r + 2 ℓ<br />

return r<br />

To distinguish between arbitrary prime and composite numbers, we need to exploit some results<br />

about Z ∗ n from group theory and number theory. First, we define the order of an element x ∈ Z ∗ n<br />

as the smallest positive integer k such that x k ≡ 1 (mod n). For example, in the group<br />

Z ∗ 15 = {1, 2, 4, 7, 8, 11, 13, 14},<br />

the number 2 has order 4, and the number 11 has order 2. For any x ∈ Z ∗ n, we can partition the<br />

elements of Z ∗ n into equivalence classes, by declaring a ∼x b if a ≡ b·x k for some integer k. The size<br />

of every equivalence class is exactly the order of x. Since the equivalence classes must be disjoint,<br />

we can conclude that the order of any element divides the size of the group . We can express this<br />

observation more succinctly as follows:<br />

Euler’s Theorem. a φ(n) ≡ 1 (mod n). 4<br />

The most interesting special case of this theorem is when n is prime.<br />

Fermat’s Little Theorem. If p is prime, then a p ≡ a (mod p). 5<br />

This theorem leads to the following efficient pseudo-primality test.<br />

4 This is not Euler’s only theorem; he had thousands. It’s not even his most famous theorem. His second most<br />

famous theorem is the formula v + e − f = 2 relating the vertices, edges and faces of any planar map. His most<br />

famous theorem is the magic formula e πi + 1 = 0. Naming something after a mathematician or physicist (as in ‘Euler<br />

tour’ or ‘Gaussian elimination’ or ‘Avogadro’s number’) is considered a high compliment. Using a lower case letter<br />

(‘abelian group’) is even better; abbreviating (‘volt’, ‘amp’) is better still. The number e was named after Euler.<br />

5 This is not Fermat’s only theorem; he had hundreds, most of them stated without proof. Fermat’s Last Theorem<br />

wasn’t the last one he published, but the last one proved. Amazingly, despite his dislike of writing proofs, Fermat<br />

was almost always right. In that respect, he was very different from you and me.<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 15: Number-Theoretic <strong>Algorithms</strong> Fall 2002<br />

FermatPseudoPrime(n) :<br />

choose an integer a between 1 and n − 1<br />

if a n mod n = a<br />

return Composite!<br />

else<br />

return Prime?<br />

In practice, this algorithm is both fast and effective. The (empirical) probability that a random<br />

100-digit composite number will return Prime? is roughly 10−30 , even if we always choose a = 2.<br />

Unfortunately, there are composite numbers that always pass this test, no matter which value<br />

of a we use. A Carmichael number is a composite integer n such that an ≡ a (mod n) for every<br />

integer a. Thus, Fermat’s Little Theorem can be used to distinguish between two types of numbers:<br />

(primes and Carmichael numbers) and everything else. Carmichael numbers are extremely rare; in<br />

fact, it was proved only a decade ago that there are an infinite number of them.<br />

To deal with Carmichael numbers effectively, we need to look more closely at the structure of<br />

the group Z ∗ n. We say that Z ∗ n is cyclic if it contains an element of order φ(n); such an element is<br />

called a generator. Successive powers of any generator cycle through every element of the group in<br />

some order. For example, the group Z ∗ 9 = {1, 2, 4, 5, 7, 8} is cyclic, with two generators: 2 and 5,<br />

but Z∗ 15 is not cyclic. The following theorem completely characterizes which groups Z∗n are cyclic.<br />

The Cycle Theorem. Z ∗ n is cyclic if and only if n = 2, 4, pe , or 2p e for some odd prime p and<br />

positive integer e.<br />

This theorem has two relatively simple corollaries.<br />

The Discrete Log Theorem. Suppose Z ∗ n is cyclic and g is a generator. Then g x ≡ g y (mod n)<br />

if and only if x ≡ y (mod φ(n)).<br />

Proof: Suppose g x ≡ g y (mod n). By definition of ‘generator’, the sequence 〈1, g, g 2 , . . .〉 has<br />

period φ(n). Thus, x ≡ y (mod φ(n)). On the other hand, if x ≡ y (mod φ(n)), then x =<br />

y + kφ(n) for some integer k, so g x = g y+kφ(n) = g y · (g φ(n) ) k . Euler’s Theorem now implies that<br />

(g φ(n) ) k ≡ 1 k ≡ 1 (mod n), so g x ≡ g y (mod n). <br />

The √ 1 Theorem. Suppose n = p e for some odd prime p and positive integer e. The only<br />

elements x ∈ Z ∗ n that satisfy the equation x 2 ≡ 1 (mod n) are x = 1 and x = n − 1.<br />

Proof: Obviously 1 2 ≡ 1 (mod n) and (n − 1) 2 = n 2 − 2n + 1 ≡ 1 (mod n).<br />

Suppose x2 ≡ 1 (mod n) where n = pe . By the Cycle Theorem, Z∗ n is cyclic. Let g be a<br />

generator of Z ∗ n , and suppose x = gk . Then we immediately have x2 = g2k ≡ g0 = 1 (mod pe ).<br />

The Discrete Log Theorem implies that 2k ≡ 0 (mod φ(pe )). Since p is and odd prime, we have<br />

φ(pe ) = (p − 1)pe−1 , which is even. Thus, the equation 2k ≡ 0 (mod φ(pe )) has just two solutions:<br />

k = 0 and k = φ(pe )/2. By the Cycle Theorem, either x = 1 or x = gφ(n)/2 . Because x = n − 1 is<br />

also a solution to the original equation, we must have gφ(n)/2 ≡ n − 1 (mod n). <br />

This theorem leads to a different pseudo-primality algorithm:<br />

Sqrt1PseudoPrime(n):<br />

choose a number a between 2 and n − 2<br />

if a 2 mod n = 1<br />

return Composite!<br />

else<br />

return Prime?<br />

7


<strong>CS</strong> <strong>373</strong> Lecture 15: Number-Theoretic <strong>Algorithms</strong> Fall 2002<br />

As with the previous pseudo-primality test, there are composite numbers that this algorithm<br />

cannot identify as composite: powers of primes, for instance. Fortunately, however, the set of<br />

composites that always pass the √ 1 test is disjoint from the set of numbers that always pass the<br />

Fermat test. In particular, Carmichael numbers never have the form p e .<br />

15.5 The Miller-Rabin Primality Test<br />

The following randomized algorithm, adapted by Michael Rabin from an earlier deterministic algorithm<br />

of Gary Miller ∗ , combines the Fermat test and the √ 1 test. The algorithm repeats the same<br />

two tests s times, where s is some user-chosen parameter, each time with a random value of a.<br />

MillerRabin(n):<br />

write n − 1 = 2 t u where u is odd<br />

for i ← 1 to s<br />

a ← Random(2, n − 2)<br />

if EuclidGCD(a, n) = 1<br />

return Composite! 〈〈obviously!〉〉<br />

x0 ← a u mod n<br />

for j ← 1 to t<br />

xj ← x 2 j−1<br />

mod n<br />

if xj = 1 and xj−1 = 1 and xj−1 = n − 1<br />

return Composite! 〈〈by the √ 1 Theorem〉〉<br />

if xt = 1 〈〈xt = a n−1 mod n〉〉<br />

return Composite! 〈〈by Fermat’s Little Theorem〉〉<br />

return Prime?<br />

First let’s consider the running time; for simplicity, we assume that all integer arithmetic is<br />

done using the quadratic-time grade school algorithms. We can compute u and t in O(log n) time<br />

by scanning the bits in the binary representation of n. Euclid’s algorithm takes O(log 2 n) time.<br />

Computing a u mod n requires O(log u) = O(log n) multiplications, each of which takes O(log 2 n)<br />

time. Squaring xj takes O(log 2 n) time. Overall, the running time for one iteration of the outer loop<br />

is O(log 3 n + t log 2 n) = O(log 3 n), since t ≤ lg n. Thus, the total running time of this algorithm is<br />

O(s log 3 n) . If we set s = O(log n), this running time is polynomial in the size of the input.<br />

Fine, so it’s fast, but is it correct? Like the earlier pseudoprime testing algorithms, a prime<br />

input will always cause MillerRabin to return Prime?. Composite numbers, however, may not<br />

always return Composite!; because we choose the number a at random, there is a small probability<br />

of error. 6 Fortunately, the error probability can be made ridiculously small—in practice, less than<br />

the probability that random quantum fluctuations will instantly transform your computer into a<br />

kitten—by setting s ≈ 1000.<br />

Theorem 2. If n is composite, MillerRabin(n) returns Composite! with probability at least<br />

1 − 2 −s .<br />

6 If instead, we try all possible values of a, we obtain an exact primality testing algorithm, but it runs in exponential<br />

time. Miller’s original deterministic algorithm examined every value of a in a carefully-chosen subset of Z ∗ n. If the<br />

Extended Riemann Hypothesis holds, this subset has logarithmic size, and Miller’s algorithm runs in polynomial<br />

time. The Riemann Hypothesis is a century-old open problem about the distribution of prime numbers. A solution<br />

would be at least as significant as proving Fermat’s Last Theorem or P=NP.<br />

8


<strong>CS</strong> <strong>373</strong> Lecture 15: Number-Theoretic <strong>Algorithms</strong> Fall 2002<br />

Proof: First, suppose n is not a Carmichael number. Let F be the set of elements of Z ∗ n that pass<br />

the Fermat test:<br />

F = {a ∈ Z ∗ n | an−1 ≡ 1 (mod n)}.<br />

Since n is not a Carmichael number, F is a proper subset of Z ∗ n. Given any two elements a, b ∈ F ,<br />

their product a · b mod n in Z ∗ n is also an element of F :<br />

(a · b) n−1 ≡ a n−1 b n−1 ≡ 1 · 1 ≡ 1 (mod n)<br />

We also easily observe that 1 is an element of F , and the multiplicative inverse (mod n) of any<br />

element of F is also in F . Thus, F is a proper subgroup of Z ∗ n , that is, a proper subset that is also<br />

a group under the same binary operation. A standard result in group theory states that if F is a<br />

subgroup of a finite group G, the number of elements of F divides the number of elements of G.<br />

(We used a special case of this result in our proof of Euler’s Theorem.) In our setting, this means<br />

that |F | divides φ(n). Since we already know that |F | < φ(n), we must have |F | ≤ φ(n)/2. Thus,<br />

at most half the elements of Z ∗ n pass the Fermat test.<br />

The case of Carmichael numbers is more complicated, but the main idea is the same: at most<br />

half the possible values of a pass the √ 1 test. See CLRS for further details. <br />

9


<strong>CS</strong> <strong>373</strong> Non-Lecture C: String Matching Fall 2002<br />

C String Matching<br />

C.1 Brute Force<br />

The basic object that we’re going to talk about for the next two lectures is a string, which is really<br />

just an array. The elements of the array come from a set Σ called the alphabet; the elements<br />

themselves are called characters. Common examples are ASCII text, where each character is an<br />

seven-bit integer 1 , strands of DNA, where the alphabet is the set of nucleotides {A, C, G, T }, or<br />

proteins, where the alphabet is the set of 22 amino acids.<br />

The problem we want to solve is the following. Given two strings, a text T [1 .. n] and a pattern<br />

P [1 .. m], find the first substring of the text that is the same as the pattern. (It would be easy<br />

to extend our algorithms to find all matching substrings, but we will resist.) A substring is just<br />

a contiguous subarray. For any shift s, let Ts denote the substring T [s .. s + m − 1]. So more<br />

formally, we want to find the smallest shift s such that Ts = P , or report that there is no match.<br />

For example, if the text is the string ‘AMANAPLANACATACANALPANAMA’ 2 and the pattern is ‘CAN’, then<br />

the output should be 15. If the pattern is ‘SPAM’, then the answer should be ‘none’. In most cases<br />

the pattern is much smaller than the text; to make this concrete, I’ll assume that m < n/2.<br />

Here’s the ‘obvious’ brute force algorithm, but with one immediate improvement. The inner<br />

while loop compares the substring Ts with P . If the two strings are not equal, this loop stops at<br />

the first character mismatch.<br />

AlmostBruteForce(T [1 .. n], P [1 .. m]):<br />

for s ← 1 to n − m + 1<br />

equal ← true<br />

i ← 1<br />

while equal and i ≤ m<br />

if T [s + i − 1] = P [i]<br />

equal ← false<br />

else<br />

i ← i + 1<br />

if equal<br />

return s<br />

return ‘none’<br />

In the worst case, the running time of this algorithm is O((n − m)m) = O(nm), and we can<br />

1 Yes, seven. Most computer systems use some sort of 8-bit character set, but there’s no universally accepted<br />

standard. Java supposedly uses the Unicode character set, which has variable-length characters and therefore doesn’t<br />

really fit into our framework. Just think, someday you’ll be able to write ‘ = ℵ[∞++]/;’ in your Java code! Joy!<br />

2 Dan Hoey (or rather, his computer program) found the following 540-word palindrome in 1984. We have better<br />

online dictionaries now, so I’m sure you could do better.<br />

A man, a plan, a caret, a ban, a myriad, a sum, a lac, a liar, a hoop, a pint, a catalpa, a gas, an oil, a bird, a yell, a vat, a caw,<br />

a pax, a wag, a tax, a nay, a ram, a cap, a yam, a gay, a tsar, a wall, a car, a luger, a ward, a bin, a woman, a vassal, a wolf, a<br />

tuna, a nit, a pall, a fret, a watt, a bay, a daub, a tan, a cab, a datum, a gall, a hat, a fag, a zap, a say, a jaw, a lay, a wet, a<br />

gallop, a tug, a trot, a trap, a tram, a torr, a caper, a top, a tonk, a toll, a ball, a fair, a sax, a minim, a tenor, a bass, a passer,<br />

a capital, a rut, an amen, a ted, a cabal, a tang, a sun, an ass, a maw, a sag, a jam, a dam, a sub, a salt, an axon, a sail, an ad,<br />

a wadi, a radian, a room, a rood, a rip, a tad, a pariah, a revel, a reel, a reed, a pool, a plug, a pin, a peek, a parabola, a dog, a<br />

pat, a cud, a nu, a fan, a pal, a rum, a nod, an eta, a lag, an eel, a batik, a mug, a mot, a nap, a maxim, a mood, a leek, a grub,<br />

a gob, a gel, a drab, a citadel, a total, a cedar, a tap, a gag, a rat, a manor, a bar, a gal, a cola, a pap, a yaw, a tab, a raj, a gab,<br />

a nag, a pagan, a bag, a jar, a bat, a way, a papa, a local, a gar, a baron, a mat, a rag, a gap, a tar, a decal, a tot, a led, a tic, a<br />

bard, a leg, a bog, a burg, a keel, a doom, a mix, a map, an atom, a gum, a kit, a baleen, a gala, a ten, a don, a mural, a pan, a<br />

faun, a ducat, a pagoda, a lob, a rap, a keep, a nip, a gulp, a loop, a deer, a leer, a lever, a hair, a pad, a tapir, a door, a moor,<br />

an aid, a raid, a wad, an alias, an ox, an atlas, a bus, a madam, a jag, a saw, a mass, an anus, a gnat, a lab, a cadet, an em, a<br />

natural, a tip, a caress, a pass, a baronet, a minimax, a sari, a fall, a ballot, a knot, a pot, a rep, a carrot, a mart, a part, a tort,<br />

a gut, a poll, a gateway, a law, a jay, a sap, a zag, a fat, a hall, a gamut, a dab, a can, a tabu, a day, a batt, a waterfall, a patina,<br />

a nut, a flow, a lass, a van, a mow, a nib, a draw, a regular, a call, a war, a stay, a gam, a yap, a cam, a ray, an ax, a tag, a wax,<br />

a paw, a cat, a valley, a drib, a lion, a saga, a plat, a catnip, a pooh, a rail, a calamus, a dairyman, a bater, a canal—Panama!<br />

1


<strong>CS</strong> <strong>373</strong> Non-Lecture C: String Matching Fall 2002<br />

actually achieve this running time by searching for the pattern AAA...AAAB with m − 1 A’s, in a<br />

text consisting of n A’s.<br />

In practice, though, breaking out of the inner loop at the first mismatch makes this algorithm<br />

quite practical. We can wave our hands at this by assuming that the text and pattern are both<br />

random. Then on average, we perform a constant number of comparisons at each position i, so<br />

the total expected number of comparisons is O(n). Of course, neither English nor DNA is really<br />

random, so this is only a heuristic argument.<br />

C.2 Strings as Numbers<br />

For the rest of the lecture, let’s assume that the alphabet consists of the numbers 0 through 9, so<br />

we can interpret any array of characters as either a string or a decimal number. In particular, let<br />

p be the numerical value of the pattern P , and for any shift s, let ts be the numerical value of Ts:<br />

m<br />

p = 10 m−i m<br />

· P [i] ts = 10 m−i · T [s + i − 1]<br />

i=1<br />

For example, if T = 31415926535897932384626433832795028841971 and m = 4, then t17 = 2384.<br />

Clearly we can rephrase our problem as follows: Find the smallest s, if any, such that p = ts.<br />

We can compute p in O(m) arithmetic operations, without having to explicitly compute powers of<br />

ten, using Horner’s rule:<br />

p = P [m] + 10 P [m − 1] + 10 P [m − 2] + · · · + 10 P [2] + 10 · P [1] · · · <br />

We could also compute any ts in O(m) operations using Horner’s rule, but this leads to essentially<br />

the same brute-force algorithm as before. But once we know ts, we can actually compute ts+1 in<br />

constant time just by doing a little arithmetic — subtract off the most significant digit T [s]·10 m−1 ,<br />

shift everything up by one digit, and add the new least significant digit T [r + m]:<br />

i=1<br />

ts+1 = 10 ts − 10 m−1 · T [s] + T [s + m]<br />

To make this fast, we need to precompute the constant 10 m−1 . (And we know how to do that<br />

quickly. Right?) So it seems that we can solve the string matching problem in O(n) worst-case<br />

time using the following algorithm:<br />

NumberSearch(T [1 .. n], P [1 .. m]):<br />

σ ← 10 m−1<br />

p ← 0<br />

t1 ← 0<br />

for i ← 1 to m<br />

p ← 10 · p + P [i]<br />

t1 ← 10 · t1 + T [i]<br />

for s ← 1 to n − m + 1<br />

if p = ts<br />

return s<br />

ts+1 ← 10 · ts − σ · T [s] + T [s + m]<br />

return ‘none’<br />

Unfortunately, the most we can say is that the number of arithmetic operations is O(n). These<br />

operations act on numbers with up to m digits. Since we want to handle arbitrarily long patterns,<br />

we can’t assume that each operation takes only constant time!<br />

2


<strong>CS</strong> <strong>373</strong> Non-Lecture C: String Matching Fall 2002<br />

C.3 Karp-Rabin Fingerprinting<br />

To make this algorithm efficient, we will make one simple change, discovered by Richard Karp and<br />

Michael Rabin in 1981:<br />

Perform all arithmetic modulo some prime number q.<br />

We choose q so that the value 10q fits into a standard integer variable, so that we don’t need any<br />

fancy long-integer data types. The values (p mod q) and (ts mod q) are called the fingerprints of P<br />

and Ts, respectively. We can now compute (p mod q) and (t1 mod q) in O(m) time using Horner’s<br />

rule ‘mod q’<br />

p mod q = P [m] + · · · + 10 · P [2] + 10 · P [1] mod q mod q mod q · · · mod q<br />

and similarly, given (ts mod q), we can compute (ts+1 mod q) in constant time.<br />

ts+1 mod q = 10 · ts − 10 m−1 mod q · T [s] mod q mod q mod q + T [s + m] mod q<br />

Again, we have to precompute the value (10 m−1 mod q) to make this fast.<br />

If (p mod q) = (ts mod q), then certainly P = Ts. However, if (p mod q) = (ts mod q), we can’t<br />

tell whether P = Ts or not. All we know for sure is that p and ts differ by some integer multiple of<br />

q. If P = Ts in this case, we say there is a false match at shift s. To test for a false match, we simply<br />

do a brute-force string comparison. (In the algorithm below, ˜p = p mod q and ˜ts = ts mod q.)<br />

KarpRabin(T [1 .. n], P [1 .. m]:<br />

choose a small prime q<br />

σ ← 10 m−1 mod q<br />

˜p ← 0<br />

˜t1 ← 0<br />

for i ← 1 to m<br />

˜p ← (10 · ˜p mod q) + P [i] mod q<br />

˜t1 ← (10 · ˜t1 mod q) + T [i] mod q<br />

for s ← 1 to n − m + 1<br />

if ˜p = ˜ts<br />

if P = Ts<br />

〈〈brute-force O(m)-time comparison〉〉<br />

return s<br />

˜ts+1 ← 10 · ˜ts − σ · T [s] mod q mod q mod q + T [s + m] mod q<br />

return ‘none’<br />

The running time of this algorithm is O(n + F m), where F is the number of false matches.<br />

Intuitively, we expect the fingerprints ts to jump around between 0 and q − 1 more or less at<br />

random, so the ‘probability’ of a false match ‘ought’ to be 1/q. This intuition implies that F = n/q<br />

‘on average’, which gives us an ‘expected’ running time of O(n + nm/q). If we always choose<br />

q ≥ m, this simplifies to O(n). But of course all this intuitive talk of probabilities is just frantic<br />

meaningless handwaving, since we haven’t actually done anything random yet.<br />

C.4 Random Prime Number Facts<br />

The real power of the Karp-Rabin algorithm is that by choosing the modulus q randomly, we can<br />

actually formalize this intuition! The first line of KarpRabin should really read as follows:<br />

3


<strong>CS</strong> <strong>373</strong> Non-Lecture C: String Matching Fall 2002<br />

Let q be a random prime number less than nm 2 log(nm 2 ).<br />

For any positive integer u, let π(u) denote the number of prime numbers less than u. There are<br />

π(nm 2 log nm 2 ) possible values for q, each with the same probability of being chosen.<br />

Our analysis needs two results from number theory. I won’t even try to prove the first one, but<br />

the second one is quite easy.<br />

Lemma 1 (The Prime Number Theorem). π(u) = Θ(u/ log u).<br />

Lemma 2. Any integer x has at most ⌊lg x⌋ distinct prime divisors.<br />

Proof: If x has k distinct prime divisors, then x ≥ 2 k , since every prime number is bigger<br />

than 1. <br />

Let’s assume that there are no true matches, so p = ts for all s. (That’s the worst case for the<br />

algorithm anyway.) Let’s define a strange variable X as follows:<br />

X =<br />

n−m+1 <br />

s=1<br />

|p − ts| .<br />

Notice that by our assumption, X can’t be zero.<br />

Now suppose we have false match at shift s. Then p mod q = ts mod q, so p − ts is an integer<br />

multiple of q, and this implies that X is also an integer multiple of q. In other words, if there is a<br />

false match, then q must one of the prime divisors of X.<br />

Since p < 10 m and ts < 10 m , we must have X < 10 nm . Thus, by the second lemma, X has<br />

O(mn) prime divisors. Since we chose q randomly from a set of π(nm 2 log(nm 2 )) = Ω(nm 2 ) prime<br />

numbers, the probability that q divides X is at most<br />

O(nm)<br />

Ω(nm2 <br />

1<br />

= O .<br />

) m<br />

We have just proven the following amazing fact.<br />

The probability of getting a false match is O(1/m).<br />

Recall that the running time of KarpRabin is O(n + mF ), where F is the number of false<br />

matches. By using the really loose upper bound E[F ] ≤ Pr[F > 0] · n, we can conclude that the<br />

expected number of false matches is O(n/m). Thus, the expected running time of the KarpRabin<br />

algorithm is O(n).<br />

C.5 Random Prime Number?<br />

Actually choosing a random prime number is not particularly easy. The best method known is<br />

to repeatedly generate a random integer and test to see if it’s prime. In practice, it’s enough to<br />

choose a random probable prime. You can read about probable primes in the textbook Randomized<br />

<strong>Algorithms</strong> by Rajeev Motwani and Prabhakar Raghavan (Cambridge, 1995).<br />

4


<strong>CS</strong> <strong>373</strong> Non-Lecture D: More String Matching Fall 2002<br />

D More String Matching<br />

D.1 Redundant Comparisons<br />

Let’s go back to the character-by-character method for string matching. Suppose we are looking for<br />

the pattern ‘ABRACADABRA’ in some longer text using the (almost) brute force algorithm described<br />

in the previous lecture. Suppose also that when s = 11, the substring comparison fails at the fifth<br />

position; the corresponding character in the text (just after the vertical line below) is not a C. At<br />

this point, our algorithm would increment s and start the substring comparison from scratch.<br />

HOCUSPOCUSABRABRACADABRA...<br />

ABRA/CADABRA<br />

ABRACADABRA<br />

If we look carefully at the text and the pattern, however, we should notice right away that<br />

there’s no point in looking at s = 12. We already know that the next character is a B — after all,<br />

it matched P [2] during the previous comparison — so why bother even looking there? Likewise,<br />

we already know that the next two shifts s = 13 and s = 14 will also fail, so why bother looking<br />

there?<br />

HOCUSPOCUSABRABRACADABRA...<br />

ABRA/CADABRA<br />

/ABRACADABRA<br />

/ABRACADABRA<br />

ABRACADABRA<br />

Finally, when we get to s = 15, we can’t immediately rule out a match based on earlier comparisons.<br />

However, for precisely the same reason, we shouldn’t start the substring comparison over<br />

from scratch — we already know that T [15] = P [4] = A. Instead, we should start the substring<br />

comparison at the second character of the pattern, since we don’t yet know whether or not it<br />

matches the corresponding text character.<br />

If you play with this idea long enough, you’ll notice that the character comparisons should<br />

always advance through the text. Once we’ve found a match for a text character, we never<br />

need to do another comparison with that character again. In other words, we should be<br />

able to optimize the brute-force algorithm so that it always advances through the text.<br />

You’ll also eventually notice a good rule for finding the next ‘reasonable’ shift s. A prefix of a<br />

string is a substring that includes the first character; a suffix is a substring that includes the last<br />

character. A prefix or suffix is proper if it is not the entire string. Suppose we have just discovered<br />

that T [i] = P [j]. The next reasonable shift is the smallest value of s such that T [s .. i−1],<br />

which is a suffix of the previously-read text, is also a proper prefix of the pattern.<br />

In this lecture, we’ll describe a string matching algorithm, published by Donald Knuth, James<br />

Morris, and Vaughn Pratt in 1977, that implements both of these ideas.<br />

D.2 Finite State Machines<br />

If we have a string matching algorithm that follows our first observation (that we always advance<br />

through the text), we can interpret it as feeding the text through a special type of finite-state<br />

machine. A finite state machine is a directed graph. Each node in the graph, or state, is labeled<br />

with a character from the pattern, except for two special nodes labeled $○ and !○. Each node has<br />

two outgoing edges, a success edge and a failure edge. The success edges define a path through the<br />

1


<strong>CS</strong> <strong>373</strong> Non-Lecture D: More String Matching Fall 2002<br />

characters of the pattern in order, starting at $○ and ending at !○. Failure edges always point to<br />

earlier characters in the pattern.<br />

!<br />

A<br />

B<br />

A R<br />

R<br />

B<br />

$<br />

A<br />

D<br />

A finite state machine for the string ‘ABRADACABRA’.<br />

Thick arrows are the success edges; thin arrows are the failure edges.<br />

We use the finite state machine to search for the pattern as follows. At all times, we have<br />

a current text character T [i] and a current node in the graph, which is usually labeled by some<br />

pattern character P [j]. We iterate the following rules:<br />

• If T [i] = P [j], or if the current label is $○, follow the success edge to the next node and<br />

increment i. (So there is no failure edge from the start node $○.)<br />

• If T [i] = P [j], follow the failure edge back to an earlier node, but do not change i.<br />

For the moment, let’s simply assume that the failure edges are defined correctly—we’ll come<br />

back to this later. If we ever reach the node labeled !○, then we’ve found an instance of the pattern<br />

in the text, and if we run out of text characters (i > n) before we reach !○, then there is no match.<br />

The finite state machine is really just a (very!) convenient metaphor. In a real implementation,<br />

we would not construct the entire graph. Since the success edges always go through the pattern<br />

characters in order, we only have to remember where the failure edges go. We can encode this<br />

failure function in an array fail[1 .. n], so that for each j there is a failure edge from node j to<br />

node fail[j]. Following a failure edge back to an earlier state exactly corresponds, in our earlier<br />

formulation, to shifting the pattern forward. The failure function fail[j] tells us how far to shift<br />

after a character mismatch T [i] = P [j].<br />

Here’s what the actual algorithm looks like:<br />

KnuthMorrisPratt(T [1 .. n], P [1 .. m]):<br />

j ← 1<br />

for i ← 1 to n<br />

while j > 0 and T [i] = P [j]<br />

j ← fail[j]<br />

if j = m 〈〈Found it!〉〉<br />

return i − m + 1<br />

j ← j + 1<br />

return ‘none’<br />

Before we discuss computing the failure function, let’s analyze the running time of Knuth-<br />

MorrisPratt under the assumption that a correct failure function is already known. At each<br />

character comparison, either we increase i and j by one, or we decrease j and leave i alone.<br />

2<br />

A<br />

C<br />

A


<strong>CS</strong> <strong>373</strong> Non-Lecture D: More String Matching Fall 2002<br />

We can increment i at most n − 1 times before we run out of text, so there are at most n − 1<br />

successful comparisons. Similarly, there can be at most n − 1 failed comparisons, since the number<br />

of times we decrease j cannot exceed the number of times we increment j. In other words, we<br />

can amortize character mismatches against earlier character matches. Thus, the total number of<br />

character comparisons performed by KnuthMorrisPratt in the worst case is O(n).<br />

D.3 Computing the Failure Function<br />

We can now rephrase our second intuitive rule about how to choose a reasonable shift after a<br />

character mismatch T [i] = P [j]:<br />

P [1 .. fail[j] − 1] is the longest proper prefix of P [1 .. j − 1] that is also a suffix of T [1 .. i − 1].<br />

Notice, however, that if we are comparing T [i] against P [j], then we must have already matched<br />

the first j − 1 characters of the pattern. In other words, we already know that P [1 .. j − 1] is a<br />

suffix of T [1 .. i − 1]. Thus, we can rephrase the prefix-suffix rule as follows:<br />

P [1 .. fail[j] − 1] is the longest proper prefix of P [1 .. j − 1] that is also a suffix of P [1 .. j − 1].<br />

This is the definition of the Knuth-Morris-Pratt failure function fail[j] for all j > 1. 1 By convention<br />

we set fail[1] = 0; this tells the KMP algorithm that if the first pattern character doesn’t match,<br />

it should just give up and try the next text character.<br />

P [i] A B R A C A D A B R A<br />

fail[i] 0 1 1 1 2 1 2 1 2 3 4<br />

Failure function for the string ‘ABRACADABRA’<br />

(Compare with the finite state machine on the previous page.)<br />

We could easily compute the failure function in O(m 3 ) time by checking, for each j, whether<br />

every prefix of P [1 .. j − 1] is also a suffix of P [1 .. j − 1], but this is not the fastest method. The<br />

following algorithm essentially uses the KMP search algorithm to look for the pattern inside itself!<br />

ComputeFailure(P [1 .. m]):<br />

j ← 0<br />

for i ← 1 to m<br />

fail[i] ← j (∗)<br />

while j > 0 and P [i] = P [j]<br />

j ← fail[j]<br />

j ← j + 1<br />

Here’s an example of this algorithm in action. In each line, the current values of i and j are<br />

indicated by superscripts; $ represents the beginning of the string. (You should imagine pointing<br />

at P [j] with your left hand and pointing at P [i] with your right hand, and moving your fingers<br />

according to the algorithm’s directions.)<br />

1 CLR defines a similar prefix function, denoted π[j], as follows:<br />

P [1 .. π[j]] is the longest proper prefix of P [1 .. j] that is also a suffix of P [1 .. j].<br />

These two functions are not the same, but they are related by the simple equation π[j] = fail[j + 1] − 1. The<br />

off-by-one difference between the two functions adds a few extra +1s to CLR’s version of the algorithm.<br />

3


<strong>CS</strong> <strong>373</strong> Non-Lecture D: More String Matching Fall 2002<br />

j ← 0, i ← 1 $ j A i B R A C A D A B R X . . .<br />

fail[i] ← j 0 . . .<br />

j ← j + 1, i ← i + 1 $ A j B i R A C A D A B R X . . .<br />

fail[i] ← j 0 1 . . .<br />

j ← fail[j] $ j A B i R A C A D A B R X . . .<br />

j ← j + 1, i ← i + 1 $ A j B R i A C A D A B R X . . .<br />

fail[i] ← j 0 1 1 . . .<br />

j ← fail[j] $ j A B R i A C A D A B R X . . .<br />

j ← j + 1, i ← i + 1 $ A j B R A i C A D A B R X . . .<br />

fail[i] ← j 0 1 1 1 . . .<br />

j ← j + 1, i ← i + 1 $ A B j R A C i A D A B R X . . .<br />

fail[i] ← j 0 1 1 1 2 . . .<br />

j ← fail[j] $ A j B R A C i A D A B R X . . .<br />

j ← fail[j] $ j A B R A C i A D A B R X . . .<br />

j ← j + 1, i ← i + 1 $ A j B R A C A i D A B R X . . .<br />

fail[i] ← j 0 1 1 1 2 1 . . .<br />

j ← j + 1, i ← i + 1 $ A B j R A C A D i A B R X . . .<br />

fail[i] ← j 0 1 1 1 2 1 2 . . .<br />

j ← fail[j] $ A j B R A C A D i A B R X . . .<br />

j ← fail[j] $ j A B R A C A D i A B R X . . .<br />

j ← j + 1, i ← i + 1 $ A j B R A C A D A i B R X . . .<br />

fail[i] ← j 0 1 1 1 2 1 2 1 . . .<br />

j ← j + 1, i ← i + 1 $ A B j R A C A D A B i R X . . .<br />

fail[i] ← j 0 1 1 1 2 1 2 1 2 . . .<br />

j ← j + 1, i ← i + 1 $ A B R j A C A D A B R i X . . .<br />

fail[i] ← j 0 1 1 1 2 1 2 1 2 3 . . .<br />

j ← j + 1, i ← i + 1 $ A B R A j C A D A B R X i . . .<br />

fail[i] ← j 0 1 1 1 2 1 2 1 2 3 4 . . .<br />

j ← fail[j] $ A j B R A C A D A B R X i . . .<br />

j ← fail[j] $ j A B R A C A D A B R X i . . .<br />

ComputeFailure in action. Do this yourself by hand.<br />

Just as we did for KnuthMorrisPratt, we can analyze ComputeFailure by amortizing<br />

character mismatches against earlier character matches. Since there are at most m character<br />

matches, ComputeFailure runs in O(m) time.<br />

Let’s prove (by induction, of course) that ComputeFailure correctly computes the failure<br />

function. The base case fail[1] = 0 is obvious. Assuming inductively that we correctly computed<br />

fail[1] through fail[i] in line (∗), we need to show that fail[i + 1] is also correct. Just after the ith<br />

iteration of line (∗), we have j = fail[i], so P [1 .. j − 1] is the longest proper prefix of P [1 .. i − 1]<br />

that is also a suffix.<br />

Let’s define the iterated failure functions fail c [j] inductively as follows: fail 0 [j] = j, and<br />

fail c [j] = fail[fail c−1 [j]] =<br />

c<br />

<br />

fail[fail[· · · [fail[j]] · · · ]].<br />

In particular, if failc−1 [j] = 0, then failc [j] is undefined. We can easily show by induction (see<br />

[CLR, p.872]) that every string of the form P [1 .. failc [j] − 1] is both a proper prefix and a proper<br />

suffix of P [1 .. i−1], and in fact, these are the only examples. Thus, the longest proper prefix/suffix<br />

of P [1 .. i] must be the longest string of the form P [1 .. failc [j]] — i.e., the one with smallest c —<br />

such that P [failc [j]] = P [i]. This is exactly what the while loop in ComputeFailure computes;<br />

the (c + 1)th iteration compares P [failc [j]] = P [failc+1 [i]] against P [i]. ComputeFailure is<br />

actually a dynamic programming implementation of the following recursive definition of fail[i]:<br />

⎧<br />

⎨0<br />

if i = 0,<br />

fail[i] = <br />

⎩max<br />

failc [i − 1] + 1 P [i − 1] = P [failc [i − 1]] otherwise.<br />

c≥1<br />

4


<strong>CS</strong> <strong>373</strong> Non-Lecture D: More String Matching Fall 2002<br />

D.4 Optimizing the Failure Function<br />

We can speed up KnuthMorrisPratt slightly by making one small change to the failure function.<br />

Recall that after comparing T [i] against P [j] and finding a mismatch, the algorithm compares T [i]<br />

against P [fail[j]]. With the current definition, however, it is possible that P [j] and P [fail[j]] are<br />

actually the same character, in which case the next character comparison will automatically fail.<br />

So why do the comparison at all?<br />

We can optimize the failure function by ‘short-circuiting’ these redundant comparisons with<br />

some simple post-processing:<br />

OptimizeFailure(P [1 .. m], fail[1 .. m]):<br />

for i ← 2 to m<br />

if P [i] = P [fail[i]]<br />

fail[i] ← fail[fail[i]]<br />

We can also compute the optimized failure function directly by adding three new lines (in bold) to<br />

the ComputeFailure function.<br />

ComputeOptFailure(P [1 .. m]):<br />

j ← 0<br />

for i ← 1 to m<br />

if P [i] = P [j]<br />

fail[i] ← fail[j]<br />

else<br />

fail[i] ← j<br />

while j > 0 and P [i] = P [j]<br />

j ← fail[j]<br />

j ← j + 1<br />

This optimization slows down the preprocessing slightly, but it may significantly decrease the<br />

number of comparisons at each text character. The worst-case running time is still O(n); however,<br />

the constant is about half as big as for the unoptimized version, so this could be a significant<br />

improvement in practice.<br />

!<br />

A<br />

B<br />

A R<br />

R<br />

B<br />

$<br />

A<br />

D<br />

Optimized finite state machine for the string ‘ABRADACABRA’<br />

P [i] A B R A C A D A B R A<br />

fail[i] 0 1 1 0 2 0 2 0 1 1 0<br />

Optimized failure function for ‘ABRACADABRA’, with changes in bold.<br />

Here are the unoptimized and optimized failure functions for a few more patterns:<br />

5<br />

A<br />

C<br />

A


<strong>CS</strong> <strong>373</strong> Non-Lecture D: More String Matching Fall 2002<br />

P [i] A N A N A B A N A N A N A<br />

unoptimized fail[i] 0 1 1 2 3 4 1 2 3 4 5 6 5<br />

optimized fail[i] 0 1 0 1 0 4 0 1 0 1 0 6 0<br />

Failure functions for ‘ANANABANANANA’.<br />

P [i] A B A B C A B A B C A B C<br />

unoptimized fail[i] 0 1 1 2 3 1 2 3 4 5 6 7 8<br />

optimized fail[i] 0 1 0 1 3 0 1 0 1 3 0 1 8<br />

Failure functions for ‘ABABCABABCABC’.<br />

P [i] A B B A B B A B A B B A B<br />

unoptimized fail[i] 0 1 1 1 2 3 4 5 6 2 3 4 5<br />

optimized fail[i] 0 1 1 0 1 1 0 1 6 1 1 0 1<br />

Failure functions for ‘ABBABBABABBAB’.<br />

P [i] A A A A A A A A A A A A B<br />

unoptimized fail[i] 0 1 2 3 4 5 6 7 8 9 10 11 12<br />

optimized fail[i] 0 0 0 0 0 0 0 0 0 0 0 0 12<br />

Failure functions for ‘AAAAAAAAAAAAB’.<br />

6


<strong>CS</strong> <strong>373</strong> Non-Lecture E: Convex Hulls Fall 2002<br />

E Convex Hulls<br />

E.1 Definitions<br />

We are given a set P of n points in the plane. We want to compute something called the convex<br />

hull of P . Intuitively, the convex hull is what you get by driving a nail into the plane at each point<br />

and then wrapping a piece of string around the nails. More formally, the convex hull is the smallest<br />

convex polygon containing the points:<br />

• polygon: A region of the plane bounded by a cycle of line segments, called edges, joined<br />

end-to-end in a cycle. Points where two successive edges meet are called vertices.<br />

• convex: For any two points p, q inside the polygon, the line segment pq is completely inside<br />

the polygon.<br />

• smallest: Any convex proper subset of the convex hull excludes at least one point in P . This<br />

implies that every vertex of the convex hull is a point in P .<br />

We can also define the convex hull as the largest convex polygon whose vertices are all points in P ,<br />

or the unique convex polygon that contains P and whose vertices are all points in P . Notice that<br />

P might have interior points that are not vertices of the convex hull.<br />

A set of points and its convex hull.<br />

Convex hull vertices are black; interior points are white.<br />

Just to make things concrete, we will represent the points in P by their Cartesian coordinates,<br />

in two arrays X[1 .. n] and Y [1 .. n]. We will represent the convex hull as a circular linked list of<br />

vertices in counterclockwise order. if the ith point is a vertex of the convex hull, next[i] is index of<br />

the next vertex counterclockwise and pred[i] is the index of the next vertex clockwise; otherwise,<br />

next[i] = pred[i] = 0. It doesn’t matter which vertex we choose as the ‘head’ of the list. The<br />

decision to list vertices counterclockwise instead of clockwise is arbitrary.<br />

To simplify the presentation of the convex hull algorithms, I will assume that the points are in<br />

general position, meaning (in this context) that no three points lie on a common line. This is just<br />

like assuming that no two elements are equal when we talk about sorting algorithms. If we wanted<br />

to really implement these algorithms, we would have to handle colinear triples correctly, or at least<br />

consistently. This is fairly easy, but definitely not trivial.<br />

E.2 Simple Cases<br />

Computing the convex hull of a single point is trivial; we just return that point. Computing the<br />

convex hull of two points is also trivial.<br />

For three points, we have two different possibilities — either the points are listed in the array<br />

in clockwise order or counterclockwise order. Suppose our three points are (a, b), (c, d), and (e, f),<br />

given in that order, and for the moment, let’s also suppose that the first point is furthest to the<br />

1


<strong>CS</strong> <strong>373</strong> Non-Lecture E: Convex Hulls Fall 2002<br />

left, so a < c and a < f. Then the three points are in counterclockwise order if and only if the line<br />

←−−−−−−→<br />

(a, b)(c, d) is less than the slope of the line ←−−−−−−→<br />

(a, b)(e, f):<br />

counterclockwise ⇐⇒<br />

d − b f − b<br />

<<br />

c − a e − a<br />

Since both denominators are positive, we can rewrite this inequality as follows:<br />

counterclockwise ⇐⇒ (f − b)(c − a) > (d − b)(e − a)<br />

This final inequality is correct even if (a, b) is not the leftmost point. If the inequality is reversed,<br />

then the points are in clockwise order. If the three points are colinear (remember, we’re assuming<br />

that never happens), then the two expressions are equal.<br />

(a,b)<br />

(c,d)<br />

(e,f)<br />

Three points in counterclockwise order.<br />

Another way of thinking about this counterclockwise test is that we’re computing the crossproduct<br />

of the two vectors (c, d) − (a, b) and (e, f) − (a, b), which is defined as a 2 × 2 determinant:<br />

counterclockwise ⇐⇒<br />

We can also write it as a 3 × 3 determinant:<br />

counterclockwise ⇐⇒<br />

<br />

<br />

<br />

c<br />

− a d − b<br />

<br />

e − a f − b<br />

> 0<br />

<br />

<br />

1<br />

a b<br />

<br />

<br />

1<br />

c d<br />

> 0<br />

1<br />

e f<br />

All three boxed expressions are algebraically identical.<br />

This counterclockwise test plays exactly the same role in convex hull algorithms as comparisons<br />

play in sorting algorithms. Computing the convex hull of of three points is analogous to sorting<br />

two numbers: either they’re in the correct order or in the opposite order.<br />

E.3 Jarvis’s Algorithm (Wrapping)<br />

Perhaps the simplest algorithm for computing convex hulls simply simulates the process of wrapping<br />

a piece of string around the points. This algorithm is usually called Jarvis’s march, but it is also<br />

referred to as the gift-wrapping algorithm.<br />

Jarvis’s march starts by computing the leftmost point ℓ, i.e., the point whose x-coordinate is<br />

smallest, since we know that the left most point must be a convex hull vertex. Finding ℓ clearly<br />

takes linear time.<br />

2


<strong>CS</strong> <strong>373</strong> Non-Lecture E: Convex Hulls Fall 2002<br />

p=l<br />

l<br />

p<br />

l<br />

l<br />

p<br />

The execution of Jarvis’s March.<br />

Then the algorithm does a series of pivoting steps to find each successive convex hull vertex,<br />

starting with ℓ and continuing until we reach ℓ again. The vertex immediately following a point p<br />

is the point that appears to be furthest to the right to someone standing at p and looking at the<br />

other points. In other words, if q is the vertex following p, and r is any other input point, then<br />

the triple p, q, r is in counter-clockwise order. We can find each successive vertex in linear time by<br />

performing a series of O(n) counter-clockwise tests.<br />

JarvisMarch(X[1 .. n], Y [1 .. n]):<br />

ℓ ← 1<br />

for i ← 2 to n<br />

if X[i] < X[ℓ]<br />

ℓ ← i<br />

p ← ℓ<br />

repeat<br />

q ← p + 1 〈〈Make sure p = q〉〉<br />

for i ← 2 to n<br />

if CCW(p, i, q)<br />

q ← i<br />

next[p] ← q; prev[q] ← p<br />

p ← q<br />

until p = ℓ<br />

Since the algorithm spends O(n) time for each convex hull vertex, the worst-case running time<br />

is O(n 2 ). However, this naïve analysis hides the fact that if the convex hull has very few vertices,<br />

Jarvis’s march is extremely fast. A better way to write the running time is O(nh), where h is the<br />

number of convex hull vertices. In the worst case, h = n, and we get our old O(n 2 ) time bound,<br />

but in the best case h = 3, and the algorithm only needs O(n) time. Computational geometers call<br />

this an output-sensitive algorithm; the smaller the output, the faster the algorithm.<br />

E.4 Divide and Conquer (Splitting)<br />

The behavior of Jarvis’s marsh is very much like selection sort: repeatedly find the item that goes<br />

in the next slot. In fact, most convex hull algorithms resemble some sorting algorithm.<br />

For example, the following convex hull algorithm resembles quicksort. We start by choosing a<br />

pivot point p. Partitions the input points into two sets L and R, containing the points to the left<br />

3<br />

p<br />

l<br />

l<br />

p<br />

p


<strong>CS</strong> <strong>373</strong> Non-Lecture E: Convex Hulls Fall 2002<br />

of p, including p itself, and the points to the right of p, by comparing x-coordinates. Recursively<br />

compute the convex hulls of L and R. Finally, merge the two convex hulls into the final output.<br />

The merge step requires a little explanation. We start by connecting the two hulls with a line<br />

segment between the rightmost point of the hull of L with the leftmost point of the hull of R.<br />

Call these points p and q, respectively. (Yes, it’s the same p.) Actually, let’s add two copies of<br />

the segment pq and call them bridges. Since p and q can ‘see’ each other, this creates a sort of<br />

dumbbell-shaped polygon, which is convex except possibly at the endpoints off the bridges.<br />

p<br />

q<br />

p<br />

p p<br />

p<br />

q<br />

Merging the left and right subhulls.<br />

We now expand this dumbbell into the correct convex hull as follows. As long as there is a<br />

clockwise turn at either endpoint of either bridge, we remove that point from the circular sequence<br />

of vertices and connect its two neighbors. As soon as the turns at both endpoints of both bridges<br />

are counter-clockwise, we can stop. At that point, the bridges lie on the upper and lower common<br />

tangent lines of the two subhulls. These are the two lines that touch both subhulls, such that both<br />

subhulls lie below the upper common tangent line and above the lower common tangent line.<br />

Merging the two subhulls takes O(n) time in the worst case. Thus, the running time is given<br />

by the recurrence T (n) = O(n) + T (k) + T (n − k), just like quicksort, where k the number of points<br />

in R. Just like quicksort, if we use a naïve deterministic algorithm to choose the pivot point p,<br />

the worst-case running time of this algorithm is O(n 2 ). If we choose the pivot point randomly, the<br />

expected running time is O(n log n).<br />

There are inputs where this algorithm is clearly wasteful (at least, clearly to us). If we’re really<br />

unlucky, we’ll spend a long time putting together the subhulls, only to throw almost everything<br />

away during the merge step. Thus, this divide-and-conquer algorithm is not output sensitive.<br />

E.5 Graham’s Algorithm (Scanning)<br />

A set of points that shouldn’t be divided and conquered.<br />

Our third convex hull algorithm, called Graham’s scan, first explicitly sorts the points in O(n log n)<br />

and then applies a linear-time scanning algorithm to finish building the hull.<br />

4<br />

q<br />

q<br />

p<br />

q<br />

q


<strong>CS</strong> <strong>373</strong> Non-Lecture E: Convex Hulls Fall 2002<br />

We start Graham’s scan by finding the leftmost point ℓ, just as in Jarvis’s march. Then we<br />

sort the points in counterclockwise order around ℓ. We can do this in O(n log n) time with any<br />

comparison-based sorting algorithm (quicksort, mergesort, heapsort, etc.). To compare two points<br />

p and q, we check whether the triple ℓ, p, q is oriented clockwise or counterclockwise. Once the<br />

points are sorted, we connected them in counterclockwise order, starting and ending at ℓ. The<br />

result is a simple polygon with n vertices.<br />

l<br />

A simple polygon formed in the sorting phase of Graham’s scan.<br />

To change this polygon into the convex hull, we apply the following ‘three-penny algorithm’.<br />

We have three pennies, which will sit on three consecutive vertices p, q, r of the polygon; initially,<br />

these are ℓ and the two vertices after ℓ. We now apply the following two rules over and over until<br />

a penny is moved forward onto ℓ:<br />

• If p, q, r are in counterclockwise order, move the back penny forward to the successor of r.<br />

• If p, q, r are in clockwise order, remove q from the polygon, add the edge pr, and move the<br />

middle penny backward to the predecessor of p.<br />

The ‘three-penny’ scanning phase of Graham’s scan.<br />

Whenever a penny moves forward, it moves onto a vertex that hasn’t seen a penny before<br />

(except the last time), so the first rule is applied n − 2 times. Whenever a penny moves backwards,<br />

a vertex is removed from the polygon, so the second rule is applied exactly n − h times, where h is<br />

as usual the number of convex hull vertices. Since each counterclockwise test takes constant time,<br />

the scanning phase takes O(n) time altogether.<br />

E.6 Chan’s Algorithm (Shattering)<br />

The last algorithm I’ll describe is an output-sensitive algorithm that is never slower than either<br />

Jarvis’s march or Graham’s scan. The running time of this algorithm, which was discovered by<br />

5


<strong>CS</strong> <strong>373</strong> Non-Lecture E: Convex Hulls Fall 2002<br />

Timothy Chan in 1993, is O(n log h). Chan’s algorithm is a combination of divide-and-conquer and<br />

gift-wrapping.<br />

First suppose a ‘little birdie’ tells us the value of h; we’ll worry about how to implement the<br />

little birdie in a moment. Chan’s algorithm starts by shattering the input points into n/h arbitrary 1<br />

subsets, each of size h, and computing the convex hull of each subset using (say) Graham’s scan.<br />

This much of the algorithm requires O((n/h) · h log h) = O(n log h) time.<br />

Shattering the points and computing subhulls in O(n log h) time.<br />

Once we have the n/h subhulls, we follow the general outline of Jarvis’s march, ‘wrapping a<br />

string around’ the n/h subhulls. Starting with p = ℓ, where ℓ is the leftmost input point, we<br />

successively find the convex hull vertex the follows p and counterclockwise order until we return<br />

back to ℓ again.<br />

The vertex that follows p is the point that appears to be furthest to the right to someone standing<br />

at p. This means that the successor of p must lie on a right tangent line between p and one of<br />

the subhulls—a line from p through a vertex of the subhull, such that the subhull lies completely<br />

on the right side of the line from p’s point of view. We can find the right tangent line between p<br />

and any subhull in O(log h) time using a variant of binary search. (This is a practice problem in<br />

the homework!) Since there are n/h subhulls, finding the successor of p takes O((n/h) log h) time<br />

altogether.<br />

Since there are h convex hull edges, and we find each edge in O((n/h) log h) time, the overall<br />

running time of the algorithm is O(n log h).<br />

Wrapping the subhulls in O(n log h) time.<br />

Unfortunately, this algorithm only takes O(n log h) time if a little birdie has told us the value<br />

of h in advance. So how do we implement the ‘little birdie’? Chan’s trick is to guess the correct<br />

value of h; let’s denote the guess by h ∗ . Then we shatter the points into n/h ∗ subsets of size h ∗ ,<br />

compute their subhulls, and then find the first h ∗ edges of the global hull. If h < h ∗ , this algorithm<br />

computes the complete convex hull in O(n log h ∗ ) time. Otherwise, the hull doesn’t wrap all the<br />

way back around to ℓ, so we know our guess h ∗ is too small.<br />

Chan’s algorithm starts with the optimistic guess h ∗ = 3. If we finish an iteration of the<br />

algorithm and find that h ∗ is too small, we square h ∗ and try again. In the final iteration, h ∗ < h 2 ,<br />

so the last iteration takes O(n log h ∗ ) = O(n log h 2 ) = O(n log h) time. The total running time of<br />

Chan’s algorithm is given by the sum<br />

O(n log 3 + n log 3 2 + n log 3 4 + · · · + n log 3 2k<br />

),<br />

1 In the figures, in order to keep things as clear as possible, I’ve chosen these subsets so that their convex hulls are<br />

disjoint. This is not true in general!<br />

6


<strong>CS</strong> <strong>373</strong> Non-Lecture E: Convex Hulls Fall 2002<br />

for some integer k. We can rewrite this as a geometric series:<br />

O(n log 3 + 2n log 3 + 4n log 3 + · · · + 2 k n log 3).<br />

Since any geometric series adds up to a constant times its largest term, the total running time is a<br />

constant times the time taken by the last iteration, which is O(n log h). So Chan’s algorithm runs<br />

in O(n log h) time overall, even without the little birdie.<br />

7


<strong>CS</strong> <strong>373</strong> Non-Lecture F: Line Segment Intersection Fall 2002<br />

F Line Segment Intersection<br />

F.1 Introduction<br />

In this lecture, I’ll talk about detecting line segment intersections. A line segment is the convex<br />

hull of two points, called the endpoints (or vertices) of the segment. We are given a set of n line<br />

segments, each specified by the x- and y-coordinates of its endpoints, for a total of 4n real numbers,<br />

and we want to know whether any two segments intersect.<br />

To keep things simple, just as in the previous lecture, I’ll assume the segments are in general<br />

position.<br />

• No three endpoints lie on a common line.<br />

• No two endpoints have the same x-coordinate. In particular, no segment is vertical, no<br />

segment is just a point, and no two segments share an endpoint.<br />

This general position assumption lets us avoid several annoying degenerate cases. Of course, in<br />

any real implementation of the algorithm I’m about to describe, you’d have to handle these cases.<br />

Real-world data is full of degeneracies!<br />

F.2 Two segments<br />

Degenerate cases of intersecting segments that we’ll pretend never happen:<br />

Overlapping colinear segments, endpoints inside segments, and shared endpoints.<br />

The first case we have to consider is n = 2. (n ≤ 1 is obviously completely trivial!) How do we<br />

tell whether two line segments intersect? One possibility, suggested by a student in class, is to<br />

construct the convex hull of the segments. Two segments intersect if and only if the convex hull<br />

is a quadrilateral whose vertices alternate between the two segments. In the figure below, the first<br />

pair of segments has a triangular convex hull. The last pair’s convex hull is a quadrilateral, but its<br />

vertices don’t alternate.<br />

Some pairs of segments.<br />

Fortunately, we don’t need (or want!) to use a full-fledged convex hull algorithm just to test<br />

two segments; there’s a much simpler test.<br />

Two segments ab and cd intersect if and only if<br />

• the endpoints a and b are on opposite sides of the line ←→ cd, and<br />

• the endpoints c and d are on opposite sides of the line ←→ ab.<br />

1


<strong>CS</strong> <strong>373</strong> Non-Lecture F: Line Segment Intersection Fall 2002<br />

To test whether two points are on opposite sides of a line through two other points, we use the same<br />

counterclockwise test that we used for building convex hulls. Specifically, a and b are on opposite<br />

sides of line ←→ cd if and only if exactly one of the two triples a, c, d and b, c, d is in counterclockwise<br />

order. So we have the following simple algorithm.<br />

Or even simpler:<br />

Intersect(a, b, c, d):<br />

if CCW(a, c, d) = CCW(b, c, d)<br />

return False<br />

else if CCW(a, b, c) = CCW(a, b, d)<br />

return False<br />

else<br />

return True<br />

Intersect(a, b, c, d):<br />

return CCW(a, c, d) = CCW(b, c, d) ∧ CCW(a, b, c) = CCW(a, b, d) <br />

F.3 A Sweep Line Algorithm<br />

To detect whether there’s an intersection in a set of more than just two segments, we use something<br />

called a sweep line algorithm. First let’s give each segment a unique label. I’ll use letters, but<br />

in a real implementation, you’d probably use pointers/references to records storing the endpoint<br />

coordinates.<br />

Imagine sweeping a vertical line across the segments from left to right. At each position of<br />

the sweep line, look at the sequence of (labels of) segments that the line hits, sorted from top to<br />

bottom. The only times this sorted sequence can change is when the sweep line passes an endpoint<br />

or when the sweep line passes an intersection point. In the second case, the order changes because<br />

two adjacent labels swap places. 1 Our algorithm will simulate this sweep, looking for potential<br />

swaps between adjacent segments.<br />

The sweep line algorithm begins by sorting the 2n segment endpoints from left to right by<br />

comparing their x-coordinates, in O(n log n) time. The algorithm then moves the sweep line from<br />

left to right, stopping at each endpoint.<br />

We store the vertical label sequence in some sort of balanced binary tree that supports the<br />

following operations in O(log n) time. Note that the tree does not store any explicit search keys,<br />

only segment labels.<br />

• Insert a segment label.<br />

• Delete a segment label.<br />

• Find the neighbors of a segment label in the sorted sequence.<br />

O(log n) amortized time is good enough, so we could use a scapegoat tree or a splay tree. If we’re<br />

willing to settle for an expected time bound, we could use a treap or a skip list instead.<br />

1 Actually, if more than two segments intersect at the same point, there could be a larger reversal, but this won’t<br />

have any effect on our algorithm.<br />

2


<strong>CS</strong> <strong>373</strong> Non-Lecture F: Line Segment Intersection Fall 2002<br />

A<br />

B<br />

A B<br />

A<br />

C<br />

B<br />

A<br />

C<br />

B<br />

C<br />

D<br />

B<br />

C<br />

D<br />

E F<br />

The sweep line algorithm in action. The boxes show the label sequence stored in the binary tree.<br />

The intersection between F and D is detected in the last step.<br />

Whenever the sweep line hits a left endpoint, we insert the corresponding label into the tree in<br />

O(log n) time. In order to do this, we have to answer questions of the form ‘Does the new label X<br />

go above or below the old label Y?’ To answer this question, we test whether the new left endpoint<br />

of X is above segment Y, or equivalently, if the triple of endpoints left(Y), right(Y), left(X) is in<br />

counterclockwise order.<br />

Once the new label is inserted, we test whether the new segment intersects either of its two<br />

neighbors in the label sequence. For example, in the figure above, when the sweep line hits the left<br />

endpoint of F, we test whether F intersects either B or E. These tests require O(1) time.<br />

Whenever the sweep line hits a right endpoint, we delete the corresponding label from the tree<br />

in O(log n) time, and then check whether its two neighbors intersect in O(1) time. For example, in<br />

the figure, when the sweep line hits the right endpoint of C, we test whether B and D intersect.<br />

If at any time we discover a pair of segments that intersects, we stop the algorithm and report<br />

the intersection. For example, in the figure, when the sweep line reaches the right endpoint of B,<br />

we discover that F and D intersect, and we halt. Note that we may not discover the intersection<br />

until long after the two segments are inserted, and the intersection we discover may not be the<br />

one that the sweep line would hit first. It’s not hard to show by induction (hint, hint) that the<br />

algorithm is correct. Specifically, if the algorithm reaches the nth right endpoint without detecting<br />

an intersection, none of the segments intersect.<br />

For each segment endpoint, we spend O(log n) time updating the binary tree, plus O(1) time<br />

performing pairwise intersection tests—at most two at each left endpoint and at most one at each<br />

right endpoint. Thus, the entire sweep requires O(n log n) time in the worst case. Since we also<br />

spent O(n log n) time sorting the endpoints, the overall running time is O(n log n).<br />

Here’s a slightly more formal description of the algorithm. The input S[1 .. n] is an array of line<br />

segments. The sorting phase in the first line produces two auxiliary arrays:<br />

E<br />

B<br />

C<br />

D<br />

• label[i] is the label of the ith leftmost endpoint. I’ll use indices into the input array S as the<br />

labels, so the ith vertex is an endpoint of S[label[i]].<br />

• isleft[i] is True if the ith leftmost endpoint is a left endpoint and False if it’s a right endpoint.<br />

The functions Insert, Delete, Predecessor, and Successor modify or search through the<br />

sorted label sequence. Finally, Intersect tests whether two line segments intersect.<br />

3<br />

E<br />

B<br />

D<br />

E<br />

F<br />

B<br />

D<br />

E<br />

F<br />

D<br />

!


<strong>CS</strong> <strong>373</strong> Non-Lecture F: Line Segment Intersection Fall 2002<br />

AnyIntersections(S[1 .. n]):<br />

sort the endpoints of S from left to right<br />

create an empty label sequence<br />

for i ← 1 to 2n<br />

ℓ ← label[i]<br />

if isleft[i]<br />

Insert(ℓ)<br />

if Intersect(S[ℓ], S[Successor(ℓ)])<br />

return True<br />

if Intersect(S[ℓ], S[Predecessor(ℓ)])<br />

return True<br />

else<br />

if Intersect(S[Successor(ℓ)], S[Predecessor(ℓ)])<br />

return True<br />

Delete(label[i])<br />

return False<br />

Note that the algorithm doesn’t try to avoid redundant pairwise tests. In the figure below,<br />

the top and bottom segments would be checked n − 1 times, once at the top left endpoint, and<br />

once at the right endpoint of every short segment. But since we’ve already spent O(n log n) time<br />

just sorting the inputs, O(n) redundant segment intersection tests make no difference in the overall<br />

running time.<br />

The same pair of segments might be tested n − 1 times.<br />

4


<strong>CS</strong> <strong>373</strong> Non-Lecture G: Polygon Triangulation Fall 2002<br />

G Polygon Triangulation<br />

G.1 Introduction<br />

Recall from last time that a polygon is a region of the plane bounded by a cycle of straight edges<br />

joined end to end. Given a polygon, we want to decompose it into triangles by adding diagonals:<br />

new line segments between the vertices that don’t cross the boundary of the polygon. Because we<br />

want to keep the number of triangles small, we don’t allow the diagonals to cross. We call this<br />

decomposition a triangulation of the polygon. Most polygons can have more than one triangulation;<br />

we don’t care which one we compute.<br />

Two triangulations of the same polygon.<br />

Before we go any further, I encourage you to play around with some examples. Draw a few<br />

polygons (making sure that the edges are straight and don’t cross) and try to break them up into<br />

triangles.<br />

G.2 Existence and Complexity<br />

If you play around with a few examples, you quickly discover that every triangulation of an n-sided<br />

has n − 2 triangles. You might even try to prove this observation by induction. The base case<br />

n = 3 is trivial: there is only one triangulation of a triangle, and it obviously has only one triangle!<br />

To prove the general case, let P be a polygon with n edges. Draw a diagonal between two vertices.<br />

This splits P into two smaller polygons. One of these polygons has k edges of P plus the diagonal,<br />

for some integer k between 2 and n − 2, for a total of k + 1 edges. So by the induction hypothesis,<br />

this polygon can be broken into k − 1 triangles. The other polygon has n − k + 1 edges, and so by<br />

the induction hypothesis, it can be broken into n − k − 1 tirangles. Putting the two pieces back<br />

together, we have a total of (k − 1) + (n − k − 1) = n − 2 triangles.<br />

Breaking a polygon into two smaller polygons with a diagonal.<br />

This is a fine induction proof, which any of you could have discovered on your own (right?),<br />

except for one small problem. How do we know that every polygon has a diagonal? This seems<br />

1


<strong>CS</strong> <strong>373</strong> Non-Lecture G: Polygon Triangulation Fall 2002<br />

patently obvious, but it’s surprisingly hard to prove, and in fact many incorrect proofs were actually<br />

published as late as 1975. The following proof is due to Meisters in 1975.<br />

Lemma 1. Every polygon with more than three vertices has a diagonal.<br />

Proof: Let P be a polygon with more than three vertices. Every vertex of a P is either convex or<br />

concave, depending on whether it points into or out of P , respectively. Let q be a convex vertex,<br />

and let p and r be the vertices on either side of q. For example, let q be the leftmost vertex. (If<br />

there is more than one leftmost vertex, let q be the the lowest one.) If pr is a diagonal, we’re done;<br />

in this case, we say that the triangle △pqr is an ear.<br />

If pr is not a diagonal, then △pqr must contain another vertex of the polygon. Out of all the<br />

vertices inside △pqr, let s be the vertex furthest away from the line ←→ pr. In other words, if we take<br />

a line parallel to ←→ pr through q, and translate it towards ←→ pr, then then s is the first vertex that the<br />

line hits. Then the line segment qs is a diagonal. <br />

q<br />

p<br />

r<br />

The leftmost vertex q is the tip of an ear, so pr is a diagonal.<br />

The rightmost vertex q ′ is not, since △p ′ q ′ r ′ contains three other vertices. In this case, q ′ s ′ is a diagonal.<br />

G.3 Existence and Complexity<br />

Meister’s existence proof immediately gives us an algorithm to compute a diagonal in linear time.<br />

The input to our algorithm is just an array of vertices in counterclockwise order around the polygon.<br />

First, we can find the (lowest) leftmost vertex q in O(n) time by comparing the x-coordinates of<br />

the vertices (using y-coordinates to break ties). Next, we can determine in O(n) time whether<br />

the triangle △pqr contains any of the other n − 3 vertices. Specifically, we can check whether one<br />

point lies inside a triangle by performing three counterclockwise tests. Finally, if the triangle is not<br />

empty, we can find the vertex s in O(n) time by comparing the areas of every triangle △pqs; we<br />

can compute this area using the counterclockwise determinant.<br />

Here’s the algorithm in excruciating detail. We need three support subroutines to compute the<br />

area of a polygon, to determine if three poitns are in counterclockwise order, and to determine if a<br />

point is inside a triangle.<br />

〈〈Return twice the signed area of △P [i]P [j]P [k]〉〉<br />

Area(i, j, k):<br />

return (P [k].y − P [i].y)(P [j].x − P [i].x) − (P [k].x − P [i].x)(P [j].y − P [i].y)<br />

〈〈Are P [i], P [j], P [k] in counterclockwise order?〉〉<br />

CCW(i, j, k):<br />

return Area(i, j, k) > 0<br />

〈〈Is P [i] inside △P [p]P [q]P [r]?〉〉<br />

Inside(i, p, q, r):<br />

return CCW(i, p, q) and CCW(i, q, r) and CCW(i, r, p)<br />

2<br />

p’<br />

r’<br />

s’<br />

q’


<strong>CS</strong> <strong>373</strong> Non-Lecture G: Polygon Triangulation Fall 2002<br />

FindDiagonal(P [1 .. n]):<br />

q ← 1<br />

for i ← 2 to n<br />

if P [i].x < P [q].x<br />

q ← i<br />

p ← q − 1 mod n<br />

r ← q + 1 mod n<br />

ear ← True<br />

s ← p<br />

for i ← 1 to n<br />

if i ≤ p and i = q and i = r and Inside(i, p, q, r)<br />

ear ← False<br />

if Area(i, r, p) > Area(s, r, p)<br />

s ← i<br />

if ear = True<br />

return (p, r)<br />

else<br />

return (q, s)<br />

Once we have a diagonal, we can recursively triangulate the two pieces. The worst-case running<br />

time of this algorithm satisfies almost the same recurrence as quicksort:<br />

T (n) ≤ max T (k + 1) + T (n − k + 1) + O(n).<br />

2≤k≤n−2<br />

Just like quicksort, the solution is T (n) = O(n 2 ). So we can now triangulate any polygon in<br />

quadratic time.<br />

G.4 Faster Special Cases<br />

For certain special cases of polygons, we can do much better than O(n 2 ) time. For example, we<br />

can easily triangulate any convex polygon by connecting any vertex to every other vertex. Since<br />

we’re given the counterclockwise order of the vertices as input, this takes only O(n) time.<br />

Triangulating a convex polygon is easy.<br />

Another easy special case is monotone mountains. A polygon is monotone if any vertical line<br />

intersects the boundary in at most two points. A monotone polygon is a mountain if it contains<br />

an edge from the rightmost vertex to the leftmost vertex. Every monotone polygon consists of two<br />

chains of edges going left to right between the two extreme vertices; for mountains, one of these<br />

chains is a single edge.<br />

A monotone mountain and a monotone non-mountain.<br />

3


<strong>CS</strong> <strong>373</strong> Non-Lecture G: Polygon Triangulation Fall 2002<br />

Triangulating a monotone mounting is extremely easy, since every convex vertex is the tip of<br />

an ear, except possibly for the vertices on the far left and far right. Thus, all we have to do is<br />

scan through the intermediate vertices, and when we find a convex vertex, cut off the ear. The<br />

simplest method for doing this is probably the three-penny algorithm used in the “Graham’s scan”<br />

convex hull algorithm—instead of filling in the outside of a polygon with triangles, we’re filling in<br />

the inside, both otherwise it’s the same process. This takes O(n) time.<br />

Triangulating a monotone mountain. (Some of the triangles are very thin.)<br />

We can also triangulate general monotone polygons in linear time, but the process is more<br />

complicated. A good way to visuialize the algorithm is to think of the polygon as a complicated<br />

room. Two people named Tom and Bob are walking along the top and bottom walls, both starting<br />

at the left end and going to the right. At all times, they have a rubber band stretched between<br />

them that can never leave the room.<br />

B<br />

A rubber band stretched between a vertex on the top and a vertex on the bottom of a monotone polygon.<br />

Now we loop through all the vertices of the polygon in order from left to right. Whenever<br />

we see a new bottom vertex, Bob moves onto it, and whenever we see a new bottom vertex Tom<br />

moves onto it. After either person moves, we cut the polygon along the rubber band. (In fact, this<br />

will only cut the polygon along a single diagonal at any step.) When we’re done, the polygon is<br />

decomposed into triangles and boomerangs—nonconvex polygons consisting of two straight edges<br />

and a concave chain. A boomerang can only be triangulated in one way, by joining the apex to<br />

every vertex in the concave chain.<br />

4<br />

T


<strong>CS</strong> <strong>373</strong> Non-Lecture G: Polygon Triangulation Fall 2002<br />

Triangulating a monotone polygon by walking a rubber band from left to right.<br />

I don’t want to go into too many implementation details, but a few observations shoudl convince<br />

you that this algorithm can be implemented to run in O(n) time. Notice that at all times, the<br />

rubber band forms a concave chain. The leftmost edge in the rubber band joins a top vertex to<br />

a bottom vertex. If the rubber band has any other vertices, either they are all on top or all on<br />

bottom. If all the other vertices are on top, there are just three ways the rubber band can change:<br />

1. The bottom vertex changes, the rubber band straighens out, and we get a new boomerang.<br />

2. The top vertex changes and the rubber band gets a new concave vertex.<br />

3. The top vertex changes, the rubber band loses some vertices, and we get a new boomerang.<br />

Deciding between the first case and the other two requires a simple comparison between x-coordinates.<br />

Deciding between the last two requires a counterclockwise test.<br />

5


<strong>CS</strong> <strong>373</strong> Non-Lecture H: Lower Bounds Fall 2002<br />

H Lower Bounds<br />

H.1 What Are Lower Bounds?<br />

Number Six: What do you want?<br />

Number Two: Information!<br />

Number Six: Whose side are you on?<br />

Number Two: That would be telling. We want information!<br />

Number Six: You won’t get it!<br />

Number Two: By hook or by crook, we will!<br />

— Opening sequence of ‘The Prisoner’ (1967–68)<br />

So far in this class we’ve been developing algorithms and data structures for solving certain problems<br />

and analyzing their time and space complexity.<br />

Let TA(X) denote the running of algorithm A given input X. Then the worst-case running time<br />

of A for inputs of size n is defined as follows:<br />

<br />

TA(n) = max TA(X)<br />

|X|=n<br />

.<br />

The worst-case complexity of a problem Π is the worst-case running time of the fastest algorithm<br />

for solving it:<br />

<br />

TΠ(n) = min TA(n)<br />

A solves Π<br />

<br />

<br />

= min max TA(X)<br />

A solves Π |X|=n<br />

<br />

.<br />

Now suppose we’ve shown that the worst-case running time of an algorithm A is O(f(n)). Then<br />

we immediately have an upper bound for the complexity of Π:<br />

TΠ(n) ≤ TA(n) = O(f(n)).<br />

The faster our algorithm, the better our upper bound. In other words, when we give a running<br />

time for an algorithm, what we’re really doing — and what most theoretical computer scientists<br />

devote their entire careers doing 1 — is bragging about how easy some problem is.<br />

Starting with this lecture, we’ve turned the tables. Instead of bragging about how easy problems<br />

are, now we’re arguing that certain problems are hard by proving lower bounds on their complexity.<br />

This is a little harder, because it’s no longer enough to examine a single algorithm. To show that<br />

TΠ(n) = Ω(f(n)), we have to prove that every algorithm that solves Π has a worst-case running<br />

time Ω(f(n)), or equivalently, that no algorithm runs in o(f(n)) time.<br />

1 This sometimes leads to long sequences of results that sound like an obscure version of “Name that Tune”:<br />

Lennes: “I can triangulate that polygon in O(n 2 ) time.”<br />

Shamos: “I can triangulate that polygon in O(n log n) time.”<br />

Tarjan: “I can triangulate that polygon in O(n log log n) time.”<br />

Seidel: “I can triangulate that polygon in O(n log ∗ n) time.”<br />

[Audience gasps.]<br />

Chazelle: “I can triangulate that polygon in O(n) time.”<br />

[Audience gasps and applauds.]<br />

“Triangulate that polygon!”<br />

1


<strong>CS</strong> <strong>373</strong> Non-Lecture H: Lower Bounds Fall 2002<br />

H.2 Decision Trees<br />

Unfortunately, there is no formal definition of the phrase ‘all algorithms’! 2 So when we derive lower<br />

bounds, we first have to specify, formally, what an algorithm is and how to measure its running<br />

time. This specification is called a model of computation.<br />

One rather powerful model of computation is decision trees. A decision tree is (as the name<br />

suggests) a tree. Each internal node in the tree is labeled by a query, which is just a question about<br />

the input. The edges out of a node correspond to the various answers to the query. Each leaf of<br />

the tree is labeled with an output. To compute with a decision tree, start at the root and follow a<br />

path down to a leaf. At each internal node, the answer to the query tells you which node to visit<br />

next. When you reach a leaf, output its label.<br />

For example, the guessing game where one person thinks of an animal and the other person tries<br />

to figure it out with a series of yes/no questions can be modeled as a decision tree. Each internal<br />

node is labeled with a question and has two edges labeled ‘yes’ and ‘no’. Each leaf is labeled with<br />

an animal.<br />

YES<br />

Does it live in the water?<br />

Does it have scales? Does it have more than four legs?<br />

YES NO NO<br />

YES<br />

Fish Frog<br />

Does it have wings?<br />

Gnu<br />

Eagle<br />

NO<br />

Does it have wings?<br />

NO YES YES<br />

NO<br />

A decision tree to choose one of six animals.<br />

Mosquito Centipede<br />

Here’s another simple example, called the dictionary problem. Let A be a fixed array with<br />

n numbers. Suppose want to determine, given a number x, the position of x in the array A, if<br />

any. One solution to the dictionary problem is to sort A (remembering every element’s original<br />

position) and then use binary search. The (implicit) binary search tree can be used almost directly<br />

as a decision tree. Each internal node the the search tree stores a key k; the corresponding node<br />

in the decision tree stores the question ‘Is x < k?’. Each leaf in the search tree stores some value<br />

A[i]; the corresponding node in the decision tree asks ‘Is x = A[i]?’ and has two leaf children, one<br />

labeled ‘i’ and the other ‘none’.<br />

3<br />

2 3 5 7 11 13 17 19<br />

5<br />

7<br />

11<br />

13<br />

17<br />

19<br />

2 3 5 7 11 13 17 19<br />

YES<br />

YES<br />

YES<br />

x


<strong>CS</strong> <strong>373</strong> Non-Lecture H: Lower Bounds Fall 2002<br />

We define the running time of a decision tree algorithm for a given input to be the number of<br />

queries in the path from the root to the leaf. For example, in the ‘Guess the animal’ tree above,<br />

T (frog) = 2. Thus, the worst-case running time of the algorithm is just the depth of the tree. This<br />

definition ignores other kinds of operations that the algorithm might perform that have nothing to<br />

do with the queries. (Even the most efficient binary search problem requires more than one machine<br />

instruction per comparison!) But the number of decisions is certainly a lower bound on the actual<br />

running time, which is good enough to prove a lower bound on the complexity of a problem.<br />

Both of the examples describe binary decision trees, where every query has only two answers.<br />

We may sometimes want to consider decision trees with higher degree. For example, we might use<br />

queries like ‘Is x greater than, equal to, or less than y?’ or ‘Are these three points in clockwise<br />

order, colinear, or in counterclockwise order?’ A k-ary decision tree is one where every query has<br />

(at most) k different answers. From now on, I will only consider k-ary decision trees where<br />

k is a constant.<br />

H.3 Information Theory<br />

Most lower bounds for decision trees are based on the following simple observation: the answers to<br />

the queries must give you enough information to specify any possible output. If a problem has N<br />

different outputs, then obviously any decision tree must have at least N leaves. (It’s possible for<br />

several leaves to specify the same output.) Thus, if every query has at most k possible answers,<br />

then the depth of the decision tree must be at least ⌈log k N⌉ = Ω(log N).<br />

Let’s apply this to the dictionary problem for a set S of n numbers. Since there are n + 1<br />

possible outputs, any decision tree must have at least n + 1 leaves, and thus any decision tree must<br />

have depth at least ⌈log k(n + 1)⌉ = Ω(log n). So the complexity of the dictionary problem, in<br />

the decision-tree model of computation, is Ω(log n). This matches the upper bound O(log n) that<br />

comes from a perfectly-balanced binary search tree. That means that the standard binary search<br />

algorithm, which runs in O(log n) time, is optimal—there is no faster algorithm in this model of<br />

computation.<br />

H.4 But wait a second. . .<br />

We can solve the membership problem in O(1) expected time using hashing. Isn’t this inconsistent<br />

with the Ω(log n) lower bound?<br />

No, it isn’t. The reason is that hashing involves a query with more than a constant number<br />

of outcomes, specifically ‘What is the hash value of x?’ In fact, if we don’t restrict the degree of<br />

the decision tree, we can get constant running time even without hashing, by using the obviously<br />

unreasonable query ‘For which index i (if any) is A[i] = x?’. No, I am not cheating — remember<br />

that the decision tree model allows us to ask any question about the input!<br />

This example illustrates a common theme in proving lower bounds: choosing the right model<br />

of computation is absolutely crucial. If you choose a model that is too powerful, the problem<br />

you’re studying may have a completely trivial algorithm. On the other hand, if you consider more<br />

restrictive models, the problem may not be solvable at all, in which case any lower bound will be<br />

meaningless! (In this class, we’ll just tell you the right model of computation to use.)<br />

H.5 Sorting<br />

Now let’s consider the sorting problem — Given an array of n numbers, arrange them in increasing<br />

order. Unfortunately, decision trees don’t have any way of describing moving data around, so we<br />

have to rephrase the question slightly:<br />

3


<strong>CS</strong> <strong>373</strong> Non-Lecture H: Lower Bounds Fall 2002<br />

Given a sequence 〈x1, x2, . . . , xn〉 of n distinct numbers, find the permutation π such<br />

that x π(1) < x π(2) < · · · < x π(n).<br />

Now a k-ary decision-tree lower bound is immediate. Since there are n! possible permutations π,<br />

any decision tree for sorting must have at least n! leaves, and so must have depth Ω(log(n!)). To<br />

simplify the lower bound, we apply Stirling’s approximation<br />

This gives us the lower bound<br />

n! =<br />

⌈log k(n!)⌉ ><br />

<br />

n<br />

n √2πn<br />

e<br />

<br />

log k<br />

<br />

1 + Θ<br />

1<br />

n<br />

<br />

><br />

<br />

n<br />

n .<br />

e<br />

<br />

n<br />

n = ⌈n logk n − n logk e⌉ = Ω(n log n).<br />

e<br />

This matches the O(n log n) upper bound that we get from mergesort, heapsort, or quicksort, so<br />

those algorithms are optimal. The decision-tree complexity of sorting is Θ(n log n).<br />

Well. . . we’re not quite done. In order to say that those algorithms are optimal, we have to<br />

demonstrate that they fit into our model of computation. A few minutes thought will convince<br />

you that they can be described as a special type of decision tree called a comparison tree, where<br />

every query is of the form ‘Is xi bigger or smaller than xj?’ These algorithms treat any two input<br />

sequences exactly the same way as long as the same comparisons produce exactly the same results.<br />

This is a feature of any comparison tree. In other words, the actual input values don’t matter,<br />

only their order. Comparison trees describe almost all sorting algorithms: bubble sort, selection<br />

sort, insertion sort, shell sort, quicksort, heapsort, mergesort, and so forth — but not radix sort or<br />

bucket sort.<br />

H.6 Finding the Maximum and Adversaries<br />

Finally let’s consider the maximum problem: Given an array X of n numbers, find its largest entry.<br />

Unfortunately, there’s no hope of proving a lower bound in this formulation, since there are an<br />

infinite number of possible answers, so let’s rephrase it slightly.<br />

Given a sequence 〈x1, x2, . . . , xn〉 of n distinct numbers, find the index m such that xm<br />

is the largest element in the sequence.<br />

We can get an upper bound of n − 1 comparisons in several different ways. The easiest is<br />

probably to start at one end of the sequence and do a linear scan, maintaining a current maximum.<br />

Intuitively, this seems like the best we can do, but the information-theoretic lower bound is only<br />

⌈log 2 n⌉.<br />

To prove that n − 1 comparisons are actually necessary, we use something called an adversary<br />

argument. The idea is that an all-powerful malicious adversary pretends to choose an input for<br />

the algorithm. When the algorithm asks a question about the input, the adversary answers in<br />

whatever way will make the algorithm do the most work. If the algorithm does not ask enough<br />

queries before terminating, then there will be several different inputs, each consistent with the<br />

adversary’s answers, the should result in different outputs. In this case, whatever the algorithm<br />

outputs, the adversary can ‘reveal’ an input that is consistent with its answers, but contradicts the<br />

algorithm’s output, and then claim that that was the input that he was using all along.<br />

For the maximum problem, the adversary originally pretends that xi = i for all i, and answers<br />

all comparison queries appropriately. Whenever the adversary reveals that xi < xj, he marks xi<br />

as an item that the algorithm knows (or should know) is not the maximum element. At most<br />

4


<strong>CS</strong> <strong>373</strong> Non-Lecture H: Lower Bounds Fall 2002<br />

one element xi is marked after each comparison. Note that xn is never marked. If the algorithm<br />

does less than n − 1 comparisons before it terminates, the adversary must have at least one other<br />

unmarked element xk = xn. In this case, the adversary can change the value of xk from k to n + 1,<br />

making xk the largest element, without being inconsistent with any of the comparisons that the<br />

algorithm has performed. In other words, the algorithm cannot tell that the adversary has cheated.<br />

However, xn is the maximum element in the original input, and xk is the largest element in the<br />

modified input, so the algorithm cannot possibly give the correct answer for both cases. Thus, in<br />

order to be correct, any algorithm must perform at least n − 1 comparisons.<br />

It is very important to notice that the adversary makes no assumptions about the order in<br />

which the algorithm does its comparisons. The adversary forces any algorithm (in this model of<br />

computation 3 ) to either perform n − 1 comparisons, or to give the wrong answer for at least one<br />

input sequence. Notice also that no algorithm can distinguish between a malicious adversary and<br />

an honest user who actually chooses an input in advance and answers all queries truthfully.<br />

In the next lecture, we’ll see several more complicated adversary arguments.<br />

3<br />

Actually, the n−1 lower bound for finding the maximum holds in a much powerful model called algebraic decision<br />

trees, which are binary trees where every query is a comparison between two polynomial functions of the input values,<br />

such as ‘Is x 2 1 − 3x2x3 + x 17<br />

4 bigger or smaller than 5 + x1x 5 3x 2 5 − 2x 42<br />

7 ?’<br />

5


<strong>CS</strong> <strong>373</strong> Non-Lecture I: Adversaries Fall 2002<br />

I Adversary Arguments<br />

I.1 Three-Card Monte<br />

It is possible that the operator could be hit by an asteroid and your $20 could fall off<br />

his cardboard box and land on the ground, and while you were picking it up, $5 could<br />

blow into your hand. You therefore could win $5 by a simple twist of fate.<br />

— Penn Jillette, explaining how to win at Three-Card Monte (1999)<br />

Until Times Square was turned into TimesSquareLand by Mayor Guiliani, you could often find<br />

dealers stealing tourists’ money using a game called ‘Three Card Monte’ or ‘Spot the Lady’. The<br />

dealer has three cards, say the Queen of Hearts and the two and three of clubs. The dealer shuffles<br />

the cards face down on a table (usually slowly enough that you can follow the Queen), and then<br />

asks the tourist to bet on which card is the Queen. In principle, the tourist’s odds of winning are<br />

at least one in three.<br />

In practice, however, the tourist never wins, because the dealer cheats. There are actually four<br />

cards; before he even starts shuffling the cards, the dealer palms the queen or sticks it up his sleeve.<br />

No matter what card the tourist bets on, the dealer turns over a black card. If the tourist gives<br />

up, the dealer slides the queen under one of the cards and turns it over, showing the tourist ‘where<br />

the queen was all along’. If the dealer is really good, the tourist won’t see the dealer changing the<br />

cards and will think maybe the queen was there all along and he just wasn’t smart enough to figure<br />

that out. 1 As long as the dealer doesn’t reveal all the black cards at once, the tourist has no way<br />

to prove that the dealer cheated!<br />

I.2 n-Card Monte<br />

Now let’s consider a similar game, but with an algorithm acting as the tourist and with bits instead<br />

of cards. Suppose we have an array of n bits and we want to determine if any of them is a 1.<br />

Obviously we can figure this out by just looking at every bit, but can we do better? Is there maybe<br />

some complicated tricky algorithm to answer the question “Any ones?” without looking at every<br />

bit? Well, of course not, but how do we prove it?<br />

The simplest proof technique is called an adversary argument. The idea is that an all-powerful<br />

malicious adversary (the dealer) pretends to choose an input for the algorithm (the tourist). When<br />

the algorithm wants looks at a bit (a card), the adversary sets that bit to whatever value will make<br />

the algorithm do the most work. If the algorithm does not look at enough bits before terminating,<br />

then there will be several different inputs, each consistent with the bits already seen, the should<br />

result in different outputs. Whatever the algorithm outputs, the adversary can ‘reveal’ an input<br />

that is has all the examined bits but contradicts the algorithm’s output, and then claim that that<br />

was the input that he was using all along. Since the only information the algorithm has is the set<br />

of bits it examined, the algorithm cannot distinguish between a malicious adversary and an honest<br />

user who actually chooses an input in advance and answers all queries truthfully.<br />

For the n-card monte problem, the adversary originally pretends that the input array is all<br />

zeros—whenever the algorithm looks at a bit, it sees a 0. Now suppose the algorithms stops before<br />

looking at all three bits. If the algorithm says ‘No, there’s no 1,’ the adversary changes one of the<br />

unexamined bits to a 1 and shows the algorithm that it’s wrong. If the algorithm says ‘Yes, there’s<br />

a 1,’ the adversary reveals the array of zeros and again proves the algorithm wrong. Either way,<br />

the algorithm cannot tell that the adversary has cheated.<br />

1 He’s right about the second part!<br />

1


<strong>CS</strong> <strong>373</strong> Non-Lecture I: Adversaries Fall 2002<br />

It is important to notice that the adversary makes absolutely no assumptions about the algorithm.<br />

The adversary strategy can’t depend on some predetermined order of examining bits, and<br />

it doesn’t care about anything the algorithm might or might not do when it’s not looking at bits.<br />

Any algorithm that doesn’t examine every bit falls victim to the adversary.<br />

I.3 Finding Patterns in Bit Strings<br />

Let’s make the problem a little more complicated. Suppose we’re given an array of n bits and we<br />

want to know if it contains the substring 01, a zero followed immediately by a one. Can we answer<br />

this question without looking at every bit?<br />

It turns out that if n is odd, we don’t have to look at all the bits. First we look the bits in<br />

every even position: B[2], B[4], . . . , B[n − 1]. If we see B[i] = 0 and B[j] = 1 for any i < j, then<br />

we know the pattern 01 is in there somewhere—starting at the last 0 before B[j]—so we can stop<br />

without looking at any more bits. If we see only 1s followed by 0s, we don’t have to look at the<br />

bit between the last 0 and the first 1. If every even bit is a 0, we don’t have to look at B[1], and if<br />

every even bit is a 1, we don’t have to look at B[n]. In the worst case, our algorithm looks at only<br />

n − 1 of the n bits.<br />

But what if n is even? In that case, we can use the following adversary strategy to show that<br />

any algorithm does have to look at every bit. The adversary is going to produce an input of the<br />

form 11 . . . 100 . . . 0. The adversary maintains two indices ℓ and r and pretends that everything to<br />

the left of ℓ is a 1 and everything to the right of r is a 0. Initially ℓ = 0 and r = n + 1.<br />

111111✷✷✷✷✷✷0000<br />

↑ ↑<br />

ℓ r<br />

What the adversary is thinking; ✷ represents an unknown bit.<br />

The adversary maintains the invariant that r − ℓ, the length of the intermediate portion of the<br />

array, is even. When the algorithm looks at a bit between ℓ and r, the adversary chooses whichever<br />

value preserves the parity of the intermediate chunk of the array, and then moves either ℓ or r.<br />

Specifically, here’s what the adversary does right before the the algorithm examines the bit B[i].<br />

(Note that I’m specifying the adversary strategy as an algorithm!)<br />

Hide01(i):<br />

if i ≤ ℓ<br />

B[i] ← 1<br />

else if i ≥ r<br />

B[i] ← 0<br />

else if i − ℓ is even<br />

B[i] ← 0<br />

r ← i<br />

else<br />

B[i] ← 1<br />

ℓ ← i<br />

It’s fairly easy to prove that this strategy forces the algorithm to examine every bit. If the<br />

algorithm doesn’t look at every bit to the right of r, the adversary could replace some unexamined<br />

bit with a 1. Similarly, if the algorithm doesn’t look at every bit to the left of ℓ, the adversary<br />

could replace some unexamined bit with a zero. Finally, if there are any unexamined bits between<br />

ℓ and r, there must be at least two such bits (since r − ℓ is always even) and the adversary can put<br />

a 01 in the gap.<br />

2


<strong>CS</strong> <strong>373</strong> Non-Lecture I: Adversaries Fall 2002<br />

In general, we say that a bit pattern is evasive if we have to look at every bit to decide if a<br />

string of n bits contains the pattern. So the pattern 1 is evasive for all n, and the pattern 01 is<br />

evasive if and only if n is even. It turns out that the only patterns that are evasive for all values<br />

of n are the one-bit patterns 0 and 1.<br />

I.4 Evasive Graph Properties<br />

Another class of problems for which adversary arguments give good lower bounds is graph problems<br />

where the graph is represented by an adjacency matrix, rather than an adjacency list. Recall that<br />

the adjacency matrix of an undirected n-vertex graph G = (V, E) is an n × n matrix A, where<br />

A[i, j] = (i, j) ∈ E . We are interested in deciding whether an undirected graph has or does not<br />

have a certain property. For example, is the input graph connected? Acyclic? Planar? Complete?<br />

A tree? We call a graph property evasive if we have to look look at all n 2 entries in the adjacency<br />

matrix to decide whether a graph has that property.<br />

An obvious example of an evasive graph property is emptiness: Does the graph have any edges<br />

at all? We can show that emptiness is evasive using the following simple adversary strategy. The<br />

adversary maintains two graphs E and G. E is just the empty graph with n vertices. Initially G<br />

is the complete graph on n vertices. Whenever the algorithm asks about an edge, the adversary<br />

removes that edge from G (unless it’s already gone) and answers ‘no’. If the algorithm terminates<br />

without examining every edge, then G is not empty. Since both G and E are consistent with all<br />

the adversary’s answers, the algorithm must give the wrong answer for one of the two graphs.<br />

I.5 Connectedness Is Evasive<br />

Now let me give a more complicated example, connectedness. Once again, the adversary maintains<br />

two graphs, Y and M (‘yes’ and ‘maybe’). Y contains all the edges that the algorithm knows are<br />

definitely in the input graph. M contains all the edges that the algorithm thinks might be in the<br />

input graph, or in other words, all the edges of Y plus all the unexamined edges. Initially, Y is<br />

empty and M is complete.<br />

Here’s the strategy that adversary follows when the adversary asks whether eh input graph<br />

contains the edge e. I’ll assume that whenever an algorithm examines an edge, it’s in M but not<br />

in Y ; in other words, algorithms never ask about the same edge more than once.<br />

HideConnectedness(e):<br />

if M \ {e} is connected<br />

remove (i, j) from M<br />

return 0<br />

else<br />

add e to Y<br />

return 1<br />

Notice that Y and M are both consistent with the adversary’s answers. The adversary strategy<br />

maintains a few other simple invariants.<br />

• Y is a subgraph of M.<br />

• M is connected.<br />

• If M has a cycle, none of its edges are in Y . If M has a cycle, then deleting any edge<br />

in that cycle leaves M connected.<br />

3


<strong>CS</strong> <strong>373</strong> Non-Lecture I: Adversaries Fall 2002<br />

• Y is acyclic. Obvious from the previous invariant.<br />

• If Y = M, then Y is disconnected. The only connected acyclic graph is a tree. Suppose<br />

Y is a tree and some edge e is in M but not Y . Then there is a cycle in M that contains e,<br />

all of whose other edges are in Y . This is impossible.<br />

Now, if an algorithm terminates before examining all n 2 edges, then there is an edge in M that<br />

is not in Y . Since the algorithm cannot distinguish between M and Y , even though M is connected<br />

and Y is not, the algorithm cannot possibly give the correct output for both graphs. Thus, in order<br />

to be correct, any algorithm must examine every edge—Connectedness is evasive!<br />

I.6 An Evasive Conjecture<br />

A graph property is nontrivial is there is at least one graph with the property and at least one<br />

graph without the property. (The only trivial properties are ‘Yes’ and ‘No’.) A graph property<br />

is monotone if it is closed under taking subgraphs — if G has the property, then any subgraph<br />

of G has the property. For example, emptiness, planarity, acyclicity, and non-connectedness are<br />

monotone. The properties of being a tree and of having a vertex of degree 3 are not monotone.<br />

Conjecture 1 (Aanderraa, Karp, and Rosenberg). Every nontrivial monotone property of<br />

n-vertex graphs is evasive.<br />

The Aanderraa-Karp-Rosenberg conjecture has been proven when n = p e for some prime p and<br />

positive integer exponent e—the proof uses some interesting results from algebraic topology 2 —but<br />

it is still open for other values of n.’<br />

Incidentally, this was really the point of the ‘scorpion’ practice homework problem. ’Scorprionhood’<br />

is an example of a nontrivial graph property that is not evasive, since there is an algorithm to<br />

identify scorpions by examining only O(n) of their edges. However, this property is not monotone—<br />

a subgraph of a scorpion is not necessarily a scorpion.<br />

I.7 Finding the Minimum and Maximum<br />

Last time, we saw an adversary argument that finding the largest element of an unsorted set of n<br />

numbers requires at least n − 1 comparisons. Let’s consider the complexity of finding the largest<br />

and smallest elements. More formally:<br />

Given a sequence X = 〈x1, x2, . . . , xn〉 of n distinct numbers, find indices i and j such<br />

that xi = min X and xj = max X.<br />

How many comparisons do we need to solve this problem? An upper bound of 2n − 3 is easy:<br />

find the minimum in n − 1 comparisons, and then find the maximum of everything else in n − 2<br />

comparisons. Similarly, a lower bound of n − 1 is easy, since any algorithm that finds the min and<br />

the max certainly finds the max.<br />

We can improve both the upper and the lower bound to ⌈3n/2⌉ − 2. The upper bound is<br />

established by the following algorithm. Compare all ⌊n/2⌋ consecutive pairs of elements x2i−1 and<br />

x2i, and put the smaller element into a set S and the larger element into a set L. if n is odd, put<br />

2 Let ∆ be a contractible simplicial complex whose automorphism group Aut(∆) is vertex-transitive, and let Γ be<br />

a vertex-transitive subgroup of Aut(∆). Suppose there are normal subgroups Γ1 ✁ Γ2 ✁ Γ such that |Γ1| = p α for<br />

some prime p and integer α, |Γ/Γ2| = q β for some prime q and integer β, and Γ2/Γ1 is cyclic. Then ∆ is a simplex.<br />

No, this will not be on the final exam.<br />

4


<strong>CS</strong> <strong>373</strong> Non-Lecture I: Adversaries Fall 2002<br />

xn into both L and S. Then find the smallest element of S and the largest element of L. The total<br />

number of comparisons is at most<br />

<br />

n<br />

<br />

2<br />

+<br />

<br />

n<br />

<br />

− 1<br />

2 <br />

+<br />

<br />

n<br />

<br />

− 1<br />

2 <br />

<br />

3n<br />

= − 2.<br />

2<br />

compute min S compute max L<br />

<br />

build S and L<br />

For the lower bound, we use an adversary argument. The adversary marks each element +<br />

if it might be the maximum element, and − if it might be the minimum element. Initially, the<br />

adversary puts both marks + and − on every element. If the algorithm compares two doublemarked<br />

elements, then the adversary declares one smaller, removes the + mark from the smaller<br />

element, and removes the − mark from the larger one. In every other case, the adversary can<br />

answer so that at most one mark needs to be removed. For example, if the algorithm compares a<br />

double marked element against one labeled −, the adversary says the one labeled − is smaller and<br />

removes the − mark from the other. If the algorithm compares to +’s, the adversary must unmark<br />

one of the two.<br />

Initially, there are 2n marks. At the end, in order to be correct, exactly one item must be marked<br />

+ and exactly one other must be marked −, since the adversary can make any + the maximum<br />

and any − the minimum. Thus, the algorithm must force the adversary to remove 2n − 2 marks.<br />

At most ⌊n/2⌋ comparisons remove two marks; every other comparison removes at most one mark.<br />

Thus, the adversary strategy forces any algorithm to perform at least 2n − 2 − ⌊n/2⌋ = ⌈3n/2⌉ − 2<br />

comparisons.<br />

I.8 Finding the Median<br />

Finally, let’s consider the median problem: Given an unsorted array X of n numbers, find its n/2th<br />

largest entry. (I’ll assume that n is even to eliminate pesky floors and ceilings.) More formally:<br />

Given a sequence 〈x1, x2, . . . , xn〉 of n distinct numbers, find the index m such that xm<br />

is the n/2th largest element in the sequence.<br />

To prove a lower bound for this problem, we can use a combination of information theory and<br />

two adversary arguments. We use one adversary argument to prove the following simple lemma:<br />

Lemma 1. Any comparison tree that correctly finds the median element also identifies the elements<br />

smaller than the median and larger than the median.<br />

Proof: Suppose we reach a leaf of a decision tree that chooses the median element xm, and there<br />

is still some element xi that isn’t known to be larger or smaller than xm. In other words, we cannot<br />

decide based on the comparisons that we’ve already performed whether xi < xm or xi > xm. Then<br />

in particular no element is known to lie between xi and xm. This means that there must be an<br />

input that is consistent with the comparisons we’ve performed, in which xi and xm are adjacent<br />

in sorted order. But then we can swap xi and xm, without changing the result of any comparison,<br />

and obtain a different consistent input in which xi is the median, not xm. Our decision tree gives<br />

the wrong answer for this ‘swapped’ input. <br />

This lemma lets us rephrase the median-finding problem yet again.<br />

Given a sequence X = 〈x1, x2, . . . , xn〉 of n distinct numbers, find the indices of its<br />

n/2 − 1 largest elements L and its n/2th largest element xm.<br />

5


<strong>CS</strong> <strong>373</strong> Non-Lecture I: Adversaries Fall 2002<br />

Now suppose a ‘little birdie’ tells us the set L of elements larger than the median. This information<br />

fixes the outcomes of certain comparisons — any item in L is bigger than any element not in L<br />

— and so we can ‘prune’ those comparisons from the comparison tree. The pruned tree finds the<br />

largest element of X \ L (the median of X), and thus must have depth at least n/2 − 1. In fact, an<br />

adversary argument similar to last lecture’s implies that every leaf in the pruned tree must have<br />

depth at least n/2 − 1, so the pruned tree has at least 2n/2−1 leaves.<br />

There are n <br />

n/2−1 ≈ 2n / n/2 possible choices for the set L. Every leaf in the original comparison<br />

tree is also a leaf in exactly one of the n <br />

n/2−1 pruned trees, so the original comparison tree<br />

must have at least n <br />

n/2−1 2n/2−1 ≈ 23n/2 / n/2 leaves. Thus, any comparison tree that finds the<br />

median must have depth at least<br />

<br />

n<br />

n<br />

− 1 + lg<br />

=<br />

2 n/2 − 1<br />

3n<br />

− O(log n).<br />

2<br />

A more complicated adversary argument (also involving pruning the tree with little birdies) improves<br />

this lower bound to 2n − o(n).<br />

<br />

A similar argument implies that at least n − k + lg n <br />

k−1<br />

<br />

comparisons are required to find the<br />

kth largest element in an n-element set.<br />

6


<strong>CS</strong> <strong>373</strong> Non-Lecture J: Reductions Fall 2002<br />

J Reductions<br />

J.1 Introduction<br />

A common technique for deriving algorithms is reduction—instead of solving a problem directly,<br />

we use an algorithm for some other related problem as a subroutine or black box.<br />

For example, when we talked about nuts and bolts, I argued that once the nuts and bolts are<br />

sorted, we can match each nut to its bolt in linear time. Thus, since we can sort nuts and bolts in<br />

O(n log n) expected time, then we can also match them in O(n log n) expected time:<br />

Tmatch(n) ≤ Tsort(n) + O(n) = O(n log n) + O(n) = O(n log n).<br />

Let’s consider (as we did in lecture 3) a decision tree model of computation, where every query<br />

is a comparison between a nut and a bolt—too big, too small, or just right? The output to the<br />

matching problem is a permutation π, where for all i, the ith nut matches the π(i)th bolt. Since<br />

there are n! permutations of n items, any nut/bolt comparison tree that matches n nuts and bolts<br />

has at least n! leaves, and thus has depth at least ⌈log 3(n!)⌉ = Ω(n log n).<br />

Now the same reduction from matching to sorting can be used to prove a lower bound for sorting<br />

nuts and bolts, just by reversing the inequality:<br />

Tsort(n) ≥ Tmatch(n) − O(n) = Ω(n log n) − O(n) = Ω(n log n).<br />

Thus, any nut-bolt comparison tree that sorts n nuts and bolts has depth Ω(n log n), and our<br />

randomized quicksort algorithm is optimal. 1<br />

J.2 Sorting to Convex Hulls<br />

Here’s a slightly less trivial example. Suppose we want to prove a lower bound for the problem of<br />

computing the convex hull of a set of n points in the plane. To do this, we demonstrate a reduction<br />

from sorting to convex hulls.<br />

To sort a list of n numbers {a, b, c, . . .}, we first transform it into a set of n points {(a, a 2 ),<br />

(b, b 2 ), (c, c 2 ), . . .}. You can think of the original numbers as a set of points on a horizontal real<br />

number line, and the transformation as lifting those point up to the parabola y = x 2 . Then we<br />

compute the convex hull of the parabola points. Finally, to get the final sorted list of numbers, we<br />

output the first coordinate of every convex vertex, starting from the leftmost vertex and going in<br />

counterclockwise order.<br />

Reducing sorting to computing a convex hull.<br />

1 We could have proved this lower bound directly. The output to the sorting problem is two permutations, so there<br />

are n! 2 possible outputs, and we get a lower bound of ⌈log 3(n! 2 )⌉ = Ω(n log n).<br />

1


<strong>CS</strong> <strong>373</strong> Non-Lecture J: Reductions Fall 2002<br />

Transforming the numbers into points takes O(n) time. Since the convex hull is output as a<br />

circular doubly-linked list of vertices, reading off the final sorted list of numbers also takes O(n)<br />

time. Thus, given a black-box convex hull algorithm, we can sort in linear extra time. In this case,<br />

we say that there is a linear time reduction from sorting to convex hulls. We can visualize the<br />

reduction as follows:<br />

set of n numbers O(n)<br />

−−−→ set of n points<br />

<br />

<br />

ConvexHull<br />

sorted list of numbers O(n)<br />

←−−− convex polygon<br />

(I strongly encourage you to draw a picture like this whenever you use a reduction argument, at<br />

least until you get used to them.) The reduction gives us the following inequality relating the<br />

complexities of the two problems:<br />

Tsort(n) ≤ Tconvex hull(n) + O(n)<br />

Since we can compute convex hulls in O(n log n) time, our reduction implies that we can also sort<br />

in O(n log n) time. More importantly, by reversing the inequality, we get a lower bound on the<br />

complexity of computing convex hulls.<br />

Tconvex hull(n) ≥ Tsort(n) − O(n)<br />

Since any binary decision tree requires Ω(n log n) time to sort n numbers, it follows that any binary<br />

decision tree requires Ω(n log n) time to compute the convex hull of n points.<br />

J.3 Watch the Model!<br />

This result about the complexity of computing convex hulls is often misquoted as follows:<br />

Since we need Ω(n log n) comparisons to sort, we also need Ω(n log n) comparisons<br />

(between x-coordinates) to compute convex hulls.<br />

Although this statement is true, it’s completely trivial, since it’s impossible to compute convex<br />

hulls using any number of comparisons! In order to compute hulls, we must perform counterclockwise<br />

tests on triples of points.<br />

The convex hull algorithms we’ve seen — Graham’s scan, Jarvis’s march, divide-and-conquer,<br />

Chan’s shatter — can all be modeled as binary 2 decision trees, where every query is a counterclockwise<br />

test on three points. So our binary decision tree lower bound is meaningful, and several<br />

of those algorithms are optimal.<br />

This is a subtle but important point about deriving lower bounds using reduction arguments.<br />

In order for any lower bound to be meaningful, it must hold in a model in which the problem can be<br />

solved! Often the problem we are reducing from is much simpler than the problem we are reducing<br />

to, and thus can be solved in a more restrictive model of computation.<br />

2 or ternary, if we allow colinear triples of points<br />

2


<strong>CS</strong> <strong>373</strong> Non-Lecture J: Reductions Fall 2002<br />

J.4 Element Uniqueness (A Bad Example)<br />

The element uniqueness problem asks, given a list of n numbers x1, x2, . . . , xn, whether any two of<br />

them are equal. There is an obvious and simple algorithm to solve this problem: sort the numbers,<br />

and then scan for adjacent duplicates. Since we can sort in O(n log n) time, we can solve the<br />

element uniqueness problem in O(n log n) time.<br />

We also have an Ω(n log n) lower bound for sorting, but our reduction does not give us a lower<br />

bound for element uniqueness. The reduction goes the wrong way! Inscribe the following on the<br />

back of your hand 3 :<br />

To prove that problem A is harder than problem B, reduce B to A.<br />

There isn’t (as far as I know) a reduction from sorting to the element uniqueness problem.<br />

However, using other techniques (which I won’t talk about), it is possible to prove an Ω(n log n)<br />

lower bound for the element uniqueness problem. The lower bound applies to so-called algebraic<br />

decision trees. An algebraic decision tree is a ternary decision tree, where each query asks for<br />

the sign of a constant-degree polynomial in the variables x1, x2, . . . , xn. A comparison tree is an<br />

example of an algebraic decision tree, using polynomials of the form xi − xj. The reduction from<br />

sorting to element uniqueness implies that any algebraic decision tree requires Ω(n log n) time to<br />

sort n numbers. But since algebraic decision trees are ternary decision trees, we already knew that.<br />

J.5 Closest Pair<br />

The simplest version of the closest pair problem asks, given a list of n numbers x1, x2, . . . , xn, to<br />

find the closest pair of elements, that is, the elements xi and xj that minimize |xi − xj|.<br />

There is an obvious reduction from element uniqueness to closest pair, based on the observation<br />

that the elements of the input list are distinct if and only if the distance between the closest pair<br />

is bigger than zero. This reduction implies that the closest pair problem requires Ω(n log n) time<br />

in the algebraic decision tree model.<br />

set of n numbers trivial<br />

−−−→ set of n numbers<br />

<br />

<br />

ClosestPair<br />

True or False O(1)<br />

←−−− closest pair xi, xj<br />

There are also higher-dimensional closest pair problems; for example, given a set of n points in<br />

the plane, find the two points that closest together. Since the one-dimensional problem is a special<br />

case of the 2d problem — just put all n point son the x-axis — the Ω(n log n) lower bound applies<br />

to the higher-dimensional problems as well.<br />

J.6 3SUM to Colinearity. . .<br />

Unfortunately, lower bounds are relatively few and far between. There are thousands of computational<br />

problems for which we cannot prove any good lower bounds. We can still learn something<br />

useful about the complexity of such a problem by from reductions, namely, that it is harder than<br />

some other problem.<br />

Here’s an example. The problem 3sum asks, given a sequence of n numbers x1, . . . , xn, whether<br />

any three of them sum to zero. There is a fairly simple algorithm to solve this problem in O(n 2 )<br />

3 right under all those rules about logarithms, geometric series, and recurrences<br />

3


<strong>CS</strong> <strong>373</strong> Non-Lecture J: Reductions Fall 2002<br />

time (hint, hint). This is widely believed to be the fastest algorithm possible. There is an Ω(n 2 )<br />

lower bound for 3sum, but only in a fairly weak model of computation. 4<br />

Now consider a second problem: given a set of n points in the plane, do any three of them lie on<br />

a common non-horizontal line? Again, there is an O(n 2 )-time algorithm, and again, this is believed<br />

to be the best possible. The following reduction from 3SUM offers some support for this belief.<br />

Suppose we are given an array A of n numbers as input to 3sum. Replace each element a ∈ A with<br />

three points (a, 0), (−a/2, 1), and (a, 2). Thus, we replace the n numbers with 3n points on three<br />

horizontal lines y = 0, y = 1, and y = 2.<br />

If any three points in this set lie on a common non-horizontal line, they consist of one point on<br />

each of those three lines, say (a, 0), (−b/2, 1), and (c, 2). The slope of the common line is equal<br />

to both −b/2 − a and c + b/2; since these two expressions are equal, we must have a + b + c = 0.<br />

Similarly, is any three elements a, b, c ∈ A sum to zero, then the resulting points (a, 0), (−b/2, 1),<br />

and (c, 2) are colinear.<br />

So we have a valid reduction from 3sum to the colinear-points problem:<br />

set of n numbers O(n)<br />

−−−→ set of 3n points<br />

<br />

<br />

Colinear?<br />

True or False trivial<br />

←−−− True or False<br />

T3sum(n) ≤ Tcolinear(3n) + O(n) =⇒ Tcolinear(n) ≥ T3sum(n/3) − O(n).<br />

Thus, if we could detect colinear points in o(n 2 ) time, we could also solve 3sum in o(n 2 ) time,<br />

which seems unlikely. Conversely, if we could prove an Ω(n 2 ) lower bound for 3sum in a sufficiently<br />

powerful model of computation, it would imply an Ω(n 2 ) lower bound for the colinear points<br />

problem as well.<br />

The existing Ω(n 2 ) lower bound for 3sum does not imply a lower bound for finding colinear<br />

points, because the model of computation is too weak. It is possible to prove an Ω(n 2 ) lower bound<br />

directly using an adversary argument, but only in a fairly weak decision-tree model of computation.<br />

Note that in order to prove that the reduction is correct, we have to show that both yes answers<br />

and no answers are correct: the numbers sum to zero if and only if three points lie on a line.<br />

Even though the reduction itself only goes one way, from the ‘easier’ problem to the<br />

‘harder’ problem, the proof of correctness must go both ways.<br />

Anka Gajentaan and Mark Overmars 5 defined a whole class of computational geometry problems<br />

that are harder than 3sum; they called these problems 3sum-hard. A sub-quadratic algorithm for<br />

any 3sum-hard problem would imply a subquadratic algorithm for 3sum. I’ll finish the lecture with<br />

two more examples of 3sum-hard problems.<br />

J.7 . . . to Segment Splitting . . .<br />

Consider the following segment splitting problem: Given a collection of line segments in the plane,<br />

is there a line that does not hit any segment and splits the segments into two non-empty subsets?<br />

To show that this problem is 3sum-hard, we start with the collection of points produced by our<br />

last reduction. Replace each point by a ‘hole’ between two horizontal line segments. To make sure<br />

4 The Ω(n 2 ) lower bound holds in a decision tree model where every query asks for the sign of a linear combination<br />

of three of the input numbers. For example, ‘Is 5x1 + x42 − 17x5 positive, negative, or zero?’ See my paper ‘Lower<br />

bounds for linear satisfiability problems’ (http://www.uiuc.edu/~jeffe/pubs/linsat.html) for the gory(!) details.<br />

5 A. Gajentaan and M. Overmars, On a class of O(n 2 ) problems in computational geometry, Comput. Geom.<br />

Theory Appl. 5:165–185, 1995. ftp://ftp.cs.ruu.nl/pub/RUU/<strong>CS</strong>/techreps/<strong>CS</strong>-1993/1993-15.ps.gz<br />

4


<strong>CS</strong> <strong>373</strong> Non-Lecture J: Reductions Fall 2002<br />

that the only way to split the segments is by passing through three colinear holes, we build two<br />

‘gadgets’, each consisting of five segments, to cap off the left and right ends as shown in the figure<br />

below.<br />

Top: 3n points, three on a non-horizontal line.<br />

Bottom: 3n + 13 segments separated by a line through three colinear holes.<br />

This reduction could be performed in linear time if we could make the holes infinitely small, but<br />

computers can’t really deal with infinitesimal numbers. On the other hand, if we make the holes<br />

too big, we might be able to thread a line through three holes that don’t quite line up. I won’t go<br />

into details, but it is possible to compute a working hole size in O(n log n) time by first computing<br />

the distance between the closest pair of points.<br />

Thus, we have a valid reduction from 3sum to segment splitting (by way of colinearity):<br />

set of n numbers O(n)<br />

O(n log n)<br />

−−−→ set of 3n points −−−−−→ set of 3n + 13 segments<br />

<br />

<br />

Splittable?<br />

True or False trivial<br />

trivial<br />

←−−− True or False ←−−−−− True or False<br />

T3sum(n) ≤ Tsplit(3n + 13) + O(n log n) =⇒ Tsplit(n) ≥ T3sum<br />

J.8 . . . to Motion Planning<br />

n − 13<br />

3<br />

<br />

− O(n log n).<br />

Finally, suppose we want to know whether a robot can move from one position and location to<br />

another. To make things simple, we’ll assume that the robot is just a line segment, and the<br />

environment in which the robot moves is also made up of non-intersecting line segments. Given<br />

an initial position and orientation and a final position and orientation, is there a sequence of<br />

translations and rotations that moves the robot from start to finish?<br />

To show that this motion planning problem is 3sum-hard, we do one more reduction, starting<br />

from the set of segments output by the previous reduction algorithm. Specifically, we use our earlier<br />

set of line segments as a ‘screen’ between two large rooms. The rooms are constructed so that the<br />

robot can enter or leave each room only by passing through the screen. We make the robot long<br />

enough that the robot can pass from one room to the other if and only if it can pass through three<br />

colinear holes in the screen. (If the robot isn’t long enough, it could get between the ‘layers’ of the<br />

screen.) See the figure below:<br />

5


<strong>CS</strong> <strong>373</strong> Non-Lecture J: Reductions Fall 2002<br />

start<br />

end<br />

The robot can move from one room to the other if and only if the screen between the rooms has three colinear holes.<br />

Once we have the screen segments, we only need linear time to compute how big the rooms<br />

should be, and then O(1) time to set up the 20 segments that make up the walls. So we have a<br />

fast reduction from 3sum to motion planning (by way of colinearity and segment splitting):<br />

set of n numbers O(n)<br />

−−−→ set of 3n points<br />

O(n log n)<br />

−−−−−→ set of 3n + 13 segments O(n)<br />

True or False trivial<br />

trivial<br />

←−−− True or False ←−−−−− True or False<br />

−−−→ set of 3n + 33 segments<br />

<br />

<br />

<br />

Movable?<br />

trivial<br />

←−−− True or False<br />

<br />

n − 33<br />

<br />

T3sum(n) ≤ Tmotion(3n + 33) + O(n log n) =⇒ Tmtion(n) ≥ T3sum − O(n log n).<br />

3<br />

Thus, a sufficiently powerful Ω(n 2 ) lower bound for 3sum would imply an Ω(n 2 ) lower bound<br />

for motion planning as well. The existing Ω(n 2 ) lower bound for 3sum does not imply a lower<br />

bound for this problem — the model of computation in which the lower bound holds is too weak<br />

to even solve the motion planning problem. In fact, the best lower bound anyone can prove for this<br />

motion planning problem is Ω(n log n), using a (different) reduction from element uniqueness. But<br />

the reduction does give us evidence that motion planning ‘should’ require quadratic time.<br />

6


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

Math class is tough!<br />

— Teen Talk Barbie (1992)<br />

That’s why I like it!<br />

— What she should have said next<br />

The Manhattan-based Barbie Liberation Organization claims to have performed corrective surgery<br />

on 300 Teen Talk Barbies and Talking Duke G.I. Joes—switching their sound chips, repackaging<br />

the toys, and returning them to store shelves. Consumers reported their amazement at hearing<br />

Barbie bellow, ‘Eat lead, Cobra!’ or ‘Vengeance is mine!,’ while Joe chirped, ‘Will we ever have<br />

enough clothes?’ and ‘Let’s plan our dream wedding!’<br />

— Mark Dery, “Hacking Barbie’s Voice Box: Vengeance is Mine!”, New Media (May 1994)<br />

16 NP-Hard Problems (December 3 and 5)<br />

16.1 ‘Efficient’ Problems<br />

A long time ago 1 , theoretical computer scientists like Steve Cook and Dick Karp decided that a<br />

minimum requirement of any efficient algorithm is that it runs in polynomial time: O(n c ) for some<br />

constant c. People recognized early on that not all problems can be solved this quickly, but we<br />

had a hard time figuring out exactly which ones could and which ones couldn’t. So Cook, Karp,<br />

and others, defined the class of NP-hard problems, which most people believe cannot be solved in<br />

polynomial time, even though nobody can prove a super-polynomial lower bound.<br />

Circuit satisfiability is a good example of a problem that we don’t know how to solve in polynomial<br />

time. In this problem, the input is a boolean circuit: a collection of and, or, and not gates<br />

connected by wires. We will assume that there are no loops in the circuit (so no delay lines or<br />

flip-flops). The input to the circuit is a set of m boolean (true/false) values x1, . . . , xm. The output<br />

is a single boolean value. Given specific input values, we can calculate the output in polynomial<br />

(actually, linear) time using depth-first-search and evaluating the output of each gate in constant<br />

time.<br />

The circuit satisfiability problem asks, given a circuit, whether there is an input that makes the<br />

circuit output True, or conversely, whether the circuit always outputs False. Nobody knows how<br />

to solve this problem faster than just trying all 2 m possible inputs to the circuit, but this requires<br />

exponential time. On the other hand, nobody has ever proved that this is the best we can do;<br />

maybe there’s a clever algorithm that nobody has discovered yet!<br />

x<br />

y<br />

x 1<br />

x 2<br />

x 3<br />

x 4<br />

1 . . . in a galaxy far far away . . .<br />

x 5<br />

x<br />

x ∧y x ∨y<br />

x x<br />

y<br />

An and gate, an or gate, and a not gate.<br />

A boolean circuit. Inputs enter from the left, and the output leaves to the right.<br />

1


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

16.2 P, NP, and co-NP<br />

Let me define three classes of problems:<br />

• P is the set of yes/no problems 2 that can be solved in polynomial time. Intuitively, P is the<br />

set of problems that can be solved quickly.<br />

• NP is the set of yes/no problems with the following property: If the answer is yes, then there<br />

is a proof of this fact that can be checked in polynomial time. Intuitively, NP is the set of<br />

problems where we can verify a Yes answer quickly if we have the solution in front of us.<br />

For example, the circuit satisfiability problem is in NP. If the answer is yes, then any set of<br />

m input values that produces True output is a proof of this fact; we can check the proof by<br />

evaluating the circuit in polynomial time.<br />

• co-NP is the exact opposite of NP. If the answer to a problem in co-NP is no, then there is<br />

a proof of this fact that can be checked in polynomial time.<br />

If a problem is in P, then it is also in NP — to verify that the answer is yes in polynomial time,<br />

we can just throw away the proof and recompute the answer from scratch. Similarly, any problem<br />

in P is also in co-NP.<br />

The (or at least, a) central question in theoretical computer science is whether or not P=NP.<br />

Nobody knows. Intuitively, it should be obvious that P=NP; the homeworks and exams in this<br />

class have (I hope) convinced you that problems can be incredibly hard to solve, even when the<br />

solutions are obvious once you see them. But nobody can prove it.<br />

Notice that the definition of NP (and co-NP) is not symmetric. Just because we can verify<br />

every yes answer quickly, we may not be able to check no answers quickly, and vice versa. For<br />

example, as far as we know, there is no short proof that a boolean circuit is not satisfiable. But<br />

again, we don’t have a proof; everyone believes that NP=co-NP, but nobody really knows.<br />

co−NP<br />

P<br />

NP<br />

What we think the world looks like.<br />

16.3 NP-hard, NP-easy, and NP-complete<br />

A problem Π is NP-hard if a polynomial-time algorithm for Π would imply a polynomial-time<br />

algorithm for every problem in NP. In other words:<br />

Π is NP-hard ⇐⇒ If Π can be solved in polynomial time, then P=NP<br />

Intuitively, this is like saying that if we could solve one particular NP-hard problem quickly, then<br />

we could quickly solve any problem whose solution is easy to understand, using the solution to that<br />

one special problem as a subroutine. NP-hard problems are at least as hard as any problem in NP.<br />

2 Technically, I should be talking about languages, which are just sets of bit strings. The language associated with<br />

a yes/no problem is the set of bit strings for which the answer is yes. For example, if the problem is ‘Is the input<br />

graph connected?’, then the corresponding language is the set of connected graphs, where each graph is represented<br />

as a bit string (for example, its adjacency matrix). P is the set of languages that can be recognized in polynomial<br />

time by a single-tape Turing machine. Take 375 if you want to know more.<br />

2


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

Saying that a problem is NP-hard is like saying ‘If I own a dog, then it can speak fluent English.’<br />

You probably don’t know whether or not I own a dog, but you’re probably pretty sure that I don’t<br />

own a talking dog. Nobody has a mathematical proof that dogs can’t speak English—the fact<br />

that no one has ever heard a dog speak English is evidence, as are the hundreds of examinations<br />

of dogs that lacked the proper mouth shape and brainpower, but mere evidence is not a proof.<br />

Nevertheless, no sane person would believe me if I said I owned a dog that spoke fluent English.<br />

So the statement ‘If I own a dog, then it can speak fluent English’ has a natural corollary: No one<br />

in their right mind should believe that I own a dog! Likewise, if a problem is NP-hard, no one in<br />

their right mind should believe it can be solved in polynomial time.<br />

The following theorem was proved by Steve Cook in 1971. I won’t even sketch the proof (since<br />

I’ve been deliberately vague about the definitions). If you want more details, take <strong>CS</strong> 375 next<br />

semester.<br />

Cook’s Theorem. Circuit satisfiability is NP-hard.<br />

Finally, a problem is NP-complete if it is both NP-hard and an element of NP (‘NP-easy’).<br />

NP-complete problems are the hardest problems in NP. If anyone finds a polynomial-time algorithm<br />

for even one NP-complete problem, then that would imply a polynomial-time algorithm for every<br />

NP-complete problem. Literally thousands of problems have been shown to be NP-complete, so a<br />

polynomial-time algorithm for one (i.e., all) of them seems incredibly unlikely.<br />

co−NP<br />

16.4 Reductions (again) and SAT<br />

P<br />

NP−hard<br />

NP<br />

NP−complete<br />

More of what we think the world looks like.<br />

To prove that a problem is NP-hard, we use a reduction argument, exactly like we’re trying to prove<br />

a lower bound. So we can use a special case of that statement about reductions the you tattooed<br />

on the back of your hand last time.<br />

To prove that problem A is NP-hard, reduce a known NP-hard problem to A.<br />

For example, consider the formula satisfiability problem, usually just called SAT. The input to<br />

SAT is a boolean formula like<br />

(a ∨ b ∨ c ∨ ¯ d) ⇔ ((b ∧ ¯c) ∨ (ā ⇒ d) ∨ (c = a ∧ b)),<br />

and the question is whether it is possible to assign boolean values to the variables a, b, c, . . . so that<br />

the formula evaluates to True.<br />

To show that SAT is NP-hard, we need to give a reduction from a known NP-hard problem.<br />

The only problem we know is NP-hard so far is circuit satisfiability, so let’s start there. Given a<br />

boolean circuit, we can transform it into a boolean formula by creating new output variables for<br />

each gate, and then just writing down the list of gates separated by and. For example, we could<br />

transform the example circuit into a formula as follows:<br />

3


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

x 1<br />

x 2<br />

x 3<br />

x 4<br />

x 5<br />

(y1 = x1 ∧ x4) ∧ (y2 = x4) ∧ (y3 = x3 ∧ y2) ∧ (y4 = y1 ∨ x2) ∧<br />

y 2<br />

y 1<br />

y 3<br />

y 5<br />

y 6<br />

y 4<br />

y 7<br />

(y5 = x2) ∧ (y6 = x5) ∧ (y7 = y3 ∨ y5) ∧ (y8 = y4 ∧ y7 ∧ y6) ∧ y8<br />

A boolean circuit with gate variables added, and an equivalent boolean formula.<br />

Now the original circuit is satisfiable if and only if the resulting formula is satisfiable. Given a<br />

satisfying input to the circuit, we can get a satisfying assignment for the formula by computing the<br />

output of every gate. Given a satisfying assignment for the formula, we can get a satisfying input<br />

the the circuit by just ignoring the gate variables yi.<br />

We can transform any boolean circuit into a formula in linear time using depth-first search, and<br />

the size of the resulting formula is only a constant factor larger than the size of the circuit. Thus,<br />

we have a polynomial-time reduction from circuit satisfiability to SAT:<br />

boolean circuit O(n)<br />

−−−→ boolean formula<br />

<br />

<br />

SAT<br />

True or False trivial<br />

←−−− True or False<br />

T<strong>CS</strong>AT(n) ≤ O(n) + TSAT(O(n)) =⇒ TSAT(n) ≥ T<strong>CS</strong>AT(Ω(n)) − O(n)<br />

The reduction implies that if we had a polynomial-time algorithm for SAT, then we’d have a<br />

polynomial-time algorithm for circuit satisfiability, which would imply that P=NP. So SAT is NPhard.<br />

To prove that a boolean formula is satisfiable, we only have to specify an assignment to the<br />

variables that makes the formula true. We can check the proof in linear time just by reading<br />

the formula from left to right, evaluating as we go. So SAT is also in NP, and thus is actually<br />

NP-complete.<br />

16.5 3SAT<br />

A special case of SAT that is incredibly useful in proving NP-hardness results is 3SAT (or as [CLRS]<br />

insists on calling it, 3-CNF-SAT ).<br />

A boolean formula is in conjunctive normal form (CNF) if it is a conjunction (and) of several<br />

clauses, each of which is the disjunction (or) of several literals, each of which is either a variable<br />

or its negation. For example:<br />

clause<br />

<br />

(a ∨ b ∨ c ∨ d) ∧ (b ∨ ¯c ∨ ¯ d) ∧ (ā ∨ c ∨ d) ∧ (a ∨ ¯ b)<br />

A 3CNF formula is a CNF formula with exactly three literals per clause; the previous example is<br />

not a 3CNF formula, since its first clause has four literals and its last clause has only two. 3SAT<br />

4<br />

y 8


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

is just SAT restricted to 3CNF formulas — given a 3CNF formula, is there an assignment to the<br />

variables that makes the formula evaluate to true?<br />

We could prove that 3SAT is NP-hard by a reduction from the more general SAT problem,<br />

but it’s easier just to start over from scratch, with a boolean circuit. We perform the reduction in<br />

several stages.<br />

1. Make sure every and and or gate has only two inputs. If any gate has k > 2 inputs, replace<br />

it with a binary tree of k − 1 two-input gates.<br />

2. Write down the circuit as a formula, with one clause per gate. This is just the previous<br />

reduction.<br />

3. Change every gate clause into a CNF formula. There are only three types of clauses, one for<br />

each type of gate:<br />

a = b ∧ c ↦−→ (a ∨ ¯ b ∨ ¯c) ∧ (ā ∨ b) ∧ (ā ∨ c)<br />

a = b ∨ c ↦−→ (ā ∨ b ∨ c) ∧ (a ∨ ¯ b) ∧ (a ∨ ¯c)<br />

a = ¯ b ↦−→ (a ∨ b) ∧ (ā ∨ ¯ b)<br />

4. Make sure every clause has exactly three literals. Introduce new variables into each one- and<br />

two-literal clause, and expand it into two clauses as follows:<br />

a ↦−→ (a ∨ x ∨ y) ∧ (a ∨ ¯x ∨ y) ∧ (a ∨ x ∨ ¯y) ∧ (a ∨ ¯x ∨ ¯y)<br />

a ∨ b ↦−→ (a ∨ b ∨ x) ∧ (a ∨ b ∨ ¯x)<br />

For example, if we start with the same example circuit we used earlier, we obtain the following<br />

3CNF formula. Although this may look a lot more ugly and complicated than the original circuit<br />

at first glance, it’s actually only a constant factor larger. Even if the formula were larger than the<br />

circuit by a polynomial, like n <strong>373</strong> , we would have a valid reduction.<br />

(y1 ∨ x1 ∨ x4) ∧ (y1 ∨ x1 ∨ z1) ∧ (y1 ∨ x1 ∨ z1) ∧ (y1 ∨ x4 ∨ z2) ∧ (y1 ∨ x4 ∨ z2)<br />

∧ (y2 ∨ x4 ∨ z3) ∧ (y2 ∨ x4 ∨ z3) ∧ (y2 ∨ x4 ∨ z4) ∧ (y2 ∨ x4 ∨ z4)<br />

∧ (y3 ∨ x3 ∨ y2) ∧ (y3 ∨ x3 ∨ z5) ∧ (y3 ∨ x3 ∨ z5) ∧ (y3 ∨ y2 ∨ z6) ∧ (y3 ∨ y2 ∨ z6)<br />

∧ (y4 ∨ y1 ∨ x2) ∧ (y4 ∨ x2 ∨ z7) ∧ (y4 ∨ x2 ∨ z7) ∧ (y4 ∨ y1 ∨ z8) ∧ (y4 ∨ y1 ∨ z8)<br />

∧ (y5 ∨ x2 ∨ z9) ∧ (y5 ∨ x2 ∨ z9) ∧ (y5 ∨ x2 ∨ z10) ∧ (y5 ∨ x2 ∨ z10)<br />

∧ (y6 ∨ x5 ∨ z11) ∧ (y6 ∨ x5 ∨ z11) ∧ (y6 ∨ x5 ∨ z12) ∧ (y6 ∨ x5 ∨ z12)<br />

∧ (y7 ∨ y3 ∨ y5) ∧ (y7 ∨ y3 ∨ z13) ∧ (y7 ∨ y3 ∨ z13) ∧ (y7 ∨ y5 ∨ z14) ∧ (y7 ∨ y5 ∨ z14)<br />

∧ (y8 ∨ y4 ∨ y7) ∧ (y8 ∨ y4 ∨ z15) ∧ (y8 ∨ y4 ∨ z15) ∧ (y8 ∨ y7 ∨ z16) ∧ (y8 ∨ y7 ∨ z16)<br />

∧ (y9 ∨ y8 ∨ y6) ∧ (y9 ∨ y8 ∨ z17) ∧ (y9 ∨ y8 ∨ z17) ∧ (y9 ∨ y6 ∨ z18) ∧ (y9 ∨ y6 ∨ z18)<br />

∧ (y9 ∨ z19 ∨ z20) ∧ (y9 ∨ z19 ∨ z20) ∧ (y9 ∨ z19 ∨ z20) ∧ (y9 ∨ z19 ∨ z20)<br />

At the end of this process, we’ve transformed the circuit into an equivalent 3CNF formula.<br />

The formula is satisfiable if and only if the original circuit is satisfiable. As with the more general<br />

SAT problem, the formula is only a constant factor larger than then any reasonable description<br />

of the original circuit, and the reduction can be carried out in polynomial time. Thus, we have a<br />

polynomial-time reduction from circuit satisfiability to 3SAT:<br />

5


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

boolean circuit O(n)<br />

−−−→ 3CNF formula<br />

<br />

<br />

3SAT<br />

True or False trivial<br />

←−−− True or False<br />

T<strong>CS</strong>AT(n) ≤ O(n) + T3SAT(O(n)) =⇒ T3SAT(n) ≥ T<strong>CS</strong>AT(Ω(n)) − O(n)<br />

So 3SAT is NP-hard.<br />

Finally, since 3SAT is a special case of SAT, it is also in NP, so 3SAT is NP-complete.<br />

16.6 Maximum Clique Size (from 3SAT)<br />

The last problem I’ll consider in this lecture is a graph problem. A clique is another name for a<br />

complete graph. The maximum clique size problem, or simply MaxClique, is to compute, given<br />

a graph, the number of nodes in its largest complete subgraph.<br />

A graph with maximum clique size 4.<br />

I’ll prove that MaxClique is NP-hard (but not NP-complete, since it isn’t a yes/no problem)<br />

using a reduction from 3SAT. I’ll describe a reduction algorithm that transforms a 3CNF formula<br />

into a graph that has a clique of a certain size if and only if the formula is satisfiable.<br />

The graph has one node for each instance of each literal in the formula. Two nodes are connected<br />

by an edge if (1) they correspond to literals in different clauses and (2) those literals do not<br />

contradict each other. In particular, all the nodes that come from the same literal (in different<br />

clauses) are joined by edges. For example, the formula (a ∨ b ∨ c) ∧ (b ∨ ¯c∨ ¯ d) ∧ (ā ∨ c ∨ d) ∧ (a ∨ ¯ b ∨ ¯ d)<br />

is transformed into the following graph. (Look for the edges that aren’t in the graph.)<br />

a<br />

b<br />

d<br />

a b c<br />

a c d<br />

A graph derived from a 3CNF formula, and a clique of size 4.<br />

Now suppose the original formula had k clauses. Then I claim that the formula is satisfiable if<br />

and only if the graph has a clique of size k.<br />

6<br />

b<br />

c<br />

d


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

1. k-clique =⇒ satisfying assignment: If the graph has a clique of k vertices, then each<br />

vertex must come from a different clause. To get the satisfying assignment, we declare that<br />

each literal in the clique is true. Since we only connect non-contradictory literals with edges,<br />

this declaration assigns a consistent value to several of the variables. There may be variables<br />

that have no literal in the clique; we can set these to any value we like.<br />

2. satisfying assignment =⇒ k-clique: If we have a satisfying assignment, then we can<br />

choose one literal in each clause that is true. Those literals form a clique in the graph.<br />

Thus, the reduction is correct. Since the reduction from 3CNF formula to graph can be done in<br />

polynomial time, so MaxClique is NP-hard. Here’s a diagram of the reduction:<br />

3CNF formula with k clauses O(n)<br />

−−−→ graph with 3k nodes<br />

<br />

<br />

Clique<br />

True or False O(1)<br />

←−−− maximum clique size<br />

T3SAT(n) ≤ O(n) + TMaxClique(O(n)) =⇒ TMaxClique(n) ≥ T3SAT(Ω(n)) − O(n)<br />

16.7 Independent Set (from Clique)<br />

An independent set is a collection of vertices is a graph with no edges between them. The IndependentSet<br />

problem is to find the largest independent set in a given graph.<br />

There is an easy proof that IndependentSet is NP-hard, using a reduction from Clique.<br />

Any graph G has a complement G with the same vertices, but with exactly the opposite set of<br />

edges—(u, v) is an edge in G if and only if it is not an edge in G. A set of vertices forms a clique in<br />

G if and only if the same vertices are an independent set in G. Thus, we can compute the largest<br />

clique in a graph simply by computing the largest independent set in the complement of the graph.<br />

graph G O(n)<br />

−−−→ complement graph G<br />

<br />

<br />

IndependentSet<br />

largest clique trivial<br />

←−−− largest independent set<br />

16.8 Vertex Cover (from Independent Set)<br />

A vertex cover of a graph is a set of vertices that touches every edge in the graph. The Vertex-<br />

Cover problem is to find the smallest vertex cover in a given graph.<br />

Again, the proof of NP-hardness is simple, and relies on just one fact: If I is an independent<br />

set in a graph G = (V, E), then V \ I is a vertex cover. Thus, to find the largest independent set,<br />

we just need to find the vertices that aren’t in the smallest vertex cover of the same graph.<br />

graph G = (V, E) trivial<br />

−−−→ graph G = (V, E)<br />

<br />

<br />

VertexCover<br />

largest independent set V \ S O(n)<br />

←−−− smallest vertex cover S<br />

7


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

16.9 Graph Coloring (from 3SAT)<br />

A c-coloring of a graph is a map C : V → {1, 2, . . . , c} that assigns one of c ‘colors’ to each vertex,<br />

so that every edge has two different colors at its endpoints. The graph coloring problem is to find<br />

the smallest possible number of colors in a legal coloring. To show that this problem is NP-hard,<br />

it’s enough to consider the special case 3Colorable: Given a graph, does it have a 3-coloring?<br />

To prove that 3Colorable is NP-hard, we use a reduction from 3sat. Given a 3CNF formula,<br />

we produce a graph as follows. The graph consists of a truth gadget, one variable gadget for each<br />

variable in the formula, and one clause gadget for each clause in the formula.<br />

The truth gadget is just a triangle with three vertices T , F , and X, which intuitively stand for<br />

True, false, and other. Since these vertices are all connected, they must have different colors in<br />

any 3-coloring. For the sake of convenience, we will name those colors True, False, and Other.<br />

Thus, when we say that a node is colored True, all we mean is that it must be colored the same<br />

as the node T .<br />

The variable gadget for a variable a is also a triangle joining two new nodes labeled a and a to<br />

node X in the truth gadget. Node a must be colored either True or False, and so node a must<br />

be colored either False or True, respectively.<br />

Finally, each clause gadget joins three literal nodes to node T in the truth gadget using five new<br />

unlabeled nodes and ten edges; see the figure below. If all three literal nodes in the clause gadget<br />

are colored False, then the rightmost vertex in the gadget cannot have one of the three colors.<br />

Since the variable gadgets force each literal node to be colored either True or False, in any valid<br />

3-coloring, at least one of the three literal nodes is colored True. I need to emphasize here that<br />

the final graph contains only one node T , only one node F , only one node ā for each variable a,<br />

and so on.<br />

X<br />

T F<br />

X<br />

a a<br />

Gadgets for the reduction from 3SAT to 3-Colorability:<br />

The truth gadget, a variable gadget for a, and a clause gadget for (a ∨ b ∨ ¯c).<br />

The proof of correctness is just brute force. If the graph is 3-colorable, then we can extract a<br />

satisfying assignment from any 3-coloring—at least one of the three literal nodes in every clause<br />

gadget is colored True. Conversely, if the formula is satisfiable, then we can color the graph<br />

according to any satisfying assignment.<br />

3CNF formula O(n)<br />

−−−→ graph<br />

<br />

<br />

3Colorable<br />

True or False trivial<br />

←−−− True or False<br />

For example, the formula (a ∨ b ∨ c) ∧ (b ∨ ¯c ∨ ¯ d) ∧ (ā ∨ c ∨ d) ∧ (a ∨ ¯ b ∨ ¯ d) that I used to illustrate<br />

the MaxClique reduction would be transformed into the following graph. The 3-coloring is one<br />

of several that correspond to the satisfying assignment a = c = True, b = d = False.<br />

8<br />

a<br />

b<br />

c<br />

T


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

a a b b<br />

c c d<br />

A 3-colorable graph derived from a satisfiable 3CNF formula.<br />

We can easily verify that a graph has been correctly 3-colored in linear time: just compare the<br />

endpoints of every edge. Thus, 3Coloring is in NP, and therefore NP-complete. Moreover, since<br />

3Coloring is a special case of the more general graph coloring problem—What is the minimum<br />

number of colors?—the more problem is also NP-hard, but not NP-complete, because it’s not a<br />

yes/no problem.<br />

16.10 Hamiltonian Cycle (from Vertex Cover)<br />

A Hamiltonian cycle is a graph is a cycle that visits every vertex exactly once. This is very<br />

different from an Eulerian cycle, which is actually a closed walk that traverses every edge exactly<br />

once. Eulerian cycles are easy to find and construct in linear time using a variant of depth-first<br />

search. Finding Hamiltonian cycles, on the other hand, is NP-hard.<br />

To prove this, we use a reduction from the vertex cover problem. Given a graph G and an<br />

integer k, we need to transform it into another graph G ′ , such that G ′ has a Hamiltonian cycle if<br />

and only if G has a vertex cover of size k. As usual, our transformation consists of putting together<br />

several gadgets.<br />

T<br />

• For each edge (u, v) in G, we have an edge gadget in G ′ consisting of twelve vertices and<br />

fourteen edges, as shown below. The four corner vertices (u, v, 1), (u, v, 6), (v, u, 1), and<br />

(v, u, 6) each have an edge leaving the gadget. A Hamiltonian cycle can only pass through<br />

an edge gadget in one of three ways. Eventually, these will correspond to one or both of the<br />

vertices u and v being in the vertex cover.<br />

(u,v,1) (u,v,2) (u,v,3) (u,v,4) (u,v,5) (u,v,6)<br />

(v,u,1) (v,u,2) (v,u,3) (v,u,4) (v,u,5) (v,u,6)<br />

An edge gadget for (u, v) and the only possible Hamiltonian paths through it.<br />

9<br />

X<br />

F<br />

d


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

• G ′ also contains k cover vertices, simply numbered 1 through k.<br />

• Finally, for each vertex u in G, we string together all the edge gadgets for edges (u, v)<br />

into a single vertex chain, and then connect the ends of the chain to all the cover vertices.<br />

Specifically, suppose u has d neighbors v1, v2, . . . , vd. Then G ′ has d − 1 edges between<br />

(u, vi, 6) and (u, vi+1, 1), plus k edges between the cover vertices and (u, v1, 1), and finally k<br />

edges between the cover vertices and (u, vd, 6).<br />

1<br />

2<br />

3<br />

. . . k<br />

(v,w)<br />

(w,v)<br />

(v,x)<br />

(x,v)<br />

The vertex chain for v: all edge gadgets involving v are strung together and joined to the k cover vertices.<br />

It’s not hard to prove that if {v1, v2, . . . , vk} is a vertex cover of G, then G ′ has a Hamiltonian<br />

cycle—start at cover vertex 1, through traverse the vertex chain for v1, then visit cover vertex 2, then<br />

traverse the vertex chain for v2, and so forth, eventually returning to cover vertex 1. Conversely,<br />

any Hamiltonian cycle in G ′ alternates between cover vertices and vertex chains, and the vertex<br />

chains correspond to the k vertices in a vertex cover of G. (This is a little harder to prove.) Thus,<br />

G has a vertex cover of size k if and only if G ′ has a Hamiltonian cycle.<br />

u v<br />

w x<br />

1<br />

(u,w)<br />

(w,u)<br />

The original graph G with vertex cover {v, w}, and the transformed graph G ′ with a corresponding Hamiltonian cycle.<br />

Vertex chains are colored to match their corresponding vertices.<br />

The transformation from G to G ′ takes at most O(n 2 ) time, so the Hamiltonian cycle problem is<br />

NP-hard. Moreover, since we can easily verify a Hamiltonian cycle in linear time, the Hamiltonian<br />

cycle problem is in NP, and therefore NP-complete.<br />

10<br />

(u,v)<br />

(v,u)<br />

(v,w)<br />

(w,v)<br />

(w,x)<br />

(x,w)<br />

(v,y)<br />

(y,v)<br />

2<br />

(v,z)<br />

(z,v)<br />

(v,x)<br />

(x,v)


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

A closely related problem to Hamiltonian cycles is the famous traveling salesman problem—<br />

Given a weighted graph G, find the shortest cycle that visits every vertex. Finding the shortest<br />

cycle is obviously harder than determining if a cycle exists at all, so the traveling salesman problem<br />

is also NP-hard.<br />

16.11 Minesweeper (from Circuit SAT)<br />

In 1999, Richard Kaye proved that the solitaire game Minesweeper is NP-complete, using a reduction<br />

from the original circuit satisfiability problem. 3 The reduction involves setting up gadgets for<br />

every possible feature of a boolean circuit: wires, and gates, or gates, not gates, wire crossings, and<br />

so forth. For all the gory details, see http://www.mat.bham.ac.uk/R.W.Kaye/minesw/minesw.pdf!<br />

16.12 Other Useful NP-hard Problems<br />

Literally thousands of different problems have been proved to be NP-hard. I want to close this<br />

note by listing a few NP-hard problems that are useful in deriving reductions. I won’t describe the<br />

NP-hardness for these problems, but you can find them in Garey and Johnson’s Angry Black Book<br />

of NP-Completeness. 4<br />

• PlanarCircuitSAT: Given a boolean circuit that can be embedded in the plane so that<br />

no two wires cross, is there an input that makes the circuit output True? This can be<br />

proved NP-hard by reduction from the general circuit satisfiability problem, by replacing<br />

each crossing with a small series of gates. (This is an easy exercise.)<br />

• NotAllEqual3SAT: Given a 3CNF formula, is there an assignment of values to the variables<br />

so that every clause contains at least one true literal and at least one false literal? This<br />

can be proved NP-hard by reduction from the usual 3SAT.<br />

• Planar3SAT: Given a 3CNF boolean formula, consider a bipartite graph whose vertices are<br />

the clauses and variables, where an edge indicates that a variable (or its negation) appears in<br />

a clause. If this graph is planar, the 3CNF formula is also called planar. The Planar3SAT<br />

problem asks, given a planar 3CNF formula, whether it has a satifying assignment. This can<br />

be proven NP-hard by reduction from PlanarSat.<br />

• PlanarNotAllEqual3SAT: You get the idea.<br />

• Exact3DimensionalMatching or X3M: Given a set S and a collection of three-element<br />

subsets of S, called triples, is there a subcollection of disjoint triples that exactly cover S?<br />

This can be proved NP-hard by a reduction from 3SAT.<br />

• Partition: Given a set S of n integers, are there subsets A and B such that A ∪ B = S,<br />

A ∩ B = ∅, and <br />

a = <br />

b?<br />

a∈A<br />

b∈B<br />

This can be proved NP-hard by a reduction from. . .<br />

3 Minesweeper is NP-complete. Mathematical Intelligencer 22(2):9–15, 2000.<br />

4 Michael Garey and David Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness.<br />

W. H. Freeman and Co., 1979.<br />

11


<strong>CS</strong> <strong>373</strong> Lecture 16: NP-Hard Problems Fall 2002<br />

• SubsetSum: Given a set S of n integers and an integer T , is there a subset A ⊆ S such that<br />

<br />

a = S?<br />

a∈A<br />

This is a generalization of Partition, where T = ( S)/2. SubsetSum can proved NP-hard<br />

by a (nontrivial!) reduction from either 3SAT or X3M.<br />

• 3Partition: Given a set S with 3n elements, can it be partitioned into n disjoint subsets,<br />

each with 3 elements, such that every subset has the same sum. Note that this is very<br />

different from the Partition problem; I didn’t make up the names. This can be proved NPhard<br />

by reduction from X3M. The similar problem of dividing a set of 2n into n equal-weight<br />

two-element sets can be solved in O(n log n) time.<br />

• SetCover: Given a collection of sets S = {S1, S2, . . . , Sm}, find the smallest subcollection of<br />

Si’s that contains all the elements of <br />

i Si. This is a generalization of both VertexCover<br />

and X3M.<br />

• HittingSet: Given a collection of sets S = {S1, S2, . . . , Sm}, find the minimum number of<br />

elements of <br />

i Si that hit every set in S. This is also a generalization of VertexCover.<br />

• LongestPath: Given a non-negatively weighted graph G and two vertices u and v, what is<br />

the longest simple path from u to v in the graph? A path is simple if it visits each vertex<br />

at most once. This is a generalization of the HamiltonianPath problem. Of course, the<br />

corresoinding shortest path problem is in P.<br />

• SteinerTree: Given a weighted, undirected graph G with some of the vertices marked, what<br />

is the minimum-weight subtree of G that contains every marked vertex? If every vertex is<br />

marked, the minimum Steiner tree is just the minimum spanning tree; if exactly two vertices<br />

are marked, the minimum Steiner tree is just the shortest path between them. This can be<br />

proved NP-hard by reduction to HamiltonianPath.<br />

• Tetris: Given a Tetris board and a finite sequence of future pieces, can you survive? This<br />

was recently proved NP-hard by reduction from 3Partition.<br />

12

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!