EPIDEMIC ALGORITHMS FOR REPLICATED DATABASE MAINTENANCE 9The function i(s) is zero whenThis is an implicit equation <strong>for</strong> s, but the dominant term shows s decreasing exponentially withk. Thus increasing k is an effective way of insuring that almost everybody hears the rumor. Forexample, at k = 1 this <strong>for</strong>mula suggests that 20% will miss that rumor, while at k = 2 only 6%will nliss it.VariationsSo far we have seen only one complex epidemic, based on the rumor spreading technique. Ingeneral we would like to understand how to design a "good" epidemic, so it is worth pausing nowto review the criteria used to judge epidemics. We are principally concerned with:1. Residue. This is the value of s when i is zero, that is, the remaining susceptibles when theepidemic finishes. We would like the residue to be as small as possible, and, as the aboveanalysis shows, it is feasible to make the residue arbitrarily small.2. Traffic. Presently we are measuring traffic in database updates sent between sites, withoutregard <strong>for</strong> the topology of the network. It is convenient to use. an average m, the number ofmessages sent from a typical site:m=Total update trafficN umber of sitesIn section 3 we will refine this metric to incorporate traffic on individual links.3. Delay. There are two interesting times. The average delay is the difference between the timeof the initial injection of an update and the arrival of the update at a given site, averagedover all sites. We will refer to this as t ave . A similar quantity, tl ast , is the delay until thereception by the last site that will receive the update during the epidemic. Update messagesmay continue to appear in the network after tl ast , but they will never reach a susceptible site.We found it necessary to introduce two times because they often behave differently, and thedesigner is legitimately concerned about both times.Next, let us consider a few simple variations of rumor spreading. First we will describe the practicalaspects of the modifications, and later we will discuss residue, traffic, and delay.Blind vs. Feedback. The rumor example used feedback from the recipient; a sender loses interestonly if the recipient already knows the rumor. A blind variation loses interest with probability 11kregardless of the recipient. This obviates the need <strong>for</strong> the bit vector response from the recipient.Counter vs. Coin. Instead of losing interest with probability 11k we can use a counter, so thatwe lose interest only after k unnecessary contacts. The counters require keeping extra state <strong>for</strong>elements on the infective lists. Note that we can combine counter with blind, remaining infective<strong>for</strong> k cycles independent of any feedback.A surprising aspect of the above variations is that they share the same fundamental relationshipbetween traffic and residue:This is relatively easy to see by noticing that there are nm updates sent and the chance that a singlesite misses all these updates is (1 - I/n)nm. (Since m is not constant this relationship depends onthe moments around the mean of the distribution of m going to zero as n ~ 00, a fact that we haveobserved empirically, but have not proven.) Delay is the the only consideration that distinguishesXEROX PARC, <strong>CSL</strong>-<strong>89</strong>-1, JANUARY 19<strong>89</strong>
10 EPIDEMIC ALGORITHMS FOR REPLICATED DATABASE MAINTENANCEthe above possibilities: simulations indicate that counters and feedback improve the delay, withcounters playing a more significant role than feedback.Table 1. Per<strong>for</strong>mance of an epidemic on 1000 sites using feedback and counters.Counter Residue Traffic Convergencek s m t ave tlast1 0.18 1.7 11.0 16.82 0.037 3.3 12.1 16.93 0.011 4.5 12.5 17.44 0.0036 5.6 12.7 17.55 0.0012 6.7 12.8 17.7Table 2. Per<strong>for</strong>mance of an epidemic on 1000 sites using blind and coin.Coin Residue Traffic Convergencek s m tave tlast1 0.96 0.04 19 382 0.20 1.6 17 333 0.060 2.8 15 324 0.021 3.9 14.1 325 0.008 4.9 13.8 32Push vs. Pull. So far we have assumed that all the sites would randomly choose destinationsites and push infective updates to the destinations. The push vs. pull distinction made already<strong>for</strong> anti-entropy can be applied to rumor mongering as well. Pull has some advantages, but theprimary practical consideration is the rate of update injection into the distributed database. Ifthere are numerous independent updates a pull request is likely to find a source with a non-emptyrumor list, triggering useful in<strong>for</strong>mation flow. By contrast, if the database is quiescent, the pushalgorithm ceases to introduce traffic overhead, while the pull variation continues to inject fruitlessrequests <strong>for</strong> updates. Our own CIN application has a high enough update rate to warrant the useof pull.The chief advantage of pull is that it does significantly better than the s = e- m relationshipof push. Here the blind vs. feedback and counter vs. coin variations are important. Simulationsindicate that the counter and feedback variations improve residue, with counters being more importantthan feedback. We have a recurrence relation modeling the counter with feedback casethat exhibits s = e-e(m3) behavior.Table 3. Per<strong>for</strong>mance of a pull epidemic on 1000 sites using feedback and counters t.Counter Residue Traffic Convergencek s m tave tlast1 3.1 X 10-:l 2.7 9.97 17.62 5.8 x 10- 4 4.5 10.07 15.43 4.0 x 10- 6 6.1 10.08 14.0t If more than one recipient pulls from a site in a single cycle, then at the end of the cycle theeffects on the counter are as follows: if any recipient needed the update then the counter is reset;if all recipients did not need the update then one is added to the counter.XEROX PARC, <strong>CSL</strong>-<strong>89</strong>-1, JANUARY 19<strong>89</strong>