Section 3.4 1 Chapter 3 – Special Discrete Random Variables ...

Section 3.4 1 

Chapter 3 – Special Discrete Random Variables. 

Section 3.4 Binomial random variable 

An experiment that has only two possible outcomes is called a Bernoulli trial, for 

example, a single coin toss. For the sake of argument, we will call one of the possible 

outcomes “success”, and the other one “failure”. The probability of a success is p, and the 

probability of failure is 1 − p. We are interested in studying a sequence of identical and 

independent Bernoulli trials, and looking at the total number of successes that occur. 

Definition. A binomial random variable is the number of successes in n independent 

and identical Bernoulli trials. 

Examples. 

A fair coin is tossed 100 times and Y , the number of heads, is recorded. Then Y is 

a binomial random variable with n = 100 and p = 1/2. 

Two evenly matched teams play a series of 6 games. The number of wins Y is a 

binomial random variable with n = 6 and p = 1/2. 

An inspector looks at five computers where the chance that each computer is defective 

is 1/6. The number Y of defective computers that he sees is a binomial random variable 

with n = 5 and p = 1/6. 

If Y is a binomial random variable, then the possible outcomes for Y are obviously 

0, 1, . . . , n. In other words, the number of observed successes could be any number between 

0 and n. The sample space consists of all strings of length n that consist of S’s and F ’s; 

for example, 

n trials 

 

SSF SF SSSF · · · SF . 

Now let us choose a value of 0 ≤ y ≤ n, and look at a couple of typical sample points 

belonging to the event (Y = y), 

y n − y 

 

SSS · · · S F F F · · · F , 

y − 1 n − y 

 

SSS · · · S F F F · · · F S, 

y − 2 n − y 

 

SSS · · · S F F F · · · F SS. 

Every sample point in the event (Y = y) is an arrangement of y S’s and n − y F ’s, and 

so therefore has probability p y (1 − p) n−y . 

How many such sample points are there? The number of sample points in (Y = y) 

. Putting it 

is the number of distinct arrangements of y S’s and n − y F ’s, that is, n 

y

together gives the formula for binomial probabilities. 

Binomial probabilities. 


If Y is a binomial random variable with parameters n and p, 

then 

 

n 

P(Y = y) = p 

y 

y (1 − p) n−y , y = 0, 1, . . . , n. 

Example. Best-of-seven series 

In section 1.6 we figured out that the probability of a best-of-seven series between two 

evenly matched teams going the full seven games was 20/64. This can also be calculated 

using binomial probabilities. If you play six games against an equally skilled opponent, 

and Y is the number of wins, then Y has a binomial distribution with n = 6 and p = 1/2. 

The series goes seven games if Y = 3, and the chance of that happening is P(Y = 3) = 

6 

3 

(1/2) 3 (1/2) 3 = 20/64 = .3125. So best-of-seven series ought to be seven games long 

30% of the time. But, in fact, if you look at the Stanley Cup final series for the last fifty 

years (1946-1995), there were seven-game series only 8 times (1950, 1954, 1955, 1964, 1965, 

1971, 1987, 1994). This seems to show that a lot of these match-ups were not even, which 

tends to make the series end sooner. 

If you are twice as good as your opponent, what is the chance of a full seven games? 

This time p = 2/3, and so P(Y = 3) = 6 3 3 

3 (2/3) (1/3) = .2195. This agrees more closely 

to the actual results, although it’s still a bit high. 

Example. An even split 

If I toss a fair coin ten times, what is the chance that I get exactly 5 heads and 5 

tails? The answer is P(Y = 5) = 10 5 5 

5 (1/2) (1/2) = .2461. If I toss a fair coin 100 

times, what is the chance of exactly fifty heads? This time the answer is P(Y = 50) = 

100 

50 

(1/2) 50 (1/2) 50 = .0796. You may be a bit surprised that this is such an uncommon 

event. If you flip a coin 100 times the odds are pretty good that you will get about an equal 

number of heads and tails, but to get exactly one half heads and one half tails gets harder 

and harder as the sample size increases. Just for fun, here is an approximate formula for 

the chance of getting exactly n heads in 2n coin tosses: P(an even split) ≈ (πn) −1 . 

Example. Testing for ESP 

In order to test for ESP you draw a card from an ordinary deck and ask the subject 

what color it is. You repeat this 20 times and the subject is correct 15 times. How likely 

is it that this is due to chance? 

If the subject is guessing, then Y , the number of correct readings, follows a binomial 

distribution with n = 20 and p = 1/2. We want to know the probability that someone

can do this well (or better) by guessing. Thus 

P(Y ≥ 15) = P(Y = 15) + P(Y = 16) + · · · + P(Y = 20) 

 

20 

= (1/2) 

15 

15 (1/2) 5 

20 

+ 

16 

= 21700(1/2) 20 

= 0.0207. 


(1/2) 16 (1/2) 4 + · · · + 

 

20 

(1/2) 

20 

20 (1/2) 0 

This is a pretty unlikely event but certainly not impossible. What conclusion can we draw? 

Example. Quality control 

In mass production manufacturing there is a certain percentage of acceptable loss 

due to defective units. To check the level of defectives, you take a sample from the day’s 

production. If the number of defectives is small you continue, but if there are too many 

defectives you shut down the production line for repairs. 

Suppose that 5% defectives is considered acceptable, but 10% defectives is unacceptable. 

Our strategy is to take a sample of n = 40 units and shut down production if we find 

4 or more defectives. Our inspection strategy has two conflicting goals, it is supposed to 

shut down when p ≥ .10, but continue if p ≤ .05. There are two possible wrong decisions; 

to continue when p ≥ .10, and to shut down even though p ≤ .05. 

How often will we unnecessarily shut down? Suppose that there are acceptably many 

defectives, and to take the worst case, say there are 5% defectives, so that p = .05. Let 

Y be the number of observed defective units in the sample. The probability of shutting 

down production is 

P(shut down) 

= P(Y ≥ 4) 

= 1 − P(Y ≤ 3) 

= 1 − P(Y = 0) − P(Y = 1) − P(Y = 2) − P(Y = 3) 

= 1 − 

40 

0 

 

(.05) 0 (.95) 40 − 

40 

1 

= 1 − .1285 − .2705 − .2777 − .1851 

= .1382 

 

(.05) 1 (.95) 39 − 

40 

2 

 

(.05) 2 (.95) 38 − 

 

40 

(.05) 

3 

3 (.95) 37 

On the other hand, how often will we fail to spot an unacceptably high level of 

defectives? Let us now suppose that there are unacceptably many defectives, and again to 

take the worst case, let’s say there are 10% defectives, so that p = .10. The chance that

the day’s production passes inspection anyway is 

P(passes inspection) 

= P(Y ≤ 3) 

= P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3) 

= 

40 

0 

 

(.10) 0 (.90) 40 + 

40 

1 

= .0148 + .0657 + .1423 + .2003 

= .4231 


 

(.10) 1 (.90) 39 + 

40 

2 

 

(.10) 2 (.90) 38 + 

 

40 

(.10) 

3 

3 (.90) 37 

We see that this scheme is fairly likely to make errors. If we wanted to be more certain 

about our decision, we would need to take a larger sample size. 

Example. Multiple choice exams 

If a multiple choice exam has 30 questions, each with 5 responses, what is the probability 

of passing the exam by guessing? If you guess on every question, then Y the number 

of correct answers will be a binomial random variable with n = 30 and p = 1/5. To pass 

you need 15 or more correct answers so P(pass the exam) = P(Y ≥ 15) = 0.000231. 

Binomial moments. 

If Y is a binomial random variable with parameters n and p, 

then 

E(Y ) = np and VAR (Y ) = np(1 − p). 

Example. The accuracy of empirical probabilities 

If we simulate n random events, where the chance of a success is p, then the number of 

observed successes Y has a binomial distribution with parameters n and p. The empirical 

probability is p = Y/n. Now the binomial moments given above show that E(p) = 

(np)/n = p, and VAR (p) = (np(1−p))/n 2 = p(1−p)/n. By computing the two standard 

deviation interval, we get some idea about how close p is to p. Since the quantity p(1 − p) 

is maximized when p = 1/2, we find that regardless of the value of p, 

2 STD (p) = 2 

p(1 − p) 

n 

≤ 1 

√ n . 

In most of our examples, the empirical probabilities have been based on n = 1000 repetitions. 

Thus, our empirical probabilities are typically within ±.03 of the true probabilities. 

For example, suppose we simulate 1000 throws of five dice, and find that on 71 occasions 

we get a sum of 14. Then we are fairly certain that the true probability of getting 

14 lies somewhere between .041 and .101.

Section 3.5 5 

Section 3.5 Geometric and negative binomial random variables 

Like the binomial, the geometric and negative binomial random variables are based 

on a sequence of independent and identical Bernoulli trials. Instead of fixing the number 

of trials n and counting up how many successes there are, we fix the number of successes 

k and count up how many trials it takes to get them. The geometric random variable is 

the number of trials until the first success. Given an integer k ≥ 1, the negative binomial 

random variable is the number of trials until the k th success. You see that a geometric 

random variable is a negative binomial random variable where k = 1. On the other hand, 

note that a negative binomial random variable Y is the sum of k independent geometric 

random variables. That is, Y = X1 + X2 + · · · + Xk, where X1 is the number of trials until 

the first success, X2 is the number of trials after the first success until the second success, 

etc. All of these X ’s have geometric distributions with parameter p. If Y is negative 

binomial, then a typical sample point belonging to (Y = y) looks like F F S · · · F S S, 

where the first y − 1 symbols in the string contain exactly k − 1 successes and y − k 

such strings, and they all 

failures, and then the y th symbol is an S. Since there are y−1 k−1 

have probability pk (1 − p) y−k we get the following formula. 

Negative binomial probabilities. 

If Y is a negative binomial random variable with parameters k 

and p, then 

P(Y = y) = 

 

y − 1 

p 

k − 1 

k (1 − p) y−k , y = k, k + 1, . . . . 

It follows that the geometric distribution is given by p(y) = p(1 − p) y−1 , y = 1, 2, . . . . 

Example. The chance of a packet arrival to a distribution hub is 1/10 during each 

time interval. Let Y be the arrival time of the first packet, it has a geometric distribution 

with p = .10. The probability that the first packet arrives during the third time interval 

is P(Y = 3) = (1/10) 1 (9/10) 2 = .081. The probability that the first packet arrives on or 

after the third time interval is 

P(Y ≥ 3) = 1 − P(Y = 1) − P(Y = 2) = 1 − .10 − (.90)(.10) = .81. 

If X is the arrival time of the tenth packet, the chance that it arrives on the 99 th time 

interval is P(X = 99) = 98 10 89 

9 (1/10) (9/10) = 0.01332.

Example. The 500 goal club 


With only 30 games remaining in the NHL season, veteran winger Flash LaRue is 

starting to get worried. With a career total of 488 goals, it is not at all certain that he 

will be able to score his 500th career goal before the end of the season. He will get a big 

bonus from his team if he manages this feat, but unfortunately Flash only scores at a rate 

of about once every three games. Is there any hope that he will get his 500th goal before 

the end of the season? 

Let’s try to calculate the moments of a negative binomial random variable. 

p + p(1 − p) + p(1 − p) 2 + p(1 − p) 3 + · · · 1 

p(1 − p) + p(1 − p) 2 + p(1 − p) 3 + · · · (1 − p) 

p(1 − p) 2 + p(1 − p) 3 + · · · (1 − p) 2 

p(1 − p) 3 + · · · (1 − p) 3 

p + 2p(1 − p) + 3p(1 − p) 2 + 4p(1 − p) 3 + · · · 1/p 

This sum ought to convince you that the mean of a geometric random variable is 1/p, 

and the result for negative binomial follows from the equation Y = X1 + X2 + · · · + Xk. 

Confirming the variance formula is left as an exercise. 

Negative binomial moments. 

If Y is a negative binomial random variable with parameters k 

and p, then 

E(Y ) = k 

p 

and VAR (Y ) = 

. .. 

k(1 − p) 

p2 . 

We note that, as you would expect, the rarer an event is, the longer you will have to 

wait for it. Taking the geometric case (k = 1), we see that we will wait on average µ = 2 

trials to see the first “heads” in a coin tossing experiment, we will wait on average µ = 36 

trials to see the first pair of sixes in tossing a pair of dice, and we will buy on average 

µ = 13, 983, 816 tickets before we win Lotto 6-49. 

We also note that σ decreases from infinity to zero as p ranges from 0 to 1. This says 

that predicting the first occurrence of an event is difficult for rare events, and easy for 

common events. 

Section 3.7 Hypergeometric random variable 

The hypergeometric distribution is the number of successes that arise in sampling 

without replacement. We suppose that there is a population of size N , of which r of them 

are “successes” and the rest “failures”, and a sample of size n is drawn. 

.


The probability formula below is simply the ratio of the number of samples containing 

y successes and n −y failures, to the total number of possible samples of size n. The weird 

looking conditions on y just ensure that you don’t try to find the probability of some 

impossible event. 

Hypergeometric probabilities. 

If Y is a hypergeometric random variable with parameters n, r, 

and N , then 

P(Y = y) = 

 

r N − r 

y n − y 

 

N 

n 

, y = max(0, n −(N −r)), . . . , min(n, r) 

Example. A box contains 12 poker chips of which 7 are green and 5 are blue. 

Eight chips are selected at random without replacement from this box. Let X denote the 

number of green chips selected. The probability mass function is 

 

7 5 

 

p(x ) = 

x 8 − x 

 

12 

8 

, x = 3, 4, . . . , 7. 

Note that the range of possible x values is restricted by the make-up of the population. 

Example. Lotto 6-49 

In Lotto 6-49 you buy a ticket with six numbers chosen from the set {1, 2, . . . , 49}. 

The draw consists of a random sample drawn without replacement from the same set, 

and your prize depends on how many “successes” were drawn. Here a “success” is any 

number that was on your ticket. So Y , the number of matches, follows a hypergeometric 

distribution with r = 6, n = 6, and N = 49. The probabilities for the different number of 

matches are obtained using the formula 

 

6 43 

y 6 − y 

P(Y = y) = , y = 0, . . . , 6. 

49 

6 

To four decimal places, we have 

y 0 1 2 3 4 5 6 

p(y) .4360 .4130 .1324 .0176 .0010 .0000 .0000

Hypergeometric moments. 


If Y is a hypergeometric random variable with parameters n, r, 

and N then 

E(Y ) = n r 

N 

and VAR (Y ) = n r 

N 

N − r 

N 

N − n 

N − 1 . 

For example, the average number of green chips drawn in the first problem is µ = 

(8)(7)/12 = 4.66666. Also, the average number of matches on your Lotto 6-49 ticket is 

µ = (6)(6)/49 = .73469. 

Example. Capture-tag-recapture 

A scientific expedition has captured, tagged, and released eight sea turtles in a particular 

region. The expedition assumes that the population size in this region is 35, which 

means that 8 are tagged and 27 not tagged. The expedition will now capture 10 turtles 

and note how many of them are tagged. If the assumption about the population size is 

correct, what is the probability that the new sample will have 3 or less tagged turtles in 

it? 

P(Y ≤ 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3) 

 

8 27 8 27 8 27 8 27 

= 

0 10 

 

35 

+ 

1 9 

 

35 

+ 

2 8 

 

35 

+ 

3 7 

 

35 

10 10 10 10 

= .04595 + .20424 + .33861 + .27089 

= .85969. 

We would certainly expect to get three or less tagged turtles in the new sample. If the 

expedition found five tagged turtles, is that evidence that they have over-estimated the 

population size? 

Example. A political poll 

The population of Alberta is around 2, 545, 000, and let’s suppose that about 70% of 

these are eligible to vote in the next provincial election. Then the population of eligible 

voters has N = 1781500 people in it. Suppose that n = 100 people are randomly selected 

from the eligible voters (without replacement) and asked whether or not they support 

Ralph Klein. Also suppose, for the sake of argument, that exactly 60%, or 1068900 eligible 

voters do support Ralph Klein. How accurately will the poll reflect that? 

Let Y stand for the number of Klein supporters included in the random sample. Then 

Y has a hypergeometric distribution with n = 100, r = 1068900, and N = 1781500. The

mean and variance of Y are given by 

µ = 100 1068900 

1781500 = 60 and σ 2 = 100 1068900 

1781500 


712600 

1781500 

1781400 

1781499 

= 23.998666. 

A two standard deviation interval says that probably between 50 to 70 people in the poll 

will be Klein supporters. 

Note that if the sampling were done with replacement, then Y would follow a binomial 

distribution with n = 100 and p = .6. In this case, we would have 

Since n is small relative to N , the ratio 

µ = 100(.6) = 60 and σ 2 = 100(.6)(.4) = 24. 

1781400 

1781499 

= N − n 

N − 1 

and the mean and variance of the hypergeometric distribution coincide with the mean and 

variance of the binomial distribution. The distributions of these two random variables are 

also essentially the same whenever n is small relative to N . 

Section 3.8 Poisson random variable 

This probability distribution is named after the French mathematician Poisson, according 

to whom. . . 

≈ 1, 

Life is good for only two things, discovering mathematics and 

teaching mathematics – Siméon Poisson 

In Recherches sur la probabilité des jugements en matière criminelle et en matière 

civile, an important work on probability published in 1837, the Poisson distribution first 

appeared. The Poisson distribution describes the probability that a random event will 

occur in a time or space interval under the conditions that the probability of the event 

occurring is very small, but the number of trials is very large so that the event actually 

occurs a few times. 

To illustrate this idea, suppose you are interested in the number of arrivals to a queue 

in a one day period. You could divide the time interval up into little subintervals, so that 

for all practical purposes, only one arrival can occur per subinterval. Therefore, for each 

subinterval of time, we have 

P(no arrival) = 1 − p, P(one arrival) = p, P(more than one arrival) = 0. 

The total number of arrivals X , is the number of subintervals that contain an arrival. This 

has a binomial distribution, where n is the number of subintervals. The probability of 

seeing x arrivals during the day is 

P(X = x ) = 

 

n 

p 

x 

x (1 − p) n−x .


Now let’s suppose that you keep on dividing the time interval into smaller and smaller 

subintervals; increasing n but decreasing p so that the product µ = np remains constant. 

What happens to P(X = x )? 

 

n 

p 

x 

x (1 − p) n−x = 

 

n 

 

µ 

x 1 − 

x n 

µ 

n−x n 

Now you take the limit as n → ∞, and obtain 

 

1 − µ 

n → e 

n 

−µ 

= n(n − 1) · · · (n − x + 1) 

 

µ 

x 1 − 

x ! 

n 

µ 

n 1 − 

n 

µ 

−x n 

= µ x 

1 − 

x ! 

µ 

n n 

 

n − 1 

 

n − x + 1 

 

· · · 

1 − 

n n n 

n 

µ 

−x . 

n 

and 

This leads to the following formula. 

n 

n 

Poisson probabilities. 

n − 1 

n 

 

· · · 

n − x + 1 

n 

 

1 − µ 

−x → 1. 

n 

If X is a Poisson random variable with parameter µ, then 

P(X = x ) = e −µ µ x 

x ! 

, x = 0, 1, . . . , 

The derivation of the Poisson distribution explains why it is sometimes called the law 

of rare events. Let’s look at an example involving the rarest event I can think of. 

Example. More Lotto 6-49 

The odds of winning the jackpot in Lotto 6-49 are one in 13,983,816, or p = 7.1511 × 

10 −8 . Suppose you play twice a week, every week for 10,000 years. The total number 

of plays is then n = 2 × 52 × 10000 = 1, 040, 000. Setting µ = np = .07437 and using 

the Poisson formula, we see that the chance of hitting zero jackpots during this time is 

P(X = 0) = (e −.07437 )(.07437) 0 /0! = .928327. After all that time, we still have only 

about a 7% chance of getting a Lotto 6-49 jackpot. The probability of getting exactly two 

jackpots during this time is P(X = 2) = (e −.07437 )(.07437) 2 /2! = .002567. 

Example. Hashing 

Hashing is a tool for organizing files, where a hashing function transforms a key into 

an address, which is then the basis for searching for and storing records. Hashing has two 

important features: 

1. With hashing, the addresses generated appear to be random — there is no immediate 

connection between the key and the location of the record.


2. With hashing, two different keys may be transformed into the same address, in which 

case we say that a collision has occurred. 

Given that it is nearly impossible to achieve a uniform distribution of records among 

the available addresses in a file, it is important to be able to predict how records are likely 

to be distributed. Suppose that there are N addresses available, and that the hashing 

function assigns them in a completely random fashion. This means that for any fixed 

address, the probability that it is selected is 1/N . If r keys are hashed, we can use the 

Poisson approximation to the binomial to obtain the probability that exactly x records 

are assigned to a given address. This is 

p(x ) = e −(r/N) (r/N ) x 

x ! 

, x = 0, 1, . . . 

For instance, if we are trying to fit r = 10000 records in N = 10000 addresses, the 

proportion of addresses that will remain empty is p(0) = 1 0 e −1 /0! = .3679. We would 

expect a total of about 3679 empty addresses. Since p(1) = 1 1 e −1 /1! = .3679, we would 

also expect a total of about 3679 addresses with 1 record assigned, and about 10000 − 

2(3679) = 2642 addresses with more than 1 record assigned. Because we have a packing 

density r/N of 1, we must expect a large number of collisions. In order to reduce the 

number of collisions we should increase the number N of available addresses. 

For more about hashing, the reader is referred to Chapter 11 of the book File Structures: 

A conceptual toolkit by Michael J. Folk and Bill Zoellick. 

Poisson moments. 

If X is a Poisson random variable, then 

Example. Particle emissions 

E(X ) = µ and VAR (X ) = µ. 

In 1910, Hans Geiger and Ernest Rutherford conducted a famous experiment in which 

they counted the number of α-particle emissions during 2608 time intervals of equal length. 

Their data is as follows. 

x 0 1 2 3 4 5 6 7 8 9 10 > 10 

intervals 57 203 383 525 532 408 273 139 45 27 10 6 

A total of 10097 particles were observed, giving a rate of µ = 10097/2608 = 3.8715 

particles per time period. If these particles were following a Poisson distribution, then the 

number of intervals with no particles should be about 

2608 × e −3.8715 (3.8715) 0 

0! 

= 54.31,


the number of intervals with exactly one particle should be about 

2608 × e −3.8715 (3.8715) 1 

1! 

= 210.27, 

and so on. In fact, the frequencies that we would expect to observe are 

0 1 2 3 4 5 6 7 8 9 10 > 10 

54.31 210.27 407.06 525.31 508.44 393.69 254.03 140.50 67.99 29.25 11.32 5.83 

By comparing these two tables, you can see that the Poisson distribution seems to 

describe this phenomenon quite well.

Section 3.4 1 Chapter 3 – Special Discrete Random Variables ...

Create successful ePaper yourself

Delete template?

Save as template?