12.07.2015 Views

Introduction to Probability and Statistics - Universidad Carlos III de ...

Introduction to Probability and Statistics - Universidad Carlos III de ...

Introduction to Probability and Statistics - Universidad Carlos III de ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Recommen<strong>de</strong>d reading• CM Grinstead <strong>and</strong> JM Snell (1997).Available from:<strong>Introduction</strong> <strong>to</strong> <strong>Probability</strong>, AMS.http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/pdf.html• Online <strong>Statistics</strong>:online course at:An Interactive Multimedia Course of Study is a goodhttp://onlinestatbook.com/• MP Wiper (2006). Here are some notes on probability from an elementarycourse.http://halweb.uc3m.es/esp/Personal/personas/mwiper/docencia/Spanish/Doc<strong>to</strong>rado_EEMC/probability_class.pdf<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


<strong>Probability</strong>Chance is a part of our everyday lives. Everyday we make judgements basedon probability:• There is a 90% chance Real Madrid will win <strong>to</strong>morrow.• There is a 1/6 chance that a dice <strong>to</strong>ss will be a 3.<strong>Probability</strong> Theory was <strong>de</strong>veloped from the study of games of chance by Fermat<strong>and</strong> Pascal <strong>and</strong> is the mathematical study of r<strong>and</strong>omness. This theory <strong>de</strong>alswith the possible outcomes of an event <strong>and</strong> was put on<strong>to</strong> a firm mathematicalbasis by Kolmogorov.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The Kolmogorov axiomsKolmogorovFor a r<strong>and</strong>om experiment with sample space Ω, then a probability measureP is a function such that1. for any event A ∈ Ω, P (A) ≥ 0.2. P (Ω) = 1.3. P (∪ j∈J A j ) = ∑ j∈J P (A j) if {A j : j ∈ J} is a countable set ofincompatible events.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Laws of probabilityThe basic laws of probability can be <strong>de</strong>rived directly from set theory <strong>and</strong> theKolmogorov axioms. For example, for any two events A <strong>and</strong> B, we have theaddition law,P (A ∪ B) = P (A) + P (B) − P (A ∩ B).


Laws of probabilityThe basic laws of probability can be <strong>de</strong>rived directly from set theory <strong>and</strong> theKolmogorov axioms. For example, for any two events A <strong>and</strong> B, we have theaddition law,P (A ∪ B) = P (A) + P (B) − P (A ∩ B).ProofA = (A ∩ B) ∪ (A ∩ ¯B) soP (A) = P (A ∩ B) + P (A ∩ ¯B) <strong>and</strong> similarly for B. Also,A ∪ B = (A ∩ ¯B) ∪ (B ∩ Ā) ∪ (A ∩ B) soP (A ∪ B) = P (A ∩ ¯B) + P (B ∩ Ā) + P (A ∩ B)= P (A) − P (A ∩ B) + P (B) − P (A ∩ B) + P (A ∩ B)= P (A) + P (B) − P (A ∩ B).<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Interpretations of probabilityThe Kolmogorov axioms provi<strong>de</strong> a mathematical basis for probability but don’tprovi<strong>de</strong> for a real life interpretation. Various ways of interpreting probability inreal life situations have been proposed.• The classical interpretation.• Frequentist probability.• Subjective probability.• Other approaches; logical probability <strong>and</strong> propensities.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Classical probabilityBernoulliThis <strong>de</strong>rives from the i<strong>de</strong>as of Jakob Bernoulli (1713) contained in the principleof insufficient reason (or principle of indifference) <strong>de</strong>veloped by Laplace(1812) which can be used <strong>to</strong> provi<strong>de</strong> a way of assigning epistemic or subjectiveprobabilities.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The principle of insufficient reasonIf we are ignorant of the ways an event can occur (<strong>and</strong> therefore have noreason <strong>to</strong> believe that one way will occur preferentially compared <strong>to</strong> another),the event will occur equally likely in any way.Thus the probability of an event is the coefficient between the number offavourable cases <strong>and</strong> the <strong>to</strong>tal number of possible cases.This is a very limited <strong>de</strong>finition <strong>and</strong> cannot be easily applied in infinitedimensional or continuous sample spaces.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Subjective probabilityRamseyA different approach uses the concept of ones own probability as a subjectivemeasure of ones own uncertainty about the occurrence of an event. Thus,we may all have different probabilities for the same event because we all havedifferent experience <strong>and</strong> knowledge. This approach is more general than theother methods as we can now <strong>de</strong>fine probabilities for unrepeatable experiments.Subjective probability is studied in <strong>de</strong>tail in Bayesian <strong>Statistics</strong>.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Other approachesKeynes• Logical probability was <strong>de</strong>veloped by Keynes (1921) <strong>and</strong> Carnap (1950)as an extension of the classical concept of probability. The (conditional)probability of a proposition H given evi<strong>de</strong>nce E is interpreted as the (unique)<strong>de</strong>gree <strong>to</strong> which E logically entails H.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Popper• Un<strong>de</strong>r the theory of propensities <strong>de</strong>veloped by Popper (1957), probabilityis an innate disposition or propensity for things <strong>to</strong> happen. Long runpropensities seem <strong>to</strong> coinci<strong>de</strong> with the frequentist <strong>de</strong>finition of probabilityalthough it is not clear what individual propensities are, or whether theyobey the probability calculus.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Conditional probability <strong>and</strong> in<strong>de</strong>pen<strong>de</strong>nceThe probability of an event B conditional on an event A is <strong>de</strong>fined asP (B|A) =P (A ∩ B).P (A)This can be interpreted as the probability of B given that A occurs.Two events A <strong>and</strong> B are called in<strong>de</strong>pen<strong>de</strong>nt if P (A ∩ B) = P (A)P (B) orequivalently if P (B|A) = P (B) or P (A|B) = P (A).<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The multiplication lawA restatement of the conditional probability formula is the multiplication lawP (A ∩ B) = P (B|A)P (A).Example 1What is the probability of getting two cups in two draws from a Spanish packof cards?Write C i for the event that draw i is a cup for i = 1, 2. Enumerating allthe draws with two cups is not entirely trivial. However, the conditionalprobabilities are easy <strong>to</strong> calculate:P (C 1 ∩ C 2 ) = P (C 2 |C 1 )P (C 1 ) = 939 × 1040 = 3 52 .The multiplication law can be exten<strong>de</strong>d <strong>to</strong> more than two events. For example,P (A ∩ B ∩ C) = P (C|A, B)P (B|A)P (A).<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The birthday problemExample 2What is the probability that among n stu<strong>de</strong>nts in a classroom, at least twowill have the same birthday?http://webspace.ship.edu/<strong>de</strong>ensley/mathdl/stats/Birthday.html


The birthday problemExample 2What is the probability that among n stu<strong>de</strong>nts in a classroom, at least twowill have the same birthday?http://webspace.ship.edu/<strong>de</strong>ensley/mathdl/stats/Birthday.htmlThe solution is not obvious but can be solved using conditional probability. Letb i be the birthday of stu<strong>de</strong>nt i, for i = 1, . . . , n. Then it is easiest <strong>to</strong> calculatethe probability that all birthdays are distinctP (b 1 ≠ b 2 ≠ . . . ≠ b n ) = P (b n /∈ {b 1 , . . . , b n−1 }|b 1 ≠ b 2 ≠ . . . b n−1 ) ×P (b n−1 /∈ {b 1 , . . . , b n−2 }|b 1 ≠ b 2 ≠ . . . b n−2 ) × · · ·×P (b 3 /∈ {b 1 , b 2 }|b 1 ≠ b 2 )P (b 1 ≠ b 2 )<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Now clearly,P (b 1 ≠ b 2 ) = 364365 , P (b 3 /∈ {b 1 , b 2 }|b 1 ≠ b 2 ) = 363365<strong>and</strong> similarlyfor i = 3, . . . , n.P (b i /∈ {b 1 , . . . , b i−1 }|b 1 ≠ b 2 ≠ . . . b i−1 ) = 366 − i365Thus, the probability that at least two stu<strong>de</strong>nts have the same birthday is, forn < 365,1 − 364365 × · · · × 366 − n 365!=365 365 n (365 − n)! .For n = 23, this probability is greater than 0.5 <strong>and</strong> for n > 50, it is virtuallyone.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The law of <strong>to</strong>tal probabilityThe simplest version of this rule is the following.Theorem 1For any two events A <strong>and</strong> B, thenP (B) = P (B|A)P (A) + P (B|Ā)P (Ā).We can also extend the law <strong>to</strong> the case where A 1 , . . . , A n form a partition. Inthis case, we haven∑P (B) = P (B|A i )P (A i ).i=1<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Theorem 2For any two events A <strong>and</strong> B, thenBayes theoremP (A|B) =P (B|A)P (A).P (B)Supposing that A 1 , . . . , A n form a partition, using the law of <strong>to</strong>tal probability,we can write Bayes theorem asP (A j |B) = P (B|A j)P (A j )∑ ni=1 P (B|A i)P (A i )for j = 1, . . . , n.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The Monty Hall problemExample 3The following statement of the problem was given in a column by Marilyn vosSavant in a column in Para<strong>de</strong> magazine in 1990.Suppose you’re on a game show, <strong>and</strong> you’re given the choice of three doors:Behind one door is a car; behind the others, goats. You pick a door, say No.1, <strong>and</strong> the host, who knows what’s behind the doors, opens another door,say No. 3, which has a goat. He then says <strong>to</strong> you, “Do you want <strong>to</strong> pickdoor No. 2?” Is it <strong>to</strong> your advantage <strong>to</strong> switch your choice?<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Simulating the gameHave a look at the following web page.http://www.stat.sc.edu/~west/javahtml/LetsMakeaDeal.html


Simulating the gameHave a look at the following web page.http://www.stat.sc.edu/~west/javahtml/LetsMakeaDeal.htmlUsing Bayes theoremhttp://en.wikipedia.org/wiki/Monty_Hall_problem<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


R<strong>and</strong>om variablesA r<strong>and</strong>om variable generalizes the i<strong>de</strong>a of probabilities for events. Formally, ar<strong>and</strong>om variable, X simply assigns a numerical value, x i <strong>to</strong> each event, A i ,in the sample space, Ω. For mathematicians, we can write X in terms of amapping, X : Ω → R.R<strong>and</strong>om variables may be classified according <strong>to</strong> the values they take as• discrete• continuous• mixed<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Discrete variablesDiscrete variables are those which take a discrete set range of values, say{x 1 , x 2 , . . .}. For such variables, we can <strong>de</strong>fine the cumulative distributionfunction,F X (x) = P (X ≤ x) = ∑P (X = x i )i,x i ≤xwhere P (X = x) is the probability function or mass function.For a discrete variable, the mo<strong>de</strong> is <strong>de</strong>fined <strong>to</strong> be the point, ˆx, with maximumprobability, i.e. such thatP (X = x) < P (X = ˆx)for all x ≠ ˆx.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


MomentsFor any discrete variable, X, we can <strong>de</strong>fine the mean of X <strong>to</strong> beµ X = E[X] = ∑ ix i P (X = x i ).Recalling the frequency <strong>de</strong>finition of probability, we can interpret the mean asthe limiting value of the sample mean from this distribution. Thus, this is ameasure of location.In general we can <strong>de</strong>fine the expectation of any function, g(X) asE[g(X)] = ∑ ig(x i )P (X = x i ).In particular, the variance is <strong>de</strong>fined asσ 2 = V [X] = E [ (X − µ X ) 2]<strong>and</strong> the st<strong>and</strong>ard <strong>de</strong>viation is simply σ = √ σ 2 . This is a measure of spread.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Chebyshev’s inequalityIt is interesting <strong>to</strong> analyze the probability of being close or far away from themean of a distribution. Chebyshev’s inequality provi<strong>de</strong>s loose bounds whichare valid for any distribution with finite mean <strong>and</strong> variance.Theorem 3For any r<strong>and</strong>om variable, X, with finite mean, µ, <strong>and</strong> variance, σ 2 , then forany k > 0,P (|X − µ| ≥ kσ) ≤ 1 k 2.Thus, we know that P (µ − √ 2σ ≤ X ≤ µ + √ 2σ) ≥ 0.5 for any variable X.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


ProofP (|X − µ| ≥ kσ) = P ( (X − µ) 2 ≥ k 2 σ 2)( (X ) 2 − µ= P≥ 1)kσ[= E≤]I( ) X−µ 2≥1kσ[ (X ) ] 2 − µE= 1 kσ k 2where I is an indica<strong>to</strong>r function<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Important discrete distributionsThe binomial distributionLet X be the number of heads in n in<strong>de</strong>pen<strong>de</strong>nt <strong>to</strong>sses of a coin such thatP (head) = p. Then X has a binomial distribution with parameters n <strong>and</strong> p<strong>and</strong> we write X ∼ BI(n, p). The mass function isP (X = x) =(nx)p x (1 − p) n−x for x = 0, 1, 2, . . . , n.The mean <strong>and</strong> variance of X are np <strong>and</strong> np(1 − p) respectively.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The geometric distributionSuppose that Y is <strong>de</strong>fined <strong>to</strong> be the number of tails observed before the firsthead occurs for the same coin. Then Y has a geometric distribution withparameter p, i.e. Y ∼ GE(p) <strong>and</strong>P (Y = y) = p(1 − p) y for y = 0, 1, 2, . . .The mean any variance of X are 1−pp<strong>and</strong> 1−pp 2respectively.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The negative binomial distributionA generalization of the geometric distribution is the negative binomialdistribution. If we <strong>de</strong>fine Z <strong>to</strong> be the number of tails observed beforethe r’th head is observed, then Z ∼ N B(r, p) <strong>and</strong>P (Z = z) =(r + z − 1The mean <strong>and</strong> variance of X are r 1−ppz)p r (1 − p) z for z = 0, 1, 2, . . .<strong>and</strong> r 1−pp 2respectively.The negative binomial distribution reduces <strong>to</strong> the geometric mo<strong>de</strong>l for the caser = 1.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The hypergeometric distributionSuppose that a pack of N cards contains R red cards <strong>and</strong> that we <strong>de</strong>al n cardswithout replacement. Let X be the number of red cards <strong>de</strong>alt. Then X hasa hypergeometric distribution with parameters N, R, n, i.e. X ∼ HG(N, R, n)<strong>and</strong>( ) ( )R N − RxP (X = x) =n − x) for x = 0, 1, . . . , n.(NnExample 4In the Primitiva lottery, a contestant chooses 6 numbers from 1 <strong>to</strong> 49 <strong>and</strong> 6numbers are drawn without replacement. The contestant wins the gr<strong>and</strong> prizeif all numbers match. The probability of winning is thus)P (X = x) =(66) (430(496) = 6!43!49!=113983816 .<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


What if N <strong>and</strong> R are large?For large N <strong>and</strong> R, then the fac<strong>to</strong>rials in the hypergeometric probabilityexpression are often hard <strong>to</strong> evaluate.Example 5Suppose that N = 2000 <strong>and</strong> R = 500 <strong>and</strong> n = 20 <strong>and</strong> that we wish <strong>to</strong> findP (X = 5). Then the calculation of 2000! for example is very difficult.


What if N <strong>and</strong> R are large?For large N <strong>and</strong> R, then the fac<strong>to</strong>rials in the hypergeometric probabilityexpression are often hard <strong>to</strong> evaluate.Example 5Suppose that N = 2000 <strong>and</strong> R = 500 <strong>and</strong> n = 20 <strong>and</strong> that we wish <strong>to</strong> findP (X = 5). Then the calculation of 2000! for example is very difficult.Theorem 4Let X ∼ HG(N, R, n) <strong>and</strong> suppose that R, N → ∞ <strong>and</strong> R/N → p. ThenP (X = x) →(nx)p x (1 − p) n−x for x = 0, 1, . . . , n.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Proof) (N − R)) (N − n)(Rx(nxP (X = x) =n − x) =R − x)=→(nx(nx( (N NnR)R!(N − R)!(N − n)!(R − x)!(N − R − n + x)!N!) R x (N − R) n−x ( )nN n → p x (1 − p) n−xxIn the example,p ( = ) 500/2000 = 0.25 <strong>and</strong> using a binomial approximation,20P (X = 5) ≈ 0.2555 0.75 15 = 0.2023. The exact answer, from Matlabis 0.2024.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The Poisson distributionAssume that rare events occur on average at a rate λ per hour. Then we canoften assume that the number of rare events X that occur in a time periodof length t has a Poisson distribution with parameter (mean <strong>and</strong> variance) λt,i.e. X ∼ P(λt). ThenP (X = x) = (λt)x e −λtx!for x = 0, 1, 2, . . .


The Poisson distributionAssume that rare events occur on average at a rate λ per hour. Then we canoften assume that the number of rare events X that occur in a time periodof length t has a Poisson distribution with parameter (mean <strong>and</strong> variance) λt,i.e. X ∼ P(λt). ThenP (X = x) = (λt)x e −λtx!for x = 0, 1, 2, . . .Formally, the conditions for a Poisson distribution are• The numbers of events occurring in non-overlapping intervals arein<strong>de</strong>pen<strong>de</strong>nt for all intervals.• The probability that a single event occurs in a sufficiently small interval oflength h is λh + o(h).• The probability of more than one event in such an interval is o(h).<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Continuous variablesContinuous variables are those which can take values in a continuum. For acontinuous variable, X, we can still <strong>de</strong>fine the distribution function, F X (x) =P (X ≤ x) but we cannot <strong>de</strong>fine a probability function P (X = x). Instead, wehave the <strong>de</strong>nsity functionf X (x) =dF (x)dx .∫Thus, the distribution function can be <strong>de</strong>rived from the <strong>de</strong>nsity as F X (x) =x−∞ f X(u) du. In a similar way, moments of continuous variables can be<strong>de</strong>fined as integrals,E[X] =∫ ∞−∞xf X (x) dx<strong>and</strong> the mo<strong>de</strong> is <strong>de</strong>fined <strong>to</strong> be the point of maximum <strong>de</strong>nsity.For a continuous variable, another measure of location is the median, ˜x,<strong>de</strong>fined so that F X (˜x) = 0.5.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Important continuous variablesThe uniform distributionThis is the simplest continuous distribution. A r<strong>and</strong>om variable, X, is said <strong>to</strong>have a uniform distribution with parameters a <strong>and</strong> b iff X (x) = 1b − afor a < x < b.In this case, we write X ∼ U(a, b) <strong>and</strong> the mean <strong>and</strong> variance of X are a+b2<strong>and</strong> (b−a)212respectively.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The exponential distributionRemember that the Poisson distribution mo<strong>de</strong>ls the number of rare eventsoccurring at rate λ in a given time period. In this scenario, consi<strong>de</strong>r thedistribution of the time between any two successive events. This is anexponential r<strong>and</strong>om variable, Y ∼ E(λ), with <strong>de</strong>nsity functionf Y (y) = λe −λy for y > 0.The mean <strong>and</strong> variance of X are 1 λ <strong>and</strong> 1 respectively.λ 2<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The normal distributionThis is probably the most important continuous distribution. A r<strong>and</strong>omvariable, X, is said <strong>to</strong> follow a normal distribution with mean <strong>and</strong> varianceparameters µ <strong>and</strong> σ 2 iff X (x) = 1σ √ 2π exp (− 12σ 2(x − µ)2 )In this case, we write X ∼ N ( µ, σ 2) .for −∞ < x < ∞.• If X is normally distributed, then a + bX is normally distributed.particular, X−µσ∼ N (0, 1).In• P (|X − µ| ≥ σ) = 0.3174, P (|X − µ| ≥ 2σ) = 0.0456, P (|X − µ| ≥ 3σ) =0.0026.• Any sum of normally distributed variables is also normally distributed.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The central limit theoremOne of the main reasons for the importance of the normal distribution is thatit can be shown <strong>to</strong> approximate many real life situations due <strong>to</strong> the centrallimit theorem.Theorem 5Given a r<strong>and</strong>om sample of size X 1 , . . . , X n from some distribution, thenun<strong>de</strong>r certain conditions, the sample mean ¯X = 1 ∑ nn i=1 X i follows a normaldistribution.Proof See later.For an illustration of the CLT, seehttp://cnx.rice.edu/content/m11186/latest/<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Mixed variablesOccasionally it is possible <strong>to</strong> encounter variables which are partially discrete<strong>and</strong> partially continuous. For example, the time spent waiting for service bya cus<strong>to</strong>mer arriving in a queue may be zero with positive probability (as thequeue may be empty) <strong>and</strong> otherwise takes some positive value in (0, ∞).<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The probability generating functionFor a discrete r<strong>and</strong>om variable, X, taking values in some subset of the nonnegativeintegers, then the probability generating function, G X (s) is <strong>de</strong>finedasG X (s) = E [ s X] ∞∑= P (X = x)s x .x=0This function has a number of useful properties:• G(0) = P (X = 0) <strong>and</strong> more generally, P (X = x) = 1 d x G(s)x! ds x | s=0 .• G(1) = 1, E[X] = dG(1)ds<strong>and</strong> more generally, the k’th fac<strong>to</strong>rial moment,E[X(X − 1) · · · (X − k + 1)], isE[X!](X − k)!∣= dk G(s) ∣∣∣s=1ds k<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


• The variance of X isV [X] = G ′′ (1) + G ′ (1) − G ′ (1) 2 .Example 7Consi<strong>de</strong>r a negative binomial variable, X ∼ N B(r, p).P (X = x) =E[s X ] =(r + x − 1∞∑x=0xs x (r + x − 1x∑ ∞ (= p r r + x − 1xx=0)p r (1 − p) x for z = 0, 1, 2, . . .)p r (1 − p) x)({(1 − p)s} x p=1 − (1 − p)s) r<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


dEdsdE=ds∣ = r 1 − ps=1prp r (1 − p)(1 − (1 − p)s) r+1= E[X]d 2 Eds 2 = r(r + 1)pr (1 − p) 2(1 − (1 − p)s) r+2∣d 2 (E ∣∣∣s=1 1 − pds 2 = r(r + 1)p( ) 2 1 − pV [X] = r(r + 1) + r 1 − pp p= r 1 − pp 2 .) 2= E[X(X − 1)]−(r 1 − p ) 2p<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The probability generating function for a sum of in<strong>de</strong>pen<strong>de</strong>nt variablesSuppose that X 1 , . . . , X n are in<strong>de</strong>pen<strong>de</strong>nt with generating functions G i (s) fors = 1, . . . , n. Let Y = ∑ ni=1 X i. ThenG Y (s) = E [ s Y ]= E[s ∑ ni=1 X i]==n∏E [ s X ]ii=1n∏G i (s)i=1by in<strong>de</strong>pen<strong>de</strong>nceFurthermore, if X 1 , . . . , X n are i<strong>de</strong>ntically distributed, with common generatingfunction G X (s), thenG Y (s) = G X (s) n .<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Example 8Suppose that X 1 , . . . , X n are Bernoulli trials so thatP (X i = 1) = p <strong>and</strong> P (X i = 0) = 1 − p for i = 1, . . . , nThen, the probability generating function for any X i is G X (s) = 1 − p + sp.Now consi<strong>de</strong>r a binomial r<strong>and</strong>om variable, Y = ∑ ni=1 X i. ThenG Y (s) = (1 − p + sp) nis the binomial probability generating function.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Another useful property of pgfsIf N is a discrete variable taking values on the non-negative integers <strong>and</strong> withpgf G N (s) <strong>and</strong> if X 1 , . . . , X N is a sequence of in<strong>de</strong>pen<strong>de</strong>nt <strong>and</strong> i<strong>de</strong>nticallydistributed variables with pgf G X (s), then if Y = ∑ Ni=1 X i, we haveG Y (s) = E= E[s ∑ Ni=1 X i][E= E [ G X (s) N]= G N (G X (s))[s ∑ ]]Ni=1 X i| NThis result is useful in the study of branching processes. See the course inS<strong>to</strong>chastic Processes.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The moment generating functionFor any variable, X, the moment generating function of X is <strong>de</strong>fined <strong>to</strong> beM X (s) = E [ e sX] .This generates the moments of X as we haveM X (s) = E[ ∞∑i=1d i M X (s)ds i ∣∣∣∣s=0= E [ X i]](sX) ii!<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Example 9Suppose that X ∼ G(α, β). Thenf X (x) = βαΓ(α) xα−1 e −βx for x > 0M X (s) =dMds==∫ ∞0∫ ∞sx βαeΓ(α) xα−1 e −βx dxβ α0 Γ(α) xα−1 e −(β−s)x dx( ) α ββ − sdM αβ α=ds (β − s) α−1∣ = αs=0β = E[X]<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Example 10Suppose that X ∼ N (0, 1). ThenM X (s) ====∫ ∞−∞∫ ∞−∞∫ ∞−∞∫ ∞−∞e sx 1 √2πe −x2 2 dx(1√ exp − 1 2π 2(1√ exp − 1 2π 2(1√ exp − 1 2π 2[x 2 − 2s ]) dx[x 2 − 2s + s 2 − s 2]) dx[(x − s) 2 − s 2]) dx= e s2 2 .<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The moment generating function of a sum of in<strong>de</strong>pen<strong>de</strong>nt variablesSuppose we have a sequence of in<strong>de</strong>pen<strong>de</strong>nt variables, X 1 , X 2 , . . . , X n withmgfs M 1 (s), . . . , M n (s). Then, if Y = ∑ ni=1 X i, it is easy <strong>to</strong> see thatM Y (s) =n∏M i (s)i=1<strong>and</strong> if the variables are i<strong>de</strong>ntically distributed with common mgf M X (s), thenM Y (s) = M X (s) n .<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Example 11Suppose that X i ∼ E(λ) for i = 1, . . . , n are in<strong>de</strong>pen<strong>de</strong>nt. ThenM X (s) == λ=∫ ∞0∫ ∞0λλ − s .Therefore the mgf of Y = ∑ ni=1 X i is given bye sx λe −λx dxe −(λ−s)x dxM Y (s) =( λ) nλ − swhich we can recognize as the mgf of a gamma distribution, Y ∼ G(n, λ).<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Proof of the central limit theoremFor any variable, Y , with zero mean <strong>and</strong> unit variance <strong>and</strong> such that allmoments exist, then the moment generating function isM Y (s) = E[e sY ] = 1 + s22 + o(s2 ).Now assume that X 1 , . . . , X n are a r<strong>and</strong>om sample from a distribution withmean µ <strong>and</strong> variance σ 2 . Then, we can <strong>de</strong>fine the st<strong>and</strong>ardized variables,Y i = X i−µσ, which have mean 0 <strong>and</strong> variance 1 for i = 1, . . . , n <strong>and</strong> thenZ n = ¯X ∑ n− µσ/ √ n = i=1 Y i√ nNow, suppose that M Y (s) is the mgf of Y i , for i = 1, . . . , n. ThenM Zn (s) = M Y(s/√ n) n<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


<strong>and</strong> therefore,) nM Zn (s) =(1 + s22n + o(s2 /n) → e s2 2which is the mgf of a normally distributed r<strong>and</strong>om variable.


<strong>and</strong> therefore,) nM Zn (s) =(1 + s22n + o(s2 /n) → e s2 2which is the mgf of a normally distributed r<strong>and</strong>om variable.To make this result valid for variables that do not necessarily possessall their moments, then we can use essentially the same arguments but<strong>de</strong>fining the characteristic function C X (s) = E[e isX ] instead of the momentgenerating function.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Multivariate distributionsIt is straightforward <strong>to</strong> extend the concept of a r<strong>and</strong>om variable <strong>to</strong> themultivariate case. Full <strong>de</strong>tails are inclu<strong>de</strong>d in the course on MultivariateAnalysis.For two discrete variables, X <strong>and</strong> Y , we can <strong>de</strong>fine the joint probability functionat (X = x, Y = y) <strong>to</strong> be P (X = x, Y = y) <strong>and</strong> in the continuous case, wesimilarly <strong>de</strong>fine a joint <strong>de</strong>nsity function f X,Y (x, y) such that∑ ∑P (X = x, Y = y) = 1<strong>and</strong> similarly for the continuous case.xy∑P (X = x, Y = y) = P (X = x)y∑P (X = x, Y = y) = P (Y = y)x<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Conditional distributionsThe conditional distribution of Y given X = x is <strong>de</strong>fined <strong>to</strong> bef Y |x (y|x) = f X,Y (x, y).f X (x)Two variables are said <strong>to</strong> be in<strong>de</strong>pen<strong>de</strong>nt if for all x, y, then f X,Y (x, y) =f X (x)f Y (y) or equivalently if f Y |x (y|x) = f Y (y) or f X|y (x|y) = f X (x).We can also <strong>de</strong>fine the conditional expectation of Y |x <strong>to</strong> be E[Y |x] =∫yfY |x (y|x) dx.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Covariance <strong>and</strong> correlationIt is useful <strong>to</strong> obtain a measure of the <strong>de</strong>gree of relation between the twovariables. Such a measure is the correlation.We can <strong>de</strong>fine the expectation of any function, g(X, Y ), in a similar way <strong>to</strong>the univariate case,∫ ∫E[g(X, Y )] = g(x, y)f X,Y (x, y) dx dy.In particular, the covariance is <strong>de</strong>fined asσ X,Y = Cov[X, Y ] = E[XY ] − E[X]E[Y ].Obviously, the units of the covariance are the product of the units of X <strong>and</strong>Y . A scale free measure is the correlation,ρ X,Y = Corr[X, Y ] = σ X,Yσ X σ Y<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Properties of the correlation are as follows:• −1 ≤ ρ X,Y ≤ 1• ρ X,Y = 0 if X <strong>and</strong> Y are in<strong>de</strong>pen<strong>de</strong>nt. (This is not necessarily true inreverse!)• ρ XY = 1 if there is an exact, positive relation between X <strong>and</strong> Y so thatY = a + bX where b > 0.• ρ XY = −1 if there is an exact, negative relation between X <strong>and</strong> Y so thatY = a + bX where b < 0.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Conditional expectations <strong>and</strong> variancesTheorem 6For two variables, X <strong>and</strong> Y , thenE[Y ] = E[E[Y |X]]V [Y ] = E[V [Y |X]] + V [E[Y |X]]ProofE[E[Y |X]] = E∫=∫=∫=[∫∫y∫y]yf Y |X (y|X) dy=∫f Y |X (y|x)f X (x) dx dyf X,Y (x, y) dx dyyf Y (y) dy = E[Y ]∫f X (x)yf Y |X (y|X) dy dx<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Example 12A r<strong>and</strong>om variable X has a beta distribution, X ∼ B(α, β), iff X (x) =Γ(α + β)Γ(α)Γ(β) xα−1 (1 − x) β−1 for 0 < x < 1.The mean of X is E[X] =αα+β .Suppose now that we <strong>to</strong>ss a coin with probability P (heads) = X a <strong>to</strong>tal of ntimes <strong>and</strong> that we require the distribution of the number of heads, Y .This is the beta-binomial distribution which is quite complicated:<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


P (Y = y) ====∫ 10∫ 1P (Y = y|X = x)f X (x) dx( )n0y( )n Γ(α + β)y Γ(α)Γ(β)(nyx y n−y Γ(α + β)(1 − x)∫ 10Γ(α)Γ(β) xα−1 (1 − x) β−1 dxx α+y−1 (1 − x) β+n−y−1 dx) Γ(α + β) Γ(α + y)Γ(β + n − y)Γ(α)Γ(β) Γ(α + β + n)for y = 0, 1, . . . , n.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


We could try <strong>to</strong> calculate the mean of Y directly using the above probabilityfunction. However, this would be very complicated. There is a much easierway.


We could try <strong>to</strong> calculate the mean of Y directly using the above probabilityfunction. However, this would be very complicated. There is a much easierway.E[Y ] = E[E[Y |X]]= E[nX] because Y |X ∼ BI(n, X)= n αα + β .<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


<strong>Statistics</strong><strong>Statistics</strong> is the science of data analysis. This is concerned with• how <strong>to</strong> generate suitable samples of data• how <strong>to</strong> summarize samples of data <strong>to</strong> illustrate their important features• how <strong>to</strong> make inference about populations given sample data.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


SamplingIn statistical problems we usually wish <strong>to</strong> study the characteristics of somepopulation. However, it is usually impossible <strong>to</strong> measure the values of thevariables of interest for all members of the population. This implies the use ofa sample.There are many possible ways of selecting a sample. Non r<strong>and</strong>om approachesinclu<strong>de</strong>:• Convenience sampling• Volunteer sampling• Quota samplingSuch approaches can suffer from induced biases.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


R<strong>and</strong>om samplingA better approach is r<strong>and</strong>om sampling. For a population of elements, saye 1 , . . . , e N , then a simple r<strong>and</strong>om sample of size n selects every possible n-tupleof elements with equal probability. Unrepresentative samples can be selectedby this approach, but is no a priori bias which means that this is likely.When the population is large or heterogeneous, other r<strong>and</strong>om samplingapproaches may be preferred. For example:• Systematic sampling,• Stratified sampling• Cluster sampling• Multi stage samplingSampling theory is studied in more <strong>de</strong>tail in Quantitative Methods.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Descriptive statisticsGiven a data sample, it is important <strong>to</strong> <strong>de</strong>velop methods <strong>to</strong> summarizethe important features of the data both visually <strong>and</strong> numerically. Differentapproaches should be used for different types of data.• Categorical data:– Nominal data,– Ordinal data.• Numerical data:– Discrete data,– Continuous data.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Categorical dataCategorical data are those that take values in different categories, e.g. bloodtypes, favourite colours, etc. These data may be nominal, when the differentcategories have no inherent sense of or<strong>de</strong>r or ordinal, when the categories arenaturally or<strong>de</strong>red.Example 13The following table gives the frequencies of the different first movesin a chess game found on 20/02/1996 using the search engine ofhttp://www.chessgames.com/.Opening Frequency Relative frequencye4 178130 0.4794d4 125919 0.3389Nf3 32206 0.0867c4 28796 0.0776Others 6480 0.0174T otal 371531 1.0000<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


This is an example of a sample of nominal data. The frequency table has beenaugmented with the relative frequencies or proportions in each class. We cansee immediately that the most popular opening or modal class is e4, played innearly half the games.A nice way of visualizing the data is via a pie chart. This could be augmentedwith the frequencies or relative frequencies in each class.Othersc4Nf3e4d4<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


An alternative display is a bar chart which can be constructed using frequenciesor relative frequencies.18 x 104 Opening161412Frequency1086420e4 d4 Nf3 c4 OthersWhen the data are categorical, it is usual <strong>to</strong> or<strong>de</strong>r them from highest <strong>to</strong> lowestfrequency. With ordinal data, it is more sensible <strong>to</strong> use the or<strong>de</strong>ring of theclasses.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


A final approach which is good <strong>to</strong> look at but not so easy <strong>to</strong> interpret is thepic<strong>to</strong>gram. The area of each image is proportional <strong>to</strong> the frequency.e4 d4 Nf3 c4 Others<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Measuring the relation between two categorical variablesOften we may record the values of two (or more) categorical variables. In suchcases we are interested in whether or not there is any relation between thevariables. To do this, we can construct a contingency table.Example 14The following data given in Morrell (1999) come from a South African studyof single birth children. At birth in 1990 it was recor<strong>de</strong>d whether or not themothers received medical aid <strong>and</strong> later, in 1995 the researchers attempted <strong>to</strong>trace the children. Those children found were inclu<strong>de</strong>d in the five year groupfor further study.Children not traced Five-Year GroupHad Medical Aid 195 46No Medical Aid 979 3701590CH Morrell (1999). Simpson’s Paradox: An Example From a Longitudinal Study in South Africa. Journal of <strong>Statistics</strong> Education, 7.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Analysis of a contingency tableIn or<strong>de</strong>r <strong>to</strong> analyze the contingency table it is useful <strong>to</strong> first calculate themarginal <strong>to</strong>tals.Children not traced Five-Year GroupHad Medical Aid 195 46 241No Medical Aid 979 370 13491174 416 1590<strong>and</strong> then <strong>to</strong> convert the original data in<strong>to</strong> percentages.Children not traced Five-Year GroupHad Medical Aid .123 .029 .152No Medical Aid .615 .133 .848.738 .262 1<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Then it is also possible <strong>to</strong> calculate conditional frequencies. For example, theproportion of children not traced who received medical aid is195/1174 = .123/.738 = .166.Finally, we may often wish <strong>to</strong> assess whether there exists any relation betweenthe two variables. In or<strong>de</strong>r <strong>to</strong> do this we can assess how many data we woul<strong>de</strong>xpect <strong>to</strong> see in each cell, assuming the marginal <strong>to</strong>tals if the data really werein<strong>de</strong>pen<strong>de</strong>nt.Children not traced Five-Year GroupHad Medical Aid 177.95 63.05 241No Medical Aid 996.05 352.95 13491174 416 1590Comparing these expected <strong>to</strong>tals with the original frequencies, we could set upa formal statistical (χ 2 ) test for in<strong>de</strong>pen<strong>de</strong>nce.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Simpson’s paradoxSometimes we can observe apparently paradoxical results when a populationwhich contains heterogeneous groups is subdivi<strong>de</strong>d. The following example ofthe so called Simpson’s paradox comes from the same study.http://www.amstat.org/publications/jse/secure/v7n3/datasets.morrell.cfm<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Numerical dataWhen data are naturally numerical, we can use both graphical <strong>and</strong> numericalapproaches <strong>to</strong> summarize their important characteristics. For discrete data, wecan frequency tables <strong>and</strong> bar charts in a similar way <strong>to</strong> the categorical case.Example 15The table reports the number of previous convictions for 283 adult malesarrested for felonies in the USA taken from Holl<strong>and</strong> et al (1981).# Previous convictions Frequency Rel. freq. Cum. freq. Cum. rel. freq.0 0 0.0000 0 0.00001 16 0.0565 16 0.05652 27 0.0954 43 0.15193 37 0.1307 80 0.28274 46 0.1625 126 0.44525 36 0.1272 162 0.57246 40 0.1413 202 0.71387 31 0.1095 233 0.82338 27 0.0954 260 0.91879 13 0.0459 273 0.964710 8 0.0283 281 0.992911 2 0.0071 283 1.0000> 11 0 0.0000 283 1.0000TR Holl<strong>and</strong>, M Levi & GE Beckett (1981). Associations Between Violent And Nonviolent Criminality: A Canonical Contingency-TableAnalysis. Multivariate Behavioral Research, 16, 237–241.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Note that we have augmented the table with relative frequencies <strong>and</strong> cumulativefrequencies. We can construct a bar chart of frequencies as earlier but wecould also use cumulative or relative frequencies.5045403530Frequency25201510500 1 2 3 4 5 6 7 8 9 10 11 12Number of previous convictionsNote that the bars are much thinner in this case. Also, we can see that thedistribution is slightly positively skewed or skewed <strong>to</strong> the right.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Continuous data <strong>and</strong> his<strong>to</strong>gramsWith continuous data, we should use his<strong>to</strong>grams instead of bar charts. Themain difficulty is in choosing the number of classes. We can see the effects ofchoosing different bar widths in the following web page.http://www.shodor.org/interactivate/activities/his<strong>to</strong>gram/An empirical rule is <strong>to</strong> choose around √ n classes where n is the number ofdata. Similar rules are used by the main statistical packages.It is also possible <strong>to</strong> illustrate the differences between two groups of individualsusing his<strong>to</strong>grams. Here, we should use the same classes for both groups.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Example 16The table summarizes the hourly wage levels of 30 Spanish men <strong>and</strong> 25 Spanishwomen, (with at least secondary education) who work > 15 hours per week.MWInterval n i f i n i f i[300, 600) 1 .033 0 0[600, 900) 1 .033 1 .04[900, 1200) 2 .067 7 .28[1200, 1500) 5 .167 10 .4[1500, 1800) 10 .333 6 .24[1800, 2100) 8 .267 1 .04[2100, 2400) 3 .100 0 0> 2400 0 0 0 030 1 25 1J Dolado <strong>and</strong> V LLorens (2004). Gen<strong>de</strong>r Wage Gaps by Education in Spain: Glass Floors vs. Glass Ceilings, CEPR DP., 4203.http://www.eco.uc3m.es/temp/dollorens2.pdf<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


menwage✻24002100180015001200900600300✻womenf✛.4 .3 .2 .1 0 0 .1 .2 .3 .4 .5✲fThe male average wage is a little higher <strong>and</strong> the distribution of the male wagesis more disperse <strong>and</strong> asymmetric.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


His<strong>to</strong>grams with intervals of different widthsIn this case, the his<strong>to</strong>gram is constructed so that the area of each bar isproportional <strong>to</strong> the number of data.Example 17The following data are the results of a questionnaire <strong>to</strong> marijuana usersconcerning the weekly consumption of marijuana.g / week Frequency[0, 3) 94[3, 11) 269[11, 18) 70[18, 25) 48[25, 32) 31[32, 39) 10[39, 46) 5[46, 74) 2> 74 0L<strong>and</strong>rigan et al (1983). Paraquat <strong>and</strong> marijuana: epi<strong>de</strong>miologic risk assessment. Amer. J. Public Health, 73, 784-788<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


We augment the table with relative frequencies <strong>and</strong> bar widths <strong>and</strong> heights.g / week width n i f i height[0, 3) 3 94 .178 .0592[3, 11) 8 269 .509 .0636[11, 18) 7 70 .132 .0189[18, 25) 7 48 .091 .0130[25, 32) 7 31 .059 .0084[32, 39) 7 10 .019 .0027[39, 46) 7 5 .009 .0014[46, 74) 28 2 .004 .0001> 74 0 0 0 0Total 529 1We use the formulaheight = frequency/interval width<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


freq/width.07✻.06.05.04.03.02.01.000 10 20 30 40 50 60 70 80✲g/weekThe distribution is very skewed <strong>to</strong> the right.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Other graphical methods• the frequency polygon. A his<strong>to</strong>gram is constructed <strong>and</strong> the bars are linesare used <strong>to</strong> join each bar at the centre. Usually the his<strong>to</strong>gram is thenremoved. This simulates the probability <strong>de</strong>nsity function.• the cumulative frequency polygon. As above but using a cumulativefrequency his<strong>to</strong>gram <strong>and</strong> joining at the end of each bar.• the stem <strong>and</strong> leaf plot. This is like a his<strong>to</strong>gram but retaining the originalnumbers.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Sample momentsFor a sample, x 1 , . . . , x n of numerical data, then the sample mean is <strong>de</strong>finedas ¯x = 1 n∑ ni=1 x i <strong>and</strong> the sample variance is s 2 = 1n−1∑ ni=1 (x i − ¯x) 2 . Thesample st<strong>and</strong>ard <strong>de</strong>viation is s = √ s 2 .The sample mean may be interpreted as an estima<strong>to</strong>r of the population mean.It is easiest <strong>to</strong> see this if we consi<strong>de</strong>r grouped data say x 1 , . . . , x k where x j isobserved n j times in <strong>to</strong>tal <strong>and</strong> ∑ ki=1 n i = n. Then, the sample mean is¯x = 1 nk∑n j x j =j=1k∑f j x jj=1where f j is the proportion of times that x j was observed.When n → ∞, then (using the frequency <strong>de</strong>finition of probability), we knowthat f j → P (X = x j ) <strong>and</strong> so ¯x → µ X , the true population mean.Sometimes the sample variance is <strong>de</strong>fined with a <strong>de</strong>nomina<strong>to</strong>r of n instead ofn − 1. However, in this case, it is a biased estima<strong>to</strong>r of σ 2 .<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Problems with outliers, the median <strong>and</strong> interquartile rangeThe mean <strong>and</strong> st<strong>and</strong>ard <strong>de</strong>viation are good estima<strong>to</strong>rs of location <strong>and</strong> spreadof the data if there are no outliers or if the sample is reasonably symmetric.Otherwise, it is better <strong>to</strong> use the sample median <strong>and</strong> interquartile range.Assume that the sample data are or<strong>de</strong>red so that x 1 ≤ x 2 ≤ . . . ≤ x n . Thenthe sample median is <strong>de</strong>fined <strong>to</strong> be˜x =⎧⎨⎩xn+12if n is oddxn2+x n+222if n is even.For example, if we have a sample 1, 2, 6, 7, 8, 9, 11, the median is 7 <strong>and</strong> forthe sample 1, 2, 6, 7, 8, 9, 11, 12 then the median is 7.5. We can think of themedian as dividing the sample in two.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


We can also <strong>de</strong>fine the quartiles in a similar way. The lower quartile isQ 1 = xn+1 <strong>and</strong> the upper quartile may be <strong>de</strong>fined as Q 3 = x 3(n+1) where if the44fraction is not a whole number, the value should be <strong>de</strong>rived by interpolation.Thus, for the sample 1, 2, 6, 7, 8, 9, 11, then Q 1 = 2 <strong>and</strong> Q 3 = 9. For thesample 1, 2, 6, 7, 8, 9, 11, 12, we have n+14= 2.25 so that<strong>and</strong> 3(n+1)4= 6.75 soQ 1 = 2 + 0.25(6 − 2) = 3Q 3 = 9 + 0.75(11 − 9) = 10.5.A nice visual summary of a data sample using the median, quartiles <strong>and</strong> rangeof the data is the so called box <strong>and</strong> whisker plot or boxplot.http://en.wikipedia.org/wiki/Box_plot<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Correlation <strong>and</strong> regressionOften we are interested in mo<strong>de</strong>ling the extent of a linear relationship betweentwo data samples.Example 18In a study on the treatment of diabetes, the researchers measured patientsweight losses, y, <strong>and</strong> their initial weights on diagnosis, x <strong>to</strong> see if weight losswas influenced by initial weight.X 225 235 173 223 200 199 129 242Y 15 44 31 39 6 16 21 44X 140 156 146 195 155 185 150 149Y 5 12 −3 19 10 24 −3 10In or<strong>de</strong>r <strong>to</strong> assess the relationship, it is useful <strong>to</strong> plot these data as a scatterplot.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


We can see that there is a positive relationship between initial weight <strong>and</strong>weight loss.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The sample covarianceIn such cases, we can measure the extent of the relationship using the samplecorrelation. Given a sample, (x 1 , y 1 ), . . . , (x n , y n ), then the sample covarianceis <strong>de</strong>fined <strong>to</strong> bes xy = 1 n∑(x i − ¯x)(y i − ȳ).n − 1In our example, we havei=1¯x = 1 (225 + 235 + . . . + 149)16= 181.375ȳ = 1 (15 + 44 + . . . + 10)16= 18.125s xy = 1 {(225 − 181.375)(15 − 18.125)+16(235 − 181.375)(44 − 18.125) + . . . +(149 − 181.375)(10 − 18.125)} ≈ 361.64<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The sample correlationThe sample correlation isr xy = s xys x s ywhere s x <strong>and</strong> s y are the two st<strong>and</strong>ard <strong>de</strong>viations. This has properties similar<strong>to</strong> the population correlation.• −1 ≤ r xy ≤ 1.• r xy = 1 if y = a + bx <strong>and</strong> r xy = −1 if y = a − bx for some b > 0.• If there is no relationship between the two variables, then the correlation is(approximately) zero.In our example, we find that s 2 x ≈ 1261.98 <strong>and</strong> s 2 y ≈ 211.23 so that s x ≈ 35.52<strong>and</strong> s y ≈ 14.53. which implies that r xy = 361.6435.52×14.53≈ 0.70 indicating astrong, positive relationship between the two variables.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Correlation only measures linear relationships!High or low correlations can often be misleading.In both cases, the variables have strong, non-linear relationships. Thus,whenever we are using correlation or building regression mo<strong>de</strong>ls, it is alwaysimportant <strong>to</strong> plot the data first.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Spurious correlationCorrelation is often associated with causation. If X <strong>and</strong> Y are highly correlated,it is often assumed that X causes Y or Y causes X.


Spurious correlationCorrelation is often associated with causation. If X <strong>and</strong> Y are highly correlated,it is often assumed that X causes Y or Y causes X.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Example 19Springfield had just spent millions of dollars creating a highly sophisticated”Bear Patrol” in response <strong>to</strong> the sighting of a single bear the week before.Homer: Not a bear in sight. The ”Bear Patrol” is working like a a charmLisa: That’s specious reasoning, Dad.Homer:[uncomprehendingly] Thanks, honey.Lisa: By your logic, I could claim that this rock keeps tigers away.Homer: Hmm. How does it work? Lisa:It doesn’t work. (pause) It’s just a stupid rock!Homer: Uh-huh.Lisa: But I don’t see any tigers around, do you?Homer: (pause) Lisa, I want <strong>to</strong> buy your rock.Much Apu about nothing. The Simpsons series 7.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Example 201988 US census data showed that numbers of churches in a city was highlycorrelated with the number of violent crimes. Does this imply that havingmore churches means that there will be more crimes or that having more crimemeans that more churches are built?


Example 201988 US census data showed that numbers of churches in a city was highlycorrelated with the number of violent crimes. Does this imply that havingmore churches means that there will be more crimes or that having more crimemeans that more churches are built?Both variables are highly correlated <strong>to</strong> population.them is spurious.The correlation between<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


RegressionAn mo<strong>de</strong>l representing an approximately linear relation between x <strong>and</strong> y iswhere ɛ is a prediction error.y = α + βx + ɛIn this formulation, y is the <strong>de</strong>pen<strong>de</strong>nt variable whose value is mo<strong>de</strong>led as<strong>de</strong>pending on the value of x, the in<strong>de</strong>pen<strong>de</strong>nt variable .How should we fit such a mo<strong>de</strong>l <strong>to</strong> the data sample?<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Least squaresGaussWe wish <strong>to</strong> find the line which best fits the sample data (x 1 , y 1 ), . . . , (x n , y n ).In or<strong>de</strong>r <strong>to</strong> do this, we should choose the line, y = a + bx, which in some wayminimizes the prediction errors or residuals,e i = y i − (a + bx i ) for i = 1, . . . , n.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


A minimum criterion would be that ∑ ni=1 e i = 0. However, many lines satisfythis, for example y = ȳ. Thus, we need a stronger constraint.The st<strong>and</strong>ard way of doing this is <strong>to</strong> choose <strong>to</strong> minimize the sum of square<strong>de</strong>rrors, E(a, b) = ∑ ni=1 e2 i .Theorem 7For a sample (x 1 , y 1 ), . . . , (x n , y n ), the line of form y = a+bx which minimizesthe sum of squared errors, E[a, b] = ∑ ni=1 (y i − a − bx i ) 2 is such thatb = s xys 2 xa = ȳ − b¯x<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Proof Suppose that we fit the line y = a + bx. We want <strong>to</strong> minimize thevalue of E(a, b). We can recall that at the minimum,∂E∂a = ∂E∂b = 0.Now, E = ∑ ni=1 (y i − a − bx i ) 2 <strong>and</strong> therefore∂E∂a= −20 = −2n∑(y i − a − bx i ) <strong>and</strong> at the minimumi=1n∑(y i − a − bx i )i=1= −2 (nȳ − na − nb¯x)a = ȳ − b¯x<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


∂E∂b= −2n∑x i (y i − a − bx i )i=1(∑ n0 = −2 x i y i −n∑x i y i =i=1=b =i=1n∑x i (a + bx i )i=1)n∑x i (a + bx i )i=1n∑x i (ȳ − b¯x + bx i )i=1( n)∑= n¯xȳ + b x 2 i − n¯x 2i=1∑ ni=1 x iy i − n¯xȳ∑ ni=1 x2 i − n¯x2<strong>and</strong> at the minimum,substituting for a<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>= ns xyns 2 x= s xys 2 x


We will fit the regression line <strong>to</strong> the data of our example on the weights ofdiabetics. We have seen earlier that ¯x = 181.375, ȳ = 18.125 , s xy = 361.64,s 2 x = 1261.98 <strong>and</strong> s 2 y = 211.23.Thus, if we wish <strong>to</strong> predict the values of y (reduction in weight) in terms of x(original weight), the least squares regression line iswherey = a + bxb = 361.641261.98 ≈ 0.287a = 18.125 − 0.287 × 181.375 ≈ −33.85The following diagram shows the fitted regression line.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


We can use this line <strong>to</strong> predict the weight loss of a diabetic given their initialweight. Thus, for a diabetic who weighed 220 pounds on diagnosis, we wouldpredict that their weight loss would be aroundŷ = −33.85 + 0.287 × 220 = 29.29 lbs.Note that we should be careful when making predictions outsi<strong>de</strong> the range ofthe data. For example the linear predicted weight gain for a 100 lb patientwould be around 5 lbs but it is not clear that the linear relationship still holdsat such low values.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Residual analysisLos residuals or prediction errors are the differences e i = y i − (a + bx i ). It isuseful <strong>to</strong> see whether the average prediction error is small or large. Thus, wecan <strong>de</strong>fine the residual variances 2 e = 1n − 1n∑i=1e 2 i<strong>and</strong> the residual st<strong>and</strong>ard <strong>de</strong>viation, s e = √ s 2 e.In our example, we have e 1 = 15−(−33.85+0.287×225), e 2 = 44−(−33.85+0.287 × 235) etc. <strong>and</strong> after some calculation, the residual sum of squares canbe shown <strong>to</strong> be s 2 e ≈ 123. Calculating the results this way is very slow. Thereis a faster method.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Theorem 8ē = 0s 2 r = s 2 y(1 − r2xy)Proofē = 1 n= 1 n= 1 n= 0n∑(y i − (a + bx i ))i=1n∑(y i − (ȳ − b¯x + bx i )) by <strong>de</strong>finition of ai=1(∑ n(y i − ȳ) − bi=1i=1)n∑(x i − ¯x)<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


s 2 e ====1n − 1n∑(y i − (a + bx i )) 2i=11n − 1 (y i − (ȳ − b¯x + bx i )) 2 by <strong>de</strong>finition of a1n − 1 ((y i − ȳ) − b(x i − ¯x))) 2(1 ∑ n n∑n)∑(y i − ȳ) 2 − 2b (y i − ȳ)(x i − ¯x) + b 2 (x i − ¯x) 2n − 1i=1i=1i=1= s 2 y − 2bs xy + b 2 s 2 x = s 2 y − 2 s xys xy += s 2 y − s2 xy= s 2 y(s 2 x1 −(= s 2 y 1 − s2 xys 2 xs 2 y( ) ) 2 sxy= s 2 ys x s y)s 2 x(1 − r2xy)(sxys 2 x) 2s 2 x by <strong>de</strong>finition of b<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


This result shows thatInterpretations 2 rs 2 y= (1 − r 2 xy).Consi<strong>de</strong>r the problem of estimating y. If we only observe y, . . . , y n , then ourbest estimate is ȳ <strong>and</strong> the variance of our data is s 2 y.Given the x data, then our best estimate is the regression line <strong>and</strong> the residualvariance is s 2 r.Thus, the percentage reduction in variance due <strong>to</strong> fitting the regression line isR 2 = (1 − r 2 xy) × 100%In our example, r xy ≈ 0.7 so R 2 = (1 − 0.49) × 100% = 51%.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Graphing the residualsAs we have seen earlier, the correlation between two variables can be highwhen there is a strong non-linear relation.Whenever we fit a linear regression mo<strong>de</strong>l, it is important <strong>to</strong> use residual plotsin or<strong>de</strong>r <strong>to</strong> check the a<strong>de</strong>quacy of the fit.The regression line for the following five groups of data, from Basset et al(1986) is the same, that isy = 18.43 + 0.28xBassett, E. et al (1986). <strong>Statistics</strong>: Problems <strong>and</strong> Solutions. London: Edward Arnold<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


• The first case is a st<strong>and</strong>ard regression.• In the second case, we have a non-linear fit.• In the third case, we can see the influence of an outlier.• The fourth case is a regression but . . .• In the final case, we see that one point is very influential.Now we can observe the residuals.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


In case 4 we see the residuals increasing as y increases.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Two regression linesSo far, we have used the least squares technique <strong>to</strong> fit the line y = a + bxwhere a = ȳ − b¯x <strong>and</strong> b = s xy.s 2 xWe could also rewrite the linear equation in terms of x <strong>and</strong> try <strong>to</strong> fit x = c+dy.Then, via least squares, we have that c = ¯x − dȳ <strong>and</strong> d = s xy.s 2 yWe might expect that these would be the same lines, but rewritingy = a + bx ⇒ x = − a b + 1 b y ≠ c + dyIt is important <strong>to</strong> notice that the least squares technique minimizes theprediction errors in one direction only.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


The following example shows data on the extension of a cable, y relative <strong>to</strong>force applied, x <strong>and</strong> the fit of both regression lines.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>


Regression <strong>and</strong> normalitySuppose that we have a statistical mo<strong>de</strong>lY = α + βx + ɛwhere ɛ ∼ N (0, σ 2 ). Then, if data come from this mo<strong>de</strong>l, it can be shown thatthe least squares fit method coinci<strong>de</strong>s with the maximum likelihood approach<strong>to</strong> estimating the parameters.You will study this in more <strong>de</strong>tail in the course on Regression Mo<strong>de</strong>ls.<strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!