10.07.2015 Views

Using R for Introductory Statistics : John Verzani

Using R for Introductory Statistics : John Verzani

Using R for Introductory Statistics : John Verzani

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Using</strong> R <strong>for</strong> introductory statistics 240> cols =c("blue","brown","green","orange","red","yellow","purple")> prob = c(1,1,1,1,2,2,2) # ratio of colors> bagfull.mms=sample(cols,30,replace=TRUE, prob=prob)> table(bagfull.mms)bagfull.mmsblue brown green orange purple red yellow2 3 1 3 6 10 5A <strong>for</strong>mula <strong>for</strong> the multinomial distribution is similar to that <strong>for</strong> the binomial distributionexcept that more factors are involved, as there are more categories to choose from. Thedistribution can be specified as follows:As an example, consider the voter survey. Suppose we expected the percentages to be35% Republican, 35% Democrat, and 30% undecided. What is the probability in a surveyof 100 that we see 35, 40, and 25 respectively? It isThis is found with> choose(100,30)*choose(70,40) * .35^35 * .35^40 *.30^25[1] 0.008794(We skip the last coefficient, as <strong>for</strong> any j.) This small value is the probability ofthe observed value, but it is not a p-value. A p-value also includes the probability ofseeing more extreme values than the observed one. We still need to specify what thatmeans.9.1.2 Pearson’s χ 2 statisticTrying to use the multinomial distribution directly to answer a problem about the p-valueis difficult, as the variables Y i are correlated. If one is large the others are more likely tobe small, so the meaning of “extreme” in calculating a pvalue is not immediately clear.As an alternative, the problem of testing whether a given set of probabilities could haveproduced the data is done as be<strong>for</strong>e: by comparing the observed value with the expectedvalue and then normalizing to get something with a known distribution.Each Y i is a random variable telling us how many of the n choices were in category i.If we focus on a single i, then Y i is seen to be Binomial(n, p i ). Again, the Y i are notindependent but correlated, as one large one implies that the others are more likelysmaller. However, we know that the expected number of Y i is np i . Based on this, a goodstatistic might be

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!