10.07.2015 Views

Using R for Introductory Statistics : John Verzani

Using R for Introductory Statistics : John Verzani

Using R for Introductory Statistics : John Verzani

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Multivariate data 1272 42 stomach…63 3460 breast64 719 breastThe variable names in the output of stack() are always values to store the data and ind toindicate which sample the data is from. When we use stack(), it is important that thevariables in the data frame or list have names, so that ind can indicate which variable thedata is from.4.3.5 Problems4.15 The data set MLBattend (<strong>Using</strong>R) contains attendance data <strong>for</strong> major leaguebaseball between the years 1969 and 2000. For each year, make boxplots of attendance.Can you pick out two seasons that were shortened by strikes? (There were three, but thethird is hard to see.)4.16 The data set MLBattend (<strong>Using</strong>R) contains several variables concerningattendance at major league baseball games from 1969 to 2000. Compare the meannumber of runs scored per team <strong>for</strong> each league be<strong>for</strong>e and after 1972 (when thedesignated hitter was introduced). Is there a difference? Hint: the function tapply() can beused, as in> tapply(runs.scored,league,mean)AL NL713.3 675.4However, do this <strong>for</strong> the data be<strong>for</strong>e and after 1972.4.17 The data set npdb (<strong>Using</strong>R) contains malpractice-award in<strong>for</strong>mation <strong>for</strong> the years2000 to 2003 in the United States. The variable ID contains an identification numberunique to a doctor. The command table (table (ID) ) shows that only 5% of doctors areinvolved in multiple awards. Perhaps these few are the cause of the large insurancepayouts? How can we check graphically?We’ll make boxplots of the total award amount per doctor broken down by the numberof awards that doctor has against him and investigate. First though, we need tomanipulate the data.1. The command tmp=split (award, ID) will <strong>for</strong>m the list tmp with each elementcorresponding to the awards <strong>for</strong> a given doctor. Explain what these commands do:sapply (tmp, sum) and sapply (tmp, length).2. Make a data frame with the command> df = data.frame(sum = sapply(x,sum), number =sapply(x,length))With this, create side-by-side boxplots of the total amount by a doctor brokendown by the number of awards.What do you conclude about these 5% being the main cause of the damages?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!