10.07.2015 Views

Using R for Introductory Statistics : John Verzani

Using R for Introductory Statistics : John Verzani

Using R for Introductory Statistics : John Verzani

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Using</strong> R <strong>for</strong> introductory statistics 114The data-frame notation allows us to take subsets of the data frames in a natural andefficient manner. To illustrate, let’s consider the data set babies (<strong>Using</strong>R) again. We wishto see if any relationships appear between the gestation time (gestation), birth weight(wt), mother’s age (age), and family income (inc).Again, we need to massage the data to work with R. Several of these variables have aspecial numeric code <strong>for</strong> data that is not available (NA). Looking at the documentation ofbabies (<strong>Using</strong>R) (with ?babies), we see that gestation uses 999, age uses 99, and incomeis really a categorical variable that uses 98 <strong>for</strong> “not available.”We can set these values to NA as follows:## bad idea, doesn’t change babies, only copies> attach(babies)> gestation[gestation == 999] = NA> age[age == 99] = NA> inc[inc == 98] = NA> pairs(babies[,c("gestation","wt","age","inc")])But the graphic produced by pairs() won’t be correct, as we didn’t actually change thevalues in the data frame babies; rather we modified the local copies produced by attach().A better way to make these changes is to find the indices that are not good <strong>for</strong> thevariables gestation, age, and inc and use these <strong>for</strong> extraction as follows:> rm(gestation); rm(age); rm(inc) # clear out copies> detach(babies); attach(babies) # really clear out> not.these = (gestation == 999) | (age == 99) | (inc== 98)## A logical not and named extraction> tmp = babies[!not.these,c("gestation","age","wt","inc")]> pairs(tmp)> detach(babies)The pairs() function produces the scatterplot matrix (Figure 4.4) of the new data frametmp. We had to remove the copies of the variables gestation, age, and inc that werecreated in the previous try at this. To be sure that we used the correct variables, wedetached and reattached the data set. Trends we might want to investigate are therelationship between gestation period and birth weight and the relationship of income andage.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!