10.07.2015 Views

Using R for Introductory Statistics : John Verzani

Using R for Introductory Statistics : John Verzani

Using R for Introductory Statistics : John Verzani

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Using</strong> R <strong>for</strong> introductory statistics 62Tails of a distribution and skewThe tails of a distribution are the very large and very small values of the distribution.They give the shape of the histogram on the far left and right—hence the name. Manyinferences about a distribution are affected by its tails. A distribution is called a longtaileddistribution if the data set contains values far from the body of the data. This ismade precise after the normal distribution is introduced as a reference. Long tails are alsoknow as “fat tails.” Alternatively, a distribution is called a short-tailed distribution ifthere are no values far from the body.A distribution is a skewed distribution if one tail is significantly fatter or longer thanthe other. A distribution with a longer left tail is termed skewed left; a distribution with alonger right tail is termed skewed right.We’ve seen how very large or very small values in a data set can skew the mean. Wewill call a data point that doesn’t fit the pattern set by the majority of the data an outlier.Outliers may be the result of an underlying distribution with a long tail or a mixture ofdistributions, or they may indicate mistakes of some sort in the data.■ Example 2.9: Asset distributions are long-tailed The distributions of assets, likeincomes, are typically skewed right. For example, the amount of equity a household hasin vehicles (cars, boats, etc.) is contained in the VEHIC variable of the cfb (<strong>Using</strong>R) dataset. Figure 2.17 shows the long-tailed distribution. The summary() function shows asignificant difference between the median and mean as expected in these situations.> attach(cfb) # it is a data frame> summary(VEHIC)Min. 1st Qu. Median Mean 3rd Qu. Max.0 3880 11000 15400 21300 188000> hist(VEHIC,breaks="Scott",prob=TRUE)> lines(density(VEHIC))> detach(cfb)Measures of center <strong>for</strong> symmetric data When a data set is symmetric and not too longtailed, then the mean, trimmed mean, and median are approximately the same. In thiscase, the more familiar mean is usually used to measure center.Measuring the center <strong>for</strong> long-tailed distributions If a distribution has very longtails, the mean may be a poor indicator of the center, as values far from the mean mayhave a significant effect on the mean. In this case, a trimmed mean or median is preferredif the data is symmetric, and a median is preferred if the data is skewed.For similar reasons, the IQR is preferred to the standard deviation when summarizingspread in a long-tailed distribution.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!