Chapter 16: Nonparametric Tests - WH Freeman

16-4 CHAPTER 16 robustnessoutlierstransforming dataother standarddistributionsnonparametricmethodsrank testscontinuousdistributionNonparametric TestsIntroductionThe most commonly used methods for inference about the means of quantitativeresponse variables assume that the variables in question have Normaldistributions in the population or populations from which we draw our data.In practice, of course, no distribution is exactly Normal. Fortunately, ourusual methods for inference about population means (the one-sample andtwo-sample t procedures and analysis of variance) are quite robust. That is,the results of inference are not very sensitive to moderate lack of Normality,especially when the samples are reasonably large. Some practical guidelinesfor taking advantage of the robustness of these methods appear in Chapter 7.What can we do if plots suggest that the data are clearly not Normal,especially when we have only a few observations? This is not a simplequestion. Here are the basic options:1. If there are extreme outliers in a small data set, any inference method maybe suspect. An outlier is an observation that may not come from the samepopulation as the others. To decide what to do, you must find the causeof the outlier. Equipment failure that produced a bad measurement, forexample, entitles you to remove the outlier and analyze the remainingdata. If the outlier appears to be “real data,” it is risky to draw anyconclusion from just a few observations.2. Sometimes we can transform our data so that their distribution is morenearly Normal. Transformations such as the logarithm that pull in thelong tail of right-skewed distributions are particularly helpful. Exercises2.16 (page 100), 7.96 (page 496), and 11.5 (page 640) illustrate theuse of the logarithm transformation.3. In some settings, other standard distributions replace the Normal distributionsas models for the overall pattern in the population. There areinference procedures for the parameters of these distributions that replacethe t procedures when we use specific non-Normal models.4. Finally, there are inference procedures that do not require any specificform for the distribution of the population. These are called nonparametricmethods. The sign test (page 450) is an example of a nonparametrictest.This chapter concerns one type of nonparametric procedure, tests thatcan replace the t tests and one-way analysis of variance when the Normalityconditions for those tests are not met. The most useful nonparametric testsare rank tests based on the rank (place in order) of each observation in theset of all the data.Figure 16.1 presents an outline of the standard tests (based on Normaldistributions) and the rank tests that compete with them. All of these testsrequire that the population or populations have continuous distributions.That is, each distribution must be described by a density curve that allowsobservations to take any value in some interval of outcomes. The Normalcurves are one shape of density curve. Rank tests allow curves of any shape.

16-6 CHAPTER 16 Nonparametric TestsTABLE 16.1Annual earnings of hourly workers at National BankBlack Black White Whitefemales males females males$16,015 $18,365 $25,249 $15,10017,516 17,755 19,029 22,34617,274 16,890 17,233 22,04916,555 17,147 26,606 26,97020,788 18,402 28,346 16,41119,312 20,972 31,176 19,26817,124 24,750 18,863 28,33618,405 16,576 15,904 19,00719,090 16,853 22,477 22,07812,641 21,565 19,102 19,97717,813 29,347 18,002 17,19418,206 19,028 21,596 30,38319,338 26,885 18,36415,953 24,780 18,24516,904 14,698 23,53119,30817,57624,49720,61217,757Figure 16.2 is a back-to-back stemplot that compares the earnings ofthe 15 black females and the 12 black males in our sample. The femaledistribution has a low outlier and the male distribution is strongly skewedto the right. If we suspected before looking at the data that men wouldearn more, we would test the null hypothesis of “similar earnings” againsta one-sided alternative. We prefer a test that is not based on Normality.The rank transformationWe first rank all 27 observations together. To do this, arrange them in orderfrom smallest to largest:12641 15953 16015 16555 16576 16853 16890 16904 1712417147 17274 17516 17755 17813 18206 18365 18402 1840519028 19090 19312 19338 20788 20972 21565 24750 29347The boldface entries in the list are the earnings of the 12 males. We see thatthe four lowest earnings are for women and the four highest are for men.The idea of rank tests is to look just at position in this ordered list. To do

16-10 CHAPTER 16Nonparametric TestsS-PlusExact Wilcoxon rank-sum testdata: bmale and bfemalerank-sum statistic W = 195, n = 12, m = 15, p-value = 0.0998alternative hypothesis: true mu is greater than 0FIGURE 16.3 Output from the S-Plus statistical software for the dataon black females and black males in Table 16.1. This program uses theexact distribution for W when the samples are small and there are notied observations.APPLY YOURKNOWLEDGEThe two-sample t test gives a somewhat more significant result than theWilcoxon test in Example 16.2 ( t 1. 8033, P 0.0465). We hesitate totrust the t test because one sample is strongly skewed and the sample sizesare not large.16.1 Consumption growth in Asia and Eastern Europe. Exercise 1.32 (page 35)gives the growth rates of private consumption in 12 Asian nations. Exercise1.35 (page 39) gives the rates for 13 Eastern European nations. Hereare the data for Asia,2.3 8.8 3.9 4.1 6.4 5.9 4.2 2.9 1.3 5.1 5.6 6.2and for Eastern Europe,6.0 5.2 1.0 3.5 3.2 1.7 1.5 1.0 4.9 1.4 7.0 3.7 12.1Arrange these 25 observations in order and assign the ranks. Does one groupappear to have generally higher ranks?16.2 Wilcoxon for Asia versus Europe. Continue your work in the previousexercise. What is the Wilcoxon rank sum W for Asia? What are its meanWand standard deviation W? Using software, is there good evidence thatgrowth rates of private consumption in Asia are systematically higher thanin Europe?The Normal approximationThe rank sum statistic W becomes approximately Normal as the two samplesizes increase. We can then form yet another z statistic by standardizing W:continuitycorrectionW z WUse standard Normal probability calculations to find P-values for this statistic.Because W takes only whole-number values, the continuity correctionimproves the accuracy of the approximation.WW n1( N1)/2nn( N1)/121 2

16-12 CHAPTER 16Nonparametric Tests(a) MinitabMann-Whitney Confidence Interval and TestbmalebfemaleN = 12N = 15Median = 18384Median = 17516Point estimate for ETA1-ETA2 is 110995.2 Percent CI for ETA1-ETA2 is (–384,3456)W = 195.0Test of ETA1 = ETA2 vs ETA1 > ETA2 is significant at 0.0980(b) SASWilcoxon Scores (Rank Sums) for Variable EARNINGSClassified by Variable GENDERGENDERNSum ofScoresExpectedUnder H0Std DevUnder H0MeanScoreFemaleMale1512183.0195.0210.0168.020.49390220.49390212.20016.250Wilcoxon Two-Sample TestStatistic (S)Normal ApproximationZOne-Sided Pr > ZTwo-Sided Pr > |Z|Exact TestOne-Sided Pr > = STwo-Sided Pr > = |S – Mean|195.00001.29310.09800.19600.09980.1995FIGURE 16.4 Output from the Minitab and SAS statistical software for the bankworkers data. (a) Minitab uses the Normal approximation for the distribution of W .(b) SAS gives both the exact and approximate values.16.4 It’s your choice. Exercise 16.2 asks for the rank sum W for Asia. What isthe rank sum for Eastern Europe? The ranks of 25 observations always addto 325. Do your two sums add to 325? Repeat the previous exercise usingthe other rank sum (Eastern Europe if you previously used Asia). Show thatyou obtain exactly the same P-value. That is, your choice between the twopossible W’s does not affect the results of the Wilcoxon test.What hypotheses does Wilcoxon test?Our null hypothesis is that earnings of women and men do not differsystematically. Our alternative hypothesis is that men’s earnings are higher.If we are willing to assume that earnings are Normally distributed, or if we

aH : H : a16.1 The Wilcoxon Rank Sum Test 16-13have reasonably large samples, we use the two-sample t test for means. Ourhypotheses then become H : median mediana0 M FMWhen the distributions may not be Normal, we might restate the hypothesesin terms of population medians rather than means:H : median median0 M FMThe Wilcoxon rank sum test provides a significance test for these hypotheses,but only if an additional assumption is met: both populations must havedistributions of the same shape. That is, the density curve for male earningsmust look exactly like that for female earnings, except that it may slideto a different location on the scale of earnings. The Minitab output inFigure 16.4(a) states the hypotheses in terms of population medians (whichit calls “ETA”) and also gives a confidence interval for the difference betweenthe two population medians.It is clear from the stemplot in Figure 16.2 that the two distributionsdo not have the same shape. The same-shape assumption is too strict to bereasonable in practice. Recall that our preferred version of the two-samplet test does not require that the two populations have the same standarddeviation—that is, it does not make a same-shape assumption. Fortunately,the Wilcoxon test also applies in a much more general and more usefulsetting. It tests hypotheses that we can state in words asH0: two distributions are the sameH : one has values that are systematically largerHere is a more exact statement of the “systematically larger” alternativehypothesis. Take X1to be the earnings of a randomly chosen male bankworker and X2to be the earnings of a randomly chosen female. Theseearnings are random variables. That is, every time we choose a male bankworker at random, his earnings are a value of the variable X1. The probabilitythat a man’s earnings exceed $20,000 is PX ( 1 20000). If male earnings are“systematically larger” than those of females, amounts higher than $20,000should be more likely for men. That is, we should havePX ( 20000) PX ( 20000)1 2The alternative hypothesis says that this inequality holds not just for $20,000but for any amount we care to specify. Choosing men always puts more2probability “to the right” of whatever earnings amount we are interested in.This exact statement of the hypotheses we are testing is a bit awkward.The hypotheses really are “nonparametric” because they do not involve anyspecific parameter such as the mean or median. If the two distributionsdo have the same shape, the general hypotheses reduce to comparingmedians. Many texts and software outputs state the hypotheses in terms ofFF

16-14 CHAPTER 16 Nonparametric Testsmedians, sometimes ignoring the same-shape requirement. We recommendthat you express the hypotheses in words rather than symbols. “Men earnsystematically more than women” is easy to understand and is a goodstatement of the effect that the Wilcoxon test looks for. In particular, youshould ignore Minitab’s confidence intervals in Figure 16.4 because the twodistributions do not have the same shape.average ranksTiesThe exact distribution for the Wilcoxon rank sum is obtained assuming thatall observations in both samples take different values. This allows us to rankthem all. In practice, however, we often find observations tied at the samevalue. What shall we do? The usual practice is to assign all tied values theaverage of the ranks they occupy. Here is an example with 6 observations:Observation 14714 15953 16015 16015 16576 16853Rank 1 2 3.5 3.5 5 6The tied observations occupy the third and fourth places in the ordered list,so they share rank 3.5.The exact distribution for the Wilcoxon rank sum W applies only todata without ties. Moreover, the standard deviation W must be adjusted ifties are present. The Normal approximation can be used after the standarddeviation is adjusted. Statistical software will detect ties, make the necessaryadjustment, and switch to the Normal approximation. In practice, softwareis required if you want to use rank tests when the data contain tied values.It is sometimes useful to use rank tests on data that have very many tiesbecause the scale of measurement has only a few values. Here is an example.CASE 16.2CONSUMER PERCEPTIONS OF FOOD SAFETYVendors of prepared food are very sensitive to the public’s perception of thesafety of the food they sell. Food sold at outdoor fairs and festivals may beless safe than food sold in restaurants because it is prepared in temporarylocations and often by volunteer help. What do people who attend fairsthink about the safety of the food served? One study asked this question ofpeople at a number of fairs in the Midwest:How often do you think people become sick because of food they consumeprepared at outdoor fairs and festivals?The possible responses were1 very rarely2 once in a while3 often4 more often than not5 always

16.1 The Wilcoxon Rank Sum Test 16-15In all, 303 people answered the question. Of these, 196 were women and107 were men. Is there good evidence that men and women differ in their3perceptions about food safety at fairs?We should first ask if the subjects in Case 16.2 are a random sample ofpeople who attend fairs, at least in the Midwest. The researcher visited 11different fairs. She stood near the entrance and stopped every 25th adult whopassed. Because no personal choice was involved in choosing the subjects,we can reasonably treat the data as coming from a random sample. (Asusual, there was some nonresponse, which could create bias.)Here are the data, presented as a two-way table of counts:Response1 2 3 4 5 TotalFemale 13 108 50 23 2 196Male 22 57 22 5 1 107Total 35 165 72 28 3 303Comparing row percents shows that the women in the sample are moreconcerned about food safety than the men:Response1 2 3 4 5 TotalFemale 6.6% 55.1% 25.5% 11.7% 1.0% 100%Male 20.6% 53.3% 20.6% 4.7% 1.0% 100%Is the difference between the genders statistically significant?We might apply the chi-square test (Chapter 9). It is highly significant2( X 16. 120, df 4, P 0.0029). Although the chi-square test answersour general question, it ignores the ordering of the responses and so does notuse all of the available information. We would really like to know whethermen or women are more concerned about the safety of the food served. Thisquestion depends on the ordering of responses from least concerned to mostconcerned. We can use the Wilcoxon test for the hypotheses:H0: men and women do not differ in their responsesH : one gender gives systematically larger responses than the otheraThe alternative hypothesis is two-sided. Because the responses can take onlyfive values, there are very many ties. All 35 people who chose “very rarely”are tied at 1, and all 165 who chose “once in a while” are tied at 2.

16-16 CHAPTER 16Nonparametric TestsSASWilcoxon Scores (Rank Sums) for Variable SFAIRClassified by Variable GENDERGENDERNSum ofScoresExpectedUnder H0Std DevUnder H0MeanScoreFemale 196 31996.5000 29792.0 661.161398 163.247449Male 107 14059.5000 16264.0 661.161398 131.397196Average Scores Were Used for TiesWilcoxon 2-Sample Test (Normal Approximation)(with Continuity Correction of 0.5)S = 14059.5 Z = –3.33353 Prob > |Z| = 0.0009FIGURE 16.5 Output from SAS for the food safety study of Case 16.2. The approximatetwo-sided P -value is 0.0009.EXAMPLE 16.5CASE 16.2Attitudes toward food sold at fairsFigure 16.5 gives computer output for the Wilcoxon test. The rank sum formen (using average ranks for ties) is W 14, 059.5. The standardized value isz 3. 33, with two-sided P-value P 0.0009. There is very strong evidence of adifference. Women are more concerned than men about the safety of food served atfairs.APPLY YOURKNOWLEDGEWith more than 100 observations in each group and no outliers, wemight use the two-sample t even though responses take only five values. Infact, the results for Example 16.5 are t 3. 3655 with P 0.0009. TheP-value for two-sample t is the same as that for the Wilcoxon test. Thereis, however, another reason to prefer the rank test in this example. Thet statistic treats the response values 1 through 5 as meaningful numbers.In particular, the possible responses are treated as though they are equallyspaced. The difference between “very rarely” and “once in a while” is thesame as the difference between “once in a while” and “often.” This maynot make sense. The rank test, on the other hand, uses only the order of theresponses, not their actual values. The responses are arranged in order fromleast to most concerned about safety, so the rank test makes sense. Somestatisticians avoid using t procedures when there is not a fully meaningfulscale of measurement.16.5 Highway mileage. Table 1.2 (page 13) gives the highway gas mileage for 322001 model year midsize cars. There are many ties among the observations.Arrange the mileages in order and assign ranks, assigning all tied values theaverage of the ranks they occupy.

16.1 The Wilcoxon Rank Sum Test 16-1716.6 Domestic versus foreign mileage. Of the cars listed in Table 1.2, 10 haveAmerican brand names and 22 have foreign brand names. (Some of theforeign models are built in North America or by foreign subsidiaries ofAmerican companies. We are concerned only with the brands.) Using yourranks from the previous exercise, what is the rank sum W for the Americanmodels? Using software, is there a significant difference between the mileagesof domestic and foreign brands?Limitations of nonparametric testsThe examples we have given illustrate the potential usefulness of nonparametrictests. Nonetheless, rank tests are of secondary importance relative toinference procedures based on the Normal distribution.Nonparametric inference is largely restricted to simple settings. Normalinference extends to methods for use with complex experimental designsand multiple regression, but nonparametric tests do not. We stress Normalinference in part because it leads on to more advanced statistics.Normal tests compare means and are accompanied by simple confidenceintervals for means or differences between means. When we use nonparametrictests to compare medians, we can also give confidence intervals,though they are awkward to calculate without software. However, theusefulness of nonparametric tests is clearest in settings when they do notsimply compare medians—see the discussion of “What hypotheses doesWilcoxon test?” In these settings, there is no measure of the size of theobserved effect that is closely related to the rank test of the statisticalsignificance of the effect.The robustness of Normal tests for means implies that we do not often encounterdata that require nonparametric procedures to obtain reasonablyaccurate P-values.There are more modern and more effective ways to escape the assumptionof Normality, such as bootstrap methods (see page 374).SECTION16.1 SUMMARY Nonparametric tests do not require any specific form for the distributionof the population from which our samples come. Rank tests are nonparametric tests based on the ranks of observations,their positions in the list ordered from smallest (rank 1) to largest. Tiedobservations receive the average of their ranks. The Wilcoxon rank sum test compares two distributions to assesswhether one has systematically larger values than the other. The Wilcoxontest is based on the Wilcoxon rank sum statistic W,which is the sum of theranks of one of the samples. The Wilcoxon test can replace the two-samplet test.

16-18 CHAPTER 16Nonparametric Tests P-valuesfor the Wilcoxon test are based on the sampling distributionof the rank sum statistic W when the null hypothesis (no difference indistributions) is true. You can find P-values from special tables, software,or a Normal approximation (with continuity correction).SECTION16.1 EXERCISESStatistical software is very helpful in doing these exercises. If you do nothave access to software, base your work on the Normal approximation withcontinuity correction.16.7 Wheat prices. Example 7.14 (page 470) reports the results of a small surveythat asked separate samples of 5 wheat producers in each of Septemberand July what price they received for wheat sold that month. Here arethe data:MonthPrice of wheat ($/bushel)September $3.5900 $3.6150 $3.5950 $3.5725 $3.5825July $2.9200 $2.9675 $2.9175 $2.9250 $2.9325The stemplot on page 471 shows a large difference between months. Wecannot assess Normality from such small samples. Carry out by hand thesteps in the Wilcoxon rank sum test for comparing prices in July andSeptember.(a) Arrange the 10 observations in order and assign ranks. There are noties.(b) Find the rank sum W for September. What are the mean and standarddeviation of W under the null hypothesis that prices in July andSeptember do not differ systematically?(c) Standardize W to obtain a z statistic. Do a Normal probability calculationwith the continuity correction to obtain a two-sided P-value.16.8 Does pesticide work? In a study of cereal leaf beetle damage on oats,researchers measured the number of beetle larvae per stem in small plotsof oats after randomly applying one of two treatments: no pesticide or4Malathion at the rate of 0.25 pound per acre. Here are the data:Control 2 4 3 4 2 3 3 5 3 2 6 3 4Treatment 0 1 1 2 1 2 1 1 2 1 1 1The researchers used a t procedure even though the data are far fromNormal.(a) Examine the data. In what ways do they depart from Normality?(b) Is there evidence that Malathion systematically reduces the number oflarvae per stem?

16.1 The Wilcoxon Rank Sum Test 16-1916.9 Polyester fabrics in landfills. How quickly do synthetic fabrics such aspolyester decay in landfills? A researcher buried polyester strips in the soilfor different lengths of time, then dug up the strips and measured the forcerequired to break them. Breaking strength is easy to measure and is a goodindicator of decay. Lower strength means the fabric has decayed. Part ofthe study involved burying 10 polyester strips in well-drained soil in thesummer. Five of the strips, chosen at random, were dug up after 2 weeks;the other 5 were dug up after 16 weeks. Here are the breaking strengths in5pounds:2 weeks 118 126 126 120 12916 weeks 124 98 110 140 110(a) Make a back-to-back stemplot. Does it appear reasonable to assumethat the two distributions have the same shape?(b) Is there evidence that breaking strengths are lower for strips buriedlonger?16.10 The influence of subliminal messages. Can “subliminal” messages that arebelow our threshold of awareness nonetheless influence us? Advertisers,among others, want to know. One study asked if subliminal messages helpstudents learn math. A group of students who had failed the mathematicspart of the City University of New York Skills Assessment Test agreed toparticipate in a study to find out. All received a daily subliminal message,flashed on a screen too rapidly to be consciously read. The treatment groupof 10 students was exposed to “Each day I am getting better in math.”The control group of 8 students was exposed to a neutral message, “Peopleare walking on the street.” All students participated in a summer programdesigned to raise their math skills, and all took the assessment test again atthe end of the program. Here are data on the subjects’ scores before and6after the program.Treatment Group Control GroupPretest Posttest Pretest Posttest18 24 18 2918 25 24 2921 33 20 2418 29 18 2618 33 24 3820 36 22 2723 34 15 2223 36 19 3121 3417 27(a) The study design was a randomized comparative experiment. Outlinethis design.

16-20 CHAPTER 16 Nonparametric Tests(b) Compare the gain in scores in the two groups, using a graph andnumerical descriptions. Does it appear that the treatment group’s scoresrose more than the scores for the control group?(c) Apply the Wilcoxon rank sum test to the posttest versus pretest differences.Note that there are some ties. What do you conclude?16.11 Logging tropical forests. “Conservationists have despaired over destructionof tropical rainforest by logging, clearing, and burning.” These words begin7a report on a statistical study of the effects of logging in Borneo. Here aredata on the number of tree species in 12 unlogged forest plots and 9 similarplots logged 8 years earlier:Unlogged 22 18 22 20 15 21 13 13 19 13 19 15Logged 17 4 18 14 18 15 15 10 12(a) Make a back-to-back stemplot of the data. Does there appear to be adifference in species counts for logged and unlogged plots?(b) Does logging significantly reduce the number of species in a plot after 8years? State hypotheses, do a Wilcoxon test, and state your conclusion.16.12 The effect of piano lessons. Exercise 7.70 (page 482) studies the effectof piano lessons on the spatial-temporal reasoning of preschool children.The data there concern 34 children who took piano lessons and a controlgroup of 44 children. The data take only small whole-number values. Usethe Wilcoxon rank sum test (there are many ties) to decide whether pianolessons improve spatial-temporal reasoning.16.13 Safety of restaurant food. Case 16.2 describes a study of theattitudes of people attending outdoor fairs about the safety of thefood served at such locations. You can find the full data set onlineor on the CD as the file eg16-05.dat. It contains the responses of 303 peopleto several questions. The variables in this data set are (in order)subject hfair sfair sfast srest genderThe variable “sfair” contains the responses described in the example concerningsafety of food served at outdoor fairs and festivals. The variable“srest” contains responses to the same question asked about food served inrestaurants. The variable “gender” contains 1 if the respondent is a woman,2ifheisaman. We saw that women are more concerned than men aboutthe safety of food served at fairs. Is this also true for restaurants?16.14 Food safety at fairs and in restaurants. The data file used inExample 16.5 and Exercise 16.13 contains 303 rows, one foreach of the 303 respondents. Each row contains the responses ofone person to several questions. We wonder if people are more concernedabout the safety of food served at fairs than they are about the safety offood served at restaurants. Explain carefully why we cannot answer thisquestion by applying the Wilcoxon rank sum test to the variables “sfair”and “srest.”16.15 Consumer attitudes: secondhand stores. To study customers’ attitudes towardsecondhand stores, researchers interviewed samples of shoppers attwo secondhand stores of the same chain in two cities. Here are data onCASE 16.2 CASE 16.2

16.2 The Wilcoxon Signed Rank Test 16-21the incomes of shoppers at the two stores, presented as a two-way table of8counts:Income code Income City 1 City 21 Under $10,000 70 622 $10,000 to $19,999 52 633 $20,000 to $24,999 69 504 $25,000 to $34,999 22 195 $35,000 or more 28 24(a) Is there a relationship between city and income? Use the chi-square testto answer this question.(b) The chi-square test ignores the ordering of the income categories. Thedata file ex16-15.dat contains data on the 459 shoppers in this study.The first variable is the city (City1 or City2) and the second is the incomecode as it appears in the table above (1 to 5). Is there good evidencethat shoppers in one city have systematically higher incomes than in theother?16.2 The Wilcoxon Signed Rank TestWe use the one-sample t procedures for inference about the mean of onepopulation or for inference about the mean difference in a matched pairssetting. We will now meet a rank test for matched pairs and single samples.The matched pairs setting is more important because good studies aregenerally comparative.EXAMPLE 16.6Vitamin loss in a food productFood products are often enriched with vitamins and other supplements. Doesthe level of a supplement decline over time, so that the user receives less than themanufacturer intended? Here are data on the vitamin C levels (milligrams per100 grams) in wheat soy blend, a flour-like product supplied by international aidprograms mainly for feeding children. The same 9 bags of blend were measured at9the factory and five months later in Haiti.Bag 1 2 3 4 5 6 7 8 9Factory 45 32 47 40 38 41 37 52 37Haiti 38 40 35 38 34 35 38 38 40Difference 7 8 12 2 4 6 1 14 3We suspect that vitamin C levels are generally higher at the factory than they arefive months later. We would like to test the hypothesesH0: vitamin C has the same distribution at both timesH : vitamin C is systematically higher at the factoryaBecause these are matched pairs data, we base our inference on the differences.

16-22 CHAPTER 16absolute valueNonparametric TestsPositive differences in Example 16.6 indicate that the vitamin C level of abagwashigheratthefactorythaninHaiti.Iffactoryvaluesaregenerallyhigher,the positive differences should be farther from zero in the positive directionthan the negative differences are in the negative direction. We thereforecompare the absolute values of the differences, that is, their magnitudeswithout a sign. Here they are, with boldface indicating the positive values:7 8 12 2 4 6 1 14 3Arrange the absolute values in increasing order and assign ranks, keepingtrack of which values were originally positive. Tied values receive the averageof their ranks. If there are zero differences, discard them before ranking. Inour example, there are no zeros and no ties.Absolute value 1 2 3 4 6 7 8 12 14Rank 1 2 3 4 5 6 7 8 9The test statistic is the sum of the ranks of the positive differences. This is theWilcoxon signed rank statistic. Its value here is W 34. (We could equallywell use the sum of the ranks of the negative differences, which is 11.)THE WILCOXON SIGNED RANK TEST FOR MATCHED PAIRSDraw an SRS from a population for a matched pairs study andtake the differences in responses within pairs. Remove all zerodifferences, so that n nonzero differences remain. Rank the absolutevalues of these differences. The sum W of the ranks for the positivedifferences is the Wilcoxon signed rank statistic.If the distribution of the responses is not affected by the differenttreatments within pairs, then W has meanand standard deviationWWnn ( 1)4nn ( 1)(2n1)24The Wilcoxon signed rank test rejects the hypothesis that there areno systematic differences within pairs when the rank sum W is farfrom its mean.EXAMPLE 16.7Vitamin loss rank testIn the vitamin loss study of Example 16.6, n 9. If the null hypothesis (nosystematic loss of vitamin C) is true, the mean of the signed rank statistic isWnn ( 1) (9)(10) 22.54 4Our observed value W 34 is somewhat larger than this mean. The one-sidedP-value is P( W 34).

16.2 The Wilcoxon Signed Rank Test 16-23(a) S-PlusExact Wilcoxon signed-rank testdata: factory and haitisigned-rank statistic V = 34, n = 9, p-value = 0.1016alternative hypothesis: true mu is greater than 0(b) MinitabWilcoxon Signed-Rank TestTest of median = 0.000000 versus median > 0.000000DiffN9N forTest9WilcoxonStatistic34.0P0.096EstimatedMedian4.000FIGURE 16.6 Output from (a) S-Plus and (b) Minitab for the vitaminloss study of Example 16.6. S-Plus reports the exact P-value P 0.1016.Minitab uses the Normal approximation with the continuity correction to givethe approximate P-value P 0.096.Figure 16.6 displays the output of two statistical programs. We see fromFigure 16.6(a) that the one-sided P-value for the Wilcoxon signed rank test withn 9 observations and W 34 is P 0.1016. This small sample does not giveconvincing evidence of vitamin loss.In fact, the Normal quantile plot in Figure 16.7 shows that the differences arereasonably Normal. We could use the paired-sample t to get a similar conclusion10Vitamin C loss50–5–3 –2 –1 0 1 2 3z-scoreFIGURE 16.7 Normal quantile plot for the nine matched pairsdifferences in Example 16.6.

16-24 CHAPTER 16Nonparametric Tests( t 1. 5595, df 8, P 0. 0787). The t test has a slightly lower P-value becauseit is somewhat more powerful than the rank test when the data are actuallyNormal.Although we emphasize the matched pairs setting, W can also beapplied to a single sample. It then tests the hypothesis that the populationmedian is zero. To test the hypothesis that the population median has aspecific value m, apply the test to the differences Xi m. For matched pairs,we are testing that the median of the differences is zero.The Normal approximationThe distribution of the signed rank statistic when the null hypothesis (nodifference) is true becomes approximately Normal as the sample size becomeslarge. We can then use Normal probability calculations (with the continuitycorrection) to obtain approximate P-values for W . Let’s see how thisworks in the vitamin loss example, even though n 9 is certainly not alarge sample.EXAMPLE 16.8Vitamin loss Normal approximationFor n 9 observations, we saw in Example 16.7 that W 22.5. The standarddeviation of W under the null hypothesis isWnn ( 1)(2n1)24(9)(10)(19)2471. 25 8.441The continuity correction calculates the P-value P( W 34) as P( W 33.5),treating the value W 34 as occupying the interval from 33.5 to 34.5. We findthe Normal approximation for the P-value by standardizing and using the standardNormal table:W 22. 5 33. 5 22.5PW ( 33. 5) P8. 441 8.441 PZ ( 1.303) 0.0968Despite the small sample size, the Normal approximation gives a result quiteclose to the exact value P 0.1016. The Minitab output in Figure 16.6(b) givesP 0.096 based on a Normal calculation rather than the table.TiesTies among the absolute differences are handled by assigning average ranks.A tie within a pair creates a difference of zero. Because these are neither

16.2 The Wilcoxon Signed Rank Test 16-25positive nor negative, we drop such pairs from our sample. As in the caseof the Wilcoxon rank sum, ties complicate finding a P-value. There isno longer a usable exact distribution for the signed rank statistic W ,and the standard deviation W must be adjusted for the ties beforewe can use the Normal approximation. Software will do this. Here isan example.EXAMPLE 16.9Two rounds of golf scoresHere are the golf scores of 12 members of a college women’s golf team in tworounds of tournament play. (A golf score is the number of strokes required tocomplete the course, so that low scores are better.)Player 1 2 3 4 5 6 7 8 9 10 11 12Round 2 94 85 89 89 81 76 107 89 87 91 88 80Round 1 89 90 87 95 86 81 102 105 83 88 91 79Difference 5 5 2 6 5 5 5 16 4 3 3 1Negative differences indicate better (lower) scores on the second round. We see that6ofthe 12 golfers improved their scores. We would like to test the hypothesesthat in a large population of collegiate woman golfersHH0a: scores have the same distribution in rounds 1 and 2: scores are systematically lower or higher in round 2A Normal quantile plot of the differences (Figure 16.8) shows some irregularityand a low outlier. We will use the Wilcoxon signed rank test.5Difference in golf score0–5–10–15–3 –2 –1 0 1 2 3z-scoreFIGURE 16.8 Normal quantile plot of the differences in scores for tworounds of a golf tournament, from Example 16.9.

16-26 CHAPTER 16Nonparametric TestsThe absolute values of the differences, with boldface indicating thosethat were negative, are5 5 2 6 5 5 5 16 4 3 3 1Arrange these in increasing order and assign ranks, keeping track of whichvalues were originally negative. Tied values receive the average of theirranks.Absolute value 1 2 3 3 4 5 5 5 5 5 6 16Rank 1 2 3.5 3.5 5 8 8 8 8 8 11 12The Wilcoxon signed rank statistic is the sum of the ranks of the negativedifferences. (We could equally well use the sum of the ranks of the positivedifferences.) Its value is W 50.5.EXAMPLE16.10Software results for golf scoresHere are the two-sided P-values for the Wilcoxon signed rank test for the golfscore data from several statistical programs:ProgramP-valueMinitab P 0.388SAS P 0.388S-Plus P 0.384SPSS P 0.363All lead to the same practical conclusion: these data give no evidence for asystematic change in scores between rounds. However, the P-values reported differa bit from program to program. The reason for the variations is that the programsuse slightly different versions of the approximate calculations needed when ties arepresent. The exact result depends on which of these variations the programmerchooses to use.For these data, the matched pairs t test gives t 0. 9314 with P 0.3716.Once again, t and W lead to the same conclusion.SECTION16.2 SUMMARY The Wilcoxon signed rank test applies to matched pairs studies. It teststhe null hypothesis that there is no systematic difference within pairsagainst alternatives that assert a systematic difference (either one-sided ortwo-sided). The test is based on the Wilcoxon signed rank statistic W , which is thesum of the ranks of the positive (or negative) differences when we rank theabsolute values of the differences. The matched pairs t test and the sign testare alternative tests in this setting.

16.2 The Wilcoxon Signed Rank Test 16-27 The Wilcoxon signed rank test can also be used to test hypotheses abouta population median by applying it to a single sample. P-valuesfor the signed rank test are based on the sampling distributionof W when the null hypothesis is true. You can find P-values from specialtables, software, or a Normal approximation (with continuity correction).SECTION16.2 EXERCISESStatistical software is very helpful in doing these exercises. If you do nothave access to software, base your work on the Normal approximation withcontinuity correction.16.16 Growing trees faster? The concentration of carbon dioxide (CO 2) in theatmosphere is increasing rapidly due to our use of fossil fuels. Becauseplants use CO2 to fuel photosynthesis, more CO2may cause trees andother plants to grow faster. This would affect forest product companiesthat operate tree farms. An elaborate apparatus allows researchers to pipeextra CO2to a 30-meter circle of forest. They set up three pairs of circlesin different parts of a pine forest in North Carolina. One of each pairreceived extra CO2for an entire growing season, and the other receivedambient air. The response variable is the average growth in base area fortrees in a circle, as a fraction of the starting area. Here are the data for one10growing season:Pair Control Treatment1 0.06528 0.081502 0.05232 0.063343 0.04329 0.05936(a) Summarize the data. Does it appear that growth was faster in the treatedplots?(b) The researchers used a paired-sample t test to see if the data give goodevidence of faster growth in the treated plots. State hypotheses, carryout the test, and state your conclusion.(c) The sample is so small that we cannot assess Normality. To be safe, wemight use the Wilcoxon signed rank test. Carry out this test and reportyour result.(d) The tests lead to very different conclusions. The primary reason is thelack of power of rank tests for very small samples. Explain to someonewho knows no statistics what this means.16.17 Raising your heart rate. Exercise that helps health and fitness should raiseour heart rate for some period of time. A firm that markets a “Step Upto Health” apparatus consisting of a step and handrails for users to holdmust tell buyers how to use their new device. The firm has subjects use thestep at several stepping rates and measures their heart rates before and afterstepping. Here are data for five subjects and two treatments: low rate (14steps per minute) and medium rate (21 steps per minute). For each subject,

16-28 CHAPTER 16 Nonparametric Testswe give the resting heart rate (beats per minutes) and the heart rate at the11end of the exercise.Low Rate Medium RateSubject Resting Final Resting Final1 60 75 63 842 90 99 69 933 87 93 81 964 78 87 75 905 84 84 90 108Does exercise at the low rate raise heart rate significantly? State hypothesesin terms of the median increase in heart rate and apply the Wilcoxon signedrank test. What do you conclude?16.18 Stepping faster. Do the data from the previous exercise give good reasonto think that stepping at the medium rate increases heart rates more thanstepping at the low rate?(a) State hypotheses in terms of comparing the median increases for the twotreatments. What is the proper rank test for these hypotheses?(b) Carry out your test and state a conclusion.16.19 Learning a language. Table 7.2 (page 443) presents the scores on a test ofunderstanding of spoken French for a group of executives before and afteran intensive French course. The improvements in scores between the pretestand the posttest for the 20 executives were2 0 6 6 3 3 2 3 6 6 6 6 3 0 1 1 0 2 3 3Use the Wilcoxon signed rank procedure (implemented by software) to reacha conclusion about the effect of language instruction. State hypotheses inwords and report the statistic W , its P-value, and your conclusion. (Notethat there are several zeros and many ties in the data.)16.20 Do the calculation. Show the assignment of ranks and the calculation ofthe signed rank statistic W for the data in Exercise 16.19. Remember thatzeros are dropped from the data before ranking, so that n is the number ofnonzero differences within pairs.16.21 Vitamin loss in a food product. The 9 bags of wheat soy blend consideredin Example 16.6 are a sample from the full data set of 27 bags given in Exercise7.39 (page 457). A test based on the larger sample is more powerfuland is therefore more likely to find significant loss of vitamin C if a lossis actually present in the population of all bags shipped to Haiti. Is thereevidence of systematic loss of vitamin C in shipping and storage?16.22 Consumer perceptions of food safety. Case 16.2 describes a studyof the attitudes of people attending outdoor fairs about the safetyof the food served at such locations. The full data set is availableonline or on the CD as the file eg16-05.dat. It contains the responses of 303people to several questions. The variables in this data set are (in order)subject hfair sfair sfast srest genderCASE 16.2

a16.2 The Wilcoxon Signed Rank Test 16-29The variable “sfair” contains responses to the safety question described inCase 16.2. The variable “srest” contains responses to the same questionasked about food served in restaurants. We suspect that restaurant food willappear safer than food served outdoors at a fair. Do the data give goodevidence for this suspicion? (Give descriptive measures, a test statistic andits P-value, and your conclusion.)16.23 Consumer perceptions of food safety. The food safety surveydata described in Case 16.2 and Exercise 16.22 also contain theresponses of the 303 subjects to the same question asked aboutfood served at fast-food restaurants. These responses are the values of thevariable “sfast.” Is there a systematic difference between the level of concernabout food safety at outdoor fairs and at fast-food restaurants?16.24 Home radon detectors. Exercise 7.38 (page 457) reports readings from 12home radon detectors exposed to 105 picocuries per liter of radon:91.9 97.8 111.4 122.3 105.4 95.0 103.8 99.6 96.6 119.3 104.8 101.7We wonder if the median reading differs significantly from the true value105.(a) Graph the data, and comment on skewness and outliers. A rank test isappropriate.(b) We would like to test hypotheses about the median reading from homeradon detectors:H0: median 105H : median 105To do this, apply the Wilcoxon signed rank statistic to the differencesbetween the observations and 105. (This is the one-sample version ofthe test.) What do you conclude?16.25 The Platinum Gasaver. Exercises 1.41 and 7.12 concern the PlatinumGasaver, a device its maker says “may increase gas mileage by 22%.”An advertisement reports the results of a matched pairs study with 15identical vehicles. The claimed percent changes in gas mileage with theGasaver are48.3 46.9 46.8 44.6 40.2 38.5 34.6 33.728.7 28.7 24.8 10.8 10.4 6.9 12.4Is there good evidence that the Gasaver improves median gas mileage by22% or more? (Apply the Wilcoxon signed rank test to the differencesbetween the sample percents and the claimed 22%.)16.26 Home radon detectors, continued. Some software (Minitab, for example)calculates a confidence interval for the population median as part of theWilcoxon signed rank test. Using software and the data in Exercise 16.24,give a 95% confidence interval for the median reading of home radondetectors when exposed to 105 picocuries per liter of radon.16.27 Median spending at a supermarket. Some software (Minitab, for example)calculates a confidence interval for the population median as part of theWilcoxon signed rank test. Exercise 1.11 (page 19) gives the amounts spentCASE 16.2

16-30 CHAPTER 16Nonparametric Testsby a sample of 50 shoppers at a supermarket. Using software, find a 95%confidence interval for the median amount spent by all shoppers at thismarket.16.3 The Kruskal-Wallis TestWe have now considered alternatives to the paired-sample and two-samplet tests for comparing the magnitude of responses to two treatments. Tocompare more than two treatments, we use one-way analysis of variance(ANOVA) if the distributions of the responses to each treatment are at leastroughly Normal and have similar spreads. What can we do when thesedistribution requirements are violated?EXAMPLE16.11CASE 16.1Earnings of hourly bank workersTable 16.1 (page 16-6) gives data on the annual earnings of samples of four groupsof hourly workers at National Bank. We have so far compared only the blackfemale and black male groups. The Wilcoxon rank sum test (see the SAS outputin Figure 16.4) has one-sided P-value 0.0998 and two-sided P-value 0.1995. Thatis, these data do not give strong evidence of a systematic difference between theearnings of black females and black males. Now we ask if there is evidence of anydifferences among the four groups. The summary statistics areGroup n Mean Std. dev.Black female 15 17667.13 1576.57Black male 12 19804.17 3855.50White female 20 21484.80 4543.09White male 15 21283.93 4469.59Normal quantile plots (Figure 16.9) show that the earnings in some groups are notNormally distributed. Moreover, the sample standard deviations do not satisfy ourrule of thumb that for safe use of ANOVA the largest should not exceed twice thesmallest. We may want to use a nonparametric test.Hypotheses and assumptionsThe ANOVA F test concerns the means of the several populations representedby our samples. For Example 16.11, the ANOVA hypotheses areH : H0 1 2 3 4a: not all four group means are equalFor example, 1is the mean earnings in the population of all black femaleNational Bank hourly workers. The data for ANOVA should consist of*This section assumes that you have read Chapter 14 on one-way analysis of variance.

16.3 The Kruskal-Wallis Test 16-31Black female earnings14,000 16,000 18,000 20,000Black male earnings20,000 25,000–3 –2 –1 0 1 2 3z-score–3 –2 –1 0 1 2 3z-scoreWhite female earnings15,000 20,000 25,000 30,000–3 –2 –1 0 1 2 3z-scoreWhite male earnings15,000 20,000 25,000 30,000–3 –2 –1 0 1 2 3z-scoreFIGURE 16.9 Normal quantile plots for the earnings of the four groups of bank workers inTable 16.1.four independent random samples from the four populations, all Normallydistributed with the same standard deviation.The Kruskal-Wallis test is a rank test that can replace the ANOVA Ftest. The assumption about data production (independent random samplesfrom each population) remains important, but we can relax the Normalityassumption. We assume only that the response has a continuous distributionin each population. The hypotheses tested in our example areH0: earnings have the same distribution in all groupsH : earnings are systematically higher in some groups than in othersaIf all of the population distributions have the same shape (Normal or not),these hypotheses take a simpler form. The null hypothesis is that all fourpopulations have the same median earnings. The alternative hypothesis isthat not all four median earnings are equal.The Kruskal-Wallis testRecall the analysis of variance idea: we write the total observed variationin the responses as the sum of two parts, one measuring variation amongthe groups (sum of squares for groups, SSG) and one measuring variationamong individual observations within the same group (sum of squares for

16-32 CHAPTER 16Nonparametric Testserror, SSE). The ANOVA F test rejects the null hypothesis that the meanresponses are equal in all groups if SSG is large relative to SSE.The idea of the Kruskal-Wallis rank test is to rank all the responses fromall groups together and then apply one-way ANOVA to the ranks ratherthan to the original observations. If there are N observations in all, the ranksare always the whole numbers from 1 to N. The total sum of squares forthe ranks is therefore a fixed number no matter what the data are. So we donot need to look at both SSG and SSE. Although it isn’t obvious withoutsome unpleasant algebra, the Kruskal-Wallis test statistic is essentially justSSG for the ranks. We give the formula, but you should rely on software todo the arithmetic. When SSG is large, that is evidence that the groups differ.THE KRUSKAL-WALLIS TESTDraw independent SRSs of sizes n1, n2, ..., nIfrom I populations.There are N observations in all. Rank all N observations and letRibe the sum of the ranks for the ith sample. The Kruskal-Wallisstatistic is12H NN ( 1) 3( N1)When the sample sizes niare large and all I populations have thesame continuous distribution, H has approximately the chi-squaredistribution with I 1 degrees of freedom.The Kruskal-Wallis test rejects the null hypothesis that allpopulations have the same distribution when H is large.2iRniWe now see that, like the Wilcoxon rank sum statistic, the Kruskal-Wallisstatistic is based on the sums of the ranks for the groups we are comparing.The more different these sums are, the stronger is the evidence that responsesare systematically larger in some groups than in others.The exact distribution of the Kruskal-Wallis statistic H under the nullhypothesis depends on all the sample sizes n1to nI, so tables are awkward.The calculation of the exact distribution is so time-consuming for all butthe smallest problems that even most statistical software uses the chisquareapproximation to obtain P-values. As usual, there is no usable exactdistribution when there are ties among the responses. We again assignaverage ranks to tied observations.EXAMPLE16.12CASE 16.1Rank test for bank workers’ earningsIn Example 16.11, there are I 4 populations. The sample sizes are n1 15,n2 12, n3 20, and n4 15. There are N 62 observations in all. Becausethe data are annual earnings to the exact dollar, there are no ties.Figure 16.10 displays the output from the SAS statistical software, whichgives the results H 9. 0047 and P 0.0292. We have reasonably good evidence

16.3 The Kruskal-Wallis Test 16-33SASWilcoxon Scores (Rank Sums) for Variable EARNINGSClassified by Variable TYPETYPENSum ofScoresExpectedUnder H0Std DevUnder H0MeanScoreBFBMWFWM15122015308.0350.0745.0550.0472.50378.00630.00472.5060.83789956.12486166.40783160.83789920.60000029.16666737.20000036.666667Kruskal-Wallis TestChi-SquareDFPr > Chi-Square9.004730.0292FIGURE 16.10 Output from SAS for the Kruskal-Wallis test applied to the data forExample 16.11. SAS uses the chi-square approximation to obtain a P -value. Othersoftware gives similar output.that the distributions of earnings for the four populations of bank workers arenot all the same. As with ANOVA, the Kruskal-Wallis test does not tell us whichgroups differ.Doing the work by hand is a bit tedious. We chose to display SAS outputbecause SAS also gives the rank sums Rifor the four groups, under the heading“Sum of Scores.” This allows us to verify that the Kruskal-Wallis statistic is12H NN ( 1)Rn2ii 3( N1)12 308 350 745 550 (62)(63) 15 12 20 1512 (64450 . 52) 1893906 9.00472 2 2 2Referring to the table of chi-square critical points (Table F) with df 3, we findthat the P -value lies in the interval 0. 025 P 0.05. (3)(63)As an option, SAS will calculate the exact P-value for the Kruskal-Wallistest. For 4 groups with 62 observations, however, the computation wouldrequire several hours of computing time. The ordinary ANOVA F test forthese data gives F 3. 305 with P 0.0264. The practical conclusion isthe same. Many people would use ANOVA because the F test is reasonably

16-34 CHAPTER 16Nonparametric Testsrobust against non-Normality and our rule of thumb for standard deviationsis quite conservative.SECTION16.3 SUMMARY The Kruskal-Wallis test compares several populations on the basis ofindependent random samples from each population. This is the one-wayanalysis of variance setting. The null hypothesis for the Kruskal-Wallis test is that the distribution ofthe response variable is the same in all the populations. The alternativehypothesis is that responses are systematically larger in some populationsthan in others. The Kruskal-Wallis statistic H can be viewed in two ways. It isessentially the result of applying one-way ANOVA to the ranks of theobservations. It is also a comparison of the sums of the ranks for theseveral samples. When the sample sizes are not too small and the null hypothesis istrue, H for comparing I populations has approximately the chi-squaredistribution with I 1 degrees of freedom. We use this approximatedistribution to obtain P-values.SECTION16.3 EXERCISESStatistical software is needed to do these exercises without unpleasant handcalculations. If you do not have access to software, find the Kruskal-Wallis statistic H by hand and use the chi-square table to get approximateP-values.16.28 Consumer response to advertising. Example 14.13 describes a study ofhow advertising influences consumer judgments of the quality of a product.Market researchers asked consumers to rate the quality (on a1to7scale)of a new line of refrigerated dinner entrees ´ based on reading a magazine ad.The study compared three advertisements. The first ad included informationthat would undermine the expected positive association between perceivedquality and advertising; the second contained information that would affirmthe association; and the third was a neutral control. The subjects’ responsesappear in Table 14.2.(a) The study was a completely randomized experiment. Outline the studydesign.(b) Because the response variable has a 1 to 7 scale, it cannot have aNormal distribution. The samples are large enough that we might useeither one-way ANOVA or the Kruskal-Wallis test. What are the nulland alternative hypotheses for each of these tests?(c) Using software, carry out the Kruskal-Wallis test. Report the value ofthe statistic H, its P-value, and your conclusion.16.29 Loss of vitamin C in bread. Exercise 14.44 presents the following data froma study of the loss of vitamin C in bread after baking:

16.3 The Kruskal-Wallis Test 16-35Condition Vitamin C (mg/100 g)Immediately after baking 47.62 49.79One day after baking 40.45 43.46Three days after baking 21.25 22.34Five days after baking 13.18 11.65Seven days after baking 8.51 8.13The loss of vitamin C over time is clear, but with only 2 loaves of breadfor each storage time we wonder if the differences among the groups aresignificant.(a) Use the Kruskal-Wallis test to assess significance, then write a briefsummary of what the data show.(b) Because there are only 2 observations per group, we suspect that thecommon chi-square approximation to the distribution of the Kruskal-Wallis statistic may not be accurate. The exact P-value (from the SASsoftware) is P 0. 0011. Compare this with your P-value from (a). Isthe difference large enough to affect your conclusion?16.30 Exercise and bone density. Exercise 14.54 discusses a study of the effect ofexercise on bone density in growing rats. Ten rats were assigned to eachof three treatments: a 60-centimeter “high jump,” a 30-centimeter “lowjump,” and a control group with no jumping. Here are the bone densities(in milligrams per cubic centimeter) after 8 weeks of 10 jumps per day:3Group Bone density (mg/cm )Control 611 621 614 593 593 653 600 554 603 569Low jump 635 605 638 594 599 632 631 588 607 596High jump 650 622 626 626 631 622 643 674 643 650(a) The study was a randomized comparative experiment. Outline the designof this experiment.(b) Make side-by-side stemplots for the three groups, with the stems linedup for easy comparison. The distributions are a bit irregular but notstrongly non-Normal. We would usually use analysis of variance toassess the significance of the difference in group means.(c) Do the Kruskal-Wallis test. Explain the distinction between the hypothesestested by Kruskal-Wallis and ANOVA.(d) Write a brief statement of your findings. Include a numerical comparisonof the groups as well as your test result.16.31 Attracting beetles. In Exercise 14.61 you used ANOVA to analyze the resultsof a study to see which of four colors best attracts cereal leaf beetles. Hereare the data:ColorInsects trappedLemon yellow 45 59 48 46 38 47White 21 12 14 17 13 17Green 37 32 15 25 39 41Blue 16 11 20 21 14 7

16-36 CHAPTER 16 Nonparametric TestsBecause the samples are small, we will apply a nonparametric test.(a) What hypotheses does ANOVA test? What hypotheses does Kruskal-Wallis test?(b) Find the median number of beetles trapped by boards of each color.Which colors appear more effective? Use the Kruskal-Wallis test tosee if there are significant differences among the colors. What do youconclude?16.32 Kruskal-Wallis details. Exercise 16.31 gives data on the counts of insectsattracted by boards of four different colors. Carry out the Kruskal-Wallistest by hand, following these steps.(a) What are I, the ni, and N in this example?(b) Arrange the counts in order and assign ranks. Be careful about ties. Findthe sum of the ranks Rifor each color.(c) Calculate the Kruskal-Wallis statistic H. How many degrees of freedomshould you use for the chi-square approximation to its distribution underthe null hypothesis? Use the chi-square table to give an approximateP-value.16.33 Decay of polyester fabric. Here are the breaking strengths (in pounds) of12strips of polyester fabric buried in the ground for several lengths of time:TimeBreaking strength2 weeks 118 126 126 120 1294 weeks 130 120 114 126 1288 weeks 122 136 128 146 14016 weeks 124 98 110 140 110Breaking strength is a good measure of the extent to which the fabric hasdecayed.(a) Find the standard deviations of the 4 samples. They do not meet ourrule of thumb for applying ANOVA. In addition, the sample buried for16 weeks contains an outlier. We will use the Kruskal-Wallis test.(b) Find the medians of the 4 samples. What are the hypotheses for theKruskal-Wallis test, expressed in terms of medians?(c) Carry out the test and report your conclusion.16.34 Food safety: fairs, fast food, restaurants. Case 16.2 describes a study ofthe attitudes of people attending outdoor fairs about the safety of the foodserved at such locations. The full data set is available online or on the CDas the file eg16-05.dat. It contains the responses of 303 people to severalquestions. The variables in this data set are (in order)subject hfair sfair sfast srest genderThe variable “sfair” contains responses to the safety question describedin Case 16.2. The variables “srest” and “sfast” contain responses to thesame question asked about food served in restaurants and in fast-foodchains. Explain carefully why we cannot use the Kruskal-Wallis test to see ifthere are systematic differences in perceptions of food safety in these threelocations.

Statistics in Summary 16-3716.35 Effects of logging in tropical forests. In Exercise 16.11 you compared thenumber of tree species in plots of land in a tropical rainforest that had neverbeen logged with similar plots nearby that had been logged 8 years earlier.The researchers also counted species in plots that had been logged just 113year earlier. Here are the counts of species:Plot typeSpecies countUnlogged 22 18 22 20 15 21 13 13 19 13 19 15Logged 1 year ago 11 11 14 7 18 15 15 12 13 2 15 8Logged 8 years ago 17 4 18 14 18 15 15 10 12(a) Use side-by-side stemplots to compare the distributions of number oftrees per plot for the three groups of plots. Are there features that mightprevent use of ANOVA? Also give the median number of trees per plotin the three groups.(b) Use the Kruskal-Wallis test to compare the distributions of tree counts.State hypotheses, the test statistic and its P-value, and your conclusions.16.36 Smoking and SES. In a study of heart disease in male federal employees,researchers classified 356 volunteer subjects according to their socioeconomicstatus (SES) and their smoking habits. There were three categories of SES:high, middle, and low. Individuals were asked whether they were currentsmokers, former smokers, or had never smoked. Here are the data, as a14two-way table of counts:SES Never (1) Former (2) Current (3)High 68 92 51Middle 9 21 22Low 22 28 43The data for all 356 subjects are stored in the file ex16-36.dat online and onthe CD. Smoking behavior is stored numerically as 1, 2, or 3 using the codesgiven in the column headings above.(a) Higher SES people in the United States smoke less as a group thanlower SES people. Do these data show a relationship of this kind? Givepercents that back your statements.(b) Apply the chi-square test to see if there is a significant relationshipbetween SES and smoking behavior.(c) The chi-square test ignores the ordering of the responses. Use theKruskal-Wallis test (with many ties) to test the hypothesis that some SESclasses smoke systematically more than others.STATISTICS INSUMMARYThe standard procedures for inference about the mean of a distributionand for comparing the means of several distributions are the t proceduresand analysis of variance. These procedures are based on the condition thatthe populations in question have Normal distributions. Fortunately, these

16-38 CHAPTER 16Nonparametric TestsNormal procedures are quite robust against lack of Normality, especiallywhen the samples are not very small. If the distributions are strongly non-Normal, however, we may prefer nonparametric procedures that are notbased on Normality and that do not identify the center of a distribution asits mean. The rank tests of this chapter are nonparametric replacements forthe one- and two-sample t tests and the analysis of variance F test. Criticalvalues and P-values for these tests come from special sampling distributionsthat are practical only when the samples are small and there are no ties.Otherwise, Normal approximations are used. In practice, use of rank testsrequires software in all but the simplest problems. Here are the skills youshould develop from studying this chapter.A. RANKS1. Assign ranks to a moderate number of observations. Use average ranksif there are ties among the observations.2. From the ranks, calculate the rank sums when the observations comefrom two or several samples.B. RANK TEST STATISTICS1. Determine which of the rank sum tests is appropriate in a specificproblem setting.2. Calculate the Wilcoxon rank sum W from ranks for two samples, theWilcoxon signed rank sum W for matched pairs, and the Kruskal-Wallis statistic H for two or more samples.3. State the hypotheses tested by each of these statistics in specificproblem settings.4. Determine when it is appropriate to state the hypotheses for W andH in terms of population medians.C. RANK TESTS1. Use software to carry out any of the rank tests. Combine the testwith data description and give a clear statement of findings in specificproblem settings.2. Use the Normal approximation with continuity correction to findapproximate P-values for W and W . Use a table of chi-squarecritical values to approximate the P-value of H.C 16 R EHAPTER EVIEW XERCISES16.37 Design of controls. Exercise 7.40 (page 458) contains data from a studentproject that investigated whether right-handed people can turn a knob fasterclockwise than they can counterclockwise. Describe what the data show,then state hypotheses and do a test that does not require Normality. Reportyour conclusions carefully.16.38 Retail sales. Exercise 7.13 (page 447) gives monthly sales of electric canopeners at a sample of 50 retail stores. That exercise reports 95% confidence

Chapter 16 Review Exercises 16-3916.39 The cost of Internet access. Exercise 1.23 (page 27) gives the amounts paidfor Internet access by a random sample of 50 consumers. It appears thatAmerica Online and other national service providers charged $20 for accessat the time of the survey.(a) Make a graph of the data: they are strongly non-Normal.(b) Carry out a rank test of the hypothesis that the median cost for allconsumers of Internet access is $20 against the two-sided alternative.What do you conclude?(c) The t procedures are quite robust when n 50. Carry out a t test of thehypothesis that the population mean cost is $20 against the two-sidedalternative. How do your results in (b) and (c) compare?16.40 Learning Spanish. The growth of the Hispanic population in the UnitedStates has made ability to speak Spanish a valuable skill for many businesspeople. Exercise 7.42 (page 459) gives the scores on a test of comprehensionof spoken Spanish before and after an intensive language course for 20executives. Is there good evidence that scores are systematically higher afterthe course? Use a test that is not based on Normality.16.41 Calories in hot dog brands. Table 16.2 presents data on the calorie andsodium content of selected brands of beef, meat, and poultry hot dogs. Wewill regard these brands as random samples from all brands available infood stores.(a) Make stemplots of the calorie contents side by side, using the samestems for easy comparison. Give the five-number summaries for thethree types of hot dog. What do the data suggest about the caloriecontent of different types of hot dog?(b) Are any of the three distributions clearly not Normal? Which ones,and why?(c) Carry out a nonparametric test. Report your conclusions carefully.16.42 House prices. Case Study 7.2 (page 501) reports data on the selling pricesof 9 four-bedroom houses and 28 three-bedroom houses in West Lafayette,Indiana. We wonder if there is a difference between the average prices ofthree- and four-bedroom houses in this community.(a) Make a Normal quantile plot of the prices of three-bedroom houses.What kind of deviation from Normality do you see?(b) The t tests are quite robust. State the hypotheses for the proper t test,carry out the test, and present your results including appropriate datasummaries.(c) Carry out a nonparametric test. Once more state the hypotheses testedand present your results for both the test and the data summaries thatshould go with it.16.43intervals for the mean sales in all stores obtained by the bootstrap method,which does not require Normality. Find a 95% confidence interval for themedian sales and compare your results with those in Exercise 7.13.Repeat the analysis of Exercise 16.41 for the sodium content of hot dogs.Exercise 14.56 reports data from a study of iron-deficiency anemia inEthiopia. The issue is whether Ethiopian food loses more iron when cooked

16-40 CHAPTER 16 Nonparametric TestsTABLE 16.2Calories and sodium in three types of hot dogsBeef hot dogs Meat hot dogs Poultry hot dogsCalories Sodium Calories Sodium Calories Sodium186 495 173 458 129 430181 477 191 506 132 375176 425 182 473 102 396149 322 190 545 106 383184 482 172 496 94 387190 587 147 360 102 542158 370 146 387 87 359139 322 139 386 99 357175 479 175 507 170 528148 375 136 393 113 513152 330 179 405 135 426111 300 153 372 142 513141 386 107 144 86 358153 401 195 511 143 581190 645 135 405 152 588157 440 140 428 146 522131 317 138 339 144 545149 319135 298132 253in some types of pots. Here are data on iron content (milligrams per 100grams of food) for three types of food cooked in each of three types of pots.Exercises 16.44 to 16.46 use these data.Iron contentType of pot Meat Legumes VegetablesAluminum 1.77 2.36 1.96 2.14 2.40 2.17 2.41 2.34 1.03 1.53 1.07 1.30Clay 2.27 1.28 2.48 2.68 2.41 2.43 2.57 2.48 1.55 0.79 1.68 1.82Iron 5.27 5.17 4.06 4.22 3.69 3.43 3.84 3.72 2.45 2.99 2.80 2.9216.44We want to know if the vegetable dish varies in iron content when cookedin aluminum, clay, and iron pots.(a) Check the requirements for one-way ANOVA. Which requirements area bit dubious in this setting?(b) Instead of ANOVA, do a nonparametric test. Summarize your conclusionsabout the effect of pot material on iron content, including bothdescriptive measures and your test result.

Chapter 16 Review Exercises 16-4116.4516.46There appears to be little difference between the iron content of food cookedin aluminum pots and food cooked in clay pots. Is there a significantdifference between the iron content of meat cooked in aluminum and clay?Is the difference between aluminum and clay significant for legumes? Usenonparametric tests.The data show that food cooked in iron pots has the highest iron content.They also suggest that the three types of food differ in iron content. Is theresignificant evidence that the three types of food differ in iron content whenall are cooked in iron pots?Education and income. The individuals.dat data set, described in the DataAppendix, contains information on the income and education level (amongother variables) of a random sample of more than 55,000 individuals. Wewill treat these individuals as a population for software-based Exercises16.47 to 16.50.16.47 Does finishing college raise income? Select an SRS of 200 individuals fromthis population. We are interested only in the sample individuals who begancollege but did not obtain a bachelor’s degree (education level 4) and thosewhose highest education is a bachelor’s degree (level 5).(a) How many people of each of these levels are in your sample?(b) Briefly compare the incomes of these two groups. Does it appear thatpeople with a bachelor’s degree earn systematically more than thosewithout a degree?(c) We expect income distributions to be clearly non-Normal. In what wayare your sample data non-Normal?(d) Does the bachelor’s degree group earn significantly more than thenondegree group? (Give a test statistic, its P-value, and your conclusion.)(e) Does your study show that finishing college actually causes higherincome? Explain your answer.16.48 Does the private sector pay better? Select an SRS of 100 individuals fromthe population. We are interested only in the sample individuals whosemain work experience is in either the private sector (code 5) or government(code 6). Carry out a full comparison of the distributions of income inthese two groups, including numerical and graphical descriptions and atest that asks whether private-sector workers earn systematically more thangovernment workers. State your conclusions clearly.16.49 Incomes of men and women. Select an SRS of 100 individuals from thepopulation. Make a careful comparison of the incomes of men and women,including graphical and numerical descriptions and a test of whether eithergender has a systematically higher income. Explain your conclusionclearly.16.50 Education and income. The data report six levels of education, coded as 1to 6 in order of increasing education. Select an SRS of 200 individuals fromthe population.(a) Compare the distributions of income for the 6 education levels usingboth a graph and numerical summaries. Does income appear to increaseas education increases?

16-42 CHAPTER 16 Nonparametric Tests(b) Is there good evidence that the distribution of income differs systematicallyamong the six education groups? Give a test statistic, its P-value,and your conclusion.16.51 Rank sums. We have noted several times that the sum of the ranks 1to N must be a fixed number that depends only on the total number ofobservations N.(a) To find this sum in terms of N, return to the formula for the mean ofthe Wilcoxon rank sum W in the box on page 16-8. There are two ranksums, for the two samples of sizes n1 and n2, where n1n2 N. Thesum of the two means must be the sum of all ranks 1 to N. Why? Usethis fact to find the sum.(b) On page 16-8, we claimed that “the sum of the ranks from 1 to 27 isalways equal to 378.” Use your result from (a) to verify this.(c) Figure 16.10 (page 16-33) gives the rank sums for four groups, of 15,12, 20, and 15 observations. There are 62 observations in all. Verifythat the four rank sums add to the sum of all ranks 1 to 62.Notes for Chapter 161. John M. Bizjak and Jeffrey L. Coles, “The effect of private antitrust litigation on thestock-market valuation of the firm,” American Economic Review, 85 (1995), pp.436–461.2. For purists, here is the precise definition: X is stochastically larger than X if1 2PX ( 1 a) PX ( 2 a)for all a, with strict inequality for at least one a. The Wilcoxon rank sum test iseffective against this alternative in the sense that the power of the test approaches 1(that is, the test becomes more certain to reject the null hypothesis) as the number ofobservations increases.3. Data from Huey Chern Boo, “Consumers’ perceptions and concerns about safety andhealthfulness of food served at fairs and festivals,” M.S. thesis, Purdue University,1997.4. Modified from M. C. Wilson and R. E. Shade, “Relative attractiveness of variousluminescent colors to the cereal leaf beetle and the meadow spittlebug,” Journal ofEconomic Entomology,60(1967), pp. 578–580.5. From Sapna Aneja, “Biodeterioration of textile fibers in soil,” M.S. thesis, PurdueUniversity, 1994.6. Data provided by Warren Page, New York City Technical College, from a studydone by John Hudesman.7. Data provided by Charles Cannon, Duke University. The study report is C. H.Cannon, D. R. Peart, and M. Leighton, “Tree species diversity in commerciallylogged Bornean rainforest,” Science, 281 (1998), pp. 1366–1367.8. From William D. Darley, “Store-choice behavior for pre-owned merchandise,”Journal of Business Research,27(1993), pp. 17–31.9. Data from “Results report on the vitamin C pilot program,” prepared by SUSTAIN(Sharing United States Technology to Aid in the Improvement of Nutrition) for theU.S. Agency for International Development.10. Data for 1998 provided by Jason Hamilton, University of Illinois. The study reportis Evan H. DeLucia et al., “Net primary production of a forest ecosystem withexperimental CO enhancement,” Science, 284 (1999), pp. 1177–1179.2

Notes for Chapter 16 16-4311. Simplified from the EESEE story “Stepping Up Your Heart Rate,” on the CD.12. See Note 5.13. See Note 7.14. Ray H. Rosenman et al., “A 4-year prospective study of the relationship of differenthabitual vocational physical activity to risk and incidence of ischemic heart disease involunteer male federal employees,” in P. Milvey (ed.), The Marathon: Physiological,Medical, Epidemiological and Psychological Studies,New York Academy of Sciences,301 (1977), pp. 627–641.

Chapter 16: Nonparametric Tests - WH Freeman

Create successful ePaper yourself

Delete template?

Save as template?