12.07.2015 Views

The Evolution of the Black-White Test Score Gap in Grades K-3: The ...

The Evolution of the Black-White Test Score Gap in Grades K-3: The ...

The Evolution of the Black-White Test Score Gap in Grades K-3: The ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>The</strong> <strong>Evolution</strong> <strong>of</strong> <strong>the</strong> <strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Score</strong> <strong>Gap</strong> <strong>in</strong><strong>Grades</strong> K-3: <strong>The</strong> Fragility <strong>of</strong> Results ∗Timothy N. Bond † Kev<strong>in</strong> Lang ‡August 16, 2012AbstractAlthough both economists and psychometricians typically treat <strong>the</strong>m as <strong>in</strong>tervalscales, test scores are reported us<strong>in</strong>g ord<strong>in</strong>al scales. Us<strong>in</strong>g <strong>the</strong> Early ChildhoodLongitud<strong>in</strong>al Study and <strong>the</strong> Children <strong>of</strong> <strong>the</strong> National Longitud<strong>in</strong>al Survey<strong>of</strong> Youth, we exam<strong>in</strong>e how order-preserv<strong>in</strong>g scale transformations affect <strong>the</strong> evolution<strong>of</strong> <strong>the</strong> black-white read<strong>in</strong>g test score gap from k<strong>in</strong>dergarten entry throughthird grade.Plausible transformations reverse <strong>the</strong> growth <strong>of</strong> <strong>the</strong> gap <strong>in</strong> <strong>the</strong>CNLSY and greatly reduce it <strong>in</strong> <strong>the</strong> ECLS-K dur<strong>in</strong>g <strong>the</strong> early school years. Allgrowth from entry through first grade and a nontrivial proportion from first tothird grade probably reflects scal<strong>in</strong>g decisions. JEL codes: I24, J15∗We are grateful to Henry Braun, Sandy Jencks, Dan Koretz, Sunny Ladd, Ed Lazear,Michael Manove, Dick Murnane, Sean Reardon, Bruce Spencer, an anonymous referee, <strong>the</strong>editor and participants <strong>in</strong> sem<strong>in</strong>ars and workshops at Baruch College, Boston University,Brown, CERGE-EI, Sciences Po and Stanford for helpful comments and suggestions. <strong>The</strong>usual caveat applies.†Department <strong>of</strong> Economics, Krannert School <strong>of</strong> Management, Purdue University, 403 W.State Street, West Lafayette, IN 47907; email: tnbond@purdue.edu‡Department <strong>of</strong> Economics, Boston University, 270 Bay State Road, Boston, MA 02215;email:lang@bu.edu


to converse fluently <strong>in</strong> Lat<strong>in</strong>. In <strong>the</strong> latter case, as economists, we are <strong>in</strong>cl<strong>in</strong>ed to view <strong>the</strong>marg<strong>in</strong>al value <strong>of</strong> 1 as much greater than that <strong>of</strong> 2 which is <strong>in</strong> turn much greater than 3, but<strong>the</strong>re are surely o<strong>the</strong>r admissible scales.Suppose we have a sample <strong>of</strong> twenty blacks and twenty whites. In each case, two peoplereceive scores <strong>of</strong> a and two scores <strong>of</strong> d. Overall, however, blacks do worse than whites. Of<strong>the</strong> rema<strong>in</strong><strong>in</strong>g 16 blacks, 14 get a b and two get a c, while among whites <strong>the</strong> figures are twob and fourteen c.If we use <strong>the</strong> naive scale, 0, 1, 2, 3, <strong>the</strong>n <strong>the</strong> difference <strong>in</strong> <strong>the</strong> means is .6 or about .73standard deviations.<strong>The</strong> <strong>the</strong>oretical lower limit is reached by mak<strong>in</strong>g <strong>the</strong> gap between b and c arbitrarilysmall or, equivalently, send<strong>in</strong>g a and d to m<strong>in</strong>us and plus <strong>in</strong>f<strong>in</strong>ity. In this case, <strong>the</strong>re is nogap s<strong>in</strong>ce <strong>the</strong> total number <strong>of</strong> blacks and whites with a score <strong>of</strong> ei<strong>the</strong>r b or c is identical.At <strong>the</strong> o<strong>the</strong>r extreme, if we treat a and b and c and d as essentially equal, <strong>the</strong>n we get agap <strong>of</strong> 1.20 standard deviations. Without some external reference for determ<strong>in</strong><strong>in</strong>g <strong>the</strong> properscale, all we can say is that <strong>the</strong> test-score gap is somewhere between 0 and 1.20 standarddeviations. Of course, <strong>in</strong> some cases such an external standard may exist. Cunha, Heckmanand Schennach (2010) tie a variety <strong>of</strong> objective and subjective measures <strong>of</strong> cognitive andnoncognitive ability to later adult outcomes. More generally, test scores can be tied to o<strong>the</strong>rmeasures <strong>of</strong> performance.<strong>The</strong> situation becomes, if anyth<strong>in</strong>g, more diffi cult when we attempt to determ<strong>in</strong>e whe<strong>the</strong>r<strong>the</strong> gap <strong>in</strong>creases or decreases as children progress through school. We suppose aga<strong>in</strong> thatour test<strong>in</strong>g situation is ideal. A year later, we adm<strong>in</strong>ister a very similar (<strong>in</strong> jargon, verticallyl<strong>in</strong>ked) test so that a score on <strong>the</strong> second test is fully equivalent to that same score on <strong>the</strong>5


first test. Suppose fur<strong>the</strong>r that on adm<strong>in</strong>ister<strong>in</strong>g <strong>the</strong> second test, we observe that with<strong>in</strong>each race/performance level, half <strong>of</strong> <strong>the</strong> <strong>in</strong>dividuals advanced exactly one level so that <strong>the</strong>reis now one black and one white at each <strong>of</strong> levels a and e.If we follow <strong>the</strong> “natural” scale and assign consecutive <strong>in</strong>tegers to <strong>the</strong> levels so that ane corresponds to a scaled score <strong>of</strong> 4, <strong>the</strong>n <strong>the</strong> test-score gap rema<strong>in</strong>s at .6, but decl<strong>in</strong>es to.62 with<strong>in</strong>-grade standard deviations because <strong>the</strong> with<strong>in</strong>-grade variance has <strong>in</strong>creased. Incontrast, suppose that we believe that over <strong>the</strong> two tests, <strong>the</strong> score distribution should beapproximately normal. <strong>The</strong>n we would assign scaled scores <strong>of</strong> −1.89, −.75, .27, 1.27, 2.34.<strong>The</strong> scaled-score gap would be virtually unchanged at roughly .61 <strong>in</strong> both periods, butbecause <strong>the</strong> variance <strong>of</strong> <strong>the</strong> scaled scores would have decl<strong>in</strong>ed, <strong>the</strong> gap as a proportion <strong>of</strong> <strong>the</strong>standard deviation would grow from .61 to .72. This is similar to <strong>the</strong> situation documented<strong>in</strong> Murnane et al (2006) where <strong>the</strong> gap us<strong>in</strong>g <strong>the</strong> test scale rema<strong>in</strong>s constant but <strong>the</strong> gap asa proportion <strong>of</strong> <strong>the</strong> standard deviation grows.<strong>The</strong> scale that m<strong>in</strong>imizes <strong>the</strong> level <strong>of</strong> growth <strong>in</strong> this example is to set <strong>the</strong> scores for b,c, d to be approximately equal. Any scale that sets b and c equal creates a gap <strong>of</strong> 0 on<strong>the</strong> first test, while any scale that sets b, c, and d equal creates a gap <strong>of</strong> 0 on <strong>the</strong> secondtest. With both gaps equal to 0, <strong>the</strong>re is no growth <strong>in</strong> <strong>the</strong> gap between <strong>the</strong> two tests. <strong>The</strong>scale that maximizes <strong>the</strong> level <strong>of</strong> growth sets a, b, and c and d and e as approximately equal.Aga<strong>in</strong>, with b and c be<strong>in</strong>g approximately equal, we have no gap on <strong>the</strong> first test, howeverthis arrangement gives <strong>the</strong> upper bound on <strong>the</strong> second test <strong>of</strong> 1.39. All we can say, <strong>the</strong>refore,is that <strong>the</strong> growth <strong>of</strong> <strong>the</strong> racial test gap between <strong>the</strong> two tests is somewhere between 0 and1.39 standard deviations.6


3 Some Background on Educational MeasurementWhen we have presented early versions <strong>of</strong> this paper <strong>in</strong> sem<strong>in</strong>ars, we have sometimes beenasked how people <strong>in</strong> <strong>the</strong> educational measurement field have responded to it. <strong>The</strong> simpleanswer is “with much k<strong>in</strong>dness and patience for <strong>the</strong> gaps <strong>in</strong> our knowledge.”Several extremelyhelpful experts are mentioned <strong>in</strong> <strong>the</strong> acknowledgements. One person who told us that he haddiscussed <strong>the</strong> paper with a number <strong>of</strong> o<strong>the</strong>r people <strong>in</strong> <strong>the</strong> field reported that <strong>the</strong> reactionswere divided between “that is deeply disturb<strong>in</strong>g” and “that cannot be important.” A fewhave implied that we underappreciated <strong>the</strong> degree to which psychometricians are aware <strong>of</strong> <strong>the</strong>issues we raise, an error that we attempt to correct <strong>in</strong> this section. But none has suggestedthat test scores are, <strong>in</strong> fact, measured on <strong>in</strong>terval scales. A dist<strong>in</strong>ct m<strong>in</strong>ority <strong>of</strong> researchers<strong>in</strong> <strong>the</strong> field believe that certa<strong>in</strong> scales are <strong>in</strong>terval scales, but ei<strong>the</strong>r <strong>the</strong>y did not read ourearlier drafts, or <strong>the</strong>y did not bo<strong>the</strong>r to correspond with us. 3<strong>The</strong> concern that test scores are ord<strong>in</strong>al scales is long-stand<strong>in</strong>g. <strong>The</strong> earliest reference wehave been able to f<strong>in</strong>d is Stevens (1946). Among <strong>the</strong> giants <strong>of</strong> <strong>the</strong> test<strong>in</strong>g field, Thorndike(1966, p. 124) makes <strong>the</strong> po<strong>in</strong>t emphasized <strong>in</strong> this article: “... it is assumed that <strong>the</strong>numerals <strong>in</strong> which <strong>the</strong> variables are expressed represent equal <strong>in</strong>crements <strong>in</strong> some attribute.It is also recognized that this assumption is usually not well supported. But for ‘rough andready’studies <strong>of</strong> relationship, <strong>the</strong> violation <strong>of</strong> <strong>the</strong> assumption usually does not hurt much.However, when start<strong>in</strong>g to deal with someth<strong>in</strong>g as fragile as a change score, <strong>the</strong> violation <strong>of</strong>this basic assumption becomes a good deal more critical.”In this section, we provide a brief primer on item response <strong>the</strong>ory (IRT) scal<strong>in</strong>g so that<strong>the</strong> economists and o<strong>the</strong>rs can have a better understand<strong>in</strong>g <strong>of</strong> why some researchers might7


elieve that it generates <strong>in</strong>terval scales. We <strong>the</strong>n briefly discuss how <strong>the</strong> ord<strong>in</strong>ality <strong>of</strong> IRT(and o<strong>the</strong>r) scales has been addressed. We aga<strong>in</strong> thank our tutors for <strong>the</strong>ir help and absolve<strong>the</strong>m <strong>of</strong> responsibility for our errors. Readers who are familiar with <strong>the</strong> relevant literatureor who do not care about it can skip this section. We refer readers who want more detail toBaker (2001).IRT models are def<strong>in</strong>ed by <strong>the</strong>ir number <strong>of</strong> parameters. Across all models, questions areidentified by a diffi culty parameter b, while <strong>in</strong>dividuals vary by a parameter θ. <strong>The</strong> goal <strong>of</strong>IRT is to estimate θ for each test-taker.Start<strong>in</strong>g with <strong>the</strong> three-parameter logistic IRT model, <strong>the</strong> probability that a student iwill answer question q correctly is given by 4p iq = c q + (1 − c q )11 + e −aq(θ i−b q) . (1)c captures <strong>the</strong> guessability <strong>of</strong> <strong>the</strong> question. It is <strong>the</strong> probability that a student <strong>of</strong> extremelylow ability will get <strong>the</strong> answer right. When this probability is negligible, c = 0, and we arriveat <strong>the</strong> two-parameter IRT model:p iq =11 + e −aq(θ i−b q) . (2)<strong>The</strong>n a student with θ i = b q is one who will get a question <strong>of</strong> diffi culty b q right exactlyhalf <strong>the</strong> time. a measures how well <strong>the</strong> question dist<strong>in</strong>guishes among students <strong>of</strong> differentabilities. If a = ∞, it dist<strong>in</strong>guishes perfectly between those with θ > b and those with θ < b.It is universally recognized that <strong>the</strong> parameters are def<strong>in</strong>ed only up to a l<strong>in</strong>ear transfor-8


mation. Add<strong>in</strong>g <strong>the</strong> same constant to θ and b or multiply<strong>in</strong>g θ and b and divid<strong>in</strong>g a by <strong>the</strong>same constant does not change p. We can normalize <strong>the</strong> scale by, for example, sett<strong>in</strong>g <strong>the</strong>mean to be 100 and <strong>the</strong> standard deviation to be 15. Some authors claim that θ is uniquelyidentified given this normalization. <strong>The</strong> argument seems to be that given this normalizationand adequate data, <strong>the</strong> θ i are unique.However, as discussed by Lloyd (1975), one <strong>of</strong> <strong>the</strong> fa<strong>the</strong>rs <strong>of</strong> IRT, if θ ′ i = f (θ i ) where fis a strictly <strong>in</strong>creas<strong>in</strong>g monotonic function, <strong>the</strong>np iq =11 + e −aq(f −1 (θ ′ i )−bq) (3)also fits <strong>the</strong> data with <strong>the</strong> same likelihood. As noted, most psychometricians accept Lloyd’sargument.In <strong>the</strong> special case where all questions are equally <strong>in</strong>formative, a can be normalized to 1.This leads to <strong>the</strong> one parameter IRT model or Rasch scale,lnp iq1 − p iq= θ i − b q . (4)Thus regardless <strong>of</strong> <strong>the</strong> diffi culty <strong>of</strong> a question, a one unit <strong>in</strong>crease <strong>in</strong> θ <strong>in</strong>creases <strong>the</strong> log odds<strong>of</strong> gett<strong>in</strong>g <strong>the</strong> answer correct by 1. As a result, it is more common among <strong>in</strong>dividuals <strong>in</strong> <strong>the</strong>field to believe that Rasch scales can be <strong>in</strong>terpreted as <strong>in</strong>terval scales. 5Like many economists, noneconomists frequently ignore <strong>the</strong> complexities associated withwork<strong>in</strong>g with <strong>in</strong>terval scales. However, as early as 1983, Spencer po<strong>in</strong>ted out that <strong>the</strong> performance<strong>of</strong> two groups can only by strictly ranked if <strong>the</strong> cumulative distribution functions9


(cdf) <strong>of</strong> <strong>the</strong>ir scores do not cross (one is higher than <strong>the</strong> o<strong>the</strong>r <strong>in</strong> <strong>the</strong> sense <strong>of</strong> first-orderstochastic dom<strong>in</strong>ance) and suggested comparisons based on <strong>the</strong>se cdfs.One common solution uses measures based on percentile ranks ra<strong>the</strong>r than test scoress<strong>in</strong>ce percentiles are <strong>in</strong>variant to scale. However, percentiles are also a monotonic transformation<strong>of</strong> <strong>the</strong> scale, one <strong>in</strong> which <strong>the</strong> value placed between ranks is constant across <strong>the</strong>distribution. <strong>The</strong> most prom<strong>in</strong>ent percentile-based measure is <strong>the</strong> percentile-percentile (PP)curve. 6This method plots <strong>the</strong> percentile associated with a given score for one group (typically<strong>the</strong> lower performer) aga<strong>in</strong>st <strong>the</strong> percentile associated with that score for <strong>the</strong> o<strong>the</strong>r.If <strong>the</strong> PP curve does not cross <strong>the</strong> 45 degree l<strong>in</strong>e, <strong>the</strong> scores <strong>of</strong> one group are lower than<strong>the</strong> o<strong>the</strong>r <strong>in</strong> <strong>the</strong> sense <strong>of</strong> stochastic dom<strong>in</strong>ance. If one PP curve lies above ano<strong>the</strong>r, <strong>the</strong> gapappears to be smaller for <strong>the</strong> comparison captured by <strong>the</strong> higher curve. Likewise, if <strong>the</strong> PPcurves are identical, it would appear that <strong>the</strong> gaps are also identical. 7Unfortunately, as <strong>the</strong> follow<strong>in</strong>g example shows, <strong>the</strong> conclusion from analysis <strong>of</strong> shifts (orlack <strong>the</strong>re<strong>of</strong>) <strong>of</strong> <strong>the</strong> PP curve can be mislead<strong>in</strong>g. Consider a vertically l<strong>in</strong>ked test adm<strong>in</strong>isteredtwice to three black and three white children. <strong>The</strong> test has 5 possible scores which,to emphasize <strong>the</strong> absence <strong>of</strong> an <strong>in</strong>terval scale, we denote a, b, c, d and e. When <strong>the</strong> test isfirst adm<strong>in</strong>istered, <strong>the</strong> three white children <strong>in</strong>itially score a, c, and e while two black childrenscore a and one black child scores b. One year later <strong>the</strong> white children score b, d, and e, whiletwo black children score b and one black child scores c. It is easy to verify that <strong>the</strong> PP curveis unchanged. In our example, one white child improved from a to b and one from c to d.In contrast, two black children improved from a to b and one from b to c. <strong>The</strong> test gap isunchanged only if <strong>the</strong> value <strong>of</strong> improvement from c to d for one child equals <strong>the</strong> value <strong>of</strong>improvement from a to b for one child plus <strong>the</strong> value <strong>of</strong> improvement from b to c for a second10


child. This is nei<strong>the</strong>r obviously false nor obviously true and cannot be resolved without somereference to <strong>the</strong> “correct”underly<strong>in</strong>g scale.It is clear from this and <strong>the</strong> examples from <strong>the</strong> previous section that scal<strong>in</strong>g issues canbe <strong>of</strong> great importance <strong>in</strong> <strong>the</strong>ory. We now proceed to explore to what extent <strong>the</strong>y are <strong>in</strong>practice.4 DataWe use two data sets: <strong>the</strong> Children <strong>of</strong> <strong>the</strong> National Longitud<strong>in</strong>al Survey <strong>of</strong> Youth (CNLSY)and <strong>the</strong> Early Childhood Longitud<strong>in</strong>al Study K<strong>in</strong>dergarten Class <strong>of</strong> 1998-1999 (ECLS-K).<strong>The</strong> pr<strong>in</strong>cipal advantage <strong>of</strong> <strong>the</strong> CNLSY is that it features two separate tests.<strong>The</strong> first,<strong>the</strong> Peabody Picture Vocabulary <strong>Test</strong>, was adm<strong>in</strong>istered before school entry and is similarto those on which <strong>the</strong>re are early test score gaps.<strong>The</strong> second, <strong>the</strong> Peabody IndividualAchievement <strong>Test</strong>, served as part <strong>of</strong> <strong>the</strong> basis for <strong>the</strong> test adm<strong>in</strong>istered <strong>in</strong> <strong>the</strong> ECLS-K,<strong>the</strong> test on which <strong>the</strong> early gap is much more modest. This helps us exam<strong>in</strong>e directly <strong>the</strong>importance <strong>of</strong> test differences for <strong>the</strong> conflict<strong>in</strong>g f<strong>in</strong>d<strong>in</strong>gs <strong>in</strong> <strong>the</strong> literature.On <strong>the</strong> o<strong>the</strong>rhand, unlike <strong>the</strong> CNLSY, <strong>the</strong> ECLS-K sample is nationally representative, and all studentstake each <strong>of</strong> <strong>the</strong> tests adm<strong>in</strong>istered and <strong>in</strong> <strong>the</strong> same grade. Moreover, it is <strong>the</strong> data set usedby Fryer and Levitt.4.1 Children <strong>of</strong> <strong>the</strong> National Longitud<strong>in</strong>al Survey <strong>of</strong> Youth<strong>The</strong> CNLSY is a biennial survey <strong>of</strong> children <strong>of</strong> women <strong>in</strong> <strong>the</strong> National Longitud<strong>in</strong>al Survey<strong>of</strong> Youth 1979 cohort (NLSY79). <strong>The</strong> NLSY79 is a longitud<strong>in</strong>al survey that has followed11


a sample <strong>of</strong> 12,686 youths who were between <strong>the</strong> ages <strong>of</strong> 14 and 21 as <strong>of</strong> December 1978.<strong>The</strong> survey <strong>in</strong>cludes a nationally representative sample, as well as an oversample <strong>of</strong> blacks,Hispanics, military personnel, and poor whites, <strong>the</strong> latter two be<strong>in</strong>g dropped from <strong>the</strong> latersurveys.Beg<strong>in</strong>n<strong>in</strong>g <strong>in</strong> 1986, <strong>the</strong> children <strong>of</strong> women surveyed <strong>in</strong> <strong>the</strong> NLSY79 were surveyed andassessed biennially. <strong>The</strong> assessments <strong>in</strong>cluded a battery <strong>of</strong> tests <strong>of</strong> psychological, socioemotional,and cognitive ability, <strong>in</strong> addition to questions on <strong>the</strong> environment <strong>in</strong> which <strong>the</strong> childwas raised. Children exit <strong>the</strong> sample at age 15, and enter a separate sample <strong>of</strong> young adults.As <strong>of</strong> 2008, a total <strong>of</strong> 11,495 children born to 4,929 unique female respondents had beensurveyed.Due to <strong>the</strong> way <strong>the</strong> sample was created, it is not nationally representative. Children bornbefore 1982, when mo<strong>the</strong>rs <strong>in</strong> <strong>the</strong> NLSY79 were seventeen to twenty-five years old, will onlybe partially <strong>in</strong>cluded <strong>in</strong> <strong>the</strong> sample as <strong>the</strong>y were over four years old <strong>in</strong> <strong>the</strong> first survey year.Children born before 1972 will not be <strong>in</strong>cluded at all although <strong>the</strong>re should be very few suchchildren. Children who were adopted <strong>in</strong>to and out <strong>of</strong> <strong>the</strong> families <strong>of</strong> <strong>the</strong> mo<strong>the</strong>rs are notsampled. Children born after 1994 will only be partially observed <strong>in</strong> <strong>the</strong> sample and willthus be underrepresented.Our sample consists <strong>of</strong> children from age three or four through third grade or roughly agen<strong>in</strong>e and so underrepresents children <strong>of</strong> older mo<strong>the</strong>rs s<strong>in</strong>ce children born after 1998, when<strong>the</strong> mo<strong>the</strong>rs would have been thirty-four through forty-one, will not have reached thirdgrade. It also underrepresents children born before 1982, when <strong>the</strong> mo<strong>the</strong>rs were seventeento twenty-five, s<strong>in</strong>ce such children would be older than four <strong>in</strong> 1986.Our focus is on <strong>the</strong> Peabody Individual Achievement <strong>Test</strong> (PIAT) Read<strong>in</strong>g: Recognition12


and Comprehension subtests and on <strong>the</strong> Peabody Picture Vocabulary <strong>Test</strong> (PPVT). <strong>The</strong>Peabody Picture Vocabulary <strong>Test</strong> is a test <strong>of</strong> receptive vocabulary that is, accord<strong>in</strong>g to<strong>the</strong> CNLSY User’s Guide, designed to provide a quick estimate <strong>of</strong> scholastic aptitude. <strong>The</strong>User’s Guide reports that <strong>the</strong> PPVT was adm<strong>in</strong>istered when children were four or five andaga<strong>in</strong> when <strong>the</strong>y were ten or eleven. It appears to us that, <strong>in</strong> fact, <strong>the</strong> earlier adm<strong>in</strong>istrationoccurred between <strong>the</strong> ages <strong>of</strong> 36 and 60 months. In order to avoid measur<strong>in</strong>g differences <strong>in</strong>human capital that could be caused by differences <strong>in</strong> k<strong>in</strong>dergarten quality, when we exam<strong>in</strong>eyoung children we limit <strong>the</strong> analysis to children who took <strong>the</strong> exam at less than four years<strong>of</strong> age. Fur<strong>the</strong>r limit<strong>in</strong>g our sample to only black and white youths, we have a total <strong>of</strong> 1,655scores (1,072 white and 583 black) for tests taken between <strong>the</strong> ages <strong>of</strong> 3 and 4. 8<strong>The</strong>re areno repeat exam takers.<strong>The</strong> data show a racial test gap with<strong>in</strong> this sample. On average, black children perform.97 standard deviations worse on <strong>the</strong> PPVT than do white children, based on <strong>the</strong> <strong>of</strong>fi cialscale. <strong>The</strong> highest score is 77 and was atta<strong>in</strong>ed by one child, while two children scored 0,<strong>the</strong> lower bound.Although not tied to a particular curriculum, <strong>the</strong> PIAT is designed to measure <strong>the</strong> types<strong>of</strong> skills typically taught <strong>in</strong> school. It covers a suffi ciently wide range <strong>of</strong> material that <strong>the</strong>scores are not subject to boundary effects at <strong>the</strong> top although this is somewhat <strong>of</strong> a concernat <strong>the</strong> bottom. <strong>The</strong> PIAT was adm<strong>in</strong>istered at each survey to all children age 5-14. Because<strong>the</strong> survey is conducted <strong>in</strong> alternate years, we typically observe a child <strong>in</strong> k<strong>in</strong>dergarten andsecond grade or <strong>in</strong> first and third grade but not both.Table 1 shows descriptive statistics for our sample <strong>of</strong> PIAT test scores. Although childrentypically take <strong>the</strong> test <strong>in</strong> only two grades, sample size is fairly consistent across <strong>the</strong> grades13


that we analyze, both <strong>in</strong> terms <strong>of</strong> total observations and <strong>the</strong> proportion <strong>of</strong> test-takers whoare black. S<strong>in</strong>ce <strong>the</strong> test<strong>in</strong>g material rema<strong>in</strong>s <strong>the</strong> same, scores rise steadily over time, from anaverage score <strong>of</strong> 17 <strong>in</strong> k<strong>in</strong>dergarten to an average score <strong>of</strong> 39 by third grade. In each grade,<strong>the</strong> average score is higher among whites than among blacks, and this difference rises aschildren progress through school. <strong>The</strong> standard deviation <strong>of</strong> <strong>the</strong> test scores also rises. While<strong>the</strong>re is at least one child who scores a 0 <strong>in</strong> each grade, <strong>in</strong> <strong>the</strong> later grades <strong>the</strong>se childrenare severe outliers. In <strong>the</strong> third grade, for <strong>in</strong>stance, <strong>the</strong> three lowest scores are 0, 2, and 3,while <strong>the</strong> fourth lowest score is 15. <strong>The</strong> highest score rises as children reach higher grades;<strong>the</strong> highest score <strong>in</strong> <strong>the</strong> sample is 81, achieved by a white child <strong>in</strong> third grade.<strong>The</strong> gap between blacks’and whites’PIAT scores is quite modest <strong>in</strong> k<strong>in</strong>dergarten, butexpands over <strong>the</strong> first four years <strong>of</strong> school. <strong>Black</strong>s <strong>in</strong>itially perform .25 standard deviationsworse than whites do on <strong>the</strong> PIAT but by third grade have fallen .61 standard deviationsbeh<strong>in</strong>d. <strong>The</strong>se results are <strong>in</strong> l<strong>in</strong>e with those <strong>in</strong> Fryer and Levitt (2006) and our own from <strong>the</strong>ECLS-K which we describe <strong>in</strong> <strong>the</strong> next subsection. On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> pre-k<strong>in</strong>dergartengap on <strong>the</strong> PPVT is almost a full standard deviation, <strong>in</strong> l<strong>in</strong>e with <strong>the</strong> results <strong>in</strong> Jencks andPhillips (1998). <strong>The</strong>se two f<strong>in</strong>d<strong>in</strong>gs are suggestive <strong>of</strong> <strong>the</strong> result <strong>in</strong> Murnane et al (2006) that<strong>the</strong> difference between <strong>the</strong> results <strong>in</strong> Fryer and Levitt and <strong>the</strong> prior literature reflects <strong>the</strong>differences <strong>in</strong> <strong>the</strong> tests.4.2 Early Childhood Longitud<strong>in</strong>al Study<strong>The</strong> ECLS-K is a nationally representative longitud<strong>in</strong>al survey that follows children whoentered k<strong>in</strong>dergarten <strong>in</strong> <strong>the</strong> 1998-1999 school year. Information was collected <strong>in</strong> <strong>the</strong> fall and14


spr<strong>in</strong>g <strong>of</strong> k<strong>in</strong>dergarten, and <strong>the</strong> spr<strong>in</strong>gs <strong>of</strong> first, third, fifth, and eighth grades. 9<strong>The</strong> children were surveyed and assessed on a variety <strong>of</strong> different dimensions, such asschool experience, motor skill development, height, weight, and direct cognitive assessments<strong>of</strong> read<strong>in</strong>g and ma<strong>the</strong>matical skill. In each survey year, <strong>the</strong> student’s parents and teacher were<strong>in</strong>terviewed about <strong>the</strong> child’s background, home, and school environment. <strong>The</strong> tests weredesigned to measure <strong>the</strong> student’s ability <strong>in</strong> read<strong>in</strong>g, ma<strong>the</strong>matics and general knowledgeor science. 10<strong>The</strong> material covered on <strong>the</strong> test rema<strong>in</strong>ed <strong>the</strong> same through first grade, butwas modified <strong>in</strong> later years to reflect <strong>the</strong> grow<strong>in</strong>g knowledge that should be ga<strong>in</strong>ed <strong>in</strong> school.Children were first given a short “rout<strong>in</strong>g test”that directed <strong>the</strong>m to a more comprehensiveexam, <strong>the</strong> diffi culty <strong>of</strong> which depended on <strong>the</strong>ir answers to <strong>the</strong> rout<strong>in</strong>g test. Accord<strong>in</strong>g to <strong>the</strong>User’s Guide, overall scores are calculated us<strong>in</strong>g IRT and represent <strong>the</strong> estimated number <strong>of</strong>questions <strong>the</strong> test taker would have answered correctly had she taken <strong>the</strong> entire test, ra<strong>the</strong>rthan just <strong>the</strong> section to which she was routed. In pr<strong>in</strong>ciple, a 112 on <strong>the</strong> k<strong>in</strong>dergarten entrytest represents <strong>the</strong> same level <strong>of</strong> accomplishment as a 112 on <strong>the</strong> third grade test. For ouranalysis, we will focus only on <strong>the</strong> evolution <strong>of</strong> <strong>the</strong> test score gap through third grade but<strong>in</strong> some cases also draw on <strong>the</strong> fifth grade data to scale <strong>the</strong> earlier scores. <strong>The</strong>refore, we use<strong>the</strong> scores that were released with <strong>the</strong> 5th grade data file.We mimic Fryer and Levitt’s sample construction methods to make our results comparableto <strong>the</strong>irs. We focus on <strong>the</strong> read<strong>in</strong>g scores because <strong>the</strong>y show <strong>the</strong> most strik<strong>in</strong>g growth<strong>in</strong> <strong>the</strong> early years <strong>in</strong> <strong>the</strong> Fryer/Levitt study. We drop all students who are miss<strong>in</strong>g a validread<strong>in</strong>g score from k<strong>in</strong>dergarten through third grade, and drop all students who do not havea valid entry for race. We also use <strong>the</strong> sampl<strong>in</strong>g weights associated with grades k<strong>in</strong>dergartenthrough three for child assessment studies, and drop all children who do not have a valid set15


<strong>of</strong> <strong>the</strong>se weights. For much <strong>of</strong> <strong>the</strong> analysis we use only <strong>the</strong> test score and race data, but <strong>in</strong>one table we control for sociodemographic characteristics.Table 2 shows descriptive statistics for our ECLS-K sample. We have 11,414 observations<strong>of</strong> whom 62 percent are white and 17 percent are black. <strong>The</strong> basel<strong>in</strong>e scale shows a modest(.4 standard deviations) test-score gap at <strong>the</strong> beg<strong>in</strong>n<strong>in</strong>g <strong>of</strong> k<strong>in</strong>dergarten, ris<strong>in</strong>g steadily to agap <strong>of</strong> three-quarters <strong>of</strong> a standard deviation towards <strong>the</strong> end <strong>of</strong> third grade. <strong>The</strong> secondcolumn <strong>of</strong> Table 2 shows <strong>the</strong> correspond<strong>in</strong>g figures from Fryer and Levitt. Although oursample is somewhat larger with a higher proportion <strong>of</strong> whites and blacks than <strong>the</strong>irs, <strong>the</strong>test-score gap evolves <strong>in</strong> very similar ways <strong>in</strong> <strong>the</strong> two samples.It is important to recognize that <strong>the</strong>re is only a modest amount <strong>of</strong> overlap <strong>in</strong> <strong>the</strong> entry andthird grade scores <strong>of</strong> <strong>the</strong> ECLS-K. About 95 percent <strong>of</strong> students received scores on <strong>the</strong> entrytest that were below <strong>the</strong> lowest score on <strong>the</strong> third grade test. Still <strong>the</strong> rema<strong>in</strong><strong>in</strong>g 5 percentscored better than at least some third graders, and two students enter<strong>in</strong>g k<strong>in</strong>dergarten scoredabove <strong>the</strong> third grade mean us<strong>in</strong>g <strong>the</strong> orig<strong>in</strong>al test score scale.5 MethodsWe def<strong>in</strong>e <strong>the</strong> test score gap at a given grade or age as <strong>the</strong> difference between <strong>the</strong> mean testscores <strong>of</strong> whites and blacks divided by <strong>the</strong> standard deviation <strong>of</strong> test scores <strong>in</strong> that grade orat that age.We beg<strong>in</strong> by search<strong>in</strong>g for <strong>the</strong> monotonic transformations <strong>of</strong> <strong>the</strong> orig<strong>in</strong>al scale that max-16


imize and m<strong>in</strong>imize <strong>the</strong> growth <strong>of</strong> this gap. We impose <strong>the</strong> transformationT (t + 1) = T (t) + a 2 t+1 (5)where t is <strong>the</strong> orig<strong>in</strong>al scale, T is <strong>the</strong> transformed scale and a t+1 is a real number. S<strong>in</strong>ce <strong>the</strong>gap is unchanged by a l<strong>in</strong>ear transformation, we must normalize two <strong>of</strong> <strong>the</strong> parameters. Weset T (0) equal to 0 and T (t max ) equal to t max where t max is <strong>the</strong> highest score observed <strong>in</strong>that grade. 11 Def<strong>in</strong>e G g to be <strong>the</strong> test gap <strong>in</strong> grade g.G g =N −1w∑i∈whiteT (t ig ) − N −1b∑i∈blackT (t ig )√N −1∑ (T (t ig ) − N −1∑ ) (6)2T (t ig )where G g is <strong>the</strong> gap <strong>in</strong> grade g, N w , N b and N are <strong>the</strong> sizes <strong>of</strong> <strong>the</strong> white, black and total sample.We choose <strong>the</strong> rema<strong>in</strong><strong>in</strong>g values <strong>of</strong> a us<strong>in</strong>g Newton-Raphson to m<strong>in</strong>imize <strong>the</strong> objectivefunction given byD m<strong>in</strong> = m<strong>in</strong>a(G 3 − G e ) (7)where D is <strong>the</strong> difference between <strong>the</strong> test gap <strong>in</strong> grade 3 and <strong>the</strong> test gap <strong>in</strong> k<strong>in</strong>dergarten,and a refers to <strong>the</strong> vector <strong>of</strong> coeffi cients. We def<strong>in</strong>e D max similarly for <strong>the</strong> maximum. Inpractice, not all <strong>of</strong> <strong>the</strong> possible scores are observed each year <strong>in</strong> <strong>the</strong> data. We normalize a t+1to 0 if no member <strong>of</strong> <strong>the</strong> sample <strong>in</strong> that grade is observed to have an <strong>in</strong>itial test score <strong>of</strong>t + 1.This nonparametric approach is useful for f<strong>in</strong>d<strong>in</strong>g <strong>the</strong> bounds on <strong>the</strong> gap, but it producesscales that are typically step functions with one or two steps and likely implausible.17


Additionally it cannot be used when <strong>the</strong> test score is a cont<strong>in</strong>uous variable, as <strong>the</strong> ECLS-K assessmentapproximately is. <strong>The</strong>refore, we focus on transformations that are both monotonicand smooth. We look at <strong>the</strong> path <strong>of</strong> <strong>the</strong> gap that can be formed by vary<strong>in</strong>g parameters <strong>in</strong> asixth degree polynomialT (t) = β 0 + β 1 (t − c) + β 2 (t − c) 2 + β 3 (t − c) 3 + β 4 (t − c) 4 + β 5 (t − c) 5 + β 6 (t − c) 6 (8)where β 0 − β 6 and c are constants. This type <strong>of</strong> function is very flexible and can be usedto approximate a wide array <strong>of</strong> cont<strong>in</strong>uous functions. This transformation, however, is notguaranteed to be monotonic. Our algorithm checks for monotonicity and rejects attempts tochoose parameters that violate this condition. 12 Needless to say, not all monotonic functionswill be well approximated by even a monotonic six-degree polynomial. We <strong>the</strong>refore cannotrule out <strong>the</strong> possibility that some o<strong>the</strong>r transformation could generate results outside <strong>the</strong>range we present here.Aga<strong>in</strong>, G is unchanged by l<strong>in</strong>ear transformations. When show<strong>in</strong>g <strong>the</strong> density <strong>of</strong> <strong>the</strong> testscores, we normalize <strong>the</strong>ir standard deviation to equal 1 and choose β 0 so that <strong>the</strong>ir mean is0. Note that <strong>the</strong> test score distribution is not required to be symmetric so that <strong>the</strong> medianneed not be 0. However, it is easiest to show <strong>the</strong> transformations on a scale similar to <strong>the</strong>one used for <strong>the</strong> orig<strong>in</strong>al test scores. <strong>The</strong>refore when show<strong>in</strong>g <strong>the</strong> relation between <strong>the</strong> twoscales, we fix <strong>the</strong> highest and lowest scores to be equal across scales. 13If <strong>the</strong> test score distributions on entry and <strong>in</strong> third grade were disjo<strong>in</strong>t, <strong>the</strong>n (subject to am<strong>in</strong>or caveat about <strong>the</strong> ability <strong>of</strong> a six-degree polynomial to simultaneously approximate twodifferent distributions), we would f<strong>in</strong>d D max by m<strong>in</strong>imiz<strong>in</strong>g <strong>the</strong> test-score gap at entry and18


maximiz<strong>in</strong>g it <strong>in</strong> third grade. Conversely, to f<strong>in</strong>d D m<strong>in</strong> we would maximize G e and m<strong>in</strong>imizeG 3.In practice, because <strong>the</strong> two test score distributions overlap, we cannot do <strong>the</strong> maximizationsand m<strong>in</strong>imizations separately. 14Never<strong>the</strong>less, because <strong>the</strong>re is not much overlap, <strong>the</strong>process <strong>of</strong> select<strong>in</strong>g <strong>the</strong> transformations comes close to mimick<strong>in</strong>g this approach.As we will see, <strong>in</strong> both data sets, <strong>the</strong> implications <strong>of</strong> D m<strong>in</strong> and D max are very different.In <strong>the</strong> latter case, <strong>the</strong> black-white gap is trivial when children first enter school but growsto be substantial by <strong>the</strong> end <strong>of</strong> third grade. In contrast, <strong>in</strong> <strong>the</strong> former case, <strong>the</strong> black-whitegap <strong>in</strong> <strong>the</strong> ECLS-K is modest but not trivial when children enter school and changes littleover <strong>the</strong> next four years. In <strong>the</strong> CNLSY <strong>the</strong> gap under D m<strong>in</strong> actually shr<strong>in</strong>ks.<strong>The</strong>se bounds are not very helpful.<strong>The</strong>refore, to help us select among <strong>the</strong> possibletransformations, <strong>in</strong>clud<strong>in</strong>g less extreme ones, we choose <strong>the</strong> transformations that have <strong>the</strong>most predictive power for future test scores. For <strong>the</strong> CNLSY, we maximize <strong>the</strong> correlationbetween <strong>the</strong> PPVT at age 3 and <strong>the</strong> PIAT read<strong>in</strong>g test adm<strong>in</strong>istered dur<strong>in</strong>g k<strong>in</strong>dergarten.For <strong>the</strong> ECLS-K, we maximize <strong>the</strong> correlation between <strong>the</strong> entry and third grade tests.6 Results6.1 Maximiz<strong>in</strong>g and M<strong>in</strong>imiz<strong>in</strong>g <strong>the</strong> Growth <strong>of</strong> <strong>the</strong> <strong>Gap</strong><strong>The</strong> first column <strong>of</strong> Table 3 shows <strong>the</strong> evolution <strong>of</strong> <strong>the</strong> black-white test gap <strong>in</strong> <strong>the</strong> PIAT,us<strong>in</strong>g <strong>the</strong> orig<strong>in</strong>al scale provided with <strong>the</strong> exam. <strong>The</strong> gap shows a large <strong>in</strong>crease over <strong>the</strong>first four years <strong>of</strong> education, beg<strong>in</strong>n<strong>in</strong>g at a modest .25 standard deviations <strong>in</strong> k<strong>in</strong>dergarten19


and ris<strong>in</strong>g to .61 standard deviations by third grade.We can f<strong>in</strong>d <strong>the</strong> boundaries <strong>of</strong> <strong>the</strong> evolution <strong>of</strong> this gap by assign<strong>in</strong>g a new set <strong>of</strong> monotonically<strong>in</strong>creas<strong>in</strong>g test scores chosen to ei<strong>the</strong>r maximize or m<strong>in</strong>imize <strong>the</strong> difference between <strong>the</strong>third grade and k<strong>in</strong>dergarten test gap. Under <strong>the</strong> growth-m<strong>in</strong>imiz<strong>in</strong>g scale, <strong>the</strong> black-whitetest gap shr<strong>in</strong>ks by .18 standard deviations dur<strong>in</strong>g <strong>the</strong> first four years <strong>of</strong> education. Column(2) <strong>of</strong> Table 3 shows <strong>the</strong> evolution under this m<strong>in</strong>imiz<strong>in</strong>g transformation. <strong>The</strong> test gap <strong>in</strong>k<strong>in</strong>dergarten is similar to that <strong>of</strong> <strong>the</strong> basel<strong>in</strong>e at .24 standard deviations. In contrast with<strong>the</strong> basel<strong>in</strong>e, <strong>the</strong> gap immediately beg<strong>in</strong>s to decl<strong>in</strong>e to .18 standard deviations <strong>in</strong> first gradeand .08 standard deviations <strong>in</strong> second grade. <strong>The</strong> gap rema<strong>in</strong>s roughly constant <strong>in</strong> thirdgrade, end<strong>in</strong>g at .06 standard deviations.<strong>The</strong> evolution under <strong>the</strong> growth-maximiz<strong>in</strong>g scale is shown <strong>in</strong> <strong>the</strong> third column <strong>of</strong> Table3. This transformation reduces <strong>the</strong> gap at k<strong>in</strong>dergarten substantially to just .05 standarddeviations. After k<strong>in</strong>dergarten <strong>the</strong> evolution is similar to that <strong>in</strong> <strong>the</strong> basel<strong>in</strong>e model, withblacks perform<strong>in</strong>g .63 standard deviations worse than whites <strong>in</strong> third grade, only slightlyworse than <strong>the</strong>y perform us<strong>in</strong>g <strong>the</strong> basel<strong>in</strong>e scale. With this transformation, <strong>the</strong> gap growsby .58 standard deviations over <strong>the</strong> first four years <strong>of</strong> school.<strong>The</strong> two extreme transformations produce test scales that differ noticeably from <strong>the</strong>basel<strong>in</strong>e scale. <strong>The</strong> transformed scales are essentially step functions, with scores that arealmost constant with<strong>in</strong> tiers separated by large jumps. Though this may not be <strong>in</strong>tuitivelyappeal<strong>in</strong>g, it is not unlike tests which have “pr<strong>of</strong>iciency”cut<strong>of</strong>fs. Suppose for <strong>in</strong>stance thatk<strong>in</strong>dergartners only differed <strong>in</strong> <strong>the</strong>ir possession a few mean<strong>in</strong>gful skills such as <strong>the</strong> ability torecognize letters, <strong>the</strong> ability to recognize words, and <strong>the</strong> ability to read for comprehension.<strong>The</strong>n this could be an appropriate scale to use at that grade. In fact, <strong>the</strong> PIAT read<strong>in</strong>g20


test is designed somewhat like this. Students must pass a read<strong>in</strong>g recognition test <strong>in</strong> orderto advance to questions on a read<strong>in</strong>g comprehension test.<strong>The</strong> modal score <strong>in</strong> both ourk<strong>in</strong>dergarten and first grade sample is 18, which is <strong>the</strong> highest score a student could achievewithout advanc<strong>in</strong>g to <strong>the</strong> read<strong>in</strong>g comprehension section.Turn<strong>in</strong>g our attention to <strong>the</strong> ECLS-K, because <strong>the</strong> IRT scor<strong>in</strong>g method produces anessentially cont<strong>in</strong>uous variable, we use a sixth-degree monotonic polynomial transformationon <strong>the</strong> entire scale.This means that we apply <strong>the</strong> same transformation to each grade.Table 4 shows how <strong>the</strong> achievement gap on <strong>the</strong> ECLS-K read<strong>in</strong>g assessment evolves from<strong>the</strong> beg<strong>in</strong>n<strong>in</strong>g <strong>of</strong> k<strong>in</strong>dergarten through <strong>the</strong> spr<strong>in</strong>g <strong>of</strong> third grade. <strong>The</strong> first column repeats<strong>the</strong> basel<strong>in</strong>e pattern from Table 2. <strong>The</strong> second column shows <strong>the</strong> choice <strong>of</strong> transformationthat m<strong>in</strong>imizes <strong>the</strong> estimated growth <strong>in</strong> <strong>the</strong> gap. At k<strong>in</strong>dergarten entry <strong>the</strong> gap is .46, onlyslightly higher than <strong>in</strong> <strong>the</strong> basel<strong>in</strong>e. As discussed above, <strong>the</strong> scale that m<strong>in</strong>imizes <strong>the</strong> growth<strong>in</strong> <strong>the</strong> test score gap should come close to maximiz<strong>in</strong>g <strong>the</strong> entry gap. Thus it appears that<strong>the</strong> scale used <strong>in</strong> <strong>the</strong> ECLS-K comes close to maximiz<strong>in</strong>g that gap. <strong>The</strong> m<strong>in</strong>imum possiblegrowth <strong>in</strong> <strong>the</strong> gap is quite small. Us<strong>in</strong>g this scale, <strong>in</strong> third grade <strong>the</strong> gap is only .51 and thusnoticeably less than <strong>the</strong> gap <strong>in</strong> <strong>the</strong> basel<strong>in</strong>e. And <strong>the</strong> growth between entry and third gradeis only .05. Note that, <strong>in</strong> pr<strong>in</strong>ciple, m<strong>in</strong>imiz<strong>in</strong>g growth between entry and third grade couldstill generate large sw<strong>in</strong>gs <strong>in</strong> <strong>the</strong> first grade gap. However, this does not occur. <strong>The</strong>re is nonoticeable change <strong>in</strong> <strong>the</strong> gap between any pair <strong>of</strong> tests when this scale is applied.Column (3) <strong>of</strong> Table 4 shows <strong>the</strong> results <strong>of</strong> choos<strong>in</strong>g <strong>the</strong> transformation that maximizes<strong>the</strong> growth <strong>of</strong> <strong>the</strong> gap between k<strong>in</strong>dergarten and third grade. <strong>The</strong> transformed gap at <strong>the</strong>beg<strong>in</strong>n<strong>in</strong>g <strong>of</strong> k<strong>in</strong>dergarten is now only .11 standard deviations, which is .29 less than <strong>in</strong> <strong>the</strong>basel<strong>in</strong>e. <strong>The</strong> transformed gap <strong>in</strong>creases by .10 standard deviations to .21 between <strong>the</strong> fall21


and spr<strong>in</strong>g k<strong>in</strong>dergarten tests and <strong>the</strong>n rises a fur<strong>the</strong>r .22 standard deviations by <strong>the</strong> spr<strong>in</strong>g<strong>of</strong> first grade so that <strong>the</strong> estimated gaps are similar to <strong>the</strong> basel<strong>in</strong>e for <strong>the</strong> first and thirdgrades. <strong>The</strong> end result is a growth <strong>of</strong> .64 standard deviations <strong>in</strong> <strong>the</strong> racial test gap <strong>in</strong> <strong>the</strong>first four years <strong>of</strong> education, almost twice that us<strong>in</strong>g <strong>the</strong> basel<strong>in</strong>e scale. Note that <strong>the</strong> gap at<strong>the</strong> end <strong>of</strong> third grade is almost unchanged from <strong>the</strong> basel<strong>in</strong>e, suggest<strong>in</strong>g that <strong>the</strong> basel<strong>in</strong>escale comes close to maximiz<strong>in</strong>g <strong>the</strong> black-white gap at this stage.Figure 1 shows <strong>the</strong> density function <strong>of</strong> test scores associated with each choice <strong>of</strong> scale for<strong>the</strong> ECLS-K at k<strong>in</strong>dergarten entry. Note that <strong>in</strong> <strong>the</strong> basel<strong>in</strong>e case, it is skewed with a longright tail.In contrast, visually, <strong>the</strong> result<strong>in</strong>g test score distribution from <strong>the</strong> m<strong>in</strong>imiz<strong>in</strong>gtransformation more closely approximates a normal distribution. <strong>The</strong> density associated with<strong>the</strong> maximiz<strong>in</strong>g transformation is somewhat aes<strong>the</strong>tically displeas<strong>in</strong>g and possibly unattractiveon o<strong>the</strong>r grounds. Most <strong>of</strong> <strong>the</strong> weight <strong>of</strong> this distribution is <strong>in</strong> a narrow band aroundits mode, and <strong>the</strong>re are no scores substantially below this mode. Never<strong>the</strong>less, we do notf<strong>in</strong>d this representation <strong>of</strong> <strong>the</strong> scores altoge<strong>the</strong>r counter<strong>in</strong>tuitive. It is plausible that mostchildren do not have much <strong>in</strong> <strong>the</strong> way <strong>of</strong> read<strong>in</strong>g, math and general knowledge skills andthat <strong>the</strong> modest differences over much <strong>of</strong> <strong>the</strong> range are un<strong>in</strong>formative. On <strong>the</strong> o<strong>the</strong>r hand,<strong>the</strong>re are a small number, best represented by <strong>the</strong> two who are already operat<strong>in</strong>g solidly at<strong>the</strong> third grade level, who are truly dist<strong>in</strong>ct from <strong>the</strong> rest <strong>of</strong> <strong>the</strong> pack. Moreover, <strong>in</strong> somerespects <strong>the</strong> density <strong>of</strong> <strong>the</strong> growth-maximiz<strong>in</strong>g transformation is more aes<strong>the</strong>tically pleas<strong>in</strong>gthan <strong>the</strong> <strong>in</strong>come or wealth distribution <strong>in</strong> <strong>the</strong> United States. It is less skewed than ei<strong>the</strong>r.<strong>The</strong> 50-10 spread (measured <strong>in</strong> standard deviations) is plausibly larger than it is <strong>in</strong> <strong>the</strong>wealth distribution. 15How do <strong>the</strong>se transformations affect <strong>the</strong> test score distributions <strong>in</strong> third grade?As22


previously noted, <strong>the</strong> transformation that m<strong>in</strong>imizes <strong>the</strong> growth <strong>in</strong> <strong>the</strong> gap will be closeto <strong>the</strong> one that m<strong>in</strong>imizes <strong>the</strong> third grade gap while <strong>the</strong> choice <strong>of</strong> T (t) that maximizes<strong>the</strong> growth <strong>of</strong> <strong>the</strong> gap produces a third-grade gap very close to <strong>the</strong> one <strong>in</strong> <strong>the</strong> basel<strong>in</strong>e.Figure 2 shows <strong>the</strong> density <strong>of</strong> <strong>the</strong> test score distribution for <strong>the</strong> basel<strong>in</strong>e scale and <strong>the</strong> twotransformations. As <strong>in</strong> <strong>the</strong> case <strong>of</strong> <strong>the</strong> k<strong>in</strong>dergarten scores, <strong>the</strong> key to m<strong>in</strong>imiz<strong>in</strong>g <strong>the</strong> thirdgrade gap, and thus growth, is compress<strong>in</strong>g <strong>the</strong> middle <strong>of</strong> <strong>the</strong> distribution so that moststudents appear quite similar and spread<strong>in</strong>g out <strong>the</strong> differences among very high and amongvery low scores. In contrast, <strong>the</strong> growth-maximiz<strong>in</strong>g transformation leaves <strong>the</strong> distribution<strong>of</strong> test scores look<strong>in</strong>g similar to that associated with <strong>the</strong> basel<strong>in</strong>e.As already discussed, we should not necessarily dismiss distributions that primarily dist<strong>in</strong>guish<strong>the</strong> very high and very low performers from everyone else. While <strong>the</strong> large spike at<strong>the</strong> mode when us<strong>in</strong>g <strong>the</strong> growth-m<strong>in</strong>imiz<strong>in</strong>g transformation <strong>in</strong>itially appears problematic,<strong>the</strong> implied distribution is not obviously more implausible than <strong>the</strong> U.S. earn<strong>in</strong>gs, <strong>in</strong>come andwealth distributions. However, it is perhaps more problematic that <strong>the</strong> growth-m<strong>in</strong>imiz<strong>in</strong>gtransformation requires this large spike to appear between school entry and third grade.<strong>The</strong> relation between <strong>the</strong> orig<strong>in</strong>al and transformed scales is shown <strong>in</strong> figure 3. We can seethat <strong>the</strong> growth <strong>in</strong> <strong>the</strong> test score gap is m<strong>in</strong>imized if we believe that differences <strong>in</strong> very lowscores (roughly 15 to 40) and very high scores (roughly those over 140) are very <strong>in</strong>formativebut those <strong>in</strong> between are relatively un<strong>in</strong>formative. <strong>The</strong> transformation that maximizes <strong>the</strong>growth <strong>of</strong> <strong>the</strong> test score gap does <strong>the</strong> opposite, at least at <strong>the</strong> bottom <strong>of</strong> <strong>the</strong> scale. It treatsmost differences among <strong>the</strong> very low scores as un<strong>in</strong>formative. This would be appropriate if webelieved that most children arrive <strong>in</strong> k<strong>in</strong>dergarten know<strong>in</strong>g very little <strong>of</strong> <strong>the</strong> material coveredby <strong>the</strong> ECLS and that throughout most <strong>of</strong> <strong>the</strong> distribution, differences <strong>in</strong> performance should23


e viewed as relatively unimportant and that only children with very high scores should beviewed as differ<strong>in</strong>g substantially from <strong>the</strong> mass <strong>of</strong> k<strong>in</strong>dergarten entrants.<strong>The</strong> results <strong>in</strong> this subsection br<strong>in</strong>g out <strong>the</strong> fragility <strong>of</strong> any conclusion about <strong>the</strong> extentto which <strong>the</strong> test score gap <strong>in</strong>creases between school entry and <strong>the</strong> end <strong>of</strong> third grade. <strong>The</strong>bounds permit conclusions rang<strong>in</strong>g from “<strong>the</strong>re is essentially no gap when students beg<strong>in</strong>school and a very sizeable gap by <strong>the</strong> end <strong>of</strong> third grade” through “<strong>the</strong>re are modest gapsat entry and at <strong>the</strong> end <strong>of</strong> <strong>the</strong> third grade and essentially no growth <strong>in</strong> <strong>the</strong> gap over thisperiod.”<strong>The</strong>re are even scales for <strong>the</strong> PIAT from which one could conclude “black childrenmoderately lag beh<strong>in</strong>d white children <strong>in</strong> achievement when <strong>the</strong>y enter school, but match<strong>the</strong> achievement <strong>of</strong> <strong>the</strong>ir white peers by third grade.” As is <strong>of</strong>ten <strong>the</strong> case with bound<strong>in</strong>gexercises, <strong>the</strong> range <strong>of</strong> possible results is too large to be helpful.It is thus evident that determ<strong>in</strong><strong>in</strong>g <strong>the</strong> right scale is important <strong>in</strong> determ<strong>in</strong><strong>in</strong>g how <strong>the</strong>gap between blacks and whites evolves. We could attempt to choose scales that produce“aes<strong>the</strong>tically appeal<strong>in</strong>g” distributions <strong>of</strong> test scores, but this is unsatisfactory.<strong>The</strong>re isno consensus on what <strong>the</strong> distribution <strong>of</strong> childhood ability should look like. Well acceptedchildhood tests, <strong>in</strong>clud<strong>in</strong>g <strong>the</strong> PIAT and <strong>the</strong> ECLS-K assessments, produce widely vary<strong>in</strong>gdistributions <strong>of</strong> achievement. And as discussed before, <strong>the</strong>re are reasons to th<strong>in</strong>k thatun<strong>in</strong>tuitive distributions <strong>of</strong> ability could be plausible, both for young children and adults.In <strong>the</strong> next subsection we consider a more formal approach to choos<strong>in</strong>g <strong>the</strong> appropriatetransformation.24


6.2 Select<strong>in</strong>g TransformationsWe would not expect k<strong>in</strong>dergarten or first grade performance to perfectly predict third gradeperformance. <strong>The</strong>re is randomness <strong>in</strong> performance on each test. Moreover, students makevary<strong>in</strong>g academic progress. Indeed <strong>the</strong> po<strong>in</strong>t <strong>of</strong> <strong>the</strong> current exercise is to ask whe<strong>the</strong>r blacksand whites progress academically at different rates dur<strong>in</strong>g <strong>the</strong> first four years <strong>of</strong> school.Never<strong>the</strong>less, tests measure related skills. Students who perform well on one test wouldgenerally be expected to perform well on <strong>the</strong> o<strong>the</strong>r tests. A reasonable criterion for select<strong>in</strong>ga transformation is to ask which transformation allows us to best predict future performanceus<strong>in</strong>g <strong>in</strong>formation from previous tests. 16We <strong>the</strong>refore choose transformations which maximize<strong>the</strong> correlation between test scores. If <strong>the</strong> tests measure a common underly<strong>in</strong>g latentvariable, this approach maximizes reliability. If not, it merely maximizes <strong>the</strong> ability <strong>of</strong> anearlier test to predict performance on a later test.For <strong>the</strong> CNLSY we have access to scores on an early childhood cognitive achievement test,<strong>the</strong> PPVT. We construct a sample <strong>of</strong> children who took both <strong>the</strong> PPVT before age 4 and<strong>the</strong> PIAT while <strong>in</strong> k<strong>in</strong>dergarten. This sample consists <strong>of</strong> 398 white and 253 black children.<strong>The</strong> racial test gaps <strong>in</strong> this subsample are very similar to those <strong>of</strong> <strong>the</strong> full sample. <strong>Black</strong>sperform .97 standard deviations worse on average on <strong>the</strong> PPVT than whites, and .2 standarddeviations worse than whites on <strong>the</strong> PIAT. <strong>The</strong> correlation between <strong>the</strong> untransformed testscores is .32.We use monotonic sixth degree polynomial transformations to f<strong>in</strong>d <strong>the</strong> set <strong>of</strong> scales whichmaximizes <strong>the</strong> correlation between <strong>in</strong>dividuals’PPVT and PIAT test scores. <strong>The</strong> result<strong>in</strong>gscales <strong>in</strong>crease <strong>the</strong> correlation between <strong>the</strong>se two tests only moderately, to .35 and do not25


noticeably alter <strong>the</strong> racial test gaps. <strong>The</strong> test gap on <strong>the</strong> PIAT falls to .24 and that on <strong>the</strong>PPVT <strong>in</strong>creases to .98.In Figure 4, we plot <strong>the</strong> correlation-maximiz<strong>in</strong>g transformations over <strong>the</strong> range <strong>of</strong> oursample, normaliz<strong>in</strong>g each scale to have <strong>the</strong> same range as <strong>the</strong> basel<strong>in</strong>e. <strong>The</strong> transformedPPVT is similar to <strong>the</strong> orig<strong>in</strong>al except that it compresses <strong>the</strong> highest scores. Interest<strong>in</strong>gly, <strong>the</strong>PIAT transformation magnifies differences among <strong>the</strong> highest test scores, while compress<strong>in</strong>g<strong>the</strong> scores somewhat below <strong>the</strong> highest. While this suggests that a high PIAT test score maybe a more important predictor <strong>of</strong> performance than a high PPVT test score, it is importantto remember that <strong>in</strong>ferences on <strong>the</strong> highest range are based on only a few observations.Column (4) <strong>of</strong> Table 3 shows <strong>the</strong> evolution <strong>of</strong> <strong>the</strong> PIAT test gap under this scale. Surpris<strong>in</strong>glygiven <strong>the</strong> modest transformation, <strong>the</strong> pattern differs substantially from <strong>the</strong> basel<strong>in</strong>e.<strong>The</strong> k<strong>in</strong>dergarten gaps are similar, but <strong>the</strong> gap drops to .19 <strong>in</strong> third grade for a decreasefrom k<strong>in</strong>dergarten through third grade <strong>of</strong> approximately .05 standard deviations. One caveatfor this result is that roughly 18 percent <strong>of</strong> <strong>the</strong> third grade sample scored above <strong>the</strong> highestk<strong>in</strong>dergarten score. This is much less <strong>of</strong> a problem for <strong>the</strong> first and second grade tests. Yet,<strong>the</strong> gap us<strong>in</strong>g <strong>the</strong> transformed scale is essentially constant from k<strong>in</strong>dergarten through secondgrade while it grows substantially from year to year us<strong>in</strong>g <strong>the</strong> orig<strong>in</strong>al scale.In <strong>the</strong> ECLS-K we do not have data on test scores outside <strong>of</strong> <strong>the</strong> cognitive assessments.We <strong>the</strong>refore maximize <strong>the</strong> correlation across <strong>the</strong> read<strong>in</strong>g assessments. First we exam<strong>in</strong>e<strong>the</strong> transformation that maximizes <strong>the</strong> correlation between <strong>the</strong> tests taken at <strong>the</strong> beg<strong>in</strong>n<strong>in</strong>g<strong>of</strong> k<strong>in</strong>dergarten and <strong>the</strong> spr<strong>in</strong>g <strong>of</strong> third grade. This new scale substantially <strong>in</strong>creases <strong>the</strong>correlation between <strong>the</strong> two scores. <strong>The</strong> correlation when us<strong>in</strong>g <strong>the</strong> transformation is .62(R 2 = .39) compared with only .54 (R 2 = .29) us<strong>in</strong>g <strong>the</strong> basel<strong>in</strong>e scores. This approach26


produces a k<strong>in</strong>dergarten gap that is very close to <strong>the</strong> potential maximum gap at k<strong>in</strong>dergartenentry and a third grade gap that is very close to <strong>the</strong> maximum gap at this po<strong>in</strong>t.As noted earlier, <strong>the</strong> scale that maximizes <strong>the</strong> third grade gap is similar to <strong>the</strong> basel<strong>in</strong>escale and <strong>the</strong> one that maximizes <strong>the</strong> entry gap is only moderately different from <strong>the</strong> basel<strong>in</strong>e.<strong>The</strong>refore, <strong>the</strong> overall pattern <strong>of</strong> <strong>the</strong> racial test gap us<strong>in</strong>g <strong>the</strong> correlation-maximiz<strong>in</strong>gtransformation does not differ dramatically from <strong>the</strong> basel<strong>in</strong>e. As shown <strong>in</strong> column (4) <strong>of</strong>Table 4, <strong>the</strong> total growth <strong>in</strong> <strong>the</strong> gap from <strong>the</strong> beg<strong>in</strong>n<strong>in</strong>g <strong>of</strong> k<strong>in</strong>dergarten through <strong>the</strong> end <strong>of</strong>third grade is .26 standard deviations, .09 smaller than <strong>the</strong> growth seen <strong>in</strong> <strong>the</strong> basel<strong>in</strong>e. All<strong>of</strong> <strong>the</strong> growth <strong>of</strong> <strong>the</strong> gap occurs between <strong>the</strong> end <strong>of</strong> first and <strong>the</strong> end <strong>of</strong> third grade. Thisdiffers from <strong>the</strong> story told by <strong>the</strong> basel<strong>in</strong>e ECLS-K <strong>of</strong> a steady <strong>in</strong>crease <strong>in</strong> <strong>the</strong> gap throughout<strong>the</strong> first four years <strong>of</strong> school<strong>in</strong>g.We additionally choose <strong>the</strong> scale that maximizes <strong>the</strong> R 2 from a regression <strong>of</strong> <strong>the</strong> thirdgrade score on <strong>the</strong> scores <strong>the</strong> student received on <strong>the</strong> first grade and two k<strong>in</strong>dergarten tests,as well as from a regression <strong>of</strong> <strong>the</strong> fifth grade score on <strong>the</strong> third, first, and k<strong>in</strong>dergarten tests.Both <strong>the</strong>se approaches yield similar results to those <strong>of</strong> column 4.6.3 Controll<strong>in</strong>g for Socioeconomic FactorsOne <strong>of</strong> <strong>the</strong> surpris<strong>in</strong>g results <strong>in</strong> Fryer and Levitt is that when students first enter k<strong>in</strong>dergarten,<strong>the</strong> modest black-white test score gap can be accounted for fully by a small number<strong>of</strong> socioeconomic characteristics (child’s age, child’s birth weight, a socioeconomic statusmeasure, WIC participation, mo<strong>the</strong>r’s age at first birth, and number <strong>of</strong> children’s books <strong>in</strong><strong>the</strong> home). In this subsection we ask whe<strong>the</strong>r <strong>the</strong> same is true for <strong>the</strong> scales developed <strong>in</strong> <strong>the</strong>27


previous subsections. Of course, when <strong>the</strong>re is no gap at entry, <strong>the</strong>se characteristics cannotaccount for <strong>the</strong> gap, but it is possible that <strong>the</strong>y could reverse it.Table 5 shows <strong>the</strong> results <strong>of</strong> this exercise. Strik<strong>in</strong>gly <strong>the</strong> k<strong>in</strong>dergarten entry results arerobust to <strong>the</strong> choice <strong>of</strong> scale. Regardless <strong>of</strong> whe<strong>the</strong>r <strong>the</strong> scale shows an unadjusted gap <strong>of</strong>.11 or .47, after controll<strong>in</strong>g for this small number <strong>of</strong> factors, <strong>the</strong> rema<strong>in</strong><strong>in</strong>g gap is actuallyreversed and favors blacks by between .03 and .05 standard deviations.In contrast, <strong>the</strong>importance <strong>of</strong> <strong>the</strong> controls <strong>in</strong> third grade depends on <strong>the</strong> choice <strong>of</strong> scale. Three <strong>of</strong> <strong>the</strong> fourscales generate unadjusted test score gaps <strong>of</strong> approximately .75 standard deviations. Aftercontroll<strong>in</strong>g for <strong>the</strong> socioeconomic factors, <strong>the</strong> gap falls to about .3 standard deviations butstill <strong>in</strong>dicates a very substantial deterioration <strong>in</strong> <strong>the</strong> relative performance <strong>of</strong> black childrenover <strong>the</strong> first three years <strong>of</strong> school. In contrast, <strong>the</strong> transformation that m<strong>in</strong>imizes <strong>the</strong> growth<strong>of</strong> <strong>the</strong> unadjusted gap shows a noticeably more modest adjusted gap <strong>of</strong> .17. In this case twothirds<strong>of</strong> <strong>the</strong> unadjusted gap is accounted for by <strong>the</strong> measured characteristics, a somewhatlarger proportion than <strong>the</strong> little over half accounted for when <strong>the</strong> o<strong>the</strong>r scales are used.Thus <strong>the</strong> choice <strong>of</strong> scale has a significant impact on <strong>the</strong> magnitude <strong>of</strong> <strong>the</strong> <strong>in</strong>crease <strong>of</strong><strong>the</strong> adjusted gap as well as <strong>of</strong> <strong>the</strong> unadjusted gap. 17We note that we have not chosen <strong>the</strong>scales on <strong>the</strong> basis <strong>of</strong> <strong>the</strong> adjusted gaps. We have, however, done some experimentation thatsuggests that maximiz<strong>in</strong>g and m<strong>in</strong>imiz<strong>in</strong>g <strong>the</strong> adjusted gaps would not significantly alter <strong>the</strong>results.Ano<strong>the</strong>r surpris<strong>in</strong>g result <strong>in</strong> Fryer and Levitt is that <strong>the</strong> growth <strong>of</strong> <strong>the</strong> black-white testgap is virtually unaffected by whe<strong>the</strong>r or not socioeconomic controls are used. Both <strong>the</strong>controlled and uncontrolled test gaps grow by a similar magnitude between entry and <strong>the</strong>end <strong>of</strong> <strong>the</strong> third grade. We have already shown that this appears to be an artifact <strong>of</strong> <strong>the</strong>28


scale. Much <strong>of</strong> <strong>the</strong> growth <strong>in</strong> <strong>the</strong> gap under <strong>the</strong> maximiz<strong>in</strong>g transformation can be expla<strong>in</strong>edby socioeconomic controls. While <strong>the</strong> raw gap <strong>in</strong>creases by .64 standard deviations over <strong>the</strong>first four years <strong>of</strong> education under this transformation, <strong>the</strong> controlled gap <strong>in</strong>creases by only.35 standard deviations. Under <strong>the</strong> m<strong>in</strong>imiz<strong>in</strong>g transformation, <strong>the</strong> socioeconomic controlsactually have negative explanatory power.While <strong>the</strong> raw gap under this transformationgrows by only .05 standard deviations, <strong>the</strong> adjusted gap <strong>in</strong>creases by .2 standard deviations.We fur<strong>the</strong>r analyze <strong>the</strong> robustness <strong>of</strong> this result <strong>in</strong> Table 6. In <strong>the</strong> first two columns,we f<strong>in</strong>d <strong>the</strong> transformation that maximizes <strong>the</strong> percentage <strong>of</strong> <strong>the</strong> growth <strong>in</strong> <strong>the</strong> raw testgap that can be expla<strong>in</strong>ed by <strong>the</strong> socioeconomic controls. That is, we m<strong>in</strong>imize <strong>the</strong> ratio <strong>of</strong><strong>the</strong> magnitude <strong>of</strong> <strong>the</strong> growth <strong>in</strong> <strong>the</strong> controlled gap over <strong>the</strong> magnitude <strong>of</strong> <strong>the</strong> growth <strong>in</strong> <strong>the</strong>uncontrolled gap. In this transformation, <strong>the</strong> controls expla<strong>in</strong> only slightly more than <strong>in</strong> <strong>the</strong>growth-maximiz<strong>in</strong>g transformation. <strong>The</strong> raw test gap grows by .59 standard deviations fromk<strong>in</strong>dergarten through third grade, while <strong>the</strong> controlled gap grows by only .28. In columns 3and 4, we <strong>in</strong>stead maximize <strong>the</strong> difference between <strong>the</strong> growth <strong>of</strong> <strong>the</strong> raw test gap and <strong>the</strong>growth <strong>of</strong> <strong>the</strong> controlled gap. <strong>The</strong> pattern under this transformation looks similar to that <strong>of</strong><strong>the</strong> maximiz<strong>in</strong>g transformation as well.6.4 Scale SensitivityTo some extent, <strong>the</strong> choice <strong>of</strong> scale limits <strong>the</strong> potential magnitude <strong>of</strong> between-group differences.An example may clarify this po<strong>in</strong>t. Suppose a group <strong>of</strong> researchers is <strong>in</strong>terested <strong>in</strong>understand<strong>in</strong>g early racial differences <strong>in</strong> read<strong>in</strong>g. <strong>The</strong>y adm<strong>in</strong>ister a test to a group <strong>of</strong> 50black and 50 white children. <strong>The</strong> performance <strong>of</strong> <strong>the</strong> children can be strictly ranked. <strong>The</strong>y29


<strong>the</strong>n give <strong>the</strong> results to two psychometricians with <strong>in</strong>structions to scale <strong>the</strong> results. <strong>The</strong>first reports that <strong>in</strong> this group <strong>the</strong> black-white test score gap is almost exactly two standarddeviations. <strong>The</strong> second reports that it is about 1.1 standard deviations.Fur<strong>the</strong>r <strong>in</strong>vestigation reveals that both psychometricians believe that scales should reflectdevelopmental milestones and that differences <strong>in</strong> performance on a given side <strong>of</strong> <strong>the</strong>milestone are <strong>in</strong>significant. <strong>The</strong>y also agree that <strong>the</strong> milestone is passed when children shiftfrom “learn<strong>in</strong>g to read”to “read<strong>in</strong>g to learn,”but <strong>the</strong>y differ about what test performancecorresponds to this shift.<strong>The</strong> first psychometrician set <strong>the</strong> milestone at a po<strong>in</strong>t for which half <strong>of</strong> <strong>the</strong> orig<strong>in</strong>alsample was judged to be read<strong>in</strong>g to learn. In contrast, <strong>the</strong> second psychometrician believesthat only <strong>the</strong> 25 children with <strong>the</strong> highest scores merit this designation. Note that becauseeach psychometrician uses a scale with only two po<strong>in</strong>ts, <strong>the</strong>ir calculations are <strong>in</strong>variant to<strong>the</strong> issues we have addressed heret<strong>of</strong>ore.In our fictional example, we have allocated <strong>the</strong> fifty lowest scores to <strong>the</strong> “blacks” and<strong>the</strong> highest fifty scores to <strong>the</strong> “whites.”In both cases, <strong>the</strong> reported test gaps are <strong>the</strong> largestconsistent with <strong>the</strong> scales and distributions “chosen” by <strong>the</strong> psychometricians. Thus bothgaps are at <strong>the</strong>ir maxima, but when <strong>the</strong> scale sets equal numbers <strong>of</strong> 0s and 1s, <strong>the</strong> gap canbe bigger than it can be when one one-fourth <strong>of</strong> <strong>the</strong> students receive 1s.<strong>The</strong> lower half <strong>of</strong> <strong>the</strong> scores <strong>in</strong> <strong>the</strong> ECLS-K k<strong>in</strong>dergarten test are all clustered with<strong>in</strong> onestandard deviation <strong>of</strong> <strong>the</strong> median. It is possible that this characteristic <strong>of</strong> <strong>the</strong> test and itsscal<strong>in</strong>g affects <strong>the</strong> potential for a large test score gap <strong>in</strong> k<strong>in</strong>dergartenTo analyze <strong>the</strong> effect that such clump<strong>in</strong>g may have on <strong>the</strong> evolution <strong>of</strong> <strong>the</strong> racial test gap<strong>in</strong> <strong>the</strong> ECLS-K, we calculate what <strong>the</strong> test gap would be <strong>in</strong> each grade if blacks had all <strong>the</strong>30


lowest scores and whites all <strong>the</strong> highest. Denot<strong>in</strong>g w j as <strong>the</strong> weighted number <strong>of</strong> childrenwith score t j , where j is <strong>the</strong> rank (from low to high) <strong>of</strong> <strong>the</strong> score, and W B is <strong>the</strong> weightednumber <strong>of</strong> blacks <strong>in</strong> <strong>the</strong> sample, we create a set <strong>of</strong> weighted test scores B = {t j |j ∈ [1, m]}∑where m solves <strong>the</strong> problem m w i = W B . We, likewise, assign all <strong>the</strong> highest scores toi=1whites, based on <strong>the</strong>ir weighted proportion <strong>of</strong> <strong>the</strong> sample. 18<strong>The</strong> first column <strong>of</strong> Table 7 shows <strong>the</strong> weighted test gaps along with <strong>the</strong> hypo<strong>the</strong>ticalupper boundary for <strong>the</strong> gap on <strong>the</strong> ECLS-K assessments. While <strong>the</strong> observed black-whitetest gap <strong>in</strong>creases over time, so does <strong>the</strong> boundary for that gap. <strong>The</strong> <strong>the</strong>oretical maximumgap based on <strong>the</strong> distribution at <strong>the</strong> beg<strong>in</strong>n<strong>in</strong>g <strong>of</strong> k<strong>in</strong>dergarten is 1.5 standard deviations.This rises to 2.2 standard deviations by <strong>the</strong> end <strong>of</strong> third grade. <strong>The</strong> result is that <strong>the</strong> observedracial test gap, as measured as a percentage <strong>of</strong> <strong>the</strong> possible test gap, hardly changes overtime. At <strong>the</strong> beg<strong>in</strong>n<strong>in</strong>g <strong>of</strong> k<strong>in</strong>dergarten, <strong>the</strong> achievement gap is 27 percent <strong>of</strong> <strong>the</strong> maximumpossible achievement gap, given <strong>the</strong> scale, while <strong>the</strong> gap is 33 percent <strong>of</strong> <strong>the</strong> maximum gapat <strong>the</strong> end <strong>of</strong> third grade. This raises <strong>the</strong> concern that part <strong>of</strong> <strong>the</strong> large observed <strong>in</strong>crease <strong>in</strong><strong>the</strong> racial achievement gap <strong>in</strong> <strong>the</strong> ECLS-K may be attributable to changes <strong>in</strong> scale and testsensitivity, as opposed to changes <strong>in</strong> <strong>the</strong> real achievement gap.<strong>The</strong> rema<strong>in</strong><strong>in</strong>g columns <strong>of</strong> Table 7 look at <strong>the</strong> boundary <strong>of</strong> <strong>the</strong> test gap for our previouslydiscussed transformations. Columns (2) and (3) show <strong>the</strong> maximum test gap for <strong>the</strong>transformations that simply try to m<strong>in</strong>imize or maximize <strong>the</strong> growth <strong>of</strong> <strong>the</strong> gap <strong>in</strong> <strong>the</strong> firstfour years <strong>of</strong> education. Interest<strong>in</strong>gly, <strong>the</strong>se transformations have <strong>the</strong> opposite effect on <strong>the</strong>growth <strong>of</strong> <strong>the</strong> gap relative to <strong>the</strong> maximum test gap. <strong>The</strong> m<strong>in</strong>imiz<strong>in</strong>g transformation yieldsa test gap 24 percent <strong>the</strong> size <strong>of</strong> <strong>the</strong> maximum gap at k<strong>in</strong>dergarten entry, but one that is53 percent at <strong>the</strong> end <strong>of</strong> third grade despite virtually no growth <strong>in</strong> <strong>the</strong> size <strong>of</strong> <strong>the</strong> test gap31


<strong>in</strong> terms <strong>of</strong> standard deviations over this period. Likewise, <strong>in</strong> <strong>the</strong> maximiz<strong>in</strong>g transformation<strong>the</strong> gap as a percentage <strong>of</strong> <strong>the</strong> maximum gap shr<strong>in</strong>ks from 46 percent at <strong>the</strong> start <strong>of</strong>k<strong>in</strong>dergarten to 35 percent at <strong>the</strong> end <strong>of</strong> third grade despite a nearly 700 percent <strong>in</strong>crease<strong>in</strong> <strong>the</strong> size <strong>of</strong> <strong>the</strong> gap <strong>in</strong> terms <strong>of</strong> standard deviations over that same period. Our transformationsappear to act ma<strong>in</strong>ly by chang<strong>in</strong>g <strong>the</strong> potential sensitivity <strong>of</strong> <strong>the</strong> scale to <strong>the</strong> racialtest gap. <strong>The</strong> test gap at third grade can be no larger than .95 standard deviations under<strong>the</strong> m<strong>in</strong>imiz<strong>in</strong>g transformation, compared to 2.23 standard deviations <strong>in</strong> <strong>the</strong> basel<strong>in</strong>e. <strong>The</strong>maximiz<strong>in</strong>g transformations can have a gap no larger than .24 at k<strong>in</strong>dergarten, which is notonly lower than <strong>the</strong> maximum <strong>in</strong> <strong>the</strong> basel<strong>in</strong>e <strong>of</strong> 1.5, but lower also than <strong>the</strong> actual observedtest gap <strong>in</strong> <strong>the</strong> basel<strong>in</strong>e <strong>of</strong> .4 standard deviations. Column (4) looks at <strong>the</strong> boundaries for <strong>the</strong>test gap under <strong>the</strong> transformation that maximizes <strong>the</strong> correlation across tests. Strik<strong>in</strong>gly,<strong>the</strong> maximum gap is almost identical at k<strong>in</strong>dergarten entry and third grade. <strong>The</strong> <strong>in</strong>crease<strong>in</strong> <strong>the</strong> estimated gap as a proportion <strong>of</strong> <strong>the</strong> maximum gap <strong>the</strong>refore reflects changes <strong>in</strong> <strong>the</strong>former ra<strong>the</strong>r than <strong>the</strong> latter. Recall, however, that <strong>the</strong> <strong>in</strong>crease <strong>in</strong> <strong>the</strong> estimated gap withthis scale is smaller than with <strong>the</strong> base scale.Ra<strong>the</strong>r than look at <strong>the</strong> boundaries <strong>of</strong> <strong>the</strong> test score gap, an alternative approach is tolook at what <strong>the</strong> gap would look like if <strong>the</strong> test scale rema<strong>in</strong>ed constant. Denote F g (r) as<strong>the</strong> function that maps a child’s performance rank to a test score. Panel A <strong>of</strong> Table 8 shows<strong>the</strong> evolution <strong>of</strong> <strong>the</strong> test gap if F g (r) did not vary with g. That is, we choose an <strong>in</strong>itial gradeand <strong>the</strong>n take a child’s rank on each grade’s exam and reassign to him or her <strong>the</strong> score givento <strong>the</strong> child who was at that rank on <strong>the</strong> <strong>in</strong>itially chosen exam. 19 When we impose ei<strong>the</strong>r <strong>the</strong>fall k<strong>in</strong>dergarten or <strong>the</strong> spr<strong>in</strong>g third grade mapp<strong>in</strong>g, we see virtually no growth <strong>in</strong> <strong>the</strong> testscore gap until <strong>the</strong> third grade test, but substantial growth <strong>in</strong> <strong>the</strong> test gap at third grade.32


Panel B <strong>in</strong>stead supposes that we fix r while vary<strong>in</strong>g F g (i.e. chang<strong>in</strong>g <strong>the</strong> scales acrossgrades as <strong>the</strong>y do de facto <strong>in</strong> <strong>the</strong> ECLS-K.) Even if <strong>the</strong> rank order<strong>in</strong>g <strong>of</strong> students did notchange, we would still see growth <strong>in</strong> <strong>the</strong> test gap us<strong>in</strong>g <strong>the</strong> basel<strong>in</strong>e scales <strong>in</strong> <strong>the</strong> ECLS-K.If <strong>the</strong> rank order <strong>of</strong> children rema<strong>in</strong>ed what it was <strong>in</strong> k<strong>in</strong>dergarten throughout <strong>the</strong> first fouryears, we would observe a .09 standard deviation <strong>in</strong>crease <strong>in</strong> <strong>the</strong> test gap from entry to thirdgrade due simply to changes <strong>in</strong> <strong>the</strong> spac<strong>in</strong>g between ranks over time. Likewise, us<strong>in</strong>g <strong>the</strong>third grade rank order we would observe a .13 standard deviation <strong>in</strong>crease <strong>in</strong> <strong>the</strong> test gap.Most <strong>of</strong> <strong>the</strong> <strong>in</strong>crease <strong>in</strong> this gap occurs between <strong>the</strong> spr<strong>in</strong>g first grade and spr<strong>in</strong>g third gradetests. Us<strong>in</strong>g <strong>the</strong> entry rank<strong>in</strong>gs, <strong>the</strong> <strong>in</strong>crease dur<strong>in</strong>g this time span is .05 standard deviations,and us<strong>in</strong>g <strong>the</strong> third grade rank<strong>in</strong>gs it is .07.Table 8 strongly suggests that <strong>the</strong> growth <strong>in</strong> <strong>the</strong> test gap from k<strong>in</strong>dergarten through firstgrade reflects scales and not achievement. Moreover, a significant portion <strong>of</strong> <strong>the</strong> growth fromfirst to third grade also reflects scal<strong>in</strong>g decisions. Taken toge<strong>the</strong>r, tables 7 and 8 suggestthat when we use <strong>the</strong> base scale, someth<strong>in</strong>g on <strong>the</strong> order <strong>of</strong> 8 to 13 percentage po<strong>in</strong>ts <strong>of</strong> <strong>the</strong>growth <strong>in</strong> <strong>the</strong> gap between entry and third grade reflects scale sensitivity.7 Summary and ConclusionOur f<strong>in</strong>d<strong>in</strong>gs suggest that we should exercise great caution when us<strong>in</strong>g test scores to determ<strong>in</strong>ewhen a black-white test score gap first emerges and whe<strong>the</strong>r it widens <strong>in</strong> <strong>the</strong> early schoolyears. By choos<strong>in</strong>g <strong>the</strong> scale appropriately, we can make <strong>the</strong> <strong>in</strong>itial gap <strong>in</strong> <strong>the</strong> ECLS-K, atk<strong>in</strong>dergarten entry, <strong>in</strong> read<strong>in</strong>g anywhere from a trivial one-n<strong>in</strong>th <strong>of</strong> a standard deviation toalmost half a standard deviation. Similarly, <strong>the</strong> third grade gap varies between half and33


three-quarters <strong>of</strong> a standard deviation.Equally significantly, whe<strong>the</strong>r <strong>the</strong> gap widens afterschool entry depends on our choice <strong>of</strong> scale. Some scales show a decrease <strong>in</strong> <strong>the</strong> test gap <strong>in</strong><strong>the</strong> CNLSY.Our message is by no means limited to <strong>the</strong> analysis <strong>of</strong> <strong>the</strong> black-white test score gap.It must be considered whenever test scores or any ord<strong>in</strong>al measures are used as dependentvariables. <strong>The</strong>re is <strong>in</strong>creas<strong>in</strong>g pressure <strong>in</strong> <strong>the</strong> United States and elsewhere to use “valueaddedmeasures”to determ<strong>in</strong>e teacher compensation and retention. Value-added is typicallymeasured by regress<strong>in</strong>g student scores on <strong>the</strong>ir past scores and o<strong>the</strong>r characteristics. Teachervalue-added is calculated ei<strong>the</strong>r by <strong>in</strong>clud<strong>in</strong>g a teacher fixed-effect <strong>in</strong> <strong>the</strong> regression or as <strong>the</strong>mean residual for that teacher. Lang (2010) presents a simple example <strong>in</strong> which <strong>the</strong> rank<strong>in</strong>g<strong>of</strong> teachers is highly sensitive to <strong>the</strong> choice <strong>of</strong> scale. At <strong>the</strong> very least, before l<strong>in</strong>k<strong>in</strong>g importantdecisions mechanically to value-added measures, we should be sure that <strong>the</strong> measures arerobust to arbitrary scal<strong>in</strong>g decisions.Cascio and Staiger (2012) partially address this issue <strong>in</strong> <strong>the</strong> context <strong>of</strong> <strong>the</strong> common f<strong>in</strong>d<strong>in</strong>gthat <strong>in</strong>terventions fade-out relatively quickly. <strong>The</strong>ir po<strong>in</strong>t can be summarized as follows.Suppose that we had <strong>the</strong> good fortune <strong>of</strong> hav<strong>in</strong>g <strong>the</strong> “true” <strong>in</strong>terval scale <strong>of</strong> achievement(def<strong>in</strong>ed up to a l<strong>in</strong>ear transformation). It is common practice among economists to renormscales each year to have mean zero and variance 1. For any given year, this is unproblematic;it is simply a l<strong>in</strong>ear transformation. However, if <strong>the</strong> variance <strong>of</strong> <strong>the</strong> true scale growsas students progress through school, <strong>the</strong>n we are us<strong>in</strong>g different l<strong>in</strong>ear transformations eachyear, and <strong>the</strong> scales are no longer strictly comparable. If variancegrows quickly as studentsprogress through school, whatever students learned earlier will seem<strong>in</strong>gly become lessimportant because <strong>the</strong> true scale is divided by a higher number later <strong>in</strong> school. <strong>The</strong>refore,34


fade-out can be a mechanical result <strong>of</strong> <strong>the</strong> convention <strong>of</strong> normaliz<strong>in</strong>g test scores. <strong>The</strong>y f<strong>in</strong>devidence <strong>of</strong> such an effect but conclude that it is only <strong>of</strong> modest importance.However, Cascio and Staiger do not address <strong>the</strong> ord<strong>in</strong>ality issue that is <strong>the</strong> focus <strong>of</strong> thispaper. <strong>The</strong>ir approach depends on <strong>the</strong> ability to calculate correlations among test scores,which is possible only if <strong>the</strong>y are measured on an <strong>in</strong>terval scale. We <strong>in</strong>terpret <strong>the</strong>ir resultas say<strong>in</strong>g that <strong>the</strong>re is a monotonic transformation <strong>of</strong> <strong>the</strong> latent achievement scale suchthat renormaliz<strong>in</strong>g <strong>the</strong> transformed scale <strong>in</strong> each grade does not mechanically create muchfade-out. It is by no means clear to us that <strong>the</strong>ir result would hold for arbitrary monotonictransformations. Indeed, we believe <strong>the</strong> opposite to be likely.As discussed above, we are by no means <strong>the</strong> first to recognize <strong>the</strong> diffi culties <strong>of</strong> work<strong>in</strong>gwith ord<strong>in</strong>al test score scales. Cunha and Heckman (2008) and Cunha, Heckman andSchennach (2010) tie test scores to adult outcomes <strong>in</strong> order to create <strong>in</strong>terval scales. Thishas <strong>the</strong> enormous advantage <strong>of</strong> giv<strong>in</strong>g test scores a valid external reference. However, it isnot a panacea. <strong>The</strong> test score scale will differ if it is tied to wages or to log wages, and<strong>the</strong> choice between <strong>the</strong> two is essentially a welfare judgment.We have no idea whe<strong>the</strong>rplausible variation <strong>in</strong> <strong>the</strong> choice <strong>of</strong> anchor would affect <strong>the</strong> outcome <strong>of</strong> <strong>the</strong>ir research. Weexpect that scales that put much more weight on only differences among very low wages oronly differences among very high wages would produce different results but do not f<strong>in</strong>d suchscales particularly plausible.Nor is our message limited to <strong>the</strong> education and child development literature. Similarproblems arise <strong>in</strong> <strong>the</strong> happ<strong>in</strong>ess literature. A common, perhaps <strong>the</strong> central, f<strong>in</strong>d<strong>in</strong>g <strong>in</strong> thisliterature is that <strong>the</strong> relation between happ<strong>in</strong>ess and <strong>in</strong>come is clearly positive with<strong>in</strong> countriesbut, at best, weak across countries. It is way beyond <strong>the</strong> scope <strong>of</strong> this paper to provide35


a critique <strong>of</strong> this literature. We note only that happ<strong>in</strong>ess is typically measured on an ord<strong>in</strong>alscale <strong>of</strong> three to five po<strong>in</strong>ts. <strong>The</strong>re are rarely more po<strong>in</strong>ts than <strong>in</strong> <strong>the</strong> Cantril self-anchor<strong>in</strong>gscale (Cantril, 1965) which asks, “Please imag<strong>in</strong>e a ladder with steps numbered from zero at<strong>the</strong> bottom to 10 at <strong>the</strong> top. <strong>The</strong> top <strong>of</strong> <strong>the</strong> ladder represents <strong>the</strong> best possible life for youand <strong>the</strong> bottom <strong>of</strong> <strong>the</strong> ladder represents <strong>the</strong> worst possible life for you. On which step <strong>of</strong> <strong>the</strong>ladder would you say you personally feel you stand at this time?” Researchers <strong>in</strong> this fieldrout<strong>in</strong>ely average <strong>the</strong> answers with<strong>in</strong> a group to f<strong>in</strong>d <strong>the</strong> mean level <strong>of</strong> happ<strong>in</strong>ess or life satisfaction.20Without analyz<strong>in</strong>g <strong>the</strong> underly<strong>in</strong>g data, it is diffi cult to know how problematicthis is.In this paper, we have shown that scal<strong>in</strong>g matters but differently for different tests.In <strong>the</strong> ECLS-K, <strong>the</strong> gap at k<strong>in</strong>dergarten entry is somewhat but not dramatically largerthan suggested by <strong>the</strong> untransformed scale and <strong>the</strong> growth <strong>in</strong> <strong>the</strong> gap through third gradeis correspond<strong>in</strong>gly smaller.Almost all <strong>of</strong> this growth occurs between spr<strong>in</strong>g <strong>of</strong> <strong>the</strong> firstand third grades. But even this result is suspect because <strong>the</strong> third grade test is capable <strong>of</strong>generat<strong>in</strong>g a larger gap than are <strong>the</strong> earlier tests. Given this concern, we do not wish to placeexcessive emphasis on <strong>the</strong> f<strong>in</strong>d<strong>in</strong>g that <strong>the</strong> gap widens between first and third grades. In <strong>the</strong>CNLSY, <strong>the</strong> gap <strong>in</strong> k<strong>in</strong>dergarten is virtually identical <strong>in</strong> <strong>the</strong> untransformed and our preferredtransformed scales. However while <strong>the</strong> untransformed scale shows a dramatic <strong>in</strong>crease <strong>in</strong> <strong>the</strong>test gap over <strong>the</strong> first four years, our preferred scale actually implies a decl<strong>in</strong>e <strong>in</strong> magnitudeby third grade.We note that it has become someth<strong>in</strong>g <strong>of</strong> a mantra <strong>in</strong> education circles that third gradeis when students beg<strong>in</strong> <strong>the</strong> transition from “learn<strong>in</strong>g to read” to “read<strong>in</strong>g to learn.” If <strong>the</strong>tim<strong>in</strong>g implied by our ECLS-K rescal<strong>in</strong>g is correct, this suggests one avenue to pursue <strong>in</strong>36


fur<strong>the</strong>r<strong>in</strong>g our understand<strong>in</strong>g <strong>of</strong> <strong>the</strong> gap. If <strong>the</strong> pattern <strong>in</strong> our CNLSY rescal<strong>in</strong>g is correct,pre-enrollment <strong>in</strong>terventions may be more important <strong>in</strong> reduc<strong>in</strong>g <strong>the</strong> test gap than thosethat are post-enrollment.More broadly, our f<strong>in</strong>d<strong>in</strong>gs suggest that economists and o<strong>the</strong>r researchers should be muchmore circumspect <strong>in</strong> <strong>the</strong>ir use <strong>of</strong> test scores and o<strong>the</strong>r ord<strong>in</strong>al scales as dependent variables,particularly when compar<strong>in</strong>g changes across groups. While many f<strong>in</strong>d<strong>in</strong>gs will be robust toscale changes, many will not be.37


ReferencesBaker, Frank, <strong>The</strong> Basics <strong>of</strong> Item Response <strong>The</strong>ory (College Park: ERIC Clear<strong>in</strong>ghouseon Assessment and Evaluation, University <strong>of</strong> Maryland, 2001).Braun, Henry I., “A New Approach to Avoid<strong>in</strong>g Problems <strong>of</strong> Scale <strong>in</strong> Interpret<strong>in</strong>g Trends<strong>in</strong> Mental Measurement Data,”Journal <strong>of</strong> Educational Measurement 25 (1988), 171-191.Cunha, Flavio and James J. Heckman, “Formulat<strong>in</strong>g, Identify<strong>in</strong>g, and Estimat<strong>in</strong>g <strong>the</strong>Technology <strong>of</strong> Cognitive and Noncognitive Skill Formation,” Journal <strong>of</strong> Human Resources43 (2008), 738-782.Cunha, Flavio, James J. Heckman, and Susanne M. Schennach, “Estimat<strong>in</strong>g <strong>the</strong> Technology<strong>of</strong> Cognitive and Noncognitive Skill Formation,”Econometrica 78 (2010), 883-931.Cascio, Elizabeth U. and Douglas O. Staiger, “Knowledge, <strong>Test</strong>s, and Fadeout <strong>in</strong> EducationalIntervention,”NBER work<strong>in</strong>g paper no. 18038 (2012).Clotfelter, Charles T , Helen F Ladd and Jacob L Vigdor, “<strong>The</strong> Academic Achievement<strong>Gap</strong> <strong>in</strong> <strong>Grades</strong> 3 to 8,”Review <strong>of</strong> Economics and Statistics 91 (2009), 398-419.Dickens, William T. and James R. Flynn, “Heritability Estimates versus Large EnvironmentalEffects: <strong>The</strong> IQ Paradox Resolved,”Psychological Review 108 (2001), 346-369.Duncan, Greg J. and Ka<strong>the</strong>r<strong>in</strong>e A. Magnuson, “Can Family Socioeconomic ResourcesAccount for Racial and Ethnic <strong>Test</strong> <strong>Score</strong> <strong>Gap</strong>s?”<strong>The</strong> Future <strong>of</strong> Children 15 (2005), 35-54.Dunn, Elizabeth and Norton, Michael, “Don’t Indulge. Be Happy,” New York TimesSunday Review, July 8, 2012.Fryer, Roland G., Jr. “<strong>The</strong> Importance <strong>of</strong> Segregation, Discrim<strong>in</strong>ation, Peer Dynamics,and Identity <strong>in</strong> Expla<strong>in</strong><strong>in</strong>g Trends <strong>in</strong> <strong>the</strong> Racial Achievement <strong>Gap</strong>,” NBER work<strong>in</strong>g paper38


no. 16257 (2010).Fryer, Roland G., Jr. and Steven D. Levitt, “Understand<strong>in</strong>g <strong>the</strong> <strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Score</strong><strong>Gap</strong> <strong>in</strong> <strong>the</strong> First Two Years <strong>of</strong> School,”Review <strong>of</strong> Economics and Statistics 86 (2004), 447-64.Fryer, Roland G., Jr. and Steven D. Levitt, “<strong>The</strong> <strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Score</strong> <strong>Gap</strong> ThroughThird Grade,”American Law and Economics Review 8 (2006), 249-81.Cantril, Hadley, <strong>The</strong> Pattern <strong>of</strong> Human Concerns (New Brunswick: Rutgers UniversityPress, 1965).Hanushek, Eric A. and Steven G. Rivk<strong>in</strong>, S. G. “School Quality and <strong>the</strong> <strong>Black</strong>-<strong>White</strong>Achievement <strong>Gap</strong>,”NBER work<strong>in</strong>g paper no. 12651 (2006).Ho, Andrew D., “A Nonparametric Framework for Compar<strong>in</strong>g Trends and <strong>Gap</strong>s Across<strong>Test</strong>s,”Journal <strong>of</strong> Educational and Behavioral Statistics 34 (2009), 201-228.Ho, Andrew D. and Edward H. Haertel., “Metric-Free Measures <strong>of</strong> <strong>Test</strong> <strong>Score</strong> Trends and<strong>Gap</strong>s with Policy-Relevant Examples,”CSE Report no. 665 (2006).Holland, Paul W., “Two Measures <strong>of</strong> Change <strong>in</strong> <strong>the</strong> <strong>Gap</strong>s Between <strong>the</strong> CDFs <strong>of</strong> <strong>Test</strong>-<strong>Score</strong>Distributions,”Journal <strong>of</strong> Education and Behavioral Statistics 27 (2002), 3-17.Jencks, Christopher and Meredith Phillips, “<strong>The</strong> <strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Score</strong> <strong>Gap</strong>: An Introduction,”<strong>in</strong>Christopher Jencks and Meredith Phillips (Eds.), <strong>The</strong> <strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Score</strong><strong>Gap</strong> (Wash<strong>in</strong>gton, DC: Brook<strong>in</strong>gs Institution Press, 1998).Jensen, Arthur R., “How Much Can We Boost IQ and Scholastic Achievement?”HarvardEducational Review 39 (1969) 1-123.Kennickell, Arthur B., “Ponds and streams: Wealth and Income <strong>in</strong> <strong>the</strong> U.S., 1989 to2007,”Federal Reserve Board Discussion Paper 2009-13 (2009).Lang, Kev<strong>in</strong>, “Measurement Matters: Perspectives on Education Policy from an Econo-39


mist and School Board Member,”Journal <strong>of</strong> Economic Perspectives 24 (2010), 167-181.Kimeldorf, George and Allen R. Samos<strong>in</strong>, “Monotone Dependence,”<strong>The</strong> Annals <strong>of</strong> Statistics6 (1978), 895-903.Lord, Frederic M., “<strong>The</strong> ‘Ability’Scale <strong>in</strong> Item Characteristic Curve <strong>The</strong>ory,”Psychometrika40 (1975), 205-217.Murnane, Richard J., John B. Willett, Kristen L. Bub and Kathleen McCartney, “Understand<strong>in</strong>gTrends <strong>in</strong> <strong>the</strong> <strong>Black</strong>-<strong>White</strong> Achievement <strong>Gap</strong>s dur<strong>in</strong>g <strong>the</strong> First Years <strong>of</strong> School,”Brook<strong>in</strong>gs-Wharton Papers on Urban Affairs (2006), 97-135.Reise, Steven P. and Waller, Niels G., “Item Response <strong>The</strong>ory and Cl<strong>in</strong>ical Measurement,”AnnualReview <strong>of</strong> Cl<strong>in</strong>ical Psychology 5 (2009), 27-48.Rouse, Cecilia E., Jeanne Brooks-Gunn and Sara McLanahan, “Introduc<strong>in</strong>g <strong>the</strong> Issue,School Read<strong>in</strong>ess: Clos<strong>in</strong>g Racial and Ethnic <strong>Gap</strong>s,”<strong>The</strong> Future <strong>of</strong> Children 15 (2005), 5-14.Reardon, Sean F., “Thirteen Ways <strong>of</strong> Look<strong>in</strong>g at <strong>the</strong> <strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Score</strong> <strong>Gap</strong>,”Stanford University Institute for Research on Education Policy and Practice work<strong>in</strong>g paperno. 2008-08 (2008).Spencer, Bruce D. “On Interpret<strong>in</strong>g <strong>Test</strong> <strong>Score</strong>s as Social Indicators: Statistical Considerations,”Journal<strong>of</strong> Educational Measurement 20 (1983), 317-333.Stevens, S.S. “On <strong>the</strong> <strong>The</strong>ory <strong>of</strong> Scales <strong>of</strong> Measurement,”Science 10 (1946), 677-680.Wilk, M. B. and R. Gnanadesikan, “Probability Plott<strong>in</strong>g Methods for <strong>the</strong> Analysis <strong>of</strong>Data,”Biometrika 55 (1968), 1-17.40


Notes1 Fryer (2010) f<strong>in</strong>ds that, depend<strong>in</strong>g on <strong>the</strong> measure used, <strong>the</strong> racial test gap <strong>in</strong> <strong>the</strong>ECLS-K ei<strong>the</strong>r cont<strong>in</strong>ues to expand through eighth grade or rema<strong>in</strong>s fairly constant fromthird through eighth grade.Hanushek and Rivk<strong>in</strong> (2006) f<strong>in</strong>d a widen<strong>in</strong>g gap <strong>in</strong> Texaswhile Clotfelter, Ladd and Vigdor (2009) f<strong>in</strong>d <strong>in</strong> North Carol<strong>in</strong>a that gaps widen amonghigh-perform<strong>in</strong>g and narrow among low-perform<strong>in</strong>g students. Both studies, however, lookat somewhat later grades than those used here and <strong>in</strong> Fryer/Levitt.2 See <strong>the</strong> summary <strong>in</strong> Rouse, Brooks-Gunn and McLanahan (2005) and <strong>the</strong> analysis <strong>in</strong>Duncan and Magnuson (2005).3 See <strong>the</strong> discussion <strong>of</strong> <strong>the</strong> two views <strong>in</strong> Reise and Waller (2009).4 <strong>The</strong>re is a similar model based on <strong>the</strong> normal distribution which, to conserve space, wewill not discuss.5 See, for example, Reardon (2008) who concludes that <strong>the</strong> test he uses does not satisfythis condition.6 <strong>The</strong> earliest reference appears to be Wilk and Gnanadesikan (1968). For examples <strong>of</strong>test gap measures based on <strong>the</strong> PP curve and extensions, see Braun (1988), Holland (2002),Ho and Haertel (2006) and Reardon (2008).7 See, for example, Ho (2009).8 We dropped one observation that reported be<strong>in</strong>g <strong>in</strong> <strong>the</strong> third grade at age 3, who had aPPVT score well above all <strong>of</strong> <strong>the</strong> o<strong>the</strong>r 3-year-old scores.9 An additional subsample <strong>in</strong>cludes a set <strong>of</strong> children who were <strong>in</strong>itially <strong>in</strong>terviewed <strong>in</strong> <strong>the</strong>fall <strong>of</strong> <strong>the</strong>ir first grade. <strong>The</strong>se children are excluded from both our and Fryer and Levitt’s41


analysis, s<strong>in</strong>ce <strong>the</strong>y do not have k<strong>in</strong>dergarten test scores.10 Beg<strong>in</strong>n<strong>in</strong>g <strong>in</strong> <strong>the</strong> third grade, <strong>the</strong> general knowledge test was replaced by a science test.11 In practice, our program sometimes converged faster (or only) when we normalized <strong>the</strong>two lowest scores, and <strong>the</strong>n transformed <strong>the</strong> data afterwards to range from 0 to t max .12 This <strong>in</strong>troduces a discont<strong>in</strong>uity <strong>in</strong>to our objective function which creates many localm<strong>in</strong>ima (maximuma).We <strong>the</strong>refore searched from several different start<strong>in</strong>g locations foreach transformation and report that which produced <strong>the</strong> smallest (largest) gap.13 In practice, it was easier to do <strong>the</strong> estimation by sett<strong>in</strong>g <strong>the</strong> constant term to 0 andconstra<strong>in</strong><strong>in</strong>g <strong>the</strong> l<strong>in</strong>ear term and only subsequently transform<strong>in</strong>g <strong>the</strong> estimated coeffi cients.14 It is not entirely obvious that we should treat <strong>the</strong> difference between gett<strong>in</strong>g exactly <strong>the</strong>first six and <strong>the</strong> first five questions right as identical regardless <strong>of</strong> when <strong>the</strong> student took <strong>the</strong>test, but we impose this assumption.15 This is based on our imputation from Kennickell’s (2009) calculations based on <strong>the</strong>1989-2007 Survey <strong>of</strong> Consumer F<strong>in</strong>ances.16 Kimeldorf and Sampson (1978) refer to this as <strong>the</strong> monotone correlation.17 Given <strong>the</strong> arbitrar<strong>in</strong>ess <strong>of</strong> scale, <strong>the</strong>re is no reason to believe that <strong>the</strong> socioeconomicfactors should enter <strong>the</strong> equation l<strong>in</strong>early.We choose this specification follow<strong>in</strong>g that <strong>of</strong>Fryer and Levitt (2004,2006). <strong>The</strong> conclusions are robust to us<strong>in</strong>g a full set <strong>of</strong> <strong>in</strong>teractedcontrols enter<strong>in</strong>g as a cubic polynomial.18 <strong>The</strong> rema<strong>in</strong><strong>in</strong>g middle scores are implicitly assigned to Hispanics, Asians, and o<strong>the</strong>rs,though we do not look at <strong>the</strong>ir hypo<strong>the</strong>tical test gaps <strong>in</strong> this situation.19 S<strong>in</strong>ce we are us<strong>in</strong>g weights, we cannot map rank <strong>in</strong> one grade directly to rank <strong>in</strong> <strong>the</strong>o<strong>the</strong>r grade. Instead we view rank as a cont<strong>in</strong>uum and look at masses at each score. This42


esults <strong>in</strong> some children receiv<strong>in</strong>g a weighted average <strong>of</strong> two consecutive scores. <strong>The</strong> resultsare sensitive to <strong>the</strong> way <strong>in</strong> which ties at scores are broken, but s<strong>in</strong>ce <strong>the</strong>re are very few tiesgiven <strong>the</strong> quasi-cont<strong>in</strong>uous nature <strong>of</strong> <strong>the</strong> scor<strong>in</strong>g system, this sensitivity is only beyond <strong>the</strong>fourth decimal po<strong>in</strong>t.20 Two noted happ<strong>in</strong>ess researchers went fur<strong>the</strong>r and treated <strong>the</strong> scale as a ratio scale: “..our data showed that people who earned $55,000 were just 9 percent more satisfied thanthose mak<strong>in</strong>g $25,000.”(Dunn and Norton, 2012)43


Figure 1: ECLS-K K<strong>in</strong>dergarten DensitiesDensity0 .5 1 1.5 2 2.5­2 0 2 4 6 8Standard Deviations from MeanBasel<strong>in</strong>eGrowth Maximiz<strong>in</strong>gGrowth M<strong>in</strong>imiz<strong>in</strong>gOutly<strong>in</strong>g values beyond seven standard deviations above <strong>the</strong> mean are not displayed44


Figure 2: ECLS-K Third Grade DensitiesDensity0 .5 1 1.5 2­5 0 5 10Standard Deviations from MeanBasel<strong>in</strong>eGrowth Maximiz<strong>in</strong>gGrowth M<strong>in</strong>imiz<strong>in</strong>gOutly<strong>in</strong>g values beyond seven standard deviations above <strong>the</strong> mean are not displayed45


Figure 3: ECLS-K Transformation Functions46


Figure 4: Correlation Maximiz<strong>in</strong>g Transformations for CNLSY47


Table 1: CNLSY Descriptive StatisticsTotal <strong>Black</strong> <strong>White</strong>K<strong>in</strong>dergartenMean 17.33 16.50 17.81(5.18) (4.79) (5.34)M<strong>in</strong> 0 0 0Max 56 46 56N 2771 1015 1756First GradeMean 24.96 22.89 26.23(7.94) (6.41) (8.51)M<strong>in</strong> 0 0 0Max 64 50 64N 2765 1055 1710Second GradeMean 32.89 29.48 35.15(9.72) (8.57) (9.79)M<strong>in</strong> 0 8 0Max 68 59 68N 2825 1127 1698Third GradeMean 38.63 35.04 40.98(9.71) (9.33) (9.23)M<strong>in</strong> 0 0 2Max 81 68 81N 2832 1117 1715Source: Children <strong>of</strong> <strong>the</strong> National Longitud<strong>in</strong>alSurvey <strong>of</strong> Youth. Standard deviations <strong>in</strong>paren<strong>the</strong>sis.48


Table 2: ECLS-K Descriptive StatisticsBond and Lang Fryer and LevittRace<strong>White</strong> 0.62 0.55(0.49) (0.50)<strong>Black</strong> 0.17 0.15(0.37) (0.36)Hispanic 0.14 0.18(0.35) (0.38)Asian 0.02 0.07(0.15) (0.25)Female 0.49 0.49(0.50) (0.50)<strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Gap</strong>K<strong>in</strong>dergarten Fall 0.40 0.40(0.03) (0.03)K<strong>in</strong>dergarten Spr<strong>in</strong>g 0.44 0.45(0.03) (0.03)First Grade Spr<strong>in</strong>g 0.49 0.52(0.03) (0.03)Third Grade Spr<strong>in</strong>g 0.75 0.77(0.04) (0.03)Sociodemographic ControlsAge (<strong>in</strong> months) fall K<strong>in</strong>dergarten 68.5 67.0SES composite measure 0.022 0.005Number <strong>of</strong> children’s books <strong>in</strong> home 76.8 61.4Mo<strong>the</strong>r’s age at first birth 23.6 23.6Child’s birth weight (<strong>in</strong> ounces) 118.1 87.5WIC participant 0.42 0.38Observations 11414 10540Source: Early Childhood Longitud<strong>in</strong>al Study K<strong>in</strong>dergarten Class <strong>of</strong> 1998-1999. Standarddeviations are <strong>in</strong> paran<strong>the</strong>sis for variables. <strong>Test</strong> gaps measured <strong>in</strong> standard deviations andstandard errors are <strong>in</strong> paren<strong>the</strong>sis.49


Table 3: <strong>Evolution</strong> <strong>of</strong> <strong>the</strong> black-white test gap under various transformations <strong>of</strong> <strong>the</strong> PIATBasel<strong>in</strong>e M<strong>in</strong>imum Maximum Corr Max(1) (2) (3) (4)K<strong>in</strong>dergarten 0.25*** 0.24*** 0.05 0.24***(0.04) (0.04) (0.04) (0.04)First Grade 0.42*** 0.18*** 0.29*** 0.29***(0.04) (0.04) (0.04) (0.03)Second Grade 0.58*** 0.08*** 0.52*** 0.26***(0.04) (0.04) (0.04) (0.03)Third Grade 0.61*** 0.06 0.63*** 0.19***(0.04) (0.04) (0.04) (0.03)<strong>Gap</strong>s are average white score m<strong>in</strong>us average black score on <strong>the</strong> PIAT-RC.Column 4 represents <strong>the</strong> transformation that maximizes <strong>the</strong> correlationbetween <strong>the</strong> PIAT-RC at k<strong>in</strong>dergarten and <strong>the</strong> PPVT at age 3. Standarderrors are <strong>in</strong> paran<strong>the</strong>sis. *p


Table 4: <strong>Evolution</strong> <strong>of</strong> <strong>the</strong> black-white test gap under various transformations <strong>of</strong> <strong>the</strong> ECLS-KBasel<strong>in</strong>e M<strong>in</strong>imum Maximium Corr Max(1) (2) (3) (4)K<strong>in</strong>dergarten - Fall 0.40*** 0.46*** 0.11*** 0.47***(0.03) (0.03) (0.02) (0.04)K<strong>in</strong>dergarten - Spr<strong>in</strong>g 0.44*** 0.50*** 0.21*** 0.52***(0.03) (0.04) (0.02) (0.04)First Grade - Spr<strong>in</strong>g 0.49*** 0.49*** 0.43*** 0.49***(0.03) (0.04) (0.03) (0.04)Third Grade - Spr<strong>in</strong>g 0.75*** 0.51*** 0.75*** 0.73***(0.04) (0.02) (0.03) (0.03)<strong>Gap</strong>s are average white score m<strong>in</strong>us average black score on <strong>the</strong> ECLS-K read<strong>in</strong>gassessment. Standard errors are <strong>in</strong> paren<strong>the</strong>sis. *p


Table 5: <strong>Evolution</strong> <strong>of</strong> <strong>the</strong> unexpla<strong>in</strong>ed black-white test gap under various transformationsTransformation Basel<strong>in</strong>e M<strong>in</strong>imum Maximum Corr Max(1) (2) (3) (4)K<strong>in</strong>dergarten - Fall -0.05 -0.03 -0.04* -0.03(0.03) (0.04) (0.02) (0.04)K<strong>in</strong>dergarten - Spr<strong>in</strong>g 0.04 0.08** -0.01 0.10**(0.03) (0.04) (0.02) (0.04)First Grade - Spr<strong>in</strong>g 0.10*** 0.10** 0.08** 0.10**(0.04) (0.04) (0.03) (0.04)Third Grade - Spr<strong>in</strong>g 0.31*** 0.17*** 0.31*** 0.30***(0.04) (0.03) (0.04) (0.04)<strong>Gap</strong>s are <strong>the</strong> coeffi cient on a white <strong>in</strong>dicator variable with black as <strong>the</strong> excludedvariable. Each regression controls for SES, number <strong>of</strong> books <strong>in</strong> <strong>the</strong> home, gender,birth weight, <strong>in</strong>dicators for whe<strong>the</strong>r <strong>the</strong> mo<strong>the</strong>r was a teenager or over 30 at birth,and WIC recipiency. Standard errors <strong>in</strong> paren<strong>the</strong>sis. *p


Table 6: Scales which maximize <strong>the</strong> explanatary power <strong>of</strong> controlsPercent Difference Raw DifferenceNo Controls Controls No Controls Controls(1) (2) (3) (4)K<strong>in</strong>dergarten-Fall 0.07*** -0.03 0.07*** -0.04(0.02) (0.02) (0.02) (0.02)K<strong>in</strong>dergarten-Spr<strong>in</strong>g 0.14*** -0.00 0.15*** -0.00(0.02) (0.01) (0.02) (0.02)First Grade - Spr<strong>in</strong>g 0.32*** 0.05** 0.34*** 0.06**(0.02) (0.03) (0.02) (0.03)Third Grade - Spr<strong>in</strong>g 0.66*** 0.25*** 0.67*** 0.26***(0.03) (0.03) (0.03) (0.03)<strong>Gap</strong>s are <strong>the</strong> coeffi cient on a white <strong>in</strong>dicator variable with black as <strong>the</strong> excluded variable.Standard errors are <strong>in</strong> paran<strong>the</strong>sis. *p


Table 7: <strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Gap</strong> as a Percentage <strong>of</strong> Boundary Under Various TransformationsBasel<strong>in</strong>e M<strong>in</strong>imiz<strong>in</strong>g Maximiz<strong>in</strong>g Corr Max(1) (2) (3) (4)Fall-K <strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Gap</strong> 0.40 0.46 0.11 0.47Fall-K Maximum <strong>Test</strong> <strong>Gap</strong> 1.49 1.92 0.24 1.97Fall-K % <strong>of</strong> Maximum <strong>Gap</strong> 27.0% 24.1% 46.3% 23.9%Spr<strong>in</strong>g-3 <strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Gap</strong> 0.75 0.51 0.75 0.73Spr<strong>in</strong>g-3 Maximum <strong>Gap</strong> 2.23 0.95 2.16 1.96Spr<strong>in</strong>g-3 % <strong>of</strong> Maximum <strong>Gap</strong> 33.4% 53.5% 34.7% 37.4%<strong>Gap</strong>s are average white score m<strong>in</strong>us average black score on <strong>the</strong> ECLS-K read<strong>in</strong>g assessment.54


Table 8: <strong>Evolution</strong> <strong>of</strong> <strong>Black</strong>-<strong>White</strong> <strong>Test</strong> <strong>Gap</strong> Under Fixed DistributionFall-K Spr<strong>in</strong>g-K Spr<strong>in</strong>g-1 Spr<strong>in</strong>g-3(1) (2) (3) (4)Panel A: Fixed Scale, Varied RankFall-K Distribution 0.40 0.42 0.44 0.62Spr<strong>in</strong>g-3 Distribution 0.49 0.53 0.51 0.75Panel B: Varied Scale, Fixed RankFall-K Rank 0.40 0.42 0.47 0.49Spr<strong>in</strong>g-3 Rank 0.62 0.65 0.72 0.75<strong>Gap</strong>s are average white score m<strong>in</strong>us average black score on <strong>the</strong> ECLS-Kread<strong>in</strong>g assessment.55

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!