RESEARCH METHOD COHEN ok

RESEARCH METHOD COHEN ok RESEARCH METHOD COHEN ok

12.01.2015 Views

NORM-REFERENCED, CRITERION-REFERENCED AND DOMAIN-REFERENCED TESTS 415 parameters of abilities are known. They assume the following (Morrison 1993): There is a normal curve of distribution of scores in the population: the bell-shaped symmetry of the Gaussian curve of distribution seen, for example, in standardized scores of IQ or the measurement of people’s height or the distribution of achievement on reading tests in the population as a whole. There are continuous and equal intervals between the test scores and, with tests that have a true zero (see Chapter 24), the opportunity for a score of, say, 80 per cent to be double that of 40 per cent; this differs from the ordinal scaling of rating scales discussed earlier in connection with questionnaire design where equal intervals between each score could not be assumed. Parametric tests will usually be published tests which are commercially available and which have been piloted and standardized on a large and representative sample of the whole population. They usually arrive complete with the backup data on sampling, reliability and validity statistics which have been computed in the devising of the tests. Working with these tests enables the researcher to use statistics applicable to interval and ratio levels of data. Non-parametric tests make few or no assumptions about the distribution of the population (the parameters of the scores) or the characteristics of that population. The tests do not assume a regular bell-shaped curve of distribution in the wider population; indeed the wider population is perhaps irrelevant as these tests are designed for a given specific population – a class in school, achemistrygroup,aprimaryschoolyeargroup. Because they make no assumptions about the wider population, the researcher must work with nonparametric statistics appropriate to nominal and ordinal levels of data. Parametric tests, with a true zero and marks awarded, are the stock-intrade of classroom teachers – the spelling test, the mathematics test, the end-of-year examination, the mock-examination. The attraction of non-parametric statistics is their utility for small samples because they do not make any assumptions about how normal, even and regular the distributions of scores will be. Furthermore, computation of statistics for non-parametric tests is less complicated than that for parametric tests. Non-parametric tests have the advantage of being tailored to particular institutional, departmental and individual circumstances. They offer teachers a valuable opportunity for quick, relevant and focused feedback on student performance. Parametric tests are more powerful than nonparametric tests because they not only derive from standardized scores but also enable the researcher to compare sub-populations with a whole population (e.g. to compare the results of one school or local education authority with the whole country, for instance in comparing students’ performance in norm-referenced or criterionreferenced tests against a national average score in that same test). They enable the researcher to use powerful statistics in data processing (see Chapters 24–26), and to make inferences about the results. Because non-parametric tests make no assumptions about the wider population a different set of statistics is available to the researcher (see Chapter 24). These can be used in very specific situations – one class of students, one year group, one style of teaching, one curriculum area – and hence are valuable to teachers. Norm-referenced, criterion-referenced and domain-referenced tests A norm-referenced test compares students’ achievements relative to other students’ achievements, for example a national test of mathematical performance or a test of intelligence which has been standardized on a large and representative sample of students between the ages of 6 and 16. A criterion-referenced test does not compare student with student but, rather, requires the student to fulfil a given set of criteria, a predefined and absolute standard or outcome (Cunningham 1998). For example, a driving test is usually criterionreferenced since to pass it requires the ability to Chapter 19

416 TESTS meet certain test items – reversing round a corner, undertaking an emergency stop, avoiding a crash, etc. – regardless of how many others have or have not passed the driving test. Similarly many tests of playing a musical instrument require specified performances, such as the ability to play a particular scale or arpeggio, the ability to play a Bach fugue without hesitation or technical error. If the student meets the criteria, then he or she passes the examination. A criterion-referenced test provides the researcher with information about exactly what astudenthaslearned,whatheorshecando, whereas a norm-referenced test can only provide the researcher with information on how well one student has achieved in comparison with another, enabling rank orderings of performance and achievement to be constructed. Hence a major feature of the norm-referenced test is its ability to discriminate between students and their achievements – a well-constructed normreferenced test enables differences in achievement to be measured acutely, i.e. to provide variability or a great range of scores. For a criterion-referenced test this is less of a problem: the intention here is to indicate whether students have achieved a set of given criteria, regardless of how many others might or might not have achieved them, hence variability or range is less important here. More recently an outgrowth of criterionreferenced testing has seen the rise of domainreferenced tests (Gipps 1994: 81). Here considerable significance is accorded to the careful and detailed specification of the content or the domain which will be assessed. The domain is the particular field or area of the subject that is being tested, for example, light in science, two-part counterpoint in music, parts of speech in English language. The domain is set out very clearly and very fully, such that the full depth and breadth of the content are established. Test items are then selected from this very full field, with careful attention to sampling procedures so that representativeness of the wider field is ensured in the test items. The student’s achievements on that test are computed to yield aproportionofthemaximumscorepossible,and this, in turn, is used as an index of the proportion of the overall domain that she has grasped. So, for example, if a domain has 1,000 items and the test has 50 items, and the student scores 30 marks from the possible 50, then it is inferred that she has grasped 60 per cent ({30 ÷ 50}×100) of the domain of 1,000 items. Here inferences are being made from a limited number of items to the student’s achievements in the whole domain; this requires careful and representative sampling procedures for test items. Commercially produced tests and researcher-produced tests There is a battery of tests in the public domain which cover a vast range of topics and that can be used for evaluative purposes (references were indicated earlier). Most schools will have used published tests at one time or another. There are several attractions to using published tests: They are objective. They have been piloted and refined. They have been standardized across a named population (e.g. a region of the country, the whole country, a particular age group or various age groups) so that they represent a wide population. They declare how reliable and valid they are (mentioned in the statistical details which are usually contained in the manual of instructions for administering the test). They tend to be parametric tests, hence enabling sophisticated statistics to be calculated. They come complete with instructions for administration. They are often straightforward and quick to administer and to mark. Guides to the interpretation of the data are usually included in the manual. Researchers are spared the task of having to devise, pilot and refine their own test. On the other hand, Howitt and Cramer (2005) suggest that commercially produced tests are expensive to purchase and to administer; they are often targeted to special, rather than to general populations (e.g. in psychological testing), and

416 TESTS<br />

meet certain test items – reversing round a corner,<br />

undertaking an emergency stop, avoiding a crash,<br />

etc. – regardless of how many others have or have<br />

not passed the driving test. Similarly many tests<br />

of playing a musical instrument require specified<br />

performances, such as the ability to play a particular<br />

scale or arpeggio, the ability to play a Bach<br />

fugue without hesitation or technical error. If the<br />

student meets the criteria, then he or she passes<br />

the examination.<br />

A criterion-referenced test provides the<br />

researcher with information about exactly what<br />

astudenthaslearned,whatheorshecando,<br />

whereas a norm-referenced test can only provide<br />

the researcher with information on how well<br />

one student has achieved in comparison with<br />

another, enabling rank orderings of performance<br />

and achievement to be constructed. Hence a<br />

major feature of the norm-referenced test is<br />

its ability to discriminate between students and<br />

their achievements – a well-constructed normreferenced<br />

test enables differences in achievement<br />

to be measured acutely, i.e. to provide variability<br />

or a great range of scores. For a criterion-referenced<br />

test this is less of a problem: the intention here is<br />

to indicate whether students have achieved a set<br />

of given criteria, regardless of how many others<br />

might or might not have achieved them, hence<br />

variability or range is less important here.<br />

More recently an outgrowth of criterionreferenced<br />

testing has seen the rise of domainreferenced<br />

tests (Gipps 1994: 81). Here considerable<br />

significance is accorded to the careful and detailed<br />

specification of the content or the domain which<br />

will be assessed. The domain is the particular field<br />

or area of the subject that is being tested, for<br />

example, light in science, two-part counterpoint<br />

in music, parts of speech in English language. The<br />

domain is set out very clearly and very fully, such<br />

that the full depth and breadth of the content are<br />

established. Test items are then selected from this<br />

very full field, with careful attention to sampling<br />

procedures so that representativeness of the wider<br />

field is ensured in the test items. The student’s<br />

achievements on that test are computed to yield<br />

aproportionofthemaximumscorepossible,and<br />

this, in turn, is used as an index of the proportion<br />

of the overall domain that she has grasped. So,<br />

for example, if a domain has 1,000 items and<br />

the test has 50 items, and the student scores 30<br />

marks from the possible 50, then it is inferred that<br />

she has grasped 60 per cent ({30 ÷ 50}×100) of<br />

the domain of 1,000 items. Here inferences are<br />

being made from a limited number of items to<br />

the student’s achievements in the whole domain;<br />

this requires careful and representative sampling<br />

procedures for test items.<br />

Commercially produced tests and<br />

researcher-produced tests<br />

There is a battery of tests in the public domain<br />

which cover a vast range of topics and that can<br />

be used for evaluative purposes (references were<br />

indicated earlier). Most schools will have used<br />

published tests at one time or another. There are<br />

several attractions to using published tests:<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

They are objective.<br />

They have been piloted and refined.<br />

They have been standardized across a named<br />

population (e.g. a region of the country, the<br />

whole country, a particular age group or various<br />

age groups) so that they represent a wide<br />

population.<br />

They declare how reliable and valid they are<br />

(mentioned in the statistical details which are<br />

usually contained in the manual of instructions<br />

for administering the test).<br />

They tend to be parametric tests, hence enabling<br />

sophisticated statistics to be calculated.<br />

They come complete with instructions for<br />

administration.<br />

They are often straightforward and quick to<br />

administer and to mark.<br />

Guides to the interpretation of the data are<br />

usually included in the manual.<br />

Researchers are spared the task of having to<br />

devise, pilot and refine their own test.<br />

On the other hand, Howitt and Cramer (2005)<br />

suggest that commercially produced tests are<br />

expensive to purchase and to administer; they are<br />

often targeted to special, rather than to general<br />

populations (e.g. in psychological testing), and

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!