RESEARCH METHOD COHEN ok
RESEARCH METHOD COHEN ok RESEARCH METHOD COHEN ok
NORM-REFERENCED, CRITERION-REFERENCED AND DOMAIN-REFERENCED TESTS 415 parameters of abilities are known. They assume the following (Morrison 1993): There is a normal curve of distribution of scores in the population: the bell-shaped symmetry of the Gaussian curve of distribution seen, for example, in standardized scores of IQ or the measurement of people’s height or the distribution of achievement on reading tests in the population as a whole. There are continuous and equal intervals between the test scores and, with tests that have a true zero (see Chapter 24), the opportunity for a score of, say, 80 per cent to be double that of 40 per cent; this differs from the ordinal scaling of rating scales discussed earlier in connection with questionnaire design where equal intervals between each score could not be assumed. Parametric tests will usually be published tests which are commercially available and which have been piloted and standardized on a large and representative sample of the whole population. They usually arrive complete with the backup data on sampling, reliability and validity statistics which have been computed in the devising of the tests. Working with these tests enables the researcher to use statistics applicable to interval and ratio levels of data. Non-parametric tests make few or no assumptions about the distribution of the population (the parameters of the scores) or the characteristics of that population. The tests do not assume a regular bell-shaped curve of distribution in the wider population; indeed the wider population is perhaps irrelevant as these tests are designed for a given specific population – a class in school, achemistrygroup,aprimaryschoolyeargroup. Because they make no assumptions about the wider population, the researcher must work with nonparametric statistics appropriate to nominal and ordinal levels of data. Parametric tests, with a true zero and marks awarded, are the stock-intrade of classroom teachers – the spelling test, the mathematics test, the end-of-year examination, the mock-examination. The attraction of non-parametric statistics is their utility for small samples because they do not make any assumptions about how normal, even and regular the distributions of scores will be. Furthermore, computation of statistics for non-parametric tests is less complicated than that for parametric tests. Non-parametric tests have the advantage of being tailored to particular institutional, departmental and individual circumstances. They offer teachers a valuable opportunity for quick, relevant and focused feedback on student performance. Parametric tests are more powerful than nonparametric tests because they not only derive from standardized scores but also enable the researcher to compare sub-populations with a whole population (e.g. to compare the results of one school or local education authority with the whole country, for instance in comparing students’ performance in norm-referenced or criterionreferenced tests against a national average score in that same test). They enable the researcher to use powerful statistics in data processing (see Chapters 24–26), and to make inferences about the results. Because non-parametric tests make no assumptions about the wider population a different set of statistics is available to the researcher (see Chapter 24). These can be used in very specific situations – one class of students, one year group, one style of teaching, one curriculum area – and hence are valuable to teachers. Norm-referenced, criterion-referenced and domain-referenced tests A norm-referenced test compares students’ achievements relative to other students’ achievements, for example a national test of mathematical performance or a test of intelligence which has been standardized on a large and representative sample of students between the ages of 6 and 16. A criterion-referenced test does not compare student with student but, rather, requires the student to fulfil a given set of criteria, a predefined and absolute standard or outcome (Cunningham 1998). For example, a driving test is usually criterionreferenced since to pass it requires the ability to Chapter 19
416 TESTS meet certain test items – reversing round a corner, undertaking an emergency stop, avoiding a crash, etc. – regardless of how many others have or have not passed the driving test. Similarly many tests of playing a musical instrument require specified performances, such as the ability to play a particular scale or arpeggio, the ability to play a Bach fugue without hesitation or technical error. If the student meets the criteria, then he or she passes the examination. A criterion-referenced test provides the researcher with information about exactly what astudenthaslearned,whatheorshecando, whereas a norm-referenced test can only provide the researcher with information on how well one student has achieved in comparison with another, enabling rank orderings of performance and achievement to be constructed. Hence a major feature of the norm-referenced test is its ability to discriminate between students and their achievements – a well-constructed normreferenced test enables differences in achievement to be measured acutely, i.e. to provide variability or a great range of scores. For a criterion-referenced test this is less of a problem: the intention here is to indicate whether students have achieved a set of given criteria, regardless of how many others might or might not have achieved them, hence variability or range is less important here. More recently an outgrowth of criterionreferenced testing has seen the rise of domainreferenced tests (Gipps 1994: 81). Here considerable significance is accorded to the careful and detailed specification of the content or the domain which will be assessed. The domain is the particular field or area of the subject that is being tested, for example, light in science, two-part counterpoint in music, parts of speech in English language. The domain is set out very clearly and very fully, such that the full depth and breadth of the content are established. Test items are then selected from this very full field, with careful attention to sampling procedures so that representativeness of the wider field is ensured in the test items. The student’s achievements on that test are computed to yield aproportionofthemaximumscorepossible,and this, in turn, is used as an index of the proportion of the overall domain that she has grasped. So, for example, if a domain has 1,000 items and the test has 50 items, and the student scores 30 marks from the possible 50, then it is inferred that she has grasped 60 per cent ({30 ÷ 50}×100) of the domain of 1,000 items. Here inferences are being made from a limited number of items to the student’s achievements in the whole domain; this requires careful and representative sampling procedures for test items. Commercially produced tests and researcher-produced tests There is a battery of tests in the public domain which cover a vast range of topics and that can be used for evaluative purposes (references were indicated earlier). Most schools will have used published tests at one time or another. There are several attractions to using published tests: They are objective. They have been piloted and refined. They have been standardized across a named population (e.g. a region of the country, the whole country, a particular age group or various age groups) so that they represent a wide population. They declare how reliable and valid they are (mentioned in the statistical details which are usually contained in the manual of instructions for administering the test). They tend to be parametric tests, hence enabling sophisticated statistics to be calculated. They come complete with instructions for administration. They are often straightforward and quick to administer and to mark. Guides to the interpretation of the data are usually included in the manual. Researchers are spared the task of having to devise, pilot and refine their own test. On the other hand, Howitt and Cramer (2005) suggest that commercially produced tests are expensive to purchase and to administer; they are often targeted to special, rather than to general populations (e.g. in psychological testing), and
- Page 384 and 385: PLANNING INTERVIEW-BASED RESEARCH P
- Page 386 and 387: PLANNING INTERVIEW-BASED RESEARCH P
- Page 388 and 389: PLANNING INTERVIEW-BASED RESEARCH P
- Page 390 and 391: PLANNING INTERVIEW-BASED RESEARCH P
- Page 392 and 393: GROUP INTERVIEWING 373 an intro
- Page 394 and 395: INTERVIEWING CHILDREN 375 taking pl
- Page 396 and 397: THE NON-DIRECTIVE INTERVIEW AND THE
- Page 398 and 399: TELEPHONE INTERVIEWING 379 By mean
- Page 400 and 401: TELEPHONE INTERVIEWING 381 questi
- Page 402 and 403: ETHICAL ISSUES IN INTERVIEWING 383
- Page 404 and 405: PROCEDURES IN ELICITING, ANALYSING
- Page 406 and 407: PROCEDURES IN ELICITING, ANALYSING
- Page 408 and 409: DISCOURSE ANALYSIS 389 completeness
- Page 410 and 411: ACCOUNT GATHERING IN EDUCATIONAL RE
- Page 412 and 413: STRENGTHS OF THE ETHOGENIC APPROACH
- Page 414 and 415: A NOTE ON STORIES 395 instruments t
- Page 416 and 417: INTRODUCTION 397 the physical s
- Page 418 and 419: STRUCTURED OBSERVATION 399 Box 18.1
- Page 420 and 421: STRUCTURED OBSERVATION 401 Box 18.2
- Page 422 and 423: STRUCTURED OBSERVATION 403 or event
- Page 424 and 425: NATURALISTIC AND PARTICIPANT OBSERV
- Page 426 and 427: NATURALISTIC AND PARTICIPANT OBSERV
- Page 428 and 429: ETHICAL CONSIDERATIONS 409 Box 18.3
- Page 430 and 431: SOME CAUTIONARY COMMENTS 411 ob
- Page 432 and 433: CONCLUSION 413 the data mean. This
- Page 436 and 437: COMMERCIALLY PRODUCED TESTS AND RES
- Page 438 and 439: CONSTRUCTING A TEST 419 achieveme
- Page 440 and 441: CONSTRUCTING A TEST 421 Select the
- Page 442 and 443: CONSTRUCTING A TEST 423 where A = t
- Page 444 and 445: CONSTRUCTING A TEST 425 true/fal
- Page 446 and 447: CONSTRUCTING A TEST 427 short-answe
- Page 448 and 449: CONSTRUCTING A TEST 429 demonstrate
- Page 450 and 451: CONSTRUCTING A TEST 431 (e.g. to as
- Page 452 and 453: COMPUTERIZED ADAPTIVE TESTING 433 H
- Page 454 and 455: 20 Personal constructs Introduction
- Page 456 and 457: ALLOTTING ELEMENTS TO CONSTRUCTS 43
- Page 458 and 459: PROCEDURES IN GRID ANALYSIS 439 inv
- Page 460 and 461: PROCEDURES IN GRID ANALYSIS 441 Box
- Page 462 and 463: SOME EXAMPLES OF THE USE OF REPERTO
- Page 464 and 465: GRID TECHNIQUE AND AUDIO/VIDEO LESS
- Page 466 and 467: FOCUSED GRIDS, NON-VERBAL GRIDS, EX
- Page 468 and 469: INTRODUCTION 449 Box 21.1 Dimension
- Page 470 and 471: ROLE-PLAYING VERSUS DECEPTION: THE
- Page 472 and 473: THE USES OF ROLE-PLAYING 453 of
- Page 474 and 475: ROLE-PLAYING IN AN EDUCATIONAL SETT
- Page 476: EVALUATING ROLE-PLAYING AND OTHER S
- Page 480 and 481: 22 Approaches to qualitative data a
- Page 482 and 483: TABULATING DATA 463 data set reprod
416 TESTS<br />
meet certain test items – reversing round a corner,<br />
undertaking an emergency stop, avoiding a crash,<br />
etc. – regardless of how many others have or have<br />
not passed the driving test. Similarly many tests<br />
of playing a musical instrument require specified<br />
performances, such as the ability to play a particular<br />
scale or arpeggio, the ability to play a Bach<br />
fugue without hesitation or technical error. If the<br />
student meets the criteria, then he or she passes<br />
the examination.<br />
A criterion-referenced test provides the<br />
researcher with information about exactly what<br />
astudenthaslearned,whatheorshecando,<br />
whereas a norm-referenced test can only provide<br />
the researcher with information on how well<br />
one student has achieved in comparison with<br />
another, enabling rank orderings of performance<br />
and achievement to be constructed. Hence a<br />
major feature of the norm-referenced test is<br />
its ability to discriminate between students and<br />
their achievements – a well-constructed normreferenced<br />
test enables differences in achievement<br />
to be measured acutely, i.e. to provide variability<br />
or a great range of scores. For a criterion-referenced<br />
test this is less of a problem: the intention here is<br />
to indicate whether students have achieved a set<br />
of given criteria, regardless of how many others<br />
might or might not have achieved them, hence<br />
variability or range is less important here.<br />
More recently an outgrowth of criterionreferenced<br />
testing has seen the rise of domainreferenced<br />
tests (Gipps 1994: 81). Here considerable<br />
significance is accorded to the careful and detailed<br />
specification of the content or the domain which<br />
will be assessed. The domain is the particular field<br />
or area of the subject that is being tested, for<br />
example, light in science, two-part counterpoint<br />
in music, parts of speech in English language. The<br />
domain is set out very clearly and very fully, such<br />
that the full depth and breadth of the content are<br />
established. Test items are then selected from this<br />
very full field, with careful attention to sampling<br />
procedures so that representativeness of the wider<br />
field is ensured in the test items. The student’s<br />
achievements on that test are computed to yield<br />
aproportionofthemaximumscorepossible,and<br />
this, in turn, is used as an index of the proportion<br />
of the overall domain that she has grasped. So,<br />
for example, if a domain has 1,000 items and<br />
the test has 50 items, and the student scores 30<br />
marks from the possible 50, then it is inferred that<br />
she has grasped 60 per cent ({30 ÷ 50}×100) of<br />
the domain of 1,000 items. Here inferences are<br />
being made from a limited number of items to<br />
the student’s achievements in the whole domain;<br />
this requires careful and representative sampling<br />
procedures for test items.<br />
Commercially produced tests and<br />
researcher-produced tests<br />
There is a battery of tests in the public domain<br />
which cover a vast range of topics and that can<br />
be used for evaluative purposes (references were<br />
indicated earlier). Most schools will have used<br />
published tests at one time or another. There are<br />
several attractions to using published tests:<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
They are objective.<br />
They have been piloted and refined.<br />
They have been standardized across a named<br />
population (e.g. a region of the country, the<br />
whole country, a particular age group or various<br />
age groups) so that they represent a wide<br />
population.<br />
They declare how reliable and valid they are<br />
(mentioned in the statistical details which are<br />
usually contained in the manual of instructions<br />
for administering the test).<br />
They tend to be parametric tests, hence enabling<br />
sophisticated statistics to be calculated.<br />
They come complete with instructions for<br />
administration.<br />
They are often straightforward and quick to<br />
administer and to mark.<br />
Guides to the interpretation of the data are<br />
usually included in the manual.<br />
Researchers are spared the task of having to<br />
devise, pilot and refine their own test.<br />
On the other hand, Howitt and Cramer (2005)<br />
suggest that commercially produced tests are<br />
expensive to purchase and to administer; they are<br />
often targeted to special, rather than to general<br />
populations (e.g. in psychological testing), and