RESEARCH METHOD COHEN ok

RESEARCH METHOD COHEN ok RESEARCH METHOD COHEN ok

12.01.2015 Views

CONSTRUCTING A TEST 423 where A = the number of correct scores from the high scoring group B = the number of correct scores from the low scoring group N = the total number of students in the two groups. Suppose all 10 students from the high scoring group answered the item correctly and 2 students from the low scoring group answered the item correctly. The formula would work out thus: 1 2 8 (10 + 10) = 0.80 (index of discriminability) The maximum index of discriminability is 1.00. Any item whose index of discriminability is less than 0.67, i.e. is too undiscriminating, should be reviewed first to find out whether this is due to ambiguity in the wording or possible clues in the wording. If this is not the case, then whether the researcher uses an item with an index lower than 0.67 is a matter of judgement. It would appear, then, that the item in the example would be appropriate to use in a test. For a further discussion of item discriminability see Linn (1993) and Aiken (2003). One can use the discriminability index to examine the effectiveness of distractors. Thisis based on the premise that an effective distractor should attract more students from a low scoring group than from a high scoring group. Consider the following example, where low and high scoring groups are identified: A B C Top 10 students 10 0 2 Bottom 10 students 8 0 10 In example A, the item discriminates positively in that it attracts more correct responses (10) from the top 10 students than the bottom 10 (8) and hence is a poor distractor; here, also, the discriminability index is 0.20, hence is a poor discriminator and is also a poor distractor. Example B is an ineffective distractor because nobody was included from either group. Example C is an effective distractor because it includes far more students from the bottom 10 students (10) than the higher group (2). However, in this case any ambiguities must be ruled out before the discriminating power can be improved. Distractors are the stuff of multiple choice items, where incorrect alternatives are offered, and students have to select the correct alternatives. Here a simple frequency count of the number of times a particular alternative is selected will provide information on the effectiveness of the distractor: if it is selected many times then it is working effectively; if it is seldom or never selected then it is not working effectively and it should be replaced. If we wish to calculate the item difficulty of a test, we can use the following formula: A N × 100 where A = the number of students who answered the item correctly; N = the total number of students who attempted the item. Hence if 12 students out of a class of 20 answered the item correctly, then the formula would work out thus: 12 × 100 = 60 per cent 20 The maximum index of difficulty is 100 per cent. Items falling below 33 per cent and above 67 per cent are likely to be too difficult and too easy respectively. It would appear, then, that this item would be appropriate to use in a test. Here, again, whether the researcher uses an item with an index of difficulty below or above the cut-off points is a matter of judgement. In a normreferenced test the item difficulty should be around 50 per cent (Frisbie 1981). For further discussion of item difficulty see Linn (1993) and Hanna (1993). Given that the researcher can know the degree of item discriminability and difficulty only once the test has been undertaken, there is an unavoidable Chapter 19

424 TESTS need to pilot home-grown tests. Items with limited discriminability and limited difficulty must be weeded out and replaced, those items with the greatest discriminability and the most appropriate degrees of difficulty can be retained; this can be undertaken only once data from a pilot have been analysed. Item discriminability and item difficulty take on differential significance in norm-referenced and criterion-referenced tests. In a norm-referenced test we wish to compare students with each other, hence item discriminability is very important. In acriterion-referencedtest,ontheotherhand,it is not important per se to be able to compare or discriminate between students’ performance. For example, it may be the case that we wish to discover whether a group of students has learnt a particular body of knowledge, that is the objective, rather than, say, finding out how many have learned it better than others. Hence it may be that a criterion-referenced test has very low discriminability if all the students achieve very well or achieve very poorly, but the discriminability is less important than the fact than the students have or have not learnt the material. A norm-referenced test would regard such a poorly discriminating item as unsuitable for inclusion, whereas a criterion-referenced test would regard such an item as providing useful information (on success or failure). With regard to item difficulty, in a criterionreferenced test the level of difficulty is that which is appropriate to the task or objective. Hence if an objective is easily achieved then the test item should be easily achieved; if the objective is difficult then the test item should be correspondingly difficult. This means that, unlike anorm-referencedtestwhereanitemmightbe reworked in order to increase its discriminability index, this is less of an issue in criterionreferencing. Of course, this is not to deny the value of undertaking an item difficulty analysis, rather it is to question the centrality of such a concern. Gronlund and Linn (1990: 265) suggest that where instruction has been effective the item difficulty index of a criterion-referenced test will be high. In addressing the item discriminability, item difficulty and distractor effect of particular test items, it is advisable, of course, to pilot these tests and to be cautious about placing too great a store on indices of difficulty and discriminability that are computed from small samples. In constructing a test with item analysis, item discriminability, item difficulty and distractor effects in mind, it is important also to consider the actual requirements of the test (Nuttall 1987; Cresswell and Houston 1991): Are all the items in the test equally difficult Which items are easy, moderately hard, hard or very hard What kinds of task is each item addressing: is it a practice item (repeating known knowledge), an application item (applying known knowledge, or a synthesis item (bringing together and integrating diverse areas of knowledge) If not, what makes some items more difficult than the rest Are the items sufficiently within the experience of the students How motivated will students be by the contents of each item (i.e. how relevant will they perceive the item to be, how interesting is it) The contents of the test will also need to take account of the notion of fitness for purpose, for example in the types of test items. Here the researcher will need to consider whether the kinds of data to demonstrate ability, understanding and achievement will be best demonstrated in, for example (Lewis 1974; Cohen et al. 2004: ch. 16): an open essay afactualandheavilydirectedessay short answer questions divergent thinking items completion items multiple-choice items (with one correct answer or more than one correct answer) matching pairs of items or statements inserting missing words incomplete sentences or incomplete, unlabelled diagrams

424 TESTS<br />

need to pilot home-grown tests. Items with limited<br />

discriminability and limited difficulty must be<br />

weeded out and replaced, those items with the<br />

greatest discriminability and the most appropriate<br />

degrees of difficulty can be retained; this can be<br />

undertaken only once data from a pilot have been<br />

analysed.<br />

Item discriminability and item difficulty take on<br />

differential significance in norm-referenced and<br />

criterion-referenced tests. In a norm-referenced<br />

test we wish to compare students with each other,<br />

hence item discriminability is very important. In<br />

acriterion-referencedtest,ontheotherhand,it<br />

is not important per se to be able to compare<br />

or discriminate between students’ performance.<br />

For example, it may be the case that we wish<br />

to discover whether a group of students has<br />

learnt a particular body of knowledge, that<br />

is the objective, rather than, say, finding out<br />

how many have learned it better than others.<br />

Hence it may be that a criterion-referenced test<br />

has very low discriminability if all the students<br />

achieve very well or achieve very poorly, but the<br />

discriminability is less important than the fact<br />

than the students have or have not learnt the<br />

material. A norm-referenced test would regard<br />

such a poorly discriminating item as unsuitable<br />

for inclusion, whereas a criterion-referenced test<br />

would regard such an item as providing useful<br />

information (on success or failure).<br />

With regard to item difficulty, in a criterionreferenced<br />

test the level of difficulty is that<br />

which is appropriate to the task or objective.<br />

Hence if an objective is easily achieved then<br />

the test item should be easily achieved; if the<br />

objective is difficult then the test item should be<br />

correspondingly difficult. This means that, unlike<br />

anorm-referencedtestwhereanitemmightbe<br />

reworked in order to increase its discriminability<br />

index, this is less of an issue in criterionreferencing.<br />

Of course, this is not to deny the<br />

value of undertaking an item difficulty analysis,<br />

rather it is to question the centrality of such a<br />

concern. Gronlund and Linn (1990: 265) suggest<br />

that where instruction has been effective the item<br />

difficulty index of a criterion-referenced test will<br />

be high.<br />

In addressing the item discriminability, item<br />

difficulty and distractor effect of particular test<br />

items, it is advisable, of course, to pilot these tests<br />

and to be cautious about placing too great a store<br />

on indices of difficulty and discriminability that<br />

are computed from small samples.<br />

In constructing a test with item analysis, item<br />

discriminability, item difficulty and distractor<br />

effects in mind, it is important also to consider<br />

the actual requirements of the test (Nuttall<br />

1987; Cresswell and Houston 1991):<br />

Are all the items in the test equally difficult<br />

Which items are easy, moderately hard, hard<br />

or very hard<br />

What kinds of task is each item addressing:<br />

is it a practice item (repeating known<br />

knowledge), an application item (applying<br />

known knowledge, or a synthesis item<br />

(bringing together and integrating diverse areas<br />

of knowledge)<br />

If not, what makes some items more difficult<br />

than the rest<br />

Are the items sufficiently within the<br />

experience of the students<br />

How motivated will students be by the contents<br />

of each item (i.e. how relevant will they<br />

perceive the item to be, how interesting is it)<br />

The contents of the test will also need to take<br />

account of the notion of fitness for purpose, for<br />

example in the types of test items. Here the<br />

researcher will need to consider whether the kinds<br />

of data to demonstrate ability, understanding and<br />

achievement will be best demonstrated in, for<br />

example (Lewis 1974; Cohen et al. 2004: ch. 16):<br />

an open essay<br />

afactualandheavilydirectedessay<br />

short answer questions<br />

divergent thinking items<br />

completion items<br />

multiple-choice items (with one correct answer<br />

or more than one correct answer)<br />

matching pairs of items or statements<br />

inserting missing words<br />

incomplete sentences or incomplete, unlabelled<br />

diagrams

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!