RESEARCH METHOD COHEN ok

RESEARCH METHOD COHEN ok RESEARCH METHOD COHEN ok

12.01.2015 Views

CONSTRUCTING A TEST 431 (e.g. to assign grades) is to lose the very purpose of the criterion-referencing (Gipps 1994: 85). For example, if a student is awarded a grade E for spelling in English, and a grade A for imaginative writing, this could be aggregated into a C grade as an overall grade of the student’s English language competence, but what does this C grade mean It is meaningless, it has no frame of reference or clear criteria, it loses the useful specificity of the A and Egrades,itisacompromisethatactuallytellsus nothing. Further, aggregating such grades assumes equal levels of difficulty of all items. Of course, raw scores are still open to interpretation – which is a matter of judgement rather than exactitude or precision (Wiliam 1996). For example, if a test is designed to assess ‘mastery’ of a subject, then the researcher is faced with the issue of deciding what constitutes ‘mastery’ – is it an absolute (i.e. very high score) or are there gradations, and if the latter, then where do these gradations fall For published tests the scoring is standardized and already made clear, as are the conversions of scores into, for example, percentiles and grades. Underpinning the discussion of scoring is the need to make it unequivocally clear exactly what the marking criteria are – what will and will not score points. This requires a clarification of whether there is a ‘checklist’ of features that must be present in a student’s answer. Clearly criterion-referenced tests will have to declare their lowest boundary – a cut-off point – below which the student has been deemed to fail to meet the criteria. A compromise can be seen in those criterion-referenced tests that award different grades for different levels of performance of the same task, necessitating the clarification of different cut-off points in the examination. A common example of this can be seen in the GCSE examinations for secondary school pupils in the United Kingdom, where students can achieve a grade between A and F for a criterion-related examination. The determination of cut-off points has been addressed by Nedelsky (1954), Angoff (1971), Ebel (1979) and Linn (1993). Angoff (1971) suggests amethodfordichotomouslyscoreditems.Here judges are asked to identify the proportion of minimally acceptable persons who would answer each item correctly. The sum of these proportions would then be taken to represent the minimally acceptable score. An elaborated version of this principle comes from Ebel (1979). Here a difficulty by relevance matrix is constructed for all the items. Difficulty might be assigned three levels (e.g. easy, medium and hard) and relevance might be assigned three levels (e.g. highly relevant, moderately relevant, barely relevant). When each and every test item has been assigned to the cells of the matrix, the judges estimate the proportion of items in each cell that minimally acceptable persons would answer correctly, with the standard for each judge being the weighted average of the proportions in each cell (which are determined by the number of items in each cell). In this method judges have to consider two factors – relevance and difficulty (unlike Angoff (1971), where only difficulty featured). What characterizes these approaches is the trust that they place in experts in making judgements about levels (e.g. of difficulty, or relevance, or proportions of successful achievement), that is they are based on fallible human subjectivity. Ebel (1979) argues that one principle in assignation of grades is that they should represent equal intervals on the score scales. Reference is made to median scores and standard deviations, median scores because it is meaningless to assume an absolute zero on scoring, and standard deviations as the unit of convenient size for inclusion of scores for each grade (see also Cohen and Holliday 1996). One procedure is thus: Calculate the median and standard deviation of the scores. Determine the lower score limits of the mark intervals using the median and the standard deviation as the unit of size for each grade. However, the issue of cut-off scores is complicated by the fact that they may vary according to the different purposes and uses of scores (e.g. for diagnosis, for certification, for selection, for programme evaluation, as these purposes will affect the number of cut-off points and grades, and the Chapter 19

432 TESTS precision of detail required. For a full analysis of determining cut-off grades see Linn (1993). The issue of scoring takes in a range of factors, for example: grade norms, age norms, percentile norms and standard score norms (e.g. z-scores, T-scores, stanine scores, percentiles). These are beyond the scope of this book to discuss, but readers are referred to Cronbach (1970), Gronlund and Linn (1990), Cohen and Holliday (1996), Hopkins et al.(1996). Devising a pretest and post-test The construction and administration of tests is an essential part of the experimental model of research, where a pretest and a post-test have to be devised for the control and experimental groups. The pretest and post-test must adhere to several guidelines: The pretest may have questions which differ in form or wording from the post-test, though the two tests must test the same content, i.e. they will be alternate forms of a test for the same groups. The pretest must be the same for the control and experimental groups. The post-test must be the same for both groups. Care must be taken in the construction of a post-test to avoid making the test easier to complete by one group than another. The level of difficulty must be the same in both tests. Test data feature centrally in the experimental model of research; additionally they may feature as part of a questionnaire, interview and documentary material. Reliability and validity of tests Chapter 6 covers issues of reliability and validity. Suffice it here to say that reliability concerns the degree of confidence that can be placed in the results and the data, which is often a matter of statistical calculation and subsequent test redesigning. Validity, on the other hand, concerns the extent to which the test tests what it is supposed to test. This devolves on content, construct, face, criterion-related and concurrent validity. Ethical issues in preparing for tests Amajorsourceofunreliabilityoftestdataderives from the extent and ways in which students have been prepared for the test. These can be located on acontinuumfromdirectandspecificpreparation, through indirect and general preparation, to no preparation at all. With the growing demand for test data (e.g. for selection, for certification, for grading, for employment, for tracking, for entry to higher education, for accountability, for judging schools and teachers) there is a perhaps understandable pressure to prepare students for tests. This is the ‘high-stakes’ aspect of testing (Harlen 1994), where much hinges on the test results. At one level this can be seen in the backwash effect of examinations on curricula and syllabuses; at another level it can lead to the direct preparation of students for specific examinations. Preparation can take many forms (Mehrens and Kaminski 1989; Gipps 1994): ensuring coverage, among other programme contents and objectives, of the objectives and programme that will be tested restricting the coverage of the programme content and objectives to only those that will be tested preparing students with ‘exam technique’ practising with past or similar papers directly matching the teaching to specific test items, where each piece of teaching and contents is the same as each test item practising on an exactly parallel form of the test telling students in advance what will appear on the test practising on and preparing the identical test itself (e.g. giving out test papers in advance) without teacher input practising on and preparing the identical test itself (e.g. giving out the test papers in advance), with the teacher working through the items, maybe providing sample answers.

432 TESTS<br />

precision of detail required. For a full analysis of<br />

determining cut-off grades see Linn (1993).<br />

The issue of scoring takes in a range of<br />

factors, for example: grade norms, age norms,<br />

percentile norms and standard score norms (e.g.<br />

z-scores, T-scores, stanine scores, percentiles).<br />

These are beyond the scope of this bo<strong>ok</strong> to<br />

discuss, but readers are referred to Cronbach<br />

(1970), Gronlund and Linn (1990), Cohen and<br />

Holliday (1996), Hopkins et al.(1996).<br />

Devising a pretest and post-test<br />

The construction and administration of tests is<br />

an essential part of the experimental model of<br />

research, where a pretest and a post-test have to be<br />

devised for the control and experimental groups.<br />

The pretest and post-test must adhere to several<br />

guidelines:<br />

<br />

<br />

<br />

<br />

<br />

The pretest may have questions which differ in<br />

form or wording from the post-test, though the<br />

two tests must test the same content, i.e. they<br />

will be alternate forms of a test for the same<br />

groups.<br />

The pretest must be the same for the control<br />

and experimental groups.<br />

The post-test must be the same for both groups.<br />

Care must be taken in the construction of a<br />

post-test to avoid making the test easier to<br />

complete by one group than another.<br />

The level of difficulty must be the same in both<br />

tests.<br />

Test data feature centrally in the experimental<br />

model of research; additionally they may feature as<br />

part of a questionnaire, interview and documentary<br />

material.<br />

Reliability and validity of tests<br />

Chapter 6 covers issues of reliability and validity.<br />

Suffice it here to say that reliability concerns<br />

the degree of confidence that can be placed<br />

in the results and the data, which is often a<br />

matter of statistical calculation and subsequent<br />

test redesigning. Validity, on the other hand,<br />

concerns the extent to which the test tests what<br />

it is supposed to test. This devolves on content,<br />

construct, face, criterion-related and concurrent<br />

validity.<br />

Ethical issues in preparing for tests<br />

Amajorsourceofunreliabilityoftestdataderives<br />

from the extent and ways in which students have<br />

been prepared for the test. These can be located on<br />

acontinuumfromdirectandspecificpreparation,<br />

through indirect and general preparation, to no<br />

preparation at all. With the growing demand<br />

for test data (e.g. for selection, for certification,<br />

for grading, for employment, for tracking, for<br />

entry to higher education, for accountability, for<br />

judging schools and teachers) there is a perhaps<br />

understandable pressure to prepare students for<br />

tests. This is the ‘high-stakes’ aspect of testing<br />

(Harlen 1994), where much hinges on the test<br />

results. At one level this can be seen in the<br />

backwash effect of examinations on curricula and<br />

syllabuses; at another level it can lead to the direct<br />

preparation of students for specific examinations.<br />

Preparation can take many forms (Mehrens and<br />

Kaminski 1989; Gipps 1994):<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

ensuring coverage, among other programme<br />

contents and objectives, of the objectives and<br />

programme that will be tested<br />

restricting the coverage of the programme<br />

content and objectives to only those that will<br />

be tested<br />

preparing students with ‘exam technique’<br />

practising with past or similar papers<br />

directly matching the teaching to specific<br />

test items, where each piece of teaching and<br />

contents is the same as each test item<br />

practising on an exactly parallel form of the<br />

test<br />

telling students in advance what will appear<br />

on the test<br />

practising on and preparing the identical test<br />

itself (e.g. giving out test papers in advance)<br />

without teacher input<br />

practising on and preparing the identical<br />

test itself (e.g. giving out the test papers in<br />

advance), with the teacher working through<br />

the items, maybe providing sample answers.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!