BMJ Statistical Notes Series List by JM Bland: http://www-users.york ...
BMJ Statistical Notes Series List by JM Bland: http://www-users.york ...
BMJ Statistical Notes Series List by JM Bland: http://www-users.york ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>BMJ</strong> <strong>Statistical</strong> <strong>Notes</strong> <strong>Series</strong><br />
<strong>List</strong> <strong>by</strong> <strong>JM</strong> <strong>Bland</strong>: <strong>http</strong>://<strong>www</strong>-<strong>users</strong>.<strong>york</strong>.ac.uk/~mb55/pubs/pbstnote.htm<br />
1 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Correlation, regression and repeated data. <strong>BMJ</strong> 1994;308:896<br />
2 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Regression towards the mean. <strong>BMJ</strong> 1994;308:1499<br />
3 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Diagnostic tests 1: sensitivity and specificity <strong>BMJ</strong> 1994;308:1552<br />
4 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Diagnostic tests 2: predictive values. <strong>BMJ</strong> 1994;309:102<br />
5 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Diagnostic tests 3: receiver operating characteristic plots. <strong>BMJ</strong> 1994;309:188<br />
6 <strong>Bland</strong> <strong>JM</strong>, Altman DG. One- and two-sided tests of significance. <strong>BMJ</strong> 1994;309:248<br />
7 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Some examples of regression towards the mean. <strong>BMJ</strong> 1994;309:780<br />
8 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Quartiles, quintliles, centiles, and other quantiles. <strong>BMJ</strong> 1994;309:996<br />
9 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Matching. <strong>BMJ</strong> 1994;309:1128<br />
10 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Multiple significance tests: the Bonferroni method. <strong>BMJ</strong> 1995;310:170<br />
11 Altman DG, <strong>Bland</strong> <strong>JM</strong>. The normal distribution. <strong>BMJ</strong> 1995;310:298<br />
12 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Calculating correlation coefficients with repeated observations: Part 1, correlation<br />
within subjects. <strong>BMJ</strong> 1995;310:446<br />
13 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Calculating correlation coefficients with repeated observations: Part 2, correlation<br />
between subjects. <strong>BMJ</strong> 1995;310:633<br />
14 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Absence of evidence is not evidence of absence. <strong>BMJ</strong> 1995;311:485<br />
15 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Presentation of numerical data. <strong>BMJ</strong> 1996;312:572<br />
16 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Logarithms. <strong>BMJ</strong> 1996;312:700<br />
17 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Transforming data. <strong>BMJ</strong> 1996;312:770<br />
18 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Transformations, means and confidence intervals. <strong>BMJ</strong> 1996;312:1079<br />
19 <strong>Bland</strong> <strong>JM</strong>, Altman DG. The use of transformations when comparing two means. <strong>BMJ</strong> 1996;312:1153<br />
[Dataset bpl 04591]<br />
20 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Comparing several groups using analysis of variance. <strong>BMJ</strong> 1996;312:1472-1473<br />
21 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Measurement error. <strong>BMJ</strong> 1996;312:1654<br />
22 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Measurement error and correlation coefficients. <strong>BMJ</strong> 1996;313:41-42 (includes<br />
corrected table)<br />
23 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Measurement error proportional to the mean. <strong>BMJ</strong> 1996;313:106
<strong>BMJ</strong> <strong>Statistical</strong> <strong>Notes</strong> <strong>Series</strong><br />
<strong>List</strong> <strong>by</strong> <strong>JM</strong> <strong>Bland</strong>: <strong>http</strong>://<strong>www</strong>-<strong>users</strong>.<strong>york</strong>.ac.uk/~mb55/pubs/pbstnote.htm<br />
24 Altman DG, Matthews JNS. Interaction 1: Heterogeneity of effects. <strong>BMJ</strong> 1996;313:486<br />
25 Matthews JNS, Altman DG. Interaction 2: compare effect sizes not P values. <strong>BMJ</strong> 1996;313:808<br />
26 Matthews JNS, Altman DG. Interaction 3: How to examine heterogeneity. <strong>BMJ</strong> 1996;313:862<br />
27 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Detecting skewness from summary information. <strong>BMJ</strong> 1996;313:1200<br />
28 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Cronbach's alpha. <strong>BMJ</strong> 1997;314:572<br />
29 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Units of analysis. <strong>BMJ</strong> 1997;314:1874<br />
30 <strong>Bland</strong> <strong>JM</strong>, Kerry SM. Trials randomised in clusters. <strong>BMJ</strong> 1997;315:600<br />
31 Kerry SM, <strong>Bland</strong> <strong>JM</strong>. Analysis of a trial randomised in clusters. <strong>BMJ</strong> 1998;316:54<br />
32 <strong>Bland</strong> <strong>JM</strong>, Kerry SM. Weighted comparison of means. <strong>BMJ</strong> 1998;316:129<br />
33 Kerry SM, <strong>Bland</strong> <strong>JM</strong>. Sample size in cluster randomisation. <strong>BMJ</strong> 1998;316:549<br />
34 Kerry SM, <strong>Bland</strong> <strong>JM</strong>. The intra-cluster correlation coefficient in cluster randomisation. <strong>BMJ</strong> 1998;316:1455<br />
35 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Generalisation and extrapolation. <strong>BMJ</strong> 1998;317:409-410<br />
36 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Time to event (survival) data. <strong>BMJ</strong> 1998;317:468-469<br />
37 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Bayesians and frequentists <strong>BMJ</strong> 1998;317:1151<br />
38 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Survival probabilities (the Kaplan-Meier method). <strong>BMJ</strong> 1998;317:1572<br />
39 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Treatment allocation in controlled trials: why randomise? <strong>BMJ</strong> 1999;318:1209<br />
40 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Variables and parameters. <strong>BMJ</strong> 1999;318:1667<br />
41 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to randomise. <strong>BMJ</strong> 1999;319:703-704<br />
42 <strong>Bland</strong> <strong>JM</strong>, Altman DG. The odds ratio. <strong>BMJ</strong> 2000;320:1468<br />
43 Day SJ, Altman DG. Blinding in clinical trials and other studies. <strong>BMJ</strong> 2000;321:504<br />
44 Altman DG, Schulz KF. Concealing treatment allocation in randomised trials. <strong>BMJ</strong> 2001;323:446-447<br />
45 Vickers AJ, Altman DG. Analysing controlled trials with baseline and follow up measurements. <strong>BMJ</strong><br />
2001;323:1123-1124<br />
46 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Validating scales and indexes. <strong>BMJ</strong> 2002;324:606-607<br />
47 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Interaction revisited: the difference between two estimates. <strong>BMJ</strong> 2003;326:219<br />
48 <strong>Bland</strong> <strong>JM</strong>, Altman DG. The logrank test. <strong>BMJ</strong> 2004;328:1073
<strong>BMJ</strong> <strong>Statistical</strong> <strong>Notes</strong> <strong>Series</strong><br />
<strong>List</strong> <strong>by</strong> <strong>JM</strong> <strong>Bland</strong>: <strong>http</strong>://<strong>www</strong>-<strong>users</strong>.<strong>york</strong>.ac.uk/~mb55/pubs/pbstnote.htm<br />
49 Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. <strong>BMJ</strong> 2004;329:168-169<br />
50 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Treatment allocation <strong>by</strong> minimisation. <strong>BMJ</strong> 2005;330:843<br />
51 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Standard deviations and standard errors. <strong>BMJ</strong> 2005;331:903<br />
52 Altman DG, Royston P. The cost of dichotomising continuous variables. <strong>BMJ</strong> 2006;332:1080<br />
53 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Missing data. <strong>BMJ</strong> 2007;334:424<br />
54 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Parametric v non-parametric methods for data analysis. <strong>BMJ</strong> 2009;339:170<br />
55 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Analysis of continuous data from small samples. <strong>BMJ</strong> 2009;339:961-962<br />
56 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Comparisons within randomised groups can be very misleading. <strong>BMJ</strong> 2011; 342:<br />
d561<br />
57 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Correlation in restricted ranges of data. <strong>BMJ</strong> 2011; 343: 577<br />
58 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to obtain the confidence interval from a P value. <strong>BMJ</strong> 2011; 343: 681<br />
59 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to obtain the P value from a confidence interval. <strong>BMJ</strong> 2011; 343: 682.<br />
60 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Brackets (parentheses) in formulas. <strong>BMJ</strong> 2011; 343: 578
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
i . ~~ " .<br />
It<br />
~,<br />
I<br />
; Statistics <strong>Notes</strong> ' ..,'<br />
,I .,<br />
;; , f<br />
,<br />
, , ;~ Measurement error<br />
,.,; r,i<br />
1; '}<br />
1 J Martin <strong>Bland</strong>, Douglas G Altman I<br />
, ,\<br />
j<br />
;\ This is the 21st in a series of Several measurements of the same quantity on the same .'<br />
ij occasional notes on medical subject will not in general be the same. This may be Table 1-Repeated peak ex~/ratory flow rate (PEFR) I<br />
f measurements for 20 schoolchildren ..<br />
" stansncs. because of natural vananon m the subject, vananon m '<br />
;) ~ shows the measurement four measurements process, of or lung both. function For example, in each table of 20 1 Child PEFR (V mln .)<br />
1: schoolchildren (taken from a larger study'). The first No 1st 2nd 3rd 4th Mean SO<br />
I child shows typical variation, having peak expiratory<br />
! flow rates of 190, 220, 200, and 200 l/min. 1 190 220 200 200 202.50 12.58 )<br />
;I' Let us suppose that the child has a "true" average 2 220 200 240 230 222.50 17.08<br />
! ...3 260 260 240 280 260.00 16.33<br />
,Ii value over all possIble measurements, which IS what we 4 210 300 280 265 263.75 38.60<br />
\' really want to know when we make a measurement. 5 270 265 280 270 271.25 6.29<br />
'" Repeated measurements on the same subject will vary 6 280 280 270 275 276.25 4.79<br />
j: around the true value because of measurement error. 7 260 280 280 300 280.00 16.33<br />
i Th ..8 275 275 275 305 282.50 15 00<br />
, e standard. deVlanon of repeated measure~ents on 9 280 290 300 290 290.00 8:16<br />
the same subject enables us to measure the sIZe of the 10 320 290 300 290 300.00 14.14<br />
measurement error. We shall assume that this standard 11 300 300 310 300 302.50 5.00<br />
deviation is the same for all subjects, as otherwise there 12 270 250 330 370 305.00 55.08 ,<br />
would be no point in estimating it. The main exception 13 320 330 330 330 327.50 5.00<br />
. h .14 335 320 335 375 341.25 2358<br />
IS W en the me~surement en:or depends on the sIZe of 15 350 320 340 355 343.75 18:87<br />
the measurement, usually WIth measurements be com- 16 360 320 350 345 343.75 17.02<br />
ing more variable as the magnitude of the measurement 17 330 340 380 390 360.00 29.44<br />
f 60 increases. We deal with this case in a subsequent statis- 18 335 385 360 370 362.50 21.02<br />
~ ° tics note. The common standard deviation of repeated 19 400 420 425 420 416.25 11.09<br />
c 20 430 460 480 470 460.00 21.60<br />
.g measurements IS known as the unthtn-subject standard<br />
.~ 40 ° deviation, which we shall denote <strong>by</strong> ~<br />
=P: To estimate the within-subject standard deviation, we<br />
i 20 : .° need several subjects with at least two measurements for standard deviation ~w is given <strong>by</strong> when n is the number 'j<br />
~ 0 0 0 ° each. In addition to the data, table 1 also shows the of subjects. We can check for a relation between stand- '<br />
.i 0.00 ° mean and standard deviation of the four readings for ard deviation and mean <strong>by</strong> plotting for each subject the<br />
~ goo 300 400 500 each child. To get the common within-subject standard absolute value of the difference--that is, ignoring any<br />
Subject mean (I/min) deviation we actually average the variances, the squares<br />
FI 1 I d.. d I b. t .of the standard deviations. The mean within-subject<br />
standard 9 -n deviations IVI ua su plotted !/ec s vanance IS 460.52, so the esumated within-subject<br />
sign-against the mean.<br />
The measurement error can be quoted as ~. The<br />
UliLerence ,... b<br />
etween a su<br />
b.,<br />
ject s measurement and the<br />
against their means standard deviation is ~w= f46o:5 = 21.5 1/min. The true value would be expected to be less than 1. 96 ~w for<br />
calculation is easier using a program that performs one 95% of observations. Another useful way of presenting<br />
way analysis ofvariance2 (table 2). The value called the measurement error is sometimes called the repeatability,<br />
residual mean square is the within-subject variance. which is J2 x 1. 96 ~ or 2. 77~. The difference<br />
The analysis of variance method is the better approach between two measurements for the same subject is<br />
in practice, as it deals automatically with the case of expected to be less than 2. 77 ~w for 95% of pairs of<br />
subjects having different numbers of observations. We observations. For the data in table 1 the repeatability is<br />
should check the assumption that the standard 2.77 x 2.5 = 60 l/min. The large variability in peak<br />
deviation is unrelated to the magnitude of the expiratory flow rate is well known, so individual<br />
measurement. This can be done graphically, <strong>by</strong> plotting readings of peak expiratory flow are seldom used.<br />
the individual subject's standard deviations against their The variable used for analysis in the study from which<br />
means (see fig 1). Any important relation should be table 1 was taken was the mean of the last three<br />
fairly obvious, but we can check analytically <strong>by</strong> calculat- readings.'<br />
.ing a rank correlation coefficient. For the figure there Other ways of describing the repeatability of<br />
HDePalartm~nt of Public<br />
S e G th ScIences, ' H .tal<br />
t eorge s OSpl<br />
Medical School London<br />
does not appear to be a relation (Kendall's t = 0.16,<br />
P - 0 3)<br />
-..stansncs<br />
A common design is to take only two measurements<br />
measurements<br />
..<br />
notes.<br />
will be considered in subsequent<br />
SW17 ORE' per subject. In ~s case the method can be simplified I <strong>Bland</strong> <strong>JM</strong>, Holland WW, Elliott A. The development of respiratory<br />
J Martin <strong>Bland</strong>, professor of because the vanance of two observations is half the symptoms in a cohort of Kent schoolchild~o. Bu/J Physio-~rh Resp<br />
medical statistics<br />
IRCF Medical Statistics<br />
Group, Centre for<br />
Statistics in Medicine,<br />
square of their difference. So if the difference between<br />
th b ..'. ...2<br />
e two 0 servanons for subject I IS di the WIthin-subject<br />
1974;10:699-716.<br />
Altman DG, <strong>Bland</strong> <strong>JM</strong>. Comparing several groups using analysis of<br />
variance. BM] 1996;312:1472.<br />
.<br />
Institute of Health Table 2-0ne way analysis of variance for the data of table 1<br />
Sciences, PO Box 777,<br />
Oxford OX3 7LF<br />
0 egrees 0 f Variance .. ratio Probability<br />
Douglas G Altman, head Source of variation freedom Sum of squares Mean square (F) (P)<br />
Correspondence to: Children 19 285318.44 15016.78 32.6
" "'.~..~~ ~.v~. a"uu,vu,-~ o'V'-" 'v UCla. u.C VL"~llL'" o.;VI1o.;urr~I11;<br />
complete destruction of the femoral head. Blood cultures infection. If the diagnosis is missed or delayed, the con-<br />
grew S auTeUS. His C reactive protein peaked at 122 sequences are serious: the joint destruction may<br />
mg/!. The hip was drained surgically of large amounts of preclude successful arthroplasty or, perhaps worse, a<br />
pus, culture of which grew S aureus and Proteus mirabi- hip replacement may be inserted into an unrecognised<br />
lis. He ~as treated with a prolonged course of high dose septic environment.<br />
antibiotics, with some clinical improvement but The most userJl non-specific tests seem to be the<br />
continuing poor mobility. erythrocyte sedimentation rate and measurement of C .<br />
n an ..~e~ctive ?ro~ein; the single most useful specific test is ~<br />
Her DIscussion Jomt asplranon and culture. .<br />
first These patients were all elderly and had pre-existing We recommend consideration of septic arthritis in<br />
ng/l. osteoarthritis and concurrent infection elsewhere. any patient with an apparently acute exacerbation of an<br />
pus None, however, had other systemic conditions predis- osteoarthritic joint, particularly if there is a possibility of<br />
nar'y posing to infection, such as diabetes, except for the sec- coexistent infection elsewhere. Other possible nonayed<br />
ond patient, who had a myeloproliferative disorder. The infective causes of a rapid deterioration in symptoms<br />
,ated development of septic arthritis <strong>by</strong> haematogenous include pseudogout and avascular necrosis, and these<br />
was spread was associated with increasing hip pain and will also need to be considered.<br />
: she rapid destruction of the femoral head. This was accom- Funding: None.<br />
1 few<br />
panied <strong>by</strong> a delay in diagnosis of up to six months.<br />
Infection in the presence of existing inflammatory<br />
Conflict ofinterest:None.<br />
ldio- joint disease, particularly rheumatoid arthritis, is well<br />
mtis known,' It is much rarer to see this in association with I Gardner GC, Weisman MH. Pyarthrosis in patients with rhewnatoid<br />
ISIng the much commoner osteoarthrins, although It IS<br />
arthritis,<br />
40 years.<br />
a<br />
Am<strong>JM</strong>ed<br />
report of<br />
1990;88:503-11.<br />
13 cases and a review of the literature from the past<br />
ttage<br />
mm<br />
recognised! In common with other bone and joint<br />
. m fi ectlOnS, . th e presentanon . 0 f sepnc . ar thr InS " has 2 Goldenberg DL, Cohen AS. Acute infectious arthritis. Am J Med<br />
3 Vincent 1976;60:369-77. GM, Amirault )D. Septic arthritis in the elderly. Chn Orthop<br />
)fhis<br />
with<br />
changed in recent years from the usual florid illness. 1991;251:241-5.<br />
leral<br />
atec-<br />
:rred .<br />
Statistics <strong>Notes</strong><br />
..<br />
Measurement error and correlation coefficients<br />
J Martin <strong>Bland</strong>, Douglas G Altman<br />
This is the 22nd in a series of Measurement error is the variation between measure- natural approach when investigating measurement error,<br />
occasional notes on medical ments of the same quantity on the same individual.! To this will inflate the correlation coefficient,<br />
statistics quantify measurement error we need repeated measure- The correlation coefficient between repeated measments<br />
on several subjects. We have discussed the urements is often called the reliability of the<br />
f within-subject standard deviation as an index of measurement method, It is widely used in the validation<br />
~ measurement error,! which we like as it has a simple of psychological measures such as scales of anxiety and<br />
( CliniCal interpretation. Here we consider the use of cor- depression, where it is known as the test-retest reliabilrelation<br />
coefficients to quantify measurement error, ity,ln such studies it is quoted for different populations<br />
1 A common design for the investigation of measurement (university students, psychiatric outpatients, etc)<br />
;teo- " error is to take pairs of measurements on a group of sub- because the correlation coefficient differs between them<br />
' . I<br />
jects, as in table 1. When we have pairs of observations it is as a result of differing ranges of the quantity being<br />
natural to plot one measurement against the other, The measured, The user has to select the correlation from<br />
resulting scatter diagram (see figure 1) may tempt us to the study population most like the user's own.<br />
Departm~nt of Public<br />
Healtb Sciences,<br />
S t G ~orge ' s H osp i t al<br />
calculate a correlation coefficient between the first and<br />
second measurement. There are difficulties in interpreting<br />
thi 1 . ffi . In al th 1 .<br />
between s corre repeated anon coe measurements Clent. gener, will depend e corre on anon the<br />
Another problem with the use of the correlation coefficient<br />
between the first and second measurements is<br />
Table 1-Pairs of measurements of FEV 1 (Htres) a few<br />
Medical School, London variability between sub'ects. Samples containing subjects weeks apart from 20 Scottish schoolchildren, .tak~n from<br />
SW17 ORE<br />
J Martin <strong>Bland</strong>, professor of<br />
h<br />
W 0<br />
A,a- tl<br />
UL1Ler grea y<br />
will J .<br />
pro duce 1arger corre 1 anon .a larger study (0 Strachan, personal communication)<br />
medical statistics coefficients than will samples containing similar subjects'<br />
ICRF M d 1 S .,<br />
e lca tatlstlcs<br />
For example, suppose we split<br />
...easuremen<br />
this group m whom we have<br />
measured forced expiratory volume m one second (FEV J<br />
S b. u Jec<br />
I<br />
No<br />
M<br />
1 sl<br />
I<br />
2nd<br />
S<br />
u b.<br />
No Jec<br />
Measuremen I<br />
1 sl 2nd<br />
Group, Centre for into two subsamples, the first 10 subjects and the second<br />
Statistics in Medicine, 10 subjects. As table 1 is ordered <strong>by</strong> the first FEV I 1 1.19 1.37 11 1.54 1.57<br />
In~titute ofHealtb measurement, both subsamples vary less than does the 2 1.33 1.32 12 1.59 1.60<br />
ScIences, PO Box 777,<br />
Oxford OX3 7LF whole sample.<br />
Th<br />
e corre<br />
1 .<br />
anon<br />
fi<br />
or<br />
th firs<br />
e<br />
b<br />
t su samp<br />
1 .3<br />
e IS 4<br />
1.35<br />
1.36<br />
1.40<br />
1.25<br />
13<br />
14<br />
1.61<br />
1.61<br />
1.53<br />
1.61<br />
DougiasGAInnanhead r=0.63andfortheseconditisr=0.31,bothlessthan 5 1.36 1.29 15 1.62 1.68<br />
, r= 0.77 for the full sample. The correlation coefficient 6 1.38 1.37 16 1.78 1.76<br />
Correspondence to: thus depends on the way the sample is chosen, and it has 7 1.38 1.40 17 1.80 1.82<br />
Professor <strong>Bland</strong>. meaning only for the population from which the stUdY: ~::~ ~:~:~: ~:: ~:~~ l<br />
subjects can be regarded as a random sample. If we select 10 1.431.51 20 2.102.20 l<br />
EMJ 1996;313:41-2 subjects to give a wide range of the measurement, the Ii<br />
8M] VOLUME313 6 JULy 1996 41 H<br />
I<br />
i<br />
I<br />
i
.'.)'. In practice, there will usually be little difference"; 11<br />
Table 2-0ne way analysis of varianqe for the data in table 1 between r and rI for true repeated measurements. If,<br />
Source of variation<br />
Degrees of<br />
freedom<br />
Sum of<br />
squares<br />
Mean<br />
square<br />
Variance<br />
ratio (F)<br />
Probability<br />
(P)<br />
however, there is a systematic change from the first<br />
d .gh b db!<br />
measurement to the secon , as ml t e cause y a<br />
learning effect, rl will be much less than r. If there was<br />
"-<br />
';<br />
!' ...<br />
.<br />
Children 19 1.52981 0.08052 7.4
Correct data for Statistics Note<br />
There was an error in Statistics Note 22, Measurement error and correlation coefficients, <strong>Bland</strong> <strong>JM</strong> and<br />
Altman DG, 1996, British Medical Journal 313, 41-2.<br />
The wrong table of data was printed. The correct data are:<br />
Sub. 1st 2nd Sub. 1st 2nd<br />
1 1.20 1.24 11 1.62 1.68<br />
2 1.21 1.19 12 1.64 1.61<br />
3 1.42 1.83 13 1.65 2.05<br />
4 1.43 1.38 14 1.74 1.80<br />
5 1.49 1.60 15 1.78 1.76<br />
6 1.58 1.36 16 1.80 1.76<br />
7 1.58 1.65 17 1.80 1.82<br />
8 1.59 1.60 18 1.85 1.73<br />
9 1.60 1.58 19 1.88 1.82<br />
10 1.61 1.61 20 1.92 2.00
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Downloaded from<br />
bmj.com on 28 February 2008
Statistics notes<br />
Bayesians and frequentists<br />
J Martin <strong>Bland</strong>, Douglas G Altman,<br />
There are two competing philosophies of statistical<br />
analysis: the Bayesian and the frequentist. The<br />
frequentists are much the larger group, and almost all<br />
the statistical analyses which appear in the <strong>BMJ</strong> are frequentist.<br />
The Bayesians are much fewer and until<br />
recently could only snipe at the frequentists from the<br />
high ground of university departments of mathematical<br />
statistics. Now the increasing power of computers is<br />
bringing Bayesian methods to the fore.<br />
Bayesian methods are based on the idea that<br />
unknown quantities, such as population means and<br />
proportions, have probability distributions. The probability<br />
distribution for a population proportion<br />
expresses our prior knowledge or belief about it, before<br />
we add the knowledge which comes from our data. For<br />
example, suppose we want to estimate the prevalence<br />
of diabetes in a health district. We could use the knowledge<br />
that the percentage of diabetics in the United<br />
Kingdom as a whole is about 2%, so we expect the<br />
prevalence in our health district to be fairly similar. It is<br />
unlikely to be 10%, for example. We might have information<br />
based on other datasets that such rates vary<br />
between 1% and 3%, or we might guess that the prevalence<br />
is somewhere between these values. We can construct<br />
a prior distribution which summarises our<br />
beliefs about the prevalence in the absence of specific<br />
data. We can do this with a distribution having mean 2<br />
and standard deviation 0.5, so that two standard deviations<br />
on either side of the mean are 1% and 3%. (The<br />
precise mathematical form of the prior distribution<br />
depends on the particular problem.)<br />
Suppose we now collect some data <strong>by</strong> a sample<br />
survey of the district population. We can use the data to<br />
modify the prior probability distribution to tell us what<br />
we now think the distribution of the population<br />
percentage is; this is the posterior distribution. For<br />
example, if we did a survey of 1000 subjects and found<br />
15 (1.5%) to be diabetic, the posterior distribution<br />
would have mean 1.7% and standard deviation 0.3%.<br />
We can calculate a set of values, a 95% credible interval<br />
(1.2% to 2.4% for the example), such that there is a<br />
probability of 0.95 that the percentage of diabetics is<br />
within this set. The frequentist analysis, which ignores<br />
the prior information, would give an estimate 1.5%<br />
with standard error 0.4% and 95% confidence interval<br />
0.8% to 2.5%. This is similar to the results of the Bayesian<br />
method, as is usually the case, but the Bayesian<br />
method gives an estimate nearer the prior mean and a<br />
narrower interval.<br />
Frequentist methods regard the population value<br />
as a fixed, unvarying (but unknown) quantity, without a<br />
probability distribution. Frequentists then calculate<br />
confidence intervals for this quantity, or significance<br />
tests of hypotheses concerning it. Bayesians reasonably<br />
object that this does not allow us to use our wider<br />
knowledge of the problem. Also, it does not provide<br />
what researchers seem to want, which is to be able to<br />
say that there is a probability of 95% that the<br />
<strong>BMJ</strong> VOLUME 317 24 OCTOBER 1998 <strong>www</strong>.bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
population value lies within the 95% confidence interval,<br />
or that the probability that the null hypothesis is<br />
true is less than 5%. It is argued that researchers want<br />
this, which is why they persistently misinterpret<br />
confidence intervals and significance tests in this way.<br />
A major difficulty, of course, is deciding on the<br />
prior distribution. This is going to influence the<br />
conclusions of the study, yet it may be a subjective synthesis<br />
of the available information, so the same data<br />
analysed <strong>by</strong> different investigators could lead to different<br />
conclusions. Another difficulty is that Bayesian<br />
methods may lead to intractable computational<br />
problems. (All widely available statistical packages use<br />
frequentist methods.)<br />
Most statisticians have become Bayesians or<br />
frequentists as a result of their choice of university.<br />
They did not know that Bayesians and frequentists<br />
existed until it was too late and the choice had been<br />
made. There have been subsequent conversions. Some<br />
who were taught the Bayesian way discovered that<br />
when they had huge quantities of medical data to analyse<br />
the frequentist approach was much quicker and<br />
more practical, although they may remain Bayesian at<br />
heart. Some frequentists have had Damascus road conversions<br />
to the Bayesian view. Many practising<br />
statisticians, however, are fairly ignorant of the<br />
methods used <strong>by</strong> the rival camp and too busy to have<br />
time to find out.<br />
The advent of very powerful computers has given a<br />
new impetus to the Bayesians. Computer intensive<br />
methods of analysis are being developed, which allow<br />
new approaches to very difficult statistical problems,<br />
such as the location of geographical clusters of cases of<br />
a disease. This new practicability of the Bayesian<br />
approach is leading to a change in the statistical<br />
paradigm—and a rapprochement between Bayesians<br />
and frequentists. 12 Frequentists are becoming curious<br />
about the Bayesian approach and more willing to use<br />
Bayesian methods when they provide solutions to difficult<br />
problems. In the future we expect to see more<br />
Bayesian analyses reported in the <strong>BMJ</strong>. When this happens<br />
we may try to use Statistics notes to explain them,<br />
though we may have to recruit a Bayesian to do it.<br />
We thank David Spiegelhalter for comments on a draft.<br />
1 Breslow N. Biostatistics and Bayes (with discussion). Statist Sci 1990;5:<br />
269-98.<br />
2 Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to<br />
randomized trials (with discussion). J R Statist Soc A 1994;157:357-416.<br />
Correction<br />
North of England evidence based guidelines development project:<br />
guideline for the primary care management of dementia<br />
An editorial error occurred in this article <strong>by</strong> Martin Eccles<br />
and colleagues (19 September, pp 802-8). In the list of<br />
authors the name of Moira Livingston [not Livingstone] was<br />
misspelt.<br />
Education and debate<br />
Department of<br />
Public Health<br />
Sciences, St<br />
George’s Hospital<br />
Medical School,<br />
London SW17 0RE<br />
J Martin <strong>Bland</strong>,<br />
professor of medical<br />
statistics<br />
ICRF Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine,<br />
Institute of Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman,<br />
head<br />
Correspondence to:<br />
Professor <strong>Bland</strong><br />
<strong>BMJ</strong> 1998;317:1151<br />
1151
Clinical review<br />
Department of<br />
Public Health<br />
Sciences, St<br />
George’s Hospital<br />
Medical School,<br />
London SW17 0RE<br />
J Martin <strong>Bland</strong>,<br />
professor of medical<br />
statistics<br />
ICRF Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine,<br />
Institute of Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman,<br />
head<br />
Correspondence to:<br />
Professor <strong>Bland</strong><br />
<strong>BMJ</strong> 1998;317:1572<br />
Time (months) to<br />
conception or<br />
censoring in 38<br />
sub-fertile women<br />
after laparoscopy<br />
and hydrotubation 2<br />
Did not<br />
Conceived conceive<br />
1 2<br />
1 3<br />
1 4<br />
1 7<br />
1 7<br />
1 8<br />
2 8<br />
2 9<br />
2 9<br />
2 9<br />
2 11<br />
3 24<br />
3 24<br />
3<br />
4<br />
4<br />
4<br />
6<br />
6<br />
9<br />
9<br />
9<br />
10<br />
13<br />
16<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
Statistics <strong>Notes</strong><br />
Survival probabilities (the Kaplan-Meier method)<br />
J Martin <strong>Bland</strong>, Douglas G Altman<br />
As we have observed, 1 analysis of survival data requires<br />
special techniques because some observations are<br />
censored as the event of interest has not occurred for all<br />
patients. For example, when patients are recruited over<br />
two years one recruited at the end of the study may be<br />
alive at one year follow up, whereas one recruited at the<br />
start may have died after two years. The patient who died<br />
has a longer observed survival than the one who still<br />
survives and whose ultimate survival time is unknown.<br />
The table shows data from a study of conception in<br />
subfertile women. 2 The event is conception, and<br />
women “survived” until they conceived. One woman<br />
conceived after 16 months (menstrual cycles), whereas<br />
several were followed for shorter time periods during<br />
which they did not conceive; their time to conceive was<br />
thus censored.<br />
We wish to estimate the proportion surviving (not<br />
having conceived) <strong>by</strong> any given time, which is also the<br />
estimated probability of survival to that time for a<br />
member of the population from which the sample is<br />
drawn. Because of the censoring we use the<br />
Kaplan-Meier method. For each time interval we<br />
estimate the probability that those who have survived<br />
to the beginning will survive to the end. This is a conditional<br />
probability (the probability of being a survivor at<br />
the end of the interval on condition that the subject<br />
was a survivor at the beginning of the interval). Survival<br />
to any time point is calculated as the product of the<br />
conditional probabilities of surviving each time<br />
interval. These data are unusual in representing<br />
months (menstrual cycles); usually the conditional<br />
probabilities relate to days. The calculations are simplified<br />
<strong>by</strong> ignoring times at which there were no recorded<br />
survival times (whether events or censored times).<br />
In the example, the probability of surviving for two<br />
months is the probability of surviving the first month<br />
times the probability of surviving the second month<br />
given that the first month was survived. Of 38 women,<br />
32 survived the first month, or 0.842. Of the 32 women<br />
at the start of the second month (“at risk” of<br />
conception), 27 had not conceived <strong>by</strong> the end of the<br />
month. The conditional probability of surviving the<br />
second month is thus 27/32 = 0.844, and the overall<br />
probability of surviving (not conceiving) after two<br />
months is 0.842 × 0.844 = 0.711. We continue in this<br />
way to the end of the table, or until we reach the last<br />
event. Observations censored at a given time affect the<br />
number still at risk at the start of the next month. The<br />
estimated probability changes only in months when<br />
there is a conception. In practice, a computer is used to<br />
do these calculations. Standard errors and confidence<br />
intervals for the estimated survival probabilities can be<br />
found <strong>by</strong> Greenwood’s method. 3 Survival probabilities<br />
are usually presented as a survival curve (figure). The<br />
“curve” is a step function, with sudden changes in the<br />
estimated probability corresponding to times at which<br />
an event was observed. The times of the censored data<br />
are indicated <strong>by</strong> short vertical lines.<br />
Survival probability<br />
1.0<br />
0.75<br />
0.5<br />
0.25<br />
0<br />
0 6 12 18 24<br />
Time (months)<br />
Survival curve showing probability of not conceiving among 38<br />
subfertile women after laparoscopy and hydrotubation 2<br />
There are three assumptions in the above. Firstly,<br />
we assume that at any time patients who are censored<br />
have the same survival prospects as those who<br />
continue to be followed. This assumption is not easily<br />
testable. Censoring may be for various reasons. In the<br />
conception study some women had received hormone<br />
treatment to promote ovulation, and others had<br />
stopped trying to conceive. Thus they were no longer<br />
part of the population we wanted to study, and their<br />
survival times were censored. In most studies some<br />
subjects drop out for reasons unrelated to the<br />
condition under study (for example, emigration) If,<br />
however, for some patients in this study censoring was<br />
related to failure to conceive this would have biased the<br />
estimated survival probabilities downwards.<br />
Secondly, we assume that the survival probabilities<br />
are the same for subjects recruited early and late in the<br />
study. In a long term observational study of patients<br />
with cancer, for example, the case mix may change over<br />
the period of recruitment, or there may be an innovation<br />
in ancillary treatment. This assumption may be<br />
tested, provided we have enough data to estimate<br />
survival curves for different subsets of the data.<br />
Thirdly, we assume that the event happens at the<br />
time specified. This is not a problem for the conception<br />
data, but could be, for example, if the event were recurrence<br />
of a tumour which would be detected at a regular<br />
examination. All we would know is that the event<br />
happened between two examinations. This imprecision<br />
would bias the survival probabilities upwards. When<br />
the observations are at regular intervals this can be<br />
allowed for quite easily, using the actuarial method. 3<br />
Formal methods are needed for testing hypotheses<br />
about survival in two or more groups. We shall describe<br />
the logrank test for comparing curves and the more<br />
complex Cox regression model in future <strong>Notes</strong>.<br />
1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Time to event (survival) data. <strong>BMJ</strong><br />
1997;317:468-9.<br />
2 Luthra P, <strong>Bland</strong> <strong>JM</strong>, Stanton SL. Incidence of pregnancy after<br />
laparoscopy and hydrotubation. <strong>BMJ</strong> 1982;284:1013-4.<br />
3 Parmar MKB, Machin D. Survival analysis: a practical approach. Chichester:<br />
Wiley, 37, 47-9.<br />
1572 <strong>BMJ</strong> VOLUME 317 5 DECEMBER 1998 <strong>www</strong>.bmj.com
Statistics notes<br />
Treatment allocation in controlled trials: why randomise?<br />
Douglas G Altman, J Martin <strong>Bland</strong><br />
Since 1991 the <strong>BMJ</strong> has had a policy of not publishing<br />
trials that have not been properly randomised, except<br />
in rare cases where this can be justified. 1 Why?<br />
The simplest approach to evaluating a new<br />
treatment is to compare a single group of patients<br />
given the new treatment with a group previously<br />
treated with an alternative treatment. Usually such<br />
studies compare two consecutive series of patients in<br />
the same hospital(s). This approach is seriously flawed.<br />
Problems will arise from the mixture of retrospective<br />
and prospective studies, and we can never satisfactorily<br />
eliminate possible biases due to other factors (apart<br />
from treatment) that may have changed over time.<br />
Sacks et al compared trials of the same treatments in<br />
which randomised or historical controls were used and<br />
found a consistent tendency for historically controlled<br />
trials to yield more optimistic results than randomised<br />
trials. 2 The use of historical controls can be justified<br />
only in tightly controlled situations of relatively rare<br />
conditions, such as in evaluating treatments for<br />
advanced cancer.<br />
The need for contemporary controls is clear, but<br />
there are difficulties. If the clinician chooses which<br />
treatment to give each patient there will probably be<br />
differences in the clinical and demographic characteristics<br />
of the patients receiving the different treatments.<br />
Much the same will happen if patients choose their<br />
own treatment or if those who agree to have a<br />
treatment are compared with ref<strong>users</strong>. Similar problems<br />
arise when the different treatment groups are at<br />
different hospitals or under different consultants. Such<br />
systematic differences, termed bias, will lead to an overestimate<br />
or underestimate of the difference between<br />
treatments. Bias can be avoided <strong>by</strong> using random allocation.<br />
A well known example of the confusion engendered<br />
<strong>by</strong> a non-randomised study was the study of the<br />
possible benefit of vitamin supplementation at the time<br />
of conception in women at high risk of having a ba<strong>by</strong><br />
with a neural tube defect. 3 The investigators found that<br />
the vitamin group subsequently had fewer babies with<br />
neural tube defects than the placebo control group.<br />
The control group included women ineligible for the<br />
trial as well as women who refused to participate. As a<br />
consequence the findings were not widely accepted,<br />
and the Medical Research Council later funded a large<br />
randomised trial to answer to the question in a way that<br />
would be widely accepted. 4<br />
The main reason for using randomisation to<br />
allocate treatments to patients in a controlled trial is to<br />
prevent biases of the types described above. We want to<br />
compare the outcomes of treatments given to groups<br />
of patients which do not differ in any systematic way.<br />
Another reason for randomising is that statistical<br />
theory is based on the idea of random sampling. In a<br />
study with random allocation the differences between<br />
treatment groups behave like the differences between<br />
random samples from a single population. We know<br />
<strong>BMJ</strong> VOLUME 318 1 MAY 1999 <strong>www</strong>.bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
how random samples are expected to behave and so<br />
can compare the observations with what we would<br />
expect if the treatments were equally effective.<br />
The term random does not mean the same as haphazard<br />
but has a precise technical meaning. By random<br />
allocation we mean that each patient has a known<br />
chance, usually an equal chance, of being given each<br />
treatment, but the treatment to be given cannot be predicted.<br />
If there are two treatments the simplest method<br />
of random allocation gives each patient an equal<br />
chance of getting either treatment; it is equivalent to<br />
tossing a coin. In practice most people use either a<br />
table of random numbers or a random number<br />
generator on a computer. This is simple randomisation.<br />
Possible modifications include block randomisation,<br />
to ensure closely similar numbers of patients in<br />
each group, and stratified randomisation, to keep the<br />
groups balanced for certain prognostic patient characteristics.<br />
We discuss these extensions in a subsequent<br />
Statistics note.<br />
Fifty years after the publication of the first<br />
randomised trial 5 the technical meaning of the term<br />
randomisation continues to elude some investigators.<br />
Journals continue to publish “randomised” trials which<br />
are no such thing. One common approach is to<br />
allocate treatments according to the patient’s date of<br />
birth or date of enrolment in the trial (such as giving<br />
one treatment to those with even dates and the other to<br />
those with odd dates), <strong>by</strong> the terminal digit of the hospital<br />
number, or simply alternately into the different<br />
treatment groups. While all of these approaches are in<br />
principle unbiased—being unrelated to patient<br />
characteristics—problems arise from the openness of<br />
the allocation system. 1 Because the treatment is known<br />
when a patient is considered for entry into the trial this<br />
knowledge may influence the decision to recruit that<br />
patient and so produce treatment groups which are<br />
not comparable.<br />
Of course, situations exist where randomisation is<br />
simply not possible. 6 The goal here should be to retain<br />
all the methodological features of a well conducted<br />
randomised trial 7 other than the randomisation.<br />
1 Altman DG. Randomisation. <strong>BMJ</strong> 1991;302:1481-2.<br />
2 Sacks H, Chalmers TC, Smith H. Randomized versus historical controls<br />
for clinical trials. Am J Med 1982;72:233-40.<br />
3 Smithells RW, Sheppard S, Schorah CJ, Seller MJ, Nevin NC, Harris R, et<br />
al. Possible prevention of neural-tube defects <strong>by</strong> periconceptional vitamin<br />
supplementation. Lancet 1980;i:339-40.<br />
4 MRC Vitamin Study Research Group. Prevention of neural tube defects:<br />
results of the Medical Research Council vitamin study. Lancet<br />
1991;338:131-7.<br />
5 Medical Research Council. Streptomycin treatment of pulmonary tuberculosis.<br />
<strong>BMJ</strong> 1948;2:769-82.<br />
6 Black N. Why we need observational studies to evaluate the effectiveness<br />
of health care. <strong>BMJ</strong> 1996;312:1215-8.<br />
7 Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improving<br />
the quality of reporting of randomized controlled trials: the<br />
CONSORT Statement. JAMA 1996;276:637-9.<br />
Education and debate<br />
ICRF Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine,<br />
Institute of Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman,<br />
professor of statistics<br />
in medicine<br />
Department of<br />
Public Health<br />
Sciences, St<br />
George’s Hospital<br />
Medical School,<br />
London SW17 0RE<br />
J Martin <strong>Bland</strong>,<br />
professor of medical<br />
statistics<br />
Correspondence to:<br />
Professor Altman.<br />
<strong>BMJ</strong> 1999;318:1209<br />
1209
15 Van den Hoogen H<strong>JM</strong>, Koes BW, van Eijk JT, Bouter LM, Devillé W. On<br />
the course of low back pain in general practice: a one year follow up<br />
study. Ann Rheum Dis 1998;57:13-9.<br />
16 Croft PR, Papageorgiou AC, Ferry S, Thomas E, Jayson MIV, Silman AJ.<br />
Psychological distress and low back pain: Evidence from a prospective<br />
study in the general population. Spine 1996;20:2731-7.<br />
17 Papageorgiou AC, Macfarlane GJ, Thomas E, Croft PR, Jayson MIV,<br />
Silman AJ. Psychosocial factors in the work place—do they predict new<br />
episodes of low back pain? Spine 1997;22:1137-42.<br />
18 Main CJ, Wood PL, Hollis S, Spanswick CC, Waddell G. The distress and<br />
risk assessment method. A simple patient classification to identify distress<br />
and evaluate the risk of poor outcome. Spine 1992;17:42-52.<br />
19 Coste J, Delecoeuillerie G, Cohen de Lara A, Le Parc <strong>JM</strong>, Paolaggi JB.<br />
Clinical course and prognostic factors in acute low back pain: an inception<br />
cohort study in primary care practice. <strong>BMJ</strong> 1994;308:577-80.<br />
Statistics notes<br />
Variables and parameters<br />
Douglas G Altman, J Martin <strong>Bland</strong><br />
Like all specialist areas, statistics has developed its own<br />
language. As we have noted before, 1 much confusion<br />
may arise when a word in common use is also given a<br />
technical meaning. Statistics abounds in such terms,<br />
including normal, random, variance, significant, etc.<br />
Two commonly confused terms are variable and<br />
parameter; here we explain and contrast them.<br />
Information recorded about a sample of individuals<br />
(often patients) comprises measurements such as<br />
blood pressure, age, or weight and attributes such as<br />
blood group, stage of disease, and diabetes. Values of<br />
these will vary among the subjects; in this context<br />
blood pressure, weight, blood group and so on are<br />
variables. Variables are quantities which vary from<br />
individual to individual.<br />
By contrast, parameters do not relate to actual<br />
measurements or attributes but to quantities defining a<br />
theoretical model. The figure shows the distribution of<br />
measurements of serum albumin in 481 white men<br />
aged over 20 with mean 46.14 and standard deviation<br />
3.08 g/l. For the empirical data the mean and SD are<br />
called sample estimates. They are properties of the collection<br />
of individuals. Also shown is the normal 1 distribution<br />
which fits the data most closely. It too has mean<br />
46.14 and SD 3.08 g/l. For the theoretical distribution<br />
the mean and SD are called parameters. There is not<br />
one normal distribution but many, called a family of<br />
distributions. Each member of the family is defined <strong>by</strong><br />
its mean and SD, the parameters 1 which specify the<br />
particular theoretical normal distribution with which<br />
we are dealing. In this case, they give the best estimate<br />
of the population distribution of serum albumin if we<br />
can assume that in the population serum albumin has<br />
a normal distribution.<br />
Most statistical methods, such as t tests, are called<br />
parametric because they estimate parameters of some<br />
underlying theoretical distribution. Non-parametric<br />
methods, such as the Mann-Whitney U test and the log<br />
rank test for survival data, do not assume any particular<br />
family for the distribution of the data and so do not<br />
estimate any parameters for such a distribution.<br />
Another use of the word parameter relates to its<br />
original mathematical meaning as the value(s) defining<br />
one of a family of curves. If we fit a regression model,<br />
such as that describing the relation between lung function<br />
and height, the slope and intercept of this line<br />
<strong>BMJ</strong> VOLUME 318 19 JUNE 1999 <strong>www</strong>.bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
20 Dionne CE, Koepsell TD, Von Korff M, Deyo RA, Barlow WE, Checkoway<br />
H. Predicting long-term functional limitations among back pain patients<br />
in primary care. J Clin Epidemiol 1997;50:31-43.<br />
21 Macfarlane GJ, Thomas E, Papageorgiou AC, Schollum J, Croft PR. The<br />
natural history of chronic pain in the community: a better prognosis than<br />
in the clinic? J Rheumatol 1996;23:1617-20.<br />
22 Troup JDG, Martin JW, Lloyd DCEF. Back pain in industry. A prospective<br />
survey. Spine 1981;6:61-9.<br />
23 Burton AK, Tillotson KM. Prediction of the clinical course of low-back<br />
trouble using multivariable models. Spine 1991;16:7-14.<br />
24 Pope MH, Rosen JC, Wilder DG, Frymoyer JW. The relation between biomechanical<br />
and psychological factors in patients with low-back pain.<br />
Spine 1980;5:173-8.<br />
Frequency<br />
(Accepted 31 March 1999)<br />
80<br />
60<br />
40<br />
20<br />
0<br />
35 40 45 50 55<br />
Albumin (g/l)<br />
Measurements of serum albumin in 481 white men aged over 20<br />
(data from Dr W G Miller)<br />
(more generally known as regression coefficients) are<br />
the parameters defining the model. They have no<br />
meaning for individuals, although they can be used to<br />
predict an individual’s lung function from their height.<br />
In some contexts parameters are values that can be<br />
altered to see what happens to the performance of<br />
some system. For example, the performance of a<br />
screening programme (such as positive predictive<br />
value or cost effectiveness) will depend on aspects such<br />
as the sensitivity and specificity of the screening test. If<br />
we look to see how the performance would change if,<br />
say, sensitivity and specificity were improved, then we<br />
are treating these as parameters rather than using the<br />
values observed in a real set of data.<br />
Parameter is a technical term which has only<br />
recently found its way into general use, unfortunately<br />
without keeping its correct meaning. It is common in<br />
medical journals to find variables incorrectly called<br />
parameters (but not in the <strong>BMJ</strong> we hope 2 ). Another<br />
common misuse of parameter is as a limit or boundary,<br />
as in “within certain parameters.” This misuse seems to<br />
have arisen from confusion between parameter and<br />
perimeter.<br />
Misuse of medical terms is rightly deprecated. Like<br />
other language errors it leads to confusion and the loss<br />
of valuable distinction. Misuse of non-medical terms<br />
should be viewed likewise.<br />
1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. The normal distribution. <strong>BMJ</strong> 1995;310:298.<br />
2 Endpiece: What’s a parameter? <strong>BMJ</strong> 1998;316:1877.<br />
General practice<br />
ICRF Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine,<br />
Institute of Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman,<br />
professor of statistics<br />
in medicine<br />
Department of<br />
Public Health<br />
Sciences, St<br />
George’s Hospital<br />
Medical School,<br />
London SW17 0RE<br />
J Martin <strong>Bland</strong>,<br />
professor of medical<br />
statistics<br />
Correspondence to:<br />
Professor Altman.<br />
<strong>BMJ</strong> 1999;318:1667<br />
1667
Much work still needs to be done to achieve this. To be<br />
useful in health policy at this level, all the targets need<br />
to be elaborated further and clear, practical statements<br />
must be made on their operation—especially the four<br />
targets on health policy and sustainable health systems.<br />
The WHO should stimulate the discussion of these<br />
important targets, but it should also be careful about<br />
being too prescriptive about health systems since this<br />
could be counterproductive.<br />
In addition, more attention should be given to the<br />
usefulness of the targets in member states. One way of<br />
doing this is to rank the countries <strong>by</strong> target and to<br />
divide them into three groups. A specific level could be<br />
set for each group. For example, for target 2, three such<br />
groups could be distinguished as follows:<br />
x Countries that have already achieved this target<br />
x Countries for which the global target is achievable<br />
and challenging<br />
x Countries that find the global target hard to achieve<br />
and therefore “demotivating.”<br />
The first group needs stricter target levels, and the<br />
third group less stringent ones. If a breakdown of this<br />
kind is made for each target, some countries may be<br />
classified in different groups for different targets. In this<br />
way, the targets will provide an insight into the health<br />
status of the population and could be useful for policy<br />
makers in member states in encouraging action and<br />
allocating their resources.<br />
We thank Dr J Visschedijk and Professor L J Gunning-Schepers<br />
and other referees of this article for their helpful comments.<br />
Funding: This study was commissioned <strong>by</strong> Policy Action<br />
Coordination at the WHO and supported <strong>by</strong> an unrestricted<br />
educational grant from Merck & Co Inc, New Jersey, USA.<br />
Competing interests: None declared.<br />
1 World Health Assembly. Resolution WHA51.7. Health for all policy for the<br />
twenty-first century. Geneva: World Health Organisation, 1998.<br />
2 World Health Association. Health for all in the 21st century. Geneva: WHO,<br />
1998.<br />
Statistics notes<br />
How to randomise<br />
Douglas G Altman, J Martin <strong>Bland</strong><br />
We have explained why random allocation of<br />
treatments is a required feature of controlled trials. 1<br />
Here we consider how to generate a random allocation<br />
sequence.<br />
Almost always patients enter a trial in sequence<br />
over a prolonged period. In the simplest procedure,<br />
simple randomisation, we determine each patient’s<br />
treatment at random independently with no constraints.<br />
With equal allocation to two treatment groups<br />
this is equivalent to tossing a coin, although in practice<br />
coins are rarely used. Instead we use computer generated<br />
random numbers. Suitable tables can be found in<br />
most statistics textbooks. The table shows an example 2 :<br />
the numbers can be considered as either random digits<br />
from 0 to 9 or random integers from 0 to 99.<br />
For equal allocation to two treatments we could<br />
take odd and even numbers to indicate treatments A<br />
and B respectively. We must then choose an arbitrary<br />
<strong>BMJ</strong> VOLUME 319 11 SEPTEMBER 1999 <strong>www</strong>.bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
3 World Health Association. Global strategy for health for all <strong>by</strong> the year 2000.<br />
Geneva: WHO, 1981. (WHO Health for All series No 3.)<br />
4 Visschedijk J, Siméant S. Targets for health for all in the 21st century.<br />
World Health Stat Q 1998;51:56-67.<br />
5 Van de Water HPA, van Herten LM. Never change a winning team? Review<br />
of WHO’s new global policy: health for all in the 21st century. Leiden: TNO<br />
Prevention and Health, 1999.<br />
6 World Health Organisation. Bridging the gaps. Geneva: WHO, 1995.<br />
(World health report.)<br />
7 World Health Organisation. Fighting disease, fostering development. Geneva:<br />
WHO, 1996. (World health report.)<br />
8 World Health Organisation. 1997: Conquering suffering,enriching humanity.<br />
Geneva: WHO, 1997. (World health report.)<br />
9 Murray CJL, Lopez AD, eds. The global burden of disease. Boston: Harvard<br />
University Press, 1996.<br />
10 United Nations. The world population prospects. New York: UN, 1998.<br />
11 United Nations Development Programme. Human development report<br />
1997. New York: Oxford University Press, 1997.<br />
12 World Bank. Poverty reduction and the World Bank: progress and challenges in<br />
the 1990s. New York: World Bank, 1996.<br />
13 World Health Organisation. Third evaluation of health for all <strong>by</strong> the year<br />
2000. Geneva: WHO, 1999. (In press.)<br />
14 Ad Hoc Committee on Health Research Relating to Future Intervention<br />
Options. Investing in health research and development. Geneva: WHO,1996.<br />
(Document TDR/Gen/96.1.)<br />
15 Taylor CE. Surveillance for equity in primary health care: policy implications<br />
from international experience. Int J Epidemiol 1992;21:1043-9.<br />
16 Frerichs RR. Epidemiologic surveillance in developing countries. Annu<br />
Rev Public Health 1991;12:257-80.<br />
17 World Health Organisation. Health for all renewal—building sustainable<br />
health systems: from policy to action. Report of meeting on 17-19 November 1997<br />
in Helsinki, Finland. Geneva: WHO, 1998.<br />
18 World Health Organisation. EMC annual report 1996. Geneva: WHO:<br />
1996.<br />
19 World Health Organisation. Physical status: the use and interpretation of<br />
anthropometry of a WHO expert committee. Geneva: WHO, 1995. (WHO<br />
technical report series No 834.)<br />
20 World Health Organisation. Global database on child growth and malnutrition.<br />
Geneva: WHO, 1997.<br />
21 World Health Organisation. Tobacco or health:a global status report. Geneva:<br />
WHO, 1997.<br />
22 Erkens C. Cost-effectiveness of ‘short course chemotherapy’ in smear-negative<br />
tuberculosis. Utrecht: Netherlands School of Public Health, 1996.<br />
23 Van de Water HPA, van Herten LM. Bull’s eye or Achilles’ heel: WHO’s European<br />
health for all targets evaluated in the Netherlands. Leiden: Netherlands<br />
Association for Applied Scientific Research (TNO) Prevention and<br />
Health, 1996.<br />
24 Van de Water HPA, van Herten LM. Health policies on target? Review of<br />
health target and priority setting in 18 European countries. Leiden:<br />
Netherlands Association for Applied Scientific Research (TNO) Prevention<br />
and Health, 1998.<br />
(Accepted 4 May 1999)<br />
place to start and also the direction in which to read<br />
the table. The first 10 two digit numbers from a starting<br />
place in column 2 are 85 80 62 36 96 56 17 17 23 87,<br />
which translate into the sequence ABBBBBAAA<br />
A for the first 10 patients. We could instead have taken<br />
each digit on its own, or numbers 00 to 49 for A and 50<br />
to 99 for B. There are countless possible strategies; it<br />
makes no difference which is used.<br />
We can easily generalise the approach. With three<br />
groups we could use 01 to 33 for A, 34 to 66 for B, and<br />
67 to 99 for C (00 is ignored). We could allocate treatments<br />
A and B in proportions 2 to 1 <strong>by</strong> using 01 to 66<br />
for A and 67 to 99 for B.<br />
At any point in the sequence the numbers of<br />
patients allocated to each treatment will probably<br />
differ, as in the above example. But sometimes we want<br />
to keep the numbers in each group very close at all<br />
times. Block randomisation (also called restricted<br />
Education and debate<br />
ICRF Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine,<br />
Institute of Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman,<br />
professor of statistics<br />
in medicine<br />
Department of<br />
Public Health<br />
Sciences, St<br />
George’s Hospital<br />
Medical School,<br />
London SW17 0RE<br />
J Martin <strong>Bland</strong>,<br />
professor of medical<br />
statistics<br />
Correspondence to:<br />
Professor Altman.<br />
<strong>BMJ</strong> 1999;319:703–4<br />
703
Education and debate<br />
Excerpt from a table of random digits. 2 The numbers used in the<br />
example are shown in bold<br />
89 11 77 99 94<br />
35 83 73 68 20<br />
84 85 95 45 52<br />
56 80 93 52 82<br />
97 62 98 71 39<br />
79 36 13 72 99<br />
34 96 98 54 89<br />
69 56 88 97 43<br />
09 17 78 78 02<br />
83 17 39 84 16<br />
24 23 36 44 14<br />
39 87 30 20 41<br />
75 18 53 77 83<br />
33 93 39 24 81<br />
22 52 01 86 71<br />
randomisation) is used for this purpose. For example, if<br />
we consider subjects in blocks of four at a time there<br />
are only six ways in which two get A and two get B:<br />
1:AABB2:ABAB3:ABBA4:BBAA5:BA<br />
BA6:BAAB<br />
We choose blocks at random to create the<br />
allocation sequence. Using the single digits of the previous<br />
random sequence and omitting numbers outside<br />
the range 1 to 6 we get 5623665611.From these<br />
we can construct the block allocation sequence BAB<br />
A/BAAB/ABAB/ABBA/BAAB,andsoon.<br />
The numbers in the two groups at any time can never<br />
differ <strong>by</strong> more than half the block length. Block size is<br />
normally a multiple of the number of treatments.<br />
Large blocks are best avoided as they control balance<br />
less well. It is possible to vary the block length, again at<br />
random, perhaps using a mixture of blocks of size 2, 4,<br />
or 6.<br />
While simple randomisation removes bias from the<br />
allocation procedure, it does not guarantee, for example,<br />
that the individuals in each group have a similar<br />
age distribution. In small studies especially some<br />
chance imbalance will probably occur, which might<br />
complicate the interpretation of results. We can use<br />
stratified randomisation to achieve approximate<br />
balance of important characteristics without sacrificing<br />
the advantages of randomisation. The method is to<br />
produce a separate block randomisation list for each<br />
subgroup (stratum). For example, in a study to<br />
One hundred years ago<br />
Generalisation of salt infusions<br />
The subcutaneous infusion of salt solution has proved of great<br />
benefit in the treatment of collapse after severe operations. The<br />
practice, it may be said, developed from two sources: the new<br />
method of transfusion where water, instead of another person’s<br />
blood, is injected into the patient’s veins; and flushing of the<br />
peritoneum, introduced <strong>by</strong> Lawson Tait. After flushing, much of<br />
the fluid left in the peritoneum is absorbed into the circulation,<br />
greatly to the patient’s advantage. Dr. Clement Penrose has tried<br />
the effect of subcutaneous salt infusions as a last extremity in<br />
severe cases of pneumonia. He continues this treatment with<br />
inhalations of oxygen. He has had experience of three cases, all<br />
considered hopeless, and succeeded in saving one. In the other<br />
two the prolongation of life and the relief of symptoms were so<br />
marked that Dr. Penrose regretted that the treatment had not<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
compare two alternative treatments for breast cancer it<br />
might be important to stratify <strong>by</strong> menopausal status.<br />
Separate lists of random numbers should then be constructed<br />
for premenopausal and postmenopausal<br />
women. It is essential that stratified treatment<br />
allocation is based on block randomisation within each<br />
stratum rather than simple randomisation; otherwise<br />
there will be no control of balance of treatments within<br />
strata, so the object of stratification will be defeated.<br />
Stratified randomisation can be extended to two or<br />
more stratifying variables. For example, we might want<br />
to extend the stratification in the breast cancer trial to<br />
tumour size and number of positive nodes. A separate<br />
randomisation list is needed for each combination of<br />
categories. If we had two tumour size groups (say 4cm) and three groups for node involvement (0,<br />
1-4, > 4) as well as menopausal status, then we have<br />
2 × 3 × 2 = 12 strata, which may exceed the limit of what<br />
is practical. Also with multiple strata some of the combinations<br />
of categories may be rare, so the intended<br />
treatment balance is not achieved.<br />
In a multicentre study the patients within each centre<br />
will need to be randomised separately unless there<br />
is a central coordinated randomising service. Thus<br />
“centre” is a stratifying variable, and there may be other<br />
stratifying variables as well.<br />
In small studies it is not practical to stratify on more<br />
than one or perhaps two variables, as the number of<br />
strata can quickly approach the number of subjects.<br />
When it is really important to achieve close similarity<br />
between treatment groups for several variables<br />
minimisation can be used—we discuss this method in a<br />
separate Statistics note. 3<br />
We have described the generation of a random<br />
sequence in some detail so that the principles are clear.<br />
In practice, for many trials the process will be done <strong>by</strong><br />
computer. Suitable software is available at <strong>http</strong>://<br />
<strong>www</strong>.sghms.ac.uk/phs/staff/jmb/jmb.htm.<br />
We shall also consider in a subsequent note the<br />
practicalities of using a random sequence to allocate<br />
treatments to patients.<br />
1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Treatment allocation in controlled trials: why randomise?<br />
<strong>BMJ</strong> 1999;318:1209.<br />
2 Altman DG. Practical statistics for medical research. London: Chapman and<br />
Hall, 1990: 540-4.<br />
3 Treasure T, MacRae KD. Minimisation: the platinum standard for trials?<br />
<strong>BMJ</strong> 1998;317:362-3.<br />
been employed earlier. Several physicians have adopted Dr.<br />
Penrose’s method, and with the most gratifying results. The cases<br />
are reported fully in the Bulletin of the Johns Hopkins Hospital for<br />
July last. The infusions of salt solution were administered just as<br />
after an operation. The salt solution, at a little above body<br />
temperature, is poured into a graduated bottle connected <strong>by</strong> a<br />
rubber tube with a needle. The pressure is regulated <strong>by</strong> elevating<br />
the bottle, or <strong>by</strong> means of a rubber bulb with valves; the needle is<br />
introduced into the connective tissue under the breast or under<br />
the integuments of the thighs. There can be no doubt that<br />
subcutaneous saline infusions are increasing in popularity, and<br />
little doubt that their use will be greatly extended in medicine as<br />
well as surgery.<br />
(<strong>BMJ</strong> 1899;ii:933)<br />
704 <strong>BMJ</strong> VOLUME 319 11 SEPTEMBER 1999 <strong>www</strong>.bmj.com
Education and debate<br />
Department of<br />
Public Health<br />
Sciences,<br />
St George’s<br />
Hospital Medical<br />
School, London<br />
SW17 0RE<br />
J Martin <strong>Bland</strong><br />
professor of medical<br />
statistics<br />
ICRF Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine,<br />
Institute of Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
Correspondence to:<br />
Professor <strong>Bland</strong><br />
<strong>BMJ</strong> 2000;320:1468<br />
Statistics <strong>Notes</strong><br />
The odds ratio<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
J Martin <strong>Bland</strong>, Douglas G Altman<br />
In recent years odds ratios have become widely used in<br />
medical reports—almost certainly some will appear in<br />
today’s <strong>BMJ</strong>. There are three reasons for this. Firstly,<br />
they provide an estimate (with confidence interval) for<br />
the relationship between two binary (“yes or no”) variables.<br />
Secondly, they enable us to examine the effects of<br />
other variables on that relationship, using logistic<br />
regression. Thirdly, they have a special and very<br />
convenient interpretation in case-control studies (dealt<br />
with in a future note).<br />
The odds are a way of representing probability,<br />
especially familiar for betting. For example, the odds<br />
that a single throw of a die will produce a six are 1 to<br />
5, or 1/5. The odds is the ratio of the probability that<br />
the event of interest occurs to the probability that it<br />
does not. This is often estimated <strong>by</strong> the ratio of the<br />
number of times that the event of interest occurs to<br />
the number of times that it does not. The table shows<br />
data from a cross sectional study showing the<br />
prevalence of hay fever and eczema in 11 year old<br />
children. 1 The probability that a child with eczema will<br />
also have hay fever is estimated <strong>by</strong> the proportion<br />
141/561 (25.1%). The odds is estimated <strong>by</strong> 141/420.<br />
Similarly, for children without eczema the probability<br />
of having hay fever is estimated <strong>by</strong> 928/14 453 (6.4%)<br />
and the odds is 928/13 525. We can compare the<br />
groups in several ways: <strong>by</strong> the difference between the<br />
proportions, 141/561 − 928/14 453 = 0.187 (or 18.7<br />
percentage points); the ratio of the proportions, (141/<br />
561)/(928/14 453) = 3.91 (also called the relative<br />
risk); or the odds ratio, (141/420)/(928/<br />
13 525) = 4.89.<br />
Association between hay fever and eczema in 11 year old<br />
children 1<br />
Hay fever<br />
Eczema<br />
Yes No<br />
Total<br />
Yes 141 420 561<br />
No 928 13 525 14 453<br />
Total 1069 13 945 15 522<br />
Now, suppose we look at the table the other way<br />
round, and ask what is the probability that a child with<br />
hay fever will also have eczema? The proportion is<br />
141/1069 (13.2%) and the odds is 141/928. For a<br />
child without hay fever, the proportion with eczema is<br />
420/13 945 (3.0%) and the odds is 420/13 525. Comparing<br />
the proportions this way, the difference is 141/<br />
1069 − 420/13 945 = 0.102 (or 10.2 percentage<br />
points); the ratio (relative risk) is (141/1069)/(420/<br />
13 945) = 4.38; and the odds ratio is (141/928)/(420/<br />
13 525) = 4.89. The odds ratio is the same whichever<br />
way round we look at the table, but the difference and<br />
ratio of proportions are not. It is easy to see why this is.<br />
The two odds ratios are<br />
which can both be rearranged to give<br />
If we switch the order of the categories in the rows and<br />
the columns, we get the same odds ratio. If we switch<br />
the order for the rows only or for the columns only, we<br />
get the reciprocal of the odds ratio, 1/4.89 = 0.204.<br />
These properties make the odds ratio a useful<br />
indicator of the strength of the relationship.<br />
The sample odds ratio is limited at the lower end,<br />
since it cannot be negative, but not at the upper end,<br />
and so has a skew distribution. The log odds ratio, 2<br />
however, can take any value and has an approximately<br />
Normal distribution. It also has the useful property that<br />
if we reverse the order of the categories for one of the<br />
variables, we simply reverse the sign of the log odds<br />
ratio: log(4.89) = 1.59, log(0.204) = − 1.59.<br />
We can calculate a standard error for the log odds<br />
ratio and hence a confidence interval. The standard<br />
error of the log odds ratio is estimated simply <strong>by</strong> the<br />
square root of the sum of the reciprocals of the four<br />
frequencies. For the example,<br />
A 95% confidence interval for the log odds ratio is<br />
obtained as 1.96 standard errors on either side of the<br />
estimate. For the example, the log odds ratio is<br />
log e(4.89) = 1.588 and the confidence interval is<br />
1.588±1.96×0.103, which gives 1.386 to 1.790. We can<br />
antilog these limits to give a 95% confidence interval<br />
for the odds ratio itself, 2 as exp(1.386) = 4.00 to<br />
exp(1.790) = 5.99. The observed odds ratio, 4.89, is not<br />
in the centre of the confidence interval because of the<br />
asymmetrical nature of the odds ratio scale. For this<br />
reason, in graphs odds ratios are often plotted using a<br />
logarithmic scale. The odds ratio is 1 when there is no<br />
relationship. We can test the null hypothesis that the<br />
odds ratio is 1 <strong>by</strong> the usual 2 test for a two <strong>by</strong> two<br />
table.<br />
Despite their usefulness, odds ratios can cause difficulties<br />
in interpretation. 3 We shall review this debate<br />
and also discuss odds ratios in logistic regression and<br />
case-control studies in future Statistics <strong>Notes</strong>.<br />
We thank Barbara Butland for providing the data.<br />
1 Strachan DP, Butland BK, Anderson HR. Incidence and prognosis of<br />
asthma and wheezing illness from early childhood to age 33 in a national<br />
British cohort. <strong>BMJ</strong>. 1996;312:1195-9.<br />
2 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Transforming data. <strong>BMJ</strong> 1996;312:770.<br />
3 Sackett DL, Deeks JJ, Altman DG. Down with odds ratios! Evidence-Based<br />
Med 1996;1:164-6.<br />
1468 <strong>BMJ</strong> VOLUME 320 27 MAY 2000 bmj.com
Education and debate<br />
Leo<br />
Pharmaceuticals,<br />
Princes Risborough,<br />
Buckinghamshire<br />
HP27 9RR<br />
Simon J Day<br />
manager, clinical<br />
biometrics<br />
ICRF Medical<br />
Statistics Group,<br />
Institute of Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
Correspondence to:<br />
SJDay<br />
<strong>BMJ</strong> 2000;321:504<br />
Statistics <strong>Notes</strong><br />
Blinding in clinical trials and other studies<br />
Simon J Day, Douglas G Altman<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
Human behaviour is influenced <strong>by</strong> what we know or<br />
believe. In research there is a particular risk of expectation<br />
influencing findings, most obviously when there is<br />
some subjectivity in assessment, leading to biased<br />
results. Blinding (sometimes called masking) is used to<br />
try to eliminate such bias.<br />
It is a tenet of randomised controlled trials that the<br />
treatment allocation for each patient is not revealed<br />
until the patient has irrevocably been entered into the<br />
trial, to avoid selection bias. This sort of blinding, better<br />
referred to as allocation concealment, will be discussed<br />
in a future statistics note. In controlled trials the term<br />
blinding, and in particular “double blind,” usually<br />
refers to keeping study participants, those involved<br />
with their management, and those collecting and analysing<br />
clinical data unaware of the assigned treatment,<br />
so that they should not be influenced <strong>by</strong> that<br />
knowledge.<br />
The relevance of blinding will vary according to<br />
circumstances. Blinding patients to the treatment they<br />
have received in a controlled trial is particularly important<br />
when the response criteria are subjective, such as<br />
alleviation of pain, but less important for objective criteria,<br />
such as death. Similarly, medical staff caring for<br />
patients in a randomised trial should be blinded to<br />
treatment allocation to minimise possible bias in<br />
patient management and in assessing disease status.<br />
For example, the decision to withdraw a patient from a<br />
study or to adjust the dose of medication could easily<br />
be influenced <strong>by</strong> knowledge of which treatment group<br />
the patient has been assigned to.<br />
In a double blind trial neither the patient nor the<br />
caregivers are aware of the treatment assignment.<br />
Blinding means more than just keeping the name of<br />
the treatment hidden. Patients may well see the<br />
treatment being given to patients in the other<br />
treatment group(s), and the appearance of the drug<br />
used in the study could give a clue to its identity. Differences<br />
in taste, smell, or mode of delivery may also<br />
influence efficacy, so these aspects should be identical<br />
for each treatment group. Even colour of medication<br />
has been shown to influence efficacy. 1<br />
In studies comparing two active compounds, blinding<br />
is possible using the “double dummy” method. For<br />
example, if we want to compare two medicines, one<br />
presented as green tablets and one as pink capsules, we<br />
could also supply green placebo tablets and pink<br />
placebo capsules so that both groups of patients would<br />
take one green tablet and one pink capsule.<br />
Blinding is certainly not always easy or possible.<br />
Single blind trials (where either only the investigator or<br />
only the patient is blind to the allocation) are<br />
sometimes unavoidable, as are open (non-blind) trials.<br />
In trials of different styles of patient management,<br />
surgical procedures, or alternative therapies, full blinding<br />
is often impossible.<br />
In a double blind trial it is implicit that the<br />
assessment of patient outcome is done in ignorance of<br />
the treatment received. Such blind assessment of<br />
outcome can often also be achieved in trials which are<br />
open (non-blinded). For example, lesions can be<br />
photographed before and after treatment and assessed<br />
<strong>by</strong> someone not involved in running the trial. Indeed,<br />
blind assessment of outcome may be more important<br />
than blinding the administration of the treatment,<br />
especially when the outcome measure involves subjectivity.<br />
Despite the best intentions, some treatments have<br />
unintended effects that are so specific that their occurrence<br />
will inevitably identify the treatment received to<br />
both the patient and the medical staff. Blind<br />
assessment of outcome is especially useful when this is<br />
a risk.<br />
In epidemiological studies it is preferable that the<br />
identification of “cases” as opposed to “controls” be<br />
kept secret while researchers are determining each<br />
subject’s exposure to potential risk factors. In many<br />
such studies blinding is impossible because exposure<br />
can be discovered only <strong>by</strong> interviewing the study<br />
participants, who obviously know whether or not they<br />
are a case. The risk of differential recall of important<br />
disease related events between cases and controls must<br />
then be recognised and if possible investigated. 2 As a<br />
minimum the sensitivity of the results to differential<br />
recall should be considered. Blinded assessment of<br />
patient outcome may also be valuable in other<br />
epidemiological studies, such as cohort studies.<br />
Blinding is important in other types of research<br />
too. For example, in studies to evaluate the performance<br />
of a diagnostic test those performing the test must<br />
be unaware of the true diagnosis. In studies to evaluate<br />
the reproducibility of a measurement technique the<br />
observers must be unaware of their previous measurement(s)<br />
on the same individual.<br />
We have emphasised the risks of bias if adequate<br />
blinding is not used. This may seem to be challenging<br />
the integrity of researchers and patients, but bias associated<br />
with knowing the treatment is often subconscious.<br />
On average, randomised trials that have not<br />
used appropriate levels of blinding show larger<br />
treatment effects than blinded studies. 3 Similarly, diagnostic<br />
test performance is overestimated when the reference<br />
test is interpreted with knowledge of the test<br />
result. 4 Blinding makes it difficult to bias results<br />
intentionally or unintentionally and so helps ensure<br />
the credibility of study conclusions.<br />
1 De Craen A<strong>JM</strong>, Roos PJ, de Vries AL, Kleijnen J. Effect of colour of drugs:<br />
systematic review of perceived effect of drugs and their effectiveness. <strong>BMJ</strong><br />
1996;313:1624-6.<br />
2 Barry D. Differential recall bias and spurious associations in case/control<br />
studies. Stat Med 1996;15:2603-16.<br />
3 Schulz KF, Chalmers I, Hayes R, Altman DG. Empirical evidence of bias:<br />
dimensions of methodological quality associated with estimates of treatment<br />
effects in controlled trials. JAMA 1995;273:408-12.<br />
4 Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der<br />
Meulen JH, et al. Empirical evidence of design-related bias in studies of<br />
diagnostic tests. JAMA 1999;282:1061-6.<br />
504 <strong>BMJ</strong> VOLUME 321 19-26 AUGUST 2000 bmj.com
Education and debate<br />
ICRF Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine,<br />
Institute of Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
Family Health<br />
International,<br />
PO Box 13950,<br />
Research Triangle<br />
Park, NC 27709,<br />
USA<br />
Kenneth F Schulz<br />
vice president,<br />
Quantitative Sciences<br />
Correspondence to:<br />
D G Altman<br />
<strong>BMJ</strong> 2001;323:446–7<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
Irrationality, the market, and quality of care<br />
Consider the irrationality of a person who pays extra<br />
so as not to share a hotel room with a colleague while<br />
on a business trip. He does this because he values<br />
privacy but he also scoffs at taking out long term care<br />
insurance to guarantee a private room in a nursing<br />
home. Why is he willing to risk sharing a room for the<br />
rest of his life with a person he does not like? This<br />
common irrationality is often masked <strong>by</strong><br />
rationalisations such as “I would rather die than have<br />
to live in a nursing home.” Yet we know that when the<br />
time comes most prefer the limited pleasures of life in<br />
a nursing home to suicide<br />
their feet. There are even more fundamental reasons<br />
why depending on the rationality of the market will<br />
never work well for quality of care (box). Sensible<br />
policy for providing nursing home care requires a<br />
larger welfare state, a larger regulatory state, and<br />
encouragement of public, non-profit providers.<br />
Australia’s recent experience shows that to head in the<br />
opposite direction is medically, economically, and<br />
politically irrational.<br />
Competing interests: None declared.<br />
1 McCallum J, Geiselhart K. Australia’s new aged: issues for young and old.<br />
Sydney: Allen and Unwin, 1996.<br />
2 Osborne D, Gaebler T. Reinventing government. New York: Addison-<br />
Wesley, 1992.<br />
3 Jost T. The necessary and proper role of regulation to assure the quality<br />
of health care. Houston Law Review 1988;25:525-98.<br />
4 Tingle L. Moran the big winner as aged care goes private. Sydney Morning<br />
Herald 16 March 2001:2.<br />
5 Lohr R, Head M. Kerosene baths reveal systemic aged care crisis in Australia.<br />
World Socialist Web Site. <strong>www</strong>.wsws.org/articles/2000/mar2000/<br />
aged-m10.shtml (accessed 10 Mar 2000).<br />
6 Jenkins A, Braithwaite J. Profits, pressure and corporate lawbreaking.<br />
Crime, Law and Social Change 1993;20:221-32.<br />
7 Braithwaite J, Makkai T, Braithwaite V, Gibson D. Raising the standard: resident<br />
centred nursing home regulation in Australia. Canberra: Department of<br />
Community Services and Health, 1993.<br />
8 Braithwaite J, Braithwaite V. The politics of legalism: rules versus<br />
standards in nursing home regulation. Social and Legal Studies<br />
1995;4:307-41.<br />
9 Black J. Rules and regulators. Oxford: Clarendon Press, 1997.<br />
10 Braithwaite J, Makkai T. Can resident-centred inspection of nursing<br />
homes work with very sick residents? Health Policy 1993;24:19-33.<br />
11 Makkai T, Braithwaite J. Praise, pride and corporate compliance. Int J Sociology<br />
Law 1993;21:73-91.<br />
12 Braithwaite J. Restorative justice and responsive regulation. New York: Oxford<br />
University Press (in press).<br />
13 McKibbin H. Accreditation: the on-site audit. The Standard (Newsletter of<br />
the Aged Care Standards Agency) 1999;2(2):2.<br />
14 Power M. The audit society. Oxford: Oxford University Press, 1997.<br />
Statistics <strong>Notes</strong><br />
Concealing treatment allocation in randomised trials<br />
Douglas G Altman, Kenneth F Schulz<br />
We have previously explained why random allocation<br />
of treatments is a required design feature of controlled<br />
trials 1 and explained how to generate a random allocation<br />
sequence. 2 Here we consider the importance of<br />
concealing the treatment allocation until the patient is<br />
entered into the trial.<br />
Regardless of how the allocation sequence has<br />
been generated—such as <strong>by</strong> simple or stratified<br />
randomisation 2 —there will be a prespecified sequence<br />
of treatment allocations. In principle, therefore, it is<br />
possible to know what treatment the next patient will<br />
get at the time when a decision is taken to consider the<br />
patient for entry into the trial.<br />
The strength of the randomised trial is based on<br />
aspects of design which eliminate various types of bias.<br />
Randomisation of patients to treatment groups<br />
eliminates bias <strong>by</strong> making the characteristics of the<br />
patients in two (or more) groups the same on average,<br />
and stratification with blocking may help to reduce<br />
chance imbalance in a particular trial. 2 All this good<br />
work can be undone if a poor procedure is adopted to<br />
implement the allocation sequence. In any trial one or<br />
more people must determine whether each patient is<br />
eligible for the trial, decide whether to invite the<br />
patient to participate, explain the aims of the trial and<br />
the details of the treatments, and, if the patient agrees<br />
to participate, determine what treatment he or she will<br />
receive.<br />
Suppose it is clear which treatment a patient will<br />
receive if he or she enters the trial (perhaps because<br />
there is a typed list showing the allocation sequence).<br />
Each of the above steps may then be compromised<br />
because of conscious or subconscious bias. Even when<br />
the sequence is not easily available, there is strong<br />
anecdotal evidence of frequent attempts to discover<br />
the sequence through a combination of a misplaced<br />
belief that this will be beneficial to patients and lack of<br />
understanding of the rationale of randomisation. 3<br />
How can the allocation sequence be concealed?<br />
Firstly, the person who generates the allocation<br />
sequence should not be the person who determines<br />
eligibility and entry of patients. Secondly, if possible the<br />
mechanism for treatment allocation should use people<br />
not involved in the trial. A common procedure,<br />
especially in larger trials, is to use a central telephone<br />
randomisation system. Here patient details are<br />
supplied, eligibility confirmed, and the patient entered<br />
into the trial before the treatment allocation is divulged<br />
(and it may still be blinded 4 ). Another excellent allocation<br />
concealment mechanism, common in drug trials,<br />
is to get the allocation done <strong>by</strong> a pharmacy. The interventions<br />
are sealed in serially numbered containers<br />
(usually bottles) of equal appearance and weight<br />
according to the allocation sequence.<br />
If external help is not available the only other<br />
system that provides a plausible defence against allocation<br />
bias is to enclose assignments in serially<br />
numbered, opaque, sealed envelopes. Apart from<br />
neglecting to mention opacity, this is the method used<br />
in the famous 1948 streptomycin trial (see box). This<br />
446 <strong>BMJ</strong> VOLUME 323 25 AUGUST 2001 bmj.com
Description of treatment allocation in the MRC<br />
streptomycin trial 5<br />
“Determination of whether a patient would be treated<br />
<strong>by</strong> streptomycin and bed-rest (S case) or <strong>by</strong> bed-rest<br />
alone (C case) was made <strong>by</strong> reference to a statistical<br />
series based on random sampling numbers drawn up<br />
for each sex at each centre <strong>by</strong> Professor Bradford Hill;<br />
the details of the series were unknown to any of the<br />
investigators or to the co-ordinator and were<br />
contained in a set of sealed envelopes, each bearing on<br />
the outside only the name of the hospital and a<br />
number. After acceptance of a patient <strong>by</strong> the panel,<br />
and before admission to the streptomycin centre, the<br />
appropriate numbered envelope was opened at the<br />
central office; the card inside told if the patient was to<br />
be an S or a C case, and this information was then<br />
given to the medical officer of the centre.”<br />
method is not immune to corruption, 3 particularly if<br />
poorly executed. However, with care, it can be a good<br />
mechanism for concealing allocation. We recommend<br />
that investigators ensure that the envelopes are opened<br />
sequentially, and only after the participant’s name and<br />
other details are written on the appropriate envelope. 3<br />
If possible, that information should also be transferred<br />
to the assigned allocation <strong>by</strong> using pressure sensitive<br />
paper or carbon paper inside the envelope. If an investigator<br />
cannot use numbered containers, envelopes<br />
represent the best available allocation concealment<br />
mechanism without involving outside parties, and may<br />
sometimes be the only feasible option. We suspect,<br />
however, that in years to come we will see greater use of<br />
external “third party” randomisation.<br />
The public health benefits of mobile phones<br />
The bread and butter of public health on call is<br />
identifying contacts in the case of suspected<br />
meningococcal disease. On the whole this is<br />
straightforward but can occasionally cause difficulties.<br />
Most areas that I have worked in include several<br />
universities, and during October it is common to<br />
experience the problem of contact tracing in the<br />
student population.<br />
There are two main problems. The first is how to<br />
define household contacts when the index patient lives<br />
in a hall of residence containing several hundred<br />
students. Finding the appropriate university protocol<br />
and not being too concerned about the different<br />
approaches adopted <strong>by</strong> neighbouring universities can<br />
reduce the number of sleepless nights. The second<br />
problem is harder. “Close kissing contacts” among 18<br />
year olds who have been set free from parental control<br />
for the first time is a minefield. My experience suggests<br />
that it is best to assume there will be lots and that names<br />
and contact details will not necessarily have been<br />
obtained. By the end of a weekend on call, you will feel<br />
like a cross between a detective and an “agony aunt.”<br />
One year I volunteered to cover Christmas weekend<br />
in the belief that at least the students would be gone <strong>by</strong><br />
then. I could not have been more mistaken. To add a<br />
further difficulty, the index patient presented to<br />
hospital on the night of the last day of term, and all<br />
contacts had already set off to the far reaches of the<br />
country. I could not believe my luck when the friend<br />
<strong>BMJ</strong> VOLUME 323 25 AUGUST 2001 bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
The desirability of concealing the allocation was<br />
recognised in the streptomycin trial 5 (see box). Yet the<br />
importance of this key element of a randomised trial<br />
has not been widely recognised. Empirical evidence of<br />
the bias associated with failure to conceal the<br />
allocation 67 and explicit requirement to discuss this<br />
issue in the CONSORT statement 8 seem to be leading<br />
to wider recognition that allocation concealment is an<br />
essential aspect of a randomised trial.<br />
Allocation concealment is completely different<br />
from (double) blinding. 4 It is possible to conceal the<br />
randomisation in every randomised trial. Also,<br />
allocation concealment seeks to eliminate selection<br />
bias (who gets into the trial and the treatment they are<br />
assigned). By contrast, blinding relates to what happens<br />
after randomisation, is not possible in all trials, and<br />
seeks to reduce ascertainment bias (assessment of<br />
outcome).<br />
1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Treatment allocation in controlled trials: why randomise?<br />
<strong>BMJ</strong> 1999;318:1209.<br />
2 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to randomise. <strong>BMJ</strong> 1999;319:703-4.<br />
3 Schulz KF. Subverting randomization in controlled trials. JAMA<br />
1995;274:1456-8.<br />
4 Day SJ, Altman DG. Blinding in clinical trials and other studies. <strong>BMJ</strong><br />
2000;321:504.<br />
5 Medical Research Council. Streptomycin treatment of pulmonary<br />
tuberculosis: a Medical Research Council investigation. <strong>BMJ</strong> 1948;2:<br />
769-82.<br />
6 Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, et al. Does<br />
quality of reports of randomised trials affect estimates of intervention<br />
efficacy reported in meta-analyses. Lancet 1998;352:609-13.<br />
7 Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias:<br />
dimensions of methodological quality associated with estimates of treatment<br />
effects in controlled trials. JAMA 1995;273:408-12.<br />
8 Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improving<br />
the quality of reporting of randomized controlled trials: the<br />
CONSORT statement. JAMA 1996;276:637-9.<br />
accompanying the patient produced both their mobile<br />
phones and confidently reassured me that between the<br />
two of them they would have the mobile numbers of<br />
all 15 “household” contacts. She was right, and in just<br />
over two hours all of them had been contacted.<br />
There has been much coverage in the medical and<br />
popular press about the potential health hazards of<br />
mobile phones, and if these fears are realised the 100%<br />
ownership among this small sample of students is<br />
worrying. However, in terms of contact tracing for<br />
suspected meningococcal disease, mobile phones have<br />
potential health benefits not just for their owners but<br />
also for the mental health of public health doctors. Of<br />
course, this may not solve the “close kissing contact”<br />
problem.<br />
Debbie Lawlor senior lecturer in epidemiology and public<br />
health, University of Bristol<br />
We welcome articles up to 600 words on topics such as<br />
A memorable patient, A paper that changed my practice, My<br />
most unfortunate mistake, or any other piece conveying<br />
instruction, pathos, or humour. If possible the article<br />
should be supplied on a disk. Permission is needed<br />
from the patient or a relative if an identifiable patient is<br />
referred to. We also welcome contributions for<br />
“Endpieces,” consisting of quotations of up to 80 words<br />
(but most are considerably shorter) from any source,<br />
ancient or modern, which have appealed to the reader.<br />
Education and debate<br />
447
Statistics <strong>Notes</strong><br />
Analysing controlled trials with baseline and follow up<br />
measurements<br />
Andrew J Vickers, Douglas G Altman<br />
In many randomised trials researchers measure a continuous<br />
variable at baseline and again as an outcome<br />
assessed at follow up. Baseline measurements are common<br />
in trials of chronic conditions where researchers<br />
want to see whether a treatment can reduce<br />
pre-existing levels of pain, anxiety, hypertension, and<br />
the like.<br />
<strong>Statistical</strong> comparisons in such trials can be made<br />
in several ways. Comparison of follow up (posttreatment)<br />
scores will give a result such as “at the end<br />
of the trial, mean pain scores were 15 mm (95% confidence<br />
interval 10 to 20 mm) lower in the treatment<br />
group.” Alternatively a change score can be calculated<br />
<strong>by</strong> subtracting the follow up score from the baseline<br />
score, leading to a statement such as “pain reductions<br />
were 20 mm (16 to 24 mm) greater on treatment than<br />
control.” If the average baseline scores are the same in<br />
each group the estimated treatment effect will be the<br />
same using these two simple approaches. If the<br />
treatment is effective the statistical significance of the<br />
treatment effect <strong>by</strong> the two methods will depend on the<br />
correlation between baseline and follow up scores. If<br />
the correlation is low using the change score will add<br />
variation and the follow up score is more likely to show<br />
a significant result. Conversely, if the correlation is high<br />
using only the follow up score will lose information<br />
and the change score is more likely to be significant. It<br />
is incorrect, however, to choose whichever analysis<br />
gives a more significant finding. The method of analysis<br />
should be specified in the trial protocol.<br />
Some use change scores to take account of chance<br />
imbalances at baseline between the treatment groups.<br />
However, analysing change does not control for<br />
baseline imbalance because of regression to the<br />
mean 12 : baseline values are negatively correlated with<br />
change because patients with low scores at baseline<br />
generally improve more than those with high scores. A<br />
better approach is to use analysis of covariance<br />
(ANCOVA), which, despite its name, is a regression<br />
method. 3 In effect two parallel straight lines (linear<br />
regression) are obtained relating outcome score to<br />
baseline score in each group. They can be summarised<br />
as a single regression equation:<br />
follow up score =<br />
constant + a×baseline score + b×group<br />
Results of trial of acupuncture for shoulder pain 4<br />
Placebo group<br />
(n=27)<br />
Pain scores (mean and SD)<br />
where a and b are estimated coefficients and group<br />
is a binary variable coded 1 for treatment and 0 for<br />
control. The coefficient b is the effect of interest—the<br />
estimated difference between the two treatment<br />
groups. In effect an analysis of covariance adjusts each<br />
patient’s follow up score for his or her baseline score,<br />
but has the advantage of being unaffected <strong>by</strong> baseline<br />
differences. If, <strong>by</strong> chance, baseline scores are worse in<br />
the treatment group, the treatment effect will be<br />
underestimated <strong>by</strong> a follow up score analysis and overestimated<br />
<strong>by</strong> looking at change scores (because of<br />
regression to the mean). By contrast, analysis of covariance<br />
gives the same answer whether or not there is<br />
baseline imbalance.<br />
As an illustration, Kleinhenz et al randomised 52<br />
patients with shoulder pain to either true or sham acupuncture.<br />
4 Patients were assessed before and after<br />
treatment using a 100 point rating scale of pain and<br />
function, with lower scores indicating poorer outcome.<br />
There was an imbalance between groups at baseline,<br />
with better scores in the acupuncture group (see table).<br />
Analysis of post-treatment scores is therefore biased.<br />
The authors analysed change scores, but as baseline<br />
and change scores are negatively correlated (about<br />
r= −0.25 within groups) this analysis underestimates<br />
the effect of acupuncture. From analysis of covariance<br />
we get:<br />
follow up score =<br />
24 + 0.71×baseline score + 12.7×group<br />
(see figure). The coefficient for group (b) has a useful<br />
interpretation: it is the difference between the mean<br />
change scores of each group. In the above example it<br />
can be interpreted as “pain and function score<br />
improved <strong>by</strong> an estimated 12.7 points more on average<br />
in the treatment group than in the control group.” A<br />
95% confidence interval and P value can also be calculated<br />
for b (see table). 5 The regression equation<br />
provides a means of prediction: a patient with a<br />
baseline score of 50, for example, would be predicted<br />
to have a follow up score of 72.2 on treatment and 59.5<br />
on control.<br />
An additional advantage of analysis of covariance is<br />
that it generally has greater statistical power to detect a<br />
treatment effect than the other methods. 6 For example,<br />
a trial with a correlation between baseline and follow<br />
Acupuncture group<br />
(n=25)<br />
Difference between means<br />
(95% CI) P value<br />
Baseline<br />
Analysis<br />
53.9 (14) 60.4 (12.3) 6.5<br />
Follow up 62.3 (17.9) 79.6 (17.1) 17.3 (7.5 to 27.1) 0.0008<br />
Change score* 8.4 (14.6) 19.2 (16.1) 10.8 (2.3 to 19.4) 0.014<br />
ANCOVA<br />
*Analysis reported <strong>by</strong> authors.<br />
12.7 (4.1 to 21.3) 0.005<br />
4<br />
<strong>BMJ</strong> VOLUME 323 10 NOVEMBER 2001 bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
Education and debate<br />
Integrative<br />
Medicine Service,<br />
Biostatistics Service,<br />
Memorial<br />
Sloan-Kettering<br />
Cancer Center, New<br />
York, New York<br />
10021, USA<br />
Andrew J Vickers<br />
assistant attending<br />
research methodologist<br />
ICRF Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine,<br />
Institute of Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
Correspondence to:<br />
Dr Vickers<br />
vickersa@mskcc.org<br />
<strong>BMJ</strong> 2001;323:1123–4<br />
1123
Education and debate<br />
Post–treatment score<br />
100<br />
80<br />
60<br />
40<br />
20<br />
Acupuncture<br />
Placebo<br />
20 40 60<br />
80 100<br />
Pretreatment score<br />
Pretreatment and post-treatment scores in each group showing fitted<br />
lines. Squares show mean values for the two groups. The estimated<br />
difference between the groups from analysis of covariance is the<br />
vertical distance between the two lines<br />
up scores of 0.6 that required 85 patients for analysis of<br />
follow up scores, would require 68 for a change score<br />
analysis but only 54 for analysis of covariance.<br />
The efficiency gains of analysis of covariance compared<br />
with a change score are low when there is a high<br />
correlation (say r > 0.8) between baseline and follow up<br />
measurements. This will often be the case, particularly<br />
in stable chronic conditions such as obesity. In these<br />
A memorable patient<br />
Informed consent<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
I first met Ivy three years ago when she came for her<br />
29th oesophageal dilatation. She was an 86 year old<br />
spinster, deaf without speech from childhood, and the<br />
only sign language she knew was thumbs up, which she<br />
would use for saying good morning or for showing<br />
happiness. She had no next of kin and had lived in a<br />
residential home for the past 50 years. She developed a<br />
benign oesophageal stricture in 1992 and came to the<br />
endoscopy unit for repeated dilatations. The carers in<br />
the residential home used to say that she enjoyed her<br />
“days out” at the endoscopy unit.<br />
We would explain the procedure to her in sign<br />
language. She would use the thumbs up sign and make<br />
a cross on the dotted line on the consent form. She<br />
would enter the endoscopy room smiling, put her left<br />
arm out to be cannulated, turn to her left side for<br />
endoscopy, and when fully awake would show her<br />
thumbs up again. Every time after her dilatation the<br />
nursing staff would question why an expandable<br />
oesophageal stent was not being considered. We would<br />
conclude that the indications for an expandable stent<br />
in benign strictures are not well established.<br />
Her need for dilatation was becoming more<br />
frequent, and so on her 46th dilatation we decided to<br />
refer her to our regional centre for the insertion of a<br />
stent. She had an expandable stent inserted, and in his<br />
report the endoscopist mentioned the risk of the stent<br />
migrating down in the stomach beyond the stricture.<br />
Six weeks later she developed a bolus obstruction. At<br />
endoscopy it was noted that the stent had indeed<br />
migrated down. She consented to another stent. Four<br />
weeks later she had another bolus obstruction that<br />
could not be completely removed at the first attempt,<br />
situations, analysis of change scores can be a<br />
reasonable alternative, particularly if restricted<br />
randomisation is used to ensure baseline comparability<br />
between groups. 7 Analysis of covariance is the<br />
preferred general approach, however.<br />
As with all analyses of continuous data, the use of<br />
analysis of covariance depends on some assumptions<br />
that need to be tested. In particular, data transformation,<br />
such as taking logarithms, may be indicated. 8<br />
Lastly, analysis of covariance is a type of multiple<br />
regression and can be seen as a special type of adjusted<br />
analysis. The analysis can thus be expanded to include<br />
additional prognostic variables (not necessarily continuous),<br />
such as age and diagnostic group.<br />
We thank Dr J Kleinhenz for supplying the raw data from his<br />
study.<br />
1 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Regression towards the mean. <strong>BMJ</strong><br />
1994;308:1499.<br />
2 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Some examples of regression towards the mean.<br />
<strong>BMJ</strong> 1994;309:780.<br />
3 Senn S. Baseline comparisons in randomized clinical trials. Stat Med<br />
1991;10:1157-9.<br />
4 Kleinhenz J, Streitberger K, Windeler J, Gussbacher A, Mavridis G, Martin<br />
E. Randomised clinical trial comparing the effects of acupuncture and<br />
a newly designed placebo needle in rotator cuff tendonitis. Pain<br />
1999;83:235-41.<br />
5 Altman DG, Gardner MJ. Regression and correlation. In: Altman DG,<br />
Machin D, Bryant TN, Gardner MJ, eds. Statistics with confidence. 2nd ed.<br />
London: <strong>BMJ</strong> Books, 2000:73-92.<br />
6 Vickers AJ. The use of percentage change from baseline as an outcome in<br />
a controlled trial is statistically inefficient: a simulation study. BMC Med<br />
Res Methodol 2001;1:16.<br />
7 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to randomise. <strong>BMJ</strong> 1999;319:703-4.<br />
8 <strong>Bland</strong> <strong>JM</strong>, Altman DG. The use of transformation when comparing two<br />
means. <strong>BMJ</strong> 1996;312:1153.<br />
and she was brought back the following day for<br />
removal of the bolus <strong>by</strong> endoscopy.<br />
She came to the endoscopy room but did not have<br />
her familiar smile. She looked around for a minute, got<br />
off her trolley, and walked out. Everyone in the<br />
endoscopy room understood that she was trying to say,<br />
“I’ve had enough.”<br />
She did not come back for a repeat endoscopy, and<br />
she stayed nil <strong>by</strong> mouth on intravenous fluids. Two<br />
weeks later she died of an aspiration pneumonia. We<br />
think she understood all the procedures she had<br />
agreed to. We also think it was informed consent. I<br />
hope we were right. She gave us a very clear message<br />
without saying a word on her last visit to the<br />
endoscopy room.<br />
Do we really understand what aphasic patients are<br />
trying to tell us when we get informed consent for<br />
invasive procedures? We should try to read the<br />
non-verbal messages very carefully.<br />
I Tiwari associate specialist in gastroenterology, Broomfield<br />
Hospital, Chelmsford<br />
We welcome articles of up to 600 words on topics<br />
such as A memorable patient, A paper that changed my<br />
practice, My most unfortunate mistake, or any other piece<br />
conveying instruction, pathos, or humour. If possible<br />
the article should be supplied on a disk. Permission is<br />
needed from the patient or a relative if an identifiable<br />
patient is referred to. We also welcome contributions<br />
for “Endpieces,” consisting of quotations of up to 80<br />
words (but most are considerably shorter) from any<br />
source, ancient or modern, which have appealed to<br />
the reader.<br />
1124 <strong>BMJ</strong> VOLUME 323 10 NOVEMBER 2001 bmj.com
Education and debate<br />
Papers p569<br />
Department of<br />
Public Health<br />
Sciences, St<br />
George’s Hospital<br />
Medical School,<br />
London SW17 0RE<br />
J Martin <strong>Bland</strong><br />
professor of medical<br />
statistics<br />
Cancer Research<br />
UK Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine,<br />
Institute for Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
Correspondence to:<br />
Professor <strong>Bland</strong><br />
mbland@sghms.ac.uk<br />
<strong>BMJ</strong> 2002;324:606–7<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
The quality and reliability of health information on<br />
the internet remains of paramount concern in Europe,<br />
as elsewhere. Self regulatory codes of ethics for health<br />
websites abound, yet the quality and practices of many<br />
are highly questionable.<br />
Little progress seems to have been made,<br />
moreover, in assuring consumers that the information<br />
they share with health websites will not be misused.<br />
Several US studies have already concluded that<br />
websites’ privacy practices do not match their<br />
proclaimed policies. 5 In an attempt to counter this erosion<br />
of trust in Europe, the European Commission’s<br />
guidelines for quality criteria for health related<br />
websites have recognised that there is no shortage of<br />
legislation in the field of privacy and security. 6 They<br />
have drawn specific attention to a new recommendation<br />
regarding online data collection adopted in<br />
May 2001 that explains how European directives on<br />
issues such as data protection should be applied to the<br />
most common processing tasks carried out via the<br />
internet. 7<br />
The challenge facing Europe’s health professionals<br />
and policymakers is to carefully craft the development<br />
of new approaches to the supervision of medical and<br />
pharmaceutical practice. Their ultimate goal is to raise<br />
Statistics <strong>Notes</strong><br />
Validating scales and indexes<br />
J Martin <strong>Bland</strong>, Douglas G Altman<br />
An index of quality is a measurement like any other,<br />
whether it is assessing a website, as in today’s <strong>BMJ</strong>, 1 a<br />
clinical trial used in a meta-analysis, 2 or the quality of a<br />
life experienced <strong>by</strong> a patient. 3 As with all measurements,<br />
we have to decide whether it measures what we<br />
want it to measure, and how well.<br />
The simplest measurements, such as length and<br />
distance, can be validated <strong>by</strong> an objective criterion. The<br />
earliest criteria must have been biological: the length of<br />
a pace, a foot, a thumb. The obvious problem, that the<br />
criterion varies from person to person, was eventually<br />
solved <strong>by</strong> establishing a fundamental unit and defining<br />
all others in terms of it. Other measurements can then<br />
be defined in terms of a fundamental unit. To define a<br />
unit of weight we find a handy substance which<br />
appears the same everywhere, such as water. The unit<br />
of weight is then the weight of a volume of water specified<br />
in the basic unit of length, such as 100 cubic centimetres.<br />
Such measurements have criterion validity,<br />
meaning that we can take some known quantity and<br />
compare our measurement with it.<br />
For some measurements no such standard is possible.<br />
Cardiac stroke volume, for example, can be<br />
measured only indirectly. Direct measurement, <strong>by</strong><br />
collecting all the blood pumped out of the heart over a<br />
series of beats, would involve rather drastic interference<br />
with the system. Our criterion becomes agreement<br />
with another indirect measurement. Indeed, we sometimes<br />
have to use as a standard a method which we<br />
know produces inaccurate measurements.<br />
consumers’ confidence in online healthcare. They must<br />
ensure that the mechanisms are put in place where<strong>by</strong><br />
health professionals themselves can benefit from using<br />
the internet, while still ensuring the highest standards<br />
of medical practice.<br />
Avienda was formerly known as the Centre for Law Ethics and<br />
Risk in Telemedicine.<br />
Competing interests: None declared.<br />
1 <strong>http</strong>://news.bbc.co.uk/hi/english/uk/england/newsid_1752000/<br />
1752670.stm (accessed 5 Feb 2002).<br />
2 Case C-322/01: Reference for a preliminary ruling <strong>by</strong> the Landgericht<br />
Frankfurt am Main <strong>by</strong> order of that court of 10 August 2001 in the case<br />
of Deutscher Apothekerverband e.V. against DocMorris NV and Jacques<br />
Waterval. Official Journal of the European Communities No C 2001<br />
December 8:348/10.<br />
3 Council Directive 1992/28/EEC of 31 March 1992 on the advertising of<br />
medicinal products for human use. (Articles 1(3) and 3(1).) Official Journal<br />
of the European Communities No L 1995 11 February:32/26.<br />
4 Directive 2000/31/EC on mutual recognition of primary medical and<br />
specialist medical qualifications and minimum standards of training. Official<br />
Journal of the European Communities No L 2001 July 31:206/1-51.<br />
5 Schwartz J. Medical websites faulted on privacy. Washington Post 2000<br />
February 1.<br />
6 <strong>http</strong>://europa.eu.int/information_society/eeurope/ehealth/quality/<br />
draft_guidelines/index_en.htm (accessed 5 Feb 2002).<br />
7 European Commission. Recommendation 2/2001 on certain minimum<br />
requirements for collecting personal data on-line in the European<br />
Union. Adopted on 17 May 2001. <strong>http</strong>://europa.eu.int/comm/<br />
internal_market/en/dataprot/wpdocs/wp43en.htm (accessed 25 Jan<br />
2002).<br />
Some quantities are even more difficult to measure<br />
and evaluate. Cardiac stroke volume does at least have<br />
an objective reality; a physical quantity of blood is<br />
pumped out of the heart when it beats. Anxiety and<br />
depression do not have a physical reality but are useful<br />
artificial constructs. They are measured <strong>by</strong> questionnaire<br />
scales, where answers to a series of questions<br />
related to the concept we want to measure are<br />
combined to give a numerical score. Website quality is<br />
similar. We are measuring a quantity which is not precisely<br />
defined, and there is no instrument with which<br />
we can compare any measure we might devise. How<br />
are we to assess the validity of such a scale?<br />
The relevant theory was developed in the social sciences<br />
in the context of questionnaire scales. 4 First we<br />
might ask whether the scale looks right, whether it asks<br />
about the sorts of thing which we think of as being<br />
related to anxiety or website quality. If it appears to be<br />
correct, we call this face validity. Next we might ask<br />
whether it covers all the aspects which we want to<br />
measure. A phobia scale which asked about fear of<br />
dogs, spiders, snakes, and cats but ignored height, confined<br />
spaces, and crowds would not do this. We call<br />
appropriate coverage of the subject matter content<br />
validity.<br />
Our scale may look right and cover the right things,<br />
but what other evidence can we bring to the question<br />
of validity? One question we can ask is whether our<br />
score has the relationships with other variables that we<br />
would expect. For example, does an anxiety measure<br />
606 <strong>BMJ</strong> VOLUME 324 9 MARCH 2002 bmj.com
distinguish between psychiatric patients and medical<br />
patients? Do we get different anxiety scores from<br />
students before and after an examination? Does a<br />
measure of depression predict suicide attempts? We<br />
call the property of having appropriate relationships<br />
with other variables construct validity.<br />
We can also ask whether the items which together<br />
compose the scale are related to one another: does the<br />
scale have internal consistency? If not, do the items really<br />
measure the same thing? On the other hand, if the<br />
items are too similar, some may be redundant. Highly<br />
correlated items in a scale may make the scale overlong<br />
and may lead to some aspects being overemphasised,<br />
impairing the content validity. A handy summary<br />
measure for this feature is Cronbach’s alpha. 5<br />
A scale must also be repeatable and be sufficiently<br />
objective to give similar results for different observers.<br />
If a measurement is repeatable, in that someone who<br />
has a high score on one occasion tends to have a high<br />
score on another, it must be measuring something.<br />
With physical measurements, it is often possible for the<br />
same observer (or different observers) to make<br />
repeated measurements in quick succession. When<br />
there is a subjective element in the measurement the<br />
observer can be blinded from their first measurement,<br />
Honour a physician with the honour due unto him<br />
A few years ago my general practitioner told me that<br />
anyone aged over 40 with upper abdominal discomfort<br />
needed investigating. At the local teaching hospital, a<br />
pleasant young doctor did a gastroscopy, which<br />
showed a mass in my stomach wall. I was sent for a<br />
barium meal. A consultant radiologist took the x ray<br />
films, instructing me briskly to turn this way and that<br />
but not otherwise paying me any attention. He told me<br />
to wait a few minutes while he checked the films to see<br />
if all the views were satisfactory. I sat alone in the room<br />
for about five minutes.<br />
From the moment the consultant re-entered I could<br />
see that he was slightly agitated. “I’m terribly sorry,” he<br />
called out as he came through the door at the far end.<br />
And then again, “I’m terribly sorry.” Perhaps these<br />
words of regret, coupled with the concern on his face,<br />
might not have had the effect they did had I not been a<br />
man with an abdominal mass on his mind. At this<br />
moment of truth and reckoning, certain visions swam<br />
before my eyes.<br />
Three strides later, he was in front of me and<br />
looking me full in the face: “I’m terribly sorry, I hadn’t<br />
realised you were a doctor.” In his hand was the request<br />
form, and I could see that my general practitioner had<br />
written “ex-SR here” in one corner. He must have<br />
spotted this when checking the form as he looked at<br />
the preliminary plates. Though no further x rays were<br />
needed, he proceeded a little breathlessly to deliver<br />
three or four minutes of almost a caricature of caring,<br />
empathic interest in a patient. What branch of<br />
medicine was I in, and where did I work? Good<br />
heavens, that must be tough. Is that an Australian<br />
accent I hear? A St Mary’s old boy, ah yes. What did I<br />
think about. . .?<br />
I don’t mean to imply that this was insincere, merely<br />
splendidly different from his earlier matter of factness<br />
and economy of word. I had thought nothing of this at<br />
the time: in such a bread and butter procedure I had<br />
<strong>BMJ</strong> VOLUME 324 9 MARCH 2002 bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
and different observers can make simultaneous<br />
measurements. In assessing the reliability of a website<br />
quality scale, it is easy to get several observers to apply<br />
the scale independently. With websites, repeat assessments<br />
need to be close in time because their content<br />
changes frequently (as does bmj.com). With questionnaires,<br />
either self administered or recorded <strong>by</strong> an<br />
observer, repeat measurements need to be far enough<br />
apart in time for the earlier responses to be forgotten,<br />
yet not so far apart that the underlying quantity being<br />
measured might have changed. Such data enable us to<br />
evaluate test-retest reliability. If two measures have comparable<br />
face, content, and construct validity the more<br />
repeatable one may be preferred for the study of a<br />
given population.<br />
1 Gagliardi A, Jadad AR. Examination of instruments used to rate quality of<br />
health information on the internet: chronicle of a voyage with an unclear<br />
destination. <strong>BMJ</strong> 2002;324:569-73.<br />
2 Jüni P, Altman DG, Egger M. Assessing the quality of controlled clinical<br />
trials. <strong>BMJ</strong> 2001;323:42-6.<br />
3 Muldoon MF, Barger SD, Flory JD, Manuck B. What are quality of life<br />
measurements measuring? <strong>BMJ</strong> 1998;316:542-5.<br />
4 Streiner DL, Norman GR. Health measurement scales: a practical<br />
guide to their development and use. 2nd ed. Oxford: Oxford University Press,<br />
1996.<br />
5 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Cronbach’s alpha. <strong>BMJ</strong> 1997;314:572.<br />
no more reason to expect the doctor to engage with<br />
me as a person than I would the phlebotomist taking a<br />
routine blood sample. Clearly, this consultant saw<br />
things similarly as a rule, but when the patient was a<br />
doctor the aesthetics of the encounter changed. He<br />
had apologised three times for what he felt was a lapse<br />
on his part, arising from his failure to notice what was<br />
written in the corner of the request form. Perhaps he<br />
thought I knew that my general practitioner had<br />
written this and that I expected this of a medical<br />
referral, and thus expected to be recognised <strong>by</strong> him<br />
not just as a patient but also as a colleague. He seemed<br />
to see this as my due. (As it happens, I didn’t.)<br />
I had forgotten this incident, but it was brought back<br />
to me <strong>by</strong> the aftermath of the Bristol cardiac surgery<br />
debacle, and <strong>by</strong> the publicity surrounding other recent<br />
medical scandals. These have all put a spotlight on<br />
relations between doctors, who seem to offer each<br />
other acknowledgement and empathy, as my<br />
consultant had sought belatedly to do to me. The<br />
general public may be coming to suspect that this<br />
collegiate solidarity is somehow not in their interests,<br />
associating it with mutual protectiveness and thus with<br />
cover-ups of medical malpractice. It is too soon to say<br />
how the profession will react, but my consultant was an<br />
older man and my guess is that, with younger<br />
generations of doctors, we will see the waning of a<br />
tradition whose roots lie with Hippocrates. For it was<br />
his oath that bound doctors to look well on each other<br />
(and not charge each other for their services).<br />
It’s another story, but I found out later that the mass<br />
was the gastroscopy instrument itself distorting the<br />
stomach wall, misdiagnosed <strong>by</strong> an inexperienced<br />
registrar. No special treatment there, anyway.<br />
Derek Summerfield consultant psychiatrist, CASCAID,<br />
South London and Maudsley NHS Trust, London<br />
Education and debate<br />
607
Statistics <strong>Notes</strong><br />
Interaction revisited: the difference between two estimates<br />
Douglas G Altman, J Martin <strong>Bland</strong><br />
We often want to compare two estimates of the same<br />
quantity derived from separate analyses. Thus we might<br />
want to compare the treatment effect in subgroups in a<br />
randomised trial, such as two age groups. The term for<br />
such a comparison is a test of interaction. In earlier Statistics<br />
<strong>Notes</strong> we discussed interaction in terms of heterogeneity<br />
of treatment effect. 1–3 Here we revisit interaction<br />
and consider the concept more generally.<br />
The comparison of two estimated quantities, such as<br />
means or proportions, each with its standard error, is a<br />
general method that can be applied widely. The two estimates<br />
should be independent, not obtained from the<br />
same individuals—examples are the results from<br />
subgroups in a randomised trial or from two independent<br />
studies. The samples should be large. If the estimates<br />
are E 1 and E 2 with standard errors SE(E 1) and SE(E 2),<br />
then the difference d=E 1 − E 2 has standard error<br />
SE(d)=√[SE(E 1) 2 + SE(E 2) 2 ] (that is, the square root of the<br />
sum of the squares of the separate standard errors). This<br />
formula is an example of a well known relation that the<br />
variance of the difference between two estimates is the<br />
sum of the separate variances (here the variance is the<br />
square of the standard error). Then the ratio z=d/SE(d)<br />
gives a test of the null hypothesis that in the population<br />
the difference d is zero, <strong>by</strong> comparing the value of z to<br />
the standard normal distribution. The 95% confidence<br />
interval for the difference is d−1.96SE(d) tod+1.96SE(d).<br />
We illustrated this for means and proportions, 3<br />
although we did not show how to get the standard<br />
error of the difference. Here we consider comparing<br />
relative risks or odds ratios. These measures are always<br />
analysed on the log scale because the distributions of<br />
the log ratios tend to be those closer to normal than of<br />
the ratios themselves.<br />
In a meta-analysis of non-vertebral fractures in randomised<br />
trials of hormone replacement therapy the<br />
estimated relative risk from 22 trials was 0.73 (P=0.02) in<br />
favour of hormone replacement therapy. 4 From 14 trials<br />
of women aged on average < 60 years the relative risk<br />
was 0.67 (95% confidence interval 0.46 to 0.98; P=0.03).<br />
From eight trials of women aged >60 the relative risk<br />
was 0.88 (0.71 to 1.08; P=0.22). In other words, in<br />
younger women the estimated treatment benefit was a<br />
33% reduction in risk of fracture, which was statistically<br />
significant, compared with a 12% reduction in older<br />
women, which was not significant. But are the relative<br />
risks from the subgroups significantly different from<br />
each other? We show how to answer this question using<br />
just the summary data quoted.<br />
Because the calculations were made on the log scale,<br />
comparing the two estimates is complex (see table). We<br />
need to obtain the logs of the relative risks and their<br />
confidence intervals (rows 2 and 4). 5 As 95% confidence<br />
intervals are obtained as 1.96 standard errors either side<br />
of the estimate, the SE of each log relative risk is<br />
obtained <strong>by</strong> dividing the width of its confidence interval<br />
<strong>by</strong> 2×1.96 (row 6). The estimated difference in log<br />
relative risks is d=E 1− E 2=−0.2726 and its standard error<br />
<strong>BMJ</strong> VOLUME 326 25 JANUARY 2003 bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
0.2206 (row 8). From these two values we can test the<br />
interaction and estimate the ratio of the relative risks<br />
(with confidence interval). The test of interaction is the<br />
ratio of d to its standard error: z=−0.2726/<br />
0.2206=−1.24, which gives P=0.2 when we refer it to a<br />
table of the normal distribution. The estimated<br />
interaction effect is exp( − 0.2726)=0.76. (This value can<br />
also be obtained directly as 0.67/0.88=0.76.) The<br />
confidence interval for this effect is − 0.7050 to 0.1598<br />
on the log scale (row 9). Transforming back to the relative<br />
risk scale, we get 0.49 to 1.17 (row 12). There is thus<br />
no good evidence to support a different treatment effect<br />
in younger and older women.<br />
The same approach is used for comparing odds<br />
ratios. Comparing means or regression coefficients is<br />
simpler as there is no log transformation. The two estimates<br />
must be independent: the method should not be<br />
used to compare a subset with the whole group, or two<br />
estimates from the same patients.<br />
There is limited power to detect interactions, even in<br />
a meta-analysis combining the results from several studies.<br />
As this example illustrates, even when the two<br />
estimates and P values seem very different the test of<br />
interaction may not be significant. It is not sufficient for<br />
the relative risk to be significant in one subgroup and<br />
not in another. Conversely, it is not correct to assume<br />
that when two confidence intervals overlap the two estimates<br />
are not significantly different. 6 <strong>Statistical</strong> analysis<br />
should be targeted on the question in hand, and not<br />
based on comparing P values from separate analyses. 2<br />
1 Altman DG, Matthews JNS. Interaction 1: Heterogeneity of effects. <strong>BMJ</strong><br />
1996;313:486.<br />
2 Matthews JNS, Altman DG. Interaction 2: Compare effect sizes not P<br />
values. <strong>BMJ</strong> 1996;313:808.<br />
3 Matthews JNS, Altman DG. Interaction 3: How to examine heterogeneity.<br />
<strong>BMJ</strong> 1996;313:862.<br />
4 Torgerson DJ, Bell-Syer SEM. Hormone replacement therapy and<br />
prevention of nonvertebral fractures. A meta-analysis of randomized<br />
trials. JAMA 2001;285:2891-7.<br />
5 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Logarithms. <strong>BMJ</strong> 1996;312:700.<br />
6 <strong>Bland</strong> M, Peacock J. Interpreting statistics with confidence. Obstetrician<br />
and Gynaecologist (in press).<br />
Calculations for comparing two estimated relative risks<br />
Education and debate<br />
Cancer Research<br />
UK Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine,<br />
Institute for Health<br />
Sciences, Oxford<br />
OX3 7LF<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
Department of<br />
Public Health<br />
Sciences,<br />
St George’s<br />
Hospital Medical<br />
School, London<br />
SW17 0RE<br />
J Martin <strong>Bland</strong><br />
professor of medical<br />
statistics<br />
Correspondence to:<br />
D G Altman<br />
doug.altman@<br />
cancer.org.uk<br />
<strong>BMJ</strong> 2003;326:219<br />
Group 1 Group 2<br />
1 RR 0.67 0.88<br />
2 *log RR −0.4005 (E1) −0.1278 (E2) 3 95% CI for RR 0.46 to 0.98 0.71 to 1.08<br />
4 *95% CI for log RR −0.7765 to −0.0202 −0.3425 to 0.0770<br />
5 Width of CI 0.7563 0.4195<br />
6 SE[=width/(2×1.96)] 0.1929 0.1070<br />
Difference between log relative risks<br />
7 d[=E1−E2] −0.4005–(−0.1278)=−0.2726<br />
8 SE(d) √(0.1929 2 + 0.1070 2 )=0.2206<br />
9 CI(d) −0.2726 ±1.96×0.2206<br />
or −0.7050 to 0.1598<br />
10 Test of interaction z=−0.2726/0.2206=−1.24 (P=0.2)<br />
Ratio of relative risks (RRR)<br />
11 RRR=exp(d) exp(−0.2726)=0.76<br />
12 CI(RRR) exp(−0.7050) to exp(0.1598), or 0.49 to 1.17<br />
*Values obtained <strong>by</strong> taking natural logarithms of values on preceding row.<br />
219
Statistics <strong>Notes</strong><br />
The logrank test<br />
J Martin <strong>Bland</strong>, Douglas G Altman<br />
We often wish to compare the survival experience of<br />
two (or more) groups of individuals. For example, the<br />
table shows survival times of 51 adult patients with<br />
recurrent malignant gliomas 1<br />
tabulated <strong>by</strong> type of<br />
tumour and indicating whether the patient had died or<br />
was still alive at analysis—that is, their survival time was<br />
censored. 2 As the figure shows, the survival curves differ,<br />
but is this sufficient to conclude that in the population<br />
patients with anaplastic astrocytoma have worse<br />
survival than patients with glioblastoma?<br />
We could compute survival curves 3 for each group<br />
and compare the proportions surviving at any specific<br />
time. The weakness of this approach is that it does not<br />
provide a comparison of the total survival experience of<br />
the two groups, but rather gives a comparison at some<br />
arbitrary time point(s). In the figure the difference in<br />
survival is greater at some times than others and eventually<br />
becomes zero. We describe here the logrank test, the<br />
most popular method of comparing the survival of<br />
groups, which takes the whole follow up period into<br />
account. It has the considerable advantage that it does<br />
not require us to know anything about the shape of the<br />
survival curve or the distribution of survival times.<br />
The logrank test is used to test the null hypothesis<br />
that there is no difference between the populations in<br />
the probability of an event (here a death) at any time<br />
point. The analysis is based on the times of events<br />
(here deaths). For each such time we calculate the<br />
observed number of deaths in each group and the<br />
number expected if there were in reality no difference<br />
between the groups. The first death was in week 6,<br />
when one patient in group 1 died. At the start of this<br />
week, there were 51 subjects alive in total, so the risk of<br />
death in this week was 1/51. There were 20 patients in<br />
group 1, so, if the null hypothesis were true, the<br />
expected number of deaths in group 1 is 20 × 1/51 =<br />
0.39. Likewise, in group 2 the expected number of<br />
deaths is 31 × 1/51 = 0.61. The second event<br />
occurred in week 10, when there were two deaths.<br />
There were now 19 and 31 patients at risk (alive) in the<br />
two groups, one having died in week 6, so the<br />
probability of death in week 10 was 2/50. The<br />
expected numbers of deaths were 19 × 2/50 = 0.76<br />
and 31 × 2/50 = 1.24 respectively.<br />
The same calculations are performed each time an<br />
event occurs. If a survival time is censored, that<br />
individual is considered to be at risk of dying in the<br />
week of the censoring but not in subsequent weeks.<br />
This way of handling censored observations is the<br />
same as for the Kaplan-Meier survival curve. 3<br />
From the calculations for each time of death, the<br />
total numbers of expected deaths were 22.48 in group<br />
1 and 19.52 in group 2, and the observed numbers of<br />
deaths were 14 and 28. We can now use a 2 test of the<br />
null hypothesis. The test statistic is the sum of (O –<br />
E) 2 /E for each group, where O and E are the totals of<br />
the observed and expected events. Here (14 − 22.48) 2<br />
/ 22.48 + (28 − 19.52) 2 / 19.52 = 6.88. The degrees of<br />
freedom are the number of groups minus one, i.e.<br />
<strong>BMJ</strong> VOLUME 328 1 MAY 2004 bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
Survival proportion<br />
1.0<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0 52 104 156 208<br />
Survival curves for women with glioma <strong>by</strong> diagnosis<br />
2 − 1 = 1. From a table of the 2 distribution we get<br />
P < 0.01, so that the difference between the groups is<br />
statistically significant. There is a different method of<br />
calculating the test statistic, 4 but we prefer this<br />
approach as it extends easily to several groups. It is<br />
also possible to test for a trend in survival across<br />
ordered groups. 4 Although we have shown how the<br />
calculation is made, we strongly recommend the use of<br />
statistical software.<br />
The logrank test is based on the same assumptions<br />
as the Kaplan Meier survival curve 3 —namely, that censoring<br />
is unrelated to prognosis, the survival probabilities<br />
are the same for subjects recruited early and late in<br />
the study, and the events happened at the times specified.<br />
Deviations from these assumptions matter most if<br />
they are satisfied differently in the groups being<br />
compared, for example if censoring is more likely in<br />
one group than another.<br />
The logrank test is most likely to detect a difference<br />
between groups when the risk of an event is<br />
consistently greater for one group than another. It is<br />
unlikely to detect a difference when survival curves<br />
cross, as can happen when comparing a medical with a<br />
surgical intervention. When analysing survival data, the<br />
survival curves should always be plotted.<br />
Because the logrank test is purely a test of<br />
significance it cannot provide an estimate of the size of<br />
the difference between the groups or a confidence<br />
interval. For these we must make some assumptions<br />
about the data. Common methods use the hazard ratio,<br />
including the Cox proportional hazards model, which<br />
we shall describe in a future Statistics Note.<br />
Competing interests: None declared.<br />
Astrocytoma<br />
Glioblastoma<br />
Time (weeks)<br />
1 Rostomily RC, Spence AM, Duong D, McCormick K, <strong>Bland</strong> M, Berger<br />
MS. Multimodality management of recurrent adult malignant gliomas:<br />
results of a phase II multiagent chemotherapy study and analysis of<br />
cytoreductive surgery. Neurosurgery 1994;35:378-8.<br />
2 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Time to event (survival) data. <strong>BMJ</strong><br />
1998;317:468-9.<br />
3 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Survival probabilities. The Kaplan-Meier method.<br />
<strong>BMJ</strong> 1998;317:1572.<br />
4 Altman DG. Practical statistics for medical research. London: Chapman &<br />
Hall, 1991: 371-5.<br />
Education and debate<br />
Department of<br />
Health Sciences,<br />
University of York,<br />
York YO10 5DD<br />
J Martin <strong>Bland</strong><br />
professor of health<br />
statistics<br />
Cancer Research<br />
UK/NHS Centre<br />
for Statistics in<br />
Medicine, Institute<br />
of Health Sciences,<br />
Oxford OX3 7LF<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
Correspondence to:<br />
Professor <strong>Bland</strong><br />
<strong>BMJ</strong> 2004;328:1073<br />
Weeks to death or<br />
censoring in 51<br />
adults with<br />
recurrent gliomas 1<br />
(A=astrocytoma,<br />
G=glioblastoma)<br />
A G<br />
6 10<br />
13 10<br />
21 12<br />
30 13<br />
31* 14<br />
37 15<br />
38 16<br />
47* 17<br />
49 18<br />
50 20<br />
63 24<br />
79 24<br />
80* 25<br />
82* 28<br />
82* 30<br />
86 33<br />
98 34*<br />
149* 35<br />
202 37<br />
219 40<br />
40<br />
40*<br />
46<br />
48<br />
70*<br />
76<br />
81<br />
82<br />
91<br />
112<br />
181<br />
*Censored survival<br />
time (still alive at<br />
follow up).<br />
1073
Papers<br />
What is already known on this topic<br />
Epidural analgesia during labour is effective but<br />
has been associated with increased rates of<br />
instrumental delivery<br />
Studies have included women of mixed parity and<br />
high concentrations of epidural anaesthetic<br />
What this study adds<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
Epidural infusions with low concentrations of local<br />
anaesthetic are unlikely to increase the risk of<br />
caesarean section in nulliparous women<br />
Although epidural analgesia is associated with an<br />
increased risk of instrumental vaginal delivery,<br />
operator bias cannot be excluded<br />
Epidural analgesia is associated with a longer<br />
second stage labour and increased oxytocin<br />
requirement, but the importance of these is<br />
unclear as maternal analgesia and neonatal<br />
outcome may be better with epidural analgesia<br />
Differences in protocols for management of labour<br />
could have contributed to the differences in rates of<br />
instrumental vaginal delivery.<br />
Another limitation was the large number of women<br />
who changed from parenteral opioids to epidural<br />
analgesia. Our intention to treat approach would likely<br />
render any estimation of the effects of epidural analgesia<br />
more conservative, but it was necessary to prevent<br />
selection bias. Comparing a policy of offering epidural<br />
analgesia with one of offering parenteral opioids<br />
reflects real life. As the definitions of stages of labour<br />
varied between trials, we were unable to determine if<br />
epidural analgesia prolonged the first stage, and the<br />
actual duration can only be estimated. Unlike previous<br />
reviews, we focused on nulliparous women because the<br />
indications for, and risks of, caesarean section differ<br />
with parity.<br />
We limited our analysis to trials that used infusions<br />
of bupivacaine with concentrations of 0.125% or less,<br />
to reflect current practice. In a randomised controlled<br />
trial, low concentrations have been shown to reduce<br />
the rate of instrumental delivery. 7<br />
Epidural analgesia may increase the risk of<br />
instrumental delivery <strong>by</strong> several mechanisms. Reduction<br />
of serum oxytocin levels can result in a weakening<br />
of uterine activity. 9 10 This may be due in part to intravenous<br />
fluid infusions being given before epidural<br />
analgesia, reducing oxytocin secretion. 11 The increased<br />
use of oxytocin after starting epidural analgesia may<br />
indicate attempts at speeding up labour. Maternal<br />
efforts at expulsion can also be impaired, causing fetal<br />
malposition during descent. 12 Previously, the association<br />
of neonatal morbidity and mortality with longer<br />
labour (second stage longer than two hours) had justified<br />
expediting delivery, leading to increased rates of<br />
instrumental delivery. 13<br />
Delaying pushing until the fetus’s head is visible or<br />
until one hour after reaching full cervical dilation may<br />
reduce the incidence of instrumental delivery and its<br />
attendant morbidity. 13 Although women receiving<br />
epidural analgesia had a longer second stage labour,<br />
this was not associated with poorer neonatal outcome<br />
in our analysis.<br />
We thank Christopher R Palmer (University of Cambridge) for<br />
guiding us and Shiv Sharma (University of Texas Southwestern<br />
Medical Centre) for helping with the data.<br />
Contributors: See bmj.com<br />
Funding: None.<br />
Competing interests: None declared.<br />
Ethical approval: Not required.<br />
1 Hawkins JL, Koonin LM, Palmer SK, Gibbs CP. Anesthesia-related deaths<br />
during obstetric delivery in the United States, 1979-1990. Anesthesiology<br />
1997;86:277-84.<br />
2 Chestnut DH. Anesthesia and maternal mortality. Anesthesiology<br />
3<br />
1997;86:273-6.<br />
Roberts CL, Algert CS, Douglas I, Tracy SK, Peat B. Trends in labour and<br />
birth interventions among low-risk women in New South Wales. Aust NZ<br />
J Obstet Gynaecol 2002;42:176-81.<br />
4 Halpern SH, Leighton BL, Ohlsson A, Barrett JF, Rice A. Effect of<br />
epidural vs parenteral opioid analgesia on the progress of labor: a metaanalysis.<br />
JAMA 1998;280:2105-10.<br />
5 Zhang J, Klebanoff MA, DerSimonian R. Epidural analgesia in association<br />
with duration of labor and mode of delivery: a quantitative review. Am J<br />
Obstet Gynecol 1999;180:970-7.<br />
6 Howell CJ. Epidural versus non-epidural analgesia for pain relief in<br />
labour. Cochrane Database Syst Rev 2000;CD000331.<br />
7 Comparative Obstetric Mobile Epidural Trial (COMET) Study Group<br />
UK. Effect of low-dose mobile versus traditional epidural techniques on<br />
mode of delivery: a randomised controlled trial. Lancet 2001;358:19-23.<br />
8 Scottish Intercollegiate Guidelines Network. Methodology checklist 2:<br />
randomised controlled trials. <strong>www</strong>.sign.ac.uk/guidelines/fulltext/50/<br />
9<br />
checklist2.html (accessed 7 Mar 2004).<br />
Goodfellow CF, Hull MG, Swaab DF, Dogterom J, Buijs RM. Oxytocin<br />
deficiency at delivery with epidural analgesia. Br J Obstet Gynaecol<br />
1983;90:214-9.<br />
10 Bates RG, Helm CW, Duncan A, Edmonds DK. Uterine activity in the second<br />
stage of labour and the effect of epidural analgesia. Br J Obstet Gynaecol<br />
1985;92:1246-50.<br />
11 Cheek TG, Samuels P, Miller F, Tobin M, Gutsche BB. Normal saline i.v.<br />
fluid load decreases uterine activity in active labour. Br J Anaesth<br />
1996;77:632-5.<br />
12 Newton ER, Schroeder BC, Knape KG, Bennett BL. Epidural analgesia<br />
and uterine function. Obstet Gynecol 1995;85:749-55.<br />
13 Miller AC. The effects of epidural analgesia on uterine activity and labor.<br />
Int J Obstet Anesth 1997:6:2-18.<br />
(Accepted 18 March 2004)<br />
doi 10.1136/bmj.38097.590810.7C<br />
Corrections and clarifications<br />
The logrank test<br />
An error in this Statistics <strong>Notes</strong> article <strong>by</strong> J Martin<br />
<strong>Bland</strong> and Douglas G Altman persisted to<br />
publication (1 May, p 1073). The third sentence—<br />
relating to the values in a table of survival<br />
times—should have ended: “but is this sufficient to<br />
conclude that in the population, patients with<br />
anaplastic astrocytoma have better [not ‘worse’]<br />
survival than patients with glioblastoma?”<br />
Minerva<br />
In the seventh item in the 17 April issue (p 964),<br />
Minerva admits to not being religious. Maybe this is<br />
why she reported that in the Bible Daniel has 21<br />
chapters. It doesn’t; it has only 12.<br />
Exploring perspectives on death<br />
In this News article <strong>by</strong> Caroline White (17 April,<br />
p 911), we referred to average life expectancy as<br />
running to 394 000 minutes—a rather brief life<br />
span, some might think, given that there are only<br />
525 600 minutes in one non-leap year. Assuming<br />
that the average life expectancy referred to in the<br />
article was intended to relate to UK males (that is,<br />
75 years), some straightforward maths reveals that<br />
75 years (even discounting the extra length of leap<br />
years) in fact contain 39 420 000 minutes. Quite a<br />
lot of minutes were lost somewhere in our<br />
publication process.<br />
1412 <strong>BMJ</strong> VOLUME 328 12 JUNE 2004 bmj.com
Education and debate<br />
Screening and Test<br />
Evaluation<br />
Program, School of<br />
Public Health,<br />
University of<br />
Sydney, NSW 2006,<br />
Australia<br />
Jonathan J Deeks<br />
senior research<br />
biostatistician<br />
Cancer Research<br />
UK/NHS Centre<br />
for Statistics in<br />
Medicine, Institute<br />
for Health Sciences,<br />
Oxford OX3 7LF<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
Correspondence to:<br />
Mr Deeks<br />
Jon.Deeks@<br />
cancer.org.uk<br />
<strong>BMJ</strong> 2004;329:168–9<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
Statistics <strong>Notes</strong><br />
Diagnostic tests 4: likelihood ratios<br />
Jonathan J Deeks, Douglas G Altman<br />
The properties of a diagnostic or screening test are often<br />
described using sensitivity and specificity or predictive<br />
values, as described in previous <strong>Notes</strong>. 12 Likelihood<br />
ratios are alternative statistics for summarising diagnostic<br />
accuracy, which have several particularly powerful<br />
properties that make them more useful clinically than<br />
other statistics. 3<br />
Each test result has its own likelihood ratio, which<br />
summarises how many times more (or less) likely<br />
patients with the disease are to have that particular<br />
result than patients without the disease. More formally,<br />
it is the ratio of the probability of the specific test result<br />
in people who do have the disease to the probability in<br />
people who do not.<br />
A likelihood ratio greater than 1 indicates that the<br />
test result is associated with the presence of the disease,<br />
whereas a likelihood ratio less than 1 indicates that the<br />
test result is associated with the absence of disease. The<br />
further likelihood ratios are from 1 the stronger the<br />
evidence for the presence or absence of disease. Likelihood<br />
ratios above 10 and below 0.1 are considered to<br />
provide strong evidence to rule in or rule out<br />
diagnoses respectively in most circumstances. 4 When<br />
tests report results as being either positive or negative<br />
the two likelihood ratios are called the positive<br />
likelihood ratio and the negative likelihood ratio.<br />
The table shows the results of a study of the value of<br />
a history of smoking in diagnosing obstructive airway<br />
disease. 5 Smoking history was categorised into four<br />
groups according to pack years smoked (packs per day<br />
× years smoked). The likelihood ratio for each category<br />
is calculated <strong>by</strong> dividing the percentage of patients with<br />
obstructive airway disease in that category <strong>by</strong> the<br />
percentage without the disease in that category. For<br />
example, among patients with the disease 28% had 40+<br />
smoking pack years compared with just 1.4% of<br />
patients without the disease. The likelihood ratio is<br />
thus 28.4/1.4 = 20.3. A smoking history of more than<br />
40 pack years is strongly predictive of a diagnosis of<br />
obstructive airway disease as the likelihood ratio is substantially<br />
higher than 10. Although never smoking or<br />
smoking less than 20 pack years both point to not having<br />
obstructive airway disease, their likelihood ratios<br />
are not small enough to rule out the disease with<br />
confidence.<br />
Likelihood ratios are ratios of probabilities, and can be treated in the same way as risk<br />
ratios for the purposes of calculating confidence intervals 6<br />
Smoking habit<br />
Obstructive airway disease<br />
(pack years)<br />
Yes (n (%)) No (n (%)) Likelihood ratio 95% CI<br />
≥40 42 (28.4) 2 (1.4) (42/148)/(2/144)=20.4 5.04 to 82.8<br />
20-40 25 (16.9) 24 (16.7) (25/148)/(24/144)=1.01 0.61 to 1.69<br />
0-20 29 (19.6) 51 (35.4) (29/148)/51/144)=0.55 0.37 to 0.82<br />
Never smoked or smoked<br />
for
test results. Forcing dichotomisation on multicategory<br />
test results may discard useful diagnostic information.<br />
Likelihood ratios can be used to help adapt the<br />
results of a study to your patients. To do this they make<br />
use of a mathematical relationship known as Bayes<br />
theorem that describes how a diagnostic finding<br />
changes our knowledge of the probability of abnormality.<br />
3 The post-test odds that the patient has the<br />
disease are estimated <strong>by</strong> multiplying the pretest odds<br />
<strong>by</strong> the likelihood ratio. The use of odds rather than<br />
risks makes the calculation slightly complex (box) but a<br />
nomogram can be used to avoid having to make<br />
conversions between odds and probabilities (figure). 7<br />
Both the figure and the box illustrate how a prior<br />
probability of obstructive airway disease of 0.1 (based,<br />
say, on presenting features) is updated to a probability<br />
of 0.7 with the knowledge that the patient had smoked<br />
for more than 40 pack years.<br />
In clinical practice it is essential to know how a<br />
particular test result predicts the risk of abnormality.<br />
Sensitivities and specificities 1 do not do this: they<br />
describe how abnormality (or normality) predicts<br />
particular test results. Predictive values 2 do give<br />
probabilities of abnormality for particular test results,<br />
but depend on the prevalence of abnormality in the<br />
A memorable patient<br />
Living history<br />
I was finally settling down at my desk when the pager<br />
bleeped: it was the outpatients’ department. An extra<br />
patient had been added to the afternoon list—would I<br />
see him?<br />
The patient was a slightly built man in his 60s. He<br />
had brought recent documentation from another<br />
hospital. I asked about his presenting complaint.<br />
“Well, I’ll try, but I wasn’t aware of everything that<br />
happened. That’s why I’ve brought my wife—she was<br />
with me at the time.”<br />
This was turning out to be one of those perfect<br />
neurological consultations: documents from another<br />
hospital, a witness account, an articulate patient. The<br />
only question would be whether it was seizure,<br />
syncope, or transient ischaemic attack. As we went<br />
through his medical history, I studied his records and<br />
for the first time noticed the phrase “Tetralogy of<br />
Fallot.”<br />
“Yes, my lifelong diagnosis,” he smiled. “I was<br />
operated on.”<br />
I saw the faint dusky blue colour of his lips.<br />
“Blalock-Taussig shunt?” I asked, as dim memories<br />
from medical school somehow came back into focus.<br />
“Yes, twice. By Blalock.” He paused. “In those days<br />
the operation was only done at Johns Hopkins—all the<br />
patients went there. I remember my second operation<br />
well. I was 12 at the time.”<br />
“And the doctors?” I asked.<br />
“Yes, especially Dr Taussig. She would come around<br />
with her entourage every so often. Deaf as a post, she<br />
was.”<br />
“What? Taussig deaf?”<br />
“Yes. She had an amplifier attached to her<br />
stethoscope to examine patients.”<br />
I asked to examine him, only too aware that it was<br />
more for my benefit than his.<br />
A half smile suggested that he had read my<br />
thoughts: “Of course.”<br />
<strong>BMJ</strong> VOLUME 329 17 JULY 2004 bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
study sample and can rarely be generalised beyond the<br />
study (except when the study is based on a suitable<br />
random sample, as is sometimes the case for<br />
population screening studies). Likelihood ratios provide<br />
a solution as they can be used to calculate the<br />
probability of abnormality, while adapting for varying<br />
prior probabilities of the chance of abnormality from<br />
different contexts.<br />
1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Diagnostic tests 1: sensitivity and specificity. <strong>BMJ</strong><br />
1994;308:1552.<br />
2 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Diagnostic tests 2: predictive values. <strong>BMJ</strong><br />
1994;309:102.<br />
3 Sackett DL, Straus S, Richardson WS, Rosenberg W, Haynes RB. Evidencebased<br />
medicine.How to practise and teach EBM. 2nd ed. Edinburgh: Churchill<br />
Livingstone, 2000:67-93.<br />
4 Jaeschke R, Guyatt G, Lijmer J. Diagnostic tests. In: Guyatt G, Rennie D,<br />
eds. Users’ guides to the medical literature. Chicago: AMA Press,<br />
2002:121-40.<br />
5 Straus SE, McAlister FA, Sackett DL, Deeks JJ. The accuracy of patient<br />
history, wheezing, and laryngeal measurements in diagnosing obstructive<br />
airway disease. CARE-COAD1 Group. Clinical assessment of the<br />
reliability of the examination-chronic obstructive airways disease. JAMA<br />
2000;283:1853-7.<br />
6 Altman DG. Diagnostic tests. In: Altman DG, Machin D, Bryant TN,<br />
Gardner MJ, eds. Statistics with confidence. 2nd ed. London: <strong>BMJ</strong> Books,<br />
2000:105-19.<br />
7 Fagan TJ. Letter: Nomogram for Bayes theorem. N Engl J Med<br />
1975;293:257.<br />
To even my unpractised technique, his<br />
cardiovascular signs were a museum piece: absent left<br />
subclavian pulse, big arciform scars on the anterior<br />
chest created <strong>by</strong> the surgeon who had saved his life,<br />
central cyanosis, right-sided systolic murmurs, loud<br />
pulmonary valve closure sound (iatrogenic pulmonary<br />
hypertension, I reasoned), pulsatile liver—all these and<br />
undoubtedly more noted <strong>by</strong> the physician who first<br />
understood his condition.<br />
That night, I read about his doctors. Helen Taussig<br />
indeed had had substantial hearing impairment, a<br />
disability that would have meant the end of a career in<br />
cardiology for a less able clinician. I also learnt of the<br />
greater challenges of sexual prejudice that she fought<br />
and overcame all her life. I learnt about Alfred Blalock,<br />
the young doctor denied a residency at Johns Hopkins<br />
only to be invited back in his later years to head its<br />
surgical unit.<br />
The experiences of a life in medicine are sometimes<br />
overwhelming. For weeks, I reflected on the<br />
perspectives opened to me <strong>by</strong> this unassuming patient.<br />
The curious irony of a man with a life threatening<br />
condition who had outlived his saviours; the<br />
extraordinary vision of his all-too-human doctors; the<br />
opportunity to witness history played out in the course<br />
of a half-hour consultation. And memories were<br />
jogged, too: the words of my former professor of<br />
medicine, who showed us cases of “Fallot’s” and first<br />
told us about Taussig, the woman cardiologist; the<br />
portrait of Blalock adorning a surgical lecture theatre<br />
in medical college.<br />
(And my patient’s neurological examination and<br />
investigations? Non-contributory. I still don’t know<br />
whether it was seizure, syncope, or transient ischaemic<br />
attack.)<br />
Giridhar P Kalamangalam clinical fellow in epilepsy,<br />
department of neurology, Cleveland Clinic Foundation,<br />
Cleveland OH, USA<br />
Education and debate<br />
169
Statistics <strong>Notes</strong><br />
Treatment allocation <strong>by</strong> minimisation<br />
Douglas G Altman, J Martin <strong>Bland</strong><br />
In almost all controlled trials treatments are allocated<br />
<strong>by</strong> randomisation. Blocking and stratification can be<br />
used to ensure balance between groups in size and<br />
patient characteristics. 1 But stratified randomisation<br />
using several variables is not effective in small trials.<br />
The only widely acceptable alternative approach is<br />
minimisation, 2 3 a method of ensuring excellent<br />
balance between groups for several prognostic factors,<br />
even in small samples. With minimisation the<br />
treatment allocated to the next participant enrolled in<br />
the trial depends (wholly or partly) on the characteristics<br />
of those participants already enrolled. The aim is<br />
that each allocation should minimise the imbalance<br />
across multiple factors.<br />
Table 1 shows some baseline characteristics in a<br />
controlled trial comparing two types of counselling in<br />
relation to dietary intake. 4 Minimisation was used for<br />
the four variables shown, and the two groups were<br />
clearly very similar in all of these variables. Such good<br />
balance for important prognostic variables helps the<br />
credibility of the comparisons. How is it achieved?<br />
Minimisation is based on a different principle from<br />
randomisation. The first participant is allocated a treatment<br />
at random. For each subsequent participant we<br />
determine which treatment would lead to better<br />
balance between the groups in the variables of interest.<br />
The dietary behaviour trial used minimisation<br />
based on the four variables in table 1. Suppose that<br />
after 40 patients had entered this trial the numbers in<br />
each subgroup in each treatment group were as shown<br />
in table 2. (Note that two or more categories need to be<br />
constructed for continuous variables.)<br />
The next enrolled participant is a black woman<br />
aged 52, who is a non-smoker. If we were to allocate her<br />
to behavioural counselling, the imbalance would be<br />
increased in sex distribution (12+1 v 11 women), in age<br />
(7+1 v 5 aged > 50), and in smoking (14+1 v 12<br />
non-smoking) and decreased in ethnicity (4+1 v 5<br />
black). We formalise this <strong>by</strong> summing over the four<br />
variables the numbers of participants with the same<br />
characteristics as this new recruit already in the trial:<br />
Behavioural: 12 (sex) +7 (age) +4 (ethnicity) +14 (smoking) = 37<br />
Nutrition: 11+5+5+12=33<br />
Imbalance is minimised <strong>by</strong> allocating this person to<br />
the group with the smaller total (or at random if the<br />
totals are the same). Allocation to behavioural counsel-<br />
Table 1 Baseline characteristics in two groups 4<br />
Behavioural<br />
counselling<br />
Nutrition<br />
counselling<br />
Women 82 84<br />
Mean (SD) age (years)<br />
Ethnicity:<br />
43.3 (13.8) 43.2 (14.0)<br />
White 94 96<br />
Black 37 32<br />
Asian 3 5<br />
Current smokers 47 44<br />
<strong>BMJ</strong> VOLUME 330 9 APRIL 2005 bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
Table 2 Hypothetical distribution of baseline characteristics after<br />
40 patients had been enrolled in the trial<br />
Behavioural<br />
counselling<br />
(n=20)<br />
Nutrition<br />
counselling<br />
(n=20)<br />
Women 12 11<br />
Age >50<br />
Ethnicity:<br />
7 5<br />
White 15 15<br />
Black 4 5<br />
Asian 1 0<br />
Current smokers 6 8<br />
ling would increase the imbalance; allocation to nutrition<br />
would decrease it.<br />
At this point there are two options. The chosen<br />
treatment could simply be taken as the one with the<br />
lower score; or we could introduce a random element.<br />
We use weighted randomisation so that there is a high<br />
chance (eg 80%) of each participant getting the<br />
treatment that minimises the imbalance. The use of a<br />
random element will slightly worsen the overall imbalance<br />
between the groups, but balance will be much<br />
better for the chosen variables than with simple<br />
randomisation. A random element also makes the allocation<br />
more unpredictable, although minimisation is a<br />
secure allocation system when used <strong>by</strong> an independent<br />
person.<br />
After the treatment is determined for the current<br />
participant the numbers in each group are updated<br />
and the process repeated for each subsequent<br />
participant. If at any time the totals for the two groups<br />
are the same, then the choice should be made using<br />
simple randomisation. The method extends to trials of<br />
more than two treatments.<br />
Minimisation is a valid alternative to ordinary randomisation,<br />
2 3 5 and has the advantage, especially in<br />
small trials, that there will be only minor differences<br />
between groups in those variables used in the<br />
allocation process. Such balance is especially desirable<br />
where there are strong prognostic factors and modest<br />
treatment effects, such as oncology. Minimisation is<br />
best performed with the aid of software—for example,<br />
minim, a free program. 6 Its use makes trialists think<br />
about prognostic factors at the outset and helps ensure<br />
adherence to the protocol as a trial progresses. 7<br />
1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to randomise. <strong>BMJ</strong> 1999;319:703-4.<br />
2 Treasure T, MacRae KD. Minimisation: the platinum standard for trials.<br />
<strong>BMJ</strong> 1998;317:362-3.<br />
3 Scott NW, McPherson GC, Ramsay CR, Campbell MK. The method of<br />
minimization for allocation to clinical trials: a review. Control Clin Trials<br />
2002;23:662-74.<br />
4 Steptoe A, Perkins-Porras L, McKay C, Rink E, Hilton S, Cappuccio FP.<br />
Behavioural counselling to increase consumption of fruit and vegetables<br />
in low income adults: randomised trial. <strong>BMJ</strong> 2003;326:855-8.<br />
5 Buyse M, McEntegart D. Achieving balance in clinical trials: an<br />
unbalanced view from the European regulators. Applied Clin Trials<br />
2004;13:36-40.<br />
6 Evans S, Royston P, Day S. Minim: allocation <strong>by</strong> minimisation in clinical<br />
trials. <strong>http</strong>://<strong>www</strong>-<strong>users</strong>.<strong>york</strong>.ac.uk/zmb55/guide/minim.htm. (accessed<br />
24 October 2004).<br />
7 Day S. Commentary: Treatment allocation <strong>by</strong> the method of<br />
minimisation. <strong>BMJ</strong> 1999;319:947-8.<br />
Education and debate<br />
Cancer Research<br />
UK Medical<br />
Statistics Group,<br />
Centre for Statistics<br />
in Medicine, Oxford<br />
OX3 7LF<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
Department of<br />
Health Sciences,<br />
University of York,<br />
York YO10 5DD<br />
J Martin <strong>Bland</strong><br />
professor of health<br />
statistics<br />
Correspondence to:<br />
D Altman<br />
doug.altman@<br />
cancer.org.uk<br />
<strong>BMJ</strong> 2005;330:843<br />
843
6 The Cardiac Arrhythmia Suppression Trial I. Preliminary Report. Effect of<br />
encainide and flecainide on mortality in a randomized trial of arrhythmia<br />
suppression after myocardial infarction. N Engl J Med 1989;321:406-12.<br />
7 Dickert N, Grady C. What’s the price of a research subject? Approaches to<br />
payment for research participation. N Engl J Med 1999;341:198-203.<br />
8 Grady C. Money for research participation: does it jeopardize informed<br />
consent? Am J Bioethics 2001;1:40-4.<br />
9 Macklin R. “Due” and “undue” inducements: On paying money to<br />
research subjects. IRB: a review of human subjects research 1981;3:1-6.<br />
10 McGee G. Subject to payment? JAMA 1997;278:199-200.<br />
11 McNeil P. Paying people to participate in research. Bioethics<br />
1997;11:390-6.<br />
12 Wilkenson M, Moore A. Inducement in research. Bioethics 1997;11:<br />
373-89.<br />
13 Halpern SD, Karlawish JHT, Casarett D, Berlin JA, Asch DA. Empirical<br />
assessment of whether moderate payments are undue or unjust inducements<br />
for participation in clinical trials. Arch Intern Med 2004;164:801-3.<br />
14 Bentley JP, Thacker PG. The influence of risk and monetary payment on<br />
the research participation decision making process. J Med Ethics<br />
2004;30:293-8.<br />
15 Viens AM. Socio-economic status and inducement to participate. Am J<br />
Bioethics 2001;1.<br />
Statistics <strong>Notes</strong><br />
Standard deviations and standard errors<br />
Douglas G Altman, J Martin <strong>Bland</strong><br />
The terms “standard error” and “standard deviation”<br />
are often confused. 1 The contrast between these two<br />
terms reflects the important distinction between data<br />
description and inference, one that all researchers<br />
should appreciate.<br />
The standard deviation (often SD) is a measure of<br />
variability. When we calculate the standard deviation of a<br />
sample, we are using it as an estimate of the variability of<br />
the population from which the sample was drawn. For<br />
data with a normal distribution, 2 about 95% of individuals<br />
will have values within 2 standard deviations of the<br />
mean, the other 5% being equally scattered above and<br />
below these limits. Contrary to popular misconception,<br />
the standard deviation is a valid measure of variability<br />
regardless of the distribution. About 95% of observations<br />
of any distribution usually fall within the 2 standard<br />
deviation limits, though those outside may all be at one<br />
end. We may choose a different summary statistic, however,<br />
when data have a skewed distribution. 3<br />
When we calculate the sample mean we are usually<br />
interested not in the mean of this particular sample, but<br />
in the mean for individuals of this type—in statistical<br />
terms, of the population from which the sample comes.<br />
We usually collect data in order to generalise from them<br />
and so use the sample mean as an estimate of the mean<br />
for the whole population. Now the sample mean will<br />
vary from sample to sample; the way this variation<br />
occurs is described <strong>by</strong> the “sampling distribution” of the<br />
mean. We can estimate how much sample means will<br />
vary from the standard deviation of this sampling distribution,<br />
which we call the standard error (SE) of the estimate<br />
of the mean. As the standard error is a type of<br />
standard deviation, confusion is understandable.<br />
Another way of considering the standard error is as a<br />
measure of the precision of the sample mean.<br />
The standard error of the sample mean depends<br />
on both the standard deviation and the sample size, <strong>by</strong><br />
the simple relation SE = SD/√(sample size). The standard<br />
error falls as the sample size increases, as the extent<br />
of chance variation is reduced—this idea underlies the<br />
sample size calculation for a controlled trial, for<br />
<strong>BMJ</strong> VOLUME 331 15 OCTOBER 2005 bmj.com<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
16 Beauchamp TL, Childress JF. Respect for autonomy. Principles of biomedical<br />
ethics. 4th ed. New York: Oxford University Press, 1994:120-88.<br />
17 Roberts LW. Evidence-based ethics and informed consent in mental<br />
illness research. Arch Gen Psychiatry 2000;57:540-2.<br />
18 Bayer R, Oppenheimer GM. Toward a more democratic medicine: sharing the<br />
burden of ignorance. AIDS Doctors: voices from the epidemic. New York: Oxford<br />
University Press, 2000:156-69.<br />
19 Coulter A, Rozansky D. Full engagement in health. <strong>BMJ</strong> 2004;329:1197-8.<br />
20 Joffe S, Manocchia M, Weeks JC, Cleary PD. What do patients value in<br />
their hospital care? An empirical perspective on autonomy centred<br />
bioethics. J Med Ethics 2003;29:103-8.<br />
21 Heesen C, Kasper J, Segal J, Kopke S, Muhlhauser I. Decisional role preferences,<br />
risk knowledge and information interests in patients with multiple<br />
sclerosis. Mult Scler 2004;10:643-50.<br />
22 Azoulay E, Pochard F, Chevret S, et al. Half the family members of intensive<br />
care unit patients do not want to share in the decision-making process:<br />
a study in 78 French intensive care units. Crit Care Med<br />
2004;32:1832-8.<br />
23 Dunn LB, Gordon NE. Improving informed consent and enhancing<br />
recruitment for research <strong>by</strong> understanding economic behavior. JAMA<br />
2005;293:609-12.<br />
example. By contrast the standard deviation will not<br />
tend to change as we increase the size of our sample.<br />
So, if we want to say how widely scattered some<br />
measurements are, we use the standard deviation. If we<br />
want to indicate the uncertainty around the estimate of<br />
the mean measurement, we quote the standard error of<br />
the mean. The standard error is most useful as a means<br />
of calculating a confidence interval. For a large sample,<br />
a 95% confidence interval is obtained as the values<br />
1.96×SE either side of the mean. We will discuss confidence<br />
intervals in more detail in a subsequent Statistics<br />
Note. The standard error is also used to calculate P values<br />
in many circumstances.<br />
The principle of a sampling distribution applies to<br />
other quantities that we may estimate from a sample,<br />
such as a proportion or regression coefficient, and to<br />
contrasts between two samples, such as a risk ratio or<br />
the difference between two means or proportions. All<br />
such quantities have uncertainty due to sampling variation,<br />
and for all such estimates a standard error can be<br />
calculated to indicate the degree of uncertainty.<br />
In many publications a ± sign is used to join the<br />
standard deviation (SD) or standard error (SE) to an<br />
observed mean—for example, 69.4±9.3 kg. That<br />
notation gives no indication whether the second figure<br />
is the standard deviation or the standard error (or<br />
indeed something else). A review of 88 articles<br />
published in 2002 found that 12 (14%) failed to<br />
identify which measure of dispersion was reported<br />
(and three failed to report any measure of variability). 4<br />
The policy of the <strong>BMJ</strong> and many other journals is to<br />
remove ± signs and request authors to indicate clearly<br />
whether the standard deviation or standard error is<br />
being quoted. All journals should follow this practice.<br />
Competing interests: None declared.<br />
1 Nagele P. Misuse of standard error of the mean (SEM) when reporting<br />
variability of a sample. A critical evaluation of four anaesthesia journals.<br />
Br J Anaesthesiol 2003;90:514-6.<br />
2 Altman DG, <strong>Bland</strong> <strong>JM</strong>. The normal distribution. <strong>BMJ</strong> 1995;310:298.<br />
3 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Quartiles, quintiles, centiles, and other quantiles.<br />
<strong>BMJ</strong> 1994;309:996.<br />
4 Olsen CH. Review of the use of statistics in Infection and Immunity. Infect<br />
Immun 2003;71:6689-92.<br />
Education and debate<br />
Cancer Research<br />
UK/NHS Centre<br />
for Statistics in<br />
Medicine, Wolfson<br />
College, Oxford<br />
OX2 6UD<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
Department of<br />
Health Sciences,<br />
University of York,<br />
York YO10 5DD<br />
J Martin <strong>Bland</strong><br />
professor of health<br />
statistics<br />
Correspondence to:<br />
Prof Altman<br />
doug.altman@<br />
cancer.org.uk<br />
<strong>BMJ</strong> 2005;331:903<br />
903
Practice<br />
Cancer Research<br />
UK/NHS Centre<br />
for Statistics in<br />
Medicine, Wolfson<br />
College, Oxford<br />
OX2 6UD<br />
Douglas G Altman<br />
professor of statistics<br />
in medicine<br />
MRC Clinical Trials<br />
Unit, London<br />
NW1 2DA<br />
Patrick Royston<br />
professor of statistics<br />
Correspondence to:<br />
Professor Altman<br />
doug.altman@<br />
cancer.org.uk<br />
<strong>BMJ</strong> 2006;332:1080<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
Statistics <strong>Notes</strong><br />
The cost of dichotomising continuous variables<br />
Douglas G Altman, Patrick Royston<br />
Measurements of continuous variables are made in all<br />
branches of medicine, aiding in the diagnosis and<br />
treatment of patients. In clinical practice it is helpful to<br />
label individuals as having or not having an attribute,<br />
such as being “hypertensive” or “obese” or having<br />
”high cholesterol,” depending on the value of a<br />
continuous variable.<br />
Categorisation of continuous variables is also common<br />
in clinical research, but here such simplicity is<br />
gained at some cost. Though grouping may help data<br />
presentation, notably in tables, categorisation is unnecessary<br />
for statistical analysis and it has some serious<br />
drawbacks. Here we consider the impact of converting<br />
continuous data to two groups (dichotomising), as this<br />
is the most common approach in clinical research. 1<br />
What are the perceived advantages of forcing all<br />
individuals into two groups? A common argument is<br />
that it greatly simplifies the statistical analysis and leads<br />
to easy interpretation and presentation of results. A<br />
binary split—for example, at the median—leads to a<br />
comparison of groups of individuals with high or low<br />
values of the measurement, leading in the simplest case<br />
to a t test or 2 test and an estimate of the difference<br />
between the groups (with its confidence interval). There<br />
is, however, no good reason in general to suppose that<br />
there is an underlying dichotomy, and if one exists there<br />
is no reason why it should be at the median. 2<br />
Dichotomising leads to several problems. Firstly,<br />
much information is lost, so the statistical power to<br />
detect a relation between the variable and patient outcome<br />
is reduced. Indeed, dichotomising a variable at<br />
the median reduces power <strong>by</strong> the same amount as<br />
would discarding a third of the data. 2 3 Deliberately discarding<br />
data is surely inadvisable when research<br />
studies already tend to be too small. Dichotomisation<br />
may also increase the risk of a positive result being a<br />
false positive. 4 Secondly, one may seriously underestimate<br />
the extent of variation in outcome between<br />
groups, such as the risk of some event, and<br />
considerable variability may be subsumed within each<br />
group. Individuals close to but on opposite sides of the<br />
cutpoint are characterised as being very different<br />
rather than very similar. Thirdly, using two groups<br />
conceals any non-linearity in the relation between the<br />
variable and outcome. Presumably, many who dichotomise<br />
are unaware of the implications.<br />
If dichotomisation is used where should the<br />
cutpoint be? For a few variables there are recognised<br />
cutpoints, such as > 25 kg/m 2 to define “overweight”<br />
based on body mass index. For some variables, such as<br />
age, it is usual to take a round number, usually a multiple<br />
of five or 10. The cutpoint used in previous studies<br />
may be adopted. In the absence of a prior cutpoint<br />
the most common approach is to take the sample<br />
median. However, using the sample median implies<br />
that various cutpoints will be used in different studies<br />
so that their results cannot easily be compared,<br />
seriously hampering meta-analysis of observational<br />
studies. 5 Nevertheless, all these approaches are<br />
preferable to performing several analyses and<br />
choosing that which gives the most convincing result.<br />
Use of this so called “optimal” cutpoint (usually that<br />
giving the minimum P value) runs a high risk of a spuriously<br />
significant result; the difference in the outcome<br />
variable between the groups will be overestimated,<br />
perhaps considerably; and the confidence interval will<br />
6 7<br />
be too narrow. This strategy should never be used.<br />
When regression is being used to adjust for the effect<br />
of a confounding variable, dichotomisation will run the<br />
risk that a substantial part of the confounding remains. 47<br />
Dichotomisation is not much used in epidemiological<br />
studies, where the use of several categories is preferred.<br />
Using multiple categories (to create an “ordinal”<br />
variable) is generally preferable to dichotomising. With<br />
four or five groups the loss of information can be quite<br />
small, but there are complexities in analysis.<br />
Instead of categorising continuous variables, we prefer<br />
to keep them continuous. We could then use<br />
linear regression rather than a two sample t test, for<br />
example. If we were concerned that a linear regression<br />
would not truly represent the relation between the<br />
outcome and predictor variable, we could explore<br />
whether some transformation (such as a log transformation)<br />
would be helpful. 78 As an example, in a regression<br />
analysis to develop a prognostic model for patients with<br />
primary biliary cirrhosis, a carefully developed model<br />
with bilirubin as a continuous explanatory variable<br />
explained 31% more of the variability in the data than<br />
when bilirubin distribution was split at the median. 7<br />
Competing interests: None declared.<br />
1 Del Priore G, Zandieh P, Lee MJ. Treatment of continuous data as<br />
categoric variables in obstetrics and gynecology. Obstet Gynecol<br />
1997;89:351-4.<br />
2 MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of<br />
dichotomization of quantitative variables. Psychol Meth 2002;7:19-40.<br />
3 Cohen J. The cost of dichotomization. Appl Psychol Meas 1983;7:249-53.<br />
4 Austin PC, Brunner LJ. Inflation of the type I error rate when a continuous<br />
confounding variable is categorized in logistic regression analyses.<br />
Stat Med 2004;23:1159-78.<br />
5 Buettner P, Garbe C, Guggenmoos-Holzmann I. Problems in defining<br />
cutoff points of continuous prognostic factors: example of tumor<br />
thickness in primary cutaneous melanoma. J Clin Epidemiol<br />
1997;50:1201-10.<br />
6 Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using<br />
“optimal” cutpoints in the evaluation of prognostic factors. J Natl Cancer<br />
Inst 1994;86:829-35.<br />
7 Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous<br />
predictors in multiple regression: a bad idea. Stat Med 2006;25:127-41.<br />
8 Royston P, Sauerbrei W. Building multivariable regression models with<br />
continuous covariates in clinical epidemiology–with an emphasis on<br />
fractional polynomials. Methods Inf Med 2005;44:561-71.<br />
Endpiece<br />
Reading and reflecting<br />
Reading without reflecting is like eating without<br />
digesting.<br />
Edmund Burke<br />
Kamal Samanta, retired general practitioner,<br />
Den<strong>by</strong> Dale, West Yorkshire<br />
1080 <strong>BMJ</strong> VOLUME 332 6 MAY 2006 bmj.com
PRACTICE<br />
1 Cancer Research UK/NHS Centre<br />
for Statistics in Medicine, Oxford<br />
OX2 6UD<br />
2 Department of Health Sciences,<br />
University of York, York YO10 5DD<br />
Correspondence to: Professor<br />
altman<br />
doug.altman@cancer.org.uk<br />
bmj 2007;334:424<br />
doi:10.1136/bmj.38977.682025.2C<br />
sTATIsTICs noTEs<br />
Missing data<br />
Downloaded from<br />
bmj.com on 28 February 2008<br />
Douglas G Altman 1 , J Martin <strong>Bland</strong> 2<br />
Almost all studies have some missing observations.<br />
Yet textbooks and software commonly assume that<br />
data are complete, and the topic of how to handle<br />
missing data is not often discussed outside statistics<br />
journals.<br />
There are many types of missing data and different<br />
reasons for data being missing. Both issues affect the<br />
analysis. Some examples are:<br />
(1) n a postal questionnaire survey not all the<br />
selected individuals respond;<br />
(2) n a randomised trial some patients are lost to<br />
follow-up before the end of the study;<br />
(3) n a multicentre study some centres do not<br />
measure a particular variable;<br />
(4) n a study in which patients are assessed<br />
frequently some data are missing at some time<br />
points for unknown reasons;<br />
(5) Occasional data values for a variable are missing<br />
because some equipment failed;<br />
(6) Some laboratory samples are lost in transit or<br />
technically unsatisfactory;<br />
(7) n a magnetic resonance imaging study some very<br />
obese patients are excluded as they are too large<br />
for the machine;<br />
(8) n a study assessing quality of life some patients<br />
die during the follow-up period.<br />
The prime concern is always whether the available<br />
data would be biased. f the fact that an observation<br />
is missing is unrelated both to the unobserved value<br />
(and hence to patient outcome) and the data that are<br />
available this is called “missing completely at random.”<br />
For cases 5 and 6 above that would be a safe<br />
assumption. Sometimes data are missing in a predictable<br />
way that does not depend on the missing value<br />
itself but which can be predicted from other data—as<br />
in case 3. Confusingly, this is known as “missing at<br />
random.” n the common cases 1 and 2, however, the<br />
missing data probably depend on unobserved values,<br />
called “missing not at random,” and hence their lack<br />
may lead to bias.<br />
n general, it is important to be able to examine<br />
whether missing data may have introduced bias. For<br />
example, if we know nothing at all about the nonresponders<br />
to a survey then we can do little to explore<br />
possible bias. Thus a high response rate is necessary for<br />
reliable answers. 1 Sometimes, though, some information<br />
is available. For example, if the survey sample is<br />
chosen from a register that includes age and sex, then<br />
the responders and non-responders can be compared<br />
on these variables. At the very least this gives some<br />
pointers to the representativeness of the sample. Nonresponders<br />
often (but not always) have a worse medical<br />
prognosis than those who respond.<br />
A few missing observations are a minor nuisance,<br />
but a large amount of missing data is a major threat<br />
to a study’s integrity. Non-response is a particular<br />
problem in pair-matched studies, such as some casecontrol<br />
studies, as it is unclear how to analyse data<br />
from the unmatched individuals. Loss of patients<br />
also reduces the power of the trial. Where losses are<br />
expected it is wise to increase the target sample size<br />
to allow for losses. This cannot eliminate the potential<br />
bias, however.<br />
Missing data are much more common in retrospective<br />
studies, in which routinely collected data are<br />
subsequently used for a different purpose. 2 When information<br />
is sought from patients’ medical notes, the notes<br />
often do not say whether or not a patient was a smoker<br />
or had a particular procedure carried out. t is tempting<br />
to assume that the answer is no when there is no<br />
indication that the answer is yes, but this is generally<br />
unwise.<br />
No really satisfactory solution exists for missing data,<br />
which is why it is important to try to maximise data<br />
collection. The main ways of handling missing data in<br />
analysis are: (a) omitting variables which have many<br />
missing values; (b) omitting individuals who do not<br />
have complete data; and (c) estimating (imputing) what<br />
the missing values were.<br />
Omitting everyone without complete data is known<br />
as complete case (or available case) analysis and is<br />
probably the most common approach. When only a<br />
very few observations are missing little harm will be<br />
done, but when many are missing omitting all patients<br />
without full data might result in a large proportion of<br />
the data being discarded, with a major loss of statistical<br />
power. The results may be biased unless the data<br />
are missing completely at random. n general it is<br />
advisable not to include in an analysis any variable<br />
that is not available for a large proportion of the sample.<br />
The main alternative approach to case deletion is<br />
imputation, where<strong>by</strong> missing values are replaced <strong>by</strong><br />
some plausible value predicted from that individual’s<br />
available data. mputation has been the topic of much<br />
recent methodological work; we will consider some of<br />
the simpler methods in a separate Statistics Note.<br />
Competing interests: None declared.<br />
1 Evans SJW. Good surveys guide. <strong>BMJ</strong> 1991;302:302-3.<br />
2 Burton A, Altman DG. Missing covariate data within cancer prognostic<br />
studies: a review of current reporting and proposed guidelines. Br J<br />
Cancer 2004;91:4-8.<br />
This is part of a series of occasional articles on statistics and handling<br />
data in research.<br />
424 <strong>BMJ</strong> | 24 feBruary 2007 | VoluMe 334
PRACTICE<br />
1 Centre for Statistics in Medicine,<br />
University of Oxford, Wolfson<br />
College Annexe, Oxford OX2 6UD<br />
2 Department of Health Sciences,<br />
University of York, York YO10 5DD<br />
Correspondence to:<br />
Professor Altman<br />
doug.altman@csm.ox.ac.uk<br />
Cite this as: <strong>BMJ</strong> 2009;338:a3167<br />
doi: 10.1136/bmj.a3167<br />
STATISTICS noTES<br />
Parametric v non-parametric methods for data analysis<br />
Douglas G Altman, 1 J Martin <strong>Bland</strong> 2<br />
Continuous data arise in most areas of medicine.<br />
Familiar clinical examples include blood pressure,<br />
ejection fraction, forced expiratory volume in 1 second<br />
(FEV 1 ), serum cholesterol, and anthropometric measurements.<br />
Methods for analysing continuous data fall<br />
into two classes, distinguished <strong>by</strong> whether or not they<br />
make assumptions about the distribution of the data.<br />
Theoretical distributions are described <strong>by</strong> quantities<br />
called parameters, notably the mean and standard<br />
deviation. 1 Methods that use distributional assumptions<br />
are called parametric methods, because we estimate<br />
the parameters of the distribution assumed for the data.<br />
Frequently used parametric methods include t tests and<br />
analysis of variance for comparing groups, and least<br />
squares regression and correlation for studying the relation<br />
between variables. All of the common parametric<br />
methods (“t methods”) assume that in some way the data<br />
follow a normal distribution and also that the spread of<br />
the data (variance) is uniform either between groups or<br />
across the range being studied. For example, the two<br />
sample t test assumes that the two samples of observations<br />
come from populations that have normal distributions<br />
with the same standard deviation. The importance<br />
of the assumptions for t methods diminishes as sample<br />
size increases.<br />
Alternative methods, such as the sign test, Mann-<br />
Whitney test, and rank correlation, do not require the<br />
data to follow a particular distribution. They work<br />
<strong>by</strong> using the rank order of observations rather than<br />
the measurements themselves. Methods which do not<br />
require us to make distributional assumptions about<br />
the data, such as the rank methods, are called non-parametric<br />
methods. The term non-parametric applies to<br />
the statistical method used to analyse data, and is not<br />
a property of the data. 1 As tests of significance, rank<br />
methods have almost as much power as t methods to<br />
detect a real difference when samples are large, even<br />
for data which meet the distributional requirements.<br />
Non-parametric methods are most often used to<br />
analyse data which do not meet the distributional<br />
requirements of parametric methods. In particular,<br />
First day jitters<br />
On my first day of clinical rotations, I was confident that<br />
I wouldn’t be another clueless medical student. I had<br />
worked diligently in my basic science courses to learn<br />
the pathophysiology of clinically relevant diseases<br />
and rehearsed for hours on end how to perform an<br />
impressive history and physical examination. Judging<br />
from the case-based questions I was doing, it seemed that<br />
my first day would be a walk in the park.<br />
Then I saw my first patient, and it was as though all<br />
the mental cue cards that I had meticulously arranged<br />
skewed data are frequently analysed <strong>by</strong> non-parametric<br />
methods, although data transformation can often make<br />
the data suitable for parametric analyses. 2<br />
Data that are scores rather than measurements may<br />
have many possible values, such as quality of life scales<br />
or data from visual analogue scales, while others have<br />
only a few possible values, such as Apgar scores or<br />
stage of disease. Scores with many values are often analysed<br />
using parametric methods, whereas those with<br />
few values tend to be analysed using rank methods, but<br />
there is no clear boundary between these cases.<br />
To compensate for the advantage of being free of<br />
assumptions about the distribution of the data, rank<br />
methods have the disadvantage that they are mainly<br />
suited to hypothesis testing and no useful estimate is<br />
obtained, such as the average difference between two<br />
groups. Estimates and confidence intervals are easy<br />
to find with t methods. Non-parametric estimates and<br />
confidence intervals can be calculated, however, but<br />
depend on extra assumptions which are almost as<br />
strong as those for t methods. 3 Rank methods have the<br />
added disadvantage of not generalising to more complex<br />
situations, most obviously when we wish to use<br />
regression methods to adjust for several other factors.<br />
Rank methods can generate strong views, with<br />
some people preferring them for all analyses and<br />
others believing that they have no place in statistics. We<br />
believe that rank methods are sometimes useful, but<br />
parametric methods are generally preferable as they<br />
provide estimates and confidence intervals and generalise<br />
to more complex analyses.<br />
The choice of approach may also be related to sample<br />
size, as the distributional assumptions are more<br />
important for small samples. We consider the analysis<br />
of small data sets in a subsequent Statistics <strong>Notes</strong>.<br />
Competing interests: None declared.<br />
Provenance and peer review: Commissioned, not externally peer reviewed.<br />
1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Variables and parameters. <strong>BMJ</strong> 1999;318:1667.<br />
2 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Transforming data. <strong>BMJ</strong> 1996;312:770.<br />
3 Campbell MJ, Gardner MJ. Medians and their differences. In: Altman<br />
DG, Machin D, Bryant TN, Gardner MJ, eds. Statistics with confidence.<br />
2nd ed. London: <strong>BMJ</strong> Books, 2000:36-44.<br />
and rehearsed fell from my hands to the floor in utter<br />
disarray. I wouldn’t want to share with you all the<br />
embarrassment, but, suffice to say, a quick thinking<br />
resident had to add the words “… of gainful employment”<br />
to the end of a question that I had unthinkingly<br />
asked a 55 year old man: “So when would you say was<br />
your last period?”<br />
Bharat Kumar student, Saba University School of Medicine, Rye<br />
Brook, NY, USA bharatk85@gmail.com<br />
Cite this as: <strong>BMJ</strong> 2009;339:b2380<br />
170 <strong>BMJ</strong> | 18 JULY 2009 | VoLUMe 339
1 Department of Health Sciences,<br />
University of York, York YO10 5DD<br />
2 Centre for Statistics in Medicine,<br />
University of Oxford, Wolfson<br />
College Annexe, Oxford OX2 6UD<br />
correspondence to: Professor<br />
<strong>Bland</strong> mb55@<strong>york</strong>.ac.uk<br />
Cite this as: <strong>BMJ</strong> 2009;338:a3166<br />
doi: 10.1136/bmj.a3166<br />
Parametric methods, including t tests, correlation,<br />
and regression, require the assumption that the data<br />
follow a normal distribution and that variances<br />
are uniform between groups or across ranges. 1 In<br />
small samples these assumptions are particularly<br />
important, so this setting seems ideal for rank (nonparametric)<br />
methods, which make no assumptions<br />
about the distribution of the data; they use the rank<br />
order of observations rather than the measurements<br />
themselves. 1 Unfortunately, rank methods are least<br />
effective in small samples. Indeed, for very small<br />
samples, they cannot yield a significant result<br />
whatever the data. For example, when using the<br />
Mann-Witney test for comparing two samples of<br />
fewer than four observations a statistically significant<br />
difference is impossible: any data give P>0.05.<br />
Similarly, the Wilcoxon paired test, the sign test,<br />
and Spearman’s and Kendall’s rank correlation coefficients<br />
cannot produce P
esearch methods & reporting<br />
paired sample test, but this method assumes that the<br />
differences have a symmetrical distribution, which<br />
they do not. The sign test is preferred here; it tests<br />
the null hypothesis that non-zero differences are<br />
equally likely to be positive or negative, using the<br />
binomial distribution. We have 1 negative and 11<br />
positive differences, which gives P=0.006. Hence<br />
the original authors failed to detect a difference<br />
because they used an inappropriate analysis.<br />
We have often come across the idea that we<br />
should not use t distribution methods for small samples<br />
but should instead use rank based methods.<br />
The statement is sometimes that we should not use t<br />
methods at all for samples of fewer than six observations.<br />
4 But, as we noted, rank based methods cannot<br />
produce anything useful for such small samples.<br />
draw on general experience<br />
The aversion to parametric methods for small<br />
samples may arise from the inability to assess<br />
the distribution shape when there are so few<br />
Research methods and reporting<br />
Research methods and reporting is for “how to”<br />
articles—those that discuss the nuts and bolts of doing<br />
and writing up research, are actionable and readable,<br />
and will warrant appraisal <strong>by</strong> the <strong>BMJ</strong> ’s research team<br />
and statisticians. These articles should be aimed at a<br />
general medical audience that includes doctors of all<br />
disciplines and other health professionals working<br />
in and outside the UK. You should not assume that<br />
readers will know about organisations or practices that<br />
are specific to a single discipline.<br />
We welcome articles on all kinds of medical and<br />
health services research methods that will be relevant<br />
and useful to <strong>BMJ</strong> readers, whether that research<br />
is quantitative or qualitative, clinical or not. This<br />
includes articles that propose and explain practical and<br />
theoretical developments in research methodology and<br />
for those on improving the clarity and transparency of<br />
reports about research studies, protocols, and results.<br />
This section is for the “how?” of research, while<br />
the “what, why, when, and who cares?” will usually<br />
belong elsewhere. Studies evaluating ways to<br />
conduct and report research should go to the <strong>BMJ</strong> ’s<br />
Research section; articles debating research concepts,<br />
frameworks, and translation into practice and policy<br />
should go to Analysis, Editorials, or Features; and those<br />
expressing personal opinions should go to Personal<br />
View.<br />
Articles for Research methods and reporting should<br />
include:<br />
• Up to 2000 words set out under informative<br />
subheadings. For some submissions, this might be<br />
published in full on bmj.com with a shorter version in<br />
the print <strong>BMJ</strong><br />
• A separate introduction (“standfirst”) of 100-150 words<br />
spelling out what the article is about and emphasising<br />
its importance<br />
observations. How can we tell whether data follow<br />
a normal distribution if we have only a few observations?<br />
The answer is that we have not only the data<br />
to be analysed, but usually also experience of other<br />
sets of measurements of the same thing. In addition,<br />
general experience tells us that body size measurements<br />
are usually approximately normal, as are the<br />
logarithms of many blood concentrations and the<br />
square roots of counts.<br />
We thank Jonathan Cowley for the data in table 1.<br />
competing interests: None declared.<br />
Provenance and peer review: Commissioned, not externally peer<br />
reviewed.<br />
1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Parametric v non-parametric methods for<br />
data analysis. <strong>BMJ</strong> 2009;338:a3167.<br />
2 Baker CJ, Kasper DL, Edwards MS, Schiffman G. Influence of<br />
preimmunization antibody levels on the specificity of the immune<br />
response to related polysaccharide antigens. N Engl J Med<br />
1980;303:173-8.<br />
3 Altman DG. Practical statistics for medical research. London:<br />
Chapman and Hall, 1991: 224-5.<br />
4 Siegel S. Nonparametric statistics for the behavioral sciences. 1st<br />
ed. Tokyo: McGraw-Hill Kogakusha, 1956:32.<br />
• Explicit evidence to support key statements and a<br />
brief explanation of the strength of the evidence<br />
(published trials, systematic reviews, observational<br />
studies, expert opinion, etc)<br />
• No more than 20 references, in Vancouver style,<br />
presenting the evidence on which the key statements<br />
in the paper are made<br />
• Up to three tables, boxes, or illustrations (clinical<br />
photographs, imaging, line drawings, figures—we<br />
welcome colour) to enhance the text and add to<br />
or substantiate key points made in the body of the<br />
article<br />
• A summary box with up to four short single<br />
sentences, in the form of bullet points, highlighting<br />
the article’s main points<br />
• A box of linked information such as websites for<br />
those who want to pursue the subject in more depth<br />
(this is optional)<br />
• Web extras: we may be able to publish on bmj.<br />
com some additional boxes, figures, references (in a<br />
separate reference list numbered w1,w2,w3, etc, and<br />
marked as such in the main text of the article)<br />
• Suggestions for linked podcasts or video clips, as<br />
appropriate<br />
• A statement of sources and selection criteria. As well<br />
as the standard statements of funding, competing<br />
interests, and contributorship, please provide at<br />
the end of the paper a 100-150 word paragraph<br />
(excluded from the word count) explaining the<br />
paper’s provenance. This should include the<br />
relevant experience or expertise of the author(s)<br />
and the sources of information used to prepare the<br />
paper. It should also give details of each author’s<br />
role in producing the article and name one as<br />
guarantor.<br />
962 <strong>BMJ</strong> | 24 octoBer 2009 | VoluMe 339
<strong>BMJ</strong> 2011;342:d561 doi: 10.1136/bmj.d561 Page 1 of 3<br />
Research Methods & Reporting<br />
RESEARCH METHODS & REPORTING<br />
STATISTICS NOTES<br />
Comparisons within randomised groups can be very<br />
misleading<br />
J Martin <strong>Bland</strong> professor of health statistics 1 , Douglas G Altman professor of statistics in medicine 2<br />
1 Department of Health Sciences, University of York, York YO10 5DD; 2 Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD<br />
When we randomise trial participants into two or more<br />
intervention groups, we do this to remove bias; the groups will,<br />
on average, be comparable in every respect except the treatment<br />
which they receive. Provided the trial is well conducted, without<br />
other sources of bias, any difference in the outcome of the<br />
groups can then reasonably be attributed to the different<br />
interventions received. In a previous note we discussed the<br />
analysis of those trials in which the primary outcome measure<br />
is also measured at baseline. We discussed several valid<br />
analyses, observing that “analysis of covariance” (a regression<br />
method) is the method of choice. 1<br />
Rather than comparing the randomised groups directly, however,<br />
researchers sometimes look at the change in the measurement<br />
between baseline and the end of the trial; they test whether there<br />
was a significant change from baseline, separately in each<br />
randomised group. They may then report that this difference is<br />
significant in one group but not in the other, and conclude that<br />
this is evidence that the groups, and hence the treatments, are<br />
different. One such example was a recent trial in which<br />
participants were randomised to receive either an “anti-ageing”<br />
cream or the vehicle as a placebo. 2 A wrinkle score was recorded<br />
at baseline and after six months. The authors gave the results<br />
of significance tests comparing the score with baseline for each<br />
group separately, reporting the active treatment group to have<br />
a significant difference (P=0.013) and the vehicle group not<br />
(P=0.11). Their interpretation was that the cosmetic cream<br />
resulted in significant clinical improvement in facial wrinkles.<br />
But we cannot validly draw this conclusion, because the lack<br />
of a significant difference in the vehicle group does not provide<br />
good evidence that the anti-ageing product is superior. 3<br />
The essential feature of a randomised trial is the comparison<br />
between groups. Within group analyses do not address a<br />
meaningful question: the question is not whether there is a<br />
change from baseline, but whether any change is greater in one<br />
group than the other. It is not possible to draw valid inferences<br />
<strong>by</strong> comparing P values. In particular, there is an inflated risk of<br />
a false positive result, which we shall illustrate with a simulation.<br />
Correspondence to: Professor M <strong>Bland</strong> martin.bland@<strong>york</strong>.ac.uk<br />
The table shows simulated data for a randomised trial with two<br />
groups of 30 participants. Data were drawn from the same<br />
population, so there is no systematic difference between the two<br />
groups. The true baseline measurements had a mean of 10.0<br />
with standard deviation (SD) 2.0, and the outcome measurement<br />
was equal to the baseline plus an increase of 0.5 and a random<br />
element with SD 1.0. The difference between mean outcomes<br />
is 0.22 (95% confidence interval –0.75 to 0.34, P=0.5), adjusting<br />
for the baseline <strong>by</strong> analysis of covariance. 1 The difference is<br />
not statistically significant, which is not surprising because we<br />
know that the null hypothesis of no difference in the population<br />
is true. If we compare baseline with outcome for each group<br />
using a paired t test, however, for group A the difference is<br />
statistically significant, P=0.03, for group B it is not significant,<br />
P = 0.2. These results are quite similar to those of the anti-ageing<br />
cream trial. 2<br />
We would not wish to draw any conclusions from one<br />
simulation. In 1000 runs, the difference between groups had<br />
P
<strong>BMJ</strong> 2011;342:d561 doi: 10.1136/bmj.d561 Page 2 of 3<br />
difference were 0.37 rather than 0.5, we would expect 50% of<br />
samples to have just one significant difference.<br />
The anti-ageing trial is not the only one where we have seen<br />
this misleading approach applied to randomised trial data. 3 We<br />
even found it once in the <strong>BMJ</strong>! 5<br />
Contributors: <strong>JM</strong>B and DGA jointly wrote and agreed the text, <strong>JM</strong>B did<br />
the statistical analysis.<br />
Competing interests: All authors have completed the Unified Competing<br />
Interest form at <strong>www</strong>.icmje.org/coi_disclosure.pdf (available on request<br />
from the corresponding author) and declare: no support from any<br />
organisation for the submitted work; no financial relationships with any<br />
organisations that might have an interest in the submitted work in the<br />
previous 3 years; no other relationships or activities that could appear<br />
to have influenced the submitted work.<br />
1 Vickers AJ, Altman DG. Analysing controlled trials with baseline and follow-up<br />
measurements. <strong>BMJ</strong> 2001;323:1123-4.<br />
2 Watson REB, Ogden S, Cotterell LF, Bowden JJ, Bastrilles JY, Long SP, et al. A cosmetic<br />
‘anti-ageing’ product improves photoaged skin: a double-blind, randomized controlled<br />
trial. Br J Dermatol 2009;161:419-26.<br />
3 <strong>Bland</strong> <strong>JM</strong>. Evidence for an ‘anti-ageing’ product may not be so clear as it appears. Br J<br />
Dermatol 2009;161:1207-8.<br />
4 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Interaction revisited: the difference between two estimates. <strong>BMJ</strong><br />
2003;326:219.<br />
5 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Informed consent. <strong>BMJ</strong> 1993;306:928.<br />
Cite this as: <strong>BMJ</strong> 2011;342:d561<br />
RESEARCH METHODS & REPORTING<br />
Reprints: <strong>http</strong>://journals.bmj.com/cgi/reprintform Subscribe: <strong>http</strong>://resources.bmj.com/bmj/subscribers/how-to-subscribe
<strong>BMJ</strong> 2011;342:d561 doi: 10.1136/bmj.d561 Page 3 of 3<br />
Table<br />
Table 1| Simulated data from a randomised trial comparing two groups of 30 patients, with a true change from baseline but no difference<br />
between groups (sorted <strong>by</strong> baseline values within each group)<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
8<br />
9<br />
10<br />
11<br />
12<br />
13<br />
14<br />
15<br />
16<br />
17<br />
18<br />
19<br />
20<br />
21<br />
22<br />
23<br />
24<br />
25<br />
26<br />
27<br />
28<br />
29<br />
30<br />
Group A<br />
Baseline 6 months Change<br />
6.4<br />
6.6<br />
7.3<br />
7.7<br />
7.7<br />
7.9<br />
8.0<br />
8.0<br />
8.1<br />
9.2<br />
9.3<br />
9.6<br />
9.7<br />
9.8<br />
9.8<br />
10.2<br />
10.3<br />
10.6<br />
10.6<br />
10.7<br />
10.9<br />
11.1<br />
11.2<br />
11.8<br />
12.3<br />
12.4<br />
13.1<br />
13.2<br />
13.3<br />
13.7<br />
Mean 10.02<br />
SD<br />
2.06<br />
7.1<br />
5.6<br />
8.3<br />
9.1<br />
9.5<br />
9.6<br />
8.5<br />
8.5<br />
9.1<br />
9.6<br />
8.7<br />
10.7<br />
9.0<br />
9.0<br />
8.0<br />
11.1<br />
11.5<br />
9.1<br />
12.0<br />
13.2<br />
9.7<br />
12.2<br />
10.8<br />
11.9<br />
12.2<br />
12.6<br />
15.0<br />
13.8<br />
14.1<br />
14.2<br />
10.46<br />
2.29<br />
0.7<br />
-1.0<br />
1.0<br />
1.4<br />
1.8<br />
1.7<br />
0.5<br />
0.5<br />
1.0<br />
0.4<br />
-0.6<br />
1.1<br />
-0.7<br />
-0.8<br />
-1.8<br />
0.9<br />
1.2<br />
-1.5<br />
1.4<br />
2.5<br />
-1.2<br />
1.1<br />
-0.4<br />
0.1<br />
-0.1<br />
0.2<br />
1.9<br />
0.6<br />
0.8<br />
0.5<br />
0.44<br />
1.06<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
8<br />
9<br />
10<br />
11<br />
12<br />
13<br />
14<br />
15<br />
16<br />
17<br />
18<br />
19<br />
20<br />
21<br />
22<br />
23<br />
24<br />
25<br />
26<br />
27<br />
28<br />
29<br />
30<br />
Group B<br />
Baseline 6 months Change<br />
6.8<br />
7.2<br />
7.2<br />
7.4<br />
7.5<br />
7.5<br />
8.3<br />
8.4<br />
8.7<br />
9.0<br />
9.2<br />
9.6<br />
9.9<br />
10.1<br />
10.2<br />
10.3<br />
10.4<br />
10.5<br />
10.7<br />
10.8<br />
10.8<br />
11.1<br />
11.1<br />
11.4<br />
11.6<br />
11.7<br />
12.0<br />
12.3<br />
13.7<br />
13.9<br />
Mean 9.98<br />
SD<br />
1.90<br />
7.9<br />
7.5<br />
6.9<br />
6.9<br />
8.3<br />
9.4<br />
9.0<br />
8.8<br />
8.0<br />
7.2<br />
7.1<br />
10.6<br />
11.0<br />
11.5<br />
10.4<br />
11.0<br />
9.9<br />
11.3<br />
9.9<br />
10.7<br />
11.8<br />
10.0<br />
13.2<br />
11.8<br />
12.1<br />
11.5<br />
12.7<br />
13.7<br />
12.6<br />
13.7<br />
10.21<br />
2.09<br />
1.1<br />
0.3<br />
-0.3<br />
-0.5<br />
0.8<br />
1.9<br />
0.7<br />
0.4<br />
-0.7<br />
-1.8<br />
-2.1<br />
1.0<br />
1.1<br />
1.4<br />
0.2<br />
0.7<br />
-0.5<br />
0.8<br />
-0.8<br />
-0.1<br />
1.0<br />
-1.1<br />
2.1<br />
0.4<br />
0.5<br />
-0.2<br />
0.7<br />
1.4<br />
-1.1<br />
-0.2<br />
0.24<br />
1.02<br />
RESEARCH METHODS & REPORTING<br />
Reprints: <strong>http</strong>://journals.bmj.com/cgi/reprintform Subscribe: <strong>http</strong>://resources.bmj.com/bmj/subscribers/how-to-subscribe
1 Department of Health Sciences,<br />
University of York, York YO10 5DD<br />
2 Centre for Statistics in Medicine,<br />
University of Oxford, Oxford<br />
OX2 6UD<br />
Correspondence to: Professor<br />
M <strong>Bland</strong><br />
martin.bland@<strong>york</strong>.ac.uk<br />
Cite this as: <strong>BMJ</strong> 2011;342:d556<br />
doi: 10.1136/bmj.d556<br />
Competing interests: All authors<br />
have completed the Unified<br />
Competing Interest form at <strong>www</strong>.<br />
icmje.org/coi_disclosure.pdf<br />
(available on request from the<br />
corresponding author) and declare:<br />
no support from any organisation<br />
for the submitted work; no<br />
financial relationships with any<br />
organisations that might have an<br />
interest in the submitted work in<br />
the previous three years; no other<br />
relationships or activities that could<br />
appear to have influenced the<br />
submitted work.<br />
Correlation between<br />
abdominal circumference<br />
and body mass index (BMI)<br />
in 1450 adult patients with<br />
diabetes<br />
BMI group r<br />
35 0.86<br />
All patients 0.85<br />
STATISTICS NOTES<br />
Correlation in restricted ranges of data<br />
J Martin <strong>Bland</strong>, 1 Douglas G Altman 2<br />
In a study of 150 adult diabetic patients there was a strong<br />
correlation between abdominal circumference and body<br />
mass index (BMI) (r = 0.85). 1 The authors went on to report<br />
that the correlation differed in different BMI categories as<br />
shown in the table.<br />
The authors’ interpretation of these data was that in<br />
patients with low or high BMI values (BMI 35 kg/m 2 ) the correlation was strong, but in those with<br />
BMI values between 25 and 35 kg/m 2 the correlation was<br />
weak or missing. They concluded that measuring abdominal<br />
circumference is of particular importance in subjects with<br />
the most frequent BMI category (25 to 35 kg/m 2 ).<br />
When we restrict the range of one of the variables, a<br />
correlation coefficient will be reduced. For example, fig 1<br />
shows some BMI and abdominal circumference measurements<br />
from a different population. Although these people<br />
are from a rather thinner population, the correlation coefficient<br />
is very similar, r = 0.82 (P
RESEARCH METHODS AND<br />
REPORTING, p 682<br />
1 Centre for Statistics in Medicine,<br />
University of Oxford, Oxford<br />
OX2 6UD<br />
2 Department of Health Sciences,<br />
University of York, Heslington, York<br />
YO10 5DD<br />
Correspondence to: D G Altman<br />
doug.altman@csm.ox.ac.uk<br />
Cite this as: <strong>BMJ</strong> 2011;343:d2090<br />
doi: 10.1136/bmj.d2090<br />
RESEARCH METHODS<br />
& REPORTING<br />
STATISTICS NOTES<br />
How to obtain the confidence interval from a P value<br />
Douglas G Altman, 1 J Martin <strong>Bland</strong> 2<br />
Confidence intervals (CIs) are widely used in reporting statistical<br />
analyses of research data, and are usually considered to<br />
1 2<br />
be more informative than P values from significance tests.<br />
Some published articles, however, report estimated effects and<br />
P values, but do not give CIs (a practice <strong>BMJ</strong> now strongly discourages).<br />
Here we show how to obtain the confidence interval<br />
when only the observed effect and the P value were reported.<br />
The method is outlined in the box below in which we have<br />
distinguished two cases.<br />
(a) Calculating the confidence interval for a difference<br />
We consider first the analysis comparing two proportions or<br />
two means, such as in a randomised trial with a binary outcome<br />
or a measurement such as blood pressure.<br />
For example, the abstract of a report of a randomised trial<br />
included the statement that “more patients in the zinc group<br />
than in the control group recovered <strong>by</strong> two days (49% v 32%,<br />
P=0.032).” 5 The difference in proportions was Est = 17 percentage<br />
points, but what is the 95% confidence interval (CI)?<br />
Following the steps in the box we calculate the CI as<br />
fo llows:<br />
z = –0.862+ √[0.743 – 2.404×log(0.032)] = 2.141;<br />
SE = 17/2.141 = 7.940, so that 1.96×SE = 15.56<br />
percentage points;<br />
95% CI is 17.0 – 15.56 to 17.0 + 15.56, or 1.4 to 32.6<br />
percentage points.<br />
(b) Calculating the confidence interval for a ratio (log<br />
transformation needed)<br />
The calculation is trickier for ratio measures, such as risk ratio,<br />
odds ratio, and hazard ratio. We need to log transform the estimate<br />
and then reverse the procedure, as described in a previous<br />
Statistics Note. 6<br />
For example, the abstract of a report of a cohort study<br />
includes the statement that “In those with a [diastolic blood<br />
pressure] reading of 95-99 mm Hg the relative risk was 0.30<br />
Steps to obtain the CI for an estimate of effect from the P value and the estimate (Est)<br />
(a) CI for a difference<br />
• Calculate the test statistic for a normal distribution test, z, from P 3 : z = −0.862 + √[0.743 −<br />
2.404×log(P)]<br />
• Calculate the standard error: SE = Est/z (ignoring minus signs)<br />
• Calculate the 95% CI: Est –1.96×SE to Est + 1.96×SE.<br />
(b) CI for a ratio<br />
• For a ratio measure, such as a risk ratio, the above formulas should be used with the<br />
estimate Est on the log scale (eg, the log risk ratio). Step 3 gives a CI on the log scale; to<br />
derive the CI on the natural scale we need to exponentiate (antilog) Est and its CI. 4<br />
All P values are two sided.<br />
All logarithms are natural (ie, to base e). 4<br />
For a 90% CI, we replace 1.96 <strong>by</strong> 1.65; for a 99% CI we use 2.57.<br />
(P=0.034).” 7 What is the confidence interval around 0.30?<br />
Following the steps in the box we calculate the CI:<br />
z = –0.862+ √[0.743 – 2.404×log(0.034)] = 2.117;<br />
Est = log (0.30) = −1.204;<br />
SE = −1.204/2.117 = −0.569 but we ignore the minus<br />
sign, so SE = 0.569, and 1.96×SE = 1.115;<br />
95% CI on log scale = −1.204 − 1.115 to −1.204 + 1.115 =<br />
−2.319 to −0.089;<br />
95% CI on natural scale = exp (−2.319) = 0.10 to exp<br />
(−0.089) = 0.91.<br />
Hence the relative risk is estimated to be 0.30 with 95% CI<br />
0.10 to 0.91.<br />
Limitations of the method<br />
The methods described can be applied in a wide range of<br />
settings, including the results from meta-analysis and regression<br />
analyses. The main context where they are not correct is<br />
in small samples where the outcome is continuous and the<br />
analysis has been done <strong>by</strong> a t test or analysis of variance,<br />
or the outcome is dichotomous and an exact method has<br />
been used for the confidence interval. However, even here<br />
the methods will be approximately correct in larger studies<br />
with, say, 60 patients or more.<br />
P values presented as inequalities<br />
Sometimes P values are very small and so are presented as<br />
P0.05 or the difference is not significant,<br />
things are more difficult. If we apply the method<br />
described here, using P=0.05, the confidence interval will be<br />
too narrow. We must remember that the estimate is even poorer<br />
than the confidence interval calculated would suggest.<br />
1 Gardner MJ, Altman DG. Confidence intervals rather than P values:<br />
estimation rather than hypothesis testing. <strong>BMJ</strong> 1986;292:746-50.<br />
2 Moher D, HopewellS,Schulz KF, MontoriV, Gøtzsche PC, DevereauxPJ,<br />
et al. CONSORT 2010. Explanation and Elaboration: updated guidelines<br />
for reporting parallel group randomised trials. <strong>BMJ</strong> 2010;340:c869.<br />
3 Lin J-T. Approximating the normal tail probability and its inverse for use<br />
on a pocket calculator. Appl Stat 1989;38:69-70.<br />
4 <strong>Bland</strong> <strong>JM</strong>, Altman DG.Statistics <strong>Notes</strong>. Logarithms. <strong>BMJ</strong> 1996;312:700.<br />
5 RoySK, Hossain MJ, Khatun W, Chakraborty B, ChowdhuryS, Begum<br />
A, et al. Zinc supplementation in children with cholera in Bangladesh:<br />
randomised controlled trial. <strong>BMJ</strong> 2008;336:266-8.<br />
6 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Interaction revisited: the difference between two<br />
estimates. <strong>BMJ</strong> 2003;326:219.<br />
7 Lindblad U, Råstam L, Rydén L, Ranstam J, IsacssonS-O, Berglund G.<br />
Control of blood pressure and riskof first acute myocardial infarction:<br />
Skaraborg hypertension project. <strong>BMJ</strong> 1994;308:681.<br />
<strong>BMJ</strong> | 1 OCTOBER 2011 | VOLUME 343 681
RESEARCH METHODS & REPORTING<br />
RESEARCH METHODS AND<br />
REPORTING, p 681<br />
1 Centre for Statistics in Medicine,<br />
University of Oxford, Oxford<br />
OX2 6UD<br />
2 Department of Health Sciences,<br />
University of York, Heslington, York<br />
YO10 5DD<br />
Correspondence to: D G Altman<br />
doug.altman@csm.ox.ac.uk<br />
Cite this as: <strong>BMJ</strong> 2011;343:d2304<br />
doi: 10.1136/bmj.d2304<br />
bmj.com<br />
Previous articles in this<br />
series<br />
Ж Choosing target<br />
conditions for test<br />
accuracy studies that are<br />
relevant to clinical practice<br />
(<strong>BMJ</strong> 2011;343:d4684)<br />
Ж Brackets (parentheses)<br />
in formulas<br />
(<strong>BMJ</strong> 2011;343:d570)<br />
Ж Verification problems<br />
in diagnostic accuracy<br />
studies: consequences<br />
and solution<br />
(<strong>BMJ</strong> 2011;343:d4770)<br />
How to obtain the P value from a confidence interval<br />
Douglas G Altman, 1 J Martin <strong>Bland</strong> 2<br />
We have shown in a previous Statistics Note 1 how we can<br />
calculate a confidence interval (CI) from a P value. Some<br />
published articles report confidence intervals, but do not<br />
give corresponding P values. Here we show how a confidence<br />
interval can be used to calculate a P value, should<br />
this be required. This might also be useful when the P<br />
value is given only imprecisely (eg, as P
RESEARCH METHODS & REPORTING<br />
1 Centre for Statistics in Medicine,<br />
University of Oxford, Oxford<br />
OX2 6UD<br />
2 Department of Health Sciences,<br />
University of York, York YO10 5DD<br />
Correspondence to: DG Altman<br />
doug.altman@csm.ox.ac.uk<br />
Cite this as: <strong>BMJ</strong> 2011;343:d570<br />
doi: 10.1136/bmj.d570<br />
bmj.com<br />
Previous articles in this<br />
series<br />
ЖChoosing target<br />
conditions for test<br />
accuracy studies that are<br />
relevant to clinical practice<br />
(<strong>BMJ</strong> 2011;343:d4684)<br />
ЖBrackets (parentheses)<br />
in formulas<br />
(<strong>BMJ</strong> 2011;343:d570)<br />
ЖHow to obtain the<br />
confidence interval from<br />
a P value<br />
(<strong>BMJ</strong> 2011;343:d2090)<br />
ЖHow to obtain the P<br />
value from a confidence<br />
interval<br />
(<strong>BMJ</strong> 2011;343:d2304)<br />
ЖVerification problems<br />
in diagnostic accuracy<br />
studies: consequences<br />
and solution<br />
(<strong>BMJ</strong> 2011;343:d4770)<br />
Brackets (parentheses) in formulas<br />
Douglas G Altman, 1 J Martin <strong>Bland</strong> 2<br />
Each year, new health sciences postgraduate students in<br />
York are given a simple maths test. Each year the majority<br />
of them fail to calculate 20 – 3 × 5 correctly. According<br />
to the conventional rules of arithmetic, division and<br />
multiplication are done before addition and subtraction,<br />
so 20 − 3 × 5 = 20 − 15 = 5. Many students work from<br />
left to right and calculate 20 − 3 × 5 as 17 × 5 = 85. If<br />
that was what was actually meant, we would need to use<br />
brackets: (20 − 3) × 5 = 17 × 5 = 85. Brackets tell us that<br />
the enclosed part must be evaluated first. That convention<br />
is part of various mnemonic acronyms that indicate<br />
the order of operations, such as BODMAS (Brackets, Of<br />
(that is, power of), Divide, Multiply, Add, Subtract) and<br />
PEMDAS (Parentheses, Exponentiation, Multiplication,<br />
Division, Addition, Subtraction). 1<br />
Schoolchildren learn the basic rules about how to construct<br />
and interpret mathematical formulas. 1 The conventions<br />
exist to ensure that there is absolutely no ambiguity,<br />
as mathematics (unlike prose) has no redundancy, so any<br />
mistake may have serious consequences. Our experience<br />
is that mistakes are quite common when formulas are presented<br />
in medical journal articles. A particular concern is<br />
that brackets are often omitted or misused. The following<br />
examples are typical, and we mean nothing personal <strong>by</strong><br />
choosing them.<br />
Example 1<br />
In a discussion of methods for analysing diagnostic test<br />
accuracy, Collinson 2 wrote:<br />
Sensitivity = TP/TP + FN<br />
where TP = true positive and FN = false negative. The formula<br />
should, of course, be:<br />
Sensitivity = TP/(TP + FN).<br />
Example 2<br />
For a non-statistical example, Leyland 3 wrote that the<br />
total optical power of the cornea is:<br />
P = P 1 + P 2 – t/n 2(P 1P 2)<br />
where P 1 = n 2 – n 1 /r 1 and P 2 = n 3 – n 2 /r 2 . Here n 1 , n 2 , and n 3<br />
are refractive indices, r 1, r 2, and t are distances in metres,<br />
and P, P 1 , and P 2 , are powers in dioptres. But he should<br />
have written P 1 = (n 2 – n 1 )/r 1 and P 2 = (n 3 – n 2 )/r 2 . P 1 = n 2 –<br />
n 1 /r 1 is clearly wrong dimensionally, as P 1 is dioptres, 1/<br />
metre, n 2 and n 1 are ratios and so pure numbers, and r 1 is<br />
in metres. Also, it is not clear whether t/n 2 (P 1P 2) means<br />
(t/n 2) P 1P 2, which it does, or t/(n 2P 1P 2).<br />
Do such errors matter? Certainly. In our experience the<br />
calculations are usually correct in the paper, but anyone<br />
using the published formula would go wrong. Sometimes,<br />
however, the incorrect formula was used, as in the<br />
following case.<br />
Example 3<br />
In their otherwise exemplary evaluation of the chronic<br />
ankle instability scale, Eechaute et al 4 made a mistake in<br />
their formula for the minimal detectable change (MDC)<br />
or repeatability coefficient, 5 writing: MDC = 2.04 × √(2 ×<br />
SEM). Here SEM is the standard error of a measurement<br />
or within subject standard deviation. 5 This formula uses<br />
2.04 where 2 or 1.96 is more usual, 6 but, much more seriously,<br />
the SEM should not be included within the square<br />
root, as the brackets indicate. This might be dismissed<br />
as a simple typographical error, but the authors actually<br />
used this incorrect formula. Their value of SEM was 2.7<br />
points, so they calculated the minimal detectable change<br />
as 2.04 × √(2 × 2.7) = 4.7. They should have calculated<br />
2.04 × √(2) × 2.7 = 7.8. Their erroneous formula makes<br />
the scale appear considerably more reliable than it actually<br />
is. 6 The formula is also wrong in terms of dimensions,<br />
because the minimum clinical difference should<br />
be in the same units as the measurement, not in square<br />
root units.<br />
Some mistakes in formulas may be present in a submitted<br />
manuscript, but others might be introduced in the<br />
publication process. For example, problems sometimes<br />
arise when a displayed formula is converted to an “intext”<br />
formula as part of the editing, and the implications<br />
are not realised or not noticed <strong>by</strong> either editing staff or<br />
authors. Often it is necessary to insert brackets when<br />
reformatting a formula. So the simple formula:<br />
should be changed to p/(1 − p) if moved to the text.<br />
Formulas in published articles may be used <strong>by</strong> others,<br />
so mistakes may lead to substantive errors in research. It<br />
is essential that authors and editors check all formulas<br />
carefully.<br />
Acknowledgements: We are very grateful to Phil McShane for pointing out a<br />
mistake in an earlier version of this statistics note.<br />
Contributors: DGA and <strong>JM</strong>B jointly wrote and agreed the text.<br />
Competing interests: All authors have completed the Unified Competing<br />
Interest form at <strong>www</strong>.icmje.org/coi_disclosure.pdf (available on request from<br />
the corresponding author) and declare: no support from any organisation for the<br />
submitted; no financial relationships with any organisations that might have an<br />
interest in the submitted work in the previous three years; no other relationships<br />
or activities that could appear to have influenced the submitted work.<br />
1 Wikipedia. Order of operations. [cited 2010 Nov 23]. <strong>http</strong>://<br />
en.wikipedia.org/wiki/Order_of_operations.<br />
2 Collinson P. Of bombers, radiologists, and cardiologists: time to ROC.<br />
Heart 1998;80:215-7.<br />
3 Leyland M. Validation of Orbscan II posterior corneal curvature<br />
measurement for intraocular lens power calculation. Eye<br />
2004;18:357-60.<br />
4 Eechaute C, Vaes P, Duquet W. The chronic ankle instability scale:<br />
clinimetric properties of a multidimensional, patient-assessed<br />
instrument. Phys Ther Sport 2008;9:57-66.<br />
5 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Statistics notes. Measurement error. <strong>BMJ</strong><br />
1996;312:744.<br />
6 <strong>Bland</strong> <strong>JM</strong>. Minimal detectable change. Phys Ther Sport 2009;10:39.<br />
578 <strong>BMJ</strong> | 17 SEPTEMBER 2011 | VOLUME 343