Tips for Learners of Evidence-Based Medicine
Tips for Learners of Evidence-Based Medicine
Tips for Learners of Evidence-Based Medicine
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
CMAJ 2005: <strong>Tips</strong> <strong>for</strong> <strong>Learners</strong> <strong>of</strong> <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong>: A 5-Part Series<br />
02 Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, Moyer V, Guyatt G.<br />
<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine: 1. relative risk reduction, absolute<br />
risk reduction and number needed to treat. Can Med Assoc J 2004; 171:353–<br />
358.<br />
08 Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, Guyatt G.<br />
<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine: 2. measures <strong>of</strong> precision<br />
(confidence intervals). Can Med Assoc J 2004; 171:611–615.<br />
14 McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt G. <strong>Tips</strong> <strong>for</strong> learners<br />
<strong>of</strong> evidence-based medicine: 3. measures <strong>of</strong> observer variability (kappa statistic).<br />
Can Med Assoc J 2004; 171:1369–1373.<br />
19 Hatala R, Keitz S, Wyer P, Guyatt G. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based<br />
medicine: 4. assessing heterogeneity <strong>of</strong> primary studies in systematic reviews<br />
and whether to combine their results. Can Med Assoc J 2005;172:661–665.<br />
24 Montori VM, Wyer P, Newman TB, Keitz S, Guyatt G. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong><br />
evidence-based medicine: 5. the effect <strong>of</strong> spectrum <strong>of</strong> disease on the<br />
per<strong>for</strong>mance <strong>of</strong> diagnostic tests. Can med Assoc J 2005;172:385–390.<br />
Page 1 <strong>of</strong> 29
DOI:10.1503/cmaj.1021197<br />
<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />
1. Relative risk reduction, absolute risk reduction<br />
and number needed to treat<br />
Physicians, patients and policy-makers are influenced<br />
not only by the results <strong>of</strong> studies but also by how authors<br />
present the results. 1–4 Depending on which<br />
measures <strong>of</strong> effect authors choose, the impact <strong>of</strong> an intervention<br />
may appear very large or quite small, even though<br />
the underlying data are the same. In this article we present<br />
3 measures <strong>of</strong> effect — relative risk reduction, absolute risk<br />
reduction and number needed to treat — in a fashion designed<br />
to help clinicians understand and use them. We<br />
have organized the article as a series <strong>of</strong> “tips” or exercises.<br />
This means that you, the reader, will have to do some work<br />
in the course <strong>of</strong> reading this article (we are assuming that<br />
most readers are practitioners, as opposed to researchers<br />
and educators).<br />
The tips in this article are adapted from approaches developed<br />
by educators with experience in teaching evidencebased<br />
medicine skills to clinicians. 5,6 A related article, intended<br />
<strong>for</strong> people who teach these concepts to clinicians, is available<br />
online at www.cmaj.ca/cgi/content/full/171/4/353/DC1.<br />
Clinician learners’ objectives<br />
Understanding risk and risk reduction<br />
• Learn how to determine control and treatment event<br />
rates in published studies.<br />
• Learn how to determine relative and absolute risk reductions<br />
from published studies.<br />
• Understand how relative and absolute risk reductions<br />
usually apply to different populations.<br />
Balancing benefits and adverse effects in individual<br />
patients<br />
• Learn how to use a known relative risk reduction to estimate<br />
the risk <strong>of</strong> an event <strong>for</strong> a patient undergoing<br />
treatment, given an estimate <strong>of</strong> that patient’s risk <strong>of</strong> the<br />
CMAJ • AUG. 17, 2004; 171 (4) 353<br />
© 2004 Canadian Medical Association or its licensors<br />
Review<br />
Synthèse<br />
Alexandra Barratt, Peter C. Wyer, Rose Hatala, Thomas McGinn, Antonio L. Dans, Sheri Keitz,<br />
Virginia Moyer, Gordon Guyatt, <strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group<br />
ß See related article page 347<br />
event without treatment.<br />
• Learn how to use absolute risk reductions to assess<br />
whether the benefits <strong>of</strong> therapy outweigh its harms.<br />
Calculating and using number needed to treat<br />
• Develop an understanding <strong>of</strong> the concept <strong>of</strong> number<br />
needed to treat (NNT) and how it is calculated.<br />
• Learn how to interpret the NNT and develop an understanding<br />
<strong>of</strong> how the “threshold NNT” varies depending<br />
on the patient’s values and preferences, the<br />
severity <strong>of</strong> possible outcomes and the adverse effects<br />
(harms) <strong>of</strong> therapy.<br />
Tip 1: Understanding risk and risk reduction<br />
You can calculate relative and absolute risk reductions using<br />
simple mathematical <strong>for</strong>mulas (see Appendix 1). However,<br />
you might find it easier to understand the concepts<br />
through visual presentation. Fig. 1A presents data from a hypothetical<br />
trial <strong>of</strong> a new drug <strong>for</strong> acute myocardial infarction,<br />
showing the 30-day mortality rate in a group <strong>of</strong> patients at<br />
high risk <strong>for</strong> the adverse event (e.g., elderly patients with<br />
congestive heart failure and anterior wall infarction). On the<br />
basis <strong>of</strong> in<strong>for</strong>mation in Fig. 1A, how would you describe the<br />
Teachers <strong>of</strong> evidence-based medicine:<br />
See the “<strong>Tips</strong> <strong>for</strong> teachers” version <strong>of</strong> this article online<br />
at www.cmaj.ca/cgi/content/full/171/4/353/DC1. It<br />
contains the exercises found in this article in fill-in-theblank<br />
<strong>for</strong>mat, commentaries from the authors on the<br />
challenges they encounter when teaching these concepts<br />
to clinician learners and links to useful online resources.<br />
Page 2 <strong>of</strong> 29
Barratt et al<br />
effect <strong>of</strong> the new drug? (Hint: Consider the event rates in<br />
people not taking the new drug and those who are taking it.)<br />
We can describe the difference in mortality (event)<br />
rates in both relative and absolute<br />
terms. In this case,<br />
these high-risk patients had a<br />
relative risk reduction <strong>of</strong> 25%<br />
and an absolute risk reduction<br />
<strong>of</strong> 10%.<br />
Now, let’s consider Fig. 1B,<br />
which shows the results <strong>of</strong> a<br />
second hypothetical trial <strong>of</strong> the<br />
same new drug, but in a patient<br />
population with a lower risk <strong>for</strong><br />
the outcome (e.g., younger patients<br />
with uncomplicated inferior<br />
wall myocardial infarction).<br />
Looking at Fig. 1B, how<br />
would you describe the effect<br />
<strong>of</strong> the new drug?<br />
The relative risk reduction<br />
with the new drug remains at<br />
25%, but the event rate is lower<br />
in both groups, and hence<br />
the absolute risk reduction is only 2.5%.<br />
Although the relative risk reduction might be similar<br />
across different risk groups (a safe assumption in many if<br />
A<br />
Risk <strong>for</strong> outcome<br />
<strong>of</strong> interest, %<br />
B<br />
Risk <strong>for</strong> outcome<br />
<strong>of</strong> interest, %<br />
40<br />
30<br />
20<br />
10<br />
0<br />
40<br />
30<br />
20<br />
10<br />
0<br />
Trial 1: high-<br />
risk patients<br />
Trial 1: high-<br />
risk patients<br />
Placebo<br />
Treatment<br />
Trial 2: low-<br />
risk patients<br />
Risk and risk reduction: definitions<br />
354 JAMC 17 AOÛT 2004; 171 (4)<br />
Event rate: the number <strong>of</strong> people experiencing an<br />
event as a proportion <strong>of</strong> the number <strong>of</strong> people in<br />
the population<br />
Relative risk reduction: the difference in event<br />
rates between 2 groups, expressed as a proportion<br />
<strong>of</strong> the event rate in the untreated group; usually<br />
constant across populations with different risks 7,8<br />
Absolute risk reduction: the arithmetic difference<br />
between 2 event rates; varies with the underlying<br />
risk <strong>of</strong> an event in the individual patient<br />
The absolute risk reduction becomes smaller<br />
when event rates are low, whereas the<br />
relative risk reduction, or “efficacy” <strong>of</strong> the<br />
treatment, <strong>of</strong>ten remains constant<br />
not most cases 7,8 ), the absolute gains, represented by absolute<br />
risk reductions, are not. In sum, the absolute risk reduction<br />
becomes smaller when event rates are low, whereas<br />
the relative risk reduction, or<br />
“efficacy” <strong>of</strong> the treatment, <strong>of</strong>-<br />
ten remains constant.<br />
These phenomena may be<br />
factors in the design <strong>of</strong> drug<br />
trials. For example, a drug<br />
may be tested in severely affected<br />
people in whom the<br />
absolute risk reduction is likely<br />
to be impressive, but is<br />
subsequently marketed <strong>for</strong><br />
use by less severely affected<br />
patients, in whom the absolute<br />
risk reduction will be<br />
substantially less.<br />
The bottom line<br />
Relative risk reduction is<br />
<strong>of</strong>ten more impressive than<br />
absolute risk reduction. Furthermore,<br />
the lower the event rate in the control group,<br />
the larger the difference between relative risk reduction<br />
and absolute risk reduction.<br />
Among high-risk patients in trial 1, the event rate in the control group (placebo) is 40 per<br />
100 patients, and the event rate in the treatment group is 30 per 100 patients.<br />
Absolute risk reduction (also called the risk difference) is the simple difference in the event<br />
rates (40% – 30% = 10%).<br />
Relative risk reduction is the difference between the event rates in relative terms. Here, the<br />
event rate in the treatment group is 25% less than the event rate in the control group (i.e., the<br />
10% absolute difference expressed as a proportion <strong>of</strong> the control rate is 10/40 or<br />
25% less).<br />
Among low-risk patients in trial 2, the event rate in the control group (placebo) is only 10%.<br />
If the treatment is just as effective in these low-risk patients, what event rate can we expect<br />
in the treatment group?<br />
Page 3 <strong>of</strong> 29<br />
The event rate in the treated group would be 25% less than in the control group or 7.5%.<br />
There<strong>for</strong>e, the absolute risk reduction <strong>for</strong> the low-risk patients (second pair <strong>of</strong> columns) is only<br />
2.5%, even though the relative risk reduction is the same as <strong>for</strong> the high-risk patients<br />
(first pair <strong>of</strong> columns).<br />
Fig. 1: Results <strong>of</strong> hypothetical placebo-controlled trials <strong>of</strong> a new drug <strong>for</strong> acute myocardial infarction. The bars represent the 30day<br />
mortality rate in different groups <strong>of</strong> patients with acute myocardial infarction and heart failure. A: Trial involving patients at<br />
high risk <strong>for</strong> the adverse outcome. B: Trials involving a group <strong>of</strong> patients at high risk <strong>for</strong> the adverse outcome and another group <strong>of</strong><br />
patients at low risk <strong>for</strong> the adverse outcome.
Tip 2: Balancing benefits and adverse effects<br />
in individual patients<br />
In prescribing medications or other treatments, physicians<br />
consider both the potential benefits and the potential<br />
harms. We have just demonstrated that the benefits <strong>of</strong><br />
treatment (presented as absolute risk reductions) will generally<br />
be greater in patients at higher risk <strong>of</strong> adverse outcomes<br />
than in patients at lower risk <strong>of</strong> adverse outcomes.<br />
You must now incorporate the possibility <strong>of</strong> harm into<br />
your decision-making.<br />
First, you need to quantify the potential benefits. Assume<br />
you are managing 2 patients <strong>for</strong> high blood pressure<br />
and are considering the use <strong>of</strong> a new antihypertensive drug,<br />
drug X, <strong>for</strong> which the relative risk reduction <strong>for</strong> stroke over<br />
3 years is 33%, according to published randomized controlled<br />
trials.<br />
Pat is a 69-year-old woman whose blood pressure during<br />
a routine examination is 170/100 mm Hg; her blood<br />
pressure remains unchanged when you see her again 3<br />
weeks later. She is otherwise well and has no history <strong>of</strong> cardiovascular<br />
or cerebrovascular disease. You assess her risk<br />
<strong>of</strong> stroke at about 1% (or 1 per 100) per year. 9<br />
Dorothy is also 69 years <strong>of</strong> age, and her blood pressure<br />
is the same as Pat’s, 170/100 mm Hg; however, because she<br />
had a stroke recently, you assess her risk <strong>of</strong> subsequent<br />
stroke as higher than Pat’s, perhaps 10% per year. 10<br />
One way <strong>of</strong> determining the potential benefit <strong>of</strong> a new<br />
treatment is to complete a benefit table such as Table 1A.<br />
To do this, insert your estimated 3-year event rates <strong>for</strong> Pat<br />
and Dorothy, and then apply the relative risk reduction<br />
(33%) expected if they take drug X. It is clear from Table<br />
Table 1B: Benefit and harm table<br />
Patient group<br />
Table 1A: Benefit table*<br />
Patient group<br />
No<br />
treatment<br />
<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine<br />
1A that the absolute risk reduction <strong>for</strong> patients at higher<br />
risk (such as Dorothy) is much greater than <strong>for</strong> those at<br />
lower risk (such as Pat).<br />
Now, you need to factor the potential harms (adverse effects<br />
associated with using the drug) into the clinical decision.<br />
In the clinical trials <strong>of</strong> drug X, the risk <strong>of</strong> severe gastric<br />
bleeding increased 3-fold over 3 years in patients who<br />
received the drug (relative risk <strong>of</strong> 3). A population-based<br />
study has reported the risk <strong>of</strong> severe gastric bleeding <strong>for</strong><br />
women in your patients’ age group at about 0.1% per year<br />
(regardless <strong>of</strong> their risk <strong>of</strong> stroke). These data can now be<br />
added to the table to allow a more balanced assessment <strong>of</strong><br />
the benefits and harms that could arise from treatment<br />
(Table 1B).<br />
Considering the results <strong>of</strong> this process, would you give<br />
drug X to Pat, to Dorothy or to both?<br />
In making your decisions, remember that there is not<br />
necessarily one “right answer” here. Your analysis might go<br />
something like this:<br />
Pat will experience a small benefit (absolute risk reduction<br />
over 3 years <strong>of</strong> about 1%), but this will be considerably<br />
<strong>of</strong>fset by the increased risk <strong>of</strong> gastric bleeding (absolute risk<br />
increase over 3 years <strong>of</strong> 0.6%). The potential benefit <strong>for</strong><br />
Dorothy (absolute risk reduction over 3 years <strong>of</strong> about 10%)<br />
is much greater than the increased risk <strong>of</strong> harm (absolute<br />
risk increase over 3 years <strong>of</strong> 0.6%). There<strong>for</strong>e, the benefit <strong>of</strong><br />
treatment is likely to be greater <strong>for</strong> Dorothy (who is at<br />
higher risk <strong>of</strong> stroke) than <strong>for</strong> Pat (who is at lower risk).<br />
Assessment <strong>of</strong> the balance between benefits and harms<br />
depends on the value that patients place on reducing their<br />
risk <strong>of</strong> stoke in relation to the increased risk <strong>of</strong> gastric<br />
bleeding. Many patients might be much more concerned<br />
about the <strong>for</strong>mer than the latter.<br />
3-yr event rate <strong>for</strong> stroke, % 3-yr event rate <strong>for</strong> severe gastric bleeding, %<br />
With treatment<br />
(drug X)<br />
3-yr event rate <strong>for</strong> stroke, %<br />
No<br />
treatment<br />
Absolute risk reduction<br />
(no treatment – treatment)<br />
With treatment<br />
(drug X)<br />
No<br />
treatment<br />
Absolute<br />
risk reduction, %<br />
(no treatment – treatment)<br />
At lower risk (e.g., Pat) 3 2 1<br />
At higher risk (e.g., Dorothy) 30 20 10<br />
*<strong>Based</strong> on data from a randomized controlled trial <strong>of</strong> drug X, which reported a 33% relative risk reduction <strong>for</strong> the outcome<br />
(stroke) over 3 years.<br />
With treatment<br />
(drug X)<br />
Absolute risk increase<br />
(treatment – no treatment)<br />
At lower risk<br />
(e.g., Pat) 3 2 1 0.3 0.9 0.6<br />
At higher risk<br />
(e.g., Dorothy) 30 20 10 0.3 0.9 0.6<br />
*<strong>Based</strong> on data from randomized controlled trials <strong>of</strong> drug X reporting a 33% relative risk reduction <strong>for</strong> the outcome (stroke) over 3 years and a 3-fold increase <strong>for</strong> the adverse effect<br />
(severe gastric bleeding) over the same period.<br />
Page 4 <strong>of</strong> 29<br />
CMAJ AUG. 17, 2004; 171 (4) 355
Barratt et al<br />
Number needed to treat: definitions<br />
Number needed to treat: the number <strong>of</strong> patients who<br />
would have to receive the treatment <strong>for</strong> 1 <strong>of</strong> them to<br />
benefit; calculated as 100 divided by the absolute risk<br />
reduction expressed as a percentage (or 1 divided by the<br />
absolute risk reduction expressed as a proportion; see<br />
Appendix 1)<br />
Number needed to harm: the number <strong>of</strong> patients who<br />
would have to receive the treatment <strong>for</strong> 1 <strong>of</strong> them to<br />
experience an adverse effect; calculated as 100 divided<br />
by the absolute risk increase expressed as a percentage<br />
(or 1 divided by the absolute risk increase expressed as a<br />
proportion)<br />
The bottom line<br />
When available, trial data regarding relative risk reductions<br />
(or increases), combined with estimates <strong>of</strong> baseline<br />
(untreated) risk in individual patients, provide the basis <strong>for</strong><br />
clinicians to balance the benefits and harms <strong>of</strong> therapy <strong>for</strong><br />
their patients.<br />
Tip 3: Calculating and using number needed<br />
to treat<br />
Some physicians use another measure <strong>of</strong> risk and benefit,<br />
the number needed to treat (NNT), in considering the<br />
consequences <strong>of</strong> treating or not treating. The NNT is the<br />
number <strong>of</strong> patients to whom a clinician would need to administer<br />
a particular treatment to prevent 1 patient from<br />
having an adverse outcome over a predefined period <strong>of</strong><br />
time. (It also reflects the likelihood that a particular patient<br />
to whom treatment is administered will benefit from it.) If,<br />
<strong>for</strong> example, the NNT <strong>for</strong> a treatment is 10, the practitioner<br />
would have to give the treatment to 10 patients to<br />
prevent 1 patient from having the adverse outcome over<br />
Table 2: Benefit table <strong>for</strong> patients with cardiovascular problems<br />
356 JAMC 17 AOÛT 2004; 171 (4)<br />
the defined period, and each patient who received the treatment<br />
would have a 1 in 10 chance <strong>of</strong> being a beneficiary.<br />
If the absolute risk reduction is large, you need to treat<br />
only a small number <strong>of</strong> patients to observe a benefit in at<br />
least some <strong>of</strong> them. Conversely, if the absolute risk reduction<br />
is small, you must treat many people to observe a benefit<br />
in just a few.<br />
An analogous calculation to the one used to determine<br />
the NNT can be used to determine the number <strong>of</strong> patients<br />
who would have to be treated <strong>for</strong> 1 patient to experience an<br />
adverse event. This is the number needed to harm (NNH),<br />
which is the inverse <strong>of</strong> the absolute risk increase.<br />
How com<strong>for</strong>table are you with estimating the NNT<br />
<strong>for</strong> a given treatment? For example, consider the following<br />
questions: How many 60-year-old patients with hypertension<br />
would you have to treat with diuretics <strong>for</strong> a period<br />
<strong>of</strong> 5 years to prevent 1 death? How many people with<br />
myocardial infarction would you have to treat with βblockers<br />
<strong>for</strong> 2 years to prevent 1 death? How many people<br />
with acute myocardial infarction would you have to treat<br />
with streptokinase to prevent 1 person from dying in the<br />
next 5 weeks? Compare your answers with estimates derived<br />
from published studies (Table 2). How accurate<br />
were your estimates? Are you surprised by the size <strong>of</strong> the<br />
NNT values?<br />
Physicians <strong>of</strong>ten experience problems in this type <strong>of</strong><br />
exercise, usually because they are unfamiliar with the calculation<br />
<strong>of</strong> NNT. Here is one way to think about it. If a<br />
disease has a mortality rate <strong>of</strong> 100% without treatment<br />
and therapy reduces that mortality rate to 50%, how<br />
many people would you need to treat to prevent 1 death?<br />
From the numbers given, you can probably figure out that<br />
treating 100 patients with the otherwise fatal disease results<br />
in 50 survivors. This is equivalent to 1 out <strong>of</strong> every 2<br />
treated. Since all were destined to die, the NNT to prevent<br />
1 death is 2. The <strong>for</strong>mula reflected in this calculation<br />
is as follows: the NNT to prevent 1 adverse outcome<br />
equals the inverse <strong>of</strong> the absolute risk reduction. Table 3<br />
illustrates this concept further. Note that, if the absolute<br />
risk reduction is presented as a percentage, the NNT is<br />
Event rate, %<br />
Clinical question Control group Treatment group ARR, % NNT<br />
What is the reduction in risk <strong>of</strong> stroke within 5<br />
years among 60-year-old patients with<br />
hypertension who are treated with diuretics? 11<br />
What is the reduction in risk <strong>of</strong> death within 2<br />
years after MI among 60-year-old patients treated<br />
with β-blockers? 12<br />
What is the reduction in risk <strong>of</strong> death within 5<br />
weeks after acute MI among 60-year-old patients<br />
treated with streptokinase? 13<br />
Note: MI = myocardial infarction, ARR = absolute risk reduction, NNT = number needed to treat.<br />
2.9 1.9 1.00 100<br />
9.8 7.3 2.50 40<br />
12.0 9.2 2.80 36<br />
Page 5 <strong>of</strong> 29
Table 3: Calculation <strong>of</strong> NNT from absolute risk reduction*<br />
Form <strong>of</strong> absolute<br />
risk reduction<br />
100/absolute risk reduction; if the absolute risk reduction<br />
is expressed as a proportion, the NNT is 1/absolute risk<br />
reduction. Both methods give the same answer, so use<br />
whichever you find easier.<br />
It can be challenging <strong>for</strong> clinicians to estimate the baseline<br />
risks <strong>for</strong> specific populations. For example, some physicians<br />
may have little idea <strong>of</strong> the risk <strong>of</strong> stroke over 5 years<br />
among patients with hypertension. Physicians may also<br />
overestimate the effect <strong>of</strong> treatment, which leads them to<br />
ascribe larger absolute risk reductions and smaller NNT<br />
values than are actually the case. 14<br />
Now that you know how to determine the NNT from<br />
the absolute risk reduction, you must also consider whether<br />
the NNT is reasonable. In other words, what is the maximum<br />
NNT that you and your patients will accept as justifying<br />
the benefits and harms <strong>of</strong> therapy? This is referred to<br />
as the threshold NNT. 15 If the calculated NNT is above<br />
the threshold, the benefits are not large enough (or the risk<br />
<strong>of</strong> harm is too great) to warrant initiating the therapy.<br />
Determinants <strong>of</strong> the threshold NNT include the patient’s<br />
own values and preferences, the severity <strong>of</strong> the outcome<br />
that would be prevented, and the costs and side effects<br />
<strong>of</strong> the intervention. Thus, the threshold NNT will<br />
almost certainly be different <strong>for</strong> different patients, and<br />
there is no simple answer to the question <strong>of</strong> when an NNT<br />
is sufficiently low to justify initiating treatment.<br />
The bottom line<br />
NNT is a concise, clinically useful presentation <strong>of</strong> the<br />
effect <strong>of</strong> an intervention. You can easily calculate it from<br />
the absolute risk reduction (just remember to check<br />
whether the absolute risk reduction is presented as a percentage<br />
or a proportion and use a numerator <strong>of</strong> 100 or 1<br />
accordingly). Be careful not to overestimate the effect <strong>of</strong><br />
treatments (i.e., use a value <strong>of</strong> absolute risk reduction that is<br />
too high) and thus underestimate the NNT.<br />
Conclusions<br />
Calculation<br />
<strong>of</strong> NNT Example<br />
Percentage (e.g., 2.8%) 100/ARR 100/2.8 = 36<br />
Proportion (e.g., 0.028) 1/ARR 1/0.028 = 36<br />
*Using absolute risk reduction in last row <strong>of</strong> Table 2. 13<br />
Clinicians seeking to apply clinical evidence to the care<br />
<strong>of</strong> individual patients need to understand and be able to<br />
calculate relative risk reduction, absolute risk reduction<br />
and NNT from data presented in clinical trials and systematic<br />
reviews. We have described and defined these<br />
concepts and presented tabular tools and equations to<br />
help clinicians overcome common pitfalls in acquiring<br />
these skills.<br />
This article has been peer reviewed.<br />
References<br />
<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine<br />
From the School <strong>of</strong> Public Health, University <strong>of</strong> Sydney, Sydney, Australia (Barratt);<br />
the Columbia University College <strong>of</strong> Physicians and Surgeons, New York, NY<br />
(Wyer); the Department <strong>of</strong> <strong>Medicine</strong>, University <strong>of</strong> British Columbia, Vancouver,<br />
BC (Hatala); Mount Sinai Medical Center, New York, NY (McGinn); the Department<br />
<strong>of</strong> Internal <strong>Medicine</strong>, University <strong>of</strong> the Philippines College <strong>of</strong> <strong>Medicine</strong>,<br />
Manila, The Philippines (Dans); Durham Veterans Affairs Medical Center and<br />
Duke University Medical Center, Durham, NC (Keitz); the Department <strong>of</strong> Pediatrics,<br />
University <strong>of</strong> Texas, Houston, Tex. (Moyer); and the Departments <strong>of</strong> <strong>Medicine</strong><br />
and <strong>of</strong> Clinical Epidemiology and Biostatistics, McMaster University, Hamilton,<br />
Ont. (Guyatt)<br />
Competing interests: None declared.<br />
Contributors: Alexandra Barratt contributed tip 2, drafted the manuscript, coordinated<br />
input from coauthors and reviewers and from field-testing and revised all<br />
drafts. Peter Wyer edited drafts and provided guidance in developing the final <strong>for</strong>mat.<br />
Rose Hatala contributed tip 1, coordinated the internal review process and<br />
provided comments throughout development <strong>of</strong> the manuscript. Thomas McGinn<br />
contributed tip 3 and provided comments throughout development <strong>of</strong> the manuscript.<br />
Antonio Dans reviewed all drafts and provided comments throughout development<br />
<strong>of</strong> the manuscript. Sheri Keitz conducted field-testing <strong>of</strong> the tips and contributed<br />
material from the field-testing to the manuscript. Virginia Moyer<br />
reviewed and contributed to the final version <strong>of</strong> the manuscript. Gordon Guyatt<br />
helped to write the manuscript (as an editor and coauthor).<br />
1. Malenka DJ, Baron JA, Johansen S, Wahrenberger JW, Ross JM. The framing<br />
effect <strong>of</strong> relative and absolute risk. J Gen Intern Med 1993;8:543-8.<br />
2. Forrow L, Taylor WC, Arnold RM. Absolutely relative: How research results<br />
are summarized can affect treatment decisions. Am J Med 1992;92:121-4.<br />
3. Naylor CD, Chen E, Strauss B. Measured enthusiasm: Does the method <strong>of</strong><br />
reporting trial results alter perceptions <strong>of</strong> therapeutic effectiveness? Ann Intern<br />
Med 1992;117:916-21.<br />
4. Fahey T, Griffiths S, Peters TJ. <strong>Evidence</strong> based purchasing: understanding<br />
results <strong>of</strong> clinical trials and systematic reviews. BMJ 1995;311:1056-60.<br />
5. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures<br />
<strong>of</strong> association. In: Guyatt G, Rennie D, editors. The users’ guides to the<br />
medical literature: a manual <strong>of</strong> evidence-based clinical practice. Chicago: AMA<br />
Publications; 2002. p. 351-68.<br />
6. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. <strong>Tips</strong><br />
<strong>for</strong> learning and teaching evidence-based medicine: introduction to the series.<br />
CMAJ 2004;171(4):347-8.<br />
7. Schmid CH, Lau J, McIntosh MW, Cappelleri JC. An empirical study <strong>of</strong> the<br />
effect <strong>of</strong> the control rate as a predictor <strong>of</strong> treatment efficacy in meta-analysis<br />
<strong>of</strong> clinical trials. Stat Med 1998;17:1923-42.<br />
8. Furukawa TA, Guyatt GH, Griffith LE. Can we individualise the number<br />
needed to treat? An empirical study <strong>of</strong> summary effect measures in metaanalyses.<br />
Int J Epidemiol 2002;31:72-6.<br />
9. SHEP Cooperative Research Group. Prevention <strong>of</strong> stroke by anti-hypertensive<br />
drug treatment in older persons with isolated systolic hypertension. Final<br />
results <strong>of</strong> the Systolic Hypertension in the Elderly Program (SHEP). JAMA<br />
1991;265:3255-64.<br />
10. SALT Collaborative Group. Swedish Aspirin Low-dose Trial (SALT) <strong>of</strong><br />
75mg aspirin as secondary prophylaxis after cerebrovascular events. Lancet<br />
1991;338:1345-9.<br />
11. Psaty BM, Smith NL, Siscovick DS, Koepsell TD, Weiss NS, Heckbert<br />
SR. Health outcomes associated with antihypertensive therapies used as<br />
first-line agents. A systematic review and meta-analysis. JAMA 1997;277:<br />
739-45.<br />
12. β-Blocker Health Attack Trial Research Group. A randomized trial <strong>of</strong> propranolol<br />
in patients with acute myocardial infarction. I. Mortality results.<br />
JAMA 1982;247:1707-14.<br />
13. ISIS-2 Collaborative Group. Randomised trial <strong>of</strong> intravenous streptokinase,<br />
oral aspirin, both or neither among 17 187 cases <strong>of</strong> suspected acute myocardial<br />
infarction: ISIS-2. Lancet 1988;2:349-60.<br />
14. Chatellier G, Zapletal E, Lemaitre D, Menard J, Degoulet P. The number<br />
needed to treat: a clinically useful nomogram in its proper context. BMJ 1996;<br />
312:426-9.<br />
15. Sinclair JC, Cook RJ, Guyatt GH, Pauker SG, Cook DJ. When should an effective<br />
treatment be used? Derivation <strong>of</strong> the threshold number needed to treat<br />
and the minimum event rate <strong>for</strong> treatment. J Clin Epidemiol 2001;54:253-62.<br />
Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,<br />
Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet<br />
.att.net<br />
Page 6 <strong>of</strong> 29<br />
CMAJ AUG. 17, 2004; 171 (4) 357
Barratt et al<br />
Members <strong>of</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong><br />
Working Group: Peter C. Wyer (project director), Columbia<br />
University College <strong>of</strong> Physicians and Surgeons, New York, NY;<br />
Deborah Cook, Gordon Guyatt (general editor), Ted Haines,<br />
Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose<br />
Hatala (internal review coordinator), Department <strong>of</strong> <strong>Medicine</strong>,<br />
University <strong>of</strong> British Columbia, Vancouver, BC; Robert Hayward<br />
(editor, online version), Bruce Fisher, University <strong>of</strong> Alberta,<br />
Edmonton, Alta.; Sheri Keitz (field-test coordinator), Durham<br />
Veterans Affairs Medical Center and Duke University, Durham,<br />
NC; Alexandra Barratt, University <strong>of</strong> Sydney, Sydney, Australia;<br />
Pamela Charney, Albert Einstein College <strong>of</strong> <strong>Medicine</strong>, Bronx, NY;<br />
Antonio L. Dans, University <strong>of</strong> the Philippines College <strong>of</strong><br />
<strong>Medicine</strong>, Manila, The Philippines; Barnet Eskin, Morristown<br />
Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory<br />
University, Atlanta, Ga.; Hui Lee, <strong>for</strong>merly Group Health Centre,<br />
Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas<br />
McGinn, Mount Sinai Medical Center, New York, NY; Victor M.<br />
Montori, Department <strong>of</strong> <strong>Medicine</strong>, Mayo Clinic College <strong>of</strong><br />
<strong>Medicine</strong>, Rochester, Minn.; Virginia Moyer, University <strong>of</strong> Texas,<br />
Houston, Tex.; Thomas B. Newman, University <strong>of</strong> Cali<strong>for</strong>nia, San<br />
Fred Sebastian<br />
358 JAMC 17 AOÛT 2004; 171 (4)<br />
Francisco, Calif.; Jim Nishikawa, University <strong>of</strong> Ottawa, Ottawa,<br />
Ont.; W. Scott Richardson, Wright State University, Dayton,<br />
Ohio; Mark C. Wilson, University <strong>of</strong> Iowa, Iowa City, Iowa<br />
Appendix 1: Formulas <strong>for</strong> commonly used measures <strong>of</strong><br />
therapeutic effect<br />
Measure <strong>of</strong> effect Formula<br />
Relative risk (Event rate in intervention group) ÷ (event<br />
rate in control group)<br />
Relative risk reduction 1 – relative risk<br />
or<br />
(Absolute risk reduction) ÷ (event rate in<br />
control group)<br />
Absolute risk reduction (Event rate in intervention group) – (event<br />
rate in control group)<br />
Number needed to treat 1 ÷ (absolute risk reduction)<br />
Please, reader, can you spare some time?<br />
Our annual CMAJ readership survey begins September 20. By telling us a<br />
little about who you are and what you think <strong>of</strong> CMAJ, you’ll help us pave<br />
our way to an even better journal. For 2 weeks, we’ll be asking you to take<br />
the survey route on one <strong>of</strong> your visits to the journal online. We hope you’ll<br />
go along with the detour and help us stay on track.<br />
Chers lecteurs et lectrices, pourriez-vous nous accorder un moment?<br />
Le sondage annuel auprès des lecteurs du JAMC débute le 20 septembre. En nous parlant un peu de<br />
vous et de ce que vous pensez du JAMC, vous nous aiderez à améliorer encore le journal. Pendant<br />
deux semaines, lorsque vous rendrez visite au journal électronique, nous vous demanderons de passer<br />
une fois par la page du sondage. Nous espérons que vous accepterez de faire ce détour qui contribuera<br />
à nous garder sur la bonne voie.<br />
Page 7 <strong>of</strong> 29
DOI:10.1503/cmaj.1031667<br />
<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />
2. Measures <strong>of</strong> precision (confidence intervals)<br />
In the first article in this series, 1 we presented an approach<br />
to understanding how to estimate a treatment’s<br />
effectiveness that covered relative risk reduction, absolute<br />
risk reduction and number needed to treat. But how<br />
precise are these estimates <strong>of</strong> treatment effect?<br />
In reading the results <strong>of</strong> clinical trials, clinicians <strong>of</strong>ten<br />
come across 2 related but different statistical measures <strong>of</strong> an<br />
estimate’s precision: p values and confidence intervals. The p<br />
value describes how <strong>of</strong>ten apparent differences in treatment<br />
effect that are as large as or larger than those observed in a<br />
particular trial will occur in a long run <strong>of</strong> identical trials if in<br />
fact no true effect exists. If the observed differences are sufficiently<br />
unlikely to occur by chance alone, investigators reject<br />
the hypothesis that there is no effect. For example, consider<br />
a randomized trial comparing diuretics with placebo<br />
that finds a 25% relative risk reduction <strong>for</strong> stroke with a p<br />
value <strong>of</strong> 0.04. This p value means that, if diuretics were in<br />
fact no different in effectiveness than placebo, we would expect,<br />
by the play <strong>of</strong> chance alone, to observe a reduction —<br />
or increase — in relative risk <strong>of</strong> 25% or more in 4 out <strong>of</strong><br />
100 identical trials.<br />
Although they are useful <strong>for</strong> investigators planning how<br />
large a study needs to be to demonstrate a particular magnitude<br />
<strong>of</strong> effect, p values fail to provide clinicians and patients<br />
with the in<strong>for</strong>mation they most need, i.e., the range<br />
<strong>of</strong> values within which the true effect is likely to reside.<br />
However, confidence intervals provide exactly that in<strong>for</strong>mation<br />
in a <strong>for</strong>m that pertains directly to the process <strong>of</strong> deciding<br />
whether to administer a therapy to patients. If the<br />
range <strong>of</strong> possible true effects encompassed by the confidence<br />
interval is overly wide, the clinician may choose to<br />
administer the therapy only selectively or not at all.<br />
Confidence intervals are there<strong>for</strong>e the topic <strong>of</strong> this article.<br />
For a nontechnical explanation <strong>of</strong> p values and their<br />
limitations, we refer interested readers to the Users’ Guides<br />
to the Medical Literature. 2<br />
As with the first article in this series, 1 we present the in<strong>for</strong>mation<br />
as a series <strong>of</strong> “tips” or exercises. This means that<br />
you, the reader, will have to do some work in the course <strong>of</strong><br />
reading the article. The tips we present here have been<br />
adapted from approaches developed by educators experienced<br />
in teaching evidence-based medicine skills to clinicians.<br />
2-4 A related article, intended <strong>for</strong> people who teach<br />
Review<br />
Synthèse<br />
Victor M. Montori, Jennifer Kleinbart, Thomas B. Newman, Sheri Keitz, Peter C. Wyer,<br />
Virginia Moyer, Gordon Guyatt, <strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group<br />
these concepts to clinicians, is available online at www.<br />
cmaj.ca/cgi/content/full/171/6/611/DC1.<br />
Clinician learners’ objectives<br />
Making confidence intervals intuitive<br />
• Understand the dynamic relation between confidence<br />
intervals and sample size.<br />
Interpreting confidence intervals<br />
• Understand how the confidence intervals around estimates<br />
<strong>of</strong> treatment effect can affect therapeutic decisions.<br />
Estimating confidence intervals <strong>for</strong> extreme<br />
proportions<br />
• Learn a shortcut <strong>for</strong> estimating the upper limit <strong>of</strong> the<br />
95% confidence intervals <strong>for</strong> proportions with very<br />
small numerators and <strong>for</strong> proportions with numerators<br />
very close to the corresponding denominators.<br />
Tip 1: Making confidence intervals intuitive<br />
Imagine a hypothetical series <strong>of</strong> 5 trials (<strong>of</strong> equal duration<br />
but different sample sizes) in which investigators have<br />
experimented with treatments <strong>for</strong> patients who have a particular<br />
condition (elevated low-density lipoprotein cholesterol)<br />
to determine whether a drug (a novel cholesterollowering<br />
agent) would work better than a placebo to<br />
prevent strokes (Table 1A). The smallest trial enrolled only<br />
Teachers <strong>of</strong> evidence-based medicine:<br />
See the “<strong>Tips</strong> <strong>for</strong> teachers” version <strong>of</strong> this article online<br />
at www.cmaj.ca/cgi/content/full/171/6/611/DC1. It<br />
contains the exercises found in this article in fill-in-theblank<br />
<strong>for</strong>mat, commentaries from the authors on the<br />
challenges they encounter when teaching these concepts<br />
to clinician learners and links to useful online resources.<br />
CMAJ • SEPT. 14, 2004; 171 (6) 611<br />
© 2004 Canadian Medical Association or its licensors<br />
Page 8 <strong>of</strong> 29
Montori et al<br />
8 patients, and the largest enrolled 2000 patients, and half<br />
<strong>of</strong> the patients in each trial underwent the experimental<br />
treatment. Now imagine that all <strong>of</strong> the trials showed a relative<br />
risk reduction <strong>for</strong> the treatment group <strong>of</strong> 50% (meaning<br />
that patients in the drug treatment group were only half<br />
as likely as those in the placebo group to have a stroke). In<br />
each individual trial, how confident can we be that the true<br />
value <strong>of</strong> the relative risk reduction is important <strong>for</strong> patients<br />
(i.e., “patient-important”)? 5 If you were to look at the studies<br />
individually, which ones would lead you to recommend<br />
the treatment unequivocally to your patients?<br />
Most clinicians might intuitively guess that we could be<br />
more confident in the results <strong>of</strong> the larger trials. Why is this?<br />
In the absence <strong>of</strong> bias or systematic error, the results <strong>of</strong> a trial<br />
can be interpreted as an estimate <strong>of</strong> the true magnitude <strong>of</strong> effect<br />
that would occur if all possible eligible patients had been<br />
included. When only a few <strong>of</strong> these patients are included, the<br />
play <strong>of</strong> chance alone may lead to a result that is quite different<br />
from the true value. Confidence intervals are a numeric<br />
measure <strong>of</strong> the range within which such variation is likely to<br />
occur. The 95% confidence intervals that we <strong>of</strong>ten see in<br />
biomedical publications represent the range within which we<br />
are likely to find the underlying true treatment effect.<br />
To gain a better appreciation <strong>of</strong> confidence intervals, go<br />
back to Table 1A (don’t look yet at Table 1B!) and take a<br />
guess at what you think the confidence intervals might be<br />
<strong>for</strong> the 5 trials presented. In a moment you’ll see how your<br />
Table 1A: Relative risk and relative risk reduction observed<br />
in 5 successively larger hypothetical trials<br />
Control event<br />
rate<br />
Treatment<br />
event rate Relative risk, %<br />
Relative risk<br />
reduction, %*<br />
2/4 1/4 50 50<br />
10/20 5/20 50 50<br />
20/40 10/40 50 50<br />
50/100 25/100 50 50<br />
500/1000 250/1000 50 50<br />
*Calculated as the absolute difference between the control and treatment event rates<br />
(expressed as a fraction or a percentage), divided by the control event rate. In the first row<br />
in this table, relative risk reduction = (2/4 –1/4) ÷ 2/4 = 1/2 or 50%. If the control event<br />
rate were 3/4 and the treatment event rate 1/4, the relative risk reduction would be<br />
(3/4 – 1/4) ÷ 3/4 = 2/3. Using percentages <strong>for</strong> the same example, if the control event rate<br />
were 75% and the treatment event rate were 25%, the relative risk reduction would be<br />
(75% – 25%) ÷ 75% = 67%.<br />
Table 1B: Confidence intervals (CIs) around the relative risk reduction in<br />
5 successively larger hypothetical trials<br />
Control<br />
event rate<br />
Treatment<br />
event rate<br />
Relative<br />
risk, %<br />
612 JAMC 14 SEPT. 2004; 171 (6)<br />
estimates compare to 95% confidence intervals calculated<br />
using a <strong>for</strong>mula, but <strong>for</strong> now, try figuring out intervals that<br />
you intuitively feel to be appropriate.<br />
Now, consider the first trial, in which 2 out <strong>of</strong> 4 patients<br />
who receive the control intervention and 1 out <strong>of</strong> 4 patients<br />
who receive the experimental treatment suffer a stroke.<br />
The risk in the treatment group is half that in the control<br />
group, which gives us a relative risk <strong>of</strong> 50% and a relative<br />
risk reduction <strong>of</strong> 50% (see Table 1A). 1,6<br />
Given the substantial relative risk reduction, would you<br />
be ready to recommend this treatment to a patient? Be<strong>for</strong>e<br />
you answer this question, consider whether it is plausible,<br />
with so few patients in the study, that the investigators might<br />
just have gotten lucky and the true treatment effect is really a<br />
50% increase in relative risk. In other words, is it plausible<br />
that the true event rate in the group that received treatment<br />
was 3 out <strong>of</strong> 4 instead <strong>of</strong> 1 out <strong>of</strong> 4? If you accept that this<br />
large, harmful effect might represent the underlying truth,<br />
would you also accept that a relative risk reduction <strong>of</strong> 90%,<br />
i.e., a very large benefit <strong>of</strong> treatment, is consistent with the<br />
experimental data in these few patients? To the extent that<br />
these suggestions are plausible, we can intuitively create a<br />
range <strong>of</strong> plausible truth <strong>of</strong> “-50% to 90%” surrounding the<br />
relative risk reduction <strong>of</strong> 50% that was actually observed.<br />
Now, do this <strong>for</strong> each <strong>of</strong> the other 4 trials. In the trial with<br />
20 patients in each group, 10 <strong>of</strong> those in the control group<br />
suffered a stroke, as did 5 <strong>of</strong> those in the treatment group.<br />
Both the relative risk and the relative risk reduction are again<br />
50%. Do you still consider it plausible that the true event rate<br />
in the treatment group is 15 out <strong>of</strong> 20 rather than 5 out <strong>of</strong> 20<br />
(the same proportions as we considered in the smaller trial)?<br />
If not, what about 12 out <strong>of</strong> 20? The latter would represent a<br />
20% increase in risk over the control rate (12/20 v. 10/20). A<br />
true relative risk reduction <strong>of</strong> 90% may still be plausible,<br />
given the observed results and the numbers <strong>of</strong> patients involved.<br />
In short, given this larger number <strong>of</strong> patients and the<br />
lower chance <strong>of</strong> a “bad sample,” the “range <strong>of</strong> plausible truth”<br />
around the observed relative risk reduction <strong>of</strong> 50% might be<br />
narrower, perhaps from a relative risk increase <strong>of</strong> 20% (represented<br />
as –20%) to a relative risk reduction <strong>of</strong> 90%.<br />
You can develop similar intuitively derived confidence<br />
intervals <strong>for</strong> the larger trials. We’ve done this in Table 1B,<br />
which also shows the 95% confidence intervals that we cal-<br />
CI around relative risk reduction, %<br />
Relative risk<br />
reduction, % Intuitive CI* Calculated 95% CI*†<br />
2/4 1/4 50 50 –50 to 90 –174 to 92<br />
10/20 5/20 50 50 –20 to 90 –14 to 79.5<br />
20/40 10/40 50 50 0 to 90 9.5 to 73.4<br />
50/100 25/100 50 50 20 to 80 26.8 to 66.4<br />
500/1000 250/1000 50 50 40 to 60 43.5 to 55.9<br />
*Negative values represent an increase in risk relative to control. See text <strong>for</strong> further explanation.<br />
†Calculated by statistical s<strong>of</strong>tware.<br />
Page 9 <strong>of</strong> 29
culated using a statistical program called StatsDirect (available<br />
commercially through www.statsdirect.com). You can<br />
see that in some instances we intuitively overestimated or<br />
underestimated the intervals relative to those we derived<br />
using the statistical <strong>for</strong>mulas.<br />
The bottom line<br />
Confidence intervals in<strong>for</strong>m clinicians about the range<br />
within which the true treatment effect might plausibly lie,<br />
given the trial data. Greater precision (narrower confidence<br />
intervals) results from larger sample sizes and consequent<br />
larger number <strong>of</strong> events. Statisticians (and statistical s<strong>of</strong>tware)<br />
can calculate 95% confidence intervals around any<br />
estimate <strong>of</strong> treatment effect.<br />
Tip 2: Interpreting<br />
confidence intervals<br />
You should now have an understanding<br />
<strong>of</strong> the relation between the<br />
width <strong>of</strong> the confidence interval<br />
around a measure <strong>of</strong> outcome in a<br />
clinical trial and the number <strong>of</strong> participants<br />
and events in that study.<br />
You are ready to consider whether a<br />
study is sufficiently large, and the resulting<br />
confidence intervals sufficiently<br />
narrow, to reach a definitive<br />
conclusion about recommending the<br />
therapy, after taking into account<br />
your patient’s values, preferences and<br />
circumstances.<br />
The concept <strong>of</strong> a minimally important<br />
treatment effect proves useful<br />
in considering the issue <strong>of</strong> when a<br />
study is large enough and has there<strong>for</strong>e<br />
generated confidence intervals<br />
that are narrow enough to recommend<br />
<strong>for</strong> or against the therapy. This<br />
concept requires the clinician to<br />
think about the smallest amount <strong>of</strong><br />
benefit that would justify therapy.<br />
Consider a set <strong>of</strong> hypothetical trials.<br />
Fig. 1A displays the results <strong>of</strong> trial<br />
1. The uppermost point <strong>of</strong> the bell<br />
curve is the observed treatment effect<br />
(the point estimate), and the tails <strong>of</strong><br />
the bell curve represent the boundaries<br />
<strong>of</strong> the 95% confidence interval.<br />
For the medical condition being investigated,<br />
assume that a 1% absolute<br />
risk reduction is the smallest benefit<br />
that patients would consider to outweigh<br />
the downsides <strong>of</strong> therapy.<br />
Given the in<strong>for</strong>mation in Fig. 1A,<br />
A<br />
B<br />
C<br />
-5<br />
-5<br />
Trial 4<br />
Treatment harms<br />
-3<br />
-3<br />
Trial 3<br />
<strong>Tips</strong> <strong>for</strong> EBM learners: confidence intervals<br />
would you recommend this treatment to your patients if<br />
the point estimate represented the truth? What if the upper<br />
boundary <strong>of</strong> the confidence interval represented the truth?<br />
Or the lower boundary?<br />
For all 3 <strong>of</strong> these questions, the answer is yes, provided<br />
that 1% is in fact the smallest patient-important difference.<br />
Thus, the trial is definitive and allows a strong inference<br />
about the treatment decision.<br />
In the case <strong>of</strong> trial 2 (see Fig. 1B), would your patients<br />
choose to undergo the treatment if either the point estimate<br />
or the upper boundary <strong>of</strong> the confidence interval represented<br />
the true effect? What about the lower boundary? The answer<br />
regarding the lower boundary is no, because the effect<br />
is less than the smallest difference that patients would consider<br />
large enough <strong>for</strong> them to undergo the treatment. Al-<br />
-1<br />
-1<br />
-5 -3 -1 0<br />
Treatment helps<br />
0 1 3 5<br />
0 1 3 5<br />
1 3 5<br />
% Absolute risk reduction<br />
Trial 1<br />
Trial 1<br />
Page 10 <strong>of</strong> 29<br />
Trial 2<br />
Fig. 1: Results <strong>of</strong> 4 hypothetical trials. For the medical condition under investigation,<br />
an absolute risk reduction <strong>of</strong> 1% (double vertical rule) is the smallest benefit that patients<br />
would consider important enough to warrant undergoing treatment. In each<br />
case, the uppermost point <strong>of</strong> the bell curve is the observed treatment effect (the point<br />
estimate), and the tails <strong>of</strong> the bell curve represent the boundaries <strong>of</strong> the 95% confidence<br />
interval. See text <strong>for</strong> further explanation.<br />
CMAJ SEPT. 14, 2004; 171 (6) 613
Montori et al<br />
though trial 2 shows a “positive” result (i.e., the confidence<br />
interval does not encompass zero), the sample size was inadequate<br />
and the result remains compatible with risk reductions<br />
below the minimal patient-important difference.<br />
When a study result is positive, you can determine<br />
whether the sample size was adequate by checking the lower<br />
boundary <strong>of</strong> the confidence interval, the smallest plausible<br />
treatment effect compatible with the results. If this value is<br />
greater than the smallest difference your patients would<br />
consider important, the sample size is adequate and the trial<br />
result definitive. However, if the lower boundary falls below<br />
the smallest patient-important difference, leaving patients<br />
uncertain as to whether taking the treatment is in their best<br />
interest, the trial is not definitive. The sample size is inadequate,<br />
and further trials are required.<br />
What happens when the confidence interval <strong>for</strong> the effect<br />
<strong>of</strong> a therapy includes zero (where zero means “no effect”<br />
and hence a negative result)?<br />
For studies with negative results — those that do not exclude<br />
a true treatment effect <strong>of</strong> zero — you must focus on<br />
the other end <strong>of</strong> the confidence interval, that representing<br />
the largest plausible treatment effect consistent with the<br />
trial data. You must consider whether the upper boundary<br />
<strong>of</strong> the confidence interval falls below the smallest difference<br />
that patients might consider important. If so, the sample<br />
size is adequate, and the trial is definitively negative (see<br />
trial 3 in Fig. 1C). Conversely, if the upper boundary exceeds<br />
the smallest patient-important difference, then the<br />
trial is not definitively negative, and more trials with larger<br />
sample sizes are needed (see trial 4 in Fig. 1C).<br />
The bottom line<br />
To determine whether a trial with a positive result is sufficiently<br />
large, clinicians should focus on the lower boundary <strong>of</strong><br />
the confidence interval and determine if it is greater than the<br />
smallest treatment benefit that patients would consider important<br />
enough to warrant taking the treatment. For studies<br />
with a negative result, clinicians should examine the upper<br />
boundary <strong>of</strong> the confidence interval to determine if this value<br />
is lower than the smallest treatment benefit that patients<br />
would consider important enough to warrant taking the treatment.<br />
In either case, if the confidence interval overlaps the<br />
smallest treatment benefit that is important to patients, then<br />
the study is not definitive and a larger study is needed.<br />
Table 2: The 3/n rule to estimate the upper limit <strong>of</strong> the<br />
95% confidence interval (CI) <strong>for</strong> proportions with 0 in the<br />
numerator<br />
n<br />
Observed<br />
proportion 3/n<br />
Upper limit <strong>of</strong><br />
95% CI<br />
20 0/20 3/20 0.15 or 15%<br />
100 0/100 3/100 0.03 or 3%<br />
300 0/300 3/300 0.01 or 1%<br />
1000 0/1000 3/1000 0.003 or 0.3%<br />
614 JAMC 14 SEPT. 2004; 171 (6)<br />
Tip 3: Estimating confidence intervals <strong>for</strong><br />
extreme proportions<br />
When reviewing journal articles, readers <strong>of</strong>ten encounter<br />
proportions with small numerators or with numerators very<br />
close in size to the denominators. Both situations raise the<br />
same issue. For example, an article might assert that a treatment<br />
is safe because no serious complications occurred in the<br />
20 patients who received it; another might claim near-perfect<br />
sensitivity <strong>for</strong> a test that correctly identified 29 out <strong>of</strong> 30<br />
cases <strong>of</strong> a disease. However, in many cases such articles do<br />
not present confidence intervals <strong>for</strong> these proportions.<br />
The first step <strong>of</strong> this tip is to learn the “rule <strong>of</strong> 3” <strong>for</strong><br />
zero numerators, 7 and the next step is to learn an extension<br />
(which might be called the “rule <strong>of</strong> 5, 7, 9 and 10”) <strong>for</strong> numerators<br />
<strong>of</strong> 1, 2, 3 and 4. 8<br />
Consider the following example. Twenty people undergo<br />
surgery, and none suffer serious complications. Does<br />
this result allow us to be confident that the true complication<br />
rate is very low, say less than 5% (1 out <strong>of</strong> 20)? What<br />
about 10% (2 out <strong>of</strong> 20)?<br />
You will probably appreciate that if the true complication<br />
rate were 5% (1 in 20), it wouldn’t be that unusual to<br />
observe no complications in a sample <strong>of</strong> 20, but <strong>for</strong> increasingly<br />
higher true rates, the chances <strong>of</strong> observing no complications<br />
in a sample <strong>of</strong> 20 gets increasingly smaller.<br />
What we are after is the upper limit <strong>of</strong> a 95% confidence<br />
interval <strong>for</strong> the proportion 0/20. The following is a<br />
simple rule <strong>for</strong> calculating this upper limit: if an event occurs<br />
0 times in n subjects, the upper boundary <strong>of</strong> the 95%<br />
confidence interval <strong>for</strong> the event rate is about 3/n (Table 2).<br />
You can use the same <strong>for</strong>mula when the observed proportion<br />
is 100%, by translating 100% into its complement.<br />
For example, imagine that the authors <strong>of</strong> a study on a diagnostic<br />
test report 100% sensitivity when the test is per<strong>for</strong>med<br />
<strong>for</strong> 20 patients who have the disease. That means<br />
that the test identified all 20 with the disease as positive and<br />
identified none as falsely negative. You would like to know<br />
how low the sensitivity <strong>of</strong> the test could be, given that it<br />
was 100% <strong>for</strong> a sample <strong>of</strong> 20 patients. Using the 3/n rule<br />
Table 3: Method <strong>for</strong> obtaining an approximation <strong>of</strong><br />
the upper limit <strong>of</strong> the 95% CI*<br />
Observed<br />
numerator<br />
Numerator <strong>for</strong> calculating<br />
approximate upper limit <strong>of</strong> 95% CI<br />
0 3<br />
1 5<br />
2 7<br />
3 9<br />
4 10<br />
*For any observed numerator listed in the left hand column, divide the<br />
corresponding numerator in the right hand column by the number <strong>of</strong> study<br />
subjects to get the approximate upper limit <strong>of</strong> the 95% CI. For example, if the<br />
sample size is 15 and the observed numerator is 3, the upper limit <strong>of</strong> the 95%<br />
confidence interval is approximately 9 ÷ 15 = 0.6 or 60%.<br />
Page 11 <strong>of</strong> 29
<strong>for</strong> the proportion <strong>of</strong> false negatives (0 out <strong>of</strong> 20), we find<br />
that the proportion <strong>of</strong> false negatives could be as high as<br />
15% (3 out <strong>of</strong> 20). Subtract this result from 100% to obtain<br />
the lower limit <strong>of</strong> the 95% confidence interval <strong>for</strong> the sensitivity<br />
(in this example, 85%).<br />
What if the numerator is not zero but is still very small?<br />
There is a shortcut rule <strong>for</strong> small numerators other than<br />
zero (i.e., 1, 2, 3 or 4) (Table 3).<br />
For example, out <strong>of</strong> 20 people receiving surgery imagine<br />
that 1 person suffers a serious complication, yielding an observed<br />
proportion <strong>of</strong> 1/20 or 5%. Using the corresponding<br />
value from Table 3 (i.e., 5) and the sample size, we find that<br />
the upper limit <strong>of</strong> the 95% confidence interval will be<br />
about 5/20 or 25%. If 2 <strong>of</strong> the 20 (10%) had suffered complications,<br />
the upper limit would be about 7/20, or 35%.<br />
The bottom line<br />
Although statisticians (and statistical s<strong>of</strong>tware) can calculate<br />
95% confidence intervals, clinicians can readily estimate<br />
the upper boundary <strong>of</strong> confidence intervals <strong>for</strong> proportions<br />
with very small numerators. These estimates highlight the<br />
greater precision attained with larger sample sizes and help<br />
to calibrate intuitively derived confidence intervals.<br />
Conclusions<br />
Clinicians need to understand and interpret confidence<br />
intervals to properly use research results in making decisions.<br />
They can use thresholds, based on differences that<br />
patients are likely to consider important, to interpret confidence<br />
intervals and to judge whether the results are definitive<br />
or whether a larger study (with more patients and<br />
events) is necessary. For proportions with extremely small<br />
numerators, a simple rule is available <strong>for</strong> estimating the upper<br />
limit <strong>of</strong> the confidence interval.<br />
This article has been peer reviewed.<br />
From the Department <strong>of</strong> <strong>Medicine</strong>, Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>, Rochester,<br />
Minn. (Montori); the Hospital <strong>Medicine</strong> Unit, Division <strong>of</strong> General <strong>Medicine</strong>,<br />
Emory University, Atlanta, Ga. (Kleinbart); the Departments <strong>of</strong> Epidemiology and<br />
Biostatistics and <strong>of</strong> Pediatrics, University <strong>of</strong> Cali<strong>for</strong>nia, San Francisco, San Francisco,<br />
Calif. (Newman); Durham Veterans Affairs Medical Center and Duke University<br />
Medical Center, Durham, NC (Keitz); the Columbia University College <strong>of</strong><br />
Physicians and Surgeons, New York, NY (Wyer); the Department <strong>of</strong> Pediatrics,<br />
University <strong>of</strong> Texas, Houston, Tex. (Moyer); and the Departments <strong>of</strong> <strong>Medicine</strong><br />
and <strong>of</strong> Clinical Epidemiology and Biostatistics, McMaster University, Hamilton,<br />
Ont. (Guyatt)<br />
Competing interests: None declared.<br />
Contributors: Victor Montori, as principal author, decided on the structure and<br />
flow <strong>of</strong> the article, and oversaw and contributed to the writing <strong>of</strong> the manuscript.<br />
Jennifer Kleinbart reviewed the manuscript at all phases <strong>of</strong> development and contributed<br />
to the writing <strong>of</strong> tip 1. Thomas Newman developed the original idea <strong>for</strong><br />
tip 3 and reviewed the manuscript at all phases <strong>of</strong> development. Sheri Keitz used<br />
all <strong>of</strong> the tips as part <strong>of</strong> a live teaching exercise and submitted comments, suggestions<br />
and the possible variations that are described in the article. Peter Wyer reviewed<br />
and revised the final draft <strong>of</strong> the manuscript to achieve uni<strong>for</strong>m adherence<br />
with <strong>for</strong>mat specifications. Virginia Moyer reviewed and revised the final draft <strong>of</strong><br />
the manuscript to improve clarity and style. Gordon Guyatt developed the original<br />
ideas <strong>for</strong> tips 1 and 2, reviewed the manuscript at all phases <strong>of</strong> development, contributed<br />
to the writing as coauthor, and reviewed and revised the final draft <strong>of</strong> the<br />
manuscript to achieve accuracy and consistency <strong>of</strong> content as general editor.<br />
References<br />
<strong>Tips</strong> <strong>for</strong> EBM learners: confidence intervals<br />
1. Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. <strong>Tips</strong> <strong>for</strong><br />
learners <strong>of</strong> evidence-based medicine: 1. Relative risk reduction, absolute risk<br />
reduction and number needed to treat. CMAJ 2004;171(4):353-8.<br />
2. Guyatt G, Jaeschke R, Cook D, Walter S. Therapy and understanding the results:<br />
hypothesis testing. In: Guyatt G, Rennie D, editors. Users’ guides to the<br />
medical literature: a manual <strong>of</strong> evidence-based clinical practice. Chicago: AMA<br />
Press; 2002. p. 329-38.<br />
3. Guyatt G, Walter S, Cook D, Jaeschke R. Therapy and understanding the results:<br />
confidence intervals. In: Guyatt G, Rennie D, editors. Users’ guides to the<br />
medical literature: a manual <strong>of</strong> evidence-based clinical practice. Chicago: AMA<br />
Press; 2002. p. 339-49.<br />
4. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. <strong>Tips</strong><br />
<strong>for</strong> learning and teaching evidence-based medicine: introduction to the series<br />
[editorial]. CMAJ 2004;171(4):347-8.<br />
5. Guyatt G, Montori V, Devereaux PJ, Schunemann H, Bhandari M. Patients at the<br />
center: in our practice, and in our use <strong>of</strong> language. ACP J Club 2004;140:A11-2.<br />
6. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures<br />
<strong>of</strong> association. In: Guyatt G, Rennie D, editors. Users’ guides to the medical<br />
literature: a manual <strong>of</strong> evidence-based clinical practice. Chicago: AMA Press;<br />
2002. p. 351-68.<br />
7. Hanley J, Lippman-Hand A. If nothing goes wrong, is everything all right?<br />
Interpreting zero numerators. JAMA 1983;249:1743-5.<br />
8. Newman TB. If almost nothing goes wrong, is almost everything all right?<br />
[letter]. JAMA 1995;274:1013.<br />
Members <strong>of</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working<br />
Group: Peter C. Wyer (project director), College <strong>of</strong> Physicians and<br />
Surgeons, Columbia University, New York, NY; Deborah Cook,<br />
Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke,<br />
McMaster University, Hamilton, Ont.; Rose Hatala (internal<br />
review coordinator), University <strong>of</strong> British Columbia, Vancouver,<br />
BC; Robert Hayward (editor, online version), Bruce Fisher,<br />
University <strong>of</strong> Alberta, Edmonton, Alta.; Sheri Keitz (field test<br />
coordinator), Durham Veterans Affairs Medical Center and Duke<br />
University Medical Center, Durham, NC; Alexandra Barratt,<br />
University <strong>of</strong> Sydney, Sydney, Australia; Pamela Charney, Albert<br />
Einstein College <strong>of</strong> <strong>Medicine</strong>, Bronx, NY; Antonio L. Dans,<br />
University <strong>of</strong> the Philippines College <strong>of</strong> <strong>Medicine</strong>, Manila, The<br />
Philippines; Barnet Eskin, Morristown Memorial Hospital,<br />
Morristown, NJ; Jennifer Kleinbart, Emory University School <strong>of</strong><br />
<strong>Medicine</strong>, Atlanta, Ga.; Hui Lee, <strong>for</strong>merly Group Health Centre,<br />
Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas<br />
McGinn, Mount Sinai Medical Center, New York, NY; Victor M.<br />
Montori, Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>, Rochester, Minn.;<br />
Virginia Moyer, University <strong>of</strong> Texas, Houston, Tex.; Thomas B.<br />
Newman, University <strong>of</strong> Cali<strong>for</strong>nia, San Francisco, San Francisco,<br />
Calif.; Jim Nishikawa, University <strong>of</strong> Ottawa, Ottawa, Ont.;<br />
Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain;<br />
W. Scott Richardson, Wright State University, Dayton, Ohio; Mark<br />
C. Wilson, University <strong>of</strong> Iowa, Iowa City, Iowa<br />
Articles to date in this series<br />
Page 12 <strong>of</strong> 29<br />
Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,<br />
Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet<br />
.att.net<br />
Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S,<br />
et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine: 1.<br />
Relative risk reduction, absolute risk reduction and<br />
number needed to treat. CMAJ 2004;171(4):353-8.<br />
CMAJ SEPT. 14, 2004; 171 (6) 615
Correspondance<br />
ical journals [editorial]. CMAJ 1984;130:1412.<br />
11. Bero LA, Galbraith A, Rennie D. The publication<br />
<strong>of</strong> sponsored symposiums in medical journals.<br />
N Engl J Med 1992;327:1135-40.<br />
Competing interests: None declared.<br />
DOI:10.1503/cmaj.1041329<br />
Online access to a<br />
<strong>for</strong>-pr<strong>of</strong>it CMAJ<br />
Wayne Kondro, quoting CMA Secretary-General<br />
Bill Tholl, reports<br />
that “Physicians will continue to receive<br />
their free subscription to CMAJ as a benefit<br />
<strong>of</strong> association membership ‘<strong>for</strong> the<br />
<strong>for</strong>eseeable future’” after CMA Publications<br />
is sold to CMA Holdings in January<br />
2004. 1 That’s all to the good — but what<br />
then <strong>of</strong> CMAJ’s worldwide readers? Will<br />
access to CMAJ remain free <strong>for</strong> all online<br />
users, despite the shift to <strong>for</strong>-pr<strong>of</strong>it status?<br />
I found it strange that this issue was not<br />
addressed in Kondro’s news article.<br />
Adam L. Scheffler<br />
Independent researcher<br />
Chicago, Ill.<br />
Reference<br />
1. Kondro W. CMAJ enters <strong>for</strong>-pr<strong>of</strong>it market.<br />
CMAJ 2004;171(11):1334.<br />
DOI:10.1503/cmaj.1041759<br />
[Editor’s note]<br />
CMAJ’s editors have addressed the<br />
topic <strong>of</strong> open access in this issue’s<br />
Editorial (see page 149).<br />
DOI:10.1503/cmaj.1041760<br />
Correction<br />
In part 2 <strong>of</strong> the series “<strong>Tips</strong> <strong>for</strong> learners<br />
<strong>of</strong> evidence-based medicine” 1 the<br />
in<strong>for</strong>mation in Fig. 1 did not fully correspond<br />
with the in<strong>for</strong>mation provided in<br />
the text. Specifically, the data <strong>for</strong> hypo-<br />
162 JAMC • 18 JANV. 2005; 172 (2)<br />
thetical trial 2 in Fig. 1B should have<br />
been centred at 5% absolute risk reduction,<br />
as described in the text; instead, the<br />
figure showed trial 2 as being centred at<br />
about 6.5% absolute risk reduction. The<br />
corrected figure is presented here.<br />
A<br />
B<br />
C<br />
-5<br />
-5<br />
Trial 4<br />
Treatment harms<br />
-3<br />
-3<br />
Trial 3<br />
-1<br />
-1<br />
-5 -3 -1 0<br />
Treatment helps<br />
0 1 3 5<br />
0 1 3 5<br />
% Absolute risk reduction<br />
Reference<br />
1. Montori VM, Kleinbart J, Newman TB, Keitz S,<br />
Wyer PC, Moyer V, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong><br />
evidence-based medicine: 2. Measures <strong>of</strong> precision<br />
(confidence intervals). CMAJ 2004;171(6):<br />
611-5.<br />
DOI:10.1503/cmaj.1041761<br />
1 3 5<br />
Trial 1<br />
Trial 1<br />
Page 13 <strong>of</strong> 29<br />
Trial 2<br />
Fig. 1: Results <strong>of</strong> 4 hypothetical trials. For the medical condition under investigation,<br />
an absolute risk reduction <strong>of</strong> 1% (double vertical rule) is the smallest benefit<br />
that patients would consider important enough to warrant undergoing treatment. In<br />
each case, the uppermost point <strong>of</strong> the bell curve is the observed treatment effect<br />
(the point estimate), and the tails <strong>of</strong> the bell curve represent the boundaries <strong>of</strong> the<br />
95% confidence interval. See the text 1 <strong>for</strong> further explanation.
DOI:10.1503/cmaj.1031981<br />
<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />
3. Measures <strong>of</strong> observer variability (kappa statistic)<br />
Thomas McGinn, Peter C. Wyer, Thomas B. Newman, Sheri Keitz, Rosanne Leipzig,<br />
Gordon Guyatt, <strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group<br />
Imagine that you’re a busy family physician and that<br />
you’ve found a rare free moment to scan the recent literature.<br />
Reviewing your preferred digest <strong>of</strong> abstracts,<br />
you notice a study comparing emergency physicians’ interpretation<br />
<strong>of</strong> chest radiographs with radiologists’ interpretations.<br />
1 The article catches your eye because you have frequently<br />
found that your own reading <strong>of</strong> a radiograph differs<br />
from both the <strong>of</strong>ficial radiologist reading and an un<strong>of</strong>ficial<br />
reading by a different radiologist, and you’ve wondered<br />
about the extent <strong>of</strong> this disagreement and its implications.<br />
Looking at the abstract, you find that the authors have reported<br />
the extent <strong>of</strong> agreement using the κ statistic. You recall<br />
that κ stands <strong>for</strong> “kappa” and that you have encountered this<br />
measure <strong>of</strong> agreement be<strong>for</strong>e, but your grasp <strong>of</strong> its meaning<br />
remains tentative. You there<strong>for</strong>e choose to take a quick glance<br />
at the authors’ conclusions as reported in the abstract and to<br />
defer downloading and reviewing the full text <strong>of</strong> the article.<br />
Practitioners, such as the family physician just described,<br />
may benefit from understanding measures <strong>of</strong> observer variability.<br />
For many studies in the medical literature, clinician<br />
readers will be interested in the extent <strong>of</strong> agreement among<br />
multiple observers. For example, do the investigators in a<br />
clinical study agree on the presence or absence <strong>of</strong> physical,<br />
radiographic or laboratory findings? Do investigators involved<br />
in a systematic overview agree on the validity <strong>of</strong> an<br />
article, or on whether the article should be included in the<br />
analysis? In perusing these types <strong>of</strong> studies, where investigators<br />
are interested in quantifying agreement, clinicians<br />
will <strong>of</strong>ten come across the kappa statistic.<br />
In this article we present tips aimed at helping clinical<br />
learners to use the concepts <strong>of</strong> kappa when applying diagnostic<br />
tests in practice. The tips presented here have been<br />
adapted from approaches developed by educators experienced<br />
in teaching evidence-based medicine skills to clinicians.<br />
2 A related article, intended <strong>for</strong> people who teach<br />
these concepts to clinicians, is available online at www.<br />
cmaj.ca/cgi/content/full/171/11/1369/DC1.<br />
Clinician learners’ objectives<br />
Defining the importance <strong>of</strong> kappa<br />
• Understand the difference between measuring agreement<br />
and measuring agreement beyond chance.<br />
• Understand the implications <strong>of</strong> different values <strong>of</strong> kappa.<br />
Calculating kappa<br />
Review<br />
Synthèse<br />
• Understand the basics <strong>of</strong> how the kappa score is<br />
calculated.<br />
• Understand the importance <strong>of</strong> “chance agreement” in<br />
estimating kappa.<br />
Calculating chance agreement<br />
• Understand how to calculate the kappa score given different<br />
distributions <strong>of</strong> positive and negative results.<br />
• Understand that the more extreme the distributions <strong>of</strong><br />
positive and negative results, the greater the agreement<br />
that will occur by chance alone.<br />
• Understand how to calculate chance agreement, agreement<br />
beyond chance and kappa <strong>for</strong> any set <strong>of</strong> assessments<br />
by 2 observers.<br />
Tip 1: Defining the importance <strong>of</strong> kappa<br />
A common stumbling block <strong>for</strong> clinicians is the basic<br />
concept <strong>of</strong> agreement beyond chance and, in turn, the importance<br />
<strong>of</strong> correcting <strong>for</strong> chance agreement. People making<br />
a decision on the basis <strong>of</strong> presence or absence <strong>of</strong> an element<br />
<strong>of</strong> the physical examination, such as Murphy’s sign,<br />
will sometimes agree simply by chance. The kappa statistic<br />
corrects <strong>for</strong> this chance agreement and tells us how much<br />
<strong>of</strong> the possible agreement over and above chance the reviewers<br />
have achieved.<br />
A simple example should help to clarify the importance<br />
<strong>of</strong> correcting <strong>for</strong> chance agreement. Two radiologists independently<br />
read the same 100 mammograms. Reader 1 is<br />
having a bad day and reads all the films as negative without<br />
looking at them in great detail. Reader 2 reads the<br />
Teachers <strong>of</strong> evidence-based medicine:<br />
See the “<strong>Tips</strong> <strong>for</strong> teachers” version <strong>of</strong> this article online<br />
at www.cmaj.ca/cgi/content/full/171/11/1369/DC1. It<br />
contains the exercises found in this article in fill-in-theblank<br />
<strong>for</strong>mat, commentaries from the authors on the<br />
challenges they encounter when teaching these concepts<br />
to clinician learners and links to useful online resources.<br />
CMAJ • NOV. 23, 2004; 171 (11) 1369<br />
© 2004 Canadian Medical Association or its licensors<br />
Page 14 <strong>of</strong> 29
McGinn et al<br />
films more carefully and identifies 4 <strong>of</strong> the 100 mammograms<br />
as positive (suspicious <strong>for</strong> malignancy). How would<br />
you characterize the level <strong>of</strong> agreement between these 2<br />
radiologists?<br />
The percent agreement between them is 96%, even<br />
though one <strong>of</strong> the readers has, on cursory review, decided<br />
to call all <strong>of</strong> the results negative. Hence, measuring the<br />
simple percent agreement overestimates the degree <strong>of</strong> clinically<br />
important agreement in a fashion that is misleading.<br />
The role <strong>of</strong> kappa is to indicate how much the 2 observers<br />
agree beyond the level <strong>of</strong> agreement that could be expected<br />
by chance. Table 1 presents a rating system that is commonly<br />
used as a guideline <strong>for</strong> evaluating kappa scores.<br />
Purely to illustrate the range <strong>of</strong> kappa scores that readers<br />
can expect to encounter, Table 2 gives some examples <strong>of</strong><br />
commonly reported assessments and the kappa scores that<br />
resulted when investigators studied their reproducibility.<br />
The bottom line<br />
If clinicians neglect the possibility <strong>of</strong> chance agreement,<br />
they will come to misleading conclusions about the reproducibility<br />
<strong>of</strong> clinical tests. The kappa statistic allows us to<br />
measure agreement above and beyond that expected by<br />
chance alone. Examples <strong>of</strong> kappa scores <strong>for</strong> frequently ordered<br />
tests sometimes show surprisingly poor levels <strong>of</strong><br />
agreement beyond chance.<br />
Table 1: Qualitative classification<br />
<strong>of</strong> kappa values as degree <strong>of</strong><br />
agreement beyond chance 3<br />
Kappa<br />
value<br />
Degree <strong>of</strong> agreement<br />
beyond chance<br />
0 None<br />
0–0.2 Slight<br />
0.2–0.4 Fair<br />
0.4–0.6 Moderate<br />
0.6–0.8 Substantial<br />
0.8–1.0 Almost perfect<br />
Table 2: Representative kappa values <strong>for</strong> common tests<br />
and clinical assessments<br />
Assessment Kappa value<br />
Interpretation <strong>of</strong> T wave changes on an exercise<br />
stress test 4<br />
Presence <strong>of</strong> jugular venous distension 5<br />
Detection <strong>of</strong> alcohol dependence using CAGE<br />
questionnaire 6<br />
Presence <strong>of</strong> goitre 7<br />
Bone marrow interpretation by hematologist 8<br />
Straight leg raising test 9<br />
Diagnosis <strong>of</strong> pulmonary embolus by helical CT 10<br />
Diagnosis <strong>of</strong> lower extremity arterial disease by<br />
arteriography 11<br />
0.25<br />
0.56<br />
0.75<br />
0.82–0.95<br />
0.84<br />
0.82<br />
0.82<br />
0.39–0.64<br />
1370 JAMC 23 NOV. 2004; 171 (11)<br />
Tip 2: Calculating kappa<br />
What is the maximum potential <strong>for</strong> agreement between<br />
2 observers doing a clinical assessment, such as<br />
presence or absence <strong>of</strong> Murphy’s sign in patients with<br />
abdominal pain? In Fig. 1, the upper horizontal bar represents<br />
100% agreement between 2 observers. For the hypothetical<br />
situation represented in the figure, the estimated<br />
chance agreement between the 2 observers is 50%.<br />
This would occur if, <strong>for</strong> example, each <strong>of</strong> the 2 observers<br />
randomly called half <strong>of</strong> the assessments positive. Given<br />
this in<strong>for</strong>mation, what is the possible agreement beyond<br />
chance?<br />
The vertical line in Fig. 1 intersects the horizontal bars<br />
at the 50% point that we identified as the expected agreement<br />
by chance. All agreement to the right <strong>of</strong> this line corresponds<br />
to agreement beyond chance. Hence the maximum<br />
agreement beyond chance is 50% (100% – 50%).<br />
The other number you need to calculate the kappa score<br />
is the degree <strong>of</strong> agreement beyond chance. The observed<br />
agreement, as shown by the lower horizontal bar in Fig. 1,<br />
is 75%, so the degree <strong>of</strong> agreement beyond chance is 25%<br />
(75% – 50%).<br />
Kappa is calculated as the observed agreement beyond<br />
chance (25%) divided by the maximum agreement beyond<br />
chance (50%); here, kappa is 0.50.<br />
Agreement expected Possible agreement<br />
by chance 50% above chance<br />
Observed agreement: 75%<br />
Observed agreement above chance: 25%<br />
kappa = 25/50 = 0. 5 (moderate agreement)<br />
Page 15 <strong>of</strong> 29<br />
Fig. 1: Two observers independently assess the presence or<br />
absence <strong>of</strong> a finding or outcome. Each observer determines<br />
that the finding is present in exactly 50% <strong>of</strong> the subjects. Their<br />
assessments agree in 75% <strong>of</strong> the cases. The yellow horizontal<br />
bar represents potential agreement (100%), and the turquoise<br />
bar represents actual agreement. The portion <strong>of</strong> each coloured<br />
bar that lies to the left <strong>of</strong> the dotted vertical line represents the<br />
agreement expected by chance (50%). The observed agreement<br />
above chance is half <strong>of</strong> the possible agreement above<br />
chance. The ratio <strong>of</strong> these 2 numbers is the kappa score.
The bottom line<br />
Kappa allows us to measure agreement above and beyond<br />
that expected by chance alone. We calculate kappa by<br />
estimating the chance agreement and then comparing the<br />
observed agreement beyond chance with the maximum<br />
possible agreement beyond chance.<br />
Tip 3: Calculating chance agreement<br />
A conceptual understanding <strong>of</strong> kappa may still leave the<br />
actual calculations a mystery. The following example is intended<br />
<strong>for</strong> those who desire a more complete understanding<br />
<strong>of</strong> the kappa statistic.<br />
Let us assume that 2 hopeless clinicians are assessing the<br />
presence <strong>of</strong> Murphy’s sign in a group <strong>of</strong> patients. They<br />
have no idea what they are doing, and their evaluations are<br />
no better than blind guesses. Let us say they are each<br />
guessing the presence and absence <strong>of</strong> Murphy’s sign in a<br />
50:50 ratio: half the time they guess that Murphy’s sign is<br />
present, and the other half that it is absent. If you were<br />
completing a 2 × 2 table, with these 2 clinicians evaluating<br />
the same 100 patients, how would the cells, on average, get<br />
filled in?<br />
Fig. 2 represents the completed 2 × 2 table. Guessing at<br />
random, the 2 hopeless clinicians have agreed on the assessments<br />
<strong>of</strong> 50% <strong>of</strong> the patients. How did we arrive at the<br />
numbers shown in the table? According to the laws <strong>of</strong><br />
chance, each clinician guesses that half <strong>of</strong> the 50 patients<br />
assessed as positive by the other clinician (i.e., 25 patients)<br />
have Murphy’s sign.<br />
How would this exercise work if the same 2 hopeless<br />
clinicians were to randomly guess that 60% <strong>of</strong> the patients<br />
had a positive result <strong>for</strong> Murphy’s sign? Fig. 3 provides the<br />
answer in this situation. The clinicians would agree <strong>for</strong> 52<br />
<strong>of</strong> the 100 patients (or 52% <strong>of</strong> the time) and would disagree<br />
<strong>for</strong> 48 <strong>of</strong> the patients. In a similar way, using 2 × 2 tables<br />
<strong>for</strong> higher and higher positive proportions (i.e., how <strong>of</strong>ten<br />
Clinician 2<br />
Sign<br />
present<br />
Sign<br />
absent<br />
Sign<br />
present<br />
Clinician 1<br />
Sign<br />
absent Total<br />
25 25 50<br />
25 25 50<br />
Total 50 50<br />
Fig. 2: Agreement table <strong>for</strong> 2 hopeless clinicians who randomly<br />
guess whether Murphy’s sign is present or absent in 100 patients<br />
with abdominal pain. Each clinician determines that half<br />
<strong>of</strong> the patients have a positive result. The numbers in each box<br />
reflect the number <strong>of</strong> patients in each agreement category.<br />
<strong>Tips</strong> <strong>for</strong> EBM learners: kappa statistic<br />
the observer makes the diagnosis), you can figure out how<br />
<strong>of</strong>ten the observers will, on average, agree by chance alone<br />
(as delineated in Table 3).<br />
At this point, we have demonstrated 2 things. First, even<br />
if the reviewers have no idea what they are doing, there will<br />
be substantial agreement by chance alone. Second, the<br />
magnitude <strong>of</strong> the agreement by chance increases as the<br />
proportion <strong>of</strong> positive (or negative) assessments increases.<br />
But how can we calculate kappa when the clinicians<br />
whose assessments are being compared are no longer<br />
“hopeless,” in other words, when their assessments reflect a<br />
level <strong>of</strong> expertise that one might actually encounter in practice?<br />
It’s not very hard.<br />
Let’s take a simple example, returning to the premise<br />
that each <strong>of</strong> the 2 clinicians assesses Murphy’s sign as being<br />
present in 50% <strong>of</strong> the patients. Here, we assume that<br />
the 2 clinicians now have some knowledge <strong>of</strong> Murphy’s<br />
sign and their assessments are no longer random. Each<br />
decides that 50% <strong>of</strong> the patients have Murphy’s sign and<br />
50% do not, but they still don’t agree on every patient.<br />
Rather, <strong>for</strong> 40 patients they agree that Murphy’s sign is<br />
present, and <strong>for</strong> 40 patients they agree that Murphy’s sign<br />
is absent. Thus, they agree on the diagnosis <strong>for</strong> 80% <strong>of</strong><br />
the patients, and they disagree <strong>for</strong> 20% <strong>of</strong> the patients<br />
(see Fig. 4A). How do we calculate the kappa score in this<br />
situation?<br />
Recall that if each clinician found that 50% <strong>of</strong> the patients<br />
had Murphy’s sign but their decision about the presence <strong>of</strong><br />
the sign in each patient was random, the clinicians would be<br />
in agreement 50% <strong>of</strong> the time, each cell <strong>of</strong> the 2 × 2 table<br />
would have 25 patients (as shown in Fig. 2), chance agree-<br />
Clinician 2<br />
Sign<br />
present<br />
Sign<br />
absent<br />
Sign<br />
present<br />
Clinician 1<br />
Sign<br />
absent Total<br />
36 24 60<br />
24 16 40<br />
Total 60 40<br />
Page 16 <strong>of</strong> 29<br />
Fig. 3: As in Fig. 2, the 2 clinicians again guess at random<br />
whether Murphy’s sign is present or absent. However, each<br />
clinician now guesses that the sign is present in 60 <strong>of</strong> the 100<br />
patients. Under these circumstances, <strong>of</strong> the 60 patients <strong>for</strong><br />
whom clinician 1 guesses that the sign is present, clinician 2<br />
guesses that it is present in 60%; 60% <strong>of</strong> 60 is 36 patients. Of<br />
the 60 patients <strong>for</strong> whom clinician 1 guesses that the sign is<br />
present, clinician 2 guesses that it is absent in 40%; 40% <strong>of</strong> 60<br />
is 24 patients. Of the 40 patients <strong>for</strong> whom clinician 1 guesses<br />
that the sign is absent, clinician 2 guesses that it is present in<br />
60%; 60% <strong>of</strong> 40 is 24 patients. Of the 40 patients <strong>for</strong> whom<br />
clinician 1 guesses that the sign is absent, clinician 2 guesses<br />
that it is absent in 40%; 40% <strong>of</strong> 40 is 16 patients.<br />
CMAJ NOV. 23, 2004; 171 (11) 1371
McGinn et al<br />
ment would be 50%, and maximum agreement beyond<br />
chance would also be 50%.<br />
The no-longer-hopeless clinicians’ agreement on 80%<br />
<strong>of</strong> the patients is there<strong>for</strong>e 30% above chance. Kappa is a<br />
comparison <strong>of</strong> the observed agreement above chance with<br />
the maximum agreement above chance: 30%/50% = 60%<br />
<strong>of</strong> the possible agreement above chance, which gives these<br />
clinicians a kappa <strong>of</strong> 0.6, as shown in Fig. 4B.<br />
A Clinician 1<br />
Clinician 2<br />
Sign<br />
present<br />
Sign<br />
absent<br />
Sign<br />
present<br />
Sign<br />
absent<br />
40 10<br />
10 40<br />
B Clinician 1<br />
Clinician 2<br />
Table 3: Chance agreement when 2<br />
observers randomly assign positive<br />
and negative results, <strong>for</strong> successively<br />
higher rates <strong>of</strong> a positive call<br />
Proportion<br />
positive (%)<br />
Sign<br />
present<br />
Sign<br />
absent<br />
Sign<br />
present<br />
40<br />
(25)<br />
10<br />
(25)<br />
Agreement<br />
by chance (%)<br />
50 50<br />
60 52<br />
70 58<br />
80 68<br />
90 82<br />
Sign<br />
absent Total<br />
10<br />
(25)<br />
40<br />
(25)<br />
Total 50 50<br />
κ = (observed agreement – agreement expected by chance) ÷ (100 – agreement expected<br />
by chance)<br />
= (80% – 50%) ÷ (100% – 50%)<br />
= 30% ÷ 50%<br />
= 0.6<br />
Fig. 4: Two clinicians who have been trained to assess Murphy’s<br />
sign in patients with abdominal pain do an actual assessment<br />
on 100 patients. A: A 2 × 2 table reflecting actual agreement<br />
between the 2 clinicians. B: A 2 × 2 table illustrating the<br />
correct approach to determining the kappa score. The numbers<br />
in parentheses correspond to the results that would be expected<br />
were each clinician randomly guessing that half <strong>of</strong> the<br />
patients had a positive result (as in Fig. 2).<br />
1372 JAMC 23 NOV. 2004; 171 (11)<br />
50<br />
50<br />
Formula <strong>for</strong> calculating kappa<br />
(Observed agreement – agreement expected by chance) ÷<br />
(100% – agreement expected by chance)<br />
Another way <strong>of</strong> expressing this <strong>for</strong>mula:<br />
(Observed agreement beyond chance) ÷ (maximum<br />
possible agreement beyond chance)<br />
Hence, to calculate kappa when only 2 alternatives are<br />
possible (e.g., presence or absence <strong>of</strong> a finding), you need<br />
just 2 numbers: the percentage <strong>of</strong> patients that the 2 assessors<br />
agreed on and the expected agreement by chance.<br />
Both can be determined by constructing a 2 × 2 table exactly<br />
as illustrated above.<br />
The bottom line<br />
Chance agreement is not always 50%; rather, it varies<br />
from one clinical situation to another. When the prevalence<br />
<strong>of</strong> a disease or outcome is low, 2 observers will guess<br />
that most patients are normal and the symptom <strong>of</strong> the disease<br />
is absent. This situation will lead to a high percentage<br />
<strong>of</strong> agreement simply by chance. When the prevalence is<br />
high, there will also be high apparent agreement, with most<br />
patients judged to exhibit the symptom. Kappa measures<br />
the agreement after correcting <strong>for</strong> this variable degree <strong>of</strong><br />
chance agreement.<br />
Conclusions<br />
Page 17 <strong>of</strong> 29<br />
Armed with this understanding <strong>of</strong> kappa as a measure <strong>of</strong><br />
agreement between different observers, you are able to return<br />
to the study <strong>of</strong> agreement in chest radiography interpretations<br />
between emergency physicians and radiologists 1<br />
in a more in<strong>for</strong>med fashion. You learn from the abstract<br />
that the kappa score <strong>for</strong> overall agreement between the 2<br />
classes <strong>of</strong> practitioners was 0.40, with a 95% confidence<br />
interval ranging from 0.35 to 0.46. This means that the<br />
agreement between emergency physicians and radiologists<br />
represented 40% <strong>of</strong> the potentially achievable agreement<br />
beyond chance. You understand that this kappa score<br />
would be conventionally considered to represent fair to<br />
moderate agreement but is inferior to many <strong>of</strong> the kappa<br />
values listed in Table 2. You are now much more confident<br />
about going to the full text <strong>of</strong> the article to review the<br />
methods and assess the clinical applicability <strong>of</strong> the results to<br />
your own patients.<br />
The ability to understand measures <strong>of</strong> variability in data<br />
presented in clinical trials and systematic reviews is an important<br />
skill <strong>for</strong> clinicians. We have presented a series <strong>of</strong><br />
tips developed and used by experienced teachers <strong>of</strong> evidence-based<br />
medicine <strong>for</strong> the purpose <strong>of</strong> facilitating such<br />
understanding.
This article has been peer reviewed.<br />
From the Department <strong>of</strong> <strong>Medicine</strong>, Division <strong>of</strong> General Internal <strong>Medicine</strong><br />
(McGinn), and the Department <strong>of</strong> Geriatrics (Leipzig), Mount Sinai Medical Center,<br />
New York, NY; the Columbia University College <strong>of</strong> Physicians and Surgeons,<br />
New York, NY (Wyer); the Departments <strong>of</strong> Epidemiology and Biostatistics and <strong>of</strong><br />
Pediatrics, University <strong>of</strong> Cali<strong>for</strong>nia, San Francisco, San Francisco, Calif. (Newman);<br />
Durham Veterans Affairs Medical Center and Duke University Medical<br />
Center, Durham, NC (Keitz); and the Departments <strong>of</strong> <strong>Medicine</strong> and <strong>of</strong> Clinical<br />
Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt)<br />
Competing interests: None declared.<br />
Contributors: Thomas McGinn developed the original idea <strong>for</strong> tips 1 and 2 and, as<br />
principal author, oversaw and contributed to the writing <strong>of</strong> the manuscript.<br />
Thomas Newman and Roseanne Leipzig reviewed the manuscript at all phases <strong>of</strong><br />
development and contributed to the writing as coauthors. Sheri Keitz used all <strong>of</strong><br />
the tips as part <strong>of</strong> a live teaching exercise and submitted comments, suggestions<br />
and the possible variations that are described in the article. Peter Wyer reviewed<br />
and revised the final draft <strong>of</strong> the manuscript to achieve uni<strong>for</strong>m adherence with<br />
<strong>for</strong>mat specifications. Gordon Guyatt developed the original idea <strong>for</strong> tip 3, reviewed<br />
the manuscript at all phases <strong>of</strong> development, contributed to the writing as a<br />
coauthor, and, as general editor, reviewed and revised the final draft <strong>of</strong> the manuscript<br />
to achieve accuracy and consistency <strong>of</strong> content.<br />
References<br />
1. Gatt ME, Spectre G, Paltiel O, Hiller N, Stalnikowicz R. Chest radiographs<br />
in the emergency department: Is the radiologist really necessary? Postgrad<br />
Med J 2003;79:214-7.<br />
2. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. <strong>Tips</strong><br />
<strong>for</strong> learning and teaching evidence-based medicine: introduction to the series<br />
[editorial]. CMAJ 2004;171(4):347-8.<br />
3. Maclure M, Willett WC. Misinterpretation and misuse <strong>of</strong> the kappa statistic.<br />
Am J Epidemiol 1987;126:161-9.<br />
4. Blackburn H. The exercise electrocardiogram: differences in interpretation.<br />
Report <strong>of</strong> a technical group on exercise electrocardiography. Am J Cardiol<br />
1968;21:871-80.<br />
5. Cook DJ. Clinical assessment <strong>of</strong> central venous pressure in the critically ill.<br />
Am J Med Sci 1990;299:175-8.<br />
6. Aertgeerts B, Buntinx F, Fevery J, Ansoms S. Is there a difference between<br />
CAGE interviews and written CAGE questionnaires? Alcohol Clin Exp Res<br />
2000;24:733-6.<br />
7. Kilpatrick R, Milne JS, Rushbrooke M, Wilson ESB. A survey <strong>of</strong> thyroid enlargement<br />
in two general practices in Great Britain. BMJ 1963;1:29-34.<br />
8. Guyatt GH, Patterson C, Ali M, Singer J, Levine M, Turpie I, et al. Diagnosis<br />
<strong>of</strong> iron-deficiency anemia in the elderly. Am J Med 1990;88:205-9.<br />
9. McCombe PF, Fairbank JC, Cockersole BC, Pynsent PB. 1989 Volvo Award<br />
in clinical sciences. Reproducibility <strong>of</strong> physical signs in low-back pain. Spine<br />
1989;14:908-18.<br />
10. Perrier A, Howarth N, Didier D, Loubeyre P, Unger PF, de Moerloose P, et<br />
al. Per<strong>for</strong>mance <strong>of</strong> helical computed tomography in unselected outpatients<br />
with suspected pulmonary embolism. Ann Intern Med 2001;135:88-97.<br />
11. Koelemay MJ, Legemate DA, Reekers JA, Koedam NA, Balm R, Jacobs MJ.<br />
Interobserver variation in interpretation <strong>of</strong> arteriography and management <strong>of</strong><br />
severe lower leg arterial disease. Eur J Vasc Endovasc Surg 2001;21:417-22.<br />
Articles to date in this series<br />
<strong>Tips</strong> <strong>for</strong> EBM learners: kappa statistic<br />
Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,<br />
Pelham NY 10803, USA; fax 914 738-9368; pwyer@att.net<br />
Page 18 <strong>of</strong> 29<br />
Members <strong>of</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong><br />
Working Group: Peter C. Wyer (project director), College <strong>of</strong><br />
Physicians and Surgeons, Columbia University, New York, NY;<br />
Deborah Cook, Gordon Guyatt (general editor), Ted Haines,<br />
Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose<br />
Hatala (internal review coordinator), University <strong>of</strong> British<br />
Columbia, Vancouver, BC; Robert Hayward (editor, online<br />
version), Bruce Fisher, University <strong>of</strong> Alberta, Edmonton, Alta.;<br />
Sheri Keitz (field test coordinator), Durham Veterans Affairs<br />
Medical Center and Duke University Medical Center, Durham,<br />
NC; Alexandra Barratt, University <strong>of</strong> Sydney, Sydney, Australia;<br />
Pamela Charney, Albert Einstein College <strong>of</strong> <strong>Medicine</strong>, Bronx, NY;<br />
Antonio L. Dans, University <strong>of</strong> the Philippines College <strong>of</strong><br />
<strong>Medicine</strong>, Manila, The Philippines; Barnet Eskin, Morristown<br />
Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory<br />
University School <strong>of</strong> <strong>Medicine</strong>, Atlanta, Ga.; Hui Lee, <strong>for</strong>merly<br />
Group Health Centre, Sault Ste. Marie, Ont. (deceased); Rosanne<br />
Leipzig, Thomas McGinn, Mount Sinai Medical Center, New<br />
York, NY; Victor M. Montori, Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>,<br />
Rochester, Minn.; Virginia Moyer, University <strong>of</strong> Texas, Houston,<br />
Tex.; Thomas B. Newman, University <strong>of</strong> Cali<strong>for</strong>nia, San<br />
Francisco, San Francisco, Calif.; Jim Nishikawa, University <strong>of</strong><br />
Ottawa, Ottawa, Ont.; Kameshwar Prasad, Arabian Gulf<br />
University, Manama, Bahrain; W. Scott Richardson, Wright State<br />
University, Dayton, Ohio; Mark C. Wilson, University <strong>of</strong> Iowa,<br />
Iowa City, Iowa<br />
Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz<br />
S, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />
1. Relative risk reduction, absolute risk reduction and<br />
number needed to treat. CMAJ 2004;171(4):353-8.<br />
Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC,<br />
Moyer V, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based<br />
medicine: 2. Measures <strong>of</strong> precision (confidence intervals).<br />
CMAJ 2004;171(6):611-5.<br />
CMAJ NOV. 23, 2004; 171 (11) 1373
DOI:10.1503/cmaj.1031920<br />
<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />
4. Assessing heterogeneity <strong>of</strong> primary studies<br />
in systematic reviews and whether to combine<br />
their results<br />
Rose Hatala, Sheri Keitz, Peter Wyer, Gordon Guyatt, <strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong><br />
Teaching <strong>Tips</strong> Working Group<br />
Clinicians wishing to quickly answer a clinical question<br />
may seek a systematic review, rather than searching<br />
<strong>for</strong> primary articles. Such a review is also called a<br />
meta-analysis when the investigators have used statistical<br />
techniques to combine results across studies. Databases useful<br />
<strong>for</strong> this purpose include the Cochrane Library (www.<br />
thecochranelibrary.com) and the ACP Journal Club (www.<br />
acpjc.org; use the search term “review”), both <strong>of</strong> which are<br />
available through personal or institutional subscription.<br />
Clinicians can use systematic reviews to guide clinical practice<br />
if they are able to understand and interpret the results.<br />
Systematic reviews differ from traditional reviews in that<br />
they are usually confined to a single focused question,<br />
which serves as the basis <strong>for</strong> systematic searching, selection<br />
and critical evaluation <strong>of</strong> the relevant research. 1 Authors <strong>of</strong><br />
systematic reviews use explicit methods to minimize bias<br />
and consider using statistical techniques to combine the results<br />
<strong>of</strong> individual studies. When appropriate, such pooling<br />
allows a more precise estimate <strong>of</strong> the magnitude <strong>of</strong> benefit<br />
or harm <strong>of</strong> a therapy. It may also increase the applicability<br />
<strong>of</strong> the result to a broader range <strong>of</strong> patient populations.<br />
Clinicians encountering a meta-analysis frequently find<br />
the pooling process mysterious. Specifically, they wonder<br />
how authors decide whether the ranges <strong>of</strong> patients, interventions<br />
and outcomes are too broad to sensibly pool the<br />
results <strong>of</strong> the primary studies.<br />
In this article we present an approach to evaluating potentially<br />
important differences in the results <strong>of</strong> individual<br />
studies being considered <strong>for</strong> a meta-analysis. These differences<br />
are frequently referred to as heterogeneity. 1 Our discussion<br />
focuses on the qualitative, rather than the statistical,<br />
assessment <strong>of</strong> heterogeneity (see Box 1).<br />
Two concepts are commonly implied in the assessment<br />
<strong>of</strong> heterogeneity. The first is an assessment <strong>for</strong> heterogeneity<br />
within 4 key elements <strong>of</strong> the design <strong>of</strong> the original studies:<br />
the patients, interventions, outcomes and methods. This<br />
assessment bears on the question <strong>of</strong> whether pooling the results<br />
is at all sensible. The second concept relates to assessing<br />
heterogeneity among the results <strong>of</strong> the original studies.<br />
Even if the study designs are similar, the researchers must<br />
decide whether it is useful to combine the primary studies’<br />
CMAJ • MAR. 1, 2005; 172 (5) 661<br />
© 2005 CMA Media Inc. or its licensors<br />
Review<br />
Synthèse<br />
results. Our discussion assumes a basic familiarity with how<br />
investigators present the magnitude 2,3 and precision 4 <strong>of</strong><br />
treatment effects in individual randomized trials.<br />
The tips in this article are adapted from approaches developed<br />
by educators with experience in teaching evidencebased<br />
medicine skills to clinicians. 1,5,6 A related article, intended<br />
<strong>for</strong> people who teach these concepts to clinicians, is<br />
available online at www.cmaj.ca/cgi/content/full/172/5/<br />
661/DC1.<br />
Clinician learners’ objectives<br />
Qualitative assessment <strong>of</strong> the design <strong>of</strong> primary<br />
studies<br />
• Understand the concepts <strong>of</strong> heterogeneity <strong>of</strong> study design<br />
among the individual studies included in a systematic<br />
review.<br />
Qualitative assessment <strong>of</strong> the results <strong>of</strong> primary<br />
studies<br />
• Understand how to qualitatively determine the appropriateness<br />
<strong>of</strong> pooling estimates <strong>of</strong> effect from the individual<br />
studies by assessing (1) the degree <strong>of</strong> overlap <strong>of</strong><br />
the confidence intervals around these point estimates <strong>of</strong><br />
effect and (2) the disparity between the point estimates<br />
themselves.<br />
• Understand how to estimate the “true” value <strong>of</strong> the estimate<br />
<strong>of</strong> effect from a graphic display <strong>of</strong> the results <strong>of</strong><br />
individual studies.<br />
Teachers <strong>of</strong> evidence-based medicine:<br />
Page 19 <strong>of</strong> 29<br />
See the “<strong>Tips</strong> <strong>for</strong> teachers” version <strong>of</strong> this article online<br />
at www.cmaj.ca/cgi/content/full/172/5/661/DC1. It<br />
contains the exercises found in this article in fill-in-theblank<br />
<strong>for</strong>mat, commentaries from the authors on the<br />
challenges they encounter when teaching these concepts<br />
to clinician learners and links to useful online resources.
Hatala et al<br />
Box 1: Statistical assessments <strong>of</strong> heterogeneity<br />
Meta-analysts typically use 2 statistical approaches to evaluate<br />
the extent <strong>of</strong> variability in results between studies: Cochran’s<br />
Q test and the I 2<br />
statistic.<br />
Cochran’s Q test<br />
• Cochran’s Q test is the traditional test <strong>for</strong> heterogeneity. It<br />
begins with the null hypothesis that all <strong>of</strong> the apparent<br />
variability is due to chance. That is, the true underlying<br />
magnitude <strong>of</strong> effect (whether measured with a relative risk,<br />
an odds ratio or a risk difference) is the same across studies.<br />
• The test then generates a probability, based on a χ 2<br />
distribution, that differences in results between studies as<br />
extreme as or more extreme than those observed could occur<br />
simply by chance.<br />
• If the p value is low (say, less than 0.1) investigators should<br />
look hard <strong>for</strong> possible explanations <strong>of</strong> variability in results<br />
between studies (including differences in patients,<br />
interventions, measurement <strong>of</strong> outcomes and study design).<br />
• As the p value gets very low (less than 0.01) we may be<br />
increasingly uncom<strong>for</strong>table about using single best estimates<br />
<strong>of</strong> treatment effects.<br />
• The traditional test <strong>for</strong> heterogeneity is limited, in that it may<br />
be underpowered (when studies have included few patients it<br />
may be difficult to reject the null hypothesis even if it is false)<br />
or overpowered (when sample sizes are very large, small and<br />
unimportant differences in magnitude <strong>of</strong> effect may<br />
nevertheless generate low p values).<br />
I 2<br />
statistic<br />
• The I 2<br />
statistic, the second approach to measuring<br />
heterogeneity, attempts to deal with potential underpowering<br />
or overpowering. I 2<br />
provides an estimate <strong>of</strong> the percentage <strong>of</strong><br />
variability in results across studies that is likely due to true<br />
differences in treatment effect, as opposed to chance.<br />
• When I 2<br />
is 0%, chance provides a satisfactory explanation <strong>for</strong><br />
the variability we have observed, and we are more likely to<br />
be com<strong>for</strong>table with a single pooled estimate <strong>of</strong> treatment<br />
effect.<br />
• As I 2<br />
increases, we get increasingly uncom<strong>for</strong>table with a<br />
single pooled estimate, and the need to look <strong>for</strong> explanations<br />
<strong>of</strong> variability other than chance becomes more compelling.<br />
• For example, one rule <strong>of</strong> thumb characterizes I 2 <strong>of</strong> less than<br />
0.25 as low heterogeneity, 0.25 to 0.5 as moderate<br />
heterogeneity and over 0.5 as high heterogeneity.<br />
662 JAMC 1 er MARS 2005; 172 (5)<br />
Tip 1: Qualitative assessment <strong>of</strong> the design <strong>of</strong><br />
primary studies<br />
Consider the following 3 hypothetical systematic reviews.<br />
For which <strong>of</strong> these systematic reviews does it make<br />
sense to combine the primary studies?<br />
• A systematic review <strong>of</strong> all therapies <strong>for</strong> all types <strong>of</strong> cancer,<br />
intended to generate a single estimate <strong>of</strong> the impact<br />
<strong>of</strong> these therapies on mortality.<br />
• A systematic review that examines the effect <strong>of</strong> different<br />
antibiotics, such as tetracyclines, penicillins and chloramphenicol,<br />
on improvement in peak expiratory flow<br />
rates and days <strong>of</strong> illness in patients with acute exacerbation<br />
<strong>of</strong> obstructive lung disease, including chronic<br />
bronchitis and emphysema. 7<br />
• A systematic review <strong>of</strong> the effectiveness <strong>of</strong> tissue plasminogen<br />
activator (tPA) compared with no treatment<br />
or placebo in reducing mortality among patients with<br />
acute myocardial infarction. 8<br />
Most clinicians would instinctively reject the first <strong>of</strong><br />
these proposed reviews as overly broad but would be com<strong>for</strong>table<br />
with the idea <strong>of</strong> combining the results <strong>of</strong> trials relevant<br />
to the third question. What about the second review?<br />
What aspects <strong>of</strong> the primary studies must be similar to justify<br />
combining their results in this systematic review?<br />
Table 1 lists features that would be relevant to the<br />
question considered in the second review and categorizes<br />
them according to the 4 key elements <strong>of</strong> study design: the<br />
patients, interventions, outcomes and methods <strong>of</strong> the primary<br />
studies. Combining results is appropriate when the<br />
biology is such that across the range <strong>of</strong> patients, interventions,<br />
outcomes and study methods, one can anticipate<br />
more or less the same magnitude <strong>of</strong> treatment effect.<br />
In other words, the judgement as to whether the primary<br />
studies are similar enough to be combined in a systematic<br />
review is based on whether the underlying pathophysiology<br />
would predict a similar treatment effect across<br />
the range <strong>of</strong> patients, interventions, outcomes and study<br />
methods <strong>of</strong> the primary studies. If you think back to the<br />
first systematic review — all therapies <strong>for</strong> all cancers — you<br />
probably recognize that there is significant variability in the<br />
Table 1: Relevant features <strong>of</strong> study design to be considered when deciding whether to pool studies in a<br />
systematic review (<strong>for</strong> a review examining the effect <strong>of</strong> antibiotics in patients with obstructive lung disease)<br />
Patients Interventions Outcomes Study methods<br />
Patient age Same antibiotic in all studies Death All randomized trials<br />
Patient sex<br />
Type <strong>of</strong> lung disease<br />
(e.g., emphysema,<br />
chronic bronchitis)<br />
Same class <strong>of</strong> antibiotic in all<br />
studies<br />
Comparison <strong>of</strong> antibiotic with<br />
placebo<br />
Comparison <strong>of</strong> one antibiotic with<br />
another<br />
Peak expiratory flow<br />
Forced expiratory volume in<br />
the first second<br />
Only blinded randomized<br />
trials<br />
Cohort studies<br />
Page 20 <strong>of</strong> 29
pathophysiology <strong>of</strong> different cancers (“patients” in Table 1)<br />
and in the mechanisms <strong>of</strong> action <strong>of</strong> different cancer therapies<br />
(“interventions” in Table 1).<br />
If you were inclined to reject pooling the results <strong>of</strong> the<br />
studies to be considered in the second systematic review, you<br />
might have reasoned that we would expect substantially different<br />
effects with different antibiotics, different infecting<br />
agents or different underlying lung pathology. If you were<br />
inclined to accept pooling <strong>of</strong> results in this review, you might<br />
argue that the antibiotics used in the different studies are all<br />
effective against the most common organisms underlying<br />
pulmonary exacerbations. You might also assert that the biology<br />
<strong>of</strong> an acute exacerbation <strong>of</strong> an obstructive lung disease<br />
(e.g., inflammation) is similar, despite variability in the underlying<br />
pathology. In other words, we would expect more<br />
or less the same effect across agents and across patients.<br />
Finally, you probably accepted the validity <strong>of</strong> pooling results<br />
<strong>for</strong> the third systematic review — tPA <strong>for</strong> myocardial<br />
infarction — because you consider that the mechanism <strong>of</strong><br />
myocardial infarction is relatively constant across a broad<br />
range <strong>of</strong> patients.<br />
The bottom line<br />
• Similarity in the aspects <strong>of</strong> primary study design outlined<br />
in Table 1 (patients, interventions, outcomes,<br />
study methods) guides the decision as to whether it<br />
makes sense to combine the results <strong>of</strong> primary studies<br />
in a systematic review.<br />
• The range <strong>of</strong> characteristics <strong>of</strong> the primary studies<br />
across which it is sensible to combine results is a matter<br />
<strong>of</strong> judgment based on the researcher’s understanding <strong>of</strong><br />
the underlying biology <strong>of</strong> the disease.<br />
Tip 2: Qualitative assessment <strong>of</strong> the results <strong>of</strong><br />
primary studies<br />
You should now understand that combining the results <strong>of</strong><br />
different studies is sensible only when we expect more or less<br />
the same magnitude <strong>of</strong> treatment effects across the range <strong>of</strong><br />
patients, interventions and outcomes that the investigators<br />
have included in their systematic review. However, even<br />
when we are confident <strong>of</strong> the similarity in design among the<br />
individual studies, we may still wonder whether the results <strong>of</strong><br />
the studies should be pooled. The following graphic demonstration<br />
shows how to qualitatively assess the results <strong>of</strong> the<br />
primary studies to decide if meta-analysis (i.e., statistical<br />
pooling) is appropriate. You can find discussions <strong>of</strong> quantitative,<br />
or statistical, approaches to the assessment <strong>of</strong> heterogeneity<br />
elsewhere (see Box 1 or Higgins and associates 9 ).<br />
Consider the results <strong>of</strong> the studies in 2 hypothetical systematic<br />
reviews (Fig. 1A and Fig. 1B). The central vertical<br />
line, labelled “no difference,” represents a treatment effect <strong>of</strong><br />
0. This would be equivalent to a risk ratio or relative risk <strong>of</strong> 1<br />
or an absolute or relative risk reduction <strong>of</strong> 0. 2 Values to the<br />
<strong>Tips</strong> <strong>for</strong> EBM learners: heterogeneity<br />
left <strong>of</strong> the “no difference” line indicate that the treatment is<br />
superior to the control, whereas those to the right <strong>of</strong> the line<br />
indicate that the control is superior to the treatment. For<br />
each <strong>of</strong> the 4 studies represented in the figures, the dot represents<br />
the point estimate <strong>of</strong> the treatment effect (the value<br />
observed in the study), and the horizontal line represents the<br />
confidence interval around that observed effect. For which<br />
systematic review does it make sense to combine results? Decide<br />
on the answer to this question be<strong>for</strong>e you read on.<br />
You have probably concluded that pooling is appropriate<br />
A<br />
B<br />
Favours new<br />
treatment<br />
Favours<br />
new treatment<br />
No difference<br />
No difference<br />
Favours control<br />
Favours control<br />
Page 21 <strong>of</strong> 29<br />
Fig. 1: Results <strong>of</strong> the studies in 2 hypothetical systematic reviews.<br />
The central vertical line represents a treatment effect <strong>of</strong><br />
0. Values to the left <strong>of</strong> this line indicate that the treatment is superior<br />
to the control, whereas those to the right <strong>of</strong> the line indicate<br />
that the control is superior to the treatment. For each <strong>of</strong><br />
the 4 studies in each figure, the dot represents the point estimate<br />
<strong>of</strong> the treatment effect (the value observed in the study),<br />
and the horizontal line represents the confidence interval<br />
around that observed effect.<br />
CMAJ MAR. 1, 2005; 172 (5) 663
Hatala et al<br />
<strong>for</strong> the studies represented in Fig. 1B but not <strong>for</strong> those represented<br />
in Fig. 1A. Can you explain why? Is it because the<br />
point estimates <strong>for</strong> the studies in Fig. 1A lie on opposite sides<br />
Favours<br />
new treatment<br />
Fig. 2: Point estimates and confidence intervals <strong>for</strong> 4 studies.<br />
Two <strong>of</strong> the point estimates favour the new treatment, and the<br />
other 2 point estimates favour the control. Investigators doing a<br />
systematic review with these 4 studies would be satisfied that it<br />
is appropriate to pool the results.<br />
Pooled estimate <strong>of</strong> underlying effect<br />
Favours<br />
new treatment<br />
No difference<br />
No difference<br />
Favours control<br />
Favours control<br />
Fig. 3: Results <strong>of</strong> the hypothetical systematic review presented<br />
in Fig. 1B. The pooled estimate at the bottom <strong>of</strong> the chart (large<br />
diamond) provides the best guess as to the underlying treatment<br />
effect. It is centred on the midpoint <strong>of</strong> the area <strong>of</strong> overlap<br />
<strong>of</strong> the confidence intervals around the estimates <strong>of</strong> the individual<br />
trials.<br />
664 JAMC 1 er MARS 2005; 172 (5)<br />
<strong>of</strong> the “no difference” line, whereas those <strong>for</strong> the studies in<br />
Fig. 1B lie on the same side <strong>of</strong> the “no difference” line?<br />
Be<strong>for</strong>e you answer this question, consider the studies<br />
represented in Fig. 2. Here, the point estimates <strong>of</strong> 2 studies<br />
are on the “favours new treatment” side <strong>of</strong> the “no difference”<br />
line, and the point estimates <strong>of</strong> 2 other studies are on<br />
the “favours control” side. However, all 4 point estimates<br />
are very close to the “no difference” line, and, in this case,<br />
investigators doing a systematic review will be satisfied that<br />
it is appropriate to pool the results. There<strong>for</strong>e, it is not the<br />
position <strong>of</strong> the point estimates relative to the “no difference”<br />
line that determines the appropriateness <strong>of</strong> pooling.<br />
There are 2 criteria <strong>for</strong> not combining the results <strong>of</strong><br />
studies in a meta-analysis: highly disparate point estimates<br />
and confidence intervals with little overlap, both <strong>of</strong> which<br />
are exemplified by Fig. 1A. When pooling is appropriate on<br />
the basis <strong>of</strong> these criteria, where is the best estimate <strong>of</strong> the<br />
underlying magnitude <strong>of</strong> effect likely to be? Look again at<br />
Fig. 1B and make a guess. Now look at Fig. 3.<br />
The pooled estimate at the bottom <strong>of</strong> Fig. 3 is centred on<br />
the midpoint <strong>of</strong> the area <strong>of</strong> overlap <strong>of</strong> the confidence intervals<br />
around the estimates <strong>of</strong> the individual trials. It provides our<br />
best guess as to the underlying treatment effect. Of course, we<br />
cannot actually know the “truth” and must be content with<br />
potentially misleading estimates. The intent <strong>of</strong> a meta-analysis<br />
is to include enough studies to narrow the confidence interval<br />
around the resulting pooled estimate sufficiently to provide estimates<br />
<strong>of</strong> benefit <strong>for</strong> our patients in which we can be confident.<br />
Thus, our best estimate <strong>of</strong> the truth will lie in the area <strong>of</strong><br />
overlap among the confidence intervals around the point estimates<br />
<strong>of</strong> treatment effect presented in the primary studies.<br />
What is the clinician to do when presented with results<br />
such as those in Fig. 1A? If the investigators have done a<br />
good job <strong>of</strong> planning and executing the meta-analysis, they<br />
will provide some assistance. 6 Be<strong>for</strong>e examining the study<br />
results in detail, they will have generated a priori hypotheses<br />
to explain the heterogeneity in magnitude <strong>of</strong> effect across<br />
studies that they are liable to encounter. These hypotheses<br />
will include differences in patients (effects may be larger in<br />
sicker patients), in interventions (larger doses may result in<br />
larger effects), in outcomes (longer follow-up may diminish<br />
the magnitude <strong>of</strong> effect) and in study design (methodologically<br />
weaker studies may generate larger effects).<br />
The investigators will then have examined the extent to<br />
which these hypotheses can explain the differences in magnitude<br />
<strong>of</strong> effect across studies. These subgroup analyses<br />
may be misleading, but if they meet 7 criteria suggested<br />
elsewhere 10 (see Box 2), they may provide credible and satisfying<br />
explanations <strong>for</strong> the variability in results.<br />
The bottom line<br />
Page 22 <strong>of</strong> 29<br />
• Readers can decide <strong>for</strong> themselves whether there is<br />
clinically important heterogeneity among the results <strong>of</strong><br />
primary studies through a qualitative assessment <strong>of</strong> the<br />
graphic results. This assessment is based on the amount
Box 2: Questions to ask when evaluating a subgroup<br />
analysis in a meta-analysis 10<br />
• Was the subgroup comparison based on a within-study,<br />
rather than a between-study, comparison?<br />
• Is the magnitude <strong>of</strong> the difference in effect between<br />
subgroups large?<br />
• Is the effect consistent across studies?<br />
• Is the difference in effect statistically significant?<br />
• Was the subgroup analysis planned in advance by the<br />
trialists?<br />
• Were many subgroup analyses per<strong>for</strong>med and selectively<br />
reported?<br />
• Is the difference in effect in the subgroup supported by a<br />
biological hypothesis?<br />
<strong>of</strong> disparity among the individual point estimates and<br />
the degree <strong>of</strong> overlap among the confidence intervals.<br />
Conclusions<br />
Understanding the concept <strong>of</strong> heterogeneity in a systematic<br />
review or meta-analysis is central to a full appreciation<br />
<strong>of</strong> the implications <strong>of</strong> such reviews <strong>for</strong> clinical practice.<br />
We have presented 2 tips aimed at helping clinical readers<br />
overcome commonly encountered difficulties in understanding<br />
this concept.<br />
This article has been peer reviewed.<br />
From the Department <strong>of</strong> <strong>Medicine</strong>, University <strong>of</strong> British Columbia, Vancouver, BC<br />
(Hatala); Durham Veterans Affairs Medical Center and Duke University Medical<br />
Center, Durham, NC (Keitz); the Columbia University College <strong>of</strong> Physicians and<br />
Surgeons, New York, NY (Wyer); and the Departments <strong>of</strong> <strong>Medicine</strong> and <strong>of</strong> Clinical<br />
Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt)<br />
Competing interests: None declared.<br />
Contributors: Rose Hatala modified the original ideas <strong>for</strong> tips 1 and 2, drafted the<br />
manuscript, coordinated input from reviewers and field-testing, and revised all drafts.<br />
Sheri Keitz used all <strong>of</strong> the tips as part <strong>of</strong> a live teaching exercise and submitted comments,<br />
suggestions and the possible variations that are described in the article. Peter<br />
Wyer reviewed and revised the final draft <strong>of</strong> the manuscript to achieve uni<strong>for</strong>m adherence<br />
with <strong>for</strong>mat specifications. Gordon Guyatt developed the original ideas <strong>for</strong><br />
tips 1 and 2, reviewed the manuscript at all phases <strong>of</strong> development, contributed to<br />
the writing as a coauthor, and, as general editor, reviewed and revised the final draft<br />
<strong>of</strong> the manuscript to achieve accuracy and consistency <strong>of</strong> content.<br />
References<br />
1. Oxman A, Guyatt G, Cook D, Montori V. Summarizing the evidence. In:<br />
Guyatt G, Rennie D, editors. Users’ guides to the medical literature: a manual <strong>for</strong><br />
evidence-based clinical practice. Chicago: AMA Press; 2002. p. 155-73.<br />
2. Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al, <strong>for</strong> the<br />
<strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group. <strong>Tips</strong> <strong>for</strong> learners<br />
<strong>of</strong> evidence-based medicine: 1. Relative risk reduction, absolute risk reduction<br />
and number needed to treat. CMAJ 2004;171(4):353-8.<br />
3. Guyatt G, Cook D, Devereaux PJ, Meade M, Straus S. Therapy. In: Guyatt<br />
G, Rennie D, editors. Users’ guides to the medical literature: a manual <strong>for</strong> evidence-based<br />
clinical practice. Chicago: AMA Press; 2002. p. 55-79.<br />
4. Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, et al,<br />
<strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group. <strong>Tips</strong> <strong>for</strong><br />
learners <strong>of</strong> evidence-based medicine: 2. Measures <strong>of</strong> precision (confidence intervals).<br />
CMAJ 2004;171(6):611-5.<br />
<strong>Tips</strong> <strong>for</strong> EBM learners: heterogeneity<br />
5. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. <strong>Tips</strong><br />
<strong>for</strong> learning and teaching evidence-based medicine: introduction to the series.<br />
CMAJ 2004;171(4):347-8.<br />
6. Montori V, Hatala R, Guyatt G. Summarizing the evidence: evaluating differences<br />
in study results. In: Guyatt G, Rennie D, editors. Users’ guides to the medical literature:<br />
a manual <strong>for</strong> evidence-based clinical practice. Chicago: AMA Press; 2002. p. 547-52.<br />
7. Saint S, Bent S, Vittingh<strong>of</strong>f E, Grady D. Antibiotics in chronic obstructive<br />
pulmonary disease exacerbations. JAMA 1995;273:957-60.<br />
8. Held PH, Teo KK, Yusuf S. Effects <strong>of</strong> tissue-type plasminogen activator and<br />
anisoylated plasminogen streptokinase activator complex on mortality in acute<br />
myocardial infarction. Circulation 1990;82:1668-74.<br />
9. Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency<br />
in meta-analyses. BMJ 2003;327:557-60.<br />
10. Oxman A, Guyatt G. When to believe a subgroup analysis. In: Guyatt G,<br />
Rennie D, editors. Users’ guides to the medical literature: a manual <strong>for</strong> evidencebased<br />
clinical practice. Chicago: AMA Press; 2002. p. 553-65.<br />
Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,<br />
Pelham NY 10804; fax 914 738-9368; pwyer@att.net<br />
Members <strong>of</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working<br />
Group: Peter C. Wyer (project director), College <strong>of</strong> Physicians and<br />
Surgeons, Columbia University, New York, NY; Deborah Cook,<br />
Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke,<br />
McMaster University, Hamilton, Ont.; Rose Hatala (internal<br />
review coordinator), University <strong>of</strong> British Columbia, Vancouver,<br />
BC; Robert Hayward (editor, online version), Bruce Fisher,<br />
University <strong>of</strong> Alberta, Edmonton, Alta.; Sheri Keitz (field test<br />
coordinator), Durham Veterans Affairs Medical Center and Duke<br />
University Medical Center, Durham, NC; Alexandra Barratt,<br />
University <strong>of</strong> Sydney, Sydney, Australia; Pamela Charney, Albert<br />
Einstein College <strong>of</strong> <strong>Medicine</strong>, Bronx, NY; Antonio L. Dans,<br />
University <strong>of</strong> the Philippines College <strong>of</strong> <strong>Medicine</strong>, Manila, The<br />
Philippines; Barnet Eskin, Morristown Memorial Hospital,<br />
Morristown, NJ; Jennifer Kleinbart, Emory University School <strong>of</strong><br />
<strong>Medicine</strong>, Atlanta, Ga.; Hui Lee, <strong>for</strong>merly Group Health Centre,<br />
Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas<br />
McGinn, Mount Sinai Medical Center, New York, NY; Victor M.<br />
Montori, Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>, Rochester, Minn.;<br />
Virginia Moyer, University <strong>of</strong> Texas, Houston, Tex.; Thomas B.<br />
Newman, University <strong>of</strong> Cali<strong>for</strong>nia, San Francisco, San Francisco,<br />
Calif.; Jim Nishikawa, University <strong>of</strong> Ottawa, Ottawa, Ont.;<br />
Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain;<br />
W. Scott Richardson, Wright State University, Dayton, Ohio; Mark<br />
C. Wilson, University <strong>of</strong> Iowa, Iowa City, Iowa<br />
Articles to date in this series<br />
Page 23 <strong>of</strong> 29<br />
Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S,<br />
et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine: 1.<br />
Relative risk reduction, absolute risk reduction and<br />
number needed to treat. CMAJ 2004;171(4):353-8.<br />
Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC,<br />
Moyer V, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />
2. Measures <strong>of</strong> precision (confidence intervals).<br />
CMAJ 2004;171(6):611-5.<br />
McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt<br />
G, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />
3. Measures <strong>of</strong> observer variability (kappa statistic).<br />
CMAJ 2004;171(11):1369-73.<br />
CMAJ MAR. 1, 2005; 172 (5) 665
DOI:10.1503/cmaj.1031666<br />
<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />
5. The effect <strong>of</strong> spectrum <strong>of</strong> disease on the<br />
per<strong>for</strong>mance <strong>of</strong> diagnostic tests<br />
Victor M. Montori, Peter Wyer, Thomas B. Newman, Sheri Keitz, Gordon Guyatt,<br />
<strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group<br />
For clinicians to use a diagnostic test in clinical practice,<br />
they need to know how well the test distinguishes<br />
between those who have the suspected disease<br />
or condition and those who do not. If investigators<br />
choose clinically inappropriate populations <strong>for</strong> their study<br />
<strong>of</strong> a diagnostic test and thereby introduce what is sometimes<br />
called spectrum bias, the results may seriously mislead<br />
clinicians.<br />
In this article we present a series <strong>of</strong> examples that illustrate<br />
why clinicians need to pay close attention to the populations<br />
enrolled in studies <strong>of</strong> diagnostic test per<strong>for</strong>mance<br />
be<strong>for</strong>e they apply the results <strong>of</strong> those studies to their own<br />
patients. After working through these examples, you should<br />
understand which characteristics <strong>of</strong> a study population are<br />
likely to result in misleading interpretations <strong>of</strong> test results<br />
and which are not.<br />
The tips in this article are adapted from approaches developed<br />
by educators with experience in teaching evidencebased<br />
medicine principles to clinicians. 1,2 A related article,<br />
intended <strong>for</strong> people who teach these concepts to clinicians,<br />
is available online at www.cmaj.ca/cgi/content/full/173<br />
/4/385/DC1.<br />
Clinician learners’ objectives<br />
“Ideal” spectrum <strong>of</strong> disease<br />
• Understand the importance <strong>of</strong> spectrum <strong>of</strong> disease in<br />
the evaluation <strong>of</strong> diagnostic test characteristics.<br />
Prevalence, spectrum and test characteristics<br />
• Understand the lack <strong>of</strong> impact <strong>of</strong> disease prevalence on<br />
sensitivity, specificity and likelihood ratios.<br />
• Understand the impact <strong>of</strong> disease prevalence or likelihood<br />
on the probability <strong>of</strong> the target condition (posttest<br />
probability) after test results are available.<br />
Tip 1: “Ideal” spectrum <strong>of</strong> disease<br />
Let’s consider a clinical example that illustrates the concept<br />
<strong>of</strong> “disease spectrum” in relation to diagnostic tests.<br />
CMAJ • AUG. 16, 2005; 173 (4) 385<br />
© 2005 CMA Media Inc. or its licensors<br />
Review<br />
Synthèse<br />
Brain natriuretic peptide (BNP) is a hormone secreted by<br />
the ventricles in the heart in response to expansion. Plasma<br />
levels <strong>of</strong> BNP increase when acute or chronic congestive<br />
heart failure is present. Consequently, investigators have<br />
suggested using BNP levels to distinguish congestive heart<br />
failure from other causes <strong>of</strong> acute dyspnea among patients<br />
presenting to emergency departments. 3<br />
One highly publicized study reported promising results<br />
using a BNP cut<strong>of</strong>f point <strong>of</strong> 100 pg/mL. 4,5 This cut<strong>of</strong>f point<br />
means that patients with BNP levels greater than<br />
100 pg/mL are considered to have a “positive” test result<br />
<strong>for</strong> congestive heart failure and those with levels below this<br />
threshold are considered to have a “negative” test result.<br />
The investigators compared the number <strong>of</strong> diagnoses <strong>of</strong><br />
congestive heart failure using BNP levels with those using a<br />
criterion standard (or “gold standard”) defined by established<br />
clinical and imaging criteria. Commentaries have<br />
challenged the investigators’ estimates <strong>of</strong> the sensitivity and<br />
specificity <strong>of</strong> the BNP test at the proposed cut<strong>of</strong>f point on<br />
the basis that clinicians were already confident with respect<br />
to the likelihood <strong>of</strong> congestive heart failure in most <strong>of</strong> the<br />
patients in the study. 6,7<br />
Ideally, the ability <strong>of</strong> a test to correctly identify patients<br />
with and without a particular disease would not vary between<br />
patients. However, if you are a clinician, you already<br />
intuitively understand that a test may per<strong>for</strong>m better when<br />
it is used to evaluate patients with more severe disease than<br />
it would with patients whose disease is less advanced and<br />
less obvious. You also appreciate that diagnostic tests are<br />
not needed when the disease is either clinically obvious or<br />
sufficiently unlikely that you need not seriously consider it.<br />
Teachers <strong>of</strong> evidence-based medicine:<br />
See the “<strong>Tips</strong> <strong>for</strong> teachers” version <strong>of</strong> this article online<br />
at www.cmaj.ca/cgi/content/full/173/4/385/DC1. It<br />
contains the exercises found in this article in fill-in-theblank<br />
<strong>for</strong>mat, commentaries from the authors on the<br />
challenges they encounter when teaching these<br />
concepts to clinician learners and links to useful online<br />
resources.<br />
Page 24 <strong>of</strong> 29
Montori et al<br />
A study <strong>of</strong> the per<strong>for</strong>mance <strong>of</strong> a diagnostic test involves<br />
per<strong>for</strong>ming that test on patients with and without the disease<br />
or condition <strong>of</strong> interest together with a second test or<br />
investigation that we will call the “criterion standard.” We<br />
accept the results <strong>of</strong> the second test as the criterion by<br />
which the results <strong>of</strong> the test under investigation are assessed.<br />
In designing such a study, investigators sometimes<br />
choose both patients in whom the disease is unequivocally<br />
advanced and patients who are unequivocally free <strong>of</strong> disease,<br />
such as healthy, asymptomatic volunteers. This approach<br />
ensures the validity <strong>of</strong> the criterion standard and<br />
may be appropriate in the early stages <strong>of</strong> developing a test.<br />
However, any study done with a population that lacks diagnostic<br />
uncertainty may produce a biased estimate <strong>of</strong> a test’s<br />
per<strong>for</strong>mance relative to that produced by<br />
a study restricted to patients <strong>for</strong> whom<br />
the test would be clinically indicated.<br />
Returning to the use <strong>of</strong> BNP levels to<br />
test <strong>for</strong> congestive heart failure among patients<br />
with acute dyspnea, consider Fig. 1.<br />
The horizontal axis represents increasing<br />
values <strong>of</strong> BNP. The 2 bell curves constitute<br />
hypothetical probability density plots<br />
<strong>of</strong> the distribution <strong>of</strong> BNP values among<br />
patients with and without congestive<br />
heart failure. 8 The height at any point in<br />
either curve reflects the proportion <strong>of</strong><br />
emergency patients in the particular subgroup<br />
with the corresponding BNP value.<br />
Aside from the choice <strong>of</strong> cut<strong>of</strong>f value, this<br />
figure does not reflect the results <strong>of</strong> any<br />
actual study.<br />
The bell curve on the left in Fig. 1represents<br />
the hypothetical distribution <strong>of</strong><br />
BNP values in a group <strong>of</strong> young patients<br />
with known asthma and no risk factors <strong>for</strong><br />
congestive heart failure. They will tend to<br />
have low levels <strong>of</strong> circulating BNP. The<br />
bell curve on the right represents the distribution<br />
<strong>of</strong> BNP values among older patients<br />
with unequivocal and severe congestive<br />
heart failure. Such patients will<br />
have test results clustered on the high end<br />
<strong>of</strong> the scale.<br />
If Fig. 1accurately represented the<br />
per<strong>for</strong>mance <strong>of</strong> the BNP test in distinguishing<br />
between all patients with and<br />
without congestive heart failure as the<br />
cause <strong>of</strong> their symptoms, the test would<br />
be very useful. The 2 curves demonstrate<br />
very little overlap. For BNP values below<br />
90 pg/mL (point A), no patients have<br />
congestive heart failure, and <strong>for</strong> BNP values<br />
above 110 pg/mL (point B), all patients<br />
have congestive heart failure. This<br />
Proportion <strong>of</strong> patients<br />
386 JAMC 16 AOÛT 2005; 173 (4)<br />
means, assuming that Fig. 1 reflects reality, that you can be<br />
completely certain about the diagnosis <strong>for</strong> all people with<br />
BNP values below 90 pg/mL or above 110 pg/mL. Only<br />
<strong>for</strong> patients whose BNP values are between 90 and<br />
110 pg/mL is there residual uncertainty about their likelihood<br />
<strong>of</strong> congestive heart failure.<br />
However, be<strong>for</strong>e you embrace a test on the basis <strong>of</strong> its<br />
per<strong>for</strong>mance among patients in whom the presence or absence<br />
<strong>of</strong> disease is unequivocal, you need to consider the<br />
likely distribution <strong>of</strong> test results in a population <strong>of</strong> patients<br />
<strong>for</strong> whom you would be less certain.<br />
In Fig. 2, imagine that the entire study population is<br />
made up <strong>of</strong> middle-aged patients, all <strong>of</strong> whom have chronic<br />
congestive heart failure and recurrent asthma. The distributions<br />
<strong>of</strong> BNP values in the subgroups with and without<br />
A<br />
BNP level, pg/mL<br />
Fig. 1: Hypothetical probability density distributions <strong>of</strong> measured plasma brain<br />
natriuretic peptide (BNP) levels in 2 subgroups <strong>of</strong> a study population. The cut<strong>of</strong>f<br />
point <strong>for</strong> a diagnosis <strong>of</strong> congestive heart failure (CHF) is 100 pg/mL. Patients with a<br />
negative test result <strong>for</strong> CHF (left-hand curve) are younger, with known asthma and<br />
no risk factors <strong>for</strong> CHF. The patients with confirmed CHF are older, and the disease<br />
is clinically severe and unequivocal. Clinicians in the emergency department have<br />
little uncertainty regarding the cause <strong>of</strong> dyspnea in any <strong>of</strong> these patients.<br />
Proportion <strong>of</strong> patients<br />
Patients without<br />
acute CHF<br />
Patients with<br />
acute CHF<br />
0 20 40 60 80 100 120 140 160 180 200<br />
Patients without<br />
acute CHF<br />
Patients with<br />
acute CHF<br />
0 20 40 60 80 100 120 140 160 180 200<br />
Fig. 2: These hypothetical probability density distributions reflect a study population<br />
<strong>of</strong> middle-aged patients who all have recurrent asthma and chronic CHF.<br />
The patients whose dyspnea is caused by asthma exacerbations look clinically<br />
similar to those whose symptoms are caused by acute CHF.<br />
B<br />
BNP level, pg/mL<br />
A B<br />
Page 25 <strong>of</strong> 29
acute congestive heart failure are both much closer to the<br />
middle <strong>of</strong> the range. The extent <strong>of</strong> the overlap <strong>of</strong> the curves<br />
between points A and B is much greater, which means that<br />
there is residual uncertainty about the disease status <strong>of</strong> a<br />
large proportion <strong>of</strong> the patients even after the BNP test has<br />
been per<strong>for</strong>med.<br />
It may be helpful to note that the sensitivity <strong>of</strong> the BNP<br />
test at a cut<strong>of</strong>f value <strong>of</strong> 100 pg/mL (the proportion <strong>of</strong> patients<br />
with acute congestive heart failure whose BNP level<br />
is greater than 100 pg/mL) is defined in Fig. 1 and Fig. 2 as<br />
the percentage <strong>of</strong> the total area <strong>of</strong> the right-hand curve that<br />
lies to the right <strong>of</strong> the cut<strong>of</strong>f value. Notice that this percentage<br />
is markedly lower in Fig. 2 than in Fig. 1. The<br />
same is true <strong>of</strong> specificity, which is the proportion <strong>of</strong> patients<br />
without acute congestive heart failure whose BNP<br />
level is less than 100 pg/mL. This is defined in the figures<br />
as the proportion <strong>of</strong> the left-hand curve that lies to the left<br />
<strong>of</strong> the cut<strong>of</strong>f point. Again this percentage is appreciably<br />
lower in Fig. 2 compared with Fig. 1.<br />
These theoretical concerns play out (albeit with a lesser<br />
magnitude <strong>of</strong> impact than depicted in Fig. 1and Fig. 2) in<br />
studies <strong>of</strong> the BNP test as a diagnostic tool. In the BNP<br />
study to which we have referred, the sensitivity and specificity<br />
<strong>of</strong> the test using the 100 pg/mL cut-<strong>of</strong>f were 90%<br />
and 76% respectively when all patients were included. 4<br />
Only about 25% <strong>of</strong> the study population were judged by<br />
the treating physicians to be in the intermediate range <strong>of</strong><br />
probability <strong>of</strong> acute congestive heart failure. 5 When only<br />
patients in this subgroup were considered in a number <strong>of</strong><br />
studies, the sensitivity and specificity <strong>of</strong> the BNP test at a<br />
cut<strong>of</strong>f point <strong>of</strong> 100 pg/mL were only 88% and 55% respectively.<br />
7<br />
The range <strong>of</strong> disease states found among the patients<br />
in the population upon which a test is to be used is commonly<br />
referred to as “disease spectrum.” In making your<br />
final assessment on the value <strong>of</strong> a test,<br />
consider the spectrum <strong>of</strong> the disease or<br />
condition in which you are interested.<br />
You don’t need to differentiate healthy<br />
patients from patients with severe disease.<br />
Rather, you must differentiate<br />
those who have the disease from those<br />
who do not among all those who appear<br />
as if they might have it. The “right”<br />
population <strong>for</strong> a diagnostic test study includes<br />
(1) those in whom we are uncertain<br />
<strong>of</strong> the diagnosis; (2) those in whom<br />
we will use the test in clinical practice to<br />
resolve our uncertainty; and (3) patients<br />
with the disease who have a wide spectrum<br />
<strong>of</strong> severity and patients without the<br />
disease who have symptoms commonly<br />
associated with it.<br />
Readers familiar with the concept and<br />
interpretation <strong>of</strong> likelihood ratios <strong>for</strong> diagnostic<br />
test results 1 may find it useful to<br />
Proportion <strong>of</strong> patients<br />
note that the likelihood ratio <strong>for</strong> any given test value is represented<br />
by the respective height <strong>of</strong> the curves at that point<br />
on the horizontal axis (Fig. 3). The point on the horizontal<br />
axis below the intersection <strong>of</strong> the 2 curves is the test result<br />
with a likelihood ratio <strong>of</strong> 1. Fig. 3 also identifies test<br />
values corresponding to likelihood ratios <strong>of</strong> 0.25 and 4.<br />
Comparing Fig. 1and Fig. 2 once more, you will notice<br />
that the relative heights <strong>of</strong> the 2 curves, and hence the likelihood<br />
ratios, corresponding to a given BNP level will<br />
change as the curves move closer together and the area <strong>of</strong><br />
overlap increases.<br />
The bottom line<br />
<strong>Tips</strong> <strong>for</strong> EBM learners: spectrum <strong>of</strong> disease<br />
• Test per<strong>for</strong>mance will vary with the spectrum <strong>of</strong> disease<br />
within a study population. 9<br />
• The sensitivity and specificity <strong>of</strong> a test, when it is used<br />
to differentiate patients who obviously do not have the<br />
disease from patients who obviously do, likely overestimate<br />
its per<strong>for</strong>mance when the test is applied in a clinical<br />
context characterized by diagnostic uncertainty.<br />
Patients without<br />
acute CHF<br />
Patients with<br />
acute CHF<br />
Increasing<br />
test value<br />
Definitions<br />
Disease spectrum: The range <strong>of</strong> the disease states found<br />
among patients who make up the population upon<br />
which a test is to be used.<br />
Per<strong>for</strong>mance <strong>of</strong> diagnostic tests: Measures derived from<br />
the percentage <strong>of</strong> patients with and without disease<br />
identified by a particular test result, with disease<br />
positivity defined through the application <strong>of</strong> an<br />
acceptable criterion standard to each patient in a study.<br />
Sensitivity and specificity are examples <strong>of</strong> such measures.<br />
Test result<br />
(LR = 0.25)<br />
Test result<br />
(LR = 1)<br />
Test result<br />
(LR = 4)<br />
Page 26 <strong>of</strong> 29<br />
Fig. 3: Likelihood ratios (LRs) and spectrum <strong>of</strong> disease. The likelihood ratio <strong>of</strong> a<br />
test result represented by a point on the horizontal line is the height <strong>of</strong> the righthand<br />
bell curve (patients with the disease <strong>of</strong> interest) divided by the height <strong>of</strong> the<br />
left-hand bell curve (patients without the disease <strong>of</strong> interest) at that point.<br />
CMAJ AUG. 16, 2005; 173 (4) 387<br />
4<br />
1
Montori et al<br />
Tip 2: Prevalence, spectrum and test<br />
characteristics<br />
You may have learned the rule <strong>of</strong> thumb that post-test<br />
probabilities (which are closely related to predictive values)<br />
vary with disease prevalence, but sensitivities, specificities<br />
and likelihood ratios do not. Is this true? The answer is<br />
“yes,” provided that disease spectrum remains the same in<br />
high- and low-prevalence populations. In the discussion<br />
that follows, <strong>for</strong> purposes <strong>of</strong> simplicity, we use the term<br />
“prevalence” to denote the likelihood that any patient randomly<br />
selected from the study population has the disease or<br />
condition as defined by the criterion standard. This is not<br />
the same thing as the probability <strong>of</strong> disease in any individual<br />
patient.<br />
Referring once again to Fig. 1, let’s consider 3 cases. In<br />
the first, we’ll assume that there were 1000 patients in each<br />
subgroup: 1000 in whom congestive heart failure was unequivocally<br />
the cause <strong>of</strong> their dyspnea and 1000 in whom<br />
asthma was almost certainly the cause. The prevalence <strong>of</strong><br />
congestive heart failure is 50%. Each bell curve corresponds<br />
to the distribution <strong>of</strong> BNP values within the respec-<br />
A Pregnant Not pregnant Total<br />
Positive<br />
test result<br />
Negative<br />
test result<br />
A<br />
C<br />
95<br />
5<br />
388 JAMC 16 AOÛT 2005; 173 (4)<br />
B<br />
D<br />
1 96<br />
99 104<br />
Total 100 100 200<br />
B<br />
Positive<br />
test result<br />
Negative<br />
test result<br />
A × 4<br />
C × 4<br />
380<br />
20<br />
B<br />
D<br />
1 381<br />
99 119<br />
Total 400 100 500<br />
C<br />
Positive<br />
test result<br />
Negative<br />
test result<br />
A<br />
C<br />
95<br />
5<br />
B × 4<br />
D × 4<br />
4 99<br />
396 401<br />
Total 100 400 500<br />
Fig. 4: Changes in disease prevalence have no effect on diagnostic test characteristics.<br />
tive subgroup. Now consider a second case, where there are<br />
2000 older patients with severe congestive heart failure and<br />
1000 younger patients with recurrent asthma and no risk<br />
factors <strong>for</strong> congestive heart failure. The prevalence <strong>of</strong> congestive<br />
heart failure is 67%. Finally, consider a third case,<br />
where 2000 patients with asthma and 1000 patients with severe<br />
congestive heart failure are studied. The prevalence <strong>of</strong><br />
congestive heart failure is 33%.<br />
In each case the height <strong>of</strong> either curve corresponding to<br />
any particular BNP level still corresponds to the proportion<br />
<strong>of</strong> patients with that test value in that group. Changes<br />
in the total number <strong>of</strong> patients will not alter these proportions,<br />
and the per<strong>for</strong>mance <strong>of</strong> the test, as measured by sensitivity,<br />
specificity or likelihood ratios, will be unaffected.<br />
The per<strong>for</strong>mance <strong>of</strong> the BNP test in identifying patients<br />
with and without acute congestive heart failure remained<br />
the same. Hence, when the spectrum remains the same, the<br />
prevalence <strong>of</strong> congestive heart failure within the study population<br />
is irrelevant to the estimation <strong>of</strong> test characteristics.<br />
Let’s take a different clinical example. The ICON urine<br />
test <strong>for</strong> pregnancy (Beckman Coulter, Inc., Fullerton,<br />
Calif.) has a very high sensitivity and specificity when per<strong>for</strong>med<br />
later than 2 weeks postconception. 10<br />
Women attending a screening clinic in a geographic area<br />
characterized by moderate population growth are tested <strong>for</strong><br />
pregnancy. 50% <strong>of</strong> the women are pregnant. Hence, the<br />
prevalence <strong>of</strong> pregnancy is 50% in this setting. The ICON<br />
test has a sensitivity <strong>of</strong> 95% and a specificity <strong>of</strong> 99%. By<br />
definition, 95% <strong>of</strong> the 100 pregnant women (95% sensitivity)<br />
will have a positive test result, and 99% <strong>of</strong> the 100<br />
nonpregnant women (99% specificity) will have a negative<br />
test result. The sensitivity is influenced by the proportion <strong>of</strong><br />
women who present less than 2 weeks after conception.<br />
The same test is per<strong>for</strong>med in a similar clinic located in a<br />
geographic area characterized by high population growth.<br />
Four times as many women are pregnant as women who are<br />
not. The prevalence <strong>of</strong> pregnancy has increased to 80%. The<br />
percentage <strong>of</strong> pregnant women who have positive test results<br />
remains the same (380/400), and the sensitivity <strong>of</strong> the test<br />
remains 95% in this population. The percentage <strong>of</strong><br />
nonpregnant women who have a negative test result is also<br />
unchanged at 99%.<br />
The same pregnancy test is now used in a clinic servicing a<br />
population characterized by low population growth. Only<br />
one-fifth <strong>of</strong> women are pregnant. The sensitivity remains the<br />
same despite a decrease in the proportion <strong>of</strong> pregnant women<br />
from 50% to 20%. The specificity (the proportion <strong>of</strong><br />
nonpregnant women with a negative test result) remains the<br />
same despite an increase in the prevalence <strong>of</strong> nonpregnant<br />
women to 80%. Once again, the prevalence <strong>of</strong> pregnancy in<br />
the population is irrelevant to the estimation <strong>of</strong> test<br />
characteristics.<br />
Page 27 <strong>of</strong> 29
It is a qualitative, and inherently dichotomized, test:<br />
both clinicians and patients recognize that it is not possible<br />
to be “a little bit pregnant.” In short, although estimates <strong>of</strong><br />
per<strong>for</strong>mance values <strong>for</strong> the ICON test vary in the literature,<br />
11,12 the per<strong>for</strong>mance <strong>of</strong> the test in detecting pregnancy<br />
is likely to be uni<strong>for</strong>m if the percentage <strong>of</strong> subjects who are<br />
less than 2 weeks postconception does not vary.<br />
For the purpose <strong>of</strong> our demonstration, let’s assume that<br />
ICON test results are positive in 95% <strong>of</strong> women who are<br />
pregnant and negative in 99% <strong>of</strong> women who are not. Fig.<br />
4 shows the sensitivity and specificity <strong>of</strong> the test when it is<br />
administered in 3 different geographic locations with high,<br />
moderate and low population growth and where the proportion<br />
<strong>of</strong> women presenting within 2 weeks <strong>of</strong> conception<br />
is constant. Again, <strong>for</strong> simplicity, we are considering only<br />
the prevalence <strong>of</strong> pregnancy in the population being studied<br />
— in other words, the percentage <strong>of</strong> women tested who<br />
are pregnant. A practitioner might estimate the probability<br />
<strong>of</strong> pregnancy in an individual patient to be higher or lower<br />
than this on the basis <strong>of</strong> clinical features such as use <strong>of</strong> birth<br />
control methods, history <strong>of</strong> recent sexual activity and past<br />
history <strong>of</strong> gynecologic disease. As Fig. 4 shows, the prevalence<br />
<strong>of</strong> pregnancy in the population has no effect on the<br />
estimation <strong>of</strong> test characteristics.<br />
There are many examples <strong>of</strong> conditions that may present<br />
with equal severity in people with different demographic<br />
characteristics (age, sex, ethnicity) but that are<br />
much more prevalent in one group than in another. Mild<br />
osteoarthritis <strong>of</strong> the knee is rare among young patients but<br />
common among older patients. Asymptomatic thyroid abnormalities<br />
are rare among men but common among<br />
women. In both examples, diagnostic tests will have the<br />
same sensitivity, specificity and likelihood ratios in young<br />
and old patients and in men and women respectively.<br />
However, higher prevalence will result in a higher proportion<br />
<strong>of</strong> those with a positive test result who do in fact<br />
have the disease <strong>for</strong> which they are being tested. Referring<br />
to Fig. 4, in the population with a lower prevalence <strong>of</strong><br />
pregnancy, 95 <strong>of</strong> 99 women (96%) with positive test results<br />
are pregnant (Fig. 4C) compared with 380 <strong>of</strong> 381 women<br />
(99.7%) in the population with a higher prevalence (Fig.<br />
4B). The likelihood <strong>of</strong> the condition or disease among patients<br />
who have a positive test result is sometimes referred<br />
to as the predictive value <strong>of</strong> a test. The predictive value corresponds<br />
with the post-test probability <strong>of</strong> the disease when<br />
the test result is positive. Unlike sensitivity, specificity or<br />
likelihood ratios, predictive values are strongly influenced<br />
by changes in prevalence in the population being tested.<br />
Although differences in prevalence alone should not affect<br />
the sensitivity or specificity <strong>of</strong> a test, in many clinical<br />
settings disease prevalence and severity may be related. For<br />
instance, rheumatoid arthritis seen in a family physician’s<br />
<strong>of</strong>fice will be relatively uncommon, and most patients will<br />
have a relatively mild case. In contrast, rheumatoid arthritis<br />
will be common in a rheumatologist’s <strong>of</strong>fice, and patients<br />
will tend to have relatively severe disease. Tests to diagnose<br />
rheumatoid arthritis in the rheumatologist’s waiting area<br />
(e.g., hand inspection <strong>for</strong> joint de<strong>for</strong>mity) are likely to be<br />
relatively more sensitive not because <strong>of</strong> the increased<br />
prevalence but because <strong>of</strong> the spectrum <strong>of</strong> disease present<br />
(e.g., degree and extent <strong>of</strong> joint de<strong>for</strong>mity) in this setting.<br />
The bottom line<br />
• Disease prevalence has no direct effect on test characteristics<br />
(e.g., likelihood ratios, sensitivity, and specificity).<br />
• Spectrum <strong>of</strong> disease and disease prevalence have different<br />
effects on diagnostic test characteristics.<br />
Conclusions<br />
Clinicians need to understand how and when the choice<br />
<strong>of</strong> patients <strong>for</strong> a diagnostic test study may affect the per<strong>for</strong>mance<br />
<strong>of</strong> the test. Both disease spectrum in patients with<br />
the condition <strong>of</strong> interest and the spectrum <strong>of</strong> competing<br />
conditions in patients without the condition <strong>of</strong> interest can<br />
affect the test’s apparent diagnostic power. Despite the potentially<br />
powerful impact <strong>of</strong> disease spectrum and competing<br />
conditions, changes in prevalence that do not reflect<br />
changes in spectrum will not alter test per<strong>for</strong>mance.<br />
This article has been peer reviewed.<br />
References<br />
<strong>Tips</strong> <strong>for</strong> EBM learners: spectrum <strong>of</strong> disease<br />
From the Knowledge and Encounter Research Unit, Department <strong>of</strong> <strong>Medicine</strong>,<br />
Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>, Rochester, Minn. (Montori); the Departments<br />
<strong>of</strong> Epidemiology and Biostatistics and <strong>of</strong> Pediatrics, University <strong>of</strong> Cali<strong>for</strong>nia, San<br />
Francisco (Newman); Durham Veterans Affairs Medical Center and Duke University<br />
Medical Center, Durham, NC (Keitz); the Columbia University College <strong>of</strong><br />
Physicians and Surgeons, New York, NY (Wyer); and the Departments <strong>of</strong> <strong>Medicine</strong><br />
and <strong>of</strong> Clinical Epidemiology and Biostatistics, McMaster University, Hamilton,<br />
Ont. (Guyatt)<br />
Competing interests: None declared.<br />
Page 28 <strong>of</strong> 29<br />
Contributors: Victor Montori, as principal author, oversaw and contributed to the<br />
writing <strong>of</strong> the manuscript. Thomas Newman reviewed the manuscript at all phases<br />
<strong>of</strong> development and contributed to the writing as coauthor <strong>of</strong> tip 2. Sheri Keitz<br />
used all tips as part <strong>of</strong> a live teaching exercise and submitted comments, suggestions<br />
and the possible variations that are reported in the manuscript. Peter Wyer<br />
reviewed and revised the final draft <strong>of</strong> the manuscript to achieve uni<strong>for</strong>m adherence<br />
with <strong>for</strong>mat specifications. Gordon Guyatt developed the original idea <strong>for</strong> tips<br />
1 and 2, reviewed the manuscript at all phases <strong>of</strong> development, contributed to the<br />
writing as coauthor, and reviewed and revised the final draft <strong>of</strong> the manuscript to<br />
achieve accuracy and consistency <strong>of</strong> content as general editor.<br />
1. Jaeschke R, Guyatt G, Lijmer J. Diagnostic tests. In: Guyatt G, Rennie D, editors.<br />
Users’ guides to the medical literature: a manual <strong>for</strong> evidence-based clinical<br />
practice. Chicago: AMA Press; 2002. p. 121-40.<br />
2. Wyer P, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. <strong>Tips</strong> <strong>for</strong><br />
learning and teaching evidence-based medicine: introduction to the series.<br />
CMAJ 2004;171(4):347-8.<br />
3. Dao Q, Krishnaswamy P, Kazanegra R, Harrison A, Amirnovin R, Lenert L,<br />
et al. Utility <strong>of</strong> B-type natriuretic peptide in the diagnosis <strong>of</strong> congestive heart<br />
failure in an urgent-care setting. J Am Coll Cardiol 2001;37:379-85.<br />
4. Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et<br />
al.; Breathing Not Properly Multinational Study Investigators. Rapid measurement<br />
<strong>of</strong> B-type natriuretic peptide in the emergency diagnosis <strong>of</strong> heart<br />
failure. N Engl J Med 2002;347:161-7.<br />
5. McCullough PA, Nowak RM, McCord J, Hollander JE, Herrmann HC, Steg<br />
PG, et al. B-type natriuretic peptide and clinical judgment in emergency diagnosis<br />
<strong>of</strong> heart failure: analysis from Breathing Not Properly (BNP) Multinational<br />
Study. Circulation 2002;106:416-22.<br />
CMAJ AUG. 16, 2005; 173 (4) 389
Montori et al<br />
6. Hohl CM, Mitelman BY, Wyer P, Lang E. Should emergency physicians use<br />
B-type natriuretic peptide testing in patients with unexplained dyspnea? Can J<br />
Emerg Med 2003;5:162-5.<br />
7. Schwam E. B-type natriuretic peptide <strong>for</strong> diagnosis <strong>of</strong> heart failure in emergency<br />
department patients: a critical appraisal. Acad Emerg Med 2004;11:686-91.<br />
8. Tandberg D, Deely JJ, O’Malley AJ. Generalized likelihood ratios <strong>for</strong> quantitative<br />
diagnostic test scores. Am J Emerg Med 1997;15:694-9.<br />
9. Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen<br />
JH, et al. Empirical evidence <strong>of</strong> design-related bias in studies <strong>of</strong> diagnostic<br />
tests. JAMA 1999;282:1061-6.<br />
10. Product insert. Available: www.beckman.com/literature/ClinDiag/08109.D<br />
.pdf (accessed 13 Jul 2005).<br />
11. Lauszus FF. Clinical trial <strong>of</strong> 2 highly sensitive pregnancy tests — Tandem<br />
ICON HCG-urine and OPCO On-step Pacific Biotech. Ugeskr Laeger 1992;<br />
154:2069-70.<br />
12. Mishalani SH, Seliktar J, Braunstein GD. Four rapid serum–urine combination<br />
assays <strong>of</strong> choriogonadotropin (hCG) compared and assesed <strong>for</strong> their utility<br />
in quantitative determinations <strong>of</strong> hCG. Clin Chem 1994;40:1944-99.<br />
Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,<br />
Pelham NY 10804; fax 212 305-6792; pwyer@att.net<br />
Members <strong>of</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working<br />
Group: Peter C. Wyer (project director), College <strong>of</strong> Physicians and<br />
Surgeons, Columbia University, New York, NY; Deborah Cook,<br />
Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke,<br />
McMaster University, Hamilton, Ont.; Rose Hatala (internal<br />
review coordinator), University <strong>of</strong> British Columbia, Vancouver,<br />
BC; Robert Hayward (editor, online version), Bruce Fisher,<br />
University <strong>of</strong> Alberta, Edmonton, Alta.; Sheri Keitz (field test<br />
coordinator), Durham Veterans Affairs Medical Center and Duke<br />
University Medical Center, Durham, NC; Alexandra Barratt,<br />
University <strong>of</strong> Sydney, Sydney, Australia; Pamela Charney, Albert<br />
Einstein College <strong>of</strong> <strong>Medicine</strong>, Bronx, NY; Antonio L. Dans,<br />
University <strong>of</strong> the Philippines College <strong>of</strong> <strong>Medicine</strong>, Manila, The<br />
Philippines; Barnet Eskin, Morristown Memorial Hospital,<br />
Morristown, NJ; Jennifer Kleinbart, Emory University School <strong>of</strong><br />
Holiday Review 2005<br />
Call <strong>for</strong> submissions<br />
Hilarity and good humour … help enormously in both the study and<br />
the practice <strong>of</strong> medicine … [I]t is an unpardonable sin to go about<br />
among patients with a long face.<br />
— William Osler<br />
390 JAMC 16 AOÛT 2005; 173 (4)<br />
<strong>Medicine</strong>, Atlanta, Ga.; Hui Lee, <strong>for</strong>merly Group Health Centre,<br />
Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas<br />
McGinn, Mount Sinai Medical Center, New York, NY; Victor M.<br />
Montori, Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>, Rochester, Minn.;<br />
Virginia Moyer, University <strong>of</strong> Texas, Houston, Tex.; Thomas B.<br />
Newman, University <strong>of</strong> Cali<strong>for</strong>nia, San Francisco, San Francisco,<br />
Calif.; Jim Nishikawa, University <strong>of</strong> Ottawa, Ottawa, Ont.;<br />
Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain;<br />
W. Scott Richardson, Wright State University, Dayton, Ohio; Mark<br />
C. Wilson, University <strong>of</strong> Iowa, Iowa City, Iowa<br />
Articles to date in this series<br />
Yes, that’s right, it’s already time to send us your creative contributions<br />
<strong>for</strong> CMAJ’s Holiday Review 2005. We’re looking <strong>for</strong> humour, spo<strong>of</strong>s,<br />
personal reflections, history <strong>of</strong> medicine, <strong>of</strong>f-beat scientific explorations<br />
and postcards from the edge <strong>of</strong> medicine.<br />
Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz<br />
S, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />
1. Relative risk reduction, absolute risk reduction and<br />
number needed to treat. CMAJ 2004;171(4):353-8.<br />
Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC,<br />
Moyer V, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based<br />
medicine: 2. Measures <strong>of</strong> precision (confidence intervals).<br />
CMAJ 2004;171(6):611-5.<br />
McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R,<br />
Guyatt G, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based<br />
medicine: 3. Measures <strong>of</strong> observer variability (kappa<br />
statistic). CMAJ 2004;171(11):1369-73.<br />
Hatala R, Keitz S, Wyer P, Guyatt G; <strong>for</strong> the <strong>Evidence</strong>-<br />
<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group. <strong>Tips</strong><br />
<strong>for</strong> learners <strong>of</strong> evidence-based medicine: 4. Assessing<br />
heterogeneity <strong>of</strong> primary studies in systematic reviews<br />
and whether to combine their results. CMAJ 2005;<br />
172(5):661-5.<br />
Send your <strong>of</strong>ferings through our online manuscript tracking system (http://mc.manuscriptcentral.com/cmaj).<br />
Articles should be no more than 1200 words; photographs and illustrations are welcome. Please mention in<br />
your cover letter that your submission is intended <strong>for</strong> this year’s Holiday Review.<br />
The deadline <strong>for</strong> submissions is Sept. 20, 2005.<br />
Page 29 <strong>of</strong> 29