Tips for Learners of Evidence-Based Medicine

CMAJ 2005: Tips for Learners of Evidence-Based Medicine: A 5-Part Series 

02 Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, Moyer V, Guyatt G. 

Tips for learners of evidence-based medicine: 1. relative risk reduction, absolute 

risk reduction and number needed to treat. Can Med Assoc J 2004; 171:353– 

358. 

08 Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, Guyatt G. 

Tips for learners of evidence-based medicine: 2. measures of precision 

(confidence intervals). Can Med Assoc J 2004; 171:611–615. 

14 McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt G. Tips for learners 

of evidence-based medicine: 3. measures of observer variability (kappa statistic). 

Can Med Assoc J 2004; 171:1369–1373. 

19 Hatala R, Keitz S, Wyer P, Guyatt G. Tips for learners of evidence-based 

medicine: 4. assessing heterogeneity of primary studies in systematic reviews 

and whether to combine their results. Can Med Assoc J 2005;172:661–665. 

24 Montori VM, Wyer P, Newman TB, Keitz S, Guyatt G. Tips for learners of 

evidence-based medicine: 5. the effect of spectrum of disease on the 

performance of diagnostic tests. Can med Assoc J 2005;172:385–390. 

Page 1 of 29

DOI:10.1503/cmaj.1021197 

Tips for learners of evidence-based medicine: 

1. Relative risk reduction, absolute risk reduction 

and number needed to treat 

Physicians, patients and policy-makers are influenced 

not only by the results of studies but also by how authors 

present the results. 1–4 Depending on which 

measures of effect authors choose, the impact of an intervention 

may appear very large or quite small, even though 

the underlying data are the same. In this article we present 

3 measures of effect — relative risk reduction, absolute risk 

reduction and number needed to treat — in a fashion designed 

to help clinicians understand and use them. We 

have organized the article as a series of “tips” or exercises. 

This means that you, the reader, will have to do some work 

in the course of reading this article (we are assuming that 

most readers are practitioners, as opposed to researchers 

and educators). 

The tips in this article are adapted from approaches developed 

by educators with experience in teaching evidencebased 

medicine skills to clinicians. 5,6 A related article, intended 

for people who teach these concepts to clinicians, is available 

online at www.cmaj.ca/cgi/content/full/171/4/353/DC1. 

Clinician learners’ objectives 

Understanding risk and risk reduction 

• Learn how to determine control and treatment event 

rates in published studies. 

• Learn how to determine relative and absolute risk reductions 

from published studies. 

• Understand how relative and absolute risk reductions 

usually apply to different populations. 

Balancing benefits and adverse effects in individual 

patients 

• Learn how to use a known relative risk reduction to estimate 

the risk of an event for a patient undergoing 

treatment, given an estimate of that patient’s risk of the 

CMAJ • AUG. 17, 2004; 171 (4) 353 

© 2004 Canadian Medical Association or its licensors 

Review 

Synthèse 

Alexandra Barratt, Peter C. Wyer, Rose Hatala, Thomas McGinn, Antonio L. Dans, Sheri Keitz, 

Virginia Moyer, Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group 

ß See related article page 347 

event without treatment. 

• Learn how to use absolute risk reductions to assess 

whether the benefits of therapy outweigh its harms. 

Calculating and using number needed to treat 

• Develop an understanding of the concept of number 

needed to treat (NNT) and how it is calculated. 

• Learn how to interpret the NNT and develop an understanding 

of how the “threshold NNT” varies depending 

on the patient’s values and preferences, the 

severity of possible outcomes and the adverse effects 

(harms) of therapy. 

Tip 1: Understanding risk and risk reduction 

You can calculate relative and absolute risk reductions using 

simple mathematical formulas (see Appendix 1). However, 

you might find it easier to understand the concepts 

through visual presentation. Fig. 1A presents data from a hypothetical 

trial of a new drug for acute myocardial infarction, 

showing the 30-day mortality rate in a group of patients at 

high risk for the adverse event (e.g., elderly patients with 

congestive heart failure and anterior wall infarction). On the 

basis of information in Fig. 1A, how would you describe the 

Teachers of evidence-based medicine: 

See the “Tips for teachers” version of this article online 

at www.cmaj.ca/cgi/content/full/171/4/353/DC1. It 

contains the exercises found in this article in fill-in-theblank 

format, commentaries from the authors on the 

challenges they encounter when teaching these concepts 

to clinician learners and links to useful online resources. 


Barratt et al 

effect of the new drug? (Hint: Consider the event rates in 

people not taking the new drug and those who are taking it.) 

We can describe the difference in mortality (event) 

rates in both relative and absolute 

terms. In this case, 

these high-risk patients had a 

relative risk reduction of 25% 

and an absolute risk reduction 

of 10%. 

Now, let’s consider Fig. 1B, 

which shows the results of a 

second hypothetical trial of the 

same new drug, but in a patient 

population with a lower risk for 

the outcome (e.g., younger patients 

with uncomplicated inferior 

wall myocardial infarction). 

Looking at Fig. 1B, how 

would you describe the effect 

of the new drug? 

The relative risk reduction 

with the new drug remains at 

25%, but the event rate is lower 

in both groups, and hence 

the absolute risk reduction is only 2.5%. 

Although the relative risk reduction might be similar 

across different risk groups (a safe assumption in many if 

A 

Risk for outcome 

of interest, % 

B 

Risk for outcome 

of interest, % 

40 

30 

20 

10 

0 

40 

30 

20 

10 

0 

Trial 1: high- 

risk patients 

Trial 1: high- 

risk patients 

Placebo 

Treatment 

Trial 2: low- 

risk patients 

Risk and risk reduction: definitions 

354 JAMC 17 AOÛT 2004; 171 (4) 

Event rate: the number of people experiencing an 

event as a proportion of the number of people in 

the population 

Relative risk reduction: the difference in event 

rates between 2 groups, expressed as a proportion 

of the event rate in the untreated group; usually 

constant across populations with different risks 7,8 

Absolute risk reduction: the arithmetic difference 

between 2 event rates; varies with the underlying 

risk of an event in the individual patient 

The absolute risk reduction becomes smaller 

when event rates are low, whereas the 

relative risk reduction, or “efficacy” of the 

treatment, often remains constant 

not most cases 7,8 ), the absolute gains, represented by absolute 

risk reductions, are not. In sum, the absolute risk reduction 

becomes smaller when event rates are low, whereas 

the relative risk reduction, or 

“efficacy” of the treatment, of- 

ten remains constant. 

These phenomena may be 

factors in the design of drug 

trials. For example, a drug 

may be tested in severely affected 

people in whom the 

absolute risk reduction is likely 

to be impressive, but is 

subsequently marketed for 

use by less severely affected 

patients, in whom the absolute 

risk reduction will be 

substantially less. 

The bottom line 

Relative risk reduction is 

often more impressive than 

absolute risk reduction. Furthermore, 

the lower the event rate in the control group, 

the larger the difference between relative risk reduction 

and absolute risk reduction. 

Among high-risk patients in trial 1, the event rate in the control group (placebo) is 40 per 

100 patients, and the event rate in the treatment group is 30 per 100 patients. 

Absolute risk reduction (also called the risk difference) is the simple difference in the event 

rates (40% – 30% = 10%). 

Relative risk reduction is the difference between the event rates in relative terms. Here, the 

event rate in the treatment group is 25% less than the event rate in the control group (i.e., the 

10% absolute difference expressed as a proportion of the control rate is 10/40 or 

25% less). 

Among low-risk patients in trial 2, the event rate in the control group (placebo) is only 10%. 

If the treatment is just as effective in these low-risk patients, what event rate can we expect 

in the treatment group? 

Page 3 of 29 

The event rate in the treated group would be 25% less than in the control group or 7.5%. 

Therefore, the absolute risk reduction for the low-risk patients (second pair of columns) is only 

2.5%, even though the relative risk reduction is the same as for the high-risk patients 

(first pair of columns). 

Fig. 1: Results of hypothetical placebo-controlled trials of a new drug for acute myocardial infarction. The bars represent the 30day 

mortality rate in different groups of patients with acute myocardial infarction and heart failure. A: Trial involving patients at 

high risk for the adverse outcome. B: Trials involving a group of patients at high risk for the adverse outcome and another group of 

patients at low risk for the adverse outcome.

Tip 2: Balancing benefits and adverse effects 

in individual patients 

In prescribing medications or other treatments, physicians 

consider both the potential benefits and the potential 

harms. We have just demonstrated that the benefits of 

treatment (presented as absolute risk reductions) will generally 

be greater in patients at higher risk of adverse outcomes 

than in patients at lower risk of adverse outcomes. 

You must now incorporate the possibility of harm into 

your decision-making. 

First, you need to quantify the potential benefits. Assume 

you are managing 2 patients for high blood pressure 

and are considering the use of a new antihypertensive drug, 

drug X, for which the relative risk reduction for stroke over 

3 years is 33%, according to published randomized controlled 

trials. 

Pat is a 69-year-old woman whose blood pressure during 

a routine examination is 170/100 mm Hg; her blood 

pressure remains unchanged when you see her again 3 

weeks later. She is otherwise well and has no history of cardiovascular 

or cerebrovascular disease. You assess her risk 

of stroke at about 1% (or 1 per 100) per year. 9 

Dorothy is also 69 years of age, and her blood pressure 

is the same as Pat’s, 170/100 mm Hg; however, because she 

had a stroke recently, you assess her risk of subsequent 

stroke as higher than Pat’s, perhaps 10% per year. 10 

One way of determining the potential benefit of a new 

treatment is to complete a benefit table such as Table 1A. 

To do this, insert your estimated 3-year event rates for Pat 

and Dorothy, and then apply the relative risk reduction 

(33%) expected if they take drug X. It is clear from Table 

Table 1B: Benefit and harm table 

Patient group 

Table 1A: Benefit table* 

Patient group 

No 

treatment 

Tips for learners of evidence-based medicine 

1A that the absolute risk reduction for patients at higher 

risk (such as Dorothy) is much greater than for those at 

lower risk (such as Pat). 

Now, you need to factor the potential harms (adverse effects 

associated with using the drug) into the clinical decision. 

In the clinical trials of drug X, the risk of severe gastric 

bleeding increased 3-fold over 3 years in patients who 

received the drug (relative risk of 3). A population-based 

study has reported the risk of severe gastric bleeding for 

women in your patients’ age group at about 0.1% per year 

(regardless of their risk of stroke). These data can now be 

added to the table to allow a more balanced assessment of 

the benefits and harms that could arise from treatment 

(Table 1B). 

Considering the results of this process, would you give 

drug X to Pat, to Dorothy or to both? 

In making your decisions, remember that there is not 

necessarily one “right answer” here. Your analysis might go 

something like this: 

Pat will experience a small benefit (absolute risk reduction 

over 3 years of about 1%), but this will be considerably 

offset by the increased risk of gastric bleeding (absolute risk 

increase over 3 years of 0.6%). The potential benefit for 

Dorothy (absolute risk reduction over 3 years of about 10%) 

is much greater than the increased risk of harm (absolute 

risk increase over 3 years of 0.6%). Therefore, the benefit of 

treatment is likely to be greater for Dorothy (who is at 

higher risk of stroke) than for Pat (who is at lower risk). 

Assessment of the balance between benefits and harms 

depends on the value that patients place on reducing their 

risk of stoke in relation to the increased risk of gastric 

bleeding. Many patients might be much more concerned 

about the former than the latter. 

3-yr event rate for stroke, % 3-yr event rate for severe gastric bleeding, % 

With treatment 

(drug X) 

3-yr event rate for stroke, % 

No 

treatment 

Absolute risk reduction 

(no treatment – treatment) 


(drug X) 

No 

treatment 

Absolute 

risk reduction, % 

(no treatment – treatment) 

At lower risk (e.g., Pat) 3 2 1 

At higher risk (e.g., Dorothy) 30 20 10 

*Based on data from a randomized controlled trial of drug X, which reported a 33% relative risk reduction for the outcome 

(stroke) over 3 years. 


(drug X) 

Absolute risk increase 

(treatment – no treatment) 

At lower risk 

(e.g., Pat) 3 2 1 0.3 0.9 0.6 

At higher risk 

(e.g., Dorothy) 30 20 10 0.3 0.9 0.6 

*Based on data from randomized controlled trials of drug X reporting a 33% relative risk reduction for the outcome (stroke) over 3 years and a 3-fold increase for the adverse effect 

(severe gastric bleeding) over the same period. 


CMAJ AUG. 17, 2004; 171 (4) 355

Barratt et al 

Number needed to treat: definitions 

Number needed to treat: the number of patients who 

would have to receive the treatment for 1 of them to 

benefit; calculated as 100 divided by the absolute risk 

reduction expressed as a percentage (or 1 divided by the 

absolute risk reduction expressed as a proportion; see 

Appendix 1) 

Number needed to harm: the number of patients who 

would have to receive the treatment for 1 of them to 

experience an adverse effect; calculated as 100 divided 

by the absolute risk increase expressed as a percentage 

(or 1 divided by the absolute risk increase expressed as a 

proportion) 


When available, trial data regarding relative risk reductions 

(or increases), combined with estimates of baseline 

(untreated) risk in individual patients, provide the basis for 

clinicians to balance the benefits and harms of therapy for 

their patients. 

Tip 3: Calculating and using number needed 

to treat 

Some physicians use another measure of risk and benefit, 

the number needed to treat (NNT), in considering the 

consequences of treating or not treating. The NNT is the 

number of patients to whom a clinician would need to administer 

a particular treatment to prevent 1 patient from 

having an adverse outcome over a predefined period of 

time. (It also reflects the likelihood that a particular patient 

to whom treatment is administered will benefit from it.) If, 

for example, the NNT for a treatment is 10, the practitioner 

would have to give the treatment to 10 patients to 

prevent 1 patient from having the adverse outcome over 

Table 2: Benefit table for patients with cardiovascular problems 

356 JAMC 17 AOÛT 2004; 171 (4) 

the defined period, and each patient who received the treatment 

would have a 1 in 10 chance of being a beneficiary. 

If the absolute risk reduction is large, you need to treat 

only a small number of patients to observe a benefit in at 

least some of them. Conversely, if the absolute risk reduction 

is small, you must treat many people to observe a benefit 

in just a few. 

An analogous calculation to the one used to determine 

the NNT can be used to determine the number of patients 

who would have to be treated for 1 patient to experience an 

adverse event. This is the number needed to harm (NNH), 

which is the inverse of the absolute risk increase. 

How comfortable are you with estimating the NNT 

for a given treatment? For example, consider the following 

questions: How many 60-year-old patients with hypertension 

would you have to treat with diuretics for a period 

of 5 years to prevent 1 death? How many people with 

myocardial infarction would you have to treat with βblockers 

for 2 years to prevent 1 death? How many people 

with acute myocardial infarction would you have to treat 

with streptokinase to prevent 1 person from dying in the 

next 5 weeks? Compare your answers with estimates derived 

from published studies (Table 2). How accurate 

were your estimates? Are you surprised by the size of the 

NNT values? 

Physicians often experience problems in this type of 

exercise, usually because they are unfamiliar with the calculation 

of NNT. Here is one way to think about it. If a 

disease has a mortality rate of 100% without treatment 

and therapy reduces that mortality rate to 50%, how 

many people would you need to treat to prevent 1 death? 

From the numbers given, you can probably figure out that 

treating 100 patients with the otherwise fatal disease results 

in 50 survivors. This is equivalent to 1 out of every 2 

treated. Since all were destined to die, the NNT to prevent 

1 death is 2. The formula reflected in this calculation 

is as follows: the NNT to prevent 1 adverse outcome 

equals the inverse of the absolute risk reduction. Table 3 

illustrates this concept further. Note that, if the absolute 

risk reduction is presented as a percentage, the NNT is 

Event rate, % 

Clinical question Control group Treatment group ARR, % NNT 

What is the reduction in risk of stroke within 5 

years among 60-year-old patients with 

hypertension who are treated with diuretics? 11 

What is the reduction in risk of death within 2 

years after MI among 60-year-old patients treated 

with β-blockers? 12 

What is the reduction in risk of death within 5 

weeks after acute MI among 60-year-old patients 

treated with streptokinase? 13 

Note: MI = myocardial infarction, ARR = absolute risk reduction, NNT = number needed to treat. 

2.9 1.9 1.00 100 

9.8 7.3 2.50 40 

12.0 9.2 2.80 36 


Table 3: Calculation of NNT from absolute risk reduction* 

Form of absolute 

risk reduction 

100/absolute risk reduction; if the absolute risk reduction 

is expressed as a proportion, the NNT is 1/absolute risk 

reduction. Both methods give the same answer, so use 

whichever you find easier. 

It can be challenging for clinicians to estimate the baseline 

risks for specific populations. For example, some physicians 

may have little idea of the risk of stroke over 5 years 

among patients with hypertension. Physicians may also 

overestimate the effect of treatment, which leads them to 

ascribe larger absolute risk reductions and smaller NNT 

values than are actually the case. 14 

Now that you know how to determine the NNT from 

the absolute risk reduction, you must also consider whether 

the NNT is reasonable. In other words, what is the maximum 

NNT that you and your patients will accept as justifying 

the benefits and harms of therapy? This is referred to 

as the threshold NNT. 15 If the calculated NNT is above 

the threshold, the benefits are not large enough (or the risk 

of harm is too great) to warrant initiating the therapy. 

Determinants of the threshold NNT include the patient’s 

own values and preferences, the severity of the outcome 

that would be prevented, and the costs and side effects 

of the intervention. Thus, the threshold NNT will 

almost certainly be different for different patients, and 

there is no simple answer to the question of when an NNT 

is sufficiently low to justify initiating treatment. 


NNT is a concise, clinically useful presentation of the 

effect of an intervention. You can easily calculate it from 

the absolute risk reduction (just remember to check 

whether the absolute risk reduction is presented as a percentage 

or a proportion and use a numerator of 100 or 1 

accordingly). Be careful not to overestimate the effect of 

treatments (i.e., use a value of absolute risk reduction that is 

too high) and thus underestimate the NNT. 

Conclusions 

Calculation 

of NNT Example 

Percentage (e.g., 2.8%) 100/ARR 100/2.8 = 36 

Proportion (e.g., 0.028) 1/ARR 1/0.028 = 36 

*Using absolute risk reduction in last row of Table 2. 13 

Clinicians seeking to apply clinical evidence to the care 

of individual patients need to understand and be able to 

calculate relative risk reduction, absolute risk reduction 

and NNT from data presented in clinical trials and systematic 

reviews. We have described and defined these 

concepts and presented tabular tools and equations to 

help clinicians overcome common pitfalls in acquiring 

these skills. 

This article has been peer reviewed. 

References 

Tips for learners of evidence-based medicine 

From the School of Public Health, University of Sydney, Sydney, Australia (Barratt); 

the Columbia University College of Physicians and Surgeons, New York, NY 

(Wyer); the Department of Medicine, University of British Columbia, Vancouver, 

BC (Hatala); Mount Sinai Medical Center, New York, NY (McGinn); the Department 

of Internal Medicine, University of the Philippines College of Medicine, 

Manila, The Philippines (Dans); Durham Veterans Affairs Medical Center and 

Duke University Medical Center, Durham, NC (Keitz); the Department of Pediatrics, 

University of Texas, Houston, Tex. (Moyer); and the Departments of Medicine 

and of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, 

Ont. (Guyatt) 

Competing interests: None declared. 

Contributors: Alexandra Barratt contributed tip 2, drafted the manuscript, coordinated 

input from coauthors and reviewers and from field-testing and revised all 

drafts. Peter Wyer edited drafts and provided guidance in developing the final format. 

Rose Hatala contributed tip 1, coordinated the internal review process and 

provided comments throughout development of the manuscript. Thomas McGinn 

contributed tip 3 and provided comments throughout development of the manuscript. 

Antonio Dans reviewed all drafts and provided comments throughout development 

of the manuscript. Sheri Keitz conducted field-testing of the tips and contributed 

material from the field-testing to the manuscript. Virginia Moyer 

reviewed and contributed to the final version of the manuscript. Gordon Guyatt 

helped to write the manuscript (as an editor and coauthor). 

1. Malenka DJ, Baron JA, Johansen S, Wahrenberger JW, Ross JM. The framing 

effect of relative and absolute risk. J Gen Intern Med 1993;8:543-8. 

2. Forrow L, Taylor WC, Arnold RM. Absolutely relative: How research results 

are summarized can affect treatment decisions. Am J Med 1992;92:121-4. 

3. Naylor CD, Chen E, Strauss B. Measured enthusiasm: Does the method of 

reporting trial results alter perceptions of therapeutic effectiveness? Ann Intern 

Med 1992;117:916-21. 

4. Fahey T, Griffiths S, Peters TJ. Evidence based purchasing: understanding 

results of clinical trials and systematic reviews. BMJ 1995;311:1056-60. 

5. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures 

of association. In: Guyatt G, Rennie D, editors. The users’ guides to the 

medical literature: a manual of evidence-based clinical practice. Chicago: AMA 

Publications; 2002. p. 351-68. 

6. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips 

for learning and teaching evidence-based medicine: introduction to the series. 

CMAJ 2004;171(4):347-8. 

7. Schmid CH, Lau J, McIntosh MW, Cappelleri JC. An empirical study of the 

effect of the control rate as a predictor of treatment efficacy in meta-analysis 

of clinical trials. Stat Med 1998;17:1923-42. 

8. Furukawa TA, Guyatt GH, Griffith LE. Can we individualise the number 

needed to treat? An empirical study of summary effect measures in metaanalyses. 

Int J Epidemiol 2002;31:72-6. 

9. SHEP Cooperative Research Group. Prevention of stroke by anti-hypertensive 

drug treatment in older persons with isolated systolic hypertension. Final 

results of the Systolic Hypertension in the Elderly Program (SHEP). JAMA 

1991;265:3255-64. 

10. SALT Collaborative Group. Swedish Aspirin Low-dose Trial (SALT) of 

75mg aspirin as secondary prophylaxis after cerebrovascular events. Lancet 

1991;338:1345-9. 

11. Psaty BM, Smith NL, Siscovick DS, Koepsell TD, Weiss NS, Heckbert 

SR. Health outcomes associated with antihypertensive therapies used as 

first-line agents. A systematic review and meta-analysis. JAMA 1997;277: 

739-45. 

12. β-Blocker Health Attack Trial Research Group. A randomized trial of propranolol 

in patients with acute myocardial infarction. I. Mortality results. 

JAMA 1982;247:1707-14. 

13. ISIS-2 Collaborative Group. Randomised trial of intravenous streptokinase, 

oral aspirin, both or neither among 17 187 cases of suspected acute myocardial 

infarction: ISIS-2. Lancet 1988;2:349-60. 

14. Chatellier G, Zapletal E, Lemaitre D, Menard J, Degoulet P. The number 

needed to treat: a clinically useful nomogram in its proper context. BMJ 1996; 

312:426-9. 

15. Sinclair JC, Cook RJ, Guyatt GH, Pauker SG, Cook DJ. When should an effective 

treatment be used? Derivation of the threshold number needed to treat 

and the minimum event rate for treatment. J Clin Epidemiol 2001;54:253-62. 

Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave., 

Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet 

.att.net 


CMAJ AUG. 17, 2004; 171 (4) 357

Barratt et al 

Members of the Evidence-Based Medicine Teaching Tips 

Working Group: Peter C. Wyer (project director), Columbia 

University College of Physicians and Surgeons, New York, NY; 

Deborah Cook, Gordon Guyatt (general editor), Ted Haines, 

Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose 

Hatala (internal review coordinator), Department of Medicine, 

University of British Columbia, Vancouver, BC; Robert Hayward 

(editor, online version), Bruce Fisher, University of Alberta, 

Edmonton, Alta.; Sheri Keitz (field-test coordinator), Durham 

Veterans Affairs Medical Center and Duke University, Durham, 

NC; Alexandra Barratt, University of Sydney, Sydney, Australia; 

Pamela Charney, Albert Einstein College of Medicine, Bronx, NY; 

Antonio L. Dans, University of the Philippines College of 

Medicine, Manila, The Philippines; Barnet Eskin, Morristown 

Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory 

University, Atlanta, Ga.; Hui Lee, formerly Group Health Centre, 

Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas 

McGinn, Mount Sinai Medical Center, New York, NY; Victor M. 

Montori, Department of Medicine, Mayo Clinic College of 

Medicine, Rochester, Minn.; Virginia Moyer, University of Texas, 

Houston, Tex.; Thomas B. Newman, University of California, San 

Fred Sebastian 

358 JAMC 17 AOÛT 2004; 171 (4) 

Francisco, Calif.; Jim Nishikawa, University of Ottawa, Ottawa, 

Ont.; W. Scott Richardson, Wright State University, Dayton, 

Ohio; Mark C. Wilson, University of Iowa, Iowa City, Iowa 

Appendix 1: Formulas for commonly used measures of 

therapeutic effect 

Measure of effect Formula 

Relative risk (Event rate in intervention group) ÷ (event 

rate in control group) 

Relative risk reduction 1 – relative risk 

or 

(Absolute risk reduction) ÷ (event rate in 

control group) 

Absolute risk reduction (Event rate in intervention group) – (event 

rate in control group) 

Number needed to treat 1 ÷ (absolute risk reduction) 

Please, reader, can you spare some time? 

Our annual CMAJ readership survey begins September 20. By telling us a 

little about who you are and what you think of CMAJ, you’ll help us pave 

our way to an even better journal. For 2 weeks, we’ll be asking you to take 

the survey route on one of your visits to the journal online. We hope you’ll 

go along with the detour and help us stay on track. 

Chers lecteurs et lectrices, pourriez-vous nous accorder un moment? 

Le sondage annuel auprès des lecteurs du JAMC débute le 20 septembre. En nous parlant un peu de 

vous et de ce que vous pensez du JAMC, vous nous aiderez à améliorer encore le journal. Pendant 

deux semaines, lorsque vous rendrez visite au journal électronique, nous vous demanderons de passer 

une fois par la page du sondage. Nous espérons que vous accepterez de faire ce détour qui contribuera 

à nous garder sur la bonne voie. 


DOI:10.1503/cmaj.1031667 


2. Measures of precision (confidence intervals) 

In the first article in this series, 1 we presented an approach 

to understanding how to estimate a treatment’s 

effectiveness that covered relative risk reduction, absolute 

risk reduction and number needed to treat. But how 

precise are these estimates of treatment effect? 

In reading the results of clinical trials, clinicians often 

come across 2 related but different statistical measures of an 

estimate’s precision: p values and confidence intervals. The p 

value describes how often apparent differences in treatment 

effect that are as large as or larger than those observed in a 

particular trial will occur in a long run of identical trials if in 

fact no true effect exists. If the observed differences are sufficiently 

unlikely to occur by chance alone, investigators reject 

the hypothesis that there is no effect. For example, consider 

a randomized trial comparing diuretics with placebo 

that finds a 25% relative risk reduction for stroke with a p 

value of 0.04. This p value means that, if diuretics were in 

fact no different in effectiveness than placebo, we would expect, 

by the play of chance alone, to observe a reduction — 

or increase — in relative risk of 25% or more in 4 out of 

100 identical trials. 

Although they are useful for investigators planning how 

large a study needs to be to demonstrate a particular magnitude 

of effect, p values fail to provide clinicians and patients 

with the information they most need, i.e., the range 

of values within which the true effect is likely to reside. 

However, confidence intervals provide exactly that information 

in a form that pertains directly to the process of deciding 

whether to administer a therapy to patients. If the 

range of possible true effects encompassed by the confidence 

interval is overly wide, the clinician may choose to 

administer the therapy only selectively or not at all. 

Confidence intervals are therefore the topic of this article. 

For a nontechnical explanation of p values and their 

limitations, we refer interested readers to the Users’ Guides 

to the Medical Literature. 2 

As with the first article in this series, 1 we present the information 

as a series of “tips” or exercises. This means that 

you, the reader, will have to do some work in the course of 

reading the article. The tips we present here have been 

adapted from approaches developed by educators experienced 

in teaching evidence-based medicine skills to clinicians. 

2-4 A related article, intended for people who teach 

Review 

Synthèse 

Victor M. Montori, Jennifer Kleinbart, Thomas B. Newman, Sheri Keitz, Peter C. Wyer, 

Virginia Moyer, Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group 

these concepts to clinicians, is available online at www. 

cmaj.ca/cgi/content/full/171/6/611/DC1. 


Making confidence intervals intuitive 

• Understand the dynamic relation between confidence 

intervals and sample size. 

Interpreting confidence intervals 

• Understand how the confidence intervals around estimates 

of treatment effect can affect therapeutic decisions. 

Estimating confidence intervals for extreme 

proportions 

• Learn a shortcut for estimating the upper limit of the 

95% confidence intervals for proportions with very 

small numerators and for proportions with numerators 

very close to the corresponding denominators. 

Tip 1: Making confidence intervals intuitive 

Imagine a hypothetical series of 5 trials (of equal duration 

but different sample sizes) in which investigators have 

experimented with treatments for patients who have a particular 

condition (elevated low-density lipoprotein cholesterol) 

to determine whether a drug (a novel cholesterollowering 

agent) would work better than a placebo to 

prevent strokes (Table 1A). The smallest trial enrolled only 








CMAJ • SEPT. 14, 2004; 171 (6) 611 



Montori et al 

8 patients, and the largest enrolled 2000 patients, and half 

of the patients in each trial underwent the experimental 

treatment. Now imagine that all of the trials showed a relative 

risk reduction for the treatment group of 50% (meaning 

that patients in the drug treatment group were only half 

as likely as those in the placebo group to have a stroke). In 

each individual trial, how confident can we be that the true 

value of the relative risk reduction is important for patients 

(i.e., “patient-important”)? 5 If you were to look at the studies 

individually, which ones would lead you to recommend 

the treatment unequivocally to your patients? 

Most clinicians might intuitively guess that we could be 

more confident in the results of the larger trials. Why is this? 

In the absence of bias or systematic error, the results of a trial 

can be interpreted as an estimate of the true magnitude of effect 

that would occur if all possible eligible patients had been 

included. When only a few of these patients are included, the 

play of chance alone may lead to a result that is quite different 

from the true value. Confidence intervals are a numeric 

measure of the range within which such variation is likely to 

occur. The 95% confidence intervals that we often see in 

biomedical publications represent the range within which we 

are likely to find the underlying true treatment effect. 

To gain a better appreciation of confidence intervals, go 

back to Table 1A (don’t look yet at Table 1B!) and take a 

guess at what you think the confidence intervals might be 

for the 5 trials presented. In a moment you’ll see how your 

Table 1A: Relative risk and relative risk reduction observed 

in 5 successively larger hypothetical trials 

Control event 

rate 

Treatment 

event rate Relative risk, % 

Relative risk 

reduction, %* 

2/4 1/4 50 50 

10/20 5/20 50 50 

20/40 10/40 50 50 

50/100 25/100 50 50 

500/1000 250/1000 50 50 

*Calculated as the absolute difference between the control and treatment event rates 

(expressed as a fraction or a percentage), divided by the control event rate. In the first row 

in this table, relative risk reduction = (2/4 –1/4) ÷ 2/4 = 1/2 or 50%. If the control event 

rate were 3/4 and the treatment event rate 1/4, the relative risk reduction would be 

(3/4 – 1/4) ÷ 3/4 = 2/3. Using percentages for the same example, if the control event rate 

were 75% and the treatment event rate were 25%, the relative risk reduction would be 

(75% – 25%) ÷ 75% = 67%. 

Table 1B: Confidence intervals (CIs) around the relative risk reduction in 

5 successively larger hypothetical trials 

Control 

event rate 

Treatment 

event rate 

Relative 

risk, % 

612 JAMC 14 SEPT. 2004; 171 (6) 

estimates compare to 95% confidence intervals calculated 

using a formula, but for now, try figuring out intervals that 

you intuitively feel to be appropriate. 

Now, consider the first trial, in which 2 out of 4 patients 

who receive the control intervention and 1 out of 4 patients 

who receive the experimental treatment suffer a stroke. 

The risk in the treatment group is half that in the control 

group, which gives us a relative risk of 50% and a relative 

risk reduction of 50% (see Table 1A). 1,6 

Given the substantial relative risk reduction, would you 

be ready to recommend this treatment to a patient? Before 

you answer this question, consider whether it is plausible, 

with so few patients in the study, that the investigators might 

just have gotten lucky and the true treatment effect is really a 

50% increase in relative risk. In other words, is it plausible 

that the true event rate in the group that received treatment 

was 3 out of 4 instead of 1 out of 4? If you accept that this 

large, harmful effect might represent the underlying truth, 

would you also accept that a relative risk reduction of 90%, 

i.e., a very large benefit of treatment, is consistent with the 

experimental data in these few patients? To the extent that 

these suggestions are plausible, we can intuitively create a 

range of plausible truth of “-50% to 90%” surrounding the 

relative risk reduction of 50% that was actually observed. 

Now, do this for each of the other 4 trials. In the trial with 

20 patients in each group, 10 of those in the control group 

suffered a stroke, as did 5 of those in the treatment group. 

Both the relative risk and the relative risk reduction are again 

50%. Do you still consider it plausible that the true event rate 

in the treatment group is 15 out of 20 rather than 5 out of 20 

(the same proportions as we considered in the smaller trial)? 

If not, what about 12 out of 20? The latter would represent a 

20% increase in risk over the control rate (12/20 v. 10/20). A 

true relative risk reduction of 90% may still be plausible, 

given the observed results and the numbers of patients involved. 

In short, given this larger number of patients and the 

lower chance of a “bad sample,” the “range of plausible truth” 

around the observed relative risk reduction of 50% might be 

narrower, perhaps from a relative risk increase of 20% (represented 

as –20%) to a relative risk reduction of 90%. 

You can develop similar intuitively derived confidence 

intervals for the larger trials. We’ve done this in Table 1B, 

which also shows the 95% confidence intervals that we cal- 

CI around relative risk reduction, % 

Relative risk 

reduction, % Intuitive CI* Calculated 95% CI*† 

2/4 1/4 50 50 –50 to 90 –174 to 92 

10/20 5/20 50 50 –20 to 90 –14 to 79.5 

20/40 10/40 50 50 0 to 90 9.5 to 73.4 

50/100 25/100 50 50 20 to 80 26.8 to 66.4 

500/1000 250/1000 50 50 40 to 60 43.5 to 55.9 

*Negative values represent an increase in risk relative to control. See text for further explanation. 

†Calculated by statistical software. 


culated using a statistical program called StatsDirect (available 

commercially through www.statsdirect.com). You can 

see that in some instances we intuitively overestimated or 

underestimated the intervals relative to those we derived 

using the statistical formulas. 


Confidence intervals inform clinicians about the range 

within which the true treatment effect might plausibly lie, 

given the trial data. Greater precision (narrower confidence 

intervals) results from larger sample sizes and consequent 

larger number of events. Statisticians (and statistical software) 

can calculate 95% confidence intervals around any 

estimate of treatment effect. 

Tip 2: Interpreting 

confidence intervals 

You should now have an understanding 

of the relation between the 

width of the confidence interval 

around a measure of outcome in a 

clinical trial and the number of participants 

and events in that study. 

You are ready to consider whether a 

study is sufficiently large, and the resulting 

confidence intervals sufficiently 

narrow, to reach a definitive 

conclusion about recommending the 

therapy, after taking into account 

your patient’s values, preferences and 

circumstances. 

The concept of a minimally important 

treatment effect proves useful 

in considering the issue of when a 

study is large enough and has therefore 

generated confidence intervals 

that are narrow enough to recommend 

for or against the therapy. This 

concept requires the clinician to 

think about the smallest amount of 

benefit that would justify therapy. 

Consider a set of hypothetical trials. 

Fig. 1A displays the results of trial 

1. The uppermost point of the bell 

curve is the observed treatment effect 

(the point estimate), and the tails of 

the bell curve represent the boundaries 

of the 95% confidence interval. 

For the medical condition being investigated, 

assume that a 1% absolute 

risk reduction is the smallest benefit 

that patients would consider to outweigh 

the downsides of therapy. 

Given the information in Fig. 1A, 

A 

B 

C 

-5 

-5 

Trial 4 

Treatment harms 

-3 

-3 

Trial 3 

Tips for EBM learners: confidence intervals 

would you recommend this treatment to your patients if 

the point estimate represented the truth? What if the upper 

boundary of the confidence interval represented the truth? 

Or the lower boundary? 

For all 3 of these questions, the answer is yes, provided 

that 1% is in fact the smallest patient-important difference. 

Thus, the trial is definitive and allows a strong inference 

about the treatment decision. 

In the case of trial 2 (see Fig. 1B), would your patients 

choose to undergo the treatment if either the point estimate 

or the upper boundary of the confidence interval represented 

the true effect? What about the lower boundary? The answer 

regarding the lower boundary is no, because the effect 

is less than the smallest difference that patients would consider 

large enough for them to undergo the treatment. Al- 

-1 

-1 

-5 -3 -1 0 

Treatment helps 

0 1 3 5 

0 1 3 5 

1 3 5 

% Absolute risk reduction 

Trial 1 

Trial 1 


Trial 2 

Fig. 1: Results of 4 hypothetical trials. For the medical condition under investigation, 

an absolute risk reduction of 1% (double vertical rule) is the smallest benefit that patients 

would consider important enough to warrant undergoing treatment. In each 

case, the uppermost point of the bell curve is the observed treatment effect (the point 

estimate), and the tails of the bell curve represent the boundaries of the 95% confidence 

interval. See text for further explanation. 

CMAJ SEPT. 14, 2004; 171 (6) 613

Montori et al 

though trial 2 shows a “positive” result (i.e., the confidence 

interval does not encompass zero), the sample size was inadequate 

and the result remains compatible with risk reductions 

below the minimal patient-important difference. 

When a study result is positive, you can determine 

whether the sample size was adequate by checking the lower 

boundary of the confidence interval, the smallest plausible 

treatment effect compatible with the results. If this value is 

greater than the smallest difference your patients would 

consider important, the sample size is adequate and the trial 

result definitive. However, if the lower boundary falls below 

the smallest patient-important difference, leaving patients 

uncertain as to whether taking the treatment is in their best 

interest, the trial is not definitive. The sample size is inadequate, 

and further trials are required. 

What happens when the confidence interval for the effect 

of a therapy includes zero (where zero means “no effect” 

and hence a negative result)? 

For studies with negative results — those that do not exclude 

a true treatment effect of zero — you must focus on 

the other end of the confidence interval, that representing 

the largest plausible treatment effect consistent with the 

trial data. You must consider whether the upper boundary 

of the confidence interval falls below the smallest difference 

that patients might consider important. If so, the sample 

size is adequate, and the trial is definitively negative (see 

trial 3 in Fig. 1C). Conversely, if the upper boundary exceeds 

the smallest patient-important difference, then the 

trial is not definitively negative, and more trials with larger 

sample sizes are needed (see trial 4 in Fig. 1C). 


To determine whether a trial with a positive result is sufficiently 

large, clinicians should focus on the lower boundary of 

the confidence interval and determine if it is greater than the 

smallest treatment benefit that patients would consider important 

enough to warrant taking the treatment. For studies 

with a negative result, clinicians should examine the upper 

boundary of the confidence interval to determine if this value 

is lower than the smallest treatment benefit that patients 

would consider important enough to warrant taking the treatment. 

In either case, if the confidence interval overlaps the 

smallest treatment benefit that is important to patients, then 

the study is not definitive and a larger study is needed. 

Table 2: The 3/n rule to estimate the upper limit of the 

95% confidence interval (CI) for proportions with 0 in the 

numerator 

n 

Observed 

proportion 3/n 

Upper limit of 

95% CI 

20 0/20 3/20 0.15 or 15% 

100 0/100 3/100 0.03 or 3% 

300 0/300 3/300 0.01 or 1% 

1000 0/1000 3/1000 0.003 or 0.3% 

614 JAMC 14 SEPT. 2004; 171 (6) 

Tip 3: Estimating confidence intervals for 

extreme proportions 

When reviewing journal articles, readers often encounter 

proportions with small numerators or with numerators very 

close in size to the denominators. Both situations raise the 

same issue. For example, an article might assert that a treatment 

is safe because no serious complications occurred in the 

20 patients who received it; another might claim near-perfect 

sensitivity for a test that correctly identified 29 out of 30 

cases of a disease. However, in many cases such articles do 

not present confidence intervals for these proportions. 

The first step of this tip is to learn the “rule of 3” for 

zero numerators, 7 and the next step is to learn an extension 

(which might be called the “rule of 5, 7, 9 and 10”) for numerators 

of 1, 2, 3 and 4. 8 

Consider the following example. Twenty people undergo 

surgery, and none suffer serious complications. Does 

this result allow us to be confident that the true complication 

rate is very low, say less than 5% (1 out of 20)? What 

about 10% (2 out of 20)? 

You will probably appreciate that if the true complication 

rate were 5% (1 in 20), it wouldn’t be that unusual to 

observe no complications in a sample of 20, but for increasingly 

higher true rates, the chances of observing no complications 

in a sample of 20 gets increasingly smaller. 

What we are after is the upper limit of a 95% confidence 

interval for the proportion 0/20. The following is a 

simple rule for calculating this upper limit: if an event occurs 

0 times in n subjects, the upper boundary of the 95% 

confidence interval for the event rate is about 3/n (Table 2). 

You can use the same formula when the observed proportion 

is 100%, by translating 100% into its complement. 

For example, imagine that the authors of a study on a diagnostic 

test report 100% sensitivity when the test is performed 

for 20 patients who have the disease. That means 

that the test identified all 20 with the disease as positive and 

identified none as falsely negative. You would like to know 

how low the sensitivity of the test could be, given that it 

was 100% for a sample of 20 patients. Using the 3/n rule 

Table 3: Method for obtaining an approximation of 

the upper limit of the 95% CI* 

Observed 

numerator 

Numerator for calculating 

approximate upper limit of 95% CI 

0 3 

1 5 

2 7 

3 9 

4 10 

*For any observed numerator listed in the left hand column, divide the 

corresponding numerator in the right hand column by the number of study 

subjects to get the approximate upper limit of the 95% CI. For example, if the 

sample size is 15 and the observed numerator is 3, the upper limit of the 95% 

confidence interval is approximately 9 ÷ 15 = 0.6 or 60%. 


for the proportion of false negatives (0 out of 20), we find 

that the proportion of false negatives could be as high as 

15% (3 out of 20). Subtract this result from 100% to obtain 

the lower limit of the 95% confidence interval for the sensitivity 

(in this example, 85%). 

What if the numerator is not zero but is still very small? 

There is a shortcut rule for small numerators other than 

zero (i.e., 1, 2, 3 or 4) (Table 3). 

For example, out of 20 people receiving surgery imagine 

that 1 person suffers a serious complication, yielding an observed 

proportion of 1/20 or 5%. Using the corresponding 

value from Table 3 (i.e., 5) and the sample size, we find that 

the upper limit of the 95% confidence interval will be 

about 5/20 or 25%. If 2 of the 20 (10%) had suffered complications, 

the upper limit would be about 7/20, or 35%. 


Although statisticians (and statistical software) can calculate 

95% confidence intervals, clinicians can readily estimate 

the upper boundary of confidence intervals for proportions 

with very small numerators. These estimates highlight the 

greater precision attained with larger sample sizes and help 

to calibrate intuitively derived confidence intervals. 

Conclusions 

Clinicians need to understand and interpret confidence 

intervals to properly use research results in making decisions. 

They can use thresholds, based on differences that 

patients are likely to consider important, to interpret confidence 

intervals and to judge whether the results are definitive 

or whether a larger study (with more patients and 

events) is necessary. For proportions with extremely small 

numerators, a simple rule is available for estimating the upper 

limit of the confidence interval. 


From the Department of Medicine, Mayo Clinic College of Medicine, Rochester, 

Minn. (Montori); the Hospital Medicine Unit, Division of General Medicine, 

Emory University, Atlanta, Ga. (Kleinbart); the Departments of Epidemiology and 

Biostatistics and of Pediatrics, University of California, San Francisco, San Francisco, 

Calif. (Newman); Durham Veterans Affairs Medical Center and Duke University 

Medical Center, Durham, NC (Keitz); the Columbia University College of 

Physicians and Surgeons, New York, NY (Wyer); the Department of Pediatrics, 

University of Texas, Houston, Tex. (Moyer); and the Departments of Medicine 


Ont. (Guyatt) 


Contributors: Victor Montori, as principal author, decided on the structure and 

flow of the article, and oversaw and contributed to the writing of the manuscript. 

Jennifer Kleinbart reviewed the manuscript at all phases of development and contributed 

to the writing of tip 1. Thomas Newman developed the original idea for 

tip 3 and reviewed the manuscript at all phases of development. Sheri Keitz used 

all of the tips as part of a live teaching exercise and submitted comments, suggestions 

and the possible variations that are described in the article. Peter Wyer reviewed 

and revised the final draft of the manuscript to achieve uniform adherence 

with format specifications. Virginia Moyer reviewed and revised the final draft of 

the manuscript to improve clarity and style. Gordon Guyatt developed the original 

ideas for tips 1 and 2, reviewed the manuscript at all phases of development, contributed 

to the writing as coauthor, and reviewed and revised the final draft of the 

manuscript to achieve accuracy and consistency of content as general editor. 

References 

Tips for EBM learners: confidence intervals 

1. Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. Tips for 

learners of evidence-based medicine: 1. Relative risk reduction, absolute risk 

reduction and number needed to treat. CMAJ 2004;171(4):353-8. 

2. Guyatt G, Jaeschke R, Cook D, Walter S. Therapy and understanding the results: 

hypothesis testing. In: Guyatt G, Rennie D, editors. Users’ guides to the 


Press; 2002. p. 329-38. 

3. Guyatt G, Walter S, Cook D, Jaeschke R. Therapy and understanding the results: 

confidence intervals. In: Guyatt G, Rennie D, editors. Users’ guides to the 


Press; 2002. p. 339-49. 


for learning and teaching evidence-based medicine: introduction to the series 

[editorial]. CMAJ 2004;171(4):347-8. 

5. Guyatt G, Montori V, Devereaux PJ, Schunemann H, Bhandari M. Patients at the 

center: in our practice, and in our use of language. ACP J Club 2004;140:A11-2. 

6. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures 

of association. In: Guyatt G, Rennie D, editors. Users’ guides to the medical 

literature: a manual of evidence-based clinical practice. Chicago: AMA Press; 

2002. p. 351-68. 

7. Hanley J, Lippman-Hand A. If nothing goes wrong, is everything all right? 

Interpreting zero numerators. JAMA 1983;249:1743-5. 

8. Newman TB. If almost nothing goes wrong, is almost everything all right? 

[letter]. JAMA 1995;274:1013. 

Members of the Evidence-Based Medicine Teaching Tips Working 

Group: Peter C. Wyer (project director), College of Physicians and 

Surgeons, Columbia University, New York, NY; Deborah Cook, 

Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke, 

McMaster University, Hamilton, Ont.; Rose Hatala (internal 

review coordinator), University of British Columbia, Vancouver, 

BC; Robert Hayward (editor, online version), Bruce Fisher, 

University of Alberta, Edmonton, Alta.; Sheri Keitz (field test 

coordinator), Durham Veterans Affairs Medical Center and Duke 

University Medical Center, Durham, NC; Alexandra Barratt, 

University of Sydney, Sydney, Australia; Pamela Charney, Albert 

Einstein College of Medicine, Bronx, NY; Antonio L. Dans, 

University of the Philippines College of Medicine, Manila, The 

Philippines; Barnet Eskin, Morristown Memorial Hospital, 

Morristown, NJ; Jennifer Kleinbart, Emory University School of 

Medicine, Atlanta, Ga.; Hui Lee, formerly Group Health Centre, 



Montori, Mayo Clinic College of Medicine, Rochester, Minn.; 

Virginia Moyer, University of Texas, Houston, Tex.; Thomas B. 

Newman, University of California, San Francisco, San Francisco, 

Calif.; Jim Nishikawa, University of Ottawa, Ottawa, Ont.; 

Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain; 

W. Scott Richardson, Wright State University, Dayton, Ohio; Mark 

C. Wilson, University of Iowa, Iowa City, Iowa 

Articles to date in this series 



Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet 

.att.net 

Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, 

et al. Tips for learners of evidence-based medicine: 1. 

Relative risk reduction, absolute risk reduction and 

number needed to treat. CMAJ 2004;171(4):353-8. 

CMAJ SEPT. 14, 2004; 171 (6) 615

Correspondance 

ical journals [editorial]. CMAJ 1984;130:1412. 

11. Bero LA, Galbraith A, Rennie D. The publication 

of sponsored symposiums in medical journals. 

N Engl J Med 1992;327:1135-40. 


DOI:10.1503/cmaj.1041329 

Online access to a 

for-profit CMAJ 

Wayne Kondro, quoting CMA Secretary-General 

Bill Tholl, reports 

that “Physicians will continue to receive 

their free subscription to CMAJ as a benefit 

of association membership ‘for the 

foreseeable future’” after CMA Publications 

is sold to CMA Holdings in January 

2004. 1 That’s all to the good — but what 

then of CMAJ’s worldwide readers? Will 

access to CMAJ remain free for all online 

users, despite the shift to for-profit status? 

I found it strange that this issue was not 

addressed in Kondro’s news article. 

Adam L. Scheffler 

Independent researcher 

Chicago, Ill. 

Reference 

1. Kondro W. CMAJ enters for-profit market. 

CMAJ 2004;171(11):1334. 

DOI:10.1503/cmaj.1041759 

[Editor’s note] 

CMAJ’s editors have addressed the 

topic of open access in this issue’s 

Editorial (see page 149). 

DOI:10.1503/cmaj.1041760 

Correction 

In part 2 of the series “Tips for learners 

of evidence-based medicine” 1 the 

information in Fig. 1 did not fully correspond 

with the information provided in 

the text. Specifically, the data for hypo- 

162 JAMC • 18 JANV. 2005; 172 (2) 

thetical trial 2 in Fig. 1B should have 

been centred at 5% absolute risk reduction, 

as described in the text; instead, the 

figure showed trial 2 as being centred at 

about 6.5% absolute risk reduction. The 

corrected figure is presented here. 

A 

B 

C 

-5 

-5 

Trial 4 

Treatment harms 

-3 

-3 

Trial 3 

-1 

-1 

-5 -3 -1 0 

Treatment helps 

0 1 3 5 

0 1 3 5 

% Absolute risk reduction 

Reference 

1. Montori VM, Kleinbart J, Newman TB, Keitz S, 

Wyer PC, Moyer V, et al. Tips for learners of 

evidence-based medicine: 2. Measures of precision 

(confidence intervals). CMAJ 2004;171(6): 

611-5. 

DOI:10.1503/cmaj.1041761 

1 3 5 

Trial 1 

Trial 1 


Trial 2 

Fig. 1: Results of 4 hypothetical trials. For the medical condition under investigation, 

an absolute risk reduction of 1% (double vertical rule) is the smallest benefit 

that patients would consider important enough to warrant undergoing treatment. In 

each case, the uppermost point of the bell curve is the observed treatment effect 

(the point estimate), and the tails of the bell curve represent the boundaries of the 

95% confidence interval. See the text 1 for further explanation.

DOI:10.1503/cmaj.1031981 


3. Measures of observer variability (kappa statistic) 

Thomas McGinn, Peter C. Wyer, Thomas B. Newman, Sheri Keitz, Rosanne Leipzig, 

Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group 

Imagine that you’re a busy family physician and that 

you’ve found a rare free moment to scan the recent literature. 

Reviewing your preferred digest of abstracts, 

you notice a study comparing emergency physicians’ interpretation 

of chest radiographs with radiologists’ interpretations. 

1 The article catches your eye because you have frequently 

found that your own reading of a radiograph differs 

from both the official radiologist reading and an unofficial 

reading by a different radiologist, and you’ve wondered 

about the extent of this disagreement and its implications. 

Looking at the abstract, you find that the authors have reported 

the extent of agreement using the κ statistic. You recall 

that κ stands for “kappa” and that you have encountered this 

measure of agreement before, but your grasp of its meaning 

remains tentative. You therefore choose to take a quick glance 

at the authors’ conclusions as reported in the abstract and to 

defer downloading and reviewing the full text of the article. 

Practitioners, such as the family physician just described, 

may benefit from understanding measures of observer variability. 

For many studies in the medical literature, clinician 

readers will be interested in the extent of agreement among 

multiple observers. For example, do the investigators in a 

clinical study agree on the presence or absence of physical, 

radiographic or laboratory findings? Do investigators involved 

in a systematic overview agree on the validity of an 

article, or on whether the article should be included in the 

analysis? In perusing these types of studies, where investigators 

are interested in quantifying agreement, clinicians 

will often come across the kappa statistic. 

In this article we present tips aimed at helping clinical 

learners to use the concepts of kappa when applying diagnostic 

tests in practice. The tips presented here have been 

adapted from approaches developed by educators experienced 

in teaching evidence-based medicine skills to clinicians. 

2 A related article, intended for people who teach 

these concepts to clinicians, is available online at www. 

cmaj.ca/cgi/content/full/171/11/1369/DC1. 


Defining the importance of kappa 

• Understand the difference between measuring agreement 

and measuring agreement beyond chance. 

• Understand the implications of different values of kappa. 

Calculating kappa 

Review 

Synthèse 

• Understand the basics of how the kappa score is 

calculated. 

• Understand the importance of “chance agreement” in 

estimating kappa. 

Calculating chance agreement 

• Understand how to calculate the kappa score given different 

distributions of positive and negative results. 

• Understand that the more extreme the distributions of 

positive and negative results, the greater the agreement 

that will occur by chance alone. 

• Understand how to calculate chance agreement, agreement 

beyond chance and kappa for any set of assessments 

by 2 observers. 

Tip 1: Defining the importance of kappa 

A common stumbling block for clinicians is the basic 

concept of agreement beyond chance and, in turn, the importance 

of correcting for chance agreement. People making 

a decision on the basis of presence or absence of an element 

of the physical examination, such as Murphy’s sign, 

will sometimes agree simply by chance. The kappa statistic 

corrects for this chance agreement and tells us how much 

of the possible agreement over and above chance the reviewers 

have achieved. 

A simple example should help to clarify the importance 

of correcting for chance agreement. Two radiologists independently 

read the same 100 mammograms. Reader 1 is 

having a bad day and reads all the films as negative without 

looking at them in great detail. Reader 2 reads the 








CMAJ • NOV. 23, 2004; 171 (11) 1369 



McGinn et al 

films more carefully and identifies 4 of the 100 mammograms 

as positive (suspicious for malignancy). How would 

you characterize the level of agreement between these 2 

radiologists? 

The percent agreement between them is 96%, even 

though one of the readers has, on cursory review, decided 

to call all of the results negative. Hence, measuring the 

simple percent agreement overestimates the degree of clinically 

important agreement in a fashion that is misleading. 

The role of kappa is to indicate how much the 2 observers 

agree beyond the level of agreement that could be expected 

by chance. Table 1 presents a rating system that is commonly 

used as a guideline for evaluating kappa scores. 

Purely to illustrate the range of kappa scores that readers 

can expect to encounter, Table 2 gives some examples of 

commonly reported assessments and the kappa scores that 

resulted when investigators studied their reproducibility. 


If clinicians neglect the possibility of chance agreement, 

they will come to misleading conclusions about the reproducibility 

of clinical tests. The kappa statistic allows us to 

measure agreement above and beyond that expected by 

chance alone. Examples of kappa scores for frequently ordered 

tests sometimes show surprisingly poor levels of 

agreement beyond chance. 

Table 1: Qualitative classification 

of kappa values as degree of 

agreement beyond chance 3 

Kappa 

value 

Degree of agreement 

beyond chance 

0 None 

0–0.2 Slight 

0.2–0.4 Fair 

0.4–0.6 Moderate 

0.6–0.8 Substantial 

0.8–1.0 Almost perfect 

Table 2: Representative kappa values for common tests 

and clinical assessments 

Assessment Kappa value 

Interpretation of T wave changes on an exercise 

stress test 4 

Presence of jugular venous distension 5 

Detection of alcohol dependence using CAGE 

questionnaire 6 

Presence of goitre 7 

Bone marrow interpretation by hematologist 8 

Straight leg raising test 9 

Diagnosis of pulmonary embolus by helical CT 10 

Diagnosis of lower extremity arterial disease by 

arteriography 11 

0.25 

0.56 

0.75 

0.82–0.95 

0.84 

0.82 

0.82 

0.39–0.64 

1370 JAMC 23 NOV. 2004; 171 (11) 

Tip 2: Calculating kappa 

What is the maximum potential for agreement between 

2 observers doing a clinical assessment, such as 

presence or absence of Murphy’s sign in patients with 

abdominal pain? In Fig. 1, the upper horizontal bar represents 

100% agreement between 2 observers. For the hypothetical 

situation represented in the figure, the estimated 

chance agreement between the 2 observers is 50%. 

This would occur if, for example, each of the 2 observers 

randomly called half of the assessments positive. Given 

this information, what is the possible agreement beyond 

chance? 

The vertical line in Fig. 1 intersects the horizontal bars 

at the 50% point that we identified as the expected agreement 

by chance. All agreement to the right of this line corresponds 

to agreement beyond chance. Hence the maximum 

agreement beyond chance is 50% (100% – 50%). 

The other number you need to calculate the kappa score 

is the degree of agreement beyond chance. The observed 

agreement, as shown by the lower horizontal bar in Fig. 1, 

is 75%, so the degree of agreement beyond chance is 25% 

(75% – 50%). 

Kappa is calculated as the observed agreement beyond 

chance (25%) divided by the maximum agreement beyond 

chance (50%); here, kappa is 0.50. 

Agreement expected Possible agreement 

by chance 50% above chance 

Observed agreement: 75% 

Observed agreement above chance: 25% 

kappa = 25/50 = 0. 5 (moderate agreement) 


Fig. 1: Two observers independently assess the presence or 

absence of a finding or outcome. Each observer determines 

that the finding is present in exactly 50% of the subjects. Their 

assessments agree in 75% of the cases. The yellow horizontal 

bar represents potential agreement (100%), and the turquoise 

bar represents actual agreement. The portion of each coloured 

bar that lies to the left of the dotted vertical line represents the 

agreement expected by chance (50%). The observed agreement 

above chance is half of the possible agreement above 

chance. The ratio of these 2 numbers is the kappa score.


Kappa allows us to measure agreement above and beyond 

that expected by chance alone. We calculate kappa by 

estimating the chance agreement and then comparing the 

observed agreement beyond chance with the maximum 

possible agreement beyond chance. 

Tip 3: Calculating chance agreement 

A conceptual understanding of kappa may still leave the 

actual calculations a mystery. The following example is intended 

for those who desire a more complete understanding 

of the kappa statistic. 

Let us assume that 2 hopeless clinicians are assessing the 

presence of Murphy’s sign in a group of patients. They 

have no idea what they are doing, and their evaluations are 

no better than blind guesses. Let us say they are each 

guessing the presence and absence of Murphy’s sign in a 

50:50 ratio: half the time they guess that Murphy’s sign is 

present, and the other half that it is absent. If you were 

completing a 2 × 2 table, with these 2 clinicians evaluating 

the same 100 patients, how would the cells, on average, get 

filled in? 

Fig. 2 represents the completed 2 × 2 table. Guessing at 

random, the 2 hopeless clinicians have agreed on the assessments 

of 50% of the patients. How did we arrive at the 

numbers shown in the table? According to the laws of 

chance, each clinician guesses that half of the 50 patients 

assessed as positive by the other clinician (i.e., 25 patients) 

have Murphy’s sign. 

How would this exercise work if the same 2 hopeless 

clinicians were to randomly guess that 60% of the patients 

had a positive result for Murphy’s sign? Fig. 3 provides the 

answer in this situation. The clinicians would agree for 52 

of the 100 patients (or 52% of the time) and would disagree 

for 48 of the patients. In a similar way, using 2 × 2 tables 

for higher and higher positive proportions (i.e., how often 

Clinician 2 

Sign 

present 

Sign 

absent 

Sign 

present 

Clinician 1 

Sign 

absent Total 

25 25 50 

25 25 50 

Total 50 50 

Fig. 2: Agreement table for 2 hopeless clinicians who randomly 

guess whether Murphy’s sign is present or absent in 100 patients 

with abdominal pain. Each clinician determines that half 

of the patients have a positive result. The numbers in each box 

reflect the number of patients in each agreement category. 

Tips for EBM learners: kappa statistic 

the observer makes the diagnosis), you can figure out how 

often the observers will, on average, agree by chance alone 

(as delineated in Table 3). 

At this point, we have demonstrated 2 things. First, even 

if the reviewers have no idea what they are doing, there will 

be substantial agreement by chance alone. Second, the 

magnitude of the agreement by chance increases as the 

proportion of positive (or negative) assessments increases. 

But how can we calculate kappa when the clinicians 

whose assessments are being compared are no longer 

“hopeless,” in other words, when their assessments reflect a 

level of expertise that one might actually encounter in practice? 

It’s not very hard. 

Let’s take a simple example, returning to the premise 

that each of the 2 clinicians assesses Murphy’s sign as being 

present in 50% of the patients. Here, we assume that 

the 2 clinicians now have some knowledge of Murphy’s 

sign and their assessments are no longer random. Each 

decides that 50% of the patients have Murphy’s sign and 

50% do not, but they still don’t agree on every patient. 

Rather, for 40 patients they agree that Murphy’s sign is 

present, and for 40 patients they agree that Murphy’s sign 

is absent. Thus, they agree on the diagnosis for 80% of 

the patients, and they disagree for 20% of the patients 

(see Fig. 4A). How do we calculate the kappa score in this 

situation? 

Recall that if each clinician found that 50% of the patients 

had Murphy’s sign but their decision about the presence of 

the sign in each patient was random, the clinicians would be 

in agreement 50% of the time, each cell of the 2 × 2 table 

would have 25 patients (as shown in Fig. 2), chance agree- 

Clinician 2 

Sign 

present 

Sign 

absent 

Sign 

present 

Clinician 1 

Sign 

absent Total 

36 24 60 

24 16 40 

Total 60 40 


Fig. 3: As in Fig. 2, the 2 clinicians again guess at random 

whether Murphy’s sign is present or absent. However, each 

clinician now guesses that the sign is present in 60 of the 100 

patients. Under these circumstances, of the 60 patients for 

whom clinician 1 guesses that the sign is present, clinician 2 

guesses that it is present in 60%; 60% of 60 is 36 patients. Of 

the 60 patients for whom clinician 1 guesses that the sign is 

present, clinician 2 guesses that it is absent in 40%; 40% of 60 

is 24 patients. Of the 40 patients for whom clinician 1 guesses 

that the sign is absent, clinician 2 guesses that it is present in 

60%; 60% of 40 is 24 patients. Of the 40 patients for whom 

clinician 1 guesses that the sign is absent, clinician 2 guesses 

that it is absent in 40%; 40% of 40 is 16 patients. 

CMAJ NOV. 23, 2004; 171 (11) 1371

McGinn et al 

ment would be 50%, and maximum agreement beyond 

chance would also be 50%. 

The no-longer-hopeless clinicians’ agreement on 80% 

of the patients is therefore 30% above chance. Kappa is a 

comparison of the observed agreement above chance with 

the maximum agreement above chance: 30%/50% = 60% 

of the possible agreement above chance, which gives these 

clinicians a kappa of 0.6, as shown in Fig. 4B. 

A Clinician 1 

Clinician 2 

Sign 

present 

Sign 

absent 

Sign 

present 

Sign 

absent 

40 10 

10 40 

B Clinician 1 

Clinician 2 

Table 3: Chance agreement when 2 

observers randomly assign positive 

and negative results, for successively 

higher rates of a positive call 

Proportion 

positive (%) 

Sign 

present 

Sign 

absent 

Sign 

present 

40 

(25) 

10 

(25) 

Agreement 

by chance (%) 

50 50 

60 52 

70 58 

80 68 

90 82 

Sign 

absent Total 

10 

(25) 

40 

(25) 

Total 50 50 

κ = (observed agreement – agreement expected by chance) ÷ (100 – agreement expected 

by chance) 

= (80% – 50%) ÷ (100% – 50%) 

= 30% ÷ 50% 

= 0.6 

Fig. 4: Two clinicians who have been trained to assess Murphy’s 

sign in patients with abdominal pain do an actual assessment 

on 100 patients. A: A 2 × 2 table reflecting actual agreement 

between the 2 clinicians. B: A 2 × 2 table illustrating the 

correct approach to determining the kappa score. The numbers 

in parentheses correspond to the results that would be expected 

were each clinician randomly guessing that half of the 

patients had a positive result (as in Fig. 2). 

1372 JAMC 23 NOV. 2004; 171 (11) 

50 

50 

Formula for calculating kappa 

(Observed agreement – agreement expected by chance) ÷ 

(100% – agreement expected by chance) 

Another way of expressing this formula: 

(Observed agreement beyond chance) ÷ (maximum 

possible agreement beyond chance) 

Hence, to calculate kappa when only 2 alternatives are 

possible (e.g., presence or absence of a finding), you need 

just 2 numbers: the percentage of patients that the 2 assessors 

agreed on and the expected agreement by chance. 

Both can be determined by constructing a 2 × 2 table exactly 

as illustrated above. 


Chance agreement is not always 50%; rather, it varies 

from one clinical situation to another. When the prevalence 

of a disease or outcome is low, 2 observers will guess 

that most patients are normal and the symptom of the disease 

is absent. This situation will lead to a high percentage 

of agreement simply by chance. When the prevalence is 

high, there will also be high apparent agreement, with most 

patients judged to exhibit the symptom. Kappa measures 

the agreement after correcting for this variable degree of 

chance agreement. 

Conclusions 


Armed with this understanding of kappa as a measure of 

agreement between different observers, you are able to return 

to the study of agreement in chest radiography interpretations 

between emergency physicians and radiologists 1 

in a more informed fashion. You learn from the abstract 

that the kappa score for overall agreement between the 2 

classes of practitioners was 0.40, with a 95% confidence 

interval ranging from 0.35 to 0.46. This means that the 

agreement between emergency physicians and radiologists 

represented 40% of the potentially achievable agreement 

beyond chance. You understand that this kappa score 

would be conventionally considered to represent fair to 

moderate agreement but is inferior to many of the kappa 

values listed in Table 2. You are now much more confident 

about going to the full text of the article to review the 

methods and assess the clinical applicability of the results to 

your own patients. 

The ability to understand measures of variability in data 

presented in clinical trials and systematic reviews is an important 

skill for clinicians. We have presented a series of 

tips developed and used by experienced teachers of evidence-based 

medicine for the purpose of facilitating such 

understanding.


From the Department of Medicine, Division of General Internal Medicine 

(McGinn), and the Department of Geriatrics (Leipzig), Mount Sinai Medical Center, 

New York, NY; the Columbia University College of Physicians and Surgeons, 

New York, NY (Wyer); the Departments of Epidemiology and Biostatistics and of 

Pediatrics, University of California, San Francisco, San Francisco, Calif. (Newman); 

Durham Veterans Affairs Medical Center and Duke University Medical 

Center, Durham, NC (Keitz); and the Departments of Medicine and of Clinical 

Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt) 


Contributors: Thomas McGinn developed the original idea for tips 1 and 2 and, as 

principal author, oversaw and contributed to the writing of the manuscript. 

Thomas Newman and Roseanne Leipzig reviewed the manuscript at all phases of 

development and contributed to the writing as coauthors. Sheri Keitz used all of 

the tips as part of a live teaching exercise and submitted comments, suggestions 

and the possible variations that are described in the article. Peter Wyer reviewed 

and revised the final draft of the manuscript to achieve uniform adherence with 

format specifications. Gordon Guyatt developed the original idea for tip 3, reviewed 

the manuscript at all phases of development, contributed to the writing as a 

coauthor, and, as general editor, reviewed and revised the final draft of the manuscript 

to achieve accuracy and consistency of content. 

References 

1. Gatt ME, Spectre G, Paltiel O, Hiller N, Stalnikowicz R. Chest radiographs 

in the emergency department: Is the radiologist really necessary? Postgrad 

Med J 2003;79:214-7. 


for learning and teaching evidence-based medicine: introduction to the series 

[editorial]. CMAJ 2004;171(4):347-8. 

3. Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. 

Am J Epidemiol 1987;126:161-9. 

4. Blackburn H. The exercise electrocardiogram: differences in interpretation. 

Report of a technical group on exercise electrocardiography. Am J Cardiol 

1968;21:871-80. 

5. Cook DJ. Clinical assessment of central venous pressure in the critically ill. 

Am J Med Sci 1990;299:175-8. 

6. Aertgeerts B, Buntinx F, Fevery J, Ansoms S. Is there a difference between 

CAGE interviews and written CAGE questionnaires? Alcohol Clin Exp Res 

2000;24:733-6. 

7. Kilpatrick R, Milne JS, Rushbrooke M, Wilson ESB. A survey of thyroid enlargement 

in two general practices in Great Britain. BMJ 1963;1:29-34. 

8. Guyatt GH, Patterson C, Ali M, Singer J, Levine M, Turpie I, et al. Diagnosis 

of iron-deficiency anemia in the elderly. Am J Med 1990;88:205-9. 

9. McCombe PF, Fairbank JC, Cockersole BC, Pynsent PB. 1989 Volvo Award 

in clinical sciences. Reproducibility of physical signs in low-back pain. Spine 

1989;14:908-18. 

10. Perrier A, Howarth N, Didier D, Loubeyre P, Unger PF, de Moerloose P, et 

al. Performance of helical computed tomography in unselected outpatients 

with suspected pulmonary embolism. Ann Intern Med 2001;135:88-97. 

11. Koelemay MJ, Legemate DA, Reekers JA, Koedam NA, Balm R, Jacobs MJ. 

Interobserver variation in interpretation of arteriography and management of 

severe lower leg arterial disease. Eur J Vasc Endovasc Surg 2001;21:417-22. 


Tips for EBM learners: kappa statistic 


Pelham NY 10803, USA; fax 914 738-9368; pwyer@att.net 


Members of the Evidence-Based Medicine Teaching Tips 

Working Group: Peter C. Wyer (project director), College of 

Physicians and Surgeons, Columbia University, New York, NY; 

Deborah Cook, Gordon Guyatt (general editor), Ted Haines, 

Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose 

Hatala (internal review coordinator), University of British 

Columbia, Vancouver, BC; Robert Hayward (editor, online 

version), Bruce Fisher, University of Alberta, Edmonton, Alta.; 

Sheri Keitz (field test coordinator), Durham Veterans Affairs 

Medical Center and Duke University Medical Center, Durham, 

NC; Alexandra Barratt, University of Sydney, Sydney, Australia; 

Pamela Charney, Albert Einstein College of Medicine, Bronx, NY; 

Antonio L. Dans, University of the Philippines College of 

Medicine, Manila, The Philippines; Barnet Eskin, Morristown 

Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory 

University School of Medicine, Atlanta, Ga.; Hui Lee, formerly 

Group Health Centre, Sault Ste. Marie, Ont. (deceased); Rosanne 

Leipzig, Thomas McGinn, Mount Sinai Medical Center, New 

York, NY; Victor M. Montori, Mayo Clinic College of Medicine, 

Rochester, Minn.; Virginia Moyer, University of Texas, Houston, 

Tex.; Thomas B. Newman, University of California, San 

Francisco, San Francisco, Calif.; Jim Nishikawa, University of 

Ottawa, Ottawa, Ont.; Kameshwar Prasad, Arabian Gulf 

University, Manama, Bahrain; W. Scott Richardson, Wright State 

University, Dayton, Ohio; Mark C. Wilson, University of Iowa, 

Iowa City, Iowa 

Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz 

S, et al. Tips for learners of evidence-based medicine: 

1. Relative risk reduction, absolute risk reduction and 


Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, 

Moyer V, et al. Tips for learners of evidence-based 

medicine: 2. Measures of precision (confidence intervals). 

CMAJ 2004;171(6):611-5. 

CMAJ NOV. 23, 2004; 171 (11) 1373

DOI:10.1503/cmaj.1031920 


4. Assessing heterogeneity of primary studies 

in systematic reviews and whether to combine 

their results 

Rose Hatala, Sheri Keitz, Peter Wyer, Gordon Guyatt, for the Evidence-Based Medicine 

Teaching Tips Working Group 

Clinicians wishing to quickly answer a clinical question 

may seek a systematic review, rather than searching 

for primary articles. Such a review is also called a 

meta-analysis when the investigators have used statistical 

techniques to combine results across studies. Databases useful 

for this purpose include the Cochrane Library (www. 

thecochranelibrary.com) and the ACP Journal Club (www. 

acpjc.org; use the search term “review”), both of which are 

available through personal or institutional subscription. 

Clinicians can use systematic reviews to guide clinical practice 

if they are able to understand and interpret the results. 

Systematic reviews differ from traditional reviews in that 

they are usually confined to a single focused question, 

which serves as the basis for systematic searching, selection 

and critical evaluation of the relevant research. 1 Authors of 

systematic reviews use explicit methods to minimize bias 

and consider using statistical techniques to combine the results 

of individual studies. When appropriate, such pooling 

allows a more precise estimate of the magnitude of benefit 

or harm of a therapy. It may also increase the applicability 

of the result to a broader range of patient populations. 

Clinicians encountering a meta-analysis frequently find 

the pooling process mysterious. Specifically, they wonder 

how authors decide whether the ranges of patients, interventions 

and outcomes are too broad to sensibly pool the 

results of the primary studies. 

In this article we present an approach to evaluating potentially 

important differences in the results of individual 

studies being considered for a meta-analysis. These differences 

are frequently referred to as heterogeneity. 1 Our discussion 

focuses on the qualitative, rather than the statistical, 

assessment of heterogeneity (see Box 1). 

Two concepts are commonly implied in the assessment 

of heterogeneity. The first is an assessment for heterogeneity 

within 4 key elements of the design of the original studies: 

the patients, interventions, outcomes and methods. This 

assessment bears on the question of whether pooling the results 

is at all sensible. The second concept relates to assessing 

heterogeneity among the results of the original studies. 

Even if the study designs are similar, the researchers must 

decide whether it is useful to combine the primary studies’ 

CMAJ • MAR. 1, 2005; 172 (5) 661 

© 2005 CMA Media Inc. or its licensors 

Review 

Synthèse 

results. Our discussion assumes a basic familiarity with how 

investigators present the magnitude 2,3 and precision 4 of 

treatment effects in individual randomized trials. 



medicine skills to clinicians. 1,5,6 A related article, intended 

for people who teach these concepts to clinicians, is 

available online at www.cmaj.ca/cgi/content/full/172/5/ 

661/DC1. 


Qualitative assessment of the design of primary 

studies 

• Understand the concepts of heterogeneity of study design 

among the individual studies included in a systematic 

review. 

Qualitative assessment of the results of primary 

studies 

• Understand how to qualitatively determine the appropriateness 

of pooling estimates of effect from the individual 

studies by assessing (1) the degree of overlap of 

the confidence intervals around these point estimates of 

effect and (2) the disparity between the point estimates 

themselves. 

• Understand how to estimate the “true” value of the estimate 

of effect from a graphic display of the results of 

individual studies. 








to clinician learners and links to useful online resources.

Hatala et al 

Box 1: Statistical assessments of heterogeneity 

Meta-analysts typically use 2 statistical approaches to evaluate 

the extent of variability in results between studies: Cochran’s 

Q test and the I 2 

statistic. 

Cochran’s Q test 

• Cochran’s Q test is the traditional test for heterogeneity. It 

begins with the null hypothesis that all of the apparent 

variability is due to chance. That is, the true underlying 

magnitude of effect (whether measured with a relative risk, 

an odds ratio or a risk difference) is the same across studies. 

• The test then generates a probability, based on a χ 2 

distribution, that differences in results between studies as 

extreme as or more extreme than those observed could occur 

simply by chance. 

• If the p value is low (say, less than 0.1) investigators should 

look hard for possible explanations of variability in results 

between studies (including differences in patients, 

interventions, measurement of outcomes and study design). 

• As the p value gets very low (less than 0.01) we may be 

increasingly uncomfortable about using single best estimates 

of treatment effects. 

• The traditional test for heterogeneity is limited, in that it may 

be underpowered (when studies have included few patients it 

may be difficult to reject the null hypothesis even if it is false) 

or overpowered (when sample sizes are very large, small and 

unimportant differences in magnitude of effect may 

nevertheless generate low p values). 

I 2 

statistic 

• The I 2 

statistic, the second approach to measuring 

heterogeneity, attempts to deal with potential underpowering 

or overpowering. I 2 

provides an estimate of the percentage of 

variability in results across studies that is likely due to true 

differences in treatment effect, as opposed to chance. 

• When I 2 

is 0%, chance provides a satisfactory explanation for 

the variability we have observed, and we are more likely to 

be comfortable with a single pooled estimate of treatment 

effect. 

• As I 2 

increases, we get increasingly uncomfortable with a 

single pooled estimate, and the need to look for explanations 

of variability other than chance becomes more compelling. 

• For example, one rule of thumb characterizes I 2 of less than 

0.25 as low heterogeneity, 0.25 to 0.5 as moderate 

heterogeneity and over 0.5 as high heterogeneity. 

662 JAMC 1 er MARS 2005; 172 (5) 

Tip 1: Qualitative assessment of the design of 

primary studies 

Consider the following 3 hypothetical systematic reviews. 

For which of these systematic reviews does it make 

sense to combine the primary studies? 

• A systematic review of all therapies for all types of cancer, 

intended to generate a single estimate of the impact 

of these therapies on mortality. 

• A systematic review that examines the effect of different 

antibiotics, such as tetracyclines, penicillins and chloramphenicol, 

on improvement in peak expiratory flow 

rates and days of illness in patients with acute exacerbation 

of obstructive lung disease, including chronic 

bronchitis and emphysema. 7 

• A systematic review of the effectiveness of tissue plasminogen 

activator (tPA) compared with no treatment 

or placebo in reducing mortality among patients with 

acute myocardial infarction. 8 

Most clinicians would instinctively reject the first of 

these proposed reviews as overly broad but would be comfortable 

with the idea of combining the results of trials relevant 

to the third question. What about the second review? 

What aspects of the primary studies must be similar to justify 

combining their results in this systematic review? 

Table 1 lists features that would be relevant to the 

question considered in the second review and categorizes 

them according to the 4 key elements of study design: the 

patients, interventions, outcomes and methods of the primary 

studies. Combining results is appropriate when the 

biology is such that across the range of patients, interventions, 

outcomes and study methods, one can anticipate 

more or less the same magnitude of treatment effect. 

In other words, the judgement as to whether the primary 

studies are similar enough to be combined in a systematic 

review is based on whether the underlying pathophysiology 

would predict a similar treatment effect across 

the range of patients, interventions, outcomes and study 

methods of the primary studies. If you think back to the 

first systematic review — all therapies for all cancers — you 

probably recognize that there is significant variability in the 

Table 1: Relevant features of study design to be considered when deciding whether to pool studies in a 

systematic review (for a review examining the effect of antibiotics in patients with obstructive lung disease) 

Patients Interventions Outcomes Study methods 

Patient age Same antibiotic in all studies Death All randomized trials 

Patient sex 

Type of lung disease 

(e.g., emphysema, 

chronic bronchitis) 

Same class of antibiotic in all 

studies 

Comparison of antibiotic with 

placebo 

Comparison of one antibiotic with 

another 

Peak expiratory flow 

Forced expiratory volume in 

the first second 

Only blinded randomized 

trials 

Cohort studies 


pathophysiology of different cancers (“patients” in Table 1) 

and in the mechanisms of action of different cancer therapies 

(“interventions” in Table 1). 

If you were inclined to reject pooling the results of the 

studies to be considered in the second systematic review, you 

might have reasoned that we would expect substantially different 

effects with different antibiotics, different infecting 

agents or different underlying lung pathology. If you were 

inclined to accept pooling of results in this review, you might 

argue that the antibiotics used in the different studies are all 

effective against the most common organisms underlying 

pulmonary exacerbations. You might also assert that the biology 

of an acute exacerbation of an obstructive lung disease 

(e.g., inflammation) is similar, despite variability in the underlying 

pathology. In other words, we would expect more 

or less the same effect across agents and across patients. 

Finally, you probably accepted the validity of pooling results 

for the third systematic review — tPA for myocardial 

infarction — because you consider that the mechanism of 

myocardial infarction is relatively constant across a broad 

range of patients. 


• Similarity in the aspects of primary study design outlined 

in Table 1 (patients, interventions, outcomes, 

study methods) guides the decision as to whether it 

makes sense to combine the results of primary studies 

in a systematic review. 

• The range of characteristics of the primary studies 

across which it is sensible to combine results is a matter 

of judgment based on the researcher’s understanding of 

the underlying biology of the disease. 

Tip 2: Qualitative assessment of the results of 

primary studies 

You should now understand that combining the results of 

different studies is sensible only when we expect more or less 

the same magnitude of treatment effects across the range of 

patients, interventions and outcomes that the investigators 

have included in their systematic review. However, even 

when we are confident of the similarity in design among the 

individual studies, we may still wonder whether the results of 

the studies should be pooled. The following graphic demonstration 

shows how to qualitatively assess the results of the 

primary studies to decide if meta-analysis (i.e., statistical 

pooling) is appropriate. You can find discussions of quantitative, 

or statistical, approaches to the assessment of heterogeneity 

elsewhere (see Box 1 or Higgins and associates 9 ). 

Consider the results of the studies in 2 hypothetical systematic 

reviews (Fig. 1A and Fig. 1B). The central vertical 

line, labelled “no difference,” represents a treatment effect of 

0. This would be equivalent to a risk ratio or relative risk of 1 

or an absolute or relative risk reduction of 0. 2 Values to the 

Tips for EBM learners: heterogeneity 

left of the “no difference” line indicate that the treatment is 

superior to the control, whereas those to the right of the line 

indicate that the control is superior to the treatment. For 

each of the 4 studies represented in the figures, the dot represents 

the point estimate of the treatment effect (the value 

observed in the study), and the horizontal line represents the 

confidence interval around that observed effect. For which 

systematic review does it make sense to combine results? Decide 

on the answer to this question before you read on. 

You have probably concluded that pooling is appropriate 

A 

B 

Favours new 

treatment 

Favours 

new treatment 

No difference 

No difference 

Favours control 



Fig. 1: Results of the studies in 2 hypothetical systematic reviews. 

The central vertical line represents a treatment effect of 

0. Values to the left of this line indicate that the treatment is superior 

to the control, whereas those to the right of the line indicate 

that the control is superior to the treatment. For each of 

the 4 studies in each figure, the dot represents the point estimate 

of the treatment effect (the value observed in the study), 

and the horizontal line represents the confidence interval 

around that observed effect. 

CMAJ MAR. 1, 2005; 172 (5) 663

Hatala et al 

for the studies represented in Fig. 1B but not for those represented 

in Fig. 1A. Can you explain why? Is it because the 

point estimates for the studies in Fig. 1A lie on opposite sides 

Favours 

new treatment 

Fig. 2: Point estimates and confidence intervals for 4 studies. 

Two of the point estimates favour the new treatment, and the 

other 2 point estimates favour the control. Investigators doing a 

systematic review with these 4 studies would be satisfied that it 

is appropriate to pool the results. 

Pooled estimate of underlying effect 

Favours 

new treatment 

No difference 

No difference 



Fig. 3: Results of the hypothetical systematic review presented 

in Fig. 1B. The pooled estimate at the bottom of the chart (large 

diamond) provides the best guess as to the underlying treatment 

effect. It is centred on the midpoint of the area of overlap 

of the confidence intervals around the estimates of the individual 

trials. 

664 JAMC 1 er MARS 2005; 172 (5) 

of the “no difference” line, whereas those for the studies in 

Fig. 1B lie on the same side of the “no difference” line? 

Before you answer this question, consider the studies 

represented in Fig. 2. Here, the point estimates of 2 studies 

are on the “favours new treatment” side of the “no difference” 

line, and the point estimates of 2 other studies are on 

the “favours control” side. However, all 4 point estimates 

are very close to the “no difference” line, and, in this case, 

investigators doing a systematic review will be satisfied that 

it is appropriate to pool the results. Therefore, it is not the 

position of the point estimates relative to the “no difference” 

line that determines the appropriateness of pooling. 

There are 2 criteria for not combining the results of 

studies in a meta-analysis: highly disparate point estimates 

and confidence intervals with little overlap, both of which 

are exemplified by Fig. 1A. When pooling is appropriate on 

the basis of these criteria, where is the best estimate of the 

underlying magnitude of effect likely to be? Look again at 

Fig. 1B and make a guess. Now look at Fig. 3. 

The pooled estimate at the bottom of Fig. 3 is centred on 

the midpoint of the area of overlap of the confidence intervals 

around the estimates of the individual trials. It provides our 

best guess as to the underlying treatment effect. Of course, we 

cannot actually know the “truth” and must be content with 

potentially misleading estimates. The intent of a meta-analysis 

is to include enough studies to narrow the confidence interval 

around the resulting pooled estimate sufficiently to provide estimates 

of benefit for our patients in which we can be confident. 

Thus, our best estimate of the truth will lie in the area of 

overlap among the confidence intervals around the point estimates 

of treatment effect presented in the primary studies. 

What is the clinician to do when presented with results 

such as those in Fig. 1A? If the investigators have done a 

good job of planning and executing the meta-analysis, they 

will provide some assistance. 6 Before examining the study 

results in detail, they will have generated a priori hypotheses 

to explain the heterogeneity in magnitude of effect across 

studies that they are liable to encounter. These hypotheses 

will include differences in patients (effects may be larger in 

sicker patients), in interventions (larger doses may result in 

larger effects), in outcomes (longer follow-up may diminish 

the magnitude of effect) and in study design (methodologically 

weaker studies may generate larger effects). 

The investigators will then have examined the extent to 

which these hypotheses can explain the differences in magnitude 

of effect across studies. These subgroup analyses 

may be misleading, but if they meet 7 criteria suggested 

elsewhere 10 (see Box 2), they may provide credible and satisfying 

explanations for the variability in results. 



• Readers can decide for themselves whether there is 

clinically important heterogeneity among the results of 

primary studies through a qualitative assessment of the 

graphic results. This assessment is based on the amount

Box 2: Questions to ask when evaluating a subgroup 

analysis in a meta-analysis 10 

• Was the subgroup comparison based on a within-study, 

rather than a between-study, comparison? 

• Is the magnitude of the difference in effect between 

subgroups large? 

• Is the effect consistent across studies? 

• Is the difference in effect statistically significant? 

• Was the subgroup analysis planned in advance by the 

trialists? 

• Were many subgroup analyses performed and selectively 

reported? 

• Is the difference in effect in the subgroup supported by a 

biological hypothesis? 

of disparity among the individual point estimates and 

the degree of overlap among the confidence intervals. 

Conclusions 

Understanding the concept of heterogeneity in a systematic 

review or meta-analysis is central to a full appreciation 

of the implications of such reviews for clinical practice. 

We have presented 2 tips aimed at helping clinical readers 

overcome commonly encountered difficulties in understanding 

this concept. 


From the Department of Medicine, University of British Columbia, Vancouver, BC 

(Hatala); Durham Veterans Affairs Medical Center and Duke University Medical 

Center, Durham, NC (Keitz); the Columbia University College of Physicians and 

Surgeons, New York, NY (Wyer); and the Departments of Medicine and of Clinical 

Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt) 


Contributors: Rose Hatala modified the original ideas for tips 1 and 2, drafted the 

manuscript, coordinated input from reviewers and field-testing, and revised all drafts. 

Sheri Keitz used all of the tips as part of a live teaching exercise and submitted comments, 

suggestions and the possible variations that are described in the article. Peter 

Wyer reviewed and revised the final draft of the manuscript to achieve uniform adherence 

with format specifications. Gordon Guyatt developed the original ideas for 

tips 1 and 2, reviewed the manuscript at all phases of development, contributed to 

the writing as a coauthor, and, as general editor, reviewed and revised the final draft 

of the manuscript to achieve accuracy and consistency of content. 

References 

1. Oxman A, Guyatt G, Cook D, Montori V. Summarizing the evidence. In: 

Guyatt G, Rennie D, editors. Users’ guides to the medical literature: a manual for 

evidence-based clinical practice. Chicago: AMA Press; 2002. p. 155-73. 

2. Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al, for the 

Evidence-Based Medicine Teaching Tips Working Group. Tips for learners 

of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction 

and number needed to treat. CMAJ 2004;171(4):353-8. 

3. Guyatt G, Cook D, Devereaux PJ, Meade M, Straus S. Therapy. In: Guyatt 

G, Rennie D, editors. Users’ guides to the medical literature: a manual for evidence-based 

clinical practice. Chicago: AMA Press; 2002. p. 55-79. 

4. Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, et al, 

for the Evidence-Based Medicine Teaching Tips Working Group. Tips for 

learners of evidence-based medicine: 2. Measures of precision (confidence intervals). 

CMAJ 2004;171(6):611-5. 

Tips for EBM learners: heterogeneity 


for learning and teaching evidence-based medicine: introduction to the series. 

CMAJ 2004;171(4):347-8. 

6. Montori V, Hatala R, Guyatt G. Summarizing the evidence: evaluating differences 

in study results. In: Guyatt G, Rennie D, editors. Users’ guides to the medical literature: 

a manual for evidence-based clinical practice. Chicago: AMA Press; 2002. p. 547-52. 

7. Saint S, Bent S, Vittinghoff E, Grady D. Antibiotics in chronic obstructive 

pulmonary disease exacerbations. JAMA 1995;273:957-60. 

8. Held PH, Teo KK, Yusuf S. Effects of tissue-type plasminogen activator and 

anisoylated plasminogen streptokinase activator complex on mortality in acute 

myocardial infarction. Circulation 1990;82:1668-74. 

9. Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency 

in meta-analyses. BMJ 2003;327:557-60. 

10. Oxman A, Guyatt G. When to believe a subgroup analysis. In: Guyatt G, 

Rennie D, editors. Users’ guides to the medical literature: a manual for evidencebased 

clinical practice. Chicago: AMA Press; 2002. p. 553-65. 


Pelham NY 10804; fax 914 738-9368; pwyer@att.net 




























Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, 

et al. Tips for learners of evidence-based medicine: 1. 

Relative risk reduction, absolute risk reduction and 



Moyer V, et al. Tips for learners of evidence-based medicine: 

2. Measures of precision (confidence intervals). 

CMAJ 2004;171(6):611-5. 

McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt 

G, et al. Tips for learners of evidence-based medicine: 

3. Measures of observer variability (kappa statistic). 

CMAJ 2004;171(11):1369-73. 

CMAJ MAR. 1, 2005; 172 (5) 665

DOI:10.1503/cmaj.1031666 


5. The effect of spectrum of disease on the 

performance of diagnostic tests 

Victor M. Montori, Peter Wyer, Thomas B. Newman, Sheri Keitz, Gordon Guyatt, 

for the Evidence-Based Medicine Teaching Tips Working Group 

For clinicians to use a diagnostic test in clinical practice, 

they need to know how well the test distinguishes 

between those who have the suspected disease 

or condition and those who do not. If investigators 

choose clinically inappropriate populations for their study 

of a diagnostic test and thereby introduce what is sometimes 

called spectrum bias, the results may seriously mislead 

clinicians. 

In this article we present a series of examples that illustrate 

why clinicians need to pay close attention to the populations 

enrolled in studies of diagnostic test performance 

before they apply the results of those studies to their own 

patients. After working through these examples, you should 

understand which characteristics of a study population are 

likely to result in misleading interpretations of test results 

and which are not. 



medicine principles to clinicians. 1,2 A related article, 

intended for people who teach these concepts to clinicians, 

is available online at www.cmaj.ca/cgi/content/full/173 

/4/385/DC1. 


“Ideal” spectrum of disease 

• Understand the importance of spectrum of disease in 

the evaluation of diagnostic test characteristics. 

Prevalence, spectrum and test characteristics 

• Understand the lack of impact of disease prevalence on 

sensitivity, specificity and likelihood ratios. 

• Understand the impact of disease prevalence or likelihood 

on the probability of the target condition (posttest 

probability) after test results are available. 

Tip 1: “Ideal” spectrum of disease 

Let’s consider a clinical example that illustrates the concept 

of “disease spectrum” in relation to diagnostic tests. 

CMAJ • AUG. 16, 2005; 173 (4) 385 

© 2005 CMA Media Inc. or its licensors 

Review 

Synthèse 

Brain natriuretic peptide (BNP) is a hormone secreted by 

the ventricles in the heart in response to expansion. Plasma 

levels of BNP increase when acute or chronic congestive 

heart failure is present. Consequently, investigators have 

suggested using BNP levels to distinguish congestive heart 

failure from other causes of acute dyspnea among patients 

presenting to emergency departments. 3 

One highly publicized study reported promising results 

using a BNP cutoff point of 100 pg/mL. 4,5 This cutoff point 

means that patients with BNP levels greater than 

100 pg/mL are considered to have a “positive” test result 

for congestive heart failure and those with levels below this 

threshold are considered to have a “negative” test result. 

The investigators compared the number of diagnoses of 

congestive heart failure using BNP levels with those using a 

criterion standard (or “gold standard”) defined by established 

clinical and imaging criteria. Commentaries have 

challenged the investigators’ estimates of the sensitivity and 

specificity of the BNP test at the proposed cutoff point on 

the basis that clinicians were already confident with respect 

to the likelihood of congestive heart failure in most of the 

patients in the study. 6,7 

Ideally, the ability of a test to correctly identify patients 

with and without a particular disease would not vary between 

patients. However, if you are a clinician, you already 

intuitively understand that a test may perform better when 

it is used to evaluate patients with more severe disease than 

it would with patients whose disease is less advanced and 

less obvious. You also appreciate that diagnostic tests are 

not needed when the disease is either clinically obvious or 

sufficiently unlikely that you need not seriously consider it. 






challenges they encounter when teaching these 

concepts to clinician learners and links to useful online 

resources. 


Montori et al 

A study of the performance of a diagnostic test involves 

performing that test on patients with and without the disease 

or condition of interest together with a second test or 

investigation that we will call the “criterion standard.” We 

accept the results of the second test as the criterion by 

which the results of the test under investigation are assessed. 

In designing such a study, investigators sometimes 

choose both patients in whom the disease is unequivocally 

advanced and patients who are unequivocally free of disease, 

such as healthy, asymptomatic volunteers. This approach 

ensures the validity of the criterion standard and 

may be appropriate in the early stages of developing a test. 

However, any study done with a population that lacks diagnostic 

uncertainty may produce a biased estimate of a test’s 

performance relative to that produced by 

a study restricted to patients for whom 

the test would be clinically indicated. 

Returning to the use of BNP levels to 

test for congestive heart failure among patients 

with acute dyspnea, consider Fig. 1. 

The horizontal axis represents increasing 

values of BNP. The 2 bell curves constitute 

hypothetical probability density plots 

of the distribution of BNP values among 

patients with and without congestive 

heart failure. 8 The height at any point in 

either curve reflects the proportion of 

emergency patients in the particular subgroup 

with the corresponding BNP value. 

Aside from the choice of cutoff value, this 

figure does not reflect the results of any 

actual study. 

The bell curve on the left in Fig. 1represents 

the hypothetical distribution of 

BNP values in a group of young patients 

with known asthma and no risk factors for 

congestive heart failure. They will tend to 

have low levels of circulating BNP. The 

bell curve on the right represents the distribution 

of BNP values among older patients 

with unequivocal and severe congestive 

heart failure. Such patients will 

have test results clustered on the high end 

of the scale. 

If Fig. 1accurately represented the 

performance of the BNP test in distinguishing 

between all patients with and 

without congestive heart failure as the 

cause of their symptoms, the test would 

be very useful. The 2 curves demonstrate 

very little overlap. For BNP values below 

90 pg/mL (point A), no patients have 

congestive heart failure, and for BNP values 

above 110 pg/mL (point B), all patients 

have congestive heart failure. This 

Proportion of patients 

386 JAMC 16 AOÛT 2005; 173 (4) 

means, assuming that Fig. 1 reflects reality, that you can be 

completely certain about the diagnosis for all people with 

BNP values below 90 pg/mL or above 110 pg/mL. Only 

for patients whose BNP values are between 90 and 

110 pg/mL is there residual uncertainty about their likelihood 

of congestive heart failure. 

However, before you embrace a test on the basis of its 

performance among patients in whom the presence or absence 

of disease is unequivocal, you need to consider the 

likely distribution of test results in a population of patients 

for whom you would be less certain. 

In Fig. 2, imagine that the entire study population is 

made up of middle-aged patients, all of whom have chronic 

congestive heart failure and recurrent asthma. The distributions 

of BNP values in the subgroups with and without 

A 

BNP level, pg/mL 

Fig. 1: Hypothetical probability density distributions of measured plasma brain 

natriuretic peptide (BNP) levels in 2 subgroups of a study population. The cutoff 

point for a diagnosis of congestive heart failure (CHF) is 100 pg/mL. Patients with a 

negative test result for CHF (left-hand curve) are younger, with known asthma and 

no risk factors for CHF. The patients with confirmed CHF are older, and the disease 

is clinically severe and unequivocal. Clinicians in the emergency department have 

little uncertainty regarding the cause of dyspnea in any of these patients. 


Patients without 

acute CHF 

Patients with 

acute CHF 

0 20 40 60 80 100 120 140 160 180 200 


acute CHF 

Patients with 

acute CHF 

0 20 40 60 80 100 120 140 160 180 200 

Fig. 2: These hypothetical probability density distributions reflect a study population 

of middle-aged patients who all have recurrent asthma and chronic CHF. 

The patients whose dyspnea is caused by asthma exacerbations look clinically 

similar to those whose symptoms are caused by acute CHF. 

B 

BNP level, pg/mL 

A B 


acute congestive heart failure are both much closer to the 

middle of the range. The extent of the overlap of the curves 

between points A and B is much greater, which means that 

there is residual uncertainty about the disease status of a 

large proportion of the patients even after the BNP test has 

been performed. 

It may be helpful to note that the sensitivity of the BNP 

test at a cutoff value of 100 pg/mL (the proportion of patients 

with acute congestive heart failure whose BNP level 

is greater than 100 pg/mL) is defined in Fig. 1 and Fig. 2 as 

the percentage of the total area of the right-hand curve that 

lies to the right of the cutoff value. Notice that this percentage 

is markedly lower in Fig. 2 than in Fig. 1. The 

same is true of specificity, which is the proportion of patients 

without acute congestive heart failure whose BNP 

level is less than 100 pg/mL. This is defined in the figures 

as the proportion of the left-hand curve that lies to the left 

of the cutoff point. Again this percentage is appreciably 

lower in Fig. 2 compared with Fig. 1. 

These theoretical concerns play out (albeit with a lesser 

magnitude of impact than depicted in Fig. 1and Fig. 2) in 

studies of the BNP test as a diagnostic tool. In the BNP 

study to which we have referred, the sensitivity and specificity 

of the test using the 100 pg/mL cut-off were 90% 

and 76% respectively when all patients were included. 4 

Only about 25% of the study population were judged by 

the treating physicians to be in the intermediate range of 

probability of acute congestive heart failure. 5 When only 

patients in this subgroup were considered in a number of 

studies, the sensitivity and specificity of the BNP test at a 

cutoff point of 100 pg/mL were only 88% and 55% respectively. 

7 

The range of disease states found among the patients 

in the population upon which a test is to be used is commonly 

referred to as “disease spectrum.” In making your 

final assessment on the value of a test, 

consider the spectrum of the disease or 

condition in which you are interested. 

You don’t need to differentiate healthy 

patients from patients with severe disease. 

Rather, you must differentiate 

those who have the disease from those 

who do not among all those who appear 

as if they might have it. The “right” 

population for a diagnostic test study includes 

(1) those in whom we are uncertain 

of the diagnosis; (2) those in whom 

we will use the test in clinical practice to 

resolve our uncertainty; and (3) patients 

with the disease who have a wide spectrum 

of severity and patients without the 

disease who have symptoms commonly 

associated with it. 

Readers familiar with the concept and 

interpretation of likelihood ratios for diagnostic 

test results 1 may find it useful to 


note that the likelihood ratio for any given test value is represented 

by the respective height of the curves at that point 

on the horizontal axis (Fig. 3). The point on the horizontal 

axis below the intersection of the 2 curves is the test result 

with a likelihood ratio of 1. Fig. 3 also identifies test 

values corresponding to likelihood ratios of 0.25 and 4. 

Comparing Fig. 1and Fig. 2 once more, you will notice 

that the relative heights of the 2 curves, and hence the likelihood 

ratios, corresponding to a given BNP level will 

change as the curves move closer together and the area of 

overlap increases. 


Tips for EBM learners: spectrum of disease 

• Test performance will vary with the spectrum of disease 

within a study population. 9 

• The sensitivity and specificity of a test, when it is used 

to differentiate patients who obviously do not have the 

disease from patients who obviously do, likely overestimate 

its performance when the test is applied in a clinical 

context characterized by diagnostic uncertainty. 


acute CHF 

Patients with 

acute CHF 

Increasing 

test value 

Definitions 

Disease spectrum: The range of the disease states found 

among patients who make up the population upon 

which a test is to be used. 

Performance of diagnostic tests: Measures derived from 

the percentage of patients with and without disease 

identified by a particular test result, with disease 

positivity defined through the application of an 

acceptable criterion standard to each patient in a study. 

Sensitivity and specificity are examples of such measures. 

Test result 

(LR = 0.25) 

Test result 

(LR = 1) 

Test result 

(LR = 4) 


Fig. 3: Likelihood ratios (LRs) and spectrum of disease. The likelihood ratio of a 

test result represented by a point on the horizontal line is the height of the righthand 

bell curve (patients with the disease of interest) divided by the height of the 

left-hand bell curve (patients without the disease of interest) at that point. 

CMAJ AUG. 16, 2005; 173 (4) 387 

4 

1

Montori et al 

Tip 2: Prevalence, spectrum and test 

characteristics 

You may have learned the rule of thumb that post-test 

probabilities (which are closely related to predictive values) 

vary with disease prevalence, but sensitivities, specificities 

and likelihood ratios do not. Is this true? The answer is 

“yes,” provided that disease spectrum remains the same in 

high- and low-prevalence populations. In the discussion 

that follows, for purposes of simplicity, we use the term 

“prevalence” to denote the likelihood that any patient randomly 

selected from the study population has the disease or 

condition as defined by the criterion standard. This is not 

the same thing as the probability of disease in any individual 

patient. 

Referring once again to Fig. 1, let’s consider 3 cases. In 

the first, we’ll assume that there were 1000 patients in each 

subgroup: 1000 in whom congestive heart failure was unequivocally 

the cause of their dyspnea and 1000 in whom 

asthma was almost certainly the cause. The prevalence of 

congestive heart failure is 50%. Each bell curve corresponds 

to the distribution of BNP values within the respec- 

A Pregnant Not pregnant Total 

Positive 

test result 

Negative 

test result 

A 

C 

95 

5 

388 JAMC 16 AOÛT 2005; 173 (4) 

B 

D 

1 96 

99 104 

Total 100 100 200 

B 

Positive 

test result 

Negative 

test result 

A × 4 

C × 4 

380 

20 

B 

D 

1 381 

99 119 

Total 400 100 500 

C 

Positive 

test result 

Negative 

test result 

A 

C 

95 

5 

B × 4 

D × 4 

4 99 

396 401 

Total 100 400 500 

Fig. 4: Changes in disease prevalence have no effect on diagnostic test characteristics. 

tive subgroup. Now consider a second case, where there are 

2000 older patients with severe congestive heart failure and 

1000 younger patients with recurrent asthma and no risk 

factors for congestive heart failure. The prevalence of congestive 

heart failure is 67%. Finally, consider a third case, 

where 2000 patients with asthma and 1000 patients with severe 

congestive heart failure are studied. The prevalence of 

congestive heart failure is 33%. 

In each case the height of either curve corresponding to 

any particular BNP level still corresponds to the proportion 

of patients with that test value in that group. Changes 

in the total number of patients will not alter these proportions, 

and the performance of the test, as measured by sensitivity, 

specificity or likelihood ratios, will be unaffected. 

The performance of the BNP test in identifying patients 

with and without acute congestive heart failure remained 

the same. Hence, when the spectrum remains the same, the 

prevalence of congestive heart failure within the study population 

is irrelevant to the estimation of test characteristics. 

Let’s take a different clinical example. The ICON urine 

test for pregnancy (Beckman Coulter, Inc., Fullerton, 

Calif.) has a very high sensitivity and specificity when performed 

later than 2 weeks postconception. 10 

Women attending a screening clinic in a geographic area 

characterized by moderate population growth are tested for 

pregnancy. 50% of the women are pregnant. Hence, the 

prevalence of pregnancy is 50% in this setting. The ICON 

test has a sensitivity of 95% and a specificity of 99%. By 

definition, 95% of the 100 pregnant women (95% sensitivity) 

will have a positive test result, and 99% of the 100 

nonpregnant women (99% specificity) will have a negative 

test result. The sensitivity is influenced by the proportion of 

women who present less than 2 weeks after conception. 

The same test is performed in a similar clinic located in a 

geographic area characterized by high population growth. 

Four times as many women are pregnant as women who are 

not. The prevalence of pregnancy has increased to 80%. The 

percentage of pregnant women who have positive test results 

remains the same (380/400), and the sensitivity of the test 

remains 95% in this population. The percentage of 

nonpregnant women who have a negative test result is also 

unchanged at 99%. 

The same pregnancy test is now used in a clinic servicing a 

population characterized by low population growth. Only 

one-fifth of women are pregnant. The sensitivity remains the 

same despite a decrease in the proportion of pregnant women 

from 50% to 20%. The specificity (the proportion of 

nonpregnant women with a negative test result) remains the 

same despite an increase in the prevalence of nonpregnant 

women to 80%. Once again, the prevalence of pregnancy in 

the population is irrelevant to the estimation of test 

characteristics. 


It is a qualitative, and inherently dichotomized, test: 

both clinicians and patients recognize that it is not possible 

to be “a little bit pregnant.” In short, although estimates of 

performance values for the ICON test vary in the literature, 

11,12 the performance of the test in detecting pregnancy 

is likely to be uniform if the percentage of subjects who are 

less than 2 weeks postconception does not vary. 

For the purpose of our demonstration, let’s assume that 

ICON test results are positive in 95% of women who are 

pregnant and negative in 99% of women who are not. Fig. 

4 shows the sensitivity and specificity of the test when it is 

administered in 3 different geographic locations with high, 

moderate and low population growth and where the proportion 

of women presenting within 2 weeks of conception 

is constant. Again, for simplicity, we are considering only 

the prevalence of pregnancy in the population being studied 

— in other words, the percentage of women tested who 

are pregnant. A practitioner might estimate the probability 

of pregnancy in an individual patient to be higher or lower 

than this on the basis of clinical features such as use of birth 

control methods, history of recent sexual activity and past 

history of gynecologic disease. As Fig. 4 shows, the prevalence 

of pregnancy in the population has no effect on the 

estimation of test characteristics. 

There are many examples of conditions that may present 

with equal severity in people with different demographic 

characteristics (age, sex, ethnicity) but that are 

much more prevalent in one group than in another. Mild 

osteoarthritis of the knee is rare among young patients but 

common among older patients. Asymptomatic thyroid abnormalities 

are rare among men but common among 

women. In both examples, diagnostic tests will have the 

same sensitivity, specificity and likelihood ratios in young 

and old patients and in men and women respectively. 

However, higher prevalence will result in a higher proportion 

of those with a positive test result who do in fact 

have the disease for which they are being tested. Referring 

to Fig. 4, in the population with a lower prevalence of 

pregnancy, 95 of 99 women (96%) with positive test results 

are pregnant (Fig. 4C) compared with 380 of 381 women 

(99.7%) in the population with a higher prevalence (Fig. 

4B). The likelihood of the condition or disease among patients 

who have a positive test result is sometimes referred 

to as the predictive value of a test. The predictive value corresponds 

with the post-test probability of the disease when 

the test result is positive. Unlike sensitivity, specificity or 

likelihood ratios, predictive values are strongly influenced 

by changes in prevalence in the population being tested. 

Although differences in prevalence alone should not affect 

the sensitivity or specificity of a test, in many clinical 

settings disease prevalence and severity may be related. For 

instance, rheumatoid arthritis seen in a family physician’s 

office will be relatively uncommon, and most patients will 

have a relatively mild case. In contrast, rheumatoid arthritis 

will be common in a rheumatologist’s office, and patients 

will tend to have relatively severe disease. Tests to diagnose 

rheumatoid arthritis in the rheumatologist’s waiting area 

(e.g., hand inspection for joint deformity) are likely to be 

relatively more sensitive not because of the increased 

prevalence but because of the spectrum of disease present 

(e.g., degree and extent of joint deformity) in this setting. 


• Disease prevalence has no direct effect on test characteristics 

(e.g., likelihood ratios, sensitivity, and specificity). 

• Spectrum of disease and disease prevalence have different 

effects on diagnostic test characteristics. 

Conclusions 

Clinicians need to understand how and when the choice 

of patients for a diagnostic test study may affect the performance 

of the test. Both disease spectrum in patients with 

the condition of interest and the spectrum of competing 

conditions in patients without the condition of interest can 

affect the test’s apparent diagnostic power. Despite the potentially 

powerful impact of disease spectrum and competing 

conditions, changes in prevalence that do not reflect 

changes in spectrum will not alter test performance. 


References 

Tips for EBM learners: spectrum of disease 

From the Knowledge and Encounter Research Unit, Department of Medicine, 

Mayo Clinic College of Medicine, Rochester, Minn. (Montori); the Departments 

of Epidemiology and Biostatistics and of Pediatrics, University of California, San 

Francisco (Newman); Durham Veterans Affairs Medical Center and Duke University 

Medical Center, Durham, NC (Keitz); the Columbia University College of 

Physicians and Surgeons, New York, NY (Wyer); and the Departments of Medicine 


Ont. (Guyatt) 



Contributors: Victor Montori, as principal author, oversaw and contributed to the 

writing of the manuscript. Thomas Newman reviewed the manuscript at all phases 

of development and contributed to the writing as coauthor of tip 2. Sheri Keitz 

used all tips as part of a live teaching exercise and submitted comments, suggestions 

and the possible variations that are reported in the manuscript. Peter Wyer 

reviewed and revised the final draft of the manuscript to achieve uniform adherence 

with format specifications. Gordon Guyatt developed the original idea for tips 

1 and 2, reviewed the manuscript at all phases of development, contributed to the 

writing as coauthor, and reviewed and revised the final draft of the manuscript to 

achieve accuracy and consistency of content as general editor. 

1. Jaeschke R, Guyatt G, Lijmer J. Diagnostic tests. In: Guyatt G, Rennie D, editors. 

Users’ guides to the medical literature: a manual for evidence-based clinical 

practice. Chicago: AMA Press; 2002. p. 121-40. 

2. Wyer P, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips for 

learning and teaching evidence-based medicine: introduction to the series. 

CMAJ 2004;171(4):347-8. 

3. Dao Q, Krishnaswamy P, Kazanegra R, Harrison A, Amirnovin R, Lenert L, 

et al. Utility of B-type natriuretic peptide in the diagnosis of congestive heart 

failure in an urgent-care setting. J Am Coll Cardiol 2001;37:379-85. 

4. Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et 

al.; Breathing Not Properly Multinational Study Investigators. Rapid measurement 

of B-type natriuretic peptide in the emergency diagnosis of heart 

failure. N Engl J Med 2002;347:161-7. 

5. McCullough PA, Nowak RM, McCord J, Hollander JE, Herrmann HC, Steg 

PG, et al. B-type natriuretic peptide and clinical judgment in emergency diagnosis 

of heart failure: analysis from Breathing Not Properly (BNP) Multinational 

Study. Circulation 2002;106:416-22. 

CMAJ AUG. 16, 2005; 173 (4) 389

Montori et al 

6. Hohl CM, Mitelman BY, Wyer P, Lang E. Should emergency physicians use 

B-type natriuretic peptide testing in patients with unexplained dyspnea? Can J 

Emerg Med 2003;5:162-5. 

7. Schwam E. B-type natriuretic peptide for diagnosis of heart failure in emergency 

department patients: a critical appraisal. Acad Emerg Med 2004;11:686-91. 

8. Tandberg D, Deely JJ, O’Malley AJ. Generalized likelihood ratios for quantitative 

diagnostic test scores. Am J Emerg Med 1997;15:694-9. 

9. Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen 

JH, et al. Empirical evidence of design-related bias in studies of diagnostic 

tests. JAMA 1999;282:1061-6. 

10. Product insert. Available: www.beckman.com/literature/ClinDiag/08109.D 

.pdf (accessed 13 Jul 2005). 

11. Lauszus FF. Clinical trial of 2 highly sensitive pregnancy tests — Tandem 

ICON HCG-urine and OPCO On-step Pacific Biotech. Ugeskr Laeger 1992; 

154:2069-70. 

12. Mishalani SH, Seliktar J, Braunstein GD. Four rapid serum–urine combination 

assays of choriogonadotropin (hCG) compared and assesed for their utility 

in quantitative determinations of hCG. Clin Chem 1994;40:1944-99. 


Pelham NY 10804; fax 212 305-6792; pwyer@att.net 
















Holiday Review 2005 

Call for submissions 

Hilarity and good humour … help enormously in both the study and 

the practice of medicine … [I]t is an unpardonable sin to go about 

among patients with a long face. 

— William Osler 

390 JAMC 16 AOÛT 2005; 173 (4) 












Yes, that’s right, it’s already time to send us your creative contributions 

for CMAJ’s Holiday Review 2005. We’re looking for humour, spoofs, 

personal reflections, history of medicine, off-beat scientific explorations 

and postcards from the edge of medicine. 

Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz 

S, et al. Tips for learners of evidence-based medicine: 

1. Relative risk reduction, absolute risk reduction and 



Moyer V, et al. Tips for learners of evidence-based 

medicine: 2. Measures of precision (confidence intervals). 

CMAJ 2004;171(6):611-5. 

McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, 

Guyatt G, et al. Tips for learners of evidence-based 

medicine: 3. Measures of observer variability (kappa 

statistic). CMAJ 2004;171(11):1369-73. 

Hatala R, Keitz S, Wyer P, Guyatt G; for the Evidence- 

Based Medicine Teaching Tips Working Group. Tips 

for learners of evidence-based medicine: 4. Assessing 

heterogeneity of primary studies in systematic reviews 

and whether to combine their results. CMAJ 2005; 

172(5):661-5. 

Send your offerings through our online manuscript tracking system (http://mc.manuscriptcentral.com/cmaj). 

Articles should be no more than 1200 words; photographs and illustrations are welcome. Please mention in 

your cover letter that your submission is intended for this year’s Holiday Review. 

The deadline for submissions is Sept. 20, 2005.

Tips for Learners of Evidence-Based Medicine

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?