Nature Biotechnologytrawls

volume 28 number 8 august 2010 

editorials 

761 Wrong numbers? 

761 MAQC-II: analyze that! 

© 2010 Nature America, Inc. All rights reserved. 

A computer-generated representation 

of HIV on the surface of a 

T lymphocyte. Holt et al. block the 

entry of HIV into blood cells by using 

zinc finger nucleases to knock out 

CCR5 in hematopoietic stem cells 

(p 839). Credit: ANIMATE4.com/ 

SciencePhotoLibrary 

Jackson Lab’s legal woes, p 768 

news 

763 Industry makes strides in melanoma 

765 Firms combine experimental cancer drugs to speed development 

767 FDA transparency rules could hit small companies hardest 

767 Supremes rule on Bilski 

768 Lawsuits rock Jackson 

769 Food firms test fry Pioneer’s trans fat–free soybean oil 

769 Anti-CD20 patent battle ends 

769 EU states free to ban GM crops 

770 GM alfalfa—who wins? 

770 Biofuel ‘Made in China’ 

771 data page: 2Q10—spreading the wealth 

772 News feature: Drugmakers dance with autism 

Bioentrepreneur 

Building a business 

775 At ground level 

Julian Bertschinger 

opinion and comment 

CORRESPONDENCE 

778 Waking up and smelling the coffee 

779 Genetic stability in two commercialized transgenic lines (MON810) 

780 Distances needed to limit cross-fertilization between GM and conventional 

maize in Europe 

Nature Biotechnology (ISSN 1087-0156) is published monthly by Nature Publishing Group, a trading name of Nature America Inc. located at 75 Varick Street, 

Fl 9, New York, NY 10013-1917. Periodicals postage paid at New York, NY and additional mailing post offices. Editorial Office: 75 Varick Street, Fl 9, New York, 

NY 10013-1917. Tel: (212) 726 9335, Fax: (212) 696 9753. Annual subscription rates: USA/Canada: US$250 (personal), US$3,520 (institution), US$4,050 

(corporate institution). Canada add 5% GST #104911595RT001; Euro-zone: €202 (personal), €2,795 (institution), €3,488 (corporate institution); Rest of world 

(excluding China, Japan, Korea): £130 (personal), £1,806 (institution), £2,250 (corporate institution); Japan: Contact NPG Nature Asia-Pacific, Chiyoda Building, 

2-37 Ichigayatamachi, Shinjuku-ku, Tokyo 162-0843. Tel: 81 (03) 3267 8751, Fax: 81 (03) 3267 8746. POSTMASTER: Send address changes to Nature 

Biotechnology, Subscriptions Department, 342 Broadway, PMB 301, New York, NY 10013-3910. Authorization to photocopy material for internal or personal 

use, or internal or personal use of specific clients, is granted by Nature Publishing Group to libraries and others registered with the Copyright Clearance Center 

(CCC) Transactional Reporting Service, provided the relevant copyright fee is paid direct to CCC, 222 Rosewood Drive, Danvers, MA 01923, USA. Identification 

code for Nature Biotechnology: 1087-0156/04. Back issues: US$45, Canada add 7% for GST. CPC PUB AGREEMENT #40032744. Printed by Publishers 

Press, Inc., Lebanon Junction, KY, USA. Copyright © 2010 Nature America, Inc. All rights reserved. Printed in USA. 

i


COMMENTARY 

783 case study: India’s billion dollar biotech 

Justin Chakma, Hassan Masum, Kumar Perampaladas, Jennifer Heys & 

Peter A Singer 

784 DNA patents and diagnostics: not a pretty picture 

Julia Carbone, E Richard Gold, Bhaven Sampat, Subhashini Chandrasekharan, 

Lori Knowles, Misha Angrist & Robert Cook-Deegan 

Rapid bacterial engineering, p 812 

feature 

793 Public biotech 2009—the numbers 

Brady Huggett, John Hodgson & Riku Lähteenmäki 


State 

H3K14ac 

H3K23ac 

H4K12ac 

H2AK9ac 

H4K16ac 

H2AK5ac 

H4K91ac 

H3K4ac 

H2BK20ac 

H3K18ac 

H2BK120ac 

H3K27ac 

H2BK5ac 

H2BK12ac 

H3K36ac 

H4K5ac 

H4K8ac 

H3K9ac 

PolII 

CTCF 

H2AZ 

H3K4me3 

H3K4me2 

H3K4me1 

H3K9me1 

H3K79me3 

H3K79me2 

H3K79me1 

H3K27me1 

H2BK5me1 

H4K20me1 

H3K36me3 

H3K36me1 

H3R2me1 

H3R2me2 

H3K27me2 

H3K27me3 

H4R3me2 

H3K9me2 

H3K9me3 

H4K20me3 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

Epigenetic marks define chromatin 

states, p 817 

patents 

801 Bilski v. Kappos: the US Supreme Court broadens patent subject-matter eligibility 

William J Simmons 

806 Recent patent applications in proteomics 

NEWS AND VIEWS 

807 Can HIV be cured with stem cell therapy? 

Steven G Deeks & Joseph M McCune see also p 839 

810 Microarrays in the clinic 

Guy W Tillinghast see also p 827 

812 Shaking up genome engineering 

Kim A Tipton & John Dueber see also p 856 

813 The expanding family of dendritic cell subsets 

Hideki Ueno, A Karolina Palucka & Jacques Banchereau 

816 Research highlights 

computational biology 

analysis 

817 Discovery and characterization of chromatin states for systematic annotation of 

the human genome 

Jason Ernst & Manolis Kellis 

0.982 0.910 0.845 0.748 0.575 0.557 0.311 0.323 0.244 0.193 

0.973 0.918 0.829 0.792 0.493 0.437 0.322 0.306 0.307 0.202 

0.965 0.801 0.816 0.652 0.514 0.349 0.383 0.360 0.217 0.243 

0.991 0.752 0.750 0.778 0.509 0.483 0.345 0.305 0.295 0.193 

0.973 0.869 0.825 0.755 0.403 0.413 0.321 0.275 0.193 0.266 

0.982 0.762 0.823 0.702 0.533 0.557 0.284 0.203 0.143 0.257 

0.982 0.871 0.445 0.728 0.472 0.249 0.429 0.353 0.295 0.293 

0.930 0.838 0.805 0.773 0.542 0.386 0.345 0.289 0.225 0.181 

0.982 0.847 0.835 0.737 0.488 0.344 0.118 0.324 0.110 0.176 

0.973 0.860 0.829 0.690 0.371 0.376 0.344 0.229 0.057 0.243 

0.956 0.815 0.847 0.773 0.491 0.202 0.185 0.385 −0.014 0.187 

0.982 0.847 0.780 0.755 0.377 0.423 0.313 −0.042 0.198 0.241 

0.725 0.782 0.824 0.770 0.531 0.344 0.168 0.349 −0.096 0.165 

0.982 0.707 0.782 0.466 0.499 0.184 0.271 0.000 −0.062 0.203 

0.636 0.761 0.454 0.748 0.247 0.377 0.062 0.324 0.043 0.085 

0.856 0.054 0.709 0.751 0.455 −0.213 −0.078 0.114 0.479 −0.096 

0.982 0.830 0.595 0.544 0.036 −0.090 −0.027 0.336 −0.143 −0.030 

0.973 0.830 0.816 0.748 0.491 0.376 0.311 0.306 0.193 0.193 

0.982 0.891 0.829 0.732 0.403 0.479 0.429 0.301 0.217 0.162 

Evaluating microarray classifiers, 

p 827 

research 

ARTICLES 

827 The MicroArray Quality Control (MAQC)-II study of common practices for the 

development and validation of microarray-based predictive models 

MAQC Consortium see also p 810 

839 Human hematopoietic stem/progenitor cells modified by zinc-finger nucleases 

targeted to CCR5 control HIV-1 in vivo 

N Holt, J Wang, K Kim, G Friedman, X Wang, V Taupin, G M Crooks, D B Kohn, 

P D Gregory, M C Holmes & P M Cannon see also p 807 

nature biotechnology 

iii


848 Cell type of origin influences the molecular and functional properties of mouse 

induced pluripotent stem cells 

J M Polo, S Liu, M E Figueroa, W Kulalert, S Eminli, K Yong Tan, E Apostolou, 

M Stadtfeld, Y Li, T Shioda, S Natesan, A J Wagers, A Melnick, T Evans & 

K Hochedlinger 

856 Rapid profiling of a microbial genome using mixtures of barcoded oligonucleotides 

J R Warner, P J Reeder, A Karimpour-Fard, L B A Woodruff & R T Gill 

see also p 812 

letters 

Epigenetics of iPS cells, p 848 

863 Implications of the presence of N-glycolylneuraminic acid in recombinant 

therapeutic glycoproteins 

D Ghaderi, R E Taylor, V Padler-Karavani, S Diaz & A Varki 

868 Global analysis of lysine ubiquitination by ubiquitin remnant immunoaffinity 

profiling 

G Xu, J S Paige & S R Jaffrey 


careers and recruitment 

875 Second quarter biotech job picture 

Michael Francisco 

876 people 


v

in this issue 


MAQC-II: evaluating microarray 

classifiers 

Building on its original work 

assessing the technical performance 

of DNA microarray technology (http:// 

www.nature.com/nbt/focus/maqc/ 

index.html), the Microarray Quality 

Control (MAQC) consortium, a 

partnership of research groups from 

the US Food and Drug Administration 

(FDA), academia, industry and other government agencies, has 

set out to investigate the capabilities and limitations of microarray 

data analysis with respect to disease diagnosis or choice of 

therapies. Although numerous methods for analyzing microarray 

data have been developed, there remains a lack of consensus 

regarding best practices in terms of their use in identifying gene 

signatures that are representative of a pathological condition. 

Such practices are becoming increasingly important, especially 

as the FDA receives many proposals to use microarrays to support 

medical product development and testing. In the present paper, 36 

data analysis teams applied a variety of analytic methods to build 

classifiers to predict the toxicity of chemicals in rodent models and 

to predict clinical outcomes in human patients with breast cancer, 

multiple myeloma or neuroblastoma. The experience gained during 

this large project may be useful for developing classifiers for data 

from other high-throughput assays. This is important in light of 

the study’s finding that microarrays perform poorly at making 

certain clinical predictions, suggesting that technologies that 

assay additional aspects of human physiology may be needed to 

formulate better clinical treatment plans. [Articles, p. 827; 

News and Views p. 810] 

CM 

Engineered stem cells control HIV 

Cannon and colleagues present an anti-HIV strategy in which human 

hematopoietic stem/progenitor cells are modified with zinc-finger 

nucleases to knock out C-C chemokine receptor 5 (CCR5), the principal 

co-receptor for HIV. CCR5 has been a target of exceptional interest 

ever since the 1996 discovery that a homozygous 32-bp deletion in 

the gene confers resistance to HIV infection without any apparent 

ill effects on health. Most previous work has used small molecules, 

ribozymes or siRNA to inhibit CCR5 protein or mRNA. In contrast, 

Cannon and colleagues nucleofect plasmids expressing two zincfinger 

nucleases into human CD34 + stem/progenitor cells to permanently 

knock out the CCR5 gene. The modified cells are transplanted 

into irradiated, immunodeficient mice and allowed to engraft for 

8–12 weeks before the mice are challenged with CCR5-tropic HIV. 

Although human T cell counts initially decline, by week 8 they have 

recovered to their original levels. By weeks 10 and 12, HIV RNA in 

Written by Kathy Aschheim, Markus Elsner, Michael Francisco, 

Peter Hare, Craig Mak, & Lisa Melton 

the intestine is undetectable. Because hematopoietic stem cells can 

reconstitute the entire hematopoietic system, the authors propose that 

modified CD34 + cells could provide long-term HIV resistance in all 

the lymphoid and myeloid cell types that the virus infects. In support 

of this hypothesis, a transplant of allogeneic CCR5Δ32 hematopoietic 

stem cells in an HIV + individual with acute myeloid leukemia may 

have cured the HIV infection (N. Engl. J. Med. 360, 724–725, 2009). 

[Articles, p. 839; News and Views, p. 807] 

KA 

Epigenetic marks stand together 

State 2 

State 3 

State 5 

State 37 

State 38 

Coding Exon 

Spliced ESTs 

Mammalian 

Conservation 

With over 100 known 

histone modifications 

that can occur in 

thousands of possible 

combinations, it is challenging 

to identify specific combinations that have distinct biological 

functions. Ernst and Kellis describe an algorithm that deduces 

chromatin states (reoccurring, spatially coherent combinations of 

epigenetic marks) from experimental data on the distribution of different 

modifications. Using a multivariate Hidden Markov Model to 

analyze data on the position of 41 different marks in human T cells, 

they define 51 distinct chromatin states. The authors correlate these 

states with prior genome annotation and find that individual states 

are associated with specific functional regions such as gene promoters, 

transcriptionally active genes, large-scale repressed regions or 

intergenic active regions. The identification of chromatin states will 

facilitate genome annotation, the discovery of functional elements, 

and mechanistic studies of gene regulation by epigenetic marks. 

[Analysis, p. 817] 

ME 

Faster trait-to-gene mapping 

chr1: 

242959000 242959500 242960000 242960500 242961000 242961500 

low-expression promoter state 

new exon prediction 

Gill and colleagues describe an approach for creating rationally modified 

collections of Escherichia coli in which every strain contains the 

same defined mutation but in a different gene. Such collections are 

valuable tools for mapping the genetic basis of traits, but until now 

have been labor intensive to construct. The method creates thousands 

of modified strains in parallel by transforming bacteria with 

pools of oligonucleotides that each recombine with a single gene 

to introduce a mutation. Barcode sequence tags uniquely identify 

each oligo and thus each strain. The collection of strains is grown 

in a condition of interest that selects for genetic modifications that 

confer fitness advantages. Fitter strains are recovered and identified 

by sequencing or by microarray detection of their barcodes. To demonstrate 

the method, Gill and colleagues created collections of E. coli 

with strains in which single genes were either up- or downregulated. 

Growing these strains in cellulosic hydrolysates—a toxic intermediate 

of biofuel processing—or in the presence of valine, d-fucose or 

methyglyoxal revealed unexpected genes that influenced growth in 

these industrially relevant conditions. The identified genes could 

form the basis for subsequent combinatorial genetic engineering. 

[Articles, p. 856; News and Views, p. 812] 

CM 

nature biotechnology volume 28 number 8 august 2010 

vii

in this issue 


Ubiquitination sites in the crosshairs 

Immunoaffinity-based approaches have been 

key to enabling proteome-wide analysis of 

post-translational modifications such as phosphorylation. 

However, attempts to selectively 

purify ubiquitinated peptides on a large scale 

have been frustrated by the difficulty of isolating and identifying peptides 

tagged with the 76-amino-acid ubiquitin protein. Jaffrey and colleagues 

simplify such analyses by generating a monoclonal antibody that selectively 

recognizes sites of protein ubiquitination. When protein lysates are digested 

with trypsin, ubiquitin adducts are trimmed to a diglycine stub. The ability 

of the antibody to recognize these ubiquitin remnants conjugated to the 

side chains of ubiquitinated lysines in a range of sequence contexts enables 

the authors to enrich for peptides carrying sites of ubiquitination and then 

identify them using tandem mass spectrometry. Working with cells expressing 

hexahistidine-tagged ubiquitin, the authors use this strategy to extend 

the catalog of mammalian ubiquitinated proteins and further illustrate the 

strength of the approach by demonstrating differential regulation of ubiquitination 

at distinct sites within the same protein. [Letters, p. 868] PH 

Neu5Gc content and biologics 

Much effort has been devoted to reducing the immunogenicity of protein 

biologics caused by peptide epitopes. However, far less attention has been 

Patent Roundup 

The US Food and Drug Administration is proposing new 

transparency rules to increase the information it discloses 

about product applications. The rules could compromise trade 

secret protection and put small companies at a competitive 

disadvantage. [News Analysis, p. 767] 

LM 

The US Supreme Court’s long-awaited decision on Bilski v. 

Kappos rules against patenting only inventions transformed by 

a machine. But the ruling leaves several questions unanswered, 

especially with regard to the eligibility of patents for diagnostic 

methods. [News in brief, p. 767] 

LM 

The not-for-profit Jackson Laboratory has been caught up in 

patent disputes, for the first time in its 80-year history. If the 

expense of such litigation escalates, the lab may have to cover its 

costs by charging researchers higher prices for access to mouse 

strains in its repository. [News in brief, p. 768] 

LM 

A four-year dispute over a European patent for an anti-CD20 

monoclonal antibody to treat rheumatoid arthritis has ended in 

favor of Trubion, based in Seattle, and against Genentech and 

Biogen Idec. The decision frees up the patent space for anyone 

contemplating a CD20 program, according to Trubion. [News in 

brief, p. 769] 

LM 

Both sides are claiming victory following the US Supreme Court’s 

verdict in Monsanto v. Geerston Seed Farms over future sales of 

Roundup Ready alfalfa seeds. Monsanto (St. Louis, MO) cheered 

the court’s decision to reverse a previous injunction banning the 

transgenic alfalfa, but the seeds’ commercialization is still subject 

to an environmental impact statement by the US Department of 

Agriculture. [News in brief, p. 770] 

LM 

The US Supreme Court recently broadened the definition of 

patent-eligible subject matter. In this issue, Simmons parses Bilski 

v. Kappos and what the far-reaching decision means for biotech 

and pharmaceutical patent seekers. [Patent Article, p. 801] MF 

Recent patent applications in proteomics. [New Patents, p. 806]MF 

GG 

K 

paid to the possibility of untoward effects caused by immune reactions to 

glycans on glycoprotein therapeutics. Varki and colleagues present evidence 

suggesting that it may be necessary to revisit whether the presence 

of the sialic acid N-glycolylneuraminic acid (Neu5Gc) on certain glycoprotein 

drugs may influence their immunogenicity and half-lives in vivo. 

Unlike other mammals studied to date, humans lack the ability to make 

Neu5Gc. Nonetheless, recent studies have revealed that most of us have 

variable—and sometimes relatively high—levels of circulating antibodies 

against Neu5Gc. The authors demonstrate the presence of Neu5Gc 

on only one of two clinically approved monoclonal antibodies directed 

against the same target. In vitro, antibodies or antisera against Neu5Gc 

from healthy humans generate immune complexes only in the presence 

of the Neu5Gc-containing drug. Moreover, antibodies to Neu5Gc in mice 

with a human-like defect in Neu5Gc synthesis promote the clearance of 

only the Neu5Gc-containing drug. Injection of this drug also promotes 

the production of preexisting antibodies against Neu5Gc. If further studies 

support the possibility that antibodies against Neu5Gc might influence 

the immunogenicity and efficacy of therapeutic glycoproteins in 

humans, production using cultured human cells may not resolve the issue, 

as Neu5Gc could still be incorporated from animal-derived products in 

culture media. Varki and colleagues show that a better solution would be 

to displace Neu5Gc from being incorporated into recombinant proteins 

by inclusion of an excess of the human sialic acid N-acetylneuraminic acid 

in culture media. [Letters, p. 863] 

PH 

Epigenetic memory in iPS cells 

All induced pluripotent stem (iPS) 

cells from different tissues are not 

created equal. That is the conclusion 

of a study comparing mouse 

iPS cells derived from four tissues— 

tail-tip fibroblasts, splenic B cells, 

bone marrow–derived granulocytes and skeletal muscle precursors. 

Hochedlinger and colleagues use a ‘secondary’ system for reprogramming 

(Nat. Biotechnol. 26, 916–924, 2008) so that all iPS cells have identical 

integrations of the four transgenes, eliminating this confounding variable. 

They find that early-passage iPS cells retain an epigenetic memory of their 

cell type of origin and that this memory alters the cells’ gene expression 

and differentiation potential. Notably, these epigenetic, transcriptional and 

functional differences can be attenuated by extended passaging. Several 

lines of evidence suggest that this erasure of epigenetic memory occurs 

not though the selection of rare, fully reprogrammed cells but through 

gradual epigenetic changes in the majority of cells. Epigenetic memory 

in iPS cells can be considered desirable or not depending on one’s experimental 

goals. In studies aimed at producing a specific cell type, it could be 

beneficial—suggesting, for example, that a project to generate blood cells 

should begin by reprogramming blood cells rather than an unrelated cell 

type. [Articles, p. 848] 

KA 

Next month in 

• Castor bean genome 

• Benchmarking dynamic mass redistribution 

• Measuring protein-DNA interactions at equilibrium 

• Metabolic modeling made easier 

viii 

volume 28 number 8 august 2010 nature biotechnology


www.nature.com/naturebiotechnology 

EDITORIAL OFFICE 

biotech@us.nature.com 

75 Varick Street, Fl 9, New York, NY 10013-1917 

Tel: (212) 726 9200, Fax: (212) 696 9635 

Chief Editor: Andrew Marshall 

Senior Editors: Laura DeFrancesco (News & Features), Kathy Aschheim (Research), 

Peter Hare (Research), Michael Francisco (Resources and Special Projects) 

Business Editor: Brady Huggett 

Associate Business Editor: Victor Bethencourt 

News Editor: Lisa Melton 

Associate Editors: Markus Elsner (Research), Craig Mak (Research) 

Editor-at-Large: John Hodgson 

Contributing Editors: Mark Ratner, Chris Scott 

Contributing Writer: Jeffrey L. Fox 

Senior Copy Editor: Teresa Moogan 

Managing Production Editor: Ingrid McNamara 

Senior Production Editor: Brandy Cafarella 

Production Editor: Amanda Crawford 

Senior Illustrator: Katie Vicari 

Illustrator/Cover Design: Kimberly Caesar 

Senior Editorial Assistant: Ania Levinson 

MANAGEMENT OFFICES 

NPG New York 

75 Varick Street, Fl 9, New York, NY 10013-1917 

Tel: (212) 726 9200, Fax: (212) 696 9006 

Publisher: Melanie Brazil 

Executive Editor: Linda Miller 

Chief Technology Officer: Howard Ratner 

Head of Nature Research & Reviews Marketing: Sara Girard 

Circulation Manager: Stacey Nelson 

Production Coordinator: Diane Temprano 

Head of Web Services: Anthony Barrera 

Senior Web Production Editor: Laura Goggin 

NPG London 

The Macmillan Building, 4 Crinan Street, London N1 9XW 

Tel: 44 207 833 4000, Fax: 44 207 843 4996 

Managing Director: Steven Inchcoombe 

Publishing Director: Peter Collins 

Editor-in-Chief, Nature Publications: Philip Campbell 

Marketing Director: Della Sar 

Director of Web Publishing: Timo Hannay 

NPG Nature Asia-Pacific 

Chiyoda Building, 2-37 Ichigayatamachi, Shinjuku-ku, Tokyo 162-0843 

Tel: 81 3 3267 8751, Fax: 81 3 3267 8746 

Publishing Director — Asia-Pacific: David Swinbanks 

Associate Director: Antoine E. Bocquet 

Manager: Koichi Nakamura 

Operations Director: Hiroshi Minemura 

Marketing Manager: Masahiro Yamashita 

Asia-Pacific Sales Director: Kate Yoneyama 

Asia-Pacific Sales Manager: Ken Mikami 

DISPLAY ADVERTISING 

display@us.nature.com (US/Canada) 

display@nature.com (Europe) 

nature@natureasia.com (Asia) 

Global Head of Advertising and Sponsorship: Dean Sanderson, Tel: (212) 726 9350, 

Fax: (212) 696 9482 

Global Head of Display Advertising and Sponsorship: Andrew Douglas, Tel: 44 207 843 4975, 

Fax: 44 207 843 4996 

Asia-Pacific Sales Director: Kate Yoneyama, Tel: 81 3 3267 8765, Fax: 81 3 3267 8746 

Display Account Managers: 

New England: Sheila Reardon, Tel: (617) 399 4098, Fax: (617) 426 3717 

New York/Mid-Atlantic/Southeast: Jim Breault, Tel: (212) 726 9334, Fax: (212) 696 9481 

Midwest: Mike Rossi, Tel: (212) 726 9255, Fax: (212) 696 9481 

West Coast: George Lui, Tel: (415) 781 3804, Fax: (415) 781 3805 

Germany/Switzerland/Austria: Sabine Hugi-Fürst, Tel: 41 52761 3386, Fax: 41 52761 3419 

UK/Ireland/Scandinavia/Spain/Portugal: Evelina Rubio-Hakansson, Tel: 44 207 014 4079, 

Fax: 44 207 843 4749 

UK/Germany/Switzerland/Austria: Nancy Luksch, Tel: 44 207 843 4968, Fax: 44 207 843 4749 

France/Belgium/The Netherlands/Luxembourg/Italy/Israel/Other Europe: Nicola Wright, 

Tel: 44 207 843 4959, Fax: 44 207 843 4749 

Asia-Pacific Sales Manager: Ken Mikami, Tel: 81 3 3267 8765, Fax: 81 3 3267 8746 

Greater China/Singapore: Gloria To, Tel: 852 2811 7191, Fax: 852 2811 0743 

NATUREJOBS 

naturejobs@us.nature.com (US/Canada) 

naturejobs@nature.com (Europe) 

nature@natureasia.com (Asia) 

US Sales Manager: Ken Finnegan, Tel: (212) 726 9248, Fax: (212) 696 9482 

European Sales Manager: Dan Churchward, Tel: 44 207 843 4966, Fax: 44 207 843 4596 

Asia-Pacific Sales & Business Development Manager: Yuki Fujiwara, Tel: 81 3 3267 8765, 

Fax: 81 3 3267 8752 

SPONSORSHIP 

g.preston@nature.com 

Global Head of Sponsorship: Gerard Preston, Tel: 44 207 843 4965, Fax: 44 207 843 4749 

Business Development Executive: David Bagshaw, Tel: (212) 726 9215, Fax: (212) 696 9591 

Business Development Executive: Graham Combe, Tel: 44 207 843 4914, Fax: 44 207 843 4749 

Business Development Executive: Reya Silao, Tel: 44 207 843 4977, Fax: 44 207 843 4996 

SITE LICENSE BUSINESS UNIT 

Americas: Tel: (888) 331 6288 

institutions@us.nature.com 

Asia/Pacific: Tel: 81 3 3267 8751 

institutions@natureasia.com 

Australia/New Zealand: Tel: 61 3 9825 1160 

nature@macmillan.com.au 

India: Tel: 91 124 2881054/55 

npgindia@nature.com 

ROW: Tel: 44 207 843 4759 

institutions@nature.com 

CUSTOMER SERVICE 

www.nature.com/help 

Senior Global Customer Service Manager: Gerald Coppin 

For all print and online assistance, please visit www.nature.com/help 

Purchase subscriptions: 

Americas: Nature Biotechnology, Subscription Dept., 342 Broadway, PMB 301, New York, NY 10013- 

3910, USA. Tel: (866) 363 7860, Fax: (212) 334 0879 

Europe/ROW: Nature Biotechnology, Subscription Dept., Macmillan Magazines Ltd., Brunel Road, 

Houndmills, Basingstoke RG21 6XS, United Kingdom. Tel: 44 1256 329 242, Fax: 44 1256 812 358 

Asia-Pacific: Nature Biotechnology, NPG Nature Asia-Pacific, Chiyoda Building, 

2-37 Ichigayatamachi, Shinjuku-ku, Tokyo 162-0843. Tel: 81 3 3267 8751, Fax: 81 3 3267 8746 

India: Nature Biotechnology, NPG India, 3A, 4th Floor, DLF Corporate Park, Gurgaon 122002, India. 

Tel: 91 124 2881054/55, Tel/Fax: 91 124 2881052 

REPRINTS 

reprints@us.nature.com 

Nature Biotechnology, Reprint Department, Nature Publishing Group, 75 Varick Street, Fl 9, 

New York, NY 10013-1917, USA. 

For commercial reprint orders of 600 or more, please contact: 

UK Reprints: Tel: 44 1256 302 923, Fax: 44 1256 321 531 

US Reprints: Tel: (617) 494 4900, Fax: (617) 494 4960

Editorial 


Wrong numbers? 

With biotech infiltrating multiple industries and fewer 

life science ventures listing on stock exchanges, what 

do we really learn from surveying the set of public 

biotech companies? 

Each year, Nature Biotechnology trawls through the accounts of publicly 

quoted biotech companies and pulls out some numbers that characterize 

this part of the commercial life science landscape. Perhaps the most 

surprising statistic this year was that most of the companies that appeared 

in last year’s survey are still there. The current straitened circumstances 

took their toll, of course, but total revenues were up 10%, R&D was only 

down 4% and the group collectively was profitable for another year. But 

what, if anything, does the survey tell us about the general health of the 

innovative life science sector? 

Back in the 1990s, the answer seemed clear. Thanks to much freer flows 

of capital then, the annual audit measured the progress of a specialized, 

self-reliant and relatively independent industrial endeavor. It assessed the 

rapid churn of companies listing newly on exchanges. Companies could 

float much earlier; some were even able to go public without products in 

human trials. Buoyant stock markets took valuations to ecstatic heights 

and poured money into the sector. Product for product and dollar for dollar, 

biotech companies were valued much more highly than ‘traditional’ 

pharma companies. 

That differential was unsustainable. As Amgen and Genentech and 

Biogen Idec and others climbed up the pharmaceutical league standings, 

reality dawned. Innovators metamorphosed into drugmakers. And as the 

pharma sponge absorbed more biotech, the boundaries between the two 

spheres faded. 

The consequence of this merging is that much, if not most, of the 

biological products and biological techniques now resides outside the 

group of independent public companies that we survey. Pharma spends 

$65 billion a year on R&D, 25–40% of it either devoted to biological 

products or using the techniques of biotech. Thus, pharma outspends 

‘biotech,’ even on biotech R&D. Furthermore, biotech processes extend 

far beyond the pharmaceutical segment: political imperatives and 

technological capability have expanded industrial biotech for biofuels 

production, waste management and green chemistry. Geographically, 

biotech is no longer a Western province: China, India, South Korea and 

elsewhere are prominent actors in follow-on biologic drugs, diagnostics 

and clinical testing. 

Our public company survey reflects none of these changes: pharma 

companies, biogenerics firms, diagnostic and device providers all fall outside 

the definitions of our survey. In Asia, successful biotech companies 

(see p. 783) have only restricted access to mature public capital markets. 

Overall, the survey is now less a gauge for innovative life science and more 

a pointer to the shape of the Western healthcare market. To measure life 

sciences’ impact more broadly, other indicators are needed. 

To quantify innovation, we need to look, too, at activities within small 

private companies and, increasingly, at the early translational work in the 

public sector. These data are exponentially more difficult to gather than 

data from publicly quoted firms. Accordingly, policymakers, governments 

and industry associations need to devote much more effort and resources 

to collecting them. 

MAQC-II: analyze that! 

The MAQC consortium’s latest study suggests that human 

error in handling DNA microarray data analysis software 

could delay the technology’s wider adoption in the clinic. 

Following up on its publications in Nature Biotechnology four years ago 

(http://www.nature.com/nbt/focus/maqc/index.html), the Microarray 

Quality Control (MAQC) consortium publishes the results of its second 

phase of assessment (MAQC-II) on p. 827, in conjunction with ten accompanying 

papers in The Pharmacogenomics Journal (http://www.nature. 

com/tpj/journal/v10/n4/index.html). The new work assesses the capabilities 

and limitations of microarray data analysis methods—so-called 

genomic classifiers—in identifying gene signatures representative of a 

specific pathological condition. 

All in all, >30,000 genomic classifier models were built by combining 

one of 17 different data preprocessing and normalization methods, 

with one of 9 methods for filtering out problematic data, with one of >33 

techniques for picking ‘signature’ genes, with one of >24 algorithms for 

discerning patterns from those genes, and with one of 6 methods for testing 

the robustness of the results. Thirty-six research teams sought gene 

signatures within 6 massive microarray datasets derived from toxicological 

studies of chemicals on rodents and expression profiles of human cancer 

patients that predict 13 ‘endpoints’ potentially relevant to preclinical or 

clinical applications. 

As discussed on p. 810, one key finding of MAQC-II is that the classifier 

models are remarkably similar in predicting outcome, irrespective of the 

approach used. On the other hand, the overall success of the classifiers in 

predicting endpoints depends on the endpoints themselves. For example, 

predictions were in general much worse for breast cancer and multiple 

myeloma, which have highly heterogenous genetic backgrounds, than for 

liver toxicology or neuroblastoma. 

Perhaps most striking of all, some data analysis teams were consistently 

better at predictions than others. This may relate to simple errors 

associated with manipulating such large datasets. But insufficient tuning 

of the parameters used in a classifier model is also a likely contributor. 

In this sense, MAQC-II was as much an exercise in sociology as in 

technology. The human element in classifier implementation is key. 

Thus a key take-home message is that classifier protocols need to be 

more tightly described and more tightly executed. In this respect, regulatory 

agencies and scientific journals can promote good practice. A clear 

need exists for greater meticulousness both in documenting the parameters 

of a particular classifier model used and in detailing the procedures 

for normalization, batch effect correction, quality control and reduction 

of quality control flaws. Greater attention to detail will not only enhance 

reproducibility of research—it will also facilitate the progression of this 

technology toward the clinic. 

nature biotechnology volume 28 number 8 august 2010 761

in this section 

Investigational 

cancer agents 

tested in pairs 

p765 

Transparency rules 

challenge small 

firms p767 

news 

GM soybeans for 

trans fat–free oil 

p769 

Industry makes strides in melanoma 


After decades of continuous failures, the treatment 

of metastatic melanoma is finally advancing. 

This year’s American Society for Clinical 

Oncology (ASCO) annual meeting heralded 

a breakthrough antibody therapy for the disease. 

Top-line, phase 3 results for Bristol-Myers 

Squibb’s humanized monoclonal antibody 

(mAb) ipilimumab showed a survival benefit 

in patients with advanced cancer—the first 

ever phase 3 trial to do so. These results contrast 

with a litany of letdowns from cancer vaccines, 

cytokine therapies, adoptive T-cell therapies as 

well as several targeted therapies that all have 

failed to improve on standard chemotherapy, 

which itself achieves a meager 15% response rate 

with negligible survival benefit. “Those of us in 

the melanoma business have felt like we’ve been 

in a long, dark tunnel,” said oncologist Vernon 

Sondak, of the H. Lee Moffitt Cancer Center in 

Tampa, Florida, at the ASCO meeting. 

The ipilimumab data, released by New 

York–based Bristol-Myers Squibb in June, have 

changed all that. The 676 individuals included in 

the study had unresectable, metastatic melanoma 

and had previously undergone chemotherapy for 

the disease. Those receiving ipilimumab, with 

or without the synthetic peptide vaccine glycoprotein 

100 (gp100), had a median survival 

of about 10 months, against 6.4 months for the 

vaccine alone. Ipilimumab, which targets cytotoxic 

T-lymphocyte antigen 4 (CTLA4), nearly 

doubled the rates of survival at 12 months (46% 

versus 25%) and 24 months (24% versus 14%) 

after treatment compared with the peptide. 

“This is really a benchmark for the field,” says 

John Kirkwood, a melanoma researcher at the 

University of Pittsburgh. “We finally have a randomized 

controlled trial that is positive.” 

Finalized phase 1 results of a BRAF inhibitor, 

developed by the Berkeley, California–based 

Plexxikon, are at least as dramatic. The small 

molecule PLX4032 (also RG7204), which 

Plexxikon is co-developing with Roche of Basel, 

specifically inhibits the V600E mutant BRAF, a 

constitutively active kinase present in more than 

half of metastatic melanomas. The drug produced 

an 81% response rate among 32 patients 

receiving the therapeutic dose. “The early effects 

are [as] profound, reliable and gratifying as 

Antigenpresenting 

cell 

B7 

MHC 

B7 

one could ever want out of a cancer therapy,” 

says trial principal investigator Keith Flaherty 

of Massachusetts General Hospital in Boston. 

PLX4032 is now in phase 3. 

Although both compounds will almost certainly 

become approved drugs, they have limitations. 

Ipilimumab extends median survival but, 

strangely, has only an 11% overall response rate. 

And almost all patients on PLX4032 relapse, 

most within a year. Nevertheless, the two drugs 

have revitalized melanoma research. By using 

ipilimumab and PLX4032 in combination with 

a variety of standard and investigational agents 

—or with each other—researchers hope to push 

long-term survival of metastatic melanoma 

patients up from the roughly 10% combined 

cure rate now achievable with ipilimumab 

monotherapy and interleukin-2 (IL-2) monotherapy. 

“We’re going to move the cure rate of 

melanoma progressively up,” predicts melanoma 

researcher Mario Sznol, of Yale University in 

New Haven, “to what could be a very respectable 

30, 35, 40% of patients, over the course of 

the next several years.” 

Ipilimumab 

Ag 

T cell 

activated 

TCR 

CTLA4 

CD28 

T cell 

Figure 1 Ipilimumab stimulates antitumor immunity by blocking CTLA4, a natural brake on T cells, and 

allowing their unimpeded ‘costimulation’. Ipilimumab is the first agent to extend survival in metastatic 

melanoma patients in phase 3. 

Anti-CTLA4 therapy has succeeded where 

other immunotherapies failed because, instead 

of trying to indirectly stimulate T cells by presenting 

tumor antigen to overcome immune 

tolerance, it activates T cells directly, by disabling 

a brake on T-cell activity. Normally, 

when a T cell is activated after CD28 binding 

of the B7 receptor on antigen-presenting cells, 

CTLA4 acts as a brake, trafficking from the 

T-lymphocyte cytosol to the surface to bind 

B7 molecule with high affinity. Thus CTLA4 

turns the T cell off. When the ipilimumab 

mAb is present it blocks CTLA4, keeping the 

T lymphocyte activated. The mAb also promotes 

unfettered binding of the T-cell CD28 

receptor to the antigen-presenting cell receptor 

B7, together with antigen presentation to the 

T-cell receptor (Fig. 1). Such ‘co-stimulation’ 

is necessary for T-cell activation, and antitumor 

immunity. Unfortunately, ipilimumab also triggers 

autoimmune side effects, some severe. A 

few patients have died from colitis-related bowel 

perforations, for example. But Kirkwood points 

out, “[for] the vast majority of patients, we can 

nature biotechnology volume 28 number 8 AUGUST 2010 763

NEWS 


Table 1 Selected phase 3 trials in metastatic melanoma 

Company (location) Product Description 

Bristol-Myers Squibb 

Ipilimumab 

(MDX-010) 

manage the side effects fairly easily, once you 

know how to look for them.” 

The one controversy in the phase 3 trial 

was the choice of the gp100 peptide vaccine, 

developed by the Bethesda, Maryland–based 

National Cancer Institute, as the active control 

arm for the study. The combination of this HLA- 

A0201–restricted peptide vaccine with highdose 

IL-2 resulted in higher response rates and 

improved progression-free survival in an earlier 

randomized trial. Thus the choice of gp100 for 

the control arm. Some researchers speculate 

that the vaccine may have hurt patients, thus 

giving ipilimumab an artificial statistical boost. 

(Certain vaccines have reduced survival in 

melanoma trials). Kirkwood disagrees, because 

gp100 did not appear to cause harm in its other 

trials. “The issues regarding the control are, in 

my book, non-issues,” he says. 

The question remains, why did ipilimumab 

succeed whereas tremelimumab, a similar anti- 

CTLA4 antibody from Pfizer, failed? It is possible 

that tremelimumab didn’t really fail. “[Pfizer] 

analyzed the trial early,” says Sznol. “You need 

to wait for the events to develop.” Sznol points 

out that some patients treated with anti-CTLA4 

mAbs experience progression of their cancers 

initially, followed by regression, and that other 

patients have most of their lesions disappear 

while a few continue growing. All are classified 

as nonresponders, but some may live for a long 

time. It’s also possible, Sznol says, that the company 

used the wrong drug dose and schedule. 

Kirkwood agrees that Pfizer was probably too 

quick to analyze the data. 

Pfizer defended the tremelimumab phase 

3 trial dose and schedule in an e-mail, noting 

that phase 2 results (using the same dose and 

schedule as in phase 3) were very similar to 

ipilimumab’s despite the different dose regi- 

Fully human antibody targeting the CTLA-4 receptor 

on T cells 

Plexxikon/Roche PLX4032 Small-molecule inhibitor of V600E mutant BRAF kinase 

Abraxis Bioscience Abraxane 

Nanoparticle albumin-bound paclitaxel (Taxol) 

(Los Angeles) 

(nab-paclitaxel, ABI-007) 

Eli Lilly 

(Indianapolis) 

Biovex 

(Woburn, Massachusetts) 

Novartis 

(Basel) 

GlaxoSmithKline 

Vical 

(San Diego) 

Tasisulam 

(LY573636) 

OncoVEX 

Tasigna 

(nilotinib, AMNN-107) 

Astuprotimut-r 

(MAGE-A3 ASCI) 

Allovectin-7 

Source: BioMedTracker & Nature Biotechnology 

Acyl sulfonamide, generates reactive oxygen species 

and induces apoptosis 

Oncolytic herpes simplex virus type-1 encoding granulocyte 

macrophage colony stimulating factor; selectively 

replicates in tumor cells, recruits dendritic cells 

Small molecule oral c-kit kinase inhibitor for c-kit 

mutant melanoma 

Protein subunit vaccine based on melanoma-associated 

antigen A3 (MAGE-A3), specific for tumor cells 

DNA plasmid/lipid complex containing human leukocyte 

antigen B7 and beta-2 microglobulin DNA sequences 

that together form major histocompatibility class I; 

improves antigen presentation 

mens. Long-term phase 3 follow up did show a 

survival advantage for the tremelimumab arm, 

but not enough to justify US Food and Drug 

Administration registration. Many patients in 

the tremelimumab trial control arm went on 

to receive ipilimumab in a compassionate use 

program, which could have decreased tremelimumab’s 

apparent effect. So circumstances, not 

biology, may have defeated tremelimumab. 

Any lingering ipilimumab doubts may 

disappear with a second completed phase 3 

trial, comparing ipilimumab plus dacarbazine 

chemotherapy to dacarbazine alone. Patient 

accrual ended more than two years ago, and 

results have not yet been reported. The delay 

suggests to many a successful trial, but no one 

knows for sure. 

No efficacy doubts exist for PLX4032. All 

agree the drug works, and works quickly, in 

the vast majority of patients with mutant BRAF 

tumors. Because PLX4032 targets the mutant 

form of the protein encoded by the BRAF oncogene, 

this allows very high doses to be given 

without adverse effects on normal cells. Data 

from several groups show, in fact, that PLX4032 

paradoxically activates BRAF signaling in normal 

cells. This pathway activation enhances the 

therapeutic window, but also probably leads to 

the appearance of skin lesions known as keratoacanthomas 

in many patients. They are benign, 

but raise the theoretical possibility that longterm 

treatment could promote the growth of 

other cancers. 

But the main downside of PLX4032 is 

relapses. Median duration of response in 

phase 1 was about nine months. By historical 

standards, this is excellent, and a few patients 

have had complete responses lasting two years 

or more (they remain on the drug). But the 

relapses indicate a still-unknown form of drug 

resistance. Some residual BRAF signaling in 

tumor cells persists, despite treatment, and 

there are new data that the mitogen-activated 

protein (MAP) kinase signaling pathway is 

reactivated downstream of BRAF. In either 

case, combining a BRAF inhibitor with an 

inhibitor of MAP kinase kinase (MEK), which 

is immediately downstream of BRAF, could 

overcome resistance and prolong survival. Such 

a trial is now underway with GSK2118436—a 

small-molecule inhibitor of the V600E mutant 

BRAF—and MEK inhibitor GSK1120212, both 

from GlaxoSmithKline in London and soon to 

be in phase 2/3 studies. 

Meanwhile, PLX4032 is moving forward 

quickly. An already completed phase 2 trial 

will “we all believe … likely be enough for FDA 

approval next year,” says Flaherty. Phase 3 will 

definitively show whether PLX4032 changes the 

natural history of the disease and extends survival. 

The list of agents in phase 3 trials is growing 

(Table 1), although none of them displayed 

the efficacy of ipilimumab and PLX4032 in 

phase 2. One comparable compound, however, 

is Bristol-Myers Squibb’s humanized anti-PD-1 

mAb, MDX-1106. PD-1, or programmed cell 

death-1, is a T-cell molecule that, like CTLA4, 

downregulates T-cell activity. It appears to be at 

least as powerful as CTLA4, and may function at 

the later stages of the immune response to shut 

down T cells. 

In phase 1, MDX-1106 treatment led to 15 

confirmed responses among 46 metastatic 

melanoma patients. As of June, none of the 

responders had relapsed, with more than a 

year passing in several cases. “This is one of 

the most promising starts I’ve seen for any 

drug,” said Sznol, the trial’s principal investigator. 

“It’s the kind of thing where we can’t 

sleep because we want to offer this to our next 

patient.” Autoimmune side effects occur, but 

fewer than with ipilimumab. A combination 

trial with ipilimumab has begun (see p. 765). 

The most anticipated combination is ipilimumab 

and PLX4032. This would bring 

together the quick responses of PLX4032 with 

ipilimumab’s ability to deliver cures. “The two 

are made for one another,” says Kirkwood. 

Tumor cells killed by PLX4032 should release 

antigen, enhancing ipilimumab’s ability to activate 

antitumor T cells. Flaherty says that the two 

sponsoring companies have agreed to collaborate 

on a large randomized combination trial, 

which should begin next year. 

Individually, ipilimumab and PLX4032 have 

ended the futility and nihilism that have long 

dominated melanoma treatment. It will take 

time to sort out the best combinations and the 

best way to apply them. “But at least the cupboard 

is not bare any more,” said Sondak. 

Ken Garber Ann Arbor, Michigan 

764 volume 28 number 8 AUGUST 2010 nature biotechnology

news 

Firms combine experimental cancer drugs 

to speed development 


Tackling breast cancer. Drug 

developers are starting to combine 

novel, unapproved agents in search of 

synergistic activity. 

The next generation of cancer treatments 

could be approved in pairs, at least judging 

by a growing trend among drug makers to 

combine drugs early in development and the 

US Food and Drug Administration’s (FDA) 

willingness to regulate 

them. On 2 June, the 

FDA opened its public 

consultation into the 

formulation of guidance 

for combinations of 

investigational therapies. 

In the same week, Merck, 

of Whitehouse Station, 

New Jersey, reported at 

the annual American 

Society of Clinical 

Oncology meeting in 

Chicago that a combination 

of ridaforolimus, an 

oral inhibitor of mammalian 

target of rapamycin 

(mTOR) developed 

with Ariad of Cambridge, 

Massachusetts, and dalotuzumab, 

an antibody 

targeting the insulinlike 

growth factor 1 receptor (IGFR1), led 

to responses in a cluster of patients with 

highly proliferative, estrogen-receptorpositive 

breast cancers in a phase 1b trial. 

Collaborations between different sponsors to 

combine drugs very early in development are 

unusual and pose new issues for regulators 

compared with oversight of combinations of 

agents already on the market. 

The FDA initiative is not limited to cancer—it 

also covers infection, seizure disorders 

and cardiovascular disease. But cancer 

drug makers, in particular, are grappling with 

some thorny questions as they attempt to 

translate their rapidly expanding knowledge 

of tumor biology into therapies that offer significant 

improvements on what is now available. 

Foremost among their concerns is how 

to accelerate clinical development to deliver 

solid efficacy data without compromising 

patient safety. “We’ve talked to the FDA about 

specific combinations and have received guidance 

on an ad hoc basis,” says Pearl Huang, vice 

president and oncology franchise integrator at 

Merck. “For us, the burning issue is if we demonstrate 

great activity for the combination, are 

we obligated to demonstrate lack of activity for 

the single agent alone?” 

Some claim combinations of investigational 

drugs could accelerate clinical development. 

Merck’s ridaforolimus-dalotuzumab program, 

which is due to enter phase 2 trials later this 

year, is a key initiative and is being closely 

scrutinized. It exemplifies a science-based 

approach to combining investigational drugs 

that may offer limited 

potential as single agents, 

but which may offer synergistic 

effects when administered 

together, as well as 

reducing the risk of drug 

resistance. Trials of several 

other combinations of 

new types of agents are also 

underway (Table 1). 

Although combination 

therapy in cancer—and 

other indications—is not 

a new theme, it has developed 

historically through 

Sebastian Kaulitzki/iStockphoto 

trial and error. “Our knowledge 

of biological pathways 

and networks is so superficial 

it really is hard to come 

up with a strong rationale,” 

says Alan Ashworth, professor 

of molecular biology 

at the Institute of Cancer Research in London. 

The ridaforolimus-dalotuzumab combination 

emerged from an unbiased screen of a colon 

cancer cell line in which individual genes were 

systematically switched off using short hairpin 

RNAs, whereas each of the two drugs was 

tested in turn in a cell proliferation assay. This 

kind of synthetically lethal screen can unveil 

dependencies between related pathways and 

overcome compensatory mechanisms that 

cancer cells switch to when only one target is 

hit. “Those types of approaches couldn’t be 

done before,” says Eric Rubin, vice president 

of clinical oncology at Merck. The upcoming 

phase 2 trial will recruit around 200 breast 

cancer patients, who will be assigned to one 

of four treatment arms, comprising either 

ridaforolimus as monotherapy, dalotuzumab 

as monotherapy, the two drugs in combination 

or exemestane, the active comparator. 

The key question is whether that kind of design 

would need to be replicated in a large-scale registration 

trial of a new combination comprising 

two investigational compounds. “What we have 

proposed—and others have as well—is to do this 

in a more limited setting,” Rubin says. Balancing 

regulators’ requirements for statistical power 

with patients’ needs for effective therapy is not 

a straightforward task, particularly if some trial 

participants are to receive single agents that are 


NEWS 


Table 1 Selected targeted experimental combination cancer therapies in development 

Company Combination Mechanism Indication Status 

AstraZeneca (AZ) 

Cediranib maleate 

Vascular endothelial growth factor (VEGF) receptor inhibitor + Recurrent Phase 1/2 

(AZD2171) + olaparib 

poly(ADP-ribose) polymerase inhibitor 

ovarian cancer 

AZ & Merck (Darmstadt, 

Germany) 

unlikely to confer any benefit, while at the same 

time, the duration of combination trials is significantly 

extended. Ashworth says that more innovative 

trial designs and early use of biomarkers 

can help—but only if there is already a solid case 

for moving a particular therapy into the clinic 

in the first place. “You need a very strong biological 

basis for your combination treatment,” he 

says. “If you need 4,000 patients to prove your 

hypothesis, I’m sorry mate, you’ve got the wrong 

hypothesis.” 

There is some precedent for rapid approval 

of investigational therapies based on a strong 

phase 2 efficacy signal, particularly when 

it is backed by a solid understanding of 

the underlying biological mechanism. For 

example, Novartis, of Basel, gained FDA 

approval for Gleevec (imatinib mesylate) in 

chronic myeloid leukemia on the basis of a 

phase 1b dose-escalating trial (New. Engl. J. 

Med. 344, 1031–1037, 2001). “If in a phase 

2 trial, you’ve figured out the right dose and 

the correct schedule for a combination, and 

you get a dramatic change in efficacy, for 

example in a directed patient population, 

a path for that combination could be very 

straightforward,” says Bill Sellers, global head 

of oncology research at the Novartis Institutes 

for Biomedical Research, in Cambridge, 

Massachusetts. Head-to-head studies 

against the existing standard of care would 

also smooth the path toward approval—and 

combination therapies, he says, should aim 

for curative levels of efficacy rather than 

small, incremental improvements. “A major 

change in the rate of complete response or 

partial response to a therapeutic says you’ve 

killed a lot of the cancer.” 

Many of the combinations being tested target 

different kinase enzymes. Merck’s Huang 

Cediranib maleate + cilengitide VEGF receptor inhibitor + integrin inhibitor Recurrent 

glioblastoma 

says the combination of their investigational 

anti-cancer agent MK-2206, that inhibits 

Akt (a component of the phosphatiyliositol-3 

kinase pathway), with London-based 

AstraZeneca’s selumetinib (AZD6244), an 

inhibitor of the enzyme MEK, was chosen 

because each target is part of a canonical signal 

transduction pathway, downstream from a 

receptor tyrosine kinase. “They’re in parallel, 

but they also cross-talk,” she says. “They are 

not the cancer’s mutational drivers, they’re 

more the downstream effectors.” 

Even so, insights into tumor biology do 

not always yield significant clinical benefits. 

“In oncology, what we think works and what 

[actually] works are two different things, and 

that’s why we need to do big studies,” says 

Justin Stebbing, a physician scientist based at 

Imperial College London. “The initial promise 

of biomarkers doesn’t hold up to scrutiny, 

ultimately.” 

Matthew Ellis, professor of medicine at 

Washington University in St. Louis, Missouri, 

who recently published genomic analyses of 

cancer and normal tissues taken from an 

individual with breast cancer (Nature, 464, 

999–1005, 2010), has a different take: “My 

guess is we can solve the companion diagnostic 

problem by making full-genome sequencing 

of cancer the primary screen.” “We’re 

beginning to understand cancer genomes at 

a much more fundamental level than we ever 

have before,” he adds. “What we’re seeing, I 

think, is a great deal of complexity, much 

more complexity than was ever appreciated 

before.” This complexity is accompanied by 

an appreciable degree of heterogeneity—no 

two cancers appear the same. “We’re [starting] 

to classify them and put them into different 

buckets,” says James Zwiebel, chief of the 

Phase 1b 

GlaxoSmithKline (London) GSK1120212 + GSK2141795 MEK inhibitor + Akt kinase inhibitor Solid tumors Phase 1b 

Novartis & GlaxoSmithKline BKM120 + GSK1120212 Phoshphoinositide-3-OH kinase inhibitor + MEK inhibitor Solid tumors Phase 1b 

AZ & Roche (Basel) Cediranib maleate + RO4929097 VEGF receptor inhibitor + γ-secretase inhibitor Solid tumors Phase 1 

Bristol-Myers Squibb (New York) Ipilimumab + MDX-1106 Cytotoxic T-Lymphocyte antigen 4 (CTLA-4) inhibitor + Programmed Melanoma Phase 1 

& Ono Pharma (London) 

death-1 receptor (PD-1) inhibitor 

Merck & Ariad Dalotuzumab + ridaforolimus Insulin-like growth factor receptor 1 (IGFR1) inhibitor + mTOR Neoplasms Phase 1 

inhibitor 

Merck & AZ MK-2206 + selumetinib Akt inhibitor + MEK1/2 inhibitor Solid tumors Phase 1 

Pfizer (New York) Figitumumab + PF-00299804 IGFR1 inhibitor + HER tyrosine kinase inhibitor Solid tumors Phase 1 

Pfizer Crizotinib + PF-00299804 Met tyrosine kinase inhibitor + HER tyrosine kinase inhibitor Non-small cell 

lung carcinoma 

Phase 1 

Roche GDC-0449 + RO4929097 Hedgehog antagonist + γ-secretase inhibitor Breast cancer 

Sarcoma 

Source: http://www.ClinicalTrials.gov 

Phase 1 

Phase 1/2 

investigational drug branch at the National 

Cancer Institute, in Rockville, Maryland. 

“That’s really only scratching the surface. 

When you get down to it, every patient is 

going to have some unique characteristics.” 

That could make life more difficult for drug 

developers, he notes. 

This genome-level view of cancer, rather 

than the classic assumption of cancer as a 

disease affecting a particular organ, is turning 

our understanding of cancer on its head. 

Breast cancer perfectly illustrates the point. 

“When you do the genetics, what you see is 

a constellation of rare diseases,” Ellis says. 

In contrast, gastrointestinal stromal tumors, 

for example, seem to have a more uniform 

genetic profile. “You’ve got rare diseases 

defined by a common mutation, and we’re 

making progress,” he says. “We haven’t 

worked out how to handle the reverse situation, 

a common disease defined by multiple 

rare mutations.” 

Although the cost of individual genome 

sequencing is falling, Sellers says that full 

cancer genome sequencing may not be necessary 

to identify the dominant mutations that 

drive a particular cancer: partial approaches, 

based on techniques such as hybrid capture, 

targeted resequencing and high-throughput 

genotyping, may be sufficient. But even with 

the correct genomic information at hand, 

clinical progress will remain difficult, as 

combining two investigational agents correctly 

is not a straightforward task. “This is 

probably the biggest challenge: finding the 

effective and tolerated dose and, importantly, 

the schedule,” Sellers says. “I think this is 

probably a bigger challenge than the FDA 

regulatory challenge.” 

Cormac Sheridan Dublin 



FDA transparency rules could hit small 

companies hardest 

The US Food and Drug Administration (FDA) 

is considering changing how much information 

it discloses about product applications—news 

that biotechs have greeted with a mixture of 

trepidation and hope. The agency is proposing 

to make publicly available ‘complete response’ 

and ‘refuse-to-file’ letters for drugs and ‘not 

approvable’ letters for devices. From opinions 

gathered in advance of the final decision, it 

seems the smallest biotechs stand to lose the 

most. 

The proposed changes are wide-reaching 

and include some things most experts agree are 

good. On the upside, they say, this is an opportunity 

to make more information about what FDA 

does available to the public and ensure that data 

sources are more user-friendly. The downside, 

however, is the proposal to disclose information 

early in the approval process, including 

Investigational New Drug (IND) applications, 

holds and IND withdrawals. Few can see how 

revealing more information at the product 

application stage can be reconciled with trade 

secrets protection. 

The Biotechnology Industry Organization 

(BIO) wants more details about how these 

proposed regulations would be implemented. 

“They [FDA] define trade secrets [in the 

document], but oddly there is no definition 

of what constitutes competitive information,” 

explains Andrew Emmett, director for 

science and regulatory affairs at BIO, based 

in Washington, DC. The organization also 

wants clarification around who will decide 

what remains secret. Under current Freedom 

of Information Act regulations, Emmett 

says, companies have five days to determine 

whether documents that are going to be 

made public contain trade secrets that should 

be redacted. “We need to know exactly what 

the role of the sponsor will be in deciding 

what information is going to be shared,” he 

says. Otherwise, companies could be put at 

competitive disadvantage or become victims 

of wild speculation. 

The confidentiality issue is particularly critical 

for small biotechs. “When a small public 

company has a clinical trial pending, hedge 

fund managers do everything they can to get a 

sense of what the outcome might be,” says Alan 

Mendelson, senior partner at Los Angeles– 

headquartered law firm Latham and Watkins. 

If every pause in the clinical trial process gets 

announced to the public, it could lead to stock 

trading based on misleading or inadequate 

information. “It’s bad enough today,” he says, 

“But at least now people are commenting on 

definitive data, not just a signal that might prove 

to be nothing.” 

Wayne Kubick, a vice president in safety at 

Waltham, Massachusetts–based PhaseForward, 

says companies with “limited products” are also 

going to be at greatest risk of competitive disadvantage. 

Competitors will be able to use some 

types of information better than others. Says 

Gregory Conko, senior fellow at the Competitive 

Enterprise Institute in Washington, DC, “It’s 

less important with complete response or rejection 

letters, but with a new drug application, a 

hold, or a withdrawal, that is where tipping off 

competitors is a much bigger concern.” Smaller 

companies are already at a disadvantage in the 

review process. In comments it filed in April, 

BIO pointed out that a recent study from the 

law firm Booz Allen Hamilton found that small 

firms had only a 48% first-cycle approval rate 

for products in the priority review category, 

compared with a 78% rate for larger companies. 

In a survey of 168 of its members (http://www. 

bio.org/letters/20100412b.pdf), BIO also found 

that “early, frequent and explicit communication 

with the FDA” was felt to be the most helpful 

means for first-time filers to improve their success 

rates. 

The transparency initiative could help shore 

up this communication weakness. “A variety 

of leaders have been pushing for more open 

and straightforward dialog with the agency for 

years,” says J. Donald deBethizy, president and 

CEO of Winston-Salem, North Carolina–based 

Targacept. “This initiative could provide a means 

for that.” Greater transparency could also put 

pressure on FDA to provide rationales for rejections, 

which critics charge are sometimes based 

on “petty” issues, according to Conko. 

Overall, such changes may not necessarily 

translate to better decision making, Conko 

warns. “FDA’s political incentives are still poorly 

aligned. Even when their rationale is weak, they 

still don’t have to pay a price for it,” he says. 

On the other hand, transparency is not necessarily 

a bad thing. “The world is very different 

already in 2010” says Kubick. “We have 

clinicaltrials.gov and a lot of other information 

already available.” But it means companies will 

face more instances where study data is used out 

of context. “You have to protect yourself against 

people who data mine and then hold up a little 

data nugget as the truth,” deBethizy says. 

Many are watching closely as the next phase 

of the initiative rolls out. “This is by no means a 

done deal,” says Kubick. “Some [of the proposed] 

things are going to happen, but not everything 

will.” Others are very skeptical, like Jack McLane, 

in brief 

Supremes rule on Bilski 

The US Supreme 

Court has ruled 

on a long-awaited 

and controversial 

patent litigation 

case, a decision 

greeted with relief 

by the biotech 

industry but 

vague enough that 

both sides can 

claim victory. The 

Bilski v. Kappos 

case was closely 

news 

Biotech welcomes 

ruling. 

watched by the biotech community after 

the US Court of Appeals for the Federal 

Circuit ruled in 2008 that only methods 

tied to a machine or transformed into a 

different state are patentable, a standard 

which appeared to exclude crucial aspects of 

medical diagnostics. Commentators feared a 

restrictive ruling could have severely limited 

the ability to obtain patents on methods 

that use genes, proteins and metabolites 

to diagnose disease. Instead, the Supreme 

Court struck down patent claims on narrow 

grounds. “The Court was clearly conscious 

of the potential negative and unforeseeable 

consequences of a broad and sweeping 

decision,” stated Washington, DC–based 

Biotechnology Industry Organization 

president and CEO Jim Greenwood. The court 

ruled on two issues. First, it ruled against 

patenting only those inventions that are 

“tied to a particular machine” or those that 

transform “a particular article into a different 

state or thing.” Second, the court held that 

the word “process” as used in the US Patent 

Act should be read broadly to include modern 

day inventions. The ruling does not address 

the eligibility of patents for diagnostic 

methods, however, which leaves a number 

of questions unanswered with regard to a 

string of pending cases, including the closely 

watched dispute against Myriad Genetics 

and its breast cancer gene patents. Dan 

Ravicher of the Public Patent Foundation, a 

co-plaintiff with the American Civil Liberties 

Union in the suit against Myriad Genetics, 

believes “the opinion reinforces the line of 

case law that Judge Sweet relied upon in his 

decision striking down gene patents [in the 

Myriad case]. It rejects the argument that 

‘anything’ is patentable.” Justices Stevens, 

Breyer, Ginsburg and Sotomayor would have 

struck down not only the specific Bilski 

business method claims, but all business 

method patents on historical grounds that 

this class of patents was never contemplated 

by the framers of the US Constitution. 

The same argument would be difficult to 

support in biotech-specific cases as there 

is ample evidence that Thomas Jefferson, 

who reformed the Patent Act of 1793, 

considered medicine a “useful art” as was 

originally stated, a language later changed to 

“process.” Kenneth Chahine & Javier Mixco 

Lee Pettet/istockphoto 


NEWS 


in brief 

Lawsuits rock Jackson 

Lee Pettet/istockphoto 

Litigation over models 

may inflate prices. 

The Jackson 

Laboratory has 

unwittingly found 

itself ensnared in 

patent disputes. 

In June, the 

nonprofit laboratory 

mouse developer 

located in Bar 

Harbor, Maine, was 

cleared of a patent 

infringement 

allegation—the 

first in the 

laboratory’s 80-year history—and now 

faces a second allegation by another party. 

Jackson’s mission of making its repository 

of more than 5,000 mouse strains available 

to researchers at affordable prices could 

be challenged if it is forced to continue 

defending itself in expensive lawsuits, says 

David Einhorn, the laboratory’s in-house 

attorney. In Jackson’s first scuffle, the 

Central Institute for Experimental Animals 

(CIEA), a Kawasaki, Japan–based nonprofit, 

in 2008 sued Jackson for distributing a 

mouse model particularly useful for grafting 

human tissue. Both groups in the 1990s 

separately developed these immunodeficient 

mice by starting with a strain of nonobese 

diabetic mouse (NOD), crossing those 

with mice carrying the scid mutation for 

immunodeficiency, and crossing them again 

with mice whose gene for a key immune 

signaling molecule, interleukin-2 receptor γ, 

was knocked out. Jackson has distributed the 

mouse to more than 1,000 research groups 

worldwide, says Einhorn. But the laboratory 

didn’t patent its mouse, whereas CIEA did. 

On June 1, a US District Court judge ruled 

that the Jackson Laboratory had not infringed 

CIEA’s patent. What ultimately swayed the 

judge to side with Jackson was that the 

CIEA, in its patent application, described the 

mouse but didn’t claim it. In his decision the 

judge cited the Guidelines for Nomenclature 

of Mouse and Rat Strains, which state mice 

inbred for more than 20 generations can be 

considered a different strain, and Jackson’s 

mouse line had been separately inbred 

many times. Michael Rader, attorney with 

Wolf, Greenfield & Sacks in Boston, who 

represented Jackson, says this was likely 

the first time nomenclature rules have been 

used to help decide a lawsuit. Now Jackson 

faces another lawsuit involving transgenic 

mice with mutations useful in Alzheimer’s 

disease research. The Alzheimer’s Institute 

of America in February sued Jackson and six 

biotech and pharma companies for patent 

infringement. Despite the high costs of the 

two lawsuits, Einhorn says Jackson won’t 

alter its mission of making laboratory mice 

accessible. But he notes that if the suing 

trend continues, “the most obvious way 

to recoup the costs is to charge more for 

mice.” He adds: “That falls on the backs of 

scientists who do the research.” Emily Waltz 

The FDA’s Transparency Task Force is proposing to increase access to the agency’s decision letters 

about products or drugs. Such a move would challenge small biotech. 

vice president of clinical and regulatory affairs 

at Hudson, Massachusetts–based Clinquest. 

McLane points out that releasing more data 

earlier will also stretch the agency’s resources 

because there will be pressure to analyze many 

more signals quickly and thoroughly. “It’s a tremendous 

overreach,” he says. “A lot of people 

do not think this will go through.” McLane says 

he’d rather see the agency bring their transparency 

rules in line with the Sarbanes-Oxley Act 

of 2002, which set new standards for US boards, 

management and accounting firms. “A lot of 

what the FDA is asking for here is competitive 

information,” he says. 

in their words 

“They have grown so 

fast and so suddenly 

that people are still 

skeptical. But we should 

get used to it.” Rasmus 

Nielsen, a geneticist 

at the University of 

California at Berkeley, 

who collaborates with 

Chinese colleagues, on 

China’s sudden boom in 

sequencing output. (The Washington Post, 

28 June 2010) 

“Until the capacity issues can be addressed, this 

will not be an effective agent.” Chris Logothetis, 

head of prostate cancer research at the University 

The agency was accepting comments 

through July 20. In the autumn, the task 

force will consider the public comments as 

well as the “priority, operational feasibility, 

and resource requirements” of each proposal, 

according to Afia K. Asamoah, director of the 

FDA’s transparency initiative. BIO submitted 

one set of comments in April, and Emmett 

says the group will submit more before the 

deadline. Even if the agency decided to go 

through with all the proposals, though, some 

of the changes could not be implemented 

without new legislation. 

Malorye Allison Acton, Massachusetts 

of Texas MD Anderson Cancer Center in Houston, 

on the year-long wait patients currently face for 

Dendreon’s prostate cancer vaccine Provenge. 

(Pharmalot, 28 June 2010) 

“Everyone can claim victory, except of course 

Mr. Bilski himself.” Dan Ravicher, of the Public 

Patent Foundation, the organization leading the 

attack on Myriad, on the Supreme Court’s decision 

in Bilski v. Kappos. (GoozNews, 28 June 2010) 

“Now that the full integration has taken place, 

it’s the Genentech guys who are being promoted 

and getting the key positions.” Allianz Global 

Investors’ Joerg de Vries-Hippen on how 

Genentech is the strongest in the marriage with 

Roche. (Bloomberg Businessweek, 1 July 2010) 

JASON REED/Reuters/Corbis 



Food firms test fry Pioneer’s trans fat–free 

soybean oil 

The US Department of Agriculture (USDA) 

has approved for environmental release one of 

the first biotech crops aimed at the food industry. 

The new crop, a genetically modified soybean 

with an altered fatty acid profile, yields oil 

that is more stable at high frying temperatures 

and has a longer shelf life than commodity soybean 

oil. It was developed by Pioneer Hi-Bred 

in Johnston, Iowa, a Dupont company. The 

company received marketing approval for the 

biotech soybean in June and aims to commercialize 

it by 2012. St. Louis–based Monsanto 

is following close behind, with two soybean 

products with modified oil profiles in its pipeline. 

The new soybean 

traits may help the 

biotech industry 

deliver on a twodecade-long 

promise: 

to develop crops with 

improved nutritional 

value. Until now, 

most commercialized 

biotech crops have 

been engineered with 

such traits as pest 

resistance and herbicide 

tolerance—traits 

that mostly benefit 

farmers rather than the food industry or consumers. 

“Heat stability and longer shelf life: 

these are the things that can light up the food 

industry, not reduced pesticides,” says Tom 

Hoban, a professor of food science at North 

Carolina State University in Raleigh. 

Pioneer is marketing its new soybean oil as 

an alternative to partially hydrogenated vegetable 

oils. For decades, food producers have 

relied on partially hydrogenated soybean oil 

because it retains its flavor at high cooking 

temperatures and for extended periods on the 

grocery store shelf. But the process of partial 

hydrogenation produces trans fatty acids, or 

trans fats, which are known to increase ‘bad’ 

low-density lipoprotein (LDL) cholesterol and 

increase risk of coronary heart disease. 

In 2006, the US Food and Drug 

Administration began requiring food manufacturers 

to label food with trans fats, and 

measures to alert the public of the health risks 

of trans fats ensued. Food producers turned 

to alternatives, such as palm oil and certain 

kinds of canola oil, that have more stable frying 

and shelf life characteristics than those 

of unhydrogenated soybean oil. As a result, 

soybean oil’s share of the edible fats and oils 

The success of Pioneer’s recently approved soy 

bean, which has been engineered to cut down on 

trans fats, will depend on how well it is received 

by the food industry. 

market has gone from 76% in 2005 to 64% 

today, according to the US Census Bureau. 

“We hope to recapture that space [for soybeans],” 

says Pioneer’s Russ Sanders, director 

of enhanced oils. 

Pioneer’s new soybean oil has an oleic fatty 

acid content of >75%, a property that gives it 

frying and shelf stability comparable to that 

of palm, high oleic acid canola and hydrogenated 

soybean oils. It also contains 20% less 

saturated fat than commodity soybean oil. 

Pioneer dubbed the crop “Plenish high-oleic 

soybeans.” Overproduction of oleic acid and 

decreased levels of linoleic and linolenic acids 

in Plenish arise from 

transgenic expression 

of a fragment of the 

soybean microsomal 

omega-6 desaturase 

gene (FAD2-1) 

under the control 

of soybean Kunitz 

trypsin inhibitor 

gene promoter, which 

John Lee/iStockphoto 

silences endogenous 

omega-6 desaturase. 

The transgenic 

soybean also carries 

the S-adenosyl-lmethionine 

synthetase 

as a marker to enable initial selection 

in the laboratory by acetolactate synthase 

(ALS)-inhibiting herbicide. 

The success of the Plenish soybean will 

depend on how well it is received by the food 

industry. Pioneer has already set up testing 

agreements with a dozen undisclosed food 

companies, says Sanders. The companies will 

run consumer taste tests, frying tests and shelf 

life tests—just about anything a food company 

would normally do with a new ingredient. 

Food companies can already choose from an 

array of oils with modified fatty acid contents 

developed with conventional breeding. “The 

hard reality will be how producers of liquid 

vegetable oils compete,” says Terry Etherton, 

professor of animal nutrition at Penn State in 

University Park, Pennsylvania. 

Food industry representatives say they welcome 

the new oil option, but see it as a “trial 

situation,” says Jeffrey Barach, vice president 

of science policy at Grocery Manufacturers 

Association in Washington, DC .“Each company 

has to try it out and do some experimental 

work,” he says. 

Although Pioneer received the full go-ahead 

from regulators, the company doesn’t plan to 

in brief 

news 

Anti-CD20 patent battle ends 

On June 1, a four-year dispute over a European 

patent for anti-CD20 drugs to treat rheumatoid 

arthritis came to an end, with Seattle-based 

Trubion winning the dispute. This result frees up 

the space for anyone with a CD20 program, says 

Jeff Pepe, associate general counsel at Trubion. 

Multiple oppositions had been filed against the 

patent (European Patent 1176981) held jointly 

by Genentech of S. San Francisco, California, 

and Biogen Idec of Cambridge, Massachusetts. 

Trubion was joined by MedImmune, GenMab, 

Centocor, the Glaxo Group and Merck Serono, all 

pursuing anti-CD-20 programs at one time. In 

2008, the Opposition Division of the European 

Patent Office ruled that, as filed, the patent did 

not meet the necessary requirements, favoring 

Trubion. Genentech and Biogen appealed in 

2009. Finally, at an oral hearing this June, 

the original ruling was upheld, and no further 

appeals will be allowed. Ironically, around the 

time of the hearing, New York–based Pfizer, 

which acquired Trubion’s CD20 programs when 

it bought Wyeth in 2009, announced they would 

drop Trubion’s lead anti-CD20 compound (TRU- 

015) though retaining the biotech’s second 

generation anti-CD20 monoclonal antibody also 

in rheumatoid arthritis. For Genentech/Roche 

“the decision does not impact our expectations 

with respect to protection against Rituxan 

[rituximab, anti-CD20 chimeric monoclonal 

antibody],” says company spokesperson 

Rubin Snyder. 

Laura DeFrancesco 

EU states free to ban GM crops 

In July, the European Commission (EC) 

officially proposed to give member states 

the freedom to veto cultivation of genetically 

modified (GM) crops without having to 

back their decision with scientific evidence 

on new risks. The reform’s goal is to hand 

back responsibility to individual states and 

speed up pending authorizations. Anti-GM 

countries can now choose to opt out whereas 

biotech-friendly countries can cultivate new 

GM varieties. However, there is no guarantee 

it will work. “We are not against freedom 

for member states, the problem is how 

the principle is articulated,” says Carel du 

Marchie Sarvaas, director for agricultural 

biotech at EuropaBio. The proposal stands on 

two legs: an amendment to directive 2001/18 

that must gain the approval of the council 

of ministers and the European Parliament, 

and an EC recommendation on coexistence, 

already effective. The first legalizes national 

or local bans on growing, the second one 

achieves the same result by conceding that 

countries wanting to keep ‘contamination’ 

levels well below the labeling threshold can 

enforce wide isolation distances between 

GM and conventional or organic fields. “It’s 

a Pandora’s box. We are concerned it will 

create legal uncertainty and unpredictability 

for farmers and operators,” says du Marchie 

Sarvaas. The reform doesn’t target imports of 

GM material for food or feed, whose approvals 

are also stalled. 

Anna Meldolesi 


NEWS 


in brief 

GM alfalfa—who wins? 

Both sides are claiming victory following the 

Supreme Court’s verdict issued June 21 in 

Monsanto v. Geerston Seed Farms over the 

future sale of Roundup Ready (RR) alfalfa 

seeds. The Supreme Court repealed a lower 

court injunction issued in 2007 banning the 

biotech seeds nationwide (Nat. Biotechnol. 28, 

184, 2010). Monsanto’s business lead for the 

crop, Steve Welker, says the St. Louis–based 

company has plenty of RR alfalfa seeds 

“ready to deliver,” although their release is 

subject to a pending environmental impact 

statement (EIS) by the US Department of 

Agriculture (USDA). “Our goal is to have 

everything in place for growers to plant in fall 

2010,” Welker adds. Not so fast, says lawsuit 

opponent Andrew Kimbrell of the Center for 

Food Safety in Washington. He points out 

that the Supreme Court “just took away the 

injunction, and USDA still has to comply with 

NEPA [the National Environmental Policy 

Act] and complete an EIS” before the crop 

can be deregulated. Although USDA appears 

poised to complete its EIS and fully deregulate 

RR alfalfa, the Center for Food Safety could 

renew its challenge of USDA’s decision. 

This lingering uncertainty has agitated many 

members of Congress. Seven senators and 

49 representatives have asked agriculture 

secretary Tom Vilsack to retain regulated status 

for RR alfalfa, whereas two other senators have 

urged Vilsack to “mount vigorous defenses 

against lawsuits that seek to upend sciencebased 

regulatory decisions.” Jeffrey L Fox 

Biofuel ‘Made in China’ 

Collaboration between the Danish enzyme 

producer Novozymes of Bagsvared, Beijingbased 

China Petroleum and Chemical and 

Cofco, the state-run agriculture company, will 

produce three million gallons of ethanol a 

year for local consumption, using corn stalks 

and leaves from northeastern China’s corn 

belt. The demonstration plant will test novel 

technologies, including Novozymes’ new 

Cellic CTec2 enzymes, with a view to launch a 

commercial facility by 2013. Cofco has been 

running a small pilot plant in Heilongjiang 

province for four years, but as a precondition 

for commercialization “we need more capacity 

to optimize our design and operation,” says 

Guo Shunjie, general manager of Cofco’s 

bio-energy and biochemical department. One 

remaining hurdle is the inability to break down 

five-carbon sugars abundant in lignocellulose, 

which make up 20–40% of the plant biomass. 

The new process could cut costs considerably, 

as it requires half the dose of enzymes needed 

by other treatments to break down plant waste. 

The partners’ goal is to produce cellulosic 

ethanol at $2.25 a gallon, a price further 

pushed down by government tax credits to be 

competitive with corn-based ethanol, currently 

at $1.50–1.60 a gallon. “Since the trend to 

lower carbon emissions is here to stay, it 

won’t be long before we break even,” 

says Shunjie. 

Daniel Grushkin 

Table 1 USDA-approved soybeans modified for improved trans fat content 

Product Company Description 

DP-305423 

Pioneer Hi-Bred 

International 

commercialize Plenish soybeans until the first 

quarter of 2012, after food players have had 

time to determine what food applications, if 

any, they want to pursue with Plenish soybeans. 

“We’re being fairly conservative in our 

commercialization schedule,” Sanders says. 

The time to market also depends on 

Pioneer’s ability to secure regulatory approval 

in key global markets, such as Europe, Japan, 

China, Taiwan and South Korea, Sanders says. 

The soybean is already approved in Canada 

and Mexico. 

Global regulatory hurdles hampered 

Dupont’s earlier development of a different 

high oleic acid soybean (Table 1). In 1997, the 

USDA approved, or deregulated, DD-026005-3 

—a Dupont soybean with an oleic acid content 

of 85%. This variety was modified with 

an extra copy of soybean Δ 12 -fatty acid dehydrogenase 

under the control of the soybean 

β-conglycinin promoter, which triggered 

silencing of the transgene and its counterpart 

endogenous gene. But the product fizzled 

after the company encountered global regulatory 

complexities associated with the crop’s 

marker technology, says Sanders. Markers 

are used by crop developers to test whether 

genetic material is successfully transferred 

to the host crop. In this case, DD-026005-3 

contained the Escherichia coli uidA gene, 

encoding β-glucuronidase as a colorimetric 

marker, and the bla gene, encoding the 

enzyme β-lactamase as a selective marker 

that confers resistance to β-lactam antibiotics 

(such as penicillin and ampicillin). 

Pioneer’s new high oleic soybean targets the 

same oleic acid pathway as the 1997 version, 

but it is hoped that use of a different marker 

gene, one imparting tolerance to an ALSinhibitor 

herbicide, will smooth the regulatory 

path. (The plant will not be tolerant to 

ALS-inhibitor herbicides at the levels used in 

the field.) Sanders says he is “optimistic” about 

the 2012 regulatory goals. 

On Pioneer’s regulatory heels are two 

Monsanto soybean products with modified 

oil profiles, one with omega-3 fatty acids for 

High oleic acid soybean produced by inserting extra copies of a 

portion of the gene encoding omega-6 desaturase, gm-fad2-1, 

resulting in silencing of the endogenous omega-6 desaturase 

gene (FAD2-1). 

DD-026005-3 DuPont High oleic acid soybean produced by inserting a second copy of 

a portion of the gene encoding omega-6 desaturase, gm-fad2-1, 

resulting in silencing of the endogenous omega-6 desaturase 

gene (FAD2-1). 

OT96-15 

Source: AGBIOS 

Agriculture & Agri-Food 

Canada 

Low linolenic acid soybean produced through traditional crossbreeding 

to incorporate the trait from a naturally occurring fan1 

gene mutant that was selected for low linolenic acid. 

nutrition and the other with enhanced texture 

and functionality, called high stearic 

acid soybeans. Monsanto has submitted to 

the USDA petitions for deregulation of both 

products. Still in the discovery phase, Dow 

AgroSciences in Indianapolis, Indiana is 

developing omega-9 canola and sunflower 

oils. With one nutritionally altered crop 

approved and a handful in the pipeline, 

the public may finally get what it has been 

promised for two decades. But whether 

high oleic acid soybeans directly benefit 

consumers enough to boost public opinion 

of biotech crops is doubtful, say agriculture 

experts. “Companies already have methods 

of removing trans fats” from food, says Jane 

Rissler, a senior scientist with the Union for 

Concerned Scientists in Washington, DC. 

Pioneer is “offering an alternative to those 

existing methods” without much added benefit 

to consumers, she says. Alan McHughen, 

a plant biotechnologist at the University of 

California, Riverside, notes that: “Those 

who already despise [genetic modification] 

will continue to do so, those who accept GM 

will continue to do so, and most others won’t 

even notice it, as it’s not a high-profile whole 

food with immediate consumer-recognized 

benefit.” 

In the US, food companies aren’t required 

to label food derived from genetically engineered 

crops, and generally don’t voluntarily 

do so. 

An April 2010 survey of 750 US consumers 

asked this question: “All other things 

being equal, how likely would you be to 

buy a food product made with oils that had 

been modified by biotechnology to avoid 

trans fats?” Seventy-four percent said they 

were either very likely or somewhat likely to 

buy this kind of biotech food. However, in a 

separate question, only 32% of those respondents 

said they had a favorable impression of 

biotech food. The survey was conducted by 

the International Food Information Council 

Federation in Washington, DC. 

Emily Waltz Nashville, Tennessee 


data page 

2Q10—spreading the wealth 

Walter Yang 


Although biotech stocks, along with the general markets, performed 

poorly last quarter, more companies were able to access capital, more than 

in each of the previous four quarters. Excluding US partnership monies, 

219 companies pulled in $8.1 billion (compared with 157 firms raising $5.3 

Stock market performance 

The BioCentury 100 and the NASDAQ Biotechnology were down 11% and 

15%, respectively, similar to other major indices. 

Index 

1,700 

1,600 

1,500 

1,400 

1,300 

1,200 

1,100 

1,000 

900 

800 

700 

600 

500 

12/2008 

1/2009 

2/2009 

3/2009 

4/2009 

5/2009 

6/2009 

7/2009 

8/2009 

9/2009 

10/2009 

Month 

11/2009 

12/2009 

Global biotech industry financing 

BioCentury 100 

Dow Jones 

S&P 500 

NASDAQ 

NASDAQ Biotech 

Swiss Market 

1/2010 

2/2010 

Partnership Debt and other financing Venture Follow-on PIPE 

2Q10 

1Q10 

4Q09 

3Q09 

2Q09 

6.1, 2.1, 1.3, 1.3, 0.5, 0.4 

8.5, 5.0, 1.7, 0.6, 0.4, 0.3 

9.4, 2.3, 1.2, 2.4, 0.6, 0.7 

8.0, 2.6, 1.2, 0.8, 0.7, 0.0 

0 5 10 15 20 25 

Amount raised ($ billions) 

3/2010 

4/2010 

5/2010 

6/2010 

Excluding partnership monies, 2Q10 funding was up $8.1 billion, 53% 

on 2Q09, largely through debt deals, which shot up 97%. 

Global biotech initial public offerings 

Amount raised ($ millions) 

700 

600 

500 

400 

300 

200 

100 

0 

0 

0 

0 

2Q09 

15 

7 

635 

3Q09 

50 

151 

70 

4Q09 

Financial quarter 

31 

0 

364 

1Q10 

50 

85 

208 

2Q10 

IPO 

14.8, 3.1, 1.6, 2.3, 0.7, 0.3 

Partnership figures are for deals involving a US company. Source: BCIQ: BioCentury Online Intelligence, 

Burrill & Co. 

Ten companies raised $342.9 million through IPOs last quarter versus 

none in 2Q09. 

Asia-Pacific 

Europe 

Americas 

2Q09 3Q09 4Q09 1Q10 2Q10 

Americas 0 2 2 4 4 

Europe 0 1 2 0 5 

Asia-Pacific 0 1 2 2 1 

Table indicates number of IPOs. Source: BCIQ: BioCentury Online Intelligence 

billion in 2Q09), 39% of which originated from debt deals by Genzyme 

(Cambridge, MA) and Teva Pharmaceuticals (Petah Tikva, Israel). Venture 

funding was up 36% from 2Q09; ten companies launched initial public 

offerings (IPOs), raising $342.9 million. 

Global biotech venture capital investment 

Venture money raised was up 36% to $1.7 billion from $1.2 billion in 

2Q09. 

Amount raised ($ millions) 

1,800 

1,600 

1,400 

1,200 

1,000 

800 

600 

400 

200 

0 

Notable Q2 deals 

Venture capital 

$9 

$180 

$1,035 

2Q09 

$6 

$104 

$1,064 

$9 

$479 

$1,065 

$24 

$331 

$939 

3Q09 4Q09 1Q10 

Financial quarter 

Amount 

raised 

($ millions) 

$0 

$458 

$1,210 

2Q10 

Americas 

Europe 

Asia 

2Q09 3Q09 4Q09 1Q10 2Q10 

Americas 43 49 60 60 76 

Europe 14 14 32 30 28 

Asia-Pacific 1 1 1 1 1 

Table indicates number of venture capital investments and includes rounds where the amount raised was 

not disclosed. Source: BCIQ: BioCentury Online Intelligence 

Company (lead investors) 

Round 

number 

Date 

closed 

AiCuris (Santo Holding) 74.9 2 14-Apr 

Achaogen (Frazier Healthcare) 56.0 3 7-Apr 

Pacific Biosciences (Gen-Probe) 50.0 6 17-Jun 

OptiNose (Avista Capital) 48.5 NA 8-Jun 

Agile Therapeutics (Investor Growth Capital,Care Capital) 45.0 2 14-Jun 

Tetraphase (Excel Venture) 45.0 3 1-Jun 

Anaphore 3 (5AM Ventures, Versant, Apposite Capital) 38.0 1 14-May 

Mergers and acquisitions 

Target 

Acquirer 

Value 

($ million) 

Date 

announced 

OSI Pharma Astellas 4,000 17-May 

Valeant Biovail 3,200 21-Jun 

Abraxis Celgene 2,900 30-Jun 

Wuxi PharmTech Charles River 1,500 26-Apr 

IPOs 

Company (lead underwriters) 

Amount 

raised 

($ millions) 

Change 

in stock 

price 

since offer 

Date 

completed 

Codexis 78.0 –33% 22-Apr 

Alimera 72.1 –32% 22-Apr 

Lansen Pharma 50.2 3% 30-Apr 

Tengion 30.0 –26% 9-Apr 

GenMark 27.6 –26% 28-May 

Aposense 24.8 –11% 7-Jun 

Licensing/collaboration 

Researcher Investor 

Value 

($ millions) Deal description 

TransTech Forest $1,100 Exclusive, worldwide rights, excluding the Middle East and 

North Africa, to develop and commercialize small-molecule 

glucokinase activators 

Regulus Sanofi-aventis >$750 Discover, develop and commercialize microRNA therapeutics 

for up to four targets 

Diamyd Johnson & 

Johnson 

$625 Exclusive rights to Diamyd diabetes vaccine outside Nordic 

countries 

Neurocrine Abbott $595 Exclusive, worldwide rights to develop and commercialize 

endometriosis compound elagolix 

OncoMed Bayer >$500 Discover and develop antibodies, proteins and small molecules 

targeting the Wnt signaling pathway to treat cancer 

Source: BCIQ: BioCentury Online Intelligence 

Walter Yang is Research Director at BioCentury 



NEWS feature 

Drugmakers dance with autism 

With monogenetic neurodevelopmental disorders similar to autism 

serving as starting points for several drug discovery programs, 

smaller biotechs are now joining big pharma in pursuing therapies 

to tackle this perplexing condition. Sarah Webb reports. 

In June, the Autism Research Project published 

the largest genetic study of autism so 

far, identifying 226 gene mutations that are 

found in people with the syndrome 1 . Children 

with autism are 20% more likely to carry one 

of these rare mutations, though they are not 

inheriting them; they are present in less than 

6% of the parents of autistic children. This 

study adds to the growing list of genes that 

could serve as starting points for research on 

autism therapies. 

Whereas the pharmaceutical industry 

increasingly has been shying away from 

psychiatric disorders, such as schizophrenia 

and depression, interest in autism has 

intensified. Together with an increasing 

number of autism cases diagnosed each 

year, there is a dearth of effective treatments. 

As a result, “autism seems to be a relatively 

hot area,” says Manuel Lopez-Figueroa 

of Bay City Capital, a venture capital firm 

in San Francisco, and scientific liaison for 

the Pritzker Neuropsychiatric Disorders 

Research Consortium. Not only is the pharmaceutical 

sector ploughing R&D resources 

into the condition, but several smaller companies 

are pioneering therapies, one of which 

is an enzyme replacement therapy already in 

phase 3 human testing (Table 1 and Box 1). 

What’s more, progress in drug discovery programs 

aiming to target proteins associated 

with Mendelian neurodevelopmental disorders 

may pave the way for expansion into 

broader spectrum autism conditions. 

Repurposed drugs 

Current estimates indicate that 1 in 110 children 

in the United States have an autism 

spectrum disorder defined by three core 

symptoms: deficits in social interactions, 

problems with communication and repetitive 

behaviors. Although twin and family studies 

have established a strong genetic basis for 

autism, no clear genetic cause has emerged. 

In addition to complex genetics, the disorder 

is phenotypically diverse: individuals with 

an autism spectrum diagnosis may be intelligent 

and high functioning (e.g., those with 

Asperger’s syndrome) or have severe mental 

deficits. The large variation in phenotypes and 

Trouble at the synapse. The genetics of autism is 

pointing toward malfunctioning at the synapse. 

high concordance in monozygotic twins suggests 

many genetic and environmental biasing 

factors are involved. 

A diagnosis of autism brings along a slew of 

unmet medical needs, including anxiety, sleep 

disturbances, and metabolic and gastrointestinal 

issues. Initial moves by industry into 

autism therapeutics have involved applying 

existing drugs to alleviate some of these symptoms, 

says Sophia Colamarino, vice president 

for research at Autism Speaks, a patient advocacy 

group based in New York. “In the short 

term, that’s where many of the pharmaceutical 

companies will be able to have an immediate 

impact,” she says. Two atypical antipsychotics 

have been approved by the US Food and Drug 

Administration (FDA) for treating irritability 

in autistic children. Johnson & Johnson’s 

Risperdal (risperidone) was approved in 

late 2006, followed by Abilify (aripiprazole) 

from Bristol-Myers Squibb in New York, and 

Otsuka in Princeton, New Jersey, in 2009. 

Selective serotonin reuptake inhibitors such 

as low-dose Prozac (fluoxetine) are approved 

for use in adults and children for obsessive 

compulsive disorder and have been tested in 

children with autism. Anticonvulsives such 

Mike Agliolo/Corbis 

as valproate (Stavzor, Depakene, Depacon) 

may serve the same sort of purpose for some 

patients, says Eric Hollander, director of the 

Compulsive, Impulsive and Autism Spectrum 

Disorders Program at Albert Einstein College 

of Medicine and Montefiore Medical Center 

in New York. 

Treating these related symptoms gives 

patients and their caregivers an improved 

quality of life, making it more likely that 

an individual with autism can live at home 

rather than in a care facility, Hollander adds. 

Improving those related symptoms can also 

make patients more responsive to behavioral 

therapies, says Robert Ring, who is heading 

up Pfizer’s autism research unit in Groton, 

Connecticut. 

At least one repurposed drug is targeting the 

imbalance between excitatory and inhibitory 

signaling suspected to be part of the basis of 

autism. New York-based Forest Laboratories is 

testing Namenda (memantine), an Alzheimer’s 

drug and N-methyl d-aspartate receptor 

(NMDA) receptor modulator, in a phase 2 trial 

in autism patients. 

Abnormal synaptic connectivity 

Because this spectrum of disorders has a 

clear genetic basis but no clear genetic cause, 

researchers are chewing on the question of how 

so many different mutations could lead to a 

similar phenotype, says Luca Santarelli, head 

of Roche’s central nervous system exploratory 

development in Basel. 

Genetic studies are important, but they don’t 

tell a complete story. “Identifying genes and 

coming up with gene candidates is really just 

a first step in gaining confidence in a potential 

genetic target that could be druggable,” says 

John Spiro, a research director at the Simons 

Foundation Autism Research Initiative in New 

York City. “There are not many genes that you 

can be really, really confident are accounting 

for any significant portion of autism.” Though 

researchers remain hopeful that the genes might 

converge into a single meaningful pathway, he 

adds, “for the most part in autism, it’s not clear 

yet that’s going to be the case.” 

Nonetheless, some patterns are emerging 

that may help researchers devise new therapeutic 

strategies. A genome-wide survey of a group 

of autistic and mentally retarded individuals 

revealed a set of mutations (point mutations 

and copy number variants) in a gene, SHANK2, 

that controls synaptic structure, defects in 

which could lead to problems in neuronal 

communication 2 . 

Mutations in another family of genes 

involved with synapse formation, the neuroligins, 

which code for adhesion molecules 

that cluster on the receiving side 

772 volume 28 number 8 august 2010 nature biotechnology

news feature 


Box 1 Enzyme replacement for autism? 

Unlike other emerging treatment strategies for autism that target genes or neurochemical 

pathways, Rye New York’s Curemark is working on an enzyme replacement therapy 

comprising a mixture of several digestive enzymes (Table 1). In clinical work with children 

who showed symptoms of autism, Curemark’s founder and CEO, Joan Fallon, noticed that 

several of these patients restricted their diets by their own choice, preferring carbohydrateladen 

foods such as crackers and pasta. Searching for an explanation, she found that these 

patients had low fecal levels of the protease chymotrypsin (fecal chymotrypsin levels have 

also served as a diagnostic indicator of cystic fibrosis). Children with autism without a known 

genetic cause, often had these low enzyme levels, Fallon says. 

Administering high-protease enzymes, the physicians observed behavioral changes in 

the children. Fallon filed patents in 1999 and formed a biotech company in 2005. The 

company’s protease-based treatment, CM-AT, is currently being tested in a phase 3 study 

with 170 children ages 3–8 in 12 locations around the United States. 

of the synapse, may account for up to 6% 

of autism cases, according to Nils Brose, 

director of the Department of Molecular 

Neurobiology at the Max Planck Institute 

of Experimental Medicine, in Göttingen, 

Germany. Neuroligins 3 and 4 localize to 

glutamatergic synapses, and loss-of-function 

mutations in these genes segregate in 

certain pedigrees with mental retardation, 

autism and Asperger’s syndrome. These 

molecules are likely operating as the organizational 

point for information coming into 

the postsynaptic space, recruiting signaling 

receptors. In mouse knockouts of two of 

these neuroligins, Brose says, “the synapses 

are intrinsically operational, but they lack 

normal receptors and as a consequence don’t 

function properly.” 

But just noting a connection between these 

genes and synaptic structures isn’t enough for 

developing drug candidates, Spiro adds. “You 

don’t know. Is it too much? Is it too little? Are 

[the structures] in the wrong place during 

development? There are just a million questions 

that need to be ironed out before you can 

think about a pharmaceutical intervention.” 

Santarelli’s group at Roche is trying to get at 

some of these questions, in collaboration with 

Peter Scheiffele, a professor of cell and developmental 

neurobiology at the University of 

Basel and a leader in the neuroligin research 

area. “We’d like to understand the common 

downstream effects of different genetic alterations 

that lead to autisms and whether there 

are common mechanisms that could lead to 

treatments,” Santarelli says. 

Clues from rare single-gene disorders 

The increasing understanding of some of the 

molecular mechanisms of autism is providing 

one avenue forward. The second breakthrough, 

according to Colamarino, is coming through 

animal studies of single-gene disorders such 

as fragile X 3 and Rett’s syndromes 4 , which are 

found in a disproportionate number of individuals 

who meet the criteria for autism spectrum 

disorders. Since 2007, a handful of studies of 

animal models with inducible mutations have 

shown that animals can develop to adulthood 

with these disorders, and then recover after 

proper gene function is switched back on. 

That ability to reverse the symptoms in animals 

with advanced disease has been a major 

breakthrough, says Spiro. With clear genetic 

causes coupled with the opportunity to build 

animal models of these disorders, “it may be 

very reasonable to say that the pathway to drug 

discovery in autism may be paved by a careful 

focus on these rarer syndromes,” Ring says. 

Fragile X syndrome provides a case study 

in this approach that weds treatment strategies 

for a rare disorder with the possibility 

of understanding the underpinnings of 

autism. This genetic disorder, which affects 1 

in 4,000 males and 1 in 6,000 females (http:// 

www.fraxa.org/), leads to learning disabilities 

and even mental retardation, anxiety and seizures. 

Up to 20% of individuals with fragile X 

also meet the criteria for an autism diagnosis. 

As a result of a single gene mutation, these 

individuals do not make the fragile X mental 

retardation protein (FMRP). Mark Bear of 

the Massachusetts Institute of Technology in 

Cambridge and his colleagues found that the 

lack of FMRP leads to dysregulation of signaling 

through the metabotropic glutamate 

receptors (mGluR). The mGluR5 receptor is 

highly expressed in regions of the brain critical 

for learning and memory. 

FMRP serves as a brake on this signaling 

pathway, says Randall Carpenter, CEO 

and president of Seaside Therapeutics, a 

Cambridge, Massachusetts, biotech company 

co-founded by Bear. “When it’s not 

there then there’s overactivation of the signaling 

pathway. The brain can’t discriminate 

between important information and noise 

and it doesn’t develop normally.” In mice 

with the fragile X mutation, Bear and his colleagues 

found that knocking down expression 

of mGluR5 to 50% rescued the learning 

deficits, stopped seizures and increased other 

measures of plasticity in the brain. 

Confident that they’re targeting the appropriate 

pathways, Seaside Therapeutics has licensed 

a series of small-molecule compounds from 

Merck to target glutamate signaling in general 

and mGluR5 signaling specifically, Carpenter 

says. They recently completed a phase 2 clinical 

trial of a general γ-aminobutyric acid (GABA) 

B agonist, STX209, in fragile X patients, and 

will soon complete a phase 2 trial of the same 

compound in individuals with autism spectrum 

disorders. A specific antagonist of the mGluR5 

receptor is currently in repeat-dose phase 2 trials, 

and Seaside expects to start phase 2 trials 

with fragile X patients by early 2011. 

Mutations in glutamate receptor genes 

GRIN2A and GRIK2 and multiple GABA 

receptor genes have been associated with 

autism. Two pharma companies also see 

promise in the mGluR5 receptor strategy 

for treating fragile X patients. Novartis in 

Basel recently completed a phase 2 clinical 

trial of their compound AFQ 056 at sites 

in Europe and is planning their next study, 

which is scheduled to open later in 2010, says 

spokesman Jeffrey Lockwood in an e-mail. 

Roche’s small-molecule mGluR5 antagonist 

is being tested in phase 2 clinical trials 

in five locations in the United States, says 

Santarelli. Their results are “encouraging so 

far,” he says. This growing understanding 

of these specific, related genetic disorders, 

Santarelli adds, provides a pathway to think 

about possible extrapolations to the more 

sporadic types of autism. 

Peptide hormone targets 

The peptide oxytocin and its related receptors 

are emerging as a pathway that could prove 

useful for treating a variety of neuropsychiatric 

disorders including autism. Animal studies 

have pointed to the importance of oxytocin in 

social behavior; in voles, for example, oxytocin 

and its counterpoint hormone vasopressin 

appears to have a role in pair bonding. 

Karen Parker and her colleagues at Stanford 

University in California observed seasonal 

differences in the way females and males who 

are raising young interacted. In the laboratory, 

they tracked these differences, caused by purely 

environmental cues to the locations of oxytocin 

receptors in the animals’ brains. Changes based 

on environmental cues have led researchers to 

consider oxytocin therapies for treating social 

dysfunctioning in humans. 

Such tests are already being done in humans. 

Hollander has given intravenous oxytocin 


NEWS feature 


to higher functioning patients with autism 

and Asperger’s syndrome and has observed 

improved social cognition. Patients were better 

able to lay down social memories or recognize 

emotions in spoken language, he says. 

Such treatments also decreased the severity 

of repetitive behaviors and self-stimulatory 

behaviors such as hand clapping, rocking and 

head banging. 

Patients treated with intranasal oxytocin 

showed similar improvements. Earlier this 

year, researchers at the Center for Cognitive 

Neuroscience in Bron, France, found that adults 

diagnosed as high functioning on the autism 

spectrum who received doses of intranasal 

oxytocin were better able to recognize cooperative 

play than adults with a similar diagnosis 

who had not received oxytocin. Those who had 

received oxytocin also spent more time looking 

at the face of their virtual playmates 5 . 

But teasing out the importance of oxytocin 

isn’t easy. The French study shows variation in 

individual responses to oxytocin. “We don’t have 

good biomarkers of oxytocin levels,” Parker says. 

Funded by a grant from the Simons Foundation, 

she and her colleagues are trying to measure 

plasma oxytocin levels, various mutations and 

social phenotypes among individuals with 

autism and their siblings and compare them 

with controls matched for age and gender. 

Oxytocin and the related response pathways 

represent “one of the most exciting biologies in 

the autism space today,” says Pfizer’s Ring and 

could have implications for other psychiatric 

areas as well. In research Ring carried out at 

Wyeth, he developed the first nonpeptide oxytocin 

receptor agonist 6 . “The oxytocin receptor 

is a priority target for the field, but a very 

challenging target to develop traditional smallmolecule 

chemistry for.” 

Cellceutix, a biotech company in Beverly, 

Massachusetts, is also testing a preclinical 

compound for autism, KM-391, in a rodent 

model of autism developed by researchers at 

the Kennedy Krieger Institute in Baltimore. 

The autism-like symptoms are induced by 

injecting the chemical 5,7-dihydroxytryptamine 

(5,7-DHT) into the forebrain of newborn 

rat pups, leading to neonatal serotonin depletion, 

reduced brain plasticity and abnormal 

behaviors. In an initial study, KM-391 given 

over 90 days restored normal behaviors, and 

near-normal serotonin levels and increased 

brain plasticity relative to a nontreatment 

group and a group given Prozac. Another study 

measuring serotonin levels in three regions of 

the rat brain has confirmed the restoration of 

normal serotonin levels. 

Another small study added an oxytocin 

antagonist to the mix. The antagonist alone 

intensified the autism-related behaviors, such as 

Table 1 Selected companies with autism targets in clinical development 

Company Target Drug candidate 

Curemark Protease CM-AT (a mixture of amylase, protease, chymotrypsin, 

deficiency trypsin, papain and papaya in a 4–10:1 ratio with lipase, 

derived from animal, plant, microbial or synthetic sources) 

repetitive behaviors and sensitivity to touch, but 

when given with KM-391, the frequency and 

intensity of these behaviors were reduced. 

Measuring outcomes 

Fueled by academic research and increased 

funding from the US National Institutes of 

Health, nonprofit and advocacy organizations, 

the field is moving forward. But even 

as some drug candidates are moving into the 

clinic, a number of challenges remain for the 

field as a whole. Above all is the problem of 

the heterogeneity of the disorder, according 

to Colamarino. “We’re calling it one thing 

when it’s really probably more than one.” That 

heterogeneity can pose a challenge in choosing 

appropriate study subjects. The field is 

also struggling with finding appropriate outcome 

measures, particularly those that can be 

measured within the time frame of a clinical 

study. Without sensitive measures of changes 

in the core symptoms, researchers need to 

identify what the focus should be within a 

particular trial. In many cases researchers 

have depended on parental reporting of 

behavioral changes, Colamarino says, leading 

to a large placebo effect. Although no 

biomarkers have been established for autism, 

some sort of biological measure of change 

in connection with autism’s core symptoms, 

would be particularly attractive. Some clinical 

trials have failed because of methodological 

issues, she adds. “That’s why we need to 

address this sooner rather than later.” 

To bring researchers together to discuss 

these challenges, Autism Speaks and Pfizer 

are co-sponsoring a translational research 

meeting to improve clinical study methodology 

and design, tentatively scheduled for 

later this year. “There’s no better investment 

for us externally than to bring together all 

the key experts in this area and have a discussion 

with FDA present and try to iron 

out a framework to address this challenge 

together,” Ring says. The development of the 

Diagnostic and Statistical Manual of Mental 

Stage of 

development 

Phase 3 

Novartis mGluR5 AFQ 056 (small molecule) Phase 2 

Roche mGluR5 RO4917523 (small molecule) Phase 2 

Seaside 

Therapeutics 

Forest 

Laboratories 

GABA B 

mGluR5 

NMDA receptor 

modulator 

STX209 (R-isomer of baclofen) 

STX107 (2-methyl-1,3-thiazol-4-yl) 

ethynylpyridine) 

Phase 2 

Phase 1 

Namenda (memantine) Phase 2 

Disorders (DSM-V), the bible for neurological 

diseases, scheduled for release in May 

2013, could complicate the development of 

trial endpoints, Bay City’s Lopez-Figueroa 

adds, depending on how autism disorders 

and symptoms are classified. 

A second meeting in early 2011 will look at 

clinical targets—both their identification and 

validation—in an attempt to reach a consensus 

on where therapeutics can bring the most 

initial benefit to patients. This is something 

the field is still struggling with, Ring says. “If 

we had one shot today to demonstrate that 

this would work, what would be the clinical 

target that we should take on?” 

Pfizer and Roche are also developing an 

autism proposal for the Innovative Medicines 

Initiative, which coordinates European 

Union–based public-private partnerships in 

drug discovery and development. The idea 

is for companies to join forces to work on 

research that is not generating intellectual 

property, Santarelli says, such as the development 

of animal models, understanding disease 

mechanisms and physiology, finding biomarkers 

and developing clinical methodology. 

Unquestionably, developing therapeutics 

for a developmental neuropsychiatric disorder 

with such an early onset presents several 

challenges. But Autism Speaks’ Colamarino is 

encouraged by the growth in the field. “Three 

to five years ago, we wouldn’t have been talking 

about clinical trials, certainly with respect to 

novel drug discovery,” she says. Pfizer’s Ring 

expects industry involvement to continue to 

grow: “It’s just too large an unmet medical 

need for companies not to see the opportunity 

to enter into this research space.” 

Sarah Webb, Brooklyn, NY 

1. Pinto, D. et al. Nature 466, 368–372 (2010). 

2. Berkel, S. et al. Nat. Genet. 42, 489–491 (2010). 

3. Guy, J. et al. Science 315, 1143–1147 (2007). 

4. Dölen, G. et al. Neuron 56, 955–962 (2007). 

5. Andari, E. et al. Proc. Natl. Acad. Sci. USA 107, 

4389–4394 (2010). 

6. Ring, R.H. et al. Neuropharmacology 58, 69–77 

(2010). 


uilding a business 

At ground level 

Julian Bertschinger 

The hardest—and perhaps loneliest—period of being an entrepreneur might be just after your company is founded. 


cofounded Covagen when I was 30 years 

I old. Although my PhD and postdoc work 

had taught me to think in a focused manner 

and be product oriented, I was as green as 

they come concerning the nuts and bolts of 

launching a company. Picking it up as you go 

might not be the optimal way to learn, but 

I’m living proof that it can be done with the 

right team. Here’s how we did it. 

Two men and a plan 

The most important motivating factor, for me, 

was my education. I did my thesis in Dario 

Neri’s lab at the Institute of Pharmaceutical 

Sciences at ETH Zurich. The research group 

there had just isolated an antibody fragment 

that binds to a tumor-associated marker, 

and proof-of-concept data showed that the 

fragment selectively targeted solid tumors 

in mice. Neri went on to cofound Philogen, 

based in Siena, Italy, and develop the antibody 

in collaboration with Bayer Schering 

in Berlin. Today, several derivatives of this 

antibody are in phase 2 trials. 

Seeing this process firsthand showed me 

(and Dragan Grabulovski, my cofounder at 

Covagen, which is based in Zurich) that it was 

possible to move from the lab to the commercial 

side. This had our group thinking about 

products right away, which I believe is crucial 

when contemplating a biotech company. But 

the truth is that Covagen never would have 

been founded without the Venture business 

plan competition, organized every two years 

by McKinsey, in Zurich, and ETH Zurich. 

One of the winners of this competition was 

Glycart Biotechnology, also in Zurich, which 

took the prize in 1998 and eventually was 

acquired by Roche, in Basel, Switzerland, for 

CHF235 million (US$180 million) in 2005. 

Grabulovski and I decided to take part in 

Julian Bertschinger is CEO at Covagen, 

Zurich, Switzerland. 

e-mail: julian.bertschinger@covagen.com 

the Venture 2006 competition for two reasons: 

we were eager to learn how to write a business 

plan (we’d never written one) and we thought 

it would be interesting precisely because it was 

so different from the reports and scholarly 

articles we were used to writing. 

The competition is divided into two 

phases. During the first, entrants submit a 

business idea outlined on a few pages, and 

the best ten ideas are awarded a prize. In the 

second, all participants receive free coaching 

from industry experts and venture capitalists, 

who then give advice to participants 

writing their first business plan. The ten best 

business plans are chosen by a jury and all 

receive the same prize amount of CHF2,500 

(US$2,057). 

We submitted our business idea, but I 

didn’t actually expect us to be one of the 

winners; I was busy applying for postdoc 

positions abroad. Nevertheless, our idea 

was chosen out of about 100 applications 

to be awarded with a CHF2,500 prize. This 

Box 1 The technology behind Covagen 

Covagen is built on Fynomer technology (Fig. 1), 

developed at the Institute of Pharmaceutical 

Sciences at ETH Zurich. Fynomers are a class of 

binding proteins derived from the Src homology 

3 (SH3) domain of the human Fyn kinase (D. 

Grabulovski et al. J. Biol. Chem. 282, 3196– 

3204 (2007)). The Fyn SH3 domain structure 

is made up of two anti-parallel β-sheets and 

two loops—n-src and RT—which are known to 

be involved in interactions with other ligand 

proteins. 

Fynomers can be produced in bacteria at 

high yields and are approximately 20 times 

smaller than antibodies. Additionally, they 

have the advantage of being easily assembled 

in a modular manner to yield bispecific and/or 

surprised me—not because we doubted our 

entry, which was based on the Fynomer technology 

(Fig 1; Box 1 and D. Grabulovski et 

al. J. Biol. Chem. 282, 3196–3204 (2007)) but 

because we felt that it was too early to found 

a company on the available results: we had 

no in vivo data. 

Looking back, the biggest effect of participating 

in Venture 2006 was that it let us begin 

to establish a business network—previously, 

we’d known only people within academia. At 

workshops during the second phase of the 

competition, we met Rudolf Gygax, a managing 

director of Novartis Venture Fund, who 

would be a key contact for us later on. He 

and Neri helped us to draft our first business 

plan. 

The prize money was certainly useful, 

but the large amount of positive feedback 

we received was even more important. That 

boosted our confidence, and after winning, I 

thought for the first time that we really could 

found our own company. 

Figure 1 Fyn Src homology 3 (SH3) 

domain structure. The RT-Src loop is 

shown in red, and the n-Src loop is shown 

in green. (Protein Data Bank entry 1M27) 

multivalent proteins, which might allow new treatment modalities that are challenging or 

impossible to explore with traditional antibody formats. 




Box 2 Securing our funding 

I was able to found Covagen with an initial investment (in several tranches) from the 

Novartis Venture Fund. The first tranche came after signing investment documents, and 

the following tranches were hinged on attaining research milestones. 

It was crucial that Novartis Venture Fund was prepared to invest in us at a very early 

stage. Corporate venture funds are beneficial in this way: they are usually more likely to do 

early-stage investments than most private venture capitalists because corporate funds can 

afford longer times to exit. If you’ve hit upon an interesting idea in academia, you might 

look to corporate venture funds first. 

In 2009, Covagen was able to attract three other investors: the corporate venture 

fund MP Healthcare Venture Management, of Boston; Ventech, of Paris; and Edmond de 

Rothschild Investment Partners, also of Paris. We also have received some funds via our 

research collaboration with Roche, which was secured in June 2009. 

To move our interleukin-17A inhibitor into preclinical and clinical development, we 

are planning to raise additional money this year, so we are seeking one or two venture 

capitalists to join our existing investors. 

Founding Covagen 

We stayed in contact with Gygax, and he 

invited us to present our project at the 

Novartis Venture Fund headquarters in Basel. 

The fund was interested in investing, and we 

sat down to negotiate our first term sheet. I 

had absolutely no idea what the difference 

was between a binding contract and term 

sheet, and this was my initiation. I learned 

what Series A shares are, how to calculate 

pre-money and post-money valuations, what 

drag-along and tag-along clauses are, why a 

high liquidation preference for investors is 

bad for holders of common shares and how 

anti-dilution protection for investors can 

hurt founders in a down round. I was moving 

into a whole new world. 

It is very important to understand every 

word in term sheets and agreements. You 

should always know what you are signing. 

To do this, first make sure you find a lawyer 

who intimately knows relationships between 

venture capitalists and biotech startup companies, 

and then be persistent enough to ask your 

lawyer about every single expression or phrase 

you do not understand. (You can familiarize 

yourself somewhat with the terminology by 

using the internet, in particular http://www. 

investopedia.com/terms/v/venturecapital.asp, 

but also ask your lawyer directly.) 

When we finally signed the term sheet, 

we found it just meant more paperwork. We 

still needed to establish a licensing agreement 

with ETH Zurich and negotiate the 

investment and shareholder’s agreements. I 

admit that when I first read the investment 

document drafts, I thought the beginning 

definitions weren’t very relevant. But after 

further reading and questioning our lawyer, 

I quickly realized that those definitions are 

actually one of the most important things in 

a contract. 

Once all the details were ironed out 

(Box 2), we founded Covagen in December 

2006 and signed the investment agreements 

with Novartis Venture Fund. The real work 

was about to start. 

The lonely lab 

Grabulovski still had to finish his PhD thesis. 

This made me Covagen’s only employee 

from December 2006 until May 2007, and 

Covagen was a startup in every sense of 

the word. My first task was to open a bank 

account so Novartis Venture Fund could 

transfer in its investment. 

When that was done, I set up Covagen’s 

homepage (be sure to check for domain 

name availability before you decide on a 

company name). A friend of a friend runs a 

company offering website design and e-mail 

hosting services, and he helped me create 

Covagen’s website. Here’s a tip: make sure 

that you can administer the website yourself 

so you will not have to pay a web designer for 

every small change or update. In addition, I 

opened a Covagen e-mail account, and here, 

too, I made sure I could independently set up 

additional e-mail accounts. 

But there remained a very big need—work 

space. We had no laboratory. Unfortunately, 

ETH Zurich does not offer incubator space 

for spin-outs. Startup companies usually try 

to find space within the department they 

originated from, but in our case there was no 

room available. After asking around within 

ETH Zurich, Grabulovski learned of an empty 

laboratory not attached to any department, 

and we were able to make an arrangement to 

allow us to rent this space. In addition, our former 

institute enabled us to access some rather 

expensive instruments for an affordable fee. 

The laboratory was empty, except for 

benches and desks, and somewhat dusty. On 

my second day, I brought rags from home 

and started cleaning. This wasn’t really what 

I envisioned a biotech CEO doing, but the 

truth is, I was excited—I was starting a company 

from the very bottom! There was no 

network connection for my computer, no 

printer, no phone, no fax. However, after 

making a few calls with my mobile phone, 

the university’s staff set up all the necessary 

connections within a few days. This is 

a benefit of staying within academia: when 

starting your company, all issues related to 

infrastructure need only minimal time and 

management resources. 

After all that work, I thoroughly appreciated 

making the first company phone call and 

sending the first message from my Covagen 

e-mail account! 

With communications behind me, I was 

left with the science. It’s only when you start 

from scratch that you realize how many different 

instruments and tools, disposable 

plastic tubes, glassware, kits, antibodies and 

chemicals are needed for research, and I had 

none of it. I also realized how comfortable 

my life in the academic lab had been, where 

many instruments were available and I didn’t 

have to think about budgeting. That was not 

the case at Covagen, where I became very 

cost sensitive. Comparison shopping takes 

time, and it was four months before the last 

instruments and reagents arrived. This neatly 

coincided with Grabulovski earning his PhD 

in May 2007, and he joined Covagen as CSO. 

I finally had company. 

Building a biotech 

Established as Covagen, we now had several 

target proteins in mind to validate the 

technology, but we did not have a clear plan 

on which targets we wanted to focus on 

for the development of our first Fynomerbased 

clinical candidate. Choosing a good 

first target was the most important decision 

we needed to make because once we made 

the call, we’d invest most of our resources in 

that direction. We investigated many different 

targets to find one that was economically 

promising and in an area in which Covagen 

had freedom to operate. We decided to go for 

inhibition of the cytokine interleukin-17A, 

which is an attractive emerging target for 

diseases such as rheumatoid arthritis, psoriasis 

and uveitis. 

In early summer 2007, we hired another 

person to help speed up our research. We 

had spent less money than we expected in the 

first half of 2007, so we had sufficient financial 

resources to hire. We felt that our first 

employee should be someone we already knew 

and someone we could trust to be dependable 




and competent. As several investors had 

warned us, not getting along with co-workers 

is a big reason why many small companies fail. 

Personal frictions tend to increase even more 

if a company hits hard times. 

We asked Simon Brack, an antibody engineering 

specialist we knew from our time in Neri’s 

group, to join Covagen. Brack was returning to 

Switzerland from Oxford, where he’d worked as a 

postdoc. In October 2007, he became Covagen’s 

third employee and was a great hire. 

Even in a company as small as Covagen 

was then, there were a million administrative 

things to do, and they occupied a large amount 

of my time—I was finding it hard to do the 

necessary work on the bench to develop our 

technology, not to mention that creating documents 

and presentations for potential investors 

takes a lot of time. So at the very least, it 

felt good to know that if I had to leave the lab, 

I had four hands working while I was gone. 

Now, we are up to seven employees. 

Advancing our technology is the most 

important task we have at Covagen, just as 

it was when we started. For this reason, all 

employees at Covagen are PhD scientists. We 

are a young and enthusiastic team; none of 

us is older than 33. This can be a problem 

at times: when talking to investors, I realize 

that we sometimes lack credibility. Quite 

often, investors do not believe our claims, 

and mainly that’s because they do not believe 

I have enough experience. In some ways, 

they are right—I am a scientist still learning 

the business side of things. But we have 

been taught a lot about the varying aspects 

of drug development through working with 

Neri, and I believe a young group like us can 

learn fast if given the right advice. 

Currently, we’re getting that advice from 

Ray Hill, who was executive director for 

licensing in Europe at Merck & Co. and now 

is a visiting professor in neuroscience and 

mental health at Imperial College London. 

Hill sits on our board of directors. We’ve also 

established an excellent scientific advisory 

board, which will be of great help and value 

when bringing our first drug candidate to 

preclinical development and broadening our 

research activities. 

Conclusions 

Even as our company grows, things continue 

to change quickly and will for the foreseeable 

future. The larger we get, the more important 

(and time consuming) communicating 

with employees, investors, our board of 

directors and our scientific advisory board 

becomes. My tasks are always shifting as we 

adapt, improve and complement our skills. 

But this fluid environment is partially what 

makes startup companies attractive workplaces. 

Now, our company doesn’t feel so young 

anymore. This year, we plan to bring our 

first drug candidate to good manufacturing 

practice production and preclinical development. 

That, of course, will require additional 

money, and we plan to close a financing 

round this year. Raising a sizable round is 

another challenge for me, and it means I’m 

no longer on the bench. My job is raising 

money now. In that regard, I’ve graduated to 

the role of a typical biotech CEO. 

To discuss the contents of this article, join the Bioentrepreneur forum on Nature Network: 

http://network.nature.com/groups/bioentrepreneur/forum/topics 


correspondence 

Waking up and smelling the coffee 


To the Editor: 

As I pointed out recently on the Patent 

Docs weblog (http://www.patentdocs. 

org/), the editorial ‘Sitting up and taking 

notice’ in the May issue 1 , announcing 

Judge Sweet’s 29 March decision in favor of 

the plaintiffs in Association for Molecular 

Pathology v. US Patent and Trademark 

Office, contains several misstatements and 

promotes the wrong-headed idea that gene 

patenting is a problem. 

In describing the case, you begin by 

making factual errors. Judge Sweet’s 

decision (summary judgment) does not 

indicate that “the judge felt that Myriad 

had no case to argue.” Rather, summary 

judgment is used when there are no 

disputed issues of material fact, and the 

case is decided as a matter of law. I would 

argue that the prudence of Judge Sweet’s 

judgment is questionable because he chose 

to make law by deciding that DNA is not 

patent eligible for being “the physical 

embodiment of genetic information.” 

You then state that “[t]he plaintiffs…won 

on virtually every count.” In fact, the court 

refused to consider the US Constitutional 

issues raised in the complaint, which 

formed the basis for the breast cancer 

victims to have standing in the lawsuit. 

This is not trivial because the court used 

these constitutional issues not only to 

deny defendants’ motions to dismiss, but 

also, politically, to provide the political 

frisson so attractive to the American Civil 

Liberties Union (New York) and the Public 

Patent Foundation (New York). 

The editorial goes on to mischaracterize 

the effects of BRCA patents on research, 

stating that “Myriad’s influence has been 

particularly pernicious. Its lawyers have 

issued cease-and-desist letters to genetics 

laboratories in universities, hospitals and 

clinics that offered diagnostic services 

based on the BRCA1 and BRCA2 genes.” 

Why is enforcing your patent rights 

pernicious? Use of these patented tests by 

these institutions constitutes infringement. 

It doesn’t matter whether the infringer 

is a university, hospital or clinic, they 

are still liable for infringement owing to 

their for-profit, commercial activities. 

There is no evidence that Myriad Genetics 

(Salt Lake City, UT, USA) or any other 

gene patent holder has inhibited basic 

biological research by threatening patent 

infringement litigation; indeed, there are 

several thousand basic research papers in 

scientific journals that have been published 

since the BRCA gene patents were granted. 

The piece also attempts 

to achieve ‘truth by 

association’ in citing 

several groups having 

“concerns” about gene 

patents that filed amicus 

briefs, including the 

International Center for 

Technology Assessment, 

Greenpeace, the 

Indigenous Peoples’ 

Council on Biocolonialism 

and the Council for 

Responsible Genetics. 

Their contribution would 

be more worthwhile if 

it did not include incorrect statements 

regarding gene patenting’s consequences, 

including “the privatization of genetic 

heritage, the creation of private rights of 

unknown scope and consequences and the 

violation of patients’ rights.” 

The editorial was correct in noting 

that “[t]he alignment of physicians’ 

and patients’ groups with what are, in 

effect, antibiotech lobbyists is a worrying 

development,” albeit ignoring the fact that 

not only the biotech sector, but also the 

public should be worried if these groups get 

their way. 

The editorial did supply potentially 

informative data, that Myriad reported 

“$326 million in revenue from diagnostic 

testing against $43 million in costs.” 

Assuming that these numbers are correct, 

and reflect only BRCA testing, this 

could be a measure of the profitability of 

BRCA testing results (perhaps providing 

motivation for the “universities, hospitals 

and clinics” to be so keen on getting into 

the business, infringing or no). But even 

here, the figures are completely out of 

context. No indication is provided whether 

these profits are out of the ordinary for a 

diagnostics company, traditional or genetic, 

or whether the ‘costs’ include ancillary 

costs like genetic counseling or physician 

education (both critical in genetic 

diagnostics due to the consequences for a 

patient of receiving a genetic diagnosis). 

If Myriad’s profits are 

significantly higher than 

those at other diagnostic 

companies, that fact would 

be relevant. The absence of 

any comparisons suggests 

that the absolute numbers 

were used because they 

better supported the 

editorial’s views. 

Finally, the editorial 

departs from reality when 

it decries the patent system 

for rewarding “only the last 

inventive step—the small 

breakthrough that enables 

a concept to be realized.” Such a statement 

indicates just how little the writers 

understand the ‘balance of rights’ that the 

patent bargain actually strikes. The patent 

system rewards inventors who disclose 

how to make and use an invention that 

is new, useful and nonobvious. Whether 

the improvement is groundbreaking or 

incremental, satisfaction of the statutory 

requirements governs patentability. Thus, 

if technology becomes obsolescent, new 

technology takes its place—because patents 

expire, as indeed Myriad’s patents will 

begin to expire in 2014. The consistent 

lack of understanding of innovation and 

the patent process is illustrated by the 

suggestion that rights to specific genes in 

multigene tests be assigned based on “the 

importance of any specific gene sequence 

to the utility of the test.” This is something 

the marketplace can be counted on to do 

without the government’s help. 

The last sentence of the piece 

even acknowledges the editorial idea 




is “implausible within the current 

petrified patent system and commercial 

infrastructure,” and then adds that this 

“doesn’t have to stop the dream” or “stop 

the discussion.” I would counter that the 

dream of better diagnostics and therapies 

is being, and has been, realized by 30 years 

of biotech and protection thereof by an 

invigorated patent system in the United 

States (and elsewhere). Changing that now, 

particularly if based on the wooly-headed 

arguments (really, sentiments) in the 

editorial, is the fastest and surest way that 

those hopes and dreams will be dashed. 

COMPETING FINANCIAL INTERESTS 

The author declares no competing financial 

interests. 

Kevin E Noonan 

McDonnell Boehnen Hulbert & Berghoff LLP, 

Chicago, Illinois, USA. 

e-mail: noonan@mbhb.com 

1. Anonymous. Nat. Biotechnol. 28, 381 (2010). 

Nature Biotechnology replies: 

We were not making the case that gene 

patenting itself was a problem, although it 

is clear that some DNA patents with overly 

broad claims are cause for concern. We 

disagree with the contention that “there 

is no evidence that Myriad Genetics…or 

any other gene patent holder has inhibited 

basic biological research by threatening 

patent infringement litigation.” There are 

cases where exclusive licensing practices 

(a particular problem for methods patents) 

or aggressive license enforcement has 

stymied research, as is detailed elsewhere 

in this issue 1 . The problems also reach 

beyond basic research: a survey of 132 

clinical laboratory heads in the United 

States found that 53% had “decided not 

to develop or perform a test/service for 

clinical or research purposes because of a 

patent” 2 . Indeed, one of the plaintiffs in 

the Association for Molecular Pathology 

v. US Patent and Trademark Office case 

is a patient who would like to have their 

BRCA1 test from Myriad independently 

verified by another laboratory, but cannot 

because of Myriad’s aggressive stance that 

prevents other laboratories performing the 

test. It might be good business for Myriad, 

but is it reasonable to enforce intellectual 

property in such a manner that it is so 

difficult for a patient to confirm a DNA 

test in an independent laboratory? 

The claim that new technology takes the 

place of ‘obsolescent’ technology because 

“patents expire” is also moot in relation to 

DNA patents. A point we were trying to 

make in the editorial is that the fields of 

molecular diagnostics and sequencing are 

moving so quickly that they are becoming 

obsolete along much shorter timelines 

than patent terms of 20 years. Although 

Genetic stability in two 

commercialized transgenic 

lines (MON810) 


A letter of correspondence by Dany Morisset 

and his colleagues 1 in the August 2009 issue 

cites two recent publications 2,3 in which “two 

commercial seed varieties of the MON810 

maize genetically modified 

event (ARISTIS BT and 

CGS4540) present genetic 

variation thus hampering the 

detection by several methods 

for MON810 (Monsanto, St. 

Louis).” As representatives of 

Monsanto Europe (Brussels), 

Syngenta Crop Protection 

(Basel) and Limagrain 

Services Holding (Chappes, 

France), we would like to 

correct the scientific record 

concerning the claimed 

“variation” of the transgenic 

insertion in these transgenic 

hybrids. 

Upon request for further information, 

Margarita Aguilera and her colleagues at 

the European Commission, Directorate 

General Joint Research Center (JRC) in Ispra, 

Italy, informed us that the seeds tested were 

among 26 MON810 varieties provided by the 

Spanish Instituto Nacional de Investigación 

y Technología Agraria y Alimentaria (INIA; 

Madrid). The Spanish agency did not provide 

the JRC with details of the respective batch 

numbers for each variety. 

Our investigation has revealed that the 

two deviating results were not in fact related 

to variation of the transgenic insertion, 

as reported by Aguilera et al. 2,3 . Instead, 

our conclusions are that the two varieties 

(reported as entry 2 and entry 5) were not 

MON810 maize hybrids at all. 

Variety CGS4540 (entry 5) is a Bt176 maize 

hybrid and we do not understand why the 

seed was provided by INIA as MON810. 

Entry 2, which was designated as Aristis 

it was not trivial to sequence a human 

gene 20 years ago, it is certainly becoming 

routine today. 

1. Carbone, J. et al. Nat. Biotechnol. 28, 784–791 

(2010). 

2. Cho, M.K. et al. J. Mol. Diagnostics 5, 3–6 (2003). 

Bt, is most likely Aristis, the conventional 

counterpart of Aristis Bt (MON810). When 

we requested INIA to send a sample of 

Aristis Bt to its official Spanish laboratory 

CSIC (Consejo Superior de Investigaciones 

Científicas) for testing, the 

results were positive for 

MON810, as expected. 

Aguilera and her 

colleagues were not able 

to provide a correct chain 

of custody for the samples 

used in their analyses, 

which would have allowed 

resolution of the origin of 

these deviating results. 

The seed industry has 

invested significantly to 

provide quality products 

to the market place, which 

includes selling compliant 

and stable products. Traits are tested for 

presence and stability for many generations 

before release to the market place. We 

are therefore convinced that there is no 

scientific evidence of instability in MON810 

hybrids. 


The authors declare competing financial interests: 

details accompany the full-text HTML version of the 

paper at http://www.nature.com/naturebiotechnology/. 

Sofia Ben Tahar 1 , Isabelle Salva 2 & 

Ivo O Brants 3 

1 Limagrain Services Holding, Quality Assurance, 

Chappes, France. 2 Syngenta Crop Protection AG, 

Regulatory Affairs, Basel, Switzerland. 3 Monsanto 

Europe SA, Scientific Affairs, Brussels, Belgium. 

e-mail: ivo.o.brants@monsanto.com 

1. Morisset, D. et al. Nat. Biotechnol. 27, 700–701 

(2009). 

2. Aguilera, M. et al. Food Anal. Methods 1, 252–258 

(2008). 

3. Aguilera, M. et al. Food Anal. Methods 2, 73–79 

(2009). 



Distances needed to limit cross-fertilization 

between GM and conventional maize in Europe 



To avoid the economic consequences of 

admixtures of genetically modified (GM) 

and non-GM harvests, and to ensure that 

agricultural production complies with 

mandatory labeling provisions, the European 

Union (EU; Brussels) member states have 

adopted co-existence measures directed to 

farmers cultivating GM varieties. For GM 

maize cultivation, regulators have established 

mandatory isolation distances, which 

differ between countries and in some cases 

have been regarded as disproportionate 1,2 . 

Taking advantage of numerous field studies 

conducted by EU researchers in recent years, 

we report here a statistical analysis of crossfertilization 

data in maize, showing that 

separating fields 40 m is sufficient to keep 

GM adventitious presence below the legal 

labeling threshold in the EU set at 0.9%. 

Currently, insect-resistant maize 

(engineered to express Bacillus thuringiensis 

toxin; Bt) and Amflora potato (engineered 

with antisense against granule-bound starch 

synthase), which was recently approved 3 , 

are the only two GM crops authorized for 

commercial cultivation in the EU. Bt maize 

was approved in 1998 and currently covers 

1.2% of the total maize area in the EU 

(Supplementary Notes 1 and 2). 

Given the legal standards for labeling and/ 

or purity, the cultivation of GM maize in the 

EU is associated with mandatory technical 

coexistence measures designed to reduce 

the adventitious presence of GM maize 

in neighboring non-GM maize harvests. 

Such measures, to be applied by GM maize 

growers, should be stringent enough to 

keep adventitious presence below 0.9% so 

that conventional maize can comply with 

labeling provisions and avoid any potential 

price premium losses associated with GM 

admixtures 4,5 . 

Cross-fertilization between neighboring 

maize fields is the most important ‘biological’ 

source of admixture between GM and 

conventional maize 4,5 . Factors influencing 

cross-fertilization rates in maize cultivation 

are well studied and include, among others, 

the distance between fields, flowering 

synchrony, weather conditions, the relative 

positions of donor and receptor fields (with 

respect to dominant winds in the area) 

and the size and shape of fields 4 . Because 

of the difficulty to control some of these 

parameters, regulatory bodies from most 

EU countries have decided to establish 

mandatory separation distances between GM 

and non-GM maize fields as the preferred 

single measure to limit cross-fertilization 6 . 

An overview of mandatory separation 

distances adopted by EU member states 

(Supplementary Table 1) shows a remarkable 

range of variation, 25–600 m, between the 

different countries. Although climatic and 

landscape parameters in maize cultivation 

(that affect cross-fertilization rates) are 

variable in the EU, often there is little sciencebased 

evidence that the distances adopted 

are proportional to achieve the desired purity 

standards. 

To test the proportionality of the 

separation distances established by EU 

member states, we perform a statistical 

analysis of data obtained from a number of 

recent studies on maize cross-fertilization 

performed in different European countries. 

Although the various studies recorded 

different variables, we analyzed only data 

on cross-fertilization rates (measured as 

percentage of seeds in the sample) in the 

receptor field as a function of distance 

from the edge of the pollen source. The aim 

of the analysis was to estimate distances 

necessary to keep cross-fertilization below 

different arbitrary tolerance thresholds and 

with different confidence levels. The results 

should inform debate on whether current 

distances between GM and non-GM maize 

fields stipulated by member states to meet 

legal EU labeling thresholds are supported by 

scientific data. 

Out-crossing (% seeds) 

40 

35 

30 

25 

20 

15 

10 

5 

0 

Out-crossing (% seeds) 

5 

4 

3 

2 

1 

0 

We first compiled a database of crossfertilization 

rates and distance by collating 

different publications and unpublished 

studies on maize cross-fertilization, to obtain 

a total of 1,174 observations covering four 

European countries (Germany, Italy, Spain and 

Switzerland). Details on the sources of data 

used are given in Supplementary Table 2. 

The database covered studies with a variety 

of experimental designs (mostly receptor and 

donor fields side by side, but also donor and 

receptor fields dispersed in actual agricultural 

landscapes) and that had been performed 

in different growing seasons (2001–2006). 

Data originate from experimental designs 

representing worst-case scenarios (receptor 

fields situated downwind from donor fields 

and coincidence of flowering between donor 

and receptor fields) in Europe. 

The relationship between distances and 

cross-fertilization rates for the database 

shows a negative relationship between 

these two variables (Fig. 1). This reciprocal 

relationship between cross-fertilization rates 

and distance was pointed out previously 

by several other authors 4,5,7–9 . For further 

analyses, cross-fertilization rates were 

analyzed for 10 m distance intervals 

(Supplementary Table 3). Because of the lack 

of sufficient observations from 50 m upwards, 

the size of intervals was increased to 20 m. 

Supplementary Table 3 shows that data on 

maize cross-fertilization are mostly available 

for short distances, close to the donor (84.1% 

of the data set, or 985 observations, are taken 

between 0 m and 20 m). In contrast, only 

0 25 50 75 100 125 150 

Distance (m) 

0 50 100 150 200 

Distance (m) 

Figure 1 Cross-fertilization rates for Bt maize. The figure shows a meta-analysis of maize crossfertilization 

data. Cross-fertilization rates are represented in relation to the distance from the pollen 

donor. The upper chart is a magnification of the original chart with a limited scale of the respective axis. 




Table 1 Probability of keeping cross-fertilization below a certain threshold level (%) using a gamma distribution 

Distance (m) 

1.5% 

Mean 

(low-high bounds) 

(0–10] 49.44 

(46.10–52.92) 

(10–20] 91.19 

(88.58–93.70) 

(20–30] 99.86 

(99.54–100) 

(30–40] 99.99 

(99.96–100) 

(40–50] 99.88 

(99.56–100) 

(50–70] 99.88 

(99.28–100) 

(70–90] 99.98 

(99.90–100) 

>90 100 

(100–100) 

4.2% of the measurements are available from 

distances above 50 m from the donor field. 

The mean cross-fertilization rate and the 

standard deviation for each distance interval 

were calculated using all data points in the 

interval, and the highest and the lowest 

values for cross-fertilization rate registered 

(Supplementary Table 3). 

The mean and variance of each 

distance interval were used to calculate 

the parameters that characterize different 

probability distributions at those intervals. 

Once the distribution was obtained, 

probability of avoiding maize crossfertilization 

at different thresholds levels was 

calculated for each distance interval. 

To ensure robustness of the results 

obtained, different probability distributions 

were used following parametric and 

nonparametric approaches. Both approaches 

produced similar results. In the parametric 

approach, the probability distribution used 

to represent the cross-fertilization level for 

a given distance interval was the gamma 

distribution. The parameters of the gamma 

distribution were determined by the mean 

and the variance of the data in each interval. 

The probability distribution of crossfertilization 

being above a certain threshold 

level was obtained by conducting bootstrap 

sampling per interval 1,000 times. Bootstrap 

sampling allows obtaining a range of values 

for the parameters of the gamma distribution 

and therefore we were able to calculate the 

probability of being above a number of 

stated cross-fertilization thresholds (e.g., 

0.9%; see ‘Gamma parameterization’ in 

Supplementary Note 3). We also estimated 

Cross-fertilization threshold (% of seeds) 1 

0.9% 

Mean 


41.16 

(37.80–44.62) 

70.89 

(67.56–74.38) 

95.62 

(92.12–98.44) 

99.61 

(98.76–100) 

98.56 

(96.10–100) 

99.11 

(96.26–100) 

99.58 

(98.68–100) 

99.96 

(99.86–100) 

a beta distribution to analyze the data 

(Supplementary Note 4). 

The nonparametric approach, where 

no distributional parameters are assigned, 

was based on a bootstrap simulation that 

consisted in drawing the observed data 

on cross-fertilization 1,000 times with 

replacement per interval. Therefore, we 

obtained 1,000 subsamples per interval. 

From each of these subsamples, the 

probability distribution of being above 

any cross-fertilization threshold can be 

calculated and mean and confidence 

intervals for the probability of being above a 

cross-fertilization threshold can be obtained. 

Table 1 shows the mean probability of 

keeping cross-fertilization between maize 

fields below different arbitrary threshold 

levels (1.5%, 0.9%, 0.5% and 0.3%) for each 

separation distance interval, using the 

gamma distribution. A 95% confidence 

interval of the mean probability of keeping 

cross-fertilization below a certain threshold 

is calculated (see low and high bounds for 

each distance interval). 

The results provided in Table 1 are 

relevant for policy decision-making. For 

example, implementing a 30 m separation 

distance would result in a probability higher 

than 95% (95.62%, see mean probability 

values in bold in Table 1) to keep crossfertilization 

values below the 0.9% EU 

labeling threshold. The probability increases 

to 99% if a 40 m distance is implemented. 

However, it is known that cross-fertilization 

is not the only source of GM adventitious 

presence in maize harvests. Traces of GM 

seeds in conventional seeds and machinery 

0.5% 

Mean 


33.11 

(29.76–36.66) 

41.41 

(37.78–45.06) 

66.94 

(58.30–75.14) 

94.14 

(87.70–99.44) 

92.07 

(84.12–99.80) 

95.89 

(87.30–99.90) 

96.08 

(91.56–99.94) 

98.58 

(97.30–99.54) 

0.3% 

Mean 


27.30 

(24.06–30.64) 

21.80 

(18.20–25.68) 

31.19 

(21.52–41.00) 

77.26 

(63.70–91.08) 

79.38 

(66.48–95.34) 

88.05 

(74.54–96.86) 

86.81 

(77.48–97.66) 

90.76 

(86.22–94.76) 

1 Numbers in italics indicate a scenario where separation distance is sufficient to reduce admixture in maize cultivation below different threshold levels (1.5%, 0.9%, 0.5% and 0.3%). 

Square brackets denote that the upper limit is included in the interval. 

are considered to be additional contributors 

to final adventitious presence 4,10 . Greater 

distances to the pollen source would be 

required if lower threshold levels for crossfertilization 

were to be considered that aim 

to take into account additional sources 

of adventitious presence. For example, a 

distance of 40 m is needed to keep crossfertilization 

below 0.5% with a probability 

higher than 90% (94.1%). 

An analysis of the data in Table 1 also 

allows the effects of a hypothetical increase 

in the EU mandatory labeling threshold on 

segregation practices in maize cultivation to 

be estimated (countries such as Japan allow 

as much as 5% tolerance). For example, a 

20 m separation distance would be sufficient 

to achieve a desired threshold level of 1.5% 

(with a probability of 91.19%). When using 

a nonparametric approach (bootstrapping 

simulation) results were quite similar to 

those obtained for the gamma distributions 

(Supplementary Table 4). 

The results presented here (Table 1) 

clearly show that some of the current 

mandatory separation distances proposed 

by several EU countries for maize 

segregation (Supplementary Table 1) are 

disproportionate. They are set too high to the 

objective of keeping cross-fertilization below 

the legal threshold level in real agricultural 

landscapes. Our results are robust because 

the experimental data set considered 

represents several climatic conditions, 

field sizes and locations in Europe. A 

previous study by Sanvido et al. 5 looking at 

separation distances in Switzerland came 

to similar conclusions. Also, the levels of 




cross-fertilization recorded in our database 

correspond to individual data points in 

receptor fields at several distances. Because 

most of the field points sampled were located 

at short distances from the donor field, crossfertilization 

rates at these distances were 

likely to be higher than cross-fertilization 

rates computed for an entire field harvested. 

In an agricultural context, harvest always 

represents a mixture of different harvested 

areas. The actual GM content in the harvest 

is thereby often substantially reduced 

because zones with higher cross-fertilization 

rates at the field margin are mixed with 

zones with lower GM content further within 

the receptor field. Studies performed in real 

agricultural landscapes with commercial 

cultivation of GM and non-GM maize point 

to distances over 20 m as being sufficient to 

prevent cross-fertilization below a threshold 

level of 0.9% 11,12 . 

In practice, large mandatory distances 

restrict farmers’ freedom of choice to grow 

GM maize in certain agricultural landscapes 

(especially in those with substantial presence 

of maize cultivation in small and scattered 

fields). This imposes important opportunity 

costs on farmers, reducing the potential net 

gains in farmers’ gross margins derived from 

Bt maize cultivation 13 . 

In conclusion, we have shown that a 

separation distance of 40 m is sufficient to 

reduce admixture in maize cultivation below 

the legal threshold of 0.9%. However, this 

is not an endorsement of using separation 

distances as the single tool to regulate coexistence 

in maize production. Numerous 

recent studies have pointed to the need for 

flexibility in co-existence measures 4,14,15 . 

Pollen barriers consisting of non-GM 

maize, for example, have proven to reduce 

cross-fertilization rates more effectively 

than an isolation of the same distance with 

open ground or low-growing crops. With 

a maize barrier of 10–20 m, the remaining 

maize harvest in the field rarely exceeds the 

threshold of 0.9% GM material 11 . Buffer 

zones, discard zones and other measures 

could therefore be combined or substitute for 

large, fixed-separation distances in search of 

a system that increases the real options for 

farmers to cultivate their crop of choice 1 . 

Note: Supplementary information is available on the 

Nature Biotechnology website. 

Disclaimer 

The views expressed are purely those of the authors 

and may not in any circumstances be regarded 

as stating an official position of the European 

Commission. 

ACKNOWLEDGMENTS 

The authors thank M. Czarnak-Klos for help in 

the interpretation of the data sets of maize crossfertilization 

trials that constitute the database of this 

analysis and J. Delincé for his useful comments on 

statistical simulation. The authors wish to express 

thanks to G. Squire, as coordinator of the gene flow 

and ecological field studies of the SIGMEA project, 

for providing SIGMEA data sets on maize crossfertilization 

trials. Within the SIGMEA partners, many 

thanks are extended to R. Wilhelm for providing data 

under German agricultural conditions, A. Vogler for 

Swiss data and J. Messeguer for data from Spain. 


The authors declare no competing financial interests. 

Laura Riesgo 1 , Francisco J Areal 1 , 

Olivier Sanvido 2 & Emilio Rodríguez-Cerezo 1 

1 European Commission, Joint Research Centre 

(JRC), Institute for Prospective Technological 

Studies (IPTS), Edificio Expo, Avda. Inca 

Garcilaso, Seville, Spain. 2 Agroscope Reckenholz 

Tänikon Research Station ART., Zurich, 

Switzerland. 

e-mail: laura.riesgo@ec.europa.eu 

1. Devos, Y., Demont, M. & Sanvido, O. Nat. Biotechnol. 

26, 1223–1225 (2008). 

2. Moschini, G. Eur. Rev. Agric. Econ. 35, 331–355 

(2008). 

3. Ryffel, G.U. Nat. Biotechnol. 28, 318 (2010). 

4. Devos, Y. et al. Agron. Sustain. Dev. 29, 11–30 

(2009). 

5. Sanvido, O. et al. Transgenic Res. 17, 317–335 

(2008). 

6. European Commission. Commission Staff Working 

Document: Report from the Commission to the Council 

and the European Parliament on the Coexistence of 

Genetically Modified Crops with Conventional and 

Organic Farming. Implementation of National Measures 

on the Coexistence of GM crops with Conventional 

and Organic Farming. (Commission of the European 

Communities, Brussels, 2009). 

7. Pla, M. et al. Transgenic Res. 15, 219–228 (2006). 

8. Goggi, A.S. et al. Field Crops Res. 99, 147–157 

(2006). 

9. Vogler, A., Eisenbeiss, H., Aulinger-Leipner, I. & 

Stamp, P. Eur. J. Agron. 31, 99–102 (2009). 

10. Demeke, T., Perry, D.J. & Scowcroft, W.R. Can. J. Plant 

Sci. 86, 1–23 (2006). 

11. Messeguer, J. et al. Plant Biotechnol. J. 4, 633–645 

(2006). 

12. Gustafson, D.I. et al. Crop Sci. 46, 2133–2140 

(2006). 

13. Gómez-Barbero, M., Berbel, J. & Rodríguez-Cerezo, E. 

Nat. Biotechnol. 26, 384–386 (2008). 

14. Demont, M. & Devos, Y. Trends Biotechnol. 26, 353– 

358 (2008). 

15. Messéan, A. et al. Oleagineux 16, 37–51 (2009). 


case study 

commentary 

India’s billion dollar biotech 

Justin Chakma, Hassan Masum, Kumar Perampaladas, Jennifer Heys & Peter A Singer 

By focusing on an unmet medical need, providing a cost-efficient solution and reinvesting the resulting revenues into 

R&D and state-of-the-art manufacturing, Shantha Biotechnics was able to build one of India’s first biotech successes. 


Shantha Biotechnics, an Indian biotech firm started by K. I. Varaprasad 

Reddy with $1.2 million of angel funds, was acquired last year by 

Sanofi-Aventis of Paris for €571 million. Since developing a copy of the 

hepatitis B surface antigen subunit vaccine—one of the first recombinant 

products to be ‘home grown’ in India—Shantha has been on a tear, 

bringing 11 products to market. Much of the company’s success can be 

attributed to the vision of its management, which brought its first product 

to market in only four years, reinvested revenues into internal R&D and 

built a state-of-the art manufacturing capability. This not only enhanced 

the company’s ability to address local health needs, but also built its global 

reputation—all of which has subsequently proved good business. 1 

After attending a conference in 1992, Varaprasad, an electrical engineer 

by training, recognized the urgent need for an inexpensive Indian 

hepatitis B vaccine; over 100,000 Indians die every year from the viral 

infection, with 4% of the population carriers. Prices were as high as $23 

a dose with primary suppliers being Merck and SmithKlineBeecham 

(now part of GlaxoSmithKline). With most Indian families living on 

$1 a day, with multiple children and three doses required per child, 

vaccination was simply unaffordable. Varaprasad saw the possibility of 

a local venture that could supply an affordable version. 

After recruiting local talent and two expatriate scientists in 1993 (see 

Supplementary Tables), the company took only four years to develop 

and register Shanvac-B, a version of the vaccine produced in Pichia pastoris. 

Shanvac-B was launched at $1 a dose and was an immediate success. 

Indian consumption of hepatitis B vaccine rose from a few hundred 

thousand doses in the early 1990s to tens of millions today with prices 

dropping as low as $0.25. 

Rapid uptake of the vaccine was partly helped by a confidential partnership 

with a large pharmaceutical multinational, which provided 

manufacturing/regulatory acumen and also resold the vaccine. Shantha 

followed Shanvac-B with Shanferon (interferon alpha 2b), which it also 

produced in P. pastoris. The company’s development of a purification 

process compliant with International Conference on Harmonization 

regulations led it to become the first Indian company to have a hepatitis 

B vaccine prequalified by the World Health Organization (WHO; 

Geneva). The initial investment in quality control helped accelerate 

approval for its other products. 

The company’s growing reputation for manufacturing excellence and 

regulatory expertise in recombinant vaccines also helped to secure business 

from entities in other developing countries, such as the International 

Vaccine Institute (IVI; South Korea) for low-cost oral cholera vaccine, 

and the Pediatric Dengue Vaccine Initiative (South Korea). 

This success led to international attention in 2006 when Mérieux 

Alliance (Paris, France) acquired a 60% stake in Shantha after its Omani 

investors sought an exit. The acquisition further bolstered Shantha’s 

reputation internationally as well as opening new markets. In 2009, the 

firm was awarded a $340 million United Nations International Children’s 

The authors are at the McLaughlin-Rotman Centre for Global Health, 

University Health Network and University of Toronto, Toronto, 

ON, Canada. 

e-mail: peter.singer@mrcglobal.org 

Emergency Fund (UNICEF) contract for pentavalent vaccines from 

2010–2012. Soon after, rumors emerged that multinationals were interested 

in bidding on Shantha, ultimately culminating in the takeover by 

Sanofi-Aventis the same year. 

The case of Shantha shows developing world biotech innovators can 

maintain a balance between local health impact and financial returns by 

keeping four principles in mind. First, identify therapeutic areas where 

cost efficiencies can be achieved locally and combine this with strong 

leadership skills. Varaprasad leveraged India’s homegrown scientists, 

lower labor costs, process innovation and a low-margins business strategy 

to exploit this opportunity. 

Second, seek investments/partnerships from non-traditional and 

international sources. Shantha embraced collaborations with research 

institutes such as the US National Institutes of Health (Bethesda, MD), 

and with competing multinationals for regulatory guidance. 

Third, focus on innovation and reinvestment. By plowing back significant 

profits toward R&D, Shantha has recently released new products 

every year or two. This initial focus on process and quality innovation 

may have delayed Shanvac-B’s launch, but it allowed Shantha to become 

the first WHO-prequalified Indian firm for hepatitis B vaccine, and 

opened the door to large international contracts, including contract 

research. However, experience with Shanferon suggested that India’s 

regulatory environment had challenges in conducting complex clinical 

trials. Other innovators in developing countries should not insist upon 

home-grown manufacturing or clinical trials if it entails compromise on 

quality for the sake of patriotism. 

Finally, Shantha shows integrated business models are viable in 

developing countries. Pre-acquisition, Shantha would not invest in 

any products for which it did not have internal capacity to execute 

on a significant part of the project. This contrasts with the developed 

world, where it is becoming increasingly popular to develop a 

‘virtual’ business model, whereby clinical trials and even early stage 

work is outsourced to contract research organizations. Shantha shows 

the virtual model may not make sense for an innovative biotech in 

a developing country because the risks of low quality and delays 

in outsourcing are too great. By maintaining internal development 

capabilities, Shantha and other developing country firms can also 

capitalize on earnings generated by contract research work for other 

companies. 

By combining cost-efficiency with focused R&D, biotech firms like 

Shantha are creating a new source of innovation for global health. 

Funding 

This work was funded by a grant from the Bill & Melinda Gates 

Foundation through the Grand Challenges in Global Health Initiative. 

Note: Supplementary information is available on the Nature Biotechnology website. 

Competing Financial Interests 

The authors declare competing financial interests: details accompany the full-text 

HTML version of the paper at http://www.nature.com/naturebiotechnology/. 

1. Prahalad, CK. The Fortune at the Bottom of the Pyramid: Eradicating Poverty through 

Profits. (Wharton School Publishing, Philadelphia; 2004). 


commentary 

DNA patents and diagnostics: not a 

pretty picture 

Julia Carbone, E Richard Gold, Bhaven Sampat, Subhashini Chandrasekharan, Lori Knowles, Misha Angrist & 

Robert Cook-Deegan 


Restrictive licensing practices on DNA patents are stymieing clinical access and research on genetic diagnostic testing. 

Diagnostic companies, university tech transfer offices and their respective associations need to pay more attention. 

Four decades after the US Supreme Court first 

held that an artificially created bacterium 

had the potential to be patented in the United 

States 1 , biotech patents continue to generate 

controversy—particularly human gene patents 

used in diagnostic testing. The persistence of the 

debate can be attributed to particular business 

models for genetic testing and university licensing 

that, despite public pronouncements to the 

contrary, have failed to acknowledge and appropriately 

address the real social and economic 

concerns raised by clinical geneticists, health 

care professionals, patient groups, politicians 

and academics. Their failure has led both policymakers 

and the courts to express increasing 

concern about broad patent rights over human 

genes that affect diagnostic testing. 

The most recent flare-up in the ongoing 

DNA patent and genetic testing debate is 

Julia Carbone is at Duke University’s School of 

Law, Durham, North Carolina, USA; E. Richard 

Gold is at McGill University’s Faculty of Law 

and Faculty of Medicine, Montreal, Québec, 

Canada; Bhaven Sampat is at Columbia 

University’s Department of Health Policy and 

Management, New York, NY, USA; Subhashini 

Chandrasekharan is at Duke University’s Center 

for Genome Ethics, Law & Policy, Institute for 

Genome Sciences and Policy, Durham, NC, USA; 

Lori Knowles is at the University of Alberta’s 

Health Law Institute, Edmonton, Alberta, 

Canada; Misha Angrist is at Duke University’s 

Institute for Genome Sciences & Policy, Durham, 

NC, USA; and Robert Cook-Deegan is at Duke 

University’s Center for Genome Ethics, Law & 

Policy, Institute for Genome Sciences and Policy, 

Durham, NC, USA. 

e-mail: Robert Cook-Deegan: bob.cd@duke.edu 

Myriad Genetics has been the poster child for controversial DNA patent licensing. 

the decision of the US District Court for the 

Southern District of New York in Association 

for Molecular Pathology et al. v. United States 

Patent and Trademark Office et al. 2 . On 29 

March, US Federal District Court Judge 

Robert Sweet ruled that isolated DNA is not 

patentable in the United States, and also that 

Myriad Genetics’ (Salt Lake City, UT, USA) 

method claims relevant to testing for BRCA1 

and BRCA2 genes are invalid. Essentially, 

the District Court held that neither isolated 

DNA nor cDNA is sufficiently different from 

DNA as it occurs within host cells to be considered 

an invention. As for the diagnostic 

tests, the court held that they simply involved 

drawing a mental correlation between facts, 

something that does not fall within the scope 

of what is patentable. 

A week earlier, the US Court of Appeals 

for the Federal Circuit held in Ariad 

Pharmaceuticals, Inc. et al. v. Eli Lilly and 

Company 3 that a researcher must do more than 

identify that a class of compounds has a certain 

effect: he or she must actually describe what 

those compounds are. This effectively eliminated 

the award of patents over basic research, 

requiring, instead, that the inventor “actually 

perform the difficult work of ‘invention’—that 

is, conceive of the complete and final invention 

with all its claimed limitations—and disclose 

the fruits of that effort to the public.” 

One month before that, on 10 February, the 

Secretary’s Advisory Committee on Genetics, 

Health and Society (SACGHS; Bethesda, 

MD, USA) at the US Department of Health 

and Human Services 4 , after a careful study 

of current knowledge on the effects of patenting 

genes on research and accessibility to 

genetic tests, found that there is no convincing 

evidence that patents either facilitate or 

accelerate the development and accessibility 

of such tests. What’s more, the committee 


COMMENTARY 


found that there was some, albeit limited, 

evidence that patents had a negative effect 

on clinical research and on the accessibility 

of genetic tests to patients. In addition, 

most gene patents relevant to diagnostics are 

held by universities on the basis of research 

funded by public money. In this context, the 

committee recommended that universities 

be more cautious in patenting and licensing 

human genes, that there be more transparency 

and accountability for university licensing 

practices and that an existing exception 

protecting medical practitioners from patent 

infringement when they undertake surgery or 

treat a patient’s body be extended to include 

the provision of genetic diagnostic testing. 

What all three developments have in common 

is that they reflect growing disenchantment 

with the patenting and licensing practices 

of universities and industry. These concerns 

have existed for over a decade without resolution 

5,6 . The maturity of microarray technology 

that allows for multi-allele genotyping and 

now the prospect of full-genome sequencing 

deepen these concerns 7 . A legacy of exclusively 

licensed gene patents casts a shadow of patent 

infringement liability over the future of multiallele 

testing and full-genome analysis. 

In an attempt to better understand why concerns 

about DNA patenting persist and what role 

universities play as patentees and often exclusive 

licensors, this article outlines university technology 

transfer practices and business models that 

have given rise to the concerns. After outlining 

the practices that have given rise to concerns 

about the patenting of human genes for 

diagnostic genetic tests, we review past efforts 

attempting to address concerns. We then lay out 

the obstacles to addressing these concerns going 

forward, including a lack of recognition that 

diagnostics is a highly unusual market—and 

that the problem is not so much a legal question 

or necessarily about what gets patented, so much 

as how patents are licensed and enforced by both 

universities and industry. The ability to change 

these restrictive licensing practices, will, in turn, 

depend on several factors: first, a sharper definition 

of what constitutes research that needs 

to be protected in licensing provisions; second, 

more coherent university policies that promote 

broad dissemination, along with incentives for 

industry compliance with best practices; third, 

greater recognition of problems and the proposal 

of constructive solutions by key players; 

fourth, transparent reporting of DNA patents 

and diagnostic testing license agreements; and 

fifth, secure funding for technology transfer 

offices. Although legislative change may ultimately 

be necessary to facilitate these changes 

in practice, many problems can be addressed 

without statutory change. 

A legacy of short-sighted tech transfer 

and business practices 

Currently, universities frequently file patents 

on early-stage inventions 9 , and license patents 

exclusively half the time 10–13 . A study by 

Mowery et al. 10 notes the following: “A relatively 

high fraction of all inventions that are 

licensed—as high as 90% for UC [University 

of California] licenses and no less than 58.8% 

for Stanford licenses of ‘all technologies’ during 

this period—is licensed on a relatively exclusive 

basis, and these shares are similar for biomedical 

inventions.” Many of those licenses will endure 

for many years, including licenses on university 

patents relevant to DNA diagnostics. 

Universities and academic medical centers 

that provide diagnostic testing services face 

private genetic testing companies that enforce 

patents against university genetic testing services 

and national reference laboratories 5 —in 

contrast to the situation for therapeutics, where 

universities are often the plaintiffs. The story 

often begins with publicly funded academic 

or nonprofit research that is either patented 

and licensed exclusively to a private company 

or forms the basis for a spin-off company that 

attracts further investment and develops an 

invention that is patented. Whether exclusive 

licensees or spin-offs, these companies then 

develop genetic testing services based on a 

business model that relies not only on patenting 

sequences and mutations—not objectionable 

in itself—but also on preventing other 

institutions, including universities from offering 

those genetic tests. 

The case of Myriad patents over BRCA1, 

BRCA2 and methods for diagnostic testing 

14 , as well as Athena Diagnostics’ exclusive 

licenses for clinical testing from Duke 

University (Durham, NC, USA) over three 

method patents related to diagnostic testing 

for Alzheimer’s disease 15,16 , exemplify these 

practices and business models. 

Furthermore, other neurological and metabolic 

conditions, as well as other entities’ screening 

for Canavan disease, hemochromatosis and 

other single-gene conditions, has also generated 

fierce debate. In the case of Canavan testing, 

litigation resulted from licensing restrictions 

that inhibited freedom of action among those 

seeking to get genetic tests. 

In the case of Myriad, initial research took 

place at the University of Utah—with public 

funding from the US National Institutes 

of Health (NIH; Bethesda, MD, USA). 

The researchers then spun off Myriad, 

which attracted investment from Eli Lilly 

(Indianapolis, IN, USA) and succeeded in patenting 

BRCA1 and a diagnostic test for breast 

cancer (patents that were ultimately jointly 

assigned to the University of Utah, Myriad and 

the NIH). Rather than licensing out the test to 

clinical geneticists and laboratories around the 

world, Myriad required initial testing in each 

family to be performed at its laboratories in Salt 

Lake City. In the United States, the company 

sent out cease-and-desist letters to laboratories—both 

academic and commercial—already 

performing tests when the patent was issued. 

Threatened patent enforcement resulted in 

a backlash around the world from public laboratories, 

clinicians, molecular geneticists and 

some patient groups—against both the patenting 

of human genes and what they viewed 

as Myriad’s strong-arm tactics. These groups 

feared that by closing down public laboratories, 

Myriad would thwart research identifying 

weaknesses in Myriad’s test or distinguishing 

the effects of different mutations in the genes 

on disease severity or progression, and prevent 

the integration of breast and ovarian cancer 

genetic tests into genetic health services. 

Although some of these fears were clearly 

exaggerated, Myriad’s aggressive initial patent 

enforcement affected practice in the clinical 

genetics community and stirred long-standing 

resentment. Furthermore, in countries with 

public health care systems, health administrators 

objected to Myriad’s business model 

because it removed their ability to deploy 

genetic tests to their citizens in the manner 

that they viewed as most efficient 14 . 

Myriad always permitted what it considered 

to be basic research on BRCA1 and BRCA2, and 

also engaged in research collaborations. In fact, 

until 2004—after which Myriad ceased to do so 

for unknown reasons—the company contributed 

data to public databases. To illustrate Myriad’s 

openness to others performing basic research 

using BRCA1 and BRCA2, the company’s president, 

Greg Critchfield, has identified 7,000 

papers published by independent authors that 

mention BRCA1 or BRCA2 (http://docs.justia. 

com/cases/federal/district-courts/new-york/ 

nysdce/1:2009cv04515/345544/158/0.pdf). 

This indicates that, with the exception of clinical 

testing at the University of Pennsylvania 

in 1998, Myriad did not pursue those who 

conducted research. Myriad also defined the 

University of Pennsylvania’s testing as ‘commercial’, 

as later defined under the terms of a 1999 

Memorandum of Understanding with the US 

National Cancer Institute (NCI: Bethesda, MD, 

USA). Myriad has been successful in arranging 

for payment agreements with insurers and 

other payers. However, as a result of Myriad’s 

enforcement actions coupled with broad patent 

claims, its fairly narrow conception of what 

constituted acceptable research and its failure 

to clearly state that it would not pursue those 

conducting such research, university and private 

laboratories ceased to offer the test publicly 


COMMENTARY 


in the United States. Outside the United States, 

resistance to Myriad’s model—particularly 

from health care administrators and government 

departments—caused the company to 

lose most of its market. Furthermore, Myriad’s 

relationship with scientists and policymakers 

around the world was seriously damaged 14 . 

Although the biotech industry tried to portray 

Myriad as an outlier, a series of detailed 

case studies conducted by some of us (J.C., 

S.C., M.A. and R.C.-D.) and others 15,18–24 at 

Duke University’s Center for Genome Ethics 

Law and Policy reveal that, in fact, Myriad’s 

business model is not unique. As these studies 

show, diagnostic companies such as Athena 

Diagnostics (Worcester, MA) and PGxHealth 

(New Haven, CT) have adopted similar or even 

more aggressive business models and have 

shut out university laboratories from offering 

genetic testing for diseases such as long-QT 

syndrome and Alzheimer’s disease. In the case 

of Alzheimer’s disease, genes and method patents 

for diagnostic testing were initially patented 

by Duke University (and other academic 

institutions) and licensed exclusively to Athena 

Diagnostics. Athena Diagnostics then used its 

patents aggressively to prevent others from carrying 

out the test. 

These case studies strongly suggest both that 

universities are often not managing research 

and patents in a way that promotes dissemination 

and that companies deploy their patents or 

exclusive licenses to remove genetic testing laboratories 

at academic health centers and lowmargin 

national reference laboratories from 

the market. This is demonstrably a viable business 

model, or at least it has proven to be until 

recently—but is it good national policy, and 

does it add value to the national health system? 

As clinicians and laboratory directors react to 

cease-and-desist letters by withdrawing from 

those activities, clinical research and genetic 

testing are impeded. GeneDx (Gaithersburg, 

MD) and university laboratories ceased testing 

for the life-threatening long-QT syndrome 

after patent enforcement in 2002, for example, 

but no commercial test entered the market 

until 2004 (ref. 9); neither the University of 

Utah (which held the patents) nor the NIH 

(which could have been petitioned to march 

in, given that ‘health and safety’ needs were 

not being met) took action. Certain tests may 

not be offered if the patent holder or exclusive 

licensee does not provide them; second-opinion 

and verification testing may be unavailable; 

and tests are costly to public and private payers, 

sometimes prohibitively so for those lacking 

insurance 25,26 . Although negative effects on 

price and access to genetic testing are not uniform, 

consistent or pervasive, one cannot read 

the case studies as a whole without realizing 

there are real problems—and also that there are 

relatively easy solutions modeled on nonexclusive 

licensing, as used for Huntington’s disease 

and cystic fibrosis testing. Gene patents over 

diagnostics are not just like all other patents, 

and the diagnostic market is not just like markets 

for therapeutics and instruments. Holders 

of gene patents need to take care in licensing 

them for diagnostic use. 

Hurdles to resolution of concerns 

The past decade saw a plethora of policy 

reports about DNA patents, such as those from 

the Nuffield Council on Bioethics 17 , the US 

National Academy of Sciences 27 , the Ontario 

Ministry of Health 28 and the Australian Law 

Reform Commission 29 . Academic articles 

examined the concerns, the extent to which 

concerns were founded and the roles of 

industry, universities and legislative reform 

in addressing these concerns 5,6,26,30–38 . Some 

countries also made statutory changes to their 

patent and health laws. France expanded compulsory 

licensing laws 39 , and Belgium did the 

same, also carving out a diagnostic-use exemption 

from patent-infringement liability 40 . The 

In addition to evidence that gene 

patents covering diagnostics do 

not necessarily impede research, 

there is very little evidence of 

patent litigation in the field. 

US Patent and Trademark Office (USPTO; 

Washington, DC) developed guidelines on 

‘utility’ and ‘written description’ specifically 

for examining gene patent applications 41 . 

Recognizing that many of the concerns 

could be addressed through better licensing 

practices, many institutions also developed 

licensing guidelines, some aimed at universities 

and others at industry. These include 

the NIH’s Best Practices for the Licensing of 

Genomic Inventions 42 , the Organisation for 

Economic Cooperation and Development’s 

(OECD; Paris) Guidelines for Licensing of 

Genetic Inventions 43 and In the Public Interest: 

Nine Points to Consider in Licensing University 

Technology 44 , a document crafted by 12 institutions 

and subsequently endorsed by the Board 

of Trustees of the Association of University 

Technology Managers (AUTM; Deerfield, IL, 

USA). Since then, ~50 other institutions and 

organizations have also endorsed the guidelines. 

In November 2009, as part of AUTM’s 

Global Health Initiative to promote licensing 

practices that facilitate access to essential 

medicines in developing countries, AUTM 

also endorsed a document entitled University 

Principles on Global Access to Medicines 45 . 

Most recently, the SACGHS recommended 

the implementation of an exception to patentinfringement 

liability for research use and 

diagnostic testing 4 . All of these reports and recommendations 

focus on broad dissemination 

through nonexclusive licensing of gene-based 

inventions, particularly for publicly funded 

research. They reserve exclusive licensing 

for situations in which it is needed to induce 

investment in private-sector development to 

bring a product or service to fruition—which, 

as will later be discussed, is rarely the case for 

genetic diagnostics. 

Despite the plethora of policy reports, 

academic articles, guidelines and legislative 

changes, concerns about DNA patents persist. 

We must therefore turn our attention to factors 

that impede changing the system. 

A question of law or of practice. The first 

response to concerns is often a call to change 

patent law 39,46,47 . As recent research indicates, 

however, the central problem does not lie with 

patents over human genes themselves so long as 

the law incorporates the appropriate checks and 

balances. The recent suit challenging Myriad’s 

patents on BRCA genes notwithstanding 2 , the 

following discussion indicates that there is little 

evidence on which to conclude that limiting 

the ability to patent genes is the only way to 

solve the problems in the system. 

A recent study by Huys et al. 48 from Belgium 

suggests that relatively few claims in gene patents 

block competing laboratories from providing 

genetic tests. This study of 145 active patent 

documents (267 independent claims) related to 

genetic diagnostic testing of 22 inherited diseases 

(including method claims, gene claims, 

oligo claims and kit claims) that the European 

Patent Office (Munich, Germany) and the 

USPTO issued. It concluded that clinicians 

could easily get around 36% of claims and 

could, with work, circumvent another 49% of 

claims. Only 15% of claims would be difficult 

or impossible to circumvent. Of the gene claims 

studied, only 3% were found to be blocking. 

However, as discussed below, blocking claims 

were more prevalent among method claims. 

In addition to evidence that gene patents 

covering diagnostics do not necessarily impede 

research, there is very little evidence of patent 

litigation in the field. A recent study 8 on 

trends in human gene patent litigation notes 

that there is rarely any litigation over diagnostic 

tests arising from gene patents. This study 

identified only 31 examples of litigation over 

human genes in the United States from 1987 to 

2008. Although the low frequency of litigation 


COMMENTARY 


could hypothetically support the conclusion 

that patents successfully exclude others (that 

is, threatened patent enforcement stops potentially 

infringing activities), an examination of 

patent claims suggests that most patents over 

human genes and related diagnostic tests find 

themselves in a relatively weak legal position. 

This weak legal position is further reinforced 

by the dissent in Laboratory Corp. of America 

Holdings v. Metabolite Laboratories, Inc. 49 , 

which concluded that a natural correlation 

between two substances in the body was an 

unpatentable product of nature (the majority 

decided not to address the issue); by the United 

States District Court decision in Association for 

Molecular Pathology et al. v. the United States 

Patent and Trademark Office et al.; and by the 

general trajectory of recent decisions on assessing 

damages, the lack of automatic injunctive 

relief (eBay Inc v. MercExchange, L.L.C. 50 ), as 

well as by the increasing ambit for finding an 

invention to be obvious under patent law. The 

recent US Supreme Court decision In re Bilski 51 

only exasperates the uncertainty over method 

claims on DNA diagnostics. In fact, an eventual 

appeal from the District Court decision 

in Association for Molecular Pathology et al. v. 

the United States Patent and Trademark Office 

et al. may be required to determine whether 

these type of claims are valid. 

Adding to the trend in legal thinking is the 

Federal Circuit’s decision in Ariad, relating to 

claims based on DNA patents, where the court 

writes: “Much university research relates to 

basic research, including research into scientific 

principles and mechanisms of action…, 

and universities may not have the resources 

or inclination to work out the practical implications 

of all such research [i.e., finding and 

identifying compounds able to affect the mechanism 

discovered]. That is no failure of the law’s 

interpretation, but its intention. Patents are not 

awarded for academic theories, no matter how 

groundbreaking or necessary to the later patentable 

inventions of others.” 

That research hypotheses do not qualify for 

patent protection possibly results in some loss 

of incentive, although Ariad presents no evidence 

of any discernable impact on the pace of 

innovation or the number of patents obtained 

by universities. But claims to research plans 

also impose costs on downstream research, 

discouraging later invention.” Taken together, 

these studies and cases indicate that gene patents 

per se have closed off far less of the research 

landscape than is often supposed, and where 

expansive claims have been granted, many are 

vulnerable to challenge. 

Method claims in patents related to diagnostic 

testing, however, bear special mention. 

Although many pharmaceutical patents claim 

products as chemical entities, universities and 

biotech firms also tend to patent ways of using 

knowledge, including method patents that 

affect genetic tests. In fact, Huys et al. 48 conclude 

that 30% of method claims relating to 

genetic testing are difficult, if not impossible, 

to circumvent. Such claims tend to be broad, 

often to the point of vagueness, and many cover 

all conceivable ways to conduct genetic tests on 

a gene or for a clinical condition. In the 15 of 

22 conditions that Huys et al. 48 found had at 

least one blocking claim, most such claims were 

to methods. In the diagnostic realm, blocking 

patents thus appear to be common, present in 

68% of the clinical conditions studied. Changes 

in jurisprudence could reduce the number of 

truly blocking patents in genetic diagnostics. 

Recent and pending court decisions suggest 

that some fraction of broad claims in US 

patents on DNA sequences and methods pertinent 

to genetic diagnostics would be judged 

invalid if challenged. Although dealing with 

a patent claim in the information technology 

field, the recent US Court of Appeals for the 

Federal Circuit decision in In re Bilski narrowed 

criteria for patents on methods to inventions 

that entail a transformative step or involvement 

of a particular machine. Depending 

on how the Federal Circuit deals with the US 

Supreme Court in Bilski—perhaps in an appeal 

in the Myriad case—it could signal that broad 

method claims in DNA diagnostics might be 

held invalid because the link between a mutation 

and a probability of contracting a disease 

may be considered unpatentable. As it stands, 

many broad method claims pertinent to DNA 

diagnostics suffer under a cloud of uncertainty 

and may turn out to be invalid, thus dramatically 

increasing freedom to operate without fear of 

patent-infringement liability. Other recent US 

court decisions have moved in the same direction, 

increasing the stringency of criteria for 

nonobviousness 52,53 and written description 3 . 

Taken as a group, these decisions suggest 

that some of the potential obstacles to innovation 

that patents cause in diagnostics may not 

be as high, nor the amount of intellectual territory 

enclosed and enforced as expansive, as 

some had feared. A clear research exemption, 

a simplified method for challenging patents 

(for example, opposition proceedings or inter 

partes re-examination requests) and improved 

examination procedures to avoid overly broad 

patent claims could help quell concerns over 

blocked research and overly broad patents 54 . 

Overall, the problem does not lie wholly in 

patent law but rather concerns how decisions 

are made about what is patented (methods 

versus products) and how patents are managed 

and used. With one or a few successful 

challenges to broad patents enforced for 

diagnostic purposes, the business models of 

enforcing monopolies on genetic testing for 

specific conditions would probably give way 

to more cross-licensing, more competition and 

faster innovation in testing methods. 

A need for changes in patent licensing practices 

at universities. As patent law evolves, 

it is increasingly apparent that the exclusive 

licensing strategies of universities and the 

business models of a few companies doing 

DNA diagnostics are as much, or even more, 

of an impediment to DNA diagnostics as any 

problems with the law. Meanwhile, no evidence 

suggests that exclusive licensing is as 

important in the field of diagnostic testing as 

in therapeutics in creating products that would 

not otherwise exist. The exclusive licenses over 

erythropoietin, growth hormone, interferon 

and other therapeutic proteins are of commercial 

significance, as illustrated by the fact 

that eleven legal cases that presume the validity 

of gene patents have been decided by the 

US Court of Appeals for the Federal Circuit 8 . 

The same cannot be said for diagnostic testing: 

no exclusive license in this field has been 

deemed to be of such importance for anyone 

to take to court. In fact, most cases involving 

diagnostic testing are settled after initial notification 

letters or cease and desist letters are sent 

out. A handful have led to litigation, but settled 

early. The Federal District Court’s ruling of 29 

March in Association for Molecular Pathology 

et al. v. the United States Trademark and Patent 

Office is the first diagnostic case to go before a 

judge for a decision. Furthermore, barriers to 

entering the market with a new genetic test, at 

least for the first-generation genetic tests that 

search for mutations in one or a few genes, are 

far lower than for therapeutics. This is because 

for universities and national reference laboratories 

that already offer other genetic tests, the 

cost of ‘setting up’ a new genetic test based on 

data in scientific publications is comparable to 

the cost of patenting the underlying inventions 

since they are already laboratories approved by 

US regulators. 

Supporting this proposition is the fact that 

exclusive licensing does not appear to have 

been necessary to get a test to market in any of 

the cases 15,18–24 studied for SACGHS. In the 

study of 10 clinical conditions considered by 

SACGHS, three cases did not involve patent 

rights (i.e., there were no patents or patents 

were not licensed or enforced) or patents were 

nonexclusively licensed to multiple providers. 

These were cystic fibrosis, hereditary colorectal 

cancer and Tay-Sachs disease. Such 

patenting and licensing practices comply with 

current guidelines. In six cases, however, exclusive 

licensing led to patent enforcement that 


COMMENTARY 


Defining what qualifies as research. Although 

most industries tolerate a broad range of 

research activities and most researchers 

ignore patents when deciding whether to do 

research 55 , such blithe ignorance is not an obvious 

option in human genetic diagnostics, where 

threatened enforcement is common, laboratory 

directors and clinicians tend to respond to 

threatened enforcement by ceasing the activities 

under threat and workaround in the case 

of method patents are not always available 48 . 

Norms over what research is to be tolerated 

are unsettled, despite the existence of research 

exceptions 56 in many national laws (including 

an exemption in the United States for research 

into products that may eventually lead to the 

filing of an application with the US Food and 

Drug Administration (Rockville, MD) 57 ). 

One prominent example of disputed norms 

is the controversy between Myriad and the 

University of Pennsylvania Genetic Diagnostic 

Laboratory (GDL; Philadelphia, PA). Although 

Myriad states that it is generally supportive of 

research, it nevertheless sent GDL a cease-andreduced 

availability of genetic tests already 

being offered: HFE (hemochromatosis), APOE, 

Alzheimer’s disease and genes associated with 

Canavan disease, long-QT syndrome, hearing 

loss and spinocerebellar ataxias. Because tests 

were already available, exclusive licensing in 

these cases deviates from the norms that technology 

licensing offices generally claim to 

be following. In some cases, but not all, this 

led, at least transiently, to genetic testing by 

a single provider, and that exclusive license 

holder then eliminated other testing services 

that had beaten it to market. In all cases except 

hemochromatosis, exclusive licenses from universities 

were involved. Although the exclusive 

licensee may ultimately have developed a better 

test, in no case was the exclusive licensee the 

first to market. The tenth clinical condition 

studied by the SACGHS, hearing impairment, 

is subject to a hybrid of exclusive and nonexclusive 

licensing, and entails many genes and 

different means of testing. This case does have 

some examples of controversial patent enforcement 

action, but tests are generally widely 

available from several vendors. 

Patent incentives may induce investment 

in genetic diagnostics, but in none of the case 

studies did this lead to new availability of a test 

that was not already available, at least in part. 

This is in stark contrast with the role of patents 

in therapeutics and scientific-instrument 

development, where the benefits attributable 

to private R&D and new products are much 

clearer. The SACGHS case studies thus reinforce 

the benefits of licensing nonexclusively 

for genetic diagnostics, unless an unusual 

situation arises in which exclusivity is needed 

to get a product to market for the first time. 

The cases also highlight deviations from the 

NIH Best Practices 43 , OECD Guidelines 43 and 

the AUTM-endorsed Nine Points 44 . Exclusive 

licensing practices consistently reduce availability, 

at least as measured by the number of 

available laboratories offering a test, and thus 

reduce competition in genetic diagnostics, but 

with little evidence of a public benefit from services 

not otherwise available. 

Instead of recognizing this reality, some 

universities continue to seek broad patents 

regardless of subject matter and then 

license exclusively, enabling business models 

that impede competition in genetic testing. 

Although the real risk of being successfully 

sued for patent infringement in DNA diagnostics 

may be low, a 2003 survey 33 and recent 

case studies 14,15,18–24 indicate that laboratory 

directors change their testing practices and 

clinicians avoid research areas in reaction to 

cease-and-desist letters. Diagnostics are generally 

low-margin sources of revenue, and 

when faced with a threat of patent enforce- 

ment, most laboratories simply stop offering 

a genetic test, or at least no longer advertise 

a test’s availability publicly (in all the case 

studies, we learned of ‘research’ testing as an 

‘escape valve’ for patients who could not get 

or could not afford commercial genetic tests). 

Although part of the problem is that licenses 

executed over the past decade do not embody 

the principles of the NIH, OECD or AUTM 

guidelines and yet remain in force, the reality 

is that only a minority of universities have 

endorsed the consensus Nine Points 44 —with 

no repercussions for those who do not or those 

who sign and then violate the norms. Shortsighted 

licensing practices persist. 

Potential solutions 

Changes that could remedy problems with 

the current strategy of the licensing system 

include the following: first, a clear definition 

of research that should be exempt from patent-infringement 

liability; second, universities’ 

leadership in promoting the alignment of tech 

transfer licensing practices with the univeristies’ 

broader goal of dissemination; third, coupling 

of the latter with incentives to promote 

industry compliance and leadership by AUTM 

and the Biotechnology Industry Organization 

(BIO; Washington, DC) in recognizing problems 

and proposing constructive solutions; 

fourth, adequate funding for tech transfer 

offices to learn about and implement changing 

practices; and finally, greater transparency in 

reporting patent holdings and licensing agreement 

terms. A more detailed discussion of each 

of these follows. 

desist letter because it did not consider GDL’s 

activities to be research. To Myriad, GDL’s 

provision of testing services to researchers was 

commercial, not a research service 14 . GDL took 

the position, however, that its activities, which 

supported others’ research, fell within the norm 

of tolerated research use, and much of the contested 

testing was part of clinical trials funded 

by the NCI, which is clearly clinical research. 

Much debate ensued, leaving many researchers 

with the (wrong) impression that Myriad 

would not tolerate any form of research. 

In an attempt to establish a clear norm 

over the question of which activities should 

be considered ‘research’, Myriad entered into 

a Memorandum of Understanding with the 

NCI to provide at-cost or below-cost testing 

to the NCI and any researcher working under 

an NCI-funded project. Myriad also similarly 

offered to provide NIH researchers with at-cost 

testing, given that the NIH was a co-owner 

of some of the relevant patents. Importantly, 

the agreement with the NCI defined the type 

of research Myriad would tolerate as being 

“part of the grant supported research of an 

Investigator, and not in performance of a technical 

service for the grant supported research 

of another (as a core facility, for example).” 

Furthermore, testing services had to be paid 

for out of grant funds and not by a patient or 

by insurance. Under this definition, GDL was 

not conducting research. This agreement was 

acceptable to both parties (Myriad and the 

NCI), and given the ‘at-cost’ provisions and the 

known efficiency of Myriad in testing, perhaps 

it is a salutary precedent. It is worth noting, 

however, that the NCI did not seek to delegate 

its government use rights under the Bayh-Dole 

Act 35 U.S.C. § 200-212 (“Bayh-Dole Act”) or 

Stevenson-Wydler Act 15 U.S.C. 3701 (which 

pertain because Myriad’s patents include inventors 

covered by both laws). 

The restricted nature of the Myriad-NCI 

Memorandum of Understanding limits its value 

as a precedent. It covered only the provision 

of services by Myriad; it did not address the 

general question of which research practices a 

patent holder should tolerate in the diagnostics 

field. Some of the conflict surrounding patents 

and genetics laboratories could be avoided by 

adopting a clearer definition of ‘research’ for 

the purposes of incorporating licensing terms 

that lower the threat of patent-infringement 

liability. The scope of government use rights 

under the Bayh-Dole and Stevenson-Wylder 

Acts is another legal gray zone. In any case, the 

definition of research should not be left to the 

individual negotiation between one company 

and one NIH institute. The NIH could take on 

a key role in developing this norm by convening 

a meeting of interested parties to develop 


COMMENTARY 


the principles by which individual actors can 

determine how to apply the norm. 

University leadership. Implementation of 

licensing guidelines and best practices is 

difficult when interests and goals are not 

aligned. Participants at a workshop held at 

Duke University in April 2009 addressed 

the role of universities in DNA patents and 

diagnostic testing and noted that those at the 

front line of implementing these guidelines, 

tech transfer offices, face many hurdles to 

implementation. Many university administrators 

view patents as a means to secure revenues 

(to subsequently reinvest in research) 

and believe that exclusive licenses generate 

the most revenues. Although the evidence 58 

is quite clear that most tech transfer offices 

either break even or lose money and that 

many of the most lucrative university patents 

have entailed nonexclusive licensing, this view 

persists. Compounding this problem, universities 

expect tech transfer offices to generate 

sufficient revenues to be sustainable. Despite 

usually being unrealistic, such expectations 

can lead these offices toward licensing strategies 

that promote short-term income over 

dissemination and broad availability. 

If there is to be a change of behavior, it 

must come from two sources: first, university 

administrators must align tech transfer strategy 

with the university mission of broad knowledge 

dissemination; and second, universities 

should provide more push-back when threatened 

patent enforcement gets in the way of 

research and impedes the university’s central 

mission. Regarding the first point, university 

presidents and senior management must take 

seriously the university mission to disseminate 

knowledge and technology. They must consider 

technology transfer as one component 

of their strategy to enable the wider world to 

access, enjoy and use university-generated 

knowledge. To achieve change, they need to 

change the way they fund tech transfer offices 

so that the latter have the freedom to explore 

alternatives to the way they currently license 

out technology. They also need to develop clear 

goals for dissemination and ensure that they 

impose measures of success for their technology 

licensing offices that correspond to those 

goals. Expecting technology licensing officers 

to forgo exclusive licenses when companies 

seek them is unrealistic unless the officers 

are rewarded for decisions that acknowledge 

the broad social benefit of avoiding patent 

thickets in genetic diagnostics. Recognition 

must also be given to the fact that these offices 

do not negotiate licenses in a vacuum: they 

negotiate largely with industry partners. If 

diagnostic companies are unwilling to accept 

nonexclusive licenses, broad research exemptions 

or other terms that universities propose 

to support research, tech transfer offices have 

little room to maneuver. Currently, there is no 

incentive—whether external or through the 

threatened use of government march-in rights 

under the Bayh-Dole Act—to curb industry 

behavior even when it is problematic. Tech 

transfer departments with limited funding, 

limited staff and unreasonable expectations 

to be sustainable cannot be expected to resist 

intransigence by licensees. 

Universities need also to take a lead in 

encouraging their researchers, clinicians and 

laboratory directors to push back when threatened 

with patent enforcement. University 

administrators need to educate themselves 

and their staff about the freedom to operate for 

purposes of research and improving diagnostic 

testing—that is, the scope of activities allowed 

that do not infringe on a valid patent. University 

Implementation of licensing 

guidelines and best practices 

is difficult when interests and 

goals are not aligned. 

administrators, researchers, clinicians and 

laboratory directors can act together by sharing 

cease and desist letters or other patent 

enforcement actions to determine whether the 

activities are, in fact, infringing. They can share 

expertise about the validity of patent claims that 

threaten research or clinical testing. Although 

individual laboratories may lack the resources 

to conduct these analyses, other institutions 

may have the requisite resources (for example, 

the American Society of Human Genetics, the 

American College of Medical Genetics, the 

College of American Pathologists and academic 

units such as the science policy research units at 

the University of Sussex in Brighton, UK, and 

the University of Leuven, Belgium). 

Leadership from AUTM and BIO. The 

development of a ‘gene patent supermarket’ 

by Denver firm MPEG-LA is a promising 

step toward enabling nonexclusive licensing, 

increasing simplicity and consistency in licensing 

terms, and reducing transaction costs 59 . 

Unfortunately, instead of proposing such 

constructive solutions, BIO and AUTM have 

chosen not to acknowledge the real problems 

that exist in the unusual market for genetic 

diagnostics and have been quick and vociferous 

in their opposition to the recommendations 

of the SACGHS 60,61 . It is impossible to 

judge the full extent of the problems, but it is 

certainly poor policy to deny that they exist at 

all. Moreover, BIO and AUTM have expended 

time and resources opposing SACGHS recommendations 

while failing to enforce the 

established norms laid out by the NIH and the 

OECD, as well as the AUTM-endorsed Nine 

Points, among their respective constituencies. 

Companies and universities that violate those 

norms have faced no action, or even recognition 

that they have deviated. Indeed, there has 

been no public statement from either BIO or 

AUTM that members have been responsible for 

some of the problems uncovered in licensing 

practices for genetic diagnostics. It is reasonable 

to disagree with the SACGHS recommendations, 

but it is not reasonable to read the 

SACGHS report and the case studies prepared 

for it and conclude that the system is working 

well across the board. BIO and AUTM should 

recognize the very real problems that have been 

uncovered, exhort compliance with established 

norms and—even more importantly if such 

norms are to be meaningful—criticize deviations 

from them, rather than following the 

politically expedient tactic of focusing their 

fire on SACGHS recommendations intended 

to prevent these problems. 

The two most controversial SACGHS recommendations 

are, first, a proposed exemption 

from infringement liability for research use, 

and second, a similar exemption for diagnostic 

use. As previously noted, university licensing 

offices opposing a research exemption puts 

them at odds with their own stated principles, 

as licensing to ensure freedom to do research 

appears in every document proposing norms 

for licensing. Opposition to a diagnostic-use 

exemption is more understandable because 

it may be that there are unusual situations in 

which exclusivity is needed to get a product or 

service to market, and such situations simply 

have not been captured in the cases studied to 

date. Nevertheless, it is quite clear that in many 

if not most cases of genetic diagnostics, the 

main use of exclusive licenses from universities 

has been to reduce competition and reduce 

the number of laboratories offering tests, without 

apparent benefits of introducing tests that 

were not already available. Rather, tests would 

demonstrably have been available even without 

the participation of the companies involved. 

The SACGHS may have judged that tech 

transfer offices are failing to respect existing 

norms, and in the absence of any credible compliance 

measures, the simplest legal solution 

is to address the problem through exemption 

from infringement liability. If AUTM and 

BIO want to preserve the option of exclusive 

licensing when needed to get genetic tests to 

market, then compliance with guidelines needs 

to be credible. Criticizing deviations when 


COMMENTARY 


they come to light, with the long-term goal 

of increasing compliance with stated norms, 

would go a long way toward reducing the need 

for a diagnostic-use exemption. Moreover, 

enforcing nonexclusive licensing norms can 

preserve revenue streams, as seen in the cystic 

fibrosis and Huntington’s models, whereas 

a diagnostic-use exemption would eliminate 

those revenues because the patents would be 

unenforceable for diagnostic uses. 

One could object that it is neither the function 

nor the responsibility of either BIO or 

AUTM to criticize their members. BIO is an 

industry lobby group that sees itself as “the 

champion of biotechnology and the advocate 

for its member organizations,” whereas AUTM 

is an association of individuals working in tech 

transfer that seeks “to support and advance academic 

technology transfer globally.” Developing 

and enforcing patenting and licensing policies 

fall within neither mandate. This argument is, 

however, disingenuous, given that both AUTM 

and BIO claim to be working to ensure that 

tech transfer serves the public good. It is just as 

important to reduce practices that fall short as 

to promote practices that achieve the goals of 

their respective constituencies. Both organizations 

have endorsed the Nine Points guidelines 

and actively promote technology transfer “in 

a manner that is beneficial to the public interest” 

(http://bio.org/ip/techtransfer/) while 

“improving quality of life, building social and 

economic well-being, and enhancing research 

programs” (http://betterworldproject.org/ 

tech_transfer.cfm). Having voluntarily taken 

these positions, both organizations should be 

held accountable for them. 

Increasing transparency to permit ‘system 

learning’. To promote change, universityindustry 

relationships need to be more transparent; 

indeed, the current opaqueness over 

existing university-industry interactions is 

a major hurdle to improving the intellectual 

property system for DNA diagnostics 11 . For 

example, license agreements between universities 

and start-up and private companies are 

unavailable, even in general terms. The only 

exceptions are universities or companies that 

voluntarily make such information public. 

Participants at the workshop on the role of 

universities in DNA patents and diagnostic testing 

held at Duke in April 2009 noted that most 

licensing information is not publicly available, 

even for inventions arising from public funding. 

In some cases, but only some, it is possible 

to reconstruct licensing terms from company 

annual reports or from press announcements. 

There is often no way for researchers and institutions 

to know what practices a license covers, 

whether there remains scope for others to 

practice an invention, which regions it covers 

and whether it applies to any specific fields of 

use or contains special restrictions. The lack of 

information makes it difficult to substantiate 

claims that licensing practices are changing or 

comply with best practices. As a study 11 on university 

licensing practices notes, simply stating 

whether a license is exclusive or nonexclusive 

misses important nuances. Not only would 

more transparency help researchers better 

understand the scope and ownership of intellectual 

property rights, it would also allow policymakers, 

academics and tech transfer offices 

to determine in what cases exclusive licensing 

is justified, as opposed to enforcing a blanket 

norm of nonexclusive licensing. 

Although under provisions 62 of the Bayh- 

Dole Act, all recipients of federal grants must 

report on activities involving the disposition of 

certain intellectual property rights that result 

from federally funded research, the information 

is incomplete and cannot be obtained 

Data on patenting and licensing 

practices are languishing in a 

government database that is not 

mined for valuable insights. 

because of strictures on access to the data. A 

clause of the legislation was intended to protect 

proprietary data from public access through the 

Freedom of Information Act 35 U.S.C. § 202(c) 

(5). The way the implementing regulations 

were written, however, went well beyond this, 

and gave licensees veto power over nongovernment 

disclosure of information. Tech transfer 

offices file reports with the interagency Edison 

(iEdison) database when they license inventions 

supported by most government funders. 

The reporting requirements do not require the 

disclosure of the licensing terms, and what is 

reported to iEdison is not publicly available. 

Indeed, access to iEdison is highly restricted; 

the database is unavailable for study or use outside 

government, and even government officials 

wanting to study technology transfer have 

been denied access unless they get permission 

from all licensees, a nearly impossible hurdle 

to overcome. 

Making licensing terms of publicly funded 

inventions more transparent would require 

a rewrite of the implementing regulations to 

change interpretation of the Bayh-Dole Act’s 

confidentiality clause. The confidentiality provision 

in the Bayh-Dole Act was intended to 

protect agencies from being forced to disclose 

proprietary data, but its implementing regulation 

is so broad that, in effect, it restricts the 

government’s ability to use data without permission 

of the relevant licensee. Current nondisclosure 

practices lead to data being unavailable for 

research aimed at improving knowledge about 

patenting and licensing practices. Many studies 

could be undertaken on aggregated reported 

data, and there are many precedents for using 

census data, health statistics and other very 

private information in government databases. 

The original rationale for the Bayh-Dole Act 

was that government-owned inventions were 

languishing for want of effective patent incentives 

to grantees and contractors; the current 

problem is that data on patenting and licensing 

practices are languishing in a government database 

that is not mined for valuable insights. 

On the industry side, there is a somewhat 

higher standard for disclosure by public companies 

to protect shareholders. As of 2003, the 

Securities and Exchange Commission (SEC) 

requires disclosure of material agreements, 

including license agreements, as part of SEC 

filings. Section 401(a) of the Sarbanes-Oxley 

Act of 2002 (Public Company Accounting 

Reform and Investor Protection Act of 2002, 

Pub. L. No. 107-204, 116 Stat. 745) requires the 

SEC to adopt rules to require each annual and 

quarterly financial report filed with the commission 

to disclose “all material off-balance 

sheet transactions, arrangements, obligations 

(including contingent obligations), and other 

relationships of the issuer with unconsolidated 

entities or other persons, that may have a material 

current or future effect on financial condition, 

changes in financial condition, results 

of operations, liquidity, capital expenditures, 

capital resources, or significant components 

of revenues or expenses.” In many cases, however, 

these disclosures are of little assistance in 

understanding the licensing landscape. The 

reporting pertains only when a license underpins 

a genetic test that is a large enough portion 

of a publicly traded company’s business that it 

needs to be disclosed to investors. Even then, 

which patents have been licensed under what 

terms may be disclosed vaguely. Many biotech 

start-up companies are not publicly traded and 

are not subject to SEC disclosure requirements. 

By the time a biotech company goes public, its 

prospectus may contain some, but only limited, 

information about licensing agreements. In the 

usual case of a public company acquiring technology 

by buying another company, disclosure 

of the original license may not be required. 

Universities argue that if they are forced to 

disclose the terms of prior licensing agreements, 

it will undermine their negotiating position 

with new potential licensees. If, however, 

public companies must disclose the contents of 

their license agreements to protect the interests 

of those funding them (namely, shareholders) as 


COMMENTARY 


a matter of public policy, then it is not clear why 

a university should not be required to disclose 

the contents of its license agreements to protect 

those who fund it (namely, the public). The 

question of human resources needed to ensure 

transparency is very real and needs to be taken 

into account, but the principle of public disclosure 

should be entrenched within public institutions, 

particularly when the licensed inventions 

arise from publicly funded research and when 

data are being collected and reported already. 

Government and nonprofit research dollars 

should come with public accountability. 

Secure funding of tech transfer offices. As 

noted above, some tech transfer offices are 

expected to be self-sustaining and suffer from 

a serious lack of resources. This situation has 

several consequences. First, the agreements 

that these offices pursue will not necessarily 

aim to promote dissemination but instead will 

focus first on securing revenues. Second, tech 

transfer offices lack resources to train managers 

on implementing guidelines and the particular 

challenges that different technologies raise. The 

DNA diagnostic market is complex and rapidly 

evolving. For example, technology licensing 

officers need to know that the development of 

genetic testing after the discovery of the gene 

requires far less investment than the development 

of therapeutics, suggesting that exclusive 

licenses are usually not as necessary 11 . Without 

a more nuanced and informed understanding 

of how optimal patenting, dissemination and 

licensing decisions vary across different types 

of technologies and uses, these offices cannot 

fulfill their mandate: transferring technology. 

Conclusions 

To address the ongoing failure to achieve the 

goals of the multiple guidelines, policies and 

even legislation aimed at ensuring continued 

research on and access to clinical genetic tests, 

practices within universities and their industry 

partners must conform to existing guidelines. 

Although some changes to patent law—such as 

clearer research exemptions and an opposition 

proceeding—could be of use, fundamentally the 

problem is one of strategy about what to patent 

(products versus methods), how broadly to 

make claims to early-stage gene-based inventions 

and how to deploy those patents (broadly 

versus exclusively). Patents will be properly 

deployed only when university constituencies 

unite in promoting broad dissemination, when 

technology transfer offices are given the necessary 

financial support and incentives and when 

universities and industry have transparent and 

publicly accountable practices for licensing of 

DNA diagnostic technologies. Industry groups 

such as BIO and university technology transfer 

organizations such as AUTM have a crucial and 

constructive role to play in resolving this predicament. 

Progress toward addressing the problems 

in genetic diagnostics can begin with less caustic 

and unhelpful rhetoric and more focus on 

engagement with their constituencies on seriously 

implementing guidelines, as well as with 

federal advisory bodies such as the SACGHS. 

By acknowledging and engaging with the distinctive 

problems that patenting and licensing 

practices raise for DNA diagnostics, both the 

universities licensing out technology and the 

companies licensing it in can bring about real 

improvement without the need for legislation. 



1. Diamond v. Chakrabarty, 447 U.S. 303 (1980). 

2. Association for Molecular Pathology et al. v. United 

States Patent and Trademark Office et al. (USDC SDNY 

09 Civ. 4515, 2010). 

3. Ariad Pharmaceuticals, Inc. v. Eli Lilly and Co. (560 

F3d 1366 (Fed Cir 2009). 

4. Secretary’s Advisory Committee on Genetics Health 

and Society, National Institutes of Health. Report 

on Gene Patents and Licensing Practices and Their 

Impact on Patient Access to Genetic Tests (SACGHS, 

Washginton, DC, 2010). 

5. Merz, J.F. Clin. Chem. 45, 324–330 (1999). 

6. Heller, M.A. & Eisenberg, R.A. Science 280, 698–701 

(1998). 

7. Chandrasekharan, S. & Cook-Deegan, R. Genome Med. 

1, 92 (2009). 

8. Holman, C.M. Science 322, 198–199 (2008). 

9. Nelson, R. J. Technol. Transf. 26, 13–19 (2001). 

10. Mowery, D.C. et al. Res. Policy 30, 99–119 (2001). 

11. Pressman, L. et al. Nat. Biotechnol. 24, 31–39 

(2006). 

12. Schissel, A., Merz, J.F. & Cho, M.K. Nature 402, 118 

(1999). 

13. Henry, M.R., Cho, M.K., Weaver, M.A. & Merz, J.F. 

Science 297, 1279 (2002). 

14. Gold, E.R. & Carbone, J. Genet. Med. 12 Suppl, S39– 

S70 (2010). 

15. Skeehan, K., Heaney, C. & Cook-Deegan, R. Genet. 

Med. 12 Suppl, S71–S82 (2010). 

16. Merz, J.F. in The Penn Center Guide to Bioethics (eds. 

Ravitsky, F., Feister, A. & Caplan, A.L.) 383–385 

(Springer, New York, 2009). 

17. Nuffield Council on Bioethics. The Ethics of Patenting 

DNA (Nuffield Council on Bioethics, London, 2002). 

18. Cook-Deegan, R. et al. Genet. Med. 12 Suppl, S15– 

S38 (2010). 

19. Angrist, M., Chandrasekharan, S., Heaney, C. & 

Cook-Deegan, R. Genet. Med. 12 Suppl, S111–S154 

(2010). 

20. Chandrasekharan, S. & Fiffer, M. Genet. Med. 12 

Suppl, S171–S193 (2010). 

21. Chandrasekharan, S., Heaney, C., James, T., Conover, 

C. & Cook-Deegan, R. Genet. Med. 12 Suppl, 

S194–S211 (2010). 

22. Chandrasekharan, S., Pitlick, E., Heaney, C. & Cook- 

Deegan, R. Genet. Med. 12 Suppl, S155–S170 

(2010). 

23. Colaianni, A., Chandrasekharan, S. & Cook-Deegan, R. 

Genet. Med. 12 Suppl, S5–S14 (2010). 

24. Powell, A., Chandrasekharan, S. & Cook-Deegan, R. 

Genet. Med. 12 Suppl, S83–S110 (2010). 

25. Cook-Deegan, R., Chandrasekharan, S. & Angrist, M. 

Nature 458, 405–406 (2009). 

26. Caulfield, T., Cook-Deegan, R.M., Kieff, F.S. & Walsh, 

J.P. Nat. Biotechnol. 24, 1091–1094 (2006). 

27. National Research Council. Reaping the Benefits 

of Genomic and Proteomic Research: Intellectual 

Property Rights, Innovation and Public Health 

(National Research Council, Washington, DC, 

2006). 

28. Ontario Report to the Provinces and Territories. 

Genetics, Testing and Gene Patenting: Charting 

New Territory in Healthcare (Government of Ontario, 

Toronto, Ontario, Canada, 2002). 

29. Australian Law Reform Commission. Essentially 

Yours: The Protection of Human Genetic Information 

in Australia (ALRC 96) (ALRC, Sydney, New South 

Wales, Australia, 2003). 

30. Gold, E.R., Bubela, T., Miller, F.A., Nicol, D. & Piper, 

T. Nat. Biotechnol. 25, 388–389 (2007). 

31. Gold, E.R. Nat. Biotechnol. 18, 1319–1320 (2000). 

32. Nicol, D. & Nielsen, J. Patents and Medical 

Biotechnology: An Empirical Analysis of Issues 

Facing the Australian Industry (Occasional Paper no. 

6) (Centre for Law & Genetics, Sandy Bay, Tasmania, 

Australia, 2003). 

33. Cho, M.K., Illangasekare, S., Weaver, M.A., Leonard, 

D.G.B. & Merz, J.F. J. Mol. Diagn. 5, 3–8 (2003). 

34. Rai, A. Northwest. Univ. Law Rev. 94, 77–152 

(1999). 

35. Merz, J.F., Kriss, A.G., Leonard, D.G. & Cho, M.K. 


36. Merz, J.F., Cho, M.K., Robertson, M.J. & Leonard, D.G. 

Mol. Diagn. 2, 299–304 (1997). 

37. Merz, J.F. & Cho, M.K. Camb. Q. Healthc. Ethics 7, 

425–428 (1998). 

38. Andrews, L.B. Nat. Rev. Genet. 3, 803–808 (2002). 

39. LOI no 613–16 as amended in 2004. 

40. Overwalle, G.V. Int. Rev. Intellect. Property Competition 

Law 889, 908–918 (2006). 

41. Fed. Reg. 66, 1092–1099 (2001). 

42. Fed. Reg. 70, 18413–18415 (2005). 

43. Organisation for Economic Co-operation and 

Development. Guidelines for the Licensing of Genetic 

Inventions (OECD, Paris, 2006). 

44. In the Public Interest: Nine Points to Consider in 

Licensing University Technology (AUTM, Deerfield, 

Illinois, USA, 2007). 

45. Association of University Technology Managers. 

University Principles on Global Access to Medicines 

(AUTM, Deerfield, Illinois, USA, 2009). 

46. Rimmer, M. Eur. Intellectual Prop. Rev. 25, 20–33 

(2003). 

47. American Medical Association. Report 9 of the Council 

on Scientific Affairs (AMA, Chicago, 2000). 

48. Huys, I., Berthels, N., Matthijs, G. & Van Overwalle, G. 


49. Laboratory Corporation of America Holdings, dba 

Labcorp v. Metabo-Lite Laboratories, Inc. et al., 548 

U.S. 124 (2006). 

50. eBay Inc. v. MercExchange, LLC, 547 U.S. 388 

(2006). 

51. Bilski v. Kappos, 561 U.S. ____ 20010 (No. 08–964), 

affirming F.3d 943 3d 943 (Fed. Cir. 2008). 

52. In re Kubin (Fed Cir. 2009). 

53. KSR International Co. v. Teleflex, Inc., 550 U.S. 398 

(2007). 

54. Van Overwalle, G., van Zimmeren, E., Verbeure, B. & 

Matthijs, G. Nat. Rev. Genet. 7, 143–148 (2006). 

55. Walsh, J.P., Ashish, A. & Cohen, W. in Effects Of 

Research Tool Patents And Licensing On Biomedical 

Innovation (eds. Cohen, W. & Merrill, S.) 285–336 

(National Academies Press, Washington, DC, 2003). 

56. Gold, E.R. et al. The Research or Experimental 

Use Exception: A Comparative Analysis (Centre for 

Intellectual Property Policy/Health Law Institute, 

Montreal, Quebec, Canada, 2005). 

57. Merck KGaA v. Integra Lifesciences I, Ltd., 545 U.S. 

193 (2005). 

58. Siegel, D.S. & Wright, M. Oxford Rev. Econ. Policy 23, 

529–540 (2007). 

59. http://www.mpegla.com/Lists/MPEG%20LA%20 

News%20List/Attachments/230/n-10–04–08.pdf, 

Last Accessed May 4, 2010. 

60. (5 February 2010). 

61. http://bio.org/ip/genepat/documents/SACGHSsignonletter2–4-2010final_000.pdf 

62. Bayh-Doyle Act, 37 C.F.R. Part 401. 


FEATURE 

Public biotech 2009—the numbers 

Brady Huggett, John Hodgson & Riku Lähteenmäki 

The public biotech sector sustained more losses in 2009, but the year ended on a positive note, and the industry has 

regained its footing. 


That whooshing sound at the end of 2009 

was the biotech sector letting out its collective 

breath. The year began as a hard slog, 

so when it came to a close on an upward 

swing, the industry rightfully felt a measure 

of relief. That’s not to say there weren’t casualties: 

a distressingly large number of companies 

departed the scene last year. But it was not as 

bad as some pundits had estimated, and the 

industry proved itself to be strong and creative. 

It was helped by a recovering economy in the 

second half of the year. Overall, counting the 

vast financial potential of collaborations, the 

industry recorded one of its best years for 

fundraising. That has left the sector brightly 

looking ahead again—a far cry from how 

things appeared at the end of 2008. 

Economic woes 

The 2009 data from Nature Biotechnology’s 

annual survey of public biotech firms, which 

now number 461 (owing to a change in 

our data-gathering process; see Box 1 and 

Supplementary Table 1), show little trace of 

how terribly the year began or how tightly 

the public markets had been hammered shut 

at the end of 2008. The reality is that 2009 

started bleakly for biotech, and it continued 

that way for most of the first quarter. 

Of course, not just biotech suffered—the 

recession affected all countries and sectors. 

Along with the other indices, shares on the 

Nasdaq Biotechnology Index bottomed out 

on 9 March, resting at 59.05, a low it had not 

seen since May 2003. The global economy continued 

to shed jobs last year: the US Central 

Data retrieval for this article was by Ernst & 

Young (Boston) with additional reporting by 

Riku Lähteenmäki. Brady Huggett is business 

editor at Nature Biotechnology, John Hodgson 

is editor-at-large at Nature Biotechnology, 

and Riku Lähteenmäki is a freelance writer in 

Turku, Finland. 

Box 1 The numbers 

Nature Biotechnology has published an annual report on public biotech companies since 

1996. As the industry has grown and changed, so have our definition of what constitutes 

a biotech company and our methods for gathering the information that serves as the 

backbone to this piece. We generally include companies built upon applying biological 

organisms, systems or processes, or the provision of specialist services to facilitate the 

understanding thereof. We exclude pharmaceutical companies, medical-device firms and 

contract research organizations to better focus on the unique attributes and situations 

that make up the biotech sector. 

This year’s data was provided by Ernst & Young, which has broadened the report’s reach 

into international exchanges and increased our total number of companies. Additional 

reporting was done via individual financial reports. The top-ten lists and other aggregate 

lists are sourced appropriately, with most data supplied by BioCentury. As investors do not 

stratify the biotech sector as stringently as Nature Biotechnology, we used money figures 

from across the biotech and biopharmaceutical arena to best highlight trends. In some 

cases, full-year data were not available and fourth-quarter numbers were extrapolated; 

this is noted in the company-by-company data table (Supplementary Table 1). Companies 

delisted in 2009 from major exchanges were excluded. 

Intelligence Agency estimates unemployment 

numbers increased around the world, sometimes 

drastically—Ireland’s unemployment 

nearly doubled to 12%, whereas the US went 

from 5.8% in 2008 to 9.3%. 

So although biotech wasn’t alone in the 

dark, as an industry made up mainly of small 

companies devoid of revenue—and thus 

more dependent on raising public funds— 

the sector was hit particularly hard. The fear, 

expressed by pundits, the Biotechnology 

Industry Organization (Washington, DC) 

and even biotech executives themselves, was 

that the industry would lose up to 25% of its 

companies to bankruptcy. 

But the Nasdaq Biotech Index steadily 

recovered from that March low and closed 

2009 at 81.83. Overall funding for the sector 

jumped in the second half, and although the 

National Bureau of Economic Research has 

yet to officially declare the end of the recession 

in the United States, consensus pegs it 

around the second quarter of 2009. 

Catastrophic shrinkage in the sector has 

not happened. There were losses (Table 1), 

but they were not as far-reaching as feared. 

And among all this detritus, a surprise: the 

biotech sector was again profitable in 2009. 

The money trail 

Financing levels for biotech are a useful 

gauge of the sector’s overall health, because 

without repeated investment, the industry 

shrivels. In this regard, 2009 turned out better 

than expected. The third quarter saw 

the first month of positive growth in the 

US economy since the recession started in 

December 2007, and as the economy recovered, 

money again began moving. By year’s 

end, overall biotech financing was up 84% 

from the depressed figures seen in 2008. 

In 2008, as first the United States and then 

the world slid into recession, overall funding 

was at its lowest since at least 2002 (Fig. 1), 

with debt financings, private investments in 

a public entity (PIPEs), follow-on offerings 


feature 


Table 1 Casualties in 2009 

Company 

Alpha Innotech 

Altus Pharmaceuticals 

Arthrokinetics 

Autoimmune 

Avalon Pharmaceuticals 

Avigen 

Biopure Corporation 

BioXell 

CelSis 

Cellegy 

Cell Genesys 

Cobra 

Curagen 

Curalogic 

CV Therapeutics 

EPIX Pharmaceuticals 

Evolutec 

Genaera 

Genentech 

Hemacare 

Hemagen Diagnostics 

IDM Pharma 

Introgen 

Isologen 

Intercytex 

Liponex 

Medarex 

Metabasis Therapeutics 

Monogram 

Napo Pharma 

Nastech 

Neos 

Neurogen 

Northfield Laboratories 

Nucryst 

Nuvelo 

Nventa Biopharmaceuticals 

Phynova 

Replidyne 

Targanta 

ViRexx Medical 

XLT Biopharmaceuticals 

and initial public offerings (IPOs) all declining 

substantially from previous years. Only 

venture capital remained aloft, although 

venture capitalists were more inclined to put 

money into companies previously invested 

in, rather than new ventures. 

This pattern reversed last year. Debt 

financings, venture capital and money raised 

in follow-ons and IPOs all increased, almost 

achieving the level seen in 2007, before the 

markets tanked. Only one category went 

backward, PIPEs —which was to be expected, 

Reason for status change 

Acquired by Cell Biosciences 

Bankruptcy 

Delisted 

Inactive 

Acquired by Clinical Data 

Acquired by Medicinova 

Bankruptcy 

Acquired by Cosmo 

Acquired by JM Hambro 

Merged with Adamis Pharmaceuticals 

Acquired by BioSante 

Merged with Recipharm 

Acquired by CellDex 

Bankruptcy 

Acquired by Gilead 

Liquidated 

Transformed into investment company 

Dissolved 

Acquired by Roche 

Inactive 

Inactive 

Acquired by Takeda 

Bankruptcy 

Bankruptcy 

Delisted 

Merged with ImaSight 

Acquired by BNS 

Acquired by Ligand Pharmaceuticals 

Acquired by LabCorp 

Inactive 

Changed name to MDRNA 

Inactive 

Inactive 

Inactive 

Inactive 

Merged with Arca 

Inactive 

Delisted 

Merged with Cardiovascular 

Acquired by The Medicines Company 

Acquired by Paladin 

Delisted 

as once the general markets (and individual 

stock prices) improved, the need for private 

investment faded. 

The largest follow-on offering of the year 

($640 million) was conducted by Qiagen 

(Venlo, The Netherlands), a profitable provider 

of sample and assay technologies 

(Table 2). It had the best year of its existence 

in 2009, with overall revenues above $1 billion, 

and is the type of stable company that 

can easily reach into the secondary-offering 

market. The sexier story is Human Genome 

Sciences (HGS, Rockville, Maryland, USA), 

which raised about $850 million in two follow-on 

offerings. As its stock price rocketed 

after positive pivotal trial results for the lupus 

drug Benlysta (belimumab), it tapped the 

public markets in late July for more than $373 

million and again in December for about $477 

million. The company’s stock, which opened 

the year at $2.12, ended it at $30.58. 

This is a similar story to Dendreon’s 

(Seattle), which in April reported positive 

phase 3 results for its prostate cancer vaccine 

Provenge (sipuleucel-T), sending its 

stock up more than 100% on the day the 

results were announced. This set the stage for 

a $427-million public offering in May, followed 

by another in December. Provenge has 

now been approved, the company has priced 

the drug aggressively, and Dendreon’s stock, 

at the time of publication, sat just above $34; 

it began 2009 at $4.59. 

Whereas many of biotech’s established 

companies completed debt deals last year, 

returning that funding category to levels 

seen before a well-below-average 2008, it 

was hardly a year worth mentioning for IPOs 

(Table 3). Just ten occurred in 2009, none 

before August, and none could be considered 

a typical biotech IPO, either in the type of 

company or the amount of money raised. 

For instance, the JSC Human Stem Cell 

Institute (with sites in Russia, Germany and the 

Ukraine) raised a mere $4.8 million. The institute 

doesn’t look much like the usual biotech enterprise 

preparing to go public: it has a research 

laboratory and a center for storage of cellular 

materials, and it publishes the journal Cellular 

Transplantation and Tissue Engineering. 

What’s more, an IPO is no longer the cash 

windfall and viable exit for investors it once 

was. Consider D-Pharm (Rehovot, Israel), 

which raised about $7.4 million on the Tel 

Aviv Stock Exchange to fund clinical testing 

of its small-molecule stroke drug, DP-b99, a 

membrane-active derivative of the calcium 

chelator 1,2-bis-(2-aminophenoxy)ethane- 

N,N,N′,N′-tetraacetic acid (BAPTA). 

Alongside the IPO, the company also completed 

a rights offering (which gives existing 

shareholders the right to buy shares during a 

defined period, usually at a discount), raising 

NIS 57 million ($14.8 million). The existing 

investors didn’t exit—they instead had the 

choice to increase their stake. 

In truth, the average amount raised per 

IPO is hardly enough to alleviate financial 

concerns for long. In 2008, our survey 

showed IPOs raised on average $22.3 million. 

In the previous two years, it was considerably 

more, $58 million in 2007 and $41 million in 

2006. Figure 2 shows an IPO in 2009 raised, 


feature 


of finance in Europe and nearly half of all 

finance in Europe during 2009 (Table 4). 

Without this money, the amount raised in 

Europe during 2009 would have been only 

15% of the global total finance in this survey, 

rather than 26%. 

Those IPOs had a small role in the sizable 

increase in overall funding from 2008, but 

the biggest factor was headline-grabbing 

partnering deals: $36.9 billion in 2009, up 

from $20 billion the previous year. This 

heightened partnering activity was propelled 

both by pharma’s need to bolster fading pipelines 

and biotech’s need for help of any kind 

during the recession. 

But here again, that high figure is misleading, 

because a large portion of it represents 

milestone payments that may never be 

paid. The leading deal among our companies 

(Table 6) was formed between Nektar 

and AstraZeneca for two programs that use 

Nektar’s advanced polymer conjugate techon 

average, $92.8 million. On the surface 

that seems a marked increase, but further 

inspection shows that the figure is distorted 

by the unique case of Talecris Biotherapeutics 

(Research Triangle Park, NC, USA). The 

company develops nonrecombinant protein 

therapeutics from plasma and is profitable. It 

was pegged as an acquisition target by rival 

CSL (Victoria, Australia) in 2008 for $3.1 billion, 

but the US government challenged the 

purchase as anticompetitive, and the deal fell 

apart. Talecris instead conducted an IPO in 

2009 for a whopping $550 million. Toss aside 

Talecris, and the figure falls more in line with 

recent years: $42 million. Talecris is again in 

line for an acquisition, by Grifols (Barcelona, 

Spain) for $3.4 billion. 

Overall, the public markets in Europe 

remain relatively parsimonious. They provided 

only 15% of all European financing, 

whereas US public markets provided 33% 

of the total US fundraising (Table 4). The 

main shortfall, as in previous years, was 

in follow-on offerings. Where follow-on 

financings occurred in Europe, they raised 

amounts comparable to those raised by US 

firms—$112 million on average, compared 

with $107 million for US companies. But in 

2009, 48 US biotech companies got followon 

offerings away, compared with only seven 

in Europe. For European public companies, 

secondary offerings are still the exception— 

leaving them open to acquisition bids and 

investors open to disillusionment. 

Two European firms dominated debt 

financing this year (Table 5), with giant UCB 

(formed around Celltech, Brussels) taking in 

more than $2.6 billion in a series of three 

notes. Elan (Dublin) also raised $625 billion 

in a bond issue. These two massive chunks 

of debt financing distort the European 

fundraising picture, giving it an undue rosy 

glow. The $3.2 billion raised represents over 

three-quarters of the ‘Other’ categories 

Financing raised ($ billions) 

70 

60 

50 

40 

30 

20 

10 

0 

0.544 

3.883 

2.231 

4.018 

9.075 

8.933 

2003 

2.556 

3.335 

2.93 

5.318 

8.833 

10.933 

2004 

1.859 

4.838 

2.661 

5.398 

6.112 

17.268 

2005 

2.03 

5.578 

4.695 

5.682 

11.853 

19.796 

2006 

Year 

2.95 

4.377 

4.748 

6.809 

11.68 

22.365 

2007 

0.134 

1.867 

3.143 

5.177 

3.232 

20.023 

2008 

0.928 

6.041 

2.277 

5.198 

10.335 

36.923 

2009 

IPO 

Follow-on 

PIPES 

Venture capital 

Debt and other 

Partnerships 

Figure 1 Global biotech industry financing. Biotech funding was up 84% to $62 billion in 2009 from 

$33 billion in 2008. Partnership figures from Burrill & Co. are for deals involving a US company. 

BioCentury makes updates to its financing data on an ongoing basis. Sources: BCIQ: BioCentury Online 

Intelligence; Burrill & Co. 

nology platform—the program NKTR-118, 

which had completed phase 2 for opioidinduced 

constipation, and NKTR-119, an 

early-stage program intended to deliver 

products for pain without a constipation side 

effect. Nektar did receive an up-front payment 

of $125 million in the deal, but it’s the 

potential milestones that give the partnership 

its $1.5 billion high-end value. 

That was one of six deals in 2009 that had 

a potential payout of more than $1 billion, 

making the average potential of our top-ten 

group worth more than a billion dollars. 

But the average amount of funds received 

up front (including equity investments or 

money for milestones hit at the time of deal 

signing) was much lower, at about $109 million, 

meaning nearly 90% of the value in these 

deals remained unrealized at year’s end. 

When considering all partnerships 

between pharma and biotech (public and 

private), using data from Elsevier’s Strategic 

Table 2 Top ten follow-on offerings of 2009 

Company name 

Date completed 

Amount raised 

($ millions) Underwriters 

Qiagen 9/24 640.4 Deutsche Bank, Goldman Sachs, J.P. Morgan, Barclays Capital, Commerzbank, DZ Bank 

Vertex Pharmaceuticals 12/2 500.5 Goldman Sachs, Merrill Lynch, J.P. Morgan, Morgan Stanley 

Human Genome Sciences 12/2 476.8 Goldman Sachs, Citigroup, J.P. Morgan, Morgan Stanley, UBS 

Dendreon 12/10 426.9 J.P. Morgan, Deutsche Bank, Citigroup, Morgan Stanley, Lazard, Leerink 

Human Genome Sciences 7/28 373.8 Goldman Sachs, Citigroup 

Vertex Pharmaceuticals 2/18 320 Merrill Lynch, Cowen 

Cephalon 5/21 300 Deutsche Bank, J.P. Morgan, Barclays Capital Inc., Credit Suisse, Morgan Stanley 

Dendreon 5/13 229.9 Deutsche Bank 

Incyte 9/25 139.7 Goldman Sachs, Morgan Stanley, J.P. Morgan 

Seattle Genetics 8/11 135.9 J.P. Morgan, Goldman Sachs, Needham, Oppenheimer, RBC Capital Markets 

Data are matched to the definition of biotech in Box 1. Source: BCIQ: BioCentury Online Intelligence 


feature 


Table 3 Initial public offerings of 2009 

Amount raised 

Company name Location Date completed ($ millions) Underwriters 

CanBas Shizuoka, Japan 9/17 14.8 Mitsubishi UFJ Securities International plc, Mizuho, Ichiyoshi, JPMorgan, 

Mizuho Investors, Takagi 

China Nuokang 

Bio-Pharmaceutical 

Cumberland 

Pharmaceuticals 

D. Western Therapeutics 

Institute 

Transactions database, we found the average 

total amount paid up front in 2009 was about 

$58.9 million. That’s the highest average over 

the past 10 years (only 2006 came close, at 

$55.7 million), and a long way from the upfront 

money paid out in 2000, which was just 

$12.4 million. Still, it also drives home the 

reality that a deal with a potential value of 

$1 billion is just that: potential. 

2009 also provided an interesting wrinkle 

for equity investments around partnerships. 

Over the past 10 years, the average equity 

bought as part of a deal in each year was 

well below $10 million, with the exception 

of 2001, when it leaped to $32.3 million. Last 

year, it leaped again, to $20.6 million. In both 

2001 and 2009, the public markets had come 

down from peaks, and thus selling equity as 

part of partnering deals rose in favor. 

Beijing 12/9 40.7 Jefferies, Oppenheimer 

Nashville, TN, USA 8/10 85 UBS, Jefferies, Wells Fargo, Morgan Joseph and Co. 

Aichi, Japan 10/13 9.7 Nomura, Mitsubishi UFJ Securities International plc, Takagi, SBI Securities 

Co. Ltd., Tokai Tokyo, Mizuho 

D-Pharm Rahovot, Israel 8/17 7.3 Clal Finance, Rosario, Meitav 

Human Stem Cell Institute Moscow 12/10 4.8 CJSC Alor Invest 

Movetis N.V. Turnhout, Belgium 12/3 146 Credit Suisse, KBC, Piper Jaffray 

Omeros Corp. Seattle 10/7 68.2 Deutsche Bank, Wedbush, Canaccord, Needham, Chicago Investment 

Group, National Securities 

Talecris Biotherapeutics Research Triangle Park, 

NC, USA 

9/30 549.9 Morgan Stanley, Goldman Sachs, JPMorgan, Citigroup, Wells Fargo, 

Barclays Capital 

T-Ray Science Inc. Vancouver 12/9 1.4 Research Capital Corp. 


Number of IPOs 

60 

50 

40 

30 

20 

10 

0 

10 

28 

2002 

14 

39 

2003 

53 

48 

2004 

45 

41 

2005 

Year 

49 

41 

2006 

Buyouts and climbing sales 

Mergers and acquisitions fell in 2009, both 

in total number and in the values assigned to 

the companies acquired (Table 7). Leading 

our list is Roche’s buyout of Genentech, but 

that deal was actually announced in 2008. 

Although it closed in the spring of last year, 

the acquisition is old news. 

But also high on the list is the purchase of 

Medarex by Bristol-Myers Squibb (BMS, New 

York), an acquisition that gained a validation 

of sorts in 2010. The purchase gave BMS 

access to Medarex’s antibody-drug conjugate 

technology and UltiMAb human antibody 

development system, but the main draw was 

ipilimumab. BMS was already partnered with 

Medarex on ipilimumab in phase 3 for metastatic 

melanoma, in phase 2 for lung cancer 

and in phase 3 for adjuvant melanoma and 

51 

58 

2007 

6 

22 

2008 

10 

92.8 

2009 

100 

90 

80 

70 

60 

50 

40 

30 

20 

10 

0 

Average amount raised ($ millions) 

Number of IPOs 

Average amt 

raised ($M) 

Figure 2 Global biotech initial public offerings. IPOs in 2009 seemingly made a recovery in amount 

raised, if not number of offerings.But the data is skewed by one large offering. 

hormone-refractory prostate cancer, so it had 

seen the product up close. Perhaps that’s the 

reason it offered a greater than 90% premium 

to the trading price of Medarex shares; the deal 

went through at $16 apiece, or $2.4 billion. 

Ipilimumab, a monoclonal antibody 

designed to block the inhibitory signal of 

cytotoxic T lymphocyte-associated antigen-4 

(CTLA-4), had failed in a phase 3 trial 

in 2007, and there was uncertainty around 

the new pivotal program for melanoma. 

But BMS announced in June 2010 at the 

American Society of Clinical Oncology’s 

annual meeting in Chicago that ipilimumab 

met the primary endpoint of survival in 

advanced melanoma in a phase 3 doubleblind 

randomized trial, and BMS said it 

expects to submit for regulatory approval 

of ipilimumab this year. Should the drug 

win approval, the $2.4 billion price tag for 

Medarex will seem a steal. 

Also of interest last year was Gilead’s (Foster 

City, CA, USA) buyout of CV Therapeutics, 

giving a company typically known for its 

HIV franchise a presence in the cardiovascular 

space. The move brought aboard 

Ranexa (ranolazine extended-release tablets), 

approved for chronic angina, and Lexiscan 

(regadenoson) injection for use as a pharmacologic 

stress agent in radionuclide myocardial 

perfusion imaging. Gilead remains a leader in 

HIV drugs—its highest-selling product was 

Truvada at about $2.5 billion last year, and 

90% of Gilead’s product sales came from its 

antiviral franchise—but through this acquisition 

it is seeking growth in other areas. 

Big sellers like Truvada are the beacons in the 

biotech fog, promising a move into the black 

after years spent dumping money into R&D and 


feature 

Table 4 Comparison of US and EU financing in 2009 

Amount 

raised in US 

($ millions) 

Number of 

US deals 

Amount 

raised in EU 

($ millions) 

Number of 

EU deals 

UCB and Elan 

($ millions) 

EU financing 

minus UCB 

($ millions) 

EU financing 

minus UCB 

(% of US + 

EU total) 

EU as a 

percentage 

of US + EU 

total 

EU as a 

percentage 

of EU total 

US as a 

percentage 

of US total 

Venture capital 3,939 197 1,114 87 – 1,114 22% 22% 18% 22% 

IPO 703 3 158 3 – 158 18% 18% 3% 4% 

Follow-on offering 5,166 48 785 7 – 785 13% 13% 12% 29% 

Other 7,756 236 4,253 108 3,200 1,053 12% 35% 67% 44% 

Total 17,564 484 6,310 205 3,200 3,110 15% 26% 100% 100% 



the clinic. Achieving that level of revenue usually 

follows this path: drug approval, then a marketing 

push and physician acceptance, followed 

by subsequent approvals in other indications 

to further increase sales. Most of the biologics 

in our list of the top ten drugs (Table 8) went 

that route. Enbrel (etanercept), from Amgen 

(Thousand Oaks, CA, USA), exemplifies this 

tactic. Originally approved in 1998 for rheumatoid 

arthritis, Amgen has received approvals in 

four other indications (ankylosing spondylitis, 

psoriasis, psoriatic arthritis and juvenile 

rheumatoid arthritis), and its worldwide revenue 

has jumped from $2.6 billion in 2005 to 

an estimated $6.4 billion in 2009, according to 

BioMedTracker. The drug, which inhibits the 

tumor necrosis factor (TNF) pathway, is the 

top-selling biologic in the world. 

In fact, three of the top five revenue-producing 

drugs target TNF: Remicade (infliximab, 

Johnson & Johnson, New Brunswick, 

NJ, USA) and Humira (adalimumab, Abbott, 

Abbott Park, IL, USA), are the other two, 

selling $5.9 billion and $5.5 billion worldwide, 

respectively. Those numbers, like the 

revenues for all the drugs in this table, are an 

improvement over the previous year. 

Given the lack of generic competition for 

biologics, it’s almost an anomaly when a 

drug does not increase sales year on year; it 

suggests something must have gone wrong. 

That’s been the case with Amgen’s Aranesp. 

Peaking at $4.1 billion in worldwide sales in 

2006, the drug has lost ground yearly since 

then, and in 2009 declined 15% to about 

$2.7 billion, falling off our list of the top ten 

biotech drugs. Amgen attributes the decline 

to the negative impact, mostly in supportive 

cancer care, of a “product label change” 

that came in August 2008. In fact, Aranesp 

serves as an example of the downside of 

product growth: the drug was being used offlabel 

in various indications until reports of 

adverse effects caused the US Food and Drug 

Administration (FDA) to tighten its label. 

The decline of Aranesp revenue meant 

Amgen reported lower overall revenues for 

2009, although the company’s adjusted net 

income for the year was more than $5 billion, 

compared with $4.9 billion in 2008, a 

3% increase. 

Affymetrix (Santa Clara, CA, USA) also 

saw its revenue decrease in 2009, though the 

reason has more to do with accounting: the 

figures had been buoyed in 2008 by a onetime 

intellectual property payment of $90 

million. So while a comparison year-by-year 

shows the company lost 20% of revenue in 

2009, in truth the business ground along 

smoothly. It had product revenue of $279.2 

million and service revenue of $39.6 million 

last year, both up from the previous year 

(2008 product revenue was $270.4 million 

and service revenue was $32.1 million.) 

Like Amgen and Affymetrix, other established 

firms fared well. Gilead experienced the 

largest increase in revenues, posting product 

sales that increased 27% over 2008 to nearly 

$6.5 billion, driven mostly by its HIV franchise 

of Truvada (emtricitabine and tenofovir disoproxil 

fumarate) and Atripla (efavirenz 600 

mg, emtricitabine 200 mg, tenofovir disoproxil 

fumarate 300 mg). Truvada sales increased 

18% to about $2.5 billion, and Atripla brought 

in $2.4 billion, up 51% over 2008. 

HGS also reported impressive revenues 

of $275.7 million for 2009, compared with 

Table 5 Top ten debt financings of 2009 

Company name Financing type Date completed 

revenues of only $48.4 million the previous 

year. The company logged its first product 

sales—$180.2 million for delivering to the 

US Strategic National Stockpile raxibacumab 

(human monoclonal antibody drug for treatment 

of inhalation anthrax) under a government 

contract. That helped HGS earn 

a net income of $5.7 million for the year, 

compared with a net loss of $268.9 million 

in 2008. The company also reported positive 

results for Benlysta (belimumab) phase 

3 trials announced in July and November 

2009. The good news drove up HGS’s stock 

price considerably, and as we noted earlier, it 

raised public funds twice during the year. 

End of the line 

Whereas 2008 saw 34 companies depart from 

the public biotech landscape—11 because of 

delisting or bankruptcy—those numbers 

increased in 2009. The total number of 

companies departing for any reason (buyout 

or merger included) climbed to 44, and 

the number removed owing to financial difficulty 

also went up, reaching 20. But a 9.5% 

drop in the number of companies is fewer 

casualties than was feared. Of those that teetered 

but survived, some were helped partially 

by the markets opening back up in the 

spring; by the ability to conduct debt deals, 

Amount raised 

($ millions) 

Amgen Sr notes (other) 1/14 2,000 

UCB Group Bond (other) 10/27 1,128 

UCB Group Bond (other) 12/3 751.9 

UCB Group Sr convert notes (other) 9/30 730.3 

Elan Sr notes (other) 9/29 625 

Cephalon Sr subord convert notes (other) 5/22 500 

Gilead Sciences Debt (other) 4/20 400 

Incyte Convert notes (other) 9/25 400 

Bio-Rad Laboratories Sr notes (other) 5/19 300 

PDL BioPharma Sr notes (other) 10/28 300 



feature 


Table 6 Top ten research partnership and licensing deals of 2009 

Researcher 

Investor 

which returned to a more normal level after 

suffering through the battered credit markets 

in 2008; and by partners supplying up-front 

money and other funding. 

Also, considering that Genentech (and 

its $3.4 billion of net income in 2008) is no 

longer in our survey (now part of Roche), it 

seemed unlikely the sector would be able to 

repeat its performance from 2008, when it 

posted a net profit of $3.8 billion. But it did, 

drawing a collective net income in 2009 of 

$8 billion—with the heavy lifting, unsurprisingly, 

done by the large-cap firms (Fig. 3). 

Three main drivers contributed to the 

unexpected profitability in 2009. The first 

is an accounting change by the US Federal 

Accounting Standards Board, issued in late 

2007 but applicable in fiscal year 2009. Called 

SFAS 141R, the new guidance allows the costs 

associated with mergers and acquisitions to 

be expensed over time, rather than all at once 

Date 

announced 

Deal value 

($ millions) Details 

Nektar AstraZeneca 9/21 1,505 Worldwide rights to NKTR-118 for opioid-induced constipation and NKTR-119 

for pain 

Incyte Novartis 11/25 1,310 Ex-US rights to oral INCB18424, which is in phase 3 for myelofibrosis, and worldwide 

rights to preclinical cancer compound INCB28060 

Targacept AstraZeneca 12/3 1,240 Worldwide rights to develop and commercialize major depressive disorder compound 

TC-5214 

Exelixis Sanofi-aventis 5/29 >1,161 Exclusive, worldwide rights to XL147 and XL765, oral phosphoinositide 3-kinase 

inhibitors in phase 1b/2 and phase 2 to treat cancer 

ZymoGenetics Bristol-Myers Squibb 1/12 1,105 Codevelop and commercialize phase 1 HCV compound PEG-Interferon 

lambda (IL-29) 

Amylin Takeda 11/1 1,075 Codevelop and commercialize therapeutics for obesity and related indications 

Santaris Pharma Wyeth 1/12 847 Worldwide rights to ALD518 for all indications except cancer 

Algeta Bayer 9/3 800 Codevelop Alpharadin for bone metastases 

Medivation Astellas Pharma 10/27 765 Codevelop MDV3100 for the treatment of prostate cancer 

Cytokinetics Amgen 5/26 650 Exclusive world-wide (except Japan) license for cardiac contractility program 

Acorda Bayer 7/1 510 Exclusive collaboration and license agreement to develop Fampridine-SR for 

multiple sclerosis 


as part of the purchase price. It’s a small factor, 

and biotech-biotech mergers are less common 

and of lesser value than those between 

biotech and pharma, but still noteworthy. 

The second is that some companies simply 

had good years, and their revenue growth 

helped make up for the loss of Genentech. 

We’ve seen this with companies such as 

Gilead, which pushed its revenue up 31% and 

net income up 33% from 2008, and Biogen 

Idec (Weston, MA, USA), which posted a 

net income of $970 million, up 24% over the 

previous year. 

But the major reason for the collective 

profit is the same one that kept the number 

of bankruptcies lower than feared: a cutback 

on expenses. When the money isn’t 

there, spending has to decrease, and biotech 

tightened its belt in 2009. Companies spent 

less in two notable ways. First, they carried 

smaller payrolls than previously. In 2008, the 

companies surveyed had an average of 489 

employees per company. In 2009, although 

our pool of biotech firms surveyed grew to 

461 and with it the total number of employees 

increased, the average number of employees 

per company actually dropped to 442. 

Second, the biotech sector collectively 

reduced its R&D spending. In 2008, even as 

it faced financial turmoil, biotech increased 

its spending on R&D, as it had for years, from 

$22.8 billion in 2007 to $25.5 billion. This 

pattern came to a halt last year, when the 

sector’s overall R&D spending fell to $22.3 

billion, with the greatest decrease seen in the 

microcaps, which went from $5.4 billion in 

2008 to $4.0 billion (a fall of nearly 30%) in 

2009. (Large caps reduced their R&D spending 

by just under 10%.) This considerable 

drop helped keep biotech profitable, but it is 

likely that it penalized the sector’s ability to 

carry out innovative science. 

Table 7 Top ten announced mergers and acquisitions of 2009 

Target 

Acquirer 

Month 

completed 

Deal value 

($ millions) 

Genentech Roche March 46,800 

Medarex Bristol-Myers Squibb September 2,400 

CV Therapeutics Gilead Sciences April 1,400 

ESBATech Alcon September 589 

BiPar Sciences Sanofi-aventis April 500 

Noven Hisamitsu Pharmaceuticals August 428 

ViroChem Vertex March 413 

Peplin Leo Pharma November 288 

Dow Pharmaceutical Sciences Valeant Pharmaceuticals January 285 

Arana Therapeutics Cephalon August 276 


The horizon 

Compared with other business sectors, biotech 

will continue to face the challenges of 

long timelines for product development. 

The heavy costs of R&D have shaped this 

industry since its inception, and that’s not 

about to change. But precisely because biotech 

remains centered on the provision of 

medical products, it has had the advantage of 

being considered ‘recession proof ’—people 

need drugs no matter how the economy is 

performing. The bottom lines of biotech’s 

big producers—Amgen, Gilead, Biogen—in 

2008 and 2009 reflect this. 

Yet the sector’s ability to fund itself ebbs 

and flows with the global economy, and this 


feature 

Table 8 Top ten biologic drugs in terms of sales in 2009 

Name Lead company Approved indication(s) 

Enbrel Amgen Rheumatoid arthritis (RA), ankylosing spondylitis, psoriasis, psoriatic arthritis (PA), juvenile 

rheumatoid arthritis 

2009 revenue 

($ million) 

~6,400 

Remicade Johnson & Johnson Psoriasis, ulcerative colitis (UC), ankylosing spondylitis, Crohn’s disease, PA, RA 5,892 

Avastin Roche Colorectal cancer, breast cancer, brain cancer, renal cell cancer, non–small cell lung cancer 5,747 

Rituxan Biogen IDEC Non-Hodgkin’s lymphoma, RA, chronic lymphocytic leukemia 5,617 

Humira Abbott Laboratories RA, ankylosing spondylitis, juvenile rheumatoid arthritis, Crohn’s disease, PA, psoriasis 5,488 

Herceptin Roche Breast cancer 4,833 

Lantus Sanofi-aventis Diabetes mellitus type II, diabetes mellitus type I 4,295 

Gleevec Novartis Chronic myelogenous leukemia, hypereosinophilic syndrome, dermatofibrosarcoma protuberans, 

3,944 

myeloproliferative disorders, gastrointestinal stromal tumor, acute lymphocytic leukemia, 

myelodysplastic syndrome, mastocytosis 

Neulasta Amgen Neutropenia, leucopenia 3,355 

Prevnar Pfizer Prevention of otitis media, Streptococcus pneumoniae pneumonia ~3,100 

Source: BioMedTracker 


is especially true for the smaller-cap firms. 

These companies require investors, they 

require the support of the public markets, 

and they require lending, and when the 

world’s money locks up the way it did over 

2008 and the beginning of 2009, they suffer. 

At times like these, some will break, and R&D 

expertise and know-how will be dispersed— 

or worse, will be gone for good. 

But what biotech showed us in 2008 and 

2009 is its ability to hibernate until money 

flows again. The industry has long had to 

make do with less—a valuable trait when the 

tap runs dry. It forces the sector’s executives to 

look constantly for new ways to trim expenses 

and to partner. This can be seen through collaborations 

by Symphony Capital (New York), 

which invests in clinical programs rather than 

a company itself, or the low-infrastructure 

model espoused by groups such as Talaris 

Advisors (Hopkinton, MA, USA), or the use of 

contract research organizations to outsource 

portions of drug development. 

The economic upswing seen in the second 

half of 2009 has continued. Overall funding 

in the first six months of 2010 is on pace 

to easily surpass 2009 for both private and 

public biotechs. The FDA approved 16 biologics 

last year, an increase over both 2008 

(11 biologic approvals) and 2007 (9 biologic 

approvals). The Nasdaq biotech index has 

held ground for the first six months of 2010. 

J. Craig Venter and colleagues caught the 

world’s attention by creating a bacterium with 

an artificial genome. Biotech made its way 

to the Supreme Court, winning a decision 

favorable to Monsanto (St. Louis, MO, USA) 

and others developing genetically modified 

seeds. And so far this year, there have been 

approvals of Amgen’s Prolia (denosumab) for 

post-menopausal osteoporosis and Provenge 

(sipuleucel-T) for prostate cancer, both of 

which are expected to be huge sellers. 

Biotech, with its small firms and entrepreneurial 

spirit, has long thought of itself as the 

underdog, made up of fast, nimble companies 

built to innovate, overachieve, withstand hardship 

and adapt. This attitude has always been 

part of the industry’s culture, and these days 

it’s also a carefully cultivated personality used 

to distance biotech from the more troubled 

a 

b 

Number of companies Amount ($ billions) 

60 

50 

40 

30 

20 

10 

0 

–10 

350 

300 

250 

200 

150 

100 

50 

0 

58.7 

21.0 

Revenue 

97,207 

13 

Large cap 

10.3 

7.1 4.8 4.3 3.7 4.0 

63,876 

32 

Mid-cap 

R and D 

82 

Small cap 

pharmaceutical industry. In short, it has often 

seemed like biotech was built to deal with 

adversity. After surviving the past two years, 

it now knows it can. 


The authors would like to acknowledge the insight of G. 

Giovannetti and G. Jaggi in crafting this article. 

Note: Supplementary information is available on the 

Nature Biotechnology website. 

12.9 

1.6 –2.4 –4.0 

Net profit/loss 

334 

24,394 22,954 

Micro cap 

100,000 

80,000 

60,000 

40,000 

20,000 

Number of employees 

Micro cap 

Small cap 

Mid-cap 

Large cap 

Number of companies 

Number of employees 

Figure 3 Public biotech company revenue, R&D spending, profits and number of employees by 

market cap. Large cap, ≥$5 billion; mid-cap, $1 billion to

patents 

Bilski v. Kappos: the US Supreme Court broadens 

patent subject-matter eligibility 

William J Simmons 

The court narrowly ruled that business methods may be patent eligible, while striking down the primacy of its main test. 


With over 60 biopharmaceutical products 

applied for or expected to be filed at the 

US Food and Drug Administration this year, 

joining over 335 currently approved biopharmaceuticals, 

determining what can or cannot 

be patented is a threshold question protecting 

inventions in biotech and pharmaceutical 

industry 1 . Up until 2008, the answer to 

this important question was relatively clear. 

However, a 2008 decision that set a new single 

standard for patent eligibility made addressing 

this inquiry fundamentally uncertain. 

In a landmark decision issued 28 June, 

the US Supreme Court issued its holding 

regarding patent-eligible subject matter in 

Bilski v. Kappos. The court unanimously 

agreed that Bilski’s claims recited no more 

than “abstract ideas” and were therefore 

not patentable under US law. Importantly, a 

majority of the court held that the language 

of the relevant law (35 USC §§100–101) 

broadly encompassed vast forms of subject 

matter as patent eligible. The court unanimously 

struck down the ‘machine-or-transformation’ 

test 1 , a test implemented by the US 

Court of Appeals for the Federal Circuit in 

2008 that was criticized as “unnecessary,” as 

the sole test for determining whether a process 

is directed to patentable subject matter 

and held that the machine-or-transformation 

test is one test among many that can be used 

to determine patent eligibility. Justice Kennedy 

delivered the court’s opinion, with Justices 

Roberts, Thomas and Alito joining in full and 

Justice Scalia joining in part. Justice Stevens 

filed a concurring opinion in which Justices 

Ginsburg, Breyer and Sotomayor joined. 

Justice Breyer filed a concurring opinion in 

which Justice Scalia joined in part. 

William J. Simmons is at Sughrue Mion, PLLC, 

Washington, DC, USA. 

e-mail: wsimmons@sughrue.com 

The facts of Bilski v. Kappos did not involve 

biotech or pharmaceutical subject matter but 

rather a process for hedging risk in commodity 

markets (that is, an invention regarding 

instructing buyers and sellers of commodities 

in the energy market to protect against the risk 

of price fluctuations) 2 . For example, the application 

recited a series of steps instructing how to 

hedge risk, and in another instance the application 

of risk hedging was described in the form 

of a mathematical formula. The US Patent and 

Trademark Office (USPTO) denied Bilski a 

patent because, according to the USPTO, the 

patent application was directed to business 

methods that were patent-ineligible subject 

matter. The USPTO reasoned that the invention 

was too abstract, that it merely manipulated an 

idea and that it failed to practically apply concepts 

enough to render them patentable. The 

administrative appeal board affirmed, concluding 

that the application involved only mental 

steps and did not result in the transformation 

of physical matter. 

On appeal, the Federal Circuit, sitting 

en banc, did not rely on any of the several tests 

used by prior courts, including the Supreme 

Court, but instead created and applied a new 

legal standard for patentability: processes are 

patentable only if they are tied to a particular 

machine or apparatus, or transform a particular 

article into a different state or thing— 

namely, the machine-or-transformation test 3 . 

The Federal Circuit reasoned that because 

Bilski’s claims did not satisfy the new governing 

test, which the court made clear should be 

grossly applied to all areas of technology, the 

USPTO’s decision was correct and Bilski was 

not entitled to a patent. 

Judge Rader, now the Chief Judge of the 

Federal Circuit, in dissent, indicated that the 

language of 35 USC §101 “contains no hint 

of an exclusion for certain types of methods” 

and stated that “ironically the Patent Act itself 

specifically defines ‘process’ without any of 

these judicial innovations.” Rader argued that 

the only limits on eligibility are inventions 

that embrace natural laws, natural phenomena 

and abstract ideas. He wrote, “this court today 

invents several circuitous and unnecessary 

tests.” Even so, Rader suggested that the hedging 

claim on appeal was abstract, and he stated, 

“Bilski’s method for hedging risk in commodities 

trading is either a vague economic concept 

or obvious on its face.” Rader pointed out that 

US patent law was designed to encourage ingenuity 

and that the law is focused not on particular 

subject categories but on the patentability of 

the specific claimed invention. He maintained 

that the law distinguishes eligibility from conditions 

of patentability and generously provides 

for patent eligibility. His dissent was clear: the 

court should not create any categorical exclusion. 

Rader also pointed out that in Diehr 4 , the 

Supreme Court indicated that only natural laws, 

natural phenomena and abstract ideas are patent 

ineligible. He clarified, however, that if an 

abstract idea is applied to a practical use, it may 

be patent eligible. Notably, Rader commented 

that the earlier Supreme Court opinion of 

three dissenting justices in Lab. Corp. 5 misapprehended 

the distinction between a natural 

phenomenon and a patentable process, and in 

so doing, this opinion did not ask the fundamental 

question of whether the subject matter 

at issue is deserving of patent protection. Rader 

was clear that courts should not avoid this fundamental 

inquiry nor categorically preclude any 

form of invention. 

In response to the Federal Circuit’s decision, 

Bilski petitioned for and obtained Supreme 

Court review. The Bilski decision garnered the 

attention of many, prompting an unprecedented 

number of submissions of unsolicited briefs 

expressing the views of nonparties. Among the 

66 briefs, 13 were submitted by or on behalf of 

life science organizations, including biotech and 


patents 


5 5 

pharmaceutical interests (Fig. 1 and Table 1). 

Interestingly, there are differing opinions on the 

desired outcome of the case within the industry, 

including support for affirmance of the decision. 

However, among the 13 briefs submitted, 

only one brief appeared to support the machineor-transformation 

test (with the caveat that the 

test be applied correctly; Figs. 1 and 2, and 

Table 1). 

During the oral arguments heard at the 

Supreme Court in November 2009, several 

justices expressed their concerns that in the 

absence of unambiguous limitations regarding 

patent eligibility, the public could be harmed 

by the grant of patents to inventions directed 

to unworthy subject matter or commercially 

useful subject matter that might stifle business 

or innovation if granted a monopoly. The chief 

justice and several other justices appeared dissatisfied 

with the Federal Circuit’s machine-ortransformation 

test as the sole test for patent 

eligibility but seemed to be concerned to avoid 

expanding the scope of patent-eligible subject 

matter beyond that limited by the court’s precedent 

5 . In defense of its decision, the USPTO 

argued that the Bilski process did not comply 

with the machine-or-transformation test, that 

the claimed process was a method of conducting 

business that was per se unpatentable and 

that the claimed process was no more than 

an abstract idea and therefore unworthy of a 

patent. The USPTO was clear about the devastating 

effects of banning entire categories of 

inventions from patenting and further asserted, 

“to say that business methods are categorically 

ineligible for patent protection would eliminate 

new machines, including programmed computers, 

that are useful because of their contributions 

to the operation of business.” 

3 

Bilski 

Affirmance 

Neither party 

Figure 1 Number of amicus briefs from biotech and pharma sector vis-à-vis Bilski v. Kappos. Chart 

compares numbers of briefs arguing for a decision in favor of Bilski, for affirmance of the court’s 

decision against Bilski or for neither party. 

The Supreme Court’s decision was supported 

by all justices but the Court divided 5–4 in holding 

that under some undefined circumstances, 

at least some business methods may be patented. 

The Court did not clarify under which 

circumstance one could distinguish a patenteligible 

business method from an unpatentable 

“abstract idea,” leaving this issue for the Federal 

Circuit to decide. 

In reaching its decision, the court looked to 

the language of the law that describes four categories 

of patentable subject matter: processes, 

manufactures, machines and compositions 

of matter. A problem, however, arises in that 

the law sets forth a circular definition of ‘process’, 

making it difficult, at times, to determine 

whether a process meets the requirements of the 

statute. According to the court, the machine-ortransformation 

test, when applied as the sole 

test of determining a statutory process, violates 

proper statutory interpretation because “[t]he 

term ‘process’ means process, art or method, 

machine, manufacture, composition of matter 

or material” and the ordinary definition of process 

does not require that it be tied to a machine 

or transform an article.” 5 Joined by three 

other justices, Justice Kennedy explained that 

“[s]ection 101 is a dynamic provision designed 

to encompass new and unforeseen inventions” 

and that as new technologies evolve, the statute 

allows for the development and application of 

additional tests to assist in determining which 

processes are patent eligible. 

Regarding the contention that business methods 

are per se unpatentable, the court rejected 

this argument. However, the court reasoned 

that, in view of specific on-point legislation— 

namely, 35 USC §273—which creates a defense 

to alleged infringement of a business method 

claim, the legislature intended that claims 

directed to business methods can be patentable 

subject matter. Justice Kennedy reiterated that 

abstract ideas (which he did not define) are not 

patentable and that the court’s decisions regarding 

the unpatentability of abstract ideas were 

useful in determining which business methods 

may be protected under the patent law. The 

court held that Bilski’s claims were unpatentable 

because they were directed to “abstract ideas.” 

According to the court, Bilski sought a patent 

on “the use of the abstract idea of hedging risk 

in the energy market,” which was too abstract 

to be patent eligible. Even though the court 

rejected application of an exclusive machineor-transformation 

test, the court was careful to 

point out that inventions should be considered 

as a whole, not analyzed by dissecting the claims 

into old and new elements. Although it rejected 

the machine-or-transformation sole standard, 

the court provided the Federal Circuit with 

great flexibility in developing and applying 

“other limiting criteria” useful for determining 

patent eligibility. This guidance by the court is 

important and when properly implemented 

will fundamentally impact method patenting 

in every act. 

The court’s reasoning was grounded on precedent, 

such as that articulated in Benson 7 , Flook 8 

and Diehr 5 . The court held that the claims at 

issue were unpatentable because allowing Bilski 

“to patent risk hedging would pre-empt use of 

this approach in all fields, and would effectively 

grant a monopoly over an abstract idea.” The 

court did not formulate a new test but instead 

held “precedents establish that the machine-ortransformation 

test is a useful and important 

clue, an investigative tool” and nothing more. It 

is therefore clear that the machine-or-transformation 

test is a nonexclusive option for lower 

courts, in addition to the tests set forth in the 

court’s earlier decisions. 

The guidance set forth in Benson, Flook and 

Diehr should therefore be carefully considered 

and revisited. Briefly, in Benson, the patent 

sought related to an algorithm that converts 

numbers from binary-coded decimal form 

into pure binary form, which arguably could 

be applied to specific computer applications. 

The Supreme Court held that the recited 

algorithms were not patentable because they 

were drawn to abstract ideas, were not tied 

to a particular machine or apparatus and did 

not change articles or materials to a “different 

state or thing.” The court found it important to 

determine whether, assuming the algorithm to 

be patentable, patenting of the invention would 

pre-empt use of the mathematical formula. In 

Flook, the Supreme Court held that a method 

for updating alarm limits in catalytic conversion 

processes, which recited a mathematical 


patents 


Table 1 Amicus summary (selected) in Bilski v. Kappos 

Amicus Industry or group represented Summary 

Novartis Corp. 15 

Caris Diagnostics, Inc. 16 

algorithm for computing an updated alarm 

limit from measured present values of variables, 

was not patent eligible. The court held that the 

identification of a limited category of useful 

post-solution applications of a formula does 

not make an otherwise unpatentable formula 

patentable because a process itself, not merely 

the mathematical algorithm, must be new and 

useful in order to meet the requirements for 

patentability. In Diehr, the court addressed a 

process for molding uncured synthetic rubber 

into a cured product. The claims were directed 

to a method that constantly measured the actual 

temperature inside a mold. The court held that 

the process constituted patentable subject matter 

under 35 USC §101 because the transformation 

and reduction of an article “to a different 

state or thing” is one clue to the patentability of 

a process claim that does not include a specific 

machine. In this instance, the court determined 

that the invention manifested the transformation 

of an article, uncured synthetic rubber, into 

a different state or thing. Although the invention 

used a well-known mathematical equation, 

the court remarked that the applicants did not 

seek to pre-empt the use of the equation. 

Health care solutions; 

pharmaceutical 

Personalized medicine; tailoring 

therapeutics for individual patients 

using biomarkers 

Machine-or-transformation test unduly narrows the scope of diagnostic process 

claims. If upheld, the court should clarify that the test is not the dispositive standard. 

Machine-or-transformation test is not the exclusive test for patent eligibility of 

processes. Many diagnostic tests do not involve a machine or transformation. 

Georgia Biomedical Partnership, Inc. 17 Life sciences Machine-or-transformation test is too rigid. Precedent is flexible and permissive. 

University of South Florida 18 University; research facility Only presents arguments for the first question presented. Machine-ortransformation 

test excludes from patent eligibility certain processes that 

Congress intended to be patent eligible. 

Ananda Chakrabarty 19 University medical research Machine-or-transformation test finds no support in the statute and is bad policy. 

Prometheus Laboratories 20 

Manufacturer of pharmaceutical, 

medical treatment and diagnostic 

processes 

Court’s interpretation of section 101 may have significant ramifications beyond 

business methods and may adversely affect the field of medical diagnostic and 

treatment processes. 

Monogram Biosciences, Inc. et al. 21 

Emerging field of personalized 

medicine, using molecular diagnostic 

tests to correlate genetic and 

molecular biomarkers with clinically 

useful disease characteristics 

Federal Circuit erred in holding that a process must be tied to a particular 

machine or transformation. This should not be the sole test. Nonphysical 

processes should not be excluded. 

Medtronic, Inc. 22 R&D of medical technology Machine-or-transformation test would adversely affect medical technology innovation. 

Such a test would render significant medical advances patent ineligible. 

Pharmaceutical Research and 

Manufacturers of America 23 

Biotechnology Industry Organization 

et al. 11 

Knowledge Ecology International 24 

Pharmaceutical and biotechnology 

industry 

Biotech and medical technology 

industries 

Advocate of new incentive and 

financing models for biomedical 

information 

Court should not adopt a new test for the boundaries of section 101. Medical 

processes have long been protected. 

Bilski test is not appropriate for determining patent eligibility of biotechnology 

and medical technology under section 101. 

It is not necessary to fashion an overly broad definition of patentable subject 

matter merely to save medical innovations from an imagined and speculative 

danger. 

Adamas Pharmaceuticals et al. 25 Biomarkers and pharmaceuticals Problematic business method patents should be eliminated. Machine-ortransformation 

test violates NAFTA and the 1994 TRIPS Agreement. This test 

directly over-rules Congress’s choice (35 USC section 287(c)) to maintain broad 

subject-matter coverage for health care–related technology. 

American Medical Association et al. 26 

Medical profession; physicians and 

geneticists 

Bilski’s claims are not directed to technology. Machine-or-transformation test 

must remain secondary and cannot supplant this court’s requirement that 

claims address a technology or the court’s pre-emption standard. Machine-ortransformation 

test must be allowed to vary with each particular case. 

Impact on life science technologies 

In Bilski, four Supreme Court justices unequivocally 

indicated that nascent technologies, such as 

biotech and pharmaceutical processes, are patent 

eligible. This plurality expressed appreciation for 

technological progress and acknowledged that 

“unforeseen innovations such as computer programs” 

are patent eligible 9 . Justice Kennedy reasoned 

that the machine-or-transformation test 

may be an appropriate test for evaluating the patent 

eligibility of processes of the Industrial Age 

but should not be the sole test for newer types 

of inventions, such as medical diagnostic techniques. 

Interestingly, Justice Kennedy was careful 

to point out that he was “not commenting 

on the patentability of any particular invention, 

let alone holding that any of the above-mentioned 

technologies from the Information Age 

should or should not receive patent protection.” 

Regarding limiting interference with the development 

of nascent technologies, such as biotechnology 

and biopharmaceuticals, the court 

indicated that some types of inventions “raise 

special problems in terms of vagueness and suspect 

validity” and could “put a chill on creative 

endeavor and dynamic change.” 

In dramatic contrast, however, Justice 

Stevens’ concurrence (joined by Justices 

Breyer, Ginsburg and Sotomayor), in a separate 

47-page opinion, “strongly disagree[d] 

with the court’s disposition of this case.” 

Justice Stevens expressed great concern that 

the court “never provides a satisfying account 

of what constitutes an unpatentable abstract 

idea” and indicated that business method 

patents are per se unpatentable even though 

Bilski’s claims and application materials presented 

concrete parameters that may have 

amounted to more than an abstract idea or 

generalized concept. Justice Stevens cited 

English and early American patent jurisprudence 

and legislation as supportive of the 

opinion, concluding that the scope of patenteligible 

subject matter is “broad” but not limitless 

because, according to history, neither the 

patent statute nor patent law was intended to 

include business methods. Interestingly, biotech 

or pharmaceutical processes were not 

differentiated from business methods in the 

opinion, and it remains unclear to what extent 

such inventions could be distinguished, sufficient 

to survive Stevens’ per se ban. 


patents 


The dissenting justices agreed with the majority 

that the machine-or-transformation test was 

not the exclusive test for method claim patentability, 

but they went further, indicating that 

business methods are categorically excluded 

from patentable subject matter. Justice Stevens 

indicated that the court should have held “that 

Petitioners’ claim is not a ‘process’ within the 

meaning of Section 101 because methods of 

doing business are not, in themselves, covered 

by the statute.” Regarding the majority opinion’s 

holding that the patentability of business 

methods was clear from a reading of the satute, 

Stevens asserted that Congress did not explicitly 

state that it was amending and expanding 

the patent statute to include business methods; 

thus, he wrote, it was improper for the court to 

make such a presumption. Justice Stevens did 

not indicate how business method patents are 

categorically distinct from other forms of patent 

protection (for example, life science processes 

or therapeutic processes) but rather expressed 

“serious doubts” about whether business 

method patents are needed to encourage business 

innovation. It is unclear to what extent a 

safe harbor defense to those alleged of infringement 

of a business method claim applies to biotech 

or pharmaceutical businesses. The dissent 

therefore encompasses life science methods and 

Stevens’ logic applies equally well to biotech and 

pharmaceutical method patents vis-à-vis therapeutic 

innovations, making it critical for the 

industry to consider how each of their process 

inventions encourage medical innovations. 

Justice Breyer filed a separate concurring 

opinion, joined by Justice Scalia, indicating 

that agreement was reached by all of the 

12 

1 

Against MoT test 

Support MoT test 

Figure 2 Amicus briefs from biotech and pharma sector supporting or against Bilski machine-ortransformation 

test. MoT, machine-or-transformation. 

justices on at least four points: (i) the statute 

is broad but has some narrow limits; (ii) the 

machine-or-transformation is a useful test; 

(iii) the machine-or-transformation test is not 

to be misunderstood as the governing test; and 

(iv) by no means is everything that produces a 

“useful, concrete and tangible result” a patenteligible 

process. 

Regarding the breadth of patent-eligible subject 

matter, Justice Breyer considered the issue 

at oral argument wherein he indicated, “…every 

successful businessman typically has something. 

His firm wouldn’t be successful if he didn’t have 

anything others didn’t have…—and it’s new, too, 

and it’s useful, made him a fortune—anything 

that helps any businessman succeed is patentable 

because we reduce it to a number of steps, 

explain it in general terms, file our application, 

granted…” to which the attorney answered yes, 

what was described by Justice Bryer is potentially 

patentable. The Justice was also concerned that 

by simply assigning a set of instructions to a 

computer, and including the computer in the 

patent, an otherwise unpatentable process 

would be rendered patentable, asking, “how 

you are going to later, down the road, deal with 

the situation of all you do is get somebody who 

knows computers, and you turn every business 

patent into a setting of switches on the machine 

because there are no businesses that don’t use 

those machines.” This concern was directly 

addressed by Judge Rader, in the Bilski dissent 

at the Federal Circuit, wherein he focused the 

court not on patent ineligibility but rather on 

the fundamental inquiry of determining if an 

invention was worthy of patent protection (e.g., 

if an invention is novel and not obvious). 

Important pending life sciences cases 

Following the Federal Circuit’s decision in 

Bilski, several cases were decided based solely 

on the machine or transformation test. Parties 

whose patent claims were held to be invalid 

under Bilski will take advantage of the change 

in law and seek reversal of these decisions. 

One such case is Association for Molecular 

Pathology v. USPTO (hereinafter, AMP), 

wherein the patent claims at issue are related to 

isolated DNA containing all or portions of the 

BRCA1 and BRCA2 gene sequence and methods 

for comparing or analyzing BRCA1 and 

BRCA2 gene sequences to identify the presence 

of mutations correlating with a predisposition 

to breast or ovarian cancer 10 . In a decision that 

radically changed the law, the court held that 

the step of isolating or purifying DNA does not 

sufficiently change the genetic sequence found 

in nature to make a claim to the gene per se 

patent eligible and that comparisons of DNA 

sequences are abstract mental processes, and 

thus not patent eligible. The court discussed 

abstract ideas, referring to the Federal Circuit’s 

opinion in Bilski, and applied the machine-ortransformation 

test to invalidate the process 

claims. In deciding AMP, the court discussed 

and distinguished another critical case, 

Prometheus Laboratories v. Mayo 11 . 

Prometheus Laboratories owns patents 

covering a method to optimize dosage of two 

drugs useful for autoimmune diseases, which 

involves administering a drug at certain dosage, 

detecting the concentration of certain metabolites 

and then comparing the value to a preset 

threshold value and subsequently increasing 

or decreasing the drug dosage accordingly. 

The Federal Circuit considered that this diagnostic 

process based on a correlation between 

drug metabolites level and drug efficacy and 

toxicity was patent eligible because, consistent 

with In re Bilski, a claimed process is patenteligible 

if the claimed process is transformative 

(e.g., citing the administering step and various 

chemical and physical changes of the drug’s 

metabolites that enable their concentrations 

to be determined). The court reasoned that 

determining the levels of drug metabolites was 

per se transformative because drug metabolite 

levels cannot be determined by mere inspection. 

And because these transformations were 

central to the invention, according to the court, 

the process was found to be patent eligible and 

the patent was held valid. The court provided 

no guidance as to when the interaction of a 

drug metabolite with the human body is a 

natural phenomenon. 

In Bilski, Justice Kennedy discusses the technological 

aspects of the Industrial Age and the 

Information Age, suggesting that the differences 

between the two periods provides insight 


patents 


into how inventions are reduced to “physical 

or tangible form 12 .” Justice Kennedy seemed 

to be concerned that adoption of a single test 

—the machine-or-transformation test—could 

retard innovation by “creating uncertainty as 

to the patentability of … advanced diagnostic 

medicine techniques…”. This issue is addressed 

again later, where Justice Kennedy refers to “the 

tension, ever present in patent law, between 

stimulating innovation by protecting investors 

and impeding progress by granting patents 

when not justified by the statutory design 13 .” 

This tension is most evident in the field of biotechnology 

and biopharmaceuticals. 

In deciding AMP, the district court looked to 

Prometheus Labs 10 in determining what constitutes 

a ‘transformation’ in the biotechnological 

arts; for example, if an alleged transformation 

is mere preparatory “data gathering,” it falls outside 

the “central” focus of the recited method. 

Myriad’s patents were directed to methods of 

“analyzing” or “comparing” isolated or purified 

DNA, not host DNA. Although the district 

court recognized the great difficulty in isolating 

the subject DNAs, the court characterized 

this technical accomplishment as a mere “datagathering 

step,” thus invalidating the claimed 

methods as being directed to patent-ineligible 

subject matter. The district court’s new patent 

eligibility test is that to be patent eligible, 

isolated material must be “markedly different” 

from its naturally occurring counterpart. The 

court referred to the Supreme Court landmark 

decision in Diamond v. Chakrabarty 14 as precedent 

but did not define a “markedly different” 

invention. However, the district court went further 

and applied a “fundamental qualities” test 

to invalidate Myriad’s isolated DNA composition 

claims, indicating that a naturally occurring 

DNA’s “fundamental quality” is to contain “the 

physical embodiment of biological information,” 

which is the same “fundamental quality” 

as isolated DNA. The court appeared to reason 

that because both forms of DNA shared this 

quality, the isolated DNA was not sufficiently 

different from the naturally occurring DNA to 

render it patent eligible—a sweeping conclusion 

that draws into question the validity of thousands 

of patents susceptible to the application 

of similar logic. 

It is also important to remember the questions 

raised by the court in Lab Corp. v. Metabolite 5 

in attempting to differentiate patent eligible 

subject matter from ineligible biotech inventions. 

In this case, the Supreme Court declined 

to explicitly consider the issue of the patent eli- 

gibility of claims to a method for detecting the 

deficiency of cobalamin or folate by measuring 

the level of homocysteine in body fluids. The 

Federal Circuit held that the claims were valid 

but did not address the issue of patent eligibility 

under 35 USC §101. The Supreme Court then 

declined to review the decision, with Justices 

Stevens, Breyer and Souter dissenting. The 

dissenting opinion maintained that the claims 

were invalid because they recited only natural 

phenomena, which are not patent eligible. The 

dissent was compelled by public policy considerations 

and indicated that if the correlations 

between metabolite levels and disease were 

patent eligible, physicians may not be able to 

exercise their best judgment or might waste 

time, and the cost of healthcare would increase 

a result that would outweigh the value of protecting 

the invention at issue. 

Conclusion 

In Bilski, the Supreme Court expanded the 

forms of biotech and pharmaceutical inventions 

that are patent eligible in the US, holding 

that the machine-or-transformation test is 

not the sole test for patent eligibility in the US 

and the types of patent-eligibile subject matter 

are vast. But the Court narrowly avoided a 

catastrophe for the biotech and pharmaceutical 

industry. A majority of the court declined 

to adopt the view that “new technologies may 

call for new inquiries” directed to patent eligibility, 

which would adapt patent law to inventions 

of the Information Age. While the court 

unanimously held that Bilski’s process claims 

were not patent eligible, it indicated that the 

machine-or-transformation test may be useful 

for determining whether a method claim 

meets the threshold requirements of eligibility. 

Thus, universities and companies should consider 

providing sufficient evidence to satisfy the 

machine-or-transformation test when seeking 

to obtain patents. 

Although a 5–4 majority held that business 

methods are not categorically unpatentable, the 

court was a single vote away from denying business 

methods patent protection. This is chilling 

in view of the implications of such a ruling for 

other areas of technology, such as biotech and 

pharmaceutical method patenting. The court 

refrained from articulating a generic test that 

would distinguish a patentable method from 

an abstract idea. It remains to be seen how the 

USPTO, district courts and the Federal Circuit 

will proceed to define a new standard of patent 

eligibility designed to accommodate future 

innovations such as those emerging in the life 

sciences. The courts must provide guidance 

to the biotech industry as to what is patentable. 

What is clear, however, is that based on 

the court’s determination that Bilski’s claims 

were unpatentable because they were directed 

to abstract ideas, it is essential for the pharmaceutical 

and biotech industry to pursue and 

obtain method claims of varying scope and 

pre-emptively evaluate any available evidence 

to address future attacks on their intellectual 

property based on Bilski, at least until a medically 

important “abstract idea,” which could 

include an otherwise patentable invention 

under US law, is distinguished by the courts or 

the legislature. 

acknowledgments 

The views expressed are solely the author’s and do not 

represent those of Sughrue Mion and its clients, and are 

subject to changes in the art and law. The author thanks 

An Kang Li and Stuart Levy for their cotributions. 


The author declares no competing financial interests. 

1. Langer, E.S. Realistic expectations likely to prevail in 

2010. Gen. Eng. News (March 1, 2010). 

2. In re Bilski, 545 F.3d 943 (Fed. Cir. 2008) (en banc). 

3. Simmons, W.J. Nat. Biotechnol. 27, 245–248 

(2009). 

4. In re Bilski, 545 F.3d at 954. 

5. Diamond v. Diehr, 450 US 175, 185 (1981). 

6. Lab Corp. v. Metabolite (Fed. Cir. 2004). 

7. Gottschalk v. Benson, 409 US 63 (1972). 

8. Parker v. Flook, 437 US 584 (1978). 

9. Bilski v. Kappos, 561 US (2010) at 8. 

10. Association for Molecular Pathology et al. v. USPTO 

et al. 1:09 cv-04515 (SDNY). 

11. Prometheus Laboratories v. Mayo Collaborative Servs., 

581 F.3d 1336 (Fed. Cir. 2009). 

12. 561 US (2010) - Kennedy, p. 9. 

13. 561 US (2010) - Kennedy, pp. 12–13. 

14. Diamond v. Chakrabarty 447 US 303 (1980). 

15. 

16. 

17. 

18. 

19. 

20. 

21. 

22. 

23. 

24. 

25. 

26. 


patents 


Recent patent applications in proteomics 

Patent number Description Assignee Inventor 

US 20100155243 

US 20100129842 

WO 2010052510 

CN 101696238 

JP 2010078455 

WO 2010035129 

WO 2010026742 

WO 2010011860 

WO 2010010108 

US 7653493 

JP 2010014689 

A method for separating a sample, involving introducing 

the sample into a microchannel formed in a module and 

separating the sample into sub-samples according to 

isoelectric point and into protein components based on 

electrophoresis; useful in, e.g., proteomics. 

Proteomic analysis of polypeptides for biomarker analysis, 

involving reacting two polypeptide samples, each having 

reactive analytes, with different labeling reagents of a set 

of labeling reagents, mixing, digesting with enzyme and 

performing mass analysis. 

A method of diagnosing S-adenyl-l-homocysteine 

hydrolase deficiency involving determining qualitativequantitative 

blood plasma proteomic profile and 

diagnosing S-adenyl-l-homocysteine hydrolase 

deficiency based on data obtained by the subject method. 

The total protein extract of a plant, and a method for its 

preparation, comprising phenol and the reducing agents 

mercaptoethanol or dithiothreitol; used for proteomics 

research on plant tissue samples. 

A method for isolating a peptide, e.g., disease marker 

protein, from blood, involving performing multidimensional 

column chromatography using an amphoteric ion 

column to isolate the peptide and performing protein 

mass spectrometry. 

An apparatus for separating constituents of a complex 

protein mixture for proteomic analysis, comprising the 

separation of elements having chemical-physical features 

such that they can capture proteins belonging to the 

determined homogeneous group by adsorption. 

A liquid chromatograph for proteomic analysis that 

injects a sample solution into an injection valve through 

an injection port that is arranged in the flow path of the 

injection valve. 

A method for determining if a subject of interest has 

pre-diabetes or diabetes or is at risk for developing 

pre-diabetes or diabetes, or for monitoring the efficacy 

of a therapy, comprising comparing a proteomic profile of 

a test sample with a reference sample. 

A new cell with no or low endogenous dihydrofolate 

reductase (DHFR) levels comprising at least two 

heterologous vector constructs; useful as a model cell 

for production cell proteomics and for manufacturing 

proteins. 

A system for automatic mass spectroscopy analysis of a 

group of proteomic samples, e.g., peptides, comprising a 

unit for detecting ions, ion data processing units to receive 

the ion data and a material characterization processor. 

A method for the determination of melanoma, involving 

detecting or quantifying a melanoma marker gene, e.g., 

serum amyloid A2 gene, or melanoma marker protein, 

e.g., serum amyloid A2 protein, in a biological sample 

extracted from a human. 

Baraniuk JN, 

Schneider TW 

Life Technologies 

(Carlsbad, CA, USA) 

Rudjer Boskovic 

Institute (Zagreb, 

Croatia) 

Guangdong Academy 

of Agricultural Sciences 

Crop Research Institute 

(Guangdong, China) 

Japan Science and 

Technology Agency 

(Saitama, Japan) 

National Research 

Council (Rome) 

Baraniuk JN, 

Schneider TW 

Coull JM, 

Pappin DJC, 

Purkayastha S 

Cindric M, Hock K, 

Kraljevic Pavelic S, 

Sedic M 

Chen X, Liang X, 

Zhang E 

Asajima M, 

Fukuda H, Into A, 

Kurisaki A 

Boccardi C, Citti L, 

Mercatanti A, 

Parodi O, 

Rocchiccioli S 

Priority 

application 

date 

Publication 

date 

2/26/2003 6/24/2010 

1/5/2004 5/27/2010 

11/5/2008 5/14/2010 

10/27/2009 4/21/2010 

9/26/2008 4/8/2010 

9/29/2008 4/1/2010 

GL Sciences (Tokyo) Uzu H, Zhou X 9/2/2008 3/11/2010 

Diabetomics 

(Beaverton, OR, USA) 

Boehringer Ingelheim 

Pharma (Ingelheim, 

Germany) 

Stanford University 

(Palo Alto, CA, USA) 

Nagalla SR, 

Paturi VR, 

Roberts CT 

Becker E, Florin L, 

Kaufmann H, 

Studts JM 

Brown M, 

Chungfat N, Dutta S, 

Mathewson S, 

Wang EW 

Shizuoka Ken Akiyama Y, 

Takigawa M 

7/23/2008 1/28/2010 

7/23/2008 1/28/2010 

2/24/2006 1/26/2010 

6/6/2008 1/21/2010 

Source: Thomson Scientific Search Service. The status of each application is slightly different from country to country. For further details, contact Thomson Scientific, 1800 

Diagonal Road, Suite 250, Alexandria, Virginia 22314, USA. Tel: 1 (800) 337-9368 (http://www.thomson.com/scientific). 


news and views 

Can HIV be cured with stem cell therapy? 

Steven G Deeks & Joseph M McCune 

Transplantation of human hematopoietic stem cells engineered to lack the viral coreceptor CCR5 confers resistance 

to HIV infection in mice. 


Antiretroviral therapy has transformed the 

treatment of HIV infection, but, despite its profound 

successes, it will not halt the relentless 

advance of the epidemic. Against this sobering 

reality, several promising, recent developments 

in the basic-science arena have led 

HIV researchers to envision new thera peutic 

approaches that would completely eradicate 

the virus, effectively ‘curing’ HIV disease. In 

an exciting and impressive display of data 

published in this issue, Holt et al. 1 provide a 

scientific bellwether for the practical implementation 

of one such strategy. They show 

that CCR5, a human gene often required for 

HIV to enter target cells, can be effectively and 

permanently disrupted in long-lived, multilineage, 

human hematopoietic stem cells (HSCs). 

When introduced into mice, these cells generate 

an apparently intact human immune 

system that is resistant to subsequent infection 

with HIV (Fig. 1a). This result raises the 

intriguing possibility that HIV-infected individuals 

might be cured with a one-time infusion 

of autologous, gene-modified HSCs. 

The introduction of combination antiretroviral 

regimens against HIV in the mid-1990s 

was undoubtedly one of the great triumphs 

of modern medicine. Almost overnight, those 

who could receive and adhere to the therapies 

gained a new lease on life. But the passage 

of time has revealed the limitations of 

these regimens. Because HIV DNA persists 

as an integrated genome in long-lived cellular 

reservoirs, current antiretroviral drugs are 

unlikely to prove curative 2 . In addition, the 

therapies require life-long adherence, which 

many find challenging, and are often associated 

with some short-term and long-term 

Steven G. Deeks and Joseph M. McCune are 

in the Department of Medicine, University of 

California, San Francisco, California, USA. 

e-mail: sdeeks@php.ucsf.edu and 

mike.mccune@ucsf.edu 

toxicity. Moreover, although they suppress 

HIV replication in a potent, durable manner, 

they do not restore health; for reasons 

that remain unknown, treated HIV disease 

is attended by chronic inflammation, persistent 

T-cell dysfunction and a shortened 

life expectancy 3,4 . Finally, and perhaps most 

importantly, antiretroviral therapies and 

their management are expensive and hard to 

deliver on a worldwide basis. It is now apparent 

that the number of HIV-infected people 

will continue to eclipse the number that can 

be successfully treated. To stop the epidemic 

and to provide care for all, a fundamentally 

different approach is needed. 

Gene therapy with HSCs 

The concept of HSC-based gene therapy for 

HIV disease emerged in the epidemic’s first 

decade, when effective antiretroviral regimens 

were nonexistent. Multiple advances in delivering 

and expressing transgenes in eukaryotic 

cells suggested that therapeutic applications 

were within reach 5 . Baltimore coined the term 

“intracellular immunization” to describe the 

introduction of HIV resistance genes into 

HSCs to allow long-term repopulation of the 

host with progeny cells that would be impervious 

to HIV 6 . By the late 1980s, startup biotech 

companies were isolating and preparing 

human HSCs for transplantation, devising 

techniques and vectors to genetically modify 

the cells, and conducting preclinical testing 7 . 

During the same period, studies of HIV 

pathogenesis were generating data that begged 

for a therapeutic approach that went beyond 

antiretroviral drugs. On the one hand, it 

became clear that CD4 + T-cell depletion, the 

hallmark of HIV disease, was caused not simply 

by destruction of late-stage CD4 + T effector 

cells but also by the host’s inability to maintain 

progenitor cells, including HSCs, intrathymic 

T progenitor cells and central memory T cells 

in the periphery 8 (Fig. 1b). On the other hand, 

it was recognized that HIV can persist within 

multiple lineages of long-lived cells, including 

T cells and cells of the myeloid lineage (some of 

which appear to be progenitor cells) 2,9 . Taken 

together, these observations underscored the 

need to confer HIV resistance to both progenitor 

cells and their progeny. 

Early attempts to engineer HIV resistance 

into hematopoietic progenitor cells encountered 

insurmountable hurdles: the scientific and 

practical constraints of HSC-based therapies 

were substantial; protocols for genetic modification 

of HSCs were inefficient and cytotoxic; 

the preclinical animal models were inadequate; 

and the choice of anti-HIV genes was driven 

more by convenience (and/or patent considerations) 

than by data 7 . Moreover, it proved 

difficult to devise a business model that could 

support the introduction of such a dramatically 

different, untested and potentially toxic form of 

therapy into the clinic. More recently, however, 

two important developments have prompted 

a reevaluation of HSC-based therapy for HIV: 

a critical target—the cell-surface receptor 

CCR5—was identified, and an HIV-infected 

individual was reported to be virus free in 

the absence of antiretroviral medications 20 

months after receiving a transplant of CCR5- 

defective allogeneic HSCs 10 . 

Targeting the Achilles’ heel of HIV 

To enter cells, HIV must bind to either CCR5 or 

CXCR4, chemokine receptors present on many 

immune cells 11 . The vast majority of transmitted 

viruses use CCR5 (R5 variants). As the disease 

progresses, HIV evolves and often, but not 

always, expands its co-receptor preference to 

include CXCR4 (X4 variants). A small fraction 

of people carry a 32-base-pair deletion in the 

CCR5 gene, leading to a truncated gene product, 

CCR5 ∆32. Those who are heterozygous 

for CCR5 ∆32 have delayed disease progression 

after they acquire HIV, whereas homozygotes 

rarely acquire HIV 11 . Although lack of 

nature biotechnology volume 28 number 8 August 2010 807



a 

b 

Human 

HSPCs 

HSC 

CCR5 knockout 

with ZFNs 

No CCR5 

modification 

CLP 

M/E 

Bone marrow 

ITTP 

Transplant 

HSPCs into 

NOG mice 

DP 

Thymus 

CCR5 may be associated with increased risk 

of developing serious sequela of some uncommon 

infections 12 , it does not seem to affect life 

expectancy and may even be associated with a 

reduced risk of certain inflammatory diseases. 

Once the role of CCR5 became clear in the late 

1990s, the pharmaceutical industry devoted 

tremendous resources to the development 

of small-molecule inhibitors, one of which, 

maraviroc (Selzentry), is highly effective, welltolerated 

and now FDA approved. 

This important set of observations inspired 

several groups to pursue CCR5-targeted gene 

therapy 13,14 . One highly innovative approach 

relied on engineered zinc-finger nucleases 

specific for the CCR5 gene 15 . Such ‘molecular 

scissors’ can be delivered to cells ex vivo using 

methods such as integrase-defective lentiviral 

vectors, adenoviral vectors and plasmid DNA 

nucleofection. After specific binding of a pair 

of zinc-finger nucleases to the CCR5 gene, a 

double-stranded DNA break is introduced 

and then repaired by pathways that include 

SP4 

SP8 

Tissue myeloid cells 

Challenge with 

CCR5-tropic HIV 

HSPC/thymusmediated 

expansion 

of peripheral 

CD4 + T cells 

CD4M 

CD4N 

CD8N 

CD8M 

HIV-resistant 

immune cells 

Low viremia 

High viremia 

HIV-mediated 

destruction of 

immune cells 

CD4E 

CD4E 

CD8E 

CD8E 

Peripheral lymphoid system 

Figure 1 Reconstitution of an HIV-resistant lymphoid and myeloid system in an experimental model. 

(a) Holt et al. 1 isolated human hematopoietic stem/progenitor cells (HSPCs) and used zinc-finger 

nucleases (ZFNs) to disrupt the CCR5 gene, which is often required for the entry of HIV into target 

cells. Mice that were successfully engrafted with CCR5-disrupted HSPCs tolerated infection with 

HIV, whereas those engrafted with unmodified HSPCs exhibited loss of CD4 + T cells and high-level 

viremia. (b) Long-lived, multilineage hematopoietic stem cells (HSCs) give rise to common lymphocyte 

progenitors (CLPs) and progenitors of the myelo-erythroid (M/E) lineages. CLPs move through the 

thymus and differentiate through a series of stages, from CD3 – CD4 + CD8 – intrathymic T progenitor 

(ITTP) cells to CD3 +/– CD4 + CD8 + double positive (DP) thymocytes to CD3 + thymocytes that are single 

positive for CD4 (SP4) or CD8 (SP8) to circulating naïve (N), effector (E), and memory (M) CD4 + or 

CD8 + T cells. All of the cell stages colored in red can be directly or indirectly disabled by HIV infection. 

error-prone nonhomologous end-joining. This 

can create a permanent gene disruption that is 

passed to daughter cells in the absence of persistent 

transgene expression. The end result is 

the functional disruption of the CCR5 gene. 

Previous work using this approach demonstrated 

its feasibility in human peripheral blood 

CD4 + T cells 15 , and unpublished data from a 

phase 1 trial suggest that autologous CD4 + 

T cells modified in this way can be reinfused 

safely into HIV-infected individuals (P. Tebas, 

University of Pennsylvania, personal communication). 

Although of great interest, this strategy 

does not disrupt CCR5 in HSCs and thus 

would not enable the long-term generation of 

both T and myeloid-lineage cells resistant to 

HIV infection. Evidence supporting such a leap 

came from another quarter. 

The Berlin patient: an instructive N of 1 

For all of those engaged in the care and treatment 

of patients with HIV disease, the world 

changed in 2009 with the remarkable story of 

a stably treated, HIV-infected individual—the 

‘Berlin patient’—who developed acute myeloid 

leukemia and was transplanted with HSCs from 

a human leukocyte antigen–matched, homozygous 

CCR5 ∆32 donor 10 . Combination antiretroviral 

therapy was discontinued the day before 

the transplant. Twenty months later, HIV could 

not be detected in any of the patient’s tissues 

examined, even when very sensitive techniques 

were used. Given disappointing treatment outcomes 

in the past, the HIV research community 

is hesitant to use the word ‘cure’, but this 

single case could very well be the first example 

to fit the bill. 

It is important to emphasize that this road 

to a cure was arduous and will not be available 

to the vast majority of patients. The Berlin 

patient underwent fully ablative condi tion ing 

with a potentially lethal regimen that included 

fludarabine (Fludara), Ara-C, amsacrine 

(Amerkin, Amsidyl, Amsidine), cyclosporin, 

mycophenolate mofetil (CellCept), antithymocyte 

globulin and 4 Gy of total body irradiation. 

Graft-versus-host disease developed 

during the post-transplant period. Owing to 

recurrent acute myeloid leukemia, a second 

stem cell transplantation using cells from the 

same donor was performed one year after the 

first transplant, which again required exposure 

to myeloablative therapy, including irradiation. 

No one believes that this approach 

will soon be used beyond the highly unusual 

indications for which allogeneic transplantations 

are normally performed. However, the 

example of the Berlin patient does provide a 

strong rationale for the development of CCR5- 

targeted stem cell therapy. 

This case also provides fascinating insights 

into HIV pathogenesis, some of which may be 

relevant to future attempts at HIV eradication. 

For example, it is not entirely clear why HIV 

did not rebound after combination antiretroviral 

therapy was discontinued. According to 

genotypic assays, the patient likely harbored 

a minority (2.9%) of X4 variant viruses. Also, 

host-derived CCR5-expressing myeloid cells, 

which are permissive for HIV infection, persisted 

for at least five months after the transplant. 

Given this volatile combination of 

residual CXCR4-tropic virus and long-lived 

CCR5-expressing targets, HIV replication and 

spread should have continued even as the rest 

of the hematopoietic system was being replaced 

by homozygous CCR5 ∆32 donor cells. 

There are at least two possible explanations 

for this surprising result. First, the low-level 

X4 variant may have been a poorly fit dualtropic 

virus that was dependent on CCR5 for 

replication, whereas the number of residual 

CCR5-expressing myeloid cells was too low to 

support systemic replication of the CCR5-tropic 

808 volume 28 number 8 August 2010 nature biotechnology



variants. Second, it is possible that the myeloablative 

preparative regimen itself contributed to 

the cure by destroying latently infected T and 

myeloid cells and by reducing the numbers of 

susceptible activated CD4 + T cells (HIV more 

readily infects activated rather than resting 

target cells). It is also possible that the ongoing 

graft-versus-host disease may have acted to 

clear residual susceptible target cells. Detailed 

exploration of these and other mechanisms will 

surely provide profound insights into almost 

any possible intervention aimed at HIV eradication 

in the future, and should be pursued. 

Disruption of CCR5 in autologous HSCs 

The strategy of Holt et al. 1 is related to the 

treatment received by the Berlin patient but 

is potentially relevant to a larger number of 

patients (Fig. 1a). The authors obtained human 

CD34 + hematopoietic stem/progenitor cells 

(HSPCs) (a population enriched in HSCs) 

from umbilical cord blood and stimulated them 

to divide with Flt-3 and thrombopoietin. The 

cells were nucleofected with plasmids expressing 

CCR5-specific zinc-finger nucleases. 

A mean of 17% of the cells were successfully 

modified, 5–7% of which were estimated to 

be homozygous CCR5 – . Modified or unmodified 

CD34 + cells were then transplanted 

into nonobese diabetic/severe combined 

immunodeficient/interleukin 2rγ null (NOD/ 

SCID/IL2rγ null or NOG) mice, a model known 

to support multilineage human hematopoiesis. 

As expected, mice engrafted with unmodified 

stem cells and subsequently challenged with 

CCR5-tropic HIV (Bal) showed high levels 

of viremia and loss of peripheral and tissuebased 

human T cells. Remarkably, in animals 

repopulated with CCR5-disrupted HSPCs, the 

virus levels were lower and CD4 + T cells were 

not depleted, either in the peripheral blood 

or in the hemato lymphoid tissues (e.g., bone 

marrow, thymus, spleen and small intestine). 

The preservation of human CD4 + T cells in 

the experimental group was due to selection 

for multiple independent clones of successfully 

gene-modified cells. The frequency of 

cells containing evidence of CCR5 disruption 

increased to >80% in the peripheral blood 

and to >40% in multiple tissues by week 12 

of infection. 

These experiments raise a number of technical 

issues and derivative questions. For 

instance, do genetically modified HSCs confer 

benefit to a mouse that is already infected 

(the situation most closely approximating the 

therapeutic need in humans)? Does a CCR5- 

disrupted hematopoietic compartment confer 

protection against infection by X4 viral variants? 

Is the CCR5-disrupted immune system 

normal? Are there long-term toxicities that will 

become evident later? Is off-target cleavage by 

the zinc-finger nucleases a significant concern 

(e.g., the CCR2 gene may also be targeted by 

this nuclease 15 )? These and other issues can be 

resolved with further work. In the meantime, 

the data of Holt et al. 1 show convincingly that 

a relatively small number of gene-modified 

HSCs can be rapidly selected to ultimately 

confer resistance to HIV in vivo. 

Next steps 

If stem cell–based gene therapy for HIV is to 

become a reality in the clinic, a number of 

nontrivial theoretical and practical concerns 

must be addressed. First, in the current era, 

when clinicians are increasingly concerned 

about the ‘toxicity’ of ongoing viral replication, 

will patients and their healthcare providers be 

willing to allow HIV to replicate at high levels 

in the absence of antiretroviral therapy so that 

CCR5-deficient cells can be selected? There is 

now a growing consensus that HIV replication 

causes significant and perhaps irreversible 

harm to many organs, including those of the 

cardiovascular, renal, hepatic and neurologic 

systems 4 , so this approach must be assumed 

to carry some risk. 

Second, will a partially effective antiviral 

intervention (which is what the gene- modified 

cells represent) select for the outgrowth of a 

resistant virus population, such as X4 variants? 

The history of HIV therapeutics is absolutely 

clear on this issue: if HIV is allowed to 

replicate in the presence of a selective pressure, 

it will find a way to survive. This concern 

is even more pressing as it is widely believed 

that X4 variants are more virulent than R5 

variants. Although X4 variants are only infrequently 

selected in patients treated with smallmolecule 

CCR5 inhibitors (e.g., maraviroc), 

this is only true when a fully suppressive regimen 

is used from day one. It is not likely that 

transplantation of gene-modified HSCs will 

be fully suppressive at first, particularly if partially 

myeloablative therapy is used. 

Third, will ablative therapy be needed to 

allow stem cell engraftment and, if so, will 

short- and long-term toxicity preclude its use 

in those most likely to be offered this intervention 

first? Those most in need of aggressive 

interventions typically have dual-tropic 

virus and are therefore unlikely to respond 

to any approach based on disruption of 

CCR5 (ref. 16). And with advanced disease, 

they have a paucity of HSCs and damaged 

hematopoietic microenvironments (such as 

bone marrow, thymus and lymph node) that 

would normally support the maturation of 

modified HSCs. 

Finally, the mechanism whereby HIV causes 

CD4 + T-cell depletion remains unclear 8 . 

Although HIV can clearly kill cells directly, 

many if not most cells in an HIV-infected individual 

die as a consequence of indirect viral 

effects. Generalized activation of the immune 

system, for example, is harmful to the function 

of T and myeloid cells and to the regeneration 

of multiple lineages. These indirect effects may 

persist even as the virus is driven to extinction 

by the gradual emergence of an HIV-resistant 

T-cell population. 

Reaching for blue sky 

Although the above concerns are daunting, 

the epidemic is not going to disappear, 

the science of stem cells is becoming more 

tractable, sociopolitical forces are forging 

new perspectives in healthcare, and now is 

not the time to stop. From our perspective, 

HSC-based gene therapy for HIV disease may 

make a significant impact on the worldwide 

epidemic if two goals can be met. First, it is 

essential to find a way to deliver these therapies 

to all in need, in a manner that is safe, 

affordable and generally available around 

the world. Many clever approaches to do this 

have been proposed in the past, and more 

will surely emerge. Second, the preclinical 

and clinical development of these strategies 

requires a sustainable financial model. Such 

a model may involve reprioritization of governmental 

efforts, creative plans to incentivize 

existing pharmaceutical and healthcare delivery 

systems, and global assistance programs 

motivated by a common desire for a world 

free of HIV. This may seem like a formidable 

exercise, but it is worth noting that if oneshot, 

modified HSC-based gene therapy can 

be made efficacious and accessible in the context 

of HIV disease, similar approaches will 

likely be applicable to a host of other chronic 

diseases, infectious and otherwise. If so, the 

treatment paradigms of the future will look 

vastly different from today’s. In the same way 

that problems associated with the reliance on 

fossil fuels have stimulated the development of 

alternative strategies of energy delivery, so too 

may the ongoing crisis in the HIV epidemic 

spark novel approaches to the provision of 

healthcare in the future. 

Conclusion 

The progress in HIV therapeutics over the past 

15 years has been tremendous. The life expectancy 

of most people who present with HIV 

disease today in resource-rich regions is on 

the order of decades. Yet antiretroviral drugs 

have intrinsic limitations that are unlikely to 

be surmounted. What is needed, therefore, is 

a ‘game changer’, such as a cure for HIV infection 

or an effective vaccine. Could a one-shot 

manipulation of HSCs be the answer? We will 




not know unless we continue to move these 

new technologies into the clinic. Even if CCR5- 

targeted gene therapy is not the ultimate solution, 

human studies are certain to be highly 

informative with regard to HIV pathogenesis 

and human immunology. 


The authors wish to acknowledge amfAR, Project 

Inform, TAG and the AIDS Policy Project for 

supporting and stimulating cross-disciplinary 

discussion on the issues outlined in this commentary. 

The authors’ work that contributed to this review 

was supported by the National Institute of Allergy 

and Infectious Diseases (RO1 AI087145 and 

K24AI069994 to S.G.D. and R37 AI40312 and DPI 

OD00329 to J.M.M.), the University of California, 

San Francisco (UCSF) Center for AIDS Research 

(P30 MH59037), the UCSF Clinical and Translational 

Science Institute (UL1 RR024131), the Harvey V. 

Berneking Living Trust and amfAR. J.M.M. is a 

recipient of the National Institutes of Health (NIH) 

Director’s Pioneer Award Program, part of the NIH 

Roadmap for Medical Research. 

Microarrays in the clinic 

Guy W Tillinghast 

Clinical application of gene expression microarrays 

1 and other ’omics technologies is widely 

expected to usher in a new era of personalized 

medicine. But although DNA microarrays are 

beginning to be used in patient care 2,3 , progress 

has been slow, in part because of analytic 

challenges and concerns about accuracy and 

reproducibility. In this issue, the MAQC consortium 

presents the results of a large study, 

MAQC-II 4 , to evaluate methods for building 

genomic classifiers—software programs that 

convert microarray profiles of an individual 

sample into a prediction, such as membership 

in a clinical class. The results show that 

microarray algorithms can be reliable enough 

to justify clinical application, at least within 

certain contexts. More broadly, the findings 

of MAQC-II on microarray classifiers may 

be useful for analyzing data from other highthroughput 

assays. 

Existing clinical predictors have well-known 

limitations, especially with respect to complex 

diseases such as cancer. Given two individuals 

who present identical clinical parameters, one 

Guy Tillinghast is at the Riverside Cancer Care 

Center, Newport News, Virginia, USA. 

e-mail: guy.tillinghast@rivhs.com 


The authors declare competing financial 

interests: details accompany the full-text HTML 

version of the paper at http://www.nature.com/ 

naturebiotechnology/. 

1. Holt, N. et al. Nat. Biotechnol. 28, 839–847 (2010). 

2. Siliciano, J.D. et al. Nat. Med. 9, 727–728 (2003). 

3. Kuller, L.H. et al. PLoS Med. 5, e203 (2008). 

4. Phillips, A.N., Neaton, J. & Lundgren, J.D. AIDS 22, 

2409–2418 (2008). 

5. Friedman, A.D., Triezenberg, S.J. & McKnight, S.L. 


6. Baltimore, D. Nature 335, 395–396 (1988). 

7. Rossi, J.J., June, C.H. & Kohn, D.B. Nat. Biotechnol. 

25, 1444–1454 (2007). 

8. McCune, J.M. Nature 410, 974–979 (2001). 

9. McCune, J.M. Cell 82, 183–188 (1995). 

10. Hutter, G. et al. N. Engl. J. Med. 360, 692–698 

(2009). 

11. Moore, J.P., Kitchen, S.G., Pugach, P. & Zack, J.A. 

AIDS Res. Hum. Retroviruses 20, 111–126 (2004). 

12. Glass, W.G. et al. J. Exp. Med. 203, 35–40 (2006). 

13. DiGiusto, D.L. et al. Sci. Transl. Med. 2, 36ra43 

(2010). 

14. Shimizu, S. et al. Blood 115, 1534–1544 (2010). 

15. Perez, E.E. et al. Nat. Biotechnol. 26, 808–816 

(2008). w 

16. Hunt, P.W. et al. J. Infect. Dis. 194, 926–930 (2006). 

The MicroArray Quality Control (MAQC) consortium has evaluated methods 

for making clinically useful predictions from large-scale gene expression data. 

may respond to a therapy whereas the other may 

not. In principle, genome-wide data should be 

able to discriminate between them. The most 

common goals of a clinical test are to make a 

diagnosis or to determine an appropriate therapy. 

In light of statistical considerations, these 

goals depend on the prevalence of a disease, 

suggesting that clinical DNA microarray tests 

will augment, and not supplant, other clinical 

information. Thus, a possible strategy would 

be to first use traditional clinical predictors to 

broadly identify patients who might benefit 

from a treatment, and to then use an expensive 

assay, such as a microarray, to eliminate 

those for whom the treatment is unlikely to 

be effective. 

Despite this promise, DNA microarrays have 

not been rapidly adopted in clinical practice. 

One reason is the noise that results from analyzing 

thousands of genes, which can lead to 

false predictions. Consequently, microarrays 

have been criticized because studies of the 

same clinical groups using different microarray 

measurements or analytic methods have 

often yielded dissimilar lists of differentially 

expressed genes. A second concern is the inherent 

error in the technology. Error stems from 

high background at the bottom of the dynamic 

range, saturation at the top of the dynamic 

range, and nonlinearity, at least with measurements 

of some transcripts. 

Many statistical methods have been developed 

to address these challenges, including 

approaches for grouping samples and genes, 

data normalization schemes to allow meaningful 

comparisons across samples, multiple testing 

procedures to select differentially expressed 

genes and ‘cross-validation’ methods for using 

samples to train prediction algorithms while 

reducing bias. These methods are applied 

sequentially to transform massive data sets of 

raw microarray gene expression profiles into 

clinically useful classifiers (Fig. 1a). As the 

optimal combination of methods is difficult 

to determine, MAQC-II sought to evaluate 

approaches to building classifiers. 

Clinical use of microarrays is particularly 

challenging owing to the variability of 

the arrays themselves and to the variability 

between patients and between laboratories 

performing the analyses. These effects fall 

under the rubric of ‘batch effects’ and cause 

false positives. Moreover, before MAQC-II, it 

had not been clear whether classifiers trained 

on an initial data set would be able to make 

accurate predictions based on completely 

independent samples collected at a later date. 

The five-step process for building a classifier 

in MAQC-II involved designing the experiment, 

collecting microarray data, creating a predictive 

model, validating the model internally with 

the training samples and validating the model 

externally with new samples obtained independently 

from the training data. MAQC-II 

enlisted 36 teams of data analysts within government 

agencies, academia and industry. The 

teams were given six microarray data sets and 

charged with predicting 13 ‘endpoints’ potentially 

relevant to clinical or preclinical applications. 

The data sets included toxicological 

studies of chemicals on rodents and expression 

profiles of human cancer patients. In total, the 

teams built >30,000 classifiers using hundreds 

of combinations of analytic methods. A team of 

referees comprising biostatisticians and experienced 

data analysts chose one ‘candidate’ model 

that was expected to have the best performance 

for each endpoint from among models nominated 

by each of the 36 teams. 

Next, the consortium analyzed how well 

the models classified samples. Performance 

was measured using several metrics, but the 

one most familiar to clinicians is the receiver 

operating characteristic area under the curve 

(AUC), a metric that varies between 0 and 

1, where 0.5 indicates performance no better 

than chance and 1 means that all samples 

are correctly classified and none misclassified. 

For most of the endpoints, the candidate 




not know unless we continue to move these 

new technologies into the clinic. Even if CCR5- 

targeted gene therapy is not the ultimate solution, 

human studies are certain to be highly 

informative with regard to HIV pathogenesis 

and human immunology. 


The authors wish to acknowledge amfAR, Project 

Inform, TAG and the AIDS Policy Project for 

supporting and stimulating cross-disciplinary 

discussion on the issues outlined in this commentary. 

The authors’ work that contributed to this review 

was supported by the National Institute of Allergy 

and Infectious Diseases (RO1 AI087145 and 

K24AI069994 to S.G.D. and R37 AI40312 and DPI 

OD00329 to J.M.M.), the University of California, 

San Francisco (UCSF) Center for AIDS Research 

(P30 MH59037), the UCSF Clinical and Translational 

Science Institute (UL1 RR024131), the Harvey V. 

Berneking Living Trust and amfAR. J.M.M. is a 

recipient of the National Institutes of Health (NIH) 

Director’s Pioneer Award Program, part of the NIH 

Roadmap for Medical Research. 

Microarrays in the clinic 

Guy W Tillinghast 

Clinical application of gene expression microarrays 

1 and other ’omics technologies is widely 

expected to usher in a new era of personalized 

medicine. But although DNA microarrays are 

beginning to be used in patient care 2,3 , progress 

has been slow, in part because of analytic 

challenges and concerns about accuracy and 

reproducibility. In this issue, the MAQC consortium 

presents the results of a large study, 

MAQC-II 4 , to evaluate methods for building 

genomic classifiers—software programs that 

convert microarray profiles of an individual 

sample into a prediction, such as membership 

in a clinical class. The results show that 

microarray algorithms can be reliable enough 

to justify clinical application, at least within 

certain contexts. More broadly, the findings 

of MAQC-II on microarray classifiers may 

be useful for analyzing data from other highthroughput 

assays. 

Existing clinical predictors have well-known 

limitations, especially with respect to complex 

diseases such as cancer. Given two individuals 

who present identical clinical parameters, one 

Guy Tillinghast is at the Riverside Cancer Care 

Center, Newport News, Virginia, USA. 

e-mail: guy.tillinghast@rivhs.com 


The authors declare competing financial 

interests: details accompany the full-text HTML 

version of the paper at http://www.nature.com/ 

naturebiotechnology/. 

1. Holt, N. et al. Nat. Biotechnol. 28, 839–847 (2010). 

2. Siliciano, J.D. et al. Nat. Med. 9, 727–728 (2003). 

3. Kuller, L.H. et al. PLoS Med. 5, e203 (2008). 

4. Phillips, A.N., Neaton, J. & Lundgren, J.D. AIDS 22, 

2409–2418 (2008). 

5. Friedman, A.D., Triezenberg, S.J. & McKnight, S.L. 


6. Baltimore, D. Nature 335, 395–396 (1988). 

7. Rossi, J.J., June, C.H. & Kohn, D.B. Nat. Biotechnol. 

25, 1444–1454 (2007). 

8. McCune, J.M. Nature 410, 974–979 (2001). 

9. McCune, J.M. Cell 82, 183–188 (1995). 

10. Hutter, G. et al. N. Engl. J. Med. 360, 692–698 

(2009). 

11. Moore, J.P., Kitchen, S.G., Pugach, P. & Zack, J.A. 

AIDS Res. Hum. Retroviruses 20, 111–126 (2004). 

12. Glass, W.G. et al. J. Exp. Med. 203, 35–40 (2006). 

13. DiGiusto, D.L. et al. Sci. Transl. Med. 2, 36ra43 

(2010). 

14. Shimizu, S. et al. Blood 115, 1534–1544 (2010). 

15. Perez, E.E. et al. Nat. Biotechnol. 26, 808–816 

(2008). w 

16. Hunt, P.W. et al. J. Infect. Dis. 194, 926–930 (2006). 

The MicroArray Quality Control (MAQC) consortium has evaluated methods 

for making clinically useful predictions from large-scale gene expression data. 

may respond to a therapy whereas the other may 

not. In principle, genome-wide data should be 

able to discriminate between them. The most 

common goals of a clinical test are to make a 

diagnosis or to determine an appropriate therapy. 

In light of statistical considerations, these 

goals depend on the prevalence of a disease, 

suggesting that clinical DNA microarray tests 

will augment, and not supplant, other clinical 

information. Thus, a possible strategy would 

be to first use traditional clinical predictors to 

broadly identify patients who might benefit 

from a treatment, and to then use an expensive 

assay, such as a microarray, to eliminate 

those for whom the treatment is unlikely to 

be effective. 

Despite this promise, DNA microarrays have 

not been rapidly adopted in clinical practice. 

One reason is the noise that results from analyzing 

thousands of genes, which can lead to 

false predictions. Consequently, microarrays 

have been criticized because studies of the 

same clinical groups using different microarray 

measurements or analytic methods have 

often yielded dissimilar lists of differentially 

expressed genes. A second concern is the inherent 

error in the technology. Error stems from 

high background at the bottom of the dynamic 

range, saturation at the top of the dynamic 

range, and nonlinearity, at least with measurements 

of some transcripts. 

Many statistical methods have been developed 

to address these challenges, including 

approaches for grouping samples and genes, 

data normalization schemes to allow meaningful 

comparisons across samples, multiple testing 

procedures to select differentially expressed 

genes and ‘cross-validation’ methods for using 

samples to train prediction algorithms while 

reducing bias. These methods are applied 

sequentially to transform massive data sets of 

raw microarray gene expression profiles into 

clinically useful classifiers (Fig. 1a). As the 

optimal combination of methods is difficult 

to determine, MAQC-II sought to evaluate 

approaches to building classifiers. 

Clinical use of microarrays is particularly 

challenging owing to the variability of 

the arrays themselves and to the variability 

between patients and between laboratories 

performing the analyses. These effects fall 

under the rubric of ‘batch effects’ and cause 

false positives. Moreover, before MAQC-II, it 

had not been clear whether classifiers trained 

on an initial data set would be able to make 

accurate predictions based on completely 

independent samples collected at a later date. 

The five-step process for building a classifier 

in MAQC-II involved designing the experiment, 

collecting microarray data, creating a predictive 

model, validating the model internally with 

the training samples and validating the model 

externally with new samples obtained independently 

from the training data. MAQC-II 

enlisted 36 teams of data analysts within government 

agencies, academia and industry. The 

teams were given six microarray data sets and 

charged with predicting 13 ‘endpoints’ potentially 

relevant to clinical or preclinical applications. 

The data sets included toxicological 

studies of chemicals on rodents and expression 

profiles of human cancer patients. In total, the 

teams built >30,000 classifiers using hundreds 

of combinations of analytic methods. A team of 

referees comprising biostatisticians and experienced 

data analysts chose one ‘candidate’ model 

that was expected to have the best performance 

for each endpoint from among models nominated 

by each of the 36 teams. 

Next, the consortium analyzed how well 

the models classified samples. Performance 

was measured using several metrics, but the 

one most familiar to clinicians is the receiver 

operating characteristic area under the curve 

(AUC), a metric that varies between 0 and 

1, where 0.5 indicates performance no better 

than chance and 1 means that all samples 

are correctly classified and none misclassified. 

For most of the endpoints, the candidate 



a 

b 

AUC = 0.991 

AUC = 0.956 

AUC = 0.787 

AUC = 0.615 

Tissue sample 

Microarray 

Remove 

batch 

effects 

Classifier 

Normalize 

Select 

features 

Prediction 

Train 

algorithm 

Process evaluated in MAQC-II 

Treatment plan 

Internal 

validation 

True-positive rate (sensitivity) 

False-positive rate (1 – specificity) 


Figure 1 Using microarrays to make clinical predictions. (a) Current clinical decision-making processes can be refined by gene expression–based 

predictions generated by microarray classifiers (top). MAQC-II evaluated methods for constructing classifiers (bottom). Constructing a classifier from 

raw microarray data requires processing the data using a sequence of analytic steps (colored boxes). Many different approaches have been developed to 

solve each step (represented as dots above each box). In MAQC-II, >30,000 classifiers were constructed to test different combinations of analytic steps 

to predict 13 clinical and preclinical ‘endpoints’. (b) Curves showing the range of performance of classifiers developed for different data sets as part of 

MAQC-II. Performance is quantified using AUC. Data sets are characterized by the ratio of positive to negative samples in the cohort (P/N). Classifiers 

performed well for some endpoints, such as the sex of patients. The ~400 genes exclusively present on the Y chromosome made this an easy-to-predict 

positive control (red, training set P/N 1.44). The most difficult-to-predict endpoint was the overall survival of multiple myeloma patients, which has 

traditionally been difficult for other tests as well (orange, training set P/N 0.34). Classifiers for liver toxicity in rats (blue, training set P/N 0.58) and 

pathological complete remission in breast cancer (green, training set P/N 0.34) showed intermediate performance. 

microarray-based classifiers performed far 

better than chance on the independent validation 

data set, with a range of 0.62–0.99. 

Moreover, the performance of the refereeselected 

candidate models was better than that 

of nominated models, suggesting that expert 

advice can enhance the modeling outcome. 

Notably, classifier performance was found 

to depend heavily on the endpoint being predicted 

(Fig. 1b). However, it is evident from 

inspecting the data that there is a linear correlation 

between the AUC performance and 

the ratio of positive to negative samples in the 

cohort (‘training set P/N’). The composition of 

the training set is known to affect classification 

performance, and extreme imbalance, such as 

with the breast cancer and multiple myeloma 

endpoints (Fig. 1b, orange and green), may have 

adversely affected performance. Alternatively, 

the genetics of neuroblastoma and certainly 

the rodent data sets may be less variable and 

hence more tractable to modeling (Fig. 1b, 

blue). Moreover, genetic variation typically 

accumulates over time, making the genomes 

of the patients with breast cancer and multiple 

myeloma more variable than those with neuroblastoma 

and therefore less consistent with the 

reference genome from which the microarray 

platforms were constructed. These substantial 

differences in endpoints may have affected the 

validation AUC results. 

Several findings from MAQC-II may help 

bring the technology closer to clinical use. 

Microarray experiments should be designed 

to minimize batch effects, such as those introduced 

by different laboratories or material 

lots. There should be a plan for detecting such 

effects (e.g., by testing for unexpected genes 

that are expressed in different experimental 

conditions), and the same statistical test used 

to detect differentially expressed genes should 

be applied to all samples 5 . A gene that is differentially 

expressed in a pattern that matches 

the grouping of samples into batches should 

be examined closely and probably not used 

in a classifier. 

Related to batch effects, quality control metrics 

should be used to distinguish variation in 

gene expression caused by laboratory artifact 

rather than by clinical phenotype. Quality control 

metrics are formulated to assess specific 

aspects of laboratory processing, such as RNA 

degradation or faulty equipment. These metrics 

can be used to adjust gene expression measurements 

or to identify problem microarrays. In 

the MAQC-II project, rather than adjusting 

measurements to account for laboratory noise, 

data analysts did not use samples that appeared 

to have quality control problems. 

Several factors were found to influence 

classifier performance more than the type of 

algorithm used. One of these is the inherent 

difficulty of the biological phenomena being 

predicted. Another is the method for tuning 

the algorithm. Inexperience in tuning can be a 

major source of bias in the final classifier, especially 

if the predictive algorithm is not tuned 

for the population of interest. For example, in 

a population with low prevalence of a disease, 

it may be more desirable to have a test that 

makes few false predictions. 

The results of MAQC-II highlight two priorities 

for future work. First, the field needs 

rigorous standards for reporting the steps 

used to develop a classifier, its parameters 

of use and the appropriate quality metrics. 

Examples in the literature 2 may provide useful 

starting points. A classifier submitted for 

publication or for regulatory approval should 

specify how to use it to classify new samples— 

for example, the normalization and batch 

effect correction procedures to perform, the 

essential quality control checks and how to 

handle quality control flaws. The final report 

of a prediction algorithm should provide the 

variance (that is, standard error) of the performance 

measure as well as an estimation of 

the bias. A prediction report based on analysis 

of an individual patient sample should be 

accompanied by a report of quality metrics 

and their normal values and a report of batch 

effect measures that could provide a clinician 

with a sense of whether a microarray is within 

the range of the samples for which the test 

was developed 5 . 

Second, methods are needed to combine 

microarray predictions with existing clinical 

decision-making tools, such as nomograms 

(a graphical chart for performing calculations). 

In constructing a nomogram, it will be necessary 

to determine how to balance the data from 

a microarray classifier with traditional clinical 

predictors. In addition, approaches should be 

developed to handle variability. For instance, 

the microarray chips used in MAQC-II have 

already been replaced by newer versions. 

A key observation of MAQC-II—namely, 

that some endpoints seem inherently more 

predictable than others, regardless of the 

analytic methods used—suggests that gene 

expression microarrays may not capture a 




sufficiently rich snapshot of disease physiology. 

In such cases, complementary technologies, 

which measure mRNA expression, 

protein levels, genetic mutation, copy number 

variation, gene silencing or regulatory RNA 

expression, could be considered. Alternatively, 

the best technology may vary by tumor type. 

High-throughput sequencing, in particular, 

offers advantages over microarrays in that 

coverage of the genome is less biased and the 

dynamic range is larger 6 . With luck, the results 

of MAQC-II will be useful for shepherding 

Shaking up genome engineering 

KA Tipton & John Dueber 

A new method generates genome-scale modified bacteria with 

unprecedented ease. 

Systematic approaches to mutate and characterize 

the function of every gene in a microbe have 

been hampered by the need to manually create 

thousands of separate strains through tedious 

genetic manipulation. In this issue, Warner 

et al. 1 describe an approach to create and characterize 

rationally modified versions of almost 

every gene in Escherichia coli. Using this strategy, 

the authors quickly zero in on genes that 

influence industrially relevant traits, such as 

tolerance to toxins in a biofuel feedstock. The 

method enables single genome modifications 

to be probed rapidly and comprehensively and 

correlated to a phenotype, yielding information 

that lays a foundation for gene mapping and for 

engineering strains with desired phenotypes. 

Until now, systematic phenotyping of mutants 

in yeasts 2,3 and E. coli 4 has been accomplished 

by Herculean manual efforts to create thousands 

of mutant strains, each with a different singlegene 

knockout. Although the resulting strain 

collections have proven valuable, it remains 

a challenge to create, on a genome scale, new 

collections of mutants for targeted applications 

or to control gene expression levels using 

a strong promoter, an inducible promoter or a 

low- efficiency ribosome binding site. 

In contrast, the method of Warner et al. 1 — 

trackable multiplex recombineering (TRMR), 

pronounced ‘tremor’ (Fig. 1)—offers a fast 

and cheap approach for creating collections of 

mutants. Impressively, the authors were able to 

KA Tipton and John Dueber are at the 

University of California Berkeley, Berkeley, 

California, USA. 

e-mail: jdueber@berkeley.edu 

other high- throughput technologies toward 

the clinic as well. 



1. DeRisi, J.L., Iyer, V.R. & Brown, P.P. Science 278, 

680–686 (1997). 

2. Dumur, C.I. et al. J. Mol. Diagn. 10, 67–77 (2008). 

3. Buyse, M. et al. J. Natl. Cancer Inst. 98, 1183–1192 

(2006). 

4. The MicroArray Quality Control (MAQC) consortium. 


5. Luo, J. et al. Pharmacogenomics J. 10, 278–291 

(2010). 

6. Schuster, S.C. Nat. Methods 5, 16–18 (2008). 

construct libraries containing up- and downregulated 

versions of 96% of the genes in the 

E. coli genome in one week at a materials cost 

of ~$1 per targeted gene. 

The first step in TRMR is to obtain thousands 

of 189-base-pair oligonucleotides that 

target and uniquely identify every E. coli gene. 

Each of these oligos consists of a barcode tag 

unique to a gene and regions of homology that 

E. coli 

+ 

Multiplex 

oligonucleotide library 

E. coli strains with modified 

gene expression levels 

flank the targeted gene in the genome. Warner 

et al. 1 purchased the oligos, which were made 

on a programmable microarray. Next, using 

a clever cloning strategy, they appended the 

oligos to DNA elements that modulate gene 

expression. Attaching the targeting oligos to 

the strong P LtetO-1 promoter created a DNA 

cassette that was expected to upregulate the 

targeted gene after incorporation into the 

genome. Conversely, attaching the targeting 

oligo to a weak ribosome binding site produced 

a DNA cassette that downregulated the 

targeted gene. An antibiotic resistance gene 

allowed selection for the genetic modifications. 

As a result of the DNA synthesis and 

manipulation steps, Warner et al. 1 created 

two libraries of linear DNA fragments, each 

with 4,077 DNA cassettes pooled together in 

a single tube. 

These libraries of DNA oligonucleotides were 

used to modify the E. coli genome by means of 

recombineering, a homologous recombination– 

based method in E. coli expressing λ phage 

recombination factors (λgam, bet and exo) 5 . 

Growth on antibiotic medium selects for successful 

recombinants, and the sites of recombination 

are determined by homology of the 

targeting oligos to genomic regions flanking 

each gene. 

The resulting collections of modified E. coli 

strains were then challenged by growth in 

environmental conditions of interest. Warner 

et al. 1 measured the relative fitness of each 

Selection in new 

environmental 

conditions 

Figure 1 TRMR enables genome-scale selection of rational modifications to the expression of single 

genes. A multiplex library of oligonucleotides is synthesized to encode a unique barcode tag and regions 

of homology flanking individual target genes in the E. coli genome (left). A series of cloning steps 

generates linear DNA fragments that contain sequences necessary for up- or downregulating the 

expression of each target gene. E. coli are transformed with this library of linear fragments to create a 

collection of genetically modified strains (middle, green cells containing a modified genetic network). 

The modifications alter the functional linkages between genes. (Lines in the networks represent 

linkages, with thickness being the strength of the link. Circles represent genes, with translucency 

and a dashed outline representing attenuated expression). The E. coli strain collection is grown 

on medium containing an environmental challenge of interest (right). The identities and relative 

abundances of individual survivors are determined by sequencing colonies using universal primer 

sequences. Alternatively, survivors are determined in bulk by microarray analysis of the barcode tags. 

Importantly, the basic TRMR strategy is amenable to rapid iteration such that the most promising gene 

modifications are used to seed subsequent cycles of mutation and selection (dotted arrow). 




sufficiently rich snapshot of disease physiology. 

In such cases, complementary technologies, 

which measure mRNA expression, 

protein levels, genetic mutation, copy number 

variation, gene silencing or regulatory RNA 

expression, could be considered. Alternatively, 

the best technology may vary by tumor type. 

High-throughput sequencing, in particular, 

offers advantages over microarrays in that 

coverage of the genome is less biased and the 

dynamic range is larger 6 . With luck, the results 

of MAQC-II will be useful for shepherding 

Shaking up genome engineering 

KA Tipton & John Dueber 

A new method generates genome-scale modified bacteria with 

unprecedented ease. 

Systematic approaches to mutate and characterize 

the function of every gene in a microbe have 

been hampered by the need to manually create 

thousands of separate strains through tedious 

genetic manipulation. In this issue, Warner 

et al. 1 describe an approach to create and characterize 

rationally modified versions of almost 

every gene in Escherichia coli. Using this strategy, 

the authors quickly zero in on genes that 

influence industrially relevant traits, such as 

tolerance to toxins in a biofuel feedstock. The 

method enables single genome modifications 

to be probed rapidly and comprehensively and 

correlated to a phenotype, yielding information 

that lays a foundation for gene mapping and for 

engineering strains with desired phenotypes. 

Until now, systematic phenotyping of mutants 

in yeasts 2,3 and E. coli 4 has been accomplished 

by Herculean manual efforts to create thousands 

of mutant strains, each with a different singlegene 

knockout. Although the resulting strain 

collections have proven valuable, it remains 

a challenge to create, on a genome scale, new 

collections of mutants for targeted applications 

or to control gene expression levels using 

a strong promoter, an inducible promoter or a 

low- efficiency ribosome binding site. 

In contrast, the method of Warner et al. 1 — 

trackable multiplex recombineering (TRMR), 

pronounced ‘tremor’ (Fig. 1)—offers a fast 

and cheap approach for creating collections of 

mutants. Impressively, the authors were able to 

KA Tipton and John Dueber are at the 

University of California Berkeley, Berkeley, 

California, USA. 

e-mail: jdueber@berkeley.edu 

other high- throughput technologies toward 

the clinic as well. 



1. DeRisi, J.L., Iyer, V.R. & Brown, P.P. Science 278, 

680–686 (1997). 

2. Dumur, C.I. et al. J. Mol. Diagn. 10, 67–77 (2008). 

3. Buyse, M. et al. J. Natl. Cancer Inst. 98, 1183–1192 

(2006). 

4. The MicroArray Quality Control (MAQC) consortium. 


5. Luo, J. et al. Pharmacogenomics J. 10, 278–291 

(2010). 

6. Schuster, S.C. Nat. Methods 5, 16–18 (2008). 

construct libraries containing up- and downregulated 

versions of 96% of the genes in the 

E. coli genome in one week at a materials cost 

of ~$1 per targeted gene. 

The first step in TRMR is to obtain thousands 

of 189-base-pair oligonucleotides that 

target and uniquely identify every E. coli gene. 

Each of these oligos consists of a barcode tag 

unique to a gene and regions of homology that 

E. coli 

+ 

Multiplex 

oligonucleotide library 

E. coli strains with modified 

gene expression levels 

flank the targeted gene in the genome. Warner 

et al. 1 purchased the oligos, which were made 

on a programmable microarray. Next, using 

a clever cloning strategy, they appended the 

oligos to DNA elements that modulate gene 

expression. Attaching the targeting oligos to 

the strong P LtetO-1 promoter created a DNA 

cassette that was expected to upregulate the 

targeted gene after incorporation into the 

genome. Conversely, attaching the targeting 

oligo to a weak ribosome binding site produced 

a DNA cassette that downregulated the 

targeted gene. An antibiotic resistance gene 

allowed selection for the genetic modifications. 

As a result of the DNA synthesis and 

manipulation steps, Warner et al. 1 created 

two libraries of linear DNA fragments, each 

with 4,077 DNA cassettes pooled together in 

a single tube. 

These libraries of DNA oligonucleotides were 

used to modify the E. coli genome by means of 

recombineering, a homologous recombination– 

based method in E. coli expressing λ phage 

recombination factors (λgam, bet and exo) 5 . 

Growth on antibiotic medium selects for successful 

recombinants, and the sites of recombination 

are determined by homology of the 

targeting oligos to genomic regions flanking 

each gene. 

The resulting collections of modified E. coli 

strains were then challenged by growth in 

environmental conditions of interest. Warner 

et al. 1 measured the relative fitness of each 

Selection in new 

environmental 

conditions 

Figure 1 TRMR enables genome-scale selection of rational modifications to the expression of single 

genes. A multiplex library of oligonucleotides is synthesized to encode a unique barcode tag and regions 

of homology flanking individual target genes in the E. coli genome (left). A series of cloning steps 

generates linear DNA fragments that contain sequences necessary for up- or downregulating the 

expression of each target gene. E. coli are transformed with this library of linear fragments to create a 

collection of genetically modified strains (middle, green cells containing a modified genetic network). 

The modifications alter the functional linkages between genes. (Lines in the networks represent 

linkages, with thickness being the strength of the link. Circles represent genes, with translucency 

and a dashed outline representing attenuated expression). The E. coli strain collection is grown 

on medium containing an environmental challenge of interest (right). The identities and relative 

abundances of individual survivors are determined by sequencing colonies using universal primer 

sequences. Alternatively, survivors are determined in bulk by microarray analysis of the barcode tags. 

Importantly, the basic TRMR strategy is amenable to rapid iteration such that the most promising gene 

modifications are used to seed subsequent cycles of mutation and selection (dotted arrow). 




modified strain by isolating genomic DNA, 

amplifying the barcode tags using PCR and 

hybridizing the amplified DNA to a microarray 

that contains probes complementary to 

each tag. A signal on the microarray identifies 

strains that grew. To demonstrate the 

approach, the authors selected for growth in 

media containing salicin, d-fucose, valine or 

methylglyoxyl. These compounds inhibit cell 

growth by different mechanisms. Salicin is a 

carbon source that normally cannot be metabolized. 

d-fucose is an analogue of arabinose 

that inhibits the ability of E. coli to metabolize 

this sugar. Valine acts as a feedback inhibitor 

of growth-limiting leucine and isoleucine biosynthesis. 

Methylglyoxal presents an oxidative 

stress if present in elevated concentrations. 

These conditions demonstrated the effectiveness 

of TRMR in identifying gene-trait relationships 

and in identifying genes that were 

not expected to be involved in resistance to the 

given cellular stress, thus supporting the power 

of a genome-scale, unbiased approach. 

In a particularly challenging and exciting 

application of TRMR, Warner et al. 1 grew their 

libraries of strains in lignocellulosic hydrolysate 

derived from corn stover. Hydrolysates 

represent a complex potpourri of molecules 

toxic to E. coli. It has been difficult to predict 

a priori which genes would best confer resistance 

to growth inhibitors in the hydrolysates 6 . 

This problem is thus well suited to test the 

authors’ methods. Among the modified genes 

that conferred improved growth were genes 

with expected functions as well as several 

with seemingly disparate cellular functions, 

including primary metabolism, RNA metabolism, 

sugar transporters, secondary metabolism, 

vitamin processes and antioxidant activities. In 

one notable result, the authors identified the 

antioxidant ahpC, a gene not previously linked 

to growth on hydrolysates, which, when upregulated, 

considerably improved both growth rate 

and final biomass levels. 

TRMR has many potential uses. Warner 

et al. 1 note that it could easily be applied iteratively, 

with strains selected after one round of 

TRMR used as the starting strains for a second 

round, thereby accumulating beneficial 

genome alterations (Fig. 1, dotted arrow). 

Such iterative processing can take advantage 

of the same pool of oligos already synthesized. 

Parallel microarray analysis of the barcode 

tags present in the selected survivors should 

produce additional layers of information about 

genetic contributors to fitness. For instance, 

the ability to track combinations of alterations 

in a stepwise fashion as they accumulate has 

the potential to provide snapshots of genetic 

interaction data that, if taken at a high enough 

frequency, may uncover network connections 

in conditions particularly relevant to industrial 

and biotechnological settings. 

TRMR is also valuable because it identifies 

genes and network connections that 

could form the basis for further strain optimization. 

For instance, a particularly powerful 

combination of technologies would 

be to first use TRMR to identify relevant 

genes and then apply the recently developed 

multiplex automated genome engineering 

(MAGE) method 7 , which finely tunes the 

expression levels of a limited number of 

genes. In microbial engineering applications, 

such as the creation of a strain of E. coli that 

can metabolize lignocellulose sugars, TRMR 

should complement existing technologies, 

including directed evolution, genome-scale 

metabolic modeling and synthetic biology 

approaches for redox balancing, flux improvement 

and limiting the production of undesirable 

and toxic metabolic products. 

In addition to TRMR, other approaches 

based on genome-wide modifications are 

Dendritic cells (DCs) are central players in the 

control of immunity and tolerance, and investigation 

of their properties is expected to illuminate 

many diseases of the immune system 

and lead to innovative therapies. Four recent 

reports 1–4 in The Journal of Experimental 

Medicine mark new progress in our understanding 

of the biology of a particular human 

DC subset identified by co-expression of 

CD141 (thrombomodulin, BDCA-3) and the 

increasingly providing scientists with the ability 

to generate large, information-rich data sets 

from which new genetic information may be 

extracted 2–4,8,9 . TRMR heralds an approach to 

genetic analyses in which phenotypes are rapidly 

mapped to genetic modifications across the 

genome, simultaneously producing improved 

strains for immediate practical use as well as 

data sets enabling future rational creation of 

sophisticated strains. 



1. Warner, J. et al. Nat. Biotechnol. 28, 856–862 

(2010). 

2. Giaever, G. et al. Nature 418, 387–391 (2002). 

3. Kim, D.U. et al. Nat. Biotechnol. 28, 617–623 

(2010). 

4. Baba, T. et al. Mol. Syst. Biol. 2, 2006.0008 (2006). 

5. Datta, S., Costantino, N. & Court, D.L. Gene 379, 

109–115 (2006). 

6. Mohagheghi, A. & Schell, D.J. Biotechnol. Bioeng. 105, 

992–996 (2010). 

7. Wang, H.H. et al. Nature 460, 894–898 (2009). 

8. Tong, A.H. et al. Science 294, 2364–2368 (2001). 

9. Mnaimneh, S. et al. Cell 118, 31–44 (2004). 

The expanding family of dendritic 

cell subsets 


The recent identification of human CD141 + dendritic cells as a counterpart 

of mouse CD8 + dendritic cells may be useful in developing vaccines and 

immunotherapies. 

Hideki Ueno, A. Karolina Palucka and Jacques 

Banchereau are at the Baylor Institute for 

Immunology Research and INSERM U899, 

Dallas, Texas, USA; A. Karolina Palucka is at 

the Sammons Cancer Center, Baylor University 

Medical Center, Dallas, Texas, USA; and 

A. Karolina Palucka and Jacques Banchereau 

are in the Department of Gene and Cell 

Medicine and Department of Medicine, 

Immunology Institute, Mount Sinai School of 

Medicine, New York, New York, USA. 

e-mail: jacquesb@baylorhealth.edu 

C-type lectin CLEC9A (DNGR-1). Collectively, 

the papers show that CD141 + DCs are the 

human counterpart of mouse CD8 + DCs. As 

mouse CD8 + DCs are important for the induction 

of cytotoxic T-lymphocyte responses 

through their exceptional capacity to present 

exogenous antigens in an HLA class I pathway 

(so-called cross-presentation) 5 , this discovery 

could have significant clinical impact if human 

CD141 + DCs have a similar role. 

DCs were discovered in 1973 by Ralph 

Steinman as a novel cell type in the mouse 

spleen and are now recognized as a group of 

related cell populations that efficiently present 

antigens. Both mice and humans have two 

major types of DC: myeloid DCs (mDCs, also 

called conventional or classical DCs), and 

plasmacytoid DCs (pDCs). pDCs are considered 

the front line in anti-viral immunity as 

they rapidly produce abundant type I interferon 

in response to viral infection. In their 

resting state, pDCs may be important in tolerance, 

including oral tolerance 6,7 . pDCs are 




modified strain by isolating genomic DNA, 

amplifying the barcode tags using PCR and 

hybridizing the amplified DNA to a microarray 

that contains probes complementary to 

each tag. A signal on the microarray identifies 

strains that grew. To demonstrate the 

approach, the authors selected for growth in 

media containing salicin, d-fucose, valine or 

methylglyoxyl. These compounds inhibit cell 

growth by different mechanisms. Salicin is a 

carbon source that normally cannot be metabolized. 

d-fucose is an analogue of arabinose 

that inhibits the ability of E. coli to metabolize 

this sugar. Valine acts as a feedback inhibitor 

of growth-limiting leucine and isoleucine biosynthesis. 

Methylglyoxal presents an oxidative 

stress if present in elevated concentrations. 

These conditions demonstrated the effectiveness 

of TRMR in identifying gene-trait relationships 

and in identifying genes that were 

not expected to be involved in resistance to the 

given cellular stress, thus supporting the power 

of a genome-scale, unbiased approach. 

In a particularly challenging and exciting 

application of TRMR, Warner et al. 1 grew their 

libraries of strains in lignocellulosic hydrolysate 

derived from corn stover. Hydrolysates 

represent a complex potpourri of molecules 

toxic to E. coli. It has been difficult to predict 

a priori which genes would best confer resistance 

to growth inhibitors in the hydrolysates 6 . 

This problem is thus well suited to test the 

authors’ methods. Among the modified genes 

that conferred improved growth were genes 

with expected functions as well as several 

with seemingly disparate cellular functions, 

including primary metabolism, RNA metabolism, 

sugar transporters, secondary metabolism, 

vitamin processes and antioxidant activities. In 

one notable result, the authors identified the 

antioxidant ahpC, a gene not previously linked 

to growth on hydrolysates, which, when upregulated, 

considerably improved both growth rate 

and final biomass levels. 

TRMR has many potential uses. Warner 

et al. 1 note that it could easily be applied iteratively, 

with strains selected after one round of 

TRMR used as the starting strains for a second 

round, thereby accumulating beneficial 

genome alterations (Fig. 1, dotted arrow). 

Such iterative processing can take advantage 

of the same pool of oligos already synthesized. 

Parallel microarray analysis of the barcode 

tags present in the selected survivors should 

produce additional layers of information about 

genetic contributors to fitness. For instance, 

the ability to track combinations of alterations 

in a stepwise fashion as they accumulate has 

the potential to provide snapshots of genetic 

interaction data that, if taken at a high enough 

frequency, may uncover network connections 

in conditions particularly relevant to industrial 

and biotechnological settings. 

TRMR is also valuable because it identifies 

genes and network connections that 

could form the basis for further strain optimization. 

For instance, a particularly powerful 

combination of technologies would 

be to first use TRMR to identify relevant 

genes and then apply the recently developed 

multiplex automated genome engineering 

(MAGE) method 7 , which finely tunes the 

expression levels of a limited number of 

genes. In microbial engineering applications, 

such as the creation of a strain of E. coli that 

can metabolize lignocellulose sugars, TRMR 

should complement existing technologies, 

including directed evolution, genome-scale 

metabolic modeling and synthetic biology 

approaches for redox balancing, flux improvement 

and limiting the production of undesirable 

and toxic metabolic products. 

In addition to TRMR, other approaches 

based on genome-wide modifications are 

Dendritic cells (DCs) are central players in the 

control of immunity and tolerance, and investigation 

of their properties is expected to illuminate 

many diseases of the immune system 

and lead to innovative therapies. Four recent 

reports 1–4 in The Journal of Experimental 

Medicine mark new progress in our understanding 

of the biology of a particular human 

DC subset identified by co-expression of 

CD141 (thrombomodulin, BDCA-3) and the 

increasingly providing scientists with the ability 

to generate large, information-rich data sets 

from which new genetic information may be 

extracted 2–4,8,9 . TRMR heralds an approach to 

genetic analyses in which phenotypes are rapidly 

mapped to genetic modifications across the 

genome, simultaneously producing improved 

strains for immediate practical use as well as 

data sets enabling future rational creation of 

sophisticated strains. 



1. Warner, J. et al. Nat. Biotechnol. 28, 856–862 

(2010). 

2. Giaever, G. et al. Nature 418, 387–391 (2002). 

3. Kim, D.U. et al. Nat. Biotechnol. 28, 617–623 

(2010). 

4. Baba, T. et al. Mol. Syst. Biol. 2, 2006.0008 (2006). 

5. Datta, S., Costantino, N. & Court, D.L. Gene 379, 

109–115 (2006). 

6. Mohagheghi, A. & Schell, D.J. Biotechnol. Bioeng. 105, 

992–996 (2010). 

7. Wang, H.H. et al. Nature 460, 894–898 (2009). 

8. Tong, A.H. et al. Science 294, 2364–2368 (2001). 

9. Mnaimneh, S. et al. Cell 118, 31–44 (2004). 

The expanding family of dendritic 

cell subsets 


The recent identification of human CD141 + dendritic cells as a counterpart 

of mouse CD8 + dendritic cells may be useful in developing vaccines and 

immunotherapies. 

Hideki Ueno, A. Karolina Palucka and Jacques 

Banchereau are at the Baylor Institute for 

Immunology Research and INSERM U899, 

Dallas, Texas, USA; A. Karolina Palucka is at 

the Sammons Cancer Center, Baylor University 

Medical Center, Dallas, Texas, USA; and 

A. Karolina Palucka and Jacques Banchereau 

are in the Department of Gene and Cell 

Medicine and Department of Medicine, 

Immunology Institute, Mount Sinai School of 

Medicine, New York, New York, USA. 

e-mail: jacquesb@baylorhealth.edu 

C-type lectin CLEC9A (DNGR-1). Collectively, 

the papers show that CD141 + DCs are the 

human counterpart of mouse CD8 + DCs. As 

mouse CD8 + DCs are important for the induction 

of cytotoxic T-lymphocyte responses 

through their exceptional capacity to present 

exogenous antigens in an HLA class I pathway 

(so-called cross-presentation) 5 , this discovery 

could have significant clinical impact if human 

CD141 + DCs have a similar role. 

DCs were discovered in 1973 by Ralph 

Steinman as a novel cell type in the mouse 

spleen and are now recognized as a group of 

related cell populations that efficiently present 

antigens. Both mice and humans have two 

major types of DC: myeloid DCs (mDCs, also 

called conventional or classical DCs), and 

plasmacytoid DCs (pDCs). pDCs are considered 

the front line in anti-viral immunity as 

they rapidly produce abundant type I interferon 

in response to viral infection. In their 

resting state, pDCs may be important in tolerance, 

including oral tolerance 6,7 . pDCs are 




CTL Th cells 

Long-lived memory 

CD8 + T cells 

Langerhans 

cells 

IL-15 

CTLs 

Antigen crosspresentation 

themselves composed of at least two subsets 

with different functional properties 8 . 

Similarly, mDCs comprise different subsets 

with unique localization, phenotype and functions 

(Fig. 1). In human skin, the epidermis 

hosts Langerhans cells, whereas the dermis 

contains CD1a + DCs and CD14 + DCs. The 

latter DC subset is involved in the generation of 

humoral immunity, partly through secretion of 

interleukin (IL)-12, which stimulates the differentiation 

of activated B cells into plasma cells 

and also promotes the differentiation of naive 

CD4 + T cells into T follicular helper cells 9,10 , 

a CD4 + T-cell subset that promotes antibody 

responses. In contrast, Langerhans cells efficiently 

prime antigen-specific CD8 + T cells, 

possibly by means of IL-15 (ref. 9). The functions 

of the predominant CD1a + dermal DCs 

are as yet unknown. 

Human DCs expressing CD141 were originally 

found in blood as a subset of mDCs distinct 

from CD1c + mDCs 11 . The new reports 1–4 

argue that CD141 + DCs are the human counterpart 

of mouse CD8 + DCs on the basis of 

results from several different experimental 

CD141 + DCs 

IL-12? 

Protection in vivo 

Plasma cells 

Dermal 

CD14 + DCs 

IL-12 

Tfh cells 

Long-lived 

memory B cells 

Figure 1 Contribution of human myeloid DC subsets to the regulation of adaptive immunity. The 

humoral and cellular arms of adaptive immunity are regulated by different human mDC subsets. 

Humoral immunity is preferentially regulated by CD14 + dermal DCs by means of IL-12, which acts 

directly on B cells and promotes the development of T follicular helper cells (Tfh). Cellular immunity 

is preferentially regulated by Langerhans cells, possibly through IL-15 and a dedicated subset of CD4 + 

T cells specialized to help CD8 + T cells (CTL Th cells). Given their capacity to cross-present antigens 

to CD8 + T cells, CD141 + DCs are likely to be involved in the development of cytotoxic T-lymphocyte 

responses. CD141 + DCs might also be involved in the development of humoral responses through 

IL-12 secretion. This hypothesis is supported by mouse in vivo antigen-targeting studies showing that 

CD8 + DCs, the mouse counterpart of human CD141 + DCs, can induce both cytotoxic T-lymphocyte and 

humoral responses 12,13 , although the mechanisms may be different. It will be important to determine 

whether and how CD141 + DCs are related to Langerhans cells and to dermal DCs, and how these DC 

subsets shape adaptive immunity. 

approaches, including detailed functional and 

phenotypic analysis 1,3 , as well as the discovery 

of a chemokine receptor expressed on both 

cell types 2,4 . 

First, like mouse CD8 + DCs, human CD141 + 

DCs are present in secondary lymphoid organs 

such as tonsils and spleen 1,3 . Further studies 

are needed to determine whether they are also 

present in tissues. 

Second, although human CD141 + DCs do 

not express CD8, they share with mouse CD8 + 

DCs expression of other surface molecules, 

including CLEC9A 1,3,12,13 and the adhesion 

molecule, NECL2 (refs. 3,14). NECL2 binds to 

class I–restricted T cell–associated molecule, 

a cell-surface protein primarily expressed by 

natural killer cells, natural killer T cells and 

activated CD8 + T cells 14 . 

Third, human CD141 + DCs uniquely express 

the chemokine receptor XCR1 (refs. 2,4), in 

line with the unique expression of XCR1 by 

mouse CD8 + DCs shown previously. XCR1 

expressed in both human and mouse DCs is 

functional, as the cells migrate in response to 

the ligand XCL1 (refs. 2,4), a secreted protein 

known to be produced by natural killer cells 

and activated CD8 + T cells. These observations 

suggest a potential for interactions 

between human CD141 + DCs/mouse CD8 + 

DCs and natural killer cells or CD8 + T cells, 

which might be a mechanism involved in the 

efficient induction of cytotoxic T lymphocyte 

responses. For example, interferon (IFN)-γ 

released by natural killer cells and/or CD8 + 

T cells might stimulate CD141 + DCs/CD8 + 

DCs to secrete more IL-12 (refs. 2,4). 

Fourth, all of the new studies 1–4 demonstrate 

that human CD141 + DCs are highly efficient in 

inducing CD8 + T-cell responses through their 

capacity to cross-present exogenous antigens. 

This evidence suggests that human CD141 + 

DCs participate in the development of cytotoxic 

lymphocyte responses in vivo. 

Fifth, human CD141 + DCs and mouse CD8 + 

DCs express the transcription factors Batf3 

and IRF-8 (refs. 1,3), both of which are strictly 

required for the development of mouse CD8 + 

DCs 5 . In contrast, CD141 + DCs do not express 

IRF4 (refs. 1,3), a transcription factor required 

for the development of other mouse spleen 

CD4 + DCs 5 . Thus, CD141 + DCs and mouse 

CD8 + DCs might share a common developmental 

pathway. 

Finally, two of the studies 1,3 show similarities 

between human CD141 + DCs and mouse 

CD8 + DCs in the expression of Toll-like 

receptors (TLRs). TLRs belong to the family 

of pattern recognition receptors through 

which DCs sense microbes and dying cells. 

Engagement of these receptors by pathogen- 

and danger-associated molecular patterns 

expressed by microbes and dying cells 

triggers DC maturation, a complex series of 

events that includes expression of new surface 

molecules, secretion of cytokines and a 

reduction in antigen capture. Different DC 

subsets express different sets of pattern recognition 

receptors, particularly in humans, 

which provides flexibility in responding to 

different microbes. 

Similar to mouse CD8 + DCs, human CD141 + 

DCs are found to express TLR3 and TLR8, 

and stimulation with their respective ligands 

(poly I:C and poly U) induces their maturation 

and cytokine secretion. In contrast 

to the relatively limited TLR expression by 

CD141 + DCs, it is known that CD1c + DCs, 

another blood mDC subset, express a wide 

array, including TLR4, 5 and 7. Whether 

human CD141 + DCs express other pattern 

recognition receptors, such as NOD-like 

receptors and RIG-I-like receptors, has yet 

to be determined. 

The identification of the human counterpart 

of mouse CD8 + DCs opens the possibility 

of translating to humans knowledge 




generated in the mouse. There are still many 

infectious diseases for which no efficient vaccines 

are available, including AIDS, malaria, 

hepatitis C infection and tuberculosis. Most 

of these would benefit from the induction of 

potent cytotoxic T lymphocytes to eliminate 

the infected cells. Similarly, strong cytotoxic 

T-lymphocyte responses would be beneficial 

in the context of cancer immunotherapy. 

Thus, it may be possible to exploit CD141 + 

DCs in the ‘DC-targeting’ vaccination strategy, 

in which vaccines are generated from 

recombinant anti-DC antibodies fused to 

selected antigens 15 . Studies in mice have 

shown that targeting antigen to DCs in this 

manner in vivo results in potent antigenspecific 

CD4 + and CD8 + T-cell immunity 15 , 

provided adjuvants are co-administered to 

activate the targeted DCs. Indeed, antibodies 

to CLEC9 allowed targeting of antigen to 

mouse CD8 + DCs in vivo, inducing potent 

cytotoxic T-lymphocyte responses when 

combined with anti-CD40 administration 12 

and potent antibody responses even without 

co-administration of adjuvants 13 . 

It should be emphasized, however, that 

translating mouse immunological data to 

the clinic is fraught with uncertainty, as 65 

million years of independent evolution have 

produced many nuances that distinguish the 

human and mouse immune systems 16 . As one 

example, other human DCs, such as CD1c + 

DCs 1,3 and epidermal Langerhans cells 9 , can 

also cross-present antigens. Thus, it remains 

to be determined whether and how human 

CD141 + mDCs are related to other mDCs 

subsets and how all the mDC subsets cooperate 

in shaping adaptive immunity. 



1. Jongbloed, S.L. et al. J. Exp. Med. 207, 1247–1260 

(2010). 

2. Bachem, A. et al. J. Exp. Med. 207, 1273–1281 

(2010). 

3. Poulin, L.F. et al. J. Exp. Med. 207, 1261–1271 

(2010). 

4. Crozat, K. et al. J. Exp. Med. 207, 1283–1292 

(2010). 

5. Shortman, K. & Heath, W.R. Immunol. Rev. 234, 

18–31 (2010). 

6. Goubier, A. et al. Immunity 29, 464–475 (2008). 

7. Liu, Y.J. Annu. Rev. Immunol. 23, 275–306 (2005). 

8. Matsui, T. et al. J. Immunol. 182, 6815–6823 

(2009). 

9. Klechevsky, E. et al. Immunity 29, 497–510 (2008). 

10. Schmitt, N. et al. Immunity 31, 158–169 (2009). 

11. Dzionek, A. et al. J. Immunol. 165, 6037–6046 

(2000). 

12. Sancho, D. et al. J. Clin. Invest. 118, 2098–2110 

(2008). 

13. Caminschi, I. et al. Blood 112, 3264–3273 (2008). 

14. Galibert, L. et al. J. Biol. Chem. 280, 21955–21964 

(2005). 

15. Bonifaz, L.C. et al. J. Exp. Med. 199, 815–824 

(2004). 

16. Mestas, J. & Hughes, C.C. J. Immunol. 172, 2731– 

2738 (2004). 


esearch highlights 


Lung on a chip 

Efforts to mimic the 

alveolar-capillary 

interface—the 

fundamental functional 

unit of the lung—in 

cell culture have been 

frustrated primarily 

by the challenge 

of replicating the 

structural and functional 

properties of the system 

while simulating the 

mechanical changes 

associated with normal 

breathing. Huh et al. 

recreate the behavior 

of lung tissue in a 

microfluidic device 

by lining a thin (10 µm), porous and flexible membrane 

with human alveolar epithelial cells on one side and human 

pulmonary microvascular endothelial cells on the other. 

Application and release of a vacuum to two flanking chambers 

causes the membrane with its adherent tissue layers to stretch 

and then relax to its original size, thus recreating the dynamic 

mechanical distortion of the alveolar-capillary interface caused 

by breathing. The device reproduces organ-level responses to 

bacterial infection and inflammatory cytokines, and its use 

suggests that mechanical strain can promote nanoparticleinduced 

toxicity. These findings underscore the potential of 

the chip for evaluating the safety and efficacy of new drugs for 

lung disorders, or the effects of environmental toxins. 

(Science 328, 1662–1668, 2010) 

PH 

miRNAs, Dicer and metastasis 

MicroRNAs (miRNAs) play a key role in the pathogenesis of cancer. 

Although the overexpression of individual miRNAs is important 

in numerous tumors, a global downregulation of miRNA levels is a 

hallmark of cancer. Martello et al. now show that members of the 

miR-103/107 family suppress the expression of Dicer, the enzyme 

responsible for the maturation of pre-miRNAs into miRNAs. Levels of 

miR-103/107 are inversely proportional to Dicer abundance in cancer 

cell lines and high miR-103/107 expression correlates with metastasis 

and poor prognosis in breast cancer. In mouse models of breast cancer, 

nonmetastatic cell lines can be converted to an invasive phenotype by 

miR-103/107 expression. Therapeutic targeting of the miRNAs with 

a specific antisense molecule reduces the number of lung metastases, 

making these miRNAs promising targets for antimetastatic drugs, 

although no effect on the growth of the primary tumor was observed. 

The miR-103/107 molecules promote an epithelial-to-mesenchymal 

transition, a developmental program associated with increased mobility 

and loss of cell adhesion that is frequently observed in metastatic 

cancer. (Cell 141, 1195–1207, 2010) 

ME 

Written by Kathy Aschheim, Laura DeFrancesco, Markus Elsner, 

Peter Hare & Craig Mak 

Fungal histone acetylation inhibitors 

Targeting fungal histone acetylation may provide a new source of drugs 

against Candida albicans infections, a particular problem for immunocompromised 

individuals, research by Wurtele et al. suggests. The authors 

set out to determine whether a fungal histone acetyltransferase enzyme 

(RTT109) not found in humans would make a good drug target. The 

particular modification that the enzyme makes—acetylation of lysine 56 

on histone 3 (H3 Lys56)—is found on close to 30% of C. albicans histones, 

whereas only 1% of human histones bear the mark. Knocking out both 

copies of RTT109 creates strains with greater sensitivity to certain antifungal 

agents; repressing the activity of the HST3 deacetylase enzyme led 

to fungal cell death. The effects were also mirrored by nicotinamide, an 

inhibitor of NAD-dependent deacetylases. A/J mice, a model particularly 

sensitive to C. albicans infection, which were injected with an HST3- 

repressed strain of the fungus or an RTT109-deleted strain failed to show 

signs of infection. Once again, nicotinamide treatment mirrored the effects 

of HST3 repression, but only in strains with wild-type RTT109, suggesting 

that nicotinamide, which acts as an anti-inflammatory, exerts its effects 

on infection through its interaction with the histone deacetylase pathway. 

Finally, the researchers showed that whereas some fungal pathogens are 

sensitive in various degrees to nicotinamide, all tested clinical isolates of 

C. albicans, the fungus with the greatest impact on human health, were 

sensitive. (Nat. Med. 16, 774–780, 2010) 

LD 

iPS cells from blood 

As researchers contemplate clinical applications of induced pluripotent stem 

(iPS) cells, one practical consideration is the accessibility of the donor cells 

used for reprogramming. So far, most human iPS cells have been derived 

from fibroblasts collected through skin biopsies, a procedure that requires an 

incision and stitches. Following three 2009 papers on the reprogramming of 

human hematopoietic stem/progenitor cells from cord blood or from adults 

after mobilization by granulocyte colony stimulating factor, three new studies 

describe iPS cells from unmobilized adult blood cells. All three groups rely 

on the standard ‘Yamanaka’ reprogramming factors (OCT4, SOX2, KLF4, 

C-MYC), but Loh et al. and Staerk et al. deliver these with retroviruses, 

whereas Seki et al. use the nonintegrating Sendai virus. The latter method 

appears more efficient, allowing iPSCs to be generated from samples as small 

as 1 ml. Like keratinocytes from plucked hair (Nat. Biotechnol. 26, 1276–1284, 

2008), peripheral blood cells may provide a convenient source of iPS cells in 

a clinical context. (Cell Stem Cell 7, 15–19; 20–24; 11–14, 2010) KA 

Antibody therapy for thrombosis 

Small-molecule therapeutics, such as aspirin and clopidogrel (Plavix), 

reduce the risk for heart attack and stroke by inhibiting platelets but at 

the cost of increased risk for excessive bleeding. Tucker et al. demonstrate 

an alternative strategy in baboons based on reducing platelet counts using 

neutralizing antibodies. This strategy was tested using a vascular graft 

model that mimics a damaged blood vessel at risk for thrombosis. Animals 

with fewer circulating platelets showed less potential for thrombosis in 

the graft model. Notably, the blood of these animals did not take longer 

to clot after cutting the animals’ forearm, whereas aspirin treatment led to 

a statistically significant increase in bleeding time. Tucker et al. reduced 

platelet counts by treating animals with serum containing polyclonal neutralizing 

antibodies raised in baboons against thrombopoietin, a hormone 

essential for platelet production. Drugs that can be safely used to inhibit 

platelet production will be required before this strategy can be tested in 

humans. (Sci. Transl. Med. 2, 37ra45, 2010) 

CM 


A n a ly s i s 

Discovery and characterization of chromatin states for 

systematic annotation of the human genome 

Jason Ernst 1,2 & Manolis Kellis 1,2 


A plethora of epigenetic modifications have been described 

in the human genome and shown to play diverse roles in gene 

regulation, cellular differentiation and the onset of disease. 

Although individual modifications have been linked to the 

activity levels of various genetic functional elements, their 

combinatorial patterns are still unresolved and their potential 

for systematic de novo genome annotation remains untapped. 

Here, we use a multivariate Hidden Markov Model to reveal 

‘chromatin states’ in human T cells, based on recurrent and 

spatially coherent combinations of chromatin marks. We define 

51 distinct chromatin states, including promoter-associated, 

transcription-associated, active intergenic, large-scale repressed 

and repeat-associated states. Each chromatin state shows 

specific enrichments in functional annotations, sequence 

motifs and specific experimentally observed characteristics, 

suggesting distinct biological roles. This approach provides a 

complementary functional annotation of the human genome 

that reveals the genome-wide locations of diverse classes of 

epigenetic function. 

The primary DNA sequence of the human genome encodes the 

genetic information of each cell, but numerous epigenetic modifications 

can modulate the interpretation of the primary sequence. 

These modifications contribute to the diversity of phenotypes found 

across different human cell types, play key roles in the establishment 

and maintenance of cellular identity during development and have 

been associated with DNA repair, replication and human disease. 

Post-translational modifications in the tails of histone proteins that 

package DNA into chromatin constitute perhaps the most versatile 

type of such epigenetic information. More than a dozen positions of 

multiple histone proteins can undergo a number of modifications, 

such as acetylation and mono-, di- or tri-methylation 1,2 . 

More than 100 distinct histone modifications have been described, 

leading to the ‘histone code hypothesis’ that specific combinations of 

chromatin modifications would encode distinct biological functions 3 . 

Others, however, have instead proposed that individual epigenetic 

marks act in additive ways and the multitude of modifications simply 

contributes to stability and robustness 4 . The specific combinations of 

1 MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, 

Massachusetts, USA. 2 Broad Institute of MIT and Harvard, Cambridge, 

Massachusetts, USA. Correspondence should be addressed to M.K. 

(manoli@mit.edu). 

Published online 25 July 2010; doi:10.1038/nbt.1662 

epigenetic modifications that are biologically meaningful, and their 

corresponding functional roles, are still largely unknown. 

To directly address these questions, we introduce an approach for 

the de novo discovery of ‘chromatin states’ (Fig. 1, Supplementary 

Table 1 and Supplementary Fig. 1), or biologically meaningful and 

spatially coherent combinations of chromatin marks, by performing 

a systematic genome-wide analysis based on a multivariate Hidden 

Markov Model (HMM). Multivariate HMMs are graphical probabilistic 

models that model multiple ‘observed’ inputs as generated by 

unobserved ‘hidden’ states, using transitions between hidden states 

to model spatial relationships (Online Methods). 

Our model captures two types of chromatin information. The frequency 

with which different chromatin mark combinations are found 

with each other are captured by a vector of ‘emission’ probabilities 

associated with each chromatin state (Fig. 2 and Supplementary 

Figs. 2 and 3) and the frequency with which different chromatin 

states occur in spatial relationships of each other along the genome 

are encoded in a ‘transition’ probability vector associated with each 

state. These spatial relationships capture both the spreading of certain 

chromatin domains across the genome, as well as the functional ordering 

of different states such as from intergenic regions to promoter regions 

and transcribed regions (Supplementary Notes and Supplementary 

Figs. 4–6). Biologically the genomic locations associated with a 

given chromatin state may correspond to specific types of functional 

elements, such as transcription start sites (TSS), enhancers, active genes, 

repressed genes, exons or heterochromatin, which can be inferred 

solely from the corresponding combinations of chromatin marks in 

their spatial context, even though no information about these annotations 

is given to the model as input. 

We applied our model to the largest data set of chromatin mark 

information available, consisting of the genome-wide occupancy data 

for a set of 38 different histone methylation and acetylation marks and 

for the histone variant H2AZ, RNA polymerase II (PolII) and CTCF in 

human CD4 T-cells. The maps were previously obtained using chromatin 

immunoprecipitation followed by next generation sequencing 

(ChIP-seq) (Online Methods) 5,6 . To understand the biological importance 

of the resulting chromatin states, we undertook a large-scale, 

systematic data-mining effort, bringing to bear dozens of genomewide 

data sets including gene annotations, expression information, 

evolutionary conservation, regulatory motif instances, compositional 

biases, genome-wide association data, transcription-factor binding, 

DNaseI hypersensitivity and nuclear lamina maps. 

This work provides an unbiased and systematic chromatin-driven 

annotation for every region of the genome at a 200 base pair resolution, 

refining previously described epigenetic states and introducing 

nature biotechnology VOLUME 28 NUMBER 8 AUGUST 2010 817


additional ones. Regardless of whether these chromatin states are 

causal in directing regulatory processes, or simply reinforcing independent 

regulatory decisions, these annotations should provide a 

resource for interpreting biological and medical data sets, such as 

genome-wide association studies for diverse phenotypes and could 

potentially help to identify new classes of functional elements. 

RESULTS 

Chromatin states model and comparison to previous work 

Previous analyses have largely focused on characterizing the marks 

predictive of specific classes of genomic elements defined a priori such 

as transcribed regions, promoters or putative enhancers, and using 

the characterization to identify new instances of these classes 5–12 . 

Chr 7: 

116,260 kb 

116,270 kb 

116,280 kb 

116,290 kb 

116,300 kb 

116,310 kb 

116,320 kb 116,330 kb 116,340 kb 116,350 kb 116,360 kb 


Chromatin states 

Chromatin marks 

State 3 

State 5 

State 7 

State 8 

State 10 

State 11 

State 13 

State 15 

State 16 

State 17 

State 18 

State 19 

State 24 

State 25 

State 26 

State 36 

State 37 

State 38 

State 39 

State 43 

State 44 

State 51 

H3K14ac 

H3K23ac 

H4K12ac 

H2AK9ac 

H4K16ac 

H2AK5ac 

H4K91ac 

H3K4ac 

H2BK20ac 

H3K18ac 

H2BK120ac 

H3K27ac 

H2BK5ac 

H2BK12ac 

H3K36ac 

H4K5ac 

H4K8ac 

H3K9ac 

PolII 

CTCF 

H2AZ 

H3K4me3 

H3K4me2 

H3K4me1 

H3K9me1 

H3K79me3 

H3K79me2 

H3K79me1 

H3K27me1 

H2BK5me1 

H4K20me1 

H3K36me3 

H3K36me1 

H3R2me1 

H3R2me2 

H3K27me2 

H3K27me3 

H4R3me2 

H3K9me2 

H3K9me3 

H4K20me3 

Promoter states 

Transcribed states 

Active intergenic 

Repressed 

Repetitive 

CAPZA2 

50 kb 

Figure 1 Example of chromatin state annotation. Input chromatin mark information and resulting chromatin state annotation for a 120-kb region of 

human chromosome 7 surrounding the CAPZA2 gene. For each 200-bp interval, the input ChIP-Seq sequence tag count (black bars) is processed into a 

binary presence and/or absence call for each of 18 acetylation marks (light blue), 20 methylation marks (pink) and CTCF/Pol2/H2AZ (brown). The precise 

combination of these marks in each interval in their spatial context is used to infer the most probable chromatin state assignment (colored boxes). Although 

chromatin states were learned independently of any prior genome annotation, they correlate strongly with upstream and downstream promoters (red), 

5′-proximal and distal transcribed regions (purple), active intergenic regions (yellow), repressed (gray) and repetitive (blue) regions (state descriptions 

shown in Supplementary Table 1). This example illustrates that even when the signal coming from chromatin marks is noisy, the resulting chromatin state 

annotation is very robust, directly interpretable and shows a strong correspondence with the gene annotation. Several spatially coherent transitions are seen 

from large-scale repressed to active intergenic regions near active genes, from upstream to downstream promoter states surrounding the TSS and from 

5′-proximal to distal transcribed regions along the body of the gene. The frequent transitions to state 16 correlate with annotated Alu elements (57% 

overlap versus 4% and 25% for states 13 and 15, respectively). Transitions to state 13 are likely due to enhancer elements in the first intron of CAPZA2, 

a region where regulatory elements are commonly found and correlate with several enhancer marks. The maximum-probability state assignments are shown 

here, and the full posterior probability for each state in this region is shown in Supplementary Figure 1. 

818 VOLUME 28 NUMBER 8 AUGUST 2010 nature biotechnology

a n a ly s i s 

An unsupervised (without using prior knowledge) local chromatin 

pattern discovery method 13 first demonstrated that many of the 

patterns previously associated with promoters and enhancers could 

be discovered de novo, but did not discover patterns associated with 

broader domains and left the vast majority of the genome unannotated 

(Supplementary Fig. 7). 

Unsupervised HMM approaches that modeled chromatin mark 

signal intensity levels using multivariate normals or nonparametric 

histograms 14–18 have been previously used, but in contrast we use 

a binarization approach that explicitly models the presence/absence 

frequency of each mark. Specifically, we make a local call of whether a 

mark was present in each 200-bp interval, and use a Bernoulli random 

variable to model the probability of detection of each mark in isolation, 

and a product of independent probabilities to model the probability 

of each combination of marks (Online Methods). Our approach 

has the advantage that the model parameters are directly interpretable 

as the frequencies of each mark and each mark combination, in 

contrast to previous approaches for which the biological significance 

of the parameters corresponding to varying signal intensity levels for 

each mark is often unclear. Moreover, the binarization also makes our 


a 

b 

Repetitive Repressive Active intergenic 

Transcribed states 

Promoter states 

State 

H3K14ac 

H3K23ac 

H4K12ac 

H2AK9ac 

H4K16ac 

H2AK5ac 

H4K91ac 

H3K4ac 

H2BK20ac 

H3K18ac 

H2BK120ac 

H3K27ac 

H2BK5ac 

H2BK12ac 

H3K36ac 

H4K5ac 

H4K8ac 

H3K9ac 

PolII 

CTCF 

H2AZ 

H3K4me3 

H3K4me2 

H3K4me1 

H3K9me1 

H3K79me3 

H3K79me2 

H3K79me1 

H3K27me1 

H2BK5me1 

H4K20me1 

H3K36me3 

H3K36me1 

H3R2me1 

H3R2me2 

H3K27me2 

H3K27me3 

H4R3me2 

H3K9me2 

H3K9me3 

H4K20me3 

State 

Percent of genome 

% +-2kb TSS 

Percent of TSS 

Chromatin mark frequency 

0.01 0.08 1 

xF TSS exact 

% RefSeq gene 

Expression level 

xF ZNF gene 

5′ UTR 

xF 

All exons 

xF 

xF Spliced exons 

xF 3′ UTR 

xF TES 

xF Conserved 

xF DNaseI 

TF binding 

xF 

xF CpG island 

% GC 

% Lamina 

% Repeat 

c 

Promoter upstream high expr; potential enh looping 

Promoter upstream med expr; potential enh looping 

Promoter upstream low expr; potential enh looping 

Repressed promoter 

TSS low-med expr; most GC rich 

TSS med expr 

TSS high expr 

Transcribed promoter; highest expr, TSS for active genes 

Transcribed promoter; highest expr, downstream 

Transcribed promoter; high expr, near TSS 

Transcribed promoter; high expr, downstream 

Transcribed 5′ proximal, higher expr, open chr, TF binding 

Transcribed 5′ proximal, higher expr, open chr 

Transcribed 5′ proximal, high expr, open chr 

Transcribed 5′ proximal, high expr 

Transcribed 5′ proximal, med expr; Alu repeats 

Transcribed less 5′ proximal, med expr, open chr 

Transcribed less 5′ proximal, med expr 

Transcribed less 5′ proximal, lower expr; Alu repeats 

Candidate strong enhancer in transcribed regions 

Spliced exons/GC rich; open chr, TF binding 

Spliced exons/GC rich 

Spliced exons/GC rich; Alu repeats 

Transcribed 5′ distal; exons 

Transcribed further 5′ distal; exons 

Transcribed 5′ distal; Alu repeats 

End of transcription; exons; high expr 

ZNF genes; KAP-1 repressed state 

Cand strong distal enh; higher open chr; higher target expr 

Cand strong distal enh; high open chr; higher target expr 

Intergenic H2AZ with open chr/TF binding. Cand. distal enh 

Candidate weak distal enhancer 

Candidate distal enhancer 

Proximal to active enhancers; Alu repeats 

Active intergenic regions not enhancer specific 

Active intergenic further from enhancers; Alu repeats 

Non-repressive intergenic domains; Alu repeats 

H2AZ specific state 

CTCF island; candidate insulator 

Unmappable 

Heterochr; nuclear lamina; most AT rich 

Heterochr; nuclear lamina; ERVL repeats 

Heterochr; lower gene depletion 

Heterochr; ERVL repeats: lower gene/exon depletion 

Specific repression 

Simple repeats (CA)n, (TG)n 

L1/LTR repeats 

Satellite repeat 

Satellite repeat; moderate mapping bias 

Satellite repeat; high mapping bias 

Satellite repeat/rRNA; extreme mapping bias 

Genome total/average 

Figure 2 Chromatin state definition and functional interpretation. (a) Chromatin mark combinations associated with each state. Each row shows the specific 

combination of marks associated with each chromatin state and the frequencies between 0 and 1 with which they occur (color scale). These correspond to 

the emission probability parameters of the Hidden Markov Model (HMM) learned across the genome during model training (values shown in Supplementary 

Fig. 2). Marks and states colored as in Figure 1. (b) Genomic and functional enrichments of chromatin states. %, percentage; xF, fold enrichment. In order, 

columns are: percentage of the genome assigned to the state; percentage of state that overlaps a 200-bp interval within 2 kb of an annotated RefSeq TSS; 

percentage of RefSeq TSS found in the state; fold enrichment for TSS; percentage of state overlapping a RefSeq transcribed region; average expression level 

of genomic intervals overlapping the state; fold-enrichment for zinc-finger–named gene; fold-enrichment for RefSeq 5′ Untranslated Region (5′-UTR) exon 

and introns; fold enrichment for RefSeq exons; fold enrichment for spliced exons (2 nd exon or later); fold enrichment for RefSeq 3′ Untranslated Region 

(3′-UTR) exons and introns; fold enrichment for RefSeq transcription end sites (TES); fold enrichment for PhastCons conserved elements; fold enrichment 

for DNaseI hypersensitive sites; median fold enrichment for transcription factor binding sites over a set of experiments (expanded in Supplementary 

Fig. 23); fold-enrichment for CpG islands; percentage of GC nucleotides; percent overlapping experimental nuclear lamina data; percent overlapping a 

RepeatMasker element (expanded in Supplementary Fig. 31). All enrichments are based on the posterior probability assignments. Genome total indicates 

the total percentage of 200 bp interval intersecting the feature or the genome average for expression and percent GC. (c) Brief description of biological state 

function and interpretation (chr, chromatin; enh, enhancer, full descriptions in Supplementary Table 1). 




a 

c 

Number of genes 

States 24–28 shown 

2,000 

State 26 

1,500 

1,000 

500 

State 25 

State 24 

State 27 

0 

State 28 

2,000 

State 19 


1,500 

21 16 

15 

1,000 

23 

20 18 

500 

22 

13 

0 

0 

Gene GO 

category 

Cell cycle 

phase 

Embryonic 

development 

Chromatin 

Response to 

DNA damage 

RNA 

processing 

T-cell 

activation 

1,600 

3,200 

4,800 

3 4 5 6 7 8 

2.70 

(10 –7 ) 

1.24 

(1.0) 

1.20 

(1.0) 

1.20 

(1.0) 

0.49 

(1.0) 

0.77 

(1.0) 

6,400 

8,000 

9,600 

11,200 

12,800 

14,400 

16,000 

17,600 

19,200 

Distance from transcription start site 

Chromatin state at TSS of corresponding gene 

0.57 

(1.0) 

2.82 

(10 –22 ) 

0.48 

(1.0) 

0.35 

(1.0) 

0.26 

(1.0) 

0.88 

(1.0) 

1.61 

(10 –3 ) 

1.07 

(1.0) 

2.17 

(10 –7 ) 

1.55 

(0.07) 

1.31 

(1.0) 

1.27 

(1.0) 

Fold enrichment 

1.45 

(1.0) 

0.85 

(1.0) 

1.64 

(1.0) 

2.13 

(10 –11 ) 

1.91 

(10 –11 ) 

0.70 

(1.0) 

14 

12 

10 

8 

6 

4 

2 

0 

1.15 

(1.0) 

0.54 

(1.0) 

0.85 

(1.0) 

1.97 

(10 –4 ) 

2.64 

(10 –24 ) 

0.79 

(1.0) 


–2,000 

–1,600 

–1,200 

–800 

–400 

0 

400 

1.51 

(1.0) 

1.00 

(1.0) 

0.85 

(1.0) 

0.84 

(1.0) 

2.46 

(10 –4 ) 

4.72 

(10 –7 ) 

State 22 

State 21 

State 23 

State 20 

Distance from spliced exon start 

b 


800 

1,200 

1,600 

2,000 

80 

60 

40 

20 

0 

160 


120 

80 

40 

0 

80 

60 

40 

20 

0 

14 

12 

10 

8 

6 

4 

2 

0 

Dual peaking 

State 1 

State 2 

State 3 

TSS centered 

State 4 

State 5 

State 6 

State 7 

Downstream 

State 8 

State 9 

State 10 

State 11 


–4,000 

–3,200 

–2,400 

Distance from transcription start site 

State 21 

State 23 

–1,600 

–800 

State 27 

0 

800 

1,600 

2,400 

3,200 

4,000 

Distance from transcription end site 

–2,000 

–1,600 

–1,200 

–800 

–400 

0 

400 

800 

1,200 

1,600 

2,000 

State 12 

State 13 

State 14 

State 15 

State 16 

State 17 

State 18 

State 19 

State 20 

State 21 

State 22 

State 23 

State 24 

State 25 

State 26 

State 27 

State 28 

Figure 3 Promoter and transcribed chromatin states show distinct functional and positional enrichments. (a) Distinct Gene Ontology (GO) functional 

enrichments (fold and corrected P-values) found for genes associated with different promoter states at their TSS. For additional states and GO terms, see 

Supplementary Figure 29. (b) Distinct positional biases of promoter states with respect to nearest RefSeq TSS distinguish states peaking upstream, only 

downstream and centered at the TSS. (c) Positional biases of transcribed states with respect to TSS, nearest spliced exon start and transcription end 

sites (TES). These distinguish 5′-proximal states (12–23, left panel), 5′-distal states (24–28), states strongly enriched for spliced exons (middle panel, 

see also Supplementary Fig. 24 for plot for states 24–28) and TES-associated states (with state 27 being particularly precisely positioned, right panel). 

model less prone to forming states overfitting potentially insignificant 

variations in signal intensity levels. In contrast to models that use a 

multivariate normal distribution, our method avoids this strong parametric 

assumption, which is generally violated by the often relatively 

small discrete counts found in ChIP-seq experiments, enabling more 

robust models to be inferred. In comparison to the models previously 

inferred based on a nonparametric histogram strategy 18 , our binarization 

approach uses an order of magnitude fewer parameters per state, 

further increasing model robustness and interpretability. 

We developed a procedure for learning sets of chromatin states 

across a range of model complexities. For a given number of states and 

from a set of initial parameters, standard expectation maximization 

based procedures enable simultaneous local optimization of the state 

definitions (emission and transition probabilities) and the corresponding 

genome annotation consistent with the observed data. However 

the model inferred and its quality can depend on the initial set of 

parameters, which can confound comparing models with different 

number of states learned from independent initializations. We therefore 

used a two-stage process that first selected a 79-state model which 

had the highest complexity-penalized likelihood score across a large 

compendium of randomly-initialized models of varying complexity. 

We then pruned and optimized this model down to smaller numbers 

of states, leading to a model with 51 states that were relatively 

consistently recovered across the compendium of models, and that 

sufficiently captured all states found in larger models for which we 

could give a distinct biological interpretation (see Online Methods). 

This enabled us to maintain a relatively small number of states while 

capturing most of the unique biology uncovered across our compendium 

of randomly-initialized models. Put in other words, this 

procedure enabled us to maximize biological interpretability, while 

minimizing model complexity. We further ensured that general 

properties of the resulting model validated our approach, including 

robustness to varying thresholds and different background models, 

and independence of marks given a chromatin state (Supplementary 

Notes, Supplementary Figs. 8–21 and Supplementary Table 2). 

We next describe the likely biological functions of the 51 discovered 

chromatin states, divided into five large groups. 

Promoter-associated states 

The first group of states, states 1–11, all had high enrichment for 

promoter regions: 40–89% of each state was within 2 kb of a RefSeq 

TSS, compared with 2.7% genome-wide (P < 10 −200 , for all states). 




Figure 4 SNP and GWAS enrichments for 

chromatin states. (a) Several chromatin states 

show enrichments for disease association 

data sets. For each state is shown: genome 

percentage; fold enrichment for SNPs from the 

HapMap CEU population; fold enrichment from a 

collection of 1,640 GWAS SNPs associated with 

a variety of diseases and traits from numerous 

studies 25 ; fold enrichment of GWAS SNPs 

relative to the HapMap CEU SNP enrichment; 

significance of GWAS SNPs relative to the 

underlying SNP frequency (when the corrected 

P-value < 0.01). (b) Example of intergenic 

SNP in GWAS-enriched state 33, found 40 kb 

downstream of the IKZF2 gene and associated 

with plasma eosinophil count levels 26 . SNP 

significance as reported 26 is shown for each 

SNP in the region (blue circles) and associated 

chromatin state annotation (similar to Fig. 1). 

Red circle denotes top SNP and its overlap with 

state 33. In addition to top SNPs, secondary 

SNPs were also frequently found at or near 

GWAS-enriched states in several cases. 

These states accounted for 59% of all RefSeq TSS although they 

covered only 1.3% of genome. These states all had a high frequency of 

H3K4me3 in common, as well as significant enrichments for DNaseI 

hypersensitive sites, CpG islands, evolutionarily conserved motifs and 

bound transcription factors (Fig. 2). They differed however in the 

presence and levels of other associated marks, primarily H3K79me2/3, 

H4K20me1, H3K4me1/2 and H3K9me1, and of numerous acetylations 

leading to varying strength of the aforementioned functional 

enrichments, and varying expression levels of the downstream genes 

(Supplementary Figs. 22 and 23). 

Promoter states differed in the enrichment of Gene Ontology (GO) 

terms of associated genes including cell cycle, embryonic development, 

RNA processing and T-cell activation (Fig. 3a). For instance, the term 

‘embryonic development’ is specifically enriched in state 4, whereas 

the term ‘T-cell activation’ is specifically enriched in state 8. Promoter 

states also differed in their preferentially enriched positions with respect 

to the TSS of associated genes (Fig. 3b). States 4–7 were most concentrated 

over the TSS (showing upwards of 100-fold enrichment), states 

8–11 peaked between 400 bp and 1,200 bp downstream of the TSS and 

corresponded to transcribed promoter regions of expressed genes and 

states 1–3 peaked both upstream and downstream of the TSS. 

Transcription-associated states 

The second large group of chromatin states consisted of 17 

transcription-associated states. These are 70–95% contained within 

RefSeq-annotated transcribed regions compared to 36% for the rest 

of the genome (Fig. 2b, P < 10 −200 , for all states). This group was not 

predominantly associated with a single mark, but instead defined by 

combinations of seven marks, H3K79me3, H3K79me2, H3K79me1, 

H3K27me1, H2BK5me1, H4K20me1 and H3K36me3 (Fig. 2a). 

Inspection of the transition frequencies between these states revealed 

subgroups of states that are associated with 5′-proximal or 5′-distal 

locations and with different expression levels (Fig. 2c, Supplementary 

Notes, Supplementary Table 1 and Supplementary Fig. 4). 

We observed several states strongly enriched for spliced exons (states 

21–25 and 27–28 with 5.7- to 9.7-fold enrichments) (Figs. 2b and 3c and 

Supplementary Fig. 24). Spliced exons were previously reported to be 

enriched in several individual marks 19–21 . In contrast to these previous 

studies, the combinatorial approach we have taken here shows that 

a 

State 

Percent 

genome 

HapMap 

CEU SNP 

GWAS 

HapMap CEU 

SNP and GWAS 

P value 

4.6E-04 

3.2E-03 

5.2E-05 

5.8E-04 

3.6E-06 

b 

Promoter 

states 

Transcribed 

states 

Active 

State 33 

intergenic 

states 

Repressed 

states 

Repetitive 

states 

Human mRNAs 

Spliced ESTs 

Mammal cons 

–log P 

6 

5 

4 

3 

2 

1 

0 

213.3 

individual marks in spliced exonic states are also frequently detected in 

several other states that show only a modest 1.3- to 1.6-fold enrichment 

for spliced exons (e.g., states 12, 13, 14 and 17). This suggests that the 

chromatin signature of spliced exons is not solely defined by the presence 

of the previously reported H3K36me3, H2BK5me1, H4K20me1 

and H3K79me1 marks, but their specific combinations and the absence 

of H3K4me2, H3K9me1 and H3K79me2/3. 

State 27 showed a 12.5-fold enrichment for transcription end sites 

(TES) with its enrichment peaking directly over these locations (Fig. 3c). 

It was characterized both by the presence of H3K36me3, PolII and 

H4K20me1 and the absence of H3K4me1, H3K4me2 and H3K4me3, 

distinguishing it from other transcribed states with higher PolII or 

H3K36me3 frequencies. This suggests a distinct signature for 3′ ends of 

genes for which, to our knowledge, no specific chromatin signature had 

been described before. This was further validated by a 3.4-fold signal 

enrichment for the elongating form of PolII surveyed in an independent 

study 22 (Supplementary Fig. 25), even though our input data did not 

distinguish between the elongating and non-elongating form. 

State 28 showed a 112-fold enrichment in zinc-finger genes, which 

comprise 58% of the state. This state was characterized by the high frequency 

for H3K9me3, H4K20me3 and H3K36me3 and relatively low 

frequency of other marks. This specific combination has been independently 

reported as marking regions of KAP1 binding, a zinc-finger– 

specific co-repressor, which also shows a specific 44-fold enrichment 

for state 28 (refs. 23,24). Although the association of H3K9me3 and 

H4K20me3 with zinc-finger genes has been previously reported 5 , the 

de novo discovery of this highly specific signature of zinc-finger genes 

illustrates the utility of the methodology and also reveals the additional 

presence of H3K36me3 and lower frequency of other marks as 

complementing the signature of zinc-finger genes. 

Active intergenic states 

The third broad class of chromatin states consisted of 11 active 

intergenic states (states 29–39), including several classes of candidate 

enhancer regions, insulator regions and other regions proximal 

to expressed genes (Supplementary Notes). These states were 

associated with higher frequencies for H3K4me1, H2AZ, numerous 

acetylation marks and/or CTCF and with lower frequencies 

for other methylation marks (Fig. 2a and Supplementary Figs. 2 

rs12619285 

IKZF2 

IKZF2 

213.4 213.5 213.6 213.7 213.8 

Position (Mb) 




a 

True-positive rate 

b 

True-positive rate 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0 

0.5 

0.4 

0.3 

RefSeq gene transcription start sites 

5 

7 

6 

8 

4 

2 9 1011 

H3K4me3 

3 1 

H3K9ac 

Pol2 

RefSeq gene transcripts 

False-positive rate 

45 21 20 31 

Individual marks (CD4T) 

Chromatin states ordered 

(CD4T cells only) 

CAGE tags (all cell types) 

H3K4me3 at varying cutoffs 

0.005 0.01 0.015 0.02 0.025 0.03 

10 8 26 

2114 

20 

2328 

7 6 5 4 

27 

Individual marks (CD4T cells) 

0.2 

9 

19 

Chromatin states ordered 

24 

(CD4T cells only) 

25 

Expressed sequence tags 

11 

16 

H4K20me1 

(all cell types) 

0.1 22 

H3K79me3 

12 

H3K36me3 

18 

17 

H2BK5me1 

13 

15 H3K79me1 

H3K79me2 

0 

0 0.005 0.01 0.015 0.02 0.025 0.03 

False-positive rate 

and 3). They occurred primarily away from promoter regions 

(85–97% outside 2 kb of a TSS) and outside of transcribed genes 

(48–64% outside of RefSeq annotations, Fig. 2b). When they overlapped 

gene annotations, it was mainly in regions that were repressed 

or not highly expressed (see expression column in Fig. 2b). 

States 29–33 were notable as they corresponded to smaller fractions 

of the genome specifically associated with greater DNaseI 

hypersensitivity, transcription factor binding and regulatory motif 

instances and are likely to represent enhancer regions (Fig. 2 and 

Supplementary Fig. 23). Although these candidate enhancer states all 

shared higher H3K4me1 frequencies, they showed differences in the 

expression levels of downstream genes associated with subtle differences 

in their specific mark combinations (Supplementary Fig. 22). 

For instance, genes downstream of state 30 had a consistently higher 

average expression level than genes downstream of state 31 (P < 0.001 

at 10 kb, two-sided t-test). The two states differed in the frequency of 

several acetylation marks (state 30 relative to 31 showed higher frequency 

for H2BK120ac, H3K27ac and H2BK5ac and lower frequency 

for H4K5ac, H4K8ac) and also in the level of H2AZ (higher in state 

31 than 30), suggesting that these marks may be playing a more 

complex role than previously thought in enhancer regions. 

Several active intergenic states showed significant enrichments 

for genome-wide association study (GWAS) hits (e.g., 3.3-fold for 

candidate enhancer state 33, Fig. 4a), based on a curated database 

of top-scoring single-nucleotide polymorphisms (SNPs) in a range 

of diseases and traits 25 . These states thus provide a likely common 

functional role and means of refining many intergenic SNPs even 

in the absence of other annotations. For example (Fig. 4b), a SNP 

reported to be strongly associated with plasma eosinophil count levels 

in inflammatory diseases (rs12619285) 26 and located 40 kb downstream 

of IKZF2 in an intergenic region devoid of annotations is in 

a section of the genome in the chromatin state 33, which is enriched 

c 

State 

CAGE (%) 

CAGE (%) | not RefSeq TSS 

mRNA (%) 

mRNA (%) | not RefSeq 

% Overall 2 2 46 16 

Figure 5 Discovery power of chromatin states for genome annotation. 

(a) Comparison of the power to discover TSS for individual chromatin 

marks (red), chromatin states (blue) ordered by their TSS enrichment 

and a directed experimental approach based on CAGE sequence tag data 

read counts from all available cell types 36 (gold), whereas the chromatin 

states and marks use only data from CD4 T-cells. Both chromatin states 

and CAGE tags are compared using a receiver operating characteristic 

(ROC) curve that shows the false-positive (x axis) and true-positive 

(y axis) rates at varying prediction thresholds or increasing numbers 

of states in the task of predicting if a 200-bp interval intersects a 

RefSeq TSS. Thin red curve compares performance of H3K4me3 mark 

at varying intensity thresholds. (b) Comparison of the power to detect 

RefSeq transcribed regions for chromatin states and marks as in a, and 

directed experimental information coming from EST data (gold) based 

on sequence counts from all available cell types 37,38 . (c) Independent 

experimental information provides support that a significant fraction of 

false positives in a and b are genuine unannotated TSS and transcribed 

regions currently missing from RefSeq. Percentage of each state 

supported by a CAGE tag (column 1), and the same percentage for 

locations at least 2 kb away from a RefSeq TSS (column 2), suggests that 

many promoter-associated state assignments outside RefSeq promoters 

are supported by CAGE tag evidence. Similarly, percentage of each state 

overlapping a GenBank mRNA (column 3), and the same percentage 

specifically outside RefSeq genes (column 4), suggest that transcriptionassociated 

state assignments outside RefSeq genes are supported 

by mRNA evidence. Similar support is found by GenBank ESTs and 

evolutionarily conserved, predicted new exons (Supplementary Fig. 33). 

for GWAS hits. In contrast, the surrounding region of the genome 

is assigned to other active or repressed intergenic states with no 

significant GWAS association. 

Large-scale repressed states 

The next group of states (40–45) marked large-scale repressed and 

heterochromatic regions, representing 64% of the genome. The two 

most frequently detected modifications in total for all the states in this 

group were H3K27me3 and H3K9me3. State 40, covering 13% of the 

genome, was essentially devoid of any detected modifications, states 

41–42 (25% of the genome) had a higher frequency for H3K9me3 than 

H3K27me3, whereas states 43–45 (26% of the genome) had a higher 

frequency for H3K27me3. States 41–42 as compared to states 43–45 

showed a stronger depletion for genes, promoters and conserved elements 

and stronger association with nuclear lamina regions 27 and the 

darkest-staining chromosomal bands 28 . It also had a higher frequency 

of A/T nucleotides (Fig. 2b and Supplementary Figs. 26–28). 

State 45 likely corresponds to targeted gene repression. It showed 

the highest frequency for H3K27me3 and was unique among repressed 

states to show enrichment for TSS. The corresponding genes were 

enriched for development-related GO categories (Supplementary 

Fig. 29), similar to the repressed promoter state 4 marked by 

H3K4me3. However, in contrast to state 4, state 45 showed almost no 

change in acetylation levels in response to histone deacetylase inhibitor 

(HDACi) treatment (Supplementary Fig. 30), suggesting that state 4 

is poised for activation whereas state 45 is stably repressed 29 . 

Repetitive states 

The final group of six states (46–51) showed strong and distinct 

enrichments for specific repetitive elements (Supplementary Fig. 31). 

State 46 had a strong enrichment of simple repeats, specifically 

(CA) n , (TG) n or (CATG) n (44, 45 and 302-fold, respectively), possibly 

due to sequence biases in ChIP-based experiments 30 . State 47 

was characterized specifically by H3K9me3 and enriched for L1 and 

LTR repeats. State 48–51 all had higher frequencies of H4K20me3 

and H3K9me3 and were heavily enriched for satellite repeat elements. 



States 49–51 showed seemingly high frequencies for numerous 

modifications, but also strong enrichments in sequence reads from 

a nonspecific antibody (IgG) control 31 (Supplementary Fig. 20), 

suggesting these enrichments are due to a lack of coverage for the 

additional copies of these repeat elements in the reference genome 

assembly 32 , thus illustrating the ability of our model to capture such 

potential artifacts by considering all marks jointly. 

Predictive power for genome annotation 

We next set out to study the predictive power of chromatin states for 

the discovery of functional elements. We focused on two classes of 

elements that benefit from ample experimental information independent 

of chromatin marks, TSS and transcribed regions. We found 

that chromatin states consistently outperformed predictions based on 

individual marks (Fig. 5a,b), emphasizing the importance of using 

a 

State 

None 

H3K4me2 

H3K18ac 

H3K4me3 

H3K79me3 

H2BK5me1 

H3K36me3 

H2BK120ac 

H3K9me3 

H3K4me1 

H4K20me1 

H2AZ 

CTCF 

H2BK5ac 

H4K91ac 

H3K27me3 

H4K20me3 

H3K9me1 

H4K5ac 

H3K79me2 

H2BK20ac 

H3K27me1 

H3K27ac 

H3K79me1 

H3K27me2 

PolII 

H3K4ac 

H3R2me1 

H2AK5ac 

H4K8ac 

H3K36ac 

H3R2me2 

H3K9me2 

H2BK12ac 

H3K9ac 

H3K36me1 

H4K16ac 

H4R3me2 

H3K23ac 

H4K12ac 

H2AK9ac 

H3K14ac 

b 

State 

First 10 greedy 

Ref. 38 


c 

50 

45 

40 

35 

Squared error 

30 

25 

20 

15 

10 

5 

0 

0 2 4 6 8 10 12 14 16 18 20 

22 

24 

26 

28 

30 

32 

34 

36 

38 

40 

Number of marks 

Figure 6 Recovery of chromatin states with subsets of marks. (a) The figure shows the ordering of marks (top, from left to right) based on a greedy 

forward selection algorithm to optimize a squared error penalty on state misassignments (Online Methods). Conditioned on all the marks to the left 

having already been profiled, the mark listed is the optimal selection for one additional mark to be profiled based on the target optimization function. 

Below each mark is the percentage of a state with identical assignments using the subset of marks. (b) Comparison of the percentage of each state 

recovered between the first ten marks based on the greedy method and the ten marks previously used 33 (Supplementary Fig. 39). The two columns after 

the state IDs are the proportion of the states recovered using the greedy algorithm and the set previously used 33 . (c) The figure shows a progressive 

decrease in squared error for state misassignment as a function of the number of marks selected based on the greedy algorithm. 




mark combinations and spatial genomic information (Supplementary 

Notes and Supplementary Fig. 32 for a comparison to k-means clustering 

and a supervised classifier). The prediction performance of 

chromatin states based on just CD4 T-cells was similar to that of cap 

analysis of gene expression (CAGE) tags and expressed sequence tags 

(ESTs) data, even though these were obtained across many diverse cell 

types. This was possible because active and inactive states together 

capture the information about genetic elements across cell type 

boundaries (Fig. 5 and Supplementary Figs. 33–35). Moreover, based 

on our 51-state model, we could predict TSS and transcribed regions 

when applied to occupancy data obtained for a subset of ten chromatin 

marks in CD36 erythrocyte precursors and CD133 hematopoietic 

stem cells 33 (Supplementary Fig. 36). 

We also found that chromatin states revealed candidate promoter 

and transcribed regions not in RefSeq, but further supported by independent 

experimental evidence. Candidate promoters overlapped with 

CAGE tags (Fig. 5c) and intergenic PolII (Supplementary Fig. 37), and 

candidate transcribed regions overlapped GenBank mRNAs (Fig. 5c) 

and EST data (Supplementary Fig. 33). A number of promoter and 

transcribed states outside known genes were also strongly enriched 

for not previously described protein-coding exons predicted using 

evolutionary comparisons of 29 mammals (Lin and M.K., unpublished 

data) (Supplementary Fig. 33). We note that some candidate promoters 

may represent distal enhancers, sharing promoter-associated marks 

potentially due to looping of enhancer to promoter regions 7 . 

Recovery of chromatin states using subsets of marks 

As the large majority of chromatin states were defined by multiple marks, 

we next sought to specifically study the contribution of each mark in 

defining chromatin states. First, we found several notable examples of 

both additive relationships, such as acetylation marks in promoter regions, 

and combinatorial relationships, such as methylation marks associated 

with repressive and repetitive elements (Supplementary Notes and 

Supplementary Fig. 38). We also evaluated varying subsets of chromatin 

marks in their ability to distinguish between chromatin states 

(Supplementary Notes and Supplementary Figs. 39–41). More generally, 

we sought to provide guidelines for selecting subsets of chromatin marks 

to survey in new cell types that would be maximally informative. 

As a proof of principle, we evaluated the recovery power of increasing 

numbers of marks in a greedy way, that is, selecting the best mark given 

all previous selected marks, weighing each state equally and penalizing 

mismatches uniformly (see Online Methods), which provided an 

initial unbiased recommendation of marks to survey for a new cell type 

(Fig. 6). We find that increasing subsets of marks rapidly converge to a 

fairly accurate annotation of chromatin states (Fig. 6c), providing costefficient 

recommendations for new cell types. In addition to an overall 

error score, this analysis provides information on the proportion of each 

state accurately recovered, and specific pairwise state misassignments. 

Such information could be incorporated in a modified scoring function 

to provide chromatin mark recommendations targeted to the 

subset of chromatin states that are of particular biological interest, or 

the particular state distinctions that are most important to each study. 

DISCUSSION 

The discovery and systematic characterization of chromatin states presented 

here reveals a diverse epigenomic landscape with 51 functionally 

distinct chromatin states. Although the exact number of chromatin states 

can vary based on the number of chromatin marks surveyed and the 

desired resolution at which state differences are studied, our results suggest 

that the genome annotation resulting from these states can extend the 

interpretable part of the human genome, especially outside protein-coding 

genes. The definition of the states themselves revealed numerous insights 

into the combinatorial and additive roles of chromatin marks, sometimes 

hinting at combinations of chromatin marks that were not previously 

described, and the genome-wide annotation of these states exposed many 

previously unannotated candidate functional elements. 

We expect the usefulness of the methods presented here will 

increase as additional genome-wide epigenetic data sets become 

available, and as additional cell types are surveyed systematically. 

Chromatin states can be inferred with virtually any type of epigenetic 

and related information, including histone variants, DNA methylation, 

DNaseI hypersensitivity and binding of chromatin-associated 

and sequence-specific transcription factors. Although we focused on 

a single human cell type, the methods are generally applicable to any 

species and any number of cell types and even whole embryos, albeit 

in mixed cell populations mutually exclusive marks found in different 

subsets of cells could potentially be interpreted as co-occurring. 

Specifically for understanding epigenomic dynamics, chromatin 

states can play a central role going forward, as they provide a uniform 

language for interpreting and comparing diverse epigenetic data 

sets, for selecting and prioritizing chromatin marks for additional 

cell types and for summarizing complex relationships of dozens of 

marks in directly-interpretable chromatin states. As several largescale 

data production efforts are currently underway to map the 

epigenomes of many more cell types, exemplified by the ENCODE 34 , 

modENCODE 35 and Epigenome Roadmap projects (http://www. 

roadmapepigenomics.org/), chromatin states will likely play a key 

role in the understanding of the human epigenome and its role in 

development, health and disease. 

Methods 

Methods and any associated references are available in the online version 

of the paper at http://www.nature.com/naturebiotechnology/. 


Acknowledgments 

We thank P. Kheradpour for regulatory motif instances and M.F. Lin for predicted 

new exons. We thank M. Garber, A. Siepel, K. Lindblad-Toh, and E. Lander for use of 

comparative information on 29 mammals. We thank B. Bernstein, N. Shoresh, C. Epstein 

and T. Mikkelsen for helpful discussions. We thank L. Goff, C. Bristow, R. Sealfon and 

all members of the MIT CompBio Group for comments, feedback and support. This 

material is based upon work supported by the National Science Foundation under award 

no. 0905968 and funding from the US National Human Genome Research Institute 

(NHGRI) under awards U54-HG004570 and RC1-HG005334. 

AUTHOR CONTRIBUTIONS 

J.E. and M.K. developed the method, analyzed results and wrote the paper. 



Published online at http://www.nature.com/naturebiotechnology/. 

Reprints and permissions information is available online at http://npg.nature.com/ 

reprintsandpermissions/. 

1. Bernstein, B.E., Meissner, A. & Lander, E.S. The mammalian epigenome. Cell 128, 

669–681 (2007). 

2. Kouzarides, T. Chromatin modifications and their function. Cell 128, 693–705 

(2007). 

3. Strahl, B.D. & Allis, C.D. The language of covalent histone modifications. Nature 

403, 41–45 (2000). 

4. Schreiber, S.L. & Bernstein, B.E. Signaling network model of chromatin. Cell 111, 

771–778 (2002). 

5. Barski, A. et al. High-resolution profiling of histone methylations in the human 

genome. Cell 129, 823–837 (2007). 

6. Wang, Z. et al. Combinatorial patterns of histone acetylations and methylations in 

the human genome. Nat. Genet. 40, 897–903 (2008). 




7. Heintzman, N.D. et al. Distinct and predictive chromatin signatures of transcriptional 

promoters and enhancers in the human genome. Nat. Genet. 39, 311–318 (2007). 

8. Heintzman, N.D. et al. Histone modifications at human enhancers reflect global 

cell-type-specific gene expression. Nature 459, 108–112 (2009). 

9. Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved 

large non-coding RNAs in mammals. Nature 458, 223–227 (2009). 

10. Hon, G., Wang, W. & Ren, B. Discovery and annotation of functional chromatin 

signatures in the human genome. PLoS Comput. Biol. 5, e1000566 (2009). 

11. Wang, X., Xuan, Z., Zhao, X., Li, Y. & Zhang, M.Q. High-resolution human corepromoter 

prediction with CoreBoost_HM. Genome Res. 19, 266–275 (2009). 

12. Won, K.J., Chepelev, I., Ren, B. & Wang, W. Prediction of regulatory elements in 

mammalian genomes using chromatin signatures. BMC Bioinformatics 9, 547 (2008). 

13. Hon, G., Ren, B. & Wang, W. ChromaSig: a probabilistic approach to finding common 

chromatin signatures in the human genome. PLOS Comput. Biol. 4, e1000201 

(2008). 

14. Day, N., Hemmaplardh, A., Thurman, R.E., Stamatoyannopoulos, J.A. & Noble, W.S. 

Unsupervised segmentation of continuous genomic data. Bioinformatics 23, 

1424–1426 (2007). 

15. Jia, L. et al. Functional enhancers at the gene-poor 8q24 cancer-linked locus. PLoS 

Genet. 5, e1000597 (2009). 

16. Thurman, R.E., Day, N., Noble, W.S. & Stamatoyannopoulos, J.A. Identification of 

higher-order functional domains in the human ENCODE regions. Genome Res. 17, 

917 (2007). 

17. Schuettengruber, B. et al. Functional anatomy of polycomb and trithorax chromatin 

landscapes in Drosophila embryos. PLoS Biol. 7, e13 (2009). 

18. Jaschek, R. & Tanay, A. Spatial clustering of multivariate genomic and epigenomic 

information. in Proceedings of the 13th Annual International Conference on Research 

in Computational Molecular Biology (ed. Batzoglou, S.) 170–183 (Springer, 2009). 

19. Schwartz, S., Meshorer, E. & Ast, G. Chromatin organization marks exon-intron 

structure. Nat. Struct. Mol. Biol. 16, 990–995 (2009). 

20. Kolasinska-Zwierz, P. et al. Differential chromatin marking of introns and expressed 

exons by H3K36me3. Nat. Genet. 41, 376–381 (2009). 

21. Andersson, R., Enroth, S., Rada-Iglesias, A., Wadelius, C. & Komorowski, J. 

Nucleosomes are well positioned in exons and carry characteristic histone 

modifications. Genome Res. 19, 1732–1741 (2009). 

22. Schones, D.E. et al. Dynamic regulation of nucleosome positioning in the human 

genome. Cell. 132, 878–898 (2008). 

23. Sripathy, S.P., Stevens, J. & Schultz, D.C. The KAP1 corepressor functions to 

coordinate the assembly of de novo HP1-demarcated microenvironments of 

heterochromatin required for KRAB zinc finger protein-mediated transcriptional 

repression. Mol. Cell. Biol. 26, 8623–8638 (2006). 

24. O’Geen, H. et al. Genome-wide analysis of KAP1 binding suggests autoregulation 

of KRAB-ZNFs. PLoS Genet. 3, e89 (2007). 

25. Hindorff, L.A., Junkins, H.A., Mehta, J.P. & Manolio, T.A. A catalog of published 

genome-wide association studies. accessed 

July 22, 2009. 

26. Gudbjartsson, D.F. et al. Sequence variants affecting eosinophil numbers associate 

with asthma and myocardial infarction. Nat. Genet. 41, 342–347 (2009). 

27. Guelen, L. et al. Domain organization of human chromosomes revealed by mapping 

of nuclear lamina interactions. Nature 453, 948–951 (2008). 

28. Furey, T.S. & Haussler, D. Integration of the cytogenetic map with the draft human 

genome sequence. Hum. Mol. Genet. 12, 1037–1044 (2003). 

29. Wang, Z. et al. Genome-wide mapping of HATs and HDACs reveals distinct functions 

in active and inactive genes. Cell 138, 1019–1031 (2009). 

30. Johnson, D.S. et al. Systematic evaluation of variability in ChIP-chip experiments 

using predefined DNA targets. Genome Res. 18, 393–403 (2008). 

31. Zang, C. et al. A clustering approach for identification of enriched domains from 

histone modification ChIP-Seq data. Bioinformatics 25, 1952–1958 (2009). 

32. Zhang, Y., Shin, H., Song, J.S., Lei, Y. & Liu, X.S. Identifying positioned nucleosomes 

with epigenetic marks in human from ChIP-Seq. BMC Genomics 9, 537 (2008). 

33. Cui, K. et al. Chromatin signatures in multipotent human hematopoietic stem cells 

indicate the fate of bivalent genes during differentiation. Cell Stem Cell 4, 80–93 

(2009). 

34. ENCODE Project Consortium. Identification and analysis of functional elements in 

1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 

(2007). 

35. Celniker, S.E. et al. Unlocking the secrets of the genome. Nature 459, 927–930 

(2009). 

36. Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and 

evolution. Nat. Genet. 38, 626–635 (2006). 

37. Karolchik, D. et al. The UCSC Genome Browser Database: 2008 update. Nucleic Acids 

Res. 36, D773–D779 (2008). 

38. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. & Wheeler, D.L. GenBank: 

update. Nucleic Acids Res. 32, D23–D26 (2004). 



ONLINE METHODS 

Input data for modeling. The initial unprocessed data were bed files containing 

the genomic coordinates and strand orientation of mapped sequence reads 

from ChIP-seq experiments 5,6 . There was a separate bed file for each of the 18 

acetylations, 20 methylations, H2AZ, CTCF and PolII in CD4 T cells. We used 

the updated version of the H3K79me1/2/3 data, as reported 6 , which differs 

from the version first reported 5 . 

To apply the model we first divided the genome into 200-base-pair nonoverlapping 

intervals within which we independently made a call as to whether 

each of the 41 marks was detected as being present or not based on the count 

of tags mapping to the interval. Each tag was uniquely assigned to one interval 

based on the location of the 5′ end of the tag after applying a shift of 100 bases 

in the 5′ to 3′ direction of the tag. The threshold, t, for each mark was based 

on the total number of mapped reads for the mark (Supplementary Table 2), 

and was set to be the smallest integer t such that P(X>t)


The sequence data for computed nucleotide frequencies, CpG islands, repeats 42 

and conservation data were also obtained from the UCSC genome browser. 

The conservation data were based on PhastCon conserved elements using the 

44-way vertebrate alignment 43,44 (Lindblad-Toh, K. et al., Broad Institute, 

unpublished data ). Transcription factor binding enrichments were computed 

for 18 experiments from numerous publications (Supplementary Fig. 23), the 

median enrichment over all these experiments is reported in Figure 2b. The 

DNaseI hypersensitivity data was as described 45 obtained from the UCSC genome 

browser. The nuclear lamina data of human fibroblasts was obtained from 

ref. 27. The zinc-finger genes were defined as those that had ‘ZNF’ at the beginning 

of the gene symbol in the RefSeq gene table. For published coordinates 

that were in hg17 we converted them to hg18 using the liftover tool from the 

UCSC genome browser 46 . 

Expression, motif and gene ontology analyses. We obtained the processed 

CD4 T expression data from ref. 47 for both replicates. We then averaged the 

two replicates. After averaging the two replicates we performed a natural log 

transform of the average values. We then standardized all values by subtracting 

the mean log transformed value and then dividing by the s.d. of the log transform 

values. The genome coordinates of each probe set were obtained from the UCSC 

genome browser. Each 200 bp interval that overlapped a probe set obtained the 

transformed expression score. If multiple probe sets overlapped the same 200 bp 

then the average of the expression values associated with these were taken. 

We generated transcription factor motif enrichments as described 48 , 

extended for position-weight matrices (PWMs) (Kheradpour, P., MIT, and 

M.K., unpublished data) based on the hard state assignments. 

Gene ontology enrichments were based on the hard state assignment of the 

interval containing the RefSeq annotated TSS of the gene. Enrichments were 

computed using the STEM software (v.1.3.4) and the Bonferroni corrected 

P-values are reported 49 . 

SNP and GWAS analysis. The HapMap CEU 50 data were downloaded from 

the UCSC genome browser. Significant GWAS hits were taken from ref. 25. 

SNPs listed as occurring multiple times were only counted once, and for the 

SNP set listed as a 17-marker haplotype only the first SNP was used giving 

1,640 SNPs. In computing enrichment for HapMap and GWAS SNPs, if two 

SNPs mapped to the same interval, we counted them multiple times. To determine 

if the number of GWAS SNPs in a chromatin state was more significant 

than would be expected based on the general SNP frequency in the state 

we used a binomial distribution where n = 1,640 and p is the proportion of 

HapMap CEU SNPs assigned to the state. We applied a Bonferroni correction 

for testing multiple states and only reported those P-values significantly 

enriched with P < 0.01. 

RefSeq TSS and gene transcripts discovery. The ROC curve for the CAGE data 

was based on the number of CAGE tags mapping to a 200 bp interval retrieved from 

the Fantom database and converted from hg17 to hg18 using the UCSC genome 

browser liftover tool 36 . The overlap with EST was based on those EST listed in 

the UCSC genome browser all_est table as of November, 29, 2009 (refs. 37,38). 

The overlap with GenBank mRNA is based on the overlap with the UCSC genome 

browser mRNA listed in the table as of October 31, 2009 (refs. 37,38). The novel 

exon predictions are from (Lin, M.F., MIT, and M.K., unpublished data). 

Mark subset evaluation and selection. When evaluating the coverage of 

a specified subset of marks, first a posterior distribution over the states at 

each interval is computed using the model learned on the full set of marks, 

except that the marks not in the subset are omitted when computing emission 

probabilities. For an interval t we define here s t,k and f t,k to be the posterior 

assignment to state k at interval t based on the subset and full set of marks, 

respectively. The proportion of state k recovered with a subset of marks is 

defined as: 

min( f s 

c t t, k, t, 

k) 

k 

ft, 

k 

= ∑ ∑ t 

where the sum is over all intervals t in the genome. The ordering of marks presented 

without any prior biological knowledge was based on a greedy forward selection 

algorithm designed to select marks that would minimize this function: 

∑ 

2 

( 1− 

c k ) 

k 

where the sum is over all states. At each step the algorithm would then choose 

the one additional mark, conditioned on all the other previously selected 

marks that would cause this function to be minimized. We note that this 

target function considers all nonidentical state assignments to have equal loss. 

An extension of this approach would be to apply target functions that weigh 

different misassignments differently. The proportion of state k with the full 

set of marks that is misassigned to state i using a subset of marks, m k,i , as is 

presented in Supplementary Figures 39 and 40, is defined as: 

mk, 

i = 

∑ 

⎛ 

⎛ max( st, i − ft, 

i, 0) 

⎞⎞ 

⎜max( ft, k − st, 

k , 0) 

⎜ ⎟⎟ 

t ⎜ 

⎜ max( st j f 

j , − t, 

j, ) 

⎝ ∑ 

0 

⎠ 

⎟⎟ 

⎝ 

⎠ 

∑ 

t f t, 

k 

The first term in the sum in the numerator represents for an interval t the 

amount of posterior probability assigned to state k using the full set of marks 

not assigned using the subset of marks. The second term represents the portion 

of this posterior probability that will be credited to state i. The portion 

credited to state i is the proportion of the surplus posterior state i received 

with the subset of marks in the interval relative to the total surplus posterior 

all states received in the interval. 

39. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis 

(Cambridge Univ. Press, 1998). 

40. Neal, R.M. & Hinton, G.E. A view of the EM algorithm that justifies incremental, 

sparse, and other variants. Learn. Graph. Models 89, 355–368 (1998). 

41. Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI reference sequences (RefSeq): a 

curated non-redundant sequence database of genomes, transcripts and proteins. 

Nucleic Acids Res. 35, D61–D65 (2007). 

42. Smit, A., Hubley, R. & Green, P. RepeatMasker Open-3.0 1996-2010 . 

43. Miller, W. et al. 28-way vertebrate alignment and conservation track in the UCSC 

Genome Browser. Genome Res. 17, 1797–1808 (2007). 

44. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and 

yeast genomes. Genome Res. 15, 1034–1050 (2005). 

45. Boyle, A.P. et al. High-resolution mapping and characterization of open chromatin 

across the genome. Cell 132, 311–322 (2008). 

46. Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 

(2002). 

47. Su, A.I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. 

Proc. Natl. Acad. Sci. USA 101, 6062–6067 (2004). 

48. Kheradpour, P., Stark, A., Roy, S. & Kellis, M. Reliable prediction of regulator 

targets using 12 Drosophila genomes. Genome Res. 17, 1919–1931 (2007). 

49. Ernst, J. & Bar-Joseph, Z. STEM: a tool for the analysis of short time series gene 

expression data. BMC Bioinformatics 7, 191 (2006). 

50. International HapMap Consortium. A second generation human haplotype map of 

over 3.1 million SNPs. Nature 449, 851–861 (2007). 

doi:10.1038/nbt.1662 

nature biotechnology

A r t i c l e s 

The MicroArray Quality Control (MAQC)-II study of 

common practices for the development and validation 

of microarray-based predictive models 

MAQC Consortium * 


Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of 

these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets 

to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in 

rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many 

combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of 

the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model 

performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar 

performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees 

and independent investigators that evaluate methods for global gene expression analysis. 

As part of the United States Food and Drug Administration’s (FDA’s) 

Critical Path Initiative to medical product development (http://www. 

fda.gov/oc/initiatives/criticalpath/), the MAQC consortium began in 

February 2005 with the goal of addressing various microarray reliability 

concerns raised in publications 1–9 pertaining to reproducibility 

of gene signatures. The first phase of this project (MAQC-I) extensively 

evaluated the technical performance of microarray platforms 

in identifying all differentially expressed genes that would potentially 

constitute biomarkers. The MAQC-I found high intra-platform reproducibility 

across test sites, as well as inter-platform concordance of 

differentially expressed gene lists 10–15 and confirmed that microarray 

technology is able to reliably identify differentially expressed genes 

between sample classes or populations 16,17 . Importantly, the MAQC-I 

helped produce companion guidance regarding genomic data submission 

to the FDA (http://www.fda.gov/downloads/Drugs/GuidanceCo 

mplianceRegulatoryInformation/Guidances/ucm079855.pdf). 

Although the MAQC-I focused on the technical aspects of gene 

expression measurements, robust technology platforms alone are 

not sufficient to fully realize the promise of this technology. An 

additional requirement is the development of accurate and reproducible 

multivariate gene expression–based prediction models, also 

referred to as classifiers. Such models take gene expression data from 

a patient as input and as output produce a prediction of a clinically 

relevant outcome for that patient. Therefore, the second phase of the 

project (MAQC-II) has focused on these predictive models 18 , studying 

both how they are developed and how they are evaluated. For 

any given microarray data set, many computational approaches can 

be followed to develop predictive models and to estimate the future 

performance of these models. Understanding the strengths and limitations 

of these various approaches is critical to the formulation 

of guidelines for safe and effective use of preclinical and clinical 

genomic data. Although previous studies have compared and benchmarked 

individual steps in the model development process 19 , no 

prior published work has, to our knowledge, extensively evaluated 

current community practices on the development and validation of 

microarray-based predictive models. 

Microarray-based gene expression data and prediction models are 

increasingly being submitted by the regulated industry to the FDA 

to support medical product development and testing applications 20 . 

For example, gene expression microarray–based assays that have 

been approved by the FDA as diagnostic tests include the Agendia 

MammaPrint microarray to assess prognosis of distant metastasis in 

breast cancer patients 21,22 and the Pathwork Tissue of Origin Test 

to assess the degree of similarity of the RNA expression pattern in 

a patient’s tumor to that in a database of tumor samples for which 

the origin of the tumor is known 23 . Gene expression data have 

also been the basis for the development of PCR-based diagnostic 

assays, including the xDx Allomap test for detection of rejection of 

heart transplants 24 . 

The possible uses of gene expression data are vast and include diagnosis, 

early detection (screening), monitoring of disease progression, 

risk assessment, prognosis, complex medical product characterization 

and prediction of response to treatment (with regard to safety or 

efficacy) with a drug or device labeling intent. The ability to generate 

models in a reproducible fashion is an important consideration in 

predictive model development. 

A lack of consistency in generating classifiers from publicly available 

data is problematic and may be due to any number of factors 

including insufficient annotation, incomplete clinical identifiers, 

coding errors and/or inappropriate use of methodology 25,26 . There 

* A full list of authors and affiliations appears at the end of the paper. Correspondence should be addressed to L.S. (leming.shi@fda.hhs.gov or leming.shi@gmail.com). 

Received 2 March; accepted 30 June; published online 30 July 2010; doi:10.1038/nbt.1665 


A rt i c l e s 


are also examples in the literature of classifiers whose performance 

cannot be reproduced on independent data sets because of poor study 

design 27 , poor data quality and/or insufficient cross-validation of all 

model development steps 28,29 . Each of these factors may contribute 

to a certain level of skepticism about claims of performance levels 

achieved by microarray-based classifiers. 

Previous evaluations of the reproducibility of microarray-based 

classifiers, with only very few exceptions 30,31 , have been limited 

to simulation studies or reanalysis of previously published results. 

Frequently, published benchmarking studies have split data sets at 

random, and used one part for training and the other for validation. 

This design assumes that the training and validation sets are produced 

by unbiased sampling of a large, homogeneous population of samples. 

However, specimens in clinical studies are usually accrued over years 

and there may be a shift in the participating patient population and 

also in the methods used to assign disease status owing to changing 

practice standards. There may also be batch effects owing to time 

variations in tissue analysis or due to distinct methods of sample 

collection and handling at different medical centers. As a result, 

samples derived from sequentially accrued patient populations, as 

was done in MAQC-II to mimic clinical reality, where the first cohort 

is used for developing predictive models and subsequent patients are 

included in validation, may differ from each other in many ways that 

could influence the prediction performance. 

The MAQC-II project was designed to evaluate these sources of 

bias in study design by constructing training and validation sets at 

different times, swapping the test and training sets and also using 

data from diverse preclinical and clinical scenarios. The goals of 

MAQC-II were to survey approaches in genomic model development 

in an attempt to understand sources of variability in prediction 

performance and to assess the influences of endpoint signal strength 

in data. By providing the same data sets to many organizations for 

analysis, but not restricting their data analysis protocols, the project 

has made it possible to evaluate to what extent, if any, results depend 

on the team that performs the analysis. This contrasts with previous 

benchmarking studies that have typically been conducted by single 

laboratories. Enrolling a large number of organizations has also made 

it feasible to test many more approaches than would be practical for 

any single team. MAQC-II also strives to develop good modeling 

practice guidelines, drawing on a large international collaboration of 

experts and the lessons learned in the perhaps unprecedented effort 

of developing and evaluating >30,000 genomic classifiers to predict 

a variety of endpoints from diverse data sets. 

MAQC-II is a collaborative research project that includes 

participants from the FDA, other government agencies, industry 

and academia. This paper describes the MAQC-II structure and 

experimental design and summarizes the main findings and key 

results of the consortium, whose members have learned a great deal 

during the process. The resulting guidelines are general and should 

not be construed as specific recommendations by the FDA for 

regulatory submissions. 

RESULTS 

Generating a unique compendium of >30,000 prediction models 

The MAQC-II consortium was conceived with the primary 

goal of examining model development practices for generating 

binary classifiers in two types of data sets, preclinical and clinical 

(Supplementary Tables 1 and 2). To accomplish this, the project 

leader distributed six data sets containing 13 preclinical and clinical 

endpoints coded A through M (Table 1) to 36 voluntary participating 

data analysis teams representing academia, industry 

and government institutions (Supplementary Table 3). Endpoints 

were coded so as to hide the identities of two negative-control endpoints 

(endpoints I and M, for which class labels were randomly 

assigned and are not predictable by the microarray data) and two 

positive-control endpoints (endpoints H and L, representing the 

sex of patients, which is highly predictable by the microarray data). 

Endpoints A, B and C tested teams’ ability to predict the toxicity 

of chemical agents in rodent lung and liver models. The remaining 

endpoints were predicted from microarray data sets from human 

patients diagnosed with breast cancer (D and E), multiple myeloma 

(F and G) or neuroblastoma (J and K). For the multiple myeloma 

and neuroblastoma data sets, the endpoints represented event free 

survival (abbreviated EFS), meaning a lack of malignancy or disease 

recurrence, and overall survival (abbreviated OS) after 730 days 

(for multiple myeloma) or 900 days (for neuroblastoma) post treatment 

or diagnosis. For breast cancer, the endpoints represented 

estrogen receptor status, a common diagnostic marker of this 

cancer type (abbreviated ‘erpos’), and the success of treatment 

involving chemotherapy followed by surgical resection of a tumor 

(abbreviated ‘pCR’). The biological meaning of the control endpoints 

was known only to the project leader and not revealed to 

the project participants until all model development and external 

validation processes had been completed. 

To evaluate the reproducibility of the models developed by a data 

analysis team for a given data set, we asked teams to submit models 

from two stages of analyses. In the first stage (hereafter referred to as 

the ‘original’ experiment), each team built prediction models for up to 

13 different coded endpoints using six training data sets. Models were 

‘frozen’ against further modification, submitted to the consortium 

and then tested on a blinded validation data set that was not available 

to the analysis teams during training. In the second stage (referred 

to as the ‘swap’ experiment), teams repeated the model building and 

validation process by training models on the original validation set 

and validating them using the original training set. 

To simulate the potential decision-making process for evaluating a 

microarray-based classifier, we established a process for each group 

to receive training data with coded endpoints, propose a data analysis 

protocol (DAP) based on exploratory analysis, receive feedback on 

the protocol and then perform the analysis and validation (Fig. 1). 

Analysis protocols were reviewed internally by other MAQC-II participants 

(at least two reviewers per protocol) and by members of the 

MAQC-II Regulatory Biostatistics Working Group (RBWG), a team 

from the FDA and industry comprising biostatisticians and others 

with extensive model building expertise. Teams were encouraged to 

revise their protocols to incorporate feedback from reviewers, but 

each team was eventually considered responsible for its own analysis 

protocol and incorporating reviewers’ feedback was not mandatory 

(see Online Methods for more details). 

We assembled two large tables from the original and swap experiments 

(Supplementary Tables 1 and 2, respectively) containing 

summary information about the algorithms and analytic steps, or 

‘modeling factors’, used to construct each model and the ‘internal’ 

and ‘external’ performance of each model. Internal performance 

measures the ability of the model to classify the training samples, 

based on cross-validation exercises. External performance measures 

the ability of the model to classify the blinded independent validation 

data. We considered several performance metrics, including Matthews 

Correlation Coefficient (MCC), accuracy, sensitivity, specificity, 

area under the receiver operating characteristic curve (AUC) and 

root mean squared error (r.m.s.e.). These two tables contain data on 

>30,000 models. Here we report performance based on MCC because 



it is informative when the distribution of the two classes in a data set 

is highly skewed and because it is simple to calculate and was available 

for all models. MCC values range from +1 to −1, with +1 indicating 

perfect prediction (that is, all samples classified correctly and none 

incorrectly), 0 indicates random prediction and −1 indicating perfect 

inverse prediction. 

The 36 analysis teams applied many different options under each 

modeling factor for developing models (Supplementary Table 4) 

including 17 summary and normalization methods, nine batch-effect 

removal methods, 33 feature selection methods (between 1 and >1,000 

features), 24 classification algorithms and six internal validation 

methods. Such diversity suggests the community’s common practices are 

Table 1 Microarray data sets used for model development and validation in the MAQC-II project 


Date set 

code 

Endpoint 

code 

Endpoint 

description 

Hamner A Lung tumorigen 

vs. non-tumorigen 

(mouse) 

Iconix B Non-genotoxic liver 

carcinogens vs. 

non-carcinogens 

(rat) 

NIEHS C Liver toxicants vs. 

non-toxicants based 

on overall necrosis 

score (rat) 

Breast 

cancer 

(BR) 

Multiple 

myeloma 

(MM) 

Neuroblastoma 

(NB) 

D 

E 

F 

G 

H 

I 

J 

K 

L 

M 

Pre-operative treatment 

response (pCR, 

pathologic complete 

response) 

Estrogen receptor 

status (erpos) 

Overall survival 

milestone outcome 

(OS, 730-d cutoff) 

Event-free survival 


(EFS, 730-d cutoff) 

Clinical parameter 

S1 (CPS1). The 

actual class label 

is the sex of the 

patient. Used as a 

“positive” control 

endpoint 

Clinical parameter 

R1 (CPR1). The 

actual class label is 

randomly assigned. 

Used as a “negative” 

control endpoint 

Overall survival 


(OS, 900-d cutoff) 

Event-free survival 


(EFS, 900-d cutoff) 

Newly established 

parameter S (NEP_S). 

The actual class label 

is the sex of the 

patient. Used as a 

“positive” control 

endpoint 

Newly established 

parameter R (NEP_R). 

The actual class label 

is randomly assigned. 

Used as a “negative” 

control endpoint 

Microarray 

platform 

Affymetrix Mouse 

430 2.0 

Amersham Uniset 

Rat 1 Bioarray 

Affymetrix 

Rat 230 2.0 

Affymetrix Human 

U133A 

Affymetrix Human 

U133Plus 2.0 

Different versions 

of Agilent human 

microarrays 

Number 

of samples 

Comments and references 

Positives Negatives P/N Number Positives Negatives P/N 

(P) (N) ratio of samples (P) (N) ratio 

Training set a Validation set a 

70 26 44 0.59 88 28 60 0.47 The training set was first 

published in 2007 (ref. 50) and 

the validation set was generated 

for MAQC-II 

216 73 143 0.51 201 57 144 0.40 The data set was first published 

in 2007 (ref. 51). Raw microarray 

intensity data, instead of ratio 

data, were provided for MAQC-II 

data analysis 

214 79 135 0.58 204 78 126 0.62 Exploratory visualization of the 

data set was reported in 2008 

(ref. 53). However, the phenotype 

classification problem was 

formulated specifically for 

MAQC-II. A large amount of 

additional microarray and 

phenotype data were provided to 

MAQC-II for cross-platform and 

cross-tissue comparisons 

130 33 97 0.34 100 15 85 0.18 The training set was first 

published in 2006 (ref. 56) and 

the validation set was specifically 

generated for MAQC-II. In addition, 

130 80 50 1.6 100 61 39 1.56 

two distinct endpoints (D 

and E) were analyzed in MAQC-II 

340 51 289 0.18 214 27 187 0.14 The data set was first published 

in 2006 (ref. 57) and 2007 

(ref. 58). However, patient 

340 84 256 0.33 214 34 180 0.19 survival data were updated and 

the raw microarray data (CEL 

files) were provided specifically 

340 194 146 1.33 214 140 74 1.89 for MAQC-II data analysis. In 

addition, endpoints H and I were 

designed and analyzed specifically 

in MAQC-II 

340 200 140 1.43 214 122 92 1.33 

238 

239 

246 

246 

22 

49 

145 

145 

216 

190 

101 

101 

0.10 

0.26 

1.44 

1.44 

177 

193 

231 

253 

39 

83 

133 

143 

138 

110 

98 

110 

0.28 

0.75 

1.36 

1.30 

The training data set was first 

published in 2006 (ref. 63). 

The validation set (two-color 

Agilent platform) was generated 

specifically for MAQC-II. In addition, 

one-color Agilent platform 

data were also generated for most 

samples used in the training and 

validation sets specifically for 

MAQC-II to compare the prediction 

performance of two-color 

versus one-color platforms. 

Patient survival data were also 

updated. In addition, endpoints L 

and M were designed and 

analyzed specifically in MAQC-II 

The first three data sets (Hamner, Iconix and NIEHS) are from preclinical toxicogenomics studies, whereas the other three data sets are from clinical studies. Endpoints H and L are positive 

controls (sex of patient) and endpoints I and M are negative controls (randomly assigned class labels). The nature of H, I, L and M was unknown to MAQC-II participants except for the project 

leader until all calculations were completed. 

a Numbers shown are the actual number of samples used for model development or validation. 




Figure 1 Experimental design and timeline 

of the MAQC-II project. Numbers (1–11) 

order the steps of analysis. Step 11 indicates 

when the original training and validation 

data sets were swapped to repeat steps 4–10. 

See main text for description of each step. 

Every effort was made to ensure the complete 

independence of the validation data sets from 

the training sets. Each model is characterized 

by several modeling factors and seven internal 

and external validation performance metrics 

(Supplementary Tables 1 and 2). The modeling 

factors include: (i) organization code; (ii) data 

set code; (iii) endpoint code; (iv) summary and 

normalization; (v) feature selection method; 

(vi) number of features used; (vii) classification 

algorithm; (viii) batch-effect removal method; 

(ix) type of internal validation; and (x) number 

of iterations of internal validation. The seven 

performance metrics for internal validation and 

external validation are: (i) MCC; (ii) accuracy; 

(iii) sensitivity; (iv) specificity; (v) AUC; 

(vi) mean of sensitivity and specificity; and 

(vii) r.m.s.e. s.d. of metrics are also provided for 

internal validation results. 

9/07 – 10/07 

1. Exploratory 

data analysis 

(36 DATs) 

well represented. For each of the models nominated by a team as being 

the best model for a particular endpoint, we compiled the list of features 

used for both the original and swap experiments (see the MAQC Web 

site at http://edkb.fda.gov/MAQC/). These comprehensive tables represent 

a unique resource. The results that follow describe data mining 

efforts to determine the potential and limitations of current practices for 

developing and validating gene expression–based prediction models. 

Performance depends on endpoint and can be estimated 

during training 

Unlike many previous efforts, the study design of MAQC-II provided 

the opportunity to assess the performance of many different modeling 

a 

External validation (MCC) 

c 

MCC 

1.0 

0.8 

0.6 

0.4 

0.2 

0 

–0.2 

–0.4 

b 

External validation (MCC) 

10/07 

9/1/2007 2/1/2009 

10/07 – 12/07 1/08 – 3/08 

3/08 – 8/08 8/08 – 9/08 10/08 – 2/09 

4. Data sets 

5. Classifiers 

12/07 – 1/08 

3. Review & approval 

of DAP by RBWG 

11/07 12/07 

2. Data analysis 

protocol (DAP) 

1. Exploration 2. DAP 3. DAP review 

11. Swap 

r = 0.840, N = 18,060 1.0 r = 0.951, N = 13 

Endpoint 

A 

0.8 

B 

C 

D 

0.6 

E 

F 

0.4 

G 

H 

0.2 

I 

I G 

J 

K 

0 

L 

M –0.2 

M 

–0.4 

–0.6 

–0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1.0 

Internal validation (MCC) 

1.0 

L C H E K 

0.8 

0.6 

0.4 

0.2 

0 

J 

1/08 

1/08 3/08 8/08 9/08 

Face-to-face 

meeting 

4. Six training 

data sets 

(13 endpoints) 

2/08 3/08 

5. Classifiers are frozen 

(mark one for validation) 

7. Validation 

(blind test) 

data sets 

distribution 

6. Models 7. Validation 8. Prediction 

4/08 5/08 6/08 7/08 8/08 9/08 10/08 11/08 12/08 1/09 

6. MAQC-II’s 

candidate models 

9-10. Meta-data 

distribution 

9 

8. Prediction 

results 

approaches on a clinically realistic blinded external validation data set. 

This is especially important in light of the intended clinical or preclinical 

uses of classifiers that are constructed using initial data sets and 

validated for regulatory approval and then are expected to accurately 

predict samples collected under diverse conditions perhaps months or 

years later. To assess the reliability of performance estimates derived 

during model training, we compared the performance on the internal 

training data set with performance on the external validation data set 

for of each of the 18,060 models in the original experiment (Fig. 2a). 

Models without complete metadata were not included in the analysis. 

We selected 13 ‘candidate models’, representing the best model for 

each endpoint, before external validation was performed. We required 

that each analysis team nominate one model 

L 

H C 

E 

J 

K B 

–0.6 

–0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1.0 

Internal validation (MCC) 

D G F A I M 

Internal validation 

External validation 

–0.2 

–0.4 

1796 970 866 1143 1079 2263 1192 2905 877 863 1569 807 1730 

NBpositive 

NIEHS MM- 

BR- 

NB- NB- Iconix BR- 

MM- MM- Hamner MM- 

NB- 

(rat liver positive erpos EFS OS (rat liver pCR EFS OS (mouse negative negative 

necrosis) 

tumor) 

lung tumor) 

B 

D 

A 

F 

5′ 

Models 

9/08 – 10/08 

11. Swap 

prediction 

results 

12. Meta-data analysis 

& visualization 

10. Table of model information 

Performance metrics 

1 

2 

3 

... 

... 

... 

... 

... 

n 

Modeling 

factors 

Internal 

validation 

External 

validation 

1 

... ... ... ... ... ... ... ... 

2 3 m 

... 

... 

... 

MF1 MF2 MF3 IV1 IV2 IV3 EV1 EV2 EV3 

12. Meta-data analysis 

for each endpoint they analyzed and we then 

selected one candidate from these nominations 

for each endpoint. We observed a 

higher correlation between internal and 

external performance estimates in terms 

1.0 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

r = 0.8495, N = 17,092 

0.2 

0.30.40.50.60.70.80.91.0 

Figure 2 Model performance on internal 

validation compared with external validation. 

(a) Performance of 18,060 models that were 

validated with blinded validation data. 

(b) Performance of 13 candidate models. 

r, Pearson correlation coefficient; N, number 

of models. Candidate models with binary and 

continuous prediction values are marked as 

circles and squares, respectively, and the 

standard error estimate was obtained using 

500-times resampling with bagging of the 

prediction results from each model. (c) Distribution 

of MCC values of all models for each endpoint in 

internal (left, yellow) and external (right, green) 

validation performance. Endpoints H and L (sex of 

the patients) are included as positive controls and 

endpoints I and M (randomly assigned sample 

class labels) as negative controls. Boxes indicate 

the 25% and 75% percentiles, and whiskers 

indicate the 5% and 95% percentiles. 




Figure 3 Performance, measured using MCC, 

of the best models nominated by the 17 data 

analysis teams (DATs) that analyzed all 13 

endpoints in the original training-validation 

experiment. The median MCC value for 

an endpoint, representative of the level of 

predicability of the endpoint, was calculated 

based on values from the 17 data analysis 

teams. The mean MCC value for a data analysis 

team, representative of the team’s proficiency 

in developing predictive models, was calculated 

based on values from the 11 non-random 

endpoints (excluding negative controls I and M). 

Red boxes highlight candidate models. Lack 

of a red box in an endpoint indicates that the 

candidate model was developed by a data analysis 

team that did not analyze all 13 endpoints. 

DAT24 

DAT13 

DAT25 

DAT11 

DAT12 

DAT32 

DAT10 

DAT20 

DAT4 

DAT18 

DAT36 

DAT29 

DAT35 

DAT7 

DAT19 

DAT33 

DAT3 

Median 

of MCC for the selected candidate models 

Candidate 

(r = 0.951, n = 13, Fig. 2b) than for the overall 

Mean* L 

set of models (r = 0.840, n = 18,060, Fig. 2a), 

suggesting that extensive peer review of 

analysis protocols was able to avoid selecting 

models that could result in less reliable 

predictions in external validation. Yet, even 

for the hand-selected candidate models, there is noticeable bias in the 

performance estimated from internal validation. That is, the internal 

validation performance is higher than the external validation performance 

for most endpoints (Fig. 2b). However, for some endpoints 

and for some model building methods or teams, internal and external 

performance correlations were more modest as described in the following 

sections. 

To evaluate whether some endpoints might be more predictable 

than others and to calibrate performance against the positive- and 

negative-control endpoints, we assessed all models generated for each 

endpoint (Fig. 2c). We observed a clear dependence of prediction 

performance on endpoint. For example, endpoints C (liver necrosis 

score of rats treated with hepatotoxicants), E (estrogen receptor status 

of breast cancer patients), and H and L (sex of the multiple myeloma 

and neuroblastoma patients, respectively) were the easiest to predict 

(mean MCC > 0.7). Toxicological endpoints A and B and disease 

progression endpoints D, F, G, J and K were more difficult to predict 

(mean MCC ~0.1–0.4). Negative-control endpoints I and M were 

totally unpredictable (mean MCC ~0), as expected. For 11 endpoints 

(excluding the negative controls), a large proportion of the submitted 

models predicted the endpoint significantly better than chance (MCC 

> 0) and for a given endpoint many models performed similarly well 

on both internal and external validation (see the distribution of MCC 

in Fig. 2c). On the other hand, not all the submitted models performed 

equally well for any given endpoint. Some models performed 

no better than chance, even for some of the easy-to-predict endpoints, 

suggesting that additional factors were responsible for differences in 

model performance. 

Data analysis teams show different proficiency 

Next, we summarized the external validation performance of the 

models nominated by the 17 teams that analyzed all 13 endpoints 

(Fig. 3). Nominated models represent a team’s best assessment of its 

model-building effort. The mean external validation MCC per team 

over 11 endpoints, excluding negative controls I and M, varied from 

0.532 for data analysis team (DAT)24 to 0.263 for DAT3, indicating 

appreciable differences in performance of the models developed by different 

teams for the same data. Similar trends were observed when AUC 

Data analysis team code 

0.532 0.982 0.910 0.845 0.748 0.575 0.557 0.311 0.323 0.244 0.193 0.168 0.011 −0.059 

0.513 0.973 0.918 0.829 0.792 0.493 0.437 0.322 0.306 0.307 0.202 0.060 0.044 −0.041 

0.504 0.965 0.801 0.816 0.652 0.514 0.349 0.383 0.360 0.217 0.243 0.247 0.016 −0.051 

0.500 0.991 0.752 0.750 0.778 0.509 0.483 0.345 0.305 0.295 0.193 0.099 0.029 0.012 

0.495 0.973 0.869 0.825 0.755 0.403 0.413 0.321 0.275 0.193 0.266 0.152 −0.016 −0.117 

0.489 0.982 0.762 0.823 0.702 0.533 0.557 0.284 0.203 0.143 0.257 0.129 0.043 −0.006 

0.485 0.982 0.871 0.445 0.728 0.472 0.249 0.429 0.353 0.295 0.293 0.222 0.016 −0.035 

0.483 0.930 0.838 0.805 0.773 0.542 0.386 0.345 0.289 0.225 0.181 0.000 0.067 −0.152 

0.473 0.982 0.847 0.835 0.737 0.488 0.344 0.118 0.324 0.110 0.176 0.247 −0.067 −0.112 

0.460 0.973 0.860 0.829 0.690 0.371 0.376 0.344 0.229 0.057 0.243 0.090 −0.059 −0.059 

0.457 0.956 0.815 0.847 0.773 0.491 0.202 0.185 0.385 −0.014 0.187 0.203 0.002 −0.075 

0.443 0.982 0.847 0.780 0.755 0.377 0.423 0.313 −0.042 0.198 0.241 0.000 0.000 −0.041 

0.427 0.725 0.782 0.824 0.770 0.531 0.344 0.168 0.349 −0.096 0.165 0.140 0.068 0.036 

0.371 0.982 0.707 0.782 0.466 0.499 0.184 0.271 0.000 −0.062 0.203 0.051 0.013 −0.103 

0.364 0.636 0.761 0.454 0.748 0.247 0.377 0.062 0.324 0.043 0.085 0.271 0.016 −0.020 

0.284 0.856 0.054 0.709 0.751 0.455 −0.213 –0.078 0.114 0.479 −0.096 0.091 0.051 0.024 

0.263 0.982 0.830 0.595 0.544 0.036 −0.090 −0.027 0.336 −0.143 −0.030 −0.142 −0.047 0.019 

0.488 0.973 0.830 0.816 0.748 0.491 0.376 0.311 0.306 0.193 0.193 0.129 0.016 −0.041 

0.511 0.982 0.891 0.829 0.732 0.403 0.479 0.429 0.301 0.217 0.162 0.196 0.067 −0.103 

H C E K J B D A G F I M 

NB pos 

MM pos 

Rat liver necr. 

BR erpos 

NB EFS 

NB OS 

Rat liver tumor 

BR pCR 

Mouse lung tumor 

MM EFS 

MM OS 

MM neg 

NB neg 

Endpoint 

was used as the performance metric (Supplementary Table 5) or when 

the original training and validation sets were swapped (Supplementary 

Tables 6 and 7). Table 2 summarizes the modeling approaches that 

were used by two or more MAQC-II data analysis teams. 

Many factors may have played a role in the difference of external validation 

performance between teams. For instance, teams used different 

modeling factors, criteria for selecting the nominated models, and software 

packages and code. Moreover, some teams may have been more 

proficient at microarray data modeling and better at guarding against 

clerical errors. We noticed substantial variations in performance among 

the many K-nearest neighbor algorithm (KNN)-based models developed 

by four analysis teams (Supplementary Fig. 1). Follow-up investigations 

identified a few possible causes leading to the discrepancies in 

performance 32 . For example, DAT20 fixed the parameter ‘number of 

neighbors’ K = 3 in its data analysis protocol for all endpoints, whereas 

DAT18 varied K from 3 to 15 with a step size of 2. This investigation 

also revealed that even a detailed but standardized description of model 

building requested from all groups failed to capture many important 

tuning variables in the process. The subtle modeling differences not 

captured may have contributed to the differing performance levels 

achieved by the data analysis teams. The differences in performance 

for the models developed by various data analysis teams can also be 

observed from the changing patterns of internal and external validation 

performance across the 13 endpoints (Fig. 3, Supplementary 

Tables 5–7 and Supplementary Figs. 2–4). Our observations highlight 

the importance of good modeling practice in developing and validating 

microarray-based predictive models including reporting of computational 

details for results to be replicated 26 . In light of the MAQC-II 

experience, recording structured information about the steps and 

parameters of an analysis process seems highly desirable to facilitate 

peer review and reanalysis of results. 

Swap and original analyses lead to consistent results 

To evaluate the reproducibility of the models generated by each team, 

we correlated the performance of each team’s models on the original 

training data set to performance on the validation data set and 

repeated this calculation for the swap experiment (Fig. 4). The correlation 

varied from 0.698–0.966 on the original experiment and from 

1.0 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

−1.0 




Table 2 Modeling factor options frequently adopted by MAQC-II data 

analysis teams 

Original analysis (training => validation) 

Modeling factor 

Option 

Number 

of teams 

Number 

of endpoints 

Number 

of models 

Summary and normalization Loess 12 3 2,563 

RMA 3 7 46 

MAS5 11 7 4,947 

Batch-effect removal None 10 11 2,281 

Mean shift 3 11 7,279 

Feature selection SAM 4 11 3,771 

FC+P 8 11 4,711 

T-Test 5 11 400 

RFE 2 11 647 

Number of features 0~9 10 11 393 

10~99 13 11 4,445 

≥1,000 3 11 474 

100~999 10 11 4,298 

Classification algorithm DA 4 11 103 

Tree 5 11 358 

NB 4 11 924 

KNN 8 11 6,904 

SVM 9 11 986 

Analytic options used by two or more of the 14 teams that submitted models for all endpoints in both 

the original and swap experiments. RMA, robust multichip analysis; SAM, significance analysis of 

microarrays; FC, fold change; RFE, recursive feature elimination; DA, discriminant analysis; Tree, 

decision tree; NB, naive Bayes; KNN, K-nearest neighbors; SVM, support vector machine. 

0.443–0.954 on the swap experiment. For all but three teams (DAT3, 

DAT10 and DAT11) the original and swap correlations were within 

±0.2, and all but three others (DAT4, DAT13 and DAT36) were within 

±0.1, suggesting that the model building process was relatively robust, 

at least with respect to generating models with similar performance. 

For some data analysis teams the internal validation performance 

drastically overestimated the performance of the same model in predicting 

the validation data. Examination of some of those models 

revealed several reasons, including bias in the feature selection and 

cross-validation process 28 , findings consistent with what was observed 

from a recent literature survey 33 . 

Previously, reanalysis of a widely cited single study 34 found that 

the results in the original publication were very fragile—that is, not 

reproducible if the training and validation sets were swapped 35 . Our 

observations, except for DAT3, DAT11 and DAT36 with correlation 

65% 

of the variability in the external validation performance. All other 

factors explain 1%. 

The BLUPs reveal the effect of each level of the factor to the corresponding 

MCC value. The BLUPs of the main endpoint effect show 

that rat liver necrosis, breast cancer estrogen receptor status and the 

sex of the patient (endpoints C, E, H and L) are relatively easier to be 

predicted with ~0.2–0.4 advantage contributed on the corresponding 

MCC values. The rest of the endpoints are relatively harder to 

be predicted with about −0.1 to −0.2 disadvantage contributed to 

the corresponding MCC values. The main factors of normalization, 

classification algorithm, the number of selected features and 

the feature selection method have an impact of −0.1 to 0.1 on the 

corresponding MCC values. Loess normalization was applied to the 

endpoints (J, K and L) for the neuroblastoma data set with the twocolor 

Agilent platform and has 0.1 advantage to MCC values. Among 

the Microarray Analysis Suite version 5 (MAS5), Robust Multichip 

Analysis (RMA) and dChip normalization methods that were 

applied to all endpoints (A, C, D, E, F, G and H) for Affymetrix data, 

the dChip method has a lower BLUP than the others. Because 

normalization methods are partially confounded with endpoints, it 

may not be suitable to compare methods between different confounded 

groups. Among classification methods, discriminant analysis has the 

largest positive impact of 0.056 on the MCC values. Regarding the 

number of selected features, larger bin number has better impact on 

the average across endpoints. The bin number is assigned by applying 

the ceiling function to the log base 10 of the number of selected features. 

All the feature selection methods have a slight impact of −0.025 to 0.025 

Correlation in swap analysis (validation → training) 

1.0 

0.9 

0.8 

0.7 

0.6 

0.5 

10 

12 18 

24 

20 

4 

29 

32 

13 

7 

0.4 

0.4 0.5 0.6 0.7 0.8 0.9 1.0 

Correlation in original analysis (training → validation) 

Figure 4 Correlation between internal and external validation is 

dependent on data analysis team. Pearson correlation coefficients 

between internal and external validation performance in terms of MCC are 

displayed for the 14 teams that submitted models for all 13 endpoints 

in both the original (x axis) and swap (y axis) analyses. The unusually low 

correlation in the swap analysis for DAT3, DAT11 and DAT36 is a result 

of their failure to accurately predict the positive endpoint H, likely due to 

operator errors (Supplementary Table 6). 

36 

25 

3 

11 


Bscatter 

FC 

Fisher 

Golub 

KS 

RFE P 

SAM 

T-Test 

Welch 

Wilcoxon 

DA 

Forest 

GLM 

KNN 

NC 

NB 

PLS 

RFE 

SVM 

Tree 



a 

Endpoint 

Summary normalization 

Classification algorithm 

Number of features 

Feature selection 

Validation iterations 

Organization 

Batch effect removal 

Organization*classification 

algorithm 

Summary normalization*endpoint 

Classification algorithm*endpoint 

Number of features*endpoint 

Feature selection*endpoint 

Validation iterations*endpoint 

Organization*endpoint 

Batch effect removal*endpoint 

Organization*classification 

algorithm*endpoint 

Residual 

0 10 20 30 40 50 60 70 

0 1 2 3 4 5 6 7 8 9 

Percentage of variation 

0.40 

0.30 

0.20 

0.10 

0 

–0.10 

–0.20 

on MCC values except for recursive feature elimination (RFE) that 

has an impact of −0.006. In the plots of the four selected interactions, 

the estimated BLUPs vary across endpoints. The large variation across 

endpoints implies the impact of the corresponding modeling factor on 

different endpoints can be very different. Among the four interaction 

plots (see Supplementary Fig. 6 for a clear labeling of each interaction 

term), the corresponding BLUPs of the three-way interaction 

of organization, classification algorithm and endpoint show the highest 

variation. This may be due to different tuning parameters applied 

to individual algorithms for different organizations, as was the case 

for KNN 32 . 

We also analyzed the relative importance of modeling factors on 

external-validation prediction performance using a decision tree 

model 38 . The analysis results revealed observations (Supplementary 

Fig. 7) largely consistent with those above. First, the endpoint code 

was the most influential modeling factor. Second, feature selection 

method, normalization and summarization method, classification 

method and organization code also contributed to prediction performance, 

but their contribution was relatively small. 

Feature list stability is correlated with endpoint predictability 

Prediction performance is the most important criterion for evaluating 

the performance of a predictive model and its modeling process. 

However, the robustness and mechanistic relevance of the model and 

b 

BLUP 

BLUP 

BLUP 

A B C D E F G H J K L 

Tox BR MM NB 

Endpoint 

0.10 

0.05 

0 

–0.05 

–0.10 

1 2 3 4 5 

Number of features 

FC+P 

dChip 

GA 

Loess 

MAS5 

Mean 

Median 

RMA 

Vote 

Logistic 

ML 

A B C D E F G H J K L A B C D E F G H J K L A B C D E F G H J K L 

Classification algorithm* 

endpoint 

0.10 

0.05 

0 

–0.05 

–0.10 

0.10 

0.05 

Summary normalization 

Feature selection method 

0.10 

0.20 0.20 

0.15 

0.10 

0.05 

0.10 

0 

–0.10 

0 

0.05 

–0.20 

0 

–0.05 

–0.30 

–0.05 

–0.40 

–0.10 –0.10 

–0.50 

0 

–0.05 

–0.10 

Number of features* 

endpoint 

0.10 

0.05 

0 

–0.05 

–0.10 

the corresponding gene signature is also important (Supplementary 

Fig. 8). That is, given comparable prediction performance between 

two modeling processes, the one yielding a more robust and reproducible 

gene signature across similar data sets (e.g., by swapping the 

training and validation sets), which is therefore less susceptible to 

sporadic fluctuations in the data, or the one that provides new insights 

to the underlying biology is preferable. Reproducibility or stability of 

feature sets is best studied by running the same model selection protocol 

on two distinct collections of samples, a scenario only possible, in 

this case, after the blind validation data were distributed to the data 

analysis teams that were asked to perform their analysis after swapping 

their original training and test sets. Supplementary Figures 9 and 10 

show that, although the feature space is extremely large for microarray 

data, different teams and protocols were able to consistently select the 

best-performing features. Analysis of the lists of features indicated that 

for endpoints relatively easy to predict, various data analysis teams 

arrived at models that used more common features and the overlap 

of the lists from the original and swap analyses is greater than those 

for more difficult endpoints (Supplementary Figs. 9–11). Therefore, 

the level of stability of feature lists can be associated to the level of difficulty 

of the prediction problem (Supplementary Fig. 11), although 

multiple models with different feature lists and comparable performance 

can be found from the same data set 39 . Functional analysis of the 

most frequently selected genes by all data analysis protocols shows 

0.10 

0.05 

0 

–0.05 

–0.10 

ANN 

SMO 

Classification algorithm 

A B C D E F G H J K L 

Tox BR MM NB 

Summary normalization* 

endpoint 

Tox BR MM NB Tox BR MM NB Tox BR MM NB 

Organization*classification* 

endpoint 

Figure 5 Effect of modeling factors on estimates of model performance. (a) Random-effect models of external validation performance (MCC) were 

developed to estimate a distinct variance component for each modeling factor and several selected interactions. The estimated variance components 

were then divided by their total in order to compare the proportion of variability explained by each modeling factor. The endpoint code contributes the 

most to the variability in external validation performance. (b) The BLUP plots of the corresponding factors having proportion of variation larger than 1% 

in a. Endpoint abbreviations (Tox., preclinical toxicity; BR, breast cancer; MM, multiple myeloma; NB, neuroblastoma). Endpoints H and L are the sex 

of the patient. Summary normalization abbreviations (GA, genetic algorithm; RMA, robust multichip analysis). Classification algorithm abbreviations 

(ANN, artificial neural network; DA, discriminant analysis; Forest, random forest; GLM, generalized linear model; KNN, K-nearest neighbors; Logistic, 

logistic regression; ML, maximum likelihood; NB, Naïve Bayes; NC, nearest centroid; PLS, partial least squares; RFE, recursive feature elimination; 

SMO, sequential minimal optimization; SVM, support vector machine; Tree, decision tree). Feature selection method abbreviations (Bscatter, betweenclass 

scatter; FC, fold change; KS, Kolmogorov-Smirnov algorithm; SAM, significance analysis of microarrays). 




that many of these genes represent biological processes that are highly 

relevant to the clinical outcome that is being predicted 36 . The sexbased 

endpoints have the best overlap, whereas more difficult survival 

endpoints (in which disease processes are confounded by many other 

factors) have only marginally better overlap with biological processes 

relevant to the disease than that expected by random chance. 

Summary of MAQC-II observations and recommendations 

The MAQC-II data analysis teams comprised a diverse group, some 

of whom were experienced microarray analysts whereas others were 

graduate students with little experience. In aggregate, the group’s 

composition likely mimicked the broad scientific community engaged 

in building and publishing models derived from microarray data. The 

more than 30,000 models developed by 36 data analysis teams for 

13 endpoints from six diverse clinical and preclinical data sets are a 

rich source from which to highlight several important observations. 

First, model prediction performance was largely endpoint (biology) 

dependent (Figs. 2c and 3). The incorporation of multiple data 

sets and endpoints (including positive and negative controls) in the 

MAQC-II study design made this observation possible. Some endpoints 

are highly predictive based on the nature of the data, which 

makes it possible to build good models, provided that sound modeling 

procedures are used. Other endpoints are inherently difficult to predict 

regardless of the model development protocol. 

Second, there are clear differences in proficiency between data 

analysis teams (organizations) and such differences are correlated 

with the level of experience of the team. For example, the topperforming 

teams shown in Figure 3 were mainly industrial participants 

with many years of experience in microarray data analysis, whereas 

bottom-performing teams were mainly less-experienced graduate 

students or researchers. Based on results from the positive and negative 

endpoints, we noticed that simple errors were sometimes made, 

suggesting rushed efforts due to lack of time or unnoticed implementation 

flaws. This observation strongly suggests that mechanisms are 

needed to ensure the reliability of results presented to the regulatory 

agencies, journal editors and the research community. By examining 

the practices of teams whose models did not perform well, future 

studies might be able to identify pitfalls to be avoided. Likewise, 

practices adopted by top-performing teams can provide the basis for 

developing good modeling practices. 

Third, the internal validation performance from well-implemented, 

unbiased cross-validation shows a high degree of concordance with the 

external validation performance in a strict blinding process (Fig. 2). 

This observation was not possible from previously published studies 

owing to the small number of available endpoints tested in them. 

Fourth, many models with similar performance can be developed 

from a given data set (Fig. 2). Similar prediction performance is 

attainable when using different modeling algorithms and parameters, 

and simple data analysis methods often perform as well as more 

complicated approaches 32,40 . Although it is not essential to include 

the same features in these models to achieve comparable prediction 

performance, endpoints that were easier to predict generally yielded 

models with more common features, when analyzed by different 

teams (Supplementary Fig. 11). 

Finally, applying good modeling practices appeared to be more 

important than the actual choice of a particular algorithm over the 

others within the same step in the modeling process. This can be seen 

in the diverse choices of the modeling factors used by teams that produced 

models that performed well in the blinded validation (Table 2) 

where modeling factors did not universally contribute to variations in 

model performance among good performing teams (Fig. 5). 

Summarized below are the model building steps recommended to 

the MAQC-II data analysis teams. These may be applicable to model 

building practitioners in the general scientific community. 

Step one (design). There is no exclusive set of steps and procedures, 

in the form of a checklist, to be followed by any practitioner for all 

problems. However, normal good practice on the study design and 

the ratio of sample size to classifier complexity should be followed. 

The frequently used options for normalization, feature selection and 

classification are good starting points (Table 2). 

Step two (pilot study or internal validation). This can be accomplished 

by bootstrap or cross-validation such as the ten repeats of a 

fivefold cross-validation procedure adopted by most MAQC-II teams. 

The samples from the pilot study are not replaced for the pivotal 

study; rather they are augmented to achieve ‘appropriate’ target size. 

Step three (pivotal study or external validation). Many investigators 

assume that the most conservative approach to a pivotal study is to 

simply obtain a test set completely independent of the training set(s). 

However, it is good to keep in mind the exchange 34,35 regarding the 

fragility of results when the training and validation sets are swapped. 

Results from further resampling (including simple swapping as in 

MAQC-II) across the training and validation sets can provide important 

information about the reliability of the models and the modeling 

procedures, but the complete separation of the training and validation 

sets should be maintained 41 . 

Finally, a perennial issue concerns reuse of the independent validation 

set after modifications to an originally designed and validated 

data analysis algorithm or protocol. Such a process turns the validation 

set into part of the design or training set 42 . Ground rules must 

be developed for avoiding this approach and penalizing it when it 

occurs; and practitioners should guard against using it before such 

ground rules are well established. 

DISCUSSION 

MAQC-II conducted a broad observational study of the current community 

landscape of gene-expression profile–based predictive model 

development. Microarray gene expression profiling is among the most 

commonly used analytical tools in biomedical research. Analysis of 

the high-dimensional data generated by these experiments involves 

multiple steps and several critical decision points that can profoundly 

influence the soundness of the results 43 . An important requirement 

of a sound internal validation is that it must include feature selection 

and parameter optimization within each iteration to avoid overly optimistic 

estimations of prediction performance 28,29,44 . To what extent 

this information has been disseminated and followed by the scientific 

community in current microarray analysis remains unknown 33 . 

Concerns have been raised that results published by one group of 

investigators often cannot be confirmed by others even if the same 

data set is used 26 . An inability to confirm results may stem from any 

of several reasons: (i) insufficient information is provided about the 

methodology that describes which analysis has actually been done; 

(ii) data preprocessing (normalization, gene filtering and feature 

selection) is too complicated and insufficiently documented to be 

reproduced; or (iii) incorrect or biased complex analytical methods 26 

are performed. A distinct but related concern is that genomic data may 

yield prediction models that, even if reproducible on the discovery 

data set, cannot be extrapolated well in independent validation. The 

MAQC-II project provided a unique opportunity to address some of 

these concerns. 

Notably, we did not place restrictions on the model building methods 

used by the data analysis teams. Accordingly, they adopted numerous 

different modeling approaches (Table 2 and Supplementary Table 4). 




For example, feature selection methods varied widely, from statistical 

significance tests, to machine learning algorithms, to those more 

reliant on differences in expression amplitude, to those employing 

knowledge of putative biological mechanisms associated with the 

endpoint. Prediction algorithms also varied widely. To make internal 

validation performance results comparable across teams for different 

models, we recommended that a model’s internal performance was 

estimated using a ten times repeated fivefold cross-validation, but this 

recommendation was not strictly followed by all teams, which also 

allows us to survey internal validation approaches. The diversity of 

analysis protocols used by the teams is likely to closely resemble that 

of current research going forward, and in this context mimics reality. 

In terms of the space of modeling factors explored, MAQC-II is a survey 

of current practices rather than a randomized, controlled experiment; 

therefore, care should be taken in interpreting the results. For 

example, some teams did not analyze all endpoints, causing missing 

data (models) that may be confounded with other modeling factors. 

Overall, the procedure followed to nominate MAQC-II candidate 

models was quite effective in selecting models that performed reasonably 

well during validation using independent data sets, although 

generally the selected models did not do as well in validation as in 

training. The drop in performance associated with the validation 

highlights the importance of not relying solely on internal validation 

performance, and points to the need to subject every classifier to at 

least one external validation. The selection of the 13 candidate models 

from many nominated models was achieved through a peer-review 

collaborative effort of many experts and could be described as slow, 

tedious and sometimes subjective (e.g., a data analysis team could 

only contribute one of the 13 candidate models). Even though they 

were still subject to over-optimism, the internal and external performance 

estimates of the candidate models were more concordant than 

those of the overall set of models. Thus the review was productive in 

identifying characteristics of reliable models. 

An important lesson learned through MAQC-II is that it is almost 

impossible to retrospectively retrieve and document decisions that 

were made at every step during the feature selection and model development 

stage. This lack of complete description of the model building 

process is likely to be a common reason for the inability of different 

data analysis teams to fully reproduce each other’s results 32 . Therefore, 

although meticulously documenting the classifier building procedure 

can be cumbersome, we recommend that all genomic publications 

include supplementary materials describing the model building and 

evaluation process in an electronic format. MAQC-II is making available 

six data sets with 13 endpoints that can be used in the future as a 

benchmark to verify that software used to implement new approaches 

performs as expected. Subjecting new software to benchmarks against 

these data sets could reassure potential users that the software is 

mature enough to be used for the development of predictive models 

in new data sets. It would seem advantageous to develop alternative 

ways to help determine whether specific implementations of modeling 

approaches and performance evaluation procedures are sound, and to 

identify procedures to capture this information in public databases. 

The findings of the MAQC-II project suggest that when the same 

data sets are provided to a large number of data analysis teams, many 

groups can generate similar results even when different model building 

approaches are followed. This is concordant with studies 29,33 that 

found that given good quality data and an adequate number of informative 

features, most classification methods, if properly used, will yield 

similar predictive performance. This also confirms reports 6,7,39 on 

small data sets by individual groups that have suggested that several 

different feature selection methods and prediction algorithms can 

yield many models that are distinct, but have statistically similar 

performance. Taken together, these results provide perspective on 

the large number of publications in the bioinformatics literature that 

have examined the various steps of the multivariate prediction model 

building process and identified elements that are critical for achieving 

reliable results. 

An important and previously underappreciated observation from 

MAQC-II is that different clinical endpoints represent very different 

levels of classification difficulty. For some endpoints the currently 

available data are sufficient to generate robust models, whereas for 

other endpoints currently available data do not seem to be sufficient 

to yield highly predictive models. An analysis done as part of the 

MAQC-II project and that focused on the breast cancer data demonstrates 

these points in more detail 40 . It is also important to point out 

that for some clinically meaningful endpoints studied in the MAQC-II 

project, gene expression data did not seem to significantly outperform 

models based on clinical covariates alone, highlighting the challenges 

in predicting the outcome of patients in a heterogeneous population 

and the potential need to combine gene expression data with 

clinical covariates (unpublished data). 

The accuracy of the clinical sample annotation information may 

also play a role in the difficulty to obtain accurate prediction results 

on validation samples. For example, some samples were misclassified 

by almost all models (Supplementary Fig. 12). It is true even for some 

samples within the positive control endpoints H and L, as shown 

in Supplementary Table 8. Clinical information of neuroblastoma 

patients for whom the positive control endpoint L was uniformly 

misclassified were rechecked and the sex of three out of eight cases 

(NB412, NB504 and NB522) was found to be incorrectly annotated. 

The companion MAQC-II papers published elsewhere give more 

in-depth analyses of specific issues such as the clinical benefits of 

genomic classifiers (unpublished data), the impact of different 

modeling factors on prediction performance 45 , the objective assessment 

of microarray cross-platform prediction 46 , cross-tissue prediction 

47 , one-color versus two-color prediction comparison 48 , 

functional analysis of gene signatures 36 and recommendation of a 

simple yet robust data analysis protocol based on the KNN 32 . For 

example, we systematically compared the classification performance 

resulting from one- and two-color gene-expression profiles of 

478 neuroblastoma samples and found that analyses based on either 

platform yielded similar classification performance 48 . This newly generated 

one-color data set has been used to evaluate the applicability of 

the KNN-based simple data analysis protocol to future data sets 32 . In 

addition, the MAQC-II Genome-Wide Association Working Group 

assessed the variabilities in genotype calling due to experimental or 

algorithmic factors 49 . 

In summary, MAQC-II has demonstrated that current methods 

commonly used to develop and assess multivariate gene-expression 

based predictors of clinical outcome were used appropriately by 

most of the analysis teams in this consortium. However, differences 

in proficiency emerged and this underscores the importance 

of proper implementation of otherwise robust analytical methods. 

Observations based on analysis of the MAQC-II data sets may be 

applicable to other diseases. The MAQC-II data sets are publicly 

available and are expected to be used by the scientific community 

as benchmarks to ensure proper modeling practices. The experience 

with the MAQC-II clinical data sets also reinforces the notion that 

clinical classification problems represent several different degrees 

of prediction difficulty that are likely to be associated with whether 

mRNA abundances measured in a specific data set are informative for 

the specific prediction problem. We anticipate that including other 




types of biological data at the DNA, microRNA, protein or metabolite 

levels will enhance our capability to more accurately predict 

the clinically relevant endpoints. The good modeling practice guidelines 

established by MAQC-II and lessons learned from this unprecedented 

collaboration provide a solid foundation from which other 

high-dimensional biological data could be more reliably used for the 

purpose of predictive and personalized medicine. 

Methods 

Methods and any associated references are available in the online 

version of the paper at http://www.nature.com/naturebiotechnology/. 

Accession codes. All MAQC-II data sets are available through 

GEO (series accession number: GSE16716), the MAQC Web site 

(http://www.fda.gov/nctr/science/centers/toxicoinformatics/maqc/), 

ArrayTrack (http://www.fda.gov/nctr/science/centers/toxicoinformatics/ArrayTrack/) 

or CEBS (http://cebs.niehs.nih.gov/) accession 

number: 009-00002-0010-000-3. 



The MAQC-II project was funded in part by the FDA’s Office of Critical Path 

Programs (to L.S.). Participants from the National Institutes of Health (NIH) were 

supported by the Intramural Research Program of NIH, Bethesda, Maryland or 

the Intramural Research Program of the NIH, National Institute of Environmental 

Health Sciences (NIEHS), Research Triangle Park, North Carolina. J.F. was 

supported by the Division of Intramural Research of the NIEHS under contract 

HHSN273200700046U. Participants from the Johns Hopkins University were 

supported by grants from the NIH (1R01GM083084-01 and 1R01RR021967-01A2 

to R.A.I. and T32GM074906 to M.M.). Participants from the Weill Medical College 

of Cornell University were partially supported by the Biomedical Informatics 

Core of the Institutional Clinical and Translational Science Award RFA-RM-07- 

002. F.C. acknowledges resources from The HRH Prince Alwaleed Bin Talal Bin 

Abdulaziz Alsaud Institute for Computational Biomedicine and from the David A. 

Cofrin Center for Biomedical Information at Weill Cornell. The data set from The 

Hamner Institutes for Health Sciences was supported by a grant from the American 

Chemistry Council’s Long Range Research Initiative. The breast cancer data set 

was generated with support of grants from NIH (R-01 to L.P.), The Breast Cancer 

Research Foundation (to L.P. and W.F.S.) and the Faculty Incentive Funds of the 

University of Texas MD Anderson Cancer Center (to W.F.S.). The data set from 

the University of Arkansas for Medical Sciences was supported by National Cancer 

Institute (NCI) PO1 grant CA55819-01A1, NCI R33 Grant CA97513-01, Donna D. 

and Donald M. Lambert Lebow Fund to Cure Myeloma and Nancy and Steven 

Grand Foundation. We are grateful to the individuals whose gene expression data 

were used in this study. All MAQC-II participants freely donated their time and 

reagents for the completion and analyses of the MAQC-II project. The MAQC-II 

consortium also thanks R. O’Neill for his encouragement and coordination among 

FDA Centers on the formation of the RBWG. The MAQC-II consortium gratefully 

dedicates this work in memory of R.F. Wagner who enthusiastically worked on the 

MAQC-II project and inspired many of us until he unexpectedly passed away in 

June 2008. 

DISCLAIMER 

This work includes contributions from, and was reviewed by, individuals at the 

FDA, the Environmental Protection Agency (EPA) and the NIH. This work has 

been approved for publication by these agencies, but it does not necessarily reflect 

official agency policy. Certain commercial materials and equipment are identified 

in order to adequately specify experimental procedures. In no case does such 

identification imply recommendation or endorsement by the FDA, the EPA or the 

NIH, nor does it imply that the items identified are necessarily the best available 

for the purpose. 







1. Marshall, E. Getting the noise out of gene arrays. Science 306, 630–631 (2004). 

2. Frantz, S. An array of problems. Nat. Rev. Drug Discov. 4, 362–363 (2005). 

3. Michiels, S., Koscielny, S. & Hill, C. Prediction of cancer outcome with microarrays: 

a multiple random validation strategy. Lancet 365, 488–492 (2005). 

4. Ntzani, E.E. & Ioannidis, J.P. Predictive ability of DNA microarrays for cancer 

outcomes and correlates: an empirical assessment. Lancet 362, 1439–1444 

(2003). 

5. Ioannidis, J.P. Microarrays and molecular research: noise discovery? Lancet 365, 

454–455 (2005). 

6. Ein-Dor, L., Kela, I., Getz, G., Givol, D. & Domany, E. Outcome signature genes in 

breast cancer: is there a unique set? Bioinformatics 21, 171–178 (2005). 

7. Ein-Dor, L., Zuk, O. & Domany, E. Thousands of samples are needed to generate 

a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. USA 

103, 5923–5928 (2006). 

8. Shi, L. et al. QA/QC: challenges and pitfalls facing the microarray community and 

regulatory agencies. Expert Rev. Mol. Diagn. 4, 761–777 (2004). 

9. Shi, L. et al. Cross-platform comparability of microarray technology: intra-platform 

consistency and appropriate data analysis procedures are essential. BMC 

Bioinformatics 6 Suppl 2, S12 (2005). 

10. Shi, L. et al. The MicroArray Quality Control (MAQC) project shows inter- and 

intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 

24, 1151–1161 (2006). 

11. Guo, L. et al. Rat toxicogenomic study reveals analytical consistency across 

microarray platforms. Nat. Biotechnol. 24, 1162–1169 (2006). 

12. Canales, R.D. et al. Evaluation of DNA microarray results with quantitative gene 

expression platforms. Nat. Biotechnol. 24, 1115–1122 (2006). 

13. Patterson, T.A. et al. Performance comparison of one-color and two-color platforms 

within the MicroArray Quality Control (MAQC) project. Nat. Biotechnol. 24, 

1140–1150 (2006). 

14. Shippy, R. et al. Using RNA sample titrations to assess microarray platform 

performance and normalization techniques. Nat. Biotechnol. 24, 1123–1131 

(2006). 

15. Tong, W. et al. Evaluation of external RNA controls for the assessment of microarray 

performance. Nat. Biotechnol. 24, 1132–1139 (2006). 

16. Irizarry, R.A. et al. Multiple-laboratory comparison of microarray platforms. Nat. 

Methods 2, 345–350 (2005). 

17. Strauss, E. Arrays of hope. Cell 127, 657–659 (2006). 

18. Shi, L., Perkins, R.G., Fang, H. & Tong, W. Reproducible and reliable microarray 

results through quality control: good laboratory proficiency and appropriate data 

analysis practices are essential. Curr. Opin. Biotechnol. 19, 10–18 (2008). 

19. Dudoit, S., Fridlyand, J. & Speed, T.P. Comparison of discrimination methods for 

the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97, 

77–87 (2002). 

20. Goodsaid, F.M. et al. Voluntary exploratory data submissions to the US FDA and 

the EMA: experience and impact. Nat. Rev. Drug Discov. 9, 435–445 (2010). 

21. van ‘t Veer, L.J. et al. Gene expression profiling predicts clinical outcome of breast 

cancer. Nature 415, 530–536 (2002). 

22. Buyse, M. et al. Validation and clinical utility of a 70-gene prognostic signature for 

women with node-negative breast cancer. J. Natl. Cancer Inst. 98, 1183–1192 

(2006). 

23. Dumur, C.I. et al. Interlaboratory performance of a microarray-based gene expression 

test to determine tissue of origin in poorly differentiated and undifferentiated 

cancers. J. Mol. Diagn. 10, 67–77 (2008). 

24. Deng, M.C. et al. Noninvasive discrimination of rejection in cardiac allograft recipients 

using gene expression profiling. Am. J. Transplant. 6, 150–160 (2006). 

25. Coombes, K.R., Wang, J. & Baggerly, K.A. Microarrays: retracing steps. Nat. Med. 

13, 1276–1277, author reply 1277–1278 (2007). 

26. Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression 

analyses. Nat. Genet. 41, 149–155 (2009). 

27. Baggerly, K.A., Edmonson, S.R., Morris, J.S. & Coombes, K.R. High-resolution serum 

proteomic patterns for ovarian cancer detection. Endocr. Relat. Cancer 11, 

583–584, author reply 585–587 (2004). 

28. Ambroise, C. & McLachlan, G.J. Selection bias in gene extraction on the basis of 

microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99, 6562–6566 

(2002). 

29. Simon, R. Using DNA microarrays for diagnostic and prognostic prediction. Expert 

Rev. Mol. Diagn. 3, 587–595 (2003). 

30. Dobbin, K.K. et al. Interlaboratory comparability study of cancer gene expression 

analysis using oligonucleotide microarrays. Clin. Cancer Res. 11, 565–572 

(2005). 

31. Shedden, K. et al. Gene expression-based survival prediction in lung adenocarcinoma: 

a multi-site, blinded validation study. Nat. Med. 14, 822–827 (2008). 

32. Parry, R.M. et al. K-nearest neighbors (KNN) models for microarray gene-expression 

analysis and reliable clinical outcome prediction. Pharmacogenomics J. 10, 292–309 

(2010). 

33. Dupuy, A. & Simon, R.M. Critical review of published microarray studies for cancer 

outcome and guidelines on statistical analysis and reporting. J. Natl. Cancer Inst. 

99, 147–157 (2007). 

34. Dave, S.S. et al. Prediction of survival in follicular lymphoma based on molecular 

features of tumor-infiltrating immune cells. N. Engl. J. Med. 351, 2159–2169 

(2004). 

35. Tibshirani, R. Immune signatures in follicular lymphoma. N. Engl. J. Med. 352, 

1496–1497, author reply 1496–1497 (2005). 



36. Shi, W. et al. Functional analysis of multiple genomic signatures demonstrates that 

classification algorithms choose phenotype-related genes. Pharmacogenomics J. 10, 

310–323 (2010). 

37. Robinson, G.K. That BLUP is a good thing: the estimation of random effects. 

Stat. Sci. 6, 15–32 (1991). 

38. Hothorn, T., Hornik, K. & Zeileis, A. Unbiased recursive partitioning: a conditional 

inference framework. J. Comput. Graph. Statist. 15, 651–674 (2006). 

39. Boutros, P.C. et al. Prognostic gene signatures for non-small-cell lung cancer. Proc. 

Natl. Acad. Sci. USA 106, 2824–2828 (2009). 

40. Popovici, V. et al. Effect of training sample size and classification difficulty on the 

accuracy of genomic predictors. Breast Cancer Res. 12, R5 (2010). 

41. Yousef, W.A., Wagner, R.F. & Loew, M.H. Assessing classifiers from two independent 

data sets using ROC analysis: a nonparametric approach. IEEE Trans. Pattern Anal. 

Mach. Intell. 28, 1809–1817 (2006). 

42. Gur, D., Wagner, R.F. & Chan, H.P. On the repeated use of databases for testing 

incremental improvement of computer-aided detection schemes. Acad. Radiol. 11, 

103–105 (2004). 

43. Allison, D.B., Cui, X., Page, G.P. & Sabripour, M. Microarray data analysis: from 

disarray to consolidation and consensus. Nat. Rev. Genet. 7, 55–65 (2006). 

44. Wood, I.A., Visscher, P.M. & Mengersen, K.L. Classification based upon gene expression 

data: bias and precision of error rates. Bioinformatics 23, 1363–1370 (2007). 

45. Luo, J. et al. A comparison of batch effect removal methods for enhancement of 

prediction performance using MAQC-II microarray gene expression data. 

Pharmacogenomics J. 10, 278–291 (2010). 

46. Fan, X. et al. Consistency of predictive signature genes and classifiers generated using 

different microarray platforms. Pharmacogenomics J. 10, 247–257 (2010). 

47. Huang, J. et al. Genomic indicators in the blood predict drug-induced liver injury. 


48. Oberthuer, A. et al. Comparison of performance of one-color and two-color geneexpression 

analyses in predicting clinical endpoints of neuroblastoma patients. 


49. Hong, H. et al. Assessing sources of inconsistencies in genotypes and their effects 

on genome-wide association studies with HapMap samples. Pharmacogenomics J. 

10, 364–374 (2010). 


Leming Shi 1 , Gregory Campbell 2 , Wendell D Jones 3 , Fabien Campagne 4 , Zhining Wen 1 , Stephen J Walker 5 , 

Zhenqiang Su 6 , Tzu-Ming Chu 7 , Federico M Goodsaid 8 , Lajos Pusztai 9 , John D Shaughnessy Jr 10 , 

André Oberthuer 11 , Russell S Thomas 12 , Richard S Paules 13 , Mark Fielden 14 , Bart Barlogie 10 , Weijie Chen 2 , 

Pan Du 15 , Matthias Fischer 11 , Cesare Furlanello 16 , Brandon D Gallas 2 , Xijin Ge 17 , Dalila B Megherbi 18 , 

W Fraser Symmans 19 , May D Wang 20 , John Zhang 21 , Hans Bitter 22 , Benedikt Brors 23 , Pierre R Bushel 13 , 

Max Bylesjo 24 , Minjun Chen 1 , Jie Cheng 25 , Jing Cheng 26 , Jeff Chou 13 , Timothy S Davison 27 , Mauro Delorenzi 28 , 

Youping Deng 29 , Viswanath Devanarayan 30 , David J Dix 31 , Joaquin Dopazo 32 , Kevin C Dorff 33 , Fathi Elloumi 31 , 

Jianqing Fan 34 , Shicai Fan 35 , Xiaohui Fan 36 , Hong Fang 6 , Nina Gonzaludo 37 , Kenneth R Hess 38 , 

Huixiao Hong 1 , Jun Huan 39 , Rafael A Irizarry 40 , Richard Judson 31 , Dilafruz Juraeva 23 , Samir Lababidi 41 , 

Christophe G Lambert 42 , Li Li 7 , Yanen Li 43 , Zhen Li 31 , Simon M Lin 15 , Guozhen Liu 44 , Edward K Lobenhofer 45 , 

Jun Luo 21 , Wen Luo 46 , Matthew N McCall 40 , Yuri Nikolsky 47 , Gene A Pennello 2 , Roger G Perkins 1 , Reena Philip 2 , 

Vlad Popovici 28 , Nathan D Price 48 , Feng Qian 6 , Andreas Scherer 49 , Tieliu Shi 50 , Weiwei Shi 47 , Jaeyun Sung 48 , 

Danielle Thierry-Mieg 51 , Jean Thierry-Mieg 51 , Venkata Thodima 52 , Johan Trygg 24 , Lakshmi Vishnuvajjala 2 , 

Sue Jane Wang 8 , Jianping Wu 53 , Yichao Wu 54 , Qian Xie 55 , Waleed A Yousef 56 , Liang Zhang 53 , Xuegong Zhang 35 , 

Sheng Zhong 57 , Yiming Zhou 10 , Sheng Zhu 53 , Dhivya Arasappan 6 , Wenjun Bao 7 , Anne Bergstrom Lucas 58 , 

Frank Berthold 11 , Richard J Brennan 47 , Andreas Buness 59 , Jennifer G Catalano 41 , Chang Chang 50 , 

Rong Chen 60 , Yiyu Cheng 36 , Jian Cui 50 , Wendy Czika 7 , Francesca Demichelis 61 , Xutao Deng 62 , 

Damir Dosymbekov 63 , Roland Eils 23 , Yang Feng 34 , Jennifer Fostel 13 , Stephanie Fulmer-Smentek 58 , 

James C Fuscoe 1 , Laurent Gatto 64 , Weigong Ge 1 , Darlene R Goldstein 65 , Li Guo 66 , Donald N Halbert 67 , 

Jing Han 41 , Stephen C Harris 1 , Christos Hatzis 68 , Damir Herman 69 , Jianping Huang 36 , Roderick V Jensen 70 , 

Rui Jiang 35 , Charles D Johnson 71 , Giuseppe Jurman 16 , Yvonne Kahlert 11 , Sadik A Khuder 72 , Matthias Kohl 73 , 

Jianying Li 74 , Li Li 75 , Menglong Li 76 , Quan-Zhen Li 77 , Shao Li 36 , Zhiguang Li 1 , Jie Liu 1 , Ying Liu 35 , Zhichao Liu 1 , 

Lu Meng 35 , Manuel Madera 18 , Francisco Martinez-Murillo 2 , Ignacio Medina 78 , Joseph Meehan 6 , Kelci Miclaus 7 , 

Richard A Moffitt 20 , David Montaner 78 , Piali Mukherjee 33 , George J Mulligan 79 , Padraic Neville 7 , 

Tatiana Nikolskaya 47 , Baitang Ning 1 , Grier P Page 80 , Joel Parker 3 , R Mitchell Parry 20 , Xuejun Peng 81 , 

Ron L Peterson 82 , John H Phan 20 , Brian Quanz 39 , Yi Ren 83 , Samantha Riccadonna 16 , Alan H Roter 84 , 

Frank W Samuelson 2 , Martin M Schumacher 85 , Joseph D Shambaugh 86 , Qiang Shi 1 , Richard Shippy 87 , 

Shengzhu Si 88 , Aaron Smalter 39 , Christos Sotiriou 89 , Mat Soukup 8 , Frank Staedtler 85 , Guido Steiner 90 , 

Todd H Stokes 20 , Qinglan Sun 53 , Pei-Yi Tan 7 , Rong Tang 2 , Zivana Tezak 2 , Brett Thorn 1 , Marina Tsyganova 63 , 

Yaron Turpaz 91 , Silvia C Vega 92 , Roberto Visintainer 16 , Juergen von Frese 93 , Charles Wang 62 , Eric Wang 21 , 

Junwei Wang 50 , Wei Wang 94 , Frank Westermann 23 , James C Willey 95 , Matthew Woods 21 , Shujian Wu 96 , 

Nianqing Xiao 97 , Joshua Xu 6 , Lei Xu 1 , Lun Yang 1 , Xiao Zeng 44 , Jialu Zhang 8 , Li Zhang 8 , Min Zhang 1 , 

Chen Zhao 50 , Raj K Puri 41 , Uwe Scherf 2 , Weida Tong 1 & Russell D Wolfinger 7 

1 National Center for Toxicological Research, US Food and Drug Administration, Jefferson, Arkansas, USA. 2 Center for Devices and Radiological Health, US Food and 

Drug Administration, Silver Spring, Maryland, USA. 3 Expression Analysis Inc., Durham, North Carolina, USA. 4 Department of Physiology and Biophysics and HRH 

Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Medical College of Cornell University, New York, New York, USA. 

5 Wake Forest Institute for Regenerative Medicine, Wake Forest University, Winston-Salem, North Carolina, USA. 6 Z-Tech, an ICF International Company at NCTR/FDA, 

Jefferson, Arkansas, USA. 7 SAS Institute Inc., Cary, North Carolina, USA. 8 Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, 

Maryland, USA. 9 Breast Medical Oncology Department, University of Texas (UT) M.D. Anderson Cancer Center, Houston, Texas, USA. 10 Myeloma Institute for Research 




and Therapy, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA. 11 Department of Pediatric Oncology and Hematology and Center for Molecular 

Medicine (CMMC), University of Cologne, Cologne, Germany. 12 The Hamner Institutes for Health Sciences, Research Triangle Park, North Carolina, USA. 13 National 

Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, North Carolina, USA. 14 Roche Palo Alto LLC, South San Francisco, 

California, USA. 15 Biomedical Informatics Center, Northwestern University, Chicago, Illinois, USA. 16 Fondazione Bruno Kessler, Povo-Trento, Italy. 17 Department of 

Mathematics & Statistics, South Dakota State University, Brookings, South Dakota, USA. 18 CMINDS Research Center, Department of Electrical and Computer 

Engineering, University of Massachusetts Lowell, Lowell, Massachusetts, USA. 19 Department of Pathology, UT M.D. Anderson Cancer Center, Houston, Texas, USA. 

20 Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, Georgia, USA. 21 Systems Analytics Inc., Waltham, 

Massachusetts, USA. 22 Hoffmann-LaRoche, Nutley, New Jersey, USA. 23 Department of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), 

Heidelberg, Germany. 24 Computational Life Science Cluster (CLiC), Chemical Biology Center (KBC), Umeå University, Umeå, Sweden. 25 GlaxoSmithKline, Collegeville, 

Pennsylvania, USA. 26 Medical Systems Biology Research Center, School of Medicine, Tsinghua University, Beijing, China. 27 Almac Diagnostics Ltd., Craigavon, UK. 

28 Swiss Institute of Bioinformatics, Lausanne, Switzerland. 29 Department of Biological Sciences, University of Southern Mississippi, Hattiesburg, Mississippi, USA. 

30 Global Pharmaceutical R&D, Abbott Laboratories, Souderton, Pennsylvania, USA. 31 National Center for Computational Toxicology, US Environmental Protection 

Agency, Research Triangle Park, North Carolina, USA. 32 Department of Bioinformatics and Genomics, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain. 

33 HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Medical College of Cornell University, New York, New York, 

USA. 34 Department of Operation Research and Financial Engineering, Princeton University, Princeton, New Jersey, USA. 35 MOE Key Laboratory of Bioinformatics 

and Bioinformatics Division, TNLIST / Department of Automation, Tsinghua University, Beijing, China. 36 Institute of Pharmaceutical Informatics, College of 

Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, China. 37 Roche Palo Alto LLC, Palo Alto, California, USA. 38 Department of Biostatistics, 

UT M.D. Anderson Cancer Center, Houston, Texas, USA. 39 Department of Electrical Engineering & Computer Science, University of Kansas, Lawrence, Kansas, USA. 

40 Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA. 41 Center for Biologics Evaluation and Research, US Food and Drug 

Administration, Bethesda, Maryland, USA. 42 Golden Helix Inc., Bozeman, Montana, USA. 43 Department of Computer Science, University of Illinois at Urbana- 

Champaign, Urbana, Illinois, USA. 44 SABiosciences Corp., a Qiagen Company, Frederick, Maryland, USA. 45 Cogenics, a Division of Clinical Data Inc., Morrisville, 

North Carolina, USA. 46 Ligand Pharmaceuticals Inc., La Jolla, California, USA. 47 GeneGo Inc., Encinitas, California, USA. 48 Department of Chemical and 

Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA. 49 Spheromics, Kontiolahti, Finland. 50 The Center for Bioinformatics and 

The Institute of Biomedical Sciences, School of Life Science, East China Normal University, Shanghai, China. 51 National Center for Biotechnology Information, 

National Institutes of Health, Bethesda, Maryland, USA. 52 Rockefeller Research Laboratories, Memorial Sloan-Kettering Cancer Center, New York, New York, USA. 

53 CapitalBio Corporation, Beijing, China. 54 Department of Statistics, North Carolina State University, Raleigh, North Carolina, USA. 55 SRA International (EMMES), 

Rockville, Maryland, USA. 56 Helwan University, Helwan, Egypt. 57 Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA. 

58 Agilent Technologies Inc., Santa Clara, California, USA. 59 F. Hoffmann-La Roche Ltd., Basel, Switzerland. 60 Stanford Center for Biomedical Informatics Research, 

Stanford University, Stanford, California, USA. 61 Department of Pathology and Laboratory Medicine and HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute 

for Computational Biomedicine, Weill Medical College of Cornell University, New York, New York, USA. 62 Cedars-Sinai Medical Center, UCLA David Geffen School of 

Medicine, Los Angeles, California, USA. 63 Vavilov Institute for General Genetics, Russian Academy of Sciences, Moscow, Russia. 64 DNAVision SA, Gosselies, Belgium. 

65 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland. 66 State Key Laboratory of Multi-phase Complex Systems, Institute of Process 

Engineering, Chinese Academy of Sciences, Beijing, China. 67 Abbott Laboratories, Abbott Park, Illinois, USA. 68 Nuvera Biosciences Inc., Woburn, Massachusetts, 

USA. 69 Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA. 70 VirginiaTech, Blacksburg, Virgina, USA. 

71 BioMath Solutions, LLC, Austin, Texas, USA. 72 Bioinformatic Program, University of Toledo, Toledo, Ohio, USA. 73 Department of Mathematics, University of 

Bayreuth, Bayreuth, Germany. 74 Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, USA. 75 Pediatric Department, 

Stanford University, Stanford, California, USA. 76 College of Chemistry, Sichuan University, Chengdu, Sichuan, China. 77 University of Texas Southwestern Medical 

Center (UTSW), Dallas, Texas, USA. 78 Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain. 79 Millennium Pharmaceuticals Inc., Cambridge, 

Massachusetts, USA. 80 RTI International, Atlanta, Georgia, USA. 81 Takeda Global R & D Center, Inc., Deerfield, Illinois, USA. 82 Novartis Institutes of Biomedical 

Research, Cambridge, Massachusetts, USA. 83 W.M. Keck Center for Collaborative Neuroscience, Rutgers, The State University of New Jersey, Piscataway, New Jersey, 

USA. 84 Entelos Inc., Foster City, California, USA. 85 Biomarker Development, Novartis Institutes of BioMedical Research, Novartis Pharma AG, Basel, Switzerland. 

86 Genedata Inc., Lexington, Massachusetts, USA. 87 Affymetrix Inc., Santa Clara, California, USA. 88 Department of Chemistry and Chemical Engineering, Hefei 

Teachers College, Hefei, Anhui, China. 89 Institut Jules Bordet, Brussels, Belgium. 90 Biostatistics, F. Hoffmann-La Roche Ltd., Basel, Switzerland. 91 Lilly Singapore 

Centre for Drug Discovery, Immunos, Singapore. 92 Microsoft Corporation, US Health Solutions Group, Redmond, Washington, USA. 93 Data Analysis Solutions DA-SOL 

GmbH, Greifenberg, Germany. 94 Cornell University, Ithaca, New York, USA. 95 Division of Pulmonary and Critical Care Medicine, Department of Medicine, University of 

Toledo Health Sciences Campus, Toledo, Ohio, USA. 96 Bristol-Myers Squibb, Pennington, New Jersey, USA. 97 OpGen Inc., Gaithersburg, Maryland, USA. 




MAQC-II participants. MAQC-II participants can be grouped into several 

categories. Data providers are the participants who provided data sets to the 

consortium. The MAQC-II Regulatory Biostatistics Working Group, whose 

members included a number of biostatisticians, provided guidance and standard 

operating procedures for model development and performance estimation. One 

or more data analysis teams were formed at each organization. Each data analysis 

team actively analyzed the data sets and produced prediction models. Other participants 

also contributed to discussion and execution of the project. The 36 data 

analysis teams listed in Supplementary Table 3 developed data analysis protocols 

and predictive models for one or more of the 13 endpoints. The teams included 

more than 100 scientists and engineers with diverse backgrounds in machine 

learning, statistics, biology, medicine and chemistry, among others. They volunteered 

tremendous time and effort to conduct the data analysis tasks. 

Six data sets including 13 prediction endpoints. To increase the chance 

that MAQC-II would reach generalized conclusions, consortium members 

strongly believed that they needed to study several data sets, each of high 

quality and sufficient size, which would collectively represent a diverse set of 

prediction tasks. Accordingly, significant early effort went toward the selection 

of appropriate data sets. Over ten nominated data sets were reviewed 

for quality of sample collection and processing consistency, and quality of 

microarray and clinical data. Six data sets with 13 endpoints were ultimately 

selected among those nominated during a face-to-face project meeting with 

extensive deliberations among many participants (Table 1). Importantly, three 

preclinical (toxicogenomics) and three clinical data sets were selected to test 

whether baseline practice conclusions could be generalized across these rather 

disparate experimental types. An important criterion for data set selection 

was the anticipated support of MAQC-II by the data provider and the commitment 

to continue experimentation to provide a large external validation 

test set of comparable size to the training set. The three toxicogenomics data 

sets would allow the development of predictive models that predict toxicity 

of compounds in animal models, a prediction task of interest to the pharmaceutical 

industry, which could use such models to speed up the evaluation of 

toxicity for new drug candidates. The three clinical data sets were for endpoints 

associated with three diseases, breast cancer (BR), multiple myeloma (MM) 

and neuroblastoma (NB). Each clinical data set had more than one endpoint, 

and together incorporated several types of clinical applications, including 

treatment outcome and disease prognosis. The MAQC-II predictive modeling 

was limited to binary classification problems; therefore, continuous endpoint 

values such as overall survival (OS) and event-free survival (EFS) times were 

dichotomized using a ‘milestone’ cutoff of censor data. Prediction endpoints 

were chosen to span a wide range of prediction difficulty. Two endpoints, 

H (CPS1) and L (NEP_S), representing the sex of the patients, were used as 

positive control endpoints, as they are easily predictable by microarrays. Two 

other endpoints, I (CPR1) and M (NEP_R), representing randomly assigned 

class labels, were designed to serve as negative control endpoints, as they 

are not supposed to be predictable. Data analysis teams were not aware of 

the characteristics of endpoints H, I, L and M until their swap prediction 

results had been submitted. If a data analysis protocol did not yield models to 

accurately predict endpoints H and L, or if a data analysis protocol claims to 

be able to yield models to accurately predict endpoints I and M, something 

must have gone wrong. 

The Hamner data set (endpoint A) was provided by The Hamner Institutes 

for Health Sciences. The study objective was to apply microarray gene expression 

data from the lung of female B6C3F1 mice exposed to a 13-week treatment 

of chemicals to predict increased lung tumor incidence in the 2-year 

rodent cancer bioassays of the National Toxicology Program 50 . If successful, 

the results may form the basis of a more efficient and economical approach 

for evaluating the carcinogenic activity of chemicals. Microarray analysis was 

performed using Affymetrix Mouse Genome 430 2.0 arrays on three to four 

mice per treatment group, and a total of 70 mice were analyzed and used as 

MAQC-II’s training set. Additional data from another set of 88 mice were 

collected later and provided as MAQC-II’s external validation set. 

The Iconix data set (endpoint B) was provided by Iconix Biosciences. 

The study objective was to assess, upon short-term exposure, hepatic tumor 

induction by nongenotoxic chemicals 51 , as there are currently no accurate and 

well-validated short-term tests to identify nongenotoxic hepatic tumorigens, 

thus necessitating an expensive 2-year rodent bioassay before a risk assessment 

can begin. The training set consists of hepatic gene expression data from 216 

male Sprague-Dawley rats treated for 5 d with one of 76 structurally and mechanistically 

diverse nongenotoxic hepatocarcinogens and nonhepatocarcinogens. 

The validation set consists of 201 male Sprague-Dawley rats treated for 5 d with 

one of 68 structurally and mechanistically diverse nongenotoxic hepatocarcinogens 

and nonhepatocarcinogens. Gene expression data were generated using the 

Amersham Codelink Uniset Rat 1 Bioarray (GE HealthCare) 52 . The separation 

of the training set and validation set was based on the time when the microarray 

data were collected; that is, microarrays processed earlier in the study 

were used as training and those processed later were used as validation. 

The NIEHS data set (endpoint C) was provided by the National Institute 

of Environmental Health Sciences (NIEHS) of the US National Institutes 

of Health. The study objective was to use microarray gene expression data 

acquired from the liver of rats exposed to hepatotoxicants to build classifiers 

for prediction of liver necrosis. The gene expression ‘compendium’ data set 

was collected from 418 rats exposed to one of eight compounds (1,2-dichlorobenzene, 

1,4-dichlorobenzene, bromobenzene, monocrotaline, N-nitrosomorpholine, 

thioacetamide, galactosamine and diquat dibromide). All eight 

compounds were studied using standardized procedures, that is, a common 

array platform (Affymetrix Rat 230 2.0 microarray), experimental procedures 

and data retrieving and analysis processes. For details of the experimental 

design see ref. 53. Briefly, for each compound, four to six male, 12-week-old 

F344 rats were exposed to a low dose, mid dose(s) and a high dose of the toxicant 

and sacrificed 6, 24 and 48 h later. At necropsy, liver was harvested for 

RNA extraction, histopathology and clinical chemistry assessments. 

Animal use in the studies was approved by the respective Institutional 

Animal Use and Care Committees of the data providers and was conducted 

in accordance with the National Institutes of Health (NIH) guidelines 

for the care and use of laboratory animals. Animals were housed in fully 

accredited American Association for Accreditation of Laboratory Animal 

Care facilities. 

The human breast cancer (BR) data set (endpoints D and E) was contributed 

by the University of Texas M.D. Anderson Cancer Center. Gene expression data 

from 230 stage I–III breast cancers were generated from fine needle aspiration 

specimens of newly diagnosed breast cancers before any therapy. The biopsy 

specimens were collected sequentially during a prospective pharmacogenomic 

marker discovery study between 2000 and 2008. These specimens represent 

70–90% pure neoplastic cells with minimal stromal contamination 54 . Patients 

received 6 months of preoperative (neoadjuvant) chemotherapy including 

paclitaxel (Taxol), 5-fluorouracil, cyclophosphamide and doxorubicin 

(Adriamycin) followed by surgical resection of the cancer. Response to preoperative 

chemotherapy was categorized as a pathological complete response 

(pCR = no residual invasive cancer in the breast or lymph nodes) or residual 

invasive cancer (RD), and used as endpoint D for prediction. Endpoint E is the 

clinical estrogen-receptor status as established by immunohistochemistry 55 . 

RNA extraction and gene expression profiling were performed in multiple 

batches over time using Affymetrix U133A microarrays. Genomic analysis of 

a subset of this sequentially accrued patient population were reported previously 

56 . For each endpoint, the first 130 cases were used as a training set and 

the next 100 cases were used as an independent validation set. 

The multiple myeloma (MM) data set (endpoints F, G, H and I) was contributed 

by the Myeloma Institute for Research and Therapy at the University 

of Arkansas for Medical Sciences. Gene expression profiling of highly purified 

bone marrow plasma cells was performed in newly diagnosed patients with 

MM 57–59 . The training set consisted of 340 cases enrolled in total therapy 2 

(TT2) and the validation set comprised 214 patients enrolled in total therapy 3 

(TT3) 59 . Plasma cells were enriched by anti-CD138 immunomagnetic bead 

selection of mononuclear cell fractions of bone marrow aspirates in a central 

laboratory. All samples applied to the microarray contained >85% plasma 

cells as determined by two-color flow cytometry (CD38 + and CD45 − /dim) 

performed after selection. Dichotomized overall survival (OS) and event-free 

survival (EFS) were determined based on a 2-year milestone cutoff. A gene 

expression model of high-risk multiple myeloma was developed and validated 

by the data provider 58 and later on validated in three additional independent 

data sets 60–62 . 

doi:10.1038/nbt.1665 



The neuroblastoma (NB) data set (endpoints J, K, L and M) was contributed 

by the Children’s Hospital of the University of Cologne, Germany. Tumor 

samples were checked by a pathologist before RNA isolation; only samples 

with ≥60% tumor content were used and total RNA was isolated from ~50 mg 

of snap-frozen neuroblastoma tissue obtained before chemotherapeutic 

treatment. First, 502 preexisting 11 K Agilent dye-flipped, dual-color replicate 

profiles for 251 patients were provided 63 . Of these, profiles of 246 neuroblastoma 

samples passed an independent MAQC-II quality assessment by majority 

decision and formed the MAQC-II training data set. Subsequently, 514 dyeflipped 

dual-color 11 K replicate profiles for 256 independent neuroblastoma 

tumor samples were generated and profiles for 253 samples were selected to 

form the MAQC-II validation set. Of note, for one patient of the validation 

set, two different tumor samples were analyzed using both versions of the 

2 × 11K microarray (see below). All dual-color gene-expression of the MAQC-II 

training set were generated using a customized 2 × 11K neuroblastoma-related 

microarray 63 . Furthermore, 20 patients of the MAQC-II validation set were 

also profiled using this microarray. Dual-color profiles of the remaining 

patients of the MAQC-II validation set were performed using a slightly revised 

version of the 2 × 11K microarray. This version V2.0 of the array comprised 

200 novel oligonucleotide probes whereas 100 oligonucleotide probes of the 

original design were removed due to consistent low expression values (near 

background) observed in the training set profiles. These minor modifications 

of the microarray design resulted in a total of 9,986 probes present on both 

versions of the 2 × 11K microarray. The experimental protocol did not differ 

between both sets and gene-expression profiles were performed as described 63 . 

Furthermore, single-color gene-expression profiles were generated for 478/499 

neuroblastoma samples of the MAQC-II dual-color training and validation sets 

(training set 244/246; validation set 234/253). For the remaining 21 samples 

no single-color data were available, due to either shortage of tumor material 

of these patients (n = 15), poor experimental quality of the generated singlecolor 

profiles (n = 5), or correlation of one single-color profile to two different 

dual-color profiles for the one patient profiled with both versions of the 2 × 

11K microarrays (n = 1). Single-color gene-expression profiles were generated 

using customized 4 × 44K oligonucleotide microarrays produced by Agilent 

Technologies. These 4 × 44K microarrays included all probes represented by 

Agilent’s Whole Human Genome Oligo Microarray and all probes of the version 

V2.0 of the 2 × 11K customized microarray that were not present in the 

former probe set. Labeling and hybridization was performed following the 

manufacturer’s protocol as described 48 . 

Sample annotation information along with clinical co-variates of the patient 

cohorts is available at the MAQC web site (http://edkb.fda.gov/MAQC/). The 

institutional review boards of the respective providers of the clinical microarray 

data sets had approved the research studies, and all subjects had provided 

written informed consent to both treatment protocols and sample procurement, 

in accordance with the Declaration of Helsinki. 

MAQC-II effort and data analysis procedure. This section provides details 

about some of the analysis steps presented in Figure 1. Steps 2–4 in a first 

round of analysis was conducted where each data analysis team analyzed 

MAQC-II data sets to generate predictive models and associated performance 

estimates. After this first round of analysis, most participants attended 

a consortium meeting where approaches were presented and discussed. The 

meeting helped members decide on a common performance evaluation protocol, 

which most data analysis teams agreed to follow to render performance 

statistics comparable across the consortium. It should be noted that some data 

analysis teams decided not to follow the recommendations for performance 

evaluation protocol and used instead an approach of their choosing, resulting 

in various internal validation approaches in the final results. Data analysis 

teams were given 2 months to implement the revised analysis protocol (the 

group recommended using fivefold stratified cross-validation with ten repeats 

across all endpoints for the internal validation strategy) and submit their final 

models. The amount of metadata to collect for characterizing the modeling 

approach used to derive each model was also discussed at the meeting. 

For each endpoint, each team was also required to select one of its 

submitted models as its nominated model. No specific guideline was given 

and groups could select nominated models according to any objective or 

subjective criteria. Because the consortium lacked an agreed upon reference 

performance measure (Supplementary Fig. 13), it was not clear how the 

nominated models would be evaluated, and data analysis teams ranked models 

by different measures or combinations of measures. Data analysis teams were 

encouraged to report a common set of performance measures for each model 

so that models could be reranked consistently a posteriori. Models trained 

with the training set were frozen (step 6). MAQC-II selected for each endpoint 

one model from the up-to 36 nominations as the MAQC-II candidate 

for validation (step 6). 

External validation sets lacking class labels for all endpoints were distributed 

to the data analysis teams. Each data analysis team used its previously 

frozen models to make class predictions on the validation data set (step 7). 

The sample-by-sample prediction results were submitted to MAQC-II by 

each data analysis team (step 8). Results were used to calculate the external 

validation performance metrics for each model. Calculations were carried 

out by three independent groups not involved in developing models, which 

were provided with validation class labels. Data analysis teams that still had 

no access to the validation class labels were given an opportunity to correct 

apparent clerical mistakes in prediction submissions (e.g., inversion of class 

labels). Class labels were then distributed to enable data analysis teams to 

check prediction performance metrics and perform in depth analysis of results. 

A table of performance metrics was assembled from information collected in 

steps 5 and 8 (step 10, Supplementary Table 1). 

To check the consistency of modeling approaches, the original validation and 

training sets were swapped and steps 4–10 were repeated (step 11). Briefly, each 

team used the validation class labels and the validation data sets as a training 

set. Prediction models and evaluation performance were collected by internal 

and external validation (considering the original training set as a validation 

set). Data analysis teams were asked to apply the same data analysis protocols 

that they used for the original ‘Blind’ Training → Validation analysis. Swap 

analysis results are provided in Supplementary Table 2. It should be noted 

that during the swap experiment, the data analysis teams inevitably already 

had access to the class label information for samples in the swap validation set, 

that is, the original training set. 

Model summary information tables. To enable a systematic comparison of 

models for each endpoint, a table of information was constructed containing 

a row for each model from each data analysis team, with columns containing 

three categories of information: (i) modeling factors that describe the model 

development process; (ii) performance metrics from internal validation; and 

(iii) performance metrics from external validation (Fig. 1; step 10). 

Each data analysis team was requested to report several modeling factors for 

each model they generated. These modeling factors are organization code, data 

set code, endpoint code, summary or normalization method, feature selection 

method, number of features used in final model, classification algorithm, 

internal validation protocol, validation iterations (number of repeats of crossvalidation 

or bootstrap sampling) and batch-effect-removal method. A set of 

valid entries for each modeling factor was distributed to all data analysis teams 

in advance of model submission, to help consolidate a common vocabulary 

that would support analysis of the completed information table. It should be 

noted that since modeling factors are self-reported, two models that share a 

given modeling factor may still differ in their implementation of the modeling 

approach described by the modeling factor. 

The seven performance metrics for internal validation and external validation 

are MCC (Matthews Correlation Coefficient), accuracy, sensitivity, specificity, 

AUC (area under the receiver operating characteristic curve), binary 

AUC (that is, mean of sensitivity and specificity) and r.m.s.e. For internal 

validation, s.d. for each performance metric is also included in the table. 

Missing entries indicate that the data analysis team has not submitted the 

requested information. 

In addition, the lists of features used in the data analysis team’s nominated 

models are recorded as part of the model submission for functional analysis 

and reproducibility assessment of the feature lists (see the MAQC Web site at 

http://edkb.fda.gov/MAQC/). 

Selection of nominated models by each data analysis team and selection 

of MAQC-II candidate and backup models by RBWG and the steering 

committee. In addition to providing results to generate the model information 


doi:10.1038/nbt.1665


table, each team nominated a single model for each endpoint as its preferred 

model for validation, resulting in a total of 323 nominated models, 318 of 

which were applied to the prediction of the validation sets. These nominated 

models were peer reviewed, debated and ranked for each endpoint by the 

RBWG before validation set predictions. The rankings were given to the 

MAQC-II steering committee, and those members not directly involved in 

developing models selected a single model for each endpoint, forming the 13 

MAQC-II candidate models. If there was sufficient evidence through documentation 

to establish that the data analysis team had followed the guidelines 

of good classifier principles for model development outlined in the standard 

operating procedure (Supplementary Data), then their nominated models 

were considered as potential candidate models. The nomination and selection 

of candidate models occurred before the validation data were released. 

Selection of one candidate model for each endpoint across MAQC-II was 

performed to reduce multiple selection concerns. This selection process turned 

out to be highly interesting, time consuming, but worthy, as participants had 

different viewpoints and criteria in ranking the data analysis protocols and 

selecting the candidate model for an endpoint. One additional criterion was 

to select the 13 candidate models in such a way that only one of the 13 models 

would be selected from the same data analysis team to ensure that a variety 

of approaches to model development were considered. For each endpoint, a 

backup model was also selected under the same selection process and criteria 

as for the candidate models. The 13 candidate models selected by MAQC-II 

indeed performed well in the validation prediction (Figs. 2c and 3). 

50. Thomas, R.S., Pluta, L., Yang, L. & Halsey, T.A. Application of genomic biomarkers 

to predict increased lung tumor incidence in 2-year rodent cancer bioassays. Toxicol. 

Sci. 97, 55–64 (2007). 

51. Fielden, M.R., Brennan, R. & Gollub, J. A gene expression biomarker provides early 

prediction and mechanistic assessment of hepatic tumor induction by nongenotoxic 

chemicals. Toxicol. Sci. 99, 90–100 (2007). 

52. Ganter, B. et al. Development of a large-scale chemogenomics database to improve 

drug candidate selection and to understand mechanisms of chemical toxicity and 

action. J. Biotechnol. 119, 219–244 (2005). 

53. Lobenhofer, E.K. et al. Gene expression response in target organ and whole blood 

varies as a function of target organ injury phenotype. Genome Biol. 9, R100 

(2008). 

54. Symmans, W.F. et al. Total RNA yield and microarray gene expression profiles from 

fine-needle aspiration biopsy and core-needle biopsy samples of breast carcinoma. 

Cancer 97, 2960–2971 (2003). 

55. Gong, Y. et al. Determination of oestrogen-receptor status and ERBB2 status of 

breast carcinoma: a gene-expression profiling study. Lancet Oncol. 8, 203–211 

(2007). 

56. Hess, K.R. et al. Pharmacogenomic predictor of sensitivity to preoperative 

chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide 

in breast cancer. J. Clin. Oncol. 24, 4236–4244 (2006). 

57. Zhan, F. et al. The molecular classification of multiple myeloma. Blood 108, 

2020–2028 (2006). 

58. Shaughnessy, J.D. Jr. et al. A validated gene expression model of high-risk multiple 

myeloma is defined by deregulated expression of genes mapping to chromosome 1. 

Blood 109, 2276–2284 (2007). 

59. Barlogie, B. et al. Thalidomide and hematopoietic-cell transplantation for multiple 

myeloma. N. Engl. J. Med. 354, 1021–1030 (2006). 

60. Zhan, F., Barlogie, B., Mulligan, G., Shaughnessy, J.D. Jr. & Bryant, B. High-risk 

myeloma: a gene expression based risk-stratification model for newly diagnosed 

multiple myeloma treated with high-dose therapy is predictive of outcome in 

relapsed disease treated with single-agent bortezomib or high-dose dexamethasone. 

Blood 111, 968–969 (2008). 

61. Chng, W.J., Kuehl, W.M., Bergsagel, P.L. & Fonseca, R. Translocation t(4;14) retains 

prognostic significance even in the setting of high-risk molecular signature. 

Leukemia 22, 459–461 (2008). 

62. Decaux, O. et al. Prediction of survival in multiple myeloma based on gene 

expression profiles reveals cell cycle and chromosomal instability signatures in 

high-risk patients and hyperdiploid signatures in low-risk patients: a study of the 

Intergroupe Francophone du Myelome. J. Clin. Oncol. 26, 4798–4805 (2008). 

63. Oberthuer, A. et al. Customized oligonucleotide microarray gene expression-based 

classification of neuroblastoma patients outperforms current clinical risk 

stratification. J. Clin. Oncol. 24, 5070–5078 (2006). 

doi:10.1038/nbt.1665 


articles 

Human hematopoietic stem/progenitor cells modified 

by zinc-finger nucleases targeted to CCR5 control 

HIV-1 in vivo 

Nathalia Holt 1 , Jianbin Wang 2 , Kenneth Kim 2 , Geoffrey Friedman 2 , Xingchao Wang 3 , Vanessa Taupin 3 , 

Gay M Crooks 4 , Donald B Kohn 4 , Philip D Gregory 2 , Michael C Holmes 2 & Paula M Cannon 1 


CCR5 is the major HIV-1 co-receptor, and individuals homozygous for a 32-bp deletion in CCR5 are resistant to infection by 

CCR5-tropic HIV-1. Using engineered zinc-finger nucleases (ZFNs), we disrupted CCR5 in human CD34 + hematopoietic stem/ 

progenitor cells (HSPCs) at a mean frequency of 17% of the total alleles in a population. This procedure produces both mono- and 

bi-allelically disrupted cells. ZFN-treated HSPCs retained the ability to engraft NOD/SCID/IL2rγ null mice and gave rise to polyclonal 

multi-lineage progeny in which CCR5 was permanently disrupted. Control mice receiving untreated HSPCs and challenged with 

CCR5-tropic HIV-1 showed profound CD4 + T-cell loss. In contrast, mice transplanted with ZFN-modified HSPCs underwent 

rapid selection for CCR5 −/− cells, had significantly lower HIV-1 levels and preserved human cells throughout their tissues. The 

demonstration that a minority of CCR5 −/− HSPCs can populate an infected animal with HIV-1-resistant, CCR5 −/− progeny supports 

the use of ZFN-modified autologous hematopoietic stem cells as a clinical approach to treating HIV-1. 

The entry of HIV-1 into target cells involves sequential binding of 

the viral gp120 Env protein to the CD4 receptor and a chemokine 

co-receptor 1 . CCR5 is the major co-receptor used by HIV-1 and is 

expressed on key T-cell subsets that are depleted during HIV-1 infection, 

including memory T cells 2 . A genetic 32-bp deletion in CCR5 

(CCR5Δ32) is relatively common in Western European populations 

and confers resistance to HIV-1 infection and AIDS in homozygotes 

3,4 . The absence of any other significant phenotype associated 

with a lack of CCR5 (refs. 5–7) has spurred the development of 

therapies aimed at blocking the virus–CCR5 interaction, and CCR5 

antagonists have proved to be an effective salvage therapy in patients 

with drug-resistant strains of HIV-1 (ref. 8). 

Recently, the ability of CCR5 −/− mobilized CD34 + peripheral blood 

cells to generate HIV-resistant progeny that suppress HIV-1 replication in 

vivo was demonstrated in an HIV-infected patient undergoing transplantation 

from a homozygous CCR5Δ32 donor during treatment for acute 

myeloid leukemia 9 . The donor cells conferred long-term control of HIV-1 

replication and restored the patient’s CD4 + T-cell levels in the absence of 

antiretroviral drug therapy. These clinical data support the potential of 

gene or stem cell therapies based on the elimination of CCR5. However, 

the risks associated with allogeneic transplantation and the impracticality 

of obtaining sufficient numbers of matched CCR5Δ32 donors 10 

mean that broader application of this approach will require methods for 

generating autologous CCR5 −/− cells. Various gene therapy approaches 

to block CCR5 expression are being evaluated, including CCR5-specific 

ribozymes 11,12 , siRNAs 13 and intrabodies 14 . The targeted cell populations 

include both mature T cells and CD34 + HSPCs. Loss of CCR5 in HSPCs 

appears to have no adverse effects on hematopoiesis 12,13,15 . 

An alternative approach is the use of engineered ZFNs to permanently 

disrupt the CCR5 open reading frame. ZFNs comprise a series 

of linked zinc fingers engineered to bind specific DNA sequences 

and fused to an endonuclease domain 16 . Concerted binding of two 

juxtaposed ZFNs on DNA, followed by dimerization of the endonuclease 

domains, generates a double-stranded break at the DNA 

target. Such double-stranded breaks are rapidly repaired by cellular 

repair pathways, notably the mutagenic nonhomologous end-joining 

pathway, which leads to frequent disruption of the gene due to the 

addition or deletion of nucleotides at the break site 17,18 . A significant 

advantage of this approach is that permanent gene disruption can 

result from only transient ZFN expression. 

CD4 + T cells modified by CCR5-targeted ZFNs 19 are currently being 

evaluated in a clinical trial. However, disruption of CCR5 in HSPCs 

is likely to provide a more durable anti-viral effect and to give rise to 

CCR5 −/− cells in both the lymphoid and myeloid compartments that 

HIV-1 infects. To evaluate this approach, we optimized the delivery of 

CCR5-specific ZFNs to human CD34 + HSPCs and transplanted the 

modified cells into nonobese diabetic/severe combined immunodeficient/interleukin 

2rγ null (NOD/SCID/IL2rγ null ; NSG) mice, which support 

both human hematopoiesis 20 and HIV-1 infection 13 . Infection of 

the mice with a CCR5-tropic strain of HIV-1 led to rapid selection for 

CCR5 – human cells, a significant reduction in viral load and protection 

of human T-cell populations in the key tissues that HIV-1 infects. These 

1 Keck School of Medicine of the University of Southern California, Los Angeles, California, USA. 2 Sangamo BioSciences, Inc., Richmond, California, USA. 3 Childrens 

Hospital Los Angeles, Los Angeles, California, USA. 4 David Geffen School of Medicine at the University of California Los Angeles, Los Angeles, California, USA. 

Correspondence should be addressed to P.M.C. (pcannon@usc.edu). 

Received 20 October 2009; accepted 24 June 2010; published online 2 July 2010; corrected online 22 July 2010; doi:10.1038/nbt.1663 


articles 


Figure 1 ZFN-mediated disruption of CCR5 in 

CD34 + HSPCs. (a) Representative gel showing 

extent of CCR5 disruption in CD34 + HSPCs 

24 h after nucleofection with ZFN-expressing 

plasmids (ZFN) or mock nucleofected (mock). 

Neg. is untreated CD34 + HSPCs. CCR5 

disruption was measured by PCR amplification 

across the ZFN target site, followed by Cel 

1 nuclease digestion and quantification of 

products by PAGE. (b) Graph showing mean 

± s.d. percentage of human CD45 + cells in 

peripheral blood of mice at 8 weeks after 

transplantation with either untreated, mock 

nucleofected or ZFN nucleofected CD34 + 

HSPCs (n = 5 each group). (c) FACS profiles 

of human cells from various organs of one 

representative mouse into which ZFN-treated 

CD34 + HSPCs were transplanted. Cells were 

gated on FSC/SSC (forward scatter/ side 

scatter) to remove debris. Staining for human 

CD45, a pan leukocyte marker, was used to 

reveal the level of engraftment with human 

cells in each organ. CD45 + -gated populations 

were further analyzed for subsets, as indicated: 

CD19 (B cells) in bone marrow, CD14 

(monocytes/macrophages) in lung, CD4 and 

CD8 (T cells) in thymus and spleen and CD3 (T 

cells) in the small intestine (lamina propria). 

The CD45 + population from the small intestine 

was further analyzed for CD4 and CCR5 

expression. Peripheral blood cells from CD45 + 

and lymphoid gates were analyzed for CD4 

and CD8 expression. The percentage of cells 

in each indicated area is shown. No staining 

was observed with isotype-matched control 

antibodies (Supplementary Fig. 1) or in animals 

receiving no human graft (data not shown). 

Bone marrow 

findings suggest that ZFN engineering of autologous HSPCs may enable 

long-term control of HIV-1 in infected individuals. 

CD34 + cells 

Neg. Mock ZFN 

0% 0% 16% 

RESULTS 

Efficient disruption of CCR5 in human CD34 + HSPCs 

Gene delivery methods suitable to express ZFNs include plasmid 

DNA nucleofection 16 , integrase-defective lentiviral vectors 21 and 

adenoviral vectors 19 . Although nonviral methods are attractive, 

nucleofection can be associated with relatively high toxicity for 

human CD34 + HSPCs and loss of engraftment potential 22 , although, 

more recently, less toxic outcomes have been described 23–25 . We 

evaluated different parameters to identify nucleofection conditions 

that allowed efficient disruption of CCR5 while limiting toxicity. The 

extent of CCR5 disruption was quantified using PCR amplification 

across the CCR5 locus, denaturation and reannealing of products, 

and digestion with the Cel 1 nuclease, which preferentially cleaves 

DNA at distorted duplexes caused by mismatches. The Cel 1 nuclease 

assay detects a linear range of CCR5 disruption between 0.69% and 

44% of the total alleles in a population, with an upper limit of sensitivity 

of 70–80% disruption (ref. 19 and data not shown). We used 

this assay to monitor CCR5 disruption as only a minority of human 

CD34 + cells expresses CCR5 (ref. 26), making it difficult to measure 

CCR5 expression by flow cytometry. 

Using CD34 + HSPCs harvested from umbilical cord blood and optimized 

nucleofection conditions, we achieved mean disruption rates of 

a 

c 

1,000 

SSC 

SSC 

SSC 

74 

CD45 

0 

10 0 10 1 10 2 10 3 10 4 

CD45 

CD8 

CD45 

Cel 1 digestion 

products 

CCR5 

Lung 

SSC 

% CD45 + in blood 

CD8 

100 

17% ± 10 (n = 21) of the total CCR5 alleles in the population (Fig. 1a). 

Similar results were also achieved using CD34 + HSPCs isolated from 

human fetal liver (data not shown). Previous studies in human cell 

lines 16 and primary human T cells 19 have shown that the percentage 

of bi-allelically modified cells in a ZFN-treated population is 30–40% 

of the total number of disrupted alleles detected by the Cel 1 assay. We 

therefore estimated that 5–7% of ZFN-treated cells would be CCR5 −/− , 

although this was not directly measured. 

We evaluated toxicity by measuring induction of apoptosis. Although 

nucleofection increased toxicity to human CD34 + cells threefold compared 

to untreated cells, inclusion of the ZFN plasmids had no additional 

effect compared to mock nucleofected controls (data not shown). 

Overall, we consider that any adverse effects of nucleofection on cell 

viability may be offset by the high levels of CCR5 disruption achieved 

as well as the speed and simplicity of the procedure compared to viral 

vector systems 19,21 . 

ZFN-modified CD34 + HSPCs are capable of multi-lineage 

engraftment in NSG mice 

NSG mice can be engrafted with human CD34 + HSPCs 20 and thereby 

provide a rigorous readout of the hematopoietic potential of genetically 

modified HSPCs. We evaluated the effects of nucleofection and/ 

or CCR5 disruption by transplanting both untreated and ZFN-treated 

human CD34 + HSPCs into 1-d-old mice that had received low-dose 

(150 cGy) radiation. Engraftment of human cells was efficient and rapid, 

80 

60 

40 

20 

0 

Neg. Mock ZFN 

10 4 

1,000 

10 4 

71 

10 10 23 

3 

3 

10 2 

10 2 

13 

10 1 

10 1 

10 0 

10 

0 0 

10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 

CD19 

CD45 

CD14 

1 10 2 10 3 

10 0 1 10 2 10 3 10 4 10 0 10 2 10 3 10 4 

Spleen 

Thymus 

CD45 

1,000 

10 4 

10 4 

21 

10 3 

10 3 

10 2 

10 2 

66 10 1 

33 

10 1 

0 

10 0 10 0 10 0 10 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 

10 4 

10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 1 

CD45 

CD4 

CD4 

Small intestine 

Blood 

CD45 CD4 

CD45 Lymphoid 

1,000 

10 46 

10 4 

10 4 

0 

10 0 10 0 10 0 

42 

10 3 

10 3 

23 

10 3 

10 2 

10 2 

10 2 

6 

10 1 

10 1 

10 1 

38 

CD45 CD3 

10 

CD4 CD4 

b 

CD45 

CD8 


articles 


typically resulting in 40% human CD45 + leukocytes 

in the peripheral blood at 8 weeks after 

transplantation. The animals showed no obvious 

toxicity or ill health, as reported for higher 

radiation doses 27 . ZFN-treated cells engrafted 

NSG mice as efficiently as untreated control 

cells (Fig. 1b), with no statistically significant 

difference between the two groups (Student’s 

t-test, P = 0.26). 

Eight to 12 weeks after transplantation, we 

analyzed engraftment of various mouse tissues with human CD45 + 

leukocytes and with cells from specific hematopoietic lineages (Fig. 

1c). Human cells were detected using human-specific antibodies, and 

specificity was confirmed using both unengrafted animals and isotypematched 

antibody controls (Supplementary Fig. 1). High levels of 

human cells were found in both the peripheral blood and tissues, ranging 

from 5–15% of the intestine, >50% of blood, spleen and bone marrow, 

and >90% of the thymus (Supplementary Table 1). CD4 + and CD8 + 

T cells were present in multiple organs, including the thymus, spleen, 

and both the intraepithelial and lamina propria regions of the small and 

large intestines; B-cell progenitors were present in the bone marrow; and 

CD14 + macrophage and/or monocytes were detected in the lung. Of 

particular interest was the large population of human CD4 + CCR5 + cells 

in the intestines, as these cells are targeted by both HIV-1 in humans 28–31 

and SIV in primates 32–34 . Overall, the profile of human cells in mice 

receiving ZFN-treated CD34 + HSPCs was indistinguishable from that 

of mice transplanted with unmodified cells, both with respect to the 

percentage of human cells in each tissue and the frequencies of different 

subsets (Supplementary Table 1), suggesting that ZFN-modified CD34 + 

HSPCs are functionally normal. 

ZFN-treated CD34 + HSPCs produce CCR5-disrupted progeny 

after secondary transplantation 

To evaluate whether ZFN treatment of the bulk CD34 + population 

modified true SCID-repopulating stem cells, we harvested 

bone marrow from an animal 18 weeks after engraftment 

with ZFN-treated CD34 + HSPCs, in which the extent of CCR5 

disruption in the bone marrow was 11% (Table 1). This marrow 

was transplanted into three 8-week-old recipients. 

At the same time, bone marrow from a control animal engrafted with 

Table 1 Secondary transplantation of ZFN-treated HSPCs 

Donor animals a CD45 b blood (%) Cel 1 c BM (%) Secondary 

recipients 

CD45 b blood (%) Cel 1 c blood (%) 

ZFN (1) 41 11 ZFN (3) 34 +/- 5 16 +/- 4 

Neg. (1) 47 0 Neg. (3) 37 +/- 7 0 +/- 0 

a Bone marrow (BM) was harvested from donor mice engrafted with ZFN-treated HSPCs (ZFN) or untreated HSPCs (Neg.) and 

transplanted into three secondary recipients for each BM. b Levels of human CD45 + cells were measured in blood of both donor 

and recipient mice at 8 weeks post-transplantation. c CCR5 disruption rates, measured by Cel 1 analysis of donor BM at time of 

harvest and in blood of recipient mice at 10 weeks post-transplantation. 

untreated CD34 + HSPCs was transplanted into three additional animals. 

Analysis of the peripheral blood of the secondary recipients 8 

weeks later revealed that all six animals had engrafted and that there 

was no significant difference in the percentage of human CD45 + leukocytes 

between the ZFN-treated and control groups. Furthermore, 

human cells in the blood of the ZFN cohort had levels of CCR5 disruption 

that slightly exceeded the level in the original donor marrow 

(12–20%) (Table 1). These data demonstrate that ZFN activity 

can lead to permanent disruption of CCR5 in SCID-repopulating 

stem cells and that such modified cells retain their engraftment and 

differentiation potential. 

Protection of CD4 + T cells in peripheral blood of NSG mice after 

HIV-1 infection 

Engrafted animals at 8–12 weeks after transplantation that had received 

either unmodified or ZFN-treated CD34 + HSPCs were challenged with 

the CCR5-tropic virus HIV-1 BAL . This strain of HIV-1 causes a robust 

infection and significant CD4 + T-cell depletion in humanized mouse 

models 35,36 , mimicking the human infection, in which depletion of 

CD4 + CCR5 + lymphocytes results from a combination of direct infection, 

systemic immune activation 36 and the upregulation of CCR5 on thymic 

precursors 37,38 . After infection, blood samples were collected from the 

mice every 2 weeks and analyzed for HIV-1 RNA levels, T-cell subsets and 

the extent of CCR5 disruption. At 8–12 weeks after infection, animals were 

euthanized and multiple tissues analyzed (Supplementary Fig. 2). 

Changes in the ratio of CD4 + to CD8 + T cells in the peripheral blood 

are characteristic of progressive infection in individuals with AIDS 39,40 . 

We therefore examined the CD4/CD8 ratio in blood samples from individual 

mice both before and after infection and found that the mean 

ratio before infection was similar for both the untreated and ZFN-treated 

a 

HIV-1 infected 

b 

Uninf. (3) Neg. (3) ZFN (9) 

CD45 Lymphoid 

10 4 

2.5 

P = 0.8892 P = 0.0001 

Neg. 

ZFN 

Blood 

CD8 

10 4 

10 4 

10 3 42 10 3 67 10 3 32 

10 2 

10 2 

10 2 

10 1 

10 1 

10 1 

38 

0 39 

10 0 10 0 10 0 

10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 

CD4 + /CD8 + ratio 

2.0 

1.5 

1.0 

0.5 

0.0 

CD4 

Pre-infection 

Post-infection 

Figure 2 Protection of human CD4 + T cells in peripheral blood of HIV-infected mice previously engrafted with ZFN-modified CD34 + HSPCs. (a) FACS plots 

showing human CD4 + and CD8 + T cells in peripheral blood of representative animals from each of three cohorts: uninfected mice previously engrafted with 

either untreated or ZFN-treated CD34 + HSPCs (Uninf.), and HIV-1 infected animals previously engrafted with either untreated (Neg.) or ZFN-treated (ZFN) 

CD34 + HSPCs, at 4 weeks post-infection. The total number of animals analyzed in each cohort is indicated. Cells were gated on FSC/SSC to remove debris, 

on human CD45, and a lymphoid gate applied. Percentage of cells in indicated compartments is shown. (b) Ratio of human CD4 + to CD8 + lymphocytes in 

peripheral blood of individual mice into which untreated (Neg.) or ZFN-modified CD34 + HSPCs were transplanted, measured pre-infection and at 6–8 weeks 

post-infection. Statistical analysis comparing Neg. and ZFN cohorts at each time point is shown. 


articles 

groups. After HIV-1 challenge, the ratios became highly skewed in the 

control group owing to the pronounced loss of CD4 + cells, whereas the 

ZFN-treated animals maintained normal ratios (Fig. 2a,b). 

Protection of human cells in mouse tissues after HIV-1 infection 

We next analyzed the human cells present in various mouse tissues 12 

weeks after infection with HIV-1 BAL . NSG mice into which unmodified 

cells were transplanted displayed a characteristic loss of certain 

human cell populations, whereas the ZFN-treated cohort retained 

normal human cell profiles throughout their tissues despite HIV-1 

challenge (Fig. 3a). In the intestines and spleen, which are the organs 

harboring the highest percentage of human CD4 + CCR5 + cells in 

this model (Supplementary Fig. 3), we observed specific depletion 

of CD4 + T cells from the spleen and the complete loss of all human 

lymphocytes from the intestines of untreated animals, whereas these 

populations were fully preserved in the ZFN-treated cohort (Fig. 3b). 

In the bone marrow, which is not a major target organ of HIV-1 infection, 

levels of human CD45 + cells were similar in all three groups. 

Notably, HIV-1 BAL infection resulted in the loss of virtually all 

human cells from the thymus of mice receiving untreated CD34 + 

HSPCs by 12 weeks after infection (Fig. 3a). Depletion of thymocytes 

has been proposed to occur as a consequence of the upregulation 

of CCR5 on these cells during HIV-1 infection 37,38 , and likely 

contributed both to the observed depletion in the thymus and to the 

reduction in the numbers of mature CD4 + and CD8 + T cells observed 

in other tissues. 


a 

Bone marrow 


Spleen 


Uninf. (3) Neg. (3) ZFN (9) Uninf. (3) Neg. (3) ZFN (9) 

CD45 

1,000 

1,000 

1,000 

10 4 

1,000 

1,000 

10 3 

21 

45 11 

10 65 

47 41 

2 

10 33 

0 

34 

1 

0 

0 

0 

10 0 

0 

0 

10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 

CD45 

CD4 

Thymus 


CD45 

CD45 

10 4 

10 4 

10 4 

10 4 

10 4 

10 4 

10 83 0 94 

3 

10 3 

10 3 

10 3 

10 3 

10 3 

10 79 0 84 

2 10 2 10 2 

10 2 10 2 10 2 

10 1 

10 1 

10 1 

10 1 

10 1 

10 1 

10 0 

10 0 

10 0 

10 0 

10 0 

10 0 

10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 

CD4 

CD3 

b 

SSC 

CD8 

CD8 

CD45 


No graft (2) Neg. (2) ZFN (2) Neg. (3) ZFN (9) 

Small intestine, anti-CD3 

Spleen, anti-CD4 

Figure 3 Effects of HIV-1 infection on human cells in HSPC-engrafted NSG mice. (a) FACS 

analysis of human cells in tissues of representative NSG mice from three cohorts: uninfected 

mice previously engrafted with either untreated or ZFN-treated CD34 + HSPCs (Uninf.), and 

HIV-1 infected animals previously engrafted with either untreated (Neg.) or ZFN-treated (ZFN) 

CD34 + HSPCs. Mice were necropsied at 12 weeks post-infection or at the equivalent time point 

for uninfected animals. The total number of animals analyzed in each cohort is indicated. FACS 

analysis was performed as described in Figure 1. Small intestine sample is lamina propria, and 

similar results were obtained when samples from the large intestine were analyzed. Percentage 

of cells in indicated compartments is shown. (b) Immunohistochemical analysis of human CD3 

expression in small intestine, and CD4 expression in spleen of representative NSG mice, into 

which untreated (Neg.) or ZFN-treated (ZFN) CD34 + HSPCs were transplanted, with and without 

HIV-1 infection. Animals were necropsied at 12 weeks after infection or at the same time point 

for uninfected animals. Control animals receiving no human CD34 + HSPCs (no graft) were also 

analyzed. The number of animals analyzed in each cohort is shown. Scale bars, 50 µM. 

HIV-1 infection rapidly selects for 

CCR5 – T cells 

We examined whether the survival of T cells in 

the mice receiving ZFN-treated CD34 + HSPCs 

was the result of selection for ZFN-modified 

progeny. We measured the percentage of disrupted 

CCR5 alleles in the blood of mice at 

sequential time points after HIV-1 challenge, 

using both the Cel 1 assay and a specific PCR 

amplification that detects a common 5-bp 

duplication at the ZFN target site that typically 

accounts for 10–30% of total modifications 19 . 

Both assays revealed a rapid increase in the frequency 

of ZFN-disrupted alleles, reaching the 

upper limit of the Cel 1 assay by 4 weeks after 

infection (Fig. 4a). 

We also examined levels of CCR5 disruption 

in multiple tissues from ZFN-treated animals, 

either uninfected or 12 weeks after HIV-1 BAL 

challenge, and observed a sharp increase 

in CCR5 disruption after HIV-1 infection 

(Fig. 4b). FACS analysis of the spleen and intestine 

revealed that, in contrast to uninfected animals, 

in which ~25% of CD4 + cells were also 

CCR5 + , very little or no CCR5 expression was 

detected in the CD4 + T cells that persisted in 

the ZFN-treated animals (Fig. 4c,d). Together, 

these data suggest that the protection of CD4 + 

lymphocytes in ZFN-treated mice was a consequence 

of selection for CCR5 – , HIV-1-resistant 

cells derived from ZFN-edited cells. 

Heterogeneity of CCR5 modifications 

suggests polyclonal origins 

ZFN-induced double-stranded breaks 

repaired by nonhomologous end-joining 

result in highly heterogeneous changes at 

the targeted locus 19 . We used this property 

to investigate whether the CCR5 – cells that 

developed in mice that received ZFN-treated 

CD34 + HSPCs were polyclonal in origin. 

Sequencing of 60 individual CCR5 alleles 

amplified from the large intestine of an HIV- 

1-infected mouse into which ZFN-treated 

CD34 + HSPCs were previously transplanted 

revealed that 59 alleles harbored mutations 

at the ZFN target site (Fig. 5). As previously 


articles 


reported for this ZFN pair 19 , a high proportion (13 out of 59) of the 

mutated loci contained a characteristic 5-bp duplication, with the 

remaining 46 clones bearing 36 unique sequences. In contrast, all 

alleles sequenced from a mouse receiving untreated CD34 + HSPCs 

contained the wild-type sequence (data not shown). The high degree 

of sequence diversity observed strongly suggests that multiple stem 

or progenitor cells were modified by the ZFNs. These findings also 

predict that the overwhelming majority of cells selected by HIV-1 BAL 

infection would be CCR5 −/− , which is in agreement with the data 

from flow cytometry analysis (Fig. 4c). 

Presence of ZFN-modified cells controls HIV-1 replication in vivo 

Quantitative PCR analysis of HIV-1 RNA levels in the peripheral 

blood of animals revealed that peak viremia occurred at 6 weeks after 

infection for animals that received transplants of either untreated or 

ZFN-treated CD34 + HSPCs (Fig. 6a), although the levels were significantly 

lower (P = 0.03) in the ZFN cohort. By 8 weeks after infection, 

viral loads in both cohorts were dropping but there continued 

to be a statistically significant difference between the two groups (P 

= 0.001). Measurements of p24 levels in the blood by enzyme-linked 

immunosorbent assay (ELISA) corroborated these findings, with a 

Figure 4 HIV-1 infection selects for disrupted 

CCR5 alleles. (a) Mean ± s.d. levels of CCR5 

disruption (Cel 1 assay, black bars) in sequential 

peripheral blood samples taken from mice 

into which ZFN-treated CD34 + HSPCs were 

transplanted and which were subsequently 

infected with HIV-1. Upper limit of linearity of 

Cel 1 assay is 44% (ref. 19) and is indicated by 

the dotted line; upper limit of sensitivity of assay 

is 70–80%. White bars show the frequency of 

a common 5-bp duplication at the ZFN target 

site that typically comprises 10–30% of total 

CCR5 mutations 19 . Numbers of mice analyzed 

at each time point, and in each assay, are shown 

above the appropriate bar. (b) Mean ± s.d. levels 

of CCR5 disruption (Cel 1 assay) in indicated 

tissues from mice into which ZFN-treated CD34 + 

HSPCs were transplanted; mice were necropsied 

at 12 weeks after infection (black bars) or at 

an equivalent time point for uninfected ZFNtreated 

animals (white bars). Numbers analyzed 

in each group are shown above the appropriate 

bar. One representative Cel 1 analysis from the 

large intestine (lamina propria) of uninfected 

and infected mice is shown. Animals receiving 

untreated cells gave no Cel 1 digestion products 

at any time point analyzed (data not shown). 

Asterisk indicates levels too low to quantify. 

(c) Contour FACS analyses of human CD4 + 

cells in the small intestine (lamina propria) and 

spleen of one representative animal from each 

indicated cohort are shown. Cells were gated 

on FSC/SSC to remove debris and gated on 

human CD45 and CD4. Numbers indicate the 

percentage of cells that are CCR5 + . (d) Mean ± 

s.d. numbers of human CD4 + cells (gray bars) 

and CD4 + CCR5 + cells (white bars) per 5,000 

human CD45 + cells analyzed from different 

sections of the intestine and from the indicated 

cohorts. Asterisk indicates levels too low to 

quantify. Number of animals analyzed in each 

cohort is indicated. Abbr. S, small intestine; L, 

large intestine; E, intraepithelial lymphocytes; P, 

lamina propria lymphocytes; BM, bone marrow. 

a 

b 

c 

d 

CCR5 disruption in 

peripheral blood (%) 

CCR5 disruption (%) 


Number cells per 5,000 

human CD45 + cells 

CCR5 

100 

80 

60 

40 

20 

0 

80 

60 

40 

20 

0 

3,000 

2,000 

1,000 

significant difference (P = 0.02) in antigenemia between the two 

groups observed by the 6-week time point (data not shown). 

These differences between the two cohorts are more striking when 

the levels of human CD4 + T cells are also considered (Fig. 6a), as the 

loss of CD4 + T cells in the untreated mice probably contributed to the 

lowering of overall viral levels seen as the infection progressed. The 

continued presence of virus in the blood, despite acute loss of CD4 + 

cells, also occurs during progression to AIDS, where high viral load 

measurements in serum are typically observed when T-cell death is 

rapidly occurring 41 . In contrast, CD4 + T-cell levels in the ZFN-treated 

mice rebounded after the 2-week nadir and recovered to normal levels 

by 4 weeks after infection. In contrast to these findings with HIV- 

1 BAL , ZFN-treated mice challenged with a CXCR4-tropic HIV-1 strain 

did not control viral levels or preserve CD4 + T cells, confirming that 

the mechanism is CCR5 specific (Supplementary Fig. 4). 

We also measured HIV-1 levels in intestinal samples. In tissues 

harvested at 8 and 9 weeks after infection, viral levels in the ZFNtreated 

mice were 4 orders of magnitude lower than in the untreated 

controls. By the 10- and 12-week time points, HIV-1 RNA was undetectable 

in the ZFN-treated mice (Fig. 6b). This drop in viral load 

occurred despite the maintenance of normal numbers of human 

5 3 2 4 Total disuptions 

5 bp duplication 

2 

5 

3 

4 

2 5 

1 1 

0 2 4 6 8 10 

Weeks post-infection 

2 

Thymus 

CD4 

SE 

3 3 

2 

Lung 

2 

Spleen 

SP 

LE 

LP 

2 

2 

SE 

3 3 3 3 

2 

SP 

SE 

SP 

LE 

LP 

2 

LE 

HIV-1 

2 

LP 


Uninf. (3) Neg. (3) ZFN (9) 

SE 

SP 

LE 

2 

BM 

LP 

3 

Uninf. 

HIV-1 

Spleen 

CCR5 

CD4 

Large intestine 

Uninf. HIV-1 

8 56 

Cel 1 products (%) 

HIV-1 

Uninf. (3) Neg. (3) ZFN (9) 

Uninf. (3) Neg. (3) ZFN (9) 

CD45 CD4 CD45 CD4 

10 4 

10 4 

10 

10 

10 4 

10 4 

10 

10 4 

10 

10 4 

10 3 

10 3 

10 3 

10 3 

10 3 

10 3 

33 0 0 30 0 0 

10 2 

10 2 

10 2 

10 2 

10 2 

10 2 

10 1 

10 10 10 1 

10 1 

1 

1 

10 1 

10 0 10 0 10 1 10 2 10 3 10 4 0 10 0 10 1 10 2 10 3 10 4 0 10 0 10 1 10 2 10 3 10 4 0 10 0 10 1 10 2 10 3 10 4 0 10 0 10 1 10 2 10 3 10 4 10 0 

10 0 10 1 10 2 10 3 10 4 

0 

CD4 + CCR5 + 

CD4 + 


articles 


Wild-type (1) 

gttttgtgggcaacatgctggtcatcctcatcctgataaactgcaaaaggctgaagagcatgactgaca wt 

Deletions (43) 

gttttgtgggcaacatgctggtcatcctcat-ctgataaactgcaaaaggctgaagagcatgactgaca -1 

gttttgtgggcaacatgctggtcatcctcatcctgat--actgcaaaaggctgaagagcatgactgaca -2 

gttttgtgggcaacatgctggtcatcctcatcctg--aaactgcaaaaggctgaagagcatgactgaca -2 2X 

gttttgtgggcaacatgctggtcatcc---tcctgataaactgcaaaaggctgaagagcatgactgaca -3 

gttttgtgggcaacatgctggtcatcctcatc----taaactgcaaaaggctgaagagcatgactgaca -4 

gttttgtgggcaacatgctggtcatcctcatc-----aaactgcaaaaggctgaagagcatgactgaca -5 3X 

gttttgtgggcaacatgctggAcatcctcatcctgat------caaaaggctgaagagcatgactgaca -6 

gttttgtgggcaacatgctggtcatcctcatc------aaTtgcaaaaggctgaagagcatgactgaca -6 

gttttgtgggcaacatgctggtcatcctcatcctgat-------aaaaggctgaagagcatgactgaca -7 

gttttgtgggcaacatgctggtcat-------ctgataaactgcaaaaggctgaagagcatgactgaca -7 

gttttgtgggcaacatgctggtcatcctcatc--------ctgcaaaaggctgaagagcatgactgaca -8 

gttttgtgggcaacatgctggtcatcctcatcctgat--------aaaggctgaagagcatgactgaca -8 

gttttgtgggcaacatgctggtcatcctc--------aaactgcaaaaggctgaagagcatgactgaca -8 

gttttgtgggcaacatgctggtcatcc--------ataaactgcaaaaggctAaagagcatgactgaca -8 

gttttgtgggcaacatgctggtcatcctcat---------ctgcaaaaggctgaagagcatgactgaca -9 

gttttgtgggcaacatgctggtcatcctcatcctgat----------aggctgaagagcatgactgaca -10 

gttttgtgggcaacatgctggt----------ctgataaactgcaaaaggctgaagagcatgactgaca -10 

gttttgtgggcaacatgctggtcatcctcatc-----------caaaaggctgaagagcatgactgaca -11 

gttttgtgggcaacatgctggtcatcctca-----------tgcaaaaggctgaagagcatgactgaca -11 2X 

gttttgtgggcaacatgctggtcatcctcatc------------aaaaggctgaaAagGatgactgaca -12 

gttttgtgggcaacatgctg------------ctgGtaaactgcaaaaggctgaagagcatgactgaca -12 

gttttgtgggcaacatgctggtcatcct--------------gcaaaaggctgaagagcatgactgaca -14 5X 

gttttgtgggcaacatgctggtcat---------------ctgcaaaaggctgaagagcatgactgaca -15 

gttttgtgggcaacatgctggtcatcct---------------caaaaggctgaagagcatgactgaca -15 2X 

gttttgtgggcaacatgctggtcatcctcatcctgataa----------------gagcatgactgaca -16 

gttttgtgggcaacatgctggtcatcctcatcctgat-----------------Cgagcatgactgaca -17 

gttttgtgggcaacatgctggtcatcctcatcctga-------------------gagcatgactgaca -19 

gttttgtgggcaacatgctggtcatcctcatc-------------------tgaagagcatgactgaca -19 

gttttgtgggcaacatgctggtcatcctcatcctgat--------------------gcatgactgaca -20 

gttttgtgggcaacatgctggtcatcctcatc----------------------agagcatgactgaca -22 

gttttgtgggcaacatgc--------------------------aaaaggctgaagagcatgactgaca -26 

gttttgtgggcaa------------------------------caaaaggctgaagagcatgactgaca -30 

gttttgtgggcaacatgctggtcatcctcatcctg--------------------------------ca -32 

gttttgtgggcaacatgctggt---------------------------------------------ca -45 

Insertions (16) 

gttttgtgggcaacatgctggtcatcctcatcctCTgataaactgcaaaaggctgaagagcatgactga +2 

gttttgtgggcaacatgctggtcatcctcatcctgataTAaactgcaaaaggctgaagagcatgactga +2 

gttttgtgggcaacatgctggtcatcctcatcctgatCTGATaaactgcaaaaggctgaagagcatgac +5 13X 

T lymphocytes in the intestines and other tissues (Fig. 3). These 

observations are consistent with a strong selective pressure for HIVresistant 

CCR5 −/− cells to replace CCR5-expressing cells, leading to 

control of viral replication. 

DISCUSSION 

Despite major advances in anti-retroviral therapy, HIV-1 infection 

remains an epidemic cause of morbidity and mortality. Effective antiretroviral 

therapy often involves costly, multi-drug regimens that are 

not well tolerated by a significant percentage of patients 42 , and even 

successful adherence to the therapy does not eradicate the virus, and a 

rapid rebound in HIV-1 levels can occur if therapy is discontinued 43 . 

An alternative approach to controlling HIV-1 replication is engineering 

of the body’s immune cells to be resistant to infection 44 . In this regard, 

the CCR5 co-receptor is an attractive target because of the HIV-resistant 

phenotype of homozygous CCR5Δ32 individuals 3 . In the present study, 

we identified conditions that allow efficient disruption of CCR5 in 

human CD34 + HSPCs and demonstrated that such modified cells 

generate CCR5 −/− , HIV-resistant progeny in a mouse model of human 

hematopoiesis and HIV-1 infection, leading to control of HIV-1 replication. 

These findings suggest that transplantation of autologous HSPCs 

modified by CCR5-specific ZFNs may provide a permanent supply of 

HIV-resistant progeny that could replace cells killed by HIV-1, reconstitute 

the immune system and control viral replication long term in the 

absence of anti-retroviral therapy. 

The high levels of CCR5 disruption that we achieved were possible 

because of an efficient gene editing technology based on ZFNs. 

ZFNs can be designed to bind to a specific genomic DNA sequence 

Figure 5 ZFN activity produces heterogeneous 

mutations in CCR5. Sequence analysis was 

performed on 60 cloned human CCR5 alleles, 

PCR amplified from intraepithelial cells from 

the large intestine of an HIV-infected mouse into 

which ZFN-treated CD34 + HSPCs were previously 

transplanted, and at 12 weeks post-infection. 

The number of nucleotides deleted or inserted 

at the ZFN target site (underlined) in each clone 

is indicated on the right of each sequence, 

together with the number of times the sequence 

was found. Dashes (–) indicate deleted bases 

compared to the wild-type sequence; uppercase 

letters are point mutations; underlined upper 

case letters are inserted bases. Some specific 

mutations of CCR5 occurred more frequently, 

in particular a 5-bp duplication at the ZFN 

target site that was identified 13 times (bottom 

sequence). No mutations in CCR5 were observed 

in a similar analysis performed on control samples 

from a mouse receiving unmodified CD34 + 

HSPCs (data not shown). 

and effect permanent knockout of the targeted 

gene 19,45–47 . Only transient expression 

of the ZFNs is required during a brief period 

of ex vivo culture, and the genetic mutation 

is present for the life of the cell and its progeny. 

Thus, a major shortcoming of other gene 

therapy technologies—the need for continued 

expression of a foreign transgene—is avoided. 

Moreover, unlike approaches based on small 

molecules, antibodies or RNA interference 44 , 

ZFN-mediated gene disruption can completely 

eliminate CCR5 from the surface of 

cells through bi-allelic modification. By using an optimized nucleofection 

procedure, we were able to overcome the technical challenges to 

ZFN-induced genome editing in CD34 + cells previously reported 21 and 

achieve, on average, disruption at 17% of the loci, which we estimate will 

produce 5–7% bi-allelically modified cells. 

The safety and efficacy of T lymphocytes modified with CCR5- 

targeted ZFNs are currently being evaluated in a phase 1 clinical trial. 

In a preclinical study, investigation of the specificity of the same CCR5- 

targeted ZFNs as used in this study revealed off-target cleavage events in 

T cells at significant levels only at the homologous CCR2 locus 19 . Studies 

in mice have not detected any deleterious phenotype associated with 

loss of CCR2 (ref. 48), and human genetic studies have even suggested 

a beneficial phenotype from the loss of this gene in HIV-infected individuals 

49 . Although not analyzed here, modification of CD34 + HSPCs 

with these same CCR5 ZFN reagents is likely to result in similar, low 

levels of off-target cleavage events. Any safety concerns associated with 

nonspecific cleavage must be evaluated in larger, future studies. 

Although T lymphocytes are the primary target of HIV-1 infection, 

ZFN modification of HSPCs may allow longer-term production of 

CCR5 −/− cells in patients. The scientific rationale for CCR5 modification 

of HSPCs is supported by the recent finding that an HIV + leukemia patient 

receiving a transplant from a CCR5 −/− donor was effectively cured of his 

infection, despite discontinuing antiretroviral therapy 9 . As shown by our 

data, ZFN-modified HSPCs retained full functionality and gave rise to 

CCR5 – cells in lineages relevant to HIV-1 pathogenesis. ZFNs delivered to 

purified CD34 + cell populations by nucleofection were capable of modifying 

true SCID-repopulating stem cells, and the high levels of CCR5 editing 

were maintained after secondary transplantation. 


articles 


The experimental mouse model of HIV-1 infection used in these 

studies revealed a strong selection for CCR5 – progeny during acute 

infection with a CCR5-tropic strain of HIV-1. This suggests that 

CCR5 −/− stem cells, even if the minority, produced sufficient numbers 

of CCR5 −/− progeny to support immune reconstitution and inhibit 

HIV-1 replication. Such selection is consistent with clinical observations 

from genetic diseases such as adenosine deaminase deficiency 

(ADA)-SCID, X-linked SCID and Wiskott-Aldrich syndrome, in which 

normal hematopoietic cells have a selective advantage, so that spontaneous 

monoclonal reversions can lead to selective outgrowth of such cells 

and amelioration of symptoms 50–53 . 

The observation of almost complete replacement of human T cells in 

the intestines of the infected mice with CCR5 – cells is consistent with 

this tissue harboring the majority of the body’s CD4 + CCR5 + effector 

memory cells. A characteristic feature of HIV-1 replication in mucosal 

tissues is an ongoing cycle of T-cell death and the recruitment of replacement 

T cells, which, in an activated state, are highly permissive for HIV-1 

infection 37 . This is especially true in the gut mucosa, a key battleground 

in HIV-1 infection 54–56 . We also observed a strong selection for CCR5 – 

cells in the thymus, suggesting that CCR5 – cells would be selected at both 

a precursor stage in the thymus and at an effector stage in the mucosa. 

Ultimately, the presence of HIV-resistant CCR5 – cells in mucosal tissues 

should both protect individual cells from infection and help to break 

the cycle of immune hyperactivation that may underlie much of the 

pathology of AIDS 57 . 

a 

HIV-1 RNA copies/ml blood 

b 

10 7 80 

10 1 Neg. (3) 

ZFN (9) 

10 0 8 9 10 12 8 9 10 12 Weeks post-infection 

10 6 

60 

10 5 

10 4 

40 

10 3 

20 

10 2 

0 

2 4 6 8 

0 2 4 6 8 



10 8 

2 2 2 

2 

Neg. 

ZFN 

10 6 

3 

2 3 

10 4 

2 

2 

2 

2 2 

10 2 

2 9 2 9 

HIV-1 RNA copies/10 6 cells 


CD4+ in blood (%) 

Large intestine 

Figure 6 Control of HIV-1 replication in mice receiving ZFN-treated CD34 + 

HSPCs . (a) Mean +/− s.d. levels of HIV-1 RNA (left) and percent CD4 + 

human T cells (right) in peripheral blood of mice into which untreated (Neg.) 

or ZFN-treated CD34 + HSPCs were transplanted, at indicated times postinfection. 

Dashed line is limit of detection of assay. Asterisk indicates a 

statistically significant difference between two groups (P < 0.05). (b) Mean 

± s.d. HIV-1 RNA levels in small and large intestine lamina propria from 

Neg. or ZFN mice, from animals necropsied between 8 and 12 weeks postinfection. 

Numbers of mice analyzed at each time point are shown above the 

appropriate bar. Dashed line indicates limits of detection of assay. Asterisk 

indicates undetectable levels. 

Although antiretroviral therapy is highly effective in many patients, the 

associated costs and potential for side effects can be considerable when 

extrapolated over a lifetime. In contrast, our approach may provide a 

one-shot treatment that would be most suited to the setting of autologous 

HSPC transplantation. Procedures for isolating and processing HSPCs 

for autologous or allogeneic transplantation are well established. The use 

of a patient’s own stem cells may remove the requirement for full ablation 

of the marrow hematopoietic compartment and the immune suppression 

that is necessary in allogeneic transplantation. Indeed, the toxicity of such 

regimens is one reason that allogeneic stem cell transplantation from 

CCR5Δ32 donors is not a realistic treatment option for HIV + patients in 

the absence of other conditions that necessitate the transplant. 

Of note, certain HIV-infected individuals, such as AIDS lymphoma 

patients, already undergo full ablation and autologous HSPC rescue 

as part of their therapy 58 and may be suitable candidates for HSPCbased 

gene therapies 44 . In addition, the experience of autologous HSPC 

transplantation in gene therapy treatments for ADA-SCID 59,60 , chronic 

granulomatous disease 61 and X-linked adrenoleukodystrophy 62 is that 

nonmyeloablative conditioning can facilitate engraftment of gene-modified 

autologous HSPCs with minimal associated toxicity. It is possible 

that the use of nonmyeloablative regimens, together with the selective 

advantage conferred on CCR5 −/− progeny, could prove an effective combination 

for HIV + patients receiving ZFN-treated autologous HSPCs. 

Targeting CCR5 is not expected to provide protection against viruses 

that use alternate co-receptors such as CXCR4. Although only a handful 

of cases of HIV-1 infection of CCR5Δ32 homozygotes have been 

reported 63,64 , CXCR4-tropic viruses have been associated with accelerated 

disease progression 65 , so that selection for such strains could be an 

undesirable consequence of targeting CCR5. However, this outcome is 

not generally observed in patients treated with CCR5 inhibitors unless 

CXCR4-tropic viruses were present before therapy, and resistance to 

these drugs occurs by viral adaptation to the drug-bound form of CCR5 

(refs. 66,67). Notably, although the patient who received the CCR5Δ32 

transplant harbored CXCR4-tropic virus before the procedure, his HIV-1 

infection was still controlled long term 9,10 . Similar to the recommendations 

for CCR5 inhibitors, it may be prudent to restrict CCR5 ZFN treatment 

of HSPCs to individuals with no detectable CXCR4-tropic virus. 

In contrast to the acute HIV-1 infection modeled in this study, HIV-1 

patients usually present in a chronic phase of the disease, and their viral 

levels can be effectively controlled by antiretroviral therapy. The requirement 

for the selective pressure of active HIV-1 replication in the success 

of this, or other, anti-HIV gene therapies is at present unknown. It has 

been suggested that low-level viral replication continues in certain sanctuary 

sites, even in well-controlled patients on antiretroviral therapy 43,68 , 

which could provide a low level of selection, although drug intensification 

trials have not provided evidence of ongoing replication 69 . It is also 

possible that the high levels of CCR5 disruption we achieved without 

selection, if extrapolated to HIV + patients, could be sufficient to provide 

a therapeutic effect even in the absence of a strong selective pressure. 

Alternatively, ZFN knockout of CCR5 in HSPCs could be viewed 

as a backup strategy in the event that antiretroviral therapy fails or is 

withdrawn. It may also be possible to incorporate antiretroviral therapy 

interruptions into an overall therapeutic strategy, as recently described 

for HIV-infected individuals receiving autologous HSPCs engineered 

with anti-HIV ribozymes, where gene-marked progeny were found at 

higher levels after treatment interruptions 70 . 

In summary, our data demonstrate that transient ZFN treatment of 

human CD34 + HSPCs can efficiently disrupt CCR5 while yielding cells 

that remain competent to engraft and support hematopoiesis. In the 

presence of CCR5-tropic HIV-1, CCR5 −/− progeny rapidly replaced cells 

depleted by the virus, leading to a polyclonal population that ultimately 


articles 


preserved human immune cells in multiple tissues. Our findings indicate 

that the modification of only a minority of human CD34 + HSPCs may 

provide the same strong anti-viral benefit as was conferred by a complete 

CCR5Δ32 stem cell transplantation in a patient 9 . And they further 

suggest that a partially modified autologous transplant, administered 

under only mildly ablative transplantation regimens may also be effective, 

opening up the treatment to many more HIV-infected individuals. 

Finally, the identification of conditions that allow the efficient use of 

ZFNs in human CD34 + HSPCs suggests the use of this technology in 

other diseases for which HSPC modification may be curative. 

METHODS 





We would like to thank A. Cuddihy, S. Ge, R. Hollis and N. Smiley for expert 

technical assistance; C. Lutzko, V. Garcia, R. Akkina, B. Torbett and M. McCune for 

advice regarding humanized mice; and M. McCune for communicating unpublished 

data. This work was supported by funding from the California HIV/AIDS Research 

Project (P.M.C.), The Saban Research Institute (V.T.), and the National Heart, Lung, 

and Blood Institute P01 HL73104 (G.M.C., D.B.K. and P.M.C.). 


N.H. performed most of the experiments; J.W., K.K., G.F. and X.W. developed assays 

and analyzed samples; V.T. contributed to discussions; N.H., G.M.C., D.B.K., P.D.G., 

M.C.H. and P.M.C. designed the experiments and analyzed data; N.H. and P.M.C. 

wrote the manuscript. 





Reprints and permissions information is available online at 

http://npg.nature.com/reprintsandpermissions/. 

1. Wu, L. et al. CD4-induced interaction of primary HIV-1 gp120 glycoproteins with the 

chemokine receptor CCR-5. Nature 384, 179–183 (1996). 

2. deRoda Husman, A.M., Blaak, H., Brouwer, M. & Schuitemaker, H. CC chemokine 

receptor 5 cell-surface expression in relation to CC chemokine receptor 5 genotype and 

the clinical course of HIV-1 infection. J. Immunol. 163, 84597–84603 (1999). 

3. Samson, M. et al. Resistance to HIV-1 infection in Caucasian individuals bearing mutant 

alleles of the CCR-5 chemokine receptor gene. Nature 382, 722–725 (1996). 

4. Novembre, J. et al. The geographic spread of the CCR5 Delta32 HIV-resistance allele. 

PLoS Biol. 3, e339 (2005). 

5. Glass, W.G. et al. CCR5 deficiency increases risk of symptomatic West Nile virus 

infection. J. Exp. Med. 203, 35–40 (2006). 

6. Kantarci, O.H. et al. CCR5∆32 polymorphism effects on CCR5 expression, patterns 

of immunopathology and disease course in multiple sclerosis. J. Neuroimmunol. 169, 

137–143 (2005). 

7. Rossol, M. et al. Negative association of the chemokine receptor CCR5 d32 polymorphism 

with systemic inflammatory response, extra-articular symptoms and joint 

erosion in rheumatoid arthritis. Arthritis Res. Ther. 11, R91–98 (2009). 

8. Dau, B. & Holodiny, M. Novel targets for antiretroviral therapy: clinical progress to 

date. Drugs 69, 31–50 (2009). 

9. Hutter, G. et al. Long-term control of HIV by CCR5 Delta32/Delta32 stem-cell transplantation. 

N. Engl. J. Med. 360, 692–698 (2009). 

10. Hutter, G., Schneider, T. & Thiel, E. Transplantation of selected or transgenic blood 

stem cells—a future treatment for HIV/AIDS? J. Int. AIDS Soc. 12, 10–14 (2009). 

11. Anderson, J. et al. Safety and efficacy of a lentiviral vector containing three anti-HIV 

genes–CCR5 ribozyme, tat-rev siRNA, and TAR decoy–in SCID-hu mouse-derived T 

cells. Mol. Ther. 15, 1182–1188 (2007). 

12. Bai, J. et al. Characterization of anti-CCR5 ribozyme-transduced CD34+ hematopoietic 

progenitor cells in vitro and in a SCID-hu mouse model in vivo. Mol. Ther. 1, 244–254 

(2000). 

13. Kumar, P. et al. T cell-specific siRNA delivery suppresses HIV-1 infection in humanized 

mice. Cell 134, 577–586 (2008). 

14. Swan, C.H. et al. T-cell protection and enrichment through lentiviral CCR5 intrabody 

gene delivery. Gene Ther. 13, 1480–1492 (2006). 

15. Swan, C.H. & Torbett, B.E. Can gene delivery close the door to HIV-1 entry after 

escape? J. Med. Primatol. 35, 236–247 (2006). 

16. Urnov, F.D. et al. Highly efficient endogenous human gene correction using designed 

zinc-finger nucleases. Nature 435, 646–651 (2005). 

17. Jasin, M. et al. Genetic manipulation of genomes with rare-cutting endonucleases. 

Trends Genet. 12, 224–228 (1996). 

18. Sonoda, E. et al. Differential usage of non-homologous end-joining and homologous 

recombination in double strand break repair. DNA Repair (Amst.) 5, 1021–1029 

(2006). 

19. Perez, E.E. et al. Establishment of HIV-1 resistance in CD4+ T cells by genome editing 

using zinc-finger nucleases. Nat. Biotechnol. 26, 808–816 (2008). 

20. Ishikawa, F. et al. Development of functional human blood and immune systems in NOD/ 

SCID/IL2 receptor {gamma} chain(null) mice. Blood 106, 1565–1573 (2005). 

21. Lombardo, A. et al. Gene editing in human stem cells using zinc finger nucleases 

and integrase-defective lentiviral vector delivery. Nat. Biotechnol. 25, 1298–1306 

(2007). 

22. Hollis, R.P. et al. Stable gene transfer to human CD34(+) hematopoietic cells using 

the Sleeping Beauty transposon. Exp. Hematol. 34, 1333–1343 (2006). 

23. Sumiyoshi, T. et al. Stable transgene expression in primitive human CD34+ hematopoietic 

stem/progenitor cells, using the Sleeping Beauty transposon system. Hum. Gene 

Ther. 20, 1607–1626 (2009). 

24. Mátés, L. et al. Molecular evolution of a novel hyperactive Sleeping Beauty transposase 

enables robust stable gene transfer in vertebrates. Nat. Genet. 41, 753–761 

(2009). 

25. Xue, X. et al. Stable gene transfer and expression in cord blood-derived CD34+ 

hematopoietic stem and progenitor cells by a hyperactive Sleeping Beauty transposon 

system. Blood 114, 1319–1330 (2009). 

26. Basu, S. & Broxmeyer, H.E. CCR5 ligands modulate CXCL12-induced chemotaxis, 

adhesion, and Akt phosphorylation of human cord blood CD34+ cells. J. Immunol. 

183, 7478–7488 (2009). 

27. Watanabe, S. et al. Hematopoietic stem cell-engrafted NOD/SCID/IL2Rgamma null 

mice develop human lymphoid systems and induce long-lasting HIV-1 infection with 

specific humoral immune responses. Blood 109, 212–218 (2007). 

28. Brenchley, J.M. et al. CD4 + T cell depletion during all stages of HIV disease occurs 

predominantly in the gastrointestinal tract. J. Exp. Med. 200, 749–759 (2004). 

29. Brenchley, J.M. et al. HIV disease: fallout from a mucosal catastrophe? Nat. Immunol. 

7, 235–239 (2006). 

30. Guadalupe, M. et al. Severe CD4+ T-cell depletion in gut lymphoid tissue during 

primary human immunodeficiency virus type 1 infection and substantial delay in 

restoration following highly active antiretroviral therapy. J. Virol. 77, 11708–11717 

(2003). 

31. Talal, A.H. et al. Effect of HIV-1 infection on lymphocyte proliferation in gut-associated 

lymphoid tissue. J. Acquir. Immune Defic. Syndr. 26, 208–217 (2001). 

32. Li, Q. et al. Peak SIV replication in resting memory CD4 + T cells depletes gut lamina 

propria CD4 + T cells. Nature 434, 1148–1152 (2005). 

33. Mattapallil, J.J. et al. Massive infection and loss of memory CD4 + T cells in multiple 

tissues during acute SIV infection. Nature 434, 1093–1097 (2005). 

34. Veazey, R.S. et al. Gastrointestinal tract as a major site of CD4 + T cell depletion and 

viral replication in SIV infection. Science 280, 427–431 (1998). 

35. Berges, B.K. et al. HIV-1 infection and CD4 T cell depletion in the humanized 

Rag2−/−gamma c−/− (RAG-hu) mouse model. Retrovirology 3, 76–90 (2006). 

36. Appay, V. & Sauce, D. Immune activation and inflammation in HIV-1 infection: causes 

and consequences. J. Pathol. 214, 231–241 (2008). 

37. Stoddart, C.A. et al. IFN-alpha-induced upregulation of CCR5 leads to expanded HIV 

tropism in vivo. PLoS Pathog. 6, e1000766 (2010). 

38. Choudhary, S.K. et al. R5 human immunodeficiency virus type 1 infection of fetal 

thymic organ culture induces cytokine and CCR5 expression. J. Virol. 79, 458–471 

(2005). 

39. Kahn, J.O. & Walker, B.D. Acute human immunodeficiency virus type 1 infection. N. 

Engl. J. Med. 339, 33–39 (1998). 

40. Margolick, J.B. et al. Impact of inversion of the CD4/CD8 ratio on the natural history 

of HIV-1 infection. J. Acquir. Immune Defic. Syndr. 42, 620–626 (2007). 

41. Henrard, D.R. et al. Natural History of HIV-1 cell-free viremia. J. Am. Med. Assoc. 

274, 554–558 (1995). 

42. Chen, R.Y. et al. Distribution of health care expenditures for HIV-infected patients. 

Clin. Infect. Dis. 42, 1003–1010 (2006). 

43. Richman, D.D. et al. The challenge of finding a cure for HIV infection. Science 323, 

1304–1307 (2009). 

44. Rossi, J.J., June, C.H. & Kohn, D.B. Genetic therapies against HIV. Nat. Biotechnol. 

25, 1444–1454 (2007). 

45. Bibikova, M. et al. Targeted chromosomal cleavage and mutagenesis in Drosophila 

using zinc-finger nucleases. Genetics 161, 1169–1175 (2002). 

46. Doyon, Y. et al. Heritable targeted gene disruption in zebrafish using designed zincfinger 

nucleases. Nat. Biotechnol. 26, 702–708 (2008). 

47. Santiago, Y. et al. Targeted gene knockout in mammalian cells by using engineered 

zinc-finger nucleases. Proc. Natl. Acad. Sci. USA 105, 5809–5814 (2008). 

48. Peters, W., Dupuis, M. & Charo, I.F. A mechanism for the impaired IFN-gamma production 

in C–C chemokine receptor 2 (CCR2) knockout mice: Role of CCR2 in linking 

the innate and adaptive immune responses. J. Immunol. 165, 7072–7077 (2000). 

49. Smith, M.W. et al. CCR2 chemokine receptor and AIDS progression. Nat. Med. 3, 

1052–1053 (1997). 

50. Davis, B.R. & Candotti, F. Revertant somatic mosaicism in the Wiskott-Aldrich syndrome. 

Immunol. Res. 44, 127–131 (2009). 

51. Hirschhorn, R. et al. Spontaneous in vivo reversion to normal of an inherited mutation 

in a patient with adenosine deaminase deficiency. Nat. Genet. 3, 290–295 (1996). 


articles 

52. Hirschhorn, R. et al. In vivo reversion to normal of inherited mutations in humans. 

J. Med. Genet. 40, 721–728 (2003). 

53. Stephan, V. et al. Atypical X-linked severe combined immunodeficiency due to possible 

spontaneous reversion of the genetic defect in T cells. N. Engl. J. Med. 335, 

1563–1567 (1996). 

54. Chun, T.W. et al. Persistence of HIV in gut-associated lymphoid tissue despite longterm 

antiretroviral therapy. J. Infect. Dis. 197, 714–720 (2008). 

55. Lackner, A.A. et al. The gastrointestinal tract and AIDS pathogenesis. Gastroenterology 

136, 1965–1978 (2009). 

56. Picker, L.J. Immunopathogenesis of acute AIDS virus infection. Curr. Opin. Immunol. 

18, 399–405 (2006). 

57. Veazey, R.S., Marx, P.A. & Lackner, A.A. The mucosal immune system: primary target 

for HIV infection and AIDS. Trends Immunol. 22, 626–633 (2001). 

58. Krishnan, A. et al. Autologous stem cell transplantation for HIV associated lymphoma. 

Blood 98, 3857–3859 (2001). 

59. Aiuti, A. et al. Correction of ADA-SCID by stem cell gene therapy combined with 

nonmyeloablative conditioning. Science 296, 2410–2413 (2002). 

60. Aiuti, A. et al. Gene therapy for immunodeficiency due to adenosine deaminase 

deficiency. N. Engl. J. Med. 360, 447–458 (2009). 

61. Ott, M.G. et al. Correction of X-linked chronic granulomatous disease by gene therapy, 

augmented by insertional activation of MDS1–EVI1, PRDM16 or SETBP1. Nat. Med. 

12, 401–409 (2006). 

62. Cartier, N. et al. Hematopoietic stem cell gene therapy with a lentiviral vector in 

X-linked adrenoleukodystrophy. Science 326, 818–823 (2009). 

63. Biti, R. et al. HIV-1 infection in an individual homozygous for the CCR5 deletion allele. 

Nat. Med. 3, 252–253 (1997). 

64. Oh, D.Y. et al. CCR5Delta32 genotypes in a German HIV-1 seroconverter cohort and 

report of HIV-1 infection in a CCR5Delta32 homozygous individual. PLoS ONE 3, 

e2747–2753 (2008). 

65. Weiser, B. et al. HIV-1 coreceptor usage and CXCR4-specific viral load predict clinical 

disease progression during combination antiretroviral therapy. AIDS 22, 469–479 

(2008). 

66. Ogert, R.A. et al. Mapping Resistance to the CCR5 co-receptor antagonist vicriviroc 

using heterologous chimeric HIV-1 envelope genes reveals key determinants in the 

C2–V5 domain of gp120. Virology 373, 387–399 (2008). 

67. Soulie, C. et al. Primary genotypic resistance of HIV-1 to CCR5 antagonist treatmentnaïve 

patients. AIDS 22, 2212–2214 (2008). 

68. Palmer, S. et al. Low-level viremia persists for at least 7 years in patients on suppressive 

antiretroviral therapy. Proc. Natl. Acad. Sci. USA 105, 3879–3884 (2008). 

69. Dinoso, J.B. et al. Treatment intensification does not reduce residual HIV-1 viremia 

in patients on highly active antiretroviral therapy. Proc. Natl. Acad. Sci. USA 106, 

9403–9408 (2009). 

70. Mitsuyasu, R.T. et al. Phase 2 gene therapy trial of an anti-HIV ribozyme in autologous 

CD34+ cells. Nat. Med. 15, 285–292 (2009). 





Hematopoietic stem/progenitor cell isolation. Human CD3 + HSPCs were 

isolated from umbilical cord blood collected from normal deliveries at local 

hospitals, according to guidelines approved by the Children’s Hospital Los 

Angeles Committee on Clinical Investigation, or as waste cord blood material 

from StemCyte Corp. Immunomagnetic enrichment for CD34 + cells 

was performed using the magnetic-activated cell sorting (MACS) system 

(Miltenyi Biotec), per the manufacturer’s instructions, with the modification 

that the initial purified CD34 + population was put through a second 

column and washed three times with 3 ml of the supplied buffer per wash 

before the final elution. This additional step gave a > 99% pure CD34 + population, 

as measured by FACS analysis using the anti-CD34 antibody, 8G12 

(BD Biosciences). 

Nucleofection of CD34 + HSPCs with ZFN expression plasmids. Freshly 

isolated CD34 + cells were stimulated for 5–12 h in X-VIVO 10 media (Lonza) 

containing 2 nM l-glutamine, 50 ng/ml SCF, 50 ng/ml Flt-3 and 50 ng/ml 

TPO (R&D Systems). 1 × 10 6 cells were nucleofected with 2.5 µg each of a 

plasmid pair expressing ZFNs binding upstream (ZFN-L) or downstream 

(ZFN-R) of codon Leu55 within TM1 of human CCR5 (ref. 19). The CD34 + 

cell/DNA mix was processed in an X series Amaxa Nucleofector (Lonza) 

using the U-01 setting and the human CD34 + nucleofector solution, according 

to the manufacturer’s instructions. Following nucleofection, cells were 

immediately placed in pre-warmed IMDM media (Lonza) containing 26% 

FBS (Mediatech), 0.35% BSA, 2nM l-glutamine, 0.5% 10 −3 mol/l hydrocortisone 

(Stem Cell Technologies), 5 ng/ml IL-3, 10 ng/ml IL-6 and 25 ng/ml 

SCF (R&D Systems). Cells were allowed to recover in this media for 2–12 h 

before injection into mice. 

Apoptosis assay. CD34 + HSPCs were collected at 24 h post-nucleofection 

and analyzed for the percent of viable cells marked for apoptosis using the 

PE apoptosis detection kit (BD Biosciences) according to the manufacturer’s 

instructions. Cells were stained with 7-AAD (detects viable cells) and annexin 

V (detects apoptotic cells) and analyzed using a FACScan flow cytometer (BD 

Biosciences). This double staining allowed the identification of cells in the 

early stages of apoptosis. 

NSG mouse transplantation. NOD.Cg-Prkdc scid Il2rg tm1Wj/SzJ (NOD/ 

SCID/IL2rγ null , NSG) mice 71 were obtained from Jackson Laboratories. 

Neonatal mice within 48 h of birth received 150 cGy radiation, then 2–4 h 

later 1 × 10 6 ZFN-modified or mock-treated human CD34 + HSPCs in 50 µl 

PBS containing 1% heparin were injected through the facial vein. For secondary 

transplantations, bone marrow was harvested by needle aspiration 

from the upper and lower limbs of 18-week-old animals previously engrafted 

with human CD34 + HSPCs, filtered through a 70 µm nylon mesh screen 

(Fisher Scientific) and washed in PBS. The cells were transplanted into three 

8-week-old mice that had previously received 350 cGy radiation, using retroorbital 

injection of 2 × 10 7 bone marrow cells per mouse. Mouse cohorts are 

described in Supplementary Table 2. 

Analysis of CCR5 disruption. The percentage of CCR5 alleles disrupted by 

ZFN treatment was measured by performing PCR across the ZFN target site 

followed by digestion with the Surveyor (Cel 1) nuclease (Transgenomic), 

which detects heteroduplex formation, as previously described 19 . Briefly, 

genomic DNA was extracted from mouse tissues and subject to nested PCR 

amplification using human CCR5-specific primers, with the resulting radiolabeled 

products digested with Cel 1 nuclease and resolved by PAGE. The 

ratio of cleaved to uncleaved products was calculated to give a measure of 

the frequency of gene disruption. The assay is sensitive enough to detect 

single-nucleotide changes and has a linear detection range between 0.69 and 

44% 19 . 

In addition, a common 5-bp (pentamer) duplication that occurs 

after nonhomologous end-joining repair of ZFN-cleaved CCR5 (ref. 19) 

was detected by PCR. The first-round PCR product generated during 

Cel 1 analysis was diluted 1:5,000 and 5 µl used in a Taqman qPCR reaction 

using primers (5′-GGTCATCCTCATCCTGATCTGA-3′ and 

5′-GATGATGAAGAAGATTCCAGAGAAGAAG-3′) and probe 5′-FAM d 

(CCTTCTTACTGTCCCCTTCTGGGCTCAC) BHQ-1-3′ (Biosearch 

Technologies), and analyzed using a 7,900HT real-time PCR machine 

(Applied Biosystems). At the same time, 5 µl of a 1:50,000 dilution of 

the PCR product were used in a Taqman qPCR reaction using primers 

(5′- CCAAAAAATCAATGTGAAGCAAATC-3′ and 5′- TGCCCACAAAAC 

CAAAGATG -3′) and probe 5′- FAM d(CAGCCCGCCTCCTGCCTCC) 

BHQ-1-3′ to detect total copies of human CCR5. Data were analyzed using 

software supplied by the manufacturer and the frequency of pentamer insertions 

in CCR5 calculated. The assay is sensitive enough to detect a single 

pentamer insertion event in 100,000 cells (data not shown). 

ZFN-induced modifications of CCR5 were analyzed by directly sequencing 

cloned CCR5 alleles, isolated by PCR amplification as described above, and 

TOPO-TA cloning (Invitrogen). Plasmid DNA was isolated from 60 individual 

bacterial colonies for each tissue analyzed. 

HIV-1 infection and analysis. A cell-free virus stock of HIV-1 BaL and a 

molecular clone of HIV-1 NL4-3 were obtained from the AIDS Research and 

Reference Reagent Program (ARRRP), Division of AIDS, NIAID, NIH from 

material deposited by Suzanne Gartner, Mikulas Popovic, Robert Gallo and 

Malcolm Martin. HIV-1 BaL virus was propagated in PM1 cells, obtained from 

the ARRRP and deposited by Marvin Reitz and harvested 10 d post-infection. 

HIV-1 NL4-3 viruses were generated by transient transfection of 293T 

cells (ATCC). Viruses were titrated using the Alliance HIV-1 p24 ELISA 

kit (PerkinElmer) and by TCID 50 analysis on U373-MAGI cells (ARRRP, 

deposited by Michael Emerman and Adam Geballe). Mice to be infected 

with HIV-1 were anesthetized with inhalant 2.5% isoflourane and injected 

intraperitoneally with virus stocks containing 200 ng p24, 7 × 10 4 TCID 50 

units, in 100 µl total volume. 

HIV-1 levels in peripheral blood or tissues harvested at necropsy were 

determined by extracting RNA from 5 × 10 5 cells using the master pure 

complete DNA and RNA purification kit (Epicentre Biotechnologies) and 

performing Taqman qPCR using a primer and probe set targeting the HIV-1 

LTR region, as previously described 72 . In addition, p24 levels were measured 

in blood samples by ELISA. 

Mouse blood and tissue collection. Peripheral blood samples were collected 

every 2 weeks starting at 8 weeks of age, using retro-orbital sampling. Whole 

blood was blocked in FBS (Mediatech) for 30 min., the red blood cells were 

lysed using Pharmlyse solution (BD Biosciences) and cells were washed with 

PBS. Tissue samples were collected at necropsy and processed immediately 

for cell isolation and FACS analysis, or kept in freezing media (IMDM plus 

20% DMSO) in liquid nitrogen, for later analysis and DNA extraction. Tissue 

samples were manually agitated in PBS before filtering through a sterile 70 

µm nylon mesh screen (Fisher Scientific) and suspension cell preparations 

produced as previously described 19 . Intestinal samples were processed as 

previously described 73 , with the modification that the mononuclear cell 

population was isolated after incubation in citrate buffer and collagenase 

enzyme for 2 h, followed by nylon wool filtration (Amersham Biosciences) 

and ficoll-hypaque gradient isolation (GE Healthcare). 

Analysis of human cells in mouse tissues. FACS analysis of human cells was 

performed using a FACSCalibur instrument (BD Biosciences) with either 

BD CellQuest Pro version 5.2 (BD Biosciences) or FlowJo software version 

8.8.6 for Macintosh (Treestar). The gating strategy performed was an initial 

forward scatter versus side scatter (FSC/SSC) gate to exclude debris, followed 

by a human CD45 gate. For analysis of lymphocyte populations in peripheral 

blood, a further lymphoid gate (low side scatter) was also applied to exclude 

cells of monocytic origin 74 . All antibodies used were fluorochrome conjugated 

and human specific, and obtained from BD Biosciences: CD45 (clone 2D1), 

CD19 (clone HIB19), CD14 (clone MϕP9), CD3 (clone SK7), CD4 (clone 

SK3), CD8 (clone HIT8a), CCR5 (2D7). Gates were set using fluorescence 

minus one controls, where cells were stained with all antibodies except the one 

of interest. Specificity was also confirmed using isotype-matched nonspecific 

antibodies (BD Biosciences) (Supplementary Fig. 1) and with tissues from 

animals that had not been engrafted with human cells. 

Immunohistochemical analysis of human CD3 and CD4 expression, 

respectively, in the small intestine and spleen tissue from HSPC-engrafted 

nature biotechnology doi:10.1038/nbt.1663

mice was performed on fixed paraffin-embedded tissue sections, as previously 

described 73 . Controls included isotype-matched nonspecific antibodies and 

unengrafted NSG mice. 

Statistical analysis. All statistical analysis was performed using GraphPad 

Prism version 5.0b for Mac OSX (GraphPad Software). Unpaired two-tailed 

t-tests were performed assuming equal variance to calculate P-values. A 95% 

confidence interval was used to determine significance. A minimum of three 

data points was used for each analysis. 

71. Shultz, L.D. et al. Human lymphoid and myeloid cell development in NOD/LtSz-scid 

IL2R gamma null mice engrafted with mobilized human hematopoietic stem cells. 

J. Immunol. 174, 6477–6489 (2005). 

72. Rouet, F. et al. Transfer and evaluation of an automated, low-cost real-time reverse 

transcription-PCR test for diagnosis and monitoring of human immunodeficiency 

virus type 1 infection in a West African resource-limited setting. J. Clin. Microbiol. 

43, 2709–2717 (2005). 

73. Sun, Z. et al. Intrarectal transmission, systemic infection, and CD4+ T cell depletion 

in humanized mice infected with HIV-1. J. Exp. Med. 204, 705–714 (2007). 

74. Loken, M.R. et al. Establishing lymphocyte gates for immunophenotyping by flow 

cytometry. Cytometry 11, 453–459 (1990). 


doi:10.1038/nbt.1663 


articles 

Cell type of origin influences the molecular and 

functional properties of mouse induced pluripotent 

stem cells 


Jose M Polo 1–4 , Susanna Liu 5 , Maria Eugenia Figueroa 6 , Warakorn Kulalert 1–4 , Sarah Eminli 1–4 , 

Kah Yong Tan 1,4,7 , Effie Apostolou 1–4 , Matthias Stadtfeld 1–4 , Yushan Li 6 , Toshi Shioda 2 , Sridaran Natesan 8 , 

Amy J Wagers 1,4,7 , Ari Melnick 6 , Todd Evans 5 & Konrad Hochedlinger 1–4 

Induced pluripotent stem cells (iPSCs) have been derived from various somatic cell populations through ectopic expression of defined 

factors. It remains unclear whether iPSCs generated from different cell types are molecularly and functionally similar. Here we 

show that iPSCs obtained from mouse fibroblasts, hematopoietic and myogenic cells exhibit distinct transcriptional and epigenetic 

patterns. Moreover, we demonstrate that cellular origin influences the in vitro differentiation potentials of iPSCs into embryoid bodies 

and different hematopoietic cell types. Notably, continuous passaging of iPSCs largely attenuates these differences. Our results 

suggest that early-passage iPSCs retain a transient epigenetic memory of their somatic cells of origin, which manifests as differential 

gene expression and altered differentiation capacity. These observations may influence ongoing attempts to use iPSCs for disease 

modeling and could also be exploited in potential therapeutic applications to enhance differentiation into desired cell lineages. 

IPSCs are usually obtained from fibroblasts after infection with viral constructs 

expressing the four transcription factors Oct4, Sox2, Klf4 and 

c-Myc 1–10 . In addition, other cell types, including blood 2,4,11 , stomach 

and liver cells 1 , keratinocytes 12,13 , melanocytes 14 , pancreatic β cells 7 and 

neural progenitors 3,15–17 have been reprogrammed into iPSCs. Although 

these iPSC lines have been shown to express pluripotency genes and 

support the differentiation into cell types of all three germ layers, recent 

studies detected substantial molecular and functional differences among 

iPSCs derived from distinctive cell types. For example, iPSCs produced 

from various fibroblasts, stomach and liver cells showed different propensities 

to form tumors in mice, although the underlying molecular 

mechanisms remain elusive 18 . Another study identified persistent donor 

cell–specific gene expression patterns in human iPSCs produced from 

different cell types, suggesting an influence of the somatic cell of origin 

on the molecular properties of resultant iPSCs 19 . Whether cellular origin 

also affected the functional properties of iPSCs remained unexplored 

in that report. Of note, the findings of some of these studies may be 

confounded by the presence of different viral insertions in individual 

iPSC lines and by the fact that the analyzed iPSC lines were of different 

genetic background, which can affect both gene expression patterns 20 

and the functionality 9,21 of cells. Indeed, we have recently shown that 

many mouse iPSC lines derived from different somatic cell types show 

aberrant silencing of a surprisingly small set of transcripts compared with 

embryonic stem cells (ESCs) 22 . However, our study did not investigate 

whether additional cell-of-origin–specific differences may exist in iPSC 

lines derived from different cell types. 

Patient-specific iPSCs are a valuable tool for the study of disease and 

possibly for the development of therapies 20,23–26 . Thus, resolving the question 

of whether iPSCs produced from different cell types are molecularly 

and functionally equivalent is crucial for using these cells to model disease, 

which entails detecting subtle differences in the differentiation potential 

of patient-derived iPSCs 24,27 . Furthermore, the identification of somatic 

cells that influence the differentiation capacities of resultant iPSCs into 

desired cell lineages could be useful in a therapeutic setting. 

To assess whether iPSCs derived from different somatic cell types are 

distinguishable, we compared here the transcriptional and epigenetic 

patterns, as well as the in vitro differentiation potentials, of iPSCs produced 

from four genetically identical adult mouse cell types that differed 

only in the lineage from which they were derived. 

RESULTS 

Genetically matched iPSCs derived from different cell types 

Because the genetic background of ESCs can influence their transcriptional 

and functional behaviors, we used a previously described 

‘secondary system’ to generate genetically identical iPSCs 2,28 (Fig. 1a). 

Briefly, iPSCs were generated from somatic cells using doxycyclineinducible 

lentiviruses expressing Oct4, Sox2, Klf4 and c-Myc 29 , and 

then injected into blastocysts to produce isogenic chimeric mice. 

1 Howard Hughes Medical Institute and Department of Stem Cell and Regenerative Biology, Harvard University and Harvard Medical School, Cambridge, 

Massachusetts, USA. 2 Massachusetts General Hospital Cancer Center, Charlestown, Massachusetts, USA. 3 Massachusetts General Hospital Center for 

Regenerative Medicine, Boston, Massachusetts, USA. 4 Harvard Stem Cell Institute, Cambridge, Massachusetts, USA. 5 Department of Surgery, Weill Cornell 

Medical College, New York, New York, USA. 6 Department of Medicine, Hematology Oncology Division, Weill Cornell Medical College, New York, New York, USA. 

7 Joslin Diabetes Center, Boston, Massachusetts, USA. 8 Sanofi-Aventis Cambridge Genomics Center, Cambridge, Massachusetts, USA. Correspondence should be 

addressed to K.H. (khochedlinger@helix.mgh.harvard.edu). 

Received 26 March; accepted 9 July; published online 19 July 2010; doi:10.1038/nbt1667 


articles 

a 

Blast 

injection 

Secondary iPSC clone 

(carry dox-inducible copies 

of Oct4, Sox2, Klf4, c-Myc) 

Chi no. 1 

Granulocytes 

SMP cells 

+ dox 

B cells 

Gra-iPSC 

SMP-iPSC 

B-iPSC 

• Gene expression 

• DNA methylation 

• ChIP for histone modifications 

• In vitro differentiation 

Chi no. 2 

TTFs 

TTF-iPSC 

b 

Cxcr4 

Itgb1 

Gr-1 

Lysozyme 

0.12 

1.00 

0.05 

0.05 


Fold GAPDH 

c 

0.10 

0.08 

0.06 

0.04 

0.02 

0.00 

SMPiPSC 

GraiPSC 

SMPiPSC 

Fold GAPDH 

0.80 

0.60 

0.40 

0.20 

0.00 

Chi no. 1 Chi no. 2 

SMPiPSC 

GraiPSC 

Fold GAPDH 

d 

0.04 

0.03 

0.02 

0.01 

0.00 

SMP-iPSC1 

SMP-iPSC2 

SMP-iPSC3 

Gra-iPSC3 

Gra-iPSC1 

Gra-iPSC2 

Fold GAPDH 

0.04 

0.03 

0.02 

0.01 

0.00 

B-iPSC3 

B-iPSC1 

B-iPSC2 

TTF-iPSC3 

TTF-iPSC1 

TTF-iPSC2 


1 2 3 1 2 3 1 2 3 1 2 3 

GraiPSC 

SMPiPSC 

B- 

iPSC 

TTFiPSC 

GraiPSC 

SMPiPSC 

GraiPSC 

Figure 1 iPSCs derived from different cell types are transcriptionally distinguishable. (a) Flow chart explaining the derivation and analysis of genetically 

matched iPSCs from different cell types. Secondary iPSCs were first injected into blastocysts to generate chimeric mice, from which the indicated somatic 

cell types were isolated. Exposure of these cells to doxycycline (dox) then gave rise to iPSCs. ChIP, chromatin immunoprecipitation. (b) Quantification of 

the expression levels of Cxcr4, Itgb1, Gr-1 and Lysozyme by quantitative PCR in SMP-iPSCs, in red, and Gra-iPSCs, in gray. The values were normalized to 

GAPDH expression; the error bars depict the s.e.m. (n = 3). (c) Heat map showing top 104 probes with highest variance in their expression levels. Left panel, 

SMP-iPSCs and Gra-iPSCs derived from chimera no. 1. Right panel, TTF-iPSCs and B-iPSCs derived from chimera no. 2. (d) Hierarchical, unsupervised 

clustering of iPSC expression profiles using the correlation distance and the Ward method. SMP-iPSCs and Gra-iPSCs were derived from chimera no. 1 (left 

panel), TTF-iPSCs and B-iPSCs originate from chimera no. 2 (right panel). Chi no. 1, chimera no. 1; chi no. 2, chimera no. 2. 

Thus, isolation of different cell types from these chimeras and their 

subsequent exposure to doxycycline gave rise to iPSCs with the same 

genetic makeup. In this study, we focused on iPSCs derived from tail 

tip–derived fibroblasts (TTFs), splenic B cells (B), bone marrow– 

derived granulocytes and skeletal muscle precursors (SMPs) 30 , which 

were continuously cultured for 2–3 weeks (passage 4 to 6) after picking. 

The pluripotency of some of these cell lines has been previously 

documented 2 , or was analyzed in this study (Supplementary Table 

1 and Supplementary Fig. 1). All cell lines grew at similar rates and 

independently of viral transgene expression (Supplementary Fig. 

2) and upregulated the endogenous pluripotency genes Nanog, 

Sox2 and Oct4, indicating successful molecular reprogramming 

(Supplementary Table 1). Moreover, all lines gave rise to differentiated 

teratomas, and all tested lines supported the development of 

chimeric animals upon blastocyst injection, demonstrating their 

pluripotency (Supplementary Table 1). We therefore concluded that 

the cell lines analyzed here qualify as bona fide iPSC lines. 

iPSCs produced from different cell types are transcriptionally 

distinguishable 

We first evaluated whether iPSCs derived from defined somatic cell 

types retain gene expression patterns indicative of their cells of origin. 

Specifically, we assessed the expression of cell lineage–specific 

candidate genes in iPSCs derived from granulocytes (Gra-iPSCs) 

and SMPs (SMP-iPSCs). As expected, the SMP markers Cxcr4 and 

Integrin B1 and the granulocyte markers Lysozyme (also known as 

Lyz1 and Lyz2) and Gr-1 (also known as Ly6g) were expressed at considerably 

higher levels in the somatic cells of origin than in resultant 


articles 

a 


b 

Gra-iPSC2 

Gra-iPSC1 

B-iPSC1 

B-iPSC3 

d = 0.02 d = 0.02 

Gra-iPSC3 

B-iPSC2 

SMP-iPSC2 

TTF-iPSC1 

SMP-iPSC3 

TTF-iPSC3 

SMP-iPSC1 

TTF-iPSC2 



c 

SMP-iPSC1 

SMP-iPSC2 

SMP-iPSC3 

Gra-iPSC1 

Gra-iPSC2 

Gra-iPSC3 

ESC 

d 

Percent input 

0.12 

0.1 

0.08 

0.06 

0.04 

0.02 

Gr-1 

Gr-1 

Lysozyme 

Percent of methylation 

0% 100% Not analyzed 

Percent input 

Itgb1 

Cxcr4 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

Cxcr4 

H3Ac 

H3K4me3 

H3K27me3 

IgG 

0 

0 

Gra 

SMP 

Gra SMP GraiPSC 

SMPiPSC 

GraiPSC 

SMPiPSC 

0.4 

Lysozyme 

0.8 

Itgb1 

Percent input 

0.3 

0.2 

Percent input 

0.6 

0.4 

0.1 

0.2 

0 

0 

Gra SMP GraiPSC 

SMPiPSC 

Gra SMP 

GraiPSC 

SMPiPSC 

Figure 2 iPSCs derived from different cell types exhibit distinguishable epigenetic signatures. (a) Hierarchical unsupervised clustering analysis of 

HELP genome-wide methylation data from indicated iPSC lines. (b) Correspondence analysis of SMP-iPSCs and Gra-iPSCs (left panel) from chimera 

no. 1, TTF-iPSCs and B-iPSCs (right panel) from chimera no. 2. (c) Graphic representation of DNA methylation quantification of specific CpGs 

(circles) in the promoter regions of the indicated candidate genes using EpiTYPER DNA methylation analyses. Yellow indicates 0% methylation and 

blue 100% methylation. (d) Chromatin immunoprecipitation (ChIP) for H3 pan-acetylated (H3Ac, in blue), H3K4 trimethylated (H3K4me3, in green), 

H3K27 trimethylated (H3K27me3, in red) and isotype control (IgG, in light blue) of granulocytes (Gra), SMPs, Gra-iPSCs and SMP-iPSCs. Chi no. 1, 

chimera no. 1; chi no. 2, chimera no. 2. The error bars depict the s.e.m. (n = 3). 


articles 


EryP colonies 

a 

b 

c 

EB diameter 

(in arbitrary units) 

2,000 

1,600 

1,200 

800 

400 

0 

B-iPSC 

TTF-iPSC 

Gra-iPSC 

SMP-iPSC 

8 

7 

6 

5 

4 

3 

2 

1 

0 

B- 

iPSC 

B-iPSC 

B- 

iPSC 

P < 0.001 

5,000 cells/ml 

6 days 

P < 0.001 

Chi no. 2 

2,500 

2,000 

1,500 

1,000 

500 

0 

P < 0.05 

EBs 

TTF-iPSC 

Dissociate and 

plate 100,000/ml 

iPSCs (Supplementary Fig. 3). Moreover, SMP-iPSCs expressed substantially 

higher levels of Cxcr4 and Itgb1 than did Gra-iPSCs (Fig. 

1b), and Gra-iPSCs showed higher expression levels of Lysozyme 

and Gr-1 compared with SMP-iPSCs (Fig. 1b). Together, these data 

suggest that iPSCs retain a transcriptional memory of their somatic 

cell of origin. 

To test this notion globally, we compared the transcriptional profiles 

of iPSC lines originating from SMPs (n = 3) with those derived from 

granulocytes (n = 3), as well as expression profiles of iPSC lines originating 

from B cells (n = 3) with those produced from TTFs (n = 3). 

Note that iPSCs were compared with each other only if they originated 

from the same chimeric mouse (SMP-iPSCs versus Gra-iPSCs and 

B-iPSCs versus TTF-iPSCs) (Fig. 1a) to eliminate potential variability 

d 

250 

200 

150 

100 

50 

0 

eryPs 

250 

200 

150 

100 

50 

0 

Gra-iPSC 

e f g 

EryP colonies 

SMPiPSC 

TTFiPSC 

GraiPSC 

SMPiPSC 

Chi no. 1 

Macrophage colonies 

B- 

iPSC 


P < 0.07 

EPO 

4 days 

7 days 

cytokines 

8 days 

Macrophages 

IL-3/M-CSF 

3 

2 

1 

0 

B- 

iPSC 

SMP-iPSC 

eryPs 

Macrophages 

Mixed colonies 




Chi no. 2 

Chi no. 1 

4 

3 

2 

1 

0 

P < 0.05 

Figure 3 iPSCs derived from different cell types have distinctive in vitro differentiation potentials. (a) Experimental 

outline. iPSCs were first differentiated into embryoid bodies. At day 6, embryoid bodies were dissociated and 

plated in conditions to favor differentiation into erythrocyte progenitors (eryP) and macrophage and mixed 

hematopoietic colonies. (b) Phase contrast images showing embryoid bodies derived from B-iPSCs, TTF-iPSCs, 

Gra-iPSCs and SMP-iPSCs at same magnification. (c) Quantification of embryoid body sizes derived from B-iPSCs, 

TTF-iPSCs, Gra-iPSCs and SMP-iPSCs; the diameter of the embryoid bodies was measured using arbitrary units 

(AU). The error bars depict the s.e.m. (n = 30) (d) Representative images of erythrocyte progenitors (eryPs), 

macrophage colonies and mixed hematopoietic colonies. (e–g) Quantification of in vitro differentiation potentials 

of the different iPSCs into EryPs (e), macrophage colonies (f) and mixed hematopoietic colonies (g). Chi no. 1, 

chimera no. 1; chi no. 2, chimera no. 2. The error bars depict the s.e.m. (n = 12). 



TTFiPSC 

GraiPSC 

SMPiPSC 

TTFiPSC 

TTFiPSC 

GraiPSC 

SMPiPSC 

GraiPSC 

between different experiments and 

individual animals. All iPSC lines 

analyzed were between passage (p) 

4 and 6. There were 1,388 genes differentially 

expressed (twofold, corrected 

P = 0.05) between SMP-iPSCs 

and Gra-iPSCs, and 1,090 genes 

between B-iPSCs and TTF-iPSCs 

(Supplementary Table 2). An analysis 

of the 100 genes with the greatest 

range of expression levels across 

all samples indicated that iPSCs 

with the same cell of origin clustered 

together (Fig. 1c). Consistent 

with this observation, unsupervised 

hierarchical clustering (Fig. 

1d) as well as principal component 

analysis (Supplementary Fig. 4) 

of all genes placed SMP-iPSCs and 

Gra-iPSCs, as well as B-iPSCs and 

TTF-iPSCs, into different groups 

according to their cells of origin. 

Notably, Gene Ontology (GO) 

analysis of the 100 genes with the 

greatest range of expression between 

SMP-iPSCs and Gra-iPSCs indicated 

an enrichment for genes belonging 

to the categories ‘myofibril’ (7.6- 

fold enrichment), ‘contractile fiber’ 

(7.3-fold enrichment) and ‘muscle 

development’ (5.9-fold enrichment) 

as well as ‘B-cell activation’ 

(6.8-fold enrichment) and ‘leukocyte 

activation’ (3.7-fold enrichment) 

(when compared with the 

expected background). Together, 

these results show that genetically 

identical iPSCs obtained from four 

different somatic cell types are distinguishable 

from each other using 

genome-wide transcriptional analyses, 

further supporting the notion 

that the donor cell type influences 

the overall gene expression pattern 

of resultant iPSCs. 

To determine the effect on gene 

expression patterns of deriving 

iPSCs from different animals in 

independent experiments, we compared the expression profiles of 

Gra-iPSCs derived from chimera no. 1 (n = 3) with Gra-iPSCs from 

chimera no. 2 (n = 3) as well as with SMP-iPSCs from chimera no. 1 

and TTF-iPSCs from chimera no. 2 (Fig. 1a). Hierarchical clustering 

separated Gra-iPSCs according to their origin from different animals, 

suggesting a significant contribution of this experimental variable to 

gene expression patterns (Supplementary Fig. 5). However, when the 

expression data from TTF-iPSCs and SMP-iPSCs were included in the 

analysis, we found that differences due to cell of origin were stronger 

than those arising from variations in experimental conditions or animals. 

These data reinforce the observation that iPSCs derived from 

different somatic cell types are transcriptionally distinguishable, even 

when they originate from different animals. 


articles 


Figure 4 Continuous 

passaging of iPSCs 

abrogates transcriptional, 

epigenetic and functional 

differences. (a) Hierarchical 

unsupervised clustering of 

expression profiles from 

B-iPSCs, T-iPSCs, TTFiPSCs 

and Gra-iPSCs from 

chimera no. 2. Left panel 

shows clustering analysis of 

all iPSC samples at passage 

p4, the middle panel at p10 

and the right panel at p16. 

(b) Number of differentially 

expressed probes between 

pairs of iPSC samples used 

in a; iPSCs at p4 are shown 

in blue bars, iPSCs at p10 

are shown in orange bars 

and iPSCs at p16 are shown 

in red bars. The number 

of differently expressed 

probes between iPSCs was 

calculated using a pairwise 

analysis (twofold), with t-test 

P = 0.05, with Bejamini and 

Hochberg correction (n = 3). 

(c) Venn diagram and GO 

analysis showing overlap of 

genes that change from p4 

to p16 in Gra-IPSCs, TTFiPSCs 

and B-iPSCs. Red line 

marks functional GO cluster 

of genes shared between all 

three iPSC groups. Black 

line marks functional GO 

cluster of genes shared 

by at least two of the 

iPSC groups. Functional 

ontology cluster analysis was 

performed using the DAVIS 

algorithm. (d) Hierarchical 

unsupervised clustering 

using HELP genome-wide 

methylation profiles of 

B-iPSCs and TTF-iPSCs at 

p16. (e–g) Quantification 

of in vitro differentiation 

potentials of B-iPSCs and 

TTF-iPSCs at p16 into EryPs 

(e), macrophage colonies 

(f) and mixed hematopoietic 

colonies (g). The error bars 

depict the s.e.m. (n = 9). 

a 

T-iPSC2 

T-iPSC1 

T-iPSC3 

TTF-iPSC3 

TTF iPSC1 

TTF iPSC2 

Gra-iPSC3 

Gra-iPSC1 

Gra-iPSC2 

B-iPSC3 

B-iPSC1 

B-iPSC2 

c 

Gra-iPSC p4 vs. p16 

TTF-iPSC p4 vs. p16 

d 

TTF iPSC2 

B-iPSC3 

T-iPSC2 

Gra-iPSC1 

Gra-iPSC2 

Gra-iPSC3 

B-iPSC3 

TTF-iPSC1 

TTF-iPSC2 

B-iPSC2 

B-iPSC1 

TTF-iPSC3 

B-iPSC1 

T-iPSC1 

T-iPSC3 

685 

474 

TTF iPSC1 

125 

56 

508 

Organ development: 

EGLN1, EN2, AA409316, 

GYS1, IQGAP2, LOXL3, MGP, 

NDRG1, NOPE, PHF21A, 

BC021588, CYB5R3, 

SNRPD3, NM_008681, 

NM_030247 

B-iPSC p4 vs. p16 

B-iPSC2 

B-iPSC1 

T-iPSC1 

T-iPSC2 

TTF-iPSC3 

B-iPSC2 

Gra-iPSC2 

TTF-iPSC2 

Gra-iPSC3 

T-iPSC3 

Gra-iPSC1 

B-iPSC3 

TTF-iPSC1 

p4 p10 p16 

68 15 

TTF-iPSC3 

e 

EryP colonies 

900 

800 

700 

600 

500 

400 

300 

200 

100 

0 

g 


1.5 

1 

0.5 

0 

b 

Differentially expressed probes 

B-iPSC 

B-iPSC 

2,500 p4 

2,000 

1,500 

1,000 

500 

0 

T-iPSC 

vs. 

B-iPSC 

T-iPSC 

vs. 

TTF-iPSC 

Functional cluster 

Tube morphogenesis 

B-iPSC 

vs. 

TTF-iPSC 

T-iPSC 

vs. 

Gra-iPSC 

Positive regulation of cellular process 

Morphogenesis of a branching structure 

Response to heat 

Organ development 

mRNA metabolic process 

Cellular component assembly 

Cartilage and skeletal development 

Regulation of cell cycle 

Tissue development 

Spermatogenesis 

TTF-iPSC 

TTF-iPSC 

f 


140 

120 

100 

80 

60 

40 

20 

0 

B-iPSC 

B-iPSC 

vs. 

Gra-iPSC 

Gra-iPSC 

vs. 

TTF-iPSC 

Enrichment score 

2.75 

2.09 

2.04 

1.98 

1.96 

1.95 

1.79 

1.75 

1.69 

1.41 

1.40 

TTF-iPSC 

p10 

p16 

To exclude the possibility that the observed gene expression differences 

were due to the specific secondary system used, we derived 

iPSCs from SMPs, granulocytes, B cells and peritoneal fibroblasts 

from reprogrammable mice 31 , which carry dox-inducible copies of 

all four reprogramming factors in a defined genomic locus. All iPSC 

lines grew independently of dox and gave rise to differentiated teratomas 

(Supplementary Fig. 6a). Analysis of gene expression profiles of 

these lines at p4 showed clustering according to their cells of origin, 

with the exception of peritoneal fibroblast–derived iPSCs, which 

may be a consequence of the heterogeneity of the starting population. 

Collectively, these results corroborate the notion that iPSCs 

generated from different cell types exhibit distinct transcriptional 

patterns (Supplementary Fig. 6b). 

iPSCs derived from different cell types exhibit distinguishable 

epigenetic patterns 

We next asked whether the differential gene expression patterns we 

observed correlated with differences in epigenetic marks. To this end, we 

performed a genome-wide, restriction enzyme–based methylation analysis 

of promoters termed ‘HpaII tiny fragment enrichment by ligationmediated 

PCR’ (HELP) on the same cell lines we used for expression 

analysis. Unsupervised hierarchical clustering showed that Gra-iPSCs 


articles 

Reprogramming 

(transgene-dependent phase) 

Reprogramming 

(transgene-independent phase) 

two genes. A similar pattern was observed for the 

granulocyte-specific genes in Gra-iPSCs compared 

with SMP-iPSCs, with Gr-1 and Lysozyme being 

elevated for H3K4me3 (Fig. 2d). These data show 

that the observed expression differences among 

iPSCs derived from different cell types may be predominantly 

the consequence of differences in histone 

marks, further suggesting that iPSCs retain an 

epigenetic memory of their cells of origin. 


Cell of origin 

Partially reprogrammed cells 

• No endogenous pluripotent 

gene expression 

• No contribution to chimeras 

• Teratoma formation 

Early passage iPSC 

• Activation of endogenous 

pluripotency genes 

• Promoter demethylation 


• Chimera contribution 

• Transcriptionally distinguishable 

• Transient epigenetic memory 

• Altered differentiation 

Continuous passaging of iPSCs abrogates transcriptional, 

epigenetic and functional differences 

Previously published data suggest that early-passage, human iPSCs 

derived from fibroblasts are transcriptionally distinct from late-passage 

iPSCs 32 . However, that study did not examine the effect of passaging on 

the iPSC functionality. We therefore wondered whether continuous passaging 

of the various iPSC lines would eliminate the observed differences 

in gene expression and differentiation potential. For this analysis, we 

added to the B-iPSC/TTF-iPSC group, studied before (Figs. 1 and 2a,b), a 

new set of T cell– and granulocyte-derived iPSCs, which were all derived 

from chimera no. 2. These 12 iPSC lines were subjected to several additional 

rounds of passaging under identical culture conditions, and RNA 

was harvested at p10 and p16 for expression profiling. Whereas unsupervised 

hierarchical clustering of these cell lines at early passage (p4) 

clearly separated each of the different iPSC lines according to their cells 

of origin (Fig. 4a, left panel), unsupervised clustering of these lines at p10 

showed that B-iPSCs, TTF-iPSCs and T-iPSCs were indistinguishable 

from each other, whereas the Gra-iPSCs still clustered together (Fig. 4a, 

middle panel). Further passaging of these cells until p16 entirely eliminated 

these differences (Fig. 4a, right panel). Together, these data indiand 

SMP-iPSCs, as well as B-iPSCs and TTF-iPSCs, which clustered 

separately in the transcriptional assays, were also distinguishable based 

on their methylation patterns (Fig. 2a). Correspondence analysis of the 

same samples corroborated this finding (Fig. 2b), indicating that the 

donor cell type affects not only the overall transcriptional pattern but 

also the promoter methylation pattern of resultant iPSCs. 

Despite the separation of Gra-iPSCs from SMP-iPSCs and of 

TTF-iPSCs from B-iPSCs (Fig. 2a,b) by hierarchical clustering, we 

detected few loci that were differentially methylated with statistical 

significance using supervised analysis (69 genes between GraiPSCs 

and SMP-iPSCs and 0 genes between B-iPSCs and TTF-iPSCs; 

Supplementary Table 3). To complement these results, we interrogated 

the DNA methylation status at the promoter regions of the 

previously analyzed markers Cxcr4, Itgb1, Lysozyme and Gr-1 (Fig. 

1b) using EpiTYPER DNA methylation analysis, which quantifies 

gene-specific CpG methylation. We failed to detect differences in the 

methylation levels of these candidate genes between SMP-iPSCs and 

Gra-iPSCs (Fig. 2c), further indicating that methylation differences 

are more subtle than the observed gene expression differences and 

raising the possibility that other chromatin marks may be responsible 

for the observed expression differences. 

Indeed, we observed high levels of the activating marks H3Ac and 

H3K4me3 and low levels of the repressive marks H3K27me3 at the promoters 

of Cxcr4 and Itgb1 in SMPs and at the promoters of Lysozyme and 

Gr-1 in granulocytes, respectively, consistent with their abundant expression 

in these cell types (Fig. 2d). Notably, SMP-iPSCs, which showed 

higher expression levels of Cxcr4 and Itgb1 than did Gra-iPSCs (Fig. 

1b), were enriched for H3K4me3 compared with Gra-iPSCs at these 

iPSCs derived from different cell types have 

distinctive in vitro differentiation potentials 

Because the gene expression differences we observed 

among different iPSC lines affected genes known to 

be involved in the lineage-specific differentiation 

and function of the somatic cell types from which 

they were derived, we reasoned that these differences 

might affect their capacity to differentiate 

into defined cell lineages. Thus, we evaluated the 

autonomous differentiation potential of the four 

types of iPSC lines by assessing their abilities to 

produce embryoid bodies, erythrocyte progenitors, 

macrophages and mixed hematopoietic colonies 

using established semiquantitative differentiation 

protocols (Fig. 3a). Most notably, TTF-iPSCs produced 

significantly smaller and fewer embryoid 

bodies compared with all the other iPSC lines (P 

< 0.001; Fig. 3b,c). Moreover, the embryoid bodies 

derived from TTF-iPSC generated relatively 

few erythrocyte, macrophage and mixed colony 

progenitors compared with B-iPSCs derived from 

the same animal despite equal numbers of input 

cells, indicating striking differences in the differentiation 

potentials of these iPSCs (Fig. 3d–g). In contrast, SMP-iPSCs 

and Gra-iPSCs showed equivalent abilities to produce embryoid bodies 

(Fig. 3d–g). However, Gra-iPSCs gave rise to erythrocyte, macrophages 

and mixed colonies at higher efficiencies than SMP-iPSCs, suggesting 

a pattern of differentiation that reflects their cells of origin. Together, 

these data show that the cell type of origin may bias the differentiation 

potential of resultant iPSC lines. 

Late passage iPSC 

• Activation of endogenous 

pluripotency genes 

• Promoter demethylation 


• Chimera contribution 

Figure 5 Model summarizing the presented data. iPSCs derived from different somatic cell 

types retain a transient epigenetic and transcriptional memory of their cell type of origin at early 

passage, despite acquiring pluripotent gene expression, transgene-independent growth and the 

ability to contribute to tissues in chimeras. Continuous passaging resolves these differences, 

giving rise to iPSCs that are molecularly and functionally indistinguishable. Note the difference 

between early passage iPSCs and partially reprogrammed cells, which require continuous 

viral transgene expression and fail to activate endogenous pluripotency genes or support the 

development of viable mice. 


articles 


cate that continuous cell division resolves transcriptional differences 

among iPSC lines. Consistent with this observation, the total number 

of differentially expressed genes between various pairs of iPSC lines 

derived from different cellular origins was reduced from ~500–2,000 

in early-passage cultures to only ~50 or even 0 in late-passage cultures, 

further demonstrating that after extensive in vitro propagation, these 

iPSC lines have become very similar to each other (Fig. 4b). 

Analysis of the genes whose expression changed between p4 and p16 

in Gra-iPSCs, B-iPSCs and TTF-iPSCs showed 25% overlap with at least 

one of the other two groups of iPSC lines, suggesting that iPSCs undergo 

some common changes during passaging, irrespective of their cell of origin 

(Fig. 4c). GO analysis of these changes indicated a strong enrichment 

for developmental regulators. Moreover, the only GO cluster common to 

all three groups was ‘organ development’, indicating that the passaging of 

iPSCs results in a change of differentiation-associated gene expression 

patterns (Fig. 4c). The expression levels of the pluripotency genes Sox2 

and Oct4, which are high already at early passage (Supplementary Table 

1), increased even further during the passaging process, supporting the 

notion that the pluripotency network becomes increasingly solidified 

during culture (Supplementary Fig. 7), consistent with a previous report 

showing gradual upregulation of pluripotency-associated genes upon 

passaging of human iPSC lines 32 . 

To evaluate whether the passaging of iPSCs attenuates the observed 

epigenetic differences, we performed HELP analysis on B-iPSCs and 

TTF-iPSCs at late passage. In contrast to early-passage iPSCs, the latepassage 

iPSCs could not be separated by hierarchical unsupervised 

clustering analysis based on their cells of origin (Fig. 4d). Accordingly, 

the methylation levels of histones at candidate genes in Gra-iPSCs and 

SMP-iPSCs became indistinguishable (Supplementary Fig. 8). Notably, 

several of the analyzed loci showed an enrichment for both H3K4me3 

and H3K27me3, indicative of bivalent domains that are characteristic of 

pluripotent stem cells 33 . Thus, continuous passaging leads to an equilibration 

of the epigenetic differences detected in early-passage iPSCs. 

Two possible mechanisms could account for the observed loss of 

epigenetic and transcriptional memory with increased passage number: 

(i) passive replication-dependent loss of somatic marks in the majority 

of iPSCs and (ii) selection of rare, preexisting, fully reprogrammed cells 

over time. Because the selection model predicts that such rare clones 

would have a growth or survival advantage, we would expect to see 

impaired growth rates of bulk iPSC cultures at early passage compared 

with late passage, which we did not observe (Supplementary Fig. 9a). 

We also did not detect significant differences when the growth rates of 

single-cell clones established from early and late passage iPSC lines were 

examined using a colorimetric assay (XTT assay) that detects metabolic 

activity (Supplementary Fig. 10) or by measuring the increase 

in cell numbers on three consecutive days (Supplementary Figs. 11 

and 12). Similarly, an analysis of the colony formation efficiency of 

single cell-sorted iPSC from early- and late-passage cultures did not 

yield detectable differences (Supplementary Fig. 9b). Collectively, these 

data argue against the presence of rare subclones that become selected 

over time and are consistent with the notion that all iPSC lines gradually 

resolve transcriptional and epigenetic differences with increased 

passaging. However, our results do not exclude a combined model 

involving passive resolution of epigenetic marks as well as selection 

of multiple clones. 

Finally, we asked whether the similar transcriptional and epigenetic 

patterns of late-passage iPSCs derived from distinct cells of origin would 

translate into an equalization of their differentiation potentials. We first 

performed an embryoid-body formation assay at different passages for 

TTF-iPSCs and B-iPSCs, which showed a strong difference at early passage. 

TTF-iPSCs gave rise to similarly-sized embryoid bodies as B-iPSCs 

around p10–p12 (Supplementary Fig. 13a,b) and were indistinguishable 

at p16 (Supplementary Fig. 13c,d). Moreover, embryoid bodies derived 

from TTF-iPSCs and B-iPSCs at p16 differentiated into similar numbers 

of erythrocyte (Fig. 4e), macrophage (Fig. 4f) and mixed-colony progenitors 

(Fig. 4g), thus proving that extensive cellular passaging eliminates 

differences in the differentiation potentials of these iPSCs. 

DISCUSSION 

Our study shows that genetically matched iPSCs retain a transient transcriptional 

and epigenetic memory of their cell of origin at early passage, 

which can substantially affect their potential to differentiate into 

embryoid bodies and different hematopoietic cell types (Fig. 5). These 

molecular and functional differences are lost upon continuous passaging, 

however, indicating that complete reprogramming is a gradual 

process that continues beyond the acquisition of a bona fide iPSC state 

as measured by the activation of endogenous pluripotency genes, viral 

transgene–independent growth and the ability to differentiate into 

cell types of all three germ layers. Notably, the previously seen silencing 

of the Dlk1-Dio3 locus in many iPSC lines 22 is not affected by the 

passaging of cells (data not shown). Of note, the early-passage iPSCs 

described here are different from “partially reprogrammed iPSCs” 34,35 , 

which depend on the continuous expression of viral transgenes and do 

not activate and demethylate pluripotency genes or contribute to the 

formation of viable chimeras (Fig. 5). 

The mechanism by which passaging eliminates the molecular and 

functional differences between iPSCs of different origins remains to 

be determined. Three key observations argue against the possibility 

of selective expansion of a rare subset of completely reprogrammed 

iPSCs: (i) both early- and late-passage iPSCs had similar proliferation 

rates; (ii) there was little variability in the growth rate of single-cell 

iPSC clones from early- and late-passage lines; and (iii) the number 

of passages required to resolve cell-of-origin differences was dependent 

upon the starting cell type. These observations suggest that the 

consolidation of the pluripotent transcriptional network upon passaging 

is a slow process, potentially facilitated by a positive feedback 

mechanism that gradually resolves the residual cell-of-origin–specific 

epigenetic marks and transcriptional patterns. In accordance with this 

idea is the finding that telomeres become gradually elongated with 

increased passage number of iPSCs 36 . Our results are also consistent 

with the previous observation that cloned embryos often retain donor 

cell–specific transcriptional patterns and do not efficiently activate 

embryonic genes over many cell divisions 37–40 , suggesting possible 

similarities in the mechanisms of reprogramming by nuclear transfer 

and induced pluripotency. 

Because of the lack of ESC lines genetically matched to the secondary 

iPSC lines used here, we did not include ESC lines in our 

comparative analysis. Nevertheless, the present results may help to 

explain some of the previously reported differences between ESCs 

and iPSCs 41,42 . Some of these studies compared late-passage ESC lines 

with iPSC lines of undefined, but presumably earlier, passage that may 

not yet have reached an ESC-equivalent ground state. It should be 

informative to revisit these studies with genetically matched, transgene-free 

late-passage iPSCs to determine whether this abrogates such 

gene expression and differentiation differences. 

The observed tendency of early-passage iPSC lines to differentiate preferentially 

into the cell lineage of origin could potentially be exploited 

in clinical settings to produce certain somatic cell types that have been 

difficult to obtain from ESCs thus far. However, these data also serve as a 

cautionary note for ongoing attempts to recapitulate disease phenotypes 

in vitro using patient-specific, early-passage iPSC lines, as the epigenetic, 

transcriptional and functional ‘immaturity’ of these cells might confound 


articles 


the data obtained from them. Further elucidation of the molecular indicators 

of fully reprogrammed iPSCs should help in the establishment of 

standardized iPSC lines that can be compared with confidence in basic 

biological and drug discovery studies. 

METHODS 



Accession code. GEO: GSE22043, GSE22827, GSE22908. 



We thank N. Maherali and R. Walsh for helpful suggestions and critical reading of 

the manuscript, B. Wittner for statistical advice, J. LaVecchio, G. Buruzula, 

K. Folz-Donahue and L. Prickett for expert cell sorting and K. Coser for technical 

assistance. J.M.P. was supported by an MGH ECOR fellowship, E.A. by a Jane 

Coffin Childs fellowship, M.S. by a Schering fellowship and K.Y.T. by the Agency 

of Science, Technology and Research Singapore. Support to A.M. was from the 

Lymphoma Society, SCOR no. 7132-08; to T.E. from National Institutes of Health 

(NIH) grant HL056182 and NYSTEM; to A.J.W. in part from the Burroughs 

Wellcome Fund, Harvard Stem Cell Institute, Peabody Foundation, and NIH 1 

DP2 OD004345-01, and the Joslin Diabetes Center DERC (P30DK036836); to 

K.H. from Howard Hughes Medical Institute, the NIH Director’s Innovator Award 

and the Harvard Stem Cell Institute. The content is solely the responsibility of the 

authors and does not necessarily represent the official views of the NIH. 


J.M.P. and K.H. conceived the study, interpreted results and wrote the manuscript; 

J.M.P. performed most of the experiments with help from W.K.; S.L. and T.E. 

performed and interpreted in vitro differentiation assays; M.E.F and A.M. 

performed and analyzed HELP methylation experiments; K.Y.T. and A.J.W. isolated 

SMPs and derived most SMP-iPSCs; T.S. and S.N. performed expression arrays; 

and S.E., E.A. and M.S. provided essential study material. All authors gave critical 

input to the manuscript draft. 







Note added in proof: We thank George Daley for sharing unpublished results, 

which show similar differences in DNA methylation patterns and differentiation 

propensity of iPSCs derived from distinctive cell types. Of note, this report 43 also 

suggests that somatic cell nuclear transfer more faithfully reprograms cells to a 

pluripotent state than transcription factor overexpression. 

1. Aoi, T. et al. Generation of pluripotent stem cells from adult mouse liver and stomach 

cells. Science 321, 699–702 (2008). 

2. Eminli, S. et al. Differentiation stage determines potential of hematopoietic cells for 

reprogramming into induced pluripotent stem cells. Nat. Genet. 41, 968–976 (2009). 

3. Eminli, S., Utikal, J., Arnold, K., Jaenisch, R. & Hochedlinger, K. Reprogramming of 

neural progenitor cells into induced pluripotent stem cells in the absence of exogenous 

Sox2 expression. Stem Cells 26, 2467–2474 (2008). 

4. Hanna, J. et al. Direct reprogramming of terminally differentiated mature B lymphocytes 

to pluripotency. Cell 133, 250–264 (2008). 

5. Lowry, W.E. et al. Generation of human induced pluripotent stem cells from dermal 

fibroblasts. Proc. Natl. Acad. Sci. USA 105, 2883–2888 (2008). 

6. Park, I.H. et al. Reprogramming of human somatic cells to pluripotency with defined 

factors. Nature 451, 141–146 (2008). 

7. Stadtfeld, M., Brennand, K. & Hochedlinger, K. Reprogramming of pancreatic beta 

cells into induced pluripotent stem cells. Curr. Biol. 18, 890–894 (2008). 

8. Takahashi, K. et al. Induction of pluripotent stem cells from adult human fibroblasts 

by defined factors. Cell 131, 861–872 (2007). 

9. Takahashi, K. & Yamanaka, S. Induction of pluripotent stem cells from mouse embryonic 

and adult fibroblast cultures by defined factors. Cell 126, 663–676 (2006). 

10. Yu, J. et al. Induced pluripotent stem cell lines derived from human somatic cells. 

Science 318, 1917–1920 (2007). 

11. Loh, Y.H. et al. Generation of induced pluripotent stem cells from human blood. Blood 

113, 5476–5479 (2009). 

12. Aasen, T. et al. Efficient and rapid generation of induced pluripotent stem cells from 

human keratinocytes. Nat. Biotechnol. 26, 1276–1284 (2008). 

13. Maherali, N. et al. A high-efficiency system for the generation and study of human 

induced pluripotent stem cells. Cell Stem Cell 3, 340–345 (2008). 

14. Utikal, J., Maherali, N., Kulalert, W. & Hochedlinger, K. Sox2 is dispensable for the 

reprogramming of melanocytes and melanoma cells into induced pluripotent stem 

cells. J. Cell Sci. 122, 3502–3510 (2009). 

15. Kim, J.B. et al. Pluripotent stem cells induced from adult neural stem cells by reprogramming 

with two factors. Nature 454, 646–650 (2008). 

16. Shi, Y. et al. A combined chemical and genetic approach for the generation of induced 

pluripotent stem cells. Cell Stem Cell 2, 525–528 (2008). 

17. Silva, J. et al. Promotion of reprogramming to ground state pluripotency by signal 

inhibition. PLoS Biol. 6, e253 (2008). 

18. Miura, K. et al. Variation in the safety of induced pluripotent stem cell lines. Nat. 

Biotechnol. 27, 743–745 (2009). 

19. Ghosh, Z. et al. Persistent donor cell gene expression among human induced pluripotent 

stem cells contributes to differences with human embryonic stem cells. PLoS 

One 5, e8975 (2010). 

20. Soldner, F. et al. Parkinson’s disease patient-derived induced pluripotent stem cells 

free of viral reprogramming factors. Cell 136, 964–977 (2009). 

21. Okita, K., Ichisaka, T. & Yamanaka, S. Generation of germline-competent induced 

pluripotent stem cells. Nature 448, 313–317 (2007). 

22. Stadtfeld, M. et al. Aberrant silencing of imprinted genes on chromosome 12qF1 in 

mouse induced pluripotent stem cells. Nature 465, 175–181 (2010). 

23. Dimos, J.T. et al. Induced pluripotent stem cells generated from patients with ALS 

can be differentiated into motor neurons. Science 321, 1218–1221 (2008). 

24. Ebert, A.D. et al. Induced pluripotent stem cells from a spinal muscular atrophy 

patient. Nature 457, 277–280 (2009). 

25. Park, I.H. et al. Disease-specific induced pluripotent stem cells. Cell 134, 877–886 

(2008). 

26. Saha, K. & Jaenisch, R. Technical challenges in using human induced pluripotent 

stem cells to model disease. Cell Stem Cell 5, 584–595 (2009). 

27. Lee, G. et al. Modelling pathogenesis and treatment of familial dysautonomia using 

patient-specific iPSCs. Nature 461, 402–406 (2009). 

28. Wernig, M. et al. A drug-inducible transgenic system for direct reprogramming of 

multiple somatic cell types. Nat. Biotechnol. 26, 916–924 (2008). 

29. Stadtfeld, M., Maherali, N., Breault, D.T. & Hochedlinger, K. Defining molecular 

cornerstones during fibroblast to iPS cell reprogramming in mouse. Cell Stem Cell 2, 

230–240 (2008). 

30. Cerletti, M. et al. Highly efficient, functional engraftment of skeletal muscle stem 

cells in dystrophic muscles. Cell 134, 37–47 (2008). 

31. Stadtfeld, M., Maherali, N., Borkent, M. & Hochedlinger, K. A reprogrammable mouse 

strain from gene-targeted embryonic stem cells. Nat. Methods 7, 53–55 (2010). 

32. Chin, M.H. et al. Induced pluripotent stem cells and embryonic stem cells are distinguished 

by gene expression signatures. Cell Stem Cell 5, 111–123 (2009). 

33. Bernstein, B.E. et al. A bivalent chromatin structure marks key developmental genes 

in embryonic stem cells. Cell 125, 315–326 (2006). 

34. Mikkelsen, T.S. et al. Dissecting direct reprogramming through integrative genomic 

analysis. Nature 454, 49–55 (2008). 

35. Sridharan, R. et al. Role of the murine reprogramming factors in the induction of 

pluripotency. Cell 136, 364–377 (2009). 

36. Marion, R.M. et al. Telomeres acquire embryonic stem cell characteristics in induced 

pluripotent stem cells. Cell Stem Cell 4, 141–154 (2009). 

37. Boiani, M., Eckardt, S., Scholer, H.R. & McLaughlin, K.J. Oct4 distribution and 

level in mouse clones: consequences for pluripotency. Genes Dev. 16, 1209–1219 

(2002). 

38. Bortvin, A. et al. Incomplete reactivation of Oct4-related genes in mouse embryos 

cloned from somatic nuclei. Development 130, 1673–1680 (2003). 

39. Ng, R.K. & Gurdon, J.B. Epigenetic memory of active gene transcription is inherited 

through somatic cell nuclear transfer. Proc. Natl. Acad. Sci. USA 102, 1957–1962 

(2005). 

40. Ng, R.K. & Gurdon, J.B. Epigenetic memory of an active gene state depends on histone 

H3.3 incorporation into chromatin in the absence of transcription. Nat. Cell Biol. 10, 

102–109 (2008). 

41. Feng, Q. et al. Hemangioblastic derivatives from human induced pluripotent stem 

cells exhibit limited expansion and early senescence. Stem Cells 28, 704–712 

(2010). 

42. Hu, B.Y. et al. Neural differentiation of human induced pluripotent stem cells follows 

developmental principles but with variable potency. Proc. Natl. Acad. Sci. USA 107, 

4335–4340 (2010). 

43. Kim, K. et al. Epigenetic memory in induced pluripotent stem cells. Nature 

doi:10.1038/nature09342 (19 July 2010). 




Generation of iPSC lines. iPSC lines were generated as described previously 

2 . Briefly, iPSC-derived somatic cells were isolated from chimeras 

by fluorescence-activated cell sorting (FACS), plated on feeders in the 

presence of cytokines in ESC culture conditions. Resultant iPSC colonies 

were picked and expanded in the absence of doxycycline and used for 

subsequent analyses. 

SMP isolation. Myofiber-associated cells were prepared from intact 

limb muscles (extensor digitorum longus, gastrocnemius, quadriceps, 

soleus, traverus abdominis and triceps brachii) as described 

previously 44,45 . Briefly, intact mouse limb muscles were digested 

with collagenase II to dissociate individual myofibers. These were 

triturated and digested with collagenase II and dispase to release 

myofiber-associated cells. The myofiber-associated cells were next 

unfractionated by FACS, using the following marker profiles for each 

population: (i) SMPs: CD45 − Sca-1 − Mac-1 − CXCR4 + β1-integrin + ; (ii) 

Myoblast-containing population: CD45 − Sca-1 − Mac-1 − CXCR4 − ; (iii) 

Sca1 + mesenchymal cells: D45 − Sca-1 + Mac-1 − . After the initial sort, cells 

were resorted by FACS using the same gating profile to increase the 

purity of the obtained population 46 . 

Blastocyst injections. For blastocyst injections, female BDF1 mice were 

superovulated by intraperitoneal injection of PMS and hCG and mated 

to BDF1 stud males. Zygotes were isolated from females with a vaginal 

plug 24 h after hCG injection. Zygotes for 2n injections were cultured 

for 3 d in vitro in KSOM media, blastocysts were identified, injected with 

ESCs or iPSCs and transferred into pseudopregnant recipient females. 

Teratoma formation. iPSCs were harvested by trypsinization, preplated 

onto untreated culture plates to remove feeders as well as differentiating 

cells and injected into flanks of nonobese diabetic/severe combined 

immunodeficient NOD/SCID mice, using ~5 million cells per injection. 

The mice were euthanized 3–5 weeks after injection, teratomas dissected 

out and processed for histological analysis. 

Cellular growth assays. To measure the clonal growth potential of iPSCs, 

SSEA1-positive cells from the different iPSC lines were sorted into 

96-well plates by FACS (BD). After 7 d, the presence of iPSC colonies 

was scored based on morphology. To establish growth rates, the different 

bulk iPSCs lines or derivative subclones were plated in six gelatinized 

wells of a 12-well plates and each day the number of cells was counted 

in duplicate using a Countess cell counter (Invitrogen). For colorimetric 

measurement of growth, iPSCs lines were subcloned into 96-well plates 

and after 7 d, the cells were exposed to XTT (TOX-2) (Sigma) reagent 

overnight and the absorbance at 450 nm measured with a multiwell plate 

reader (Molecular Devices). 

Cell culture. ESCs and iPSCs were cultured in ESC medium (DMEM 

with 15% FBS, l-glutamin, penicillin-streptomycin, nonessential amino 

acids, β-mercaptoethanol and 1,000 U/ml leukemia inhibitor factor) on 

irradiated feeder cells. TTF cultures were established by trypsin digestion 

of tail-tip biopsies taken from newborn (3–8 d of age) chimeric mice 

produced by blastocyst injection of iPSCs. 

RNA isolation. ESCs and iPSCs grown on 35-mm dishes were harvested 

when they reached about 50% confluency and preplated on 

nongelatinized T25 flasks for 45 min to remove feeder cells. Cells 

were spun down and the pellet used for isolation of total RNA using 

the miRNeasy Mini Kit (Qiagen) without DNase digestion. RNA was 

eluted from the columns using 50 ml RNAse-free water or TE buffer, 

pH7.5 (10 mM Tris-HCl and 0.1 mM EDTA) and quantified using a 

Nanodrop (Nanodrop Technologies). 

Quantitative PCR. cDNA was produced with the First Strand cDNA 

Synthesis Kit (Roche) using 1 mg of total RNA input. Real-time quantitative 

PCR reactions were set up in triplicate using 5 ml of cDNA 

(1:100 dilution) with the Brilliant II SYBR Green QPCR Master Mix 

(Stratagene) and run on a Mx3000P QPCR System (Stratagene). Primer 

sequences are listed in Supplementary Table 4. 

mRNA profiling. Total RNA samples (RIN (RNA integrity number) > 9) 

were subjected to transcriptomal analyses using Affymetrix HTMG- 430A 

mRNA expression microarray as previously described. 

Statistical analyses. Hierarchical clustering was performed using the 

GeneSifter software (Geospiza). Correlation distance and subsequent clustering 

were done using Ward’s method. The differentially expressed genes 

(twofold) were calculated using a t-test (P = 0.05) with Benjamini and 

Hochberg correction. Principal component analysis was performed using 

the GeneSifter software. Gene ontology analysis was performed using the 

DAVID software 47 , with the classification stringency set to ‘high’. 

Embryoid body formation. Before plating embryoid bodies, the iPSCs 

were depleted of mouse embryonic fibroblasts by splitting the cells 1:3 

onto gelatin-coated plates on each day, for 2 consecutive days. On the 3rd 

day (designated day 0), iPSCs were trypsinized and plated at a density of 

5,000 cells/ml in Isocove’s Modified Dulbecco’s Medium (IMDM) with 

15% FCS (Atlanta Biologicals), 10% protein-free hybridoma medium 

(PFHM-II; Gibco), 2 mM l-glutamine (Gibco), 200 µg/ml transferrin 

(Roche), 0.5 mM ascorbic acid (Sigma) and 4.5 × 10–4 M monothioglycerol 

(MTG; Sigma). Differentiation was carried out in 60-mm ethylene 

oxide–treated Petri grade dishes (Parter Medical). The embryoid bodies 

were left to differentiate until day 6, when the cells were harvested to 

assay for hematopoietic colonies. 

Hematopoietic colony formation assays. Day 6 embryoid bodies were 

collected by gravity, dissociated with trypsin and then passed several 

times through a 20 gauge needle to ensure dissociation. For the growth 

of hematopoietic progenitors, the cells were then seeded at a density 

of 100,000 cells/ml in IMDM containing 1% methylcellulose (Fluka 

Biochemika), 15% plasma-derived serum (PDS; Animal Technologies), 

5% PFHM-II and specific cytokines as follows: primitive erythrocytes 

(erythropoietin (EPO, 2 U/ml)); macrophages (IL-3 (10ng/ml), M-CSF 

(5 ng/ml)); megakaryocytes (IL-3 (10 ng/ml), IL-11 (5 ng/ml), thrombopoietin 

(TPO, 5 ng/ml)); mixed colonies (SCF (5ng/ml), IL-3 (10 ng/ 

ml), G-CSF (30 ng/ml), GM-CSF (10 ng/ml), IL-11 (5 ng/ml), IL-6 (5 ng/ 

ml), TPO (5 ng/ml), and M-CSF (5 ng/ml)). All cytokines were purchased 

from R&D Systems. Primitive erythroctye colonies (eryPs) were counted 

on day 10 (4 d after embryoid body harvest). Macrophage colonies were 

counted on day 13 (7 d after embryoid body harvest). Mixed colonies 

were counted on day 14 (8 d after embryoid body harvest) and consist of 

a layer of macrophages, a layer of granulocytes, and a central core of red 

erythroid cells. Statistical analysis was performed using the Krward software. 

P values were calculated using the nonparametric Wilkinson test. 

HELP DNA methylation analysis. High molecular weight DNA 

was isolated from iPSCs using the PureGene kit from Qiagen and 

the HELP (HpaII tiny fragment enrichment by ligation-mediated 

PCR) assay was carried out as previously described 1,2 . Briefly, 1 µg 

of genomic DNA was digested overnight with either HpaII or MspI 

(New England Biolabs). On the following day, the reactions were 

nature biotechnology 

doi:10.1038/nbt.1667


extracted once with phenol-chloroform and resuspended in 11 µl of 

10 mM Tris-HCl pH 8.0 and the digested DNA was used to set up an 

overnight ligation of the JHpaII adaptor using T4 DNA ligase. The 

adaptor-ligated DNA was used to carry out the PCR amplification 

of the HpaII- and MspI-digested DNA as previously described 48 . 

All samples for microarray hybridization were processed at the 

Roche-NimbleGen Service Laboratory. Samples were labeled using 

Cy-labeled random primers (9 mers) and then hybridized onto a 

mouse custom-designed oligonucleotide array (50-mers) covering 

25,720 HpaII amplifiable fragments (HAF) (>50,000 CpGs), annotated 

to 15,465 unique gene symbols (Roche NimbleGen, Design 

name: 2006-10-26_MM5_HELP_Promoter Design ID = 4803). 

HpaII-amplifiable fragments are defined as genomic sequences contained 

between two flanking HpaII sites found within 200–2,000 bp 

from each other and is represented on the array by 15 individual 

probes, randomly distributed across the microarray slide. HAF were 

first realigned to the MM9 July 2007 build of the mouse genome and 

then annotated to the nearest transcription start site (TSS), allowing 

for a maximum distance of 5 kb from the TSS. Scanning was 

performed using a GenePix 4000B scanner (Axon Instruments) as 

previously described 49 . Quality control and data analysis of HELP 

microarrays was performed as described 50 . 

Signal intensities at each HpaII-amplifiable fragment were calculated 

as a robust (25% trimmed) mean of their component probe-level signal 

intensities. Any fragments found within the level of background MspI 

signal intensity, measured as 2.5 mean-absolute-differences (MAD) above 

the median of random probe signals, were categorized as ‘failed’. These 

failed loci therefore represent the population of fragments that did not 

amplify by PCR, whatever the biological (e.g., genomic deletions and 

other sequence errors) or experimental cause. On the other hand, ‘methylated’ 

loci were so designated when the level of HpaII signal intensity 

was similarly indistinguishable from background. PCR-amplifying fragments 

(those not flagged as either methylated or failed) were normalized 

using an intra-array quantile approach wherein HpaII/MspI ratios are 

aligned across density-dependent sliding windows of fragment size–sorted 

data. DNA methylation was therefore measured as the log 2 (HpaII/MspI) 

ratio, where HpaII reflects the hypomethylated fraction of the genome 

and MspI represents the whole genome reference. Analysis of normalized 

data revealed the presence of a bimodal distribution. For each sample, 

a cutoff was selected at the point that more clearly separated these two 

populations and the data were centered around this point. Each fragment 

was then categorized as either methylated, if the centered log HpaII/MspI 

ratio < 0, or hypomethylated if on the other hand the log ratio > 0. 

HELP data analysis. Statistical analysis was performed using R 2.9 and 

BioConductor 51 . Unsupervised hierarchical clustering of HELP data was 

performed using the subset of probe sets (n = 3745) with s.d. > 1 across 

all cases. We used 1– Pearson correlation distance, followed by a Lingoes 

transformation of the distance matrix to a Euclidean one and subsequent 

clustering using Ward’s method. Correspondence analysis was performed 

using the BioConductor package MADE4. The top 100 genes whose 

methylation status varied the most across the different groups were identified 

as those with the greatest s.d. across all samples. 

Quantitative DNA methylation analysis by MassARRAY EpiTyping. 

Validation of HELP findings was performed by matrix-assisted laser 

desorption ionization/time-of-flight (MALDI-TOF) mass spectrometry 

using EpiTyper by MassARRAY (Sequenom) on bisulfite-converted 

DNA following manufacturer’s instructions 52 but using the Fast Start 

High Fidelity Taq polymerase from Roche for the PCR amplification 

of the bisulfite-converted DNA. MassArray primers were designed to 

cover the promoter regions of the indicated genes. (Primer sequences 

available as Supplementary Table 5). 

Chromatin immunoprecipitation (ChIP). Cells were fixed in 1% 

formaldehyde for 10 min, quenched with glycine and washed three 

times with PBS. Cells were then resuspended in lysis buffer and 

sonicated 10 × 30 s in a Bioruptor (Diagenode) to shear the chromatin 

to an average length of 600 bp. Supernatants were precleared 

using protein-A agarose beads (Roche) and 10% input was collected. 

Immunoprecipitations were performed using polyclonal antibodies 

to H3K4trimethylated, H3K27trimethylated, H3 pan-acetylation and 

normal rabbit serum (Upstate). DNA-protein complexes were pulled 

down using protein-A agarose beads and washed. DNA was recovered 

by overnight incubation at 65 °C to reverse cross-links and purified 

using QIAquick PCR purification columns (Qiagen). Enrichment of 

the modified histones in different genes was detected by quantitative 

real-time PCR using the primers in the Supplementary Table 4. 

44. Conboy, I.M., Conboy, M.J., Smythe, G.M. & Rando, T.A. Notch-mediated restoration 

of regenerative potential to aged muscle. Science 302, 1575–1577 (2003). 

45. sherwood, R.I. et al. Isolation of adult mouse myogenic progenitors: functional heterogeneity 

of cells within and engrafting skeletal muscle. Cell 119, 543–554 (2004). 

46. cheshier, S.H., Morrison, S.J., Liao, X. & Weissman, I.L. In vivo proliferation and cell 

cycle kinetics of long-term self-renewing hematopoietic stem cells. Proc. Natl. Acad. 

Sci. USA 96, 3120–3125 (1999). 

47. Huang, D.W. et al. Systematic and integrative analysis of large gene lists using DAVID 

Bioinformatics Resources. Nat. Protoc. 4, 44–57 (2009). 

48. Figueroa, M.E., Melnick, A. & Greally, J.M. Genome-wide determination of DNA methylation 

by Hpa II tiny fragment enrichment by ligation-mediated PCR (HELP) for the 

study of acute leukemias. Methods Mol. Biol. 538, 395–407 (2009). 

49. selzer, R.R. et al. Analysis of chromosome breakpoints in neuroblastoma at sub-kilobase 

resolution using fine-tiling oligonucleotide array CGH. Genes Chromosom. Cancer 

44, 305–319 (2005). 

50. Thompson, R.F. et al. An analytical pipeline for genomic representations used for 

cytosine methylation studies. Bioinformatics 24, 1161–1167 (2008). 

51. Culhane, A.C., Thioulouse, J., Perriere, G. & Higgins, D.G. MADE4: an R package 

for multivariate analysis of gene expression data. Bioinformatics 21, 2789–2790 

(2005). 

52. Ehrich, M. et al. Quantitative high-throughput analysis of DNA methylation patterns 

by base-specific cleavage and mass spectrometry. Proc. Natl. Acad. Sci. USA 102, 

15785–15790 (2005). 

doi:10.1038/nbt.1667 



Rapid profiling of a microbial genome using mixtures 

of barcoded oligonucleotides 

Joseph R Warner 1 , Philippa J Reeder 1 , Anis Karimpour-Fard 2 , Lauren B A Woodruff 1 & Ryan T Gill 1 


A fundamental goal in biotechnology and biology is the development of approaches to better understand the genetic basis of 

traits. Here we report a versatile method, trackable multiplex recombineering (TRMR), whereby thousands of specific genetic 

modifications are created and evaluated simultaneously. To demonstrate TRMR, in a single day we modified the expression of 

>95% of the genes in Escherichia coli by inserting synthetic DNA cassettes and molecular barcodes upstream of each gene. 

Barcode sequences and microarrays were then used to quantify population dynamics. Within a week we mapped thousands of 

genes that affect E. coli growth in various media (rich, minimal and cellulosic hydrolysate) and in the presence of several growth 

inhibitors (b-glucoside, d-fucose, valine and methylglyoxal). This approach can be applied to a broad range of traits to identify 

targets for future genome-engineering endeavors. 

Microbial genomes hold the potential for tremendous combinatorial 

diversity, comprising a sequence space of 4 4,600,000 . Researchers’ ability 

to search this diversity for genetic features that affect pertinent traits 

remains limited by the number of individuals that can be tested, which 

is a small fraction of all possibilities. Thus, there is a demand for strategies 

for first defining relevant genetic variation and then thoroughly 

searching that space. This issue has been studied in great depth at the 

level of individual genes 1,2 , where high-throughput protein engineering 

methods are available for introducing specific mutations and then 

mapping the effects of such mutations onto protein activity. Advances 

in genomics 3 , and more recently multiplex DNA synthesis 4–8 and 

homologous recombination (or recombineering) 9–11 , now enable the 

extension of such a strategy to the genome scale. 

Advances in genomics have resulted in several methods for highly 

parallel mapping of genes to traits, such as profiling of gene-knockout 

and plasmid-based libraries 12–20 . In some instances, microarray 

technology has been used to enable parallel tracking of genetically 

distinct individuals throughout growth in selective environments. 

One such tool, molecular barcoding 12,17 , involves the replacement 

of every gene in Saccharomyces cerevisiae with a specific DNA 

sequence that could be tracked via microarray. Although these tools 

are a powerful way to profile the effect of mutation, the difficulty 

of specifically creating new mutations limits these studies to one of 

two types of mutations that have previously been introduced (insertions 

or increases in copy number). These limitations have challenged 

efforts to apply these methods for dissecting phenotypes and reengineering 

phenotypes that rely upon the coordinated action of multiple 

genes and mutations. 

Research over the past decade has resulted in recombination-based 

methods (recombineering) that make it easier to specifically modify 

the E. coli genome using synthetic DNA (synDNA) 9–11,21–23 . Recently, 

a recombineering-based method, called MAGE, was reported 24 , 

whereby the expression levels of 24 genes were optimized in parallel 

to improve lycopene production more than all previously reported 

efforts, in considerably less time. This demonstration was enabled by 

a priori knowledge of what genes to modify, which is not known in 

many genome-engineering efforts, such as engineering growth and 

tolerance. Here we describe TRMR, a complementary method for 

simultaneously mapping genetic modifications that affect a trait of 

interest. The method combines parallel DNA synthesis, recombineering 

and molecular barcode technology to enable rapid modification of 

all E. coli genes (Fig. 1 and Supplementary Fig. 1). We demonstrate 

this general approach through the construction of two comprehensive 

E. coli genomic libraries comprising 8,000 distinct mutations and 

gene-trait mapping of these cells in seven environments. 

Results 

Synthetic DNA cassettes for promoter replacement 

We designed a comprehensive library of synDNA cassettes that 

have predictable effects when inserted into the genome of E. coli. 

Although various genetic features could have been incorporated into 

the cassettes (such as point mutations or sequences affecting mRNA 

stability, translational efficiency and other processes), we chose to 

demonstrate TRMR using functional modifications that either generally 

increase the expression of a target gene, called ‘up’, or generally 

decrease the gene’s expression, called ‘down’. The up cassette contains 

a strong and repressible P LtetO-1 promoter 25 and ribosome binding 

site (RBS) 26 sequences, which in general will increase downstream 

gene transcription and translation (Fig. 2). The down cassette was 

designed to replace the native RBS with an inert sequence that will 

generally cause a decrease in translation initiation. Both cassette 

designs include a blasticidin-S resistance gene 27 , allowing for selection 

of recombinant alleles. Molecular barcodes 12 (also called ‘tags’) 

were incorporated to track the presence of each synDNA oligo and to 

1 Department of Chemical and Biological Engineering, University of Colorado, Boulder, Colorado, USA. 2 School of Medicine, University of Colorado at Health Science 

Center, Denver, Colorado, USA. Correspondence should be addressed to R.T.G. (rtg@colorado.edu). 

Received 4 February; accepted 8 June; published online 18 July 2010; doi:10.1038/nbt.1653 




track each allele (engineered cell) within the 

mixed population on a barcode microarray 28 

(Supplementary Notes). 

Because the length of the synDNA cassettes 

used here is beyond the current 

capabilities of commercially available oligo 

library synthesis, we developed a strategy 

for multiplex cassette construction that 

involves the ligation of sequences shared by 

all cassettes to a mixture of shorter oligos 

specific to each targeted gene. Construction 

of this library was complicated by the fact 

that each synDNA cassette must contain 

unique sequences in the flanking positions 

that are homologous to the chromosome 

where the cassette is to be inserted. This is 

traditionally accomplished by using PCR to 

amplify a DNA cassette with primers that 

contain the flanking homology regions 21,29 . 

Using such a method to construct thousands 

of alleles is resource- and timeintensive 

3,11,19 , thus limiting the number 

and type of allelic libraries that can be investigated. 

(iii) Multiplex recombineering 

(ii) Multiplex 

synthesis 

(i) Design 

Targeting Tracking 

Bacterial cells 

wild-type genome 

To address these issues, we developed a procedure to generate thousands 

of synDNAs containing multiple desirable sequence features 

(such as homology regions and expression modulators) that can be 

carried out in a complex mixture. Briefly, ‘targeting oligos’ were first 

synthesized on a microarray. Then, we ligated these to the cassette 

that modifies gene function, amplified the resulting product with 

rolling-circle amplification and then cleaved the long amplified DNA 

molecule into the synDNAs (Fig. 2a–c). 

Targeting oligos were designed for every protein-coding gene in the 

E. coli MG1655 genome (Supplementary Table 1 and Supplementary 

Notes). In all, 8,154 targeting oligos were designed to create two possible 

expression alleles for 4,077 genes. Targeting regions were chosen 

such that DNA cassettes would insert upstream of genes, replace the 

translation start codon and account for gene overlap. Once designed, 

the set of targeting oligos, each 189 nucleotides long, was purchased 

through limited access at a cost of roughly $1 per unique oligo 

(Oligonucleotide Library Synthesis, Agilent). 

To test cassette design and construction and to optimize the procedure 

for allele production, we attempted promoter replacement 

for the lacZ and galK genes. After optimizing design, we were able 

to efficiently generate these alleles using the procedures outlined in 

Figure 2. Alleles were isolated as colonies and all showed the expected 

change in regulation and expression of the lacZ gene (Fig. 2d,e) or 

the galK gene. Furthermore, in PCR confirmations and sequencing, 

30 of 30 alleles tested showed the correct site of insertion. By counting 

colonies we estimated that we were able to routinely generate at 

least 75 alleles per microliter of cells transformed and determined 

that yields increased linearly with transformation volumes from 

40 ml up to 400 ml tested. With increases in scale, it is conceivable that 

one could generate 10 5 –10 7 alleles in a single day, enough to profile 

several modifications of every E. coli gene. 

Efficient construction of genome-scale allele libraries 

Using a library of 8,154 targeting oligos, we attempted to construct 

4,077 up synDNA oligos and 4,077 down synDNA oligos 

in separate pools. Both oligo pools were constructed in 1 week 

and resulted in enough material for several rounds of multiplex 

recombineering. The synDNA oligos were then used in a day of 

Mixture of ≈ 8,000 

unique oligomers 

Functional 

Targeting 

(iv) Enrichment of improved cells 

Engineered 

genomes 

Frequency of designed 

mutation (F x = C x /C tot ) 

geneC 

geneD 

geneE 

geneF 

microarray 

(v) Multiplex identification 

Frequency of designed 

mutation (F x = C x /C tot ) 

geneC 

geneD 

geneE 

geneF 

microarray 

Improved 

genomes 

(vi) Genome mapping 

Fitness conferred by 

mutation (W′ = F x x,f /F x,i ) 

Genome 

plot 

Figure 1 TRMR method. (i) Design DNA cassettes encoding the suite of mutations of interest. 

(ii) Synthesize those cassettes, along with associated molecular barcodes, in a single pool. 

(iii) Introduce cassettes into recombination-proficient E. coli 46 and produce thousands of variants, 

each with a distinct region of the chromosome that is engineered. (iv) Perform selections or screens 

on the mixture of variants to enrich for those possessing a desired trait. (v) Quantify changes in 

allele frequency using molecular barcode technology 47 . (vi) Use these frequency measurements 

to map specific genetic changes onto the trait of interest. C x , concentration of allele x; C tot , 

total concentration; F x,f and F x,i , final and initial allele frequencies (see equations in Results). 

recombineering experiments, separately generating thousands of 

up and down recombinant colonies. Colonies were scraped from 

plates and frozen in aliquots for subsequent experiments. 

To confirm that desired mutant alleles were generated, we PCR 

amplified and sequenced barcode tags from 390 colonies. Sequencing 

of the cassette and neighboring chromosome DNA indicated that in 

34 of 34 distinct alleles, the cassettes had inserted into the correct 

location of the genome. Sequencing also provided an estimate of the 

number of alleles containing an error in DNA sequence. Outside 

of the barcode sequences, DNA errors were observed in only three 

of 34 alleles, two of which had errors in regions of the cassette that 

should not affect allele identification or function. The barcode tag 

sequences provide an estimate of DNA errors present in the initial 

oligo libraries because barcodes are not subject to the experimental 

bias (bias includes selection for correct sequences during PCR 

amplification and during homologous recombination) that would 

filter out incorrect sequences. High fidelity of the molecular barcode 

sequences is also required to accurately detect the presence of each 

allele in cell mixtures. Only 5% of the 390 sequenced tags showed 

an error, usually substitution or loss of a single nucleotide. The 

high percentage of correct alleles observed here is a first indication 

that complex oligonucleotide mixtures may be used to engineer and 

identify thousands of distinct genomic loci with high fidelity. 

To assess our ability to make complete and uniform libraries in 

multiplex, we used Affymetrix Geneflex TAG4 arrays 28 to measure 

the concentration of each barcode tag in the synDNA mixture (before 

recombineering) and in genomic DNA from cell mixtures (after 

recombineering). We observed microarray signals from hybridization 

of each of the 8,154 library tags, ten positive-control tags that 

we spiked into the samples to calculate tag concentrations (see 

Supplementary Fig. 2), and 1,642 negative-control tags used to provide 

a measure of background hybridization and noise. The barcode 

signals from the synDNA mixtures indicated that 8,016 of the oligos 

were present (detected above background). Therefore, we successfully 

generated nearly complete (98%) up and down oligo libraries. 

Microarray analysis of the cell mixtures indicated successful generation 

of at least 7,829 unique alleles (96% of designed alleles; Fig. 3a 

and Supplementary Table 2). We found that the concentration of 

each unique allele depended on the concentration of synDNA used 



a 

Target 

oligos 

Two mixtures of 

target oligos 

geneX up 

geneY up 

geneZ up 

Shared DNA 

geneX up 

Shared DNA 

Two mixtures of 4,077 

synDNA oligos 

X up X 

Y up Y 

Z up Z 

geneX down 

i geneY down 

ii iii iv 

geneZ down 

X down X 

Y down Y 

Z down Z 

b 

Target oligo (189 nucleotides) 

P1 H2 x Cut site H1 x P3 Tag x P2 

synDNA oligo (~ 800 base pairs) 

H1 P3 Tag P2 antibiotic R x 

x 

Up/Down H2 x 

c 

E. coli cell 

Chromosome 

geneX 

up 

up 

up 

geneY 

Recombineering enzymes 

4,077 genes 

targeted 

simultaneously 

geneZ 


d 

Up 

Down 

(P LtetO-1 & RBS) 

(no RBS) 

Up allele (761 bp insertion) 

Chromosome Tag blasticidin R lacZ+ P LtetO-1 RBS 

Down allele (703 bp insertion) 

Chromosome Tag blasticidin R 

lacZ– No RBS 

Chromosome 

Chromosome 

Chromosome 

lacZ 

lacZ 

up 

geneX 

geneX 

geneX 

e 

up 

geneY 

lacZ up 

geneY 

geneY 

Glucose + X-gal 

Wild type 

geneZ 

up 

geneZ 

geneZ 

lacZ down 

IPTG + X-gal 

Figure 2 Multiplex strategy to rapidly generate cell mixtures with defined genetic modifications. (a) Construction of synDNA library. (i) ‘Target’ oligos 

that contain chromosome homology and barcodes are synthesized on a chip, cleaved from the chip, amplified by two rounds of PCR and modified 

with (ligation) sequences by uracil excision 48 . (ii) This pool of target oligos is ligated with oligos containing a selectable marker and promoter and 

RBS variants (Shared DNA), resulting in a pool of DNA circles. (iii) DNA circles are copied into a pool of linear concatemers by rolling-circle 

amplification 49 . (iv) Concatemers are cleaved at a repeating site linking the homology regions to provide a pool of synDNA ready for multiplex 

recombineering. (b) Schematic of target oligos and synDNA oligos for gene x. Red, unique regions; black, shared regions; P, PCR priming site; 

H, chromosome targeting region; Tag, barcode tag sequence; Up/Down, functional region. Sequence is shown for amplifying barcode tags and for 

functional regions (promoter sequence in italic, RBS in bold, start codon underlined). (c) Pool of synDNA oligos is inserted into electrocompetent E. coli 

cells. Recombineering enzymes catalyze the insertion of the synDNA oligos at thousands of unique loci in the genome. (d) Schematic of lacZ alleles 

used to test the method. Up allele is designed to increase gene transcription and translation. Down allele is designed to decrease translation. (e) LacZ 

up and down alleles yield the intended phenotypes. Up mutation of the lacZ gene causes cells to turn blue on the surface of agar containing glucose and 

X-gal. Down mutation of the lacZ gene causes cells to remain colorless on the surface of agar containing IPTG and X-gal. 

Wild type 

to construct that allele (Supplementary Fig. 3). After normalization 

of the concentration of each allele for differences in synDNA concentrations 

used in recombineering, the s.d. for generating each mutant 

was ± 65% of the average, distributed uniformly around the genome 

(Fig. 3b). We also observed a modest dependence of recombineering 

frequency on the hybridization free energy 30 of the homology regions 

(Supplementary Fig. 4). 

A small percentage of the alleles were not detected (4%), and in all 

these cases the preceding synDNA was either absent or found in low 

concentrations. In subsequent attempts to create allele libraries, most 

of these missing alleles were detected, suggesting that the alleles were 

initially not detected because of low concentrations of the synDNA 

oligos. These results indicate that the uniformity of cell mixtures in 

future multiplex recombineering experiments may easily be improved 

by supplementation with synDNA oligos that are initially present in 

low concentrations. Improvements in the uniformity of the initial 

mixture should enable the more efficient identification of cells with 

improved traits. 

Notably, a single researcher was able to create these two genomescale 

up and down allele libraries in a single day, demonstrating that 

multiplex recombineering is a rapid strategy for reprogramming 

thousands of genes. 

Genome-scale mapping of alleles to selectable traits 

To illustrate the potential of TRMR to rapidly generate and identify 

cells with new traits, we plated the cell mixtures on agar medium 

supplemented to create four different conditions (salicin, d-fucose, 

methylglyoxal and valine) in which wild-type E. coli typically do not 

grow. Colonies representing resistant mutants arose from our allele 

mixtures at frequencies >100-fold greater than from unmodified control 

cells that relied on spontaneous mutation to generate resistance 

(Supplementary Table 3). We characterized individual colonies (83 

total) by sequencing the barcode tags. Additionally, we used TAG4 

microarrays to characterize the populations obtained by scraping all 

colonies off of the surfaces of selection plates. Using microarray data, 

we ranked each allele in each condition according to fitness (fitness of 



a 

Number of probes 

2,000 

1,500 

1,000 

500 

Number of probes 

1,200 

900 

600 

300 

0 

20 80 140 200 260 

Unassigned tag signals 

0 

0 1,200 2,400 3,600 4,800 6,000 7,200 8,400 

Allele tag signals 

Threshold 

Figure 3 Analysis of synDNA and cell library. (a) Histogram showing 

the distribution of barcode signals of the up and down allele libraries 

detected by the TAG4 microarray. The unassigned tag signals (shown 

in gray) provide a measure of the background signal for each probe on 

the microarray. Probes that are assigned to unique alleles are shown in 

green. The unassigned tag signals have a low signal distribution (inset), 

and the threshold is shown for signals that are significantly above the 

background signal. The threshold for detection was such that the rate 

of false positives would be less than 2.2%. (b) TAG4 microarray results 

showing the distribution of synDNA oligos and alleles plotted by genomic 

location on the circular E. coli genome. Blue, up library; red, down 

library; inner circles, the concentration of each unique synDNA oligo 

before recombineering; outer circles, efficiency of generating each allele, 

calculated by dividing allele concentration by synDNA concentration. 


b 

Up library 

pop. 3,869 

allele x = W ′ x = F x,f / F x,i , which is the ratio of the final allele frequency 

(F x = concentration of x/total concentration) after growth to the initial 

allele frequency). The allele fitness determined by microarray agreed 

well with the results from picking and sequencing individual colonies 

(Fig. 4 and Supplementary Table 4). 

Constructing mutants with beneficial traits and identifying the 

genetic cause has traditionally been a slow and laborious process. 

Using TRMR, we were able to rapidly identify traits present in our 

cell mixtures that are consistent with previous 

studies and identify unexpected genetic 

modifications that could be used in future 

metabolic engineering. The allele(s) that 

conferred the highest frequency or fitness 

from these selections were reconstructed 

separately to confirm that improved growth is 

due to the insertion of the identified cassette. 

These alleles are summarized in Figure 4 and 

described in detail below. 

Salicin is a carbon source that E. coli normally 

cannot metabolize owing to repression 

of the enzymes BglF and BglB. We identified 

the hns down mutation, using both array 

Down library 

pop. 3,960 

a 

b 

Frozen cell 

mixture 

Salicin 

hns 

results and sequencing, as having the greatest effect on fitness in 

medium supplemented with salicin. Mutations in the hns (histonelike 

nucleoid structuring protein) regulator 31 are known to confer 

improved growth on salicin. Its identification here confirms that the 

TRMR method can effectively uncover gene-trait relationships. 

d-fucose is a nonmetabolizable analog of arabinose that inhibits the 

ability of E. coli to use arabinose as a carbon source by inhibiting induction 

of the l-arabinose operon. We identified the xylA up allele, which 

causes overexpression of xylA and xylB, as conferring the ability to grow 

in the presence of d-fucose. Notably, these results suggest that E. coli 

xylose isomerase (XylA) may have in vivo l-arabinose isomerase activity. 

This discovery is corroborated by the observation that overexpression 

of E. coli xylAB in Pseudomonas putida confers the ability to metabolize 

both xylose and l-arabinose 32 . Such a trait is of potential value for the 

efficient use of cellulosic biomass as a renewable feedstock. 

Methylglyoxal is an important intracellular metabolite because it 

can be used as an intermediate for production of commodity chemicals 

and because, when metabolism is disrupted, it can accumulate, 

Recovered 

cells 

Growth on selective agars 

Microarray analysis, allele sequencing & reconstruction, phenotype validation 

xyIA 

D-fucose 

Methylglyoxal 

sodC 

ilvN 

Valine 

Figure 4 Trait-conferring genotypes identified 

in four selective environments. (a) Up and down 

alleles were recovered from frozen cultures and 

spread on agar medium in conditions where 

wild-type cells would not grow (indicated as 

column headings). (b) Fitness (W′) calculated 

by microarray detection of barcode tags was 

plotted for each allele by genomic location. 

Blue, up allele; red, down allele. (c) Known 

or hypothesized mechanisms whereby the 

identified genomic modifications confer 

the ability to grow. High-fitness alleles were 

detected on microarrays, except for leuL down, 

which was identified by sequencing of barcode 

tags within colonies. 

c 

hns down 

Salicin 

BgIF 

H-NS 

BgIB 

H-NS 

Glycolysis 

D-xylose 

XylA 

XyIB 

xyIA up sodC down IeuL down ilvN down 

D-fucose 

L-arabinose 

D-fucose 

Pentose phosphate 

pathway 

Methylglyoxal 

Toxic 

oxygen radicals 

SodC 

Overexpression 

LeuABCD 

Valine 

2-KIV 

LeuABCD 

Leucine 

& 

isoleucine 

Pyruvate 

+ 

2-ketobutyrate 

IIvB 

ilvN 

Valine 

IIvB 

Acetohydroxybutyrate 

Isoleucine 




Figure 5 Alleles identified during pooled growth in 

media and cellulosic hydrolysate. (a) TRMR alleles 

were recovered from frozen cultures and allowed 

to grow in a rich medium, minimal medium or 

cellulosic hydrolysate. (b) Allele frequencies after 

growth in media plotted by genomic location. Inner 

circle, rich; outer circle, minimal; blue, up allele; 

red, down allele; black, control allele frequency × 10. 

(c) Allele fitness in minimal medium plotted 

against fitness of the same allele in rich medium. 

Shapes describe the affected gene function as 

determined by clusters of orthologous groups: 

◊, information storage and processing; , cellular 

processes; , metabolism; ×, poorly characterized; 

blue, up allele; red, down allele; black, control 

allele. Fitness trend was fit to a line shown in 

black (R 2 = 0.748). (d) The fitness of down 

alleles compared with the corresponding up 

alleles. Brown , rich medium; green , minimal 

medium. For alleles that cluster toward either 

the x or the y axis, the up allele and the down 

allele report opposite effects. Inset shows fitness 

benefits (W′ > 1) of top 40 alleles for growth in 

minimal medium, and the fitness effects (usually 

detrimental, W′ < 1) of the orthogonal alleles. 

(e) Fitness (lnW′) plotted by genomic location of 

alleles isolated after growth in hydrolysate. Inner 

circle, 15–17% hydrolysate; outer circle, 18–20% 

hydrolysate; blue, up allele; red, down allele. 

Some alleles conferring high fitness are labeled. 

(f) Growth curves of isolated variants in cellulosic 

hydrolysate. Each growth curve is the average of 

three replicates. Curves are fit with a Gompertz 

function 50 (black). Alleles are denoted with roman 

numerals, as follows: (i) puuE down (pale blue), 

(ii) yciV down (purple), (iii) ygaZ up (green), (iv) lpp 

down (pink), (v) ugpE down (blue), (vi) ptsI down 

(pale green), (vii) wild-type MG1655 (red), 

(viii) ahpC up (blue). Error bars are minimal and are 

not shown for clarity. A 600 , absorbance at 600 nm. 

(g) Percent change in biomass productivity and 

maximum growth rate for isolated variants grown 

in hydrolysate relative to E. coli MG1655 grown 

in hydrolysate. Biomass productivity (gray bars) is 

the area under each growth curve. Growth rate (red 

bars) is the maximum growth rate as calculated 

from the Gompertz function. Values are the average 

of three replicates; error bars denote s.d. 

a 

b 

c 

Minimal medium allele fitness (W′) 

d 

Down allele fitness (W′) 

3.0 

2.5 

2.0 

1.5 

1.0 

0.5 

3.0 

2.5 

2.0 

1.5 

1.0 

0.5 

Growth in 

minimal nutrients 

Growth in 

rich nutrients 

0 

0 0.5 1.0 1.5 2.0 

Rich medium allele fitness (W′) 

Fitness 

Gain 

Loss 

0 

0 0.5 1.0 1.5 2.0 2.5 3.0 

Up allele fitness (W′) 

3 

2 

1 

0 

Freezer 

stock 

e 

f 

Number of cells (A 600 ) 

g 

% change in biomass productivity 

and growth rate relative to wild-type 

1.5 

1.0 

0.5 

cyaA 

ygjQ 

ahpC 

up 

cyaA 

ilvM 

Growth in 15–17% 

cellulosic hydrolysate 

Growth in 18–20% 

cellulosic hydrolysate 

eutL 

ptsI 

eutL 

moeA 

ybaB 

ydjG 

0 

0 2 4 6 8 10 

3.5 

3.0 

2.5 

2.0 

1.5 

1.0 

0.5 

0 

0 2 4 6 8 10 

Time (h) 

248 ± 18% 

233 ± 22% 

80 

60 

40 

20 

0 

–20 

puuE ptsI lpp yciV 

down down down down 

ygaZ 

up 

ahpC 

IsrA 

yciV 

vi 

12 14 

12 14 

ugpE 

down 

i 

ii 

iii 

iv & v 

vii 

viii 

vii 

resulting in oxidative damage and eventual cell death 33 . We used 

TRMR to discover a previously unknown phenotype: decreased 

expression of sodC, which produces a superoxide-mediating enzyme 34 , 

confers resistance to exogenous methylglyoxal, possibly by affecting 

superoxide concentrations in the periplasm. 

Excess valine causes feedback inhibition of leucine and isoleucine 

biosynthesis, leading to inhibition of cell growth as these amino 

acids become scarce. Microarray results identified ilvN down as the 

allele conferring the best growth, and this genomic region has been 

indicated in several previous studies 35,36 . Unexpectedly, sequencing 

showed that the leuL down allele also could grow well on valine plates. 

The leuL down mutation would cause increased expression of the 

leucine biosynthesis operon leuABCD by circumventing the alleged 

transcription attenuation caused by leuL 37 . Mutations of this operon 

have not previously been associated with valine resistance. However, a 

recent attempt to increase production of noncanonical amino acids in 

engineered E. coli cells demonstrated that overexpression of leuABCD 

shifts metabolite pools from valine toward isoleucine and leucine 38 . 

Genome-scale quantitative growth phenotypes 

To further demonstrate that TRMR performs well at the genome scale, 

we combined the up and down allele libraries and measured fitness in 

liquid cultures that contained rich or minimal nutrients (Fig. 5a). The 

liquid cultures were allowed to grow for an average of eight generations, 

before and after which aliquots of cells were plated for analysis of 

individuals or frozen for microarray analysis. Additionally, an aliquot 

of control cells (barcoded and kanamycin resistant; Supplementary 

Notes) was spiked into the culture at the start of selections. A known 

concentration of these control cells was used to assess the ability of 

barcode technology to measure allele concentrations during pooled 

growth. The control cells also serve as a wild-type standard with which 

the fitness of alleles can be compared. 

Using barcode microarrays, we simultaneously tracked all of 

the alleles, which were reduced to approximately 2,500 alleles after 

growth selections (Fig. 5b). The numbers of control cells in the 

populations determined by microarray was not substantially different 

from estimates of control-cell numbers obtained from counting 




kanamycin-resistant colonies. Microarrays revealed that the majority 

of alleles had similar growth phenotypes in both rich and minimal 

media (Fig. 5c, x-y diagonal). Noteworthy alleles that do not fit this 

trend are those that allow growth in the rich medium but are no 

longer observed in the minimal medium (Fig. 5c, alleles along x axis). 

Consistent with previous observations 19 , many of these alleles consist 

of changes in the expression of genes involved in metabolism. Also 

of interest are those alleles that confer faster growth than that of the 

control cells in the minimal medium (a list of fitness values can be 

found in Supplementary Table 5). 

These experiments also offer the first genome-wide glimpse of 

generally orthogonal expression alleles grown competitively in the 

same culture. We anticipated that if a particular up allele shows a 

fitness benefit, then the down allele is likely to show a negative effect 

on fitness, possibly being lost from the culture, and vice versa. This 

is often the case (see Fig. 5d, allele clustering toward the axes), providing 

further evidence that our synthetic cassettes are generally 

causing the intended effects at genome-wide loci. Exceptions such as 

improved growth resulting from both up and down expression alleles 

in the same environment may be due to secondary effects (such as 

increased transcription of multiple downstream genes) and require 

further investigation. 

Mapping tolerance to lignocellulosic hydrolysate 

We next applied TRMR to identify genes that improve tolerance to 

lignocellulosic hydrolysate derived from corn stover (provided by the 

US National Renewable Energy Laboratory). This class of feedstocks 

contains a variable array of growth inhibitors (known inhibitors 

include organic acids, aldehydes and phenolic-based compounds) 39,40 . 

To take hydrolysate variability into account, we measured growth of 

variants bearing our alleles in several mixtures of hydrolysate and 

minimal medium. 

Microarray analysis of the alleles indicated that only a small 

subset of the population remained after each selection (Fig. 5e; 

see Supplementary Table 6 for fitness values and gene ontology 

analysis). Many of the modifications that improved growth in lower 

concentrations of corn stover hydrolysate affected genes known to be 

involved in primary metabolism (pgi up, eno up and tdcG up), RNA 

metabolism (rlmG down, rimM up, rsmE down and rrmA down) and 

transport of sugars (ptsI down, ptsI up and directly downstream crr). 

Growth in higher concentrations of hydrolysate selected alleles related 

to secondary metabolism (ispF up and dxs up), vitamin metabolic 

processes (nadD up, menD up, apbE up, pabC up, dxs up and ribB 

up) and antioxidant activity (ahpC up, tpx up and bcp up). The down 

mutation of the adenylate cyclase gene (cyaA) conferred a growth 

advantage in every selection. 

To confirm that the mutations conferred fitness advantages, we 

isolated seven alleles after the selections and characterized growth 

in hydrolysate relative to unmutated E. coli. (Fig. 5f,g). All seven 

alleles (ahpC up, ugpE down, puuE down, ptsI down, ygaZ up, yciV 

down and lpp down) yielded improvement in either growth rate or 

biomass productivity relative to the wild-type strain. Notably, the 

up allele of ahpC resulted in a large improvement. The ahpC gene 

and its downstream counterpart ahpF have not previously been 

identified as important for growth in hydrolysate. However, they 

have been implicated in resistance to organic solvents 41 and various 

oxidants 42,43 , possibly indicating that during growth in cellulosic 

hydrolysate, reactive oxygen species in the form of peroxides and 

other oxidants are present or forming as a result of imbalances in 

metabolism 44 . In addition to identifying several important targets for 

future genome-engineering endeavors, many of which would have 

been difficult to predict a priori, these profiling studies shed light on 

general mechanisms of hydrolysate toxicity (such as the presence of 

oxidants) and growth advantage in hydrolysate (such as metabolism 

of preferred carbon sources). 

Discussion 

We have described a new method for the genome-scale mapping 

of genes to traits and have shown that this method can increase 

the throughput of genetic studies by several orders of magnitude. 

Although some of the trait-conferring modifications we identified 

correspond to previously identified genomic regions, the majority 

would have been difficult to predict. Such unanticipated outcomes 

provide insight into many uncharacterized genes and, in some cases, 

into known genes with uncharacterized functions. We have already 

begun applying this method toward understanding a range of traits of 

importance in biotechnology, including improved growth in industrially 

relevant conditions and enhanced product formation. 

We have designed TRMR to be easy to use and versatile. The 

molecular cloning procedures were accomplished within a week by 

a single researcher, with two additional days providing enough cells 

for 60 genome-wide selection and screening studies. Notably, data 

acquisition and analysis from TRMR is similar to genomics methods 

currently used by the yeast community and is amenable to a range 

of freely and commercially available software packages. The primary 

challenge to the broad dissemination of this method is the acquisition 

of oligonucleotide libraries, which will be overcome as DNA synthesis 

technologies continue to improve. 

We envision that a broad range of additional studies could be performed 

using the basic TRMR platform described here by changing 

the targeting, functional or tracking design. For example, although the 

functional regions we used were promoters and translation sites, one 

might conceivably use sites associated with additional functions such 

as switches, oscillators or sensors 45 . Moreover, the TRMR approach 

is not limited to engineering or examining the E. coli genome. The 

design could be adapted for rapidly engineering yeast and a range 

of Gram-negative bacteria 23 , provided the host has sufficient transformation 

and recombination capabilities. Additionally, TRMR may 

be carried out recursively, allowing for the accumulation of multiple 

beneficial mutations within a genome. Researchers could produce 

second- and third-generation recombinant cells by removing the antibiotic 

cassette between rounds of recombineering to allow isolation 

of cells containing an additional mutation, by using different antibiotic 

cassettes in the modular construction of the synDNA oligos so 

that different antibiotics could be used to isolate recombinants after 

each round of TRMR, or by eliminating altogether the need to isolate 

recombinants by relying on the increased efficiency of recombineering 

strategies such as those used in MAGE 24 . Integration of TRMR 

into directed-evolution programs would provide genome-scale construction 

and tracking of combinations of mutations, which would 

improve both the understanding and engineering of complex traits. 

Methods 





We thank D. Court (Center for Cancer Research, National Cancer Institute at 

Frederick, Maryland) for sharing plasmid pSIM5, C. Nislow and G. Giaever 

(University of Toronto, Ontario) for help with microarray analysis, A. Mohagheghi 

and M. Zhang (US National Renewable Energy Laboratories) for hydrolysate 

samples, M. O’Donnell for help in preparation of selective agar plates, Agilent for 




access to the Oligonucleotide Library Synthesis product, and H. Marshall and the 

University of Colorado Microarray Facility for molecular barcode genotyping. 

The authors appreciate financial support provided by Shell, the Colorado Center 

for Biorefining and Biofuels (http://www.C2B2web.org) and the Colorado Energy 

Initiative (http://rasei.colorado.edu). 

Author Contributions 

J.R.W. and R.T.G. conceived the study; J.R.W. designed and performed all 

experiments except for growth selections and allele confirmations in hydrolysate, 

which were conducted by P.J.R.; A.K.-F. aided J.R.W. in selection of targeting 

sequences and selection of barcode tags; A.K.-F. and P.J.R. assigned gene ontology 

terms; L.B.A.W. aided J.R.W. in selection design and microarray analysis; L.B.A.W. 

constructed circle plots; P.J.R., A.K.-F. and L.B.A.W. helped in manuscript 

preparation; J.R.W. and R.T.G. wrote the manuscript; R.T.G. supervised all aspects 

of the study. 






1. Fox, R.J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. 


2. Turner, N.J. Directed evolution drives the next generation of biocatalysts. Nat. Chem. 

Biol. 5, 567–573 (2009). 

3. Winzeler, E.A. et al. Functional characterization of the S. cervisiase genome by 

gene geletion and parallel analysis. Science 285, 901–906 (1999). 

4. Fodor, S. et al. Light-directed, spatially addressable parallel chemical synthesis. 

Science 251, 767–773 (1991). 

5. Blanchard, A.P., Kaiser, R.J. & Hood, L.E. High-density oligonucleotide arrays. 

Biosens. Bioelectron. 11, 687–690 (1996). 

6. Singh-Gasson, S. et al. Maskless fabrication of light-directed oligonucleotide microarrays 

using a digital micromirror array. Nat. Biotechnol. 17, 974–978 (1999). 

7. Cleary, M.A. et al. Production of complex nucleic acid libraries using highly parallel 

in situ oligonucleotide synthesis. Nat. Methods 1, 241–248 (2004). 

8. Ghindilis, A. et al. CombiMatrix oligonucleotide arrays: genotyping and gene 

expression assays employing electrochemical detection. Biosens. Bioelectron. 22, 

1853–1860 (2007). 

9. Yu, D. et al. An efficient recombination system for chromosome engineering in 

Escherichia coli. Proc. Natl. Acad. Sci. USA 97, 5978–5983 (2000). 

10. Murphy, K. Use of bacteriophage lambda recombination functions to promote gene 

replacement in Escherichia coli. J. Bacteriol. 180, 2063–2071 (1998). 

11. Zhang, Y., Buchholz, F., Muyrers, J. & Stewart, A.F. A new logic for DNA engineering 

using recombination in Escherichia coli. Nat. Genet. 20, 123–128 (1998). 

12. Shoemaker, D.D., Lashkari, D.A., Morris, D., Mittmann, M. & Davis, R.W. Quantitative 

phenotypic analysis of yeast deletion mutants using a highly parallel molecular 

bar-coding strategy. Nat. Genet. 14, 450–456 (1996). 

13. Cho, R.J. et al. Parallel analysis of genetic selections using whole genome 

oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 95, 3752–3757 (1998). 

14. Gill, R.T. et al. Genome wide screening for trait conferring genes using DNA microarrays. 


15. Lynch, M.D., Warnecke, T. & Gill, R.T. SCALEs: multiscale analysis of library 

enrichment. Nat. Methods 4, 87–93 (2007). 

16. Badarinarayana, V. et al. Selection analyses of insertional mutants using subgenic 

resolution arrays. Nat. Biotechnol. 19, 1060–1065 (2001). 

17. Giaever, G. et al. Functional profiling of the Saccharomyces cerevisiae genome. 


18. Ho, C.H. et al. A molecular barcoded yeast ORF library enables mode-of-action 

analysis of bioactive compounds. Nat. Biotechnol. 27, 369–377 (2009). 

19. Baba, T. et al. Construction of Escherichia coli K-12 in-frame, single-gene knockout 

mutants: the Keio collection. Mol. Syst. Biol. 2, 1–11 (2006). 

20. Kitagawa, M. et al. Complete set of ORF clones of Escherichia coli ASKA library 

(a complete set of E. coli K-12 ORF archive): unique resources for biological 

research. DNA Res. 12, 291–299 (2006). 

21. Datsenko, K. & Wanner, B. One-step inactivation of chromosomal genes in 

E. coli K12 using PCR products. Proc. Natl. Acad. Sci. USA 97, 6640–6645 

(2000). 

22. Ellis, H.M., Yu, D., DiTizio, T. & Court, D.L. High efficiency mutagenesis, repair, 

and engineering of chromosomal DNA using single-stranded oligonucleotides. 


23. Datta, S., Costantino, N., Zhou, X. & Court, D.L. Identification and analysis of 

recombineering functions from Gram-negative and Gram-positive bacteria and their 

phages. Proc. Natl. Acad. Sci. USA 105, 1626–1631 (2008). 

24. Wang, H. et al. Programming cells by multiplex genome engineering and accelerate 

evolution. Nature 460, 894–898 (2009). 

25. Lutz, R. & Bujard, H. Independent and tight regulation of transcriptional units in 

Escherichis coli via the LacR/O, the TetR/O and AraC/I1–I2 regulatory elements. 

Nucleic Acids Res. 25, 1203–1210 (1997). 

26. Shine, J. & Dalgarno, L. The 3′-terminal sequence of Escherichia coli 16S ribosomal 

RNA: complementarity to nonsense triplets and ribosome binding sites. Proc. Natl. 

Acad. Sci. USA 71, 1342–1346 (1974). 

27. Kimura, M., Takatsuki, A., Yamaguchi, I. & Blasticidin, S. Deaminase gene from 

Aspergillus terreus (BSD): a new drug resistance gene for transfection of mammalian 

cells. Biochim. Biophys. Acta 1219, 653–659 (1994). 

28. Pierce, S.E. et al. A unique and universal molecular barcode array. Nat. Methods 

3, 601–603 (2006). 

29. Baudin, A., Ozier-Kalogeropoulos, O., Denouel, A., Lacroute, F. & Culin, C. A simple 

and efficient method for direct gene deletion in Saccharomyces cerevisiae. 

Nucleic Acids Res. 21, 3329–3330 (1993). 

30. Markham, N.R. & Zuker, M. DINAMelt web server for nucleic acid melting prediction. 

Nucleic Acids Res. 33, W577–W581 (2005). 

31. Defez, R. & de Felice, M. Cryptic operon for beta-glucoside metabolism in 

Escherichia coli K12: genetic evidence for a regulatory protein. Genetics 97, 11–25 

(1981). 

32. Meijnen, J.P., de Winde, J.H. & Ruijssenaars, H.J. Engineering Pseudomonas putida 

S12 for efficient utilization of d-xylose and l-arabinose. Appl. Environ. Microbiol. 

74, 5031–5037 (2008). 

33. Zhu, M.M., Skraly, F.A. & Cameron, D.C. Accumulation of methylglyoxal in 

anaerobically grown Escherichia coli and its detoxification by expression of the 

Pseudomonas putida glyoxalase i gene. Metab. Eng. 3, 218–225 (2001). 

34. Gort, A.S., Ferber, D.M. & Imlay, J.A. The regulation and role of the periplasmic copper, 

zinc superoxide dismutase of Escherichia coli. Mol. Microbiol. 32, 179–191 (1999). 

35. Sutton, A., Newman, T., Francis, M. & Freundlich, M. Valine-resistant Escherichia 

coli K-12 strains with mutations in the ilvB operon. J. Bacteriol. 148, 998–1001 

(1981). 

36. Weinstock, O., Sella, C., Chipman, D.M. & Barak, Z. Properties of subcloned 

subunits of bacterial acetohydroxy acid synthases. J. Bacteriol. 174, 5560–5566 

(1992). 

37. Wessler, S.R. & Calvo, J.M. Control of leu operon expression in Escherichia coli by 

a transcription attenuation mechanism. J. Mol. Biol. 149, 579–597 (1981). 

38. Sycheva, E.V. et al. Overproduction of noncanonical amino acids by Escherichia 

coli cells. Microbiology 76, 712–718 (2007). 

39. Chen, S.F., Mowery, R.A., Castleberry, V.A., van Walsum, G.P. & Chambliss, C.K. 

High-performance liquid chromatography method for simultaneous determination 

of aliphatic acid, aromatic acid and neutral degradation products in biomass 

pretreatment hydrolysates. J. Chromatogr. A 1104, 54–61 (2006). 

40. Mohagheghi, A. & Schell, D.J. Impact of recycling stillage on conversion of dilute sulfuric 

acid pretreated corn stover to ethanol. Biotechnol. Bioeng. 105, 992–996 (2010). 

41. Ferrante, A.A., Augliera, J., Lewis, K. & Klibanov, A.M. Cloning of an organic 

solvent-resistance gene in Escherichia coli: the unexpected role of alkylhydroperoxide 

reductase. Proc. Natl. Acad. Sci. USA 92, 7617–7621 (1995). 

42. Poole, L.B. Bacterial defenses against oxidants: mechanistic features of cysteinebased 

peroxidases and their flavoprotein reductases. Arch. Biochem. Biophys. 433, 

240–254 (2005). 

43. Seaver, L.C. & Imlay, J.A. Alkyl hydroperoxide reductase is the primary scavenger 

of endogenous hydrogen peroxide in Escherichia coli. J. Bacteriol. 183, 7173–7181 

(2001). 

44. Kohanski, M.A., Dwyer, D.J., Hayete, B., Lawrence, C.A. & Collins, J.J. A common 

mechanism of cellular death induced by bactericidal antibiotics. Cell 130, 797–810 

(2007). 

45. Lu, T.K., Khalil, A.S. & Collins, J.J. Next-generation synthetic gene networks. 


46. Datta, S., Constantino, N. & Court, D.L. A set of recombineering plasmids for 

gram-negative bacteria. Gene 379, 109–115 (2006). 

47. Pierce, S.E., Davis, R.W., Nislow, C. & Giaever, G. Genome-wide analysis of barcoded 

Saccharomyces cerevisiae gene-deletion mutants in pooled cultures. Nat. Protoc. 

2, 2958–2974 (2007). 

48. Nour-Eldin, H.H., Hansen, B.G., Norholm, M.H.H., Jensen, J.K. & Halkier, B.A. 

Advancing uracil-excision based cloning towards an ideal technique for cloning PCR 

fragments. Nucleic Acids Res. 34, e122 (2006). 

49. Dean, F.B., Nelson, J.R., Giesler, T.L. & Lasken, R.S. Rapid amplification of plasmid 

and phage DNA using Phi29 DNA polymerase and multiply-primed rolling circle 

amplification. Genome Res. 11, 1095–1099 (2001). 

50. Perni, S., Andrew, P.W. & Shama, G. Estimating the maximum growth rate from microbial 

growth curves: definition is everything. Food Microbiol. 22, 491–495 (2005). 




Strains, DNA and reagents. Escherichia coli MG1655 (wild type) was obtained 

from ATCC 700926. Genomic sequences were obtained from GenBank 

U00096.2, and gene annotation was from the Ecogene database version 2.20 

(http://www.ecogene.org/). Pseudogenes and insertion elements were excluded 

from the protein-coding genes that were targeted. The kanamycin-resistant 

control strain (also called JWKAN) was constructed from E. coli ATCC 700926, 

with nucleotide 3,909,796 replaced with a barcoded kanamycin cassette 21 

(Supplementary Notes). Up and down DNA cassettes were constructed using 

PCR and cloned into the pEM7/BSD plasmid (Invitrogen, Supplementary 

Notes). Oligonucleotide libraries were purchased from Agilent; all other oligonucleotides 

were purchased from Integrated DNA Technologies with standard 

desalting except where noted. The pSIM5 plasmid 46 was a gift from D. Court. 

All reagents were obtained from common commercial sources. All enzymes 

were from New England Biolabs except where noted. All sequencing was performed 

by Macrogen USA or Eurofins MGW Operon. Recipes and additional 

information can be found in Supplementary Notes. 

Preparation of synthetic DNA and recombineering. A portion of the oligonucleotide 

library provided by Agilent (8,154 unique 189-mers) was amplified 

by two rounds of PCR. Products were treated with the USER enzymes (New 

England Biolabs), purified and ligated to the up cassette. Rolling-circle amplification, 

nuclease treatment and purification resulted in 8–10 μg synDNA. This 

procedure was also carried out in parallel to separately generate TRMR down 

synDNA. More details are available in Supplementary Notes. 

E. coli cells containing the recombineering plasmid pSIM5 were grown 

in 800 ml SOB cultures at 30 °C and made recombineering proficient with 

minor modifications to reported methods 46 . Briefly, when cells reached an 

optical density at 600 nm of 0.7, flasks were transferred to water baths at 42 °C 

to induce the λRed enzymes for 15 min. Flasks were then transferred to an 

ice-water bath and cells were kept close to 4 °C for the remaining steps. Cells 

were collected by centrifugation and suspended with cold deionized water. Cell 

collection and washing was repeated once more, then cells were suspended to a 

final volume of 6.4 ml in water. Aliquots of cells (400 μl) were transformed in 

a 0.2-cm electrocuvette with approximately 1 μg of up or down synDNA and a 

pulse of 12.5 kV cm −1 . Transformation was carried out eight times to generate 

the up allele library and eight times to generate the down allele library. The 

cells from each transformation were recovered in 12 ml SOC medium for 1 h 

at 37 °C. Cells were collected by centrifugation and resuspended in 30 ml MA 

salts (Supplementary Notes). Centrifugation and resuspension was repeated 

twice more, with the final resuspension to a volume of 2 ml in MA salts. The 

up and down allele libraries were separately spread onto a total of 40 low-salt 

LB agar plates containing blasticidin-S (90 μg ml −1 ) and allowed to grow at 

37 °C for 22 h. Colonies were scraped from the agar plates and up and down 

allele libraries were each suspended in a total of 35 ml LB. Cells were collected 

by centrifugation and suspended to 3 × 10 9 cells per milliliter in LB medium 

containing 16% (vol/vol) glycerol and blasticidin-S (90 μg ml −1 ). Aliquots of 

the up or down cell mixtures were stored at −80 °C. 

Screens and selections. Freezer stocks were used to inoculate 50 ml low-salt 

LB medium containing 80 μg ml −1 blasticidin-S with 5 × 10 8 TRMR up cells 

and 5 × 10 8 TRMR down cells. This culture was allowed to grow with shaking 

at 37 °C to an optical density at 600 nm of 0.8. The cells were centrifuged 

at 4,500g for 6 min, decanted and suspended in 30 ml of MA salts. The cells 

were collected once more by centrifugation and suspended in MA salts to a 

concentration of 5 × 10 8 cells per milliliter. The JWKAN cells were added to a 

final concentration of 7.7 × 10 4 cells per milliliter. A 1.7 ml aliquot of the cell 

library (called the recovery culture) was frozen for microarray analysis, and 

the remainder was used for various growth selections. 

Liquid selections were carried out with shaking at 37 °C in 600 ml of MOPS 

minimal medium containing 2 mM phosphate and 4% (wt/vol) glucose or in 

600 ml LB medium. Each medium was inoculated with 2.4 × 10 8 cells from a 

recovery culture and allowed to grow to an optical density at 600 nm of 1.0–1.2. 

Cells were collected from each culture by centrifugation of 10-ml aliquots 

at 4,500g for 6 min, decanted and stored at −80 °C for microarray analysis. 

Growth results are the average of three array hybridizations. 

Hydrolysate growth selections were carried out in various dilutions of 

hydrolysate in minimal media (15%, 16%, 17%, 18%, 19% and 20%). During 

selections, cell samples were taken for microarray analysis of populations, and 

cells were plated to isolate and identify individual alleles growing as colonies. 

Unique alleles from selections were identified and confirmed by PCR and 

studied for growth characteristics in hydrolysate. All growth curves were done 

in complete triplicate. More details are available in Supplementary Notes. 

Growth on various selective agars was carried out by spreading a total of 0.7 × 

10 8 cells of the allele mixtures recovered from freezer aliquots on five plates 

for each selective condition (salicin, d-fucose plus l-arabinose, valine, and 

methylglyoxal; plate recipes in Supplementary Notes). Plates were incubated 

at 37 °C until colonies were visible (1–3 d). Selection for galK down alleles was 

carried out on plates containing 2-deoxygalactose 9 , and screens were carried 

out on MacConkey agar containing 1% (wt/vol) d-galactose. Screens of lacZ 

up alleles were carried out on LB agar plates containing 0.2% (wt/vol) glucose 

and 40 μg ml −1 X-gal. Screens of lacZ down alleles were carried out on LB 

agar plates containing 0.05% (wt/vol) IPTG and 40 μg ml −1 X-gal. Selections 

for control cells were carried out on LB agar plates containing kanamycin 

(30 μg ml −1 ). 

Microarray tracking. Genomic DNA was extracted from ~10 9 E. coli cells 

using Purelink Genomic Mini kit (Invitrogen). Barcode tags are amplified in 

300 μl PCR reactions (final concentrations: 1× PCR buffer, 2.5 mM MgCl 2 , 

0.2 mM each dNTP, 1 μM each primer 5′-GTAGCACACGAGGTCTCT-3′ and 

Biotin-5′-TACGACTCACTATAGGGAGA-3′, 0.6 U μl −1 Taq polymerase and 

0.5 μg genomic DNA or 30 pg synDNA). Reactions were cycled 25 times with 

an annealing temperature of 55 °C. Barcode tags were purified by agarose gel 

electrophoresis and extraction using the QIAquick gel extraction protocols 

(Qiagen, substitute buffer QX1 for QG). Tag purification was shown to reduce 

background hybridization. Microarray hybridizations to the Geneflex Tag4 

16K V2 array (Affymetrix) were carried out according to published procedures 

47 with the following modifications: 600 ng of purified tags (combined 

up tags and down tags) were hybridized along with ten tags (amplified and 

purified as above) included at known concentrations (0.5 pM to 10 nM). 

Intensity values are calculated for each tag after removal of replicate outliers 

and averaging of unmasked replicates using software (raw_file_maker.pl) that 

can be downloaded from http://chemogenomics.stanford.edu/supplements/ 

04tag/download.html. Background hybridization was calculated from the 

average intensity of 1,642 unused tag probes; threshold intensity was set to 

background hybridization plus 2 s.d. The intensities of the ten spiked tags 

were used to calculate allele concentrations from array signals and correct for 

array saturation (Supplementary Fig. 2). Barcode frequencies were calculated 

by dividing barcode concentrations by the total concentration of all barcodes 

detected on the array. 

doi:10.1038/nbt.1653 


l e t t e r s 

Implications of the presence of N-glycolylneuraminic 

acid in recombinant therapeutic glycoproteins 

Darius Ghaderi 1,2 , Rachel E Taylor 1 , Vered Padler-Karavani 1 , Sandra Diaz 1 & Ajit Varki 1 


Recombinant glycoprotein therapeutics produced in nonhuman 

mammalian cell lines and/or with animal serum are often 

modified with the nonhuman sialic acid N-glycolylneuraminic 

acid (Neu5Gc; refs. 1,2). This documented contamination 

has generally been ignored in drug development because 

healthy individuals were not thought to react to Neu5Gc 

(ref. 2). However, recent findings indicate that all humans 

have Neu5Gc-specific antibodies, sometimes at high levels 3,4 . 

Working with two monoclonal antibodies in clinical use, 

we demonstrate the presence of covalently bound Neu5Gc 

in cetuximab (Erbitux) but not panitumumab (Vectibix). 

Anti-Neu5Gc antibodies from healthy humans interact with 

cetuximab in a Neu5Gc-specific manner and generate immune 

complexes in vitro. Mice with a human-like defect in Neu5Gc 

synthesis generate antibodies to Neu5Gc after injection with 

cetuximab, and circulating anti-Neu5Gc antibodies can 

promote drug clearance. Finally, we show that the Neu5Gc 

content of cultured human and nonhuman cell lines and their 

secreted glycoproteins can be reduced by adding a human 

sialic acid to the culture medium. Our findings may be relevant 

to improving the half-life, efficacy and immunogenicity of 

glycoprotein therapeutics. 

Therapeutic glycoproteins, including antibodies, growth factors, 

cytokines, hormones and clotting factors, generate sales with annual 

double-digit growth rates 5 . They must often be produced in mammalian 

expression systems because of the crucial influence of the location, 

number and structure of N-glycans on their yields, bioactivity, solubility, 

stability against proteolysis, immunogenicity and rate of clearance 

from the bloodstream 6–8 . 

Two differences between the protein glycosylation apparatus of 

humans and rodents account for major potential differences between 

the N-glycans on glycoproteins made in cultured human cells and 

those made using rodent cell lines. First, humans cannot synthesize a 

terminal Galα1-3Gal motif (known as alpha-Gal) on N-glycans. As a 

consequence, they express antibodies against this structure 9 . Second, 

unlike other mammals, humans cannot biosynthesize the sialic acid 

Neu5Gc because the human gene CMAH, encoding CMP-N-acetylneuraminic 

acid hydroxylase, the enzyme responsible for producing 

CMP-Neu5Gc from CMP-N-acetylneuraminic acid (CMP-Neu5Ac), 

is irreversibly mutated 10 . The use of cultured human cells to address 

this issue is not a solution, as Neu5Gc can be taken up from animal 

products present in the culture medium and then metabolically incorporated 

into secreted glycoproteins 11 . 

Owing largely to limitations of the assays originally used to detect 

anti-Neu5Gc antibodies, including the fact that only a small number 

of possible Neu5Gc-containing epitopes were tested, healthy humans 

were long believed to show no immune reaction to Neu5Gc (ref. 2). 

Subsequent reports that all humans possess anti-Neu5Gc antibodies 3 , 

sometimes at high levels, approaching 0.1–0.2% of circulating IgG 3,4 , 

have led to re-evaluation of the potential significance of Neu5Gc 

contamination 7,8 . Especially in light of trends toward administering 

increasingly higher amounts of certain biotherapeutics over longer 

periods of time, some biopharmaceutical companies are exploring 

steps to reduce levels of Neu5Gc in their products 12 . 

Given that they are produced using nonhuman cell lines, animal 

serum or serum-derived factors, or a combination of these, it is likely 

that most recombinant therapeutic glycoproteins carry some Neu5Gc. 

However, given the diversity of products and production protocols, 

it is difficult to make generalizations. Thus, we chose to compare 

two US Food and Drug Administration (FDA)-approved monoclonal 

antibodies with the same therapeutic target, the EGF receptor. The 

first, Erbitux (cetuximab, obtained from the University of California, 

San Diego Pharmacy), is a chimeric antibody produced in mouse 

myeloma cells 13,14 . The second, Vectibix (panitumumab, obtained 

from Amgen), is a fully human antibody produced in Chinese 

hamster ovary (CHO) cells 15 . The samples studied were preparations 

that would normally be administered to patients. 

We first performed enzyme-linked immunosorbent assays (ELISAs) 

using an affinity-purified polyclonal chicken Neu5Gc-specific antibody 

preparation that is highly monospecific for Neu5Gc (ref. 16, 

alongside a nonreactive control IgY). Bound Neu5Gc was easily 

detectable on cetuximab but not on panitumumab (Fig. 1a). Sialidase 

pretreatment abolished binding, confirming specificity. Western blot 

analysis also showed sialidase-sensitive anti-Neu5Gc IgY reactivity 

on the heavy chains of cetuximab but not those of panitumumab 

(Fig. 1b). The specificity of anti-Neu5Gc IgY binding was reaffirmed 

by pretreatment with mild sodium periodate under conditions that 

selectively cleave sialic acid side chains (Fig. 1c) and abolish reactivity 

of such antibodies 3,16 . Finally, we quantified sialic acids on the therapeutic 

antibodies, as described in Online Methods. Panitumumab 

carries 0.22 mol of sialic acids per mole of protein, with

l e t t e r s 

a b 

c 

d e 

A 495 

0.6 

0.4 

0.2 

0 

Anti-Neu5Gc 

IgY 

Cet 

Pan 

Control IgY 

*** 

Anti-Neu5Gc 

IgY 

Control IgY 

Active Heat-inactivated 

sialidase sialidase 

Sialidase: 

Coomassie 

staining 

Anti-Neu5Gc 

IgY 

Control IgY 

Cet Pan 

– + – + 

A 495 

0.8 

0.4 

0 

Cet 

Pan 

Anti-Neu5Gc 

IgY 

Control IgY 

Periodate 

treatment 

*** 

Anti-Neu5Gc 

IgY 

Control IgY 

Mock 

treatment 

A 495 

0.2 

0 

Human anti- 

Neu5Gc IgG 

Cet 

Pan 

Control 

human IgG 

Periodate 

treatment 

*** 

Human anti- 

Neu5Gc IgG 

Control 

human IgG 

Mock 

treatment 

Cet Pan 

Sialidase: – + – + 

Anti-Neu5Gc 

Human IgG 

f 

Concentration (ng µl –1 ) 

2 

1 

0 

** 

** 

Cet Pan No Ab 

S34 

S30 


Figure 1 ELISA and western blot detection of Neu5Gc on biotherapeutic antibodies by Neu5Gc IgY antibodies from chickens or IgG antibodies from 

normal human serum. Cetuximab (Cet) and panitumumab (Pan) were treated with active sialidase to eliminate sialic acid epitopes or with heatinactivated 

sialidase as control. (a,b) Samples were used for ELISA (a) or western blot (b), in which Neu5Gc was detected using an affinity-purified 

chicken anti-Neu5Gc IgY or control IgY. A 495 , absorbance at 495 nm. ***P < 0.001, paired two-tailed t-test. (c) In an additional ELISA, Cet and Pan 

were used for coating, then blocked, and sialic acid epitopes were modified chemically using mild sodium metaperiodate pretreatment. The reaction 

was stopped using sodium borohydride. As a control, periodate and borohydride were mixed and then added to the wells (the borohydride inactivates the 

periodate). ELISA samples were studied at least in triplicate and data shown are means ± s.d. ***P < 0.001, paired two-tailed t-test. (d) Cet and Pan 

were pretreated with mild periodate as in c and used to coat ELISA wells before blocking and incubation with human anti-Neu5Gc IgG that had been 

purified from the serum of healthy humans and biotinylated as previously described 4 . Samples were studied in triplicate and data shown are means ± 

s.d. ***P < 0.001, paired two-tailed t-test. (e) Cet and Pan (1 μg each) were treated with sialidase or heat-inactivated sialidase as in a separated by 

SDS-PAGE, Coomassie-stained, blotted (see b), and Neu5Gc detected using biotinylated human anti-Neu5Gc IgG. (f) Immune complex formation with 

Cet or Pan in whole human serum was detected using the CIC (C1Q) ELISA Kit (Buehlmann) as described in the manufacturer’s guidelines. Absorbance 

was measured at 405 nm. Samples were studied in triplicate and data shown are means ± s.d. **P < 0.01, paired two-tailed t-test. Gels in b and e were 

cropped for clarity of presentation. Full-length blots and gels are presented in Supplementary Figures 1–4. 

In contrast, cetuximab carries 1.84 mol of sialic acids per mole of 

protein, mostly as Neu5Gc (see Supplementary Table 1). The differences 

probably reflect different cell-expression systems. For example, 

in contrast to CHO cells, murine myeloma cell lines express a greater 

proportion of sialic acids as Neu5Gc (ref. 17; see Supplementary 

Tables 2 and 3 for a listing of other potential examples). Pull-down 

assays of cetuximab with SNA-agarose (modified with the lectin 

Sambucus nigra agglutinin, which recognizes α2-6-linked sialic acids), 

followed by ELISAs of unbound proteins, showed that only about half 

of cetuximab molecules actually carry bound sialic acids and Neu5Gc 

(data not shown). Such heterogeneity is typical for glycoproteins. 

To address the potential significance of the high levels of anti- 

Neu5Gc antibodies found in certain humans, we affinity purified anti- 

Neu5Gc antibodies from normal human sera and biotinylated them 

exactly as previously described 4 , before their analysis using ELISA 

and western blotting assays (Fig. 1d,e). As with Neu5Gc-specific 

chicken IgY, these affinity-purified human Neu5Gc-specific antibodies 

reacted with cetuximab but not with panitumumab. Again, 

reactivity was abrogated by pretreatment with mild sodium periodate 

(Fig. 1d) or sialidase (Fig. 1e). 

To further address potential clinical relevance, we studied whether 

addition of cetuximab to normal human sera is capable of promoting 

the formation of immune complexes (Fig. 1f). Cetuximab formed 

immune complexes in a human serum with high levels of anti-Neu5Gc 

antibodies (serum S34; ref. 4) but not in a low-titer serum (serum S30; 

ref. 4). In contrast, we detected no formation of immune complexes 

with either serum in the presence of panitumumab. Assuming that 

similar interactions occur between cetuximab and circulating anti- 

Neu5Gc antibodies in humans, these complexes could potentially fix 

complement and cause untoward reactions in some patients and/or 

affect half-life, possibly explaining some reported clinical differences 

between cetuximab and panitumumab 13,15 , 

We next evaluated whether Neu5Gc affects clearance rate when circulating 

anti-Neu5Gc antibodies are present. To mimic the situation 

in humans, we used mice with a human-like defect in the Cmah 

gene, which encodes the enzyme that generates activated Neu5Gc 

(CMP-Neu5Gc) 18 . Such mice can make anti-Neu5Gc antibodies upon 

immunization with glycosidically bound, but not free, Neu5Gc 19–21 . 

However, the previous studies reporting these mouse anti-Neu5Gc 

antibodies used whole rodent or chimpanzee cells for immunization 

19,20 , an artificial approach. In contrast, feeding of Neu5Gc (which 

is present in mouse chow) does not induce a human-like immune 

response in the mutant mice 21 . We could not immunize the mice with 

cetuximab itself, as other antibodies directed against the partly human 

IgG protein backbone would confound any results. To most closely 

mimic the situation in humans, we therefore immunized with Neu5Gcloaded 

Haemophilus influenzae (see Online Methods and ref. 21; 

this is very similar to the mechanism by which human Neu5Gc-specific 

antibodies appear to be generated naturally 21 ). Given the great variability 

in isotypes and affinities of the naturally occurring human 

anti-Neu5Gc antibodies, as well as their different relative reactivities 

against various Neu5Gc-containing antigens 4 , it is impractical to 

model all possible human conditions. We therefore chose to mimic 

a situation in a human with relatively high levels of the IgG antibodies 

against the kind of Neu5Gc epitope (Neu5Gcα2-6Galβ1-4Glc-) 

found in cetuximab 22 . It also happens that this epitope is commonly 

recognized by human anti-Neu5Gc antibodies 4 . 

Each of the therapeutic antibodies, cetuximab and panitumumab, 

was injected intravenously at levels estimated to ensure a concentration 

of 1 μg ml −1 in the extracellular fluid volume according to mouse 

body weight 23 . Next, sera pooled from naïve, control-immunized or 

Neu5Gc-immunized syngeneic mice were passively transferred via 

intraperitoneal injection, ensuring equal starting concentrations of 

circulating Neu5Gc-specific antibodies. Anti-Neu5Gc IgG levels in 

the pooled sera from Neu5Gc-immunized mice were quantified using 

ELISA with a Neu5Gcα2-6Galβ1-4Glc-conjugate as a target, as previously 

described 4 (97.5 μg ml −1 , data not shown). The amount of 

pooled antibody injected was then calculated to achieve an approximate 


l e t t e r s 


Figure 2 Effects of Neu5Gc-specific antibodies on the kinetics of 

therapeutic antibodies in mice with a human-like Neu5Gc deficiency, 

levels of anti-Neu5Gc IgG in mice after injections of the therapeutic 

antibodies, and binding of IgG Neu5Gc-specific antibodies from whole 

human serum to Neu5Gc on the Fab fragment of cetuximab. (a) Cmahnull 

mice were first injected intravenously with either of the therapeutic 

antibodies, cetuximab (Cet) or panitumumab (Pan). Serum from Cmah-null 

mice containing anti-Neu5Gc antibodies (or serum from naïve mice or 

control-immunized mice) was then passively transferred by intraperitoneal 

injection. Mice were bled periodically after the passive transfer of serum. 

Concentrations of Cet or Pan in the isolated sera were determined by 

sandwich ELISA. Absorbance was measured at 495 nm. The y axis starts 

at 60% to better display the difference in kinetics. Error bars, s.d.; 

***P < 0.001, unpaired two-tailed t-test. (b) Cmah-null mice were injected 

intravenously with Cet, Pan or mouse IgG weekly and were bled initially 

and after the third intravenous injection. To detect Neu5Gc-specific 

antibodies by ELISA, we coated wells with human (Neu5Gc-deficient) 

or chimpanzee (Neu5Gc-positive) serum glycoproteins (upper chart), or 

alternatively with human or bovine fibrinogen (lower chart). Data were 

obtained in triplicate. (c) Fab fragments of Cet and Pan were isolated using 

the Pierce Fab Preparation Kit according to the manufacturer’s manual for use as target molecules in ELISA (1 μg per well). Sialic acid–specific binding 

was determined using mild sodium metaperiodate pretreatment. Wells were then blocked and incubated with human sera (S30 and S34, with low and 

high anti-Neu5Gc IgG titers, respectively 4 ). Binding of human IgG was detected using anti-human IgG-Fc. Absorbance was measured at 490 nm and 

ELISA samples were studied in triplicate. Error bars, s.d.; *P < 0.05, paired two-tailed t-test. 

starting concentration of 4 μg ml −1 IgG in the extracellular fluid 

volumes of the mice, which is about a four fold excess of anti-Neu5Gc 

antibodies compared to the injected drug in the mice, and similar to 

levels found in some humans 4 . 

Clearance was monitored by a sandwich ELISA specific for human 

IgG-Fc. Although both drugs had a similar clearance rate in mice 

pre-injected with serum from naïve or control-immunized mice, 

circulating levels of cetuximab decreased significantly (P < 0.001) 

when Neu5Gc-specific antibodies were pre-injected (Fig. 2a). 

Assuming that a similar interaction between cetuximab and 

circulating anti-Neu5Gc antibodies occurs in patients, there could 

be relevant effects on clearance rate and efficacy. This might help to 

explain the wide range of half-life values reported for such antibodies 

in clinical studies 14,15 . 

To further simulate the clinical situation, we injected equal 

amounts of cetuximab or panitumumab intravenously into Neu5Gcdeficient 

Cmah −/− mice in typical human dosages (4 μg per gram of 

body weight) at weekly intervals. To exclude any effect of the human 

portion of the protein (cetuximab) or of the fully human protein 

(panitumumab) in mice, we also injected murine IgG as a positive 

control, as it happens to carry Neu5Gc as the predominant sialic 

acid (Supplementary Table 1). Notably, cetuximab and murine IgG 

(but never panitumumab) induced a Neu5Gc-specific IgG immune 

response (Fig. 2b). As with humans, responses of individual mice 

varied greatly, and more positive signals were obtained with the 

Neu5Gc epitope mixture found in chimp serum than that in bovine 

fibrinogen. Thus, even patients without pre-existing high levels of 

anti-Neu5Gc antibodies may be at risk of developing them after injection 

of Neu5Gc-carrying agents, potentially affecting the outcome 

of subsequent injections. Moreover, repeated injections of Neu5Gccarrying 

agents could result in the accumulation of this nonhuman 

sugar in human tissues. Together with Neu5Gc-specific antibodies, 

accumulation of Neu5Gc in tissues can mediate chronic inflammation 

and potentially facilitate progression of diseases such as cancer 19 and 

atherosclerosis 24 . Thus, chronic use of Neu5Gc-bearing therapeutics 

might increase future risk of such diseases. 

Finally, we studied direct binding of anti-Neu5Gc antibodies from 

whole human sera to both cetuximab and panitumumab. To avoid 

a 

Circulating Cet or Pan 

(normalized) 

100 

Cet 

Pan 

80 

60 

0 

NaÏve 

mouse 

serum 

*** *** 

50 

Hours 

Serum of 

controlimmunized 

mice 

100 

Serum of 

Neu5Gcimmunized 

mice 

b 

Change in A 495 

A 495 

0.8 

mlgG Pan Cet 

0.6 

0.4 

0.2 

0 

–0.2 

0.3 

0.2 

0.1 

0 

–0.1 

S34 

S30 

0 

Periodate Mock Periodate Mock 

Fab Cet Fab Pan 

excessive cross-reactivity involving the secondary reagent, we prepared 

Fab fragments of both of the agents, used them to coat ELISA 

plate wells, exposed them to human sera and then detected serum 

antibody binding with a human IgG-Fc–specific secondary antibody 

(note that cetuximab is known to have an additional glycosylation 

site in the V-region 21 ). We detected mild periodate–sensitive binding 

of serum IgG from a high–anti-Neu5Gc titer serum (S34, ref. 4), 

which had >15 μg ml −1 IgG antibodies against Neu5Gcα2-6Galβ1- 

4Glc-) to the Fab fragments of cetuximab and not to those of panitumumab 

(Fig. 2c). In contrast, incubation with another human serum 

containing very low Neu5Gc-antibodies (serum S30, ref. 4, which had 

l e t t e r s 

a b c d e 

Neu5Gc 

(% of total sialic acids) 

80 

60 

40 

20 

0 

Day 0 

Ethanol-soluble fraction 

Ethanol-precipitable fraction 

Secreted protein 

Membrane fraction 

Control 5 mM Neu5Ac Control 5 mM Neu5Ac 

Control 5 mM Neu5Ac Control 5 mM Neu5Ac 

80 

10 

60 

15 

40 

10 

5 

20 

5 

0 

0 

0 

Day 1 

Day 2 

Day 3 

Day 4 

Day 5 

Neu5Gc 

(% of total sialic acids) 

Day 0 

Day 1 

Day 2 

Day 3 

Day 4 

Day 5 

Neu5Gc (% of total sialic acids) 

Day 3 

Day 5 

Day 7 

Neu5Gc (% of total sialic acids) 

Day 3 

Day 5 

Day 7 

5 mM Neu5Ac 

– + Size 

197 (kDa) 

125 

83 

Figure 3 An approach to reducing Neu5Gc contamination in biotherapeutic products. (a,b) Human 293T cells were grown in the presence of 5 mM Neu5Gc 

for 3 d. The cells were then washed with PBS and split into two identical cultures, and 5 mM Neu5Ac was added to one of the cultures. Cells were harvested 

as described in Online Methods, and the Neu5Gc and Neu5Ac content of both the ethanol-soluble (a) and ethanol-precipitable proteins (b) was analyzed 

by HPLC. (c–e) Feeding of CHO cells with free Neu5Ac reduced Neu5Gc in the whole-cell membranes and in secreted glycoproteins. Stably transfected 

CHO-KI cells expressing a recombinant soluble IgG-Fc fusion protein were grown in the absence or presence of 5 mM Neu5Ac. The individually collected 

medium was centrifuged to remove cell debris and adjusted to 5 mM Tris-HCl pH 8. The fusion protein was purified using protein A–Sepharose, and sialic 

acid content was determined by DMB-HPLC analysis as described in Online Methods (c). Total cell membranes from the same CHO cells were prepared and 

used for DMB-HPLC analysis (d). CHO membrane proteins from d were separated by SDS-PAGE and transferred onto nitrocellulose membranes. Expression 

of Neu5Gc (e) was detected by incubating with polyclonal affinity-purified chicken anti-Neu5Gc IgY, as described in Online Methods. 

37 


small amounts of Neu5Gc in recombinant glycoproteins produced in 

CHO cells 1,16 , we next asked whether feeding Neu5Ac could reduce total 

glycoprotein Neu5Gc levels in CHO cells. This was successful for all membrane 

glycoproteins and for a secreted recombinant protein (Fig. 3c–e). 

Similar feeding of murine myeloma cells with Neu5Ac did not substantially 

reduce the higher initial Neu5Gc content (~70–80% of sialic acids), 

most probably because of the higher baseline levels of Cmah in these cells. 

Regardless, given that the CHO cell expresses its own Cmah enzyme, these 

data suggest a novel mechanism, in addition to Neu5Ac competing out 

recycled Neu5Gc. Whatever that mechanism, reduction of Neu5Gc content 

of a recombinant glycoprotein can be achieved even in a nonhuman 

Cmah-positive cell line that starts with low levels of Neu5Gc. 

Despite their successful use for a variety of indications, infusionrelated 

reactions, immunogenicity and accelerated clearance remain 

important concerns for many therapeutic glycoproteins 7,25 . The incidence 

and severity of an immune reaction depends on the interplay 

of infused agents with the immune system and can vary greatly from 

patient to patient. Understanding the underlying nature of these 

events will help to identify patients at risk with the use of specific 

markers. Humanized and fully human antibodies have been developed 

to reduce immunogenicity due to peptide epitopes 5 . However, the 

potential immunogenicity of the glycans they carry has not been as 

well considered. It is known that immune reactions can be mediated 

by binding of pre-existing IgEs against the nonhuman alpha-Gal epitope 

carried by some agents, such as cetuximab 13 . However, in our studies 

alpha-Gal residues are not an issue, as Cmah-null mice already express 

this sequence and do not have antibodies against it. 

A further concern arises here because pre-existing antibodies 

against a glycan on a glycoprotein can secondarily enhance antibody 

reactivity against the underlying protein backbone 26 , perhaps because 

immune complexes are cleared efficiently by Fc receptors into dendritic 

cells and other antigen-presenting cells 27,28 . Such a mechanism 

might help explain why patients’ immunogenicity to some glycoprotein 

therapeutics sometimes increases over time 26,29,30 . If this were 

true, it would likely have a further impact in long-term replacement 

therapy with recombinant therapeutic glycoproteins. 

Our findings suggest that the potential significance of the presence 

of Neu5Gc on glycoprotein biotherapeutics should be revisited. 

Despite a natural tendency to downplay potential new problems 

involving currently useful drugs, it is worthwhile to consider lessons 

from other fields, where initial enthusiasm was not balanced by full 

appreciation of immunological implications 31 . With this in mind, we 

have also suggested that Neu5Gc contamination of stem cells and 

other cell types intended for human therapy could pose risks 32,33 . In 

addition, others have recently reported that Cmah-null mice can reject 

Neu5Gc-positive wild-type organ transplants via complement-fixing 

Neu5Gc-specific antibodies 20 . 

For new drugs, it may be possible to avoid Neu5Gc contamination 

from the outset by using Neu5Gc-deficient cells and media. 

Meanwhile, as an immediate practical solution, we have also demonstrated 

a nontoxic way to reduce the Neu5Gc content of some currently 

used expression systems and their secreted glycoproteins, by simply 

adding Neu5Ac to the culture media. This could bypass the need 

to establish new Neu5Gc-deficient cell lines for already approved 

drugs. The addition of Neu5Ac to the media could also potentially 

increase total sialylation of a glycoprotein biotherapeutic agent. But 

if anything, such an increase would only be beneficial—for example, 

leading to a longer half-life of the agent in vivo. 

Methods 





This work was supported by US National Institutes of Health grants R01-GM32373 

and R01-CA38701 to A.V. and The International Sephardic Education Foundation 

for V.P.-K. Haemophilus influenzae strain 2019 was a generous gift from M. Apicella, 

Department of Microbiology, University of Iowa. 


All authors helped design the studies; D.G. and S.D. performed the research; R.E.T. 

and V.P.-K. generated crucial reagents; D.G. and A.V. wrote the paper; and all 

authors read the paper. 







1. Hokke, C.H. et al. Sialylated carbohydrate chains of recombinant human 

glycoproteins expressed in Chinese hamster ovary cells contain traces of 

N-glycolylneuraminic acid. FEBS Lett. 275, 9–14 (1990). 


l e t t e r s 


2. Noguchi, A., Mukuria, C.J., Suzuki, E. & Naiki, M. Failure of human immunoresponse 

to N-glycolylneuraminic acid epitope contained in recombinant human erythropoietin. 

Nephron 72, 599–603 (1996). 

3. Tangvoranuntakul, P. et al. Human uptake and incorporation of an immunogenic 

nonhuman dietary sialic acid. Proc. Natl. Acad. Sci. USA 100, 12045–12050 

(2003). 

4. Padler-Karavani, V. et al. Diversity in specificity, abundance, and composition of 

anti-Neu5Gc antibodies in normal humans: potential implications for disease. 

Glycobiology 18, 818–830 (2008). 

5. Aggarwal, S. What′s fueling the biotech engine—2007. Nat. Biotechnol. 26, 

1227–1233 (2008). 

6. Arnold, J.N., Wormald, M.R., Sim, R.B., Rudd, P.M. & Dwek, R.A. The impact of 

glycosylation on the biological function and structure of human immunoglobulins. 

Annu. Rev. Immunol. 25, 21–50 (2007). 

7. Durocher, Y. & Butler, M. Expression systems for therapeutic glycoprotein production. 

Curr. Opin. Biotechnol. 20, 700–707 (2009). 

8. Higgins, E. Carbohydrate analysis throughout the development of a protein 

therapeutic. Glycoconj. J. 27, 211–225 (2009). 

9. Galili, U. Immune response, accommodation, and tolerance to transplantation 

carbohydrate antigens. Transplantation 78, 1093–1098 (2004). 

10. Varki, A. Glycan-based interactions involving vertebrate sialic-acid-recognizing 

proteins. Nature 446, 1023–1029 (2007). 

11. Bardor, M., Nguyen, D.H., Diaz, S. & Varki, A. Mechanism of uptake and incorporation 

of the non-human sialic acid N-glycolylneuraminic acid into human cells. J. Biol. 

Chem. 280, 4228–4237 (2005). 

12. Borys, M.C. et al. Effects of culture conditions on N-glycolylneuraminic acid 

(Neu5Gc) content of a recombinant fusion protein produced in CHO cells. Biotechnol. 

Bioeng. 105, 1048–1057 (2009). 

13. Chung, C.H. et al. Cetuximab-induced anaphylaxis and IgE specific for galactosealpha-1,3-galactose. 

N. Engl. J. Med. 358, 1109–1117 (2008). 

14. Delbaldo, C. et al. Pharmacokinetic profile of cetuximab (Erbitux) alone and in 

combination with irinotecan in patients with advanced EGFR-positive adenocarcinoma. 

Eur. J. Cancer 41, 1739–1745 (2005). 

15. Saadeh, C.E. & Lee, H.S. Panitumumab: a fully human monoclonal antibody with 

activity in metastatic colorectal cancer. Ann. Pharmacother. 41, 606–613 

(2007). 

16. Diaz, S.L. et al. Sensitive and specific detection of the non-human sialic acid 

N-glycolylneuraminic acid in human tissues and biotherapeutic products. PLoS ONE 

4, e4241 (2009). 

17. Muchmore, E.A., Milewski, M., Varki, A. & Diaz, S. Biosynthesis of N-glycolyneuraminic 

acid. The primary site of hydroxylation of N-acetylneuraminic acid 

is the cytosolic sugar nucleotide pool. J. Biol. Chem. 264, 20216–20223 

(1989). 

18. Hedlund, M. et al. N-glycolylneuraminic acid deficiency in mice: implications for 

human biology and evolution. Mol. Cell. Biol. 27, 4340–4346 (2007). 

19. Hedlund, M., Padler-Karavani, V., Varki, N.M. & Varki, A. Evidence for a humanspecific 

mechanism for diet and antibody-mediated inflammation in carcinoma 

progression. Proc. Natl. Acad. Sci. USA 105, 18936–18941 (2008). 

20. Tahara, H. et al. Immunological property of antibodies against N-glycolylneuraminic 

acid epitopes in cytidine monophospho-n-acetylneuraminic acid hydroxylasedeficient 

mice. J. Immunol. 184, 3269–3275 (2010). 

21. Taylor, R.E. et al. Novel mechanism for the generation of human xeno-autoantibodies 

against the non-human sialic acid N-glycolylneuraminic acid. J. Exp. 

Med. published online, doi: 10.1084/jem.20100575 (12 July 2010). 

22. Qian, J. et al. Structural characterization of N-linked oligosaccharides on monoclonal 

antibody cetuximab by the combination of orthogonal matrix-assisted laser 

desorption/ionization hybrid quadrupole-quadrupole time-of-flight tandem mass 

spectrometry and sequential enzymatic digestion. Anal. Biochem. 364, 8–18 

(2007). 

23. Axworthy, D.B. et al. Cure of human carcinoma xenografts by a single dose of 

pretargeted yttrium-90 with negligible toxicity. Proc. Natl. Acad. Sci. USA 97, 

1802–1807 (2000). 

24. Pham, T. et al. Evidence for a novel human-specific xeno-auto-antibody response 

against vascular endothelium. Blood 114, 5225–5235 (2009). 

25. Jahn, E.M. & Schneider, C.K. How to systematically evaluate immunogenicity of 

therapeutic proteins—regulatory considerations. New Biotechnol. 25, 280–286 

(2009). 

26. Galili, U. et al. Enhancement of antigen presentation of influenza virus hemagglutinin 

by the natural human anti-Gal antibody. Vaccine 14, 321–328 (1996). 

27. Benatuil, L. et al. The influence of natural antibody specificity on antigen 

immunogenicity. Eur. J. Immunol. 35, 2638–2647 (2005). 

28. Abdel-Motal, U.M., Wigglesworth, K. & Galili, U. Mechanism for increased 

immunogenicity of vaccines that form in vivo immune complexes with the natural 

anti-Gal antibody. Vaccine 27, 3072–3082 (2009). 

29. Koren, E. et al. Recommendations on risk-based strategies for detection and 

characterization of antibodies against biotechnology products. J. Immunol. Methods 

333, 1–9 (2008). 

30. Shankar, G., Pendley, C. & Stein, K.E. A risk-based bioanalytical strategy for the 

assessment of antibody immune responses against biological drugs. Nat. Biotechnol. 

25, 555–561 (2007). 

31. Wilson, J.M. Medicine. A history lesson for stem cells. Science 324, 727–728 

(2009). 

32. Martin, M.J., Muotri, A., Gage, F. & Varki, A. Human embryonic stem cells express 

an immunogenic nonhuman sialic acid. Nat. Med. 11, 228–232 (2005). 

33. Martin, M.J., Muotri, A., Gage, F. & Varki, A. Response to Cerdan et al.: Complement 

targeting of nonhuman sialic acid does not mediate cell death of human embryonic 

stem cells. Nat. Med. 12, 1115 (2006). 

34. Van Hoeyveld, E. & Bossuyt, X. Evaluation of seven commercial ELISA kits compared 

with the C1q solid-phase binding RIA for detection of circulating immune complexes. 

Clin. Chem. 46, 283–285 (2000). 

35. Campagnari, A.A., Gupta, M.R., Dudas, K.C., Murphy, T.F. & Apicella, M.A. Antigenic 

diversity of lipooligosaccharides of nontypable Haemophilus influenzae. Infect. 

Immun. 55, 882–887 (1987). 

36. Greiner, L.L. et al. Nontypeable Haemophilus influenzae strain 2019 produces a 

biofilm containing N-acetylneuraminic acid that may mimic sialylated O-linked 

glycans. Infect. Immun. 72, 4249–4260 (2004). 

37. Gagneux, P. et al. Proteomic comparison of human and great ape blood plasma 

reveals conserved glycosylation and differences in thyroid hormone metabolism. 

Am. J. Phys. Anthropol. 115, 99–109 (2001). 

38. Debeire, P., Montreuil, J., Moczar, E., van Halbeek, H. & Vliegenthart, J.F.G. Primary 

structure of two major glycans of bovine fibrinogen. Eur. J. Biochem. 151, 607–611 

(1985). 




Mice. The Cmah-null mice used for this study have been described previously 18 

and were backcrossed to C57Bl/6 mice for over ten generations. All experiments 

were approved by the University of California, San Diego Institutional 

Review Board committee responsible for approving animal experiments. 

Sialidase treatment of therapeutic antibodies. One milligram each of cetuximab 

or panitumumab (obtained from the University of California, San Diego 

pharmacy or the manufacturer) were treated with 50 mU of active or heatinactivated 

Arthrobacter ureafaciens sialidase (EY Laboratories) in 100 mM 

sodium acetate pH 5.5, at 37 °C for 24 h. Samples were used for ELISA or 

western blots. 

Periodate treatment of therapeutic antibodies on ELISA plate. Untreated 

cetuximab and panitumumab (1 μg per well) were used for coating, then 

blocked with PBST for 2 h and incubated with freshly made 2 mM sodium 

metaperiodate in PBS for 20 min at 4 °C in the dark. The reaction was stopped 

by addition of 200 mM sodium borohydride to a final concentration of 20 mM. 

As a control, periodate and borohydride were premixed and then added to the 

wells (the borohydride inactivates the periodate). To remove resulting borates, 

wells were then washed three times with 100 mM sodium acetate with 100 mM 

NaCl pH 5.5 before further analysis. 

ELISA detection of Neu5Gc on therapeutic antibodies. For the ELISA, wells 

were coated with 1 μg of cetuximab or panitumumab (either before sialidase 

treatment or after periodate treatment), blocked with TBST for 2 h and then 

incubated with affinity-purified chicken anti-Neu5Gc IgY or control IgY for 1 h 

(1:20,000 in TBST). Binding of IgY was detected using horseradish peroxidase 

(HRP)-conjugated donkey anti-chicken IgY (1:50,000 in TBST) and developed 

with O-phenylenediamine in citrate-phosphate buffer, pH 5.5, with absorbance 

measured at 495 nm. ELISA samples were studied at least in triplicate. Similar 

to the ELISA with the anti-Neu5Gc chicken IgY, human anti-Neu5Gc IgG that 

had been purified from the serum of healthy humans and biotinylated (exactly 

as described in ref. 4) was also used as the primary antibody (1:100 in TBST). 

Binding of the human antibodies to the therapeutic antibodies was detected 

using HRP-conjugated streptavidin (1:10,000) followed by development as 

described above. Samples were studied in triplicate. 

Western blot detection of Neu5Gc on therapeutic antibodies. For western 

blot detection, cetuximab or panitumumab (1 μg per lane) was separated 

by 12.5% SDS-PAGE and Coomassie-stained or blotted on nitrocellulose 

membranes. Blotted membranes were blocked with TBST containing 0.5% 

cold-water fish-skin gelatin overnight at 4 °C and subsequently incubated 

with affinity-purified chicken anti-Neu5Gc IgY for 4 h at room temperature 

(1:100,000 in TBST). Binding of the chicken anti-Neu5Gc IgY was detected 

using HRP-conjugated donkey anti-chicken IgY for 1 h (1:50,000 in TBST), 

followed by incubation with SuperSignal West Pico Substrate (Pierce) as per 

the manufacturer’s recommendation, exposure to X-ray film and development 

of the film. Similar to the western blot with the chicken anti-Neu5Gc IgY, 

purified biotinylated human anti-Neu5Gc IgG was also used as the primary 

antibody (1:100 in TBST). Binding of the human antibodies to the therapeutic 

antibodies was detected using HRP-conjugated streptavidin (1:10,000 in 

TBST) followed by development as described above. 

CIC-C1q binding assay. Immune complex formation was detected using 

the CIC (C1Q) ELISA Kit (Buehlmann) as described in the manufacturer’s 

guidelines 34 . Briefly, 100 μl of human serum with low or high anti-Neu5Gc 

(S30 and S34, respectively 4 ) was incubated with 40 μg of cetuximab or panitumumab 

for 14 h at 4 °C. We applied 1:50 dilutions of the mix to human 

C1q–coated ELISA wells and incubated for 1 h at 25 °C. Binding was detected 

using alkaline phosphatase–conjugated protein A. After another washing step, 

the enzyme substrate (para-nitrophenylphosphate) was added, followed by a 

stopping step. The absorbance was measured at 405 nm. Samples were studied 

in triplicate. 

Generation of murine Neu5Gc-specific antibodies. Haemophilus influenzae 

strain 2019 (ref. 35) was a generous gift from M. Apicella, Department of 

Microbiology, University of Iowa. Bacteria were grown to mid log phase in 

sialic acid–free media 36 with or without addition of 1 mM Neu5Gc 21 , heatkilled 

and injected intraperitoneally (200 μl of culture at an absorbance of 

600 nm of 0.4) into Cmah-null mice. 

Effects of anti-Neu5Gc antibodies on in vivo kinetics of therapeutic antibodies. 

Cetuximab or panitumumab in PBS (0.24 μg per gram mouse body 

weight) were injected intravenously, and 14 h later, mouse serum pooled from 

syngeneic Cmah-null mice containing anti-Neu5Gc antibodies (or pooled 

serum from syngeneic naïve or control-immunized mice) was passively transferred 

via intraperitoneal injection into syngeneic Cmah-null mice that were 

prescreened for the absence of pre-existing antibodies against human IgG. 

Mice were bled 0, 2, 8, 32, 56 and 80 h after the passive transfer of mouse 

serum. For quantification of therapeutic antibody concentrations in the sera, 

wells of ELISA plates were coated with 1 μg of anti-human IgG (Biorad), then 

blocked with TBST for 2 h and incubated with 1:500 dilutions of the sera in 

each well. Captured therapeutic antibodies were detected by HRP-conjugated 

anti-human Fc (Jackson; 1:10,000), with development by O-phenylenediamine 

in citrate-phosphate buffer, pH 5.5, and absorbance measured at 495 nm 

(n = 5 for injections of both control sera groups; n = 10 for injections of anti- 

Neu5Gc serum groups). 

Quantification of Neu5Gc-specific IgG antibodies in Neu5Gc-immunized 

mice. A Neu5Gcα2-6Galβ1-4Glc-conjugate 4 (1 μg per well) and serial dilutions 

of mouse IgG as standards (0.625–20 ng per well) were used for coating 

overnight, then blocked with PBST for 2 h and incubated with pooled serum 

from Neu5Gc-immunized mice (1:250 dilution) for 2 h at 25 °C. Binding 

of mouse IgG was detected using HRP-conjugated goat anti-mouse IgG-Fc 

(Jackson; 1:10,000 in PBST) and developed with O-phenylenediamine in 

citrate-phosphate buffer, pH 5.5, with absorbance measured at 490 nm. ELISA 

samples were studied in triplicate. 

Levels of anti-Neu5Gc IgG after injections of the antibodies. Cmah-null mice 

were injected intravenously with 4 μg antibody per gram of mouse body weight 

in PBS weekly for 3 weeks. Mice were bled initially, and again 1 week after the 

third intravenous injection. Wells of ELISA plates were coated with 1:1,000 

dilutions of human (Neu5Gc-deficient) or chimpanzee (Neu5Gc-positive) 

serum glycoproteins (note that the only major difference between human 

and chimp serum glycosylation is the absence or presence of Neu5Gc; 

ref. 37). Alternatively, wells were coated with human or bovine fibrinogen, 

which carry Neu5Ac or Neu5Gc on otherwise identical N-glycans 38 . Wells 

were then blocked with TBST for 2 h followed by incubation with 1:100 

dilutions of the mouse sera. Binding of the mouse antibodies was detected 

using HRP-conjugated goat anti-mouse IgG Fc fragment (1:10,000 in TBST). 

Neu5Gc-specific binding (change in absorbance at 495 nm) was determined 

by subtracting the background signal of the wells coated with human serum or 

human fibrinogen (no Neu5Gc) from the signal of chimpanzee serum–coated 

or bovine fibrinogen–coated wells (containing Neu5Gc). Data were obtained 

in triplicate (n = 5 for injection of mouse IgG; n = 4 for injection of panitumumab; 

n = 6 for injection of cetuximab ). 

An approach to reduce Neu5Gc contamination in biotherapeutic products. 

Human 293T kidney cells were grown in DME supplemented with 10% (vol/vol) 

fetal calf serum. Cells were lifted from the culture plate using 20 mM EDTA in 

PBS and allowed to grow to 50% confluence. At this point, buffered 100 mM 

Neu5Gc was added to the culture in duplicate for a final 5 mM concentration, 

and the cells were grown in this supplemented media for 3 d. At the end of 

this Neu5Gc pulse, the cells were once again lifted using 20 mM EDTA in 

PBS, pelleted, washed once with PBS to remove any excess Neu5Gc and then 

suspended in 30 ml of growth medium. We added 5 ml of this cell suspension 

to each of five P-100 dishes. We immediately harvested the last aliquot of cell 

suspension, at time 0, by pelleting the cells, washing once with PBS, suspending 

them in 1 ml of PBS and transferring them to a 1.5-ml microcentrifuge 

tube. The cells were repelleted and frozen until all time points were collected. 

Buffered 100 mM Neu5Ac was added to each of the other five plates for the 

‘Neu5Ac chase’ and an equivalent amount of media added to the ‘minus chase’ 

samples. We harvested cells at days 1, 2, 3, 4 and 5 by scraping them into the 


doi:10.1038/nbt.1651

culture media, collecting by pelleting, washing once with PBS, transferring 

them to a 1.5-ml microcentrifuge tube, pelleting and freezing the cell pellet. 

At the end of the 5 d of chase, all collected cell pellets were homogenized in 

300 μl of ice-cold 20 mM potassium phosphate pH 7 using a 3- to 20-s burst 

with a Fisher Sonicator. We precipitated glycoconjugate-bound sialic acids by 

adding 700 μl of 100% ice-cold ethanol (final 70% (vol/vol) correct ethanol) and 

incubating at −20 °C overnight. The samples were spun at 20,000g for 15 min 

and the supernatants transferred to clean tubes and dried on a speed vac. 

The precipitated glycoconjugates and dried ethanol supernatants were each 

suspended in 100 μl of 20 mM potassium phosphate pH 7 by sonication. Sialic 

acids were released from both fractions by acid hydrolysis with 2 M acetic 

acid (final) and incubation at 80 °C for 3 h. Samples were passed through 

a Microcon-10 filter and the filtrate derivatized with DMB (1,2-diamino-4, 

5-methylenedioxybenzene) reagent for analysis of sialic acids by HPLC. 

A similar approach was taken with CHO cells stably expressing a Siglec-Fc 

protein in the medium, except that the Neu5Gc pulse was omitted and the 

secreted glycoproteins were captured on protein A–Sepharose beads. The 

cells were also processed similarly, except that total cell membranes were 

pelleted by centrifugation. The sialic acid content of the secreted proteins and 

cell membranes was determined by acid hydrolysis, DMB derivatization and 

HPLC. The cell membranes were also studied by western blotting with the 

chicken anti-Neu5Gc IgY, as described above. 


doi:10.1038/nbt.1651 


l e t t e r s 

Global analysis of lysine ubiquitination by ubiquitin 

remnant immunoaffinity profiling 

Guoqiang Xu, Jeremy S Paige & Samie R Jaffrey 


Protein ubiquitination is a post-translational modification 

(PTM) that regulates various aspects of protein function by 

different mechanisms. Characterization of ubiquitination has 

lagged behind that of smaller PTMs, such as phosphorylation, 

largely because of the difficulty of isolating and identifying 

peptides derived from the ubiquitinated portion of proteins. 

To address this issue, we generated a monoclonal antibody 

that enriches for peptides containing lysine residues modified 

by diglycine, an adduct left at sites of ubiquitination after 

trypsin digestion. We use mass spectrometry to identify 374 

diglycine-modified lysines on 236 ubiquitinated proteins from 

HEK293 cells, including 80 proteins containing multiple 

sites of ubiquitination. Seventy-two percent of these proteins 

and 92% of the ubiquitination sites do not appear to have 

been reported previously. Ubiquitin remnant profiling of the 

multi-ubiquitinated proteins proliferating cell nuclear antigen 

(PCNA) and tubulin -1A reveals differential regulation of 

ubiquitination at specific sites by microtubule inhibitors, 

demonstrating the effectiveness of our method to characterize 

the dynamics of lysine ubiquitination. 

Protein ubiquitination occurs on a wide variety of eukaryotic proteins 

and affects processes ranging from protein degradation and subcellular 

localization to gene expression and DNA repair 1 . Ubiquitination 

involves the transfer of ubiquitin to a target protein using E1 ubiquitin– 

activating enzymes, E2 ubiquitin–conjugating enzymes and E3 ubiquitin 

ligases 1 . This process typically leads to the formation of an amide 

linkage comprising the ε-amine of lysine of the target protein and the 

C terminus of ubiquitin, and can involve ubiquitination at distinct sites 

within the same protein, although the roles of ubiquitination at distinct 

sites are incompletely understood. The human genome is predicted to 

encode 16 E1, 53 E2 and 527 E3 proteins 2 , which underscores the likely 

importance of ubiquitination in molecular signaling. 

In most cases, proteins suspected to be ubiquitinated have been 

identified based on their susceptibility to proteasome-mediated degradation, 

as evidenced by their increased levels following application 

of proteasome inhibitors. These proteins are immunopurified and 

ubiquitin adducts are confirmed by anti-ubiquitin immunoblotting 3 . 

Mutagenesis experiments can identify ubiquitination sites 4 . Global 

identification of ubiquitinated proteins has been performed by 

purifying ubiquitinated proteins, using ubiquitin-binding proteins 

such as anti-ubiquitin antibodies 5 , or by purifying hexahistidine 

(His 6 )-tagged ubiquitin-protein conjugates 6 . The enriched set of proteins 

are then proteolyzed and subjected to tandem mass spectrometry 

(MS/MS). However, as only one or a few lysines are typically modified 

in any ubiquitinated protein, most peptides do not exhibit any 

ubiquitin-derived modifications 7 . This introduces uncertainty 

whether they are derived from the nonubiquitinated portion of a 

protein or from coprecipitated proteins. 

Alternatively, proteolytic digests can be screened for peptides that 

contain remnants of ubiquitin modification. Digestion of ubiquitinconjugated 

proteins results in peptides that contain a ubiquitin remnant 

derived from the ubiquitin C terminus. The three C-terminal 

residues of ubiquitin are Arg-Gly-Gly, with the C-terminal glycine 

conjugated to a lysine residue in the target. After digestion with 

trypsin, ubiquitin is cleaved after arginine, leaving a Gly-Gly dipeptide 

remnant on the conjugated lysine. Therefore, tryptic digests 

will include peptides that contain a diglycine-modified lysine, 

indicating the prior conjugation of ubiquitin to that region of the 

target protein. The diglycine-modified lysine serves as a signature 

of ubiquitination and also identifies the specific site of modification. 

Sequencing of ubiquitin remnant–containing peptides in tryptic 

digests has been used to identify 110 ubiquitination sites from 

yeast expressing His 6 -ubiquitin 7 . Despite the availability of these 

approaches for several years, analysis of the Swiss-Prot database 

indicates that only 255 mammalian proteins have been reported to 

be ubiquitinated based on experimental evidence. In most cases, the 

ubiquitination sites have not been identified. Direct enrichment of 

ubiquitin remnant–containing peptides would facilitate the highthroughput 

identification of ubiquitination sites. 

To identify ubiquitinated proteins and simultaneously report their 

sites of ubiquitination, we generated an antibody that recognizes 

peptides containing the ubiquitin remnant left after trypsin digestion 

of ubiquitinated proteins. To prepare a protein antigen containing 

diglycine-modified lysines, we first reacted purified lysine-rich histone 

III-S protein with t-butyloxycarbonyl-Gly-Gly-N-hydroxysuccinimide 

(Boc-Gly-Gly-NHS) to introduce amide-linked Boc-Gly-Gly adducts 

on all amines (Fig. 1a). Nearly complete modification of the amines was 

confirmed by the reduction in labeling of the Boc-Gly-Gly–modified 

protein by the lysine-modifying reagent biotin-NHS, as assessed by 

anti-biotin immunoblotting (Fig. 1b). The modified protein was treated 

with trifluoroacetic acid (TFA) to remove the Boc moiety. Quantitative 

Department of Pharmacology, Weill Medical College, Cornell University, New York, New York, USA. Correspondence should be addressed to S.R.J. 

(srj2003@med.cornell.edu). 

Received 25 March; accepted 11 June; published online 18 July 2010; doi:10.1038/nbt.1654 


l e t t e r s 


Figure 1 Generation of monoclonal antibodies that selectively recognize 

diglycine-modified lysines. (a) The antigen used to raise antibodies was 

synthesized by modifying the ε-amines of all lysines in a histone with 

t-butyloxycarbonyl-Gly-Gly-N-hydroxysuccinimide (Boc-Gly-Gly-NHS) 

and then removing the Boc group by treatment with TFA. The lysines 

in the final protein contain Gly-Gly adducts on the ε-amine of all lysine 

residues. (b) To validate the synthesis of Gly-Gly–modified histone, 

we monitored the reaction of the histone with Boc-Gly-Gly-NHS by 

detecting amines, such as those in unmodified lysine, through the 

reaction of the protein with the amine-modifying agent biotin-NHS, 

and subsequent western blot analysis with an anti-biotin antibody. 

Amines in the histone were completely lost after treatment with 

Boc-Gly-Gly-NHS, indicating complete modification of all the 

lysines in the histone. Removal of the Boc protecting group with 

TFA resulted in the formation of an amine at the N terminus of the 

Gly-Gly adduct. This step was essentially complete, as the TFA-treated 

protein exhibited nearly complete recovery of amine reactivity. 

The position of the bands in the different samples is slightly 

shifted due to the different molecular weights and number of 

positive charges in the modified and unmodified samples. The 

bands above 50 kDa represent impurities in the histone sample. 

(c) We evaluated the specificity of the GX41 monoclonal antibody by western blot analysis of β-lactoglobulin, lysozyme or rat brain lysate, in which 

the lysines were either unmodified (A), or modified with Boc-Gly-Gly (B) or Gly-Gly- (C) adducts, respectively. 

conversion of the Boc-Gly-Gly adduct, which does not contain an 

amine, to Gly-Gly, which contains an amine, was confirmed by the 

reactivity of the TFA-treated protein with biotin-NHS (Fig. 1b). 

We injected the diglycine-modified histone into mice, and screened 

hybridoma lines for antibodies that specifically recognize proteins 

containing diglycine-modified lysines. Hybridoma line GX41 generated 

monoclonal antibodies that exhibited pronounced specificity for 

proteins containing the diglycine-modified lysines. The antibodies 

failed to interact with unmodified lysozyme or lactoglobulin (Fig. 1c), 

or either of these proteins after they have been modified with Boc-Gly- 

Gly. However, the antibody recognized Gly-Gly–modified lysozyme 

and lactoglobulin obtained after removal of the Boc group. 

These results indicate that the antibody recognizes Gly-Gly–modified 

lysines, and suggest that the antibody only recognizes Gly-Gly adducts 

that contain an unmodified primary amine. Similarly, the antibody 

exhibits negligible reactivity with rat brain lysate (Fig. 1c), or brain 

lysate modified with Boc-Gly-Gly, but exhibits substantial reactivity 

with Gly-Gly–modified proteins from brain lysate. Notably, the brain 

lysate includes highly abundant proteins containing internal Gly- 

Gly peptide sequences, such as β-actin, glyceraldehyde-3-phosphate 

dehydrogenase and α-tubulin, as well as histone H2A, which contains 

an internal Gly-Gly-Lys sequence. This indicates that internal Gly-Gly 

sequences are not recognized by the antibody. Additionally, peptides 

that contain Gly-Gly as the first two amino acids are not recognized 

(Supplementary Fig. 1). Together, these data indicate that the antibody 

recognizes Gly-Gly sequences that are present as an adduct on 

the ε-amine of lysine. 

We next investigated whether the anti–diglycyl-lysine antibody was 

able to immunoprecipitate peptides containing Gly-Gly–modified lysine. 

A flow chart for sample preparation, immunoprecipitation, and MS/MS 

analysis is shown in Figure 2a. We prepared a peptide containing an 

N-terminal Gly-Gly sequence (GGDRVYIHPFHL), and a peptide containing 

a diglycyl adduct on lysine (Ac-SYSMEHFRWGK*PV-NH 2 ; K* 

and Ac represent Gly-Gly–modified lysine and an acetyl group, respectively). 

An equimolar mixture of the peptides was immunoprecipitated 

with the anti–diglycyl-lysine antibody, resulting in selective enrichment 

(≥50×) of the peptide containing the Gly-Gly–modified lysine (Fig. 2b). 

Additionally, this peptide was quantitatively immunoprecipitated 

with a nearly 100% yield (Supplementary Fig. 2). These experiments 

a 

NH 2 

NH 2 

NH 

NH 

NH 

NH 2 

NH 

NH 2 

Boc-Gly-Gly-NHS 

NH 

TFA 

NH 

NH 

NH 

Boc-Gly-Gly- 

Boc-Gly-Gly- 

Boc-Gly-Gly- 

Boc-Gly-Gly- 

Gly-Gly- 

Gly-Gly- 

Gly-Gly- 

Gly-Gly- 

Histone 

Boc-Gly-Gly-NHS 

TFA 

Biotin-NHS 

demonstrated that the GX41 antibody is capable of enriching peptides 

containing diglycine-modified lysines and does not immunoprecipitate 

peptides containing a Gly-Gly sequence at their N termini. 

We next sought to assess the diversity of lysine ubiquitination in 

cultured cells. To distinguish diglycine remnants derived from ubiquitin 

from those originating from less common ubiquitin-like proteins 

(such as ISG15 and NEDD8, which also leave a diglycine remnant 

on lysines after trypsinization 8 ), we used HEK293 cells expressing 

His 6 -tagged ubiquitin. Ubiquitinated proteins were purified 

by immobilized metal-affinity chromatography, before proteolysis 

and anti–diglycyl-lysine immunopurification. Ubiquitin remnant– 

containing peptides were subjected to liquid chromatography (LC)- 

MS/MS followed by database searching and spectral validation. To 

minimize alterations in ubiquitination levels after cell lysis, 5 mM 

chloroacetamide was included in lysis buffer to inhibit deubiquitinase 

and ubiquitin ligase activity 9 . To measure post-lysis ubiquitination, 

we spiked a lysate with excess glutathione S-transferase. This protein 

showed no detectable level of ubiquitination (Supplementary Fig. 3), 

suggesting that negligible ubiquitination occurred after cell lysis. 

MS/MS spectra of ubiquitin remnant–containing peptides exhibited 

normal y- and b-ion series, typically with a pair of ions separated by 

a mass of 242.14 Da, consistent with the masses of a lysine residue 

(128.09) and a Gly-Gly adduct (114.04 Da) on the ε-amine of lysine. 

Whereas most peptides contained a single diglycine-modified lysine 

(Fig. 2c), 17 peptides contained two diglycine-modified lysines. The 

majority (>92%) of ubiquitin remnant–containing peptides have a +3 or 

+4 charge (Supplementary Fig. 4), which reflects the additional charge 

from the N-terminal amine on the Gly-Gly adduct. Gly-Gly–modified 

lysines as the C-terminal residue of peptides were also detected (~2% of 

total) (Supplementary Fig. 5), and reflect use of the Gly-Gly–modified 

lysine as a substrate for trypsin, as described previously 10 . 

In total, we identified 374 diglycine-modified lysines on 236 ubiquitinated 

mammalian proteins. Analysis of the Swiss-Prot database 

suggests that 72% of these proteins were not previously known to 

be ubiquitinated. Similarly, 92% of the ubiquitination sites that we 

identified were not previously known. Among the identified proteins, 

156 proteins have one ubiquitination site and 80 have two or more 

ubiquitination sites (Supplementary Table 1 and Supplementary 

Fig. 6). To validate the ubiquitination detected using the ubiquitin 

b 

WB: anti-biotin 

c 

180 

115 

82 

64 

49 

37 

26 

15 

6 

250 

150 

100 

75 

50 

37 

25 

15 

A B C 

– + + 

– – 

+ + 

+ 

β-Lact Lysozyme 

A B C A B C Brain lysate 

A B C 

185 

98 

52 

31 

19 

17 

14 

WB: anti–diglycyl-lysine 


l e t t e r s 


Figure 2 Profiling immunopurified ubiquitin 

remnant–containing peptides to identify 

ubiquitinated proteins. (a) Strategy to identify 

ubiquitinated proteins by immunoprecipitation 

of peptides containing diglycyl-lysine, 

followed by MS analysis. (b) Confirmation 

of antibody specificity using two peptides, 

GGDRVYIHPFHL and Ac-SYSMEHFRWGK*PV- 

NH 2 . Equimolar amounts (0.3 nmol) of the two 

peptides were mixed and immunoprecipitated 

with immobilized anti–diglycyl-lysine 

monoclonal antibody. Matrix-assisted laser 

desorption ionization/time-of-flight (MALDI- 

TOF)-MS analysis for the starting material 

and the antibody-purified material suggests 

an enrichment factor of at least 50, based on 

the comparison of the MS signals of the two 

peptides before and after immunoprecipitation. 

(c) Representative annotated MS/MS spectra 

of two ubiquitin remnant–containing peptides 

obtained by immunoprecipitation from a 

HEK293 cell lysate. The sequence of 

the ubiquitinated peptide, including the 

diglycine-modified lysine (K * ), is indicated 

and the fragment ions are labeled. The 

symbols \, / and | represent b-ions, y-ions, 

and both b-ions and y-ions, respectively. 

(d) Biochemical verification of the 

ubiquitination of six proteins. Proteins were 

immunoprecipitated using target-specific 

antibodies and the immunoprecipitate 

was detected by western blotting using an 

anti-ubiquitin antibody. IgG was used as a 

control for nonspecific immunoprecipitation. 

The proteasome inhibitor N-acetyl-Leu- 

Leu-norleucinal (LLnL) was added to allow 

accumulation of the ubiquitinated protein. 

remnant–profiling approach, we selected a subset of six proteins identified 

by MS and assessed whether they were ubiquitinated in cells. 

Lysates from HEK293 cells were immunoprecipitated with antibodies 

specific for the protein under investigation and immunoblotted 

using an anti-ubiquitin antibody (Fig. 2d). In these experiments, the 

HEK293 cells were not transfected with plasmids expressing His 6 - 

tagged ubiquitin. In each case, the immunopurified protein exhibits 

anti-ubiquitin immunoreactivity consistent with the endogenous 

ubiquitination of these proteins. 

The ubiquitination targets include disease-related proteins, such 

as 14-3-3ε, ataxin, β-catenin, BRCA1-associated protein and TTRAP 

(TRAF and TNF receptor-associated protein). The proteins identified 

by ubiquitin remnant profiling have roles in numerous biological 

processes, of which the largest number involve metabolism, cell cycle/ 

apoptosis and signal transduction (Fig. 3a). Additionally, we identified 

proteins that influence the trafficking, localization and structure of 

proteins, as well as regulate the immune system, consistent with previously 

reported roles for ubiquitination 11–14 . Ubiquitination of many 

ubiquitin-conjugating enzymes, ubiquitin ligases and 26S proteasome 

regulatory subunits also supports previous studies that reported the 

prevalence of ubiquitination of proteins involved in proteasome 

degradation pathways 15,16 . Some of the proteins found to be ubiquitinated 

extend earlier findings regarding the role of ubiquitination in 

certain cellular processes. For example, although histone H2 ubiquitination 

has been described 3 , we found that histone H1, H3 and H4 isoforms 

are also ubiquitinated, as are subunits of histone acetyltransferases and 

histone deacetylases. These findings support the idea that ubiquitin 

a 

b 

Relative intensity 


Trypsin 

digestion 

Immunopurification 

Ub(RGG-)HN 

GG 

K 

GG 

K 

nLC-MS/MS analysis 

Before purification 

100 

GGDRVYIHPFHL 

Ac-SYSMEHFRWGK*PV-NH 2 

80 

60 

Met-Ox 

40 

20 

+Na 

0 

1,000 1,200 1,400 1,600 1,800 2,000 

m/z 

After purification 

100 

Ac-SYSMEHFRWGK*PV-NH 2 

80 

60 

40 

Met-Ox 

20 

0 

1,000 1,200 1,400 1,600 1,800 2,000 

m/z 

c 



4 

b 

60S ribosomal protein L7a 

GA|L|A| 217 K*|L|V|E/A/I/R 

100 R I A E V L 

K* 

80 

y 

60 

a 2 y 1 

y y 

40 

3 

5 y 

y 6 

2 -NH 3 

y 

y 5 

2+ 4 

y 1 

y 2 y y 

20 

3 6 

2+ 

b 2+ 

2 b y y 6 y 7 7 4 y y8 

11 

20 

b b 3 y 2 

2+ 

2+ y 

y 8 

7 

2+ 

y 9 2 2+ 

a 7 b 4 

b 5 6 

b y 7 

0 

7 

100 200 300 400 500 600 700 800 900 1,000 

m/z 

Splicing factor, arginine/serine-rich 1 

a 

100 

2 DI|ED\V/F/Y/ 38 K*/Y/G/A/I/R 

80 

R I A G Y K* 

Y F 

60 

40 

0 

200 400 600 800 1,000 1,200 

m/z 

d 

Western blotting: ubiquitin 

IP: β-14-3-3 IgG IP: Vimentin IgG 

LLnL – + – + LLnL – + – + 

250 

250 

150 

150 

100 

75 

100 

75 

50 

IP: NAP1L1 IgG IP: PARP1 IgG 

LLnL 

250 

– + – + LLnL – + – + 

150 

250 

100 

150 

75 

100 

50 

IP: HSP70 IgG IP: β-Catenin IgG 

LLnL – + – + LLnL – + – + 

250 

150 

100 

75 

250 

150 

100 

contributes to epigenetic gene regulation through multiple pathways. 

Many heat shock proteins, such as HSP70, HSP105, and 

HSC71, are ubiquitinated, linking ubiquitination to stress responses. 

Ubiquitination of several heterogeneous nuclear ribonucleoproteins 

reveals a role for ubiquitination in mRNA processing, metabolism, 

transport and splicing. Our studies also identify numerous transcription 

factors, splicing factors, DNA repair proteins and kinases. This 

supports the well-characterized role for ubiquitination in regulating 

cellular signal transduction. 

The subcellular distribution of the detected proteins is likely to 

reflect, in part, the subcellular fractions that were used for MS/MS 

analysis. Subcellular localization analysis of the identified proteins 

indicates that essentially all the ubiquitinated proteins are cytosolic 

(Fig. 3a, right panel), which is consistent with the general observation 

that ubiquitination occurs primary in the cytosolic compartment of the 

cell 12 . Many of the identified proteins are localized to the nucleus, and 

several proteins are localized to the mitochondria, suggesting a role for 

ubiquitination in regulating aspects of mitochondrial function. 

We next wanted to gain insight into how lysine ubiquitination 

might be regulated at the level of primary and secondary structure. 

Interestingly, ubiquitin remnant–modified lysines have a slight 

tendency to be localized in regions enriched in small hydrophobic 

residues, such as alanine, leucine, isoleucine, glycine, proline and 

valine (Supplementary Fig. 7a). Examination of a six-amino-acid 

window adjacent to ubiquitinated lysines in the human proteome 

revealed that cysteine, histidine and lysine are found at a ~40% 

lower frequency than when they are adjacent to lysines in general 

75 


l e t t e r s 


Figure 3 Bioinformatic analysis of ubiquitinated 

proteins and ubiquitin-modified lysines. (a) Pie 

charts of biological processes and subcellular 

localization of ubiquitinated proteins analyzed 

using the PANTHER and PENCE Proteome 

Analyst databases, respectively. Proteins were 

designated ‘other’ if their localizations or 

functions were not annotated in the database. 

(b) Backbone amino acid sequence analysis 

of ubiquitinated peptides. A density map of 

the ratios of the frequencies of each of the 

20 amino acids adjacent to the ubiquitinated 

lysines and adjacent to lysines in general was 

plotted using MATLAB. Several amino acids 

are slightly enriched at certain positions, 

such as leucine at +2, valine at −2, alanine 

at −5, glycine at +6, and tyrosine at −1 and 

+1, determined by Rosner’s test with a 95% 

confidence. (c) Ubiquitinated lysines (Ub 

Lys) possess an increased solvent accessible 

area (SAA) relative to lysines in general. The 

distribution of SAA of both populations of lysines indicates an increase in SAA among ubiquitinated lysines. The two distributions are significantly 

different (Student’s t test, P < 0.001). The results were obtained from an analysis of 89 PDB structures (140 ubiquitinated lysines, 3,970 total lysines). 

(d) Distribution of secondary structures of all lysines and ubiquitinated lysines obtained from an analysis of 89 PDB structures. The disordered region 

was predicted by DisEMBL for all ubiquitinated proteins identified by our MS experiments. χ 2 test: P < 0.001. 

(Supplementary Fig. 7a). Analysis involving Motif-x 17 identified 

K*XL as a potential consensus ubiquitination site. This motif appears 

to be ~1.8 times more common among ubiquitinated lysines than 

lysines in general (Supplementary Fig. 7b). To compare all 20 amino 

acids for their propensity to be found at specific residues adjacent to 

ubiquitinated lysines, we prepared a density map that indicates the 

frequency of each amino acid at any of the ten proximal positions on 

either side of the ubiquitinated lysines, compared to the frequency of 

that amino acid next to lysines in general, as assessed by surveying 

the human proteome (Fig. 3b). This analysis shows that there is only 

a subtle enrichment for specific residues at some positions, such as 

leucine at the +2 position, valine at the −2 position, alanine at the −5 

position, glycine at the +6 position, and tyrosine at the −1 and +1 

positions. In contrast, an analysis of ubiquitinated proteins in yeast 7 

indicates an significant enrichment of aspartic acid, glutamic acid, 

histidine and proline at some positions (Supplementary Fig. 7c). 

To determine whether the sequence of the immunogen affected the 

specificity of the immunoprecipitated peptides, we generated a similar 

density map to present the frequency of each amino acid adjacent to 

the Gly-Gly–modified lysines in the immunogen. Although there are 

a 

Metabolism 

(49.6%) 

Cell cycle/apoptosis (13.0%) 

Structure (4.4%) 

Small-molecule transport (2.3%) 

Immunity/defence (4.2%) 

Protein trafficking/localization (8.3%) 

Other/unclassified (9.1%) 

Signal transduction (9.1%) 

Mitochondria 

(3.6%) 

Endoplamic 

reticulum (2.4%) 

b c d 

A 

CDE 

F 

G 

H 

K L 

M 

NP 

Q 

RSTV 

W 

YK 

I 

–10 –5 0 5 10 

2.5 

2.0 

1.5 

1.0 

0.5 

0 

Percentage 

40 

30 

20 

10 

0 

All Lys 

Ub Lys 

0–20 

20–40 

40–60 

60–80 

80–100 

Relative solvent accessible area (%) 

Cytoplasm (48.2%) 

Nucleus (28.7%) 

Plasma 

membrane (1.2%) 

Other (15.8%) 

Golgi (2.4%) 

All Lys 

0.6 

Ub Lys 

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 

Helix Strand Coil Disordered 

marked amino acid preferences adjacent to lysine in the immunogen 

(Supplementary Fig. 7d), these preferences are not seen in peptides 

pulled down by the anti–diglycyl-lysine antibody (Supplementary 

Fig. 7d). This suggests that the sequence of the immunogen used to 

generate our immunoaffinity reagent does not substantially bias the 

sequences of the immunoprecipitated peptides the antibody recovers. 

We found that ubiquitinated lysines have a slight tendency to appear 

on protein surfaces in preferred structural contexts. Structural information 

is available in Protein Data Bank (PDB) for 89 of the proteins 

identified in this study. Measurements of the solvent-accessible area 

of lysines in these proteins indicate that ubiquitinated lysines tend 

to be exposed slightly more to solvent than other lysines (Fig. 3c, 

Student’s t test, P < 0.001). If lysines with >50% surface exposure are 

considered solvent exposed 18 , 60% of the ubiquitinated lysines are 

exposed, which is more than for lysines in general (45%). Overall, 

ubiquitinated lysines are ~6.5% more exposed than all the lysines. 

This is in agreement with a ubiquitination site survey for yeast 19 . 

Interestingly, in some cases, the ubiquitinated lysine is fully buried 

(e.g., Supplementary Fig. 8). In these proteins, ubiquitination may 

be regulated by stimuli that induce the exposure of the lysine to the 

Fraction 



1.0 

0.8 

0.6 

0.4 

0.2 

0 

1.0 

0.8 

0.6 

0.4 

0.2 

0 

Lys0 Lys8 DLSHIGDAVVISCA 164 K*DGVK 

Lys0:Lys8 = 1.08 ± 0.05 

524 526 528 530 

m/z 

Lys0 

532 534 

YYLAP 254 K*IEDEEGS 

Lys0:Lys8 = 1.47 ± 0.09 

Lys8 

814 816 818 820 822 

m/z 

Figure 4 Colchicine differentially regulates the ubiquitination of two 

lysines in PCNA. HEK293 cells were grown in SILAC medium containing 

either light (Lys0) or heavy (Lys8) lysine, and transfected with a plasmid 

expressing His 6 -ubiquitin. Whereas Lys0-labeled cells were treated with 

10 μM colchicine, Lys8-labeled cells were treated with vehicle for 16 h. 

Identical amounts of cells from each treatment were mixed and processed 

for MS analysis of ubiquitin remnant–containing peptides. The relative 

ratio of MS signals between Lys0- and Lys8-labeled peptides was used 

for relative quantification of the change in ubiquitination at K164 and 

K254. The observed ratio was normalized to the change in PCNA protein 

abundance in the two samples by measuring two unmodified PCNA 

peptides in the initial mixed cell lysate (Supplementary Fig. 11). The 

observation that the ion intensity of the novel ubiquitination site (K254) 

is about 20% of that of K164 suggests that its ubiquitination may be less 

common or more transient than K164. This may explain why it was not 

detected previously in mutagenesis studies 33 . All data are the averages 

of experiments repeated three times. Note that the peptide ubiquitinated 

at K254 is the C-terminal tryptic peptide of the protein so that the last 

amino acid is neither K nor R, and the charge state of this peptide is +2. 


l e t t e r s 


surface. Analysis of the local secondary structure surrounding all 

lysines and ubiquitinated lysines indicates that ubiquitinated lysines 

prefer helical structures compared to all lysines, although ubiquitination 

sites can also be found in other structural contexts (Fig. 3d). 

Additional crystal structures of proteins that are susceptible to ubiquitination 

are needed to fully assess the solvent exposure and structural 

contexts of ubiquitinated lysines. 

Recently, a large number of lysine acetylation sites have been discovered 

by proteomic approaches 20–23 . Although only 0.6% of lysines 

are predicted to be acetylated based on yeast studies 24 , >20% of the 

lysines that we found to be ubiquitinated are also sites of acetylation. 

For example, all the ubiquitinated lysines in H2B, H3.1 and H4 were 

reported to be acetylated. In the case of tubulin α-1A, four of the six 

ubiquitinated lysines were reported to be acetylated. The surprisingly 

high degree of concordance of lysine ubiquitination and acetylation 

sites suggests that acetylation of a specific lysine residue could serve 

as a means to prevent lysine ubiquitination 25 , or vice versa. A BLAST 

analysis of ubiquitination sites in human proteins against mouse, rat 

and yeast revealed that modified lysines are statistically more conserved 

between these species than lysines in general (Supplementary 

Fig. 9). This suggests that the pathways leading to the ubiquitination 

of these sites may be evolutionarily conserved. 

In cases where a protein is ubiquitinated at more than one site, it is 

particularly challenging to monitor how the ubiquitination at the individual 

sites is independently regulated. We therefore examined two 

proteins exhibiting multi-ubiquitination: tubulin α-1A and PCNA, 

a protein that regulates cell cycle progression 26 and has been linked 

to tumorigenesis 27 . We labeled His 6 -ubiquitin-expressing HEK293T 

cells with either light (Lys0) or heavy (Lys8) lysine to quantify ubiquitination 

using the SILAC (stable isotope labeling by amino acids in 

cell culture) approach 28 (Supplementary Fig. 10). We treated cells for 

16 h with either vehicle (Lys8) or 10 μM colchicine (Lys0), an inhibitor 

of microtubule polymerization that affects progression through the 

cell cycle 29 , before mixing, lysing and processing cells as described in 

the Online Methods. We then analyzed the samples by nanoLC-MS 

to quantify ubiquitination at the PCNA ubiquitination sites that we 

had previously identified using MS/MS based on their retention time, 

mass-to-charge ratio (m/z) and charge states. We quantified relative 

ubiquitination at each modification site by normalization using protein 

abundance, as measured by the averaged light-to-heavy ratio of 

unmodified peptides detected from initial mixed cell lysate before any 

affinity purification 30 (Supplementary Fig. 11). Interestingly, whereas 

the ubiquitination of K164 was unaffected by colchicine treatment, 

the ubiquitination of K254 was increased by 47% (Fig. 4). 

We also examined the multi-ubiquitination of tubulin α-1A. 

Treatment with colchicine resulted in a similar ~80% decrease in 

the ubiquitination of K326, K336 and K370. Surprisingly, treatment 

with vinblastine, which also disrupts microtubules, albeit through a 

distinct mechanism 31,32 , resulted in an opposite effect on ubiquitination, 

with a ~40% increase in ubiquitination at each of these sites 

(Supplementary Figs. 12 and 13). These results highlight how some 

ubiquitination sites may be ubiquitinated in a dynamic manner, for 

example, in response to specific signals, whereas other ubiquitination 

sites may be ‘constitutive’. In the case of both PCNA and tubulin α-1A, 

ubiquitin remnant profiling provided insights into how distinct 

ubiquitination sites respond to different experimental treatments in 

a manner not readily available using currently available approaches. 

The ubiquitin remnant–profiling approach described here provides 

a simple and robust strategy to identify and quantify sites of ubiquitination 

in cells. It could be used to identify ubiquitination patterns 

in cells and tissues with altered expression of ubiquitin ligases, 

deubiquitinating enzymes, as well as to profile changes in ubiquitination 

elicited by various signaling molecules, drugs and disease states. 

Although the present data used cells expressing His 6 -tagged ubiquitin 

to reduce the likelihood of obtaining diglycine-modified peptides from 

ISG15- and NEDD8-modified proteins, ubiquitin-modified proteins 

could readily be enriched using immobilized ubiquitin-binding 

proteins, such as S5a, or ubiquitin antibodies 5 in cells and tissues not 

amenable to transfection. 

Methods 



Accession code. MS/MS data and the identifications are deposited 

in the open access public repository PRIDE (http://www.ebi.ac.uk/ 

pride/) with the accession code of 12018. 



We thank T. Neubert and G. Zhang (New York University) for useful suggestions, 

P. Zhou (Weill Cornell Medical College, WCMC) for the His 6 -ubiquitin plasmid, 

U. Hengst, A. Deglincerti, R. Almeida and B. Derakhshan for the assistance 

during initial cell culturing, S. Gross and Y. Ma (WCMC Mass Spectrometry Core 

Facility) for helpful discussion in MS/MS analysis, F. Campagne, L. Skrabanek, 

J. Sun (WCMC Institute for Computational Biomedicine) for instructions and 

assistance in bioinformatic analysis. The mass spectrometry work was performed 

at the WCMC Mass Spectrometry Core Facility using instrumentation supported 

by US National Institutes of Health (NIH) RR19355 and RR22615. This work 

was supported by grants from Weill Cornell, NIH (MH086128) (S.R.J.), and 

a pharmacology cancer training grant from the National Cancer Institute 

(T32CA062948) (G.X. and J.S.P.). 


S.R.J. and G.X. conceived and designed the study. G.X. and J.S.P. conducted 

the experiments, and G.X. and S.R.J. analyzed the data. S.R.J. and G.X. wrote 

the manuscript. 






1. Hershko, A. & Ciechanover, A. The ubiquitin system. Annu. Rev. Biochem. 67, 

425–479 (1998). 

2. Xu, P. & Peng, J. Dissecting the ubiquitin pathway by mass spectrometry. Biochim. 

Biophys. Acta 1764, 1940–1947 (2006). 

3. Ericsson, C., Goldknopf, I.L. & Daneholt, B. Inhibition of transcription does not 

affect the total amount of ubiquitinated histone 2A in chromatin. Exp. Cell Res. 

167, 127–134 (1986). 

4. Galluzzi, L., Paiardini, M., Lecomte, M.C. & Magnani, M. Identification of the main 

ubiquitination site in human erythroid alpha-spectrin. FEBS Lett. 489, 254–258 

(2001). 

5. Tomlinson, E., Palaniyappan, N., Tooth, D. & Layfield, R. Methods for the purification 

of ubiquitinated proteins. Proteomics 7, 1016–1022 (2007). 

6. Beers, E.P. & Callis, J. Utility of polyhistidine-tagged ubiquitin in the purification 

of ubiquitin-protein conjugates and as an affinity ligand for the purification of 

ubiquitin-specific hydrolases. J. Biol. Chem. 268, 21645–21649 (1993). 

7. Peng, J. et al. A proteomics approach to understanding protein ubiquitination. 


8. Srikumar, T., Jeram, S.M., Lam, H. & Raught, B. A ubiquitin and ubiquitin-like 

protein spectral library. Proteomics 10, 337–342 (2010). 

9. Hershko, A., Heller, H., Elias, S. & Ciechanover, A. Components of ubiquitin-protein 

ligase system. Resolution, affinity purification, and role in protein breakdown. 

J. Biol. Chem. 258, 8206–8214 (1983). 

10. Denis, N.J., Vasilescu, J., Lambert, J.P., Smith, J.C. & Figeys, D. Tryptic digestion 

of ubiquitin standards reveals an improved strategy for identifying ubiquitinated 

proteins by mass spectrometry. Proteomics 7, 868–874 (2007). 

11. Rechsteiner, M. Ubiquitin-mediated pathways for intracellular proteolysis. Annu. 

Rev. Cell Biol. 3, 1–30 (1987). 

12. Bonifacino, J.S. & Weissman, A.M. Ubiquitin and the control of protein fate in the 

secretory and endocytic pathways. Annu. Rev. Cell Dev. Biol. 14, 19–57 (1998). 


l e t t e r s 

13. Kirkpatrick, D.S., Denison, C. & Gygi, S.P. Weighing in on ubiquitin: the expanding 

role of mass-spectrometry-based proteomics. Nat. Cell Biol. 7, 750–757 (2005). 

14. Sun, L. & Chen, Z.J. The novel functions of ubiquitination in signaling. Curr. Opin. 

Cell Biol. 16, 119–126 (2004). 

15. Etlinger, J.D., Li, S.X., Guo, G.G. & Li, N. Phosphorylation and ubiquitination of 

the 26S proteasome complex. Enzyme Protein 47, 325–329 (1993). 

16. Peters, J.M. Subunits and substrates of the anaphase-promoting complex. Exp. Cell 

Res. 248, 339–349 (1999). 

17. Schwartz, D. & Gygi, S.P. An iterative statistical approach to the identification of 

protein phosphorylation motifs from large-scale data sets. Nat. Biotechnol. 23, 

1391–1398 (2005). 

18. Ahmad, S. & Gromiha, M.M. NETASA: neural network based prediction of solvent 

accessibility. Bioinformatics 18, 819–824 (2002). 

19. Catic, A., Collins, C., Church, G.M. & Ploegh, H.L. Preferred in vivo ubiquitination 

sites. Bioinformatics 20, 3302–3307 (2004). 

20. Choudhary, C. et al. Lysine acetylation targets protein complexes and co-regulates 

major cellular functions. Science 325, 834–840 (2009). 

21. Gnad, F. et al. PHOSIDA (phosphorylation site database): management, structural 

and evolutionary investigation, and prediction of phosphosites. Genome Biol. 8, 

R250 (2007). 

22. Kim, S.C. et al. Substrate and functional diversity of lysine acetylation revealed by 

a proteomics survey. Mol. Cell 23, 607–618 (2006). 

23. Zhao, S. et al. Regulation of cellular metabolism by protein lysine acetylation. 

Science 327, 1000–1004 (2010). 

24. Basu, A. et al. Proteome-wide prediction of acetylation substrates. Proc. Natl. Acad. 

Sci. USA 106, 13785–13790 (2009). 

25. Yang, X.J. & Seto, E. Lysine acetylation: codified crosstalk with other posttranslational 

modifications. Mol. Cell 31, 449–461 (2008). 

26. Prosperi, E. Multiple roles of the proliferating cell nuclear antigen: DNA replication, 

repair and cell cycle control. Prog. Cell Cycle Res. 3, 193–210 (1997). 

27. Mayer, A. et al. The prognostic significance of proliferating cell nuclear antigen, 

epidermal growth factor receptor, and mdr gene expression in colorectal cancer. 

Cancer 71, 2454–2460 (1993). 

28. Ong, S.E. et al. Stable isotope labeling by amino acids in cell culture, SILAC, as 

a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 

1, 376–386 (2002). 

29. Jordan, M.A. Mechanism of action of antitumor drugs that interact with microtubules 

and tubulin. Curr. Med. Chem. Anticancer Agents 2, 1–17 (2002). 

30. Wisniewski, J.R. et al. Constitutive and dynamic phosphorylation and acetylation 

sites on NUCKS, a hypermodified nuclear protein, studied by quantitative 

proteomics. Proteins 73, 710–718 (2008). 

31. Gigant, B. et al. Structural basis for the regulation of tubulin by vinblastine. Nature 

435, 519–522 (2005). 

32. Ravelli, R.B. et al. Insight into tubulin regulation from a complex with colchicine 

and a stathmin-like domain. Nature 428, 198–202 (2004). 

33. Unk, I. et al. Human SHPRH is a ubiquitin ligase for Mms2-Ubc13-dependent 

polyubiquitylation of proliferating cell nuclear antigen. Proc. Natl. Acad. Sci. USA 

103, 18107–18112 (2006). 





Antigen synthesis and antibody production. Lysine-rich histone from 

calf thymus (type III-S, Sigma) was dissolved in 100 mM NaHCO 3 buffer 

(10 ml) at pH 10. 500 μl t-butyloxycarbonyl-Gly-Gly-N-hydroxysuccinimide 

(50 mM, Boc-Gly-Gly-NHS, ref. 34) in DMSO was added to histone solution 

and the reaction was carried out at 25° C for 1 h by constant shaking on a 

plate rotator. This step was repeated three additional times and sample B 

was obtained. For deprotection of the Boc group, neat trifluoroacetic acid 

(6 ml, TFA, Sigma) was added and the solution was shaken for 2 h at 25° C. 

The reaction was stopped by neutralizing with 10 M NaOH dropwise on 

ice (sample C). Sample A, B and C were dialyzed four times against 20 mM 

acetic acid followed by lyophilization. The degree of the reaction was assessed 

by anti-biotin (Sigma) western blot analysis after samples A, B and C were 

reacted with 5 mM biotin-NHS (Sigma) for 10 min. The same protocol 

was used to prepare Boc-Gly-Gly– and Gly-Gly–modified β-lactoglobulin, 

hen egg white lysozyme, rat brain lysate and peptides (DRVYIHPFHL and 

Ac-SYSMEHFRWGKPV-NH 2 ) for antibody evaluation. 

The antigen was injected into mice for antibody production, and hybridoma 

clones were made by Promab. Cells of monoclonal clones were grown 

in MegaCell Dulbecco’s Modified Eagle’s Medium (MegaCell DMEM, pH 7.2, 

Sigma) supplemented with 10% FBS (FBS), 50 μg/ml of kanamycin, 1 mM 

glutamine, and cells were split and cell culture supernatant was collected 

every week. 

Hybridoma clone GX41 was obtained after screening a panel of hybridomas 

to assess their utility in detecting diglycine-modified lysines. Antibodies 

from each hybridoma clone were first evaluated by western blot analysis using 

Gly-Gly–modified β-lactoglobulin, lysozyme and rat brain lysate. Clones were 

selected based on the absence of reactivity with unmodified protein and lysates, 

absence of reactivity with proteins and lysate modified with Boc-Gly-Gly, and 

reactivity with Gly-Gly–modified proteins and lysate. The top five clones that 

were further characterized were based on their ability to recognize the largest 

number of bands in the Gly-Gly–modified rat brain lysate. Antibodies from 

these clones were purified and used for immunoprecipitation of ubiquitin 

remnant–containing peptides from His 6 -ubiquitin–expressing HEK 293 cells, 

and tandem MS identification of tryptic ubiquitinated peptides to assess the 

degeneracy of antibodies. Only clone GX41 pulled down peptides that contained 

each of the 20 amino acids N-terminal to the modified lysine and each 

of the 20 amino acids C-terminal to the modified lysine, suggesting that the 

antibody can bind peptides which contain the diglycyl-lysine in a wide range 

of sequence contexts, which was supported by subsequent characterization of 

the amino acid context of the diglycyl-lysine obtained from a larger data set of 

ubiquitin remnant peptides (Fig. 3b and Supplementary Fig. 7a). The GX41 

anti–diglycyl-lysine monoclonal antibody was found to be IgG1κ isotype. This 

antibody was used for all the experiments in this study. 

Antibody purification and coupling. Gly-Gly–modified β-lactoglobulin was 

coupled to Affi-Gel 10 resin (Bio-Rad) in a concentration of 5 mg protein/ml 

resin in a pH 8 HEPES buffer overnight in 4 °C. The resin was quenched by 1 M 

Tris-HCl (pH 8), washed with three volumes of 10 mM citric acid (pH 3) 

and PBS. Cell culture supernatant (50 ml) from monoclonal cell lines was 

loaded six times into an 8-cm column with 1 ml Affi-Gel 10 resin coupled with 

Gly-Gly–modified β-lactoglobulin in 4 °C using a peristaltic pump. The resin 

was washed three times with 6 ml of 2× PBS and three times with 6 ml of PBS. 

The antibody was eluted four times with 0.5 ml 10 mM citric acid (pH 3) and 

immediately neutralized by 50 μl of 1 M HEPES (pH 8). The pH was adjusted 

to 8.5 and the antibody was concentrated by a 15 ml filter device (30 kDa 

molecular weight cutoff, Millipore). The antibody concentration was measured 

by Bradford protein assay (Bio-Rad). Typically, 0.1~0.2 mg of antibody was 

coupled to 20 μl Affi-Gel 10 resin according to the method described above. 

The antibody resin was stored in PBS buffer with 0.1% sodium azide at 4 °C. 

Cell culture and sample preparation. Human embryo kidney (HEK) 293 cells 

were cultured in Dulbecco’s Modified Eagle’s Medium (DMEM, Invitrogen) 

supplemented with 4.5 g/l glucose, 10% FBS, 100 units/ml penicillin G and 

100 μg/ml streptomycin. When the confluence reached ~50%, cells were 

transfected with 10 μg of a His 6 -tagged ubiquitin plasmid per 10-cm Petri 

dish using the calcium phosphate transfection method. Cells were used 1 d 

after transfection and treated with vehicle or proteasome inhibitor 25 μM 

LLnL (Calbiochem) in DMSO and incubated for 16 h before harvest. The 

His 6 -ubiquitin is expressed at a fraction of the level of endogenous ubiquitin 

(Supplementary Fig. 14) suggesting that it is unlikely to perturb endogenous 

ubiquitin pathways. The expression of tagged ubiquitin has been widely used 

in proteomics studies of protein ubiquitination 7,35,36 . 

Twenty 10-cm Petri dishes were cultured and cells were washed twice with 

ice-cold PBS. The cells were detached, collected and centrifuged at 1,000g for 

5 min at 4 °C. To increase coverage of ubiquitinated proteins, crude lysates, as 

well as subcellular fractions, including nuclear, membrane and cytosolic fractions, 

were prepared for analysis. For the crude lysate, the cell pellet was lysed 

and His 6 -tagged proteins were purified by Ni-NTA resin (Qiagen) in native 

and denaturing conditions according to the manufacturer’s protocol. The lysis 

buffer contained 5 mM chloroacetamide to alkylate cysteines and to inhibit 

ubiquitin ligases and deubiquitinases 9 . The membrane fraction was obtained 

by centrifuging at 100,000g for 60 min after removing the nuclear pellet in the 

presence of 250 mM sucrose. The pellet from nuclear and membrane fraction 

was dissolved in 8 M urea with 1% triton X-100 and 0.1% SDS and the proteins 

were purified by Ni-NTA resin in the presence of 10 mM β-mercaptoethanol. 

After immobilized metal affinity purification, ubiquitinated proteins are significantly 

enriched (Supplementary Fig. 15). 

All the samples after Ni-NTA purification were concentrated on an Amicon 

YM10 filter device (Millipore) and separated by SDS-PAGE. Gel pieces were 

treated with 10 mM dithiothreitol at 50 °C for 30 min, followed by 55 mM 

chloroacetamide at 25 °C for 45 min, using methods described previously 37 , 

except that chloroacetamide was used in place of iodoacetamide. In-gel digestion 

and peptide extraction were performed as described 37 . 

The lyophilized peptide mixture was dissolved in 300 μl of buffer containing 

150 mM NaCl, 50 mM Tris-HCl (pH7.4) and 2 mM EDTA. The sample 

was boiled in a water bath for 10 min to deactivate residual trypsin activity. 

The peptide mixture was incubated with 20 μl antibody resin for 4 h in 4 °C, 

loaded on a micro-spin column (Pierce) six times, washed three times with 

2× PBS and three times with PBS, and eluted six times with 20 μl 10 mM 

citric acid (pH 3). The eluted peptide mixture was concentrated to 20 μl for 

tandem MS analysis. 

For the MALDI-TOF-MS experiment, a sample containing ~0.3 nmol of 

each peptide, GGDRVYIHPFHL and Ac-SYSMEHFRWGK*PV-NH 2 , was prepared 

and subjected to immunoprecipitation using the agarose-immobilized 

antibody described above. 

For SILAC quantification, five 10-cm dishes of HEK293T cells were grown 

in the media containing either light lysine (Lys0: 12 C 6 14 N 2 -Lys) or heavy lysine 

(Lys8: 13 C 6 15 N 2 -Lys) (Cambridge Isotope Labs) using previously described 

procedures for SILAC experiments 38 . The cells were transfected with His 6 - 

ubiquitin plasmid as described above, and treated with vehicle or drugs 

(10 μM colchicine or 1 μM vinblastine, Sigma) in the presence of LLnL (PCNA: 

25 μM for 16 h; tubulin α-1A: 50 μM for 30 min). The cells were mixed and 

purified under denaturing condition as described above without fractionation. 

To normalize the ubiquitinated peptides by unmodified peptides in the 

cell lysate, a small amount of initial mixed cell lysate was digested by trypsin 

followed by tandem MS analysis 30 . 

Mass spectrometric analysis. For MALDI-TOF-MS, samples were desalted 

by Millipore C18 ZipTip according to manufacturer’s protocol and eluted 

in a 2 μl solvent with 50% acetonitrile and 0.1% TFA in the presence of 

10 mg/ml α-cyano-4-hydroxycinnamic acid (Sigma). The masses of the samples 

were analyzed in the reflector mode by Voyager-DE PRO MALDI-TOF-MS 

(Applied Biosystems). 

The samples purified from cell lysate were analyzed by nanoLC Q-TOF 

MS/MS (Agilent) to obtain peptide sequence information using settings as 

described previously 39 . Briefly, 8 μl of peptide mixtures were loaded onto 

an enrichment column with 97% solvent A and 3% solvent B with a flow rate 

of 3 μl/min. Solvent A consists of 0.1% formic acid (Fluka) and solvent B of 

90% acetonitrile (Fisher) and 0.1% formic acid. Peptides were eluted with a 

gradient from 3% to 40% solvent B in 20 min, followed by a steep gradient 

to 90% solvent B in 5 min at a flow rate of 0.3 μl/min. Mass spectra were 

acquired in the positive-ion mode with automated data-dependent MS/MS 

on the five most intense ions from precursor MS scans and every selected 


doi:10.1038/nbt.1654


precursor peak was analyzed twice within 3 min. In some runs, a list of previous 

identified peptides was excluded for MS/MS fragmentation. 

Database search of MS/MS spectra for peptide and protein identification. 

Analysis of MS/MS spectra for peptide and protein identification was 

performed by protein database searching with Spectrum Mill software 

(Rev A.03.02, Agilent) against the Swiss-Prot database (v57.2, May 5, 2009) 

containing a concatenated reverse database with the same entries and the 

same length for each protein, as described 40 . The use of a decoy database 

to evaluate the false-positive rate for modified peptides may underestimate 

the false identifications as protein modifications can greatly expand the 

search space. Raw spectra were first extracted to MS/MS spectra that could 

be assigned to at least four y- or b-series ions. Scans with the same precursor 

within a mass window of ±0.4 m/z were merged within a time frame of ±15 s, 

charges up to a maximum of 7 were assigned to the precursor ion and the 

12 C peak was determined by the Data Extractor. Key search parameters were 

a minimum matched peak intensity of 50%, a precursor mass tolerance of 

±20 p.p.m., and a product mass tolerance of ±40 p.p.m. A fixed modification 

was carbamidomethylation (same modification as chloroacetamide) for 

cysteines and variable modifications were Gly-Gly modification for lysines 

and oxidation for methionines. It should be noted that there are potentially 

a large number of naturally occurring sequence variants in mammals, but 

very limited data in the databases on these sequences. These variants may 

be missed or misidentified if the sequence variation lies in the same peptide 

that contains the diglycine modified–lysine. The maximal number of 

diglycine modifications was set as two. Trypsin was selected as enzyme for 

sample digestion and four missed cleavages were allowed during the database 

search. The threshold used for peptide identification was a Spectrum Mill 

score of ≥ 9, an SPI% (the percentage of the scored peak intensity) of ≥ 50% 

and the difference between forward and reverse scores of ≥2. Under these 

criteria, the false-positive rate is 1, there is a commensurately 

higher likelihood for Pro at the −1 position to be adjacent to a ubiquitinated 

lysine. The highest relative ratio detected was 2.3 and the range of the color 

map was set from 0 to 2.5. The density map was prepared by MATLAB. The 

enriched amino acids were obtained by determining the outliers with a 95% 

confidence using the Rosner’s test 46 . 

To access the structural features of ubiquitinated lysine residues for human 

proteins, we searched crystal structures for all the ubiquitinated proteins in 

protein database bank (PDB). In total, 89 PDB structures (Supplementary 

Table 2) contained lysines that we found are susceptible to ubiquitination 

(140 modified lysines and 3970 total lysines). In cases when multiple PDB 

structures for a ubiquitinated protein were reported, the structure with best 

quality was used. The secondary structure types for lysines were determined 

using the program DSSP 47 . H and G were considered to be helix, E and B to 

be strand, S, T and others for coil. The fraction of each secondary structure 

type of modified lysines was compared to that of all the lysine residues in 

89 PDB structures. The disordered region was predicted by DisEMBL 48 

for all identified ubiquitinated proteins and the information for modified 

lysines and all lysines was extracted. The relative solvent-accessible area 

for the modified and all lysines in 89 crystal structures was calculated 

using NACCESS 49 with a probe of 1.4 Å, which corresponds to the size of a 

water molecule. 

doi:10.1038/nbt.1654 


34. Derrien, D. et al. Muramyl dipeptide bound to poly-l-lysine substituted with mannose 

and gluconoyl residues as macrophage activators. Glycoconj. J. 6, 241–255 (1989). 

35. Kirkpatrick, D.S., Weldon, S.F., Tsaprailis, G., Liebler, D.C. & Gandolfi, A.J. 

Proteomic identification of ubiquitinated proteins from human cells expressing 

His-tagged ubiquitin. Proteomics 5, 2104–2111 (2005). 

36. Xu, P. et al. Quantitative proteomics reveals the function of unconventional ubiquitin 

chains in proteasomal degradation. Cell 137, 133–145 (2009). 

37. Shevchenko, A., Wilm, M., Vorm, O. & Mann, M. Mass spectrometric sequencing 

of proteins silver-stained polyacrylamide gels. Anal. Chem. 68, 850–858 (1996). 

38. de Godoy, L.M. et al. Status of complete proteome analysis by mass spectrometry: 

SILAC labeled yeast as a model system. Genome Biol. 7, R50 (2006). 

39. Xu, G., Shin, S.B. & Jaffrey, S.R. Global profiling of protease cleavage sites by 

chemoselective labeling of protein N-termini. Proc. Natl. Acad. Sci. USA 106, 

19310–19315 (2009). 

40. Elias, J.E. & Gygi, S.P. Target-decoy search strategy for increased confidence in 

large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 

(2007). 

41. Silva, J.C. et al. Quantitative proteomic analysis by accurate mass retention time 

pairs. Anal. Chem. 77, 2187–2200 (2005). 

42. Mortensen, P. et al. MSQuant, an open source platform for mass spectrometry-based 

quantitative proteomics. J. Proteome Res. 9, 393–403 (2010). 

43. Thomas, P.D. et al. PANTHER: a library of protein families and subfamilies indexed 

by function. Genome Res. 13, 2129–2141 (2003). 

44. Dennis, G. Jr. et al. DAVID: database for annotation, visualization, and integrated 

discovery. Genome Biol. 4, 3 (2003). 

45. Lu, Z. et al. Predicting subcellular localization of proteins using machine-learned 

classifiers. Bioinformatics 20, 547–556 (2004). 

46. Rosner, J. Test of auditory analysis skills (TAAS) in helping children overcome 

learning difficulties: a step-by-step guide for parents and teachers (Academic 

Therapy, New York, 1979). 

47. Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern 

recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 

2577–2637 (1983). 

48. Linding, R. et al. Protein disorder prediction: implications for structural proteomics. 

Structure 11, 1453–1459 (2003). 

49. Hubbard, S.J., Campbell, S.F. & Thornton, J.M. Molecular recognition. Conformational 

analysis of limited proteolytic sites and serine proteinase protein inhibitors. J. Mol. 

Biol. 220, 507–530 (1991). 



doi:10.1038/nbt.1654

careers and recruitment 

Second quarter biotech job picture 

Michael Francisco 


In the second quarter of 2010, biotech and pharmaceutical postings on 

the three representative job databases tracked by Nature Biotechnology 

(Tables 1 and 2) largely stayed the same from the previous quarter 

(Nat. Biotechnol. 28, 527, 2010). Noteworthy increases in job openings 

were seen from instrument systems and consumables manufacturers Life 

Table 1 Who’s hiring? Advertised openings at the 25 largest biotech 

companies 

Number of advertised openings b 

Company a 

Number of 

employees 

Monster Biospace Naturejobs 

Monsanto 21,700 0 0 31 

Amgen 16,800 29 29 1 

Genentech 11,186 6 26 100 

Genzyme 11,000 63 0 0 

Life Technologies 9,700 71 89 0 

PerkinElmer 7,900 52 0 0 

Bio-Rad Laboratories 6,600 12 17 0 

Biomerieux 6,140 9 0 0 

Millipore 5,900 15 11 0 

IDEXX Laboratories 4,700 14 0 0 

Biogen Idec 4,700 38 104 1 

Gilead Sciences 3,441 0 26 0 

WuXi PharmaTech 3,172 0 0 0 

Qiagen 3,041 0 0 1 

Cephalon 2,780 0 0 0 

Biocon 2,772 0 0 0 

Celgene 2,441 13 10 0 

Biotest 2,108 0 11 0 

Actelion 2,054 4 1 0 

Amylin Pharmaceuticals 1,800 9 3 0 

Elan 1,687 7 2 0 

Illumina 1,536 2 27 9 

Albany Molecular 

Research 

1,357 0 0 0 

Vertex Pharmaceuticals 1,322 40 58 2 

CK Life Sciences 1,315 0 0 0 

a As defined in Nature Biotechnology’s survey of public companies (27, 710–721, 2009). b As 

searched on Monster.com, Biospace.com and Naturejobs.com, 21 July 2010. Jobs may overlap. 

Michael Francisco is Senior Editor, Nature Biotechnology 

Technologies (Carlsbad, CA, USA), PerkinElmer (Waltham, MA, USA), 

Bio-Rad Laboratories (Hercules, CA, USA) and Illumina (San Diego). 

Table 3 shows selected downsizings within the life science industry. 

Nature Biotechnology will continue to follow hiring and firing trends 

throughout 2010. 

Table 2 Advertised job openings at the ten largest pharma companies 

Number of advertised openings b 

Company a 

Number of 

employees Monster Biospace Naturejobs 

Johnson & Johnson 119,200 522 8 19 

Bayer 106,200 78 27 3 

GlaxoSmithKline 103,483 5 1 2 

Sanofi-Aventis 99,495 12 1 4 

Novartis 98,200 144 94 20 

Pfizer 86,600 2 81 88 

Roche 78,604 35 31 20 

Abbott Laboratories 68,697 67 42 1 

AstraZeneca 67,400 71 7 4 

Merck & Co. 59,800 0 10 0 

a Data obtained from MedAdNews. b As searched on Monster.com, Biospace.com and Naturejobs.com, 

21 July 2010. Jobs may overlap. 

Table 3 Selected biotech and pharma downsizings 

Company 

Albany Molecular 

Research 

Number of 

employees 

cut Details 

80 Restructured its US operations, including reducing head 

count by about 10% and suspending operations at one of 

its research laboratories in Rensselaer, New York. 

Cell Therapeutics 36 Reduced head count by 29% to 88 to conserve cash, 

with the cuts coming mostly from sales and marketing. 

GTC 

Biotherapeutics 

Helicos 

BioSciences 

50 Will restructure and reduce head count by 46% to 59 to 

save cash. 

40 Reduced head count by 50% to 40 and plans to refocus 

its business on molecular diagnostics development. 

InterMune 60 Reduced head count by 40% to 85, with the cuts 

coming predominantly in the commercial and discovery 

research areas. 

Lonza Group 193 Reducing head count by 6% to 2,899 at its R&D and 

production site in Visp, Switzerland, to save cash. 

Myriad 

Pharmaceuticals 

21 Restructured and reduced head count by 13% to about 

140 to focus on its cancer pipeline. 

Novartis 383 Will restructure Novartis Pharmaceuticals and reduce head 

count at the US unit. Thirty-five percent of the cuts will 

be achieved by not filling vacant positions. The cuts will 

primarily come from “headquarter-based functions,” with 

minimal impact on the commercial sales organization. 

Pfizer 6,000 Announced plans to restructure its global manufacturing 

plant network and reduce manufacturing head count by 

18% to 27,000 over the next five years. Plans to close 

eight sites in Puerto Rico, Ireland and the US and reduce 

operations at another six sites. 

Sanofi-Aventis 400 Cuts will primarily come from US sales force, which 

previously had 5,700 employees. 

Takeda 

Pharmaceutical 

Source: BioCentury. 

~1,900 Will reduce head count by about 10% to reduce costs in 

its fiscal year 2010 ending March 31, 2011. 


people 


Biogen Idec (Cambridge, MA, USA) has named George Scangos 

(left) as its new CEO as well as a member of the board of 

directors, replacing the recently retired Jim Mullen. Scangos 

joins Biogen Idec from Exelixis, where he has served as president 

and CEO since 1996. Previously, he spent 10 years at Bayer, 

leaving as president of Bayer Biotechnology. 

“George’s appointment is the culmination of the board’s 

comprehensive selection process to identify the best leader to 

take Biogen Idec to the next level,” says chairman William D. 

Young. “Science is at the heart of our business, and George has 

an exceptional scientific background, as well as significant operational expertise and a 

strong leadership track record.” 

Nile Therapeutics (San Francisco) has appointed 

Richard B. Brewer as executive chairman. 

Brewer brings over 35 years of operational, 

financial and business development expertise 

to Nile. He currently serves as chairman of Arca 

Biopharma and was previously CEO and president 

of Scios, COO of Heartport and senior vice 

president of US marketing at Genentech. 

BioVex (Woburn, MA, USA) has appointed 

Kapil Dhingra to its board of directors. 

Dhingra spent nearly ten years at 

Hoffmann-La Roche, culminating in his 

appointment as vice president and head of 

oncology clinical development. 

Myriad Genetics (Salt Lake City, UT, USA) has 

announced the appointment of Gary A. King 

to the newly created position of executive vice 

president of international operations. King has 

over 25 years of life sciences experience, most 

recently as CEO of AverDx. Prior to AverDx, 

he was vice president, international operations 

at Biosite. 

Dean Mitchell (left) 

has been named president 

and CEO of Lux 

Biosciences (Jersey 

City, NJ, USA). 

Mitchell was formerly 

president and 

CEO of Alpharma 

and Guilford 

Pharmaceuticals. 

He is also a nonexecutive board member of 

ISTA Pharmaceutics, Intrexon and Talecris 

Biotherapeutics. 

Diagnostic kit developer Ingen Biosciences 

(Chilly-Mazarin, France) has appointed 

Karine Mignon-Godefroy as director of 

research and development. She joins Ingen 

from the blood virus division of Bio-Rad, 

where she was director of international projects. 

Before Bio-Rad, she held the post of R&D 

manager at BMD. 

Frank Morich, CEO of NOXXON Pharma 

(Berlin) has announced his intention to leave 

the company effective August 15 to take up 

the position of executive vice president, international 

operations of Takeda Pharmaceutical 

Company. Iain Buchanan, a director of 

NOXXON, will assume the role of interim 

CEO and will support the board during its 

search for a permanent replacement. Buchanan 

has over 30 years of experience in the pharma 

and biotech industry, most recently as CEO 

of Novexel. 

Marine biotechnology company Aquapharm 

Biodiscovery (Oban, UK) has named Tim 

Morley as CSO. Morley has over 20 years experience 

in the pharmaceutical industry, including 

previous positions as research and strategic 

project director at Quotient Biodiagnostics, 

vice president preclinical sciences at Ardana 

Bioscience and senior director molecular and 

cellular pharmacology at Vernalis. 

Exelixis (S. San Francisco, CA, USA) has 

announced the appointment of Michael 

Morrissey as president and CEO, succeeding 

George Scangos. Morrissey will also become 

a member of the board of directors. He joined 

Exelixis in 2000 and served as executive vice 

president, discovery before his appointment 

as president of research and development in 

January 2007. 

Illumina (San Diego) has announced the 

appointment of Nicholas J. Naclerio to the position 

of senior vice president, corporate development. 

Naclerio formerly served as cofounder 

and executive chairman of Quanterix, raising 

$15 million in venture financing to launch the 

company. In addition, Illumina has named 

to its board of directors Gerald Möller, who 

currently serves as an advisor at HBM Bio 

Ventures, a Swiss investment firm. Previously, 

Möller spent 23 years at Boehringer Mannheim 

and Roche, where he held a number of leadership 

positions including CEO of the worldwide 

Boehringer Mannheim Group and head 

of global development and strategic marketing, 

pharmaceuticals for Roche. 

BrainStorm Cell Therapeutics (New York and 

Petach Tikva, Israel) has named Liat Sossover 

as CFO. Sossover has served in senior financial 

positions at a number of publicly traded and 

private companies, most recently as vice president, 

finance at ForeScout Technologies. 

James F. Young has been appointed to the board 

of directors of 3-V Biosciences (Menlo Park, 

CA, USA). He currently serves on the board of 

directors of Novavax. Previously, he served as 

head of MedImmune’s R&D organization and 

was directly involved in the development of 

approximately 20 clinical programs. 

Patrick J. Zenner has been elected to the board 

of directors of Par Pharmaceutical (Woodcliff 

Lake, NJ, USA). Zenner retired in January 2001 

from Hoffmann-La Roche, where he served as 

president and CEO since 1993. He currently 

serves as chairman of the board of ArQule 

and Exact Sciences and as a director of West 

Pharmaceutical Services.

Nature Biotechnologytrawls

Create successful ePaper yourself

Delete template?

Save as template?