Computational tools and Interoperability in Comparative ... - CBS

Peter Fischer Hallin | 2009 Peter Fischer Hallin 

Computational tools and Interoperability in Comparative Genomics 

2.5 

Computational tools and 

Interoperability in 

Comparative Genomics 

lari 

jejuni 

concisus 

curvus 

fetus 

hominis 

2.3 % 

34 / 1,494 

57.2 % 

1,123 / 1,965 

56.7 % 

1,123 / 1,979 

1.7 % 

27 / 1,581 

55.2 % 

1,145 / 2,073 

84.7 % 

1,448 / 1,709 

49.4 % 

1,062 / 2,150 

83.5 % 

1,481 / 1,773 

1.5 % 

24 / 1,585 

Campylobacter concisus 

13826 

2,080 proteins, 1,972 families 

Campylobacter curvus 

525.92 


Campylobacter fetus 

subsp. fetus 82-40 


Campylobacter hominis 

ATCC BAA-381 


Campylobacter jejuni 

RM1221 



subsp. doylei 269.97 



subsp. jejuni 81-176 



subsp. jejuni 81116 



subsp. jejuni NCTC 11168 


Campylobacter lari 

RM2100 


53.0 % 

1,143 / 2,158 

67.3 % 

1,316 / 1,955 

82.9 % 

1,474 / 1,778 

22.8 % 

596 / 2,619 

76.9 % 

1,466 / 1,906 

64.4 % 

1,289 / 2,003 

2.3 % 

39 / 1,702 

30.0 % 

742 / 2,476 

22.9 % 

614 / 2,676 

74.6 % 

1,441 / 1,931 

62.2 % 

1,304 / 2,096 

24.7 % 

682 / 2,756 

30.6 % 

774 / 2,526 

23.1 % 

617 / 2,675 

71.4 % 

1,451 / 2,032 

4.0 % 

66 / 1,650 

24.5 % 

704 / 2,875 

24.8 % 

698 / 2,820 

30.3 % 

770 / 2,538 

22.5 % 

628 / 2,795 

63.5 % 

1,345 / 2,118 

24.4 % 

718 / 2,948 

25.1 % 

706 / 2,816 

28.7 % 

767 / 2,669 

21.2 % 

595 / 2,802 

2.3 % 

41 / 1,780 

jejuni 


fetus 

curvus 

concisus 

PhD thesis | Peter Fischer Hallin | 2009 

Center for Biological Sequence Analysis 

Department of Systems Biology 

Technical University of Denmark 


RM2100 




24.3 % 

717 / 2,950 

23.7 % 

699 / 2,950 

27.5 % 

736 / 2,676 

21.4 % 

618 / 2,886 




23.6 % 

723 / 3,070 

22.5 % 

668 / 2,964 

27.9 % 

767 / 2,750 

2.0 % 

33 / 1,623 


22.7 % 

698 / 3,076 

23.0 % 

698 / 3,036 

30.4 % 

782 / 2,576 

22.5 % 

713 / 3,175 

26.1 % 

741 / 2,838 

1.5 % 

25 / 1,665 

lari 








RM1221 

25.8 % 

765 / 2,961 

34.7 % 

929 / 2,678 



ATCC BAA-381 

32.4 % 

916 / 2,828 

1.8 % 

34 / 1,885 

21.2 % 




50.3 % 

1,317 / 2,616 



525.92 

3.5 % 

69 / 1,972 

1.5 % 

Homology between proteomes 



13826 


Homology within proteomes 

84.7 % 

4.0 %

To my family. Thank you Susanne for your endless support and for giving us two 

wonderful boys, Oliver and Victor.

Preface 

This Ph.D. thesis is written for The Department for Systems Biology, Technical University 

of Denmark, as part of the Life Science programme as a requirement for obtaining the 

Ph.D. degree. 

The work was supported through the EMBRACE project which is funded by the European 

Commission within the Sixth Framework Programme, under the area of “Life sciences, 

genomics and biotechnology for health”, contract number LSGH-CT-2004-512092. 

Parts of the work was supported through a grant from the Danish Natural Science Research 

Council, contract number 26-06-0349 entitled “Comparative Genomics of Campylobacter 

jejuni”. 

The work was carried out at the Center for Biological Sequence Analysis (CBS), Department 

of Systems Biology, under supervision by Associate Professor David W. Ussery. 

The work on bacterial promotors was carried out during an external stay at University 

of California, Davis (UC Davis Genome Center), under supervision by Professor Craig J. 

Benham and supported through an NSF Research Grant, contract number DBI-0416764. 

Lyngby, 28 September, 2009 

Peter Fischer Hallin 

Cover illustration 

The background of the cover shows a “BLAST atlas” of Burkholderia pseudomallei, strain 

1710b compared with 22 other Burkholderia genomes. The top panel, under the title, 

shows the P1/P2 rrnB promotor region of E. coli, mapped to different DNA properties. 

The panel below is a “BLAST matrix” of 10 different Campylobacter strains, showing the 

overall proteome similarity. 

i

Abstract 

The scientific community is witnessing an explosion in both the number and the complexity 

of DNA sequencing projects. As sequencing equipment becomes more reliable, 

faster and less expensive, new possibilities of applying the technology are opening up. 

The early genome sequencing projects, dating back almost 15 years, presented only individual 

microbial strains and the large efforts and scientific achievements at this time 

qualified publication in high ranking journals. Today however, projects like the Human 

Microbiome Project (HMP), Human Gut Microbiome Initiative (HGMI) and the Genomic 

Encyclopedia of Bacteria and Archaea (GEBA) takes sequencing into a new era, to study 

the genomes and ecological niches of entire populations consisting of thousands of microorganisms. 

These initiatives put a demand for new analysis tools to process and derive 

knowledge from the wealth of genomic information. 

This thesis describes development of new tools and methods to study these types 

of data. When the genome of characterized strains and environmental samples are sequenced, 

the ribosomal RNA genes are commonly chosen as a starting point to describe 

the phylogeny and diversity. The rRNA genes are often interpreted as an ‘evolutionary 

chronometer’ and the RNAmmer software was developed as a tool to quickly and 

consistently identify the rRNA genes allowing for large-scale analysis of phylogeny of complex 

data sets. RNAmmer solved previous issues of the gene boundary accuracy, that 

is observed when using BLAST approaches to mapping rRNA genes. The possibility to 

accurately map the start of rRNA transcripts has allowed the investigation of promotor 

structures of these highly expressed operons and a promotor analysis in E. coli K12 is 

demonstrated by applying a mathematical model of the energetics involved in DNA helix 

opening. 

But a single gene, such as the 16S rRNA, can in nature not describe the phenotype 

nor the full coding potential of an organism. This thesis describes the development of 

the BLASTatlas tool, which is a visualization tool to overview similarity and differences 

between any number of genomes, metagenomic samples or sequence databases from the 

viewpoint of a reference genome. This software has proved to be a powerful tool to study 

the localization and gain/loss of gene clusters, such as pathogenicity islands in virulent 

organisms. The tool has been used in several research projects and collaborations and 

was described as a cover article in Molecular BioSystems in 2008, and highlighted in the 

journal Chemical Biology. Despite the usefulness of this tool, it became obvious that a web 

based version, more “biologist friendly” with zooming capability, was needed. This lead 

to the GeneWiz browser, which was developed in a joint effort with the IT staff at CBS. 

The tool enables the user to interactively zoom from a global chromosomal scale down 

the nucletide, while maintaining the overview of all data being presented in the plot. It 

features disproportional zooming as known from google maps. At the time of writing this 

iii

thesis, the work is just being published in the second issue of the SIGS journal (Standards 

In Genomic Sciences). 

Since starting my Ph.D. project, a total of 630 prokaryotic genomes has been sequenced 

and published. This represents on average about four genomes per week! As we 

gain knowledge from this vast amount of data, new prediction methods become available 

allowing for the generation of even more data; examples include predicting sigma factor 

genes, chromosomal replication starts, and secretion systems. This combination of new 

sequence data as well as new predicitons squares the problem: How do we deal with the 

challenge that more and more genomic material shall be processed through more and more 

bioinformatic tools? And how is this flow of information formalized and automated allowing 

bioinformaticians to programmatically submit comparisons of any genome to any 

prediction method anywhere in the world? The need for interoperable and programmable 

interfaces for these resources is now widely recognized, and machine-to-machine communication 

through Web Services has gained acceptance. But ahead lies challenges during the 

transition from a web-browser-centric thinking towards interoperability and service orietated 

architecture, SOA. During my Ph.D. work a number of significant contributions to 

both implementations and server infrastructure has provided remote users access to CBS 

prediction servers and databases. This work has been presented both during the general 

meetings of the EU project (EMBRACE) initiating these efforts and during various 

workshops teaching the usage of Web Services and Comparative Genomics. 

iv

Resumé 

Det videnskabelige samfund er vidne til en eksplosion i b˚ade antallet og kompleksiteten 

af genomsekventeringer. I takt med, at sekventeringsudstyret bliver hurtigere, mere 

p˚alideligt, og tilmed billigere, ˚abner der sig nye muligheder for anvendelse af teknologien. 

De første genomprojekter, der g˚ar næsten 15 ˚ar tilbage, præsenterede kun enkelte 

bakteriestammer og den store indsats sammen de videnskabelige resultater har bidraget 

med publikationer i højt rangerende tidsskrifter. I dag har projekter som Human Microbiome 

Project (HMP), Human Gut Microbiome Initiative (HGMI) og Genome Encyclopedia 

of Bacteria and Archaea (GEBA) bragt genomsekventering ind i en ny æra ved at 

karakterisere tusinder af referencegenomer og hele økosystemer best˚aende at tusinder af 

specier. Disse initiativer vil efterspørge nye analyseværktøjer til at behandle og omdanne 

denne flod af information til viden. 

Denne afhandling beskriver metoder og værktøjer til at studere disse typer af data. 

N˚ar karakteriserede stammer og prøver bliver sekventeret, er det ribosomale RNA ofte 

valgt som udgangspunkt til at beskrive fylogeni og diversitet. Ribosomalt RNA er ofte 

benyttet som et ’evolutionært kronometer’ og programmet RNAmmer blev udviklet som 

et værktøj til hurtigt og konsistent at identificere rRNA gener, hvilket giver mulighed 

for mere omfattende fylogenetiske analyser af komplekse datasæt. RNAmmer har løst 

tidligere problemer med at fastsl˚a genernes nøjagtige annotering, hvilket har været tilfældet 

med BLAST baserede metoder. Muligheden for nøjagtigt at kunne kortlægge rRNA 

gener, har tilladt undersøgelse af promotor strukturer for disse stærkt udtrykte operoner. 

Efterfølgende er en eksisterende matematisk energimodel for DNAets ˚abning anvendt, til 

at lave en promotor analyse af P1/P2 systemet i E. coli K12. 

Men et enkelt gen, som for eksempel 16S rRNA, er i sagens natur ude af stand til at 

beskrive en hel organismes fænotype eller dens fulde kodende potentiale. Denne afhandling 

beskriver BLASTatlas metoden, som er et visualiseringsværktøj til at give et overblik 

over similaritet mellem et vilk˚arligt antal genomer, metagenomiske prøver eller sekvensdatabaser 

med udgangspunkt i et referencegenom. Denne software har vist sig at være et 

effektivt redskab til at studere enkelte gener eller grupper af gener, der er konserveret eller 

g˚aet tabt i eksempelvis sygdomsfremkaldende mikroorganismer. Værktøjet er blev brugt 

i forbindelse med flere forskningsprojekter og samarbejder og metoden blev offentliggjort 

som forsideartikel i maj 2008 udgaven af Environmental Microbiology. Det blev imidlertid 

klart, at manglen p˚a et interaktivt aspekt, gjorde værktøjet vanskeligt at anvende for biologer. 

Dette førte til udviklingen af programmet GeneWiz Browser, som blev udviklet i 

samarbejde med IT-personale p˚a CBS. Værktøjet gør det muligt for brugeren interaktivt 

at zoome ud fra det globale genom og ned til det enkelte nukleotid, og samtidig bevare 

overblikket over alle data, der præsenteres i diagrammet. Programmet anvender disproportional 

skalering som det kendes fra for eksempel Google Maps. Arbejdet er i øjeblikket 

v

ved at blive publiceret i Standards In Genomic Sciences. 

Siden starten p˚a mit tre ˚arige Ph.D. projekt er ialt 630 prokaryote organismer blev fuld 

sekventeret og offentliggjort. Dette svarer i gennemsnit til tre genomer om ugen! I takt 

med vi f˚ar ny viden udfra disse store data mængder, bliver der publiceret nye forudsigelsesmetoder 

til for eksempel sigma faktorer, kromosomal replikation, og sekretionssystemer. 

Denne dobbelthed understreger problemet: Hvordan reagerer vi p˚a den udfordring, at 

mere og mere genomisk materiale skal processeres ved hjælp af flere og flere bioinformatiske 

værktøjer? Og hvordan kan denne strøm af information formaliseres og automatiseres 

p˚a en s˚adan m˚ade, at bioinformatikere og biologer p˚a en programmrbar m˚ade kan 

køre sammenligninger af enhvert genom p˚a enhver forudsigelsesmetode overalt i verden? 

Behovet for interoperable og programmerbare grænseflader til disse ressourcer er nu almindeligt 

anerkendt, og computer-til-computer kommunikation gennem Web Services har 

vundet indpas. Men forude ligger udfordringer i overgangen fra en webbrowser-fokuseret 

tankegang i retning af interoperabilitet og Service Orientated Architecture, kaldet SOA. I 

mit Ph.D. arbejde har er en række betydelige bidrag i form a implementeringer og infrastruktur 

givet eksterne brugere af forskellige CBS værktøjer og databaser en programmerbar 

adgang via Web Services. Disse bidrag er blevet præsenteret b˚ade under generalmøder i 

EMBRACE EU-projektet og forskellige workshops omhandlende brugen af Web Services. 

vi

Acknowledgments 

I would like to express a deep gratitude to my supervisor Prof. David Ussery for his support 

during my Ph.D. project. It has been a great pleasure to work with him during my time 

at CBS and I will miss the time of organizing workshops and preparing for conferences. 

A thanks to Prof. and center director Søren Brunak for creating a unique and inspiring 

environment at CBS which enabled this project. 

I would like to extend my heartfelt gratitude to Craig and Marcia Benham for the 

incribile hospitality and openness towards our family during my research visit at University 

of California, Davis in 2007. 

I would like to thank a great collegue and friend of mine, Tim T. Binnewies, for support 

during conferences, manuscript preperations and our daily colaborations - it has been a 

pleasure to work with Tim. A thanks to Karin Lagesen for great research collaboration 

during the development of RNAmmer and Hanni Willenbrock for great collaboration and 

for driving numerous publications. I would also like thank all the people I worked with 

during the development of the ENCODE pipeline, Ramneek Gupta, Thomas Blicher, 

Haakan Svensson, Henrik Nielsen, Rasmus Wernersson, Morten Bo Johansen and Eleonora 

Kulberkyte. 

A special thanks to Hans-Henrik Stærfeldt for valuable feedback and all the inspiring 

and productive sessions of finalizing GeneWiz Browser and composing web services software. 

A special thanks to Kristoffer Rapacki for being a great travel companion, for always 

finding solutions, and for the many fruitfull discussions we have had - I hope there will be 

more. I would like to thank the numerous people with whom I have had the pleasure of 

working with, during research projects and courses. 

Former center administrators Johanne Keiding and Anne Christensen, current center 

administrator Dorthe Kjærsgaard, Lone Boesen and Malene Beck for your extrodinary 

efforts of making the CBS engine running efficient. Lone Boesen deserves special praise 

for smoothly arranging and handling travel details for my many trips abroad, including 

five continents. 

vii

viii

Publications and manuscripts 

Publications included in this thesis are listed in the order they appear. All other articles 

are sorted by publication date, descending. For papers with five and more citations this 

number is indicated. 

Paper I 

Hallin PF, Binnewies TT, Ussery DW. The genome BLASTatlas - a GeneWiz extension 

for visualization of whole-genome homology. Mol Biosyst 4:363-71 (2008). 

Paper II 

Binnewies TT, Motro Y, Hallin PF, Lund O, Dunn D. La T, Hampson DJ, Bellgard M, 

Wassenaar TM, Ussery DW. Ten years of bacterial genome sequencing: comparative– 

genomics–based discoveries. Funct Integr Genomics 6:165-85 (2006) - 56 citations. 

Paper III 

Reva ON, Hallin PF, Willenbrock H, Sicheritz-Ponten T, Tummler B, Ussery DW Global 

features of the Alcanivorax borkumensis SK2 genome. Environ Microbiol 10:614- 

25 (2008). 

Paper IV 

Vesth T, Hallin PF, Snipen L, Lagesen K, Wassenaar TM, Ussery DW. The origins of 

Vibrio species. Microbial Ecology (2009) doi:10.1007/s00248-009-9596-7 

Paper V 

Wassenaar TM, Binnewies TT, Hallin PF, and Ussery DW Tools for comparison of 

bacterial genomes. Book chapter, Microbiology of Hydrocarbons, Oils, Lipids, and 

Derived Compounds, Springer-Verlag, Heidelberg, Germany, 2009. 

ix

Paper VI 

[Lagesen K, Hallin P] 1 , Rodland EA, Stærfeldt HH, Rognes T, Ussery DW. RNAmmer: 

consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 

35:3100-8 (2007) - 8 citations 2 

Paper VII 

Hallin PF, Stærfeldt H, Rotenberg E, Binnewies TT, Benham CJ, and Ussery DW. GeneWiz 

browser: An Interactive Tool for Visualizing Sequenced Chromosomes. 

Standards in Genomic Sciences 1:204-215 (2009) doi:10.4056/sigs.28177. 

Papers not included 

Contributions have been made to the following papers during my PhD project. 

• Miller WG, Parker CT, Rubenfield M, Mendz GL, Wosten MM, Ussery DW, 

Stolz JF, Binnewies TT, Hallin PF, Wang G, Malek JA, Rogosin A, Stanker 

LH, Mandrell RE. The complete genome sequence and analysis of the 

human pathogen Arcobacter butzleri. PLoS ONE 2:e1358 (2007) 

• Willenbrock H, Hallin PF, Wassenaar TM, Ussery DW Characterization of 

probiotic Escherichia coli isolates with a novel pan-genome microarray. 

Genome Biol 8:R267 (2007) 

Earlier papers, 2004–2006 

• Worning P, Jensen LJ, Hallin PF, Stærfeldt HH, Ussery DW Origin of replication 

in circular prokaryotic chromosomes. Environ Microbiol 8:353-61 

(2006) - 28 citations 

• Kill K, Binnewies TT, Sicheritz-Ponten T, Willenbrock H, Hallin PF, Wassenaar 

TM, Ussery DW Genome update: sigma factors in 240 bacterial 

genomes. Microbiology 151:3147-50 (2005) 

• Bendtsen JD, Binnewies TT, Hallin PF, Ussery DW Genome update: prediction 

of membrane proteins in prokaryotic genomes. Microbiology 

151:2119-21 (2005) 

• Bendtsen JD, Binnewies TT, Hallin PF, Sicheritz-Ponten T, Ussery DW Genome 

update: prediction of secreted proteins in 225 bacterial proteomes. 

Microbiology 151:1725-7 (2005) 

• Binnewies TT, Bendtsen JD, Hallin PF, Nielsen N, Wassenaar TM, Pedersen 

MB, Klemm P, Ussery DW Genome Update: Protein secretion systems 

in 225 bacterial genomes. Microbiology 151:1013-6 (2005) 

• Hallin PF, Nielsen N, Devine KM, Binnewies TT, Willenbrock H, Ussery DW 

Genome update: base skews in 200+ bacterial chromosomes. Microbiology 

151:633-7 (2005) 

1 Both authors contributed equally 

2 Additionally 8 citations for the first 8 GEBA genomes published in SIGS journal; being part of a 

standard pipeline, RNAmmer will be cited for future GEBA articles. 

x

• Willenbrock H, Binnewies TT, Hallin PF, Ussery DW Genome update: 2D 

clustering of bacterial genomes. Microbiology 151:333-6 (2005) 

• Binnewies TT, Hallin PF, Stærfeldt HH, Ussery DW Genome Update: proteome 

comparisons. Microbiology 151:1-4 (2005) 

• Hallin PF, Ussery DW CBS Genome Atlas Database: a dynamic storage 

for bioinformatic results and sequence data. Bioinformatics 20:3682- 

6 (2004) - 37 citations 

• Hallin PF, Coenye T, Binnewies TT, Jarmer H, Stærfeldt HH, Ussery DW 

Genome update: correlation of bacterial genomic properties. Microbiology 

150:3899-903 (2004) 

• Ussery DW, Binnewies TT, Gouveia-Oliveira R, Jarmer H, Hallin PF Genome 

update: DNA repeats in bacterial genomes. Microbiology 150:3519-21 

(2004) - 11 citations 

• Hallin PF, Binnewies TT, Ussery DW Genome update: chromosome atlases. 

Microbiology 150:3091-3 (2004) 

• Ussery DW, Tindbaek N, Hallin PF Genome update: promoter profiles. 

Microbiology 150:2791-3 (2004) 

• Ussery DW, Jensen MS, Poulsen TR, Hallin PF Genome update: alignment 

of bacterial chromosomes. Microbiology 150:2491-3 (2004) 

• Ussery DW, Hallin PF Genome Update: annotation quality in sequenced 

microbial genomes. Microbiology 150:2015-7 (2004) - 8 citations 

• Ussery DW, Hallin PF, Lagesen K, Wassenaar TM Genome update: tR- 

NAs in sequenced microbial genomes. Microbiology 150:1603-6 (2004) 

• Ussery DW, Hallin PF, Lagesen K, Coenye T Genome update: rRNAs in 

sequenced microbial genomes. Microbiology 150:1113-5 (2004) 

• Ussery DW, Hallin PF Genome Update: AT content in sequenced prokaryotic 

genomes. Microbiology 150:749-52 (2004) - 8 citations 

• Ussery DW, Hallin PF Genome update: Length distributions of sequenced 

prokaryotic genomes. Microbiology 150:513-6 (2004) 

xi

xii

Contents 

List of Figures xvii 

1 Introduction 1 

2 Comparative Genomics 3 

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 

2.2 The genome annotation pipeline . . . . . . . . . . . . . . . . . . . . . . . . 3 

2.2.1 fetchgbk: Obtaining existing public genomes from GenBank . . . . 4 

2.2.2 Other ways to acquire genome information . . . . . . . . . . . . . . 4 

2.2.3 Tools contigsort and contigmap . . . . . . . . . . . . . . . . . . . 5 

2.2.4 Finding protein encoding genes in prokaryotes . . . . . . . . . . . . 6 

2.2.5 Finding tRNA and rRNA genes . . . . . . . . . . . . . . . . . . . . . 7 

2.3 Genome Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

2.3.1 Box-and-wiskers plot . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

2.3.2 heatmap - 2D clustering . . . . . . . . . . . . . . . . . . . . . . . . . 9 

2.3.3 Codon usage and chromosomal base composition . . . . . . . . . . . 11 

2.3.4 CodonPlot: visualizing codon usage . . . . . . . . . . . . . . . . . . 13 

2.3.5 Base composition and DNA repair . . . . . . . . . . . . . . . . . . . 16 

2.3.6 BLASTmatrix - proteome comparison . . . . . . . . . . . . . . . . . . 16 

2.3.7 BLASTatlas - visualizing while-genome homology . . . . . . . . . . . 18 

2.3.8 CorePlot - plotting the core- and pan-genomes of species . . . . . . 23 

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

2.5 Instant insight: Reading the genetic atlas . . . . . . . . . . . . . . . . . . 27 

2.6 Paper I: The genome BLASTatlas - a GeneWiz extension for visualization 

of whole-genome homology . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

2.7 Paper II: Ten years of bacterial genome sequencing: comparative–genomics– 

based discoveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

2.8 Paper III: Global features of the Alcanivorax borkumensis SK2 genome . . 61 

2.9 Paper IV: The origins of Vibrio species . . . . . . . . . . . . . . . . . . . . 75 

2.10 Paper V: Tools for comparison of bacterial genomes . . . . . . . . . . . . . 89 

3 rRNA operons and promoter analysis 105 

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 

3.2 P1 and P2 promoters in E. coli . . . . . . . . . . . . . . . . . . . . . . . . . 105 

3.3 Conservation of regulatory elements . . . . . . . . . . . . . . . . . . . . . . 106 

3.3.1 Modeling the P1 and P2 in selected enterics . . . . . . . . . . . . . . 108 

3.3.2 Iterating weight matrix frequencies . . . . . . . . . . . . . . . . . . . 112 

xiii

3.3.3 Refining E. coli and Shigella models . . . . . . . . . . . . . . . . . . 112 

3.4 DNA melting and SIDD energy . . . . . . . . . . . . . . . . . . . . . . . . . 114 

3.4.1 codesearch: Mapping nummerical data to genome annotations . . . 114 

3.5 The genomic context: visualizing operons and DNA properties . . . . . . . 117 

3.6 Visualizing sequencing quality using gwBrowser . . . . . . . . . . . . . . . . 117 

3.6.1 Visualizing the P1 and P2 structure using gwBrowser . . . . . . . . 119 

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 

3.8 Paper VI: RNAmmer: Fast two-level HMM prediction of rRNA in prokaryotic 

genome sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 

3.9 Paper VII: GeneWiz browser: An Interactive Tool for Visualizing Sequenced 

Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

4 Web Services and Interoperability in Genomics 145 

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 

4.2 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 

4.2.1 SOAP based Web Services . . . . . . . . . . . . . . . . . . . . . . . . 147 

4.3 EMBRACE: An EU initiative for enhance interoperability . . . . . . . . . . 147 

4.3.1 Quasi - a light-weight SOAP server . . . . . . . . . . . . . . . . . . 150 

4.3.2 quasi mktemp - From template to Web Service . . . . . . . . . . . . 150 

4.4 ENCODE pipeline: applying Web Services . . . . . . . . . . . . . . . . . . . 151 

4.4.1 Collecting Web Services clients in EPipe . . . . . . . . . . . . . . . . 151 

4.4.2 Mapping Pfam annotations to protein structure: mecA . . . . . . . . 151 

5 Conclusion and perspectives 155 

A Appendix: Workshops, teaching, and conferences 157 

A.1 Lectures and Presentations . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 

A.1.1 DTU Course 27101: Framework Course in Biotechnology and Food 

Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 

A.1.2 Comparative Microbial Genomics Workshop . . . . . . . . . . . . . . 157 

A.1.3 Comparative Microbial Genomics and Taxonomy . . . . . . . . . . . 157 

A.1.4 EMBRACE Workshop on Client Side Scripting for Web Services . . 157 

A.1.5 EMBRACE Workshop on Bioinformatics of Immunology . . . . . . . 157 

A.1.6 EMBRACE 3 rd AGM: Implementation of web services . . . . . . . . 157 

A.1.7 EMBRACE Workshop on Perl, SQL and Web Services . . . . . . . . 158 

A.2 Workshops and meetings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 

A.2.1 EMBRACE Workshop: SOAP web services . . . . . . . . . . . . . . 158 

A.2.2 EUCOMM Bioinformatics Training Course . . . . . . . . . . . . . . 158 

A.2.3 EMBRACE Workshop: Modern computer tools for the biosciences . 158 

A.2.4 EMBRACE 3rd Annual General Meeting . . . . . . . . . . . . . . . 158 

A.2.5 EMBRACE Workshop: Deploying Web Services for Biological Sequence 

Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 

A.2.6 EMBRACE 4th Annual General Meeting . . . . . . . . . . . . . . . 158 

A.2.7 Technical discussion of EMBRACE registry . . . . . . . . . . . . . . 158 

A.2.8 EMBRACE meeting: Discussion of standard data types . . . . . . . 158 

A.3 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 

A.3.1 Conference: Metagenomics, July 2007, San Diego U.S.A. . . . . . . . 158 

A.3.2 Conference: ASM Biodefense 2007, February 2007, Washington U.S.A.158 

B Appendix: Ph.D. study plan 159 

xiv

C Appendix: Courses 165 

C.1 Global regulatory networks in microorganisms . . . . . . . . . . . . . . . . . 165 

C.2 Protein Structure and Computational Biology . . . . . . . . . . . . . . . . . 165 

C.3 Biological Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 165 

C.4 Comparative Genome Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 165 

C.5 Doctorial seminar on business economics for academic entrepreneurs . . . . 165 

C.6 ECTS summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 

D Appendix: Software 166 

D.1 fetchgbk manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 

D.2 Sample output from queryGenomes . . . . . . . . . . . . . . . . . . . . . . . 167 

D.3 BLASTatlas configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 

D.3.1 file blast.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 

D.3.2 file custom.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 

D.4 BLASTmatrix example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 

D.5 iscan source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 

D.6 quasi mktemp manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 

Bibliography 174 

xv

xvi

List of Figures 

2.1 Mapping of multiple contigs to a backbone genome. C. jejuni str. NCTC 

11168 is used as backbone for mapping contigs C. jejuni str. 260.94. Blue 

and red blocks represent direct and reverse hits, respectively. Panel (a) 

shows un-mapped whereas panel (b) shows mapped contigs. . . . . . . . . 6 

2.2 Construction of a box-and-whiskers plot. Notches is an estimate of the 95% 

confidence interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

2.3 Genome size of all public prokaryotic. . . . . . . . . . . . . . . . . . . . . . 10 

2.4 Average AT content of all public prokaryotic. . . . . . . . . . . . . . . . . 10 

2.5 2D-clustering showing 87 Enterobacteriaceae. . . . . . . . . . . . . . . . . . 12 

2.6 Codon and amino acid usage of Buchnera aphidicola Cc (79.8% AT), Klebsiella 

pneumoniae NTUH-K2044 (42.3% AT), and E. coli K12 49.2% AT. 

Rightmost column shows the nucleotide bias of the three codon positions. . 14 

2.7 AT content profile 400 bp upstream and downstram of annotated translation 

starts in Buchnera aphidicola Cc. . . . . . . . . . . . . . . . . . . . . . . . 15 

2.8 Deamination of cytosine (C) into uracil (U) . . . . . . . . . . . . . . . . . . 16 

2.9 Construction of the BLASTmatrix diagram. Proteome similarity between 

three E. coli genomes. Lower part of the diagram corresponds to intraproteome 

similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

2.10 Proteome similarity between ten Campylobacter species. Color encoding 

corresponds to percentage of shared protein families. . . . . . . . . . . . . 17 

2.11 Proteome comparison of 32 Vibrionaceae genomes. Environmental V. cholerae 

strains lacking the cholera enterotoxin genes are highlighted in bright green, 

whilst pathogenic V. cholerae strains genomes are shown in dark green. . . 18 

2.12 Mapping of pairwise alignment to a reference genome. Mismatches, conservative 

mismatches and perfect matches contrubute to the overall map 0.0, 

0.5, and 1.0, respectively. Gaps within the reference protein, corresponding 

to missing features of the reference protein, cannot be mapped and are 

hence excluded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

2.13 Inclusion of multiple organisms using the BLASTatlas method. Each track 

correspond to a pairwise comparison against the reference chromosome. . . 19 

2.14 Comparison of B. pseudomallei 1710b chomosome I and II against all public 

Burkholderia genomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

2.15 A phylome atlas of Alcanivorax borkumensis, comparing the proteome against 

all γ-, α-, β-, δ, and ɛ-proteobacteria available at the time of publishing. . 22 

2.16 Count of genomes and species divided by genera. Source: CBS Genome 

Atlas Database as of 2009-09-11. . . . . . . . . . . . . . . . . . . . . . . . . 23 

xvii

xviii 

2.17 Pan- and core-genome plot of 10 Campylobacter genomes. For the data 

currently available, there seem to exist an equilibrium at close to 600 protein 

families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

2.18 CorePlot output for 32 Vibrio genomes. . . . . . . . . . . . . . . . . . . . . 24 

3.1 The transcription of bacterial genes. . . . . . . . . . . . . . . . . . . . . . . 106 

3.2 The promotor structure of the rrnB operon in E. coli. . . . . . . . . . . . . 107 

3.3 The –10 and –35 hexamers of the E. coli σ 70 promotor correspond to the 

motifs being located on opposite side of the DNA helix. Delition or insertions 

of the spacing cases a shift of approx. 36deg per nucleotide. . . . . . 107 

3.4 Logo plots showing the initial weight matrices used for searching E. coli 

and Shigella genomes: –10 hexamer (a), –35 hexamer (b), UP element (c), 

and FIS binding motif (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

3.5 Neighbor-joining tree of first 1k bases of all 16S rRNA genes of Yersinia, 

Salmonella, Shigella, and E. coli . . . . . . . . . . . . . . . . . . . . . . . . 110 

3.6 Profiles showing the maximum Ri(tot) scores of the initial weight matrices 

applied to E. coli and Shigella: Unadjusted P1 scores (a), Adjusted P1 

scores (b), Unadjusted P2 scores (c), and Adjusted P2 scores (d) . . . . . . 112 

3.7 Logos showing the base compostion of P1 and P2 of E. coli genomes, as 

identified by initial P1 and P2 scan: P1 –10 hexamer (a), P1 –35 hexamer 

(b), P1 UP element (c), P1 FIS binding motif (d), P2 –10 hexamer (e), P2 

–35 hexamer (f), P2 UP element (g) . . . . . . . . . . . . . . . . . . . . . . 113 

3.8 Average profiles of SIDD energy calculated at five different helix densities 

-0.025, -0.035, -0.045, and -0.055. All genes have been aligned at the translation 

start. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

3.9 E. coli and Shigella rrnB energy landscape visualized using the heatmap 

function. Each vertical column corresponds to a promotor sequence, whereas 

the horizontal rows represent average values over 10 bp within each sequence. 

Coordinates labeled on the horizontal rows are relative to the 16S 

rRNA gene start. The upper heatmaps show P1 whereas the lower heatmaps 

show P2. Leftmost heatmaps show P1/P2 model scores in green, whereas 

rightmost heatmaps show the SIDD energy in blue. . . . . . . . . . . . . . 116 

3.10 Principle workflow of gwBrowser data exchange. . . . . . . . . . . . . . . . 118 

3.11 Mapping qualities of sequencing reads to a reference genome while accounting 

for the uniqueness of the read. . . . . . . . . . . . . . . . . . . . . . . . 118 

3.12 A zoom of the P1 P2 tandem promotor system upstream of the rrnB operon 

of E. coli K12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 

4.1 Screen shot of NCBI Entrez Genome projects web page . . . . . . . . . . . 146 

4.2 Schematic layout of a simple SOAP resource, where WSDL and schemas 

reside on the same server. WSDL and schemas are read and intepreted 

by the SOAP client in order compose the outgoing request and parse the 

incoming server response. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 

4.3 Schematic layout of the ENCODE pipeline, EPipe. The main program 

ensures that as much as possible is dispatched in parrallel. Modules may 

either be alignment dependent or not. If the alignment is required to predict 

the protein features, the module is not launched until the alignment 

algorithm has finished. Modules may either return global features of the 

entire protein (e.g. cellular localization), or return positional features (e.g. 

phosphorylation sites). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

4.4 The input web page of EPipe: Upper part defines sequence upload and 

alignment method, and lower part selects which modules / methods to 

run. When applicable, gene ontologies have been added to each feature and 

feature values (light green boxes). . . . . . . . . . . . . . . . . . . . . . . . 153 

4.5 The mecA encoded protein (EEV85461) shows homology to PDB entry 

1VQQ (Lim & Strynadka, 2002). Top panel shows the EPipe structure 

browser which allows for any 90 degrees rotating. Lower panel shows a 

post-processing of the PyMol script, generated by EPipe. . . . . . . . . . . 154 

xix

Chapter 1 

Introduction 

Introduction 

Since the publication of the first complete bacterial genome sequence in 1995 close to a 

thousand prokaryotes have been fully sequenced and made publicly available. These data 

represent large efforts by many scientists and technicians, closing gaps in the chromosomal 

sequences and providing detailed gene annotations. These genome projects constitute a 

valuable collection of prokaryotic diversity and they serve as an indispensable resource for 

comparative studies when novel features of newly discovered organisms are identified. 

We are however witnessing a transition phase as genome sequencing becomes a trivial 

step carried out by any researcher or company in the need of a better characterization of an 

organism. Sequencing equipment and the capability of assembling an entire genome will 

likely follow the same path as any other technological advance the world has seen. Telephones, 

cars, aeroplanes, and computers all have started as costly and clumsy attempts, 

and ended up as mainstream affordable and efficient products, taken for granted. Nothing 

will prevent sequencing technology to follow the same path and it will likely end up as a 

tiny desktop instrument on a doctor’s table next to the blood preasure measuring device. 

But the decreasing novelty of presenting a new genome sequence could cause a decline in 

the number of published genomes in the near future, causing less control and organization 

of these data, with fewer demands on data integrity, sequencing and annotation quality. 

Some major issues arrise as massive amounts of genomic data becomes a reality. There 

are signs that our ability to process and analyze genomic data is being overtaken by the 

technological developments of the sequencing equipment. For example, over the past 

twenty-five years, GenBank has grown roughly 100,000 fold, whereas the computer processing 

power, following Moore’s law has grown “only” a 1,000 times. The overwhelming 

data generated by modern sequencing machines constitite tough challenges for most biologist 

and although efforts are constantly being made to improve gene prediction and 

genome assembly software, these steps are not yet functioning in a scalable and unsupervised 

fashion. Further, post-annotation steps deriving knowledge from predicted genes 

remain one of the biggest challenges. How do we transform contigs of nucleotide sequences 

into knowledge to derive the phenotype of the organism? 

As more prokaryotic genomes are being sequenced, there are now a number of species 

for which multiple strains are sequenced. Roughly one fourth of all prokaryotic projects 

exist within species where 5 or more strains are available. As this coverage of diversity 

increases, we may begin to answer some key questions with better confidence. How do 

we define core sets of genes? Can we estimate the size of the pan genome? Which 

features are novel in selected strains and are these features regionally conserved within 

the chromosomes? To answer these questions, there is a fundamental need to visuzalize 

and overview the similarity and differences between larger number of genomes. Obtaining 

such an overview allows some questions concerning gene acquisition and chromosomal 

1

organization to be answered. The development and refinement of the BLASTatlas method 

done during this Ph.D. project is an essential step forward enabling these types of analysis 

and the method is now offered as an online service by CBS. This work let to a publication 

in 2008, describing the BLASTatlas method. 

In chapter 2 a number of tools are described, which can assist rapid analysis of genomes, 

genomic contigs and larger collections of genomes to conclude the similarity. Enabling 

local and web based genome analysis tools for the novice user remains a critical point for 

the success of future sequencing projects. In chapter 3 the RNAmmer tool was used as 

a starting point to study the E. coli rrn tandem promotors. This work presents useful 

tools to model and visualize promotor conservation in genomes. The exchange of genomic 

data between users, sequencing centers, repositories, and tool providers currently lack 

standardizaion and interoperability. The lack of a formal way to exchange genomic data is 

a limiting factor as to how we in the future may exploit the wave of new genomic material 

being generated. Chapter 4 of this thesis describe a number of efforts made during this 

Ph.D. project to provide interoperabitlity and programmatic access to both prediction 

methods, genomic visualization methods as well as management of data standards. The 

outcome of this work has led CBS to adapt tools and server infrastructure thereby sharing 

its many tools in a way that allow programmers to insert sophistcated prediction methods 

directoy in their own programming environment. 

2

Chapter 2 


2.1 Introduction 


This chapter covers work for five publications. The first paper (I) describes the BLASTatlas 

method developed to compare and visualize the homology between a reference genome 

and any number of other genomes, collections of genomes, metagenomic sequences, or 

databases as a single graphic. The method has been used in connection with various 

research projects including the publication of the Arcobacter butzleri RM4018 genome 

(Miller et al., 2007), computer exercises (see chapter 4 and appendix A.1) and as analysis 

tool for publications made during the project (papers II-V). 

A number of smaller unpublished methods, including the BLAST matrix, Core Plot, 

and Codon Plot has been written and used as in-house tools. The BLASTmatrix software 

derives unique and shared protein families for any number of proteomes. This enables the 

viewer to obtain the similarity between any pair of organisms included in the comparison. 

The tool was first used in (Jensen et al., 2005), and also used in other papers including 

paper II. An improved version of the BLASTmatrix tool is used in paper IV. The 

BLASTmatrix software generates all-against-all BLAST (Basic Local alignment Search 

Tool, Altschul et al. (1997)) of a number of selected proteomes. When comparing multiple 

species of the same genus, these BLAST results can be reused by the CorePlot program 

to estimate the size of the core- and pan-genome. Finally, the CodonPlot program was 

written to visualize the codon and amino acid usage by an organism. The CodonPlot 

results contributed to papers II, III, and V. 

The development of an interactive web based genome browser (gwBrowser) has allowed 

a broader application of the atlas visualization method, including analysis of sequencing 

reads and promotor regions. This work is described in chapter 3. 

2.2 The genome annotation pipeline 

Having assembled the reads of a sequencing project, the biologist is often presented with 

an incomplete mapping of the chromosome, with gaps and a large number of contigs 

(contiguous pieces of DNA). The quality of the assembly originating from most modern 

high-throughput techniques can be negatively affected by a number of factors such as 

short or insufficient reads, elevated error rates near the end of the reads, DNA repeats on 

the chromosome, inadequate assembly tools etc. This section describes tools to analyze 

both complete genome data (single-contig) as well as preliminary data generated by pyrosequencing 

machines (multiple contigs). Most tools that are presented here are stored 

on the CBS servers at /home/people/pfh/scripts/. 

3

The genome annotation pipeline 

2.2.1 fetchgbk: Obtaining existing public genomes from GenBank 

Without robust access to prior knowledge about existing genomes, it is hard to draw 

conclusions about a novel genome sequence. The tool fetchgbk was made to download the 

most recent genbank entries via NCBI using both individual accession numbers (GenBank 

and RefSeq), ranges thereof, or the NCBI project id whereby all replicons of an organism 

can be obtained. Listing 2.1 shows common usage of the program and appendix D.1 

includes the manual. 

Listing 2.1: Usage of fetchgbk 

1 # download a single genbank record 

2 fetchgbk -a CP000896 

3 # download a single refseq entry 

4 fetchgbk -a NZ_ABIZ00000000 

5 # download a range of RefSeq entries 

6 fetchgbk -a NZ_ABIH01000001 - NZ_ABIH01000038 

7 # just listing refseq accession numbers of a project 

8 fetchgbk -p 12997 -d refseq -l 

9 # download all replicons of a project ( RefSeq ) 

10 fetchgbk -p 19391 -d refseq 

11 # download all replicons of a project ( GenBank ) 

12 fetchgbk -p 19391 -d genbank 

2.2.2 Other ways to acquire genome information 

The genbank records maintained in the CBS Genome Atlas Database (Hallin & Ussery, 

2004) are regularly synchronized against NCBI Entrez (see http://www.ncbi.nlm.nih. 

gov/genomes/lproks.cgi). The raw sequence data can be downloaded from this database 

using the Web Services client scripts getSeq, getOrfs, and getProt. Example scripts can be 

downloaded and run as separate commands (listing 2.2) or integrated into larger workflows, 

in other programming languages if needed. 

Listing 2.2: Accessing Genome Atlas Database through Web Services. 

1 # download prerequisites 

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples /xml - compile .pl 

3 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getseq .pl 

4 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getprot .pl 

5 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getorfs .pl 

6 

7 # obtain full genome sequence of genbank entry 

8 perl getseq .pl CP000550 > CP000550 . fsa 

9 

10 # obtain translations of genbank entry 

11 perl getprot .pl CP000550 > CP000550 . proteins . fsa 

12 

13 # obtain open reading frames of genbank entry 

14 perl getorfs .pl CP000550 > CP000550 . orfs . fsa 

The CBS Genome Atlas Database contains an index of genome meta-data, such as 

organism name, NCBI Project ID, replicon, genome size, number of coding genes, tRNA 

genes, rRNA genes, the base composition, and average values of various DNA properties 

such intrinsic curvature (Bolshoy et al., 1991) and stacking energy (Satchwell et al., 1986). 

For more information on the Web Services implementation, see section 4.2.1 and for a 

full documentation please refer to http://www.cbs.dtu.dk/ws/GenomeAtlas. Listing 2.3 

shows an example of how to use queryGenomes to obtain AT content and gene count for 

4


the publicly available Vibrio genomes. Output the command is listed in appendix D.2. 

Listing 2.3: Using queryGenomes to obtain genome meta data. 

1 # download client script 

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / querygenomes .pl 

3 

4 # download XML :: Compile helper script 


6 

7 # extract AT - content and number of genes for all vibrio genomes 

8 perl querygenomes .pl - hideMerged - organism vibrio -output 

ATCONTENT , NGENES 

2.2.3 Tools contigsort and contigmap 

For some applications in analysis of unfinished or partially sequenced genomes, it is desired 

to obtain approximate coordinates of the contigs within the complete chromosome. To 

resolve this the contigsort program was written. It accepts any number of entries (contigs) 

in one FASTA file together with a backbone sequence in one contig in a second FASTA file. 

The entries of the contig file is then mapped to the backbone sequence using a nucleotide 

BLAST, assuming at least one significant hit. The tool then sorts all contigs based on the 

coordinate in the backbone of the center-point of each alignment. Contigs spanning the 

origin of circular backbones are automatically split in two. 

The tool genomemap was written to visualize genome homology between two genomes 

sequences. Each genome may consist of one or more contigs and all contigs are aligned 

using BLASTN. This tool allow a user to validate the output of the backbone mapping from 

contigsort. The plot generated has similarities to that produced by Artemis Comparison 

Tool (ACT) (Rutherford et al., 2000); however the output of genomemap is a vector 

graphic file (PostScript) and allows for multiple sequence entries within each of the two 

compared sequences. 

Example: Campylobacter jejuni str. 260.94 

The 10 contigs of the currently unpublished sequence of Campylobacter jejuni str. 260.94 

(GenBank accession no. AANK01000001-AANK01000010) were downloaded and converted 

into FASTA format file. The program saco convert is an in-house program at CBS, 

which converts between different sequence formats. In the example provided the Campylobacter 

jejuni str. NCTC 11168 (Parkhill et al., 2000) is used as the backbone (see listing 

2.4). 

Listing 2.4: Using contigsort to map assemblied contigs to a backbone. 

1 set path = (˜ pfh/scripts/contigsort ˜pfh/scripts/fetchgbk $path ) 

2 fetchgbk −a AANK01000001−AANK01000010 > AANK . gbk 

3 saco_convert −I genbank −O fasta AANK . gbk > AANK . fsa 

4 fetchgbk −a AL111168 > AL111168 . gbk 

5 saco_convert −I genbank −O fasta AL111168 . gbk > AL111168 . fsa 

6 contigsort −c −i AANK . fsa −b AL111168 . fsa > mapped . fsa 

To visualize the result of the contig mapping the mapped and un-mapped contigs were 

processed by contigmap. The output from the comparison is a PostScript document (figure 

2.1 and listing 2.5). 

5

The genome annotation pipeline 

AL111168_AL139074_AL 

AANK01000001_AANK010 AANK01000002_AANK010 AANK01000003_AANK010 

(a) 

AANK01000004_AANK010 

AANK01000005_AANK010 

AANK01000006_AANK010 

AANK01000007_AANK010 

AANK01000010_AANK010 

AANK01000009_AANK010 

AANK01000008_AANK010 

AANK01000007_AANK010 

AANK01000002_AANK010 AANK01000008_AANK010 

AANK01000003_AANK010 

AL111168_AL139074_AL 

AANK01000005_AANK010 

AANK01000001_AANK010 AANK01000009_AANK010 

Figure 2.1: Mapping of multiple contigs to a backbone genome. C. jejuni str. NCTC 11168 is used 

as backbone for mapping contigs C. jejuni str. 260.94. Blue and red blocks represent direct and 

reverse hits, respectively. Panel (a) shows un-mapped whereas panel (b) shows mapped contigs. 

Listing 2.5: Using contigmap to draw homology between contigs and reference genome 

1 set path = (˜ pfh/scripts/contigmap $path ) 

2 contigmap AL111168 . fsa AANK . fsa > AANK−raw . ps 

3 contigmap AL111168 . fsa mapped . fsa > AANK−mapped . ps 

2.2.4 Finding protein encoding genes in prokaryotes 

A crucial step for implementing any genome pipeline is the gene finding. Having successfully 

completed the gene calling enables a number of downstream analysis such as 

translation of ORFs into protein sequence, finding of potentially novel genes, annotation 

of protein function by homology searches, assigning functional domains, and detection 

of signal peptide to derive the secretome. To both reveal novel protein sequences and 

to draw conclusions as to the overall proteome, it is therefore essential that the gene 

calling can be trusted. There are several public prokaryotic gene predictors available 

such as Glimmer3 (http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi, 

Delcher et al. (1999)), GeneMarkS (http://exon.biology.gatech.edu/, Besemer et al. 

(2001)), EasyGene (http://www.cbs.dtu.dk/services/EasyGene/, Larsen & Krogh (2003)), 

and Prodigal (unpublished, http://compbio.ornl.gov/prodigal). Prodigal is a recent 

development and despite of its high speed and simplicity it provides promising results. It 

has been implemented as part of the CBS Genome Atlas Database Web Services. Code 

examples are provided showing the usage of the Prodigal client scripts (listing 2.6). 

Listing 2.6: Using Prodigal for ORF prediction. Note that 6pack is an internal CBS tool used for 

translation of ORFs. 


2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / prodigal .pl 

3 perl prodigal .pl -ta 11 -fasta < mapped . fsa > mapped . orfs . fsa 

4 6 pack -1 < mapped . orfs . fsa > mapped . proteins . fsa 

Assessing annotation quality 

All of the four gene finders listed above were applied to the latest version of the E. coli 

strain K-12 isolate MG1655 genome sequence (U00096, 28 July, 2009, Blattner et al. 

(1997)). These predictions, together with an older annotation of the same GenBank entry 

6 

(b) 

AANK01000010_AANK010 

AANK01000004_AANK010 

AANK01000006_AANK010 

AANK01000007_AANK010


source CDS total TP FP FN 3’off 5’off sens. shared 

U00096 (present) 4,321 - - - - - - - 

U00096 (2004) 4,254 4,172 82 109 1.02 -4.07 0.97 93% 

Glimmer 3.02 4,476 4,174 302 125 -0.6 -24.09 0.97 87% 

GeneMark-S 2.6 4,377 4,207 170 90 1.94 -20.17 0.98 91% 

EasyGene 1.2 4,056 4,017 39 256 -0.28 -19.07 0.94 91% 

Prodigal 1.1 4,332 4,200 132 97 0.54 -20.07 0.98 92% 

Table 2.1: Performance of prokaryotic gene finders. An older genbank record for E. coli K12 

(U00096, 2002) has been included and the reference of all comparisons is the most recent shown 

at the top. The 3’ and 5’ off correspond to the number of base pairs that a query coordinate is 

downstream (positive number) or upstream (negative number) when compared to the reference. 

T P 

The sensitivity is estimated by binary classification, T P +F N 

where T P is the number of proteins 

shared between reference and query and F N are proteins unique to the reference, not found in 

the query. Calculating specificity (which requires a true negative count) is difficult as it is hard 

to identify regions of the chromosome that for certain does not contain protein coding genes 

(Larsen & Krogh, 2003). The rightmost column contains an estimate of the percentage of protein 

families shared between the query and the reference genome. The number is derived using the 

BLASTmatrix tool. 

(U00096 from 2004) were compared pairwise to the latest version of the GenBank entry. 

The number of unique genes in both reference and query genome was derived and for each 

overlapping pair of ORFs, the average inaccuracy of the 3’ and 5’ ends was calculated 

(table 2.1). In addition the encoded proteins were compared using the BLASTmatrix 

tool, described in section 2.3.6. This allows estimation of the number of protein families 

shared between the reference and the query genomes. 

2.2.5 Finding tRNA and rRNA genes 

The tool tRNAscan-SE (Lowe & Eddy, 1997) has been implemented in the CBS Genome 

Atlas Database Web Service, and it predicts tRNA genes in contigs or genomes: 

1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . inc .pl 

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / trnascan .pl 

3 perl trnascan .pl < mapped . fsa > mapped . trna . fsa 

The RNAmmer method (Paper VI, chapter 3) can be used to consistently annotate 

rRNA genes in contigs and full genome sequences. This tool is implemented as a separate 

Web Service at CBS. Please refer to http://www.cbs.dtu.dk/ws/RNAmmer for full documentation. 

In listing 2.7 and example is provided showing the usage of the RNAmmer 

client script. 

Listing 2.7: Running RNAmmer on a genome sequence 

1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . inc .pl 

2 wget http :// www . cbs . dtu .dk/ws/ RNAmmer / examples / rnammer .pl 

3 perl rnammer .pl bac < mapped . fsa > mapped . rrna . fsa 

2.3 Genome Comparisons 

The previous section has described some initial steps for annotating the bacterial genome 

which is required for further comparative studies. In this section emphasis will be placed 

on comparing annotated genomes both on the proteome level as well as using meta-data. 

7

Genome Comparisons 

Right whisker ends at an observed 

data point, not exceeding 1.5 IQR 

1.5 x IQR 

95% confidence interval 

Q1 IQR Q3 

1.5 x IQR 

median 

Right whisker ends at an observed 

data point, not exceeding 1.5 IQR 

Mild outliers between 1.5 and 3.0 IQR 

and extreme outliers more than 3 IQR 

away from Q1 and Q3 

Figure 2.2: Construction of a box-and-whiskers plot. Notches is an estimate of the 95% confidence 

interval. 

The tools presented here have all been used widely during course activities and research 

projects. 

2.3.1 Box-and-wiskers plot 

As the number of sequenced bacterial genomes grew from only two in 1995 to now close to a 

thousand at the time of writing, there began to be enough data to sample various genomic 

properties amongst the different phylogenetic groups. The box-and-wiskers plot (Tukey, 

1977) is a useful tool for visualizing such differences. The plot shows a box between the 

first and the third quantile (figure 2.2). The distance between Q1 and Q3 is called the Inter 

Quantile Ratio (IQR) and whiskers are drawn through observations that are not exceeding 

1.5 × IQR. A line is drawn within the box representing the median. Data between 

1.5 × IQR and 3.0 × IQR are denoted ”mild” outliers whereas observations exceeding 

3.0 × IQR are extreme outliers. Notches are sometimes drawn to denote the confidence 

interval. In the R implementation of the box-and-wiskers plot the 95% confidence interval 

is approximated by 1.5×IQR 

√ . When comparing two or more distributions, non-overlapping 

N 

notches marks significant differences. 

Distribution of genome size and base composition in prokaryotes 

To examine the base composition and genome size for different phylogenetic groups, a 

query to the CBS Genome Atlas Database can be done, grouping replicons into projects 

and summarizing / averaging within each project. Altough only possible from within CBS, 

the commands are listed below. 

8


1 mysql -N -B -D genomeatlas3_cur -e " select p.grp , concat (’#’, color ) 

,ord , sum ( length ),concat ( organism_name ,’/’, segment_name ,’/’, 

genbank ) from atlasdb as a, genbank_complete_prj as p , 

genbank_complete_seq as s , phyla as ph where s. genbank = a. 

accession and s. pid = p. pid and segment_name not like ’genome %’ 

and ph. phyla = p. grp group by s. pid " > length . tbl 

2 set N = ‘wc -l < length .tbl ‘ 

3 ~ pfh / scripts / boxplot -main " Size distribution of Prokaryotic 

genomes (N = $N)" < length . tbl > length .ps 

4 mysql -N -B -D genomeatlas3_cur -e " select p.grp , concat (’#’, color ) 

,ord , sum ( atcontent * length )/ sum ( length ),concat ( organism_name 

,’/’, segment_name ,’/’, genbank ) from atlasdb as a, 

genbank_complete_prj as p , genbank_complete_seq as s , phyla 

as ph where s. genbank = a. accession and s. pid = p. pid and 

segment_name not like ’genome %’ and ph. phyla = p. grp group by s 

. pid "> atcontent . tbl 

5 ~ pfh / scripts / boxplot -main "AT content distribution of Prokaryotic 

genomes (N = $N)" < atcontent . tbl > atcontent .ps 

The tables generated by the MySQL query can be read by the boxplot program, which 

is a Perl wrapper for the R command boxplot, and a PostScript document is generated. 

Figure 2.4 shows the total genome length (including all replicons) of all published prokaryotic 

genomes, divided into phyla. The confidence interval appears wide for many groups, 

reflecting a high intra-phyla variation. However, for a number of phyla the difference 

is significant. The β-protebacteria tend to have longer chromosomes than for example 

the firmicutes, the α-proteobacteria, and the cyanobacteria. It is also evident that the 

δ-proteobacteria Sorangium cellulosum Soce56 represents the longest genome (13,033,779 

nt, Schneiker et al. (2007)) but that this is an outlier not representative of the entire phylum. 

The shortest bacterial genome published so far is the α-proteobacterium Candidatus 

Hodgkinia cicadicola Dsem (143,795 nt, McCutcheon et al. (2009)). Thus, the difference 

between the smallest and the largest is close to 100 fold. The plot in figure 2.3 shows the 

fraction of AT for the prokaryotic genomes ranging from 25% for the δ-proteobacterium 

Anaeromyxobacter dehalogenans 2CP-C (Sanford et al., 2002) to 83% for Candidatus Carsonella 

ruddii PV (Nakabachi et al. (2006). 

2.3.2 heatmap - 2D clustering 

A way to increase the dimensionality for visualizing genomic properties is by using a socalled 

heatmap or 2D clustering. Instead of looking at a single property at a time (e.g. 

length or AT content), multiple features may be included in the same plot. The axis is 

replaced with a color transformation of the data and different normalization methods may 

be applied. In the example below a comparison is made for 87 Enterobacteriaceae, covering 

among others the genera of Escherichia, Salmonella, Yersinia, Shigella, Buchnera, and 

Klebsiella. The CBS Genome Atlas Database is queried for the features such as tRNA and 

rRNA gene count, total coding genes, genome size, AT content, simple genomic repeats, 

local direct repeats, base pairs per gene, and coding fraction of the genome. The plot 

is shown in figure 2.5 and the R code for producing the plot is shown below in listing 

2.8. The data have been normalized to allow for comparison. Features and organisms are 

hierarchically clustered to group organisms with similar properties and to gorup properties 

that correlate within the organisms. 

9


12 

10 

12 

Size distribution of Prokaryotic genomes (N = 932) 

Crenarchaeota (n=23) 

Euryarchaeota (n=39) 

Nanoarchaeota (n=1) 

Acidobacteria (n=3) 


Actinobacteria (n=68) 


Aquificae (n=5) 


Bacteroidetes/Chlorobi (n=26) 

Acidobacteria (n=3) 

Chlamydiae/Verrucomicrobia (n=14) 


Chloroflexi (n=10) 


Cyanobacteria (n=36) 

Bacteroidetes/Chlorobi (n=26) 

Deinococcus−Thermus (n=5) 

Chlamydiae/Verrucomicrobia (n=14) 

Firmicutes (n=191) 

Chloroflexi (n=10) 

Fusobacteria (n=1) 

Cyanobacteria (n=36) 

Planctomycetes (n=1) 

Deinococcus−Thermus (n=5) 

Alphaproteobacteria (n=114) 

Firmicutes (n=191) 

Betaproteobacteria (n=70) 

Fusobacteria (n=1) 

Gammaproteobacteria (n=226) 

Planctomycetes (n=1) 

Deltaproteobacteria (n=29) 

Alphaproteobacteria (n=114) 

Epsilonproteobacteria (n=25) 

Betaproteobacteria (n=70) 

Spirochaetes (n=18) 

Gammaproteobacteria (n=226) 

Thermotogae (n=10) 

Deltaproteobacteria (n=29) 

Other Archaea (n=1) 

Epsilonproteobacteria (n=25) 

Other Bacteria (n=16) 

Spirochaetes (n=18) 


Size distribution of Prokaryotic genomes (N = 932) 


0.0e+00 2.0e+06 


Buchnera 

4.0e+06 6.0e+06 

E. coli 

Salmonella 

Yersinia 

8.0e+06 1.0e+07 1.2e+07 

0.0e+00 2.0e+06 4.0e+06 6.0e+06 8.0e+06 1.0e+07 1.2e+07 

E. coli 

Buchnera 

Salmonella 





Crenarchaeota 

Acidobacteria 

(n=23) 

(n=3) 

Euryarchaeota 


(n=39) 

Nanoarchaeota 


(n=1) 

Bacteroidetes/Chlorobi Acidobacteria (n=26) (n=3) 

Chlamydiae/Verrucomicrobia Actinobacteria (n=14) (n=68) 

Chloroflexi Aquificae (n=10) (n=5) 

Bacteroidetes/Chlorobi Cyanobacteria (n=36) (n=26) 

Chlamydiae/Verrucomicrobia Deinococcus−Thermus (n=14) (n=5) 

Firmicutes Chloroflexi (n=191) (n=10) 

Cyanobacteria Fusobacteria (n=36) (n=1) 

Deinococcus−Thermus Planctomycetes (n=1) (n=5) 

Alphaproteobacteria Firmicutes (n=114) (n=191) 

Betaproteobacteria Fusobacteria (n=70) (n=1) 

Gammaproteobacteria Planctomycetes (n=226) (n=1) 

Alphaproteobacteria Deltaproteobacteria (n=114) (n=29) 

Epsilonproteobacteria Betaproteobacteria (n=25) (n=70) 

Gammaproteobacteria Spirochaetes (n=226) (n=18) 

Deltaproteobacteria Thermotogae (n=10) (n=29) 

Epsilonproteobacteria Other Archaea (n=25) (n=1) 

Other Spirochaetes Bacteria (n=16) (n=18) 




Figure 2.3: Genome size of all public prokaryotic. 



AT content distribution of Prokaryotic genomes (N = 932) 

AT content distribution of Prokaryotic genomes (N = 932) 

0.3 0.4 0.5 0.6 0.7 0.8 

E. coli 

Salmonella 

Buchnera 


0.3 0.4 0.5 0.6 0.7 0.8 

E. coli 

Salmonella 

Buchnera 


Figure 2.4: Average AT content of all public prokaryotic. 

Figure 2.4: Average AT content contentof ofall all public prokaryotic.

Listing 2.8: R code to generate a 2D clustering graphic 


1 library ( gplots ) 

2 postscript ( file =’output .ps ’) 

3 data


12 

TRNA_SCAN_COUNT 

LENGTH 

NGENES 

RNAMMER_SSU_COUNT 

ATCONTENT 

LOC_DIR_REPEAT 

LOC_INV_REPEAT 

SR_PERCENT 

CODING_FRACTION 

BPPRGENE 

Escherichia coli SMS−3−5 

Escherichia coli O127:H6 str. E2348/69 

Escherichia coli E24377A 

Escherichia coli S88 

Escherichia coli SE11 

Escherichia coli UMN026 

Escherichia coli IAI39 

Escherichia coli 55989 

Escherichia coli ED1a 

Escherichia coli UTI89 

Escherichia coli CFT073 

Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 

Salmonella enterica subsp. enterica serovar Newport str. SL254 

Salmonella enterica subsp. enterica serovar Agona str. SL483 

Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633 

Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594 

Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853 

Salmonella enterica subsp. enterica serovar Choleraesuis str. SC−B67 


Salmonella enterica subsp. enterica serovar Typhi str. CT18 

Serratia proteamaculans 568 

Klebsiella pneumoniae subsp. pneumoniae MGH 78578 

Klebsiella pneumoniae NTUH−K2044 

Klebsiella pneumoniae 342 

Salmonella enterica subsp. enterica serovar Paratyphi B str. SPB7 

Citrobacter koseri ATCC BAA−895 

Escherichia coli O157:H7 str. Sakai 

Escherichia coli O157:H7 EDL933 

Escherichia coli O157:H7 str. EC4115 

Escherichia coli str. K−12 substr. MG1655 

Escherichia coli str. K−12 substr. W3110 

Escherichia coli HS 

Escherichia coli IAI1 

Escherichia fergusonii ATCC 35469 

Salmonella enterica subsp. arizonae serovar 62:z4,z23:−− 

Salmonella enterica subsp. enterica serovar Enteritidis str. P125109 

Salmonella enterica subsp. enterica serovar Paratyphi A str. AKU_12601 

Enterobacter sp. 638 

Escherichia coli BL21 

Escherichia coli ATCC 8739 

Escherichia coli str. K−12 substr. DH10B 

Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 

Escherichia coli BW2952 

Escherichia coli BL21(DE3) 

Yersinia pseudotuberculosis YPIII 

Yersinia pseudotuberculosis PB1/+ 

Yersinia pseudotuberculosis IP 31758 

Yersinia enterocolitica subsp. enterocolitica 8081 


Shigella boydii Sb227 

Shigella dysenteriae Sd197 

Escherichia coli APEC O1 

Shigella flexneri 2a str. 301 

Shigella sonnei Ss046 

Shigella flexneri 5 str. 8401 

Shigella flexneri 2a str. 2457T 

Shigella boydii CDC 3083−94 

Edwardsiella ictaluri 93−146 

Cronobacter sakazakii ATCC BAA−894 

Erwinia tasmaniensis Et1/99 

Photorhabdus luminescens subsp. laumondii TTO1 

Photorhabdus asymbiotica 

Proteus mirabilis HI4320 

Pectobacterium atrosepticum SCRI1043 

Salmonella enterica subsp. enterica serovar Gallinarum str. 287/91 

Pectobacterium carotovorum subsp. carotovorum PC1 

Dickeya zeae Ech1591 

Dickeya dadantii Ech703 

Salmonella enterica subsp. enterica serovar Typhi str. Ty2 

Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150 

Yersinia pestis Angola 

Yersinia pestis CO92 

Yersinia pestis Antiqua 

Yersinia pestis KIM 

Yersinia pestis Nepal516 

Yersinia pestis biovar Microtus str. 91001 

Yersinia pestis Pestoides F 

Sodalis glossinidius str. morsitans 

Buchnera aphidicola str. Cc (Cinara cedri) 

Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 

Candidatus Blochmannia floridanus 

Candidatus Blochmannia pennsylvanicus str. BPEN 

Buchnera aphidicola str. Sg (Schizaphis graminum) 

Buchnera aphidicola str. Bp (Baizongia pistaciae) 

Buchnera aphidicola str. APS (Acyrthosiphon pisum) 

Buchnera aphidicola str. Tuc7 (Acyrthosiphon pisum) 

Buchnera aphidicola str. 5A (Acyrthosiphon pisum) 

−1 −0.5 0 0.5 1 

Value 

Figure 2.5: 2D-clustering showing 87 Enterobacteriaceae. 

Color Key

1st 

U 

C 

A 

G 

U 

2nd position 

C A G 

3rd 

56 Phe 31 Ser 41 Tyr 12 Cys U 

2 Phe 1 Ser 2 Tyr 1 Cys C 

79 Leu 22 Ser 3 Stop 0 Stop A 

5 Leu 1 Ser 0 Stop 8 Trp G 

7 Leu 13 Pro 17 His 7 Arg U 

0 Leu 1 Pro 1 His 0 Arg C 

5 Leu 12 Pro 25 Gln 5 Arg A 

0 Leu 2 Pro 2 Gln 0 Arg G 

79 Ile 18 Thr 75 Asn 12 Ser U 

4 Ile 1 Thr 6 Asn 1 Ser C 

51 Ile 20 Thr 131 Lys 18 Arg A 

18 Met 1 Thr 6 Lys 1 Arg G 

18 Val 16 Ala 33 Asp 18 Gly U 

1 Val 1 Ala 2 Asp 1 Gly C 

18 Val 15 Ala 41 Glu 27 Gly A 

1 Val 1 Ala 2 Glu 2 Gly G 


Table 2.2: Codon usage in Buchnera aphidicola Cc. Frequencies are measured per thousand. A 

total of 354,219 base pairs are examined in 360 ORFs (5 orfs rejectred due to possible frame shifts) 

codons may be replaced to encode both identical and similar amino acids to adjust the 

overall base composition. 

2.3.4 CodonPlot: visualizing codon usage 

A rose plot diagram (Ussery et al., 2004; Binnewies et al., 2006) may be used to make a 

graphical representation of codon and amino acid usage. In the codon rose plot, all 64 

codons are listed in the perimeter and the frequency of each codon is drawn on a radial 

scale. The 64 codons are sorted in the order AUGC, first by the last letter (XX[AUCG]), 

then by the second letter (X[AUGC]X), and finally by the first letter ([AUGC]XX). The 

result is four quadrants, with codons ending with A or U in the right half, and codons 

ending with C or G in the left half. This allows easy overview of biases in the third position. 

For the amino acid rose plot, all 20 amino acids are drawn in the perimeter with their 

frequencies show radially. Here, the amino acids are grouped according to their chemical 

properties. In addition to the rose plot, information content can be applied to measure the 

bias within each of the three positions of the codon. These codon analysis are shown in 

figure 2.6 for three different enteric genomes: the AT rich Buchnera aphidicola Cc (79.8% 

AT), an E. coli strain K-12 (49.2% AT), and a somewhat GC rich Klebsiella pneumoniae 

NTUH-K2044 (42.3%). The bias in B. aphidicola is striking with a strong preference of A 

and U at the third position. This variation results in a periodic fluctuation of AT content 

when aligning all open reading frames (ORFs) to the translation start, and extracting 400 

base pairs up- and down-stream, as shown in figure 2.7. The red line represents a 3 point 

running average which quickly approaches zero in the coding region. Gray lines represent 

the raw average values. 

13


N 

E 

D 

N 

E 

D 

N 

E 

D 

Q 

R 

Q 

R 

Q 

R 

S 

K 

S 

K 

Amino Acid Usage 

Buchnera_aphidicola_Cc 

M 

T 

A 

C 

(a) 

V 


Ecoli_K12 

M 

T 

A 

C 

(d) 

Y 

L 

W 


Klebsiella_pneumoniae_NTUH-K2044 

S 

K 

M 

T 

A 

C 

(g) 

V 

Y 

V 

Y 

L 

W 

L 

W 

I 

I 

H 

I 

H 

H 

G 

G 

G 

F 

F 

F 

P 

P 

P 

0.14 

0.11 

0.09 

0.06 

0.03 

0.01 

0.11 

0.09 

0.07 

0.05 

0.03 

0.01 

0.11 

0.09 

0.07 

0.05 

0.03 

0.01 

Frequency 

Frequency 

Frequency 

GGC 

GGC 

GGC 

GAG 

CAG 

CGC 

GAG 

CGC 

GAG 

UAG 

GCC 

CAG 

UGC 

GCC 

CAG 

CGC 

UAG 

UGC 

UAG 

GCC 

UGC 

UUG 

AAG 

AGC 

CCC 

CUG 

AUG 

UUG 

AAG 

AGC 

CCC 

UUG 

AAG 

CCC 

GUG 

AUG 

AUG 

AGC 

UCC 

GUG 

CUG 

GUG 

CUG 

GUC 

UCC 

UCC 

ACC 

GUC 

GUC 

ACC 

ACC 

UCG 

ACG 

CCG 

CUC 

UCG 

ACG 

CUC 

ACG 

CUC 

GCG 

CCG 

UUC 

GCG 

UUC 

Codon Usage 

Buchnera_aphidicola_Cc 

AGG 

AUC 

GAC 

UGG 

AGG 

AUC 

GAC 

CGG 

CAC 

UGG 

CAC 

GGG 

UAC 

UAC 

AAA 

AAC 

UAA 

GGU 

CAA 

CGU 

(b) 

Codon Usage 

Ecoli_K12 

CGG 

GGG 

AAA 

AAC 

UAA 

GGU 

UGU 

CAA 

CGU 

(e) 

UGU 

GAA 

AUA 

AGU 

GAA 

AUA 

AGU 

UUA 

GCU 

UUA 

GCU 

CUA 

CCU 

UCU 

CUA 

Codon Usage 

Klebsiella_pneumoniae_NTUH-K2044 

CCG 

UCG 

GCG 

UUC 

AGG 

AUC 

GAC 

UGG 

CGG 

CAC 

GGG 

UAC 

AAA 

AAC 

UAA 

GGU 

CAA 

CGU 

(h) 

UGU 

GAA 

AUA 

AGU 

UUA 

GCU 

CCU 

UCU 

ACA 

ACU 

ACA 

ACU 

CUA 

CCU 

AC UCU 

ACA 

GUA 

GUA 

UCA 

UCA 

GUA 

UCA 

CCA 

AGA 

AUU 

UUU 

CUU 

GUU 

CCA 

AGA 

AUU 

UUU 

CUU 

GUU 

CCA 

AGA 

AUU 

UUU 

CUU 

GUU 

UGA 

GCA 

AAU 

UGA 

CGA 

UAU 

GCA 

AAU 

UGA 

CGA 

UAU 

GCA 

AAU 

CAU 

CAU 

CGA 

UAU 

GGA 

GAU 

GGA 

GAU 

CAU 

GGA 

GAU 

0.13 

0.10 

0.08 

0.05 

0.03 

0.00 

0.05 

0.04 

0.03 

0.02 

0.01 

0.00 

0.07 

0.06 

0.04 

0.03 

0.01 

0.00 

Frequency 

Frequency 

Frequency 

bits 

bits 

bits 

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 

(c) 

(f) 

(i) 

| C 

1 

G 

U A 

CU 

G 

A 

C GU A | 

2 

3 

| U 

1 

CAG C 

G 

A 

U 

U 

A CG| 

| 1 

2 

3 

U CG| 

U ACG C 

G 

AU 

A 

2 

3 

Figure 2.6: Codon and amino acid usage of Buchnera aphidicola Cc (79.8% AT), Klebsiella 

pneumoniae NTUH-K2044 (42.3% AT), and E. coli K12 49.2% AT. Rightmost column shows the 

nucleotide bias of the three codon positions. 

14

1st 

U 

C 

A 

G 

U 

2nd position 

C A G 

3rd 

19 Phe 4 Ser 14 Tyr 3 Cys U 

19 Phe 11 Ser 13 Tyr 8 Cys C 

6 Leu 4 Ser 2 Stop 1 Stop A 

7 Leu 12 Ser 0 Stop 16 Trp G 

8 Leu 5 Pro 12 His 13 Arg U 

16 Leu 8 Pro 11 His 31 Arg C 

3 Leu 4 Pro 7 Gln 3 Arg A 

72 Leu 30 Pro 38 Gln 10 Arg G 

20 Ile 5 Thr 12 Asn 4 Ser U 

33 Ile 31 Thr 22 Asn 22 Ser C 

3 Ile 3 Thr 24 Lys 2 Arg A 

27 Met 13 Thr 13 Lys 1 Arg G 

10 Val 10 Ala 26 Asp 13 Gly U 

21 Val 44 Ala 24 Asp 43 Gly C 

7 Val 8 Ala 27 Glu 6 Gly A 

33 Val 43 Ala 27 Glu 14 Gly G 


Table 2.3: Codon usage in Klebsiella pneumoniae NTUH-K2044. Frequencies are measured per 

thousand. A total of 4,697,097 base pairs are examined in 5,006 ORFs. 

Z−score 

−2.0 −1.5 −1.0 −0.5 0.0 

Buchnera_aphidicola_Cc: AT content 

−400 −200 0 200 400 

Distance from translation start 

Figure 2.7: AT content profile 400 bp upstream and downstram of annotated translation starts in 

Buchnera aphidicola Cc. 

15


Figure 2.8: Deamination of cytosine (C) into uracil (U) 

2.3.5 Base composition and DNA repair 

Klebsiella is often found in plant products, root surfaces and living trees, fresh vegetables, 

and foods with high content of sugars and acids, such as frozen orange juice concentrate. 

Klebsiella pneumoniae can causes urinary tract infections and the NTUH-K2044 strain 

was isolated from a patient with liver abscess and meningitis. The broad range of ecological 

niches in which Klebsiella lives share the property of being rich in energy and nitrogen. 

Nitrogen-fixing aerobic bacteria are known to have higher chromosomal GC content (McEwan 

et al., 1998), explained by the nitrogen requirement to replicate the chromosome; an 

AT base pairs contains 7 nitrogen atoms whereas a GC pair contains 8 nitrogen atoms. 

Cytosine pairs are prone to mutation caused by spontaneous deamination into uracil 

(Visnes et al., 2009) (figure 2.8). In E. coli the two enzymes uracil N -glycosylase and 

apurinic (AP) endonuclease are responsible for the repair of this mutation. However, in 

Buchnera aphidicola Cc, which is a small reduced genome, these two enzymes are absent 

(confirmed by protein BLAST). A negative selection is likely to occur in organisms with 

high chromosomal GC content and the lack of a functional repair mechanism. Hence, base 

composition of the bacterial genome is by no means random and adjusting the overall GC 

contant through evolution may be yet another way to adapt to the environment. 

2.3.6 BLASTmatrix - proteome comparison 

The BLASTmatrix tool allows for visualization of proteome similarity between larger 

numbers of organisms. For each of the pairwise combinations of proteomes, a BLAST 

is performed. Two proteins are declared homologous when 50% of the protein is aligned 

and 50% of the residues within the alignment are conserved. For a report of proteome 

A against proteome B, all homologous proteins are then grouped into families and the 

similarity between A and B is calculated as the number of families having both organism 

A and B represented. The BLAST report is cached, based on MD5 checksums of the 

proteomes. This enables the tool to efficiently reuse previous results, when organisms 

are added to a comparison. This is repeated for all N j=1 j combinations and for each 

combination a square is drawn containing the following information: the similarity as 

percentage of all families of A and B, the number of shared families and the total number 

of families. A small example matrix is shown in figure 2.9. The percentage is used to 

color-code the square to allow for easier overview of larger comparisons. 

The software requires a configuration in XML as first argument. In appendix D.4 

a Perl script is provided which automatically constructs a configuration that compares 

all published Campylobacter proteomes, by querying the Genome Atlas Database. The 

output of the BLASTmatrix configuration is shown in figure 2.10. 

The software has been used in different publications (Binnewies et al., 2005, 2006) and 

has been updated a number of times since. The older versions contained both BLAST 

directions and showed the number of shared proteins, leaving the diagram redundant. The 

recent version avoids this by instead plotting the shared families which renders the plot 

symmetrical across the diagonal. This allows the lower triangle to be removed. 

16

Escherichia coli 

strain K-12, substrain DH10B 



strain K-12, substrain W3110 



strain K-12, substrain MG1655 


4.3 % 

167 / 3,912 

95.3 % 

3,843 / 4,034 

91.5 % 

3,685 / 4,027 

4.3 % 

170 / 3,965 

93.1 % 

3,742 / 4,020 


strain K-12, substrain MG1655 


6.4 % 

242 / 3,797 


strain K-12, substrain W3110 




strain K-12, substrain DH10B 


Figure 2.9: Construction of the BLASTmatrix diagram. Proteome similarity between three E. 

coli genomes. Lower part of the diagram corresponds to intra-proteome similarity. 

lari 

jejuni 

concisus 

curvus 

fetus 


2.3 % 

34 / 1,494 

57.2 % 

1,123 / 1,965 





ATCC BAA-381 



RM1221 















RM2100 


56.7 % 

1,123 / 1,979 

1.7 % 

27 / 1,581 

55.2 % 

1,145 / 2,073 

84.7 % 

1,448 / 1,709 


13826 



525.92 


49.4 % 

1,062 / 2,150 

83.5 % 

1,481 / 1,773 

1.5 % 

24 / 1,585 

53.0 % 

1,143 / 2,158 

67.3 % 

1,316 / 1,955 

82.9 % 

1,474 / 1,778 

22.8 % 

596 / 2,619 

76.9 % 

1,466 / 1,906 

64.4 % 

1,289 / 2,003 

2.3 % 

39 / 1,702 

30.0 % 

742 / 2,476 

22.9 % 

614 / 2,676 

74.6 % 

1,441 / 1,931 

62.2 % 

1,304 / 2,096 

24.7 % 

682 / 2,756 

30.6 % 

774 / 2,526 

23.1 % 

617 / 2,675 

71.4 % 

1,451 / 2,032 

4.0 % 

66 / 1,650 

24.5 % 

704 / 2,875 

24.8 % 

698 / 2,820 

30.3 % 

770 / 2,538 

22.5 % 

628 / 2,795 

63.5 % 

1,345 / 2,118 


RM2100 

24.4 % 

718 / 2,948 

25.1 % 

706 / 2,816 

28.7 % 

767 / 2,669 

21.2 % 

595 / 2,802 

2.3 % 

41 / 1,780 





24.3 % 

717 / 2,950 

23.7 % 

699 / 2,950 

27.5 % 

736 / 2,676 

21.4 % 

618 / 2,886 




23.6 % 

723 / 3,070 

22.5 % 

668 / 2,964 

27.9 % 

767 / 2,750 

2.0 % 

33 / 1,623 

22.7 % 

698 / 3,076 

23.0 % 

698 / 3,036 

30.4 % 

782 / 2,576 

22.5 % 

713 / 3,175 

26.1 % 

741 / 2,838 

1.5 % 

25 / 1,665 

lari 








RM1221 

25.8 % 

765 / 2,961 

34.7 % 

929 / 2,678 


32.4 % 

916 / 2,828 

1.8 % 

34 / 1,885 

50.3 % 

1,317 / 2,616 

jejuni 


ATCC BAA-381 






525.92 

3.5 % 

69 / 1,972 

1.5 % 




13826 



fetus 

curvus 

concisus 


Figure 2.10: Proteome similarity between ten Campylobacter species. Color encoding corresponds 

to percentage of shared protein families. 

21.2 % 

84.7 % 

4.0 % 

17


A.salmonicida LFI1238 

V.species Ex25 

V.campbellii AND4 

V.harveyi BAA1116 

V.shilonii AK1 

P.profundum SS9 

27.2 % 

1,946 / 7,165 

27.1 % 

31.2 % 

1,964 / 7,245 2,143 / 6,862 

27.5 % 

31.1 % 

32.5 % 

1,971 / 7,179 2,163 / 6,948 2,385 / 7,336 

26.3 % 

31.5 % 

32.6 % 

35.8 % 

1,893 / 7,208 2,169 / 6,884 2,405 / 7,380 2,018 / 5,637 

28.0 % 

30.4 % 

33.1 % 

35.9 % 

38.7 % 

1,962 / 7,016 2,098 / 6,893 2,415 / 7,299 2,049 / 5,713 2,143 / 5,536 

28.7 % 

32.3 % 

31.7 % 

36.4 % 

38.3 % 

32.1 % 

1,944 / 6,766 2,164 / 6,706 2,323 / 7,337 2,055 / 5,647 2,156 / 5,631 1,846 / 5,747 

28.2 % 

33.0 % 

33.6 % 

34.7 % 

38.8 % 

32.1 % 

34.0 % 

1,960 / 6,957 2,137 / 6,467 2,410 / 7,181 1,968 / 5,677 2,162 / 5,566 1,873 / 5,828 1,963 / 5,771 

27.6 % 

32.4 % 

34.3 % 

37.3 % 

37.9 % 

32.5 % 

33.7 % 

35.0 % 

1,965 / 7,122 2,155 / 6,649 2,377 / 6,932 2,045 / 5,477 2,110 / 5,560 1,873 / 5,769 1,977 / 5,865 1,949 / 5,561 

27.7 % 

31.8 % 

33.8 % 

38.7 % 

40.3 % 

30.6 % 

34.2 % 

34.8 % 

40.3 % 

1,965 / 7,093 2,169 / 6,817 2,403 / 7,116 2,021 / 5,225 2,167 / 5,378 1,777 / 5,804 1,983 / 5,797 1,967 / 5,647 2,326 / 5,771 

27.8 % 

32.1 % 

33.3 % 

37.4 % 

41.6 % 

33.3 % 

32.5 % 

35.3 % 

39.8 % 

38.4 % 

1,967 / 7,064 2,173 / 6,778 2,418 / 7,252 2,032 / 5,428 2,140 / 5,139 1,863 / 5,593 1,896 / 5,827 1,972 / 5,581 2,339 / 5,873 2,291 / 5,971 

25.7 % 

32.2 % 

33.5 % 

36.7 % 

40.6 % 

34.4 % 

35.3 % 

33.6 % 

40.4 % 

38.0 % 

41.7 % 

1,850 / 7,198 2,173 / 6,752 2,420 / 7,225 2,048 / 5,585 2,159 / 5,323 1,846 / 5,360 1,981 / 5,619 1,884 / 5,612 2,345 / 5,808 2,307 / 6,067 2,552 / 6,116 

25.6 % 

30.3 % 

33.6 % 

37.0 % 

39.5 % 

33.4 % 

36.6 % 

36.3 % 

38.6 % 

38.5 % 

41.2 % 

44.3 % 

1,841 / 7,194 2,079 / 6,856 2,420 / 7,193 2,051 / 5,545 2,169 / 5,493 1,852 / 5,547 1,964 / 5,371 1,965 / 5,413 2,251 / 5,839 2,311 / 6,004 2,564 / 6,224 2,515 / 5,683 

28.1 % 

29.7 % 

31.0 % 

37.2 % 

39.7 % 

32.7 % 

35.5 % 

37.7 % 

41.7 % 

37.0 % 

41.9 % 

43.7 % 

42.2 % 

1,904 / 6,782 2,044 / 6,887 2,282 / 7,362 2,052 / 5,516 2,168 / 5,459 1,868 / 5,705 1,974 / 5,563 1,947 / 5,165 2,346 / 5,626 2,227 / 6,026 2,575 / 6,151 2,527 / 5,781 2,215 / 5,254 

26.9 % 

32.4 % 

30.8 % 

34.4 % 

40.0 % 

33.0 % 

34.6 % 

36.6 % 

42.9 % 

39.7 % 

40.0 % 

44.5 % 

41.6 % 

40.0 % 

1,851 / 6,869 2,098 / 6,481 2,270 / 7,379 1,944 / 5,645 2,171 / 5,428 1,872 / 5,667 1,982 / 5,732 1,961 / 5,354 2,314 / 5,388 2,312 / 5,825 2,473 / 6,185 2,539 / 5,707 2,225 / 5,354 2,421 / 6,055 

28.2 % 

31.2 % 

33.3 % 

34.8 % 

38.2 % 

33.2 % 

34.8 % 

35.7 % 

41.9 % 

40.6 % 

42.9 % 

42.8 % 

42.3 % 

39.6 % 

70.3 % 

1,949 / 6,915 2,045 / 6,565 2,327 / 6,984 1,952 / 5,606 2,104 / 5,504 1,872 / 5,641 1,984 / 5,694 1,969 / 5,522 2,334 / 5,571 2,270 / 5,592 2,564 / 5,977 2,449 / 5,718 2,236 / 5,283 2,438 / 6,154 2,933 / 4,174 

27.9 % 

32.6 % 

32.1 % 

38.1 % 

37.3 % 

30.2 % 

35.0 % 

35.9 % 

40.9 % 

39.9 % 

44.1 % 

45.9 % 

41.3 % 

40.0 % 

69.2 % 

73.6 % 

1,942 / 6,969 2,153 / 6,600 2,268 / 7,062 1,994 / 5,228 2,064 / 5,537 1,747 / 5,786 1,985 / 5,667 1,971 / 5,485 2,343 / 5,733 2,299 / 5,768 2,533 / 5,743 2,535 / 5,526 2,181 / 5,277 2,440 / 6,094 2,953 / 4,267 3,045 / 4,135 

27.9 % 

31.8 % 

34.2 % 

36.4 % 

41.6 % 

30.0 % 

31.9 % 

36.1 % 

41.2 % 

38.9 % 

43.3 % 

47.1 % 

43.8 % 

38.4 % 

69.7 % 

74.9 % 

71.6 % 

1,941 / 6,954 2,123 / 6,682 2,394 / 7,002 1,935 / 5,317 2,134 / 5,135 1,736 / 5,791 1,857 / 5,817 1,971 / 5,458 2,346 / 5,697 2,309 / 5,932 2,559 / 5,916 2,503 / 5,310 2,234 / 5,101 2,348 / 6,120 2,944 / 4,221 3,101 / 4,142 3,010 / 4,205 

27.9 % 

32.0 % 

33.4 % 

37.7 % 

39.1 % 

33.6 % 

32.1 % 

32.8 % 

41.4 % 

39.3 % 

42.3 % 

46.4 % 

45.9 % 

41.4 % 

66.3 % 

75.5 % 

72.6 % 

75.9 % 

1,909 / 6,851 2,130 / 6,656 2,359 / 7,060 2,026 / 5,367 2,048 / 5,244 1,805 / 5,377 1,861 / 5,795 1,843 / 5,611 2,346 / 5,670 2,314 / 5,892 2,572 / 6,075 2,534 / 5,464 2,223 / 4,842 2,445 / 5,905 2,833 / 4,271 3,089 / 4,092 3,068 / 4,226 3,094 / 4,077 

29.6 % 

32.0 % 

33.4 % 

37.3 % 

40.4 % 

31.9 % 

35.6 % 

33.1 % 

38.0 % 

39.4 % 

42.7 % 

45.2 % 

44.3 % 

42.4 % 

73.2 % 

69.8 % 

73.5 % 

77.2 % 

68.7 % 

2,295 / 7,753 2,097 / 6,549 2,375 / 7,115 2,022 / 5,418 2,139 / 5,293 1,743 / 5,469 1,922 / 5,398 1,848 / 5,585 2,213 / 5,823 2,314 / 5,868 2,578 / 6,032 2,546 / 5,633 2,232 / 5,038 2,408 / 5,683 2,952 / 4,034 2,942 / 4,217 3,065 / 4,172 3,155 / 4,088 2,874 / 4,181 

27.9 % 

35.2 % 

33.0 % 

37.3 % 

39.4 % 

33.5 % 

34.2 % 

36.7 % 

38.0 % 

36.9 % 

42.9 % 

45.5 % 

43.0 % 

41.8 % 

73.5 % 

76.0 % 

68.5 % 

78.0 % 

67.2 % 

70.4 % 

1,972 / 7,061 2,581 / 7,333 2,325 / 7,056 2,019 / 5,407 2,118 / 5,370 1,845 / 5,501 1,872 / 5,473 1,906 / 5,192 2,209 / 5,811 2,208 / 5,989 2,579 / 6,005 2,548 / 5,599 2,240 / 5,212 2,434 / 5,818 2,863 / 3,897 3,059 / 4,025 2,914 / 4,256 3,149 / 4,038 2,880 / 4,288 2,922 / 4,153 

29.4 % 

34.3 % 

46.4 % 

37.8 % 

40.3 % 

32.9 % 

35.7 % 

34.9 % 

41.8 % 

36.4 % 

39.4 % 

45.8 % 

43.4 % 

40.8 % 

76.4 % 

75.2 % 

74.1 % 

71.5 % 

69.7 % 

70.3 % 

64.7 % 

2,212 / 7,534 2,276 / 6,634 3,371 / 7,266 2,001 / 5,288 2,145 / 5,320 1,824 / 5,545 1,970 / 5,513 1,843 / 5,282 2,264 / 5,418 2,186 / 6,003 2,432 / 6,172 2,552 / 5,568 2,242 / 5,171 2,445 / 5,993 2,970 / 3,887 2,954 / 3,928 3,024 / 4,083 2,986 / 4,175 2,916 / 4,183 2,965 / 4,217 2,888 / 4,463 

27.8 % 

34.4 % 

34.9 % 

47.0 % 

39.8 % 

33.0 % 

34.9 % 

36.8 % 

39.9 % 

39.9 % 

39.1 % 

42.2 % 

43.6 % 

41.1 % 

73.1 % 

80.4 % 

73.0 % 

79.5 % 

69.0 % 

72.2 % 

64.9 % 

76.9 % 

2,222 / 7,979 2,472 / 7,184 2,496 / 7,160 2,741 / 5,827 2,086 / 5,245 1,831 / 5,549 1,952 / 5,586 1,951 / 5,307 2,202 / 5,514 2,238 / 5,609 2,413 / 6,176 2,409 / 5,711 2,244 / 5,143 2,450 / 5,957 2,977 / 4,072 3,080 / 3,831 2,908 / 3,986 3,125 / 3,932 2,860 / 4,145 2,986 / 4,136 2,940 / 4,533 3,165 / 4,117 

28.1 % 

33.0 % 

38.7 % 

37.8 % 

64.9 % 

33.1 % 

35.2 % 

36.1 % 

42.0 % 

38.0 % 

43.0 % 

41.4 % 

41.1 % 

41.3 % 

73.4 % 

77.3 % 

78.5 % 

77.9 % 

71.8 % 

68.5 % 

67.6 % 

76.7 % 

83.4 % 

2,155 / 7,667 2,516 / 7,615 2,880 / 7,439 2,081 / 5,503 3,384 / 5,214 1,804 / 5,448 1,954 / 5,558 1,940 / 5,373 2,320 / 5,530 2,171 / 5,707 2,483 / 5,781 2,372 / 5,735 2,153 / 5,242 2,449 / 5,936 2,971 / 4,050 3,098 / 4,009 3,061 / 3,901 3,002 / 3,856 2,896 / 4,036 2,869 / 4,191 2,983 / 4,413 3,195 / 4,167 3,315 / 3,973 

29.5 % 

36.5 % 

37.0 % 

39.9 % 

45.0 % 

31.9 % 

35.3 % 

36.2 % 

41.1 % 

40.1 % 

40.1 % 

46.3 % 

41.2 % 

37.9 % 

73.8 % 

77.1 % 

75.8 % 

83.0 % 

71.5 % 

73.7 % 

65.1 % 

81.6 % 

81.3 % 

82.4 % 

2,198 / 7,456 2,593 / 7,105 2,900 / 7,832 2,372 / 5,942 2,357 / 5,232 2,074 / 6,494 1,926 / 5,455 1,940 / 5,352 2,303 / 5,603 2,293 / 5,719 2,373 / 5,919 2,464 / 5,326 2,152 / 5,228 2,313 / 6,099 2,975 / 4,030 3,088 / 4,007 3,073 / 4,056 3,135 / 3,777 2,801 / 3,915 2,947 / 4,001 2,880 / 4,423 3,264 / 4,000 3,320 / 4,085 3,302 / 4,009 

30.3 % 

36.7 % 

34.6 % 

37.5 % 

46.1 % 

32.3 % 

35.5 % 

36.3 % 

41.6 % 

39.2 % 

43.5 % 

43.5 % 

46.0 % 

38.2 % 

65.6 % 

78.0 % 

75.1 % 

79.4 % 

72.2 % 

81.0 % 

67.3 % 

77.5 % 

81.9 % 

80.8 % 

83.2 % 

2,110 / 6,968 2,562 / 6,982 2,682 / 7,762 2,396 / 6,387 2,626 / 5,697 1,842 / 5,705 2,270 / 6,400 1,906 / 5,250 2,314 / 5,569 2,272 / 5,796 2,550 / 5,859 2,367 / 5,437 2,220 / 4,821 2,320 / 6,080 2,791 / 4,256 3,097 / 3,971 3,061 / 4,077 3,144 / 3,961 2,861 / 3,960 2,989 / 3,688 2,909 / 4,320 3,153 / 4,066 3,311 / 4,041 3,319 / 4,106 3,325 / 3,995 

29.7 % 

30.4 % 

36.7 % 

36.9 % 

43.2 % 

32.6 % 

34.5 % 

35.9 % 

41.5 % 

39.8 % 

42.2 % 

45.9 % 

42.7 % 

42.3 % 

65.2 % 

71.3 % 

76.3 % 

79.3 % 

69.0 % 

74.9 % 

67.8 % 

78.4 % 

76.3 % 

81.6 % 

80.7 % 

85.8 % 

2,127 / 7,169 2,085 / 6,866 2,759 / 7,516 2,259 / 6,124 2,655 / 6,143 2,040 / 6,250 1,965 / 5,696 2,233 / 6,219 2,272 / 5,479 2,292 / 5,756 2,506 / 5,941 2,501 / 5,451 2,113 / 4,953 2,399 / 5,675 2,768 / 4,246 2,953 / 4,142 3,076 / 4,029 3,138 / 3,958 2,868 / 4,158 2,944 / 3,932 2,836 / 4,184 3,157 / 4,029 3,142 / 4,120 3,311 / 4,057 3,321 / 4,117 3,291 / 3,837 

28.3 % 

29.4 % 

29.6 % 

38.6 % 

40.2 % 

30.5 % 

35.3 % 

36.2 % 

43.9 % 

39.2 % 

42.9 % 

46.1 % 

44.3 % 

39.7 % 

71.6 % 

70.2 % 

69.2 % 

80.3 % 

69.1 % 

73.3 % 

68.1 % 

74.3 % 

83.7 % 

75.3 % 

81.4 % 

82.5 % 

79.6 % 

1,980 / 6,989 2,083 / 7,082 2,214 / 7,478 2,289 / 5,931 2,413 / 5,999 2,050 / 6,715 2,191 / 6,211 1,976 / 5,464 2,762 / 6,293 2,230 / 5,684 2,536 / 5,906 2,513 / 5,455 2,213 / 5,001 2,303 / 5,796 2,802 / 3,915 2,930 / 4,172 2,925 / 4,226 3,147 / 3,918 2,864 / 4,145 2,983 / 4,071 2,876 / 4,226 2,987 / 4,018 3,275 / 3,915 3,136 / 4,162 3,309 / 4,067 3,278 / 3,971 3,139 / 3,944 

28.0 % 

26.7 % 

29.3 % 

33.6 % 

42.3 % 

33.1 % 

33.1 % 

36.3 % 

45.4 % 

41.4 % 

42.3 % 

45.7 % 

43.5 % 

42.9 % 

68.6 % 

77.2 % 

64.3 % 

73.1 % 

70.0 % 

73.6 % 

66.4 % 

78.6 % 

86.6 % 

82.6 % 

76.0 % 

83.4 % 

78.1 % 

92.9 % 

2,022 / 7,222 1,916 / 7,168 2,244 / 7,665 1,915 / 5,695 2,451 / 5,795 2,074 / 6,269 2,209 / 6,672 2,179 / 6,005 2,507 / 5,523 2,698 / 6,523 2,475 / 5,845 2,506 / 5,480 2,200 / 5,058 2,463 / 5,745 2,743 / 4,001 2,983 / 3,866 2,805 / 4,365 3,000 / 4,103 2,873 / 4,102 2,979 / 4,045 2,917 / 4,393 3,113 / 3,962 3,253 / 3,757 3,267 / 3,954 3,147 / 4,141 3,267 / 3,919 3,147 / 4,032 3,489 / 3,754 

25.5 % 

34.5 % 

28.3 % 

32.5 % 

34.5 % 

34.9 % 

35.7 % 

34.2 % 

43.7 % 

43.7 % 

46.4 % 

45.1 % 

44.9 % 

40.8 % 

77.1 % 

71.8 % 

71.6 % 

69.5 % 

68.3 % 

74.3 % 

66.3 % 

75.5 % 

91.2 % 

85.6 % 

82.9 % 

79.4 % 

80.2 % 

89.7 % 

77.1 % 

1,872 / 7,339 2,335 / 6,762 2,095 / 7,406 1,919 / 5,903 1,963 / 5,692 2,114 / 6,065 2,219 / 6,213 2,205 / 6,448 2,670 / 6,112 2,492 / 5,705 3,042 / 6,550 2,444 / 5,415 2,242 / 4,998 2,400 / 5,876 2,975 / 3,861 2,855 / 3,974 2,868 / 4,006 2,908 / 4,185 2,820 / 4,126 2,982 / 4,014 2,908 / 4,386 3,125 / 4,141 3,355 / 3,679 3,244 / 3,790 3,277 / 3,954 3,143 / 3,956 3,169 / 3,953 3,485 / 3,884 3,186 / 4,134 

26.1 % 

30.9 % 

43.4 % 

30.3 % 

33.9 % 

55.5 % 

38.1 % 

36.6 % 

40.8 % 

41.9 % 

43.2 % 

48.9 % 

43.5 % 

42.4 % 

73.0 % 

82.5 % 

67.9 % 

76.7 % 

68.0 % 

68.5 % 

67.0 % 

74.6 % 

91.7 % 

90.1 % 

83.2 % 

87.0 % 

75.1 % 

81.1 % 

74.9 % 

80.4 % 

2,254 / 8,624 2,144 / 6,948 2,981 / 6,875 1,795 / 5,923 1,991 / 5,874 2,683 / 4,838 2,277 / 5,979 2,201 / 6,016 2,680 / 6,565 2,637 / 6,301 2,597 / 6,013 2,994 / 6,128 2,155 / 4,958 2,451 / 5,781 2,911 / 3,989 3,117 / 3,780 2,780 / 4,092 2,961 / 3,861 2,806 / 4,126 2,844 / 4,150 2,915 / 4,348 3,103 / 4,160 3,455 / 3,766 3,346 / 3,715 3,208 / 3,855 3,272 / 3,762 3,024 / 4,028 3,280 / 4,046 3,187 / 4,253 3,303 / 4,109 

25.9 % 

30.1 % 

45.0 % 

46.2 % 

30.5 % 

52.4 % 

75.0 % 

38.7 % 

72.3 % 

39.7 % 

67.5 % 

47.2 % 

43.5 % 

40.9 % 

74.7 % 

78.0 % 

78.6 % 

71.8 % 

73.1 % 

70.6 % 

64.7 % 

75.4 % 

96.0 % 

90.4 % 

91.4 % 

83.0 % 

80.7 % 

77.3 % 

80.2 % 

88.8 % 

88.1 % 

2,170 / 8,370 2,581 / 8,574 3,018 / 6,702 2,452 / 5,307 1,813 / 5,939 2,666 / 5,085 3,261 / 4,346 2,246 / 5,808 3,688 / 5,101 2,672 / 6,728 3,741 / 5,540 2,608 / 5,524 2,547 / 5,858 2,360 / 5,769 2,922 / 3,914 3,045 / 3,906 3,059 / 3,894 2,849 / 3,968 2,818 / 3,854 2,886 / 4,087 2,847 / 4,403 3,111 / 4,124 3,531 / 3,678 3,439 / 3,805 3,373 / 3,689 3,126 / 3,768 3,108 / 3,853 3,164 / 4,093 3,271 / 4,079 3,489 / 3,927 3,495 / 3,966 

5.0 % 

243 / 4,897 

3.9 % 

200 / 5,078 

3.9 % 

201 / 5,117 

V.parahaemolyticus 2210633 


V.vulnificus CMCP6 

V.vulnificus YJ016 

V.species MED222 

V.splendidus LGP32 

V.fischeri ES114 

V.fischeri MJ11 

2.3 % 

88 / 3,822 

2.7 % 

103 / 3,886 

3.3 % 

111 / 3,378 

2.9 % 

112 / 3,894 

2.6 % 

96 / 3,691 

2.8 % 

118 / 4,277 

2.3 % 

103 / 4,463 

3.1 % 

150 / 4,773 

V.cholerae MO10 

V.cholerae BX330286 

V.cholerae RC9 

V.cholerae MJ1236 

V.cholerae B33VCE 

V.cholerae 2740-80 

V.cholerae AM-19226 

V.cholerae MZO-2 

V.cholerae 12129 

V.cholerae TM11079-80 

V.cholerae TMA21 

V.cholerae VL426 


2.8 % 

121 / 4,337 

2.1 % 

79 / 3,683 

V.cholerae N16961 

V.cholerae 0395 TEDA 

V.cholerae 0395 TIGR 

V.cholerae V52 

V.cholerae M66-2 

3.2 % 

147 / 4,662 

1.9 % 

62 / 3,316 

2.9 % 

99 / 3,427 


2.4 % 

83 / 3,442 



2.1 % 

72 / 3,454 



2.2 % 

73 / 3,311 


V.fischeri MJ11 

2.5 % 

84 / 3,305 

V.fischeri ES114 


2.8 % 

99 / 3,586 



3.5 % 

125 / 3,567 



2.6 % 

92 / 3,593 



3.0 % 

109 / 3,575 



2.8 % 

102 / 3,619 



2.9 % 

100 / 3,429 



1.8 % 

59 / 3,353 

30.0 % 



2.8 % 

99 / 3,560 

0.0 % 





3.3 % 

120 / 3,599 



4.3 % 

157 / 3,665 



4.2 % 

155 / 3,729 



3.0 % 

110 / 3,665 

90.0 % 

6.0 % 


Figure 2.11: Proteome comparison of 32 Vibrionaceae genomes. Environmental V. cholerae strains 

lacking the cholera enterotoxin genes are highlighted in bright green, whilst pathogenic V. cholerae 

strains genomes are shown in dark green. 

Large similarities between environmental and pathogenic V. cholerae 

The BLAST matrix shown in figure 2.11 includes environmental and pathogenetic strains 

of V. cholerae. The figures shows that within and between these two groups the V. cholerae 

strains share a large number of genes. 

Intra- vs. inter-proteome similarity 

The lower row of the diagram shows the special case of organism A versus itself. This 

shows the intra-proteome similarity. If not dealt with separately, this part would appear 

as 100% similar since the proteome is BLASTed against itself. However, all self-matching 

proteins are excluded, leaving this part to reflect the paraloges of the organism. Also, this 

part has a separate color encoding (red) whereas the intra-protome comparison is coded 

green (see figure 2.10). 

2.3.7 BLASTatlas - visualizing while-genome homology 

The BLASTmatrix tool described earlier condenses the similarity between two proteomes 

into a single number. This simplification allows for an all-against-all comparison, but lacks 

detailed information on the conserved genes and where these are located. The BLASTatlas 

method overcomes these issues by comparing the proteomes to a single reference chromosome. 

When a single representative chromosome has been selected, all ORF’s or proteins 

of that reference is BLASTed against each of the proteome to be included in the comparison. 

The most optimal alignment of each proteome, disregarding the significance, is 

mapped back to the reference genome. A numerical value of zero is mapped at mismatches 

or gaps, 0.5 at conservative mismatches, and one is mapped to matches. This method has 

proved powerful because it answers several questions in one diagram: Which reference 

proteins are found in which query genomes? How well are they conserved? And is there 

18


 

 

 

Figure 2.12: Mapping of pairwise alignment to a reference genome. Mismatches, conservative 

mismatches and perfect matches contrubute to the overall map 0.0, 0.5, and 1.0, respectively. Gaps 

within the reference protein, corresponding to missing features of the reference protein, cannot be 

mapped and are hence excluded. 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.13: Inclusion of multiple organisms using the BLASTatlas method. Each track correspond 

to a pairwise comparison against the reference chromosome. 

any correlation between the conservation of neighboring genes such as within larger genomic 

islands. Figure 2.12 depicts the remapping of a protein-protein alignment back to 

the reference genome. 

The result of the mapping step is a list of same length as the reference genome. BLASTmatrix 

then uses the GeneWiz software (Pedersen et al., 2000) to visualize this numerical 

data. Genewiz applies a smoothing and each bin is then encoded into a color representation 

either fixed or dynamic, given as n standard deviations around the average. Each 

genome included in the comparison is plotted as individual tracks. The tool is offered 

as a Web Service (see chapter 4) A general client script can be obtained from the online 

documentation at http://www.cbs.dtu.dk/ws/BLASTatlas. The client script produces 

as PostScript plot as output. In the next sections examples are provided demonstrating 

the flexibility of the tool. 

Gene loss in Burkholderia species 

A comparative study aimed at mapping pathogenic islands or gene losses among different 

bacterial genomes can benefit from the graphical representation provided by the BLAS- 

Tatlas method. The genus of Burkholderia covers a number of important animal and 

human pathogens known to cause melioidosis (B. pseudomallei) and pulmonary infection 

in CF patients (B. cepacia), whereas B. thailandensis, which is closely related to B. pseudomallei, 

rarely gives rise to diseases in humans (Brett et al., 1998; Smith et al., 1997). All 

publicly available and fully sequenced Burkholderia genomes are compared to chromosome 

I and II of B. pseudomallei 1710b. The code listing below describes how the comparison 

was made and it demonstrates the flexibility of the tool as it allows for easy automation 

19


by reading simple configurations files - in this case generated by a MySQL query. The 

output configuration file is listed in appendix D.3. 

1 # let mysql construct the blast configuration file 

2 mysql --raw -B -N -e ’ select concat (" legend :",replace ( 

organism_name ," Burkholderia ","B."),"\ nprogram : blastp \ ncolor :", 

if( organism_name like "% pseudomal %"," 101010 _000009 ",if( 

organism_name like "% mallei %"," 101010 _000900 ",if( organism_name 

like "% cenocep %"," 101010 _080000 ",if( organism_name like "% ambi %" 

," 101010 _020002 ",if( organism_name like "% thailand %"," 101010 

_000900 "," 101010 _050505 "))))),"\ nrange :0.0 ,0.8\ nsource : files /", 

pid ,". fsa \n") from genomeatlas3_cur . genbank_complete_prj where 

organism_name like " burkhold %" and organism_name not like " 

%1710 b%" order by organism_name ;’ > blast . cfg 

3 # copy genbank files of chr I and II 

4 foreach acc ( CP000124 CP000125 ) 

5 cp / home / databases / genomeatlasdb -3.0 _cur / data / $acc / $acc . gbk . 

6 saco_convert -I genbank -O annotation $acc . gbk > $acc . ann 

7 saco_extract -I genbank -O fasta -t $acc . gbk > $acc . proteins . fsa 

8 saco_convert -I genbank -O fasta $acc . gbk > $acc . fsa 

9 end 

10 

11 # run the BLASTatlas client script on both chromosomes 

12 perl BLASTatlas -modus circle -ref CP000124 . fsa - proteins CP000124 

. proteins . fsa -ann CP000124 . ann - blastcfg blast . cfg -- dnap =" 

Percent AT ,GC Skew " -title "B. pseudomallei 1710b, chr I" > 

burkholderia_chrI .ps 

13 perl BLASTatlas -modus circle -ref CP000125 . fsa - proteins CP000125 

. proteins . fsa -ann CP000125 . ann - blastcfg blast . cfg -- dnap =" 

Percent AT ,GC Skew " -title "B. pseudomallei 1710b, chr II" > 

burkholderia_chrII .ps 

The plots of the two chromosomes are shown in figure 2.14. The other B. pseudomallei 

genomes are obvious as three dark blue tracks, representing high homology within the 

species. Both species of B. thailandensis and B. mallei display large chromosomal deletions 

when compared to B. pseudomallei. However the more scattered nature of the gene loss 

observed in B. thailandensis suggests that B. mallei evolved from B. pseudomallei through 

the loss of larger regions (Ong et al., 2004). These deletions are evident from the atlases 

shown in figure 2.14. It is evident that a strong preference of deletions exist for chromosome 

II. Ong and co-workers report that deletions in chromosome II counts for 70% and 61% 

of the total gene loss in B. mallei and B. thailandensis, respectively. 

The Alcanivorax phylome BLASTatlas 

Tracks on the BLASTatlas are not limitted to single genomes or proteomes. Sequence files 

specified for a given tracks is converted into a BLAST database and reference genome is 

searched against each the databases of each track. However, a track may just as well be 

a collection of genomes, entire phyla or even SwissProt. In Paper III a ‘phylome’ atlas 

was constructed for the oil-degrading marine bacterium Alcanivorax borkumensis (Reva 

et al., 2008). Here, tracks were constructed collecting all proteins of all published bacterial 

genomes, all proteobacteria, all γ-, α-, β-, δ, and ɛ-proteobacteria (see figure 2.15). The 

phylome atlas reveals no or very few homologes in δ- and ɛ-proteobacteria, some homologes 

in α- and β-proteobacteria wheras the highest sequence homology was identified among 

γ-proteobacteria. 

20

3M 

2.5M 

3.5M 

2.5M 

2M 

0M 

2M 

0.5M 

B. pseudomallei 1710b, chr I 

4,126,292 bp 

3M 

0M 

1.5M 

0.5M 

B. pseudomallei 1710b, chr II 

3,181,762 bp 

1.5M 

1M 

1M 


B. ambifaria AMMD 

0.00 0.80 

B. ambifaria MC40-6 

0.00 0.80 

B. cenocepacia AU 1054 

0.00 0.80 

B. cenocepacia HI2424 

0.00 0.80 

B. cenocepacia J2315 

0.00 0.80 

B. cenocepacia MC0-3 

0.00 0.80 

B. glumae BGR1 

0.00 0.80 

B. mallei ATCC 23344 

0.00 0.80 

B. mallei NCTC 10229 

0.00 0.80 


0.00 0.80 

B. mallei SAVP1 

0.00 0.80 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

B. multivorans ATCC 17616 fix 

avg 

0.00 0.80 


http://www.cbs.dtu.dk/ 

B. ambifaria AMMD 

0.00 0.80 

B. ambifaria MC40-6 

0.00 0.80 

B. cenocepacia AU 1054 

0.00 0.80 

B. cenocepacia HI2424 

0.00 0.80 

B. cenocepacia J2315 

0.00 0.80 

B. cenocepacia MC0-3 

0.00 0.80 

B. glumae BGR1 

0.00 0.80 

B. mallei ATCC 23344 

0.00 0.80 


0.00 0.80 


0.00 0.80 

B. mallei SAVP1 

0.00 0.80 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 


avg 

0.00 0.80 




avg 

0.00 0.80 

B. phymatum STM815 

0.00 0.80 

B. phytofirmans PsJN 

0.00 0.80 

B. pseudomallei 1106a 

0.00 0.80 

B. pseudomallei 668 

0.00 0.80 

B. pseudomallei K96243 

0.00 0.80 

B. sp. 383 

0.00 0.80 

B. thailandensis E264 

0.00 0.80 

B. vietnamiensis G4 

0.00 0.80 

B. xenovorans LB400 

0.00 0.80 

W) Annotations: 

CDS + 

CDS - 

rRNA 

tRNA 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 


avg 

0.00 0.80 

B. phymatum STM815 

0.00 0.80 

B. phytofirmans PsJN 

0.00 0.80 

B. pseudomallei 1106a 

0.00 0.80 

B. pseudomallei 668 

0.00 0.80 

B. pseudomallei K96243 

0.00 0.80 

B. sp. 383 

0.00 0.80 

B. thailandensis E264 

0.00 0.80 

B. vietnamiensis G4 

0.00 0.80 

B. xenovorans LB400 

0.00 0.80 

W) Annotations: 

Figure 2.14: Comparison of B. pseudomallei 1710b chomosome I and II against all public 

Burkholderia genomes. 

CDS + 

CDS - 

rRNA 

tRNA 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

fix 

avg 

Percent AT 

0.21 0.42 

GC Skew 

-0.09 0.09 

Percent AT 

0.21 0.42 

GC Skew 

-0.09 0.09 

21 

Resolution: 1273 

dev 

avg 

dev 

avg 

BLAST ATLAS 


dev 

avg 

dev 

avg 

BLAST ATLAS


Bacteria 

fix 

avg 

0.00 0.50 

Proteobacteria 

fix 

avg 

0.00 0.50 

gamma 

fix 

avg 

0.00 0.50 

Annotations: 

CDS + 

CDS - 

0M 

rRNA 

tRNA 

0.5M 

2.5M 

alpha 

fix 

avg 

A. borkumensis 

3,120,143 bp 

0.00 0.30 

beta 

1M 

2M 

fix 

avg 

0.00 0.30 

1.5M 

delta 

fix 

avg 

0.00 0.30 

epsilon 

fix 

avg 

0.00 0.30 

Percent AT 

dev 

avg 

0.40 0.51 




Figure 2.15: A phylome atlas of Alcanivorax borkumensis, comparing the proteome against all γ-, 

α-, β-, δ, and ɛ-proteobacteria available at the time of publishing. 

22 

Phylome ATLAS

Streptococcus 

Escherichia 

Bacillus 

Clostridium 

Burkholderia 

Mycobacterium 

Candidatus 

Staphylococcus 

Shewanella 

Mycoplasma 

Strains 

Species 

0 10 20 30 40 50 


Figure 2.16: Count of genomes and species divided by genera. Source: CBS Genome Atlas 

Database as of 2009-09-11. 

2.3.8 CorePlot - plotting the core- and pan-genomes of species 

There are a number of bacterial genera for which numerous strains and species are fully 

sequenced. Streptococcus (43 strains), Escherichia (29 strains), and Bacillus (25 strains) 

are the most highly represented genomes among the Bacteria (Genome Atlas Database, 

2009-09-11). Figure 2.16 shows the genome and species counts of the 10 most sampled 

genera. The increased depth by which bacterial genera are sequenced has previously been 

used to estimate the core- and pan-genome by fitting an exponential decaying function. 

An often used approach is to perform either a limited or a full permutation of the genome 

order (Lefebure & Stanhope, 2007; Tettelin et al., 2005). This provides an error estimate 

for every step a genome is added An alternative method was developed during the Ph.D. 

project, which derives the protein families by grouping homologous proteins, however using 

a fixed order of genomes. Homologs are generated by pairwise protein BLAST between 

proteomes followed by a grouping of all significant alignments (50% alignment length and 

50% conservation within the alignment). The method can re-use cached BLAST reports 

from the BLASTmatrix method. The example below uses the same proteome files as 

was generated in the BLASTmatrix example (section 2.3.6 and appendix D.4) and it 

demonstrates how a MySQL query can be used as configuration for CorePlot program. 

1 mysql -N -B -e " select organism_name , concat (pid , ’. proteins .fsa ’) 

from genomeatlas3_cur . genbank_complete_prj where organism_name 

like ’ campylobacter %’ order by organism_name " > table . dat 

2 perl ~ pfh / scripts / coregenome / coregenome -2.3 < table . dat > core .ps 

Both the BLASTmatrix and the coregenome scripts accesses the same MySQL caching 

databases. The user will not have to worry about how results are cached and shared 

between the two programs. Figure 2.17 shows the output core- and pan-genome plot 

generated by the program. 

By using a fixed genome order, it is possible to compare multiple species within the 

same plot, to reveal varying slopes of the pan- and core-genome graphs. From figure 2.17 

it is visible that the first 5 strains come from distinct species, giving rise to a steep increase 

of the pan genome, and reduction of the core genome. The following five genomes come 

from C. jejuni and the curves appear to flatten out at a core size of 600 proteins, 5,200 

proteins. In figure 2.18 a larger core- and pan-genome plot for Vibrio species are shown 

(paper IV). 

23


0 1000 2000 3000 4000 5000 6000 7000 

New genes 

New gene families 

Core genome 

Pan genome 

1 : Campylobacter concisus 13826 

2 : Campylobacter curvus 525.92 

3 : Campylobacter fetus subsp. fetus 8240 

pan-genome (blue line) increases, and the number of conserved gene families (red 

4 : Campylobacter hominis ATCC BAA381 

line) in the core genome decreases, albeit at a lower rate. This is because every 

5 : Campylobacter jejuni RM1221 

genome can add many novel (and frequently different) genes to the pan-genome but 

6 : Campylobacter jejuni subsp. doylei 269.97 

only decreases the core genome with a few genes that are absent in that particular 

7 : Campylobacter jejuni subsp. jejuni 81176 

strain but that were conserved in the previously given genomes. The pan-genome 

8 : Campylobacter jejuni subsp. jejuni 81116 

curve increases with a relative steep slope when a novel species is added, as is 

9 : Campylobacter jejuni subsp. jejuni NCTC 11168 

obvious when one V. parahaemolyticus genome is added after the 18th V. cholerae. A 

10 : Campylobacter lari RM2100 

stable plateau can be seen for pan genome of the V. cholerae genomes around 6500 

genes, whereas the core genome steadily decreases to approximately 1000 genes for 

these 32 genomes. A. salmonicida, although not a member of the Vibrio genus, does 

not add significantly more genes to the pan genome than the other Vibrio species do, in 

contrast to P. profundum which produces a sharp increase in the pan genome, as does, 

interestingly, V. shilonii. Note that there are approximately 20,000 total gene families 

within the 30 sequenced Vibrionaceae genomes. 

In fact, the small jump seen in the pan genome of V. cholerae when adding the 11th 

1 2 3 4 5 6 7 8 9 10 

genome (figure 3) is caused by the difference between the two subclusters of V. 

cholerae seen in the pan-genome family tree (figure 2). Note that the 10th strain (V. 

clolerae 2740-80) behaves as an outlier in all the figures shown; although documented 

Figureas 2.17: an environmental Pan- and core-genome isolate, this plotappears of 10 Campylobacter closer to the genomes. clinical isolates, For thein data terms currently of 

available, overall there genomic seem to properties. exist an equilibrium at close to 600 protein families. 

24 

25000 

20000 

15000 

10000 

5000 

0 

Pan genome 

Core genome 


V. cholerae MJ1236 

V. cholerae RC9 

V. cholerae BX330286 

V. cholerae MO10 

V. cholerae O395 TIGR 

V. cholerae O395 TEDA 

V. cholerae M66-2 

V. cholerae N16961 

V. cholerae B33VCE 

V. cholerae AM-19226 

V. cholerae 1587 

V. cholerae 2740-80 

V. cholerae TM11079-80 

V. cholerae TMA21 


V. cholerae MZO-2 



V.harveyi BAA-1116 

V.campbellii 

Vibrio sp Ex25 


V. fisheri MJ11 

V. fisheri ES114 

V.splendidus LGB2 

Vibrio. sp MED222 

V. vulnificus YJ016 

V. vulnificus CMCP6 

V. parahaem. 16 

V. parahaiem. 2210633 

V. cholerae V52 

V. cholerae VL426 

Figure 3. Pan- and core-genome plot of the 32 Vibrionaceae genomes. V. cholerae 

strains that do not cause cholera are highlighted in bright green. Colours are the same 

as in Figure 2. 

Figure 2.18: CorePlot output for 32 Vibrio genomes. 

BLAST comparison visualized in a BLAST matrix 

A BLAST matrix provides a visual overview of reciprocal pairwise whole genome 

comparisons (figure 4). The stronger a matrix cell is colored, the more similarity was

2.4 Summary 


This chapter presents a number of comparative genomics and visualization tools used in 

a genome annotation and analysis pipeline. Visualization methods have been shown to 

help draw biological conclusions about adaptation to environmental niches, pathogenic 

properties, and comparison of many other genomic properties including proteome similarity. 

Overviewing the large amount of genomic data constitutes a constant challenge that 

will need more attention in the future as sequencing technology becomes more and more 

common. How can one visualize comparison of a thousand genomes? Soon there will be 

a need to compare sets of thousands of genomes. 

25

Summary 

26


2.5 Instant insight: Reading the genetic atlas

Instant insight: Reading the genetic atlas

‘ReSourCe is 

he best online 

submission 

system of any 

publisher.’ 

ReSourCe 

nd referees who have used 

o help you through every step of 

line proof collection, free pdf 

check and update their personal 

ence even further. 

se juggling a hectic research 

a not-for-prot society publisher 

e today and nd out more. 

Registered Charity No. 207890 

.rsc.org/resource 

 

1 


2.6 Paper I: The genome BLASTatlas - a GeneWiz extension 

for visualization of whole-genome homology 

Volume 4 | Number 5 | 2008 Molecular BioSystems Pages 353–444 

Molecular 

BioSystems 

www.molecularbiosystems.org Volume 4 | Number 5 | May 2008 | Pages 353–444 

ISSN 1742-206X 

HIGHLIGHT 

Peter F. Hallin et al. 

REVIEW 

The genome BLASTatlas—a GeneWiz Eric C. Greene et al. 

extension for visualization of whole- The importance of surfaces in singlegenome 

homology molecule bioscience 

1742-206X(2008)4:5;1-9 

Indexed in 

MEDLINE! 

17/04/2008 11:00:58

HIGHLIGHT www.rsc.org/molecularbiosystems | Molecular BioSystems 

The genome BLASTatlas—a GeneWiz 

extension for visualization of whole-genome 

homology 

Peter F. Hallin, Tim T. Binnewies* and David W. Ussery 

DOI: 10.1039/b717118h 

The development of fast and inexpensive methods for sequencing bacterial genomes 

has led to a wealth of data, often with many genomes being sequenced of the same 

species or closely related organisms. Thus, there is a need for visualization methods that 

will allow easy comparison of many sequenced genomes to a defined reference strain. 

The BLASTatlas is one such tool that is useful for mapping and visualizing whole 

genome homology of genes and proteins within a reference strain compared to other 

strains or species of one or more prokaryotic organisms. We provide examples of 

BLASTatlases, including the Clostridium tetani plasmid p88, where homologues for toxin 

genes can be easily visualized in other sequenced Clostridium genomes, and for a 

Clostridium botulinum genome, compared to 14 other Clostridium genomes. DNA 

structural information is also included in the atlas to visualize the DNA chromosomal 

context of regions. Additional information can be added to these plots, and as an 

example we have added circles showing the probability of the DNA helix opening up 

under superhelical tension. The tool is SOAP compliant and WSDL (web services 

description language) files are located on our website: (http://www.cbs.dtu.dk/ws/ 

BLASTatlas), where programming examples are available in Perl. By providing an 

interoperable method to carry out whole genome visualization of homology, 

this service offers bioinformaticians as well as biologists an easy-to-adopt workflow 

that can be directly called from the programming language of the user, hence 

enabling automation of repeated tasks. This tool can be relevant in many pangenomic 

as well as in metagenomic studies, by giving a quick overview of clusters of 

insertion sites, genomic islands and overall homology between a reference 

sequence and a data set. 

Center for Biological Sequence Analysis, 

Department of Systems Biology, The 

Technical University of Denmark, 2800 

Lyngby, Denmark. E-mail: pfh@cbs.dtu.dk. 

E-mail: tim@cbs.dtu.dk. E-mail: 

dave@cbs.dtu.dk 

Background 

It has been more than 10 years since the 

sequencing of the first bacterial genome 

(ref. 1, US patent number 6,528,289), and 

currently sequence data are available for 

more than a thousand sequenced genomes. 

Peter F. Hallin Tim T. Binnewies David W. Ussery 

With so many genome sequences, for 

several bacterial species multiple genome 

sequences exist; for example, at the time 

of writing, 10 different Escherichia coli 

genomes have been fully sequenced and 

published, and draft sequences for another 

31 genomes are available, adding 

Peter F. Hallin was born in 

Odense, Denmark, and is currently 

a PhD student at CBS, 

DTU. Tim T. Binnewies grew 

up in Kiel, Germany, and obtained 

his PhD from the Technical 

University of Denmark, 

he is currently working for 

Roche Diagnostics AG in Switzerland. 

David W. Ussery was 

born and raised in Springdale, 

Arkansas. Since 1998, he has 

been leader for the Comparative 

Genomics group at CBS. 

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 363

up to a total of 41 different E. coli 

genomes (according to the National Center 

for Biotechnology Information, 

NCBI Entrez, 12-Feb-2008). Table 1 lists 

the top 20 represented prokaryotic 

genera in terms of numbers of fully 

sequenced genomes based on recent 

counting in Entrez Genome Projects, 

although these numbers will change 

quickly as more genomes are being 

added on a regular basis. Thus, analysis 

of multiple genomes of the same organism 

(the ‘‘pangenome’’) is now possible, 

and as more metagenomic datasets are 

published (see for example the projects 

listed on the GOLD web pages 24 ), there 

is a need for a graphical representation 

of how these new data compare to existing 

reference strains or model organisms. 

We have developed a visualization 

method, called ‘‘BLASTatlas’’, for showing 

mapped alignments of BLAST 

searches of a reference sequence against 

one or more databases, onto the reference 

genome. Early implementation of a 

similar method 2–4 accounted for the statistical 

significance (E-value) of each hit, 

by color coding the expectation values 

[ log(E)] of the alignment. This method 

gives a uniform color throughout the 

alignment (gene or protein) but shows 

no information about the amino acid 

conservation within regions of the alignment. 

At the level of a bacterial chromosome, 

this makes little difference, 

although when one zooms in at the level 

of individual genes, the older method of 

shading the entire gene based on the Evalue 

gives no information about regions 

within a gene (such as functional domains) 

which might be strongly conserved, 

whilst other parts of the gene 

have little sequence homology within 

other genomes. We have refined the 

BLASTatlas method to map each 

individual amino acid residue or 

nucleotide back to the reference genome 

sequence from which the coding sequence 

was derived. Instead of colourcoding 

the significance of the entire hit, 

this method maps the conservation of the 

individual bases or amino acids. Tools 

such as the Artemis Comparison Tool 

(ACT) 5 allow detailed viewing of complete 

BLAST results, and this is an 

excellent graphical method for comparison 

of two genomes. ACT can also be 

extended to compare two genomes to a 

reference, placed in the middle. In 

contrast, the BLASTatlas method can 

compare many genomes to the same 

reference, and can provide a quick overview 

of chromosomal regions of gene 

conservation across many genomes. 

As can be seen from Table 1, for many 

of the heavily sampled genera, there are 

further genome projects in the pipeline 

which will produce even more sequences 

than are currently available, and there is 

a need for methods for efficient comparison 

of these genomes, giving an overview 

of general trends in the data. The 

Table 1 The number of species and NCBI Entrez Project IDs of the 20 most represented genera 

in the Entrez Genome Projects Database, 13 as accessed on 21 October 2007. The numbers in 

brackets show the counting of both ongoing and completed projects, whereas the first number 

reflects only the completed projects. Candidate genera have been excluded from this counting 

Genus Projects Species 

Streptococcus 26 [63] 8 [15] 

Burkholderia 15 [55] 8 [15] 

Bacillus 16 [48] 9 [16] 

Clostridium 14 [43] 9 [22] 

Vibrio 7 [35] 5 [14] 

Mycobacterium 16 [30] 9 [14] 

Salmonella 5 [30] 2 [3] 

Listeria 4 [29] 3 [6] 

Escherichia 10 [27] 1 [1] 

Mycoplasma 13 [25] 11 [17] 

Shewanella 14 [24] 10 [15] 

Pseudomonas 13 [23] 7 [8] 

Yersinia 9 [23] 3 [7] 

Haemophilus 6 [23] 3 [4] 

Staphylococcus 17 [22] 4 [5] 

Synechococcus 10 [21] 2 [2] 

Campylobacter 9 [20] 5 [9] 

Francisella 7 [16] 1 [2] 

Lactobacillus 11 [15] 10 [12] 

Rickettsia 10 [15] 9 [12] 

BLASTatlas allows the comparison of 

many genomes to a reference sequence. 

The current limit is about 60 genomes. 

There are two levels of comparison, the 

first represents a one-page map of the 

whole chromosome, and the second level 

zooming in a particular region of interest, 

allowing the visualization of regions 

of conservation within individual genes. 

The color-coding represents identical 

amino acids (or nucleic acids), based on 

a pairwise alignment of all protein coding 

regions, with the best matches for 

each gene in the reference genome 

shown. Thus, combining both levels, it 

is possible to get a global overview of the 

whole chromosome, and to then quickly 

identify gene conservation (or lack thereof) 

in regions of interest, at the level of 

conservation of individual amino acid 

residues. 

Clostridium botulinum is an important 

human pathogen which is the causative 

agent of botulism, giving rise to fatal 

paralysis of the respiratory muscles, 

caused by botulinum neurotoxin (BoNT) 

which disrupts nerve functions. The 

genes encoding BoNT components are 

clustered on the bacterial chromosome 

(group I + II strains), on prophages 

(group III strains) or on plasmids (group 

IV strains). Group I strains encode type 

A, B and F type toxins, group II strains 

produce type B, E and F toxins and 

group III strains encode for type C and 

D toxins, whereas group IV strains 

produce type G toxin. 6 We use the 

BLASTatlas method to show the overall 

genome homology of the C. botulinum 

strain F Langeland, compared to all 

currently available and fully sequenced 

strains of the Clostridium genus. 

Methods 

The BLASTatlas method uses all the 

provided annotated coding sequences 

(or proteins) of a reference genome, and 

compares each of those with one or more 

genomes. The total genome sequence for 

each organism is represented by a database 

and can contain any number of 

DNA or protein sequences. BLAST 

searches with a non-stringent E-value 

cut-off of 0.01 are used to identify the 

best alignments between the reference 

sequence protein and the database 

(genome) in question. Once identified, 

the single best pairwise alignment for 

364 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008

each of the reference sequences is 

obtained and included in the map. 

The reference genome of a given 

comparison has a fixed size, whereas 

the sequences to be compared can be 

thought of as simply a ‘‘pile of proteins’’, 

ranging between the size from that of a 

small phage, to a single genome, or an 

entire metagenomic sample or even existing 

large BLAST databases, such as 

UniProt. It is important to emphasize 

that each protein in the reference genome 

is compared to all the proteins in the 

query set—regardless of orientation or 

location. The BLASTatlas method uses 

the software BLASTALL v. 2.2.11 for 

the search, and in BLAST terminology, 

the reference genome constitutes the 

‘query’ whereas each other genome 

(e.g., a lane or circle in the atlas) in the 

comparison corresponds to the ‘database’. 

We define a lane as a visual representation 

of mapped database hits 

(individual residue matches) on to the 

reference genome. A lane can have a 

boxfilter (smoothing) applied within 

each of the smallest visible units of the 

atlas (the resolution of the graphical 

representation). A single BLASTatlas 

may contain several lanes; currently 

around 60 circles is the upper limit. 

The input requires a file containing the 

genome sequence, including all annotated 

coding sequences (comprising protein-start, 

-stop and -direction) for the 

reference genome. The four programs 

‘BLASTp’, ‘BLASTn’, ‘BLASTx’, and 

‘tBLASTn’ can be used for each lane of 

the BLASTatlas, although of course the 

appropriate sequences (DNA or protein) 

must be provided. For example, when 

using ‘ BLASTn’ or ‘tBLASTn’ in a lane, 

the required DNA sequence can be a set 

of open reading frames (ORFs), chromosomal 

contigs, entire genome sequences 

or even environmental (metagenomic) 

samples. In a pairwise fashion, the sequence 

of the reference is BLASTed 

against each database defined by the 

user, employing the specified BLAST 

algorithm. 

Interpretation of BLAST alignments 

For each of the sequences defined in the 

reference, only the best hit in each database 

is stored. For these hits, the alignments 

are mapped on to the reference 

genome. When aligning two DNA 

sequences, the map shows one of four 

possible states for each position: match, 

mismatch, gap in query (reference genome), 

and gap in database (lane). Only 

the match contributes to the overall score 

with a value of 1, whereas mismatches 

and gaps in the database get a score 

value of zero. When aligning two protein 

sequences, an additional state is introduced 

for conservative mismatches, indicating 

that two amino acids have similar 

physical–chemical properties; such a 

state will receive a score of 0.5. Match 

and gap states of protein alignments are 

defined similar to those of the DNA 

alignments. The occurrence of gaps in 

the reference sequence do not get a corresponding 

coordinate and are therefore 

ignored (see Fig. 1). In the BLASTatlas 

context, a map is an array of match 

scores. The array has the same length 

as the reference genome, with each position 

along the gene having a value of 0, 

0.5 or 1: It should be noted that intergenic 

regions (and ncRNAs, including 

tRNAs and rRNAs) have values of 0, 

because BLASTatlases only compare 

protein encoding genes. We use this as 

a control, checking to make sure that the 

rRNA operons are visualized as ‘‘gaps’’ 

throughout all the lanes, for example. 

For each database defined, there will be 

a corresponding BLAST map within the 

atlas (see Fig. 2). Each database entry of 

the BLAST searches must contain a 

legend text for the lane, a colour code 

range and a scaling method. For the 

colours, an upper and lower colour is 

required, whereas the middle colour 

is usually grey; all colours are defined 

in RGB integers ranging from 0 to 10. 

The scale can be either fixed, such as 

ranging from 0 to 1, or scaled using any 

number of standard deviations around 

the average. 

DNA properties 

The BLASTatlas method allows users to 

add structural as well as base composition 

information to the atlas by using the 

‘DNAparameters’ element in the request. 

These properties can be for example 

DNA structural properties, 7 

such as 

intrinsic curvature, 8 global or local 

repeats 9 or other measures of base composition. 

10 A list of possible different 

properties currently pre-computed can 

be obtained via the online documentation 

and type declarations of the web 

services description. The DNA property 

lanes are usually added near the center 

(or at the lowest part when seen from the 

outermost circle) of the atlas. 

Custom properties 

In addition to the standard DNA properties 

and BLAST maps, the web service 

provides a method for adding individual 

customer data for example gene expression 

values to the atlas, using the ‘customMap’ 

element in the request. Data 

must be provided in the form of comma 

separated strings, with each position in 

the list corresponding to the genomic 

position. When defining custom data 

lanes, the colour ranges, scaling method, 

and legend text must be provided. 

Visualization 

Details such as the atlas title and the 

geometry (linear or circle representation) 

are necessary for the final visualization. 

Once the BLAST searches are carried 

out and remapped to the reference 

Fig. 1 Mapping of protein–protein alignment to DNA. Panel A: mismatches and perfect matches are assigned a score of 0 and 1, respectively. 

Conservative mismatches are assigned a score of 0.5. In the case of DNA alignment, only scores of 0 and 1 are possible. Panel B: gaps in the 

database sequence will be rendered as being non-conserved areas (filled with zeros). Panel C: gaps in the reference sequence will be neglected, since 

they have no corresponding region in the reference genome into which they can be mapped. 


Fig. 2 Genes (or segments) from each genome are compared with a reference gene, as shown in 

the left panel; a pairwise comparison is made using one of the BLAST algorithms. On the right is 

shown the ‘‘remapping’’, or the representation of each of the BLAST runs on the left, mapped 

onto the chromosomal sequence. Note that gaps in the reference gene (grey) are not included in 

the colored maps of the atlas. 

genome and custom data and DNA 

properties are collected, an XML configuration 

file is composed which contains 

all these data and the layout of the atlas. 

This file is then sent to the GeneWiz 7 

software which produces a PostScript 

document, it then is base64 encoded to 

allow transport via XML. This part of 

the process takes place on the server and 

requires no user-interaction. An example 

atlas of a plasmid is shown in Fig. 3, and 

will be discussed in more detail below. 

Web services implementation 

A WSDL (web services description language) 

file is written which describes the 

operations (runAtlas, pollQueue, fetch- 

AtlasResult) and the input requirements 

for them. The file can be downloaded. 

All input/output objects are defined in a 

separated XSD file (XML schema definition) 

within the WSDL file, which comprises 

information and type restrictions 

applicable in the request. This serves as 

documentation of the objects as well as a 

way to validate a request before it is 

submitted. Unfortunately, the validation 

supports only Perl modules for now that 

is not optimal yet, whereas this option is 

well implemented in tools like soapUI 

(http://www.soapui.org/). It should be 

stressed that users should, until better 

validation support can be implemented, 

be careful to correctly format the input 

parameters before sending the request. 

Fig. 3 BLASTatlas of pE88—a small plasmid of Clostridium tetani strain E88, GenBank accession number AF528097. DNA parameters percent AT, 

GC skew, global direct repeats, and global inverted repeats are included in the inner most lanes. BLAST lanes of all complete genome sequences of the 

Clostridium genomes (see Table 1), including plasmids are included in the outer most lanes. As examples of custom lanes, the free energy (G, blue kcal 

mol 1 ) and the probability (P, red) measures of stress induced DNA duplex destabilization (SIDD) sites are included in the lanes between the DNA 

properties and the BLAST lanes. 23 SIDD calculations were obtained from the SIDDbase WebService (http://www.cbs.dtu.dk/ws/SIDDbase). The 

request XML used to construct this plot can be downloaded from the example section of the service homepage, http://www.cbs.dtu.dk/ws/BLASTatlas. 

As expected, there is full homology of all coding regions between the plasmids and all replicons of C. tetani E88 (black lane just outside of the 

annotations); however there appears to be limited conservation of these pE88 genes throughout the genomes for other Clostridium strains. 


Table 2 A list of all strains and their accession numbers used in this comparison. Each row represents the NCBI Entrez sequencing project. The 

number of base pairs and protein coding genes are those derived as the sum within each project. C. botulinum str. F Langeland is that used as 

reference of the comparison 

Species Segments Size Proteins 

C. acetobutylicum ATCC 824 14 

Entrez Project 77: Chromosome: AE001437, 

Plasmid pSOL1: AE001438 

4.132.880 3.848 

C. beijerinckii NCIMB 8052 (unpublished) Entrez Project 12637: Chromosome: CP000721 6.000.632 5.020 

C. botulinum A str. ATCC 19397 (unpublished) Entrez Project 19517: Chromosome: CP000726 3.863.450 3.552 

C. botulinum A str. ATCC 3502 6 

Entrez Project 193: Chromosome: AM412317, 

Plasmid pBOT3502: AM412318 

3.903.260 3.671 

C. botulinum A str. Hall (unpublished) Entrez Project 19521: Chromosome: CP000727 3.760.560 3.407 

C. botulinum F str. (unpublished) Entrez Project 19519: Chromosome: CP000728, 

Plasmid pCLI: CP000729 

4.012.918 3.659 

C. difficile 630 15 

Entrez Project 78: Chromosome: AM180355, 

Plasmid pCD630: AM180356 

4.298.133 3.787 

C. kluyveri DSM 555 (unpublished) Entrez Project 19065: Chromosome: CP000673, 

Plasmid pCKL555A: CP000674 

4.023.800 3.913 

C. novyi NT 16 

Entrez Project 16820: Chromosome: CP000382 2.547.720 2.325 

C. perfringens ATCC 13124 25 

Entrez Project 304: Chromosome: CP000246 3.256.683 2.876 

C. perfringens SM101 17 

Entrez Project 12521: Chromosome: CP000312, 

Plasmid 1: CP000313, Plasmid 2: CP000314, 

Viral segment phage phiSM101: CP000315 

2.960.088 2.631 

C. perfringens str. 13 18 

Entrez Project 79: Chromosome: BA000016, 

Plasmid pCP13: AP003515, 

3.085.740 2.723 

C. tetani E88 19 

Entrez Project 81: Chromosome: AE015927, 

Plasmid pE88: AF528097 

2.873.333 2.432 

C. thermocellum ATCC 27405 (unpublished) Entrez Project 314: Chromosome: CP000568 3.843.301 3.191 

Clostridium phage 20 

Phage c-st: AP008983 185.683 198 

Web services workflow 

A workflow was written in Perl (v5.8.7), 

employing SOAP:Lite (v0.69) which 

reads the FASTA files of the database 

strains listed in Table 3 and produces a 

BLASTatlas using the C. botulinum 

strain F Langeland as reference. The 

script uses the online web service (see 

Fig. 4). The BLASTatlas figure produced 

by this workflow is seen in Fig. 5. 

Results 

Fig. 3 represents a BLASTatlas for plasmid 

pE88 from Clostridium tetani strain 

Fig. 4 Workflow description: a Perl script was written for handling the assembly of the SOAP 

envelope and contacting various other web services operations: (A) obtaining genomes sequence: 

using the getSeq operation of the GenomeAtlas Web Services (v.3.3), the genome sequence of the 

reference genome is obtained as one continuous string. (B) Obtaining atlas annotations: 

annotated CDS, rRNA, and tRNA features of the GenBank record of the reference genome 

using the getFeatures operation—these are the features which will be printed in a separate lane 

on the atlas. (C) Obtaining ORF annotations of the reference genome: again, using the getFeatures 

operation, all codon sequences and their translations are obtained. (D) Obtain databases: read 

FASTA files containing proteins and ORFs of the database genomes to be added as lanes. The 

output of A–F are assembled into a single SOAP request, including configurations of the atlas. 

(E) Polling the queue: once the job has been submitted, a 32 character hex string is returned for 

identifying the job, which can be used by operation pollQueue to see the status of the job. 

(F + G) Obtaining result: once a status ‘‘FINISHED’’ is obtained from pollQueue, the job id 

can submitted to fetchResult and the resulting PostScript image is returned. 

E88. The homology for genes in the 

plasmid to other sequenced genomes is 

shown in the circles, additional ‘‘custom 

lanes’’ represent chromosomal regions 

predicted to open under superhelical 

stress. The chromosomal location of the 

genes encoding colT and tetR are labelled 

in the figure. Notice that these two proteins 

contain regions of homology that 

are found in most of the Clostridium 

proteomes searched. Since the C. tetani 

plasmid is included in the genome sequence 

(black circle in the figure), all 

the genes are found in this genome (solid 

black), and most of the other Clostridium 

proteomes contain some weak homology 

but in general lack most of the plasmidencoded 

genes. Thus, this is a quick overview 

of gene conservation of a plasmid 

compared to many sequenced genomes of 

the same genera. 

To demonstrate this for an entire bacterial 

genome (which is millions of bp in 

size, compared to a small/B75 000 bp 

plasmid, shown in Fig. 3), we have used 

the genome sequence of C. botulinum 

strain F Langeland, the largest of the 

C. botulinum genomes, to build a protein 

BLASTatlas of all publicly available 

fully sequenced Clostridia genomes, including 

all chromosomes, plasmids and 

phages (see Fig. 5). Each lane of the atlas 

corresponds to a sequencing project that 

contains the main chromosome plus any 


Fig. 5 BLASTatlas of Clostridium botulinum F strain Langeland: Lanes show genome homology of (starting from the outermost lane): 

C. acetobutylicum ATCC 824, C. beijerinckii NCIMB 8052, C. botulinum A str. ATCC 19397, C. botulinum A ATCC 3502, C. botulinum A str. Hall, 

C. difficile 630, C. kluyveri DSM 555, C. novyi NT, C. perfringens ATCC 13124, C. perfringens SM101, C. perfringens str. 13, C. tetani E88, 

C. thermocellum ATCC 27405, and Clostridium phage c-st genome. Inside of the annotation circle are shown global direct repeats, global inverted 

repeats, stacking energy, and percent AT. Blue and red annotations are coding sequences on plus and minus strand, whereas green and turquoise 

are rRNA and tRNA, genes respectively. The two toxin components NTNH and BoNT/A1 that are identified on phage c-st are present in the 

reference genome at positions 880 kb and 883 kb, respectively (marked ‘cst’). The presence of the two is visible as a thin blue band on the c-st blast 

lane. The lower part of the figure shows a zoom of the region around 2635 kb, providing an example of a gene cluster which appears to be 

conserved throughout the C. botulinum strains and partly within the C. difficile 630. 

phages or plasmids present in the genome. 

The proteins encoded by the 185 kb 

neurotoxin-converting bacteriophage 

c-st are labelled, as well as a region which 

is zoomed in the second panel in Fig. 5. 

The accession numbers, total size and 

total number of genes within each lane 

can be seen in Table 2. 

There are several items of interest which 

can be seen in Fig. 5. First, the rRNA 

operons can be quite readily seen, near the 

top part of the chromosome map, labeled 

turquoise; these rRNA operons are more 

GC rich (hence less red in the inner-most 

lane), have direct and inverted repeats (the 

next two lanes), and are not shown in the 

proteome comparison lanes (since these 

genes do not encode proteins). 

As expected, the circle representing 

the c-st phage shows little match for most 

of the C. botulinum genome, at the 

protein level. In general, the two other 

C. botulinum genomes (both in blue) have 

the highest similarity to the reference 

C. botulinum genome (also shown as a 

circle). In this case it is used as an internal 

control: all of the proteins should show a 

match for this lane, since the reference 

genome is blasted against itself. Another 

interesting observation is the upper-lefthand 

part of the genome which seems to 

have more homology to other Clostridium 

genomes, in particular showing 

many matches to the C. perfringens 

genomes (green circles), compared to the 

rest of the genome. 

Application in metagenomics 

The genera of Prochlorococcus belongs 

to the cyanobacteria and is one of the 

most abundant photosynthetic organisms 

of the ocean. It plays an important 

role in the planet’s carbon cycle and has 

adapted to the various light and oxygen 

conditions present at the various 

depths. 11 As of the end of January 

2008, eleven Prochlorococcus marinus 

genomes are publicly available and we 

have included all encoded proteins of 

these data with the seven metagenomic 

read collections from the ALOHA 

station near Hawaii, 12 as shown in 

Table 3. The strain of P. marinus strain 

MIT 9303 has the largest genome of all 


Table 3 A list of all strains/sample names and their accession numbers used in the metagenomic comparison. The list is sorted by sampling depth 

Source Size Origin Accession/sample Ref. Depth 

P. marinus str. MIT 9515 1 704 176 (1906 proteins) Tropical Pacific CP000552 Unpublished Surface 

P. marinus str. MIT 9215 1 738 790 (1983 proteins) Equatorial Pacific CP000825 Unpublished Surface 

P. marinus str. MED4 1 657 990 (1936 proteins) Mediterranean Sea BX548174 21 4 m 

JGI_SMPL_HF10_10-07-02 7 482 668 (7842 contigs) North Pacific Subtropical Gyre — 12 10 m 

P. marinus str. NATL1A 1 864 731 (2193 proteins) North Atlantic CP000553 Unpublished 30 m 

P. marinus str. NATL2A 1 842 899 (2163 proteins) North Atlantic CP000095 Unpublished 30 m 

P. marinus str. AS9601 1 669 886 (1921 proteins) Arabian Sea CP000551 Unpublished 50 m 

JGI_SMPL_HF70_10-07-02 10 828 386 (10 999 contigs) North Pacific Subtropical Gyre — 12 70 m 

P. marinus str. MIT 9211 1 688 963 (1855 proteins) Equatorial Pacific CP000878 21 83 m 

P. marinus str. MIT 9301 1 641 879 (1907 proteins) Sargasso Sea CP000576 Unpublished 90 m 

P. marinus str. MIT 9303 2 682 675 (2997 proteins) Sargasso Sea CP000554 Unpublished 100 m 

P. marinus str. SS120 1 751 080 (1882 proteins) Sargasso Sea AE017126 22 120 m 


P. marinus str. MIT 9312 1 709 204 (1962 proteins) Equatorial Pacific CP000111 Unpublished 135 m 

P. marinus str. MIT MIT9313 2 410 873 (2273 proteins) Gulf Stream BX548175 21 135 m 





currently available sequences (2.7 Mb) 

and was therefore used as reference in 

this comparison. BLAST hits between 

the reference and the encoded proteins 

of all the P. marinus genomes included 

were generated with the BLASTp 

algorithm, whereas hits between the 

reference proteins and the DNA reads 

of the metagenomic samples were gener- 

ated using the tBLASTn algorithm. 

tBLASTn was used to avoid the 

gene prediction step of the metagenomic 

samples and to allow a rough estimate 

of the coding potential of these samples. 

All lanes are sorted according to 

the water depth at which the samples 

were collected (see Fig. 6). The Perl 

code for constructing this plot using 

web services is provided on the service 

homepage. 

Discussion 

The BLASTatlas method can assist biologists 

in finding regions along the chromosome 

which are conserved (or not). 

This information is useful for several 

Fig. 6 BLASTatlas showing fully sequenced Prochlorococcus genomes (green) and the seven ALOHA metagenomic samples (blue). Outermost 

lanes represent samples closer to the ocean surface. 


different applications, such as identifying 

phage insertion sites and loss of important 

genetic material. This method is 

even able to scale down to each individual 

nucleotide or amino acid residue. 

However, it is unable to deal with sequences 

(or parts thereof) that are not 

found in the reference genome. A good 

compromise when dealing with this issue 

is often to use the largest chromosome of 

a species as reference; in addition, it can 

be useful to rebuild the maps using different 

reference genomes. Besides this 

limitation, the fact that all coordinates 

are mapped back to the reference causes 

the coordinates of the database genomes 

to ‘‘get lost’’ in that only the best match 

is displayed, regardless of the chromosomal 

location in the database genomes. 

Other aspects of genome homology like 

gene synteny cannot effectively be 

answered by this tool. However, it is 

possible to use an additional circle to 

plot gene order conservation along the 

chromosome. 

Currently, we see the BLASTatlas as 

an intermediate stage in analysis of many 

genomes of similar species. Soon there 

will be a need to compare hundreds or 

thousands of genome sequences, and the 

need for development of new methods 

for comparison of even larger numbers 

of genomes (hundreds or thousands) is 

ever more important. 

Acknowledgements 

The authors would like to thank Hans 

Henrik Stærfeld for assistance with server 

side programs and Kristoffer Rapacki 

for assistance on web services 

data types. The work was supported by 

a grant from the European Union 

through the EMBRACE network of Excellence, 

contract number LSHG-CT- 

2004-512092 and a grant from the Danish 

Center for Scientific Computing 

(DCSC). 

References 

1 R. D. Fleischmann, M. D. Adams, O. 

White, R. A. Clayton, E. F. Kirkness, A. 

R. Kerlavage, C. J. Bult, J. F. Tomb, B. A. 

Dougherty, J. M. Merrick, J. McKenney, 

G. Sutton, W. FitzHugh, C. Fields, J. D. 

Gocyne, J. Scott, R. Shirley, L. I. Liu, A. 

Glodek, J. M. Kelley, J. F. Weidman, C. 

A. Phillips, T. Spriggs, E. Hedblom, M. D. 

Cotton, T. R. Utterback, M. C. Hanna, D. 

T. Nguyen, D. M. Saudek, R. C. Brandon, 

L. D. Fine, J. L. Fritchman, J. L. Fuhrmann, 

N. S. M. Geoghagen, C. L. Gnehm, 

L. A. McDonald, K. V. Small, C. M. 

Fraser, H. O. Smith and J. C. Venter, 

Whole-Genome Random Sequencing and 

Assembly of Haemophilus Influenzae Rd., 

Science, 1995, 269(5223), 496–512. 

2 L. J. Jensen, M. Skovgaard, T. Sicheritz- 

Ponten, M. K. Jorgensen, C. Lundegaard, 

C. C. Pedersen, N. Petersen and D. Ussery, 

Analysis of two large functionally uncharacterized 

regions in the Methanopyrus 

kandleri AV19 genome, BMC Genomics, 

2003, 4, 12. 

3 L. J. Jensen, M. Skovgaard, T. Sicheritz- 

Ponten, N. T. Hansen, H. Johansson, 

M. K. Jørgensen, K. Kiil, P. F. Hallin 

and D. Ussery, Comparative genomics of 

four Pseudomonas species, in The Pseudomonads 

Vol. I. Genomics, Life Style 

and Molecular Architecture, ed. J. L. 

Ramos, Kluwer Academic/Plenum 

Publishers, New York, 2004, ch. 5, 

pp. 139–164. 

4 P. F. Hallin, T. T. Binnewies and D. W. 

Ussery, Genome update: chromosome atlases, 

Microbiology (Reading, U. K.), 

2004, 150, 3091–3093. 

5 T. J. Carver, K. M. Rutherford, M. Berriman, 

M. A. Rajandream, B. G. Barrell and 

J. Parkhill, ACT: the Artemis Comparison 

Tool, Bioinformatics, 2005, 21, 3422–3423. 

6 M. Sebaihia, M. W. Peck, N. P. Minton, 

N. R. Thomson, M. T. Holden, W. J. 

Mitchell, A. T. Carter, S. D. Bentley, D. 

R. Mason, L. Crossman, C. J. Paul, A. 

Ivens, M. H. Wells-Bennik, I. J. Davis, A. 

M. Cerdeno-Tarraga, C. Churcher, M. A. 

Quail, T. Chillingworth, T. Feltwell, A. 

Fraser, I. Goodhead, Z. Hance, K. Jagels, 

N. Larke, M. Maddison, S. Moule, K. 

Mungall, H. Norbertczak, E. Rabbinowitsch, 

M. Sanders, M. Simmonds, B. 

White, S. Whithead and J. Parkhill, Genome 

sequence of a proteolytic (Group I) 

Clostridium botulinum strain Hall A and 

comparative analysis of the clostridial genomes, 

Genome Res., 2007, 17, 1082–1092. 

7 A. G. Pedersen, L. J. Jensen, S. Brunak, H. 

H. Staerfeldt and D. W. Ussery, A DNA 

structural atlas for Escherichia coli, J. Mol. 

Biol., 2000, 299, 907–930. 

8 E. S. Shpigelman, E. N. Trifonov and 

Bolshoy, A Curvature: software for the 

analysis of curved DNA, CABIOS, Comput. 

Appl. Biosci., 1993, 9, 435–440. 

9 M. Skovgaard, L. J. Jensen, C. Friis, H. H. 

Stærfeldt, P. Worning, S. Brunak and D. 

Ussery, The Atlas Visualisation of Genome-wide 

Information, Methods Microbiol., 

2002, 33, 49–63. 

10 L. J. Jensen, C. Friis and D. W. Ussery, 

Three Views of Microbial Genomes, Res. 

Microbiol., 1999, 150, 773–777. 

11 M. B. Sullivan, M. L. Coleman, P. Weigele, 

F. Rohwer and S. W. Chisholm, 

Three Prochlorococcus cyanophage Genomes: 

Signature Features and Ecological 

Interpretations, PLoS Biol., 2005, 3, e144; 

PMID: 15828858 [PubMed—indexed for 

MEDLINE]. 

12 E. F. DeLong, C. M. Preston, T. Mincer, 

V. Rich, S. J. Hallam, N.-U. Frigaard, A. 

Martinez, M. B. Sullivan, R. Edwards, B. 

R. Brito, S. W. Chisholm and D. M. Karl, 

Community Genomics Among Stratified 

Microbial Assemblages in the Ocean’s Interior, 

Science, 2006, 311(5760), 496–503. 

13 D. L. Wheeler, T. Barrett, D. A. Benson, 

S. H. Bryant, K. Canese, V. Chetvernin, 

D. M. Church, M. DiCuccio, R. Edgar, S. 

Federhen, L. Y. Geer, Y. Kapustin, O. 

Khovayko, D. Landsman, D. J. Lipman, 

T. L. Madden, D. R. Maglott, J. Ostell, V. 

Miller, K. D. Pruitt, G. D. Schuler, E. 

Sequeira, S. T. Sherry, K. Sirotkin, A. 

Souvorov, G. Starchenko, R. L. Tatusov, 

T. A. Tatusova, L. Wagner and E. 

Yaschenko, Database Resources of the 

National Center for Biotechnology Information, 

Nucleic Acids Res., 2007, 35, 

D5–D12. 

14 J. Nolling, G. Breton, M. V. Omelchenko, 

K. S. Makarova, Q. Zeng, R. Gibson, H. 

M. Lee, J. Dubois, D. Qiu, J. Hitti, Y. I. 

Wolf, R. L. Tatusov, F. Sabathe, L. Doucette-Stamm, 

P. Soucaille, M. J. Daly, G. 

N. Bennett, E. V. Koonin and D. R. 

Smith, Genome Sequence and Comparative 

Analysis of the Solvent-producing 

Bacterium Clostridium acetobutylicum, J. 

Bacteriol., 2001, 183, 4823–4838. 

15 M. Sebaihia, B. W. Wren, P. Mullany, N. 

F. Fairweather, N. Minton, R. Stabler, N. 

R. Thomson, A. P. Roberts, A. M. Cerdeno-Tarraga, 

H. Wang, M. T. Holden, A. 

Wright, C. Churcher, M. A. Quail, S. 

Baker, N. Bason, K. Brooks, T. Chillingworth, 

A. Cronin, P. Davis, L. Dowd, A. 

Fraser, T. Feltwell, Z. Hance, S. Holroyd, 

K. Jagels, S. Moule, K. Mungall, C. Price, 

E. Rabbinowitsch, S. Sharp, M. Simmonds, 

K. Stevens, L. Unwin, S. Whithead, 

B. Dupuy, G. Dougan, B. Barrell 

and J. Parkhill, The Multidrug-resistant 

Human Pathogen Clostridium difficile has 

a Highly Mobile: Mosaic Genome, Nat. 

Genet., 2006, 38, 779–786. 

16 C. Bettegowda, X. Huang, J. Lin, I. 

Cheong, M. Kohli, S. A. Szabo, X. Zhang, 

L. A. Diaz, Jr, V. E. Velculescu, G. Parmigiani, 

K. W. Kinzler, B. Vogelstein and 

S. Zhou, The Genome and Transcriptomes 

of the Anti-tumor Agent Clostridiumnovyi-NT, 

Nat. Biotechnol., 2006, 24, 

1573–1580. 

17 G. S. Myers, D. A. Rasko, J. K. Cheung, J. 

Ravel, R. Seshadri, R. T. DeBoy, Q. Ren, 

J. Varga, M. M. Awad, L. M. Brinkac, S. 

C. Daugherty, D. H. Haft, R. J. Dodson, 

R. Madupu, W. C. Nelson, N. J. Rosovitz, 

S. A. Sullivan, H. Khouri, G. I. Dimitrov, 

K. L. Watkins, S. Mulligan, J. Benton, D. 

Radune, D. J. Fisher, H. S. Atkins, T. 

Hiscox, B. H. Jost, S. J. Billington, J. G. 

Songer, B. A. McClane, R. W. Titball, J. I. 

Rood, S. B. Melville and I. T. Paulsen, 

Skewed Genomic Variability in Strains of 

the Toxigenic Bacterial Pathogen, 

Clostridium perfringens, Genome Res., 

2006, 16, 1031–1040. 

18 T. Shimizu, K. Ohtani, H. Hirakawa, K. 

Ohshima, A. Yamashita, T. Shiba, N. 

Ogasawara, M. Hattori, S. Kuhara and 

H. Hayashi, Complete Genome Sequence 

of Clostridium perfringens, an Anaerobic 

Flesh-eater, Proc. Natl. Acad. Sci. 

U. S. A., 2002, 99, 996–1001. 

19 H. Bruggemann, S. Baumer, W. F. Fricke, 

A. Wiezer, H. Liesegang, I. Decker, 


C. Herzberg, R. Martinez-Arias, R. Merkl, 

A. Henne and G. Gottschalk, The Genome 

Sequence of Clostridium tetani, the 

Causative Agent of Tetanus Disease, Proc. 

Natl. Acad. Sci. U. S. A., 2003, 100, 

1316–1321. 

20 Y. Sakaguchi, T. Hayashi, K. Kurokawa, 

K. Nakayama, K. Oshima, Y. Fujinaga, M. 

Ohnishi, E. Ohtsubo, M. Hattori and K. 

Oguma, The Genome Sequence of 

Clostridium botulinum Type C Neurotoxin 

Converting Phage and the Molecular Mechanisms 

of Unstable Lysogeny, Proc. Natl. 

Acad. Sci. U. S. A.,2005,102,17472–17477. 

21 G. Rocap, F. W. Larimer, J. Lamerdin, S. 

Malfatti, P. Chain, N. A. Ahlgren, A. 

Arellano, M. Coleman, L. Hauser, W. R. 

Hess, Z. I. Johnson, M. Land, D. Lindell, 

A. F. Post, W. Regala, M. Shah, S. L. 

Shaw, C. Steglich, M. B. Sullivan, C. S. 

Ting, A. Tolonen, E. A. Webb, E. R. 

Zinser and S. W. Chisholm, Genome Divergence 

in Two Prochlorococcus ecotypes 

Reflects Oceanic Niche Differentiation, 

Nature, 2003, 424, 1042–1047. 

22 A. Dufresne, M. Salanoubat, F. Partensky, 

F. Artiguenave, I. M. Axmann, V. 

Barbe, S. Duprat, M. Y. Galperin, E. V. 

Koonin, F. Le Gall, K. S. Makarova, M. 

Ostrowski, S. Oztas, C. Robert, I. B. Rogozin, 

D. J. Scanlan, N. Tandeau de Marsac, 

J. Weissenbach, P. Wincker, Y. I. 

Wolf and W. R. Hess, Genome Sequence 

of the Cyanobacterium Prochlorococcus 

marinus SS120, a Nearly Minimal Oxyphototrophic 

Genome, Proc. Natl. Acad. 

Sci. U. S. A., 2003, 100, 9647–9649. 

23 C. J. Benham and C. Bi, The Analysis of 

Stress-induced Duplex Destabilization in 

Long Genomic DNA Sequences, J. 

Comput. Biol., 2004, 11, 519–543. 

24 K. Liolios, N. Tavernarakis, P. 

Hugenholtz and N. C. Kyrpides, The 

Genomes On Line Database (GOLD) 

v.2: a monitor of genome projects worldwide, 

Nucleic Acids Res., 2006, 34, 

D332–D334. 

25 J. I. Rood and S. T. Cole, Molecular 

genetics and pathogenesis of Clostridium 

perfringens, Microbiol. Rev., 1991, 55, 

621–648. 


1 


2.7 Paper II: Ten years of bacterial genome sequencing: 

comparative–genomics–based discoveries

Funct Integr Genomics (2006) 6: 165–185 

DOI 10.1007/s10142-006-0027-2 

REVIEW 

Tim T. Binnewies . Yair Motro . Peter F. Hallin . 

Ole Lund . David Dunn . Tom La . David J. Hampson . 

Matthew Bellgard . Trudy M. Wassenaar . 

David W. Ussery 

Ten years of bacterial genome sequencing: 

comparative-genomics-based discoveries 

Received: 20 January 2006 / Revised: 24 February 2006 / Accepted: 7 March 2006 / Published online: 12 May 2006 

# Springer-Verlag 2006 

Abstract It has been more than 10 years since the first 

bacterial genome sequence was published. Hundreds of 

bacterial genome sequences are now available for comparative 

genomics, and searching a given protein against 

more than a thousand genomes will soon be possible. The 

subject of this review will address a relatively straightforward 

question: “What have we learned from this vast 

amount of new genomic data?” Perhaps one of the most 

important lessons has been that genetic diversity, at the 

level of large-scale variation amongst even genomes of the 

same species, is far greater than was thought. The classical 

textbook view of evolution relying on the relatively slow 

accumulation of mutational events at the level of individual 

bases scattered throughout the genome has changed. One 

of the most obvious conclusions from examining the 

sequences from several hundred bacterial genomes is the 

enormous amount of diversity—even in different genomes 

from the same bacterial species. This diversity is generated 

by a variety of mechanisms, including mobile genetic 

elements and bacteriophages. An examination of the 20 

Escherichia coli genomes sequenced so far dramatically 

illustrates this, with the genome size ranging from 4.6 to 

5.5 Mbp; much of the variation appears to be of phage 

origin. This review also addresses mobile genetic elements, 

T. T. Binnewies . P. F. Hallin . O. Lund . D. W. Ussery (*) 


Technical University of Denmark, 

2800 Lyngby, Denmark 

e-mail: dave@cbs.dtu.dk 

Y. Motro . D. Dunn . M. Bellgard 

Center for Bioinformatics and Biological Computing, 

Murdoch University, 

Murdoch, Western Australia 6150, Australia 

T. La . D. J. Hampson 

School of Veterinary and Biomedical Sciences, 

Murdoch University, 

Murdoch, Western Australia 6150, Australia 

T. M. Wassenaar 

Molecular Microbiology and Genomics Consultants, 

Zotzenheim, Germany 

including pathogenicity islands and the structure of 

transposable elements. There are at least 20 different 

methods available to compare bacterial genomes. Metagenomics 

offers the chance to study genomic sequences 

found in ecosystems, including genomes of species that are 

difficult to culture. It has become clear that a genome 

sequence represents more than just a collection of gene 

sequences for an organism and that information concerning 

the environment and growth conditions for the organism 

are important for interpretation of the genomic data. The 

newly proposed Minimal Information about a Genome 

Sequence standard has been developed to obtain this 

information. 

Keywords Bacterial genomics . Comparative genomics . 

Bioinformatics . Genomic diversity . 

Molecular evolution 

Introduction 

The year 1995 marked the publication of two human 

pathogenic bacterial genome sequences: Haemophilus 

influenzae (Fleischmann et al. 1995, US patent number 

6,528,289) and Mycoplasma genetalium (Fraser et al. 

1995, US patent number 6,537,773). Since then, more than 

300 bacterial genomes have been fully sequenced and 

become publicly available, including the sequence of a 

virulent form of H. influenzae (Harrison et al. 2005); the 

original H. influenzae strain sequenced in 1995 was from 

an isolate that does not cause disease. Although the 

majority of these several hundred genomes are from 

pathogenic organisms, some environmental bacterial genome 

sequences have also become available. This review 

article will provide a brief overview of sequenced bacterial 

genomes, their genomic diversity and some of the insights 

gained from analysis of this vast amount of data. 

Bacteria are microscopic unicellular prokaryotes that 

inhabit a wide variety of environmental niches, broadly 

distributed in three ecosystems: the soil, marine environments 

and other living organisms. Although there are

166 

literally millions of bacterial species, only a small proportion 

of these can be grown in the laboratory (Handelsman 

2004). Bacteria (and Archaea) can be found almost 

anywhere in the environment: in the air, even in the 

International Space Station (Novikova et al. 2006), in 

thermal ducts found at great depths in the oceans (Alain et 

al. 2002; Vezzi et al. 2005), in the intestinal tracts of 

animals (Yan and Polk 2004; Backhed et al. 2005) and in 

soil and rocks, even thousands of meters deep (Torsvik et 

al. 1990). Bacteria live within unicellular eukaryotes, 

algae, plants or animals. This diversity is reflected in their 

physiology, morphology, metabolism and ecosystems. For 

example, from a physiological perspective, most intestinal 

bacteria such as Escherichia coli are motile by means of 

flagella, to overcome the peristalsis of the gut, whilst the 

soil bacterium Clostridium perfringens does not posses 

such motility machinery (Shimizu et al. 2002). From a 

metabolic perspective, the versatile Burkholderia cepacia 

(formerly Pseudomonas cepacia) can utilise approximately 

100 different organic compounds as a sole energy source 

(Goldmann and Klinger 1986) compared to the strictly 

intracellular Mycobacterium tuberculosis which is dependent 

on only a few carbon sources produced by its 

involuntary host. From an inter-bacterial interaction 

perspective, sometimes bacteria cooperate. For example, 

Enterobacter cloacae and Pseudomonas mendocina positively 

interact to stimulate plant growth (Duponnois et al. 

1999). On the other hand, there are also bacteria which not 

only “do not cooperate” but exhibit predatory behavior, 

such as Bdellovibrio bacteriovorus (Rendulic et al. 2004). 

As for bacteria–host interactions, for a given bacterial 

species both pathogenic and non-pathogenic strains can 

exist (Dobrindt and Hacker 2001; Penyalver and Lopez 

1999), while other species may be exclusively parasitic 

(Goebel and Gross 2001), truly symbiotic (Gil et al. 2004) 

or commensal (Yan and Polk 2004) for their host. It is 

interesting to note that this diversity is somehow captured 

in the relatively small bacterial genomes. 

The first complete viral genome (φX174) was published 

in 1977 (Sanger et al. 1977). To put this into perspective, to 

sequence the 4.6-Mbp E. coli K-12 genome at that time 

(about a thousand base pairs (bp) could be sequenced per 

year in 1977) would take more than a thousand years to 

finish, and to sequence the human genome would take 

more than a million years to complete. The automation of 

sequencing methods, the invention of polymerase chain 

reaction (PCR) (Mullis et al. 1986) and the shotgun cloning 

procedure reduced costs and time, and provided the 

capability for large-scale sequencing. These developments 

together have led to the sequencing of the first complete 

bacterial genome (Fleischmann et al. 1995) almost 20 years 

after the sequencing of φX174. The choice of the first 

bacterium to be completely sequenced (H. influenzae Rd 

KW20) was based on the following reasons: (1) the 

genome size was thought to be ‘typical’ among bacteria 

(1.8 Mbp), (2) the G + C base composition was close to that 

of the human genome (38%) and (3) the bacterium had 

important human health implications. In the absence of 

procedures to produce a genetic map for the species, 

genome sequencing was proven to be a powereful 

alternative for genetic characterisation. This landmark 

work initiated the influx of genome sequence data which 

is now updated frequently and is publicly available. As of 

November 2005, there are more than 300 fully sequenced, 

publicly available bacterial genomes. Figure 1 shows this 

increase of sequence data over the past decade. 1 

The total number of completed bacterial genome 

sequences has more than doubled over the past 2 years 

and, at the time of writing, there are 855 publicly listed 

bacterial and archaeal genome projects that are in various 

stages of progress. 2 In addition to new species, multiple 

strains of the same bacterial species are being sequenced. 

The amount of genomic data currently available has 

provided significant advances in our understanding of a 

number of important themes, including bacterial diversity, 

population characteristics, operon structure, mobile genetic 

elements (MGE) and horizontal gene transfer (HGT). It has 

also provided a number of challenges in understanding the 

ecology of, as yet, undiscovered bacterial worlds. The 

availability of whole genome sequences for pathogenic and 

commensal bacterial species has allowed a more detailed 

analysis of the complex interactions that occur with their 

plant or animal hosts. Figure 2a is a phylogenetic tree of 

300 sequenced bacterial genomes (available at the time of 

writing). Many of these genomes are from pathogenic 

bacteria living in complex ecosystems, such as the 

spirochaete Brachyspira pilosicoli labelled in red in the 

phylogenetic tree shown in Fig. 2b. This bacterium attaches 

to enterocytes to form a “false brush border” in the colon. 

Most genome sequencing projects are currently carried 

out using automated applications of the sequencing 

technique developed by Sanger et al. (1973), but newly 

developed methodologies may enable even more rapid 

sequencing in the future. Two papers have been published 

about two different methods for high-throughput sequencing 

of bacterial genomes (Pennisi 2005). One method is 

essentially a “do-it-yourself kit”, which uses a laser 

confocal microscope and other “off-the-shelf” components 

to build a sequencing machine capable of sequencing an E. 

coli genome in less than a day (Shendure et al. 2005). The 

second method is a commercial machine, based on 

pyrosequencing methodologies to generate many short 

pieces of DNA; this method was used to sequence a 

bacterial genome within a few hours (Margulies et al. 

2005). Although there are still some technical problems 

with both of these methods, it is clear that, in the near 

future, it will be possible to quickly sequence a bacterial 

genome at a considerably low cost. 

1 Completed genome statistics obtained from the CBS atlas web 

pages http://www.cbs.dtu.dk/services/GenomeAtlas 

2 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj

Fig. 1 Cumulative number of 

complete published sequenced 

bacterial genomes (bars) and 

total number of basepairs (line) 

over the past decade 

(1995–2005) 

Genomic information 

DNA codes for more than just proteins 

The quality of annotation of bacterial genomes varies, 

although a survey based on three different methods to 

predict the expected number of genes in a genome has 

found that it is likely that, for most bacterial genomes, 

around 20% of the genes annotated might not be “real” 

(Skovgaard et al. 2001). Furthermore, some “real” genes, 

based on proteomics experiments, which were not 

originally predicted have been detected, highlighting the 

dynamic nature of annotation and that genes are missed 

(Jaffe et al. 2004). Over-annotation of bacterial genomes is 

a problem but, unfortunately, this cannot be easily avoided. 

On the one hand, no one wants to miss a gene and, on the 

other hand, small genes can be quite difficult to predict, as a 

short open reading frame could easily occur by statistical 

chance (Skovgaard et al. 2001). 

There are currently several automated annotation systems 

and the BaSys system (Van Domselaar et al. 2005) 

provides a comprehensive annotation of a DNA sequence 

file. To conduct comparative genomics with several 

hundred genomes, quality databases are essential and the 

“GenomeAtlas” database, which was originally developed 

to store DNA structural information about the various 

sequenced genomes, is one example (Hallin and Ussery 

2004). Approximately a hundred different features for each 

genome (such as percent AT, coding skew bias, length of 

genome and number of genes) are currently made available 

through http://www.cbs.dtu.dk/services/GenomeAtlas/. 

Duplication of essentials 

One of the features of genomic sequences that can be easily 

recognised is the presence of repeat sequences. The most 

obvious and extensive repeats present in many bacterial 

167 

genomes are the operons encoding the ribosomal RNA 

genes. These rRNA operons typically encode 16S and 23S 

rRNA separated by a short spacer, often followed by the 5S 

rRNA gene. All sequenced bacterial genomes possess at 

least one rRNA operon, and many (215 of 300) have two or 

more copies; the number of operons tends to correlate with 

bacterial division time. Thus, species that divide quickly 

(such as Bacillus cereus) have more copies of rRNA genes, 

so as to enable rapid production of ribosomes. In addition, 

species containing multiple rRNA operons appear to be 

more adaptable to changing environmental conditions 

(Acinas et al. 2004). The rRNA genes are a valuable tool 

for the estimation of taxonomic relationships (see Fig 2a). 

These genes evolve slowly, presumably because they play 

an essential role as the backbone of ribosomes while 

interacting with multiple proteins. Any changes in the 

shape (sequence) of rRNA would most likely be fatal. 

Multiple copies per genome of tRNA genes can also be 

found in some genomes, again tending to correlate with 

division time. However, for tRNAs, the duplication 

number is also dictated by the frequency with which 

particular codons are used (or vice versa, as cause and 

effect cannot be distinguished here). This enables a less 

obvious level of regulating gene activity: a gene using 

many codons for which only one tRNA gene is available 

will probably be translated at a rate-limiting step, whereas 

abundant proteins are more likely to use tRNAs for which 

multiple gene copies are available. This is the basis for the 

codon adaption index, which is a measure of the adaptation 

of a gene’s codon usage towards the optimal tRNA pool 

(Sharp and Li 1987). 

There are of course other duplications in bacterial 

genomes, some of which might appear at first glance to be 

less essential. For example, the ‘REP’ repetitive sequences 

frequently found in enterobacteriaceae can be used as 

unique identifiers of bacterial genomes (Tobes and Ramos 

2005). It has been speculated that these repeats are 

meaningless, resulting from errors in replication, or that

168

3Fig. 2 a Phylogenetic tree of 287 sequenced bacterial genomes, 

based on aligments from the 16S rRNA gene sequence. The phyla 

are colour-coded; a more detailed view, with names of all the 

organisms can be found in the supplemental information: http:// 

www.cbs.dtu.dk/services/GenomeAtlas/suppl/FIG10yr/. b Photomicrograph 

showing a dense fringe of anaerobic spirochaetes (B. 

pilosicoli) attached by one cell end to the luminal surface of human 

colonic enterocytes, forming a “false brush border”. Besides that of 

humans, B. pilosicoli colonises the large intestine of a variety of 

mammals and birds, causing diarrhoea and reduced growth rates. 

Genomic sequence from B. pilosicoli is being analysed to assist in 

understanding the genetic basis of this dense colonisation, including 

patterns of gene expression underlying the complex interactions that 

occur between individual bacterial cells and the colonised 

enterocytes. The photograph is courtesy of Dr. W. Bastiaan DeBoer, 

University of Western Australia, Perth, Western Australia 

they may be a part of mobile elements that are able to 

translocate and duplicate themselves. These could alternatively 

be non-functional ‘molecular fossils’ of previous 

insertion events. Finally, it could well be that these repeats 

serve some as yet undiscovered useful purpose. It is 

possible, for example, that repetitive sequences and 

insertion sequence elements (ISs) contribute to genome 

plasticity through structural changes based on homologous 

recombination (Kennedy et al. 2001; Fraser-Liggett 2005). 

A brief history of bacterial operons 

Much of the early classical work in microbiology has been 

done with E. coli, as this bacterium is relatively easy to 

culture in the laboratory. As more and more genetic 

information was gathered, it was considered a ‘typical’ 

bacterium, although E. coli is not more typical for bacteria 

than a rabbit is for all eukaryotic organisms. More than 

40 years ago, a model was proposed for gene regulation of 

the catabolism of lactose in E. coli (Jacob et al. 1960; Jacob 

and Monod 1961). The model described an operon as a 

cluster of genes with related functions (encoding, in this 

case, enzymes required for lactose degradation). This 

operon structure neatly allows regulation of gene expression 

by the concentration of lactose (Lewis et al. 1996; 

Reznikoff 1992). With the continuous expression of one 

small protein (a repressor), wasteful expression of several 

other catabolic enzymes in the absence of lactose is 

prevented. 

Since the discovery of the lac operon, many more 

catabolic operons have been discovered, with positive and 

negative feedback strategies, and these illustrate the 

biological need to use resources as efficiently as possible. 

Many, if not all, bacterial genomes indeed display clusters 

of genes involved in a single process (be it co-jointly 

transcribed and regulated, as in classical operons, or with 

separate promoters and regulators), but the degree of 

operon gene organisation and gene clustering differs 

between species. In some bacteria, such as in Helicobacter 

pylori, operons are relatively unconserved, and genes 

involved in one cellular process can be dispersed 

169 

throughout the genome (Tomb et al. 1997; Alm and Trust 

1999), although more recent work suggest that perhaps 

there are more operons in H. pylori than previously thought 

(Price et al. 2005). There are currently many resources for 

prediction of operons (Rogozin et al. 2004; Rosenfeld et al. 

2004; Alm et al. 2005; Janga et al. 2005; Nishi et al. 2005; 

Price et al. 2005; Vallenet et al. 2006), including several 

databases, such as the Operon Database (Okuda et al. 

2006), RegulonDB (Salgado et al. 2006a,b) and Gene- 

Chords (Zheng et al. 2005). 

How did the first operon evolve? There have been 

historically three models proposed for the origins of gene 

clusters. The first model, which dates back to 1945, 

proposed the clustering of genes to be the direct result of 

gene duplication and evolution (Horowitz 1945, 1965). 

Gene duplication can occur during replication and, as a 

duplicated gene has more freedom to mutate, this is 

believed to be a classical mechanism for novel enzymes to 

evolve (Lazcano et al. 1995). However, although all genes 

within an operon may be involved in a single metabolic 

process, their function and structure can vary considerably, 

and a phylogenic relationship between them is not always 

likely. 

The second model proposed for the evolution of operons 

is that coregulation of genes under a common promoter 

could provide selective advantage (Jacob et al. 1960). 

However, we now know that, in fact, it is possible to have 

coregulation of genes that are not physically linked 

together. Furthermore, this model does not really provide 

a gradual step-by-step mechanism for the evolution of 

operons. 

The third model for the evolution of an operon is that 

pre-existing genes moved together due to selective 

advantages of having genes involved in the same 

biochemical pathways or processes being physically 

close to each other. This hypothesis allows for structurally 

distinct genes to be part of one operon. This model requires 

both variation and frequent recombination and has been 

proposed as an explanation of clustering of genes in 

bacteriophage genomes (Stahl and Murray 1966; Juhala et 

al. 2000). 

In addition to these three views, there are other 

alternatives. Gene clustering may be of selective advantage 

in the case of horizontal gene transfer (see section below) 

and, based on this idea, a fourth mechanism, ‘selfish 

operon’ model, was proposed (Lawrence and Roth 1996). 

This view has been recently called into question, based on 

the physical clustering of essential genes in the E. coli K-12 

genome (Pal and Hurst 2004). Two other alternatives for 

operon evolution deal with chromatin structure and the 

physical location of genes in bacterial chromosomes, where 

transcription and translation are coupled (Pal and Hurst 

2004). It is quite possible that, in fact, there is no one 

“correct” mechanism, but perhaps different mechanisms are 

involved at the same time. For example, the selective 

advantage of gene clustering during horizontal gene transfer 

is exemplified by the clustering of multiple antibiotic

170 

resistance genes on mobile genetic elements (Carattoli 

2001). In the era of antibiotic use, such genes are under 

strong selective pressure and are frequently passed on 

between bacteria by means of mobile elements. Whether 

these have directly contributed to the spread of catabolic and 

other operons between bacterial species is currently not 

known. 

What separates genes in a genome? 

In comparison to genes, the non-coding part of genomes 

receives far less attention. Some genomes are more 

densely packed than the others. The average coding 

density is about 90%, ranging from 95% for Pelagibacter 

ubique (Giovannoni et al. 2005) to 51% for Sodalis 

glossinidius (Toh et al. 2006). Bacterial genes are not 

spliced as they are in eukaryotes; that is, introns are absent 

from nearly all bacterial genes. The sequences separating 

genes (intergenic regions) can be thought of as spacers 

where information on regulation of transcription can be 

stored, although sometimes these intergenic regions can 

also be more than regulatory and spacer domains. 

Intergenic regions in the E. coli K-12 chromosome have 

been suggested to contain the sequences for several 

hundreds of small RNA genes which are transcribed but do 

Table 1 Current E. coli genomes sequenced or in progress 


strain 

Length (bp) Number of 

genes 

Number of 

tRNAs 

not code for proteins (Chen et al. 2002). Many of these 

small RNAs act as regulators (Gottesman 2005). 

In general, the intergenic regions of bacterial genomes 

are more AT-rich, will melt more readily, are more curved 

and are more rigid than the chromosomal average 

(Pedersen et al. 2000; Hallin and Ussery 2004). This is 

true for nearly all of the several hundreds of bacterial 

genomes sequenced, regardless of AT content. These 

characteristics make sense in terms of mechanical properties 

needed for initiating transcription. 

Generation of genomic diversity in bacteria 

Genomic diversity is far greater than expected 

The view in many textbooks of biological diversity and 

evolution often envisions clonal bacteria which slowly 

evolve through the gradual accumulation of single-nucleotide 

changes. There might occasionally be a rare event 

where a new gene is duplicated but, in general, it has been 

commonly thought that if one were to sequence two 

different strains of a common bacterium like E. coli, the 

sequences would, for the most part, be similar and the two 

strains would share most (perhaps 90% or more) of their 

genes. At the time of writing, there are 20 different E. coli 

Number of 

rRNAs 

Number of 

contigs 

Accessionumber 

O157_EDL93 5,528,445 5,349 100 7 1 AE005174 

E22 5,516,16 4,788 NA NA 109 AAJV00000000 

O157_RIMD0509952 5,498,450 5,361 103 7 1 BA000007 

E110019 5,384,084 4,746 NA NA 119 AAJW00000000 

B171 5,299,753 4,467 NA NA 159 AAJX00000000 

53638 5,289,471 4,783 NA NA 119 AAKB00000000 

042 5,241,977 4,899 93 7 2 Sanger Institute 

(unpublished) 

CFT073 5,231,428 5,379 89 7 1 AE014075 

H10407 ~5,208,000 ~5,000 NA NA 225 Sanger Institute 

(unpublished) 

F11 5,206,906 4,467 NA NA 88 AAJU00000000 

B7A 5,202,558 4,637 NA NA 198 AAJT00000000 

NMEC RS218 5,089,235 ~4,900 NA NA 1 Uni. Wisc. (unpublished) 

E2348 5,072,200 4,594 71 7 4 Sanger Institute 

(unpublished) 

E24377A 4,980,187 4,254 97 6 1 AAJZ00000000 

UPEC 536 ~4,900,000 ~4800 NA NA 1 Uni. Würzburg 

(unpublished) 

101NA1 4,880,380 4,238 NA NA 70 AAMK00000000 

HS 4,643,538 3,689 89 6 1 AAJY00000000 

K-12_W3110 4,641,433 4,390 88 7 1 AP009048 

K-12_MG1655 4,639,675 4,254 88 7 1 U00096 

B03 4,629,810 4,387 86 6 1 CNRS France (unpublished) 

NA Currently not annotated

genomes which have been either completely sequenced or 

at least with an expected coverage of greater than 99% of 

the genome. Table 1 lists these genomes, and one of the 

surprising observations is the diversity just in size of the 

main chromosome, ranging from 5.5 to 4.6 Mbp—that is, 

close to a million base pairs present in some E. coli strains 

which are missing in others. Furthermore, if one were to 

pick any one of these 20 strains, there would be more than a 

hundred genes which are unique to that strain and are not 

found in the other 19 E. coli genomes. Studies have 

indicated that much of this diversity comes from 

bacteriophages (Ohnishi et al. 2001). 

Gene order conservation 

When comparing bacterial genomes, two features are 

frequently analysed: gene presence and gene order. The 

presence or absence of genes is particularly interesting 

when two closely related species or strains that have 

different phenotypes, such as a pathogenic and a commensal 

strain of the same species, are compared (Hayashi et al. 

2001). As for the actual process leading to the difference, 

the direction of the insertion/deletion event is not always 

clear; the nature of the indel (INsertion/DELetion) is 

generally kept neutral. 

Table 2 Types of mobile genetic elements found in bacterial genomes 

There are various models of how the gene order within 

operons may have changed throughout evolution. It may be 

that the gene order in ancient ancestral operons has been 

maintained, such that all (or many) of the operons in 

studied genomes would be expected to have a similar gene 

structure. However, this view has been contradicted by data 

from whole genome studies. Examining the stability of 

operon structures over evolutionary distance shows that the 

majority of the gene orders within operons could be 

shuffled frequently during evolution, with the ribosomal 

protein operons as an exception (Itoh et al. 1999). Such 

observations support the alternative possibility that operons 

are multiple evolutionary inventions. A more recent 

study has examined the evolution of the histidine operon in 

Proteobacteria and found evidence for indeed a gradual 

merging of genes with similar function into operons, at 

least in this case (Fani et al. 2005). 

Comparisons of gene order can also be informative of 

chromosomal translocations and inversions, which frequently 

happen in bacterial genomes (Kuwahara et al. 

2004). Such events are mostly neutral in terms of 

evolution, as they do not change the total genetic content 

of the cell, but translocations and inversions frequently 

coincide with insertions or deletions. Any of these 

processes can result from inaccurate excision of mobile 

genetic elements and, as such elements are frequently 

MGE Description References 

Plasmids Circular, self-replicating DNA molecules that exist in cells as extra-chromosomal 

replicons. Some plasmids can insert into the chromosome. 

(Dobrindt et al. 2004) 

Transposons DNA molecules that frequently change their chromosomal localisation, either 

within or between replicons. They usually code for a transposase and some other 

genes (such as antibiotic resistance genes), and are flanked by inverted repeat 

DNA sequences. 


Conjugative Transposons that also carry genes related to plasmid-encoded conjugation, thus, (Dobrindt et al. 2004) 

transposons providing the ability to transfer between cells via conjugation 

Bacteriophages Prokaryote-infecting viruses, which can modify the host genome by coding new 

functions or by modifying existing functions. They are also capable of inserting 

into the genome (prophages). These are also agents of HGT. 


Integrons Genetic elements composed of a gene encoding an integrase (int gene; excises and (Fluit and Schmitz 2004; Holmes et al. 

integrates the gene cassettes from and into the integron), gene cassettes (become 

part of the integron upon integration; consist of a promoterless gene and a 

recombination site termed attC) and an integration site for the gene cassettes (attI 

gene) 

2003; Peters et al. 2001) 

Insertion Small, genetically compact DNA sequences, generally less than 2.5 kbp in length, (Mahillon et al. 1999; Ou et al. 2006) 

sequence encoding functions involved in their translocation, and transpose both within and 

elements between genomes. IS elements are a subset of a general group of elements named 

transposable elements. These transposable elements are defined as elements of 

DNA segments that carry the genes required for this process (and, in some cases, 

other genes), and consequently move about chromosomes and, more generally, 

genomes. 

Genomic Large chromosomal regions that contain a cluster of functionally related genes, an (Dobrindt et al. 2004) 

islands operon or a number of operons, flanked by direct repeat sequences, and located 

near an integrase or transposase gene and a tRNA gene. 

171

172 

involved in generating diversity in bacteria, they deserve to 

be treated in a separate section. 

Mobile genetic elements 

MGEs are genomic elements that are capable of translocating 

themselves within or between genomes. When moving 

to a new genome, they may confer a new characteristic on 

the recipient. Their size ranges from hundreds of base pairs 

to more than 100 kbp. Plasmids, transposons, conjugative 

transposons, bacteriophages, integrons, insertion sequence 

elements and genomic islands (GEIs) are all considered 

MGEs (Table 2). Bacteriophages are the most sophisticated, 

as they produce their own protein coat to protect the 

genetic material (which can be DNA or RNA). Conjugative 

transposons induce conjugation between cells, a process in 

which cellular membranes merge to produce a bridge 

through which the transposon can move. Some plasmids 

can also induce conjugation (a transposon always encodes 

transposase whereas a conjugative plasmid replicates 

without integration in the chromosome). Some of the 

definitions for the various MGEs partly overlap, as indeed 

these terms are flexible. For instance, transposons can 

integrate in plasmids, and bacteriophages may contain 

insertion sequence elements (Burrus and Waldor 2004). 

MGEs constitute potentially foreign DNA located in a 

conceptual ‘flexible’ gene pool, from where ‘donated’ 

DNA is made available for recipient cells. Once the MGE 

is transferred into the recipient cell, the DNA will either 

insert into a region on the chromosome or it will start to 

evoke its own replication machinery. If the MGE is 

integrated into the genome, for example, like a pathogenicity 

island (PAI), the genes (or operon) will start to be 

expressed, thus adding a new characteristic to the cell. The 

MGE may later initiate ‘donation’ of DNA either to a next 

receptor (for which the trigger is as yet unknown) or to the 

flexible gene pool, perhaps taking with it a ‘new’ or 

additional gene or function. The integrated MGE may also 

become immobile as a result of chromosomal re-arrangements, 

duplications or sequence insertions/deletions. In the 

case of such rendered immobility, the integrated MGE 

becomes a permanent genomic element or genomic island. 

At a later stage, the genomic island may be modified and 

rendered mobile again, making it available for transfer to 

the flexible gene pool once again. 

As the subject of all MGEs listed in Table 2 would 

suffice a review paper on its own, this review focuses on 

two, namely, insertion sequence elements and GEIs. These 

two MGEs are of particular interest because our knowledge 

of them has improved dramatically as a direct result of 

genome sequence availability and due also to their impact 

on the diversity of bacteria. 

Insertion sequence elements 

IS elements are small DNA sequences, generally less than 

2.5 kb in length, encoding functions involved in their own 

translocation and can transpose both within and between 

genomes (Mahillon et al. 1999). IS elements were 

originally described as a subset of transposable elements 

(Prescott et al. 1999). IS elements are the simplest form of 

MGE and a key component of a majority of the more 

complex transposable elements, found both in bacterial and 

eukaryotic genomes. A number of reviews deal with IS 

elements in greater depth (van Belkum et al. 1998; 

Mahillon et al. 1999; Galun 2003). 

An IS contains a transposase gene, flanked by terminal 

inverted repeats (the sequence of one flank is encoded on 

the opposite strand of the other flank). One of these repeats 

classically contains the promoter for the transposase gene 

(Fig. 3; Galun 2003). The IS elements are also flanked by 

short, directly repeated sequences, which are generated in 

the recipient DNA as a result of insertion. 

The activity of transposable elements in genomes was 

first noted by McClintock (1950) in maize, although at that 

time the mechanism behind the observed genetic changes 

was not understood. Starlinger and Saedler (1976) provided 

the first review of IS elements in bacterial genomes. As 

noted by Lupski and Weinstock (1992), the first ISs were 

classified before their function, origin and dispersion 

mechanisms were understood. The present genomic era 

has resulted in advances in their classification, understanding 

of mechanisms of dispersion and identification of 

their role in evolution (van Belkum et al. 1998; Mahillon et 

al. 1999). Although the classical ISs are considered to be 

evolutionary neutral, as each can only translocate their own 

transposase, they are the means by which genomic islands 

(for example PAIs and metabolic islands) are transferred, 

and they also play a role in plasmid integration (Rocha et al. 

1999). Variation in the excision of ISs promotes genome 

rearrangements (including deletions, inversions and replicon 

fusions; Mahillon et al. 1999). Antibiotic resistance 

genes are frequently spread within bacterial populations 

with the aid of ISs, which gives these simple elements 

clinical relevance. Finally, in special cases, IS elements can 

indirectly cause antigenic variation, a process in which a 

gene is switched off and on in a reversible manner within a 

bacterial population (Talarico et al. 2005). IS sequences that 

Fig. 3 Organisation of a typical insertion sequence. The IS is 

represented as an open box in which the terminal inverted repeats are 

shown as blue boxes labelled IRL (left IR) and IRR (right IR). An 

open reading frame encoding the transposase (grey box) is located in 

the IS. WXY boxes flanking the IS represent short directly repeated 

sequences generated in the target DNA as a consequence of 

insertion. The transposase promoter is localised in IRL

are present in the first part of a gene can cause slippage 

during replication, as DNA polymerase has difficulties with 

correct replication of short multiple repeats. The result can 

be a frame shift with consequential inactivation, but the 

next frame shift can restore gene function. Such slippage 

can also vary the distance and, thus, activity of a promoter 

and its gene. Examples involving genes with a role in 

pathogenicity, with antigenic variation of surface exposed 

proteins, and environmental adaptation have been described 

(van Belkum et al. 1998; Rocha et al. 1999). 

Monitoring of these elements has provided insights into 

bacterial genome molecular processes and the nature of IS 

elements. For example, understanding the regulatory 

mechanisms of IS elements has provided insights into the 

importance of the compromises adopted by IS elements 

(and MGEs, in general) between a stable host genome and 

in endangering the survival of the host, through too much 

transposition activity (Nagy and Chandler 2004). It has 

also been suggested that IS expansion occurs during an 

evolutionary bottleneck, which reduces effective population 

size and the degree of intraspecies competition 

(Parkhill et al. 2003). 

Genomic islands 

GEIs, also referred to as integrative and conjugative 

elements or ICElands (van der Meer and Sentchilo 2003), 

are large chromosomal regions that cluster functionally 

related genes, are flanked by direct repeat sequences and 

are located near an integrase or transposase gene and often 

also near a tRNA. Furthermore, GEIs must have a GC 

composition different from the rest of the genome. GEIs 

include pathogenicity islands, symbiosis islands (SYIs), 

metabolic islands (MEIs), antibiotic resistance islands 

(REIs) and secretion system islands (SEIs) (Zhang and 

Zhang 2004). This remarkable variety of GEIs demonstrates 

the power of horizontal gene transfer, as they are 

believed to be the result of interspecies DNA transfer. With 

multiple genes neatly clustered in functional groups 

including all necessary regulatory and secretory genes, 

the power of transferring such ‘adaptive genetic bombs’ 

can be easily imagined. 

Genome sequences have revealed that GEIs are common 

in bacteria as a result of successful horizontal transfers of 

Fig. 4 Generalised diagrammatic representation of a pathogenicity 

island. Commonly inserted into a tRNA gene sequence, flanked by 

direct repeat sequences, containing an integrase (int) gene, 

commonly containing insertion sequence elements, and harbouring 

DNA from a donor genome to a recipient genome. In most 

cases, the nature of the donor is unfortunately unknown. 

Even when an identified GEI bears a high resemblance to a 

section of another sequenced organism, one should not 

assume (though frequently this mistake has been made) 

that the GEI was directly received from that other 

organism. The transfer could well have involved a third 

unidentified species, serving either as an intermediate 

between the first two or as the donor for the others. These 

possibilities are frequently not recognised, as people can be 

mislead by the available genome sequences and are not 

sufficiently aware of all those bacterial genomes for which 

we are currently lacking sequence information. 

The discovery of abundant genomic islands is strengthening 

the concept of a bacterial genome being quite 

dynamic and consisting of a backbone genome supplemented 

with adaptive genome modules, which may or may 

not be present in a given strain of the species (Fraser- 

Liggett 2005). All modules available to the species (but 

never all present in one strain) would comprise the gene 

pool of that organism. This concept clearly does not apply 

to strictly clonal species, in which case all isolates or strains 

closely resemble each other (as is the case, for instance, 

with Bacillus anthracis), but it better describes the situation 

for frequently observed highly diverse species, such as E. 

coli or Streptomyces. Nevertheless, the timescale at which 

these events take place should not be ignored. Genomes are 

the sum of thousands of years of evolution. Observations of 

evolutionary events taking place in ‘real time’ are still 

relatively seldom. 

Pathogenicity islands 

173 

PAIs are now considered a subtype of genomic islands but 

were among the earliest islands to be described. PAIs 

harbour pathogenicity-related genes, thus potentially conferring 

a pathogenic phenotype on a recipient genome. 

Figure 4 illustrates a generalised model of a PAI. As with 

other GEIs, PAIs are commonly inserted into tRNA genes, 

which may be preferred sites of insertion due to their 

relative conservation and redundancy (Dobrindt et al. 

2004). PAIs are flanked by direct repeat sequences 

allowing for insertion into the recipient DNA and contain 

an integrase gene that enables the integration into the 

functional genes (with virulence associated properties), which may 

be organised into an operon structure. Sometimes, a type III 

secretion system is also present

174 

recipient DNA. A feature observed for many PAIs (and 

originally included in their definition although not always 

present) is the presence of a type III secretion system, a set 

of genes building an apparatus to specifically inject 

virulence factors into the host cell (Jores et al. 2004). 

Numerous investigations have identified and analysed PAIs 

(McGillivary et al. 2005; Middendorf et al. 2004; Paulsen 

et al. 2003; Schneider et al. 2004; Zubrzycki 2004; Schmidt 

and Hensel 2004). 

Horizontal gene transfer and restriction modification 

systems 

Evidence of HGT (also referred to as lateral gene transfer 

LGT) dates back more than 30 years (Falkow 1975), with 

the finding of transposable elements. Although such events 

were considered only exceptional cases at that time, it is 

now evident that HGT events can make a substantial 

contribution to the generation of genetic diversity. As with 

all other features, the degree of horizontal transfer varies 

amongst species. Ochman et al. (2000) assessed 19 

completely sequenced bacterial genomes and reported 

that the proportion of foreign proteins vary from 0% 

(Mycoplasma genitalium) to about 17% (Synechocystis 

spp). These findings were supported by others including 

Dufraigne et al. (2005). Ortutay et al. (2003) undertook a 

genomic-scale phylogenetic analysis of protein-encoding 

genes from five closely related Chlamydia spp and 

identified a set of sequences that have arisen via HGT as 

the divergence of the Chlamydia lineage. These data 

illustrate the significant role of HGT in the evolution of 

particular bacterial species. It is not surprising that obligate 

intracellular pathogens show less evidence of recent HGT: 

they will not easily encounter other bacterial species with 

which to share DNA. 

Doolittle (1999a) listed three observations that can only 

be explained by HGT. The first observation is that 

phylogenetic trees based on individual protein-coding 

genes frequently differ substantially from the rRNA tree 

or from each other. The second observation comes from 

analysis, within a genome, of variation in G + C content, 

codon usage and gene order. The third observation is a 

result of between-genome comparisons, which show that 

all genomes contain particular genes that are more similar 

to homologues in distant genomes than to homologues in 

closer relatives or indeed that are absent from all known 

genomes of closer relatives. Combining this evidences, 

Doolittle (1999b) proposed an alternative to the tree of life 

to describe the evolutionary history of living organisms. 

His model of a web-like structure takes into account the 

influence of HGT, where interactions occur between 

ancestral organisms and descendants (branches) as well 

as between branches. A similar concept of a biological 

network has been further explored by Kunin et al. (2005). 

Such a concept is difficult to work with, and currently 

many microbiologists still accept a tree-like phylogenetic 

relationship, at least for an artificial ‘backbone’ of the 

species. Independent of the source (strain or species) of the 

genes, phylogenetic trees can indeed be correctly produced 

for many genes and gene families and may describe 

evolutionary relationships that do not date back very far. 

Going back further in time, the vertical lineages become 

weaker and the phylogenetic trees are less meaningful. The 

paradoxal conclusion is that, by elucidating more of the 

evolutionary history of bacteria, their history has become 

less clear. 

If it is really true that horizontal gene transfer is so 

general, how is it still possible to recognise bacterial 

species? First, HGT is not so frequent that it can be easily 

observed as DNA exchange in ‘real time’ (other than the 

uptake of plasmids, spread of antibiotic resistance genes or 

transfection of phages). Evidence for past HGT events can 

be seen in many bacterial genomes and exemplifies its 

importance in evolution but, without a time scale, the 

frequency of such events cannot be estimated. Second, 

there are barriers that restrict HGT. It is obvious that not all 

bacteria share the same gene pool and only bacteria that 

share an ecological niche are likely to encounter and share 

each other’s DNA. Even under circumstances that favour 

DNA exchange, internal factors restrict the success of 

HGT, notably bacteriophage specificity, plasmid incompatibility, 

and the activity of restriction modification (RM) 

systems. Finally, not all putatively HGT genes from E. coli 

are actually translated into proteins, perhaps because of 

incompatability of translational machinery (Taoka et al. 

2004). 

The discovery of restriction enzymes which could cleave 

specific DNA sequences provided the basis for driving the 

“biotechnology revolution” in the 1970s. RM systems are 

popular in molecular genetics and are routinely used by 

most molecular biology laboratories throughout the world. 

The RM systems encode a modification enzyme that 

chemically modifies a specific short DNA sequence and a 

restriction endonuclease that will digest the DNA at that 

same specific recognition sequence unless the sequence has 

been modified (usually by methylation). Bacterial species 

(and frequently strains within a species) all have their own 

combination of RM systems (Roberts et al. 2005). 

Incoming DNA with a different modification pattern will 

be recognised by the endonuclease of the recipient strain, 

and the fate of such DNA is to be degraded. This is seen as 

a serious restriction for the spread of DNA through 

populations unless their RM systems are compatible. 

The analysis of RM systems at a comparative genomics 

level (particularly the type restriction II endonucleases) has 

shown the dynamic state of the respective genes (Lin et al. 

2001) and posed a number of questions to the view that RM 

genes restrict gene flow. For example, H. pylori and 

Campylobacter jejuni are competent to take up DNA and 

have a large set of genes to maintain this property. The 

dynamic nature of the H. pylori genome and its natural 

competence is consistent with the weakly clonal population 

structure of H. pylori. Nevertheless, studies on H. pylori 

identified at least eight type II RM systems across two 

strains with an active restriction endonuclease and 

methylase (Kong et al. 2000; Lin et al. 2001). In addition, 

there were several active methylase genes without an active

endonuclease. The occurrence of RM systems that are not 

shared between the strains suggests that new RM systems 

are readily acquired and subsequently lost as a result of 

mutation or recombination (Lin et al. 2001). But that these 

would pose restriction barriers in gene flow is difficult to 

envisage with the dynamic population structure. RM genes 

possibly have other advantages to the cell. For methylation 

genes missing their matching restriction gene, it has been 

suggested that they may be used for regulating gene 

expression (as for DAM methylation in E. coli; Lobner- 

Olesen et al. 2005; Robbins-Manke et al. 2005) and for 

keeping track of which parts of the chromosome have been 

recently replicated (Maas 2004). 

Methods for comparing bacterial genomes 

There are at least 20 methods to compare bacterial 

genomes, as shown in Table 3. Some methods are more 

commonly used than the others, and it is beyond the scope 

of this review to provide a detailed analysis of each 

method. A few of these methods are discussed in this 

section. 

Chromosome alignment and size comparison 

Perhaps one of the easiest ways to compare genomes is by 

their sizes, as shown in Fig. 5. Although different phyla 

have different average sizes, it must be kept in mind that 

many of the phyla have currently few representatives and 

that there is a strong economic bias towards sequencing the 

smallest genome, so the size distributions shown here for 

the sequenced genomes could well be shorter than what 

Table 3 Approaches to comparing bacterial genomes 

exist in natural ecosystems. Another way of comparing 

chromosomes is to do a simple alignment of the DNA 

sequences. There are two versions of the alignment 

programmes. One involves downloading some scripts 

and running them on a local computer such as the Sanger 

Centre’s (Cambridge, UK) Artemis Comparison Tool 

(ACT, Carver et al. 2005) and the other is web-based 

such as “WebACT”, a web-based version of ACT with precomputed 

comparisons between several hundred bacterial 

genomes. The latter might be easier to use for those 

biologists who are less computationally inclined (Abbott et 

al. 2005). 

AT content in genomes and promoter analysis 

Another relatively easy method to compare genomes is by 

their AT content, which ranges from 78% (Wigglesworthia 

glossinidia) to 27% (Clavibacter michiganensis) for the 

300 genomes sequenced at the time of writing. In addition 

to the average AT content for a whole genome, if the 

variation of the AT content within a given genome is 

examined, two general trends can be seen for nearly all of 

the bacterial genomes. First, on a more global chromosomal 

level, there is a tendency for the region around the 

origin of DNA replication to be more GC rich (i.e. less AT 

rich) and the region around the replication terminus to be 

more AT rich (Hallin et al. 2004b). Second, the average AT 

content for DNA about 400 bp upstream of the translation 

start site for all the genes in a genome is higher than 400 bp 

downstream (Hallin et al. 2004b). This makes sense in that 

the DNA will need to melt more easily in order for 

transcription to start. 

Level Method Reference 

Genome Chromosome alignment Carver et al. 2005 

AT content in the genome and upstream of genes Ussery and Hallin 2004a 

Oligomer bias on leading or lagging strands Worning et al. 2006 

Repeats (local and global) Ussery et al. 2004a 

Periodicity of DNA structural properties Worning et al. 2000 

Length comparison Ussery and Hallin 2004b 

Promoter analysis Ussery et al. 2004d 

Transcriptome Organisation of rRNA operons Ussery et al. 2004b 

tRNAs and codon usage Ussery et al. 2004c 

Third nucleotide position bias in codon usage Ussery et al. 2004c 

Annotation quality Skovgaard et al. 2001 

Proteome Amino acid usage Ussery et al. 2004c 

BLAST atlases Hallin et al. 2004a 

BLAST matrices Binnewies et al. 2004 

Sigma factors Kiil et al. 2005a 

Transcription factors Kummerfeld 2006 

Secreted proteins Bendtsen et al. 2005a 

Membrane proteins Bendtsen et al. 2005b 

2-D correlation of properties Willenbrock et al. 2005 

Two component signal transduction systems Kiil et al. 2005b 

175

176 

Fig. 5 Genome length distribution for 287 bacterial chromosomes, 

shown as box and whiskers plot for each phyla. The number of 

chromosomes in each phylum is shown on the axis. Most of the 

bacterial genomes shown are either Proteobacteria (156 genomes) or 

tRNAs, codon usage and amino acid 

As mentioned above, the 200 bp upstream of translation 

start sites is more AT rich, on average, than the 200 bp 

downstream. However, if the unsmoothed data is examined 

(the grey lines in Fig. 6, panel a), there is much “noise” in 

the coding sequence, compared to the upstream, noncoding 

DNA. This is due to bias in codon usage, as shown in 

Fig. 6, panel b. The genome for a given organism will tend 

to show a preference towards certain codons and can be 

seen as a bias in the third codon position (Fig. 6, panel c). 

Finally, these codon biases also are in part affected by 

which amino acids an organism uses, as shown in panel d 

of Fig. 6. The amino acid usage for different E.coli 

proteomes differ: for example, E. coli K-12 shows the same 

amino acid usage as Salmonella entericia LT2, while the 

usage in E.coli O157 resembles that of Shigella flexeneri. 

Thus, two different E. coli genomes can have quite 

different amino acid usage (which might not be that 

surprising in view of the differences between strains of this 

species, see Table 1). 

BLAST atlases 

The GenomeAtlas is a method to visualise structural 

features of an entire bacterial genome sequence as one plot. 

The plots are created using the “GeneWiz” programme, 

Firmicutes (70). At the time of writing, the largest complete bacterial 

genome sequenced is that of Burkholderia xenovorans, which is 

consists of 9,703,676 bp within two chromosomes, and the smallest 

is that of M. genitalium genome of 580,074 bp 

developed at CBS (Pedersen et al. 2000). A more recent 

extension of this method is the development of the 

“genome BLAST atlas”, in which genes from different 

genomes are blasted against a reference genome and 

visualised using an atlas plot. BLAST atlases can provide 

additional contextual information about regions which 

contain few conserved genes. For example, a new genome 

might have a few small islands of unique proteins, and 

these regions might be more AT rich or might be expected 

to be potentially highly expressed, based on chromosomal 

structural information also provided in the plots. As 

mentioned above, when the 20 E. coli sequenced genomes 

in Table 1 are compared, an enormous amount of diversity 

is found. A BLAST atlas for E.coli 0157 is shown in Fig 7a. 

Several regions of the chromosome have “holes” representing 

large segments of missing genes in some organisms, 

compared to the reference genome. In a sense, this 

information is somewhat similar to that obtained by the 

ACT plots mentioned above, although now the comparisons 

are being made at the level of presence/absence of 

clusters of proteins. In Fig. 7b, some of the regions 

containing gaps are more AT rich, some contain repeats and 

a few (marked) contain genes that might be highly 

expressed, based on chromatin properties. Thus, this tool 

can give a quick overview of the comparison of many 

genomes. 

In Fig. 7a, the gaps correspond to regions of missing 

genes in the E. coli O157 genome. Similar patterns can be

Fig. 6 Genomic properties of Streptomyces coelicolor A3. a Comparison 

of AT content upstream and downstream of all 7,825 genes; the 

genes are all oriented in the same direction and aligned such that the 

translation start site is in the middle. Z-scores of standard deviations 

from the chromosomal average are plotted, as described previously 

(Ussery and Hallin 2004a). b Codon usage of the same set of 7825 

genes. The frequency of occurrence of each of the 64 codons is plotted 

in a star plot; note that most codons have a relatively low frequency of 

usage. c Bias in the codon position are plotted as frequencies; note that 

seen for many other bacterial genomes. For example, in 

Fig. 7b, there are four large gaps in the C. jejuni RM1221 

genome compared to other epsilon Proteobacteria. These 

correspond to phage insertion sites in C. jejuni RM1221, as 

described in the original genome sequence publication 

(Fouts et al. 2005). Similar results have been observed for 

177 

there is a strong tendancy for Cs and Gs in third position. d Amino acid 

usage of each of the 20 amino acids for the entire S. coelicolor 

proteome is plotted as frequency of the total; the amino acids in this plot 

are grouped according to their properties; for example, all the aliphatic 

amino acids (A, V, L, I and G) are together and, in general, there is a 

general trend for this proteome to favour aliphatic amino acids, with the 

exception of isoleucine. The three star plots are as described previously 

(Ussery et al. 2004c) 

Streptococcus (Hallin et al. 2004a). In all three of these 

cases, there are large regions which contain many genes 

which are missing in other genomes of the same species. 

These clusters of genes often contain evidence that they 

came from phages, which appears to be an efficient method 

of bringing new DNA into a genome.

178

3Fig. 7 Genome BLAST atlases. The outer circles represent BLAST 

hits of a given genome (named in the legend) to the reference 

genome (named in the center of the atlas). The colours are scaled 

such that good BLAST hits (E=10–40) are darkly shaded, whilst 

regions containing no hits are shown in light grey, as described 

previously (Hallin et al. 2004a). a Genome BLAST atlas of E. coli 

EO157 EDL933 vs four other sequenced E. coli strains (the four 

outermost circles; the genomes are, going from the outermost 

towards the center, E. coli K-12 MG1655, E. coli K-12 W3110, E. 

coli CFT1076 and E. coli O157 RIMD0509952). b Genome BLAST 

atlas of C. jejuni vs other epsilon Proteobacteria 

BLAST matrices 

Figure 7a,b illustrates the use of BLAST atlases to compare 

genome sequences. However, with several hundred 

genomes available, there is a need for a faster way of 

getting an overview of genome similarity. One method is 

the use of reciprocal hits—that is, to BLAST all the 

proteins encoded in a genome of interest against those in 

another genome (Binnewies et al. 2004). First, the genomes 

of interest are selected (e.g. all genomes of Proteobacteria), 

then a BLAST matrix can be displayed from this selection. 

The results are pre-generated and the system keeps track of 

sequence updates by generating MD5 checksums of all 

sequences and the combinations in which they have been 

BLASTed. The MD5 (termed also a message digest) will 

Fig. 8 The BLAST table shows 

the overall protein homology 

between all combinations of the 

five available Vibrio sequences. 

Only hits containing at least 

80% of the length of the gene 

and with an E-value of 1×10 or 

better are counted. The diagonal 

(red/pink) indicates the fraction 

of proteins that have homologous 

hits within the proteome 

itself; the fraction is similar in 

all genomes, and the intensity is 

shown by the red colour, scaled 

from ~24% (grey) to ~27% 

(red). Note that the largest genome 

also has the highest fraction 

of internal homologs. The 

green area for the rest of 

the table, on each side of the 

diagonal, shows the number 

of proteins that have homologous 

hits between different 

Vibrio genomes. As before, the 

fraction is indicated by the intensity 

of the colour (green) 

scaled from ~57 (grey) to ~83% 

(green). In general, it is clear 

that these organisms share a 

high percentage of their genes 

with the other Vibrio species, 

which should be expected 

because they are from the same 

genus 

produce a 32-digit string that is unique to an input string, 

e.g. a genomic sequence. The system maintains an allagainst-all 

BLAST database updating only the missing 

comparisons—that is, changing the sequence of a record or 

inserting a new record will cause a BLAST run of the 

sequence against all the existing sequences of the database. 

By having multiple genomes in a given selection, an allagainst-all 

BLAST matrix can be presented showing the 

percentage of genes that are shared between sequences— 

both on a protein and on a nucleotide level. Each such 

percentage is supplied with a link to give a full listing from 

the BLAST report. Fig. 8 shows an example of such a 

BLAST matrix, with the diagonal (in red) reflecting the 

internal homologues of a given genome. The boxes are 

colour-coded such that the intensity represents the fraction 

of hits (Binnewies et al. 2004) (Fig. 8). 

Meta-genomics: comparison of all the genomes 

in an ecosystem 

179 

The term “metagenomics” is used for genome sequencing 

projects in which many organisms are sequenced at once 

by shotgun cloning of all DNA present in a sample 

(Handelsman 2004). This enables microbial ecosystems 

containing microbes that are not (presently) culturable in 

pure form to be investigated (Handelsman 2004). The

180 

reasons why organisms remain uncultured can be practical 

(e.g. thermophilic bacteria grow at a temperature above the 

melting point of agar), physiological (e.g. extremophiles 

that grow on pure culture can have very different properties 

from those observed in their true environment) or biological 

(symbiotic life forms cannot be cultured in microbiological 

pure form). The first genome sequence obtained 

from a non-culturable bacterium was indeed that of 

Buchnera aphidicola, a symbiont of aphids. This sequence 

was not obtained by meta-genomics at the total genome 

DNA level but rather at the rRNA level. Cell counts 

compared to plate counts showed that the latter can be 

orders of magnitude wrong: many viable bacteria refuse to 

grow on solid culture medium. The isolation of bulk RNA 

and the subsequent determination of rRNA sequences 

using specific primers allowed qualitative analysis to be 

performed for identifying novel bacterial species or 

ribotypes present in an ecosystem (Olsen et al. 1986). 

The application of PCR improved the sensitivity of such 

approaches but the limitation to rRNA sequences confined 

analyses to phylogenetic information only and little further 

knowledge was obtained about the new species. Metagenomics 

can be used to generate complete or fragmented 

genome sequences of organisms that might be abundant in 

nature but are not easily culturable. 

The acid mine drainage sequencing project has shown 

the potential of meta-genomics (Tyson et al. 2004). The 

mine water of the Richmond mine is covered with a biofilm 

of bacteria despite its hostile environment: an extreme acid 

pH (between 0 and 1), high concentrations of metal ions, 

including copper, zinc and arsenic, and the absence of 

carbon or nitrogen sources (other than from air). The 

biofilm was composed of relatively few organisms, 

enabling the sequencing of shotgun-cloned DNA and the 

sorting of fragments according to their G + C content into 

nearly complete bacterial genomes. A dominant bacterial 

genus was identified, Leptospirillum, and a less abundant 

Sulfobacillus spp and some Archaea were also present. The 

findings greatly improved understanding of this ecosystem. 

The predominant bacteria were responsible for nitrogen 

and carbon fixation (Leptospirillum group III), whereas 

several species were able to generate energy from iron 

oxidation (Ferroplasma and Leptospirillum spp). As in this 

approach, each sequenced DNA fragment is obtained from 

a different individual (whereas in classical genome 

sequencing all DNA is obtained from one clone); 

information on polymorphisms also becomes available. 

As more complex ecosystems are studied, the puzzle of 

genome assembly becomes more difficult due to the 

presence of more species, genomic rearrangements and 

horizontal gene transfer events. 

The largest attempt so far at metagenomics was initiated 

by C. Venter to sequence the microbial ecosystem in the 

Sargasso Sea (Venter et al. 2004). Seawater was sampled 

by filtering to specifically recover bacterial (and not viral or 

amoebal) DNA. Over 1 billion base pairs of sequence were 

generated, which was attributed to at least 1,800 species. 

As the abundance of individual species determines their 

coverage in shotgun cloning, this coverage (or rather the 

mean of their Poisson distribution) was used to sort out 

DNA scaffolds (a scaffold is a reconstructed genomic 

region), and oligonucleotide frequencies were used to 

refine this sorting. Although the complexity of the 

investigated ecosystem did not allow complete assembly 

of individual genomes, the scaffolds belonging to the most 

abundant species could be attributed to Burkholderia and 

Shewanella-like species. As with the acid main drainage 

project, polymorphisms were detected with varying 

frequencies. In fact, the dataset ranged from organisms 

belonging to a single species and clonal (few polymorphisms) 

to a population continuum in which some clonal 

complexes could be recognised. These observations 

illustrate the ‘unnatural’ approach of studying only pure 

bacterial cultures that have a strict clonal structure in 

contrast to natural environments where the population 

structure is much more fluid and the concept of clones or 

species is more elusive. The most impressive output of the 

Sargasso Sea study is the numbers of individual genes that 

were identified (69,901). Among the surprising findings 

was that rhodopsin (the bacterial protein required for 

carbon fixation) was abundant outside the proteobacteria 

where it had previously been identified. The finding of 

many genes involved in phosphate uptake and utilisation of 

poly- and pyrophosphates is puzzling, as the marine 

environment is extremely phosphate-limited. 

The challenge to analyse the complex communities of a 

nutrient-rich environment was taken up by Tringe and 

Rubin (2005). One sample that was analysed was derived 

from agricultural soil and three were from marine whale 

carcasses. First, rRNA libraries were generated by PCR to 

investigate the microbial diversity. The soil sample (DNA 

obtained from 5 g of surface clay loam from land that had 

been used for livestock) was extremely rich in species with 

at least 847 ribotypes detected representing over 12 phyla. 

The whale samples (two bone parts and one biofilm 

covering a whale carcass) were less diverse but still 

contained between 25 and 150 ribotypes. Although the 

assembly of sequences obtained from shotgun libraries was 

not possible, the genes that were identified on the 

sequenced library clones demonstrated that approximately 

half of the predicted proteins found similarities (homologs) 

in existing gene databases. Plotting the number of novel 

gene families against the amount of generated sequences 

suggested that, for the soil sample, few novel orthologues 

were found after sequencing 25 Mbp. The functions of 

predicted proteins from the sequences were naturally 

diverse, but for the soil sample, potassium channelling 

systems were overrepresented, whereas for the whale 

samples sodium ion exporters were abundant—which fit 

with the abundance of these two ions in the two 

environments, respectively. 

The metagenomics analyses will continue to see databases 

expanding, with the interpretation and assembly of 

raw data becoming more complete. The human gastrointestinal 

tract, for example, is the target of a metagenomics 

sequencing project (Mongodin et al. 2005). It is apparent 

that each individual carries a large variety of microflora, 

probably acquired early in life (and which may have health

consequences even though these organisms are not pathogenic) 

as well as bacterial microheterogeneity that was not 

recognised previously. Against the common belief that 

Firmicutes and Bacteroides would be the most abundant 

microbes present in the human gut, it appears that 

Actinobacteria and Archaea may be more prominent 

(Mongodin et al. 2005). The intestinal microflora of 

obese mice differs considerably to that of lean animals, 

an observation in support of the view that the microbiota of 

mammals are good indicators (be it cause or effect) of their 

health status (Ley et al. 2005). There are clearly many 

microbial communities to be analysed and compared using 

metagenomics. 

Application: computational vaccine development 

Vaccines remain an extremely important tool for controlling 

infectious diseases of humans and animals, although 

they are only available for about 10% of the microrganisms 

known to be harmful to humans (Lund et al. 2005). 

Traditional vaccines typically have incorporated whole live 

attenuated or killed microorganisms, but, particularly for 

use in humans, such vaccines now have limited application 

due to concerns about safety, efficacy and/or ease of 

production. Much recent work, therefore, has focused on 

developing vaccines composed of prominent immunogenic 

parts of microorganisms (subunit vaccines) or genes 

encoding these components (genetic vaccines, Ellis 

1999). For bacterial vaccine discovery, these newer 

approaches have been greatly assisted by the recent 

availability of whole genomic sequence data and has 

allowed a new approach to vaccine development called 

“reverse vaccinology” (Rappuoli 2001). 

In reverse vaccinology, bioinformatics tools are used to 

undertake comprehensive in silico screening of genomic 

sequence to identify genes encoding proteins that have 

desirable characteristics. The power of this process has 

increased as more and more genomic sequences that 

encode proteins of known function become available in the 

databases for comparative analysis. Targets for consideration 

for use in vaccines include genes encoding outer 

membrane proteins or lipoproteins, transmembrane domains 

or export signal peptides, and proteins with 

homologies to bacterial factors already known to be 

involved in virulence or pathogenicity. Surface-exposed 

or secreted proteins as well as virulence factors such as 

toxins or adhesive factors are likely to induce an immune 

response that may be protective (Zagursky and Russell 

2001). In this way, large numbers of potential vaccine 

components can be identified from a whole (or partial) 

genome sequence. This approach was first taken for the 

human pathogen Neisseria meningitidis serogroup B, with 

600 open reading frames (ORFs) of potential interest 

initially being identified (Pizza et al. 2000). Recombinant 

proteins from 350 ORFs were eventually produced and, 

after screening in for distribution in different serotypes, 

stability, immunogenicity and cross-protection, 15 were 

selected as potential subunit vaccine candidates. This same 

approach to vaccine discovery is now being taken for a 

number of important human and animal pathogens (Serruto 

et al. 2004). Reverse vaccinology allows rapid identification 

of a large number of potential subunit vaccine 

candidates, many of which would not have been recognised 

by more traditional approaches. It is complemented by the 

use of microarrays to analyse gene expression and of 

proteomic approaches to study protein expression and 

distribution and can be focused further by the use of 

computer alogorithms that scan and identify sequences 

encoding specific epitopes involved in immunogenicity 

(reviewed in Lund et al. 2002; see also, fo a review, 

Theoretical Biology and Biophysics Group, Los Alamos 

National Laboratory [http://www.hiv.lanl.gov/content/ 

immunology/pdf/2002/1/Lund2002.pdf]). These alogorithms 

have been strengthened by the availability of full 

genomic sequences for many pathogens. 

Methods for the three main types of epitopes targeting B 

cell, helper T lymphocyte and cytotoxic T lymphocyte 

have been made, and improved methods are constantly 

being developed. Thus, it is possible to take a genome 

sequence, use some predictors as described above and 

select potential peptide sequences for construction of 

vaccines. These vaccines can be either chemically 

synthesised peptide based or DNA based. With regards to 

peptides, these can be used directly or used to construct a 

“polytope”, which is a composite protein made from 

individual epitopes. 

Intellectual property rights: who owns the genome 

sequence? 

181 

This review started by giving the US patent numbers for the 

first two genomes sequenced. This final section will briefly 

discuss some of the issues facing researchers working with 

genomic data. At the time of writing, ten whole genome 

patents have been granted, with more patents being applied 

for (O’Malley et al. 2005). Some of these patents include 

the use of the sequence in silico and clearly raise a number 

of issues related to freedom to operate in research. In 

addition, the enforcement of the patents could be difficult, 

with many bioinformatic tools being developed in the 

public domain. 

Another related difficulty has to do with using or 

analysing genome sequences before they are presented in 

scientific publications. Now that it is possible to sequence a 

bacterial genome in an afternoon and have a GenBank file a 

day or two later, the time gap between having the sequence 

publicly available and having the paper in print can be 

several years. Some public granting agencies have pushed 

hard for the data to be made available as soon as possible 

for people to search for their particular gene of interest. On 

the other hand, it is also understandable that the individuals 

who have actually sequenced the genomes need some lead 

time to analyse their data. With high-throughput bioinformatic 

techniques, it is possible, for example, for some 

groups to do in a few days what would take other groups 

months (or years) to complete.

182 

A final problem has to do with obtaining basic 

information about the strain used for sequencing a genome. 

For example, what was the strain isolated from? What was 

the growth temperature or culture medium pH for the 

culture that the genomic DNA was derived from? What is 

the doubling time of this organism under these conditions? 

These are all important pieces of data, but they are often 

missing in genome publications. A recent “minimal 

information about a genome sequence” standard has been 

proposed (Field and Hughes 2005), which is in the same 

spirit as the MIAMI standard for microarray experiments. 3 

In the future, it could well be that something resembling a 

GenBank file with additional biological information will be 

the “publication” for a bacterial genome sequence, as 

genome sequencing becomes ever cheaper and easier to 

perform. Overall, it is important that genome sequence 

information is released into the public domain in a timely 

manner so that global scientific progress can be maintained. 

Acknowledgements DWU, PFH and TTB are supported by grants 

from the Danish Research Foundation. We are grateful to the Sanger 

Center for allowing prepublication access to the sequences for the E. 

coli 042 genome (the DNA sequence and annotation files were 

downloaded from the Sanger web site http://www.sanger.ac.uk/). 

References 

Abbott JC, Aanensen DM, Rutherford K, Butcher S, Spratt BG 

(2005) WebACT—an online companion for the Artemis 

Comparison Tool. Bioinformatics 21(18):3665–3666 

Acinas SG, Marcelino LA, Klepac-Ceraj V, Polz MF (2004) 

Divergence and redundancy of 16S rRNA sequences in genomes 

with multiple rrn operons. J Bacteriol 186(9):2629–2635 

Alain K, Querellou J, Lesongeur F, Pignet P, Crassous P, Raguenes G, 

Cueff V, Cambon-Bonavita M-A (2002) Caminibacter hydrogeniphilus 

gen. nov., sp. nov., a novel thermophilic, hydrogenoxidizing 

bacterium isolated from an East Pacific Rise 

hydrothermal vent. Int J Syst Evol Microbiol 52:1317–1323 

Alm EJ, Huang KH, Price MN, Koche RP, Keller K, Dubchak IL, 

Arkin AP (2005) The MicrobesOnline Web site for comparative 

genomics. Genome Res 15(7):1015–1022 

Alm RA, Trust TJ (1999) Analysis of the genetic diversity of 

Helicobacter pylori: the tale of two genomes. J Mol Med 77 

(12):834–846 (Review) 

Backhed F, Ley RE, Sonnenburg JL, Peterson DA, Gordon JI (2005) 

Host–bacterial mutualism in the human intestine. Science 307 

(5717):1915–1920 

Bendtsen JD, Binnewies TT, Hallin PF, Sicheritz-Ponten T, Ussery 

DW (2005a) Genome update: prediction of secreted proteins in 

225 bacterial proteomes. Microbiology 151(Pt 6):1725–1727 

Bendtsen JD, Binnewies TT, Hallin PF, Ussery DW (2005b) 

Genome update: prediction of membrane proteins in prokaryotic 

genomes. Microbiology 151(Pt 7):2119–2121 

Binnewies TT, Hallin PF, Staerfeldt HH, Ussery DW (2004) Genome 

update: proteome comparisons. Microbiology 151(Pt 1):1–4 

Burrus V, Waldor MK (2004) Shaping bacterial genomes with 

integrative and conjugative elements. Res Microbiol 155 

(5):376–386 

Carattoli A (2001) Importance of integrons in the diffusion of 

resistance. Vet Res 32(3–4):243–259 

Carver TJ, Rutherford KM, Berriman M, Rajandream MA, Barrell 

BG, Parkhill J (2005) ACT: the Artemis Comparison Tool. 

Bioinformatics 21(16):3422–3423 

3 http://www.ucl.ac.uk/wibr/services/docs/miamiv1.doc 

Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH, Ecker DJ, 

Blyn LB (2002) A bioinformatics based approach to discover 

small RNA genes in the Escherichia coli genome. Biosystems 

65(2–3):157–177 

Dobrindt U, Hacker J (2001) Whole genome plasticity in pathogenic 

bacteria. Curr Opin Microbiol 5(4):550–557 

Dobrindt U, Hochhut B, Hentschel U, Hacker J (2004) Genomic 

islands in pathogenic and environmental microorganisms. Nat 

Rev Microbiol (2):414–424 

Doolittle WF (1999a) Lateral genomics. Trends Cell Biol 12(9): 

M5–M8 

Doolittle WF (1999b) Phylogenetic classification and the universal 

tree. Science 5423(284):2124–2129 

Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P (2005) 

Detection and characterisation of horizontal transfers in 

prokaryotes using genomic signature. Nucleic Acids Res 1 

(33):e6 

Duponnois R, Ba AM, Mateille T (1999) Beneficial effects of 

Enterobacter cloacae and Pseudomonas mendocina for biocontrol 

of Meloidogyne incognita with the endospore-forming 

bacterium Oasteuria penetrans. Nematology 1(1):95–101 

Ellis RW (1999) New technologies for making vaccines. Vaccine 17 

(13–14):1596–1604 

Falkow S (1975) Infectious multiple drug resistance. Pion Limited, 

London, England 

Fani R, Brilli M, Lio P (2005) The origin and evolution of operons: 

the piecewise building of the proteobacterial histidine operon. 

J Mol Evol 60(3):378–390 

Field D, Hughes J (2005) Cataloguing our current genome 

collection. Microbiology 151(Pt 4):1016–1019 

Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, 

Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, 

McKenney K, Sutton G, FitzHugh W, Fields C, Gocyne JD, Scott 

J, Shirley R, Liu LI, Glodek A, Kelley JM, Weidman JF, Phillips 

CA, Spriggs T, Hedblom E, Cotton MD, Utterback TR, Hanna 

MC, Nguyen DT, Saudek DM, Brandon RC, Fine LD, Fritchman 

JL, Fuhrmann JL, Geoghagen NSM, Gnehm CL, McDonald LA, 

Small KV, Fraser CM, Smith HO, Venter JC (1995) Wholegenome 

random sequencing and assembly of Haemophilus 

influenzae Rd. Science 5223(269):496–498, 507–512 

Fluit AC, Schmitz F-J (2004) Resistance integrons and superintegrons. 

Clin Microbiol Infect 10:272–288 

Fouts DE, Mongodin EF, Mandrell RE, Miller WG, Rasko DA, 

Ravel J, Brinkac LM, DeBoy RT, Parker CT, Daugherty SC, 

Dodson RJ, Durkin AS, Madupu R, Sullivan SA, Shetty JU, 

Ayodeji MA, Shvartsbeyn A, Schatz MC, Badger JH, Fraser 

CM, Nelson KE (2005) Major structural differences and novel 

potential virulence mechanisms from the genomes of multiple 

campylobacter species. PLoS Biol 3(1):e15 

Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, 

Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, 

Fritchman RD, Weidman JF, Small KV, Sandusky M, 

Fuhrmann J, Nguyen D, Utterback TR, Saudek DM, Phillips 

CA, Merrick JM, Tomb JF, Dougherty BA, Bott KF, Hu PC, 

Lucier TS, Peterson SN, Smith HO, Hutchison CA 3rd, Venter 

JC (1995) The minimal gene complement of Mycoplasma 

genitalium. Science 270(5235):397–403 

Fraser-Liggett CM (2005) Insights on biology and evolution from 

microbial genome sequencing. Genome Res 15:1603–1610 

Galun E (2003) Transposable elements: a guide to the perplexed and 

the novice. Kluwer Academic, Dordrecht, The Netherlands, pp 

25–73 

Gil R, Latorre A, Moya A (2004) Bacterial endosymbionts of insects: 

insights from comparative genomics. Environ Microbiol 6 

(11):1109–1122 

Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, Baptista D, 

Bibbs L, Eads J, Richardson TH, Noordewier M, Rappe MS, 

Short JM, Carrington JC, Mathur EJ (2005) Genome 

streamlining in a cosmopolitan oceanic bacterium. Science 

309(5738):1242–1245

Goebel W, Gross R (2001) Intracellularsurvivalstrategiesofmutualistic 

and parasitic prokaryotes. Trends Microbiol 9(6):267–273 

Goldmann DA, Klinger JD (1986) Pseudomonas cepacia: 

biology, mechanisms of virulence, epidemiology. J Pediatr 

108(5 Pt 2):806–812 

Gottesman S (2005) Micros for microbes: non-coding regulatory 

RNAs in bacteria. Trends Genet 7:399–404 

Hallin PF, Ussery DW (2004) CBS genome atlas database: a dynamic 

storage for bioinformatic results and sequence data. Bioinformatics 

20(18):3682–3686 

Hallin PF, Binnewies TT, Ussery DW (2004a) Genome update: 

chromosome atlases. Microbiology 150(Pt 10):3091–3093 

Hallin PF, Coenye T, Binnewies TT, Jarmer H, Saerfeldt HH, Ussery 

DW (2004b) Genome update: correlation of bacterial genomic 

properties. Microbiology 150(Pt 12):3899–3903 

Handelsman J (2004) Metagenomics: application of genomics to 

uncultured microorganisms. Microbiol Mol Biol Rev 68:669–685 

Harrison A, Dyer DW, Gillaspy A, Ray WC, Mungur R, Carson 

MB, Zhong H, Gipson J, Gipson M, Johnson LS, Lewis L, 

Bakaletz LO, Munson RS Jr (2005) Genomic sequence of an 

otitis media isolate of nontypeable Haemophilus influenzae: 

comparative study with H. influenzae serotype d, strain KW20. 

J Bacteriol 187(13):4627–4636 

Hayashi T, Makino K, Ohnishi M, Kurokawa K, Ishii K, Yokoyama 

K, Han CG, Ohtsubo E, Nakayama K, Murata T, Tanaka M, 

Tobe T, Iida T, Takami H, Honda T, Sasakawa C, Ogasawara N, 

Yasunaga T, Kuhara S, Shiba T, Hattori M, Shinagawa H 

(2001) Complete genome sequence of enterohemorrhagic 

Escherichia coli O157:H7 and genomic comparison with a 

laboratory strain K-12. DNA Res 8:11–22 

Holmes AJ, Gillings MR, Nield BS, Mabbutt BC, Nevalainen KM, 

Stokes HW (2003) The gene cassette metagenome is a basic 

resource for bacterial genome evolution. Environ Microbiol 5 

(5):383–394 

Horowitz NH (1945) On the evolution of biochemical synthesis. 

Proc Natl Acad Sci U S A 31:153–157 

Horowitz NH (1965) The evolution of biochemical synthesis— 

retrospect and prospect. In: Bryson V, Vogel HJ (eds) Evolving 

genes and proteins. Academic, New York, pp 15–23 

Itoh T, Takemoto K, Mori H, Gojobori T (1999) Evolutionary 

instability of operon structures disclosed by sequence comparisons 

of complete microbial genomes. Mol Biol Evol 3:332–346 

Jacob F, Monod J (1961) Genetic regulatory mechanisms in the 

synthesis of proteins. J Mol Biol 3:318–356 

Jacob F, Perrin D, Sanchez C, Monod J (1960) Operon: a group of 

genes with the expression coordinated by an operator. C R 

Hebd Seances Acad Sci 250:1727–1729 

Jaffe JD, Stange-Thomann N, Smith C, DeCaprio D, Fisher S, 

Butler J, Calvo S, Elkins T, FitzGerald MG, Hafez N, Kodira 

CD, Major J, Wang S, Wilkinson J, Nicol R, Nusbaum C, 

Birren B, Berg HC, Church GM (2004) The complete genome 

and proteome of Mycoplasma mobile. Genome Res 14 

(8):1447–1461 

Janga SC, Collado-Vides J, Moreno-Hagelsieb G (2005) Nebulon: a 

system for the inference of functional relationships of gene 

products from the rearrangement of predicted operons. Nucleic 

Acids Res 33(8):2521–2530 

Jores J, Rumer L, Wieler LH (2004) Impact of the locus of enterocyte 

effacement pathogenicity island on the evolution of pathogenic 

Escherichia coli. Int J Med Microbiol 294(2–3):103–113 

(Review) 

Juhala RJ, Ford ME, Duda RL, Youlton A, Hatfull GF, Hendrix RW 

(2000) Genomic sequences of bacteriophages HK97 and 

HK022: pervasive genetic mosaicism in the lambdoid bacteriophages. 

J Mol Biol 299(1):27–51 

Kennedy SP, Ng WV, Salzberg SL, Hood L, DasSarma S (2001) 

Understanding the adaptation of Halobacterium species NRC-1 

to its extreme environment through computational analysis of 

its genome sequence. Genome Res 11:1641–1650 

Kiil K, Binnewies TT, Sicheritz-Ponten T, Willenbrock H, Hallin PF, 

Wassenaar TM, Ussery DW (2005a) Genome update: sigma factors 

in 240 bacterial genomes. Microbiology 151(Pt 10):3147–3150 

183 

Kiil K, Ferchaud JB, David C, Binnewies TT, Wu H, Sicheritz- 

Ponten T, Willenbrock H, Ussery DW (2005b) Genome update: 

distribution of two-component transduction systems in 250 

bacterial genomes. Microbiology 151(Pt 11):3447–3452 

Kong H, Lin L-F, Porter N, Stickel S, Byrd D, Posfai J, Roberts RJ 

(2000) Functional analysis of putative restriction–modification 

system genes in the Helicobacter pylori J99 genome. Nucleic 

Acids Res 28:3216–3223 

Kummerfeld SK, Teichmann SA (2006) DBD: a transcription factor 

prediction database. Nucleic Acids Res 34(Database issue): 

D74–D81 

Kunin V, Goldovsky L, Darzentas N, Ouzounis CA (2005) The net 

of life: reconstructing the microbial phylogenetic network. 

Genome Res 15(7):954–959 

Kuwahara T, Yamashita A, Hirakawa H, Nakayama H, Toh H, 

Okada N, Kuhara S, Hattori M, Hayashi T, Ohnishi Y (2004) 

Genomic analysis of Bacteroides fragilis reveals extensive 

DNA inversions regulating cell surface adaptation. Proc Natl 

Acad Sci U S A 101(41):14919–14924 

Lawrence JG, Roth JR (1996) Selfish operons: horizontal transfer 

may drive the evolution of gene clusters. Genetics 143 

(4):1843–1860 

Lazcano A, Diaz-Villagomez E, Mills T, Oro J (1995) On the levels of 

enzymatic substrate specificity: implications for the early 

evolution of metabolic pathways. Adv Space Res 15(3):345–356 

Lewis M, Chang G, Horton NC, Kercher MA, Pace HC, 

Schumacher MA, Brennan RG, Lu P (1996) Crystal structure 

of the lactose operon repressor and its complexes with DNA 

and inducer. Science 271(5253):1247–1254 

Ley RE, Backhed F, Turnbaugh P, Lozupone CA, Knight RD, 

Gordon JI (2005) Obesity alters gut microbial ecology. Proc 

Natl Acad Sci U S A 102(31):11070–11075 

Lin L-F, Posfai J, Roberts RJ, Kong H (2001) Comparative 

genomics of the restriction–modification systems in Helicobacter 

pylori. Proc Natl Acad Sci U S A 98:2740–2745 

Lobner-Olesen A, Skovgaard O, Marinus MG (2005) Dam methylation: 

coordinating cellular processes. Curr Opin Microbiol 8 

(2):154–160 

Lund O, Nielsen M, Kesmir C, Christensen JK, Lundegaard C, 

Worning P, Brunak C (2002) Web-based tools for vaccine 

design. In: Korber BT, Brander C, Haynes BF, Koup R, Kuiken 

C, Moore JP, Walker BD, Watkins D (eds) HIV molecular 

immunology. Los Alamos, NM, pp 45–51 

Lund O, Nielsen M, Lundegaard C, Kesmit C, Brunak S (2005) 

Immunological bioinformatics. MIT, Cambridge, Massachusetts 

Lupski JR, Weinstock GM (1992) Short, interspersed repetitive 

DNA sequences in prokaryotic genomes. J Bacteriol 174 

(14):4525–4529 

Maas R (2004) Prereplicative purine methylation and postreplicative 

demethylation in each DNA duplication of the Escherichia coli 

replication cycle. J Biol Chem 279(49):51568–51573 

Mahillon J, Leonard C, Chandler M (1999) IS elements as 

constituents of bacterial genomes. Res Microbiol 150:675–687 

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben 

LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du 

L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho 

CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, 

Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei 

M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, 

McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, 

Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson 

JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer 

GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, 

Rothberg JM (2005) Genome sequencing in microfabricated 

high-density picolitre reactors. Nature 437(7057):376–380 

McClintock B (1950) The origin and behavior of mutable loci in 

maize. Proc Natl Acad Sci U S A 36(6):344–355 

McGillivary G, Tomaras AP, Rhodes ER, Actis LA (2005) Cloning 

and sequencing of a genomic island found in the Brazilian 

purpuric fever clone of Haemophilus influenzae biogroup 

aegyptius. Infect Immun 73(4):1927–1938

184 

Middendorf B, Hochhut B, Leipold K, Dobrindt U, Blum-Oehler G, 

Hacker J (2004) Instability of pathogenicity islands in 

uropathogenic Escherichia coli 536. J Bacteriology 186 

(10):3086–3096 

Mongodin EF, Emerson JB, Nelson KE (2005) Microbial metagenomics. 

Genome Biol 6(10):347 

Mullis K, Faloona F, Scharf S, Saiki R, Horn G, Erlich H (1986) 

Specific enzymatic amplification of DNA in vitro: the 

polymerase chain reaction. Cold Spring Harb Symp Quant 

Biol 51(Pt 1):263–273 

Nagy Z, Chandler M (2004) Regulation of transposition in bacteria. 

Res Microbiol 155:387–398 

Nishi T, Ikemura T, Kanaya S (2005) GeneLook: a novel ab initio 

gene identification system suitable for automated annotation of 

prokaryotic sequences. Gene 346:115–125 

Novikova N, De Boever P, Poddubko S, Deshevaya E, Polikarpov 

N, Rakova N, Coninx I, Mergeay M (2006) Survey of 

environmental biocontamination on board the International 

Space Station. Res Microbiol 157(1):5–12 

Ochman H, Lawrence JG, Groisman EA (2000) Lateral gene transfer 

and the nature of bacterial evolution. Nature 405:299–304 

Ohnishi M, Kurokawa K, Hayashi T (2001) Diversification of 

Escherichia coli genomes: are bacteriophages the major 

contributors? Trends Microbiol 9:481–485 

Okuda S, Katayama T, Kawashima S, Goto S, Kanehisa (2006) 

MODB: a database of operons accumulating known operons 

across multiple genomes. Nucleic Acids Res 34(Database 

issue):D358–362 

Olsen GJ, Lane DJ, Giovannoni SJ, Pace NR, Stahl DA (1986) 

Microbial ecology and evolution: a ribosomal RNA approach. 

Annu Rev Microbiol 40:337–365 

O’Malley MA, Bostanci A, Calvert J (2005) Whole-genome 

patenting. Nat Rev Genet 6(6):502–506 

Ortutay C, Gaspari Z, Toth G, Jager E, Vida G, Orosz L, Vellai T 

(2003) Speciation in Chlamydia: genome-wide phylogenetic 

analyses identified a reliable set of acquired genes. J Mol Evol 

57:672–680 

Ou HY, Chen LL, Lonnen J, Chaudhuri RR, Thani AB, Smith R, 

Garton NJ, Hinton J, Pallen M, Barer MR, Rajakumar K (2006) 

A novel strategy for the identification of genomic islands by 

comparative analysis of the contents and contexts of tRNA sites 

in closely related bacteria. Nucleic Acids Res 34(1):e3 

Pal C, Hurst LD (2004) Evidence against the selfish operon theory. 

Trends Genet 20(6):232–234 

Parkhill J, Sebaihia M, Preston A, Murphy LD, Thomson N, Harris 

DE, Holden MT, Churcher CM, Bentley SD, Mungall KL, 

Cerdeno-Tarraga AM, Temple L, James K, Harris B, Quail MA, 

Achtman M, Atkin R, Baker S, Basham D, Bason N, 

Cherevach I, Chillingworth T, Collins M, Cronin A, Davis P, 

Doggett J, Feltwell T, Goble A, Hamlin N, Hauser H, Holroyd 

S, Jagels K, Leather S, Moule S, Norberczak H, O’Neil S, 

Ormond D, Price C, Rabbinowitsch E, Rutter S, Sanders M, 

Saunders D, Seeger K, Sharp S, Simmonds M, Skelton J, 

Squares R, Squares S, Stevens K, Unwin L, Whitehead S, 

Barrell BG, Maskell DJ (2003) Comparative analysis of the 

genome sequences of Bordetella pertussis, Bordetella parapertussis 

and Bordetella bronchiseptica. Nat Genet 35(1):32–40 

Paulsen IT, Banerjei L, Myers GSA, Nelson KE, Seshadri R, Read TD, 

Fouts, DE, Eisen JA, Gill SR, Heidelberg JF, Tettelin H, Dodson 

RJ, Umayam L, Brinkac L, Beanan M, Daugherty S, DeBoy RT, 

Durkin S, Kolonay J, Madupu R, Nelson W, Vamathevan J, Tran 

B, Upton J, Hansen T, Shetty J, Khouri H, Utterback T, Radune D, 

Ketchum KA, Dougherty BA, Fraser CM (2003) Role of mobile 

DNA in the evolution of vancomycin-resistant Enterococcus 

faecalis. Science 299(5615):2071–2074 

Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW 

(2000) A DNA structural atlas for Escherichia coli. J Mol Biol 

299(4):907–930 

Pennisi E (2005) Biochemistry. Cut-rate genomes on the horizon? 

Science 309(5736):862 

Penyalver R, Lopez MM (1999) Cocolonization of the rhizosphere 

by pathogenic agrobacterium strains and nonpathogenic strains 

K84 and K1026, used for crown gall biocontrol. Appl Environ 

Microbiol 65(5):1936–1940 

Peters EDJ, Leverstein-Van Hall MA, Box ATA, Verhoef J, Fluit AC 

(2001) Novel gene cassettes and integrons. Antimicrob Agents 

Chemother 45(10):2961–2964 

Pizza M, Scarlato V, Masignani V, Giuliani MM, Arico B, 

Comanducci M, Jennings GT, Baldi L, Bartolini E, Capecchi 

B, Galeotti CL, Luzzi E, Manetti R, Marchetti E, Mora M, Nuti 

S, Ratti G, Santini L, Savino S, Scarselli M, Storni E, Zuo P, 

Broeker M, Hundt E, Knapp B, Blair E, Mason T, Tettelin H, 

Hood DW, Jeffries AC, Saunders NJ, Granoff DM, Venter JC, 

Moxon ER, Grandi G, Rappuoli R (2000) Identification of 

vaccine candidates against serogroup B meningococcus by 

whole-genome sequencing. Science 287:1816–1820 

Prescott L, Harvey JP, Klein DA (1999) Microbiology, 4th edn. 

McGraw-Hill, New York, USA 

Price MN, Huang KH, Alm EJ, Arkin AP (2005) A novel method 

for accurate operon predictions in all sequenced prokaryotes. 

Nucleic Acids Res 33(3):880–892 

Rappuoli R (2001) Reverse vaccinology, a genome-based approach 

to vaccine development. Vaccine 19:2688–2691 

Rendulic S, Jagtap P, Rosinus A, Eppinger M, Baar C, Lanz C, 

Keller H, Lambert C, Evans KJ, Goesmann A, Meyer F, 

Sockett RE, Schuster SC (2004) A predator unmasked: life 

cycle of Bdellovibrio bacteriovorus from a genomic perspective. 

Science 303(5658):689–692 

Reznikoff WS (1992) The lactose operon-controlling elements: 

a complex paradigm. Mol Microbiol 6(17):2419–2422 

Robbins-Manke JL, Zdraveski ZZ, Marinus M, Essigmann JM 

(2005) Analysis of global gene expression and double-strandbreak 

formation in DNA adenine methyltransferase- and 

mismatch repair-deficient Escherichia coli. J Bacteriol 187 

(20):7027–7037 

Roberts RJ, Vincze T, Psfai J, Macelis D (2005) REBASE— 

restriction enzymes and DNA methyl transferases. Nucleic 

Acids Res 33:D230–D232 

Rocha EPC, Danchin A, Viari A (1999) Functional and evolutionary 

role of long repeats in prokaryotes. Res Microbiol 150:725–733 

Rogozin IB, Makarova KS, Wolf YI, Koonin EV (2004) Computational 

approaches for the analysis of gene neighbourhoods in 

prokaryotic genomes. Brief Bioinform 5(2):131–149 

Rosenfeld JA, Sarkar IN, Planet PJ, Figurski DH, DeSalle R (2004) 

ORFcurator: molecular curation of genes and gene clusters in 

prokaryotic organisms. Bioinformatics 20(18):3462–3465 

Salgado H, Gama-Castro S, Peralta-Gil M, Diaz-Peredo E, Sanchez- 

Solano F, Santos-Zavaleta A, Martinez-Flores I, Jimenez-Jacinto 

V, Bonavides-Martinez C, Segura-Salazar J, Martinez-Antonio 

A, Collado-Vides J (2006a) RegulonDB (version 5.0): Escherichia 

coli K-12 transcriptional regulatory network, operon 

organization, and growth conditions. Nucleic Acids Res 34 

(Database issue):D394–D397 

Salgado H, Santos-Zavaleta A, Gama-Castro S, Peralta-Gil M, 

Penaloza-Spinola MI, Martinez-Antonio A, Karp PD, Collado- 

Vides J (2006b) The comprehensive updated regulatory 

network of Escherichia coli K-12. BMC Bioinformatics 7(1):5 

Sanger F, Donelson JE, Coulson AR, Kossel H, Fischer D (1973) 

Use of DNA polymerase I primed by a synthetic oligonucleotide 

to determine a nucleotide sequence in phage fl DNA. Proc 

Natl Acad Sci U S A 70(4):1209–1213 

Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA, 

Hutchison CA, Slocombe PM, Smith M (1977) Nucleotide 

sequence of bacteriophage phi X174 DNA. Nature 265 

(5596):687–695

Schmidt H, Hensel M (2004) Pathogenicity islands in bacterial 

pathogenesis. Clin Microbiol Rev 17(1):14–56 

Schneider G, Dobrindt U, Bruggemann H, Nagy G, Janke B, Blum- 

Oehler G, Buchrieser C, Gottschalk G, Emody L, Hacker J 

(2004) The pathogenicity island-associated K15 capsule determinant 

exhibits a novel genetic structure and correlates with 

virulence in uropathogenic Escherichia coli strain 536. Infect 

Immun 72(10):5993–6001 

Serruto D, Adu-Bobie J, Capecchi B, Rappuoli R, Pizza M, 

Masignani V (2004) Biotechnology and vaccines: application 

of functional genomics to Neisseria meningitidis and other 

bacterial pathogens. J Biotechnol 113:15–32 

Sharp PM, Li WH (1987) The codon adaptation index—a measure 

of directional synonymous codon usage bias, and its potential 

applications. Nucleic Acids Res 15(3):1281–1295 

Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, 

Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM 

(2005) Accurate multiplex polony sequencing of an evolved 

bacterial genome. Science 309(5741):1728–1732 

Shimizu T, Ohtani K, Hirakawa H, Ohshima K, Yamashita A, Shiba 

T, Ogasawara N, Hattori M, Kuhara, Hayashi H (2002) 

Complete genome sequence of Clostridium perfringens, an 

anaerobic flesh-eater. Proc Natl Acad Sci U S A 99(2):996–1001 

Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A (2001) On 

the total number of genes and their length distribution in 

complete microbial genomes. Trends Genet 17(8):425–428 

Stahl FW, Murray NE (1966) The evolution of gene clusters and 

genetic circularity in microorganisms. Genetics 53(3):569–576 

Starlinger P, Saedler H (1976) IS-elements in microorganisms. Curr 

Top Microbiol Immunol 75:111–152 

Talarico S, Cave MD, Marrs CF, Foxman B, Zhang L, Yang Z (2005) 

Variation of the Mycobacterium tuberculosis PE_PGRS 33 gene 

among clinical isolates. J Clin Microbiol 43(10):4954–4960 

Taoka M, Yamauchi Y, Shinkawa T, Kaji H, Motohashi W, 

Nakayama H, Takahashi N, Isobe T (2004) Only a small 

subset of the horizontally transferred chromosomal genes in 

Escherichia coli are translated into proteins. Mol Cell 

Proteomics 3(8):780–787 

Tobes R, Ramos JL (2005) REP code: defining bacterial identity in 

extragenic space. Environ Microbiol 7(2):225–228 

Toh H, Weiss BL, Perkin SA, Yamashita A, Oshima K, Hattori M, 

Aksoy S (2006) Massive genome erosion and functional 

adaptations provide insights into the symbiotic lifestyle of 

Sodalis glossinidius in the tsetse host. Genome Res 16:149–156 

Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG, 

Fleischmann RD, Ketchum KA, Klenk HP, Gill S, Dougherty 

BA, Nelson K, Quackenbush J, Zhou L, Kirkness EF, Peterson 

S, Loftus B, Richardson D, Dodson R, Khalak HG, Glodek A, 

McKenney K, Fitzegerald LM, Lee N, Adams MD, Hickey EK, 

Berg DE, Gocayne JD, Utterback TR, Peterson JD, Kelley JM, 

Cotton MD, Weidman JM, Fujii C, Bowman C, Watthey L, 

Wallin E, Hayes WS, Borodovsky M, Karp PD, Smith HO, 

Fraser CM, Venter JC (1997) The complete genome sequence 

of the gastric pathogen Helicobacter pylori. Nature 388 

(6642):539–547 

Torsvik V, Salte K, Sorheim R, Goksoyr J (1990) Comparison of 

phenotypic diversity and DNA heterogeneity in a population of 

soil bacteria. Appl Environ Microbiol 56:776–781 

Tringe SG, Rubin EM (2005) Metagenomics: DNA sequencing of 

environmental samples. Nat Rev Genet 6(11):805–814 

Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, 

Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, 

Banfield JF (2004) Community structure and metabolism 

through reconstruction of microbial genomes from the environment. 

Nature 428(6978):37–43 

185 

Ussery DW, Hallin PF (2004a) Genome update: AT content in 

sequenced prokaryotic genomes. Microbiology 150(Pt 4):749–752 

Ussery DW, Hallin PF (2004b) Genome update: length distributions of 

sequenced prokaryotic genomes. Microbiology 150(Pt 3):513–516 

Ussery DW, Binnewies TT, Gouveia-Oliveira R, Jarmer H, Hallin 

PF (2004a) Genome update: DNA repeats in bacterial genomes. 

Microbiology 150(Pt 11):3519–3521 

Ussery DW, Hallin PF, Lagesen K, Coenye T (2004b) Genome 

update: rRNAs in sequenced microbial genomes. Microbiology 

150(Pt 5):1113–1115 

Ussery DW, Hallin PF, Lagesen K, Wassenaar TM (2004c) Genome 

update: tRNAs in sequenced microbial genomes. Microbiology 

150(Pt 6):1603–1606 

Ussery DW, Tindbaek N, Hallin PF (2004d) Genome update: 

promoter profiles. Microbiology 150(Pt 9):2791–2793 

Vallenet D, Labarre L, Rouy Z, Barbe V, Bocs S, Cruveiller S, Lajus 

A, Pascal G, Scarpelli C, Medigue C (2006) MaGe: a microbial 

genome annotation system supported by synteny results. 

Nucleic Acids Res 34(1):53–65 

van Belkum A, Scherer S, van Alphen L, Verbrugh H (1998) Short 

sequence DNA repeats in prokaryotic genomes. Microbiol Mol 

Biol Rev 62(2):275–293 

van der Meer JR, Sentchilo V (2003) Genomic islands and the 

evolution of catabolic pathways in bacteria. Curr Opin 

Biotechnol 14:248–254 

Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A, Dong 

X, Lu P, Szafron D, Greiner R, Wishart DS (2005) BASys: a web 

server for automated bacterial genome annotation. Nucleic 

Acids Res 33(Web Server issue):W455–W459 

Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, 

Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, 

Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson 

J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, 

Rogers YH, Smith HO (2004) Environmental genome shotgun 

sequencing of the Sargasso Sea. Science 304(5667):66–74 

Vezzi A, Campanaro S, D’Angelo M, Simonato F, Vitulo N, Lauro 

FM, Cestaro A, Malacrida G, Simionati B, Cannata N, 

Romualdi C, Bartlett DH, Valle G (2005) Life at depth: 

Photobacterium profundum genome sequence and expression 

analysis. Science 307(5714):1459–1461 

Willenbrock H, Binnewies TT, Hallin PF, Ussery DW (2005) Genome 

update: 2D clustering of bacterial genomes. Microbiology 151 

(Pt 2):333–336 

Worning P, Jensen LJ, Nelson KE, Brunak S, Ussery DW (2000) 

Structural analysis of DNA sequence: evidence for lateral gene 

transfer in Thermotoga maritima. Nucleic Acids Res 28 

(3):706–709 

Worning P, Jensen LJ, Hallin PF, Stærfeldt H-H, Ussery DW (2006) 

Origin of replication in circular prokaryotic chromosomes. 

Environ Microbiol (In press) 

Yan F, Polk DB (2004) Commensal bacteria in the gut: learning who 

our friends are. Curr Opin Gastroenterol 20(6):565–571 

Zagursky RJ, Russell D (2001) Bioinformatics: use in bacterial 

vaccine discovery. Biotechniques 31:636–659 

Zhang R, Zhang CT (2004) A systematic method to identify 

genomic islands and its applications in analyzing the genomes 

of Corynebacterium glutamicum and Vibrio vulnificus CMCP6 

chromosome I. Bioinformatics 20(5):612–622 

Zheng Y, Anton BP, Roberts RJ, Kasif S (2005) Phylogenetic 

detection of conserved gene clusters in microbial genomes. 

BMC Bioinformatics 6:243 

Zubrzycki IZ (2004) Analysis of the products of genes encompassed 

by the theoretically predicted pathogenicity islands of Mycobacterium 

tuberculosis and Mycobacterium bovis. Proteins: 

Struct, Funct, Bioinf 54:563–568

1 


2.8 Paper III: Global features of the Alcanivorax borkumensis 

SK2 genome

Environmental Microbiology (2007) doi:10.1111/j.1462-2920.2007.01483.x 

Global features of the Alcanivorax borkumensis 

SK2 genome 

Oleg N. Reva, 1,3 Peter F. Hallin, 2 Hanni Willenbrock, 2 

Thomas Sicheritz-Ponten, 2 Burkhard Tümmler 1 and 

David W. Ussery 2 

1 Klinische Forschergruppe, OE6711, Medizinische 

Hochschule Hannover, Carl-Neuberg-Strasse 1, 

D-30625 Hannover, Germany. 

2 Center for Biological Sequence Analysis, Technical 

University of Denmark, Lyngby, Denmark. 

3 Biochemistry Department, University of Pretoria, 

Lynnwood Road, Hillcrest, 0002 Pretoria, South Africa. 

Summary 

The global feature of the completely sequenced 

Alcanivorax borkumensis SK2 type strain chromosome 

is its symmetry and homogeneity. The origin 

and terminus of replication are located opposite 

to each other in the chromosome and are discerned 

with high signal to noise ratios by maximal oligonucleotide 

usage biases on the leading and lagging 

strand. Genomic DNA structure is rather uniform 

throughout the chromosome with respect to intrinsic 

curvature, position preference or base 

stacking energy. The orthologs and paralogs of 

A. borkumensis genes with the highest sequence 

homology were found in most cases among 

g-Proteobacteria, with Acinetobacter and P. aeruginosa 

as closest relatives. A. borkumensis shares 

a similar oligonucleotide usage and promoter 

structure with the Pseudomonadales. A comparatively 

low number of only 18 genome islands with 

atypical oligonucleotide usage was detected in the 

A. borkumensis chromosome. The gene clusters that 

confer the assimilation of aliphatic hydrocarbons, are 

localized in two genome islands which were probably 

acquired from an ancestor of the Yersinia lineage, 

whereas the alk genes of Pseudomonas putida still 

exhibit the typical Alcanivorax oligonucleotide signature 

indicating a complex evolution of this major 

hydrocarbonoclastic trait. 

Received 8 August, 2007; accepted 26 September, 2007. 

*For correspondence. E-mail tuemmler.burkhard@mh-hannover.de; 

Tel. (+49) 511 5322920; Fax (+49) 511 5326723. 

Introduction 

Alcanivorax borkumensis strain SK2 is a cosmopolitan 

oil-degrading oligotrophic marine g-proteobacterium 

(Yakimov et al., 1998). The SK2 strain is the paradigm for 

hydrocarbonoclastic bacteria that are specialized for 

hydrocarbon degradation but have an otherwise highly 

restricted substrate spectrum, being capable of utilizing 

only a few organic acids such as pyruvate, but not simple 

sugars, for growth (Yakimov et al., 1998; Sabirova et al., 

2006). A. borkumensis is present in low abundance in 

unpolluted environments, but it rapidly becomes the dominant 

bacterium in oil-polluted open ocean and coastal 

waters, where it can constitute 80–90% of the oildegrading 

microbial community (Harayama et al., 1999; 

Kasai et al., 2001; 2002; Syutsubo et al., 2001; Röling 

et al., 2002; Hara et al., 2003; McKew et al., 2007a,b). 

The genome of A. borkumensis was recently 

sequenced and annotated (Schneiker et al., 2006). In this 

paper, we perform a genome wide comparative genomics 

analysis and a detailed characterization of the global 

features of the A. borkumensis strain SK2 genome. This 

work on A. borkumensis strain SK2 aimed to visualize the 

prospective potential of genome linguistic approaches 

for functional and comparative analysis of bacterial 

genomes. 

Results and discussion 

©2007TheAuthors 

Journal compilation © 2007 Society for Applied Microbiology and Blackwell Publishing Ltd 

DNA structure and highly expressed genes 

The genome atlas (Fig. 1) shows a combination of some 

general informative properties of the chromosome. 

These are structural features (intrinsic curvature, stacking 

energy and position preference), repeat properties (global 

direct and inverted repeats) and the main base composition 

features (GC skew and percent AT). Stacking energy 

measures helix rigidity and position preference is a 

flexibility measure (Jensen et al., 1999; Pedersen et al., 

2000). Regions that exhibit low position preference correlate 

with an enrichment of highly expressed genes (Dlakic 

et al., 2004; Willenbrock and Ussery, 2007). Examples in 

A. borkumensis are the rrn operons, the genes encoding 

ribosomal proteins and the gene cluster labelled rpoC on 

the atlas which among others encodes RNA polymerase 

subunits. Low position preference was found to correlate 

with high codon adaptation indices as the common

2 O. N. Reva et al. 

Fig. 1. Genome Atlas of A. borkumensis SK2 showing different structural parameters and the distribution of global repeats, GC skew and 

A + T contents. Colour intensity increases with the deviation from the average. Values close to the average are shaded very light grey; values 

with more than 3 standard deviations from the average are most strongly coloured. 

measure for highly expressed genes (Willenbrock et al., 

2006) indicating that the local DNA structure is an important 

determinant of codon usage and gene expression. 

Moreover, intrinsic curvature is often encountered 

upstream of highly expressed genes (Skovgaard et al., 

2002) which correlates well with the fact that promoter 

DNA tends to be more curved than DNA in coding regions 

(Pedersen et al., 2000). 

The chromosome is rather homogeneous in all analysed 

structural features. The number of repeats is low, and 

the terminus of replication is opposite to the origin of 

replication as indicated by GC skew (Ussery et al., 2002). 

The three rRNA operons organized in the order 

16S-23S-5S are located in three areas with low position 

preference (green marks in the 3rd circle) and possible 

upstream regions with high intrinsic curvature (blue in the 

1st circle) near 0.4 Mb – 0.5 Mbases (two regions) and 

2.25 Mbases (one region). 

Phylogenomics by sequence homology 

The genome of A. borkumensis was compared with existing 

sequence information in other Proteobacteria by con- 

structing phylogenetic trees for each amino acid 

sequence and organisms for which a similar gene existed. 

By extracting the phylogenomic information of the resulting 

1919 phylogenetic trees a phylome atlas could be 

constructed (Fig. 2). In most cases the orthologs and 

paralogs with the highest sequence homology were found 

among g-Proteobacteria. A substantial proportion of 

A. borkumensis genes had their closest homologues in 

a- and b-Proteobacteria, but no closest homologue was 

detected in d- and e-Proteobacteria. Inspection of the collected 

phylogenetic connections revealed that the 

most closely related organisms are Acinetobacter sp. 

and Pseudomonas aeruginosa, although in trees where 

both Pseudomonas and Acinetobacter are present, 

A. borkumensis tends to cluster more often with the latter 

one. No obvious horizontal gene transfers seem to have 

taken place. Regions around 350.000 and 450.000 are 

very ‘pure’ g-proteobacteria regions. 

Genome analysis of oligonucleotide usage 

Oligonucleotide usage (OU) has been shown to be a 

genome specific signature (Pride et al., 2003; Reva and 


Journal compilation © 2007 Society for Applied Microbiology and Blackwell Publishing Ltd, Environmental Microbiology

Tümmler, 2004). Genomic regions termed the ‘core 

sequences’ are characterized by OU patterns being 

similar to the global pattern of the chromosome. However, 

many loci with alternative OU patterns typically contribute 

to in total more than 10% of a bacterial genome. These 

loci with atypical OU patterns comprise heterogeneous 

subsets of parasitic and recent foreign DNA, ancient 

genes for ribosomal constituents (RNAs and proteins), 

multidomain genes and non-coding sequences with multiple 

tandem repeats (Reva and Tümmler, 2005). Hence 

laterally transferred gene islands can be reliably identified 

in complete genomes by their atypical oligonucleotide 

usage (Reva and Tümmler, 2005; Chen et al., 2007; 

Klockgether et al., 2007). Here, we focused on tetranucleotide 

usage (TU) parameters because the 256 different 

tetranucleotide words are optimal to differentiate bacterial 

genome sequences by the frequency and informativeness 

of the individual element. TU patterns represent the deviations 

of tetranucleotide word counts in a given sequence 

from an equiprobable distribution. Selection and counterselection 

of the oligonucleotide words are driven by their 

Comparative genomics of Alcanivorax borkumensis 3 

Fig. 2. Phylome Atlas of A. borkumensis SK2 genes indicating their closest bacterial homologues. Each of the concentric circles represents a 

taxonomic group as described in the figure legend on the right, with the outermost circle corresponding to the top-most feature, and the 

innermost circle corresponding to the bottom-most feature. Light bands indicate A. borkumensis SK2 genes with no homologue in the 

respective taxonomic group. 

stereochemical properties such as base stacking energy, 

propeller twist angle, protein deformability, bendability 

and position preference (Reva and Tümmler, 2004). By 

permutation analysis, the 256 tetranucleotides were 

assigned to 39 equivalence classes each of which characterized 

by the same values for the five properties mentioned 

above (Baldi and Baisnee, 2000). Words of the 

same equivalence class tend to occur at similar frequencies 

in a nucleotide sequence (Reva and Tümmler, 2004). 

Oligonucleotide usage conservation reflects to some 

extent the phylogeny of microorganisms (Pride et al., 

2003; Teeling et al., 2004). 

Phylogenomics by tetranucleotide usage analysis 

TU patterns were calculated for all sequenced genomes 

of g-Proteobacteria. Four examples of TU patterns determined 

for A. borkumensis SK2, Pseudomonas putida 

KT2440, Escherichia coli K-12 and Shewanella oneidensis 

MR-1 are shown in Fig. 3. Tetranucleotide words were 

grouped by the equivalence classes and sorted in order of 




decrease of the base stacking energy. Figure 4 visualizes 

the phylogenetic relationships differentiated by TU patterns 

of 29 g-Proteobacterial taxa each of which represented 

by not more than a single sequenced strain. 

A. borkumensis forms a cluster with Pseudomonas, 

Methylococcus, Xanthomonas and Xylella (Fig. 4). 

Despite the variation in GC-content, from 52 to 54% in 

Xylella and Alcanivorax to more than 65% in Xanthomonas 

and Pseudomonas, the TU patterns of these 

Fig. 3. Tetranucleotide usage patterns of 

A. borkumensis SK2, P. putida KT2440, E. coli 

K12 MG1655 and S. oneidensis MR-1. The 

deviation Dw of observed from expected 

counts is shown for all 256 tetranucleotide 

words (16 ¥ 16 cells) by colour code (right 

bar). Tetranucleotides are grouped into 39 

classes of equivalent structural features (Baldi 

and Baisnee, 2000) and sorted by decreasing 

base stacking energy row-by-row starting at 

the upper left corner (class 39). The words 

corresponding to the cells in colour plots are 

shown in the table in lower part of the figure. 

microorganisms are similar and separated from other 

g-Proteobacteria. There is an abundance of GC-rich tetranucleotides 

with high base stacking energy in the 

sequence of A. borkumensis SK2 (words belonging to 

equivalence classes 37–39, 30 and 27) that is similar to 

the TU pattern of P. putida KT2440 (Fig. 3). Words of the 

AT-rich classes 7, 10, 13 and 32 are significantly underrepresented 

in both species. The major difference 

between TU patterns is the abundance of poly A and poly 



T stretches (words of class 1) in A. borkumensis in correspondence 

with its lower GC-content of 54.7%. Although 

E. coli and S. oneidensis share a similar GC contents with 

A. borkumensis, their tetranucleotides usage is different 

from Alcanivorax. The parity of GC with AT in the genome 

correlates with a balanced use of GC-rich and AT-rich 

words with high and low base stacking energy. In contrast, 

words with intermediate values of the base stacking 

energy (classes 25, 31, 36 and 29) are mostly underrepresented 

(Fig. 3). The data suggests that oligonucleotide 

usage drives GC-content and not vice versa. To give 

another example: the GC-rich words of class 21 are 

rare in all g-Proteobacteria irrespectively of their 

GC-content (Fig. 3), but these words are overrepresented 

in a-Proteobacteria (Agrobacterium, Bordetella, Caulobacter, 

Rhizobium). 

Anomalous local TU patterns in the 

A. borkumensis genome 

A. borkumensis shares a common taxonomic group 

with Pseudomonas, Methylococcus, Xanthomonas and 

Xylella. Although the TU patterns are genome specific 

signatures, the oligonucleotide usage may vary locally in 

segments made up by horizontally acquired elements, 

phylogenetically ancient genes such as rRNAs or genes 


Fig. 4. Tree of the similarity of TU patterns of 

completely sequenced g-Proteobacteria 

strains. Distance D-values (see Experimental 

procedures) between two TU patterns were 

calculated, and the tree was constructed from 

the distance matrix of all D-values by the 

minimum evolution neighbour-joining method 

(Saitou and Nei, 1987). 

with peculiar codon usage (Reva and Tümmler, 2004; 

2005). In other words, anomalous local TU patterns can 

be expected for the most recent and the most ancient 

genes. Local TU patterns were calculated in 8 kbp long 

overlapping sliding windows in steps of 2 kbp. Distances 

D between local and global TU patterns are shown in 

Fig. 5. The 18 regions with D-values above the 95% confidence 

interval are listed in Table 1. 

Three clusters with anomalous D-values encode ribosomal 

RNAs that belong to the most ancient and conserved 

elements of all bacterial genomes. All the other 15 

regions with atypical TU most likely were recently 

acquired, three of which contain transposase genes. 

In total 11 transposases were annotated in the 

A. borkumensis SK2 genome but for five of them no significant 

deviations of the local TU patterns were detected 

in adjacent regions. If inserted mobile elements had lost 

their mobility due to disruptive mutations, they undergo an 

amelioration process smoothing the differences in oligonucleotide 

usage between inserts and the host genome 

and thus cannot be detected by anomalous TU patterns 

anymore (Pride et al., 2003). 

Five regions with high D-values (Fig. 5) only encode 

hypothetical proteins (Table 1). One further region contains 

genes of the type II secretion system and two 

regions encode type IV pili biogenesis proteins the latter 




of which are known to have spread among proteobacteria 

by horizontal transfer with the original codon usage and 

GC content being retained (Spangenberg et al., 1997). 

The most extended region with high D-values encodes 

a cluster of genes for glycosyltransferases and polysaccharide 

biosynthesis proteins (Abo_858-Abo_880: 

1 018 000–1 060 000 bp) characterized by the second 

largest D-value and low GC-content (minimum 45% GC). 

The region terminates abruptly after Abo_880 at an AsntRNA 

gene. The TU pattern of the locus was compared 

with those of 177 sequenced bacterial chromosomes, 316 

plasmids and 104 phages (Reva and Tümmler, 2004). 

The pattern was distant from all analysed sequences. The 

best hit of D = 34.9% was observed for the 5833 bp large 

bacteriophage Pf3 that infects P. aeruginosa harbouring 

the RP1 plasmid (Luiten et al., 1985). A stretch of 1550 bp 

Table 1. Chromosomal regions of A. borkumensis with atypical TU patterns. 

Coordinates 

Left Right 

D a (%) Annotation 

Fig. 5. Deviations of TU patterns in local 

regions of A. borkumensis SK2 chromosome. 

Local TU patterns were determined in 8 kbp 

sliding window in steps of 2 kbp. D, the 

distance betweeen local and chromosomal 

tetranucleotide patterns as defined in 

Experimental procedures, is plotted versus 

the coordinates of the chromosome starting 

from the putative replication origin.The upper 

border of the 95% confidence interval of 

D-values is shown by the horizontal line. 

upstream of the tRNA gene is 48% identical in nucleotide 

sequence with the Pf3 sequence (2344-4078 bp). 

According to this in silico finding we propose that this 

gene island was captured from a phage that typically 

target the 3′-end of a tRNA gene (Dobrindt et al., 

2004). 

The alkB genes encoding the degradation of alkanes 

which is the prominent name-giving feature of the taxon 

Alcanivorax, are located in two islands (Schneiker et al., 

2006) with anomalous TU patterns (Table 1). Very close 

homologues were identified in marine bacteria and 

Pseudomonas species (Schneiker et al., 2006). The 

alkane hydroxylase gene cluster is widely distributed 

among hydrocarbon-utilizing g-Proteobacteria due to its 

possible horizontal transfer (van Beilen et al., 2001; 

2004). The role of these genes in the degradation of 

126 000 140 000 42.20 Abo_114–120: lysR transcriptional regulator, haloacid dehalogenase hydrolase, amiC amidase, gntR 

transcriptional regulator, alkB2 alkane monooxygenase, type I pili biogenesis proteins 

190 000 198 000 40.47 Abo_172–178: ilvD-1 dihydroxy-acid dehydratase, conserved hypothetical proteins, 

long-chain-fatty-acid-CoA ligase, acyl-CoA dehydrogenases 

234 000 245 000 47.95 Abo_209–214: conserved hypothetical proteins, transposase, type II secretion system proteins 

400 000 408 000 49.42 first operon for rRNAs 

502 000 510 000 46.26 Abo_439–446: ispA lipoprotein signal peptidase, fkpB peptidyl-prolyl cis-trans isomerase, ispH 

hydroxymethylbutenyl pyrophosphate reductase, type IV pili biogenesis proteins, conserved 

hypothetical proteins 

526 000 534 000 43.41 second operon for rRNAs 

670 000 678 000 40.29 Abo_581–583: type IV pili biogenesis proteins 

792 000 800 000 43.00 Abo_2680–2681: hypothetical proteins 

1 020 000 1 056 000 50.43 Abo_859–878: polysaccharide biosynthesis proteins 

1 742 000 1 750 000 40.88 Abo_1439: periplasmic binding domain/transglycosylase SLTdomain fusion 

1 892 000 1 900 000 46.32 Abo_2841–2847: hypothetical proteins 

2 026 000 2 034 000 41.90 Abo_1668–1671: conserved hypothetical proteins, 3 transposases, siderophore biosynthesis protein, 

glycosyl transferase 

2 088 000 2 096 000 40.65 Abo_ 1707–1708: conserved hypothetical proteins 

2 146 000 2 154 000 47.05 Abo_2897–2905: iscA iron-binding protein IscA, metal-sulfur cluster biosynthetic enzyme, sufE Fe-S 

metabolism associated domain protein, iscS cysteine desulfurase, rrf2 family protein, hypothetical 

proteins, SIR2-like transcriptional silencer 

2 254 000 2 262 000 49.71 third operon for rRNAs 

2 364 000 2 372 000 52.56 Abo_1942: penicillin-binding protein, hypothetical proteins, 2 transposases 

2 632 000 2 640 000 40.17 Abo_2979–2984: hypothetical proteins 

3 060 000 3 076 000 42.94 Abo_2516–3066: Na+/H+ antiporter, alkS alkB1GHJ regulator, alkB1 alkane monooxygenase, 

alkG rubredoxin, aldH aldehyde dehydrogenase, hypothetical proteins 

a. D, distance betweeen local and chromosomal TU patterns as defined in Experimental procedures. 



short-chain n-alkanes by A. borkumensis SK2 and AP1 

was experimentally proven (Smits et al., 2002; Hara et al., 

2004; Sabirova et al., 2006). Interestingly, the two regions 

comprising of alkS, alkB1, alkG and aldH alkanedegradation 

genes and of alkB2 and transcriptional 

regulators, respectively (Table 1), are as similar to each 

other in their TU patterns (D = 34.3%) as each of them 

is to Yersinia pestis (D = 32.2% for alkB1, D = 33.4% 

for alkB2), Yersinia enterocolitica (D = 29.5% for alkB1, 

D = 34.4% for alkB2) and Shewanella oneidensis MR-1 

(D = 32.5% for alkB1, D = 42.4% for alkB2). This data 

suggests that the alkB1 and alkB2 genes were delivered 

to A. borkumensis from an ancestor of the Yersinia 

lineage. The AlkB1 amino acid sequences of A. borkumensis 

strains AP1 and SK2 are highly homologous to 

that of P. putida strains P1 and GPO1 (van Beilen et al., 

2001; 2004; Smits et al., 2002; Hara et al., 2004), but their 

TU patterns are not that similar (D = 37.1). Surprisingly, 

the TU pattern of the alkB cluster of P. putida 

is significantly more similar with the global TU pattern of 

the whole A. borkumensis chromosome (16.7%, strain 

GPO1, 19%, strain P1), but more distant from the 

P. putida KT2440 chromosome (30.1% and 30.3%). 

D-values of 17 or 19% are within the first quartile (0–26%) 

far below the median value of 28.4% for local TU patterns 

of the A. borkumensis chromosome (Fig. 5) indicating 

that. the P. putida alkB gene behaves as if it were part of 

the Alcanivorax core genome. We note the striking phenomenon 

that there was converging evolution of the 

coding sequence of the catabolic alk transposon in 

Alkanivorax and Pseudomonas, but that the genes 

retained the oligonucleotide signature of their donors, 

most likely Alkanivorax for Pseudomonas and Yersinialike 

organisms for Alkanivorax. 


Origin of replication 

The GC skew plotted in the seventh circle of the genome 

atlas (Fig. 1) reflects a general bias of purines towards the 

leading strand of DNA replication, however, it has almost 

no correlation to the structural properties of DNA 

(Skovgaard et al., 2002). The GC skew is often useful 

when locating the origin and terminus of replication 

(Jensen et al., 1999). 

The circle is blue on the right side and purple on the left 

side. The two big gaps of colours in the top and in the 

bottom of the circle may be the origin and the terminus of 

replication. This may also be visualized more clearly in the 

origin plot (Fig. 6) (Worning et al., 2006). Here, the difference 

between hypothetical leading and lagging strand is 

plotted (red) for various positions on the chromosome. 

The peaks indicating maximal oligonucleotide skew correspond 

to origin and terminus. The terminus was identified 

as the peaks showing low G/C weighted strand bias 

at 1 502 000 bp position. The origin was identified as the 

other peak at 3 118 000 bp position. The signal to noise of 

14.0 was among the top 10% of sequenced Proteobacteria, 

indicating a big difference between leading and 

lagging strand making the prediction of origin very 

confident. 

Structural analysis of promoter regions 

Structural features of the genomic DNA may indicate promoter 

regions, as promoters normally have high curvature, 

melt easily and are more rigid. The DNA structural 

parameters mentioned earlier (position preference, stacking 

energy, and intrinsic curvature) together with AT 

content and DNAse sensitivity (Brukner et al., 1995) were 

Fig. 6. Localization of the origin and the 

terminus of replication in the A. borkumensis 

SK2 chromosome derived from strand bias 

curves: the median oligonucleotide skew 

curve (red), the GC weighted median (green) 

and the AT weighted median (blue) (Worning 

et al., 2006). 




compiled into a structural profile of all upstream regions of 

A. borkumensis (see section Experimental procedures). 

The profile uses z-scores to measure how the average 

value of the properties vary from minus 400 bp to 400 bp 

around the translation start (Fig. 7). A. borkumensis has 

only a coding density of 87% causing a wider spacer of 

the intergenic region and this appears to give rise to a 

larger and wider peak of curvature, stacking energy and 

AT content (Fig. 7A). For comparison we also analysed 

the promoter profile of another ocean bacterium, Candidatus 

Pelagibacter ubique HTCC1062 (Giovannoni et al., 

2005), an example of a highly streamlined genome with a 

coding density of 96%. Here we observed a much weaker 

curvature signal, and the distribution of stacking energy 

and AT content was more narrow and had higher maxima 

(Fig. 7B). 

Next, the probability of opening during stress-induced 

DNA duplex destabilization was computed by using the 

program SIDD (Wang et al., 2004), covering five different 

values of the super-helical density s = {-0.025, -0.035, 

-0.045, -0.055, -0.065}. As super-coiling is being 

pushed, the probability of opening increases at lower 

super-helical densities in A. borkumensis (Fig. 7C). In 

contrast, a narrower SIDD profile that exhibits only 

minor dependence on super-helical density (Fig. 7D), 

was calculated for the Candidatus Pelagibacter ubique 

HTCC1062 genome. 

The structural profile for the promoter regions of 

A. borkumensis was compared with that of closely related 

species as found above (see Fig. 4). Generally, it looked 

more like the promoter profile of members of the 

Pseudomonadales than the general comparison organism, 

E. coli. Moreover, the promoter profile was very different 

compared with the promoter profile of X. fastidiosa 

strains, even though they where very similar with regard 

to their TU profile (see Fig. 4). The promoter profiles for 

the above mentioned organisms may be found at our 

website (http://www.cbs.dtu.dk/services/GenomeAtlas/). 

Amino acid and codon usage 

We have examined the codon and amino acid usage of 

A. borkumensis and compared this with both the usage of 

bacteria in general and of 16 oceanic bacteria (Entrez 

project IDs 230, 10 645, 12 530, 13 233, 13 239, 13 282, 

13 642, 13 643, 13 654, 13 655, 13 902, 13 906, 13 910, 

13 911, 13 989, 15 660) Willenbrock et al., 2006). In 

Fig. 8, the codon usage plot of A. borkumensis is 

superimposed on the cumulative plot of all completely 

sequenced bacteria in public databases (N = 518, 

Fig. 8A) or of that of 16 oceanic bacteria (Fig. 8B). 

A few codons are differentially utilized in A. borkumensis 

(GUC, CUG), but all values are within the range of three 

standard deviations. In other words, codon usage of 

A. borkumensis resides within the typical range of 

eubacteria. 

Interestingly, the sequenced oceanic bacteria share a 

very similar amino acid usage (Fig. 8D), whereas broad 

variations thereof were noted amongst all sequenced 

bacteria that represent the whole spectrum of habitats 

(Fig. 8C). A. borkumensis roughly follows the profile of the 

oceanic bacteria, although cysteine, tryptophan, leucine, 

proline, arginine, serine are under-utilized, and glutamic 

acid, lysine, phenylalanine, histidine, methionine, and 

tyrosine are over-utilized – all exceeding the threestandard 

deviation boundaries. 

Conclusion 

Fig. 7. Profile of structural properties of 

promoter regions (A and B) and probabilities 

of opening during stress-induced DNA duplex 

destabilization at various super-helical 

densities (C and D) in the A. borkumensis 

SK2 (A and C) and Candidatus Pelagibacter 

ubique HTCC1062 (B and D) chromosomes. 

Each annotated gene was aligned at the 

translation start site and the average values 

for the SIDD probabilities, AT-content, position 

preference, stacking energy, intrinsic 

curvature and DNase sensitivity were 

calculated at each position in the alignment. 

The values were subsequently converted into 

z-scores, using the average and standard 

deviation of the entire chromosome. Values 

are smoothed over a 5 bp window. 

Inspection of the collected phylogenetic connections 

revealed that the most closely related organisms are 

Acinetobacter sp. and Pseudomonas aeruginosa, 



although in trees where both Pseudomonas and Acinetobacter 

are present, A. borkumensis tends to cluster more 

often with the latter one. 

The major structural feature of the A. borkumensis 

chromosome is its symmetry and homogeneity. The 

genome contains only very few regions with extraordinarily 

low or high curvature, position preference or base 

stacking energy. The chromosomal frame is symmetric: 

The origin and the terminus of replication are located 

opposite to each other in the chromosome and are clearly 

discerned by maxima of oligonucleotide usage biases 

between leading and lagging strand. 

The genetic repertoire of A. borkumensis is most similar 

to that of Acinetobacter and P. aeruginosa. Moreover, 


Fig. 8. Codon usage (A and B) and amino acid usage (C and D) of A. borkumensis SK2 compared with those of 518 completely sequenced 

bacteria (A and C) or compared with those of 16 sequenced oceanic bacteria. Frequencies of amino acids and codons were counted for each 

genome and normalized. Mean value (grey line) and three standard deviations (grey solid area) represent the global usage of individual 

codons (A and B) and amino acids (C and D) in the 518 (A and C) or 16 (B and D) reference genomes. The red line (A and B) shows the 

codon usage and the blue line (C and D) shows the amino acid usage of A. borkumensis. 

A. borkumensis shares a similar oligonucleotide usage 

with the Xanthomonadales and Pseudomonadales indicating 

close phylogenetic relationships with these orders 

in accordance with 16S rDNA sequence relatedness 

(Schneiker et al., 2006). Amongst this subgroup of completely 

sequenced genomes, the A. borkumensis chromosome 

harbours the relatively lowest number of genome 

islands with atypical tetranucleotide usage. P. putida 

KT2440, for example, carries threefold more islands per 

Megabase in its chromosome (Weinel et al., 2002). Interestingly, 

one of the three enzyme systems that are 

upregulated in alkane-grown cells (Sabirova et al., 2006), 

the well-known alkB1 cluster, is encoded by genome 

islands. The molecular evolution of the alk genes that are 




encoded by a catabolic transposon (van Beilen et al., 

2001) is remarkable: the Alcanivorax genes were probably 

acquired from the Yersinia lineage, whereas the 

P. putida genes exhibit the typical Alcanivorax tetranucleotide 

signature. Horizontal gene transfer was relevant to 

confer the – probably – most important metabolic trait to 

A. borkumensis, but otherwise the stable seawater habitat 

apparently did not favour the shuffling and exchange 

of genes with other taxa. Instead a symmetric and 

structurally homogeneous chromosome evolved that 

lacks numerous metabolic traits (Yakimov et al., 1998; 

Schneiker et al., 2006) found in their versatile Pseudomonas 

relatives which are endowed with twofold larger chromosomes 

(Stover et al., 2000; Nelson et al., 2002). 

Experimental procedures 

Genomic sequence 

The comparative genomics analyses were based on the 

genomic sequence of A. borkumensis SK2 (Golyshin et al., 

2003) and its annotation (Schneiker et al., 2006). 

Atlas visualization 

Atlases, developed in house, make it possible to visualize 

correlations between position dependent information contained 

within a chromosome. Circular graphical representations 

of the entire A. borkumensis genome were created 

using the atlas visualization tool, GeneWiz. Each feature, 

such as AT content is represented by a separate circle in the 

atlas. Typically, mean values are pictured in grey and extreme 

values are highlighted in a user defined colour (Pedersen 

et al., 2000). 

Phylome atlas. For each amino acid sequence, phylogenetic 

trees were automatically constructed as described in 

Sicheritz-Ponten and Andersson (2001). The phylogenomic 

information of the resulting 1919 phylogenetic trees was 

extracted and analysed in the PyPhy system. 

Genome atlas. The genome atlas is a combination of some 

general informative properties. These are some structural 

features (intrinsic curvature, stacking energy and position 

preference), some repeat properties (global direct and 

inverted repeats) and the main base composition features 

(GC skew and percent AT). 

Intrinsic curvature was calculated using the CURVATURE 

software (Shpigelman et al., 1993). Stacking energy of a 

DNA segment was determined by the method of Ornstein and 

colleagues (1978). Position preference was based on a trinucleotide 

model that estimates the helix flexibility (Satchwell 

et al., 1986). Base composition is generally divided into AT 

content and GC skews. Both were calculated from the nucleotide 

sequence. Global direct and inverted repeats were 

found using variations of an algorithm that finds the highest 

degree of homology for a 15 bp repeat within a window of 

length 100 bp (Jensen et al., 1999). 

Codon and amino acid usage 

Codon and amino acid usage were calculated from all coding 

regions in the genome as annotated in the GenBank entries. 

The relative synonymous codon usage was calculated by 

comparing the codon distribution from a set of highly 

expressed genes with a background distribution estimated 

from the codon usage of all coding regions in the genome 

(Willenbrock et al., 2006). In order to identify a set of constitutively 

highly expressed genes in A. borkumensis, the reference 

set of 27 very highly expressed Escherichia coli genes 

originally compiled by Sharp and Li (1986) was aligned at the 

protein level against all genes annotated in the GenBank 

entry using BLASTP version 2.2.9 (Altschul et al., 1997). For 

each of these very highly expressed genes, the gene with the 

best alignment was added to a set of very highly expressed 

genes if it had an E-value below 10 -6 . 

TU patterns 

Overlapping tetranucleotide words were counted in the bacterial 

nucleotide sequences by shifting the window in steps of 

1 nucleotide. The total word number in a circular sequence 

equals to the sequence length. The observed counts of words 

(Co) were compared with the expected counts of words (Ce). 

Assuming the same distribution frequency for all words irrespective 

of their composition and sequence mononucleotide 

content, Ce matches the ratio of the sequence length to the 

number of different tetranucleotide words Nw (256 for 

tetranucleotides). 

The deviation Dw of observed from expected counts is 

given by 

∆w= ( o−e)× o 

− 

C C C 1 

For the comparison of sequences by TU patterns, the words 

in each sequence were ranked by Dw values. Rank numbers 

instead of word counts were used to simplify pattern comparison 

and to remove sequence length bias. 

The distance D between two patterns was calculated as 

the sum of absolute distances between ranks of identical 

words in patterns i and j as follows and expressed as a 

percent of the possible maximal distance: 

where 

D( 

% )= × 

∑ 

100 w 

D 

max 

rank − rank 

w, i w, i 

D 

max 

Nw( Nw−1) 

= 

2 

Dmax is the maximal distance that is theoretically possible 

between two patterns. For TU patterns Nw is 256. For more 

information about methods of oligonucleotide usage statistics 

see Reva and Tümmler (2004; 2005). 

Origin plot 

The origin plot was constructed as described in Worning 

and colleagues (2006). In brief, the difference between a 

hypothetical leading and lagging strand is plotted for various 

positions on the chromosome. The frequencies of all 



oligonucleotides from 2-mers to 8-mers on the leading and 

lagging strands in a 60% window are counted and the information 

content was calculated and summarized over all 

oligos for every putative origin. The G/C and A/T weighted 

strand bias were included to distinguish between origin and 

terminus. 

Structural profile of the promoter region 

Each annotated gene was aligned at the translation start site 

and the average values for five DNA structural features 

(AT content, position preference, stacking energy, intrinsic 

curvature, DNase sensitivity; see chapter on Genome Atlas) 

were calculated at each position in the alignment. The values 

was subsequently centered and scaled and smoothed within 

a 5 bp window using Gaussian smoothing. 


The analysis has been performed within the frame of the 

‘Task Force Genome Linguistics’ of the competence 

network ‘Genome Research on Bacteria Relevant for Agriculture, 

Environment and Biotechnology’ funded by the 

Federal Ministry of Education and Research (BMBF), 

Germany (Contracts 031U213D and 031U113D). We thank 

Peter Golyshin, Vitor Martins dos Santos and Kenneth N. 

Timmis, Helmhotz Center for Infection Research, Braunschweig, 

for stimulating discussions during the initiation of the 

study and Olaf Kaiser, Lehrstuhl für Genetik, Universität 

Bielefeld, for the provision of sequence data at an early 

stage of the sequencing project. O.R. has been a recipient 

of a postdoctoral stipend of the DFG-sponsored International 

Training Group ‘Pseudomonas: Pathogenicity and 

Biotechnology’. 

References 

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., 

Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped 

BLAST and PSI-BLAST: a new generation of protein 

database search programs. Nucleic Acids Res 25: 3389– 

3402. 

Baldi, P., and Baisnee, P.F. (2000) Sequence analysis by 

additive scales: DNA structure for sequences and repeats 

of all lengths. Bioinformatics 16: 865–889. 

van Beilen, J.B., Panke, S., Lucchini, S., Franchini, A.G., 

Rothlisberger, M., and Witholt, B. (2001) Analysis of 

Pseudomonas putida alkane-degradation gene clusters 

and flanking insertion sequences: evolution and regulation 

of the alk genes. Microbiology 147: 1621–1630. 

van Beilen, J.B., Marin, M.M., Smits, T.H.M., Röthlisberger, 

M., Franchini, A.G., Witholt, B., and Rojo, F. (2004) 

Characterization of two alkane hydroxylase genes from 

the marine hydrocarbonoclastic bacterium Alcanivorax 

borkumensis. Environ Microbiol 6: 264–273. 

Brukner, I., Sanchez, R., Suck, D., and Pongor, S. (1995) 

Sequence-dependent bending propensity of DNA as 

revealed by DNase I: parameters for trinucleotides. EMBO 

J 14: 1812–1818. 


Chen, X.-H., Koumoutsi, A., Scholz, R., Eisenreich, A., 

Schneider, K., Schneider, I., et al. (2007) Comparative 

analysis of the complete genome sequence of the plant 

growth promoting Bacillus amyloliquefaciens FZB42. 

Nat Biotechnol 25: 1007–1014. 

Dlakic, M., Ussery, D., and Brunak, S. (2004) DNA bendability 

and nucleosome positioning in transcriptional 

regulation. In DNA Conformation and Transcription. 

Ohyama, T. (ed.). Austin, TX: Landes Bioscience, pp. 198– 

211. 

Dobrindt, U., Hochhut, B., Hentschel, U., and Hacker, J. 

(2004) Genomic islands in pathogenic and environmental 

microorganisms. Nat Rev Microbiol 2: 414–424. 

Giovannoni, S.J., Tripp, H.J., Givan, S., Podar, M., Vergin, 

K.L., Baptista, D., et al. (2005) Genome streamlining in a 

cosmopolitan oceanic bacterium. Science 309: 1242– 

1245. 

Golyshin, P.N., Martins Dos Santos, V.A., Kaiser, O., Ferrer, 

M., Sabirova, Y.S., Lunsdorf, H., et al. (2003) Genome 

sequence completed of Alcanivorax borkumensis, a 

hydrocarbon-degrading bacterium that plays a global role 

in oil removal from marine systems. J Biotechnol 106: 

215–220. 

Hara, A., Syutsubo, K., and Harayama, S. (2003) Alcanivorax 

which prevails in oil-contaminated seawater exhibits broad 

substrate specificity for alkane degradation. Environ 

Microbiol 5: 746–753. 

Hara, A., Baik, S.H., Syutsubo, K., Misawa, N., Smits, T.H., 

van Beilen, J.B., and Harayama, S. (2004) Cloning and 

functional analysis of alkB genes in Alcanivorax borkumensis 

SK2. Environ Microbiol 6: 191–197. 

Harayama, S., Kishira, H., Kasai, Y., and Shutsubo, K. (1999) 

Petroteum biodegradation in marine environments. J Mol 

Microbiol Biotechnol 1: 63–70. 

Jensen, L.J., Friis, C., and Ussery, D.W. (1999) Three 

views of microbial genomes. Res Microbiol 150: 773– 

777. 

Kasai, Y., Kishira, H., Sasaki, I., Syutsubo, K., Watanabe, K., 

and Harama, S. (2002) Prodominant growth of Alcanivorax 

strains in oil-contaminated and nutrient-supplemented sea 

water. Environ Microbiol 4: 141–147. 

Kasai, Y., Kishira, H., Syutsubo, K., and Harayama, S. (2001) 

Molecular detection of marine bacterial populations on 

beaches contaminated by the Nakhodka tanker oilaccident. 

Environ Microbiol 3: 246–255. 

Klockgether, J., Würdemann, D., Reva, O., Wiehlmann, L., 

and Tümmler, B. (2007) Diversity of the abundant 

pKLC102/PAGI-2 family of genomic islands in Pseudomonas 

aeruginosa. J Bacteriol 189: 2443–2459. 

Luiten, R.G., Putterman, D.G., Schoenmakers, J.G., 

Konings, R.N., and Day, L.A. (1985) Nucleotide sequence 

of the genome of Pf3, an IncP-1 plasmid-specific filamentous 

bacteriophage of Pseudomonas aeruginosa. J Virol 

56: 268–276. 

McKew, B.A., Coulon, F., Osborn, A.M., Timmis, K.N., and 

McGenity, T.J. (2007a) Determining the identity and roles 

of oil-metabolizing marine bacteria from the Thames 

estuary, UK. Environ Microbiol 9: 165–176. 

McKew, B.A., Coulon, F., Yakimov, M.M., Denaro, R., Genovese, 

M., Smith, C.J., et al. (2007b) Efficacy of intervention 

strategies for bioremediation of crude oil in marine 




systems and effects on indigenous hydrocarbonoclastic 

bacteria. Environ Microbiol 9: 1562–1571. 

Nelson, K.E., Weinel, C., Paulsen, I.T., Dodson, R.J., Hilbert, 

H., Martins dos Santos, V.A., et al. (2002) Complete 

genome sequence and comparative analysis of the metabolically 

versatile Pseudomonas putida KT2440. Environ 

Microbiol 4: 799–808. 

Ornstein, R., Rein, R., Breen, D., and MacElroy, R. (1978) 

An optimized potential function for the calculation of 

nucleic acid interaction energies. Biopolymers 17: 2341– 

2360. 

Pedersen, A.G., Jensen, L.J., Brunak, S., Staerfeldt, H.H., 

and Ussery, D.W. (2000) A DNA structural atlas for 

Escherichia coli. J Mol Biol 299: 907–930. 

Pride, D.T., Meinersmann, R.J., Wassenaar, T.M., and 

Blaser, M.J. (2003) Evolutionary implications of microbial 

genome tetranucleotide frequency biases. Genome Res 

13: 145–158. 

Reva, O.N., and Tümmler, B. (2004) Global features of 

sequences of bacterial chromosomes, plasmids and 

phages revealed by analysis of oligonucleotide usage 

patterns. BMC Bioinformatics 5: 90. 

Reva, O.N., and Tümmler, B. (2005) Differentiation of regions 

with atypical oligonucleotide composition in bacterial 

genomes. BMC Bioinformatics 6: 251. 

Röling, W.F., Milner, M.G., Jones, D.M., Lee, K., Daniel, F., 

Swannell, R.J., et al. (2002) Robust hydrocarbon degradation 

and dynamics of bacterial communities during nutrient 

– enhanced oil spill bioremediation. Appl Environ Microbiol 

68: 5537–5548. 

Sabirova, J.S., Ferrer, M., Regenhardt, D., Timmis, K.N., and 

Golyshin, P.N. (2006) Proteomic insights into metabolic 

adaptations in Alcanivorax borkumensis induced by alkane 

utilization. J Bacteriol 188: 3763–3773. 

Saitou, N., and Nei, M. (1987) The neighbor-joining method: 

a new method for reconstructing phylogenetic trees. Mol 

Biol Evol 4: 406–425. 

Satchwell, S.C., Drew, H.R., and Travers, A.A. (1986) 

Sequence periodicities in chicken nucleosome core DNA. 

J Mol Biol 191: 659–675. 

Schneiker, S., Martins dos Santos, V.A., Bartels, D., Bekel, 

T., Brecht, M., Buhrmester, J., et al. (2006) Genome 

sequence of the ubiquitous hydrocarbon-degrading marine 

bacterium Alcanivorax borkumensis. Nat Biotechnol 24: 

997–1004. 

Sharp, P.M., and Li, W.H. (1986) Codon usage in regulatory 

genes in Escherichia coli does not reflect selection for ‘rare’ 

codons. Nucleic Acids Res 14: 7737–7749. 

Shpigelman, E.S., Trifonov, E.N., and Bolshoy, A. (1993) 

CURVATURE: software for the analysis of curved DNA. 

Comput Appl Biosci 9: 435–440. 

Sicheritz-Ponten, T., and Andersson, S.G. (2001) A phyloge- 

nomic approach to microbial evolution. Nucleic Acids Res 

29: 545–552. 

Skovgaard, M., Jensen, L.J., Friis, C., Stærfeldt, H.H., 

Worning, P., Brunak, S., and Ussery, D.W. (2002) The 

atlas visualisation of genome-wide information. In Methods 

in Microbiology. Wren, B., and Dorrell, N. (eds). London, 

UK: Academic Press, pp. 49–63. 

Smits, T.H., Balada, S.B., Witholt, B., and van Beilen, J.B. 

(2002) Functional analysis of alkane hydroxylases from 

gram-negative and gram-positive bacteria. J Bacteriol 184: 

1733–1742. 

Spangenberg, C., Fislage, R., Römling, U., and Tümmler, B. 

(1997) Disrespectful type IV pilins. Mol Microbiol 25: 203– 

204. 

Stover, C.K., Pham, X.Q., Erwin, A.L., Mizoguchi, S.D., Warrener, 

P., Hickey, M.J., et al. (2000) Complete genome 

sequence of Pseudomonas aeruginosa PA01, an opportunistic 

pathogen. Nature 406: 959–964. 

Syutsubo, K., Kishira, H., and Harayama, S. (2001) Development 

of specific oliogonucleotide probes for the identification 

and in situ defection of hydrocarbon – degrading 

Alcanivorax strains. Environ Microbiol 3: 371–379. 

Teeling, H., Meyerdierks, A., Bauer, M., Amann, R., and 

Glockner, F.O. (2004) Application of tetranucleotide 

frequencies for the assignment of genomic fragments. 

Environ Microbiol 6: 938–947. 

Ussery, D., Soumpasis, D.M., Brunak, S., Staerfeldt, H.H., 

Worning, P., and Krogh, A. (2002) Bias of purine stretches 

in sequenced chromosomes. Comput Chem 26: 531–541. 

Wang, H., Noordewier, M., and Benham, C.J. (2004) Stress- 

Induced DNA Duplex destabilization (SIDD) in the E. coli 

genome: SIDD sites are closely associated with promoters. 

Genome Res 14: 1575–1584. 

Weinel, C., Nelson, K.E., and Tümmler, B. (2002) Global 

features of the Pseudomonas putida KT2440 genome 

sequence. Environ Microbiol 4: 809–818. 

Willenbrock, H., and Ussery, D.W. (2007) Prediction of highly 

expressed genes in microbes based on chromatin 

accessibility. BMC Mol Biol 8: 11. 

Willenbrock, H., Friis, C., Juncker, A.S., and Ussery, D.W. 

(2006) An environmental signature for 323 microbial 

genomes based on codon adaptation indices. Genome Biol 

7: R114. 

Worning, P., Jensen, L.J., Hallin, P.F., Staerfeldt, H.H., and 

Ussery, D.W. (2006) Origin of replication in circular 

prokaryotic chromosomes. Environ Microbiol 8: 353– 

361. 

Yakimov, M.M., Golyshin, P.N., Lang, S., Moore, E.R., 

Abraham, W.R., Lunsdorf, H., and Timmis, K.N. (1998) 

Alcanivorax borkumensis General nov., sp. nov., a new, 

hydrocarbon-degrading and surfactant-producing marine 

bacterium. Int J Syst Bacteriol 48: 339–348. 



Paper III: Global features of the Alcanivorax borkumensis SK2 genome

1 

2.9 Paper IV: The origins of Vibrio species 

Comparative Genomics

Microb Ecol 

DOI 10.1007/s00248-009-9596-7 

MINIREVIEWS 

On the Origins of a Vibrio Species 

Tammi Vesth & Trudy M. Wassenaar & Peter F. Hallin & 

Lars Snipen & Karin Lagesen & David W. Ussery 

Received: 3 July 2009 /Accepted: 17 September 2009 

# The Author(s) 2009. This article is published with open access at Springerlink.com 

Abstract Thirty-two genome sequences of various Vibrionaceae 

members are compared, with emphasis on what 

makes V. cholerae unique. As few as 1,000 gene families 

are conserved across all the Vibrionaceae genomes analysed; 

this fraction roughly doubles for gene families 

conserved within the species V. cholerae. Of these, 

approximately 200 gene families that cluster on various 

locations of the genome are not found in other sequenced 

Vibrionaceae; these are possibly unique to the V. cholerae 

species. By comparing gene family content of the analysed 

genomes, the relatedness to a particular species is identified 

for two unspeciated genomes. Conversely, two genomes 

T. Vesth : T. M. Wassenaar : P. F. Hallin : L. Snipen : 

K. Lagesen : D. W. Ussery (*) 


Department of Systems Biology, 

The Technical University of Denmark, 

Building 208, 

2800 Kgs. Lyngby, Denmark 

e-mail: dave@cbs.dtu.dk 

T. M. Wassenaar 

Molecular Microbiology and Genomics Consultants, 

Zotzenheim, Germany 

P. F. Hallin 

Novozymes A/S, 

Krogshøjvej 36, 

2880 Bagsværd, Denmark 

L. Snipen 

Biostatistics, Department of Chemistry, Biotechnology, 

and Food Sciences, Norwegian University of Life Sciences, 

Ås, Norway 

K. Lagesen 

Centre for Molecular Biology and Neuroscience and Institute 

of Medical Microbiology, University of Oslo, 

Oslo, Norway 

presumably belonging to the same species have suspiciously 

dissimilar gene family content. We are able to identify a 

number of genes that are conserved in, and unique to, V. 

cholerae. Some of these genes may be crucial to the niche 

adaptation of this species. 

Introduction 

The species concept for bacteria has long been under siege 

from several angles, and now with thousands of bacterial 

genomes being sequenced, the disputes have intensified [8]. 

One frequently used definition of a bacterial species is “a 

category that circumscribes a (preferably) genomically 

coherent group of individual isolates/strains sharing a high 

degree of similarity in (many) independent features, 

comparatively tested under highly standardized conditions” 

[12]. Such independent features are usually phenotypes that 

can easily be tested. For a new species to be defined, 

amongst other criteria, inter-species DNA–DNA hybridisation 

has to be below 70%, although this rule is not 

without its limitations [18]. In the late 1970s and 1980s, the 

16S rRNA gene sequence was introduced as a molecular 

clock that could be used to infer phylogenetic relationships 

[50]. Ideally, isolates belonging to the same species have 

identical or nearly identical 16S rRNA genes, and these 

differ from isolates belonging to different species [32, 44]. 

In practice, this is not always the case. Examples exist of 

different species sharing identical rRNA genes (for 

instance, E. coli and Shigella [37] that are even placed in 

different genera); in addition, isolates of one species can 

have different rRNA genes beyond the 97% that is 

considered to demarcate species [4]. Lateral transfer of 

genetic material (to which ribosomal genes are believed to 

be resistant) destroys the phylogenetic relationship, so that

phylogenies based on alternative housekeeping genes can 

differ from a 16S rRNA tree and frequently are not even in 

accordance to each other. Such observations question the 

validity of a phylogenetic tree as the most suitable model 

for bacterial ancestry, when multiple genetic transfers 

would produce a network-like evolutionary structure [6]. 

On the other hand, it is observed that lateral gene transfer is 

most frequent between genetically related members sharing 

a similar base content and occupying the same ecological 

niche [29]. Nevertheless, a core of genes can be recognised 

that produce coherent phylogenetic trees, though these may 

not represent the species’ complete evolutionary history as 

they comprise only a minor fraction of the genetic content 

of the organism [35]. 

Whether a tree or a network is more accurate to describe 

phylogeny, in either case bacterial species may be considered 

as a cloud of isolates having a higher level of genetic 

similarity to each other than to organisms belonging to a 

different species. When such clouds have fuzzy and 

overlapping borders, the species concept falls apart but that 

will only apply to certain cases [7]. Since 16S rRNA genes 

are not informative on the level of diversity within a 

species, the 'density' of a cloud of isolates making up a 

species cannot be determined by this gene. Those genes 

shared by all isolates belonging to one species comprise the 

core genome of that species [39], and the degree of 

diversity in the remaining non-core genes determines the 

density of the species cloud. 

We hypothesised that certain genes can be recognised as 

specific to a particular species, to be conserved in that 

species but not present in related species. We tested our 

hypothesis with complete genome sequences of the bacterial 

family Vibrionaceae, which belong to the γ- 

Proteobacteria and comprises eight genera. Most available 

genome sequences belong to the genus Vibrio. This genus 

contains 51 recognised species [10, 46] which are mainly 

found in marine environments, frequently living in association 

with marine organisms such as corals, fish, squid or 

zooplankton. Most of them are symbionts and only a few 

are human pathogens, notably particular serotypes of V. 

cholerae producing cholera, Vibrio parahaemolyticus 

(causing gastroenteritis) and Vi vulnificus (causing wound 

infections) [46]. Other Vibrionaceae, including V. vulnificus, 

Aliivibrio salmonicida and V. harveyi, are fish or 

shellfish pathogens and have major economic impact. 

Photobacterium profundum, representing another genus 

within the Vibrionaceae, was also included. 

The gene content of 32 available sequenced Vibrionaceae 

genomes was compared and the results were analysed in 

various ways. The data allowed us to identify possible V. 

cholerae-specific genes, since this species was represented 

by 18 genomes that was a sufficient number to test 

conservation both within the species and across species. 

We found that a two-component signal transduction pathway 

is uniquely conserved in V. cholerae but is not found outside 

this species. Our findings further indicated that possibly a 

relatively small set of genes could confer niche specialisation 

allowing V. cholerae to be adopted to a unique environment, 

so that over time V. cholerae have become a distinct species. 

Materials and Methods 

Genomes and Gene Annotations Used 

Publicly available genome sequences of Vibrionaceae were 

selected that were provided in less than 300 contigs and in 

which full-length 16S rRNA sequence could be found using 

the rRNA gene finder RNAmmer [19]. The 32 genome 

sequences included are shown in Table 1. 

The gene annotations as provided in GenBank were 

used, except for those genomes marked “Easygene” in 

Table 1 where protein annotation was not available in the 

RefSeq file at the time of analysis, and we used EasyGene 

[20] to identify the genes. As a control, an available 

GenBank annotation was compared to a generated Easygene 

annotation to confirm that the number of identified 

genes was comparable. 

Ribosomal RNA Analysis 

RNAmmer [19] was used to identify 16S rRNA sequences 

within the 32 genomes. Sequences were considered reliable 

if they were between 1,400 and 1,700 nucleotides long and 

had an RNAmmer score above 1,800. In cases where the 

program found multiple and variable 16S sequences within 

a genome, one of these (with satisfactory RNAmmer 

scores) was arbitrarily chosen. The sequences were aligned 

using PRANK [23, 24], and the program MEGA4 was used 

to elucidate a phylogenetic tree [45]. Within MEGA4, the 

tree was created using the Neighbor-Joining method with 

the uniform rate Jukes–Cantor distance measure and the 

complete-delete option. Five hundred resamplings were 

done to find the bootstrap values. 

Pan-Genome Family Clustering 

T. Vesth et al. 

Clustering based on shared gene families from the Vibrio 

pan-genome was constructed, based on BLASTP similarity 

using default settings. A BLASTP hit was considered 

significant if the alignment produced at least 50% identity 

for at least 50% of the length of the longest gene (either 

query or subject). Using this criterion, each pair of genes 

producing a significant reciprocal best hit was scored as 

belonging to the same gene family. A genome matrix was 

constructed, containing one row for each genome and one

Origins of V. cholerae 

Table 1 Vibrionaceae genomes used in this analysis 

GPID Organism Contigs Accession/GenBank Status No. of genes Ref. 

36 V. cholerae N16961 a 

2 AE003852.1 Fully sequenced 3,828 [15] 

15667 V. cholerae O395 TIGR a 

2 CP000626.1 Fully sequenced 3,875 [11] 

32853 V. cholerae O395 TEDA a 


33555 V. cholerae MJ-1236 a 


15666 V. cholerae MO10 a 

153 NZ_AAKF00000000 Unfinished (Easygene) 3,421 [5] 

15670 V. cholerae V52 a 

268 NZ_AAKJ00000000 Unfinished (NCBI) 3,815 [16] 

33559 V. cholerae BX330286 a 

8 NZ_ACIA00000000 Unfinished (NCBI) 3,632 [31] 

33557 V. cholerae B33 a 

17 NZ_ACHZ00000000 Unfinished (NCBI) 3,748 [31] 

33553 V. cholerae RC9 a 

11 NZ_ACHX00000000 Unfinished (NCBI) 3,811 [31] 

32851 V. cholerae M66-2 2 CP001233.1 Fully sequenced 3,693 [49] 

18495 V. cholerae MZO-2 162 NZ_AAWF00000000 Unfinished (NCBI) 3,425 [16] 

18265 V. cholerae 1587 254 NZ_AAUR00000000 Unfinished (NCBI) 3,758 [16] 

18253 V. cholerae 2740-80 257 NZ_AAUT00000000 Unfinished (NCBI) 3,771 [16] 

17723 V. cholerae AM-19226 154 NZ_AATY00000000 Unfinished (Easygene) 3,407 [33] 

33561 V. cholerae 12129 12 NZ_ACFQ00000000 Unfinished (NCBI) 3,574 [31] 

33549 V. cholerae VL426 5 NZ_ACHV00000000 Unfinished (NCBI) 3,461 [31] 

33579 V. cholerae TM 11079-80 35 NZ_ACHW00000000 Unfinished (NCBI) 3,621 [31] 

33551 V. cholerae TMA 21 20 NZ_ACHY00000000 Unfinished (NCBI) 3,600 [31] 

13564 V. campbellii AND4 143 NZ_ABGR00000000 Unfinished (NCBI) 3,935 [13] 

19857 V. harveyi BAA-1116 3 CP000789.1 Fully sequenced 6,064 [1] 

349 V. vulnificus CMCP6 2 AE016795.2 Fully sequenced 4,538 [38] 

1430 V. vulnificus YJ016 3 BA000037.2 Fully sequenced 5,028 [3] 

19397 V. shilonii AK1 158 NZ_ABCH00000000 Unfinished (NCBI) 5,360 [41] 

15693 Vibrio sp. Ex25 222 NZ_AAKK00000000 Unfinished (Easygene) 4,004 [16] 

13616 Vibrio sp. MED222 99 NZ_AAND00000000 Unfinished (NCBI) 4,590 [36] 

32815 V. splendidus LGP32 2 FM954973.1 Fully sequenced 4,434 [27] 

19395 V. parahaemolyticus 16 78 NZ_ACCV00000000 Unfinished (Easygene) 3,780 [9] 

360 V. parahaemolyticus 2210633 2 BA000031.2 Fully sequenced 4,832 [25] 

12986 A. fischeri ES114 3 CP000020.1 Fully sequenced 3,823 [42] 

19393 A. fischeri MJ11 3 CP001133.1 Fully sequenced 4,039 [26] 

30703 A. salmonicida LFI1238 6 FM178379.1 Fully sequenced 4,284 [17] 

13128 P. profundum SS9 3 CR354531.1 Fully sequenced 5,480 [48] 

GPID genome project identifier at NCBI. Contigs the number of contiguous sequences, which for a completely sequenced genome is at least two 

(for two chromosomes) and can be up to six when plasmids are present. Unfinished sequences are represented by multiple contigs per 

chromosome 

a 

Strains containing the genes encoding the cholera enterotoxin subunits are indicated 

column for each gene family. Cell (i, j) in this matrix is 1 if 

genome i has a member in gene family j, 0 otherwise. A 

hierarchical clustering, with average linkage based on the 

Manhattan distance between genomes was then performed. 

Two trees were made, one with more weight given to gene 

families present in most (90%, or between 27 and 30) 

Vibrio genomes (“stabilome”), and the other with more 

weight given to gene families present in only a few (two, 

three, or four) genomes (“mobilome”). Thus, the original 

Boolean matrix is now scaled differently, depending on the 

number of genomes in each gene family [44]. For both 

trees, singletons (families which are only found in one 

genome) have been excluded. 

Pan- and Core Genome Analysis 

The results of the BLAST analysis were also used to 

construct a pan- and core genome plot as follows. Based on 

clusterings from the pan-genome family tree, an ordered set 

of genomes was constructed with V. cholerae genomes at 

the start. For the first chosen genome, all BLAST hits found 

in the second genome were recorded and the accumulative

Figure 1 Phylogenetic tree of 

the 16S rRNA gene extracted 

from 32 sequenced Vibrio 

genomes listed in Table 1. Environmental 

V. cholerae lacking 

the cholera enterotoxin genes 

are highlighted in bright green, 

whilst pathogenic V. cholerae 

genomes are in dark green. 

Further colouring was used for 

species for which two genomes 

are represented 

number of gene families (as defined above) now recognised in 

total was plotted for the pan-genome. The number of gene 

families with at least one representative gene in both genomes 

was plotted for the core genome. A running total is plotted for 

the pan-genome which increases as more genomes are added, 

whilst the core genome representing conserved gene families 

slowly decreases with the addition of more genomes. 

Whole-Genome BLAST Analysis and Construction 

of a BLAST Matrix 

The predicted genes of every genome (annotated or found 

by Easygene) were translated and every gene was compared, 

by BLASTP against every other genome and its own 

genome. In the latter case, the hit to self was ignored. The 

50/50 rule for BLAST hits as described above was used. If 

these requirements were met, genes were combined in a 

gene family. The BLAST results were visualised in a 

BLAST matrix [2], which summarises the results of 

genomic pairwise comparisons and reports, both as percentage 

and as absolute numbers, the number of reciprocal 

BLAST hits as a fraction of the total number of gene 

families found in the two genomes. For easier visual 

inspection, the cells in the matrix are coloured darker as 

56 

88 

65 

55 

86 

the fraction of similarity increases. Hits identified within a 

genome are differently coloured. 

BLAST Atlas 

BLAST results were also visualised in a BLAST atlas, this 

time visualising, for all genes in the reference genome V 

cholerae N16961, their best hit in all other genomes, again 

with a threshold of 50% identity over at least 50% of the 

length of the query protein. The atlas displays the hits as they 

are located in the reference strain [14]. The BLAST scores 

obtained for each queried gene is plotted, so that conserved 

and variable regions are located with respect to the reference 

genome. Note that genes absent in the reference genome are 

not shown in the lanes of the query genomes. 

Results 

Vibrio sp. MED222 

A 

A 

Vibrio sp. Ex25 

Ribosomal RNA Analysis 

A phylogenetic tree based on the 16S rRNA gene extracted 

from the 32 analysed Vibrionaceae genomes is shown in 

Fig. 1. The 18 V. cholerae genomes build a tight subcluster, 

45 

T. Vesth et al.


68 

68 

93 

64 

64 

95 

100 

Vibrio, stabilome 

0.20 0.15 0.10 0.05 0.00 

Relative manhattan distance 

quite distanced from the other species. Above this in the 

figure, another subcluster comprising eight genomes representing 

at least six species is recognised, and within this 

cluster the two V. parahaemolyticus genes are not found on 

the same branch. A third cluster, a bit further removed, 

includes Aliivibrio fischeri and A. almonidica as well as V. 

splendidus and Vibrio species MED 222; the gene of 

Photobacterium profundum is the most distant. 

Pan-Genome Family Trees 

99 

99 

100 

48 

100 

100 

98 

98 

67 

Vibrio harveyi ATCC BAA1116 

Vibrio parahaemolyticus RIMD2210633 

Vibrio vulnificus CMCP6 

Vibrio vulnificus YJ016 

Vibrio sp MED222 

Vibrio splendidus LGP32 

Vibrio shilonii AK1 


Vibrio parahaemolyticus 16 

Vibrio campbellii AND4 

Aliivibrio fischeri 


Aliivibrio salmonicida LFI1238 

Photobacterium profundum SS9 

Vibrio cholerae 1587 

Vibrio cholerae AM 19226 

Vibrio cholerae MO10 

Vibrio cholerae B33VCE 

Vibrio cholerae MJ1236 

Vibrio cholerae RC9 

Vibrio cholerae BX330286 

Vibrio cholerae M662 

Vibrio cholerae O395 TEDA 

Vibrio cholerae N16961 

Vibrio cholerae O395 TIGR 


Vibrio cholerae TMA21 

Vibrio cholerae V52 

Vibrio cholerae TM1107980 

Vibrio cholerae 2740 80 

Vibrio cholerae VL426 

Vibrio cholerae MZO 2 

Starting with a database containing the total set of all Vibrio 

gene families, a profile of matching gene families was 

constructed for each individual genome. This was stored as 

a matrix, containing a column for each gene families, and a 

row for each genome. The rows contain a 0 or 1 

representing the presence or absence of the gene family. 

This matrix was weighted to emphasise either the genes 

found in most genomes (the “stabilome”) or in only a few 

genomes (the “mobilome”); from these weighted matrices, 

clustering of gene families yielded the resulting trees shown 

in Fig. 2. Shorter distances represent genomes with many 

gene families in common, and larger distances reflect 

genomes with fewer gene families in common. As 

expected, in both trees, genomes from the same species 

cluster together, whereby the depth of resolution within a 

species is considerably better than can be seen in the 16S 

rRNA tree in Fig. 1. Similarity between the unspeciated 

100 

80 

66 

37 

54 

100 

98 

67 

46 

Figure 2 Pan-genome family clustering of the 32 Vibrio genome 

sequences. The two plots represent weighted values for genes present 

in at least 90% of the genomes (stabilome) or genes found in only a 

40 

100 

100 

58 

82 

91 

59 

59 

100 

100 

71 

48 

Vibrio, mobilome 

59 

80 

100 

100 

100 

0.20 0.15 0.10 0.05 0.00 

100 

100 

100 

Relative manhattan distance 

Vibrio isolate MED222 and V. splendidus is suggested by 

their close clustering; this is a connection also suggested by 

others [21]. Note that the unspeciated Vibrio isolate Ex25 

and V. parahaemolyticus 2210633 cluster together in the 

mobilome tree, but are more distant in the stabilome. This 

implies that the genes shared between these two genomes 

are less common genes within the Vibrio genomes 

examined here. As already indicated by the 16S rRNA 

tree, the two V. parahaemolyticus isolates are quite 

dissimilar, and appear on separate branches. The Aliivibrio 

cluster is placed within Vibrio genomes in both the 

stabilome and the mobilome, as was the case for their 16S 

rRNA gene. P. profundum is not such an outlier as in the 

16S rRNA tree, and in the stabilome. It is even positioned 

close to the Aliivibrio genomes. Zooming in at the genomes 

of V. cholerae, a division into two subclusters can be seen; 

these clusters correspond to environmental vs. clinical 

isolates (with the exception of V52 in the stabilome). 

Pan- and Core Genome Plot 

99 

77 

100 

67 

82 

100 

90 

90 

100 

89 

89 


Vibrio parahaemolyticus RIMD2210633 

Vibrio campbellii AND4 


Vibrio cholerae AM 19226 

Vibrio cholerae MZO 2 

Vibrio cholerae 2740 80 

Vibrio cholerae V52 

Vibrio cholerae MO10 

Vibrio cholerae O395 TIGR 

Vibrio cholerae BX330286 

Vibrio cholerae RC9 

Vibrio cholerae B33VCE 

Vibrio cholerae MJ1236 

Vibrio cholerae N16961 

Vibrio cholerae M662 

Vibrio cholerae O395 TEDA 

Vibrio cholerae TMA21 


Vibrio cholerae TM1107980 

Vibrio cholerae VL426 

Vibrio parahaemolyticus 16 



Aliivibrio salmonicida LFI1238 

Vibrio vulnificus CMCP6 

Vibrio vulnificus YJ016 

Vibrio sp MED222 

Vibrio splendidus LGP32 

Vibrio harveyi ATCC BAA1116 

Vibrio shilonii AK1 

Photobacterium profundum SS9 

BLAST results were analysed to construct a pan-genome, 

which is a hypothetical collection of all the gene families 

that are found in the investigated genomes [28]. The core 

genome was constructed from all gene families that were 

represented at least once in every genome. Thus, the gene 

families conserved in all genomes represent their core 

genome; adding the remaining gene families produces the 

65 

82 

100 

100 

100 

100 

few (two to four) genomes (mobilome). The colours highlighting the 

species are the same as in Fig. 1

25000 

20000 

15000 

10000 

5000 

0 

Pan genome 

Core genome 


V. cholerae TM11079-80 

V. cholerae TMA21 


V. cholerae MZO-2 

V. cholerae AM-19226 


V. cholerae 2740-80 

V. cholerae V52 

V. cholerae B33VCE 

V. cholerae MJ1236 

V. cholerae RC9 

V. cholerae BX330286 

V. cholerae MO10 

V. cholerae O395 TIGR 

V. cholerae O395 TEDA 

V. cholerae M66-2 

V. cholerae N16961 

Figure 3 Pan- and core genome plot of the 32 Vibrionaceae genomes. The colours highlighting species are the same as in Fig. 1 

pan-genome. The resulting pan- and core genome plot is 

shown in Fig. 3. The genomes start with the documented 

clinical isolates of V. cholerae and then follow the order 

suggested by the pan-genome family clustering (Fig. 2), 

although genomes from the same species were kept 

together (the two V. parahaemolyticus genomes were split 

in the trees). As more genomes are added in the plot, the 

number of gene families in the pan-genome (blue line) 

increases, and the number of conserved gene families (red 

line) in the core genome decreases, albeit at a lower rate. 

This is because every genome can add many novel (and 

frequently different) genes to the pan-genome but only 

decreases the core genome with a few genes that are absent 

V. cholerae VL426 




A. fisheri MJ11 

A. fisheri ES114 

Vibrio. sp MED222 

V.splendidus LGB2 

V. vulnificus YJ016 

V. vulnificus CMCP6 


V.campbellii 





in that particular strain but that were conserved in the 

previously analysed genomes. The pan-genome curve 

increases with a relative steep slope when a novel species 

is added, as is obvious when a V. parahaemolyticus genome 

is added after the last V. cholerae. A stable plateau can be 

seen for the pan-genome of V. cholerae around 6,500 genes. 

Nevertheless, a small increase occurs when adding V. 

cholerae 11587; this is caused by the difference between 

the two subclusters of V. cholerae seen in Fig. 2. V. 

cholerae strain 2740-80 behaves atypical in all the figures 

shown; although documented as an environmental isolate, it 

appears closer to the clinical isolates, in terms of overall 

genomic properties.


Figure 4 BLAST matrix of 

the 32 Vibrionaceae genomes. 

The colours highlighting the 

species are the same as in Fig. 1. 

Since the reciprocal similarity 

(reported as percent) is not 

readable at this resolution, every 

matrix cell is coloured using the 

scales as indicated. The bottom 

row identifies hits (other than 

hits-to-self) found within a genome. 

Four matrix cells reporting 

high pairwise similarities are 

outlined; their numbers are 

specified in the text 


30.0 % 

90.0 % 


6.0 % 

0.0 % 







27.2 % 

1,946 / 7,165 

31.2 % 

2,143 / 6,862 

32.5 % 

2,385 / 7,336 

31.1 % 

2,163 / 6,948 










27.1 % 

1,964 / 7,245 

27.5 % 

1,971 / 7,179 

35.8 % 

2,018 / 5,637 

32.6 % 

2,405 / 7,380 

31.5 % 

2,169 / 6,884 

26.3 % 

1,893 / 7,208 

38.7 % 

2,143 / 5,536 

35.9 % 

2,049 / 5,713 

33.1 % 

2,415 / 7,299 

30.4 % 

2,098 / 6,893 

28.0 % 

1,962 / 7,016 

32.1 % 

1,846 / 5,747 

38.3 % 

2,156 / 5,631 

36.4 % 

2,055 / 5,647 

31.7 % 

2,323 / 7,337 

32.3 % 

2,164 / 6,706 

28.7 % 

1,944 / 6,766 







A.fischeri ES114 

A.fischeri MJ11 

34.0 % 

1,963 / 5,771 

32.1 % 

1,873 / 5,828 

38.8 % 

2,162 / 5,566 

34.7 % 

1,968 / 5,677 

33.6 % 

2,410 / 7,181 

33.0 % 

2,137 / 6,467 

28.2 % 

1,960 / 6,957 

35.0 % 

1,949 / 5,561 

33.7 % 

1,977 / 5,865 

32.5 % 

1,873 / 5,769 

37.9 % 

2,110 / 5,560 

37.3 % 

2,045 / 5,477 

34.3 % 

2,377 / 6,932 

32.4 % 

2,155 / 6,649 

27.6 % 

1,965 / 7,122 

40.3 % 

2,326 / 5,771 

34.8 % 

1,967 / 5,647 

34.2 % 

1,983 / 5,797 

30.6 % 

1,777 / 5,804 

40.3 % 

2,167 / 5,378 

38.7 % 

2,021 / 5,225 

33.8 % 

2,403 / 7,116 

31.8 % 

2,169 / 6,817 

27.7 % 

1,965 / 7,093 


38.4 % 

2,291 / 5,971 

39.8 % 

2,339 / 5,873 

35.3 % 

1,972 / 5,581 

32.5 % 

1,896 / 5,827 

33.3 % 

1,863 / 5,593 

41.6 % 

2,140 / 5,139 

37.4 % 

2,032 / 5,428 

33.3 % 

2,418 / 7,252 

32.1 % 

2,173 / 6,778 

27.8 % 

1,967 / 7,064 


41.7 % 

2,552 / 6,116 

38.0 % 

2,307 / 6,067 

40.4 % 

2,345 / 5,808 

33.6 % 

1,884 / 5,612 

35.3 % 

1,981 / 5,619 

34.4 % 

1,846 / 5,360 

40.6 % 

2,159 / 5,323 

36.7 % 

2,048 / 5,585 

33.5 % 

2,420 / 7,225 

32.2 % 

2,173 / 6,752 

25.7 % 

1,850 / 7,198 





44.3 % 

2,515 / 5,683 

41.2 % 

2,564 / 6,224 

38.5 % 

2,311 / 6,004 

38.6 % 

2,251 / 5,839 

36.3 % 

1,965 / 5,413 

36.6 % 

1,964 / 5,371 

33.4 % 

1,852 / 5,547 

39.5 % 

2,169 / 5,493 

37.0 % 

2,051 / 5,545 

33.6 % 

2,420 / 7,193 

30.3 % 

2,079 / 6,856 

25.6 % 

1,841 / 7,194 

42.2 % 

2,215 / 5,254 

43.7 % 

2,527 / 5,781 

41.9 % 

2,575 / 6,151 

37.0 % 

2,227 / 6,026 

41.7 % 

2,346 / 5,626 

37.7 % 

1,947 / 5,165 

35.5 % 

1,974 / 5,563 

32.7 % 

1,868 / 5,705 

39.7 % 

2,168 / 5,459 

37.2 % 

2,052 / 5,516 

31.0 % 

2,282 / 7,362 

29.7 % 

2,044 / 6,887 

28.1 % 

1,904 / 6,782 



40.0 % 

2,421 / 6,055 

41.6 % 

2,225 / 5,354 

44.5 % 

2,539 / 5,707 

40.0 % 

2,473 / 6,185 

39.7 % 

2,312 / 5,825 

42.9 % 

2,314 / 5,388 

36.6 % 

1,961 / 5,354 

34.6 % 

1,982 / 5,732 

33.0 % 

1,872 / 5,667 

40.0 % 

2,171 / 5,428 

34.4 % 

1,944 / 5,645 

30.8 % 

2,270 / 7,379 

32.4 % 

2,098 / 6,481 

26.9 % 

1,851 / 6,869 

70.3 % 

2,933 / 4,174 

39.6 % 

2,438 / 6,154 

42.3 % 

2,236 / 5,283 

42.8 % 

2,449 / 5,718 

42.9 % 

2,564 / 5,977 

40.6 % 

2,270 / 5,592 

41.9 % 

2,334 / 5,571 

35.7 % 

1,969 / 5,522 

34.8 % 

1,984 / 5,694 

33.2 % 

1,872 / 5,641 

38.2 % 

2,104 / 5,504 

34.8 % 

1,952 / 5,606 

33.3 % 

2,327 / 6,984 

31.2 % 

2,045 / 6,565 

28.2 % 

1,949 / 6,915 

73.6 % 

3,045 / 4,135 

69.2 % 

2,953 / 4,267 

40.0 % 

2,440 / 6,094 

41.3 % 

2,181 / 5,277 

45.9 % 

2,535 / 5,526 

44.1 % 

2,533 / 5,743 

39.9 % 

2,299 / 5,768 

40.9 % 

2,343 / 5,733 

35.9 % 

1,971 / 5,485 

35.0 % 

1,985 / 5,667 

30.2 % 

1,747 / 5,786 

37.3 % 

2,064 / 5,537 

38.1 % 

1,994 / 5,228 

32.1 % 

2,268 / 7,062 

32.6 % 

2,153 / 6,600 

27.9 % 

1,942 / 6,969 

71.6 % 

3,010 / 4,205 

74.9 % 

3,101 / 4,142 

69.7 % 

2,944 / 4,221 

38.4 % 

2,348 / 6,120 

43.8 % 

2,234 / 5,101 

47.1 % 

2,503 / 5,310 

43.3 % 

2,559 / 5,916 

38.9 % 

2,309 / 5,932 

41.2 % 

2,346 / 5,697 

36.1 % 

1,971 / 5,458 

31.9 % 

1,857 / 5,817 

30.0 % 

1,736 / 5,791 

41.6 % 

2,134 / 5,135 

36.4 % 

1,935 / 5,317 

34.2 % 

2,394 / 7,002 

31.8 % 

2,123 / 6,682 

27.9 % 

1,941 / 6,954 


















75.9 % 

3,094 / 4,077 

72.6 % 

3,068 / 4,226 

75.5 % 

3,089 / 4,092 

66.3 % 

2,833 / 4,271 

41.4 % 

2,445 / 5,905 

45.9 % 

2,223 / 4,842 

46.4 % 

2,534 / 5,464 

42.3 % 

2,572 / 6,075 

39.3 % 

2,314 / 5,892 

41.4 % 

2,346 / 5,670 

32.8 % 

1,843 / 5,611 

32.1 % 

1,861 / 5,795 

33.6 % 

1,805 / 5,377 

39.1 % 

2,048 / 5,244 

37.7 % 

2,026 / 5,367 

33.4 % 

2,359 / 7,060 

32.0 % 

2,130 / 6,656 

27.9 % 

1,909 / 6,851 

68.7 % 

2,874 / 4,181 

77.2 % 

3,155 / 4,088 

73.5 % 

3,065 / 4,172 

69.8 % 

2,942 / 4,217 

73.2 % 

2,952 / 4,034 

42.4 % 

2,408 / 5,683 

44.3 % 

2,232 / 5,038 

45.2 % 

2,546 / 5,633 

42.7 % 

2,578 / 6,032 

39.4 % 

2,314 / 5,868 

38.0 % 

2,213 / 5,823 

33.1 % 

1,848 / 5,585 

35.6 % 

1,922 / 5,398 

31.9 % 

1,743 / 5,469 

40.4 % 

2,139 / 5,293 

37.3 % 

2,022 / 5,418 

33.4 % 

2,375 / 7,115 

32.0 % 

2,097 / 6,549 

29.6 % 

2,295 / 7,753 

70.4 % 

2,922 / 4,153 

67.2 % 

2,880 / 4,288 

78.0 % 

3,149 / 4,038 

68.5 % 

2,914 / 4,256 

76.0 % 

3,059 / 4,025 

73.5 % 

2,863 / 3,897 

41.8 % 

2,434 / 5,818 

43.0 % 

2,240 / 5,212 

45.5 % 

2,548 / 5,599 

42.9 % 

2,579 / 6,005 

36.9 % 

2,208 / 5,989 

38.0 % 

2,209 / 5,811 

36.7 % 

1,906 / 5,192 

34.2 % 

1,872 / 5,473 

33.5 % 

1,845 / 5,501 

39.4 % 

2,118 / 5,370 

37.3 % 

2,019 / 5,407 

33.0 % 

2,325 / 7,056 

35.2 % 

2,581 / 7,333 

27.9 % 

1,972 / 7,061 

64.7 % 

2,888 / 4,463 

70.3 % 

2,965 / 4,217 

69.7 % 

2,916 / 4,183 

71.5 % 

2,986 / 4,175 

74.1 % 

3,024 / 4,083 

75.2 % 

2,954 / 3,928 

76.4 % 

2,970 / 3,887 

40.8 % 

2,445 / 5,993 

43.4 % 

2,242 / 5,171 

45.8 % 

2,552 / 5,568 

39.4 % 

2,432 / 6,172 

36.4 % 

2,186 / 6,003 

41.8 % 

2,264 / 5,418 

34.9 % 

1,843 / 5,282 

35.7 % 

1,970 / 5,513 

32.9 % 

1,824 / 5,545 

40.3 % 

2,145 / 5,320 

37.8 % 

2,001 / 5,288 

46.4 % 

3,371 / 7,266 

34.3 % 

2,276 / 6,634 

29.4 % 

2,212 / 7,534 

76.9 % 

3,165 / 4,117 

64.9 % 

2,940 / 4,533 

72.2 % 

2,986 / 4,136 

69.0 % 

2,860 / 4,145 

79.5 % 

3,125 / 3,932 

73.0 % 

2,908 / 3,986 

80.4 % 

3,080 / 3,831 

73.1 % 

2,977 / 4,072 

41.1 % 

2,450 / 5,957 

43.6 % 

2,244 / 5,143 

42.2 % 

2,409 / 5,711 

39.1 % 

2,413 / 6,176 

39.9 % 

2,238 / 5,609 

39.9 % 

2,202 / 5,514 

36.8 % 

1,951 / 5,307 

34.9 % 

1,952 / 5,586 

33.0 % 

1,831 / 5,549 

39.8 % 

2,086 / 5,245 

47.0 % 

2,741 / 5,827 

34.9 % 

2,496 / 7,160 

34.4 % 

2,472 / 7,184 

27.8 % 

2,222 / 7,979 

83.4 % 

3,315 / 3,973 

76.7 % 

3,195 / 4,167 

67.6 % 

2,983 / 4,413 

68.5 % 

2,869 / 4,191 

71.8 % 

2,896 / 4,036 

77.9 % 

3,002 / 3,856 

78.5 % 

3,061 / 3,901 

77.3 % 

3,098 / 4,009 

73.4 % 

2,971 / 4,050 

41.3 % 

2,449 / 5,936 

41.1 % 

2,153 / 5,242 

41.4 % 

2,372 / 5,735 

43.0 % 

2,483 / 5,781 

38.0 % 

2,171 / 5,707 

42.0 % 

2,320 / 5,530 

36.1 % 

1,940 / 5,373 

35.2 % 

1,954 / 5,558 

33.1 % 

1,804 / 5,448 

64.9 % 

3,384 / 5,214 

37.8 % 

2,081 / 5,503 

38.7 % 

2,880 / 7,439 

33.0 % 

2,516 / 7,615 

28.1 % 

2,155 / 7,667 

82.4 % 

3,302 / 4,009 

81.3 % 

3,320 / 4,085 

81.6 % 

3,264 / 4,000 

65.1 % 

2,880 / 4,423 

73.7 % 

2,947 / 4,001 

71.5 % 

2,801 / 3,915 

83.0 % 

3,135 / 3,777 

75.8 % 

3,073 / 4,056 

77.1 % 

3,088 / 4,007 

73.8 % 

2,975 / 4,030 

37.9 % 

2,313 / 6,099 

41.2 % 

2,152 / 5,228 

46.3 % 

2,464 / 5,326 

40.1 % 

2,373 / 5,919 

40.1 % 

2,293 / 5,719 

41.1 % 

2,303 / 5,603 

36.2 % 

1,940 / 5,352 

35.3 % 

1,926 / 5,455 

31.9 % 

2,074 / 6,494 

45.0 % 

2,357 / 5,232 

39.9 % 

2,372 / 5,942 

37.0 % 

2,900 / 7,832 

36.5 % 

2,593 / 7,105 

29.5 % 

2,198 / 7,456 

83.2 % 

3,325 / 3,995 

80.8 % 

3,319 / 4,106 

81.9 % 

3,311 / 4,041 

77.5 % 

3,153 / 4,066 

67.3 % 

2,909 / 4,320 

81.0 % 

2,989 / 3,688 

72.2 % 

2,861 / 3,960 

79.4 % 

3,144 / 3,961 

75.1 % 

3,061 / 4,077 

78.0 % 

3,097 / 3,971 

65.6 % 

2,791 / 4,256 

38.2 % 

2,320 / 6,080 

46.0 % 

2,220 / 4,821 

43.5 % 

2,367 / 5,437 

43.5 % 

2,550 / 5,859 

39.2 % 

2,272 / 5,796 

41.6 % 

2,314 / 5,569 

36.3 % 

1,906 / 5,250 

35.5 % 

2,270 / 6,400 

32.3 % 

1,842 / 5,705 

46.1 % 

2,626 / 5,697 

37.5 % 

2,396 / 6,387 

34.6 % 

2,682 / 7,762 

36.7 % 

2,562 / 6,982 

30.3 % 

2,110 / 6,968 

85.8 % 

3,291 / 3,837 

80.7 % 

3,321 / 4,117 

81.6 % 

3,311 / 4,057 

76.3 % 

3,142 / 4,120 

78.4 % 

3,157 / 4,029 

67.8 % 

2,836 / 4,184 

74.9 % 

2,944 / 3,932 

69.0 % 

2,868 / 4,158 

79.3 % 

3,138 / 3,958 

76.3 % 

3,076 / 4,029 

71.3 % 

2,953 / 4,142 

65.2 % 

2,768 / 4,246 

42.3 % 

2,399 / 5,675 

42.7 % 

2,113 / 4,953 

45.9 % 

2,501 / 5,451 

42.2 % 

2,506 / 5,941 

39.8 % 

2,292 / 5,756 

41.5 % 

2,272 / 5,479 

35.9 % 

2,233 / 6,219 

34.5 % 

1,965 / 5,696 

32.6 % 

2,040 / 6,250 

43.2 % 

2,655 / 6,143 

36.9 % 

2,259 / 6,124 

36.7 % 

2,759 / 7,516 

30.4 % 

2,085 / 6,866 

29.7 % 

2,127 / 7,169 
















79.6 % 

3,139 / 3,944 

82.5 % 

3,278 / 3,971 

81.4 % 

3,309 / 4,067 

75.3 % 

3,136 / 4,162 

83.7 % 

3,275 / 3,915 

74.3 % 

2,987 / 4,018 

68.1 % 

2,876 / 4,226 

73.3 % 

2,983 / 4,071 

69.1 % 

2,864 / 4,145 

80.3 % 

3,147 / 3,918 

69.2 % 

2,925 / 4,226 

70.2 % 

2,930 / 4,172 

71.6 % 

2,802 / 3,915 

39.7 % 

2,303 / 5,796 

44.3 % 

2,213 / 5,001 

46.1 % 

2,513 / 5,455 

42.9 % 

2,536 / 5,906 

39.2 % 

2,230 / 5,684 

43.9 % 

2,762 / 6,293 

36.2 % 

1,976 / 5,464 

35.3 % 

2,191 / 6,211 

30.5 % 

2,050 / 6,715 

40.2 % 

2,413 / 5,999 

38.6 % 

2,289 / 5,931 

29.6 % 

2,214 / 7,478 

29.4 % 

2,083 / 7,082 

28.3 % 

1,980 / 6,989 

92.9 % 

3,489 / 3,754 

78.1 % 

3,147 / 4,032 

83.4 % 

3,267 / 3,919 

76.0 % 

3,147 / 4,141 

82.6 % 

3,267 / 3,954 

86.6 % 

3,253 / 3,757 

78.6 % 

3,113 / 3,962 

66.4 % 

2,917 / 4,393 

73.6 % 

2,979 / 4,045 

70.0 % 

2,873 / 4,102 

73.1 % 

3,000 / 4,103 

64.3 % 

2,805 / 4,365 

77.2 % 

2,983 / 3,866 

68.6 % 

2,743 / 4,001 

42.9 % 

2,463 / 5,745 

43.5 % 

2,200 / 5,058 

45.7 % 

2,506 / 5,480 

42.3 % 

2,475 / 5,845 

41.4 % 

2,698 / 6,523 

45.4 % 

2,507 / 5,523 

36.3 % 

2,179 / 6,005 

33.1 % 

2,209 / 6,672 

33.1 % 

2,074 / 6,269 

42.3 % 

2,451 / 5,795 

33.6 % 

1,915 / 5,695 

29.3 % 

2,244 / 7,665 

26.7 % 

1,916 / 7,168 

28.0 % 

2,022 / 7,222 

77.1 % 

3,186 / 4,134 

89.7 % 

3,485 / 3,884 

80.2 % 

3,169 / 3,953 

79.4 % 

3,143 / 3,956 

82.9 % 

3,277 / 3,954 

85.6 % 

3,244 / 3,790 

91.2 % 

3,355 / 3,679 

75.5 % 

3,125 / 4,141 

66.3 % 

2,908 / 4,386 

74.3 % 

2,982 / 4,014 

68.3 % 

2,820 / 4,126 

69.5 % 

2,908 / 4,185 

71.6 % 

2,868 / 4,006 

71.8 % 

2,855 / 3,974 

77.1 % 

2,975 / 3,861 

40.8 % 

2,400 / 5,876 

44.9 % 

2,242 / 4,998 

45.1 % 

2,444 / 5,415 

46.4 % 

3,042 / 6,550 

43.7 % 

2,492 / 5,705 

43.7 % 

2,670 / 6,112 

34.2 % 

2,205 / 6,448 

35.7 % 

2,219 / 6,213 

34.9 % 

2,114 / 6,065 

34.5 % 

1,963 / 5,692 

32.5 % 

1,919 / 5,903 

28.3 % 

2,095 / 7,406 

34.5 % 

2,335 / 6,762 

25.5 % 

1,872 / 7,339 

80.4 % 

3,303 / 4,109 

74.9 % 

3,187 / 4,253 

81.1 % 

3,280 / 4,046 

75.1 % 

3,024 / 4,028 

87.0 % 

3,272 / 3,762 

83.2 % 

3,208 / 3,855 

90.1 % 

3,346 / 3,715 

91.7 % 

3,455 / 3,766 

74.6 % 

3,103 / 4,160 

67.0 % 

2,915 / 4,348 

68.5 % 

2,844 / 4,150 

68.0 % 

2,806 / 4,126 

76.7 % 

2,961 / 3,861 

67.9 % 

2,780 / 4,092 

82.5 % 

3,117 / 3,780 

73.0 % 

2,911 / 3,989 

42.4 % 

2,451 / 5,781 

43.5 % 

2,155 / 4,958 

48.9 % 

2,994 / 6,128 

43.2 % 

2,597 / 6,013 

41.9 % 

2,637 / 6,301 

40.8 % 

2,680 / 6,565 

36.6 % 

2,201 / 6,016 

38.1 % 

2,277 / 5,979 

55.5 % 

2,683 / 4,838 

33.9 % 

1,991 / 5,874 

30.3 % 

1,795 / 5,923 

43.4 % 

2,981 / 6,875 

30.9 % 

2,144 / 6,948 

26.1 % 

2,254 / 8,624 

88.1 % 

3,495 / 3,966 

88.8 % 

3,489 / 3,927 

80.2 % 

3,271 / 4,079 

77.3 % 

3,164 / 4,093 

80.7 % 

3,108 / 3,853 

83.0 % 

3,126 / 3,768 

91.4 % 

3,373 / 3,689 

90.4 % 

3,439 / 3,805 

96.0 % 

3,531 / 3,678 

75.4 % 

3,111 / 4,124 

64.7 % 

2,847 / 4,403 

70.6 % 

2,886 / 4,087 

73.1 % 

2,818 / 3,854 

71.8 % 

2,849 / 3,968 

78.6 % 

3,059 / 3,894 

78.0 % 

3,045 / 3,906 

74.7 % 

2,922 / 3,914 

40.9 % 

2,360 / 5,769 

43.5 % 

2,547 / 5,858 

47.2 % 

2,608 / 5,524 

67.5 % 

3,741 / 5,540 

39.7 % 

2,672 / 6,728 

72.3 % 

3,688 / 5,101 

38.7 % 

2,246 / 5,808 

75.0 % 

3,261 / 4,346 

52.4 % 

2,666 / 5,085 

30.5 % 

1,813 / 5,939 

46.2 % 

2,452 / 5,307 

45.0 % 

3,018 / 6,702 

30.1 % 

2,581 / 8,574 

25.9 % 

2,170 / 8,370 


3.0 % 

110 / 3,665 

4.2 % 

155 / 3,729 

4.3 % 

157 / 3,665 

3.3 % 

120 / 3,599 

2.8 % 

99 / 3,560 

1.8 % 

59 / 3,353 

2.9 % 

100 / 3,429 

2.8 % 

102 / 3,619 

3.0 % 

109 / 3,575 

2.6 % 

92 / 3,593 

3.5 % 

125 / 3,567 

2.8 % 

99 / 3,586 

2.5 % 

84 / 3,305 

2.2 % 

73 / 3,311 

2.1 % 

72 / 3,454 

2.4 % 

83 / 3,442 

2.9 % 

99 / 3,427 

1.9 % 

62 / 3,316 

3.2 % 

147 / 4,662 

2.1 % 

79 / 3,683 

2.8 % 

121 / 4,337 

3.1 % 

150 / 4,773 

2.3 % 

103 / 4,463 

2.8 % 

118 / 4,277 

2.6 % 

96 / 3,691 

2.9 % 

112 / 3,894 

3.3 % 

111 / 3,378 

2.7 % 

103 / 3,886 

2.3 % 

88 / 3,822 

3.9 % 

201 / 5,117 

3.9 % 

200 / 5,078 

5.0 % 

243 / 4,897

Gap F 

2M 

2.5M 

Gap E 

875k 

750k 

625k 

0M 


El Tor N16961 

chromosome 1 

2,961,149 bp 

1000k 

1.5M 

0k 

500k 

Gap D 


El Tor N16961 

chromosome 2 

1,072,310 bp 

125k 

375k 

0.5M 

1M 

250k 

Gap C 

Gap A 

Gap B 

Superintegron 

Gap G 

Outer circle 




V.campebellii AND4 



Vibrio spp. Ex25 

A.salmonicida LF11238 






















V.cholerae O395 TEDA 



genes positive strand 

genes negatve strand 

Stacking energy 

Position preference 

Global direct repeats 

GC skew 

Inner circle 

T. Vesth et al.


When the first genome of A. fischeri is added, which is 

not a member of the Vibrio genus, it does not add 

significantly more novel genes to the pan-genome than 

Vibrio genomes did. This contrasts with P. profundum 

which produces a sharp increase in the pan-genome, as 

does, interestingly, V. shilonii. Note that there are approximately 

20,200 total gene families within the 32 sequenced 

Vibrionaceae genomes, whereas the core genome decreases 

to approximately 1,000 gene families. 

BLAST Comparison Visualised in a BLAST Matrix 

A BLAST matrix provides a visual overview of reciprocal 

pairwise whole-genome comparisons, as shown in Fig. 4. 

The stronger a matrix cell is coloured, the more similarity 

was detected between the gene content of two genomes. As 

can be seen in the lower right triangle, all V. cholerae 

genomes are highly similar, with similarity ranging between 

64% and 93% for any given pair of genomes. No statistical 

difference was observed when comparing clinical isolates 

to environmental isolates. The two A. fischeri and the two 

V. vulnificus genomes also share a high degree of identity 

within their species (75% and 67%, respectively), visible at 

the bottom of the matrix. In contrast, the two V. parahaemolyticus 

genomes only share 35% identity, which is 

not higher than the similarity detected between genomes of 

different species. With 72% similarity, isolate MED222 

most closely matches V. splendidus and with 65% isolate 

EX25 again shares most similarity with V. parahaemolyticus 

2210633. 

BLAST Atlas 

A BLAST atlas was constructed using V. cholerae N16961 

(O1, El Tor) as the reference genome, shown in Fig. 5. The 

best blast hits identified in the query genomes are 

plotted in the lanes around the reference genome, with 

different colours for different species. In general, 

chromosome 1 is more strongly conserved than chromosome 

2. A large part of chromosome 2 of N16961 

displays very little conservation in the other genomes; 

this area represents a super integron [40] that contains 

the V. cholerae-specific repeat (VCR) sequences, as well 

Figure 5 BLAST atlas with V. cholerae strain N16961 as a reference 

strain, showing chromosomes 1 (top) and 2 (bottom). The best 

BLAST hits identified with genes from N16961 in the other V. 

cholerae genomes are represented in dark red, for the location as it 

appears in N16961. Blast hits in the other genomes are shown in 

various colours as indicated to the right. Major areas conserved in V. 

cholerae but not in other Vibrionaceae are identified as gap B, gap C, 

gap D and gap F in green; areas that are found in toxigenic V. cholerae 

only are marked black as gap A, gap E and gap G. The superintegron 

on chromosome 2 of V. cholerae is also indicated 

as a high number of gene cassettes. The repeat sequences 

are visible as black boxes in the repeat lane of the 

reference genome (second inner lane). Although all V. 

cholerae genomes contain a superintegron, its genes are 

very diverse between isolates [34] which explains the lack 

of blast hits in this region. 

Several regions of the atlas have been highlighted. Gaps 

B, C, D and F on chromosome 1 (indicated in green) 

contain genes that are conserved in the represented 

genomes of V. cholerae but not in the other Vibrionaceae. 

The gaps marked A, E and G indicate regions that are 

specific to the toxigenic, clinical isolates only. Annotated, 

V. cholerae-specific genes present in all these regions are 

listed in Table 2 (hypothetical genes are excluded). Genes 

specific for toxinogenic V. cholerae identified in gap A 

include, amongst others, biosynthesis genes for the toxin 

co-regulated pilus (which is required for transmission of the 

prophage CTXΦ carrying the enterotoxin genes), as well as 

genes encoding citrate lyase. Note that the genes in gap A 

are also found in the environmental isolate V. cholerae 

2740-80. 

Gap B contains a number of outer membrane protein 

genes involved in sugar modification that are found in all V. 

cholerae genomes. Genes from gap C encoding a histidine 

kinase two-component signal transduction regulatory system 

are also conserved within the species, as genes in gaps 

D and F, involved in chemotaxis and possible multidrug 

resistance. 

Gap E, containing genes conserved in toxigenic strains 

only, holds the prophage CTXΦ that contains the genes 

encoding cholera enterotoxin subunits A and B; this 

enterotoxin is responsible for the excessive, watery diarrhoea 

typical for cholera. Upon binding to target cell GM1 

gangliosides, enterotoxin enters the cell and stimulates 

adenylate cyclase by ADP ribosylation. The resultant 

increased cyclic AMP levels induce excessive electrolyte 

movement and sodium plus water secretion [43]. Strain 

M66-2 is believed to be a precursor of the seventh 

pandemic V. cholerae that lacks the prophage CTXΦ and 

the enterotoxin genes [11]. Gap E bears the RTX toxin 

operon, which encodes a pore-forming cytotoxin [22]. An 

RTX toxin is also present in environmental isolate 2740-80 

and in V. vulnificus. 

Gap G on chromosome 2 consists of a set of five genes, 

all in the same orientation, in a putative operon, flanked by 

genes on the complimentary strand. This appears to be a 

remnant of a mobile element, as these genes are flanked by 

a transposase gene on the 3′ end, and there is a small global 

repeat on the 5′ end. Only the first two of the five genes have 

an assigned function, with the first gene being a GMP 

reductase, and the second a putative DNA methyltransferase. 

The remaining three genes are hypothetical, but their 

strikingly strong conservation in all pathogenic strains and

Table 2 A selection of genes located in the gaps marked in Fig. 5 

Gap A (850000–913000) 

852903–851557 Citrate/sodium symporter 

853165–854235 Citrate (pro-3S)-lyase ligase 

854287–854583 Citrate lyase subunit gamma 

854565–855455 Citrate lyase, beta subunit 

855391–856995 Citrate lyase, alpha subunit 

856992–857528 citX protein 

857506–858447 citG protein 

869812–866873 Helicase-related protein 

870391–869813 Tellurite resistance protein-related 

871298–870819 Transcriptional regulator, putative 

873242–874225 Transposase, putative 

876974–880015 ToxR-activated gene A protein 

881390–884728 Inner membrane protein, putative 

885773–886267 tagD protein 

888405–886543 Toxin co-regulated pilus biosynthesis 



890449–891123 Toxin co-regulated pilin 










899896–900726 TCP pilus virulence regulatory protein 

900726–901487 Leader peptidase TcpJ 

901494–903374 Accessory colonization factor AcfB 

903380–904150 Accessory colonization factor AcfC 

904648–905556 tagE protein 

906206–905559 Accessory colonization factor AcfA 

914124–912856 Phage family integrase 

Gap B (975000–1010000) 

978644–979144 Phosphotyrosine protein phosphatase 

981833–982387 Serine acetyltransferase-related protein 

982384–983532 Exopolysacch. biosynth protein EpsF 

983529–984938 Polysacch. export protein, putative (gfcE) 

986166–986597 Serine acetyltransferase-related protein 

986597–987937 capK protein, putative 

987913–989010 Polysaccharide biosynthesis protein, putative 

1001910–1002437 Polysaccharide export-related protein (gfcE) 

1002462–1004675 Putative exopolysacch. biosynth protein 

Gap C (1130000–1160000) 

1139646–1142912 Chitinase, putative 

1147856–1148998 Response regulator 


1149990–1151309 Sensory box sensor histidine kinase 

Table 2 (continued) 

1151321–1152625 Sensor histidine kinase 



1157228–1155624 Sensor histidine kinase 

1158044–1157232 Periplasmic binding protein-related 

Gap D (1478000–1520000) 

2086826–2087584 CDP-diacylglycerol-glyc.-3phosph-3-phosphatidyltransferase 

2087587–2088519 Phosphatidate cytidylyltransferase 

2094741–2095604 PvcB protein 

2098112–2097183 LysR family transcriptional regulator 

2098432–2100258 pvcA protein 

2117923–2119977 Methyl-accepting chemotaxis protein 

2120575–2120030 Transcriptional regulator 

2120663–2121826 Benzoate transport protein 

Gap E (1537000–1587500) 

1541452–1543170 Sensor histidine kinase/response regulator 

1545396–1543231 Toxin secretion transporter, putative 

1546802–1545399 RTX toxin transporter 

1548919–1546757 RTX toxin transporter 

1549662–1550123 RTX toxin activating protein 

1550108–1563784 RTX toxin RtxA 

1564376–1564152 RstC protein 

1564844–1564470 RstB1 protein 

1565901–1564822 RstA1 protein 

1566027–1566365 Transcriptional repressor RstR 

1567341–1566967 Cholera enterotoxin, B subunit 

1568114–1567338 Cholera enterotoxin, A subunit 

1569412–1568213 Zona occludens toxin 

1569702–1569409 Accessory cholera enterotoxin 

1571241–1570993 Colonization factor 

1571760–1571377 RstB2 protein 

1572817–1571738 RstA1 protein 

1572943–1573281 Transcriptional repressor RstR 

1577272–1575704 Phage replication protein Cri 

1582123–1580555 Phage replication protein Cri 

1583160–1583513 Transposase OrfAB, subunit A 

1583510–1584382 Transposase OrfAB, subunit B 

Gap F (1896000–1956000) 

1896092–1897327 Phage family integrase 

1900831–1898009 Helicase, putative 

1903632–1902898 Chemotaxis protein MotB-related 

1908858–1905790 Type I restriction enzyme HsdR 

1916009–1913628 DNA methylase HsdM, putative 

1933231–1935654 Neuraminidase 

1936007–1935801 Transcriptional regulator 

1936121–1936597 DNA repair protein RadC, putative 

1938391–1937519 Transposase OrfAB, subunit B 

1938732–1938388 Transposase OrfAB, subunit A 

1941671–1941351 Transcriptional regulator, putative 

T. Vesth et al.


Table 2 (continued) 

1942032–1941658 Middle operon regulator-related 

1944457–1943306 eha protein 

Gap G (chromosome II, 21300–223000) 

213207–214250 GMP reductase 

214574–215725 DNA methyltransferase 

220262–219825 IS1004 transposase 

All gene annotations are taken from the reference genome V. cholerae 

strain N16961. Hypothetical proteins were excluded. Gaps A, E and G 

are conserved in pathogenic strains, whereas gaps B, C, D and F are 

conserved in all V. cholerae genomes analysed (Figure 1) 

complete absence of homologues in the other Vibrio genomes 

strongly point towards a potential biological significance. 

Discussion 

The recent availability of many Vibrionaceae genomes, 

including a substantial number of V. cholerae genomes, 

allows the possibility to take a closer look at the similarities 

and differences of species within the genus Vibrio. This can 

examine, on a genome scale, what distinguishes V. cholerae 

from the other Vibrio species. Since not all V. cholerae 

isolates are pathogenic, the presence of the prophagebearing 

cholera enterotoxin, the main virulence factor for 

cholera, is not a suitable marker for this species. We 

attempted to identify a set of V. cholerae-specific genes, 

and also explored the internal diversity within the V. 

cholerae genomes that have been sequenced to date. 

On a phylogenetic tree based on the 16S ribosomal RNA 

gene, those isolates that do not belong to the genus Vibrio 

were positioned as outliers, as expected. This tree further 

indicated the closest resembling 16S rRNA sequence for 

the two sequenced Vibrio strains that are currently not 

assigned to a species. It was observed that the two 

sequenced V. parahaemolyticus strains were not placed 

together. The complete gene content of each genome was 

next compared by BLAST and the results were pooled into 

gene families which were subjected to cluster analysis. This 

provided evidence that the 18 V. cholerae genomes fall into 

two subclusters, one mainly containing clinical isolates and 

the other environmental isolates. 

The gene family clustering, subsequent pan-genome 

analysis and the pairwise BLAST results, as summarised 

in the BLAST matrix, all supported the relatedness of 

Vibrio species Ex25 to V. parahaemolyticus 2210633 but 

not to V. parahaemolyticus 16. This latter genome was quite 

different from V. parahaemolyticus 2210633 in all analyses. 

Although it is possible that the species V. parahaemolyticus 

is far more genetically diverse than V. cholerae, A. fischeri 

or V. vulnificus, an alternative explanation is that one of the 

sequenced isolates is perhaps incorrectly named as V. 

parahaemolyticus. The similarity between Vibrio species 

MED222 and V. splendidus based on gene families is in 

agreement with their related 16S rRNA genes and published 

data [21]. However, in contrast to what the ribosomal 

gene suggests, our whole-genome comparison indicates that 

the three Aliivibrio genomes (A. salmonicida and two A. 

fischeri) are not so different from Vibrio after all. Their 

recent placement in the genus Aliivibrio, a decision based 

on five genes (the 16S rRNA gene and four housekeeping 

genes) and phenotypical characteristics [47], appears not to 

be reflective of the whole genome picture presented here. 

The BLAST results were graphically summarised in a 

BLAST atlas, which visualised V. cholerae-specific gene 

clusters. These coded for polysaccharide biosynthesis 

enzymes, response regulators and chemotaxis proteins, 

amongst others. In addition, a V. cholerae-specific, histidine 

kinase two-component signal transduction regulatory system 

was identified. The two-component signal transduction 

pathway is a powerful regulating system for bacteria to 

adapt to a particular ecological niche. There is a precedent 

for this claim, as the introduction of a single regulatory 

protein in Vibrio fischeri strain MJ11 has been shown to 

specifically enable colonization of the squid Euprymna 

scolopes [26]. 

As expected, the main differences observed between V. 

cholerae clinical isolates and the environmental strains are 

due to genes related to virulence. Two exceptions are the 

presence of a number of virulence genes in the environmental 

strain V. cholerae 2740-80 and the absence of 

enterotoxin genes in clinical isolate M66-2. It has already 

been suggested that M66-2 might be a predecessor of 

pandemic, enterotoxic V. cholerae [11]. From sequence 

comparison of four housekeeping genes, it was concluded 

that V. cholerae 2740-80 is intermediary between toxigenic 

and non-toxigenic isolates [30]. This view is confirmed by 

the data presented here, although we propose to consider 

the possibility that the isolate arose from a pandemic clone 

that has lost the CTXΦ prophage, rather than being a 

precursor of a pathogen. 

In conclusion, several different methods of genome 

comparisons have yielded a picture of V. cholerae genomes 

as forming a distinct cluster, compared to related species, 

and a relatively small number of genes might be responsible 

for environmental niche adaptation and hence for generation 

of this distinct species. Likely candidates include 

multiple two-component signal transduction regulatory 

proteins as well as chemotaxis proteins. 

Acknowledgements We would like to thank Tim Binnewies for 

early work on this project, and also to the Danish Research Councils 

and the DTU Globalization funds for financial support.

Open Access This article is distributed under the terms of the 

Creative Commons Attribution Noncommercial License which permits 

any noncommercial use, distribution, and reproduction in any 

medium, provided the original author(s) and source are credited. 

References 

1. Bassler B et al. (2007) CP000789.1: Direct submission to 

GenBank 

2. Binnewies TT, Hallin PF, Staerfeldt HH, Ussery DW (2005) 

Genome update: proteome comparisons. Microbiol 151:1–4 

3. Chen CY, Wu KM, Chang YC, Chang CH, Tsai HC, Liao TL, Liu 

YM, Chen HJ, Shen AB, Li JC, Su TL, Shao CP, Lee CT, Hor LI, 

Tsai SF (2003) Comparative genome analysis of Vibrio vulnificus, 

a marine pathogen. Genome Res 13:2577–2587 

4. Clayton RA, Sutton G, Hinkle PS, Bult C, Fields C (1995) 

Intraspecific variation in small-subunit rRNA sequences in 

GenBank: why single sequences may not adequately represent 

prokaryotic taxa. Int J Syst Bacteriol 45:595–599 

5. Colwell R, Grim CJ, Young S, Jaffe D, Gnerre S, Berlin A, 

Heiman D, Hepburn T, Shea T, Sykes S, Alvarado L, Kodira C, 

Heidelberg J, Lander E, Galagan J, Nusbaum C, Birren B (2008) 

NZ_AAKF00000000: Direct submission to GenBank 

6. Doolittle WF (1995) Phylogenetic classification and the universal 

tree. Science 284:2124–2129 

7. Doolittle WF, Papke RT (2006) Genomics and the bacterial 

species problem. Genome Biol 7:116 

8. Doolittle WF, Zhaxybayeva O (2009) On the origin of prokaryotic 

species. Genome Res 19:744–756 

9. Edwards R, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton G, 

Rogers Y-H, Friedman R, Frazier M, Venter JC (2008) 

NZ_ACCV00000000: Direct submission to GenBank 

10. Farmer JJ, Janda JM (2005) Vibrionaceae. In: Bergey’s 

manual of systematic bacteriology, 2nd edn, vol 2 part B. 

Springer, New York, pp 491–546 

11. Feng L, Reeves PR, Lan R, Ren Y, Gao C, Zhou Z, Ren Y, Cheng 

J, Wang W, Wang J, Qian W, Li D, Wang L (2008) A recalibrated 

molecular clock and independent origins for the cholera pandemic 

clones. PLoS ONE 3:e4053 

12. Gevers D, Cohan FM, Lawrence JG, Sprat BG, Coeyne T, Feil EJ, 

Stackebrandt E, Van de Peer Y, Vandamme P, Thompson FL, 

Swings J (2005) Re-evaluating prokaryotic species. Nat Rev 

Microbiol 3:733–739 

13. Hagstrom A, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton 

G, Rogers Y-H, Friedman R, Frazier M, Venter JC (2007) 

NZ_ABGR00000000: Direct submission to GenBank 

14. Hallin PF, Binnewies TT, Ussery DW (2008) The genome 

BLASTatlas—a GeneWiz extension for visualization of wholegenome 

homology. Mol Biosyst 4:363–371 

15. Heidelberg JF, Eisen JA, Nelson WC, Clayton RA, Gwinn ML, 

Dodson RJ, Haft DH, Hickey EK, Peterson JD, Umayam L, Gill 

SR, Nelson KE, Read TD, Tettelin H, Richardson D, Ermolaeva 

MD, Vamathevan J, Bass S, Qin H, Dragoi I, Sellers P, McDonald 

L, Utterback T, Fleishmann RD, Nierman WC, White O, Salzberg 

SL, Smith HO, Colwell RR, Mekalanos JJ, Venter JC, Fraser CM 

(2000) DNA sequence of both chromosomes of the cholera 

pathogen Vibrio cholerae. Nature 406:477–483 

16. Heidelberg J, Sebastian Y. NZ_AAKJ00000000, NZ_AAUT00000000, 

NZ_AAKK00000000, NZ_AAUR00000000, NZ_AAWF00000000: 

Direct submission to GenBank 

17. Hjerde E, Lorentzen MS, Holden MT, Seeger K, Paulsen S, Bason 

N, Churcher C, Harris D, Norbertczak H, Quail MA, Sanders S, 

Thurston S, Parkhill J, Willassen NP, Thomson NR (2008) The 

genome sequence of the fish pathogen Aliivibrio salmonicida 


strain LFI1238 shows extensive evidence of gene decay. BMC 

Genomics 9:616 

18. Konstantinidis T, Ramette A, Tiedje JA (2006) The bacterial 

species definition in the genomic era. Phil Trans R Soc B 

361:1929–1940 

19. Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, 

Ussery DW (2007) RNAmmer: consistent and rapid annotation of 

ribosomal RNA genes. Nucleic Acids Res 35:3100–3108 

20. Larsen TS, Krogh A (2003) EasyGene—a prokaryotic gene finder 

that ranks ORFs by statistical significance. BMC Bioinformatics 

4:29 

21. Le Roux F, Zouine M, Chakroun N, Binesse J, Saulnier D, 

Bouchier C, Zidane N, Ma L, Rusniok C, Lajus A, Buchrieser C, 

Médigue C, Polz MF, Mazel D (2009) Genome sequence of Vibrio 

splendidus: an abundant planctonic marine species with a large 

genotypic diversity. Environ Microbiol 11:1959–1970 

22. Lin W, Fullner KJ, Clayton R, Sexton JA, Rogers MB, Calia KE, 

Calderwood SB, Fraser C, Mekalanos JJ (1999) Identification of 

a Vibrio cholerae RTX toxin gene cluster that is tightly linked to 

the cholera toxin prophage. Proc Natl Acad Sci U S A 96:1071– 

1076 

23. Loytynoja A, Goldman N (2005) An algorithm for progressive 

multiple alignment of sequences with insertions. Proc Natl Acad 

Sci U S A 102:10557–10562 

24. Loytynoja A, Goldman N (2008) Phylogeny-aware gap placement 

prevents errors in sequence alignment and evolutionary analysis. 

Science 320:1632–1635 

25. Makino K, Oshima K, Kurokawa K, Yokoyama K, Uda T, 

Tagomori K, Iijima Y, Najima M, Nakano M, Yamashita A, 

Kubota Y, Kimura S, Yasunaga T, Honda T, Shinagawa H, Hattori 

M, Iida T (2003) Genome sequence of Vibrio parahaemolyticus: a 

pathogenic mechanism distinct from that of V. cholerae. Lancet 

361:743–749 

26. Mandel MJ, Wollenberg MS, Stabb EV, Visick KL, Ruby EG 

(2009) A single regulatory gene is sufficient to alter bacterial host 

range. Nature 458:215–218 

27. Mazel D, Le Roux F (2008) FM954973.1: Direct submission to 

GenBank 

28. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R 

(2005) The microbial pan-genome. Curr Opin Genet Dev 

15:589–594 

29. Medrano-Soto A, Moreno-Hagelsieb G, Vinuesa P, Christen JA, 

Collado-Vides J (2001) Succesful lateral transfer requires codon 

usage compatibility between foreign genes and recipient genomes. 

Mol Biol Evol 21:1884–1894 

30. Mohapatra SS, Ramachandran D, Mantri CK, Colwell RR, Singh 

DV (2009) Determination of relationships among non-toxigenic 

Vibrio cholerae O1 biotype El Tor strains from housekeeping 

gene sequences and ribotype patterns. Res Microbiol 160: 

57–62 

31. Munk A, Tapia R, Green L, Rogers Y, Detter JC, Bruce D, Brettin TS, 

Colwell R, Grim C, Vonstein V, Bartels D. CP001485.1, 

NZ_ACHV00000000, NZ_ACHY00000000, NZ_ACHW00000000, 

NZ_ACHX00000000, NZ_ACHZ00000000, NZ_ACIA00000000, 

NZ_ACFQ00000000: Direct submission to GenBank 

32. Murray RG, Stackebrandt E (1995) Taxonomic note: implementation 

of the provisional status Candidatus for incompletely 

described procaryotes. Int J Syst Bacteriol 45:186–187 

33. Nierman WC (2006) NZ_AATY00000000: Direct submission to 

GenBank 

34. Pang B, Yan M, Cui Z, Ye X, Diao B, Ren Y, Gao S, Zhang L, 

Kan B (2007) Genetic diversity of toxigenic and nontoxigenic 

Vibrio cholerae serogroups O1 and O139 revealed by array-based 

comparative genomic hybridization. J Bacteriol 189:4837–4879 

35. Philippe H, Douady CJ (2003) Horizontal gene transfer and 

phylogenetics. Curr Opin Microbiol 6:498–505


36. Pinhassi J, Pedros-Alio C, Ferriera S, Johnson J, Kravitz S, 

Halpern A, Remington K, Beeson K, Tran B, Rogers Y-H, 

Friedman R, Venter JC (2006) NZ_AAND00000000: Direct 

submission to GenBank 

37. Pupo GM, Lan R, Reeves PR (2000) Multiple independent origins 

of Shigella clones of Escherichia coli and convergent evolution of 

many of their characteristics. Proc Natl Acad Sci U S A 

97:10567–10572 

38. Rhee JH, Kim SY, Chung SS, Lee SE, Choy HE (2002) 

AE016795.2: Direct submission to GenBank 

39. Riley MA, Lizotte-Waniewski M (2009) Population genomics and 

the bacterial species concept. Methods Mol Biol 532:367–377 

40. Rowe-Magnus DA, Guérout AM, Mazel D (1999) Superintegrons. 

Res Microbiol 150:641–651 

41. Rosenberg E, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton 

G, Rogers Y-H, Friedman R, Frazier M. Venter JC (2006) 

NZ_ABCH00000000: Direct submission to GenBank 

42. 3Ruby EG, Urbanowski M, Campbell J, Dunn A, Faini M, Gunsalus 

R, Lostroh P, Lupp C, McCann J, Millikan D, Schaefer A, Stabb E, 

Stevens A, Visick K, Whistler C, Greenberg EP (2005) Complete 

genome sequence of Vibrio fischeri: a symbiotic bacterium with 

pathogenic congeners. Proc Natl Acad Sci U S A 102:3004–3009 

43. Sánchez J, Holmgren J (2005) Virulence factors, pathogenesis and 

vaccine protection in cholera and ETEC diarrhoea. Curr Opin 

Immunol 17:388–398 

44. Stackebrandt E, Frederiksen W, Garrity GM, Grimont PA, 

Kämpfer P, Maiden MC, Nesme X, Rosselló-Mora R, Swings J, 

Trüper HG, Vauterin L, Ward AC, Whitman WB (2002) Report of 

the ad hoc committee for the re-evaluation of the species definition 

in bacteriology. Int J Syst Evol Microbiol 52:1043–1047 

45. Tamura K, Dudley J, Nei M, Kumar S (2007) MEGA4: Molecular 

Evolutionary Genetics Analysis (MEGA) software version 4.0. 

Mol Biol Evol 24:1596–1599 

46. Thompson FL, Iida T, Swings J (2004) Biodiversity of vibrios. 

Microbiol Mol Biol Rev 68:403–431 

47. Urbanczyk H, Ast JC, Higgins MJ, Carson J, Dunlap PV (2007) 

Reclassification of Vibrio fischeri, Vibrio logei, Vibrio salmonicida 

and Vibrio wodanis as Aliivibrio fischeri gen. nov., comb. 

nov., Aliivibrio logei comb. nov., Aliivibrio salmonicida comb. 

nov. and Aliivibrio wodanis comb. nov. Int J Syst Evol Microbiol 

57:2823–2829 

48. Vezzi A, Campanaro S, D'Angelo M, Simonato F, Vitulo N, Lauro 

FM, Cestaro A, Malacrida G, Simionati B, Cannata N, Romualdi 

C, Bartlett DH, Valle G (2005) Life at depth: Photobacterium 

profundum genome sequence and expression analysis. Science 

30:1459–1461 

49. Wang L, Feng L, Reeves P, Lan R, Ren Y, Gao C, Zhou Z, Ren Y, 

Wang W (2008) CP001233.1. CP001235.1: Direct submission to 

GenBank 

50. Woese CR (1987) Bacterial evolution. Microbial Rev 51:221–271

1 


2.10 Paper V: Tools for comparison of bacterial genomes

74 Tools for Comparison of 

Bacterial Genomes 

T. M. Wassenaar 1,2 . T. T. Binnewies 1,3 . P. F. Hallin 1 . D. W. Ussery 1, * 

1 

Center for Biological Sequence Analysis, Technical University of 

Denmark, Kgs. Lyngby, Denmark 

*dave@cbs.dtv.dk 

2 

Molecular Microbiology and Genomics Consultants, Zotzenheim, 

Germany 

3 

Roche Diagnostics Ltd., Advanced Systems Group, Global Platforms & 

Support, Rotkreuz, Switzerland 

1 Introduction . . . . . . ..................................................................4314 

2 Genomic DNA Sequence Comparisons . ...........................................4314 

3 Visualization of Genomic Data: The Genome Atlas ..............................4317 

4 Whole Genome Alignment Methods . . . . ...........................................4319 

5 Comparing the Coding Fraction of Genomes . . . . . . . . ..............................4321 

6 Codon Usage Comparisons . . . . .....................................................4322 

7 Protein Sequence Comparisons . . . . . . . . . ...........................................4322 

8 Gene Synteny and Genome Islands . . . . . ...........................................4325 

9 Minimal Information About a Genome Sequence . . . ..............................4325 

10 Research Needs . . . ..................................................................4325 

K. N. Timmis (ed.), Handbook of Hydrocarbon and Lipid Microbiology, DOI 10.1007/978-3-540-77587-4_337, 

# Springer-Verlag Berlin Heidelberg, 2010

4314 74 

Tools 

Abstract: Of the plethora of bioinformatical tools available, some useful tools that allow 

complete genome sequences to be compared are described here. Comparisons of genome 

length, base composition, gene density, numbers of tRNA and rRNA genes, and codon usage 

can provide useful biological insights. Examples are provided of a Genome Atlas plot, to 

summarize many features of a single genome, and a BLAST Atlas, in which multiple genomes 

can be combined. A table of web-services for useful tools is provided. 

1 Introduction 

Presently, there are about 900 bacterial and archaeal genomes that have been fully sequenced 

and become publicly available 1 and their number more than doubled last year. Approximately 

40% of the sequenced genomes are obtained from environmental (terrestrial and marine) 

organisms. In addition, metagenomic projects are now producing a vast amount of sequences. 

Here we provide a brief overview of methods to compare sequenced bacterial genomes. Of the 

many methods available to compare bacterial genomes (Binnewies et al., 2006) > Table 1 

lists several that we find useful. It is beyond the scope of this review to provide a detailed 

analysis of these methods, and the list is far from complete. The tools discussed here provide 

some interesting information on fundamental biological features and can be used to compare 

a few or large numbers of genomes. The tools are easy to use and produce results that are easy 

to interpret and can be graphically represented. The latter is an important quality determinant 

of any sequence analysis tool when dealing with genomes, as the complexity of input data is 

so large. 

2 Genomic DNA Sequence Comparisons 

A genome can be more than one DNA molecule. Approximately 10% of the bacterial genomes 

sequenced so far have more than one chromosome. By definition a genome includes all 

chromosomes (and plasmids) that constitute an organism’s total DNA. Chromosomes are 

essential, single-copy, independently replicating DNA molecules present in each member of 

the species. Some species contain plasmids; these are frequently strain-specific and sometimes 

(incorrectly, in our opinion) omitted from a genome sequence. 

At the time of writing, the largest bacterial genome sequenced is that of Solibacter usitatus 

(strain Ellin 6076), a soil bacterium belonging to the Acidobacteria. It consists of a single 

chromosome of 9.97 mega basepairs (Mbp). The smallest bacterial genome known is 

that of Carsonella ruddii (PV), an endosymbiont of a plant sap-feeding insect with a mere 

159,662 bp. Genome size is a rough indicator of biological adaptive potential so it is no 

surprise that soil bacteria have bigger genomes, as they have to adapt to environmental 

variation, whereas the protective niche of an endosymbiont allows for a small genome. 

The genome size of an organism is easy to calculate and tabulate. > Figure 1a gives 

a graphical representation for genome size variation within bacterial phyla. A ‘‘box and 

whiskers’’ plot as shown in > Fig. 1 visualizes the distribution of a property that can be 

1 Completed genome statistics obtained from the NCBI Genome Project web pages: http://www.ncbi.nlm.nih.gov/ 

genomes/lproks.cgi 

for Comparison of Bacterial Genomes

. Table 1 

Methods for comparison of bacterial genomes 

Method URL References 

Length, %GC http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi Wheeler et al. (2007) 

Chromosome 

alignment (ACT) 

Chromosome 

alignment (MUMMER) 

http://www.sanger.ac.uk/Software/ACT/ Carver et al. (2005) 

http://www.webact.org/WebACT/home 

http://mummer.sourceforge.net Kurtz et al. (2004) 

Repeats – various http://www.cbs.dtu.dk/services/GenomeAtlas Ussery et al. (2004) 

Repeats – 

tetranucleotides 

Repeats – short, 

tandem 

Tools for Comparison of Bacterial Genomes 74 

http://www.megx.net/tetra Teeling et al. (2004) 

http://minisatellites.u-psud.fr/GPMS/default.php Denoeud and 

Vergnaud (2004) 

Repeats – VNTRs http://vntr.csie.ntu.edu.tw Chang et al. (2007) 

Replication Origins http://www.cbs.dtu.dk/services/GenomeAtlas Worning et al. 

(2006) 

Noncoding RNAs http://rfam.sanger.ac.uk Griffiths-Jones, et al. 

(2005) 

rRNAs http://www.cbs.dtu.dk/services/RNAmmer Lagesen et al. (2007) 

Genome Atlas http://www.cbs.dtu.dk/services/GenomeAtlas Hallin and Ussery 

(2004) 

BLAST Atlas (zoomable) http://www.cbs.dtu.dk/services/gwBrowser 

UPDATE! 

‘‘Genome Properties’’ http://cmr.tigr.org/tigr-scripts/CMR/shared/ 

GenomePropertiesHomePage.cgi 

Hallin and Ussery 

(2004) 

Selengut et al. 

(2007) 

4315 

expressed as a numerical value, such as length, %GC, number of genes, etc. Such plots show 

the spread of the data and are made as follows: the values are sorted and divided into two equal 

parts, separated by the median, which is marked as a bar in the middle of the distribution. A 

box is drawn to cover the range where the middle 50% of the data are (excluding the first 25% 

and the last 25% of the data). The ‘‘whiskers’’ are the hatched lines, connecting the lowest (left) 

and highest (right) values, with the exception of outlier points, which are shown as individual 

dots. Outliers are defined as data that are distant by more than 1.5 times the range of the box. 

The base composition of genomes, i.e., their %GC content (or %AT which together make 

100%), can also be compared, as shown in > Fig. 1b. The GC content of a genome can range 

from 17% in C. ruddii to 75% GC in Anaeromyxobacter dehalogenans. The smallest genome is 

also the most AT rich, and many of the larger genomes are quite GC rich. It is not clear if there 

is a biological force in play behind this correlation, although it has been observed that the 

ecological niche an organism occupies roughly correlates to both genome size and GC content 

(Foerstner et al., 2005, Musto et al., 2006). 

In addition to the average GC content for a whole genome, local variation within a given 

genome can be examined, and this reveals two general trends for almost all bacterial genomes. 

First, on a more global, chromosomal level a large region flanking the origin of DNA

4316 74 

Tools 

Size distribution of prokaryotic genomes (N = 779) AT content distribution of prokaryotic genomes (N = 779) 

for Comparison of Bacterial Genomes 

Crenarchaeota (n = 16) 

Euryarchaeota (n = 35) 

Nanoarchaeota (n = 1) 

Acidobacteria (n = 2) 

Actinobacteria (n = 55) 

Aquificae (n = 3) 

Bacteroidetes/chlorobi ( n = 26) 

Chlamydiae/verrucomicrobia (n = 13) 

Chloroflexi (n = 7) 

Cyanobacteria (n = 33) 

Deinococcus/thermus (n = 4) 

Firmicutes (n = 155) 

Fusobacteria (n = 1) 

Planctomycetes (n = 1) 

Alphaproteobacteria (n = 94) 

Betaproteobacteria (n = 61) 

Gammaproteobacteria (n = 191) 

Deltaproteobacteria (n = 21) 

Epsilonproteobacteria (n = 22) 

Spirochaetes (n = 16) 

Thermotogea (n = 8) 

Other archaea (n = 1) 

Other bacteria (n = 13) 

80 

70 

50 60 

AT content (percent) 

40 

30 

12 

10 

6 8 

Genome size (Mbp) 

4 

2 

0 

. Figure 1 

(a) Box and Whisker plot of genome length distribution for 779 bacterial chromosomes, grouped by phyla. The phylum and the number of chromosomes 

included are indicated at the left. Each phylum is colored according to our GenomeAtlas website. (b) The distribution of average chromosomal AT content 

for the same set of bacterial genomes.

eplication tends to be more GC rich, and the region around the replication terminus usually 

is more ATrich. AT-rich sequences melt more easily than GC-rich sequences, due in part to the 

extra hydrogen bond present in a GC base pair. Contra-intuitively, this would make the origin 

of replication the least likely to start replication. However, within the ‘‘large region’’ around 

the origin of approximately 5% of the chromosome, there is a short stretch of more AT rich 

basepairs, where the replication origin bubble opens up. Second, and zooming in at genes, the 

average GC content of intergenic regions is generally lower than that of coding sequences. 

These regions will melt more readily, are more curved and more rigid than the chromosomal 

average, in order to enable gene expression (Pedersen et al., 2000, Ussery and Hallin, 2004). 

This is true for nearly all of the bacterial genomes sequenced, regardless of GC content. In order 

to calculate relative or local %GC, a window has to be defined (say, investigating 100 basepairs) 

for which the %GC is calculated. This window is then moved along the genome by singlenucleotide 

steps, and the %GC is scored related to the middle of each window. These scores can 

then be graphically represented. A web-based tool for this is available at the Genome Atlas 

Website 2 in which local %GC can be visualized by color codes as discussed below. 

3 Visualization of Genomic Data: The Genome Atlas 

Genome atlases are circular plots of chromosomes or plasmids (a linear version is available 

when applicable) on which general properties of the DNA molecule are plotted as colors. 

Genome atlases are available from our web server 2 for many of the currently sequenced 

bacterial genomes. > Figure 2 shows a Genome Atlas for the chromosome of Geobacillus 

kaustophilus strain HTA426 (a thermophilic Firmicute that also contains a plasmid of 4.8 kb). 

This isolate was obtained from a deep sea sediment of the Mariana Trench in the Pacific Ocean 

(Takami et al., 2004a, b). Its genome is 3.5 Mbp long and contains 52.1% GC. G. kaustophilus 

has been suggested to provide a possible solution for paraffin deposition problems with oil 

production (Sood and Lal, 2008). A Genome Atlas maps four different aspects of the 

chromosomal DNA sequence in various lanes in a standard manner: DNA structural features 

are represented in the three outer lanes, all coding sequences are indicated in the next lane, two 

kinds of repeats are mapped in the next two lanes, and base composition properties are plotted 

in the two innermost lanes (Jensen et al., 1999). The scale in the center corresponds with the 

sequence numbering in GenBank. The DNA structural features of the three outermost circles 

are based on the physical chemical properties of the DNA helix. The annotated genes are given 

in blue for protein-coding genes oriented clockwise, and red for genes on the other strand 

(counterclockwise). The tRNA and rRNA genes have their own color. The clockwise strand 

corresponds with the sequence stored in GenBank (genes on the other strand are annotated as 

‘‘complement’’ in there). To identify global repeats (sequences that are repeated somewhere 

else on the chromosome) we search for the best match of a 100 bp window against the entire 

chromosome. Searching on the positive strand results in direct repeats (both sequences run in 

the same direction) whilst searching on the negative strand gives inverted repeats (the two 

repeat units run in opposite directions). For most of these general properties summarized in a 

Genome Atlas (structural properties, repeats, base composition) dedicated atlases are also 

available, where more features are given (such as local and simple repeats in a Repeat Atlas, or 

2 http://www.cbs.dtu.dk/services/GenomeAtlas/ 


4317

4318 74 

Tools 

Genome atlas 

Intrinsic curvature 

dev 

avg 

0.17 0.22 



dev 

avg 

–9.03 –7.55 


dev 

avg 

0.14 0.17 

Annotations: CDS + 

CDS – 

rRNA 

tRNA 

0M 

0.5M 

3M 


G. kaustophilus 

HTA426 

main chromosome 

fix 

avg 

1M 

2.5M 

5.00 7.50 

3,544,776 bp 

Global inverted repeats 

fix 

avg 

5.00 7.50 

1.5M 

2M 

GC Skew 

dev 

avg 

–0.15 0.14 

Percent AT 

fix 

avg 

0.20 0.80 


Center for biological sequence analysis 


. Figure 2 

Genome atlas of the main chromosome of Geobacillus kaustrophilus. See text for further explanation.


base composition in a Base Atlas). Such specialized atlases are explained in detail in a book that 

we recently produced (Ussery et al., 2008). 

As can be seen in > Fig. 2, the genes in this chromosome are strongly favoring one strand: 

the positive strand for the first (right) half and the negative strand for the second (left) half of 

the chromosome. These happen to be the leading strand during replication. Replication starts 

at the origin, (the 12 o’clock position here), and proceeds on either side along the circle with 

both a leading and lagging strand until the bubble reaches the terminus, at 6 o’clock, and the 

ends are combined. The positive strand represented by a genome sequence is the leading 

strand but only for the first half up till the terminus. Reading across the terminus along the 

sequence on the same strand one enters the lagging strand. Gene preference for the leading 

strand is a general feature for Firmicutes and for some other bacteria. 

In > Fig. 2 the two outward lanes identify some regions with strong structural properties 

(for instance the region around 2 o’clock, indicated by a black line). The observed strong 

curvature (blue in the outward lane) where the DNA would easily melt (red in the second lane) 

suggests this region contains genes that are highly expressed. 

There are a number of global repeats, notably in the first quarter of the chromosome. Note 

that the ribosomal RNA genes (light blue in the annotation lane) are located here, as indicated 

by the arrows, and these are picked up as global repeats, as indeed they are repeated genes. 

The GC skew lane shows the bias of G’s towards one strand or the other, averaged over a 

10,000 bp window. In contrast to many Firmicutes with a strong GC skew, this genome only 

has a weak GC skew (the right half is light blue and the left half is light pink). The innermost 

circle colors the local AT content when it is more than three standard deviations distant from 

the global average. Note a light red color around the 2 o’clock region: this local deviation in AT 

content is related to the structural features located here. 

The Genome Atlas of the Archaea Methanosarcina acetivorans, shown in > Fig. 3, tells a 

different story. This strictly anaerobic organism so efficiently produces methane that it is held 

responsible for virtually all biogenic methane. It can also oxidate CO to CO 2 (Lessner et al., 

2006). Strain C2A (the type strain of the species) was isolated from a marine sediment 

(Galagan et al., 2002). Its genome is 5.7 Mbp long and contains 42.7% GC. The Genome 

Atlas shows that its genes are evenly distributed over the two strands, and a GC skew is absent. 

Instead, the lower quart of the genome contains many strong structural features. The genome 

only contains three rRNA gene copies (indicated by arrows) one of which is located on the 

negative strand (but as discussed above, this is actually the leading strand, as is preferred for 

nearly all bacterial rRNA genes). Many other global repeats are visible, notably in the region 

around 1.2 Mbp, which is strongly curved and easily melted, and is slightly more AT rich than 

the rest of the genome. Here, the important carbon-monoxide dehydrogenase gene locus is 

present, as are multiple transposases, which could be an indication of horizontally acquired 

DNA. The genome is relatively poorly annotated, with many genes given as ‘‘predicted 

protein’’ only, which is not uncommon for archaeal genomes. 

In conclusion, a Genome atlas combines a number of features in one single figure that 

summarizes a very long and detailed story about a chromosome or plasmid. 

4 Whole Genome Alignment Methods 

4319 

Another way to compare genomes is based on alignment of nucleotide or amino acid 

sequences. Sequence alignment is a common tool to identify similarities, with BLAST, for

4320 74 

Tools 

Genome atlas 

Intrinsic curvature 

dev 

avg 

0.18 0.24 


dev 

avg 


–8.10 –7.21 

dev 

avg 


0.13 0.15 

0.5M 

0M 

M 

Annotations: CDS + 

CDS – 

rRNA 

tRNA 

5M 

1M 

4.5M 

1.5M 

M. acetivorans C2A 

5,751,492 bp 


fix 

avg 

4 

2M 

5.00 7.50 

3.5M 

2.5M 

Global inverted repeats 

fix 

avg 

5.00 7.50 

3M 

GC skew 

dev 

avg 

–0.03 0.02 

fix 

avg 

Percent AT 

0.20 0.80 


Center for biological sequence analysis 


. Figure 3 

Genome atlas of the main chromosome of the Archea Methanosarcina acetivorans.

Basic Local Alignment Search Tool, the most common (Altschul et al., 1990). However 

BLAST is not automatically suitable for large DNA input segments such as complete 

genomes. A more suitable program to align sequences in the range of megabases is Mummer, 

developed at TIGR, of which version 3 is now publicly available (Kurtz et al., 2004). Further, 

this method has been recently extended to include the average nucleotide identity in the 

conserved core genes of a set of genomes (Deloger et al., 2009). Moreover, graphical representation 

of the resulting alignment becomes an issue. Specific tools have been designed to align 

genome sequences and visualize such events. The Artemis Comparison Tool (ACT) is worth 

mentioning of which two versions are available: a downloadable version to be used on a local 

computer (Carver et al., 2005) and a web-based version with pre-computed comparisons 

between several hundred bacterial genomes. 3 BLAST results of entire bacterial chromosomes 

against each other have also been used to construct phylogenetic trees (Henz et al., 2005). Blast 

comparisons will be treated in Section 7 of this chapter. 

5 Comparing the Coding Fraction of Genomes 

The typical coding density for a bacterial genome is about 90%, ranging from 95% 

for Pelagibacter ubique (an alpha-proteal marine bacterium that counts to the most numerous 

bacteria in the world) (Giovannoni et al., 2005) to around 75% for M. acetivorans. 

Intracellular bacteria can have a coding density as low as 50%. This means the majority 

of bacterial DNA codes for genes, which mostly are not spliced so that introns are absent 

(with very few exceptions). However, not every open reading frame is a gene, and it 

appears that many bacterial genomes are over-annotated, predicting 10–15% more genes 

than are real (Skovgaard et al., 2001). These over-annotated genes are frequently short 

open reading frames. In addition, genes can be missed in the annotation. A frequent mistake 

is that genes are annotated on the wrong strand, which can happen if the reading frame is 

open in either direction. The intergenic regions separating genes regulate transcription, 

and in intracellular bacteria frequently contain pseudogenes or repeats. Genes not coding 

for proteins include tRNA and rRNA genes, and some parts of intergenic regions can 

be transcribed into stable RNA that are transcribed but do not code for proteins. E. coli 

contains several hundred small non-coding RNA genes (ncRNA) (Chen et al., 2002) that 

can act as regulators (Gottesman, 2005). Their role in environmental bacteria is virtually 

unexplored. 

Although tRNA and rRNA genes are essential to life, they are sometimes missed in the 

annotation of a genome, a rather embarrassing omission, or occasionally annotated on 

the wrong strand (Lagesen et al., 2007). The number and location of rRNA operons in a 

genome can say something about an organism. It appears that organisms with short doubling 

times have larger numbers of rRNA and tRNA genes. Comparing > Figs. 2 and 3 it is 

likely that G. kaustrophilus with 9 rRNA copies, nearly all located close to the origin of 

replication (which boosts expression during replication as their copy number increases) can 

divide more quickly than M. acetivorans which only has three copies. Some really fast-growing 

bacteria can have 14 or more rRNA copies, as can be viewed from our list of genomes. 4 

3 http://www.webact.org/WebACT/home 

4 www.cbs.dtu.dk/services/GenomeAtlas/ 


4321

4322 74 

Tools 


6 Codon Usage Comparisons 

Once the genes of a given genome have been defined, their codon usage can be analyzed. Since 

the genetic code is redundant, with up to 6 codons per amino acid, variable codons are used at 

different frequencies. Much of the redundancy in the genetic code is due to third base 

variation. > Figure 4 displays the amino acid usage for three prokaryotic genomes: Methanosphaera 

stadtmanae (27.6% GC), an archaeal methanogen that uses methanol and hydrogen to 

produce methane; Desulfitobacterium hafniense (47.4% GC), a Firmicute that efficiently 

dehalogenates tetrachloroethene and polychloroethanes; and Anaeromyxobacter dehalogenans 

(75% GC). This species, the first myxobacteria to be grown as a pure culture, can use orthosubstituted 

mono- and dichlorinated phenols. The frequency of each possible codon is plotted 

in a wheel plot in the upper part of the figure, arranged such that their third base is conserved 

in each quarter. The bias in codon usage towards the third position can also be seen in the 

sequence logo plots in the lower part of > Fig. 4. From both graphics it is evident that genomic 

GC content highly affects codon use (or the other way round). Based on a genome’s bias in 

codon usage, it is possible to predict its likely environmental niche (Willenbrock et al., 2006). 

Moreover, it is known that amino acid usage (not shown here) depends on environment, based 

on analysis of metagenomic samples (Musto et al., 2006, Foerstner et al., 2005). 

7 Protein Sequence Comparisons 

One can compare each individual gene in a given genome by BLAST against a set of genomes. 

This produces a huge amount of data that can be graphically represented in a BLAST Matrix 

(Binnewies et al., 2005, Ussery et al., 2009). A BLAST Matrix is not symmetrical, as the 

outcome is determined by which genome is used as query sequence. The diagonal of a BLAST 

matrix represents a BLASTof a genome against itself. The self-match (the gene finding itself) is 

discarded, thus the reported scores reflect internal homologues present in a given genome. 

Most of these have been derived from gene duplication and are thus paralogs. 

When more information should be visualized a BLAST Atlas is helpful. Such an atlas uses 

one genome as a reference against which the gene conservation of other genomes is plotted 

(Hallin and Ussery, 2004, Skovgaard et al., 2002). In this case gene location only refers to the 

location in the reference genome, which of course can be varied in multiple BLAST Atlases. 

A BLAST Atlas is also a suitable platform to visualize metagenomic data. So far, we have 

not dealt with metagenomics extensively, mainly because this approach very rarely results in 

completely assembled microbiological genomes. But for a BLAST Atlas, that is not a problem, 

as one can combine all the metagenomic DNA in one lane, thereby ignoring from which 

organism the detected genes originated. All obtained BLAST hits are plotted around a 

reference genome. An example of a BLAST Atlas is given in > Fig. 5, centered around 

Pelotomaculum thermopropionicum, a thermophilic, syntropic Firmicute that can utilize 

1-butanol, 1-propanol, 1-pentanol or 1,3-propanediol as a carbon source. Note that despite 

the high number of lanes, conserved and variable genes can still be easily visually inspected. 

From compacting a single genome into a Genome Atlas, we’ve now moved several levels up 

and compact multiple genomes into a single atlas. In > Fig. 5, the P. thermopropionicum 

genome is compared to many species of Clostridia, as well as other bacteria. Unfortunately, 

very few BLAST hits were found with the metagenomics samples so there is very little color in 

those three lanes. Compared to well characterized genomes (like E. coli), relatively few hits are

Methanosphaera stadtmanae DSM 3091 

Desulfitobacterium hafniense Y51 

Anaeromyxobacter dehalogenans 2CP-C 

GGG 

GGG 

GGG 

GAA 

GAA 

CAA 

CGG 

GAA 

CAA 

CGG 

UAA 

GCG 

CAA 

CGG 

UAA 

GCG 

CUA 

AAA 

UGG 

UAA 

GCG 

UGG 

UGG 

CUA 

UUA 

AAA 

UUA 

GUA 

AUA 

AGG 

CCG 

CUA 

UUA 

AAA 

CCG 

AUA 

AGG 

CCG 

AUA 

AGG 

GUA 

UCG 

UCG 

GUA 

UCG 

GUG 

GUG 

GUG 

ACG 

ACG 

ACG 

ACA 

UCA 

ACA 

CCA 

CUG 

UCA 

ACA 

CCA 

UCA 

CUG 

GCA 

CCA 

CUG 

GCA 

UUG 

UUG 

GCA 

UUG 

AGA 

AUG 

GAG 

AGA 

AUG 

GAG 

UGA 

AGA 

AUG 

GAG 

UGA 

CGA 

CAG 

UGA 

CGA 

CAG 

CGA 

CAG 

GGA 

UAG 

G GA 

UAG 

GGA 

UAG 

AAU 

72% AT 

AAG 

AAU 

25% AT 53% AT 

AAG 

AAU 

AAG 

UAU 

UAU 

GGC 

GGC 

CAU 

CGC 

UAU 

GGC 

CAU 

CGC 

UGC 

CAU 

CGC 

UGC 

UGC 

GAU 

AUU 

AGC 

GAU 

AUU 

AGC 

GAU 

AUU 

AGC 

UUU 

UUU 

GCC 

UUU 


GCC 

GCC 

CUU 

CCC 

CUU 

CCC 

UCC 

ACU 

ACC 

CUU 

UC CCC 

UCC 

ACU 

ACC 

ACU 

ACC 

GUU 

GUU 

UCU 

UCU 

GUU 

UCU 

CCU 

AGU 

CCU 

AGU 

AUC 

UUC 

CUC 

GUC 

AUC 

UU CUC 

GUC 

CCU 

AGU 

AUC 

UUC 

CUC 

GUC 

UGU 

UGU 

GCU 

AAC 

UGU 

GCU 

AAC 

CGU 

UAC 

GCU 

AAC 

CGU 

UAC 

CAC 

CGU 

UAC 

CAC 

GGU 

GAC 

CAC 

GGU 

GAC 

GGU 

GAC 

C 

0.6 

0.6 

0.6 

0.5 

0.5 

0.5 

U AG 

0.4 

0.4 

0.4 

0.3 

0.3 

0.3 

0.2 

0.2 

G 

0.2 

C 

A 

0.1 

0.1 

UA G CU 

A 

CU 

GA 

0.1 

CU 

CG 

A 

G 

U 

G 

U 

A 

C G 

A 

C 

U 

UA 

C 

G 

1 st 2 nd 3 rd 1 st 2 nd 3 rd 1 st 2 nd 3 rd 

4323 

. Figure 4 

Frequency wheel plots of codon usage (top) and sequence logo plots (bottom) of Anaeromyxobacter dehalogenans (left), Desulfitobacterium hafniense 

(middle) and Methanosphaera stadtmanae (right).

4324 74 

Tools 


2.5M 

2M 

0M 

P. thermopropionicum 

SI 

3,025,375 bp 

1.5M 

0.5M 

1M 

2 Alkaliphilus species 

Bacillus fragilis 

17 Clostridium species 

4 Desulfitobacterium species 

E. coli K-12 

6 other species belonging 

to Clostridia 

. Figure 5 

BLAST Atlas with Pelotomaculum thermoproopionicuma the reference genome. Around this the 

BLAST hits of 31 genomes of other bacteria are added as listed to the right, from the outermost 

circle (top in the legend), to the innermost circle of the bacterial genomes (bottom of legend). 

The outermost lane shows the hits of P. thermopropionicum in the UniProt database (which 

does not contain all annotated genes as it requires biological evidence of a gene product). 

The next three lanes are metagenomic DNA samples from...[Dave specify] and next follow 

30 genomes of other bacteria as listed to the right. 

found in other genomes, indicated by lack of strong colour in most of the lanes in Figure 5. 

This is probably a reflection of the huge diversity in DNA content in such samples, reducing 

the chance of a BLAST hit. It is a sobering thought that there is still so little we know, and so 

much that remains to be discovered in the microbial world. 

There are many methods being developed which utilizes sets of conserved genes and gene 

families in related organisms to cluster organisms into groups; these groups can represent 

known taxonomic relationships. For example, certain genes might be common to a set of 

organisms growing in a particular ecological niche. Some examples of such regions along the 

chromosome can be seen in the BLAST atlas plots where genomes of related organisms of 

different species are compared.

8 Gene Synteny and Genome Islands 

A comparison of genes present, absent or diverged between genomes usually ignores gene synteny: 

the position at which such genes are found. The term was coined for eukaryotes to describe genes 

that were located on the same chromosome; in bacterial genomes the local neighboring genes, 

their order and direction are usually compared. The closer two organisms are, the more likely is 

gene synteny to be conserved (between genomes of the same genus, or species, subspecies or 

phylogenic clade, in increasing order). Gene synteny is destroyed by inversions (changing the 

direction of one or several genes), translocations (changing the position of genes) and insertion 

and deletion events. All of these can result from mistakes during replication, or be the result of 

self-replicating mobile elements, such as bacteriophages, integrons, transposons etc. 

The events that affect gene synteny, combined with point mutations accumulating during 

replication are the two major forces that increase genetic diversity; selection of those organisms 

that are fittest to survive particular conditions decreases diversity. Evolution further 

depends on the change of such selective conditions. With a slow but steady re-shuffling of 

genes by evolutionary processes, a pattern emerges of a genetic ‘‘backbone’’ of genes whose 

location is relatively conserved between genomes of reasonable genetic distance, and groups of 

‘‘cluttered’’ genes that are far more variable, in what have been termed ‘‘genome islands.’’ 

Genome islands usually contain genes that are all involved in a particular phenotypic process. 

Examples are pathogenicity islands, symbiosis islands, metabolic islands or magnetosome 

islands. Examples are sulfur metabolism islands discovered in metagenomic sequences from 

marine sediments (Mussmann et al., 2005) or the magnetosome island containing all genes 

that produce the intracellular organelle enabling magnetotactic bacteria to orient themselves 

along magnetic field lines (Richter et al., 2007). The evolutionary advantage of genome islands 

is obvious. They can be regarded as genetic ‘‘building blocks’’; when transferred from one 

organism to the next, they can confer a complete phenotypic trait to the acceptor, enabling, 

for instance, adaptation to a novel ecological niche. 

9 Minimal Information About a Genome Sequence 

Genome sequences are stored in public databases such as GenBank under their biological 

names (preceded by ‘‘candidatus’’ for undecided taxonomic position), or by a code of 

numbers and letters for unculturable organisms that have not been classified. Unfortunately, 

other relevant information is often lacking. It has become apparent that biological and 

environmental data are important, and a recent standard for ‘‘Minimal Information about a 

Genome Sequence’’ has been proposed (Field et al., 2008). The Genomic Standards Consortium 

5 (GSC, http://gensc.org) promotes the standardization of genome sequencing descriptions 

and their exchange and integration in the scientific community. Overall, it is important 

that genome sequence information is released into the public domain in a timely manner so 

that global scientific progress can be maintained. 

10 Research Needs 


4325 

For very few environmental species multiple genome sequences are available. From genomic 

intra-species comparisons of pathogenic bacteria we know that these provide an extra layer of

4326 74 

Tools 

information, as genetic diversity within a bacterial species can be enormous. When multiple 

genomes are available for a species we can define its core genome (all genes that are present in 

all genomes of that species), its pan-genome (all genes that have been found in that species) 

and its dispensable genes that are responsible for the variation between isolates. Multiple 

genomes per species, together with more metagenomic data and more archaeal genome 

sequences, comprise our most urgent data gaps. The research tools for analysis of the 

genomes are available. Generate the sequences and the feast can begin. 

References 


Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ 

(1990) Basic local alignment search tool. J Mol Biol 

215: 403–410. 

Binnewies TT, Hallin PF, Staerfeldt HH, Ussery DW 

(2005) Genome update: proteome comparisons. 

Microbiology 151: 1–4. 

Binnewies TT, et al. (2006) Ten years of bacterial genome 

sequencing: comparative-genomics-based discoveries. 

Funct Integr Genomics 6: 165–185. 

Carver TJ, Rutherford KM, Berriman M, Rajandream 

MA, Barrell BG, Parkhill J (2005) ACT: the Artemis 

Comparison Tool. Bioinformatics 21: 3422–3423. 

Chang CH, Chang YC, Underwood A, Chiou CS, Kao CY 

(2007) VNTRDB: a bacterial variable number tandem 

repeat locus database. Nucleic Acids Res 35: 

D416–D421. 

Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH, 

Ecker DJ, Blyn LB (2002) A bioinformatics based 

approach to discover small RNA genes in the Escherichia 

coli genome. Biosystems 65: 157–177. 

Deloger M, El Karoui M, Petit MA (2009) A genomic 

distance based on MUM indicates discontinuity between 

most bacterial species and genera. J Bacteriol 

191: 91–99. 

Denoeud F, Vergnaud G (2004) Identification of polymorphic 

tandem repeats by direct comparison of 

genome sequence from different bacterial strains: a 

web-based resource. BMC Bioinformatics 5: 4. 

Field D, et al. (2008) The minimum information about a 

genome sequence (MIGS) specification. Nature Biotechnol 

26:541–547. 

Foerstner KU, von Mering C, Hooper SD, Bork P (2005) 

Environments shape the nucleotide composition of 

genomes. EMBO Rep 6: 1208–1213. 

Galagan JE, et al. (2002) The genome of M. acetivorans 

reveals extensive metabolic and physiological diversity. 

Genome Res 12: 532–542. 

Giovannoni SJ, et al. (2005) Genome streamlining in a 

cosmopolitan oceanic bacterium. Science 309: 

1242–1245. 

Gottesman S (2005) Micros for microbes: non-coding 

regulatory RNAs in bacteria. Trends Genet 21: 

399–404. 

Griffiths-Jones S, Moxon S, Marshall M, Khanna A, 

Eddy SR, Bateman A (2005) Rfam: annotating 

non-coding RNAs in complete genomes. Nucleic 

Acids Res 33: D121–D124. 

Hallin PF, Binnewies TT, Ussery DW (2008) The genome 

BLAST atlas - a GeneWiz extension for visualization 

of whole-genome homology. Mol Biosyst 4: 363–371. 

Hallin PF, Ussery DW (2004) CBS Genome Atlas 

Database: a dynamic storage for bioinformatic results 

and sequence data. Bioinformatics 20: 3682–3686. 

Henz SR, Huson DH, Auch AF, Nieselt-Struwe K, 

Schuster SC (2005) Whole-genome prokaryotic 

phylogeny. Bioinformatics 21: 2329–2335. 

Jensen LJ, Friis C, Ussery DW (1999) Three views of 

microbial genomes. Res Microbiol 150: 773–777. 

Kurtz S, Philippy A, Delcher AL, Smoot M, Shumway M, 

Antonescu C, Salzberg SL (2004) Versatile and open 

software for comparing large genomes. Genome Biol 

5: R12. 

Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, 

Rognes T, Ussery DW (2007) RNAmmer: consistent 

and rapid annotation of ribosomal RNA genes. 

Nucleic Acids Res 35: 3100–3108. 

Lessner DJ, et al. (2006) An unconventional pathway for 

reduction of CO 2 to methane in CO-grown Methanosarcina 

acetivorans revealed by proteomics. Proc 

Natl Acad Sci USA 103: 17921–17926. 

Mussmann M, Richter M, Lombardot T, Meyerdierks A, 

Kuever J, Kube M, Glöckner FO, Amann R (2005) 

Clustered genes related to sulfate respiration in uncultured 

prokaryotes support the theory of their 

concomitant horizontal transfer. J Bacteriol. 187: 

7126–7137. 

Musto H, Naya H, Zavala A, Romero H, Alvarez-Valin F, 

Bernardi G (2006) Genomic GC level, optimal 

growth temperature, and genome size in prokaryotes. 

Biochem Biophys Res Commun 347: 1–3. 

Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, 

Ussery DW (2000) A DNA structural atlas for 

Escherichia coli. J Mol Biol 299: 907–930. 

Richter M, Kube M, Bazylinski DA, Lombardot T, 

Glöckner FO, Reinhardt R, Schüler D (2007) Comparative 

genome analysis of four magnetotactic

acteria reveals a complex set of group-specific 

genes implicated in magnetosome biomineralization 

and function. J Bacteriol 189: 4899–4910. 

Selengut JD, et al. (2007) TIGRFAMs and Genome Properties: 

tools for the assignment of molecular function 

and biological process in prokaryotic genomes. 

Nucleic Acids Res 35: D260–D264. 

Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A 

(2001) On the total number of genes and their 

length distribution in complete microbial genomes. 

Trends Genet 17: 425–428. 

Skovgaard M, Jensen LJ, Friis C, Stærfeldt HH, Worning P, 

Brunak S, Ussery D (2002) The atlas visualisation of 

genome-wide information. Meth Microbiol. 33: 

49–63. 

Sood N, Lal B. (2008). Isolation and characterization of a 

potential paraffin-wax degrading thermophilic bacterial 

strain Geobacillus kaustophilus TERI NSM for 

application in oil wells with paraffin deposition 

problems. Chemosphere 70: 1445–1451. 

Takami H, et al. (2004a) Genomic characterization of 

thermophilic Geobacillus species isolated from the 

deepest sea mud of the Mariana Trench. Extremophiles 

8: 351–356. 

Takami H, et al. (2004b) Thermoadaptation trait 

revealed by the genome sequence of thermophilic 


4327 

Geobacillus kaustophilus. Nucl Acids Res 32: 

6292–6303. 

Teeling H, Waldmann J, Lombardot T, Bauer M, 

Glockner FO (2004) TETRA: a web-service and a 

stand-alone program for the analysis and comparison 

of tetranucleotide usage patterns in DNA 

sequences. BMC Bioinformatics 5: 163. 

Ussery DW, Hallin PF (2004) Genome update: AT content 

in sequenced prokaryotic genomes. Microbiology 

150: 749–752. 

Ussery DW, Borini S, Wassenaar TM (2009) Computing 

for Comparative Microbial Genomics: Bioinformatics 

for Microbiologists (Computational series) 

London, Verlag: Springer. 

Wheeler DL, et al. (2007) Database resources of the 

National Center for Biotechnology Information. 

Nucleic Acids Res 35: D5–D12. 

Willenbrock H, Friis C, Friis AS, Ussery DW (2006) An 

environmental signature for 323 microbial genomes 

based on codon adaptation indices. Genome Biol 7: 

R114. 

Worning P, Jensen LJ, Hallin PF, Staerfeldt HH, 

Ussery DW (2006) Origin of replication in circular 

prokaryotic chromosomes. Environ Microbiol 8: 

353–361.

Chapter 3 

rRNA operons and promoter analysis 

rRNA operons and promoter 

analysis 


This chapter covers two papers (VI and VII), dealing with rRNA localization within the 

genome, and analysis of the promoter region upstream of rRNA operons. The RNAmmer 

tool (Lagesen et al., 2007) presented in paper VI was motivated by the lack of a software 

tools that was able to accurately and consistently annotate ribosomal RNA (rRNA) genes 

in prokaryotes. BLAST strategies are widely used for this as the rRNA genes are highly 

conserved. However, homology search methods produces often less accurate gene boundaries 

as they fail to account for the observed variation in some regions. Hidden Markov 

Model (HMM) strategies, such as RNAmmer, can take into account conserved stem loop 

structures, greatly improving the accuracy of prediction of the full length rRNA genes. 

Particular detail will be given to the E. coli rRNA operons in terms of promoter predictions, 

since much experimental information is known about this system. An application 

of the gwBrowser as a tool for visualization of promoter regions upstream of the rRNA 

operons in E. coli concludes the chapter. The gwBrowser effort is currently being published 

in the Standards In Genomic Sciences journal. The P1 and P2 prediction tools are 

still developmental, and have not been published. 

Encoding the central structure of the ribosome, the 5S, 16S, and 23S rRNA genes are 

essential for protein synthesis and are transcribed at high levels. In E. coli the rrn operons 

are regulated by a tandem promotor system. With abundant transcription, the system is 

favorable for studying the mechanisms of highly expressed genes and establish connection 

to the physical properties of the DNA. In this work, the SIDD energy (Wang et al., 2004; 

Wang & Benham, 2008) was used to measure the energy requirement to melt the DNA 

helix near the promotor region. The work was carried out during my visit to Professor 

Craig Benhams lab at UC Davis, fall 2007. 

3.2 P1 and P2 promoters in E. coli 

The seven rRNA operons of E. coli are regulated by the two promotors P1 and P2, 

where P1 is active predominately during exponential growth whereas P2 is active during 

stationay phase (Hirvonen et al., 2001; Murray & Gourse, 2004). Apart from the –10 and 

–35 hexamers, the P1 site contains between 3 and 5 FIS (Factor for Inversion Stimulation) 

binding sites and an UP element. FIS has been reported to increase the transcription in 

vivo by 4-10 fold in this system (Bokal et al., 1995). 

105

Conservation of regulatory elements 

-35 

-10 

σ 

α 

α ββ‘ subunit 

+1 

CDS 

Figure 3.1: The transcription of bacterial genes. 

The first step in transcription occurs when the sigma factor first binds to the -10 and 

-35 region, followed by a wrap of the DNA template around the large RNA polymerase 

holoenzyme complex, causing a bend of the DNA molecule (figure 3.1). Roughly 150 bp of 

DNA is wrapped around the polymerase, forming a constrained supercoil. The wrapping 

interaction with the two α-subunits are particularly important, for the right orientation 

of DNA with respect to the promoter sites and transcription initiation. 

Binding of the FIS protein can strongly bend the DNA, and if properly spaced, greatly 

facilitate the wrapping of the DNA around the alpha subunits. The DNA bending takes 

place via a helix-turn-helix structure and is recognized by a 15 nucleotide symmetric motif 

(Hengen et al., 1997). The stress that is induced when FIS binds to the DNA helix, 

causes a bend which destabilizes the helix lowering the energy required for melting further 

downstream (Wang & Benham, 2008; Bokal et al., 1995). While being highly expressed 

during exponential phase FIS ensures an increased activity of P1 compared with P2. In an 

E. coli strain lacking the FIS protein the P2 promotor is more active during exponential 

growth. The same study suggest FIS to have a repression effect on P2 (Liebig & Wagner, 

1995). Both P1 and P2 contains an UP element binding to the RNA polymerase α Cterminal 

domain (αCTD). This work aims at applying an information content method to 

the P1 and P2 system, accounting for helical spacing between these regulatory elements as 

well as the conservation of the motifs. The tandem promotor system is depicted in figure 

3.2. 

3.3 Conservation of regulatory elements 

Information content is widely used in bioinformatics to find and rank independent motifs 

as an alternative to machine learning approaches. Shultzaberger and co-workers have expanded 

earlier applications of information content by describing the helical facing between 

regulatory elements on the DNA strand (Shultzaberger et al., 2007). This framework allows 

for an additive combination of both aligned weight matrices and their spacing to 

produce a final score of the entire structure. When observing the σ 70 promotor consisting 

of the –10 and –35 hexamers, the spacing corespond to each box being located on oposite 

sides of the DNA helix (see figure 3.3). 

Changing the spacing will likely cause a disruption of the binding by RNA polymerase. 

This is accounted for by applying a cosine function to the distance score (see equation 3.2). 

Shultzaberger’s equations were used to model the P1 and P2 system. 

To score a given query sequence of length L against a weight matrix, a b × p matrix 

is first generated by aligning the query sequence and the matrix. This provides all Rb,p 

106

tuB 

murI 

Fis III Fis II Fis I UP -35 -10 

min: -4nt 

center:2nt 

max:4nt 

min: 0nt 

center:3nt 

max:6nt 


center:16nt 

max:19nt 

P1 


16S tRNA 23S 5S 

Glu murB 

-35 -10 


center:3nt 

max:6nt 

P2 P1 


center:16nt 

max:19nt 

Figure 3.2: The promotor structure of the rrnB operon in E. coli. 

-35 

! 

-10 

! 

-10 -35 

Figure 3.3: The –10 and –35 hexamers of the E. coli σ 70 promotor correspond to the motifs being 

located on opposite side of the DNA helix. Delition or insertions of the spacing cases a shift of 

approx. 36deg per nucleotide. 

107


values. 

nb,p 

Rb,p = log2(4) + log2 

N 

L 

Rtot = RB,p 

p=1 

(3.1) 

–where b ∈ AT GC iterates through the four bases, p denotes the position in the 

alignment, L is the length of the alignment (or width of the matrix), and nb,p is the 

number of bases b at position p, and B denotes the nucleotide at position p in the query 

sequence. Shultzaberger and co-workers account for the helical facing by introducing the 

accessibility, n(d) (equation 3.2) and the gap surprisal, GS(d) (see equation 3.3). 

n(d) = 1 + cos[ 2π 

(d − c)] (3.2) 

w 

–where c is the center distance between two binding sites (e.g. optimally spaced), d is 

the query distance, w = 10.6 is the distance of a one helix turn of B-form DNA. Finally, 

this gives GS(d) as follows: 

n(d) 

GS(d) = log2 

N 

(3.3) 

–where N is the sum of all n(d) (see equation 3.4). The sign of the GS(d) was changed 

from the original equation described by Shultzaberger and co-workers to allow for combining 

all scores by addition. 

N = 

max 

 

d=min 

n(d) (3.4) 

–where min and max are the boundaries of a given window examined. Finally, summarizing 

all Ri and GS(d) values gives the total information of all motifs and all spacers (see 

figure 3.5) 

Ri(tot) = Ri(m1) + GS(d, m1) + Ri(m2) + ... + GS(d, mn−1) + Ri(mn) (3.5) 

3.3.1 Modeling the P1 and P2 in selected enterics 

Existing experimentally verified –10 and –35 hexamers (Huerta & Collado-Vides, 2003) 

were converted into Rb,p matrices together with data for known UP elements (Estrem 

et al., 1998) and FIS binding sites (Hengen et al., 1997). Figure 3.4 shows logo plots of 

the information content of these studies. The initial weight matrices founded the basis 

for iteratively building the final information model of the P1 and P2 promotor structure, 

using the following procedure: 

1. E. coli and Shigella genomes 

108 

2. rRNA gene finding and make upstream sequence 

3. Apply models based on literature weight matrices 

4. Refine weight matrices according to observations 

5. Formulate final model

Bits 

2.0 

1.5 

1.0 

0.5 

0.0 

Bits 

2.0 

1.5 

1.0 

0.5 

0.0 

T A T A A T 

1 

2 

(a) 

3 

4 

Position 

T T G A C A 

1 

2 

(c) 

3 

4 

Position 

5 

6 

5 

6 

Bits 

2.0 

1.5 

1.0 

0.5 

Bits 

1 

2.0 

1.5 

1.0 

0.5 

0.0 

2 

1 


T G A A A T T T T T T T T T G A A A A G T A 

3 

2 

3 

4 

4 

5 

5 

6 

6 

7 

7 

8 

8 

9 

10 

(b) 

9 

10 

11 

12 

Position 

0.0 

A T T G G T Y A A A W T T T R A C C A A T 

Figure 3.4: Logo plots showing the initial weight matrices used for searching E. coli and Shigella 

genomes: –10 hexamer (a), –35 hexamer (b), UP element (c), and FIS binding motif (d). 

The 16S rRNA genes of all E. coli and Shigella genomes were annotated using RNAmmer. 

For the list of genomes, see table 3.1. All 16S rRNA genes were aligned using clustalw 

(Thompson et al., 1994) and a neighbor-joining tree was constructed (see figure 3.5). The 

figure shows additional Salmonella and Yersinia genomes for comparison. 

(d) 

11 

13 

12 

Position 

14 

13 

15 

16 

14 

17 

15 

18 

16 

19 

17 

20 

21 

18 

22 

19 

20 

21 

109



Escherichia coli APEC O1 

Escherichia coli CFT073 

Shigella sonnei Ss046 

Shigella boydii Sb227 

Shigella flexneri 2a str. 301 

Shigella flexneri 2a str. 2457T 

Escherichia coli UTI89 

Escherichia coli K12 

Escherichia coli O157:H7 EDL933 

Escherichia coli O157:H7 str. Sakai 

Escherichia coli W3110 

Shigella dysenteriae Sd197 

Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67 

Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150 

Salmonella enterica subsp. enterica serovar Typhi Ty2 

Salmonella enterica subsp. enterica serovar Typhi str. CT18 

Salmonella typhimurium LT2 

Yersinia pestis Antiqua 

Yersinia pestis CO92 

Yersinia pestis KIM 

Yersinia pestis Nepal516 

Yersinia pestis Pestoides F 

Yersinia pestis biovar Microtus str. 91001 


Figure 3.5: Neighbor-joining tree of first 1k bases of all 16S rRNA genes of Yersinia, Salmonella, 

Shigella, and E. coli 

110

RNA operons and promoter analysis 

Organism Accession Reference 

Escherichia coli 101-1 AAMK00000000 (unpublished) 

Escherichia coli 53638 AAKB00000000 (unpublished) 

Escherichia coli 536 CP000247 (Brzuszkiewicz et al., 2006) 

Escherichia coli APEC O1 CP000468 (Johnson et al., 2007) 

Escherichia coli B171 AAJX00000000 (unpublished) 

Escherichia coli B7A AAJT00000000 (unpublished) 

Escherichia coli B AAWW00000000 (unpublished) 

Escherichia coli CFT073 AE014075 (Welch et al., 2002) 

Escherichia coli E110019 AAJW00000000 (unpublished) 

Escherichia coli E22 AAJV00000000 (unpublished) 

Escherichia coli F11 AAJU00000000 (unpublished) 

Escherichia coli K12 U00096 (Blattner et al., 1997) 

Escherichia coli O157:H7 EDL933 AE005174 (Perna et al., 2001) 

Escherichia coli O157:H7 str. Sakai BA000007 (Hayashi et al., 2001) 

Escherichia coli SECEC SMS-3-5 ABAQ00000000 (unpublished) 

Escherichia coli UTI89 CP000243 (Chen et al., 2006) 

Escherichia coli W3110 AP009048 (Hayashi et al., 2006) 

Shigella boydii CDC 3083-94 AAKA00000000 (unpublished) 

Shigella boydii Sb227 CP000036 (Yang et al., 2005) 

Shigella dysenteriae 1012 AAMJ00000000 (unpublished) 

Shigella dysenteriae Sd197 CP000034 (Yang et al., 2005) 

Shigella flexneri 2a str. 2457T AE014073 (Liao et al., 2003) 

Shigella flexneri 2a str. 301 AE005674 (Jin et al., 2002) 

Shigella sonnei Ss046 CP000038 (Yang et al., 2005) 

Table 3.1: Escherichia coli and Shigella genomes currently available at the time of the work 

(October 2007) 

111


Ri 

Ri 

−15 −10 −5 0 5 10 

−10 −5 0 5 10 15 

P1: Raw combined scores, −10,−35, UP (E.coli) (N=63) 

−500 −400 −300 −200 −100 0 

Position relative to 16S gene start 

(a) 

P2: Raw combined scores, −10,−35, UP (E. coli) (N=63) 

−500 −400 −300 −200 −100 0 

Position relative to 16S gene start 

(c) 

Ri 

Ri 

−15 −10 −5 0 5 10 15 

−10 −5 0 5 10 15 

P1: Adjusted combined scores, −10,−35, UP (E.coli) (N=63) 

−500 −400 −300 −200 −100 0 

Position relative to gene start 

(b) 

P2: Adjusted combined scores, −10,−35, UP (E. coli) (N=63) 

−500 −400 −300 −200 −100 0 

Position relative to gene start 

Figure 3.6: Profiles showing the maximum Ri(tot) scores of the initial weight matrices applied to 

E. coli and Shigella: Unadjusted P1 scores (a), Adjusted P1 scores (b), Unadjusted P2 scores (c), 

and Adjusted P2 scores (d) 

3.3.2 Iterating weight matrix frequencies 

The program iscan was developed to query a given DNA sequence and for every position in 

this sequence calculate the maximum Ri(tot) that can be obtained by trying out different 

spacing configuraitons within a specified window. The iscan algorithm aligns the first 

matrix with the query (in this case the –10 hexamer) and tries all distances between 13 

and 19 nucleotides towards the –35 hexamer, using 16 nucleotides as the center. Then 

the program locks the optimal of those distances, and continues with the next box (in 

this case the the UP element) until all elements have been included. For source code, see 

appendix D.5. The spacing configuration of the two models is shown in figure ??. 

The maximum Ri(tot) values of all operons were stacked and average and standard 

deviation values were plotted as function of position. Because the distance between P1/P2 

and the 16S gene varies slightly, the unadjusted plots appear noisy. By shifting the plots 

slightly by aligning to local maxima around P1 and P2 renders the P1 and P2 model scores 

sharper (see figure 3.6). 

3.3.3 Refining E. coli and Shigella models 

All peaks of Ri(tot) around the regions of P1 and P2 have been collected, and the P1 and 

P2 models were refined by adjusting matrix parameters according to the observed base 

frequencies in the hits obtained. The logo plots of are shown in figure 3.7 

112 

(d)

Bits 

2.0 

1.5 

1.0 

0.5 

0.0 

Bits 

T A T A A T 

1 

2 

(a) 

2.0 

1.5 

1.0 

0.5 

0.0 

3 

4 

Position 

1 

5 

6 

Bits 

2.0 

1.5 

1.0 

0.5 

0.0 

T C A A A A A A T T A T T T A A A A T T T C 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

(b) 

T T T G C T T G A A A A A T G A G C G G T 

2 

3 

4 

5 

6 

7 

8 

9 

10 

(d) 

11 

12 

Position 

Bits 

2.0 

1.5 

1.0 

0.5 

0.0 

13 

14 

15 

16 

17 

18 

19 

20 

11 

12 

Position 

21 

13 

14 

Bits 

15 

2.0 

1.5 

1.0 

0.5 

0.0 

16 


17 

1 

18 

19 

20 

21 

22 

T A T T A T 

2 

(e) 

T C A G A A A A A G A A A G C A A A A A A A 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

(g) 

12 

13 

14 

15 

16 

17 

3 

4 

Position 

5 

6 

Bits 

Bits 

2.0 

1.5 

1.0 

0.5 

0.0 

2.0 

1.5 

1.0 

0.5 

0.0 

T T G T C A 

1 

1 

2 

(c) 

3 

4 

Position 

5 

T T G A C T 

Figure 3.7: Logos showing the base compostion of P1 and P2 of E. coli genomes, as identified 

by initial P1 and P2 scan: P1 –10 hexamer (a), P1 –35 hexamer (b), P1 UP element (c), P1 FIS 

binding motif (d), P2 –10 hexamer (e), P2 –35 hexamer (f), P2 UP element (g) 

Position 

18 

19 

20 

21 

22 

2 

(f) 

3 

4 

Position 

5 

6 

6 

113

DNA melting and SIDD energy 

Z−score 

−0.8 −0.6 −0.4 −0.2 0.0 

U00096: SIDD measure − free energy 

−400 −200 0 200 400 

Distance from translation start 

s=−0.025 

s=−0.035 

s=−0.045 

s=−0.055 

Figure 3.8: Average profiles of SIDD energy calculated at five different helix densities -0.025, 

-0.035, -0.045, and -0.055. All genes have been aligned at the translation start. 

3.4 DNA melting and SIDD energy 

An algorithm developed by Benham and co-workers (Wang & Benham, 2008; Wang et al., 

2004) estimates the SIDD energy which is the free energy required to open the DNA helix 

under different superhelix densities. When observing the SIDD energy 400 nucleotides on 

each side of the translation start of all coding sequences in E. coli K12 (accession U00096) a 

clear drop in the energy requirement is visible. The drop originates from the transcription 

start rather than the translation start, which examples the broad appearance of the curve. 

Figure 3.8 plots the SIDD energy values at different helix densities (-0.025, -0.035, -0.045, 

and -0.055). The graph represents the z-scores showing how the average SIDD energy at 

a given relative position compares with the average and standard deviation of the entire 

chromosome. z-score below zero correspond to SIDD energies lower then the average of 

the chromosome, which melts more easily. 

3.4.1 codesearch: Mapping nummerical data to genome annotations 

The codesearch tool was written to enable searches for various annotation patterns of a 

genome and to map nummerical data relative to these annotations. The tool requires a 

pregenerated codefile which condenses all annotations of the genome into a single string, 

corresponding to one character per nucleotide position (see table 3.2). The tool allows the 

user to provide a regular expression to search in the pre-generated code file. 

A list of nummerical data pertaining to the individual nucleotides of the genome can 

then be included. When defined, codesearch will extract the nummerical values corresponding 

to the regions matching the pattern. The output of codesearch is divided into 

two tab-separated columns: First column contain the genomic region where pattern has 

matched, the other column contians either the sequence as a string (when running in 

114

Code Meaning Example 

C Coding CCCCCCCCCCCCC 

> Annotation start on forward strand .....>CCCC... 

< Annotation start on reverse strand ...CCCCTTT..... 

t 5S rRNA ..tttssss...... 

l 23S rRNA ...lllllcodesearch −cod U00096 . cod . gz −seq U00096 . fsa −pat ’(.{5 ,5} > s {1 ,1}) ’ 

2 223773..223779 AAATTGA 

3 3939833..3939839 AAATTGA 

4 4033556..4033562 AAATTGA 

5 4164684..4164690 AAATTGA 

6 4206172..4206178 AAATTGA 

7 3426782..3426776 ATTGAAG 

8 2729177..2729171 ATTGAAG 

9 >codesearch −cod U00096 . cod . gz −dat U00096 . sidd35 . gz : 1 , 4 −pat 

’(.{5 ,5} > s {1 ,1}) ’\ 

10 −format ’%0.2f ’ | tab2tbl −−window = ’ −5 ,2 ’ −org ’ E . coli K12 ’ −col 

blue 

11 def org col −5 −4 −3 −2 −1 1 2 

12 223773..223779 E . coli K12 blue 7.93 7.93 7.94 8.00 8.26 8.28 8.37 

13 3939833..3939839 E . coli K12 blue 7.91 7.90 7.92 7.99 8.25 8.28 8.36 

14 4033556..4033562 E . coli K12 blue 7.83 7.83 7.85 7.92 8.19 8.22 8.32 

15 4164684..4164690 E . coli K12 blue 7.85 7.85 7.87 7.95 8.21 8.25 8.34 

16 4206172..4206178 E . coli K12 blue 7.91 7.91 7.92 7.99 8.26 8.28 8.37 

17 3426782..3426776 E . coli K12 blue 7.91 7.93 7.99 8.26 8.28 8.37 8.73 

18 2729177..2729171 E . coli K12 blue 7.91 7.93 7.99 8.26 8.28 8.37 8.72 

Using heatmap to generate energy landscape 

The R function heatmap described in chapter 2, was used to compare both SIDD profiles 

and the profiles of P1/P2 model scores. All promotor sequences were aligned first according 

to the peak score of the P1 model (near the expected site of P1) and second according to 

the peak score of the P2 model (near the expected site of P2). In figure 3.9 the model scores 

are visualized using the heatmap function on the green, heatmaps on the left, whereas the 

rightmost heatmaps contain the SIDD energies (blue) of the aligned promotor sequences. 

This analysis show that a deep drop in the SIDD energy occurs for approximately half of 

the promotor sequences, near the P1 site. 

115

DNA melting and SIDD energy 

P1 

-10 box 

16S rRNA +1 

P2 

-10 box 

16S rRNA +1 

−500 

−490 

−480 

−470 

−460 

−450 

−440 

−430 

−420 

−410 

−400 

−390 

−380 

−370 

−360 

−350 

−340 

−330 

−320 

−310 

−300 

−290 

−280 

−270 

−260 

−250 

−240 

−230 

−220 

−210 

−200 

−190 

−180 

−170 

−160 

−150 

−140 

−130 

−120 

−110 

−100 

−90 

−80 

−70 

−60 

−50 

−40 

−30 

−20 

−10 

+1 

+10 

+20 

+30 

+40 

+50 

−500 

−490 

−480 

−470 

−460 

−450 

−440 

−430 

−420 

−410 

−400 

−390 

−380 

−370 

−360 

−350 

−340 

−330 

−320 

−310 

−300 

−290 

−280 

−270 

−260 

−250 

−240 

−230 

−220 

−210 

−200 

−190 

−180 

−170 

−160 

−150 

−140 

−130 

−120 

−110 

−100 

−90 

−80 

−70 

−60 

−50 

−40 

−30 

−20 

−10 

+1 

+10 

+20 

+30 

+40 

+50 

-22 34 

Promotor sequences 

Model score (bits) 

500 

490 

480 

470 

460 

450 

440 

430 

420 

410 

400 

390 

380 

370 

360 

350 

340 

330 

320 

310 

300 

290 

280 

270 

260 

250 

240 

230 

220 

210 

200 

190 

180 

170 

160 

150 

140 

130 

120 

110 

100 

90 

80 

70 

60 

50 

40 

30 

20 

10 

+1 

+10 

+20 

+30 

+40 

+50 

+60 

500 

490 

480 

470 

460 

450 

440 

430 

420 

410 

400 

390 

380 

370 

360 

350 

340 

330 

320 

310 

300 

290 

280 

270 

260 

250 

240 

230 

220 

210 

200 

190 

180 

170 

160 

150 

140 

130 

120 

110 

100 

90 

80 

70 

60 

50 

40 

30 

20 

10 

+1 

+10 

+20 

+30 

+40 

+50 

SIDD energy (kcal/mol 

5.8 10.0 

Gaps are appended 

to each promotor 

region to adjust to 

maxima of the P1/P2 

model scores 

Figure 3.9: E. coli and Shigella rrnB energy landscape visualized using the heatmap function. 

Each vertical column corresponds to a promotor sequence, whereas the horizontal rows represent 

average values over 10 bp within each sequence. Coordinates labeled on the horizontal rows are 

relative to the 16S rRNA gene start. The upper heatmaps show P1 whereas the lower heatmaps 

show P2. Leftmost heatmaps show P1/P2 model scores in green, whereas rightmost heatmaps 

show the SIDD energy in blue. 

116

RNA operons and promoter analysis 

3.5 The genomic context: visualizing operons and DNA 

properties 

During the thesis work, this author has been involved in the development of a next generation 

genome browser to replace the older GeneWiz software developed at CBS (Pedersen 

et al., 2000; Jensen et al., 1999). The old GeneWiz is still used by the BLASTatlas service 

to generate the static atlas graphic. The goal with the new version was to create an 

interactive and platform-independant program that would allow the user to zoom from a 

global genomic scale down to the nucleotide level. The basic principles of transforming 

nummerical data into a color coded representation remained identical to the GeneWiz 

method. But the old GeneWiz software required several minutes to regenerate a plot and 

the challenge was to provide an efficient data flow that would allow this regeneration in 

fractions of a second. Eva Rotenberg and Hans Henrik Stærfeldt from CBS authored the 

gwBrowser Java code which handles the plotting, whereas this author has been responsible 

for the server side software. For the fast visualization to be possible, all nummerical data 

that are plotted must be pre-binned and accessible for all of the zoom-levels. A system was 

established which could contain these pre-binned data for a number of genomes using a 

MySQL database. The first solution involved a single large table, with fields corresponding 

to genome id, position, zoom level, field, and value. It quickly proved unfeasible. Since 

storing all zoom levels for a genome of length N requires 2×N records, a rough estimation 

shows that a 1,000 genomes of 3Mb and 20 different DNA properties (field) requires 120 

billion database records. Maintaining these large search indexes and preventing table locks 

during update made this solution impossible. A different solution was tried splitting each 

genome into its own table and this solved many speed issues but did not perform satisfactory. 

Instead, data are stored in binary files - one file per genome and zoom level. All 

values are written as fixed-width data and using memory mapping the server can quickly 

obtain data within the file knowing the coordinates of the window. The listing belows 

shows how the client retrieves data for the genome id AL111168GENOMEatlas, from position 

1 to 37,473 bp, at zoom level 5. Figure 3.10 shows the workflow of the gwBrowser 

software. For further details on this tool, please refer to paper VII. The software is now 

available via http://www.cbs.dtu.dk/services/gwBrowser. 

1 set server = http : / / ws . cbs . dtu . dk/cgi−bin/gwBrowser −0.91/ server . cgi 

2 curl $server"?d=AL111168GENOMEatlas&m=d&f=dnap0&b=1&e=37473&l=5&z= 

false" 

3.6 Visualizing sequencing quality using gwBrowser 

Modern high-throughput sequencing techniques currently lack sufficient read lengths to 

span many repetitive elements of genomes, especially the rRNA genes mentioned above. To 

assess how well a given set of reads can close a genome sequence, a method was developed 

which accounts for both quality scores of the reads and the uniqueness of the reads. The 

concept of the method is to map the qualities of all reads back to a reference genome and 

apply a weight to the qualities according to the uniqueness of the reads. Reads that have 

multiple hits throughout the genome will contribute little whereas reads that at specific 

will contribute fully. Figure 3.11 shows the principle of the method and it was integrated 

into the gwBrowser software. 

117

Visualizing sequencing quality using gwBrowser 

Configure and 

submit atlas 

‘ 

wait for processing 

q’r(i) 

genome 

Browser applet 

Reference genome, 

annotations, sequencing 

reads, query genomes, 

custom numerical data 

Editing of atlas layout 

Atlas layout (XML) 

Request (atlas ID, zoom level, 

window, field name ... ) 

Returned data 

Main server 

1 2 

3 

XML configuration 

CLIENT SIDE SERVER SIDE 

hit H1 

score 

S1 

mapped reads 

ref. genome 

Figure 3.10: Principle workflow of gwBrowser data exchange. 

read 

1 

2 

3 

qr(i) 

i 

q’r(i) 

hit H2 

score S2 

genome 

read 

hit H3 

score S3 

Aligning read 

sequence to 

genome 

hit Hr 

score Sr 

Map quality scores 

to genome and 

apply weight 

4 

5 

Data binning of 

zoom levels 

Binned data 

Browser server 

Weighted coverage 

Sequence Weighted agreement coverage 

Max Sequence unique agreement qual 

Information Max unique Content qual 

Read Information anbsense Content 

Annotations 

Read anbsense 

CDS+ 

Annotations CDS- 

Weighted coverage 

rRNA CDS+ 

tRNA CDS- 

Sequence agreement 

rRNA 

Intrinsic tRNA Curvature 

Max unique qual 

Stacking Intrinsic Curvature Energy 

Information Content 

Position Stacking Preference Energy 

Read anbsense 

Global Position Annotations Direct Preference Repeats 

CDS+ rRNA 

CDS! CDS+ 

tRNA 

Global Inverted Direct 

CDS- 

Repeats 

rRNA 

GC Global Skew Inverted 

tRNA 

Repeats 

Intrinsic Curvature 

Percent GC SkewAT 

Stacking Energy 

Percent AT 

Finally, all maximum values Position Preference are 

plotted on the reference genome 

Global Direct Repeats 

using GeneWiz Browser. The 

marked band in the example Global Inverted above Repeats 

shows a regions with low 

GC Skew 

uniqueness. 

Percent AT 

From all positions in the genome, 

obtain the maximum uniqueness 

value derived from the mapped 

reads. 

Figure 3.11: Mapping qualities of sequencing reads to a reference genome while accounting for 

the uniqueness of the read. 

118

P2 

-10 

-35 

UP 

P1 

-10 

-35 

UP 

FIS 

FIS 

FIS 

rrnB 

rrnD 

rrnE 

rrnB 

rrnA 

rrnC 

rrnG 

E. coli K12 

MG1665 


rrnH 

SIDD, s:-0.055 

SIDD, s:-0.045 

SIDD, s:-0.035 

Annotations 

CDS+ 

CDS- 

rRNA 

tRNA 

Intrinsic Curvature 

Stacking Energy 

Position Preference 

GC Skew 

Percent AT 

Figure 3.12: A zoom of the P1 P2 tandem promotor system upstream of the rrnB operon of E. 

coli K12. 

3.6.1 Visualizing the P1 and P2 structure using gwBrowser 

The gwBrowser tool allows the user to append various types of annotations like TSS mark, 

boxes, and arrows once the binning step has finished. This allows to visualize promotor 

structures like the P1 / P2 system and to integrate this with various DNA properties. 

The gwBrowser tool was applied to study the E. coli rrnb promotor system to correlate 

the annotated regulatory elements with a the SIDD energy (Wang et al., 2004; Wang & 

Benham, 2008) (see figure 3.12). 

The plot in figure 3.12 shows a drop in free energy upstream of P1 and P2, which 

from an energetic viewpoint explain the high transcription rate. The transcription factor 

FIS stimulates transcription at several promoters, and for example the binding of FIS 

at the leuV promoter (Ross et al., 1999) has been suggested to transmit the superhelical 

destabilization downstream to the point where the RNAP twists and opens the helix (Wang 

et al., 2004). This model may be valid for the rrnB P1 promoter also, as the activity of 

leuV and rrnB P1 are comparable (Bauer et al., 1988). 

3.7 Summary 

Ribosomal RNA genes play an important role in the cells, and can be highly transcribed 

- often more than 90% of the total transcripts in rapidly growing bacterial cells are from 

rRNA genes. Further, rRNA genes are important in determining taxonomy. Further, 

correctly finding the location of the start/stop positions for the rRNA genes is difficult to 

do with BLAST searches; we have developed RNAmmer to find the rRNA genes. Once the 

genes are mapped, further studies, such as promoter profiling can be done. The gwBrowser 

allows one to zoom in on particular areas of the chromosome, and in the case of rRNA 

promoters, to map important structural properties of the DNA in the promoter region. 

119

Summary 

120

1 


3.8 Paper VI: RNAmmer: Fast two-level HMM prediction 

of rRNA in prokaryotic genome sequences 

121

3100–3108 Nucleic Acids Research, 2007, Vol. 35, No. 9 Published online 22 April 2007 

doi:10.1093/nar/gkm160 

RNAmmer: consistent and rapid annotation 

of ribosomal RNA genes 

Karin Lagesen 1,2, *, Peter Hallin 3 , Einar Andreas Rødland 1,2,4,5 , Hans-Henrik Stærfeldt 3 , 

Torbjørn Rognes 1,2,4 and David W. Ussery 1,2,3 

1 Centre for Molecular Biology and Neuroscience and Institute of Medical Microbiology, University of Oslo, 

NO-0027 Oslo, Norway, 2 Centre for Molecular Biology and Neuroscience and Institute of Medical Microbiology, 

Rikshospitalet-Radiumhospitalet Medical Centre, NO-0027 Oslo, Norway, 3 Center for Biological Sequence 

Analysis, Biocentrum-DTU, Technical University of Denmark, DK-2800 Lyngby, Denmark, 4 Department of 

Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway and 5 Norwegian Computing 

Center, PO Box 114 Blindern, NO-0314 Oslo, Norway 

Received December 1, 2006; Revised and Accepted March 2, 2007 

ABSTRACT 

The publication of a complete genome sequence is 

usually accompanied by annotations of its genes. 

In contrast to protein coding genes, genes for 

ribosomal RNA (rRNA) are often poorly or inconsistently 

annotated. This makes comparative 

studies based on rRNA genes difficult. We have 

therefore created computational predictors for the 

major rRNA species from all kingdoms of life and 

compiled them into a program called RNAmmer. 

The program uses hidden Markov models trained on 

data from the 5S ribosomal RNA database and 

the European ribosomal RNA database project. 

A pre-screening step makes the method fast with 

little loss of sensitivity, enabling the analysis of 

a complete bacterial genome in less than a minute. 

Results from running RNAmmer on a large set of 

genomes indicate that the location of rRNAs can be 

predicted with a very high level of accuracy. Novel, 

unannotated rRNAs are also predicted in many 

genomes. The software as well as the genome 

analysis results are available at the CBS web server. 

INTRODUCTION 

Ribosomes are the molecular machines which form the 

connection between nucleic acids and proteins in all living 

organisms. The ribosome’s dependence on ribosomal 

RNAs (rRNAs) for its function has caused them to be 

conserved at both the sequence and the structure level. 

Because of this, rRNAs are often used in comparative 

studies such as phylogenetic inference. Comparative 

studies have become more popular as more genomes 

have been completely sequenced, but can potentially 

*To whom correspondence should be addressed. Tel: þ4722844786; Email: karin.lagesen@medisin.uio.no 

become complicated when some of the genes they are 

based on are poorly annotated or not annotated at all. 

Unfortunately, this is often a problem with rRNAs as 

genome annotation pipelines usually do not include tools 

specific for rRNA detection. Instead, rRNAs are often 

located by sequence similarity searches such as BLAST. 

Although such searches may give reasonable answers due 

to the high level of sequence conservation in the core 

regions of the genes, using such results for annotation 

purposes can be problematic. The validity of the search 

results depends on the program and database used. 

Changing one or both of these can drastically change 

the results. Genomic databases have grown exponentially 

over the past two decades and search programs have as a 

consequence had to undergo constant revisions in order to 

meet the requirements of the research community. Thus, 

the results of a search done today are probably very 

different from those produced several years ago. An added 

complication is that the most commonly used database 

search methods have poor performance for noncoding 

RNAs. A recent study comparing several different 

methods for predicting noncoding RNAs, including 

rRNAs, found that the most commonly used methods 

gave the most inaccurate results (1). 

Through our work on the GenomeAtlas database (2), 

we have seen the results of poor annotation of rRNAs. 

Some genomes do not have any rRNAs annotated at all, 

whereas other genomes seem to have rRNAs annotated 

on the wrong strand. We initially tried to do systematic 

BLAST (3) searches, but it proved difficult to maintain 

consistency throughout this process. The high level of 

sequence conservation among the rRNAs enabled us to 

create hidden Markov models (HMMs) from structural 

alignments. Such models are more capable of capturing 

the sequence variation that is inherently present in 

the rRNA gene families than simple BLAST searches. 

ß 2007 The Author(s) 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ 

by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Using HMMs also simplifies the use of common criteria 

for prediction assessment. A library of HMMs was 

constructed and the program RNAmmer was developed 

to make use of this library. RNAmmer is available 

through the CBS web site, as a web service or as a 

stand-alone package. It has been tested on all published 

genomes and gives accurate predictions of rRNAs. The 

program also has the added benefit of producing results 

that are comparable between genomes. 

Our work has focused on three of the major rRNA 

species. The ribosome consists of two subunits, the small 

and the large subunit, which pair up to form the 

functional ribosome. The rRNAs present in prokaryotes 

are the 5S and 23S in the large subunit, and the 16S in the 

small subunit. In eukaryotes, 5S, 5.8S and 28S rRNA exist 

in the large subunit, and 18S rRNA in the small subunit. 

The 5.8S is not considered in this work. There are 

substantial sequence and secondary structure similarities 

between eukaryotic and prokaryotic rRNAs; however, 

the eukaryotic rRNAs commonly have longer stems and 

larger loops than those of the prokaryotes. The subunits 

are composed of both RNAs and proteins. Since their 

discovery in the early 1950s, it has been debated whether 

ribosomal function should be credited to the rRNAs or 

the proteins. Recent crystal studies have revealed that 

protein synthesis to a large extent is dependent on the 

rRNAs (4–7) and this has most likely been instrumental 

for their high level of conservation. 

In prokaryotes, the 16S, 23S and 5S rRNAs are 

commonly transcribed together, while the 18S, 28S and 

5.8S rRNAs form a transcriptional unit in eukaryotes. 

Eukaryotic 5S rRNA commonly appear in highly duplicated 

tandem repeats (8). In most organisms, there are 

several copies of the rRNA transcription unit, and 

although as much as 11% sequence divergence has been 

observed between units within the same genome, the 

difference is usually less than 1% (9). In several cases, 

segments are also edited out of the transcribed rRNA. 

These segments may be introns that after splicing leave 

a continuous rRNA, or they can be intervening sequences 

(IVS) that leave a fragmented rRNA which is still 

functional within the ribosome structure (10). Introns 

are most prevalent in eukaryotes and archaeas, while 

intervening sequences have been seen in eukaryotes and 

bacteria. Introns are predominantly found within conserved 

sequences close to tRNA and mRNA-binding 

sites (10), whereas intervening sequences are ordinarily 

seen in hypervariable regions (11). 

METHODS AND MATERIALS 

Using HMMs to find new members of a sequence family 

requires reliable multiple alignments. The 16S/18S and 

23S/28S rRNA alignments were retrieved from the 

European ribosomal RNA database (ERRD) (12). 

In this database, annotated large and small subunit 

ribosomal RNA sequences from the EMBL nucleotide 

database with a length of at least 70% of their estimated 

full length have been aligned. Multiple alignments of 5S 

rRNAs were retrieved from the 5S Ribosomal RNA 

Nucleic Acids Research, 2007, Vol. 35, No. 9 3101 

Database (13). Data from both databases were downloaded 

on October 27, 2005. The alignments are 

all structural alignments, i.e. aligned using secondary 

structure information gained from comparative sequence 

analysis. The 5S alignments were already divided 

into separate alignments for archaeal, bacterial and 

eukaryotic sequences, whereas the ERRD data were not. 

The alignments for 16/18S and 23/28S rRNAs were 

divided into the same groups as the 5S data to provide 

kingdom-specific predictors. The data was stored in 

a MySQL database for easier handling. 

The ERRD data contained sequences from ‘environmental 

samples’. These were excluded since there was little 

information about them. The 5S were generally around 

120 nt long, the 16/18S around 1500 nt and the 23/28S 

around 3000 nt long, all with no obvious outliers. The 

length of the eukaryotic rRNAs varied substantially, 

more than those of bacterial and archaeal rRNAs, but no 

sequences in the alignments seemed obviously wrong. 

The sequences were divided into phylogenetic groups to 

help with further analysis. Due to sequencing bias, some 

phylogenetic groups dominated the data sets. Such a skew 

could potentially cause the predictors to be less sensitive 

on underrepresented phylogenetic groups. Among 

the bacteria, 82% of the sequences were from three 

phyla: Actinobacteria, Firmicutes and Proteobacteria. 

Around 70% of the archaeal sequences were from 

Euryarchaeota; among the eukaryotes, the Streptophyta 

comprised 15% of the data. Several of the sequences also 

proved to be very similar. Therefore, redundancy reduction 

inspired by Hobohms second algorithm (14) was 

performed. This algorithm starts with a sorted list of the 

number of neighbors each sequence has. An all-against-all 

comparison between the sequences is performed and 

neighborship is judged by the level of similarity found. 

Similarity was measured by Score ¼ P 

i, j nijSij=ðN gÞ 

where i and j sum over the four nucleotides, nij counts the 

number of aligned nucleotide pairs (i, j ), N is the length of 

the sequence and g is the number of gap-only positions; S ij 

refers to the scoring matrix EDNAFULL created by Todd 

Lowe. The maximum similarity level allowed was set to 

ensure that each phylum was represented. Similarity 

graphs were formed for each group, with the sequences 

as vertices and edges between similar sequences. The 

sequence with the highest connectivity and its edges were 

deleted from the graph, and this was repeated until no 

edges remained. At the end, all removed sequences were 

checked to see if they had any edges to vertices in the 

remaining set. If not, they were reinstated. This procedure 

was implemented as a C program. 

Sequences in ERRD may contain ambiguous nucleotide 

symbols representing nucleotides that have not been 

uniquely determined. These occur more frequently in 

bacteria and eukaryotes than in archaea, and primarily at 

both ends of the alignment: in 16/18S, predominantly 

at the end; in 23/28S, predominantly at the beginning. 

In the latter case, this was mostly due the high prevalence 

of gaps at the end of the alignment. As we found that 

ambiguous nucleotides at the ends reduced the ability to 

predict start and stop positions accurately, we decided to 

remove all sequences with five or more ambiguous

3102 Nucleic Acids Research, 2007, Vol. 35, No. 9 

Table 1. The initial number of rRNA sequences and the number of sequences excluded for different reasons. 

Kingdom Type Initial count Environmental samples Incomplete sequences Redundancy reduction Total in HMM 

Archaea 5S 58 0 0 10 48 

16S 589 239 471 287 76 

23S 37 0 18 8 15 

Bacteria 5S 461 0 0 101 360 

16S 12 107 1429 10 723 2485 743 

23S 398 0 155 130 127 

Eukaryotes 5S 316 0 0 33 283 

18S 6585 24 5222 836 979 

28S 157 0 91 8 58 

Environmental samples were excluded due to lack of phylogenetic information. Sequences with too many unknown nucleotides in either end of the 

sequence were excluded to improve HMM accuracy. Redundancy reduction was performed to reduce bias. Note that these groups may overlap. The 

last column indicates the number of sequences used to build each HMM. 

nucleotides in either end of the sequence. A summary of 

the number of sequences removed during curation of the 

alignments is shown in Table 1. 

The software package HMMer (15) version 2.3.2 was 

used to create HMMs from alignments where all columns 

containing only gaps had been removed. It was configured 

for nucleotides, and to compensate for skews in the 

nucleotide distribution a custom null model for each 

alignment was used. Although redundancy reduction had 

been performed, the Henikoff position-based weighing 

scheme (16) was used to reduce any remaining biases. 

When using the HMMs to search genome sequences, 

the default alignment method was used: a match must 

span the entire model, and several matches may be found 

within one sequence. 

With the aim of increasing the search speed, we 

determined the 75 most conserved consecutive columns 

in each alignment, as illustrated in Figure 1, and produced 

‘spotter’ HMMs based on these. Since searches with the 

smaller spotter models would be considerably faster, 

we wanted to investigate the possibility of using the 

spotter to pre-screen for candidates, using the full HMMs 

only on regions surrounding the spotter hits. Spotter and 

full model searches were done separately. Spotter and full 

model predictions were matched based on whether they 

had overlapping nucleotides on the same strand. A linear 

regression was used to express spotter score in terms of 

full model score. Variation was estimated as linear in full 

model score with non-positive regression coefficients. 

Least squares estimates were used in both cases. Spotter 

scores were assumed to be missing when negative and, 

hence, assumed to follow a truncated normal distribution; 

expected scores and square deviations were used to replace 

missing values in the two regressions. From this model, we 

computed the lowest full model score, T99, for which there 

was at least a 99% likelihood of getting a corresponding 

spotter hit, and the likelihood, Pmin, that a full model hit 

with the lowest found score should have a corresponding 

spotter hit. 

Both the full HMMs and the spotter HMMs were run 

on all fully sequenced genomes found in the Genome Atlas 

database (listed in Supplementary Table S1). All predictions 

with non-negative score and E-value at most 100 

were reported. Only full model hits with E-value 50.01 

were accepted as reliable hits, but none with E-value 

between 0.01 and 100 were reported. As rRNAs within a 

genome tend to be very similar, usually with at least 99% 

identity, different full model hits within a genome 

corresponding to actual rRNAs should be expected to 

have similar scores. However, we found a substantial 

number of hits with far lower scores which we assume to 

be pseudogenes, truncated rRNAs or otherwise nonfunctional 

rRNA copies. To ensure that these did not have 

an adverse effect on the analyses, we excluded full model 

hits having a score less than 80% of the maximal score 

in that genome. These are listed in Supplementary 

Table S2. 

Annotations of rRNAs were obtained from GenBank. 

Unfortunately, rRNAs have not been annotated in a 

uniform manner and it was often unclear exactly what 

was annotated. In some cases, both the separate rRNAs 

and the full operon was annotated. In all such cases, the 

operons were longer than 5000 nt, and all annotations 

longer than that were thus excluded. In our experience, 

this affected only operons. In other cases, different pieces 

of the same gene had been annotated as separate entities. 

Thus, some predictions matched several annotation 

entries; these are listed in Supplementary Table S3. A 

prediction was considered to match an annotation if they 

were on the same strand and the length of their overlap 

was at least half the length of the shorter of the two; it was 

considered to be annotated if it matched at least one 

annotation. The deviation between annotated and predicted 

start and stop positions was also examined, but 

predictions with multiple matching annotations were 

excluded from this comparison. 

Additional analyses were performed for experimentally 

verified 16S in Anaplasma marginale St. Maries (M60313), 

Chlamydia muridarum Nigg (D85718), Escherichia coli 

K12 MG1655 (J01695), Sulfolobus tokodaii St. 7 

(AB022438), Thermus thermophilus HB8 (X07998) and 

Nitrobacter hamburgensis X14 (L11663). Computational 

speed was assessed on M. capricolum ATCC 27343 

(CP000123) Solibacter usitatus Ellin6076 (CP000473) and 

Sargasso Sea data (AACY01000001-AACY01811372). 

All test searches reported were performed on an 

SGI Altix 3000 machine using one 1.3 GHz Itanium 2 

processor.

Information content 



0.0 1.0 2.0 

0.0 1.0 2.0 

0.0 1.0 2.0 

RESULTS 

0 20 40 60 80 100 120 140 

0 50 100 150 

0 50 100 150 

Position in Alignment 

The predictions of the full HMM models have been 

compared first against annotations, then against the 

spotter models. 

Full model predictions versus annotation 

As Table 2 shows, the predictors appeared to be better 

at detecting bacterial rRNAs and less powerful for 

eukaryotic rRNAs. The highest accuracy was seen for 

the 16/18S rRNAs followed by the 23/28S. Two groups of 

rRNAs were particularly difficult to locate: the archaeal 

5S and the eukaryotic 18S. The missing archaeal 5S were 

all from four euryarchaeotic genomes which are all 

anaerobic methane producers. The eukaryotic 18S that 

the predictors could not find were all from two genomes, 

Guillardia theta and Plasmodium falciparum. 

Closer evaluation revealed that several annotated 

rRNAs that lacked a matching prediction had actually 

been detected, but on the opposite strand. In eukaryotes, 

this was only seen with Arabidopsis thaliana 5S. 

In bacteria, most of the reverse predictions were 5S; in 

archaea, they were predominantly 16S and 23S. It should 

be noted that for all the reverse strand predictions 

the predicted start and stop positions agreed well 

with the annotation, indicating that they have been 

annotated on the wrong strand. Annotated rRNAs 

that lacked matching predictions in either direction are 

listed in Supplementary Table S4. 

Table 2 gives the number of predicted rRNAs that did 

not have a corresponding annotation: putative novel 

rRNAs. About 70% of them were 5S rRNAs, and only a 

0.0 1.0 2.0 

0.0 1.0 2.0 

0.0 1.0 2.0 

0 500 1000 1500 

0 500 1000 1500 2000 2500 3000 

0 1000 2000 3000 4000 5000 



A 5S, Archaea (n = 48) B 16S, Archaea (n = 76) C 23S, Archaea (n = 15) 

0.0 1.0 2.0 

0.0 1.0 2.0 

0.0 1.0 2.0 

few were archaeal. In bacteria, most of the novel rRNAs 

were found in Firmicutes and Gammaproteobacterias, 

although it should be noted that these two phyla are 

the two dominant groups and contain the bulk of the 

currently sequenced bacterial genomes. Among the 

eukaryotes, only A. thaliana had novel rRNAs. The 

scores of the new rRNA predictions did not significantly 

differ from those that were annotated, indicating that 

these are true rRNAs not yet annotated. The 5S is often 

omitted in the rRNA annotation; since the eukaryotic 5S 

is usually separated from the 18-28S sequence, they might 

be less visible to annotators. 

Start and stop deviations 

0 500 1000 1500 2000 2500 3000 3500 

D 5S, Bacteria (n = 360) E 16S, Bacteria (n = 743) F 23S, Bacteria (n = 127) 

0 1000 2000 3000 4000 

G 8S, Eukaryotes (n = 283) H 18S, Eukaryotes (n = 979) I 28S, Eukaryotes (n = 58) 

0 1000 2000 3000 4000 5000 6000 7000 


Figure 1. The graphs show conservation in the alignments as measured by information content: C ¼ P 

i fi log 2ðfi=qiÞ where i sums over the four 

nucleotides, f i is the frequency of nucleotide i in the column and qi ¼ 1=4 is used as the background frequency. Ambiguous nucleotide symbols were 

evenly divided between the corresponding f i, gaps between all four nucleotides. The grey line represents the value for each position in the alignment, 

the black line is a running average over 75 nt around the current position, whereas the white dot indicates the center of the most conserved 75 nt 

region of the alignment. 

The differences between predicted and annotated start 

and stop positions are illustrated in Figure 2 and it shows 

that they agree well. The median of the start and stop 

prediction deviations were in most groups zero or very 

close to zero with more than half within 10 nucleotides. 

This was not the case for the eukaryotes. 

For eukaryotic 5S, only five genomes contained 

predictions with matching annotations. The predictions 

were uniform in length, whereas the annotations 

were more variable. The predictions that indicated a 

substantially shorter 5S than annotated were all in 

Schizosaccharomyces pombe: the average length of the 

annotations was 170 nt, whereas the corresponding 

predictions were all 114 nt. For eukaryotic 18S, however, 

predicted start and stop positions were very accurate, 

although many annotated 18S were missed.


Table 2. The number of rRNAs annotated and predicted in the genomes that were examined. 

Kingdom Type Annotated Same strand Other strand Not found Full model predictions Novel 

Archaea (n ¼ 27) 5S 56 (24) 43 (21) 1 (1) 12 (8) 47 (23) 4 (3) 

16S 47 (25) 45 (25) 2 (2) 0 (0) 47 (27) 2 (2) 

23S 47 (25) 44 (24) 2 (2) 1 (1) 46 (26) 2 (2) 

Bacteria (n ¼ 321) 5S 1205 (285) 1166 (285) 30 (16) 9 (5) 1339 (320) 173 (69) 

16S 1172 (299) 1146 (299) 22 (12) 4 (4) 1237 (320) 91 (34) 

23S 1197 (297) 1154 (291) 22 (13) 21 (12) 1248 (313) 94 (36) 

Eukaryotes (n ¼ 13) 5S 65 (7) 46 (6) 19 (1) 0 (0) 324 (9) 278 (5) 

18S 13 (4) 6 (4) 0 (0) 7 (2) 13 (6) 7 (3) 

28S 13 (5) 12 (4) 0 (0) 1 (1) 19 (7) 7 (3) 

The table gives the number of annotations, and splits this into those matching predictions on the same strand, on the other strand, and not found. 

The total number of full model predictions is given. Novel predictions are full model predictions not matching any annotation on the same strand, 

and include those annotated on the other strand. Numbers in parentheses indicate the number of genomes. It should be noted that the eukaryotic 

annotated count is somewhat uncertain due to ambiguous rRNA annotations. The genomes which were analyzed were from the GenomeAtlas 

database, a database over all available fully sequenced genomes. 

Archaea 

Bacteria 

Eukaryotes 

Start 

1000 

−100 

−10 

0 

10 

100 

5S 

(43/1163/46) 

For eukaryotic 28S, only two genomes had predictions 

with matching annotations. One of them, Encephalitozoon 

cuniculi, had stop positions predicted once 1112 nt and 

twice 4797 nt downstream of the annotation, whereas 

the start position was accurately predicted. In the 

other genome, Guillardia theta, the start positions were 

uniformly predicted 110 nt upstream of the annotated 

position, but with the stop position quite accurately 

predicted. 

1000 

Stop 

1000 

−100 

−10 

0 

10 

100 

1000 

Start 

1000 

−100 

−10 

0 

10 

100 

16/18S 

(44/1146/6) 

1000 

Stop 

1000 

−100 

−10 

0 

10 

100 

1000 

Start 

1000 

−100 

−10 

0 

10 

100 

23/28S 

(42/1150/9) 

Stop 

1000 

−100 

−10 

0 

10 

100 

1000 

Figure 2. Deviation of start and stop positions between predicted and annotated RNA is presented as pairs of panels. The number of predictions 

among the archaea, bacteria and eukaryotes are denoted beneath the panel group heading. The zero position in each panel corresponds to the 

annotation start or stop position with predicted positions presented relative to these. The yellow dot indicates the median deviation and the black 

box the quartile range. The hinges on the side of the box extend from the side of the box to the data point that is closest to, but does not exceed, 1.5 

times the interquartile range. The curves show the density of the distribution. 

Since rRNAs tend to be very similar within a genome, 

predictions within each genome generally had similar 

lengths. This similarity within genomes as well as within 

groups of closely related genomes caused multiple peaks 

in the distributions of endpoint deviations. An example 

of this can be seen in the bacterial 16S predictions where 

some of the predicted start and stop positions were 

clustered downstream of the annotation and where some 

of the predicted start positions were clustered upstream 

1000

of the annotation. Some of the major contributors to 

the upstream peak in the start positions were different 

Streptococcus pyogenes strains, Bacillus genomes and 

Yersinia pestis genomes. These, in addition to 

Streptococcus agalactiae strains and Vibrio parahaemolyticus, 

were also prevalent in the stop position downstream 

peak. There was also a downstream peak in the 

start positions, and the genomes causing this peak were 

mainly Staphylococcus aureus, Bacillus cereus and several 

Escherichia coli relatives. 

Most of the start and stop deviations did not exceed 

100 nt. However, there were a few cases of deviations 

exceeding 1000 nt, and these are not shown in the figure. 

This was the case for eukaryotic 23S and was mainly due 

to the three previously described stop positions predicted 

considerably downstream of the annotated stop position. 

In the two longer predictions from E. cuniculi, this was 

due to the HMM placing the latter 100 nt of the prediction 

further downstream to achieve a better score. Such inserts 

would most likely not appear when the spotter model is 

used first, since the inserted sequence would be too long. 

To test this, a truncated version of the sequence was run 

through the predictor. The stop position was then 

accurately predicted. This phenomenon also explains 

some cases among the bacterial 16S predictions where the 

start position was placed very far upstream of the 

annotation. There were 27 rRNAs that had a start 

position predicted to start anywhere from 13 000 to 

40 000 nt upstream of the annotated start position. All 

but one of these were Firmicutes, mostly Streptococci and 

Staphylococci. Closer study of the sequences revealed that 

the misplaced start position predictions were again due to 

long sequences being inserted near the start of the rRNA, 

indicating that the first part of the HMM had been 

misplaced in the same manner as for Guillardia theta’s stop 

predictions. To test if these were the same kind of inserts, 

a region ending in the same place as the predictions but 

starting 10 000 nt earlier was run through the full model 

predictor. This led to the bacterial 16S rRNAs being 

predicted with a deviation in start and stop positions on 

par with what was otherwise seen. 

Comparison to experimentally verified rRNAs 

Annotations were often ambiguous and considered 

unreliable. For discrepancies between annotations and 

RNAmmer predictions, it is not a priori clear which of the 

two is correct. However, some genomes with experimentally 

verified rRNAs were selected to further assess the 

accuracy of start and stop predictions. The genomes 

we examined were Anaplasma marginale Str. Maries, 

Chlamydia muridarum Nigg, Escherichia coli K12 

MG1655, Sulfolobus tokodaii Str. 7, Thermus thermophilus 

HB8 and Nitrobacter hamburgensis X14. These genomes 

all had complete 16S sequences according to the NCBI 

database and had accompanying literature which said that 

they were experimentally determined. When checking 

the positions of these rRNAs with BLAST against the 

genome, some discrepancies were found. Due to this we 

used the BLAST results when comparing annotated 

rRNAs to predictions. 


In total, there were 14 copies of the six 16S sequences, 

and all of them were found by our predictions. Stop 

predictions were more accurate than start predictions. 

In all but four cases, the start position was predicted 

to be 7 nt downstream of the annotated start position. 

In A. marginale and S. tokodaii, the start position was 

predicted to be the same as annotation, and both of the 

two entries from C. muridarum were predicted to start 3 nt 

downstream of annotated start position. In N. hamburgensis 

the start position was, in contrast to the other cases, 

predicted to start 7 nt upstream of annotated start 

position. The stop positions in all but three predictions 

ended on the same position as the annotation. In N. 

hamburgensis predicted stop was 9 nt downstream, 

whereas in S. tokoaii and A. marginale the predicted 

stop was 1 nt downstream of annotation. Thus, 

all predictions were within 10 nt of the annotated start 

and stop positions. 

Comparison to RFAM 

RFAM is a database of RNA families which incorporates 

secondary structure in its analyses. We have made a 

comparison with the 5S rRNA predictions of 

RFAM (17,18) for a selection of twenty prokaryotic 

genomes listed in Supplementary Table S5. There were a 

total of 55 5S annotated in these genomes. RNAmmer 

found 53 of them, while 54 were found in RFAM. In three 

of the genomes, both methods predicted a 5S to within a 

few nucleotides of the annotated position, but both placed 

it on the other strand. Both predictors identified three new 

5S rRNAs within these genomes, and at approximately the 

same positions. Two of these new 5S rRNAs followed 

another annotated 5S rRNA, looking like a tandem 

repeat. In most cases, both methods placed the start 

position a few nucleotides downstream of the annotation, 

whereas the stop position was more evenly distributed 

around the annotated position. RNAmmer generally 

predicted rRNAs to be shorter by a nucleotide or two 

than RFAM, usually at start of the genes. 

Spotter pre-screening 

Table 3 shows that, with the exception of archaeal 5S, 

no full model hits were missed by the spotter model. 

Also, the spotter produced relatively few false positives, 

except for the eukaryotic 5S. 

Minimum, maximum, quantile and median scores for 

all the full model predictions are shown in Table 3, giving 

some indication of the range of scores that rRNAs can be 

expected to have. The table also includes the threshold T99 

and the likelihood Pmin which indicate that all full model 

predictions were expected to have corresponding spotter 

model predictions except some among the archaeal 5S. 

Based on the relatively stable lengths of the different 

types of rRNAs and the corresponding full model hits and 

the position of the spotter hit within them, we decided on 

window sizes around spotter model hits to use when the 

spotter model is used first. These were chosen to be 300 nt 

for the 5S rRNA, 5000 nt for the 16/18S and 9000 nt for 

the 23/28S. Being roughly three times the length of the


Table 3. Evaluation of spotter and full model predictions. 

Kingdom Type Number of model predictions Full model scores T99 Pmin 

corresponding rRNAs, we consider rRNA sequences to be 

unlikely to extend beyond these windows. 

Computational speed 

Searching Mycoplasma capricolum ATCC27343, about 

1 Mbp, for bacterial 16S took 14 minutes using the full 

HMM. Using the spotter to screen the sequence, then the 

full model on the spotter hits, reduced the time to 

16 seconds. Search times are expected to increase 

proportionally to the genome size; when using the spotter 

model to screen the sequence, search time will also 

increase with increasing number of spotter hits. 

Time differences between searching long and short 

sequences were examined by searching through the 

complete sequence of Solibacter usitatus Ellin6076, and 

through the Sargasso Sea environmental samples (19). 

Searching the S. usitatus genome, about 10 Mbp, took 48 

seconds per Mbp. Two copies from each rRNAs family 

were found. The Sargasso Sea samples consisted of 

811 372 entries totaling over 800 Mbp. On this set the 

search speed was 407 seconds per Mbp. The article (19) 

accompanying this set indicated 1164 small subunit rRNA 

genes (16/18S) or fragments of genes; we found only 332, 

but our predictors are not able to find fragments of 

rRNAs. In addition, we found 562 5S and 68 23S 

sequences. 

DISCUSSION 

Full Spotter FPS Min Q1 Med Q3 Max 

Archaea 5S 47 35 7 2.9 12.7 20.0 35.3 50.6 34.9 0.69 

16S 47 47 0 1180.8 1891.9 1937.9 2004.0 2096.5 50 1.0 

23S 46 46 1 2240.7 2714.1 2870.7 3155.3 3267.3 50 1.0 

Bacteria 5S 1339 1339 123 39.9 77.7 89.5 94.6 109.6 14.0 1.0 

16S 1237 1237 31 721.9 1905.5 1989.4 2058.7 2148.5 50 1.0 

23S 1248 1248 20 2502.8 3267.8 3586.5 3690.7 3876.1 50 1.0 

Eukaryotes 5S 324 324 251 43.9 51.1 53.9 74.3 82.2 50 1.0 

18S 13 13 14 625.3 625.3 1733.1 1777.5 1777.6 50 1.0 

28S 19 19 5 1434.2 2904.7 3225.0 3335.9 3380.9 50 1.0 

This table shows the total number of full models, the number of spotter predictions that had matching full model predictions and the number of false 

positive spotter model predictions. The characteristics of the full model prediction score distributions are shown. FPS denotes the number of false 

positive spotter predictions. T99 refers to the lowest score a full model could have while still being detected with 99% probability by a spotter model 

with positive score. Pmin is the probability that a spotter with positive score would find a full model with the minimum score indicated. The lowest 

score for a full model score can be used as a lower limit on which results could be expected to be real. 

Our aim has been to enable high-throughput searches for 

rRNA while producing accurate and consistent predictions 

suitable for comparative analyses. For this purpose, 

we have developed the RNAmmer package which relies on 

HMMs for both speed and accuracy. HMMs were made 

using HMMer (15), which from a multiple alignment 

produces an HMM where match states represent columns 

with a specific nucleotide distribution, corresponding 

deletion states represent the possibility of gaps, and 

insertion states represent columns with large numbers of 

gaps; transition probabilities between the states indicate 

how likely each of the states are. HMMs thus differ from 

sequence alignments in that the likelihood of insertions 

and deletions may vary along the sequence. When 

searching a sequence with an HMM, the score indicates 

how well the sequence segment matches the model. The 

information content of a position, which reflects the 

nucleotide distribution and the likelihood of gaps, 

indicates how well that position is conserved. A good 

match to the HMM may come either from a highly 

conserved region which may well be short, or from a 

longer region with only weak conservation. We find both 

these cases. Bacterial 16S are detected despite almost half 

of the nucleotides being assigned to insert states, as other 

regions are highly conserved. For archaeal 23S, however, 

the information content of each position is low, but the 

sequence is long and there are few allowed insert states. 

These aspects can also explain cases of poor performance, 

both of the full model and of the spotter model. 

The low information content in the eukaryotic 5S and 

18S alignments indicates that these sequences are more 

divergent than archaeal and bacterial 5S and 16S. 

In addition, 40% of the 5S and 75% of the 18S alignment 

give rise to insert states in the HMM. Thus, there is little 

for the HMM to recognize. In addition, many of the 

missed 18S rRNAs were from Cryptophyta, a phylum 

which makes up only 0.6% of the alignment data. 

The archaeal 5S show the same characteristics as the 

eukaryotic 5S and 18S, which most likely explains the low 

performance for these rRNAs. The score for archaeal 5S 

hits were generally low, and the spotter score comes only 

from a 75 nt part of the sequence giving it even lower score 

causing it to miss 12 of the full model hits. It is notable, 

however, that these were the only cases missed by the 

spotter model: with the exception of archaeal 5S, our 

analyses show that the spotter should be able to detect 

rRNAs unless they are much further diverged than what 

we find in our data. 

Columns at the beginning and end of the multiple 

alignments often have low conservation and many gaps. 

Such columns are generally accommodated into the 

HMM as insert states, but HMMer ignores them at the 

beginning and end of the alignment. An example is the 5S,

where match states stop around 10 columns from the 

end of the alignments effectively causing the HMM to 

predict the last conserved nucleotide of the consensus 

sequence rather than the stop of the rRNAs. Hence, it is 

not uncommon for the stop position of the 5S to be 

predicted up to 10 nt downstream of the annotated stop 

position. 

These effects can also explain the endpoint accuracy 

that was seen when we compared our results to 

experimentally determined 16S sequences. We tried to 

find sequences where the ends had been experimentally 

verified by RACE or PCR, but such rRNAs proved 

difficult to find. All the ones we selected were sequenced, 

but it is uncertain to what extent the authors had 

tried to determine the ends. These experimentally 

found rRNAs did show better agreement with annotation 

than predictions in general, although this is not sufficient 

to conclude that our predictions are more accurate. Our 

stop predictions were very accurate, but more deviation 

was seen in the start predictions. These results could reflect 

more variation in the beginning of the alignments, which 

as in the 5S case could effectively cause the HMM to 

predict the last conserved nucleotide of the consensus 

sequence rather than the end of the rRNAs. 

In some cases, larger endpoint deviations occur. This 

can happen when one of the ends of the model finds a 

better match in a different part of the sequence. Insertion 

states sometimes allows the HMM to insert long gap 

regions and thus find a matching stop position far from 

the rest of the sequence. As shown for the bacterial 16S 

sequences that displayed this phenomenon, this is less of a 

problem when the spotter model is employed. The window 

searched around the spotter hit would most likely be too 

short to accommodate such an insert, and the model 

would match with the proper sequence. 

For fragmented rRNAs, long gap regions may be 

correctly predicted. This was seen for Coxiella burnetii 23S 

where our prediction has the same start position 

as annotated, but where the predicted stop position 

is 1884 nt downstream of GenBank’s stop position. 

However, according to Entrez Gene, this rRNA appears 

in four pieces and with the same stop position as ours, 

suggesting that in some cases ‘too long’ predictions might 

actually be correct. These cases should normally not be 

masked when using the spotter unless inserts between the 

fragments would make it exceed the window size. 

The HMM produced by HMMer requires time of order 

O(NM) to search a sequence of length N using a model 

with M states, M being proportional to the length of the 

multiple alignment. However, the speed is increased by 

using a 75 nt long spotter model to pre-screen the 

sequence, which requires time of order O(N), and then 

running the full HMM on windows around each spotter 

hit which requires time of order OðKM 2 Þ for K spotter 

hits, and window size proportional to M. The benefit of 

using the spotter is clearly illustrated in the M. capricolum 

searches. However, the time difference between the 

S. usitatus and the Sargasso Sea data searches shows 

that the spotter might lose its mission when dealing with 

many shorter sequences. 


There are other approaches to predicting non-coding 

RNA. One commonly used method is sequence alignment, 

e.g. BLAST (3), Paralign (20) or FASTA (21). Another is 

based on structure-sensitive Stochastic Context Free 

Grammars (SCFG) (22) which form the basis of the 

tRNA prediction program tRNAscan-SE (23) and of 

Infernal (24), which is used when creating RFAM. While 

the sequence alignment methods are very fast, they are not 

particularly suited for prediction of non-coding RNA (1). 

Infernal, however, has a general worst case running time 

of order OðMN 3 Þ, which is prohibitive. The RFAM 

database (17,18), which includes 5S and the 5 0 domain 

of 16S, uses BLAST to pre-screen genome sequences, 

followed by Infernal; despite a more efficient approach 

than the general SCFG, it does not analyze the entire 16S. 

A search for 5S in a 1 Mbp genome using Infernal took 

4 hours 45 minutes: almost 1000 times as much as the 

16 seconds used by RNAmmer for the much larger 16S 

model. A time-saving approach to SCFGs could be to use 

the RaveNna (25) package which can convert an RFAM 

SCFG to an HMM. This drastically reduces the running 

time; however, its usefulness would be limited since no 

models for the larger rRNAs are available. Another factor 

is that the 5S found by RaveNna (26) which were not 

already in RFAM were all in organellar sequences, 

sequences not analyzed by RNAmmer. For further 

comparisons and comments on these different methods, 

we refer to (1). 

The RNAmmer program is available as a traditional 

HTML-based prediction server at http://www.cbs.dtu.dk/ 

services/RNAmmer as well as through a SOAP-based 

web service. It is also available for download through 

the same site. 

SUPPLEMENTARY DATA 

Supplementary Data is available at NAR online. 

ACKNOWLEDGEMENTS 

We are grateful for funding from EMBIO at the 

University of Oslo, the Research Council of Norway 

and the Danish Center for Scientific Computing. It was 

also supported by a grant from the European Union 

through the EMBRACE Network of Excellence, contract 

number LSHG-CT-2004-512092. We would also like to 

thank our colleagues for critical reading of the manuscript. 

Funding to pay the Open Access publication charge 

was provided by Research Council of Norway. 

Conflict of interest statement. None declared. 

REFERENCES 

1. Freyhult,E., Bollback,J. and Gardner,P. (2007) Exploring genomic 

dark matter: a critical assessment of the performance of homology 

search methods on noncoding RNA. Genome Res., 17, 117–125. 

2. Pedersen,A., Jensen,L., Brunak,S., Staerfeldt,H. and Ussery,D. 

(2000) A DNA structural atlas for Escherichia coli. J. Mol. Biol., 

299, 907–930. 

3. Altschul,S., Gish,W., Miller,W., Myers,E. and Lipman,D. (1990) 

Basic local alignment search tool. J. Mol. Biol., 215, 403–10.


4. Wimberly,B., Brodersen,D., Clemons,W. Jr., Morgan-Warren,R., 

Carter,A., Vonrhein,C., Hartsch,T. and Ramakrishnan,V. (2000) 

Structure of the 30s ribosomal subunit. Nature, 407, 327–339. 

5. Schluenzen,F., Tocilj,A., Zarivach,R., Harms,J., Gluehmann,M., 

Janell,D., Bashan,A., Bartels,H., Agmon,I. et al. (2000) Structure 

of functionally activated small ribosomal subunit at 3.3 angstroms 

resolution. Cell, 102, 615–623. 

6. Nissen,P., Hansen,J., Ban,N., Moore,P. and Steitz,T. (2000) 

The structural basis of ribosome activity in peptide bond synthesis. 

Science, 289, 920–930. 

7. Yusupov,M., Yusupova,G., Baucom,A., Lieberman,K., Earnest,T., 

Cate,J. and Noller,H. (2001) Crystal structure of the ribosome at 

5.5 A ˚ resolution. Science, 292, 883–896. 

8. Srivastava,A. and Schlessinger,D. (1991) Structure and organization 

of ribosomal DNA. Biochimie, 73, 631–638. 

9. Acinas,S., Marcelino,L., Klepac-Ceraj,V. and Polz,M. (2004) 

Divergence and redundancy of 16s rRNA sequences in genomes 

with multiple rrn operons. J Bacteriol, 186, 2629–2635. 

10. Jackson,S., Cannone,J., Lee,J., Gutell,R. and Woodson,S. (2002) 

Distribution of rRNA introns in the three-dimensional structure 

of the ribosome. J Mol Biol, 323, 35–52. 

11. Evguenieva-Hackenberg,E. (2005) Bacterial ribosomal RNA in 

pieces. Mol Microbiol, 57, 318–325. 

12. Wuyts,J., Perriere,G. and Van De Peer,Y. (2004) The European 

ribosomal RNA database. Nucleic Acids Res, 32 Database issue, 

D101–D103. 

13. Szymanski,M., Barciszewska,M., Erdmann,V. and Barciszewski,J. 

(2002) 5s Ribosomal RNA database. Nucleic Acids Res., 30, 176–178. 

14. Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Selection 

of representative protein data sets. Protein Sci., 1, 409–417. 

15. Eddy,S. (1998) Profile hidden markov models. Bioinformatics, 14, 

755–763. 

16. Henikoff,S. and Henikoff,J. (1994) Position-based sequence weights. 

J. Mol. Biol., 243, 574–578. 

17. Griffiths-Jones,S., Moxon,S., Marshall,M., Khanna,A., Eddy,S. 

and Bateman,A. (2005) Rfam: annotating non-coding RNAs in 

complete genomes. Nucleic Acids Res., 33 Database Issue, 

D121–D124. 

18. Griffiths-Jones,S., Bateman,A., Marshall,M., Khanna,A. and 

Eddy,S. (2003) Rfam: an RNA family database. Nucleic Acids Res., 

31, 439–441. 

19. Venter,J., Remington,K., Heidelberg,J., Halpern,A., Rusch,D., 

Eisen,J., Wu,D., Paulsen,I., Nelson,K. et al. (2004) Environmental 

genome shotgun sequencing of the Sargasso Sea. Science, 304, 

66–74. 

20. Rognes,T. (2001) ParAlign: a parallel sequence alignment algorithm 

for rapid and sensitive database searches. Nucleic Acids Res, 29, 

1647–1652. 

21. Pearson,W. and Lipman,D. (1988) Improved tools for biological 

sequence comparison. Proc. Natl. Acad. Sci. USA, 85, 2444–2448. 

22. Durbin,R., Eddy,S.R., Krogh,A. and Mitchison,G. (2000) 

Biological Sequence Analysis: Probabilistic Models of Proteins and 

Nucleic Acids. Cambridge University Press. 

23. Lowe,T. and Eddy,S. (1997) tRNAscan-SE: a program for 

improved detection of transfer RNA genes in genomic sequence. 

Nucleic Acids Res., 25, 955–964. 

24. Eddy,S. (2002) A memory-efficient dynamic programming algorithm 

for optimal alignment of a sequence to an RNA secondary 

structure. BMC Bioinformatics, 3, 18. 

25. Weinberg,Z. and Ruzzo,W. (2006) Sequence-based heuristics for 

faster annotation of non-coding RNA families. Bioinformatics, 22(1). 

26. Weinberg,Z. and W.L.,R. (2004) In RECOMB 04: Proceedings of 

the Eighth Annual International Conference on Computational 

Molecular Biology, ACM Press, pp. 243–251.

1 


3.9 Paper VII: GeneWiz browser: An Interactive Tool for 

Visualizing Sequenced Chromosomes 

131

Standards in Genomic Sciences (2009) 1: 204-215 DOI:10.4056/sigs.28177 

GeneWiz browser: An Interactive Tool for Visualizing 

Sequenced Chromosomes 

Peter F. Hallin 1 , Hans-Henrik Stærfeldt 1 , Eva Rotenberg 1, 2 , Tim T. Binnewies 1, 3 , Craig J. 

Benham 4 , and David W. Ussery 1 

1 Center for Biological Sequence Analysis, Department of Systems Biology, The Technical 

University of Denmark, 2800 Kgs. Lyngby, Denmark. 

2 Lersoe Parkalle 37, 2TV, 2100 Copenhagen, Denmark 

3 Roche Diagnostics Ltd., CH-6343 Rotkreuz, Switzerland 

4 UC Davis Genome Center, University of California, Davis, California, U.S.A. 

We present an interactive web application for visualizing genomic data of prokaryotic chromosomes. 

The tool (GeneWiz browser) allows users to carry out various analyses such as 

mapping alignments of homologous genes to other genomes, mapping of short sequencing 

reads to a reference chromosome, and calculating DNA properties such as curvature or stacking 

energy along the chromosome. The GeneWiz browser produces an interactive graphic 

that enables zooming from a global scale down to single nucleotides, without changing the 

size of the plot. Its ability to disproportionally zoom provides optimal readability and increased 

functionality compared to other browsers. The tool allows the user to select the display 

of various genomic features, color setting and data ranges. Custom numerical data can 

be added to the plot allowing, for example, visualization of gene expression and regulation 

data. Further, standard atlases are pre-generated for all prokaryotic genomes available in 

GenBank, providing a fast overview of all available genomes, including recently deposited 

genome sequences. The tool is available online from 

http://www.cbs.dtu.dk/services/gwBrowser. Supplemental material including interactive atlases 

is available online at http://www.cbs.dtu.dk/services/gwBrowser/suppl/. 

Introduction 

The development of fast and inexpensive genome 

sequencing technologies has led to the generation 

of vast amounts of genomic information. As ge-‐ 

nomic sequencing becomes both more powerful 

and affordable, the handling and analysis of the 

generated data produces novel challenges and 

shifts the focus away from the discovery process 

towards technical considerations of handling, 

storing and analyzing sequence data. An impor-‐ 

tant step when exploring a new genome is to com-‐ 

pare it to existing sequences, in order to identify 

both novel and conserved features. Many auto-‐ 

mated computational methods are available that 

attempt to derive protein function from sequence 

[1-3]. In a metagenomic study by Harrington and 

co-‐workers it was estimated that 76% of the ex-‐ 

amined protein coding genes could be assigned a 

function. However, to assess predictions for indi-‐ 

vidual genes the visualization remains critical to 

provide the biologist with an overview of the ge-‐ 

nomic context. Are genes of interest situated in 

clusters? In operons? How are they regulated? 

How does their DNA base composition compare 

with that of the rest of the genome? In order to 

display such features both on a genome scale and 

in close-‐up down to the level of nucleotides, we 

developed the GeneWiz browser which is based 

on the ‘Genome Atlas’ concept [4,5]. This tool can 

also display local DNA structural properties, so 

that regulatory or repeat regions can easily be 

identified and interpreted in a chromosomal con-‐ 

text. 

During development of the GeneWiz browser, it 

became apparent that novel sequencing technolo-‐ 

gy creates a further demand. The current genera-‐ 

tion of sequencing instruments utilizes primed 

The Genomic Standards Consortium

synthesis in flow cells to simultaneously obtain 

the sequences of millions of different DNA tem-‐ 

plates, an approach that changed the field of DNA 

sequencing [6,7]. Flow sequencing, also known as 

sequencing by synthesis (SBS) on a solid surface, 

tracks nucleotides as they are added to a growing 

DNA strand [8]. SBS is used by high-‐throughput 

sequencing systems which have become commer-‐ 

cially available in the past two years. Examples 

include the sequencer GS Titanium (commercia-‐ 

lized by 454/Roche); Genome Analyser GA-‐II (So-‐ 

lexa/Illumina); and SOLiD 3 system (Applied 

Biosystems). 

These developments have increased the speed of 

sequencing while significantly reducing its cost 

[9,10]. This much higher throughput provides 

greater coverage, but at the cost of much shorter 

read-‐lengths: from 50 bases with SOLiD 3 to 75 

bases with Illumina GA II. Even reads of 500 bases 

obtained with the 454-‐Titanium are still shorter 

than read lengths typically obtained using the 

Sanger method [9,11]. The output from modern 

high-‐through sequencing equipment challenges 

the assembly software by generating shorter and 

ambiguous reads. Processing of this flood of se-‐ 

quence data has rapidly become a bottleneck, and 

developing the necessary skills and tools will most 

likely be a driving factor in the execution of 

second-‐generation sequencing [12]. As a first step 

in this development, it needs to be determined to 

what extent assembly of short-‐read sequences can 

be trusted, an assessment for which the GeneWiz 

browser can also be used. 

Methods 

Our method of visualization is based on color-‐ 

encoded lanes to display numerical information 

on a genome atlas similar to GeneWiz [4,5]. The 

color encoding can be done either using a linear 

scale with a fixed minimum and maximum range, 

or a dynamic scale of standard deviations. Using 

the latter, color intensity decreases as data ap-‐ 

proach average values, thereby emphasizing re-‐ 

gions of significant variation. The web interface is 

divided into four optional sections, to address 

various biological viewpoints of chromosomes: 1) 

DNA properties 2) Mapping of homologous genes 

by BLAST 3) Mapping of short sequencing reads 4) 

Custom lanes such as Single Nucleotide Polymor-‐ 

Hallin, et al. 

phism (SNP) or microarray data. The output of 

each method is a numerical vector of length cor-‐ 

responding to that of the reference sequence, and 

the methods used for this construction are de-‐ 

scribed in detail below. 

Read quality assessment 

Gene duplications, rRNA operons and other repeti-‐ 

tive chromosomal regions are known to cause 

difficulties during the assembly of short reads [13]. 

To assess the degree of ambiguity of sequencing 

reads, a method was developed that derives the 

uniqueness of all reads, accounting for both the 

read quality and the match to the reference ge-‐ 

nome. 

Sequence reads from Illumina and 454 are re-‐ 

ported with base qualities: a per-‐nucleotide meas-‐ 

ure that denotes the credibility of the base calls. A 

method was derived which condenses these quali-‐ 

ties into values per position in the reference ge-‐ 

nome and calculates the following information: 

uniqueness-‐weighted quality, information content, 

sequence agreement, and repeat-‐weighted cover-‐ 

age, (see methods). These estimates provide a 

preliminary overview of regions that may appear 

problematic to assemble. In general, low unique-‐ 

ness is found in the gaps between the assembled 

contigs generated by the default assembly tools 

from a given sequence dataset, as will be demon-‐ 

strated below. A high score of uniqueness-‐ 

weighted quality indicates that the base is unique-‐ 

ly identified by a read and that it has a high base 

quality in that read. The approach is illustrated in 

Figure 1. 

From the mapping, five different parameters were 

calculate which together summarizes the trust-‐ 

worthiness of the reads given the assembly: 

Weighted coverage Under the assumption that 

all reads would map only once (Hr=1), the coverage 

c(i) can be calculated as the number of 

alignments R mapped at position i. A weighted 

coverage c’(i)=wr,h (see equation below) is used 

to correct for higher coverage artificially introduced 

by repeats: 

http://standardsingenomics.org 205

GeneWiz browser 

Figure 1 | Mapping reads to a reference genome accounting for uniqueness. In step 1, each read is 

aligned against the reference genome. In the second step, the quality of each read is weighted according 

to the uniqueness of the hit. A read giving rise to two hits S 1 and S 2 in the reference genome 

will be weighted proportionally with the relative alignment scores; if scores are identical, the 

mapping of S 1 and S 2 will be applied a weight of w=0.5 (see equation below). Step 3 maps the 

weighted qualities back to the reference genome so that each genomic position contains an array 

of weighted qualities. Once all reads are mapped, in step 4 only the maximum weighted quality 

value is kept and, step 5, the maximum weighted quality scores are color coded to reveal regions 

of low uniqueness. 

Uniqueness-weighted quality This measure cor-‐ 

responds to the base qualities obtained from the 

reads that are mapped to the reference genome, 

weighted by the uniqueness of the read. Consider 

read r, which has a quality profile , where i is 

the position in the read. The read is aligned to the 

reference genome by BLAST, and all Hr hits are 

included, when the following criteria are met: 

BLAST score Sh of hit h is greater than or equal to 

S0 (optionally provided by the user), Sh S1 x 

where S1 is the score of the first/best hit, x [0;1] 

is a constant provided by the user, and the E-‐value 

is equal to or less than a threshold specified by the 

user. The following formula is used to derive the 

weighted quality : 

The value is plotted on a color scale whereby low 

information (random distribution, least expected) 

is given in dark colors, and high information (high 

From all the q’r(i) values obtained at each position 

in the genome, the maximum uniqueness-‐ 

weighted quality is chosen when all reads have 

been mapped. 

Information content provides a number in bits of 

information [14] representing to what degree the 

reads agree: zero bits means equal distribution of 

A, T, G and C at a given position and 2 bits means 

complete conservation of a single base. 

conservation, most expected) as light or neutral 

color. This measure may be useful for visualizing 

single nucleotide polymorphisms. 

206 Standards in Genomic Sciences

Read absence. A boolean where ‘one’ indicates 

complete absence of aligned reads. 

Visualization of whole-genome homology 

The BLASTatlas method [15] derives a map of per-‐ 

nucleotide numbers on a reference genome to 

visualize the matches in the alignment between 

the reference genome and a query. The query can 

constitute any number of genomic contigs, scaf-‐ 

folds, full genomes, or collections thereof. This 

provides a method to identify regions of a refer-‐ 

ence genome that are conserved throughout mul-‐ 

tiple samples, as well as those that are unique. The 

BLASTatlas method is integrated into the GeneWiz 

browser software to facilitate a user-‐friendly in-‐ 

terface. According to the BLAST algorithm chosen, 

DNA or protein sequences of the reference are 

aligned with the best match in the query (using 

either blastp, blastn, tblastn, or blastx). The align-‐ 

ment is then mapped back to the reference ge-‐ 

nome. A match adds a 'one' whereas a mismatch 

adds a 'zero' at each position along the chromo-‐ 


some. These ones and zeros translate into smooth 

color zones due to binning 

DNA properties and DNA destabilization 

Through the web interface it is currently possible 

to select from 36 different nucleotide composition 

and DNA structural properties [4,5,16-22]. In addi-‐ 

tion to this, calculations of so-‐called SIDD energy 

estimates are provided, offering an approximation 

of promoter regions. This method estimates the 

free energy required to open the DNA helix, calcu-‐ 

 

-‐0.035, -‐0.044, -‐0.055, using the SIDD algorithm 

[23]. All of these parameters can be applied in any 

combination to any of the prokaryotic genomes 

available from the web interface, or to a custom 

sequence provided by the user. Alternatively, the 

parameters may be applied as collections forming 

8 standard atlases: Genome-‐, Base-‐, Structure-‐, 

Cruciform-‐, A-‐DNA-‐, Z-‐DNA-‐, the Repeat-‐atlas, and 

finally the SIDD atlas, which is introduced in this 

manuscript (Figure 3). 

Figure 3 Configuration and references for pre-defined groups of DNA sequence- and structural 

properties: Genome-, Base-, Structure-, Cruciform-, A-DNA-, Z-DNA-, Repeat-, and SIDD-atlas. 

Custom data 

A designated section of the GeneWiz browser is 

assigned for custom data. It allows the user to 

provide a per-‐nucleotide list of numerical values 

along with a desired color and data range. Al-‐ 

though not presented here, this allows for visuali-‐ 

zation of additional information such as microar-‐ 

ray data that has been pre-‐processed by the user, 

by mapping gene expression, regulation change, or 

p-values back to genomic coordinates. In addition 

to the main genome annotation covering CDSs, 

tRNAs, and rRNAs, the user may specify miscella-‐ 

neous and pseudo-‐gene annotations separately. A 

button allows the query of selected reference ge-‐ 

nomes against a replicate of pseudogenes.org [24]. 

Other annotations of possible pseudogenes can be 

added, such as GenePRIMP output (geneprimp.jgi-‐ 

psf.org/). 

Dynamic visualization 

The GeneWiz browser allows dynamic dispropor-‐ 

tional zooming, meaning that zooming occurs 



nearly instantly when requested by the user, by 

redrawing all the components like tracks, legends, 

marks and text for every view. This allows the 

browser to scale the plot to make use of the entire 

plotting area, by not rescaling all parts of the plot 

equally. For example, zooming 10 x will stretch a 

data lane 10 in genome position axis, however 

the lane height and distance to the neighbor lane 

will remain constant. The dynamic nature of the 

GeneWiz browser requires pre-‐binning of data for 

each zoom level, all of which are stored on a cen-‐ 

tral server; for improved efficiency only data re-‐ 

quested by the user are sent. The approach to 

store per-‐nucleotide information as table records 

in a database (e.g. MySQL) has proved unfeasible, 

as the number of records per genome exceeds 

millions, and the construction of indexes would be 

very time consuming. Instead, a memory mapping 

technique was chosen, that allows the server to 

directly obtain the values from binary files when 

provided with the zoom window and level, for any 

chromosome in the database. (Examples are pro-‐ 

vided as supplemental data, http://www.cbs.-‐ 

dtu.dk/services/gwBrowser/suppl/). 

The client is written as a JavaApplet, that obtains 

the data remotely from the server 

(http://ws.cbs.dtu.dk/cgi-‐bin/gwBrowser-‐ 

0.91/server.cgi). The browser server is written in 

Perl/CGI, while a compiled c-‐program handles the 

access to the binary data files. The options cur-‐ 

rently supported are listed in Table 2. 

Table 2 GeneWiz Browser server options. 

Option description 

d The unique identifier for the atlas 

Feature type (e.g. CDS,rRNA,tRNA) when returning 

ft 

annotations 

f Data field to return 

b Begin of window 

e End of window 

l Zoom level 

z Enable zlib compression of output 

m=i Return the genome length 

m=avg/stddev/min/max Return aggregate data for window/genome 

m=d 

Return data values provided field, window and zoom 

level 

m=c Return colors provided two or three-step ranges 

m=n Return nucleotides provided the window 

m=a Return annotations (used together with option ‘ft’) 

and genes as well as numerical data associated 

These options (Table 2) can be incorporated into a with each nucleotide. The disproportional capabil-‐ 

single URL. For example, one could request all ity of the GeneWiz browser implies that all com-‐ 

ponents (legends, tracks, marks, etc.) are regene-‐ 

 

m-‐ rated for every view requested by the user. Figure 4 

http://ws.cbs.dtu.dk/cgi-‐ 

outlines the GeneWiz browser workflow. 

bin/gwBrowser-‐ 

-‐ 

When submitting a job via the web interface, the 

 

request is assigned a job identifier, under which 

 

a-‐ 

all data lanes and configurations are kept. After 

tions are described in the xml record, which can 

the job has been processed the user may alter lane 

be downloaded from the web 

order, colors, ranges, and append various types of 

(http://ws.cbs.dtu.dk/cgi-‐bin/gwBrowser-‐ 

marks to the plot. The layout of a given browser 

0.91/fetchxml.cgi?AL111168GENOMEatlas). Fur-‐ 

instance is governed by an XML file, located on the 

ther examples are provided in the supplemental 

server. When generating the graphical representa-‐ 

data section. 

tion of the genome, the client Java program will 

make requests to the server to acquire aggregated 

The GeneWiz workflow and data displayed 

values, such as the averages, standard deviations, 

The GeneWiz browser plots and provides dispro-‐ 

minima, and maxima as well as lane data and an-‐ 

portional zooming for data pertaining to features 

notations. 



Figure 4 | The dataflow of the GeneWiz browser service. 1) The selected reference genome and the 

lanes to be included are defined via the web interface. 2) The request is sent to the analysis server 

that handles the calculations. 3) When the job is finished, the web page redirects to the applet 

viewer that allows the user to navigate and edit the plot layout. 

Premade atlases 

The genome sequences stored in the CBS Genome 

Atlas Database [25] are synchronized with NCBI 

Entrez genome projects and have been pre-‐ 

processed for all of the eight standard atlases 

mentioned above. This allows the user to select 

from currently 1,636 pre-‐binned replicons from 

864 prokaryotic sequencing projects, searchable 

by replicon name, GenBank accession number, or 

organism name (http://www.cbs.dtu.dk/-‐ servic-‐ 

es/gwBrowser/precalc/) 

Results 

Evaluation of re-sequencing quality 

Three re-‐sequenced bacterial genomes were ex-‐ 

amined, one genome sequence was generated us-‐ 

ing the Illumina GA technology, whereas two ge-‐ 

nome sequences were generated utilizing the 454-‐ 

Titanium technology (Table 3). The public se-‐ 

quence was selected as reference for mapping the 

re-‐sequencing reads using the GeneWiz browser 

tool. The randomness in fragmentation was esti-‐ 

mated by comparing the experimental data with 

in-silico digestions, generated at 40X coverage 

using read lengths between 30 to 5,000 bp. A good 

correspondence between the in-silico and experi-‐ 

mental reads suggests little bias towards certain 

chromosomal regions (Figure 5, panel A). The as-‐ 

sembled contigs provided by 454 (C. jejuni and E. 

coli) are mapped to the reference genome using 

BLAST and annotated in the perimeter of the at-‐ 

lases (two leftmost atlases in Figure 5, panel A+B). 

The detailed atlas of the experimental data (true 

reads), are shown in Figure 5, panel B. Panel C 

shows quality/count of reads plotted as a function 

of read position. Note that the read quality de-‐ 

creases the further the distance from the begin-‐ 

ning of the read. 



Table 3 Sequencing details of three bacterial genomes, two of which were re-sequenced using 

454-Titanium and one with Illumina GA technology. 

E. coli K12 MG1655 C. jejuni 

NCTC11168 

S. typhi Ty2 

Strain id ATCC: 700926D-5 ATCC: 

700819D-5 

ERA000001 

Technology 454-Titanium 454-Titanium Illumina GA II 

Read count 538,784 502,438 1,650,370 

Avg read length ((std. 

dev) 

522 (=53) 598 (=75) 51 (=0) 

Truncated length 600 600 35 

Coverage 61X 183X 18X 

Genome size 4,639,675 bp 1,641,481 bp 4,791,961 bp 

Accession and original 

Reference 

U00096 [26] AL111168 [27] AE014613 [28] 

Figure 5 | Panel A: The maximum uniqueness quality is shown for the actual reads (green-to-blue 

lane) plotted in the outermost lanes, using the published genome as a reference. The following 

lanes show in-silico digestions at 40 X coverage (red-to-blue lane), using read lengths 30, 50, 70, 

200, 500, 1,000, 1,000, and 5,000 bases. Panel B shows the weighted coverage, agreement with 

reference, maximum uniqueness quality, information content, read absence, and AT content. All 

six plots can be accessed for zooming via the supplemental data section. Panel C displays the read 

count (green, secondary ordinate) and read quality (red, primary ordinate) as a function of read 

length. Note that read counts differ within the three datasets, resulting in different scales on the 

secondary ordinate. For the two 454-Titanium sets (C. jejuni and E. coli K12), an assembly was 

provided which allows a mapping of contigs to the reference genome. These marks are shown in 

gray in the perimeter of these plots. Red marks indicate contigs with two or more hits in the reference. 


Genome homology: Comparing multiple 

Burkholderia species 

A comparative study aimed at mapping for exam-‐ 

ple pathogenic islands or gene losses among dif-‐ 

ferent bacterial genomes can benefit from a graph-‐ 

ical representation provided by the BLASTatlas 

method. The genus of Burkholderia covers a num-‐ 

ber of important animal and human pathogens 

known to cause melioidosis (B. pseudomallei) and 

pulmonary infection in cystic fibrosis (CF) patients 

(B. cepacia), whereas B. thailandensis, which is 

closely related to B. pseudomallei, rarely gives rise 

to diseases in humans [29,30]. Both species of B. 

thailandensis and B. mallei display large chromo-‐ 

somal deletions when compared to B. pseudomallei. 

However, the more scattered nature of the 


gene loss observed in B. thailandensis suggests 

that B. mallei evolved from B. pseudomallei 

through the loss of larger regions [31]. These dele-‐ 

tions are evident from the atlas shown in Figure 6 

where the two chromosomes of Burkholderia 

pseudomallei 1710b are used as BLASTatlas refer-‐ 

ence in a comparison with 14 publicly available 

Burkholderia genomes (B. thailandensis plus all 

species having two or more strains sequenced, see 

supplemental data). In addition it is evident that a 

strong preference of deletion exist for chromo-‐ 

some II. Ong and co-‐workers report that deletions 

in chromosome II counts for 70% and 61% of the 

total gene loss in B. mallei and B. thailandensis, 

respectively. 

Figure 6 | BLASTatlas of Burkholderia pseudomallei 1710b chromosomes I+II compared with 14 

Burkholderia species. Showing from the outermost circles: B. ambifaria (2, purple), B. cenocepacia 

(4, red) B. thailandensis (1, green) 10774, B. mallei (4, green), and B. pseudomallei (3, blue). Innermost 

circles show percent AT, and CG skew. Note, that to allow visual comparison between B. 

thailandensis and B. mallei, both species are colored green: the outermost green lane corresponds 

to the single B. thailandensis, whereas the remaining four green lanes are all B. mallei. GenBank 

accession numbers as well as interactive plots are available through the supplemental data section. 



The SIDD atlas: Annotation of regulatory 

elements 

The browser application enables the user to ap-‐ 

pend various annotation marks such as transcrip-‐ 

tion start site arrows, gene labels, and boxes. A 

final example illustrates how these marks can be 

used to integrate known regulatory elements with 

DNA properties and gene annotations to draw a 

more complete picture of a promoter region. The 

regulatory elements of the E. coli K12 MG1665 rrn 

operons [32] have been annotated in a standard 

SIDD atlas, providing a visualization of the P1/P2 

promoter structure (Figure 7). A zoom of the pro-‐ 

moter region reveals a strong SIDD site near the 

predominant P1 promoter approximately 40 bp. 

upstream of the P1 transcription start site. The 

transcription factor FIS stimulates transcription at 

several promoters, and for example the binding of 

FIS at the leuV promoter [33] has been suggested 

to transmit the superhelical destabilization down-‐ 

stream to the point where the RNAP twists and 

opens the helix [34]. This model may be valid for 

the rrnB P1 promoter also, as the activity of leuV 

and rrnB P1 are comparable [35]. 

Figure 7 | A zoom upstream of the E. coli K12 MG1665 rrnB operon. The three outer-most lanes 

show SIDD at three superhelix densities of sigma=-0.055, -0.045, and -0.035. The lower free energy 

required to melt the helix can be observed near the UP element of P1, for the SIDD lane at sigma 

= -0.045. The atlas is available for zooming on the supplemental data section. 

Discussion 

Visualization of the multidimensional information 

that is represented by a single genome sequence 

remains complex. An indispensable property of a 

genome visualization tool is that it must be zoom-‐ 

able, so that information can be interpreted at 

varying scales. Two recently published methods, 

the DNAPlotter [36] and the Genome Projector 

[37], both enable the user to build circular plots of 

numerical data related to genes as well as graphs 

of numerical data pertaining to the nucleotides. 

These tools create static graphics and allows only 

for proportional zooming, hence making the plot 

hard to interpret when zooming too deep. Both of 

these tools allow for visualization of individual 

genomes, but do not allow easy comparison across 

multiple genomes. With the ease of new genome 

sequences becoming available, it is essential to be 

able to quickly compare other genomes to a refer-‐ 

ence. 

A number of other tools approach genome visuali-‐ 

zation from different angles: Genome Diagram [38] 

and Circos [39] are command line programs gene-‐ 

rating publication quality static images and vector 

graphics. Although these tools allow comparison 

of other genomes, are flexible and allow visualiza-‐ 

tion of numerical data, they lack an interactive 

layer. 

The GeneWiz browser described here uses dis-‐ 

proportional zooming to overcome this. From a 

technical perspective, the choice of programming 

language for writing graphical browsers is of im-‐ 

portance. There are obvious advantages of provid-‐ 


ing platform-‐independent Java software like that 

of the GeneWiz browser, but often this is at the 

cost of performance. Nevertheless, our tool de-‐ 

monstrates the usefulness of a genome browser 

that relies on interactive, true disproportional 

zooming to visualize annotated genes and features 

as well as numerical data provided at single nuc-‐ 

leotide resolution. By building a comprehensive 

tool that is both scalable and flexible, we have 

shown how different types of genomic data can be 

integrated into a single, easily navigated graphic 

that can be annotated further by the user. 

Author contributions 

P.F.H. wrote the paper and composed the web 

interfaces, as well as most parts of the server back 

end. H.H.S. wrote the c-‐code of the data binning 

and retrieval software and contributed to the Java 

Applet; E.R. wrote the majority of the Java Applet 

code and formulation of the XML configurations. 

Reference 

1. Harrington ED, Singh AH, Doerks T, Letunic I, 

von Mering C, Jensen LJ, Raes J, Bork P. Quantitative 

assessment of protein function prediction 

from metagenomics shotgun sequences. Proc Natl 

Acad Sci USA 2007; 104:13913-13918. PubMed 

doi:10.1073/pnas.0702636104 

2. Jensen LJ, Gupta R, Blom N, Devos D, Tamames 

J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K, 

Workman C et al. Prediction of human protein 

function from post-translational modifications and 

localization features. J Mol Biol 2002; 319:1257- 

1265. PubMed doi:10.1016/S0022- 

2836(02)00379-0 

3. Friedberg I. Automated protein function prediction--the 

genomic challenge. Brief Bioinform 

2006; 7:225. PubMed doi:10.1093/bib/bbl004 

4. Jensen LJ, Friis C, Ussery DW. Three views of 

microbial genomes. Res Microbiol 1999; 

150:773-777. PubMed doi:10.1016/S0923- 

2508(99)00116-3 

5. Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, 

Ussery DW. A DNA structural atlas for Escherichia 

coli. J Mol Biol 2000; 299:907-930. PubMed 

doi:10.1006/jmbi.2000.3787 

6. Hall N. Advanced sequencing technologies and 

their wider impact in microbiology. J Exp Biol 

2007; 210:1518-1525. PubMed 

doi:10.1242/jeb.001370 


T.T.B. provided source data and analysis of C. jejuni 

and E. coli sequencing reads and C.J.B. assisted 

writing the paper (paragraphs on SIDD energy). 

D.W.U. assisted in writing the paper, supervised 

the project and provided ideas for figures and 

analysis. All authors have read and made correc-‐ 

tions to the manuscript. 


This work is funded in part by grants from the Danish 

Center for Scientific Computing, NSF Research Grant 

DBI-‐0416764, The Danish Research Council grant 26-‐ 

06-‐0349, and the EU EMBRACE network of Excellence, 

contract number LSHG-‐CT-‐2004-‐512092. We thank 

Mark Driscoll and Marcel Margulies from 454 Life 

Sciences for providing the data for C. jejuni and E. coli 

and Julian Parkhill at the Sanger institute for providing 

the S. typhi sequencing data. We thank also Dr. Trudy 

Wassenaar and Dr. Lars Juhl Jensen for making sugges-‐ 

tions to the manuscript. 

7. Holt RA, Jones SJ. The new paradigm of flow cell 

sequencing. Genome Res 2008; 18:839-846. 

PubMed doi:10.1101/gr.073262.107 

8. Käller M, Lundeberg J, Ahmadian A. Arrayed 

identification of DNA signatures. Expert Rev Mol 

Diagn 2007; 7:65-76. PubMed 

doi:10.1586/14737159.7.1.65 

9. Gupta PK. Single-molecule DNA sequencing 

technologies for future genomics research. Trends 

Biotechnol 2008; 26:602-611. PubMed 

doi:10.1016/j.tibtech.2008.07.003 

10. Shendure J, Ji H. Next-generation DNA sequencing. 

Nat Biotechnol 2008; 26:1135-1145. 

PubMed doi:10.1038/nbt1486 

11. Smith DR, Quinlan AR, Peckham HE, Makowsky 

K, Tao W, Woolf B, Shen L, Donahue WF, Tusneem 

N, Stromberg MP et al. Rapid wholegenome 

mutational profiling using nextgeneration 

sequencing technologies. Genome Res 

2008; 18:1638-1642. PubMed 

doi:10.1101/gr.077776.108 

12. Lin F, Schröder H, Schmidt B. Solving the Bottleneck 

Problem in Bioinformatics Computing: An 

Architectural Perspective. J VLSI Signal Process 

2007; 48:185-188. doi:10.1007/s11265-007- 

0088-z 

13. Phillippy AM, Schatz MC, Pop M. Genome assembly 

forensics: finding the elusive mis- 



assembly. Genome Biol 2008; 9:R55. PubMed 

doi:10.1186/gb-2008-9-3-r55 

14. Tolstrup N, Rouzé P, Brunak S. A branch point 

consensus from Arabidopsis found by noncircular 

analysis allows for better prediction of 

acceptor sites. Nucleic Acids Res 1997; 25:3159- 

3163. PubMed doi:10.1093/nar/25.15.3159 

15. Hallin PF, Binnewies TT, Ussery DW. The genome 

BLASTatlas-a GeneWiz extension for visualization 

of whole-genome homology. Mol Biosyst 

2008; 4:363-371. PubMed 

doi:10.1039/b717118h 

16. Bolshoy A, McNamara P, Harrington RE, Trifonov 

EN. Curved DNA without A-A: experimental estimation 

of all 16 DNA wedge angles. Proc Natl 

Acad Sci USA 1991; 88:2312-2316. PubMed 

doi:10.1073/pnas.88.6.2312 

17. Brukner I, Sánchez R, Suck D, Pongor S. Sequence-dependent 

bending propensity of DNA as 

revealed by DNase I: parameters for trinucleotides. 

EMBO J 1995; 14:1812-1818. PubMed 

18. van Noort V, Worning P, Ussery DW, Rosche 

WA, Sinden RR. Strand misalignments lead to quasipalindrome 

correction. Trends Genet 2003; 

19:365-369. PubMed doi:10.1016/S0168- 

9525(03)00136-7 

19. Olson WK, Gorin AA, Lu XJ, Hock LM, Zhurkin 

VB. DNA sequence-dependent deformability deduced 

from protein-DNA crystal complexes. Proc 

Natl Acad Sci USA 1998; 95:11163-11168. 

PubMed doi:10.1073/pnas.95.19.11163 

20. Ornstein RL, Rein R, Breen DL, MacElroy RD. An 

optimized potential function for the calculation of 

nucleic acid interaction energies. I- Base stacking. 

Biopolymers 1978; 17:2341-2360. 

doi:10.1002/bip.1978.360171005 

21. Satchwell SC, Drew HR, Travers AA. Sequence 

periodicities in chicken nucleosome core DNA. J 

Mol Biol 1986; 191:659-675. PubMed 

doi:10.1016/0022-2836(86)90452-3 

22. Ussery D, Soumpasis DM, Brunak S, Staerfeldt 

HH, Worning P, Krogh A. Bias of purine stretches 

in sequenced chromosomes. Comput Chem 

2002; 26:531-541. PubMed doi:10.1016/S0097- 

8485(02)00013-X 

23. Wang H, Benham CJ. Superhelical destabilization 

in regulatory regions of stress response genes. 

PLOS Comput Biol 2008; 4:e17. PubMed 

doi:10.1371/journal.pcbi.0040017 

24. Karro JE, Yan Y, Zheng D, Zhang Z, Carriero N, 

Cayting P, Harrrison P, Gerstein M. Pseudo- 

gene.org: a comprehensive database and comparison 

platform for pseudogene annotation. Nucleic 

Acids Res 2007; 35:D55-D60. PubMed 

doi:10.1093/nar/gkl851 

25. Hallin PF, Ussery DW. CBS Genome Atlas Database: 

a dynamic storage for bioinformatic results 

and sequence data. Bioinformatics 2004; 

20:3682-3686. PubMed 

doi:10.1093/bioinformatics/bth423 

26. Blattner FR, Plunkett G, Bloch CA, Perna NT, 

Burland V, Riley M, Collado-Vides J, Glasner JD, 

Rode CK, Mayhew GF et al. The complete genome 

sequence of Escherichia coli K-12. Science 

1997; 277:1453-1462. PubMed 

doi:10.1126/science.277.5331.1453 

27. Parkhill J, Wren BW, Mungall K, Ketley JM, 

Churcher C, Basham D, Chillingworth T, Davies 

RM, Feltwell T, Holroyd S et al. The genome sequence 

of the food-borne pathogen Campylobacter 

jejuni reveals hypervariable sequences. Nature 

2000; 403:665-668. PubMed 

doi:10.1038/35001088 

28. Deng W, Liou SR, Plunkett G, Mayhew GF, Rose 

DJ, Burland V, Kodoyianni V, Schwartz DC, 

Blattner FR. Comparative genomics of Salmonella 

enterica serovar Typhi strains Ty2 and CT18. J 

Bacteriol 2003; 185:2330-2337. PubMed 

doi:10.1128/JB.185.7.2330-2337.2003 

29. Brett PJ, DeShazer D, Woods DE. Burkholderia 

thailandensis sp. nov., a Burkholderia pseudomallei-like 

species. Int J Syst Bacteriol 1998; 48:317- 

320. PubMed 

30. Smith MD, Angus BJ, Wuthiekanun V, White NJ. 

Arabinose assimilation defines a nonvirulent biotype 

of Burkholderia pseudomallei. Infect Immun 

1997; 65:4319-4321. PubMed 

31. Ong C, Ooi CH, Wang D, Chong H, Ng KC, Rodrigues 

F, Lee MA, Tan P. Patterns of large-scale 

genomic variation in virulent and avirulent Burkholderia 

species. Genome Res 2004; 14:2295- 

2307. PubMed doi:10.1101/gr.1608904 

32. Hirvonen CA, Ross W, Wozniak CE, Marasco E, 

Anthony JR, Aiyar SE, Newburn VH, Gourse RL. 

Contributions of UP elements and the transcription 

factor FIS to expression from the seven rrn P1 

promoters in Escherichia coli. J Bacteriol 2001; 

183:6305-6314. PubMed 

doi:10.1128/JB.183.21.6305-6314.2001 

33. Ross W, Salomon J, Holmes WM, Gourse RL. 

Activation of Escherichia coli leuV transcription 

by FIS. J Bacteriol 1999; 181:3864-3868. PubMed 


34. Wang H, Noordewier M, Benham CJ. Stressinduced 

DNA duplex destabilization (SIDD) in 

the E. coli genome: SIDD sites are closely associated 

with promoters. Genome Res 2004; 

14:1575-1584. PubMed doi:10.1101/gr.2080004 

35. Bauer BF, Kar EG, Elford RM, Holmes WM. Sequence 

determinants for promoter strength in the 

leuV operon of Escherichia coli. Gene 1988; 

63:123-134. PubMed doi:10.1016/0378- 

1119(88)90551-3 

36. Carver T, Thomson N, Bleasby A, Berriman M, 

Parkhill J. DNAPlotter: circular and linear interactive 

genome visualization. Bioinformatics 2009; 

25:119-120. PubMed 

doi:10.1093/bioinformatics/btn578 


37. Arakawa K, Tamaki S, Kono N, Kido N, Ikegami 

K, Ogawa R, Tomita M. Genome Projector: 

zoomable genome map with multiple views. BMC 

Bioinformatics 2009; 10:31. PubMed 

doi:10.1186/1471-2105-10-31 

38. Pritchard L, White JA, Birch PR, Toth IK. GenomeDiagram: 

a python package for the visualization 

of large-scale genomic data. Bioinformatics 

2006; 22:616-617. PubMed 

doi:10.1093/bioinformatics/btk021 

39. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne 

R, Horsman D, Jones SJ, Marra MA. Circos: 

an information aesthetic for comparative genomics. 

Genome Res 2009; 19:1639-1645. PubMed 

doi:10.1101/gr.092759.109 


Paper VII: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes 

144

Chapter 4 

Web Services and Interoperability in Genomics 

Web Services and Interoperability 

in Genomics 

This chapter describes work done connection with the EU project EMBRACE. The deliverables 

defined for CBS have had both outreach obligations as well as implementation 

tasks of providing tools and databases through Web Services. This author’s contributions 

reflect this duality; there was a responsibility for developing the server infrastructure for 

hosting Web Services while also teaching about using and design concepts on several occasions 

(see appendix A.1). CBS is now using this work to integrate all major prediction 

servers under the same Web Services umbrella. There are currently 17 services offered 

using this technology 1 . The work on Web Services has made the foundation for creating 

an online resource like BLASTatlas (paper I). Further, the RNAmmer tool (VI) is offered 

both as a traditional web interface and through Web Services and these implementations 

demonstrate the usefullness of programmtic access to tools. 


Over the past decade, the internet has undoubtedly revolutionized the way information 

is exchanged in the modern society. From bank transactions, digital road maps and 

satellite images, emailing, news articles, and social networks, these services are now hard 

to imagine, without a digitally connected world. Biological and bioinformatic information 

is no exception as it relies on the internet to provide the transport of sequence data, 

experimental results, scientific articles etc. Both the number and complexity of biological 

information increases day by day. As new experimental techniques become available, new 

types of data as well as new ways of combining them, are introduced. For decades, the 

exchange of biological information over the internet has been in the form of human readable 

HTML documents (HyperText Markup Language) - or flat files residing on FTP servers 

(File Transfer Protocol). When designed, HTML was intended to host static information 

presented by a server to a human being using a browser. Today, computers are required 

to digest the huge amounts of information with less involvement of humans, and more 

advanced technologies are now required. To successfully integrate the vast amounts of 

data provided by the life science community, interoperability remains a key issue. It 

may seem unrealistic to reach a point where every biologist and bioinformatician has 

the world’s biological databases and tools accessible through programmatic access, from 

their favorite programming language. However, with the current technologies in Web 

1 BLASTatlas, EasyGene, EPipe, GeneWiz, GenomeAtlas, hERG, MaxAlign, NetChop, NetCTL, Net- 

Glycate, NetNGlyc, NetOGlyc, NetPhos, RNAmmer, SIDDbase, SignalP, and TMHMM 

145

Interoperability 

Figure 4.1: Screen shot of NCBI Entrez Genome projects web page 

Services, an interoperble life science community may not be far away. When connected, 

the communities will be able to exchange not only data but many services such as tools 

for predicting protein function, performing sequence alignments, or gene finding. 

4.2 Interoperability 

”The term ’interoperability’ is defined as the ability ... information, by IEEE (http...)”. 

The term ’interoperability’ is defined as the ability of two or more systems to exchange 

and make use of information (IEEE, http://www.ieee.org). Whether systems can be 

said to be ’interoperable’ depends on how one interprets ’make use of’. Consider the list of 

full prokaryotic genome sequences, maintained by NCBI at http://www.ncbi.nlm.nih. 

gov/genomes/lproks.cgi, as shown in figure figure 4.1. 

To automatically retrieve this list, one may write a parser to transform the HTML 

into a computer-readable text. Apart from being overly sensitive to changes in the HTML 

document, such a parser will lack the knowledge behind the data since the format is not 

typed nor structured. It is only when interpreted by an internet browser and presented 

graphically to a human, that this information makes any sense. Both recipient and receiver 

must in other words have knowledge about the information that is exchanged, before these 

can be said to be interoperable. The are two aspects of interoperability: First, there must 

exist agreement on the format by which data is exchanged. Whether this is structured 

XML or any arbitrary format, the server must return the format expected by the client 

upon a request. Second, the description and understanding of the content of the data being 

exchanged is a requirement when building client-side code and objects in Web Services. 

Without the knowledge of exact data types, the programming environment (e.g. C, Java, 

Perl) fails to declare the objects with proper variable types. 

146


Listing 4.1: Abbreviated input to the queryGenomes operations of the Genome Atlas Database 

3.0 web service 

1 

4 

5 

6 

7 

8 AL111168 

9 yes 

10 

11 

12 

13 

4.2.1 SOAP based Web Services 

The SOAP standard (Simple Object Access Protocol, prior to version 1.2) is to a large 

extent an agreed-upon technology describing a protocol to exchange information in structured 

XML messages (eXtensible Markup Language). The protocol was recommended by 

W3C (World Wide Web Consortium) in 2003, and describes the messaging format between 

a client and a server which in most cases are transported over HTTP. In listings 4.1 

and 4.2 an example request and response from the CBS Genome Atlas Database 3.0 Web 

Service is provided, using operation queryGenomes to query the database for a genbank 

accession number. 

The SOAP messages are XML structures consisting of a SOAP envelope, which then 

consist of a header (not included here) and a body. A special envelope style called 

’wrapped’ is used for the CBS services, meaning that the content of both response and request 

is wrapped by an element named according to the operation issued (here queryGenomes). 

This enables the server to easily dispatch the message to the proper internal code. The 

SOAP protocol forms the basic language for exchanging messages over HTTP but does not 

describe the structure of the messages exchanged by a given resource nor does it explain 

its functionality. The WSDL (Web Services Description Language) file closes this gap by 

defining information which enables a user or computer to communicate with the resource. 

The WSDL declares all the operations supported by a resource and the composition of the 

XML structures allowed by the operations. Finally, the WSDL defines the endpoint URL 

to which the request SOAP message is submitted. The essential data of the WSDL are 

the descriptions of the XML structure, formulated in the XSD language (XML Schema 

Definition). The schema for the request of the queryGenomes operations can be seen from 

listing 4.3. Figure 4.2 shows a schematic drawing of a SOAP resource. 

4.3 EMBRACE: An EU initiative for enhance interoperability 

EMBRACE Network of Excellence is a project funded by the European Commission under 

the sixth framework programme (FP6). The intention of the EMBRACE projects was 

partly to integrate the major tools and databases within the life science communities. A 

technology recommendation workgroup within EMBRACE has investigated which current 

technologies could form the basis of the integration and it has recommended SOAP based 

147

EMBRACE: An EU initiative for enhance interoperability 

Listing 4.2: Abbreviated output from the queryGenomes operations of the Genome Atlas Database 

3.0 web service 

1 

2 

3 

5 

6 

7 

8 

9 

10 

11 B a c t e r i a 

12 E p s i l o n p r o t e o b a c t e r i a 

13 8 

14 Campylobacter j e j u n i subsp . j e j u n i NCTC 11168 

15 AL111168 

16 NC 002163 

17 Chromosome 

18 

19 

20 

21 

22 

23 

24 

25 

26 

148


Listing 4.3: XSD entry of the queryGenomes request message 

1 

2 

3 

4 

5 

6 

8 

10 

12 

14 

16 

18 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

SOAP request 

and response 

SOAP client 

Client user / computer 

endpoint WSDL Schemas 

HTTP server 

WSDL and schema files 

downloaded by client in 

XML 

Figure 4.2: Schematic layout of a simple SOAP resource, where WSDL and schemas reside on the 

same server. WSDL and schemas are read and intepreted by the SOAP client in order compose 

the outgoing request and parse the incoming server response. 

149

EMBRACE: An EU initiative for enhance interoperability 

Web Services described by WSDL files where data structures are typed using the XSD 

format. 

4.3.1 Quasi - a light-weight SOAP server 

One of the main obstacles for many SOAP servers and clients is the computational overhead 

and memory consumption involved in parsing large and complex XML structures. 

For the BLASTatlas service, this was a limitting factor. Trying a conventional server package 

called SOAP::Lite, rendered the submit process to require more memory than what is 

in a modern desktop computer while taking around 20 minutes just to prepare the message 

before submit. Once submitted, the server required the same overhead to parse the incoming 

XML. The XML::Compile package for Perl prooved superior as a client framework. 

However, for the server side, there was a demand for speed, flexibility and custom adjustment 

which led to the development of a light-wight SOAP server called ’quasi’ (’QUite 

A Soap Implementaion’ or ’QUAsi Soap Implementation’). Apart from the speed it has 

further advantages: 

• The server can be launched both remotely and locally. The later allows quick and 

easy testing of services by reading SOAP message from STDIN 

• XML parsing method (e.g. XML::Simple or XML::Twig) may be chosen independently 

for each operations and even postponed until after the job is placed in the 

queue and the job id is returned. This is an advantage for very big messages 

• Control over the code stack enable implementation of custom functionality much 

faster. 

4.3.2 quasi mktemp - From template to Web Service 

To take the ease-of-implementation to a new step, a template creator was written which 

reads from a standard CBS template an example Web Service. The user provides the 

name and version of the service and the tool prepares an entire installation of the service 

on the servers. The template created gives the following : 

• Creates automatically WSDL and XSD files for the name and version of the service, 

placed in the proper location of the file system 

• Example directory with a working Perl example using the service 

• Has built-in templates for both syncrhonous and asynchronous access 

• Creates the proper entry in the central services database table 

• When the template creator has run a web page will be available describing the 

service and providing links to WSDL and XSD files as well as WSDL-embedded 

documentation 

When designing Web Services, it is not a trivial task to keep track of namespaces, 

declerations of input/output objects, operation names etc. The feedback received so far 

for this tool indicates that functioning examples clearly reduces chances for mistakes. The 

manual for the software is found in appendix D.6. 

150


4.4 ENCODE pipeline: applying Web Services 

ENCODE (the Encyclopedia Of DNA Elements) was launched in September 2003 by 

the National Human Genome Research Institute. The goal was to identify all functional 

elements in the human genome sequence. In the pilot phase 1 percent (30 Mb) from 

44 selected regions of the human genome has been analysed by ENCODE consortium 

researchers (Birney et al., 2007). 

GENCODE is a sub-project of ENCODE, which seeks to identify all protein-coding 

genes in the ENCODE selected regions. For each protein coding gene this means the 

delineation of a complete mRNA sequence for at least one splice isoform, and often for 

a number of additional alternative splice forms. The contributions from the BioSapiens 

partners are focused on information from a protein annotation perspective. Special attention 

is given to the potential aspect of alternative splicing and the putative effect it has 

on functional diversification of genes. 

In the pilot phase of the Biosapiens project the properties of the coding sequences 

for the 44 regions have been analyzed by the Biosapiens partners separately. The results 

from single groups were collected and the main findings were published (Tress et al., 2007). 

Furthermore the entire collection of annotations created by all partners was made available 

as supplementary material for the publication. 

In the current phase of the BioSapiens project the goal is establish a scale-up of the 

annotation approach applied to the pilot ENCODE sequences to cover the 100% of the human 

genome, including all the isoforms. For the scale-up, the ENCODE Pipeline (EPipe) 

was constructed (this Biosapiens deliverable), which is a WWW service that allows researchers 

to compare functional annotations for all splice variants of a given gene in an 

automatic way, or alternatively use it for analysis of mutated sequence variants containing 

SNPs. The author of this thesis. This author has been responsible for the development 

of the main parts of the EPipe software as well as for implementing a large part of the 

modules (feature predictors). The EPipe projects is an ongoing effort which has involved 

a number of people during its development. 

4.4.1 Collecting Web Services clients in EPipe 

EPipe uses a number of local and remote resources for protein feature prediction. The 

ability of EPipe to connect to remote resources via Web Services is incorporated within 

the individual modules. This put a great deal of flexibility as to which resourses to support 

(e.g. BioMoby, SOAP etc). The pipeline is shown in figure 4.3. 

EPipe itself is offered both as a SOAP web service (http://www.cbs.dtu.dk/ws/ 

EPipe and a traditional web interfece (http://www.cbs.dtu.dk/services/EPipe). A 

schematic overview of the workflow in EPipe is shown in figure 4.4. 

4.4.2 Mapping Pfam annotations to protein structure: mecA 

In Staphylococcus aureus the mecA gene encodes a penicillin-binding protein (PBP2a), 

resulting in Methicillin resistance (Ender et al., 2009). The EPipe software can be used to 

map a range of different relevant features onto the protein structure, in order to visualize 

differences between homologs of this protein. In this example however, a single MecR1 

protein from Staphylococcus aureus strain A5937, GenBank accession no. EEV85461, is 

processed. Figure 4.5 shows the structure browser of EPipe which allows the user to 

browse the different features that are predicted, by showing the mapping onto the protein 

structure. Here, the three Pfam domains Transpeptidase, MecA N, and PBP dimer appear 

as significant hits. 

151

ENCODE pipeline: applying Web Services 

Input sequences 

Cache filter 

BLAST against 

PDB individually 

Cache filter 

Cache filter 

Cache filter 

Cache filter 

module IV 

alignment module I module II module III 

Positional 

features 

Non-positional 

features 

Alignment 

dependent 

module X 

Map feature 

coordinates to 

alignment 

Map features onto 

best structure 

XML of all results 

Cache filter 

Render images in 

parallel and present 

to output pages 

Table of 

nonpositional 

features 

Conclusion 

table 

Plot alignment and 

positions having 

different feature 

configuration 

Plot alignment 

and features 

with remapped 

coordinates 

Similarity in 

feature space 

Figure 4.3: Schematic layout of the ENCODE pipeline, EPipe. The main program ensures that 

as much as possible is dispatched in parrallel. Modules may either be alignment dependent or not. 

If the alignment is required to predict the protein features, the module is not launched until the 

alignment algorithm has finished. Modules may either return global features of the entire protein 

(e.g. cellular localization), or return positional features (e.g. phosphorylation sites). 

152


Figure 4.4: The input web page of EPipe: Upper part defines sequence upload and alignment 

method, and lower part selects which modules / methods to run. When applicable, gene ontologies 

have been added to each feature and feature values (light green boxes). 

153

ENCODE pipeline: applying Web Services 

Figure 4.5: The mecA encoded protein (EEV85461) shows homology to PDB entry 1VQQ (Lim 

& Strynadka, 2002). Top panel shows the EPipe structure browser which allows for any 90 degrees 

rotating. Lower panel shows a post-processing of the PyMol script, generated by EPipe. 

154

Chapter 5 

Conclusion and perspectives 

Conclusion and perspectives 

This thesis has presented a number comparative genomics tools that have been used 

throughout different research projects and peer review publications. The aim has been to 

provide methods that enable the scientist to keep up with the increasing speed by which 

genome sequences are published. Visualization plays a key role and finding better ways 

to present sequence information in a condensed and intuitive way is essential for deriving 

knowledge from the large number of bacterial strains being sequenced. 

Information content has previously been used to quantify conservation of DNA motifs, 

and a recent extension of this information framework has allowed to model complete 

promotors such as the P1/P2 system described in this work. The models shown here 

are to a large extent specific towards E. coli P1/P2 sites. However, the design of the 

matrix and spacing configuration format of the iscan tool enables for a much broader 

application. The tool may be used to test different hypothesis of promotor configurations 

across a broader range of organisms by estimating the promotor conservation a single 

comparable measure. There is still efforts to be made to implement benchmarking and to 

examine other promotor systems. 

Since the start of the human genome project (HGP) in 1990 there has been large 

investments to develop and improve sequencing technology. The present stage, where a 

bacterial genome can be sequenced for a few thousand dollars within few hours, is a result 

of years of competition and investments in genome projects. There are no signs that new 

achievements in sequencing technology stops here. The concept of sequencing single DNA 

molecules real time has long been an ultimate goal within genomics and DNA sequencing. 

It has been demonstrated how a DNA synthesis reaction can be monitored real-time, by 

immobilizing a DNA polymerase within a small (20 zeptoliter) well (Eid et al., 2009). If the 

technology reaches a final product, it may well start a new era in comparative genomics. 

Once it is possible to obtain a genome sequence at the same rate as the DNA replication 

itself, and at superior read lengths, sophisticated software must be implemented for the 

downstream processing. The technology can give a boost to the quality of metagenomic 

sequencing, and solve the current issues of proper assembly of these data sets. 

The BLASTatlas tool presented in this thesis incorporates a number of software to 

calculate different DNA properties as well as scripts for mapping sequence alignments to a 

reference genome. The number of dependencies makes it difficult to package the software 

and make installation on other computer systems. To share these more complex tools 

among scientists Web Services plays an important role and it has been demonstrated how 

analysis and visualization methods can be offered using this technology. At first glance the 

traditional web interfaces seems more user-friendly. However, implementing interoperable 

methods like that of the BLASTatlas method, forces a process in which the communication 

is formalized and defined in every detail. This allows direct integration into the user’s pro- 

155

gramming environment which scales significantly better. Making one or two comparisons 

using a web interface will in most cases be faster than using the Web Services counterpart. 

The true advantages are achieved when analysis are repeated possibly hundreds of 

times and when linking input/output between different remote resources. Integration of 

biological data using SOAP based Web Services is gaining acceptance. When the technology 

has matured it will undoubtedly enhance the way biological information is exploited 

by allowing seamless flow between for example public sequence databases, repositories of 

experimental data and bioinformtic prediction servers. 

156

Appendix A 

Appendix: Workshops, teaching, and conferences 

Appendix: Workshops, teaching, 

and conferences 

A.1 Lectures and Presentations 

A.1.1 DTU Course 27101: Framework Course in Biotechnology and 

Food Sciences 

Taught autumn 2008 by Prof. David Ussery, this cause featured weekly computer exercises 

throughout the semester and projects requiring computer work. I planned and supervised 

the exercises as well as assisted the students doing project work. See also: http://www. 

cbs.dtu.dk/dtucourse/genomics27101.php 

A.1.2 Comparative Microbial Genomics Workshop 

Held June 2 nd - 6 st 2008, Bangkok, Thailand. I assisted the planning of the workshop, 

lectured on rRNA operon structure, web services, and genome visualization methods and 

was responsible for computer exercises. Web page: http://www.cbs.dtu.dk/courses/ 

thaiworkshop08/programme.php 

A.1.3 Comparative Microbial Genomics and Taxonomy 

Held August 14 st - 18 st 2006, Petropolis, Brazil. I assisted the planning of the workshop 

and was responsible for computer exercises. See also: http://www.cbs.dtu.dk/courses/ 

brazilworkshop/programme.php 

A.1.4 EMBRACE Workshop on Client Side Scripting for Web Services 

Work package D5.2.X2. Held February 6 st - 8 st 2008, CBS. Responsible for computer exercises 

and lectures. See also: http://www.cbs.dtu.dk/courses/embrace/2008-02-06/ 

A.1.5 EMBRACE Workshop on Bioinformatics of Immunology 

Work package D5.2.6. Held January 24 st - 26 st 2007, CBS. Responsible for computer exercises 

and lectures. See also: http://www.cbs.dtu.dk/courses/embrace/2007-01-24/ 

A.1.6 EMBRACE 3 rd AGM: Implementation of web services 

Presentation held April 23 rd 2007 at CNRS Institute of Biology and Chemistry of Proteins 

in Lyon, France. 

157

Workshops and meetings 

A.1.7 EMBRACE Workshop on Perl, SQL and Web Services 

Scheduled for November 16 th - 20 th 2009. See also: http://www.cbs.dtu.dk/courses/ 

embrace/2009-11-16/ 

A.2 Workshops and meetings 

A.2.1 EMBRACE Workshop: SOAP web services 

April 2006, Bergen, Norway. 

A.2.2 EUCOMM Bioinformatics Training Course 

February 2007, Hinxton, United Kingdom 

A.2.3 EMBRACE Workshop: Modern computer tools for the biosciences 

March 2007, Uppsala, Sweden 

A.2.4 EMBRACE 3rd Annual General Meeting 

April 2007, Lyon, France 

A.2.5 EMBRACE Workshop: Deploying Web Services for Biological 

Sequence Annotation 

May 2007, Geneva, Switzerland 

A.2.6 EMBRACE 4th Annual General Meeting 

April 2008, Heidelberg, Germany 

A.2.7 Technical discussion of EMBRACE registry 

June 2008, Amsterdam, Holland 

A.2.8 EMBRACE meeting: Discussion of standard data types 

Januar 2009, Bergen, Norway 

A.3 Conferences 

A.3.1 Conference: Metagenomics, July 2007, San Diego U.S.A. 

Binnewies TT, Hallin PF, Sellami N, Ussery DW Prediction of Pathogenicity Networks in 

Bacterial Genomes 

A.3.2 Conference: ASM Biodefense 2007, February 2007, Washington 

U.S.A. 

Poster: Hallin PF and Binnewies TT. Gene organization of RNA genes and secretion 

system components of the Sargasso Sea environmental samples 

158

Appendix B 

Appendix: Ph.D. study plan 

Appendix: Ph.D. study plan 

159

Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse 

September 2005 

Nedenstående studieplan er accepteret af studerende og vejleder 

Hovedvejleders underskrift lokal nr. Studerendes underskrift 

Ph.d.-studieplan 

Ph.d.-studerendes navn: Peter Fischer Hallin 

Cpr.-nr.: 160877 2053 

Ph.d.-program: Bioinformatics 

Institut: BioCentrum 

Startdato: March 1 2006 

Slutdato: February 2009 

Hovedvejleder: Associate professor David W. Ussery 

(Titel, navn, institut, tlf.) 

BioCentrum-DTU, Technical University of Denmark, 

Building 301, DK-2800 Lyngby, Denmark 

E-mail address: dave@cbs.dtu.dk 

Phone (direct): (+45) 45 25 24 88 

Medvejleder: Guest Researcher Gertrude Maria Wassenaar 

(Titel, navn, 

institution/virksomhed) 

BioCentrum-DTU, Technical University of Denmark, 

Building 301, DK-2800 Lyngby, Denmark 

E-mail address: trudy@cbs.dtu.dk 

Phone (direct): (+45) 45 25 24 77 

Dato: 18-11-2007 

Studiets titel: DNA Structural Analysis and Transcript Prediction in Prokaryotic 

genomes 

1




Cpr.-nr.: 160877 2053 

Studiets hovedemne: 

The goal of this project is to obtain better understanding about the structural 

mechanisms that are involved in the initiation of transcription of DNA in 

Prokaryotic genomes and to use this information to make better and consistent 

transcript predictions. We have presented a database (Hallin and Ussery 2004) 

which holds several kinds of information for each of the over 300 fully 

sequenced Prokaryotic genomes that are currently available. Different research 

groups have made efforts to gather sequence data and analysis of the fully 

sequenced microbial genomes that are being published. 

Currently we rely on the authors' annotation of genome sequences when 

comparative genomics are applied to our data sets. However, different authors 

use different tools, approaches and criteria during the annotation process. There 

are examples of genomes that are predicted to be 50-100% over annotated 

(Skovgaard et al. 2001). Once reliable and automated processes for predicting 

transcriptomes are established, comparative analysis can be applied on the entire 

collection of organisms. It is envisioned that the users of our website can 

interactively be able to browse any piece of DNA to look for structural properties 

and repeats. 

_________________ 

Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A On the 

total number of genes and their length distribution in complete 

microbial genomes (2001) Trends Genet.17:425-8. 

Peter F. Hallin and David W. Ussery CBS Genome Atlas 

Database: A dynamic storage for bioinformatic results and 

sequence data (2004). Bioinformatics 20:3682-3686. 

(Her beskrives den videnskabelige projektdels indhold samt mål og midler. Hvis beskrivelsen er på mere end 1 A4side 

gives en kort oversigt her med henvisning til selve beskrivelsen, der vedlægges som bilag). 

Det eksterne 

forskningsophold 

Professor Craig John Benham, University of California, Davis. 

Benhams research focuses on mathematical modelling of DNA 

destabilization and prediction of opening of the DNA molecule 

during a transcription event. His strong mathematical approach is 

novel and would contribute significantly to our prediction methods 

and could possibly help explaining biological / experimental 

results. It is the idea that Craig Benhams calculations will be 

integrated into the prediction algorithms that is a major topic of my 

project. 

A 12 weeks internship is scheduled for October-December to Craig 

Benhams lab to integrate SIDD predictions (Stress Induced DNA 

Duplex Destabilization) with CBS databases and to prepare 1-2 

manuscripts on SIDD measures on a global prokaryotic scale. 

2




Cpr.-nr.: 160877 2053 

(Her anføres de forskningsmiljøer uden for DTU, hvor den ph.d.-studerende planlægges at opholde sig. Er der 

indgået konkrete aftaler, anføres dette. For hvert ophold angives det skønnede tidsforbrug (f.eks. i uger), og det 

samlede tidsforbrug til eksterne ophold anføres). 

Kursusdelen: 

Kurser på DTU 

Eksterne kurser 

Kurser meritoverført i 

forbindelse med 

indskrivning: 

Biological Sequence Analysis PhD 12 ECTS [OK] 

27802 Metabolic Engineering and 

Systems biology 

PhD 5 ECTS F1A 

27725 Globale regulatoriske netværk i 

mikroorganismer 

MSc 5 ECTS F2B 

27617 Protein structure and 

computational biology 

Msc 5 ECTS F5A 

27041 Introduction to Systems Biology Msc 5 ECTS E3A 

For kurser, som ikke findes i studiehåndbogen, skal der vedlægges en beskrivelse af det faglige indhold. Her 

anføres studiets forventede kursus/uddannelsesaktivteter. For hver del angives det skønnede antal ECTS-point, der 

sammenlagt skal svare til ca. 30 ECTS-point. 30 ECTS-point svarer til ca. 840 timers arbejde). 

Formidlingsdelen ( inkl. 

pligtarbejde): 

I have spent a total of about a month's time preparing and assisting 

in computer exercises for the CBS course Comparative Microbial 

Genomics and Taxonomy (Petropolis, Brazil, Aug. 2006, 

http://www.cbs.dtu.dk/courses/brazilworkshop) and in preparing 

and giving talks at several meetings. 

Exercises in course ”Biological Sequence Analysis” (CBS –DTU) 

1 hrs. Presentation, Modern computer tools for the biosciences 

(Uppsala, Sweden) Presentation: Embrace workshop on 

bioinformtics of Immunology (CBS – DTU) Presentation: Web 

Services implementation on CBS: Third Anual General Meeting of 

EMBRACE, (Lyon France). 

I plan to put in an additional month of work for giving and 

preparing presentations and lectures for a one week workshop to be 

3




Cpr.-nr.: 160877 2053 

held at CBS in February 2008: 

http://www.cbs.dtu.dk/courses/embrace/2008-02- 

06/programme.php. Lectures and exercies will be adjusted to cover 

promoter analysis using the EMBRACE technology. We intend to 

use graphical as well as statistical approaches to characterize 

promoter signatures of prokaryotic genomes. These are core topics 

of the thesis. 

Poster presentation at Metagenomics 2007, San Diego: “Gene 

organization of RNA genes and secretion system components of 

the Sargasso Sea environmental samples” 

(Her anføres studiets forventede dels formidlings-aktivteter og dels det pålagte pligtarbejde. For hver del angives 

det skønnede tidsforbrug (f.eks. i uger), der sammenlagt skal svare til 3 måneder). 

Tidsplan: 

1st half year (March 06 –August 06) 

Publication on rRNA gene predictor (RNAmmer). Comparative Microbial Genomics worksshop in 

Brasil. Meetings and work for CBS in connection to EMBRACE. 

2nd half year (September 06 – Feb 07) 

Lactococcus microarray project with Chr Hansen. Book chapter on Comparative Genomics, editor 

Dawn Field. EMBRACE meetings and workshops. 

3rd half year (March 07 –August 07) 

Followup article on RNAmmer – and rRNA/tRNA operons. 

4th half year (September 07 – Feb 08) 

(Oct-Dec) Internship, Craig Benham: Davis, California, 

Include work from Craig Benhams lab into RNAmmer followup manuscript and prepare SIDDbase 

application note and article on SIDD measures in prokaryotic promotor sequences. 

Prepare manuscripts 

5th. half year (March 08 –August 08) 

Course: Globale regulatoriske netværk i mikroorganismer (F2B) 

Course: Protein structure and computational biology (F5A) 

Course: 1 week may/june: 27802 Metabolic Engineering and Systems Biology 

Thesis writing+Prepare manuscripts 

6th. half year (September 08 – Feb 09) 

Course: Introduction to Systems Biology 

Thesis writing 

(Tidsplanen bør indeholde tidspunkter/perioder for alle væsentlige aktiviteter her i forbindelse med ph.d.uddannelsen. 

Det er vigtigt, at tidsplanen er fuldstændig., Den kan vedlægges som appendiks). 

Kort beskrivelse af 

vejledningens form: 

Det kan bl.a. aftales, hvor tit vejledningen sker i form af møder eller ved skriftlig tilbagemelding 

4




Cpr.-nr.: 160877 2053 

Patenter/innovation: Der er sandsynlighed for, at der under projektet udvikles 

teknologier eller software, som kan patenteres? 

Hvis Ja 

Ja x Nej 

Kort redegørelse for hvilke metoder, der anvendes til oplæring af den ph.d.-studerende i de innovationsmæssige 

aspekter 

Andet: 

(Her kan anføres andre forhold af betydning for bedømmelsen af studieplanen). 

5

Appendix C 

Appendix: Courses 

C.1 Global regulatory networks in microorganisms 

DTU course 27725, ECTS 5, M.sc. level. 

C.2 Protein Structure and Computational Biology 

DTU course 27617, ECTS 5, M.sc. level. 

C.3 Biological Sequence Analysis 

DTU course 27803, ECTS 12.5, PhD level. 

C.4 Comparative Genome Analysis 

Copenhagen University, Department of Biology, ECTS 5. 

Appendix: Courses 

C.5 Doctorial seminar on business economics for academic 

entrepreneurs 

Aarhus school of business, University of Aarhus, ECTS 3, PhD level. 

C.6 ECTS summary 

Total ECTS is 30.5 of which 15.5 at PhD level. 

165

Appendix D 

Appendix: Software 

D.1 fetchgbk manual 

S Y N O P S I S 

f e t c h g b k − d o w n l o a d s g e n b a n k / r e f s e q r e c o r d s i n g e n b a n k f o r m a t , s p e c i f y i n g e i t h e r 

a c c e s s i o n s n u m b e r , a c c e s s i o n r a n g e s , o r p r o j e c t i d . 

f e t c h g b k (−h ) (−p [ P R O J E C T _ I D ] ) (−a [ A C C E S S I O N / R A N G E ] ) (−d [ D A T A B A S E ] ) 

D E S C R I P T I O N 

W h e n d e f i n i n g t h e p r o j e c t id , u s i n g −p o p t i o n , o p t i o n −a i s i g n o r e d a n d a l l 

a c c e s s i o n n u m b e r s f o r a l l s e g m e n t s o f t h a t p r o j e c t , a r e f e t c h e d f r o m t j e p r o j e c t . 

W h e n u s i n g t h e −p o p t i o n , t h e −d o p t i o n i s i n e f f e c t , a l l o w i n g y o u t o c o n t r o l w h i c h 

d a t a b a s e t o u s e ( r e f s e q / g e n b a n k ) 

W h e n u s i n g t h e −a o p t i o n , t h e p r o g r a m w i l l r e t r i e v e o n l y t h a t a c c e s s i o n ( o r r a n g e 

o f a c c e s s i o n s ) . I t w i l l i g n o r e t h e −d o p t i o n . T h e p r o g r a m p r i n t e s g e n b a n k f o r m a t 

d a t a t o s t d o u t . O p t i o n −l i s u s e d t o s h o w o n l y a T A B s e p a r a t e d l i s t s h o w i n g a c c e s s i o n 

a n d s e g m e n t n a m e 

V E R S I O N 

2008 −08 −15: v e r s i o n 1 . 0 c r e a t e d / p f h 

−p [ n u m b e r ] 

T h e N C B I G e n o m e P r o j e c t n u m b e r , l i k e w h a t c a n b e f o u n d h e r e : 

h t t p : / / w w w . n c b i . n l m . n i h . g o v / g e n o m e s / l p r o k s . c g i . T h i s o p t i o n o v e r r u l e s t h e −a o p t i o n . 

−a [ a c c e s s i o n n o . o r a c c e s s i o n n u m b e r r a n g e ] 

W h e n u s i n g t h i s o p t i o n , t h e p r o g r a m i s i n s t r u c t e d t o d o w n l o a d o n l y t h i s r e c o r d ( o r 

t h e s e r e c o r d s , o f a r a n g e i s d e f i n e d ) . T h e −d o p t i o n i s i g n o r e d 

−d [ g e n b a n k / r e f s e q ] 

C h o i c e o f d a t a b a s e . H a s o n l y e f f e c t w h e n u s i n g o p t i o n −p . 

−l 

B o o l e a n , i n s t r u c t i n g t h e p r o g r a m n o t t o s h o w g e n b a n k r e c o r d s , b u t o n l y l i s t s e g m e n t 

n a m e s f o r e a c h a c c e s s i o n . 

−h 

S h o w i n g t h i s h e l p p a g e 

E X A M P L E S 

f e t c h g b k −p 19391 −d r e f s e q | g r e p L O C U S 

f e t c h g b k −p 19391 −d g e n b a n k | g r e p L O C U S 

f e t c h g b k −a N Z _ A B I Z 0 0 0 0 0 0 0 0 | g r e p L O C U S 

f e t c h g b k −a N Z _ A B I H 0 1 0 0 0 0 0 1 −N Z _ A B I H 0 1 0 0 0 0 3 8 | g r e p L O C U S 

f e t c h g b k −a C P 0 0 0 8 9 6 | g r e p L O C U S 

f e t c h g b k −p 12997 −d r e f s e q −l 

A U T H O R 

P e t e r F i s c h e r H a l l i n , A u g u s t 2008 , p f h @ c b s . d t u . d k 

166

D.2 Sample output from queryGenomes 

As output from listing 2.3. 


1 #kingdom phyla pid organism genbank r e f s e q segment c o l o r ATCONTENT NGENES 

2 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 7 9 N C _ 0 1 1 3 1 2 

C h r o m o s o m e 1 f f d d 4 4 0 . 6 0 7 7 3069 

3 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 0 N C _ 0 1 1 3 1 3 

C h r o m o s o m e 2 f f d d 4 4 0 . 6 1 7 6 1105 

4 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 2 N C _ 0 1 1 3 1 4 P l a s m i d 

p V S A L 3 2 0 f f d d 4 4 0 . 6 2 7 1 32 


p V S A L 8 4 0 f f d d 4 4 0 . 5 9 9 3 72 


p V A L 4 3 f f d d 4 4 0 . 6 1 9 3 


p V S A L 4 3 f f d d 4 4 0 . 6 4 3 9 3 

8 B a c t e r i a D e l t a p r o t e o b a c t e r i a 9637 B d e l l o v i b r i o b a c t e r i o v o r u s H D 1 0 0 B X 8 4 2 6 0 1 N C _ 0 0 5 3 6 3 

C h r o m o s o m e f f d d 4 4 0 . 4 9 3 5 3583 

9 B a c t e r i a G a m m a p r o t e o b a c t e r i a 28329 C e l l v i b r i o j a p o n i c u s U e d a 1 0 7 C P 0 0 0 9 3 4 N C _ 0 1 0 9 9 5 


10 B a c t e r i a B a c t e r o i d e t e s / C h l o r o b i 12607 C h l o r o b i u m p h a e o v i b r i o i d e s D S M 265 C P 0 0 0 6 0 7 N C _ 0 0 9 3 3 7 

C h r o m o s o m e f f b b 5 5 0 . 4 7 0 1 1753 

11 B a c t e r i a D e l t a p r o t e o b a c t e r i a 29493 D e s u l f o v i b r i o d e s u l f u r i c a n s s u b s p . d e s u l f u r i c a n s s t r . A T C C 

27774 C P 0 0 1 3 5 8 N C _ 0 1 1 8 8 3 C h r o m o s o m e f f d d 4 4 0 . 4 1 9 3 2356 

12 B a c t e r i a D e l t a p r o t e o b a c t e r i a 329 D e s u l f o v i b r i o d e s u l f u r i c a n s s u b s p . d e s u l f u r i c a n s s t r . G 2 0 

C P 0 0 0 1 1 2 N C _ 0 0 7 5 1 9 C h r o m o s o m e f f d d 4 4 0 . 4 2 1 6 3775 

13 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 6 N C _ 0 1 2 7 9 5 P l a s m i d 

p D M C 2 f f d d 4 4 0 . 6 2 8 3 10 

14 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 4 N C _ 0 1 2 7 9 6 


15 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 5 N C _ 0 1 2 7 9 7 P l a s m i d 

p D M C 1 f f d d 4 4 0 . 4 1 9 7 65 

16 B a c t e r i a D e l t a p r o t e o b a c t e r i a 29541 D e s u l f o v i b r i o s a l e x i g e n s D S M 2638 C P 0 0 1 6 4 9 N C _ 0 1 2 8 8 1 


17 B a c t e r i a D e l t a p r o t e o b a c t e r i a 17227 D e s u l f o v i b r i o v u l g a r i s D P 4 C P 0 0 0 5 2 8 N C _ 0 0 8 7 4 1 P l a s m i d 

p D V U L 0 1 f f d d 4 4 0 . 3 4 3 1 150 

18 B a c t e r i a D e l t a p r o t e o b a c t e r i a 17227 D e s u l f o v i b r i o v u l g a r i s D P 4 C P 0 0 0 5 2 7 N C _ 0 0 8 7 5 1 C h r o m o s o m e 

f f d d 4 4 0 . 3 6 9 9 2941 

19 B a c t e r i a D e l t a p r o t e o b a c t e r i a 27731 D e s u l f o v i b r i o v u l g a r i s s t r . M i y a z a k i F C P 0 0 1 1 9 7 N C _ 0 1 1 7 6 9 


20 B a c t e r i a D e l t a p r o t e o b a c t e r i a 51 D e s u l f o v i b r i o v u l g a r i s s t r . H i l d e n b o r o u g h A E 0 1 7 2 8 5 N C _ 0 0 2 9 3 7 


21 B a c t e r i a D e l t a p r o t e o b a c t e r i a 51 D e s u l f o v i b r i o v u l g a r i s s t r . H i l d e n b o r o u g h A E 0 1 7 2 8 6 N C _ 0 0 5 8 6 3 

M e g a p l a s m i d f f d d 4 4 0 . 3 4 3 2 152 

22 B a c t e r i a O t h e r B a c t e r i a 30733 T h e r m o d e s u l f o v i b r i o y e l l o w s t o n i i D S M 11347 C P 0 0 1 1 4 7 N C _ 0 1 1 2 9 6 

C h r o m o s o m e 888888 0 . 6 5 8 7 2033 

23 B a c t e r i a G a m m a p r o t e o b a c t e r i a 29177 T h i o a l k a l i v i b r i o s p . HL−E b G R 7 C P 0 0 1 3 3 9 N C _ 0 1 1 9 0 1 


24 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32851 V i b r i o c h o l e r a e M66 −2 C P 0 0 1 2 3 3 N C _ 0 1 2 5 7 8 C h r o m o s o m e I 

f f d d 4 4 0 . 5 2 1 7 2650 

25 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32851 V i b r i o c h o l e r a e M66 −2 C P 0 0 1 2 3 4 N C _ 0 1 2 5 8 0 C h r o m o s o m e I I 

f f d d 4 4 0 . 5 2 9 6 1043 

26 B a c t e r i a G a m m a p r o t e o b a c t e r i a 33555 V i b r i o c h o l e r a e MJ −1236 C P 0 0 1 4 8 5 N C _ 0 1 2 6 6 8 C h r o m o s o m e 1 

f f d d 4 4 0 . 5 2 4 8 2770 

27 B a c t e r i a G a m m a p r o t e o b a c t e r i a 33555 V i b r i o c h o l e r a e MJ −1236 C P 0 0 1 4 8 6 N C _ 0 1 2 6 6 7 C h r o m o s o m e 2 

f f d d 4 4 0 . 5 3 2 5 1004 

28 B a c t e r i a G a m m a p r o t e o b a c t e r i a 36 V i b r i o c h o l e r a e O 1 b i o v a r E l T o r s t r . N 1 6 9 6 1 A E 0 0 3 8 5 2 

N C _ 0 0 2 5 0 5 C h r o m o s o m e I f f d d 4 4 0 . 5 2 3 2736 

29 B a c t e r i a G a m m a p r o t e o b a c t e r i a 36 V i b r i o c h o l e r a e O 1 b i o v a r E l T o r s t r . N 1 6 9 6 1 A E 0 0 3 8 5 3 

N C _ 0 0 2 5 0 6 C h r o m o s o m e I I f f d d 4 4 0 . 5 3 0 9 1092 

30 B a c t e r i a G a m m a p r o t e o b a c t e r i a 15667 V i b r i o c h o l e r a e O 3 9 5 C P 0 0 0 6 2 6 N C _ 0 0 9 4 5 6 C h r o m o s o m e 1 

f f d d 4 4 0 . 5 3 1 2 1133 

31 B a c t e r i a G a m m a p r o t e o b a c t e r i a 15667 V i b r i o c h o l e r a e O 3 9 5 C P 0 0 0 6 2 7 N C _ 0 0 9 4 5 7 C h r o m o s o m e 2 

f f d d 4 4 0 . 5 2 2 2 2742 

32 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 0 N C _ 0 0 6 8 4 0 C h r o m o s o m e I 

f f d d 4 4 0 . 6 1 0 4 2575 

33 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 1 N C _ 0 0 6 8 4 1 C h r o m o s o m e I I 

f f d d 4 4 0 . 6 2 9 8 1172 

34 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 2 N C _ 0 0 6 8 4 2 P l a s m i d p E S 1 0 0 

f f d d 4 4 0 . 6 1 5 8 55 

35 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 3 N C _ 0 1 1 1 8 6 C h r o m o s o m e I I 

f f d d 4 4 0 . 6 2 7 5 1254 

36 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 4 N C _ 0 1 1 1 8 5 P l a s m i d p M J 1 0 0 

f f d d 4 4 0 . 6 5 2 195 

37 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 9 N C _ 0 1 1 1 8 4 C h r o m o s o m e I 

f f d d 4 4 0 . 6 1 1 2 2590 

38 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 9 1 N C _ 0 0 9 7 7 7 P l a s m i d 

p V I B H A R f f d d 4 4 0 . 5 6 2 1 120 

39 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 8 9 N C _ 0 0 9 7 8 3 

C h r o m o s o m e I f f d d 4 4 0 . 5 4 4 5 3570 

40 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 9 0 N C _ 0 0 9 7 8 4 

C h r o m o s o m e I I f f d d 4 4 0 . 5 4 7 3 2374 

41 B a c t e r i a G a m m a p r o t e o b a c t e r i a 360 V i b r i o p a r a h a e m o l y t i c u s R I M D 2210633 B A 0 0 0 0 3 1 N C _ 0 0 4 6 0 3 

C h r o m o s o m e I f f d d 4 4 0 . 5 4 6 1 3080 

42 B a c t e r i a G a m m a p r o t e o b a c t e r i a 360 V i b r i o p a r a h a e m o l y t i c u s R I M D 2210633 B A 0 0 0 0 3 2 N C _ 0 0 4 6 0 5 

C h r o m o s o m e I I f f d d 4 4 0 . 5 4 6 5 1752 

43 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32815 V i b r i o s p l e n d i d u s L G P 3 2 F M 9 5 4 9 7 3 N C _ 0 1 1 7 4 4 C h r o m o s o m e 2 

f f d d 4 4 0 . 5 6 3 6 1486 

44 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32815 V i b r i o s p l e n d i d u s L G P 3 2 F M 9 5 4 9 7 2 N C _ 0 1 1 7 5 3 C h r o m o s o m e 1 

f f d d 4 4 0 . 5 5 9 6 2950 

45 B a c t e r i a G a m m a p r o t e o b a c t e r i a 349 V i b r i o v u l n i f i c u s C M C P 6 A E 0 1 6 7 9 5 N C _ 0 0 4 4 5 9 C h r o m o s o m e I 

f f d d 4 4 0 . 5 3 5 5 2973 

46 B a c t e r i a G a m m a p r o t e o b a c t e r i a 349 V i b r i o v u l n i f i c u s C M C P 6 A E 0 1 6 7 9 6 N C _ 0 0 4 4 6 0 C h r o m o s o m e I I 

f f d d 4 4 0 . 5 2 8 8 1565 

167

BLASTatlas configurations 

47 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 B A 0 0 0 0 3 7 N C _ 0 0 5 1 3 9 C h r o m o s o m e I 

f f d d 4 4 0 . 5 3 5 9 3262 

48 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 B A 0 0 0 0 3 8 N C _ 0 0 5 1 4 0 C h r o m o s o m e I I 

f f d d 4 4 0 . 5 2 7 9 1697 

49 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 A P 0 0 5 3 5 2 N C _ 0 0 5 1 2 8 P l a s m i d p Y J 0 1 6 

f f d d 4 4 0 . 5 5 0 7 69 

D.3 BLASTatlas configurations 

D.3.1 file blast.cfg 

1 l e g e n d : B . a m b i f a r i a A M M D 

2 p r o g r a m : b l a s t p 

3 c o l o r : 1 0 1 0 1 0 _ 0 2 0 0 0 2 

4 r a n g e : 0 . 0 , 0 . 8 

5 s o u r c e : f i l e s / 1 3 4 9 0 . f s a 

6 

7 l e g e n d : B . a m b i f a r i a M C 4 0 −6 


9 c o l o r : 1 0 1 0 1 0 _ 0 2 0 0 0 2 

10 r a n g e : 0 . 0 , 0 . 8 


12 

13 l e g e n d : B . c e n o c e p a c i a A U 1054 


15 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0 

16 r a n g e : 0 . 0 , 0 . 8 


18 

19 l e g e n d : B . c e n o c e p a c i a H I 2 4 2 4 


21 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0 

22 r a n g e : 0 . 0 , 0 . 8 


24 

25 l e g e n d : B . c e n o c e p a c i a J 2 3 1 5 


27 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0 

28 r a n g e : 0 . 0 , 0 . 8 

29 s o u r c e : f i l e s / 3 3 9 . f s a 

30 

31 l e g e n d : B . c e n o c e p a c i a MC0 −3 


33 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0 

34 r a n g e : 0 . 0 , 0 . 8 


36 

37 l e g e n d : B . g l u m a e B G R 1 


39 c o l o r : 1 0 1 0 1 0 _ 0 5 0 5 0 5 

40 r a n g e : 0 . 0 , 0 . 8 


42 

43 . . . . . . 

D.3.2 file custom.cfg 

1 

2 l e g e n d : S I D D @ −0.035 

3 c o l o r : 0 0 0 0 1 0 _ 1 0 1 0 1 0 

4 r a n g e : 9 : 1 0 

5 b o x f i l t e r : 5 0 0 0 

6 s o u r c e : g u n z i p −c B X 5 7 1 9 6 6 −57 a 2 f 2 c 2 e 1 1 c a 0 d d 8 c d 7 4 4 9 3 d 6 6 7 d 4 d 6 −3173005. s i d d −−0.035−c−10−c . o u t . g z | 

c u t −f 4 | 

D.4 BLASTmatrix example 

This Perl script constructs an XML configuration file by looking up the Genome Atlas 

Database through MySQL. It queries for all Campylobacter strains currently available. 

1 #! / u s r / bin / p e r l 

2 u s e s t r i c t ; 

3 

4 m y $ S A C O _ E X T R A C T = " / u s r / c b s / b i o / b i n / l i n u x 6 4 / s a c o _ e x t r a c t " ; 

5 m y %c o l o r s = ( l a r i => ’ 0 , 1 0 4 , 1 3 9 ’ , j e j u n i => ’ 0 , 1 3 9 , 6 9 ’ , h o m i n i s => ’ 66 , 66 , 1 1 1 ’ , f e t u s 

=> ’ 1 3 9 , 1 0 1 , 8 ’ , c u r v u s=>’ 1 4 0 , 23 , 2 3 ’ , c o n c i s u s=>’ 2 0 5 , 1 7 3 , 0 ’ ) ; 

6 

7 m y $ s o u r c e s = " " ; # h o l d s the s o u r c e s p a r t o f the c o n f i g u r a t i o n − r e p l a c e i n t o DATA s e c t i o n 

8 

9 o p e n O R G A N I S M , " m y s q l - N - B - e \ " s e l e c t pid , o r g a n i s m _ n a m e f r o m g e n o m e a t l a s 3 _ c u r . 

g e n b a n k _ c o m p l e t e _ p r j w h e r e o r g a n i s m _ n a m e l i k e ’ c a m p y l o b a c t e r % ’ o r d e r b y o r g a n i s m _ n a m e \ " | " 

o r d i e $ ! ; 

10 w h i l e (< O R G A N I S M >) { 

11 c h o m p ; 

12 m y ( $ p i d , $ o r g a n i s m _ n a m e ) = s p l i t /\ t / ; 

168


13 w a r n " $ o r g a n i s m _ n a m e ( p i d $ p i d ) \ n " ; 

14 m y ( $ g e n u s , $ s p e c i e s , $ s t r a i n ) = ( $1 , $2 , $ 3 ) i f $ o r g a n i s m _ n a m e = /(\ S+) (\ S+) ( . ∗ ) / ; 

15 m y $ c o l o r = " 1 0 0 , 1 0 0 , 1 0 0 " ; 

16 $ c o l o r = $ c o l o r s { $ s p e c i e s } i f d e f i n e d $ c o l o r s { $ s p e c i e s } ; 

17 $ s o u r c e s .= " 

18 < e n t r y > 

19 < s o u r c e > . / $ p i d . p r o t e i n s . fsa < / s o u r c e > 

20 < t i t l e > $ g e n u s $ s p e c i e s < / t i t l e > 

21 < s u b t i t l e > $ s t r a i n < / s u b t i t l e > 

22 < g r o u p > $ s p e c i e s < / g r o u p > 

23 < c o l o r > $ c o l o r < / c o l o r > 

24 

25 " ; 

26 o p e n P I D , " > $ p i d . p r o t e i n s . f s a " o r d i e $ ! ; 

27 o p e n A C C E S S I O N , " m y s q l - N - B - e \ " s e l e c t g e n b a n k , s e g m e n t _ n a m e f r o m g e n o m e a t l a s 3 _ c u r . 

g e n b a n k _ c o m p l e t e _ s e q w h e r e p i d = $ p i d a n d s e g m e n t _ n a m e n o t l i k e ’ g e n o m e % ’ \ " | " ; 

28 w h i l e (< A C C E S S I O N > ) { 

29 c h o m p ; 

30 m y ( $ g e n b a n k , $ s e g m e n t _ n a m e ) = s p l i t /\ t / ; 

31 c h o m p $ g e n b a n k ; 

32 w a r n " a d d i n g $ s e g m e n t _ n a m e ( a c c e s s i o n $ g e n b a n k ) \ n " ; 

33 m y $ g b k = " / h o m e / d a t a b a s e s / g e n o m e a t l a s d b - 3 . 0 _ c u r / d a t a / $ g e n b a n k / $ g e n b a n k . g b k " ; 

34 o p e n P R O T , " $ S A C O _ E X T R A C T - I g e n b a n k - O f a s t a - t < $ g b k 2 > / d e v / n u l l | " o r d i e $ ! ; 

35 w h i l e () { 

36 p r i n t P I D ; 

37 } 

38 c l o s e P R O T ; 

39 } 

40 c l o s e A C C E S S I O N ; 

41 c l o s e P I D ; 

42 } 

43 c l o s e O R G A N I S M ; 

44 w a r n " d u m p i n g x m l c o n f i g o n s t d o u t . . . \ n " ; 

45 w h i l e (< D A T A >) { 

46 s//$ s o u r c e s / g ; 

47 p r i n t ; 

48 } 

49 

50 _ _ D A T A _ _ 

51 

52 

53 P r o t e o m e c o m p a r i s o n o f C a m p y l o b a c t e r s p e c i e s 

54 − 

55 

56 

57 a u t o 

58 a u t o 

59 

60 0.9 

61 0.9 

62 0.9 

63 

64 

65 0.975 

66 0 

67 0 

68 

69 

70 

71 a u t o 

72 a u t o 

73 

74 0.9 

75 0.9 

76 0.9 

77 

78 

79 0 

80 0.975 

81 0 

82 

83 

84 

85 

86 

87 

88 

D.5 iscan source code 

1 #! / u s r / bin / p e r l 

2 u s e s t r i c t ; 

3 

4 m y $ p w m ; 

5 m y %m a t r i x ; 

6 m y $ s p a c e r ; 

7 m y @ P W M ; 

8 m y $ p i = 3 . 1 4 1 5 9 2 6 5 ; 

9 

10 # read the model f i l e s # i n c l u d e s u p p o r t e d r e c u r s i v e l y (NO CHECK FOR LOOPS ! ) 

11 m y %s e t u p ; 

12 m y @ L I N E S ; 

169

iscan source code 

13 i f ( d e f i n e d $ A R G V [ 0 ] ) { 

14 @ L I N E S = r e a d _ m o d ( $ A R G V [ 0 ] ) ; 

15 } e l s e { 

16 w h i l e (< D A T A >) { 

17 p r i n t ; 

18 } 

19 c l o s e D A T A ; 

20 d i e " n o m o d e l p r o v i d e d . t e m p l a t e m o d e l d u m p e d \ n " ; 

21 } 

22 

23 m y $ p w m i d = −1; 

24 p r i n t " # t h i s i s t h e m o d e l : \ n " ; 

25 f o r e a c h ( @ L I N E S ) { 

26 p r i n t " # $ _ \ n " ; 

27 i f ( / ˆ \ [ p w m \ ] \ s∗=\s ∗ ( . ∗ ) /) { 

28 $ p w m i d ++; 

29 p u s h @ P W M , " $ p w m i d : $ 1 " ; 

30 } 

31 m y $ p w m = $ P W M [$# P W M ] ; 

32 $ s e t u p { $ p w m }{ $ 1 } = $ 2 i f /ˆ(\ w+)\ s∗=\s ∗([\.\ −0 −9]+) / ; 

33 n e x t u n l e s s / ˆ \ [ ( [ A T G C ]+) \ ] / ; 

34 m y @ F = s p l i t / [ \ s \ t ] + / ; 

35 s h i f t @ F ; 

36 e r r ( " p w m n o t d e f i n e d " ) u n l e s s d e f i n e d $ p w m ; 

37 @ { $ m a t r i x { $ p w m }{ $ 1 }} = @ F ; 

38 $ m a t r i x { $ p w m }{ c o u n t } [ $ _ ] += $ F [ $ _ ] f o r e a c h ( 0 . . $#F ) ; 

39 } 

40 

41 # make a lookup t a b l e o f d i s t a n c e i n f o r m a t i o n measure 

42 m y %S P A C E R _ L O O K U P ; 

43 f o r e a c h m y $ s p a c e r ( k e y s %s e t u p ) { 

44 m y $ m i n = $ s e t u p { $ s p a c e r }{ m i n } ; 

45 m y $ m a x = $ s e t u p { $ s p a c e r }{ m a x } ; 

46 m y $ c e n t e r = $ s e t u p { $ s p a c e r }{ c e n t e r } ; 

47 p r i n t f " # p a r s i n g a c c e s s i b i l i t y f o r $ s p a c e r ( m i n = $ m i n , m a x = $ m a x , c e n t e r = $ c e n t e r ) \ n " ; 

48 m y $ n = 0 ; 

49 $ n += 1 + c o s ( ( ( 2 ∗ $ p i ) / 1 0 . 6 ) ∗ ( $ _ − $ c e n t e r ) ) f o r e a c h ( $ m i n . . $ m a x ) ; 

50 f o r e a c h m y $ d ( $ m i n . . $ m a x ) { 

51 i f ( $ c e n t e r e q " " ) { 

52 $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } = 0 ; 

53 } e l s e { 

54 $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } = −(−l o g ( ( 1 + c o s ( ( ( 2 ∗ $ p i ) / 1 0 . 6 ) ∗ ( $ d 

− $ c e n t e r ) ) ) / $ n ) / l o g ( 2 ) ) ; 

55 } 

56 p r i n t f " # d = % d , s c o r e = % 0 . 2 f \ n " , $d , $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } ; 

57 } 

58 } 

59 

60 # compute matrix based o f f r e q u e n c i e s 

61 f o r e a c h m y $ p w m ( k e y s %m a t r i x ) { 

62 p r i n t " # p r e p a r i n g m a t r i x ’ $ p w m ’\ n " ; ; 

63 f o r e a c h m y $ l e t t e r ( q w / A T G C /) { 

64 p r i n t " # [ $ l e t t e r ] " ; 

65 f o r e a c h m y $ i ( 0 . . $#{ $ m a t r i x { $ p w m }{ A }} ) { 

66 m y $ i 1 = " - " ; 

67 m y $ i 2 = s p r i n t f ( ’ % 5 s ’ , ’ - ’ ) ; 

68 i f ( $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] > 0 ) { 

69 $ i 1 = 2 + l o g ( $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] / $ m a t r i x { $ p w m }{ c o u n t } [ $ i ] ) / l o g ( 2 ) − 0 ; 

70 $ i 2 = s p r i n t f ( ’ % 5 s ’ , s p r i n t f ( ’ % 0 . 2 f ’ , $ i 1 ) ) ; 

71 } 

72 $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] = $ i 1 i f $ i 1 n e " - " ; 

73 p r i n t " \ t $ i 2 " ; 

74 } 

75 p r i n t " \ n " ; 

76 } 

77 } 

78 

79 # l o o p o v e r a l l s e q u e n c e s i n i n p u t 

80 m y @ i n p = &r e a d _ f a s t a ; 

81 f o r e a c h m y $ s ( 0 . . $#i n p ) { 

82 m y $ s e q = $ i n p [ $ s ]−>{ s e q } ; 

83 p r i n t f " # S E Q U E N C E % s \ n " , $ i n p [ $ s ]−>{ i d } ; 

84 p r i n t f " # % d b p \ n " , l e n g t h ( $ s e q ) ; 

85 m y %L E N ; 

86 m y %B I T ; 

87 f o r e a c h m y $ p w m ( @ P W M ) { 

88 p r i n t " # g e n e r a t i n g b i t s c o r e s f o r m a t r i x ’ $ p w m ’\ n " ; 

89 @ { $ B I T { $ p w m }} = &s c a n ( $ s e q ,%{ $ m a t r i x { $ p w m }}) ; 

90 $ L E N { $ p w m } = s c a l a r ( @ { $ m a t r i x { $ p w m }{ A }}) ; 

91 p r i n t f " # % d e l e m e n t s i n a r r a y \ n " , s c a l a r ( @ { $ B I T { $ p w m }} ) ; 

92 } 

93 f o r e a c h m y $ p ( 0 . . ( l e n g t h ( $ s e q ) − $ L E N { $ P W M [ 0 ] } ) ) { 

94 p r i n t f " # c o n s i d e r i n g p o s i t i o n % d ( r o o t m o d e l ) \ n " , $ p + 1 ; 

95 # f i n d the s c o r e o f the i n i t i a l matrix , f o r t h i s g i v e n p o s i t i o n 

96 m y $ w = $ s e t u p { $ P W M [ 0 ] } { w e i g h t } ; 

97 m y $ f s i = $ B I T { $ P W M [ 0 ] } [ $ p ] ∗ $ w ; 

98 m y $ o f f s e t = $ p ; 

99 m y $ s i g n a l = s u b s t r ( $ s e q , $ o f f s e t , $ L E N { $ P W M [ 0 ] } ) ; 

100 m y $ s = s p r i n t f " % s \ t % 0 . 2 f " , $ s i g n a l , $ f s i ; 

101 

102 f o r e a c h m y $ p w m _ i n d e x (1 . . $#P W M ) { 

103 m y $ p w m = $ P W M [ $ p w m _ i n d e x ] ; 

104 m y $ w = $ s e t u p { $ p w m }{ w e i g h t } ; 

105 

106 # g e t the s p a c i n g d e t a i l s f o r the upstream s p a c e r 

107 m y $ p r e v _ p w m = $ P W M [ $ p w m _ i n d e x − 1 ] ; 

108 

170


109 m y ( $ m i n , $ m a x , $ c e n t e r ) = ( $ s e t u p { $ p r e v _ p w m }{ m i n } , 

110 $ s e t u p { $ p r e v _ p w m }{ m a x } , $ s e t u p { $ p r e v _ p w m }{ c e n t e r }) ; 

111 

112 m y $ o p t _ s p a c e r ; 

113 m y $ o p t _ u n i t _ s c o r e ; 

114 

115 # c a l c u l a t e u n i t s c o r e s f o r each o f the s p a c i n g c o n f i g u r a t i o n s 

116 # A u n i t i s the s p a c e r and the f o l l o w i n g matrix . We s e a r c h f o r the 

117 # s p a c e r g i v i n g r i s e t o the h i g h e s t u n i t s c o r e 

118 

119 p r i n t f " # a d j u s t i n g s p a c e r d o w n s t r a m o f ’ $ p w m ’\ n " ; 

120 

121 f o r e a c h m y $ s p a c e r ( $ m i n . . $ m a x ) { 

122 # don ’ t c o n t i n u e , o f the o f f s e t g o e s beyond z e r o . . . 

123 l a s t i f $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r < 0 ; 

124 n e x t i f $ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] ∗ $ w < $ s e t u p { $ p w m }{ t h r e s h o l d } a n d 

d e f i n e d $ s e t u p { $ p w m }{ t h r e s h o l d } ; 

125 

126 # i f no o p t i m a l s p a c e r i s d e c l a r e d y e t ( e . g . b e c a u s e t h i s i s 

127 # the f i r s t round ) then do i t now 

128 $ o p t _ s p a c e r = $ s p a c e r u n l e s s d e f i n e d $ o p t _ s p a c e r ; 

129 m y $ t e s t _ u n i t _ s c o r e = $ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] ∗ $ w + $ S P A C E R _ L O O K U P { 

$ s p a c e r }{ $ m i n }{ $ m a x }{ $ c e n t e r } ; 

130 p r i n t f " # s p a c e r : % d , s c o r e : % 0 . 1 f ( % 0 . 1 f + % 0 . 1 f ) \ n " , $ s p a c e r , $ t e s t _ u n i t _ s c o r e , 

$ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] , $ S P A C E R _ L O O K U P { $ s p a c e r }{ $ m i n }{ $ m a x }{ 

$ c e n t e r } ; 

131 $ o p t _ u n i t _ s c o r e = $ t e s t _ u n i t _ s c o r e u n l e s s d e f i n e d $ o p t _ u n i t _ s c o r e ; 

132 i f ( $ t e s t _ u n i t _ s c o r e > $ o p t _ u n i t _ s c o r e ) { 

133 $ o p t _ s p a c e r = $ s p a c e r ; 

134 $ o p t _ u n i t _ s c o r e = $ t e s t _ u n i t _ s c o r e ; 

135 } 

136 } # f o r e a c h my $ s p a c e r 

137 

138 # o f f s e t i s where the c u r r e n t pwm s t a r t s 

139 $ o f f s e t = $ o f f s e t − $ L E N { $ p w m } − $ o p t _ s p a c e r ; 

140 

141 p r i n t f " # n e w o f f s e t % d \ n " , $ o f f s e t ; 

142 

143 i f ( ! d e f i n e d $ o p t _ u n i t _ s c o r e ) { 

144 p r i n t f " # u n a b l e t o d e t e r m i n e s p a c e r \ n " ; 

145 $ s .= s p r i n t f " \ t - \ t % s \ t - " , ( ’ - ’ x $ L E N { $ p w m }) ; 

146 n e x t ; 

147 } e l s e { 

148 p r i n t f " # s p a c e r $ o p t _ s p a c e r c h o s e n , u n i t ’% s ’ g i v e s s c o r e % 0 . 1 f \ n " , $ p w m , 

$ o p t _ u n i t _ s c o r e ; 

149 $ f s i += $ o p t _ u n i t _ s c o r e ; 

150 m y $ s i g n a l = s u b s t r ( $ s e q , $ o f f s e t , $ L E N { $ p w m }) ; 

151 $ s .= s p r i n t f " \ t % d \ t % s \ t % 0 . 2 f " , $ o p t _ s p a c e r , $ s i g n a l , $ f s i ; 

152 } 

153 } # f o r e a c h my $pwm index 

154 # p r i n t the f i n a l b i t s c o r e 

155 p r i n t f " % d \ t % 0 . 2 f \ t % s \ t \ n " , ( $ p +1) , $ f s i , $ s ; 

156 } # my $p = 0 

157 } # f o r ( $s = 0 . . . . 

158 

159 

160 ####################################### 

161 # HELPER FUNCTIONS 

162 ####################################### 

163 

164 

165 # scan u s i n g a matrix o f i n f o r m a t i o n 

166 s u b s c a n { 

167 m y @ a ; 

168 m y ( $s ,% m ) = @ _ ; 

169 m y $ m a = $#{$ m { A } } ; 

170 f o r e a c h m y $ p ( 0 . . ( l e n g t h ( $ s )−$#{$ m { A }} −1 ) ) { 

171 m y $ R i = 0 ; 

172 $ R i += $ m { s u b s t r ( $s , $ p+$_ , 1 ) } [ $ _ ] f o r e a c h ( 0 . . $ m a ) ; 

173 p u s h @ a , $ R i ; 

174 } 

175 # r e t u r n a l i s t having n−l +1 e l e m e n t s e h e r e n i s the s e q u e n c e l e n g t h , 

176 # n i s the matrix s i z e ( f o r −10 ( hexamer , n=6) 

177 r e t u r n @ a ; 

178 } 

179 

180 ############################################################### 

181 # s p a c e r b i t s c o r e c a l c u l a t i o n s c o o r d i n a t e s a r e s h i f t e d 6bp 

182 ############################################################### 

183 

184 s u b r e a d _ m o d { 

185 m y @ r e t ; 

186 m y $ f n = $ _ [ 0 ] ; 

187 m y $ i ; 

188 o p e n $ i , $ f n o r e r r ( " u n a b l e t o o p e n f i l e ’ $ f n ’: $ ! \ n " ) ; 

189 w h i l e ( r e a d l i n e ( $ i ) ) { 

190 c h o m p ; 

191 i f (/ˆ#\ s ∗ i n c l u d e \ s ∗ ( . ∗ ) /) { 

192 m y @ a = r e a d _ m o d ( $ 1 ) ; 

193 p u s h @ r e t , @ a ; 

194 } e l s e { 

195 n e x t i f / ˆ [ \ s \#] + / ; 

196 n e x t u n l e s s /ˆ\ S +/; 

197 p u s h @ r e t , $ _ ; 

198 } 

199 } 

200 c l o s e $ i ; 

171

quasi mktemp manual 

201 r e t u r n @ r e t ; 

202 } 

203 

204 s u b r e a d _ f a s t a { 

205 m y @ f a s t a ; # c o n t a i n s a l l 

206 m y $ i d = −1; 

207 w h i l e ( ) { 

208 c h o m p ; 

209 i f ( /ˆ >(.∗) / ) { 

210 $ i d ++; 

211 $ f a s t a [ $ i d ]−>{ i d } = $ 1 ; 

212 } e l s i f ( / ˆ ( [ A−Za−z ]+) /) { 

213 $ f a s t a [ $ i d ]−>{ s e q } .= $ 1 ; 

214 } 

215 } 

216 r e t u r n @ f a s t a ; 

217 } 

218 

219 s u b e r r { 

220 p r i n t $ _ [ 0 ] ; 

221 e x i t 1 ; 

222 } 

223 e x i t 0 ; 

224 

225 _ _ D A T A _ _ 

226 [ p w m ]=−10 r e g i o n 

227 w e i g h t =1 

228 [ A ] 0 63 0 63 63 0 

229 [ T ] 63 0 63 0 0 63 

230 [ G ] 0 0 0 0 0 0 

231 [ C ] 0 0 0 0 0 0 

232 [ s p a c e r ] 

233 m i n =13 

234 c e n t e r =16 

235 m a x =19 

236 [ p w m ]=−35 r e g i o n 

237 w e i g h t =1 

238 [ A ] 0 0 0 0 0 36 

239 [ T ] 63 63 0 54 0 9 

240 [ G ] 0 0 63 0 18 9 

241 [ C ] 0 0 0 9 45 9 

242 [ s p a c e r ] 

243 m i n =0 

244 c e n t e r =3 

245 m a x =6 

246 [ p w m ]= U P 

247 w e i g h t =0.5 

248 [ A ] 18 0 45 27 45 54 54 54 18 9 45 9 2 9 18 45 54 45 9 2 0 9 

249 [ T ] 45 11 0 0 18 0 9 9 36 45 18 54 45 45 27 9 9 18 54 54 63 17 

250 [ G ] 0 9 18 36 0 0 0 0 9 9 0 0 0 9 9 0 0 0 0 7 0 0 

251 [ C ] 0 43 0 0 0 9 0 0 0 0 0 0 16 0 9 9 0 0 0 0 0 37 

252 [ s p a c e r ] 

253 m i n=−4 

254 c e n t e r =2 

255 m a x =4 

256 [ p w m ]= F I S 

257 w e i g h t =0.5 

258 t h r e s h o l d =0 

259 [ A ] 26 27 16 0 18 9 0 29 54 54 54 45 42 3 2 36 7 2 18 22 16 

260 [ T ] 36 36 45 0 0 38 43 0 0 0 9 0 18 45 0 0 0 0 1 0 45 

261 [ G ] 1 0 2 63 18 7 20 34 9 9 0 18 3 13 45 0 54 0 44 41 0 

262 [ C ] 0 0 0 0 27 9 0 0 0 0 0 0 0 2 16 27 2 61 0 0 2 

D.6 quasi mktemp manual 

1 N A M E 

2 q u a s i _ m k t e m p − c r e a t e a t e m p l a t e C B S W e b S e r v i c e i m p l e m e n t a t i o n 

3 

4 S Y N O P S I S 

5 p e r l q u a s i _ m k t e m p l [− n S E R V I C E N A M E ] [− v V E R S I O N ] [− w W S N U M B E R ] (−f ) (− r e m o v e ) (−t 

T E M P L A T E N A M E ) 

6 

7 D E S C R I P T I O N 

8 T h i s s c r i p t c r e a t e s a f u n c t i o n a l t e m p l a t e S O A P W e b S e r v i c e i m p l e m e n t a t i o n u n d e r Q u a s i 

i n c l u d i n g 

9 a w o r k i n g e x a m p l e . T h e o b j e c t t y p e s t h i s s e r v i c e r e c i e v e s / g e n e r a t e s a r e t h e C B S s t a n d a r d 

s e q u e n c e 

10 d a t a o b j e c t / a n n o t a t i o n d a t a o b j e c t . 

11 

12 T h e f o l l o w i n g e l e m e n t s a r e c r e a t e d b y t h e p r o g r a m : 

13 

14 ∗ W S D L f i l e , w i t h p r o p e r n a m e s p a c e s a n d o p e r a t i o n ( s ) 

15 ∗ A n X S D i n c l u d e d b y t h e W S D L 

16 ∗ A d i r e c t o r y i n / u s r / o p t / w w w / cgi−b i n / C B S / s o a p / w s / q u a s i / c o n t a i n i n g t h e P e r l m o d u l e ( 

m o d u l e . p m ) 

17 ∗ A d i r e c t o r y i n / u s r / o p t / w w w / p u b / C B S / w s / c o n t a i n i n g t h e XSD , W S D L a n d e x a m p l e f i l e s . 

18 ∗ A n e n t r y i n m y s q l . W e b S e r v i c e s . s e r v i c e s 

19 ∗ A n i n d e x . p h p a n d i n c l u d e . h t m l l o c a t e d i n / u s r / o p t / w w w / p u b / C B S / w s / [ S E R V I C E N A M E ] 

20 

21 To−d o l i s t , o n c e y o u h a v e c r e a t e d t h e t e m p l a t e : 

22 

23 [ ] A l t e r t h e W S D L s o i t c o n t a i n s t h e o p e r a t i o n s y o u n e e d 

24 [ ] A l t e r t h e X S D s o a l l o p e r a t i o n d a t a t y p e s a r e d e f i n e d 

172


25 [ ] A l t e r t h e f i l e m o d u l e . p m a n d p o s s i b l y w r a p p e r . pl , l o c a t e d i n / u s r / o p t / w w w / cgi−b i n / s o a p 

/ w s / q u a s i / [ S E R V I C E ] / [ W S ] / 

26 [ ] A l t e r t h e e x a m p l e s o t h a t i t c o n t a i n s a r e l e v a n t e x a m p l e f o r y o u r s e r v i c e . 

27 [ ] A l t e r t h e i n c l u d e . h t m l s o t h a t i t d e s c r i b e s t h e u s a g e o f t h e e x a m p l e s c r i p t 

28 [ ] O n c e y o u a r e h a p p y w i t h t h e i m p l e m e n t a t i o n , r e m o v e t h e f l a g ” i n t e r n a l _ o n l y ” f r o m m y s q l 

. W e b S e r v i c e s . s e r v i c e s 

29 a n d c h a n g e t h e d e s i r e d d e s c r i p t i o n f o r y o u r s e r v i c e ( i n f i e l d ’ d e s c r i p t i o n ’ ) 

30 

31 O P T I O N S 

32 −n S E R V I C E N A M E 

33 C a s e −s e n s i t i v e s e r v i c e n a m e , e . g . S i g n a l P 

34 

35 −v V E R S I O N 

36 T h e v e r s i o n o f t h e s e r v i c e i n t h e f o r m X . Y , e . g . 1 . 2 

37 

38 −w W S N U M B E R 

39 T h i s i s t h e i m p l e m e n t a t i o n n u m b e r f o r t h i s s e r v i c e a n d v e r s i o n . T h e n u m b e r 

40 s t a r t s a t z e r o . 

41 

42 −f 

43 F o r c e s o v e r w r i t i n g e x i s t i n g f i l e s 

44 

45 −r e m o v e 

46 R e m o v e s a l l f i l e s p e r t a i n i n g t o t h i s s e r v i c e / v e r s i o n / i m p l e m e n t a i o n − b e c a r e f u l l ! 

47 

48 −t T E M P L A T E 

49 N e w t e m p l a t e s c a n b e i n s t a l l e d . U s e o p t i o n −t l i s t t o l i s t a l l t e m p l a t e s 

50 

51 A U T H O R 

52 P e t e r F i s c h e r H a l l i n , p f h @ c b s . d t u . dk , S e p t e m b e r 2008 

53 

54 S E E A L S O 

55 / u s r / o p t / q u a q / 

56 / u s r / o p t / w w w / cgi−b i n / C B S / s o a p / w s / q u a s i . c g i 

57 

58 A U T H O R 

59 P e t e r H a l l i n 2008−09−15, p f h @ c b s . d t u . d k 

173

BIBLIOGRAPHY 

Bibliography 

S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, & D. J. 

Lipman (1997). ‘Gapped blast and psi–blast: a new generation of protein database 

searchprograms.’ Nucleic Acids Res 25:3389–402. 

B. F. Bauer, E. G. Kar, R. M. Elford, & W. M. Holmes (1988). ‘Sequence determinants 

for promoter strength in the leuv operon of Escherichia coli.’ Gene 63:123–34. 

J. Besemer, A. Lomsadze, & M. Borodovsky (2001). ‘GeneMarks: a self–training method 

for prediction of gene starts in microbial genomes. Implications for finding sequence 

motifs in regulatory regions.’ Nucleic Acids Res 29:2607–18. 

T. T. Binnewies, P. F. Hallin, H.-H. Staerfeldt, & D. W. Ussery (2005). ‘Genome Update: 

proteome comparisons.’ Microbiology 151:1–4. 

T. T. Binnewies, Y. Motro, P. F. Hallin, O. Lund, D. Dunn, T. La, D. J. Hampson, 

M. Bellgard, T. M. Wassenaar, & D. W. Ussery (2006). ‘Ten years of bacterial genome 

sequencing: comparative–genomics–baseddiscoveries.’ Funct Integr Genomics 6:165–85. 

E. Birney, J. A. Stamatoyannopoulos, A. Dutta, R. Guigo, T. R. Gingeras, E. H. Margulies, 

Z. Weng, M. Snyder, E. T. Dermitzakis, R. E. Thurman, M. S. Kuehn, C. M. 

Taylor, S. Neph, C. M. Koch, S. Asthana, A. Malhotra, I. Adzhubei, J. A. Greenbaum, 

R. M. Andrews, P. Flicek, P. J. Boyle, H. Cao, N. P. Carter, G. K. Clelland, S. Davis, 

N. Day, P. Dhami, S. C. Dillon, M. O. Dorschner, H. Fiegler, P. G. Giresi, J. Goldy, 

M. Hawrylycz, A. Haydock, R. Humbert, K. D. James, B. E. Johnson, E. M. Johnson, 

T. T. Frum, E. R. Rosenzweig, N. Karnani, K. Lee, G. C. Lefebvre, P. A. Navas, F. Neri, 

S. C. J. Parker, P. J. Sabo, R. Sandstrom, A. Shafer, D. Vetrie, M. Weaver, S. Wilcox, 

M. Yu, F. S. Collins, J. Dekker, J. D. Lieb, T. D. Tullius, G. E. Crawford, S. Sunyaev, 

W. S. Noble, I. Dunham, F. Denoeud, A. Reymond, P. Kapranov, J. Rozowsky, 

D. Zheng, R. Castelo, A. Frankish, J. Harrow, S. Ghosh, A. Sandelin, I. L. Hofacker, 

R. Baertsch, D. Keefe, S. Dike, J. Cheng, H. A. Hirsch, E. A. Sekinger, J. Lagarde, 

J. F. Abril, A. Shahab, C. Flamm, C. Fried, J. Hackermuller, J. Hertel, M. Lindemeyer, 

K. Missal, A. Tanzer, S. Washietl, J. Korbel, O. Emanuelsson, J. S. Pedersen, N. Holroyd, 

R. Taylor, D. Swarbreck, N. Matthews, M. C. Dickson, D. J. Thomas, M. T. 

Weirauch, J. Gilbert, J. Drenkow, I. Bell, X. Zhao, K. G. Srinivasan, W.-K. Sung, H. S. 

Ooi, K. P. Chiu, S. Foissac, T. Alioto, M. Brent, L. Pachter, M. L. Tress, A. Valencia, 

S. W. Choo, C. Y. Choo, C. Ucla, C. Manzano, C. Wyss, E. Cheung, T. G. Clark, 

J. B. Brown, M. Ganesh, S. Patel, H. Tammana, J. Chrast, C. N. Henrichsen, C. Kai, 

J. Kawai, U. Nagalakshmi, J. Wu, Z. Lian, J. Lian, P. Newburger, X. Zhang, P. Bickel, 

J. S. Mattick, P. Carninci, Y. Hayashizaki, S. Weissman, T. Hubbard, R. M. Myers, 

174

BIBLIOGRAPHY 

J. Rogers, P. F. Stadler, T. M. Lowe, C.-L. Wei, Y. Ruan, K. Struhl, M. Gerstein, S. E. 

Antonarakis, Y. Fu, E. D. Green, U. Karaoz, A. Siepel, J. Taylor, L. A. Liefer, K. A. 

Wetterstrand, P. J. Good, E. A. Feingold, M. S. Guyer, G. M. Cooper, G. Asimenos, 

C. N. Dewey, M. Hou, S. Nikolaev, J. I. Montoya-Burgos, A. Loytynoja, S. Whelan, 

F. Pardi, T. Massingham, H. Huang, N. R. Zhang, I. Holmes, J. C. Mullikin, A. Ureta- 

Vidal, B. Paten, M. Seringhaus, D. Church, K. Rosenbloom, W. J. Kent, E. A. Stone, 

S. Batzoglou, N. Goldman, R. C. Hardison, D. Haussler, W. Miller, A. Sidow, N. D. 

Trinklein, Z. D. Zhang, L. Barrera, R. Stuart, D. C. King, A. Ameur, S. Enroth, M. C. 

Bieda, J. Kim, A. A. Bhinge, N. Jiang, J. Liu, F. Yao, V. B. Vega, C. W. H. Lee, 

P. Ng, A. Shahab, A. Yang, Z. Moqtaderi, Z. Zhu, X. Xu, S. Squazzo, M. J. Oberley, 

D. Inman, M. A. Singer, T. A. Richmond, K. J. Munn, A. Rada-Iglesias, O. Wallerman, 

J. Komorowski, J. C. Fowler, P. Couttet, A. W. Bruce, O. M. Dovey, P. D. Ellis, C. F. 

Langford, D. A. Nix, G. Euskirchen, S. Hartman, A. E. Urban, P. Kraus, S. Van Calcar, 

N. Heintzman, T. H. Kim, K. Wang, C. Qu, G. Hon, R. Luna, C. K. Glass, M. G. Rosenfeld, 

S. F. Aldred, S. J. Cooper, A. Halees, J. M. Lin, H. P. Shulha, X. Zhang, M. Xu, 

J. N. S. Haidar, Y. Yu, Y. Ruan, V. R. Iyer, R. D. Green, C. Wadelius, P. J. Farnham, 

B. Ren, R. A. Harte, A. S. Hinrichs, H. Trumbower, H. Clawson, J. Hillman-Jackson, 

A. S. Zweig, K. Smith, A. Thakkapallayil, G. Barber, R. M. Kuhn, D. Karolchik, L. Armengol, 

C. P. Bird, P. I. W. de Bakker, A. D. Kern, N. Lopez-Bigas, J. D. Martin, B. E. 

Stranger, A. Woodroffe, E. Davydov, A. Dimas, E. Eyras, I. B. Hallgrimsdottir, J. Huppert, 

M. C. Zody, G. R. Abecasis, X. Estivill, G. G. Bouffard, X. Guan, N. F. Hansen, 

J. R. Idol, V. V. B. Maduro, B. Maskeri, J. C. McDowell, M. Park, P. J. Thomas, A. C. 

Young, R. W. Blakesley, D. M. Muzny, E. Sodergren, D. A. Wheeler, K. C. Worley, 

H. Jiang, G. M. Weinstock, R. A. Gibbs, T. Graves, R. Fulton, E. R. Mardis, R. K. 

Wilson, M. Clamp, J. Cuff, S. Gnerre, D. B. Jaffe, J. L. Chang, K. Lindblad-Toh, E. S. 

Lander, M. Koriabine, M. Nefedov, K. Osoegawa, Y. Yoshinaga, B. Zhu, & P. J. de Jong 

(2007). ‘Identification and analysis of functional elements in 1of the human genome by 

the encode pilot project.’ Nature 447:799–816. 

F. R. Blattner, G. r. Plunkett, C. A. Bloch, N. T. Perna, V. Burland, M. Riley, J. Collado- 

Vides, J. D. Glasner, C. K. Rode, G. F. Mayhew, J. Gregor, N. W. Davis, H. A. 

Kirkpatrick, M. A. Goeden, D. J. Rose, B. Mau, & Y. Shao (1997). ‘The complete 

genome sequence of Escherichia coli k–12.’ Science 277:1453–62. 

A. J. t. Bokal, W. Ross, & R. L. Gourse (1995). ‘The transcriptional activator protein fis: 

Dna interactions andcooperative interactions with rna polymerase at the Escherichia 

coli rrnbp1 promoter.’ J Mol Biol 245:197–207. 

A. Bolshoy, P. McNamara, R. E. Harrington, & E. N. Trifonov (1991). ‘Curved dna 

without a–a: experimental estimation of all 16 dna wedgeangles.’ Proc Natl Acad Sci U 

S A 88:2312–6. 

P. J. Brett, D. DeShazer, & D. E. Woods (1998). ‘Burkholderia thailandensis sp. nov., a 

Burkholderia pseudomallei–likespecies.’ Int J Syst Bacteriol 48:317–20. 

E. Brzuszkiewicz, H. Bruggemann, H. Liesegang, M. Emmerth, T. Olschlager, G. Nagy, 

K. Albermann, C. Wagner, C. Buchrieser, L. Emody, G. Gottschalk, J. Hacker, & U. Dobrindt 

(2006). ‘How to become a uropathogen: comparative genomic analysis ofextraintestinal 

pathogenic Escherichia coli strains.’ Proc Natl Acad Sci U S A 103:12879–84. 

S. L. Chen, C.-S. Hung, J. Xu, C. S. Reigstad, V. Magrini, A. Sabo, D. Blasiar, T. Bieri, 

R. R. Meyer, P. Ozersky, J. R. Armstrong, R. S. Fulton, J. P. Latreille, J. Spieth, T. M. 

175

BIBLIOGRAPHY 

Hooton, E. R. Mardis, S. J. Hultgren, & J. I. Gordon (2006). ‘Identification of genes 

subject to positive selection in uropathogenicstrains of Escherichia coli: a comparative 

genomics approach.’ Proc Natl Acad Sci U S A 103:5977–82. 

A. L. Delcher, D. Harmon, S. Kasif, O. White, & S. L. Salzberg (1999). ‘Improved microbial 

gene identification with glimmer.’ Nucleic Acids Res 27:4636–41. 

J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan, 

B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark, 

R. Dalal, A. Dewinter, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. Heiner, 

K. Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. Lin, P. Lundquist, 

C. Ma, P. Marks, M. Maxham, D. Murphy, I. Park, T. Pham, M. Phillips, J. Roy, 

R. Sebra, G. Shen, J. Sorenson, A. Tomaney, K. Travers, M. Trulson, J. Vieceli, J. Wegener, 

D. Wu, A. Yang, D. Zaccarin, P. Zhao, F. Zhong, J. Korlach, & S. Turner (2009). 

‘Real–time dna sequencing from single polymerase molecules.’ Science 323:133–8. 

M. Ender, B. Berger-Bachi, & N. McCallum (2009). ‘A novel dna–binding protein modulating 

methicillin resistance in Staphylococcus aureus.’ BMC Microbiol 9:15. 

S. T. Estrem, T. Gaal, W. Ross, & R. L. Gourse (1998). ‘Identification of an up element 

consensus sequence for bacterialpromoters.’ Proc Natl Acad Sci U S A 95:9761–6. 

P. F. Hallin & D. W. Ussery (2004). ‘Cbs Genome Atlas Database: a dynamic storage for 

bioinformatic results and sequence data.’ Bioinformatics 20:3682–6. 

K. Hayashi, N. Morooka, Y. Yamamoto, K. Fujita, K. Isono, S. Choi, E. Ohtsubo, T. Baba, 

B. L. Wanner, H. Mori, & T. Horiuchi (2006). ‘Highly accurate genome sequences of 

Escherichia coli k–12 strains mg1655and w3110.’ Mol Syst Biol 2:2006.0007. 

T. Hayashi, K. Makino, M. Ohnishi, K. Kurokawa, K. Ishii, K. Yokoyama, C. G. Han, 

E. Ohtsubo, K. Nakayama, T. Murata, M. Tanaka, T. Tobe, T. Iida, H. Takami, 

T. Honda, C. Sasakawa, N. Ogasawara, T. Yasunaga, S. Kuhara, T. Shiba, M. Hattori, 

& H. Shinagawa (2001). ‘Complete genome sequence of enterohemorrhagic Escherichia 

coli o157:h7 andgenomic comparison with a laboratory strain k–12.’ DNA Res 8:11–22. 

P. N. Hengen, S. L. Bartram, L. E. Stewart, & T. D. Schneider (1997). ‘Information 

analysis of Fis binding sites.’ Nucleic Acids Res 25:4994–5002. 

C. A. Hirvonen, W. Ross, C. E. Wozniak, E. Marasco, J. R. Anthony, S. E. Aiyar, V. H. 

Newburn, & R. L. Gourse (2001). ‘Contributions of up elements and the transcription 

factor fis toexpression from the seven rrn p1 promoters in Escherichia coli.’ J Bacteriol 

183:6305–14. 

A. M. Huerta & J. Collado-Vides (2003). ‘Sigma70 promoters in Escherichia coli: specific 

transcription in denseregions of overlapping promoter–like signals.’ J Mol Biol 333:261– 

78. 

L. J. Jensen, C. Friis, & D. W. Ussery (1999). ‘Three views of microbial genomes.’ Res 

Microbiol 150:773–7. 

L. J. Jensen, M. Skovgaard, T. Sicheritz-Ponten, N. T. Hansen, H. Johansson, M. K. 

Joergensen, K. Kiil, P. F. Hallin, & D. Ussery (2005). THE PSEUDOMONADS VOL 

I. GENOMICS, LIFE STYLE AND MOLECULAR ARCHITECTURE, vol. 1, chap. 

Chapter 5: Comparative genomics of four Pseudomonas species, pp. 139–164. Kluwer 

Academic / Plenum Publishers, New York. 

176

BIBLIOGRAPHY 

Q. Jin, Z. Yuan, J. Xu, Y. Wang, Y. Shen, W. Lu, J. Wang, H. Liu, J. Yang, F. Yang, 

X. Zhang, J. Zhang, G. Yang, H. Wu, D. Qu, J. Dong, L. Sun, Y. Xue, A. Zhao, Y. Gao, 

J. Zhu, B. Kan, K. Ding, S. Chen, H. Cheng, Z. Yao, B. He, R. Chen, D. Ma, B. Qiang, 

Y. Wen, Y. Hou, & J. Yu (2002). ‘Genome sequence of Shigella flexneri 2a: insights 

into pathogenicitythrough comparison with genomes of Escherichia coli k12 and o157.’ 

Nucleic Acids Res 30:4432–41. 

T. J. Johnson, S. Kariyawasam, Y. Wannemuehler, P. Mangiamele, S. J. Johnson, 

C. Doetkott, J. A. Skyberg, A. M. Lynne, J. R. Johnson, & L. K. Nolan (2007). ‘The 

genome sequence of avian pathogenic Escherichia coli strain o1:k1:h7shares strong similarities 

with human extraintestinal pathogenic e. coligenomes.’ J Bacteriol 189:3228–36. 

J. Kyte & R. F. Doolittle (1982). ‘A simple method for displaying the hydropathic character 

of a protein.’ J Mol Biol 157:105–32. 

K. Lagesen, P. Hallin, E. A. Rodland, H.-H. Staerfeldt, T. Rognes, & D. W. Ussery (2007). 

‘RNAmmer: consistent and rapid annotation of ribosomal rna genes.’ Nucleic Acids Res 

35:3100–8. 

T. S. Larsen & A. Krogh (2003). ‘EasyGene–a prokaryotic gene finder that ranks ORFs 

by statistical significance.’ BMC Bioinformatics 4:21. 

T. Lefebure & M. J. Stanhope (2007). ‘Evolution of the core and pan–genome of Streptococcus: 

positive selection, recombination, and genome composition.’ Genome Biol 

8:R71. 

X. Liao, T. Ying, H. Wang, J. Wang, Z. Shi, E. Feng, K. Wei, Y. Wang, X. Zhang, 

L. Huang, G. Su, & P. Huang (2003). ‘A two–dimensional proteome map of Shigella 

flexneri.’ Electrophoresis 24:2864–82. 

B. Liebig & R. Wagner (1995). ‘Effects of different growth conditions on the in vivo 

activity of thetandem Escherichia coli ribosomal rna promoters p1 and p2.’ Mol Gen 

Genet 249:328–35. 

D. Lim & N. C. J. Strynadka (2002). ‘Structural basis for the beta lactam resistance of 

pbp2a from methicillin–resistant Staphylococcus aureus.’ Nat Struct Biol 9:870–6. 

T. M. Lowe & S. R. Eddy (1997). ‘tRNAscan–se: a program for improved detection of 

transfer rna genes ingenomic sequence.’ Nucleic Acids Res 25:955–64. 

J. P. McCutcheon, B. R. McDonald, & N. A. Moran (2009). ‘Origin of an alternative 

genetic code in the extremely small and gc–rich genome of a bacterial symbiont.’ PLoS 

Genet 5:e1000565. 

C. E. McEwan, D. Gatherer, & N. R. McEwan (1998). ‘Nitrogen–fixing aerobic bacteria 

have higher genomic gc content than non–fixing species within the same genus.’ 

Hereditas 128:173–8. 

W. G. Miller, C. T. Parker, M. Rubenfield, G. L. Mendz, M. M. S. M. Wosten, D. W. 

Ussery, J. F. Stolz, T. T. Binnewies, P. F. Hallin, G. Wang, J. A. Malek, A. Rogosin, 

L. H. Stanker, & R. E. Mandrell (2007). ‘The complete genome sequence and analysis 

of the epsilonproteobacteriumArcobacter butzleri.’ PLoS One 2:e1358. 

H. D. Murray & R. L. Gourse (2004). ‘Unique roles of the rrn p2 rrna promoters in 

Escherichia coli.’ Mol Microbiol 52:1375–87. 

177

BIBLIOGRAPHY 

A. Nakabachi, A. Yamashita, H. Toh, H. Ishikawa, H. E. Dunbar, N. A. Moran, & M. Hattori 

(2006). ‘The 160–kilobase genome of the bacterial endosymbiont Carsonella.’ Science 

314:267. 

C. Ong, C. H. Ooi, D. Wang, H. Chong, K. C. Ng, F. Rodrigues, M. A. Lee, & P. Tan 

(2004). ‘Patterns of large–scale genomic variation in virulent and avirulentBurkholderia 

species.’ Genome Res 14:2295–307. 

J. Parkhill, B. W. Wren, K. Mungall, J. M. Ketley, C. Churcher, D. Basham, T. Chillingworth, 

R. M. Davies, T. Feltwell, S. Holroyd, K. Jagels, A. V. Karlyshev, S. Moule, 

M. J. Pallen, C. W. Penn, M. A. Quail, M. A. Rajandream, K. M. Rutherford, A. H. van 

Vliet, S. Whitehead, & B. G. Barrell (2000). ‘The genome sequence of the food–borne 

pathogen Campylobacter jejunireveals hypervariable sequences.’ Nature 403:665–8. 

A. G. Pedersen, L. J. Jensen, S. Brunak, H. H. Staerfeldt, & D. W. Ussery (2000). ‘A dna 

structural atlas for Escherichia coli.’ J Mol Biol 299:907–30. 

V. Perez-Brocal, R. Gil, S. Ramos, A. Lamelas, M. Postigo, J. M. Michelena, F. J. Silva, 

A. Moya, & A. Latorre (2006). ‘A small microbial genome: the end of a long symbiotic 

relationship?’ Science 314:312–3. 

N. T. Perna, G. r. Plunkett, V. Burland, B. Mau, J. D. Glasner, D. J. Rose, G. F. Mayhew, 

P. S. Evans, J. Gregor, H. A. Kirkpatrick, G. Posfai, J. Hackett, S. Klink, A. Boutin, 

Y. Shao, L. Miller, E. J. Grotbeck, N. W. Davis, A. Lim, E. T. Dimalanta, K. D. 

Potamousis, J. Apodaca, T. S. Anantharaman, J. Lin, G. Yen, D. C. Schwartz, R. A. 

Welch, & F. R. Blattner (2001). ‘Genome sequence of enterohaemorrhagic Escherichia 

coli o157:h7.’ Nature 409:529–33. 

O. N. Reva, P. F. Hallin, H. Willenbrock, T. Sicheritz-Ponten, B. Tummler, & D. W. 

Ussery (2008). ‘Global features of the Alcanivorax borkumensis sk2 genome.’ Environ 

Microbiol 10:614–25. 

E. P. C. Rocha (2004). ‘Codon usage bias from trna‘s point of view: redundancy, specialization, 

and efficient decoding for translation optimization.’ Genome Res 14:2279–86. 

W. Ross, J. Salomon, W. M. Holmes, & R. L. Gourse (1999). ‘Activation of Escherichia 

coli leuv transcription by fis.’ J Bacteriol 181:3864–8. 

K. Rutherford, J. Parkhill, J. Crook, T. Horsnell, P. Rice, M. A. Rajandream, & B. Barrell 

(2000). ‘Artemis: sequence visualization and annotation.’ Bioinformatics 16:944–5. 

R. A. Sanford, J. R. Cole, & J. M. Tiedje (2002). ‘Characterization and description of 

Anaeromyxobacter dehalogenans gen. nov., sp. nov., an aryl–halorespiring facultative 

anaerobic myxobacterium.’ Appl Environ Microbiol 68:893–900. 

S. C. Satchwell, H. R. Drew, & A. A. Travers (1986). ‘Sequence periodicities in chicken 

nucleosome core dna.’ J Mol Biol 191:659–75. 

S. Schneiker, O. Perlova, O. Kaiser, K. Gerth, A. Alici, M. O. Altmeyer, D. Bartels, 

T. Bekel, S. Beyer, E. Bode, H. B. Bode, C. J. Bolten, J. V. Choudhuri, S. Doss, 

Y. A. Elnakady, B. Frank, L. Gaigalat, A. Goesmann, C. Groeger, F. Gross, L. Jelsbak, 

L. Jelsbak, J. Kalinowski, C. Kegler, T. Knauber, S. Konietzny, M. Kopp, L. Krause, 

D. Krug, B. Linke, T. Mahmud, R. Martinez-Arias, A. C. McHardy, M. Merai, F. Meyer, 

S. Mormann, J. Munoz-Dorado, J. Perez, S. Pradella, S. Rachid, G. Raddatz, F. Rosenau, 

C. Ruckert, F. Sasse, M. Scharfe, S. C. Schuster, G. Suen, A. Treuner-Lange, G. J. 

178

BIBLIOGRAPHY 

Velicer, F.-J. Vorholter, K. J. Weissman, R. D. Welch, S. C. Wenzel, D. E. Whitworth, 

S. Wilhelm, C. Wittmann, H. Blocker, A. Puhler, & R. Muller (2007). ‘Complete genome 

sequence of the myxobacterium Sorangium cellulosum.’ Nat Biotechnol 25:1281–9. 

R. K. Shultzaberger, Z. Chen, K. A. Lewis, & T. D. Schneider (2007). ‘Anatomy of 

Escherichia coli sigma70 promoters.’ Nucleic Acids Res 35:771–88. 

M. D. Smith, B. J. Angus, V. Wuthiekanun, & N. J. White (1997). ‘Arabinose assimilation 

defines a nonvirulent biotype of Burkholderiapseudomallei.’ Infect Immun 65:4319–21. 

H. Tettelin, V. Masignani, M. J. Cieslewicz, C. Donati, D. Medini, N. L. Ward, S. V. 

Angiuoli, J. Crabtree, A. L. Jones, A. S. Durkin, R. T. Deboy, T. M. Davidsen, M. Mora, 

M. Scarselli, I. Margarit y Ros, J. D. Peterson, C. R. Hauser, J. P. Sundaram, W. C. 

Nelson, R. Madupu, L. M. Brinkac, R. J. Dodson, M. J. Rosovitz, S. A. Sullivan, 

S. C. Daugherty, D. H. Haft, J. Selengut, M. L. Gwinn, L. Zhou, N. Zafar, H. Khouri, 

D. Radune, G. Dimitrov, K. Watkins, K. J. B. O’Connor, S. Smith, T. R. Utterback, 

O. White, C. E. Rubens, G. Grandi, L. C. Madoff, D. L. Kasper, J. L. Telford, M. R. 

Wessels, R. Rappuoli, & C. M. Fraser (2005). ‘Genome analysis of multiple pathogenic 

isolates of Streptococcus agalactiae: implications for the microbial “pan–genome“.’ Proc 

Natl Acad Sci U S A 102:13950–5. 

J. D. Thompson, D. G. Higgins, & T. J. Gibson (1994). ‘Clustal w: improving the sensitivity 

of progressive multiple sequencealignment through sequence weighting, position– 

specific gap penalties andweight matrix choice.’ Nucleic Acids Res 22:4673–80. 

H. Toh, B. L. Weiss, S. A. H. Perkin, A. Yamashita, K. Oshima, M. Hattori, & S. Aksoy 

(2006). ‘Massive genome erosion and functional adaptations provide insights into the 

symbiotic lifestyle of Sodalis glossinidius in the tsetse host.’ Genome Res 16:149–56. 

M. L. Tress, P. L. Martelli, A. Frankish, G. A. Reeves, J. J. Wesselink, C. Yeats, P. I. Olason, 

M. Albrecht, H. Hegyi, A. Giorgetti, D. Raimondo, J. Lagarde, R. A. Laskowski, 

G. Lopez, M. I. Sadowski, J. D. Watson, P. Fariselli, I. Rossi, A. Nagy, W. Kai, Z. Storling, 

M. Orsini, Y. Assenov, H. Blankenburg, C. Huthmacher, F. Ramirez, A. Schlicker, 

F. Denoeud, P. Jones, S. Kerrien, S. Orchard, S. E. Antonarakis, A. Reymond, E. Birney, 

S. Brunak, R. Casadio, R. Guigo, J. Harrow, H. Hermjakob, D. T. Jones, T. Lengauer, 

C. A. Orengo, L. Patthy, J. M. Thornton, A. Tramontano, & A. Valencia (2007). ‘The 

implications of alternative splicing in the encode protein complement.’ Proc Natl Acad 

Sci U S A 104:5495–500. 

J. W. Tukey (1977). Exploratory Data Analysis. Addison-Wesley. 

D. W. Ussery, P. F. Hallin, K. Lagesen, & T. M. Wassenaar (2004). ‘Genome update: 

tRNAs in sequenced microbial genomes.’ Microbiology 150:1603–6. 

T. Visnes, B. Doseth, H. S. Pettersen, L. Hagen, M. M. L. Sousa, M. Akbari, M. Otterlei, 

B. Kavli, G. Slupphaug, & H. E. Krokan (2009). ‘Uracil in dna and its processing by 

different dna glycosylases.’ Philos Trans R Soc Lond B Biol Sci 364:563–8. 

H. Wang & C. J. Benham (2008). ‘Superhelical destabilization in regulatory regions of 

stress responsegenes.’ PLoS Comput Biol 4:e17. 

H. Wang, M. Noordewier, & C. J. Benham (2004). ‘Stress–induced dna duplex destabilization 

(sidd) in the e. coli genome:sidd sites are closely associated with promoters.’ 

Genome Res 14:1575–84. 

179

BIBLIOGRAPHY 

R. A. Welch, V. Burland, G. r. Plunkett, P. Redford, P. Roesch, D. Rasko, E. L. Buckles, 

S.-R. Liou, A. Boutin, J. Hackett, D. Stroud, G. F. Mayhew, D. J. Rose, S. Zhou, D. C. 

Schwartz, N. T. Perna, H. L. T. Mobley, M. S. Donnenberg, & F. R. Blattner (2002). 

‘Extensive mosaic structure revealed by the complete genome sequence ofuropathogenic 

Escherichia coli.’ Proc Natl Acad Sci U S A 99:17020–4. 

H. Willenbrock, C. Friis, A. S. Juncker, & D. W. Ussery (2006). ‘An environmental 

signature for 323 microbial genomes based on codon adaptation indices.’ Genome Biol 

7:R114. 

K.-M. Wu, L.-H. Li, J.-J. Yan, N. Tsao, T.-L. Liao, H.-C. Tsai, C.-P. Fung, H.-J. Chen, 

Y.-M. Liu, J.-T. Wang, C.-T. Fang, S.-C. Chang, H.-Y. Shu, T.-T. Liu, Y.-T. Chen, Y.- 

R. Shiau, T.-L. Lauderdale, I.-J. Su, R. Kirby, & S.-F. Tsai (2009). ‘Genome sequencing 

and comparative analysis of Klebsiella pneumoniae ntuh–k2044, a strain causing liver 

abscess and meningitis.’ J Bacteriol 191:4492–501. 

F. Yang, J. Yang, X. Zhang, L. Chen, Y. Jiang, Y. Yan, X. Tang, J. Wang, Z. Xiong, 

J. Dong, Y. Xue, Y. Zhu, X. Xu, L. Sun, S. Chen, H. Nie, J. Peng, J. Xu, Y. Wang, 

Z. Yuan, Y. Wen, Z. Yao, Y. Shen, B. Qiang, Y. Hou, J. Yu, & Q. Jin (2005). ‘Genome 

dynamics and diversity of Shigella species, the etiologic agents ofbacillary dysentery.’ 

Nucleic Acids Res 33:6445–58. 

180

Computational tools and Interoperability in Comparative ... - CBS

Create successful ePaper yourself

Delete template?

Save as template?