Computational tools and Interoperability in Comparative ... - CBS
Computational tools and Interoperability in Comparative ... - CBS
Computational tools and Interoperability in Comparative ... - CBS
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Peter Fischer Hall<strong>in</strong> | 2009 Peter Fischer Hall<strong>in</strong><br />
<strong>Computational</strong> <strong>tools</strong> <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> <strong>Comparative</strong> Genomics<br />
2.5<br />
<strong>Computational</strong> <strong>tools</strong> <strong>and</strong><br />
<strong>Interoperability</strong> <strong>in</strong><br />
<strong>Comparative</strong> Genomics<br />
lari<br />
jejuni<br />
concisus<br />
curvus<br />
fetus<br />
hom<strong>in</strong>is<br />
2.3 %<br />
34 / 1,494<br />
57.2 %<br />
1,123 / 1,965<br />
56.7 %<br />
1,123 / 1,979<br />
1.7 %<br />
27 / 1,581<br />
55.2 %<br />
1,145 / 2,073<br />
84.7 %<br />
1,448 / 1,709<br />
49.4 %<br />
1,062 / 2,150<br />
83.5 %<br />
1,481 / 1,773<br />
1.5 %<br />
24 / 1,585<br />
Campylobacter concisus<br />
13826<br />
2,080 prote<strong>in</strong>s, 1,972 families<br />
Campylobacter curvus<br />
525.92<br />
1,931 prote<strong>in</strong>s, 1,885 families<br />
Campylobacter fetus<br />
subsp. fetus 82-40<br />
1,719 prote<strong>in</strong>s, 1,665 families<br />
Campylobacter hom<strong>in</strong>is<br />
ATCC BAA-381<br />
1,687 prote<strong>in</strong>s, 1,623 families<br />
Campylobacter jejuni<br />
RM1221<br />
1,838 prote<strong>in</strong>s, 1,780 families<br />
Campylobacter jejuni<br />
subsp. doylei 269.97<br />
1,731 prote<strong>in</strong>s, 1,650 families<br />
Campylobacter jejuni<br />
subsp. jejuni 81-176<br />
1,758 prote<strong>in</strong>s, 1,702 families<br />
Campylobacter jejuni<br />
subsp. jejuni 81116<br />
1,626 prote<strong>in</strong>s, 1,585 families<br />
Campylobacter jejuni<br />
subsp. jejuni NCTC 11168<br />
1,624 prote<strong>in</strong>s, 1,581 families<br />
Campylobacter lari<br />
RM2100<br />
1,546 prote<strong>in</strong>s, 1,494 families<br />
53.0 %<br />
1,143 / 2,158<br />
67.3 %<br />
1,316 / 1,955<br />
82.9 %<br />
1,474 / 1,778<br />
22.8 %<br />
596 / 2,619<br />
76.9 %<br />
1,466 / 1,906<br />
64.4 %<br />
1,289 / 2,003<br />
2.3 %<br />
39 / 1,702<br />
30.0 %<br />
742 / 2,476<br />
22.9 %<br />
614 / 2,676<br />
74.6 %<br />
1,441 / 1,931<br />
62.2 %<br />
1,304 / 2,096<br />
24.7 %<br />
682 / 2,756<br />
30.6 %<br />
774 / 2,526<br />
23.1 %<br />
617 / 2,675<br />
71.4 %<br />
1,451 / 2,032<br />
4.0 %<br />
66 / 1,650<br />
24.5 %<br />
704 / 2,875<br />
24.8 %<br />
698 / 2,820<br />
30.3 %<br />
770 / 2,538<br />
22.5 %<br />
628 / 2,795<br />
63.5 %<br />
1,345 / 2,118<br />
24.4 %<br />
718 / 2,948<br />
25.1 %<br />
706 / 2,816<br />
28.7 %<br />
767 / 2,669<br />
21.2 %<br />
595 / 2,802<br />
2.3 %<br />
41 / 1,780<br />
jejuni<br />
hom<strong>in</strong>is<br />
fetus<br />
curvus<br />
concisus<br />
PhD thesis | Peter Fischer Hall<strong>in</strong> | 2009<br />
Center for Biological Sequence Analysis<br />
Department of Systems Biology<br />
Technical University of Denmark<br />
Campylobacter lari<br />
RM2100<br />
1,546 prote<strong>in</strong>s, 1,494 families<br />
Campylobacter jejuni<br />
subsp. jejuni NCTC 11168<br />
24.3 %<br />
717 / 2,950<br />
23.7 %<br />
699 / 2,950<br />
27.5 %<br />
736 / 2,676<br />
21.4 %<br />
618 / 2,886<br />
1,624 prote<strong>in</strong>s, 1,581 families<br />
Campylobacter jejuni<br />
subsp. jejuni 81116<br />
23.6 %<br />
723 / 3,070<br />
22.5 %<br />
668 / 2,964<br />
27.9 %<br />
767 / 2,750<br />
2.0 %<br />
33 / 1,623<br />
1,626 prote<strong>in</strong>s, 1,585 families<br />
22.7 %<br />
698 / 3,076<br />
23.0 %<br />
698 / 3,036<br />
30.4 %<br />
782 / 2,576<br />
22.5 %<br />
713 / 3,175<br />
26.1 %<br />
741 / 2,838<br />
1.5 %<br />
25 / 1,665<br />
lari<br />
Campylobacter jejuni<br />
subsp. jejuni 81-176<br />
1,758 prote<strong>in</strong>s, 1,702 families<br />
Campylobacter jejuni<br />
subsp. doylei 269.97<br />
1,731 prote<strong>in</strong>s, 1,650 families<br />
Campylobacter jejuni<br />
RM1221<br />
25.8 %<br />
765 / 2,961<br />
34.7 %<br />
929 / 2,678<br />
1,838 prote<strong>in</strong>s, 1,780 families<br />
Campylobacter hom<strong>in</strong>is<br />
ATCC BAA-381<br />
32.4 %<br />
916 / 2,828<br />
1.8 %<br />
34 / 1,885<br />
21.2 %<br />
1,687 prote<strong>in</strong>s, 1,623 families<br />
Campylobacter fetus<br />
subsp. fetus 82-40<br />
50.3 %<br />
1,317 / 2,616<br />
1,719 prote<strong>in</strong>s, 1,665 families<br />
Campylobacter curvus<br />
525.92<br />
3.5 %<br />
69 / 1,972<br />
1.5 %<br />
Homology between proteomes<br />
1,931 prote<strong>in</strong>s, 1,885 families<br />
Campylobacter concisus<br />
13826<br />
2,080 prote<strong>in</strong>s, 1,972 families<br />
Homology with<strong>in</strong> proteomes<br />
84.7 %<br />
4.0 %
To my family. Thank you Susanne for your endless support <strong>and</strong> for giv<strong>in</strong>g us two<br />
wonderful boys, Oliver <strong>and</strong> Victor.
Preface<br />
This Ph.D. thesis is written for The Department for Systems Biology, Technical University<br />
of Denmark, as part of the Life Science programme as a requirement for obta<strong>in</strong><strong>in</strong>g the<br />
Ph.D. degree.<br />
The work was supported through the EMBRACE project which is funded by the European<br />
Commission with<strong>in</strong> the Sixth Framework Programme, under the area of “Life sciences,<br />
genomics <strong>and</strong> biotechnology for health”, contract number LSGH-CT-2004-512092.<br />
Parts of the work was supported through a grant from the Danish Natural Science Research<br />
Council, contract number 26-06-0349 entitled “<strong>Comparative</strong> Genomics of Campylobacter<br />
jejuni”.<br />
The work was carried out at the Center for Biological Sequence Analysis (<strong>CBS</strong>), Department<br />
of Systems Biology, under supervision by Associate Professor David W. Ussery.<br />
The work on bacterial promotors was carried out dur<strong>in</strong>g an external stay at University<br />
of California, Davis (UC Davis Genome Center), under supervision by Professor Craig J.<br />
Benham <strong>and</strong> supported through an NSF Research Grant, contract number DBI-0416764.<br />
Lyngby, 28 September, 2009<br />
Peter Fischer Hall<strong>in</strong><br />
Cover illustration<br />
The background of the cover shows a “BLAST atlas” of Burkholderia pseudomallei, stra<strong>in</strong><br />
1710b compared with 22 other Burkholderia genomes. The top panel, under the title,<br />
shows the P1/P2 rrnB promotor region of E. coli, mapped to different DNA properties.<br />
The panel below is a “BLAST matrix” of 10 different Campylobacter stra<strong>in</strong>s, show<strong>in</strong>g the<br />
overall proteome similarity.<br />
i
Abstract<br />
The scientific community is witness<strong>in</strong>g an explosion <strong>in</strong> both the number <strong>and</strong> the complexity<br />
of DNA sequenc<strong>in</strong>g projects. As sequenc<strong>in</strong>g equipment becomes more reliable,<br />
faster <strong>and</strong> less expensive, new possibilities of apply<strong>in</strong>g the technology are open<strong>in</strong>g up.<br />
The early genome sequenc<strong>in</strong>g projects, dat<strong>in</strong>g back almost 15 years, presented only <strong>in</strong>dividual<br />
microbial stra<strong>in</strong>s <strong>and</strong> the large efforts <strong>and</strong> scientific achievements at this time<br />
qualified publication <strong>in</strong> high rank<strong>in</strong>g journals. Today however, projects like the Human<br />
Microbiome Project (HMP), Human Gut Microbiome Initiative (HGMI) <strong>and</strong> the Genomic<br />
Encyclopedia of Bacteria <strong>and</strong> Archaea (GEBA) takes sequenc<strong>in</strong>g <strong>in</strong>to a new era, to study<br />
the genomes <strong>and</strong> ecological niches of entire populations consist<strong>in</strong>g of thous<strong>and</strong>s of microorganisms.<br />
These <strong>in</strong>itiatives put a dem<strong>and</strong> for new analysis <strong>tools</strong> to process <strong>and</strong> derive<br />
knowledge from the wealth of genomic <strong>in</strong>formation.<br />
This thesis describes development of new <strong>tools</strong> <strong>and</strong> methods to study these types<br />
of data. When the genome of characterized stra<strong>in</strong>s <strong>and</strong> environmental samples are sequenced,<br />
the ribosomal RNA genes are commonly chosen as a start<strong>in</strong>g po<strong>in</strong>t to describe<br />
the phylogeny <strong>and</strong> diversity. The rRNA genes are often <strong>in</strong>terpreted as an ‘evolutionary<br />
chronometer’ <strong>and</strong> the RNAmmer software was developed as a tool to quickly <strong>and</strong><br />
consistently identify the rRNA genes allow<strong>in</strong>g for large-scale analysis of phylogeny of complex<br />
data sets. RNAmmer solved previous issues of the gene boundary accuracy, that<br />
is observed when us<strong>in</strong>g BLAST approaches to mapp<strong>in</strong>g rRNA genes. The possibility to<br />
accurately map the start of rRNA transcripts has allowed the <strong>in</strong>vestigation of promotor<br />
structures of these highly expressed operons <strong>and</strong> a promotor analysis <strong>in</strong> E. coli K12 is<br />
demonstrated by apply<strong>in</strong>g a mathematical model of the energetics <strong>in</strong>volved <strong>in</strong> DNA helix<br />
open<strong>in</strong>g.<br />
But a s<strong>in</strong>gle gene, such as the 16S rRNA, can <strong>in</strong> nature not describe the phenotype<br />
nor the full cod<strong>in</strong>g potential of an organism. This thesis describes the development of<br />
the BLASTatlas tool, which is a visualization tool to overview similarity <strong>and</strong> differences<br />
between any number of genomes, metagenomic samples or sequence databases from the<br />
viewpo<strong>in</strong>t of a reference genome. This software has proved to be a powerful tool to study<br />
the localization <strong>and</strong> ga<strong>in</strong>/loss of gene clusters, such as pathogenicity isl<strong>and</strong>s <strong>in</strong> virulent<br />
organisms. The tool has been used <strong>in</strong> several research projects <strong>and</strong> collaborations <strong>and</strong><br />
was described as a cover article <strong>in</strong> Molecular BioSystems <strong>in</strong> 2008, <strong>and</strong> highlighted <strong>in</strong> the<br />
journal Chemical Biology. Despite the usefulness of this tool, it became obvious that a web<br />
based version, more “biologist friendly” with zoom<strong>in</strong>g capability, was needed. This lead<br />
to the GeneWiz browser, which was developed <strong>in</strong> a jo<strong>in</strong>t effort with the IT staff at <strong>CBS</strong>.<br />
The tool enables the user to <strong>in</strong>teractively zoom from a global chromosomal scale down<br />
the nucletide, while ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the overview of all data be<strong>in</strong>g presented <strong>in</strong> the plot. It<br />
features disproportional zoom<strong>in</strong>g as known from google maps. At the time of writ<strong>in</strong>g this<br />
iii
thesis, the work is just be<strong>in</strong>g published <strong>in</strong> the second issue of the SIGS journal (St<strong>and</strong>ards<br />
In Genomic Sciences).<br />
S<strong>in</strong>ce start<strong>in</strong>g my Ph.D. project, a total of 630 prokaryotic genomes has been sequenced<br />
<strong>and</strong> published. This represents on average about four genomes per week! As we<br />
ga<strong>in</strong> knowledge from this vast amount of data, new prediction methods become available<br />
allow<strong>in</strong>g for the generation of even more data; examples <strong>in</strong>clude predict<strong>in</strong>g sigma factor<br />
genes, chromosomal replication starts, <strong>and</strong> secretion systems. This comb<strong>in</strong>ation of new<br />
sequence data as well as new predicitons squares the problem: How do we deal with the<br />
challenge that more <strong>and</strong> more genomic material shall be processed through more <strong>and</strong> more<br />
bio<strong>in</strong>formatic <strong>tools</strong>? And how is this flow of <strong>in</strong>formation formalized <strong>and</strong> automated allow<strong>in</strong>g<br />
bio<strong>in</strong>formaticians to programmatically submit comparisons of any genome to any<br />
prediction method anywhere <strong>in</strong> the world? The need for <strong>in</strong>teroperable <strong>and</strong> programmable<br />
<strong>in</strong>terfaces for these resources is now widely recognized, <strong>and</strong> mach<strong>in</strong>e-to-mach<strong>in</strong>e communication<br />
through Web Services has ga<strong>in</strong>ed acceptance. But ahead lies challenges dur<strong>in</strong>g the<br />
transition from a web-browser-centric th<strong>in</strong>k<strong>in</strong>g towards <strong>in</strong>teroperability <strong>and</strong> service orietated<br />
architecture, SOA. Dur<strong>in</strong>g my Ph.D. work a number of significant contributions to<br />
both implementations <strong>and</strong> server <strong>in</strong>frastructure has provided remote users access to <strong>CBS</strong><br />
prediction servers <strong>and</strong> databases. This work has been presented both dur<strong>in</strong>g the general<br />
meet<strong>in</strong>gs of the EU project (EMBRACE) <strong>in</strong>itiat<strong>in</strong>g these efforts <strong>and</strong> dur<strong>in</strong>g various<br />
workshops teach<strong>in</strong>g the usage of Web Services <strong>and</strong> <strong>Comparative</strong> Genomics.<br />
iv
Resumé<br />
Det videnskabelige samfund er vidne til en eksplosion i b˚ade antallet og kompleksiteten<br />
af genomsekventer<strong>in</strong>ger. I takt med, at sekventer<strong>in</strong>gsudstyret bliver hurtigere, mere<br />
p˚alideligt, og tilmed billigere, ˚abner der sig nye muligheder for anvendelse af teknologien.<br />
De første genomprojekter, der g˚ar næsten 15 ˚ar tilbage, præsenterede kun enkelte<br />
bakteriestammer og den store <strong>in</strong>dsats sammen de videnskabelige resultater har bidraget<br />
med publikationer i højt rangerende tidsskrifter. I dag har projekter som Human Microbiome<br />
Project (HMP), Human Gut Microbiome Initiative (HGMI) og Genome Encyclopedia<br />
of Bacteria <strong>and</strong> Archaea (GEBA) bragt genomsekventer<strong>in</strong>g <strong>in</strong>d i en ny æra ved at<br />
karakterisere tus<strong>in</strong>der af referencegenomer og hele økosystemer best˚aende at tus<strong>in</strong>der af<br />
specier. Disse <strong>in</strong>itiativer vil efterspørge nye analyseværktøjer til at beh<strong>and</strong>le og omdanne<br />
denne flod af <strong>in</strong>formation til viden.<br />
Denne afh<strong>and</strong>l<strong>in</strong>g beskriver metoder og værktøjer til at studere disse typer af data.<br />
N˚ar karakteriserede stammer og prøver bliver sekventeret, er det ribosomale RNA ofte<br />
valgt som udgangspunkt til at beskrive fylogeni og diversitet. Ribosomalt RNA er ofte<br />
benyttet som et ’evolutionært kronometer’ og programmet RNAmmer blev udviklet som<br />
et værktøj til hurtigt og konsistent at identificere rRNA gener, hvilket giver mulighed<br />
for mere omfattende fylogenetiske analyser af komplekse datasæt. RNAmmer har løst<br />
tidligere problemer med at fastsl˚a genernes nøjagtige annoter<strong>in</strong>g, hvilket har været tilfældet<br />
med BLAST baserede metoder. Muligheden for nøjagtigt at kunne kortlægge rRNA<br />
gener, har tilladt undersøgelse af promotor strukturer for disse stærkt udtrykte operoner.<br />
Efterfølgende er en eksisterende matematisk energimodel for DNAets ˚abn<strong>in</strong>g anvendt, til<br />
at lave en promotor analyse af P1/P2 systemet i E. coli K12.<br />
Men et enkelt gen, som for eksempel 16S rRNA, er i sagens natur ude af st<strong>and</strong> til at<br />
beskrive en hel organismes fænotype eller dens fulde kodende potentiale. Denne afh<strong>and</strong>l<strong>in</strong>g<br />
beskriver BLASTatlas metoden, som er et visualiser<strong>in</strong>gsværktøj til at give et overblik<br />
over similaritet mellem et vilk˚arligt antal genomer, metagenomiske prøver eller sekvensdatabaser<br />
med udgangspunkt i et referencegenom. Denne software har vist sig at være et<br />
effektivt redskab til at studere enkelte gener eller grupper af gener, der er konserveret eller<br />
g˚aet tabt i eksempelvis sygdomsfremkaldende mikroorganismer. Værktøjet er blev brugt<br />
i forb<strong>in</strong>delse med flere forskn<strong>in</strong>gsprojekter og samarbejder og metoden blev offentliggjort<br />
som forsideartikel i maj 2008 udgaven af Environmental Microbiology. Det blev imidlertid<br />
klart, at manglen p˚a et <strong>in</strong>teraktivt aspekt, gjorde værktøjet vanskeligt at anvende for biologer.<br />
Dette førte til udvikl<strong>in</strong>gen af programmet GeneWiz Browser, som blev udviklet i<br />
samarbejde med IT-personale p˚a <strong>CBS</strong>. Værktøjet gør det muligt for brugeren <strong>in</strong>teraktivt<br />
at zoome ud fra det globale genom og ned til det enkelte nukleotid, og samtidig bevare<br />
overblikket over alle data, der præsenteres i diagrammet. Programmet anvender disproportional<br />
skaler<strong>in</strong>g som det kendes fra for eksempel Google Maps. Arbejdet er i øjeblikket<br />
v
ved at blive publiceret i St<strong>and</strong>ards In Genomic Sciences.<br />
Siden starten p˚a mit tre ˚arige Ph.D. projekt er ialt 630 prokaryote organismer blev fuld<br />
sekventeret og offentliggjort. Dette svarer i gennemsnit til tre genomer om ugen! I takt<br />
med vi f˚ar ny viden udfra disse store data mængder, bliver der publiceret nye forudsigelsesmetoder<br />
til for eksempel sigma faktorer, kromosomal replikation, og sekretionssystemer.<br />
Denne dobbelthed understreger problemet: Hvordan reagerer vi p˚a den udfordr<strong>in</strong>g, at<br />
mere og mere genomisk materiale skal processeres ved hjælp af flere og flere bio<strong>in</strong>formatiske<br />
værktøjer? Og hvordan kan denne strøm af <strong>in</strong>formation formaliseres og automatiseres<br />
p˚a en s˚adan m˚ade, at bio<strong>in</strong>formatikere og biologer p˚a en programmrbar m˚ade kan<br />
køre sammenlign<strong>in</strong>ger af enhvert genom p˚a enhver forudsigelsesmetode overalt i verden?<br />
Behovet for <strong>in</strong>teroperable og programmerbare grænseflader til disse ressourcer er nu alm<strong>in</strong>deligt<br />
anerkendt, og computer-til-computer kommunikation gennem Web Services har<br />
vundet <strong>in</strong>dpas. Men forude ligger udfordr<strong>in</strong>ger i overgangen fra en webbrowser-fokuseret<br />
tankegang i retn<strong>in</strong>g af <strong>in</strong>teroperabilitet og Service Orientated Architecture, kaldet SOA. I<br />
mit Ph.D. arbejde har er en række betydelige bidrag i form a implementer<strong>in</strong>ger og <strong>in</strong>frastruktur<br />
givet eksterne brugere af forskellige <strong>CBS</strong> værktøjer og databaser en programmerbar<br />
adgang via Web Services. Disse bidrag er blevet præsenteret b˚ade under generalmøder i<br />
EMBRACE EU-projektet og forskellige workshops omh<strong>and</strong>lende brugen af Web Services.<br />
vi
Acknowledgments<br />
I would like to express a deep gratitude to my supervisor Prof. David Ussery for his support<br />
dur<strong>in</strong>g my Ph.D. project. It has been a great pleasure to work with him dur<strong>in</strong>g my time<br />
at <strong>CBS</strong> <strong>and</strong> I will miss the time of organiz<strong>in</strong>g workshops <strong>and</strong> prepar<strong>in</strong>g for conferences.<br />
A thanks to Prof. <strong>and</strong> center director Søren Brunak for creat<strong>in</strong>g a unique <strong>and</strong> <strong>in</strong>spir<strong>in</strong>g<br />
environment at <strong>CBS</strong> which enabled this project.<br />
I would like to extend my heartfelt gratitude to Craig <strong>and</strong> Marcia Benham for the<br />
<strong>in</strong>cribile hospitality <strong>and</strong> openness towards our family dur<strong>in</strong>g my research visit at University<br />
of California, Davis <strong>in</strong> 2007.<br />
I would like to thank a great collegue <strong>and</strong> friend of m<strong>in</strong>e, Tim T. B<strong>in</strong>newies, for support<br />
dur<strong>in</strong>g conferences, manuscript preperations <strong>and</strong> our daily colaborations - it has been a<br />
pleasure to work with Tim. A thanks to Kar<strong>in</strong> Lagesen for great research collaboration<br />
dur<strong>in</strong>g the development of RNAmmer <strong>and</strong> Hanni Willenbrock for great collaboration <strong>and</strong><br />
for driv<strong>in</strong>g numerous publications. I would also like thank all the people I worked with<br />
dur<strong>in</strong>g the development of the ENCODE pipel<strong>in</strong>e, Ramneek Gupta, Thomas Blicher,<br />
Haakan Svensson, Henrik Nielsen, Rasmus Wernersson, Morten Bo Johansen <strong>and</strong> Eleonora<br />
Kulberkyte.<br />
A special thanks to Hans-Henrik Stærfeldt for valuable feedback <strong>and</strong> all the <strong>in</strong>spir<strong>in</strong>g<br />
<strong>and</strong> productive sessions of f<strong>in</strong>aliz<strong>in</strong>g GeneWiz Browser <strong>and</strong> compos<strong>in</strong>g web services software.<br />
A special thanks to Kristoffer Rapacki for be<strong>in</strong>g a great travel companion, for always<br />
f<strong>in</strong>d<strong>in</strong>g solutions, <strong>and</strong> for the many fruitfull discussions we have had - I hope there will be<br />
more. I would like to thank the numerous people with whom I have had the pleasure of<br />
work<strong>in</strong>g with, dur<strong>in</strong>g research projects <strong>and</strong> courses.<br />
Former center adm<strong>in</strong>istrators Johanne Keid<strong>in</strong>g <strong>and</strong> Anne Christensen, current center<br />
adm<strong>in</strong>istrator Dorthe Kjærsgaard, Lone Boesen <strong>and</strong> Malene Beck for your extrod<strong>in</strong>ary<br />
efforts of mak<strong>in</strong>g the <strong>CBS</strong> eng<strong>in</strong>e runn<strong>in</strong>g efficient. Lone Boesen deserves special praise<br />
for smoothly arrang<strong>in</strong>g <strong>and</strong> h<strong>and</strong>l<strong>in</strong>g travel details for my many trips abroad, <strong>in</strong>clud<strong>in</strong>g<br />
five cont<strong>in</strong>ents.<br />
vii
viii
Publications <strong>and</strong> manuscripts<br />
Publications <strong>in</strong>cluded <strong>in</strong> this thesis are listed <strong>in</strong> the order they appear. All other articles<br />
are sorted by publication date, descend<strong>in</strong>g. For papers with five <strong>and</strong> more citations this<br />
number is <strong>in</strong>dicated.<br />
Paper I<br />
Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW. The genome BLASTatlas - a GeneWiz extension<br />
for visualization of whole-genome homology. Mol Biosyst 4:363-71 (2008).<br />
Paper II<br />
B<strong>in</strong>newies TT, Motro Y, Hall<strong>in</strong> PF, Lund O, Dunn D. La T, Hampson DJ, Bellgard M,<br />
Wassenaar TM, Ussery DW. Ten years of bacterial genome sequenc<strong>in</strong>g: comparative–<br />
genomics–based discoveries. Funct Integr Genomics 6:165-85 (2006) - 56 citations.<br />
Paper III<br />
Reva ON, Hall<strong>in</strong> PF, Willenbrock H, Sicheritz-Ponten T, Tummler B, Ussery DW Global<br />
features of the Alcanivorax borkumensis SK2 genome. Environ Microbiol 10:614-<br />
25 (2008).<br />
Paper IV<br />
Vesth T, Hall<strong>in</strong> PF, Snipen L, Lagesen K, Wassenaar TM, Ussery DW. The orig<strong>in</strong>s of<br />
Vibrio species. Microbial Ecology (2009) doi:10.1007/s00248-009-9596-7<br />
Paper V<br />
Wassenaar TM, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, <strong>and</strong> Ussery DW Tools for comparison of<br />
bacterial genomes. Book chapter, Microbiology of Hydrocarbons, Oils, Lipids, <strong>and</strong><br />
Derived Compounds, Spr<strong>in</strong>ger-Verlag, Heidelberg, Germany, 2009.<br />
ix
Paper VI<br />
[Lagesen K, Hall<strong>in</strong> P] 1 , Rodl<strong>and</strong> EA, Stærfeldt HH, Rognes T, Ussery DW. RNAmmer:<br />
consistent <strong>and</strong> rapid annotation of ribosomal RNA genes. Nucleic Acids Res<br />
35:3100-8 (2007) - 8 citations 2<br />
Paper VII<br />
Hall<strong>in</strong> PF, Stærfeldt H, Rotenberg E, B<strong>in</strong>newies TT, Benham CJ, <strong>and</strong> Ussery DW. GeneWiz<br />
browser: An Interactive Tool for Visualiz<strong>in</strong>g Sequenced Chromosomes.<br />
St<strong>and</strong>ards <strong>in</strong> Genomic Sciences 1:204-215 (2009) doi:10.4056/sigs.28177.<br />
Papers not <strong>in</strong>cluded<br />
Contributions have been made to the follow<strong>in</strong>g papers dur<strong>in</strong>g my PhD project.<br />
• Miller WG, Parker CT, Rubenfield M, Mendz GL, Wosten MM, Ussery DW,<br />
Stolz JF, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Wang G, Malek JA, Rogos<strong>in</strong> A, Stanker<br />
LH, M<strong>and</strong>rell RE. The complete genome sequence <strong>and</strong> analysis of the<br />
human pathogen Arcobacter butzleri. PLoS ONE 2:e1358 (2007)<br />
• Willenbrock H, Hall<strong>in</strong> PF, Wassenaar TM, Ussery DW Characterization of<br />
probiotic Escherichia coli isolates with a novel pan-genome microarray.<br />
Genome Biol 8:R267 (2007)<br />
Earlier papers, 2004–2006<br />
• Worn<strong>in</strong>g P, Jensen LJ, Hall<strong>in</strong> PF, Stærfeldt HH, Ussery DW Orig<strong>in</strong> of replication<br />
<strong>in</strong> circular prokaryotic chromosomes. Environ Microbiol 8:353-61<br />
(2006) - 28 citations<br />
• Kill K, B<strong>in</strong>newies TT, Sicheritz-Ponten T, Willenbrock H, Hall<strong>in</strong> PF, Wassenaar<br />
TM, Ussery DW Genome update: sigma factors <strong>in</strong> 240 bacterial<br />
genomes. Microbiology 151:3147-50 (2005)<br />
• Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW Genome update: prediction<br />
of membrane prote<strong>in</strong>s <strong>in</strong> prokaryotic genomes. Microbiology<br />
151:2119-21 (2005)<br />
• Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Sicheritz-Ponten T, Ussery DW Genome<br />
update: prediction of secreted prote<strong>in</strong>s <strong>in</strong> 225 bacterial proteomes.<br />
Microbiology 151:1725-7 (2005)<br />
• B<strong>in</strong>newies TT, Bendtsen JD, Hall<strong>in</strong> PF, Nielsen N, Wassenaar TM, Pedersen<br />
MB, Klemm P, Ussery DW Genome Update: Prote<strong>in</strong> secretion systems<br />
<strong>in</strong> 225 bacterial genomes. Microbiology 151:1013-6 (2005)<br />
• Hall<strong>in</strong> PF, Nielsen N, Dev<strong>in</strong>e KM, B<strong>in</strong>newies TT, Willenbrock H, Ussery DW<br />
Genome update: base skews <strong>in</strong> 200+ bacterial chromosomes. Microbiology<br />
151:633-7 (2005)<br />
1 Both authors contributed equally<br />
2 Additionally 8 citations for the first 8 GEBA genomes published <strong>in</strong> SIGS journal; be<strong>in</strong>g part of a<br />
st<strong>and</strong>ard pipel<strong>in</strong>e, RNAmmer will be cited for future GEBA articles.<br />
x
• Willenbrock H, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW Genome update: 2D<br />
cluster<strong>in</strong>g of bacterial genomes. Microbiology 151:333-6 (2005)<br />
• B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Stærfeldt HH, Ussery DW Genome Update: proteome<br />
comparisons. Microbiology 151:1-4 (2005)<br />
• Hall<strong>in</strong> PF, Ussery DW <strong>CBS</strong> Genome Atlas Database: a dynamic storage<br />
for bio<strong>in</strong>formatic results <strong>and</strong> sequence data. Bio<strong>in</strong>formatics 20:3682-<br />
6 (2004) - 37 citations<br />
• Hall<strong>in</strong> PF, Coenye T, B<strong>in</strong>newies TT, Jarmer H, Stærfeldt HH, Ussery DW<br />
Genome update: correlation of bacterial genomic properties. Microbiology<br />
150:3899-903 (2004)<br />
• Ussery DW, B<strong>in</strong>newies TT, Gouveia-Oliveira R, Jarmer H, Hall<strong>in</strong> PF Genome<br />
update: DNA repeats <strong>in</strong> bacterial genomes. Microbiology 150:3519-21<br />
(2004) - 11 citations<br />
• Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW Genome update: chromosome atlases.<br />
Microbiology 150:3091-3 (2004)<br />
• Ussery DW, T<strong>in</strong>dbaek N, Hall<strong>in</strong> PF Genome update: promoter profiles.<br />
Microbiology 150:2791-3 (2004)<br />
• Ussery DW, Jensen MS, Poulsen TR, Hall<strong>in</strong> PF Genome update: alignment<br />
of bacterial chromosomes. Microbiology 150:2491-3 (2004)<br />
• Ussery DW, Hall<strong>in</strong> PF Genome Update: annotation quality <strong>in</strong> sequenced<br />
microbial genomes. Microbiology 150:2015-7 (2004) - 8 citations<br />
• Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Wassenaar TM Genome update: tR-<br />
NAs <strong>in</strong> sequenced microbial genomes. Microbiology 150:1603-6 (2004)<br />
• Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Coenye T Genome update: rRNAs <strong>in</strong><br />
sequenced microbial genomes. Microbiology 150:1113-5 (2004)<br />
• Ussery DW, Hall<strong>in</strong> PF Genome Update: AT content <strong>in</strong> sequenced prokaryotic<br />
genomes. Microbiology 150:749-52 (2004) - 8 citations<br />
• Ussery DW, Hall<strong>in</strong> PF Genome update: Length distributions of sequenced<br />
prokaryotic genomes. Microbiology 150:513-6 (2004)<br />
xi
xii
Contents<br />
List of Figures xvii<br />
1 Introduction 1<br />
2 <strong>Comparative</strong> Genomics 3<br />
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
2.2 The genome annotation pipel<strong>in</strong>e . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
2.2.1 fetchgbk: Obta<strong>in</strong><strong>in</strong>g exist<strong>in</strong>g public genomes from GenBank . . . . 4<br />
2.2.2 Other ways to acquire genome <strong>in</strong>formation . . . . . . . . . . . . . . 4<br />
2.2.3 Tools contigsort <strong>and</strong> contigmap . . . . . . . . . . . . . . . . . . . 5<br />
2.2.4 F<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> encod<strong>in</strong>g genes <strong>in</strong> prokaryotes . . . . . . . . . . . . 6<br />
2.2.5 F<strong>in</strong>d<strong>in</strong>g tRNA <strong>and</strong> rRNA genes . . . . . . . . . . . . . . . . . . . . . 7<br />
2.3 Genome Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
2.3.1 Box-<strong>and</strong>-wiskers plot . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
2.3.2 heatmap - 2D cluster<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
2.3.3 Codon usage <strong>and</strong> chromosomal base composition . . . . . . . . . . . 11<br />
2.3.4 CodonPlot: visualiz<strong>in</strong>g codon usage . . . . . . . . . . . . . . . . . . 13<br />
2.3.5 Base composition <strong>and</strong> DNA repair . . . . . . . . . . . . . . . . . . . 16<br />
2.3.6 BLASTmatrix - proteome comparison . . . . . . . . . . . . . . . . . . 16<br />
2.3.7 BLASTatlas - visualiz<strong>in</strong>g while-genome homology . . . . . . . . . . . 18<br />
2.3.8 CorePlot - plott<strong>in</strong>g the core- <strong>and</strong> pan-genomes of species . . . . . . 23<br />
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
2.5 Instant <strong>in</strong>sight: Read<strong>in</strong>g the genetic atlas . . . . . . . . . . . . . . . . . . 27<br />
2.6 Paper I: The genome BLASTatlas - a GeneWiz extension for visualization<br />
of whole-genome homology . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
2.7 Paper II: Ten years of bacterial genome sequenc<strong>in</strong>g: comparative–genomics–<br />
based discoveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
2.8 Paper III: Global features of the Alcanivorax borkumensis SK2 genome . . 61<br />
2.9 Paper IV: The orig<strong>in</strong>s of Vibrio species . . . . . . . . . . . . . . . . . . . . 75<br />
2.10 Paper V: Tools for comparison of bacterial genomes . . . . . . . . . . . . . 89<br />
3 rRNA operons <strong>and</strong> promoter analysis 105<br />
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />
3.2 P1 <strong>and</strong> P2 promoters <strong>in</strong> E. coli . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />
3.3 Conservation of regulatory elements . . . . . . . . . . . . . . . . . . . . . . 106<br />
3.3.1 Model<strong>in</strong>g the P1 <strong>and</strong> P2 <strong>in</strong> selected enterics . . . . . . . . . . . . . . 108<br />
3.3.2 Iterat<strong>in</strong>g weight matrix frequencies . . . . . . . . . . . . . . . . . . . 112<br />
xiii
3.3.3 Ref<strong>in</strong><strong>in</strong>g E. coli <strong>and</strong> Shigella models . . . . . . . . . . . . . . . . . . 112<br />
3.4 DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
3.4.1 codesearch: Mapp<strong>in</strong>g nummerical data to genome annotations . . . 114<br />
3.5 The genomic context: visualiz<strong>in</strong>g operons <strong>and</strong> DNA properties . . . . . . . 117<br />
3.6 Visualiz<strong>in</strong>g sequenc<strong>in</strong>g quality us<strong>in</strong>g gwBrowser . . . . . . . . . . . . . . . . 117<br />
3.6.1 Visualiz<strong>in</strong>g the P1 <strong>and</strong> P2 structure us<strong>in</strong>g gwBrowser . . . . . . . . 119<br />
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
3.8 Paper VI: RNAmmer: Fast two-level HMM prediction of rRNA <strong>in</strong> prokaryotic<br />
genome sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
3.9 Paper VII: GeneWiz browser: An Interactive Tool for Visualiz<strong>in</strong>g Sequenced<br />
Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
4 Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics 145<br />
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />
4.2 <strong>Interoperability</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />
4.2.1 SOAP based Web Services . . . . . . . . . . . . . . . . . . . . . . . . 147<br />
4.3 EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability . . . . . . . . . . 147<br />
4.3.1 Quasi - a light-weight SOAP server . . . . . . . . . . . . . . . . . . 150<br />
4.3.2 quasi mktemp - From template to Web Service . . . . . . . . . . . . 150<br />
4.4 ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services . . . . . . . . . . . . . . . . . . . 151<br />
4.4.1 Collect<strong>in</strong>g Web Services clients <strong>in</strong> EPipe . . . . . . . . . . . . . . . . 151<br />
4.4.2 Mapp<strong>in</strong>g Pfam annotations to prote<strong>in</strong> structure: mecA . . . . . . . . 151<br />
5 Conclusion <strong>and</strong> perspectives 155<br />
A Appendix: Workshops, teach<strong>in</strong>g, <strong>and</strong> conferences 157<br />
A.1 Lectures <strong>and</strong> Presentations . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />
A.1.1 DTU Course 27101: Framework Course <strong>in</strong> Biotechnology <strong>and</strong> Food<br />
Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />
A.1.2 <strong>Comparative</strong> Microbial Genomics Workshop . . . . . . . . . . . . . . 157<br />
A.1.3 <strong>Comparative</strong> Microbial Genomics <strong>and</strong> Taxonomy . . . . . . . . . . . 157<br />
A.1.4 EMBRACE Workshop on Client Side Script<strong>in</strong>g for Web Services . . 157<br />
A.1.5 EMBRACE Workshop on Bio<strong>in</strong>formatics of Immunology . . . . . . . 157<br />
A.1.6 EMBRACE 3 rd AGM: Implementation of web services . . . . . . . . 157<br />
A.1.7 EMBRACE Workshop on Perl, SQL <strong>and</strong> Web Services . . . . . . . . 158<br />
A.2 Workshops <strong>and</strong> meet<strong>in</strong>gs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />
A.2.1 EMBRACE Workshop: SOAP web services . . . . . . . . . . . . . . 158<br />
A.2.2 EUCOMM Bio<strong>in</strong>formatics Tra<strong>in</strong><strong>in</strong>g Course . . . . . . . . . . . . . . 158<br />
A.2.3 EMBRACE Workshop: Modern computer <strong>tools</strong> for the biosciences . 158<br />
A.2.4 EMBRACE 3rd Annual General Meet<strong>in</strong>g . . . . . . . . . . . . . . . 158<br />
A.2.5 EMBRACE Workshop: Deploy<strong>in</strong>g Web Services for Biological Sequence<br />
Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />
A.2.6 EMBRACE 4th Annual General Meet<strong>in</strong>g . . . . . . . . . . . . . . . 158<br />
A.2.7 Technical discussion of EMBRACE registry . . . . . . . . . . . . . . 158<br />
A.2.8 EMBRACE meet<strong>in</strong>g: Discussion of st<strong>and</strong>ard data types . . . . . . . 158<br />
A.3 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />
A.3.1 Conference: Metagenomics, July 2007, San Diego U.S.A. . . . . . . . 158<br />
A.3.2 Conference: ASM Biodefense 2007, February 2007, Wash<strong>in</strong>gton U.S.A.158<br />
B Appendix: Ph.D. study plan 159<br />
xiv
C Appendix: Courses 165<br />
C.1 Global regulatory networks <strong>in</strong> microorganisms . . . . . . . . . . . . . . . . . 165<br />
C.2 Prote<strong>in</strong> Structure <strong>and</strong> <strong>Computational</strong> Biology . . . . . . . . . . . . . . . . . 165<br />
C.3 Biological Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />
C.4 <strong>Comparative</strong> Genome Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />
C.5 Doctorial sem<strong>in</strong>ar on bus<strong>in</strong>ess economics for academic entrepreneurs . . . . 165<br />
C.6 ECTS summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />
D Appendix: Software 166<br />
D.1 fetchgbk manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166<br />
D.2 Sample output from queryGenomes . . . . . . . . . . . . . . . . . . . . . . . 167<br />
D.3 BLASTatlas configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />
D.3.1 file blast.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />
D.3.2 file custom.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />
D.4 BLASTmatrix example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />
D.5 iscan source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />
D.6 quasi mktemp manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172<br />
Bibliography 174<br />
xv
xvi
List of Figures<br />
2.1 Mapp<strong>in</strong>g of multiple contigs to a backbone genome. C. jejuni str. NCTC<br />
11168 is used as backbone for mapp<strong>in</strong>g contigs C. jejuni str. 260.94. Blue<br />
<strong>and</strong> red blocks represent direct <strong>and</strong> reverse hits, respectively. Panel (a)<br />
shows un-mapped whereas panel (b) shows mapped contigs. . . . . . . . . 6<br />
2.2 Construction of a box-<strong>and</strong>-whiskers plot. Notches is an estimate of the 95%<br />
confidence <strong>in</strong>terval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
2.3 Genome size of all public prokaryotic. . . . . . . . . . . . . . . . . . . . . . 10<br />
2.4 Average AT content of all public prokaryotic. . . . . . . . . . . . . . . . . 10<br />
2.5 2D-cluster<strong>in</strong>g show<strong>in</strong>g 87 Enterobacteriaceae. . . . . . . . . . . . . . . . . . 12<br />
2.6 Codon <strong>and</strong> am<strong>in</strong>o acid usage of Buchnera aphidicola Cc (79.8% AT), Klebsiella<br />
pneumoniae NTUH-K2044 (42.3% AT), <strong>and</strong> E. coli K12 49.2% AT.<br />
Rightmost column shows the nucleotide bias of the three codon positions. . 14<br />
2.7 AT content profile 400 bp upstream <strong>and</strong> downstram of annotated translation<br />
starts <strong>in</strong> Buchnera aphidicola Cc. . . . . . . . . . . . . . . . . . . . . . . . 15<br />
2.8 Deam<strong>in</strong>ation of cytos<strong>in</strong>e (C) <strong>in</strong>to uracil (U) . . . . . . . . . . . . . . . . . . 16<br />
2.9 Construction of the BLASTmatrix diagram. Proteome similarity between<br />
three E. coli genomes. Lower part of the diagram corresponds to <strong>in</strong>traproteome<br />
similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
2.10 Proteome similarity between ten Campylobacter species. Color encod<strong>in</strong>g<br />
corresponds to percentage of shared prote<strong>in</strong> families. . . . . . . . . . . . . 17<br />
2.11 Proteome comparison of 32 Vibrionaceae genomes. Environmental V. cholerae<br />
stra<strong>in</strong>s lack<strong>in</strong>g the cholera enterotox<strong>in</strong> genes are highlighted <strong>in</strong> bright green,<br />
whilst pathogenic V. cholerae stra<strong>in</strong>s genomes are shown <strong>in</strong> dark green. . . 18<br />
2.12 Mapp<strong>in</strong>g of pairwise alignment to a reference genome. Mismatches, conservative<br />
mismatches <strong>and</strong> perfect matches contrubute to the overall map 0.0,<br />
0.5, <strong>and</strong> 1.0, respectively. Gaps with<strong>in</strong> the reference prote<strong>in</strong>, correspond<strong>in</strong>g<br />
to miss<strong>in</strong>g features of the reference prote<strong>in</strong>, cannot be mapped <strong>and</strong> are<br />
hence excluded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.13 Inclusion of multiple organisms us<strong>in</strong>g the BLASTatlas method. Each track<br />
correspond to a pairwise comparison aga<strong>in</strong>st the reference chromosome. . . 19<br />
2.14 Comparison of B. pseudomallei 1710b chomosome I <strong>and</strong> II aga<strong>in</strong>st all public<br />
Burkholderia genomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
2.15 A phylome atlas of Alcanivorax borkumensis, compar<strong>in</strong>g the proteome aga<strong>in</strong>st<br />
all γ-, α-, β-, δ, <strong>and</strong> ɛ-proteobacteria available at the time of publish<strong>in</strong>g. . 22<br />
2.16 Count of genomes <strong>and</strong> species divided by genera. Source: <strong>CBS</strong> Genome<br />
Atlas Database as of 2009-09-11. . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
xvii
xviii<br />
2.17 Pan- <strong>and</strong> core-genome plot of 10 Campylobacter genomes. For the data<br />
currently available, there seem to exist an equilibrium at close to 600 prote<strong>in</strong><br />
families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
2.18 CorePlot output for 32 Vibrio genomes. . . . . . . . . . . . . . . . . . . . . 24<br />
3.1 The transcription of bacterial genes. . . . . . . . . . . . . . . . . . . . . . . 106<br />
3.2 The promotor structure of the rrnB operon <strong>in</strong> E. coli. . . . . . . . . . . . . 107<br />
3.3 The –10 <strong>and</strong> –35 hexamers of the E. coli σ 70 promotor correspond to the<br />
motifs be<strong>in</strong>g located on opposite side of the DNA helix. Delition or <strong>in</strong>sertions<br />
of the spac<strong>in</strong>g cases a shift of approx. 36deg per nucleotide. . . . . . 107<br />
3.4 Logo plots show<strong>in</strong>g the <strong>in</strong>itial weight matrices used for search<strong>in</strong>g E. coli<br />
<strong>and</strong> Shigella genomes: –10 hexamer (a), –35 hexamer (b), UP element (c),<br />
<strong>and</strong> FIS b<strong>in</strong>d<strong>in</strong>g motif (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />
3.5 Neighbor-jo<strong>in</strong><strong>in</strong>g tree of first 1k bases of all 16S rRNA genes of Yers<strong>in</strong>ia,<br />
Salmonella, Shigella, <strong>and</strong> E. coli . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
3.6 Profiles show<strong>in</strong>g the maximum Ri(tot) scores of the <strong>in</strong>itial weight matrices<br />
applied to E. coli <strong>and</strong> Shigella: Unadjusted P1 scores (a), Adjusted P1<br />
scores (b), Unadjusted P2 scores (c), <strong>and</strong> Adjusted P2 scores (d) . . . . . . 112<br />
3.7 Logos show<strong>in</strong>g the base compostion of P1 <strong>and</strong> P2 of E. coli genomes, as<br />
identified by <strong>in</strong>itial P1 <strong>and</strong> P2 scan: P1 –10 hexamer (a), P1 –35 hexamer<br />
(b), P1 UP element (c), P1 FIS b<strong>in</strong>d<strong>in</strong>g motif (d), P2 –10 hexamer (e), P2<br />
–35 hexamer (f), P2 UP element (g) . . . . . . . . . . . . . . . . . . . . . . 113<br />
3.8 Average profiles of SIDD energy calculated at five different helix densities<br />
-0.025, -0.035, -0.045, <strong>and</strong> -0.055. All genes have been aligned at the translation<br />
start. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
3.9 E. coli <strong>and</strong> Shigella rrnB energy l<strong>and</strong>scape visualized us<strong>in</strong>g the heatmap<br />
function. Each vertical column corresponds to a promotor sequence, whereas<br />
the horizontal rows represent average values over 10 bp with<strong>in</strong> each sequence.<br />
Coord<strong>in</strong>ates labeled on the horizontal rows are relative to the 16S<br />
rRNA gene start. The upper heatmaps show P1 whereas the lower heatmaps<br />
show P2. Leftmost heatmaps show P1/P2 model scores <strong>in</strong> green, whereas<br />
rightmost heatmaps show the SIDD energy <strong>in</strong> blue. . . . . . . . . . . . . . 116<br />
3.10 Pr<strong>in</strong>ciple workflow of gwBrowser data exchange. . . . . . . . . . . . . . . . 118<br />
3.11 Mapp<strong>in</strong>g qualities of sequenc<strong>in</strong>g reads to a reference genome while account<strong>in</strong>g<br />
for the uniqueness of the read. . . . . . . . . . . . . . . . . . . . . . . . 118<br />
3.12 A zoom of the P1 P2 t<strong>and</strong>em promotor system upstream of the rrnB operon<br />
of E. coli K12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
4.1 Screen shot of NCBI Entrez Genome projects web page . . . . . . . . . . . 146<br />
4.2 Schematic layout of a simple SOAP resource, where WSDL <strong>and</strong> schemas<br />
reside on the same server. WSDL <strong>and</strong> schemas are read <strong>and</strong> <strong>in</strong>tepreted<br />
by the SOAP client <strong>in</strong> order compose the outgo<strong>in</strong>g request <strong>and</strong> parse the<br />
<strong>in</strong>com<strong>in</strong>g server response. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />
4.3 Schematic layout of the ENCODE pipel<strong>in</strong>e, EPipe. The ma<strong>in</strong> program<br />
ensures that as much as possible is dispatched <strong>in</strong> parrallel. Modules may<br />
either be alignment dependent or not. If the alignment is required to predict<br />
the prote<strong>in</strong> features, the module is not launched until the alignment<br />
algorithm has f<strong>in</strong>ished. Modules may either return global features of the<br />
entire prote<strong>in</strong> (e.g. cellular localization), or return positional features (e.g.<br />
phosphorylation sites). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.4 The <strong>in</strong>put web page of EPipe: Upper part def<strong>in</strong>es sequence upload <strong>and</strong><br />
alignment method, <strong>and</strong> lower part selects which modules / methods to<br />
run. When applicable, gene ontologies have been added to each feature <strong>and</strong><br />
feature values (light green boxes). . . . . . . . . . . . . . . . . . . . . . . . 153<br />
4.5 The mecA encoded prote<strong>in</strong> (EEV85461) shows homology to PDB entry<br />
1VQQ (Lim & Strynadka, 2002). Top panel shows the EPipe structure<br />
browser which allows for any 90 degrees rotat<strong>in</strong>g. Lower panel shows a<br />
post-process<strong>in</strong>g of the PyMol script, generated by EPipe. . . . . . . . . . . 154<br />
xix
Chapter 1<br />
Introduction<br />
Introduction<br />
S<strong>in</strong>ce the publication of the first complete bacterial genome sequence <strong>in</strong> 1995 close to a<br />
thous<strong>and</strong> prokaryotes have been fully sequenced <strong>and</strong> made publicly available. These data<br />
represent large efforts by many scientists <strong>and</strong> technicians, clos<strong>in</strong>g gaps <strong>in</strong> the chromosomal<br />
sequences <strong>and</strong> provid<strong>in</strong>g detailed gene annotations. These genome projects constitute a<br />
valuable collection of prokaryotic diversity <strong>and</strong> they serve as an <strong>in</strong>dispensable resource for<br />
comparative studies when novel features of newly discovered organisms are identified.<br />
We are however witness<strong>in</strong>g a transition phase as genome sequenc<strong>in</strong>g becomes a trivial<br />
step carried out by any researcher or company <strong>in</strong> the need of a better characterization of an<br />
organism. Sequenc<strong>in</strong>g equipment <strong>and</strong> the capability of assembl<strong>in</strong>g an entire genome will<br />
likely follow the same path as any other technological advance the world has seen. Telephones,<br />
cars, aeroplanes, <strong>and</strong> computers all have started as costly <strong>and</strong> clumsy attempts,<br />
<strong>and</strong> ended up as ma<strong>in</strong>stream affordable <strong>and</strong> efficient products, taken for granted. Noth<strong>in</strong>g<br />
will prevent sequenc<strong>in</strong>g technology to follow the same path <strong>and</strong> it will likely end up as a<br />
t<strong>in</strong>y desktop <strong>in</strong>strument on a doctor’s table next to the blood preasure measur<strong>in</strong>g device.<br />
But the decreas<strong>in</strong>g novelty of present<strong>in</strong>g a new genome sequence could cause a decl<strong>in</strong>e <strong>in</strong><br />
the number of published genomes <strong>in</strong> the near future, caus<strong>in</strong>g less control <strong>and</strong> organization<br />
of these data, with fewer dem<strong>and</strong>s on data <strong>in</strong>tegrity, sequenc<strong>in</strong>g <strong>and</strong> annotation quality.<br />
Some major issues arrise as massive amounts of genomic data becomes a reality. There<br />
are signs that our ability to process <strong>and</strong> analyze genomic data is be<strong>in</strong>g overtaken by the<br />
technological developments of the sequenc<strong>in</strong>g equipment. For example, over the past<br />
twenty-five years, GenBank has grown roughly 100,000 fold, whereas the computer process<strong>in</strong>g<br />
power, follow<strong>in</strong>g Moore’s law has grown “only” a 1,000 times. The overwhelm<strong>in</strong>g<br />
data generated by modern sequenc<strong>in</strong>g mach<strong>in</strong>es constitite tough challenges for most biologist<br />
<strong>and</strong> although efforts are constantly be<strong>in</strong>g made to improve gene prediction <strong>and</strong><br />
genome assembly software, these steps are not yet function<strong>in</strong>g <strong>in</strong> a scalable <strong>and</strong> unsupervised<br />
fashion. Further, post-annotation steps deriv<strong>in</strong>g knowledge from predicted genes<br />
rema<strong>in</strong> one of the biggest challenges. How do we transform contigs of nucleotide sequences<br />
<strong>in</strong>to knowledge to derive the phenotype of the organism?<br />
As more prokaryotic genomes are be<strong>in</strong>g sequenced, there are now a number of species<br />
for which multiple stra<strong>in</strong>s are sequenced. Roughly one fourth of all prokaryotic projects<br />
exist with<strong>in</strong> species where 5 or more stra<strong>in</strong>s are available. As this coverage of diversity<br />
<strong>in</strong>creases, we may beg<strong>in</strong> to answer some key questions with better confidence. How do<br />
we def<strong>in</strong>e core sets of genes? Can we estimate the size of the pan genome? Which<br />
features are novel <strong>in</strong> selected stra<strong>in</strong>s <strong>and</strong> are these features regionally conserved with<strong>in</strong><br />
the chromosomes? To answer these questions, there is a fundamental need to visuzalize<br />
<strong>and</strong> overview the similarity <strong>and</strong> differences between larger number of genomes. Obta<strong>in</strong><strong>in</strong>g<br />
such an overview allows some questions concern<strong>in</strong>g gene acquisition <strong>and</strong> chromosomal<br />
1
organization to be answered. The development <strong>and</strong> ref<strong>in</strong>ement of the BLASTatlas method<br />
done dur<strong>in</strong>g this Ph.D. project is an essential step forward enabl<strong>in</strong>g these types of analysis<br />
<strong>and</strong> the method is now offered as an onl<strong>in</strong>e service by <strong>CBS</strong>. This work let to a publication<br />
<strong>in</strong> 2008, describ<strong>in</strong>g the BLASTatlas method.<br />
In chapter 2 a number of <strong>tools</strong> are described, which can assist rapid analysis of genomes,<br />
genomic contigs <strong>and</strong> larger collections of genomes to conclude the similarity. Enabl<strong>in</strong>g<br />
local <strong>and</strong> web based genome analysis <strong>tools</strong> for the novice user rema<strong>in</strong>s a critical po<strong>in</strong>t for<br />
the success of future sequenc<strong>in</strong>g projects. In chapter 3 the RNAmmer tool was used as<br />
a start<strong>in</strong>g po<strong>in</strong>t to study the E. coli rrn t<strong>and</strong>em promotors. This work presents useful<br />
<strong>tools</strong> to model <strong>and</strong> visualize promotor conservation <strong>in</strong> genomes. The exchange of genomic<br />
data between users, sequenc<strong>in</strong>g centers, repositories, <strong>and</strong> tool providers currently lack<br />
st<strong>and</strong>ardizaion <strong>and</strong> <strong>in</strong>teroperability. The lack of a formal way to exchange genomic data is<br />
a limit<strong>in</strong>g factor as to how we <strong>in</strong> the future may exploit the wave of new genomic material<br />
be<strong>in</strong>g generated. Chapter 4 of this thesis describe a number of efforts made dur<strong>in</strong>g this<br />
Ph.D. project to provide <strong>in</strong>teroperabitlity <strong>and</strong> programmatic access to both prediction<br />
methods, genomic visualization methods as well as management of data st<strong>and</strong>ards. The<br />
outcome of this work has led <strong>CBS</strong> to adapt <strong>tools</strong> <strong>and</strong> server <strong>in</strong>frastructure thereby shar<strong>in</strong>g<br />
its many <strong>tools</strong> <strong>in</strong> a way that allow programmers to <strong>in</strong>sert sophistcated prediction methods<br />
directoy <strong>in</strong> their own programm<strong>in</strong>g environment.<br />
2
Chapter 2<br />
<strong>Comparative</strong> Genomics<br />
2.1 Introduction<br />
<strong>Comparative</strong> Genomics<br />
This chapter covers work for five publications. The first paper (I) describes the BLASTatlas<br />
method developed to compare <strong>and</strong> visualize the homology between a reference genome<br />
<strong>and</strong> any number of other genomes, collections of genomes, metagenomic sequences, or<br />
databases as a s<strong>in</strong>gle graphic. The method has been used <strong>in</strong> connection with various<br />
research projects <strong>in</strong>clud<strong>in</strong>g the publication of the Arcobacter butzleri RM4018 genome<br />
(Miller et al., 2007), computer exercises (see chapter 4 <strong>and</strong> appendix A.1) <strong>and</strong> as analysis<br />
tool for publications made dur<strong>in</strong>g the project (papers II-V).<br />
A number of smaller unpublished methods, <strong>in</strong>clud<strong>in</strong>g the BLAST matrix, Core Plot,<br />
<strong>and</strong> Codon Plot has been written <strong>and</strong> used as <strong>in</strong>-house <strong>tools</strong>. The BLASTmatrix software<br />
derives unique <strong>and</strong> shared prote<strong>in</strong> families for any number of proteomes. This enables the<br />
viewer to obta<strong>in</strong> the similarity between any pair of organisms <strong>in</strong>cluded <strong>in</strong> the comparison.<br />
The tool was first used <strong>in</strong> (Jensen et al., 2005), <strong>and</strong> also used <strong>in</strong> other papers <strong>in</strong>clud<strong>in</strong>g<br />
paper II. An improved version of the BLASTmatrix tool is used <strong>in</strong> paper IV. The<br />
BLASTmatrix software generates all-aga<strong>in</strong>st-all BLAST (Basic Local alignment Search<br />
Tool, Altschul et al. (1997)) of a number of selected proteomes. When compar<strong>in</strong>g multiple<br />
species of the same genus, these BLAST results can be reused by the CorePlot program<br />
to estimate the size of the core- <strong>and</strong> pan-genome. F<strong>in</strong>ally, the CodonPlot program was<br />
written to visualize the codon <strong>and</strong> am<strong>in</strong>o acid usage by an organism. The CodonPlot<br />
results contributed to papers II, III, <strong>and</strong> V.<br />
The development of an <strong>in</strong>teractive web based genome browser (gwBrowser) has allowed<br />
a broader application of the atlas visualization method, <strong>in</strong>clud<strong>in</strong>g analysis of sequenc<strong>in</strong>g<br />
reads <strong>and</strong> promotor regions. This work is described <strong>in</strong> chapter 3.<br />
2.2 The genome annotation pipel<strong>in</strong>e<br />
Hav<strong>in</strong>g assembled the reads of a sequenc<strong>in</strong>g project, the biologist is often presented with<br />
an <strong>in</strong>complete mapp<strong>in</strong>g of the chromosome, with gaps <strong>and</strong> a large number of contigs<br />
(contiguous pieces of DNA). The quality of the assembly orig<strong>in</strong>at<strong>in</strong>g from most modern<br />
high-throughput techniques can be negatively affected by a number of factors such as<br />
short or <strong>in</strong>sufficient reads, elevated error rates near the end of the reads, DNA repeats on<br />
the chromosome, <strong>in</strong>adequate assembly <strong>tools</strong> etc. This section describes <strong>tools</strong> to analyze<br />
both complete genome data (s<strong>in</strong>gle-contig) as well as prelim<strong>in</strong>ary data generated by pyrosequenc<strong>in</strong>g<br />
mach<strong>in</strong>es (multiple contigs). Most <strong>tools</strong> that are presented here are stored<br />
on the <strong>CBS</strong> servers at /home/people/pfh/scripts/.<br />
3
The genome annotation pipel<strong>in</strong>e<br />
2.2.1 fetchgbk: Obta<strong>in</strong><strong>in</strong>g exist<strong>in</strong>g public genomes from GenBank<br />
Without robust access to prior knowledge about exist<strong>in</strong>g genomes, it is hard to draw<br />
conclusions about a novel genome sequence. The tool fetchgbk was made to download the<br />
most recent genbank entries via NCBI us<strong>in</strong>g both <strong>in</strong>dividual accession numbers (GenBank<br />
<strong>and</strong> RefSeq), ranges thereof, or the NCBI project id whereby all replicons of an organism<br />
can be obta<strong>in</strong>ed. List<strong>in</strong>g 2.1 shows common usage of the program <strong>and</strong> appendix D.1<br />
<strong>in</strong>cludes the manual.<br />
List<strong>in</strong>g 2.1: Usage of fetchgbk<br />
1 # download a s<strong>in</strong>gle genbank record<br />
2 fetchgbk -a CP000896<br />
3 # download a s<strong>in</strong>gle refseq entry<br />
4 fetchgbk -a NZ_ABIZ00000000<br />
5 # download a range of RefSeq entries<br />
6 fetchgbk -a NZ_ABIH01000001 - NZ_ABIH01000038<br />
7 # just list<strong>in</strong>g refseq accession numbers of a project<br />
8 fetchgbk -p 12997 -d refseq -l<br />
9 # download all replicons of a project ( RefSeq )<br />
10 fetchgbk -p 19391 -d refseq<br />
11 # download all replicons of a project ( GenBank )<br />
12 fetchgbk -p 19391 -d genbank<br />
2.2.2 Other ways to acquire genome <strong>in</strong>formation<br />
The genbank records ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong> the <strong>CBS</strong> Genome Atlas Database (Hall<strong>in</strong> & Ussery,<br />
2004) are regularly synchronized aga<strong>in</strong>st NCBI Entrez (see http://www.ncbi.nlm.nih.<br />
gov/genomes/lproks.cgi). The raw sequence data can be downloaded from this database<br />
us<strong>in</strong>g the Web Services client scripts getSeq, getOrfs, <strong>and</strong> getProt. Example scripts can be<br />
downloaded <strong>and</strong> run as separate comm<strong>and</strong>s (list<strong>in</strong>g 2.2) or <strong>in</strong>tegrated <strong>in</strong>to larger workflows,<br />
<strong>in</strong> other programm<strong>in</strong>g languages if needed.<br />
List<strong>in</strong>g 2.2: Access<strong>in</strong>g Genome Atlas Database through Web Services.<br />
1 # download prerequisites<br />
2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples /xml - compile .pl<br />
3 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getseq .pl<br />
4 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getprot .pl<br />
5 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getorfs .pl<br />
6<br />
7 # obta<strong>in</strong> full genome sequence of genbank entry<br />
8 perl getseq .pl CP000550 > CP000550 . fsa<br />
9<br />
10 # obta<strong>in</strong> translations of genbank entry<br />
11 perl getprot .pl CP000550 > CP000550 . prote<strong>in</strong>s . fsa<br />
12<br />
13 # obta<strong>in</strong> open read<strong>in</strong>g frames of genbank entry<br />
14 perl getorfs .pl CP000550 > CP000550 . orfs . fsa<br />
The <strong>CBS</strong> Genome Atlas Database conta<strong>in</strong>s an <strong>in</strong>dex of genome meta-data, such as<br />
organism name, NCBI Project ID, replicon, genome size, number of cod<strong>in</strong>g genes, tRNA<br />
genes, rRNA genes, the base composition, <strong>and</strong> average values of various DNA properties<br />
such <strong>in</strong>tr<strong>in</strong>sic curvature (Bolshoy et al., 1991) <strong>and</strong> stack<strong>in</strong>g energy (Satchwell et al., 1986).<br />
For more <strong>in</strong>formation on the Web Services implementation, see section 4.2.1 <strong>and</strong> for a<br />
full documentation please refer to http://www.cbs.dtu.dk/ws/GenomeAtlas. List<strong>in</strong>g 2.3<br />
shows an example of how to use queryGenomes to obta<strong>in</strong> AT content <strong>and</strong> gene count for<br />
4
<strong>Comparative</strong> Genomics<br />
the publicly available Vibrio genomes. Output the comm<strong>and</strong> is listed <strong>in</strong> appendix D.2.<br />
List<strong>in</strong>g 2.3: Us<strong>in</strong>g queryGenomes to obta<strong>in</strong> genome meta data.<br />
1 # download client script<br />
2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / querygenomes .pl<br />
3<br />
4 # download XML :: Compile helper script<br />
5 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples /xml - compile .pl<br />
6<br />
7 # extract AT - content <strong>and</strong> number of genes for all vibrio genomes<br />
8 perl querygenomes .pl - hideMerged - organism vibrio -output<br />
ATCONTENT , NGENES<br />
2.2.3 Tools contigsort <strong>and</strong> contigmap<br />
For some applications <strong>in</strong> analysis of unf<strong>in</strong>ished or partially sequenced genomes, it is desired<br />
to obta<strong>in</strong> approximate coord<strong>in</strong>ates of the contigs with<strong>in</strong> the complete chromosome. To<br />
resolve this the contigsort program was written. It accepts any number of entries (contigs)<br />
<strong>in</strong> one FASTA file together with a backbone sequence <strong>in</strong> one contig <strong>in</strong> a second FASTA file.<br />
The entries of the contig file is then mapped to the backbone sequence us<strong>in</strong>g a nucleotide<br />
BLAST, assum<strong>in</strong>g at least one significant hit. The tool then sorts all contigs based on the<br />
coord<strong>in</strong>ate <strong>in</strong> the backbone of the center-po<strong>in</strong>t of each alignment. Contigs spann<strong>in</strong>g the<br />
orig<strong>in</strong> of circular backbones are automatically split <strong>in</strong> two.<br />
The tool genomemap was written to visualize genome homology between two genomes<br />
sequences. Each genome may consist of one or more contigs <strong>and</strong> all contigs are aligned<br />
us<strong>in</strong>g BLASTN. This tool allow a user to validate the output of the backbone mapp<strong>in</strong>g from<br />
contigsort. The plot generated has similarities to that produced by Artemis Comparison<br />
Tool (ACT) (Rutherford et al., 2000); however the output of genomemap is a vector<br />
graphic file (PostScript) <strong>and</strong> allows for multiple sequence entries with<strong>in</strong> each of the two<br />
compared sequences.<br />
Example: Campylobacter jejuni str. 260.94<br />
The 10 contigs of the currently unpublished sequence of Campylobacter jejuni str. 260.94<br />
(GenBank accession no. AANK01000001-AANK01000010) were downloaded <strong>and</strong> converted<br />
<strong>in</strong>to FASTA format file. The program saco convert is an <strong>in</strong>-house program at <strong>CBS</strong>,<br />
which converts between different sequence formats. In the example provided the Campylobacter<br />
jejuni str. NCTC 11168 (Parkhill et al., 2000) is used as the backbone (see list<strong>in</strong>g<br />
2.4).<br />
List<strong>in</strong>g 2.4: Us<strong>in</strong>g contigsort to map assemblied contigs to a backbone.<br />
1 set path = (˜ pfh/scripts/contigsort ˜pfh/scripts/fetchgbk $path )<br />
2 fetchgbk −a AANK01000001−AANK01000010 > AANK . gbk<br />
3 saco_convert −I genbank −O fasta AANK . gbk > AANK . fsa<br />
4 fetchgbk −a AL111168 > AL111168 . gbk<br />
5 saco_convert −I genbank −O fasta AL111168 . gbk > AL111168 . fsa<br />
6 contigsort −c −i AANK . fsa −b AL111168 . fsa > mapped . fsa<br />
To visualize the result of the contig mapp<strong>in</strong>g the mapped <strong>and</strong> un-mapped contigs were<br />
processed by contigmap. The output from the comparison is a PostScript document (figure<br />
2.1 <strong>and</strong> list<strong>in</strong>g 2.5).<br />
5
The genome annotation pipel<strong>in</strong>e<br />
AL111168_AL139074_AL<br />
AANK01000001_AANK010 AANK01000002_AANK010 AANK01000003_AANK010<br />
(a)<br />
AANK01000004_AANK010<br />
AANK01000005_AANK010<br />
AANK01000006_AANK010<br />
AANK01000007_AANK010<br />
AANK01000010_AANK010<br />
AANK01000009_AANK010<br />
AANK01000008_AANK010<br />
AANK01000007_AANK010<br />
AANK01000002_AANK010 AANK01000008_AANK010<br />
AANK01000003_AANK010<br />
AL111168_AL139074_AL<br />
AANK01000005_AANK010<br />
AANK01000001_AANK010 AANK01000009_AANK010<br />
Figure 2.1: Mapp<strong>in</strong>g of multiple contigs to a backbone genome. C. jejuni str. NCTC 11168 is used<br />
as backbone for mapp<strong>in</strong>g contigs C. jejuni str. 260.94. Blue <strong>and</strong> red blocks represent direct <strong>and</strong><br />
reverse hits, respectively. Panel (a) shows un-mapped whereas panel (b) shows mapped contigs.<br />
List<strong>in</strong>g 2.5: Us<strong>in</strong>g contigmap to draw homology between contigs <strong>and</strong> reference genome<br />
1 set path = (˜ pfh/scripts/contigmap $path )<br />
2 contigmap AL111168 . fsa AANK . fsa > AANK−raw . ps<br />
3 contigmap AL111168 . fsa mapped . fsa > AANK−mapped . ps<br />
2.2.4 F<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> encod<strong>in</strong>g genes <strong>in</strong> prokaryotes<br />
A crucial step for implement<strong>in</strong>g any genome pipel<strong>in</strong>e is the gene f<strong>in</strong>d<strong>in</strong>g. Hav<strong>in</strong>g successfully<br />
completed the gene call<strong>in</strong>g enables a number of downstream analysis such as<br />
translation of ORFs <strong>in</strong>to prote<strong>in</strong> sequence, f<strong>in</strong>d<strong>in</strong>g of potentially novel genes, annotation<br />
of prote<strong>in</strong> function by homology searches, assign<strong>in</strong>g functional doma<strong>in</strong>s, <strong>and</strong> detection<br />
of signal peptide to derive the secretome. To both reveal novel prote<strong>in</strong> sequences <strong>and</strong><br />
to draw conclusions as to the overall proteome, it is therefore essential that the gene<br />
call<strong>in</strong>g can be trusted. There are several public prokaryotic gene predictors available<br />
such as Glimmer3 (http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi,<br />
Delcher et al. (1999)), GeneMarkS (http://exon.biology.gatech.edu/, Besemer et al.<br />
(2001)), EasyGene (http://www.cbs.dtu.dk/services/EasyGene/, Larsen & Krogh (2003)),<br />
<strong>and</strong> Prodigal (unpublished, http://compbio.ornl.gov/prodigal). Prodigal is a recent<br />
development <strong>and</strong> despite of its high speed <strong>and</strong> simplicity it provides promis<strong>in</strong>g results. It<br />
has been implemented as part of the <strong>CBS</strong> Genome Atlas Database Web Services. Code<br />
examples are provided show<strong>in</strong>g the usage of the Prodigal client scripts (list<strong>in</strong>g 2.6).<br />
List<strong>in</strong>g 2.6: Us<strong>in</strong>g Prodigal for ORF prediction. Note that 6pack is an <strong>in</strong>ternal <strong>CBS</strong> tool used for<br />
translation of ORFs.<br />
1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples /xml - compile .pl<br />
2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / prodigal .pl<br />
3 perl prodigal .pl -ta 11 -fasta < mapped . fsa > mapped . orfs . fsa<br />
4 6 pack -1 < mapped . orfs . fsa > mapped . prote<strong>in</strong>s . fsa<br />
Assess<strong>in</strong>g annotation quality<br />
All of the four gene f<strong>in</strong>ders listed above were applied to the latest version of the E. coli<br />
stra<strong>in</strong> K-12 isolate MG1655 genome sequence (U00096, 28 July, 2009, Blattner et al.<br />
(1997)). These predictions, together with an older annotation of the same GenBank entry<br />
6<br />
(b)<br />
AANK01000010_AANK010<br />
AANK01000004_AANK010<br />
AANK01000006_AANK010<br />
AANK01000007_AANK010
<strong>Comparative</strong> Genomics<br />
source CDS total TP FP FN 3’off 5’off sens. shared<br />
U00096 (present) 4,321 - - - - - - -<br />
U00096 (2004) 4,254 4,172 82 109 1.02 -4.07 0.97 93%<br />
Glimmer 3.02 4,476 4,174 302 125 -0.6 -24.09 0.97 87%<br />
GeneMark-S 2.6 4,377 4,207 170 90 1.94 -20.17 0.98 91%<br />
EasyGene 1.2 4,056 4,017 39 256 -0.28 -19.07 0.94 91%<br />
Prodigal 1.1 4,332 4,200 132 97 0.54 -20.07 0.98 92%<br />
Table 2.1: Performance of prokaryotic gene f<strong>in</strong>ders. An older genbank record for E. coli K12<br />
(U00096, 2002) has been <strong>in</strong>cluded <strong>and</strong> the reference of all comparisons is the most recent shown<br />
at the top. The 3’ <strong>and</strong> 5’ off correspond to the number of base pairs that a query coord<strong>in</strong>ate is<br />
downstream (positive number) or upstream (negative number) when compared to the reference.<br />
T P<br />
The sensitivity is estimated by b<strong>in</strong>ary classification, T P +F N<br />
where T P is the number of prote<strong>in</strong>s<br />
shared between reference <strong>and</strong> query <strong>and</strong> F N are prote<strong>in</strong>s unique to the reference, not found <strong>in</strong><br />
the query. Calculat<strong>in</strong>g specificity (which requires a true negative count) is difficult as it is hard<br />
to identify regions of the chromosome that for certa<strong>in</strong> does not conta<strong>in</strong> prote<strong>in</strong> cod<strong>in</strong>g genes<br />
(Larsen & Krogh, 2003). The rightmost column conta<strong>in</strong>s an estimate of the percentage of prote<strong>in</strong><br />
families shared between the query <strong>and</strong> the reference genome. The number is derived us<strong>in</strong>g the<br />
BLASTmatrix tool.<br />
(U00096 from 2004) were compared pairwise to the latest version of the GenBank entry.<br />
The number of unique genes <strong>in</strong> both reference <strong>and</strong> query genome was derived <strong>and</strong> for each<br />
overlapp<strong>in</strong>g pair of ORFs, the average <strong>in</strong>accuracy of the 3’ <strong>and</strong> 5’ ends was calculated<br />
(table 2.1). In addition the encoded prote<strong>in</strong>s were compared us<strong>in</strong>g the BLASTmatrix<br />
tool, described <strong>in</strong> section 2.3.6. This allows estimation of the number of prote<strong>in</strong> families<br />
shared between the reference <strong>and</strong> the query genomes.<br />
2.2.5 F<strong>in</strong>d<strong>in</strong>g tRNA <strong>and</strong> rRNA genes<br />
The tool tRNAscan-SE (Lowe & Eddy, 1997) has been implemented <strong>in</strong> the <strong>CBS</strong> Genome<br />
Atlas Database Web Service, <strong>and</strong> it predicts tRNA genes <strong>in</strong> contigs or genomes:<br />
1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . <strong>in</strong>c .pl<br />
2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / trnascan .pl<br />
3 perl trnascan .pl < mapped . fsa > mapped . trna . fsa<br />
The RNAmmer method (Paper VI, chapter 3) can be used to consistently annotate<br />
rRNA genes <strong>in</strong> contigs <strong>and</strong> full genome sequences. This tool is implemented as a separate<br />
Web Service at <strong>CBS</strong>. Please refer to http://www.cbs.dtu.dk/ws/RNAmmer for full documentation.<br />
In list<strong>in</strong>g 2.7 <strong>and</strong> example is provided show<strong>in</strong>g the usage of the RNAmmer<br />
client script.<br />
List<strong>in</strong>g 2.7: Runn<strong>in</strong>g RNAmmer on a genome sequence<br />
1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . <strong>in</strong>c .pl<br />
2 wget http :// www . cbs . dtu .dk/ws/ RNAmmer / examples / rnammer .pl<br />
3 perl rnammer .pl bac < mapped . fsa > mapped . rrna . fsa<br />
2.3 Genome Comparisons<br />
The previous section has described some <strong>in</strong>itial steps for annotat<strong>in</strong>g the bacterial genome<br />
which is required for further comparative studies. In this section emphasis will be placed<br />
on compar<strong>in</strong>g annotated genomes both on the proteome level as well as us<strong>in</strong>g meta-data.<br />
7
Genome Comparisons<br />
Right whisker ends at an observed<br />
data po<strong>in</strong>t, not exceed<strong>in</strong>g 1.5 IQR<br />
1.5 x IQR<br />
95% confidence <strong>in</strong>terval<br />
Q1 IQR Q3<br />
1.5 x IQR<br />
median<br />
Right whisker ends at an observed<br />
data po<strong>in</strong>t, not exceed<strong>in</strong>g 1.5 IQR<br />
Mild outliers between 1.5 <strong>and</strong> 3.0 IQR<br />
<strong>and</strong> extreme outliers more than 3 IQR<br />
away from Q1 <strong>and</strong> Q3<br />
Figure 2.2: Construction of a box-<strong>and</strong>-whiskers plot. Notches is an estimate of the 95% confidence<br />
<strong>in</strong>terval.<br />
The <strong>tools</strong> presented here have all been used widely dur<strong>in</strong>g course activities <strong>and</strong> research<br />
projects.<br />
2.3.1 Box-<strong>and</strong>-wiskers plot<br />
As the number of sequenced bacterial genomes grew from only two <strong>in</strong> 1995 to now close to a<br />
thous<strong>and</strong> at the time of writ<strong>in</strong>g, there began to be enough data to sample various genomic<br />
properties amongst the different phylogenetic groups. The box-<strong>and</strong>-wiskers plot (Tukey,<br />
1977) is a useful tool for visualiz<strong>in</strong>g such differences. The plot shows a box between the<br />
first <strong>and</strong> the third quantile (figure 2.2). The distance between Q1 <strong>and</strong> Q3 is called the Inter<br />
Quantile Ratio (IQR) <strong>and</strong> whiskers are drawn through observations that are not exceed<strong>in</strong>g<br />
1.5 × IQR. A l<strong>in</strong>e is drawn with<strong>in</strong> the box represent<strong>in</strong>g the median. Data between<br />
1.5 × IQR <strong>and</strong> 3.0 × IQR are denoted ”mild” outliers whereas observations exceed<strong>in</strong>g<br />
3.0 × IQR are extreme outliers. Notches are sometimes drawn to denote the confidence<br />
<strong>in</strong>terval. In the R implementation of the box-<strong>and</strong>-wiskers plot the 95% confidence <strong>in</strong>terval<br />
is approximated by 1.5×IQR<br />
√ . When compar<strong>in</strong>g two or more distributions, non-overlapp<strong>in</strong>g<br />
N<br />
notches marks significant differences.<br />
Distribution of genome size <strong>and</strong> base composition <strong>in</strong> prokaryotes<br />
To exam<strong>in</strong>e the base composition <strong>and</strong> genome size for different phylogenetic groups, a<br />
query to the <strong>CBS</strong> Genome Atlas Database can be done, group<strong>in</strong>g replicons <strong>in</strong>to projects<br />
<strong>and</strong> summariz<strong>in</strong>g / averag<strong>in</strong>g with<strong>in</strong> each project. Altough only possible from with<strong>in</strong> <strong>CBS</strong>,<br />
the comm<strong>and</strong>s are listed below.<br />
8
<strong>Comparative</strong> Genomics<br />
1 mysql -N -B -D genomeatlas3_cur -e " select p.grp , concat (’#’, color )<br />
,ord , sum ( length ),concat ( organism_name ,’/’, segment_name ,’/’,<br />
genbank ) from atlasdb as a, genbank_complete_prj as p ,<br />
genbank_complete_seq as s , phyla as ph where s. genbank = a.<br />
accession <strong>and</strong> s. pid = p. pid <strong>and</strong> segment_name not like ’genome %’<br />
<strong>and</strong> ph. phyla = p. grp group by s. pid " > length . tbl<br />
2 set N = ‘wc -l < length .tbl ‘<br />
3 ~ pfh / scripts / boxplot -ma<strong>in</strong> " Size distribution of Prokaryotic<br />
genomes (N = $N)" < length . tbl > length .ps<br />
4 mysql -N -B -D genomeatlas3_cur -e " select p.grp , concat (’#’, color )<br />
,ord , sum ( atcontent * length )/ sum ( length ),concat ( organism_name<br />
,’/’, segment_name ,’/’, genbank ) from atlasdb as a,<br />
genbank_complete_prj as p , genbank_complete_seq as s , phyla<br />
as ph where s. genbank = a. accession <strong>and</strong> s. pid = p. pid <strong>and</strong><br />
segment_name not like ’genome %’ <strong>and</strong> ph. phyla = p. grp group by s<br />
. pid "> atcontent . tbl<br />
5 ~ pfh / scripts / boxplot -ma<strong>in</strong> "AT content distribution of Prokaryotic<br />
genomes (N = $N)" < atcontent . tbl > atcontent .ps<br />
The tables generated by the MySQL query can be read by the boxplot program, which<br />
is a Perl wrapper for the R comm<strong>and</strong> boxplot, <strong>and</strong> a PostScript document is generated.<br />
Figure 2.4 shows the total genome length (<strong>in</strong>clud<strong>in</strong>g all replicons) of all published prokaryotic<br />
genomes, divided <strong>in</strong>to phyla. The confidence <strong>in</strong>terval appears wide for many groups,<br />
reflect<strong>in</strong>g a high <strong>in</strong>tra-phyla variation. However, for a number of phyla the difference<br />
is significant. The β-protebacteria tend to have longer chromosomes than for example<br />
the firmicutes, the α-proteobacteria, <strong>and</strong> the cyanobacteria. It is also evident that the<br />
δ-proteobacteria Sorangium cellulosum Soce56 represents the longest genome (13,033,779<br />
nt, Schneiker et al. (2007)) but that this is an outlier not representative of the entire phylum.<br />
The shortest bacterial genome published so far is the α-proteobacterium C<strong>and</strong>idatus<br />
Hodgk<strong>in</strong>ia cicadicola Dsem (143,795 nt, McCutcheon et al. (2009)). Thus, the difference<br />
between the smallest <strong>and</strong> the largest is close to 100 fold. The plot <strong>in</strong> figure 2.3 shows the<br />
fraction of AT for the prokaryotic genomes rang<strong>in</strong>g from 25% for the δ-proteobacterium<br />
Anaeromyxobacter dehalogenans 2CP-C (Sanford et al., 2002) to 83% for C<strong>and</strong>idatus Carsonella<br />
ruddii PV (Nakabachi et al. (2006).<br />
2.3.2 heatmap - 2D cluster<strong>in</strong>g<br />
A way to <strong>in</strong>crease the dimensionality for visualiz<strong>in</strong>g genomic properties is by us<strong>in</strong>g a socalled<br />
heatmap or 2D cluster<strong>in</strong>g. Instead of look<strong>in</strong>g at a s<strong>in</strong>gle property at a time (e.g.<br />
length or AT content), multiple features may be <strong>in</strong>cluded <strong>in</strong> the same plot. The axis is<br />
replaced with a color transformation of the data <strong>and</strong> different normalization methods may<br />
be applied. In the example below a comparison is made for 87 Enterobacteriaceae, cover<strong>in</strong>g<br />
among others the genera of Escherichia, Salmonella, Yers<strong>in</strong>ia, Shigella, Buchnera, <strong>and</strong><br />
Klebsiella. The <strong>CBS</strong> Genome Atlas Database is queried for the features such as tRNA <strong>and</strong><br />
rRNA gene count, total cod<strong>in</strong>g genes, genome size, AT content, simple genomic repeats,<br />
local direct repeats, base pairs per gene, <strong>and</strong> cod<strong>in</strong>g fraction of the genome. The plot<br />
is shown <strong>in</strong> figure 2.5 <strong>and</strong> the R code for produc<strong>in</strong>g the plot is shown below <strong>in</strong> list<strong>in</strong>g<br />
2.8. The data have been normalized to allow for comparison. Features <strong>and</strong> organisms are<br />
hierarchically clustered to group organisms with similar properties <strong>and</strong> to gorup properties<br />
that correlate with<strong>in</strong> the organisms.<br />
9
Genome Comparisons<br />
12<br />
10<br />
12<br />
Size distribution of Prokaryotic genomes (N = 932)<br />
Crenarchaeota (n=23)<br />
Euryarchaeota (n=39)<br />
Nanoarchaeota (n=1)<br />
Acidobacteria (n=3)<br />
Crenarchaeota (n=23)<br />
Act<strong>in</strong>obacteria (n=68)<br />
Euryarchaeota (n=39)<br />
Aquificae (n=5)<br />
Nanoarchaeota (n=1)<br />
Bacteroidetes/Chlorobi (n=26)<br />
Acidobacteria (n=3)<br />
Chlamydiae/Verrucomicrobia (n=14)<br />
Act<strong>in</strong>obacteria (n=68)<br />
Chloroflexi (n=10)<br />
Aquificae (n=5)<br />
Cyanobacteria (n=36)<br />
Bacteroidetes/Chlorobi (n=26)<br />
De<strong>in</strong>ococcus−Thermus (n=5)<br />
Chlamydiae/Verrucomicrobia (n=14)<br />
Firmicutes (n=191)<br />
Chloroflexi (n=10)<br />
Fusobacteria (n=1)<br />
Cyanobacteria (n=36)<br />
Planctomycetes (n=1)<br />
De<strong>in</strong>ococcus−Thermus (n=5)<br />
Alphaproteobacteria (n=114)<br />
Firmicutes (n=191)<br />
Betaproteobacteria (n=70)<br />
Fusobacteria (n=1)<br />
Gammaproteobacteria (n=226)<br />
Planctomycetes (n=1)<br />
Deltaproteobacteria (n=29)<br />
Alphaproteobacteria (n=114)<br />
Epsilonproteobacteria (n=25)<br />
Betaproteobacteria (n=70)<br />
Spirochaetes (n=18)<br />
Gammaproteobacteria (n=226)<br />
Thermotogae (n=10)<br />
Deltaproteobacteria (n=29)<br />
Other Archaea (n=1)<br />
Epsilonproteobacteria (n=25)<br />
Other Bacteria (n=16)<br />
Spirochaetes (n=18)<br />
Thermotogae (n=10)<br />
Size distribution of Prokaryotic genomes (N = 932)<br />
Other Archaea (n=1)<br />
0.0e+00 2.0e+06<br />
Other Bacteria (n=16)<br />
Buchnera<br />
4.0e+06 6.0e+06<br />
E. coli<br />
Salmonella<br />
Yers<strong>in</strong>ia<br />
8.0e+06 1.0e+07 1.2e+07<br />
0.0e+00 2.0e+06 4.0e+06 6.0e+06 8.0e+06 1.0e+07 1.2e+07<br />
E. coli<br />
Buchnera<br />
Salmonella<br />
Yers<strong>in</strong>ia<br />
Crenarchaeota (n=23)<br />
Euryarchaeota (n=39)<br />
Nanoarchaeota (n=1)<br />
Crenarchaeota<br />
Acidobacteria<br />
(n=23)<br />
(n=3)<br />
Euryarchaeota<br />
Act<strong>in</strong>obacteria (n=68)<br />
(n=39)<br />
Nanoarchaeota<br />
Aquificae (n=5)<br />
(n=1)<br />
Bacteroidetes/Chlorobi Acidobacteria (n=26) (n=3)<br />
Chlamydiae/Verrucomicrobia Act<strong>in</strong>obacteria (n=14) (n=68)<br />
Chloroflexi Aquificae (n=10) (n=5)<br />
Bacteroidetes/Chlorobi Cyanobacteria (n=36) (n=26)<br />
Chlamydiae/Verrucomicrobia De<strong>in</strong>ococcus−Thermus (n=14) (n=5)<br />
Firmicutes Chloroflexi (n=191) (n=10)<br />
Cyanobacteria Fusobacteria (n=36) (n=1)<br />
De<strong>in</strong>ococcus−Thermus Planctomycetes (n=1) (n=5)<br />
Alphaproteobacteria Firmicutes (n=114) (n=191)<br />
Betaproteobacteria Fusobacteria (n=70) (n=1)<br />
Gammaproteobacteria Planctomycetes (n=226) (n=1)<br />
Alphaproteobacteria Deltaproteobacteria (n=114) (n=29)<br />
Epsilonproteobacteria Betaproteobacteria (n=25) (n=70)<br />
Gammaproteobacteria Spirochaetes (n=226) (n=18)<br />
Deltaproteobacteria Thermotogae (n=10) (n=29)<br />
Epsilonproteobacteria Other Archaea (n=25) (n=1)<br />
Other Spirochaetes Bacteria (n=16) (n=18)<br />
Thermotogae (n=10)<br />
Other Archaea (n=1)<br />
Other Bacteria (n=16)<br />
Figure 2.3: Genome size of all public prokaryotic.<br />
Figure 2.3: Genome size of all public prokaryotic.<br />
Figure 2.3: Genome size of all public prokaryotic.<br />
AT content distribution of Prokaryotic genomes (N = 932)<br />
AT content distribution of Prokaryotic genomes (N = 932)<br />
0.3 0.4 0.5 0.6 0.7 0.8<br />
E. coli<br />
Salmonella<br />
Buchnera<br />
Yers<strong>in</strong>ia<br />
0.3 0.4 0.5 0.6 0.7 0.8<br />
E. coli<br />
Salmonella<br />
Buchnera<br />
Yers<strong>in</strong>ia<br />
Figure 2.4: Average AT content of all public prokaryotic.<br />
Figure 2.4: Average AT content contentof ofall all public prokaryotic.
List<strong>in</strong>g 2.8: R code to generate a 2D cluster<strong>in</strong>g graphic<br />
<strong>Comparative</strong> Genomics<br />
1 library ( gplots )<br />
2 postscript ( file =’output .ps ’)<br />
3 data
Genome Comparisons<br />
12<br />
TRNA_SCAN_COUNT<br />
LENGTH<br />
NGENES<br />
RNAMMER_SSU_COUNT<br />
ATCONTENT<br />
LOC_DIR_REPEAT<br />
LOC_INV_REPEAT<br />
SR_PERCENT<br />
CODING_FRACTION<br />
BPPRGENE<br />
Escherichia coli SMS−3−5<br />
Escherichia coli O127:H6 str. E2348/69<br />
Escherichia coli E24377A<br />
Escherichia coli S88<br />
Escherichia coli SE11<br />
Escherichia coli UMN026<br />
Escherichia coli IAI39<br />
Escherichia coli 55989<br />
Escherichia coli ED1a<br />
Escherichia coli UTI89<br />
Escherichia coli CFT073<br />
Salmonella enterica subsp. enterica serovar Heidelberg str. SL476<br />
Salmonella enterica subsp. enterica serovar Newport str. SL254<br />
Salmonella enterica subsp. enterica serovar Agona str. SL483<br />
Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633<br />
Salmonella enterica subsp. enterica serovar Paratyphi C stra<strong>in</strong> RKS4594<br />
Salmonella enterica subsp. enterica serovar Dubl<strong>in</strong> str. CT_02021853<br />
Salmonella enterica subsp. enterica serovar Choleraesuis str. SC−B67<br />
Escherichia coli 536<br />
Salmonella enterica subsp. enterica serovar Typhi str. CT18<br />
Serratia proteamaculans 568<br />
Klebsiella pneumoniae subsp. pneumoniae MGH 78578<br />
Klebsiella pneumoniae NTUH−K2044<br />
Klebsiella pneumoniae 342<br />
Salmonella enterica subsp. enterica serovar Paratyphi B str. SPB7<br />
Citrobacter koseri ATCC BAA−895<br />
Escherichia coli O157:H7 str. Sakai<br />
Escherichia coli O157:H7 EDL933<br />
Escherichia coli O157:H7 str. EC4115<br />
Escherichia coli str. K−12 substr. MG1655<br />
Escherichia coli str. K−12 substr. W3110<br />
Escherichia coli HS<br />
Escherichia coli IAI1<br />
Escherichia fergusonii ATCC 35469<br />
Salmonella enterica subsp. arizonae serovar 62:z4,z23:−−<br />
Salmonella enterica subsp. enterica serovar Enteritidis str. P125109<br />
Salmonella enterica subsp. enterica serovar Paratyphi A str. AKU_12601<br />
Enterobacter sp. 638<br />
Escherichia coli BL21<br />
Escherichia coli ATCC 8739<br />
Escherichia coli str. K−12 substr. DH10B<br />
Salmonella enterica subsp. enterica serovar Typhimurium str. LT2<br />
Escherichia coli BW2952<br />
Escherichia coli BL21(DE3)<br />
Yers<strong>in</strong>ia pseudotuberculosis YPIII<br />
Yers<strong>in</strong>ia pseudotuberculosis PB1/+<br />
Yers<strong>in</strong>ia pseudotuberculosis IP 31758<br />
Yers<strong>in</strong>ia enterocolitica subsp. enterocolitica 8081<br />
Yers<strong>in</strong>ia pseudotuberculosis IP 32953<br />
Shigella boydii Sb227<br />
Shigella dysenteriae Sd197<br />
Escherichia coli APEC O1<br />
Shigella flexneri 2a str. 301<br />
Shigella sonnei Ss046<br />
Shigella flexneri 5 str. 8401<br />
Shigella flexneri 2a str. 2457T<br />
Shigella boydii CDC 3083−94<br />
Edwardsiella ictaluri 93−146<br />
Cronobacter sakazakii ATCC BAA−894<br />
Erw<strong>in</strong>ia tasmaniensis Et1/99<br />
Photorhabdus lum<strong>in</strong>escens subsp. laumondii TTO1<br />
Photorhabdus asymbiotica<br />
Proteus mirabilis HI4320<br />
Pectobacterium atrosepticum SCRI1043<br />
Salmonella enterica subsp. enterica serovar Gall<strong>in</strong>arum str. 287/91<br />
Pectobacterium carotovorum subsp. carotovorum PC1<br />
Dickeya zeae Ech1591<br />
Dickeya dadantii Ech703<br />
Salmonella enterica subsp. enterica serovar Typhi str. Ty2<br />
Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150<br />
Yers<strong>in</strong>ia pestis Angola<br />
Yers<strong>in</strong>ia pestis CO92<br />
Yers<strong>in</strong>ia pestis Antiqua<br />
Yers<strong>in</strong>ia pestis KIM<br />
Yers<strong>in</strong>ia pestis Nepal516<br />
Yers<strong>in</strong>ia pestis biovar Microtus str. 91001<br />
Yers<strong>in</strong>ia pestis Pestoides F<br />
Sodalis gloss<strong>in</strong>idius str. morsitans<br />
Buchnera aphidicola str. Cc (C<strong>in</strong>ara cedri)<br />
Wigglesworthia gloss<strong>in</strong>idia endosymbiont of Gloss<strong>in</strong>a brevipalpis<br />
C<strong>and</strong>idatus Blochmannia floridanus<br />
C<strong>and</strong>idatus Blochmannia pennsylvanicus str. BPEN<br />
Buchnera aphidicola str. Sg (Schizaphis gram<strong>in</strong>um)<br />
Buchnera aphidicola str. Bp (Baizongia pistaciae)<br />
Buchnera aphidicola str. APS (Acyrthosiphon pisum)<br />
Buchnera aphidicola str. Tuc7 (Acyrthosiphon pisum)<br />
Buchnera aphidicola str. 5A (Acyrthosiphon pisum)<br />
−1 −0.5 0 0.5 1<br />
Value<br />
Figure 2.5: 2D-cluster<strong>in</strong>g show<strong>in</strong>g 87 Enterobacteriaceae.<br />
Color Key
1st<br />
U<br />
C<br />
A<br />
G<br />
U<br />
2nd position<br />
C A G<br />
3rd<br />
56 Phe 31 Ser 41 Tyr 12 Cys U<br />
2 Phe 1 Ser 2 Tyr 1 Cys C<br />
79 Leu 22 Ser 3 Stop 0 Stop A<br />
5 Leu 1 Ser 0 Stop 8 Trp G<br />
7 Leu 13 Pro 17 His 7 Arg U<br />
0 Leu 1 Pro 1 His 0 Arg C<br />
5 Leu 12 Pro 25 Gln 5 Arg A<br />
0 Leu 2 Pro 2 Gln 0 Arg G<br />
79 Ile 18 Thr 75 Asn 12 Ser U<br />
4 Ile 1 Thr 6 Asn 1 Ser C<br />
51 Ile 20 Thr 131 Lys 18 Arg A<br />
18 Met 1 Thr 6 Lys 1 Arg G<br />
18 Val 16 Ala 33 Asp 18 Gly U<br />
1 Val 1 Ala 2 Asp 1 Gly C<br />
18 Val 15 Ala 41 Glu 27 Gly A<br />
1 Val 1 Ala 2 Glu 2 Gly G<br />
<strong>Comparative</strong> Genomics<br />
Table 2.2: Codon usage <strong>in</strong> Buchnera aphidicola Cc. Frequencies are measured per thous<strong>and</strong>. A<br />
total of 354,219 base pairs are exam<strong>in</strong>ed <strong>in</strong> 360 ORFs (5 orfs rejectred due to possible frame shifts)<br />
codons may be replaced to encode both identical <strong>and</strong> similar am<strong>in</strong>o acids to adjust the<br />
overall base composition.<br />
2.3.4 CodonPlot: visualiz<strong>in</strong>g codon usage<br />
A rose plot diagram (Ussery et al., 2004; B<strong>in</strong>newies et al., 2006) may be used to make a<br />
graphical representation of codon <strong>and</strong> am<strong>in</strong>o acid usage. In the codon rose plot, all 64<br />
codons are listed <strong>in</strong> the perimeter <strong>and</strong> the frequency of each codon is drawn on a radial<br />
scale. The 64 codons are sorted <strong>in</strong> the order AUGC, first by the last letter (XX[AUCG]),<br />
then by the second letter (X[AUGC]X), <strong>and</strong> f<strong>in</strong>ally by the first letter ([AUGC]XX). The<br />
result is four quadrants, with codons end<strong>in</strong>g with A or U <strong>in</strong> the right half, <strong>and</strong> codons<br />
end<strong>in</strong>g with C or G <strong>in</strong> the left half. This allows easy overview of biases <strong>in</strong> the third position.<br />
For the am<strong>in</strong>o acid rose plot, all 20 am<strong>in</strong>o acids are drawn <strong>in</strong> the perimeter with their<br />
frequencies show radially. Here, the am<strong>in</strong>o acids are grouped accord<strong>in</strong>g to their chemical<br />
properties. In addition to the rose plot, <strong>in</strong>formation content can be applied to measure the<br />
bias with<strong>in</strong> each of the three positions of the codon. These codon analysis are shown <strong>in</strong><br />
figure 2.6 for three different enteric genomes: the AT rich Buchnera aphidicola Cc (79.8%<br />
AT), an E. coli stra<strong>in</strong> K-12 (49.2% AT), <strong>and</strong> a somewhat GC rich Klebsiella pneumoniae<br />
NTUH-K2044 (42.3%). The bias <strong>in</strong> B. aphidicola is strik<strong>in</strong>g with a strong preference of A<br />
<strong>and</strong> U at the third position. This variation results <strong>in</strong> a periodic fluctuation of AT content<br />
when align<strong>in</strong>g all open read<strong>in</strong>g frames (ORFs) to the translation start, <strong>and</strong> extract<strong>in</strong>g 400<br />
base pairs up- <strong>and</strong> down-stream, as shown <strong>in</strong> figure 2.7. The red l<strong>in</strong>e represents a 3 po<strong>in</strong>t<br />
runn<strong>in</strong>g average which quickly approaches zero <strong>in</strong> the cod<strong>in</strong>g region. Gray l<strong>in</strong>es represent<br />
the raw average values.<br />
13
Genome Comparisons<br />
N<br />
E<br />
D<br />
N<br />
E<br />
D<br />
N<br />
E<br />
D<br />
Q<br />
R<br />
Q<br />
R<br />
Q<br />
R<br />
S<br />
K<br />
S<br />
K<br />
Am<strong>in</strong>o Acid Usage<br />
Buchnera_aphidicola_Cc<br />
M<br />
T<br />
A<br />
C<br />
(a)<br />
V<br />
Am<strong>in</strong>o Acid Usage<br />
Ecoli_K12<br />
M<br />
T<br />
A<br />
C<br />
(d)<br />
Y<br />
L<br />
W<br />
Am<strong>in</strong>o Acid Usage<br />
Klebsiella_pneumoniae_NTUH-K2044<br />
S<br />
K<br />
M<br />
T<br />
A<br />
C<br />
(g)<br />
V<br />
Y<br />
V<br />
Y<br />
L<br />
W<br />
L<br />
W<br />
I<br />
I<br />
H<br />
I<br />
H<br />
H<br />
G<br />
G<br />
G<br />
F<br />
F<br />
F<br />
P<br />
P<br />
P<br />
0.14<br />
0.11<br />
0.09<br />
0.06<br />
0.03<br />
0.01<br />
0.11<br />
0.09<br />
0.07<br />
0.05<br />
0.03<br />
0.01<br />
0.11<br />
0.09<br />
0.07<br />
0.05<br />
0.03<br />
0.01<br />
Frequency<br />
Frequency<br />
Frequency<br />
GGC<br />
GGC<br />
GGC<br />
GAG<br />
CAG<br />
CGC<br />
GAG<br />
CGC<br />
GAG<br />
UAG<br />
GCC<br />
CAG<br />
UGC<br />
GCC<br />
CAG<br />
CGC<br />
UAG<br />
UGC<br />
UAG<br />
GCC<br />
UGC<br />
UUG<br />
AAG<br />
AGC<br />
CCC<br />
CUG<br />
AUG<br />
UUG<br />
AAG<br />
AGC<br />
CCC<br />
UUG<br />
AAG<br />
CCC<br />
GUG<br />
AUG<br />
AUG<br />
AGC<br />
UCC<br />
GUG<br />
CUG<br />
GUG<br />
CUG<br />
GUC<br />
UCC<br />
UCC<br />
ACC<br />
GUC<br />
GUC<br />
ACC<br />
ACC<br />
UCG<br />
ACG<br />
CCG<br />
CUC<br />
UCG<br />
ACG<br />
CUC<br />
ACG<br />
CUC<br />
GCG<br />
CCG<br />
UUC<br />
GCG<br />
UUC<br />
Codon Usage<br />
Buchnera_aphidicola_Cc<br />
AGG<br />
AUC<br />
GAC<br />
UGG<br />
AGG<br />
AUC<br />
GAC<br />
CGG<br />
CAC<br />
UGG<br />
CAC<br />
GGG<br />
UAC<br />
UAC<br />
AAA<br />
AAC<br />
UAA<br />
GGU<br />
CAA<br />
CGU<br />
(b)<br />
Codon Usage<br />
Ecoli_K12<br />
CGG<br />
GGG<br />
AAA<br />
AAC<br />
UAA<br />
GGU<br />
UGU<br />
CAA<br />
CGU<br />
(e)<br />
UGU<br />
GAA<br />
AUA<br />
AGU<br />
GAA<br />
AUA<br />
AGU<br />
UUA<br />
GCU<br />
UUA<br />
GCU<br />
CUA<br />
CCU<br />
UCU<br />
CUA<br />
Codon Usage<br />
Klebsiella_pneumoniae_NTUH-K2044<br />
CCG<br />
UCG<br />
GCG<br />
UUC<br />
AGG<br />
AUC<br />
GAC<br />
UGG<br />
CGG<br />
CAC<br />
GGG<br />
UAC<br />
AAA<br />
AAC<br />
UAA<br />
GGU<br />
CAA<br />
CGU<br />
(h)<br />
UGU<br />
GAA<br />
AUA<br />
AGU<br />
UUA<br />
GCU<br />
CCU<br />
UCU<br />
ACA<br />
ACU<br />
ACA<br />
ACU<br />
CUA<br />
CCU<br />
AC UCU<br />
ACA<br />
GUA<br />
GUA<br />
UCA<br />
UCA<br />
GUA<br />
UCA<br />
CCA<br />
AGA<br />
AUU<br />
UUU<br />
CUU<br />
GUU<br />
CCA<br />
AGA<br />
AUU<br />
UUU<br />
CUU<br />
GUU<br />
CCA<br />
AGA<br />
AUU<br />
UUU<br />
CUU<br />
GUU<br />
UGA<br />
GCA<br />
AAU<br />
UGA<br />
CGA<br />
UAU<br />
GCA<br />
AAU<br />
UGA<br />
CGA<br />
UAU<br />
GCA<br />
AAU<br />
CAU<br />
CAU<br />
CGA<br />
UAU<br />
GGA<br />
GAU<br />
GGA<br />
GAU<br />
CAU<br />
GGA<br />
GAU<br />
0.13<br />
0.10<br />
0.08<br />
0.05<br />
0.03<br />
0.00<br />
0.05<br />
0.04<br />
0.03<br />
0.02<br />
0.01<br />
0.00<br />
0.07<br />
0.06<br />
0.04<br />
0.03<br />
0.01<br />
0.00<br />
Frequency<br />
Frequency<br />
Frequency<br />
bits<br />
bits<br />
bits<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0.0<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0.0<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0.0<br />
(c)<br />
(f)<br />
(i)<br />
| C<br />
1<br />
G<br />
U A<br />
CU<br />
G<br />
A<br />
C GU A |<br />
2<br />
3<br />
| U<br />
1<br />
CAG C<br />
G<br />
A<br />
U<br />
U<br />
A CG|<br />
| 1<br />
2<br />
3<br />
U CG|<br />
U ACG C<br />
G<br />
AU<br />
A<br />
2<br />
3<br />
Figure 2.6: Codon <strong>and</strong> am<strong>in</strong>o acid usage of Buchnera aphidicola Cc (79.8% AT), Klebsiella<br />
pneumoniae NTUH-K2044 (42.3% AT), <strong>and</strong> E. coli K12 49.2% AT. Rightmost column shows the<br />
nucleotide bias of the three codon positions.<br />
14
1st<br />
U<br />
C<br />
A<br />
G<br />
U<br />
2nd position<br />
C A G<br />
3rd<br />
19 Phe 4 Ser 14 Tyr 3 Cys U<br />
19 Phe 11 Ser 13 Tyr 8 Cys C<br />
6 Leu 4 Ser 2 Stop 1 Stop A<br />
7 Leu 12 Ser 0 Stop 16 Trp G<br />
8 Leu 5 Pro 12 His 13 Arg U<br />
16 Leu 8 Pro 11 His 31 Arg C<br />
3 Leu 4 Pro 7 Gln 3 Arg A<br />
72 Leu 30 Pro 38 Gln 10 Arg G<br />
20 Ile 5 Thr 12 Asn 4 Ser U<br />
33 Ile 31 Thr 22 Asn 22 Ser C<br />
3 Ile 3 Thr 24 Lys 2 Arg A<br />
27 Met 13 Thr 13 Lys 1 Arg G<br />
10 Val 10 Ala 26 Asp 13 Gly U<br />
21 Val 44 Ala 24 Asp 43 Gly C<br />
7 Val 8 Ala 27 Glu 6 Gly A<br />
33 Val 43 Ala 27 Glu 14 Gly G<br />
<strong>Comparative</strong> Genomics<br />
Table 2.3: Codon usage <strong>in</strong> Klebsiella pneumoniae NTUH-K2044. Frequencies are measured per<br />
thous<strong>and</strong>. A total of 4,697,097 base pairs are exam<strong>in</strong>ed <strong>in</strong> 5,006 ORFs.<br />
Z−score<br />
−2.0 −1.5 −1.0 −0.5 0.0<br />
Buchnera_aphidicola_Cc: AT content<br />
−400 −200 0 200 400<br />
Distance from translation start<br />
Figure 2.7: AT content profile 400 bp upstream <strong>and</strong> downstram of annotated translation starts <strong>in</strong><br />
Buchnera aphidicola Cc.<br />
15
Genome Comparisons<br />
Figure 2.8: Deam<strong>in</strong>ation of cytos<strong>in</strong>e (C) <strong>in</strong>to uracil (U)<br />
2.3.5 Base composition <strong>and</strong> DNA repair<br />
Klebsiella is often found <strong>in</strong> plant products, root surfaces <strong>and</strong> liv<strong>in</strong>g trees, fresh vegetables,<br />
<strong>and</strong> foods with high content of sugars <strong>and</strong> acids, such as frozen orange juice concentrate.<br />
Klebsiella pneumoniae can causes ur<strong>in</strong>ary tract <strong>in</strong>fections <strong>and</strong> the NTUH-K2044 stra<strong>in</strong><br />
was isolated from a patient with liver abscess <strong>and</strong> men<strong>in</strong>gitis. The broad range of ecological<br />
niches <strong>in</strong> which Klebsiella lives share the property of be<strong>in</strong>g rich <strong>in</strong> energy <strong>and</strong> nitrogen.<br />
Nitrogen-fix<strong>in</strong>g aerobic bacteria are known to have higher chromosomal GC content (McEwan<br />
et al., 1998), expla<strong>in</strong>ed by the nitrogen requirement to replicate the chromosome; an<br />
AT base pairs conta<strong>in</strong>s 7 nitrogen atoms whereas a GC pair conta<strong>in</strong>s 8 nitrogen atoms.<br />
Cytos<strong>in</strong>e pairs are prone to mutation caused by spontaneous deam<strong>in</strong>ation <strong>in</strong>to uracil<br />
(Visnes et al., 2009) (figure 2.8). In E. coli the two enzymes uracil N -glycosylase <strong>and</strong><br />
apur<strong>in</strong>ic (AP) endonuclease are responsible for the repair of this mutation. However, <strong>in</strong><br />
Buchnera aphidicola Cc, which is a small reduced genome, these two enzymes are absent<br />
(confirmed by prote<strong>in</strong> BLAST). A negative selection is likely to occur <strong>in</strong> organisms with<br />
high chromosomal GC content <strong>and</strong> the lack of a functional repair mechanism. Hence, base<br />
composition of the bacterial genome is by no means r<strong>and</strong>om <strong>and</strong> adjust<strong>in</strong>g the overall GC<br />
contant through evolution may be yet another way to adapt to the environment.<br />
2.3.6 BLASTmatrix - proteome comparison<br />
The BLASTmatrix tool allows for visualization of proteome similarity between larger<br />
numbers of organisms. For each of the pairwise comb<strong>in</strong>ations of proteomes, a BLAST<br />
is performed. Two prote<strong>in</strong>s are declared homologous when 50% of the prote<strong>in</strong> is aligned<br />
<strong>and</strong> 50% of the residues with<strong>in</strong> the alignment are conserved. For a report of proteome<br />
A aga<strong>in</strong>st proteome B, all homologous prote<strong>in</strong>s are then grouped <strong>in</strong>to families <strong>and</strong> the<br />
similarity between A <strong>and</strong> B is calculated as the number of families hav<strong>in</strong>g both organism<br />
A <strong>and</strong> B represented. The BLAST report is cached, based on MD5 checksums of the<br />
proteomes. This enables the tool to efficiently reuse previous results, when organisms<br />
are added to a comparison. This is repeated for all N j=1 j comb<strong>in</strong>ations <strong>and</strong> for each<br />
comb<strong>in</strong>ation a square is drawn conta<strong>in</strong><strong>in</strong>g the follow<strong>in</strong>g <strong>in</strong>formation: the similarity as<br />
percentage of all families of A <strong>and</strong> B, the number of shared families <strong>and</strong> the total number<br />
of families. A small example matrix is shown <strong>in</strong> figure 2.9. The percentage is used to<br />
color-code the square to allow for easier overview of larger comparisons.<br />
The software requires a configuration <strong>in</strong> XML as first argument. In appendix D.4<br />
a Perl script is provided which automatically constructs a configuration that compares<br />
all published Campylobacter proteomes, by query<strong>in</strong>g the Genome Atlas Database. The<br />
output of the BLASTmatrix configuration is shown <strong>in</strong> figure 2.10.<br />
The software has been used <strong>in</strong> different publications (B<strong>in</strong>newies et al., 2005, 2006) <strong>and</strong><br />
has been updated a number of times s<strong>in</strong>ce. The older versions conta<strong>in</strong>ed both BLAST<br />
directions <strong>and</strong> showed the number of shared prote<strong>in</strong>s, leav<strong>in</strong>g the diagram redundant. The<br />
recent version avoids this by <strong>in</strong>stead plott<strong>in</strong>g the shared families which renders the plot<br />
symmetrical across the diagonal. This allows the lower triangle to be removed.<br />
16
Escherichia coli<br />
stra<strong>in</strong> K-12, substra<strong>in</strong> DH10B<br />
4,126 prote<strong>in</strong>s, 3,797 families<br />
Escherichia coli<br />
stra<strong>in</strong> K-12, substra<strong>in</strong> W3110<br />
4,226 prote<strong>in</strong>s, 3,965 families<br />
Escherichia coli<br />
stra<strong>in</strong> K-12, substra<strong>in</strong> MG1655<br />
4,150 prote<strong>in</strong>s, 3,912 families<br />
4.3 %<br />
167 / 3,912<br />
95.3 %<br />
3,843 / 4,034<br />
91.5 %<br />
3,685 / 4,027<br />
4.3 %<br />
170 / 3,965<br />
93.1 %<br />
3,742 / 4,020<br />
Escherichia coli<br />
stra<strong>in</strong> K-12, substra<strong>in</strong> MG1655<br />
4,150 prote<strong>in</strong>s, 3,912 families<br />
6.4 %<br />
242 / 3,797<br />
Escherichia coli<br />
stra<strong>in</strong> K-12, substra<strong>in</strong> W3110<br />
4,226 prote<strong>in</strong>s, 3,965 families<br />
<strong>Comparative</strong> Genomics<br />
Escherichia coli<br />
stra<strong>in</strong> K-12, substra<strong>in</strong> DH10B<br />
4,126 prote<strong>in</strong>s, 3,797 families<br />
Figure 2.9: Construction of the BLASTmatrix diagram. Proteome similarity between three E.<br />
coli genomes. Lower part of the diagram corresponds to <strong>in</strong>tra-proteome similarity.<br />
lari<br />
jejuni<br />
concisus<br />
curvus<br />
fetus<br />
hom<strong>in</strong>is<br />
2.3 %<br />
34 / 1,494<br />
57.2 %<br />
1,123 / 1,965<br />
Campylobacter fetus<br />
subsp. fetus 82-40<br />
1,719 prote<strong>in</strong>s, 1,665 families<br />
Campylobacter hom<strong>in</strong>is<br />
ATCC BAA-381<br />
1,687 prote<strong>in</strong>s, 1,623 families<br />
Campylobacter jejuni<br />
RM1221<br />
1,838 prote<strong>in</strong>s, 1,780 families<br />
Campylobacter jejuni<br />
subsp. doylei 269.97<br />
1,731 prote<strong>in</strong>s, 1,650 families<br />
Campylobacter jejuni<br />
subsp. jejuni 81-176<br />
1,758 prote<strong>in</strong>s, 1,702 families<br />
Campylobacter jejuni<br />
subsp. jejuni 81116<br />
1,626 prote<strong>in</strong>s, 1,585 families<br />
Campylobacter jejuni<br />
subsp. jejuni NCTC 11168<br />
1,624 prote<strong>in</strong>s, 1,581 families<br />
Campylobacter lari<br />
RM2100<br />
1,546 prote<strong>in</strong>s, 1,494 families<br />
56.7 %<br />
1,123 / 1,979<br />
1.7 %<br />
27 / 1,581<br />
55.2 %<br />
1,145 / 2,073<br />
84.7 %<br />
1,448 / 1,709<br />
Campylobacter concisus<br />
13826<br />
2,080 prote<strong>in</strong>s, 1,972 families<br />
Campylobacter curvus<br />
525.92<br />
1,931 prote<strong>in</strong>s, 1,885 families<br />
49.4 %<br />
1,062 / 2,150<br />
83.5 %<br />
1,481 / 1,773<br />
1.5 %<br />
24 / 1,585<br />
53.0 %<br />
1,143 / 2,158<br />
67.3 %<br />
1,316 / 1,955<br />
82.9 %<br />
1,474 / 1,778<br />
22.8 %<br />
596 / 2,619<br />
76.9 %<br />
1,466 / 1,906<br />
64.4 %<br />
1,289 / 2,003<br />
2.3 %<br />
39 / 1,702<br />
30.0 %<br />
742 / 2,476<br />
22.9 %<br />
614 / 2,676<br />
74.6 %<br />
1,441 / 1,931<br />
62.2 %<br />
1,304 / 2,096<br />
24.7 %<br />
682 / 2,756<br />
30.6 %<br />
774 / 2,526<br />
23.1 %<br />
617 / 2,675<br />
71.4 %<br />
1,451 / 2,032<br />
4.0 %<br />
66 / 1,650<br />
24.5 %<br />
704 / 2,875<br />
24.8 %<br />
698 / 2,820<br />
30.3 %<br />
770 / 2,538<br />
22.5 %<br />
628 / 2,795<br />
63.5 %<br />
1,345 / 2,118<br />
Campylobacter lari<br />
RM2100<br />
24.4 %<br />
718 / 2,948<br />
25.1 %<br />
706 / 2,816<br />
28.7 %<br />
767 / 2,669<br />
21.2 %<br />
595 / 2,802<br />
2.3 %<br />
41 / 1,780<br />
1,546 prote<strong>in</strong>s, 1,494 families<br />
Campylobacter jejuni<br />
subsp. jejuni NCTC 11168<br />
1,624 prote<strong>in</strong>s, 1,581 families<br />
24.3 %<br />
717 / 2,950<br />
23.7 %<br />
699 / 2,950<br />
27.5 %<br />
736 / 2,676<br />
21.4 %<br />
618 / 2,886<br />
Campylobacter jejuni<br />
subsp. jejuni 81116<br />
1,626 prote<strong>in</strong>s, 1,585 families<br />
23.6 %<br />
723 / 3,070<br />
22.5 %<br />
668 / 2,964<br />
27.9 %<br />
767 / 2,750<br />
2.0 %<br />
33 / 1,623<br />
22.7 %<br />
698 / 3,076<br />
23.0 %<br />
698 / 3,036<br />
30.4 %<br />
782 / 2,576<br />
22.5 %<br />
713 / 3,175<br />
26.1 %<br />
741 / 2,838<br />
1.5 %<br />
25 / 1,665<br />
lari<br />
Campylobacter jejuni<br />
subsp. jejuni 81-176<br />
1,758 prote<strong>in</strong>s, 1,702 families<br />
Campylobacter jejuni<br />
subsp. doylei 269.97<br />
1,731 prote<strong>in</strong>s, 1,650 families<br />
Campylobacter jejuni<br />
RM1221<br />
25.8 %<br />
765 / 2,961<br />
34.7 %<br />
929 / 2,678<br />
1,838 prote<strong>in</strong>s, 1,780 families<br />
32.4 %<br />
916 / 2,828<br />
1.8 %<br />
34 / 1,885<br />
50.3 %<br />
1,317 / 2,616<br />
jejuni<br />
Campylobacter hom<strong>in</strong>is<br />
ATCC BAA-381<br />
1,687 prote<strong>in</strong>s, 1,623 families<br />
Campylobacter fetus<br />
subsp. fetus 82-40<br />
1,719 prote<strong>in</strong>s, 1,665 families<br />
Campylobacter curvus<br />
525.92<br />
3.5 %<br />
69 / 1,972<br />
1.5 %<br />
Homology between proteomes<br />
1,931 prote<strong>in</strong>s, 1,885 families<br />
Campylobacter concisus<br />
13826<br />
2,080 prote<strong>in</strong>s, 1,972 families<br />
hom<strong>in</strong>is<br />
fetus<br />
curvus<br />
concisus<br />
Homology with<strong>in</strong> proteomes<br />
Figure 2.10: Proteome similarity between ten Campylobacter species. Color encod<strong>in</strong>g corresponds<br />
to percentage of shared prote<strong>in</strong> families.<br />
21.2 %<br />
84.7 %<br />
4.0 %<br />
17
Genome Comparisons<br />
A.salmonicida LFI1238<br />
V.species Ex25<br />
V.campbellii AND4<br />
V.harveyi BAA1116<br />
V.shilonii AK1<br />
P.profundum SS9<br />
27.2 %<br />
1,946 / 7,165<br />
27.1 %<br />
31.2 %<br />
1,964 / 7,245 2,143 / 6,862<br />
27.5 %<br />
31.1 %<br />
32.5 %<br />
1,971 / 7,179 2,163 / 6,948 2,385 / 7,336<br />
26.3 %<br />
31.5 %<br />
32.6 %<br />
35.8 %<br />
1,893 / 7,208 2,169 / 6,884 2,405 / 7,380 2,018 / 5,637<br />
28.0 %<br />
30.4 %<br />
33.1 %<br />
35.9 %<br />
38.7 %<br />
1,962 / 7,016 2,098 / 6,893 2,415 / 7,299 2,049 / 5,713 2,143 / 5,536<br />
28.7 %<br />
32.3 %<br />
31.7 %<br />
36.4 %<br />
38.3 %<br />
32.1 %<br />
1,944 / 6,766 2,164 / 6,706 2,323 / 7,337 2,055 / 5,647 2,156 / 5,631 1,846 / 5,747<br />
28.2 %<br />
33.0 %<br />
33.6 %<br />
34.7 %<br />
38.8 %<br />
32.1 %<br />
34.0 %<br />
1,960 / 6,957 2,137 / 6,467 2,410 / 7,181 1,968 / 5,677 2,162 / 5,566 1,873 / 5,828 1,963 / 5,771<br />
27.6 %<br />
32.4 %<br />
34.3 %<br />
37.3 %<br />
37.9 %<br />
32.5 %<br />
33.7 %<br />
35.0 %<br />
1,965 / 7,122 2,155 / 6,649 2,377 / 6,932 2,045 / 5,477 2,110 / 5,560 1,873 / 5,769 1,977 / 5,865 1,949 / 5,561<br />
27.7 %<br />
31.8 %<br />
33.8 %<br />
38.7 %<br />
40.3 %<br />
30.6 %<br />
34.2 %<br />
34.8 %<br />
40.3 %<br />
1,965 / 7,093 2,169 / 6,817 2,403 / 7,116 2,021 / 5,225 2,167 / 5,378 1,777 / 5,804 1,983 / 5,797 1,967 / 5,647 2,326 / 5,771<br />
27.8 %<br />
32.1 %<br />
33.3 %<br />
37.4 %<br />
41.6 %<br />
33.3 %<br />
32.5 %<br />
35.3 %<br />
39.8 %<br />
38.4 %<br />
1,967 / 7,064 2,173 / 6,778 2,418 / 7,252 2,032 / 5,428 2,140 / 5,139 1,863 / 5,593 1,896 / 5,827 1,972 / 5,581 2,339 / 5,873 2,291 / 5,971<br />
25.7 %<br />
32.2 %<br />
33.5 %<br />
36.7 %<br />
40.6 %<br />
34.4 %<br />
35.3 %<br />
33.6 %<br />
40.4 %<br />
38.0 %<br />
41.7 %<br />
1,850 / 7,198 2,173 / 6,752 2,420 / 7,225 2,048 / 5,585 2,159 / 5,323 1,846 / 5,360 1,981 / 5,619 1,884 / 5,612 2,345 / 5,808 2,307 / 6,067 2,552 / 6,116<br />
25.6 %<br />
30.3 %<br />
33.6 %<br />
37.0 %<br />
39.5 %<br />
33.4 %<br />
36.6 %<br />
36.3 %<br />
38.6 %<br />
38.5 %<br />
41.2 %<br />
44.3 %<br />
1,841 / 7,194 2,079 / 6,856 2,420 / 7,193 2,051 / 5,545 2,169 / 5,493 1,852 / 5,547 1,964 / 5,371 1,965 / 5,413 2,251 / 5,839 2,311 / 6,004 2,564 / 6,224 2,515 / 5,683<br />
28.1 %<br />
29.7 %<br />
31.0 %<br />
37.2 %<br />
39.7 %<br />
32.7 %<br />
35.5 %<br />
37.7 %<br />
41.7 %<br />
37.0 %<br />
41.9 %<br />
43.7 %<br />
42.2 %<br />
1,904 / 6,782 2,044 / 6,887 2,282 / 7,362 2,052 / 5,516 2,168 / 5,459 1,868 / 5,705 1,974 / 5,563 1,947 / 5,165 2,346 / 5,626 2,227 / 6,026 2,575 / 6,151 2,527 / 5,781 2,215 / 5,254<br />
26.9 %<br />
32.4 %<br />
30.8 %<br />
34.4 %<br />
40.0 %<br />
33.0 %<br />
34.6 %<br />
36.6 %<br />
42.9 %<br />
39.7 %<br />
40.0 %<br />
44.5 %<br />
41.6 %<br />
40.0 %<br />
1,851 / 6,869 2,098 / 6,481 2,270 / 7,379 1,944 / 5,645 2,171 / 5,428 1,872 / 5,667 1,982 / 5,732 1,961 / 5,354 2,314 / 5,388 2,312 / 5,825 2,473 / 6,185 2,539 / 5,707 2,225 / 5,354 2,421 / 6,055<br />
28.2 %<br />
31.2 %<br />
33.3 %<br />
34.8 %<br />
38.2 %<br />
33.2 %<br />
34.8 %<br />
35.7 %<br />
41.9 %<br />
40.6 %<br />
42.9 %<br />
42.8 %<br />
42.3 %<br />
39.6 %<br />
70.3 %<br />
1,949 / 6,915 2,045 / 6,565 2,327 / 6,984 1,952 / 5,606 2,104 / 5,504 1,872 / 5,641 1,984 / 5,694 1,969 / 5,522 2,334 / 5,571 2,270 / 5,592 2,564 / 5,977 2,449 / 5,718 2,236 / 5,283 2,438 / 6,154 2,933 / 4,174<br />
27.9 %<br />
32.6 %<br />
32.1 %<br />
38.1 %<br />
37.3 %<br />
30.2 %<br />
35.0 %<br />
35.9 %<br />
40.9 %<br />
39.9 %<br />
44.1 %<br />
45.9 %<br />
41.3 %<br />
40.0 %<br />
69.2 %<br />
73.6 %<br />
1,942 / 6,969 2,153 / 6,600 2,268 / 7,062 1,994 / 5,228 2,064 / 5,537 1,747 / 5,786 1,985 / 5,667 1,971 / 5,485 2,343 / 5,733 2,299 / 5,768 2,533 / 5,743 2,535 / 5,526 2,181 / 5,277 2,440 / 6,094 2,953 / 4,267 3,045 / 4,135<br />
27.9 %<br />
31.8 %<br />
34.2 %<br />
36.4 %<br />
41.6 %<br />
30.0 %<br />
31.9 %<br />
36.1 %<br />
41.2 %<br />
38.9 %<br />
43.3 %<br />
47.1 %<br />
43.8 %<br />
38.4 %<br />
69.7 %<br />
74.9 %<br />
71.6 %<br />
1,941 / 6,954 2,123 / 6,682 2,394 / 7,002 1,935 / 5,317 2,134 / 5,135 1,736 / 5,791 1,857 / 5,817 1,971 / 5,458 2,346 / 5,697 2,309 / 5,932 2,559 / 5,916 2,503 / 5,310 2,234 / 5,101 2,348 / 6,120 2,944 / 4,221 3,101 / 4,142 3,010 / 4,205<br />
27.9 %<br />
32.0 %<br />
33.4 %<br />
37.7 %<br />
39.1 %<br />
33.6 %<br />
32.1 %<br />
32.8 %<br />
41.4 %<br />
39.3 %<br />
42.3 %<br />
46.4 %<br />
45.9 %<br />
41.4 %<br />
66.3 %<br />
75.5 %<br />
72.6 %<br />
75.9 %<br />
1,909 / 6,851 2,130 / 6,656 2,359 / 7,060 2,026 / 5,367 2,048 / 5,244 1,805 / 5,377 1,861 / 5,795 1,843 / 5,611 2,346 / 5,670 2,314 / 5,892 2,572 / 6,075 2,534 / 5,464 2,223 / 4,842 2,445 / 5,905 2,833 / 4,271 3,089 / 4,092 3,068 / 4,226 3,094 / 4,077<br />
29.6 %<br />
32.0 %<br />
33.4 %<br />
37.3 %<br />
40.4 %<br />
31.9 %<br />
35.6 %<br />
33.1 %<br />
38.0 %<br />
39.4 %<br />
42.7 %<br />
45.2 %<br />
44.3 %<br />
42.4 %<br />
73.2 %<br />
69.8 %<br />
73.5 %<br />
77.2 %<br />
68.7 %<br />
2,295 / 7,753 2,097 / 6,549 2,375 / 7,115 2,022 / 5,418 2,139 / 5,293 1,743 / 5,469 1,922 / 5,398 1,848 / 5,585 2,213 / 5,823 2,314 / 5,868 2,578 / 6,032 2,546 / 5,633 2,232 / 5,038 2,408 / 5,683 2,952 / 4,034 2,942 / 4,217 3,065 / 4,172 3,155 / 4,088 2,874 / 4,181<br />
27.9 %<br />
35.2 %<br />
33.0 %<br />
37.3 %<br />
39.4 %<br />
33.5 %<br />
34.2 %<br />
36.7 %<br />
38.0 %<br />
36.9 %<br />
42.9 %<br />
45.5 %<br />
43.0 %<br />
41.8 %<br />
73.5 %<br />
76.0 %<br />
68.5 %<br />
78.0 %<br />
67.2 %<br />
70.4 %<br />
1,972 / 7,061 2,581 / 7,333 2,325 / 7,056 2,019 / 5,407 2,118 / 5,370 1,845 / 5,501 1,872 / 5,473 1,906 / 5,192 2,209 / 5,811 2,208 / 5,989 2,579 / 6,005 2,548 / 5,599 2,240 / 5,212 2,434 / 5,818 2,863 / 3,897 3,059 / 4,025 2,914 / 4,256 3,149 / 4,038 2,880 / 4,288 2,922 / 4,153<br />
29.4 %<br />
34.3 %<br />
46.4 %<br />
37.8 %<br />
40.3 %<br />
32.9 %<br />
35.7 %<br />
34.9 %<br />
41.8 %<br />
36.4 %<br />
39.4 %<br />
45.8 %<br />
43.4 %<br />
40.8 %<br />
76.4 %<br />
75.2 %<br />
74.1 %<br />
71.5 %<br />
69.7 %<br />
70.3 %<br />
64.7 %<br />
2,212 / 7,534 2,276 / 6,634 3,371 / 7,266 2,001 / 5,288 2,145 / 5,320 1,824 / 5,545 1,970 / 5,513 1,843 / 5,282 2,264 / 5,418 2,186 / 6,003 2,432 / 6,172 2,552 / 5,568 2,242 / 5,171 2,445 / 5,993 2,970 / 3,887 2,954 / 3,928 3,024 / 4,083 2,986 / 4,175 2,916 / 4,183 2,965 / 4,217 2,888 / 4,463<br />
27.8 %<br />
34.4 %<br />
34.9 %<br />
47.0 %<br />
39.8 %<br />
33.0 %<br />
34.9 %<br />
36.8 %<br />
39.9 %<br />
39.9 %<br />
39.1 %<br />
42.2 %<br />
43.6 %<br />
41.1 %<br />
73.1 %<br />
80.4 %<br />
73.0 %<br />
79.5 %<br />
69.0 %<br />
72.2 %<br />
64.9 %<br />
76.9 %<br />
2,222 / 7,979 2,472 / 7,184 2,496 / 7,160 2,741 / 5,827 2,086 / 5,245 1,831 / 5,549 1,952 / 5,586 1,951 / 5,307 2,202 / 5,514 2,238 / 5,609 2,413 / 6,176 2,409 / 5,711 2,244 / 5,143 2,450 / 5,957 2,977 / 4,072 3,080 / 3,831 2,908 / 3,986 3,125 / 3,932 2,860 / 4,145 2,986 / 4,136 2,940 / 4,533 3,165 / 4,117<br />
28.1 %<br />
33.0 %<br />
38.7 %<br />
37.8 %<br />
64.9 %<br />
33.1 %<br />
35.2 %<br />
36.1 %<br />
42.0 %<br />
38.0 %<br />
43.0 %<br />
41.4 %<br />
41.1 %<br />
41.3 %<br />
73.4 %<br />
77.3 %<br />
78.5 %<br />
77.9 %<br />
71.8 %<br />
68.5 %<br />
67.6 %<br />
76.7 %<br />
83.4 %<br />
2,155 / 7,667 2,516 / 7,615 2,880 / 7,439 2,081 / 5,503 3,384 / 5,214 1,804 / 5,448 1,954 / 5,558 1,940 / 5,373 2,320 / 5,530 2,171 / 5,707 2,483 / 5,781 2,372 / 5,735 2,153 / 5,242 2,449 / 5,936 2,971 / 4,050 3,098 / 4,009 3,061 / 3,901 3,002 / 3,856 2,896 / 4,036 2,869 / 4,191 2,983 / 4,413 3,195 / 4,167 3,315 / 3,973<br />
29.5 %<br />
36.5 %<br />
37.0 %<br />
39.9 %<br />
45.0 %<br />
31.9 %<br />
35.3 %<br />
36.2 %<br />
41.1 %<br />
40.1 %<br />
40.1 %<br />
46.3 %<br />
41.2 %<br />
37.9 %<br />
73.8 %<br />
77.1 %<br />
75.8 %<br />
83.0 %<br />
71.5 %<br />
73.7 %<br />
65.1 %<br />
81.6 %<br />
81.3 %<br />
82.4 %<br />
2,198 / 7,456 2,593 / 7,105 2,900 / 7,832 2,372 / 5,942 2,357 / 5,232 2,074 / 6,494 1,926 / 5,455 1,940 / 5,352 2,303 / 5,603 2,293 / 5,719 2,373 / 5,919 2,464 / 5,326 2,152 / 5,228 2,313 / 6,099 2,975 / 4,030 3,088 / 4,007 3,073 / 4,056 3,135 / 3,777 2,801 / 3,915 2,947 / 4,001 2,880 / 4,423 3,264 / 4,000 3,320 / 4,085 3,302 / 4,009<br />
30.3 %<br />
36.7 %<br />
34.6 %<br />
37.5 %<br />
46.1 %<br />
32.3 %<br />
35.5 %<br />
36.3 %<br />
41.6 %<br />
39.2 %<br />
43.5 %<br />
43.5 %<br />
46.0 %<br />
38.2 %<br />
65.6 %<br />
78.0 %<br />
75.1 %<br />
79.4 %<br />
72.2 %<br />
81.0 %<br />
67.3 %<br />
77.5 %<br />
81.9 %<br />
80.8 %<br />
83.2 %<br />
2,110 / 6,968 2,562 / 6,982 2,682 / 7,762 2,396 / 6,387 2,626 / 5,697 1,842 / 5,705 2,270 / 6,400 1,906 / 5,250 2,314 / 5,569 2,272 / 5,796 2,550 / 5,859 2,367 / 5,437 2,220 / 4,821 2,320 / 6,080 2,791 / 4,256 3,097 / 3,971 3,061 / 4,077 3,144 / 3,961 2,861 / 3,960 2,989 / 3,688 2,909 / 4,320 3,153 / 4,066 3,311 / 4,041 3,319 / 4,106 3,325 / 3,995<br />
29.7 %<br />
30.4 %<br />
36.7 %<br />
36.9 %<br />
43.2 %<br />
32.6 %<br />
34.5 %<br />
35.9 %<br />
41.5 %<br />
39.8 %<br />
42.2 %<br />
45.9 %<br />
42.7 %<br />
42.3 %<br />
65.2 %<br />
71.3 %<br />
76.3 %<br />
79.3 %<br />
69.0 %<br />
74.9 %<br />
67.8 %<br />
78.4 %<br />
76.3 %<br />
81.6 %<br />
80.7 %<br />
85.8 %<br />
2,127 / 7,169 2,085 / 6,866 2,759 / 7,516 2,259 / 6,124 2,655 / 6,143 2,040 / 6,250 1,965 / 5,696 2,233 / 6,219 2,272 / 5,479 2,292 / 5,756 2,506 / 5,941 2,501 / 5,451 2,113 / 4,953 2,399 / 5,675 2,768 / 4,246 2,953 / 4,142 3,076 / 4,029 3,138 / 3,958 2,868 / 4,158 2,944 / 3,932 2,836 / 4,184 3,157 / 4,029 3,142 / 4,120 3,311 / 4,057 3,321 / 4,117 3,291 / 3,837<br />
28.3 %<br />
29.4 %<br />
29.6 %<br />
38.6 %<br />
40.2 %<br />
30.5 %<br />
35.3 %<br />
36.2 %<br />
43.9 %<br />
39.2 %<br />
42.9 %<br />
46.1 %<br />
44.3 %<br />
39.7 %<br />
71.6 %<br />
70.2 %<br />
69.2 %<br />
80.3 %<br />
69.1 %<br />
73.3 %<br />
68.1 %<br />
74.3 %<br />
83.7 %<br />
75.3 %<br />
81.4 %<br />
82.5 %<br />
79.6 %<br />
1,980 / 6,989 2,083 / 7,082 2,214 / 7,478 2,289 / 5,931 2,413 / 5,999 2,050 / 6,715 2,191 / 6,211 1,976 / 5,464 2,762 / 6,293 2,230 / 5,684 2,536 / 5,906 2,513 / 5,455 2,213 / 5,001 2,303 / 5,796 2,802 / 3,915 2,930 / 4,172 2,925 / 4,226 3,147 / 3,918 2,864 / 4,145 2,983 / 4,071 2,876 / 4,226 2,987 / 4,018 3,275 / 3,915 3,136 / 4,162 3,309 / 4,067 3,278 / 3,971 3,139 / 3,944<br />
28.0 %<br />
26.7 %<br />
29.3 %<br />
33.6 %<br />
42.3 %<br />
33.1 %<br />
33.1 %<br />
36.3 %<br />
45.4 %<br />
41.4 %<br />
42.3 %<br />
45.7 %<br />
43.5 %<br />
42.9 %<br />
68.6 %<br />
77.2 %<br />
64.3 %<br />
73.1 %<br />
70.0 %<br />
73.6 %<br />
66.4 %<br />
78.6 %<br />
86.6 %<br />
82.6 %<br />
76.0 %<br />
83.4 %<br />
78.1 %<br />
92.9 %<br />
2,022 / 7,222 1,916 / 7,168 2,244 / 7,665 1,915 / 5,695 2,451 / 5,795 2,074 / 6,269 2,209 / 6,672 2,179 / 6,005 2,507 / 5,523 2,698 / 6,523 2,475 / 5,845 2,506 / 5,480 2,200 / 5,058 2,463 / 5,745 2,743 / 4,001 2,983 / 3,866 2,805 / 4,365 3,000 / 4,103 2,873 / 4,102 2,979 / 4,045 2,917 / 4,393 3,113 / 3,962 3,253 / 3,757 3,267 / 3,954 3,147 / 4,141 3,267 / 3,919 3,147 / 4,032 3,489 / 3,754<br />
25.5 %<br />
34.5 %<br />
28.3 %<br />
32.5 %<br />
34.5 %<br />
34.9 %<br />
35.7 %<br />
34.2 %<br />
43.7 %<br />
43.7 %<br />
46.4 %<br />
45.1 %<br />
44.9 %<br />
40.8 %<br />
77.1 %<br />
71.8 %<br />
71.6 %<br />
69.5 %<br />
68.3 %<br />
74.3 %<br />
66.3 %<br />
75.5 %<br />
91.2 %<br />
85.6 %<br />
82.9 %<br />
79.4 %<br />
80.2 %<br />
89.7 %<br />
77.1 %<br />
1,872 / 7,339 2,335 / 6,762 2,095 / 7,406 1,919 / 5,903 1,963 / 5,692 2,114 / 6,065 2,219 / 6,213 2,205 / 6,448 2,670 / 6,112 2,492 / 5,705 3,042 / 6,550 2,444 / 5,415 2,242 / 4,998 2,400 / 5,876 2,975 / 3,861 2,855 / 3,974 2,868 / 4,006 2,908 / 4,185 2,820 / 4,126 2,982 / 4,014 2,908 / 4,386 3,125 / 4,141 3,355 / 3,679 3,244 / 3,790 3,277 / 3,954 3,143 / 3,956 3,169 / 3,953 3,485 / 3,884 3,186 / 4,134<br />
26.1 %<br />
30.9 %<br />
43.4 %<br />
30.3 %<br />
33.9 %<br />
55.5 %<br />
38.1 %<br />
36.6 %<br />
40.8 %<br />
41.9 %<br />
43.2 %<br />
48.9 %<br />
43.5 %<br />
42.4 %<br />
73.0 %<br />
82.5 %<br />
67.9 %<br />
76.7 %<br />
68.0 %<br />
68.5 %<br />
67.0 %<br />
74.6 %<br />
91.7 %<br />
90.1 %<br />
83.2 %<br />
87.0 %<br />
75.1 %<br />
81.1 %<br />
74.9 %<br />
80.4 %<br />
2,254 / 8,624 2,144 / 6,948 2,981 / 6,875 1,795 / 5,923 1,991 / 5,874 2,683 / 4,838 2,277 / 5,979 2,201 / 6,016 2,680 / 6,565 2,637 / 6,301 2,597 / 6,013 2,994 / 6,128 2,155 / 4,958 2,451 / 5,781 2,911 / 3,989 3,117 / 3,780 2,780 / 4,092 2,961 / 3,861 2,806 / 4,126 2,844 / 4,150 2,915 / 4,348 3,103 / 4,160 3,455 / 3,766 3,346 / 3,715 3,208 / 3,855 3,272 / 3,762 3,024 / 4,028 3,280 / 4,046 3,187 / 4,253 3,303 / 4,109<br />
25.9 %<br />
30.1 %<br />
45.0 %<br />
46.2 %<br />
30.5 %<br />
52.4 %<br />
75.0 %<br />
38.7 %<br />
72.3 %<br />
39.7 %<br />
67.5 %<br />
47.2 %<br />
43.5 %<br />
40.9 %<br />
74.7 %<br />
78.0 %<br />
78.6 %<br />
71.8 %<br />
73.1 %<br />
70.6 %<br />
64.7 %<br />
75.4 %<br />
96.0 %<br />
90.4 %<br />
91.4 %<br />
83.0 %<br />
80.7 %<br />
77.3 %<br />
80.2 %<br />
88.8 %<br />
88.1 %<br />
2,170 / 8,370 2,581 / 8,574 3,018 / 6,702 2,452 / 5,307 1,813 / 5,939 2,666 / 5,085 3,261 / 4,346 2,246 / 5,808 3,688 / 5,101 2,672 / 6,728 3,741 / 5,540 2,608 / 5,524 2,547 / 5,858 2,360 / 5,769 2,922 / 3,914 3,045 / 3,906 3,059 / 3,894 2,849 / 3,968 2,818 / 3,854 2,886 / 4,087 2,847 / 4,403 3,111 / 4,124 3,531 / 3,678 3,439 / 3,805 3,373 / 3,689 3,126 / 3,768 3,108 / 3,853 3,164 / 4,093 3,271 / 4,079 3,489 / 3,927 3,495 / 3,966<br />
5.0 %<br />
243 / 4,897<br />
3.9 %<br />
200 / 5,078<br />
3.9 %<br />
201 / 5,117<br />
V.parahaemolyticus 2210633<br />
V.parahaemolyticus 16<br />
V.vulnificus CMCP6<br />
V.vulnificus YJ016<br />
V.species MED222<br />
V.splendidus LGP32<br />
V.fischeri ES114<br />
V.fischeri MJ11<br />
2.3 %<br />
88 / 3,822<br />
2.7 %<br />
103 / 3,886<br />
3.3 %<br />
111 / 3,378<br />
2.9 %<br />
112 / 3,894<br />
2.6 %<br />
96 / 3,691<br />
2.8 %<br />
118 / 4,277<br />
2.3 %<br />
103 / 4,463<br />
3.1 %<br />
150 / 4,773<br />
V.cholerae MO10<br />
V.cholerae BX330286<br />
V.cholerae RC9<br />
V.cholerae MJ1236<br />
V.cholerae B33VCE<br />
V.cholerae 2740-80<br />
V.cholerae AM-19226<br />
V.cholerae MZO-2<br />
V.cholerae 12129<br />
V.cholerae TM11079-80<br />
V.cholerae TMA21<br />
V.cholerae VL426<br />
V.cholerae 1587<br />
2.8 %<br />
121 / 4,337<br />
2.1 %<br />
79 / 3,683<br />
V.cholerae N16961<br />
V.cholerae 0395 TEDA<br />
V.cholerae 0395 TIGR<br />
V.cholerae V52<br />
V.cholerae M66-2<br />
3.2 %<br />
147 / 4,662<br />
1.9 %<br />
62 / 3,316<br />
2.9 %<br />
99 / 3,427<br />
P.profundum SS9<br />
2.4 %<br />
83 / 3,442<br />
V.shilonii AK1<br />
V.harveyi BAA1116<br />
2.1 %<br />
72 / 3,454<br />
V.campbellii AND4<br />
V.species Ex25<br />
2.2 %<br />
73 / 3,311<br />
A.salmonicida LFI1238<br />
V.fischeri MJ11<br />
2.5 %<br />
84 / 3,305<br />
V.fischeri ES114<br />
V.splendidus LGP32<br />
2.8 %<br />
99 / 3,586<br />
V.species MED222<br />
V.vulnificus YJ016<br />
3.5 %<br />
125 / 3,567<br />
V.vulnificus CMCP6<br />
V.parahaemolyticus 16<br />
2.6 %<br />
92 / 3,593<br />
V.parahaemolyticus 2210633<br />
V.cholerae VL426<br />
3.0 %<br />
109 / 3,575<br />
V.cholerae TMA21<br />
V.cholerae TM11079-80<br />
2.8 %<br />
102 / 3,619<br />
V.cholerae 12129<br />
V.cholerae MZO-2<br />
2.9 %<br />
100 / 3,429<br />
V.cholerae AM-19226<br />
V.cholerae 1587<br />
1.8 %<br />
59 / 3,353<br />
30.0 %<br />
V.cholerae 2740-80<br />
V.cholerae B33VCE<br />
2.8 %<br />
99 / 3,560<br />
0.0 %<br />
Homology between proteomes<br />
Homology with<strong>in</strong> proteomes<br />
V.cholerae MJ1236<br />
V.cholerae RC9<br />
3.3 %<br />
120 / 3,599<br />
V.cholerae BX330286<br />
V.cholerae MO10<br />
4.3 %<br />
157 / 3,665<br />
V.cholerae M66-2<br />
V.cholerae V52<br />
4.2 %<br />
155 / 3,729<br />
V.cholerae 0395 TIGR<br />
V.cholerae 0395 TEDA<br />
3.0 %<br />
110 / 3,665<br />
90.0 %<br />
6.0 %<br />
V.cholerae N16961<br />
Figure 2.11: Proteome comparison of 32 Vibrionaceae genomes. Environmental V. cholerae stra<strong>in</strong>s<br />
lack<strong>in</strong>g the cholera enterotox<strong>in</strong> genes are highlighted <strong>in</strong> bright green, whilst pathogenic V. cholerae<br />
stra<strong>in</strong>s genomes are shown <strong>in</strong> dark green.<br />
Large similarities between environmental <strong>and</strong> pathogenic V. cholerae<br />
The BLAST matrix shown <strong>in</strong> figure 2.11 <strong>in</strong>cludes environmental <strong>and</strong> pathogenetic stra<strong>in</strong>s<br />
of V. cholerae. The figures shows that with<strong>in</strong> <strong>and</strong> between these two groups the V. cholerae<br />
stra<strong>in</strong>s share a large number of genes.<br />
Intra- vs. <strong>in</strong>ter-proteome similarity<br />
The lower row of the diagram shows the special case of organism A versus itself. This<br />
shows the <strong>in</strong>tra-proteome similarity. If not dealt with separately, this part would appear<br />
as 100% similar s<strong>in</strong>ce the proteome is BLASTed aga<strong>in</strong>st itself. However, all self-match<strong>in</strong>g<br />
prote<strong>in</strong>s are excluded, leav<strong>in</strong>g this part to reflect the paraloges of the organism. Also, this<br />
part has a separate color encod<strong>in</strong>g (red) whereas the <strong>in</strong>tra-protome comparison is coded<br />
green (see figure 2.10).<br />
2.3.7 BLASTatlas - visualiz<strong>in</strong>g while-genome homology<br />
The BLASTmatrix tool described earlier condenses the similarity between two proteomes<br />
<strong>in</strong>to a s<strong>in</strong>gle number. This simplification allows for an all-aga<strong>in</strong>st-all comparison, but lacks<br />
detailed <strong>in</strong>formation on the conserved genes <strong>and</strong> where these are located. The BLASTatlas<br />
method overcomes these issues by compar<strong>in</strong>g the proteomes to a s<strong>in</strong>gle reference chromosome.<br />
When a s<strong>in</strong>gle representative chromosome has been selected, all ORF’s or prote<strong>in</strong>s<br />
of that reference is BLASTed aga<strong>in</strong>st each of the proteome to be <strong>in</strong>cluded <strong>in</strong> the comparison.<br />
The most optimal alignment of each proteome, disregard<strong>in</strong>g the significance, is<br />
mapped back to the reference genome. A numerical value of zero is mapped at mismatches<br />
or gaps, 0.5 at conservative mismatches, <strong>and</strong> one is mapped to matches. This method has<br />
proved powerful because it answers several questions <strong>in</strong> one diagram: Which reference<br />
prote<strong>in</strong>s are found <strong>in</strong> which query genomes? How well are they conserved? And is there<br />
18
<strong>Comparative</strong> Genomics<br />
<br />
<br />
<br />
Figure 2.12: Mapp<strong>in</strong>g of pairwise alignment to a reference genome. Mismatches, conservative<br />
mismatches <strong>and</strong> perfect matches contrubute to the overall map 0.0, 0.5, <strong>and</strong> 1.0, respectively. Gaps<br />
with<strong>in</strong> the reference prote<strong>in</strong>, correspond<strong>in</strong>g to miss<strong>in</strong>g features of the reference prote<strong>in</strong>, cannot be<br />
mapped <strong>and</strong> are hence excluded.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Figure 2.13: Inclusion of multiple organisms us<strong>in</strong>g the BLASTatlas method. Each track correspond<br />
to a pairwise comparison aga<strong>in</strong>st the reference chromosome.<br />
any correlation between the conservation of neighbor<strong>in</strong>g genes such as with<strong>in</strong> larger genomic<br />
isl<strong>and</strong>s. Figure 2.12 depicts the remapp<strong>in</strong>g of a prote<strong>in</strong>-prote<strong>in</strong> alignment back to<br />
the reference genome.<br />
The result of the mapp<strong>in</strong>g step is a list of same length as the reference genome. BLASTmatrix<br />
then uses the GeneWiz software (Pedersen et al., 2000) to visualize this numerical<br />
data. Genewiz applies a smooth<strong>in</strong>g <strong>and</strong> each b<strong>in</strong> is then encoded <strong>in</strong>to a color representation<br />
either fixed or dynamic, given as n st<strong>and</strong>ard deviations around the average. Each<br />
genome <strong>in</strong>cluded <strong>in</strong> the comparison is plotted as <strong>in</strong>dividual tracks. The tool is offered<br />
as a Web Service (see chapter 4) A general client script can be obta<strong>in</strong>ed from the onl<strong>in</strong>e<br />
documentation at http://www.cbs.dtu.dk/ws/BLASTatlas. The client script produces<br />
as PostScript plot as output. In the next sections examples are provided demonstrat<strong>in</strong>g<br />
the flexibility of the tool.<br />
Gene loss <strong>in</strong> Burkholderia species<br />
A comparative study aimed at mapp<strong>in</strong>g pathogenic isl<strong>and</strong>s or gene losses among different<br />
bacterial genomes can benefit from the graphical representation provided by the BLAS-<br />
Tatlas method. The genus of Burkholderia covers a number of important animal <strong>and</strong><br />
human pathogens known to cause melioidosis (B. pseudomallei) <strong>and</strong> pulmonary <strong>in</strong>fection<br />
<strong>in</strong> CF patients (B. cepacia), whereas B. thail<strong>and</strong>ensis, which is closely related to B. pseudomallei,<br />
rarely gives rise to diseases <strong>in</strong> humans (Brett et al., 1998; Smith et al., 1997). All<br />
publicly available <strong>and</strong> fully sequenced Burkholderia genomes are compared to chromosome<br />
I <strong>and</strong> II of B. pseudomallei 1710b. The code list<strong>in</strong>g below describes how the comparison<br />
was made <strong>and</strong> it demonstrates the flexibility of the tool as it allows for easy automation<br />
19
Genome Comparisons<br />
by read<strong>in</strong>g simple configurations files - <strong>in</strong> this case generated by a MySQL query. The<br />
output configuration file is listed <strong>in</strong> appendix D.3.<br />
1 # let mysql construct the blast configuration file<br />
2 mysql --raw -B -N -e ’ select concat (" legend :",replace (<br />
organism_name ," Burkholderia ","B."),"\ nprogram : blastp \ ncolor :",<br />
if( organism_name like "% pseudomal %"," 101010 _000009 ",if(<br />
organism_name like "% mallei %"," 101010 _000900 ",if( organism_name<br />
like "% cenocep %"," 101010 _080000 ",if( organism_name like "% ambi %"<br />
," 101010 _020002 ",if( organism_name like "% thail<strong>and</strong> %"," 101010<br />
_000900 "," 101010 _050505 "))))),"\ nrange :0.0 ,0.8\ nsource : files /",<br />
pid ,". fsa \n") from genomeatlas3_cur . genbank_complete_prj where<br />
organism_name like " burkhold %" <strong>and</strong> organism_name not like "<br />
%1710 b%" order by organism_name ;’ > blast . cfg<br />
3 # copy genbank files of chr I <strong>and</strong> II<br />
4 foreach acc ( CP000124 CP000125 )<br />
5 cp / home / databases / genomeatlasdb -3.0 _cur / data / $acc / $acc . gbk .<br />
6 saco_convert -I genbank -O annotation $acc . gbk > $acc . ann<br />
7 saco_extract -I genbank -O fasta -t $acc . gbk > $acc . prote<strong>in</strong>s . fsa<br />
8 saco_convert -I genbank -O fasta $acc . gbk > $acc . fsa<br />
9 end<br />
10<br />
11 # run the BLASTatlas client script on both chromosomes<br />
12 perl BLASTatlas -modus circle -ref CP000124 . fsa - prote<strong>in</strong>s CP000124<br />
. prote<strong>in</strong>s . fsa -ann CP000124 . ann - blastcfg blast . cfg -- dnap ="<br />
Percent AT ,GC Skew " -title "B. pseudomallei 1710b, chr I" ><br />
burkholderia_chrI .ps<br />
13 perl BLASTatlas -modus circle -ref CP000125 . fsa - prote<strong>in</strong>s CP000125<br />
. prote<strong>in</strong>s . fsa -ann CP000125 . ann - blastcfg blast . cfg -- dnap ="<br />
Percent AT ,GC Skew " -title "B. pseudomallei 1710b, chr II" ><br />
burkholderia_chrII .ps<br />
The plots of the two chromosomes are shown <strong>in</strong> figure 2.14. The other B. pseudomallei<br />
genomes are obvious as three dark blue tracks, represent<strong>in</strong>g high homology with<strong>in</strong> the<br />
species. Both species of B. thail<strong>and</strong>ensis <strong>and</strong> B. mallei display large chromosomal deletions<br />
when compared to B. pseudomallei. However the more scattered nature of the gene loss<br />
observed <strong>in</strong> B. thail<strong>and</strong>ensis suggests that B. mallei evolved from B. pseudomallei through<br />
the loss of larger regions (Ong et al., 2004). These deletions are evident from the atlases<br />
shown <strong>in</strong> figure 2.14. It is evident that a strong preference of deletions exist for chromosome<br />
II. Ong <strong>and</strong> co-workers report that deletions <strong>in</strong> chromosome II counts for 70% <strong>and</strong> 61%<br />
of the total gene loss <strong>in</strong> B. mallei <strong>and</strong> B. thail<strong>and</strong>ensis, respectively.<br />
The Alcanivorax phylome BLASTatlas<br />
Tracks on the BLASTatlas are not limitted to s<strong>in</strong>gle genomes or proteomes. Sequence files<br />
specified for a given tracks is converted <strong>in</strong>to a BLAST database <strong>and</strong> reference genome is<br />
searched aga<strong>in</strong>st each the databases of each track. However, a track may just as well be<br />
a collection of genomes, entire phyla or even SwissProt. In Paper III a ‘phylome’ atlas<br />
was constructed for the oil-degrad<strong>in</strong>g mar<strong>in</strong>e bacterium Alcanivorax borkumensis (Reva<br />
et al., 2008). Here, tracks were constructed collect<strong>in</strong>g all prote<strong>in</strong>s of all published bacterial<br />
genomes, all proteobacteria, all γ-, α-, β-, δ, <strong>and</strong> ɛ-proteobacteria (see figure 2.15). The<br />
phylome atlas reveals no or very few homologes <strong>in</strong> δ- <strong>and</strong> ɛ-proteobacteria, some homologes<br />
<strong>in</strong> α- <strong>and</strong> β-proteobacteria wheras the highest sequence homology was identified among<br />
γ-proteobacteria.<br />
20
3M<br />
2.5M<br />
3.5M<br />
2.5M<br />
2M<br />
0M<br />
2M<br />
0.5M<br />
B. pseudomallei 1710b, chr I<br />
4,126,292 bp<br />
3M<br />
0M<br />
1.5M<br />
0.5M<br />
B. pseudomallei 1710b, chr II<br />
3,181,762 bp<br />
1.5M<br />
1M<br />
1M<br />
<strong>Comparative</strong> Genomics<br />
B. ambifaria AMMD<br />
0.00 0.80<br />
B. ambifaria MC40-6<br />
0.00 0.80<br />
B. cenocepacia AU 1054<br />
0.00 0.80<br />
B. cenocepacia HI2424<br />
0.00 0.80<br />
B. cenocepacia J2315<br />
0.00 0.80<br />
B. cenocepacia MC0-3<br />
0.00 0.80<br />
B. glumae BGR1<br />
0.00 0.80<br />
B. mallei ATCC 23344<br />
0.00 0.80<br />
B. mallei NCTC 10229<br />
0.00 0.80<br />
B. mallei NCTC 10247<br />
0.00 0.80<br />
B. mallei SAVP1<br />
0.00 0.80<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
B. multivorans ATCC 17616 fix<br />
avg<br />
0.00 0.80<br />
Center for Biological Sequence Analysis<br />
http://www.cbs.dtu.dk/<br />
B. ambifaria AMMD<br />
0.00 0.80<br />
B. ambifaria MC40-6<br />
0.00 0.80<br />
B. cenocepacia AU 1054<br />
0.00 0.80<br />
B. cenocepacia HI2424<br />
0.00 0.80<br />
B. cenocepacia J2315<br />
0.00 0.80<br />
B. cenocepacia MC0-3<br />
0.00 0.80<br />
B. glumae BGR1<br />
0.00 0.80<br />
B. mallei ATCC 23344<br />
0.00 0.80<br />
B. mallei NCTC 10229<br />
0.00 0.80<br />
B. mallei NCTC 10247<br />
0.00 0.80<br />
B. mallei SAVP1<br />
0.00 0.80<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
B. multivorans ATCC 17616 fix<br />
avg<br />
0.00 0.80<br />
Center for Biological Sequence Analysis<br />
http://www.cbs.dtu.dk/<br />
B. multivorans ATCC 17616 fix<br />
avg<br />
0.00 0.80<br />
B. phymatum STM815<br />
0.00 0.80<br />
B. phytofirmans PsJN<br />
0.00 0.80<br />
B. pseudomallei 1106a<br />
0.00 0.80<br />
B. pseudomallei 668<br />
0.00 0.80<br />
B. pseudomallei K96243<br />
0.00 0.80<br />
B. sp. 383<br />
0.00 0.80<br />
B. thail<strong>and</strong>ensis E264<br />
0.00 0.80<br />
B. vietnamiensis G4<br />
0.00 0.80<br />
B. xenovorans LB400<br />
0.00 0.80<br />
W) Annotations:<br />
CDS +<br />
CDS -<br />
rRNA<br />
tRNA<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
B. multivorans ATCC 17616 fix<br />
avg<br />
0.00 0.80<br />
B. phymatum STM815<br />
0.00 0.80<br />
B. phytofirmans PsJN<br />
0.00 0.80<br />
B. pseudomallei 1106a<br />
0.00 0.80<br />
B. pseudomallei 668<br />
0.00 0.80<br />
B. pseudomallei K96243<br />
0.00 0.80<br />
B. sp. 383<br />
0.00 0.80<br />
B. thail<strong>and</strong>ensis E264<br />
0.00 0.80<br />
B. vietnamiensis G4<br />
0.00 0.80<br />
B. xenovorans LB400<br />
0.00 0.80<br />
W) Annotations:<br />
Figure 2.14: Comparison of B. pseudomallei 1710b chomosome I <strong>and</strong> II aga<strong>in</strong>st all public<br />
Burkholderia genomes.<br />
CDS +<br />
CDS -<br />
rRNA<br />
tRNA<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
fix<br />
avg<br />
Percent AT<br />
0.21 0.42<br />
GC Skew<br />
-0.09 0.09<br />
Percent AT<br />
0.21 0.42<br />
GC Skew<br />
-0.09 0.09<br />
21<br />
Resolution: 1273<br />
dev<br />
avg<br />
dev<br />
avg<br />
BLAST ATLAS<br />
Resolution: 1273<br />
dev<br />
avg<br />
dev<br />
avg<br />
BLAST ATLAS
Genome Comparisons<br />
Bacteria<br />
fix<br />
avg<br />
0.00 0.50<br />
Proteobacteria<br />
fix<br />
avg<br />
0.00 0.50<br />
gamma<br />
fix<br />
avg<br />
0.00 0.50<br />
Annotations:<br />
CDS +<br />
CDS -<br />
0M<br />
rRNA<br />
tRNA<br />
0.5M<br />
2.5M<br />
alpha<br />
fix<br />
avg<br />
A. borkumensis<br />
3,120,143 bp<br />
0.00 0.30<br />
beta<br />
1M<br />
2M<br />
fix<br />
avg<br />
0.00 0.30<br />
1.5M<br />
delta<br />
fix<br />
avg<br />
0.00 0.30<br />
epsilon<br />
fix<br />
avg<br />
0.00 0.30<br />
Percent AT<br />
dev<br />
avg<br />
0.40 0.51<br />
Resolution: 1249<br />
http://www.cbs.dtu.dk/<br />
Center for Biological Sequence Analysis<br />
Figure 2.15: A phylome atlas of Alcanivorax borkumensis, compar<strong>in</strong>g the proteome aga<strong>in</strong>st all γ-,<br />
α-, β-, δ, <strong>and</strong> ɛ-proteobacteria available at the time of publish<strong>in</strong>g.<br />
22<br />
Phylome ATLAS
Streptococcus<br />
Escherichia<br />
Bacillus<br />
Clostridium<br />
Burkholderia<br />
Mycobacterium<br />
C<strong>and</strong>idatus<br />
Staphylococcus<br />
Shewanella<br />
Mycoplasma<br />
Stra<strong>in</strong>s<br />
Species<br />
0 10 20 30 40 50<br />
<strong>Comparative</strong> Genomics<br />
Figure 2.16: Count of genomes <strong>and</strong> species divided by genera. Source: <strong>CBS</strong> Genome Atlas<br />
Database as of 2009-09-11.<br />
2.3.8 CorePlot - plott<strong>in</strong>g the core- <strong>and</strong> pan-genomes of species<br />
There are a number of bacterial genera for which numerous stra<strong>in</strong>s <strong>and</strong> species are fully<br />
sequenced. Streptococcus (43 stra<strong>in</strong>s), Escherichia (29 stra<strong>in</strong>s), <strong>and</strong> Bacillus (25 stra<strong>in</strong>s)<br />
are the most highly represented genomes among the Bacteria (Genome Atlas Database,<br />
2009-09-11). Figure 2.16 shows the genome <strong>and</strong> species counts of the 10 most sampled<br />
genera. The <strong>in</strong>creased depth by which bacterial genera are sequenced has previously been<br />
used to estimate the core- <strong>and</strong> pan-genome by fitt<strong>in</strong>g an exponential decay<strong>in</strong>g function.<br />
An often used approach is to perform either a limited or a full permutation of the genome<br />
order (Lefebure & Stanhope, 2007; Tettel<strong>in</strong> et al., 2005). This provides an error estimate<br />
for every step a genome is added An alternative method was developed dur<strong>in</strong>g the Ph.D.<br />
project, which derives the prote<strong>in</strong> families by group<strong>in</strong>g homologous prote<strong>in</strong>s, however us<strong>in</strong>g<br />
a fixed order of genomes. Homologs are generated by pairwise prote<strong>in</strong> BLAST between<br />
proteomes followed by a group<strong>in</strong>g of all significant alignments (50% alignment length <strong>and</strong><br />
50% conservation with<strong>in</strong> the alignment). The method can re-use cached BLAST reports<br />
from the BLASTmatrix method. The example below uses the same proteome files as<br />
was generated <strong>in</strong> the BLASTmatrix example (section 2.3.6 <strong>and</strong> appendix D.4) <strong>and</strong> it<br />
demonstrates how a MySQL query can be used as configuration for CorePlot program.<br />
1 mysql -N -B -e " select organism_name , concat (pid , ’. prote<strong>in</strong>s .fsa ’)<br />
from genomeatlas3_cur . genbank_complete_prj where organism_name<br />
like ’ campylobacter %’ order by organism_name " > table . dat<br />
2 perl ~ pfh / scripts / coregenome / coregenome -2.3 < table . dat > core .ps<br />
Both the BLASTmatrix <strong>and</strong> the coregenome scripts accesses the same MySQL cach<strong>in</strong>g<br />
databases. The user will not have to worry about how results are cached <strong>and</strong> shared<br />
between the two programs. Figure 2.17 shows the output core- <strong>and</strong> pan-genome plot<br />
generated by the program.<br />
By us<strong>in</strong>g a fixed genome order, it is possible to compare multiple species with<strong>in</strong> the<br />
same plot, to reveal vary<strong>in</strong>g slopes of the pan- <strong>and</strong> core-genome graphs. From figure 2.17<br />
it is visible that the first 5 stra<strong>in</strong>s come from dist<strong>in</strong>ct species, giv<strong>in</strong>g rise to a steep <strong>in</strong>crease<br />
of the pan genome, <strong>and</strong> reduction of the core genome. The follow<strong>in</strong>g five genomes come<br />
from C. jejuni <strong>and</strong> the curves appear to flatten out at a core size of 600 prote<strong>in</strong>s, 5,200<br />
prote<strong>in</strong>s. In figure 2.18 a larger core- <strong>and</strong> pan-genome plot for Vibrio species are shown<br />
(paper IV).<br />
23
Genome Comparisons<br />
0 1000 2000 3000 4000 5000 6000 7000<br />
New genes<br />
New gene families<br />
Core genome<br />
Pan genome<br />
1 : Campylobacter concisus 13826<br />
2 : Campylobacter curvus 525.92<br />
3 : Campylobacter fetus subsp. fetus 8240<br />
pan-genome (blue l<strong>in</strong>e) <strong>in</strong>creases, <strong>and</strong> the number of conserved gene families (red<br />
4 : Campylobacter hom<strong>in</strong>is ATCC BAA381<br />
l<strong>in</strong>e) <strong>in</strong> the core genome decreases, albeit at a lower rate. This is because every<br />
5 : Campylobacter jejuni RM1221<br />
genome can add many novel (<strong>and</strong> frequently different) genes to the pan-genome but<br />
6 : Campylobacter jejuni subsp. doylei 269.97<br />
only decreases the core genome with a few genes that are absent <strong>in</strong> that particular<br />
7 : Campylobacter jejuni subsp. jejuni 81176<br />
stra<strong>in</strong> but that were conserved <strong>in</strong> the previously given genomes. The pan-genome<br />
8 : Campylobacter jejuni subsp. jejuni 81116<br />
curve <strong>in</strong>creases with a relative steep slope when a novel species is added, as is<br />
9 : Campylobacter jejuni subsp. jejuni NCTC 11168<br />
obvious when one V. parahaemolyticus genome is added after the 18th V. cholerae. A<br />
10 : Campylobacter lari RM2100<br />
stable plateau can be seen for pan genome of the V. cholerae genomes around 6500<br />
genes, whereas the core genome steadily decreases to approximately 1000 genes for<br />
these 32 genomes. A. salmonicida, although not a member of the Vibrio genus, does<br />
not add significantly more genes to the pan genome than the other Vibrio species do, <strong>in</strong><br />
contrast to P. profundum which produces a sharp <strong>in</strong>crease <strong>in</strong> the pan genome, as does,<br />
<strong>in</strong>terest<strong>in</strong>gly, V. shilonii. Note that there are approximately 20,000 total gene families<br />
with<strong>in</strong> the 30 sequenced Vibrionaceae genomes.<br />
In fact, the small jump seen <strong>in</strong> the pan genome of V. cholerae when add<strong>in</strong>g the 11th<br />
1 2 3 4 5 6 7 8 9 10<br />
genome (figure 3) is caused by the difference between the two subclusters of V.<br />
cholerae seen <strong>in</strong> the pan-genome family tree (figure 2). Note that the 10th stra<strong>in</strong> (V.<br />
clolerae 2740-80) behaves as an outlier <strong>in</strong> all the figures shown; although documented<br />
Figureas 2.17: an environmental Pan- <strong>and</strong> core-genome isolate, this plotappears of 10 Campylobacter closer to the genomes. cl<strong>in</strong>ical isolates, For the<strong>in</strong> data terms currently of<br />
available, overall there genomic seem to properties. exist an equilibrium at close to 600 prote<strong>in</strong> families.<br />
24<br />
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
Pan genome<br />
Core genome<br />
New gene families<br />
V. cholerae MJ1236<br />
V. cholerae RC9<br />
V. cholerae BX330286<br />
V. cholerae MO10<br />
V. cholerae O395 TIGR<br />
V. cholerae O395 TEDA<br />
V. cholerae M66-2<br />
V. cholerae N16961<br />
V. cholerae B33VCE<br />
V. cholerae AM-19226<br />
V. cholerae 1587<br />
V. cholerae 2740-80<br />
V. cholerae TM11079-80<br />
V. cholerae TMA21<br />
V. cholerae 12129<br />
V. cholerae MZO-2<br />
P.profundum SS9<br />
V.shilonii AK1<br />
V.harveyi BAA-1116<br />
V.campbellii<br />
Vibrio sp Ex25<br />
A.salmonicida LFI1238<br />
V. fisheri MJ11<br />
V. fisheri ES114<br />
V.splendidus LGB2<br />
Vibrio. sp MED222<br />
V. vulnificus YJ016<br />
V. vulnificus CMCP6<br />
V. parahaem. 16<br />
V. parahaiem. 2210633<br />
V. cholerae V52<br />
V. cholerae VL426<br />
Figure 3. Pan- <strong>and</strong> core-genome plot of the 32 Vibrionaceae genomes. V. cholerae<br />
stra<strong>in</strong>s that do not cause cholera are highlighted <strong>in</strong> bright green. Colours are the same<br />
as <strong>in</strong> Figure 2.<br />
Figure 2.18: CorePlot output for 32 Vibrio genomes.<br />
BLAST comparison visualized <strong>in</strong> a BLAST matrix<br />
A BLAST matrix provides a visual overview of reciprocal pairwise whole genome<br />
comparisons (figure 4). The stronger a matrix cell is colored, the more similarity was
2.4 Summary<br />
<strong>Comparative</strong> Genomics<br />
This chapter presents a number of comparative genomics <strong>and</strong> visualization <strong>tools</strong> used <strong>in</strong><br />
a genome annotation <strong>and</strong> analysis pipel<strong>in</strong>e. Visualization methods have been shown to<br />
help draw biological conclusions about adaptation to environmental niches, pathogenic<br />
properties, <strong>and</strong> comparison of many other genomic properties <strong>in</strong>clud<strong>in</strong>g proteome similarity.<br />
Overview<strong>in</strong>g the large amount of genomic data constitutes a constant challenge that<br />
will need more attention <strong>in</strong> the future as sequenc<strong>in</strong>g technology becomes more <strong>and</strong> more<br />
common. How can one visualize comparison of a thous<strong>and</strong> genomes? Soon there will be<br />
a need to compare sets of thous<strong>and</strong>s of genomes.<br />
25
Summary<br />
26
<strong>Comparative</strong> Genomics<br />
2.5 Instant <strong>in</strong>sight: Read<strong>in</strong>g the genetic atlas
Instant <strong>in</strong>sight: Read<strong>in</strong>g the genetic atlas
‘ReSourCe is<br />
he best onl<strong>in</strong>e<br />
submission<br />
system of any<br />
publisher.’<br />
ReSourCe<br />
nd referees who have used<br />
o help you through every step of<br />
l<strong>in</strong>e proof collection, free pdf<br />
check <strong>and</strong> update their personal<br />
ence even further.<br />
se juggl<strong>in</strong>g a hectic research<br />
a not-for-prot society publisher<br />
e today <strong>and</strong> nd out more.<br />
Registered Charity No. 207890<br />
.rsc.org/resource<br />
<br />
1<br />
<strong>Comparative</strong> Genomics<br />
2.6 Paper I: The genome BLASTatlas - a GeneWiz extension<br />
for visualization of whole-genome homology<br />
Volume 4 | Number 5 | 2008 Molecular BioSystems Pages 353–444<br />
Molecular<br />
BioSystems<br />
www.molecularbiosystems.org Volume 4 | Number 5 | May 2008 | Pages 353–444<br />
ISSN 1742-206X<br />
HIGHLIGHT<br />
Peter F. Hall<strong>in</strong> et al.<br />
REVIEW<br />
The genome BLASTatlas—a GeneWiz Eric C. Greene et al.<br />
extension for visualization of whole- The importance of surfaces <strong>in</strong> s<strong>in</strong>glegenome<br />
homology molecule bioscience<br />
1742-206X(2008)4:5;1-9<br />
Indexed <strong>in</strong><br />
MEDLINE!<br />
17/04/2008 11:00:58
HIGHLIGHT www.rsc.org/molecularbiosystems | Molecular BioSystems<br />
The genome BLASTatlas—a GeneWiz<br />
extension for visualization of whole-genome<br />
homology<br />
Peter F. Hall<strong>in</strong>, Tim T. B<strong>in</strong>newies* <strong>and</strong> David W. Ussery<br />
DOI: 10.1039/b717118h<br />
The development of fast <strong>and</strong> <strong>in</strong>expensive methods for sequenc<strong>in</strong>g bacterial genomes<br />
has led to a wealth of data, often with many genomes be<strong>in</strong>g sequenced of the same<br />
species or closely related organisms. Thus, there is a need for visualization methods that<br />
will allow easy comparison of many sequenced genomes to a def<strong>in</strong>ed reference stra<strong>in</strong>.<br />
The BLASTatlas is one such tool that is useful for mapp<strong>in</strong>g <strong>and</strong> visualiz<strong>in</strong>g whole<br />
genome homology of genes <strong>and</strong> prote<strong>in</strong>s with<strong>in</strong> a reference stra<strong>in</strong> compared to other<br />
stra<strong>in</strong>s or species of one or more prokaryotic organisms. We provide examples of<br />
BLASTatlases, <strong>in</strong>clud<strong>in</strong>g the Clostridium tetani plasmid p88, where homologues for tox<strong>in</strong><br />
genes can be easily visualized <strong>in</strong> other sequenced Clostridium genomes, <strong>and</strong> for a<br />
Clostridium botul<strong>in</strong>um genome, compared to 14 other Clostridium genomes. DNA<br />
structural <strong>in</strong>formation is also <strong>in</strong>cluded <strong>in</strong> the atlas to visualize the DNA chromosomal<br />
context of regions. Additional <strong>in</strong>formation can be added to these plots, <strong>and</strong> as an<br />
example we have added circles show<strong>in</strong>g the probability of the DNA helix open<strong>in</strong>g up<br />
under superhelical tension. The tool is SOAP compliant <strong>and</strong> WSDL (web services<br />
description language) files are located on our website: (http://www.cbs.dtu.dk/ws/<br />
BLASTatlas), where programm<strong>in</strong>g examples are available <strong>in</strong> Perl. By provid<strong>in</strong>g an<br />
<strong>in</strong>teroperable method to carry out whole genome visualization of homology,<br />
this service offers bio<strong>in</strong>formaticians as well as biologists an easy-to-adopt workflow<br />
that can be directly called from the programm<strong>in</strong>g language of the user, hence<br />
enabl<strong>in</strong>g automation of repeated tasks. This tool can be relevant <strong>in</strong> many pangenomic<br />
as well as <strong>in</strong> metagenomic studies, by giv<strong>in</strong>g a quick overview of clusters of<br />
<strong>in</strong>sertion sites, genomic isl<strong>and</strong>s <strong>and</strong> overall homology between a reference<br />
sequence <strong>and</strong> a data set.<br />
Center for Biological Sequence Analysis,<br />
Department of Systems Biology, The<br />
Technical University of Denmark, 2800<br />
Lyngby, Denmark. E-mail: pfh@cbs.dtu.dk.<br />
E-mail: tim@cbs.dtu.dk. E-mail:<br />
dave@cbs.dtu.dk<br />
Background<br />
It has been more than 10 years s<strong>in</strong>ce the<br />
sequenc<strong>in</strong>g of the first bacterial genome<br />
(ref. 1, US patent number 6,528,289), <strong>and</strong><br />
currently sequence data are available for<br />
more than a thous<strong>and</strong> sequenced genomes.<br />
Peter F. Hall<strong>in</strong> Tim T. B<strong>in</strong>newies David W. Ussery<br />
With so many genome sequences, for<br />
several bacterial species multiple genome<br />
sequences exist; for example, at the time<br />
of writ<strong>in</strong>g, 10 different Escherichia coli<br />
genomes have been fully sequenced <strong>and</strong><br />
published, <strong>and</strong> draft sequences for another<br />
31 genomes are available, add<strong>in</strong>g<br />
Peter F. Hall<strong>in</strong> was born <strong>in</strong><br />
Odense, Denmark, <strong>and</strong> is currently<br />
a PhD student at <strong>CBS</strong>,<br />
DTU. Tim T. B<strong>in</strong>newies grew<br />
up <strong>in</strong> Kiel, Germany, <strong>and</strong> obta<strong>in</strong>ed<br />
his PhD from the Technical<br />
University of Denmark,<br />
he is currently work<strong>in</strong>g for<br />
Roche Diagnostics AG <strong>in</strong> Switzerl<strong>and</strong>.<br />
David W. Ussery was<br />
born <strong>and</strong> raised <strong>in</strong> Spr<strong>in</strong>gdale,<br />
Arkansas. S<strong>in</strong>ce 1998, he has<br />
been leader for the <strong>Comparative</strong><br />
Genomics group at <strong>CBS</strong>.<br />
This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 363
up to a total of 41 different E. coli<br />
genomes (accord<strong>in</strong>g to the National Center<br />
for Biotechnology Information,<br />
NCBI Entrez, 12-Feb-2008). Table 1 lists<br />
the top 20 represented prokaryotic<br />
genera <strong>in</strong> terms of numbers of fully<br />
sequenced genomes based on recent<br />
count<strong>in</strong>g <strong>in</strong> Entrez Genome Projects,<br />
although these numbers will change<br />
quickly as more genomes are be<strong>in</strong>g<br />
added on a regular basis. Thus, analysis<br />
of multiple genomes of the same organism<br />
(the ‘‘pangenome’’) is now possible,<br />
<strong>and</strong> as more metagenomic datasets are<br />
published (see for example the projects<br />
listed on the GOLD web pages 24 ), there<br />
is a need for a graphical representation<br />
of how these new data compare to exist<strong>in</strong>g<br />
reference stra<strong>in</strong>s or model organisms.<br />
We have developed a visualization<br />
method, called ‘‘BLASTatlas’’, for show<strong>in</strong>g<br />
mapped alignments of BLAST<br />
searches of a reference sequence aga<strong>in</strong>st<br />
one or more databases, onto the reference<br />
genome. Early implementation of a<br />
similar method 2–4 accounted for the statistical<br />
significance (E-value) of each hit,<br />
by color cod<strong>in</strong>g the expectation values<br />
[ log(E)] of the alignment. This method<br />
gives a uniform color throughout the<br />
alignment (gene or prote<strong>in</strong>) but shows<br />
no <strong>in</strong>formation about the am<strong>in</strong>o acid<br />
conservation with<strong>in</strong> regions of the alignment.<br />
At the level of a bacterial chromosome,<br />
this makes little difference,<br />
although when one zooms <strong>in</strong> at the level<br />
of <strong>in</strong>dividual genes, the older method of<br />
shad<strong>in</strong>g the entire gene based on the Evalue<br />
gives no <strong>in</strong>formation about regions<br />
with<strong>in</strong> a gene (such as functional doma<strong>in</strong>s)<br />
which might be strongly conserved,<br />
whilst other parts of the gene<br />
have little sequence homology with<strong>in</strong><br />
other genomes. We have ref<strong>in</strong>ed the<br />
BLASTatlas method to map each<br />
<strong>in</strong>dividual am<strong>in</strong>o acid residue or<br />
nucleotide back to the reference genome<br />
sequence from which the cod<strong>in</strong>g sequence<br />
was derived. Instead of colourcod<strong>in</strong>g<br />
the significance of the entire hit,<br />
this method maps the conservation of the<br />
<strong>in</strong>dividual bases or am<strong>in</strong>o acids. Tools<br />
such as the Artemis Comparison Tool<br />
(ACT) 5 allow detailed view<strong>in</strong>g of complete<br />
BLAST results, <strong>and</strong> this is an<br />
excellent graphical method for comparison<br />
of two genomes. ACT can also be<br />
extended to compare two genomes to a<br />
reference, placed <strong>in</strong> the middle. In<br />
contrast, the BLASTatlas method can<br />
compare many genomes to the same<br />
reference, <strong>and</strong> can provide a quick overview<br />
of chromosomal regions of gene<br />
conservation across many genomes.<br />
As can be seen from Table 1, for many<br />
of the heavily sampled genera, there are<br />
further genome projects <strong>in</strong> the pipel<strong>in</strong>e<br />
which will produce even more sequences<br />
than are currently available, <strong>and</strong> there is<br />
a need for methods for efficient comparison<br />
of these genomes, giv<strong>in</strong>g an overview<br />
of general trends <strong>in</strong> the data. The<br />
Table 1 The number of species <strong>and</strong> NCBI Entrez Project IDs of the 20 most represented genera<br />
<strong>in</strong> the Entrez Genome Projects Database, 13 as accessed on 21 October 2007. The numbers <strong>in</strong><br />
brackets show the count<strong>in</strong>g of both ongo<strong>in</strong>g <strong>and</strong> completed projects, whereas the first number<br />
reflects only the completed projects. C<strong>and</strong>idate genera have been excluded from this count<strong>in</strong>g<br />
Genus Projects Species<br />
Streptococcus 26 [63] 8 [15]<br />
Burkholderia 15 [55] 8 [15]<br />
Bacillus 16 [48] 9 [16]<br />
Clostridium 14 [43] 9 [22]<br />
Vibrio 7 [35] 5 [14]<br />
Mycobacterium 16 [30] 9 [14]<br />
Salmonella 5 [30] 2 [3]<br />
Listeria 4 [29] 3 [6]<br />
Escherichia 10 [27] 1 [1]<br />
Mycoplasma 13 [25] 11 [17]<br />
Shewanella 14 [24] 10 [15]<br />
Pseudomonas 13 [23] 7 [8]<br />
Yers<strong>in</strong>ia 9 [23] 3 [7]<br />
Haemophilus 6 [23] 3 [4]<br />
Staphylococcus 17 [22] 4 [5]<br />
Synechococcus 10 [21] 2 [2]<br />
Campylobacter 9 [20] 5 [9]<br />
Francisella 7 [16] 1 [2]<br />
Lactobacillus 11 [15] 10 [12]<br />
Rickettsia 10 [15] 9 [12]<br />
BLASTatlas allows the comparison of<br />
many genomes to a reference sequence.<br />
The current limit is about 60 genomes.<br />
There are two levels of comparison, the<br />
first represents a one-page map of the<br />
whole chromosome, <strong>and</strong> the second level<br />
zoom<strong>in</strong>g <strong>in</strong> a particular region of <strong>in</strong>terest,<br />
allow<strong>in</strong>g the visualization of regions<br />
of conservation with<strong>in</strong> <strong>in</strong>dividual genes.<br />
The color-cod<strong>in</strong>g represents identical<br />
am<strong>in</strong>o acids (or nucleic acids), based on<br />
a pairwise alignment of all prote<strong>in</strong> cod<strong>in</strong>g<br />
regions, with the best matches for<br />
each gene <strong>in</strong> the reference genome<br />
shown. Thus, comb<strong>in</strong><strong>in</strong>g both levels, it<br />
is possible to get a global overview of the<br />
whole chromosome, <strong>and</strong> to then quickly<br />
identify gene conservation (or lack thereof)<br />
<strong>in</strong> regions of <strong>in</strong>terest, at the level of<br />
conservation of <strong>in</strong>dividual am<strong>in</strong>o acid<br />
residues.<br />
Clostridium botul<strong>in</strong>um is an important<br />
human pathogen which is the causative<br />
agent of botulism, giv<strong>in</strong>g rise to fatal<br />
paralysis of the respiratory muscles,<br />
caused by botul<strong>in</strong>um neurotox<strong>in</strong> (BoNT)<br />
which disrupts nerve functions. The<br />
genes encod<strong>in</strong>g BoNT components are<br />
clustered on the bacterial chromosome<br />
(group I + II stra<strong>in</strong>s), on prophages<br />
(group III stra<strong>in</strong>s) or on plasmids (group<br />
IV stra<strong>in</strong>s). Group I stra<strong>in</strong>s encode type<br />
A, B <strong>and</strong> F type tox<strong>in</strong>s, group II stra<strong>in</strong>s<br />
produce type B, E <strong>and</strong> F tox<strong>in</strong>s <strong>and</strong><br />
group III stra<strong>in</strong>s encode for type C <strong>and</strong><br />
D tox<strong>in</strong>s, whereas group IV stra<strong>in</strong>s<br />
produce type G tox<strong>in</strong>. 6 We use the<br />
BLASTatlas method to show the overall<br />
genome homology of the C. botul<strong>in</strong>um<br />
stra<strong>in</strong> F Langel<strong>and</strong>, compared to all<br />
currently available <strong>and</strong> fully sequenced<br />
stra<strong>in</strong>s of the Clostridium genus.<br />
Methods<br />
The BLASTatlas method uses all the<br />
provided annotated cod<strong>in</strong>g sequences<br />
(or prote<strong>in</strong>s) of a reference genome, <strong>and</strong><br />
compares each of those with one or more<br />
genomes. The total genome sequence for<br />
each organism is represented by a database<br />
<strong>and</strong> can conta<strong>in</strong> any number of<br />
DNA or prote<strong>in</strong> sequences. BLAST<br />
searches with a non-str<strong>in</strong>gent E-value<br />
cut-off of 0.01 are used to identify the<br />
best alignments between the reference<br />
sequence prote<strong>in</strong> <strong>and</strong> the database<br />
(genome) <strong>in</strong> question. Once identified,<br />
the s<strong>in</strong>gle best pairwise alignment for<br />
364 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008
each of the reference sequences is<br />
obta<strong>in</strong>ed <strong>and</strong> <strong>in</strong>cluded <strong>in</strong> the map.<br />
The reference genome of a given<br />
comparison has a fixed size, whereas<br />
the sequences to be compared can be<br />
thought of as simply a ‘‘pile of prote<strong>in</strong>s’’,<br />
rang<strong>in</strong>g between the size from that of a<br />
small phage, to a s<strong>in</strong>gle genome, or an<br />
entire metagenomic sample or even exist<strong>in</strong>g<br />
large BLAST databases, such as<br />
UniProt. It is important to emphasize<br />
that each prote<strong>in</strong> <strong>in</strong> the reference genome<br />
is compared to all the prote<strong>in</strong>s <strong>in</strong> the<br />
query set—regardless of orientation or<br />
location. The BLASTatlas method uses<br />
the software BLASTALL v. 2.2.11 for<br />
the search, <strong>and</strong> <strong>in</strong> BLAST term<strong>in</strong>ology,<br />
the reference genome constitutes the<br />
‘query’ whereas each other genome<br />
(e.g., a lane or circle <strong>in</strong> the atlas) <strong>in</strong> the<br />
comparison corresponds to the ‘database’.<br />
We def<strong>in</strong>e a lane as a visual representation<br />
of mapped database hits<br />
(<strong>in</strong>dividual residue matches) on to the<br />
reference genome. A lane can have a<br />
boxfilter (smooth<strong>in</strong>g) applied with<strong>in</strong><br />
each of the smallest visible units of the<br />
atlas (the resolution of the graphical<br />
representation). A s<strong>in</strong>gle BLASTatlas<br />
may conta<strong>in</strong> several lanes; currently<br />
around 60 circles is the upper limit.<br />
The <strong>in</strong>put requires a file conta<strong>in</strong><strong>in</strong>g the<br />
genome sequence, <strong>in</strong>clud<strong>in</strong>g all annotated<br />
cod<strong>in</strong>g sequences (compris<strong>in</strong>g prote<strong>in</strong>-start,<br />
-stop <strong>and</strong> -direction) for the<br />
reference genome. The four programs<br />
‘BLASTp’, ‘BLASTn’, ‘BLASTx’, <strong>and</strong><br />
‘tBLASTn’ can be used for each lane of<br />
the BLASTatlas, although of course the<br />
appropriate sequences (DNA or prote<strong>in</strong>)<br />
must be provided. For example, when<br />
us<strong>in</strong>g ‘ BLASTn’ or ‘tBLASTn’ <strong>in</strong> a lane,<br />
the required DNA sequence can be a set<br />
of open read<strong>in</strong>g frames (ORFs), chromosomal<br />
contigs, entire genome sequences<br />
or even environmental (metagenomic)<br />
samples. In a pairwise fashion, the sequence<br />
of the reference is BLASTed<br />
aga<strong>in</strong>st each database def<strong>in</strong>ed by the<br />
user, employ<strong>in</strong>g the specified BLAST<br />
algorithm.<br />
Interpretation of BLAST alignments<br />
For each of the sequences def<strong>in</strong>ed <strong>in</strong> the<br />
reference, only the best hit <strong>in</strong> each database<br />
is stored. For these hits, the alignments<br />
are mapped on to the reference<br />
genome. When align<strong>in</strong>g two DNA<br />
sequences, the map shows one of four<br />
possible states for each position: match,<br />
mismatch, gap <strong>in</strong> query (reference genome),<br />
<strong>and</strong> gap <strong>in</strong> database (lane). Only<br />
the match contributes to the overall score<br />
with a value of 1, whereas mismatches<br />
<strong>and</strong> gaps <strong>in</strong> the database get a score<br />
value of zero. When align<strong>in</strong>g two prote<strong>in</strong><br />
sequences, an additional state is <strong>in</strong>troduced<br />
for conservative mismatches, <strong>in</strong>dicat<strong>in</strong>g<br />
that two am<strong>in</strong>o acids have similar<br />
physical–chemical properties; such a<br />
state will receive a score of 0.5. Match<br />
<strong>and</strong> gap states of prote<strong>in</strong> alignments are<br />
def<strong>in</strong>ed similar to those of the DNA<br />
alignments. The occurrence of gaps <strong>in</strong><br />
the reference sequence do not get a correspond<strong>in</strong>g<br />
coord<strong>in</strong>ate <strong>and</strong> are therefore<br />
ignored (see Fig. 1). In the BLASTatlas<br />
context, a map is an array of match<br />
scores. The array has the same length<br />
as the reference genome, with each position<br />
along the gene hav<strong>in</strong>g a value of 0,<br />
0.5 or 1: It should be noted that <strong>in</strong>tergenic<br />
regions (<strong>and</strong> ncRNAs, <strong>in</strong>clud<strong>in</strong>g<br />
tRNAs <strong>and</strong> rRNAs) have values of 0,<br />
because BLASTatlases only compare<br />
prote<strong>in</strong> encod<strong>in</strong>g genes. We use this as<br />
a control, check<strong>in</strong>g to make sure that the<br />
rRNA operons are visualized as ‘‘gaps’’<br />
throughout all the lanes, for example.<br />
For each database def<strong>in</strong>ed, there will be<br />
a correspond<strong>in</strong>g BLAST map with<strong>in</strong> the<br />
atlas (see Fig. 2). Each database entry of<br />
the BLAST searches must conta<strong>in</strong> a<br />
legend text for the lane, a colour code<br />
range <strong>and</strong> a scal<strong>in</strong>g method. For the<br />
colours, an upper <strong>and</strong> lower colour is<br />
required, whereas the middle colour<br />
is usually grey; all colours are def<strong>in</strong>ed<br />
<strong>in</strong> RGB <strong>in</strong>tegers rang<strong>in</strong>g from 0 to 10.<br />
The scale can be either fixed, such as<br />
rang<strong>in</strong>g from 0 to 1, or scaled us<strong>in</strong>g any<br />
number of st<strong>and</strong>ard deviations around<br />
the average.<br />
DNA properties<br />
The BLASTatlas method allows users to<br />
add structural as well as base composition<br />
<strong>in</strong>formation to the atlas by us<strong>in</strong>g the<br />
‘DNAparameters’ element <strong>in</strong> the request.<br />
These properties can be for example<br />
DNA structural properties, 7<br />
such as<br />
<strong>in</strong>tr<strong>in</strong>sic curvature, 8 global or local<br />
repeats 9 or other measures of base composition.<br />
10 A list of possible different<br />
properties currently pre-computed can<br />
be obta<strong>in</strong>ed via the onl<strong>in</strong>e documentation<br />
<strong>and</strong> type declarations of the web<br />
services description. The DNA property<br />
lanes are usually added near the center<br />
(or at the lowest part when seen from the<br />
outermost circle) of the atlas.<br />
Custom properties<br />
In addition to the st<strong>and</strong>ard DNA properties<br />
<strong>and</strong> BLAST maps, the web service<br />
provides a method for add<strong>in</strong>g <strong>in</strong>dividual<br />
customer data for example gene expression<br />
values to the atlas, us<strong>in</strong>g the ‘customMap’<br />
element <strong>in</strong> the request. Data<br />
must be provided <strong>in</strong> the form of comma<br />
separated str<strong>in</strong>gs, with each position <strong>in</strong><br />
the list correspond<strong>in</strong>g to the genomic<br />
position. When def<strong>in</strong><strong>in</strong>g custom data<br />
lanes, the colour ranges, scal<strong>in</strong>g method,<br />
<strong>and</strong> legend text must be provided.<br />
Visualization<br />
Details such as the atlas title <strong>and</strong> the<br />
geometry (l<strong>in</strong>ear or circle representation)<br />
are necessary for the f<strong>in</strong>al visualization.<br />
Once the BLAST searches are carried<br />
out <strong>and</strong> remapped to the reference<br />
Fig. 1 Mapp<strong>in</strong>g of prote<strong>in</strong>–prote<strong>in</strong> alignment to DNA. Panel A: mismatches <strong>and</strong> perfect matches are assigned a score of 0 <strong>and</strong> 1, respectively.<br />
Conservative mismatches are assigned a score of 0.5. In the case of DNA alignment, only scores of 0 <strong>and</strong> 1 are possible. Panel B: gaps <strong>in</strong> the<br />
database sequence will be rendered as be<strong>in</strong>g non-conserved areas (filled with zeros). Panel C: gaps <strong>in</strong> the reference sequence will be neglected, s<strong>in</strong>ce<br />
they have no correspond<strong>in</strong>g region <strong>in</strong> the reference genome <strong>in</strong>to which they can be mapped.<br />
This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 365
Fig. 2 Genes (or segments) from each genome are compared with a reference gene, as shown <strong>in</strong><br />
the left panel; a pairwise comparison is made us<strong>in</strong>g one of the BLAST algorithms. On the right is<br />
shown the ‘‘remapp<strong>in</strong>g’’, or the representation of each of the BLAST runs on the left, mapped<br />
onto the chromosomal sequence. Note that gaps <strong>in</strong> the reference gene (grey) are not <strong>in</strong>cluded <strong>in</strong><br />
the colored maps of the atlas.<br />
genome <strong>and</strong> custom data <strong>and</strong> DNA<br />
properties are collected, an XML configuration<br />
file is composed which conta<strong>in</strong>s<br />
all these data <strong>and</strong> the layout of the atlas.<br />
This file is then sent to the GeneWiz 7<br />
software which produces a PostScript<br />
document, it then is base64 encoded to<br />
allow transport via XML. This part of<br />
the process takes place on the server <strong>and</strong><br />
requires no user-<strong>in</strong>teraction. An example<br />
atlas of a plasmid is shown <strong>in</strong> Fig. 3, <strong>and</strong><br />
will be discussed <strong>in</strong> more detail below.<br />
Web services implementation<br />
A WSDL (web services description language)<br />
file is written which describes the<br />
operations (runAtlas, pollQueue, fetch-<br />
AtlasResult) <strong>and</strong> the <strong>in</strong>put requirements<br />
for them. The file can be downloaded.<br />
All <strong>in</strong>put/output objects are def<strong>in</strong>ed <strong>in</strong> a<br />
separated XSD file (XML schema def<strong>in</strong>ition)<br />
with<strong>in</strong> the WSDL file, which comprises<br />
<strong>in</strong>formation <strong>and</strong> type restrictions<br />
applicable <strong>in</strong> the request. This serves as<br />
documentation of the objects as well as a<br />
way to validate a request before it is<br />
submitted. Unfortunately, the validation<br />
supports only Perl modules for now that<br />
is not optimal yet, whereas this option is<br />
well implemented <strong>in</strong> <strong>tools</strong> like soapUI<br />
(http://www.soapui.org/). It should be<br />
stressed that users should, until better<br />
validation support can be implemented,<br />
be careful to correctly format the <strong>in</strong>put<br />
parameters before send<strong>in</strong>g the request.<br />
Fig. 3 BLASTatlas of pE88—a small plasmid of Clostridium tetani stra<strong>in</strong> E88, GenBank accession number AF528097. DNA parameters percent AT,<br />
GC skew, global direct repeats, <strong>and</strong> global <strong>in</strong>verted repeats are <strong>in</strong>cluded <strong>in</strong> the <strong>in</strong>ner most lanes. BLAST lanes of all complete genome sequences of the<br />
Clostridium genomes (see Table 1), <strong>in</strong>clud<strong>in</strong>g plasmids are <strong>in</strong>cluded <strong>in</strong> the outer most lanes. As examples of custom lanes, the free energy (G, blue kcal<br />
mol 1 ) <strong>and</strong> the probability (P, red) measures of stress <strong>in</strong>duced DNA duplex destabilization (SIDD) sites are <strong>in</strong>cluded <strong>in</strong> the lanes between the DNA<br />
properties <strong>and</strong> the BLAST lanes. 23 SIDD calculations were obta<strong>in</strong>ed from the SIDDbase WebService (http://www.cbs.dtu.dk/ws/SIDDbase). The<br />
request XML used to construct this plot can be downloaded from the example section of the service homepage, http://www.cbs.dtu.dk/ws/BLASTatlas.<br />
As expected, there is full homology of all cod<strong>in</strong>g regions between the plasmids <strong>and</strong> all replicons of C. tetani E88 (black lane just outside of the<br />
annotations); however there appears to be limited conservation of these pE88 genes throughout the genomes for other Clostridium stra<strong>in</strong>s.<br />
366 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008
Table 2 A list of all stra<strong>in</strong>s <strong>and</strong> their accession numbers used <strong>in</strong> this comparison. Each row represents the NCBI Entrez sequenc<strong>in</strong>g project. The<br />
number of base pairs <strong>and</strong> prote<strong>in</strong> cod<strong>in</strong>g genes are those derived as the sum with<strong>in</strong> each project. C. botul<strong>in</strong>um str. F Langel<strong>and</strong> is that used as<br />
reference of the comparison<br />
Species Segments Size Prote<strong>in</strong>s<br />
C. acetobutylicum ATCC 824 14<br />
Entrez Project 77: Chromosome: AE001437,<br />
Plasmid pSOL1: AE001438<br />
4.132.880 3.848<br />
C. beijer<strong>in</strong>ckii NCIMB 8052 (unpublished) Entrez Project 12637: Chromosome: CP000721 6.000.632 5.020<br />
C. botul<strong>in</strong>um A str. ATCC 19397 (unpublished) Entrez Project 19517: Chromosome: CP000726 3.863.450 3.552<br />
C. botul<strong>in</strong>um A str. ATCC 3502 6<br />
Entrez Project 193: Chromosome: AM412317,<br />
Plasmid pBOT3502: AM412318<br />
3.903.260 3.671<br />
C. botul<strong>in</strong>um A str. Hall (unpublished) Entrez Project 19521: Chromosome: CP000727 3.760.560 3.407<br />
C. botul<strong>in</strong>um F str. (unpublished) Entrez Project 19519: Chromosome: CP000728,<br />
Plasmid pCLI: CP000729<br />
4.012.918 3.659<br />
C. difficile 630 15<br />
Entrez Project 78: Chromosome: AM180355,<br />
Plasmid pCD630: AM180356<br />
4.298.133 3.787<br />
C. kluyveri DSM 555 (unpublished) Entrez Project 19065: Chromosome: CP000673,<br />
Plasmid pCKL555A: CP000674<br />
4.023.800 3.913<br />
C. novyi NT 16<br />
Entrez Project 16820: Chromosome: CP000382 2.547.720 2.325<br />
C. perfr<strong>in</strong>gens ATCC 13124 25<br />
Entrez Project 304: Chromosome: CP000246 3.256.683 2.876<br />
C. perfr<strong>in</strong>gens SM101 17<br />
Entrez Project 12521: Chromosome: CP000312,<br />
Plasmid 1: CP000313, Plasmid 2: CP000314,<br />
Viral segment phage phiSM101: CP000315<br />
2.960.088 2.631<br />
C. perfr<strong>in</strong>gens str. 13 18<br />
Entrez Project 79: Chromosome: BA000016,<br />
Plasmid pCP13: AP003515,<br />
3.085.740 2.723<br />
C. tetani E88 19<br />
Entrez Project 81: Chromosome: AE015927,<br />
Plasmid pE88: AF528097<br />
2.873.333 2.432<br />
C. thermocellum ATCC 27405 (unpublished) Entrez Project 314: Chromosome: CP000568 3.843.301 3.191<br />
Clostridium phage 20<br />
Phage c-st: AP008983 185.683 198<br />
Web services workflow<br />
A workflow was written <strong>in</strong> Perl (v5.8.7),<br />
employ<strong>in</strong>g SOAP:Lite (v0.69) which<br />
reads the FASTA files of the database<br />
stra<strong>in</strong>s listed <strong>in</strong> Table 3 <strong>and</strong> produces a<br />
BLASTatlas us<strong>in</strong>g the C. botul<strong>in</strong>um<br />
stra<strong>in</strong> F Langel<strong>and</strong> as reference. The<br />
script uses the onl<strong>in</strong>e web service (see<br />
Fig. 4). The BLASTatlas figure produced<br />
by this workflow is seen <strong>in</strong> Fig. 5.<br />
Results<br />
Fig. 3 represents a BLASTatlas for plasmid<br />
pE88 from Clostridium tetani stra<strong>in</strong><br />
Fig. 4 Workflow description: a Perl script was written for h<strong>and</strong>l<strong>in</strong>g the assembly of the SOAP<br />
envelope <strong>and</strong> contact<strong>in</strong>g various other web services operations: (A) obta<strong>in</strong><strong>in</strong>g genomes sequence:<br />
us<strong>in</strong>g the getSeq operation of the GenomeAtlas Web Services (v.3.3), the genome sequence of the<br />
reference genome is obta<strong>in</strong>ed as one cont<strong>in</strong>uous str<strong>in</strong>g. (B) Obta<strong>in</strong><strong>in</strong>g atlas annotations:<br />
annotated CDS, rRNA, <strong>and</strong> tRNA features of the GenBank record of the reference genome<br />
us<strong>in</strong>g the getFeatures operation—these are the features which will be pr<strong>in</strong>ted <strong>in</strong> a separate lane<br />
on the atlas. (C) Obta<strong>in</strong><strong>in</strong>g ORF annotations of the reference genome: aga<strong>in</strong>, us<strong>in</strong>g the getFeatures<br />
operation, all codon sequences <strong>and</strong> their translations are obta<strong>in</strong>ed. (D) Obta<strong>in</strong> databases: read<br />
FASTA files conta<strong>in</strong><strong>in</strong>g prote<strong>in</strong>s <strong>and</strong> ORFs of the database genomes to be added as lanes. The<br />
output of A–F are assembled <strong>in</strong>to a s<strong>in</strong>gle SOAP request, <strong>in</strong>clud<strong>in</strong>g configurations of the atlas.<br />
(E) Poll<strong>in</strong>g the queue: once the job has been submitted, a 32 character hex str<strong>in</strong>g is returned for<br />
identify<strong>in</strong>g the job, which can be used by operation pollQueue to see the status of the job.<br />
(F + G) Obta<strong>in</strong><strong>in</strong>g result: once a status ‘‘FINISHED’’ is obta<strong>in</strong>ed from pollQueue, the job id<br />
can submitted to fetchResult <strong>and</strong> the result<strong>in</strong>g PostScript image is returned.<br />
E88. The homology for genes <strong>in</strong> the<br />
plasmid to other sequenced genomes is<br />
shown <strong>in</strong> the circles, additional ‘‘custom<br />
lanes’’ represent chromosomal regions<br />
predicted to open under superhelical<br />
stress. The chromosomal location of the<br />
genes encod<strong>in</strong>g colT <strong>and</strong> tetR are labelled<br />
<strong>in</strong> the figure. Notice that these two prote<strong>in</strong>s<br />
conta<strong>in</strong> regions of homology that<br />
are found <strong>in</strong> most of the Clostridium<br />
proteomes searched. S<strong>in</strong>ce the C. tetani<br />
plasmid is <strong>in</strong>cluded <strong>in</strong> the genome sequence<br />
(black circle <strong>in</strong> the figure), all<br />
the genes are found <strong>in</strong> this genome (solid<br />
black), <strong>and</strong> most of the other Clostridium<br />
proteomes conta<strong>in</strong> some weak homology<br />
but <strong>in</strong> general lack most of the plasmidencoded<br />
genes. Thus, this is a quick overview<br />
of gene conservation of a plasmid<br />
compared to many sequenced genomes of<br />
the same genera.<br />
To demonstrate this for an entire bacterial<br />
genome (which is millions of bp <strong>in</strong><br />
size, compared to a small/B75 000 bp<br />
plasmid, shown <strong>in</strong> Fig. 3), we have used<br />
the genome sequence of C. botul<strong>in</strong>um<br />
stra<strong>in</strong> F Langel<strong>and</strong>, the largest of the<br />
C. botul<strong>in</strong>um genomes, to build a prote<strong>in</strong><br />
BLASTatlas of all publicly available<br />
fully sequenced Clostridia genomes, <strong>in</strong>clud<strong>in</strong>g<br />
all chromosomes, plasmids <strong>and</strong><br />
phages (see Fig. 5). Each lane of the atlas<br />
corresponds to a sequenc<strong>in</strong>g project that<br />
conta<strong>in</strong>s the ma<strong>in</strong> chromosome plus any<br />
This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 367
Fig. 5 BLASTatlas of Clostridium botul<strong>in</strong>um F stra<strong>in</strong> Langel<strong>and</strong>: Lanes show genome homology of (start<strong>in</strong>g from the outermost lane):<br />
C. acetobutylicum ATCC 824, C. beijer<strong>in</strong>ckii NCIMB 8052, C. botul<strong>in</strong>um A str. ATCC 19397, C. botul<strong>in</strong>um A ATCC 3502, C. botul<strong>in</strong>um A str. Hall,<br />
C. difficile 630, C. kluyveri DSM 555, C. novyi NT, C. perfr<strong>in</strong>gens ATCC 13124, C. perfr<strong>in</strong>gens SM101, C. perfr<strong>in</strong>gens str. 13, C. tetani E88,<br />
C. thermocellum ATCC 27405, <strong>and</strong> Clostridium phage c-st genome. Inside of the annotation circle are shown global direct repeats, global <strong>in</strong>verted<br />
repeats, stack<strong>in</strong>g energy, <strong>and</strong> percent AT. Blue <strong>and</strong> red annotations are cod<strong>in</strong>g sequences on plus <strong>and</strong> m<strong>in</strong>us str<strong>and</strong>, whereas green <strong>and</strong> turquoise<br />
are rRNA <strong>and</strong> tRNA, genes respectively. The two tox<strong>in</strong> components NTNH <strong>and</strong> BoNT/A1 that are identified on phage c-st are present <strong>in</strong> the<br />
reference genome at positions 880 kb <strong>and</strong> 883 kb, respectively (marked ‘cst’). The presence of the two is visible as a th<strong>in</strong> blue b<strong>and</strong> on the c-st blast<br />
lane. The lower part of the figure shows a zoom of the region around 2635 kb, provid<strong>in</strong>g an example of a gene cluster which appears to be<br />
conserved throughout the C. botul<strong>in</strong>um stra<strong>in</strong>s <strong>and</strong> partly with<strong>in</strong> the C. difficile 630.<br />
phages or plasmids present <strong>in</strong> the genome.<br />
The prote<strong>in</strong>s encoded by the 185 kb<br />
neurotox<strong>in</strong>-convert<strong>in</strong>g bacteriophage<br />
c-st are labelled, as well as a region which<br />
is zoomed <strong>in</strong> the second panel <strong>in</strong> Fig. 5.<br />
The accession numbers, total size <strong>and</strong><br />
total number of genes with<strong>in</strong> each lane<br />
can be seen <strong>in</strong> Table 2.<br />
There are several items of <strong>in</strong>terest which<br />
can be seen <strong>in</strong> Fig. 5. First, the rRNA<br />
operons can be quite readily seen, near the<br />
top part of the chromosome map, labeled<br />
turquoise; these rRNA operons are more<br />
GC rich (hence less red <strong>in</strong> the <strong>in</strong>ner-most<br />
lane), have direct <strong>and</strong> <strong>in</strong>verted repeats (the<br />
next two lanes), <strong>and</strong> are not shown <strong>in</strong> the<br />
proteome comparison lanes (s<strong>in</strong>ce these<br />
genes do not encode prote<strong>in</strong>s).<br />
As expected, the circle represent<strong>in</strong>g<br />
the c-st phage shows little match for most<br />
of the C. botul<strong>in</strong>um genome, at the<br />
prote<strong>in</strong> level. In general, the two other<br />
C. botul<strong>in</strong>um genomes (both <strong>in</strong> blue) have<br />
the highest similarity to the reference<br />
C. botul<strong>in</strong>um genome (also shown as a<br />
circle). In this case it is used as an <strong>in</strong>ternal<br />
control: all of the prote<strong>in</strong>s should show a<br />
match for this lane, s<strong>in</strong>ce the reference<br />
genome is blasted aga<strong>in</strong>st itself. Another<br />
<strong>in</strong>terest<strong>in</strong>g observation is the upper-lefth<strong>and</strong><br />
part of the genome which seems to<br />
have more homology to other Clostridium<br />
genomes, <strong>in</strong> particular show<strong>in</strong>g<br />
many matches to the C. perfr<strong>in</strong>gens<br />
genomes (green circles), compared to the<br />
rest of the genome.<br />
Application <strong>in</strong> metagenomics<br />
The genera of Prochlorococcus belongs<br />
to the cyanobacteria <strong>and</strong> is one of the<br />
most abundant photosynthetic organisms<br />
of the ocean. It plays an important<br />
role <strong>in</strong> the planet’s carbon cycle <strong>and</strong> has<br />
adapted to the various light <strong>and</strong> oxygen<br />
conditions present at the various<br />
depths. 11 As of the end of January<br />
2008, eleven Prochlorococcus mar<strong>in</strong>us<br />
genomes are publicly available <strong>and</strong> we<br />
have <strong>in</strong>cluded all encoded prote<strong>in</strong>s of<br />
these data with the seven metagenomic<br />
read collections from the ALOHA<br />
station near Hawaii, 12 as shown <strong>in</strong><br />
Table 3. The stra<strong>in</strong> of P. mar<strong>in</strong>us stra<strong>in</strong><br />
MIT 9303 has the largest genome of all<br />
368 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008
Table 3 A list of all stra<strong>in</strong>s/sample names <strong>and</strong> their accession numbers used <strong>in</strong> the metagenomic comparison. The list is sorted by sampl<strong>in</strong>g depth<br />
Source Size Orig<strong>in</strong> Accession/sample Ref. Depth<br />
P. mar<strong>in</strong>us str. MIT 9515 1 704 176 (1906 prote<strong>in</strong>s) Tropical Pacific CP000552 Unpublished Surface<br />
P. mar<strong>in</strong>us str. MIT 9215 1 738 790 (1983 prote<strong>in</strong>s) Equatorial Pacific CP000825 Unpublished Surface<br />
P. mar<strong>in</strong>us str. MED4 1 657 990 (1936 prote<strong>in</strong>s) Mediterranean Sea BX548174 21 4 m<br />
JGI_SMPL_HF10_10-07-02 7 482 668 (7842 contigs) North Pacific Subtropical Gyre — 12 10 m<br />
P. mar<strong>in</strong>us str. NATL1A 1 864 731 (2193 prote<strong>in</strong>s) North Atlantic CP000553 Unpublished 30 m<br />
P. mar<strong>in</strong>us str. NATL2A 1 842 899 (2163 prote<strong>in</strong>s) North Atlantic CP000095 Unpublished 30 m<br />
P. mar<strong>in</strong>us str. AS9601 1 669 886 (1921 prote<strong>in</strong>s) Arabian Sea CP000551 Unpublished 50 m<br />
JGI_SMPL_HF70_10-07-02 10 828 386 (10 999 contigs) North Pacific Subtropical Gyre — 12 70 m<br />
P. mar<strong>in</strong>us str. MIT 9211 1 688 963 (1855 prote<strong>in</strong>s) Equatorial Pacific CP000878 21 83 m<br />
P. mar<strong>in</strong>us str. MIT 9301 1 641 879 (1907 prote<strong>in</strong>s) Sargasso Sea CP000576 Unpublished 90 m<br />
P. mar<strong>in</strong>us str. MIT 9303 2 682 675 (2997 prote<strong>in</strong>s) Sargasso Sea CP000554 Unpublished 100 m<br />
P. mar<strong>in</strong>us str. SS120 1 751 080 (1882 prote<strong>in</strong>s) Sargasso Sea AE017126 22 120 m<br />
JGI_SMPL_HF130_10-06-02 6 091 784 (6812 contigs) North Pacific Subtropical Gyre — 12 130 m<br />
P. mar<strong>in</strong>us str. MIT 9312 1 709 204 (1962 prote<strong>in</strong>s) Equatorial Pacific CP000111 Unpublished 135 m<br />
P. mar<strong>in</strong>us str. MIT MIT9313 2 410 873 (2273 prote<strong>in</strong>s) Gulf Stream BX548175 21 135 m<br />
JGI_SMPL_HF200_10-06-02 7 829 659 (8286 contigs) North Pacific Subtropical Gyre — 12 200 m<br />
JGI_SMPL_HF500_10-06-02 8 764 642 (9027 contigs) North Pacific Subtropical Gyre — 12 500 m<br />
JGI_SMPL_HF770_12-21-03 11 811 597 (11 479 contigs) North Pacific Subtropical Gyre — 12 770 m<br />
JGI_SMPL_HF4000_12-21-03 11 028 821 (11 229 contigs) North Pacific Subtropical Gyre — 12 4000 m<br />
currently available sequences (2.7 Mb)<br />
<strong>and</strong> was therefore used as reference <strong>in</strong><br />
this comparison. BLAST hits between<br />
the reference <strong>and</strong> the encoded prote<strong>in</strong>s<br />
of all the P. mar<strong>in</strong>us genomes <strong>in</strong>cluded<br />
were generated with the BLASTp<br />
algorithm, whereas hits between the<br />
reference prote<strong>in</strong>s <strong>and</strong> the DNA reads<br />
of the metagenomic samples were gener-<br />
ated us<strong>in</strong>g the tBLASTn algorithm.<br />
tBLASTn was used to avoid the<br />
gene prediction step of the metagenomic<br />
samples <strong>and</strong> to allow a rough estimate<br />
of the cod<strong>in</strong>g potential of these samples.<br />
All lanes are sorted accord<strong>in</strong>g to<br />
the water depth at which the samples<br />
were collected (see Fig. 6). The Perl<br />
code for construct<strong>in</strong>g this plot us<strong>in</strong>g<br />
web services is provided on the service<br />
homepage.<br />
Discussion<br />
The BLASTatlas method can assist biologists<br />
<strong>in</strong> f<strong>in</strong>d<strong>in</strong>g regions along the chromosome<br />
which are conserved (or not).<br />
This <strong>in</strong>formation is useful for several<br />
Fig. 6 BLASTatlas show<strong>in</strong>g fully sequenced Prochlorococcus genomes (green) <strong>and</strong> the seven ALOHA metagenomic samples (blue). Outermost<br />
lanes represent samples closer to the ocean surface.<br />
This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 369
different applications, such as identify<strong>in</strong>g<br />
phage <strong>in</strong>sertion sites <strong>and</strong> loss of important<br />
genetic material. This method is<br />
even able to scale down to each <strong>in</strong>dividual<br />
nucleotide or am<strong>in</strong>o acid residue.<br />
However, it is unable to deal with sequences<br />
(or parts thereof) that are not<br />
found <strong>in</strong> the reference genome. A good<br />
compromise when deal<strong>in</strong>g with this issue<br />
is often to use the largest chromosome of<br />
a species as reference; <strong>in</strong> addition, it can<br />
be useful to rebuild the maps us<strong>in</strong>g different<br />
reference genomes. Besides this<br />
limitation, the fact that all coord<strong>in</strong>ates<br />
are mapped back to the reference causes<br />
the coord<strong>in</strong>ates of the database genomes<br />
to ‘‘get lost’’ <strong>in</strong> that only the best match<br />
is displayed, regardless of the chromosomal<br />
location <strong>in</strong> the database genomes.<br />
Other aspects of genome homology like<br />
gene synteny cannot effectively be<br />
answered by this tool. However, it is<br />
possible to use an additional circle to<br />
plot gene order conservation along the<br />
chromosome.<br />
Currently, we see the BLASTatlas as<br />
an <strong>in</strong>termediate stage <strong>in</strong> analysis of many<br />
genomes of similar species. Soon there<br />
will be a need to compare hundreds or<br />
thous<strong>and</strong>s of genome sequences, <strong>and</strong> the<br />
need for development of new methods<br />
for comparison of even larger numbers<br />
of genomes (hundreds or thous<strong>and</strong>s) is<br />
ever more important.<br />
Acknowledgements<br />
The authors would like to thank Hans<br />
Henrik Stærfeld for assistance with server<br />
side programs <strong>and</strong> Kristoffer Rapacki<br />
for assistance on web services<br />
data types. The work was supported by<br />
a grant from the European Union<br />
through the EMBRACE network of Excellence,<br />
contract number LSHG-CT-<br />
2004-512092 <strong>and</strong> a grant from the Danish<br />
Center for Scientific Comput<strong>in</strong>g<br />
(DCSC).<br />
References<br />
1 R. D. Fleischmann, M. D. Adams, O.<br />
White, R. A. Clayton, E. F. Kirkness, A.<br />
R. Kerlavage, C. J. Bult, J. F. Tomb, B. A.<br />
Dougherty, J. M. Merrick, J. McKenney,<br />
G. Sutton, W. FitzHugh, C. Fields, J. D.<br />
Gocyne, J. Scott, R. Shirley, L. I. Liu, A.<br />
Glodek, J. M. Kelley, J. F. Weidman, C.<br />
A. Phillips, T. Spriggs, E. Hedblom, M. D.<br />
Cotton, T. R. Utterback, M. C. Hanna, D.<br />
T. Nguyen, D. M. Saudek, R. C. Br<strong>and</strong>on,<br />
L. D. F<strong>in</strong>e, J. L. Fritchman, J. L. Fuhrmann,<br />
N. S. M. Geoghagen, C. L. Gnehm,<br />
L. A. McDonald, K. V. Small, C. M.<br />
Fraser, H. O. Smith <strong>and</strong> J. C. Venter,<br />
Whole-Genome R<strong>and</strong>om Sequenc<strong>in</strong>g <strong>and</strong><br />
Assembly of Haemophilus Influenzae Rd.,<br />
Science, 1995, 269(5223), 496–512.<br />
2 L. J. Jensen, M. Skovgaard, T. Sicheritz-<br />
Ponten, M. K. Jorgensen, C. Lundegaard,<br />
C. C. Pedersen, N. Petersen <strong>and</strong> D. Ussery,<br />
Analysis of two large functionally uncharacterized<br />
regions <strong>in</strong> the Methanopyrus<br />
k<strong>and</strong>leri AV19 genome, BMC Genomics,<br />
2003, 4, 12.<br />
3 L. J. Jensen, M. Skovgaard, T. Sicheritz-<br />
Ponten, N. T. Hansen, H. Johansson,<br />
M. K. Jørgensen, K. Kiil, P. F. Hall<strong>in</strong><br />
<strong>and</strong> D. Ussery, <strong>Comparative</strong> genomics of<br />
four Pseudomonas species, <strong>in</strong> The Pseudomonads<br />
Vol. I. Genomics, Life Style<br />
<strong>and</strong> Molecular Architecture, ed. J. L.<br />
Ramos, Kluwer Academic/Plenum<br />
Publishers, New York, 2004, ch. 5,<br />
pp. 139–164.<br />
4 P. F. Hall<strong>in</strong>, T. T. B<strong>in</strong>newies <strong>and</strong> D. W.<br />
Ussery, Genome update: chromosome atlases,<br />
Microbiology (Read<strong>in</strong>g, U. K.),<br />
2004, 150, 3091–3093.<br />
5 T. J. Carver, K. M. Rutherford, M. Berriman,<br />
M. A. Raj<strong>and</strong>ream, B. G. Barrell <strong>and</strong><br />
J. Parkhill, ACT: the Artemis Comparison<br />
Tool, Bio<strong>in</strong>formatics, 2005, 21, 3422–3423.<br />
6 M. Sebaihia, M. W. Peck, N. P. M<strong>in</strong>ton,<br />
N. R. Thomson, M. T. Holden, W. J.<br />
Mitchell, A. T. Carter, S. D. Bentley, D.<br />
R. Mason, L. Crossman, C. J. Paul, A.<br />
Ivens, M. H. Wells-Bennik, I. J. Davis, A.<br />
M. Cerdeno-Tarraga, C. Churcher, M. A.<br />
Quail, T. Chill<strong>in</strong>gworth, T. Feltwell, A.<br />
Fraser, I. Goodhead, Z. Hance, K. Jagels,<br />
N. Larke, M. Maddison, S. Moule, K.<br />
Mungall, H. Norbertczak, E. Rabb<strong>in</strong>owitsch,<br />
M. S<strong>and</strong>ers, M. Simmonds, B.<br />
White, S. Whithead <strong>and</strong> J. Parkhill, Genome<br />
sequence of a proteolytic (Group I)<br />
Clostridium botul<strong>in</strong>um stra<strong>in</strong> Hall A <strong>and</strong><br />
comparative analysis of the clostridial genomes,<br />
Genome Res., 2007, 17, 1082–1092.<br />
7 A. G. Pedersen, L. J. Jensen, S. Brunak, H.<br />
H. Staerfeldt <strong>and</strong> D. W. Ussery, A DNA<br />
structural atlas for Escherichia coli, J. Mol.<br />
Biol., 2000, 299, 907–930.<br />
8 E. S. Shpigelman, E. N. Trifonov <strong>and</strong><br />
Bolshoy, A Curvature: software for the<br />
analysis of curved DNA, CABIOS, Comput.<br />
Appl. Biosci., 1993, 9, 435–440.<br />
9 M. Skovgaard, L. J. Jensen, C. Friis, H. H.<br />
Stærfeldt, P. Worn<strong>in</strong>g, S. Brunak <strong>and</strong> D.<br />
Ussery, The Atlas Visualisation of Genome-wide<br />
Information, Methods Microbiol.,<br />
2002, 33, 49–63.<br />
10 L. J. Jensen, C. Friis <strong>and</strong> D. W. Ussery,<br />
Three Views of Microbial Genomes, Res.<br />
Microbiol., 1999, 150, 773–777.<br />
11 M. B. Sullivan, M. L. Coleman, P. Weigele,<br />
F. Rohwer <strong>and</strong> S. W. Chisholm,<br />
Three Prochlorococcus cyanophage Genomes:<br />
Signature Features <strong>and</strong> Ecological<br />
Interpretations, PLoS Biol., 2005, 3, e144;<br />
PMID: 15828858 [PubMed—<strong>in</strong>dexed for<br />
MEDLINE].<br />
12 E. F. DeLong, C. M. Preston, T. M<strong>in</strong>cer,<br />
V. Rich, S. J. Hallam, N.-U. Frigaard, A.<br />
Mart<strong>in</strong>ez, M. B. Sullivan, R. Edwards, B.<br />
R. Brito, S. W. Chisholm <strong>and</strong> D. M. Karl,<br />
Community Genomics Among Stratified<br />
Microbial Assemblages <strong>in</strong> the Ocean’s Interior,<br />
Science, 2006, 311(5760), 496–503.<br />
13 D. L. Wheeler, T. Barrett, D. A. Benson,<br />
S. H. Bryant, K. Canese, V. Chetvern<strong>in</strong>,<br />
D. M. Church, M. DiCuccio, R. Edgar, S.<br />
Federhen, L. Y. Geer, Y. Kapust<strong>in</strong>, O.<br />
Khovayko, D. L<strong>and</strong>sman, D. J. Lipman,<br />
T. L. Madden, D. R. Maglott, J. Ostell, V.<br />
Miller, K. D. Pruitt, G. D. Schuler, E.<br />
Sequeira, S. T. Sherry, K. Sirotk<strong>in</strong>, A.<br />
Souvorov, G. Starchenko, R. L. Tatusov,<br />
T. A. Tatusova, L. Wagner <strong>and</strong> E.<br />
Yaschenko, Database Resources of the<br />
National Center for Biotechnology Information,<br />
Nucleic Acids Res., 2007, 35,<br />
D5–D12.<br />
14 J. Noll<strong>in</strong>g, G. Breton, M. V. Omelchenko,<br />
K. S. Makarova, Q. Zeng, R. Gibson, H.<br />
M. Lee, J. Dubois, D. Qiu, J. Hitti, Y. I.<br />
Wolf, R. L. Tatusov, F. Sabathe, L. Doucette-Stamm,<br />
P. Soucaille, M. J. Daly, G.<br />
N. Bennett, E. V. Koon<strong>in</strong> <strong>and</strong> D. R.<br />
Smith, Genome Sequence <strong>and</strong> <strong>Comparative</strong><br />
Analysis of the Solvent-produc<strong>in</strong>g<br />
Bacterium Clostridium acetobutylicum, J.<br />
Bacteriol., 2001, 183, 4823–4838.<br />
15 M. Sebaihia, B. W. Wren, P. Mullany, N.<br />
F. Fairweather, N. M<strong>in</strong>ton, R. Stabler, N.<br />
R. Thomson, A. P. Roberts, A. M. Cerdeno-Tarraga,<br />
H. Wang, M. T. Holden, A.<br />
Wright, C. Churcher, M. A. Quail, S.<br />
Baker, N. Bason, K. Brooks, T. Chill<strong>in</strong>gworth,<br />
A. Cron<strong>in</strong>, P. Davis, L. Dowd, A.<br />
Fraser, T. Feltwell, Z. Hance, S. Holroyd,<br />
K. Jagels, S. Moule, K. Mungall, C. Price,<br />
E. Rabb<strong>in</strong>owitsch, S. Sharp, M. Simmonds,<br />
K. Stevens, L. Unw<strong>in</strong>, S. Whithead,<br />
B. Dupuy, G. Dougan, B. Barrell<br />
<strong>and</strong> J. Parkhill, The Multidrug-resistant<br />
Human Pathogen Clostridium difficile has<br />
a Highly Mobile: Mosaic Genome, Nat.<br />
Genet., 2006, 38, 779–786.<br />
16 C. Bettegowda, X. Huang, J. L<strong>in</strong>, I.<br />
Cheong, M. Kohli, S. A. Szabo, X. Zhang,<br />
L. A. Diaz, Jr, V. E. Velculescu, G. Parmigiani,<br />
K. W. K<strong>in</strong>zler, B. Vogelste<strong>in</strong> <strong>and</strong><br />
S. Zhou, The Genome <strong>and</strong> Transcriptomes<br />
of the Anti-tumor Agent Clostridiumnovyi-NT,<br />
Nat. Biotechnol., 2006, 24,<br />
1573–1580.<br />
17 G. S. Myers, D. A. Rasko, J. K. Cheung, J.<br />
Ravel, R. Seshadri, R. T. DeBoy, Q. Ren,<br />
J. Varga, M. M. Awad, L. M. Br<strong>in</strong>kac, S.<br />
C. Daugherty, D. H. Haft, R. J. Dodson,<br />
R. Madupu, W. C. Nelson, N. J. Rosovitz,<br />
S. A. Sullivan, H. Khouri, G. I. Dimitrov,<br />
K. L. Watk<strong>in</strong>s, S. Mulligan, J. Benton, D.<br />
Radune, D. J. Fisher, H. S. Atk<strong>in</strong>s, T.<br />
Hiscox, B. H. Jost, S. J. Bill<strong>in</strong>gton, J. G.<br />
Songer, B. A. McClane, R. W. Titball, J. I.<br />
Rood, S. B. Melville <strong>and</strong> I. T. Paulsen,<br />
Skewed Genomic Variability <strong>in</strong> Stra<strong>in</strong>s of<br />
the Toxigenic Bacterial Pathogen,<br />
Clostridium perfr<strong>in</strong>gens, Genome Res.,<br />
2006, 16, 1031–1040.<br />
18 T. Shimizu, K. Ohtani, H. Hirakawa, K.<br />
Ohshima, A. Yamashita, T. Shiba, N.<br />
Ogasawara, M. Hattori, S. Kuhara <strong>and</strong><br />
H. Hayashi, Complete Genome Sequence<br />
of Clostridium perfr<strong>in</strong>gens, an Anaerobic<br />
Flesh-eater, Proc. Natl. Acad. Sci.<br />
U. S. A., 2002, 99, 996–1001.<br />
19 H. Bruggemann, S. Baumer, W. F. Fricke,<br />
A. Wiezer, H. Liesegang, I. Decker,<br />
370 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008
C. Herzberg, R. Mart<strong>in</strong>ez-Arias, R. Merkl,<br />
A. Henne <strong>and</strong> G. Gottschalk, The Genome<br />
Sequence of Clostridium tetani, the<br />
Causative Agent of Tetanus Disease, Proc.<br />
Natl. Acad. Sci. U. S. A., 2003, 100,<br />
1316–1321.<br />
20 Y. Sakaguchi, T. Hayashi, K. Kurokawa,<br />
K. Nakayama, K. Oshima, Y. Fuj<strong>in</strong>aga, M.<br />
Ohnishi, E. Ohtsubo, M. Hattori <strong>and</strong> K.<br />
Oguma, The Genome Sequence of<br />
Clostridium botul<strong>in</strong>um Type C Neurotox<strong>in</strong><br />
Convert<strong>in</strong>g Phage <strong>and</strong> the Molecular Mechanisms<br />
of Unstable Lysogeny, Proc. Natl.<br />
Acad. Sci. U. S. A.,2005,102,17472–17477.<br />
21 G. Rocap, F. W. Larimer, J. Lamerd<strong>in</strong>, S.<br />
Malfatti, P. Cha<strong>in</strong>, N. A. Ahlgren, A.<br />
Arellano, M. Coleman, L. Hauser, W. R.<br />
Hess, Z. I. Johnson, M. L<strong>and</strong>, D. L<strong>in</strong>dell,<br />
A. F. Post, W. Regala, M. Shah, S. L.<br />
Shaw, C. Steglich, M. B. Sullivan, C. S.<br />
T<strong>in</strong>g, A. Tolonen, E. A. Webb, E. R.<br />
Z<strong>in</strong>ser <strong>and</strong> S. W. Chisholm, Genome Divergence<br />
<strong>in</strong> Two Prochlorococcus ecotypes<br />
Reflects Oceanic Niche Differentiation,<br />
Nature, 2003, 424, 1042–1047.<br />
22 A. Dufresne, M. Salanoubat, F. Partensky,<br />
F. Artiguenave, I. M. Axmann, V.<br />
Barbe, S. Duprat, M. Y. Galper<strong>in</strong>, E. V.<br />
Koon<strong>in</strong>, F. Le Gall, K. S. Makarova, M.<br />
Ostrowski, S. Oztas, C. Robert, I. B. Rogoz<strong>in</strong>,<br />
D. J. Scanlan, N. T<strong>and</strong>eau de Marsac,<br />
J. Weissenbach, P. W<strong>in</strong>cker, Y. I.<br />
Wolf <strong>and</strong> W. R. Hess, Genome Sequence<br />
of the Cyanobacterium Prochlorococcus<br />
mar<strong>in</strong>us SS120, a Nearly M<strong>in</strong>imal Oxyphototrophic<br />
Genome, Proc. Natl. Acad.<br />
Sci. U. S. A., 2003, 100, 9647–9649.<br />
23 C. J. Benham <strong>and</strong> C. Bi, The Analysis of<br />
Stress-<strong>in</strong>duced Duplex Destabilization <strong>in</strong><br />
Long Genomic DNA Sequences, J.<br />
Comput. Biol., 2004, 11, 519–543.<br />
24 K. Liolios, N. Tavernarakis, P.<br />
Hugenholtz <strong>and</strong> N. C. Kyrpides, The<br />
Genomes On L<strong>in</strong>e Database (GOLD)<br />
v.2: a monitor of genome projects worldwide,<br />
Nucleic Acids Res., 2006, 34,<br />
D332–D334.<br />
25 J. I. Rood <strong>and</strong> S. T. Cole, Molecular<br />
genetics <strong>and</strong> pathogenesis of Clostridium<br />
perfr<strong>in</strong>gens, Microbiol. Rev., 1991, 55,<br />
621–648.<br />
This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 371
1<br />
<strong>Comparative</strong> Genomics<br />
2.7 Paper II: Ten years of bacterial genome sequenc<strong>in</strong>g:<br />
comparative–genomics–based discoveries
Funct Integr Genomics (2006) 6: 165–185<br />
DOI 10.1007/s10142-006-0027-2<br />
REVIEW<br />
Tim T. B<strong>in</strong>newies . Yair Motro . Peter F. Hall<strong>in</strong> .<br />
Ole Lund . David Dunn . Tom La . David J. Hampson .<br />
Matthew Bellgard . Trudy M. Wassenaar .<br />
David W. Ussery<br />
Ten years of bacterial genome sequenc<strong>in</strong>g:<br />
comparative-genomics-based discoveries<br />
Received: 20 January 2006 / Revised: 24 February 2006 / Accepted: 7 March 2006 / Published onl<strong>in</strong>e: 12 May 2006<br />
# Spr<strong>in</strong>ger-Verlag 2006<br />
Abstract It has been more than 10 years s<strong>in</strong>ce the first<br />
bacterial genome sequence was published. Hundreds of<br />
bacterial genome sequences are now available for comparative<br />
genomics, <strong>and</strong> search<strong>in</strong>g a given prote<strong>in</strong> aga<strong>in</strong>st<br />
more than a thous<strong>and</strong> genomes will soon be possible. The<br />
subject of this review will address a relatively straightforward<br />
question: “What have we learned from this vast<br />
amount of new genomic data?” Perhaps one of the most<br />
important lessons has been that genetic diversity, at the<br />
level of large-scale variation amongst even genomes of the<br />
same species, is far greater than was thought. The classical<br />
textbook view of evolution rely<strong>in</strong>g on the relatively slow<br />
accumulation of mutational events at the level of <strong>in</strong>dividual<br />
bases scattered throughout the genome has changed. One<br />
of the most obvious conclusions from exam<strong>in</strong><strong>in</strong>g the<br />
sequences from several hundred bacterial genomes is the<br />
enormous amount of diversity—even <strong>in</strong> different genomes<br />
from the same bacterial species. This diversity is generated<br />
by a variety of mechanisms, <strong>in</strong>clud<strong>in</strong>g mobile genetic<br />
elements <strong>and</strong> bacteriophages. An exam<strong>in</strong>ation of the 20<br />
Escherichia coli genomes sequenced so far dramatically<br />
illustrates this, with the genome size rang<strong>in</strong>g from 4.6 to<br />
5.5 Mbp; much of the variation appears to be of phage<br />
orig<strong>in</strong>. This review also addresses mobile genetic elements,<br />
T. T. B<strong>in</strong>newies . P. F. Hall<strong>in</strong> . O. Lund . D. W. Ussery (*)<br />
Center for Biological Sequence Analysis,<br />
Technical University of Denmark,<br />
2800 Lyngby, Denmark<br />
e-mail: dave@cbs.dtu.dk<br />
Y. Motro . D. Dunn . M. Bellgard<br />
Center for Bio<strong>in</strong>formatics <strong>and</strong> Biological Comput<strong>in</strong>g,<br />
Murdoch University,<br />
Murdoch, Western Australia 6150, Australia<br />
T. La . D. J. Hampson<br />
School of Veter<strong>in</strong>ary <strong>and</strong> Biomedical Sciences,<br />
Murdoch University,<br />
Murdoch, Western Australia 6150, Australia<br />
T. M. Wassenaar<br />
Molecular Microbiology <strong>and</strong> Genomics Consultants,<br />
Zotzenheim, Germany<br />
<strong>in</strong>clud<strong>in</strong>g pathogenicity isl<strong>and</strong>s <strong>and</strong> the structure of<br />
transposable elements. There are at least 20 different<br />
methods available to compare bacterial genomes. Metagenomics<br />
offers the chance to study genomic sequences<br />
found <strong>in</strong> ecosystems, <strong>in</strong>clud<strong>in</strong>g genomes of species that are<br />
difficult to culture. It has become clear that a genome<br />
sequence represents more than just a collection of gene<br />
sequences for an organism <strong>and</strong> that <strong>in</strong>formation concern<strong>in</strong>g<br />
the environment <strong>and</strong> growth conditions for the organism<br />
are important for <strong>in</strong>terpretation of the genomic data. The<br />
newly proposed M<strong>in</strong>imal Information about a Genome<br />
Sequence st<strong>and</strong>ard has been developed to obta<strong>in</strong> this<br />
<strong>in</strong>formation.<br />
Keywords Bacterial genomics . <strong>Comparative</strong> genomics .<br />
Bio<strong>in</strong>formatics . Genomic diversity .<br />
Molecular evolution<br />
Introduction<br />
The year 1995 marked the publication of two human<br />
pathogenic bacterial genome sequences: Haemophilus<br />
<strong>in</strong>fluenzae (Fleischmann et al. 1995, US patent number<br />
6,528,289) <strong>and</strong> Mycoplasma genetalium (Fraser et al.<br />
1995, US patent number 6,537,773). S<strong>in</strong>ce then, more than<br />
300 bacterial genomes have been fully sequenced <strong>and</strong><br />
become publicly available, <strong>in</strong>clud<strong>in</strong>g the sequence of a<br />
virulent form of H. <strong>in</strong>fluenzae (Harrison et al. 2005); the<br />
orig<strong>in</strong>al H. <strong>in</strong>fluenzae stra<strong>in</strong> sequenced <strong>in</strong> 1995 was from<br />
an isolate that does not cause disease. Although the<br />
majority of these several hundred genomes are from<br />
pathogenic organisms, some environmental bacterial genome<br />
sequences have also become available. This review<br />
article will provide a brief overview of sequenced bacterial<br />
genomes, their genomic diversity <strong>and</strong> some of the <strong>in</strong>sights<br />
ga<strong>in</strong>ed from analysis of this vast amount of data.<br />
Bacteria are microscopic unicellular prokaryotes that<br />
<strong>in</strong>habit a wide variety of environmental niches, broadly<br />
distributed <strong>in</strong> three ecosystems: the soil, mar<strong>in</strong>e environments<br />
<strong>and</strong> other liv<strong>in</strong>g organisms. Although there are
166<br />
literally millions of bacterial species, only a small proportion<br />
of these can be grown <strong>in</strong> the laboratory (H<strong>and</strong>elsman<br />
2004). Bacteria (<strong>and</strong> Archaea) can be found almost<br />
anywhere <strong>in</strong> the environment: <strong>in</strong> the air, even <strong>in</strong> the<br />
International Space Station (Novikova et al. 2006), <strong>in</strong><br />
thermal ducts found at great depths <strong>in</strong> the oceans (Ala<strong>in</strong> et<br />
al. 2002; Vezzi et al. 2005), <strong>in</strong> the <strong>in</strong>test<strong>in</strong>al tracts of<br />
animals (Yan <strong>and</strong> Polk 2004; Backhed et al. 2005) <strong>and</strong> <strong>in</strong><br />
soil <strong>and</strong> rocks, even thous<strong>and</strong>s of meters deep (Torsvik et<br />
al. 1990). Bacteria live with<strong>in</strong> unicellular eukaryotes,<br />
algae, plants or animals. This diversity is reflected <strong>in</strong> their<br />
physiology, morphology, metabolism <strong>and</strong> ecosystems. For<br />
example, from a physiological perspective, most <strong>in</strong>test<strong>in</strong>al<br />
bacteria such as Escherichia coli are motile by means of<br />
flagella, to overcome the peristalsis of the gut, whilst the<br />
soil bacterium Clostridium perfr<strong>in</strong>gens does not posses<br />
such motility mach<strong>in</strong>ery (Shimizu et al. 2002). From a<br />
metabolic perspective, the versatile Burkholderia cepacia<br />
(formerly Pseudomonas cepacia) can utilise approximately<br />
100 different organic compounds as a sole energy source<br />
(Goldmann <strong>and</strong> Kl<strong>in</strong>ger 1986) compared to the strictly<br />
<strong>in</strong>tracellular Mycobacterium tuberculosis which is dependent<br />
on only a few carbon sources produced by its<br />
<strong>in</strong>voluntary host. From an <strong>in</strong>ter-bacterial <strong>in</strong>teraction<br />
perspective, sometimes bacteria cooperate. For example,<br />
Enterobacter cloacae <strong>and</strong> Pseudomonas mendoc<strong>in</strong>a positively<br />
<strong>in</strong>teract to stimulate plant growth (Duponnois et al.<br />
1999). On the other h<strong>and</strong>, there are also bacteria which not<br />
only “do not cooperate” but exhibit predatory behavior,<br />
such as Bdellovibrio bacteriovorus (Rendulic et al. 2004).<br />
As for bacteria–host <strong>in</strong>teractions, for a given bacterial<br />
species both pathogenic <strong>and</strong> non-pathogenic stra<strong>in</strong>s can<br />
exist (Dobr<strong>in</strong>dt <strong>and</strong> Hacker 2001; Penyalver <strong>and</strong> Lopez<br />
1999), while other species may be exclusively parasitic<br />
(Goebel <strong>and</strong> Gross 2001), truly symbiotic (Gil et al. 2004)<br />
or commensal (Yan <strong>and</strong> Polk 2004) for their host. It is<br />
<strong>in</strong>terest<strong>in</strong>g to note that this diversity is somehow captured<br />
<strong>in</strong> the relatively small bacterial genomes.<br />
The first complete viral genome (φX174) was published<br />
<strong>in</strong> 1977 (Sanger et al. 1977). To put this <strong>in</strong>to perspective, to<br />
sequence the 4.6-Mbp E. coli K-12 genome at that time<br />
(about a thous<strong>and</strong> base pairs (bp) could be sequenced per<br />
year <strong>in</strong> 1977) would take more than a thous<strong>and</strong> years to<br />
f<strong>in</strong>ish, <strong>and</strong> to sequence the human genome would take<br />
more than a million years to complete. The automation of<br />
sequenc<strong>in</strong>g methods, the <strong>in</strong>vention of polymerase cha<strong>in</strong><br />
reaction (PCR) (Mullis et al. 1986) <strong>and</strong> the shotgun clon<strong>in</strong>g<br />
procedure reduced costs <strong>and</strong> time, <strong>and</strong> provided the<br />
capability for large-scale sequenc<strong>in</strong>g. These developments<br />
together have led to the sequenc<strong>in</strong>g of the first complete<br />
bacterial genome (Fleischmann et al. 1995) almost 20 years<br />
after the sequenc<strong>in</strong>g of φX174. The choice of the first<br />
bacterium to be completely sequenced (H. <strong>in</strong>fluenzae Rd<br />
KW20) was based on the follow<strong>in</strong>g reasons: (1) the<br />
genome size was thought to be ‘typical’ among bacteria<br />
(1.8 Mbp), (2) the G + C base composition was close to that<br />
of the human genome (38%) <strong>and</strong> (3) the bacterium had<br />
important human health implications. In the absence of<br />
procedures to produce a genetic map for the species,<br />
genome sequenc<strong>in</strong>g was proven to be a powereful<br />
alternative for genetic characterisation. This l<strong>and</strong>mark<br />
work <strong>in</strong>itiated the <strong>in</strong>flux of genome sequence data which<br />
is now updated frequently <strong>and</strong> is publicly available. As of<br />
November 2005, there are more than 300 fully sequenced,<br />
publicly available bacterial genomes. Figure 1 shows this<br />
<strong>in</strong>crease of sequence data over the past decade. 1<br />
The total number of completed bacterial genome<br />
sequences has more than doubled over the past 2 years<br />
<strong>and</strong>, at the time of writ<strong>in</strong>g, there are 855 publicly listed<br />
bacterial <strong>and</strong> archaeal genome projects that are <strong>in</strong> various<br />
stages of progress. 2 In addition to new species, multiple<br />
stra<strong>in</strong>s of the same bacterial species are be<strong>in</strong>g sequenced.<br />
The amount of genomic data currently available has<br />
provided significant advances <strong>in</strong> our underst<strong>and</strong><strong>in</strong>g of a<br />
number of important themes, <strong>in</strong>clud<strong>in</strong>g bacterial diversity,<br />
population characteristics, operon structure, mobile genetic<br />
elements (MGE) <strong>and</strong> horizontal gene transfer (HGT). It has<br />
also provided a number of challenges <strong>in</strong> underst<strong>and</strong><strong>in</strong>g the<br />
ecology of, as yet, undiscovered bacterial worlds. The<br />
availability of whole genome sequences for pathogenic <strong>and</strong><br />
commensal bacterial species has allowed a more detailed<br />
analysis of the complex <strong>in</strong>teractions that occur with their<br />
plant or animal hosts. Figure 2a is a phylogenetic tree of<br />
300 sequenced bacterial genomes (available at the time of<br />
writ<strong>in</strong>g). Many of these genomes are from pathogenic<br />
bacteria liv<strong>in</strong>g <strong>in</strong> complex ecosystems, such as the<br />
spirochaete Brachyspira pilosicoli labelled <strong>in</strong> red <strong>in</strong> the<br />
phylogenetic tree shown <strong>in</strong> Fig. 2b. This bacterium attaches<br />
to enterocytes to form a “false brush border” <strong>in</strong> the colon.<br />
Most genome sequenc<strong>in</strong>g projects are currently carried<br />
out us<strong>in</strong>g automated applications of the sequenc<strong>in</strong>g<br />
technique developed by Sanger et al. (1973), but newly<br />
developed methodologies may enable even more rapid<br />
sequenc<strong>in</strong>g <strong>in</strong> the future. Two papers have been published<br />
about two different methods for high-throughput sequenc<strong>in</strong>g<br />
of bacterial genomes (Pennisi 2005). One method is<br />
essentially a “do-it-yourself kit”, which uses a laser<br />
confocal microscope <strong>and</strong> other “off-the-shelf” components<br />
to build a sequenc<strong>in</strong>g mach<strong>in</strong>e capable of sequenc<strong>in</strong>g an E.<br />
coli genome <strong>in</strong> less than a day (Shendure et al. 2005). The<br />
second method is a commercial mach<strong>in</strong>e, based on<br />
pyrosequenc<strong>in</strong>g methodologies to generate many short<br />
pieces of DNA; this method was used to sequence a<br />
bacterial genome with<strong>in</strong> a few hours (Margulies et al.<br />
2005). Although there are still some technical problems<br />
with both of these methods, it is clear that, <strong>in</strong> the near<br />
future, it will be possible to quickly sequence a bacterial<br />
genome at a considerably low cost.<br />
1 Completed genome statistics obta<strong>in</strong>ed from the <strong>CBS</strong> atlas web<br />
pages http://www.cbs.dtu.dk/services/GenomeAtlas<br />
2 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj
Fig. 1 Cumulative number of<br />
complete published sequenced<br />
bacterial genomes (bars) <strong>and</strong><br />
total number of basepairs (l<strong>in</strong>e)<br />
over the past decade<br />
(1995–2005)<br />
Genomic <strong>in</strong>formation<br />
DNA codes for more than just prote<strong>in</strong>s<br />
The quality of annotation of bacterial genomes varies,<br />
although a survey based on three different methods to<br />
predict the expected number of genes <strong>in</strong> a genome has<br />
found that it is likely that, for most bacterial genomes,<br />
around 20% of the genes annotated might not be “real”<br />
(Skovgaard et al. 2001). Furthermore, some “real” genes,<br />
based on proteomics experiments, which were not<br />
orig<strong>in</strong>ally predicted have been detected, highlight<strong>in</strong>g the<br />
dynamic nature of annotation <strong>and</strong> that genes are missed<br />
(Jaffe et al. 2004). Over-annotation of bacterial genomes is<br />
a problem but, unfortunately, this cannot be easily avoided.<br />
On the one h<strong>and</strong>, no one wants to miss a gene <strong>and</strong>, on the<br />
other h<strong>and</strong>, small genes can be quite difficult to predict, as a<br />
short open read<strong>in</strong>g frame could easily occur by statistical<br />
chance (Skovgaard et al. 2001).<br />
There are currently several automated annotation systems<br />
<strong>and</strong> the BaSys system (Van Domselaar et al. 2005)<br />
provides a comprehensive annotation of a DNA sequence<br />
file. To conduct comparative genomics with several<br />
hundred genomes, quality databases are essential <strong>and</strong> the<br />
“GenomeAtlas” database, which was orig<strong>in</strong>ally developed<br />
to store DNA structural <strong>in</strong>formation about the various<br />
sequenced genomes, is one example (Hall<strong>in</strong> <strong>and</strong> Ussery<br />
2004). Approximately a hundred different features for each<br />
genome (such as percent AT, cod<strong>in</strong>g skew bias, length of<br />
genome <strong>and</strong> number of genes) are currently made available<br />
through http://www.cbs.dtu.dk/services/GenomeAtlas/.<br />
Duplication of essentials<br />
One of the features of genomic sequences that can be easily<br />
recognised is the presence of repeat sequences. The most<br />
obvious <strong>and</strong> extensive repeats present <strong>in</strong> many bacterial<br />
167<br />
genomes are the operons encod<strong>in</strong>g the ribosomal RNA<br />
genes. These rRNA operons typically encode 16S <strong>and</strong> 23S<br />
rRNA separated by a short spacer, often followed by the 5S<br />
rRNA gene. All sequenced bacterial genomes possess at<br />
least one rRNA operon, <strong>and</strong> many (215 of 300) have two or<br />
more copies; the number of operons tends to correlate with<br />
bacterial division time. Thus, species that divide quickly<br />
(such as Bacillus cereus) have more copies of rRNA genes,<br />
so as to enable rapid production of ribosomes. In addition,<br />
species conta<strong>in</strong><strong>in</strong>g multiple rRNA operons appear to be<br />
more adaptable to chang<strong>in</strong>g environmental conditions<br />
(Ac<strong>in</strong>as et al. 2004). The rRNA genes are a valuable tool<br />
for the estimation of taxonomic relationships (see Fig 2a).<br />
These genes evolve slowly, presumably because they play<br />
an essential role as the backbone of ribosomes while<br />
<strong>in</strong>teract<strong>in</strong>g with multiple prote<strong>in</strong>s. Any changes <strong>in</strong> the<br />
shape (sequence) of rRNA would most likely be fatal.<br />
Multiple copies per genome of tRNA genes can also be<br />
found <strong>in</strong> some genomes, aga<strong>in</strong> tend<strong>in</strong>g to correlate with<br />
division time. However, for tRNAs, the duplication<br />
number is also dictated by the frequency with which<br />
particular codons are used (or vice versa, as cause <strong>and</strong><br />
effect cannot be dist<strong>in</strong>guished here). This enables a less<br />
obvious level of regulat<strong>in</strong>g gene activity: a gene us<strong>in</strong>g<br />
many codons for which only one tRNA gene is available<br />
will probably be translated at a rate-limit<strong>in</strong>g step, whereas<br />
abundant prote<strong>in</strong>s are more likely to use tRNAs for which<br />
multiple gene copies are available. This is the basis for the<br />
codon adaption <strong>in</strong>dex, which is a measure of the adaptation<br />
of a gene’s codon usage towards the optimal tRNA pool<br />
(Sharp <strong>and</strong> Li 1987).<br />
There are of course other duplications <strong>in</strong> bacterial<br />
genomes, some of which might appear at first glance to be<br />
less essential. For example, the ‘REP’ repetitive sequences<br />
frequently found <strong>in</strong> enterobacteriaceae can be used as<br />
unique identifiers of bacterial genomes (Tobes <strong>and</strong> Ramos<br />
2005). It has been speculated that these repeats are<br />
mean<strong>in</strong>gless, result<strong>in</strong>g from errors <strong>in</strong> replication, or that
168
3Fig. 2 a Phylogenetic tree of 287 sequenced bacterial genomes,<br />
based on aligments from the 16S rRNA gene sequence. The phyla<br />
are colour-coded; a more detailed view, with names of all the<br />
organisms can be found <strong>in</strong> the supplemental <strong>in</strong>formation: http://<br />
www.cbs.dtu.dk/services/GenomeAtlas/suppl/FIG10yr/. b Photomicrograph<br />
show<strong>in</strong>g a dense fr<strong>in</strong>ge of anaerobic spirochaetes (B.<br />
pilosicoli) attached by one cell end to the lum<strong>in</strong>al surface of human<br />
colonic enterocytes, form<strong>in</strong>g a “false brush border”. Besides that of<br />
humans, B. pilosicoli colonises the large <strong>in</strong>test<strong>in</strong>e of a variety of<br />
mammals <strong>and</strong> birds, caus<strong>in</strong>g diarrhoea <strong>and</strong> reduced growth rates.<br />
Genomic sequence from B. pilosicoli is be<strong>in</strong>g analysed to assist <strong>in</strong><br />
underst<strong>and</strong><strong>in</strong>g the genetic basis of this dense colonisation, <strong>in</strong>clud<strong>in</strong>g<br />
patterns of gene expression underly<strong>in</strong>g the complex <strong>in</strong>teractions that<br />
occur between <strong>in</strong>dividual bacterial cells <strong>and</strong> the colonised<br />
enterocytes. The photograph is courtesy of Dr. W. Bastiaan DeBoer,<br />
University of Western Australia, Perth, Western Australia<br />
they may be a part of mobile elements that are able to<br />
translocate <strong>and</strong> duplicate themselves. These could alternatively<br />
be non-functional ‘molecular fossils’ of previous<br />
<strong>in</strong>sertion events. F<strong>in</strong>ally, it could well be that these repeats<br />
serve some as yet undiscovered useful purpose. It is<br />
possible, for example, that repetitive sequences <strong>and</strong><br />
<strong>in</strong>sertion sequence elements (ISs) contribute to genome<br />
plasticity through structural changes based on homologous<br />
recomb<strong>in</strong>ation (Kennedy et al. 2001; Fraser-Liggett 2005).<br />
A brief history of bacterial operons<br />
Much of the early classical work <strong>in</strong> microbiology has been<br />
done with E. coli, as this bacterium is relatively easy to<br />
culture <strong>in</strong> the laboratory. As more <strong>and</strong> more genetic<br />
<strong>in</strong>formation was gathered, it was considered a ‘typical’<br />
bacterium, although E. coli is not more typical for bacteria<br />
than a rabbit is for all eukaryotic organisms. More than<br />
40 years ago, a model was proposed for gene regulation of<br />
the catabolism of lactose <strong>in</strong> E. coli (Jacob et al. 1960; Jacob<br />
<strong>and</strong> Monod 1961). The model described an operon as a<br />
cluster of genes with related functions (encod<strong>in</strong>g, <strong>in</strong> this<br />
case, enzymes required for lactose degradation). This<br />
operon structure neatly allows regulation of gene expression<br />
by the concentration of lactose (Lewis et al. 1996;<br />
Reznikoff 1992). With the cont<strong>in</strong>uous expression of one<br />
small prote<strong>in</strong> (a repressor), wasteful expression of several<br />
other catabolic enzymes <strong>in</strong> the absence of lactose is<br />
prevented.<br />
S<strong>in</strong>ce the discovery of the lac operon, many more<br />
catabolic operons have been discovered, with positive <strong>and</strong><br />
negative feedback strategies, <strong>and</strong> these illustrate the<br />
biological need to use resources as efficiently as possible.<br />
Many, if not all, bacterial genomes <strong>in</strong>deed display clusters<br />
of genes <strong>in</strong>volved <strong>in</strong> a s<strong>in</strong>gle process (be it co-jo<strong>in</strong>tly<br />
transcribed <strong>and</strong> regulated, as <strong>in</strong> classical operons, or with<br />
separate promoters <strong>and</strong> regulators), but the degree of<br />
operon gene organisation <strong>and</strong> gene cluster<strong>in</strong>g differs<br />
between species. In some bacteria, such as <strong>in</strong> Helicobacter<br />
pylori, operons are relatively unconserved, <strong>and</strong> genes<br />
<strong>in</strong>volved <strong>in</strong> one cellular process can be dispersed<br />
169<br />
throughout the genome (Tomb et al. 1997; Alm <strong>and</strong> Trust<br />
1999), although more recent work suggest that perhaps<br />
there are more operons <strong>in</strong> H. pylori than previously thought<br />
(Price et al. 2005). There are currently many resources for<br />
prediction of operons (Rogoz<strong>in</strong> et al. 2004; Rosenfeld et al.<br />
2004; Alm et al. 2005; Janga et al. 2005; Nishi et al. 2005;<br />
Price et al. 2005; Vallenet et al. 2006), <strong>in</strong>clud<strong>in</strong>g several<br />
databases, such as the Operon Database (Okuda et al.<br />
2006), RegulonDB (Salgado et al. 2006a,b) <strong>and</strong> Gene-<br />
Chords (Zheng et al. 2005).<br />
How did the first operon evolve? There have been<br />
historically three models proposed for the orig<strong>in</strong>s of gene<br />
clusters. The first model, which dates back to 1945,<br />
proposed the cluster<strong>in</strong>g of genes to be the direct result of<br />
gene duplication <strong>and</strong> evolution (Horowitz 1945, 1965).<br />
Gene duplication can occur dur<strong>in</strong>g replication <strong>and</strong>, as a<br />
duplicated gene has more freedom to mutate, this is<br />
believed to be a classical mechanism for novel enzymes to<br />
evolve (Lazcano et al. 1995). However, although all genes<br />
with<strong>in</strong> an operon may be <strong>in</strong>volved <strong>in</strong> a s<strong>in</strong>gle metabolic<br />
process, their function <strong>and</strong> structure can vary considerably,<br />
<strong>and</strong> a phylogenic relationship between them is not always<br />
likely.<br />
The second model proposed for the evolution of operons<br />
is that coregulation of genes under a common promoter<br />
could provide selective advantage (Jacob et al. 1960).<br />
However, we now know that, <strong>in</strong> fact, it is possible to have<br />
coregulation of genes that are not physically l<strong>in</strong>ked<br />
together. Furthermore, this model does not really provide<br />
a gradual step-by-step mechanism for the evolution of<br />
operons.<br />
The third model for the evolution of an operon is that<br />
pre-exist<strong>in</strong>g genes moved together due to selective<br />
advantages of hav<strong>in</strong>g genes <strong>in</strong>volved <strong>in</strong> the same<br />
biochemical pathways or processes be<strong>in</strong>g physically<br />
close to each other. This hypothesis allows for structurally<br />
dist<strong>in</strong>ct genes to be part of one operon. This model requires<br />
both variation <strong>and</strong> frequent recomb<strong>in</strong>ation <strong>and</strong> has been<br />
proposed as an explanation of cluster<strong>in</strong>g of genes <strong>in</strong><br />
bacteriophage genomes (Stahl <strong>and</strong> Murray 1966; Juhala et<br />
al. 2000).<br />
In addition to these three views, there are other<br />
alternatives. Gene cluster<strong>in</strong>g may be of selective advantage<br />
<strong>in</strong> the case of horizontal gene transfer (see section below)<br />
<strong>and</strong>, based on this idea, a fourth mechanism, ‘selfish<br />
operon’ model, was proposed (Lawrence <strong>and</strong> Roth 1996).<br />
This view has been recently called <strong>in</strong>to question, based on<br />
the physical cluster<strong>in</strong>g of essential genes <strong>in</strong> the E. coli K-12<br />
genome (Pal <strong>and</strong> Hurst 2004). Two other alternatives for<br />
operon evolution deal with chromat<strong>in</strong> structure <strong>and</strong> the<br />
physical location of genes <strong>in</strong> bacterial chromosomes, where<br />
transcription <strong>and</strong> translation are coupled (Pal <strong>and</strong> Hurst<br />
2004). It is quite possible that, <strong>in</strong> fact, there is no one<br />
“correct” mechanism, but perhaps different mechanisms are<br />
<strong>in</strong>volved at the same time. For example, the selective<br />
advantage of gene cluster<strong>in</strong>g dur<strong>in</strong>g horizontal gene transfer<br />
is exemplified by the cluster<strong>in</strong>g of multiple antibiotic
170<br />
resistance genes on mobile genetic elements (Carattoli<br />
2001). In the era of antibiotic use, such genes are under<br />
strong selective pressure <strong>and</strong> are frequently passed on<br />
between bacteria by means of mobile elements. Whether<br />
these have directly contributed to the spread of catabolic <strong>and</strong><br />
other operons between bacterial species is currently not<br />
known.<br />
What separates genes <strong>in</strong> a genome?<br />
In comparison to genes, the non-cod<strong>in</strong>g part of genomes<br />
receives far less attention. Some genomes are more<br />
densely packed than the others. The average cod<strong>in</strong>g<br />
density is about 90%, rang<strong>in</strong>g from 95% for Pelagibacter<br />
ubique (Giovannoni et al. 2005) to 51% for Sodalis<br />
gloss<strong>in</strong>idius (Toh et al. 2006). Bacterial genes are not<br />
spliced as they are <strong>in</strong> eukaryotes; that is, <strong>in</strong>trons are absent<br />
from nearly all bacterial genes. The sequences separat<strong>in</strong>g<br />
genes (<strong>in</strong>tergenic regions) can be thought of as spacers<br />
where <strong>in</strong>formation on regulation of transcription can be<br />
stored, although sometimes these <strong>in</strong>tergenic regions can<br />
also be more than regulatory <strong>and</strong> spacer doma<strong>in</strong>s.<br />
Intergenic regions <strong>in</strong> the E. coli K-12 chromosome have<br />
been suggested to conta<strong>in</strong> the sequences for several<br />
hundreds of small RNA genes which are transcribed but do<br />
Table 1 Current E. coli genomes sequenced or <strong>in</strong> progress<br />
Escherichia coli<br />
stra<strong>in</strong><br />
Length (bp) Number of<br />
genes<br />
Number of<br />
tRNAs<br />
not code for prote<strong>in</strong>s (Chen et al. 2002). Many of these<br />
small RNAs act as regulators (Gottesman 2005).<br />
In general, the <strong>in</strong>tergenic regions of bacterial genomes<br />
are more AT-rich, will melt more readily, are more curved<br />
<strong>and</strong> are more rigid than the chromosomal average<br />
(Pedersen et al. 2000; Hall<strong>in</strong> <strong>and</strong> Ussery 2004). This is<br />
true for nearly all of the several hundreds of bacterial<br />
genomes sequenced, regardless of AT content. These<br />
characteristics make sense <strong>in</strong> terms of mechanical properties<br />
needed for <strong>in</strong>itiat<strong>in</strong>g transcription.<br />
Generation of genomic diversity <strong>in</strong> bacteria<br />
Genomic diversity is far greater than expected<br />
The view <strong>in</strong> many textbooks of biological diversity <strong>and</strong><br />
evolution often envisions clonal bacteria which slowly<br />
evolve through the gradual accumulation of s<strong>in</strong>gle-nucleotide<br />
changes. There might occasionally be a rare event<br />
where a new gene is duplicated but, <strong>in</strong> general, it has been<br />
commonly thought that if one were to sequence two<br />
different stra<strong>in</strong>s of a common bacterium like E. coli, the<br />
sequences would, for the most part, be similar <strong>and</strong> the two<br />
stra<strong>in</strong>s would share most (perhaps 90% or more) of their<br />
genes. At the time of writ<strong>in</strong>g, there are 20 different E. coli<br />
Number of<br />
rRNAs<br />
Number of<br />
contigs<br />
Accessionumber<br />
O157_EDL93 5,528,445 5,349 100 7 1 AE005174<br />
E22 5,516,16 4,788 NA NA 109 AAJV00000000<br />
O157_RIMD0509952 5,498,450 5,361 103 7 1 BA000007<br />
E110019 5,384,084 4,746 NA NA 119 AAJW00000000<br />
B171 5,299,753 4,467 NA NA 159 AAJX00000000<br />
53638 5,289,471 4,783 NA NA 119 AAKB00000000<br />
042 5,241,977 4,899 93 7 2 Sanger Institute<br />
(unpublished)<br />
CFT073 5,231,428 5,379 89 7 1 AE014075<br />
H10407 ~5,208,000 ~5,000 NA NA 225 Sanger Institute<br />
(unpublished)<br />
F11 5,206,906 4,467 NA NA 88 AAJU00000000<br />
B7A 5,202,558 4,637 NA NA 198 AAJT00000000<br />
NMEC RS218 5,089,235 ~4,900 NA NA 1 Uni. Wisc. (unpublished)<br />
E2348 5,072,200 4,594 71 7 4 Sanger Institute<br />
(unpublished)<br />
E24377A 4,980,187 4,254 97 6 1 AAJZ00000000<br />
UPEC 536 ~4,900,000 ~4800 NA NA 1 Uni. Würzburg<br />
(unpublished)<br />
101NA1 4,880,380 4,238 NA NA 70 AAMK00000000<br />
HS 4,643,538 3,689 89 6 1 AAJY00000000<br />
K-12_W3110 4,641,433 4,390 88 7 1 AP009048<br />
K-12_MG1655 4,639,675 4,254 88 7 1 U00096<br />
B03 4,629,810 4,387 86 6 1 CNRS France (unpublished)<br />
NA Currently not annotated
genomes which have been either completely sequenced or<br />
at least with an expected coverage of greater than 99% of<br />
the genome. Table 1 lists these genomes, <strong>and</strong> one of the<br />
surpris<strong>in</strong>g observations is the diversity just <strong>in</strong> size of the<br />
ma<strong>in</strong> chromosome, rang<strong>in</strong>g from 5.5 to 4.6 Mbp—that is,<br />
close to a million base pairs present <strong>in</strong> some E. coli stra<strong>in</strong>s<br />
which are miss<strong>in</strong>g <strong>in</strong> others. Furthermore, if one were to<br />
pick any one of these 20 stra<strong>in</strong>s, there would be more than a<br />
hundred genes which are unique to that stra<strong>in</strong> <strong>and</strong> are not<br />
found <strong>in</strong> the other 19 E. coli genomes. Studies have<br />
<strong>in</strong>dicated that much of this diversity comes from<br />
bacteriophages (Ohnishi et al. 2001).<br />
Gene order conservation<br />
When compar<strong>in</strong>g bacterial genomes, two features are<br />
frequently analysed: gene presence <strong>and</strong> gene order. The<br />
presence or absence of genes is particularly <strong>in</strong>terest<strong>in</strong>g<br />
when two closely related species or stra<strong>in</strong>s that have<br />
different phenotypes, such as a pathogenic <strong>and</strong> a commensal<br />
stra<strong>in</strong> of the same species, are compared (Hayashi et al.<br />
2001). As for the actual process lead<strong>in</strong>g to the difference,<br />
the direction of the <strong>in</strong>sertion/deletion event is not always<br />
clear; the nature of the <strong>in</strong>del (INsertion/DELetion) is<br />
generally kept neutral.<br />
Table 2 Types of mobile genetic elements found <strong>in</strong> bacterial genomes<br />
There are various models of how the gene order with<strong>in</strong><br />
operons may have changed throughout evolution. It may be<br />
that the gene order <strong>in</strong> ancient ancestral operons has been<br />
ma<strong>in</strong>ta<strong>in</strong>ed, such that all (or many) of the operons <strong>in</strong><br />
studied genomes would be expected to have a similar gene<br />
structure. However, this view has been contradicted by data<br />
from whole genome studies. Exam<strong>in</strong><strong>in</strong>g the stability of<br />
operon structures over evolutionary distance shows that the<br />
majority of the gene orders with<strong>in</strong> operons could be<br />
shuffled frequently dur<strong>in</strong>g evolution, with the ribosomal<br />
prote<strong>in</strong> operons as an exception (Itoh et al. 1999). Such<br />
observations support the alternative possibility that operons<br />
are multiple evolutionary <strong>in</strong>ventions. A more recent<br />
study has exam<strong>in</strong>ed the evolution of the histid<strong>in</strong>e operon <strong>in</strong><br />
Proteobacteria <strong>and</strong> found evidence for <strong>in</strong>deed a gradual<br />
merg<strong>in</strong>g of genes with similar function <strong>in</strong>to operons, at<br />
least <strong>in</strong> this case (Fani et al. 2005).<br />
Comparisons of gene order can also be <strong>in</strong>formative of<br />
chromosomal translocations <strong>and</strong> <strong>in</strong>versions, which frequently<br />
happen <strong>in</strong> bacterial genomes (Kuwahara et al.<br />
2004). Such events are mostly neutral <strong>in</strong> terms of<br />
evolution, as they do not change the total genetic content<br />
of the cell, but translocations <strong>and</strong> <strong>in</strong>versions frequently<br />
co<strong>in</strong>cide with <strong>in</strong>sertions or deletions. Any of these<br />
processes can result from <strong>in</strong>accurate excision of mobile<br />
genetic elements <strong>and</strong>, as such elements are frequently<br />
MGE Description References<br />
Plasmids Circular, self-replicat<strong>in</strong>g DNA molecules that exist <strong>in</strong> cells as extra-chromosomal<br />
replicons. Some plasmids can <strong>in</strong>sert <strong>in</strong>to the chromosome.<br />
(Dobr<strong>in</strong>dt et al. 2004)<br />
Transposons DNA molecules that frequently change their chromosomal localisation, either<br />
with<strong>in</strong> or between replicons. They usually code for a transposase <strong>and</strong> some other<br />
genes (such as antibiotic resistance genes), <strong>and</strong> are flanked by <strong>in</strong>verted repeat<br />
DNA sequences.<br />
(Dobr<strong>in</strong>dt et al. 2004)<br />
Conjugative Transposons that also carry genes related to plasmid-encoded conjugation, thus, (Dobr<strong>in</strong>dt et al. 2004)<br />
transposons provid<strong>in</strong>g the ability to transfer between cells via conjugation<br />
Bacteriophages Prokaryote-<strong>in</strong>fect<strong>in</strong>g viruses, which can modify the host genome by cod<strong>in</strong>g new<br />
functions or by modify<strong>in</strong>g exist<strong>in</strong>g functions. They are also capable of <strong>in</strong>sert<strong>in</strong>g<br />
<strong>in</strong>to the genome (prophages). These are also agents of HGT.<br />
(Dobr<strong>in</strong>dt et al. 2004)<br />
Integrons Genetic elements composed of a gene encod<strong>in</strong>g an <strong>in</strong>tegrase (<strong>in</strong>t gene; excises <strong>and</strong> (Fluit <strong>and</strong> Schmitz 2004; Holmes et al.<br />
<strong>in</strong>tegrates the gene cassettes from <strong>and</strong> <strong>in</strong>to the <strong>in</strong>tegron), gene cassettes (become<br />
part of the <strong>in</strong>tegron upon <strong>in</strong>tegration; consist of a promoterless gene <strong>and</strong> a<br />
recomb<strong>in</strong>ation site termed attC) <strong>and</strong> an <strong>in</strong>tegration site for the gene cassettes (attI<br />
gene)<br />
2003; Peters et al. 2001)<br />
Insertion Small, genetically compact DNA sequences, generally less than 2.5 kbp <strong>in</strong> length, (Mahillon et al. 1999; Ou et al. 2006)<br />
sequence encod<strong>in</strong>g functions <strong>in</strong>volved <strong>in</strong> their translocation, <strong>and</strong> transpose both with<strong>in</strong> <strong>and</strong><br />
elements between genomes. IS elements are a subset of a general group of elements named<br />
transposable elements. These transposable elements are def<strong>in</strong>ed as elements of<br />
DNA segments that carry the genes required for this process (<strong>and</strong>, <strong>in</strong> some cases,<br />
other genes), <strong>and</strong> consequently move about chromosomes <strong>and</strong>, more generally,<br />
genomes.<br />
Genomic Large chromosomal regions that conta<strong>in</strong> a cluster of functionally related genes, an (Dobr<strong>in</strong>dt et al. 2004)<br />
isl<strong>and</strong>s operon or a number of operons, flanked by direct repeat sequences, <strong>and</strong> located<br />
near an <strong>in</strong>tegrase or transposase gene <strong>and</strong> a tRNA gene.<br />
171
172<br />
<strong>in</strong>volved <strong>in</strong> generat<strong>in</strong>g diversity <strong>in</strong> bacteria, they deserve to<br />
be treated <strong>in</strong> a separate section.<br />
Mobile genetic elements<br />
MGEs are genomic elements that are capable of translocat<strong>in</strong>g<br />
themselves with<strong>in</strong> or between genomes. When mov<strong>in</strong>g<br />
to a new genome, they may confer a new characteristic on<br />
the recipient. Their size ranges from hundreds of base pairs<br />
to more than 100 kbp. Plasmids, transposons, conjugative<br />
transposons, bacteriophages, <strong>in</strong>tegrons, <strong>in</strong>sertion sequence<br />
elements <strong>and</strong> genomic isl<strong>and</strong>s (GEIs) are all considered<br />
MGEs (Table 2). Bacteriophages are the most sophisticated,<br />
as they produce their own prote<strong>in</strong> coat to protect the<br />
genetic material (which can be DNA or RNA). Conjugative<br />
transposons <strong>in</strong>duce conjugation between cells, a process <strong>in</strong><br />
which cellular membranes merge to produce a bridge<br />
through which the transposon can move. Some plasmids<br />
can also <strong>in</strong>duce conjugation (a transposon always encodes<br />
transposase whereas a conjugative plasmid replicates<br />
without <strong>in</strong>tegration <strong>in</strong> the chromosome). Some of the<br />
def<strong>in</strong>itions for the various MGEs partly overlap, as <strong>in</strong>deed<br />
these terms are flexible. For <strong>in</strong>stance, transposons can<br />
<strong>in</strong>tegrate <strong>in</strong> plasmids, <strong>and</strong> bacteriophages may conta<strong>in</strong><br />
<strong>in</strong>sertion sequence elements (Burrus <strong>and</strong> Waldor 2004).<br />
MGEs constitute potentially foreign DNA located <strong>in</strong> a<br />
conceptual ‘flexible’ gene pool, from where ‘donated’<br />
DNA is made available for recipient cells. Once the MGE<br />
is transferred <strong>in</strong>to the recipient cell, the DNA will either<br />
<strong>in</strong>sert <strong>in</strong>to a region on the chromosome or it will start to<br />
evoke its own replication mach<strong>in</strong>ery. If the MGE is<br />
<strong>in</strong>tegrated <strong>in</strong>to the genome, for example, like a pathogenicity<br />
isl<strong>and</strong> (PAI), the genes (or operon) will start to be<br />
expressed, thus add<strong>in</strong>g a new characteristic to the cell. The<br />
MGE may later <strong>in</strong>itiate ‘donation’ of DNA either to a next<br />
receptor (for which the trigger is as yet unknown) or to the<br />
flexible gene pool, perhaps tak<strong>in</strong>g with it a ‘new’ or<br />
additional gene or function. The <strong>in</strong>tegrated MGE may also<br />
become immobile as a result of chromosomal re-arrangements,<br />
duplications or sequence <strong>in</strong>sertions/deletions. In the<br />
case of such rendered immobility, the <strong>in</strong>tegrated MGE<br />
becomes a permanent genomic element or genomic isl<strong>and</strong>.<br />
At a later stage, the genomic isl<strong>and</strong> may be modified <strong>and</strong><br />
rendered mobile aga<strong>in</strong>, mak<strong>in</strong>g it available for transfer to<br />
the flexible gene pool once aga<strong>in</strong>.<br />
As the subject of all MGEs listed <strong>in</strong> Table 2 would<br />
suffice a review paper on its own, this review focuses on<br />
two, namely, <strong>in</strong>sertion sequence elements <strong>and</strong> GEIs. These<br />
two MGEs are of particular <strong>in</strong>terest because our knowledge<br />
of them has improved dramatically as a direct result of<br />
genome sequence availability <strong>and</strong> due also to their impact<br />
on the diversity of bacteria.<br />
Insertion sequence elements<br />
IS elements are small DNA sequences, generally less than<br />
2.5 kb <strong>in</strong> length, encod<strong>in</strong>g functions <strong>in</strong>volved <strong>in</strong> their own<br />
translocation <strong>and</strong> can transpose both with<strong>in</strong> <strong>and</strong> between<br />
genomes (Mahillon et al. 1999). IS elements were<br />
orig<strong>in</strong>ally described as a subset of transposable elements<br />
(Prescott et al. 1999). IS elements are the simplest form of<br />
MGE <strong>and</strong> a key component of a majority of the more<br />
complex transposable elements, found both <strong>in</strong> bacterial <strong>and</strong><br />
eukaryotic genomes. A number of reviews deal with IS<br />
elements <strong>in</strong> greater depth (van Belkum et al. 1998;<br />
Mahillon et al. 1999; Galun 2003).<br />
An IS conta<strong>in</strong>s a transposase gene, flanked by term<strong>in</strong>al<br />
<strong>in</strong>verted repeats (the sequence of one flank is encoded on<br />
the opposite str<strong>and</strong> of the other flank). One of these repeats<br />
classically conta<strong>in</strong>s the promoter for the transposase gene<br />
(Fig. 3; Galun 2003). The IS elements are also flanked by<br />
short, directly repeated sequences, which are generated <strong>in</strong><br />
the recipient DNA as a result of <strong>in</strong>sertion.<br />
The activity of transposable elements <strong>in</strong> genomes was<br />
first noted by McCl<strong>in</strong>tock (1950) <strong>in</strong> maize, although at that<br />
time the mechanism beh<strong>in</strong>d the observed genetic changes<br />
was not understood. Starl<strong>in</strong>ger <strong>and</strong> Saedler (1976) provided<br />
the first review of IS elements <strong>in</strong> bacterial genomes. As<br />
noted by Lupski <strong>and</strong> We<strong>in</strong>stock (1992), the first ISs were<br />
classified before their function, orig<strong>in</strong> <strong>and</strong> dispersion<br />
mechanisms were understood. The present genomic era<br />
has resulted <strong>in</strong> advances <strong>in</strong> their classification, underst<strong>and</strong><strong>in</strong>g<br />
of mechanisms of dispersion <strong>and</strong> identification of<br />
their role <strong>in</strong> evolution (van Belkum et al. 1998; Mahillon et<br />
al. 1999). Although the classical ISs are considered to be<br />
evolutionary neutral, as each can only translocate their own<br />
transposase, they are the means by which genomic isl<strong>and</strong>s<br />
(for example PAIs <strong>and</strong> metabolic isl<strong>and</strong>s) are transferred,<br />
<strong>and</strong> they also play a role <strong>in</strong> plasmid <strong>in</strong>tegration (Rocha et al.<br />
1999). Variation <strong>in</strong> the excision of ISs promotes genome<br />
rearrangements (<strong>in</strong>clud<strong>in</strong>g deletions, <strong>in</strong>versions <strong>and</strong> replicon<br />
fusions; Mahillon et al. 1999). Antibiotic resistance<br />
genes are frequently spread with<strong>in</strong> bacterial populations<br />
with the aid of ISs, which gives these simple elements<br />
cl<strong>in</strong>ical relevance. F<strong>in</strong>ally, <strong>in</strong> special cases, IS elements can<br />
<strong>in</strong>directly cause antigenic variation, a process <strong>in</strong> which a<br />
gene is switched off <strong>and</strong> on <strong>in</strong> a reversible manner with<strong>in</strong> a<br />
bacterial population (Talarico et al. 2005). IS sequences that<br />
Fig. 3 Organisation of a typical <strong>in</strong>sertion sequence. The IS is<br />
represented as an open box <strong>in</strong> which the term<strong>in</strong>al <strong>in</strong>verted repeats are<br />
shown as blue boxes labelled IRL (left IR) <strong>and</strong> IRR (right IR). An<br />
open read<strong>in</strong>g frame encod<strong>in</strong>g the transposase (grey box) is located <strong>in</strong><br />
the IS. WXY boxes flank<strong>in</strong>g the IS represent short directly repeated<br />
sequences generated <strong>in</strong> the target DNA as a consequence of<br />
<strong>in</strong>sertion. The transposase promoter is localised <strong>in</strong> IRL
are present <strong>in</strong> the first part of a gene can cause slippage<br />
dur<strong>in</strong>g replication, as DNA polymerase has difficulties with<br />
correct replication of short multiple repeats. The result can<br />
be a frame shift with consequential <strong>in</strong>activation, but the<br />
next frame shift can restore gene function. Such slippage<br />
can also vary the distance <strong>and</strong>, thus, activity of a promoter<br />
<strong>and</strong> its gene. Examples <strong>in</strong>volv<strong>in</strong>g genes with a role <strong>in</strong><br />
pathogenicity, with antigenic variation of surface exposed<br />
prote<strong>in</strong>s, <strong>and</strong> environmental adaptation have been described<br />
(van Belkum et al. 1998; Rocha et al. 1999).<br />
Monitor<strong>in</strong>g of these elements has provided <strong>in</strong>sights <strong>in</strong>to<br />
bacterial genome molecular processes <strong>and</strong> the nature of IS<br />
elements. For example, underst<strong>and</strong><strong>in</strong>g the regulatory<br />
mechanisms of IS elements has provided <strong>in</strong>sights <strong>in</strong>to the<br />
importance of the compromises adopted by IS elements<br />
(<strong>and</strong> MGEs, <strong>in</strong> general) between a stable host genome <strong>and</strong><br />
<strong>in</strong> endanger<strong>in</strong>g the survival of the host, through too much<br />
transposition activity (Nagy <strong>and</strong> Ch<strong>and</strong>ler 2004). It has<br />
also been suggested that IS expansion occurs dur<strong>in</strong>g an<br />
evolutionary bottleneck, which reduces effective population<br />
size <strong>and</strong> the degree of <strong>in</strong>traspecies competition<br />
(Parkhill et al. 2003).<br />
Genomic isl<strong>and</strong>s<br />
GEIs, also referred to as <strong>in</strong>tegrative <strong>and</strong> conjugative<br />
elements or ICEl<strong>and</strong>s (van der Meer <strong>and</strong> Sentchilo 2003),<br />
are large chromosomal regions that cluster functionally<br />
related genes, are flanked by direct repeat sequences <strong>and</strong><br />
are located near an <strong>in</strong>tegrase or transposase gene <strong>and</strong> often<br />
also near a tRNA. Furthermore, GEIs must have a GC<br />
composition different from the rest of the genome. GEIs<br />
<strong>in</strong>clude pathogenicity isl<strong>and</strong>s, symbiosis isl<strong>and</strong>s (SYIs),<br />
metabolic isl<strong>and</strong>s (MEIs), antibiotic resistance isl<strong>and</strong>s<br />
(REIs) <strong>and</strong> secretion system isl<strong>and</strong>s (SEIs) (Zhang <strong>and</strong><br />
Zhang 2004). This remarkable variety of GEIs demonstrates<br />
the power of horizontal gene transfer, as they are<br />
believed to be the result of <strong>in</strong>terspecies DNA transfer. With<br />
multiple genes neatly clustered <strong>in</strong> functional groups<br />
<strong>in</strong>clud<strong>in</strong>g all necessary regulatory <strong>and</strong> secretory genes,<br />
the power of transferr<strong>in</strong>g such ‘adaptive genetic bombs’<br />
can be easily imag<strong>in</strong>ed.<br />
Genome sequences have revealed that GEIs are common<br />
<strong>in</strong> bacteria as a result of successful horizontal transfers of<br />
Fig. 4 Generalised diagrammatic representation of a pathogenicity<br />
isl<strong>and</strong>. Commonly <strong>in</strong>serted <strong>in</strong>to a tRNA gene sequence, flanked by<br />
direct repeat sequences, conta<strong>in</strong><strong>in</strong>g an <strong>in</strong>tegrase (<strong>in</strong>t) gene,<br />
commonly conta<strong>in</strong><strong>in</strong>g <strong>in</strong>sertion sequence elements, <strong>and</strong> harbour<strong>in</strong>g<br />
DNA from a donor genome to a recipient genome. In most<br />
cases, the nature of the donor is unfortunately unknown.<br />
Even when an identified GEI bears a high resemblance to a<br />
section of another sequenced organism, one should not<br />
assume (though frequently this mistake has been made)<br />
that the GEI was directly received from that other<br />
organism. The transfer could well have <strong>in</strong>volved a third<br />
unidentified species, serv<strong>in</strong>g either as an <strong>in</strong>termediate<br />
between the first two or as the donor for the others. These<br />
possibilities are frequently not recognised, as people can be<br />
mislead by the available genome sequences <strong>and</strong> are not<br />
sufficiently aware of all those bacterial genomes for which<br />
we are currently lack<strong>in</strong>g sequence <strong>in</strong>formation.<br />
The discovery of abundant genomic isl<strong>and</strong>s is strengthen<strong>in</strong>g<br />
the concept of a bacterial genome be<strong>in</strong>g quite<br />
dynamic <strong>and</strong> consist<strong>in</strong>g of a backbone genome supplemented<br />
with adaptive genome modules, which may or may<br />
not be present <strong>in</strong> a given stra<strong>in</strong> of the species (Fraser-<br />
Liggett 2005). All modules available to the species (but<br />
never all present <strong>in</strong> one stra<strong>in</strong>) would comprise the gene<br />
pool of that organism. This concept clearly does not apply<br />
to strictly clonal species, <strong>in</strong> which case all isolates or stra<strong>in</strong>s<br />
closely resemble each other (as is the case, for <strong>in</strong>stance,<br />
with Bacillus anthracis), but it better describes the situation<br />
for frequently observed highly diverse species, such as E.<br />
coli or Streptomyces. Nevertheless, the timescale at which<br />
these events take place should not be ignored. Genomes are<br />
the sum of thous<strong>and</strong>s of years of evolution. Observations of<br />
evolutionary events tak<strong>in</strong>g place <strong>in</strong> ‘real time’ are still<br />
relatively seldom.<br />
Pathogenicity isl<strong>and</strong>s<br />
173<br />
PAIs are now considered a subtype of genomic isl<strong>and</strong>s but<br />
were among the earliest isl<strong>and</strong>s to be described. PAIs<br />
harbour pathogenicity-related genes, thus potentially conferr<strong>in</strong>g<br />
a pathogenic phenotype on a recipient genome.<br />
Figure 4 illustrates a generalised model of a PAI. As with<br />
other GEIs, PAIs are commonly <strong>in</strong>serted <strong>in</strong>to tRNA genes,<br />
which may be preferred sites of <strong>in</strong>sertion due to their<br />
relative conservation <strong>and</strong> redundancy (Dobr<strong>in</strong>dt et al.<br />
2004). PAIs are flanked by direct repeat sequences<br />
allow<strong>in</strong>g for <strong>in</strong>sertion <strong>in</strong>to the recipient DNA <strong>and</strong> conta<strong>in</strong><br />
an <strong>in</strong>tegrase gene that enables the <strong>in</strong>tegration <strong>in</strong>to the<br />
functional genes (with virulence associated properties), which may<br />
be organised <strong>in</strong>to an operon structure. Sometimes, a type III<br />
secretion system is also present
174<br />
recipient DNA. A feature observed for many PAIs (<strong>and</strong><br />
orig<strong>in</strong>ally <strong>in</strong>cluded <strong>in</strong> their def<strong>in</strong>ition although not always<br />
present) is the presence of a type III secretion system, a set<br />
of genes build<strong>in</strong>g an apparatus to specifically <strong>in</strong>ject<br />
virulence factors <strong>in</strong>to the host cell (Jores et al. 2004).<br />
Numerous <strong>in</strong>vestigations have identified <strong>and</strong> analysed PAIs<br />
(McGillivary et al. 2005; Middendorf et al. 2004; Paulsen<br />
et al. 2003; Schneider et al. 2004; Zubrzycki 2004; Schmidt<br />
<strong>and</strong> Hensel 2004).<br />
Horizontal gene transfer <strong>and</strong> restriction modification<br />
systems<br />
Evidence of HGT (also referred to as lateral gene transfer<br />
LGT) dates back more than 30 years (Falkow 1975), with<br />
the f<strong>in</strong>d<strong>in</strong>g of transposable elements. Although such events<br />
were considered only exceptional cases at that time, it is<br />
now evident that HGT events can make a substantial<br />
contribution to the generation of genetic diversity. As with<br />
all other features, the degree of horizontal transfer varies<br />
amongst species. Ochman et al. (2000) assessed 19<br />
completely sequenced bacterial genomes <strong>and</strong> reported<br />
that the proportion of foreign prote<strong>in</strong>s vary from 0%<br />
(Mycoplasma genitalium) to about 17% (Synechocystis<br />
spp). These f<strong>in</strong>d<strong>in</strong>gs were supported by others <strong>in</strong>clud<strong>in</strong>g<br />
Dufraigne et al. (2005). Ortutay et al. (2003) undertook a<br />
genomic-scale phylogenetic analysis of prote<strong>in</strong>-encod<strong>in</strong>g<br />
genes from five closely related Chlamydia spp <strong>and</strong><br />
identified a set of sequences that have arisen via HGT as<br />
the divergence of the Chlamydia l<strong>in</strong>eage. These data<br />
illustrate the significant role of HGT <strong>in</strong> the evolution of<br />
particular bacterial species. It is not surpris<strong>in</strong>g that obligate<br />
<strong>in</strong>tracellular pathogens show less evidence of recent HGT:<br />
they will not easily encounter other bacterial species with<br />
which to share DNA.<br />
Doolittle (1999a) listed three observations that can only<br />
be expla<strong>in</strong>ed by HGT. The first observation is that<br />
phylogenetic trees based on <strong>in</strong>dividual prote<strong>in</strong>-cod<strong>in</strong>g<br />
genes frequently differ substantially from the rRNA tree<br />
or from each other. The second observation comes from<br />
analysis, with<strong>in</strong> a genome, of variation <strong>in</strong> G + C content,<br />
codon usage <strong>and</strong> gene order. The third observation is a<br />
result of between-genome comparisons, which show that<br />
all genomes conta<strong>in</strong> particular genes that are more similar<br />
to homologues <strong>in</strong> distant genomes than to homologues <strong>in</strong><br />
closer relatives or <strong>in</strong>deed that are absent from all known<br />
genomes of closer relatives. Comb<strong>in</strong><strong>in</strong>g this evidences,<br />
Doolittle (1999b) proposed an alternative to the tree of life<br />
to describe the evolutionary history of liv<strong>in</strong>g organisms.<br />
His model of a web-like structure takes <strong>in</strong>to account the<br />
<strong>in</strong>fluence of HGT, where <strong>in</strong>teractions occur between<br />
ancestral organisms <strong>and</strong> descendants (branches) as well<br />
as between branches. A similar concept of a biological<br />
network has been further explored by Kun<strong>in</strong> et al. (2005).<br />
Such a concept is difficult to work with, <strong>and</strong> currently<br />
many microbiologists still accept a tree-like phylogenetic<br />
relationship, at least for an artificial ‘backbone’ of the<br />
species. Independent of the source (stra<strong>in</strong> or species) of the<br />
genes, phylogenetic trees can <strong>in</strong>deed be correctly produced<br />
for many genes <strong>and</strong> gene families <strong>and</strong> may describe<br />
evolutionary relationships that do not date back very far.<br />
Go<strong>in</strong>g back further <strong>in</strong> time, the vertical l<strong>in</strong>eages become<br />
weaker <strong>and</strong> the phylogenetic trees are less mean<strong>in</strong>gful. The<br />
paradoxal conclusion is that, by elucidat<strong>in</strong>g more of the<br />
evolutionary history of bacteria, their history has become<br />
less clear.<br />
If it is really true that horizontal gene transfer is so<br />
general, how is it still possible to recognise bacterial<br />
species? First, HGT is not so frequent that it can be easily<br />
observed as DNA exchange <strong>in</strong> ‘real time’ (other than the<br />
uptake of plasmids, spread of antibiotic resistance genes or<br />
transfection of phages). Evidence for past HGT events can<br />
be seen <strong>in</strong> many bacterial genomes <strong>and</strong> exemplifies its<br />
importance <strong>in</strong> evolution but, without a time scale, the<br />
frequency of such events cannot be estimated. Second,<br />
there are barriers that restrict HGT. It is obvious that not all<br />
bacteria share the same gene pool <strong>and</strong> only bacteria that<br />
share an ecological niche are likely to encounter <strong>and</strong> share<br />
each other’s DNA. Even under circumstances that favour<br />
DNA exchange, <strong>in</strong>ternal factors restrict the success of<br />
HGT, notably bacteriophage specificity, plasmid <strong>in</strong>compatibility,<br />
<strong>and</strong> the activity of restriction modification (RM)<br />
systems. F<strong>in</strong>ally, not all putatively HGT genes from E. coli<br />
are actually translated <strong>in</strong>to prote<strong>in</strong>s, perhaps because of<br />
<strong>in</strong>compatability of translational mach<strong>in</strong>ery (Taoka et al.<br />
2004).<br />
The discovery of restriction enzymes which could cleave<br />
specific DNA sequences provided the basis for driv<strong>in</strong>g the<br />
“biotechnology revolution” <strong>in</strong> the 1970s. RM systems are<br />
popular <strong>in</strong> molecular genetics <strong>and</strong> are rout<strong>in</strong>ely used by<br />
most molecular biology laboratories throughout the world.<br />
The RM systems encode a modification enzyme that<br />
chemically modifies a specific short DNA sequence <strong>and</strong> a<br />
restriction endonuclease that will digest the DNA at that<br />
same specific recognition sequence unless the sequence has<br />
been modified (usually by methylation). Bacterial species<br />
(<strong>and</strong> frequently stra<strong>in</strong>s with<strong>in</strong> a species) all have their own<br />
comb<strong>in</strong>ation of RM systems (Roberts et al. 2005).<br />
Incom<strong>in</strong>g DNA with a different modification pattern will<br />
be recognised by the endonuclease of the recipient stra<strong>in</strong>,<br />
<strong>and</strong> the fate of such DNA is to be degraded. This is seen as<br />
a serious restriction for the spread of DNA through<br />
populations unless their RM systems are compatible.<br />
The analysis of RM systems at a comparative genomics<br />
level (particularly the type restriction II endonucleases) has<br />
shown the dynamic state of the respective genes (L<strong>in</strong> et al.<br />
2001) <strong>and</strong> posed a number of questions to the view that RM<br />
genes restrict gene flow. For example, H. pylori <strong>and</strong><br />
Campylobacter jejuni are competent to take up DNA <strong>and</strong><br />
have a large set of genes to ma<strong>in</strong>ta<strong>in</strong> this property. The<br />
dynamic nature of the H. pylori genome <strong>and</strong> its natural<br />
competence is consistent with the weakly clonal population<br />
structure of H. pylori. Nevertheless, studies on H. pylori<br />
identified at least eight type II RM systems across two<br />
stra<strong>in</strong>s with an active restriction endonuclease <strong>and</strong><br />
methylase (Kong et al. 2000; L<strong>in</strong> et al. 2001). In addition,<br />
there were several active methylase genes without an active
endonuclease. The occurrence of RM systems that are not<br />
shared between the stra<strong>in</strong>s suggests that new RM systems<br />
are readily acquired <strong>and</strong> subsequently lost as a result of<br />
mutation or recomb<strong>in</strong>ation (L<strong>in</strong> et al. 2001). But that these<br />
would pose restriction barriers <strong>in</strong> gene flow is difficult to<br />
envisage with the dynamic population structure. RM genes<br />
possibly have other advantages to the cell. For methylation<br />
genes miss<strong>in</strong>g their match<strong>in</strong>g restriction gene, it has been<br />
suggested that they may be used for regulat<strong>in</strong>g gene<br />
expression (as for DAM methylation <strong>in</strong> E. coli; Lobner-<br />
Olesen et al. 2005; Robb<strong>in</strong>s-Manke et al. 2005) <strong>and</strong> for<br />
keep<strong>in</strong>g track of which parts of the chromosome have been<br />
recently replicated (Maas 2004).<br />
Methods for compar<strong>in</strong>g bacterial genomes<br />
There are at least 20 methods to compare bacterial<br />
genomes, as shown <strong>in</strong> Table 3. Some methods are more<br />
commonly used than the others, <strong>and</strong> it is beyond the scope<br />
of this review to provide a detailed analysis of each<br />
method. A few of these methods are discussed <strong>in</strong> this<br />
section.<br />
Chromosome alignment <strong>and</strong> size comparison<br />
Perhaps one of the easiest ways to compare genomes is by<br />
their sizes, as shown <strong>in</strong> Fig. 5. Although different phyla<br />
have different average sizes, it must be kept <strong>in</strong> m<strong>in</strong>d that<br />
many of the phyla have currently few representatives <strong>and</strong><br />
that there is a strong economic bias towards sequenc<strong>in</strong>g the<br />
smallest genome, so the size distributions shown here for<br />
the sequenced genomes could well be shorter than what<br />
Table 3 Approaches to compar<strong>in</strong>g bacterial genomes<br />
exist <strong>in</strong> natural ecosystems. Another way of compar<strong>in</strong>g<br />
chromosomes is to do a simple alignment of the DNA<br />
sequences. There are two versions of the alignment<br />
programmes. One <strong>in</strong>volves download<strong>in</strong>g some scripts<br />
<strong>and</strong> runn<strong>in</strong>g them on a local computer such as the Sanger<br />
Centre’s (Cambridge, UK) Artemis Comparison Tool<br />
(ACT, Carver et al. 2005) <strong>and</strong> the other is web-based<br />
such as “WebACT”, a web-based version of ACT with precomputed<br />
comparisons between several hundred bacterial<br />
genomes. The latter might be easier to use for those<br />
biologists who are less computationally <strong>in</strong>cl<strong>in</strong>ed (Abbott et<br />
al. 2005).<br />
AT content <strong>in</strong> genomes <strong>and</strong> promoter analysis<br />
Another relatively easy method to compare genomes is by<br />
their AT content, which ranges from 78% (Wigglesworthia<br />
gloss<strong>in</strong>idia) to 27% (Clavibacter michiganensis) for the<br />
300 genomes sequenced at the time of writ<strong>in</strong>g. In addition<br />
to the average AT content for a whole genome, if the<br />
variation of the AT content with<strong>in</strong> a given genome is<br />
exam<strong>in</strong>ed, two general trends can be seen for nearly all of<br />
the bacterial genomes. First, on a more global chromosomal<br />
level, there is a tendency for the region around the<br />
orig<strong>in</strong> of DNA replication to be more GC rich (i.e. less AT<br />
rich) <strong>and</strong> the region around the replication term<strong>in</strong>us to be<br />
more AT rich (Hall<strong>in</strong> et al. 2004b). Second, the average AT<br />
content for DNA about 400 bp upstream of the translation<br />
start site for all the genes <strong>in</strong> a genome is higher than 400 bp<br />
downstream (Hall<strong>in</strong> et al. 2004b). This makes sense <strong>in</strong> that<br />
the DNA will need to melt more easily <strong>in</strong> order for<br />
transcription to start.<br />
Level Method Reference<br />
Genome Chromosome alignment Carver et al. 2005<br />
AT content <strong>in</strong> the genome <strong>and</strong> upstream of genes Ussery <strong>and</strong> Hall<strong>in</strong> 2004a<br />
Oligomer bias on lead<strong>in</strong>g or lagg<strong>in</strong>g str<strong>and</strong>s Worn<strong>in</strong>g et al. 2006<br />
Repeats (local <strong>and</strong> global) Ussery et al. 2004a<br />
Periodicity of DNA structural properties Worn<strong>in</strong>g et al. 2000<br />
Length comparison Ussery <strong>and</strong> Hall<strong>in</strong> 2004b<br />
Promoter analysis Ussery et al. 2004d<br />
Transcriptome Organisation of rRNA operons Ussery et al. 2004b<br />
tRNAs <strong>and</strong> codon usage Ussery et al. 2004c<br />
Third nucleotide position bias <strong>in</strong> codon usage Ussery et al. 2004c<br />
Annotation quality Skovgaard et al. 2001<br />
Proteome Am<strong>in</strong>o acid usage Ussery et al. 2004c<br />
BLAST atlases Hall<strong>in</strong> et al. 2004a<br />
BLAST matrices B<strong>in</strong>newies et al. 2004<br />
Sigma factors Kiil et al. 2005a<br />
Transcription factors Kummerfeld 2006<br />
Secreted prote<strong>in</strong>s Bendtsen et al. 2005a<br />
Membrane prote<strong>in</strong>s Bendtsen et al. 2005b<br />
2-D correlation of properties Willenbrock et al. 2005<br />
Two component signal transduction systems Kiil et al. 2005b<br />
175
176<br />
Fig. 5 Genome length distribution for 287 bacterial chromosomes,<br />
shown as box <strong>and</strong> whiskers plot for each phyla. The number of<br />
chromosomes <strong>in</strong> each phylum is shown on the axis. Most of the<br />
bacterial genomes shown are either Proteobacteria (156 genomes) or<br />
tRNAs, codon usage <strong>and</strong> am<strong>in</strong>o acid<br />
As mentioned above, the 200 bp upstream of translation<br />
start sites is more AT rich, on average, than the 200 bp<br />
downstream. However, if the unsmoothed data is exam<strong>in</strong>ed<br />
(the grey l<strong>in</strong>es <strong>in</strong> Fig. 6, panel a), there is much “noise” <strong>in</strong><br />
the cod<strong>in</strong>g sequence, compared to the upstream, noncod<strong>in</strong>g<br />
DNA. This is due to bias <strong>in</strong> codon usage, as shown <strong>in</strong><br />
Fig. 6, panel b. The genome for a given organism will tend<br />
to show a preference towards certa<strong>in</strong> codons <strong>and</strong> can be<br />
seen as a bias <strong>in</strong> the third codon position (Fig. 6, panel c).<br />
F<strong>in</strong>ally, these codon biases also are <strong>in</strong> part affected by<br />
which am<strong>in</strong>o acids an organism uses, as shown <strong>in</strong> panel d<br />
of Fig. 6. The am<strong>in</strong>o acid usage for different E.coli<br />
proteomes differ: for example, E. coli K-12 shows the same<br />
am<strong>in</strong>o acid usage as Salmonella entericia LT2, while the<br />
usage <strong>in</strong> E.coli O157 resembles that of Shigella flexeneri.<br />
Thus, two different E. coli genomes can have quite<br />
different am<strong>in</strong>o acid usage (which might not be that<br />
surpris<strong>in</strong>g <strong>in</strong> view of the differences between stra<strong>in</strong>s of this<br />
species, see Table 1).<br />
BLAST atlases<br />
The GenomeAtlas is a method to visualise structural<br />
features of an entire bacterial genome sequence as one plot.<br />
The plots are created us<strong>in</strong>g the “GeneWiz” programme,<br />
Firmicutes (70). At the time of writ<strong>in</strong>g, the largest complete bacterial<br />
genome sequenced is that of Burkholderia xenovorans, which is<br />
consists of 9,703,676 bp with<strong>in</strong> two chromosomes, <strong>and</strong> the smallest<br />
is that of M. genitalium genome of 580,074 bp<br />
developed at <strong>CBS</strong> (Pedersen et al. 2000). A more recent<br />
extension of this method is the development of the<br />
“genome BLAST atlas”, <strong>in</strong> which genes from different<br />
genomes are blasted aga<strong>in</strong>st a reference genome <strong>and</strong><br />
visualised us<strong>in</strong>g an atlas plot. BLAST atlases can provide<br />
additional contextual <strong>in</strong>formation about regions which<br />
conta<strong>in</strong> few conserved genes. For example, a new genome<br />
might have a few small isl<strong>and</strong>s of unique prote<strong>in</strong>s, <strong>and</strong><br />
these regions might be more AT rich or might be expected<br />
to be potentially highly expressed, based on chromosomal<br />
structural <strong>in</strong>formation also provided <strong>in</strong> the plots. As<br />
mentioned above, when the 20 E. coli sequenced genomes<br />
<strong>in</strong> Table 1 are compared, an enormous amount of diversity<br />
is found. A BLAST atlas for E.coli 0157 is shown <strong>in</strong> Fig 7a.<br />
Several regions of the chromosome have “holes” represent<strong>in</strong>g<br />
large segments of miss<strong>in</strong>g genes <strong>in</strong> some organisms,<br />
compared to the reference genome. In a sense, this<br />
<strong>in</strong>formation is somewhat similar to that obta<strong>in</strong>ed by the<br />
ACT plots mentioned above, although now the comparisons<br />
are be<strong>in</strong>g made at the level of presence/absence of<br />
clusters of prote<strong>in</strong>s. In Fig. 7b, some of the regions<br />
conta<strong>in</strong><strong>in</strong>g gaps are more AT rich, some conta<strong>in</strong> repeats <strong>and</strong><br />
a few (marked) conta<strong>in</strong> genes that might be highly<br />
expressed, based on chromat<strong>in</strong> properties. Thus, this tool<br />
can give a quick overview of the comparison of many<br />
genomes.<br />
In Fig. 7a, the gaps correspond to regions of miss<strong>in</strong>g<br />
genes <strong>in</strong> the E. coli O157 genome. Similar patterns can be
Fig. 6 Genomic properties of Streptomyces coelicolor A3. a Comparison<br />
of AT content upstream <strong>and</strong> downstream of all 7,825 genes; the<br />
genes are all oriented <strong>in</strong> the same direction <strong>and</strong> aligned such that the<br />
translation start site is <strong>in</strong> the middle. Z-scores of st<strong>and</strong>ard deviations<br />
from the chromosomal average are plotted, as described previously<br />
(Ussery <strong>and</strong> Hall<strong>in</strong> 2004a). b Codon usage of the same set of 7825<br />
genes. The frequency of occurrence of each of the 64 codons is plotted<br />
<strong>in</strong> a star plot; note that most codons have a relatively low frequency of<br />
usage. c Bias <strong>in</strong> the codon position are plotted as frequencies; note that<br />
seen for many other bacterial genomes. For example, <strong>in</strong><br />
Fig. 7b, there are four large gaps <strong>in</strong> the C. jejuni RM1221<br />
genome compared to other epsilon Proteobacteria. These<br />
correspond to phage <strong>in</strong>sertion sites <strong>in</strong> C. jejuni RM1221, as<br />
described <strong>in</strong> the orig<strong>in</strong>al genome sequence publication<br />
(Fouts et al. 2005). Similar results have been observed for<br />
177<br />
there is a strong tendancy for Cs <strong>and</strong> Gs <strong>in</strong> third position. d Am<strong>in</strong>o acid<br />
usage of each of the 20 am<strong>in</strong>o acids for the entire S. coelicolor<br />
proteome is plotted as frequency of the total; the am<strong>in</strong>o acids <strong>in</strong> this plot<br />
are grouped accord<strong>in</strong>g to their properties; for example, all the aliphatic<br />
am<strong>in</strong>o acids (A, V, L, I <strong>and</strong> G) are together <strong>and</strong>, <strong>in</strong> general, there is a<br />
general trend for this proteome to favour aliphatic am<strong>in</strong>o acids, with the<br />
exception of isoleuc<strong>in</strong>e. The three star plots are as described previously<br />
(Ussery et al. 2004c)<br />
Streptococcus (Hall<strong>in</strong> et al. 2004a). In all three of these<br />
cases, there are large regions which conta<strong>in</strong> many genes<br />
which are miss<strong>in</strong>g <strong>in</strong> other genomes of the same species.<br />
These clusters of genes often conta<strong>in</strong> evidence that they<br />
came from phages, which appears to be an efficient method<br />
of br<strong>in</strong>g<strong>in</strong>g new DNA <strong>in</strong>to a genome.
178
3Fig. 7 Genome BLAST atlases. The outer circles represent BLAST<br />
hits of a given genome (named <strong>in</strong> the legend) to the reference<br />
genome (named <strong>in</strong> the center of the atlas). The colours are scaled<br />
such that good BLAST hits (E=10–40) are darkly shaded, whilst<br />
regions conta<strong>in</strong><strong>in</strong>g no hits are shown <strong>in</strong> light grey, as described<br />
previously (Hall<strong>in</strong> et al. 2004a). a Genome BLAST atlas of E. coli<br />
EO157 EDL933 vs four other sequenced E. coli stra<strong>in</strong>s (the four<br />
outermost circles; the genomes are, go<strong>in</strong>g from the outermost<br />
towards the center, E. coli K-12 MG1655, E. coli K-12 W3110, E.<br />
coli CFT1076 <strong>and</strong> E. coli O157 RIMD0509952). b Genome BLAST<br />
atlas of C. jejuni vs other epsilon Proteobacteria<br />
BLAST matrices<br />
Figure 7a,b illustrates the use of BLAST atlases to compare<br />
genome sequences. However, with several hundred<br />
genomes available, there is a need for a faster way of<br />
gett<strong>in</strong>g an overview of genome similarity. One method is<br />
the use of reciprocal hits—that is, to BLAST all the<br />
prote<strong>in</strong>s encoded <strong>in</strong> a genome of <strong>in</strong>terest aga<strong>in</strong>st those <strong>in</strong><br />
another genome (B<strong>in</strong>newies et al. 2004). First, the genomes<br />
of <strong>in</strong>terest are selected (e.g. all genomes of Proteobacteria),<br />
then a BLAST matrix can be displayed from this selection.<br />
The results are pre-generated <strong>and</strong> the system keeps track of<br />
sequence updates by generat<strong>in</strong>g MD5 checksums of all<br />
sequences <strong>and</strong> the comb<strong>in</strong>ations <strong>in</strong> which they have been<br />
BLASTed. The MD5 (termed also a message digest) will<br />
Fig. 8 The BLAST table shows<br />
the overall prote<strong>in</strong> homology<br />
between all comb<strong>in</strong>ations of the<br />
five available Vibrio sequences.<br />
Only hits conta<strong>in</strong><strong>in</strong>g at least<br />
80% of the length of the gene<br />
<strong>and</strong> with an E-value of 1×10 or<br />
better are counted. The diagonal<br />
(red/p<strong>in</strong>k) <strong>in</strong>dicates the fraction<br />
of prote<strong>in</strong>s that have homologous<br />
hits with<strong>in</strong> the proteome<br />
itself; the fraction is similar <strong>in</strong><br />
all genomes, <strong>and</strong> the <strong>in</strong>tensity is<br />
shown by the red colour, scaled<br />
from ~24% (grey) to ~27%<br />
(red). Note that the largest genome<br />
also has the highest fraction<br />
of <strong>in</strong>ternal homologs. The<br />
green area for the rest of<br />
the table, on each side of the<br />
diagonal, shows the number<br />
of prote<strong>in</strong>s that have homologous<br />
hits between different<br />
Vibrio genomes. As before, the<br />
fraction is <strong>in</strong>dicated by the <strong>in</strong>tensity<br />
of the colour (green)<br />
scaled from ~57 (grey) to ~83%<br />
(green). In general, it is clear<br />
that these organisms share a<br />
high percentage of their genes<br />
with the other Vibrio species,<br />
which should be expected<br />
because they are from the same<br />
genus<br />
produce a 32-digit str<strong>in</strong>g that is unique to an <strong>in</strong>put str<strong>in</strong>g,<br />
e.g. a genomic sequence. The system ma<strong>in</strong>ta<strong>in</strong>s an allaga<strong>in</strong>st-all<br />
BLAST database updat<strong>in</strong>g only the miss<strong>in</strong>g<br />
comparisons—that is, chang<strong>in</strong>g the sequence of a record or<br />
<strong>in</strong>sert<strong>in</strong>g a new record will cause a BLAST run of the<br />
sequence aga<strong>in</strong>st all the exist<strong>in</strong>g sequences of the database.<br />
By hav<strong>in</strong>g multiple genomes <strong>in</strong> a given selection, an allaga<strong>in</strong>st-all<br />
BLAST matrix can be presented show<strong>in</strong>g the<br />
percentage of genes that are shared between sequences—<br />
both on a prote<strong>in</strong> <strong>and</strong> on a nucleotide level. Each such<br />
percentage is supplied with a l<strong>in</strong>k to give a full list<strong>in</strong>g from<br />
the BLAST report. Fig. 8 shows an example of such a<br />
BLAST matrix, with the diagonal (<strong>in</strong> red) reflect<strong>in</strong>g the<br />
<strong>in</strong>ternal homologues of a given genome. The boxes are<br />
colour-coded such that the <strong>in</strong>tensity represents the fraction<br />
of hits (B<strong>in</strong>newies et al. 2004) (Fig. 8).<br />
Meta-genomics: comparison of all the genomes<br />
<strong>in</strong> an ecosystem<br />
179<br />
The term “metagenomics” is used for genome sequenc<strong>in</strong>g<br />
projects <strong>in</strong> which many organisms are sequenced at once<br />
by shotgun clon<strong>in</strong>g of all DNA present <strong>in</strong> a sample<br />
(H<strong>and</strong>elsman 2004). This enables microbial ecosystems<br />
conta<strong>in</strong><strong>in</strong>g microbes that are not (presently) culturable <strong>in</strong><br />
pure form to be <strong>in</strong>vestigated (H<strong>and</strong>elsman 2004). The
180<br />
reasons why organisms rema<strong>in</strong> uncultured can be practical<br />
(e.g. thermophilic bacteria grow at a temperature above the<br />
melt<strong>in</strong>g po<strong>in</strong>t of agar), physiological (e.g. extremophiles<br />
that grow on pure culture can have very different properties<br />
from those observed <strong>in</strong> their true environment) or biological<br />
(symbiotic life forms cannot be cultured <strong>in</strong> microbiological<br />
pure form). The first genome sequence obta<strong>in</strong>ed<br />
from a non-culturable bacterium was <strong>in</strong>deed that of<br />
Buchnera aphidicola, a symbiont of aphids. This sequence<br />
was not obta<strong>in</strong>ed by meta-genomics at the total genome<br />
DNA level but rather at the rRNA level. Cell counts<br />
compared to plate counts showed that the latter can be<br />
orders of magnitude wrong: many viable bacteria refuse to<br />
grow on solid culture medium. The isolation of bulk RNA<br />
<strong>and</strong> the subsequent determ<strong>in</strong>ation of rRNA sequences<br />
us<strong>in</strong>g specific primers allowed qualitative analysis to be<br />
performed for identify<strong>in</strong>g novel bacterial species or<br />
ribotypes present <strong>in</strong> an ecosystem (Olsen et al. 1986).<br />
The application of PCR improved the sensitivity of such<br />
approaches but the limitation to rRNA sequences conf<strong>in</strong>ed<br />
analyses to phylogenetic <strong>in</strong>formation only <strong>and</strong> little further<br />
knowledge was obta<strong>in</strong>ed about the new species. Metagenomics<br />
can be used to generate complete or fragmented<br />
genome sequences of organisms that might be abundant <strong>in</strong><br />
nature but are not easily culturable.<br />
The acid m<strong>in</strong>e dra<strong>in</strong>age sequenc<strong>in</strong>g project has shown<br />
the potential of meta-genomics (Tyson et al. 2004). The<br />
m<strong>in</strong>e water of the Richmond m<strong>in</strong>e is covered with a biofilm<br />
of bacteria despite its hostile environment: an extreme acid<br />
pH (between 0 <strong>and</strong> 1), high concentrations of metal ions,<br />
<strong>in</strong>clud<strong>in</strong>g copper, z<strong>in</strong>c <strong>and</strong> arsenic, <strong>and</strong> the absence of<br />
carbon or nitrogen sources (other than from air). The<br />
biofilm was composed of relatively few organisms,<br />
enabl<strong>in</strong>g the sequenc<strong>in</strong>g of shotgun-cloned DNA <strong>and</strong> the<br />
sort<strong>in</strong>g of fragments accord<strong>in</strong>g to their G + C content <strong>in</strong>to<br />
nearly complete bacterial genomes. A dom<strong>in</strong>ant bacterial<br />
genus was identified, Leptospirillum, <strong>and</strong> a less abundant<br />
Sulfobacillus spp <strong>and</strong> some Archaea were also present. The<br />
f<strong>in</strong>d<strong>in</strong>gs greatly improved underst<strong>and</strong><strong>in</strong>g of this ecosystem.<br />
The predom<strong>in</strong>ant bacteria were responsible for nitrogen<br />
<strong>and</strong> carbon fixation (Leptospirillum group III), whereas<br />
several species were able to generate energy from iron<br />
oxidation (Ferroplasma <strong>and</strong> Leptospirillum spp). As <strong>in</strong> this<br />
approach, each sequenced DNA fragment is obta<strong>in</strong>ed from<br />
a different <strong>in</strong>dividual (whereas <strong>in</strong> classical genome<br />
sequenc<strong>in</strong>g all DNA is obta<strong>in</strong>ed from one clone);<br />
<strong>in</strong>formation on polymorphisms also becomes available.<br />
As more complex ecosystems are studied, the puzzle of<br />
genome assembly becomes more difficult due to the<br />
presence of more species, genomic rearrangements <strong>and</strong><br />
horizontal gene transfer events.<br />
The largest attempt so far at metagenomics was <strong>in</strong>itiated<br />
by C. Venter to sequence the microbial ecosystem <strong>in</strong> the<br />
Sargasso Sea (Venter et al. 2004). Seawater was sampled<br />
by filter<strong>in</strong>g to specifically recover bacterial (<strong>and</strong> not viral or<br />
amoebal) DNA. Over 1 billion base pairs of sequence were<br />
generated, which was attributed to at least 1,800 species.<br />
As the abundance of <strong>in</strong>dividual species determ<strong>in</strong>es their<br />
coverage <strong>in</strong> shotgun clon<strong>in</strong>g, this coverage (or rather the<br />
mean of their Poisson distribution) was used to sort out<br />
DNA scaffolds (a scaffold is a reconstructed genomic<br />
region), <strong>and</strong> oligonucleotide frequencies were used to<br />
ref<strong>in</strong>e this sort<strong>in</strong>g. Although the complexity of the<br />
<strong>in</strong>vestigated ecosystem did not allow complete assembly<br />
of <strong>in</strong>dividual genomes, the scaffolds belong<strong>in</strong>g to the most<br />
abundant species could be attributed to Burkholderia <strong>and</strong><br />
Shewanella-like species. As with the acid ma<strong>in</strong> dra<strong>in</strong>age<br />
project, polymorphisms were detected with vary<strong>in</strong>g<br />
frequencies. In fact, the dataset ranged from organisms<br />
belong<strong>in</strong>g to a s<strong>in</strong>gle species <strong>and</strong> clonal (few polymorphisms)<br />
to a population cont<strong>in</strong>uum <strong>in</strong> which some clonal<br />
complexes could be recognised. These observations<br />
illustrate the ‘unnatural’ approach of study<strong>in</strong>g only pure<br />
bacterial cultures that have a strict clonal structure <strong>in</strong><br />
contrast to natural environments where the population<br />
structure is much more fluid <strong>and</strong> the concept of clones or<br />
species is more elusive. The most impressive output of the<br />
Sargasso Sea study is the numbers of <strong>in</strong>dividual genes that<br />
were identified (69,901). Among the surpris<strong>in</strong>g f<strong>in</strong>d<strong>in</strong>gs<br />
was that rhodops<strong>in</strong> (the bacterial prote<strong>in</strong> required for<br />
carbon fixation) was abundant outside the proteobacteria<br />
where it had previously been identified. The f<strong>in</strong>d<strong>in</strong>g of<br />
many genes <strong>in</strong>volved <strong>in</strong> phosphate uptake <strong>and</strong> utilisation of<br />
poly- <strong>and</strong> pyrophosphates is puzzl<strong>in</strong>g, as the mar<strong>in</strong>e<br />
environment is extremely phosphate-limited.<br />
The challenge to analyse the complex communities of a<br />
nutrient-rich environment was taken up by Tr<strong>in</strong>ge <strong>and</strong><br />
Rub<strong>in</strong> (2005). One sample that was analysed was derived<br />
from agricultural soil <strong>and</strong> three were from mar<strong>in</strong>e whale<br />
carcasses. First, rRNA libraries were generated by PCR to<br />
<strong>in</strong>vestigate the microbial diversity. The soil sample (DNA<br />
obta<strong>in</strong>ed from 5 g of surface clay loam from l<strong>and</strong> that had<br />
been used for livestock) was extremely rich <strong>in</strong> species with<br />
at least 847 ribotypes detected represent<strong>in</strong>g over 12 phyla.<br />
The whale samples (two bone parts <strong>and</strong> one biofilm<br />
cover<strong>in</strong>g a whale carcass) were less diverse but still<br />
conta<strong>in</strong>ed between 25 <strong>and</strong> 150 ribotypes. Although the<br />
assembly of sequences obta<strong>in</strong>ed from shotgun libraries was<br />
not possible, the genes that were identified on the<br />
sequenced library clones demonstrated that approximately<br />
half of the predicted prote<strong>in</strong>s found similarities (homologs)<br />
<strong>in</strong> exist<strong>in</strong>g gene databases. Plott<strong>in</strong>g the number of novel<br />
gene families aga<strong>in</strong>st the amount of generated sequences<br />
suggested that, for the soil sample, few novel orthologues<br />
were found after sequenc<strong>in</strong>g 25 Mbp. The functions of<br />
predicted prote<strong>in</strong>s from the sequences were naturally<br />
diverse, but for the soil sample, potassium channell<strong>in</strong>g<br />
systems were overrepresented, whereas for the whale<br />
samples sodium ion exporters were abundant—which fit<br />
with the abundance of these two ions <strong>in</strong> the two<br />
environments, respectively.<br />
The metagenomics analyses will cont<strong>in</strong>ue to see databases<br />
exp<strong>and</strong><strong>in</strong>g, with the <strong>in</strong>terpretation <strong>and</strong> assembly of<br />
raw data becom<strong>in</strong>g more complete. The human gastro<strong>in</strong>test<strong>in</strong>al<br />
tract, for example, is the target of a metagenomics<br />
sequenc<strong>in</strong>g project (Mongod<strong>in</strong> et al. 2005). It is apparent<br />
that each <strong>in</strong>dividual carries a large variety of microflora,<br />
probably acquired early <strong>in</strong> life (<strong>and</strong> which may have health
consequences even though these organisms are not pathogenic)<br />
as well as bacterial microheterogeneity that was not<br />
recognised previously. Aga<strong>in</strong>st the common belief that<br />
Firmicutes <strong>and</strong> Bacteroides would be the most abundant<br />
microbes present <strong>in</strong> the human gut, it appears that<br />
Act<strong>in</strong>obacteria <strong>and</strong> Archaea may be more prom<strong>in</strong>ent<br />
(Mongod<strong>in</strong> et al. 2005). The <strong>in</strong>test<strong>in</strong>al microflora of<br />
obese mice differs considerably to that of lean animals,<br />
an observation <strong>in</strong> support of the view that the microbiota of<br />
mammals are good <strong>in</strong>dicators (be it cause or effect) of their<br />
health status (Ley et al. 2005). There are clearly many<br />
microbial communities to be analysed <strong>and</strong> compared us<strong>in</strong>g<br />
metagenomics.<br />
Application: computational vacc<strong>in</strong>e development<br />
Vacc<strong>in</strong>es rema<strong>in</strong> an extremely important tool for controll<strong>in</strong>g<br />
<strong>in</strong>fectious diseases of humans <strong>and</strong> animals, although<br />
they are only available for about 10% of the microrganisms<br />
known to be harmful to humans (Lund et al. 2005).<br />
Traditional vacc<strong>in</strong>es typically have <strong>in</strong>corporated whole live<br />
attenuated or killed microorganisms, but, particularly for<br />
use <strong>in</strong> humans, such vacc<strong>in</strong>es now have limited application<br />
due to concerns about safety, efficacy <strong>and</strong>/or ease of<br />
production. Much recent work, therefore, has focused on<br />
develop<strong>in</strong>g vacc<strong>in</strong>es composed of prom<strong>in</strong>ent immunogenic<br />
parts of microorganisms (subunit vacc<strong>in</strong>es) or genes<br />
encod<strong>in</strong>g these components (genetic vacc<strong>in</strong>es, Ellis<br />
1999). For bacterial vacc<strong>in</strong>e discovery, these newer<br />
approaches have been greatly assisted by the recent<br />
availability of whole genomic sequence data <strong>and</strong> has<br />
allowed a new approach to vacc<strong>in</strong>e development called<br />
“reverse vacc<strong>in</strong>ology” (Rappuoli 2001).<br />
In reverse vacc<strong>in</strong>ology, bio<strong>in</strong>formatics <strong>tools</strong> are used to<br />
undertake comprehensive <strong>in</strong> silico screen<strong>in</strong>g of genomic<br />
sequence to identify genes encod<strong>in</strong>g prote<strong>in</strong>s that have<br />
desirable characteristics. The power of this process has<br />
<strong>in</strong>creased as more <strong>and</strong> more genomic sequences that<br />
encode prote<strong>in</strong>s of known function become available <strong>in</strong> the<br />
databases for comparative analysis. Targets for consideration<br />
for use <strong>in</strong> vacc<strong>in</strong>es <strong>in</strong>clude genes encod<strong>in</strong>g outer<br />
membrane prote<strong>in</strong>s or lipoprote<strong>in</strong>s, transmembrane doma<strong>in</strong>s<br />
or export signal peptides, <strong>and</strong> prote<strong>in</strong>s with<br />
homologies to bacterial factors already known to be<br />
<strong>in</strong>volved <strong>in</strong> virulence or pathogenicity. Surface-exposed<br />
or secreted prote<strong>in</strong>s as well as virulence factors such as<br />
tox<strong>in</strong>s or adhesive factors are likely to <strong>in</strong>duce an immune<br />
response that may be protective (Zagursky <strong>and</strong> Russell<br />
2001). In this way, large numbers of potential vacc<strong>in</strong>e<br />
components can be identified from a whole (or partial)<br />
genome sequence. This approach was first taken for the<br />
human pathogen Neisseria men<strong>in</strong>gitidis serogroup B, with<br />
600 open read<strong>in</strong>g frames (ORFs) of potential <strong>in</strong>terest<br />
<strong>in</strong>itially be<strong>in</strong>g identified (Pizza et al. 2000). Recomb<strong>in</strong>ant<br />
prote<strong>in</strong>s from 350 ORFs were eventually produced <strong>and</strong>,<br />
after screen<strong>in</strong>g <strong>in</strong> for distribution <strong>in</strong> different serotypes,<br />
stability, immunogenicity <strong>and</strong> cross-protection, 15 were<br />
selected as potential subunit vacc<strong>in</strong>e c<strong>and</strong>idates. This same<br />
approach to vacc<strong>in</strong>e discovery is now be<strong>in</strong>g taken for a<br />
number of important human <strong>and</strong> animal pathogens (Serruto<br />
et al. 2004). Reverse vacc<strong>in</strong>ology allows rapid identification<br />
of a large number of potential subunit vacc<strong>in</strong>e<br />
c<strong>and</strong>idates, many of which would not have been recognised<br />
by more traditional approaches. It is complemented by the<br />
use of microarrays to analyse gene expression <strong>and</strong> of<br />
proteomic approaches to study prote<strong>in</strong> expression <strong>and</strong><br />
distribution <strong>and</strong> can be focused further by the use of<br />
computer alogorithms that scan <strong>and</strong> identify sequences<br />
encod<strong>in</strong>g specific epitopes <strong>in</strong>volved <strong>in</strong> immunogenicity<br />
(reviewed <strong>in</strong> Lund et al. 2002; see also, fo a review,<br />
Theoretical Biology <strong>and</strong> Biophysics Group, Los Alamos<br />
National Laboratory [http://www.hiv.lanl.gov/content/<br />
immunology/pdf/2002/1/Lund2002.pdf]). These alogorithms<br />
have been strengthened by the availability of full<br />
genomic sequences for many pathogens.<br />
Methods for the three ma<strong>in</strong> types of epitopes target<strong>in</strong>g B<br />
cell, helper T lymphocyte <strong>and</strong> cytotoxic T lymphocyte<br />
have been made, <strong>and</strong> improved methods are constantly<br />
be<strong>in</strong>g developed. Thus, it is possible to take a genome<br />
sequence, use some predictors as described above <strong>and</strong><br />
select potential peptide sequences for construction of<br />
vacc<strong>in</strong>es. These vacc<strong>in</strong>es can be either chemically<br />
synthesised peptide based or DNA based. With regards to<br />
peptides, these can be used directly or used to construct a<br />
“polytope”, which is a composite prote<strong>in</strong> made from<br />
<strong>in</strong>dividual epitopes.<br />
Intellectual property rights: who owns the genome<br />
sequence?<br />
181<br />
This review started by giv<strong>in</strong>g the US patent numbers for the<br />
first two genomes sequenced. This f<strong>in</strong>al section will briefly<br />
discuss some of the issues fac<strong>in</strong>g researchers work<strong>in</strong>g with<br />
genomic data. At the time of writ<strong>in</strong>g, ten whole genome<br />
patents have been granted, with more patents be<strong>in</strong>g applied<br />
for (O’Malley et al. 2005). Some of these patents <strong>in</strong>clude<br />
the use of the sequence <strong>in</strong> silico <strong>and</strong> clearly raise a number<br />
of issues related to freedom to operate <strong>in</strong> research. In<br />
addition, the enforcement of the patents could be difficult,<br />
with many bio<strong>in</strong>formatic <strong>tools</strong> be<strong>in</strong>g developed <strong>in</strong> the<br />
public doma<strong>in</strong>.<br />
Another related difficulty has to do with us<strong>in</strong>g or<br />
analys<strong>in</strong>g genome sequences before they are presented <strong>in</strong><br />
scientific publications. Now that it is possible to sequence a<br />
bacterial genome <strong>in</strong> an afternoon <strong>and</strong> have a GenBank file a<br />
day or two later, the time gap between hav<strong>in</strong>g the sequence<br />
publicly available <strong>and</strong> hav<strong>in</strong>g the paper <strong>in</strong> pr<strong>in</strong>t can be<br />
several years. Some public grant<strong>in</strong>g agencies have pushed<br />
hard for the data to be made available as soon as possible<br />
for people to search for their particular gene of <strong>in</strong>terest. On<br />
the other h<strong>and</strong>, it is also underst<strong>and</strong>able that the <strong>in</strong>dividuals<br />
who have actually sequenced the genomes need some lead<br />
time to analyse their data. With high-throughput bio<strong>in</strong>formatic<br />
techniques, it is possible, for example, for some<br />
groups to do <strong>in</strong> a few days what would take other groups<br />
months (or years) to complete.
182<br />
A f<strong>in</strong>al problem has to do with obta<strong>in</strong><strong>in</strong>g basic<br />
<strong>in</strong>formation about the stra<strong>in</strong> used for sequenc<strong>in</strong>g a genome.<br />
For example, what was the stra<strong>in</strong> isolated from? What was<br />
the growth temperature or culture medium pH for the<br />
culture that the genomic DNA was derived from? What is<br />
the doubl<strong>in</strong>g time of this organism under these conditions?<br />
These are all important pieces of data, but they are often<br />
miss<strong>in</strong>g <strong>in</strong> genome publications. A recent “m<strong>in</strong>imal<br />
<strong>in</strong>formation about a genome sequence” st<strong>and</strong>ard has been<br />
proposed (Field <strong>and</strong> Hughes 2005), which is <strong>in</strong> the same<br />
spirit as the MIAMI st<strong>and</strong>ard for microarray experiments. 3<br />
In the future, it could well be that someth<strong>in</strong>g resembl<strong>in</strong>g a<br />
GenBank file with additional biological <strong>in</strong>formation will be<br />
the “publication” for a bacterial genome sequence, as<br />
genome sequenc<strong>in</strong>g becomes ever cheaper <strong>and</strong> easier to<br />
perform. Overall, it is important that genome sequence<br />
<strong>in</strong>formation is released <strong>in</strong>to the public doma<strong>in</strong> <strong>in</strong> a timely<br />
manner so that global scientific progress can be ma<strong>in</strong>ta<strong>in</strong>ed.<br />
Acknowledgements DWU, PFH <strong>and</strong> TTB are supported by grants<br />
from the Danish Research Foundation. We are grateful to the Sanger<br />
Center for allow<strong>in</strong>g prepublication access to the sequences for the E.<br />
coli 042 genome (the DNA sequence <strong>and</strong> annotation files were<br />
downloaded from the Sanger web site http://www.sanger.ac.uk/).<br />
References<br />
Abbott JC, Aanensen DM, Rutherford K, Butcher S, Spratt BG<br />
(2005) WebACT—an onl<strong>in</strong>e companion for the Artemis<br />
Comparison Tool. Bio<strong>in</strong>formatics 21(18):3665–3666<br />
Ac<strong>in</strong>as SG, Marcel<strong>in</strong>o LA, Klepac-Ceraj V, Polz MF (2004)<br />
Divergence <strong>and</strong> redundancy of 16S rRNA sequences <strong>in</strong> genomes<br />
with multiple rrn operons. J Bacteriol 186(9):2629–2635<br />
Ala<strong>in</strong> K, Querellou J, Lesongeur F, Pignet P, Crassous P, Raguenes G,<br />
Cueff V, Cambon-Bonavita M-A (2002) Cam<strong>in</strong>ibacter hydrogeniphilus<br />
gen. nov., sp. nov., a novel thermophilic, hydrogenoxidiz<strong>in</strong>g<br />
bacterium isolated from an East Pacific Rise<br />
hydrothermal vent. Int J Syst Evol Microbiol 52:1317–1323<br />
Alm EJ, Huang KH, Price MN, Koche RP, Keller K, Dubchak IL,<br />
Ark<strong>in</strong> AP (2005) The MicrobesOnl<strong>in</strong>e Web site for comparative<br />
genomics. Genome Res 15(7):1015–1022<br />
Alm RA, Trust TJ (1999) Analysis of the genetic diversity of<br />
Helicobacter pylori: the tale of two genomes. J Mol Med 77<br />
(12):834–846 (Review)<br />
Backhed F, Ley RE, Sonnenburg JL, Peterson DA, Gordon JI (2005)<br />
Host–bacterial mutualism <strong>in</strong> the human <strong>in</strong>test<strong>in</strong>e. Science 307<br />
(5717):1915–1920<br />
Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Sicheritz-Ponten T, Ussery<br />
DW (2005a) Genome update: prediction of secreted prote<strong>in</strong>s <strong>in</strong><br />
225 bacterial proteomes. Microbiology 151(Pt 6):1725–1727<br />
Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW (2005b)<br />
Genome update: prediction of membrane prote<strong>in</strong>s <strong>in</strong> prokaryotic<br />
genomes. Microbiology 151(Pt 7):2119–2121<br />
B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Staerfeldt HH, Ussery DW (2004) Genome<br />
update: proteome comparisons. Microbiology 151(Pt 1):1–4<br />
Burrus V, Waldor MK (2004) Shap<strong>in</strong>g bacterial genomes with<br />
<strong>in</strong>tegrative <strong>and</strong> conjugative elements. Res Microbiol 155<br />
(5):376–386<br />
Carattoli A (2001) Importance of <strong>in</strong>tegrons <strong>in</strong> the diffusion of<br />
resistance. Vet Res 32(3–4):243–259<br />
Carver TJ, Rutherford KM, Berriman M, Raj<strong>and</strong>ream MA, Barrell<br />
BG, Parkhill J (2005) ACT: the Artemis Comparison Tool.<br />
Bio<strong>in</strong>formatics 21(16):3422–3423<br />
3 http://www.ucl.ac.uk/wibr/services/docs/miamiv1.doc<br />
Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH, Ecker DJ,<br />
Blyn LB (2002) A bio<strong>in</strong>formatics based approach to discover<br />
small RNA genes <strong>in</strong> the Escherichia coli genome. Biosystems<br />
65(2–3):157–177<br />
Dobr<strong>in</strong>dt U, Hacker J (2001) Whole genome plasticity <strong>in</strong> pathogenic<br />
bacteria. Curr Op<strong>in</strong> Microbiol 5(4):550–557<br />
Dobr<strong>in</strong>dt U, Hochhut B, Hentschel U, Hacker J (2004) Genomic<br />
isl<strong>and</strong>s <strong>in</strong> pathogenic <strong>and</strong> environmental microorganisms. Nat<br />
Rev Microbiol (2):414–424<br />
Doolittle WF (1999a) Lateral genomics. Trends Cell Biol 12(9):<br />
M5–M8<br />
Doolittle WF (1999b) Phylogenetic classification <strong>and</strong> the universal<br />
tree. Science 5423(284):2124–2129<br />
Dufraigne C, Fertil B, Lesp<strong>in</strong>ats S, Giron A, Deschavanne P (2005)<br />
Detection <strong>and</strong> characterisation of horizontal transfers <strong>in</strong><br />
prokaryotes us<strong>in</strong>g genomic signature. Nucleic Acids Res 1<br />
(33):e6<br />
Duponnois R, Ba AM, Mateille T (1999) Beneficial effects of<br />
Enterobacter cloacae <strong>and</strong> Pseudomonas mendoc<strong>in</strong>a for biocontrol<br />
of Meloidogyne <strong>in</strong>cognita with the endospore-form<strong>in</strong>g<br />
bacterium Oasteuria penetrans. Nematology 1(1):95–101<br />
Ellis RW (1999) New technologies for mak<strong>in</strong>g vacc<strong>in</strong>es. Vacc<strong>in</strong>e 17<br />
(13–14):1596–1604<br />
Falkow S (1975) Infectious multiple drug resistance. Pion Limited,<br />
London, Engl<strong>and</strong><br />
Fani R, Brilli M, Lio P (2005) The orig<strong>in</strong> <strong>and</strong> evolution of operons:<br />
the piecewise build<strong>in</strong>g of the proteobacterial histid<strong>in</strong>e operon.<br />
J Mol Evol 60(3):378–390<br />
Field D, Hughes J (2005) Catalogu<strong>in</strong>g our current genome<br />
collection. Microbiology 151(Pt 4):1016–1019<br />
Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF,<br />
Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM,<br />
McKenney K, Sutton G, FitzHugh W, Fields C, Gocyne JD, Scott<br />
J, Shirley R, Liu LI, Glodek A, Kelley JM, Weidman JF, Phillips<br />
CA, Spriggs T, Hedblom E, Cotton MD, Utterback TR, Hanna<br />
MC, Nguyen DT, Saudek DM, Br<strong>and</strong>on RC, F<strong>in</strong>e LD, Fritchman<br />
JL, Fuhrmann JL, Geoghagen NSM, Gnehm CL, McDonald LA,<br />
Small KV, Fraser CM, Smith HO, Venter JC (1995) Wholegenome<br />
r<strong>and</strong>om sequenc<strong>in</strong>g <strong>and</strong> assembly of Haemophilus<br />
<strong>in</strong>fluenzae Rd. Science 5223(269):496–498, 507–512<br />
Fluit AC, Schmitz F-J (2004) Resistance <strong>in</strong>tegrons <strong>and</strong> super<strong>in</strong>tegrons.<br />
Cl<strong>in</strong> Microbiol Infect 10:272–288<br />
Fouts DE, Mongod<strong>in</strong> EF, M<strong>and</strong>rell RE, Miller WG, Rasko DA,<br />
Ravel J, Br<strong>in</strong>kac LM, DeBoy RT, Parker CT, Daugherty SC,<br />
Dodson RJ, Durk<strong>in</strong> AS, Madupu R, Sullivan SA, Shetty JU,<br />
Ayodeji MA, Shvartsbeyn A, Schatz MC, Badger JH, Fraser<br />
CM, Nelson KE (2005) Major structural differences <strong>and</strong> novel<br />
potential virulence mechanisms from the genomes of multiple<br />
campylobacter species. PLoS Biol 3(1):e15<br />
Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA,<br />
Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM,<br />
Fritchman RD, Weidman JF, Small KV, S<strong>and</strong>usky M,<br />
Fuhrmann J, Nguyen D, Utterback TR, Saudek DM, Phillips<br />
CA, Merrick JM, Tomb JF, Dougherty BA, Bott KF, Hu PC,<br />
Lucier TS, Peterson SN, Smith HO, Hutchison CA 3rd, Venter<br />
JC (1995) The m<strong>in</strong>imal gene complement of Mycoplasma<br />
genitalium. Science 270(5235):397–403<br />
Fraser-Liggett CM (2005) Insights on biology <strong>and</strong> evolution from<br />
microbial genome sequenc<strong>in</strong>g. Genome Res 15:1603–1610<br />
Galun E (2003) Transposable elements: a guide to the perplexed <strong>and</strong><br />
the novice. Kluwer Academic, Dordrecht, The Netherl<strong>and</strong>s, pp<br />
25–73<br />
Gil R, Latorre A, Moya A (2004) Bacterial endosymbionts of <strong>in</strong>sects:<br />
<strong>in</strong>sights from comparative genomics. Environ Microbiol 6<br />
(11):1109–1122<br />
Giovannoni SJ, Tripp HJ, Givan S, Podar M, Verg<strong>in</strong> KL, Baptista D,<br />
Bibbs L, Eads J, Richardson TH, Noordewier M, Rappe MS,<br />
Short JM, Carr<strong>in</strong>gton JC, Mathur EJ (2005) Genome<br />
streaml<strong>in</strong><strong>in</strong>g <strong>in</strong> a cosmopolitan oceanic bacterium. Science<br />
309(5738):1242–1245
Goebel W, Gross R (2001) Intracellularsurvivalstrategiesofmutualistic<br />
<strong>and</strong> parasitic prokaryotes. Trends Microbiol 9(6):267–273<br />
Goldmann DA, Kl<strong>in</strong>ger JD (1986) Pseudomonas cepacia:<br />
biology, mechanisms of virulence, epidemiology. J Pediatr<br />
108(5 Pt 2):806–812<br />
Gottesman S (2005) Micros for microbes: non-cod<strong>in</strong>g regulatory<br />
RNAs <strong>in</strong> bacteria. Trends Genet 7:399–404<br />
Hall<strong>in</strong> PF, Ussery DW (2004) <strong>CBS</strong> genome atlas database: a dynamic<br />
storage for bio<strong>in</strong>formatic results <strong>and</strong> sequence data. Bio<strong>in</strong>formatics<br />
20(18):3682–3686<br />
Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW (2004a) Genome update:<br />
chromosome atlases. Microbiology 150(Pt 10):3091–3093<br />
Hall<strong>in</strong> PF, Coenye T, B<strong>in</strong>newies TT, Jarmer H, Saerfeldt HH, Ussery<br />
DW (2004b) Genome update: correlation of bacterial genomic<br />
properties. Microbiology 150(Pt 12):3899–3903<br />
H<strong>and</strong>elsman J (2004) Metagenomics: application of genomics to<br />
uncultured microorganisms. Microbiol Mol Biol Rev 68:669–685<br />
Harrison A, Dyer DW, Gillaspy A, Ray WC, Mungur R, Carson<br />
MB, Zhong H, Gipson J, Gipson M, Johnson LS, Lewis L,<br />
Bakaletz LO, Munson RS Jr (2005) Genomic sequence of an<br />
otitis media isolate of nontypeable Haemophilus <strong>in</strong>fluenzae:<br />
comparative study with H. <strong>in</strong>fluenzae serotype d, stra<strong>in</strong> KW20.<br />
J Bacteriol 187(13):4627–4636<br />
Hayashi T, Mak<strong>in</strong>o K, Ohnishi M, Kurokawa K, Ishii K, Yokoyama<br />
K, Han CG, Ohtsubo E, Nakayama K, Murata T, Tanaka M,<br />
Tobe T, Iida T, Takami H, Honda T, Sasakawa C, Ogasawara N,<br />
Yasunaga T, Kuhara S, Shiba T, Hattori M, Sh<strong>in</strong>agawa H<br />
(2001) Complete genome sequence of enterohemorrhagic<br />
Escherichia coli O157:H7 <strong>and</strong> genomic comparison with a<br />
laboratory stra<strong>in</strong> K-12. DNA Res 8:11–22<br />
Holmes AJ, Gill<strong>in</strong>gs MR, Nield BS, Mabbutt BC, Nevala<strong>in</strong>en KM,<br />
Stokes HW (2003) The gene cassette metagenome is a basic<br />
resource for bacterial genome evolution. Environ Microbiol 5<br />
(5):383–394<br />
Horowitz NH (1945) On the evolution of biochemical synthesis.<br />
Proc Natl Acad Sci U S A 31:153–157<br />
Horowitz NH (1965) The evolution of biochemical synthesis—<br />
retrospect <strong>and</strong> prospect. In: Bryson V, Vogel HJ (eds) Evolv<strong>in</strong>g<br />
genes <strong>and</strong> prote<strong>in</strong>s. Academic, New York, pp 15–23<br />
Itoh T, Takemoto K, Mori H, Gojobori T (1999) Evolutionary<br />
<strong>in</strong>stability of operon structures disclosed by sequence comparisons<br />
of complete microbial genomes. Mol Biol Evol 3:332–346<br />
Jacob F, Monod J (1961) Genetic regulatory mechanisms <strong>in</strong> the<br />
synthesis of prote<strong>in</strong>s. J Mol Biol 3:318–356<br />
Jacob F, Perr<strong>in</strong> D, Sanchez C, Monod J (1960) Operon: a group of<br />
genes with the expression coord<strong>in</strong>ated by an operator. C R<br />
Hebd Seances Acad Sci 250:1727–1729<br />
Jaffe JD, Stange-Thomann N, Smith C, DeCaprio D, Fisher S,<br />
Butler J, Calvo S, Elk<strong>in</strong>s T, FitzGerald MG, Hafez N, Kodira<br />
CD, Major J, Wang S, Wilk<strong>in</strong>son J, Nicol R, Nusbaum C,<br />
Birren B, Berg HC, Church GM (2004) The complete genome<br />
<strong>and</strong> proteome of Mycoplasma mobile. Genome Res 14<br />
(8):1447–1461<br />
Janga SC, Collado-Vides J, Moreno-Hagelsieb G (2005) Nebulon: a<br />
system for the <strong>in</strong>ference of functional relationships of gene<br />
products from the rearrangement of predicted operons. Nucleic<br />
Acids Res 33(8):2521–2530<br />
Jores J, Rumer L, Wieler LH (2004) Impact of the locus of enterocyte<br />
effacement pathogenicity isl<strong>and</strong> on the evolution of pathogenic<br />
Escherichia coli. Int J Med Microbiol 294(2–3):103–113<br />
(Review)<br />
Juhala RJ, Ford ME, Duda RL, Youlton A, Hatfull GF, Hendrix RW<br />
(2000) Genomic sequences of bacteriophages HK97 <strong>and</strong><br />
HK022: pervasive genetic mosaicism <strong>in</strong> the lambdoid bacteriophages.<br />
J Mol Biol 299(1):27–51<br />
Kennedy SP, Ng WV, Salzberg SL, Hood L, DasSarma S (2001)<br />
Underst<strong>and</strong><strong>in</strong>g the adaptation of Halobacterium species NRC-1<br />
to its extreme environment through computational analysis of<br />
its genome sequence. Genome Res 11:1641–1650<br />
Kiil K, B<strong>in</strong>newies TT, Sicheritz-Ponten T, Willenbrock H, Hall<strong>in</strong> PF,<br />
Wassenaar TM, Ussery DW (2005a) Genome update: sigma factors<br />
<strong>in</strong> 240 bacterial genomes. Microbiology 151(Pt 10):3147–3150<br />
183<br />
Kiil K, Ferchaud JB, David C, B<strong>in</strong>newies TT, Wu H, Sicheritz-<br />
Ponten T, Willenbrock H, Ussery DW (2005b) Genome update:<br />
distribution of two-component transduction systems <strong>in</strong> 250<br />
bacterial genomes. Microbiology 151(Pt 11):3447–3452<br />
Kong H, L<strong>in</strong> L-F, Porter N, Stickel S, Byrd D, Posfai J, Roberts RJ<br />
(2000) Functional analysis of putative restriction–modification<br />
system genes <strong>in</strong> the Helicobacter pylori J99 genome. Nucleic<br />
Acids Res 28:3216–3223<br />
Kummerfeld SK, Teichmann SA (2006) DBD: a transcription factor<br />
prediction database. Nucleic Acids Res 34(Database issue):<br />
D74–D81<br />
Kun<strong>in</strong> V, Goldovsky L, Darzentas N, Ouzounis CA (2005) The net<br />
of life: reconstruct<strong>in</strong>g the microbial phylogenetic network.<br />
Genome Res 15(7):954–959<br />
Kuwahara T, Yamashita A, Hirakawa H, Nakayama H, Toh H,<br />
Okada N, Kuhara S, Hattori M, Hayashi T, Ohnishi Y (2004)<br />
Genomic analysis of Bacteroides fragilis reveals extensive<br />
DNA <strong>in</strong>versions regulat<strong>in</strong>g cell surface adaptation. Proc Natl<br />
Acad Sci U S A 101(41):14919–14924<br />
Lawrence JG, Roth JR (1996) Selfish operons: horizontal transfer<br />
may drive the evolution of gene clusters. Genetics 143<br />
(4):1843–1860<br />
Lazcano A, Diaz-Villagomez E, Mills T, Oro J (1995) On the levels of<br />
enzymatic substrate specificity: implications for the early<br />
evolution of metabolic pathways. Adv Space Res 15(3):345–356<br />
Lewis M, Chang G, Horton NC, Kercher MA, Pace HC,<br />
Schumacher MA, Brennan RG, Lu P (1996) Crystal structure<br />
of the lactose operon repressor <strong>and</strong> its complexes with DNA<br />
<strong>and</strong> <strong>in</strong>ducer. Science 271(5253):1247–1254<br />
Ley RE, Backhed F, Turnbaugh P, Lozupone CA, Knight RD,<br />
Gordon JI (2005) Obesity alters gut microbial ecology. Proc<br />
Natl Acad Sci U S A 102(31):11070–11075<br />
L<strong>in</strong> L-F, Posfai J, Roberts RJ, Kong H (2001) <strong>Comparative</strong><br />
genomics of the restriction–modification systems <strong>in</strong> Helicobacter<br />
pylori. Proc Natl Acad Sci U S A 98:2740–2745<br />
Lobner-Olesen A, Skovgaard O, Mar<strong>in</strong>us MG (2005) Dam methylation:<br />
coord<strong>in</strong>at<strong>in</strong>g cellular processes. Curr Op<strong>in</strong> Microbiol 8<br />
(2):154–160<br />
Lund O, Nielsen M, Kesmir C, Christensen JK, Lundegaard C,<br />
Worn<strong>in</strong>g P, Brunak C (2002) Web-based <strong>tools</strong> for vacc<strong>in</strong>e<br />
design. In: Korber BT, Br<strong>and</strong>er C, Haynes BF, Koup R, Kuiken<br />
C, Moore JP, Walker BD, Watk<strong>in</strong>s D (eds) HIV molecular<br />
immunology. Los Alamos, NM, pp 45–51<br />
Lund O, Nielsen M, Lundegaard C, Kesmit C, Brunak S (2005)<br />
Immunological bio<strong>in</strong>formatics. MIT, Cambridge, Massachusetts<br />
Lupski JR, We<strong>in</strong>stock GM (1992) Short, <strong>in</strong>terspersed repetitive<br />
DNA sequences <strong>in</strong> prokaryotic genomes. J Bacteriol 174<br />
(14):4525–4529<br />
Maas R (2004) Prereplicative pur<strong>in</strong>e methylation <strong>and</strong> postreplicative<br />
demethylation <strong>in</strong> each DNA duplication of the Escherichia coli<br />
replication cycle. J Biol Chem 279(49):51568–51573<br />
Mahillon J, Leonard C, Ch<strong>and</strong>ler M (1999) IS elements as<br />
constituents of bacterial genomes. Res Microbiol 150:675–687<br />
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben<br />
LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du<br />
L, Fierro JM, Gomes XV, Godw<strong>in</strong> BC, He W, Helgesen S, Ho<br />
CH, Irzyk GP, J<strong>and</strong>o SC, Alenquer ML, Jarvie TP, Jirage KB,<br />
Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei<br />
M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE,<br />
McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R,<br />
Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson<br />
JW, Sr<strong>in</strong>ivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer<br />
GA, Wang SH, Wang Y, We<strong>in</strong>er MP, Yu P, Begley RF,<br />
Rothberg JM (2005) Genome sequenc<strong>in</strong>g <strong>in</strong> microfabricated<br />
high-density picolitre reactors. Nature 437(7057):376–380<br />
McCl<strong>in</strong>tock B (1950) The orig<strong>in</strong> <strong>and</strong> behavior of mutable loci <strong>in</strong><br />
maize. Proc Natl Acad Sci U S A 36(6):344–355<br />
McGillivary G, Tomaras AP, Rhodes ER, Actis LA (2005) Clon<strong>in</strong>g<br />
<strong>and</strong> sequenc<strong>in</strong>g of a genomic isl<strong>and</strong> found <strong>in</strong> the Brazilian<br />
purpuric fever clone of Haemophilus <strong>in</strong>fluenzae biogroup<br />
aegyptius. Infect Immun 73(4):1927–1938
184<br />
Middendorf B, Hochhut B, Leipold K, Dobr<strong>in</strong>dt U, Blum-Oehler G,<br />
Hacker J (2004) Instability of pathogenicity isl<strong>and</strong>s <strong>in</strong><br />
uropathogenic Escherichia coli 536. J Bacteriology 186<br />
(10):3086–3096<br />
Mongod<strong>in</strong> EF, Emerson JB, Nelson KE (2005) Microbial metagenomics.<br />
Genome Biol 6(10):347<br />
Mullis K, Faloona F, Scharf S, Saiki R, Horn G, Erlich H (1986)<br />
Specific enzymatic amplification of DNA <strong>in</strong> vitro: the<br />
polymerase cha<strong>in</strong> reaction. Cold Spr<strong>in</strong>g Harb Symp Quant<br />
Biol 51(Pt 1):263–273<br />
Nagy Z, Ch<strong>and</strong>ler M (2004) Regulation of transposition <strong>in</strong> bacteria.<br />
Res Microbiol 155:387–398<br />
Nishi T, Ikemura T, Kanaya S (2005) GeneLook: a novel ab <strong>in</strong>itio<br />
gene identification system suitable for automated annotation of<br />
prokaryotic sequences. Gene 346:115–125<br />
Novikova N, De Boever P, Poddubko S, Deshevaya E, Polikarpov<br />
N, Rakova N, Con<strong>in</strong>x I, Mergeay M (2006) Survey of<br />
environmental biocontam<strong>in</strong>ation on board the International<br />
Space Station. Res Microbiol 157(1):5–12<br />
Ochman H, Lawrence JG, Groisman EA (2000) Lateral gene transfer<br />
<strong>and</strong> the nature of bacterial evolution. Nature 405:299–304<br />
Ohnishi M, Kurokawa K, Hayashi T (2001) Diversification of<br />
Escherichia coli genomes: are bacteriophages the major<br />
contributors? Trends Microbiol 9:481–485<br />
Okuda S, Katayama T, Kawashima S, Goto S, Kanehisa (2006)<br />
MODB: a database of operons accumulat<strong>in</strong>g known operons<br />
across multiple genomes. Nucleic Acids Res 34(Database<br />
issue):D358–362<br />
Olsen GJ, Lane DJ, Giovannoni SJ, Pace NR, Stahl DA (1986)<br />
Microbial ecology <strong>and</strong> evolution: a ribosomal RNA approach.<br />
Annu Rev Microbiol 40:337–365<br />
O’Malley MA, Bostanci A, Calvert J (2005) Whole-genome<br />
patent<strong>in</strong>g. Nat Rev Genet 6(6):502–506<br />
Ortutay C, Gaspari Z, Toth G, Jager E, Vida G, Orosz L, Vellai T<br />
(2003) Speciation <strong>in</strong> Chlamydia: genome-wide phylogenetic<br />
analyses identified a reliable set of acquired genes. J Mol Evol<br />
57:672–680<br />
Ou HY, Chen LL, Lonnen J, Chaudhuri RR, Thani AB, Smith R,<br />
Garton NJ, H<strong>in</strong>ton J, Pallen M, Barer MR, Rajakumar K (2006)<br />
A novel strategy for the identification of genomic isl<strong>and</strong>s by<br />
comparative analysis of the contents <strong>and</strong> contexts of tRNA sites<br />
<strong>in</strong> closely related bacteria. Nucleic Acids Res 34(1):e3<br />
Pal C, Hurst LD (2004) Evidence aga<strong>in</strong>st the selfish operon theory.<br />
Trends Genet 20(6):232–234<br />
Parkhill J, Sebaihia M, Preston A, Murphy LD, Thomson N, Harris<br />
DE, Holden MT, Churcher CM, Bentley SD, Mungall KL,<br />
Cerdeno-Tarraga AM, Temple L, James K, Harris B, Quail MA,<br />
Achtman M, Atk<strong>in</strong> R, Baker S, Basham D, Bason N,<br />
Cherevach I, Chill<strong>in</strong>gworth T, Coll<strong>in</strong>s M, Cron<strong>in</strong> A, Davis P,<br />
Doggett J, Feltwell T, Goble A, Haml<strong>in</strong> N, Hauser H, Holroyd<br />
S, Jagels K, Leather S, Moule S, Norberczak H, O’Neil S,<br />
Ormond D, Price C, Rabb<strong>in</strong>owitsch E, Rutter S, S<strong>and</strong>ers M,<br />
Saunders D, Seeger K, Sharp S, Simmonds M, Skelton J,<br />
Squares R, Squares S, Stevens K, Unw<strong>in</strong> L, Whitehead S,<br />
Barrell BG, Maskell DJ (2003) <strong>Comparative</strong> analysis of the<br />
genome sequences of Bordetella pertussis, Bordetella parapertussis<br />
<strong>and</strong> Bordetella bronchiseptica. Nat Genet 35(1):32–40<br />
Paulsen IT, Banerjei L, Myers GSA, Nelson KE, Seshadri R, Read TD,<br />
Fouts, DE, Eisen JA, Gill SR, Heidelberg JF, Tettel<strong>in</strong> H, Dodson<br />
RJ, Umayam L, Br<strong>in</strong>kac L, Beanan M, Daugherty S, DeBoy RT,<br />
Durk<strong>in</strong> S, Kolonay J, Madupu R, Nelson W, Vamathevan J, Tran<br />
B, Upton J, Hansen T, Shetty J, Khouri H, Utterback T, Radune D,<br />
Ketchum KA, Dougherty BA, Fraser CM (2003) Role of mobile<br />
DNA <strong>in</strong> the evolution of vancomyc<strong>in</strong>-resistant Enterococcus<br />
faecalis. Science 299(5615):2071–2074<br />
Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW<br />
(2000) A DNA structural atlas for Escherichia coli. J Mol Biol<br />
299(4):907–930<br />
Pennisi E (2005) Biochemistry. Cut-rate genomes on the horizon?<br />
Science 309(5736):862<br />
Penyalver R, Lopez MM (1999) Cocolonization of the rhizosphere<br />
by pathogenic agrobacterium stra<strong>in</strong>s <strong>and</strong> nonpathogenic stra<strong>in</strong>s<br />
K84 <strong>and</strong> K1026, used for crown gall biocontrol. Appl Environ<br />
Microbiol 65(5):1936–1940<br />
Peters EDJ, Leverste<strong>in</strong>-Van Hall MA, Box ATA, Verhoef J, Fluit AC<br />
(2001) Novel gene cassettes <strong>and</strong> <strong>in</strong>tegrons. Antimicrob Agents<br />
Chemother 45(10):2961–2964<br />
Pizza M, Scarlato V, Masignani V, Giuliani MM, Arico B,<br />
Com<strong>and</strong>ucci M, Jenn<strong>in</strong>gs GT, Baldi L, Bartol<strong>in</strong>i E, Capecchi<br />
B, Galeotti CL, Luzzi E, Manetti R, Marchetti E, Mora M, Nuti<br />
S, Ratti G, Sant<strong>in</strong>i L, Sav<strong>in</strong>o S, Scarselli M, Storni E, Zuo P,<br />
Broeker M, Hundt E, Knapp B, Blair E, Mason T, Tettel<strong>in</strong> H,<br />
Hood DW, Jeffries AC, Saunders NJ, Granoff DM, Venter JC,<br />
Moxon ER, Gr<strong>and</strong>i G, Rappuoli R (2000) Identification of<br />
vacc<strong>in</strong>e c<strong>and</strong>idates aga<strong>in</strong>st serogroup B men<strong>in</strong>gococcus by<br />
whole-genome sequenc<strong>in</strong>g. Science 287:1816–1820<br />
Prescott L, Harvey JP, Kle<strong>in</strong> DA (1999) Microbiology, 4th edn.<br />
McGraw-Hill, New York, USA<br />
Price MN, Huang KH, Alm EJ, Ark<strong>in</strong> AP (2005) A novel method<br />
for accurate operon predictions <strong>in</strong> all sequenced prokaryotes.<br />
Nucleic Acids Res 33(3):880–892<br />
Rappuoli R (2001) Reverse vacc<strong>in</strong>ology, a genome-based approach<br />
to vacc<strong>in</strong>e development. Vacc<strong>in</strong>e 19:2688–2691<br />
Rendulic S, Jagtap P, Ros<strong>in</strong>us A, Epp<strong>in</strong>ger M, Baar C, Lanz C,<br />
Keller H, Lambert C, Evans KJ, Goesmann A, Meyer F,<br />
Sockett RE, Schuster SC (2004) A predator unmasked: life<br />
cycle of Bdellovibrio bacteriovorus from a genomic perspective.<br />
Science 303(5658):689–692<br />
Reznikoff WS (1992) The lactose operon-controll<strong>in</strong>g elements:<br />
a complex paradigm. Mol Microbiol 6(17):2419–2422<br />
Robb<strong>in</strong>s-Manke JL, Zdraveski ZZ, Mar<strong>in</strong>us M, Essigmann JM<br />
(2005) Analysis of global gene expression <strong>and</strong> double-str<strong>and</strong>break<br />
formation <strong>in</strong> DNA aden<strong>in</strong>e methyltransferase- <strong>and</strong><br />
mismatch repair-deficient Escherichia coli. J Bacteriol 187<br />
(20):7027–7037<br />
Roberts RJ, V<strong>in</strong>cze T, Psfai J, Macelis D (2005) REBASE—<br />
restriction enzymes <strong>and</strong> DNA methyl transferases. Nucleic<br />
Acids Res 33:D230–D232<br />
Rocha EPC, Danch<strong>in</strong> A, Viari A (1999) Functional <strong>and</strong> evolutionary<br />
role of long repeats <strong>in</strong> prokaryotes. Res Microbiol 150:725–733<br />
Rogoz<strong>in</strong> IB, Makarova KS, Wolf YI, Koon<strong>in</strong> EV (2004) <strong>Computational</strong><br />
approaches for the analysis of gene neighbourhoods <strong>in</strong><br />
prokaryotic genomes. Brief Bio<strong>in</strong>form 5(2):131–149<br />
Rosenfeld JA, Sarkar IN, Planet PJ, Figurski DH, DeSalle R (2004)<br />
ORFcurator: molecular curation of genes <strong>and</strong> gene clusters <strong>in</strong><br />
prokaryotic organisms. Bio<strong>in</strong>formatics 20(18):3462–3465<br />
Salgado H, Gama-Castro S, Peralta-Gil M, Diaz-Peredo E, Sanchez-<br />
Solano F, Santos-Zavaleta A, Mart<strong>in</strong>ez-Flores I, Jimenez-Jac<strong>in</strong>to<br />
V, Bonavides-Mart<strong>in</strong>ez C, Segura-Salazar J, Mart<strong>in</strong>ez-Antonio<br />
A, Collado-Vides J (2006a) RegulonDB (version 5.0): Escherichia<br />
coli K-12 transcriptional regulatory network, operon<br />
organization, <strong>and</strong> growth conditions. Nucleic Acids Res 34<br />
(Database issue):D394–D397<br />
Salgado H, Santos-Zavaleta A, Gama-Castro S, Peralta-Gil M,<br />
Penaloza-Sp<strong>in</strong>ola MI, Mart<strong>in</strong>ez-Antonio A, Karp PD, Collado-<br />
Vides J (2006b) The comprehensive updated regulatory<br />
network of Escherichia coli K-12. BMC Bio<strong>in</strong>formatics 7(1):5<br />
Sanger F, Donelson JE, Coulson AR, Kossel H, Fischer D (1973)<br />
Use of DNA polymerase I primed by a synthetic oligonucleotide<br />
to determ<strong>in</strong>e a nucleotide sequence <strong>in</strong> phage fl DNA. Proc<br />
Natl Acad Sci U S A 70(4):1209–1213<br />
Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA,<br />
Hutchison CA, Slocombe PM, Smith M (1977) Nucleotide<br />
sequence of bacteriophage phi X174 DNA. Nature 265<br />
(5596):687–695
Schmidt H, Hensel M (2004) Pathogenicity isl<strong>and</strong>s <strong>in</strong> bacterial<br />
pathogenesis. Cl<strong>in</strong> Microbiol Rev 17(1):14–56<br />
Schneider G, Dobr<strong>in</strong>dt U, Bruggemann H, Nagy G, Janke B, Blum-<br />
Oehler G, Buchrieser C, Gottschalk G, Emody L, Hacker J<br />
(2004) The pathogenicity isl<strong>and</strong>-associated K15 capsule determ<strong>in</strong>ant<br />
exhibits a novel genetic structure <strong>and</strong> correlates with<br />
virulence <strong>in</strong> uropathogenic Escherichia coli stra<strong>in</strong> 536. Infect<br />
Immun 72(10):5993–6001<br />
Serruto D, Adu-Bobie J, Capecchi B, Rappuoli R, Pizza M,<br />
Masignani V (2004) Biotechnology <strong>and</strong> vacc<strong>in</strong>es: application<br />
of functional genomics to Neisseria men<strong>in</strong>gitidis <strong>and</strong> other<br />
bacterial pathogens. J Biotechnol 113:15–32<br />
Sharp PM, Li WH (1987) The codon adaptation <strong>in</strong>dex—a measure<br />
of directional synonymous codon usage bias, <strong>and</strong> its potential<br />
applications. Nucleic Acids Res 15(3):1281–1295<br />
Shendure J, Porreca GJ, Reppas NB, L<strong>in</strong> X, McCutcheon JP,<br />
Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM<br />
(2005) Accurate multiplex polony sequenc<strong>in</strong>g of an evolved<br />
bacterial genome. Science 309(5741):1728–1732<br />
Shimizu T, Ohtani K, Hirakawa H, Ohshima K, Yamashita A, Shiba<br />
T, Ogasawara N, Hattori M, Kuhara, Hayashi H (2002)<br />
Complete genome sequence of Clostridium perfr<strong>in</strong>gens, an<br />
anaerobic flesh-eater. Proc Natl Acad Sci U S A 99(2):996–1001<br />
Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A (2001) On<br />
the total number of genes <strong>and</strong> their length distribution <strong>in</strong><br />
complete microbial genomes. Trends Genet 17(8):425–428<br />
Stahl FW, Murray NE (1966) The evolution of gene clusters <strong>and</strong><br />
genetic circularity <strong>in</strong> microorganisms. Genetics 53(3):569–576<br />
Starl<strong>in</strong>ger P, Saedler H (1976) IS-elements <strong>in</strong> microorganisms. Curr<br />
Top Microbiol Immunol 75:111–152<br />
Talarico S, Cave MD, Marrs CF, Foxman B, Zhang L, Yang Z (2005)<br />
Variation of the Mycobacterium tuberculosis PE_PGRS 33 gene<br />
among cl<strong>in</strong>ical isolates. J Cl<strong>in</strong> Microbiol 43(10):4954–4960<br />
Taoka M, Yamauchi Y, Sh<strong>in</strong>kawa T, Kaji H, Motohashi W,<br />
Nakayama H, Takahashi N, Isobe T (2004) Only a small<br />
subset of the horizontally transferred chromosomal genes <strong>in</strong><br />
Escherichia coli are translated <strong>in</strong>to prote<strong>in</strong>s. Mol Cell<br />
Proteomics 3(8):780–787<br />
Tobes R, Ramos JL (2005) REP code: def<strong>in</strong><strong>in</strong>g bacterial identity <strong>in</strong><br />
extragenic space. Environ Microbiol 7(2):225–228<br />
Toh H, Weiss BL, Perk<strong>in</strong> SA, Yamashita A, Oshima K, Hattori M,<br />
Aksoy S (2006) Massive genome erosion <strong>and</strong> functional<br />
adaptations provide <strong>in</strong>sights <strong>in</strong>to the symbiotic lifestyle of<br />
Sodalis gloss<strong>in</strong>idius <strong>in</strong> the tsetse host. Genome Res 16:149–156<br />
Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG,<br />
Fleischmann RD, Ketchum KA, Klenk HP, Gill S, Dougherty<br />
BA, Nelson K, Quackenbush J, Zhou L, Kirkness EF, Peterson<br />
S, Loftus B, Richardson D, Dodson R, Khalak HG, Glodek A,<br />
McKenney K, Fitzegerald LM, Lee N, Adams MD, Hickey EK,<br />
Berg DE, Gocayne JD, Utterback TR, Peterson JD, Kelley JM,<br />
Cotton MD, Weidman JM, Fujii C, Bowman C, Watthey L,<br />
Wall<strong>in</strong> E, Hayes WS, Borodovsky M, Karp PD, Smith HO,<br />
Fraser CM, Venter JC (1997) The complete genome sequence<br />
of the gastric pathogen Helicobacter pylori. Nature 388<br />
(6642):539–547<br />
Torsvik V, Salte K, Sorheim R, Goksoyr J (1990) Comparison of<br />
phenotypic diversity <strong>and</strong> DNA heterogeneity <strong>in</strong> a population of<br />
soil bacteria. Appl Environ Microbiol 56:776–781<br />
Tr<strong>in</strong>ge SG, Rub<strong>in</strong> EM (2005) Metagenomics: DNA sequenc<strong>in</strong>g of<br />
environmental samples. Nat Rev Genet 6(11):805–814<br />
Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ,<br />
Richardson PM, Solovyev VV, Rub<strong>in</strong> EM, Rokhsar DS,<br />
Banfield JF (2004) Community structure <strong>and</strong> metabolism<br />
through reconstruction of microbial genomes from the environment.<br />
Nature 428(6978):37–43<br />
185<br />
Ussery DW, Hall<strong>in</strong> PF (2004a) Genome update: AT content <strong>in</strong><br />
sequenced prokaryotic genomes. Microbiology 150(Pt 4):749–752<br />
Ussery DW, Hall<strong>in</strong> PF (2004b) Genome update: length distributions of<br />
sequenced prokaryotic genomes. Microbiology 150(Pt 3):513–516<br />
Ussery DW, B<strong>in</strong>newies TT, Gouveia-Oliveira R, Jarmer H, Hall<strong>in</strong><br />
PF (2004a) Genome update: DNA repeats <strong>in</strong> bacterial genomes.<br />
Microbiology 150(Pt 11):3519–3521<br />
Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Coenye T (2004b) Genome<br />
update: rRNAs <strong>in</strong> sequenced microbial genomes. Microbiology<br />
150(Pt 5):1113–1115<br />
Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Wassenaar TM (2004c) Genome<br />
update: tRNAs <strong>in</strong> sequenced microbial genomes. Microbiology<br />
150(Pt 6):1603–1606<br />
Ussery DW, T<strong>in</strong>dbaek N, Hall<strong>in</strong> PF (2004d) Genome update:<br />
promoter profiles. Microbiology 150(Pt 9):2791–2793<br />
Vallenet D, Labarre L, Rouy Z, Barbe V, Bocs S, Cruveiller S, Lajus<br />
A, Pascal G, Scarpelli C, Medigue C (2006) MaGe: a microbial<br />
genome annotation system supported by synteny results.<br />
Nucleic Acids Res 34(1):53–65<br />
van Belkum A, Scherer S, van Alphen L, Verbrugh H (1998) Short<br />
sequence DNA repeats <strong>in</strong> prokaryotic genomes. Microbiol Mol<br />
Biol Rev 62(2):275–293<br />
van der Meer JR, Sentchilo V (2003) Genomic isl<strong>and</strong>s <strong>and</strong> the<br />
evolution of catabolic pathways <strong>in</strong> bacteria. Curr Op<strong>in</strong><br />
Biotechnol 14:248–254<br />
Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A, Dong<br />
X, Lu P, Szafron D, Gre<strong>in</strong>er R, Wishart DS (2005) BASys: a web<br />
server for automated bacterial genome annotation. Nucleic<br />
Acids Res 33(Web Server issue):W455–W459<br />
Venter JC, Rem<strong>in</strong>gton K, Heidelberg JF, Halpern AL, Rusch D,<br />
Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE,<br />
Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson<br />
J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C,<br />
Rogers YH, Smith HO (2004) Environmental genome shotgun<br />
sequenc<strong>in</strong>g of the Sargasso Sea. Science 304(5667):66–74<br />
Vezzi A, Campanaro S, D’Angelo M, Simonato F, Vitulo N, Lauro<br />
FM, Cestaro A, Malacrida G, Simionati B, Cannata N,<br />
Romualdi C, Bartlett DH, Valle G (2005) Life at depth:<br />
Photobacterium profundum genome sequence <strong>and</strong> expression<br />
analysis. Science 307(5714):1459–1461<br />
Willenbrock H, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW (2005) Genome<br />
update: 2D cluster<strong>in</strong>g of bacterial genomes. Microbiology 151<br />
(Pt 2):333–336<br />
Worn<strong>in</strong>g P, Jensen LJ, Nelson KE, Brunak S, Ussery DW (2000)<br />
Structural analysis of DNA sequence: evidence for lateral gene<br />
transfer <strong>in</strong> Thermotoga maritima. Nucleic Acids Res 28<br />
(3):706–709<br />
Worn<strong>in</strong>g P, Jensen LJ, Hall<strong>in</strong> PF, Stærfeldt H-H, Ussery DW (2006)<br />
Orig<strong>in</strong> of replication <strong>in</strong> circular prokaryotic chromosomes.<br />
Environ Microbiol (In press)<br />
Yan F, Polk DB (2004) Commensal bacteria <strong>in</strong> the gut: learn<strong>in</strong>g who<br />
our friends are. Curr Op<strong>in</strong> Gastroenterol 20(6):565–571<br />
Zagursky RJ, Russell D (2001) Bio<strong>in</strong>formatics: use <strong>in</strong> bacterial<br />
vacc<strong>in</strong>e discovery. Biotechniques 31:636–659<br />
Zhang R, Zhang CT (2004) A systematic method to identify<br />
genomic isl<strong>and</strong>s <strong>and</strong> its applications <strong>in</strong> analyz<strong>in</strong>g the genomes<br />
of Corynebacterium glutamicum <strong>and</strong> Vibrio vulnificus CMCP6<br />
chromosome I. Bio<strong>in</strong>formatics 20(5):612–622<br />
Zheng Y, Anton BP, Roberts RJ, Kasif S (2005) Phylogenetic<br />
detection of conserved gene clusters <strong>in</strong> microbial genomes.<br />
BMC Bio<strong>in</strong>formatics 6:243<br />
Zubrzycki IZ (2004) Analysis of the products of genes encompassed<br />
by the theoretically predicted pathogenicity isl<strong>and</strong>s of Mycobacterium<br />
tuberculosis <strong>and</strong> Mycobacterium bovis. Prote<strong>in</strong>s:<br />
Struct, Funct, Bio<strong>in</strong>f 54:563–568
1<br />
<strong>Comparative</strong> Genomics<br />
2.8 Paper III: Global features of the Alcanivorax borkumensis<br />
SK2 genome
Environmental Microbiology (2007) doi:10.1111/j.1462-2920.2007.01483.x<br />
Global features of the Alcanivorax borkumensis<br />
SK2 genome<br />
Oleg N. Reva, 1,3 Peter F. Hall<strong>in</strong>, 2 Hanni Willenbrock, 2<br />
Thomas Sicheritz-Ponten, 2 Burkhard Tümmler 1 <strong>and</strong><br />
David W. Ussery 2<br />
1 Kl<strong>in</strong>ische Forschergruppe, OE6711, Mediz<strong>in</strong>ische<br />
Hochschule Hannover, Carl-Neuberg-Strasse 1,<br />
D-30625 Hannover, Germany.<br />
2 Center for Biological Sequence Analysis, Technical<br />
University of Denmark, Lyngby, Denmark.<br />
3 Biochemistry Department, University of Pretoria,<br />
Lynnwood Road, Hillcrest, 0002 Pretoria, South Africa.<br />
Summary<br />
The global feature of the completely sequenced<br />
Alcanivorax borkumensis SK2 type stra<strong>in</strong> chromosome<br />
is its symmetry <strong>and</strong> homogeneity. The orig<strong>in</strong><br />
<strong>and</strong> term<strong>in</strong>us of replication are located opposite<br />
to each other <strong>in</strong> the chromosome <strong>and</strong> are discerned<br />
with high signal to noise ratios by maximal oligonucleotide<br />
usage biases on the lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g<br />
str<strong>and</strong>. Genomic DNA structure is rather uniform<br />
throughout the chromosome with respect to <strong>in</strong>tr<strong>in</strong>sic<br />
curvature, position preference or base<br />
stack<strong>in</strong>g energy. The orthologs <strong>and</strong> paralogs of<br />
A. borkumensis genes with the highest sequence<br />
homology were found <strong>in</strong> most cases among<br />
g-Proteobacteria, with Ac<strong>in</strong>etobacter <strong>and</strong> P. aerug<strong>in</strong>osa<br />
as closest relatives. A. borkumensis shares<br />
a similar oligonucleotide usage <strong>and</strong> promoter<br />
structure with the Pseudomonadales. A comparatively<br />
low number of only 18 genome isl<strong>and</strong>s with<br />
atypical oligonucleotide usage was detected <strong>in</strong> the<br />
A. borkumensis chromosome. The gene clusters that<br />
confer the assimilation of aliphatic hydrocarbons, are<br />
localized <strong>in</strong> two genome isl<strong>and</strong>s which were probably<br />
acquired from an ancestor of the Yers<strong>in</strong>ia l<strong>in</strong>eage,<br />
whereas the alk genes of Pseudomonas putida still<br />
exhibit the typical Alcanivorax oligonucleotide signature<br />
<strong>in</strong>dicat<strong>in</strong>g a complex evolution of this major<br />
hydrocarbonoclastic trait.<br />
Received 8 August, 2007; accepted 26 September, 2007.<br />
*For correspondence. E-mail tuemmler.burkhard@mh-hannover.de;<br />
Tel. (+49) 511 5322920; Fax (+49) 511 5326723.<br />
Introduction<br />
Alcanivorax borkumensis stra<strong>in</strong> SK2 is a cosmopolitan<br />
oil-degrad<strong>in</strong>g oligotrophic mar<strong>in</strong>e g-proteobacterium<br />
(Yakimov et al., 1998). The SK2 stra<strong>in</strong> is the paradigm for<br />
hydrocarbonoclastic bacteria that are specialized for<br />
hydrocarbon degradation but have an otherwise highly<br />
restricted substrate spectrum, be<strong>in</strong>g capable of utiliz<strong>in</strong>g<br />
only a few organic acids such as pyruvate, but not simple<br />
sugars, for growth (Yakimov et al., 1998; Sabirova et al.,<br />
2006). A. borkumensis is present <strong>in</strong> low abundance <strong>in</strong><br />
unpolluted environments, but it rapidly becomes the dom<strong>in</strong>ant<br />
bacterium <strong>in</strong> oil-polluted open ocean <strong>and</strong> coastal<br />
waters, where it can constitute 80–90% of the oildegrad<strong>in</strong>g<br />
microbial community (Harayama et al., 1999;<br />
Kasai et al., 2001; 2002; Syutsubo et al., 2001; Röl<strong>in</strong>g<br />
et al., 2002; Hara et al., 2003; McKew et al., 2007a,b).<br />
The genome of A. borkumensis was recently<br />
sequenced <strong>and</strong> annotated (Schneiker et al., 2006). In this<br />
paper, we perform a genome wide comparative genomics<br />
analysis <strong>and</strong> a detailed characterization of the global<br />
features of the A. borkumensis stra<strong>in</strong> SK2 genome. This<br />
work on A. borkumensis stra<strong>in</strong> SK2 aimed to visualize the<br />
prospective potential of genome l<strong>in</strong>guistic approaches<br />
for functional <strong>and</strong> comparative analysis of bacterial<br />
genomes.<br />
Results <strong>and</strong> discussion<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd<br />
DNA structure <strong>and</strong> highly expressed genes<br />
The genome atlas (Fig. 1) shows a comb<strong>in</strong>ation of some<br />
general <strong>in</strong>formative properties of the chromosome.<br />
These are structural features (<strong>in</strong>tr<strong>in</strong>sic curvature, stack<strong>in</strong>g<br />
energy <strong>and</strong> position preference), repeat properties (global<br />
direct <strong>and</strong> <strong>in</strong>verted repeats) <strong>and</strong> the ma<strong>in</strong> base composition<br />
features (GC skew <strong>and</strong> percent AT). Stack<strong>in</strong>g energy<br />
measures helix rigidity <strong>and</strong> position preference is a<br />
flexibility measure (Jensen et al., 1999; Pedersen et al.,<br />
2000). Regions that exhibit low position preference correlate<br />
with an enrichment of highly expressed genes (Dlakic<br />
et al., 2004; Willenbrock <strong>and</strong> Ussery, 2007). Examples <strong>in</strong><br />
A. borkumensis are the rrn operons, the genes encod<strong>in</strong>g<br />
ribosomal prote<strong>in</strong>s <strong>and</strong> the gene cluster labelled rpoC on<br />
the atlas which among others encodes RNA polymerase<br />
subunits. Low position preference was found to correlate<br />
with high codon adaptation <strong>in</strong>dices as the common
2 O. N. Reva et al.<br />
Fig. 1. Genome Atlas of A. borkumensis SK2 show<strong>in</strong>g different structural parameters <strong>and</strong> the distribution of global repeats, GC skew <strong>and</strong><br />
A + T contents. Colour <strong>in</strong>tensity <strong>in</strong>creases with the deviation from the average. Values close to the average are shaded very light grey; values<br />
with more than 3 st<strong>and</strong>ard deviations from the average are most strongly coloured.<br />
measure for highly expressed genes (Willenbrock et al.,<br />
2006) <strong>in</strong>dicat<strong>in</strong>g that the local DNA structure is an important<br />
determ<strong>in</strong>ant of codon usage <strong>and</strong> gene expression.<br />
Moreover, <strong>in</strong>tr<strong>in</strong>sic curvature is often encountered<br />
upstream of highly expressed genes (Skovgaard et al.,<br />
2002) which correlates well with the fact that promoter<br />
DNA tends to be more curved than DNA <strong>in</strong> cod<strong>in</strong>g regions<br />
(Pedersen et al., 2000).<br />
The chromosome is rather homogeneous <strong>in</strong> all analysed<br />
structural features. The number of repeats is low, <strong>and</strong><br />
the term<strong>in</strong>us of replication is opposite to the orig<strong>in</strong> of<br />
replication as <strong>in</strong>dicated by GC skew (Ussery et al., 2002).<br />
The three rRNA operons organized <strong>in</strong> the order<br />
16S-23S-5S are located <strong>in</strong> three areas with low position<br />
preference (green marks <strong>in</strong> the 3rd circle) <strong>and</strong> possible<br />
upstream regions with high <strong>in</strong>tr<strong>in</strong>sic curvature (blue <strong>in</strong> the<br />
1st circle) near 0.4 Mb – 0.5 Mbases (two regions) <strong>and</strong><br />
2.25 Mbases (one region).<br />
Phylogenomics by sequence homology<br />
The genome of A. borkumensis was compared with exist<strong>in</strong>g<br />
sequence <strong>in</strong>formation <strong>in</strong> other Proteobacteria by con-<br />
struct<strong>in</strong>g phylogenetic trees for each am<strong>in</strong>o acid<br />
sequence <strong>and</strong> organisms for which a similar gene existed.<br />
By extract<strong>in</strong>g the phylogenomic <strong>in</strong>formation of the result<strong>in</strong>g<br />
1919 phylogenetic trees a phylome atlas could be<br />
constructed (Fig. 2). In most cases the orthologs <strong>and</strong><br />
paralogs with the highest sequence homology were found<br />
among g-Proteobacteria. A substantial proportion of<br />
A. borkumensis genes had their closest homologues <strong>in</strong><br />
a- <strong>and</strong> b-Proteobacteria, but no closest homologue was<br />
detected <strong>in</strong> d- <strong>and</strong> e-Proteobacteria. Inspection of the collected<br />
phylogenetic connections revealed that the<br />
most closely related organisms are Ac<strong>in</strong>etobacter sp.<br />
<strong>and</strong> Pseudomonas aerug<strong>in</strong>osa, although <strong>in</strong> trees where<br />
both Pseudomonas <strong>and</strong> Ac<strong>in</strong>etobacter are present,<br />
A. borkumensis tends to cluster more often with the latter<br />
one. No obvious horizontal gene transfers seem to have<br />
taken place. Regions around 350.000 <strong>and</strong> 450.000 are<br />
very ‘pure’ g-proteobacteria regions.<br />
Genome analysis of oligonucleotide usage<br />
Oligonucleotide usage (OU) has been shown to be a<br />
genome specific signature (Pride et al., 2003; Reva <strong>and</strong><br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
Tümmler, 2004). Genomic regions termed the ‘core<br />
sequences’ are characterized by OU patterns be<strong>in</strong>g<br />
similar to the global pattern of the chromosome. However,<br />
many loci with alternative OU patterns typically contribute<br />
to <strong>in</strong> total more than 10% of a bacterial genome. These<br />
loci with atypical OU patterns comprise heterogeneous<br />
subsets of parasitic <strong>and</strong> recent foreign DNA, ancient<br />
genes for ribosomal constituents (RNAs <strong>and</strong> prote<strong>in</strong>s),<br />
multidoma<strong>in</strong> genes <strong>and</strong> non-cod<strong>in</strong>g sequences with multiple<br />
t<strong>and</strong>em repeats (Reva <strong>and</strong> Tümmler, 2005). Hence<br />
laterally transferred gene isl<strong>and</strong>s can be reliably identified<br />
<strong>in</strong> complete genomes by their atypical oligonucleotide<br />
usage (Reva <strong>and</strong> Tümmler, 2005; Chen et al., 2007;<br />
Klockgether et al., 2007). Here, we focused on tetranucleotide<br />
usage (TU) parameters because the 256 different<br />
tetranucleotide words are optimal to differentiate bacterial<br />
genome sequences by the frequency <strong>and</strong> <strong>in</strong>formativeness<br />
of the <strong>in</strong>dividual element. TU patterns represent the deviations<br />
of tetranucleotide word counts <strong>in</strong> a given sequence<br />
from an equiprobable distribution. Selection <strong>and</strong> counterselection<br />
of the oligonucleotide words are driven by their<br />
<strong>Comparative</strong> genomics of Alcanivorax borkumensis 3<br />
Fig. 2. Phylome Atlas of A. borkumensis SK2 genes <strong>in</strong>dicat<strong>in</strong>g their closest bacterial homologues. Each of the concentric circles represents a<br />
taxonomic group as described <strong>in</strong> the figure legend on the right, with the outermost circle correspond<strong>in</strong>g to the top-most feature, <strong>and</strong> the<br />
<strong>in</strong>nermost circle correspond<strong>in</strong>g to the bottom-most feature. Light b<strong>and</strong>s <strong>in</strong>dicate A. borkumensis SK2 genes with no homologue <strong>in</strong> the<br />
respective taxonomic group.<br />
stereochemical properties such as base stack<strong>in</strong>g energy,<br />
propeller twist angle, prote<strong>in</strong> deformability, bendability<br />
<strong>and</strong> position preference (Reva <strong>and</strong> Tümmler, 2004). By<br />
permutation analysis, the 256 tetranucleotides were<br />
assigned to 39 equivalence classes each of which characterized<br />
by the same values for the five properties mentioned<br />
above (Baldi <strong>and</strong> Baisnee, 2000). Words of the<br />
same equivalence class tend to occur at similar frequencies<br />
<strong>in</strong> a nucleotide sequence (Reva <strong>and</strong> Tümmler, 2004).<br />
Oligonucleotide usage conservation reflects to some<br />
extent the phylogeny of microorganisms (Pride et al.,<br />
2003; Teel<strong>in</strong>g et al., 2004).<br />
Phylogenomics by tetranucleotide usage analysis<br />
TU patterns were calculated for all sequenced genomes<br />
of g-Proteobacteria. Four examples of TU patterns determ<strong>in</strong>ed<br />
for A. borkumensis SK2, Pseudomonas putida<br />
KT2440, Escherichia coli K-12 <strong>and</strong> Shewanella oneidensis<br />
MR-1 are shown <strong>in</strong> Fig. 3. Tetranucleotide words were<br />
grouped by the equivalence classes <strong>and</strong> sorted <strong>in</strong> order of<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
4 O. N. Reva et al.<br />
decrease of the base stack<strong>in</strong>g energy. Figure 4 visualizes<br />
the phylogenetic relationships differentiated by TU patterns<br />
of 29 g-Proteobacterial taxa each of which represented<br />
by not more than a s<strong>in</strong>gle sequenced stra<strong>in</strong>.<br />
A. borkumensis forms a cluster with Pseudomonas,<br />
Methylococcus, Xanthomonas <strong>and</strong> Xylella (Fig. 4).<br />
Despite the variation <strong>in</strong> GC-content, from 52 to 54% <strong>in</strong><br />
Xylella <strong>and</strong> Alcanivorax to more than 65% <strong>in</strong> Xanthomonas<br />
<strong>and</strong> Pseudomonas, the TU patterns of these<br />
Fig. 3. Tetranucleotide usage patterns of<br />
A. borkumensis SK2, P. putida KT2440, E. coli<br />
K12 MG1655 <strong>and</strong> S. oneidensis MR-1. The<br />
deviation Dw of observed from expected<br />
counts is shown for all 256 tetranucleotide<br />
words (16 ¥ 16 cells) by colour code (right<br />
bar). Tetranucleotides are grouped <strong>in</strong>to 39<br />
classes of equivalent structural features (Baldi<br />
<strong>and</strong> Baisnee, 2000) <strong>and</strong> sorted by decreas<strong>in</strong>g<br />
base stack<strong>in</strong>g energy row-by-row start<strong>in</strong>g at<br />
the upper left corner (class 39). The words<br />
correspond<strong>in</strong>g to the cells <strong>in</strong> colour plots are<br />
shown <strong>in</strong> the table <strong>in</strong> lower part of the figure.<br />
microorganisms are similar <strong>and</strong> separated from other<br />
g-Proteobacteria. There is an abundance of GC-rich tetranucleotides<br />
with high base stack<strong>in</strong>g energy <strong>in</strong> the<br />
sequence of A. borkumensis SK2 (words belong<strong>in</strong>g to<br />
equivalence classes 37–39, 30 <strong>and</strong> 27) that is similar to<br />
the TU pattern of P. putida KT2440 (Fig. 3). Words of the<br />
AT-rich classes 7, 10, 13 <strong>and</strong> 32 are significantly underrepresented<br />
<strong>in</strong> both species. The major difference<br />
between TU patterns is the abundance of poly A <strong>and</strong> poly<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
T stretches (words of class 1) <strong>in</strong> A. borkumensis <strong>in</strong> correspondence<br />
with its lower GC-content of 54.7%. Although<br />
E. coli <strong>and</strong> S. oneidensis share a similar GC contents with<br />
A. borkumensis, their tetranucleotides usage is different<br />
from Alcanivorax. The parity of GC with AT <strong>in</strong> the genome<br />
correlates with a balanced use of GC-rich <strong>and</strong> AT-rich<br />
words with high <strong>and</strong> low base stack<strong>in</strong>g energy. In contrast,<br />
words with <strong>in</strong>termediate values of the base stack<strong>in</strong>g<br />
energy (classes 25, 31, 36 <strong>and</strong> 29) are mostly underrepresented<br />
(Fig. 3). The data suggests that oligonucleotide<br />
usage drives GC-content <strong>and</strong> not vice versa. To give<br />
another example: the GC-rich words of class 21 are<br />
rare <strong>in</strong> all g-Proteobacteria irrespectively of their<br />
GC-content (Fig. 3), but these words are overrepresented<br />
<strong>in</strong> a-Proteobacteria (Agrobacterium, Bordetella, Caulobacter,<br />
Rhizobium).<br />
Anomalous local TU patterns <strong>in</strong> the<br />
A. borkumensis genome<br />
A. borkumensis shares a common taxonomic group<br />
with Pseudomonas, Methylococcus, Xanthomonas <strong>and</strong><br />
Xylella. Although the TU patterns are genome specific<br />
signatures, the oligonucleotide usage may vary locally <strong>in</strong><br />
segments made up by horizontally acquired elements,<br />
phylogenetically ancient genes such as rRNAs or genes<br />
<strong>Comparative</strong> genomics of Alcanivorax borkumensis 5<br />
Fig. 4. Tree of the similarity of TU patterns of<br />
completely sequenced g-Proteobacteria<br />
stra<strong>in</strong>s. Distance D-values (see Experimental<br />
procedures) between two TU patterns were<br />
calculated, <strong>and</strong> the tree was constructed from<br />
the distance matrix of all D-values by the<br />
m<strong>in</strong>imum evolution neighbour-jo<strong>in</strong><strong>in</strong>g method<br />
(Saitou <strong>and</strong> Nei, 1987).<br />
with peculiar codon usage (Reva <strong>and</strong> Tümmler, 2004;<br />
2005). In other words, anomalous local TU patterns can<br />
be expected for the most recent <strong>and</strong> the most ancient<br />
genes. Local TU patterns were calculated <strong>in</strong> 8 kbp long<br />
overlapp<strong>in</strong>g slid<strong>in</strong>g w<strong>in</strong>dows <strong>in</strong> steps of 2 kbp. Distances<br />
D between local <strong>and</strong> global TU patterns are shown <strong>in</strong><br />
Fig. 5. The 18 regions with D-values above the 95% confidence<br />
<strong>in</strong>terval are listed <strong>in</strong> Table 1.<br />
Three clusters with anomalous D-values encode ribosomal<br />
RNAs that belong to the most ancient <strong>and</strong> conserved<br />
elements of all bacterial genomes. All the other 15<br />
regions with atypical TU most likely were recently<br />
acquired, three of which conta<strong>in</strong> transposase genes.<br />
In total 11 transposases were annotated <strong>in</strong> the<br />
A. borkumensis SK2 genome but for five of them no significant<br />
deviations of the local TU patterns were detected<br />
<strong>in</strong> adjacent regions. If <strong>in</strong>serted mobile elements had lost<br />
their mobility due to disruptive mutations, they undergo an<br />
amelioration process smooth<strong>in</strong>g the differences <strong>in</strong> oligonucleotide<br />
usage between <strong>in</strong>serts <strong>and</strong> the host genome<br />
<strong>and</strong> thus cannot be detected by anomalous TU patterns<br />
anymore (Pride et al., 2003).<br />
Five regions with high D-values (Fig. 5) only encode<br />
hypothetical prote<strong>in</strong>s (Table 1). One further region conta<strong>in</strong>s<br />
genes of the type II secretion system <strong>and</strong> two<br />
regions encode type IV pili biogenesis prote<strong>in</strong>s the latter<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
6 O. N. Reva et al.<br />
of which are known to have spread among proteobacteria<br />
by horizontal transfer with the orig<strong>in</strong>al codon usage <strong>and</strong><br />
GC content be<strong>in</strong>g reta<strong>in</strong>ed (Spangenberg et al., 1997).<br />
The most extended region with high D-values encodes<br />
a cluster of genes for glycosyltransferases <strong>and</strong> polysaccharide<br />
biosynthesis prote<strong>in</strong>s (Abo_858-Abo_880:<br />
1 018 000–1 060 000 bp) characterized by the second<br />
largest D-value <strong>and</strong> low GC-content (m<strong>in</strong>imum 45% GC).<br />
The region term<strong>in</strong>ates abruptly after Abo_880 at an AsntRNA<br />
gene. The TU pattern of the locus was compared<br />
with those of 177 sequenced bacterial chromosomes, 316<br />
plasmids <strong>and</strong> 104 phages (Reva <strong>and</strong> Tümmler, 2004).<br />
The pattern was distant from all analysed sequences. The<br />
best hit of D = 34.9% was observed for the 5833 bp large<br />
bacteriophage Pf3 that <strong>in</strong>fects P. aerug<strong>in</strong>osa harbour<strong>in</strong>g<br />
the RP1 plasmid (Luiten et al., 1985). A stretch of 1550 bp<br />
Table 1. Chromosomal regions of A. borkumensis with atypical TU patterns.<br />
Coord<strong>in</strong>ates<br />
Left Right<br />
D a (%) Annotation<br />
Fig. 5. Deviations of TU patterns <strong>in</strong> local<br />
regions of A. borkumensis SK2 chromosome.<br />
Local TU patterns were determ<strong>in</strong>ed <strong>in</strong> 8 kbp<br />
slid<strong>in</strong>g w<strong>in</strong>dow <strong>in</strong> steps of 2 kbp. D, the<br />
distance betweeen local <strong>and</strong> chromosomal<br />
tetranucleotide patterns as def<strong>in</strong>ed <strong>in</strong><br />
Experimental procedures, is plotted versus<br />
the coord<strong>in</strong>ates of the chromosome start<strong>in</strong>g<br />
from the putative replication orig<strong>in</strong>.The upper<br />
border of the 95% confidence <strong>in</strong>terval of<br />
D-values is shown by the horizontal l<strong>in</strong>e.<br />
upstream of the tRNA gene is 48% identical <strong>in</strong> nucleotide<br />
sequence with the Pf3 sequence (2344-4078 bp).<br />
Accord<strong>in</strong>g to this <strong>in</strong> silico f<strong>in</strong>d<strong>in</strong>g we propose that this<br />
gene isl<strong>and</strong> was captured from a phage that typically<br />
target the 3′-end of a tRNA gene (Dobr<strong>in</strong>dt et al.,<br />
2004).<br />
The alkB genes encod<strong>in</strong>g the degradation of alkanes<br />
which is the prom<strong>in</strong>ent name-giv<strong>in</strong>g feature of the taxon<br />
Alcanivorax, are located <strong>in</strong> two isl<strong>and</strong>s (Schneiker et al.,<br />
2006) with anomalous TU patterns (Table 1). Very close<br />
homologues were identified <strong>in</strong> mar<strong>in</strong>e bacteria <strong>and</strong><br />
Pseudomonas species (Schneiker et al., 2006). The<br />
alkane hydroxylase gene cluster is widely distributed<br />
among hydrocarbon-utiliz<strong>in</strong>g g-Proteobacteria due to its<br />
possible horizontal transfer (van Beilen et al., 2001;<br />
2004). The role of these genes <strong>in</strong> the degradation of<br />
126 000 140 000 42.20 Abo_114–120: lysR transcriptional regulator, haloacid dehalogenase hydrolase, amiC amidase, gntR<br />
transcriptional regulator, alkB2 alkane monooxygenase, type I pili biogenesis prote<strong>in</strong>s<br />
190 000 198 000 40.47 Abo_172–178: ilvD-1 dihydroxy-acid dehydratase, conserved hypothetical prote<strong>in</strong>s,<br />
long-cha<strong>in</strong>-fatty-acid-CoA ligase, acyl-CoA dehydrogenases<br />
234 000 245 000 47.95 Abo_209–214: conserved hypothetical prote<strong>in</strong>s, transposase, type II secretion system prote<strong>in</strong>s<br />
400 000 408 000 49.42 first operon for rRNAs<br />
502 000 510 000 46.26 Abo_439–446: ispA lipoprote<strong>in</strong> signal peptidase, fkpB peptidyl-prolyl cis-trans isomerase, ispH<br />
hydroxymethylbutenyl pyrophosphate reductase, type IV pili biogenesis prote<strong>in</strong>s, conserved<br />
hypothetical prote<strong>in</strong>s<br />
526 000 534 000 43.41 second operon for rRNAs<br />
670 000 678 000 40.29 Abo_581–583: type IV pili biogenesis prote<strong>in</strong>s<br />
792 000 800 000 43.00 Abo_2680–2681: hypothetical prote<strong>in</strong>s<br />
1 020 000 1 056 000 50.43 Abo_859–878: polysaccharide biosynthesis prote<strong>in</strong>s<br />
1 742 000 1 750 000 40.88 Abo_1439: periplasmic b<strong>in</strong>d<strong>in</strong>g doma<strong>in</strong>/transglycosylase SLTdoma<strong>in</strong> fusion<br />
1 892 000 1 900 000 46.32 Abo_2841–2847: hypothetical prote<strong>in</strong>s<br />
2 026 000 2 034 000 41.90 Abo_1668–1671: conserved hypothetical prote<strong>in</strong>s, 3 transposases, siderophore biosynthesis prote<strong>in</strong>,<br />
glycosyl transferase<br />
2 088 000 2 096 000 40.65 Abo_ 1707–1708: conserved hypothetical prote<strong>in</strong>s<br />
2 146 000 2 154 000 47.05 Abo_2897–2905: iscA iron-b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> IscA, metal-sulfur cluster biosynthetic enzyme, sufE Fe-S<br />
metabolism associated doma<strong>in</strong> prote<strong>in</strong>, iscS cyste<strong>in</strong>e desulfurase, rrf2 family prote<strong>in</strong>, hypothetical<br />
prote<strong>in</strong>s, SIR2-like transcriptional silencer<br />
2 254 000 2 262 000 49.71 third operon for rRNAs<br />
2 364 000 2 372 000 52.56 Abo_1942: penicill<strong>in</strong>-b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong>, hypothetical prote<strong>in</strong>s, 2 transposases<br />
2 632 000 2 640 000 40.17 Abo_2979–2984: hypothetical prote<strong>in</strong>s<br />
3 060 000 3 076 000 42.94 Abo_2516–3066: Na+/H+ antiporter, alkS alkB1GHJ regulator, alkB1 alkane monooxygenase,<br />
alkG rubredox<strong>in</strong>, aldH aldehyde dehydrogenase, hypothetical prote<strong>in</strong>s<br />
a. D, distance betweeen local <strong>and</strong> chromosomal TU patterns as def<strong>in</strong>ed <strong>in</strong> Experimental procedures.<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
short-cha<strong>in</strong> n-alkanes by A. borkumensis SK2 <strong>and</strong> AP1<br />
was experimentally proven (Smits et al., 2002; Hara et al.,<br />
2004; Sabirova et al., 2006). Interest<strong>in</strong>gly, the two regions<br />
compris<strong>in</strong>g of alkS, alkB1, alkG <strong>and</strong> aldH alkanedegradation<br />
genes <strong>and</strong> of alkB2 <strong>and</strong> transcriptional<br />
regulators, respectively (Table 1), are as similar to each<br />
other <strong>in</strong> their TU patterns (D = 34.3%) as each of them<br />
is to Yers<strong>in</strong>ia pestis (D = 32.2% for alkB1, D = 33.4%<br />
for alkB2), Yers<strong>in</strong>ia enterocolitica (D = 29.5% for alkB1,<br />
D = 34.4% for alkB2) <strong>and</strong> Shewanella oneidensis MR-1<br />
(D = 32.5% for alkB1, D = 42.4% for alkB2). This data<br />
suggests that the alkB1 <strong>and</strong> alkB2 genes were delivered<br />
to A. borkumensis from an ancestor of the Yers<strong>in</strong>ia<br />
l<strong>in</strong>eage. The AlkB1 am<strong>in</strong>o acid sequences of A. borkumensis<br />
stra<strong>in</strong>s AP1 <strong>and</strong> SK2 are highly homologous to<br />
that of P. putida stra<strong>in</strong>s P1 <strong>and</strong> GPO1 (van Beilen et al.,<br />
2001; 2004; Smits et al., 2002; Hara et al., 2004), but their<br />
TU patterns are not that similar (D = 37.1). Surpris<strong>in</strong>gly,<br />
the TU pattern of the alkB cluster of P. putida<br />
is significantly more similar with the global TU pattern of<br />
the whole A. borkumensis chromosome (16.7%, stra<strong>in</strong><br />
GPO1, 19%, stra<strong>in</strong> P1), but more distant from the<br />
P. putida KT2440 chromosome (30.1% <strong>and</strong> 30.3%).<br />
D-values of 17 or 19% are with<strong>in</strong> the first quartile (0–26%)<br />
far below the median value of 28.4% for local TU patterns<br />
of the A. borkumensis chromosome (Fig. 5) <strong>in</strong>dicat<strong>in</strong>g<br />
that. the P. putida alkB gene behaves as if it were part of<br />
the Alcanivorax core genome. We note the strik<strong>in</strong>g phenomenon<br />
that there was converg<strong>in</strong>g evolution of the<br />
cod<strong>in</strong>g sequence of the catabolic alk transposon <strong>in</strong><br />
Alkanivorax <strong>and</strong> Pseudomonas, but that the genes<br />
reta<strong>in</strong>ed the oligonucleotide signature of their donors,<br />
most likely Alkanivorax for Pseudomonas <strong>and</strong> Yers<strong>in</strong>ialike<br />
organisms for Alkanivorax.<br />
<strong>Comparative</strong> genomics of Alcanivorax borkumensis 7<br />
Orig<strong>in</strong> of replication<br />
The GC skew plotted <strong>in</strong> the seventh circle of the genome<br />
atlas (Fig. 1) reflects a general bias of pur<strong>in</strong>es towards the<br />
lead<strong>in</strong>g str<strong>and</strong> of DNA replication, however, it has almost<br />
no correlation to the structural properties of DNA<br />
(Skovgaard et al., 2002). The GC skew is often useful<br />
when locat<strong>in</strong>g the orig<strong>in</strong> <strong>and</strong> term<strong>in</strong>us of replication<br />
(Jensen et al., 1999).<br />
The circle is blue on the right side <strong>and</strong> purple on the left<br />
side. The two big gaps of colours <strong>in</strong> the top <strong>and</strong> <strong>in</strong> the<br />
bottom of the circle may be the orig<strong>in</strong> <strong>and</strong> the term<strong>in</strong>us of<br />
replication. This may also be visualized more clearly <strong>in</strong> the<br />
orig<strong>in</strong> plot (Fig. 6) (Worn<strong>in</strong>g et al., 2006). Here, the difference<br />
between hypothetical lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong> is<br />
plotted (red) for various positions on the chromosome.<br />
The peaks <strong>in</strong>dicat<strong>in</strong>g maximal oligonucleotide skew correspond<br />
to orig<strong>in</strong> <strong>and</strong> term<strong>in</strong>us. The term<strong>in</strong>us was identified<br />
as the peaks show<strong>in</strong>g low G/C weighted str<strong>and</strong> bias<br />
at 1 502 000 bp position. The orig<strong>in</strong> was identified as the<br />
other peak at 3 118 000 bp position. The signal to noise of<br />
14.0 was among the top 10% of sequenced Proteobacteria,<br />
<strong>in</strong>dicat<strong>in</strong>g a big difference between lead<strong>in</strong>g <strong>and</strong><br />
lagg<strong>in</strong>g str<strong>and</strong> mak<strong>in</strong>g the prediction of orig<strong>in</strong> very<br />
confident.<br />
Structural analysis of promoter regions<br />
Structural features of the genomic DNA may <strong>in</strong>dicate promoter<br />
regions, as promoters normally have high curvature,<br />
melt easily <strong>and</strong> are more rigid. The DNA structural<br />
parameters mentioned earlier (position preference, stack<strong>in</strong>g<br />
energy, <strong>and</strong> <strong>in</strong>tr<strong>in</strong>sic curvature) together with AT<br />
content <strong>and</strong> DNAse sensitivity (Brukner et al., 1995) were<br />
Fig. 6. Localization of the orig<strong>in</strong> <strong>and</strong> the<br />
term<strong>in</strong>us of replication <strong>in</strong> the A. borkumensis<br />
SK2 chromosome derived from str<strong>and</strong> bias<br />
curves: the median oligonucleotide skew<br />
curve (red), the GC weighted median (green)<br />
<strong>and</strong> the AT weighted median (blue) (Worn<strong>in</strong>g<br />
et al., 2006).<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
8 O. N. Reva et al.<br />
compiled <strong>in</strong>to a structural profile of all upstream regions of<br />
A. borkumensis (see section Experimental procedures).<br />
The profile uses z-scores to measure how the average<br />
value of the properties vary from m<strong>in</strong>us 400 bp to 400 bp<br />
around the translation start (Fig. 7). A. borkumensis has<br />
only a cod<strong>in</strong>g density of 87% caus<strong>in</strong>g a wider spacer of<br />
the <strong>in</strong>tergenic region <strong>and</strong> this appears to give rise to a<br />
larger <strong>and</strong> wider peak of curvature, stack<strong>in</strong>g energy <strong>and</strong><br />
AT content (Fig. 7A). For comparison we also analysed<br />
the promoter profile of another ocean bacterium, C<strong>and</strong>idatus<br />
Pelagibacter ubique HTCC1062 (Giovannoni et al.,<br />
2005), an example of a highly streaml<strong>in</strong>ed genome with a<br />
cod<strong>in</strong>g density of 96%. Here we observed a much weaker<br />
curvature signal, <strong>and</strong> the distribution of stack<strong>in</strong>g energy<br />
<strong>and</strong> AT content was more narrow <strong>and</strong> had higher maxima<br />
(Fig. 7B).<br />
Next, the probability of open<strong>in</strong>g dur<strong>in</strong>g stress-<strong>in</strong>duced<br />
DNA duplex destabilization was computed by us<strong>in</strong>g the<br />
program SIDD (Wang et al., 2004), cover<strong>in</strong>g five different<br />
values of the super-helical density s = {-0.025, -0.035,<br />
-0.045, -0.055, -0.065}. As super-coil<strong>in</strong>g is be<strong>in</strong>g<br />
pushed, the probability of open<strong>in</strong>g <strong>in</strong>creases at lower<br />
super-helical densities <strong>in</strong> A. borkumensis (Fig. 7C). In<br />
contrast, a narrower SIDD profile that exhibits only<br />
m<strong>in</strong>or dependence on super-helical density (Fig. 7D),<br />
was calculated for the C<strong>and</strong>idatus Pelagibacter ubique<br />
HTCC1062 genome.<br />
The structural profile for the promoter regions of<br />
A. borkumensis was compared with that of closely related<br />
species as found above (see Fig. 4). Generally, it looked<br />
more like the promoter profile of members of the<br />
Pseudomonadales than the general comparison organism,<br />
E. coli. Moreover, the promoter profile was very different<br />
compared with the promoter profile of X. fastidiosa<br />
stra<strong>in</strong>s, even though they where very similar with regard<br />
to their TU profile (see Fig. 4). The promoter profiles for<br />
the above mentioned organisms may be found at our<br />
website (http://www.cbs.dtu.dk/services/GenomeAtlas/).<br />
Am<strong>in</strong>o acid <strong>and</strong> codon usage<br />
We have exam<strong>in</strong>ed the codon <strong>and</strong> am<strong>in</strong>o acid usage of<br />
A. borkumensis <strong>and</strong> compared this with both the usage of<br />
bacteria <strong>in</strong> general <strong>and</strong> of 16 oceanic bacteria (Entrez<br />
project IDs 230, 10 645, 12 530, 13 233, 13 239, 13 282,<br />
13 642, 13 643, 13 654, 13 655, 13 902, 13 906, 13 910,<br />
13 911, 13 989, 15 660) Willenbrock et al., 2006). In<br />
Fig. 8, the codon usage plot of A. borkumensis is<br />
superimposed on the cumulative plot of all completely<br />
sequenced bacteria <strong>in</strong> public databases (N = 518,<br />
Fig. 8A) or of that of 16 oceanic bacteria (Fig. 8B).<br />
A few codons are differentially utilized <strong>in</strong> A. borkumensis<br />
(GUC, CUG), but all values are with<strong>in</strong> the range of three<br />
st<strong>and</strong>ard deviations. In other words, codon usage of<br />
A. borkumensis resides with<strong>in</strong> the typical range of<br />
eubacteria.<br />
Interest<strong>in</strong>gly, the sequenced oceanic bacteria share a<br />
very similar am<strong>in</strong>o acid usage (Fig. 8D), whereas broad<br />
variations thereof were noted amongst all sequenced<br />
bacteria that represent the whole spectrum of habitats<br />
(Fig. 8C). A. borkumensis roughly follows the profile of the<br />
oceanic bacteria, although cyste<strong>in</strong>e, tryptophan, leuc<strong>in</strong>e,<br />
prol<strong>in</strong>e, arg<strong>in</strong><strong>in</strong>e, ser<strong>in</strong>e are under-utilized, <strong>and</strong> glutamic<br />
acid, lys<strong>in</strong>e, phenylalan<strong>in</strong>e, histid<strong>in</strong>e, methion<strong>in</strong>e, <strong>and</strong><br />
tyros<strong>in</strong>e are over-utilized – all exceed<strong>in</strong>g the threest<strong>and</strong>ard<br />
deviation boundaries.<br />
Conclusion<br />
Fig. 7. Profile of structural properties of<br />
promoter regions (A <strong>and</strong> B) <strong>and</strong> probabilities<br />
of open<strong>in</strong>g dur<strong>in</strong>g stress-<strong>in</strong>duced DNA duplex<br />
destabilization at various super-helical<br />
densities (C <strong>and</strong> D) <strong>in</strong> the A. borkumensis<br />
SK2 (A <strong>and</strong> C) <strong>and</strong> C<strong>and</strong>idatus Pelagibacter<br />
ubique HTCC1062 (B <strong>and</strong> D) chromosomes.<br />
Each annotated gene was aligned at the<br />
translation start site <strong>and</strong> the average values<br />
for the SIDD probabilities, AT-content, position<br />
preference, stack<strong>in</strong>g energy, <strong>in</strong>tr<strong>in</strong>sic<br />
curvature <strong>and</strong> DNase sensitivity were<br />
calculated at each position <strong>in</strong> the alignment.<br />
The values were subsequently converted <strong>in</strong>to<br />
z-scores, us<strong>in</strong>g the average <strong>and</strong> st<strong>and</strong>ard<br />
deviation of the entire chromosome. Values<br />
are smoothed over a 5 bp w<strong>in</strong>dow.<br />
Inspection of the collected phylogenetic connections<br />
revealed that the most closely related organisms are<br />
Ac<strong>in</strong>etobacter sp. <strong>and</strong> Pseudomonas aerug<strong>in</strong>osa,<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
although <strong>in</strong> trees where both Pseudomonas <strong>and</strong> Ac<strong>in</strong>etobacter<br />
are present, A. borkumensis tends to cluster more<br />
often with the latter one.<br />
The major structural feature of the A. borkumensis<br />
chromosome is its symmetry <strong>and</strong> homogeneity. The<br />
genome conta<strong>in</strong>s only very few regions with extraord<strong>in</strong>arily<br />
low or high curvature, position preference or base<br />
stack<strong>in</strong>g energy. The chromosomal frame is symmetric:<br />
The orig<strong>in</strong> <strong>and</strong> the term<strong>in</strong>us of replication are located<br />
opposite to each other <strong>in</strong> the chromosome <strong>and</strong> are clearly<br />
discerned by maxima of oligonucleotide usage biases<br />
between lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong>.<br />
The genetic repertoire of A. borkumensis is most similar<br />
to that of Ac<strong>in</strong>etobacter <strong>and</strong> P. aerug<strong>in</strong>osa. Moreover,<br />
<strong>Comparative</strong> genomics of Alcanivorax borkumensis 9<br />
Fig. 8. Codon usage (A <strong>and</strong> B) <strong>and</strong> am<strong>in</strong>o acid usage (C <strong>and</strong> D) of A. borkumensis SK2 compared with those of 518 completely sequenced<br />
bacteria (A <strong>and</strong> C) or compared with those of 16 sequenced oceanic bacteria. Frequencies of am<strong>in</strong>o acids <strong>and</strong> codons were counted for each<br />
genome <strong>and</strong> normalized. Mean value (grey l<strong>in</strong>e) <strong>and</strong> three st<strong>and</strong>ard deviations (grey solid area) represent the global usage of <strong>in</strong>dividual<br />
codons (A <strong>and</strong> B) <strong>and</strong> am<strong>in</strong>o acids (C <strong>and</strong> D) <strong>in</strong> the 518 (A <strong>and</strong> C) or 16 (B <strong>and</strong> D) reference genomes. The red l<strong>in</strong>e (A <strong>and</strong> B) shows the<br />
codon usage <strong>and</strong> the blue l<strong>in</strong>e (C <strong>and</strong> D) shows the am<strong>in</strong>o acid usage of A. borkumensis.<br />
A. borkumensis shares a similar oligonucleotide usage<br />
with the Xanthomonadales <strong>and</strong> Pseudomonadales <strong>in</strong>dicat<strong>in</strong>g<br />
close phylogenetic relationships with these orders<br />
<strong>in</strong> accordance with 16S rDNA sequence relatedness<br />
(Schneiker et al., 2006). Amongst this subgroup of completely<br />
sequenced genomes, the A. borkumensis chromosome<br />
harbours the relatively lowest number of genome<br />
isl<strong>and</strong>s with atypical tetranucleotide usage. P. putida<br />
KT2440, for example, carries threefold more isl<strong>and</strong>s per<br />
Megabase <strong>in</strong> its chromosome (We<strong>in</strong>el et al., 2002). Interest<strong>in</strong>gly,<br />
one of the three enzyme systems that are<br />
upregulated <strong>in</strong> alkane-grown cells (Sabirova et al., 2006),<br />
the well-known alkB1 cluster, is encoded by genome<br />
isl<strong>and</strong>s. The molecular evolution of the alk genes that are<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
10 O. N. Reva et al.<br />
encoded by a catabolic transposon (van Beilen et al.,<br />
2001) is remarkable: the Alcanivorax genes were probably<br />
acquired from the Yers<strong>in</strong>ia l<strong>in</strong>eage, whereas the<br />
P. putida genes exhibit the typical Alcanivorax tetranucleotide<br />
signature. Horizontal gene transfer was relevant to<br />
confer the – probably – most important metabolic trait to<br />
A. borkumensis, but otherwise the stable seawater habitat<br />
apparently did not favour the shuffl<strong>in</strong>g <strong>and</strong> exchange<br />
of genes with other taxa. Instead a symmetric <strong>and</strong><br />
structurally homogeneous chromosome evolved that<br />
lacks numerous metabolic traits (Yakimov et al., 1998;<br />
Schneiker et al., 2006) found <strong>in</strong> their versatile Pseudomonas<br />
relatives which are endowed with twofold larger chromosomes<br />
(Stover et al., 2000; Nelson et al., 2002).<br />
Experimental procedures<br />
Genomic sequence<br />
The comparative genomics analyses were based on the<br />
genomic sequence of A. borkumensis SK2 (Golysh<strong>in</strong> et al.,<br />
2003) <strong>and</strong> its annotation (Schneiker et al., 2006).<br />
Atlas visualization<br />
Atlases, developed <strong>in</strong> house, make it possible to visualize<br />
correlations between position dependent <strong>in</strong>formation conta<strong>in</strong>ed<br />
with<strong>in</strong> a chromosome. Circular graphical representations<br />
of the entire A. borkumensis genome were created<br />
us<strong>in</strong>g the atlas visualization tool, GeneWiz. Each feature,<br />
such as AT content is represented by a separate circle <strong>in</strong> the<br />
atlas. Typically, mean values are pictured <strong>in</strong> grey <strong>and</strong> extreme<br />
values are highlighted <strong>in</strong> a user def<strong>in</strong>ed colour (Pedersen<br />
et al., 2000).<br />
Phylome atlas. For each am<strong>in</strong>o acid sequence, phylogenetic<br />
trees were automatically constructed as described <strong>in</strong><br />
Sicheritz-Ponten <strong>and</strong> Andersson (2001). The phylogenomic<br />
<strong>in</strong>formation of the result<strong>in</strong>g 1919 phylogenetic trees was<br />
extracted <strong>and</strong> analysed <strong>in</strong> the PyPhy system.<br />
Genome atlas. The genome atlas is a comb<strong>in</strong>ation of some<br />
general <strong>in</strong>formative properties. These are some structural<br />
features (<strong>in</strong>tr<strong>in</strong>sic curvature, stack<strong>in</strong>g energy <strong>and</strong> position<br />
preference), some repeat properties (global direct <strong>and</strong><br />
<strong>in</strong>verted repeats) <strong>and</strong> the ma<strong>in</strong> base composition features<br />
(GC skew <strong>and</strong> percent AT).<br />
Intr<strong>in</strong>sic curvature was calculated us<strong>in</strong>g the CURVATURE<br />
software (Shpigelman et al., 1993). Stack<strong>in</strong>g energy of a<br />
DNA segment was determ<strong>in</strong>ed by the method of Ornste<strong>in</strong> <strong>and</strong><br />
colleagues (1978). Position preference was based on a tr<strong>in</strong>ucleotide<br />
model that estimates the helix flexibility (Satchwell<br />
et al., 1986). Base composition is generally divided <strong>in</strong>to AT<br />
content <strong>and</strong> GC skews. Both were calculated from the nucleotide<br />
sequence. Global direct <strong>and</strong> <strong>in</strong>verted repeats were<br />
found us<strong>in</strong>g variations of an algorithm that f<strong>in</strong>ds the highest<br />
degree of homology for a 15 bp repeat with<strong>in</strong> a w<strong>in</strong>dow of<br />
length 100 bp (Jensen et al., 1999).<br />
Codon <strong>and</strong> am<strong>in</strong>o acid usage<br />
Codon <strong>and</strong> am<strong>in</strong>o acid usage were calculated from all cod<strong>in</strong>g<br />
regions <strong>in</strong> the genome as annotated <strong>in</strong> the GenBank entries.<br />
The relative synonymous codon usage was calculated by<br />
compar<strong>in</strong>g the codon distribution from a set of highly<br />
expressed genes with a background distribution estimated<br />
from the codon usage of all cod<strong>in</strong>g regions <strong>in</strong> the genome<br />
(Willenbrock et al., 2006). In order to identify a set of constitutively<br />
highly expressed genes <strong>in</strong> A. borkumensis, the reference<br />
set of 27 very highly expressed Escherichia coli genes<br />
orig<strong>in</strong>ally compiled by Sharp <strong>and</strong> Li (1986) was aligned at the<br />
prote<strong>in</strong> level aga<strong>in</strong>st all genes annotated <strong>in</strong> the GenBank<br />
entry us<strong>in</strong>g BLASTP version 2.2.9 (Altschul et al., 1997). For<br />
each of these very highly expressed genes, the gene with the<br />
best alignment was added to a set of very highly expressed<br />
genes if it had an E-value below 10 -6 .<br />
TU patterns<br />
Overlapp<strong>in</strong>g tetranucleotide words were counted <strong>in</strong> the bacterial<br />
nucleotide sequences by shift<strong>in</strong>g the w<strong>in</strong>dow <strong>in</strong> steps of<br />
1 nucleotide. The total word number <strong>in</strong> a circular sequence<br />
equals to the sequence length. The observed counts of words<br />
(Co) were compared with the expected counts of words (Ce).<br />
Assum<strong>in</strong>g the same distribution frequency for all words irrespective<br />
of their composition <strong>and</strong> sequence mononucleotide<br />
content, Ce matches the ratio of the sequence length to the<br />
number of different tetranucleotide words Nw (256 for<br />
tetranucleotides).<br />
The deviation Dw of observed from expected counts is<br />
given by<br />
∆w= ( o−e)× o<br />
−<br />
C C C 1<br />
For the comparison of sequences by TU patterns, the words<br />
<strong>in</strong> each sequence were ranked by Dw values. Rank numbers<br />
<strong>in</strong>stead of word counts were used to simplify pattern comparison<br />
<strong>and</strong> to remove sequence length bias.<br />
The distance D between two patterns was calculated as<br />
the sum of absolute distances between ranks of identical<br />
words <strong>in</strong> patterns i <strong>and</strong> j as follows <strong>and</strong> expressed as a<br />
percent of the possible maximal distance:<br />
where<br />
D(<br />
% )= ×<br />
∑<br />
100 w<br />
D<br />
max<br />
rank − rank<br />
w, i w, i<br />
D<br />
max<br />
Nw( Nw−1)<br />
=<br />
2<br />
Dmax is the maximal distance that is theoretically possible<br />
between two patterns. For TU patterns Nw is 256. For more<br />
<strong>in</strong>formation about methods of oligonucleotide usage statistics<br />
see Reva <strong>and</strong> Tümmler (2004; 2005).<br />
Orig<strong>in</strong> plot<br />
The orig<strong>in</strong> plot was constructed as described <strong>in</strong> Worn<strong>in</strong>g<br />
<strong>and</strong> colleagues (2006). In brief, the difference between a<br />
hypothetical lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong> is plotted for various<br />
positions on the chromosome. The frequencies of all<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
oligonucleotides from 2-mers to 8-mers on the lead<strong>in</strong>g <strong>and</strong><br />
lagg<strong>in</strong>g str<strong>and</strong>s <strong>in</strong> a 60% w<strong>in</strong>dow are counted <strong>and</strong> the <strong>in</strong>formation<br />
content was calculated <strong>and</strong> summarized over all<br />
oligos for every putative orig<strong>in</strong>. The G/C <strong>and</strong> A/T weighted<br />
str<strong>and</strong> bias were <strong>in</strong>cluded to dist<strong>in</strong>guish between orig<strong>in</strong> <strong>and</strong><br />
term<strong>in</strong>us.<br />
Structural profile of the promoter region<br />
Each annotated gene was aligned at the translation start site<br />
<strong>and</strong> the average values for five DNA structural features<br />
(AT content, position preference, stack<strong>in</strong>g energy, <strong>in</strong>tr<strong>in</strong>sic<br />
curvature, DNase sensitivity; see chapter on Genome Atlas)<br />
were calculated at each position <strong>in</strong> the alignment. The values<br />
was subsequently centered <strong>and</strong> scaled <strong>and</strong> smoothed with<strong>in</strong><br />
a 5 bp w<strong>in</strong>dow us<strong>in</strong>g Gaussian smooth<strong>in</strong>g.<br />
Acknowledgements<br />
The analysis has been performed with<strong>in</strong> the frame of the<br />
‘Task Force Genome L<strong>in</strong>guistics’ of the competence<br />
network ‘Genome Research on Bacteria Relevant for Agriculture,<br />
Environment <strong>and</strong> Biotechnology’ funded by the<br />
Federal M<strong>in</strong>istry of Education <strong>and</strong> Research (BMBF),<br />
Germany (Contracts 031U213D <strong>and</strong> 031U113D). We thank<br />
Peter Golysh<strong>in</strong>, Vitor Mart<strong>in</strong>s dos Santos <strong>and</strong> Kenneth N.<br />
Timmis, Helmhotz Center for Infection Research, Braunschweig,<br />
for stimulat<strong>in</strong>g discussions dur<strong>in</strong>g the <strong>in</strong>itiation of the<br />
study <strong>and</strong> Olaf Kaiser, Lehrstuhl für Genetik, Universität<br />
Bielefeld, for the provision of sequence data at an early<br />
stage of the sequenc<strong>in</strong>g project. O.R. has been a recipient<br />
of a postdoctoral stipend of the DFG-sponsored International<br />
Tra<strong>in</strong><strong>in</strong>g Group ‘Pseudomonas: Pathogenicity <strong>and</strong><br />
Biotechnology’.<br />
References<br />
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.,<br />
Zhang, Z., Miller, W., <strong>and</strong> Lipman, D.J. (1997) Gapped<br />
BLAST <strong>and</strong> PSI-BLAST: a new generation of prote<strong>in</strong><br />
database search programs. Nucleic Acids Res 25: 3389–<br />
3402.<br />
Baldi, P., <strong>and</strong> Baisnee, P.F. (2000) Sequence analysis by<br />
additive scales: DNA structure for sequences <strong>and</strong> repeats<br />
of all lengths. Bio<strong>in</strong>formatics 16: 865–889.<br />
van Beilen, J.B., Panke, S., Lucch<strong>in</strong>i, S., Franch<strong>in</strong>i, A.G.,<br />
Rothlisberger, M., <strong>and</strong> Witholt, B. (2001) Analysis of<br />
Pseudomonas putida alkane-degradation gene clusters<br />
<strong>and</strong> flank<strong>in</strong>g <strong>in</strong>sertion sequences: evolution <strong>and</strong> regulation<br />
of the alk genes. Microbiology 147: 1621–1630.<br />
van Beilen, J.B., Mar<strong>in</strong>, M.M., Smits, T.H.M., Röthlisberger,<br />
M., Franch<strong>in</strong>i, A.G., Witholt, B., <strong>and</strong> Rojo, F. (2004)<br />
Characterization of two alkane hydroxylase genes from<br />
the mar<strong>in</strong>e hydrocarbonoclastic bacterium Alcanivorax<br />
borkumensis. Environ Microbiol 6: 264–273.<br />
Brukner, I., Sanchez, R., Suck, D., <strong>and</strong> Pongor, S. (1995)<br />
Sequence-dependent bend<strong>in</strong>g propensity of DNA as<br />
revealed by DNase I: parameters for tr<strong>in</strong>ucleotides. EMBO<br />
J 14: 1812–1818.<br />
<strong>Comparative</strong> genomics of Alcanivorax borkumensis 11<br />
Chen, X.-H., Koumoutsi, A., Scholz, R., Eisenreich, A.,<br />
Schneider, K., Schneider, I., et al. (2007) <strong>Comparative</strong><br />
analysis of the complete genome sequence of the plant<br />
growth promot<strong>in</strong>g Bacillus amyloliquefaciens FZB42.<br />
Nat Biotechnol 25: 1007–1014.<br />
Dlakic, M., Ussery, D., <strong>and</strong> Brunak, S. (2004) DNA bendability<br />
<strong>and</strong> nucleosome position<strong>in</strong>g <strong>in</strong> transcriptional<br />
regulation. In DNA Conformation <strong>and</strong> Transcription.<br />
Ohyama, T. (ed.). Aust<strong>in</strong>, TX: L<strong>and</strong>es Bioscience, pp. 198–<br />
211.<br />
Dobr<strong>in</strong>dt, U., Hochhut, B., Hentschel, U., <strong>and</strong> Hacker, J.<br />
(2004) Genomic isl<strong>and</strong>s <strong>in</strong> pathogenic <strong>and</strong> environmental<br />
microorganisms. Nat Rev Microbiol 2: 414–424.<br />
Giovannoni, S.J., Tripp, H.J., Givan, S., Podar, M., Verg<strong>in</strong>,<br />
K.L., Baptista, D., et al. (2005) Genome streaml<strong>in</strong><strong>in</strong>g <strong>in</strong> a<br />
cosmopolitan oceanic bacterium. Science 309: 1242–<br />
1245.<br />
Golysh<strong>in</strong>, P.N., Mart<strong>in</strong>s Dos Santos, V.A., Kaiser, O., Ferrer,<br />
M., Sabirova, Y.S., Lunsdorf, H., et al. (2003) Genome<br />
sequence completed of Alcanivorax borkumensis, a<br />
hydrocarbon-degrad<strong>in</strong>g bacterium that plays a global role<br />
<strong>in</strong> oil removal from mar<strong>in</strong>e systems. J Biotechnol 106:<br />
215–220.<br />
Hara, A., Syutsubo, K., <strong>and</strong> Harayama, S. (2003) Alcanivorax<br />
which prevails <strong>in</strong> oil-contam<strong>in</strong>ated seawater exhibits broad<br />
substrate specificity for alkane degradation. Environ<br />
Microbiol 5: 746–753.<br />
Hara, A., Baik, S.H., Syutsubo, K., Misawa, N., Smits, T.H.,<br />
van Beilen, J.B., <strong>and</strong> Harayama, S. (2004) Clon<strong>in</strong>g <strong>and</strong><br />
functional analysis of alkB genes <strong>in</strong> Alcanivorax borkumensis<br />
SK2. Environ Microbiol 6: 191–197.<br />
Harayama, S., Kishira, H., Kasai, Y., <strong>and</strong> Shutsubo, K. (1999)<br />
Petroteum biodegradation <strong>in</strong> mar<strong>in</strong>e environments. J Mol<br />
Microbiol Biotechnol 1: 63–70.<br />
Jensen, L.J., Friis, C., <strong>and</strong> Ussery, D.W. (1999) Three<br />
views of microbial genomes. Res Microbiol 150: 773–<br />
777.<br />
Kasai, Y., Kishira, H., Sasaki, I., Syutsubo, K., Watanabe, K.,<br />
<strong>and</strong> Harama, S. (2002) Prodom<strong>in</strong>ant growth of Alcanivorax<br />
stra<strong>in</strong>s <strong>in</strong> oil-contam<strong>in</strong>ated <strong>and</strong> nutrient-supplemented sea<br />
water. Environ Microbiol 4: 141–147.<br />
Kasai, Y., Kishira, H., Syutsubo, K., <strong>and</strong> Harayama, S. (2001)<br />
Molecular detection of mar<strong>in</strong>e bacterial populations on<br />
beaches contam<strong>in</strong>ated by the Nakhodka tanker oilaccident.<br />
Environ Microbiol 3: 246–255.<br />
Klockgether, J., Würdemann, D., Reva, O., Wiehlmann, L.,<br />
<strong>and</strong> Tümmler, B. (2007) Diversity of the abundant<br />
pKLC102/PAGI-2 family of genomic isl<strong>and</strong>s <strong>in</strong> Pseudomonas<br />
aerug<strong>in</strong>osa. J Bacteriol 189: 2443–2459.<br />
Luiten, R.G., Putterman, D.G., Schoenmakers, J.G.,<br />
Kon<strong>in</strong>gs, R.N., <strong>and</strong> Day, L.A. (1985) Nucleotide sequence<br />
of the genome of Pf3, an IncP-1 plasmid-specific filamentous<br />
bacteriophage of Pseudomonas aerug<strong>in</strong>osa. J Virol<br />
56: 268–276.<br />
McKew, B.A., Coulon, F., Osborn, A.M., Timmis, K.N., <strong>and</strong><br />
McGenity, T.J. (2007a) Determ<strong>in</strong><strong>in</strong>g the identity <strong>and</strong> roles<br />
of oil-metaboliz<strong>in</strong>g mar<strong>in</strong>e bacteria from the Thames<br />
estuary, UK. Environ Microbiol 9: 165–176.<br />
McKew, B.A., Coulon, F., Yakimov, M.M., Denaro, R., Genovese,<br />
M., Smith, C.J., et al. (2007b) Efficacy of <strong>in</strong>tervention<br />
strategies for bioremediation of crude oil <strong>in</strong> mar<strong>in</strong>e<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
12 O. N. Reva et al.<br />
systems <strong>and</strong> effects on <strong>in</strong>digenous hydrocarbonoclastic<br />
bacteria. Environ Microbiol 9: 1562–1571.<br />
Nelson, K.E., We<strong>in</strong>el, C., Paulsen, I.T., Dodson, R.J., Hilbert,<br />
H., Mart<strong>in</strong>s dos Santos, V.A., et al. (2002) Complete<br />
genome sequence <strong>and</strong> comparative analysis of the metabolically<br />
versatile Pseudomonas putida KT2440. Environ<br />
Microbiol 4: 799–808.<br />
Ornste<strong>in</strong>, R., Re<strong>in</strong>, R., Breen, D., <strong>and</strong> MacElroy, R. (1978)<br />
An optimized potential function for the calculation of<br />
nucleic acid <strong>in</strong>teraction energies. Biopolymers 17: 2341–<br />
2360.<br />
Pedersen, A.G., Jensen, L.J., Brunak, S., Staerfeldt, H.H.,<br />
<strong>and</strong> Ussery, D.W. (2000) A DNA structural atlas for<br />
Escherichia coli. J Mol Biol 299: 907–930.<br />
Pride, D.T., Me<strong>in</strong>ersmann, R.J., Wassenaar, T.M., <strong>and</strong><br />
Blaser, M.J. (2003) Evolutionary implications of microbial<br />
genome tetranucleotide frequency biases. Genome Res<br />
13: 145–158.<br />
Reva, O.N., <strong>and</strong> Tümmler, B. (2004) Global features of<br />
sequences of bacterial chromosomes, plasmids <strong>and</strong><br />
phages revealed by analysis of oligonucleotide usage<br />
patterns. BMC Bio<strong>in</strong>formatics 5: 90.<br />
Reva, O.N., <strong>and</strong> Tümmler, B. (2005) Differentiation of regions<br />
with atypical oligonucleotide composition <strong>in</strong> bacterial<br />
genomes. BMC Bio<strong>in</strong>formatics 6: 251.<br />
Röl<strong>in</strong>g, W.F., Milner, M.G., Jones, D.M., Lee, K., Daniel, F.,<br />
Swannell, R.J., et al. (2002) Robust hydrocarbon degradation<br />
<strong>and</strong> dynamics of bacterial communities dur<strong>in</strong>g nutrient<br />
– enhanced oil spill bioremediation. Appl Environ Microbiol<br />
68: 5537–5548.<br />
Sabirova, J.S., Ferrer, M., Regenhardt, D., Timmis, K.N., <strong>and</strong><br />
Golysh<strong>in</strong>, P.N. (2006) Proteomic <strong>in</strong>sights <strong>in</strong>to metabolic<br />
adaptations <strong>in</strong> Alcanivorax borkumensis <strong>in</strong>duced by alkane<br />
utilization. J Bacteriol 188: 3763–3773.<br />
Saitou, N., <strong>and</strong> Nei, M. (1987) The neighbor-jo<strong>in</strong><strong>in</strong>g method:<br />
a new method for reconstruct<strong>in</strong>g phylogenetic trees. Mol<br />
Biol Evol 4: 406–425.<br />
Satchwell, S.C., Drew, H.R., <strong>and</strong> Travers, A.A. (1986)<br />
Sequence periodicities <strong>in</strong> chicken nucleosome core DNA.<br />
J Mol Biol 191: 659–675.<br />
Schneiker, S., Mart<strong>in</strong>s dos Santos, V.A., Bartels, D., Bekel,<br />
T., Brecht, M., Buhrmester, J., et al. (2006) Genome<br />
sequence of the ubiquitous hydrocarbon-degrad<strong>in</strong>g mar<strong>in</strong>e<br />
bacterium Alcanivorax borkumensis. Nat Biotechnol 24:<br />
997–1004.<br />
Sharp, P.M., <strong>and</strong> Li, W.H. (1986) Codon usage <strong>in</strong> regulatory<br />
genes <strong>in</strong> Escherichia coli does not reflect selection for ‘rare’<br />
codons. Nucleic Acids Res 14: 7737–7749.<br />
Shpigelman, E.S., Trifonov, E.N., <strong>and</strong> Bolshoy, A. (1993)<br />
CURVATURE: software for the analysis of curved DNA.<br />
Comput Appl Biosci 9: 435–440.<br />
Sicheritz-Ponten, T., <strong>and</strong> Andersson, S.G. (2001) A phyloge-<br />
nomic approach to microbial evolution. Nucleic Acids Res<br />
29: 545–552.<br />
Skovgaard, M., Jensen, L.J., Friis, C., Stærfeldt, H.H.,<br />
Worn<strong>in</strong>g, P., Brunak, S., <strong>and</strong> Ussery, D.W. (2002) The<br />
atlas visualisation of genome-wide <strong>in</strong>formation. In Methods<br />
<strong>in</strong> Microbiology. Wren, B., <strong>and</strong> Dorrell, N. (eds). London,<br />
UK: Academic Press, pp. 49–63.<br />
Smits, T.H., Balada, S.B., Witholt, B., <strong>and</strong> van Beilen, J.B.<br />
(2002) Functional analysis of alkane hydroxylases from<br />
gram-negative <strong>and</strong> gram-positive bacteria. J Bacteriol 184:<br />
1733–1742.<br />
Spangenberg, C., Fislage, R., Röml<strong>in</strong>g, U., <strong>and</strong> Tümmler, B.<br />
(1997) Disrespectful type IV pil<strong>in</strong>s. Mol Microbiol 25: 203–<br />
204.<br />
Stover, C.K., Pham, X.Q., Erw<strong>in</strong>, A.L., Mizoguchi, S.D., Warrener,<br />
P., Hickey, M.J., et al. (2000) Complete genome<br />
sequence of Pseudomonas aerug<strong>in</strong>osa PA01, an opportunistic<br />
pathogen. Nature 406: 959–964.<br />
Syutsubo, K., Kishira, H., <strong>and</strong> Harayama, S. (2001) Development<br />
of specific oliogonucleotide probes for the identification<br />
<strong>and</strong> <strong>in</strong> situ defection of hydrocarbon – degrad<strong>in</strong>g<br />
Alcanivorax stra<strong>in</strong>s. Environ Microbiol 3: 371–379.<br />
Teel<strong>in</strong>g, H., Meyerdierks, A., Bauer, M., Amann, R., <strong>and</strong><br />
Glockner, F.O. (2004) Application of tetranucleotide<br />
frequencies for the assignment of genomic fragments.<br />
Environ Microbiol 6: 938–947.<br />
Ussery, D., Soumpasis, D.M., Brunak, S., Staerfeldt, H.H.,<br />
Worn<strong>in</strong>g, P., <strong>and</strong> Krogh, A. (2002) Bias of pur<strong>in</strong>e stretches<br />
<strong>in</strong> sequenced chromosomes. Comput Chem 26: 531–541.<br />
Wang, H., Noordewier, M., <strong>and</strong> Benham, C.J. (2004) Stress-<br />
Induced DNA Duplex destabilization (SIDD) <strong>in</strong> the E. coli<br />
genome: SIDD sites are closely associated with promoters.<br />
Genome Res 14: 1575–1584.<br />
We<strong>in</strong>el, C., Nelson, K.E., <strong>and</strong> Tümmler, B. (2002) Global<br />
features of the Pseudomonas putida KT2440 genome<br />
sequence. Environ Microbiol 4: 809–818.<br />
Willenbrock, H., <strong>and</strong> Ussery, D.W. (2007) Prediction of highly<br />
expressed genes <strong>in</strong> microbes based on chromat<strong>in</strong><br />
accessibility. BMC Mol Biol 8: 11.<br />
Willenbrock, H., Friis, C., Juncker, A.S., <strong>and</strong> Ussery, D.W.<br />
(2006) An environmental signature for 323 microbial<br />
genomes based on codon adaptation <strong>in</strong>dices. Genome Biol<br />
7: R114.<br />
Worn<strong>in</strong>g, P., Jensen, L.J., Hall<strong>in</strong>, P.F., Staerfeldt, H.H., <strong>and</strong><br />
Ussery, D.W. (2006) Orig<strong>in</strong> of replication <strong>in</strong> circular<br />
prokaryotic chromosomes. Environ Microbiol 8: 353–<br />
361.<br />
Yakimov, M.M., Golysh<strong>in</strong>, P.N., Lang, S., Moore, E.R.,<br />
Abraham, W.R., Lunsdorf, H., <strong>and</strong> Timmis, K.N. (1998)<br />
Alcanivorax borkumensis General nov., sp. nov., a new,<br />
hydrocarbon-degrad<strong>in</strong>g <strong>and</strong> surfactant-produc<strong>in</strong>g mar<strong>in</strong>e<br />
bacterium. Int J Syst Bacteriol 48: 339–348.<br />
©2007TheAuthors<br />
Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology
Paper III: Global features of the Alcanivorax borkumensis SK2 genome
1<br />
2.9 Paper IV: The orig<strong>in</strong>s of Vibrio species<br />
<strong>Comparative</strong> Genomics
Microb Ecol<br />
DOI 10.1007/s00248-009-9596-7<br />
MINIREVIEWS<br />
On the Orig<strong>in</strong>s of a Vibrio Species<br />
Tammi Vesth & Trudy M. Wassenaar & Peter F. Hall<strong>in</strong> &<br />
Lars Snipen & Kar<strong>in</strong> Lagesen & David W. Ussery<br />
Received: 3 July 2009 /Accepted: 17 September 2009<br />
# The Author(s) 2009. This article is published with open access at Spr<strong>in</strong>gerl<strong>in</strong>k.com<br />
Abstract Thirty-two genome sequences of various Vibrionaceae<br />
members are compared, with emphasis on what<br />
makes V. cholerae unique. As few as 1,000 gene families<br />
are conserved across all the Vibrionaceae genomes analysed;<br />
this fraction roughly doubles for gene families<br />
conserved with<strong>in</strong> the species V. cholerae. Of these,<br />
approximately 200 gene families that cluster on various<br />
locations of the genome are not found <strong>in</strong> other sequenced<br />
Vibrionaceae; these are possibly unique to the V. cholerae<br />
species. By compar<strong>in</strong>g gene family content of the analysed<br />
genomes, the relatedness to a particular species is identified<br />
for two unspeciated genomes. Conversely, two genomes<br />
T. Vesth : T. M. Wassenaar : P. F. Hall<strong>in</strong> : L. Snipen :<br />
K. Lagesen : D. W. Ussery (*)<br />
Center for Biological Sequence Analysis,<br />
Department of Systems Biology,<br />
The Technical University of Denmark,<br />
Build<strong>in</strong>g 208,<br />
2800 Kgs. Lyngby, Denmark<br />
e-mail: dave@cbs.dtu.dk<br />
T. M. Wassenaar<br />
Molecular Microbiology <strong>and</strong> Genomics Consultants,<br />
Zotzenheim, Germany<br />
P. F. Hall<strong>in</strong><br />
Novozymes A/S,<br />
Krogshøjvej 36,<br />
2880 Bagsværd, Denmark<br />
L. Snipen<br />
Biostatistics, Department of Chemistry, Biotechnology,<br />
<strong>and</strong> Food Sciences, Norwegian University of Life Sciences,<br />
Ås, Norway<br />
K. Lagesen<br />
Centre for Molecular Biology <strong>and</strong> Neuroscience <strong>and</strong> Institute<br />
of Medical Microbiology, University of Oslo,<br />
Oslo, Norway<br />
presumably belong<strong>in</strong>g to the same species have suspiciously<br />
dissimilar gene family content. We are able to identify a<br />
number of genes that are conserved <strong>in</strong>, <strong>and</strong> unique to, V.<br />
cholerae. Some of these genes may be crucial to the niche<br />
adaptation of this species.<br />
Introduction<br />
The species concept for bacteria has long been under siege<br />
from several angles, <strong>and</strong> now with thous<strong>and</strong>s of bacterial<br />
genomes be<strong>in</strong>g sequenced, the disputes have <strong>in</strong>tensified [8].<br />
One frequently used def<strong>in</strong>ition of a bacterial species is “a<br />
category that circumscribes a (preferably) genomically<br />
coherent group of <strong>in</strong>dividual isolates/stra<strong>in</strong>s shar<strong>in</strong>g a high<br />
degree of similarity <strong>in</strong> (many) <strong>in</strong>dependent features,<br />
comparatively tested under highly st<strong>and</strong>ardized conditions”<br />
[12]. Such <strong>in</strong>dependent features are usually phenotypes that<br />
can easily be tested. For a new species to be def<strong>in</strong>ed,<br />
amongst other criteria, <strong>in</strong>ter-species DNA–DNA hybridisation<br />
has to be below 70%, although this rule is not<br />
without its limitations [18]. In the late 1970s <strong>and</strong> 1980s, the<br />
16S rRNA gene sequence was <strong>in</strong>troduced as a molecular<br />
clock that could be used to <strong>in</strong>fer phylogenetic relationships<br />
[50]. Ideally, isolates belong<strong>in</strong>g to the same species have<br />
identical or nearly identical 16S rRNA genes, <strong>and</strong> these<br />
differ from isolates belong<strong>in</strong>g to different species [32, 44].<br />
In practice, this is not always the case. Examples exist of<br />
different species shar<strong>in</strong>g identical rRNA genes (for<br />
<strong>in</strong>stance, E. coli <strong>and</strong> Shigella [37] that are even placed <strong>in</strong><br />
different genera); <strong>in</strong> addition, isolates of one species can<br />
have different rRNA genes beyond the 97% that is<br />
considered to demarcate species [4]. Lateral transfer of<br />
genetic material (to which ribosomal genes are believed to<br />
be resistant) destroys the phylogenetic relationship, so that
phylogenies based on alternative housekeep<strong>in</strong>g genes can<br />
differ from a 16S rRNA tree <strong>and</strong> frequently are not even <strong>in</strong><br />
accordance to each other. Such observations question the<br />
validity of a phylogenetic tree as the most suitable model<br />
for bacterial ancestry, when multiple genetic transfers<br />
would produce a network-like evolutionary structure [6].<br />
On the other h<strong>and</strong>, it is observed that lateral gene transfer is<br />
most frequent between genetically related members shar<strong>in</strong>g<br />
a similar base content <strong>and</strong> occupy<strong>in</strong>g the same ecological<br />
niche [29]. Nevertheless, a core of genes can be recognised<br />
that produce coherent phylogenetic trees, though these may<br />
not represent the species’ complete evolutionary history as<br />
they comprise only a m<strong>in</strong>or fraction of the genetic content<br />
of the organism [35].<br />
Whether a tree or a network is more accurate to describe<br />
phylogeny, <strong>in</strong> either case bacterial species may be considered<br />
as a cloud of isolates hav<strong>in</strong>g a higher level of genetic<br />
similarity to each other than to organisms belong<strong>in</strong>g to a<br />
different species. When such clouds have fuzzy <strong>and</strong><br />
overlapp<strong>in</strong>g borders, the species concept falls apart but that<br />
will only apply to certa<strong>in</strong> cases [7]. S<strong>in</strong>ce 16S rRNA genes<br />
are not <strong>in</strong>formative on the level of diversity with<strong>in</strong> a<br />
species, the 'density' of a cloud of isolates mak<strong>in</strong>g up a<br />
species cannot be determ<strong>in</strong>ed by this gene. Those genes<br />
shared by all isolates belong<strong>in</strong>g to one species comprise the<br />
core genome of that species [39], <strong>and</strong> the degree of<br />
diversity <strong>in</strong> the rema<strong>in</strong><strong>in</strong>g non-core genes determ<strong>in</strong>es the<br />
density of the species cloud.<br />
We hypothesised that certa<strong>in</strong> genes can be recognised as<br />
specific to a particular species, to be conserved <strong>in</strong> that<br />
species but not present <strong>in</strong> related species. We tested our<br />
hypothesis with complete genome sequences of the bacterial<br />
family Vibrionaceae, which belong to the γ-<br />
Proteobacteria <strong>and</strong> comprises eight genera. Most available<br />
genome sequences belong to the genus Vibrio. This genus<br />
conta<strong>in</strong>s 51 recognised species [10, 46] which are ma<strong>in</strong>ly<br />
found <strong>in</strong> mar<strong>in</strong>e environments, frequently liv<strong>in</strong>g <strong>in</strong> association<br />
with mar<strong>in</strong>e organisms such as corals, fish, squid or<br />
zooplankton. Most of them are symbionts <strong>and</strong> only a few<br />
are human pathogens, notably particular serotypes of V.<br />
cholerae produc<strong>in</strong>g cholera, Vibrio parahaemolyticus<br />
(caus<strong>in</strong>g gastroenteritis) <strong>and</strong> Vi vulnificus (caus<strong>in</strong>g wound<br />
<strong>in</strong>fections) [46]. Other Vibrionaceae, <strong>in</strong>clud<strong>in</strong>g V. vulnificus,<br />
Aliivibrio salmonicida <strong>and</strong> V. harveyi, are fish or<br />
shellfish pathogens <strong>and</strong> have major economic impact.<br />
Photobacterium profundum, represent<strong>in</strong>g another genus<br />
with<strong>in</strong> the Vibrionaceae, was also <strong>in</strong>cluded.<br />
The gene content of 32 available sequenced Vibrionaceae<br />
genomes was compared <strong>and</strong> the results were analysed <strong>in</strong><br />
various ways. The data allowed us to identify possible V.<br />
cholerae-specific genes, s<strong>in</strong>ce this species was represented<br />
by 18 genomes that was a sufficient number to test<br />
conservation both with<strong>in</strong> the species <strong>and</strong> across species.<br />
We found that a two-component signal transduction pathway<br />
is uniquely conserved <strong>in</strong> V. cholerae but is not found outside<br />
this species. Our f<strong>in</strong>d<strong>in</strong>gs further <strong>in</strong>dicated that possibly a<br />
relatively small set of genes could confer niche specialisation<br />
allow<strong>in</strong>g V. cholerae to be adopted to a unique environment,<br />
so that over time V. cholerae have become a dist<strong>in</strong>ct species.<br />
Materials <strong>and</strong> Methods<br />
Genomes <strong>and</strong> Gene Annotations Used<br />
Publicly available genome sequences of Vibrionaceae were<br />
selected that were provided <strong>in</strong> less than 300 contigs <strong>and</strong> <strong>in</strong><br />
which full-length 16S rRNA sequence could be found us<strong>in</strong>g<br />
the rRNA gene f<strong>in</strong>der RNAmmer [19]. The 32 genome<br />
sequences <strong>in</strong>cluded are shown <strong>in</strong> Table 1.<br />
The gene annotations as provided <strong>in</strong> GenBank were<br />
used, except for those genomes marked “Easygene” <strong>in</strong><br />
Table 1 where prote<strong>in</strong> annotation was not available <strong>in</strong> the<br />
RefSeq file at the time of analysis, <strong>and</strong> we used EasyGene<br />
[20] to identify the genes. As a control, an available<br />
GenBank annotation was compared to a generated Easygene<br />
annotation to confirm that the number of identified<br />
genes was comparable.<br />
Ribosomal RNA Analysis<br />
RNAmmer [19] was used to identify 16S rRNA sequences<br />
with<strong>in</strong> the 32 genomes. Sequences were considered reliable<br />
if they were between 1,400 <strong>and</strong> 1,700 nucleotides long <strong>and</strong><br />
had an RNAmmer score above 1,800. In cases where the<br />
program found multiple <strong>and</strong> variable 16S sequences with<strong>in</strong><br />
a genome, one of these (with satisfactory RNAmmer<br />
scores) was arbitrarily chosen. The sequences were aligned<br />
us<strong>in</strong>g PRANK [23, 24], <strong>and</strong> the program MEGA4 was used<br />
to elucidate a phylogenetic tree [45]. With<strong>in</strong> MEGA4, the<br />
tree was created us<strong>in</strong>g the Neighbor-Jo<strong>in</strong><strong>in</strong>g method with<br />
the uniform rate Jukes–Cantor distance measure <strong>and</strong> the<br />
complete-delete option. Five hundred resampl<strong>in</strong>gs were<br />
done to f<strong>in</strong>d the bootstrap values.<br />
Pan-Genome Family Cluster<strong>in</strong>g<br />
T. Vesth et al.<br />
Cluster<strong>in</strong>g based on shared gene families from the Vibrio<br />
pan-genome was constructed, based on BLASTP similarity<br />
us<strong>in</strong>g default sett<strong>in</strong>gs. A BLASTP hit was considered<br />
significant if the alignment produced at least 50% identity<br />
for at least 50% of the length of the longest gene (either<br />
query or subject). Us<strong>in</strong>g this criterion, each pair of genes<br />
produc<strong>in</strong>g a significant reciprocal best hit was scored as<br />
belong<strong>in</strong>g to the same gene family. A genome matrix was<br />
constructed, conta<strong>in</strong><strong>in</strong>g one row for each genome <strong>and</strong> one
Orig<strong>in</strong>s of V. cholerae<br />
Table 1 Vibrionaceae genomes used <strong>in</strong> this analysis<br />
GPID Organism Contigs Accession/GenBank Status No. of genes Ref.<br />
36 V. cholerae N16961 a<br />
2 AE003852.1 Fully sequenced 3,828 [15]<br />
15667 V. cholerae O395 TIGR a<br />
2 CP000626.1 Fully sequenced 3,875 [11]<br />
32853 V. cholerae O395 TEDA a<br />
2 CP001235.1 Fully sequenced 3,934 [49]<br />
33555 V. cholerae MJ-1236 a<br />
2 CP001485.1 Fully sequenced 3,774 [31]<br />
15666 V. cholerae MO10 a<br />
153 NZ_AAKF00000000 Unf<strong>in</strong>ished (Easygene) 3,421 [5]<br />
15670 V. cholerae V52 a<br />
268 NZ_AAKJ00000000 Unf<strong>in</strong>ished (NCBI) 3,815 [16]<br />
33559 V. cholerae BX330286 a<br />
8 NZ_ACIA00000000 Unf<strong>in</strong>ished (NCBI) 3,632 [31]<br />
33557 V. cholerae B33 a<br />
17 NZ_ACHZ00000000 Unf<strong>in</strong>ished (NCBI) 3,748 [31]<br />
33553 V. cholerae RC9 a<br />
11 NZ_ACHX00000000 Unf<strong>in</strong>ished (NCBI) 3,811 [31]<br />
32851 V. cholerae M66-2 2 CP001233.1 Fully sequenced 3,693 [49]<br />
18495 V. cholerae MZO-2 162 NZ_AAWF00000000 Unf<strong>in</strong>ished (NCBI) 3,425 [16]<br />
18265 V. cholerae 1587 254 NZ_AAUR00000000 Unf<strong>in</strong>ished (NCBI) 3,758 [16]<br />
18253 V. cholerae 2740-80 257 NZ_AAUT00000000 Unf<strong>in</strong>ished (NCBI) 3,771 [16]<br />
17723 V. cholerae AM-19226 154 NZ_AATY00000000 Unf<strong>in</strong>ished (Easygene) 3,407 [33]<br />
33561 V. cholerae 12129 12 NZ_ACFQ00000000 Unf<strong>in</strong>ished (NCBI) 3,574 [31]<br />
33549 V. cholerae VL426 5 NZ_ACHV00000000 Unf<strong>in</strong>ished (NCBI) 3,461 [31]<br />
33579 V. cholerae TM 11079-80 35 NZ_ACHW00000000 Unf<strong>in</strong>ished (NCBI) 3,621 [31]<br />
33551 V. cholerae TMA 21 20 NZ_ACHY00000000 Unf<strong>in</strong>ished (NCBI) 3,600 [31]<br />
13564 V. campbellii AND4 143 NZ_ABGR00000000 Unf<strong>in</strong>ished (NCBI) 3,935 [13]<br />
19857 V. harveyi BAA-1116 3 CP000789.1 Fully sequenced 6,064 [1]<br />
349 V. vulnificus CMCP6 2 AE016795.2 Fully sequenced 4,538 [38]<br />
1430 V. vulnificus YJ016 3 BA000037.2 Fully sequenced 5,028 [3]<br />
19397 V. shilonii AK1 158 NZ_ABCH00000000 Unf<strong>in</strong>ished (NCBI) 5,360 [41]<br />
15693 Vibrio sp. Ex25 222 NZ_AAKK00000000 Unf<strong>in</strong>ished (Easygene) 4,004 [16]<br />
13616 Vibrio sp. MED222 99 NZ_AAND00000000 Unf<strong>in</strong>ished (NCBI) 4,590 [36]<br />
32815 V. splendidus LGP32 2 FM954973.1 Fully sequenced 4,434 [27]<br />
19395 V. parahaemolyticus 16 78 NZ_ACCV00000000 Unf<strong>in</strong>ished (Easygene) 3,780 [9]<br />
360 V. parahaemolyticus 2210633 2 BA000031.2 Fully sequenced 4,832 [25]<br />
12986 A. fischeri ES114 3 CP000020.1 Fully sequenced 3,823 [42]<br />
19393 A. fischeri MJ11 3 CP001133.1 Fully sequenced 4,039 [26]<br />
30703 A. salmonicida LFI1238 6 FM178379.1 Fully sequenced 4,284 [17]<br />
13128 P. profundum SS9 3 CR354531.1 Fully sequenced 5,480 [48]<br />
GPID genome project identifier at NCBI. Contigs the number of contiguous sequences, which for a completely sequenced genome is at least two<br />
(for two chromosomes) <strong>and</strong> can be up to six when plasmids are present. Unf<strong>in</strong>ished sequences are represented by multiple contigs per<br />
chromosome<br />
a<br />
Stra<strong>in</strong>s conta<strong>in</strong><strong>in</strong>g the genes encod<strong>in</strong>g the cholera enterotox<strong>in</strong> subunits are <strong>in</strong>dicated<br />
column for each gene family. Cell (i, j) <strong>in</strong> this matrix is 1 if<br />
genome i has a member <strong>in</strong> gene family j, 0 otherwise. A<br />
hierarchical cluster<strong>in</strong>g, with average l<strong>in</strong>kage based on the<br />
Manhattan distance between genomes was then performed.<br />
Two trees were made, one with more weight given to gene<br />
families present <strong>in</strong> most (90%, or between 27 <strong>and</strong> 30)<br />
Vibrio genomes (“stabilome”), <strong>and</strong> the other with more<br />
weight given to gene families present <strong>in</strong> only a few (two,<br />
three, or four) genomes (“mobilome”). Thus, the orig<strong>in</strong>al<br />
Boolean matrix is now scaled differently, depend<strong>in</strong>g on the<br />
number of genomes <strong>in</strong> each gene family [44]. For both<br />
trees, s<strong>in</strong>gletons (families which are only found <strong>in</strong> one<br />
genome) have been excluded.<br />
Pan- <strong>and</strong> Core Genome Analysis<br />
The results of the BLAST analysis were also used to<br />
construct a pan- <strong>and</strong> core genome plot as follows. Based on<br />
cluster<strong>in</strong>gs from the pan-genome family tree, an ordered set<br />
of genomes was constructed with V. cholerae genomes at<br />
the start. For the first chosen genome, all BLAST hits found<br />
<strong>in</strong> the second genome were recorded <strong>and</strong> the accumulative
Figure 1 Phylogenetic tree of<br />
the 16S rRNA gene extracted<br />
from 32 sequenced Vibrio<br />
genomes listed <strong>in</strong> Table 1. Environmental<br />
V. cholerae lack<strong>in</strong>g<br />
the cholera enterotox<strong>in</strong> genes<br />
are highlighted <strong>in</strong> bright green,<br />
whilst pathogenic V. cholerae<br />
genomes are <strong>in</strong> dark green.<br />
Further colour<strong>in</strong>g was used for<br />
species for which two genomes<br />
are represented<br />
number of gene families (as def<strong>in</strong>ed above) now recognised <strong>in</strong><br />
total was plotted for the pan-genome. The number of gene<br />
families with at least one representative gene <strong>in</strong> both genomes<br />
was plotted for the core genome. A runn<strong>in</strong>g total is plotted for<br />
the pan-genome which <strong>in</strong>creases as more genomes are added,<br />
whilst the core genome represent<strong>in</strong>g conserved gene families<br />
slowly decreases with the addition of more genomes.<br />
Whole-Genome BLAST Analysis <strong>and</strong> Construction<br />
of a BLAST Matrix<br />
The predicted genes of every genome (annotated or found<br />
by Easygene) were translated <strong>and</strong> every gene was compared,<br />
by BLASTP aga<strong>in</strong>st every other genome <strong>and</strong> its own<br />
genome. In the latter case, the hit to self was ignored. The<br />
50/50 rule for BLAST hits as described above was used. If<br />
these requirements were met, genes were comb<strong>in</strong>ed <strong>in</strong> a<br />
gene family. The BLAST results were visualised <strong>in</strong> a<br />
BLAST matrix [2], which summarises the results of<br />
genomic pairwise comparisons <strong>and</strong> reports, both as percentage<br />
<strong>and</strong> as absolute numbers, the number of reciprocal<br />
BLAST hits as a fraction of the total number of gene<br />
families found <strong>in</strong> the two genomes. For easier visual<br />
<strong>in</strong>spection, the cells <strong>in</strong> the matrix are coloured darker as<br />
56<br />
88<br />
65<br />
55<br />
86<br />
the fraction of similarity <strong>in</strong>creases. Hits identified with<strong>in</strong> a<br />
genome are differently coloured.<br />
BLAST Atlas<br />
BLAST results were also visualised <strong>in</strong> a BLAST atlas, this<br />
time visualis<strong>in</strong>g, for all genes <strong>in</strong> the reference genome V<br />
cholerae N16961, their best hit <strong>in</strong> all other genomes, aga<strong>in</strong><br />
with a threshold of 50% identity over at least 50% of the<br />
length of the query prote<strong>in</strong>. The atlas displays the hits as they<br />
are located <strong>in</strong> the reference stra<strong>in</strong> [14]. The BLAST scores<br />
obta<strong>in</strong>ed for each queried gene is plotted, so that conserved<br />
<strong>and</strong> variable regions are located with respect to the reference<br />
genome. Note that genes absent <strong>in</strong> the reference genome are<br />
not shown <strong>in</strong> the lanes of the query genomes.<br />
Results<br />
Vibrio sp. MED222<br />
A<br />
A<br />
Vibrio sp. Ex25<br />
Ribosomal RNA Analysis<br />
A phylogenetic tree based on the 16S rRNA gene extracted<br />
from the 32 analysed Vibrionaceae genomes is shown <strong>in</strong><br />
Fig. 1. The 18 V. cholerae genomes build a tight subcluster,<br />
45<br />
T. Vesth et al.
Orig<strong>in</strong>s of V. cholerae<br />
68<br />
68<br />
93<br />
64<br />
64<br />
95<br />
100<br />
Vibrio, stabilome<br />
0.20 0.15 0.10 0.05 0.00<br />
Relative manhattan distance<br />
quite distanced from the other species. Above this <strong>in</strong> the<br />
figure, another subcluster compris<strong>in</strong>g eight genomes represent<strong>in</strong>g<br />
at least six species is recognised, <strong>and</strong> with<strong>in</strong> this<br />
cluster the two V. parahaemolyticus genes are not found on<br />
the same branch. A third cluster, a bit further removed,<br />
<strong>in</strong>cludes Aliivibrio fischeri <strong>and</strong> A. almonidica as well as V.<br />
splendidus <strong>and</strong> Vibrio species MED 222; the gene of<br />
Photobacterium profundum is the most distant.<br />
Pan-Genome Family Trees<br />
99<br />
99<br />
100<br />
48<br />
100<br />
100<br />
98<br />
98<br />
67<br />
Vibrio harveyi ATCC BAA1116<br />
Vibrio parahaemolyticus RIMD2210633<br />
Vibrio vulnificus CMCP6<br />
Vibrio vulnificus YJ016<br />
Vibrio sp MED222<br />
Vibrio splendidus LGP32<br />
Vibrio shilonii AK1<br />
Vibrio sp Ex25<br />
Vibrio parahaemolyticus 16<br />
Vibrio campbellii AND4<br />
Aliivibrio fischeri<br />
Aliivibrio fischeri<br />
Aliivibrio salmonicida LFI1238<br />
Photobacterium profundum SS9<br />
Vibrio cholerae 1587<br />
Vibrio cholerae AM 19226<br />
Vibrio cholerae MO10<br />
Vibrio cholerae B33VCE<br />
Vibrio cholerae MJ1236<br />
Vibrio cholerae RC9<br />
Vibrio cholerae BX330286<br />
Vibrio cholerae M662<br />
Vibrio cholerae O395 TEDA<br />
Vibrio cholerae N16961<br />
Vibrio cholerae O395 TIGR<br />
Vibrio cholerae 12129<br />
Vibrio cholerae TMA21<br />
Vibrio cholerae V52<br />
Vibrio cholerae TM1107980<br />
Vibrio cholerae 2740 80<br />
Vibrio cholerae VL426<br />
Vibrio cholerae MZO 2<br />
Start<strong>in</strong>g with a database conta<strong>in</strong><strong>in</strong>g the total set of all Vibrio<br />
gene families, a profile of match<strong>in</strong>g gene families was<br />
constructed for each <strong>in</strong>dividual genome. This was stored as<br />
a matrix, conta<strong>in</strong><strong>in</strong>g a column for each gene families, <strong>and</strong> a<br />
row for each genome. The rows conta<strong>in</strong> a 0 or 1<br />
represent<strong>in</strong>g the presence or absence of the gene family.<br />
This matrix was weighted to emphasise either the genes<br />
found <strong>in</strong> most genomes (the “stabilome”) or <strong>in</strong> only a few<br />
genomes (the “mobilome”); from these weighted matrices,<br />
cluster<strong>in</strong>g of gene families yielded the result<strong>in</strong>g trees shown<br />
<strong>in</strong> Fig. 2. Shorter distances represent genomes with many<br />
gene families <strong>in</strong> common, <strong>and</strong> larger distances reflect<br />
genomes with fewer gene families <strong>in</strong> common. As<br />
expected, <strong>in</strong> both trees, genomes from the same species<br />
cluster together, whereby the depth of resolution with<strong>in</strong> a<br />
species is considerably better than can be seen <strong>in</strong> the 16S<br />
rRNA tree <strong>in</strong> Fig. 1. Similarity between the unspeciated<br />
100<br />
80<br />
66<br />
37<br />
54<br />
100<br />
98<br />
67<br />
46<br />
Figure 2 Pan-genome family cluster<strong>in</strong>g of the 32 Vibrio genome<br />
sequences. The two plots represent weighted values for genes present<br />
<strong>in</strong> at least 90% of the genomes (stabilome) or genes found <strong>in</strong> only a<br />
40<br />
100<br />
100<br />
58<br />
82<br />
91<br />
59<br />
59<br />
100<br />
100<br />
71<br />
48<br />
Vibrio, mobilome<br />
59<br />
80<br />
100<br />
100<br />
100<br />
0.20 0.15 0.10 0.05 0.00<br />
100<br />
100<br />
100<br />
Relative manhattan distance<br />
Vibrio isolate MED222 <strong>and</strong> V. splendidus is suggested by<br />
their close cluster<strong>in</strong>g; this is a connection also suggested by<br />
others [21]. Note that the unspeciated Vibrio isolate Ex25<br />
<strong>and</strong> V. parahaemolyticus 2210633 cluster together <strong>in</strong> the<br />
mobilome tree, but are more distant <strong>in</strong> the stabilome. This<br />
implies that the genes shared between these two genomes<br />
are less common genes with<strong>in</strong> the Vibrio genomes<br />
exam<strong>in</strong>ed here. As already <strong>in</strong>dicated by the 16S rRNA<br />
tree, the two V. parahaemolyticus isolates are quite<br />
dissimilar, <strong>and</strong> appear on separate branches. The Aliivibrio<br />
cluster is placed with<strong>in</strong> Vibrio genomes <strong>in</strong> both the<br />
stabilome <strong>and</strong> the mobilome, as was the case for their 16S<br />
rRNA gene. P. profundum is not such an outlier as <strong>in</strong> the<br />
16S rRNA tree, <strong>and</strong> <strong>in</strong> the stabilome. It is even positioned<br />
close to the Aliivibrio genomes. Zoom<strong>in</strong>g <strong>in</strong> at the genomes<br />
of V. cholerae, a division <strong>in</strong>to two subclusters can be seen;<br />
these clusters correspond to environmental vs. cl<strong>in</strong>ical<br />
isolates (with the exception of V52 <strong>in</strong> the stabilome).<br />
Pan- <strong>and</strong> Core Genome Plot<br />
99<br />
77<br />
100<br />
67<br />
82<br />
100<br />
90<br />
90<br />
100<br />
89<br />
89<br />
Vibrio sp Ex25<br />
Vibrio parahaemolyticus RIMD2210633<br />
Vibrio campbellii AND4<br />
Vibrio cholerae 1587<br />
Vibrio cholerae AM 19226<br />
Vibrio cholerae MZO 2<br />
Vibrio cholerae 2740 80<br />
Vibrio cholerae V52<br />
Vibrio cholerae MO10<br />
Vibrio cholerae O395 TIGR<br />
Vibrio cholerae BX330286<br />
Vibrio cholerae RC9<br />
Vibrio cholerae B33VCE<br />
Vibrio cholerae MJ1236<br />
Vibrio cholerae N16961<br />
Vibrio cholerae M662<br />
Vibrio cholerae O395 TEDA<br />
Vibrio cholerae TMA21<br />
Vibrio cholerae 12129<br />
Vibrio cholerae TM1107980<br />
Vibrio cholerae VL426<br />
Vibrio parahaemolyticus 16<br />
Aliivibrio fischeri<br />
Aliivibrio fischeri<br />
Aliivibrio salmonicida LFI1238<br />
Vibrio vulnificus CMCP6<br />
Vibrio vulnificus YJ016<br />
Vibrio sp MED222<br />
Vibrio splendidus LGP32<br />
Vibrio harveyi ATCC BAA1116<br />
Vibrio shilonii AK1<br />
Photobacterium profundum SS9<br />
BLAST results were analysed to construct a pan-genome,<br />
which is a hypothetical collection of all the gene families<br />
that are found <strong>in</strong> the <strong>in</strong>vestigated genomes [28]. The core<br />
genome was constructed from all gene families that were<br />
represented at least once <strong>in</strong> every genome. Thus, the gene<br />
families conserved <strong>in</strong> all genomes represent their core<br />
genome; add<strong>in</strong>g the rema<strong>in</strong><strong>in</strong>g gene families produces the<br />
65<br />
82<br />
100<br />
100<br />
100<br />
100<br />
few (two to four) genomes (mobilome). The colours highlight<strong>in</strong>g the<br />
species are the same as <strong>in</strong> Fig. 1
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
Pan genome<br />
Core genome<br />
New gene families<br />
V. cholerae TM11079-80<br />
V. cholerae TMA21<br />
V. cholerae 12129<br />
V. cholerae MZO-2<br />
V. cholerae AM-19226<br />
V. cholerae 1587<br />
V. cholerae 2740-80<br />
V. cholerae V52<br />
V. cholerae B33VCE<br />
V. cholerae MJ1236<br />
V. cholerae RC9<br />
V. cholerae BX330286<br />
V. cholerae MO10<br />
V. cholerae O395 TIGR<br />
V. cholerae O395 TEDA<br />
V. cholerae M66-2<br />
V. cholerae N16961<br />
Figure 3 Pan- <strong>and</strong> core genome plot of the 32 Vibrionaceae genomes. The colours highlight<strong>in</strong>g species are the same as <strong>in</strong> Fig. 1<br />
pan-genome. The result<strong>in</strong>g pan- <strong>and</strong> core genome plot is<br />
shown <strong>in</strong> Fig. 3. The genomes start with the documented<br />
cl<strong>in</strong>ical isolates of V. cholerae <strong>and</strong> then follow the order<br />
suggested by the pan-genome family cluster<strong>in</strong>g (Fig. 2),<br />
although genomes from the same species were kept<br />
together (the two V. parahaemolyticus genomes were split<br />
<strong>in</strong> the trees). As more genomes are added <strong>in</strong> the plot, the<br />
number of gene families <strong>in</strong> the pan-genome (blue l<strong>in</strong>e)<br />
<strong>in</strong>creases, <strong>and</strong> the number of conserved gene families (red<br />
l<strong>in</strong>e) <strong>in</strong> the core genome decreases, albeit at a lower rate.<br />
This is because every genome can add many novel (<strong>and</strong><br />
frequently different) genes to the pan-genome but only<br />
decreases the core genome with a few genes that are absent<br />
V. cholerae VL426<br />
P.profundum SS9<br />
V.shilonii AK1<br />
A.salmonicida LFI1238<br />
A. fisheri MJ11<br />
A. fisheri ES114<br />
Vibrio. sp MED222<br />
V.splendidus LGB2<br />
V. vulnificus YJ016<br />
V. vulnificus CMCP6<br />
V.harveyi BAA-1116<br />
V.campbellii<br />
Vibrio sp Ex25<br />
V. parahaem. 2210633<br />
V. parahaem. 16<br />
T. Vesth et al.<br />
<strong>in</strong> that particular stra<strong>in</strong> but that were conserved <strong>in</strong> the<br />
previously analysed genomes. The pan-genome curve<br />
<strong>in</strong>creases with a relative steep slope when a novel species<br />
is added, as is obvious when a V. parahaemolyticus genome<br />
is added after the last V. cholerae. A stable plateau can be<br />
seen for the pan-genome of V. cholerae around 6,500 genes.<br />
Nevertheless, a small <strong>in</strong>crease occurs when add<strong>in</strong>g V.<br />
cholerae 11587; this is caused by the difference between<br />
the two subclusters of V. cholerae seen <strong>in</strong> Fig. 2. V.<br />
cholerae stra<strong>in</strong> 2740-80 behaves atypical <strong>in</strong> all the figures<br />
shown; although documented as an environmental isolate, it<br />
appears closer to the cl<strong>in</strong>ical isolates, <strong>in</strong> terms of overall<br />
genomic properties.
Orig<strong>in</strong>s of V. cholerae<br />
Figure 4 BLAST matrix of<br />
the 32 Vibrionaceae genomes.<br />
The colours highlight<strong>in</strong>g the<br />
species are the same as <strong>in</strong> Fig. 1.<br />
S<strong>in</strong>ce the reciprocal similarity<br />
(reported as percent) is not<br />
readable at this resolution, every<br />
matrix cell is coloured us<strong>in</strong>g the<br />
scales as <strong>in</strong>dicated. The bottom<br />
row identifies hits (other than<br />
hits-to-self) found with<strong>in</strong> a genome.<br />
Four matrix cells report<strong>in</strong>g<br />
high pairwise similarities are<br />
outl<strong>in</strong>ed; their numbers are<br />
specified <strong>in</strong> the text<br />
Homology between proteomes<br />
30.0 %<br />
90.0 %<br />
Homology with<strong>in</strong> proteomes<br />
6.0 %<br />
0.0 %<br />
A.salmonicida LFI1238<br />
V.species Ex25<br />
V.campbellii AND4<br />
V.harveyi BAA1116<br />
V.shilonii AK1<br />
P.profundum SS9<br />
27.2 %<br />
1,946 / 7,165<br />
31.2 %<br />
2,143 / 6,862<br />
32.5 %<br />
2,385 / 7,336<br />
31.1 %<br />
2,163 / 6,948<br />
V.cholerae N16961<br />
V.cholerae 0395 TEDA<br />
V.cholerae 0395 TIGR<br />
V.cholerae V52<br />
V.cholerae M66-2<br />
V.cholerae MO10<br />
V.cholerae BX330286<br />
V.cholerae RC9<br />
V.cholerae MJ1236<br />
27.1 %<br />
1,964 / 7,245<br />
27.5 %<br />
1,971 / 7,179<br />
35.8 %<br />
2,018 / 5,637<br />
32.6 %<br />
2,405 / 7,380<br />
31.5 %<br />
2,169 / 6,884<br />
26.3 %<br />
1,893 / 7,208<br />
38.7 %<br />
2,143 / 5,536<br />
35.9 %<br />
2,049 / 5,713<br />
33.1 %<br />
2,415 / 7,299<br />
30.4 %<br />
2,098 / 6,893<br />
28.0 %<br />
1,962 / 7,016<br />
32.1 %<br />
1,846 / 5,747<br />
38.3 %<br />
2,156 / 5,631<br />
36.4 %<br />
2,055 / 5,647<br />
31.7 %<br />
2,323 / 7,337<br />
32.3 %<br />
2,164 / 6,706<br />
28.7 %<br />
1,944 / 6,766<br />
V.parahaemolyticus 2210633<br />
V.parahaemolyticus 16<br />
V.vulnificus CMCP6<br />
V.vulnificus YJ016<br />
V.species MED222<br />
V.splendidus LGP32<br />
A.fischeri ES114<br />
A.fischeri MJ11<br />
34.0 %<br />
1,963 / 5,771<br />
32.1 %<br />
1,873 / 5,828<br />
38.8 %<br />
2,162 / 5,566<br />
34.7 %<br />
1,968 / 5,677<br />
33.6 %<br />
2,410 / 7,181<br />
33.0 %<br />
2,137 / 6,467<br />
28.2 %<br />
1,960 / 6,957<br />
35.0 %<br />
1,949 / 5,561<br />
33.7 %<br />
1,977 / 5,865<br />
32.5 %<br />
1,873 / 5,769<br />
37.9 %<br />
2,110 / 5,560<br />
37.3 %<br />
2,045 / 5,477<br />
34.3 %<br />
2,377 / 6,932<br />
32.4 %<br />
2,155 / 6,649<br />
27.6 %<br />
1,965 / 7,122<br />
40.3 %<br />
2,326 / 5,771<br />
34.8 %<br />
1,967 / 5,647<br />
34.2 %<br />
1,983 / 5,797<br />
30.6 %<br />
1,777 / 5,804<br />
40.3 %<br />
2,167 / 5,378<br />
38.7 %<br />
2,021 / 5,225<br />
33.8 %<br />
2,403 / 7,116<br />
31.8 %<br />
2,169 / 6,817<br />
27.7 %<br />
1,965 / 7,093<br />
V.cholerae B33VCE<br />
38.4 %<br />
2,291 / 5,971<br />
39.8 %<br />
2,339 / 5,873<br />
35.3 %<br />
1,972 / 5,581<br />
32.5 %<br />
1,896 / 5,827<br />
33.3 %<br />
1,863 / 5,593<br />
41.6 %<br />
2,140 / 5,139<br />
37.4 %<br />
2,032 / 5,428<br />
33.3 %<br />
2,418 / 7,252<br />
32.1 %<br />
2,173 / 6,778<br />
27.8 %<br />
1,967 / 7,064<br />
V.cholerae 2740-80<br />
41.7 %<br />
2,552 / 6,116<br />
38.0 %<br />
2,307 / 6,067<br />
40.4 %<br />
2,345 / 5,808<br />
33.6 %<br />
1,884 / 5,612<br />
35.3 %<br />
1,981 / 5,619<br />
34.4 %<br />
1,846 / 5,360<br />
40.6 %<br />
2,159 / 5,323<br />
36.7 %<br />
2,048 / 5,585<br />
33.5 %<br />
2,420 / 7,225<br />
32.2 %<br />
2,173 / 6,752<br />
25.7 %<br />
1,850 / 7,198<br />
V.cholerae 1587<br />
V.cholerae TM11079-80<br />
V.cholerae TMA21<br />
V.cholerae VL426<br />
44.3 %<br />
2,515 / 5,683<br />
41.2 %<br />
2,564 / 6,224<br />
38.5 %<br />
2,311 / 6,004<br />
38.6 %<br />
2,251 / 5,839<br />
36.3 %<br />
1,965 / 5,413<br />
36.6 %<br />
1,964 / 5,371<br />
33.4 %<br />
1,852 / 5,547<br />
39.5 %<br />
2,169 / 5,493<br />
37.0 %<br />
2,051 / 5,545<br />
33.6 %<br />
2,420 / 7,193<br />
30.3 %<br />
2,079 / 6,856<br />
25.6 %<br />
1,841 / 7,194<br />
42.2 %<br />
2,215 / 5,254<br />
43.7 %<br />
2,527 / 5,781<br />
41.9 %<br />
2,575 / 6,151<br />
37.0 %<br />
2,227 / 6,026<br />
41.7 %<br />
2,346 / 5,626<br />
37.7 %<br />
1,947 / 5,165<br />
35.5 %<br />
1,974 / 5,563<br />
32.7 %<br />
1,868 / 5,705<br />
39.7 %<br />
2,168 / 5,459<br />
37.2 %<br />
2,052 / 5,516<br />
31.0 %<br />
2,282 / 7,362<br />
29.7 %<br />
2,044 / 6,887<br />
28.1 %<br />
1,904 / 6,782<br />
V.cholerae AM-19226<br />
V.cholerae MZO-2<br />
40.0 %<br />
2,421 / 6,055<br />
41.6 %<br />
2,225 / 5,354<br />
44.5 %<br />
2,539 / 5,707<br />
40.0 %<br />
2,473 / 6,185<br />
39.7 %<br />
2,312 / 5,825<br />
42.9 %<br />
2,314 / 5,388<br />
36.6 %<br />
1,961 / 5,354<br />
34.6 %<br />
1,982 / 5,732<br />
33.0 %<br />
1,872 / 5,667<br />
40.0 %<br />
2,171 / 5,428<br />
34.4 %<br />
1,944 / 5,645<br />
30.8 %<br />
2,270 / 7,379<br />
32.4 %<br />
2,098 / 6,481<br />
26.9 %<br />
1,851 / 6,869<br />
70.3 %<br />
2,933 / 4,174<br />
39.6 %<br />
2,438 / 6,154<br />
42.3 %<br />
2,236 / 5,283<br />
42.8 %<br />
2,449 / 5,718<br />
42.9 %<br />
2,564 / 5,977<br />
40.6 %<br />
2,270 / 5,592<br />
41.9 %<br />
2,334 / 5,571<br />
35.7 %<br />
1,969 / 5,522<br />
34.8 %<br />
1,984 / 5,694<br />
33.2 %<br />
1,872 / 5,641<br />
38.2 %<br />
2,104 / 5,504<br />
34.8 %<br />
1,952 / 5,606<br />
33.3 %<br />
2,327 / 6,984<br />
31.2 %<br />
2,045 / 6,565<br />
28.2 %<br />
1,949 / 6,915<br />
73.6 %<br />
3,045 / 4,135<br />
69.2 %<br />
2,953 / 4,267<br />
40.0 %<br />
2,440 / 6,094<br />
41.3 %<br />
2,181 / 5,277<br />
45.9 %<br />
2,535 / 5,526<br />
44.1 %<br />
2,533 / 5,743<br />
39.9 %<br />
2,299 / 5,768<br />
40.9 %<br />
2,343 / 5,733<br />
35.9 %<br />
1,971 / 5,485<br />
35.0 %<br />
1,985 / 5,667<br />
30.2 %<br />
1,747 / 5,786<br />
37.3 %<br />
2,064 / 5,537<br />
38.1 %<br />
1,994 / 5,228<br />
32.1 %<br />
2,268 / 7,062<br />
32.6 %<br />
2,153 / 6,600<br />
27.9 %<br />
1,942 / 6,969<br />
71.6 %<br />
3,010 / 4,205<br />
74.9 %<br />
3,101 / 4,142<br />
69.7 %<br />
2,944 / 4,221<br />
38.4 %<br />
2,348 / 6,120<br />
43.8 %<br />
2,234 / 5,101<br />
47.1 %<br />
2,503 / 5,310<br />
43.3 %<br />
2,559 / 5,916<br />
38.9 %<br />
2,309 / 5,932<br />
41.2 %<br />
2,346 / 5,697<br />
36.1 %<br />
1,971 / 5,458<br />
31.9 %<br />
1,857 / 5,817<br />
30.0 %<br />
1,736 / 5,791<br />
41.6 %<br />
2,134 / 5,135<br />
36.4 %<br />
1,935 / 5,317<br />
34.2 %<br />
2,394 / 7,002<br />
31.8 %<br />
2,123 / 6,682<br />
27.9 %<br />
1,941 / 6,954<br />
V.parahaemolyticus 2210633<br />
V.parahaemolyticus 16<br />
V.vulnificus CMCP6<br />
V.vulnificus YJ016<br />
V.species MED222<br />
V.splendidus LGP32<br />
A.fischeri ES114<br />
A.fischeri MJ11<br />
A.salmonicida LFI1238<br />
V.species Ex25<br />
V.campbellii AND4<br />
V.harveyi BAA1116<br />
V.shilonii AK1<br />
V.cholerae 12129<br />
V.cholerae TM11079-80<br />
V.cholerae TMA21<br />
V.cholerae VL426<br />
75.9 %<br />
3,094 / 4,077<br />
72.6 %<br />
3,068 / 4,226<br />
75.5 %<br />
3,089 / 4,092<br />
66.3 %<br />
2,833 / 4,271<br />
41.4 %<br />
2,445 / 5,905<br />
45.9 %<br />
2,223 / 4,842<br />
46.4 %<br />
2,534 / 5,464<br />
42.3 %<br />
2,572 / 6,075<br />
39.3 %<br />
2,314 / 5,892<br />
41.4 %<br />
2,346 / 5,670<br />
32.8 %<br />
1,843 / 5,611<br />
32.1 %<br />
1,861 / 5,795<br />
33.6 %<br />
1,805 / 5,377<br />
39.1 %<br />
2,048 / 5,244<br />
37.7 %<br />
2,026 / 5,367<br />
33.4 %<br />
2,359 / 7,060<br />
32.0 %<br />
2,130 / 6,656<br />
27.9 %<br />
1,909 / 6,851<br />
68.7 %<br />
2,874 / 4,181<br />
77.2 %<br />
3,155 / 4,088<br />
73.5 %<br />
3,065 / 4,172<br />
69.8 %<br />
2,942 / 4,217<br />
73.2 %<br />
2,952 / 4,034<br />
42.4 %<br />
2,408 / 5,683<br />
44.3 %<br />
2,232 / 5,038<br />
45.2 %<br />
2,546 / 5,633<br />
42.7 %<br />
2,578 / 6,032<br />
39.4 %<br />
2,314 / 5,868<br />
38.0 %<br />
2,213 / 5,823<br />
33.1 %<br />
1,848 / 5,585<br />
35.6 %<br />
1,922 / 5,398<br />
31.9 %<br />
1,743 / 5,469<br />
40.4 %<br />
2,139 / 5,293<br />
37.3 %<br />
2,022 / 5,418<br />
33.4 %<br />
2,375 / 7,115<br />
32.0 %<br />
2,097 / 6,549<br />
29.6 %<br />
2,295 / 7,753<br />
70.4 %<br />
2,922 / 4,153<br />
67.2 %<br />
2,880 / 4,288<br />
78.0 %<br />
3,149 / 4,038<br />
68.5 %<br />
2,914 / 4,256<br />
76.0 %<br />
3,059 / 4,025<br />
73.5 %<br />
2,863 / 3,897<br />
41.8 %<br />
2,434 / 5,818<br />
43.0 %<br />
2,240 / 5,212<br />
45.5 %<br />
2,548 / 5,599<br />
42.9 %<br />
2,579 / 6,005<br />
36.9 %<br />
2,208 / 5,989<br />
38.0 %<br />
2,209 / 5,811<br />
36.7 %<br />
1,906 / 5,192<br />
34.2 %<br />
1,872 / 5,473<br />
33.5 %<br />
1,845 / 5,501<br />
39.4 %<br />
2,118 / 5,370<br />
37.3 %<br />
2,019 / 5,407<br />
33.0 %<br />
2,325 / 7,056<br />
35.2 %<br />
2,581 / 7,333<br />
27.9 %<br />
1,972 / 7,061<br />
64.7 %<br />
2,888 / 4,463<br />
70.3 %<br />
2,965 / 4,217<br />
69.7 %<br />
2,916 / 4,183<br />
71.5 %<br />
2,986 / 4,175<br />
74.1 %<br />
3,024 / 4,083<br />
75.2 %<br />
2,954 / 3,928<br />
76.4 %<br />
2,970 / 3,887<br />
40.8 %<br />
2,445 / 5,993<br />
43.4 %<br />
2,242 / 5,171<br />
45.8 %<br />
2,552 / 5,568<br />
39.4 %<br />
2,432 / 6,172<br />
36.4 %<br />
2,186 / 6,003<br />
41.8 %<br />
2,264 / 5,418<br />
34.9 %<br />
1,843 / 5,282<br />
35.7 %<br />
1,970 / 5,513<br />
32.9 %<br />
1,824 / 5,545<br />
40.3 %<br />
2,145 / 5,320<br />
37.8 %<br />
2,001 / 5,288<br />
46.4 %<br />
3,371 / 7,266<br />
34.3 %<br />
2,276 / 6,634<br />
29.4 %<br />
2,212 / 7,534<br />
76.9 %<br />
3,165 / 4,117<br />
64.9 %<br />
2,940 / 4,533<br />
72.2 %<br />
2,986 / 4,136<br />
69.0 %<br />
2,860 / 4,145<br />
79.5 %<br />
3,125 / 3,932<br />
73.0 %<br />
2,908 / 3,986<br />
80.4 %<br />
3,080 / 3,831<br />
73.1 %<br />
2,977 / 4,072<br />
41.1 %<br />
2,450 / 5,957<br />
43.6 %<br />
2,244 / 5,143<br />
42.2 %<br />
2,409 / 5,711<br />
39.1 %<br />
2,413 / 6,176<br />
39.9 %<br />
2,238 / 5,609<br />
39.9 %<br />
2,202 / 5,514<br />
36.8 %<br />
1,951 / 5,307<br />
34.9 %<br />
1,952 / 5,586<br />
33.0 %<br />
1,831 / 5,549<br />
39.8 %<br />
2,086 / 5,245<br />
47.0 %<br />
2,741 / 5,827<br />
34.9 %<br />
2,496 / 7,160<br />
34.4 %<br />
2,472 / 7,184<br />
27.8 %<br />
2,222 / 7,979<br />
83.4 %<br />
3,315 / 3,973<br />
76.7 %<br />
3,195 / 4,167<br />
67.6 %<br />
2,983 / 4,413<br />
68.5 %<br />
2,869 / 4,191<br />
71.8 %<br />
2,896 / 4,036<br />
77.9 %<br />
3,002 / 3,856<br />
78.5 %<br />
3,061 / 3,901<br />
77.3 %<br />
3,098 / 4,009<br />
73.4 %<br />
2,971 / 4,050<br />
41.3 %<br />
2,449 / 5,936<br />
41.1 %<br />
2,153 / 5,242<br />
41.4 %<br />
2,372 / 5,735<br />
43.0 %<br />
2,483 / 5,781<br />
38.0 %<br />
2,171 / 5,707<br />
42.0 %<br />
2,320 / 5,530<br />
36.1 %<br />
1,940 / 5,373<br />
35.2 %<br />
1,954 / 5,558<br />
33.1 %<br />
1,804 / 5,448<br />
64.9 %<br />
3,384 / 5,214<br />
37.8 %<br />
2,081 / 5,503<br />
38.7 %<br />
2,880 / 7,439<br />
33.0 %<br />
2,516 / 7,615<br />
28.1 %<br />
2,155 / 7,667<br />
82.4 %<br />
3,302 / 4,009<br />
81.3 %<br />
3,320 / 4,085<br />
81.6 %<br />
3,264 / 4,000<br />
65.1 %<br />
2,880 / 4,423<br />
73.7 %<br />
2,947 / 4,001<br />
71.5 %<br />
2,801 / 3,915<br />
83.0 %<br />
3,135 / 3,777<br />
75.8 %<br />
3,073 / 4,056<br />
77.1 %<br />
3,088 / 4,007<br />
73.8 %<br />
2,975 / 4,030<br />
37.9 %<br />
2,313 / 6,099<br />
41.2 %<br />
2,152 / 5,228<br />
46.3 %<br />
2,464 / 5,326<br />
40.1 %<br />
2,373 / 5,919<br />
40.1 %<br />
2,293 / 5,719<br />
41.1 %<br />
2,303 / 5,603<br />
36.2 %<br />
1,940 / 5,352<br />
35.3 %<br />
1,926 / 5,455<br />
31.9 %<br />
2,074 / 6,494<br />
45.0 %<br />
2,357 / 5,232<br />
39.9 %<br />
2,372 / 5,942<br />
37.0 %<br />
2,900 / 7,832<br />
36.5 %<br />
2,593 / 7,105<br />
29.5 %<br />
2,198 / 7,456<br />
83.2 %<br />
3,325 / 3,995<br />
80.8 %<br />
3,319 / 4,106<br />
81.9 %<br />
3,311 / 4,041<br />
77.5 %<br />
3,153 / 4,066<br />
67.3 %<br />
2,909 / 4,320<br />
81.0 %<br />
2,989 / 3,688<br />
72.2 %<br />
2,861 / 3,960<br />
79.4 %<br />
3,144 / 3,961<br />
75.1 %<br />
3,061 / 4,077<br />
78.0 %<br />
3,097 / 3,971<br />
65.6 %<br />
2,791 / 4,256<br />
38.2 %<br />
2,320 / 6,080<br />
46.0 %<br />
2,220 / 4,821<br />
43.5 %<br />
2,367 / 5,437<br />
43.5 %<br />
2,550 / 5,859<br />
39.2 %<br />
2,272 / 5,796<br />
41.6 %<br />
2,314 / 5,569<br />
36.3 %<br />
1,906 / 5,250<br />
35.5 %<br />
2,270 / 6,400<br />
32.3 %<br />
1,842 / 5,705<br />
46.1 %<br />
2,626 / 5,697<br />
37.5 %<br />
2,396 / 6,387<br />
34.6 %<br />
2,682 / 7,762<br />
36.7 %<br />
2,562 / 6,982<br />
30.3 %<br />
2,110 / 6,968<br />
85.8 %<br />
3,291 / 3,837<br />
80.7 %<br />
3,321 / 4,117<br />
81.6 %<br />
3,311 / 4,057<br />
76.3 %<br />
3,142 / 4,120<br />
78.4 %<br />
3,157 / 4,029<br />
67.8 %<br />
2,836 / 4,184<br />
74.9 %<br />
2,944 / 3,932<br />
69.0 %<br />
2,868 / 4,158<br />
79.3 %<br />
3,138 / 3,958<br />
76.3 %<br />
3,076 / 4,029<br />
71.3 %<br />
2,953 / 4,142<br />
65.2 %<br />
2,768 / 4,246<br />
42.3 %<br />
2,399 / 5,675<br />
42.7 %<br />
2,113 / 4,953<br />
45.9 %<br />
2,501 / 5,451<br />
42.2 %<br />
2,506 / 5,941<br />
39.8 %<br />
2,292 / 5,756<br />
41.5 %<br />
2,272 / 5,479<br />
35.9 %<br />
2,233 / 6,219<br />
34.5 %<br />
1,965 / 5,696<br />
32.6 %<br />
2,040 / 6,250<br />
43.2 %<br />
2,655 / 6,143<br />
36.9 %<br />
2,259 / 6,124<br />
36.7 %<br />
2,759 / 7,516<br />
30.4 %<br />
2,085 / 6,866<br />
29.7 %<br />
2,127 / 7,169<br />
V.cholerae N16961<br />
V.cholerae 0395 TEDA<br />
V.cholerae 0395 TIGR<br />
V.cholerae V52<br />
V.cholerae M66-2<br />
V.cholerae MO10<br />
V.cholerae BX330286<br />
V.cholerae RC9<br />
V.cholerae MJ1236<br />
V.cholerae B33VCE<br />
V.cholerae 2740-80<br />
V.cholerae AM-19226<br />
V.cholerae MZO-2<br />
V.cholerae 12129<br />
V.cholerae 1587<br />
79.6 %<br />
3,139 / 3,944<br />
82.5 %<br />
3,278 / 3,971<br />
81.4 %<br />
3,309 / 4,067<br />
75.3 %<br />
3,136 / 4,162<br />
83.7 %<br />
3,275 / 3,915<br />
74.3 %<br />
2,987 / 4,018<br />
68.1 %<br />
2,876 / 4,226<br />
73.3 %<br />
2,983 / 4,071<br />
69.1 %<br />
2,864 / 4,145<br />
80.3 %<br />
3,147 / 3,918<br />
69.2 %<br />
2,925 / 4,226<br />
70.2 %<br />
2,930 / 4,172<br />
71.6 %<br />
2,802 / 3,915<br />
39.7 %<br />
2,303 / 5,796<br />
44.3 %<br />
2,213 / 5,001<br />
46.1 %<br />
2,513 / 5,455<br />
42.9 %<br />
2,536 / 5,906<br />
39.2 %<br />
2,230 / 5,684<br />
43.9 %<br />
2,762 / 6,293<br />
36.2 %<br />
1,976 / 5,464<br />
35.3 %<br />
2,191 / 6,211<br />
30.5 %<br />
2,050 / 6,715<br />
40.2 %<br />
2,413 / 5,999<br />
38.6 %<br />
2,289 / 5,931<br />
29.6 %<br />
2,214 / 7,478<br />
29.4 %<br />
2,083 / 7,082<br />
28.3 %<br />
1,980 / 6,989<br />
92.9 %<br />
3,489 / 3,754<br />
78.1 %<br />
3,147 / 4,032<br />
83.4 %<br />
3,267 / 3,919<br />
76.0 %<br />
3,147 / 4,141<br />
82.6 %<br />
3,267 / 3,954<br />
86.6 %<br />
3,253 / 3,757<br />
78.6 %<br />
3,113 / 3,962<br />
66.4 %<br />
2,917 / 4,393<br />
73.6 %<br />
2,979 / 4,045<br />
70.0 %<br />
2,873 / 4,102<br />
73.1 %<br />
3,000 / 4,103<br />
64.3 %<br />
2,805 / 4,365<br />
77.2 %<br />
2,983 / 3,866<br />
68.6 %<br />
2,743 / 4,001<br />
42.9 %<br />
2,463 / 5,745<br />
43.5 %<br />
2,200 / 5,058<br />
45.7 %<br />
2,506 / 5,480<br />
42.3 %<br />
2,475 / 5,845<br />
41.4 %<br />
2,698 / 6,523<br />
45.4 %<br />
2,507 / 5,523<br />
36.3 %<br />
2,179 / 6,005<br />
33.1 %<br />
2,209 / 6,672<br />
33.1 %<br />
2,074 / 6,269<br />
42.3 %<br />
2,451 / 5,795<br />
33.6 %<br />
1,915 / 5,695<br />
29.3 %<br />
2,244 / 7,665<br />
26.7 %<br />
1,916 / 7,168<br />
28.0 %<br />
2,022 / 7,222<br />
77.1 %<br />
3,186 / 4,134<br />
89.7 %<br />
3,485 / 3,884<br />
80.2 %<br />
3,169 / 3,953<br />
79.4 %<br />
3,143 / 3,956<br />
82.9 %<br />
3,277 / 3,954<br />
85.6 %<br />
3,244 / 3,790<br />
91.2 %<br />
3,355 / 3,679<br />
75.5 %<br />
3,125 / 4,141<br />
66.3 %<br />
2,908 / 4,386<br />
74.3 %<br />
2,982 / 4,014<br />
68.3 %<br />
2,820 / 4,126<br />
69.5 %<br />
2,908 / 4,185<br />
71.6 %<br />
2,868 / 4,006<br />
71.8 %<br />
2,855 / 3,974<br />
77.1 %<br />
2,975 / 3,861<br />
40.8 %<br />
2,400 / 5,876<br />
44.9 %<br />
2,242 / 4,998<br />
45.1 %<br />
2,444 / 5,415<br />
46.4 %<br />
3,042 / 6,550<br />
43.7 %<br />
2,492 / 5,705<br />
43.7 %<br />
2,670 / 6,112<br />
34.2 %<br />
2,205 / 6,448<br />
35.7 %<br />
2,219 / 6,213<br />
34.9 %<br />
2,114 / 6,065<br />
34.5 %<br />
1,963 / 5,692<br />
32.5 %<br />
1,919 / 5,903<br />
28.3 %<br />
2,095 / 7,406<br />
34.5 %<br />
2,335 / 6,762<br />
25.5 %<br />
1,872 / 7,339<br />
80.4 %<br />
3,303 / 4,109<br />
74.9 %<br />
3,187 / 4,253<br />
81.1 %<br />
3,280 / 4,046<br />
75.1 %<br />
3,024 / 4,028<br />
87.0 %<br />
3,272 / 3,762<br />
83.2 %<br />
3,208 / 3,855<br />
90.1 %<br />
3,346 / 3,715<br />
91.7 %<br />
3,455 / 3,766<br />
74.6 %<br />
3,103 / 4,160<br />
67.0 %<br />
2,915 / 4,348<br />
68.5 %<br />
2,844 / 4,150<br />
68.0 %<br />
2,806 / 4,126<br />
76.7 %<br />
2,961 / 3,861<br />
67.9 %<br />
2,780 / 4,092<br />
82.5 %<br />
3,117 / 3,780<br />
73.0 %<br />
2,911 / 3,989<br />
42.4 %<br />
2,451 / 5,781<br />
43.5 %<br />
2,155 / 4,958<br />
48.9 %<br />
2,994 / 6,128<br />
43.2 %<br />
2,597 / 6,013<br />
41.9 %<br />
2,637 / 6,301<br />
40.8 %<br />
2,680 / 6,565<br />
36.6 %<br />
2,201 / 6,016<br />
38.1 %<br />
2,277 / 5,979<br />
55.5 %<br />
2,683 / 4,838<br />
33.9 %<br />
1,991 / 5,874<br />
30.3 %<br />
1,795 / 5,923<br />
43.4 %<br />
2,981 / 6,875<br />
30.9 %<br />
2,144 / 6,948<br />
26.1 %<br />
2,254 / 8,624<br />
88.1 %<br />
3,495 / 3,966<br />
88.8 %<br />
3,489 / 3,927<br />
80.2 %<br />
3,271 / 4,079<br />
77.3 %<br />
3,164 / 4,093<br />
80.7 %<br />
3,108 / 3,853<br />
83.0 %<br />
3,126 / 3,768<br />
91.4 %<br />
3,373 / 3,689<br />
90.4 %<br />
3,439 / 3,805<br />
96.0 %<br />
3,531 / 3,678<br />
75.4 %<br />
3,111 / 4,124<br />
64.7 %<br />
2,847 / 4,403<br />
70.6 %<br />
2,886 / 4,087<br />
73.1 %<br />
2,818 / 3,854<br />
71.8 %<br />
2,849 / 3,968<br />
78.6 %<br />
3,059 / 3,894<br />
78.0 %<br />
3,045 / 3,906<br />
74.7 %<br />
2,922 / 3,914<br />
40.9 %<br />
2,360 / 5,769<br />
43.5 %<br />
2,547 / 5,858<br />
47.2 %<br />
2,608 / 5,524<br />
67.5 %<br />
3,741 / 5,540<br />
39.7 %<br />
2,672 / 6,728<br />
72.3 %<br />
3,688 / 5,101<br />
38.7 %<br />
2,246 / 5,808<br />
75.0 %<br />
3,261 / 4,346<br />
52.4 %<br />
2,666 / 5,085<br />
30.5 %<br />
1,813 / 5,939<br />
46.2 %<br />
2,452 / 5,307<br />
45.0 %<br />
3,018 / 6,702<br />
30.1 %<br />
2,581 / 8,574<br />
25.9 %<br />
2,170 / 8,370<br />
P.profundum SS9<br />
3.0 %<br />
110 / 3,665<br />
4.2 %<br />
155 / 3,729<br />
4.3 %<br />
157 / 3,665<br />
3.3 %<br />
120 / 3,599<br />
2.8 %<br />
99 / 3,560<br />
1.8 %<br />
59 / 3,353<br />
2.9 %<br />
100 / 3,429<br />
2.8 %<br />
102 / 3,619<br />
3.0 %<br />
109 / 3,575<br />
2.6 %<br />
92 / 3,593<br />
3.5 %<br />
125 / 3,567<br />
2.8 %<br />
99 / 3,586<br />
2.5 %<br />
84 / 3,305<br />
2.2 %<br />
73 / 3,311<br />
2.1 %<br />
72 / 3,454<br />
2.4 %<br />
83 / 3,442<br />
2.9 %<br />
99 / 3,427<br />
1.9 %<br />
62 / 3,316<br />
3.2 %<br />
147 / 4,662<br />
2.1 %<br />
79 / 3,683<br />
2.8 %<br />
121 / 4,337<br />
3.1 %<br />
150 / 4,773<br />
2.3 %<br />
103 / 4,463<br />
2.8 %<br />
118 / 4,277<br />
2.6 %<br />
96 / 3,691<br />
2.9 %<br />
112 / 3,894<br />
3.3 %<br />
111 / 3,378<br />
2.7 %<br />
103 / 3,886<br />
2.3 %<br />
88 / 3,822<br />
3.9 %<br />
201 / 5,117<br />
3.9 %<br />
200 / 5,078<br />
5.0 %<br />
243 / 4,897
Gap F<br />
2M<br />
2.5M<br />
Gap E<br />
875k<br />
750k<br />
625k<br />
0M<br />
V. cholerae 01<br />
El Tor N16961<br />
chromosome 1<br />
2,961,149 bp<br />
1000k<br />
1.5M<br />
0k<br />
500k<br />
Gap D<br />
V. cholerae 01<br />
El Tor N16961<br />
chromosome 2<br />
1,072,310 bp<br />
125k<br />
375k<br />
0.5M<br />
1M<br />
250k<br />
Gap C<br />
Gap A<br />
Gap B<br />
Super<strong>in</strong>tegron<br />
Gap G<br />
Outer circle<br />
P.profundum SS9<br />
V.shilonii AK1<br />
V.harveyi BAA-116<br />
V.campebellii AND4<br />
V.parahaemolyticus 16<br />
V.parahaemolyticus 2210633<br />
Vibrio spp. Ex25<br />
A.salmonicida LF11238<br />
A.fischeri MJ11<br />
A.fischeri ES114<br />
V.splendidus LGP32<br />
V.species MED222<br />
V.vulnificus YJ016<br />
V.vulnificus CMCP6<br />
V.cholerae VL426<br />
V.cholerae 12129<br />
V.cholerae TMA21<br />
V.cholerae TM11079-80<br />
V.cholerae 1587<br />
V.cholerae AM-19226<br />
V.cholerae MZO-2<br />
V.cholerae 2740-80<br />
V.cholerae BX330286<br />
V.cholerae B33VCE<br />
V.cholerae RC9<br />
V.cholerae MJ1236<br />
V.cholerae M66-2<br />
V.cholerae V52<br />
V.cholerae MO10<br />
V.cholerae O395 TEDA<br />
V.cholerae 0395 TIGR<br />
V.cholerae N16961<br />
genes positive str<strong>and</strong><br />
genes negatve str<strong>and</strong><br />
Stack<strong>in</strong>g energy<br />
Position preference<br />
Global direct repeats<br />
GC skew<br />
Inner circle<br />
T. Vesth et al.
Orig<strong>in</strong>s of V. cholerae<br />
When the first genome of A. fischeri is added, which is<br />
not a member of the Vibrio genus, it does not add<br />
significantly more novel genes to the pan-genome than<br />
Vibrio genomes did. This contrasts with P. profundum<br />
which produces a sharp <strong>in</strong>crease <strong>in</strong> the pan-genome, as<br />
does, <strong>in</strong>terest<strong>in</strong>gly, V. shilonii. Note that there are approximately<br />
20,200 total gene families with<strong>in</strong> the 32 sequenced<br />
Vibrionaceae genomes, whereas the core genome decreases<br />
to approximately 1,000 gene families.<br />
BLAST Comparison Visualised <strong>in</strong> a BLAST Matrix<br />
A BLAST matrix provides a visual overview of reciprocal<br />
pairwise whole-genome comparisons, as shown <strong>in</strong> Fig. 4.<br />
The stronger a matrix cell is coloured, the more similarity<br />
was detected between the gene content of two genomes. As<br />
can be seen <strong>in</strong> the lower right triangle, all V. cholerae<br />
genomes are highly similar, with similarity rang<strong>in</strong>g between<br />
64% <strong>and</strong> 93% for any given pair of genomes. No statistical<br />
difference was observed when compar<strong>in</strong>g cl<strong>in</strong>ical isolates<br />
to environmental isolates. The two A. fischeri <strong>and</strong> the two<br />
V. vulnificus genomes also share a high degree of identity<br />
with<strong>in</strong> their species (75% <strong>and</strong> 67%, respectively), visible at<br />
the bottom of the matrix. In contrast, the two V. parahaemolyticus<br />
genomes only share 35% identity, which is<br />
not higher than the similarity detected between genomes of<br />
different species. With 72% similarity, isolate MED222<br />
most closely matches V. splendidus <strong>and</strong> with 65% isolate<br />
EX25 aga<strong>in</strong> shares most similarity with V. parahaemolyticus<br />
2210633.<br />
BLAST Atlas<br />
A BLAST atlas was constructed us<strong>in</strong>g V. cholerae N16961<br />
(O1, El Tor) as the reference genome, shown <strong>in</strong> Fig. 5. The<br />
best blast hits identified <strong>in</strong> the query genomes are<br />
plotted <strong>in</strong> the lanes around the reference genome, with<br />
different colours for different species. In general,<br />
chromosome 1 is more strongly conserved than chromosome<br />
2. A large part of chromosome 2 of N16961<br />
displays very little conservation <strong>in</strong> the other genomes;<br />
this area represents a super <strong>in</strong>tegron [40] that conta<strong>in</strong>s<br />
the V. cholerae-specific repeat (VCR) sequences, as well<br />
Figure 5 BLAST atlas with V. cholerae stra<strong>in</strong> N16961 as a reference<br />
stra<strong>in</strong>, show<strong>in</strong>g chromosomes 1 (top) <strong>and</strong> 2 (bottom). The best<br />
BLAST hits identified with genes from N16961 <strong>in</strong> the other V.<br />
cholerae genomes are represented <strong>in</strong> dark red, for the location as it<br />
appears <strong>in</strong> N16961. Blast hits <strong>in</strong> the other genomes are shown <strong>in</strong><br />
various colours as <strong>in</strong>dicated to the right. Major areas conserved <strong>in</strong> V.<br />
cholerae but not <strong>in</strong> other Vibrionaceae are identified as gap B, gap C,<br />
gap D <strong>and</strong> gap F <strong>in</strong> green; areas that are found <strong>in</strong> toxigenic V. cholerae<br />
only are marked black as gap A, gap E <strong>and</strong> gap G. The super<strong>in</strong>tegron<br />
on chromosome 2 of V. cholerae is also <strong>in</strong>dicated<br />
as a high number of gene cassettes. The repeat sequences<br />
are visible as black boxes <strong>in</strong> the repeat lane of the<br />
reference genome (second <strong>in</strong>ner lane). Although all V.<br />
cholerae genomes conta<strong>in</strong> a super<strong>in</strong>tegron, its genes are<br />
very diverse between isolates [34] which expla<strong>in</strong>s the lack<br />
of blast hits <strong>in</strong> this region.<br />
Several regions of the atlas have been highlighted. Gaps<br />
B, C, D <strong>and</strong> F on chromosome 1 (<strong>in</strong>dicated <strong>in</strong> green)<br />
conta<strong>in</strong> genes that are conserved <strong>in</strong> the represented<br />
genomes of V. cholerae but not <strong>in</strong> the other Vibrionaceae.<br />
The gaps marked A, E <strong>and</strong> G <strong>in</strong>dicate regions that are<br />
specific to the toxigenic, cl<strong>in</strong>ical isolates only. Annotated,<br />
V. cholerae-specific genes present <strong>in</strong> all these regions are<br />
listed <strong>in</strong> Table 2 (hypothetical genes are excluded). Genes<br />
specific for tox<strong>in</strong>ogenic V. cholerae identified <strong>in</strong> gap A<br />
<strong>in</strong>clude, amongst others, biosynthesis genes for the tox<strong>in</strong><br />
co-regulated pilus (which is required for transmission of the<br />
prophage CTXΦ carry<strong>in</strong>g the enterotox<strong>in</strong> genes), as well as<br />
genes encod<strong>in</strong>g citrate lyase. Note that the genes <strong>in</strong> gap A<br />
are also found <strong>in</strong> the environmental isolate V. cholerae<br />
2740-80.<br />
Gap B conta<strong>in</strong>s a number of outer membrane prote<strong>in</strong><br />
genes <strong>in</strong>volved <strong>in</strong> sugar modification that are found <strong>in</strong> all V.<br />
cholerae genomes. Genes from gap C encod<strong>in</strong>g a histid<strong>in</strong>e<br />
k<strong>in</strong>ase two-component signal transduction regulatory system<br />
are also conserved with<strong>in</strong> the species, as genes <strong>in</strong> gaps<br />
D <strong>and</strong> F, <strong>in</strong>volved <strong>in</strong> chemotaxis <strong>and</strong> possible multidrug<br />
resistance.<br />
Gap E, conta<strong>in</strong><strong>in</strong>g genes conserved <strong>in</strong> toxigenic stra<strong>in</strong>s<br />
only, holds the prophage CTXΦ that conta<strong>in</strong>s the genes<br />
encod<strong>in</strong>g cholera enterotox<strong>in</strong> subunits A <strong>and</strong> B; this<br />
enterotox<strong>in</strong> is responsible for the excessive, watery diarrhoea<br />
typical for cholera. Upon b<strong>in</strong>d<strong>in</strong>g to target cell GM1<br />
gangliosides, enterotox<strong>in</strong> enters the cell <strong>and</strong> stimulates<br />
adenylate cyclase by ADP ribosylation. The resultant<br />
<strong>in</strong>creased cyclic AMP levels <strong>in</strong>duce excessive electrolyte<br />
movement <strong>and</strong> sodium plus water secretion [43]. Stra<strong>in</strong><br />
M66-2 is believed to be a precursor of the seventh<br />
p<strong>and</strong>emic V. cholerae that lacks the prophage CTXΦ <strong>and</strong><br />
the enterotox<strong>in</strong> genes [11]. Gap E bears the RTX tox<strong>in</strong><br />
operon, which encodes a pore-form<strong>in</strong>g cytotox<strong>in</strong> [22]. An<br />
RTX tox<strong>in</strong> is also present <strong>in</strong> environmental isolate 2740-80<br />
<strong>and</strong> <strong>in</strong> V. vulnificus.<br />
Gap G on chromosome 2 consists of a set of five genes,<br />
all <strong>in</strong> the same orientation, <strong>in</strong> a putative operon, flanked by<br />
genes on the complimentary str<strong>and</strong>. This appears to be a<br />
remnant of a mobile element, as these genes are flanked by<br />
a transposase gene on the 3′ end, <strong>and</strong> there is a small global<br />
repeat on the 5′ end. Only the first two of the five genes have<br />
an assigned function, with the first gene be<strong>in</strong>g a GMP<br />
reductase, <strong>and</strong> the second a putative DNA methyltransferase.<br />
The rema<strong>in</strong><strong>in</strong>g three genes are hypothetical, but their<br />
strik<strong>in</strong>gly strong conservation <strong>in</strong> all pathogenic stra<strong>in</strong>s <strong>and</strong>
Table 2 A selection of genes located <strong>in</strong> the gaps marked <strong>in</strong> Fig. 5<br />
Gap A (850000–913000)<br />
852903–851557 Citrate/sodium symporter<br />
853165–854235 Citrate (pro-3S)-lyase ligase<br />
854287–854583 Citrate lyase subunit gamma<br />
854565–855455 Citrate lyase, beta subunit<br />
855391–856995 Citrate lyase, alpha subunit<br />
856992–857528 citX prote<strong>in</strong><br />
857506–858447 citG prote<strong>in</strong><br />
869812–866873 Helicase-related prote<strong>in</strong><br />
870391–869813 Tellurite resistance prote<strong>in</strong>-related<br />
871298–870819 Transcriptional regulator, putative<br />
873242–874225 Transposase, putative<br />
876974–880015 ToxR-activated gene A prote<strong>in</strong><br />
881390–884728 Inner membrane prote<strong>in</strong>, putative<br />
885773–886267 tagD prote<strong>in</strong><br />
888405–886543 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
888846–889511 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
889496–889906 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
890449–891123 Tox<strong>in</strong> co-regulated pil<strong>in</strong><br />
891203–892495 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
892495–892947 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
892950–894419 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
894412–894867 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
894855–895691 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
895707–896165 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
896155–897666 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
897641–898663 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
898673–899689 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />
899896–900726 TCP pilus virulence regulatory prote<strong>in</strong><br />
900726–901487 Leader peptidase TcpJ<br />
901494–903374 Accessory colonization factor AcfB<br />
903380–904150 Accessory colonization factor AcfC<br />
904648–905556 tagE prote<strong>in</strong><br />
906206–905559 Accessory colonization factor AcfA<br />
914124–912856 Phage family <strong>in</strong>tegrase<br />
Gap B (975000–1010000)<br />
978644–979144 Phosphotyros<strong>in</strong>e prote<strong>in</strong> phosphatase<br />
981833–982387 Ser<strong>in</strong>e acetyltransferase-related prote<strong>in</strong><br />
982384–983532 Exopolysacch. biosynth prote<strong>in</strong> EpsF<br />
983529–984938 Polysacch. export prote<strong>in</strong>, putative (gfcE)<br />
986166–986597 Ser<strong>in</strong>e acetyltransferase-related prote<strong>in</strong><br />
986597–987937 capK prote<strong>in</strong>, putative<br />
987913–989010 Polysaccharide biosynthesis prote<strong>in</strong>, putative<br />
1001910–1002437 Polysaccharide export-related prote<strong>in</strong> (gfcE)<br />
1002462–1004675 Putative exopolysacch. biosynth prote<strong>in</strong><br />
Gap C (1130000–1160000)<br />
1139646–1142912 Chit<strong>in</strong>ase, putative<br />
1147856–1148998 Response regulator<br />
1149033–1149398 Response regulator<br />
1149990–1151309 Sensory box sensor histid<strong>in</strong>e k<strong>in</strong>ase<br />
Table 2 (cont<strong>in</strong>ued)<br />
1151321–1152625 Sensor histid<strong>in</strong>e k<strong>in</strong>ase<br />
1152625–1154235 Response regulator<br />
1154252–1155595 Response regulator<br />
1157228–1155624 Sensor histid<strong>in</strong>e k<strong>in</strong>ase<br />
1158044–1157232 Periplasmic b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong>-related<br />
Gap D (1478000–1520000)<br />
2086826–2087584 CDP-diacylglycerol-glyc.-3phosph-3-phosphatidyltransferase<br />
2087587–2088519 Phosphatidate cytidylyltransferase<br />
2094741–2095604 PvcB prote<strong>in</strong><br />
2098112–2097183 LysR family transcriptional regulator<br />
2098432–2100258 pvcA prote<strong>in</strong><br />
2117923–2119977 Methyl-accept<strong>in</strong>g chemotaxis prote<strong>in</strong><br />
2120575–2120030 Transcriptional regulator<br />
2120663–2121826 Benzoate transport prote<strong>in</strong><br />
Gap E (1537000–1587500)<br />
1541452–1543170 Sensor histid<strong>in</strong>e k<strong>in</strong>ase/response regulator<br />
1545396–1543231 Tox<strong>in</strong> secretion transporter, putative<br />
1546802–1545399 RTX tox<strong>in</strong> transporter<br />
1548919–1546757 RTX tox<strong>in</strong> transporter<br />
1549662–1550123 RTX tox<strong>in</strong> activat<strong>in</strong>g prote<strong>in</strong><br />
1550108–1563784 RTX tox<strong>in</strong> RtxA<br />
1564376–1564152 RstC prote<strong>in</strong><br />
1564844–1564470 RstB1 prote<strong>in</strong><br />
1565901–1564822 RstA1 prote<strong>in</strong><br />
1566027–1566365 Transcriptional repressor RstR<br />
1567341–1566967 Cholera enterotox<strong>in</strong>, B subunit<br />
1568114–1567338 Cholera enterotox<strong>in</strong>, A subunit<br />
1569412–1568213 Zona occludens tox<strong>in</strong><br />
1569702–1569409 Accessory cholera enterotox<strong>in</strong><br />
1571241–1570993 Colonization factor<br />
1571760–1571377 RstB2 prote<strong>in</strong><br />
1572817–1571738 RstA1 prote<strong>in</strong><br />
1572943–1573281 Transcriptional repressor RstR<br />
1577272–1575704 Phage replication prote<strong>in</strong> Cri<br />
1582123–1580555 Phage replication prote<strong>in</strong> Cri<br />
1583160–1583513 Transposase OrfAB, subunit A<br />
1583510–1584382 Transposase OrfAB, subunit B<br />
Gap F (1896000–1956000)<br />
1896092–1897327 Phage family <strong>in</strong>tegrase<br />
1900831–1898009 Helicase, putative<br />
1903632–1902898 Chemotaxis prote<strong>in</strong> MotB-related<br />
1908858–1905790 Type I restriction enzyme HsdR<br />
1916009–1913628 DNA methylase HsdM, putative<br />
1933231–1935654 Neuram<strong>in</strong>idase<br />
1936007–1935801 Transcriptional regulator<br />
1936121–1936597 DNA repair prote<strong>in</strong> RadC, putative<br />
1938391–1937519 Transposase OrfAB, subunit B<br />
1938732–1938388 Transposase OrfAB, subunit A<br />
1941671–1941351 Transcriptional regulator, putative<br />
T. Vesth et al.
Orig<strong>in</strong>s of V. cholerae<br />
Table 2 (cont<strong>in</strong>ued)<br />
1942032–1941658 Middle operon regulator-related<br />
1944457–1943306 eha prote<strong>in</strong><br />
Gap G (chromosome II, 21300–223000)<br />
213207–214250 GMP reductase<br />
214574–215725 DNA methyltransferase<br />
220262–219825 IS1004 transposase<br />
All gene annotations are taken from the reference genome V. cholerae<br />
stra<strong>in</strong> N16961. Hypothetical prote<strong>in</strong>s were excluded. Gaps A, E <strong>and</strong> G<br />
are conserved <strong>in</strong> pathogenic stra<strong>in</strong>s, whereas gaps B, C, D <strong>and</strong> F are<br />
conserved <strong>in</strong> all V. cholerae genomes analysed (Figure 1)<br />
complete absence of homologues <strong>in</strong> the other Vibrio genomes<br />
strongly po<strong>in</strong>t towards a potential biological significance.<br />
Discussion<br />
The recent availability of many Vibrionaceae genomes,<br />
<strong>in</strong>clud<strong>in</strong>g a substantial number of V. cholerae genomes,<br />
allows the possibility to take a closer look at the similarities<br />
<strong>and</strong> differences of species with<strong>in</strong> the genus Vibrio. This can<br />
exam<strong>in</strong>e, on a genome scale, what dist<strong>in</strong>guishes V. cholerae<br />
from the other Vibrio species. S<strong>in</strong>ce not all V. cholerae<br />
isolates are pathogenic, the presence of the prophagebear<strong>in</strong>g<br />
cholera enterotox<strong>in</strong>, the ma<strong>in</strong> virulence factor for<br />
cholera, is not a suitable marker for this species. We<br />
attempted to identify a set of V. cholerae-specific genes,<br />
<strong>and</strong> also explored the <strong>in</strong>ternal diversity with<strong>in</strong> the V.<br />
cholerae genomes that have been sequenced to date.<br />
On a phylogenetic tree based on the 16S ribosomal RNA<br />
gene, those isolates that do not belong to the genus Vibrio<br />
were positioned as outliers, as expected. This tree further<br />
<strong>in</strong>dicated the closest resembl<strong>in</strong>g 16S rRNA sequence for<br />
the two sequenced Vibrio stra<strong>in</strong>s that are currently not<br />
assigned to a species. It was observed that the two<br />
sequenced V. parahaemolyticus stra<strong>in</strong>s were not placed<br />
together. The complete gene content of each genome was<br />
next compared by BLAST <strong>and</strong> the results were pooled <strong>in</strong>to<br />
gene families which were subjected to cluster analysis. This<br />
provided evidence that the 18 V. cholerae genomes fall <strong>in</strong>to<br />
two subclusters, one ma<strong>in</strong>ly conta<strong>in</strong><strong>in</strong>g cl<strong>in</strong>ical isolates <strong>and</strong><br />
the other environmental isolates.<br />
The gene family cluster<strong>in</strong>g, subsequent pan-genome<br />
analysis <strong>and</strong> the pairwise BLAST results, as summarised<br />
<strong>in</strong> the BLAST matrix, all supported the relatedness of<br />
Vibrio species Ex25 to V. parahaemolyticus 2210633 but<br />
not to V. parahaemolyticus 16. This latter genome was quite<br />
different from V. parahaemolyticus 2210633 <strong>in</strong> all analyses.<br />
Although it is possible that the species V. parahaemolyticus<br />
is far more genetically diverse than V. cholerae, A. fischeri<br />
or V. vulnificus, an alternative explanation is that one of the<br />
sequenced isolates is perhaps <strong>in</strong>correctly named as V.<br />
parahaemolyticus. The similarity between Vibrio species<br />
MED222 <strong>and</strong> V. splendidus based on gene families is <strong>in</strong><br />
agreement with their related 16S rRNA genes <strong>and</strong> published<br />
data [21]. However, <strong>in</strong> contrast to what the ribosomal<br />
gene suggests, our whole-genome comparison <strong>in</strong>dicates that<br />
the three Aliivibrio genomes (A. salmonicida <strong>and</strong> two A.<br />
fischeri) are not so different from Vibrio after all. Their<br />
recent placement <strong>in</strong> the genus Aliivibrio, a decision based<br />
on five genes (the 16S rRNA gene <strong>and</strong> four housekeep<strong>in</strong>g<br />
genes) <strong>and</strong> phenotypical characteristics [47], appears not to<br />
be reflective of the whole genome picture presented here.<br />
The BLAST results were graphically summarised <strong>in</strong> a<br />
BLAST atlas, which visualised V. cholerae-specific gene<br />
clusters. These coded for polysaccharide biosynthesis<br />
enzymes, response regulators <strong>and</strong> chemotaxis prote<strong>in</strong>s,<br />
amongst others. In addition, a V. cholerae-specific, histid<strong>in</strong>e<br />
k<strong>in</strong>ase two-component signal transduction regulatory system<br />
was identified. The two-component signal transduction<br />
pathway is a powerful regulat<strong>in</strong>g system for bacteria to<br />
adapt to a particular ecological niche. There is a precedent<br />
for this claim, as the <strong>in</strong>troduction of a s<strong>in</strong>gle regulatory<br />
prote<strong>in</strong> <strong>in</strong> Vibrio fischeri stra<strong>in</strong> MJ11 has been shown to<br />
specifically enable colonization of the squid Euprymna<br />
scolopes [26].<br />
As expected, the ma<strong>in</strong> differences observed between V.<br />
cholerae cl<strong>in</strong>ical isolates <strong>and</strong> the environmental stra<strong>in</strong>s are<br />
due to genes related to virulence. Two exceptions are the<br />
presence of a number of virulence genes <strong>in</strong> the environmental<br />
stra<strong>in</strong> V. cholerae 2740-80 <strong>and</strong> the absence of<br />
enterotox<strong>in</strong> genes <strong>in</strong> cl<strong>in</strong>ical isolate M66-2. It has already<br />
been suggested that M66-2 might be a predecessor of<br />
p<strong>and</strong>emic, enterotoxic V. cholerae [11]. From sequence<br />
comparison of four housekeep<strong>in</strong>g genes, it was concluded<br />
that V. cholerae 2740-80 is <strong>in</strong>termediary between toxigenic<br />
<strong>and</strong> non-toxigenic isolates [30]. This view is confirmed by<br />
the data presented here, although we propose to consider<br />
the possibility that the isolate arose from a p<strong>and</strong>emic clone<br />
that has lost the CTXΦ prophage, rather than be<strong>in</strong>g a<br />
precursor of a pathogen.<br />
In conclusion, several different methods of genome<br />
comparisons have yielded a picture of V. cholerae genomes<br />
as form<strong>in</strong>g a dist<strong>in</strong>ct cluster, compared to related species,<br />
<strong>and</strong> a relatively small number of genes might be responsible<br />
for environmental niche adaptation <strong>and</strong> hence for generation<br />
of this dist<strong>in</strong>ct species. Likely c<strong>and</strong>idates <strong>in</strong>clude<br />
multiple two-component signal transduction regulatory<br />
prote<strong>in</strong>s as well as chemotaxis prote<strong>in</strong>s.<br />
Acknowledgements We would like to thank Tim B<strong>in</strong>newies for<br />
early work on this project, <strong>and</strong> also to the Danish Research Councils<br />
<strong>and</strong> the DTU Globalization funds for f<strong>in</strong>ancial support.
Open Access This article is distributed under the terms of the<br />
Creative Commons Attribution Noncommercial License which permits<br />
any noncommercial use, distribution, <strong>and</strong> reproduction <strong>in</strong> any<br />
medium, provided the orig<strong>in</strong>al author(s) <strong>and</strong> source are credited.<br />
References<br />
1. Bassler B et al. (2007) CP000789.1: Direct submission to<br />
GenBank<br />
2. B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Staerfeldt HH, Ussery DW (2005)<br />
Genome update: proteome comparisons. Microbiol 151:1–4<br />
3. Chen CY, Wu KM, Chang YC, Chang CH, Tsai HC, Liao TL, Liu<br />
YM, Chen HJ, Shen AB, Li JC, Su TL, Shao CP, Lee CT, Hor LI,<br />
Tsai SF (2003) <strong>Comparative</strong> genome analysis of Vibrio vulnificus,<br />
a mar<strong>in</strong>e pathogen. Genome Res 13:2577–2587<br />
4. Clayton RA, Sutton G, H<strong>in</strong>kle PS, Bult C, Fields C (1995)<br />
Intraspecific variation <strong>in</strong> small-subunit rRNA sequences <strong>in</strong><br />
GenBank: why s<strong>in</strong>gle sequences may not adequately represent<br />
prokaryotic taxa. Int J Syst Bacteriol 45:595–599<br />
5. Colwell R, Grim CJ, Young S, Jaffe D, Gnerre S, Berl<strong>in</strong> A,<br />
Heiman D, Hepburn T, Shea T, Sykes S, Alvarado L, Kodira C,<br />
Heidelberg J, L<strong>and</strong>er E, Galagan J, Nusbaum C, Birren B (2008)<br />
NZ_AAKF00000000: Direct submission to GenBank<br />
6. Doolittle WF (1995) Phylogenetic classification <strong>and</strong> the universal<br />
tree. Science 284:2124–2129<br />
7. Doolittle WF, Papke RT (2006) Genomics <strong>and</strong> the bacterial<br />
species problem. Genome Biol 7:116<br />
8. Doolittle WF, Zhaxybayeva O (2009) On the orig<strong>in</strong> of prokaryotic<br />
species. Genome Res 19:744–756<br />
9. Edwards R, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton G,<br />
Rogers Y-H, Friedman R, Frazier M, Venter JC (2008)<br />
NZ_ACCV00000000: Direct submission to GenBank<br />
10. Farmer JJ, J<strong>and</strong>a JM (2005) Vibrionaceae. In: Bergey’s<br />
manual of systematic bacteriology, 2nd edn, vol 2 part B.<br />
Spr<strong>in</strong>ger, New York, pp 491–546<br />
11. Feng L, Reeves PR, Lan R, Ren Y, Gao C, Zhou Z, Ren Y, Cheng<br />
J, Wang W, Wang J, Qian W, Li D, Wang L (2008) A recalibrated<br />
molecular clock <strong>and</strong> <strong>in</strong>dependent orig<strong>in</strong>s for the cholera p<strong>and</strong>emic<br />
clones. PLoS ONE 3:e4053<br />
12. Gevers D, Cohan FM, Lawrence JG, Sprat BG, Coeyne T, Feil EJ,<br />
Stackebr<strong>and</strong>t E, Van de Peer Y, V<strong>and</strong>amme P, Thompson FL,<br />
Sw<strong>in</strong>gs J (2005) Re-evaluat<strong>in</strong>g prokaryotic species. Nat Rev<br />
Microbiol 3:733–739<br />
13. Hagstrom A, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton<br />
G, Rogers Y-H, Friedman R, Frazier M, Venter JC (2007)<br />
NZ_ABGR00000000: Direct submission to GenBank<br />
14. Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW (2008) The genome<br />
BLASTatlas—a GeneWiz extension for visualization of wholegenome<br />
homology. Mol Biosyst 4:363–371<br />
15. Heidelberg JF, Eisen JA, Nelson WC, Clayton RA, Gw<strong>in</strong>n ML,<br />
Dodson RJ, Haft DH, Hickey EK, Peterson JD, Umayam L, Gill<br />
SR, Nelson KE, Read TD, Tettel<strong>in</strong> H, Richardson D, Ermolaeva<br />
MD, Vamathevan J, Bass S, Q<strong>in</strong> H, Dragoi I, Sellers P, McDonald<br />
L, Utterback T, Fleishmann RD, Nierman WC, White O, Salzberg<br />
SL, Smith HO, Colwell RR, Mekalanos JJ, Venter JC, Fraser CM<br />
(2000) DNA sequence of both chromosomes of the cholera<br />
pathogen Vibrio cholerae. Nature 406:477–483<br />
16. Heidelberg J, Sebastian Y. NZ_AAKJ00000000, NZ_AAUT00000000,<br />
NZ_AAKK00000000, NZ_AAUR00000000, NZ_AAWF00000000:<br />
Direct submission to GenBank<br />
17. Hjerde E, Lorentzen MS, Holden MT, Seeger K, Paulsen S, Bason<br />
N, Churcher C, Harris D, Norbertczak H, Quail MA, S<strong>and</strong>ers S,<br />
Thurston S, Parkhill J, Willassen NP, Thomson NR (2008) The<br />
genome sequence of the fish pathogen Aliivibrio salmonicida<br />
T. Vesth et al.<br />
stra<strong>in</strong> LFI1238 shows extensive evidence of gene decay. BMC<br />
Genomics 9:616<br />
18. Konstant<strong>in</strong>idis T, Ramette A, Tiedje JA (2006) The bacterial<br />
species def<strong>in</strong>ition <strong>in</strong> the genomic era. Phil Trans R Soc B<br />
361:1929–1940<br />
19. Lagesen K, Hall<strong>in</strong> P, Rødl<strong>and</strong> EA, Staerfeldt HH, Rognes T,<br />
Ussery DW (2007) RNAmmer: consistent <strong>and</strong> rapid annotation of<br />
ribosomal RNA genes. Nucleic Acids Res 35:3100–3108<br />
20. Larsen TS, Krogh A (2003) EasyGene—a prokaryotic gene f<strong>in</strong>der<br />
that ranks ORFs by statistical significance. BMC Bio<strong>in</strong>formatics<br />
4:29<br />
21. Le Roux F, Zou<strong>in</strong>e M, Chakroun N, B<strong>in</strong>esse J, Saulnier D,<br />
Bouchier C, Zidane N, Ma L, Rusniok C, Lajus A, Buchrieser C,<br />
Médigue C, Polz MF, Mazel D (2009) Genome sequence of Vibrio<br />
splendidus: an abundant planctonic mar<strong>in</strong>e species with a large<br />
genotypic diversity. Environ Microbiol 11:1959–1970<br />
22. L<strong>in</strong> W, Fullner KJ, Clayton R, Sexton JA, Rogers MB, Calia KE,<br />
Calderwood SB, Fraser C, Mekalanos JJ (1999) Identification of<br />
a Vibrio cholerae RTX tox<strong>in</strong> gene cluster that is tightly l<strong>in</strong>ked to<br />
the cholera tox<strong>in</strong> prophage. Proc Natl Acad Sci U S A 96:1071–<br />
1076<br />
23. Loytynoja A, Goldman N (2005) An algorithm for progressive<br />
multiple alignment of sequences with <strong>in</strong>sertions. Proc Natl Acad<br />
Sci U S A 102:10557–10562<br />
24. Loytynoja A, Goldman N (2008) Phylogeny-aware gap placement<br />
prevents errors <strong>in</strong> sequence alignment <strong>and</strong> evolutionary analysis.<br />
Science 320:1632–1635<br />
25. Mak<strong>in</strong>o K, Oshima K, Kurokawa K, Yokoyama K, Uda T,<br />
Tagomori K, Iijima Y, Najima M, Nakano M, Yamashita A,<br />
Kubota Y, Kimura S, Yasunaga T, Honda T, Sh<strong>in</strong>agawa H, Hattori<br />
M, Iida T (2003) Genome sequence of Vibrio parahaemolyticus: a<br />
pathogenic mechanism dist<strong>in</strong>ct from that of V. cholerae. Lancet<br />
361:743–749<br />
26. M<strong>and</strong>el MJ, Wollenberg MS, Stabb EV, Visick KL, Ruby EG<br />
(2009) A s<strong>in</strong>gle regulatory gene is sufficient to alter bacterial host<br />
range. Nature 458:215–218<br />
27. Mazel D, Le Roux F (2008) FM954973.1: Direct submission to<br />
GenBank<br />
28. Med<strong>in</strong>i D, Donati C, Tettel<strong>in</strong> H, Masignani V, Rappuoli R<br />
(2005) The microbial pan-genome. Curr Op<strong>in</strong> Genet Dev<br />
15:589–594<br />
29. Medrano-Soto A, Moreno-Hagelsieb G, V<strong>in</strong>uesa P, Christen JA,<br />
Collado-Vides J (2001) Succesful lateral transfer requires codon<br />
usage compatibility between foreign genes <strong>and</strong> recipient genomes.<br />
Mol Biol Evol 21:1884–1894<br />
30. Mohapatra SS, Ramach<strong>and</strong>ran D, Mantri CK, Colwell RR, S<strong>in</strong>gh<br />
DV (2009) Determ<strong>in</strong>ation of relationships among non-toxigenic<br />
Vibrio cholerae O1 biotype El Tor stra<strong>in</strong>s from housekeep<strong>in</strong>g<br />
gene sequences <strong>and</strong> ribotype patterns. Res Microbiol 160:<br />
57–62<br />
31. Munk A, Tapia R, Green L, Rogers Y, Detter JC, Bruce D, Brett<strong>in</strong> TS,<br />
Colwell R, Grim C, Vonste<strong>in</strong> V, Bartels D. CP001485.1,<br />
NZ_ACHV00000000, NZ_ACHY00000000, NZ_ACHW00000000,<br />
NZ_ACHX00000000, NZ_ACHZ00000000, NZ_ACIA00000000,<br />
NZ_ACFQ00000000: Direct submission to GenBank<br />
32. Murray RG, Stackebr<strong>and</strong>t E (1995) Taxonomic note: implementation<br />
of the provisional status C<strong>and</strong>idatus for <strong>in</strong>completely<br />
described procaryotes. Int J Syst Bacteriol 45:186–187<br />
33. Nierman WC (2006) NZ_AATY00000000: Direct submission to<br />
GenBank<br />
34. Pang B, Yan M, Cui Z, Ye X, Diao B, Ren Y, Gao S, Zhang L,<br />
Kan B (2007) Genetic diversity of toxigenic <strong>and</strong> nontoxigenic<br />
Vibrio cholerae serogroups O1 <strong>and</strong> O139 revealed by array-based<br />
comparative genomic hybridization. J Bacteriol 189:4837–4879<br />
35. Philippe H, Douady CJ (2003) Horizontal gene transfer <strong>and</strong><br />
phylogenetics. Curr Op<strong>in</strong> Microbiol 6:498–505
Orig<strong>in</strong>s of V. cholerae<br />
36. P<strong>in</strong>hassi J, Pedros-Alio C, Ferriera S, Johnson J, Kravitz S,<br />
Halpern A, Rem<strong>in</strong>gton K, Beeson K, Tran B, Rogers Y-H,<br />
Friedman R, Venter JC (2006) NZ_AAND00000000: Direct<br />
submission to GenBank<br />
37. Pupo GM, Lan R, Reeves PR (2000) Multiple <strong>in</strong>dependent orig<strong>in</strong>s<br />
of Shigella clones of Escherichia coli <strong>and</strong> convergent evolution of<br />
many of their characteristics. Proc Natl Acad Sci U S A<br />
97:10567–10572<br />
38. Rhee JH, Kim SY, Chung SS, Lee SE, Choy HE (2002)<br />
AE016795.2: Direct submission to GenBank<br />
39. Riley MA, Lizotte-Waniewski M (2009) Population genomics <strong>and</strong><br />
the bacterial species concept. Methods Mol Biol 532:367–377<br />
40. Rowe-Magnus DA, Guérout AM, Mazel D (1999) Super<strong>in</strong>tegrons.<br />
Res Microbiol 150:641–651<br />
41. Rosenberg E, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton<br />
G, Rogers Y-H, Friedman R, Frazier M. Venter JC (2006)<br />
NZ_ABCH00000000: Direct submission to GenBank<br />
42. 3Ruby EG, Urbanowski M, Campbell J, Dunn A, Fa<strong>in</strong>i M, Gunsalus<br />
R, Lostroh P, Lupp C, McCann J, Millikan D, Schaefer A, Stabb E,<br />
Stevens A, Visick K, Whistler C, Greenberg EP (2005) Complete<br />
genome sequence of Vibrio fischeri: a symbiotic bacterium with<br />
pathogenic congeners. Proc Natl Acad Sci U S A 102:3004–3009<br />
43. Sánchez J, Holmgren J (2005) Virulence factors, pathogenesis <strong>and</strong><br />
vacc<strong>in</strong>e protection <strong>in</strong> cholera <strong>and</strong> ETEC diarrhoea. Curr Op<strong>in</strong><br />
Immunol 17:388–398<br />
44. Stackebr<strong>and</strong>t E, Frederiksen W, Garrity GM, Grimont PA,<br />
Kämpfer P, Maiden MC, Nesme X, Rosselló-Mora R, Sw<strong>in</strong>gs J,<br />
Trüper HG, Vauter<strong>in</strong> L, Ward AC, Whitman WB (2002) Report of<br />
the ad hoc committee for the re-evaluation of the species def<strong>in</strong>ition<br />
<strong>in</strong> bacteriology. Int J Syst Evol Microbiol 52:1043–1047<br />
45. Tamura K, Dudley J, Nei M, Kumar S (2007) MEGA4: Molecular<br />
Evolutionary Genetics Analysis (MEGA) software version 4.0.<br />
Mol Biol Evol 24:1596–1599<br />
46. Thompson FL, Iida T, Sw<strong>in</strong>gs J (2004) Biodiversity of vibrios.<br />
Microbiol Mol Biol Rev 68:403–431<br />
47. Urbanczyk H, Ast JC, Higg<strong>in</strong>s MJ, Carson J, Dunlap PV (2007)<br />
Reclassification of Vibrio fischeri, Vibrio logei, Vibrio salmonicida<br />
<strong>and</strong> Vibrio wodanis as Aliivibrio fischeri gen. nov., comb.<br />
nov., Aliivibrio logei comb. nov., Aliivibrio salmonicida comb.<br />
nov. <strong>and</strong> Aliivibrio wodanis comb. nov. Int J Syst Evol Microbiol<br />
57:2823–2829<br />
48. Vezzi A, Campanaro S, D'Angelo M, Simonato F, Vitulo N, Lauro<br />
FM, Cestaro A, Malacrida G, Simionati B, Cannata N, Romualdi<br />
C, Bartlett DH, Valle G (2005) Life at depth: Photobacterium<br />
profundum genome sequence <strong>and</strong> expression analysis. Science<br />
30:1459–1461<br />
49. Wang L, Feng L, Reeves P, Lan R, Ren Y, Gao C, Zhou Z, Ren Y,<br />
Wang W (2008) CP001233.1. CP001235.1: Direct submission to<br />
GenBank<br />
50. Woese CR (1987) Bacterial evolution. Microbial Rev 51:221–271
1<br />
<strong>Comparative</strong> Genomics<br />
2.10 Paper V: Tools for comparison of bacterial genomes
74 Tools for Comparison of<br />
Bacterial Genomes<br />
T. M. Wassenaar 1,2 . T. T. B<strong>in</strong>newies 1,3 . P. F. Hall<strong>in</strong> 1 . D. W. Ussery 1, *<br />
1<br />
Center for Biological Sequence Analysis, Technical University of<br />
Denmark, Kgs. Lyngby, Denmark<br />
*dave@cbs.dtv.dk<br />
2<br />
Molecular Microbiology <strong>and</strong> Genomics Consultants, Zotzenheim,<br />
Germany<br />
3<br />
Roche Diagnostics Ltd., Advanced Systems Group, Global Platforms &<br />
Support, Rotkreuz, Switzerl<strong>and</strong><br />
1 Introduction . . . . . . ..................................................................4314<br />
2 Genomic DNA Sequence Comparisons . ...........................................4314<br />
3 Visualization of Genomic Data: The Genome Atlas ..............................4317<br />
4 Whole Genome Alignment Methods . . . . ...........................................4319<br />
5 Compar<strong>in</strong>g the Cod<strong>in</strong>g Fraction of Genomes . . . . . . . . ..............................4321<br />
6 Codon Usage Comparisons . . . . .....................................................4322<br />
7 Prote<strong>in</strong> Sequence Comparisons . . . . . . . . . ...........................................4322<br />
8 Gene Synteny <strong>and</strong> Genome Isl<strong>and</strong>s . . . . . ...........................................4325<br />
9 M<strong>in</strong>imal Information About a Genome Sequence . . . ..............................4325<br />
10 Research Needs . . . ..................................................................4325<br />
K. N. Timmis (ed.), H<strong>and</strong>book of Hydrocarbon <strong>and</strong> Lipid Microbiology, DOI 10.1007/978-3-540-77587-4_337,<br />
# Spr<strong>in</strong>ger-Verlag Berl<strong>in</strong> Heidelberg, 2010
4314 74<br />
Tools<br />
Abstract: Of the plethora of bio<strong>in</strong>formatical <strong>tools</strong> available, some useful <strong>tools</strong> that allow<br />
complete genome sequences to be compared are described here. Comparisons of genome<br />
length, base composition, gene density, numbers of tRNA <strong>and</strong> rRNA genes, <strong>and</strong> codon usage<br />
can provide useful biological <strong>in</strong>sights. Examples are provided of a Genome Atlas plot, to<br />
summarize many features of a s<strong>in</strong>gle genome, <strong>and</strong> a BLAST Atlas, <strong>in</strong> which multiple genomes<br />
can be comb<strong>in</strong>ed. A table of web-services for useful <strong>tools</strong> is provided.<br />
1 Introduction<br />
Presently, there are about 900 bacterial <strong>and</strong> archaeal genomes that have been fully sequenced<br />
<strong>and</strong> become publicly available 1 <strong>and</strong> their number more than doubled last year. Approximately<br />
40% of the sequenced genomes are obta<strong>in</strong>ed from environmental (terrestrial <strong>and</strong> mar<strong>in</strong>e)<br />
organisms. In addition, metagenomic projects are now produc<strong>in</strong>g a vast amount of sequences.<br />
Here we provide a brief overview of methods to compare sequenced bacterial genomes. Of the<br />
many methods available to compare bacterial genomes (B<strong>in</strong>newies et al., 2006) > Table 1<br />
lists several that we f<strong>in</strong>d useful. It is beyond the scope of this review to provide a detailed<br />
analysis of these methods, <strong>and</strong> the list is far from complete. The <strong>tools</strong> discussed here provide<br />
some <strong>in</strong>terest<strong>in</strong>g <strong>in</strong>formation on fundamental biological features <strong>and</strong> can be used to compare<br />
a few or large numbers of genomes. The <strong>tools</strong> are easy to use <strong>and</strong> produce results that are easy<br />
to <strong>in</strong>terpret <strong>and</strong> can be graphically represented. The latter is an important quality determ<strong>in</strong>ant<br />
of any sequence analysis tool when deal<strong>in</strong>g with genomes, as the complexity of <strong>in</strong>put data is<br />
so large.<br />
2 Genomic DNA Sequence Comparisons<br />
A genome can be more than one DNA molecule. Approximately 10% of the bacterial genomes<br />
sequenced so far have more than one chromosome. By def<strong>in</strong>ition a genome <strong>in</strong>cludes all<br />
chromosomes (<strong>and</strong> plasmids) that constitute an organism’s total DNA. Chromosomes are<br />
essential, s<strong>in</strong>gle-copy, <strong>in</strong>dependently replicat<strong>in</strong>g DNA molecules present <strong>in</strong> each member of<br />
the species. Some species conta<strong>in</strong> plasmids; these are frequently stra<strong>in</strong>-specific <strong>and</strong> sometimes<br />
(<strong>in</strong>correctly, <strong>in</strong> our op<strong>in</strong>ion) omitted from a genome sequence.<br />
At the time of writ<strong>in</strong>g, the largest bacterial genome sequenced is that of Solibacter usitatus<br />
(stra<strong>in</strong> Ell<strong>in</strong> 6076), a soil bacterium belong<strong>in</strong>g to the Acidobacteria. It consists of a s<strong>in</strong>gle<br />
chromosome of 9.97 mega basepairs (Mbp). The smallest bacterial genome known is<br />
that of Carsonella ruddii (PV), an endosymbiont of a plant sap-feed<strong>in</strong>g <strong>in</strong>sect with a mere<br />
159,662 bp. Genome size is a rough <strong>in</strong>dicator of biological adaptive potential so it is no<br />
surprise that soil bacteria have bigger genomes, as they have to adapt to environmental<br />
variation, whereas the protective niche of an endosymbiont allows for a small genome.<br />
The genome size of an organism is easy to calculate <strong>and</strong> tabulate. > Figure 1a gives<br />
a graphical representation for genome size variation with<strong>in</strong> bacterial phyla. A ‘‘box <strong>and</strong><br />
whiskers’’ plot as shown <strong>in</strong> > Fig. 1 visualizes the distribution of a property that can be<br />
1 Completed genome statistics obta<strong>in</strong>ed from the NCBI Genome Project web pages: http://www.ncbi.nlm.nih.gov/<br />
genomes/lproks.cgi<br />
for Comparison of Bacterial Genomes
. Table 1<br />
Methods for comparison of bacterial genomes<br />
Method URL References<br />
Length, %GC http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi Wheeler et al. (2007)<br />
Chromosome<br />
alignment (ACT)<br />
Chromosome<br />
alignment (MUMMER)<br />
http://www.sanger.ac.uk/Software/ACT/ Carver et al. (2005)<br />
http://www.webact.org/WebACT/home<br />
http://mummer.sourceforge.net Kurtz et al. (2004)<br />
Repeats – various http://www.cbs.dtu.dk/services/GenomeAtlas Ussery et al. (2004)<br />
Repeats –<br />
tetranucleotides<br />
Repeats – short,<br />
t<strong>and</strong>em<br />
Tools for Comparison of Bacterial Genomes 74<br />
http://www.megx.net/tetra Teel<strong>in</strong>g et al. (2004)<br />
http://m<strong>in</strong>isatellites.u-psud.fr/GPMS/default.php Denoeud <strong>and</strong><br />
Vergnaud (2004)<br />
Repeats – VNTRs http://vntr.csie.ntu.edu.tw Chang et al. (2007)<br />
Replication Orig<strong>in</strong>s http://www.cbs.dtu.dk/services/GenomeAtlas Worn<strong>in</strong>g et al.<br />
(2006)<br />
Noncod<strong>in</strong>g RNAs http://rfam.sanger.ac.uk Griffiths-Jones, et al.<br />
(2005)<br />
rRNAs http://www.cbs.dtu.dk/services/RNAmmer Lagesen et al. (2007)<br />
Genome Atlas http://www.cbs.dtu.dk/services/GenomeAtlas Hall<strong>in</strong> <strong>and</strong> Ussery<br />
(2004)<br />
BLAST Atlas (zoomable) http://www.cbs.dtu.dk/services/gwBrowser<br />
UPDATE!<br />
‘‘Genome Properties’’ http://cmr.tigr.org/tigr-scripts/CMR/shared/<br />
GenomePropertiesHomePage.cgi<br />
Hall<strong>in</strong> <strong>and</strong> Ussery<br />
(2004)<br />
Selengut et al.<br />
(2007)<br />
4315<br />
expressed as a numerical value, such as length, %GC, number of genes, etc. Such plots show<br />
the spread of the data <strong>and</strong> are made as follows: the values are sorted <strong>and</strong> divided <strong>in</strong>to two equal<br />
parts, separated by the median, which is marked as a bar <strong>in</strong> the middle of the distribution. A<br />
box is drawn to cover the range where the middle 50% of the data are (exclud<strong>in</strong>g the first 25%<br />
<strong>and</strong> the last 25% of the data). The ‘‘whiskers’’ are the hatched l<strong>in</strong>es, connect<strong>in</strong>g the lowest (left)<br />
<strong>and</strong> highest (right) values, with the exception of outlier po<strong>in</strong>ts, which are shown as <strong>in</strong>dividual<br />
dots. Outliers are def<strong>in</strong>ed as data that are distant by more than 1.5 times the range of the box.<br />
The base composition of genomes, i.e., their %GC content (or %AT which together make<br />
100%), can also be compared, as shown <strong>in</strong> > Fig. 1b. The GC content of a genome can range<br />
from 17% <strong>in</strong> C. ruddii to 75% GC <strong>in</strong> Anaeromyxobacter dehalogenans. The smallest genome is<br />
also the most AT rich, <strong>and</strong> many of the larger genomes are quite GC rich. It is not clear if there<br />
is a biological force <strong>in</strong> play beh<strong>in</strong>d this correlation, although it has been observed that the<br />
ecological niche an organism occupies roughly correlates to both genome size <strong>and</strong> GC content<br />
(Foerstner et al., 2005, Musto et al., 2006).<br />
In addition to the average GC content for a whole genome, local variation with<strong>in</strong> a given<br />
genome can be exam<strong>in</strong>ed, <strong>and</strong> this reveals two general trends for almost all bacterial genomes.<br />
First, on a more global, chromosomal level a large region flank<strong>in</strong>g the orig<strong>in</strong> of DNA
4316 74<br />
Tools<br />
Size distribution of prokaryotic genomes (N = 779) AT content distribution of prokaryotic genomes (N = 779)<br />
for Comparison of Bacterial Genomes<br />
Crenarchaeota (n = 16)<br />
Euryarchaeota (n = 35)<br />
Nanoarchaeota (n = 1)<br />
Acidobacteria (n = 2)<br />
Act<strong>in</strong>obacteria (n = 55)<br />
Aquificae (n = 3)<br />
Bacteroidetes/chlorobi ( n = 26)<br />
Chlamydiae/verrucomicrobia (n = 13)<br />
Chloroflexi (n = 7)<br />
Cyanobacteria (n = 33)<br />
De<strong>in</strong>ococcus/thermus (n = 4)<br />
Firmicutes (n = 155)<br />
Fusobacteria (n = 1)<br />
Planctomycetes (n = 1)<br />
Alphaproteobacteria (n = 94)<br />
Betaproteobacteria (n = 61)<br />
Gammaproteobacteria (n = 191)<br />
Deltaproteobacteria (n = 21)<br />
Epsilonproteobacteria (n = 22)<br />
Spirochaetes (n = 16)<br />
Thermotogea (n = 8)<br />
Other archaea (n = 1)<br />
Other bacteria (n = 13)<br />
80<br />
70<br />
50 60<br />
AT content (percent)<br />
40<br />
30<br />
12<br />
10<br />
6 8<br />
Genome size (Mbp)<br />
4<br />
2<br />
0<br />
. Figure 1<br />
(a) Box <strong>and</strong> Whisker plot of genome length distribution for 779 bacterial chromosomes, grouped by phyla. The phylum <strong>and</strong> the number of chromosomes<br />
<strong>in</strong>cluded are <strong>in</strong>dicated at the left. Each phylum is colored accord<strong>in</strong>g to our GenomeAtlas website. (b) The distribution of average chromosomal AT content<br />
for the same set of bacterial genomes.
eplication tends to be more GC rich, <strong>and</strong> the region around the replication term<strong>in</strong>us usually<br />
is more ATrich. AT-rich sequences melt more easily than GC-rich sequences, due <strong>in</strong> part to the<br />
extra hydrogen bond present <strong>in</strong> a GC base pair. Contra-<strong>in</strong>tuitively, this would make the orig<strong>in</strong><br />
of replication the least likely to start replication. However, with<strong>in</strong> the ‘‘large region’’ around<br />
the orig<strong>in</strong> of approximately 5% of the chromosome, there is a short stretch of more AT rich<br />
basepairs, where the replication orig<strong>in</strong> bubble opens up. Second, <strong>and</strong> zoom<strong>in</strong>g <strong>in</strong> at genes, the<br />
average GC content of <strong>in</strong>tergenic regions is generally lower than that of cod<strong>in</strong>g sequences.<br />
These regions will melt more readily, are more curved <strong>and</strong> more rigid than the chromosomal<br />
average, <strong>in</strong> order to enable gene expression (Pedersen et al., 2000, Ussery <strong>and</strong> Hall<strong>in</strong>, 2004).<br />
This is true for nearly all of the bacterial genomes sequenced, regardless of GC content. In order<br />
to calculate relative or local %GC, a w<strong>in</strong>dow has to be def<strong>in</strong>ed (say, <strong>in</strong>vestigat<strong>in</strong>g 100 basepairs)<br />
for which the %GC is calculated. This w<strong>in</strong>dow is then moved along the genome by s<strong>in</strong>glenucleotide<br />
steps, <strong>and</strong> the %GC is scored related to the middle of each w<strong>in</strong>dow. These scores can<br />
then be graphically represented. A web-based tool for this is available at the Genome Atlas<br />
Website 2 <strong>in</strong> which local %GC can be visualized by color codes as discussed below.<br />
3 Visualization of Genomic Data: The Genome Atlas<br />
Genome atlases are circular plots of chromosomes or plasmids (a l<strong>in</strong>ear version is available<br />
when applicable) on which general properties of the DNA molecule are plotted as colors.<br />
Genome atlases are available from our web server 2 for many of the currently sequenced<br />
bacterial genomes. > Figure 2 shows a Genome Atlas for the chromosome of Geobacillus<br />
kaustophilus stra<strong>in</strong> HTA426 (a thermophilic Firmicute that also conta<strong>in</strong>s a plasmid of 4.8 kb).<br />
This isolate was obta<strong>in</strong>ed from a deep sea sediment of the Mariana Trench <strong>in</strong> the Pacific Ocean<br />
(Takami et al., 2004a, b). Its genome is 3.5 Mbp long <strong>and</strong> conta<strong>in</strong>s 52.1% GC. G. kaustophilus<br />
has been suggested to provide a possible solution for paraff<strong>in</strong> deposition problems with oil<br />
production (Sood <strong>and</strong> Lal, 2008). A Genome Atlas maps four different aspects of the<br />
chromosomal DNA sequence <strong>in</strong> various lanes <strong>in</strong> a st<strong>and</strong>ard manner: DNA structural features<br />
are represented <strong>in</strong> the three outer lanes, all cod<strong>in</strong>g sequences are <strong>in</strong>dicated <strong>in</strong> the next lane, two<br />
k<strong>in</strong>ds of repeats are mapped <strong>in</strong> the next two lanes, <strong>and</strong> base composition properties are plotted<br />
<strong>in</strong> the two <strong>in</strong>nermost lanes (Jensen et al., 1999). The scale <strong>in</strong> the center corresponds with the<br />
sequence number<strong>in</strong>g <strong>in</strong> GenBank. The DNA structural features of the three outermost circles<br />
are based on the physical chemical properties of the DNA helix. The annotated genes are given<br />
<strong>in</strong> blue for prote<strong>in</strong>-cod<strong>in</strong>g genes oriented clockwise, <strong>and</strong> red for genes on the other str<strong>and</strong><br />
(counterclockwise). The tRNA <strong>and</strong> rRNA genes have their own color. The clockwise str<strong>and</strong><br />
corresponds with the sequence stored <strong>in</strong> GenBank (genes on the other str<strong>and</strong> are annotated as<br />
‘‘complement’’ <strong>in</strong> there). To identify global repeats (sequences that are repeated somewhere<br />
else on the chromosome) we search for the best match of a 100 bp w<strong>in</strong>dow aga<strong>in</strong>st the entire<br />
chromosome. Search<strong>in</strong>g on the positive str<strong>and</strong> results <strong>in</strong> direct repeats (both sequences run <strong>in</strong><br />
the same direction) whilst search<strong>in</strong>g on the negative str<strong>and</strong> gives <strong>in</strong>verted repeats (the two<br />
repeat units run <strong>in</strong> opposite directions). For most of these general properties summarized <strong>in</strong> a<br />
Genome Atlas (structural properties, repeats, base composition) dedicated atlases are also<br />
available, where more features are given (such as local <strong>and</strong> simple repeats <strong>in</strong> a Repeat Atlas, or<br />
2 http://www.cbs.dtu.dk/services/GenomeAtlas/<br />
Tools for Comparison of Bacterial Genomes 74<br />
4317
4318 74<br />
Tools<br />
Genome atlas<br />
Intr<strong>in</strong>sic curvature<br />
dev<br />
avg<br />
0.17 0.22<br />
Stack<strong>in</strong>g energy<br />
for Comparison of Bacterial Genomes<br />
dev<br />
avg<br />
–9.03 –7.55<br />
Position preference<br />
dev<br />
avg<br />
0.14 0.17<br />
Annotations: CDS +<br />
CDS –<br />
rRNA<br />
tRNA<br />
0M<br />
0.5M<br />
3M<br />
Global direct repeats<br />
G. kaustophilus<br />
HTA426<br />
ma<strong>in</strong> chromosome<br />
fix<br />
avg<br />
1M<br />
2.5M<br />
5.00 7.50<br />
3,544,776 bp<br />
Global <strong>in</strong>verted repeats<br />
fix<br />
avg<br />
5.00 7.50<br />
1.5M<br />
2M<br />
GC Skew<br />
dev<br />
avg<br />
–0.15 0.14<br />
Percent AT<br />
fix<br />
avg<br />
0.20 0.80<br />
Resolution: 1418<br />
Center for biological sequence analysis<br />
http://www.cbs.dtu.dk/<br />
. Figure 2<br />
Genome atlas of the ma<strong>in</strong> chromosome of Geobacillus kaustrophilus. See text for further explanation.
Tools for Comparison of Bacterial Genomes 74<br />
base composition <strong>in</strong> a Base Atlas). Such specialized atlases are expla<strong>in</strong>ed <strong>in</strong> detail <strong>in</strong> a book that<br />
we recently produced (Ussery et al., 2008).<br />
As can be seen <strong>in</strong> > Fig. 2, the genes <strong>in</strong> this chromosome are strongly favor<strong>in</strong>g one str<strong>and</strong>:<br />
the positive str<strong>and</strong> for the first (right) half <strong>and</strong> the negative str<strong>and</strong> for the second (left) half of<br />
the chromosome. These happen to be the lead<strong>in</strong>g str<strong>and</strong> dur<strong>in</strong>g replication. Replication starts<br />
at the orig<strong>in</strong>, (the 12 o’clock position here), <strong>and</strong> proceeds on either side along the circle with<br />
both a lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong> until the bubble reaches the term<strong>in</strong>us, at 6 o’clock, <strong>and</strong> the<br />
ends are comb<strong>in</strong>ed. The positive str<strong>and</strong> represented by a genome sequence is the lead<strong>in</strong>g<br />
str<strong>and</strong> but only for the first half up till the term<strong>in</strong>us. Read<strong>in</strong>g across the term<strong>in</strong>us along the<br />
sequence on the same str<strong>and</strong> one enters the lagg<strong>in</strong>g str<strong>and</strong>. Gene preference for the lead<strong>in</strong>g<br />
str<strong>and</strong> is a general feature for Firmicutes <strong>and</strong> for some other bacteria.<br />
In > Fig. 2 the two outward lanes identify some regions with strong structural properties<br />
(for <strong>in</strong>stance the region around 2 o’clock, <strong>in</strong>dicated by a black l<strong>in</strong>e). The observed strong<br />
curvature (blue <strong>in</strong> the outward lane) where the DNA would easily melt (red <strong>in</strong> the second lane)<br />
suggests this region conta<strong>in</strong>s genes that are highly expressed.<br />
There are a number of global repeats, notably <strong>in</strong> the first quarter of the chromosome. Note<br />
that the ribosomal RNA genes (light blue <strong>in</strong> the annotation lane) are located here, as <strong>in</strong>dicated<br />
by the arrows, <strong>and</strong> these are picked up as global repeats, as <strong>in</strong>deed they are repeated genes.<br />
The GC skew lane shows the bias of G’s towards one str<strong>and</strong> or the other, averaged over a<br />
10,000 bp w<strong>in</strong>dow. In contrast to many Firmicutes with a strong GC skew, this genome only<br />
has a weak GC skew (the right half is light blue <strong>and</strong> the left half is light p<strong>in</strong>k). The <strong>in</strong>nermost<br />
circle colors the local AT content when it is more than three st<strong>and</strong>ard deviations distant from<br />
the global average. Note a light red color around the 2 o’clock region: this local deviation <strong>in</strong> AT<br />
content is related to the structural features located here.<br />
The Genome Atlas of the Archaea Methanosarc<strong>in</strong>a acetivorans, shown <strong>in</strong> > Fig. 3, tells a<br />
different story. This strictly anaerobic organism so efficiently produces methane that it is held<br />
responsible for virtually all biogenic methane. It can also oxidate CO to CO 2 (Lessner et al.,<br />
2006). Stra<strong>in</strong> C2A (the type stra<strong>in</strong> of the species) was isolated from a mar<strong>in</strong>e sediment<br />
(Galagan et al., 2002). Its genome is 5.7 Mbp long <strong>and</strong> conta<strong>in</strong>s 42.7% GC. The Genome<br />
Atlas shows that its genes are evenly distributed over the two str<strong>and</strong>s, <strong>and</strong> a GC skew is absent.<br />
Instead, the lower quart of the genome conta<strong>in</strong>s many strong structural features. The genome<br />
only conta<strong>in</strong>s three rRNA gene copies (<strong>in</strong>dicated by arrows) one of which is located on the<br />
negative str<strong>and</strong> (but as discussed above, this is actually the lead<strong>in</strong>g str<strong>and</strong>, as is preferred for<br />
nearly all bacterial rRNA genes). Many other global repeats are visible, notably <strong>in</strong> the region<br />
around 1.2 Mbp, which is strongly curved <strong>and</strong> easily melted, <strong>and</strong> is slightly more AT rich than<br />
the rest of the genome. Here, the important carbon-monoxide dehydrogenase gene locus is<br />
present, as are multiple transposases, which could be an <strong>in</strong>dication of horizontally acquired<br />
DNA. The genome is relatively poorly annotated, with many genes given as ‘‘predicted<br />
prote<strong>in</strong>’’ only, which is not uncommon for archaeal genomes.<br />
In conclusion, a Genome atlas comb<strong>in</strong>es a number of features <strong>in</strong> one s<strong>in</strong>gle figure that<br />
summarizes a very long <strong>and</strong> detailed story about a chromosome or plasmid.<br />
4 Whole Genome Alignment Methods<br />
4319<br />
Another way to compare genomes is based on alignment of nucleotide or am<strong>in</strong>o acid<br />
sequences. Sequence alignment is a common tool to identify similarities, with BLAST, for
4320 74<br />
Tools<br />
Genome atlas<br />
Intr<strong>in</strong>sic curvature<br />
dev<br />
avg<br />
0.18 0.24<br />
for Comparison of Bacterial Genomes<br />
dev<br />
avg<br />
Stack<strong>in</strong>g energy<br />
–8.10 –7.21<br />
dev<br />
avg<br />
Position preference<br />
0.13 0.15<br />
0.5M<br />
0M<br />
M<br />
Annotations: CDS +<br />
CDS –<br />
rRNA<br />
tRNA<br />
5M<br />
1M<br />
4.5M<br />
1.5M<br />
M. acetivorans C2A<br />
5,751,492 bp<br />
Global direct repeats<br />
fix<br />
avg<br />
4<br />
2M<br />
5.00 7.50<br />
3.5M<br />
2.5M<br />
Global <strong>in</strong>verted repeats<br />
fix<br />
avg<br />
5.00 7.50<br />
3M<br />
GC skew<br />
dev<br />
avg<br />
–0.03 0.02<br />
fix<br />
avg<br />
Percent AT<br />
0.20 0.80<br />
Resolution: 2301<br />
Center for biological sequence analysis<br />
http://www.cbs.dtu.dk/<br />
. Figure 3<br />
Genome atlas of the ma<strong>in</strong> chromosome of the Archea Methanosarc<strong>in</strong>a acetivorans.
Basic Local Alignment Search Tool, the most common (Altschul et al., 1990). However<br />
BLAST is not automatically suitable for large DNA <strong>in</strong>put segments such as complete<br />
genomes. A more suitable program to align sequences <strong>in</strong> the range of megabases is Mummer,<br />
developed at TIGR, of which version 3 is now publicly available (Kurtz et al., 2004). Further,<br />
this method has been recently extended to <strong>in</strong>clude the average nucleotide identity <strong>in</strong> the<br />
conserved core genes of a set of genomes (Deloger et al., 2009). Moreover, graphical representation<br />
of the result<strong>in</strong>g alignment becomes an issue. Specific <strong>tools</strong> have been designed to align<br />
genome sequences <strong>and</strong> visualize such events. The Artemis Comparison Tool (ACT) is worth<br />
mention<strong>in</strong>g of which two versions are available: a downloadable version to be used on a local<br />
computer (Carver et al., 2005) <strong>and</strong> a web-based version with pre-computed comparisons<br />
between several hundred bacterial genomes. 3 BLAST results of entire bacterial chromosomes<br />
aga<strong>in</strong>st each other have also been used to construct phylogenetic trees (Henz et al., 2005). Blast<br />
comparisons will be treated <strong>in</strong> Section 7 of this chapter.<br />
5 Compar<strong>in</strong>g the Cod<strong>in</strong>g Fraction of Genomes<br />
The typical cod<strong>in</strong>g density for a bacterial genome is about 90%, rang<strong>in</strong>g from 95%<br />
for Pelagibacter ubique (an alpha-proteal mar<strong>in</strong>e bacterium that counts to the most numerous<br />
bacteria <strong>in</strong> the world) (Giovannoni et al., 2005) to around 75% for M. acetivorans.<br />
Intracellular bacteria can have a cod<strong>in</strong>g density as low as 50%. This means the majority<br />
of bacterial DNA codes for genes, which mostly are not spliced so that <strong>in</strong>trons are absent<br />
(with very few exceptions). However, not every open read<strong>in</strong>g frame is a gene, <strong>and</strong> it<br />
appears that many bacterial genomes are over-annotated, predict<strong>in</strong>g 10–15% more genes<br />
than are real (Skovgaard et al., 2001). These over-annotated genes are frequently short<br />
open read<strong>in</strong>g frames. In addition, genes can be missed <strong>in</strong> the annotation. A frequent mistake<br />
is that genes are annotated on the wrong str<strong>and</strong>, which can happen if the read<strong>in</strong>g frame is<br />
open <strong>in</strong> either direction. The <strong>in</strong>tergenic regions separat<strong>in</strong>g genes regulate transcription,<br />
<strong>and</strong> <strong>in</strong> <strong>in</strong>tracellular bacteria frequently conta<strong>in</strong> pseudogenes or repeats. Genes not cod<strong>in</strong>g<br />
for prote<strong>in</strong>s <strong>in</strong>clude tRNA <strong>and</strong> rRNA genes, <strong>and</strong> some parts of <strong>in</strong>tergenic regions can<br />
be transcribed <strong>in</strong>to stable RNA that are transcribed but do not code for prote<strong>in</strong>s. E. coli<br />
conta<strong>in</strong>s several hundred small non-cod<strong>in</strong>g RNA genes (ncRNA) (Chen et al., 2002) that<br />
can act as regulators (Gottesman, 2005). Their role <strong>in</strong> environmental bacteria is virtually<br />
unexplored.<br />
Although tRNA <strong>and</strong> rRNA genes are essential to life, they are sometimes missed <strong>in</strong> the<br />
annotation of a genome, a rather embarrass<strong>in</strong>g omission, or occasionally annotated on<br />
the wrong str<strong>and</strong> (Lagesen et al., 2007). The number <strong>and</strong> location of rRNA operons <strong>in</strong> a<br />
genome can say someth<strong>in</strong>g about an organism. It appears that organisms with short doubl<strong>in</strong>g<br />
times have larger numbers of rRNA <strong>and</strong> tRNA genes. Compar<strong>in</strong>g > Figs. 2 <strong>and</strong> 3 it is<br />
likely that G. kaustrophilus with 9 rRNA copies, nearly all located close to the orig<strong>in</strong> of<br />
replication (which boosts expression dur<strong>in</strong>g replication as their copy number <strong>in</strong>creases) can<br />
divide more quickly than M. acetivorans which only has three copies. Some really fast-grow<strong>in</strong>g<br />
bacteria can have 14 or more rRNA copies, as can be viewed from our list of genomes. 4<br />
3 http://www.webact.org/WebACT/home<br />
4 www.cbs.dtu.dk/services/GenomeAtlas/<br />
Tools for Comparison of Bacterial Genomes 74<br />
4321
4322 74<br />
Tools<br />
for Comparison of Bacterial Genomes<br />
6 Codon Usage Comparisons<br />
Once the genes of a given genome have been def<strong>in</strong>ed, their codon usage can be analyzed. S<strong>in</strong>ce<br />
the genetic code is redundant, with up to 6 codons per am<strong>in</strong>o acid, variable codons are used at<br />
different frequencies. Much of the redundancy <strong>in</strong> the genetic code is due to third base<br />
variation. > Figure 4 displays the am<strong>in</strong>o acid usage for three prokaryotic genomes: Methanosphaera<br />
stadtmanae (27.6% GC), an archaeal methanogen that uses methanol <strong>and</strong> hydrogen to<br />
produce methane; Desulfitobacterium hafniense (47.4% GC), a Firmicute that efficiently<br />
dehalogenates tetrachloroethene <strong>and</strong> polychloroethanes; <strong>and</strong> Anaeromyxobacter dehalogenans<br />
(75% GC). This species, the first myxobacteria to be grown as a pure culture, can use orthosubstituted<br />
mono- <strong>and</strong> dichlor<strong>in</strong>ated phenols. The frequency of each possible codon is plotted<br />
<strong>in</strong> a wheel plot <strong>in</strong> the upper part of the figure, arranged such that their third base is conserved<br />
<strong>in</strong> each quarter. The bias <strong>in</strong> codon usage towards the third position can also be seen <strong>in</strong> the<br />
sequence logo plots <strong>in</strong> the lower part of > Fig. 4. From both graphics it is evident that genomic<br />
GC content highly affects codon use (or the other way round). Based on a genome’s bias <strong>in</strong><br />
codon usage, it is possible to predict its likely environmental niche (Willenbrock et al., 2006).<br />
Moreover, it is known that am<strong>in</strong>o acid usage (not shown here) depends on environment, based<br />
on analysis of metagenomic samples (Musto et al., 2006, Foerstner et al., 2005).<br />
7 Prote<strong>in</strong> Sequence Comparisons<br />
One can compare each <strong>in</strong>dividual gene <strong>in</strong> a given genome by BLAST aga<strong>in</strong>st a set of genomes.<br />
This produces a huge amount of data that can be graphically represented <strong>in</strong> a BLAST Matrix<br />
(B<strong>in</strong>newies et al., 2005, Ussery et al., 2009). A BLAST Matrix is not symmetrical, as the<br />
outcome is determ<strong>in</strong>ed by which genome is used as query sequence. The diagonal of a BLAST<br />
matrix represents a BLASTof a genome aga<strong>in</strong>st itself. The self-match (the gene f<strong>in</strong>d<strong>in</strong>g itself) is<br />
discarded, thus the reported scores reflect <strong>in</strong>ternal homologues present <strong>in</strong> a given genome.<br />
Most of these have been derived from gene duplication <strong>and</strong> are thus paralogs.<br />
When more <strong>in</strong>formation should be visualized a BLAST Atlas is helpful. Such an atlas uses<br />
one genome as a reference aga<strong>in</strong>st which the gene conservation of other genomes is plotted<br />
(Hall<strong>in</strong> <strong>and</strong> Ussery, 2004, Skovgaard et al., 2002). In this case gene location only refers to the<br />
location <strong>in</strong> the reference genome, which of course can be varied <strong>in</strong> multiple BLAST Atlases.<br />
A BLAST Atlas is also a suitable platform to visualize metagenomic data. So far, we have<br />
not dealt with metagenomics extensively, ma<strong>in</strong>ly because this approach very rarely results <strong>in</strong><br />
completely assembled microbiological genomes. But for a BLAST Atlas, that is not a problem,<br />
as one can comb<strong>in</strong>e all the metagenomic DNA <strong>in</strong> one lane, thereby ignor<strong>in</strong>g from which<br />
organism the detected genes orig<strong>in</strong>ated. All obta<strong>in</strong>ed BLAST hits are plotted around a<br />
reference genome. An example of a BLAST Atlas is given <strong>in</strong> > Fig. 5, centered around<br />
Pelotomaculum thermopropionicum, a thermophilic, syntropic Firmicute that can utilize<br />
1-butanol, 1-propanol, 1-pentanol or 1,3-propanediol as a carbon source. Note that despite<br />
the high number of lanes, conserved <strong>and</strong> variable genes can still be easily visually <strong>in</strong>spected.<br />
From compact<strong>in</strong>g a s<strong>in</strong>gle genome <strong>in</strong>to a Genome Atlas, we’ve now moved several levels up<br />
<strong>and</strong> compact multiple genomes <strong>in</strong>to a s<strong>in</strong>gle atlas. In > Fig. 5, the P. thermopropionicum<br />
genome is compared to many species of Clostridia, as well as other bacteria. Unfortunately,<br />
very few BLAST hits were found with the metagenomics samples so there is very little color <strong>in</strong><br />
those three lanes. Compared to well characterized genomes (like E. coli), relatively few hits are
Methanosphaera stadtmanae DSM 3091<br />
Desulfitobacterium hafniense Y51<br />
Anaeromyxobacter dehalogenans 2CP-C<br />
GGG<br />
GGG<br />
GGG<br />
GAA<br />
GAA<br />
CAA<br />
CGG<br />
GAA<br />
CAA<br />
CGG<br />
UAA<br />
GCG<br />
CAA<br />
CGG<br />
UAA<br />
GCG<br />
CUA<br />
AAA<br />
UGG<br />
UAA<br />
GCG<br />
UGG<br />
UGG<br />
CUA<br />
UUA<br />
AAA<br />
UUA<br />
GUA<br />
AUA<br />
AGG<br />
CCG<br />
CUA<br />
UUA<br />
AAA<br />
CCG<br />
AUA<br />
AGG<br />
CCG<br />
AUA<br />
AGG<br />
GUA<br />
UCG<br />
UCG<br />
GUA<br />
UCG<br />
GUG<br />
GUG<br />
GUG<br />
ACG<br />
ACG<br />
ACG<br />
ACA<br />
UCA<br />
ACA<br />
CCA<br />
CUG<br />
UCA<br />
ACA<br />
CCA<br />
UCA<br />
CUG<br />
GCA<br />
CCA<br />
CUG<br />
GCA<br />
UUG<br />
UUG<br />
GCA<br />
UUG<br />
AGA<br />
AUG<br />
GAG<br />
AGA<br />
AUG<br />
GAG<br />
UGA<br />
AGA<br />
AUG<br />
GAG<br />
UGA<br />
CGA<br />
CAG<br />
UGA<br />
CGA<br />
CAG<br />
CGA<br />
CAG<br />
GGA<br />
UAG<br />
G GA<br />
UAG<br />
GGA<br />
UAG<br />
AAU<br />
72% AT<br />
AAG<br />
AAU<br />
25% AT 53% AT<br />
AAG<br />
AAU<br />
AAG<br />
UAU<br />
UAU<br />
GGC<br />
GGC<br />
CAU<br />
CGC<br />
UAU<br />
GGC<br />
CAU<br />
CGC<br />
UGC<br />
CAU<br />
CGC<br />
UGC<br />
UGC<br />
GAU<br />
AUU<br />
AGC<br />
GAU<br />
AUU<br />
AGC<br />
GAU<br />
AUU<br />
AGC<br />
UUU<br />
UUU<br />
GCC<br />
UUU<br />
Tools for Comparison of Bacterial Genomes 74<br />
GCC<br />
GCC<br />
CUU<br />
CCC<br />
CUU<br />
CCC<br />
UCC<br />
ACU<br />
ACC<br />
CUU<br />
UC CCC<br />
UCC<br />
ACU<br />
ACC<br />
ACU<br />
ACC<br />
GUU<br />
GUU<br />
UCU<br />
UCU<br />
GUU<br />
UCU<br />
CCU<br />
AGU<br />
CCU<br />
AGU<br />
AUC<br />
UUC<br />
CUC<br />
GUC<br />
AUC<br />
UU CUC<br />
GUC<br />
CCU<br />
AGU<br />
AUC<br />
UUC<br />
CUC<br />
GUC<br />
UGU<br />
UGU<br />
GCU<br />
AAC<br />
UGU<br />
GCU<br />
AAC<br />
CGU<br />
UAC<br />
GCU<br />
AAC<br />
CGU<br />
UAC<br />
CAC<br />
CGU<br />
UAC<br />
CAC<br />
GGU<br />
GAC<br />
CAC<br />
GGU<br />
GAC<br />
GGU<br />
GAC<br />
C<br />
0.6<br />
0.6<br />
0.6<br />
0.5<br />
0.5<br />
0.5<br />
U AG<br />
0.4<br />
0.4<br />
0.4<br />
0.3<br />
0.3<br />
0.3<br />
0.2<br />
0.2<br />
G<br />
0.2<br />
C<br />
A<br />
0.1<br />
0.1<br />
UA G CU<br />
A<br />
CU<br />
GA<br />
0.1<br />
CU<br />
CG<br />
A<br />
G<br />
U<br />
G<br />
U<br />
A<br />
C G<br />
A<br />
C<br />
U<br />
UA<br />
C<br />
G<br />
1 st 2 nd 3 rd 1 st 2 nd 3 rd 1 st 2 nd 3 rd<br />
4323<br />
. Figure 4<br />
Frequency wheel plots of codon usage (top) <strong>and</strong> sequence logo plots (bottom) of Anaeromyxobacter dehalogenans (left), Desulfitobacterium hafniense<br />
(middle) <strong>and</strong> Methanosphaera stadtmanae (right).
4324 74<br />
Tools<br />
for Comparison of Bacterial Genomes<br />
2.5M<br />
2M<br />
0M<br />
P. thermopropionicum<br />
SI<br />
3,025,375 bp<br />
1.5M<br />
0.5M<br />
1M<br />
2 Alkaliphilus species<br />
Bacillus fragilis<br />
17 Clostridium species<br />
4 Desulfitobacterium species<br />
E. coli K-12<br />
6 other species belong<strong>in</strong>g<br />
to Clostridia<br />
. Figure 5<br />
BLAST Atlas with Pelotomaculum thermoproopionicuma the reference genome. Around this the<br />
BLAST hits of 31 genomes of other bacteria are added as listed to the right, from the outermost<br />
circle (top <strong>in</strong> the legend), to the <strong>in</strong>nermost circle of the bacterial genomes (bottom of legend).<br />
The outermost lane shows the hits of P. thermopropionicum <strong>in</strong> the UniProt database (which<br />
does not conta<strong>in</strong> all annotated genes as it requires biological evidence of a gene product).<br />
The next three lanes are metagenomic DNA samples from...[Dave specify] <strong>and</strong> next follow<br />
30 genomes of other bacteria as listed to the right.<br />
found <strong>in</strong> other genomes, <strong>in</strong>dicated by lack of strong colour <strong>in</strong> most of the lanes <strong>in</strong> Figure 5.<br />
This is probably a reflection of the huge diversity <strong>in</strong> DNA content <strong>in</strong> such samples, reduc<strong>in</strong>g<br />
the chance of a BLAST hit. It is a sober<strong>in</strong>g thought that there is still so little we know, <strong>and</strong> so<br />
much that rema<strong>in</strong>s to be discovered <strong>in</strong> the microbial world.<br />
There are many methods be<strong>in</strong>g developed which utilizes sets of conserved genes <strong>and</strong> gene<br />
families <strong>in</strong> related organisms to cluster organisms <strong>in</strong>to groups; these groups can represent<br />
known taxonomic relationships. For example, certa<strong>in</strong> genes might be common to a set of<br />
organisms grow<strong>in</strong>g <strong>in</strong> a particular ecological niche. Some examples of such regions along the<br />
chromosome can be seen <strong>in</strong> the BLAST atlas plots where genomes of related organisms of<br />
different species are compared.
8 Gene Synteny <strong>and</strong> Genome Isl<strong>and</strong>s<br />
A comparison of genes present, absent or diverged between genomes usually ignores gene synteny:<br />
the position at which such genes are found. The term was co<strong>in</strong>ed for eukaryotes to describe genes<br />
that were located on the same chromosome; <strong>in</strong> bacterial genomes the local neighbor<strong>in</strong>g genes,<br />
their order <strong>and</strong> direction are usually compared. The closer two organisms are, the more likely is<br />
gene synteny to be conserved (between genomes of the same genus, or species, subspecies or<br />
phylogenic clade, <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g order). Gene synteny is destroyed by <strong>in</strong>versions (chang<strong>in</strong>g the<br />
direction of one or several genes), translocations (chang<strong>in</strong>g the position of genes) <strong>and</strong> <strong>in</strong>sertion<br />
<strong>and</strong> deletion events. All of these can result from mistakes dur<strong>in</strong>g replication, or be the result of<br />
self-replicat<strong>in</strong>g mobile elements, such as bacteriophages, <strong>in</strong>tegrons, transposons etc.<br />
The events that affect gene synteny, comb<strong>in</strong>ed with po<strong>in</strong>t mutations accumulat<strong>in</strong>g dur<strong>in</strong>g<br />
replication are the two major forces that <strong>in</strong>crease genetic diversity; selection of those organisms<br />
that are fittest to survive particular conditions decreases diversity. Evolution further<br />
depends on the change of such selective conditions. With a slow but steady re-shuffl<strong>in</strong>g of<br />
genes by evolutionary processes, a pattern emerges of a genetic ‘‘backbone’’ of genes whose<br />
location is relatively conserved between genomes of reasonable genetic distance, <strong>and</strong> groups of<br />
‘‘cluttered’’ genes that are far more variable, <strong>in</strong> what have been termed ‘‘genome isl<strong>and</strong>s.’’<br />
Genome isl<strong>and</strong>s usually conta<strong>in</strong> genes that are all <strong>in</strong>volved <strong>in</strong> a particular phenotypic process.<br />
Examples are pathogenicity isl<strong>and</strong>s, symbiosis isl<strong>and</strong>s, metabolic isl<strong>and</strong>s or magnetosome<br />
isl<strong>and</strong>s. Examples are sulfur metabolism isl<strong>and</strong>s discovered <strong>in</strong> metagenomic sequences from<br />
mar<strong>in</strong>e sediments (Mussmann et al., 2005) or the magnetosome isl<strong>and</strong> conta<strong>in</strong><strong>in</strong>g all genes<br />
that produce the <strong>in</strong>tracellular organelle enabl<strong>in</strong>g magnetotactic bacteria to orient themselves<br />
along magnetic field l<strong>in</strong>es (Richter et al., 2007). The evolutionary advantage of genome isl<strong>and</strong>s<br />
is obvious. They can be regarded as genetic ‘‘build<strong>in</strong>g blocks’’; when transferred from one<br />
organism to the next, they can confer a complete phenotypic trait to the acceptor, enabl<strong>in</strong>g,<br />
for <strong>in</strong>stance, adaptation to a novel ecological niche.<br />
9 M<strong>in</strong>imal Information About a Genome Sequence<br />
Genome sequences are stored <strong>in</strong> public databases such as GenBank under their biological<br />
names (preceded by ‘‘c<strong>and</strong>idatus’’ for undecided taxonomic position), or by a code of<br />
numbers <strong>and</strong> letters for unculturable organisms that have not been classified. Unfortunately,<br />
other relevant <strong>in</strong>formation is often lack<strong>in</strong>g. It has become apparent that biological <strong>and</strong><br />
environmental data are important, <strong>and</strong> a recent st<strong>and</strong>ard for ‘‘M<strong>in</strong>imal Information about a<br />
Genome Sequence’’ has been proposed (Field et al., 2008). The Genomic St<strong>and</strong>ards Consortium<br />
5 (GSC, http://gensc.org) promotes the st<strong>and</strong>ardization of genome sequenc<strong>in</strong>g descriptions<br />
<strong>and</strong> their exchange <strong>and</strong> <strong>in</strong>tegration <strong>in</strong> the scientific community. Overall, it is important<br />
that genome sequence <strong>in</strong>formation is released <strong>in</strong>to the public doma<strong>in</strong> <strong>in</strong> a timely manner so<br />
that global scientific progress can be ma<strong>in</strong>ta<strong>in</strong>ed.<br />
10 Research Needs<br />
Tools for Comparison of Bacterial Genomes 74<br />
4325<br />
For very few environmental species multiple genome sequences are available. From genomic<br />
<strong>in</strong>tra-species comparisons of pathogenic bacteria we know that these provide an extra layer of
4326 74<br />
Tools<br />
<strong>in</strong>formation, as genetic diversity with<strong>in</strong> a bacterial species can be enormous. When multiple<br />
genomes are available for a species we can def<strong>in</strong>e its core genome (all genes that are present <strong>in</strong><br />
all genomes of that species), its pan-genome (all genes that have been found <strong>in</strong> that species)<br />
<strong>and</strong> its dispensable genes that are responsible for the variation between isolates. Multiple<br />
genomes per species, together with more metagenomic data <strong>and</strong> more archaeal genome<br />
sequences, comprise our most urgent data gaps. The research <strong>tools</strong> for analysis of the<br />
genomes are available. Generate the sequences <strong>and</strong> the feast can beg<strong>in</strong>.<br />
References<br />
for Comparison of Bacterial Genomes<br />
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ<br />
(1990) Basic local alignment search tool. J Mol Biol<br />
215: 403–410.<br />
B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Staerfeldt HH, Ussery DW<br />
(2005) Genome update: proteome comparisons.<br />
Microbiology 151: 1–4.<br />
B<strong>in</strong>newies TT, et al. (2006) Ten years of bacterial genome<br />
sequenc<strong>in</strong>g: comparative-genomics-based discoveries.<br />
Funct Integr Genomics 6: 165–185.<br />
Carver TJ, Rutherford KM, Berriman M, Raj<strong>and</strong>ream<br />
MA, Barrell BG, Parkhill J (2005) ACT: the Artemis<br />
Comparison Tool. Bio<strong>in</strong>formatics 21: 3422–3423.<br />
Chang CH, Chang YC, Underwood A, Chiou CS, Kao CY<br />
(2007) VNTRDB: a bacterial variable number t<strong>and</strong>em<br />
repeat locus database. Nucleic Acids Res 35:<br />
D416–D421.<br />
Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH,<br />
Ecker DJ, Blyn LB (2002) A bio<strong>in</strong>formatics based<br />
approach to discover small RNA genes <strong>in</strong> the Escherichia<br />
coli genome. Biosystems 65: 157–177.<br />
Deloger M, El Karoui M, Petit MA (2009) A genomic<br />
distance based on MUM <strong>in</strong>dicates discont<strong>in</strong>uity between<br />
most bacterial species <strong>and</strong> genera. J Bacteriol<br />
191: 91–99.<br />
Denoeud F, Vergnaud G (2004) Identification of polymorphic<br />
t<strong>and</strong>em repeats by direct comparison of<br />
genome sequence from different bacterial stra<strong>in</strong>s: a<br />
web-based resource. BMC Bio<strong>in</strong>formatics 5: 4.<br />
Field D, et al. (2008) The m<strong>in</strong>imum <strong>in</strong>formation about a<br />
genome sequence (MIGS) specification. Nature Biotechnol<br />
26:541–547.<br />
Foerstner KU, von Mer<strong>in</strong>g C, Hooper SD, Bork P (2005)<br />
Environments shape the nucleotide composition of<br />
genomes. EMBO Rep 6: 1208–1213.<br />
Galagan JE, et al. (2002) The genome of M. acetivorans<br />
reveals extensive metabolic <strong>and</strong> physiological diversity.<br />
Genome Res 12: 532–542.<br />
Giovannoni SJ, et al. (2005) Genome streaml<strong>in</strong><strong>in</strong>g <strong>in</strong> a<br />
cosmopolitan oceanic bacterium. Science 309:<br />
1242–1245.<br />
Gottesman S (2005) Micros for microbes: non-cod<strong>in</strong>g<br />
regulatory RNAs <strong>in</strong> bacteria. Trends Genet 21:<br />
399–404.<br />
Griffiths-Jones S, Moxon S, Marshall M, Khanna A,<br />
Eddy SR, Bateman A (2005) Rfam: annotat<strong>in</strong>g<br />
non-cod<strong>in</strong>g RNAs <strong>in</strong> complete genomes. Nucleic<br />
Acids Res 33: D121–D124.<br />
Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW (2008) The genome<br />
BLAST atlas - a GeneWiz extension for visualization<br />
of whole-genome homology. Mol Biosyst 4: 363–371.<br />
Hall<strong>in</strong> PF, Ussery DW (2004) <strong>CBS</strong> Genome Atlas<br />
Database: a dynamic storage for bio<strong>in</strong>formatic results<br />
<strong>and</strong> sequence data. Bio<strong>in</strong>formatics 20: 3682–3686.<br />
Henz SR, Huson DH, Auch AF, Nieselt-Struwe K,<br />
Schuster SC (2005) Whole-genome prokaryotic<br />
phylogeny. Bio<strong>in</strong>formatics 21: 2329–2335.<br />
Jensen LJ, Friis C, Ussery DW (1999) Three views of<br />
microbial genomes. Res Microbiol 150: 773–777.<br />
Kurtz S, Philippy A, Delcher AL, Smoot M, Shumway M,<br />
Antonescu C, Salzberg SL (2004) Versatile <strong>and</strong> open<br />
software for compar<strong>in</strong>g large genomes. Genome Biol<br />
5: R12.<br />
Lagesen K, Hall<strong>in</strong> P, Rodl<strong>and</strong> EA, Staerfeldt HH,<br />
Rognes T, Ussery DW (2007) RNAmmer: consistent<br />
<strong>and</strong> rapid annotation of ribosomal RNA genes.<br />
Nucleic Acids Res 35: 3100–3108.<br />
Lessner DJ, et al. (2006) An unconventional pathway for<br />
reduction of CO 2 to methane <strong>in</strong> CO-grown Methanosarc<strong>in</strong>a<br />
acetivorans revealed by proteomics. Proc<br />
Natl Acad Sci USA 103: 17921–17926.<br />
Mussmann M, Richter M, Lombardot T, Meyerdierks A,<br />
Kuever J, Kube M, Glöckner FO, Amann R (2005)<br />
Clustered genes related to sulfate respiration <strong>in</strong> uncultured<br />
prokaryotes support the theory of their<br />
concomitant horizontal transfer. J Bacteriol. 187:<br />
7126–7137.<br />
Musto H, Naya H, Zavala A, Romero H, Alvarez-Val<strong>in</strong> F,<br />
Bernardi G (2006) Genomic GC level, optimal<br />
growth temperature, <strong>and</strong> genome size <strong>in</strong> prokaryotes.<br />
Biochem Biophys Res Commun 347: 1–3.<br />
Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH,<br />
Ussery DW (2000) A DNA structural atlas for<br />
Escherichia coli. J Mol Biol 299: 907–930.<br />
Richter M, Kube M, Bazyl<strong>in</strong>ski DA, Lombardot T,<br />
Glöckner FO, Re<strong>in</strong>hardt R, Schüler D (2007) <strong>Comparative</strong><br />
genome analysis of four magnetotactic
acteria reveals a complex set of group-specific<br />
genes implicated <strong>in</strong> magnetosome biom<strong>in</strong>eralization<br />
<strong>and</strong> function. J Bacteriol 189: 4899–4910.<br />
Selengut JD, et al. (2007) TIGRFAMs <strong>and</strong> Genome Properties:<br />
<strong>tools</strong> for the assignment of molecular function<br />
<strong>and</strong> biological process <strong>in</strong> prokaryotic genomes.<br />
Nucleic Acids Res 35: D260–D264.<br />
Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A<br />
(2001) On the total number of genes <strong>and</strong> their<br />
length distribution <strong>in</strong> complete microbial genomes.<br />
Trends Genet 17: 425–428.<br />
Skovgaard M, Jensen LJ, Friis C, Stærfeldt HH, Worn<strong>in</strong>g P,<br />
Brunak S, Ussery D (2002) The atlas visualisation of<br />
genome-wide <strong>in</strong>formation. Meth Microbiol. 33:<br />
49–63.<br />
Sood N, Lal B. (2008). Isolation <strong>and</strong> characterization of a<br />
potential paraff<strong>in</strong>-wax degrad<strong>in</strong>g thermophilic bacterial<br />
stra<strong>in</strong> Geobacillus kaustophilus TERI NSM for<br />
application <strong>in</strong> oil wells with paraff<strong>in</strong> deposition<br />
problems. Chemosphere 70: 1445–1451.<br />
Takami H, et al. (2004a) Genomic characterization of<br />
thermophilic Geobacillus species isolated from the<br />
deepest sea mud of the Mariana Trench. Extremophiles<br />
8: 351–356.<br />
Takami H, et al. (2004b) Thermoadaptation trait<br />
revealed by the genome sequence of thermophilic<br />
Tools for Comparison of Bacterial Genomes 74<br />
4327<br />
Geobacillus kaustophilus. Nucl Acids Res 32:<br />
6292–6303.<br />
Teel<strong>in</strong>g H, Waldmann J, Lombardot T, Bauer M,<br />
Glockner FO (2004) TETRA: a web-service <strong>and</strong> a<br />
st<strong>and</strong>-alone program for the analysis <strong>and</strong> comparison<br />
of tetranucleotide usage patterns <strong>in</strong> DNA<br />
sequences. BMC Bio<strong>in</strong>formatics 5: 163.<br />
Ussery DW, Hall<strong>in</strong> PF (2004) Genome update: AT content<br />
<strong>in</strong> sequenced prokaryotic genomes. Microbiology<br />
150: 749–752.<br />
Ussery DW, Bor<strong>in</strong>i S, Wassenaar TM (2009) Comput<strong>in</strong>g<br />
for <strong>Comparative</strong> Microbial Genomics: Bio<strong>in</strong>formatics<br />
for Microbiologists (<strong>Computational</strong> series)<br />
London, Verlag: Spr<strong>in</strong>ger.<br />
Wheeler DL, et al. (2007) Database resources of the<br />
National Center for Biotechnology Information.<br />
Nucleic Acids Res 35: D5–D12.<br />
Willenbrock H, Friis C, Friis AS, Ussery DW (2006) An<br />
environmental signature for 323 microbial genomes<br />
based on codon adaptation <strong>in</strong>dices. Genome Biol 7:<br />
R114.<br />
Worn<strong>in</strong>g P, Jensen LJ, Hall<strong>in</strong> PF, Staerfeldt HH,<br />
Ussery DW (2006) Orig<strong>in</strong> of replication <strong>in</strong> circular<br />
prokaryotic chromosomes. Environ Microbiol 8:<br />
353–361.
Chapter 3<br />
rRNA operons <strong>and</strong> promoter analysis<br />
rRNA operons <strong>and</strong> promoter<br />
analysis<br />
3.1 Introduction<br />
This chapter covers two papers (VI <strong>and</strong> VII), deal<strong>in</strong>g with rRNA localization with<strong>in</strong> the<br />
genome, <strong>and</strong> analysis of the promoter region upstream of rRNA operons. The RNAmmer<br />
tool (Lagesen et al., 2007) presented <strong>in</strong> paper VI was motivated by the lack of a software<br />
<strong>tools</strong> that was able to accurately <strong>and</strong> consistently annotate ribosomal RNA (rRNA) genes<br />
<strong>in</strong> prokaryotes. BLAST strategies are widely used for this as the rRNA genes are highly<br />
conserved. However, homology search methods produces often less accurate gene boundaries<br />
as they fail to account for the observed variation <strong>in</strong> some regions. Hidden Markov<br />
Model (HMM) strategies, such as RNAmmer, can take <strong>in</strong>to account conserved stem loop<br />
structures, greatly improv<strong>in</strong>g the accuracy of prediction of the full length rRNA genes.<br />
Particular detail will be given to the E. coli rRNA operons <strong>in</strong> terms of promoter predictions,<br />
s<strong>in</strong>ce much experimental <strong>in</strong>formation is known about this system. An application<br />
of the gwBrowser as a tool for visualization of promoter regions upstream of the rRNA<br />
operons <strong>in</strong> E. coli concludes the chapter. The gwBrowser effort is currently be<strong>in</strong>g published<br />
<strong>in</strong> the St<strong>and</strong>ards In Genomic Sciences journal. The P1 <strong>and</strong> P2 prediction <strong>tools</strong> are<br />
still developmental, <strong>and</strong> have not been published.<br />
Encod<strong>in</strong>g the central structure of the ribosome, the 5S, 16S, <strong>and</strong> 23S rRNA genes are<br />
essential for prote<strong>in</strong> synthesis <strong>and</strong> are transcribed at high levels. In E. coli the rrn operons<br />
are regulated by a t<strong>and</strong>em promotor system. With abundant transcription, the system is<br />
favorable for study<strong>in</strong>g the mechanisms of highly expressed genes <strong>and</strong> establish connection<br />
to the physical properties of the DNA. In this work, the SIDD energy (Wang et al., 2004;<br />
Wang & Benham, 2008) was used to measure the energy requirement to melt the DNA<br />
helix near the promotor region. The work was carried out dur<strong>in</strong>g my visit to Professor<br />
Craig Benhams lab at UC Davis, fall 2007.<br />
3.2 P1 <strong>and</strong> P2 promoters <strong>in</strong> E. coli<br />
The seven rRNA operons of E. coli are regulated by the two promotors P1 <strong>and</strong> P2,<br />
where P1 is active predom<strong>in</strong>ately dur<strong>in</strong>g exponential growth whereas P2 is active dur<strong>in</strong>g<br />
stationay phase (Hirvonen et al., 2001; Murray & Gourse, 2004). Apart from the –10 <strong>and</strong><br />
–35 hexamers, the P1 site conta<strong>in</strong>s between 3 <strong>and</strong> 5 FIS (Factor for Inversion Stimulation)<br />
b<strong>in</strong>d<strong>in</strong>g sites <strong>and</strong> an UP element. FIS has been reported to <strong>in</strong>crease the transcription <strong>in</strong><br />
vivo by 4-10 fold <strong>in</strong> this system (Bokal et al., 1995).<br />
105
Conservation of regulatory elements<br />
-35<br />
-10<br />
σ<br />
α<br />
α ββ‘ subunit<br />
+1<br />
CDS<br />
Figure 3.1: The transcription of bacterial genes.<br />
The first step <strong>in</strong> transcription occurs when the sigma factor first b<strong>in</strong>ds to the -10 <strong>and</strong><br />
-35 region, followed by a wrap of the DNA template around the large RNA polymerase<br />
holoenzyme complex, caus<strong>in</strong>g a bend of the DNA molecule (figure 3.1). Roughly 150 bp of<br />
DNA is wrapped around the polymerase, form<strong>in</strong>g a constra<strong>in</strong>ed supercoil. The wrapp<strong>in</strong>g<br />
<strong>in</strong>teraction with the two α-subunits are particularly important, for the right orientation<br />
of DNA with respect to the promoter sites <strong>and</strong> transcription <strong>in</strong>itiation.<br />
B<strong>in</strong>d<strong>in</strong>g of the FIS prote<strong>in</strong> can strongly bend the DNA, <strong>and</strong> if properly spaced, greatly<br />
facilitate the wrapp<strong>in</strong>g of the DNA around the alpha subunits. The DNA bend<strong>in</strong>g takes<br />
place via a helix-turn-helix structure <strong>and</strong> is recognized by a 15 nucleotide symmetric motif<br />
(Hengen et al., 1997). The stress that is <strong>in</strong>duced when FIS b<strong>in</strong>ds to the DNA helix,<br />
causes a bend which destabilizes the helix lower<strong>in</strong>g the energy required for melt<strong>in</strong>g further<br />
downstream (Wang & Benham, 2008; Bokal et al., 1995). While be<strong>in</strong>g highly expressed<br />
dur<strong>in</strong>g exponential phase FIS ensures an <strong>in</strong>creased activity of P1 compared with P2. In an<br />
E. coli stra<strong>in</strong> lack<strong>in</strong>g the FIS prote<strong>in</strong> the P2 promotor is more active dur<strong>in</strong>g exponential<br />
growth. The same study suggest FIS to have a repression effect on P2 (Liebig & Wagner,<br />
1995). Both P1 <strong>and</strong> P2 conta<strong>in</strong>s an UP element b<strong>in</strong>d<strong>in</strong>g to the RNA polymerase α Cterm<strong>in</strong>al<br />
doma<strong>in</strong> (αCTD). This work aims at apply<strong>in</strong>g an <strong>in</strong>formation content method to<br />
the P1 <strong>and</strong> P2 system, account<strong>in</strong>g for helical spac<strong>in</strong>g between these regulatory elements as<br />
well as the conservation of the motifs. The t<strong>and</strong>em promotor system is depicted <strong>in</strong> figure<br />
3.2.<br />
3.3 Conservation of regulatory elements<br />
Information content is widely used <strong>in</strong> bio<strong>in</strong>formatics to f<strong>in</strong>d <strong>and</strong> rank <strong>in</strong>dependent motifs<br />
as an alternative to mach<strong>in</strong>e learn<strong>in</strong>g approaches. Shultzaberger <strong>and</strong> co-workers have exp<strong>and</strong>ed<br />
earlier applications of <strong>in</strong>formation content by describ<strong>in</strong>g the helical fac<strong>in</strong>g between<br />
regulatory elements on the DNA str<strong>and</strong> (Shultzaberger et al., 2007). This framework allows<br />
for an additive comb<strong>in</strong>ation of both aligned weight matrices <strong>and</strong> their spac<strong>in</strong>g to<br />
produce a f<strong>in</strong>al score of the entire structure. When observ<strong>in</strong>g the σ 70 promotor consist<strong>in</strong>g<br />
of the –10 <strong>and</strong> –35 hexamers, the spac<strong>in</strong>g corespond to each box be<strong>in</strong>g located on oposite<br />
sides of the DNA helix (see figure 3.3).<br />
Chang<strong>in</strong>g the spac<strong>in</strong>g will likely cause a disruption of the b<strong>in</strong>d<strong>in</strong>g by RNA polymerase.<br />
This is accounted for by apply<strong>in</strong>g a cos<strong>in</strong>e function to the distance score (see equation 3.2).<br />
Shultzaberger’s equations were used to model the P1 <strong>and</strong> P2 system.<br />
To score a given query sequence of length L aga<strong>in</strong>st a weight matrix, a b × p matrix<br />
is first generated by align<strong>in</strong>g the query sequence <strong>and</strong> the matrix. This provides all Rb,p<br />
106
tuB<br />
murI<br />
Fis III Fis II Fis I UP -35 -10<br />
m<strong>in</strong>: -4nt<br />
center:2nt<br />
max:4nt<br />
m<strong>in</strong>: 0nt<br />
center:3nt<br />
max:6nt<br />
m<strong>in</strong>: 13nt<br />
center:16nt<br />
max:19nt<br />
P1<br />
rRNA operons <strong>and</strong> promoter analysis<br />
16S tRNA 23S 5S<br />
Glu murB<br />
-35 -10<br />
m<strong>in</strong>: 0nt<br />
center:3nt<br />
max:6nt<br />
P2 P1<br />
m<strong>in</strong>: 13nt<br />
center:16nt<br />
max:19nt<br />
Figure 3.2: The promotor structure of the rrnB operon <strong>in</strong> E. coli.<br />
-35<br />
!<br />
-10<br />
!<br />
-10 -35<br />
Figure 3.3: The –10 <strong>and</strong> –35 hexamers of the E. coli σ 70 promotor correspond to the motifs be<strong>in</strong>g<br />
located on opposite side of the DNA helix. Delition or <strong>in</strong>sertions of the spac<strong>in</strong>g cases a shift of<br />
approx. 36deg per nucleotide.<br />
107
Conservation of regulatory elements<br />
values.<br />
nb,p<br />
Rb,p = log2(4) + log2<br />
N<br />
L<br />
Rtot = RB,p<br />
p=1<br />
(3.1)<br />
–where b ∈ AT GC iterates through the four bases, p denotes the position <strong>in</strong> the<br />
alignment, L is the length of the alignment (or width of the matrix), <strong>and</strong> nb,p is the<br />
number of bases b at position p, <strong>and</strong> B denotes the nucleotide at position p <strong>in</strong> the query<br />
sequence. Shultzaberger <strong>and</strong> co-workers account for the helical fac<strong>in</strong>g by <strong>in</strong>troduc<strong>in</strong>g the<br />
accessibility, n(d) (equation 3.2) <strong>and</strong> the gap surprisal, GS(d) (see equation 3.3).<br />
n(d) = 1 + cos[ 2π<br />
(d − c)] (3.2)<br />
w<br />
–where c is the center distance between two b<strong>in</strong>d<strong>in</strong>g sites (e.g. optimally spaced), d is<br />
the query distance, w = 10.6 is the distance of a one helix turn of B-form DNA. F<strong>in</strong>ally,<br />
this gives GS(d) as follows:<br />
n(d)<br />
GS(d) = log2<br />
N<br />
(3.3)<br />
–where N is the sum of all n(d) (see equation 3.4). The sign of the GS(d) was changed<br />
from the orig<strong>in</strong>al equation described by Shultzaberger <strong>and</strong> co-workers to allow for comb<strong>in</strong><strong>in</strong>g<br />
all scores by addition.<br />
N =<br />
max<br />
<br />
d=m<strong>in</strong><br />
n(d) (3.4)<br />
–where m<strong>in</strong> <strong>and</strong> max are the boundaries of a given w<strong>in</strong>dow exam<strong>in</strong>ed. F<strong>in</strong>ally, summariz<strong>in</strong>g<br />
all Ri <strong>and</strong> GS(d) values gives the total <strong>in</strong>formation of all motifs <strong>and</strong> all spacers (see<br />
figure 3.5)<br />
Ri(tot) = Ri(m1) + GS(d, m1) + Ri(m2) + ... + GS(d, mn−1) + Ri(mn) (3.5)<br />
3.3.1 Model<strong>in</strong>g the P1 <strong>and</strong> P2 <strong>in</strong> selected enterics<br />
Exist<strong>in</strong>g experimentally verified –10 <strong>and</strong> –35 hexamers (Huerta & Collado-Vides, 2003)<br />
were converted <strong>in</strong>to Rb,p matrices together with data for known UP elements (Estrem<br />
et al., 1998) <strong>and</strong> FIS b<strong>in</strong>d<strong>in</strong>g sites (Hengen et al., 1997). Figure 3.4 shows logo plots of<br />
the <strong>in</strong>formation content of these studies. The <strong>in</strong>itial weight matrices founded the basis<br />
for iteratively build<strong>in</strong>g the f<strong>in</strong>al <strong>in</strong>formation model of the P1 <strong>and</strong> P2 promotor structure,<br />
us<strong>in</strong>g the follow<strong>in</strong>g procedure:<br />
1. E. coli <strong>and</strong> Shigella genomes<br />
108<br />
2. rRNA gene f<strong>in</strong>d<strong>in</strong>g <strong>and</strong> make upstream sequence<br />
3. Apply models based on literature weight matrices<br />
4. Ref<strong>in</strong>e weight matrices accord<strong>in</strong>g to observations<br />
5. Formulate f<strong>in</strong>al model
Bits<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
Bits<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
T A T A A T<br />
1<br />
2<br />
(a)<br />
3<br />
4<br />
Position<br />
T T G A C A<br />
1<br />
2<br />
(c)<br />
3<br />
4<br />
Position<br />
5<br />
6<br />
5<br />
6<br />
Bits<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
Bits<br />
1<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
2<br />
1<br />
rRNA operons <strong>and</strong> promoter analysis<br />
T G A A A T T T T T T T T T G A A A A G T A<br />
3<br />
2<br />
3<br />
4<br />
4<br />
5<br />
5<br />
6<br />
6<br />
7<br />
7<br />
8<br />
8<br />
9<br />
10<br />
(b)<br />
9<br />
10<br />
11<br />
12<br />
Position<br />
0.0<br />
A T T G G T Y A A A W T T T R A C C A A T<br />
Figure 3.4: Logo plots show<strong>in</strong>g the <strong>in</strong>itial weight matrices used for search<strong>in</strong>g E. coli <strong>and</strong> Shigella<br />
genomes: –10 hexamer (a), –35 hexamer (b), UP element (c), <strong>and</strong> FIS b<strong>in</strong>d<strong>in</strong>g motif (d).<br />
The 16S rRNA genes of all E. coli <strong>and</strong> Shigella genomes were annotated us<strong>in</strong>g RNAmmer.<br />
For the list of genomes, see table 3.1. All 16S rRNA genes were aligned us<strong>in</strong>g clustalw<br />
(Thompson et al., 1994) <strong>and</strong> a neighbor-jo<strong>in</strong><strong>in</strong>g tree was constructed (see figure 3.5). The<br />
figure shows additional Salmonella <strong>and</strong> Yers<strong>in</strong>ia genomes for comparison.<br />
(d)<br />
11<br />
13<br />
12<br />
Position<br />
14<br />
13<br />
15<br />
16<br />
14<br />
17<br />
15<br />
18<br />
16<br />
19<br />
17<br />
20<br />
21<br />
18<br />
22<br />
19<br />
20<br />
21<br />
109
Conservation of regulatory elements<br />
Escherichia coli 536<br />
Escherichia coli APEC O1<br />
Escherichia coli CFT073<br />
Shigella sonnei Ss046<br />
Shigella boydii Sb227<br />
Shigella flexneri 2a str. 301<br />
Shigella flexneri 2a str. 2457T<br />
Escherichia coli UTI89<br />
Escherichia coli K12<br />
Escherichia coli O157:H7 EDL933<br />
Escherichia coli O157:H7 str. Sakai<br />
Escherichia coli W3110<br />
Shigella dysenteriae Sd197<br />
Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67<br />
Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150<br />
Salmonella enterica subsp. enterica serovar Typhi Ty2<br />
Salmonella enterica subsp. enterica serovar Typhi str. CT18<br />
Salmonella typhimurium LT2<br />
Yers<strong>in</strong>ia pestis Antiqua<br />
Yers<strong>in</strong>ia pestis CO92<br />
Yers<strong>in</strong>ia pestis KIM<br />
Yers<strong>in</strong>ia pestis Nepal516<br />
Yers<strong>in</strong>ia pestis Pestoides F<br />
Yers<strong>in</strong>ia pestis biovar Microtus str. 91001<br />
Yers<strong>in</strong>ia pseudotuberculosis IP 32953<br />
Figure 3.5: Neighbor-jo<strong>in</strong><strong>in</strong>g tree of first 1k bases of all 16S rRNA genes of Yers<strong>in</strong>ia, Salmonella,<br />
Shigella, <strong>and</strong> E. coli<br />
110
RNA operons <strong>and</strong> promoter analysis<br />
Organism Accession Reference<br />
Escherichia coli 101-1 AAMK00000000 (unpublished)<br />
Escherichia coli 53638 AAKB00000000 (unpublished)<br />
Escherichia coli 536 CP000247 (Brzuszkiewicz et al., 2006)<br />
Escherichia coli APEC O1 CP000468 (Johnson et al., 2007)<br />
Escherichia coli B171 AAJX00000000 (unpublished)<br />
Escherichia coli B7A AAJT00000000 (unpublished)<br />
Escherichia coli B AAWW00000000 (unpublished)<br />
Escherichia coli CFT073 AE014075 (Welch et al., 2002)<br />
Escherichia coli E110019 AAJW00000000 (unpublished)<br />
Escherichia coli E22 AAJV00000000 (unpublished)<br />
Escherichia coli F11 AAJU00000000 (unpublished)<br />
Escherichia coli K12 U00096 (Blattner et al., 1997)<br />
Escherichia coli O157:H7 EDL933 AE005174 (Perna et al., 2001)<br />
Escherichia coli O157:H7 str. Sakai BA000007 (Hayashi et al., 2001)<br />
Escherichia coli SECEC SMS-3-5 ABAQ00000000 (unpublished)<br />
Escherichia coli UTI89 CP000243 (Chen et al., 2006)<br />
Escherichia coli W3110 AP009048 (Hayashi et al., 2006)<br />
Shigella boydii CDC 3083-94 AAKA00000000 (unpublished)<br />
Shigella boydii Sb227 CP000036 (Yang et al., 2005)<br />
Shigella dysenteriae 1012 AAMJ00000000 (unpublished)<br />
Shigella dysenteriae Sd197 CP000034 (Yang et al., 2005)<br />
Shigella flexneri 2a str. 2457T AE014073 (Liao et al., 2003)<br />
Shigella flexneri 2a str. 301 AE005674 (J<strong>in</strong> et al., 2002)<br />
Shigella sonnei Ss046 CP000038 (Yang et al., 2005)<br />
Table 3.1: Escherichia coli <strong>and</strong> Shigella genomes currently available at the time of the work<br />
(October 2007)<br />
111
Conservation of regulatory elements<br />
Ri<br />
Ri<br />
−15 −10 −5 0 5 10<br />
−10 −5 0 5 10 15<br />
P1: Raw comb<strong>in</strong>ed scores, −10,−35, UP (E.coli) (N=63)<br />
−500 −400 −300 −200 −100 0<br />
Position relative to 16S gene start<br />
(a)<br />
P2: Raw comb<strong>in</strong>ed scores, −10,−35, UP (E. coli) (N=63)<br />
−500 −400 −300 −200 −100 0<br />
Position relative to 16S gene start<br />
(c)<br />
Ri<br />
Ri<br />
−15 −10 −5 0 5 10 15<br />
−10 −5 0 5 10 15<br />
P1: Adjusted comb<strong>in</strong>ed scores, −10,−35, UP (E.coli) (N=63)<br />
−500 −400 −300 −200 −100 0<br />
Position relative to gene start<br />
(b)<br />
P2: Adjusted comb<strong>in</strong>ed scores, −10,−35, UP (E. coli) (N=63)<br />
−500 −400 −300 −200 −100 0<br />
Position relative to gene start<br />
Figure 3.6: Profiles show<strong>in</strong>g the maximum Ri(tot) scores of the <strong>in</strong>itial weight matrices applied to<br />
E. coli <strong>and</strong> Shigella: Unadjusted P1 scores (a), Adjusted P1 scores (b), Unadjusted P2 scores (c),<br />
<strong>and</strong> Adjusted P2 scores (d)<br />
3.3.2 Iterat<strong>in</strong>g weight matrix frequencies<br />
The program iscan was developed to query a given DNA sequence <strong>and</strong> for every position <strong>in</strong><br />
this sequence calculate the maximum Ri(tot) that can be obta<strong>in</strong>ed by try<strong>in</strong>g out different<br />
spac<strong>in</strong>g configuraitons with<strong>in</strong> a specified w<strong>in</strong>dow. The iscan algorithm aligns the first<br />
matrix with the query (<strong>in</strong> this case the –10 hexamer) <strong>and</strong> tries all distances between 13<br />
<strong>and</strong> 19 nucleotides towards the –35 hexamer, us<strong>in</strong>g 16 nucleotides as the center. Then<br />
the program locks the optimal of those distances, <strong>and</strong> cont<strong>in</strong>ues with the next box (<strong>in</strong><br />
this case the the UP element) until all elements have been <strong>in</strong>cluded. For source code, see<br />
appendix D.5. The spac<strong>in</strong>g configuration of the two models is shown <strong>in</strong> figure ??.<br />
The maximum Ri(tot) values of all operons were stacked <strong>and</strong> average <strong>and</strong> st<strong>and</strong>ard<br />
deviation values were plotted as function of position. Because the distance between P1/P2<br />
<strong>and</strong> the 16S gene varies slightly, the unadjusted plots appear noisy. By shift<strong>in</strong>g the plots<br />
slightly by align<strong>in</strong>g to local maxima around P1 <strong>and</strong> P2 renders the P1 <strong>and</strong> P2 model scores<br />
sharper (see figure 3.6).<br />
3.3.3 Ref<strong>in</strong><strong>in</strong>g E. coli <strong>and</strong> Shigella models<br />
All peaks of Ri(tot) around the regions of P1 <strong>and</strong> P2 have been collected, <strong>and</strong> the P1 <strong>and</strong><br />
P2 models were ref<strong>in</strong>ed by adjust<strong>in</strong>g matrix parameters accord<strong>in</strong>g to the observed base<br />
frequencies <strong>in</strong> the hits obta<strong>in</strong>ed. The logo plots of are shown <strong>in</strong> figure 3.7<br />
112<br />
(d)
Bits<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
Bits<br />
T A T A A T<br />
1<br />
2<br />
(a)<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
3<br />
4<br />
Position<br />
1<br />
5<br />
6<br />
Bits<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
T C A A A A A A T T A T T T A A A A T T T C<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
8<br />
9<br />
10<br />
(b)<br />
T T T G C T T G A A A A A T G A G C G G T<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
8<br />
9<br />
10<br />
(d)<br />
11<br />
12<br />
Position<br />
Bits<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
13<br />
14<br />
15<br />
16<br />
17<br />
18<br />
19<br />
20<br />
11<br />
12<br />
Position<br />
21<br />
13<br />
14<br />
Bits<br />
15<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
16<br />
rRNA operons <strong>and</strong> promoter analysis<br />
17<br />
1<br />
18<br />
19<br />
20<br />
21<br />
22<br />
T A T T A T<br />
2<br />
(e)<br />
T C A G A A A A A G A A A G C A A A A A A A<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
8<br />
9<br />
10<br />
11<br />
(g)<br />
12<br />
13<br />
14<br />
15<br />
16<br />
17<br />
3<br />
4<br />
Position<br />
5<br />
6<br />
Bits<br />
Bits<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
2.0<br />
1.5<br />
1.0<br />
0.5<br />
0.0<br />
T T G T C A<br />
1<br />
1<br />
2<br />
(c)<br />
3<br />
4<br />
Position<br />
5<br />
T T G A C T<br />
Figure 3.7: Logos show<strong>in</strong>g the base compostion of P1 <strong>and</strong> P2 of E. coli genomes, as identified<br />
by <strong>in</strong>itial P1 <strong>and</strong> P2 scan: P1 –10 hexamer (a), P1 –35 hexamer (b), P1 UP element (c), P1 FIS<br />
b<strong>in</strong>d<strong>in</strong>g motif (d), P2 –10 hexamer (e), P2 –35 hexamer (f), P2 UP element (g)<br />
Position<br />
18<br />
19<br />
20<br />
21<br />
22<br />
2<br />
(f)<br />
3<br />
4<br />
Position<br />
5<br />
6<br />
6<br />
113
DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy<br />
Z−score<br />
−0.8 −0.6 −0.4 −0.2 0.0<br />
U00096: SIDD measure − free energy<br />
−400 −200 0 200 400<br />
Distance from translation start<br />
s=−0.025<br />
s=−0.035<br />
s=−0.045<br />
s=−0.055<br />
Figure 3.8: Average profiles of SIDD energy calculated at five different helix densities -0.025,<br />
-0.035, -0.045, <strong>and</strong> -0.055. All genes have been aligned at the translation start.<br />
3.4 DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy<br />
An algorithm developed by Benham <strong>and</strong> co-workers (Wang & Benham, 2008; Wang et al.,<br />
2004) estimates the SIDD energy which is the free energy required to open the DNA helix<br />
under different superhelix densities. When observ<strong>in</strong>g the SIDD energy 400 nucleotides on<br />
each side of the translation start of all cod<strong>in</strong>g sequences <strong>in</strong> E. coli K12 (accession U00096) a<br />
clear drop <strong>in</strong> the energy requirement is visible. The drop orig<strong>in</strong>ates from the transcription<br />
start rather than the translation start, which examples the broad appearance of the curve.<br />
Figure 3.8 plots the SIDD energy values at different helix densities (-0.025, -0.035, -0.045,<br />
<strong>and</strong> -0.055). The graph represents the z-scores show<strong>in</strong>g how the average SIDD energy at<br />
a given relative position compares with the average <strong>and</strong> st<strong>and</strong>ard deviation of the entire<br />
chromosome. z-score below zero correspond to SIDD energies lower then the average of<br />
the chromosome, which melts more easily.<br />
3.4.1 codesearch: Mapp<strong>in</strong>g nummerical data to genome annotations<br />
The codesearch tool was written to enable searches for various annotation patterns of a<br />
genome <strong>and</strong> to map nummerical data relative to these annotations. The tool requires a<br />
pregenerated codefile which condenses all annotations of the genome <strong>in</strong>to a s<strong>in</strong>gle str<strong>in</strong>g,<br />
correspond<strong>in</strong>g to one character per nucleotide position (see table 3.2). The tool allows the<br />
user to provide a regular expression to search <strong>in</strong> the pre-generated code file.<br />
A list of nummerical data perta<strong>in</strong><strong>in</strong>g to the <strong>in</strong>dividual nucleotides of the genome can<br />
then be <strong>in</strong>cluded. When def<strong>in</strong>ed, codesearch will extract the nummerical values correspond<strong>in</strong>g<br />
to the regions match<strong>in</strong>g the pattern. The output of codesearch is divided <strong>in</strong>to<br />
two tab-separated columns: First column conta<strong>in</strong> the genomic region where pattern has<br />
matched, the other column contians either the sequence as a str<strong>in</strong>g (when runn<strong>in</strong>g <strong>in</strong><br />
114
Code Mean<strong>in</strong>g Example<br />
C Cod<strong>in</strong>g CCCCCCCCCCCCC<br />
> Annotation start on forward str<strong>and</strong> .....>CCCC...<br />
< Annotation start on reverse str<strong>and</strong> ...CCCCTTT.....<br />
t 5S rRNA ..tttssss......<br />
l 23S rRNA ...lllllcodesearch −cod U00096 . cod . gz −seq U00096 . fsa −pat ’(.{5 ,5} > s {1 ,1}) ’<br />
2 223773..223779 AAATTGA<br />
3 3939833..3939839 AAATTGA<br />
4 4033556..4033562 AAATTGA<br />
5 4164684..4164690 AAATTGA<br />
6 4206172..4206178 AAATTGA<br />
7 3426782..3426776 ATTGAAG<br />
8 2729177..2729171 ATTGAAG<br />
9 >codesearch −cod U00096 . cod . gz −dat U00096 . sidd35 . gz : 1 , 4 −pat<br />
’(.{5 ,5} > s {1 ,1}) ’\<br />
10 −format ’%0.2f ’ | tab2tbl −−w<strong>in</strong>dow = ’ −5 ,2 ’ −org ’ E . coli K12 ’ −col<br />
blue<br />
11 def org col −5 −4 −3 −2 −1 1 2<br />
12 223773..223779 E . coli K12 blue 7.93 7.93 7.94 8.00 8.26 8.28 8.37<br />
13 3939833..3939839 E . coli K12 blue 7.91 7.90 7.92 7.99 8.25 8.28 8.36<br />
14 4033556..4033562 E . coli K12 blue 7.83 7.83 7.85 7.92 8.19 8.22 8.32<br />
15 4164684..4164690 E . coli K12 blue 7.85 7.85 7.87 7.95 8.21 8.25 8.34<br />
16 4206172..4206178 E . coli K12 blue 7.91 7.91 7.92 7.99 8.26 8.28 8.37<br />
17 3426782..3426776 E . coli K12 blue 7.91 7.93 7.99 8.26 8.28 8.37 8.73<br />
18 2729177..2729171 E . coli K12 blue 7.91 7.93 7.99 8.26 8.28 8.37 8.72<br />
Us<strong>in</strong>g heatmap to generate energy l<strong>and</strong>scape<br />
The R function heatmap described <strong>in</strong> chapter 2, was used to compare both SIDD profiles<br />
<strong>and</strong> the profiles of P1/P2 model scores. All promotor sequences were aligned first accord<strong>in</strong>g<br />
to the peak score of the P1 model (near the expected site of P1) <strong>and</strong> second accord<strong>in</strong>g to<br />
the peak score of the P2 model (near the expected site of P2). In figure 3.9 the model scores<br />
are visualized us<strong>in</strong>g the heatmap function on the green, heatmaps on the left, whereas the<br />
rightmost heatmaps conta<strong>in</strong> the SIDD energies (blue) of the aligned promotor sequences.<br />
This analysis show that a deep drop <strong>in</strong> the SIDD energy occurs for approximately half of<br />
the promotor sequences, near the P1 site.<br />
115
DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy<br />
P1<br />
-10 box<br />
16S rRNA +1<br />
P2<br />
-10 box<br />
16S rRNA +1<br />
−500<br />
−490<br />
−480<br />
−470<br />
−460<br />
−450<br />
−440<br />
−430<br />
−420<br />
−410<br />
−400<br />
−390<br />
−380<br />
−370<br />
−360<br />
−350<br />
−340<br />
−330<br />
−320<br />
−310<br />
−300<br />
−290<br />
−280<br />
−270<br />
−260<br />
−250<br />
−240<br />
−230<br />
−220<br />
−210<br />
−200<br />
−190<br />
−180<br />
−170<br />
−160<br />
−150<br />
−140<br />
−130<br />
−120<br />
−110<br />
−100<br />
−90<br />
−80<br />
−70<br />
−60<br />
−50<br />
−40<br />
−30<br />
−20<br />
−10<br />
+1<br />
+10<br />
+20<br />
+30<br />
+40<br />
+50<br />
−500<br />
−490<br />
−480<br />
−470<br />
−460<br />
−450<br />
−440<br />
−430<br />
−420<br />
−410<br />
−400<br />
−390<br />
−380<br />
−370<br />
−360<br />
−350<br />
−340<br />
−330<br />
−320<br />
−310<br />
−300<br />
−290<br />
−280<br />
−270<br />
−260<br />
−250<br />
−240<br />
−230<br />
−220<br />
−210<br />
−200<br />
−190<br />
−180<br />
−170<br />
−160<br />
−150<br />
−140<br />
−130<br />
−120<br />
−110<br />
−100<br />
−90<br />
−80<br />
−70<br />
−60<br />
−50<br />
−40<br />
−30<br />
−20<br />
−10<br />
+1<br />
+10<br />
+20<br />
+30<br />
+40<br />
+50<br />
-22 34<br />
Promotor sequences<br />
Model score (bits)<br />
500<br />
490<br />
480<br />
470<br />
460<br />
450<br />
440<br />
430<br />
420<br />
410<br />
400<br />
390<br />
380<br />
370<br />
360<br />
350<br />
340<br />
330<br />
320<br />
310<br />
300<br />
290<br />
280<br />
270<br />
260<br />
250<br />
240<br />
230<br />
220<br />
210<br />
200<br />
190<br />
180<br />
170<br />
160<br />
150<br />
140<br />
130<br />
120<br />
110<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
+1<br />
+10<br />
+20<br />
+30<br />
+40<br />
+50<br />
+60<br />
500<br />
490<br />
480<br />
470<br />
460<br />
450<br />
440<br />
430<br />
420<br />
410<br />
400<br />
390<br />
380<br />
370<br />
360<br />
350<br />
340<br />
330<br />
320<br />
310<br />
300<br />
290<br />
280<br />
270<br />
260<br />
250<br />
240<br />
230<br />
220<br />
210<br />
200<br />
190<br />
180<br />
170<br />
160<br />
150<br />
140<br />
130<br />
120<br />
110<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
+1<br />
+10<br />
+20<br />
+30<br />
+40<br />
+50<br />
SIDD energy (kcal/mol<br />
5.8 10.0<br />
Gaps are appended<br />
to each promotor<br />
region to adjust to<br />
maxima of the P1/P2<br />
model scores<br />
Figure 3.9: E. coli <strong>and</strong> Shigella rrnB energy l<strong>and</strong>scape visualized us<strong>in</strong>g the heatmap function.<br />
Each vertical column corresponds to a promotor sequence, whereas the horizontal rows represent<br />
average values over 10 bp with<strong>in</strong> each sequence. Coord<strong>in</strong>ates labeled on the horizontal rows are<br />
relative to the 16S rRNA gene start. The upper heatmaps show P1 whereas the lower heatmaps<br />
show P2. Leftmost heatmaps show P1/P2 model scores <strong>in</strong> green, whereas rightmost heatmaps<br />
show the SIDD energy <strong>in</strong> blue.<br />
116
RNA operons <strong>and</strong> promoter analysis<br />
3.5 The genomic context: visualiz<strong>in</strong>g operons <strong>and</strong> DNA<br />
properties<br />
Dur<strong>in</strong>g the thesis work, this author has been <strong>in</strong>volved <strong>in</strong> the development of a next generation<br />
genome browser to replace the older GeneWiz software developed at <strong>CBS</strong> (Pedersen<br />
et al., 2000; Jensen et al., 1999). The old GeneWiz is still used by the BLASTatlas service<br />
to generate the static atlas graphic. The goal with the new version was to create an<br />
<strong>in</strong>teractive <strong>and</strong> platform-<strong>in</strong>dependant program that would allow the user to zoom from a<br />
global genomic scale down to the nucleotide level. The basic pr<strong>in</strong>ciples of transform<strong>in</strong>g<br />
nummerical data <strong>in</strong>to a color coded representation rema<strong>in</strong>ed identical to the GeneWiz<br />
method. But the old GeneWiz software required several m<strong>in</strong>utes to regenerate a plot <strong>and</strong><br />
the challenge was to provide an efficient data flow that would allow this regeneration <strong>in</strong><br />
fractions of a second. Eva Rotenberg <strong>and</strong> Hans Henrik Stærfeldt from <strong>CBS</strong> authored the<br />
gwBrowser Java code which h<strong>and</strong>les the plott<strong>in</strong>g, whereas this author has been responsible<br />
for the server side software. For the fast visualization to be possible, all nummerical data<br />
that are plotted must be pre-b<strong>in</strong>ned <strong>and</strong> accessible for all of the zoom-levels. A system was<br />
established which could conta<strong>in</strong> these pre-b<strong>in</strong>ned data for a number of genomes us<strong>in</strong>g a<br />
MySQL database. The first solution <strong>in</strong>volved a s<strong>in</strong>gle large table, with fields correspond<strong>in</strong>g<br />
to genome id, position, zoom level, field, <strong>and</strong> value. It quickly proved unfeasible. S<strong>in</strong>ce<br />
stor<strong>in</strong>g all zoom levels for a genome of length N requires 2×N records, a rough estimation<br />
shows that a 1,000 genomes of 3Mb <strong>and</strong> 20 different DNA properties (field) requires 120<br />
billion database records. Ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g these large search <strong>in</strong>dexes <strong>and</strong> prevent<strong>in</strong>g table locks<br />
dur<strong>in</strong>g update made this solution impossible. A different solution was tried splitt<strong>in</strong>g each<br />
genome <strong>in</strong>to its own table <strong>and</strong> this solved many speed issues but did not perform satisfactory.<br />
Instead, data are stored <strong>in</strong> b<strong>in</strong>ary files - one file per genome <strong>and</strong> zoom level. All<br />
values are written as fixed-width data <strong>and</strong> us<strong>in</strong>g memory mapp<strong>in</strong>g the server can quickly<br />
obta<strong>in</strong> data with<strong>in</strong> the file know<strong>in</strong>g the coord<strong>in</strong>ates of the w<strong>in</strong>dow. The list<strong>in</strong>g belows<br />
shows how the client retrieves data for the genome id AL111168GENOMEatlas, from position<br />
1 to 37,473 bp, at zoom level 5. Figure 3.10 shows the workflow of the gwBrowser<br />
software. For further details on this tool, please refer to paper VII. The software is now<br />
available via http://www.cbs.dtu.dk/services/gwBrowser.<br />
1 set server = http : / / ws . cbs . dtu . dk/cgi−b<strong>in</strong>/gwBrowser −0.91/ server . cgi<br />
2 curl $server"?d=AL111168GENOMEatlas&m=d&f=dnap0&b=1&e=37473&l=5&z=<br />
false"<br />
3.6 Visualiz<strong>in</strong>g sequenc<strong>in</strong>g quality us<strong>in</strong>g gwBrowser<br />
Modern high-throughput sequenc<strong>in</strong>g techniques currently lack sufficient read lengths to<br />
span many repetitive elements of genomes, especially the rRNA genes mentioned above. To<br />
assess how well a given set of reads can close a genome sequence, a method was developed<br />
which accounts for both quality scores of the reads <strong>and</strong> the uniqueness of the reads. The<br />
concept of the method is to map the qualities of all reads back to a reference genome <strong>and</strong><br />
apply a weight to the qualities accord<strong>in</strong>g to the uniqueness of the reads. Reads that have<br />
multiple hits throughout the genome will contribute little whereas reads that at specific<br />
will contribute fully. Figure 3.11 shows the pr<strong>in</strong>ciple of the method <strong>and</strong> it was <strong>in</strong>tegrated<br />
<strong>in</strong>to the gwBrowser software.<br />
117
Visualiz<strong>in</strong>g sequenc<strong>in</strong>g quality us<strong>in</strong>g gwBrowser<br />
Configure <strong>and</strong><br />
submit atlas<br />
‘<br />
wait for process<strong>in</strong>g<br />
q’r(i)<br />
genome<br />
Browser applet<br />
Reference genome,<br />
annotations, sequenc<strong>in</strong>g<br />
reads, query genomes,<br />
custom numerical data<br />
Edit<strong>in</strong>g of atlas layout<br />
Atlas layout (XML)<br />
Request (atlas ID, zoom level,<br />
w<strong>in</strong>dow, field name ... )<br />
Returned data<br />
Ma<strong>in</strong> server<br />
1 2<br />
3<br />
XML configuration<br />
CLIENT SIDE SERVER SIDE<br />
hit H1<br />
score<br />
S1<br />
mapped reads<br />
ref. genome<br />
Figure 3.10: Pr<strong>in</strong>ciple workflow of gwBrowser data exchange.<br />
read<br />
1<br />
2<br />
3<br />
qr(i)<br />
i<br />
q’r(i)<br />
hit H2<br />
score S2<br />
genome<br />
read<br />
hit H3<br />
score S3<br />
Align<strong>in</strong>g read<br />
sequence to<br />
genome<br />
hit Hr<br />
score Sr<br />
Map quality scores<br />
to genome <strong>and</strong><br />
apply weight<br />
4<br />
5<br />
Data b<strong>in</strong>n<strong>in</strong>g of<br />
zoom levels<br />
B<strong>in</strong>ned data<br />
Browser server<br />
Weighted coverage<br />
Sequence Weighted agreement coverage<br />
Max Sequence unique agreement qual<br />
Information Max unique Content qual<br />
Read Information anbsense Content<br />
Annotations<br />
Read anbsense<br />
CDS+<br />
Annotations CDS-<br />
Weighted coverage<br />
rRNA CDS+<br />
tRNA CDS-<br />
Sequence agreement<br />
rRNA<br />
Intr<strong>in</strong>sic tRNA Curvature<br />
Max unique qual<br />
Stack<strong>in</strong>g Intr<strong>in</strong>sic Curvature Energy<br />
Information Content<br />
Position Stack<strong>in</strong>g Preference Energy<br />
Read anbsense<br />
Global Position Annotations Direct Preference Repeats<br />
CDS+ rRNA<br />
CDS! CDS+<br />
tRNA<br />
Global Inverted Direct<br />
CDS-<br />
Repeats<br />
rRNA<br />
GC Global Skew Inverted<br />
tRNA<br />
Repeats<br />
Intr<strong>in</strong>sic Curvature<br />
Percent GC SkewAT<br />
Stack<strong>in</strong>g Energy<br />
Percent AT<br />
F<strong>in</strong>ally, all maximum values Position Preference are<br />
plotted on the reference genome<br />
Global Direct Repeats<br />
us<strong>in</strong>g GeneWiz Browser. The<br />
marked b<strong>and</strong> <strong>in</strong> the example Global Inverted above Repeats<br />
shows a regions with low<br />
GC Skew<br />
uniqueness.<br />
Percent AT<br />
From all positions <strong>in</strong> the genome,<br />
obta<strong>in</strong> the maximum uniqueness<br />
value derived from the mapped<br />
reads.<br />
Figure 3.11: Mapp<strong>in</strong>g qualities of sequenc<strong>in</strong>g reads to a reference genome while account<strong>in</strong>g for<br />
the uniqueness of the read.<br />
118
P2<br />
-10<br />
-35<br />
UP<br />
P1<br />
-10<br />
-35<br />
UP<br />
FIS<br />
FIS<br />
FIS<br />
rrnB<br />
rrnD<br />
rrnE<br />
rrnB<br />
rrnA<br />
rrnC<br />
rrnG<br />
E. coli K12<br />
MG1665<br />
rRNA operons <strong>and</strong> promoter analysis<br />
rrnH<br />
SIDD, s:-0.055<br />
SIDD, s:-0.045<br />
SIDD, s:-0.035<br />
Annotations<br />
CDS+<br />
CDS-<br />
rRNA<br />
tRNA<br />
Intr<strong>in</strong>sic Curvature<br />
Stack<strong>in</strong>g Energy<br />
Position Preference<br />
GC Skew<br />
Percent AT<br />
Figure 3.12: A zoom of the P1 P2 t<strong>and</strong>em promotor system upstream of the rrnB operon of E.<br />
coli K12.<br />
3.6.1 Visualiz<strong>in</strong>g the P1 <strong>and</strong> P2 structure us<strong>in</strong>g gwBrowser<br />
The gwBrowser tool allows the user to append various types of annotations like TSS mark,<br />
boxes, <strong>and</strong> arrows once the b<strong>in</strong>n<strong>in</strong>g step has f<strong>in</strong>ished. This allows to visualize promotor<br />
structures like the P1 / P2 system <strong>and</strong> to <strong>in</strong>tegrate this with various DNA properties.<br />
The gwBrowser tool was applied to study the E. coli rrnb promotor system to correlate<br />
the annotated regulatory elements with a the SIDD energy (Wang et al., 2004; Wang &<br />
Benham, 2008) (see figure 3.12).<br />
The plot <strong>in</strong> figure 3.12 shows a drop <strong>in</strong> free energy upstream of P1 <strong>and</strong> P2, which<br />
from an energetic viewpo<strong>in</strong>t expla<strong>in</strong> the high transcription rate. The transcription factor<br />
FIS stimulates transcription at several promoters, <strong>and</strong> for example the b<strong>in</strong>d<strong>in</strong>g of FIS<br />
at the leuV promoter (Ross et al., 1999) has been suggested to transmit the superhelical<br />
destabilization downstream to the po<strong>in</strong>t where the RNAP twists <strong>and</strong> opens the helix (Wang<br />
et al., 2004). This model may be valid for the rrnB P1 promoter also, as the activity of<br />
leuV <strong>and</strong> rrnB P1 are comparable (Bauer et al., 1988).<br />
3.7 Summary<br />
Ribosomal RNA genes play an important role <strong>in</strong> the cells, <strong>and</strong> can be highly transcribed<br />
- often more than 90% of the total transcripts <strong>in</strong> rapidly grow<strong>in</strong>g bacterial cells are from<br />
rRNA genes. Further, rRNA genes are important <strong>in</strong> determ<strong>in</strong><strong>in</strong>g taxonomy. Further,<br />
correctly f<strong>in</strong>d<strong>in</strong>g the location of the start/stop positions for the rRNA genes is difficult to<br />
do with BLAST searches; we have developed RNAmmer to f<strong>in</strong>d the rRNA genes. Once the<br />
genes are mapped, further studies, such as promoter profil<strong>in</strong>g can be done. The gwBrowser<br />
allows one to zoom <strong>in</strong> on particular areas of the chromosome, <strong>and</strong> <strong>in</strong> the case of rRNA<br />
promoters, to map important structural properties of the DNA <strong>in</strong> the promoter region.<br />
119
Summary<br />
120
1<br />
rRNA operons <strong>and</strong> promoter analysis<br />
3.8 Paper VI: RNAmmer: Fast two-level HMM prediction<br />
of rRNA <strong>in</strong> prokaryotic genome sequences<br />
121
3100–3108 Nucleic Acids Research, 2007, Vol. 35, No. 9 Published onl<strong>in</strong>e 22 April 2007<br />
doi:10.1093/nar/gkm160<br />
RNAmmer: consistent <strong>and</strong> rapid annotation<br />
of ribosomal RNA genes<br />
Kar<strong>in</strong> Lagesen 1,2, *, Peter Hall<strong>in</strong> 3 , E<strong>in</strong>ar Andreas Rødl<strong>and</strong> 1,2,4,5 , Hans-Henrik Stærfeldt 3 ,<br />
Torbjørn Rognes 1,2,4 <strong>and</strong> David W. Ussery 1,2,3<br />
1 Centre for Molecular Biology <strong>and</strong> Neuroscience <strong>and</strong> Institute of Medical Microbiology, University of Oslo,<br />
NO-0027 Oslo, Norway, 2 Centre for Molecular Biology <strong>and</strong> Neuroscience <strong>and</strong> Institute of Medical Microbiology,<br />
Rikshospitalet-Radiumhospitalet Medical Centre, NO-0027 Oslo, Norway, 3 Center for Biological Sequence<br />
Analysis, Biocentrum-DTU, Technical University of Denmark, DK-2800 Lyngby, Denmark, 4 Department of<br />
Informatics, University of Oslo, PO Box 1080 Bl<strong>in</strong>dern, NO-0316 Oslo, Norway <strong>and</strong> 5 Norwegian Comput<strong>in</strong>g<br />
Center, PO Box 114 Bl<strong>in</strong>dern, NO-0314 Oslo, Norway<br />
Received December 1, 2006; Revised <strong>and</strong> Accepted March 2, 2007<br />
ABSTRACT<br />
The publication of a complete genome sequence is<br />
usually accompanied by annotations of its genes.<br />
In contrast to prote<strong>in</strong> cod<strong>in</strong>g genes, genes for<br />
ribosomal RNA (rRNA) are often poorly or <strong>in</strong>consistently<br />
annotated. This makes comparative<br />
studies based on rRNA genes difficult. We have<br />
therefore created computational predictors for the<br />
major rRNA species from all k<strong>in</strong>gdoms of life <strong>and</strong><br />
compiled them <strong>in</strong>to a program called RNAmmer.<br />
The program uses hidden Markov models tra<strong>in</strong>ed on<br />
data from the 5S ribosomal RNA database <strong>and</strong><br />
the European ribosomal RNA database project.<br />
A pre-screen<strong>in</strong>g step makes the method fast with<br />
little loss of sensitivity, enabl<strong>in</strong>g the analysis of<br />
a complete bacterial genome <strong>in</strong> less than a m<strong>in</strong>ute.<br />
Results from runn<strong>in</strong>g RNAmmer on a large set of<br />
genomes <strong>in</strong>dicate that the location of rRNAs can be<br />
predicted with a very high level of accuracy. Novel,<br />
unannotated rRNAs are also predicted <strong>in</strong> many<br />
genomes. The software as well as the genome<br />
analysis results are available at the <strong>CBS</strong> web server.<br />
INTRODUCTION<br />
Ribosomes are the molecular mach<strong>in</strong>es which form the<br />
connection between nucleic acids <strong>and</strong> prote<strong>in</strong>s <strong>in</strong> all liv<strong>in</strong>g<br />
organisms. The ribosome’s dependence on ribosomal<br />
RNAs (rRNAs) for its function has caused them to be<br />
conserved at both the sequence <strong>and</strong> the structure level.<br />
Because of this, rRNAs are often used <strong>in</strong> comparative<br />
studies such as phylogenetic <strong>in</strong>ference. <strong>Comparative</strong><br />
studies have become more popular as more genomes<br />
have been completely sequenced, but can potentially<br />
*To whom correspondence should be addressed. Tel: þ4722844786; Email: kar<strong>in</strong>.lagesen@medis<strong>in</strong>.uio.no<br />
become complicated when some of the genes they are<br />
based on are poorly annotated or not annotated at all.<br />
Unfortunately, this is often a problem with rRNAs as<br />
genome annotation pipel<strong>in</strong>es usually do not <strong>in</strong>clude <strong>tools</strong><br />
specific for rRNA detection. Instead, rRNAs are often<br />
located by sequence similarity searches such as BLAST.<br />
Although such searches may give reasonable answers due<br />
to the high level of sequence conservation <strong>in</strong> the core<br />
regions of the genes, us<strong>in</strong>g such results for annotation<br />
purposes can be problematic. The validity of the search<br />
results depends on the program <strong>and</strong> database used.<br />
Chang<strong>in</strong>g one or both of these can drastically change<br />
the results. Genomic databases have grown exponentially<br />
over the past two decades <strong>and</strong> search programs have as a<br />
consequence had to undergo constant revisions <strong>in</strong> order to<br />
meet the requirements of the research community. Thus,<br />
the results of a search done today are probably very<br />
different from those produced several years ago. An added<br />
complication is that the most commonly used database<br />
search methods have poor performance for noncod<strong>in</strong>g<br />
RNAs. A recent study compar<strong>in</strong>g several different<br />
methods for predict<strong>in</strong>g noncod<strong>in</strong>g RNAs, <strong>in</strong>clud<strong>in</strong>g<br />
rRNAs, found that the most commonly used methods<br />
gave the most <strong>in</strong>accurate results (1).<br />
Through our work on the GenomeAtlas database (2),<br />
we have seen the results of poor annotation of rRNAs.<br />
Some genomes do not have any rRNAs annotated at all,<br />
whereas other genomes seem to have rRNAs annotated<br />
on the wrong str<strong>and</strong>. We <strong>in</strong>itially tried to do systematic<br />
BLAST (3) searches, but it proved difficult to ma<strong>in</strong>ta<strong>in</strong><br />
consistency throughout this process. The high level of<br />
sequence conservation among the rRNAs enabled us to<br />
create hidden Markov models (HMMs) from structural<br />
alignments. Such models are more capable of captur<strong>in</strong>g<br />
the sequence variation that is <strong>in</strong>herently present <strong>in</strong><br />
the rRNA gene families than simple BLAST searches.<br />
ß 2007 The Author(s)<br />
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/<br />
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, <strong>and</strong> reproduction <strong>in</strong> any medium, provided the orig<strong>in</strong>al work is properly cited.
Us<strong>in</strong>g HMMs also simplifies the use of common criteria<br />
for prediction assessment. A library of HMMs was<br />
constructed <strong>and</strong> the program RNAmmer was developed<br />
to make use of this library. RNAmmer is available<br />
through the <strong>CBS</strong> web site, as a web service or as a<br />
st<strong>and</strong>-alone package. It has been tested on all published<br />
genomes <strong>and</strong> gives accurate predictions of rRNAs. The<br />
program also has the added benefit of produc<strong>in</strong>g results<br />
that are comparable between genomes.<br />
Our work has focused on three of the major rRNA<br />
species. The ribosome consists of two subunits, the small<br />
<strong>and</strong> the large subunit, which pair up to form the<br />
functional ribosome. The rRNAs present <strong>in</strong> prokaryotes<br />
are the 5S <strong>and</strong> 23S <strong>in</strong> the large subunit, <strong>and</strong> the 16S <strong>in</strong> the<br />
small subunit. In eukaryotes, 5S, 5.8S <strong>and</strong> 28S rRNA exist<br />
<strong>in</strong> the large subunit, <strong>and</strong> 18S rRNA <strong>in</strong> the small subunit.<br />
The 5.8S is not considered <strong>in</strong> this work. There are<br />
substantial sequence <strong>and</strong> secondary structure similarities<br />
between eukaryotic <strong>and</strong> prokaryotic rRNAs; however,<br />
the eukaryotic rRNAs commonly have longer stems <strong>and</strong><br />
larger loops than those of the prokaryotes. The subunits<br />
are composed of both RNAs <strong>and</strong> prote<strong>in</strong>s. S<strong>in</strong>ce their<br />
discovery <strong>in</strong> the early 1950s, it has been debated whether<br />
ribosomal function should be credited to the rRNAs or<br />
the prote<strong>in</strong>s. Recent crystal studies have revealed that<br />
prote<strong>in</strong> synthesis to a large extent is dependent on the<br />
rRNAs (4–7) <strong>and</strong> this has most likely been <strong>in</strong>strumental<br />
for their high level of conservation.<br />
In prokaryotes, the 16S, 23S <strong>and</strong> 5S rRNAs are<br />
commonly transcribed together, while the 18S, 28S <strong>and</strong><br />
5.8S rRNAs form a transcriptional unit <strong>in</strong> eukaryotes.<br />
Eukaryotic 5S rRNA commonly appear <strong>in</strong> highly duplicated<br />
t<strong>and</strong>em repeats (8). In most organisms, there are<br />
several copies of the rRNA transcription unit, <strong>and</strong><br />
although as much as 11% sequence divergence has been<br />
observed between units with<strong>in</strong> the same genome, the<br />
difference is usually less than 1% (9). In several cases,<br />
segments are also edited out of the transcribed rRNA.<br />
These segments may be <strong>in</strong>trons that after splic<strong>in</strong>g leave<br />
a cont<strong>in</strong>uous rRNA, or they can be <strong>in</strong>terven<strong>in</strong>g sequences<br />
(IVS) that leave a fragmented rRNA which is still<br />
functional with<strong>in</strong> the ribosome structure (10). Introns<br />
are most prevalent <strong>in</strong> eukaryotes <strong>and</strong> archaeas, while<br />
<strong>in</strong>terven<strong>in</strong>g sequences have been seen <strong>in</strong> eukaryotes <strong>and</strong><br />
bacteria. Introns are predom<strong>in</strong>antly found with<strong>in</strong> conserved<br />
sequences close to tRNA <strong>and</strong> mRNA-b<strong>in</strong>d<strong>in</strong>g<br />
sites (10), whereas <strong>in</strong>terven<strong>in</strong>g sequences are ord<strong>in</strong>arily<br />
seen <strong>in</strong> hypervariable regions (11).<br />
METHODS AND MATERIALS<br />
Us<strong>in</strong>g HMMs to f<strong>in</strong>d new members of a sequence family<br />
requires reliable multiple alignments. The 16S/18S <strong>and</strong><br />
23S/28S rRNA alignments were retrieved from the<br />
European ribosomal RNA database (ERRD) (12).<br />
In this database, annotated large <strong>and</strong> small subunit<br />
ribosomal RNA sequences from the EMBL nucleotide<br />
database with a length of at least 70% of their estimated<br />
full length have been aligned. Multiple alignments of 5S<br />
rRNAs were retrieved from the 5S Ribosomal RNA<br />
Nucleic Acids Research, 2007, Vol. 35, No. 9 3101<br />
Database (13). Data from both databases were downloaded<br />
on October 27, 2005. The alignments are<br />
all structural alignments, i.e. aligned us<strong>in</strong>g secondary<br />
structure <strong>in</strong>formation ga<strong>in</strong>ed from comparative sequence<br />
analysis. The 5S alignments were already divided<br />
<strong>in</strong>to separate alignments for archaeal, bacterial <strong>and</strong><br />
eukaryotic sequences, whereas the ERRD data were not.<br />
The alignments for 16/18S <strong>and</strong> 23/28S rRNAs were<br />
divided <strong>in</strong>to the same groups as the 5S data to provide<br />
k<strong>in</strong>gdom-specific predictors. The data was stored <strong>in</strong><br />
a MySQL database for easier h<strong>and</strong>l<strong>in</strong>g.<br />
The ERRD data conta<strong>in</strong>ed sequences from ‘environmental<br />
samples’. These were excluded s<strong>in</strong>ce there was little<br />
<strong>in</strong>formation about them. The 5S were generally around<br />
120 nt long, the 16/18S around 1500 nt <strong>and</strong> the 23/28S<br />
around 3000 nt long, all with no obvious outliers. The<br />
length of the eukaryotic rRNAs varied substantially,<br />
more than those of bacterial <strong>and</strong> archaeal rRNAs, but no<br />
sequences <strong>in</strong> the alignments seemed obviously wrong.<br />
The sequences were divided <strong>in</strong>to phylogenetic groups to<br />
help with further analysis. Due to sequenc<strong>in</strong>g bias, some<br />
phylogenetic groups dom<strong>in</strong>ated the data sets. Such a skew<br />
could potentially cause the predictors to be less sensitive<br />
on underrepresented phylogenetic groups. Among<br />
the bacteria, 82% of the sequences were from three<br />
phyla: Act<strong>in</strong>obacteria, Firmicutes <strong>and</strong> Proteobacteria.<br />
Around 70% of the archaeal sequences were from<br />
Euryarchaeota; among the eukaryotes, the Streptophyta<br />
comprised 15% of the data. Several of the sequences also<br />
proved to be very similar. Therefore, redundancy reduction<br />
<strong>in</strong>spired by Hobohms second algorithm (14) was<br />
performed. This algorithm starts with a sorted list of the<br />
number of neighbors each sequence has. An all-aga<strong>in</strong>st-all<br />
comparison between the sequences is performed <strong>and</strong><br />
neighborship is judged by the level of similarity found.<br />
Similarity was measured by Score ¼ P<br />
i, j nijSij=ðN gÞ<br />
where i <strong>and</strong> j sum over the four nucleotides, nij counts the<br />
number of aligned nucleotide pairs (i, j ), N is the length of<br />
the sequence <strong>and</strong> g is the number of gap-only positions; S ij<br />
refers to the scor<strong>in</strong>g matrix EDNAFULL created by Todd<br />
Lowe. The maximum similarity level allowed was set to<br />
ensure that each phylum was represented. Similarity<br />
graphs were formed for each group, with the sequences<br />
as vertices <strong>and</strong> edges between similar sequences. The<br />
sequence with the highest connectivity <strong>and</strong> its edges were<br />
deleted from the graph, <strong>and</strong> this was repeated until no<br />
edges rema<strong>in</strong>ed. At the end, all removed sequences were<br />
checked to see if they had any edges to vertices <strong>in</strong> the<br />
rema<strong>in</strong><strong>in</strong>g set. If not, they were re<strong>in</strong>stated. This procedure<br />
was implemented as a C program.<br />
Sequences <strong>in</strong> ERRD may conta<strong>in</strong> ambiguous nucleotide<br />
symbols represent<strong>in</strong>g nucleotides that have not been<br />
uniquely determ<strong>in</strong>ed. These occur more frequently <strong>in</strong><br />
bacteria <strong>and</strong> eukaryotes than <strong>in</strong> archaea, <strong>and</strong> primarily at<br />
both ends of the alignment: <strong>in</strong> 16/18S, predom<strong>in</strong>antly<br />
at the end; <strong>in</strong> 23/28S, predom<strong>in</strong>antly at the beg<strong>in</strong>n<strong>in</strong>g.<br />
In the latter case, this was mostly due the high prevalence<br />
of gaps at the end of the alignment. As we found that<br />
ambiguous nucleotides at the ends reduced the ability to<br />
predict start <strong>and</strong> stop positions accurately, we decided to<br />
remove all sequences with five or more ambiguous
3102 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />
Table 1. The <strong>in</strong>itial number of rRNA sequences <strong>and</strong> the number of sequences excluded for different reasons.<br />
K<strong>in</strong>gdom Type Initial count Environmental samples Incomplete sequences Redundancy reduction Total <strong>in</strong> HMM<br />
Archaea 5S 58 0 0 10 48<br />
16S 589 239 471 287 76<br />
23S 37 0 18 8 15<br />
Bacteria 5S 461 0 0 101 360<br />
16S 12 107 1429 10 723 2485 743<br />
23S 398 0 155 130 127<br />
Eukaryotes 5S 316 0 0 33 283<br />
18S 6585 24 5222 836 979<br />
28S 157 0 91 8 58<br />
Environmental samples were excluded due to lack of phylogenetic <strong>in</strong>formation. Sequences with too many unknown nucleotides <strong>in</strong> either end of the<br />
sequence were excluded to improve HMM accuracy. Redundancy reduction was performed to reduce bias. Note that these groups may overlap. The<br />
last column <strong>in</strong>dicates the number of sequences used to build each HMM.<br />
nucleotides <strong>in</strong> either end of the sequence. A summary of<br />
the number of sequences removed dur<strong>in</strong>g curation of the<br />
alignments is shown <strong>in</strong> Table 1.<br />
The software package HMMer (15) version 2.3.2 was<br />
used to create HMMs from alignments where all columns<br />
conta<strong>in</strong><strong>in</strong>g only gaps had been removed. It was configured<br />
for nucleotides, <strong>and</strong> to compensate for skews <strong>in</strong> the<br />
nucleotide distribution a custom null model for each<br />
alignment was used. Although redundancy reduction had<br />
been performed, the Henikoff position-based weigh<strong>in</strong>g<br />
scheme (16) was used to reduce any rema<strong>in</strong><strong>in</strong>g biases.<br />
When us<strong>in</strong>g the HMMs to search genome sequences,<br />
the default alignment method was used: a match must<br />
span the entire model, <strong>and</strong> several matches may be found<br />
with<strong>in</strong> one sequence.<br />
With the aim of <strong>in</strong>creas<strong>in</strong>g the search speed, we<br />
determ<strong>in</strong>ed the 75 most conserved consecutive columns<br />
<strong>in</strong> each alignment, as illustrated <strong>in</strong> Figure 1, <strong>and</strong> produced<br />
‘spotter’ HMMs based on these. S<strong>in</strong>ce searches with the<br />
smaller spotter models would be considerably faster,<br />
we wanted to <strong>in</strong>vestigate the possibility of us<strong>in</strong>g the<br />
spotter to pre-screen for c<strong>and</strong>idates, us<strong>in</strong>g the full HMMs<br />
only on regions surround<strong>in</strong>g the spotter hits. Spotter <strong>and</strong><br />
full model searches were done separately. Spotter <strong>and</strong> full<br />
model predictions were matched based on whether they<br />
had overlapp<strong>in</strong>g nucleotides on the same str<strong>and</strong>. A l<strong>in</strong>ear<br />
regression was used to express spotter score <strong>in</strong> terms of<br />
full model score. Variation was estimated as l<strong>in</strong>ear <strong>in</strong> full<br />
model score with non-positive regression coefficients.<br />
Least squares estimates were used <strong>in</strong> both cases. Spotter<br />
scores were assumed to be miss<strong>in</strong>g when negative <strong>and</strong>,<br />
hence, assumed to follow a truncated normal distribution;<br />
expected scores <strong>and</strong> square deviations were used to replace<br />
miss<strong>in</strong>g values <strong>in</strong> the two regressions. From this model, we<br />
computed the lowest full model score, T99, for which there<br />
was at least a 99% likelihood of gett<strong>in</strong>g a correspond<strong>in</strong>g<br />
spotter hit, <strong>and</strong> the likelihood, Pm<strong>in</strong>, that a full model hit<br />
with the lowest found score should have a correspond<strong>in</strong>g<br />
spotter hit.<br />
Both the full HMMs <strong>and</strong> the spotter HMMs were run<br />
on all fully sequenced genomes found <strong>in</strong> the Genome Atlas<br />
database (listed <strong>in</strong> Supplementary Table S1). All predictions<br />
with non-negative score <strong>and</strong> E-value at most 100<br />
were reported. Only full model hits with E-value 50.01<br />
were accepted as reliable hits, but none with E-value<br />
between 0.01 <strong>and</strong> 100 were reported. As rRNAs with<strong>in</strong> a<br />
genome tend to be very similar, usually with at least 99%<br />
identity, different full model hits with<strong>in</strong> a genome<br />
correspond<strong>in</strong>g to actual rRNAs should be expected to<br />
have similar scores. However, we found a substantial<br />
number of hits with far lower scores which we assume to<br />
be pseudogenes, truncated rRNAs or otherwise nonfunctional<br />
rRNA copies. To ensure that these did not have<br />
an adverse effect on the analyses, we excluded full model<br />
hits hav<strong>in</strong>g a score less than 80% of the maximal score<br />
<strong>in</strong> that genome. These are listed <strong>in</strong> Supplementary<br />
Table S2.<br />
Annotations of rRNAs were obta<strong>in</strong>ed from GenBank.<br />
Unfortunately, rRNAs have not been annotated <strong>in</strong> a<br />
uniform manner <strong>and</strong> it was often unclear exactly what<br />
was annotated. In some cases, both the separate rRNAs<br />
<strong>and</strong> the full operon was annotated. In all such cases, the<br />
operons were longer than 5000 nt, <strong>and</strong> all annotations<br />
longer than that were thus excluded. In our experience,<br />
this affected only operons. In other cases, different pieces<br />
of the same gene had been annotated as separate entities.<br />
Thus, some predictions matched several annotation<br />
entries; these are listed <strong>in</strong> Supplementary Table S3. A<br />
prediction was considered to match an annotation if they<br />
were on the same str<strong>and</strong> <strong>and</strong> the length of their overlap<br />
was at least half the length of the shorter of the two; it was<br />
considered to be annotated if it matched at least one<br />
annotation. The deviation between annotated <strong>and</strong> predicted<br />
start <strong>and</strong> stop positions was also exam<strong>in</strong>ed, but<br />
predictions with multiple match<strong>in</strong>g annotations were<br />
excluded from this comparison.<br />
Additional analyses were performed for experimentally<br />
verified 16S <strong>in</strong> Anaplasma marg<strong>in</strong>ale St. Maries (M60313),<br />
Chlamydia muridarum Nigg (D85718), Escherichia coli<br />
K12 MG1655 (J01695), Sulfolobus tokodaii St. 7<br />
(AB022438), Thermus thermophilus HB8 (X07998) <strong>and</strong><br />
Nitrobacter hamburgensis X14 (L11663). <strong>Computational</strong><br />
speed was assessed on M. capricolum ATCC 27343<br />
(CP000123) Solibacter usitatus Ell<strong>in</strong>6076 (CP000473) <strong>and</strong><br />
Sargasso Sea data (AACY01000001-AACY01811372).<br />
All test searches reported were performed on an<br />
SGI Altix 3000 mach<strong>in</strong>e us<strong>in</strong>g one 1.3 GHz Itanium 2<br />
processor.
Information content<br />
Information content<br />
Information content<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
RESULTS<br />
0 20 40 60 80 100 120 140<br />
0 50 100 150<br />
0 50 100 150<br />
Position <strong>in</strong> Alignment<br />
The predictions of the full HMM models have been<br />
compared first aga<strong>in</strong>st annotations, then aga<strong>in</strong>st the<br />
spotter models.<br />
Full model predictions versus annotation<br />
As Table 2 shows, the predictors appeared to be better<br />
at detect<strong>in</strong>g bacterial rRNAs <strong>and</strong> less powerful for<br />
eukaryotic rRNAs. The highest accuracy was seen for<br />
the 16/18S rRNAs followed by the 23/28S. Two groups of<br />
rRNAs were particularly difficult to locate: the archaeal<br />
5S <strong>and</strong> the eukaryotic 18S. The miss<strong>in</strong>g archaeal 5S were<br />
all from four euryarchaeotic genomes which are all<br />
anaerobic methane producers. The eukaryotic 18S that<br />
the predictors could not f<strong>in</strong>d were all from two genomes,<br />
Guillardia theta <strong>and</strong> Plasmodium falciparum.<br />
Closer evaluation revealed that several annotated<br />
rRNAs that lacked a match<strong>in</strong>g prediction had actually<br />
been detected, but on the opposite str<strong>and</strong>. In eukaryotes,<br />
this was only seen with Arabidopsis thaliana 5S.<br />
In bacteria, most of the reverse predictions were 5S; <strong>in</strong><br />
archaea, they were predom<strong>in</strong>antly 16S <strong>and</strong> 23S. It should<br />
be noted that for all the reverse str<strong>and</strong> predictions<br />
the predicted start <strong>and</strong> stop positions agreed well<br />
with the annotation, <strong>in</strong>dicat<strong>in</strong>g that they have been<br />
annotated on the wrong str<strong>and</strong>. Annotated rRNAs<br />
that lacked match<strong>in</strong>g predictions <strong>in</strong> either direction are<br />
listed <strong>in</strong> Supplementary Table S4.<br />
Table 2 gives the number of predicted rRNAs that did<br />
not have a correspond<strong>in</strong>g annotation: putative novel<br />
rRNAs. About 70% of them were 5S rRNAs, <strong>and</strong> only a<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
0 500 1000 1500<br />
0 500 1000 1500 2000 2500 3000<br />
0 1000 2000 3000 4000 5000<br />
Position <strong>in</strong> Alignment<br />
Nucleic Acids Research, 2007, Vol. 35, No. 9 3103<br />
A 5S, Archaea (n = 48) B 16S, Archaea (n = 76) C 23S, Archaea (n = 15)<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
few were archaeal. In bacteria, most of the novel rRNAs<br />
were found <strong>in</strong> Firmicutes <strong>and</strong> Gammaproteobacterias,<br />
although it should be noted that these two phyla are<br />
the two dom<strong>in</strong>ant groups <strong>and</strong> conta<strong>in</strong> the bulk of the<br />
currently sequenced bacterial genomes. Among the<br />
eukaryotes, only A. thaliana had novel rRNAs. The<br />
scores of the new rRNA predictions did not significantly<br />
differ from those that were annotated, <strong>in</strong>dicat<strong>in</strong>g that<br />
these are true rRNAs not yet annotated. The 5S is often<br />
omitted <strong>in</strong> the rRNA annotation; s<strong>in</strong>ce the eukaryotic 5S<br />
is usually separated from the 18-28S sequence, they might<br />
be less visible to annotators.<br />
Start <strong>and</strong> stop deviations<br />
0 500 1000 1500 2000 2500 3000 3500<br />
D 5S, Bacteria (n = 360) E 16S, Bacteria (n = 743) F 23S, Bacteria (n = 127)<br />
0 1000 2000 3000 4000<br />
G 8S, Eukaryotes (n = 283) H 18S, Eukaryotes (n = 979) I 28S, Eukaryotes (n = 58)<br />
0 1000 2000 3000 4000 5000 6000 7000<br />
Position <strong>in</strong> Alignment<br />
Figure 1. The graphs show conservation <strong>in</strong> the alignments as measured by <strong>in</strong>formation content: C ¼ P<br />
i fi log 2ðfi=qiÞ where i sums over the four<br />
nucleotides, f i is the frequency of nucleotide i <strong>in</strong> the column <strong>and</strong> qi ¼ 1=4 is used as the background frequency. Ambiguous nucleotide symbols were<br />
evenly divided between the correspond<strong>in</strong>g f i, gaps between all four nucleotides. The grey l<strong>in</strong>e represents the value for each position <strong>in</strong> the alignment,<br />
the black l<strong>in</strong>e is a runn<strong>in</strong>g average over 75 nt around the current position, whereas the white dot <strong>in</strong>dicates the center of the most conserved 75 nt<br />
region of the alignment.<br />
The differences between predicted <strong>and</strong> annotated start<br />
<strong>and</strong> stop positions are illustrated <strong>in</strong> Figure 2 <strong>and</strong> it shows<br />
that they agree well. The median of the start <strong>and</strong> stop<br />
prediction deviations were <strong>in</strong> most groups zero or very<br />
close to zero with more than half with<strong>in</strong> 10 nucleotides.<br />
This was not the case for the eukaryotes.<br />
For eukaryotic 5S, only five genomes conta<strong>in</strong>ed<br />
predictions with match<strong>in</strong>g annotations. The predictions<br />
were uniform <strong>in</strong> length, whereas the annotations<br />
were more variable. The predictions that <strong>in</strong>dicated a<br />
substantially shorter 5S than annotated were all <strong>in</strong><br />
Schizosaccharomyces pombe: the average length of the<br />
annotations was 170 nt, whereas the correspond<strong>in</strong>g<br />
predictions were all 114 nt. For eukaryotic 18S, however,<br />
predicted start <strong>and</strong> stop positions were very accurate,<br />
although many annotated 18S were missed.
3104 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />
Table 2. The number of rRNAs annotated <strong>and</strong> predicted <strong>in</strong> the genomes that were exam<strong>in</strong>ed.<br />
K<strong>in</strong>gdom Type Annotated Same str<strong>and</strong> Other str<strong>and</strong> Not found Full model predictions Novel<br />
Archaea (n ¼ 27) 5S 56 (24) 43 (21) 1 (1) 12 (8) 47 (23) 4 (3)<br />
16S 47 (25) 45 (25) 2 (2) 0 (0) 47 (27) 2 (2)<br />
23S 47 (25) 44 (24) 2 (2) 1 (1) 46 (26) 2 (2)<br />
Bacteria (n ¼ 321) 5S 1205 (285) 1166 (285) 30 (16) 9 (5) 1339 (320) 173 (69)<br />
16S 1172 (299) 1146 (299) 22 (12) 4 (4) 1237 (320) 91 (34)<br />
23S 1197 (297) 1154 (291) 22 (13) 21 (12) 1248 (313) 94 (36)<br />
Eukaryotes (n ¼ 13) 5S 65 (7) 46 (6) 19 (1) 0 (0) 324 (9) 278 (5)<br />
18S 13 (4) 6 (4) 0 (0) 7 (2) 13 (6) 7 (3)<br />
28S 13 (5) 12 (4) 0 (0) 1 (1) 19 (7) 7 (3)<br />
The table gives the number of annotations, <strong>and</strong> splits this <strong>in</strong>to those match<strong>in</strong>g predictions on the same str<strong>and</strong>, on the other str<strong>and</strong>, <strong>and</strong> not found.<br />
The total number of full model predictions is given. Novel predictions are full model predictions not match<strong>in</strong>g any annotation on the same str<strong>and</strong>,<br />
<strong>and</strong> <strong>in</strong>clude those annotated on the other str<strong>and</strong>. Numbers <strong>in</strong> parentheses <strong>in</strong>dicate the number of genomes. It should be noted that the eukaryotic<br />
annotated count is somewhat uncerta<strong>in</strong> due to ambiguous rRNA annotations. The genomes which were analyzed were from the GenomeAtlas<br />
database, a database over all available fully sequenced genomes.<br />
Archaea<br />
Bacteria<br />
Eukaryotes<br />
Start<br />
1000<br />
−100<br />
−10<br />
0<br />
10<br />
100<br />
5S<br />
(43/1163/46)<br />
For eukaryotic 28S, only two genomes had predictions<br />
with match<strong>in</strong>g annotations. One of them, Encephalitozoon<br />
cuniculi, had stop positions predicted once 1112 nt <strong>and</strong><br />
twice 4797 nt downstream of the annotation, whereas<br />
the start position was accurately predicted. In the<br />
other genome, Guillardia theta, the start positions were<br />
uniformly predicted 110 nt upstream of the annotated<br />
position, but with the stop position quite accurately<br />
predicted.<br />
1000<br />
Stop<br />
1000<br />
−100<br />
−10<br />
0<br />
10<br />
100<br />
1000<br />
Start<br />
1000<br />
−100<br />
−10<br />
0<br />
10<br />
100<br />
16/18S<br />
(44/1146/6)<br />
1000<br />
Stop<br />
1000<br />
−100<br />
−10<br />
0<br />
10<br />
100<br />
1000<br />
Start<br />
1000<br />
−100<br />
−10<br />
0<br />
10<br />
100<br />
23/28S<br />
(42/1150/9)<br />
Stop<br />
1000<br />
−100<br />
−10<br />
0<br />
10<br />
100<br />
1000<br />
Figure 2. Deviation of start <strong>and</strong> stop positions between predicted <strong>and</strong> annotated RNA is presented as pairs of panels. The number of predictions<br />
among the archaea, bacteria <strong>and</strong> eukaryotes are denoted beneath the panel group head<strong>in</strong>g. The zero position <strong>in</strong> each panel corresponds to the<br />
annotation start or stop position with predicted positions presented relative to these. The yellow dot <strong>in</strong>dicates the median deviation <strong>and</strong> the black<br />
box the quartile range. The h<strong>in</strong>ges on the side of the box extend from the side of the box to the data po<strong>in</strong>t that is closest to, but does not exceed, 1.5<br />
times the <strong>in</strong>terquartile range. The curves show the density of the distribution.<br />
S<strong>in</strong>ce rRNAs tend to be very similar with<strong>in</strong> a genome,<br />
predictions with<strong>in</strong> each genome generally had similar<br />
lengths. This similarity with<strong>in</strong> genomes as well as with<strong>in</strong><br />
groups of closely related genomes caused multiple peaks<br />
<strong>in</strong> the distributions of endpo<strong>in</strong>t deviations. An example<br />
of this can be seen <strong>in</strong> the bacterial 16S predictions where<br />
some of the predicted start <strong>and</strong> stop positions were<br />
clustered downstream of the annotation <strong>and</strong> where some<br />
of the predicted start positions were clustered upstream<br />
1000
of the annotation. Some of the major contributors to<br />
the upstream peak <strong>in</strong> the start positions were different<br />
Streptococcus pyogenes stra<strong>in</strong>s, Bacillus genomes <strong>and</strong><br />
Yers<strong>in</strong>ia pestis genomes. These, <strong>in</strong> addition to<br />
Streptococcus agalactiae stra<strong>in</strong>s <strong>and</strong> Vibrio parahaemolyticus,<br />
were also prevalent <strong>in</strong> the stop position downstream<br />
peak. There was also a downstream peak <strong>in</strong> the<br />
start positions, <strong>and</strong> the genomes caus<strong>in</strong>g this peak were<br />
ma<strong>in</strong>ly Staphylococcus aureus, Bacillus cereus <strong>and</strong> several<br />
Escherichia coli relatives.<br />
Most of the start <strong>and</strong> stop deviations did not exceed<br />
100 nt. However, there were a few cases of deviations<br />
exceed<strong>in</strong>g 1000 nt, <strong>and</strong> these are not shown <strong>in</strong> the figure.<br />
This was the case for eukaryotic 23S <strong>and</strong> was ma<strong>in</strong>ly due<br />
to the three previously described stop positions predicted<br />
considerably downstream of the annotated stop position.<br />
In the two longer predictions from E. cuniculi, this was<br />
due to the HMM plac<strong>in</strong>g the latter 100 nt of the prediction<br />
further downstream to achieve a better score. Such <strong>in</strong>serts<br />
would most likely not appear when the spotter model is<br />
used first, s<strong>in</strong>ce the <strong>in</strong>serted sequence would be too long.<br />
To test this, a truncated version of the sequence was run<br />
through the predictor. The stop position was then<br />
accurately predicted. This phenomenon also expla<strong>in</strong>s<br />
some cases among the bacterial 16S predictions where the<br />
start position was placed very far upstream of the<br />
annotation. There were 27 rRNAs that had a start<br />
position predicted to start anywhere from 13 000 to<br />
40 000 nt upstream of the annotated start position. All<br />
but one of these were Firmicutes, mostly Streptococci <strong>and</strong><br />
Staphylococci. Closer study of the sequences revealed that<br />
the misplaced start position predictions were aga<strong>in</strong> due to<br />
long sequences be<strong>in</strong>g <strong>in</strong>serted near the start of the rRNA,<br />
<strong>in</strong>dicat<strong>in</strong>g that the first part of the HMM had been<br />
misplaced <strong>in</strong> the same manner as for Guillardia theta’s stop<br />
predictions. To test if these were the same k<strong>in</strong>d of <strong>in</strong>serts,<br />
a region end<strong>in</strong>g <strong>in</strong> the same place as the predictions but<br />
start<strong>in</strong>g 10 000 nt earlier was run through the full model<br />
predictor. This led to the bacterial 16S rRNAs be<strong>in</strong>g<br />
predicted with a deviation <strong>in</strong> start <strong>and</strong> stop positions on<br />
par with what was otherwise seen.<br />
Comparison to experimentally verified rRNAs<br />
Annotations were often ambiguous <strong>and</strong> considered<br />
unreliable. For discrepancies between annotations <strong>and</strong><br />
RNAmmer predictions, it is not a priori clear which of the<br />
two is correct. However, some genomes with experimentally<br />
verified rRNAs were selected to further assess the<br />
accuracy of start <strong>and</strong> stop predictions. The genomes<br />
we exam<strong>in</strong>ed were Anaplasma marg<strong>in</strong>ale Str. Maries,<br />
Chlamydia muridarum Nigg, Escherichia coli K12<br />
MG1655, Sulfolobus tokodaii Str. 7, Thermus thermophilus<br />
HB8 <strong>and</strong> Nitrobacter hamburgensis X14. These genomes<br />
all had complete 16S sequences accord<strong>in</strong>g to the NCBI<br />
database <strong>and</strong> had accompany<strong>in</strong>g literature which said that<br />
they were experimentally determ<strong>in</strong>ed. When check<strong>in</strong>g<br />
the positions of these rRNAs with BLAST aga<strong>in</strong>st the<br />
genome, some discrepancies were found. Due to this we<br />
used the BLAST results when compar<strong>in</strong>g annotated<br />
rRNAs to predictions.<br />
Nucleic Acids Research, 2007, Vol. 35, No. 9 3105<br />
In total, there were 14 copies of the six 16S sequences,<br />
<strong>and</strong> all of them were found by our predictions. Stop<br />
predictions were more accurate than start predictions.<br />
In all but four cases, the start position was predicted<br />
to be 7 nt downstream of the annotated start position.<br />
In A. marg<strong>in</strong>ale <strong>and</strong> S. tokodaii, the start position was<br />
predicted to be the same as annotation, <strong>and</strong> both of the<br />
two entries from C. muridarum were predicted to start 3 nt<br />
downstream of annotated start position. In N. hamburgensis<br />
the start position was, <strong>in</strong> contrast to the other cases,<br />
predicted to start 7 nt upstream of annotated start<br />
position. The stop positions <strong>in</strong> all but three predictions<br />
ended on the same position as the annotation. In N.<br />
hamburgensis predicted stop was 9 nt downstream,<br />
whereas <strong>in</strong> S. tokoaii <strong>and</strong> A. marg<strong>in</strong>ale the predicted<br />
stop was 1 nt downstream of annotation. Thus,<br />
all predictions were with<strong>in</strong> 10 nt of the annotated start<br />
<strong>and</strong> stop positions.<br />
Comparison to RFAM<br />
RFAM is a database of RNA families which <strong>in</strong>corporates<br />
secondary structure <strong>in</strong> its analyses. We have made a<br />
comparison with the 5S rRNA predictions of<br />
RFAM (17,18) for a selection of twenty prokaryotic<br />
genomes listed <strong>in</strong> Supplementary Table S5. There were a<br />
total of 55 5S annotated <strong>in</strong> these genomes. RNAmmer<br />
found 53 of them, while 54 were found <strong>in</strong> RFAM. In three<br />
of the genomes, both methods predicted a 5S to with<strong>in</strong> a<br />
few nucleotides of the annotated position, but both placed<br />
it on the other str<strong>and</strong>. Both predictors identified three new<br />
5S rRNAs with<strong>in</strong> these genomes, <strong>and</strong> at approximately the<br />
same positions. Two of these new 5S rRNAs followed<br />
another annotated 5S rRNA, look<strong>in</strong>g like a t<strong>and</strong>em<br />
repeat. In most cases, both methods placed the start<br />
position a few nucleotides downstream of the annotation,<br />
whereas the stop position was more evenly distributed<br />
around the annotated position. RNAmmer generally<br />
predicted rRNAs to be shorter by a nucleotide or two<br />
than RFAM, usually at start of the genes.<br />
Spotter pre-screen<strong>in</strong>g<br />
Table 3 shows that, with the exception of archaeal 5S,<br />
no full model hits were missed by the spotter model.<br />
Also, the spotter produced relatively few false positives,<br />
except for the eukaryotic 5S.<br />
M<strong>in</strong>imum, maximum, quantile <strong>and</strong> median scores for<br />
all the full model predictions are shown <strong>in</strong> Table 3, giv<strong>in</strong>g<br />
some <strong>in</strong>dication of the range of scores that rRNAs can be<br />
expected to have. The table also <strong>in</strong>cludes the threshold T99<br />
<strong>and</strong> the likelihood Pm<strong>in</strong> which <strong>in</strong>dicate that all full model<br />
predictions were expected to have correspond<strong>in</strong>g spotter<br />
model predictions except some among the archaeal 5S.<br />
Based on the relatively stable lengths of the different<br />
types of rRNAs <strong>and</strong> the correspond<strong>in</strong>g full model hits <strong>and</strong><br />
the position of the spotter hit with<strong>in</strong> them, we decided on<br />
w<strong>in</strong>dow sizes around spotter model hits to use when the<br />
spotter model is used first. These were chosen to be 300 nt<br />
for the 5S rRNA, 5000 nt for the 16/18S <strong>and</strong> 9000 nt for<br />
the 23/28S. Be<strong>in</strong>g roughly three times the length of the
3106 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />
Table 3. Evaluation of spotter <strong>and</strong> full model predictions.<br />
K<strong>in</strong>gdom Type Number of model predictions Full model scores T99 Pm<strong>in</strong><br />
correspond<strong>in</strong>g rRNAs, we consider rRNA sequences to be<br />
unlikely to extend beyond these w<strong>in</strong>dows.<br />
<strong>Computational</strong> speed<br />
Search<strong>in</strong>g Mycoplasma capricolum ATCC27343, about<br />
1 Mbp, for bacterial 16S took 14 m<strong>in</strong>utes us<strong>in</strong>g the full<br />
HMM. Us<strong>in</strong>g the spotter to screen the sequence, then the<br />
full model on the spotter hits, reduced the time to<br />
16 seconds. Search times are expected to <strong>in</strong>crease<br />
proportionally to the genome size; when us<strong>in</strong>g the spotter<br />
model to screen the sequence, search time will also<br />
<strong>in</strong>crease with <strong>in</strong>creas<strong>in</strong>g number of spotter hits.<br />
Time differences between search<strong>in</strong>g long <strong>and</strong> short<br />
sequences were exam<strong>in</strong>ed by search<strong>in</strong>g through the<br />
complete sequence of Solibacter usitatus Ell<strong>in</strong>6076, <strong>and</strong><br />
through the Sargasso Sea environmental samples (19).<br />
Search<strong>in</strong>g the S. usitatus genome, about 10 Mbp, took 48<br />
seconds per Mbp. Two copies from each rRNAs family<br />
were found. The Sargasso Sea samples consisted of<br />
811 372 entries total<strong>in</strong>g over 800 Mbp. On this set the<br />
search speed was 407 seconds per Mbp. The article (19)<br />
accompany<strong>in</strong>g this set <strong>in</strong>dicated 1164 small subunit rRNA<br />
genes (16/18S) or fragments of genes; we found only 332,<br />
but our predictors are not able to f<strong>in</strong>d fragments of<br />
rRNAs. In addition, we found 562 5S <strong>and</strong> 68 23S<br />
sequences.<br />
DISCUSSION<br />
Full Spotter FPS M<strong>in</strong> Q1 Med Q3 Max<br />
Archaea 5S 47 35 7 2.9 12.7 20.0 35.3 50.6 34.9 0.69<br />
16S 47 47 0 1180.8 1891.9 1937.9 2004.0 2096.5 50 1.0<br />
23S 46 46 1 2240.7 2714.1 2870.7 3155.3 3267.3 50 1.0<br />
Bacteria 5S 1339 1339 123 39.9 77.7 89.5 94.6 109.6 14.0 1.0<br />
16S 1237 1237 31 721.9 1905.5 1989.4 2058.7 2148.5 50 1.0<br />
23S 1248 1248 20 2502.8 3267.8 3586.5 3690.7 3876.1 50 1.0<br />
Eukaryotes 5S 324 324 251 43.9 51.1 53.9 74.3 82.2 50 1.0<br />
18S 13 13 14 625.3 625.3 1733.1 1777.5 1777.6 50 1.0<br />
28S 19 19 5 1434.2 2904.7 3225.0 3335.9 3380.9 50 1.0<br />
This table shows the total number of full models, the number of spotter predictions that had match<strong>in</strong>g full model predictions <strong>and</strong> the number of false<br />
positive spotter model predictions. The characteristics of the full model prediction score distributions are shown. FPS denotes the number of false<br />
positive spotter predictions. T99 refers to the lowest score a full model could have while still be<strong>in</strong>g detected with 99% probability by a spotter model<br />
with positive score. Pm<strong>in</strong> is the probability that a spotter with positive score would f<strong>in</strong>d a full model with the m<strong>in</strong>imum score <strong>in</strong>dicated. The lowest<br />
score for a full model score can be used as a lower limit on which results could be expected to be real.<br />
Our aim has been to enable high-throughput searches for<br />
rRNA while produc<strong>in</strong>g accurate <strong>and</strong> consistent predictions<br />
suitable for comparative analyses. For this purpose,<br />
we have developed the RNAmmer package which relies on<br />
HMMs for both speed <strong>and</strong> accuracy. HMMs were made<br />
us<strong>in</strong>g HMMer (15), which from a multiple alignment<br />
produces an HMM where match states represent columns<br />
with a specific nucleotide distribution, correspond<strong>in</strong>g<br />
deletion states represent the possibility of gaps, <strong>and</strong><br />
<strong>in</strong>sertion states represent columns with large numbers of<br />
gaps; transition probabilities between the states <strong>in</strong>dicate<br />
how likely each of the states are. HMMs thus differ from<br />
sequence alignments <strong>in</strong> that the likelihood of <strong>in</strong>sertions<br />
<strong>and</strong> deletions may vary along the sequence. When<br />
search<strong>in</strong>g a sequence with an HMM, the score <strong>in</strong>dicates<br />
how well the sequence segment matches the model. The<br />
<strong>in</strong>formation content of a position, which reflects the<br />
nucleotide distribution <strong>and</strong> the likelihood of gaps,<br />
<strong>in</strong>dicates how well that position is conserved. A good<br />
match to the HMM may come either from a highly<br />
conserved region which may well be short, or from a<br />
longer region with only weak conservation. We f<strong>in</strong>d both<br />
these cases. Bacterial 16S are detected despite almost half<br />
of the nucleotides be<strong>in</strong>g assigned to <strong>in</strong>sert states, as other<br />
regions are highly conserved. For archaeal 23S, however,<br />
the <strong>in</strong>formation content of each position is low, but the<br />
sequence is long <strong>and</strong> there are few allowed <strong>in</strong>sert states.<br />
These aspects can also expla<strong>in</strong> cases of poor performance,<br />
both of the full model <strong>and</strong> of the spotter model.<br />
The low <strong>in</strong>formation content <strong>in</strong> the eukaryotic 5S <strong>and</strong><br />
18S alignments <strong>in</strong>dicates that these sequences are more<br />
divergent than archaeal <strong>and</strong> bacterial 5S <strong>and</strong> 16S.<br />
In addition, 40% of the 5S <strong>and</strong> 75% of the 18S alignment<br />
give rise to <strong>in</strong>sert states <strong>in</strong> the HMM. Thus, there is little<br />
for the HMM to recognize. In addition, many of the<br />
missed 18S rRNAs were from Cryptophyta, a phylum<br />
which makes up only 0.6% of the alignment data.<br />
The archaeal 5S show the same characteristics as the<br />
eukaryotic 5S <strong>and</strong> 18S, which most likely expla<strong>in</strong>s the low<br />
performance for these rRNAs. The score for archaeal 5S<br />
hits were generally low, <strong>and</strong> the spotter score comes only<br />
from a 75 nt part of the sequence giv<strong>in</strong>g it even lower score<br />
caus<strong>in</strong>g it to miss 12 of the full model hits. It is notable,<br />
however, that these were the only cases missed by the<br />
spotter model: with the exception of archaeal 5S, our<br />
analyses show that the spotter should be able to detect<br />
rRNAs unless they are much further diverged than what<br />
we f<strong>in</strong>d <strong>in</strong> our data.<br />
Columns at the beg<strong>in</strong>n<strong>in</strong>g <strong>and</strong> end of the multiple<br />
alignments often have low conservation <strong>and</strong> many gaps.<br />
Such columns are generally accommodated <strong>in</strong>to the<br />
HMM as <strong>in</strong>sert states, but HMMer ignores them at the<br />
beg<strong>in</strong>n<strong>in</strong>g <strong>and</strong> end of the alignment. An example is the 5S,
where match states stop around 10 columns from the<br />
end of the alignments effectively caus<strong>in</strong>g the HMM to<br />
predict the last conserved nucleotide of the consensus<br />
sequence rather than the stop of the rRNAs. Hence, it is<br />
not uncommon for the stop position of the 5S to be<br />
predicted up to 10 nt downstream of the annotated stop<br />
position.<br />
These effects can also expla<strong>in</strong> the endpo<strong>in</strong>t accuracy<br />
that was seen when we compared our results to<br />
experimentally determ<strong>in</strong>ed 16S sequences. We tried to<br />
f<strong>in</strong>d sequences where the ends had been experimentally<br />
verified by RACE or PCR, but such rRNAs proved<br />
difficult to f<strong>in</strong>d. All the ones we selected were sequenced,<br />
but it is uncerta<strong>in</strong> to what extent the authors had<br />
tried to determ<strong>in</strong>e the ends. These experimentally<br />
found rRNAs did show better agreement with annotation<br />
than predictions <strong>in</strong> general, although this is not sufficient<br />
to conclude that our predictions are more accurate. Our<br />
stop predictions were very accurate, but more deviation<br />
was seen <strong>in</strong> the start predictions. These results could reflect<br />
more variation <strong>in</strong> the beg<strong>in</strong>n<strong>in</strong>g of the alignments, which<br />
as <strong>in</strong> the 5S case could effectively cause the HMM to<br />
predict the last conserved nucleotide of the consensus<br />
sequence rather than the end of the rRNAs.<br />
In some cases, larger endpo<strong>in</strong>t deviations occur. This<br />
can happen when one of the ends of the model f<strong>in</strong>ds a<br />
better match <strong>in</strong> a different part of the sequence. Insertion<br />
states sometimes allows the HMM to <strong>in</strong>sert long gap<br />
regions <strong>and</strong> thus f<strong>in</strong>d a match<strong>in</strong>g stop position far from<br />
the rest of the sequence. As shown for the bacterial 16S<br />
sequences that displayed this phenomenon, this is less of a<br />
problem when the spotter model is employed. The w<strong>in</strong>dow<br />
searched around the spotter hit would most likely be too<br />
short to accommodate such an <strong>in</strong>sert, <strong>and</strong> the model<br />
would match with the proper sequence.<br />
For fragmented rRNAs, long gap regions may be<br />
correctly predicted. This was seen for Coxiella burnetii 23S<br />
where our prediction has the same start position<br />
as annotated, but where the predicted stop position<br />
is 1884 nt downstream of GenBank’s stop position.<br />
However, accord<strong>in</strong>g to Entrez Gene, this rRNA appears<br />
<strong>in</strong> four pieces <strong>and</strong> with the same stop position as ours,<br />
suggest<strong>in</strong>g that <strong>in</strong> some cases ‘too long’ predictions might<br />
actually be correct. These cases should normally not be<br />
masked when us<strong>in</strong>g the spotter unless <strong>in</strong>serts between the<br />
fragments would make it exceed the w<strong>in</strong>dow size.<br />
The HMM produced by HMMer requires time of order<br />
O(NM) to search a sequence of length N us<strong>in</strong>g a model<br />
with M states, M be<strong>in</strong>g proportional to the length of the<br />
multiple alignment. However, the speed is <strong>in</strong>creased by<br />
us<strong>in</strong>g a 75 nt long spotter model to pre-screen the<br />
sequence, which requires time of order O(N), <strong>and</strong> then<br />
runn<strong>in</strong>g the full HMM on w<strong>in</strong>dows around each spotter<br />
hit which requires time of order OðKM 2 Þ for K spotter<br />
hits, <strong>and</strong> w<strong>in</strong>dow size proportional to M. The benefit of<br />
us<strong>in</strong>g the spotter is clearly illustrated <strong>in</strong> the M. capricolum<br />
searches. However, the time difference between the<br />
S. usitatus <strong>and</strong> the Sargasso Sea data searches shows<br />
that the spotter might lose its mission when deal<strong>in</strong>g with<br />
many shorter sequences.<br />
Nucleic Acids Research, 2007, Vol. 35, No. 9 3107<br />
There are other approaches to predict<strong>in</strong>g non-cod<strong>in</strong>g<br />
RNA. One commonly used method is sequence alignment,<br />
e.g. BLAST (3), Paralign (20) or FASTA (21). Another is<br />
based on structure-sensitive Stochastic Context Free<br />
Grammars (SCFG) (22) which form the basis of the<br />
tRNA prediction program tRNAscan-SE (23) <strong>and</strong> of<br />
Infernal (24), which is used when creat<strong>in</strong>g RFAM. While<br />
the sequence alignment methods are very fast, they are not<br />
particularly suited for prediction of non-cod<strong>in</strong>g RNA (1).<br />
Infernal, however, has a general worst case runn<strong>in</strong>g time<br />
of order OðMN 3 Þ, which is prohibitive. The RFAM<br />
database (17,18), which <strong>in</strong>cludes 5S <strong>and</strong> the 5 0 doma<strong>in</strong><br />
of 16S, uses BLAST to pre-screen genome sequences,<br />
followed by Infernal; despite a more efficient approach<br />
than the general SCFG, it does not analyze the entire 16S.<br />
A search for 5S <strong>in</strong> a 1 Mbp genome us<strong>in</strong>g Infernal took<br />
4 hours 45 m<strong>in</strong>utes: almost 1000 times as much as the<br />
16 seconds used by RNAmmer for the much larger 16S<br />
model. A time-sav<strong>in</strong>g approach to SCFGs could be to use<br />
the RaveNna (25) package which can convert an RFAM<br />
SCFG to an HMM. This drastically reduces the runn<strong>in</strong>g<br />
time; however, its usefulness would be limited s<strong>in</strong>ce no<br />
models for the larger rRNAs are available. Another factor<br />
is that the 5S found by RaveNna (26) which were not<br />
already <strong>in</strong> RFAM were all <strong>in</strong> organellar sequences,<br />
sequences not analyzed by RNAmmer. For further<br />
comparisons <strong>and</strong> comments on these different methods,<br />
we refer to (1).<br />
The RNAmmer program is available as a traditional<br />
HTML-based prediction server at http://www.cbs.dtu.dk/<br />
services/RNAmmer as well as through a SOAP-based<br />
web service. It is also available for download through<br />
the same site.<br />
SUPPLEMENTARY DATA<br />
Supplementary Data is available at NAR onl<strong>in</strong>e.<br />
ACKNOWLEDGEMENTS<br />
We are grateful for fund<strong>in</strong>g from EMBIO at the<br />
University of Oslo, the Research Council of Norway<br />
<strong>and</strong> the Danish Center for Scientific Comput<strong>in</strong>g. It was<br />
also supported by a grant from the European Union<br />
through the EMBRACE Network of Excellence, contract<br />
number LSHG-CT-2004-512092. We would also like to<br />
thank our colleagues for critical read<strong>in</strong>g of the manuscript.<br />
Fund<strong>in</strong>g to pay the Open Access publication charge<br />
was provided by Research Council of Norway.<br />
Conflict of <strong>in</strong>terest statement. None declared.<br />
REFERENCES<br />
1. Freyhult,E., Bollback,J. <strong>and</strong> Gardner,P. (2007) Explor<strong>in</strong>g genomic<br />
dark matter: a critical assessment of the performance of homology<br />
search methods on noncod<strong>in</strong>g RNA. Genome Res., 17, 117–125.<br />
2. Pedersen,A., Jensen,L., Brunak,S., Staerfeldt,H. <strong>and</strong> Ussery,D.<br />
(2000) A DNA structural atlas for Escherichia coli. J. Mol. Biol.,<br />
299, 907–930.<br />
3. Altschul,S., Gish,W., Miller,W., Myers,E. <strong>and</strong> Lipman,D. (1990)<br />
Basic local alignment search tool. J. Mol. Biol., 215, 403–10.
3108 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />
4. Wimberly,B., Brodersen,D., Clemons,W. Jr., Morgan-Warren,R.,<br />
Carter,A., Vonrhe<strong>in</strong>,C., Hartsch,T. <strong>and</strong> Ramakrishnan,V. (2000)<br />
Structure of the 30s ribosomal subunit. Nature, 407, 327–339.<br />
5. Schluenzen,F., Tocilj,A., Zarivach,R., Harms,J., Gluehmann,M.,<br />
Janell,D., Bashan,A., Bartels,H., Agmon,I. et al. (2000) Structure<br />
of functionally activated small ribosomal subunit at 3.3 angstroms<br />
resolution. Cell, 102, 615–623.<br />
6. Nissen,P., Hansen,J., Ban,N., Moore,P. <strong>and</strong> Steitz,T. (2000)<br />
The structural basis of ribosome activity <strong>in</strong> peptide bond synthesis.<br />
Science, 289, 920–930.<br />
7. Yusupov,M., Yusupova,G., Baucom,A., Lieberman,K., Earnest,T.,<br />
Cate,J. <strong>and</strong> Noller,H. (2001) Crystal structure of the ribosome at<br />
5.5 A ˚ resolution. Science, 292, 883–896.<br />
8. Srivastava,A. <strong>and</strong> Schless<strong>in</strong>ger,D. (1991) Structure <strong>and</strong> organization<br />
of ribosomal DNA. Biochimie, 73, 631–638.<br />
9. Ac<strong>in</strong>as,S., Marcel<strong>in</strong>o,L., Klepac-Ceraj,V. <strong>and</strong> Polz,M. (2004)<br />
Divergence <strong>and</strong> redundancy of 16s rRNA sequences <strong>in</strong> genomes<br />
with multiple rrn operons. J Bacteriol, 186, 2629–2635.<br />
10. Jackson,S., Cannone,J., Lee,J., Gutell,R. <strong>and</strong> Woodson,S. (2002)<br />
Distribution of rRNA <strong>in</strong>trons <strong>in</strong> the three-dimensional structure<br />
of the ribosome. J Mol Biol, 323, 35–52.<br />
11. Evguenieva-Hackenberg,E. (2005) Bacterial ribosomal RNA <strong>in</strong><br />
pieces. Mol Microbiol, 57, 318–325.<br />
12. Wuyts,J., Perriere,G. <strong>and</strong> Van De Peer,Y. (2004) The European<br />
ribosomal RNA database. Nucleic Acids Res, 32 Database issue,<br />
D101–D103.<br />
13. Szymanski,M., Barciszewska,M., Erdmann,V. <strong>and</strong> Barciszewski,J.<br />
(2002) 5s Ribosomal RNA database. Nucleic Acids Res., 30, 176–178.<br />
14. Hobohm,U., Scharf,M., Schneider,R. <strong>and</strong> S<strong>and</strong>er,C. (1992) Selection<br />
of representative prote<strong>in</strong> data sets. Prote<strong>in</strong> Sci., 1, 409–417.<br />
15. Eddy,S. (1998) Profile hidden markov models. Bio<strong>in</strong>formatics, 14,<br />
755–763.<br />
16. Henikoff,S. <strong>and</strong> Henikoff,J. (1994) Position-based sequence weights.<br />
J. Mol. Biol., 243, 574–578.<br />
17. Griffiths-Jones,S., Moxon,S., Marshall,M., Khanna,A., Eddy,S.<br />
<strong>and</strong> Bateman,A. (2005) Rfam: annotat<strong>in</strong>g non-cod<strong>in</strong>g RNAs <strong>in</strong><br />
complete genomes. Nucleic Acids Res., 33 Database Issue,<br />
D121–D124.<br />
18. Griffiths-Jones,S., Bateman,A., Marshall,M., Khanna,A. <strong>and</strong><br />
Eddy,S. (2003) Rfam: an RNA family database. Nucleic Acids Res.,<br />
31, 439–441.<br />
19. Venter,J., Rem<strong>in</strong>gton,K., Heidelberg,J., Halpern,A., Rusch,D.,<br />
Eisen,J., Wu,D., Paulsen,I., Nelson,K. et al. (2004) Environmental<br />
genome shotgun sequenc<strong>in</strong>g of the Sargasso Sea. Science, 304,<br />
66–74.<br />
20. Rognes,T. (2001) ParAlign: a parallel sequence alignment algorithm<br />
for rapid <strong>and</strong> sensitive database searches. Nucleic Acids Res, 29,<br />
1647–1652.<br />
21. Pearson,W. <strong>and</strong> Lipman,D. (1988) Improved <strong>tools</strong> for biological<br />
sequence comparison. Proc. Natl. Acad. Sci. USA, 85, 2444–2448.<br />
22. Durb<strong>in</strong>,R., Eddy,S.R., Krogh,A. <strong>and</strong> Mitchison,G. (2000)<br />
Biological Sequence Analysis: Probabilistic Models of Prote<strong>in</strong>s <strong>and</strong><br />
Nucleic Acids. Cambridge University Press.<br />
23. Lowe,T. <strong>and</strong> Eddy,S. (1997) tRNAscan-SE: a program for<br />
improved detection of transfer RNA genes <strong>in</strong> genomic sequence.<br />
Nucleic Acids Res., 25, 955–964.<br />
24. Eddy,S. (2002) A memory-efficient dynamic programm<strong>in</strong>g algorithm<br />
for optimal alignment of a sequence to an RNA secondary<br />
structure. BMC Bio<strong>in</strong>formatics, 3, 18.<br />
25. We<strong>in</strong>berg,Z. <strong>and</strong> Ruzzo,W. (2006) Sequence-based heuristics for<br />
faster annotation of non-cod<strong>in</strong>g RNA families. Bio<strong>in</strong>formatics, 22(1).<br />
26. We<strong>in</strong>berg,Z. <strong>and</strong> W.L.,R. (2004) In RECOMB 04: Proceed<strong>in</strong>gs of<br />
the Eighth Annual International Conference on <strong>Computational</strong><br />
Molecular Biology, ACM Press, pp. 243–251.
1<br />
rRNA operons <strong>and</strong> promoter analysis<br />
3.9 Paper VII: GeneWiz browser: An Interactive Tool for<br />
Visualiz<strong>in</strong>g Sequenced Chromosomes<br />
131
St<strong>and</strong>ards <strong>in</strong> Genomic Sciences (2009) 1: 204-215 DOI:10.4056/sigs.28177<br />
GeneWiz browser: An Interactive Tool for Visualiz<strong>in</strong>g<br />
Sequenced Chromosomes<br />
Peter F. Hall<strong>in</strong> 1 , Hans-Henrik Stærfeldt 1 , Eva Rotenberg 1, 2 , Tim T. B<strong>in</strong>newies 1, 3 , Craig J.<br />
Benham 4 , <strong>and</strong> David W. Ussery 1<br />
1 Center for Biological Sequence Analysis, Department of Systems Biology, The Technical<br />
University of Denmark, 2800 Kgs. Lyngby, Denmark.<br />
2 Lersoe Parkalle 37, 2TV, 2100 Copenhagen, Denmark<br />
3 Roche Diagnostics Ltd., CH-6343 Rotkreuz, Switzerl<strong>and</strong><br />
4 UC Davis Genome Center, University of California, Davis, California, U.S.A.<br />
We present an <strong>in</strong>teractive web application for visualiz<strong>in</strong>g genomic data of prokaryotic chromosomes.<br />
The tool (GeneWiz browser) allows users to carry out various analyses such as<br />
mapp<strong>in</strong>g alignments of homologous genes to other genomes, mapp<strong>in</strong>g of short sequenc<strong>in</strong>g<br />
reads to a reference chromosome, <strong>and</strong> calculat<strong>in</strong>g DNA properties such as curvature or stack<strong>in</strong>g<br />
energy along the chromosome. The GeneWiz browser produces an <strong>in</strong>teractive graphic<br />
that enables zoom<strong>in</strong>g from a global scale down to s<strong>in</strong>gle nucleotides, without chang<strong>in</strong>g the<br />
size of the plot. Its ability to disproportionally zoom provides optimal readability <strong>and</strong> <strong>in</strong>creased<br />
functionality compared to other browsers. The tool allows the user to select the display<br />
of various genomic features, color sett<strong>in</strong>g <strong>and</strong> data ranges. Custom numerical data can<br />
be added to the plot allow<strong>in</strong>g, for example, visualization of gene expression <strong>and</strong> regulation<br />
data. Further, st<strong>and</strong>ard atlases are pre-generated for all prokaryotic genomes available <strong>in</strong><br />
GenBank, provid<strong>in</strong>g a fast overview of all available genomes, <strong>in</strong>clud<strong>in</strong>g recently deposited<br />
genome sequences. The tool is available onl<strong>in</strong>e from<br />
http://www.cbs.dtu.dk/services/gwBrowser. Supplemental material <strong>in</strong>clud<strong>in</strong>g <strong>in</strong>teractive atlases<br />
is available onl<strong>in</strong>e at http://www.cbs.dtu.dk/services/gwBrowser/suppl/.<br />
Introduction<br />
The development of fast <strong>and</strong> <strong>in</strong>expensive genome<br />
sequenc<strong>in</strong>g technologies has led to the generation<br />
of vast amounts of genomic <strong>in</strong>formation. As ge-‐<br />
nomic sequenc<strong>in</strong>g becomes both more powerful<br />
<strong>and</strong> affordable, the h<strong>and</strong>l<strong>in</strong>g <strong>and</strong> analysis of the<br />
generated data produces novel challenges <strong>and</strong><br />
shifts the focus away from the discovery process<br />
towards technical considerations of h<strong>and</strong>l<strong>in</strong>g,<br />
stor<strong>in</strong>g <strong>and</strong> analyz<strong>in</strong>g sequence data. An impor-‐<br />
tant step when explor<strong>in</strong>g a new genome is to com-‐<br />
pare it to exist<strong>in</strong>g sequences, <strong>in</strong> order to identify<br />
both novel <strong>and</strong> conserved features. Many auto-‐<br />
mated computational methods are available that<br />
attempt to derive prote<strong>in</strong> function from sequence<br />
[1-3]. In a metagenomic study by Harr<strong>in</strong>gton <strong>and</strong><br />
co-‐workers it was estimated that 76% of the ex-‐<br />
am<strong>in</strong>ed prote<strong>in</strong> cod<strong>in</strong>g genes could be assigned a<br />
function. However, to assess predictions for <strong>in</strong>di-‐<br />
vidual genes the visualization rema<strong>in</strong>s critical to<br />
provide the biologist with an overview of the ge-‐<br />
nomic context. Are genes of <strong>in</strong>terest situated <strong>in</strong><br />
clusters? In operons? How are they regulated?<br />
How does their DNA base composition compare<br />
with that of the rest of the genome? In order to<br />
display such features both on a genome scale <strong>and</strong><br />
<strong>in</strong> close-‐up down to the level of nucleotides, we<br />
developed the GeneWiz browser which is based<br />
on the ‘Genome Atlas’ concept [4,5]. This tool can<br />
also display local DNA structural properties, so<br />
that regulatory or repeat regions can easily be<br />
identified <strong>and</strong> <strong>in</strong>terpreted <strong>in</strong> a chromosomal con-‐<br />
text.<br />
Dur<strong>in</strong>g development of the GeneWiz browser, it<br />
became apparent that novel sequenc<strong>in</strong>g technolo-‐<br />
gy creates a further dem<strong>and</strong>. The current genera-‐<br />
tion of sequenc<strong>in</strong>g <strong>in</strong>struments utilizes primed<br />
The Genomic St<strong>and</strong>ards Consortium
synthesis <strong>in</strong> flow cells to simultaneously obta<strong>in</strong><br />
the sequences of millions of different DNA tem-‐<br />
plates, an approach that changed the field of DNA<br />
sequenc<strong>in</strong>g [6,7]. Flow sequenc<strong>in</strong>g, also known as<br />
sequenc<strong>in</strong>g by synthesis (SBS) on a solid surface,<br />
tracks nucleotides as they are added to a grow<strong>in</strong>g<br />
DNA str<strong>and</strong> [8]. SBS is used by high-‐throughput<br />
sequenc<strong>in</strong>g systems which have become commer-‐<br />
cially available <strong>in</strong> the past two years. Examples<br />
<strong>in</strong>clude the sequencer GS Titanium (commercia-‐<br />
lized by 454/Roche); Genome Analyser GA-‐II (So-‐<br />
lexa/Illum<strong>in</strong>a); <strong>and</strong> SOLiD 3 system (Applied<br />
Biosystems).<br />
These developments have <strong>in</strong>creased the speed of<br />
sequenc<strong>in</strong>g while significantly reduc<strong>in</strong>g its cost<br />
[9,10]. This much higher throughput provides<br />
greater coverage, but at the cost of much shorter<br />
read-‐lengths: from 50 bases with SOLiD 3 to 75<br />
bases with Illum<strong>in</strong>a GA II. Even reads of 500 bases<br />
obta<strong>in</strong>ed with the 454-‐Titanium are still shorter<br />
than read lengths typically obta<strong>in</strong>ed us<strong>in</strong>g the<br />
Sanger method [9,11]. The output from modern<br />
high-‐through sequenc<strong>in</strong>g equipment challenges<br />
the assembly software by generat<strong>in</strong>g shorter <strong>and</strong><br />
ambiguous reads. Process<strong>in</strong>g of this flood of se-‐<br />
quence data has rapidly become a bottleneck, <strong>and</strong><br />
develop<strong>in</strong>g the necessary skills <strong>and</strong> <strong>tools</strong> will most<br />
likely be a driv<strong>in</strong>g factor <strong>in</strong> the execution of<br />
second-‐generation sequenc<strong>in</strong>g [12]. As a first step<br />
<strong>in</strong> this development, it needs to be determ<strong>in</strong>ed to<br />
what extent assembly of short-‐read sequences can<br />
be trusted, an assessment for which the GeneWiz<br />
browser can also be used.<br />
Methods<br />
Our method of visualization is based on color-‐<br />
encoded lanes to display numerical <strong>in</strong>formation<br />
on a genome atlas similar to GeneWiz [4,5]. The<br />
color encod<strong>in</strong>g can be done either us<strong>in</strong>g a l<strong>in</strong>ear<br />
scale with a fixed m<strong>in</strong>imum <strong>and</strong> maximum range,<br />
or a dynamic scale of st<strong>and</strong>ard deviations. Us<strong>in</strong>g<br />
the latter, color <strong>in</strong>tensity decreases as data ap-‐<br />
proach average values, thereby emphasiz<strong>in</strong>g re-‐<br />
gions of significant variation. The web <strong>in</strong>terface is<br />
divided <strong>in</strong>to four optional sections, to address<br />
various biological viewpo<strong>in</strong>ts of chromosomes: 1)<br />
DNA properties 2) Mapp<strong>in</strong>g of homologous genes<br />
by BLAST 3) Mapp<strong>in</strong>g of short sequenc<strong>in</strong>g reads 4)<br />
Custom lanes such as S<strong>in</strong>gle Nucleotide Polymor-‐<br />
Hall<strong>in</strong>, et al.<br />
phism (SNP) or microarray data. The output of<br />
each method is a numerical vector of length cor-‐<br />
respond<strong>in</strong>g to that of the reference sequence, <strong>and</strong><br />
the methods used for this construction are de-‐<br />
scribed <strong>in</strong> detail below.<br />
Read quality assessment<br />
Gene duplications, rRNA operons <strong>and</strong> other repeti-‐<br />
tive chromosomal regions are known to cause<br />
difficulties dur<strong>in</strong>g the assembly of short reads [13].<br />
To assess the degree of ambiguity of sequenc<strong>in</strong>g<br />
reads, a method was developed that derives the<br />
uniqueness of all reads, account<strong>in</strong>g for both the<br />
read quality <strong>and</strong> the match to the reference ge-‐<br />
nome.<br />
Sequence reads from Illum<strong>in</strong>a <strong>and</strong> 454 are re-‐<br />
ported with base qualities: a per-‐nucleotide meas-‐<br />
ure that denotes the credibility of the base calls. A<br />
method was derived which condenses these quali-‐<br />
ties <strong>in</strong>to values per position <strong>in</strong> the reference ge-‐<br />
nome <strong>and</strong> calculates the follow<strong>in</strong>g <strong>in</strong>formation:<br />
uniqueness-‐weighted quality, <strong>in</strong>formation content,<br />
sequence agreement, <strong>and</strong> repeat-‐weighted cover-‐<br />
age, (see methods). These estimates provide a<br />
prelim<strong>in</strong>ary overview of regions that may appear<br />
problematic to assemble. In general, low unique-‐<br />
ness is found <strong>in</strong> the gaps between the assembled<br />
contigs generated by the default assembly <strong>tools</strong><br />
from a given sequence dataset, as will be demon-‐<br />
strated below. A high score of uniqueness-‐<br />
weighted quality <strong>in</strong>dicates that the base is unique-‐<br />
ly identified by a read <strong>and</strong> that it has a high base<br />
quality <strong>in</strong> that read. The approach is illustrated <strong>in</strong><br />
Figure 1.<br />
From the mapp<strong>in</strong>g, five different parameters were<br />
calculate which together summarizes the trust-‐<br />
worth<strong>in</strong>ess of the reads given the assembly:<br />
Weighted coverage Under the assumption that<br />
all reads would map only once (Hr=1), the coverage<br />
c(i) can be calculated as the number of<br />
alignments R mapped at position i. A weighted<br />
coverage c’(i)=wr,h (see equation below) is used<br />
to correct for higher coverage artificially <strong>in</strong>troduced<br />
by repeats:<br />
http://st<strong>and</strong>ards<strong>in</strong>genomics.org 205
GeneWiz browser<br />
Figure 1 | Mapp<strong>in</strong>g reads to a reference genome account<strong>in</strong>g for uniqueness. In step 1, each read is<br />
aligned aga<strong>in</strong>st the reference genome. In the second step, the quality of each read is weighted accord<strong>in</strong>g<br />
to the uniqueness of the hit. A read giv<strong>in</strong>g rise to two hits S 1 <strong>and</strong> S 2 <strong>in</strong> the reference genome<br />
will be weighted proportionally with the relative alignment scores; if scores are identical, the<br />
mapp<strong>in</strong>g of S 1 <strong>and</strong> S 2 will be applied a weight of w=0.5 (see equation below). Step 3 maps the<br />
weighted qualities back to the reference genome so that each genomic position conta<strong>in</strong>s an array<br />
of weighted qualities. Once all reads are mapped, <strong>in</strong> step 4 only the maximum weighted quality<br />
value is kept <strong>and</strong>, step 5, the maximum weighted quality scores are color coded to reveal regions<br />
of low uniqueness.<br />
Uniqueness-weighted quality This measure cor-‐<br />
responds to the base qualities obta<strong>in</strong>ed from the<br />
reads that are mapped to the reference genome,<br />
weighted by the uniqueness of the read. Consider<br />
read r, which has a quality profile , where i is<br />
the position <strong>in</strong> the read. The read is aligned to the<br />
reference genome by BLAST, <strong>and</strong> all Hr hits are<br />
<strong>in</strong>cluded, when the follow<strong>in</strong>g criteria are met:<br />
BLAST score Sh of hit h is greater than or equal to<br />
S0 (optionally provided by the user), Sh S1 x<br />
where S1 is the score of the first/best hit, x [0;1]<br />
is a constant provided by the user, <strong>and</strong> the E-‐value<br />
is equal to or less than a threshold specified by the<br />
user. The follow<strong>in</strong>g formula is used to derive the<br />
weighted quality :<br />
The value is plotted on a color scale whereby low<br />
<strong>in</strong>formation (r<strong>and</strong>om distribution, least expected)<br />
is given <strong>in</strong> dark colors, <strong>and</strong> high <strong>in</strong>formation (high<br />
From all the q’r(i) values obta<strong>in</strong>ed at each position<br />
<strong>in</strong> the genome, the maximum uniqueness-‐<br />
weighted quality is chosen when all reads have<br />
been mapped.<br />
Information content provides a number <strong>in</strong> bits of<br />
<strong>in</strong>formation [14] represent<strong>in</strong>g to what degree the<br />
reads agree: zero bits means equal distribution of<br />
A, T, G <strong>and</strong> C at a given position <strong>and</strong> 2 bits means<br />
complete conservation of a s<strong>in</strong>gle base.<br />
conservation, most expected) as light or neutral<br />
color. This measure may be useful for visualiz<strong>in</strong>g<br />
s<strong>in</strong>gle nucleotide polymorphisms.<br />
206 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences
Read absence. A boolean where ‘one’ <strong>in</strong>dicates<br />
complete absence of aligned reads.<br />
Visualization of whole-genome homology<br />
The BLASTatlas method [15] derives a map of per-‐<br />
nucleotide numbers on a reference genome to<br />
visualize the matches <strong>in</strong> the alignment between<br />
the reference genome <strong>and</strong> a query. The query can<br />
constitute any number of genomic contigs, scaf-‐<br />
folds, full genomes, or collections thereof. This<br />
provides a method to identify regions of a refer-‐<br />
ence genome that are conserved throughout mul-‐<br />
tiple samples, as well as those that are unique. The<br />
BLASTatlas method is <strong>in</strong>tegrated <strong>in</strong>to the GeneWiz<br />
browser software to facilitate a user-‐friendly <strong>in</strong>-‐<br />
terface. Accord<strong>in</strong>g to the BLAST algorithm chosen,<br />
DNA or prote<strong>in</strong> sequences of the reference are<br />
aligned with the best match <strong>in</strong> the query (us<strong>in</strong>g<br />
either blastp, blastn, tblastn, or blastx). The align-‐<br />
ment is then mapped back to the reference ge-‐<br />
nome. A match adds a 'one' whereas a mismatch<br />
adds a 'zero' at each position along the chromo-‐<br />
Hall<strong>in</strong>, et al.<br />
some. These ones <strong>and</strong> zeros translate <strong>in</strong>to smooth<br />
color zones due to b<strong>in</strong>n<strong>in</strong>g<br />
DNA properties <strong>and</strong> DNA destabilization<br />
Through the web <strong>in</strong>terface it is currently possible<br />
to select from 36 different nucleotide composition<br />
<strong>and</strong> DNA structural properties [4,5,16-22]. In addi-‐<br />
tion to this, calculations of so-‐called SIDD energy<br />
estimates are provided, offer<strong>in</strong>g an approximation<br />
of promoter regions. This method estimates the<br />
free energy required to open the DNA helix, calcu-‐<br />
<br />
-‐0.035, -‐0.044, -‐0.055, us<strong>in</strong>g the SIDD algorithm<br />
[23]. All of these parameters can be applied <strong>in</strong> any<br />
comb<strong>in</strong>ation to any of the prokaryotic genomes<br />
available from the web <strong>in</strong>terface, or to a custom<br />
sequence provided by the user. Alternatively, the<br />
parameters may be applied as collections form<strong>in</strong>g<br />
8 st<strong>and</strong>ard atlases: Genome-‐, Base-‐, Structure-‐,<br />
Cruciform-‐, A-‐DNA-‐, Z-‐DNA-‐, the Repeat-‐atlas, <strong>and</strong><br />
f<strong>in</strong>ally the SIDD atlas, which is <strong>in</strong>troduced <strong>in</strong> this<br />
manuscript (Figure 3).<br />
Figure 3 Configuration <strong>and</strong> references for pre-def<strong>in</strong>ed groups of DNA sequence- <strong>and</strong> structural<br />
properties: Genome-, Base-, Structure-, Cruciform-, A-DNA-, Z-DNA-, Repeat-, <strong>and</strong> SIDD-atlas.<br />
Custom data<br />
A designated section of the GeneWiz browser is<br />
assigned for custom data. It allows the user to<br />
provide a per-‐nucleotide list of numerical values<br />
along with a desired color <strong>and</strong> data range. Al-‐<br />
though not presented here, this allows for visuali-‐<br />
zation of additional <strong>in</strong>formation such as microar-‐<br />
ray data that has been pre-‐processed by the user,<br />
by mapp<strong>in</strong>g gene expression, regulation change, or<br />
p-values back to genomic coord<strong>in</strong>ates. In addition<br />
to the ma<strong>in</strong> genome annotation cover<strong>in</strong>g CDSs,<br />
tRNAs, <strong>and</strong> rRNAs, the user may specify miscella-‐<br />
neous <strong>and</strong> pseudo-‐gene annotations separately. A<br />
button allows the query of selected reference ge-‐<br />
nomes aga<strong>in</strong>st a replicate of pseudogenes.org [24].<br />
Other annotations of possible pseudogenes can be<br />
added, such as GenePRIMP output (geneprimp.jgi-‐<br />
psf.org/).<br />
Dynamic visualization<br />
The GeneWiz browser allows dynamic dispropor-‐<br />
tional zoom<strong>in</strong>g, mean<strong>in</strong>g that zoom<strong>in</strong>g occurs<br />
http://st<strong>and</strong>ards<strong>in</strong>genomics.org 207
GeneWiz browser<br />
nearly <strong>in</strong>stantly when requested by the user, by<br />
redraw<strong>in</strong>g all the components like tracks, legends,<br />
marks <strong>and</strong> text for every view. This allows the<br />
browser to scale the plot to make use of the entire<br />
plott<strong>in</strong>g area, by not rescal<strong>in</strong>g all parts of the plot<br />
equally. For example, zoom<strong>in</strong>g 10 x will stretch a<br />
data lane 10 <strong>in</strong> genome position axis, however<br />
the lane height <strong>and</strong> distance to the neighbor lane<br />
will rema<strong>in</strong> constant. The dynamic nature of the<br />
GeneWiz browser requires pre-‐b<strong>in</strong>n<strong>in</strong>g of data for<br />
each zoom level, all of which are stored on a cen-‐<br />
tral server; for improved efficiency only data re-‐<br />
quested by the user are sent. The approach to<br />
store per-‐nucleotide <strong>in</strong>formation as table records<br />
<strong>in</strong> a database (e.g. MySQL) has proved unfeasible,<br />
as the number of records per genome exceeds<br />
millions, <strong>and</strong> the construction of <strong>in</strong>dexes would be<br />
very time consum<strong>in</strong>g. Instead, a memory mapp<strong>in</strong>g<br />
technique was chosen, that allows the server to<br />
directly obta<strong>in</strong> the values from b<strong>in</strong>ary files when<br />
provided with the zoom w<strong>in</strong>dow <strong>and</strong> level, for any<br />
chromosome <strong>in</strong> the database. (Examples are pro-‐<br />
vided as supplemental data, http://www.cbs.-‐<br />
dtu.dk/services/gwBrowser/suppl/).<br />
The client is written as a JavaApplet, that obta<strong>in</strong>s<br />
the data remotely from the server<br />
(http://ws.cbs.dtu.dk/cgi-‐b<strong>in</strong>/gwBrowser-‐<br />
0.91/server.cgi). The browser server is written <strong>in</strong><br />
Perl/CGI, while a compiled c-‐program h<strong>and</strong>les the<br />
access to the b<strong>in</strong>ary data files. The options cur-‐<br />
rently supported are listed <strong>in</strong> Table 2.<br />
Table 2 GeneWiz Browser server options.<br />
Option description<br />
d The unique identifier for the atlas<br />
Feature type (e.g. CDS,rRNA,tRNA) when return<strong>in</strong>g<br />
ft<br />
annotations<br />
f Data field to return<br />
b Beg<strong>in</strong> of w<strong>in</strong>dow<br />
e End of w<strong>in</strong>dow<br />
l Zoom level<br />
z Enable zlib compression of output<br />
m=i Return the genome length<br />
m=avg/stddev/m<strong>in</strong>/max Return aggregate data for w<strong>in</strong>dow/genome<br />
m=d<br />
Return data values provided field, w<strong>in</strong>dow <strong>and</strong> zoom<br />
level<br />
m=c Return colors provided two or three-step ranges<br />
m=n Return nucleotides provided the w<strong>in</strong>dow<br />
m=a Return annotations (used together with option ‘ft’)<br />
<strong>and</strong> genes as well as numerical data associated<br />
These options (Table 2) can be <strong>in</strong>corporated <strong>in</strong>to a with each nucleotide. The disproportional capabil-‐<br />
s<strong>in</strong>gle URL. For example, one could request all ity of the GeneWiz browser implies that all com-‐<br />
ponents (legends, tracks, marks, etc.) are regene-‐<br />
<br />
m-‐ rated for every view requested by the user. Figure 4<br />
http://ws.cbs.dtu.dk/cgi-‐<br />
outl<strong>in</strong>es the GeneWiz browser workflow.<br />
b<strong>in</strong>/gwBrowser-‐<br />
-‐<br />
When submitt<strong>in</strong>g a job via the web <strong>in</strong>terface, the<br />
<br />
request is assigned a job identifier, under which<br />
<br />
a-‐<br />
all data lanes <strong>and</strong> configurations are kept. After<br />
tions are described <strong>in</strong> the xml record, which can<br />
the job has been processed the user may alter lane<br />
be downloaded from the web<br />
order, colors, ranges, <strong>and</strong> append various types of<br />
(http://ws.cbs.dtu.dk/cgi-‐b<strong>in</strong>/gwBrowser-‐<br />
marks to the plot. The layout of a given browser<br />
0.91/fetchxml.cgi?AL111168GENOMEatlas). Fur-‐<br />
<strong>in</strong>stance is governed by an XML file, located on the<br />
ther examples are provided <strong>in</strong> the supplemental<br />
server. When generat<strong>in</strong>g the graphical representa-‐<br />
data section.<br />
tion of the genome, the client Java program will<br />
make requests to the server to acquire aggregated<br />
The GeneWiz workflow <strong>and</strong> data displayed<br />
values, such as the averages, st<strong>and</strong>ard deviations,<br />
The GeneWiz browser plots <strong>and</strong> provides dispro-‐<br />
m<strong>in</strong>ima, <strong>and</strong> maxima as well as lane data <strong>and</strong> an-‐<br />
portional zoom<strong>in</strong>g for data perta<strong>in</strong><strong>in</strong>g to features<br />
notations.<br />
208 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences
Hall<strong>in</strong>, et al.<br />
Figure 4 | The dataflow of the GeneWiz browser service. 1) The selected reference genome <strong>and</strong> the<br />
lanes to be <strong>in</strong>cluded are def<strong>in</strong>ed via the web <strong>in</strong>terface. 2) The request is sent to the analysis server<br />
that h<strong>and</strong>les the calculations. 3) When the job is f<strong>in</strong>ished, the web page redirects to the applet<br />
viewer that allows the user to navigate <strong>and</strong> edit the plot layout.<br />
Premade atlases<br />
The genome sequences stored <strong>in</strong> the <strong>CBS</strong> Genome<br />
Atlas Database [25] are synchronized with NCBI<br />
Entrez genome projects <strong>and</strong> have been pre-‐<br />
processed for all of the eight st<strong>and</strong>ard atlases<br />
mentioned above. This allows the user to select<br />
from currently 1,636 pre-‐b<strong>in</strong>ned replicons from<br />
864 prokaryotic sequenc<strong>in</strong>g projects, searchable<br />
by replicon name, GenBank accession number, or<br />
organism name (http://www.cbs.dtu.dk/-‐ servic-‐<br />
es/gwBrowser/precalc/)<br />
Results<br />
Evaluation of re-sequenc<strong>in</strong>g quality<br />
Three re-‐sequenced bacterial genomes were ex-‐<br />
am<strong>in</strong>ed, one genome sequence was generated us-‐<br />
<strong>in</strong>g the Illum<strong>in</strong>a GA technology, whereas two ge-‐<br />
nome sequences were generated utiliz<strong>in</strong>g the 454-‐<br />
Titanium technology (Table 3). The public se-‐<br />
quence was selected as reference for mapp<strong>in</strong>g the<br />
re-‐sequenc<strong>in</strong>g reads us<strong>in</strong>g the GeneWiz browser<br />
tool. The r<strong>and</strong>omness <strong>in</strong> fragmentation was esti-‐<br />
mated by compar<strong>in</strong>g the experimental data with<br />
<strong>in</strong>-silico digestions, generated at 40X coverage<br />
us<strong>in</strong>g read lengths between 30 to 5,000 bp. A good<br />
correspondence between the <strong>in</strong>-silico <strong>and</strong> experi-‐<br />
mental reads suggests little bias towards certa<strong>in</strong><br />
chromosomal regions (Figure 5, panel A). The as-‐<br />
sembled contigs provided by 454 (C. jejuni <strong>and</strong> E.<br />
coli) are mapped to the reference genome us<strong>in</strong>g<br />
BLAST <strong>and</strong> annotated <strong>in</strong> the perimeter of the at-‐<br />
lases (two leftmost atlases <strong>in</strong> Figure 5, panel A+B).<br />
The detailed atlas of the experimental data (true<br />
reads), are shown <strong>in</strong> Figure 5, panel B. Panel C<br />
shows quality/count of reads plotted as a function<br />
of read position. Note that the read quality de-‐<br />
creases the further the distance from the beg<strong>in</strong>-‐<br />
n<strong>in</strong>g of the read.<br />
http://st<strong>and</strong>ards<strong>in</strong>genomics.org 209
GeneWiz browser<br />
Table 3 Sequenc<strong>in</strong>g details of three bacterial genomes, two of which were re-sequenced us<strong>in</strong>g<br />
454-Titanium <strong>and</strong> one with Illum<strong>in</strong>a GA technology.<br />
E. coli K12 MG1655 C. jejuni<br />
NCTC11168<br />
S. typhi Ty2<br />
Stra<strong>in</strong> id ATCC: 700926D-5 ATCC:<br />
700819D-5<br />
ERA000001<br />
Technology 454-Titanium 454-Titanium Illum<strong>in</strong>a GA II<br />
Read count 538,784 502,438 1,650,370<br />
Avg read length ((std.<br />
dev)<br />
522 (=53) 598 (=75) 51 (=0)<br />
Truncated length 600 600 35<br />
Coverage 61X 183X 18X<br />
Genome size 4,639,675 bp 1,641,481 bp 4,791,961 bp<br />
Accession <strong>and</strong> orig<strong>in</strong>al<br />
Reference<br />
U00096 [26] AL111168 [27] AE014613 [28]<br />
Figure 5 | Panel A: The maximum uniqueness quality is shown for the actual reads (green-to-blue<br />
lane) plotted <strong>in</strong> the outermost lanes, us<strong>in</strong>g the published genome as a reference. The follow<strong>in</strong>g<br />
lanes show <strong>in</strong>-silico digestions at 40 X coverage (red-to-blue lane), us<strong>in</strong>g read lengths 30, 50, 70,<br />
200, 500, 1,000, 1,000, <strong>and</strong> 5,000 bases. Panel B shows the weighted coverage, agreement with<br />
reference, maximum uniqueness quality, <strong>in</strong>formation content, read absence, <strong>and</strong> AT content. All<br />
six plots can be accessed for zoom<strong>in</strong>g via the supplemental data section. Panel C displays the read<br />
count (green, secondary ord<strong>in</strong>ate) <strong>and</strong> read quality (red, primary ord<strong>in</strong>ate) as a function of read<br />
length. Note that read counts differ with<strong>in</strong> the three datasets, result<strong>in</strong>g <strong>in</strong> different scales on the<br />
secondary ord<strong>in</strong>ate. For the two 454-Titanium sets (C. jejuni <strong>and</strong> E. coli K12), an assembly was<br />
provided which allows a mapp<strong>in</strong>g of contigs to the reference genome. These marks are shown <strong>in</strong><br />
gray <strong>in</strong> the perimeter of these plots. Red marks <strong>in</strong>dicate contigs with two or more hits <strong>in</strong> the reference.<br />
210 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences
Genome homology: Compar<strong>in</strong>g multiple<br />
Burkholderia species<br />
A comparative study aimed at mapp<strong>in</strong>g for exam-‐<br />
ple pathogenic isl<strong>and</strong>s or gene losses among dif-‐<br />
ferent bacterial genomes can benefit from a graph-‐<br />
ical representation provided by the BLASTatlas<br />
method. The genus of Burkholderia covers a num-‐<br />
ber of important animal <strong>and</strong> human pathogens<br />
known to cause melioidosis (B. pseudomallei) <strong>and</strong><br />
pulmonary <strong>in</strong>fection <strong>in</strong> cystic fibrosis (CF) patients<br />
(B. cepacia), whereas B. thail<strong>and</strong>ensis, which is<br />
closely related to B. pseudomallei, rarely gives rise<br />
to diseases <strong>in</strong> humans [29,30]. Both species of B.<br />
thail<strong>and</strong>ensis <strong>and</strong> B. mallei display large chromo-‐<br />
somal deletions when compared to B. pseudomallei.<br />
However, the more scattered nature of the<br />
Hall<strong>in</strong>, et al.<br />
gene loss observed <strong>in</strong> B. thail<strong>and</strong>ensis suggests<br />
that B. mallei evolved from B. pseudomallei<br />
through the loss of larger regions [31]. These dele-‐<br />
tions are evident from the atlas shown <strong>in</strong> Figure 6<br />
where the two chromosomes of Burkholderia<br />
pseudomallei 1710b are used as BLASTatlas refer-‐<br />
ence <strong>in</strong> a comparison with 14 publicly available<br />
Burkholderia genomes (B. thail<strong>and</strong>ensis plus all<br />
species hav<strong>in</strong>g two or more stra<strong>in</strong>s sequenced, see<br />
supplemental data). In addition it is evident that a<br />
strong preference of deletion exist for chromo-‐<br />
some II. Ong <strong>and</strong> co-‐workers report that deletions<br />
<strong>in</strong> chromosome II counts for 70% <strong>and</strong> 61% of the<br />
total gene loss <strong>in</strong> B. mallei <strong>and</strong> B. thail<strong>and</strong>ensis,<br />
respectively.<br />
Figure 6 | BLASTatlas of Burkholderia pseudomallei 1710b chromosomes I+II compared with 14<br />
Burkholderia species. Show<strong>in</strong>g from the outermost circles: B. ambifaria (2, purple), B. cenocepacia<br />
(4, red) B. thail<strong>and</strong>ensis (1, green) 10774, B. mallei (4, green), <strong>and</strong> B. pseudomallei (3, blue). Innermost<br />
circles show percent AT, <strong>and</strong> CG skew. Note, that to allow visual comparison between B.<br />
thail<strong>and</strong>ensis <strong>and</strong> B. mallei, both species are colored green: the outermost green lane corresponds<br />
to the s<strong>in</strong>gle B. thail<strong>and</strong>ensis, whereas the rema<strong>in</strong><strong>in</strong>g four green lanes are all B. mallei. GenBank<br />
accession numbers as well as <strong>in</strong>teractive plots are available through the supplemental data section.<br />
http://st<strong>and</strong>ards<strong>in</strong>genomics.org 211
GeneWiz browser<br />
The SIDD atlas: Annotation of regulatory<br />
elements<br />
The browser application enables the user to ap-‐<br />
pend various annotation marks such as transcrip-‐<br />
tion start site arrows, gene labels, <strong>and</strong> boxes. A<br />
f<strong>in</strong>al example illustrates how these marks can be<br />
used to <strong>in</strong>tegrate known regulatory elements with<br />
DNA properties <strong>and</strong> gene annotations to draw a<br />
more complete picture of a promoter region. The<br />
regulatory elements of the E. coli K12 MG1665 rrn<br />
operons [32] have been annotated <strong>in</strong> a st<strong>and</strong>ard<br />
SIDD atlas, provid<strong>in</strong>g a visualization of the P1/P2<br />
promoter structure (Figure 7). A zoom of the pro-‐<br />
moter region reveals a strong SIDD site near the<br />
predom<strong>in</strong>ant P1 promoter approximately 40 bp.<br />
upstream of the P1 transcription start site. The<br />
transcription factor FIS stimulates transcription at<br />
several promoters, <strong>and</strong> for example the b<strong>in</strong>d<strong>in</strong>g of<br />
FIS at the leuV promoter [33] has been suggested<br />
to transmit the superhelical destabilization down-‐<br />
stream to the po<strong>in</strong>t where the RNAP twists <strong>and</strong><br />
opens the helix [34]. This model may be valid for<br />
the rrnB P1 promoter also, as the activity of leuV<br />
<strong>and</strong> rrnB P1 are comparable [35].<br />
Figure 7 | A zoom upstream of the E. coli K12 MG1665 rrnB operon. The three outer-most lanes<br />
show SIDD at three superhelix densities of sigma=-0.055, -0.045, <strong>and</strong> -0.035. The lower free energy<br />
required to melt the helix can be observed near the UP element of P1, for the SIDD lane at sigma<br />
= -0.045. The atlas is available for zoom<strong>in</strong>g on the supplemental data section.<br />
Discussion<br />
Visualization of the multidimensional <strong>in</strong>formation<br />
that is represented by a s<strong>in</strong>gle genome sequence<br />
rema<strong>in</strong>s complex. An <strong>in</strong>dispensable property of a<br />
genome visualization tool is that it must be zoom-‐<br />
able, so that <strong>in</strong>formation can be <strong>in</strong>terpreted at<br />
vary<strong>in</strong>g scales. Two recently published methods,<br />
the DNAPlotter [36] <strong>and</strong> the Genome Projector<br />
[37], both enable the user to build circular plots of<br />
numerical data related to genes as well as graphs<br />
of numerical data perta<strong>in</strong><strong>in</strong>g to the nucleotides.<br />
These <strong>tools</strong> create static graphics <strong>and</strong> allows only<br />
for proportional zoom<strong>in</strong>g, hence mak<strong>in</strong>g the plot<br />
hard to <strong>in</strong>terpret when zoom<strong>in</strong>g too deep. Both of<br />
these <strong>tools</strong> allow for visualization of <strong>in</strong>dividual<br />
genomes, but do not allow easy comparison across<br />
multiple genomes. With the ease of new genome<br />
sequences becom<strong>in</strong>g available, it is essential to be<br />
able to quickly compare other genomes to a refer-‐<br />
ence.<br />
A number of other <strong>tools</strong> approach genome visuali-‐<br />
zation from different angles: Genome Diagram [38]<br />
<strong>and</strong> Circos [39] are comm<strong>and</strong> l<strong>in</strong>e programs gene-‐<br />
rat<strong>in</strong>g publication quality static images <strong>and</strong> vector<br />
graphics. Although these <strong>tools</strong> allow comparison<br />
of other genomes, are flexible <strong>and</strong> allow visualiza-‐<br />
tion of numerical data, they lack an <strong>in</strong>teractive<br />
layer.<br />
The GeneWiz browser described here uses dis-‐<br />
proportional zoom<strong>in</strong>g to overcome this. From a<br />
technical perspective, the choice of programm<strong>in</strong>g<br />
language for writ<strong>in</strong>g graphical browsers is of im-‐<br />
portance. There are obvious advantages of provid-‐<br />
212 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences
<strong>in</strong>g platform-‐<strong>in</strong>dependent Java software like that<br />
of the GeneWiz browser, but often this is at the<br />
cost of performance. Nevertheless, our tool de-‐<br />
monstrates the usefulness of a genome browser<br />
that relies on <strong>in</strong>teractive, true disproportional<br />
zoom<strong>in</strong>g to visualize annotated genes <strong>and</strong> features<br />
as well as numerical data provided at s<strong>in</strong>gle nuc-‐<br />
leotide resolution. By build<strong>in</strong>g a comprehensive<br />
tool that is both scalable <strong>and</strong> flexible, we have<br />
shown how different types of genomic data can be<br />
<strong>in</strong>tegrated <strong>in</strong>to a s<strong>in</strong>gle, easily navigated graphic<br />
that can be annotated further by the user.<br />
Author contributions<br />
P.F.H. wrote the paper <strong>and</strong> composed the web<br />
<strong>in</strong>terfaces, as well as most parts of the server back<br />
end. H.H.S. wrote the c-‐code of the data b<strong>in</strong>n<strong>in</strong>g<br />
<strong>and</strong> retrieval software <strong>and</strong> contributed to the Java<br />
Applet; E.R. wrote the majority of the Java Applet<br />
code <strong>and</strong> formulation of the XML configurations.<br />
Reference<br />
1. Harr<strong>in</strong>gton ED, S<strong>in</strong>gh AH, Doerks T, Letunic I,<br />
von Mer<strong>in</strong>g C, Jensen LJ, Raes J, Bork P. Quantitative<br />
assessment of prote<strong>in</strong> function prediction<br />
from metagenomics shotgun sequences. Proc Natl<br />
Acad Sci USA 2007; 104:13913-13918. PubMed<br />
doi:10.1073/pnas.0702636104<br />
2. Jensen LJ, Gupta R, Blom N, Devos D, Tamames<br />
J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K,<br />
Workman C et al. Prediction of human prote<strong>in</strong><br />
function from post-translational modifications <strong>and</strong><br />
localization features. J Mol Biol 2002; 319:1257-<br />
1265. PubMed doi:10.1016/S0022-<br />
2836(02)00379-0<br />
3. Friedberg I. Automated prote<strong>in</strong> function prediction--the<br />
genomic challenge. Brief Bio<strong>in</strong>form<br />
2006; 7:225. PubMed doi:10.1093/bib/bbl004<br />
4. Jensen LJ, Friis C, Ussery DW. Three views of<br />
microbial genomes. Res Microbiol 1999;<br />
150:773-777. PubMed doi:10.1016/S0923-<br />
2508(99)00116-3<br />
5. Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH,<br />
Ussery DW. A DNA structural atlas for Escherichia<br />
coli. J Mol Biol 2000; 299:907-930. PubMed<br />
doi:10.1006/jmbi.2000.3787<br />
6. Hall N. Advanced sequenc<strong>in</strong>g technologies <strong>and</strong><br />
their wider impact <strong>in</strong> microbiology. J Exp Biol<br />
2007; 210:1518-1525. PubMed<br />
doi:10.1242/jeb.001370<br />
Hall<strong>in</strong>, et al.<br />
T.T.B. provided source data <strong>and</strong> analysis of C. jejuni<br />
<strong>and</strong> E. coli sequenc<strong>in</strong>g reads <strong>and</strong> C.J.B. assisted<br />
writ<strong>in</strong>g the paper (paragraphs on SIDD energy).<br />
D.W.U. assisted <strong>in</strong> writ<strong>in</strong>g the paper, supervised<br />
the project <strong>and</strong> provided ideas for figures <strong>and</strong><br />
analysis. All authors have read <strong>and</strong> made correc-‐<br />
tions to the manuscript.<br />
Acknowledgements<br />
This work is funded <strong>in</strong> part by grants from the Danish<br />
Center for Scientific Comput<strong>in</strong>g, NSF Research Grant<br />
DBI-‐0416764, The Danish Research Council grant 26-‐<br />
06-‐0349, <strong>and</strong> the EU EMBRACE network of Excellence,<br />
contract number LSHG-‐CT-‐2004-‐512092. We thank<br />
Mark Driscoll <strong>and</strong> Marcel Margulies from 454 Life<br />
Sciences for provid<strong>in</strong>g the data for C. jejuni <strong>and</strong> E. coli<br />
<strong>and</strong> Julian Parkhill at the Sanger <strong>in</strong>stitute for provid<strong>in</strong>g<br />
the S. typhi sequenc<strong>in</strong>g data. We thank also Dr. Trudy<br />
Wassenaar <strong>and</strong> Dr. Lars Juhl Jensen for mak<strong>in</strong>g sugges-‐<br />
tions to the manuscript.<br />
7. Holt RA, Jones SJ. The new paradigm of flow cell<br />
sequenc<strong>in</strong>g. Genome Res 2008; 18:839-846.<br />
PubMed doi:10.1101/gr.073262.107<br />
8. Käller M, Lundeberg J, Ahmadian A. Arrayed<br />
identification of DNA signatures. Expert Rev Mol<br />
Diagn 2007; 7:65-76. PubMed<br />
doi:10.1586/14737159.7.1.65<br />
9. Gupta PK. S<strong>in</strong>gle-molecule DNA sequenc<strong>in</strong>g<br />
technologies for future genomics research. Trends<br />
Biotechnol 2008; 26:602-611. PubMed<br />
doi:10.1016/j.tibtech.2008.07.003<br />
10. Shendure J, Ji H. Next-generation DNA sequenc<strong>in</strong>g.<br />
Nat Biotechnol 2008; 26:1135-1145.<br />
PubMed doi:10.1038/nbt1486<br />
11. Smith DR, Qu<strong>in</strong>lan AR, Peckham HE, Makowsky<br />
K, Tao W, Woolf B, Shen L, Donahue WF, Tusneem<br />
N, Stromberg MP et al. Rapid wholegenome<br />
mutational profil<strong>in</strong>g us<strong>in</strong>g nextgeneration<br />
sequenc<strong>in</strong>g technologies. Genome Res<br />
2008; 18:1638-1642. PubMed<br />
doi:10.1101/gr.077776.108<br />
12. L<strong>in</strong> F, Schröder H, Schmidt B. Solv<strong>in</strong>g the Bottleneck<br />
Problem <strong>in</strong> Bio<strong>in</strong>formatics Comput<strong>in</strong>g: An<br />
Architectural Perspective. J VLSI Signal Process<br />
2007; 48:185-188. doi:10.1007/s11265-007-<br />
0088-z<br />
13. Phillippy AM, Schatz MC, Pop M. Genome assembly<br />
forensics: f<strong>in</strong>d<strong>in</strong>g the elusive mis-<br />
http://st<strong>and</strong>ards<strong>in</strong>genomics.org 213
GeneWiz browser<br />
assembly. Genome Biol 2008; 9:R55. PubMed<br />
doi:10.1186/gb-2008-9-3-r55<br />
14. Tolstrup N, Rouzé P, Brunak S. A branch po<strong>in</strong>t<br />
consensus from Arabidopsis found by noncircular<br />
analysis allows for better prediction of<br />
acceptor sites. Nucleic Acids Res 1997; 25:3159-<br />
3163. PubMed doi:10.1093/nar/25.15.3159<br />
15. Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW. The genome<br />
BLASTatlas-a GeneWiz extension for visualization<br />
of whole-genome homology. Mol Biosyst<br />
2008; 4:363-371. PubMed<br />
doi:10.1039/b717118h<br />
16. Bolshoy A, McNamara P, Harr<strong>in</strong>gton RE, Trifonov<br />
EN. Curved DNA without A-A: experimental estimation<br />
of all 16 DNA wedge angles. Proc Natl<br />
Acad Sci USA 1991; 88:2312-2316. PubMed<br />
doi:10.1073/pnas.88.6.2312<br />
17. Brukner I, Sánchez R, Suck D, Pongor S. Sequence-dependent<br />
bend<strong>in</strong>g propensity of DNA as<br />
revealed by DNase I: parameters for tr<strong>in</strong>ucleotides.<br />
EMBO J 1995; 14:1812-1818. PubMed<br />
18. van Noort V, Worn<strong>in</strong>g P, Ussery DW, Rosche<br />
WA, S<strong>in</strong>den RR. Str<strong>and</strong> misalignments lead to quasipal<strong>in</strong>drome<br />
correction. Trends Genet 2003;<br />
19:365-369. PubMed doi:10.1016/S0168-<br />
9525(03)00136-7<br />
19. Olson WK, Gor<strong>in</strong> AA, Lu XJ, Hock LM, Zhurk<strong>in</strong><br />
VB. DNA sequence-dependent deformability deduced<br />
from prote<strong>in</strong>-DNA crystal complexes. Proc<br />
Natl Acad Sci USA 1998; 95:11163-11168.<br />
PubMed doi:10.1073/pnas.95.19.11163<br />
20. Ornste<strong>in</strong> RL, Re<strong>in</strong> R, Breen DL, MacElroy RD. An<br />
optimized potential function for the calculation of<br />
nucleic acid <strong>in</strong>teraction energies. I- Base stack<strong>in</strong>g.<br />
Biopolymers 1978; 17:2341-2360.<br />
doi:10.1002/bip.1978.360171005<br />
21. Satchwell SC, Drew HR, Travers AA. Sequence<br />
periodicities <strong>in</strong> chicken nucleosome core DNA. J<br />
Mol Biol 1986; 191:659-675. PubMed<br />
doi:10.1016/0022-2836(86)90452-3<br />
22. Ussery D, Soumpasis DM, Brunak S, Staerfeldt<br />
HH, Worn<strong>in</strong>g P, Krogh A. Bias of pur<strong>in</strong>e stretches<br />
<strong>in</strong> sequenced chromosomes. Comput Chem<br />
2002; 26:531-541. PubMed doi:10.1016/S0097-<br />
8485(02)00013-X<br />
23. Wang H, Benham CJ. Superhelical destabilization<br />
<strong>in</strong> regulatory regions of stress response genes.<br />
PLOS Comput Biol 2008; 4:e17. PubMed<br />
doi:10.1371/journal.pcbi.0040017<br />
24. Karro JE, Yan Y, Zheng D, Zhang Z, Carriero N,<br />
Cayt<strong>in</strong>g P, Harrrison P, Gerste<strong>in</strong> M. Pseudo-<br />
gene.org: a comprehensive database <strong>and</strong> comparison<br />
platform for pseudogene annotation. Nucleic<br />
Acids Res 2007; 35:D55-D60. PubMed<br />
doi:10.1093/nar/gkl851<br />
25. Hall<strong>in</strong> PF, Ussery DW. <strong>CBS</strong> Genome Atlas Database:<br />
a dynamic storage for bio<strong>in</strong>formatic results<br />
<strong>and</strong> sequence data. Bio<strong>in</strong>formatics 2004;<br />
20:3682-3686. PubMed<br />
doi:10.1093/bio<strong>in</strong>formatics/bth423<br />
26. Blattner FR, Plunkett G, Bloch CA, Perna NT,<br />
Burl<strong>and</strong> V, Riley M, Collado-Vides J, Glasner JD,<br />
Rode CK, Mayhew GF et al. The complete genome<br />
sequence of Escherichia coli K-12. Science<br />
1997; 277:1453-1462. PubMed<br />
doi:10.1126/science.277.5331.1453<br />
27. Parkhill J, Wren BW, Mungall K, Ketley JM,<br />
Churcher C, Basham D, Chill<strong>in</strong>gworth T, Davies<br />
RM, Feltwell T, Holroyd S et al. The genome sequence<br />
of the food-borne pathogen Campylobacter<br />
jejuni reveals hypervariable sequences. Nature<br />
2000; 403:665-668. PubMed<br />
doi:10.1038/35001088<br />
28. Deng W, Liou SR, Plunkett G, Mayhew GF, Rose<br />
DJ, Burl<strong>and</strong> V, Kodoyianni V, Schwartz DC,<br />
Blattner FR. <strong>Comparative</strong> genomics of Salmonella<br />
enterica serovar Typhi stra<strong>in</strong>s Ty2 <strong>and</strong> CT18. J<br />
Bacteriol 2003; 185:2330-2337. PubMed<br />
doi:10.1128/JB.185.7.2330-2337.2003<br />
29. Brett PJ, DeShazer D, Woods DE. Burkholderia<br />
thail<strong>and</strong>ensis sp. nov., a Burkholderia pseudomallei-like<br />
species. Int J Syst Bacteriol 1998; 48:317-<br />
320. PubMed<br />
30. Smith MD, Angus BJ, Wuthiekanun V, White NJ.<br />
Arab<strong>in</strong>ose assimilation def<strong>in</strong>es a nonvirulent biotype<br />
of Burkholderia pseudomallei. Infect Immun<br />
1997; 65:4319-4321. PubMed<br />
31. Ong C, Ooi CH, Wang D, Chong H, Ng KC, Rodrigues<br />
F, Lee MA, Tan P. Patterns of large-scale<br />
genomic variation <strong>in</strong> virulent <strong>and</strong> avirulent Burkholderia<br />
species. Genome Res 2004; 14:2295-<br />
2307. PubMed doi:10.1101/gr.1608904<br />
32. Hirvonen CA, Ross W, Wozniak CE, Marasco E,<br />
Anthony JR, Aiyar SE, Newburn VH, Gourse RL.<br />
Contributions of UP elements <strong>and</strong> the transcription<br />
factor FIS to expression from the seven rrn P1<br />
promoters <strong>in</strong> Escherichia coli. J Bacteriol 2001;<br />
183:6305-6314. PubMed<br />
doi:10.1128/JB.183.21.6305-6314.2001<br />
33. Ross W, Salomon J, Holmes WM, Gourse RL.<br />
Activation of Escherichia coli leuV transcription<br />
by FIS. J Bacteriol 1999; 181:3864-3868. PubMed<br />
214 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences
34. Wang H, Noordewier M, Benham CJ. Stress<strong>in</strong>duced<br />
DNA duplex destabilization (SIDD) <strong>in</strong><br />
the E. coli genome: SIDD sites are closely associated<br />
with promoters. Genome Res 2004;<br />
14:1575-1584. PubMed doi:10.1101/gr.2080004<br />
35. Bauer BF, Kar EG, Elford RM, Holmes WM. Sequence<br />
determ<strong>in</strong>ants for promoter strength <strong>in</strong> the<br />
leuV operon of Escherichia coli. Gene 1988;<br />
63:123-134. PubMed doi:10.1016/0378-<br />
1119(88)90551-3<br />
36. Carver T, Thomson N, Bleasby A, Berriman M,<br />
Parkhill J. DNAPlotter: circular <strong>and</strong> l<strong>in</strong>ear <strong>in</strong>teractive<br />
genome visualization. Bio<strong>in</strong>formatics 2009;<br />
25:119-120. PubMed<br />
doi:10.1093/bio<strong>in</strong>formatics/btn578<br />
Hall<strong>in</strong>, et al.<br />
37. Arakawa K, Tamaki S, Kono N, Kido N, Ikegami<br />
K, Ogawa R, Tomita M. Genome Projector:<br />
zoomable genome map with multiple views. BMC<br />
Bio<strong>in</strong>formatics 2009; 10:31. PubMed<br />
doi:10.1186/1471-2105-10-31<br />
38. Pritchard L, White JA, Birch PR, Toth IK. GenomeDiagram:<br />
a python package for the visualization<br />
of large-scale genomic data. Bio<strong>in</strong>formatics<br />
2006; 22:616-617. PubMed<br />
doi:10.1093/bio<strong>in</strong>formatics/btk021<br />
39. Krzyw<strong>in</strong>ski M, Sche<strong>in</strong> J, Birol I, Connors J, Gascoyne<br />
R, Horsman D, Jones SJ, Marra MA. Circos:<br />
an <strong>in</strong>formation aesthetic for comparative genomics.<br />
Genome Res 2009; 19:1639-1645. PubMed<br />
doi:10.1101/gr.092759.109<br />
http://st<strong>and</strong>ards<strong>in</strong>genomics.org 215
Paper VII: GeneWiz browser: An Interactive Tool for Visualiz<strong>in</strong>g Sequenced Chromosomes<br />
144
Chapter 4<br />
Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />
Web Services <strong>and</strong> <strong>Interoperability</strong><br />
<strong>in</strong> Genomics<br />
This chapter describes work done connection with the EU project EMBRACE. The deliverables<br />
def<strong>in</strong>ed for <strong>CBS</strong> have had both outreach obligations as well as implementation<br />
tasks of provid<strong>in</strong>g <strong>tools</strong> <strong>and</strong> databases through Web Services. This author’s contributions<br />
reflect this duality; there was a responsibility for develop<strong>in</strong>g the server <strong>in</strong>frastructure for<br />
host<strong>in</strong>g Web Services while also teach<strong>in</strong>g about us<strong>in</strong>g <strong>and</strong> design concepts on several occasions<br />
(see appendix A.1). <strong>CBS</strong> is now us<strong>in</strong>g this work to <strong>in</strong>tegrate all major prediction<br />
servers under the same Web Services umbrella. There are currently 17 services offered<br />
us<strong>in</strong>g this technology 1 . The work on Web Services has made the foundation for creat<strong>in</strong>g<br />
an onl<strong>in</strong>e resource like BLASTatlas (paper I). Further, the RNAmmer tool (VI) is offered<br />
both as a traditional web <strong>in</strong>terface <strong>and</strong> through Web Services <strong>and</strong> these implementations<br />
demonstrate the usefullness of programmtic access to <strong>tools</strong>.<br />
4.1 Introduction<br />
Over the past decade, the <strong>in</strong>ternet has undoubtedly revolutionized the way <strong>in</strong>formation<br />
is exchanged <strong>in</strong> the modern society. From bank transactions, digital road maps <strong>and</strong><br />
satellite images, email<strong>in</strong>g, news articles, <strong>and</strong> social networks, these services are now hard<br />
to imag<strong>in</strong>e, without a digitally connected world. Biological <strong>and</strong> bio<strong>in</strong>formatic <strong>in</strong>formation<br />
is no exception as it relies on the <strong>in</strong>ternet to provide the transport of sequence data,<br />
experimental results, scientific articles etc. Both the number <strong>and</strong> complexity of biological<br />
<strong>in</strong>formation <strong>in</strong>creases day by day. As new experimental techniques become available, new<br />
types of data as well as new ways of comb<strong>in</strong><strong>in</strong>g them, are <strong>in</strong>troduced. For decades, the<br />
exchange of biological <strong>in</strong>formation over the <strong>in</strong>ternet has been <strong>in</strong> the form of human readable<br />
HTML documents (HyperText Markup Language) - or flat files resid<strong>in</strong>g on FTP servers<br />
(File Transfer Protocol). When designed, HTML was <strong>in</strong>tended to host static <strong>in</strong>formation<br />
presented by a server to a human be<strong>in</strong>g us<strong>in</strong>g a browser. Today, computers are required<br />
to digest the huge amounts of <strong>in</strong>formation with less <strong>in</strong>volvement of humans, <strong>and</strong> more<br />
advanced technologies are now required. To successfully <strong>in</strong>tegrate the vast amounts of<br />
data provided by the life science community, <strong>in</strong>teroperability rema<strong>in</strong>s a key issue. It<br />
may seem unrealistic to reach a po<strong>in</strong>t where every biologist <strong>and</strong> bio<strong>in</strong>formatician has<br />
the world’s biological databases <strong>and</strong> <strong>tools</strong> accessible through programmatic access, from<br />
their favorite programm<strong>in</strong>g language. However, with the current technologies <strong>in</strong> Web<br />
1 BLASTatlas, EasyGene, EPipe, GeneWiz, GenomeAtlas, hERG, MaxAlign, NetChop, NetCTL, Net-<br />
Glycate, NetNGlyc, NetOGlyc, NetPhos, RNAmmer, SIDDbase, SignalP, <strong>and</strong> TMHMM<br />
145
<strong>Interoperability</strong><br />
Figure 4.1: Screen shot of NCBI Entrez Genome projects web page<br />
Services, an <strong>in</strong>teroperble life science community may not be far away. When connected,<br />
the communities will be able to exchange not only data but many services such as <strong>tools</strong><br />
for predict<strong>in</strong>g prote<strong>in</strong> function, perform<strong>in</strong>g sequence alignments, or gene f<strong>in</strong>d<strong>in</strong>g.<br />
4.2 <strong>Interoperability</strong><br />
”The term ’<strong>in</strong>teroperability’ is def<strong>in</strong>ed as the ability ... <strong>in</strong>formation, by IEEE (http...)”.<br />
The term ’<strong>in</strong>teroperability’ is def<strong>in</strong>ed as the ability of two or more systems to exchange<br />
<strong>and</strong> make use of <strong>in</strong>formation (IEEE, http://www.ieee.org). Whether systems can be<br />
said to be ’<strong>in</strong>teroperable’ depends on how one <strong>in</strong>terprets ’make use of’. Consider the list of<br />
full prokaryotic genome sequences, ma<strong>in</strong>ta<strong>in</strong>ed by NCBI at http://www.ncbi.nlm.nih.<br />
gov/genomes/lproks.cgi, as shown <strong>in</strong> figure figure 4.1.<br />
To automatically retrieve this list, one may write a parser to transform the HTML<br />
<strong>in</strong>to a computer-readable text. Apart from be<strong>in</strong>g overly sensitive to changes <strong>in</strong> the HTML<br />
document, such a parser will lack the knowledge beh<strong>in</strong>d the data s<strong>in</strong>ce the format is not<br />
typed nor structured. It is only when <strong>in</strong>terpreted by an <strong>in</strong>ternet browser <strong>and</strong> presented<br />
graphically to a human, that this <strong>in</strong>formation makes any sense. Both recipient <strong>and</strong> receiver<br />
must <strong>in</strong> other words have knowledge about the <strong>in</strong>formation that is exchanged, before these<br />
can be said to be <strong>in</strong>teroperable. The are two aspects of <strong>in</strong>teroperability: First, there must<br />
exist agreement on the format by which data is exchanged. Whether this is structured<br />
XML or any arbitrary format, the server must return the format expected by the client<br />
upon a request. Second, the description <strong>and</strong> underst<strong>and</strong><strong>in</strong>g of the content of the data be<strong>in</strong>g<br />
exchanged is a requirement when build<strong>in</strong>g client-side code <strong>and</strong> objects <strong>in</strong> Web Services.<br />
Without the knowledge of exact data types, the programm<strong>in</strong>g environment (e.g. C, Java,<br />
Perl) fails to declare the objects with proper variable types.<br />
146
Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />
List<strong>in</strong>g 4.1: Abbreviated <strong>in</strong>put to the queryGenomes operations of the Genome Atlas Database<br />
3.0 web service<br />
1 <br />
4 <br />
5 <br />
6 <br />
7 <br />
8 AL111168<br />
9 yes<br />
10 <br />
11 <br />
12 <br />
13 <br />
4.2.1 SOAP based Web Services<br />
The SOAP st<strong>and</strong>ard (Simple Object Access Protocol, prior to version 1.2) is to a large<br />
extent an agreed-upon technology describ<strong>in</strong>g a protocol to exchange <strong>in</strong>formation <strong>in</strong> structured<br />
XML messages (eXtensible Markup Language). The protocol was recommended by<br />
W3C (World Wide Web Consortium) <strong>in</strong> 2003, <strong>and</strong> describes the messag<strong>in</strong>g format between<br />
a client <strong>and</strong> a server which <strong>in</strong> most cases are transported over HTTP. In list<strong>in</strong>gs 4.1<br />
<strong>and</strong> 4.2 an example request <strong>and</strong> response from the <strong>CBS</strong> Genome Atlas Database 3.0 Web<br />
Service is provided, us<strong>in</strong>g operation queryGenomes to query the database for a genbank<br />
accession number.<br />
The SOAP messages are XML structures consist<strong>in</strong>g of a SOAP envelope, which then<br />
consist of a header (not <strong>in</strong>cluded here) <strong>and</strong> a body. A special envelope style called<br />
’wrapped’ is used for the <strong>CBS</strong> services, mean<strong>in</strong>g that the content of both response <strong>and</strong> request<br />
is wrapped by an element named accord<strong>in</strong>g to the operation issued (here queryGenomes).<br />
This enables the server to easily dispatch the message to the proper <strong>in</strong>ternal code. The<br />
SOAP protocol forms the basic language for exchang<strong>in</strong>g messages over HTTP but does not<br />
describe the structure of the messages exchanged by a given resource nor does it expla<strong>in</strong><br />
its functionality. The WSDL (Web Services Description Language) file closes this gap by<br />
def<strong>in</strong><strong>in</strong>g <strong>in</strong>formation which enables a user or computer to communicate with the resource.<br />
The WSDL declares all the operations supported by a resource <strong>and</strong> the composition of the<br />
XML structures allowed by the operations. F<strong>in</strong>ally, the WSDL def<strong>in</strong>es the endpo<strong>in</strong>t URL<br />
to which the request SOAP message is submitted. The essential data of the WSDL are<br />
the descriptions of the XML structure, formulated <strong>in</strong> the XSD language (XML Schema<br />
Def<strong>in</strong>ition). The schema for the request of the queryGenomes operations can be seen from<br />
list<strong>in</strong>g 4.3. Figure 4.2 shows a schematic draw<strong>in</strong>g of a SOAP resource.<br />
4.3 EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability<br />
EMBRACE Network of Excellence is a project funded by the European Commission under<br />
the sixth framework programme (FP6). The <strong>in</strong>tention of the EMBRACE projects was<br />
partly to <strong>in</strong>tegrate the major <strong>tools</strong> <strong>and</strong> databases with<strong>in</strong> the life science communities. A<br />
technology recommendation workgroup with<strong>in</strong> EMBRACE has <strong>in</strong>vestigated which current<br />
technologies could form the basis of the <strong>in</strong>tegration <strong>and</strong> it has recommended SOAP based<br />
147
EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability<br />
List<strong>in</strong>g 4.2: Abbreviated output from the queryGenomes operations of the Genome Atlas Database<br />
3.0 web service<br />
1 <br />
2 <br />
3 <br />
5 <br />
6 <br />
7 <br />
8 <br />
9 <br />
10 <br />
11 B a c t e r i a<br />
12 E p s i l o n p r o t e o b a c t e r i a<br />
13 8<br />
14 Campylobacter j e j u n i subsp . j e j u n i NCTC 11168<br />
15 AL111168<br />
16 NC 002163<br />
17 Chromosome<br />
18 <br />
19 <br />
20 <br />
21 <br />
22 <br />
23 <br />
24 <br />
25 <br />
26 <br />
148
Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />
List<strong>in</strong>g 4.3: XSD entry of the queryGenomes request message<br />
1 <br />
2 <br />
3 <br />
4 <br />
5 <br />
6 <br />
8 <br />
10 <br />
12 <br />
14 <br />
16 <br />
18 <br />
20 <br />
21 <br />
22 <br />
23 <br />
24 <br />
25 <br />
26 <br />
27 <br />
28 <br />
29 <br />
30 <br />
31 <br />
32 <br />
SOAP request<br />
<strong>and</strong> response<br />
SOAP client<br />
Client user / computer<br />
endpo<strong>in</strong>t WSDL Schemas<br />
HTTP server<br />
WSDL <strong>and</strong> schema files<br />
downloaded by client <strong>in</strong><br />
XML<br />
Figure 4.2: Schematic layout of a simple SOAP resource, where WSDL <strong>and</strong> schemas reside on the<br />
same server. WSDL <strong>and</strong> schemas are read <strong>and</strong> <strong>in</strong>tepreted by the SOAP client <strong>in</strong> order compose<br />
the outgo<strong>in</strong>g request <strong>and</strong> parse the <strong>in</strong>com<strong>in</strong>g server response.<br />
149
EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability<br />
Web Services described by WSDL files where data structures are typed us<strong>in</strong>g the XSD<br />
format.<br />
4.3.1 Quasi - a light-weight SOAP server<br />
One of the ma<strong>in</strong> obstacles for many SOAP servers <strong>and</strong> clients is the computational overhead<br />
<strong>and</strong> memory consumption <strong>in</strong>volved <strong>in</strong> pars<strong>in</strong>g large <strong>and</strong> complex XML structures.<br />
For the BLASTatlas service, this was a limitt<strong>in</strong>g factor. Try<strong>in</strong>g a conventional server package<br />
called SOAP::Lite, rendered the submit process to require more memory than what is<br />
<strong>in</strong> a modern desktop computer while tak<strong>in</strong>g around 20 m<strong>in</strong>utes just to prepare the message<br />
before submit. Once submitted, the server required the same overhead to parse the <strong>in</strong>com<strong>in</strong>g<br />
XML. The XML::Compile package for Perl prooved superior as a client framework.<br />
However, for the server side, there was a dem<strong>and</strong> for speed, flexibility <strong>and</strong> custom adjustment<br />
which led to the development of a light-wight SOAP server called ’quasi’ (’QUite<br />
A Soap Implementaion’ or ’QUAsi Soap Implementation’). Apart from the speed it has<br />
further advantages:<br />
• The server can be launched both remotely <strong>and</strong> locally. The later allows quick <strong>and</strong><br />
easy test<strong>in</strong>g of services by read<strong>in</strong>g SOAP message from STDIN<br />
• XML pars<strong>in</strong>g method (e.g. XML::Simple or XML::Twig) may be chosen <strong>in</strong>dependently<br />
for each operations <strong>and</strong> even postponed until after the job is placed <strong>in</strong> the<br />
queue <strong>and</strong> the job id is returned. This is an advantage for very big messages<br />
• Control over the code stack enable implementation of custom functionality much<br />
faster.<br />
4.3.2 quasi mktemp - From template to Web Service<br />
To take the ease-of-implementation to a new step, a template creator was written which<br />
reads from a st<strong>and</strong>ard <strong>CBS</strong> template an example Web Service. The user provides the<br />
name <strong>and</strong> version of the service <strong>and</strong> the tool prepares an entire <strong>in</strong>stallation of the service<br />
on the servers. The template created gives the follow<strong>in</strong>g :<br />
• Creates automatically WSDL <strong>and</strong> XSD files for the name <strong>and</strong> version of the service,<br />
placed <strong>in</strong> the proper location of the file system<br />
• Example directory with a work<strong>in</strong>g Perl example us<strong>in</strong>g the service<br />
• Has built-<strong>in</strong> templates for both syncrhonous <strong>and</strong> asynchronous access<br />
• Creates the proper entry <strong>in</strong> the central services database table<br />
• When the template creator has run a web page will be available describ<strong>in</strong>g the<br />
service <strong>and</strong> provid<strong>in</strong>g l<strong>in</strong>ks to WSDL <strong>and</strong> XSD files as well as WSDL-embedded<br />
documentation<br />
When design<strong>in</strong>g Web Services, it is not a trivial task to keep track of namespaces,<br />
declerations of <strong>in</strong>put/output objects, operation names etc. The feedback received so far<br />
for this tool <strong>in</strong>dicates that function<strong>in</strong>g examples clearly reduces chances for mistakes. The<br />
manual for the software is found <strong>in</strong> appendix D.6.<br />
150
Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />
4.4 ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services<br />
ENCODE (the Encyclopedia Of DNA Elements) was launched <strong>in</strong> September 2003 by<br />
the National Human Genome Research Institute. The goal was to identify all functional<br />
elements <strong>in</strong> the human genome sequence. In the pilot phase 1 percent (30 Mb) from<br />
44 selected regions of the human genome has been analysed by ENCODE consortium<br />
researchers (Birney et al., 2007).<br />
GENCODE is a sub-project of ENCODE, which seeks to identify all prote<strong>in</strong>-cod<strong>in</strong>g<br />
genes <strong>in</strong> the ENCODE selected regions. For each prote<strong>in</strong> cod<strong>in</strong>g gene this means the<br />
del<strong>in</strong>eation of a complete mRNA sequence for at least one splice isoform, <strong>and</strong> often for<br />
a number of additional alternative splice forms. The contributions from the BioSapiens<br />
partners are focused on <strong>in</strong>formation from a prote<strong>in</strong> annotation perspective. Special attention<br />
is given to the potential aspect of alternative splic<strong>in</strong>g <strong>and</strong> the putative effect it has<br />
on functional diversification of genes.<br />
In the pilot phase of the Biosapiens project the properties of the cod<strong>in</strong>g sequences<br />
for the 44 regions have been analyzed by the Biosapiens partners separately. The results<br />
from s<strong>in</strong>gle groups were collected <strong>and</strong> the ma<strong>in</strong> f<strong>in</strong>d<strong>in</strong>gs were published (Tress et al., 2007).<br />
Furthermore the entire collection of annotations created by all partners was made available<br />
as supplementary material for the publication.<br />
In the current phase of the BioSapiens project the goal is establish a scale-up of the<br />
annotation approach applied to the pilot ENCODE sequences to cover the 100% of the human<br />
genome, <strong>in</strong>clud<strong>in</strong>g all the isoforms. For the scale-up, the ENCODE Pipel<strong>in</strong>e (EPipe)<br />
was constructed (this Biosapiens deliverable), which is a WWW service that allows researchers<br />
to compare functional annotations for all splice variants of a given gene <strong>in</strong> an<br />
automatic way, or alternatively use it for analysis of mutated sequence variants conta<strong>in</strong><strong>in</strong>g<br />
SNPs. The author of this thesis. This author has been responsible for the development<br />
of the ma<strong>in</strong> parts of the EPipe software as well as for implement<strong>in</strong>g a large part of the<br />
modules (feature predictors). The EPipe projects is an ongo<strong>in</strong>g effort which has <strong>in</strong>volved<br />
a number of people dur<strong>in</strong>g its development.<br />
4.4.1 Collect<strong>in</strong>g Web Services clients <strong>in</strong> EPipe<br />
EPipe uses a number of local <strong>and</strong> remote resources for prote<strong>in</strong> feature prediction. The<br />
ability of EPipe to connect to remote resources via Web Services is <strong>in</strong>corporated with<strong>in</strong><br />
the <strong>in</strong>dividual modules. This put a great deal of flexibility as to which resourses to support<br />
(e.g. BioMoby, SOAP etc). The pipel<strong>in</strong>e is shown <strong>in</strong> figure 4.3.<br />
EPipe itself is offered both as a SOAP web service (http://www.cbs.dtu.dk/ws/<br />
EPipe <strong>and</strong> a traditional web <strong>in</strong>terfece (http://www.cbs.dtu.dk/services/EPipe). A<br />
schematic overview of the workflow <strong>in</strong> EPipe is shown <strong>in</strong> figure 4.4.<br />
4.4.2 Mapp<strong>in</strong>g Pfam annotations to prote<strong>in</strong> structure: mecA<br />
In Staphylococcus aureus the mecA gene encodes a penicill<strong>in</strong>-b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> (PBP2a),<br />
result<strong>in</strong>g <strong>in</strong> Methicill<strong>in</strong> resistance (Ender et al., 2009). The EPipe software can be used to<br />
map a range of different relevant features onto the prote<strong>in</strong> structure, <strong>in</strong> order to visualize<br />
differences between homologs of this prote<strong>in</strong>. In this example however, a s<strong>in</strong>gle MecR1<br />
prote<strong>in</strong> from Staphylococcus aureus stra<strong>in</strong> A5937, GenBank accession no. EEV85461, is<br />
processed. Figure 4.5 shows the structure browser of EPipe which allows the user to<br />
browse the different features that are predicted, by show<strong>in</strong>g the mapp<strong>in</strong>g onto the prote<strong>in</strong><br />
structure. Here, the three Pfam doma<strong>in</strong>s Transpeptidase, MecA N, <strong>and</strong> PBP dimer appear<br />
as significant hits.<br />
151
ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services<br />
Input sequences<br />
Cache filter<br />
BLAST aga<strong>in</strong>st<br />
PDB <strong>in</strong>dividually<br />
Cache filter<br />
Cache filter<br />
Cache filter<br />
Cache filter<br />
module IV<br />
alignment module I module II module III<br />
Positional<br />
features<br />
Non-positional<br />
features<br />
Alignment<br />
dependent<br />
module X<br />
Map feature<br />
coord<strong>in</strong>ates to<br />
alignment<br />
Map features onto<br />
best structure<br />
XML of all results<br />
Cache filter<br />
Render images <strong>in</strong><br />
parallel <strong>and</strong> present<br />
to output pages<br />
Table of<br />
nonpositional<br />
features<br />
Conclusion<br />
table<br />
Plot alignment <strong>and</strong><br />
positions hav<strong>in</strong>g<br />
different feature<br />
configuration<br />
Plot alignment<br />
<strong>and</strong> features<br />
with remapped<br />
coord<strong>in</strong>ates<br />
Similarity <strong>in</strong><br />
feature space<br />
Figure 4.3: Schematic layout of the ENCODE pipel<strong>in</strong>e, EPipe. The ma<strong>in</strong> program ensures that<br />
as much as possible is dispatched <strong>in</strong> parrallel. Modules may either be alignment dependent or not.<br />
If the alignment is required to predict the prote<strong>in</strong> features, the module is not launched until the<br />
alignment algorithm has f<strong>in</strong>ished. Modules may either return global features of the entire prote<strong>in</strong><br />
(e.g. cellular localization), or return positional features (e.g. phosphorylation sites).<br />
152
Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />
Figure 4.4: The <strong>in</strong>put web page of EPipe: Upper part def<strong>in</strong>es sequence upload <strong>and</strong> alignment<br />
method, <strong>and</strong> lower part selects which modules / methods to run. When applicable, gene ontologies<br />
have been added to each feature <strong>and</strong> feature values (light green boxes).<br />
153
ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services<br />
Figure 4.5: The mecA encoded prote<strong>in</strong> (EEV85461) shows homology to PDB entry 1VQQ (Lim<br />
& Strynadka, 2002). Top panel shows the EPipe structure browser which allows for any 90 degrees<br />
rotat<strong>in</strong>g. Lower panel shows a post-process<strong>in</strong>g of the PyMol script, generated by EPipe.<br />
154
Chapter 5<br />
Conclusion <strong>and</strong> perspectives<br />
Conclusion <strong>and</strong> perspectives<br />
This thesis has presented a number comparative genomics <strong>tools</strong> that have been used<br />
throughout different research projects <strong>and</strong> peer review publications. The aim has been to<br />
provide methods that enable the scientist to keep up with the <strong>in</strong>creas<strong>in</strong>g speed by which<br />
genome sequences are published. Visualization plays a key role <strong>and</strong> f<strong>in</strong>d<strong>in</strong>g better ways<br />
to present sequence <strong>in</strong>formation <strong>in</strong> a condensed <strong>and</strong> <strong>in</strong>tuitive way is essential for deriv<strong>in</strong>g<br />
knowledge from the large number of bacterial stra<strong>in</strong>s be<strong>in</strong>g sequenced.<br />
Information content has previously been used to quantify conservation of DNA motifs,<br />
<strong>and</strong> a recent extension of this <strong>in</strong>formation framework has allowed to model complete<br />
promotors such as the P1/P2 system described <strong>in</strong> this work. The models shown here<br />
are to a large extent specific towards E. coli P1/P2 sites. However, the design of the<br />
matrix <strong>and</strong> spac<strong>in</strong>g configuration format of the iscan tool enables for a much broader<br />
application. The tool may be used to test different hypothesis of promotor configurations<br />
across a broader range of organisms by estimat<strong>in</strong>g the promotor conservation a s<strong>in</strong>gle<br />
comparable measure. There is still efforts to be made to implement benchmark<strong>in</strong>g <strong>and</strong> to<br />
exam<strong>in</strong>e other promotor systems.<br />
S<strong>in</strong>ce the start of the human genome project (HGP) <strong>in</strong> 1990 there has been large<br />
<strong>in</strong>vestments to develop <strong>and</strong> improve sequenc<strong>in</strong>g technology. The present stage, where a<br />
bacterial genome can be sequenced for a few thous<strong>and</strong> dollars with<strong>in</strong> few hours, is a result<br />
of years of competition <strong>and</strong> <strong>in</strong>vestments <strong>in</strong> genome projects. There are no signs that new<br />
achievements <strong>in</strong> sequenc<strong>in</strong>g technology stops here. The concept of sequenc<strong>in</strong>g s<strong>in</strong>gle DNA<br />
molecules real time has long been an ultimate goal with<strong>in</strong> genomics <strong>and</strong> DNA sequenc<strong>in</strong>g.<br />
It has been demonstrated how a DNA synthesis reaction can be monitored real-time, by<br />
immobiliz<strong>in</strong>g a DNA polymerase with<strong>in</strong> a small (20 zeptoliter) well (Eid et al., 2009). If the<br />
technology reaches a f<strong>in</strong>al product, it may well start a new era <strong>in</strong> comparative genomics.<br />
Once it is possible to obta<strong>in</strong> a genome sequence at the same rate as the DNA replication<br />
itself, <strong>and</strong> at superior read lengths, sophisticated software must be implemented for the<br />
downstream process<strong>in</strong>g. The technology can give a boost to the quality of metagenomic<br />
sequenc<strong>in</strong>g, <strong>and</strong> solve the current issues of proper assembly of these data sets.<br />
The BLASTatlas tool presented <strong>in</strong> this thesis <strong>in</strong>corporates a number of software to<br />
calculate different DNA properties as well as scripts for mapp<strong>in</strong>g sequence alignments to a<br />
reference genome. The number of dependencies makes it difficult to package the software<br />
<strong>and</strong> make <strong>in</strong>stallation on other computer systems. To share these more complex <strong>tools</strong><br />
among scientists Web Services plays an important role <strong>and</strong> it has been demonstrated how<br />
analysis <strong>and</strong> visualization methods can be offered us<strong>in</strong>g this technology. At first glance the<br />
traditional web <strong>in</strong>terfaces seems more user-friendly. However, implement<strong>in</strong>g <strong>in</strong>teroperable<br />
methods like that of the BLASTatlas method, forces a process <strong>in</strong> which the communication<br />
is formalized <strong>and</strong> def<strong>in</strong>ed <strong>in</strong> every detail. This allows direct <strong>in</strong>tegration <strong>in</strong>to the user’s pro-<br />
155
gramm<strong>in</strong>g environment which scales significantly better. Mak<strong>in</strong>g one or two comparisons<br />
us<strong>in</strong>g a web <strong>in</strong>terface will <strong>in</strong> most cases be faster than us<strong>in</strong>g the Web Services counterpart.<br />
The true advantages are achieved when analysis are repeated possibly hundreds of<br />
times <strong>and</strong> when l<strong>in</strong>k<strong>in</strong>g <strong>in</strong>put/output between different remote resources. Integration of<br />
biological data us<strong>in</strong>g SOAP based Web Services is ga<strong>in</strong><strong>in</strong>g acceptance. When the technology<br />
has matured it will undoubtedly enhance the way biological <strong>in</strong>formation is exploited<br />
by allow<strong>in</strong>g seamless flow between for example public sequence databases, repositories of<br />
experimental data <strong>and</strong> bio<strong>in</strong>formtic prediction servers.<br />
156
Appendix A<br />
Appendix: Workshops, teach<strong>in</strong>g, <strong>and</strong> conferences<br />
Appendix: Workshops, teach<strong>in</strong>g,<br />
<strong>and</strong> conferences<br />
A.1 Lectures <strong>and</strong> Presentations<br />
A.1.1 DTU Course 27101: Framework Course <strong>in</strong> Biotechnology <strong>and</strong><br />
Food Sciences<br />
Taught autumn 2008 by Prof. David Ussery, this cause featured weekly computer exercises<br />
throughout the semester <strong>and</strong> projects requir<strong>in</strong>g computer work. I planned <strong>and</strong> supervised<br />
the exercises as well as assisted the students do<strong>in</strong>g project work. See also: http://www.<br />
cbs.dtu.dk/dtucourse/genomics27101.php<br />
A.1.2 <strong>Comparative</strong> Microbial Genomics Workshop<br />
Held June 2 nd - 6 st 2008, Bangkok, Thail<strong>and</strong>. I assisted the plann<strong>in</strong>g of the workshop,<br />
lectured on rRNA operon structure, web services, <strong>and</strong> genome visualization methods <strong>and</strong><br />
was responsible for computer exercises. Web page: http://www.cbs.dtu.dk/courses/<br />
thaiworkshop08/programme.php<br />
A.1.3 <strong>Comparative</strong> Microbial Genomics <strong>and</strong> Taxonomy<br />
Held August 14 st - 18 st 2006, Petropolis, Brazil. I assisted the plann<strong>in</strong>g of the workshop<br />
<strong>and</strong> was responsible for computer exercises. See also: http://www.cbs.dtu.dk/courses/<br />
brazilworkshop/programme.php<br />
A.1.4 EMBRACE Workshop on Client Side Script<strong>in</strong>g for Web Services<br />
Work package D5.2.X2. Held February 6 st - 8 st 2008, <strong>CBS</strong>. Responsible for computer exercises<br />
<strong>and</strong> lectures. See also: http://www.cbs.dtu.dk/courses/embrace/2008-02-06/<br />
A.1.5 EMBRACE Workshop on Bio<strong>in</strong>formatics of Immunology<br />
Work package D5.2.6. Held January 24 st - 26 st 2007, <strong>CBS</strong>. Responsible for computer exercises<br />
<strong>and</strong> lectures. See also: http://www.cbs.dtu.dk/courses/embrace/2007-01-24/<br />
A.1.6 EMBRACE 3 rd AGM: Implementation of web services<br />
Presentation held April 23 rd 2007 at CNRS Institute of Biology <strong>and</strong> Chemistry of Prote<strong>in</strong>s<br />
<strong>in</strong> Lyon, France.<br />
157
Workshops <strong>and</strong> meet<strong>in</strong>gs<br />
A.1.7 EMBRACE Workshop on Perl, SQL <strong>and</strong> Web Services<br />
Scheduled for November 16 th - 20 th 2009. See also: http://www.cbs.dtu.dk/courses/<br />
embrace/2009-11-16/<br />
A.2 Workshops <strong>and</strong> meet<strong>in</strong>gs<br />
A.2.1 EMBRACE Workshop: SOAP web services<br />
April 2006, Bergen, Norway.<br />
A.2.2 EUCOMM Bio<strong>in</strong>formatics Tra<strong>in</strong><strong>in</strong>g Course<br />
February 2007, H<strong>in</strong>xton, United K<strong>in</strong>gdom<br />
A.2.3 EMBRACE Workshop: Modern computer <strong>tools</strong> for the biosciences<br />
March 2007, Uppsala, Sweden<br />
A.2.4 EMBRACE 3rd Annual General Meet<strong>in</strong>g<br />
April 2007, Lyon, France<br />
A.2.5 EMBRACE Workshop: Deploy<strong>in</strong>g Web Services for Biological<br />
Sequence Annotation<br />
May 2007, Geneva, Switzerl<strong>and</strong><br />
A.2.6 EMBRACE 4th Annual General Meet<strong>in</strong>g<br />
April 2008, Heidelberg, Germany<br />
A.2.7 Technical discussion of EMBRACE registry<br />
June 2008, Amsterdam, Holl<strong>and</strong><br />
A.2.8 EMBRACE meet<strong>in</strong>g: Discussion of st<strong>and</strong>ard data types<br />
Januar 2009, Bergen, Norway<br />
A.3 Conferences<br />
A.3.1 Conference: Metagenomics, July 2007, San Diego U.S.A.<br />
B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Sellami N, Ussery DW Prediction of Pathogenicity Networks <strong>in</strong><br />
Bacterial Genomes<br />
A.3.2 Conference: ASM Biodefense 2007, February 2007, Wash<strong>in</strong>gton<br />
U.S.A.<br />
Poster: Hall<strong>in</strong> PF <strong>and</strong> B<strong>in</strong>newies TT. Gene organization of RNA genes <strong>and</strong> secretion<br />
system components of the Sargasso Sea environmental samples<br />
158
Appendix B<br />
Appendix: Ph.D. study plan<br />
Appendix: Ph.D. study plan<br />
159
Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />
September 2005<br />
Nedenstående studieplan er accepteret af studerende og vejleder<br />
Hovedvejleders underskrift lokal nr. Studerendes underskrift<br />
Ph.d.-studieplan<br />
Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />
Cpr.-nr.: 160877 2053<br />
Ph.d.-program: Bio<strong>in</strong>formatics<br />
Institut: BioCentrum<br />
Startdato: March 1 2006<br />
Slutdato: February 2009<br />
Hovedvejleder: Associate professor David W. Ussery<br />
(Titel, navn, <strong>in</strong>stitut, tlf.)<br />
BioCentrum-DTU, Technical University of Denmark,<br />
Build<strong>in</strong>g 301, DK-2800 Lyngby, Denmark<br />
E-mail address: dave@cbs.dtu.dk<br />
Phone (direct): (+45) 45 25 24 88<br />
Medvejleder: Guest Researcher Gertrude Maria Wassenaar<br />
(Titel, navn,<br />
<strong>in</strong>stitution/virksomhed)<br />
BioCentrum-DTU, Technical University of Denmark,<br />
Build<strong>in</strong>g 301, DK-2800 Lyngby, Denmark<br />
E-mail address: trudy@cbs.dtu.dk<br />
Phone (direct): (+45) 45 25 24 77<br />
Dato: 18-11-2007<br />
Studiets titel: DNA Structural Analysis <strong>and</strong> Transcript Prediction <strong>in</strong> Prokaryotic<br />
genomes<br />
1
Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />
September 2005<br />
Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />
Cpr.-nr.: 160877 2053<br />
Studiets hovedemne:<br />
The goal of this project is to obta<strong>in</strong> better underst<strong>and</strong><strong>in</strong>g about the structural<br />
mechanisms that are <strong>in</strong>volved <strong>in</strong> the <strong>in</strong>itiation of transcription of DNA <strong>in</strong><br />
Prokaryotic genomes <strong>and</strong> to use this <strong>in</strong>formation to make better <strong>and</strong> consistent<br />
transcript predictions. We have presented a database (Hall<strong>in</strong> <strong>and</strong> Ussery 2004)<br />
which holds several k<strong>in</strong>ds of <strong>in</strong>formation for each of the over 300 fully<br />
sequenced Prokaryotic genomes that are currently available. Different research<br />
groups have made efforts to gather sequence data <strong>and</strong> analysis of the fully<br />
sequenced microbial genomes that are be<strong>in</strong>g published.<br />
Currently we rely on the authors' annotation of genome sequences when<br />
comparative genomics are applied to our data sets. However, different authors<br />
use different <strong>tools</strong>, approaches <strong>and</strong> criteria dur<strong>in</strong>g the annotation process. There<br />
are examples of genomes that are predicted to be 50-100% over annotated<br />
(Skovgaard et al. 2001). Once reliable <strong>and</strong> automated processes for predict<strong>in</strong>g<br />
transcriptomes are established, comparative analysis can be applied on the entire<br />
collection of organisms. It is envisioned that the users of our website can<br />
<strong>in</strong>teractively be able to browse any piece of DNA to look for structural properties<br />
<strong>and</strong> repeats.<br />
_________________<br />
Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A On the<br />
total number of genes <strong>and</strong> their length distribution <strong>in</strong> complete<br />
microbial genomes (2001) Trends Genet.17:425-8.<br />
Peter F. Hall<strong>in</strong> <strong>and</strong> David W. Ussery <strong>CBS</strong> Genome Atlas<br />
Database: A dynamic storage for bio<strong>in</strong>formatic results <strong>and</strong><br />
sequence data (2004). Bio<strong>in</strong>formatics 20:3682-3686.<br />
(Her beskrives den videnskabelige projektdels <strong>in</strong>dhold samt mål og midler. Hvis beskrivelsen er på mere end 1 A4side<br />
gives en kort oversigt her med henvisn<strong>in</strong>g til selve beskrivelsen, der vedlægges som bilag).<br />
Det eksterne<br />
forskn<strong>in</strong>gsophold<br />
Professor Craig John Benham, University of California, Davis.<br />
Benhams research focuses on mathematical modell<strong>in</strong>g of DNA<br />
destabilization <strong>and</strong> prediction of open<strong>in</strong>g of the DNA molecule<br />
dur<strong>in</strong>g a transcription event. His strong mathematical approach is<br />
novel <strong>and</strong> would contribute significantly to our prediction methods<br />
<strong>and</strong> could possibly help expla<strong>in</strong><strong>in</strong>g biological / experimental<br />
results. It is the idea that Craig Benhams calculations will be<br />
<strong>in</strong>tegrated <strong>in</strong>to the prediction algorithms that is a major topic of my<br />
project.<br />
A 12 weeks <strong>in</strong>ternship is scheduled for October-December to Craig<br />
Benhams lab to <strong>in</strong>tegrate SIDD predictions (Stress Induced DNA<br />
Duplex Destabilization) with <strong>CBS</strong> databases <strong>and</strong> to prepare 1-2<br />
manuscripts on SIDD measures on a global prokaryotic scale.<br />
2
Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />
September 2005<br />
Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />
Cpr.-nr.: 160877 2053<br />
(Her anføres de forskn<strong>in</strong>gsmiljøer uden for DTU, hvor den ph.d.-studerende planlægges at opholde sig. Er der<br />
<strong>in</strong>dgået konkrete aftaler, anføres dette. For hvert ophold angives det skønnede tidsforbrug (f.eks. i uger), og det<br />
samlede tidsforbrug til eksterne ophold anføres).<br />
Kursusdelen:<br />
Kurser på DTU<br />
Eksterne kurser<br />
Kurser meritoverført i<br />
forb<strong>in</strong>delse med<br />
<strong>in</strong>dskrivn<strong>in</strong>g:<br />
Biological Sequence Analysis PhD 12 ECTS [OK]<br />
27802 Metabolic Eng<strong>in</strong>eer<strong>in</strong>g <strong>and</strong><br />
Systems biology<br />
PhD 5 ECTS F1A<br />
27725 Globale regulatoriske netværk i<br />
mikroorganismer<br />
MSc 5 ECTS F2B<br />
27617 Prote<strong>in</strong> structure <strong>and</strong><br />
computational biology<br />
Msc 5 ECTS F5A<br />
27041 Introduction to Systems Biology Msc 5 ECTS E3A<br />
For kurser, som ikke f<strong>in</strong>des i studiehåndbogen, skal der vedlægges en beskrivelse af det faglige <strong>in</strong>dhold. Her<br />
anføres studiets forventede kursus/uddannelsesaktivteter. For hver del angives det skønnede antal ECTS-po<strong>in</strong>t, der<br />
sammenlagt skal svare til ca. 30 ECTS-po<strong>in</strong>t. 30 ECTS-po<strong>in</strong>t svarer til ca. 840 timers arbejde).<br />
Formidl<strong>in</strong>gsdelen ( <strong>in</strong>kl.<br />
pligtarbejde):<br />
I have spent a total of about a month's time prepar<strong>in</strong>g <strong>and</strong> assist<strong>in</strong>g<br />
<strong>in</strong> computer exercises for the <strong>CBS</strong> course <strong>Comparative</strong> Microbial<br />
Genomics <strong>and</strong> Taxonomy (Petropolis, Brazil, Aug. 2006,<br />
http://www.cbs.dtu.dk/courses/brazilworkshop) <strong>and</strong> <strong>in</strong> prepar<strong>in</strong>g<br />
<strong>and</strong> giv<strong>in</strong>g talks at several meet<strong>in</strong>gs.<br />
Exercises <strong>in</strong> course ”Biological Sequence Analysis” (<strong>CBS</strong> –DTU)<br />
1 hrs. Presentation, Modern computer <strong>tools</strong> for the biosciences<br />
(Uppsala, Sweden) Presentation: Embrace workshop on<br />
bio<strong>in</strong>formtics of Immunology (<strong>CBS</strong> – DTU) Presentation: Web<br />
Services implementation on <strong>CBS</strong>: Third Anual General Meet<strong>in</strong>g of<br />
EMBRACE, (Lyon France).<br />
I plan to put <strong>in</strong> an additional month of work for giv<strong>in</strong>g <strong>and</strong><br />
prepar<strong>in</strong>g presentations <strong>and</strong> lectures for a one week workshop to be<br />
3
Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />
September 2005<br />
Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />
Cpr.-nr.: 160877 2053<br />
held at <strong>CBS</strong> <strong>in</strong> February 2008:<br />
http://www.cbs.dtu.dk/courses/embrace/2008-02-<br />
06/programme.php. Lectures <strong>and</strong> exercies will be adjusted to cover<br />
promoter analysis us<strong>in</strong>g the EMBRACE technology. We <strong>in</strong>tend to<br />
use graphical as well as statistical approaches to characterize<br />
promoter signatures of prokaryotic genomes. These are core topics<br />
of the thesis.<br />
Poster presentation at Metagenomics 2007, San Diego: “Gene<br />
organization of RNA genes <strong>and</strong> secretion system components of<br />
the Sargasso Sea environmental samples”<br />
(Her anføres studiets forventede dels formidl<strong>in</strong>gs-aktivteter og dels det pålagte pligtarbejde. For hver del angives<br />
det skønnede tidsforbrug (f.eks. i uger), der sammenlagt skal svare til 3 måneder).<br />
Tidsplan:<br />
1st half year (March 06 –August 06)<br />
Publication on rRNA gene predictor (RNAmmer). <strong>Comparative</strong> Microbial Genomics worksshop <strong>in</strong><br />
Brasil. Meet<strong>in</strong>gs <strong>and</strong> work for <strong>CBS</strong> <strong>in</strong> connection to EMBRACE.<br />
2nd half year (September 06 – Feb 07)<br />
Lactococcus microarray project with Chr Hansen. Book chapter on <strong>Comparative</strong> Genomics, editor<br />
Dawn Field. EMBRACE meet<strong>in</strong>gs <strong>and</strong> workshops.<br />
3rd half year (March 07 –August 07)<br />
Followup article on RNAmmer – <strong>and</strong> rRNA/tRNA operons.<br />
4th half year (September 07 – Feb 08)<br />
(Oct-Dec) Internship, Craig Benham: Davis, California,<br />
Include work from Craig Benhams lab <strong>in</strong>to RNAmmer followup manuscript <strong>and</strong> prepare SIDDbase<br />
application note <strong>and</strong> article on SIDD measures <strong>in</strong> prokaryotic promotor sequences.<br />
Prepare manuscripts<br />
5th. half year (March 08 –August 08)<br />
Course: Globale regulatoriske netværk i mikroorganismer (F2B)<br />
Course: Prote<strong>in</strong> structure <strong>and</strong> computational biology (F5A)<br />
Course: 1 week may/june: 27802 Metabolic Eng<strong>in</strong>eer<strong>in</strong>g <strong>and</strong> Systems Biology<br />
Thesis writ<strong>in</strong>g+Prepare manuscripts<br />
6th. half year (September 08 – Feb 09)<br />
Course: Introduction to Systems Biology<br />
Thesis writ<strong>in</strong>g<br />
(Tidsplanen bør <strong>in</strong>deholde tidspunkter/perioder for alle væsentlige aktiviteter her i forb<strong>in</strong>delse med ph.d.uddannelsen.<br />
Det er vigtigt, at tidsplanen er fuldstændig., Den kan vedlægges som appendiks).<br />
Kort beskrivelse af<br />
vejledn<strong>in</strong>gens form:<br />
Det kan bl.a. aftales, hvor tit vejledn<strong>in</strong>gen sker i form af møder eller ved skriftlig tilbagemeld<strong>in</strong>g<br />
4
Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />
September 2005<br />
Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />
Cpr.-nr.: 160877 2053<br />
Patenter/<strong>in</strong>novation: Der er s<strong>and</strong>synlighed for, at der under projektet udvikles<br />
teknologier eller software, som kan patenteres?<br />
Hvis Ja<br />
Ja x Nej<br />
Kort redegørelse for hvilke metoder, der anvendes til oplær<strong>in</strong>g af den ph.d.-studerende i de <strong>in</strong>novationsmæssige<br />
aspekter<br />
Andet:<br />
(Her kan anføres <strong>and</strong>re forhold af betydn<strong>in</strong>g for bedømmelsen af studieplanen).<br />
5
Appendix C<br />
Appendix: Courses<br />
C.1 Global regulatory networks <strong>in</strong> microorganisms<br />
DTU course 27725, ECTS 5, M.sc. level.<br />
C.2 Prote<strong>in</strong> Structure <strong>and</strong> <strong>Computational</strong> Biology<br />
DTU course 27617, ECTS 5, M.sc. level.<br />
C.3 Biological Sequence Analysis<br />
DTU course 27803, ECTS 12.5, PhD level.<br />
C.4 <strong>Comparative</strong> Genome Analysis<br />
Copenhagen University, Department of Biology, ECTS 5.<br />
Appendix: Courses<br />
C.5 Doctorial sem<strong>in</strong>ar on bus<strong>in</strong>ess economics for academic<br />
entrepreneurs<br />
Aarhus school of bus<strong>in</strong>ess, University of Aarhus, ECTS 3, PhD level.<br />
C.6 ECTS summary<br />
Total ECTS is 30.5 of which 15.5 at PhD level.<br />
165
Appendix D<br />
Appendix: Software<br />
D.1 fetchgbk manual<br />
S Y N O P S I S<br />
f e t c h g b k − d o w n l o a d s g e n b a n k / r e f s e q r e c o r d s i n g e n b a n k f o r m a t , s p e c i f y i n g e i t h e r<br />
a c c e s s i o n s n u m b e r , a c c e s s i o n r a n g e s , o r p r o j e c t i d .<br />
f e t c h g b k (−h ) (−p [ P R O J E C T _ I D ] ) (−a [ A C C E S S I O N / R A N G E ] ) (−d [ D A T A B A S E ] )<br />
D E S C R I P T I O N<br />
W h e n d e f i n i n g t h e p r o j e c t id , u s i n g −p o p t i o n , o p t i o n −a i s i g n o r e d a n d a l l<br />
a c c e s s i o n n u m b e r s f o r a l l s e g m e n t s o f t h a t p r o j e c t , a r e f e t c h e d f r o m t j e p r o j e c t .<br />
W h e n u s i n g t h e −p o p t i o n , t h e −d o p t i o n i s i n e f f e c t , a l l o w i n g y o u t o c o n t r o l w h i c h<br />
d a t a b a s e t o u s e ( r e f s e q / g e n b a n k )<br />
W h e n u s i n g t h e −a o p t i o n , t h e p r o g r a m w i l l r e t r i e v e o n l y t h a t a c c e s s i o n ( o r r a n g e<br />
o f a c c e s s i o n s ) . I t w i l l i g n o r e t h e −d o p t i o n . T h e p r o g r a m p r i n t e s g e n b a n k f o r m a t<br />
d a t a t o s t d o u t . O p t i o n −l i s u s e d t o s h o w o n l y a T A B s e p a r a t e d l i s t s h o w i n g a c c e s s i o n<br />
a n d s e g m e n t n a m e<br />
V E R S I O N<br />
2008 −08 −15: v e r s i o n 1 . 0 c r e a t e d / p f h<br />
−p [ n u m b e r ]<br />
T h e N C B I G e n o m e P r o j e c t n u m b e r , l i k e w h a t c a n b e f o u n d h e r e :<br />
h t t p : / / w w w . n c b i . n l m . n i h . g o v / g e n o m e s / l p r o k s . c g i . T h i s o p t i o n o v e r r u l e s t h e −a o p t i o n .<br />
−a [ a c c e s s i o n n o . o r a c c e s s i o n n u m b e r r a n g e ]<br />
W h e n u s i n g t h i s o p t i o n , t h e p r o g r a m i s i n s t r u c t e d t o d o w n l o a d o n l y t h i s r e c o r d ( o r<br />
t h e s e r e c o r d s , o f a r a n g e i s d e f i n e d ) . T h e −d o p t i o n i s i g n o r e d<br />
−d [ g e n b a n k / r e f s e q ]<br />
C h o i c e o f d a t a b a s e . H a s o n l y e f f e c t w h e n u s i n g o p t i o n −p .<br />
−l<br />
B o o l e a n , i n s t r u c t i n g t h e p r o g r a m n o t t o s h o w g e n b a n k r e c o r d s , b u t o n l y l i s t s e g m e n t<br />
n a m e s f o r e a c h a c c e s s i o n .<br />
−h<br />
S h o w i n g t h i s h e l p p a g e<br />
E X A M P L E S<br />
f e t c h g b k −p 19391 −d r e f s e q | g r e p L O C U S<br />
f e t c h g b k −p 19391 −d g e n b a n k | g r e p L O C U S<br />
f e t c h g b k −a N Z _ A B I Z 0 0 0 0 0 0 0 0 | g r e p L O C U S<br />
f e t c h g b k −a N Z _ A B I H 0 1 0 0 0 0 0 1 −N Z _ A B I H 0 1 0 0 0 0 3 8 | g r e p L O C U S<br />
f e t c h g b k −a C P 0 0 0 8 9 6 | g r e p L O C U S<br />
f e t c h g b k −p 12997 −d r e f s e q −l<br />
A U T H O R<br />
P e t e r F i s c h e r H a l l i n , A u g u s t 2008 , p f h @ c b s . d t u . d k<br />
166
D.2 Sample output from queryGenomes<br />
As output from list<strong>in</strong>g 2.3.<br />
Appendix: Software<br />
1 #k<strong>in</strong>gdom phyla pid organism genbank r e f s e q segment c o l o r ATCONTENT NGENES<br />
2 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 7 9 N C _ 0 1 1 3 1 2<br />
C h r o m o s o m e 1 f f d d 4 4 0 . 6 0 7 7 3069<br />
3 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 0 N C _ 0 1 1 3 1 3<br />
C h r o m o s o m e 2 f f d d 4 4 0 . 6 1 7 6 1105<br />
4 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 2 N C _ 0 1 1 3 1 4 P l a s m i d<br />
p V S A L 3 2 0 f f d d 4 4 0 . 6 2 7 1 32<br />
5 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 1 N C _ 0 1 1 3 1 1 P l a s m i d<br />
p V S A L 8 4 0 f f d d 4 4 0 . 5 9 9 3 72<br />
6 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 3 N C _ 0 1 1 3 1 5 P l a s m i d<br />
p V A L 4 3 f f d d 4 4 0 . 6 1 9 3<br />
7 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 4 N C _ 0 1 1 3 1 6 P l a s m i d<br />
p V S A L 4 3 f f d d 4 4 0 . 6 4 3 9 3<br />
8 B a c t e r i a D e l t a p r o t e o b a c t e r i a 9637 B d e l l o v i b r i o b a c t e r i o v o r u s H D 1 0 0 B X 8 4 2 6 0 1 N C _ 0 0 5 3 6 3<br />
C h r o m o s o m e f f d d 4 4 0 . 4 9 3 5 3583<br />
9 B a c t e r i a G a m m a p r o t e o b a c t e r i a 28329 C e l l v i b r i o j a p o n i c u s U e d a 1 0 7 C P 0 0 0 9 3 4 N C _ 0 1 0 9 9 5<br />
C h r o m o s o m e f f d d 4 4 0 . 4 8 0 1 3754<br />
10 B a c t e r i a B a c t e r o i d e t e s / C h l o r o b i 12607 C h l o r o b i u m p h a e o v i b r i o i d e s D S M 265 C P 0 0 0 6 0 7 N C _ 0 0 9 3 3 7<br />
C h r o m o s o m e f f b b 5 5 0 . 4 7 0 1 1753<br />
11 B a c t e r i a D e l t a p r o t e o b a c t e r i a 29493 D e s u l f o v i b r i o d e s u l f u r i c a n s s u b s p . d e s u l f u r i c a n s s t r . A T C C<br />
27774 C P 0 0 1 3 5 8 N C _ 0 1 1 8 8 3 C h r o m o s o m e f f d d 4 4 0 . 4 1 9 3 2356<br />
12 B a c t e r i a D e l t a p r o t e o b a c t e r i a 329 D e s u l f o v i b r i o d e s u l f u r i c a n s s u b s p . d e s u l f u r i c a n s s t r . G 2 0<br />
C P 0 0 0 1 1 2 N C _ 0 0 7 5 1 9 C h r o m o s o m e f f d d 4 4 0 . 4 2 1 6 3775<br />
13 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 6 N C _ 0 1 2 7 9 5 P l a s m i d<br />
p D M C 2 f f d d 4 4 0 . 6 2 8 3 10<br />
14 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 4 N C _ 0 1 2 7 9 6<br />
C h r o m o s o m e f f d d 4 4 0 . 3 7 2 3 4629<br />
15 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 5 N C _ 0 1 2 7 9 7 P l a s m i d<br />
p D M C 1 f f d d 4 4 0 . 4 1 9 7 65<br />
16 B a c t e r i a D e l t a p r o t e o b a c t e r i a 29541 D e s u l f o v i b r i o s a l e x i g e n s D S M 2638 C P 0 0 1 6 4 9 N C _ 0 1 2 8 8 1<br />
C h r o m o s o m e f f d d 4 4 0 . 5 2 9 1 3807<br />
17 B a c t e r i a D e l t a p r o t e o b a c t e r i a 17227 D e s u l f o v i b r i o v u l g a r i s D P 4 C P 0 0 0 5 2 8 N C _ 0 0 8 7 4 1 P l a s m i d<br />
p D V U L 0 1 f f d d 4 4 0 . 3 4 3 1 150<br />
18 B a c t e r i a D e l t a p r o t e o b a c t e r i a 17227 D e s u l f o v i b r i o v u l g a r i s D P 4 C P 0 0 0 5 2 7 N C _ 0 0 8 7 5 1 C h r o m o s o m e<br />
f f d d 4 4 0 . 3 6 9 9 2941<br />
19 B a c t e r i a D e l t a p r o t e o b a c t e r i a 27731 D e s u l f o v i b r i o v u l g a r i s s t r . M i y a z a k i F C P 0 0 1 1 9 7 N C _ 0 1 1 7 6 9<br />
C h r o m o s o m e f f d d 4 4 0 . 3 2 8 9 3180<br />
20 B a c t e r i a D e l t a p r o t e o b a c t e r i a 51 D e s u l f o v i b r i o v u l g a r i s s t r . H i l d e n b o r o u g h A E 0 1 7 2 8 5 N C _ 0 0 2 9 3 7<br />
C h r o m o s o m e f f d d 4 4 0 . 3 6 8 6 3379<br />
21 B a c t e r i a D e l t a p r o t e o b a c t e r i a 51 D e s u l f o v i b r i o v u l g a r i s s t r . H i l d e n b o r o u g h A E 0 1 7 2 8 6 N C _ 0 0 5 8 6 3<br />
M e g a p l a s m i d f f d d 4 4 0 . 3 4 3 2 152<br />
22 B a c t e r i a O t h e r B a c t e r i a 30733 T h e r m o d e s u l f o v i b r i o y e l l o w s t o n i i D S M 11347 C P 0 0 1 1 4 7 N C _ 0 1 1 2 9 6<br />
C h r o m o s o m e 888888 0 . 6 5 8 7 2033<br />
23 B a c t e r i a G a m m a p r o t e o b a c t e r i a 29177 T h i o a l k a l i v i b r i o s p . HL−E b G R 7 C P 0 0 1 3 3 9 N C _ 0 1 1 9 0 1<br />
C h r o m o s o m e f f d d 4 4 0 . 3 4 9 4 3283<br />
24 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32851 V i b r i o c h o l e r a e M66 −2 C P 0 0 1 2 3 3 N C _ 0 1 2 5 7 8 C h r o m o s o m e I<br />
f f d d 4 4 0 . 5 2 1 7 2650<br />
25 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32851 V i b r i o c h o l e r a e M66 −2 C P 0 0 1 2 3 4 N C _ 0 1 2 5 8 0 C h r o m o s o m e I I<br />
f f d d 4 4 0 . 5 2 9 6 1043<br />
26 B a c t e r i a G a m m a p r o t e o b a c t e r i a 33555 V i b r i o c h o l e r a e MJ −1236 C P 0 0 1 4 8 5 N C _ 0 1 2 6 6 8 C h r o m o s o m e 1<br />
f f d d 4 4 0 . 5 2 4 8 2770<br />
27 B a c t e r i a G a m m a p r o t e o b a c t e r i a 33555 V i b r i o c h o l e r a e MJ −1236 C P 0 0 1 4 8 6 N C _ 0 1 2 6 6 7 C h r o m o s o m e 2<br />
f f d d 4 4 0 . 5 3 2 5 1004<br />
28 B a c t e r i a G a m m a p r o t e o b a c t e r i a 36 V i b r i o c h o l e r a e O 1 b i o v a r E l T o r s t r . N 1 6 9 6 1 A E 0 0 3 8 5 2<br />
N C _ 0 0 2 5 0 5 C h r o m o s o m e I f f d d 4 4 0 . 5 2 3 2736<br />
29 B a c t e r i a G a m m a p r o t e o b a c t e r i a 36 V i b r i o c h o l e r a e O 1 b i o v a r E l T o r s t r . N 1 6 9 6 1 A E 0 0 3 8 5 3<br />
N C _ 0 0 2 5 0 6 C h r o m o s o m e I I f f d d 4 4 0 . 5 3 0 9 1092<br />
30 B a c t e r i a G a m m a p r o t e o b a c t e r i a 15667 V i b r i o c h o l e r a e O 3 9 5 C P 0 0 0 6 2 6 N C _ 0 0 9 4 5 6 C h r o m o s o m e 1<br />
f f d d 4 4 0 . 5 3 1 2 1133<br />
31 B a c t e r i a G a m m a p r o t e o b a c t e r i a 15667 V i b r i o c h o l e r a e O 3 9 5 C P 0 0 0 6 2 7 N C _ 0 0 9 4 5 7 C h r o m o s o m e 2<br />
f f d d 4 4 0 . 5 2 2 2 2742<br />
32 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 0 N C _ 0 0 6 8 4 0 C h r o m o s o m e I<br />
f f d d 4 4 0 . 6 1 0 4 2575<br />
33 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 1 N C _ 0 0 6 8 4 1 C h r o m o s o m e I I<br />
f f d d 4 4 0 . 6 2 9 8 1172<br />
34 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 2 N C _ 0 0 6 8 4 2 P l a s m i d p E S 1 0 0<br />
f f d d 4 4 0 . 6 1 5 8 55<br />
35 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 3 N C _ 0 1 1 1 8 6 C h r o m o s o m e I I<br />
f f d d 4 4 0 . 6 2 7 5 1254<br />
36 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 4 N C _ 0 1 1 1 8 5 P l a s m i d p M J 1 0 0<br />
f f d d 4 4 0 . 6 5 2 195<br />
37 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 9 N C _ 0 1 1 1 8 4 C h r o m o s o m e I<br />
f f d d 4 4 0 . 6 1 1 2 2590<br />
38 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 9 1 N C _ 0 0 9 7 7 7 P l a s m i d<br />
p V I B H A R f f d d 4 4 0 . 5 6 2 1 120<br />
39 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 8 9 N C _ 0 0 9 7 8 3<br />
C h r o m o s o m e I f f d d 4 4 0 . 5 4 4 5 3570<br />
40 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 9 0 N C _ 0 0 9 7 8 4<br />
C h r o m o s o m e I I f f d d 4 4 0 . 5 4 7 3 2374<br />
41 B a c t e r i a G a m m a p r o t e o b a c t e r i a 360 V i b r i o p a r a h a e m o l y t i c u s R I M D 2210633 B A 0 0 0 0 3 1 N C _ 0 0 4 6 0 3<br />
C h r o m o s o m e I f f d d 4 4 0 . 5 4 6 1 3080<br />
42 B a c t e r i a G a m m a p r o t e o b a c t e r i a 360 V i b r i o p a r a h a e m o l y t i c u s R I M D 2210633 B A 0 0 0 0 3 2 N C _ 0 0 4 6 0 5<br />
C h r o m o s o m e I I f f d d 4 4 0 . 5 4 6 5 1752<br />
43 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32815 V i b r i o s p l e n d i d u s L G P 3 2 F M 9 5 4 9 7 3 N C _ 0 1 1 7 4 4 C h r o m o s o m e 2<br />
f f d d 4 4 0 . 5 6 3 6 1486<br />
44 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32815 V i b r i o s p l e n d i d u s L G P 3 2 F M 9 5 4 9 7 2 N C _ 0 1 1 7 5 3 C h r o m o s o m e 1<br />
f f d d 4 4 0 . 5 5 9 6 2950<br />
45 B a c t e r i a G a m m a p r o t e o b a c t e r i a 349 V i b r i o v u l n i f i c u s C M C P 6 A E 0 1 6 7 9 5 N C _ 0 0 4 4 5 9 C h r o m o s o m e I<br />
f f d d 4 4 0 . 5 3 5 5 2973<br />
46 B a c t e r i a G a m m a p r o t e o b a c t e r i a 349 V i b r i o v u l n i f i c u s C M C P 6 A E 0 1 6 7 9 6 N C _ 0 0 4 4 6 0 C h r o m o s o m e I I<br />
f f d d 4 4 0 . 5 2 8 8 1565<br />
167
BLASTatlas configurations<br />
47 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 B A 0 0 0 0 3 7 N C _ 0 0 5 1 3 9 C h r o m o s o m e I<br />
f f d d 4 4 0 . 5 3 5 9 3262<br />
48 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 B A 0 0 0 0 3 8 N C _ 0 0 5 1 4 0 C h r o m o s o m e I I<br />
f f d d 4 4 0 . 5 2 7 9 1697<br />
49 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 A P 0 0 5 3 5 2 N C _ 0 0 5 1 2 8 P l a s m i d p Y J 0 1 6<br />
f f d d 4 4 0 . 5 5 0 7 69<br />
D.3 BLASTatlas configurations<br />
D.3.1 file blast.cfg<br />
1 l e g e n d : B . a m b i f a r i a A M M D<br />
2 p r o g r a m : b l a s t p<br />
3 c o l o r : 1 0 1 0 1 0 _ 0 2 0 0 0 2<br />
4 r a n g e : 0 . 0 , 0 . 8<br />
5 s o u r c e : f i l e s / 1 3 4 9 0 . f s a<br />
6<br />
7 l e g e n d : B . a m b i f a r i a M C 4 0 −6<br />
8 p r o g r a m : b l a s t p<br />
9 c o l o r : 1 0 1 0 1 0 _ 0 2 0 0 0 2<br />
10 r a n g e : 0 . 0 , 0 . 8<br />
11 s o u r c e : f i l e s / 1 7 4 1 1 . f s a<br />
12<br />
13 l e g e n d : B . c e n o c e p a c i a A U 1054<br />
14 p r o g r a m : b l a s t p<br />
15 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />
16 r a n g e : 0 . 0 , 0 . 8<br />
17 s o u r c e : f i l e s / 1 3 9 1 9 . f s a<br />
18<br />
19 l e g e n d : B . c e n o c e p a c i a H I 2 4 2 4<br />
20 p r o g r a m : b l a s t p<br />
21 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />
22 r a n g e : 0 . 0 , 0 . 8<br />
23 s o u r c e : f i l e s / 1 3 9 1 8 . f s a<br />
24<br />
25 l e g e n d : B . c e n o c e p a c i a J 2 3 1 5<br />
26 p r o g r a m : b l a s t p<br />
27 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />
28 r a n g e : 0 . 0 , 0 . 8<br />
29 s o u r c e : f i l e s / 3 3 9 . f s a<br />
30<br />
31 l e g e n d : B . c e n o c e p a c i a MC0 −3<br />
32 p r o g r a m : b l a s t p<br />
33 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />
34 r a n g e : 0 . 0 , 0 . 8<br />
35 s o u r c e : f i l e s / 1 7 9 2 9 . f s a<br />
36<br />
37 l e g e n d : B . g l u m a e B G R 1<br />
38 p r o g r a m : b l a s t p<br />
39 c o l o r : 1 0 1 0 1 0 _ 0 5 0 5 0 5<br />
40 r a n g e : 0 . 0 , 0 . 8<br />
41 s o u r c e : f i l e s / 3 3 9 0 1 . f s a<br />
42<br />
43 . . . . . .<br />
D.3.2 file custom.cfg<br />
1<br />
2 l e g e n d : S I D D @ −0.035<br />
3 c o l o r : 0 0 0 0 1 0 _ 1 0 1 0 1 0<br />
4 r a n g e : 9 : 1 0<br />
5 b o x f i l t e r : 5 0 0 0<br />
6 s o u r c e : g u n z i p −c B X 5 7 1 9 6 6 −57 a 2 f 2 c 2 e 1 1 c a 0 d d 8 c d 7 4 4 9 3 d 6 6 7 d 4 d 6 −3173005. s i d d −−0.035−c−10−c . o u t . g z |<br />
c u t −f 4 |<br />
D.4 BLASTmatrix example<br />
This Perl script constructs an XML configuration file by look<strong>in</strong>g up the Genome Atlas<br />
Database through MySQL. It queries for all Campylobacter stra<strong>in</strong>s currently available.<br />
1 #! / u s r / b<strong>in</strong> / p e r l<br />
2 u s e s t r i c t ;<br />
3<br />
4 m y $ S A C O _ E X T R A C T = " / u s r / c b s / b i o / b i n / l i n u x 6 4 / s a c o _ e x t r a c t " ;<br />
5 m y %c o l o r s = ( l a r i => ’ 0 , 1 0 4 , 1 3 9 ’ , j e j u n i => ’ 0 , 1 3 9 , 6 9 ’ , h o m i n i s => ’ 66 , 66 , 1 1 1 ’ , f e t u s<br />
=> ’ 1 3 9 , 1 0 1 , 8 ’ , c u r v u s=>’ 1 4 0 , 23 , 2 3 ’ , c o n c i s u s=>’ 2 0 5 , 1 7 3 , 0 ’ ) ;<br />
6<br />
7 m y $ s o u r c e s = " " ; # h o l d s the s o u r c e s p a r t o f the c o n f i g u r a t i o n − r e p l a c e i n t o DATA s e c t i o n<br />
8<br />
9 o p e n O R G A N I S M , " m y s q l - N - B - e \ " s e l e c t pid , o r g a n i s m _ n a m e f r o m g e n o m e a t l a s 3 _ c u r .<br />
g e n b a n k _ c o m p l e t e _ p r j w h e r e o r g a n i s m _ n a m e l i k e ’ c a m p y l o b a c t e r % ’ o r d e r b y o r g a n i s m _ n a m e \ " | "<br />
o r d i e $ ! ;<br />
10 w h i l e (< O R G A N I S M >) {<br />
11 c h o m p ;<br />
12 m y ( $ p i d , $ o r g a n i s m _ n a m e ) = s p l i t /\ t / ;<br />
168
Appendix: Software<br />
13 w a r n " $ o r g a n i s m _ n a m e ( p i d $ p i d ) \ n " ;<br />
14 m y ( $ g e n u s , $ s p e c i e s , $ s t r a i n ) = ( $1 , $2 , $ 3 ) i f $ o r g a n i s m _ n a m e = /(\ S+) (\ S+) ( . ∗ ) / ;<br />
15 m y $ c o l o r = " 1 0 0 , 1 0 0 , 1 0 0 " ;<br />
16 $ c o l o r = $ c o l o r s { $ s p e c i e s } i f d e f i n e d $ c o l o r s { $ s p e c i e s } ;<br />
17 $ s o u r c e s .= "<br />
18 < e n t r y ><br />
19 < s o u r c e > . / $ p i d . p r o t e i n s . fsa < / s o u r c e ><br />
20 < t i t l e > $ g e n u s $ s p e c i e s < / t i t l e ><br />
21 < s u b t i t l e > $ s t r a i n < / s u b t i t l e ><br />
22 < g r o u p > $ s p e c i e s < / g r o u p ><br />
23 < c o l o r > $ c o l o r < / c o l o r ><br />
24 <br />
25 " ;<br />
26 o p e n P I D , " > $ p i d . p r o t e i n s . f s a " o r d i e $ ! ;<br />
27 o p e n A C C E S S I O N , " m y s q l - N - B - e \ " s e l e c t g e n b a n k , s e g m e n t _ n a m e f r o m g e n o m e a t l a s 3 _ c u r .<br />
g e n b a n k _ c o m p l e t e _ s e q w h e r e p i d = $ p i d a n d s e g m e n t _ n a m e n o t l i k e ’ g e n o m e % ’ \ " | " ;<br />
28 w h i l e (< A C C E S S I O N > ) {<br />
29 c h o m p ;<br />
30 m y ( $ g e n b a n k , $ s e g m e n t _ n a m e ) = s p l i t /\ t / ;<br />
31 c h o m p $ g e n b a n k ;<br />
32 w a r n " a d d i n g $ s e g m e n t _ n a m e ( a c c e s s i o n $ g e n b a n k ) \ n " ;<br />
33 m y $ g b k = " / h o m e / d a t a b a s e s / g e n o m e a t l a s d b - 3 . 0 _ c u r / d a t a / $ g e n b a n k / $ g e n b a n k . g b k " ;<br />
34 o p e n P R O T , " $ S A C O _ E X T R A C T - I g e n b a n k - O f a s t a - t < $ g b k 2 > / d e v / n u l l | " o r d i e $ ! ;<br />
35 w h i l e (< P R O T >) {<br />
36 p r i n t P I D ;<br />
37 }<br />
38 c l o s e P R O T ;<br />
39 }<br />
40 c l o s e A C C E S S I O N ;<br />
41 c l o s e P I D ;<br />
42 }<br />
43 c l o s e O R G A N I S M ;<br />
44 w a r n " d u m p i n g x m l c o n f i g o n s t d o u t . . . \ n " ;<br />
45 w h i l e (< D A T A >) {<br />
46 s//$ s o u r c e s / g ;<br />
47 p r i n t ;<br />
48 }<br />
49<br />
50 _ _ D A T A _ _<br />
51 <br />
52 <br />
53 P r o t e o m e c o m p a r i s o n o f C a m p y l o b a c t e r s p e c i e s <br />
54 −<br />
55 <br />
56 <br />
57 a u t o <br />
58 a u t o <br />
59 <br />
60 0.9<br />
61 0.9<br />
62 0.9<br />
63 <br />
64 <br />
65 0.975<br />
66 0<br />
67 0<br />
68 <br />
69 <br />
70 <br />
71 a u t o <br />
72 a u t o <br />
73 <br />
74 0.9<br />
75 0.9<br />
76 0.9<br />
77 <br />
78 <br />
79 0<br />
80 0.975<br />
81 0<br />
82 <br />
83 <br />
84 <br />
85 <br />
86 <br />
87 <br />
88 <br />
D.5 iscan source code<br />
1 #! / u s r / b<strong>in</strong> / p e r l<br />
2 u s e s t r i c t ;<br />
3<br />
4 m y $ p w m ;<br />
5 m y %m a t r i x ;<br />
6 m y $ s p a c e r ;<br />
7 m y @ P W M ;<br />
8 m y $ p i = 3 . 1 4 1 5 9 2 6 5 ;<br />
9<br />
10 # read the model f i l e s # i n c l u d e s u p p o r t e d r e c u r s i v e l y (NO CHECK FOR LOOPS ! )<br />
11 m y %s e t u p ;<br />
12 m y @ L I N E S ;<br />
169
iscan source code<br />
13 i f ( d e f i n e d $ A R G V [ 0 ] ) {<br />
14 @ L I N E S = r e a d _ m o d ( $ A R G V [ 0 ] ) ;<br />
15 } e l s e {<br />
16 w h i l e (< D A T A >) {<br />
17 p r i n t ;<br />
18 }<br />
19 c l o s e D A T A ;<br />
20 d i e " n o m o d e l p r o v i d e d . t e m p l a t e m o d e l d u m p e d \ n " ;<br />
21 }<br />
22<br />
23 m y $ p w m i d = −1;<br />
24 p r i n t " # t h i s i s t h e m o d e l : \ n " ;<br />
25 f o r e a c h ( @ L I N E S ) {<br />
26 p r i n t " # $ _ \ n " ;<br />
27 i f ( / ˆ \ [ p w m \ ] \ s∗=\s ∗ ( . ∗ ) /) {<br />
28 $ p w m i d ++;<br />
29 p u s h @ P W M , " $ p w m i d : $ 1 " ;<br />
30 }<br />
31 m y $ p w m = $ P W M [$# P W M ] ;<br />
32 $ s e t u p { $ p w m }{ $ 1 } = $ 2 i f /ˆ(\ w+)\ s∗=\s ∗([\.\ −0 −9]+) / ;<br />
33 n e x t u n l e s s / ˆ \ [ ( [ A T G C ]+) \ ] / ;<br />
34 m y @ F = s p l i t / [ \ s \ t ] + / ;<br />
35 s h i f t @ F ;<br />
36 e r r ( " p w m n o t d e f i n e d " ) u n l e s s d e f i n e d $ p w m ;<br />
37 @ { $ m a t r i x { $ p w m }{ $ 1 }} = @ F ;<br />
38 $ m a t r i x { $ p w m }{ c o u n t } [ $ _ ] += $ F [ $ _ ] f o r e a c h ( 0 . . $#F ) ;<br />
39 }<br />
40<br />
41 # make a lookup t a b l e o f d i s t a n c e i n f o r m a t i o n measure<br />
42 m y %S P A C E R _ L O O K U P ;<br />
43 f o r e a c h m y $ s p a c e r ( k e y s %s e t u p ) {<br />
44 m y $ m i n = $ s e t u p { $ s p a c e r }{ m i n } ;<br />
45 m y $ m a x = $ s e t u p { $ s p a c e r }{ m a x } ;<br />
46 m y $ c e n t e r = $ s e t u p { $ s p a c e r }{ c e n t e r } ;<br />
47 p r i n t f " # p a r s i n g a c c e s s i b i l i t y f o r $ s p a c e r ( m i n = $ m i n , m a x = $ m a x , c e n t e r = $ c e n t e r ) \ n " ;<br />
48 m y $ n = 0 ;<br />
49 $ n += 1 + c o s ( ( ( 2 ∗ $ p i ) / 1 0 . 6 ) ∗ ( $ _ − $ c e n t e r ) ) f o r e a c h ( $ m i n . . $ m a x ) ;<br />
50 f o r e a c h m y $ d ( $ m i n . . $ m a x ) {<br />
51 i f ( $ c e n t e r e q " " ) {<br />
52 $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } = 0 ;<br />
53 } e l s e {<br />
54 $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } = −(−l o g ( ( 1 + c o s ( ( ( 2 ∗ $ p i ) / 1 0 . 6 ) ∗ ( $ d<br />
− $ c e n t e r ) ) ) / $ n ) / l o g ( 2 ) ) ;<br />
55 }<br />
56 p r i n t f " # d = % d , s c o r e = % 0 . 2 f \ n " , $d , $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } ;<br />
57 }<br />
58 }<br />
59<br />
60 # compute matrix based o f f r e q u e n c i e s<br />
61 f o r e a c h m y $ p w m ( k e y s %m a t r i x ) {<br />
62 p r i n t " # p r e p a r i n g m a t r i x ’ $ p w m ’\ n " ; ;<br />
63 f o r e a c h m y $ l e t t e r ( q w / A T G C /) {<br />
64 p r i n t " # [ $ l e t t e r ] " ;<br />
65 f o r e a c h m y $ i ( 0 . . $#{ $ m a t r i x { $ p w m }{ A }} ) {<br />
66 m y $ i 1 = " - " ;<br />
67 m y $ i 2 = s p r i n t f ( ’ % 5 s ’ , ’ - ’ ) ;<br />
68 i f ( $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] > 0 ) {<br />
69 $ i 1 = 2 + l o g ( $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] / $ m a t r i x { $ p w m }{ c o u n t } [ $ i ] ) / l o g ( 2 ) − 0 ;<br />
70 $ i 2 = s p r i n t f ( ’ % 5 s ’ , s p r i n t f ( ’ % 0 . 2 f ’ , $ i 1 ) ) ;<br />
71 }<br />
72 $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] = $ i 1 i f $ i 1 n e " - " ;<br />
73 p r i n t " \ t $ i 2 " ;<br />
74 }<br />
75 p r i n t " \ n " ;<br />
76 }<br />
77 }<br />
78<br />
79 # l o o p o v e r a l l s e q u e n c e s i n i n p u t<br />
80 m y @ i n p = &r e a d _ f a s t a ;<br />
81 f o r e a c h m y $ s ( 0 . . $#i n p ) {<br />
82 m y $ s e q = $ i n p [ $ s ]−>{ s e q } ;<br />
83 p r i n t f " # S E Q U E N C E % s \ n " , $ i n p [ $ s ]−>{ i d } ;<br />
84 p r i n t f " # % d b p \ n " , l e n g t h ( $ s e q ) ;<br />
85 m y %L E N ;<br />
86 m y %B I T ;<br />
87 f o r e a c h m y $ p w m ( @ P W M ) {<br />
88 p r i n t " # g e n e r a t i n g b i t s c o r e s f o r m a t r i x ’ $ p w m ’\ n " ;<br />
89 @ { $ B I T { $ p w m }} = &s c a n ( $ s e q ,%{ $ m a t r i x { $ p w m }}) ;<br />
90 $ L E N { $ p w m } = s c a l a r ( @ { $ m a t r i x { $ p w m }{ A }}) ;<br />
91 p r i n t f " # % d e l e m e n t s i n a r r a y \ n " , s c a l a r ( @ { $ B I T { $ p w m }} ) ;<br />
92 }<br />
93 f o r e a c h m y $ p ( 0 . . ( l e n g t h ( $ s e q ) − $ L E N { $ P W M [ 0 ] } ) ) {<br />
94 p r i n t f " # c o n s i d e r i n g p o s i t i o n % d ( r o o t m o d e l ) \ n " , $ p + 1 ;<br />
95 # f i n d the s c o r e o f the i n i t i a l matrix , f o r t h i s g i v e n p o s i t i o n<br />
96 m y $ w = $ s e t u p { $ P W M [ 0 ] } { w e i g h t } ;<br />
97 m y $ f s i = $ B I T { $ P W M [ 0 ] } [ $ p ] ∗ $ w ;<br />
98 m y $ o f f s e t = $ p ;<br />
99 m y $ s i g n a l = s u b s t r ( $ s e q , $ o f f s e t , $ L E N { $ P W M [ 0 ] } ) ;<br />
100 m y $ s = s p r i n t f " % s \ t % 0 . 2 f " , $ s i g n a l , $ f s i ;<br />
101<br />
102 f o r e a c h m y $ p w m _ i n d e x (1 . . $#P W M ) {<br />
103 m y $ p w m = $ P W M [ $ p w m _ i n d e x ] ;<br />
104 m y $ w = $ s e t u p { $ p w m }{ w e i g h t } ;<br />
105<br />
106 # g e t the s p a c i n g d e t a i l s f o r the upstream s p a c e r<br />
107 m y $ p r e v _ p w m = $ P W M [ $ p w m _ i n d e x − 1 ] ;<br />
108<br />
170
Appendix: Software<br />
109 m y ( $ m i n , $ m a x , $ c e n t e r ) = ( $ s e t u p { $ p r e v _ p w m }{ m i n } ,<br />
110 $ s e t u p { $ p r e v _ p w m }{ m a x } , $ s e t u p { $ p r e v _ p w m }{ c e n t e r }) ;<br />
111<br />
112 m y $ o p t _ s p a c e r ;<br />
113 m y $ o p t _ u n i t _ s c o r e ;<br />
114<br />
115 # c a l c u l a t e u n i t s c o r e s f o r each o f the s p a c i n g c o n f i g u r a t i o n s<br />
116 # A u n i t i s the s p a c e r <strong>and</strong> the f o l l o w i n g matrix . We s e a r c h f o r the<br />
117 # s p a c e r g i v i n g r i s e t o the h i g h e s t u n i t s c o r e<br />
118<br />
119 p r i n t f " # a d j u s t i n g s p a c e r d o w n s t r a m o f ’ $ p w m ’\ n " ;<br />
120<br />
121 f o r e a c h m y $ s p a c e r ( $ m i n . . $ m a x ) {<br />
122 # don ’ t c o n t i n u e , o f the o f f s e t g o e s beyond z e r o . . .<br />
123 l a s t i f $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r < 0 ;<br />
124 n e x t i f $ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] ∗ $ w < $ s e t u p { $ p w m }{ t h r e s h o l d } a n d<br />
d e f i n e d $ s e t u p { $ p w m }{ t h r e s h o l d } ;<br />
125<br />
126 # i f no o p t i m a l s p a c e r i s d e c l a r e d y e t ( e . g . b e c a u s e t h i s i s<br />
127 # the f i r s t round ) then do i t now<br />
128 $ o p t _ s p a c e r = $ s p a c e r u n l e s s d e f i n e d $ o p t _ s p a c e r ;<br />
129 m y $ t e s t _ u n i t _ s c o r e = $ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] ∗ $ w + $ S P A C E R _ L O O K U P {<br />
$ s p a c e r }{ $ m i n }{ $ m a x }{ $ c e n t e r } ;<br />
130 p r i n t f " # s p a c e r : % d , s c o r e : % 0 . 1 f ( % 0 . 1 f + % 0 . 1 f ) \ n " , $ s p a c e r , $ t e s t _ u n i t _ s c o r e ,<br />
$ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] , $ S P A C E R _ L O O K U P { $ s p a c e r }{ $ m i n }{ $ m a x }{<br />
$ c e n t e r } ;<br />
131 $ o p t _ u n i t _ s c o r e = $ t e s t _ u n i t _ s c o r e u n l e s s d e f i n e d $ o p t _ u n i t _ s c o r e ;<br />
132 i f ( $ t e s t _ u n i t _ s c o r e > $ o p t _ u n i t _ s c o r e ) {<br />
133 $ o p t _ s p a c e r = $ s p a c e r ;<br />
134 $ o p t _ u n i t _ s c o r e = $ t e s t _ u n i t _ s c o r e ;<br />
135 }<br />
136 } # f o r e a c h my $ s p a c e r<br />
137<br />
138 # o f f s e t i s where the c u r r e n t pwm s t a r t s<br />
139 $ o f f s e t = $ o f f s e t − $ L E N { $ p w m } − $ o p t _ s p a c e r ;<br />
140<br />
141 p r i n t f " # n e w o f f s e t % d \ n " , $ o f f s e t ;<br />
142<br />
143 i f ( ! d e f i n e d $ o p t _ u n i t _ s c o r e ) {<br />
144 p r i n t f " # u n a b l e t o d e t e r m i n e s p a c e r \ n " ;<br />
145 $ s .= s p r i n t f " \ t - \ t % s \ t - " , ( ’ - ’ x $ L E N { $ p w m }) ;<br />
146 n e x t ;<br />
147 } e l s e {<br />
148 p r i n t f " # s p a c e r $ o p t _ s p a c e r c h o s e n , u n i t ’% s ’ g i v e s s c o r e % 0 . 1 f \ n " , $ p w m ,<br />
$ o p t _ u n i t _ s c o r e ;<br />
149 $ f s i += $ o p t _ u n i t _ s c o r e ;<br />
150 m y $ s i g n a l = s u b s t r ( $ s e q , $ o f f s e t , $ L E N { $ p w m }) ;<br />
151 $ s .= s p r i n t f " \ t % d \ t % s \ t % 0 . 2 f " , $ o p t _ s p a c e r , $ s i g n a l , $ f s i ;<br />
152 }<br />
153 } # f o r e a c h my $pwm <strong>in</strong>dex<br />
154 # p r i n t the f i n a l b i t s c o r e<br />
155 p r i n t f " % d \ t % 0 . 2 f \ t % s \ t \ n " , ( $ p +1) , $ f s i , $ s ;<br />
156 } # my $p = 0<br />
157 } # f o r ( $s = 0 . . . .<br />
158<br />
159<br />
160 #######################################<br />
161 # HELPER FUNCTIONS<br />
162 #######################################<br />
163<br />
164<br />
165 # scan u s i n g a matrix o f i n f o r m a t i o n<br />
166 s u b s c a n {<br />
167 m y @ a ;<br />
168 m y ( $s ,% m ) = @ _ ;<br />
169 m y $ m a = $#{$ m { A } } ;<br />
170 f o r e a c h m y $ p ( 0 . . ( l e n g t h ( $ s )−$#{$ m { A }} −1 ) ) {<br />
171 m y $ R i = 0 ;<br />
172 $ R i += $ m { s u b s t r ( $s , $ p+$_ , 1 ) } [ $ _ ] f o r e a c h ( 0 . . $ m a ) ;<br />
173 p u s h @ a , $ R i ;<br />
174 }<br />
175 # r e t u r n a l i s t hav<strong>in</strong>g n−l +1 e l e m e n t s e h e r e n i s the s e q u e n c e l e n g t h ,<br />
176 # n i s the matrix s i z e ( f o r −10 ( hexamer , n=6)<br />
177 r e t u r n @ a ;<br />
178 }<br />
179<br />
180 ###############################################################<br />
181 # s p a c e r b i t s c o r e c a l c u l a t i o n s c o o r d i n a t e s a r e s h i f t e d 6bp<br />
182 ###############################################################<br />
183<br />
184 s u b r e a d _ m o d {<br />
185 m y @ r e t ;<br />
186 m y $ f n = $ _ [ 0 ] ;<br />
187 m y $ i ;<br />
188 o p e n $ i , $ f n o r e r r ( " u n a b l e t o o p e n f i l e ’ $ f n ’: $ ! \ n " ) ;<br />
189 w h i l e ( r e a d l i n e ( $ i ) ) {<br />
190 c h o m p ;<br />
191 i f (/ˆ#\ s ∗ i n c l u d e \ s ∗ ( . ∗ ) /) {<br />
192 m y @ a = r e a d _ m o d ( $ 1 ) ;<br />
193 p u s h @ r e t , @ a ;<br />
194 } e l s e {<br />
195 n e x t i f / ˆ [ \ s \#] + / ;<br />
196 n e x t u n l e s s /ˆ\ S +/;<br />
197 p u s h @ r e t , $ _ ;<br />
198 }<br />
199 }<br />
200 c l o s e $ i ;<br />
171
quasi mktemp manual<br />
201 r e t u r n @ r e t ;<br />
202 }<br />
203<br />
204 s u b r e a d _ f a s t a {<br />
205 m y @ f a s t a ; # c o n t a i n s a l l<br />
206 m y $ i d = −1;<br />
207 w h i l e ( ) {<br />
208 c h o m p ;<br />
209 i f ( /ˆ >(.∗) / ) {<br />
210 $ i d ++;<br />
211 $ f a s t a [ $ i d ]−>{ i d } = $ 1 ;<br />
212 } e l s i f ( / ˆ ( [ A−Za−z ]+) /) {<br />
213 $ f a s t a [ $ i d ]−>{ s e q } .= $ 1 ;<br />
214 }<br />
215 }<br />
216 r e t u r n @ f a s t a ;<br />
217 }<br />
218<br />
219 s u b e r r {<br />
220 p r i n t $ _ [ 0 ] ;<br />
221 e x i t 1 ;<br />
222 }<br />
223 e x i t 0 ;<br />
224<br />
225 _ _ D A T A _ _<br />
226 [ p w m ]=−10 r e g i o n<br />
227 w e i g h t =1<br />
228 [ A ] 0 63 0 63 63 0<br />
229 [ T ] 63 0 63 0 0 63<br />
230 [ G ] 0 0 0 0 0 0<br />
231 [ C ] 0 0 0 0 0 0<br />
232 [ s p a c e r ]<br />
233 m i n =13<br />
234 c e n t e r =16<br />
235 m a x =19<br />
236 [ p w m ]=−35 r e g i o n<br />
237 w e i g h t =1<br />
238 [ A ] 0 0 0 0 0 36<br />
239 [ T ] 63 63 0 54 0 9<br />
240 [ G ] 0 0 63 0 18 9<br />
241 [ C ] 0 0 0 9 45 9<br />
242 [ s p a c e r ]<br />
243 m i n =0<br />
244 c e n t e r =3<br />
245 m a x =6<br />
246 [ p w m ]= U P<br />
247 w e i g h t =0.5<br />
248 [ A ] 18 0 45 27 45 54 54 54 18 9 45 9 2 9 18 45 54 45 9 2 0 9<br />
249 [ T ] 45 11 0 0 18 0 9 9 36 45 18 54 45 45 27 9 9 18 54 54 63 17<br />
250 [ G ] 0 9 18 36 0 0 0 0 9 9 0 0 0 9 9 0 0 0 0 7 0 0<br />
251 [ C ] 0 43 0 0 0 9 0 0 0 0 0 0 16 0 9 9 0 0 0 0 0 37<br />
252 [ s p a c e r ]<br />
253 m i n=−4<br />
254 c e n t e r =2<br />
255 m a x =4<br />
256 [ p w m ]= F I S<br />
257 w e i g h t =0.5<br />
258 t h r e s h o l d =0<br />
259 [ A ] 26 27 16 0 18 9 0 29 54 54 54 45 42 3 2 36 7 2 18 22 16<br />
260 [ T ] 36 36 45 0 0 38 43 0 0 0 9 0 18 45 0 0 0 0 1 0 45<br />
261 [ G ] 1 0 2 63 18 7 20 34 9 9 0 18 3 13 45 0 54 0 44 41 0<br />
262 [ C ] 0 0 0 0 27 9 0 0 0 0 0 0 0 2 16 27 2 61 0 0 2<br />
D.6 quasi mktemp manual<br />
1 N A M E<br />
2 q u a s i _ m k t e m p − c r e a t e a t e m p l a t e C B S W e b S e r v i c e i m p l e m e n t a t i o n<br />
3<br />
4 S Y N O P S I S<br />
5 p e r l q u a s i _ m k t e m p l [− n S E R V I C E N A M E ] [− v V E R S I O N ] [− w W S N U M B E R ] (−f ) (− r e m o v e ) (−t<br />
T E M P L A T E N A M E )<br />
6<br />
7 D E S C R I P T I O N<br />
8 T h i s s c r i p t c r e a t e s a f u n c t i o n a l t e m p l a t e S O A P W e b S e r v i c e i m p l e m e n t a t i o n u n d e r Q u a s i<br />
i n c l u d i n g<br />
9 a w o r k i n g e x a m p l e . T h e o b j e c t t y p e s t h i s s e r v i c e r e c i e v e s / g e n e r a t e s a r e t h e C B S s t a n d a r d<br />
s e q u e n c e<br />
10 d a t a o b j e c t / a n n o t a t i o n d a t a o b j e c t .<br />
11<br />
12 T h e f o l l o w i n g e l e m e n t s a r e c r e a t e d b y t h e p r o g r a m :<br />
13<br />
14 ∗ W S D L f i l e , w i t h p r o p e r n a m e s p a c e s a n d o p e r a t i o n ( s )<br />
15 ∗ A n X S D i n c l u d e d b y t h e W S D L<br />
16 ∗ A d i r e c t o r y i n / u s r / o p t / w w w / cgi−b i n / C B S / s o a p / w s / q u a s i / c o n t a i n i n g t h e P e r l m o d u l e (<br />
m o d u l e . p m )<br />
17 ∗ A d i r e c t o r y i n / u s r / o p t / w w w / p u b / C B S / w s / c o n t a i n i n g t h e XSD , W S D L a n d e x a m p l e f i l e s .<br />
18 ∗ A n e n t r y i n m y s q l . W e b S e r v i c e s . s e r v i c e s<br />
19 ∗ A n i n d e x . p h p a n d i n c l u d e . h t m l l o c a t e d i n / u s r / o p t / w w w / p u b / C B S / w s / [ S E R V I C E N A M E ]<br />
20<br />
21 To−d o l i s t , o n c e y o u h a v e c r e a t e d t h e t e m p l a t e :<br />
22<br />
23 [ ] A l t e r t h e W S D L s o i t c o n t a i n s t h e o p e r a t i o n s y o u n e e d<br />
24 [ ] A l t e r t h e X S D s o a l l o p e r a t i o n d a t a t y p e s a r e d e f i n e d<br />
172
Appendix: Software<br />
25 [ ] A l t e r t h e f i l e m o d u l e . p m a n d p o s s i b l y w r a p p e r . pl , l o c a t e d i n / u s r / o p t / w w w / cgi−b i n / s o a p<br />
/ w s / q u a s i / [ S E R V I C E ] / [ W S ] /<br />
26 [ ] A l t e r t h e e x a m p l e s o t h a t i t c o n t a i n s a r e l e v a n t e x a m p l e f o r y o u r s e r v i c e .<br />
27 [ ] A l t e r t h e i n c l u d e . h t m l s o t h a t i t d e s c r i b e s t h e u s a g e o f t h e e x a m p l e s c r i p t<br />
28 [ ] O n c e y o u a r e h a p p y w i t h t h e i m p l e m e n t a t i o n , r e m o v e t h e f l a g ” i n t e r n a l _ o n l y ” f r o m m y s q l<br />
. W e b S e r v i c e s . s e r v i c e s<br />
29 a n d c h a n g e t h e d e s i r e d d e s c r i p t i o n f o r y o u r s e r v i c e ( i n f i e l d ’ d e s c r i p t i o n ’ )<br />
30<br />
31 O P T I O N S<br />
32 −n S E R V I C E N A M E<br />
33 C a s e −s e n s i t i v e s e r v i c e n a m e , e . g . S i g n a l P<br />
34<br />
35 −v V E R S I O N<br />
36 T h e v e r s i o n o f t h e s e r v i c e i n t h e f o r m X . Y , e . g . 1 . 2<br />
37<br />
38 −w W S N U M B E R<br />
39 T h i s i s t h e i m p l e m e n t a t i o n n u m b e r f o r t h i s s e r v i c e a n d v e r s i o n . T h e n u m b e r<br />
40 s t a r t s a t z e r o .<br />
41<br />
42 −f<br />
43 F o r c e s o v e r w r i t i n g e x i s t i n g f i l e s<br />
44<br />
45 −r e m o v e<br />
46 R e m o v e s a l l f i l e s p e r t a i n i n g t o t h i s s e r v i c e / v e r s i o n / i m p l e m e n t a i o n − b e c a r e f u l l !<br />
47<br />
48 −t T E M P L A T E<br />
49 N e w t e m p l a t e s c a n b e i n s t a l l e d . U s e o p t i o n −t l i s t t o l i s t a l l t e m p l a t e s<br />
50<br />
51 A U T H O R<br />
52 P e t e r F i s c h e r H a l l i n , p f h @ c b s . d t u . dk , S e p t e m b e r 2008<br />
53<br />
54 S E E A L S O<br />
55 / u s r / o p t / q u a q /<br />
56 / u s r / o p t / w w w / cgi−b i n / C B S / s o a p / w s / q u a s i . c g i<br />
57<br />
58 A U T H O R<br />
59 P e t e r H a l l i n 2008−09−15, p f h @ c b s . d t u . d k<br />
173
BIBLIOGRAPHY<br />
Bibliography<br />
S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, & D. J.<br />
Lipman (1997). ‘Gapped blast <strong>and</strong> psi–blast: a new generation of prote<strong>in</strong> database<br />
searchprograms.’ Nucleic Acids Res 25:3389–402.<br />
B. F. Bauer, E. G. Kar, R. M. Elford, & W. M. Holmes (1988). ‘Sequence determ<strong>in</strong>ants<br />
for promoter strength <strong>in</strong> the leuv operon of Escherichia coli.’ Gene 63:123–34.<br />
J. Besemer, A. Lomsadze, & M. Borodovsky (2001). ‘GeneMarks: a self–tra<strong>in</strong><strong>in</strong>g method<br />
for prediction of gene starts <strong>in</strong> microbial genomes. Implications for f<strong>in</strong>d<strong>in</strong>g sequence<br />
motifs <strong>in</strong> regulatory regions.’ Nucleic Acids Res 29:2607–18.<br />
T. T. B<strong>in</strong>newies, P. F. Hall<strong>in</strong>, H.-H. Staerfeldt, & D. W. Ussery (2005). ‘Genome Update:<br />
proteome comparisons.’ Microbiology 151:1–4.<br />
T. T. B<strong>in</strong>newies, Y. Motro, P. F. Hall<strong>in</strong>, O. Lund, D. Dunn, T. La, D. J. Hampson,<br />
M. Bellgard, T. M. Wassenaar, & D. W. Ussery (2006). ‘Ten years of bacterial genome<br />
sequenc<strong>in</strong>g: comparative–genomics–baseddiscoveries.’ Funct Integr Genomics 6:165–85.<br />
E. Birney, J. A. Stamatoyannopoulos, A. Dutta, R. Guigo, T. R. G<strong>in</strong>geras, E. H. Margulies,<br />
Z. Weng, M. Snyder, E. T. Dermitzakis, R. E. Thurman, M. S. Kuehn, C. M.<br />
Taylor, S. Neph, C. M. Koch, S. Asthana, A. Malhotra, I. Adzhubei, J. A. Greenbaum,<br />
R. M. Andrews, P. Flicek, P. J. Boyle, H. Cao, N. P. Carter, G. K. Clell<strong>and</strong>, S. Davis,<br />
N. Day, P. Dhami, S. C. Dillon, M. O. Dorschner, H. Fiegler, P. G. Giresi, J. Goldy,<br />
M. Hawrylycz, A. Haydock, R. Humbert, K. D. James, B. E. Johnson, E. M. Johnson,<br />
T. T. Frum, E. R. Rosenzweig, N. Karnani, K. Lee, G. C. Lefebvre, P. A. Navas, F. Neri,<br />
S. C. J. Parker, P. J. Sabo, R. S<strong>and</strong>strom, A. Shafer, D. Vetrie, M. Weaver, S. Wilcox,<br />
M. Yu, F. S. Coll<strong>in</strong>s, J. Dekker, J. D. Lieb, T. D. Tullius, G. E. Crawford, S. Sunyaev,<br />
W. S. Noble, I. Dunham, F. Denoeud, A. Reymond, P. Kapranov, J. Rozowsky,<br />
D. Zheng, R. Castelo, A. Frankish, J. Harrow, S. Ghosh, A. S<strong>and</strong>el<strong>in</strong>, I. L. Hofacker,<br />
R. Baertsch, D. Keefe, S. Dike, J. Cheng, H. A. Hirsch, E. A. Sek<strong>in</strong>ger, J. Lagarde,<br />
J. F. Abril, A. Shahab, C. Flamm, C. Fried, J. Hackermuller, J. Hertel, M. L<strong>in</strong>demeyer,<br />
K. Missal, A. Tanzer, S. Washietl, J. Korbel, O. Emanuelsson, J. S. Pedersen, N. Holroyd,<br />
R. Taylor, D. Swarbreck, N. Matthews, M. C. Dickson, D. J. Thomas, M. T.<br />
Weirauch, J. Gilbert, J. Drenkow, I. Bell, X. Zhao, K. G. Sr<strong>in</strong>ivasan, W.-K. Sung, H. S.<br />
Ooi, K. P. Chiu, S. Foissac, T. Alioto, M. Brent, L. Pachter, M. L. Tress, A. Valencia,<br />
S. W. Choo, C. Y. Choo, C. Ucla, C. Manzano, C. Wyss, E. Cheung, T. G. Clark,<br />
J. B. Brown, M. Ganesh, S. Patel, H. Tammana, J. Chrast, C. N. Henrichsen, C. Kai,<br />
J. Kawai, U. Nagalakshmi, J. Wu, Z. Lian, J. Lian, P. Newburger, X. Zhang, P. Bickel,<br />
J. S. Mattick, P. Carn<strong>in</strong>ci, Y. Hayashizaki, S. Weissman, T. Hubbard, R. M. Myers,<br />
174
BIBLIOGRAPHY<br />
J. Rogers, P. F. Stadler, T. M. Lowe, C.-L. Wei, Y. Ruan, K. Struhl, M. Gerste<strong>in</strong>, S. E.<br />
Antonarakis, Y. Fu, E. D. Green, U. Karaoz, A. Siepel, J. Taylor, L. A. Liefer, K. A.<br />
Wetterstr<strong>and</strong>, P. J. Good, E. A. Fe<strong>in</strong>gold, M. S. Guyer, G. M. Cooper, G. Asimenos,<br />
C. N. Dewey, M. Hou, S. Nikolaev, J. I. Montoya-Burgos, A. Loytynoja, S. Whelan,<br />
F. Pardi, T. Mass<strong>in</strong>gham, H. Huang, N. R. Zhang, I. Holmes, J. C. Mullik<strong>in</strong>, A. Ureta-<br />
Vidal, B. Paten, M. Ser<strong>in</strong>ghaus, D. Church, K. Rosenbloom, W. J. Kent, E. A. Stone,<br />
S. Batzoglou, N. Goldman, R. C. Hardison, D. Haussler, W. Miller, A. Sidow, N. D.<br />
Tr<strong>in</strong>kle<strong>in</strong>, Z. D. Zhang, L. Barrera, R. Stuart, D. C. K<strong>in</strong>g, A. Ameur, S. Enroth, M. C.<br />
Bieda, J. Kim, A. A. Bh<strong>in</strong>ge, N. Jiang, J. Liu, F. Yao, V. B. Vega, C. W. H. Lee,<br />
P. Ng, A. Shahab, A. Yang, Z. Moqtaderi, Z. Zhu, X. Xu, S. Squazzo, M. J. Oberley,<br />
D. Inman, M. A. S<strong>in</strong>ger, T. A. Richmond, K. J. Munn, A. Rada-Iglesias, O. Wallerman,<br />
J. Komorowski, J. C. Fowler, P. Couttet, A. W. Bruce, O. M. Dovey, P. D. Ellis, C. F.<br />
Langford, D. A. Nix, G. Euskirchen, S. Hartman, A. E. Urban, P. Kraus, S. Van Calcar,<br />
N. He<strong>in</strong>tzman, T. H. Kim, K. Wang, C. Qu, G. Hon, R. Luna, C. K. Glass, M. G. Rosenfeld,<br />
S. F. Aldred, S. J. Cooper, A. Halees, J. M. L<strong>in</strong>, H. P. Shulha, X. Zhang, M. Xu,<br />
J. N. S. Haidar, Y. Yu, Y. Ruan, V. R. Iyer, R. D. Green, C. Wadelius, P. J. Farnham,<br />
B. Ren, R. A. Harte, A. S. H<strong>in</strong>richs, H. Trumbower, H. Clawson, J. Hillman-Jackson,<br />
A. S. Zweig, K. Smith, A. Thakkapallayil, G. Barber, R. M. Kuhn, D. Karolchik, L. Armengol,<br />
C. P. Bird, P. I. W. de Bakker, A. D. Kern, N. Lopez-Bigas, J. D. Mart<strong>in</strong>, B. E.<br />
Stranger, A. Woodroffe, E. Davydov, A. Dimas, E. Eyras, I. B. Hallgrimsdottir, J. Huppert,<br />
M. C. Zody, G. R. Abecasis, X. Estivill, G. G. Bouffard, X. Guan, N. F. Hansen,<br />
J. R. Idol, V. V. B. Maduro, B. Maskeri, J. C. McDowell, M. Park, P. J. Thomas, A. C.<br />
Young, R. W. Blakesley, D. M. Muzny, E. Sodergren, D. A. Wheeler, K. C. Worley,<br />
H. Jiang, G. M. We<strong>in</strong>stock, R. A. Gibbs, T. Graves, R. Fulton, E. R. Mardis, R. K.<br />
Wilson, M. Clamp, J. Cuff, S. Gnerre, D. B. Jaffe, J. L. Chang, K. L<strong>in</strong>dblad-Toh, E. S.<br />
L<strong>and</strong>er, M. Koriab<strong>in</strong>e, M. Nefedov, K. Osoegawa, Y. Yosh<strong>in</strong>aga, B. Zhu, & P. J. de Jong<br />
(2007). ‘Identification <strong>and</strong> analysis of functional elements <strong>in</strong> 1of the human genome by<br />
the encode pilot project.’ Nature 447:799–816.<br />
F. R. Blattner, G. r. Plunkett, C. A. Bloch, N. T. Perna, V. Burl<strong>and</strong>, M. Riley, J. Collado-<br />
Vides, J. D. Glasner, C. K. Rode, G. F. Mayhew, J. Gregor, N. W. Davis, H. A.<br />
Kirkpatrick, M. A. Goeden, D. J. Rose, B. Mau, & Y. Shao (1997). ‘The complete<br />
genome sequence of Escherichia coli k–12.’ Science 277:1453–62.<br />
A. J. t. Bokal, W. Ross, & R. L. Gourse (1995). ‘The transcriptional activator prote<strong>in</strong> fis:<br />
Dna <strong>in</strong>teractions <strong>and</strong>cooperative <strong>in</strong>teractions with rna polymerase at the Escherichia<br />
coli rrnbp1 promoter.’ J Mol Biol 245:197–207.<br />
A. Bolshoy, P. McNamara, R. E. Harr<strong>in</strong>gton, & E. N. Trifonov (1991). ‘Curved dna<br />
without a–a: experimental estimation of all 16 dna wedgeangles.’ Proc Natl Acad Sci U<br />
S A 88:2312–6.<br />
P. J. Brett, D. DeShazer, & D. E. Woods (1998). ‘Burkholderia thail<strong>and</strong>ensis sp. nov., a<br />
Burkholderia pseudomallei–likespecies.’ Int J Syst Bacteriol 48:317–20.<br />
E. Brzuszkiewicz, H. Bruggemann, H. Liesegang, M. Emmerth, T. Olschlager, G. Nagy,<br />
K. Albermann, C. Wagner, C. Buchrieser, L. Emody, G. Gottschalk, J. Hacker, & U. Dobr<strong>in</strong>dt<br />
(2006). ‘How to become a uropathogen: comparative genomic analysis ofextra<strong>in</strong>test<strong>in</strong>al<br />
pathogenic Escherichia coli stra<strong>in</strong>s.’ Proc Natl Acad Sci U S A 103:12879–84.<br />
S. L. Chen, C.-S. Hung, J. Xu, C. S. Reigstad, V. Magr<strong>in</strong>i, A. Sabo, D. Blasiar, T. Bieri,<br />
R. R. Meyer, P. Ozersky, J. R. Armstrong, R. S. Fulton, J. P. Latreille, J. Spieth, T. M.<br />
175
BIBLIOGRAPHY<br />
Hooton, E. R. Mardis, S. J. Hultgren, & J. I. Gordon (2006). ‘Identification of genes<br />
subject to positive selection <strong>in</strong> uropathogenicstra<strong>in</strong>s of Escherichia coli: a comparative<br />
genomics approach.’ Proc Natl Acad Sci U S A 103:5977–82.<br />
A. L. Delcher, D. Harmon, S. Kasif, O. White, & S. L. Salzberg (1999). ‘Improved microbial<br />
gene identification with glimmer.’ Nucleic Acids Res 27:4636–41.<br />
J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan,<br />
B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark,<br />
R. Dalal, A. Dew<strong>in</strong>ter, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. He<strong>in</strong>er,<br />
K. Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. L<strong>in</strong>, P. Lundquist,<br />
C. Ma, P. Marks, M. Maxham, D. Murphy, I. Park, T. Pham, M. Phillips, J. Roy,<br />
R. Sebra, G. Shen, J. Sorenson, A. Tomaney, K. Travers, M. Trulson, J. Vieceli, J. Wegener,<br />
D. Wu, A. Yang, D. Zaccar<strong>in</strong>, P. Zhao, F. Zhong, J. Korlach, & S. Turner (2009).<br />
‘Real–time dna sequenc<strong>in</strong>g from s<strong>in</strong>gle polymerase molecules.’ Science 323:133–8.<br />
M. Ender, B. Berger-Bachi, & N. McCallum (2009). ‘A novel dna–b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> modulat<strong>in</strong>g<br />
methicill<strong>in</strong> resistance <strong>in</strong> Staphylococcus aureus.’ BMC Microbiol 9:15.<br />
S. T. Estrem, T. Gaal, W. Ross, & R. L. Gourse (1998). ‘Identification of an up element<br />
consensus sequence for bacterialpromoters.’ Proc Natl Acad Sci U S A 95:9761–6.<br />
P. F. Hall<strong>in</strong> & D. W. Ussery (2004). ‘Cbs Genome Atlas Database: a dynamic storage for<br />
bio<strong>in</strong>formatic results <strong>and</strong> sequence data.’ Bio<strong>in</strong>formatics 20:3682–6.<br />
K. Hayashi, N. Morooka, Y. Yamamoto, K. Fujita, K. Isono, S. Choi, E. Ohtsubo, T. Baba,<br />
B. L. Wanner, H. Mori, & T. Horiuchi (2006). ‘Highly accurate genome sequences of<br />
Escherichia coli k–12 stra<strong>in</strong>s mg1655<strong>and</strong> w3110.’ Mol Syst Biol 2:2006.0007.<br />
T. Hayashi, K. Mak<strong>in</strong>o, M. Ohnishi, K. Kurokawa, K. Ishii, K. Yokoyama, C. G. Han,<br />
E. Ohtsubo, K. Nakayama, T. Murata, M. Tanaka, T. Tobe, T. Iida, H. Takami,<br />
T. Honda, C. Sasakawa, N. Ogasawara, T. Yasunaga, S. Kuhara, T. Shiba, M. Hattori,<br />
& H. Sh<strong>in</strong>agawa (2001). ‘Complete genome sequence of enterohemorrhagic Escherichia<br />
coli o157:h7 <strong>and</strong>genomic comparison with a laboratory stra<strong>in</strong> k–12.’ DNA Res 8:11–22.<br />
P. N. Hengen, S. L. Bartram, L. E. Stewart, & T. D. Schneider (1997). ‘Information<br />
analysis of Fis b<strong>in</strong>d<strong>in</strong>g sites.’ Nucleic Acids Res 25:4994–5002.<br />
C. A. Hirvonen, W. Ross, C. E. Wozniak, E. Marasco, J. R. Anthony, S. E. Aiyar, V. H.<br />
Newburn, & R. L. Gourse (2001). ‘Contributions of up elements <strong>and</strong> the transcription<br />
factor fis toexpression from the seven rrn p1 promoters <strong>in</strong> Escherichia coli.’ J Bacteriol<br />
183:6305–14.<br />
A. M. Huerta & J. Collado-Vides (2003). ‘Sigma70 promoters <strong>in</strong> Escherichia coli: specific<br />
transcription <strong>in</strong> denseregions of overlapp<strong>in</strong>g promoter–like signals.’ J Mol Biol 333:261–<br />
78.<br />
L. J. Jensen, C. Friis, & D. W. Ussery (1999). ‘Three views of microbial genomes.’ Res<br />
Microbiol 150:773–7.<br />
L. J. Jensen, M. Skovgaard, T. Sicheritz-Ponten, N. T. Hansen, H. Johansson, M. K.<br />
Joergensen, K. Kiil, P. F. Hall<strong>in</strong>, & D. Ussery (2005). THE PSEUDOMONADS VOL<br />
I. GENOMICS, LIFE STYLE AND MOLECULAR ARCHITECTURE, vol. 1, chap.<br />
Chapter 5: <strong>Comparative</strong> genomics of four Pseudomonas species, pp. 139–164. Kluwer<br />
Academic / Plenum Publishers, New York.<br />
176
BIBLIOGRAPHY<br />
Q. J<strong>in</strong>, Z. Yuan, J. Xu, Y. Wang, Y. Shen, W. Lu, J. Wang, H. Liu, J. Yang, F. Yang,<br />
X. Zhang, J. Zhang, G. Yang, H. Wu, D. Qu, J. Dong, L. Sun, Y. Xue, A. Zhao, Y. Gao,<br />
J. Zhu, B. Kan, K. D<strong>in</strong>g, S. Chen, H. Cheng, Z. Yao, B. He, R. Chen, D. Ma, B. Qiang,<br />
Y. Wen, Y. Hou, & J. Yu (2002). ‘Genome sequence of Shigella flexneri 2a: <strong>in</strong>sights<br />
<strong>in</strong>to pathogenicitythrough comparison with genomes of Escherichia coli k12 <strong>and</strong> o157.’<br />
Nucleic Acids Res 30:4432–41.<br />
T. J. Johnson, S. Kariyawasam, Y. Wannemuehler, P. Mangiamele, S. J. Johnson,<br />
C. Doetkott, J. A. Skyberg, A. M. Lynne, J. R. Johnson, & L. K. Nolan (2007). ‘The<br />
genome sequence of avian pathogenic Escherichia coli stra<strong>in</strong> o1:k1:h7shares strong similarities<br />
with human extra<strong>in</strong>test<strong>in</strong>al pathogenic e. coligenomes.’ J Bacteriol 189:3228–36.<br />
J. Kyte & R. F. Doolittle (1982). ‘A simple method for display<strong>in</strong>g the hydropathic character<br />
of a prote<strong>in</strong>.’ J Mol Biol 157:105–32.<br />
K. Lagesen, P. Hall<strong>in</strong>, E. A. Rodl<strong>and</strong>, H.-H. Staerfeldt, T. Rognes, & D. W. Ussery (2007).<br />
‘RNAmmer: consistent <strong>and</strong> rapid annotation of ribosomal rna genes.’ Nucleic Acids Res<br />
35:3100–8.<br />
T. S. Larsen & A. Krogh (2003). ‘EasyGene–a prokaryotic gene f<strong>in</strong>der that ranks ORFs<br />
by statistical significance.’ BMC Bio<strong>in</strong>formatics 4:21.<br />
T. Lefebure & M. J. Stanhope (2007). ‘Evolution of the core <strong>and</strong> pan–genome of Streptococcus:<br />
positive selection, recomb<strong>in</strong>ation, <strong>and</strong> genome composition.’ Genome Biol<br />
8:R71.<br />
X. Liao, T. Y<strong>in</strong>g, H. Wang, J. Wang, Z. Shi, E. Feng, K. Wei, Y. Wang, X. Zhang,<br />
L. Huang, G. Su, & P. Huang (2003). ‘A two–dimensional proteome map of Shigella<br />
flexneri.’ Electrophoresis 24:2864–82.<br />
B. Liebig & R. Wagner (1995). ‘Effects of different growth conditions on the <strong>in</strong> vivo<br />
activity of thet<strong>and</strong>em Escherichia coli ribosomal rna promoters p1 <strong>and</strong> p2.’ Mol Gen<br />
Genet 249:328–35.<br />
D. Lim & N. C. J. Strynadka (2002). ‘Structural basis for the beta lactam resistance of<br />
pbp2a from methicill<strong>in</strong>–resistant Staphylococcus aureus.’ Nat Struct Biol 9:870–6.<br />
T. M. Lowe & S. R. Eddy (1997). ‘tRNAscan–se: a program for improved detection of<br />
transfer rna genes <strong>in</strong>genomic sequence.’ Nucleic Acids Res 25:955–64.<br />
J. P. McCutcheon, B. R. McDonald, & N. A. Moran (2009). ‘Orig<strong>in</strong> of an alternative<br />
genetic code <strong>in</strong> the extremely small <strong>and</strong> gc–rich genome of a bacterial symbiont.’ PLoS<br />
Genet 5:e1000565.<br />
C. E. McEwan, D. Gatherer, & N. R. McEwan (1998). ‘Nitrogen–fix<strong>in</strong>g aerobic bacteria<br />
have higher genomic gc content than non–fix<strong>in</strong>g species with<strong>in</strong> the same genus.’<br />
Hereditas 128:173–8.<br />
W. G. Miller, C. T. Parker, M. Rubenfield, G. L. Mendz, M. M. S. M. Wosten, D. W.<br />
Ussery, J. F. Stolz, T. T. B<strong>in</strong>newies, P. F. Hall<strong>in</strong>, G. Wang, J. A. Malek, A. Rogos<strong>in</strong>,<br />
L. H. Stanker, & R. E. M<strong>and</strong>rell (2007). ‘The complete genome sequence <strong>and</strong> analysis<br />
of the epsilonproteobacteriumArcobacter butzleri.’ PLoS One 2:e1358.<br />
H. D. Murray & R. L. Gourse (2004). ‘Unique roles of the rrn p2 rrna promoters <strong>in</strong><br />
Escherichia coli.’ Mol Microbiol 52:1375–87.<br />
177
BIBLIOGRAPHY<br />
A. Nakabachi, A. Yamashita, H. Toh, H. Ishikawa, H. E. Dunbar, N. A. Moran, & M. Hattori<br />
(2006). ‘The 160–kilobase genome of the bacterial endosymbiont Carsonella.’ Science<br />
314:267.<br />
C. Ong, C. H. Ooi, D. Wang, H. Chong, K. C. Ng, F. Rodrigues, M. A. Lee, & P. Tan<br />
(2004). ‘Patterns of large–scale genomic variation <strong>in</strong> virulent <strong>and</strong> avirulentBurkholderia<br />
species.’ Genome Res 14:2295–307.<br />
J. Parkhill, B. W. Wren, K. Mungall, J. M. Ketley, C. Churcher, D. Basham, T. Chill<strong>in</strong>gworth,<br />
R. M. Davies, T. Feltwell, S. Holroyd, K. Jagels, A. V. Karlyshev, S. Moule,<br />
M. J. Pallen, C. W. Penn, M. A. Quail, M. A. Raj<strong>and</strong>ream, K. M. Rutherford, A. H. van<br />
Vliet, S. Whitehead, & B. G. Barrell (2000). ‘The genome sequence of the food–borne<br />
pathogen Campylobacter jejunireveals hypervariable sequences.’ Nature 403:665–8.<br />
A. G. Pedersen, L. J. Jensen, S. Brunak, H. H. Staerfeldt, & D. W. Ussery (2000). ‘A dna<br />
structural atlas for Escherichia coli.’ J Mol Biol 299:907–30.<br />
V. Perez-Brocal, R. Gil, S. Ramos, A. Lamelas, M. Postigo, J. M. Michelena, F. J. Silva,<br />
A. Moya, & A. Latorre (2006). ‘A small microbial genome: the end of a long symbiotic<br />
relationship?’ Science 314:312–3.<br />
N. T. Perna, G. r. Plunkett, V. Burl<strong>and</strong>, B. Mau, J. D. Glasner, D. J. Rose, G. F. Mayhew,<br />
P. S. Evans, J. Gregor, H. A. Kirkpatrick, G. Posfai, J. Hackett, S. Kl<strong>in</strong>k, A. Bout<strong>in</strong>,<br />
Y. Shao, L. Miller, E. J. Grotbeck, N. W. Davis, A. Lim, E. T. Dimalanta, K. D.<br />
Potamousis, J. Apodaca, T. S. Anantharaman, J. L<strong>in</strong>, G. Yen, D. C. Schwartz, R. A.<br />
Welch, & F. R. Blattner (2001). ‘Genome sequence of enterohaemorrhagic Escherichia<br />
coli o157:h7.’ Nature 409:529–33.<br />
O. N. Reva, P. F. Hall<strong>in</strong>, H. Willenbrock, T. Sicheritz-Ponten, B. Tummler, & D. W.<br />
Ussery (2008). ‘Global features of the Alcanivorax borkumensis sk2 genome.’ Environ<br />
Microbiol 10:614–25.<br />
E. P. C. Rocha (2004). ‘Codon usage bias from trna‘s po<strong>in</strong>t of view: redundancy, specialization,<br />
<strong>and</strong> efficient decod<strong>in</strong>g for translation optimization.’ Genome Res 14:2279–86.<br />
W. Ross, J. Salomon, W. M. Holmes, & R. L. Gourse (1999). ‘Activation of Escherichia<br />
coli leuv transcription by fis.’ J Bacteriol 181:3864–8.<br />
K. Rutherford, J. Parkhill, J. Crook, T. Horsnell, P. Rice, M. A. Raj<strong>and</strong>ream, & B. Barrell<br />
(2000). ‘Artemis: sequence visualization <strong>and</strong> annotation.’ Bio<strong>in</strong>formatics 16:944–5.<br />
R. A. Sanford, J. R. Cole, & J. M. Tiedje (2002). ‘Characterization <strong>and</strong> description of<br />
Anaeromyxobacter dehalogenans gen. nov., sp. nov., an aryl–halorespir<strong>in</strong>g facultative<br />
anaerobic myxobacterium.’ Appl Environ Microbiol 68:893–900.<br />
S. C. Satchwell, H. R. Drew, & A. A. Travers (1986). ‘Sequence periodicities <strong>in</strong> chicken<br />
nucleosome core dna.’ J Mol Biol 191:659–75.<br />
S. Schneiker, O. Perlova, O. Kaiser, K. Gerth, A. Alici, M. O. Altmeyer, D. Bartels,<br />
T. Bekel, S. Beyer, E. Bode, H. B. Bode, C. J. Bolten, J. V. Choudhuri, S. Doss,<br />
Y. A. Elnakady, B. Frank, L. Gaigalat, A. Goesmann, C. Groeger, F. Gross, L. Jelsbak,<br />
L. Jelsbak, J. Kal<strong>in</strong>owski, C. Kegler, T. Knauber, S. Konietzny, M. Kopp, L. Krause,<br />
D. Krug, B. L<strong>in</strong>ke, T. Mahmud, R. Mart<strong>in</strong>ez-Arias, A. C. McHardy, M. Merai, F. Meyer,<br />
S. Mormann, J. Munoz-Dorado, J. Perez, S. Pradella, S. Rachid, G. Raddatz, F. Rosenau,<br />
C. Ruckert, F. Sasse, M. Scharfe, S. C. Schuster, G. Suen, A. Treuner-Lange, G. J.<br />
178
BIBLIOGRAPHY<br />
Velicer, F.-J. Vorholter, K. J. Weissman, R. D. Welch, S. C. Wenzel, D. E. Whitworth,<br />
S. Wilhelm, C. Wittmann, H. Blocker, A. Puhler, & R. Muller (2007). ‘Complete genome<br />
sequence of the myxobacterium Sorangium cellulosum.’ Nat Biotechnol 25:1281–9.<br />
R. K. Shultzaberger, Z. Chen, K. A. Lewis, & T. D. Schneider (2007). ‘Anatomy of<br />
Escherichia coli sigma70 promoters.’ Nucleic Acids Res 35:771–88.<br />
M. D. Smith, B. J. Angus, V. Wuthiekanun, & N. J. White (1997). ‘Arab<strong>in</strong>ose assimilation<br />
def<strong>in</strong>es a nonvirulent biotype of Burkholderiapseudomallei.’ Infect Immun 65:4319–21.<br />
H. Tettel<strong>in</strong>, V. Masignani, M. J. Cieslewicz, C. Donati, D. Med<strong>in</strong>i, N. L. Ward, S. V.<br />
Angiuoli, J. Crabtree, A. L. Jones, A. S. Durk<strong>in</strong>, R. T. Deboy, T. M. Davidsen, M. Mora,<br />
M. Scarselli, I. Margarit y Ros, J. D. Peterson, C. R. Hauser, J. P. Sundaram, W. C.<br />
Nelson, R. Madupu, L. M. Br<strong>in</strong>kac, R. J. Dodson, M. J. Rosovitz, S. A. Sullivan,<br />
S. C. Daugherty, D. H. Haft, J. Selengut, M. L. Gw<strong>in</strong>n, L. Zhou, N. Zafar, H. Khouri,<br />
D. Radune, G. Dimitrov, K. Watk<strong>in</strong>s, K. J. B. O’Connor, S. Smith, T. R. Utterback,<br />
O. White, C. E. Rubens, G. Gr<strong>and</strong>i, L. C. Madoff, D. L. Kasper, J. L. Telford, M. R.<br />
Wessels, R. Rappuoli, & C. M. Fraser (2005). ‘Genome analysis of multiple pathogenic<br />
isolates of Streptococcus agalactiae: implications for the microbial “pan–genome“.’ Proc<br />
Natl Acad Sci U S A 102:13950–5.<br />
J. D. Thompson, D. G. Higg<strong>in</strong>s, & T. J. Gibson (1994). ‘Clustal w: improv<strong>in</strong>g the sensitivity<br />
of progressive multiple sequencealignment through sequence weight<strong>in</strong>g, position–<br />
specific gap penalties <strong>and</strong>weight matrix choice.’ Nucleic Acids Res 22:4673–80.<br />
H. Toh, B. L. Weiss, S. A. H. Perk<strong>in</strong>, A. Yamashita, K. Oshima, M. Hattori, & S. Aksoy<br />
(2006). ‘Massive genome erosion <strong>and</strong> functional adaptations provide <strong>in</strong>sights <strong>in</strong>to the<br />
symbiotic lifestyle of Sodalis gloss<strong>in</strong>idius <strong>in</strong> the tsetse host.’ Genome Res 16:149–56.<br />
M. L. Tress, P. L. Martelli, A. Frankish, G. A. Reeves, J. J. Wessel<strong>in</strong>k, C. Yeats, P. I. Olason,<br />
M. Albrecht, H. Hegyi, A. Giorgetti, D. Raimondo, J. Lagarde, R. A. Laskowski,<br />
G. Lopez, M. I. Sadowski, J. D. Watson, P. Fariselli, I. Rossi, A. Nagy, W. Kai, Z. Storl<strong>in</strong>g,<br />
M. Ors<strong>in</strong>i, Y. Assenov, H. Blankenburg, C. Huthmacher, F. Ramirez, A. Schlicker,<br />
F. Denoeud, P. Jones, S. Kerrien, S. Orchard, S. E. Antonarakis, A. Reymond, E. Birney,<br />
S. Brunak, R. Casadio, R. Guigo, J. Harrow, H. Hermjakob, D. T. Jones, T. Lengauer,<br />
C. A. Orengo, L. Patthy, J. M. Thornton, A. Tramontano, & A. Valencia (2007). ‘The<br />
implications of alternative splic<strong>in</strong>g <strong>in</strong> the encode prote<strong>in</strong> complement.’ Proc Natl Acad<br />
Sci U S A 104:5495–500.<br />
J. W. Tukey (1977). Exploratory Data Analysis. Addison-Wesley.<br />
D. W. Ussery, P. F. Hall<strong>in</strong>, K. Lagesen, & T. M. Wassenaar (2004). ‘Genome update:<br />
tRNAs <strong>in</strong> sequenced microbial genomes.’ Microbiology 150:1603–6.<br />
T. Visnes, B. Doseth, H. S. Pettersen, L. Hagen, M. M. L. Sousa, M. Akbari, M. Otterlei,<br />
B. Kavli, G. Slupphaug, & H. E. Krokan (2009). ‘Uracil <strong>in</strong> dna <strong>and</strong> its process<strong>in</strong>g by<br />
different dna glycosylases.’ Philos Trans R Soc Lond B Biol Sci 364:563–8.<br />
H. Wang & C. J. Benham (2008). ‘Superhelical destabilization <strong>in</strong> regulatory regions of<br />
stress responsegenes.’ PLoS Comput Biol 4:e17.<br />
H. Wang, M. Noordewier, & C. J. Benham (2004). ‘Stress–<strong>in</strong>duced dna duplex destabilization<br />
(sidd) <strong>in</strong> the e. coli genome:sidd sites are closely associated with promoters.’<br />
Genome Res 14:1575–84.<br />
179
BIBLIOGRAPHY<br />
R. A. Welch, V. Burl<strong>and</strong>, G. r. Plunkett, P. Redford, P. Roesch, D. Rasko, E. L. Buckles,<br />
S.-R. Liou, A. Bout<strong>in</strong>, J. Hackett, D. Stroud, G. F. Mayhew, D. J. Rose, S. Zhou, D. C.<br />
Schwartz, N. T. Perna, H. L. T. Mobley, M. S. Donnenberg, & F. R. Blattner (2002).<br />
‘Extensive mosaic structure revealed by the complete genome sequence ofuropathogenic<br />
Escherichia coli.’ Proc Natl Acad Sci U S A 99:17020–4.<br />
H. Willenbrock, C. Friis, A. S. Juncker, & D. W. Ussery (2006). ‘An environmental<br />
signature for 323 microbial genomes based on codon adaptation <strong>in</strong>dices.’ Genome Biol<br />
7:R114.<br />
K.-M. Wu, L.-H. Li, J.-J. Yan, N. Tsao, T.-L. Liao, H.-C. Tsai, C.-P. Fung, H.-J. Chen,<br />
Y.-M. Liu, J.-T. Wang, C.-T. Fang, S.-C. Chang, H.-Y. Shu, T.-T. Liu, Y.-T. Chen, Y.-<br />
R. Shiau, T.-L. Lauderdale, I.-J. Su, R. Kirby, & S.-F. Tsai (2009). ‘Genome sequenc<strong>in</strong>g<br />
<strong>and</strong> comparative analysis of Klebsiella pneumoniae ntuh–k2044, a stra<strong>in</strong> caus<strong>in</strong>g liver<br />
abscess <strong>and</strong> men<strong>in</strong>gitis.’ J Bacteriol 191:4492–501.<br />
F. Yang, J. Yang, X. Zhang, L. Chen, Y. Jiang, Y. Yan, X. Tang, J. Wang, Z. Xiong,<br />
J. Dong, Y. Xue, Y. Zhu, X. Xu, L. Sun, S. Chen, H. Nie, J. Peng, J. Xu, Y. Wang,<br />
Z. Yuan, Y. Wen, Z. Yao, Y. Shen, B. Qiang, Y. Hou, J. Yu, & Q. J<strong>in</strong> (2005). ‘Genome<br />
dynamics <strong>and</strong> diversity of Shigella species, the etiologic agents ofbacillary dysentery.’<br />
Nucleic Acids Res 33:6445–58.<br />
180