29.07.2013 Views

Computational tools and Interoperability in Comparative ... - CBS

Computational tools and Interoperability in Comparative ... - CBS

Computational tools and Interoperability in Comparative ... - CBS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Peter Fischer Hall<strong>in</strong> | 2009 Peter Fischer Hall<strong>in</strong><br />

<strong>Computational</strong> <strong>tools</strong> <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> <strong>Comparative</strong> Genomics<br />

2.5<br />

<strong>Computational</strong> <strong>tools</strong> <strong>and</strong><br />

<strong>Interoperability</strong> <strong>in</strong><br />

<strong>Comparative</strong> Genomics<br />

lari<br />

jejuni<br />

concisus<br />

curvus<br />

fetus<br />

hom<strong>in</strong>is<br />

2.3 %<br />

34 / 1,494<br />

57.2 %<br />

1,123 / 1,965<br />

56.7 %<br />

1,123 / 1,979<br />

1.7 %<br />

27 / 1,581<br />

55.2 %<br />

1,145 / 2,073<br />

84.7 %<br />

1,448 / 1,709<br />

49.4 %<br />

1,062 / 2,150<br />

83.5 %<br />

1,481 / 1,773<br />

1.5 %<br />

24 / 1,585<br />

Campylobacter concisus<br />

13826<br />

2,080 prote<strong>in</strong>s, 1,972 families<br />

Campylobacter curvus<br />

525.92<br />

1,931 prote<strong>in</strong>s, 1,885 families<br />

Campylobacter fetus<br />

subsp. fetus 82-40<br />

1,719 prote<strong>in</strong>s, 1,665 families<br />

Campylobacter hom<strong>in</strong>is<br />

ATCC BAA-381<br />

1,687 prote<strong>in</strong>s, 1,623 families<br />

Campylobacter jejuni<br />

RM1221<br />

1,838 prote<strong>in</strong>s, 1,780 families<br />

Campylobacter jejuni<br />

subsp. doylei 269.97<br />

1,731 prote<strong>in</strong>s, 1,650 families<br />

Campylobacter jejuni<br />

subsp. jejuni 81-176<br />

1,758 prote<strong>in</strong>s, 1,702 families<br />

Campylobacter jejuni<br />

subsp. jejuni 81116<br />

1,626 prote<strong>in</strong>s, 1,585 families<br />

Campylobacter jejuni<br />

subsp. jejuni NCTC 11168<br />

1,624 prote<strong>in</strong>s, 1,581 families<br />

Campylobacter lari<br />

RM2100<br />

1,546 prote<strong>in</strong>s, 1,494 families<br />

53.0 %<br />

1,143 / 2,158<br />

67.3 %<br />

1,316 / 1,955<br />

82.9 %<br />

1,474 / 1,778<br />

22.8 %<br />

596 / 2,619<br />

76.9 %<br />

1,466 / 1,906<br />

64.4 %<br />

1,289 / 2,003<br />

2.3 %<br />

39 / 1,702<br />

30.0 %<br />

742 / 2,476<br />

22.9 %<br />

614 / 2,676<br />

74.6 %<br />

1,441 / 1,931<br />

62.2 %<br />

1,304 / 2,096<br />

24.7 %<br />

682 / 2,756<br />

30.6 %<br />

774 / 2,526<br />

23.1 %<br />

617 / 2,675<br />

71.4 %<br />

1,451 / 2,032<br />

4.0 %<br />

66 / 1,650<br />

24.5 %<br />

704 / 2,875<br />

24.8 %<br />

698 / 2,820<br />

30.3 %<br />

770 / 2,538<br />

22.5 %<br />

628 / 2,795<br />

63.5 %<br />

1,345 / 2,118<br />

24.4 %<br />

718 / 2,948<br />

25.1 %<br />

706 / 2,816<br />

28.7 %<br />

767 / 2,669<br />

21.2 %<br />

595 / 2,802<br />

2.3 %<br />

41 / 1,780<br />

jejuni<br />

hom<strong>in</strong>is<br />

fetus<br />

curvus<br />

concisus<br />

PhD thesis | Peter Fischer Hall<strong>in</strong> | 2009<br />

Center for Biological Sequence Analysis<br />

Department of Systems Biology<br />

Technical University of Denmark<br />

Campylobacter lari<br />

RM2100<br />

1,546 prote<strong>in</strong>s, 1,494 families<br />

Campylobacter jejuni<br />

subsp. jejuni NCTC 11168<br />

24.3 %<br />

717 / 2,950<br />

23.7 %<br />

699 / 2,950<br />

27.5 %<br />

736 / 2,676<br />

21.4 %<br />

618 / 2,886<br />

1,624 prote<strong>in</strong>s, 1,581 families<br />

Campylobacter jejuni<br />

subsp. jejuni 81116<br />

23.6 %<br />

723 / 3,070<br />

22.5 %<br />

668 / 2,964<br />

27.9 %<br />

767 / 2,750<br />

2.0 %<br />

33 / 1,623<br />

1,626 prote<strong>in</strong>s, 1,585 families<br />

22.7 %<br />

698 / 3,076<br />

23.0 %<br />

698 / 3,036<br />

30.4 %<br />

782 / 2,576<br />

22.5 %<br />

713 / 3,175<br />

26.1 %<br />

741 / 2,838<br />

1.5 %<br />

25 / 1,665<br />

lari<br />

Campylobacter jejuni<br />

subsp. jejuni 81-176<br />

1,758 prote<strong>in</strong>s, 1,702 families<br />

Campylobacter jejuni<br />

subsp. doylei 269.97<br />

1,731 prote<strong>in</strong>s, 1,650 families<br />

Campylobacter jejuni<br />

RM1221<br />

25.8 %<br />

765 / 2,961<br />

34.7 %<br />

929 / 2,678<br />

1,838 prote<strong>in</strong>s, 1,780 families<br />

Campylobacter hom<strong>in</strong>is<br />

ATCC BAA-381<br />

32.4 %<br />

916 / 2,828<br />

1.8 %<br />

34 / 1,885<br />

21.2 %<br />

1,687 prote<strong>in</strong>s, 1,623 families<br />

Campylobacter fetus<br />

subsp. fetus 82-40<br />

50.3 %<br />

1,317 / 2,616<br />

1,719 prote<strong>in</strong>s, 1,665 families<br />

Campylobacter curvus<br />

525.92<br />

3.5 %<br />

69 / 1,972<br />

1.5 %<br />

Homology between proteomes<br />

1,931 prote<strong>in</strong>s, 1,885 families<br />

Campylobacter concisus<br />

13826<br />

2,080 prote<strong>in</strong>s, 1,972 families<br />

Homology with<strong>in</strong> proteomes<br />

84.7 %<br />

4.0 %


To my family. Thank you Susanne for your endless support <strong>and</strong> for giv<strong>in</strong>g us two<br />

wonderful boys, Oliver <strong>and</strong> Victor.


Preface<br />

This Ph.D. thesis is written for The Department for Systems Biology, Technical University<br />

of Denmark, as part of the Life Science programme as a requirement for obta<strong>in</strong><strong>in</strong>g the<br />

Ph.D. degree.<br />

The work was supported through the EMBRACE project which is funded by the European<br />

Commission with<strong>in</strong> the Sixth Framework Programme, under the area of “Life sciences,<br />

genomics <strong>and</strong> biotechnology for health”, contract number LSGH-CT-2004-512092.<br />

Parts of the work was supported through a grant from the Danish Natural Science Research<br />

Council, contract number 26-06-0349 entitled “<strong>Comparative</strong> Genomics of Campylobacter<br />

jejuni”.<br />

The work was carried out at the Center for Biological Sequence Analysis (<strong>CBS</strong>), Department<br />

of Systems Biology, under supervision by Associate Professor David W. Ussery.<br />

The work on bacterial promotors was carried out dur<strong>in</strong>g an external stay at University<br />

of California, Davis (UC Davis Genome Center), under supervision by Professor Craig J.<br />

Benham <strong>and</strong> supported through an NSF Research Grant, contract number DBI-0416764.<br />

Lyngby, 28 September, 2009<br />

Peter Fischer Hall<strong>in</strong><br />

Cover illustration<br />

The background of the cover shows a “BLAST atlas” of Burkholderia pseudomallei, stra<strong>in</strong><br />

1710b compared with 22 other Burkholderia genomes. The top panel, under the title,<br />

shows the P1/P2 rrnB promotor region of E. coli, mapped to different DNA properties.<br />

The panel below is a “BLAST matrix” of 10 different Campylobacter stra<strong>in</strong>s, show<strong>in</strong>g the<br />

overall proteome similarity.<br />

i


Abstract<br />

The scientific community is witness<strong>in</strong>g an explosion <strong>in</strong> both the number <strong>and</strong> the complexity<br />

of DNA sequenc<strong>in</strong>g projects. As sequenc<strong>in</strong>g equipment becomes more reliable,<br />

faster <strong>and</strong> less expensive, new possibilities of apply<strong>in</strong>g the technology are open<strong>in</strong>g up.<br />

The early genome sequenc<strong>in</strong>g projects, dat<strong>in</strong>g back almost 15 years, presented only <strong>in</strong>dividual<br />

microbial stra<strong>in</strong>s <strong>and</strong> the large efforts <strong>and</strong> scientific achievements at this time<br />

qualified publication <strong>in</strong> high rank<strong>in</strong>g journals. Today however, projects like the Human<br />

Microbiome Project (HMP), Human Gut Microbiome Initiative (HGMI) <strong>and</strong> the Genomic<br />

Encyclopedia of Bacteria <strong>and</strong> Archaea (GEBA) takes sequenc<strong>in</strong>g <strong>in</strong>to a new era, to study<br />

the genomes <strong>and</strong> ecological niches of entire populations consist<strong>in</strong>g of thous<strong>and</strong>s of microorganisms.<br />

These <strong>in</strong>itiatives put a dem<strong>and</strong> for new analysis <strong>tools</strong> to process <strong>and</strong> derive<br />

knowledge from the wealth of genomic <strong>in</strong>formation.<br />

This thesis describes development of new <strong>tools</strong> <strong>and</strong> methods to study these types<br />

of data. When the genome of characterized stra<strong>in</strong>s <strong>and</strong> environmental samples are sequenced,<br />

the ribosomal RNA genes are commonly chosen as a start<strong>in</strong>g po<strong>in</strong>t to describe<br />

the phylogeny <strong>and</strong> diversity. The rRNA genes are often <strong>in</strong>terpreted as an ‘evolutionary<br />

chronometer’ <strong>and</strong> the RNAmmer software was developed as a tool to quickly <strong>and</strong><br />

consistently identify the rRNA genes allow<strong>in</strong>g for large-scale analysis of phylogeny of complex<br />

data sets. RNAmmer solved previous issues of the gene boundary accuracy, that<br />

is observed when us<strong>in</strong>g BLAST approaches to mapp<strong>in</strong>g rRNA genes. The possibility to<br />

accurately map the start of rRNA transcripts has allowed the <strong>in</strong>vestigation of promotor<br />

structures of these highly expressed operons <strong>and</strong> a promotor analysis <strong>in</strong> E. coli K12 is<br />

demonstrated by apply<strong>in</strong>g a mathematical model of the energetics <strong>in</strong>volved <strong>in</strong> DNA helix<br />

open<strong>in</strong>g.<br />

But a s<strong>in</strong>gle gene, such as the 16S rRNA, can <strong>in</strong> nature not describe the phenotype<br />

nor the full cod<strong>in</strong>g potential of an organism. This thesis describes the development of<br />

the BLASTatlas tool, which is a visualization tool to overview similarity <strong>and</strong> differences<br />

between any number of genomes, metagenomic samples or sequence databases from the<br />

viewpo<strong>in</strong>t of a reference genome. This software has proved to be a powerful tool to study<br />

the localization <strong>and</strong> ga<strong>in</strong>/loss of gene clusters, such as pathogenicity isl<strong>and</strong>s <strong>in</strong> virulent<br />

organisms. The tool has been used <strong>in</strong> several research projects <strong>and</strong> collaborations <strong>and</strong><br />

was described as a cover article <strong>in</strong> Molecular BioSystems <strong>in</strong> 2008, <strong>and</strong> highlighted <strong>in</strong> the<br />

journal Chemical Biology. Despite the usefulness of this tool, it became obvious that a web<br />

based version, more “biologist friendly” with zoom<strong>in</strong>g capability, was needed. This lead<br />

to the GeneWiz browser, which was developed <strong>in</strong> a jo<strong>in</strong>t effort with the IT staff at <strong>CBS</strong>.<br />

The tool enables the user to <strong>in</strong>teractively zoom from a global chromosomal scale down<br />

the nucletide, while ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the overview of all data be<strong>in</strong>g presented <strong>in</strong> the plot. It<br />

features disproportional zoom<strong>in</strong>g as known from google maps. At the time of writ<strong>in</strong>g this<br />

iii


thesis, the work is just be<strong>in</strong>g published <strong>in</strong> the second issue of the SIGS journal (St<strong>and</strong>ards<br />

In Genomic Sciences).<br />

S<strong>in</strong>ce start<strong>in</strong>g my Ph.D. project, a total of 630 prokaryotic genomes has been sequenced<br />

<strong>and</strong> published. This represents on average about four genomes per week! As we<br />

ga<strong>in</strong> knowledge from this vast amount of data, new prediction methods become available<br />

allow<strong>in</strong>g for the generation of even more data; examples <strong>in</strong>clude predict<strong>in</strong>g sigma factor<br />

genes, chromosomal replication starts, <strong>and</strong> secretion systems. This comb<strong>in</strong>ation of new<br />

sequence data as well as new predicitons squares the problem: How do we deal with the<br />

challenge that more <strong>and</strong> more genomic material shall be processed through more <strong>and</strong> more<br />

bio<strong>in</strong>formatic <strong>tools</strong>? And how is this flow of <strong>in</strong>formation formalized <strong>and</strong> automated allow<strong>in</strong>g<br />

bio<strong>in</strong>formaticians to programmatically submit comparisons of any genome to any<br />

prediction method anywhere <strong>in</strong> the world? The need for <strong>in</strong>teroperable <strong>and</strong> programmable<br />

<strong>in</strong>terfaces for these resources is now widely recognized, <strong>and</strong> mach<strong>in</strong>e-to-mach<strong>in</strong>e communication<br />

through Web Services has ga<strong>in</strong>ed acceptance. But ahead lies challenges dur<strong>in</strong>g the<br />

transition from a web-browser-centric th<strong>in</strong>k<strong>in</strong>g towards <strong>in</strong>teroperability <strong>and</strong> service orietated<br />

architecture, SOA. Dur<strong>in</strong>g my Ph.D. work a number of significant contributions to<br />

both implementations <strong>and</strong> server <strong>in</strong>frastructure has provided remote users access to <strong>CBS</strong><br />

prediction servers <strong>and</strong> databases. This work has been presented both dur<strong>in</strong>g the general<br />

meet<strong>in</strong>gs of the EU project (EMBRACE) <strong>in</strong>itiat<strong>in</strong>g these efforts <strong>and</strong> dur<strong>in</strong>g various<br />

workshops teach<strong>in</strong>g the usage of Web Services <strong>and</strong> <strong>Comparative</strong> Genomics.<br />

iv


Resumé<br />

Det videnskabelige samfund er vidne til en eksplosion i b˚ade antallet og kompleksiteten<br />

af genomsekventer<strong>in</strong>ger. I takt med, at sekventer<strong>in</strong>gsudstyret bliver hurtigere, mere<br />

p˚alideligt, og tilmed billigere, ˚abner der sig nye muligheder for anvendelse af teknologien.<br />

De første genomprojekter, der g˚ar næsten 15 ˚ar tilbage, præsenterede kun enkelte<br />

bakteriestammer og den store <strong>in</strong>dsats sammen de videnskabelige resultater har bidraget<br />

med publikationer i højt rangerende tidsskrifter. I dag har projekter som Human Microbiome<br />

Project (HMP), Human Gut Microbiome Initiative (HGMI) og Genome Encyclopedia<br />

of Bacteria <strong>and</strong> Archaea (GEBA) bragt genomsekventer<strong>in</strong>g <strong>in</strong>d i en ny æra ved at<br />

karakterisere tus<strong>in</strong>der af referencegenomer og hele økosystemer best˚aende at tus<strong>in</strong>der af<br />

specier. Disse <strong>in</strong>itiativer vil efterspørge nye analyseværktøjer til at beh<strong>and</strong>le og omdanne<br />

denne flod af <strong>in</strong>formation til viden.<br />

Denne afh<strong>and</strong>l<strong>in</strong>g beskriver metoder og værktøjer til at studere disse typer af data.<br />

N˚ar karakteriserede stammer og prøver bliver sekventeret, er det ribosomale RNA ofte<br />

valgt som udgangspunkt til at beskrive fylogeni og diversitet. Ribosomalt RNA er ofte<br />

benyttet som et ’evolutionært kronometer’ og programmet RNAmmer blev udviklet som<br />

et værktøj til hurtigt og konsistent at identificere rRNA gener, hvilket giver mulighed<br />

for mere omfattende fylogenetiske analyser af komplekse datasæt. RNAmmer har løst<br />

tidligere problemer med at fastsl˚a genernes nøjagtige annoter<strong>in</strong>g, hvilket har været tilfældet<br />

med BLAST baserede metoder. Muligheden for nøjagtigt at kunne kortlægge rRNA<br />

gener, har tilladt undersøgelse af promotor strukturer for disse stærkt udtrykte operoner.<br />

Efterfølgende er en eksisterende matematisk energimodel for DNAets ˚abn<strong>in</strong>g anvendt, til<br />

at lave en promotor analyse af P1/P2 systemet i E. coli K12.<br />

Men et enkelt gen, som for eksempel 16S rRNA, er i sagens natur ude af st<strong>and</strong> til at<br />

beskrive en hel organismes fænotype eller dens fulde kodende potentiale. Denne afh<strong>and</strong>l<strong>in</strong>g<br />

beskriver BLASTatlas metoden, som er et visualiser<strong>in</strong>gsværktøj til at give et overblik<br />

over similaritet mellem et vilk˚arligt antal genomer, metagenomiske prøver eller sekvensdatabaser<br />

med udgangspunkt i et referencegenom. Denne software har vist sig at være et<br />

effektivt redskab til at studere enkelte gener eller grupper af gener, der er konserveret eller<br />

g˚aet tabt i eksempelvis sygdomsfremkaldende mikroorganismer. Værktøjet er blev brugt<br />

i forb<strong>in</strong>delse med flere forskn<strong>in</strong>gsprojekter og samarbejder og metoden blev offentliggjort<br />

som forsideartikel i maj 2008 udgaven af Environmental Microbiology. Det blev imidlertid<br />

klart, at manglen p˚a et <strong>in</strong>teraktivt aspekt, gjorde værktøjet vanskeligt at anvende for biologer.<br />

Dette førte til udvikl<strong>in</strong>gen af programmet GeneWiz Browser, som blev udviklet i<br />

samarbejde med IT-personale p˚a <strong>CBS</strong>. Værktøjet gør det muligt for brugeren <strong>in</strong>teraktivt<br />

at zoome ud fra det globale genom og ned til det enkelte nukleotid, og samtidig bevare<br />

overblikket over alle data, der præsenteres i diagrammet. Programmet anvender disproportional<br />

skaler<strong>in</strong>g som det kendes fra for eksempel Google Maps. Arbejdet er i øjeblikket<br />

v


ved at blive publiceret i St<strong>and</strong>ards In Genomic Sciences.<br />

Siden starten p˚a mit tre ˚arige Ph.D. projekt er ialt 630 prokaryote organismer blev fuld<br />

sekventeret og offentliggjort. Dette svarer i gennemsnit til tre genomer om ugen! I takt<br />

med vi f˚ar ny viden udfra disse store data mængder, bliver der publiceret nye forudsigelsesmetoder<br />

til for eksempel sigma faktorer, kromosomal replikation, og sekretionssystemer.<br />

Denne dobbelthed understreger problemet: Hvordan reagerer vi p˚a den udfordr<strong>in</strong>g, at<br />

mere og mere genomisk materiale skal processeres ved hjælp af flere og flere bio<strong>in</strong>formatiske<br />

værktøjer? Og hvordan kan denne strøm af <strong>in</strong>formation formaliseres og automatiseres<br />

p˚a en s˚adan m˚ade, at bio<strong>in</strong>formatikere og biologer p˚a en programmrbar m˚ade kan<br />

køre sammenlign<strong>in</strong>ger af enhvert genom p˚a enhver forudsigelsesmetode overalt i verden?<br />

Behovet for <strong>in</strong>teroperable og programmerbare grænseflader til disse ressourcer er nu alm<strong>in</strong>deligt<br />

anerkendt, og computer-til-computer kommunikation gennem Web Services har<br />

vundet <strong>in</strong>dpas. Men forude ligger udfordr<strong>in</strong>ger i overgangen fra en webbrowser-fokuseret<br />

tankegang i retn<strong>in</strong>g af <strong>in</strong>teroperabilitet og Service Orientated Architecture, kaldet SOA. I<br />

mit Ph.D. arbejde har er en række betydelige bidrag i form a implementer<strong>in</strong>ger og <strong>in</strong>frastruktur<br />

givet eksterne brugere af forskellige <strong>CBS</strong> værktøjer og databaser en programmerbar<br />

adgang via Web Services. Disse bidrag er blevet præsenteret b˚ade under generalmøder i<br />

EMBRACE EU-projektet og forskellige workshops omh<strong>and</strong>lende brugen af Web Services.<br />

vi


Acknowledgments<br />

I would like to express a deep gratitude to my supervisor Prof. David Ussery for his support<br />

dur<strong>in</strong>g my Ph.D. project. It has been a great pleasure to work with him dur<strong>in</strong>g my time<br />

at <strong>CBS</strong> <strong>and</strong> I will miss the time of organiz<strong>in</strong>g workshops <strong>and</strong> prepar<strong>in</strong>g for conferences.<br />

A thanks to Prof. <strong>and</strong> center director Søren Brunak for creat<strong>in</strong>g a unique <strong>and</strong> <strong>in</strong>spir<strong>in</strong>g<br />

environment at <strong>CBS</strong> which enabled this project.<br />

I would like to extend my heartfelt gratitude to Craig <strong>and</strong> Marcia Benham for the<br />

<strong>in</strong>cribile hospitality <strong>and</strong> openness towards our family dur<strong>in</strong>g my research visit at University<br />

of California, Davis <strong>in</strong> 2007.<br />

I would like to thank a great collegue <strong>and</strong> friend of m<strong>in</strong>e, Tim T. B<strong>in</strong>newies, for support<br />

dur<strong>in</strong>g conferences, manuscript preperations <strong>and</strong> our daily colaborations - it has been a<br />

pleasure to work with Tim. A thanks to Kar<strong>in</strong> Lagesen for great research collaboration<br />

dur<strong>in</strong>g the development of RNAmmer <strong>and</strong> Hanni Willenbrock for great collaboration <strong>and</strong><br />

for driv<strong>in</strong>g numerous publications. I would also like thank all the people I worked with<br />

dur<strong>in</strong>g the development of the ENCODE pipel<strong>in</strong>e, Ramneek Gupta, Thomas Blicher,<br />

Haakan Svensson, Henrik Nielsen, Rasmus Wernersson, Morten Bo Johansen <strong>and</strong> Eleonora<br />

Kulberkyte.<br />

A special thanks to Hans-Henrik Stærfeldt for valuable feedback <strong>and</strong> all the <strong>in</strong>spir<strong>in</strong>g<br />

<strong>and</strong> productive sessions of f<strong>in</strong>aliz<strong>in</strong>g GeneWiz Browser <strong>and</strong> compos<strong>in</strong>g web services software.<br />

A special thanks to Kristoffer Rapacki for be<strong>in</strong>g a great travel companion, for always<br />

f<strong>in</strong>d<strong>in</strong>g solutions, <strong>and</strong> for the many fruitfull discussions we have had - I hope there will be<br />

more. I would like to thank the numerous people with whom I have had the pleasure of<br />

work<strong>in</strong>g with, dur<strong>in</strong>g research projects <strong>and</strong> courses.<br />

Former center adm<strong>in</strong>istrators Johanne Keid<strong>in</strong>g <strong>and</strong> Anne Christensen, current center<br />

adm<strong>in</strong>istrator Dorthe Kjærsgaard, Lone Boesen <strong>and</strong> Malene Beck for your extrod<strong>in</strong>ary<br />

efforts of mak<strong>in</strong>g the <strong>CBS</strong> eng<strong>in</strong>e runn<strong>in</strong>g efficient. Lone Boesen deserves special praise<br />

for smoothly arrang<strong>in</strong>g <strong>and</strong> h<strong>and</strong>l<strong>in</strong>g travel details for my many trips abroad, <strong>in</strong>clud<strong>in</strong>g<br />

five cont<strong>in</strong>ents.<br />

vii


viii


Publications <strong>and</strong> manuscripts<br />

Publications <strong>in</strong>cluded <strong>in</strong> this thesis are listed <strong>in</strong> the order they appear. All other articles<br />

are sorted by publication date, descend<strong>in</strong>g. For papers with five <strong>and</strong> more citations this<br />

number is <strong>in</strong>dicated.<br />

Paper I<br />

Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW. The genome BLASTatlas - a GeneWiz extension<br />

for visualization of whole-genome homology. Mol Biosyst 4:363-71 (2008).<br />

Paper II<br />

B<strong>in</strong>newies TT, Motro Y, Hall<strong>in</strong> PF, Lund O, Dunn D. La T, Hampson DJ, Bellgard M,<br />

Wassenaar TM, Ussery DW. Ten years of bacterial genome sequenc<strong>in</strong>g: comparative–<br />

genomics–based discoveries. Funct Integr Genomics 6:165-85 (2006) - 56 citations.<br />

Paper III<br />

Reva ON, Hall<strong>in</strong> PF, Willenbrock H, Sicheritz-Ponten T, Tummler B, Ussery DW Global<br />

features of the Alcanivorax borkumensis SK2 genome. Environ Microbiol 10:614-<br />

25 (2008).<br />

Paper IV<br />

Vesth T, Hall<strong>in</strong> PF, Snipen L, Lagesen K, Wassenaar TM, Ussery DW. The orig<strong>in</strong>s of<br />

Vibrio species. Microbial Ecology (2009) doi:10.1007/s00248-009-9596-7<br />

Paper V<br />

Wassenaar TM, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, <strong>and</strong> Ussery DW Tools for comparison of<br />

bacterial genomes. Book chapter, Microbiology of Hydrocarbons, Oils, Lipids, <strong>and</strong><br />

Derived Compounds, Spr<strong>in</strong>ger-Verlag, Heidelberg, Germany, 2009.<br />

ix


Paper VI<br />

[Lagesen K, Hall<strong>in</strong> P] 1 , Rodl<strong>and</strong> EA, Stærfeldt HH, Rognes T, Ussery DW. RNAmmer:<br />

consistent <strong>and</strong> rapid annotation of ribosomal RNA genes. Nucleic Acids Res<br />

35:3100-8 (2007) - 8 citations 2<br />

Paper VII<br />

Hall<strong>in</strong> PF, Stærfeldt H, Rotenberg E, B<strong>in</strong>newies TT, Benham CJ, <strong>and</strong> Ussery DW. GeneWiz<br />

browser: An Interactive Tool for Visualiz<strong>in</strong>g Sequenced Chromosomes.<br />

St<strong>and</strong>ards <strong>in</strong> Genomic Sciences 1:204-215 (2009) doi:10.4056/sigs.28177.<br />

Papers not <strong>in</strong>cluded<br />

Contributions have been made to the follow<strong>in</strong>g papers dur<strong>in</strong>g my PhD project.<br />

• Miller WG, Parker CT, Rubenfield M, Mendz GL, Wosten MM, Ussery DW,<br />

Stolz JF, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Wang G, Malek JA, Rogos<strong>in</strong> A, Stanker<br />

LH, M<strong>and</strong>rell RE. The complete genome sequence <strong>and</strong> analysis of the<br />

human pathogen Arcobacter butzleri. PLoS ONE 2:e1358 (2007)<br />

• Willenbrock H, Hall<strong>in</strong> PF, Wassenaar TM, Ussery DW Characterization of<br />

probiotic Escherichia coli isolates with a novel pan-genome microarray.<br />

Genome Biol 8:R267 (2007)<br />

Earlier papers, 2004–2006<br />

• Worn<strong>in</strong>g P, Jensen LJ, Hall<strong>in</strong> PF, Stærfeldt HH, Ussery DW Orig<strong>in</strong> of replication<br />

<strong>in</strong> circular prokaryotic chromosomes. Environ Microbiol 8:353-61<br />

(2006) - 28 citations<br />

• Kill K, B<strong>in</strong>newies TT, Sicheritz-Ponten T, Willenbrock H, Hall<strong>in</strong> PF, Wassenaar<br />

TM, Ussery DW Genome update: sigma factors <strong>in</strong> 240 bacterial<br />

genomes. Microbiology 151:3147-50 (2005)<br />

• Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW Genome update: prediction<br />

of membrane prote<strong>in</strong>s <strong>in</strong> prokaryotic genomes. Microbiology<br />

151:2119-21 (2005)<br />

• Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Sicheritz-Ponten T, Ussery DW Genome<br />

update: prediction of secreted prote<strong>in</strong>s <strong>in</strong> 225 bacterial proteomes.<br />

Microbiology 151:1725-7 (2005)<br />

• B<strong>in</strong>newies TT, Bendtsen JD, Hall<strong>in</strong> PF, Nielsen N, Wassenaar TM, Pedersen<br />

MB, Klemm P, Ussery DW Genome Update: Prote<strong>in</strong> secretion systems<br />

<strong>in</strong> 225 bacterial genomes. Microbiology 151:1013-6 (2005)<br />

• Hall<strong>in</strong> PF, Nielsen N, Dev<strong>in</strong>e KM, B<strong>in</strong>newies TT, Willenbrock H, Ussery DW<br />

Genome update: base skews <strong>in</strong> 200+ bacterial chromosomes. Microbiology<br />

151:633-7 (2005)<br />

1 Both authors contributed equally<br />

2 Additionally 8 citations for the first 8 GEBA genomes published <strong>in</strong> SIGS journal; be<strong>in</strong>g part of a<br />

st<strong>and</strong>ard pipel<strong>in</strong>e, RNAmmer will be cited for future GEBA articles.<br />

x


• Willenbrock H, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW Genome update: 2D<br />

cluster<strong>in</strong>g of bacterial genomes. Microbiology 151:333-6 (2005)<br />

• B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Stærfeldt HH, Ussery DW Genome Update: proteome<br />

comparisons. Microbiology 151:1-4 (2005)<br />

• Hall<strong>in</strong> PF, Ussery DW <strong>CBS</strong> Genome Atlas Database: a dynamic storage<br />

for bio<strong>in</strong>formatic results <strong>and</strong> sequence data. Bio<strong>in</strong>formatics 20:3682-<br />

6 (2004) - 37 citations<br />

• Hall<strong>in</strong> PF, Coenye T, B<strong>in</strong>newies TT, Jarmer H, Stærfeldt HH, Ussery DW<br />

Genome update: correlation of bacterial genomic properties. Microbiology<br />

150:3899-903 (2004)<br />

• Ussery DW, B<strong>in</strong>newies TT, Gouveia-Oliveira R, Jarmer H, Hall<strong>in</strong> PF Genome<br />

update: DNA repeats <strong>in</strong> bacterial genomes. Microbiology 150:3519-21<br />

(2004) - 11 citations<br />

• Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW Genome update: chromosome atlases.<br />

Microbiology 150:3091-3 (2004)<br />

• Ussery DW, T<strong>in</strong>dbaek N, Hall<strong>in</strong> PF Genome update: promoter profiles.<br />

Microbiology 150:2791-3 (2004)<br />

• Ussery DW, Jensen MS, Poulsen TR, Hall<strong>in</strong> PF Genome update: alignment<br />

of bacterial chromosomes. Microbiology 150:2491-3 (2004)<br />

• Ussery DW, Hall<strong>in</strong> PF Genome Update: annotation quality <strong>in</strong> sequenced<br />

microbial genomes. Microbiology 150:2015-7 (2004) - 8 citations<br />

• Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Wassenaar TM Genome update: tR-<br />

NAs <strong>in</strong> sequenced microbial genomes. Microbiology 150:1603-6 (2004)<br />

• Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Coenye T Genome update: rRNAs <strong>in</strong><br />

sequenced microbial genomes. Microbiology 150:1113-5 (2004)<br />

• Ussery DW, Hall<strong>in</strong> PF Genome Update: AT content <strong>in</strong> sequenced prokaryotic<br />

genomes. Microbiology 150:749-52 (2004) - 8 citations<br />

• Ussery DW, Hall<strong>in</strong> PF Genome update: Length distributions of sequenced<br />

prokaryotic genomes. Microbiology 150:513-6 (2004)<br />

xi


xii


Contents<br />

List of Figures xvii<br />

1 Introduction 1<br />

2 <strong>Comparative</strong> Genomics 3<br />

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

2.2 The genome annotation pipel<strong>in</strong>e . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

2.2.1 fetchgbk: Obta<strong>in</strong><strong>in</strong>g exist<strong>in</strong>g public genomes from GenBank . . . . 4<br />

2.2.2 Other ways to acquire genome <strong>in</strong>formation . . . . . . . . . . . . . . 4<br />

2.2.3 Tools contigsort <strong>and</strong> contigmap . . . . . . . . . . . . . . . . . . . 5<br />

2.2.4 F<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> encod<strong>in</strong>g genes <strong>in</strong> prokaryotes . . . . . . . . . . . . 6<br />

2.2.5 F<strong>in</strong>d<strong>in</strong>g tRNA <strong>and</strong> rRNA genes . . . . . . . . . . . . . . . . . . . . . 7<br />

2.3 Genome Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

2.3.1 Box-<strong>and</strong>-wiskers plot . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2.3.2 heatmap - 2D cluster<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.3.3 Codon usage <strong>and</strong> chromosomal base composition . . . . . . . . . . . 11<br />

2.3.4 CodonPlot: visualiz<strong>in</strong>g codon usage . . . . . . . . . . . . . . . . . . 13<br />

2.3.5 Base composition <strong>and</strong> DNA repair . . . . . . . . . . . . . . . . . . . 16<br />

2.3.6 BLASTmatrix - proteome comparison . . . . . . . . . . . . . . . . . . 16<br />

2.3.7 BLASTatlas - visualiz<strong>in</strong>g while-genome homology . . . . . . . . . . . 18<br />

2.3.8 CorePlot - plott<strong>in</strong>g the core- <strong>and</strong> pan-genomes of species . . . . . . 23<br />

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.5 Instant <strong>in</strong>sight: Read<strong>in</strong>g the genetic atlas . . . . . . . . . . . . . . . . . . 27<br />

2.6 Paper I: The genome BLASTatlas - a GeneWiz extension for visualization<br />

of whole-genome homology . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

2.7 Paper II: Ten years of bacterial genome sequenc<strong>in</strong>g: comparative–genomics–<br />

based discoveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

2.8 Paper III: Global features of the Alcanivorax borkumensis SK2 genome . . 61<br />

2.9 Paper IV: The orig<strong>in</strong>s of Vibrio species . . . . . . . . . . . . . . . . . . . . 75<br />

2.10 Paper V: Tools for comparison of bacterial genomes . . . . . . . . . . . . . 89<br />

3 rRNA operons <strong>and</strong> promoter analysis 105<br />

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

3.2 P1 <strong>and</strong> P2 promoters <strong>in</strong> E. coli . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

3.3 Conservation of regulatory elements . . . . . . . . . . . . . . . . . . . . . . 106<br />

3.3.1 Model<strong>in</strong>g the P1 <strong>and</strong> P2 <strong>in</strong> selected enterics . . . . . . . . . . . . . . 108<br />

3.3.2 Iterat<strong>in</strong>g weight matrix frequencies . . . . . . . . . . . . . . . . . . . 112<br />

xiii


3.3.3 Ref<strong>in</strong><strong>in</strong>g E. coli <strong>and</strong> Shigella models . . . . . . . . . . . . . . . . . . 112<br />

3.4 DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

3.4.1 codesearch: Mapp<strong>in</strong>g nummerical data to genome annotations . . . 114<br />

3.5 The genomic context: visualiz<strong>in</strong>g operons <strong>and</strong> DNA properties . . . . . . . 117<br />

3.6 Visualiz<strong>in</strong>g sequenc<strong>in</strong>g quality us<strong>in</strong>g gwBrowser . . . . . . . . . . . . . . . . 117<br />

3.6.1 Visualiz<strong>in</strong>g the P1 <strong>and</strong> P2 structure us<strong>in</strong>g gwBrowser . . . . . . . . 119<br />

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

3.8 Paper VI: RNAmmer: Fast two-level HMM prediction of rRNA <strong>in</strong> prokaryotic<br />

genome sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

3.9 Paper VII: GeneWiz browser: An Interactive Tool for Visualiz<strong>in</strong>g Sequenced<br />

Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

4 Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics 145<br />

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />

4.2 <strong>Interoperability</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />

4.2.1 SOAP based Web Services . . . . . . . . . . . . . . . . . . . . . . . . 147<br />

4.3 EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability . . . . . . . . . . 147<br />

4.3.1 Quasi - a light-weight SOAP server . . . . . . . . . . . . . . . . . . 150<br />

4.3.2 quasi mktemp - From template to Web Service . . . . . . . . . . . . 150<br />

4.4 ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services . . . . . . . . . . . . . . . . . . . 151<br />

4.4.1 Collect<strong>in</strong>g Web Services clients <strong>in</strong> EPipe . . . . . . . . . . . . . . . . 151<br />

4.4.2 Mapp<strong>in</strong>g Pfam annotations to prote<strong>in</strong> structure: mecA . . . . . . . . 151<br />

5 Conclusion <strong>and</strong> perspectives 155<br />

A Appendix: Workshops, teach<strong>in</strong>g, <strong>and</strong> conferences 157<br />

A.1 Lectures <strong>and</strong> Presentations . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />

A.1.1 DTU Course 27101: Framework Course <strong>in</strong> Biotechnology <strong>and</strong> Food<br />

Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />

A.1.2 <strong>Comparative</strong> Microbial Genomics Workshop . . . . . . . . . . . . . . 157<br />

A.1.3 <strong>Comparative</strong> Microbial Genomics <strong>and</strong> Taxonomy . . . . . . . . . . . 157<br />

A.1.4 EMBRACE Workshop on Client Side Script<strong>in</strong>g for Web Services . . 157<br />

A.1.5 EMBRACE Workshop on Bio<strong>in</strong>formatics of Immunology . . . . . . . 157<br />

A.1.6 EMBRACE 3 rd AGM: Implementation of web services . . . . . . . . 157<br />

A.1.7 EMBRACE Workshop on Perl, SQL <strong>and</strong> Web Services . . . . . . . . 158<br />

A.2 Workshops <strong>and</strong> meet<strong>in</strong>gs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />

A.2.1 EMBRACE Workshop: SOAP web services . . . . . . . . . . . . . . 158<br />

A.2.2 EUCOMM Bio<strong>in</strong>formatics Tra<strong>in</strong><strong>in</strong>g Course . . . . . . . . . . . . . . 158<br />

A.2.3 EMBRACE Workshop: Modern computer <strong>tools</strong> for the biosciences . 158<br />

A.2.4 EMBRACE 3rd Annual General Meet<strong>in</strong>g . . . . . . . . . . . . . . . 158<br />

A.2.5 EMBRACE Workshop: Deploy<strong>in</strong>g Web Services for Biological Sequence<br />

Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />

A.2.6 EMBRACE 4th Annual General Meet<strong>in</strong>g . . . . . . . . . . . . . . . 158<br />

A.2.7 Technical discussion of EMBRACE registry . . . . . . . . . . . . . . 158<br />

A.2.8 EMBRACE meet<strong>in</strong>g: Discussion of st<strong>and</strong>ard data types . . . . . . . 158<br />

A.3 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />

A.3.1 Conference: Metagenomics, July 2007, San Diego U.S.A. . . . . . . . 158<br />

A.3.2 Conference: ASM Biodefense 2007, February 2007, Wash<strong>in</strong>gton U.S.A.158<br />

B Appendix: Ph.D. study plan 159<br />

xiv


C Appendix: Courses 165<br />

C.1 Global regulatory networks <strong>in</strong> microorganisms . . . . . . . . . . . . . . . . . 165<br />

C.2 Prote<strong>in</strong> Structure <strong>and</strong> <strong>Computational</strong> Biology . . . . . . . . . . . . . . . . . 165<br />

C.3 Biological Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />

C.4 <strong>Comparative</strong> Genome Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />

C.5 Doctorial sem<strong>in</strong>ar on bus<strong>in</strong>ess economics for academic entrepreneurs . . . . 165<br />

C.6 ECTS summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />

D Appendix: Software 166<br />

D.1 fetchgbk manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166<br />

D.2 Sample output from queryGenomes . . . . . . . . . . . . . . . . . . . . . . . 167<br />

D.3 BLASTatlas configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />

D.3.1 file blast.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />

D.3.2 file custom.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />

D.4 BLASTmatrix example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />

D.5 iscan source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />

D.6 quasi mktemp manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172<br />

Bibliography 174<br />

xv


xvi


List of Figures<br />

2.1 Mapp<strong>in</strong>g of multiple contigs to a backbone genome. C. jejuni str. NCTC<br />

11168 is used as backbone for mapp<strong>in</strong>g contigs C. jejuni str. 260.94. Blue<br />

<strong>and</strong> red blocks represent direct <strong>and</strong> reverse hits, respectively. Panel (a)<br />

shows un-mapped whereas panel (b) shows mapped contigs. . . . . . . . . 6<br />

2.2 Construction of a box-<strong>and</strong>-whiskers plot. Notches is an estimate of the 95%<br />

confidence <strong>in</strong>terval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2.3 Genome size of all public prokaryotic. . . . . . . . . . . . . . . . . . . . . . 10<br />

2.4 Average AT content of all public prokaryotic. . . . . . . . . . . . . . . . . 10<br />

2.5 2D-cluster<strong>in</strong>g show<strong>in</strong>g 87 Enterobacteriaceae. . . . . . . . . . . . . . . . . . 12<br />

2.6 Codon <strong>and</strong> am<strong>in</strong>o acid usage of Buchnera aphidicola Cc (79.8% AT), Klebsiella<br />

pneumoniae NTUH-K2044 (42.3% AT), <strong>and</strong> E. coli K12 49.2% AT.<br />

Rightmost column shows the nucleotide bias of the three codon positions. . 14<br />

2.7 AT content profile 400 bp upstream <strong>and</strong> downstram of annotated translation<br />

starts <strong>in</strong> Buchnera aphidicola Cc. . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.8 Deam<strong>in</strong>ation of cytos<strong>in</strong>e (C) <strong>in</strong>to uracil (U) . . . . . . . . . . . . . . . . . . 16<br />

2.9 Construction of the BLASTmatrix diagram. Proteome similarity between<br />

three E. coli genomes. Lower part of the diagram corresponds to <strong>in</strong>traproteome<br />

similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

2.10 Proteome similarity between ten Campylobacter species. Color encod<strong>in</strong>g<br />

corresponds to percentage of shared prote<strong>in</strong> families. . . . . . . . . . . . . 17<br />

2.11 Proteome comparison of 32 Vibrionaceae genomes. Environmental V. cholerae<br />

stra<strong>in</strong>s lack<strong>in</strong>g the cholera enterotox<strong>in</strong> genes are highlighted <strong>in</strong> bright green,<br />

whilst pathogenic V. cholerae stra<strong>in</strong>s genomes are shown <strong>in</strong> dark green. . . 18<br />

2.12 Mapp<strong>in</strong>g of pairwise alignment to a reference genome. Mismatches, conservative<br />

mismatches <strong>and</strong> perfect matches contrubute to the overall map 0.0,<br />

0.5, <strong>and</strong> 1.0, respectively. Gaps with<strong>in</strong> the reference prote<strong>in</strong>, correspond<strong>in</strong>g<br />

to miss<strong>in</strong>g features of the reference prote<strong>in</strong>, cannot be mapped <strong>and</strong> are<br />

hence excluded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.13 Inclusion of multiple organisms us<strong>in</strong>g the BLASTatlas method. Each track<br />

correspond to a pairwise comparison aga<strong>in</strong>st the reference chromosome. . . 19<br />

2.14 Comparison of B. pseudomallei 1710b chomosome I <strong>and</strong> II aga<strong>in</strong>st all public<br />

Burkholderia genomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

2.15 A phylome atlas of Alcanivorax borkumensis, compar<strong>in</strong>g the proteome aga<strong>in</strong>st<br />

all γ-, α-, β-, δ, <strong>and</strong> ɛ-proteobacteria available at the time of publish<strong>in</strong>g. . 22<br />

2.16 Count of genomes <strong>and</strong> species divided by genera. Source: <strong>CBS</strong> Genome<br />

Atlas Database as of 2009-09-11. . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

xvii


xviii<br />

2.17 Pan- <strong>and</strong> core-genome plot of 10 Campylobacter genomes. For the data<br />

currently available, there seem to exist an equilibrium at close to 600 prote<strong>in</strong><br />

families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

2.18 CorePlot output for 32 Vibrio genomes. . . . . . . . . . . . . . . . . . . . . 24<br />

3.1 The transcription of bacterial genes. . . . . . . . . . . . . . . . . . . . . . . 106<br />

3.2 The promotor structure of the rrnB operon <strong>in</strong> E. coli. . . . . . . . . . . . . 107<br />

3.3 The –10 <strong>and</strong> –35 hexamers of the E. coli σ 70 promotor correspond to the<br />

motifs be<strong>in</strong>g located on opposite side of the DNA helix. Delition or <strong>in</strong>sertions<br />

of the spac<strong>in</strong>g cases a shift of approx. 36deg per nucleotide. . . . . . 107<br />

3.4 Logo plots show<strong>in</strong>g the <strong>in</strong>itial weight matrices used for search<strong>in</strong>g E. coli<br />

<strong>and</strong> Shigella genomes: –10 hexamer (a), –35 hexamer (b), UP element (c),<br />

<strong>and</strong> FIS b<strong>in</strong>d<strong>in</strong>g motif (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

3.5 Neighbor-jo<strong>in</strong><strong>in</strong>g tree of first 1k bases of all 16S rRNA genes of Yers<strong>in</strong>ia,<br />

Salmonella, Shigella, <strong>and</strong> E. coli . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

3.6 Profiles show<strong>in</strong>g the maximum Ri(tot) scores of the <strong>in</strong>itial weight matrices<br />

applied to E. coli <strong>and</strong> Shigella: Unadjusted P1 scores (a), Adjusted P1<br />

scores (b), Unadjusted P2 scores (c), <strong>and</strong> Adjusted P2 scores (d) . . . . . . 112<br />

3.7 Logos show<strong>in</strong>g the base compostion of P1 <strong>and</strong> P2 of E. coli genomes, as<br />

identified by <strong>in</strong>itial P1 <strong>and</strong> P2 scan: P1 –10 hexamer (a), P1 –35 hexamer<br />

(b), P1 UP element (c), P1 FIS b<strong>in</strong>d<strong>in</strong>g motif (d), P2 –10 hexamer (e), P2<br />

–35 hexamer (f), P2 UP element (g) . . . . . . . . . . . . . . . . . . . . . . 113<br />

3.8 Average profiles of SIDD energy calculated at five different helix densities<br />

-0.025, -0.035, -0.045, <strong>and</strong> -0.055. All genes have been aligned at the translation<br />

start. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

3.9 E. coli <strong>and</strong> Shigella rrnB energy l<strong>and</strong>scape visualized us<strong>in</strong>g the heatmap<br />

function. Each vertical column corresponds to a promotor sequence, whereas<br />

the horizontal rows represent average values over 10 bp with<strong>in</strong> each sequence.<br />

Coord<strong>in</strong>ates labeled on the horizontal rows are relative to the 16S<br />

rRNA gene start. The upper heatmaps show P1 whereas the lower heatmaps<br />

show P2. Leftmost heatmaps show P1/P2 model scores <strong>in</strong> green, whereas<br />

rightmost heatmaps show the SIDD energy <strong>in</strong> blue. . . . . . . . . . . . . . 116<br />

3.10 Pr<strong>in</strong>ciple workflow of gwBrowser data exchange. . . . . . . . . . . . . . . . 118<br />

3.11 Mapp<strong>in</strong>g qualities of sequenc<strong>in</strong>g reads to a reference genome while account<strong>in</strong>g<br />

for the uniqueness of the read. . . . . . . . . . . . . . . . . . . . . . . . 118<br />

3.12 A zoom of the P1 P2 t<strong>and</strong>em promotor system upstream of the rrnB operon<br />

of E. coli K12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

4.1 Screen shot of NCBI Entrez Genome projects web page . . . . . . . . . . . 146<br />

4.2 Schematic layout of a simple SOAP resource, where WSDL <strong>and</strong> schemas<br />

reside on the same server. WSDL <strong>and</strong> schemas are read <strong>and</strong> <strong>in</strong>tepreted<br />

by the SOAP client <strong>in</strong> order compose the outgo<strong>in</strong>g request <strong>and</strong> parse the<br />

<strong>in</strong>com<strong>in</strong>g server response. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />

4.3 Schematic layout of the ENCODE pipel<strong>in</strong>e, EPipe. The ma<strong>in</strong> program<br />

ensures that as much as possible is dispatched <strong>in</strong> parrallel. Modules may<br />

either be alignment dependent or not. If the alignment is required to predict<br />

the prote<strong>in</strong> features, the module is not launched until the alignment<br />

algorithm has f<strong>in</strong>ished. Modules may either return global features of the<br />

entire prote<strong>in</strong> (e.g. cellular localization), or return positional features (e.g.<br />

phosphorylation sites). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152


4.4 The <strong>in</strong>put web page of EPipe: Upper part def<strong>in</strong>es sequence upload <strong>and</strong><br />

alignment method, <strong>and</strong> lower part selects which modules / methods to<br />

run. When applicable, gene ontologies have been added to each feature <strong>and</strong><br />

feature values (light green boxes). . . . . . . . . . . . . . . . . . . . . . . . 153<br />

4.5 The mecA encoded prote<strong>in</strong> (EEV85461) shows homology to PDB entry<br />

1VQQ (Lim & Strynadka, 2002). Top panel shows the EPipe structure<br />

browser which allows for any 90 degrees rotat<strong>in</strong>g. Lower panel shows a<br />

post-process<strong>in</strong>g of the PyMol script, generated by EPipe. . . . . . . . . . . 154<br />

xix


Chapter 1<br />

Introduction<br />

Introduction<br />

S<strong>in</strong>ce the publication of the first complete bacterial genome sequence <strong>in</strong> 1995 close to a<br />

thous<strong>and</strong> prokaryotes have been fully sequenced <strong>and</strong> made publicly available. These data<br />

represent large efforts by many scientists <strong>and</strong> technicians, clos<strong>in</strong>g gaps <strong>in</strong> the chromosomal<br />

sequences <strong>and</strong> provid<strong>in</strong>g detailed gene annotations. These genome projects constitute a<br />

valuable collection of prokaryotic diversity <strong>and</strong> they serve as an <strong>in</strong>dispensable resource for<br />

comparative studies when novel features of newly discovered organisms are identified.<br />

We are however witness<strong>in</strong>g a transition phase as genome sequenc<strong>in</strong>g becomes a trivial<br />

step carried out by any researcher or company <strong>in</strong> the need of a better characterization of an<br />

organism. Sequenc<strong>in</strong>g equipment <strong>and</strong> the capability of assembl<strong>in</strong>g an entire genome will<br />

likely follow the same path as any other technological advance the world has seen. Telephones,<br />

cars, aeroplanes, <strong>and</strong> computers all have started as costly <strong>and</strong> clumsy attempts,<br />

<strong>and</strong> ended up as ma<strong>in</strong>stream affordable <strong>and</strong> efficient products, taken for granted. Noth<strong>in</strong>g<br />

will prevent sequenc<strong>in</strong>g technology to follow the same path <strong>and</strong> it will likely end up as a<br />

t<strong>in</strong>y desktop <strong>in</strong>strument on a doctor’s table next to the blood preasure measur<strong>in</strong>g device.<br />

But the decreas<strong>in</strong>g novelty of present<strong>in</strong>g a new genome sequence could cause a decl<strong>in</strong>e <strong>in</strong><br />

the number of published genomes <strong>in</strong> the near future, caus<strong>in</strong>g less control <strong>and</strong> organization<br />

of these data, with fewer dem<strong>and</strong>s on data <strong>in</strong>tegrity, sequenc<strong>in</strong>g <strong>and</strong> annotation quality.<br />

Some major issues arrise as massive amounts of genomic data becomes a reality. There<br />

are signs that our ability to process <strong>and</strong> analyze genomic data is be<strong>in</strong>g overtaken by the<br />

technological developments of the sequenc<strong>in</strong>g equipment. For example, over the past<br />

twenty-five years, GenBank has grown roughly 100,000 fold, whereas the computer process<strong>in</strong>g<br />

power, follow<strong>in</strong>g Moore’s law has grown “only” a 1,000 times. The overwhelm<strong>in</strong>g<br />

data generated by modern sequenc<strong>in</strong>g mach<strong>in</strong>es constitite tough challenges for most biologist<br />

<strong>and</strong> although efforts are constantly be<strong>in</strong>g made to improve gene prediction <strong>and</strong><br />

genome assembly software, these steps are not yet function<strong>in</strong>g <strong>in</strong> a scalable <strong>and</strong> unsupervised<br />

fashion. Further, post-annotation steps deriv<strong>in</strong>g knowledge from predicted genes<br />

rema<strong>in</strong> one of the biggest challenges. How do we transform contigs of nucleotide sequences<br />

<strong>in</strong>to knowledge to derive the phenotype of the organism?<br />

As more prokaryotic genomes are be<strong>in</strong>g sequenced, there are now a number of species<br />

for which multiple stra<strong>in</strong>s are sequenced. Roughly one fourth of all prokaryotic projects<br />

exist with<strong>in</strong> species where 5 or more stra<strong>in</strong>s are available. As this coverage of diversity<br />

<strong>in</strong>creases, we may beg<strong>in</strong> to answer some key questions with better confidence. How do<br />

we def<strong>in</strong>e core sets of genes? Can we estimate the size of the pan genome? Which<br />

features are novel <strong>in</strong> selected stra<strong>in</strong>s <strong>and</strong> are these features regionally conserved with<strong>in</strong><br />

the chromosomes? To answer these questions, there is a fundamental need to visuzalize<br />

<strong>and</strong> overview the similarity <strong>and</strong> differences between larger number of genomes. Obta<strong>in</strong><strong>in</strong>g<br />

such an overview allows some questions concern<strong>in</strong>g gene acquisition <strong>and</strong> chromosomal<br />

1


organization to be answered. The development <strong>and</strong> ref<strong>in</strong>ement of the BLASTatlas method<br />

done dur<strong>in</strong>g this Ph.D. project is an essential step forward enabl<strong>in</strong>g these types of analysis<br />

<strong>and</strong> the method is now offered as an onl<strong>in</strong>e service by <strong>CBS</strong>. This work let to a publication<br />

<strong>in</strong> 2008, describ<strong>in</strong>g the BLASTatlas method.<br />

In chapter 2 a number of <strong>tools</strong> are described, which can assist rapid analysis of genomes,<br />

genomic contigs <strong>and</strong> larger collections of genomes to conclude the similarity. Enabl<strong>in</strong>g<br />

local <strong>and</strong> web based genome analysis <strong>tools</strong> for the novice user rema<strong>in</strong>s a critical po<strong>in</strong>t for<br />

the success of future sequenc<strong>in</strong>g projects. In chapter 3 the RNAmmer tool was used as<br />

a start<strong>in</strong>g po<strong>in</strong>t to study the E. coli rrn t<strong>and</strong>em promotors. This work presents useful<br />

<strong>tools</strong> to model <strong>and</strong> visualize promotor conservation <strong>in</strong> genomes. The exchange of genomic<br />

data between users, sequenc<strong>in</strong>g centers, repositories, <strong>and</strong> tool providers currently lack<br />

st<strong>and</strong>ardizaion <strong>and</strong> <strong>in</strong>teroperability. The lack of a formal way to exchange genomic data is<br />

a limit<strong>in</strong>g factor as to how we <strong>in</strong> the future may exploit the wave of new genomic material<br />

be<strong>in</strong>g generated. Chapter 4 of this thesis describe a number of efforts made dur<strong>in</strong>g this<br />

Ph.D. project to provide <strong>in</strong>teroperabitlity <strong>and</strong> programmatic access to both prediction<br />

methods, genomic visualization methods as well as management of data st<strong>and</strong>ards. The<br />

outcome of this work has led <strong>CBS</strong> to adapt <strong>tools</strong> <strong>and</strong> server <strong>in</strong>frastructure thereby shar<strong>in</strong>g<br />

its many <strong>tools</strong> <strong>in</strong> a way that allow programmers to <strong>in</strong>sert sophistcated prediction methods<br />

directoy <strong>in</strong> their own programm<strong>in</strong>g environment.<br />

2


Chapter 2<br />

<strong>Comparative</strong> Genomics<br />

2.1 Introduction<br />

<strong>Comparative</strong> Genomics<br />

This chapter covers work for five publications. The first paper (I) describes the BLASTatlas<br />

method developed to compare <strong>and</strong> visualize the homology between a reference genome<br />

<strong>and</strong> any number of other genomes, collections of genomes, metagenomic sequences, or<br />

databases as a s<strong>in</strong>gle graphic. The method has been used <strong>in</strong> connection with various<br />

research projects <strong>in</strong>clud<strong>in</strong>g the publication of the Arcobacter butzleri RM4018 genome<br />

(Miller et al., 2007), computer exercises (see chapter 4 <strong>and</strong> appendix A.1) <strong>and</strong> as analysis<br />

tool for publications made dur<strong>in</strong>g the project (papers II-V).<br />

A number of smaller unpublished methods, <strong>in</strong>clud<strong>in</strong>g the BLAST matrix, Core Plot,<br />

<strong>and</strong> Codon Plot has been written <strong>and</strong> used as <strong>in</strong>-house <strong>tools</strong>. The BLASTmatrix software<br />

derives unique <strong>and</strong> shared prote<strong>in</strong> families for any number of proteomes. This enables the<br />

viewer to obta<strong>in</strong> the similarity between any pair of organisms <strong>in</strong>cluded <strong>in</strong> the comparison.<br />

The tool was first used <strong>in</strong> (Jensen et al., 2005), <strong>and</strong> also used <strong>in</strong> other papers <strong>in</strong>clud<strong>in</strong>g<br />

paper II. An improved version of the BLASTmatrix tool is used <strong>in</strong> paper IV. The<br />

BLASTmatrix software generates all-aga<strong>in</strong>st-all BLAST (Basic Local alignment Search<br />

Tool, Altschul et al. (1997)) of a number of selected proteomes. When compar<strong>in</strong>g multiple<br />

species of the same genus, these BLAST results can be reused by the CorePlot program<br />

to estimate the size of the core- <strong>and</strong> pan-genome. F<strong>in</strong>ally, the CodonPlot program was<br />

written to visualize the codon <strong>and</strong> am<strong>in</strong>o acid usage by an organism. The CodonPlot<br />

results contributed to papers II, III, <strong>and</strong> V.<br />

The development of an <strong>in</strong>teractive web based genome browser (gwBrowser) has allowed<br />

a broader application of the atlas visualization method, <strong>in</strong>clud<strong>in</strong>g analysis of sequenc<strong>in</strong>g<br />

reads <strong>and</strong> promotor regions. This work is described <strong>in</strong> chapter 3.<br />

2.2 The genome annotation pipel<strong>in</strong>e<br />

Hav<strong>in</strong>g assembled the reads of a sequenc<strong>in</strong>g project, the biologist is often presented with<br />

an <strong>in</strong>complete mapp<strong>in</strong>g of the chromosome, with gaps <strong>and</strong> a large number of contigs<br />

(contiguous pieces of DNA). The quality of the assembly orig<strong>in</strong>at<strong>in</strong>g from most modern<br />

high-throughput techniques can be negatively affected by a number of factors such as<br />

short or <strong>in</strong>sufficient reads, elevated error rates near the end of the reads, DNA repeats on<br />

the chromosome, <strong>in</strong>adequate assembly <strong>tools</strong> etc. This section describes <strong>tools</strong> to analyze<br />

both complete genome data (s<strong>in</strong>gle-contig) as well as prelim<strong>in</strong>ary data generated by pyrosequenc<strong>in</strong>g<br />

mach<strong>in</strong>es (multiple contigs). Most <strong>tools</strong> that are presented here are stored<br />

on the <strong>CBS</strong> servers at /home/people/pfh/scripts/.<br />

3


The genome annotation pipel<strong>in</strong>e<br />

2.2.1 fetchgbk: Obta<strong>in</strong><strong>in</strong>g exist<strong>in</strong>g public genomes from GenBank<br />

Without robust access to prior knowledge about exist<strong>in</strong>g genomes, it is hard to draw<br />

conclusions about a novel genome sequence. The tool fetchgbk was made to download the<br />

most recent genbank entries via NCBI us<strong>in</strong>g both <strong>in</strong>dividual accession numbers (GenBank<br />

<strong>and</strong> RefSeq), ranges thereof, or the NCBI project id whereby all replicons of an organism<br />

can be obta<strong>in</strong>ed. List<strong>in</strong>g 2.1 shows common usage of the program <strong>and</strong> appendix D.1<br />

<strong>in</strong>cludes the manual.<br />

List<strong>in</strong>g 2.1: Usage of fetchgbk<br />

1 # download a s<strong>in</strong>gle genbank record<br />

2 fetchgbk -a CP000896<br />

3 # download a s<strong>in</strong>gle refseq entry<br />

4 fetchgbk -a NZ_ABIZ00000000<br />

5 # download a range of RefSeq entries<br />

6 fetchgbk -a NZ_ABIH01000001 - NZ_ABIH01000038<br />

7 # just list<strong>in</strong>g refseq accession numbers of a project<br />

8 fetchgbk -p 12997 -d refseq -l<br />

9 # download all replicons of a project ( RefSeq )<br />

10 fetchgbk -p 19391 -d refseq<br />

11 # download all replicons of a project ( GenBank )<br />

12 fetchgbk -p 19391 -d genbank<br />

2.2.2 Other ways to acquire genome <strong>in</strong>formation<br />

The genbank records ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong> the <strong>CBS</strong> Genome Atlas Database (Hall<strong>in</strong> & Ussery,<br />

2004) are regularly synchronized aga<strong>in</strong>st NCBI Entrez (see http://www.ncbi.nlm.nih.<br />

gov/genomes/lproks.cgi). The raw sequence data can be downloaded from this database<br />

us<strong>in</strong>g the Web Services client scripts getSeq, getOrfs, <strong>and</strong> getProt. Example scripts can be<br />

downloaded <strong>and</strong> run as separate comm<strong>and</strong>s (list<strong>in</strong>g 2.2) or <strong>in</strong>tegrated <strong>in</strong>to larger workflows,<br />

<strong>in</strong> other programm<strong>in</strong>g languages if needed.<br />

List<strong>in</strong>g 2.2: Access<strong>in</strong>g Genome Atlas Database through Web Services.<br />

1 # download prerequisites<br />

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples /xml - compile .pl<br />

3 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getseq .pl<br />

4 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getprot .pl<br />

5 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getorfs .pl<br />

6<br />

7 # obta<strong>in</strong> full genome sequence of genbank entry<br />

8 perl getseq .pl CP000550 > CP000550 . fsa<br />

9<br />

10 # obta<strong>in</strong> translations of genbank entry<br />

11 perl getprot .pl CP000550 > CP000550 . prote<strong>in</strong>s . fsa<br />

12<br />

13 # obta<strong>in</strong> open read<strong>in</strong>g frames of genbank entry<br />

14 perl getorfs .pl CP000550 > CP000550 . orfs . fsa<br />

The <strong>CBS</strong> Genome Atlas Database conta<strong>in</strong>s an <strong>in</strong>dex of genome meta-data, such as<br />

organism name, NCBI Project ID, replicon, genome size, number of cod<strong>in</strong>g genes, tRNA<br />

genes, rRNA genes, the base composition, <strong>and</strong> average values of various DNA properties<br />

such <strong>in</strong>tr<strong>in</strong>sic curvature (Bolshoy et al., 1991) <strong>and</strong> stack<strong>in</strong>g energy (Satchwell et al., 1986).<br />

For more <strong>in</strong>formation on the Web Services implementation, see section 4.2.1 <strong>and</strong> for a<br />

full documentation please refer to http://www.cbs.dtu.dk/ws/GenomeAtlas. List<strong>in</strong>g 2.3<br />

shows an example of how to use queryGenomes to obta<strong>in</strong> AT content <strong>and</strong> gene count for<br />

4


<strong>Comparative</strong> Genomics<br />

the publicly available Vibrio genomes. Output the comm<strong>and</strong> is listed <strong>in</strong> appendix D.2.<br />

List<strong>in</strong>g 2.3: Us<strong>in</strong>g queryGenomes to obta<strong>in</strong> genome meta data.<br />

1 # download client script<br />

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / querygenomes .pl<br />

3<br />

4 # download XML :: Compile helper script<br />

5 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples /xml - compile .pl<br />

6<br />

7 # extract AT - content <strong>and</strong> number of genes for all vibrio genomes<br />

8 perl querygenomes .pl - hideMerged - organism vibrio -output<br />

ATCONTENT , NGENES<br />

2.2.3 Tools contigsort <strong>and</strong> contigmap<br />

For some applications <strong>in</strong> analysis of unf<strong>in</strong>ished or partially sequenced genomes, it is desired<br />

to obta<strong>in</strong> approximate coord<strong>in</strong>ates of the contigs with<strong>in</strong> the complete chromosome. To<br />

resolve this the contigsort program was written. It accepts any number of entries (contigs)<br />

<strong>in</strong> one FASTA file together with a backbone sequence <strong>in</strong> one contig <strong>in</strong> a second FASTA file.<br />

The entries of the contig file is then mapped to the backbone sequence us<strong>in</strong>g a nucleotide<br />

BLAST, assum<strong>in</strong>g at least one significant hit. The tool then sorts all contigs based on the<br />

coord<strong>in</strong>ate <strong>in</strong> the backbone of the center-po<strong>in</strong>t of each alignment. Contigs spann<strong>in</strong>g the<br />

orig<strong>in</strong> of circular backbones are automatically split <strong>in</strong> two.<br />

The tool genomemap was written to visualize genome homology between two genomes<br />

sequences. Each genome may consist of one or more contigs <strong>and</strong> all contigs are aligned<br />

us<strong>in</strong>g BLASTN. This tool allow a user to validate the output of the backbone mapp<strong>in</strong>g from<br />

contigsort. The plot generated has similarities to that produced by Artemis Comparison<br />

Tool (ACT) (Rutherford et al., 2000); however the output of genomemap is a vector<br />

graphic file (PostScript) <strong>and</strong> allows for multiple sequence entries with<strong>in</strong> each of the two<br />

compared sequences.<br />

Example: Campylobacter jejuni str. 260.94<br />

The 10 contigs of the currently unpublished sequence of Campylobacter jejuni str. 260.94<br />

(GenBank accession no. AANK01000001-AANK01000010) were downloaded <strong>and</strong> converted<br />

<strong>in</strong>to FASTA format file. The program saco convert is an <strong>in</strong>-house program at <strong>CBS</strong>,<br />

which converts between different sequence formats. In the example provided the Campylobacter<br />

jejuni str. NCTC 11168 (Parkhill et al., 2000) is used as the backbone (see list<strong>in</strong>g<br />

2.4).<br />

List<strong>in</strong>g 2.4: Us<strong>in</strong>g contigsort to map assemblied contigs to a backbone.<br />

1 set path = (˜ pfh/scripts/contigsort ˜pfh/scripts/fetchgbk $path )<br />

2 fetchgbk −a AANK01000001−AANK01000010 > AANK . gbk<br />

3 saco_convert −I genbank −O fasta AANK . gbk > AANK . fsa<br />

4 fetchgbk −a AL111168 > AL111168 . gbk<br />

5 saco_convert −I genbank −O fasta AL111168 . gbk > AL111168 . fsa<br />

6 contigsort −c −i AANK . fsa −b AL111168 . fsa > mapped . fsa<br />

To visualize the result of the contig mapp<strong>in</strong>g the mapped <strong>and</strong> un-mapped contigs were<br />

processed by contigmap. The output from the comparison is a PostScript document (figure<br />

2.1 <strong>and</strong> list<strong>in</strong>g 2.5).<br />

5


The genome annotation pipel<strong>in</strong>e<br />

AL111168_AL139074_AL<br />

AANK01000001_AANK010 AANK01000002_AANK010 AANK01000003_AANK010<br />

(a)<br />

AANK01000004_AANK010<br />

AANK01000005_AANK010<br />

AANK01000006_AANK010<br />

AANK01000007_AANK010<br />

AANK01000010_AANK010<br />

AANK01000009_AANK010<br />

AANK01000008_AANK010<br />

AANK01000007_AANK010<br />

AANK01000002_AANK010 AANK01000008_AANK010<br />

AANK01000003_AANK010<br />

AL111168_AL139074_AL<br />

AANK01000005_AANK010<br />

AANK01000001_AANK010 AANK01000009_AANK010<br />

Figure 2.1: Mapp<strong>in</strong>g of multiple contigs to a backbone genome. C. jejuni str. NCTC 11168 is used<br />

as backbone for mapp<strong>in</strong>g contigs C. jejuni str. 260.94. Blue <strong>and</strong> red blocks represent direct <strong>and</strong><br />

reverse hits, respectively. Panel (a) shows un-mapped whereas panel (b) shows mapped contigs.<br />

List<strong>in</strong>g 2.5: Us<strong>in</strong>g contigmap to draw homology between contigs <strong>and</strong> reference genome<br />

1 set path = (˜ pfh/scripts/contigmap $path )<br />

2 contigmap AL111168 . fsa AANK . fsa > AANK−raw . ps<br />

3 contigmap AL111168 . fsa mapped . fsa > AANK−mapped . ps<br />

2.2.4 F<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> encod<strong>in</strong>g genes <strong>in</strong> prokaryotes<br />

A crucial step for implement<strong>in</strong>g any genome pipel<strong>in</strong>e is the gene f<strong>in</strong>d<strong>in</strong>g. Hav<strong>in</strong>g successfully<br />

completed the gene call<strong>in</strong>g enables a number of downstream analysis such as<br />

translation of ORFs <strong>in</strong>to prote<strong>in</strong> sequence, f<strong>in</strong>d<strong>in</strong>g of potentially novel genes, annotation<br />

of prote<strong>in</strong> function by homology searches, assign<strong>in</strong>g functional doma<strong>in</strong>s, <strong>and</strong> detection<br />

of signal peptide to derive the secretome. To both reveal novel prote<strong>in</strong> sequences <strong>and</strong><br />

to draw conclusions as to the overall proteome, it is therefore essential that the gene<br />

call<strong>in</strong>g can be trusted. There are several public prokaryotic gene predictors available<br />

such as Glimmer3 (http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi,<br />

Delcher et al. (1999)), GeneMarkS (http://exon.biology.gatech.edu/, Besemer et al.<br />

(2001)), EasyGene (http://www.cbs.dtu.dk/services/EasyGene/, Larsen & Krogh (2003)),<br />

<strong>and</strong> Prodigal (unpublished, http://compbio.ornl.gov/prodigal). Prodigal is a recent<br />

development <strong>and</strong> despite of its high speed <strong>and</strong> simplicity it provides promis<strong>in</strong>g results. It<br />

has been implemented as part of the <strong>CBS</strong> Genome Atlas Database Web Services. Code<br />

examples are provided show<strong>in</strong>g the usage of the Prodigal client scripts (list<strong>in</strong>g 2.6).<br />

List<strong>in</strong>g 2.6: Us<strong>in</strong>g Prodigal for ORF prediction. Note that 6pack is an <strong>in</strong>ternal <strong>CBS</strong> tool used for<br />

translation of ORFs.<br />

1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples /xml - compile .pl<br />

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / prodigal .pl<br />

3 perl prodigal .pl -ta 11 -fasta < mapped . fsa > mapped . orfs . fsa<br />

4 6 pack -1 < mapped . orfs . fsa > mapped . prote<strong>in</strong>s . fsa<br />

Assess<strong>in</strong>g annotation quality<br />

All of the four gene f<strong>in</strong>ders listed above were applied to the latest version of the E. coli<br />

stra<strong>in</strong> K-12 isolate MG1655 genome sequence (U00096, 28 July, 2009, Blattner et al.<br />

(1997)). These predictions, together with an older annotation of the same GenBank entry<br />

6<br />

(b)<br />

AANK01000010_AANK010<br />

AANK01000004_AANK010<br />

AANK01000006_AANK010<br />

AANK01000007_AANK010


<strong>Comparative</strong> Genomics<br />

source CDS total TP FP FN 3’off 5’off sens. shared<br />

U00096 (present) 4,321 - - - - - - -<br />

U00096 (2004) 4,254 4,172 82 109 1.02 -4.07 0.97 93%<br />

Glimmer 3.02 4,476 4,174 302 125 -0.6 -24.09 0.97 87%<br />

GeneMark-S 2.6 4,377 4,207 170 90 1.94 -20.17 0.98 91%<br />

EasyGene 1.2 4,056 4,017 39 256 -0.28 -19.07 0.94 91%<br />

Prodigal 1.1 4,332 4,200 132 97 0.54 -20.07 0.98 92%<br />

Table 2.1: Performance of prokaryotic gene f<strong>in</strong>ders. An older genbank record for E. coli K12<br />

(U00096, 2002) has been <strong>in</strong>cluded <strong>and</strong> the reference of all comparisons is the most recent shown<br />

at the top. The 3’ <strong>and</strong> 5’ off correspond to the number of base pairs that a query coord<strong>in</strong>ate is<br />

downstream (positive number) or upstream (negative number) when compared to the reference.<br />

T P<br />

The sensitivity is estimated by b<strong>in</strong>ary classification, T P +F N<br />

where T P is the number of prote<strong>in</strong>s<br />

shared between reference <strong>and</strong> query <strong>and</strong> F N are prote<strong>in</strong>s unique to the reference, not found <strong>in</strong><br />

the query. Calculat<strong>in</strong>g specificity (which requires a true negative count) is difficult as it is hard<br />

to identify regions of the chromosome that for certa<strong>in</strong> does not conta<strong>in</strong> prote<strong>in</strong> cod<strong>in</strong>g genes<br />

(Larsen & Krogh, 2003). The rightmost column conta<strong>in</strong>s an estimate of the percentage of prote<strong>in</strong><br />

families shared between the query <strong>and</strong> the reference genome. The number is derived us<strong>in</strong>g the<br />

BLASTmatrix tool.<br />

(U00096 from 2004) were compared pairwise to the latest version of the GenBank entry.<br />

The number of unique genes <strong>in</strong> both reference <strong>and</strong> query genome was derived <strong>and</strong> for each<br />

overlapp<strong>in</strong>g pair of ORFs, the average <strong>in</strong>accuracy of the 3’ <strong>and</strong> 5’ ends was calculated<br />

(table 2.1). In addition the encoded prote<strong>in</strong>s were compared us<strong>in</strong>g the BLASTmatrix<br />

tool, described <strong>in</strong> section 2.3.6. This allows estimation of the number of prote<strong>in</strong> families<br />

shared between the reference <strong>and</strong> the query genomes.<br />

2.2.5 F<strong>in</strong>d<strong>in</strong>g tRNA <strong>and</strong> rRNA genes<br />

The tool tRNAscan-SE (Lowe & Eddy, 1997) has been implemented <strong>in</strong> the <strong>CBS</strong> Genome<br />

Atlas Database Web Service, <strong>and</strong> it predicts tRNA genes <strong>in</strong> contigs or genomes:<br />

1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . <strong>in</strong>c .pl<br />

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / trnascan .pl<br />

3 perl trnascan .pl < mapped . fsa > mapped . trna . fsa<br />

The RNAmmer method (Paper VI, chapter 3) can be used to consistently annotate<br />

rRNA genes <strong>in</strong> contigs <strong>and</strong> full genome sequences. This tool is implemented as a separate<br />

Web Service at <strong>CBS</strong>. Please refer to http://www.cbs.dtu.dk/ws/RNAmmer for full documentation.<br />

In list<strong>in</strong>g 2.7 <strong>and</strong> example is provided show<strong>in</strong>g the usage of the RNAmmer<br />

client script.<br />

List<strong>in</strong>g 2.7: Runn<strong>in</strong>g RNAmmer on a genome sequence<br />

1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . <strong>in</strong>c .pl<br />

2 wget http :// www . cbs . dtu .dk/ws/ RNAmmer / examples / rnammer .pl<br />

3 perl rnammer .pl bac < mapped . fsa > mapped . rrna . fsa<br />

2.3 Genome Comparisons<br />

The previous section has described some <strong>in</strong>itial steps for annotat<strong>in</strong>g the bacterial genome<br />

which is required for further comparative studies. In this section emphasis will be placed<br />

on compar<strong>in</strong>g annotated genomes both on the proteome level as well as us<strong>in</strong>g meta-data.<br />

7


Genome Comparisons<br />

Right whisker ends at an observed<br />

data po<strong>in</strong>t, not exceed<strong>in</strong>g 1.5 IQR<br />

1.5 x IQR<br />

95% confidence <strong>in</strong>terval<br />

Q1 IQR Q3<br />

1.5 x IQR<br />

median<br />

Right whisker ends at an observed<br />

data po<strong>in</strong>t, not exceed<strong>in</strong>g 1.5 IQR<br />

Mild outliers between 1.5 <strong>and</strong> 3.0 IQR<br />

<strong>and</strong> extreme outliers more than 3 IQR<br />

away from Q1 <strong>and</strong> Q3<br />

Figure 2.2: Construction of a box-<strong>and</strong>-whiskers plot. Notches is an estimate of the 95% confidence<br />

<strong>in</strong>terval.<br />

The <strong>tools</strong> presented here have all been used widely dur<strong>in</strong>g course activities <strong>and</strong> research<br />

projects.<br />

2.3.1 Box-<strong>and</strong>-wiskers plot<br />

As the number of sequenced bacterial genomes grew from only two <strong>in</strong> 1995 to now close to a<br />

thous<strong>and</strong> at the time of writ<strong>in</strong>g, there began to be enough data to sample various genomic<br />

properties amongst the different phylogenetic groups. The box-<strong>and</strong>-wiskers plot (Tukey,<br />

1977) is a useful tool for visualiz<strong>in</strong>g such differences. The plot shows a box between the<br />

first <strong>and</strong> the third quantile (figure 2.2). The distance between Q1 <strong>and</strong> Q3 is called the Inter<br />

Quantile Ratio (IQR) <strong>and</strong> whiskers are drawn through observations that are not exceed<strong>in</strong>g<br />

1.5 × IQR. A l<strong>in</strong>e is drawn with<strong>in</strong> the box represent<strong>in</strong>g the median. Data between<br />

1.5 × IQR <strong>and</strong> 3.0 × IQR are denoted ”mild” outliers whereas observations exceed<strong>in</strong>g<br />

3.0 × IQR are extreme outliers. Notches are sometimes drawn to denote the confidence<br />

<strong>in</strong>terval. In the R implementation of the box-<strong>and</strong>-wiskers plot the 95% confidence <strong>in</strong>terval<br />

is approximated by 1.5×IQR<br />

√ . When compar<strong>in</strong>g two or more distributions, non-overlapp<strong>in</strong>g<br />

N<br />

notches marks significant differences.<br />

Distribution of genome size <strong>and</strong> base composition <strong>in</strong> prokaryotes<br />

To exam<strong>in</strong>e the base composition <strong>and</strong> genome size for different phylogenetic groups, a<br />

query to the <strong>CBS</strong> Genome Atlas Database can be done, group<strong>in</strong>g replicons <strong>in</strong>to projects<br />

<strong>and</strong> summariz<strong>in</strong>g / averag<strong>in</strong>g with<strong>in</strong> each project. Altough only possible from with<strong>in</strong> <strong>CBS</strong>,<br />

the comm<strong>and</strong>s are listed below.<br />

8


<strong>Comparative</strong> Genomics<br />

1 mysql -N -B -D genomeatlas3_cur -e " select p.grp , concat (’#’, color )<br />

,ord , sum ( length ),concat ( organism_name ,’/’, segment_name ,’/’,<br />

genbank ) from atlasdb as a, genbank_complete_prj as p ,<br />

genbank_complete_seq as s , phyla as ph where s. genbank = a.<br />

accession <strong>and</strong> s. pid = p. pid <strong>and</strong> segment_name not like ’genome %’<br />

<strong>and</strong> ph. phyla = p. grp group by s. pid " > length . tbl<br />

2 set N = ‘wc -l < length .tbl ‘<br />

3 ~ pfh / scripts / boxplot -ma<strong>in</strong> " Size distribution of Prokaryotic<br />

genomes (N = $N)" < length . tbl > length .ps<br />

4 mysql -N -B -D genomeatlas3_cur -e " select p.grp , concat (’#’, color )<br />

,ord , sum ( atcontent * length )/ sum ( length ),concat ( organism_name<br />

,’/’, segment_name ,’/’, genbank ) from atlasdb as a,<br />

genbank_complete_prj as p , genbank_complete_seq as s , phyla<br />

as ph where s. genbank = a. accession <strong>and</strong> s. pid = p. pid <strong>and</strong><br />

segment_name not like ’genome %’ <strong>and</strong> ph. phyla = p. grp group by s<br />

. pid "> atcontent . tbl<br />

5 ~ pfh / scripts / boxplot -ma<strong>in</strong> "AT content distribution of Prokaryotic<br />

genomes (N = $N)" < atcontent . tbl > atcontent .ps<br />

The tables generated by the MySQL query can be read by the boxplot program, which<br />

is a Perl wrapper for the R comm<strong>and</strong> boxplot, <strong>and</strong> a PostScript document is generated.<br />

Figure 2.4 shows the total genome length (<strong>in</strong>clud<strong>in</strong>g all replicons) of all published prokaryotic<br />

genomes, divided <strong>in</strong>to phyla. The confidence <strong>in</strong>terval appears wide for many groups,<br />

reflect<strong>in</strong>g a high <strong>in</strong>tra-phyla variation. However, for a number of phyla the difference<br />

is significant. The β-protebacteria tend to have longer chromosomes than for example<br />

the firmicutes, the α-proteobacteria, <strong>and</strong> the cyanobacteria. It is also evident that the<br />

δ-proteobacteria Sorangium cellulosum Soce56 represents the longest genome (13,033,779<br />

nt, Schneiker et al. (2007)) but that this is an outlier not representative of the entire phylum.<br />

The shortest bacterial genome published so far is the α-proteobacterium C<strong>and</strong>idatus<br />

Hodgk<strong>in</strong>ia cicadicola Dsem (143,795 nt, McCutcheon et al. (2009)). Thus, the difference<br />

between the smallest <strong>and</strong> the largest is close to 100 fold. The plot <strong>in</strong> figure 2.3 shows the<br />

fraction of AT for the prokaryotic genomes rang<strong>in</strong>g from 25% for the δ-proteobacterium<br />

Anaeromyxobacter dehalogenans 2CP-C (Sanford et al., 2002) to 83% for C<strong>and</strong>idatus Carsonella<br />

ruddii PV (Nakabachi et al. (2006).<br />

2.3.2 heatmap - 2D cluster<strong>in</strong>g<br />

A way to <strong>in</strong>crease the dimensionality for visualiz<strong>in</strong>g genomic properties is by us<strong>in</strong>g a socalled<br />

heatmap or 2D cluster<strong>in</strong>g. Instead of look<strong>in</strong>g at a s<strong>in</strong>gle property at a time (e.g.<br />

length or AT content), multiple features may be <strong>in</strong>cluded <strong>in</strong> the same plot. The axis is<br />

replaced with a color transformation of the data <strong>and</strong> different normalization methods may<br />

be applied. In the example below a comparison is made for 87 Enterobacteriaceae, cover<strong>in</strong>g<br />

among others the genera of Escherichia, Salmonella, Yers<strong>in</strong>ia, Shigella, Buchnera, <strong>and</strong><br />

Klebsiella. The <strong>CBS</strong> Genome Atlas Database is queried for the features such as tRNA <strong>and</strong><br />

rRNA gene count, total cod<strong>in</strong>g genes, genome size, AT content, simple genomic repeats,<br />

local direct repeats, base pairs per gene, <strong>and</strong> cod<strong>in</strong>g fraction of the genome. The plot<br />

is shown <strong>in</strong> figure 2.5 <strong>and</strong> the R code for produc<strong>in</strong>g the plot is shown below <strong>in</strong> list<strong>in</strong>g<br />

2.8. The data have been normalized to allow for comparison. Features <strong>and</strong> organisms are<br />

hierarchically clustered to group organisms with similar properties <strong>and</strong> to gorup properties<br />

that correlate with<strong>in</strong> the organisms.<br />

9


Genome Comparisons<br />

12<br />

10<br />

12<br />

Size distribution of Prokaryotic genomes (N = 932)<br />

Crenarchaeota (n=23)<br />

Euryarchaeota (n=39)<br />

Nanoarchaeota (n=1)<br />

Acidobacteria (n=3)<br />

Crenarchaeota (n=23)<br />

Act<strong>in</strong>obacteria (n=68)<br />

Euryarchaeota (n=39)<br />

Aquificae (n=5)<br />

Nanoarchaeota (n=1)<br />

Bacteroidetes/Chlorobi (n=26)<br />

Acidobacteria (n=3)<br />

Chlamydiae/Verrucomicrobia (n=14)<br />

Act<strong>in</strong>obacteria (n=68)<br />

Chloroflexi (n=10)<br />

Aquificae (n=5)<br />

Cyanobacteria (n=36)<br />

Bacteroidetes/Chlorobi (n=26)<br />

De<strong>in</strong>ococcus−Thermus (n=5)<br />

Chlamydiae/Verrucomicrobia (n=14)<br />

Firmicutes (n=191)<br />

Chloroflexi (n=10)<br />

Fusobacteria (n=1)<br />

Cyanobacteria (n=36)<br />

Planctomycetes (n=1)<br />

De<strong>in</strong>ococcus−Thermus (n=5)<br />

Alphaproteobacteria (n=114)<br />

Firmicutes (n=191)<br />

Betaproteobacteria (n=70)<br />

Fusobacteria (n=1)<br />

Gammaproteobacteria (n=226)<br />

Planctomycetes (n=1)<br />

Deltaproteobacteria (n=29)<br />

Alphaproteobacteria (n=114)<br />

Epsilonproteobacteria (n=25)<br />

Betaproteobacteria (n=70)<br />

Spirochaetes (n=18)<br />

Gammaproteobacteria (n=226)<br />

Thermotogae (n=10)<br />

Deltaproteobacteria (n=29)<br />

Other Archaea (n=1)<br />

Epsilonproteobacteria (n=25)<br />

Other Bacteria (n=16)<br />

Spirochaetes (n=18)<br />

Thermotogae (n=10)<br />

Size distribution of Prokaryotic genomes (N = 932)<br />

Other Archaea (n=1)<br />

0.0e+00 2.0e+06<br />

Other Bacteria (n=16)<br />

Buchnera<br />

4.0e+06 6.0e+06<br />

E. coli<br />

Salmonella<br />

Yers<strong>in</strong>ia<br />

8.0e+06 1.0e+07 1.2e+07<br />

0.0e+00 2.0e+06 4.0e+06 6.0e+06 8.0e+06 1.0e+07 1.2e+07<br />

E. coli<br />

Buchnera<br />

Salmonella<br />

Yers<strong>in</strong>ia<br />

Crenarchaeota (n=23)<br />

Euryarchaeota (n=39)<br />

Nanoarchaeota (n=1)<br />

Crenarchaeota<br />

Acidobacteria<br />

(n=23)<br />

(n=3)<br />

Euryarchaeota<br />

Act<strong>in</strong>obacteria (n=68)<br />

(n=39)<br />

Nanoarchaeota<br />

Aquificae (n=5)<br />

(n=1)<br />

Bacteroidetes/Chlorobi Acidobacteria (n=26) (n=3)<br />

Chlamydiae/Verrucomicrobia Act<strong>in</strong>obacteria (n=14) (n=68)<br />

Chloroflexi Aquificae (n=10) (n=5)<br />

Bacteroidetes/Chlorobi Cyanobacteria (n=36) (n=26)<br />

Chlamydiae/Verrucomicrobia De<strong>in</strong>ococcus−Thermus (n=14) (n=5)<br />

Firmicutes Chloroflexi (n=191) (n=10)<br />

Cyanobacteria Fusobacteria (n=36) (n=1)<br />

De<strong>in</strong>ococcus−Thermus Planctomycetes (n=1) (n=5)<br />

Alphaproteobacteria Firmicutes (n=114) (n=191)<br />

Betaproteobacteria Fusobacteria (n=70) (n=1)<br />

Gammaproteobacteria Planctomycetes (n=226) (n=1)<br />

Alphaproteobacteria Deltaproteobacteria (n=114) (n=29)<br />

Epsilonproteobacteria Betaproteobacteria (n=25) (n=70)<br />

Gammaproteobacteria Spirochaetes (n=226) (n=18)<br />

Deltaproteobacteria Thermotogae (n=10) (n=29)<br />

Epsilonproteobacteria Other Archaea (n=25) (n=1)<br />

Other Spirochaetes Bacteria (n=16) (n=18)<br />

Thermotogae (n=10)<br />

Other Archaea (n=1)<br />

Other Bacteria (n=16)<br />

Figure 2.3: Genome size of all public prokaryotic.<br />

Figure 2.3: Genome size of all public prokaryotic.<br />

Figure 2.3: Genome size of all public prokaryotic.<br />

AT content distribution of Prokaryotic genomes (N = 932)<br />

AT content distribution of Prokaryotic genomes (N = 932)<br />

0.3 0.4 0.5 0.6 0.7 0.8<br />

E. coli<br />

Salmonella<br />

Buchnera<br />

Yers<strong>in</strong>ia<br />

0.3 0.4 0.5 0.6 0.7 0.8<br />

E. coli<br />

Salmonella<br />

Buchnera<br />

Yers<strong>in</strong>ia<br />

Figure 2.4: Average AT content of all public prokaryotic.<br />

Figure 2.4: Average AT content contentof ofall all public prokaryotic.


List<strong>in</strong>g 2.8: R code to generate a 2D cluster<strong>in</strong>g graphic<br />

<strong>Comparative</strong> Genomics<br />

1 library ( gplots )<br />

2 postscript ( file =’output .ps ’)<br />

3 data


Genome Comparisons<br />

12<br />

TRNA_SCAN_COUNT<br />

LENGTH<br />

NGENES<br />

RNAMMER_SSU_COUNT<br />

ATCONTENT<br />

LOC_DIR_REPEAT<br />

LOC_INV_REPEAT<br />

SR_PERCENT<br />

CODING_FRACTION<br />

BPPRGENE<br />

Escherichia coli SMS−3−5<br />

Escherichia coli O127:H6 str. E2348/69<br />

Escherichia coli E24377A<br />

Escherichia coli S88<br />

Escherichia coli SE11<br />

Escherichia coli UMN026<br />

Escherichia coli IAI39<br />

Escherichia coli 55989<br />

Escherichia coli ED1a<br />

Escherichia coli UTI89<br />

Escherichia coli CFT073<br />

Salmonella enterica subsp. enterica serovar Heidelberg str. SL476<br />

Salmonella enterica subsp. enterica serovar Newport str. SL254<br />

Salmonella enterica subsp. enterica serovar Agona str. SL483<br />

Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633<br />

Salmonella enterica subsp. enterica serovar Paratyphi C stra<strong>in</strong> RKS4594<br />

Salmonella enterica subsp. enterica serovar Dubl<strong>in</strong> str. CT_02021853<br />

Salmonella enterica subsp. enterica serovar Choleraesuis str. SC−B67<br />

Escherichia coli 536<br />

Salmonella enterica subsp. enterica serovar Typhi str. CT18<br />

Serratia proteamaculans 568<br />

Klebsiella pneumoniae subsp. pneumoniae MGH 78578<br />

Klebsiella pneumoniae NTUH−K2044<br />

Klebsiella pneumoniae 342<br />

Salmonella enterica subsp. enterica serovar Paratyphi B str. SPB7<br />

Citrobacter koseri ATCC BAA−895<br />

Escherichia coli O157:H7 str. Sakai<br />

Escherichia coli O157:H7 EDL933<br />

Escherichia coli O157:H7 str. EC4115<br />

Escherichia coli str. K−12 substr. MG1655<br />

Escherichia coli str. K−12 substr. W3110<br />

Escherichia coli HS<br />

Escherichia coli IAI1<br />

Escherichia fergusonii ATCC 35469<br />

Salmonella enterica subsp. arizonae serovar 62:z4,z23:−−<br />

Salmonella enterica subsp. enterica serovar Enteritidis str. P125109<br />

Salmonella enterica subsp. enterica serovar Paratyphi A str. AKU_12601<br />

Enterobacter sp. 638<br />

Escherichia coli BL21<br />

Escherichia coli ATCC 8739<br />

Escherichia coli str. K−12 substr. DH10B<br />

Salmonella enterica subsp. enterica serovar Typhimurium str. LT2<br />

Escherichia coli BW2952<br />

Escherichia coli BL21(DE3)<br />

Yers<strong>in</strong>ia pseudotuberculosis YPIII<br />

Yers<strong>in</strong>ia pseudotuberculosis PB1/+<br />

Yers<strong>in</strong>ia pseudotuberculosis IP 31758<br />

Yers<strong>in</strong>ia enterocolitica subsp. enterocolitica 8081<br />

Yers<strong>in</strong>ia pseudotuberculosis IP 32953<br />

Shigella boydii Sb227<br />

Shigella dysenteriae Sd197<br />

Escherichia coli APEC O1<br />

Shigella flexneri 2a str. 301<br />

Shigella sonnei Ss046<br />

Shigella flexneri 5 str. 8401<br />

Shigella flexneri 2a str. 2457T<br />

Shigella boydii CDC 3083−94<br />

Edwardsiella ictaluri 93−146<br />

Cronobacter sakazakii ATCC BAA−894<br />

Erw<strong>in</strong>ia tasmaniensis Et1/99<br />

Photorhabdus lum<strong>in</strong>escens subsp. laumondii TTO1<br />

Photorhabdus asymbiotica<br />

Proteus mirabilis HI4320<br />

Pectobacterium atrosepticum SCRI1043<br />

Salmonella enterica subsp. enterica serovar Gall<strong>in</strong>arum str. 287/91<br />

Pectobacterium carotovorum subsp. carotovorum PC1<br />

Dickeya zeae Ech1591<br />

Dickeya dadantii Ech703<br />

Salmonella enterica subsp. enterica serovar Typhi str. Ty2<br />

Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150<br />

Yers<strong>in</strong>ia pestis Angola<br />

Yers<strong>in</strong>ia pestis CO92<br />

Yers<strong>in</strong>ia pestis Antiqua<br />

Yers<strong>in</strong>ia pestis KIM<br />

Yers<strong>in</strong>ia pestis Nepal516<br />

Yers<strong>in</strong>ia pestis biovar Microtus str. 91001<br />

Yers<strong>in</strong>ia pestis Pestoides F<br />

Sodalis gloss<strong>in</strong>idius str. morsitans<br />

Buchnera aphidicola str. Cc (C<strong>in</strong>ara cedri)<br />

Wigglesworthia gloss<strong>in</strong>idia endosymbiont of Gloss<strong>in</strong>a brevipalpis<br />

C<strong>and</strong>idatus Blochmannia floridanus<br />

C<strong>and</strong>idatus Blochmannia pennsylvanicus str. BPEN<br />

Buchnera aphidicola str. Sg (Schizaphis gram<strong>in</strong>um)<br />

Buchnera aphidicola str. Bp (Baizongia pistaciae)<br />

Buchnera aphidicola str. APS (Acyrthosiphon pisum)<br />

Buchnera aphidicola str. Tuc7 (Acyrthosiphon pisum)<br />

Buchnera aphidicola str. 5A (Acyrthosiphon pisum)<br />

−1 −0.5 0 0.5 1<br />

Value<br />

Figure 2.5: 2D-cluster<strong>in</strong>g show<strong>in</strong>g 87 Enterobacteriaceae.<br />

Color Key


1st<br />

U<br />

C<br />

A<br />

G<br />

U<br />

2nd position<br />

C A G<br />

3rd<br />

56 Phe 31 Ser 41 Tyr 12 Cys U<br />

2 Phe 1 Ser 2 Tyr 1 Cys C<br />

79 Leu 22 Ser 3 Stop 0 Stop A<br />

5 Leu 1 Ser 0 Stop 8 Trp G<br />

7 Leu 13 Pro 17 His 7 Arg U<br />

0 Leu 1 Pro 1 His 0 Arg C<br />

5 Leu 12 Pro 25 Gln 5 Arg A<br />

0 Leu 2 Pro 2 Gln 0 Arg G<br />

79 Ile 18 Thr 75 Asn 12 Ser U<br />

4 Ile 1 Thr 6 Asn 1 Ser C<br />

51 Ile 20 Thr 131 Lys 18 Arg A<br />

18 Met 1 Thr 6 Lys 1 Arg G<br />

18 Val 16 Ala 33 Asp 18 Gly U<br />

1 Val 1 Ala 2 Asp 1 Gly C<br />

18 Val 15 Ala 41 Glu 27 Gly A<br />

1 Val 1 Ala 2 Glu 2 Gly G<br />

<strong>Comparative</strong> Genomics<br />

Table 2.2: Codon usage <strong>in</strong> Buchnera aphidicola Cc. Frequencies are measured per thous<strong>and</strong>. A<br />

total of 354,219 base pairs are exam<strong>in</strong>ed <strong>in</strong> 360 ORFs (5 orfs rejectred due to possible frame shifts)<br />

codons may be replaced to encode both identical <strong>and</strong> similar am<strong>in</strong>o acids to adjust the<br />

overall base composition.<br />

2.3.4 CodonPlot: visualiz<strong>in</strong>g codon usage<br />

A rose plot diagram (Ussery et al., 2004; B<strong>in</strong>newies et al., 2006) may be used to make a<br />

graphical representation of codon <strong>and</strong> am<strong>in</strong>o acid usage. In the codon rose plot, all 64<br />

codons are listed <strong>in</strong> the perimeter <strong>and</strong> the frequency of each codon is drawn on a radial<br />

scale. The 64 codons are sorted <strong>in</strong> the order AUGC, first by the last letter (XX[AUCG]),<br />

then by the second letter (X[AUGC]X), <strong>and</strong> f<strong>in</strong>ally by the first letter ([AUGC]XX). The<br />

result is four quadrants, with codons end<strong>in</strong>g with A or U <strong>in</strong> the right half, <strong>and</strong> codons<br />

end<strong>in</strong>g with C or G <strong>in</strong> the left half. This allows easy overview of biases <strong>in</strong> the third position.<br />

For the am<strong>in</strong>o acid rose plot, all 20 am<strong>in</strong>o acids are drawn <strong>in</strong> the perimeter with their<br />

frequencies show radially. Here, the am<strong>in</strong>o acids are grouped accord<strong>in</strong>g to their chemical<br />

properties. In addition to the rose plot, <strong>in</strong>formation content can be applied to measure the<br />

bias with<strong>in</strong> each of the three positions of the codon. These codon analysis are shown <strong>in</strong><br />

figure 2.6 for three different enteric genomes: the AT rich Buchnera aphidicola Cc (79.8%<br />

AT), an E. coli stra<strong>in</strong> K-12 (49.2% AT), <strong>and</strong> a somewhat GC rich Klebsiella pneumoniae<br />

NTUH-K2044 (42.3%). The bias <strong>in</strong> B. aphidicola is strik<strong>in</strong>g with a strong preference of A<br />

<strong>and</strong> U at the third position. This variation results <strong>in</strong> a periodic fluctuation of AT content<br />

when align<strong>in</strong>g all open read<strong>in</strong>g frames (ORFs) to the translation start, <strong>and</strong> extract<strong>in</strong>g 400<br />

base pairs up- <strong>and</strong> down-stream, as shown <strong>in</strong> figure 2.7. The red l<strong>in</strong>e represents a 3 po<strong>in</strong>t<br />

runn<strong>in</strong>g average which quickly approaches zero <strong>in</strong> the cod<strong>in</strong>g region. Gray l<strong>in</strong>es represent<br />

the raw average values.<br />

13


Genome Comparisons<br />

N<br />

E<br />

D<br />

N<br />

E<br />

D<br />

N<br />

E<br />

D<br />

Q<br />

R<br />

Q<br />

R<br />

Q<br />

R<br />

S<br />

K<br />

S<br />

K<br />

Am<strong>in</strong>o Acid Usage<br />

Buchnera_aphidicola_Cc<br />

M<br />

T<br />

A<br />

C<br />

(a)<br />

V<br />

Am<strong>in</strong>o Acid Usage<br />

Ecoli_K12<br />

M<br />

T<br />

A<br />

C<br />

(d)<br />

Y<br />

L<br />

W<br />

Am<strong>in</strong>o Acid Usage<br />

Klebsiella_pneumoniae_NTUH-K2044<br />

S<br />

K<br />

M<br />

T<br />

A<br />

C<br />

(g)<br />

V<br />

Y<br />

V<br />

Y<br />

L<br />

W<br />

L<br />

W<br />

I<br />

I<br />

H<br />

I<br />

H<br />

H<br />

G<br />

G<br />

G<br />

F<br />

F<br />

F<br />

P<br />

P<br />

P<br />

0.14<br />

0.11<br />

0.09<br />

0.06<br />

0.03<br />

0.01<br />

0.11<br />

0.09<br />

0.07<br />

0.05<br />

0.03<br />

0.01<br />

0.11<br />

0.09<br />

0.07<br />

0.05<br />

0.03<br />

0.01<br />

Frequency<br />

Frequency<br />

Frequency<br />

GGC<br />

GGC<br />

GGC<br />

GAG<br />

CAG<br />

CGC<br />

GAG<br />

CGC<br />

GAG<br />

UAG<br />

GCC<br />

CAG<br />

UGC<br />

GCC<br />

CAG<br />

CGC<br />

UAG<br />

UGC<br />

UAG<br />

GCC<br />

UGC<br />

UUG<br />

AAG<br />

AGC<br />

CCC<br />

CUG<br />

AUG<br />

UUG<br />

AAG<br />

AGC<br />

CCC<br />

UUG<br />

AAG<br />

CCC<br />

GUG<br />

AUG<br />

AUG<br />

AGC<br />

UCC<br />

GUG<br />

CUG<br />

GUG<br />

CUG<br />

GUC<br />

UCC<br />

UCC<br />

ACC<br />

GUC<br />

GUC<br />

ACC<br />

ACC<br />

UCG<br />

ACG<br />

CCG<br />

CUC<br />

UCG<br />

ACG<br />

CUC<br />

ACG<br />

CUC<br />

GCG<br />

CCG<br />

UUC<br />

GCG<br />

UUC<br />

Codon Usage<br />

Buchnera_aphidicola_Cc<br />

AGG<br />

AUC<br />

GAC<br />

UGG<br />

AGG<br />

AUC<br />

GAC<br />

CGG<br />

CAC<br />

UGG<br />

CAC<br />

GGG<br />

UAC<br />

UAC<br />

AAA<br />

AAC<br />

UAA<br />

GGU<br />

CAA<br />

CGU<br />

(b)<br />

Codon Usage<br />

Ecoli_K12<br />

CGG<br />

GGG<br />

AAA<br />

AAC<br />

UAA<br />

GGU<br />

UGU<br />

CAA<br />

CGU<br />

(e)<br />

UGU<br />

GAA<br />

AUA<br />

AGU<br />

GAA<br />

AUA<br />

AGU<br />

UUA<br />

GCU<br />

UUA<br />

GCU<br />

CUA<br />

CCU<br />

UCU<br />

CUA<br />

Codon Usage<br />

Klebsiella_pneumoniae_NTUH-K2044<br />

CCG<br />

UCG<br />

GCG<br />

UUC<br />

AGG<br />

AUC<br />

GAC<br />

UGG<br />

CGG<br />

CAC<br />

GGG<br />

UAC<br />

AAA<br />

AAC<br />

UAA<br />

GGU<br />

CAA<br />

CGU<br />

(h)<br />

UGU<br />

GAA<br />

AUA<br />

AGU<br />

UUA<br />

GCU<br />

CCU<br />

UCU<br />

ACA<br />

ACU<br />

ACA<br />

ACU<br />

CUA<br />

CCU<br />

AC UCU<br />

ACA<br />

GUA<br />

GUA<br />

UCA<br />

UCA<br />

GUA<br />

UCA<br />

CCA<br />

AGA<br />

AUU<br />

UUU<br />

CUU<br />

GUU<br />

CCA<br />

AGA<br />

AUU<br />

UUU<br />

CUU<br />

GUU<br />

CCA<br />

AGA<br />

AUU<br />

UUU<br />

CUU<br />

GUU<br />

UGA<br />

GCA<br />

AAU<br />

UGA<br />

CGA<br />

UAU<br />

GCA<br />

AAU<br />

UGA<br />

CGA<br />

UAU<br />

GCA<br />

AAU<br />

CAU<br />

CAU<br />

CGA<br />

UAU<br />

GGA<br />

GAU<br />

GGA<br />

GAU<br />

CAU<br />

GGA<br />

GAU<br />

0.13<br />

0.10<br />

0.08<br />

0.05<br />

0.03<br />

0.00<br />

0.05<br />

0.04<br />

0.03<br />

0.02<br />

0.01<br />

0.00<br />

0.07<br />

0.06<br />

0.04<br />

0.03<br />

0.01<br />

0.00<br />

Frequency<br />

Frequency<br />

Frequency<br />

bits<br />

bits<br />

bits<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

(c)<br />

(f)<br />

(i)<br />

| C<br />

1<br />

G<br />

U A<br />

CU<br />

G<br />

A<br />

C GU A |<br />

2<br />

3<br />

| U<br />

1<br />

CAG C<br />

G<br />

A<br />

U<br />

U<br />

A CG|<br />

| 1<br />

2<br />

3<br />

U CG|<br />

U ACG C<br />

G<br />

AU<br />

A<br />

2<br />

3<br />

Figure 2.6: Codon <strong>and</strong> am<strong>in</strong>o acid usage of Buchnera aphidicola Cc (79.8% AT), Klebsiella<br />

pneumoniae NTUH-K2044 (42.3% AT), <strong>and</strong> E. coli K12 49.2% AT. Rightmost column shows the<br />

nucleotide bias of the three codon positions.<br />

14


1st<br />

U<br />

C<br />

A<br />

G<br />

U<br />

2nd position<br />

C A G<br />

3rd<br />

19 Phe 4 Ser 14 Tyr 3 Cys U<br />

19 Phe 11 Ser 13 Tyr 8 Cys C<br />

6 Leu 4 Ser 2 Stop 1 Stop A<br />

7 Leu 12 Ser 0 Stop 16 Trp G<br />

8 Leu 5 Pro 12 His 13 Arg U<br />

16 Leu 8 Pro 11 His 31 Arg C<br />

3 Leu 4 Pro 7 Gln 3 Arg A<br />

72 Leu 30 Pro 38 Gln 10 Arg G<br />

20 Ile 5 Thr 12 Asn 4 Ser U<br />

33 Ile 31 Thr 22 Asn 22 Ser C<br />

3 Ile 3 Thr 24 Lys 2 Arg A<br />

27 Met 13 Thr 13 Lys 1 Arg G<br />

10 Val 10 Ala 26 Asp 13 Gly U<br />

21 Val 44 Ala 24 Asp 43 Gly C<br />

7 Val 8 Ala 27 Glu 6 Gly A<br />

33 Val 43 Ala 27 Glu 14 Gly G<br />

<strong>Comparative</strong> Genomics<br />

Table 2.3: Codon usage <strong>in</strong> Klebsiella pneumoniae NTUH-K2044. Frequencies are measured per<br />

thous<strong>and</strong>. A total of 4,697,097 base pairs are exam<strong>in</strong>ed <strong>in</strong> 5,006 ORFs.<br />

Z−score<br />

−2.0 −1.5 −1.0 −0.5 0.0<br />

Buchnera_aphidicola_Cc: AT content<br />

−400 −200 0 200 400<br />

Distance from translation start<br />

Figure 2.7: AT content profile 400 bp upstream <strong>and</strong> downstram of annotated translation starts <strong>in</strong><br />

Buchnera aphidicola Cc.<br />

15


Genome Comparisons<br />

Figure 2.8: Deam<strong>in</strong>ation of cytos<strong>in</strong>e (C) <strong>in</strong>to uracil (U)<br />

2.3.5 Base composition <strong>and</strong> DNA repair<br />

Klebsiella is often found <strong>in</strong> plant products, root surfaces <strong>and</strong> liv<strong>in</strong>g trees, fresh vegetables,<br />

<strong>and</strong> foods with high content of sugars <strong>and</strong> acids, such as frozen orange juice concentrate.<br />

Klebsiella pneumoniae can causes ur<strong>in</strong>ary tract <strong>in</strong>fections <strong>and</strong> the NTUH-K2044 stra<strong>in</strong><br />

was isolated from a patient with liver abscess <strong>and</strong> men<strong>in</strong>gitis. The broad range of ecological<br />

niches <strong>in</strong> which Klebsiella lives share the property of be<strong>in</strong>g rich <strong>in</strong> energy <strong>and</strong> nitrogen.<br />

Nitrogen-fix<strong>in</strong>g aerobic bacteria are known to have higher chromosomal GC content (McEwan<br />

et al., 1998), expla<strong>in</strong>ed by the nitrogen requirement to replicate the chromosome; an<br />

AT base pairs conta<strong>in</strong>s 7 nitrogen atoms whereas a GC pair conta<strong>in</strong>s 8 nitrogen atoms.<br />

Cytos<strong>in</strong>e pairs are prone to mutation caused by spontaneous deam<strong>in</strong>ation <strong>in</strong>to uracil<br />

(Visnes et al., 2009) (figure 2.8). In E. coli the two enzymes uracil N -glycosylase <strong>and</strong><br />

apur<strong>in</strong>ic (AP) endonuclease are responsible for the repair of this mutation. However, <strong>in</strong><br />

Buchnera aphidicola Cc, which is a small reduced genome, these two enzymes are absent<br />

(confirmed by prote<strong>in</strong> BLAST). A negative selection is likely to occur <strong>in</strong> organisms with<br />

high chromosomal GC content <strong>and</strong> the lack of a functional repair mechanism. Hence, base<br />

composition of the bacterial genome is by no means r<strong>and</strong>om <strong>and</strong> adjust<strong>in</strong>g the overall GC<br />

contant through evolution may be yet another way to adapt to the environment.<br />

2.3.6 BLASTmatrix - proteome comparison<br />

The BLASTmatrix tool allows for visualization of proteome similarity between larger<br />

numbers of organisms. For each of the pairwise comb<strong>in</strong>ations of proteomes, a BLAST<br />

is performed. Two prote<strong>in</strong>s are declared homologous when 50% of the prote<strong>in</strong> is aligned<br />

<strong>and</strong> 50% of the residues with<strong>in</strong> the alignment are conserved. For a report of proteome<br />

A aga<strong>in</strong>st proteome B, all homologous prote<strong>in</strong>s are then grouped <strong>in</strong>to families <strong>and</strong> the<br />

similarity between A <strong>and</strong> B is calculated as the number of families hav<strong>in</strong>g both organism<br />

A <strong>and</strong> B represented. The BLAST report is cached, based on MD5 checksums of the<br />

proteomes. This enables the tool to efficiently reuse previous results, when organisms<br />

are added to a comparison. This is repeated for all N j=1 j comb<strong>in</strong>ations <strong>and</strong> for each<br />

comb<strong>in</strong>ation a square is drawn conta<strong>in</strong><strong>in</strong>g the follow<strong>in</strong>g <strong>in</strong>formation: the similarity as<br />

percentage of all families of A <strong>and</strong> B, the number of shared families <strong>and</strong> the total number<br />

of families. A small example matrix is shown <strong>in</strong> figure 2.9. The percentage is used to<br />

color-code the square to allow for easier overview of larger comparisons.<br />

The software requires a configuration <strong>in</strong> XML as first argument. In appendix D.4<br />

a Perl script is provided which automatically constructs a configuration that compares<br />

all published Campylobacter proteomes, by query<strong>in</strong>g the Genome Atlas Database. The<br />

output of the BLASTmatrix configuration is shown <strong>in</strong> figure 2.10.<br />

The software has been used <strong>in</strong> different publications (B<strong>in</strong>newies et al., 2005, 2006) <strong>and</strong><br />

has been updated a number of times s<strong>in</strong>ce. The older versions conta<strong>in</strong>ed both BLAST<br />

directions <strong>and</strong> showed the number of shared prote<strong>in</strong>s, leav<strong>in</strong>g the diagram redundant. The<br />

recent version avoids this by <strong>in</strong>stead plott<strong>in</strong>g the shared families which renders the plot<br />

symmetrical across the diagonal. This allows the lower triangle to be removed.<br />

16


Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> DH10B<br />

4,126 prote<strong>in</strong>s, 3,797 families<br />

Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> W3110<br />

4,226 prote<strong>in</strong>s, 3,965 families<br />

Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> MG1655<br />

4,150 prote<strong>in</strong>s, 3,912 families<br />

4.3 %<br />

167 / 3,912<br />

95.3 %<br />

3,843 / 4,034<br />

91.5 %<br />

3,685 / 4,027<br />

4.3 %<br />

170 / 3,965<br />

93.1 %<br />

3,742 / 4,020<br />

Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> MG1655<br />

4,150 prote<strong>in</strong>s, 3,912 families<br />

6.4 %<br />

242 / 3,797<br />

Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> W3110<br />

4,226 prote<strong>in</strong>s, 3,965 families<br />

<strong>Comparative</strong> Genomics<br />

Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> DH10B<br />

4,126 prote<strong>in</strong>s, 3,797 families<br />

Figure 2.9: Construction of the BLASTmatrix diagram. Proteome similarity between three E.<br />

coli genomes. Lower part of the diagram corresponds to <strong>in</strong>tra-proteome similarity.<br />

lari<br />

jejuni<br />

concisus<br />

curvus<br />

fetus<br />

hom<strong>in</strong>is<br />

2.3 %<br />

34 / 1,494<br />

57.2 %<br />

1,123 / 1,965<br />

Campylobacter fetus<br />

subsp. fetus 82-40<br />

1,719 prote<strong>in</strong>s, 1,665 families<br />

Campylobacter hom<strong>in</strong>is<br />

ATCC BAA-381<br />

1,687 prote<strong>in</strong>s, 1,623 families<br />

Campylobacter jejuni<br />

RM1221<br />

1,838 prote<strong>in</strong>s, 1,780 families<br />

Campylobacter jejuni<br />

subsp. doylei 269.97<br />

1,731 prote<strong>in</strong>s, 1,650 families<br />

Campylobacter jejuni<br />

subsp. jejuni 81-176<br />

1,758 prote<strong>in</strong>s, 1,702 families<br />

Campylobacter jejuni<br />

subsp. jejuni 81116<br />

1,626 prote<strong>in</strong>s, 1,585 families<br />

Campylobacter jejuni<br />

subsp. jejuni NCTC 11168<br />

1,624 prote<strong>in</strong>s, 1,581 families<br />

Campylobacter lari<br />

RM2100<br />

1,546 prote<strong>in</strong>s, 1,494 families<br />

56.7 %<br />

1,123 / 1,979<br />

1.7 %<br />

27 / 1,581<br />

55.2 %<br />

1,145 / 2,073<br />

84.7 %<br />

1,448 / 1,709<br />

Campylobacter concisus<br />

13826<br />

2,080 prote<strong>in</strong>s, 1,972 families<br />

Campylobacter curvus<br />

525.92<br />

1,931 prote<strong>in</strong>s, 1,885 families<br />

49.4 %<br />

1,062 / 2,150<br />

83.5 %<br />

1,481 / 1,773<br />

1.5 %<br />

24 / 1,585<br />

53.0 %<br />

1,143 / 2,158<br />

67.3 %<br />

1,316 / 1,955<br />

82.9 %<br />

1,474 / 1,778<br />

22.8 %<br />

596 / 2,619<br />

76.9 %<br />

1,466 / 1,906<br />

64.4 %<br />

1,289 / 2,003<br />

2.3 %<br />

39 / 1,702<br />

30.0 %<br />

742 / 2,476<br />

22.9 %<br />

614 / 2,676<br />

74.6 %<br />

1,441 / 1,931<br />

62.2 %<br />

1,304 / 2,096<br />

24.7 %<br />

682 / 2,756<br />

30.6 %<br />

774 / 2,526<br />

23.1 %<br />

617 / 2,675<br />

71.4 %<br />

1,451 / 2,032<br />

4.0 %<br />

66 / 1,650<br />

24.5 %<br />

704 / 2,875<br />

24.8 %<br />

698 / 2,820<br />

30.3 %<br />

770 / 2,538<br />

22.5 %<br />

628 / 2,795<br />

63.5 %<br />

1,345 / 2,118<br />

Campylobacter lari<br />

RM2100<br />

24.4 %<br />

718 / 2,948<br />

25.1 %<br />

706 / 2,816<br />

28.7 %<br />

767 / 2,669<br />

21.2 %<br />

595 / 2,802<br />

2.3 %<br />

41 / 1,780<br />

1,546 prote<strong>in</strong>s, 1,494 families<br />

Campylobacter jejuni<br />

subsp. jejuni NCTC 11168<br />

1,624 prote<strong>in</strong>s, 1,581 families<br />

24.3 %<br />

717 / 2,950<br />

23.7 %<br />

699 / 2,950<br />

27.5 %<br />

736 / 2,676<br />

21.4 %<br />

618 / 2,886<br />

Campylobacter jejuni<br />

subsp. jejuni 81116<br />

1,626 prote<strong>in</strong>s, 1,585 families<br />

23.6 %<br />

723 / 3,070<br />

22.5 %<br />

668 / 2,964<br />

27.9 %<br />

767 / 2,750<br />

2.0 %<br />

33 / 1,623<br />

22.7 %<br />

698 / 3,076<br />

23.0 %<br />

698 / 3,036<br />

30.4 %<br />

782 / 2,576<br />

22.5 %<br />

713 / 3,175<br />

26.1 %<br />

741 / 2,838<br />

1.5 %<br />

25 / 1,665<br />

lari<br />

Campylobacter jejuni<br />

subsp. jejuni 81-176<br />

1,758 prote<strong>in</strong>s, 1,702 families<br />

Campylobacter jejuni<br />

subsp. doylei 269.97<br />

1,731 prote<strong>in</strong>s, 1,650 families<br />

Campylobacter jejuni<br />

RM1221<br />

25.8 %<br />

765 / 2,961<br />

34.7 %<br />

929 / 2,678<br />

1,838 prote<strong>in</strong>s, 1,780 families<br />

32.4 %<br />

916 / 2,828<br />

1.8 %<br />

34 / 1,885<br />

50.3 %<br />

1,317 / 2,616<br />

jejuni<br />

Campylobacter hom<strong>in</strong>is<br />

ATCC BAA-381<br />

1,687 prote<strong>in</strong>s, 1,623 families<br />

Campylobacter fetus<br />

subsp. fetus 82-40<br />

1,719 prote<strong>in</strong>s, 1,665 families<br />

Campylobacter curvus<br />

525.92<br />

3.5 %<br />

69 / 1,972<br />

1.5 %<br />

Homology between proteomes<br />

1,931 prote<strong>in</strong>s, 1,885 families<br />

Campylobacter concisus<br />

13826<br />

2,080 prote<strong>in</strong>s, 1,972 families<br />

hom<strong>in</strong>is<br />

fetus<br />

curvus<br />

concisus<br />

Homology with<strong>in</strong> proteomes<br />

Figure 2.10: Proteome similarity between ten Campylobacter species. Color encod<strong>in</strong>g corresponds<br />

to percentage of shared prote<strong>in</strong> families.<br />

21.2 %<br />

84.7 %<br />

4.0 %<br />

17


Genome Comparisons<br />

A.salmonicida LFI1238<br />

V.species Ex25<br />

V.campbellii AND4<br />

V.harveyi BAA1116<br />

V.shilonii AK1<br />

P.profundum SS9<br />

27.2 %<br />

1,946 / 7,165<br />

27.1 %<br />

31.2 %<br />

1,964 / 7,245 2,143 / 6,862<br />

27.5 %<br />

31.1 %<br />

32.5 %<br />

1,971 / 7,179 2,163 / 6,948 2,385 / 7,336<br />

26.3 %<br />

31.5 %<br />

32.6 %<br />

35.8 %<br />

1,893 / 7,208 2,169 / 6,884 2,405 / 7,380 2,018 / 5,637<br />

28.0 %<br />

30.4 %<br />

33.1 %<br />

35.9 %<br />

38.7 %<br />

1,962 / 7,016 2,098 / 6,893 2,415 / 7,299 2,049 / 5,713 2,143 / 5,536<br />

28.7 %<br />

32.3 %<br />

31.7 %<br />

36.4 %<br />

38.3 %<br />

32.1 %<br />

1,944 / 6,766 2,164 / 6,706 2,323 / 7,337 2,055 / 5,647 2,156 / 5,631 1,846 / 5,747<br />

28.2 %<br />

33.0 %<br />

33.6 %<br />

34.7 %<br />

38.8 %<br />

32.1 %<br />

34.0 %<br />

1,960 / 6,957 2,137 / 6,467 2,410 / 7,181 1,968 / 5,677 2,162 / 5,566 1,873 / 5,828 1,963 / 5,771<br />

27.6 %<br />

32.4 %<br />

34.3 %<br />

37.3 %<br />

37.9 %<br />

32.5 %<br />

33.7 %<br />

35.0 %<br />

1,965 / 7,122 2,155 / 6,649 2,377 / 6,932 2,045 / 5,477 2,110 / 5,560 1,873 / 5,769 1,977 / 5,865 1,949 / 5,561<br />

27.7 %<br />

31.8 %<br />

33.8 %<br />

38.7 %<br />

40.3 %<br />

30.6 %<br />

34.2 %<br />

34.8 %<br />

40.3 %<br />

1,965 / 7,093 2,169 / 6,817 2,403 / 7,116 2,021 / 5,225 2,167 / 5,378 1,777 / 5,804 1,983 / 5,797 1,967 / 5,647 2,326 / 5,771<br />

27.8 %<br />

32.1 %<br />

33.3 %<br />

37.4 %<br />

41.6 %<br />

33.3 %<br />

32.5 %<br />

35.3 %<br />

39.8 %<br />

38.4 %<br />

1,967 / 7,064 2,173 / 6,778 2,418 / 7,252 2,032 / 5,428 2,140 / 5,139 1,863 / 5,593 1,896 / 5,827 1,972 / 5,581 2,339 / 5,873 2,291 / 5,971<br />

25.7 %<br />

32.2 %<br />

33.5 %<br />

36.7 %<br />

40.6 %<br />

34.4 %<br />

35.3 %<br />

33.6 %<br />

40.4 %<br />

38.0 %<br />

41.7 %<br />

1,850 / 7,198 2,173 / 6,752 2,420 / 7,225 2,048 / 5,585 2,159 / 5,323 1,846 / 5,360 1,981 / 5,619 1,884 / 5,612 2,345 / 5,808 2,307 / 6,067 2,552 / 6,116<br />

25.6 %<br />

30.3 %<br />

33.6 %<br />

37.0 %<br />

39.5 %<br />

33.4 %<br />

36.6 %<br />

36.3 %<br />

38.6 %<br />

38.5 %<br />

41.2 %<br />

44.3 %<br />

1,841 / 7,194 2,079 / 6,856 2,420 / 7,193 2,051 / 5,545 2,169 / 5,493 1,852 / 5,547 1,964 / 5,371 1,965 / 5,413 2,251 / 5,839 2,311 / 6,004 2,564 / 6,224 2,515 / 5,683<br />

28.1 %<br />

29.7 %<br />

31.0 %<br />

37.2 %<br />

39.7 %<br />

32.7 %<br />

35.5 %<br />

37.7 %<br />

41.7 %<br />

37.0 %<br />

41.9 %<br />

43.7 %<br />

42.2 %<br />

1,904 / 6,782 2,044 / 6,887 2,282 / 7,362 2,052 / 5,516 2,168 / 5,459 1,868 / 5,705 1,974 / 5,563 1,947 / 5,165 2,346 / 5,626 2,227 / 6,026 2,575 / 6,151 2,527 / 5,781 2,215 / 5,254<br />

26.9 %<br />

32.4 %<br />

30.8 %<br />

34.4 %<br />

40.0 %<br />

33.0 %<br />

34.6 %<br />

36.6 %<br />

42.9 %<br />

39.7 %<br />

40.0 %<br />

44.5 %<br />

41.6 %<br />

40.0 %<br />

1,851 / 6,869 2,098 / 6,481 2,270 / 7,379 1,944 / 5,645 2,171 / 5,428 1,872 / 5,667 1,982 / 5,732 1,961 / 5,354 2,314 / 5,388 2,312 / 5,825 2,473 / 6,185 2,539 / 5,707 2,225 / 5,354 2,421 / 6,055<br />

28.2 %<br />

31.2 %<br />

33.3 %<br />

34.8 %<br />

38.2 %<br />

33.2 %<br />

34.8 %<br />

35.7 %<br />

41.9 %<br />

40.6 %<br />

42.9 %<br />

42.8 %<br />

42.3 %<br />

39.6 %<br />

70.3 %<br />

1,949 / 6,915 2,045 / 6,565 2,327 / 6,984 1,952 / 5,606 2,104 / 5,504 1,872 / 5,641 1,984 / 5,694 1,969 / 5,522 2,334 / 5,571 2,270 / 5,592 2,564 / 5,977 2,449 / 5,718 2,236 / 5,283 2,438 / 6,154 2,933 / 4,174<br />

27.9 %<br />

32.6 %<br />

32.1 %<br />

38.1 %<br />

37.3 %<br />

30.2 %<br />

35.0 %<br />

35.9 %<br />

40.9 %<br />

39.9 %<br />

44.1 %<br />

45.9 %<br />

41.3 %<br />

40.0 %<br />

69.2 %<br />

73.6 %<br />

1,942 / 6,969 2,153 / 6,600 2,268 / 7,062 1,994 / 5,228 2,064 / 5,537 1,747 / 5,786 1,985 / 5,667 1,971 / 5,485 2,343 / 5,733 2,299 / 5,768 2,533 / 5,743 2,535 / 5,526 2,181 / 5,277 2,440 / 6,094 2,953 / 4,267 3,045 / 4,135<br />

27.9 %<br />

31.8 %<br />

34.2 %<br />

36.4 %<br />

41.6 %<br />

30.0 %<br />

31.9 %<br />

36.1 %<br />

41.2 %<br />

38.9 %<br />

43.3 %<br />

47.1 %<br />

43.8 %<br />

38.4 %<br />

69.7 %<br />

74.9 %<br />

71.6 %<br />

1,941 / 6,954 2,123 / 6,682 2,394 / 7,002 1,935 / 5,317 2,134 / 5,135 1,736 / 5,791 1,857 / 5,817 1,971 / 5,458 2,346 / 5,697 2,309 / 5,932 2,559 / 5,916 2,503 / 5,310 2,234 / 5,101 2,348 / 6,120 2,944 / 4,221 3,101 / 4,142 3,010 / 4,205<br />

27.9 %<br />

32.0 %<br />

33.4 %<br />

37.7 %<br />

39.1 %<br />

33.6 %<br />

32.1 %<br />

32.8 %<br />

41.4 %<br />

39.3 %<br />

42.3 %<br />

46.4 %<br />

45.9 %<br />

41.4 %<br />

66.3 %<br />

75.5 %<br />

72.6 %<br />

75.9 %<br />

1,909 / 6,851 2,130 / 6,656 2,359 / 7,060 2,026 / 5,367 2,048 / 5,244 1,805 / 5,377 1,861 / 5,795 1,843 / 5,611 2,346 / 5,670 2,314 / 5,892 2,572 / 6,075 2,534 / 5,464 2,223 / 4,842 2,445 / 5,905 2,833 / 4,271 3,089 / 4,092 3,068 / 4,226 3,094 / 4,077<br />

29.6 %<br />

32.0 %<br />

33.4 %<br />

37.3 %<br />

40.4 %<br />

31.9 %<br />

35.6 %<br />

33.1 %<br />

38.0 %<br />

39.4 %<br />

42.7 %<br />

45.2 %<br />

44.3 %<br />

42.4 %<br />

73.2 %<br />

69.8 %<br />

73.5 %<br />

77.2 %<br />

68.7 %<br />

2,295 / 7,753 2,097 / 6,549 2,375 / 7,115 2,022 / 5,418 2,139 / 5,293 1,743 / 5,469 1,922 / 5,398 1,848 / 5,585 2,213 / 5,823 2,314 / 5,868 2,578 / 6,032 2,546 / 5,633 2,232 / 5,038 2,408 / 5,683 2,952 / 4,034 2,942 / 4,217 3,065 / 4,172 3,155 / 4,088 2,874 / 4,181<br />

27.9 %<br />

35.2 %<br />

33.0 %<br />

37.3 %<br />

39.4 %<br />

33.5 %<br />

34.2 %<br />

36.7 %<br />

38.0 %<br />

36.9 %<br />

42.9 %<br />

45.5 %<br />

43.0 %<br />

41.8 %<br />

73.5 %<br />

76.0 %<br />

68.5 %<br />

78.0 %<br />

67.2 %<br />

70.4 %<br />

1,972 / 7,061 2,581 / 7,333 2,325 / 7,056 2,019 / 5,407 2,118 / 5,370 1,845 / 5,501 1,872 / 5,473 1,906 / 5,192 2,209 / 5,811 2,208 / 5,989 2,579 / 6,005 2,548 / 5,599 2,240 / 5,212 2,434 / 5,818 2,863 / 3,897 3,059 / 4,025 2,914 / 4,256 3,149 / 4,038 2,880 / 4,288 2,922 / 4,153<br />

29.4 %<br />

34.3 %<br />

46.4 %<br />

37.8 %<br />

40.3 %<br />

32.9 %<br />

35.7 %<br />

34.9 %<br />

41.8 %<br />

36.4 %<br />

39.4 %<br />

45.8 %<br />

43.4 %<br />

40.8 %<br />

76.4 %<br />

75.2 %<br />

74.1 %<br />

71.5 %<br />

69.7 %<br />

70.3 %<br />

64.7 %<br />

2,212 / 7,534 2,276 / 6,634 3,371 / 7,266 2,001 / 5,288 2,145 / 5,320 1,824 / 5,545 1,970 / 5,513 1,843 / 5,282 2,264 / 5,418 2,186 / 6,003 2,432 / 6,172 2,552 / 5,568 2,242 / 5,171 2,445 / 5,993 2,970 / 3,887 2,954 / 3,928 3,024 / 4,083 2,986 / 4,175 2,916 / 4,183 2,965 / 4,217 2,888 / 4,463<br />

27.8 %<br />

34.4 %<br />

34.9 %<br />

47.0 %<br />

39.8 %<br />

33.0 %<br />

34.9 %<br />

36.8 %<br />

39.9 %<br />

39.9 %<br />

39.1 %<br />

42.2 %<br />

43.6 %<br />

41.1 %<br />

73.1 %<br />

80.4 %<br />

73.0 %<br />

79.5 %<br />

69.0 %<br />

72.2 %<br />

64.9 %<br />

76.9 %<br />

2,222 / 7,979 2,472 / 7,184 2,496 / 7,160 2,741 / 5,827 2,086 / 5,245 1,831 / 5,549 1,952 / 5,586 1,951 / 5,307 2,202 / 5,514 2,238 / 5,609 2,413 / 6,176 2,409 / 5,711 2,244 / 5,143 2,450 / 5,957 2,977 / 4,072 3,080 / 3,831 2,908 / 3,986 3,125 / 3,932 2,860 / 4,145 2,986 / 4,136 2,940 / 4,533 3,165 / 4,117<br />

28.1 %<br />

33.0 %<br />

38.7 %<br />

37.8 %<br />

64.9 %<br />

33.1 %<br />

35.2 %<br />

36.1 %<br />

42.0 %<br />

38.0 %<br />

43.0 %<br />

41.4 %<br />

41.1 %<br />

41.3 %<br />

73.4 %<br />

77.3 %<br />

78.5 %<br />

77.9 %<br />

71.8 %<br />

68.5 %<br />

67.6 %<br />

76.7 %<br />

83.4 %<br />

2,155 / 7,667 2,516 / 7,615 2,880 / 7,439 2,081 / 5,503 3,384 / 5,214 1,804 / 5,448 1,954 / 5,558 1,940 / 5,373 2,320 / 5,530 2,171 / 5,707 2,483 / 5,781 2,372 / 5,735 2,153 / 5,242 2,449 / 5,936 2,971 / 4,050 3,098 / 4,009 3,061 / 3,901 3,002 / 3,856 2,896 / 4,036 2,869 / 4,191 2,983 / 4,413 3,195 / 4,167 3,315 / 3,973<br />

29.5 %<br />

36.5 %<br />

37.0 %<br />

39.9 %<br />

45.0 %<br />

31.9 %<br />

35.3 %<br />

36.2 %<br />

41.1 %<br />

40.1 %<br />

40.1 %<br />

46.3 %<br />

41.2 %<br />

37.9 %<br />

73.8 %<br />

77.1 %<br />

75.8 %<br />

83.0 %<br />

71.5 %<br />

73.7 %<br />

65.1 %<br />

81.6 %<br />

81.3 %<br />

82.4 %<br />

2,198 / 7,456 2,593 / 7,105 2,900 / 7,832 2,372 / 5,942 2,357 / 5,232 2,074 / 6,494 1,926 / 5,455 1,940 / 5,352 2,303 / 5,603 2,293 / 5,719 2,373 / 5,919 2,464 / 5,326 2,152 / 5,228 2,313 / 6,099 2,975 / 4,030 3,088 / 4,007 3,073 / 4,056 3,135 / 3,777 2,801 / 3,915 2,947 / 4,001 2,880 / 4,423 3,264 / 4,000 3,320 / 4,085 3,302 / 4,009<br />

30.3 %<br />

36.7 %<br />

34.6 %<br />

37.5 %<br />

46.1 %<br />

32.3 %<br />

35.5 %<br />

36.3 %<br />

41.6 %<br />

39.2 %<br />

43.5 %<br />

43.5 %<br />

46.0 %<br />

38.2 %<br />

65.6 %<br />

78.0 %<br />

75.1 %<br />

79.4 %<br />

72.2 %<br />

81.0 %<br />

67.3 %<br />

77.5 %<br />

81.9 %<br />

80.8 %<br />

83.2 %<br />

2,110 / 6,968 2,562 / 6,982 2,682 / 7,762 2,396 / 6,387 2,626 / 5,697 1,842 / 5,705 2,270 / 6,400 1,906 / 5,250 2,314 / 5,569 2,272 / 5,796 2,550 / 5,859 2,367 / 5,437 2,220 / 4,821 2,320 / 6,080 2,791 / 4,256 3,097 / 3,971 3,061 / 4,077 3,144 / 3,961 2,861 / 3,960 2,989 / 3,688 2,909 / 4,320 3,153 / 4,066 3,311 / 4,041 3,319 / 4,106 3,325 / 3,995<br />

29.7 %<br />

30.4 %<br />

36.7 %<br />

36.9 %<br />

43.2 %<br />

32.6 %<br />

34.5 %<br />

35.9 %<br />

41.5 %<br />

39.8 %<br />

42.2 %<br />

45.9 %<br />

42.7 %<br />

42.3 %<br />

65.2 %<br />

71.3 %<br />

76.3 %<br />

79.3 %<br />

69.0 %<br />

74.9 %<br />

67.8 %<br />

78.4 %<br />

76.3 %<br />

81.6 %<br />

80.7 %<br />

85.8 %<br />

2,127 / 7,169 2,085 / 6,866 2,759 / 7,516 2,259 / 6,124 2,655 / 6,143 2,040 / 6,250 1,965 / 5,696 2,233 / 6,219 2,272 / 5,479 2,292 / 5,756 2,506 / 5,941 2,501 / 5,451 2,113 / 4,953 2,399 / 5,675 2,768 / 4,246 2,953 / 4,142 3,076 / 4,029 3,138 / 3,958 2,868 / 4,158 2,944 / 3,932 2,836 / 4,184 3,157 / 4,029 3,142 / 4,120 3,311 / 4,057 3,321 / 4,117 3,291 / 3,837<br />

28.3 %<br />

29.4 %<br />

29.6 %<br />

38.6 %<br />

40.2 %<br />

30.5 %<br />

35.3 %<br />

36.2 %<br />

43.9 %<br />

39.2 %<br />

42.9 %<br />

46.1 %<br />

44.3 %<br />

39.7 %<br />

71.6 %<br />

70.2 %<br />

69.2 %<br />

80.3 %<br />

69.1 %<br />

73.3 %<br />

68.1 %<br />

74.3 %<br />

83.7 %<br />

75.3 %<br />

81.4 %<br />

82.5 %<br />

79.6 %<br />

1,980 / 6,989 2,083 / 7,082 2,214 / 7,478 2,289 / 5,931 2,413 / 5,999 2,050 / 6,715 2,191 / 6,211 1,976 / 5,464 2,762 / 6,293 2,230 / 5,684 2,536 / 5,906 2,513 / 5,455 2,213 / 5,001 2,303 / 5,796 2,802 / 3,915 2,930 / 4,172 2,925 / 4,226 3,147 / 3,918 2,864 / 4,145 2,983 / 4,071 2,876 / 4,226 2,987 / 4,018 3,275 / 3,915 3,136 / 4,162 3,309 / 4,067 3,278 / 3,971 3,139 / 3,944<br />

28.0 %<br />

26.7 %<br />

29.3 %<br />

33.6 %<br />

42.3 %<br />

33.1 %<br />

33.1 %<br />

36.3 %<br />

45.4 %<br />

41.4 %<br />

42.3 %<br />

45.7 %<br />

43.5 %<br />

42.9 %<br />

68.6 %<br />

77.2 %<br />

64.3 %<br />

73.1 %<br />

70.0 %<br />

73.6 %<br />

66.4 %<br />

78.6 %<br />

86.6 %<br />

82.6 %<br />

76.0 %<br />

83.4 %<br />

78.1 %<br />

92.9 %<br />

2,022 / 7,222 1,916 / 7,168 2,244 / 7,665 1,915 / 5,695 2,451 / 5,795 2,074 / 6,269 2,209 / 6,672 2,179 / 6,005 2,507 / 5,523 2,698 / 6,523 2,475 / 5,845 2,506 / 5,480 2,200 / 5,058 2,463 / 5,745 2,743 / 4,001 2,983 / 3,866 2,805 / 4,365 3,000 / 4,103 2,873 / 4,102 2,979 / 4,045 2,917 / 4,393 3,113 / 3,962 3,253 / 3,757 3,267 / 3,954 3,147 / 4,141 3,267 / 3,919 3,147 / 4,032 3,489 / 3,754<br />

25.5 %<br />

34.5 %<br />

28.3 %<br />

32.5 %<br />

34.5 %<br />

34.9 %<br />

35.7 %<br />

34.2 %<br />

43.7 %<br />

43.7 %<br />

46.4 %<br />

45.1 %<br />

44.9 %<br />

40.8 %<br />

77.1 %<br />

71.8 %<br />

71.6 %<br />

69.5 %<br />

68.3 %<br />

74.3 %<br />

66.3 %<br />

75.5 %<br />

91.2 %<br />

85.6 %<br />

82.9 %<br />

79.4 %<br />

80.2 %<br />

89.7 %<br />

77.1 %<br />

1,872 / 7,339 2,335 / 6,762 2,095 / 7,406 1,919 / 5,903 1,963 / 5,692 2,114 / 6,065 2,219 / 6,213 2,205 / 6,448 2,670 / 6,112 2,492 / 5,705 3,042 / 6,550 2,444 / 5,415 2,242 / 4,998 2,400 / 5,876 2,975 / 3,861 2,855 / 3,974 2,868 / 4,006 2,908 / 4,185 2,820 / 4,126 2,982 / 4,014 2,908 / 4,386 3,125 / 4,141 3,355 / 3,679 3,244 / 3,790 3,277 / 3,954 3,143 / 3,956 3,169 / 3,953 3,485 / 3,884 3,186 / 4,134<br />

26.1 %<br />

30.9 %<br />

43.4 %<br />

30.3 %<br />

33.9 %<br />

55.5 %<br />

38.1 %<br />

36.6 %<br />

40.8 %<br />

41.9 %<br />

43.2 %<br />

48.9 %<br />

43.5 %<br />

42.4 %<br />

73.0 %<br />

82.5 %<br />

67.9 %<br />

76.7 %<br />

68.0 %<br />

68.5 %<br />

67.0 %<br />

74.6 %<br />

91.7 %<br />

90.1 %<br />

83.2 %<br />

87.0 %<br />

75.1 %<br />

81.1 %<br />

74.9 %<br />

80.4 %<br />

2,254 / 8,624 2,144 / 6,948 2,981 / 6,875 1,795 / 5,923 1,991 / 5,874 2,683 / 4,838 2,277 / 5,979 2,201 / 6,016 2,680 / 6,565 2,637 / 6,301 2,597 / 6,013 2,994 / 6,128 2,155 / 4,958 2,451 / 5,781 2,911 / 3,989 3,117 / 3,780 2,780 / 4,092 2,961 / 3,861 2,806 / 4,126 2,844 / 4,150 2,915 / 4,348 3,103 / 4,160 3,455 / 3,766 3,346 / 3,715 3,208 / 3,855 3,272 / 3,762 3,024 / 4,028 3,280 / 4,046 3,187 / 4,253 3,303 / 4,109<br />

25.9 %<br />

30.1 %<br />

45.0 %<br />

46.2 %<br />

30.5 %<br />

52.4 %<br />

75.0 %<br />

38.7 %<br />

72.3 %<br />

39.7 %<br />

67.5 %<br />

47.2 %<br />

43.5 %<br />

40.9 %<br />

74.7 %<br />

78.0 %<br />

78.6 %<br />

71.8 %<br />

73.1 %<br />

70.6 %<br />

64.7 %<br />

75.4 %<br />

96.0 %<br />

90.4 %<br />

91.4 %<br />

83.0 %<br />

80.7 %<br />

77.3 %<br />

80.2 %<br />

88.8 %<br />

88.1 %<br />

2,170 / 8,370 2,581 / 8,574 3,018 / 6,702 2,452 / 5,307 1,813 / 5,939 2,666 / 5,085 3,261 / 4,346 2,246 / 5,808 3,688 / 5,101 2,672 / 6,728 3,741 / 5,540 2,608 / 5,524 2,547 / 5,858 2,360 / 5,769 2,922 / 3,914 3,045 / 3,906 3,059 / 3,894 2,849 / 3,968 2,818 / 3,854 2,886 / 4,087 2,847 / 4,403 3,111 / 4,124 3,531 / 3,678 3,439 / 3,805 3,373 / 3,689 3,126 / 3,768 3,108 / 3,853 3,164 / 4,093 3,271 / 4,079 3,489 / 3,927 3,495 / 3,966<br />

5.0 %<br />

243 / 4,897<br />

3.9 %<br />

200 / 5,078<br />

3.9 %<br />

201 / 5,117<br />

V.parahaemolyticus 2210633<br />

V.parahaemolyticus 16<br />

V.vulnificus CMCP6<br />

V.vulnificus YJ016<br />

V.species MED222<br />

V.splendidus LGP32<br />

V.fischeri ES114<br />

V.fischeri MJ11<br />

2.3 %<br />

88 / 3,822<br />

2.7 %<br />

103 / 3,886<br />

3.3 %<br />

111 / 3,378<br />

2.9 %<br />

112 / 3,894<br />

2.6 %<br />

96 / 3,691<br />

2.8 %<br />

118 / 4,277<br />

2.3 %<br />

103 / 4,463<br />

3.1 %<br />

150 / 4,773<br />

V.cholerae MO10<br />

V.cholerae BX330286<br />

V.cholerae RC9<br />

V.cholerae MJ1236<br />

V.cholerae B33VCE<br />

V.cholerae 2740-80<br />

V.cholerae AM-19226<br />

V.cholerae MZO-2<br />

V.cholerae 12129<br />

V.cholerae TM11079-80<br />

V.cholerae TMA21<br />

V.cholerae VL426<br />

V.cholerae 1587<br />

2.8 %<br />

121 / 4,337<br />

2.1 %<br />

79 / 3,683<br />

V.cholerae N16961<br />

V.cholerae 0395 TEDA<br />

V.cholerae 0395 TIGR<br />

V.cholerae V52<br />

V.cholerae M66-2<br />

3.2 %<br />

147 / 4,662<br />

1.9 %<br />

62 / 3,316<br />

2.9 %<br />

99 / 3,427<br />

P.profundum SS9<br />

2.4 %<br />

83 / 3,442<br />

V.shilonii AK1<br />

V.harveyi BAA1116<br />

2.1 %<br />

72 / 3,454<br />

V.campbellii AND4<br />

V.species Ex25<br />

2.2 %<br />

73 / 3,311<br />

A.salmonicida LFI1238<br />

V.fischeri MJ11<br />

2.5 %<br />

84 / 3,305<br />

V.fischeri ES114<br />

V.splendidus LGP32<br />

2.8 %<br />

99 / 3,586<br />

V.species MED222<br />

V.vulnificus YJ016<br />

3.5 %<br />

125 / 3,567<br />

V.vulnificus CMCP6<br />

V.parahaemolyticus 16<br />

2.6 %<br />

92 / 3,593<br />

V.parahaemolyticus 2210633<br />

V.cholerae VL426<br />

3.0 %<br />

109 / 3,575<br />

V.cholerae TMA21<br />

V.cholerae TM11079-80<br />

2.8 %<br />

102 / 3,619<br />

V.cholerae 12129<br />

V.cholerae MZO-2<br />

2.9 %<br />

100 / 3,429<br />

V.cholerae AM-19226<br />

V.cholerae 1587<br />

1.8 %<br />

59 / 3,353<br />

30.0 %<br />

V.cholerae 2740-80<br />

V.cholerae B33VCE<br />

2.8 %<br />

99 / 3,560<br />

0.0 %<br />

Homology between proteomes<br />

Homology with<strong>in</strong> proteomes<br />

V.cholerae MJ1236<br />

V.cholerae RC9<br />

3.3 %<br />

120 / 3,599<br />

V.cholerae BX330286<br />

V.cholerae MO10<br />

4.3 %<br />

157 / 3,665<br />

V.cholerae M66-2<br />

V.cholerae V52<br />

4.2 %<br />

155 / 3,729<br />

V.cholerae 0395 TIGR<br />

V.cholerae 0395 TEDA<br />

3.0 %<br />

110 / 3,665<br />

90.0 %<br />

6.0 %<br />

V.cholerae N16961<br />

Figure 2.11: Proteome comparison of 32 Vibrionaceae genomes. Environmental V. cholerae stra<strong>in</strong>s<br />

lack<strong>in</strong>g the cholera enterotox<strong>in</strong> genes are highlighted <strong>in</strong> bright green, whilst pathogenic V. cholerae<br />

stra<strong>in</strong>s genomes are shown <strong>in</strong> dark green.<br />

Large similarities between environmental <strong>and</strong> pathogenic V. cholerae<br />

The BLAST matrix shown <strong>in</strong> figure 2.11 <strong>in</strong>cludes environmental <strong>and</strong> pathogenetic stra<strong>in</strong>s<br />

of V. cholerae. The figures shows that with<strong>in</strong> <strong>and</strong> between these two groups the V. cholerae<br />

stra<strong>in</strong>s share a large number of genes.<br />

Intra- vs. <strong>in</strong>ter-proteome similarity<br />

The lower row of the diagram shows the special case of organism A versus itself. This<br />

shows the <strong>in</strong>tra-proteome similarity. If not dealt with separately, this part would appear<br />

as 100% similar s<strong>in</strong>ce the proteome is BLASTed aga<strong>in</strong>st itself. However, all self-match<strong>in</strong>g<br />

prote<strong>in</strong>s are excluded, leav<strong>in</strong>g this part to reflect the paraloges of the organism. Also, this<br />

part has a separate color encod<strong>in</strong>g (red) whereas the <strong>in</strong>tra-protome comparison is coded<br />

green (see figure 2.10).<br />

2.3.7 BLASTatlas - visualiz<strong>in</strong>g while-genome homology<br />

The BLASTmatrix tool described earlier condenses the similarity between two proteomes<br />

<strong>in</strong>to a s<strong>in</strong>gle number. This simplification allows for an all-aga<strong>in</strong>st-all comparison, but lacks<br />

detailed <strong>in</strong>formation on the conserved genes <strong>and</strong> where these are located. The BLASTatlas<br />

method overcomes these issues by compar<strong>in</strong>g the proteomes to a s<strong>in</strong>gle reference chromosome.<br />

When a s<strong>in</strong>gle representative chromosome has been selected, all ORF’s or prote<strong>in</strong>s<br />

of that reference is BLASTed aga<strong>in</strong>st each of the proteome to be <strong>in</strong>cluded <strong>in</strong> the comparison.<br />

The most optimal alignment of each proteome, disregard<strong>in</strong>g the significance, is<br />

mapped back to the reference genome. A numerical value of zero is mapped at mismatches<br />

or gaps, 0.5 at conservative mismatches, <strong>and</strong> one is mapped to matches. This method has<br />

proved powerful because it answers several questions <strong>in</strong> one diagram: Which reference<br />

prote<strong>in</strong>s are found <strong>in</strong> which query genomes? How well are they conserved? And is there<br />

18


<strong>Comparative</strong> Genomics<br />

<br />

<br />

<br />

Figure 2.12: Mapp<strong>in</strong>g of pairwise alignment to a reference genome. Mismatches, conservative<br />

mismatches <strong>and</strong> perfect matches contrubute to the overall map 0.0, 0.5, <strong>and</strong> 1.0, respectively. Gaps<br />

with<strong>in</strong> the reference prote<strong>in</strong>, correspond<strong>in</strong>g to miss<strong>in</strong>g features of the reference prote<strong>in</strong>, cannot be<br />

mapped <strong>and</strong> are hence excluded.<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Figure 2.13: Inclusion of multiple organisms us<strong>in</strong>g the BLASTatlas method. Each track correspond<br />

to a pairwise comparison aga<strong>in</strong>st the reference chromosome.<br />

any correlation between the conservation of neighbor<strong>in</strong>g genes such as with<strong>in</strong> larger genomic<br />

isl<strong>and</strong>s. Figure 2.12 depicts the remapp<strong>in</strong>g of a prote<strong>in</strong>-prote<strong>in</strong> alignment back to<br />

the reference genome.<br />

The result of the mapp<strong>in</strong>g step is a list of same length as the reference genome. BLASTmatrix<br />

then uses the GeneWiz software (Pedersen et al., 2000) to visualize this numerical<br />

data. Genewiz applies a smooth<strong>in</strong>g <strong>and</strong> each b<strong>in</strong> is then encoded <strong>in</strong>to a color representation<br />

either fixed or dynamic, given as n st<strong>and</strong>ard deviations around the average. Each<br />

genome <strong>in</strong>cluded <strong>in</strong> the comparison is plotted as <strong>in</strong>dividual tracks. The tool is offered<br />

as a Web Service (see chapter 4) A general client script can be obta<strong>in</strong>ed from the onl<strong>in</strong>e<br />

documentation at http://www.cbs.dtu.dk/ws/BLASTatlas. The client script produces<br />

as PostScript plot as output. In the next sections examples are provided demonstrat<strong>in</strong>g<br />

the flexibility of the tool.<br />

Gene loss <strong>in</strong> Burkholderia species<br />

A comparative study aimed at mapp<strong>in</strong>g pathogenic isl<strong>and</strong>s or gene losses among different<br />

bacterial genomes can benefit from the graphical representation provided by the BLAS-<br />

Tatlas method. The genus of Burkholderia covers a number of important animal <strong>and</strong><br />

human pathogens known to cause melioidosis (B. pseudomallei) <strong>and</strong> pulmonary <strong>in</strong>fection<br />

<strong>in</strong> CF patients (B. cepacia), whereas B. thail<strong>and</strong>ensis, which is closely related to B. pseudomallei,<br />

rarely gives rise to diseases <strong>in</strong> humans (Brett et al., 1998; Smith et al., 1997). All<br />

publicly available <strong>and</strong> fully sequenced Burkholderia genomes are compared to chromosome<br />

I <strong>and</strong> II of B. pseudomallei 1710b. The code list<strong>in</strong>g below describes how the comparison<br />

was made <strong>and</strong> it demonstrates the flexibility of the tool as it allows for easy automation<br />

19


Genome Comparisons<br />

by read<strong>in</strong>g simple configurations files - <strong>in</strong> this case generated by a MySQL query. The<br />

output configuration file is listed <strong>in</strong> appendix D.3.<br />

1 # let mysql construct the blast configuration file<br />

2 mysql --raw -B -N -e ’ select concat (" legend :",replace (<br />

organism_name ," Burkholderia ","B."),"\ nprogram : blastp \ ncolor :",<br />

if( organism_name like "% pseudomal %"," 101010 _000009 ",if(<br />

organism_name like "% mallei %"," 101010 _000900 ",if( organism_name<br />

like "% cenocep %"," 101010 _080000 ",if( organism_name like "% ambi %"<br />

," 101010 _020002 ",if( organism_name like "% thail<strong>and</strong> %"," 101010<br />

_000900 "," 101010 _050505 "))))),"\ nrange :0.0 ,0.8\ nsource : files /",<br />

pid ,". fsa \n") from genomeatlas3_cur . genbank_complete_prj where<br />

organism_name like " burkhold %" <strong>and</strong> organism_name not like "<br />

%1710 b%" order by organism_name ;’ > blast . cfg<br />

3 # copy genbank files of chr I <strong>and</strong> II<br />

4 foreach acc ( CP000124 CP000125 )<br />

5 cp / home / databases / genomeatlasdb -3.0 _cur / data / $acc / $acc . gbk .<br />

6 saco_convert -I genbank -O annotation $acc . gbk > $acc . ann<br />

7 saco_extract -I genbank -O fasta -t $acc . gbk > $acc . prote<strong>in</strong>s . fsa<br />

8 saco_convert -I genbank -O fasta $acc . gbk > $acc . fsa<br />

9 end<br />

10<br />

11 # run the BLASTatlas client script on both chromosomes<br />

12 perl BLASTatlas -modus circle -ref CP000124 . fsa - prote<strong>in</strong>s CP000124<br />

. prote<strong>in</strong>s . fsa -ann CP000124 . ann - blastcfg blast . cfg -- dnap ="<br />

Percent AT ,GC Skew " -title "B. pseudomallei 1710b, chr I" ><br />

burkholderia_chrI .ps<br />

13 perl BLASTatlas -modus circle -ref CP000125 . fsa - prote<strong>in</strong>s CP000125<br />

. prote<strong>in</strong>s . fsa -ann CP000125 . ann - blastcfg blast . cfg -- dnap ="<br />

Percent AT ,GC Skew " -title "B. pseudomallei 1710b, chr II" ><br />

burkholderia_chrII .ps<br />

The plots of the two chromosomes are shown <strong>in</strong> figure 2.14. The other B. pseudomallei<br />

genomes are obvious as three dark blue tracks, represent<strong>in</strong>g high homology with<strong>in</strong> the<br />

species. Both species of B. thail<strong>and</strong>ensis <strong>and</strong> B. mallei display large chromosomal deletions<br />

when compared to B. pseudomallei. However the more scattered nature of the gene loss<br />

observed <strong>in</strong> B. thail<strong>and</strong>ensis suggests that B. mallei evolved from B. pseudomallei through<br />

the loss of larger regions (Ong et al., 2004). These deletions are evident from the atlases<br />

shown <strong>in</strong> figure 2.14. It is evident that a strong preference of deletions exist for chromosome<br />

II. Ong <strong>and</strong> co-workers report that deletions <strong>in</strong> chromosome II counts for 70% <strong>and</strong> 61%<br />

of the total gene loss <strong>in</strong> B. mallei <strong>and</strong> B. thail<strong>and</strong>ensis, respectively.<br />

The Alcanivorax phylome BLASTatlas<br />

Tracks on the BLASTatlas are not limitted to s<strong>in</strong>gle genomes or proteomes. Sequence files<br />

specified for a given tracks is converted <strong>in</strong>to a BLAST database <strong>and</strong> reference genome is<br />

searched aga<strong>in</strong>st each the databases of each track. However, a track may just as well be<br />

a collection of genomes, entire phyla or even SwissProt. In Paper III a ‘phylome’ atlas<br />

was constructed for the oil-degrad<strong>in</strong>g mar<strong>in</strong>e bacterium Alcanivorax borkumensis (Reva<br />

et al., 2008). Here, tracks were constructed collect<strong>in</strong>g all prote<strong>in</strong>s of all published bacterial<br />

genomes, all proteobacteria, all γ-, α-, β-, δ, <strong>and</strong> ɛ-proteobacteria (see figure 2.15). The<br />

phylome atlas reveals no or very few homologes <strong>in</strong> δ- <strong>and</strong> ɛ-proteobacteria, some homologes<br />

<strong>in</strong> α- <strong>and</strong> β-proteobacteria wheras the highest sequence homology was identified among<br />

γ-proteobacteria.<br />

20


3M<br />

2.5M<br />

3.5M<br />

2.5M<br />

2M<br />

0M<br />

2M<br />

0.5M<br />

B. pseudomallei 1710b, chr I<br />

4,126,292 bp<br />

3M<br />

0M<br />

1.5M<br />

0.5M<br />

B. pseudomallei 1710b, chr II<br />

3,181,762 bp<br />

1.5M<br />

1M<br />

1M<br />

<strong>Comparative</strong> Genomics<br />

B. ambifaria AMMD<br />

0.00 0.80<br />

B. ambifaria MC40-6<br />

0.00 0.80<br />

B. cenocepacia AU 1054<br />

0.00 0.80<br />

B. cenocepacia HI2424<br />

0.00 0.80<br />

B. cenocepacia J2315<br />

0.00 0.80<br />

B. cenocepacia MC0-3<br />

0.00 0.80<br />

B. glumae BGR1<br />

0.00 0.80<br />

B. mallei ATCC 23344<br />

0.00 0.80<br />

B. mallei NCTC 10229<br />

0.00 0.80<br />

B. mallei NCTC 10247<br />

0.00 0.80<br />

B. mallei SAVP1<br />

0.00 0.80<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

B. multivorans ATCC 17616 fix<br />

avg<br />

0.00 0.80<br />

Center for Biological Sequence Analysis<br />

http://www.cbs.dtu.dk/<br />

B. ambifaria AMMD<br />

0.00 0.80<br />

B. ambifaria MC40-6<br />

0.00 0.80<br />

B. cenocepacia AU 1054<br />

0.00 0.80<br />

B. cenocepacia HI2424<br />

0.00 0.80<br />

B. cenocepacia J2315<br />

0.00 0.80<br />

B. cenocepacia MC0-3<br />

0.00 0.80<br />

B. glumae BGR1<br />

0.00 0.80<br />

B. mallei ATCC 23344<br />

0.00 0.80<br />

B. mallei NCTC 10229<br />

0.00 0.80<br />

B. mallei NCTC 10247<br />

0.00 0.80<br />

B. mallei SAVP1<br />

0.00 0.80<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

B. multivorans ATCC 17616 fix<br />

avg<br />

0.00 0.80<br />

Center for Biological Sequence Analysis<br />

http://www.cbs.dtu.dk/<br />

B. multivorans ATCC 17616 fix<br />

avg<br />

0.00 0.80<br />

B. phymatum STM815<br />

0.00 0.80<br />

B. phytofirmans PsJN<br />

0.00 0.80<br />

B. pseudomallei 1106a<br />

0.00 0.80<br />

B. pseudomallei 668<br />

0.00 0.80<br />

B. pseudomallei K96243<br />

0.00 0.80<br />

B. sp. 383<br />

0.00 0.80<br />

B. thail<strong>and</strong>ensis E264<br />

0.00 0.80<br />

B. vietnamiensis G4<br />

0.00 0.80<br />

B. xenovorans LB400<br />

0.00 0.80<br />

W) Annotations:<br />

CDS +<br />

CDS -<br />

rRNA<br />

tRNA<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

B. multivorans ATCC 17616 fix<br />

avg<br />

0.00 0.80<br />

B. phymatum STM815<br />

0.00 0.80<br />

B. phytofirmans PsJN<br />

0.00 0.80<br />

B. pseudomallei 1106a<br />

0.00 0.80<br />

B. pseudomallei 668<br />

0.00 0.80<br />

B. pseudomallei K96243<br />

0.00 0.80<br />

B. sp. 383<br />

0.00 0.80<br />

B. thail<strong>and</strong>ensis E264<br />

0.00 0.80<br />

B. vietnamiensis G4<br />

0.00 0.80<br />

B. xenovorans LB400<br />

0.00 0.80<br />

W) Annotations:<br />

Figure 2.14: Comparison of B. pseudomallei 1710b chomosome I <strong>and</strong> II aga<strong>in</strong>st all public<br />

Burkholderia genomes.<br />

CDS +<br />

CDS -<br />

rRNA<br />

tRNA<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

Percent AT<br />

0.21 0.42<br />

GC Skew<br />

-0.09 0.09<br />

Percent AT<br />

0.21 0.42<br />

GC Skew<br />

-0.09 0.09<br />

21<br />

Resolution: 1273<br />

dev<br />

avg<br />

dev<br />

avg<br />

BLAST ATLAS<br />

Resolution: 1273<br />

dev<br />

avg<br />

dev<br />

avg<br />

BLAST ATLAS


Genome Comparisons<br />

Bacteria<br />

fix<br />

avg<br />

0.00 0.50<br />

Proteobacteria<br />

fix<br />

avg<br />

0.00 0.50<br />

gamma<br />

fix<br />

avg<br />

0.00 0.50<br />

Annotations:<br />

CDS +<br />

CDS -<br />

0M<br />

rRNA<br />

tRNA<br />

0.5M<br />

2.5M<br />

alpha<br />

fix<br />

avg<br />

A. borkumensis<br />

3,120,143 bp<br />

0.00 0.30<br />

beta<br />

1M<br />

2M<br />

fix<br />

avg<br />

0.00 0.30<br />

1.5M<br />

delta<br />

fix<br />

avg<br />

0.00 0.30<br />

epsilon<br />

fix<br />

avg<br />

0.00 0.30<br />

Percent AT<br />

dev<br />

avg<br />

0.40 0.51<br />

Resolution: 1249<br />

http://www.cbs.dtu.dk/<br />

Center for Biological Sequence Analysis<br />

Figure 2.15: A phylome atlas of Alcanivorax borkumensis, compar<strong>in</strong>g the proteome aga<strong>in</strong>st all γ-,<br />

α-, β-, δ, <strong>and</strong> ɛ-proteobacteria available at the time of publish<strong>in</strong>g.<br />

22<br />

Phylome ATLAS


Streptococcus<br />

Escherichia<br />

Bacillus<br />

Clostridium<br />

Burkholderia<br />

Mycobacterium<br />

C<strong>and</strong>idatus<br />

Staphylococcus<br />

Shewanella<br />

Mycoplasma<br />

Stra<strong>in</strong>s<br />

Species<br />

0 10 20 30 40 50<br />

<strong>Comparative</strong> Genomics<br />

Figure 2.16: Count of genomes <strong>and</strong> species divided by genera. Source: <strong>CBS</strong> Genome Atlas<br />

Database as of 2009-09-11.<br />

2.3.8 CorePlot - plott<strong>in</strong>g the core- <strong>and</strong> pan-genomes of species<br />

There are a number of bacterial genera for which numerous stra<strong>in</strong>s <strong>and</strong> species are fully<br />

sequenced. Streptococcus (43 stra<strong>in</strong>s), Escherichia (29 stra<strong>in</strong>s), <strong>and</strong> Bacillus (25 stra<strong>in</strong>s)<br />

are the most highly represented genomes among the Bacteria (Genome Atlas Database,<br />

2009-09-11). Figure 2.16 shows the genome <strong>and</strong> species counts of the 10 most sampled<br />

genera. The <strong>in</strong>creased depth by which bacterial genera are sequenced has previously been<br />

used to estimate the core- <strong>and</strong> pan-genome by fitt<strong>in</strong>g an exponential decay<strong>in</strong>g function.<br />

An often used approach is to perform either a limited or a full permutation of the genome<br />

order (Lefebure & Stanhope, 2007; Tettel<strong>in</strong> et al., 2005). This provides an error estimate<br />

for every step a genome is added An alternative method was developed dur<strong>in</strong>g the Ph.D.<br />

project, which derives the prote<strong>in</strong> families by group<strong>in</strong>g homologous prote<strong>in</strong>s, however us<strong>in</strong>g<br />

a fixed order of genomes. Homologs are generated by pairwise prote<strong>in</strong> BLAST between<br />

proteomes followed by a group<strong>in</strong>g of all significant alignments (50% alignment length <strong>and</strong><br />

50% conservation with<strong>in</strong> the alignment). The method can re-use cached BLAST reports<br />

from the BLASTmatrix method. The example below uses the same proteome files as<br />

was generated <strong>in</strong> the BLASTmatrix example (section 2.3.6 <strong>and</strong> appendix D.4) <strong>and</strong> it<br />

demonstrates how a MySQL query can be used as configuration for CorePlot program.<br />

1 mysql -N -B -e " select organism_name , concat (pid , ’. prote<strong>in</strong>s .fsa ’)<br />

from genomeatlas3_cur . genbank_complete_prj where organism_name<br />

like ’ campylobacter %’ order by organism_name " > table . dat<br />

2 perl ~ pfh / scripts / coregenome / coregenome -2.3 < table . dat > core .ps<br />

Both the BLASTmatrix <strong>and</strong> the coregenome scripts accesses the same MySQL cach<strong>in</strong>g<br />

databases. The user will not have to worry about how results are cached <strong>and</strong> shared<br />

between the two programs. Figure 2.17 shows the output core- <strong>and</strong> pan-genome plot<br />

generated by the program.<br />

By us<strong>in</strong>g a fixed genome order, it is possible to compare multiple species with<strong>in</strong> the<br />

same plot, to reveal vary<strong>in</strong>g slopes of the pan- <strong>and</strong> core-genome graphs. From figure 2.17<br />

it is visible that the first 5 stra<strong>in</strong>s come from dist<strong>in</strong>ct species, giv<strong>in</strong>g rise to a steep <strong>in</strong>crease<br />

of the pan genome, <strong>and</strong> reduction of the core genome. The follow<strong>in</strong>g five genomes come<br />

from C. jejuni <strong>and</strong> the curves appear to flatten out at a core size of 600 prote<strong>in</strong>s, 5,200<br />

prote<strong>in</strong>s. In figure 2.18 a larger core- <strong>and</strong> pan-genome plot for Vibrio species are shown<br />

(paper IV).<br />

23


Genome Comparisons<br />

0 1000 2000 3000 4000 5000 6000 7000<br />

New genes<br />

New gene families<br />

Core genome<br />

Pan genome<br />

1 : Campylobacter concisus 13826<br />

2 : Campylobacter curvus 525.92<br />

3 : Campylobacter fetus subsp. fetus 8240<br />

pan-genome (blue l<strong>in</strong>e) <strong>in</strong>creases, <strong>and</strong> the number of conserved gene families (red<br />

4 : Campylobacter hom<strong>in</strong>is ATCC BAA381<br />

l<strong>in</strong>e) <strong>in</strong> the core genome decreases, albeit at a lower rate. This is because every<br />

5 : Campylobacter jejuni RM1221<br />

genome can add many novel (<strong>and</strong> frequently different) genes to the pan-genome but<br />

6 : Campylobacter jejuni subsp. doylei 269.97<br />

only decreases the core genome with a few genes that are absent <strong>in</strong> that particular<br />

7 : Campylobacter jejuni subsp. jejuni 81176<br />

stra<strong>in</strong> but that were conserved <strong>in</strong> the previously given genomes. The pan-genome<br />

8 : Campylobacter jejuni subsp. jejuni 81116<br />

curve <strong>in</strong>creases with a relative steep slope when a novel species is added, as is<br />

9 : Campylobacter jejuni subsp. jejuni NCTC 11168<br />

obvious when one V. parahaemolyticus genome is added after the 18th V. cholerae. A<br />

10 : Campylobacter lari RM2100<br />

stable plateau can be seen for pan genome of the V. cholerae genomes around 6500<br />

genes, whereas the core genome steadily decreases to approximately 1000 genes for<br />

these 32 genomes. A. salmonicida, although not a member of the Vibrio genus, does<br />

not add significantly more genes to the pan genome than the other Vibrio species do, <strong>in</strong><br />

contrast to P. profundum which produces a sharp <strong>in</strong>crease <strong>in</strong> the pan genome, as does,<br />

<strong>in</strong>terest<strong>in</strong>gly, V. shilonii. Note that there are approximately 20,000 total gene families<br />

with<strong>in</strong> the 30 sequenced Vibrionaceae genomes.<br />

In fact, the small jump seen <strong>in</strong> the pan genome of V. cholerae when add<strong>in</strong>g the 11th<br />

1 2 3 4 5 6 7 8 9 10<br />

genome (figure 3) is caused by the difference between the two subclusters of V.<br />

cholerae seen <strong>in</strong> the pan-genome family tree (figure 2). Note that the 10th stra<strong>in</strong> (V.<br />

clolerae 2740-80) behaves as an outlier <strong>in</strong> all the figures shown; although documented<br />

Figureas 2.17: an environmental Pan- <strong>and</strong> core-genome isolate, this plotappears of 10 Campylobacter closer to the genomes. cl<strong>in</strong>ical isolates, For the<strong>in</strong> data terms currently of<br />

available, overall there genomic seem to properties. exist an equilibrium at close to 600 prote<strong>in</strong> families.<br />

24<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

Pan genome<br />

Core genome<br />

New gene families<br />

V. cholerae MJ1236<br />

V. cholerae RC9<br />

V. cholerae BX330286<br />

V. cholerae MO10<br />

V. cholerae O395 TIGR<br />

V. cholerae O395 TEDA<br />

V. cholerae M66-2<br />

V. cholerae N16961<br />

V. cholerae B33VCE<br />

V. cholerae AM-19226<br />

V. cholerae 1587<br />

V. cholerae 2740-80<br />

V. cholerae TM11079-80<br />

V. cholerae TMA21<br />

V. cholerae 12129<br />

V. cholerae MZO-2<br />

P.profundum SS9<br />

V.shilonii AK1<br />

V.harveyi BAA-1116<br />

V.campbellii<br />

Vibrio sp Ex25<br />

A.salmonicida LFI1238<br />

V. fisheri MJ11<br />

V. fisheri ES114<br />

V.splendidus LGB2<br />

Vibrio. sp MED222<br />

V. vulnificus YJ016<br />

V. vulnificus CMCP6<br />

V. parahaem. 16<br />

V. parahaiem. 2210633<br />

V. cholerae V52<br />

V. cholerae VL426<br />

Figure 3. Pan- <strong>and</strong> core-genome plot of the 32 Vibrionaceae genomes. V. cholerae<br />

stra<strong>in</strong>s that do not cause cholera are highlighted <strong>in</strong> bright green. Colours are the same<br />

as <strong>in</strong> Figure 2.<br />

Figure 2.18: CorePlot output for 32 Vibrio genomes.<br />

BLAST comparison visualized <strong>in</strong> a BLAST matrix<br />

A BLAST matrix provides a visual overview of reciprocal pairwise whole genome<br />

comparisons (figure 4). The stronger a matrix cell is colored, the more similarity was


2.4 Summary<br />

<strong>Comparative</strong> Genomics<br />

This chapter presents a number of comparative genomics <strong>and</strong> visualization <strong>tools</strong> used <strong>in</strong><br />

a genome annotation <strong>and</strong> analysis pipel<strong>in</strong>e. Visualization methods have been shown to<br />

help draw biological conclusions about adaptation to environmental niches, pathogenic<br />

properties, <strong>and</strong> comparison of many other genomic properties <strong>in</strong>clud<strong>in</strong>g proteome similarity.<br />

Overview<strong>in</strong>g the large amount of genomic data constitutes a constant challenge that<br />

will need more attention <strong>in</strong> the future as sequenc<strong>in</strong>g technology becomes more <strong>and</strong> more<br />

common. How can one visualize comparison of a thous<strong>and</strong> genomes? Soon there will be<br />

a need to compare sets of thous<strong>and</strong>s of genomes.<br />

25


Summary<br />

26


<strong>Comparative</strong> Genomics<br />

2.5 Instant <strong>in</strong>sight: Read<strong>in</strong>g the genetic atlas


Instant <strong>in</strong>sight: Read<strong>in</strong>g the genetic atlas


‘ReSourCe is<br />

he best onl<strong>in</strong>e<br />

submission<br />

system of any<br />

publisher.’<br />

ReSourCe<br />

nd referees who have used<br />

o help you through every step of<br />

l<strong>in</strong>e proof collection, free pdf<br />

check <strong>and</strong> update their personal<br />

ence even further.<br />

se juggl<strong>in</strong>g a hectic research<br />

a not-for-prot society publisher<br />

e today <strong>and</strong> nd out more.<br />

Registered Charity No. 207890<br />

.rsc.org/resource<br />

<br />

1<br />

<strong>Comparative</strong> Genomics<br />

2.6 Paper I: The genome BLASTatlas - a GeneWiz extension<br />

for visualization of whole-genome homology<br />

Volume 4 | Number 5 | 2008 Molecular BioSystems Pages 353–444<br />

Molecular<br />

BioSystems<br />

www.molecularbiosystems.org Volume 4 | Number 5 | May 2008 | Pages 353–444<br />

ISSN 1742-206X<br />

HIGHLIGHT<br />

Peter F. Hall<strong>in</strong> et al.<br />

REVIEW<br />

The genome BLASTatlas—a GeneWiz Eric C. Greene et al.<br />

extension for visualization of whole- The importance of surfaces <strong>in</strong> s<strong>in</strong>glegenome<br />

homology molecule bioscience<br />

1742-206X(2008)4:5;1-9<br />

Indexed <strong>in</strong><br />

MEDLINE!<br />

17/04/2008 11:00:58


HIGHLIGHT www.rsc.org/molecularbiosystems | Molecular BioSystems<br />

The genome BLASTatlas—a GeneWiz<br />

extension for visualization of whole-genome<br />

homology<br />

Peter F. Hall<strong>in</strong>, Tim T. B<strong>in</strong>newies* <strong>and</strong> David W. Ussery<br />

DOI: 10.1039/b717118h<br />

The development of fast <strong>and</strong> <strong>in</strong>expensive methods for sequenc<strong>in</strong>g bacterial genomes<br />

has led to a wealth of data, often with many genomes be<strong>in</strong>g sequenced of the same<br />

species or closely related organisms. Thus, there is a need for visualization methods that<br />

will allow easy comparison of many sequenced genomes to a def<strong>in</strong>ed reference stra<strong>in</strong>.<br />

The BLASTatlas is one such tool that is useful for mapp<strong>in</strong>g <strong>and</strong> visualiz<strong>in</strong>g whole<br />

genome homology of genes <strong>and</strong> prote<strong>in</strong>s with<strong>in</strong> a reference stra<strong>in</strong> compared to other<br />

stra<strong>in</strong>s or species of one or more prokaryotic organisms. We provide examples of<br />

BLASTatlases, <strong>in</strong>clud<strong>in</strong>g the Clostridium tetani plasmid p88, where homologues for tox<strong>in</strong><br />

genes can be easily visualized <strong>in</strong> other sequenced Clostridium genomes, <strong>and</strong> for a<br />

Clostridium botul<strong>in</strong>um genome, compared to 14 other Clostridium genomes. DNA<br />

structural <strong>in</strong>formation is also <strong>in</strong>cluded <strong>in</strong> the atlas to visualize the DNA chromosomal<br />

context of regions. Additional <strong>in</strong>formation can be added to these plots, <strong>and</strong> as an<br />

example we have added circles show<strong>in</strong>g the probability of the DNA helix open<strong>in</strong>g up<br />

under superhelical tension. The tool is SOAP compliant <strong>and</strong> WSDL (web services<br />

description language) files are located on our website: (http://www.cbs.dtu.dk/ws/<br />

BLASTatlas), where programm<strong>in</strong>g examples are available <strong>in</strong> Perl. By provid<strong>in</strong>g an<br />

<strong>in</strong>teroperable method to carry out whole genome visualization of homology,<br />

this service offers bio<strong>in</strong>formaticians as well as biologists an easy-to-adopt workflow<br />

that can be directly called from the programm<strong>in</strong>g language of the user, hence<br />

enabl<strong>in</strong>g automation of repeated tasks. This tool can be relevant <strong>in</strong> many pangenomic<br />

as well as <strong>in</strong> metagenomic studies, by giv<strong>in</strong>g a quick overview of clusters of<br />

<strong>in</strong>sertion sites, genomic isl<strong>and</strong>s <strong>and</strong> overall homology between a reference<br />

sequence <strong>and</strong> a data set.<br />

Center for Biological Sequence Analysis,<br />

Department of Systems Biology, The<br />

Technical University of Denmark, 2800<br />

Lyngby, Denmark. E-mail: pfh@cbs.dtu.dk.<br />

E-mail: tim@cbs.dtu.dk. E-mail:<br />

dave@cbs.dtu.dk<br />

Background<br />

It has been more than 10 years s<strong>in</strong>ce the<br />

sequenc<strong>in</strong>g of the first bacterial genome<br />

(ref. 1, US patent number 6,528,289), <strong>and</strong><br />

currently sequence data are available for<br />

more than a thous<strong>and</strong> sequenced genomes.<br />

Peter F. Hall<strong>in</strong> Tim T. B<strong>in</strong>newies David W. Ussery<br />

With so many genome sequences, for<br />

several bacterial species multiple genome<br />

sequences exist; for example, at the time<br />

of writ<strong>in</strong>g, 10 different Escherichia coli<br />

genomes have been fully sequenced <strong>and</strong><br />

published, <strong>and</strong> draft sequences for another<br />

31 genomes are available, add<strong>in</strong>g<br />

Peter F. Hall<strong>in</strong> was born <strong>in</strong><br />

Odense, Denmark, <strong>and</strong> is currently<br />

a PhD student at <strong>CBS</strong>,<br />

DTU. Tim T. B<strong>in</strong>newies grew<br />

up <strong>in</strong> Kiel, Germany, <strong>and</strong> obta<strong>in</strong>ed<br />

his PhD from the Technical<br />

University of Denmark,<br />

he is currently work<strong>in</strong>g for<br />

Roche Diagnostics AG <strong>in</strong> Switzerl<strong>and</strong>.<br />

David W. Ussery was<br />

born <strong>and</strong> raised <strong>in</strong> Spr<strong>in</strong>gdale,<br />

Arkansas. S<strong>in</strong>ce 1998, he has<br />

been leader for the <strong>Comparative</strong><br />

Genomics group at <strong>CBS</strong>.<br />

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 363


up to a total of 41 different E. coli<br />

genomes (accord<strong>in</strong>g to the National Center<br />

for Biotechnology Information,<br />

NCBI Entrez, 12-Feb-2008). Table 1 lists<br />

the top 20 represented prokaryotic<br />

genera <strong>in</strong> terms of numbers of fully<br />

sequenced genomes based on recent<br />

count<strong>in</strong>g <strong>in</strong> Entrez Genome Projects,<br />

although these numbers will change<br />

quickly as more genomes are be<strong>in</strong>g<br />

added on a regular basis. Thus, analysis<br />

of multiple genomes of the same organism<br />

(the ‘‘pangenome’’) is now possible,<br />

<strong>and</strong> as more metagenomic datasets are<br />

published (see for example the projects<br />

listed on the GOLD web pages 24 ), there<br />

is a need for a graphical representation<br />

of how these new data compare to exist<strong>in</strong>g<br />

reference stra<strong>in</strong>s or model organisms.<br />

We have developed a visualization<br />

method, called ‘‘BLASTatlas’’, for show<strong>in</strong>g<br />

mapped alignments of BLAST<br />

searches of a reference sequence aga<strong>in</strong>st<br />

one or more databases, onto the reference<br />

genome. Early implementation of a<br />

similar method 2–4 accounted for the statistical<br />

significance (E-value) of each hit,<br />

by color cod<strong>in</strong>g the expectation values<br />

[ log(E)] of the alignment. This method<br />

gives a uniform color throughout the<br />

alignment (gene or prote<strong>in</strong>) but shows<br />

no <strong>in</strong>formation about the am<strong>in</strong>o acid<br />

conservation with<strong>in</strong> regions of the alignment.<br />

At the level of a bacterial chromosome,<br />

this makes little difference,<br />

although when one zooms <strong>in</strong> at the level<br />

of <strong>in</strong>dividual genes, the older method of<br />

shad<strong>in</strong>g the entire gene based on the Evalue<br />

gives no <strong>in</strong>formation about regions<br />

with<strong>in</strong> a gene (such as functional doma<strong>in</strong>s)<br />

which might be strongly conserved,<br />

whilst other parts of the gene<br />

have little sequence homology with<strong>in</strong><br />

other genomes. We have ref<strong>in</strong>ed the<br />

BLASTatlas method to map each<br />

<strong>in</strong>dividual am<strong>in</strong>o acid residue or<br />

nucleotide back to the reference genome<br />

sequence from which the cod<strong>in</strong>g sequence<br />

was derived. Instead of colourcod<strong>in</strong>g<br />

the significance of the entire hit,<br />

this method maps the conservation of the<br />

<strong>in</strong>dividual bases or am<strong>in</strong>o acids. Tools<br />

such as the Artemis Comparison Tool<br />

(ACT) 5 allow detailed view<strong>in</strong>g of complete<br />

BLAST results, <strong>and</strong> this is an<br />

excellent graphical method for comparison<br />

of two genomes. ACT can also be<br />

extended to compare two genomes to a<br />

reference, placed <strong>in</strong> the middle. In<br />

contrast, the BLASTatlas method can<br />

compare many genomes to the same<br />

reference, <strong>and</strong> can provide a quick overview<br />

of chromosomal regions of gene<br />

conservation across many genomes.<br />

As can be seen from Table 1, for many<br />

of the heavily sampled genera, there are<br />

further genome projects <strong>in</strong> the pipel<strong>in</strong>e<br />

which will produce even more sequences<br />

than are currently available, <strong>and</strong> there is<br />

a need for methods for efficient comparison<br />

of these genomes, giv<strong>in</strong>g an overview<br />

of general trends <strong>in</strong> the data. The<br />

Table 1 The number of species <strong>and</strong> NCBI Entrez Project IDs of the 20 most represented genera<br />

<strong>in</strong> the Entrez Genome Projects Database, 13 as accessed on 21 October 2007. The numbers <strong>in</strong><br />

brackets show the count<strong>in</strong>g of both ongo<strong>in</strong>g <strong>and</strong> completed projects, whereas the first number<br />

reflects only the completed projects. C<strong>and</strong>idate genera have been excluded from this count<strong>in</strong>g<br />

Genus Projects Species<br />

Streptococcus 26 [63] 8 [15]<br />

Burkholderia 15 [55] 8 [15]<br />

Bacillus 16 [48] 9 [16]<br />

Clostridium 14 [43] 9 [22]<br />

Vibrio 7 [35] 5 [14]<br />

Mycobacterium 16 [30] 9 [14]<br />

Salmonella 5 [30] 2 [3]<br />

Listeria 4 [29] 3 [6]<br />

Escherichia 10 [27] 1 [1]<br />

Mycoplasma 13 [25] 11 [17]<br />

Shewanella 14 [24] 10 [15]<br />

Pseudomonas 13 [23] 7 [8]<br />

Yers<strong>in</strong>ia 9 [23] 3 [7]<br />

Haemophilus 6 [23] 3 [4]<br />

Staphylococcus 17 [22] 4 [5]<br />

Synechococcus 10 [21] 2 [2]<br />

Campylobacter 9 [20] 5 [9]<br />

Francisella 7 [16] 1 [2]<br />

Lactobacillus 11 [15] 10 [12]<br />

Rickettsia 10 [15] 9 [12]<br />

BLASTatlas allows the comparison of<br />

many genomes to a reference sequence.<br />

The current limit is about 60 genomes.<br />

There are two levels of comparison, the<br />

first represents a one-page map of the<br />

whole chromosome, <strong>and</strong> the second level<br />

zoom<strong>in</strong>g <strong>in</strong> a particular region of <strong>in</strong>terest,<br />

allow<strong>in</strong>g the visualization of regions<br />

of conservation with<strong>in</strong> <strong>in</strong>dividual genes.<br />

The color-cod<strong>in</strong>g represents identical<br />

am<strong>in</strong>o acids (or nucleic acids), based on<br />

a pairwise alignment of all prote<strong>in</strong> cod<strong>in</strong>g<br />

regions, with the best matches for<br />

each gene <strong>in</strong> the reference genome<br />

shown. Thus, comb<strong>in</strong><strong>in</strong>g both levels, it<br />

is possible to get a global overview of the<br />

whole chromosome, <strong>and</strong> to then quickly<br />

identify gene conservation (or lack thereof)<br />

<strong>in</strong> regions of <strong>in</strong>terest, at the level of<br />

conservation of <strong>in</strong>dividual am<strong>in</strong>o acid<br />

residues.<br />

Clostridium botul<strong>in</strong>um is an important<br />

human pathogen which is the causative<br />

agent of botulism, giv<strong>in</strong>g rise to fatal<br />

paralysis of the respiratory muscles,<br />

caused by botul<strong>in</strong>um neurotox<strong>in</strong> (BoNT)<br />

which disrupts nerve functions. The<br />

genes encod<strong>in</strong>g BoNT components are<br />

clustered on the bacterial chromosome<br />

(group I + II stra<strong>in</strong>s), on prophages<br />

(group III stra<strong>in</strong>s) or on plasmids (group<br />

IV stra<strong>in</strong>s). Group I stra<strong>in</strong>s encode type<br />

A, B <strong>and</strong> F type tox<strong>in</strong>s, group II stra<strong>in</strong>s<br />

produce type B, E <strong>and</strong> F tox<strong>in</strong>s <strong>and</strong><br />

group III stra<strong>in</strong>s encode for type C <strong>and</strong><br />

D tox<strong>in</strong>s, whereas group IV stra<strong>in</strong>s<br />

produce type G tox<strong>in</strong>. 6 We use the<br />

BLASTatlas method to show the overall<br />

genome homology of the C. botul<strong>in</strong>um<br />

stra<strong>in</strong> F Langel<strong>and</strong>, compared to all<br />

currently available <strong>and</strong> fully sequenced<br />

stra<strong>in</strong>s of the Clostridium genus.<br />

Methods<br />

The BLASTatlas method uses all the<br />

provided annotated cod<strong>in</strong>g sequences<br />

(or prote<strong>in</strong>s) of a reference genome, <strong>and</strong><br />

compares each of those with one or more<br />

genomes. The total genome sequence for<br />

each organism is represented by a database<br />

<strong>and</strong> can conta<strong>in</strong> any number of<br />

DNA or prote<strong>in</strong> sequences. BLAST<br />

searches with a non-str<strong>in</strong>gent E-value<br />

cut-off of 0.01 are used to identify the<br />

best alignments between the reference<br />

sequence prote<strong>in</strong> <strong>and</strong> the database<br />

(genome) <strong>in</strong> question. Once identified,<br />

the s<strong>in</strong>gle best pairwise alignment for<br />

364 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008


each of the reference sequences is<br />

obta<strong>in</strong>ed <strong>and</strong> <strong>in</strong>cluded <strong>in</strong> the map.<br />

The reference genome of a given<br />

comparison has a fixed size, whereas<br />

the sequences to be compared can be<br />

thought of as simply a ‘‘pile of prote<strong>in</strong>s’’,<br />

rang<strong>in</strong>g between the size from that of a<br />

small phage, to a s<strong>in</strong>gle genome, or an<br />

entire metagenomic sample or even exist<strong>in</strong>g<br />

large BLAST databases, such as<br />

UniProt. It is important to emphasize<br />

that each prote<strong>in</strong> <strong>in</strong> the reference genome<br />

is compared to all the prote<strong>in</strong>s <strong>in</strong> the<br />

query set—regardless of orientation or<br />

location. The BLASTatlas method uses<br />

the software BLASTALL v. 2.2.11 for<br />

the search, <strong>and</strong> <strong>in</strong> BLAST term<strong>in</strong>ology,<br />

the reference genome constitutes the<br />

‘query’ whereas each other genome<br />

(e.g., a lane or circle <strong>in</strong> the atlas) <strong>in</strong> the<br />

comparison corresponds to the ‘database’.<br />

We def<strong>in</strong>e a lane as a visual representation<br />

of mapped database hits<br />

(<strong>in</strong>dividual residue matches) on to the<br />

reference genome. A lane can have a<br />

boxfilter (smooth<strong>in</strong>g) applied with<strong>in</strong><br />

each of the smallest visible units of the<br />

atlas (the resolution of the graphical<br />

representation). A s<strong>in</strong>gle BLASTatlas<br />

may conta<strong>in</strong> several lanes; currently<br />

around 60 circles is the upper limit.<br />

The <strong>in</strong>put requires a file conta<strong>in</strong><strong>in</strong>g the<br />

genome sequence, <strong>in</strong>clud<strong>in</strong>g all annotated<br />

cod<strong>in</strong>g sequences (compris<strong>in</strong>g prote<strong>in</strong>-start,<br />

-stop <strong>and</strong> -direction) for the<br />

reference genome. The four programs<br />

‘BLASTp’, ‘BLASTn’, ‘BLASTx’, <strong>and</strong><br />

‘tBLASTn’ can be used for each lane of<br />

the BLASTatlas, although of course the<br />

appropriate sequences (DNA or prote<strong>in</strong>)<br />

must be provided. For example, when<br />

us<strong>in</strong>g ‘ BLASTn’ or ‘tBLASTn’ <strong>in</strong> a lane,<br />

the required DNA sequence can be a set<br />

of open read<strong>in</strong>g frames (ORFs), chromosomal<br />

contigs, entire genome sequences<br />

or even environmental (metagenomic)<br />

samples. In a pairwise fashion, the sequence<br />

of the reference is BLASTed<br />

aga<strong>in</strong>st each database def<strong>in</strong>ed by the<br />

user, employ<strong>in</strong>g the specified BLAST<br />

algorithm.<br />

Interpretation of BLAST alignments<br />

For each of the sequences def<strong>in</strong>ed <strong>in</strong> the<br />

reference, only the best hit <strong>in</strong> each database<br />

is stored. For these hits, the alignments<br />

are mapped on to the reference<br />

genome. When align<strong>in</strong>g two DNA<br />

sequences, the map shows one of four<br />

possible states for each position: match,<br />

mismatch, gap <strong>in</strong> query (reference genome),<br />

<strong>and</strong> gap <strong>in</strong> database (lane). Only<br />

the match contributes to the overall score<br />

with a value of 1, whereas mismatches<br />

<strong>and</strong> gaps <strong>in</strong> the database get a score<br />

value of zero. When align<strong>in</strong>g two prote<strong>in</strong><br />

sequences, an additional state is <strong>in</strong>troduced<br />

for conservative mismatches, <strong>in</strong>dicat<strong>in</strong>g<br />

that two am<strong>in</strong>o acids have similar<br />

physical–chemical properties; such a<br />

state will receive a score of 0.5. Match<br />

<strong>and</strong> gap states of prote<strong>in</strong> alignments are<br />

def<strong>in</strong>ed similar to those of the DNA<br />

alignments. The occurrence of gaps <strong>in</strong><br />

the reference sequence do not get a correspond<strong>in</strong>g<br />

coord<strong>in</strong>ate <strong>and</strong> are therefore<br />

ignored (see Fig. 1). In the BLASTatlas<br />

context, a map is an array of match<br />

scores. The array has the same length<br />

as the reference genome, with each position<br />

along the gene hav<strong>in</strong>g a value of 0,<br />

0.5 or 1: It should be noted that <strong>in</strong>tergenic<br />

regions (<strong>and</strong> ncRNAs, <strong>in</strong>clud<strong>in</strong>g<br />

tRNAs <strong>and</strong> rRNAs) have values of 0,<br />

because BLASTatlases only compare<br />

prote<strong>in</strong> encod<strong>in</strong>g genes. We use this as<br />

a control, check<strong>in</strong>g to make sure that the<br />

rRNA operons are visualized as ‘‘gaps’’<br />

throughout all the lanes, for example.<br />

For each database def<strong>in</strong>ed, there will be<br />

a correspond<strong>in</strong>g BLAST map with<strong>in</strong> the<br />

atlas (see Fig. 2). Each database entry of<br />

the BLAST searches must conta<strong>in</strong> a<br />

legend text for the lane, a colour code<br />

range <strong>and</strong> a scal<strong>in</strong>g method. For the<br />

colours, an upper <strong>and</strong> lower colour is<br />

required, whereas the middle colour<br />

is usually grey; all colours are def<strong>in</strong>ed<br />

<strong>in</strong> RGB <strong>in</strong>tegers rang<strong>in</strong>g from 0 to 10.<br />

The scale can be either fixed, such as<br />

rang<strong>in</strong>g from 0 to 1, or scaled us<strong>in</strong>g any<br />

number of st<strong>and</strong>ard deviations around<br />

the average.<br />

DNA properties<br />

The BLASTatlas method allows users to<br />

add structural as well as base composition<br />

<strong>in</strong>formation to the atlas by us<strong>in</strong>g the<br />

‘DNAparameters’ element <strong>in</strong> the request.<br />

These properties can be for example<br />

DNA structural properties, 7<br />

such as<br />

<strong>in</strong>tr<strong>in</strong>sic curvature, 8 global or local<br />

repeats 9 or other measures of base composition.<br />

10 A list of possible different<br />

properties currently pre-computed can<br />

be obta<strong>in</strong>ed via the onl<strong>in</strong>e documentation<br />

<strong>and</strong> type declarations of the web<br />

services description. The DNA property<br />

lanes are usually added near the center<br />

(or at the lowest part when seen from the<br />

outermost circle) of the atlas.<br />

Custom properties<br />

In addition to the st<strong>and</strong>ard DNA properties<br />

<strong>and</strong> BLAST maps, the web service<br />

provides a method for add<strong>in</strong>g <strong>in</strong>dividual<br />

customer data for example gene expression<br />

values to the atlas, us<strong>in</strong>g the ‘customMap’<br />

element <strong>in</strong> the request. Data<br />

must be provided <strong>in</strong> the form of comma<br />

separated str<strong>in</strong>gs, with each position <strong>in</strong><br />

the list correspond<strong>in</strong>g to the genomic<br />

position. When def<strong>in</strong><strong>in</strong>g custom data<br />

lanes, the colour ranges, scal<strong>in</strong>g method,<br />

<strong>and</strong> legend text must be provided.<br />

Visualization<br />

Details such as the atlas title <strong>and</strong> the<br />

geometry (l<strong>in</strong>ear or circle representation)<br />

are necessary for the f<strong>in</strong>al visualization.<br />

Once the BLAST searches are carried<br />

out <strong>and</strong> remapped to the reference<br />

Fig. 1 Mapp<strong>in</strong>g of prote<strong>in</strong>–prote<strong>in</strong> alignment to DNA. Panel A: mismatches <strong>and</strong> perfect matches are assigned a score of 0 <strong>and</strong> 1, respectively.<br />

Conservative mismatches are assigned a score of 0.5. In the case of DNA alignment, only scores of 0 <strong>and</strong> 1 are possible. Panel B: gaps <strong>in</strong> the<br />

database sequence will be rendered as be<strong>in</strong>g non-conserved areas (filled with zeros). Panel C: gaps <strong>in</strong> the reference sequence will be neglected, s<strong>in</strong>ce<br />

they have no correspond<strong>in</strong>g region <strong>in</strong> the reference genome <strong>in</strong>to which they can be mapped.<br />

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 365


Fig. 2 Genes (or segments) from each genome are compared with a reference gene, as shown <strong>in</strong><br />

the left panel; a pairwise comparison is made us<strong>in</strong>g one of the BLAST algorithms. On the right is<br />

shown the ‘‘remapp<strong>in</strong>g’’, or the representation of each of the BLAST runs on the left, mapped<br />

onto the chromosomal sequence. Note that gaps <strong>in</strong> the reference gene (grey) are not <strong>in</strong>cluded <strong>in</strong><br />

the colored maps of the atlas.<br />

genome <strong>and</strong> custom data <strong>and</strong> DNA<br />

properties are collected, an XML configuration<br />

file is composed which conta<strong>in</strong>s<br />

all these data <strong>and</strong> the layout of the atlas.<br />

This file is then sent to the GeneWiz 7<br />

software which produces a PostScript<br />

document, it then is base64 encoded to<br />

allow transport via XML. This part of<br />

the process takes place on the server <strong>and</strong><br />

requires no user-<strong>in</strong>teraction. An example<br />

atlas of a plasmid is shown <strong>in</strong> Fig. 3, <strong>and</strong><br />

will be discussed <strong>in</strong> more detail below.<br />

Web services implementation<br />

A WSDL (web services description language)<br />

file is written which describes the<br />

operations (runAtlas, pollQueue, fetch-<br />

AtlasResult) <strong>and</strong> the <strong>in</strong>put requirements<br />

for them. The file can be downloaded.<br />

All <strong>in</strong>put/output objects are def<strong>in</strong>ed <strong>in</strong> a<br />

separated XSD file (XML schema def<strong>in</strong>ition)<br />

with<strong>in</strong> the WSDL file, which comprises<br />

<strong>in</strong>formation <strong>and</strong> type restrictions<br />

applicable <strong>in</strong> the request. This serves as<br />

documentation of the objects as well as a<br />

way to validate a request before it is<br />

submitted. Unfortunately, the validation<br />

supports only Perl modules for now that<br />

is not optimal yet, whereas this option is<br />

well implemented <strong>in</strong> <strong>tools</strong> like soapUI<br />

(http://www.soapui.org/). It should be<br />

stressed that users should, until better<br />

validation support can be implemented,<br />

be careful to correctly format the <strong>in</strong>put<br />

parameters before send<strong>in</strong>g the request.<br />

Fig. 3 BLASTatlas of pE88—a small plasmid of Clostridium tetani stra<strong>in</strong> E88, GenBank accession number AF528097. DNA parameters percent AT,<br />

GC skew, global direct repeats, <strong>and</strong> global <strong>in</strong>verted repeats are <strong>in</strong>cluded <strong>in</strong> the <strong>in</strong>ner most lanes. BLAST lanes of all complete genome sequences of the<br />

Clostridium genomes (see Table 1), <strong>in</strong>clud<strong>in</strong>g plasmids are <strong>in</strong>cluded <strong>in</strong> the outer most lanes. As examples of custom lanes, the free energy (G, blue kcal<br />

mol 1 ) <strong>and</strong> the probability (P, red) measures of stress <strong>in</strong>duced DNA duplex destabilization (SIDD) sites are <strong>in</strong>cluded <strong>in</strong> the lanes between the DNA<br />

properties <strong>and</strong> the BLAST lanes. 23 SIDD calculations were obta<strong>in</strong>ed from the SIDDbase WebService (http://www.cbs.dtu.dk/ws/SIDDbase). The<br />

request XML used to construct this plot can be downloaded from the example section of the service homepage, http://www.cbs.dtu.dk/ws/BLASTatlas.<br />

As expected, there is full homology of all cod<strong>in</strong>g regions between the plasmids <strong>and</strong> all replicons of C. tetani E88 (black lane just outside of the<br />

annotations); however there appears to be limited conservation of these pE88 genes throughout the genomes for other Clostridium stra<strong>in</strong>s.<br />

366 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008


Table 2 A list of all stra<strong>in</strong>s <strong>and</strong> their accession numbers used <strong>in</strong> this comparison. Each row represents the NCBI Entrez sequenc<strong>in</strong>g project. The<br />

number of base pairs <strong>and</strong> prote<strong>in</strong> cod<strong>in</strong>g genes are those derived as the sum with<strong>in</strong> each project. C. botul<strong>in</strong>um str. F Langel<strong>and</strong> is that used as<br />

reference of the comparison<br />

Species Segments Size Prote<strong>in</strong>s<br />

C. acetobutylicum ATCC 824 14<br />

Entrez Project 77: Chromosome: AE001437,<br />

Plasmid pSOL1: AE001438<br />

4.132.880 3.848<br />

C. beijer<strong>in</strong>ckii NCIMB 8052 (unpublished) Entrez Project 12637: Chromosome: CP000721 6.000.632 5.020<br />

C. botul<strong>in</strong>um A str. ATCC 19397 (unpublished) Entrez Project 19517: Chromosome: CP000726 3.863.450 3.552<br />

C. botul<strong>in</strong>um A str. ATCC 3502 6<br />

Entrez Project 193: Chromosome: AM412317,<br />

Plasmid pBOT3502: AM412318<br />

3.903.260 3.671<br />

C. botul<strong>in</strong>um A str. Hall (unpublished) Entrez Project 19521: Chromosome: CP000727 3.760.560 3.407<br />

C. botul<strong>in</strong>um F str. (unpublished) Entrez Project 19519: Chromosome: CP000728,<br />

Plasmid pCLI: CP000729<br />

4.012.918 3.659<br />

C. difficile 630 15<br />

Entrez Project 78: Chromosome: AM180355,<br />

Plasmid pCD630: AM180356<br />

4.298.133 3.787<br />

C. kluyveri DSM 555 (unpublished) Entrez Project 19065: Chromosome: CP000673,<br />

Plasmid pCKL555A: CP000674<br />

4.023.800 3.913<br />

C. novyi NT 16<br />

Entrez Project 16820: Chromosome: CP000382 2.547.720 2.325<br />

C. perfr<strong>in</strong>gens ATCC 13124 25<br />

Entrez Project 304: Chromosome: CP000246 3.256.683 2.876<br />

C. perfr<strong>in</strong>gens SM101 17<br />

Entrez Project 12521: Chromosome: CP000312,<br />

Plasmid 1: CP000313, Plasmid 2: CP000314,<br />

Viral segment phage phiSM101: CP000315<br />

2.960.088 2.631<br />

C. perfr<strong>in</strong>gens str. 13 18<br />

Entrez Project 79: Chromosome: BA000016,<br />

Plasmid pCP13: AP003515,<br />

3.085.740 2.723<br />

C. tetani E88 19<br />

Entrez Project 81: Chromosome: AE015927,<br />

Plasmid pE88: AF528097<br />

2.873.333 2.432<br />

C. thermocellum ATCC 27405 (unpublished) Entrez Project 314: Chromosome: CP000568 3.843.301 3.191<br />

Clostridium phage 20<br />

Phage c-st: AP008983 185.683 198<br />

Web services workflow<br />

A workflow was written <strong>in</strong> Perl (v5.8.7),<br />

employ<strong>in</strong>g SOAP:Lite (v0.69) which<br />

reads the FASTA files of the database<br />

stra<strong>in</strong>s listed <strong>in</strong> Table 3 <strong>and</strong> produces a<br />

BLASTatlas us<strong>in</strong>g the C. botul<strong>in</strong>um<br />

stra<strong>in</strong> F Langel<strong>and</strong> as reference. The<br />

script uses the onl<strong>in</strong>e web service (see<br />

Fig. 4). The BLASTatlas figure produced<br />

by this workflow is seen <strong>in</strong> Fig. 5.<br />

Results<br />

Fig. 3 represents a BLASTatlas for plasmid<br />

pE88 from Clostridium tetani stra<strong>in</strong><br />

Fig. 4 Workflow description: a Perl script was written for h<strong>and</strong>l<strong>in</strong>g the assembly of the SOAP<br />

envelope <strong>and</strong> contact<strong>in</strong>g various other web services operations: (A) obta<strong>in</strong><strong>in</strong>g genomes sequence:<br />

us<strong>in</strong>g the getSeq operation of the GenomeAtlas Web Services (v.3.3), the genome sequence of the<br />

reference genome is obta<strong>in</strong>ed as one cont<strong>in</strong>uous str<strong>in</strong>g. (B) Obta<strong>in</strong><strong>in</strong>g atlas annotations:<br />

annotated CDS, rRNA, <strong>and</strong> tRNA features of the GenBank record of the reference genome<br />

us<strong>in</strong>g the getFeatures operation—these are the features which will be pr<strong>in</strong>ted <strong>in</strong> a separate lane<br />

on the atlas. (C) Obta<strong>in</strong><strong>in</strong>g ORF annotations of the reference genome: aga<strong>in</strong>, us<strong>in</strong>g the getFeatures<br />

operation, all codon sequences <strong>and</strong> their translations are obta<strong>in</strong>ed. (D) Obta<strong>in</strong> databases: read<br />

FASTA files conta<strong>in</strong><strong>in</strong>g prote<strong>in</strong>s <strong>and</strong> ORFs of the database genomes to be added as lanes. The<br />

output of A–F are assembled <strong>in</strong>to a s<strong>in</strong>gle SOAP request, <strong>in</strong>clud<strong>in</strong>g configurations of the atlas.<br />

(E) Poll<strong>in</strong>g the queue: once the job has been submitted, a 32 character hex str<strong>in</strong>g is returned for<br />

identify<strong>in</strong>g the job, which can be used by operation pollQueue to see the status of the job.<br />

(F + G) Obta<strong>in</strong><strong>in</strong>g result: once a status ‘‘FINISHED’’ is obta<strong>in</strong>ed from pollQueue, the job id<br />

can submitted to fetchResult <strong>and</strong> the result<strong>in</strong>g PostScript image is returned.<br />

E88. The homology for genes <strong>in</strong> the<br />

plasmid to other sequenced genomes is<br />

shown <strong>in</strong> the circles, additional ‘‘custom<br />

lanes’’ represent chromosomal regions<br />

predicted to open under superhelical<br />

stress. The chromosomal location of the<br />

genes encod<strong>in</strong>g colT <strong>and</strong> tetR are labelled<br />

<strong>in</strong> the figure. Notice that these two prote<strong>in</strong>s<br />

conta<strong>in</strong> regions of homology that<br />

are found <strong>in</strong> most of the Clostridium<br />

proteomes searched. S<strong>in</strong>ce the C. tetani<br />

plasmid is <strong>in</strong>cluded <strong>in</strong> the genome sequence<br />

(black circle <strong>in</strong> the figure), all<br />

the genes are found <strong>in</strong> this genome (solid<br />

black), <strong>and</strong> most of the other Clostridium<br />

proteomes conta<strong>in</strong> some weak homology<br />

but <strong>in</strong> general lack most of the plasmidencoded<br />

genes. Thus, this is a quick overview<br />

of gene conservation of a plasmid<br />

compared to many sequenced genomes of<br />

the same genera.<br />

To demonstrate this for an entire bacterial<br />

genome (which is millions of bp <strong>in</strong><br />

size, compared to a small/B75 000 bp<br />

plasmid, shown <strong>in</strong> Fig. 3), we have used<br />

the genome sequence of C. botul<strong>in</strong>um<br />

stra<strong>in</strong> F Langel<strong>and</strong>, the largest of the<br />

C. botul<strong>in</strong>um genomes, to build a prote<strong>in</strong><br />

BLASTatlas of all publicly available<br />

fully sequenced Clostridia genomes, <strong>in</strong>clud<strong>in</strong>g<br />

all chromosomes, plasmids <strong>and</strong><br />

phages (see Fig. 5). Each lane of the atlas<br />

corresponds to a sequenc<strong>in</strong>g project that<br />

conta<strong>in</strong>s the ma<strong>in</strong> chromosome plus any<br />

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 367


Fig. 5 BLASTatlas of Clostridium botul<strong>in</strong>um F stra<strong>in</strong> Langel<strong>and</strong>: Lanes show genome homology of (start<strong>in</strong>g from the outermost lane):<br />

C. acetobutylicum ATCC 824, C. beijer<strong>in</strong>ckii NCIMB 8052, C. botul<strong>in</strong>um A str. ATCC 19397, C. botul<strong>in</strong>um A ATCC 3502, C. botul<strong>in</strong>um A str. Hall,<br />

C. difficile 630, C. kluyveri DSM 555, C. novyi NT, C. perfr<strong>in</strong>gens ATCC 13124, C. perfr<strong>in</strong>gens SM101, C. perfr<strong>in</strong>gens str. 13, C. tetani E88,<br />

C. thermocellum ATCC 27405, <strong>and</strong> Clostridium phage c-st genome. Inside of the annotation circle are shown global direct repeats, global <strong>in</strong>verted<br />

repeats, stack<strong>in</strong>g energy, <strong>and</strong> percent AT. Blue <strong>and</strong> red annotations are cod<strong>in</strong>g sequences on plus <strong>and</strong> m<strong>in</strong>us str<strong>and</strong>, whereas green <strong>and</strong> turquoise<br />

are rRNA <strong>and</strong> tRNA, genes respectively. The two tox<strong>in</strong> components NTNH <strong>and</strong> BoNT/A1 that are identified on phage c-st are present <strong>in</strong> the<br />

reference genome at positions 880 kb <strong>and</strong> 883 kb, respectively (marked ‘cst’). The presence of the two is visible as a th<strong>in</strong> blue b<strong>and</strong> on the c-st blast<br />

lane. The lower part of the figure shows a zoom of the region around 2635 kb, provid<strong>in</strong>g an example of a gene cluster which appears to be<br />

conserved throughout the C. botul<strong>in</strong>um stra<strong>in</strong>s <strong>and</strong> partly with<strong>in</strong> the C. difficile 630.<br />

phages or plasmids present <strong>in</strong> the genome.<br />

The prote<strong>in</strong>s encoded by the 185 kb<br />

neurotox<strong>in</strong>-convert<strong>in</strong>g bacteriophage<br />

c-st are labelled, as well as a region which<br />

is zoomed <strong>in</strong> the second panel <strong>in</strong> Fig. 5.<br />

The accession numbers, total size <strong>and</strong><br />

total number of genes with<strong>in</strong> each lane<br />

can be seen <strong>in</strong> Table 2.<br />

There are several items of <strong>in</strong>terest which<br />

can be seen <strong>in</strong> Fig. 5. First, the rRNA<br />

operons can be quite readily seen, near the<br />

top part of the chromosome map, labeled<br />

turquoise; these rRNA operons are more<br />

GC rich (hence less red <strong>in</strong> the <strong>in</strong>ner-most<br />

lane), have direct <strong>and</strong> <strong>in</strong>verted repeats (the<br />

next two lanes), <strong>and</strong> are not shown <strong>in</strong> the<br />

proteome comparison lanes (s<strong>in</strong>ce these<br />

genes do not encode prote<strong>in</strong>s).<br />

As expected, the circle represent<strong>in</strong>g<br />

the c-st phage shows little match for most<br />

of the C. botul<strong>in</strong>um genome, at the<br />

prote<strong>in</strong> level. In general, the two other<br />

C. botul<strong>in</strong>um genomes (both <strong>in</strong> blue) have<br />

the highest similarity to the reference<br />

C. botul<strong>in</strong>um genome (also shown as a<br />

circle). In this case it is used as an <strong>in</strong>ternal<br />

control: all of the prote<strong>in</strong>s should show a<br />

match for this lane, s<strong>in</strong>ce the reference<br />

genome is blasted aga<strong>in</strong>st itself. Another<br />

<strong>in</strong>terest<strong>in</strong>g observation is the upper-lefth<strong>and</strong><br />

part of the genome which seems to<br />

have more homology to other Clostridium<br />

genomes, <strong>in</strong> particular show<strong>in</strong>g<br />

many matches to the C. perfr<strong>in</strong>gens<br />

genomes (green circles), compared to the<br />

rest of the genome.<br />

Application <strong>in</strong> metagenomics<br />

The genera of Prochlorococcus belongs<br />

to the cyanobacteria <strong>and</strong> is one of the<br />

most abundant photosynthetic organisms<br />

of the ocean. It plays an important<br />

role <strong>in</strong> the planet’s carbon cycle <strong>and</strong> has<br />

adapted to the various light <strong>and</strong> oxygen<br />

conditions present at the various<br />

depths. 11 As of the end of January<br />

2008, eleven Prochlorococcus mar<strong>in</strong>us<br />

genomes are publicly available <strong>and</strong> we<br />

have <strong>in</strong>cluded all encoded prote<strong>in</strong>s of<br />

these data with the seven metagenomic<br />

read collections from the ALOHA<br />

station near Hawaii, 12 as shown <strong>in</strong><br />

Table 3. The stra<strong>in</strong> of P. mar<strong>in</strong>us stra<strong>in</strong><br />

MIT 9303 has the largest genome of all<br />

368 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008


Table 3 A list of all stra<strong>in</strong>s/sample names <strong>and</strong> their accession numbers used <strong>in</strong> the metagenomic comparison. The list is sorted by sampl<strong>in</strong>g depth<br />

Source Size Orig<strong>in</strong> Accession/sample Ref. Depth<br />

P. mar<strong>in</strong>us str. MIT 9515 1 704 176 (1906 prote<strong>in</strong>s) Tropical Pacific CP000552 Unpublished Surface<br />

P. mar<strong>in</strong>us str. MIT 9215 1 738 790 (1983 prote<strong>in</strong>s) Equatorial Pacific CP000825 Unpublished Surface<br />

P. mar<strong>in</strong>us str. MED4 1 657 990 (1936 prote<strong>in</strong>s) Mediterranean Sea BX548174 21 4 m<br />

JGI_SMPL_HF10_10-07-02 7 482 668 (7842 contigs) North Pacific Subtropical Gyre — 12 10 m<br />

P. mar<strong>in</strong>us str. NATL1A 1 864 731 (2193 prote<strong>in</strong>s) North Atlantic CP000553 Unpublished 30 m<br />

P. mar<strong>in</strong>us str. NATL2A 1 842 899 (2163 prote<strong>in</strong>s) North Atlantic CP000095 Unpublished 30 m<br />

P. mar<strong>in</strong>us str. AS9601 1 669 886 (1921 prote<strong>in</strong>s) Arabian Sea CP000551 Unpublished 50 m<br />

JGI_SMPL_HF70_10-07-02 10 828 386 (10 999 contigs) North Pacific Subtropical Gyre — 12 70 m<br />

P. mar<strong>in</strong>us str. MIT 9211 1 688 963 (1855 prote<strong>in</strong>s) Equatorial Pacific CP000878 21 83 m<br />

P. mar<strong>in</strong>us str. MIT 9301 1 641 879 (1907 prote<strong>in</strong>s) Sargasso Sea CP000576 Unpublished 90 m<br />

P. mar<strong>in</strong>us str. MIT 9303 2 682 675 (2997 prote<strong>in</strong>s) Sargasso Sea CP000554 Unpublished 100 m<br />

P. mar<strong>in</strong>us str. SS120 1 751 080 (1882 prote<strong>in</strong>s) Sargasso Sea AE017126 22 120 m<br />

JGI_SMPL_HF130_10-06-02 6 091 784 (6812 contigs) North Pacific Subtropical Gyre — 12 130 m<br />

P. mar<strong>in</strong>us str. MIT 9312 1 709 204 (1962 prote<strong>in</strong>s) Equatorial Pacific CP000111 Unpublished 135 m<br />

P. mar<strong>in</strong>us str. MIT MIT9313 2 410 873 (2273 prote<strong>in</strong>s) Gulf Stream BX548175 21 135 m<br />

JGI_SMPL_HF200_10-06-02 7 829 659 (8286 contigs) North Pacific Subtropical Gyre — 12 200 m<br />

JGI_SMPL_HF500_10-06-02 8 764 642 (9027 contigs) North Pacific Subtropical Gyre — 12 500 m<br />

JGI_SMPL_HF770_12-21-03 11 811 597 (11 479 contigs) North Pacific Subtropical Gyre — 12 770 m<br />

JGI_SMPL_HF4000_12-21-03 11 028 821 (11 229 contigs) North Pacific Subtropical Gyre — 12 4000 m<br />

currently available sequences (2.7 Mb)<br />

<strong>and</strong> was therefore used as reference <strong>in</strong><br />

this comparison. BLAST hits between<br />

the reference <strong>and</strong> the encoded prote<strong>in</strong>s<br />

of all the P. mar<strong>in</strong>us genomes <strong>in</strong>cluded<br />

were generated with the BLASTp<br />

algorithm, whereas hits between the<br />

reference prote<strong>in</strong>s <strong>and</strong> the DNA reads<br />

of the metagenomic samples were gener-<br />

ated us<strong>in</strong>g the tBLASTn algorithm.<br />

tBLASTn was used to avoid the<br />

gene prediction step of the metagenomic<br />

samples <strong>and</strong> to allow a rough estimate<br />

of the cod<strong>in</strong>g potential of these samples.<br />

All lanes are sorted accord<strong>in</strong>g to<br />

the water depth at which the samples<br />

were collected (see Fig. 6). The Perl<br />

code for construct<strong>in</strong>g this plot us<strong>in</strong>g<br />

web services is provided on the service<br />

homepage.<br />

Discussion<br />

The BLASTatlas method can assist biologists<br />

<strong>in</strong> f<strong>in</strong>d<strong>in</strong>g regions along the chromosome<br />

which are conserved (or not).<br />

This <strong>in</strong>formation is useful for several<br />

Fig. 6 BLASTatlas show<strong>in</strong>g fully sequenced Prochlorococcus genomes (green) <strong>and</strong> the seven ALOHA metagenomic samples (blue). Outermost<br />

lanes represent samples closer to the ocean surface.<br />

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 369


different applications, such as identify<strong>in</strong>g<br />

phage <strong>in</strong>sertion sites <strong>and</strong> loss of important<br />

genetic material. This method is<br />

even able to scale down to each <strong>in</strong>dividual<br />

nucleotide or am<strong>in</strong>o acid residue.<br />

However, it is unable to deal with sequences<br />

(or parts thereof) that are not<br />

found <strong>in</strong> the reference genome. A good<br />

compromise when deal<strong>in</strong>g with this issue<br />

is often to use the largest chromosome of<br />

a species as reference; <strong>in</strong> addition, it can<br />

be useful to rebuild the maps us<strong>in</strong>g different<br />

reference genomes. Besides this<br />

limitation, the fact that all coord<strong>in</strong>ates<br />

are mapped back to the reference causes<br />

the coord<strong>in</strong>ates of the database genomes<br />

to ‘‘get lost’’ <strong>in</strong> that only the best match<br />

is displayed, regardless of the chromosomal<br />

location <strong>in</strong> the database genomes.<br />

Other aspects of genome homology like<br />

gene synteny cannot effectively be<br />

answered by this tool. However, it is<br />

possible to use an additional circle to<br />

plot gene order conservation along the<br />

chromosome.<br />

Currently, we see the BLASTatlas as<br />

an <strong>in</strong>termediate stage <strong>in</strong> analysis of many<br />

genomes of similar species. Soon there<br />

will be a need to compare hundreds or<br />

thous<strong>and</strong>s of genome sequences, <strong>and</strong> the<br />

need for development of new methods<br />

for comparison of even larger numbers<br />

of genomes (hundreds or thous<strong>and</strong>s) is<br />

ever more important.<br />

Acknowledgements<br />

The authors would like to thank Hans<br />

Henrik Stærfeld for assistance with server<br />

side programs <strong>and</strong> Kristoffer Rapacki<br />

for assistance on web services<br />

data types. The work was supported by<br />

a grant from the European Union<br />

through the EMBRACE network of Excellence,<br />

contract number LSHG-CT-<br />

2004-512092 <strong>and</strong> a grant from the Danish<br />

Center for Scientific Comput<strong>in</strong>g<br />

(DCSC).<br />

References<br />

1 R. D. Fleischmann, M. D. Adams, O.<br />

White, R. A. Clayton, E. F. Kirkness, A.<br />

R. Kerlavage, C. J. Bult, J. F. Tomb, B. A.<br />

Dougherty, J. M. Merrick, J. McKenney,<br />

G. Sutton, W. FitzHugh, C. Fields, J. D.<br />

Gocyne, J. Scott, R. Shirley, L. I. Liu, A.<br />

Glodek, J. M. Kelley, J. F. Weidman, C.<br />

A. Phillips, T. Spriggs, E. Hedblom, M. D.<br />

Cotton, T. R. Utterback, M. C. Hanna, D.<br />

T. Nguyen, D. M. Saudek, R. C. Br<strong>and</strong>on,<br />

L. D. F<strong>in</strong>e, J. L. Fritchman, J. L. Fuhrmann,<br />

N. S. M. Geoghagen, C. L. Gnehm,<br />

L. A. McDonald, K. V. Small, C. M.<br />

Fraser, H. O. Smith <strong>and</strong> J. C. Venter,<br />

Whole-Genome R<strong>and</strong>om Sequenc<strong>in</strong>g <strong>and</strong><br />

Assembly of Haemophilus Influenzae Rd.,<br />

Science, 1995, 269(5223), 496–512.<br />

2 L. J. Jensen, M. Skovgaard, T. Sicheritz-<br />

Ponten, M. K. Jorgensen, C. Lundegaard,<br />

C. C. Pedersen, N. Petersen <strong>and</strong> D. Ussery,<br />

Analysis of two large functionally uncharacterized<br />

regions <strong>in</strong> the Methanopyrus<br />

k<strong>and</strong>leri AV19 genome, BMC Genomics,<br />

2003, 4, 12.<br />

3 L. J. Jensen, M. Skovgaard, T. Sicheritz-<br />

Ponten, N. T. Hansen, H. Johansson,<br />

M. K. Jørgensen, K. Kiil, P. F. Hall<strong>in</strong><br />

<strong>and</strong> D. Ussery, <strong>Comparative</strong> genomics of<br />

four Pseudomonas species, <strong>in</strong> The Pseudomonads<br />

Vol. I. Genomics, Life Style<br />

<strong>and</strong> Molecular Architecture, ed. J. L.<br />

Ramos, Kluwer Academic/Plenum<br />

Publishers, New York, 2004, ch. 5,<br />

pp. 139–164.<br />

4 P. F. Hall<strong>in</strong>, T. T. B<strong>in</strong>newies <strong>and</strong> D. W.<br />

Ussery, Genome update: chromosome atlases,<br />

Microbiology (Read<strong>in</strong>g, U. K.),<br />

2004, 150, 3091–3093.<br />

5 T. J. Carver, K. M. Rutherford, M. Berriman,<br />

M. A. Raj<strong>and</strong>ream, B. G. Barrell <strong>and</strong><br />

J. Parkhill, ACT: the Artemis Comparison<br />

Tool, Bio<strong>in</strong>formatics, 2005, 21, 3422–3423.<br />

6 M. Sebaihia, M. W. Peck, N. P. M<strong>in</strong>ton,<br />

N. R. Thomson, M. T. Holden, W. J.<br />

Mitchell, A. T. Carter, S. D. Bentley, D.<br />

R. Mason, L. Crossman, C. J. Paul, A.<br />

Ivens, M. H. Wells-Bennik, I. J. Davis, A.<br />

M. Cerdeno-Tarraga, C. Churcher, M. A.<br />

Quail, T. Chill<strong>in</strong>gworth, T. Feltwell, A.<br />

Fraser, I. Goodhead, Z. Hance, K. Jagels,<br />

N. Larke, M. Maddison, S. Moule, K.<br />

Mungall, H. Norbertczak, E. Rabb<strong>in</strong>owitsch,<br />

M. S<strong>and</strong>ers, M. Simmonds, B.<br />

White, S. Whithead <strong>and</strong> J. Parkhill, Genome<br />

sequence of a proteolytic (Group I)<br />

Clostridium botul<strong>in</strong>um stra<strong>in</strong> Hall A <strong>and</strong><br />

comparative analysis of the clostridial genomes,<br />

Genome Res., 2007, 17, 1082–1092.<br />

7 A. G. Pedersen, L. J. Jensen, S. Brunak, H.<br />

H. Staerfeldt <strong>and</strong> D. W. Ussery, A DNA<br />

structural atlas for Escherichia coli, J. Mol.<br />

Biol., 2000, 299, 907–930.<br />

8 E. S. Shpigelman, E. N. Trifonov <strong>and</strong><br />

Bolshoy, A Curvature: software for the<br />

analysis of curved DNA, CABIOS, Comput.<br />

Appl. Biosci., 1993, 9, 435–440.<br />

9 M. Skovgaard, L. J. Jensen, C. Friis, H. H.<br />

Stærfeldt, P. Worn<strong>in</strong>g, S. Brunak <strong>and</strong> D.<br />

Ussery, The Atlas Visualisation of Genome-wide<br />

Information, Methods Microbiol.,<br />

2002, 33, 49–63.<br />

10 L. J. Jensen, C. Friis <strong>and</strong> D. W. Ussery,<br />

Three Views of Microbial Genomes, Res.<br />

Microbiol., 1999, 150, 773–777.<br />

11 M. B. Sullivan, M. L. Coleman, P. Weigele,<br />

F. Rohwer <strong>and</strong> S. W. Chisholm,<br />

Three Prochlorococcus cyanophage Genomes:<br />

Signature Features <strong>and</strong> Ecological<br />

Interpretations, PLoS Biol., 2005, 3, e144;<br />

PMID: 15828858 [PubMed—<strong>in</strong>dexed for<br />

MEDLINE].<br />

12 E. F. DeLong, C. M. Preston, T. M<strong>in</strong>cer,<br />

V. Rich, S. J. Hallam, N.-U. Frigaard, A.<br />

Mart<strong>in</strong>ez, M. B. Sullivan, R. Edwards, B.<br />

R. Brito, S. W. Chisholm <strong>and</strong> D. M. Karl,<br />

Community Genomics Among Stratified<br />

Microbial Assemblages <strong>in</strong> the Ocean’s Interior,<br />

Science, 2006, 311(5760), 496–503.<br />

13 D. L. Wheeler, T. Barrett, D. A. Benson,<br />

S. H. Bryant, K. Canese, V. Chetvern<strong>in</strong>,<br />

D. M. Church, M. DiCuccio, R. Edgar, S.<br />

Federhen, L. Y. Geer, Y. Kapust<strong>in</strong>, O.<br />

Khovayko, D. L<strong>and</strong>sman, D. J. Lipman,<br />

T. L. Madden, D. R. Maglott, J. Ostell, V.<br />

Miller, K. D. Pruitt, G. D. Schuler, E.<br />

Sequeira, S. T. Sherry, K. Sirotk<strong>in</strong>, A.<br />

Souvorov, G. Starchenko, R. L. Tatusov,<br />

T. A. Tatusova, L. Wagner <strong>and</strong> E.<br />

Yaschenko, Database Resources of the<br />

National Center for Biotechnology Information,<br />

Nucleic Acids Res., 2007, 35,<br />

D5–D12.<br />

14 J. Noll<strong>in</strong>g, G. Breton, M. V. Omelchenko,<br />

K. S. Makarova, Q. Zeng, R. Gibson, H.<br />

M. Lee, J. Dubois, D. Qiu, J. Hitti, Y. I.<br />

Wolf, R. L. Tatusov, F. Sabathe, L. Doucette-Stamm,<br />

P. Soucaille, M. J. Daly, G.<br />

N. Bennett, E. V. Koon<strong>in</strong> <strong>and</strong> D. R.<br />

Smith, Genome Sequence <strong>and</strong> <strong>Comparative</strong><br />

Analysis of the Solvent-produc<strong>in</strong>g<br />

Bacterium Clostridium acetobutylicum, J.<br />

Bacteriol., 2001, 183, 4823–4838.<br />

15 M. Sebaihia, B. W. Wren, P. Mullany, N.<br />

F. Fairweather, N. M<strong>in</strong>ton, R. Stabler, N.<br />

R. Thomson, A. P. Roberts, A. M. Cerdeno-Tarraga,<br />

H. Wang, M. T. Holden, A.<br />

Wright, C. Churcher, M. A. Quail, S.<br />

Baker, N. Bason, K. Brooks, T. Chill<strong>in</strong>gworth,<br />

A. Cron<strong>in</strong>, P. Davis, L. Dowd, A.<br />

Fraser, T. Feltwell, Z. Hance, S. Holroyd,<br />

K. Jagels, S. Moule, K. Mungall, C. Price,<br />

E. Rabb<strong>in</strong>owitsch, S. Sharp, M. Simmonds,<br />

K. Stevens, L. Unw<strong>in</strong>, S. Whithead,<br />

B. Dupuy, G. Dougan, B. Barrell<br />

<strong>and</strong> J. Parkhill, The Multidrug-resistant<br />

Human Pathogen Clostridium difficile has<br />

a Highly Mobile: Mosaic Genome, Nat.<br />

Genet., 2006, 38, 779–786.<br />

16 C. Bettegowda, X. Huang, J. L<strong>in</strong>, I.<br />

Cheong, M. Kohli, S. A. Szabo, X. Zhang,<br />

L. A. Diaz, Jr, V. E. Velculescu, G. Parmigiani,<br />

K. W. K<strong>in</strong>zler, B. Vogelste<strong>in</strong> <strong>and</strong><br />

S. Zhou, The Genome <strong>and</strong> Transcriptomes<br />

of the Anti-tumor Agent Clostridiumnovyi-NT,<br />

Nat. Biotechnol., 2006, 24,<br />

1573–1580.<br />

17 G. S. Myers, D. A. Rasko, J. K. Cheung, J.<br />

Ravel, R. Seshadri, R. T. DeBoy, Q. Ren,<br />

J. Varga, M. M. Awad, L. M. Br<strong>in</strong>kac, S.<br />

C. Daugherty, D. H. Haft, R. J. Dodson,<br />

R. Madupu, W. C. Nelson, N. J. Rosovitz,<br />

S. A. Sullivan, H. Khouri, G. I. Dimitrov,<br />

K. L. Watk<strong>in</strong>s, S. Mulligan, J. Benton, D.<br />

Radune, D. J. Fisher, H. S. Atk<strong>in</strong>s, T.<br />

Hiscox, B. H. Jost, S. J. Bill<strong>in</strong>gton, J. G.<br />

Songer, B. A. McClane, R. W. Titball, J. I.<br />

Rood, S. B. Melville <strong>and</strong> I. T. Paulsen,<br />

Skewed Genomic Variability <strong>in</strong> Stra<strong>in</strong>s of<br />

the Toxigenic Bacterial Pathogen,<br />

Clostridium perfr<strong>in</strong>gens, Genome Res.,<br />

2006, 16, 1031–1040.<br />

18 T. Shimizu, K. Ohtani, H. Hirakawa, K.<br />

Ohshima, A. Yamashita, T. Shiba, N.<br />

Ogasawara, M. Hattori, S. Kuhara <strong>and</strong><br />

H. Hayashi, Complete Genome Sequence<br />

of Clostridium perfr<strong>in</strong>gens, an Anaerobic<br />

Flesh-eater, Proc. Natl. Acad. Sci.<br />

U. S. A., 2002, 99, 996–1001.<br />

19 H. Bruggemann, S. Baumer, W. F. Fricke,<br />

A. Wiezer, H. Liesegang, I. Decker,<br />

370 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008


C. Herzberg, R. Mart<strong>in</strong>ez-Arias, R. Merkl,<br />

A. Henne <strong>and</strong> G. Gottschalk, The Genome<br />

Sequence of Clostridium tetani, the<br />

Causative Agent of Tetanus Disease, Proc.<br />

Natl. Acad. Sci. U. S. A., 2003, 100,<br />

1316–1321.<br />

20 Y. Sakaguchi, T. Hayashi, K. Kurokawa,<br />

K. Nakayama, K. Oshima, Y. Fuj<strong>in</strong>aga, M.<br />

Ohnishi, E. Ohtsubo, M. Hattori <strong>and</strong> K.<br />

Oguma, The Genome Sequence of<br />

Clostridium botul<strong>in</strong>um Type C Neurotox<strong>in</strong><br />

Convert<strong>in</strong>g Phage <strong>and</strong> the Molecular Mechanisms<br />

of Unstable Lysogeny, Proc. Natl.<br />

Acad. Sci. U. S. A.,2005,102,17472–17477.<br />

21 G. Rocap, F. W. Larimer, J. Lamerd<strong>in</strong>, S.<br />

Malfatti, P. Cha<strong>in</strong>, N. A. Ahlgren, A.<br />

Arellano, M. Coleman, L. Hauser, W. R.<br />

Hess, Z. I. Johnson, M. L<strong>and</strong>, D. L<strong>in</strong>dell,<br />

A. F. Post, W. Regala, M. Shah, S. L.<br />

Shaw, C. Steglich, M. B. Sullivan, C. S.<br />

T<strong>in</strong>g, A. Tolonen, E. A. Webb, E. R.<br />

Z<strong>in</strong>ser <strong>and</strong> S. W. Chisholm, Genome Divergence<br />

<strong>in</strong> Two Prochlorococcus ecotypes<br />

Reflects Oceanic Niche Differentiation,<br />

Nature, 2003, 424, 1042–1047.<br />

22 A. Dufresne, M. Salanoubat, F. Partensky,<br />

F. Artiguenave, I. M. Axmann, V.<br />

Barbe, S. Duprat, M. Y. Galper<strong>in</strong>, E. V.<br />

Koon<strong>in</strong>, F. Le Gall, K. S. Makarova, M.<br />

Ostrowski, S. Oztas, C. Robert, I. B. Rogoz<strong>in</strong>,<br />

D. J. Scanlan, N. T<strong>and</strong>eau de Marsac,<br />

J. Weissenbach, P. W<strong>in</strong>cker, Y. I.<br />

Wolf <strong>and</strong> W. R. Hess, Genome Sequence<br />

of the Cyanobacterium Prochlorococcus<br />

mar<strong>in</strong>us SS120, a Nearly M<strong>in</strong>imal Oxyphototrophic<br />

Genome, Proc. Natl. Acad.<br />

Sci. U. S. A., 2003, 100, 9647–9649.<br />

23 C. J. Benham <strong>and</strong> C. Bi, The Analysis of<br />

Stress-<strong>in</strong>duced Duplex Destabilization <strong>in</strong><br />

Long Genomic DNA Sequences, J.<br />

Comput. Biol., 2004, 11, 519–543.<br />

24 K. Liolios, N. Tavernarakis, P.<br />

Hugenholtz <strong>and</strong> N. C. Kyrpides, The<br />

Genomes On L<strong>in</strong>e Database (GOLD)<br />

v.2: a monitor of genome projects worldwide,<br />

Nucleic Acids Res., 2006, 34,<br />

D332–D334.<br />

25 J. I. Rood <strong>and</strong> S. T. Cole, Molecular<br />

genetics <strong>and</strong> pathogenesis of Clostridium<br />

perfr<strong>in</strong>gens, Microbiol. Rev., 1991, 55,<br />

621–648.<br />

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 371


1<br />

<strong>Comparative</strong> Genomics<br />

2.7 Paper II: Ten years of bacterial genome sequenc<strong>in</strong>g:<br />

comparative–genomics–based discoveries


Funct Integr Genomics (2006) 6: 165–185<br />

DOI 10.1007/s10142-006-0027-2<br />

REVIEW<br />

Tim T. B<strong>in</strong>newies . Yair Motro . Peter F. Hall<strong>in</strong> .<br />

Ole Lund . David Dunn . Tom La . David J. Hampson .<br />

Matthew Bellgard . Trudy M. Wassenaar .<br />

David W. Ussery<br />

Ten years of bacterial genome sequenc<strong>in</strong>g:<br />

comparative-genomics-based discoveries<br />

Received: 20 January 2006 / Revised: 24 February 2006 / Accepted: 7 March 2006 / Published onl<strong>in</strong>e: 12 May 2006<br />

# Spr<strong>in</strong>ger-Verlag 2006<br />

Abstract It has been more than 10 years s<strong>in</strong>ce the first<br />

bacterial genome sequence was published. Hundreds of<br />

bacterial genome sequences are now available for comparative<br />

genomics, <strong>and</strong> search<strong>in</strong>g a given prote<strong>in</strong> aga<strong>in</strong>st<br />

more than a thous<strong>and</strong> genomes will soon be possible. The<br />

subject of this review will address a relatively straightforward<br />

question: “What have we learned from this vast<br />

amount of new genomic data?” Perhaps one of the most<br />

important lessons has been that genetic diversity, at the<br />

level of large-scale variation amongst even genomes of the<br />

same species, is far greater than was thought. The classical<br />

textbook view of evolution rely<strong>in</strong>g on the relatively slow<br />

accumulation of mutational events at the level of <strong>in</strong>dividual<br />

bases scattered throughout the genome has changed. One<br />

of the most obvious conclusions from exam<strong>in</strong><strong>in</strong>g the<br />

sequences from several hundred bacterial genomes is the<br />

enormous amount of diversity—even <strong>in</strong> different genomes<br />

from the same bacterial species. This diversity is generated<br />

by a variety of mechanisms, <strong>in</strong>clud<strong>in</strong>g mobile genetic<br />

elements <strong>and</strong> bacteriophages. An exam<strong>in</strong>ation of the 20<br />

Escherichia coli genomes sequenced so far dramatically<br />

illustrates this, with the genome size rang<strong>in</strong>g from 4.6 to<br />

5.5 Mbp; much of the variation appears to be of phage<br />

orig<strong>in</strong>. This review also addresses mobile genetic elements,<br />

T. T. B<strong>in</strong>newies . P. F. Hall<strong>in</strong> . O. Lund . D. W. Ussery (*)<br />

Center for Biological Sequence Analysis,<br />

Technical University of Denmark,<br />

2800 Lyngby, Denmark<br />

e-mail: dave@cbs.dtu.dk<br />

Y. Motro . D. Dunn . M. Bellgard<br />

Center for Bio<strong>in</strong>formatics <strong>and</strong> Biological Comput<strong>in</strong>g,<br />

Murdoch University,<br />

Murdoch, Western Australia 6150, Australia<br />

T. La . D. J. Hampson<br />

School of Veter<strong>in</strong>ary <strong>and</strong> Biomedical Sciences,<br />

Murdoch University,<br />

Murdoch, Western Australia 6150, Australia<br />

T. M. Wassenaar<br />

Molecular Microbiology <strong>and</strong> Genomics Consultants,<br />

Zotzenheim, Germany<br />

<strong>in</strong>clud<strong>in</strong>g pathogenicity isl<strong>and</strong>s <strong>and</strong> the structure of<br />

transposable elements. There are at least 20 different<br />

methods available to compare bacterial genomes. Metagenomics<br />

offers the chance to study genomic sequences<br />

found <strong>in</strong> ecosystems, <strong>in</strong>clud<strong>in</strong>g genomes of species that are<br />

difficult to culture. It has become clear that a genome<br />

sequence represents more than just a collection of gene<br />

sequences for an organism <strong>and</strong> that <strong>in</strong>formation concern<strong>in</strong>g<br />

the environment <strong>and</strong> growth conditions for the organism<br />

are important for <strong>in</strong>terpretation of the genomic data. The<br />

newly proposed M<strong>in</strong>imal Information about a Genome<br />

Sequence st<strong>and</strong>ard has been developed to obta<strong>in</strong> this<br />

<strong>in</strong>formation.<br />

Keywords Bacterial genomics . <strong>Comparative</strong> genomics .<br />

Bio<strong>in</strong>formatics . Genomic diversity .<br />

Molecular evolution<br />

Introduction<br />

The year 1995 marked the publication of two human<br />

pathogenic bacterial genome sequences: Haemophilus<br />

<strong>in</strong>fluenzae (Fleischmann et al. 1995, US patent number<br />

6,528,289) <strong>and</strong> Mycoplasma genetalium (Fraser et al.<br />

1995, US patent number 6,537,773). S<strong>in</strong>ce then, more than<br />

300 bacterial genomes have been fully sequenced <strong>and</strong><br />

become publicly available, <strong>in</strong>clud<strong>in</strong>g the sequence of a<br />

virulent form of H. <strong>in</strong>fluenzae (Harrison et al. 2005); the<br />

orig<strong>in</strong>al H. <strong>in</strong>fluenzae stra<strong>in</strong> sequenced <strong>in</strong> 1995 was from<br />

an isolate that does not cause disease. Although the<br />

majority of these several hundred genomes are from<br />

pathogenic organisms, some environmental bacterial genome<br />

sequences have also become available. This review<br />

article will provide a brief overview of sequenced bacterial<br />

genomes, their genomic diversity <strong>and</strong> some of the <strong>in</strong>sights<br />

ga<strong>in</strong>ed from analysis of this vast amount of data.<br />

Bacteria are microscopic unicellular prokaryotes that<br />

<strong>in</strong>habit a wide variety of environmental niches, broadly<br />

distributed <strong>in</strong> three ecosystems: the soil, mar<strong>in</strong>e environments<br />

<strong>and</strong> other liv<strong>in</strong>g organisms. Although there are


166<br />

literally millions of bacterial species, only a small proportion<br />

of these can be grown <strong>in</strong> the laboratory (H<strong>and</strong>elsman<br />

2004). Bacteria (<strong>and</strong> Archaea) can be found almost<br />

anywhere <strong>in</strong> the environment: <strong>in</strong> the air, even <strong>in</strong> the<br />

International Space Station (Novikova et al. 2006), <strong>in</strong><br />

thermal ducts found at great depths <strong>in</strong> the oceans (Ala<strong>in</strong> et<br />

al. 2002; Vezzi et al. 2005), <strong>in</strong> the <strong>in</strong>test<strong>in</strong>al tracts of<br />

animals (Yan <strong>and</strong> Polk 2004; Backhed et al. 2005) <strong>and</strong> <strong>in</strong><br />

soil <strong>and</strong> rocks, even thous<strong>and</strong>s of meters deep (Torsvik et<br />

al. 1990). Bacteria live with<strong>in</strong> unicellular eukaryotes,<br />

algae, plants or animals. This diversity is reflected <strong>in</strong> their<br />

physiology, morphology, metabolism <strong>and</strong> ecosystems. For<br />

example, from a physiological perspective, most <strong>in</strong>test<strong>in</strong>al<br />

bacteria such as Escherichia coli are motile by means of<br />

flagella, to overcome the peristalsis of the gut, whilst the<br />

soil bacterium Clostridium perfr<strong>in</strong>gens does not posses<br />

such motility mach<strong>in</strong>ery (Shimizu et al. 2002). From a<br />

metabolic perspective, the versatile Burkholderia cepacia<br />

(formerly Pseudomonas cepacia) can utilise approximately<br />

100 different organic compounds as a sole energy source<br />

(Goldmann <strong>and</strong> Kl<strong>in</strong>ger 1986) compared to the strictly<br />

<strong>in</strong>tracellular Mycobacterium tuberculosis which is dependent<br />

on only a few carbon sources produced by its<br />

<strong>in</strong>voluntary host. From an <strong>in</strong>ter-bacterial <strong>in</strong>teraction<br />

perspective, sometimes bacteria cooperate. For example,<br />

Enterobacter cloacae <strong>and</strong> Pseudomonas mendoc<strong>in</strong>a positively<br />

<strong>in</strong>teract to stimulate plant growth (Duponnois et al.<br />

1999). On the other h<strong>and</strong>, there are also bacteria which not<br />

only “do not cooperate” but exhibit predatory behavior,<br />

such as Bdellovibrio bacteriovorus (Rendulic et al. 2004).<br />

As for bacteria–host <strong>in</strong>teractions, for a given bacterial<br />

species both pathogenic <strong>and</strong> non-pathogenic stra<strong>in</strong>s can<br />

exist (Dobr<strong>in</strong>dt <strong>and</strong> Hacker 2001; Penyalver <strong>and</strong> Lopez<br />

1999), while other species may be exclusively parasitic<br />

(Goebel <strong>and</strong> Gross 2001), truly symbiotic (Gil et al. 2004)<br />

or commensal (Yan <strong>and</strong> Polk 2004) for their host. It is<br />

<strong>in</strong>terest<strong>in</strong>g to note that this diversity is somehow captured<br />

<strong>in</strong> the relatively small bacterial genomes.<br />

The first complete viral genome (φX174) was published<br />

<strong>in</strong> 1977 (Sanger et al. 1977). To put this <strong>in</strong>to perspective, to<br />

sequence the 4.6-Mbp E. coli K-12 genome at that time<br />

(about a thous<strong>and</strong> base pairs (bp) could be sequenced per<br />

year <strong>in</strong> 1977) would take more than a thous<strong>and</strong> years to<br />

f<strong>in</strong>ish, <strong>and</strong> to sequence the human genome would take<br />

more than a million years to complete. The automation of<br />

sequenc<strong>in</strong>g methods, the <strong>in</strong>vention of polymerase cha<strong>in</strong><br />

reaction (PCR) (Mullis et al. 1986) <strong>and</strong> the shotgun clon<strong>in</strong>g<br />

procedure reduced costs <strong>and</strong> time, <strong>and</strong> provided the<br />

capability for large-scale sequenc<strong>in</strong>g. These developments<br />

together have led to the sequenc<strong>in</strong>g of the first complete<br />

bacterial genome (Fleischmann et al. 1995) almost 20 years<br />

after the sequenc<strong>in</strong>g of φX174. The choice of the first<br />

bacterium to be completely sequenced (H. <strong>in</strong>fluenzae Rd<br />

KW20) was based on the follow<strong>in</strong>g reasons: (1) the<br />

genome size was thought to be ‘typical’ among bacteria<br />

(1.8 Mbp), (2) the G + C base composition was close to that<br />

of the human genome (38%) <strong>and</strong> (3) the bacterium had<br />

important human health implications. In the absence of<br />

procedures to produce a genetic map for the species,<br />

genome sequenc<strong>in</strong>g was proven to be a powereful<br />

alternative for genetic characterisation. This l<strong>and</strong>mark<br />

work <strong>in</strong>itiated the <strong>in</strong>flux of genome sequence data which<br />

is now updated frequently <strong>and</strong> is publicly available. As of<br />

November 2005, there are more than 300 fully sequenced,<br />

publicly available bacterial genomes. Figure 1 shows this<br />

<strong>in</strong>crease of sequence data over the past decade. 1<br />

The total number of completed bacterial genome<br />

sequences has more than doubled over the past 2 years<br />

<strong>and</strong>, at the time of writ<strong>in</strong>g, there are 855 publicly listed<br />

bacterial <strong>and</strong> archaeal genome projects that are <strong>in</strong> various<br />

stages of progress. 2 In addition to new species, multiple<br />

stra<strong>in</strong>s of the same bacterial species are be<strong>in</strong>g sequenced.<br />

The amount of genomic data currently available has<br />

provided significant advances <strong>in</strong> our underst<strong>and</strong><strong>in</strong>g of a<br />

number of important themes, <strong>in</strong>clud<strong>in</strong>g bacterial diversity,<br />

population characteristics, operon structure, mobile genetic<br />

elements (MGE) <strong>and</strong> horizontal gene transfer (HGT). It has<br />

also provided a number of challenges <strong>in</strong> underst<strong>and</strong><strong>in</strong>g the<br />

ecology of, as yet, undiscovered bacterial worlds. The<br />

availability of whole genome sequences for pathogenic <strong>and</strong><br />

commensal bacterial species has allowed a more detailed<br />

analysis of the complex <strong>in</strong>teractions that occur with their<br />

plant or animal hosts. Figure 2a is a phylogenetic tree of<br />

300 sequenced bacterial genomes (available at the time of<br />

writ<strong>in</strong>g). Many of these genomes are from pathogenic<br />

bacteria liv<strong>in</strong>g <strong>in</strong> complex ecosystems, such as the<br />

spirochaete Brachyspira pilosicoli labelled <strong>in</strong> red <strong>in</strong> the<br />

phylogenetic tree shown <strong>in</strong> Fig. 2b. This bacterium attaches<br />

to enterocytes to form a “false brush border” <strong>in</strong> the colon.<br />

Most genome sequenc<strong>in</strong>g projects are currently carried<br />

out us<strong>in</strong>g automated applications of the sequenc<strong>in</strong>g<br />

technique developed by Sanger et al. (1973), but newly<br />

developed methodologies may enable even more rapid<br />

sequenc<strong>in</strong>g <strong>in</strong> the future. Two papers have been published<br />

about two different methods for high-throughput sequenc<strong>in</strong>g<br />

of bacterial genomes (Pennisi 2005). One method is<br />

essentially a “do-it-yourself kit”, which uses a laser<br />

confocal microscope <strong>and</strong> other “off-the-shelf” components<br />

to build a sequenc<strong>in</strong>g mach<strong>in</strong>e capable of sequenc<strong>in</strong>g an E.<br />

coli genome <strong>in</strong> less than a day (Shendure et al. 2005). The<br />

second method is a commercial mach<strong>in</strong>e, based on<br />

pyrosequenc<strong>in</strong>g methodologies to generate many short<br />

pieces of DNA; this method was used to sequence a<br />

bacterial genome with<strong>in</strong> a few hours (Margulies et al.<br />

2005). Although there are still some technical problems<br />

with both of these methods, it is clear that, <strong>in</strong> the near<br />

future, it will be possible to quickly sequence a bacterial<br />

genome at a considerably low cost.<br />

1 Completed genome statistics obta<strong>in</strong>ed from the <strong>CBS</strong> atlas web<br />

pages http://www.cbs.dtu.dk/services/GenomeAtlas<br />

2 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj


Fig. 1 Cumulative number of<br />

complete published sequenced<br />

bacterial genomes (bars) <strong>and</strong><br />

total number of basepairs (l<strong>in</strong>e)<br />

over the past decade<br />

(1995–2005)<br />

Genomic <strong>in</strong>formation<br />

DNA codes for more than just prote<strong>in</strong>s<br />

The quality of annotation of bacterial genomes varies,<br />

although a survey based on three different methods to<br />

predict the expected number of genes <strong>in</strong> a genome has<br />

found that it is likely that, for most bacterial genomes,<br />

around 20% of the genes annotated might not be “real”<br />

(Skovgaard et al. 2001). Furthermore, some “real” genes,<br />

based on proteomics experiments, which were not<br />

orig<strong>in</strong>ally predicted have been detected, highlight<strong>in</strong>g the<br />

dynamic nature of annotation <strong>and</strong> that genes are missed<br />

(Jaffe et al. 2004). Over-annotation of bacterial genomes is<br />

a problem but, unfortunately, this cannot be easily avoided.<br />

On the one h<strong>and</strong>, no one wants to miss a gene <strong>and</strong>, on the<br />

other h<strong>and</strong>, small genes can be quite difficult to predict, as a<br />

short open read<strong>in</strong>g frame could easily occur by statistical<br />

chance (Skovgaard et al. 2001).<br />

There are currently several automated annotation systems<br />

<strong>and</strong> the BaSys system (Van Domselaar et al. 2005)<br />

provides a comprehensive annotation of a DNA sequence<br />

file. To conduct comparative genomics with several<br />

hundred genomes, quality databases are essential <strong>and</strong> the<br />

“GenomeAtlas” database, which was orig<strong>in</strong>ally developed<br />

to store DNA structural <strong>in</strong>formation about the various<br />

sequenced genomes, is one example (Hall<strong>in</strong> <strong>and</strong> Ussery<br />

2004). Approximately a hundred different features for each<br />

genome (such as percent AT, cod<strong>in</strong>g skew bias, length of<br />

genome <strong>and</strong> number of genes) are currently made available<br />

through http://www.cbs.dtu.dk/services/GenomeAtlas/.<br />

Duplication of essentials<br />

One of the features of genomic sequences that can be easily<br />

recognised is the presence of repeat sequences. The most<br />

obvious <strong>and</strong> extensive repeats present <strong>in</strong> many bacterial<br />

167<br />

genomes are the operons encod<strong>in</strong>g the ribosomal RNA<br />

genes. These rRNA operons typically encode 16S <strong>and</strong> 23S<br />

rRNA separated by a short spacer, often followed by the 5S<br />

rRNA gene. All sequenced bacterial genomes possess at<br />

least one rRNA operon, <strong>and</strong> many (215 of 300) have two or<br />

more copies; the number of operons tends to correlate with<br />

bacterial division time. Thus, species that divide quickly<br />

(such as Bacillus cereus) have more copies of rRNA genes,<br />

so as to enable rapid production of ribosomes. In addition,<br />

species conta<strong>in</strong><strong>in</strong>g multiple rRNA operons appear to be<br />

more adaptable to chang<strong>in</strong>g environmental conditions<br />

(Ac<strong>in</strong>as et al. 2004). The rRNA genes are a valuable tool<br />

for the estimation of taxonomic relationships (see Fig 2a).<br />

These genes evolve slowly, presumably because they play<br />

an essential role as the backbone of ribosomes while<br />

<strong>in</strong>teract<strong>in</strong>g with multiple prote<strong>in</strong>s. Any changes <strong>in</strong> the<br />

shape (sequence) of rRNA would most likely be fatal.<br />

Multiple copies per genome of tRNA genes can also be<br />

found <strong>in</strong> some genomes, aga<strong>in</strong> tend<strong>in</strong>g to correlate with<br />

division time. However, for tRNAs, the duplication<br />

number is also dictated by the frequency with which<br />

particular codons are used (or vice versa, as cause <strong>and</strong><br />

effect cannot be dist<strong>in</strong>guished here). This enables a less<br />

obvious level of regulat<strong>in</strong>g gene activity: a gene us<strong>in</strong>g<br />

many codons for which only one tRNA gene is available<br />

will probably be translated at a rate-limit<strong>in</strong>g step, whereas<br />

abundant prote<strong>in</strong>s are more likely to use tRNAs for which<br />

multiple gene copies are available. This is the basis for the<br />

codon adaption <strong>in</strong>dex, which is a measure of the adaptation<br />

of a gene’s codon usage towards the optimal tRNA pool<br />

(Sharp <strong>and</strong> Li 1987).<br />

There are of course other duplications <strong>in</strong> bacterial<br />

genomes, some of which might appear at first glance to be<br />

less essential. For example, the ‘REP’ repetitive sequences<br />

frequently found <strong>in</strong> enterobacteriaceae can be used as<br />

unique identifiers of bacterial genomes (Tobes <strong>and</strong> Ramos<br />

2005). It has been speculated that these repeats are<br />

mean<strong>in</strong>gless, result<strong>in</strong>g from errors <strong>in</strong> replication, or that


168


3Fig. 2 a Phylogenetic tree of 287 sequenced bacterial genomes,<br />

based on aligments from the 16S rRNA gene sequence. The phyla<br />

are colour-coded; a more detailed view, with names of all the<br />

organisms can be found <strong>in</strong> the supplemental <strong>in</strong>formation: http://<br />

www.cbs.dtu.dk/services/GenomeAtlas/suppl/FIG10yr/. b Photomicrograph<br />

show<strong>in</strong>g a dense fr<strong>in</strong>ge of anaerobic spirochaetes (B.<br />

pilosicoli) attached by one cell end to the lum<strong>in</strong>al surface of human<br />

colonic enterocytes, form<strong>in</strong>g a “false brush border”. Besides that of<br />

humans, B. pilosicoli colonises the large <strong>in</strong>test<strong>in</strong>e of a variety of<br />

mammals <strong>and</strong> birds, caus<strong>in</strong>g diarrhoea <strong>and</strong> reduced growth rates.<br />

Genomic sequence from B. pilosicoli is be<strong>in</strong>g analysed to assist <strong>in</strong><br />

underst<strong>and</strong><strong>in</strong>g the genetic basis of this dense colonisation, <strong>in</strong>clud<strong>in</strong>g<br />

patterns of gene expression underly<strong>in</strong>g the complex <strong>in</strong>teractions that<br />

occur between <strong>in</strong>dividual bacterial cells <strong>and</strong> the colonised<br />

enterocytes. The photograph is courtesy of Dr. W. Bastiaan DeBoer,<br />

University of Western Australia, Perth, Western Australia<br />

they may be a part of mobile elements that are able to<br />

translocate <strong>and</strong> duplicate themselves. These could alternatively<br />

be non-functional ‘molecular fossils’ of previous<br />

<strong>in</strong>sertion events. F<strong>in</strong>ally, it could well be that these repeats<br />

serve some as yet undiscovered useful purpose. It is<br />

possible, for example, that repetitive sequences <strong>and</strong><br />

<strong>in</strong>sertion sequence elements (ISs) contribute to genome<br />

plasticity through structural changes based on homologous<br />

recomb<strong>in</strong>ation (Kennedy et al. 2001; Fraser-Liggett 2005).<br />

A brief history of bacterial operons<br />

Much of the early classical work <strong>in</strong> microbiology has been<br />

done with E. coli, as this bacterium is relatively easy to<br />

culture <strong>in</strong> the laboratory. As more <strong>and</strong> more genetic<br />

<strong>in</strong>formation was gathered, it was considered a ‘typical’<br />

bacterium, although E. coli is not more typical for bacteria<br />

than a rabbit is for all eukaryotic organisms. More than<br />

40 years ago, a model was proposed for gene regulation of<br />

the catabolism of lactose <strong>in</strong> E. coli (Jacob et al. 1960; Jacob<br />

<strong>and</strong> Monod 1961). The model described an operon as a<br />

cluster of genes with related functions (encod<strong>in</strong>g, <strong>in</strong> this<br />

case, enzymes required for lactose degradation). This<br />

operon structure neatly allows regulation of gene expression<br />

by the concentration of lactose (Lewis et al. 1996;<br />

Reznikoff 1992). With the cont<strong>in</strong>uous expression of one<br />

small prote<strong>in</strong> (a repressor), wasteful expression of several<br />

other catabolic enzymes <strong>in</strong> the absence of lactose is<br />

prevented.<br />

S<strong>in</strong>ce the discovery of the lac operon, many more<br />

catabolic operons have been discovered, with positive <strong>and</strong><br />

negative feedback strategies, <strong>and</strong> these illustrate the<br />

biological need to use resources as efficiently as possible.<br />

Many, if not all, bacterial genomes <strong>in</strong>deed display clusters<br />

of genes <strong>in</strong>volved <strong>in</strong> a s<strong>in</strong>gle process (be it co-jo<strong>in</strong>tly<br />

transcribed <strong>and</strong> regulated, as <strong>in</strong> classical operons, or with<br />

separate promoters <strong>and</strong> regulators), but the degree of<br />

operon gene organisation <strong>and</strong> gene cluster<strong>in</strong>g differs<br />

between species. In some bacteria, such as <strong>in</strong> Helicobacter<br />

pylori, operons are relatively unconserved, <strong>and</strong> genes<br />

<strong>in</strong>volved <strong>in</strong> one cellular process can be dispersed<br />

169<br />

throughout the genome (Tomb et al. 1997; Alm <strong>and</strong> Trust<br />

1999), although more recent work suggest that perhaps<br />

there are more operons <strong>in</strong> H. pylori than previously thought<br />

(Price et al. 2005). There are currently many resources for<br />

prediction of operons (Rogoz<strong>in</strong> et al. 2004; Rosenfeld et al.<br />

2004; Alm et al. 2005; Janga et al. 2005; Nishi et al. 2005;<br />

Price et al. 2005; Vallenet et al. 2006), <strong>in</strong>clud<strong>in</strong>g several<br />

databases, such as the Operon Database (Okuda et al.<br />

2006), RegulonDB (Salgado et al. 2006a,b) <strong>and</strong> Gene-<br />

Chords (Zheng et al. 2005).<br />

How did the first operon evolve? There have been<br />

historically three models proposed for the orig<strong>in</strong>s of gene<br />

clusters. The first model, which dates back to 1945,<br />

proposed the cluster<strong>in</strong>g of genes to be the direct result of<br />

gene duplication <strong>and</strong> evolution (Horowitz 1945, 1965).<br />

Gene duplication can occur dur<strong>in</strong>g replication <strong>and</strong>, as a<br />

duplicated gene has more freedom to mutate, this is<br />

believed to be a classical mechanism for novel enzymes to<br />

evolve (Lazcano et al. 1995). However, although all genes<br />

with<strong>in</strong> an operon may be <strong>in</strong>volved <strong>in</strong> a s<strong>in</strong>gle metabolic<br />

process, their function <strong>and</strong> structure can vary considerably,<br />

<strong>and</strong> a phylogenic relationship between them is not always<br />

likely.<br />

The second model proposed for the evolution of operons<br />

is that coregulation of genes under a common promoter<br />

could provide selective advantage (Jacob et al. 1960).<br />

However, we now know that, <strong>in</strong> fact, it is possible to have<br />

coregulation of genes that are not physically l<strong>in</strong>ked<br />

together. Furthermore, this model does not really provide<br />

a gradual step-by-step mechanism for the evolution of<br />

operons.<br />

The third model for the evolution of an operon is that<br />

pre-exist<strong>in</strong>g genes moved together due to selective<br />

advantages of hav<strong>in</strong>g genes <strong>in</strong>volved <strong>in</strong> the same<br />

biochemical pathways or processes be<strong>in</strong>g physically<br />

close to each other. This hypothesis allows for structurally<br />

dist<strong>in</strong>ct genes to be part of one operon. This model requires<br />

both variation <strong>and</strong> frequent recomb<strong>in</strong>ation <strong>and</strong> has been<br />

proposed as an explanation of cluster<strong>in</strong>g of genes <strong>in</strong><br />

bacteriophage genomes (Stahl <strong>and</strong> Murray 1966; Juhala et<br />

al. 2000).<br />

In addition to these three views, there are other<br />

alternatives. Gene cluster<strong>in</strong>g may be of selective advantage<br />

<strong>in</strong> the case of horizontal gene transfer (see section below)<br />

<strong>and</strong>, based on this idea, a fourth mechanism, ‘selfish<br />

operon’ model, was proposed (Lawrence <strong>and</strong> Roth 1996).<br />

This view has been recently called <strong>in</strong>to question, based on<br />

the physical cluster<strong>in</strong>g of essential genes <strong>in</strong> the E. coli K-12<br />

genome (Pal <strong>and</strong> Hurst 2004). Two other alternatives for<br />

operon evolution deal with chromat<strong>in</strong> structure <strong>and</strong> the<br />

physical location of genes <strong>in</strong> bacterial chromosomes, where<br />

transcription <strong>and</strong> translation are coupled (Pal <strong>and</strong> Hurst<br />

2004). It is quite possible that, <strong>in</strong> fact, there is no one<br />

“correct” mechanism, but perhaps different mechanisms are<br />

<strong>in</strong>volved at the same time. For example, the selective<br />

advantage of gene cluster<strong>in</strong>g dur<strong>in</strong>g horizontal gene transfer<br />

is exemplified by the cluster<strong>in</strong>g of multiple antibiotic


170<br />

resistance genes on mobile genetic elements (Carattoli<br />

2001). In the era of antibiotic use, such genes are under<br />

strong selective pressure <strong>and</strong> are frequently passed on<br />

between bacteria by means of mobile elements. Whether<br />

these have directly contributed to the spread of catabolic <strong>and</strong><br />

other operons between bacterial species is currently not<br />

known.<br />

What separates genes <strong>in</strong> a genome?<br />

In comparison to genes, the non-cod<strong>in</strong>g part of genomes<br />

receives far less attention. Some genomes are more<br />

densely packed than the others. The average cod<strong>in</strong>g<br />

density is about 90%, rang<strong>in</strong>g from 95% for Pelagibacter<br />

ubique (Giovannoni et al. 2005) to 51% for Sodalis<br />

gloss<strong>in</strong>idius (Toh et al. 2006). Bacterial genes are not<br />

spliced as they are <strong>in</strong> eukaryotes; that is, <strong>in</strong>trons are absent<br />

from nearly all bacterial genes. The sequences separat<strong>in</strong>g<br />

genes (<strong>in</strong>tergenic regions) can be thought of as spacers<br />

where <strong>in</strong>formation on regulation of transcription can be<br />

stored, although sometimes these <strong>in</strong>tergenic regions can<br />

also be more than regulatory <strong>and</strong> spacer doma<strong>in</strong>s.<br />

Intergenic regions <strong>in</strong> the E. coli K-12 chromosome have<br />

been suggested to conta<strong>in</strong> the sequences for several<br />

hundreds of small RNA genes which are transcribed but do<br />

Table 1 Current E. coli genomes sequenced or <strong>in</strong> progress<br />

Escherichia coli<br />

stra<strong>in</strong><br />

Length (bp) Number of<br />

genes<br />

Number of<br />

tRNAs<br />

not code for prote<strong>in</strong>s (Chen et al. 2002). Many of these<br />

small RNAs act as regulators (Gottesman 2005).<br />

In general, the <strong>in</strong>tergenic regions of bacterial genomes<br />

are more AT-rich, will melt more readily, are more curved<br />

<strong>and</strong> are more rigid than the chromosomal average<br />

(Pedersen et al. 2000; Hall<strong>in</strong> <strong>and</strong> Ussery 2004). This is<br />

true for nearly all of the several hundreds of bacterial<br />

genomes sequenced, regardless of AT content. These<br />

characteristics make sense <strong>in</strong> terms of mechanical properties<br />

needed for <strong>in</strong>itiat<strong>in</strong>g transcription.<br />

Generation of genomic diversity <strong>in</strong> bacteria<br />

Genomic diversity is far greater than expected<br />

The view <strong>in</strong> many textbooks of biological diversity <strong>and</strong><br />

evolution often envisions clonal bacteria which slowly<br />

evolve through the gradual accumulation of s<strong>in</strong>gle-nucleotide<br />

changes. There might occasionally be a rare event<br />

where a new gene is duplicated but, <strong>in</strong> general, it has been<br />

commonly thought that if one were to sequence two<br />

different stra<strong>in</strong>s of a common bacterium like E. coli, the<br />

sequences would, for the most part, be similar <strong>and</strong> the two<br />

stra<strong>in</strong>s would share most (perhaps 90% or more) of their<br />

genes. At the time of writ<strong>in</strong>g, there are 20 different E. coli<br />

Number of<br />

rRNAs<br />

Number of<br />

contigs<br />

Accessionumber<br />

O157_EDL93 5,528,445 5,349 100 7 1 AE005174<br />

E22 5,516,16 4,788 NA NA 109 AAJV00000000<br />

O157_RIMD0509952 5,498,450 5,361 103 7 1 BA000007<br />

E110019 5,384,084 4,746 NA NA 119 AAJW00000000<br />

B171 5,299,753 4,467 NA NA 159 AAJX00000000<br />

53638 5,289,471 4,783 NA NA 119 AAKB00000000<br />

042 5,241,977 4,899 93 7 2 Sanger Institute<br />

(unpublished)<br />

CFT073 5,231,428 5,379 89 7 1 AE014075<br />

H10407 ~5,208,000 ~5,000 NA NA 225 Sanger Institute<br />

(unpublished)<br />

F11 5,206,906 4,467 NA NA 88 AAJU00000000<br />

B7A 5,202,558 4,637 NA NA 198 AAJT00000000<br />

NMEC RS218 5,089,235 ~4,900 NA NA 1 Uni. Wisc. (unpublished)<br />

E2348 5,072,200 4,594 71 7 4 Sanger Institute<br />

(unpublished)<br />

E24377A 4,980,187 4,254 97 6 1 AAJZ00000000<br />

UPEC 536 ~4,900,000 ~4800 NA NA 1 Uni. Würzburg<br />

(unpublished)<br />

101NA1 4,880,380 4,238 NA NA 70 AAMK00000000<br />

HS 4,643,538 3,689 89 6 1 AAJY00000000<br />

K-12_W3110 4,641,433 4,390 88 7 1 AP009048<br />

K-12_MG1655 4,639,675 4,254 88 7 1 U00096<br />

B03 4,629,810 4,387 86 6 1 CNRS France (unpublished)<br />

NA Currently not annotated


genomes which have been either completely sequenced or<br />

at least with an expected coverage of greater than 99% of<br />

the genome. Table 1 lists these genomes, <strong>and</strong> one of the<br />

surpris<strong>in</strong>g observations is the diversity just <strong>in</strong> size of the<br />

ma<strong>in</strong> chromosome, rang<strong>in</strong>g from 5.5 to 4.6 Mbp—that is,<br />

close to a million base pairs present <strong>in</strong> some E. coli stra<strong>in</strong>s<br />

which are miss<strong>in</strong>g <strong>in</strong> others. Furthermore, if one were to<br />

pick any one of these 20 stra<strong>in</strong>s, there would be more than a<br />

hundred genes which are unique to that stra<strong>in</strong> <strong>and</strong> are not<br />

found <strong>in</strong> the other 19 E. coli genomes. Studies have<br />

<strong>in</strong>dicated that much of this diversity comes from<br />

bacteriophages (Ohnishi et al. 2001).<br />

Gene order conservation<br />

When compar<strong>in</strong>g bacterial genomes, two features are<br />

frequently analysed: gene presence <strong>and</strong> gene order. The<br />

presence or absence of genes is particularly <strong>in</strong>terest<strong>in</strong>g<br />

when two closely related species or stra<strong>in</strong>s that have<br />

different phenotypes, such as a pathogenic <strong>and</strong> a commensal<br />

stra<strong>in</strong> of the same species, are compared (Hayashi et al.<br />

2001). As for the actual process lead<strong>in</strong>g to the difference,<br />

the direction of the <strong>in</strong>sertion/deletion event is not always<br />

clear; the nature of the <strong>in</strong>del (INsertion/DELetion) is<br />

generally kept neutral.<br />

Table 2 Types of mobile genetic elements found <strong>in</strong> bacterial genomes<br />

There are various models of how the gene order with<strong>in</strong><br />

operons may have changed throughout evolution. It may be<br />

that the gene order <strong>in</strong> ancient ancestral operons has been<br />

ma<strong>in</strong>ta<strong>in</strong>ed, such that all (or many) of the operons <strong>in</strong><br />

studied genomes would be expected to have a similar gene<br />

structure. However, this view has been contradicted by data<br />

from whole genome studies. Exam<strong>in</strong><strong>in</strong>g the stability of<br />

operon structures over evolutionary distance shows that the<br />

majority of the gene orders with<strong>in</strong> operons could be<br />

shuffled frequently dur<strong>in</strong>g evolution, with the ribosomal<br />

prote<strong>in</strong> operons as an exception (Itoh et al. 1999). Such<br />

observations support the alternative possibility that operons<br />

are multiple evolutionary <strong>in</strong>ventions. A more recent<br />

study has exam<strong>in</strong>ed the evolution of the histid<strong>in</strong>e operon <strong>in</strong><br />

Proteobacteria <strong>and</strong> found evidence for <strong>in</strong>deed a gradual<br />

merg<strong>in</strong>g of genes with similar function <strong>in</strong>to operons, at<br />

least <strong>in</strong> this case (Fani et al. 2005).<br />

Comparisons of gene order can also be <strong>in</strong>formative of<br />

chromosomal translocations <strong>and</strong> <strong>in</strong>versions, which frequently<br />

happen <strong>in</strong> bacterial genomes (Kuwahara et al.<br />

2004). Such events are mostly neutral <strong>in</strong> terms of<br />

evolution, as they do not change the total genetic content<br />

of the cell, but translocations <strong>and</strong> <strong>in</strong>versions frequently<br />

co<strong>in</strong>cide with <strong>in</strong>sertions or deletions. Any of these<br />

processes can result from <strong>in</strong>accurate excision of mobile<br />

genetic elements <strong>and</strong>, as such elements are frequently<br />

MGE Description References<br />

Plasmids Circular, self-replicat<strong>in</strong>g DNA molecules that exist <strong>in</strong> cells as extra-chromosomal<br />

replicons. Some plasmids can <strong>in</strong>sert <strong>in</strong>to the chromosome.<br />

(Dobr<strong>in</strong>dt et al. 2004)<br />

Transposons DNA molecules that frequently change their chromosomal localisation, either<br />

with<strong>in</strong> or between replicons. They usually code for a transposase <strong>and</strong> some other<br />

genes (such as antibiotic resistance genes), <strong>and</strong> are flanked by <strong>in</strong>verted repeat<br />

DNA sequences.<br />

(Dobr<strong>in</strong>dt et al. 2004)<br />

Conjugative Transposons that also carry genes related to plasmid-encoded conjugation, thus, (Dobr<strong>in</strong>dt et al. 2004)<br />

transposons provid<strong>in</strong>g the ability to transfer between cells via conjugation<br />

Bacteriophages Prokaryote-<strong>in</strong>fect<strong>in</strong>g viruses, which can modify the host genome by cod<strong>in</strong>g new<br />

functions or by modify<strong>in</strong>g exist<strong>in</strong>g functions. They are also capable of <strong>in</strong>sert<strong>in</strong>g<br />

<strong>in</strong>to the genome (prophages). These are also agents of HGT.<br />

(Dobr<strong>in</strong>dt et al. 2004)<br />

Integrons Genetic elements composed of a gene encod<strong>in</strong>g an <strong>in</strong>tegrase (<strong>in</strong>t gene; excises <strong>and</strong> (Fluit <strong>and</strong> Schmitz 2004; Holmes et al.<br />

<strong>in</strong>tegrates the gene cassettes from <strong>and</strong> <strong>in</strong>to the <strong>in</strong>tegron), gene cassettes (become<br />

part of the <strong>in</strong>tegron upon <strong>in</strong>tegration; consist of a promoterless gene <strong>and</strong> a<br />

recomb<strong>in</strong>ation site termed attC) <strong>and</strong> an <strong>in</strong>tegration site for the gene cassettes (attI<br />

gene)<br />

2003; Peters et al. 2001)<br />

Insertion Small, genetically compact DNA sequences, generally less than 2.5 kbp <strong>in</strong> length, (Mahillon et al. 1999; Ou et al. 2006)<br />

sequence encod<strong>in</strong>g functions <strong>in</strong>volved <strong>in</strong> their translocation, <strong>and</strong> transpose both with<strong>in</strong> <strong>and</strong><br />

elements between genomes. IS elements are a subset of a general group of elements named<br />

transposable elements. These transposable elements are def<strong>in</strong>ed as elements of<br />

DNA segments that carry the genes required for this process (<strong>and</strong>, <strong>in</strong> some cases,<br />

other genes), <strong>and</strong> consequently move about chromosomes <strong>and</strong>, more generally,<br />

genomes.<br />

Genomic Large chromosomal regions that conta<strong>in</strong> a cluster of functionally related genes, an (Dobr<strong>in</strong>dt et al. 2004)<br />

isl<strong>and</strong>s operon or a number of operons, flanked by direct repeat sequences, <strong>and</strong> located<br />

near an <strong>in</strong>tegrase or transposase gene <strong>and</strong> a tRNA gene.<br />

171


172<br />

<strong>in</strong>volved <strong>in</strong> generat<strong>in</strong>g diversity <strong>in</strong> bacteria, they deserve to<br />

be treated <strong>in</strong> a separate section.<br />

Mobile genetic elements<br />

MGEs are genomic elements that are capable of translocat<strong>in</strong>g<br />

themselves with<strong>in</strong> or between genomes. When mov<strong>in</strong>g<br />

to a new genome, they may confer a new characteristic on<br />

the recipient. Their size ranges from hundreds of base pairs<br />

to more than 100 kbp. Plasmids, transposons, conjugative<br />

transposons, bacteriophages, <strong>in</strong>tegrons, <strong>in</strong>sertion sequence<br />

elements <strong>and</strong> genomic isl<strong>and</strong>s (GEIs) are all considered<br />

MGEs (Table 2). Bacteriophages are the most sophisticated,<br />

as they produce their own prote<strong>in</strong> coat to protect the<br />

genetic material (which can be DNA or RNA). Conjugative<br />

transposons <strong>in</strong>duce conjugation between cells, a process <strong>in</strong><br />

which cellular membranes merge to produce a bridge<br />

through which the transposon can move. Some plasmids<br />

can also <strong>in</strong>duce conjugation (a transposon always encodes<br />

transposase whereas a conjugative plasmid replicates<br />

without <strong>in</strong>tegration <strong>in</strong> the chromosome). Some of the<br />

def<strong>in</strong>itions for the various MGEs partly overlap, as <strong>in</strong>deed<br />

these terms are flexible. For <strong>in</strong>stance, transposons can<br />

<strong>in</strong>tegrate <strong>in</strong> plasmids, <strong>and</strong> bacteriophages may conta<strong>in</strong><br />

<strong>in</strong>sertion sequence elements (Burrus <strong>and</strong> Waldor 2004).<br />

MGEs constitute potentially foreign DNA located <strong>in</strong> a<br />

conceptual ‘flexible’ gene pool, from where ‘donated’<br />

DNA is made available for recipient cells. Once the MGE<br />

is transferred <strong>in</strong>to the recipient cell, the DNA will either<br />

<strong>in</strong>sert <strong>in</strong>to a region on the chromosome or it will start to<br />

evoke its own replication mach<strong>in</strong>ery. If the MGE is<br />

<strong>in</strong>tegrated <strong>in</strong>to the genome, for example, like a pathogenicity<br />

isl<strong>and</strong> (PAI), the genes (or operon) will start to be<br />

expressed, thus add<strong>in</strong>g a new characteristic to the cell. The<br />

MGE may later <strong>in</strong>itiate ‘donation’ of DNA either to a next<br />

receptor (for which the trigger is as yet unknown) or to the<br />

flexible gene pool, perhaps tak<strong>in</strong>g with it a ‘new’ or<br />

additional gene or function. The <strong>in</strong>tegrated MGE may also<br />

become immobile as a result of chromosomal re-arrangements,<br />

duplications or sequence <strong>in</strong>sertions/deletions. In the<br />

case of such rendered immobility, the <strong>in</strong>tegrated MGE<br />

becomes a permanent genomic element or genomic isl<strong>and</strong>.<br />

At a later stage, the genomic isl<strong>and</strong> may be modified <strong>and</strong><br />

rendered mobile aga<strong>in</strong>, mak<strong>in</strong>g it available for transfer to<br />

the flexible gene pool once aga<strong>in</strong>.<br />

As the subject of all MGEs listed <strong>in</strong> Table 2 would<br />

suffice a review paper on its own, this review focuses on<br />

two, namely, <strong>in</strong>sertion sequence elements <strong>and</strong> GEIs. These<br />

two MGEs are of particular <strong>in</strong>terest because our knowledge<br />

of them has improved dramatically as a direct result of<br />

genome sequence availability <strong>and</strong> due also to their impact<br />

on the diversity of bacteria.<br />

Insertion sequence elements<br />

IS elements are small DNA sequences, generally less than<br />

2.5 kb <strong>in</strong> length, encod<strong>in</strong>g functions <strong>in</strong>volved <strong>in</strong> their own<br />

translocation <strong>and</strong> can transpose both with<strong>in</strong> <strong>and</strong> between<br />

genomes (Mahillon et al. 1999). IS elements were<br />

orig<strong>in</strong>ally described as a subset of transposable elements<br />

(Prescott et al. 1999). IS elements are the simplest form of<br />

MGE <strong>and</strong> a key component of a majority of the more<br />

complex transposable elements, found both <strong>in</strong> bacterial <strong>and</strong><br />

eukaryotic genomes. A number of reviews deal with IS<br />

elements <strong>in</strong> greater depth (van Belkum et al. 1998;<br />

Mahillon et al. 1999; Galun 2003).<br />

An IS conta<strong>in</strong>s a transposase gene, flanked by term<strong>in</strong>al<br />

<strong>in</strong>verted repeats (the sequence of one flank is encoded on<br />

the opposite str<strong>and</strong> of the other flank). One of these repeats<br />

classically conta<strong>in</strong>s the promoter for the transposase gene<br />

(Fig. 3; Galun 2003). The IS elements are also flanked by<br />

short, directly repeated sequences, which are generated <strong>in</strong><br />

the recipient DNA as a result of <strong>in</strong>sertion.<br />

The activity of transposable elements <strong>in</strong> genomes was<br />

first noted by McCl<strong>in</strong>tock (1950) <strong>in</strong> maize, although at that<br />

time the mechanism beh<strong>in</strong>d the observed genetic changes<br />

was not understood. Starl<strong>in</strong>ger <strong>and</strong> Saedler (1976) provided<br />

the first review of IS elements <strong>in</strong> bacterial genomes. As<br />

noted by Lupski <strong>and</strong> We<strong>in</strong>stock (1992), the first ISs were<br />

classified before their function, orig<strong>in</strong> <strong>and</strong> dispersion<br />

mechanisms were understood. The present genomic era<br />

has resulted <strong>in</strong> advances <strong>in</strong> their classification, underst<strong>and</strong><strong>in</strong>g<br />

of mechanisms of dispersion <strong>and</strong> identification of<br />

their role <strong>in</strong> evolution (van Belkum et al. 1998; Mahillon et<br />

al. 1999). Although the classical ISs are considered to be<br />

evolutionary neutral, as each can only translocate their own<br />

transposase, they are the means by which genomic isl<strong>and</strong>s<br />

(for example PAIs <strong>and</strong> metabolic isl<strong>and</strong>s) are transferred,<br />

<strong>and</strong> they also play a role <strong>in</strong> plasmid <strong>in</strong>tegration (Rocha et al.<br />

1999). Variation <strong>in</strong> the excision of ISs promotes genome<br />

rearrangements (<strong>in</strong>clud<strong>in</strong>g deletions, <strong>in</strong>versions <strong>and</strong> replicon<br />

fusions; Mahillon et al. 1999). Antibiotic resistance<br />

genes are frequently spread with<strong>in</strong> bacterial populations<br />

with the aid of ISs, which gives these simple elements<br />

cl<strong>in</strong>ical relevance. F<strong>in</strong>ally, <strong>in</strong> special cases, IS elements can<br />

<strong>in</strong>directly cause antigenic variation, a process <strong>in</strong> which a<br />

gene is switched off <strong>and</strong> on <strong>in</strong> a reversible manner with<strong>in</strong> a<br />

bacterial population (Talarico et al. 2005). IS sequences that<br />

Fig. 3 Organisation of a typical <strong>in</strong>sertion sequence. The IS is<br />

represented as an open box <strong>in</strong> which the term<strong>in</strong>al <strong>in</strong>verted repeats are<br />

shown as blue boxes labelled IRL (left IR) <strong>and</strong> IRR (right IR). An<br />

open read<strong>in</strong>g frame encod<strong>in</strong>g the transposase (grey box) is located <strong>in</strong><br />

the IS. WXY boxes flank<strong>in</strong>g the IS represent short directly repeated<br />

sequences generated <strong>in</strong> the target DNA as a consequence of<br />

<strong>in</strong>sertion. The transposase promoter is localised <strong>in</strong> IRL


are present <strong>in</strong> the first part of a gene can cause slippage<br />

dur<strong>in</strong>g replication, as DNA polymerase has difficulties with<br />

correct replication of short multiple repeats. The result can<br />

be a frame shift with consequential <strong>in</strong>activation, but the<br />

next frame shift can restore gene function. Such slippage<br />

can also vary the distance <strong>and</strong>, thus, activity of a promoter<br />

<strong>and</strong> its gene. Examples <strong>in</strong>volv<strong>in</strong>g genes with a role <strong>in</strong><br />

pathogenicity, with antigenic variation of surface exposed<br />

prote<strong>in</strong>s, <strong>and</strong> environmental adaptation have been described<br />

(van Belkum et al. 1998; Rocha et al. 1999).<br />

Monitor<strong>in</strong>g of these elements has provided <strong>in</strong>sights <strong>in</strong>to<br />

bacterial genome molecular processes <strong>and</strong> the nature of IS<br />

elements. For example, underst<strong>and</strong><strong>in</strong>g the regulatory<br />

mechanisms of IS elements has provided <strong>in</strong>sights <strong>in</strong>to the<br />

importance of the compromises adopted by IS elements<br />

(<strong>and</strong> MGEs, <strong>in</strong> general) between a stable host genome <strong>and</strong><br />

<strong>in</strong> endanger<strong>in</strong>g the survival of the host, through too much<br />

transposition activity (Nagy <strong>and</strong> Ch<strong>and</strong>ler 2004). It has<br />

also been suggested that IS expansion occurs dur<strong>in</strong>g an<br />

evolutionary bottleneck, which reduces effective population<br />

size <strong>and</strong> the degree of <strong>in</strong>traspecies competition<br />

(Parkhill et al. 2003).<br />

Genomic isl<strong>and</strong>s<br />

GEIs, also referred to as <strong>in</strong>tegrative <strong>and</strong> conjugative<br />

elements or ICEl<strong>and</strong>s (van der Meer <strong>and</strong> Sentchilo 2003),<br />

are large chromosomal regions that cluster functionally<br />

related genes, are flanked by direct repeat sequences <strong>and</strong><br />

are located near an <strong>in</strong>tegrase or transposase gene <strong>and</strong> often<br />

also near a tRNA. Furthermore, GEIs must have a GC<br />

composition different from the rest of the genome. GEIs<br />

<strong>in</strong>clude pathogenicity isl<strong>and</strong>s, symbiosis isl<strong>and</strong>s (SYIs),<br />

metabolic isl<strong>and</strong>s (MEIs), antibiotic resistance isl<strong>and</strong>s<br />

(REIs) <strong>and</strong> secretion system isl<strong>and</strong>s (SEIs) (Zhang <strong>and</strong><br />

Zhang 2004). This remarkable variety of GEIs demonstrates<br />

the power of horizontal gene transfer, as they are<br />

believed to be the result of <strong>in</strong>terspecies DNA transfer. With<br />

multiple genes neatly clustered <strong>in</strong> functional groups<br />

<strong>in</strong>clud<strong>in</strong>g all necessary regulatory <strong>and</strong> secretory genes,<br />

the power of transferr<strong>in</strong>g such ‘adaptive genetic bombs’<br />

can be easily imag<strong>in</strong>ed.<br />

Genome sequences have revealed that GEIs are common<br />

<strong>in</strong> bacteria as a result of successful horizontal transfers of<br />

Fig. 4 Generalised diagrammatic representation of a pathogenicity<br />

isl<strong>and</strong>. Commonly <strong>in</strong>serted <strong>in</strong>to a tRNA gene sequence, flanked by<br />

direct repeat sequences, conta<strong>in</strong><strong>in</strong>g an <strong>in</strong>tegrase (<strong>in</strong>t) gene,<br />

commonly conta<strong>in</strong><strong>in</strong>g <strong>in</strong>sertion sequence elements, <strong>and</strong> harbour<strong>in</strong>g<br />

DNA from a donor genome to a recipient genome. In most<br />

cases, the nature of the donor is unfortunately unknown.<br />

Even when an identified GEI bears a high resemblance to a<br />

section of another sequenced organism, one should not<br />

assume (though frequently this mistake has been made)<br />

that the GEI was directly received from that other<br />

organism. The transfer could well have <strong>in</strong>volved a third<br />

unidentified species, serv<strong>in</strong>g either as an <strong>in</strong>termediate<br />

between the first two or as the donor for the others. These<br />

possibilities are frequently not recognised, as people can be<br />

mislead by the available genome sequences <strong>and</strong> are not<br />

sufficiently aware of all those bacterial genomes for which<br />

we are currently lack<strong>in</strong>g sequence <strong>in</strong>formation.<br />

The discovery of abundant genomic isl<strong>and</strong>s is strengthen<strong>in</strong>g<br />

the concept of a bacterial genome be<strong>in</strong>g quite<br />

dynamic <strong>and</strong> consist<strong>in</strong>g of a backbone genome supplemented<br />

with adaptive genome modules, which may or may<br />

not be present <strong>in</strong> a given stra<strong>in</strong> of the species (Fraser-<br />

Liggett 2005). All modules available to the species (but<br />

never all present <strong>in</strong> one stra<strong>in</strong>) would comprise the gene<br />

pool of that organism. This concept clearly does not apply<br />

to strictly clonal species, <strong>in</strong> which case all isolates or stra<strong>in</strong>s<br />

closely resemble each other (as is the case, for <strong>in</strong>stance,<br />

with Bacillus anthracis), but it better describes the situation<br />

for frequently observed highly diverse species, such as E.<br />

coli or Streptomyces. Nevertheless, the timescale at which<br />

these events take place should not be ignored. Genomes are<br />

the sum of thous<strong>and</strong>s of years of evolution. Observations of<br />

evolutionary events tak<strong>in</strong>g place <strong>in</strong> ‘real time’ are still<br />

relatively seldom.<br />

Pathogenicity isl<strong>and</strong>s<br />

173<br />

PAIs are now considered a subtype of genomic isl<strong>and</strong>s but<br />

were among the earliest isl<strong>and</strong>s to be described. PAIs<br />

harbour pathogenicity-related genes, thus potentially conferr<strong>in</strong>g<br />

a pathogenic phenotype on a recipient genome.<br />

Figure 4 illustrates a generalised model of a PAI. As with<br />

other GEIs, PAIs are commonly <strong>in</strong>serted <strong>in</strong>to tRNA genes,<br />

which may be preferred sites of <strong>in</strong>sertion due to their<br />

relative conservation <strong>and</strong> redundancy (Dobr<strong>in</strong>dt et al.<br />

2004). PAIs are flanked by direct repeat sequences<br />

allow<strong>in</strong>g for <strong>in</strong>sertion <strong>in</strong>to the recipient DNA <strong>and</strong> conta<strong>in</strong><br />

an <strong>in</strong>tegrase gene that enables the <strong>in</strong>tegration <strong>in</strong>to the<br />

functional genes (with virulence associated properties), which may<br />

be organised <strong>in</strong>to an operon structure. Sometimes, a type III<br />

secretion system is also present


174<br />

recipient DNA. A feature observed for many PAIs (<strong>and</strong><br />

orig<strong>in</strong>ally <strong>in</strong>cluded <strong>in</strong> their def<strong>in</strong>ition although not always<br />

present) is the presence of a type III secretion system, a set<br />

of genes build<strong>in</strong>g an apparatus to specifically <strong>in</strong>ject<br />

virulence factors <strong>in</strong>to the host cell (Jores et al. 2004).<br />

Numerous <strong>in</strong>vestigations have identified <strong>and</strong> analysed PAIs<br />

(McGillivary et al. 2005; Middendorf et al. 2004; Paulsen<br />

et al. 2003; Schneider et al. 2004; Zubrzycki 2004; Schmidt<br />

<strong>and</strong> Hensel 2004).<br />

Horizontal gene transfer <strong>and</strong> restriction modification<br />

systems<br />

Evidence of HGT (also referred to as lateral gene transfer<br />

LGT) dates back more than 30 years (Falkow 1975), with<br />

the f<strong>in</strong>d<strong>in</strong>g of transposable elements. Although such events<br />

were considered only exceptional cases at that time, it is<br />

now evident that HGT events can make a substantial<br />

contribution to the generation of genetic diversity. As with<br />

all other features, the degree of horizontal transfer varies<br />

amongst species. Ochman et al. (2000) assessed 19<br />

completely sequenced bacterial genomes <strong>and</strong> reported<br />

that the proportion of foreign prote<strong>in</strong>s vary from 0%<br />

(Mycoplasma genitalium) to about 17% (Synechocystis<br />

spp). These f<strong>in</strong>d<strong>in</strong>gs were supported by others <strong>in</strong>clud<strong>in</strong>g<br />

Dufraigne et al. (2005). Ortutay et al. (2003) undertook a<br />

genomic-scale phylogenetic analysis of prote<strong>in</strong>-encod<strong>in</strong>g<br />

genes from five closely related Chlamydia spp <strong>and</strong><br />

identified a set of sequences that have arisen via HGT as<br />

the divergence of the Chlamydia l<strong>in</strong>eage. These data<br />

illustrate the significant role of HGT <strong>in</strong> the evolution of<br />

particular bacterial species. It is not surpris<strong>in</strong>g that obligate<br />

<strong>in</strong>tracellular pathogens show less evidence of recent HGT:<br />

they will not easily encounter other bacterial species with<br />

which to share DNA.<br />

Doolittle (1999a) listed three observations that can only<br />

be expla<strong>in</strong>ed by HGT. The first observation is that<br />

phylogenetic trees based on <strong>in</strong>dividual prote<strong>in</strong>-cod<strong>in</strong>g<br />

genes frequently differ substantially from the rRNA tree<br />

or from each other. The second observation comes from<br />

analysis, with<strong>in</strong> a genome, of variation <strong>in</strong> G + C content,<br />

codon usage <strong>and</strong> gene order. The third observation is a<br />

result of between-genome comparisons, which show that<br />

all genomes conta<strong>in</strong> particular genes that are more similar<br />

to homologues <strong>in</strong> distant genomes than to homologues <strong>in</strong><br />

closer relatives or <strong>in</strong>deed that are absent from all known<br />

genomes of closer relatives. Comb<strong>in</strong><strong>in</strong>g this evidences,<br />

Doolittle (1999b) proposed an alternative to the tree of life<br />

to describe the evolutionary history of liv<strong>in</strong>g organisms.<br />

His model of a web-like structure takes <strong>in</strong>to account the<br />

<strong>in</strong>fluence of HGT, where <strong>in</strong>teractions occur between<br />

ancestral organisms <strong>and</strong> descendants (branches) as well<br />

as between branches. A similar concept of a biological<br />

network has been further explored by Kun<strong>in</strong> et al. (2005).<br />

Such a concept is difficult to work with, <strong>and</strong> currently<br />

many microbiologists still accept a tree-like phylogenetic<br />

relationship, at least for an artificial ‘backbone’ of the<br />

species. Independent of the source (stra<strong>in</strong> or species) of the<br />

genes, phylogenetic trees can <strong>in</strong>deed be correctly produced<br />

for many genes <strong>and</strong> gene families <strong>and</strong> may describe<br />

evolutionary relationships that do not date back very far.<br />

Go<strong>in</strong>g back further <strong>in</strong> time, the vertical l<strong>in</strong>eages become<br />

weaker <strong>and</strong> the phylogenetic trees are less mean<strong>in</strong>gful. The<br />

paradoxal conclusion is that, by elucidat<strong>in</strong>g more of the<br />

evolutionary history of bacteria, their history has become<br />

less clear.<br />

If it is really true that horizontal gene transfer is so<br />

general, how is it still possible to recognise bacterial<br />

species? First, HGT is not so frequent that it can be easily<br />

observed as DNA exchange <strong>in</strong> ‘real time’ (other than the<br />

uptake of plasmids, spread of antibiotic resistance genes or<br />

transfection of phages). Evidence for past HGT events can<br />

be seen <strong>in</strong> many bacterial genomes <strong>and</strong> exemplifies its<br />

importance <strong>in</strong> evolution but, without a time scale, the<br />

frequency of such events cannot be estimated. Second,<br />

there are barriers that restrict HGT. It is obvious that not all<br />

bacteria share the same gene pool <strong>and</strong> only bacteria that<br />

share an ecological niche are likely to encounter <strong>and</strong> share<br />

each other’s DNA. Even under circumstances that favour<br />

DNA exchange, <strong>in</strong>ternal factors restrict the success of<br />

HGT, notably bacteriophage specificity, plasmid <strong>in</strong>compatibility,<br />

<strong>and</strong> the activity of restriction modification (RM)<br />

systems. F<strong>in</strong>ally, not all putatively HGT genes from E. coli<br />

are actually translated <strong>in</strong>to prote<strong>in</strong>s, perhaps because of<br />

<strong>in</strong>compatability of translational mach<strong>in</strong>ery (Taoka et al.<br />

2004).<br />

The discovery of restriction enzymes which could cleave<br />

specific DNA sequences provided the basis for driv<strong>in</strong>g the<br />

“biotechnology revolution” <strong>in</strong> the 1970s. RM systems are<br />

popular <strong>in</strong> molecular genetics <strong>and</strong> are rout<strong>in</strong>ely used by<br />

most molecular biology laboratories throughout the world.<br />

The RM systems encode a modification enzyme that<br />

chemically modifies a specific short DNA sequence <strong>and</strong> a<br />

restriction endonuclease that will digest the DNA at that<br />

same specific recognition sequence unless the sequence has<br />

been modified (usually by methylation). Bacterial species<br />

(<strong>and</strong> frequently stra<strong>in</strong>s with<strong>in</strong> a species) all have their own<br />

comb<strong>in</strong>ation of RM systems (Roberts et al. 2005).<br />

Incom<strong>in</strong>g DNA with a different modification pattern will<br />

be recognised by the endonuclease of the recipient stra<strong>in</strong>,<br />

<strong>and</strong> the fate of such DNA is to be degraded. This is seen as<br />

a serious restriction for the spread of DNA through<br />

populations unless their RM systems are compatible.<br />

The analysis of RM systems at a comparative genomics<br />

level (particularly the type restriction II endonucleases) has<br />

shown the dynamic state of the respective genes (L<strong>in</strong> et al.<br />

2001) <strong>and</strong> posed a number of questions to the view that RM<br />

genes restrict gene flow. For example, H. pylori <strong>and</strong><br />

Campylobacter jejuni are competent to take up DNA <strong>and</strong><br />

have a large set of genes to ma<strong>in</strong>ta<strong>in</strong> this property. The<br />

dynamic nature of the H. pylori genome <strong>and</strong> its natural<br />

competence is consistent with the weakly clonal population<br />

structure of H. pylori. Nevertheless, studies on H. pylori<br />

identified at least eight type II RM systems across two<br />

stra<strong>in</strong>s with an active restriction endonuclease <strong>and</strong><br />

methylase (Kong et al. 2000; L<strong>in</strong> et al. 2001). In addition,<br />

there were several active methylase genes without an active


endonuclease. The occurrence of RM systems that are not<br />

shared between the stra<strong>in</strong>s suggests that new RM systems<br />

are readily acquired <strong>and</strong> subsequently lost as a result of<br />

mutation or recomb<strong>in</strong>ation (L<strong>in</strong> et al. 2001). But that these<br />

would pose restriction barriers <strong>in</strong> gene flow is difficult to<br />

envisage with the dynamic population structure. RM genes<br />

possibly have other advantages to the cell. For methylation<br />

genes miss<strong>in</strong>g their match<strong>in</strong>g restriction gene, it has been<br />

suggested that they may be used for regulat<strong>in</strong>g gene<br />

expression (as for DAM methylation <strong>in</strong> E. coli; Lobner-<br />

Olesen et al. 2005; Robb<strong>in</strong>s-Manke et al. 2005) <strong>and</strong> for<br />

keep<strong>in</strong>g track of which parts of the chromosome have been<br />

recently replicated (Maas 2004).<br />

Methods for compar<strong>in</strong>g bacterial genomes<br />

There are at least 20 methods to compare bacterial<br />

genomes, as shown <strong>in</strong> Table 3. Some methods are more<br />

commonly used than the others, <strong>and</strong> it is beyond the scope<br />

of this review to provide a detailed analysis of each<br />

method. A few of these methods are discussed <strong>in</strong> this<br />

section.<br />

Chromosome alignment <strong>and</strong> size comparison<br />

Perhaps one of the easiest ways to compare genomes is by<br />

their sizes, as shown <strong>in</strong> Fig. 5. Although different phyla<br />

have different average sizes, it must be kept <strong>in</strong> m<strong>in</strong>d that<br />

many of the phyla have currently few representatives <strong>and</strong><br />

that there is a strong economic bias towards sequenc<strong>in</strong>g the<br />

smallest genome, so the size distributions shown here for<br />

the sequenced genomes could well be shorter than what<br />

Table 3 Approaches to compar<strong>in</strong>g bacterial genomes<br />

exist <strong>in</strong> natural ecosystems. Another way of compar<strong>in</strong>g<br />

chromosomes is to do a simple alignment of the DNA<br />

sequences. There are two versions of the alignment<br />

programmes. One <strong>in</strong>volves download<strong>in</strong>g some scripts<br />

<strong>and</strong> runn<strong>in</strong>g them on a local computer such as the Sanger<br />

Centre’s (Cambridge, UK) Artemis Comparison Tool<br />

(ACT, Carver et al. 2005) <strong>and</strong> the other is web-based<br />

such as “WebACT”, a web-based version of ACT with precomputed<br />

comparisons between several hundred bacterial<br />

genomes. The latter might be easier to use for those<br />

biologists who are less computationally <strong>in</strong>cl<strong>in</strong>ed (Abbott et<br />

al. 2005).<br />

AT content <strong>in</strong> genomes <strong>and</strong> promoter analysis<br />

Another relatively easy method to compare genomes is by<br />

their AT content, which ranges from 78% (Wigglesworthia<br />

gloss<strong>in</strong>idia) to 27% (Clavibacter michiganensis) for the<br />

300 genomes sequenced at the time of writ<strong>in</strong>g. In addition<br />

to the average AT content for a whole genome, if the<br />

variation of the AT content with<strong>in</strong> a given genome is<br />

exam<strong>in</strong>ed, two general trends can be seen for nearly all of<br />

the bacterial genomes. First, on a more global chromosomal<br />

level, there is a tendency for the region around the<br />

orig<strong>in</strong> of DNA replication to be more GC rich (i.e. less AT<br />

rich) <strong>and</strong> the region around the replication term<strong>in</strong>us to be<br />

more AT rich (Hall<strong>in</strong> et al. 2004b). Second, the average AT<br />

content for DNA about 400 bp upstream of the translation<br />

start site for all the genes <strong>in</strong> a genome is higher than 400 bp<br />

downstream (Hall<strong>in</strong> et al. 2004b). This makes sense <strong>in</strong> that<br />

the DNA will need to melt more easily <strong>in</strong> order for<br />

transcription to start.<br />

Level Method Reference<br />

Genome Chromosome alignment Carver et al. 2005<br />

AT content <strong>in</strong> the genome <strong>and</strong> upstream of genes Ussery <strong>and</strong> Hall<strong>in</strong> 2004a<br />

Oligomer bias on lead<strong>in</strong>g or lagg<strong>in</strong>g str<strong>and</strong>s Worn<strong>in</strong>g et al. 2006<br />

Repeats (local <strong>and</strong> global) Ussery et al. 2004a<br />

Periodicity of DNA structural properties Worn<strong>in</strong>g et al. 2000<br />

Length comparison Ussery <strong>and</strong> Hall<strong>in</strong> 2004b<br />

Promoter analysis Ussery et al. 2004d<br />

Transcriptome Organisation of rRNA operons Ussery et al. 2004b<br />

tRNAs <strong>and</strong> codon usage Ussery et al. 2004c<br />

Third nucleotide position bias <strong>in</strong> codon usage Ussery et al. 2004c<br />

Annotation quality Skovgaard et al. 2001<br />

Proteome Am<strong>in</strong>o acid usage Ussery et al. 2004c<br />

BLAST atlases Hall<strong>in</strong> et al. 2004a<br />

BLAST matrices B<strong>in</strong>newies et al. 2004<br />

Sigma factors Kiil et al. 2005a<br />

Transcription factors Kummerfeld 2006<br />

Secreted prote<strong>in</strong>s Bendtsen et al. 2005a<br />

Membrane prote<strong>in</strong>s Bendtsen et al. 2005b<br />

2-D correlation of properties Willenbrock et al. 2005<br />

Two component signal transduction systems Kiil et al. 2005b<br />

175


176<br />

Fig. 5 Genome length distribution for 287 bacterial chromosomes,<br />

shown as box <strong>and</strong> whiskers plot for each phyla. The number of<br />

chromosomes <strong>in</strong> each phylum is shown on the axis. Most of the<br />

bacterial genomes shown are either Proteobacteria (156 genomes) or<br />

tRNAs, codon usage <strong>and</strong> am<strong>in</strong>o acid<br />

As mentioned above, the 200 bp upstream of translation<br />

start sites is more AT rich, on average, than the 200 bp<br />

downstream. However, if the unsmoothed data is exam<strong>in</strong>ed<br />

(the grey l<strong>in</strong>es <strong>in</strong> Fig. 6, panel a), there is much “noise” <strong>in</strong><br />

the cod<strong>in</strong>g sequence, compared to the upstream, noncod<strong>in</strong>g<br />

DNA. This is due to bias <strong>in</strong> codon usage, as shown <strong>in</strong><br />

Fig. 6, panel b. The genome for a given organism will tend<br />

to show a preference towards certa<strong>in</strong> codons <strong>and</strong> can be<br />

seen as a bias <strong>in</strong> the third codon position (Fig. 6, panel c).<br />

F<strong>in</strong>ally, these codon biases also are <strong>in</strong> part affected by<br />

which am<strong>in</strong>o acids an organism uses, as shown <strong>in</strong> panel d<br />

of Fig. 6. The am<strong>in</strong>o acid usage for different E.coli<br />

proteomes differ: for example, E. coli K-12 shows the same<br />

am<strong>in</strong>o acid usage as Salmonella entericia LT2, while the<br />

usage <strong>in</strong> E.coli O157 resembles that of Shigella flexeneri.<br />

Thus, two different E. coli genomes can have quite<br />

different am<strong>in</strong>o acid usage (which might not be that<br />

surpris<strong>in</strong>g <strong>in</strong> view of the differences between stra<strong>in</strong>s of this<br />

species, see Table 1).<br />

BLAST atlases<br />

The GenomeAtlas is a method to visualise structural<br />

features of an entire bacterial genome sequence as one plot.<br />

The plots are created us<strong>in</strong>g the “GeneWiz” programme,<br />

Firmicutes (70). At the time of writ<strong>in</strong>g, the largest complete bacterial<br />

genome sequenced is that of Burkholderia xenovorans, which is<br />

consists of 9,703,676 bp with<strong>in</strong> two chromosomes, <strong>and</strong> the smallest<br />

is that of M. genitalium genome of 580,074 bp<br />

developed at <strong>CBS</strong> (Pedersen et al. 2000). A more recent<br />

extension of this method is the development of the<br />

“genome BLAST atlas”, <strong>in</strong> which genes from different<br />

genomes are blasted aga<strong>in</strong>st a reference genome <strong>and</strong><br />

visualised us<strong>in</strong>g an atlas plot. BLAST atlases can provide<br />

additional contextual <strong>in</strong>formation about regions which<br />

conta<strong>in</strong> few conserved genes. For example, a new genome<br />

might have a few small isl<strong>and</strong>s of unique prote<strong>in</strong>s, <strong>and</strong><br />

these regions might be more AT rich or might be expected<br />

to be potentially highly expressed, based on chromosomal<br />

structural <strong>in</strong>formation also provided <strong>in</strong> the plots. As<br />

mentioned above, when the 20 E. coli sequenced genomes<br />

<strong>in</strong> Table 1 are compared, an enormous amount of diversity<br />

is found. A BLAST atlas for E.coli 0157 is shown <strong>in</strong> Fig 7a.<br />

Several regions of the chromosome have “holes” represent<strong>in</strong>g<br />

large segments of miss<strong>in</strong>g genes <strong>in</strong> some organisms,<br />

compared to the reference genome. In a sense, this<br />

<strong>in</strong>formation is somewhat similar to that obta<strong>in</strong>ed by the<br />

ACT plots mentioned above, although now the comparisons<br />

are be<strong>in</strong>g made at the level of presence/absence of<br />

clusters of prote<strong>in</strong>s. In Fig. 7b, some of the regions<br />

conta<strong>in</strong><strong>in</strong>g gaps are more AT rich, some conta<strong>in</strong> repeats <strong>and</strong><br />

a few (marked) conta<strong>in</strong> genes that might be highly<br />

expressed, based on chromat<strong>in</strong> properties. Thus, this tool<br />

can give a quick overview of the comparison of many<br />

genomes.<br />

In Fig. 7a, the gaps correspond to regions of miss<strong>in</strong>g<br />

genes <strong>in</strong> the E. coli O157 genome. Similar patterns can be


Fig. 6 Genomic properties of Streptomyces coelicolor A3. a Comparison<br />

of AT content upstream <strong>and</strong> downstream of all 7,825 genes; the<br />

genes are all oriented <strong>in</strong> the same direction <strong>and</strong> aligned such that the<br />

translation start site is <strong>in</strong> the middle. Z-scores of st<strong>and</strong>ard deviations<br />

from the chromosomal average are plotted, as described previously<br />

(Ussery <strong>and</strong> Hall<strong>in</strong> 2004a). b Codon usage of the same set of 7825<br />

genes. The frequency of occurrence of each of the 64 codons is plotted<br />

<strong>in</strong> a star plot; note that most codons have a relatively low frequency of<br />

usage. c Bias <strong>in</strong> the codon position are plotted as frequencies; note that<br />

seen for many other bacterial genomes. For example, <strong>in</strong><br />

Fig. 7b, there are four large gaps <strong>in</strong> the C. jejuni RM1221<br />

genome compared to other epsilon Proteobacteria. These<br />

correspond to phage <strong>in</strong>sertion sites <strong>in</strong> C. jejuni RM1221, as<br />

described <strong>in</strong> the orig<strong>in</strong>al genome sequence publication<br />

(Fouts et al. 2005). Similar results have been observed for<br />

177<br />

there is a strong tendancy for Cs <strong>and</strong> Gs <strong>in</strong> third position. d Am<strong>in</strong>o acid<br />

usage of each of the 20 am<strong>in</strong>o acids for the entire S. coelicolor<br />

proteome is plotted as frequency of the total; the am<strong>in</strong>o acids <strong>in</strong> this plot<br />

are grouped accord<strong>in</strong>g to their properties; for example, all the aliphatic<br />

am<strong>in</strong>o acids (A, V, L, I <strong>and</strong> G) are together <strong>and</strong>, <strong>in</strong> general, there is a<br />

general trend for this proteome to favour aliphatic am<strong>in</strong>o acids, with the<br />

exception of isoleuc<strong>in</strong>e. The three star plots are as described previously<br />

(Ussery et al. 2004c)<br />

Streptococcus (Hall<strong>in</strong> et al. 2004a). In all three of these<br />

cases, there are large regions which conta<strong>in</strong> many genes<br />

which are miss<strong>in</strong>g <strong>in</strong> other genomes of the same species.<br />

These clusters of genes often conta<strong>in</strong> evidence that they<br />

came from phages, which appears to be an efficient method<br />

of br<strong>in</strong>g<strong>in</strong>g new DNA <strong>in</strong>to a genome.


178


3Fig. 7 Genome BLAST atlases. The outer circles represent BLAST<br />

hits of a given genome (named <strong>in</strong> the legend) to the reference<br />

genome (named <strong>in</strong> the center of the atlas). The colours are scaled<br />

such that good BLAST hits (E=10–40) are darkly shaded, whilst<br />

regions conta<strong>in</strong><strong>in</strong>g no hits are shown <strong>in</strong> light grey, as described<br />

previously (Hall<strong>in</strong> et al. 2004a). a Genome BLAST atlas of E. coli<br />

EO157 EDL933 vs four other sequenced E. coli stra<strong>in</strong>s (the four<br />

outermost circles; the genomes are, go<strong>in</strong>g from the outermost<br />

towards the center, E. coli K-12 MG1655, E. coli K-12 W3110, E.<br />

coli CFT1076 <strong>and</strong> E. coli O157 RIMD0509952). b Genome BLAST<br />

atlas of C. jejuni vs other epsilon Proteobacteria<br />

BLAST matrices<br />

Figure 7a,b illustrates the use of BLAST atlases to compare<br />

genome sequences. However, with several hundred<br />

genomes available, there is a need for a faster way of<br />

gett<strong>in</strong>g an overview of genome similarity. One method is<br />

the use of reciprocal hits—that is, to BLAST all the<br />

prote<strong>in</strong>s encoded <strong>in</strong> a genome of <strong>in</strong>terest aga<strong>in</strong>st those <strong>in</strong><br />

another genome (B<strong>in</strong>newies et al. 2004). First, the genomes<br />

of <strong>in</strong>terest are selected (e.g. all genomes of Proteobacteria),<br />

then a BLAST matrix can be displayed from this selection.<br />

The results are pre-generated <strong>and</strong> the system keeps track of<br />

sequence updates by generat<strong>in</strong>g MD5 checksums of all<br />

sequences <strong>and</strong> the comb<strong>in</strong>ations <strong>in</strong> which they have been<br />

BLASTed. The MD5 (termed also a message digest) will<br />

Fig. 8 The BLAST table shows<br />

the overall prote<strong>in</strong> homology<br />

between all comb<strong>in</strong>ations of the<br />

five available Vibrio sequences.<br />

Only hits conta<strong>in</strong><strong>in</strong>g at least<br />

80% of the length of the gene<br />

<strong>and</strong> with an E-value of 1×10 or<br />

better are counted. The diagonal<br />

(red/p<strong>in</strong>k) <strong>in</strong>dicates the fraction<br />

of prote<strong>in</strong>s that have homologous<br />

hits with<strong>in</strong> the proteome<br />

itself; the fraction is similar <strong>in</strong><br />

all genomes, <strong>and</strong> the <strong>in</strong>tensity is<br />

shown by the red colour, scaled<br />

from ~24% (grey) to ~27%<br />

(red). Note that the largest genome<br />

also has the highest fraction<br />

of <strong>in</strong>ternal homologs. The<br />

green area for the rest of<br />

the table, on each side of the<br />

diagonal, shows the number<br />

of prote<strong>in</strong>s that have homologous<br />

hits between different<br />

Vibrio genomes. As before, the<br />

fraction is <strong>in</strong>dicated by the <strong>in</strong>tensity<br />

of the colour (green)<br />

scaled from ~57 (grey) to ~83%<br />

(green). In general, it is clear<br />

that these organisms share a<br />

high percentage of their genes<br />

with the other Vibrio species,<br />

which should be expected<br />

because they are from the same<br />

genus<br />

produce a 32-digit str<strong>in</strong>g that is unique to an <strong>in</strong>put str<strong>in</strong>g,<br />

e.g. a genomic sequence. The system ma<strong>in</strong>ta<strong>in</strong>s an allaga<strong>in</strong>st-all<br />

BLAST database updat<strong>in</strong>g only the miss<strong>in</strong>g<br />

comparisons—that is, chang<strong>in</strong>g the sequence of a record or<br />

<strong>in</strong>sert<strong>in</strong>g a new record will cause a BLAST run of the<br />

sequence aga<strong>in</strong>st all the exist<strong>in</strong>g sequences of the database.<br />

By hav<strong>in</strong>g multiple genomes <strong>in</strong> a given selection, an allaga<strong>in</strong>st-all<br />

BLAST matrix can be presented show<strong>in</strong>g the<br />

percentage of genes that are shared between sequences—<br />

both on a prote<strong>in</strong> <strong>and</strong> on a nucleotide level. Each such<br />

percentage is supplied with a l<strong>in</strong>k to give a full list<strong>in</strong>g from<br />

the BLAST report. Fig. 8 shows an example of such a<br />

BLAST matrix, with the diagonal (<strong>in</strong> red) reflect<strong>in</strong>g the<br />

<strong>in</strong>ternal homologues of a given genome. The boxes are<br />

colour-coded such that the <strong>in</strong>tensity represents the fraction<br />

of hits (B<strong>in</strong>newies et al. 2004) (Fig. 8).<br />

Meta-genomics: comparison of all the genomes<br />

<strong>in</strong> an ecosystem<br />

179<br />

The term “metagenomics” is used for genome sequenc<strong>in</strong>g<br />

projects <strong>in</strong> which many organisms are sequenced at once<br />

by shotgun clon<strong>in</strong>g of all DNA present <strong>in</strong> a sample<br />

(H<strong>and</strong>elsman 2004). This enables microbial ecosystems<br />

conta<strong>in</strong><strong>in</strong>g microbes that are not (presently) culturable <strong>in</strong><br />

pure form to be <strong>in</strong>vestigated (H<strong>and</strong>elsman 2004). The


180<br />

reasons why organisms rema<strong>in</strong> uncultured can be practical<br />

(e.g. thermophilic bacteria grow at a temperature above the<br />

melt<strong>in</strong>g po<strong>in</strong>t of agar), physiological (e.g. extremophiles<br />

that grow on pure culture can have very different properties<br />

from those observed <strong>in</strong> their true environment) or biological<br />

(symbiotic life forms cannot be cultured <strong>in</strong> microbiological<br />

pure form). The first genome sequence obta<strong>in</strong>ed<br />

from a non-culturable bacterium was <strong>in</strong>deed that of<br />

Buchnera aphidicola, a symbiont of aphids. This sequence<br />

was not obta<strong>in</strong>ed by meta-genomics at the total genome<br />

DNA level but rather at the rRNA level. Cell counts<br />

compared to plate counts showed that the latter can be<br />

orders of magnitude wrong: many viable bacteria refuse to<br />

grow on solid culture medium. The isolation of bulk RNA<br />

<strong>and</strong> the subsequent determ<strong>in</strong>ation of rRNA sequences<br />

us<strong>in</strong>g specific primers allowed qualitative analysis to be<br />

performed for identify<strong>in</strong>g novel bacterial species or<br />

ribotypes present <strong>in</strong> an ecosystem (Olsen et al. 1986).<br />

The application of PCR improved the sensitivity of such<br />

approaches but the limitation to rRNA sequences conf<strong>in</strong>ed<br />

analyses to phylogenetic <strong>in</strong>formation only <strong>and</strong> little further<br />

knowledge was obta<strong>in</strong>ed about the new species. Metagenomics<br />

can be used to generate complete or fragmented<br />

genome sequences of organisms that might be abundant <strong>in</strong><br />

nature but are not easily culturable.<br />

The acid m<strong>in</strong>e dra<strong>in</strong>age sequenc<strong>in</strong>g project has shown<br />

the potential of meta-genomics (Tyson et al. 2004). The<br />

m<strong>in</strong>e water of the Richmond m<strong>in</strong>e is covered with a biofilm<br />

of bacteria despite its hostile environment: an extreme acid<br />

pH (between 0 <strong>and</strong> 1), high concentrations of metal ions,<br />

<strong>in</strong>clud<strong>in</strong>g copper, z<strong>in</strong>c <strong>and</strong> arsenic, <strong>and</strong> the absence of<br />

carbon or nitrogen sources (other than from air). The<br />

biofilm was composed of relatively few organisms,<br />

enabl<strong>in</strong>g the sequenc<strong>in</strong>g of shotgun-cloned DNA <strong>and</strong> the<br />

sort<strong>in</strong>g of fragments accord<strong>in</strong>g to their G + C content <strong>in</strong>to<br />

nearly complete bacterial genomes. A dom<strong>in</strong>ant bacterial<br />

genus was identified, Leptospirillum, <strong>and</strong> a less abundant<br />

Sulfobacillus spp <strong>and</strong> some Archaea were also present. The<br />

f<strong>in</strong>d<strong>in</strong>gs greatly improved underst<strong>and</strong><strong>in</strong>g of this ecosystem.<br />

The predom<strong>in</strong>ant bacteria were responsible for nitrogen<br />

<strong>and</strong> carbon fixation (Leptospirillum group III), whereas<br />

several species were able to generate energy from iron<br />

oxidation (Ferroplasma <strong>and</strong> Leptospirillum spp). As <strong>in</strong> this<br />

approach, each sequenced DNA fragment is obta<strong>in</strong>ed from<br />

a different <strong>in</strong>dividual (whereas <strong>in</strong> classical genome<br />

sequenc<strong>in</strong>g all DNA is obta<strong>in</strong>ed from one clone);<br />

<strong>in</strong>formation on polymorphisms also becomes available.<br />

As more complex ecosystems are studied, the puzzle of<br />

genome assembly becomes more difficult due to the<br />

presence of more species, genomic rearrangements <strong>and</strong><br />

horizontal gene transfer events.<br />

The largest attempt so far at metagenomics was <strong>in</strong>itiated<br />

by C. Venter to sequence the microbial ecosystem <strong>in</strong> the<br />

Sargasso Sea (Venter et al. 2004). Seawater was sampled<br />

by filter<strong>in</strong>g to specifically recover bacterial (<strong>and</strong> not viral or<br />

amoebal) DNA. Over 1 billion base pairs of sequence were<br />

generated, which was attributed to at least 1,800 species.<br />

As the abundance of <strong>in</strong>dividual species determ<strong>in</strong>es their<br />

coverage <strong>in</strong> shotgun clon<strong>in</strong>g, this coverage (or rather the<br />

mean of their Poisson distribution) was used to sort out<br />

DNA scaffolds (a scaffold is a reconstructed genomic<br />

region), <strong>and</strong> oligonucleotide frequencies were used to<br />

ref<strong>in</strong>e this sort<strong>in</strong>g. Although the complexity of the<br />

<strong>in</strong>vestigated ecosystem did not allow complete assembly<br />

of <strong>in</strong>dividual genomes, the scaffolds belong<strong>in</strong>g to the most<br />

abundant species could be attributed to Burkholderia <strong>and</strong><br />

Shewanella-like species. As with the acid ma<strong>in</strong> dra<strong>in</strong>age<br />

project, polymorphisms were detected with vary<strong>in</strong>g<br />

frequencies. In fact, the dataset ranged from organisms<br />

belong<strong>in</strong>g to a s<strong>in</strong>gle species <strong>and</strong> clonal (few polymorphisms)<br />

to a population cont<strong>in</strong>uum <strong>in</strong> which some clonal<br />

complexes could be recognised. These observations<br />

illustrate the ‘unnatural’ approach of study<strong>in</strong>g only pure<br />

bacterial cultures that have a strict clonal structure <strong>in</strong><br />

contrast to natural environments where the population<br />

structure is much more fluid <strong>and</strong> the concept of clones or<br />

species is more elusive. The most impressive output of the<br />

Sargasso Sea study is the numbers of <strong>in</strong>dividual genes that<br />

were identified (69,901). Among the surpris<strong>in</strong>g f<strong>in</strong>d<strong>in</strong>gs<br />

was that rhodops<strong>in</strong> (the bacterial prote<strong>in</strong> required for<br />

carbon fixation) was abundant outside the proteobacteria<br />

where it had previously been identified. The f<strong>in</strong>d<strong>in</strong>g of<br />

many genes <strong>in</strong>volved <strong>in</strong> phosphate uptake <strong>and</strong> utilisation of<br />

poly- <strong>and</strong> pyrophosphates is puzzl<strong>in</strong>g, as the mar<strong>in</strong>e<br />

environment is extremely phosphate-limited.<br />

The challenge to analyse the complex communities of a<br />

nutrient-rich environment was taken up by Tr<strong>in</strong>ge <strong>and</strong><br />

Rub<strong>in</strong> (2005). One sample that was analysed was derived<br />

from agricultural soil <strong>and</strong> three were from mar<strong>in</strong>e whale<br />

carcasses. First, rRNA libraries were generated by PCR to<br />

<strong>in</strong>vestigate the microbial diversity. The soil sample (DNA<br />

obta<strong>in</strong>ed from 5 g of surface clay loam from l<strong>and</strong> that had<br />

been used for livestock) was extremely rich <strong>in</strong> species with<br />

at least 847 ribotypes detected represent<strong>in</strong>g over 12 phyla.<br />

The whale samples (two bone parts <strong>and</strong> one biofilm<br />

cover<strong>in</strong>g a whale carcass) were less diverse but still<br />

conta<strong>in</strong>ed between 25 <strong>and</strong> 150 ribotypes. Although the<br />

assembly of sequences obta<strong>in</strong>ed from shotgun libraries was<br />

not possible, the genes that were identified on the<br />

sequenced library clones demonstrated that approximately<br />

half of the predicted prote<strong>in</strong>s found similarities (homologs)<br />

<strong>in</strong> exist<strong>in</strong>g gene databases. Plott<strong>in</strong>g the number of novel<br />

gene families aga<strong>in</strong>st the amount of generated sequences<br />

suggested that, for the soil sample, few novel orthologues<br />

were found after sequenc<strong>in</strong>g 25 Mbp. The functions of<br />

predicted prote<strong>in</strong>s from the sequences were naturally<br />

diverse, but for the soil sample, potassium channell<strong>in</strong>g<br />

systems were overrepresented, whereas for the whale<br />

samples sodium ion exporters were abundant—which fit<br />

with the abundance of these two ions <strong>in</strong> the two<br />

environments, respectively.<br />

The metagenomics analyses will cont<strong>in</strong>ue to see databases<br />

exp<strong>and</strong><strong>in</strong>g, with the <strong>in</strong>terpretation <strong>and</strong> assembly of<br />

raw data becom<strong>in</strong>g more complete. The human gastro<strong>in</strong>test<strong>in</strong>al<br />

tract, for example, is the target of a metagenomics<br />

sequenc<strong>in</strong>g project (Mongod<strong>in</strong> et al. 2005). It is apparent<br />

that each <strong>in</strong>dividual carries a large variety of microflora,<br />

probably acquired early <strong>in</strong> life (<strong>and</strong> which may have health


consequences even though these organisms are not pathogenic)<br />

as well as bacterial microheterogeneity that was not<br />

recognised previously. Aga<strong>in</strong>st the common belief that<br />

Firmicutes <strong>and</strong> Bacteroides would be the most abundant<br />

microbes present <strong>in</strong> the human gut, it appears that<br />

Act<strong>in</strong>obacteria <strong>and</strong> Archaea may be more prom<strong>in</strong>ent<br />

(Mongod<strong>in</strong> et al. 2005). The <strong>in</strong>test<strong>in</strong>al microflora of<br />

obese mice differs considerably to that of lean animals,<br />

an observation <strong>in</strong> support of the view that the microbiota of<br />

mammals are good <strong>in</strong>dicators (be it cause or effect) of their<br />

health status (Ley et al. 2005). There are clearly many<br />

microbial communities to be analysed <strong>and</strong> compared us<strong>in</strong>g<br />

metagenomics.<br />

Application: computational vacc<strong>in</strong>e development<br />

Vacc<strong>in</strong>es rema<strong>in</strong> an extremely important tool for controll<strong>in</strong>g<br />

<strong>in</strong>fectious diseases of humans <strong>and</strong> animals, although<br />

they are only available for about 10% of the microrganisms<br />

known to be harmful to humans (Lund et al. 2005).<br />

Traditional vacc<strong>in</strong>es typically have <strong>in</strong>corporated whole live<br />

attenuated or killed microorganisms, but, particularly for<br />

use <strong>in</strong> humans, such vacc<strong>in</strong>es now have limited application<br />

due to concerns about safety, efficacy <strong>and</strong>/or ease of<br />

production. Much recent work, therefore, has focused on<br />

develop<strong>in</strong>g vacc<strong>in</strong>es composed of prom<strong>in</strong>ent immunogenic<br />

parts of microorganisms (subunit vacc<strong>in</strong>es) or genes<br />

encod<strong>in</strong>g these components (genetic vacc<strong>in</strong>es, Ellis<br />

1999). For bacterial vacc<strong>in</strong>e discovery, these newer<br />

approaches have been greatly assisted by the recent<br />

availability of whole genomic sequence data <strong>and</strong> has<br />

allowed a new approach to vacc<strong>in</strong>e development called<br />

“reverse vacc<strong>in</strong>ology” (Rappuoli 2001).<br />

In reverse vacc<strong>in</strong>ology, bio<strong>in</strong>formatics <strong>tools</strong> are used to<br />

undertake comprehensive <strong>in</strong> silico screen<strong>in</strong>g of genomic<br />

sequence to identify genes encod<strong>in</strong>g prote<strong>in</strong>s that have<br />

desirable characteristics. The power of this process has<br />

<strong>in</strong>creased as more <strong>and</strong> more genomic sequences that<br />

encode prote<strong>in</strong>s of known function become available <strong>in</strong> the<br />

databases for comparative analysis. Targets for consideration<br />

for use <strong>in</strong> vacc<strong>in</strong>es <strong>in</strong>clude genes encod<strong>in</strong>g outer<br />

membrane prote<strong>in</strong>s or lipoprote<strong>in</strong>s, transmembrane doma<strong>in</strong>s<br />

or export signal peptides, <strong>and</strong> prote<strong>in</strong>s with<br />

homologies to bacterial factors already known to be<br />

<strong>in</strong>volved <strong>in</strong> virulence or pathogenicity. Surface-exposed<br />

or secreted prote<strong>in</strong>s as well as virulence factors such as<br />

tox<strong>in</strong>s or adhesive factors are likely to <strong>in</strong>duce an immune<br />

response that may be protective (Zagursky <strong>and</strong> Russell<br />

2001). In this way, large numbers of potential vacc<strong>in</strong>e<br />

components can be identified from a whole (or partial)<br />

genome sequence. This approach was first taken for the<br />

human pathogen Neisseria men<strong>in</strong>gitidis serogroup B, with<br />

600 open read<strong>in</strong>g frames (ORFs) of potential <strong>in</strong>terest<br />

<strong>in</strong>itially be<strong>in</strong>g identified (Pizza et al. 2000). Recomb<strong>in</strong>ant<br />

prote<strong>in</strong>s from 350 ORFs were eventually produced <strong>and</strong>,<br />

after screen<strong>in</strong>g <strong>in</strong> for distribution <strong>in</strong> different serotypes,<br />

stability, immunogenicity <strong>and</strong> cross-protection, 15 were<br />

selected as potential subunit vacc<strong>in</strong>e c<strong>and</strong>idates. This same<br />

approach to vacc<strong>in</strong>e discovery is now be<strong>in</strong>g taken for a<br />

number of important human <strong>and</strong> animal pathogens (Serruto<br />

et al. 2004). Reverse vacc<strong>in</strong>ology allows rapid identification<br />

of a large number of potential subunit vacc<strong>in</strong>e<br />

c<strong>and</strong>idates, many of which would not have been recognised<br />

by more traditional approaches. It is complemented by the<br />

use of microarrays to analyse gene expression <strong>and</strong> of<br />

proteomic approaches to study prote<strong>in</strong> expression <strong>and</strong><br />

distribution <strong>and</strong> can be focused further by the use of<br />

computer alogorithms that scan <strong>and</strong> identify sequences<br />

encod<strong>in</strong>g specific epitopes <strong>in</strong>volved <strong>in</strong> immunogenicity<br />

(reviewed <strong>in</strong> Lund et al. 2002; see also, fo a review,<br />

Theoretical Biology <strong>and</strong> Biophysics Group, Los Alamos<br />

National Laboratory [http://www.hiv.lanl.gov/content/<br />

immunology/pdf/2002/1/Lund2002.pdf]). These alogorithms<br />

have been strengthened by the availability of full<br />

genomic sequences for many pathogens.<br />

Methods for the three ma<strong>in</strong> types of epitopes target<strong>in</strong>g B<br />

cell, helper T lymphocyte <strong>and</strong> cytotoxic T lymphocyte<br />

have been made, <strong>and</strong> improved methods are constantly<br />

be<strong>in</strong>g developed. Thus, it is possible to take a genome<br />

sequence, use some predictors as described above <strong>and</strong><br />

select potential peptide sequences for construction of<br />

vacc<strong>in</strong>es. These vacc<strong>in</strong>es can be either chemically<br />

synthesised peptide based or DNA based. With regards to<br />

peptides, these can be used directly or used to construct a<br />

“polytope”, which is a composite prote<strong>in</strong> made from<br />

<strong>in</strong>dividual epitopes.<br />

Intellectual property rights: who owns the genome<br />

sequence?<br />

181<br />

This review started by giv<strong>in</strong>g the US patent numbers for the<br />

first two genomes sequenced. This f<strong>in</strong>al section will briefly<br />

discuss some of the issues fac<strong>in</strong>g researchers work<strong>in</strong>g with<br />

genomic data. At the time of writ<strong>in</strong>g, ten whole genome<br />

patents have been granted, with more patents be<strong>in</strong>g applied<br />

for (O’Malley et al. 2005). Some of these patents <strong>in</strong>clude<br />

the use of the sequence <strong>in</strong> silico <strong>and</strong> clearly raise a number<br />

of issues related to freedom to operate <strong>in</strong> research. In<br />

addition, the enforcement of the patents could be difficult,<br />

with many bio<strong>in</strong>formatic <strong>tools</strong> be<strong>in</strong>g developed <strong>in</strong> the<br />

public doma<strong>in</strong>.<br />

Another related difficulty has to do with us<strong>in</strong>g or<br />

analys<strong>in</strong>g genome sequences before they are presented <strong>in</strong><br />

scientific publications. Now that it is possible to sequence a<br />

bacterial genome <strong>in</strong> an afternoon <strong>and</strong> have a GenBank file a<br />

day or two later, the time gap between hav<strong>in</strong>g the sequence<br />

publicly available <strong>and</strong> hav<strong>in</strong>g the paper <strong>in</strong> pr<strong>in</strong>t can be<br />

several years. Some public grant<strong>in</strong>g agencies have pushed<br />

hard for the data to be made available as soon as possible<br />

for people to search for their particular gene of <strong>in</strong>terest. On<br />

the other h<strong>and</strong>, it is also underst<strong>and</strong>able that the <strong>in</strong>dividuals<br />

who have actually sequenced the genomes need some lead<br />

time to analyse their data. With high-throughput bio<strong>in</strong>formatic<br />

techniques, it is possible, for example, for some<br />

groups to do <strong>in</strong> a few days what would take other groups<br />

months (or years) to complete.


182<br />

A f<strong>in</strong>al problem has to do with obta<strong>in</strong><strong>in</strong>g basic<br />

<strong>in</strong>formation about the stra<strong>in</strong> used for sequenc<strong>in</strong>g a genome.<br />

For example, what was the stra<strong>in</strong> isolated from? What was<br />

the growth temperature or culture medium pH for the<br />

culture that the genomic DNA was derived from? What is<br />

the doubl<strong>in</strong>g time of this organism under these conditions?<br />

These are all important pieces of data, but they are often<br />

miss<strong>in</strong>g <strong>in</strong> genome publications. A recent “m<strong>in</strong>imal<br />

<strong>in</strong>formation about a genome sequence” st<strong>and</strong>ard has been<br />

proposed (Field <strong>and</strong> Hughes 2005), which is <strong>in</strong> the same<br />

spirit as the MIAMI st<strong>and</strong>ard for microarray experiments. 3<br />

In the future, it could well be that someth<strong>in</strong>g resembl<strong>in</strong>g a<br />

GenBank file with additional biological <strong>in</strong>formation will be<br />

the “publication” for a bacterial genome sequence, as<br />

genome sequenc<strong>in</strong>g becomes ever cheaper <strong>and</strong> easier to<br />

perform. Overall, it is important that genome sequence<br />

<strong>in</strong>formation is released <strong>in</strong>to the public doma<strong>in</strong> <strong>in</strong> a timely<br />

manner so that global scientific progress can be ma<strong>in</strong>ta<strong>in</strong>ed.<br />

Acknowledgements DWU, PFH <strong>and</strong> TTB are supported by grants<br />

from the Danish Research Foundation. We are grateful to the Sanger<br />

Center for allow<strong>in</strong>g prepublication access to the sequences for the E.<br />

coli 042 genome (the DNA sequence <strong>and</strong> annotation files were<br />

downloaded from the Sanger web site http://www.sanger.ac.uk/).<br />

References<br />

Abbott JC, Aanensen DM, Rutherford K, Butcher S, Spratt BG<br />

(2005) WebACT—an onl<strong>in</strong>e companion for the Artemis<br />

Comparison Tool. Bio<strong>in</strong>formatics 21(18):3665–3666<br />

Ac<strong>in</strong>as SG, Marcel<strong>in</strong>o LA, Klepac-Ceraj V, Polz MF (2004)<br />

Divergence <strong>and</strong> redundancy of 16S rRNA sequences <strong>in</strong> genomes<br />

with multiple rrn operons. J Bacteriol 186(9):2629–2635<br />

Ala<strong>in</strong> K, Querellou J, Lesongeur F, Pignet P, Crassous P, Raguenes G,<br />

Cueff V, Cambon-Bonavita M-A (2002) Cam<strong>in</strong>ibacter hydrogeniphilus<br />

gen. nov., sp. nov., a novel thermophilic, hydrogenoxidiz<strong>in</strong>g<br />

bacterium isolated from an East Pacific Rise<br />

hydrothermal vent. Int J Syst Evol Microbiol 52:1317–1323<br />

Alm EJ, Huang KH, Price MN, Koche RP, Keller K, Dubchak IL,<br />

Ark<strong>in</strong> AP (2005) The MicrobesOnl<strong>in</strong>e Web site for comparative<br />

genomics. Genome Res 15(7):1015–1022<br />

Alm RA, Trust TJ (1999) Analysis of the genetic diversity of<br />

Helicobacter pylori: the tale of two genomes. J Mol Med 77<br />

(12):834–846 (Review)<br />

Backhed F, Ley RE, Sonnenburg JL, Peterson DA, Gordon JI (2005)<br />

Host–bacterial mutualism <strong>in</strong> the human <strong>in</strong>test<strong>in</strong>e. Science 307<br />

(5717):1915–1920<br />

Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Sicheritz-Ponten T, Ussery<br />

DW (2005a) Genome update: prediction of secreted prote<strong>in</strong>s <strong>in</strong><br />

225 bacterial proteomes. Microbiology 151(Pt 6):1725–1727<br />

Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW (2005b)<br />

Genome update: prediction of membrane prote<strong>in</strong>s <strong>in</strong> prokaryotic<br />

genomes. Microbiology 151(Pt 7):2119–2121<br />

B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Staerfeldt HH, Ussery DW (2004) Genome<br />

update: proteome comparisons. Microbiology 151(Pt 1):1–4<br />

Burrus V, Waldor MK (2004) Shap<strong>in</strong>g bacterial genomes with<br />

<strong>in</strong>tegrative <strong>and</strong> conjugative elements. Res Microbiol 155<br />

(5):376–386<br />

Carattoli A (2001) Importance of <strong>in</strong>tegrons <strong>in</strong> the diffusion of<br />

resistance. Vet Res 32(3–4):243–259<br />

Carver TJ, Rutherford KM, Berriman M, Raj<strong>and</strong>ream MA, Barrell<br />

BG, Parkhill J (2005) ACT: the Artemis Comparison Tool.<br />

Bio<strong>in</strong>formatics 21(16):3422–3423<br />

3 http://www.ucl.ac.uk/wibr/services/docs/miamiv1.doc<br />

Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH, Ecker DJ,<br />

Blyn LB (2002) A bio<strong>in</strong>formatics based approach to discover<br />

small RNA genes <strong>in</strong> the Escherichia coli genome. Biosystems<br />

65(2–3):157–177<br />

Dobr<strong>in</strong>dt U, Hacker J (2001) Whole genome plasticity <strong>in</strong> pathogenic<br />

bacteria. Curr Op<strong>in</strong> Microbiol 5(4):550–557<br />

Dobr<strong>in</strong>dt U, Hochhut B, Hentschel U, Hacker J (2004) Genomic<br />

isl<strong>and</strong>s <strong>in</strong> pathogenic <strong>and</strong> environmental microorganisms. Nat<br />

Rev Microbiol (2):414–424<br />

Doolittle WF (1999a) Lateral genomics. Trends Cell Biol 12(9):<br />

M5–M8<br />

Doolittle WF (1999b) Phylogenetic classification <strong>and</strong> the universal<br />

tree. Science 5423(284):2124–2129<br />

Dufraigne C, Fertil B, Lesp<strong>in</strong>ats S, Giron A, Deschavanne P (2005)<br />

Detection <strong>and</strong> characterisation of horizontal transfers <strong>in</strong><br />

prokaryotes us<strong>in</strong>g genomic signature. Nucleic Acids Res 1<br />

(33):e6<br />

Duponnois R, Ba AM, Mateille T (1999) Beneficial effects of<br />

Enterobacter cloacae <strong>and</strong> Pseudomonas mendoc<strong>in</strong>a for biocontrol<br />

of Meloidogyne <strong>in</strong>cognita with the endospore-form<strong>in</strong>g<br />

bacterium Oasteuria penetrans. Nematology 1(1):95–101<br />

Ellis RW (1999) New technologies for mak<strong>in</strong>g vacc<strong>in</strong>es. Vacc<strong>in</strong>e 17<br />

(13–14):1596–1604<br />

Falkow S (1975) Infectious multiple drug resistance. Pion Limited,<br />

London, Engl<strong>and</strong><br />

Fani R, Brilli M, Lio P (2005) The orig<strong>in</strong> <strong>and</strong> evolution of operons:<br />

the piecewise build<strong>in</strong>g of the proteobacterial histid<strong>in</strong>e operon.<br />

J Mol Evol 60(3):378–390<br />

Field D, Hughes J (2005) Catalogu<strong>in</strong>g our current genome<br />

collection. Microbiology 151(Pt 4):1016–1019<br />

Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF,<br />

Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM,<br />

McKenney K, Sutton G, FitzHugh W, Fields C, Gocyne JD, Scott<br />

J, Shirley R, Liu LI, Glodek A, Kelley JM, Weidman JF, Phillips<br />

CA, Spriggs T, Hedblom E, Cotton MD, Utterback TR, Hanna<br />

MC, Nguyen DT, Saudek DM, Br<strong>and</strong>on RC, F<strong>in</strong>e LD, Fritchman<br />

JL, Fuhrmann JL, Geoghagen NSM, Gnehm CL, McDonald LA,<br />

Small KV, Fraser CM, Smith HO, Venter JC (1995) Wholegenome<br />

r<strong>and</strong>om sequenc<strong>in</strong>g <strong>and</strong> assembly of Haemophilus<br />

<strong>in</strong>fluenzae Rd. Science 5223(269):496–498, 507–512<br />

Fluit AC, Schmitz F-J (2004) Resistance <strong>in</strong>tegrons <strong>and</strong> super<strong>in</strong>tegrons.<br />

Cl<strong>in</strong> Microbiol Infect 10:272–288<br />

Fouts DE, Mongod<strong>in</strong> EF, M<strong>and</strong>rell RE, Miller WG, Rasko DA,<br />

Ravel J, Br<strong>in</strong>kac LM, DeBoy RT, Parker CT, Daugherty SC,<br />

Dodson RJ, Durk<strong>in</strong> AS, Madupu R, Sullivan SA, Shetty JU,<br />

Ayodeji MA, Shvartsbeyn A, Schatz MC, Badger JH, Fraser<br />

CM, Nelson KE (2005) Major structural differences <strong>and</strong> novel<br />

potential virulence mechanisms from the genomes of multiple<br />

campylobacter species. PLoS Biol 3(1):e15<br />

Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA,<br />

Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM,<br />

Fritchman RD, Weidman JF, Small KV, S<strong>and</strong>usky M,<br />

Fuhrmann J, Nguyen D, Utterback TR, Saudek DM, Phillips<br />

CA, Merrick JM, Tomb JF, Dougherty BA, Bott KF, Hu PC,<br />

Lucier TS, Peterson SN, Smith HO, Hutchison CA 3rd, Venter<br />

JC (1995) The m<strong>in</strong>imal gene complement of Mycoplasma<br />

genitalium. Science 270(5235):397–403<br />

Fraser-Liggett CM (2005) Insights on biology <strong>and</strong> evolution from<br />

microbial genome sequenc<strong>in</strong>g. Genome Res 15:1603–1610<br />

Galun E (2003) Transposable elements: a guide to the perplexed <strong>and</strong><br />

the novice. Kluwer Academic, Dordrecht, The Netherl<strong>and</strong>s, pp<br />

25–73<br />

Gil R, Latorre A, Moya A (2004) Bacterial endosymbionts of <strong>in</strong>sects:<br />

<strong>in</strong>sights from comparative genomics. Environ Microbiol 6<br />

(11):1109–1122<br />

Giovannoni SJ, Tripp HJ, Givan S, Podar M, Verg<strong>in</strong> KL, Baptista D,<br />

Bibbs L, Eads J, Richardson TH, Noordewier M, Rappe MS,<br />

Short JM, Carr<strong>in</strong>gton JC, Mathur EJ (2005) Genome<br />

streaml<strong>in</strong><strong>in</strong>g <strong>in</strong> a cosmopolitan oceanic bacterium. Science<br />

309(5738):1242–1245


Goebel W, Gross R (2001) Intracellularsurvivalstrategiesofmutualistic<br />

<strong>and</strong> parasitic prokaryotes. Trends Microbiol 9(6):267–273<br />

Goldmann DA, Kl<strong>in</strong>ger JD (1986) Pseudomonas cepacia:<br />

biology, mechanisms of virulence, epidemiology. J Pediatr<br />

108(5 Pt 2):806–812<br />

Gottesman S (2005) Micros for microbes: non-cod<strong>in</strong>g regulatory<br />

RNAs <strong>in</strong> bacteria. Trends Genet 7:399–404<br />

Hall<strong>in</strong> PF, Ussery DW (2004) <strong>CBS</strong> genome atlas database: a dynamic<br />

storage for bio<strong>in</strong>formatic results <strong>and</strong> sequence data. Bio<strong>in</strong>formatics<br />

20(18):3682–3686<br />

Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW (2004a) Genome update:<br />

chromosome atlases. Microbiology 150(Pt 10):3091–3093<br />

Hall<strong>in</strong> PF, Coenye T, B<strong>in</strong>newies TT, Jarmer H, Saerfeldt HH, Ussery<br />

DW (2004b) Genome update: correlation of bacterial genomic<br />

properties. Microbiology 150(Pt 12):3899–3903<br />

H<strong>and</strong>elsman J (2004) Metagenomics: application of genomics to<br />

uncultured microorganisms. Microbiol Mol Biol Rev 68:669–685<br />

Harrison A, Dyer DW, Gillaspy A, Ray WC, Mungur R, Carson<br />

MB, Zhong H, Gipson J, Gipson M, Johnson LS, Lewis L,<br />

Bakaletz LO, Munson RS Jr (2005) Genomic sequence of an<br />

otitis media isolate of nontypeable Haemophilus <strong>in</strong>fluenzae:<br />

comparative study with H. <strong>in</strong>fluenzae serotype d, stra<strong>in</strong> KW20.<br />

J Bacteriol 187(13):4627–4636<br />

Hayashi T, Mak<strong>in</strong>o K, Ohnishi M, Kurokawa K, Ishii K, Yokoyama<br />

K, Han CG, Ohtsubo E, Nakayama K, Murata T, Tanaka M,<br />

Tobe T, Iida T, Takami H, Honda T, Sasakawa C, Ogasawara N,<br />

Yasunaga T, Kuhara S, Shiba T, Hattori M, Sh<strong>in</strong>agawa H<br />

(2001) Complete genome sequence of enterohemorrhagic<br />

Escherichia coli O157:H7 <strong>and</strong> genomic comparison with a<br />

laboratory stra<strong>in</strong> K-12. DNA Res 8:11–22<br />

Holmes AJ, Gill<strong>in</strong>gs MR, Nield BS, Mabbutt BC, Nevala<strong>in</strong>en KM,<br />

Stokes HW (2003) The gene cassette metagenome is a basic<br />

resource for bacterial genome evolution. Environ Microbiol 5<br />

(5):383–394<br />

Horowitz NH (1945) On the evolution of biochemical synthesis.<br />

Proc Natl Acad Sci U S A 31:153–157<br />

Horowitz NH (1965) The evolution of biochemical synthesis—<br />

retrospect <strong>and</strong> prospect. In: Bryson V, Vogel HJ (eds) Evolv<strong>in</strong>g<br />

genes <strong>and</strong> prote<strong>in</strong>s. Academic, New York, pp 15–23<br />

Itoh T, Takemoto K, Mori H, Gojobori T (1999) Evolutionary<br />

<strong>in</strong>stability of operon structures disclosed by sequence comparisons<br />

of complete microbial genomes. Mol Biol Evol 3:332–346<br />

Jacob F, Monod J (1961) Genetic regulatory mechanisms <strong>in</strong> the<br />

synthesis of prote<strong>in</strong>s. J Mol Biol 3:318–356<br />

Jacob F, Perr<strong>in</strong> D, Sanchez C, Monod J (1960) Operon: a group of<br />

genes with the expression coord<strong>in</strong>ated by an operator. C R<br />

Hebd Seances Acad Sci 250:1727–1729<br />

Jaffe JD, Stange-Thomann N, Smith C, DeCaprio D, Fisher S,<br />

Butler J, Calvo S, Elk<strong>in</strong>s T, FitzGerald MG, Hafez N, Kodira<br />

CD, Major J, Wang S, Wilk<strong>in</strong>son J, Nicol R, Nusbaum C,<br />

Birren B, Berg HC, Church GM (2004) The complete genome<br />

<strong>and</strong> proteome of Mycoplasma mobile. Genome Res 14<br />

(8):1447–1461<br />

Janga SC, Collado-Vides J, Moreno-Hagelsieb G (2005) Nebulon: a<br />

system for the <strong>in</strong>ference of functional relationships of gene<br />

products from the rearrangement of predicted operons. Nucleic<br />

Acids Res 33(8):2521–2530<br />

Jores J, Rumer L, Wieler LH (2004) Impact of the locus of enterocyte<br />

effacement pathogenicity isl<strong>and</strong> on the evolution of pathogenic<br />

Escherichia coli. Int J Med Microbiol 294(2–3):103–113<br />

(Review)<br />

Juhala RJ, Ford ME, Duda RL, Youlton A, Hatfull GF, Hendrix RW<br />

(2000) Genomic sequences of bacteriophages HK97 <strong>and</strong><br />

HK022: pervasive genetic mosaicism <strong>in</strong> the lambdoid bacteriophages.<br />

J Mol Biol 299(1):27–51<br />

Kennedy SP, Ng WV, Salzberg SL, Hood L, DasSarma S (2001)<br />

Underst<strong>and</strong><strong>in</strong>g the adaptation of Halobacterium species NRC-1<br />

to its extreme environment through computational analysis of<br />

its genome sequence. Genome Res 11:1641–1650<br />

Kiil K, B<strong>in</strong>newies TT, Sicheritz-Ponten T, Willenbrock H, Hall<strong>in</strong> PF,<br />

Wassenaar TM, Ussery DW (2005a) Genome update: sigma factors<br />

<strong>in</strong> 240 bacterial genomes. Microbiology 151(Pt 10):3147–3150<br />

183<br />

Kiil K, Ferchaud JB, David C, B<strong>in</strong>newies TT, Wu H, Sicheritz-<br />

Ponten T, Willenbrock H, Ussery DW (2005b) Genome update:<br />

distribution of two-component transduction systems <strong>in</strong> 250<br />

bacterial genomes. Microbiology 151(Pt 11):3447–3452<br />

Kong H, L<strong>in</strong> L-F, Porter N, Stickel S, Byrd D, Posfai J, Roberts RJ<br />

(2000) Functional analysis of putative restriction–modification<br />

system genes <strong>in</strong> the Helicobacter pylori J99 genome. Nucleic<br />

Acids Res 28:3216–3223<br />

Kummerfeld SK, Teichmann SA (2006) DBD: a transcription factor<br />

prediction database. Nucleic Acids Res 34(Database issue):<br />

D74–D81<br />

Kun<strong>in</strong> V, Goldovsky L, Darzentas N, Ouzounis CA (2005) The net<br />

of life: reconstruct<strong>in</strong>g the microbial phylogenetic network.<br />

Genome Res 15(7):954–959<br />

Kuwahara T, Yamashita A, Hirakawa H, Nakayama H, Toh H,<br />

Okada N, Kuhara S, Hattori M, Hayashi T, Ohnishi Y (2004)<br />

Genomic analysis of Bacteroides fragilis reveals extensive<br />

DNA <strong>in</strong>versions regulat<strong>in</strong>g cell surface adaptation. Proc Natl<br />

Acad Sci U S A 101(41):14919–14924<br />

Lawrence JG, Roth JR (1996) Selfish operons: horizontal transfer<br />

may drive the evolution of gene clusters. Genetics 143<br />

(4):1843–1860<br />

Lazcano A, Diaz-Villagomez E, Mills T, Oro J (1995) On the levels of<br />

enzymatic substrate specificity: implications for the early<br />

evolution of metabolic pathways. Adv Space Res 15(3):345–356<br />

Lewis M, Chang G, Horton NC, Kercher MA, Pace HC,<br />

Schumacher MA, Brennan RG, Lu P (1996) Crystal structure<br />

of the lactose operon repressor <strong>and</strong> its complexes with DNA<br />

<strong>and</strong> <strong>in</strong>ducer. Science 271(5253):1247–1254<br />

Ley RE, Backhed F, Turnbaugh P, Lozupone CA, Knight RD,<br />

Gordon JI (2005) Obesity alters gut microbial ecology. Proc<br />

Natl Acad Sci U S A 102(31):11070–11075<br />

L<strong>in</strong> L-F, Posfai J, Roberts RJ, Kong H (2001) <strong>Comparative</strong><br />

genomics of the restriction–modification systems <strong>in</strong> Helicobacter<br />

pylori. Proc Natl Acad Sci U S A 98:2740–2745<br />

Lobner-Olesen A, Skovgaard O, Mar<strong>in</strong>us MG (2005) Dam methylation:<br />

coord<strong>in</strong>at<strong>in</strong>g cellular processes. Curr Op<strong>in</strong> Microbiol 8<br />

(2):154–160<br />

Lund O, Nielsen M, Kesmir C, Christensen JK, Lundegaard C,<br />

Worn<strong>in</strong>g P, Brunak C (2002) Web-based <strong>tools</strong> for vacc<strong>in</strong>e<br />

design. In: Korber BT, Br<strong>and</strong>er C, Haynes BF, Koup R, Kuiken<br />

C, Moore JP, Walker BD, Watk<strong>in</strong>s D (eds) HIV molecular<br />

immunology. Los Alamos, NM, pp 45–51<br />

Lund O, Nielsen M, Lundegaard C, Kesmit C, Brunak S (2005)<br />

Immunological bio<strong>in</strong>formatics. MIT, Cambridge, Massachusetts<br />

Lupski JR, We<strong>in</strong>stock GM (1992) Short, <strong>in</strong>terspersed repetitive<br />

DNA sequences <strong>in</strong> prokaryotic genomes. J Bacteriol 174<br />

(14):4525–4529<br />

Maas R (2004) Prereplicative pur<strong>in</strong>e methylation <strong>and</strong> postreplicative<br />

demethylation <strong>in</strong> each DNA duplication of the Escherichia coli<br />

replication cycle. J Biol Chem 279(49):51568–51573<br />

Mahillon J, Leonard C, Ch<strong>and</strong>ler M (1999) IS elements as<br />

constituents of bacterial genomes. Res Microbiol 150:675–687<br />

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben<br />

LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du<br />

L, Fierro JM, Gomes XV, Godw<strong>in</strong> BC, He W, Helgesen S, Ho<br />

CH, Irzyk GP, J<strong>and</strong>o SC, Alenquer ML, Jarvie TP, Jirage KB,<br />

Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei<br />

M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE,<br />

McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R,<br />

Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson<br />

JW, Sr<strong>in</strong>ivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer<br />

GA, Wang SH, Wang Y, We<strong>in</strong>er MP, Yu P, Begley RF,<br />

Rothberg JM (2005) Genome sequenc<strong>in</strong>g <strong>in</strong> microfabricated<br />

high-density picolitre reactors. Nature 437(7057):376–380<br />

McCl<strong>in</strong>tock B (1950) The orig<strong>in</strong> <strong>and</strong> behavior of mutable loci <strong>in</strong><br />

maize. Proc Natl Acad Sci U S A 36(6):344–355<br />

McGillivary G, Tomaras AP, Rhodes ER, Actis LA (2005) Clon<strong>in</strong>g<br />

<strong>and</strong> sequenc<strong>in</strong>g of a genomic isl<strong>and</strong> found <strong>in</strong> the Brazilian<br />

purpuric fever clone of Haemophilus <strong>in</strong>fluenzae biogroup<br />

aegyptius. Infect Immun 73(4):1927–1938


184<br />

Middendorf B, Hochhut B, Leipold K, Dobr<strong>in</strong>dt U, Blum-Oehler G,<br />

Hacker J (2004) Instability of pathogenicity isl<strong>and</strong>s <strong>in</strong><br />

uropathogenic Escherichia coli 536. J Bacteriology 186<br />

(10):3086–3096<br />

Mongod<strong>in</strong> EF, Emerson JB, Nelson KE (2005) Microbial metagenomics.<br />

Genome Biol 6(10):347<br />

Mullis K, Faloona F, Scharf S, Saiki R, Horn G, Erlich H (1986)<br />

Specific enzymatic amplification of DNA <strong>in</strong> vitro: the<br />

polymerase cha<strong>in</strong> reaction. Cold Spr<strong>in</strong>g Harb Symp Quant<br />

Biol 51(Pt 1):263–273<br />

Nagy Z, Ch<strong>and</strong>ler M (2004) Regulation of transposition <strong>in</strong> bacteria.<br />

Res Microbiol 155:387–398<br />

Nishi T, Ikemura T, Kanaya S (2005) GeneLook: a novel ab <strong>in</strong>itio<br />

gene identification system suitable for automated annotation of<br />

prokaryotic sequences. Gene 346:115–125<br />

Novikova N, De Boever P, Poddubko S, Deshevaya E, Polikarpov<br />

N, Rakova N, Con<strong>in</strong>x I, Mergeay M (2006) Survey of<br />

environmental biocontam<strong>in</strong>ation on board the International<br />

Space Station. Res Microbiol 157(1):5–12<br />

Ochman H, Lawrence JG, Groisman EA (2000) Lateral gene transfer<br />

<strong>and</strong> the nature of bacterial evolution. Nature 405:299–304<br />

Ohnishi M, Kurokawa K, Hayashi T (2001) Diversification of<br />

Escherichia coli genomes: are bacteriophages the major<br />

contributors? Trends Microbiol 9:481–485<br />

Okuda S, Katayama T, Kawashima S, Goto S, Kanehisa (2006)<br />

MODB: a database of operons accumulat<strong>in</strong>g known operons<br />

across multiple genomes. Nucleic Acids Res 34(Database<br />

issue):D358–362<br />

Olsen GJ, Lane DJ, Giovannoni SJ, Pace NR, Stahl DA (1986)<br />

Microbial ecology <strong>and</strong> evolution: a ribosomal RNA approach.<br />

Annu Rev Microbiol 40:337–365<br />

O’Malley MA, Bostanci A, Calvert J (2005) Whole-genome<br />

patent<strong>in</strong>g. Nat Rev Genet 6(6):502–506<br />

Ortutay C, Gaspari Z, Toth G, Jager E, Vida G, Orosz L, Vellai T<br />

(2003) Speciation <strong>in</strong> Chlamydia: genome-wide phylogenetic<br />

analyses identified a reliable set of acquired genes. J Mol Evol<br />

57:672–680<br />

Ou HY, Chen LL, Lonnen J, Chaudhuri RR, Thani AB, Smith R,<br />

Garton NJ, H<strong>in</strong>ton J, Pallen M, Barer MR, Rajakumar K (2006)<br />

A novel strategy for the identification of genomic isl<strong>and</strong>s by<br />

comparative analysis of the contents <strong>and</strong> contexts of tRNA sites<br />

<strong>in</strong> closely related bacteria. Nucleic Acids Res 34(1):e3<br />

Pal C, Hurst LD (2004) Evidence aga<strong>in</strong>st the selfish operon theory.<br />

Trends Genet 20(6):232–234<br />

Parkhill J, Sebaihia M, Preston A, Murphy LD, Thomson N, Harris<br />

DE, Holden MT, Churcher CM, Bentley SD, Mungall KL,<br />

Cerdeno-Tarraga AM, Temple L, James K, Harris B, Quail MA,<br />

Achtman M, Atk<strong>in</strong> R, Baker S, Basham D, Bason N,<br />

Cherevach I, Chill<strong>in</strong>gworth T, Coll<strong>in</strong>s M, Cron<strong>in</strong> A, Davis P,<br />

Doggett J, Feltwell T, Goble A, Haml<strong>in</strong> N, Hauser H, Holroyd<br />

S, Jagels K, Leather S, Moule S, Norberczak H, O’Neil S,<br />

Ormond D, Price C, Rabb<strong>in</strong>owitsch E, Rutter S, S<strong>and</strong>ers M,<br />

Saunders D, Seeger K, Sharp S, Simmonds M, Skelton J,<br />

Squares R, Squares S, Stevens K, Unw<strong>in</strong> L, Whitehead S,<br />

Barrell BG, Maskell DJ (2003) <strong>Comparative</strong> analysis of the<br />

genome sequences of Bordetella pertussis, Bordetella parapertussis<br />

<strong>and</strong> Bordetella bronchiseptica. Nat Genet 35(1):32–40<br />

Paulsen IT, Banerjei L, Myers GSA, Nelson KE, Seshadri R, Read TD,<br />

Fouts, DE, Eisen JA, Gill SR, Heidelberg JF, Tettel<strong>in</strong> H, Dodson<br />

RJ, Umayam L, Br<strong>in</strong>kac L, Beanan M, Daugherty S, DeBoy RT,<br />

Durk<strong>in</strong> S, Kolonay J, Madupu R, Nelson W, Vamathevan J, Tran<br />

B, Upton J, Hansen T, Shetty J, Khouri H, Utterback T, Radune D,<br />

Ketchum KA, Dougherty BA, Fraser CM (2003) Role of mobile<br />

DNA <strong>in</strong> the evolution of vancomyc<strong>in</strong>-resistant Enterococcus<br />

faecalis. Science 299(5615):2071–2074<br />

Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW<br />

(2000) A DNA structural atlas for Escherichia coli. J Mol Biol<br />

299(4):907–930<br />

Pennisi E (2005) Biochemistry. Cut-rate genomes on the horizon?<br />

Science 309(5736):862<br />

Penyalver R, Lopez MM (1999) Cocolonization of the rhizosphere<br />

by pathogenic agrobacterium stra<strong>in</strong>s <strong>and</strong> nonpathogenic stra<strong>in</strong>s<br />

K84 <strong>and</strong> K1026, used for crown gall biocontrol. Appl Environ<br />

Microbiol 65(5):1936–1940<br />

Peters EDJ, Leverste<strong>in</strong>-Van Hall MA, Box ATA, Verhoef J, Fluit AC<br />

(2001) Novel gene cassettes <strong>and</strong> <strong>in</strong>tegrons. Antimicrob Agents<br />

Chemother 45(10):2961–2964<br />

Pizza M, Scarlato V, Masignani V, Giuliani MM, Arico B,<br />

Com<strong>and</strong>ucci M, Jenn<strong>in</strong>gs GT, Baldi L, Bartol<strong>in</strong>i E, Capecchi<br />

B, Galeotti CL, Luzzi E, Manetti R, Marchetti E, Mora M, Nuti<br />

S, Ratti G, Sant<strong>in</strong>i L, Sav<strong>in</strong>o S, Scarselli M, Storni E, Zuo P,<br />

Broeker M, Hundt E, Knapp B, Blair E, Mason T, Tettel<strong>in</strong> H,<br />

Hood DW, Jeffries AC, Saunders NJ, Granoff DM, Venter JC,<br />

Moxon ER, Gr<strong>and</strong>i G, Rappuoli R (2000) Identification of<br />

vacc<strong>in</strong>e c<strong>and</strong>idates aga<strong>in</strong>st serogroup B men<strong>in</strong>gococcus by<br />

whole-genome sequenc<strong>in</strong>g. Science 287:1816–1820<br />

Prescott L, Harvey JP, Kle<strong>in</strong> DA (1999) Microbiology, 4th edn.<br />

McGraw-Hill, New York, USA<br />

Price MN, Huang KH, Alm EJ, Ark<strong>in</strong> AP (2005) A novel method<br />

for accurate operon predictions <strong>in</strong> all sequenced prokaryotes.<br />

Nucleic Acids Res 33(3):880–892<br />

Rappuoli R (2001) Reverse vacc<strong>in</strong>ology, a genome-based approach<br />

to vacc<strong>in</strong>e development. Vacc<strong>in</strong>e 19:2688–2691<br />

Rendulic S, Jagtap P, Ros<strong>in</strong>us A, Epp<strong>in</strong>ger M, Baar C, Lanz C,<br />

Keller H, Lambert C, Evans KJ, Goesmann A, Meyer F,<br />

Sockett RE, Schuster SC (2004) A predator unmasked: life<br />

cycle of Bdellovibrio bacteriovorus from a genomic perspective.<br />

Science 303(5658):689–692<br />

Reznikoff WS (1992) The lactose operon-controll<strong>in</strong>g elements:<br />

a complex paradigm. Mol Microbiol 6(17):2419–2422<br />

Robb<strong>in</strong>s-Manke JL, Zdraveski ZZ, Mar<strong>in</strong>us M, Essigmann JM<br />

(2005) Analysis of global gene expression <strong>and</strong> double-str<strong>and</strong>break<br />

formation <strong>in</strong> DNA aden<strong>in</strong>e methyltransferase- <strong>and</strong><br />

mismatch repair-deficient Escherichia coli. J Bacteriol 187<br />

(20):7027–7037<br />

Roberts RJ, V<strong>in</strong>cze T, Psfai J, Macelis D (2005) REBASE—<br />

restriction enzymes <strong>and</strong> DNA methyl transferases. Nucleic<br />

Acids Res 33:D230–D232<br />

Rocha EPC, Danch<strong>in</strong> A, Viari A (1999) Functional <strong>and</strong> evolutionary<br />

role of long repeats <strong>in</strong> prokaryotes. Res Microbiol 150:725–733<br />

Rogoz<strong>in</strong> IB, Makarova KS, Wolf YI, Koon<strong>in</strong> EV (2004) <strong>Computational</strong><br />

approaches for the analysis of gene neighbourhoods <strong>in</strong><br />

prokaryotic genomes. Brief Bio<strong>in</strong>form 5(2):131–149<br />

Rosenfeld JA, Sarkar IN, Planet PJ, Figurski DH, DeSalle R (2004)<br />

ORFcurator: molecular curation of genes <strong>and</strong> gene clusters <strong>in</strong><br />

prokaryotic organisms. Bio<strong>in</strong>formatics 20(18):3462–3465<br />

Salgado H, Gama-Castro S, Peralta-Gil M, Diaz-Peredo E, Sanchez-<br />

Solano F, Santos-Zavaleta A, Mart<strong>in</strong>ez-Flores I, Jimenez-Jac<strong>in</strong>to<br />

V, Bonavides-Mart<strong>in</strong>ez C, Segura-Salazar J, Mart<strong>in</strong>ez-Antonio<br />

A, Collado-Vides J (2006a) RegulonDB (version 5.0): Escherichia<br />

coli K-12 transcriptional regulatory network, operon<br />

organization, <strong>and</strong> growth conditions. Nucleic Acids Res 34<br />

(Database issue):D394–D397<br />

Salgado H, Santos-Zavaleta A, Gama-Castro S, Peralta-Gil M,<br />

Penaloza-Sp<strong>in</strong>ola MI, Mart<strong>in</strong>ez-Antonio A, Karp PD, Collado-<br />

Vides J (2006b) The comprehensive updated regulatory<br />

network of Escherichia coli K-12. BMC Bio<strong>in</strong>formatics 7(1):5<br />

Sanger F, Donelson JE, Coulson AR, Kossel H, Fischer D (1973)<br />

Use of DNA polymerase I primed by a synthetic oligonucleotide<br />

to determ<strong>in</strong>e a nucleotide sequence <strong>in</strong> phage fl DNA. Proc<br />

Natl Acad Sci U S A 70(4):1209–1213<br />

Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA,<br />

Hutchison CA, Slocombe PM, Smith M (1977) Nucleotide<br />

sequence of bacteriophage phi X174 DNA. Nature 265<br />

(5596):687–695


Schmidt H, Hensel M (2004) Pathogenicity isl<strong>and</strong>s <strong>in</strong> bacterial<br />

pathogenesis. Cl<strong>in</strong> Microbiol Rev 17(1):14–56<br />

Schneider G, Dobr<strong>in</strong>dt U, Bruggemann H, Nagy G, Janke B, Blum-<br />

Oehler G, Buchrieser C, Gottschalk G, Emody L, Hacker J<br />

(2004) The pathogenicity isl<strong>and</strong>-associated K15 capsule determ<strong>in</strong>ant<br />

exhibits a novel genetic structure <strong>and</strong> correlates with<br />

virulence <strong>in</strong> uropathogenic Escherichia coli stra<strong>in</strong> 536. Infect<br />

Immun 72(10):5993–6001<br />

Serruto D, Adu-Bobie J, Capecchi B, Rappuoli R, Pizza M,<br />

Masignani V (2004) Biotechnology <strong>and</strong> vacc<strong>in</strong>es: application<br />

of functional genomics to Neisseria men<strong>in</strong>gitidis <strong>and</strong> other<br />

bacterial pathogens. J Biotechnol 113:15–32<br />

Sharp PM, Li WH (1987) The codon adaptation <strong>in</strong>dex—a measure<br />

of directional synonymous codon usage bias, <strong>and</strong> its potential<br />

applications. Nucleic Acids Res 15(3):1281–1295<br />

Shendure J, Porreca GJ, Reppas NB, L<strong>in</strong> X, McCutcheon JP,<br />

Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM<br />

(2005) Accurate multiplex polony sequenc<strong>in</strong>g of an evolved<br />

bacterial genome. Science 309(5741):1728–1732<br />

Shimizu T, Ohtani K, Hirakawa H, Ohshima K, Yamashita A, Shiba<br />

T, Ogasawara N, Hattori M, Kuhara, Hayashi H (2002)<br />

Complete genome sequence of Clostridium perfr<strong>in</strong>gens, an<br />

anaerobic flesh-eater. Proc Natl Acad Sci U S A 99(2):996–1001<br />

Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A (2001) On<br />

the total number of genes <strong>and</strong> their length distribution <strong>in</strong><br />

complete microbial genomes. Trends Genet 17(8):425–428<br />

Stahl FW, Murray NE (1966) The evolution of gene clusters <strong>and</strong><br />

genetic circularity <strong>in</strong> microorganisms. Genetics 53(3):569–576<br />

Starl<strong>in</strong>ger P, Saedler H (1976) IS-elements <strong>in</strong> microorganisms. Curr<br />

Top Microbiol Immunol 75:111–152<br />

Talarico S, Cave MD, Marrs CF, Foxman B, Zhang L, Yang Z (2005)<br />

Variation of the Mycobacterium tuberculosis PE_PGRS 33 gene<br />

among cl<strong>in</strong>ical isolates. J Cl<strong>in</strong> Microbiol 43(10):4954–4960<br />

Taoka M, Yamauchi Y, Sh<strong>in</strong>kawa T, Kaji H, Motohashi W,<br />

Nakayama H, Takahashi N, Isobe T (2004) Only a small<br />

subset of the horizontally transferred chromosomal genes <strong>in</strong><br />

Escherichia coli are translated <strong>in</strong>to prote<strong>in</strong>s. Mol Cell<br />

Proteomics 3(8):780–787<br />

Tobes R, Ramos JL (2005) REP code: def<strong>in</strong><strong>in</strong>g bacterial identity <strong>in</strong><br />

extragenic space. Environ Microbiol 7(2):225–228<br />

Toh H, Weiss BL, Perk<strong>in</strong> SA, Yamashita A, Oshima K, Hattori M,<br />

Aksoy S (2006) Massive genome erosion <strong>and</strong> functional<br />

adaptations provide <strong>in</strong>sights <strong>in</strong>to the symbiotic lifestyle of<br />

Sodalis gloss<strong>in</strong>idius <strong>in</strong> the tsetse host. Genome Res 16:149–156<br />

Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG,<br />

Fleischmann RD, Ketchum KA, Klenk HP, Gill S, Dougherty<br />

BA, Nelson K, Quackenbush J, Zhou L, Kirkness EF, Peterson<br />

S, Loftus B, Richardson D, Dodson R, Khalak HG, Glodek A,<br />

McKenney K, Fitzegerald LM, Lee N, Adams MD, Hickey EK,<br />

Berg DE, Gocayne JD, Utterback TR, Peterson JD, Kelley JM,<br />

Cotton MD, Weidman JM, Fujii C, Bowman C, Watthey L,<br />

Wall<strong>in</strong> E, Hayes WS, Borodovsky M, Karp PD, Smith HO,<br />

Fraser CM, Venter JC (1997) The complete genome sequence<br />

of the gastric pathogen Helicobacter pylori. Nature 388<br />

(6642):539–547<br />

Torsvik V, Salte K, Sorheim R, Goksoyr J (1990) Comparison of<br />

phenotypic diversity <strong>and</strong> DNA heterogeneity <strong>in</strong> a population of<br />

soil bacteria. Appl Environ Microbiol 56:776–781<br />

Tr<strong>in</strong>ge SG, Rub<strong>in</strong> EM (2005) Metagenomics: DNA sequenc<strong>in</strong>g of<br />

environmental samples. Nat Rev Genet 6(11):805–814<br />

Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ,<br />

Richardson PM, Solovyev VV, Rub<strong>in</strong> EM, Rokhsar DS,<br />

Banfield JF (2004) Community structure <strong>and</strong> metabolism<br />

through reconstruction of microbial genomes from the environment.<br />

Nature 428(6978):37–43<br />

185<br />

Ussery DW, Hall<strong>in</strong> PF (2004a) Genome update: AT content <strong>in</strong><br />

sequenced prokaryotic genomes. Microbiology 150(Pt 4):749–752<br />

Ussery DW, Hall<strong>in</strong> PF (2004b) Genome update: length distributions of<br />

sequenced prokaryotic genomes. Microbiology 150(Pt 3):513–516<br />

Ussery DW, B<strong>in</strong>newies TT, Gouveia-Oliveira R, Jarmer H, Hall<strong>in</strong><br />

PF (2004a) Genome update: DNA repeats <strong>in</strong> bacterial genomes.<br />

Microbiology 150(Pt 11):3519–3521<br />

Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Coenye T (2004b) Genome<br />

update: rRNAs <strong>in</strong> sequenced microbial genomes. Microbiology<br />

150(Pt 5):1113–1115<br />

Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Wassenaar TM (2004c) Genome<br />

update: tRNAs <strong>in</strong> sequenced microbial genomes. Microbiology<br />

150(Pt 6):1603–1606<br />

Ussery DW, T<strong>in</strong>dbaek N, Hall<strong>in</strong> PF (2004d) Genome update:<br />

promoter profiles. Microbiology 150(Pt 9):2791–2793<br />

Vallenet D, Labarre L, Rouy Z, Barbe V, Bocs S, Cruveiller S, Lajus<br />

A, Pascal G, Scarpelli C, Medigue C (2006) MaGe: a microbial<br />

genome annotation system supported by synteny results.<br />

Nucleic Acids Res 34(1):53–65<br />

van Belkum A, Scherer S, van Alphen L, Verbrugh H (1998) Short<br />

sequence DNA repeats <strong>in</strong> prokaryotic genomes. Microbiol Mol<br />

Biol Rev 62(2):275–293<br />

van der Meer JR, Sentchilo V (2003) Genomic isl<strong>and</strong>s <strong>and</strong> the<br />

evolution of catabolic pathways <strong>in</strong> bacteria. Curr Op<strong>in</strong><br />

Biotechnol 14:248–254<br />

Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A, Dong<br />

X, Lu P, Szafron D, Gre<strong>in</strong>er R, Wishart DS (2005) BASys: a web<br />

server for automated bacterial genome annotation. Nucleic<br />

Acids Res 33(Web Server issue):W455–W459<br />

Venter JC, Rem<strong>in</strong>gton K, Heidelberg JF, Halpern AL, Rusch D,<br />

Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE,<br />

Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson<br />

J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C,<br />

Rogers YH, Smith HO (2004) Environmental genome shotgun<br />

sequenc<strong>in</strong>g of the Sargasso Sea. Science 304(5667):66–74<br />

Vezzi A, Campanaro S, D’Angelo M, Simonato F, Vitulo N, Lauro<br />

FM, Cestaro A, Malacrida G, Simionati B, Cannata N,<br />

Romualdi C, Bartlett DH, Valle G (2005) Life at depth:<br />

Photobacterium profundum genome sequence <strong>and</strong> expression<br />

analysis. Science 307(5714):1459–1461<br />

Willenbrock H, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW (2005) Genome<br />

update: 2D cluster<strong>in</strong>g of bacterial genomes. Microbiology 151<br />

(Pt 2):333–336<br />

Worn<strong>in</strong>g P, Jensen LJ, Nelson KE, Brunak S, Ussery DW (2000)<br />

Structural analysis of DNA sequence: evidence for lateral gene<br />

transfer <strong>in</strong> Thermotoga maritima. Nucleic Acids Res 28<br />

(3):706–709<br />

Worn<strong>in</strong>g P, Jensen LJ, Hall<strong>in</strong> PF, Stærfeldt H-H, Ussery DW (2006)<br />

Orig<strong>in</strong> of replication <strong>in</strong> circular prokaryotic chromosomes.<br />

Environ Microbiol (In press)<br />

Yan F, Polk DB (2004) Commensal bacteria <strong>in</strong> the gut: learn<strong>in</strong>g who<br />

our friends are. Curr Op<strong>in</strong> Gastroenterol 20(6):565–571<br />

Zagursky RJ, Russell D (2001) Bio<strong>in</strong>formatics: use <strong>in</strong> bacterial<br />

vacc<strong>in</strong>e discovery. Biotechniques 31:636–659<br />

Zhang R, Zhang CT (2004) A systematic method to identify<br />

genomic isl<strong>and</strong>s <strong>and</strong> its applications <strong>in</strong> analyz<strong>in</strong>g the genomes<br />

of Corynebacterium glutamicum <strong>and</strong> Vibrio vulnificus CMCP6<br />

chromosome I. Bio<strong>in</strong>formatics 20(5):612–622<br />

Zheng Y, Anton BP, Roberts RJ, Kasif S (2005) Phylogenetic<br />

detection of conserved gene clusters <strong>in</strong> microbial genomes.<br />

BMC Bio<strong>in</strong>formatics 6:243<br />

Zubrzycki IZ (2004) Analysis of the products of genes encompassed<br />

by the theoretically predicted pathogenicity isl<strong>and</strong>s of Mycobacterium<br />

tuberculosis <strong>and</strong> Mycobacterium bovis. Prote<strong>in</strong>s:<br />

Struct, Funct, Bio<strong>in</strong>f 54:563–568


1<br />

<strong>Comparative</strong> Genomics<br />

2.8 Paper III: Global features of the Alcanivorax borkumensis<br />

SK2 genome


Environmental Microbiology (2007) doi:10.1111/j.1462-2920.2007.01483.x<br />

Global features of the Alcanivorax borkumensis<br />

SK2 genome<br />

Oleg N. Reva, 1,3 Peter F. Hall<strong>in</strong>, 2 Hanni Willenbrock, 2<br />

Thomas Sicheritz-Ponten, 2 Burkhard Tümmler 1 <strong>and</strong><br />

David W. Ussery 2<br />

1 Kl<strong>in</strong>ische Forschergruppe, OE6711, Mediz<strong>in</strong>ische<br />

Hochschule Hannover, Carl-Neuberg-Strasse 1,<br />

D-30625 Hannover, Germany.<br />

2 Center for Biological Sequence Analysis, Technical<br />

University of Denmark, Lyngby, Denmark.<br />

3 Biochemistry Department, University of Pretoria,<br />

Lynnwood Road, Hillcrest, 0002 Pretoria, South Africa.<br />

Summary<br />

The global feature of the completely sequenced<br />

Alcanivorax borkumensis SK2 type stra<strong>in</strong> chromosome<br />

is its symmetry <strong>and</strong> homogeneity. The orig<strong>in</strong><br />

<strong>and</strong> term<strong>in</strong>us of replication are located opposite<br />

to each other <strong>in</strong> the chromosome <strong>and</strong> are discerned<br />

with high signal to noise ratios by maximal oligonucleotide<br />

usage biases on the lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g<br />

str<strong>and</strong>. Genomic DNA structure is rather uniform<br />

throughout the chromosome with respect to <strong>in</strong>tr<strong>in</strong>sic<br />

curvature, position preference or base<br />

stack<strong>in</strong>g energy. The orthologs <strong>and</strong> paralogs of<br />

A. borkumensis genes with the highest sequence<br />

homology were found <strong>in</strong> most cases among<br />

g-Proteobacteria, with Ac<strong>in</strong>etobacter <strong>and</strong> P. aerug<strong>in</strong>osa<br />

as closest relatives. A. borkumensis shares<br />

a similar oligonucleotide usage <strong>and</strong> promoter<br />

structure with the Pseudomonadales. A comparatively<br />

low number of only 18 genome isl<strong>and</strong>s with<br />

atypical oligonucleotide usage was detected <strong>in</strong> the<br />

A. borkumensis chromosome. The gene clusters that<br />

confer the assimilation of aliphatic hydrocarbons, are<br />

localized <strong>in</strong> two genome isl<strong>and</strong>s which were probably<br />

acquired from an ancestor of the Yers<strong>in</strong>ia l<strong>in</strong>eage,<br />

whereas the alk genes of Pseudomonas putida still<br />

exhibit the typical Alcanivorax oligonucleotide signature<br />

<strong>in</strong>dicat<strong>in</strong>g a complex evolution of this major<br />

hydrocarbonoclastic trait.<br />

Received 8 August, 2007; accepted 26 September, 2007.<br />

*For correspondence. E-mail tuemmler.burkhard@mh-hannover.de;<br />

Tel. (+49) 511 5322920; Fax (+49) 511 5326723.<br />

Introduction<br />

Alcanivorax borkumensis stra<strong>in</strong> SK2 is a cosmopolitan<br />

oil-degrad<strong>in</strong>g oligotrophic mar<strong>in</strong>e g-proteobacterium<br />

(Yakimov et al., 1998). The SK2 stra<strong>in</strong> is the paradigm for<br />

hydrocarbonoclastic bacteria that are specialized for<br />

hydrocarbon degradation but have an otherwise highly<br />

restricted substrate spectrum, be<strong>in</strong>g capable of utiliz<strong>in</strong>g<br />

only a few organic acids such as pyruvate, but not simple<br />

sugars, for growth (Yakimov et al., 1998; Sabirova et al.,<br />

2006). A. borkumensis is present <strong>in</strong> low abundance <strong>in</strong><br />

unpolluted environments, but it rapidly becomes the dom<strong>in</strong>ant<br />

bacterium <strong>in</strong> oil-polluted open ocean <strong>and</strong> coastal<br />

waters, where it can constitute 80–90% of the oildegrad<strong>in</strong>g<br />

microbial community (Harayama et al., 1999;<br />

Kasai et al., 2001; 2002; Syutsubo et al., 2001; Röl<strong>in</strong>g<br />

et al., 2002; Hara et al., 2003; McKew et al., 2007a,b).<br />

The genome of A. borkumensis was recently<br />

sequenced <strong>and</strong> annotated (Schneiker et al., 2006). In this<br />

paper, we perform a genome wide comparative genomics<br />

analysis <strong>and</strong> a detailed characterization of the global<br />

features of the A. borkumensis stra<strong>in</strong> SK2 genome. This<br />

work on A. borkumensis stra<strong>in</strong> SK2 aimed to visualize the<br />

prospective potential of genome l<strong>in</strong>guistic approaches<br />

for functional <strong>and</strong> comparative analysis of bacterial<br />

genomes.<br />

Results <strong>and</strong> discussion<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd<br />

DNA structure <strong>and</strong> highly expressed genes<br />

The genome atlas (Fig. 1) shows a comb<strong>in</strong>ation of some<br />

general <strong>in</strong>formative properties of the chromosome.<br />

These are structural features (<strong>in</strong>tr<strong>in</strong>sic curvature, stack<strong>in</strong>g<br />

energy <strong>and</strong> position preference), repeat properties (global<br />

direct <strong>and</strong> <strong>in</strong>verted repeats) <strong>and</strong> the ma<strong>in</strong> base composition<br />

features (GC skew <strong>and</strong> percent AT). Stack<strong>in</strong>g energy<br />

measures helix rigidity <strong>and</strong> position preference is a<br />

flexibility measure (Jensen et al., 1999; Pedersen et al.,<br />

2000). Regions that exhibit low position preference correlate<br />

with an enrichment of highly expressed genes (Dlakic<br />

et al., 2004; Willenbrock <strong>and</strong> Ussery, 2007). Examples <strong>in</strong><br />

A. borkumensis are the rrn operons, the genes encod<strong>in</strong>g<br />

ribosomal prote<strong>in</strong>s <strong>and</strong> the gene cluster labelled rpoC on<br />

the atlas which among others encodes RNA polymerase<br />

subunits. Low position preference was found to correlate<br />

with high codon adaptation <strong>in</strong>dices as the common


2 O. N. Reva et al.<br />

Fig. 1. Genome Atlas of A. borkumensis SK2 show<strong>in</strong>g different structural parameters <strong>and</strong> the distribution of global repeats, GC skew <strong>and</strong><br />

A + T contents. Colour <strong>in</strong>tensity <strong>in</strong>creases with the deviation from the average. Values close to the average are shaded very light grey; values<br />

with more than 3 st<strong>and</strong>ard deviations from the average are most strongly coloured.<br />

measure for highly expressed genes (Willenbrock et al.,<br />

2006) <strong>in</strong>dicat<strong>in</strong>g that the local DNA structure is an important<br />

determ<strong>in</strong>ant of codon usage <strong>and</strong> gene expression.<br />

Moreover, <strong>in</strong>tr<strong>in</strong>sic curvature is often encountered<br />

upstream of highly expressed genes (Skovgaard et al.,<br />

2002) which correlates well with the fact that promoter<br />

DNA tends to be more curved than DNA <strong>in</strong> cod<strong>in</strong>g regions<br />

(Pedersen et al., 2000).<br />

The chromosome is rather homogeneous <strong>in</strong> all analysed<br />

structural features. The number of repeats is low, <strong>and</strong><br />

the term<strong>in</strong>us of replication is opposite to the orig<strong>in</strong> of<br />

replication as <strong>in</strong>dicated by GC skew (Ussery et al., 2002).<br />

The three rRNA operons organized <strong>in</strong> the order<br />

16S-23S-5S are located <strong>in</strong> three areas with low position<br />

preference (green marks <strong>in</strong> the 3rd circle) <strong>and</strong> possible<br />

upstream regions with high <strong>in</strong>tr<strong>in</strong>sic curvature (blue <strong>in</strong> the<br />

1st circle) near 0.4 Mb – 0.5 Mbases (two regions) <strong>and</strong><br />

2.25 Mbases (one region).<br />

Phylogenomics by sequence homology<br />

The genome of A. borkumensis was compared with exist<strong>in</strong>g<br />

sequence <strong>in</strong>formation <strong>in</strong> other Proteobacteria by con-<br />

struct<strong>in</strong>g phylogenetic trees for each am<strong>in</strong>o acid<br />

sequence <strong>and</strong> organisms for which a similar gene existed.<br />

By extract<strong>in</strong>g the phylogenomic <strong>in</strong>formation of the result<strong>in</strong>g<br />

1919 phylogenetic trees a phylome atlas could be<br />

constructed (Fig. 2). In most cases the orthologs <strong>and</strong><br />

paralogs with the highest sequence homology were found<br />

among g-Proteobacteria. A substantial proportion of<br />

A. borkumensis genes had their closest homologues <strong>in</strong><br />

a- <strong>and</strong> b-Proteobacteria, but no closest homologue was<br />

detected <strong>in</strong> d- <strong>and</strong> e-Proteobacteria. Inspection of the collected<br />

phylogenetic connections revealed that the<br />

most closely related organisms are Ac<strong>in</strong>etobacter sp.<br />

<strong>and</strong> Pseudomonas aerug<strong>in</strong>osa, although <strong>in</strong> trees where<br />

both Pseudomonas <strong>and</strong> Ac<strong>in</strong>etobacter are present,<br />

A. borkumensis tends to cluster more often with the latter<br />

one. No obvious horizontal gene transfers seem to have<br />

taken place. Regions around 350.000 <strong>and</strong> 450.000 are<br />

very ‘pure’ g-proteobacteria regions.<br />

Genome analysis of oligonucleotide usage<br />

Oligonucleotide usage (OU) has been shown to be a<br />

genome specific signature (Pride et al., 2003; Reva <strong>and</strong><br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


Tümmler, 2004). Genomic regions termed the ‘core<br />

sequences’ are characterized by OU patterns be<strong>in</strong>g<br />

similar to the global pattern of the chromosome. However,<br />

many loci with alternative OU patterns typically contribute<br />

to <strong>in</strong> total more than 10% of a bacterial genome. These<br />

loci with atypical OU patterns comprise heterogeneous<br />

subsets of parasitic <strong>and</strong> recent foreign DNA, ancient<br />

genes for ribosomal constituents (RNAs <strong>and</strong> prote<strong>in</strong>s),<br />

multidoma<strong>in</strong> genes <strong>and</strong> non-cod<strong>in</strong>g sequences with multiple<br />

t<strong>and</strong>em repeats (Reva <strong>and</strong> Tümmler, 2005). Hence<br />

laterally transferred gene isl<strong>and</strong>s can be reliably identified<br />

<strong>in</strong> complete genomes by their atypical oligonucleotide<br />

usage (Reva <strong>and</strong> Tümmler, 2005; Chen et al., 2007;<br />

Klockgether et al., 2007). Here, we focused on tetranucleotide<br />

usage (TU) parameters because the 256 different<br />

tetranucleotide words are optimal to differentiate bacterial<br />

genome sequences by the frequency <strong>and</strong> <strong>in</strong>formativeness<br />

of the <strong>in</strong>dividual element. TU patterns represent the deviations<br />

of tetranucleotide word counts <strong>in</strong> a given sequence<br />

from an equiprobable distribution. Selection <strong>and</strong> counterselection<br />

of the oligonucleotide words are driven by their<br />

<strong>Comparative</strong> genomics of Alcanivorax borkumensis 3<br />

Fig. 2. Phylome Atlas of A. borkumensis SK2 genes <strong>in</strong>dicat<strong>in</strong>g their closest bacterial homologues. Each of the concentric circles represents a<br />

taxonomic group as described <strong>in</strong> the figure legend on the right, with the outermost circle correspond<strong>in</strong>g to the top-most feature, <strong>and</strong> the<br />

<strong>in</strong>nermost circle correspond<strong>in</strong>g to the bottom-most feature. Light b<strong>and</strong>s <strong>in</strong>dicate A. borkumensis SK2 genes with no homologue <strong>in</strong> the<br />

respective taxonomic group.<br />

stereochemical properties such as base stack<strong>in</strong>g energy,<br />

propeller twist angle, prote<strong>in</strong> deformability, bendability<br />

<strong>and</strong> position preference (Reva <strong>and</strong> Tümmler, 2004). By<br />

permutation analysis, the 256 tetranucleotides were<br />

assigned to 39 equivalence classes each of which characterized<br />

by the same values for the five properties mentioned<br />

above (Baldi <strong>and</strong> Baisnee, 2000). Words of the<br />

same equivalence class tend to occur at similar frequencies<br />

<strong>in</strong> a nucleotide sequence (Reva <strong>and</strong> Tümmler, 2004).<br />

Oligonucleotide usage conservation reflects to some<br />

extent the phylogeny of microorganisms (Pride et al.,<br />

2003; Teel<strong>in</strong>g et al., 2004).<br />

Phylogenomics by tetranucleotide usage analysis<br />

TU patterns were calculated for all sequenced genomes<br />

of g-Proteobacteria. Four examples of TU patterns determ<strong>in</strong>ed<br />

for A. borkumensis SK2, Pseudomonas putida<br />

KT2440, Escherichia coli K-12 <strong>and</strong> Shewanella oneidensis<br />

MR-1 are shown <strong>in</strong> Fig. 3. Tetranucleotide words were<br />

grouped by the equivalence classes <strong>and</strong> sorted <strong>in</strong> order of<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


4 O. N. Reva et al.<br />

decrease of the base stack<strong>in</strong>g energy. Figure 4 visualizes<br />

the phylogenetic relationships differentiated by TU patterns<br />

of 29 g-Proteobacterial taxa each of which represented<br />

by not more than a s<strong>in</strong>gle sequenced stra<strong>in</strong>.<br />

A. borkumensis forms a cluster with Pseudomonas,<br />

Methylococcus, Xanthomonas <strong>and</strong> Xylella (Fig. 4).<br />

Despite the variation <strong>in</strong> GC-content, from 52 to 54% <strong>in</strong><br />

Xylella <strong>and</strong> Alcanivorax to more than 65% <strong>in</strong> Xanthomonas<br />

<strong>and</strong> Pseudomonas, the TU patterns of these<br />

Fig. 3. Tetranucleotide usage patterns of<br />

A. borkumensis SK2, P. putida KT2440, E. coli<br />

K12 MG1655 <strong>and</strong> S. oneidensis MR-1. The<br />

deviation Dw of observed from expected<br />

counts is shown for all 256 tetranucleotide<br />

words (16 ¥ 16 cells) by colour code (right<br />

bar). Tetranucleotides are grouped <strong>in</strong>to 39<br />

classes of equivalent structural features (Baldi<br />

<strong>and</strong> Baisnee, 2000) <strong>and</strong> sorted by decreas<strong>in</strong>g<br />

base stack<strong>in</strong>g energy row-by-row start<strong>in</strong>g at<br />

the upper left corner (class 39). The words<br />

correspond<strong>in</strong>g to the cells <strong>in</strong> colour plots are<br />

shown <strong>in</strong> the table <strong>in</strong> lower part of the figure.<br />

microorganisms are similar <strong>and</strong> separated from other<br />

g-Proteobacteria. There is an abundance of GC-rich tetranucleotides<br />

with high base stack<strong>in</strong>g energy <strong>in</strong> the<br />

sequence of A. borkumensis SK2 (words belong<strong>in</strong>g to<br />

equivalence classes 37–39, 30 <strong>and</strong> 27) that is similar to<br />

the TU pattern of P. putida KT2440 (Fig. 3). Words of the<br />

AT-rich classes 7, 10, 13 <strong>and</strong> 32 are significantly underrepresented<br />

<strong>in</strong> both species. The major difference<br />

between TU patterns is the abundance of poly A <strong>and</strong> poly<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


T stretches (words of class 1) <strong>in</strong> A. borkumensis <strong>in</strong> correspondence<br />

with its lower GC-content of 54.7%. Although<br />

E. coli <strong>and</strong> S. oneidensis share a similar GC contents with<br />

A. borkumensis, their tetranucleotides usage is different<br />

from Alcanivorax. The parity of GC with AT <strong>in</strong> the genome<br />

correlates with a balanced use of GC-rich <strong>and</strong> AT-rich<br />

words with high <strong>and</strong> low base stack<strong>in</strong>g energy. In contrast,<br />

words with <strong>in</strong>termediate values of the base stack<strong>in</strong>g<br />

energy (classes 25, 31, 36 <strong>and</strong> 29) are mostly underrepresented<br />

(Fig. 3). The data suggests that oligonucleotide<br />

usage drives GC-content <strong>and</strong> not vice versa. To give<br />

another example: the GC-rich words of class 21 are<br />

rare <strong>in</strong> all g-Proteobacteria irrespectively of their<br />

GC-content (Fig. 3), but these words are overrepresented<br />

<strong>in</strong> a-Proteobacteria (Agrobacterium, Bordetella, Caulobacter,<br />

Rhizobium).<br />

Anomalous local TU patterns <strong>in</strong> the<br />

A. borkumensis genome<br />

A. borkumensis shares a common taxonomic group<br />

with Pseudomonas, Methylococcus, Xanthomonas <strong>and</strong><br />

Xylella. Although the TU patterns are genome specific<br />

signatures, the oligonucleotide usage may vary locally <strong>in</strong><br />

segments made up by horizontally acquired elements,<br />

phylogenetically ancient genes such as rRNAs or genes<br />

<strong>Comparative</strong> genomics of Alcanivorax borkumensis 5<br />

Fig. 4. Tree of the similarity of TU patterns of<br />

completely sequenced g-Proteobacteria<br />

stra<strong>in</strong>s. Distance D-values (see Experimental<br />

procedures) between two TU patterns were<br />

calculated, <strong>and</strong> the tree was constructed from<br />

the distance matrix of all D-values by the<br />

m<strong>in</strong>imum evolution neighbour-jo<strong>in</strong><strong>in</strong>g method<br />

(Saitou <strong>and</strong> Nei, 1987).<br />

with peculiar codon usage (Reva <strong>and</strong> Tümmler, 2004;<br />

2005). In other words, anomalous local TU patterns can<br />

be expected for the most recent <strong>and</strong> the most ancient<br />

genes. Local TU patterns were calculated <strong>in</strong> 8 kbp long<br />

overlapp<strong>in</strong>g slid<strong>in</strong>g w<strong>in</strong>dows <strong>in</strong> steps of 2 kbp. Distances<br />

D between local <strong>and</strong> global TU patterns are shown <strong>in</strong><br />

Fig. 5. The 18 regions with D-values above the 95% confidence<br />

<strong>in</strong>terval are listed <strong>in</strong> Table 1.<br />

Three clusters with anomalous D-values encode ribosomal<br />

RNAs that belong to the most ancient <strong>and</strong> conserved<br />

elements of all bacterial genomes. All the other 15<br />

regions with atypical TU most likely were recently<br />

acquired, three of which conta<strong>in</strong> transposase genes.<br />

In total 11 transposases were annotated <strong>in</strong> the<br />

A. borkumensis SK2 genome but for five of them no significant<br />

deviations of the local TU patterns were detected<br />

<strong>in</strong> adjacent regions. If <strong>in</strong>serted mobile elements had lost<br />

their mobility due to disruptive mutations, they undergo an<br />

amelioration process smooth<strong>in</strong>g the differences <strong>in</strong> oligonucleotide<br />

usage between <strong>in</strong>serts <strong>and</strong> the host genome<br />

<strong>and</strong> thus cannot be detected by anomalous TU patterns<br />

anymore (Pride et al., 2003).<br />

Five regions with high D-values (Fig. 5) only encode<br />

hypothetical prote<strong>in</strong>s (Table 1). One further region conta<strong>in</strong>s<br />

genes of the type II secretion system <strong>and</strong> two<br />

regions encode type IV pili biogenesis prote<strong>in</strong>s the latter<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


6 O. N. Reva et al.<br />

of which are known to have spread among proteobacteria<br />

by horizontal transfer with the orig<strong>in</strong>al codon usage <strong>and</strong><br />

GC content be<strong>in</strong>g reta<strong>in</strong>ed (Spangenberg et al., 1997).<br />

The most extended region with high D-values encodes<br />

a cluster of genes for glycosyltransferases <strong>and</strong> polysaccharide<br />

biosynthesis prote<strong>in</strong>s (Abo_858-Abo_880:<br />

1 018 000–1 060 000 bp) characterized by the second<br />

largest D-value <strong>and</strong> low GC-content (m<strong>in</strong>imum 45% GC).<br />

The region term<strong>in</strong>ates abruptly after Abo_880 at an AsntRNA<br />

gene. The TU pattern of the locus was compared<br />

with those of 177 sequenced bacterial chromosomes, 316<br />

plasmids <strong>and</strong> 104 phages (Reva <strong>and</strong> Tümmler, 2004).<br />

The pattern was distant from all analysed sequences. The<br />

best hit of D = 34.9% was observed for the 5833 bp large<br />

bacteriophage Pf3 that <strong>in</strong>fects P. aerug<strong>in</strong>osa harbour<strong>in</strong>g<br />

the RP1 plasmid (Luiten et al., 1985). A stretch of 1550 bp<br />

Table 1. Chromosomal regions of A. borkumensis with atypical TU patterns.<br />

Coord<strong>in</strong>ates<br />

Left Right<br />

D a (%) Annotation<br />

Fig. 5. Deviations of TU patterns <strong>in</strong> local<br />

regions of A. borkumensis SK2 chromosome.<br />

Local TU patterns were determ<strong>in</strong>ed <strong>in</strong> 8 kbp<br />

slid<strong>in</strong>g w<strong>in</strong>dow <strong>in</strong> steps of 2 kbp. D, the<br />

distance betweeen local <strong>and</strong> chromosomal<br />

tetranucleotide patterns as def<strong>in</strong>ed <strong>in</strong><br />

Experimental procedures, is plotted versus<br />

the coord<strong>in</strong>ates of the chromosome start<strong>in</strong>g<br />

from the putative replication orig<strong>in</strong>.The upper<br />

border of the 95% confidence <strong>in</strong>terval of<br />

D-values is shown by the horizontal l<strong>in</strong>e.<br />

upstream of the tRNA gene is 48% identical <strong>in</strong> nucleotide<br />

sequence with the Pf3 sequence (2344-4078 bp).<br />

Accord<strong>in</strong>g to this <strong>in</strong> silico f<strong>in</strong>d<strong>in</strong>g we propose that this<br />

gene isl<strong>and</strong> was captured from a phage that typically<br />

target the 3′-end of a tRNA gene (Dobr<strong>in</strong>dt et al.,<br />

2004).<br />

The alkB genes encod<strong>in</strong>g the degradation of alkanes<br />

which is the prom<strong>in</strong>ent name-giv<strong>in</strong>g feature of the taxon<br />

Alcanivorax, are located <strong>in</strong> two isl<strong>and</strong>s (Schneiker et al.,<br />

2006) with anomalous TU patterns (Table 1). Very close<br />

homologues were identified <strong>in</strong> mar<strong>in</strong>e bacteria <strong>and</strong><br />

Pseudomonas species (Schneiker et al., 2006). The<br />

alkane hydroxylase gene cluster is widely distributed<br />

among hydrocarbon-utiliz<strong>in</strong>g g-Proteobacteria due to its<br />

possible horizontal transfer (van Beilen et al., 2001;<br />

2004). The role of these genes <strong>in</strong> the degradation of<br />

126 000 140 000 42.20 Abo_114–120: lysR transcriptional regulator, haloacid dehalogenase hydrolase, amiC amidase, gntR<br />

transcriptional regulator, alkB2 alkane monooxygenase, type I pili biogenesis prote<strong>in</strong>s<br />

190 000 198 000 40.47 Abo_172–178: ilvD-1 dihydroxy-acid dehydratase, conserved hypothetical prote<strong>in</strong>s,<br />

long-cha<strong>in</strong>-fatty-acid-CoA ligase, acyl-CoA dehydrogenases<br />

234 000 245 000 47.95 Abo_209–214: conserved hypothetical prote<strong>in</strong>s, transposase, type II secretion system prote<strong>in</strong>s<br />

400 000 408 000 49.42 first operon for rRNAs<br />

502 000 510 000 46.26 Abo_439–446: ispA lipoprote<strong>in</strong> signal peptidase, fkpB peptidyl-prolyl cis-trans isomerase, ispH<br />

hydroxymethylbutenyl pyrophosphate reductase, type IV pili biogenesis prote<strong>in</strong>s, conserved<br />

hypothetical prote<strong>in</strong>s<br />

526 000 534 000 43.41 second operon for rRNAs<br />

670 000 678 000 40.29 Abo_581–583: type IV pili biogenesis prote<strong>in</strong>s<br />

792 000 800 000 43.00 Abo_2680–2681: hypothetical prote<strong>in</strong>s<br />

1 020 000 1 056 000 50.43 Abo_859–878: polysaccharide biosynthesis prote<strong>in</strong>s<br />

1 742 000 1 750 000 40.88 Abo_1439: periplasmic b<strong>in</strong>d<strong>in</strong>g doma<strong>in</strong>/transglycosylase SLTdoma<strong>in</strong> fusion<br />

1 892 000 1 900 000 46.32 Abo_2841–2847: hypothetical prote<strong>in</strong>s<br />

2 026 000 2 034 000 41.90 Abo_1668–1671: conserved hypothetical prote<strong>in</strong>s, 3 transposases, siderophore biosynthesis prote<strong>in</strong>,<br />

glycosyl transferase<br />

2 088 000 2 096 000 40.65 Abo_ 1707–1708: conserved hypothetical prote<strong>in</strong>s<br />

2 146 000 2 154 000 47.05 Abo_2897–2905: iscA iron-b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> IscA, metal-sulfur cluster biosynthetic enzyme, sufE Fe-S<br />

metabolism associated doma<strong>in</strong> prote<strong>in</strong>, iscS cyste<strong>in</strong>e desulfurase, rrf2 family prote<strong>in</strong>, hypothetical<br />

prote<strong>in</strong>s, SIR2-like transcriptional silencer<br />

2 254 000 2 262 000 49.71 third operon for rRNAs<br />

2 364 000 2 372 000 52.56 Abo_1942: penicill<strong>in</strong>-b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong>, hypothetical prote<strong>in</strong>s, 2 transposases<br />

2 632 000 2 640 000 40.17 Abo_2979–2984: hypothetical prote<strong>in</strong>s<br />

3 060 000 3 076 000 42.94 Abo_2516–3066: Na+/H+ antiporter, alkS alkB1GHJ regulator, alkB1 alkane monooxygenase,<br />

alkG rubredox<strong>in</strong>, aldH aldehyde dehydrogenase, hypothetical prote<strong>in</strong>s<br />

a. D, distance betweeen local <strong>and</strong> chromosomal TU patterns as def<strong>in</strong>ed <strong>in</strong> Experimental procedures.<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


short-cha<strong>in</strong> n-alkanes by A. borkumensis SK2 <strong>and</strong> AP1<br />

was experimentally proven (Smits et al., 2002; Hara et al.,<br />

2004; Sabirova et al., 2006). Interest<strong>in</strong>gly, the two regions<br />

compris<strong>in</strong>g of alkS, alkB1, alkG <strong>and</strong> aldH alkanedegradation<br />

genes <strong>and</strong> of alkB2 <strong>and</strong> transcriptional<br />

regulators, respectively (Table 1), are as similar to each<br />

other <strong>in</strong> their TU patterns (D = 34.3%) as each of them<br />

is to Yers<strong>in</strong>ia pestis (D = 32.2% for alkB1, D = 33.4%<br />

for alkB2), Yers<strong>in</strong>ia enterocolitica (D = 29.5% for alkB1,<br />

D = 34.4% for alkB2) <strong>and</strong> Shewanella oneidensis MR-1<br />

(D = 32.5% for alkB1, D = 42.4% for alkB2). This data<br />

suggests that the alkB1 <strong>and</strong> alkB2 genes were delivered<br />

to A. borkumensis from an ancestor of the Yers<strong>in</strong>ia<br />

l<strong>in</strong>eage. The AlkB1 am<strong>in</strong>o acid sequences of A. borkumensis<br />

stra<strong>in</strong>s AP1 <strong>and</strong> SK2 are highly homologous to<br />

that of P. putida stra<strong>in</strong>s P1 <strong>and</strong> GPO1 (van Beilen et al.,<br />

2001; 2004; Smits et al., 2002; Hara et al., 2004), but their<br />

TU patterns are not that similar (D = 37.1). Surpris<strong>in</strong>gly,<br />

the TU pattern of the alkB cluster of P. putida<br />

is significantly more similar with the global TU pattern of<br />

the whole A. borkumensis chromosome (16.7%, stra<strong>in</strong><br />

GPO1, 19%, stra<strong>in</strong> P1), but more distant from the<br />

P. putida KT2440 chromosome (30.1% <strong>and</strong> 30.3%).<br />

D-values of 17 or 19% are with<strong>in</strong> the first quartile (0–26%)<br />

far below the median value of 28.4% for local TU patterns<br />

of the A. borkumensis chromosome (Fig. 5) <strong>in</strong>dicat<strong>in</strong>g<br />

that. the P. putida alkB gene behaves as if it were part of<br />

the Alcanivorax core genome. We note the strik<strong>in</strong>g phenomenon<br />

that there was converg<strong>in</strong>g evolution of the<br />

cod<strong>in</strong>g sequence of the catabolic alk transposon <strong>in</strong><br />

Alkanivorax <strong>and</strong> Pseudomonas, but that the genes<br />

reta<strong>in</strong>ed the oligonucleotide signature of their donors,<br />

most likely Alkanivorax for Pseudomonas <strong>and</strong> Yers<strong>in</strong>ialike<br />

organisms for Alkanivorax.<br />

<strong>Comparative</strong> genomics of Alcanivorax borkumensis 7<br />

Orig<strong>in</strong> of replication<br />

The GC skew plotted <strong>in</strong> the seventh circle of the genome<br />

atlas (Fig. 1) reflects a general bias of pur<strong>in</strong>es towards the<br />

lead<strong>in</strong>g str<strong>and</strong> of DNA replication, however, it has almost<br />

no correlation to the structural properties of DNA<br />

(Skovgaard et al., 2002). The GC skew is often useful<br />

when locat<strong>in</strong>g the orig<strong>in</strong> <strong>and</strong> term<strong>in</strong>us of replication<br />

(Jensen et al., 1999).<br />

The circle is blue on the right side <strong>and</strong> purple on the left<br />

side. The two big gaps of colours <strong>in</strong> the top <strong>and</strong> <strong>in</strong> the<br />

bottom of the circle may be the orig<strong>in</strong> <strong>and</strong> the term<strong>in</strong>us of<br />

replication. This may also be visualized more clearly <strong>in</strong> the<br />

orig<strong>in</strong> plot (Fig. 6) (Worn<strong>in</strong>g et al., 2006). Here, the difference<br />

between hypothetical lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong> is<br />

plotted (red) for various positions on the chromosome.<br />

The peaks <strong>in</strong>dicat<strong>in</strong>g maximal oligonucleotide skew correspond<br />

to orig<strong>in</strong> <strong>and</strong> term<strong>in</strong>us. The term<strong>in</strong>us was identified<br />

as the peaks show<strong>in</strong>g low G/C weighted str<strong>and</strong> bias<br />

at 1 502 000 bp position. The orig<strong>in</strong> was identified as the<br />

other peak at 3 118 000 bp position. The signal to noise of<br />

14.0 was among the top 10% of sequenced Proteobacteria,<br />

<strong>in</strong>dicat<strong>in</strong>g a big difference between lead<strong>in</strong>g <strong>and</strong><br />

lagg<strong>in</strong>g str<strong>and</strong> mak<strong>in</strong>g the prediction of orig<strong>in</strong> very<br />

confident.<br />

Structural analysis of promoter regions<br />

Structural features of the genomic DNA may <strong>in</strong>dicate promoter<br />

regions, as promoters normally have high curvature,<br />

melt easily <strong>and</strong> are more rigid. The DNA structural<br />

parameters mentioned earlier (position preference, stack<strong>in</strong>g<br />

energy, <strong>and</strong> <strong>in</strong>tr<strong>in</strong>sic curvature) together with AT<br />

content <strong>and</strong> DNAse sensitivity (Brukner et al., 1995) were<br />

Fig. 6. Localization of the orig<strong>in</strong> <strong>and</strong> the<br />

term<strong>in</strong>us of replication <strong>in</strong> the A. borkumensis<br />

SK2 chromosome derived from str<strong>and</strong> bias<br />

curves: the median oligonucleotide skew<br />

curve (red), the GC weighted median (green)<br />

<strong>and</strong> the AT weighted median (blue) (Worn<strong>in</strong>g<br />

et al., 2006).<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


8 O. N. Reva et al.<br />

compiled <strong>in</strong>to a structural profile of all upstream regions of<br />

A. borkumensis (see section Experimental procedures).<br />

The profile uses z-scores to measure how the average<br />

value of the properties vary from m<strong>in</strong>us 400 bp to 400 bp<br />

around the translation start (Fig. 7). A. borkumensis has<br />

only a cod<strong>in</strong>g density of 87% caus<strong>in</strong>g a wider spacer of<br />

the <strong>in</strong>tergenic region <strong>and</strong> this appears to give rise to a<br />

larger <strong>and</strong> wider peak of curvature, stack<strong>in</strong>g energy <strong>and</strong><br />

AT content (Fig. 7A). For comparison we also analysed<br />

the promoter profile of another ocean bacterium, C<strong>and</strong>idatus<br />

Pelagibacter ubique HTCC1062 (Giovannoni et al.,<br />

2005), an example of a highly streaml<strong>in</strong>ed genome with a<br />

cod<strong>in</strong>g density of 96%. Here we observed a much weaker<br />

curvature signal, <strong>and</strong> the distribution of stack<strong>in</strong>g energy<br />

<strong>and</strong> AT content was more narrow <strong>and</strong> had higher maxima<br />

(Fig. 7B).<br />

Next, the probability of open<strong>in</strong>g dur<strong>in</strong>g stress-<strong>in</strong>duced<br />

DNA duplex destabilization was computed by us<strong>in</strong>g the<br />

program SIDD (Wang et al., 2004), cover<strong>in</strong>g five different<br />

values of the super-helical density s = {-0.025, -0.035,<br />

-0.045, -0.055, -0.065}. As super-coil<strong>in</strong>g is be<strong>in</strong>g<br />

pushed, the probability of open<strong>in</strong>g <strong>in</strong>creases at lower<br />

super-helical densities <strong>in</strong> A. borkumensis (Fig. 7C). In<br />

contrast, a narrower SIDD profile that exhibits only<br />

m<strong>in</strong>or dependence on super-helical density (Fig. 7D),<br />

was calculated for the C<strong>and</strong>idatus Pelagibacter ubique<br />

HTCC1062 genome.<br />

The structural profile for the promoter regions of<br />

A. borkumensis was compared with that of closely related<br />

species as found above (see Fig. 4). Generally, it looked<br />

more like the promoter profile of members of the<br />

Pseudomonadales than the general comparison organism,<br />

E. coli. Moreover, the promoter profile was very different<br />

compared with the promoter profile of X. fastidiosa<br />

stra<strong>in</strong>s, even though they where very similar with regard<br />

to their TU profile (see Fig. 4). The promoter profiles for<br />

the above mentioned organisms may be found at our<br />

website (http://www.cbs.dtu.dk/services/GenomeAtlas/).<br />

Am<strong>in</strong>o acid <strong>and</strong> codon usage<br />

We have exam<strong>in</strong>ed the codon <strong>and</strong> am<strong>in</strong>o acid usage of<br />

A. borkumensis <strong>and</strong> compared this with both the usage of<br />

bacteria <strong>in</strong> general <strong>and</strong> of 16 oceanic bacteria (Entrez<br />

project IDs 230, 10 645, 12 530, 13 233, 13 239, 13 282,<br />

13 642, 13 643, 13 654, 13 655, 13 902, 13 906, 13 910,<br />

13 911, 13 989, 15 660) Willenbrock et al., 2006). In<br />

Fig. 8, the codon usage plot of A. borkumensis is<br />

superimposed on the cumulative plot of all completely<br />

sequenced bacteria <strong>in</strong> public databases (N = 518,<br />

Fig. 8A) or of that of 16 oceanic bacteria (Fig. 8B).<br />

A few codons are differentially utilized <strong>in</strong> A. borkumensis<br />

(GUC, CUG), but all values are with<strong>in</strong> the range of three<br />

st<strong>and</strong>ard deviations. In other words, codon usage of<br />

A. borkumensis resides with<strong>in</strong> the typical range of<br />

eubacteria.<br />

Interest<strong>in</strong>gly, the sequenced oceanic bacteria share a<br />

very similar am<strong>in</strong>o acid usage (Fig. 8D), whereas broad<br />

variations thereof were noted amongst all sequenced<br />

bacteria that represent the whole spectrum of habitats<br />

(Fig. 8C). A. borkumensis roughly follows the profile of the<br />

oceanic bacteria, although cyste<strong>in</strong>e, tryptophan, leuc<strong>in</strong>e,<br />

prol<strong>in</strong>e, arg<strong>in</strong><strong>in</strong>e, ser<strong>in</strong>e are under-utilized, <strong>and</strong> glutamic<br />

acid, lys<strong>in</strong>e, phenylalan<strong>in</strong>e, histid<strong>in</strong>e, methion<strong>in</strong>e, <strong>and</strong><br />

tyros<strong>in</strong>e are over-utilized – all exceed<strong>in</strong>g the threest<strong>and</strong>ard<br />

deviation boundaries.<br />

Conclusion<br />

Fig. 7. Profile of structural properties of<br />

promoter regions (A <strong>and</strong> B) <strong>and</strong> probabilities<br />

of open<strong>in</strong>g dur<strong>in</strong>g stress-<strong>in</strong>duced DNA duplex<br />

destabilization at various super-helical<br />

densities (C <strong>and</strong> D) <strong>in</strong> the A. borkumensis<br />

SK2 (A <strong>and</strong> C) <strong>and</strong> C<strong>and</strong>idatus Pelagibacter<br />

ubique HTCC1062 (B <strong>and</strong> D) chromosomes.<br />

Each annotated gene was aligned at the<br />

translation start site <strong>and</strong> the average values<br />

for the SIDD probabilities, AT-content, position<br />

preference, stack<strong>in</strong>g energy, <strong>in</strong>tr<strong>in</strong>sic<br />

curvature <strong>and</strong> DNase sensitivity were<br />

calculated at each position <strong>in</strong> the alignment.<br />

The values were subsequently converted <strong>in</strong>to<br />

z-scores, us<strong>in</strong>g the average <strong>and</strong> st<strong>and</strong>ard<br />

deviation of the entire chromosome. Values<br />

are smoothed over a 5 bp w<strong>in</strong>dow.<br />

Inspection of the collected phylogenetic connections<br />

revealed that the most closely related organisms are<br />

Ac<strong>in</strong>etobacter sp. <strong>and</strong> Pseudomonas aerug<strong>in</strong>osa,<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


although <strong>in</strong> trees where both Pseudomonas <strong>and</strong> Ac<strong>in</strong>etobacter<br />

are present, A. borkumensis tends to cluster more<br />

often with the latter one.<br />

The major structural feature of the A. borkumensis<br />

chromosome is its symmetry <strong>and</strong> homogeneity. The<br />

genome conta<strong>in</strong>s only very few regions with extraord<strong>in</strong>arily<br />

low or high curvature, position preference or base<br />

stack<strong>in</strong>g energy. The chromosomal frame is symmetric:<br />

The orig<strong>in</strong> <strong>and</strong> the term<strong>in</strong>us of replication are located<br />

opposite to each other <strong>in</strong> the chromosome <strong>and</strong> are clearly<br />

discerned by maxima of oligonucleotide usage biases<br />

between lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong>.<br />

The genetic repertoire of A. borkumensis is most similar<br />

to that of Ac<strong>in</strong>etobacter <strong>and</strong> P. aerug<strong>in</strong>osa. Moreover,<br />

<strong>Comparative</strong> genomics of Alcanivorax borkumensis 9<br />

Fig. 8. Codon usage (A <strong>and</strong> B) <strong>and</strong> am<strong>in</strong>o acid usage (C <strong>and</strong> D) of A. borkumensis SK2 compared with those of 518 completely sequenced<br />

bacteria (A <strong>and</strong> C) or compared with those of 16 sequenced oceanic bacteria. Frequencies of am<strong>in</strong>o acids <strong>and</strong> codons were counted for each<br />

genome <strong>and</strong> normalized. Mean value (grey l<strong>in</strong>e) <strong>and</strong> three st<strong>and</strong>ard deviations (grey solid area) represent the global usage of <strong>in</strong>dividual<br />

codons (A <strong>and</strong> B) <strong>and</strong> am<strong>in</strong>o acids (C <strong>and</strong> D) <strong>in</strong> the 518 (A <strong>and</strong> C) or 16 (B <strong>and</strong> D) reference genomes. The red l<strong>in</strong>e (A <strong>and</strong> B) shows the<br />

codon usage <strong>and</strong> the blue l<strong>in</strong>e (C <strong>and</strong> D) shows the am<strong>in</strong>o acid usage of A. borkumensis.<br />

A. borkumensis shares a similar oligonucleotide usage<br />

with the Xanthomonadales <strong>and</strong> Pseudomonadales <strong>in</strong>dicat<strong>in</strong>g<br />

close phylogenetic relationships with these orders<br />

<strong>in</strong> accordance with 16S rDNA sequence relatedness<br />

(Schneiker et al., 2006). Amongst this subgroup of completely<br />

sequenced genomes, the A. borkumensis chromosome<br />

harbours the relatively lowest number of genome<br />

isl<strong>and</strong>s with atypical tetranucleotide usage. P. putida<br />

KT2440, for example, carries threefold more isl<strong>and</strong>s per<br />

Megabase <strong>in</strong> its chromosome (We<strong>in</strong>el et al., 2002). Interest<strong>in</strong>gly,<br />

one of the three enzyme systems that are<br />

upregulated <strong>in</strong> alkane-grown cells (Sabirova et al., 2006),<br />

the well-known alkB1 cluster, is encoded by genome<br />

isl<strong>and</strong>s. The molecular evolution of the alk genes that are<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


10 O. N. Reva et al.<br />

encoded by a catabolic transposon (van Beilen et al.,<br />

2001) is remarkable: the Alcanivorax genes were probably<br />

acquired from the Yers<strong>in</strong>ia l<strong>in</strong>eage, whereas the<br />

P. putida genes exhibit the typical Alcanivorax tetranucleotide<br />

signature. Horizontal gene transfer was relevant to<br />

confer the – probably – most important metabolic trait to<br />

A. borkumensis, but otherwise the stable seawater habitat<br />

apparently did not favour the shuffl<strong>in</strong>g <strong>and</strong> exchange<br />

of genes with other taxa. Instead a symmetric <strong>and</strong><br />

structurally homogeneous chromosome evolved that<br />

lacks numerous metabolic traits (Yakimov et al., 1998;<br />

Schneiker et al., 2006) found <strong>in</strong> their versatile Pseudomonas<br />

relatives which are endowed with twofold larger chromosomes<br />

(Stover et al., 2000; Nelson et al., 2002).<br />

Experimental procedures<br />

Genomic sequence<br />

The comparative genomics analyses were based on the<br />

genomic sequence of A. borkumensis SK2 (Golysh<strong>in</strong> et al.,<br />

2003) <strong>and</strong> its annotation (Schneiker et al., 2006).<br />

Atlas visualization<br />

Atlases, developed <strong>in</strong> house, make it possible to visualize<br />

correlations between position dependent <strong>in</strong>formation conta<strong>in</strong>ed<br />

with<strong>in</strong> a chromosome. Circular graphical representations<br />

of the entire A. borkumensis genome were created<br />

us<strong>in</strong>g the atlas visualization tool, GeneWiz. Each feature,<br />

such as AT content is represented by a separate circle <strong>in</strong> the<br />

atlas. Typically, mean values are pictured <strong>in</strong> grey <strong>and</strong> extreme<br />

values are highlighted <strong>in</strong> a user def<strong>in</strong>ed colour (Pedersen<br />

et al., 2000).<br />

Phylome atlas. For each am<strong>in</strong>o acid sequence, phylogenetic<br />

trees were automatically constructed as described <strong>in</strong><br />

Sicheritz-Ponten <strong>and</strong> Andersson (2001). The phylogenomic<br />

<strong>in</strong>formation of the result<strong>in</strong>g 1919 phylogenetic trees was<br />

extracted <strong>and</strong> analysed <strong>in</strong> the PyPhy system.<br />

Genome atlas. The genome atlas is a comb<strong>in</strong>ation of some<br />

general <strong>in</strong>formative properties. These are some structural<br />

features (<strong>in</strong>tr<strong>in</strong>sic curvature, stack<strong>in</strong>g energy <strong>and</strong> position<br />

preference), some repeat properties (global direct <strong>and</strong><br />

<strong>in</strong>verted repeats) <strong>and</strong> the ma<strong>in</strong> base composition features<br />

(GC skew <strong>and</strong> percent AT).<br />

Intr<strong>in</strong>sic curvature was calculated us<strong>in</strong>g the CURVATURE<br />

software (Shpigelman et al., 1993). Stack<strong>in</strong>g energy of a<br />

DNA segment was determ<strong>in</strong>ed by the method of Ornste<strong>in</strong> <strong>and</strong><br />

colleagues (1978). Position preference was based on a tr<strong>in</strong>ucleotide<br />

model that estimates the helix flexibility (Satchwell<br />

et al., 1986). Base composition is generally divided <strong>in</strong>to AT<br />

content <strong>and</strong> GC skews. Both were calculated from the nucleotide<br />

sequence. Global direct <strong>and</strong> <strong>in</strong>verted repeats were<br />

found us<strong>in</strong>g variations of an algorithm that f<strong>in</strong>ds the highest<br />

degree of homology for a 15 bp repeat with<strong>in</strong> a w<strong>in</strong>dow of<br />

length 100 bp (Jensen et al., 1999).<br />

Codon <strong>and</strong> am<strong>in</strong>o acid usage<br />

Codon <strong>and</strong> am<strong>in</strong>o acid usage were calculated from all cod<strong>in</strong>g<br />

regions <strong>in</strong> the genome as annotated <strong>in</strong> the GenBank entries.<br />

The relative synonymous codon usage was calculated by<br />

compar<strong>in</strong>g the codon distribution from a set of highly<br />

expressed genes with a background distribution estimated<br />

from the codon usage of all cod<strong>in</strong>g regions <strong>in</strong> the genome<br />

(Willenbrock et al., 2006). In order to identify a set of constitutively<br />

highly expressed genes <strong>in</strong> A. borkumensis, the reference<br />

set of 27 very highly expressed Escherichia coli genes<br />

orig<strong>in</strong>ally compiled by Sharp <strong>and</strong> Li (1986) was aligned at the<br />

prote<strong>in</strong> level aga<strong>in</strong>st all genes annotated <strong>in</strong> the GenBank<br />

entry us<strong>in</strong>g BLASTP version 2.2.9 (Altschul et al., 1997). For<br />

each of these very highly expressed genes, the gene with the<br />

best alignment was added to a set of very highly expressed<br />

genes if it had an E-value below 10 -6 .<br />

TU patterns<br />

Overlapp<strong>in</strong>g tetranucleotide words were counted <strong>in</strong> the bacterial<br />

nucleotide sequences by shift<strong>in</strong>g the w<strong>in</strong>dow <strong>in</strong> steps of<br />

1 nucleotide. The total word number <strong>in</strong> a circular sequence<br />

equals to the sequence length. The observed counts of words<br />

(Co) were compared with the expected counts of words (Ce).<br />

Assum<strong>in</strong>g the same distribution frequency for all words irrespective<br />

of their composition <strong>and</strong> sequence mononucleotide<br />

content, Ce matches the ratio of the sequence length to the<br />

number of different tetranucleotide words Nw (256 for<br />

tetranucleotides).<br />

The deviation Dw of observed from expected counts is<br />

given by<br />

∆w= ( o−e)× o<br />

−<br />

C C C 1<br />

For the comparison of sequences by TU patterns, the words<br />

<strong>in</strong> each sequence were ranked by Dw values. Rank numbers<br />

<strong>in</strong>stead of word counts were used to simplify pattern comparison<br />

<strong>and</strong> to remove sequence length bias.<br />

The distance D between two patterns was calculated as<br />

the sum of absolute distances between ranks of identical<br />

words <strong>in</strong> patterns i <strong>and</strong> j as follows <strong>and</strong> expressed as a<br />

percent of the possible maximal distance:<br />

where<br />

D(<br />

% )= ×<br />

∑<br />

100 w<br />

D<br />

max<br />

rank − rank<br />

w, i w, i<br />

D<br />

max<br />

Nw( Nw−1)<br />

=<br />

2<br />

Dmax is the maximal distance that is theoretically possible<br />

between two patterns. For TU patterns Nw is 256. For more<br />

<strong>in</strong>formation about methods of oligonucleotide usage statistics<br />

see Reva <strong>and</strong> Tümmler (2004; 2005).<br />

Orig<strong>in</strong> plot<br />

The orig<strong>in</strong> plot was constructed as described <strong>in</strong> Worn<strong>in</strong>g<br />

<strong>and</strong> colleagues (2006). In brief, the difference between a<br />

hypothetical lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong> is plotted for various<br />

positions on the chromosome. The frequencies of all<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


oligonucleotides from 2-mers to 8-mers on the lead<strong>in</strong>g <strong>and</strong><br />

lagg<strong>in</strong>g str<strong>and</strong>s <strong>in</strong> a 60% w<strong>in</strong>dow are counted <strong>and</strong> the <strong>in</strong>formation<br />

content was calculated <strong>and</strong> summarized over all<br />

oligos for every putative orig<strong>in</strong>. The G/C <strong>and</strong> A/T weighted<br />

str<strong>and</strong> bias were <strong>in</strong>cluded to dist<strong>in</strong>guish between orig<strong>in</strong> <strong>and</strong><br />

term<strong>in</strong>us.<br />

Structural profile of the promoter region<br />

Each annotated gene was aligned at the translation start site<br />

<strong>and</strong> the average values for five DNA structural features<br />

(AT content, position preference, stack<strong>in</strong>g energy, <strong>in</strong>tr<strong>in</strong>sic<br />

curvature, DNase sensitivity; see chapter on Genome Atlas)<br />

were calculated at each position <strong>in</strong> the alignment. The values<br />

was subsequently centered <strong>and</strong> scaled <strong>and</strong> smoothed with<strong>in</strong><br />

a 5 bp w<strong>in</strong>dow us<strong>in</strong>g Gaussian smooth<strong>in</strong>g.<br />

Acknowledgements<br />

The analysis has been performed with<strong>in</strong> the frame of the<br />

‘Task Force Genome L<strong>in</strong>guistics’ of the competence<br />

network ‘Genome Research on Bacteria Relevant for Agriculture,<br />

Environment <strong>and</strong> Biotechnology’ funded by the<br />

Federal M<strong>in</strong>istry of Education <strong>and</strong> Research (BMBF),<br />

Germany (Contracts 031U213D <strong>and</strong> 031U113D). We thank<br />

Peter Golysh<strong>in</strong>, Vitor Mart<strong>in</strong>s dos Santos <strong>and</strong> Kenneth N.<br />

Timmis, Helmhotz Center for Infection Research, Braunschweig,<br />

for stimulat<strong>in</strong>g discussions dur<strong>in</strong>g the <strong>in</strong>itiation of the<br />

study <strong>and</strong> Olaf Kaiser, Lehrstuhl für Genetik, Universität<br />

Bielefeld, for the provision of sequence data at an early<br />

stage of the sequenc<strong>in</strong>g project. O.R. has been a recipient<br />

of a postdoctoral stipend of the DFG-sponsored International<br />

Tra<strong>in</strong><strong>in</strong>g Group ‘Pseudomonas: Pathogenicity <strong>and</strong><br />

Biotechnology’.<br />

References<br />

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.,<br />

Zhang, Z., Miller, W., <strong>and</strong> Lipman, D.J. (1997) Gapped<br />

BLAST <strong>and</strong> PSI-BLAST: a new generation of prote<strong>in</strong><br />

database search programs. Nucleic Acids Res 25: 3389–<br />

3402.<br />

Baldi, P., <strong>and</strong> Baisnee, P.F. (2000) Sequence analysis by<br />

additive scales: DNA structure for sequences <strong>and</strong> repeats<br />

of all lengths. Bio<strong>in</strong>formatics 16: 865–889.<br />

van Beilen, J.B., Panke, S., Lucch<strong>in</strong>i, S., Franch<strong>in</strong>i, A.G.,<br />

Rothlisberger, M., <strong>and</strong> Witholt, B. (2001) Analysis of<br />

Pseudomonas putida alkane-degradation gene clusters<br />

<strong>and</strong> flank<strong>in</strong>g <strong>in</strong>sertion sequences: evolution <strong>and</strong> regulation<br />

of the alk genes. Microbiology 147: 1621–1630.<br />

van Beilen, J.B., Mar<strong>in</strong>, M.M., Smits, T.H.M., Röthlisberger,<br />

M., Franch<strong>in</strong>i, A.G., Witholt, B., <strong>and</strong> Rojo, F. (2004)<br />

Characterization of two alkane hydroxylase genes from<br />

the mar<strong>in</strong>e hydrocarbonoclastic bacterium Alcanivorax<br />

borkumensis. Environ Microbiol 6: 264–273.<br />

Brukner, I., Sanchez, R., Suck, D., <strong>and</strong> Pongor, S. (1995)<br />

Sequence-dependent bend<strong>in</strong>g propensity of DNA as<br />

revealed by DNase I: parameters for tr<strong>in</strong>ucleotides. EMBO<br />

J 14: 1812–1818.<br />

<strong>Comparative</strong> genomics of Alcanivorax borkumensis 11<br />

Chen, X.-H., Koumoutsi, A., Scholz, R., Eisenreich, A.,<br />

Schneider, K., Schneider, I., et al. (2007) <strong>Comparative</strong><br />

analysis of the complete genome sequence of the plant<br />

growth promot<strong>in</strong>g Bacillus amyloliquefaciens FZB42.<br />

Nat Biotechnol 25: 1007–1014.<br />

Dlakic, M., Ussery, D., <strong>and</strong> Brunak, S. (2004) DNA bendability<br />

<strong>and</strong> nucleosome position<strong>in</strong>g <strong>in</strong> transcriptional<br />

regulation. In DNA Conformation <strong>and</strong> Transcription.<br />

Ohyama, T. (ed.). Aust<strong>in</strong>, TX: L<strong>and</strong>es Bioscience, pp. 198–<br />

211.<br />

Dobr<strong>in</strong>dt, U., Hochhut, B., Hentschel, U., <strong>and</strong> Hacker, J.<br />

(2004) Genomic isl<strong>and</strong>s <strong>in</strong> pathogenic <strong>and</strong> environmental<br />

microorganisms. Nat Rev Microbiol 2: 414–424.<br />

Giovannoni, S.J., Tripp, H.J., Givan, S., Podar, M., Verg<strong>in</strong>,<br />

K.L., Baptista, D., et al. (2005) Genome streaml<strong>in</strong><strong>in</strong>g <strong>in</strong> a<br />

cosmopolitan oceanic bacterium. Science 309: 1242–<br />

1245.<br />

Golysh<strong>in</strong>, P.N., Mart<strong>in</strong>s Dos Santos, V.A., Kaiser, O., Ferrer,<br />

M., Sabirova, Y.S., Lunsdorf, H., et al. (2003) Genome<br />

sequence completed of Alcanivorax borkumensis, a<br />

hydrocarbon-degrad<strong>in</strong>g bacterium that plays a global role<br />

<strong>in</strong> oil removal from mar<strong>in</strong>e systems. J Biotechnol 106:<br />

215–220.<br />

Hara, A., Syutsubo, K., <strong>and</strong> Harayama, S. (2003) Alcanivorax<br />

which prevails <strong>in</strong> oil-contam<strong>in</strong>ated seawater exhibits broad<br />

substrate specificity for alkane degradation. Environ<br />

Microbiol 5: 746–753.<br />

Hara, A., Baik, S.H., Syutsubo, K., Misawa, N., Smits, T.H.,<br />

van Beilen, J.B., <strong>and</strong> Harayama, S. (2004) Clon<strong>in</strong>g <strong>and</strong><br />

functional analysis of alkB genes <strong>in</strong> Alcanivorax borkumensis<br />

SK2. Environ Microbiol 6: 191–197.<br />

Harayama, S., Kishira, H., Kasai, Y., <strong>and</strong> Shutsubo, K. (1999)<br />

Petroteum biodegradation <strong>in</strong> mar<strong>in</strong>e environments. J Mol<br />

Microbiol Biotechnol 1: 63–70.<br />

Jensen, L.J., Friis, C., <strong>and</strong> Ussery, D.W. (1999) Three<br />

views of microbial genomes. Res Microbiol 150: 773–<br />

777.<br />

Kasai, Y., Kishira, H., Sasaki, I., Syutsubo, K., Watanabe, K.,<br />

<strong>and</strong> Harama, S. (2002) Prodom<strong>in</strong>ant growth of Alcanivorax<br />

stra<strong>in</strong>s <strong>in</strong> oil-contam<strong>in</strong>ated <strong>and</strong> nutrient-supplemented sea<br />

water. Environ Microbiol 4: 141–147.<br />

Kasai, Y., Kishira, H., Syutsubo, K., <strong>and</strong> Harayama, S. (2001)<br />

Molecular detection of mar<strong>in</strong>e bacterial populations on<br />

beaches contam<strong>in</strong>ated by the Nakhodka tanker oilaccident.<br />

Environ Microbiol 3: 246–255.<br />

Klockgether, J., Würdemann, D., Reva, O., Wiehlmann, L.,<br />

<strong>and</strong> Tümmler, B. (2007) Diversity of the abundant<br />

pKLC102/PAGI-2 family of genomic isl<strong>and</strong>s <strong>in</strong> Pseudomonas<br />

aerug<strong>in</strong>osa. J Bacteriol 189: 2443–2459.<br />

Luiten, R.G., Putterman, D.G., Schoenmakers, J.G.,<br />

Kon<strong>in</strong>gs, R.N., <strong>and</strong> Day, L.A. (1985) Nucleotide sequence<br />

of the genome of Pf3, an IncP-1 plasmid-specific filamentous<br />

bacteriophage of Pseudomonas aerug<strong>in</strong>osa. J Virol<br />

56: 268–276.<br />

McKew, B.A., Coulon, F., Osborn, A.M., Timmis, K.N., <strong>and</strong><br />

McGenity, T.J. (2007a) Determ<strong>in</strong><strong>in</strong>g the identity <strong>and</strong> roles<br />

of oil-metaboliz<strong>in</strong>g mar<strong>in</strong>e bacteria from the Thames<br />

estuary, UK. Environ Microbiol 9: 165–176.<br />

McKew, B.A., Coulon, F., Yakimov, M.M., Denaro, R., Genovese,<br />

M., Smith, C.J., et al. (2007b) Efficacy of <strong>in</strong>tervention<br />

strategies for bioremediation of crude oil <strong>in</strong> mar<strong>in</strong>e<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


12 O. N. Reva et al.<br />

systems <strong>and</strong> effects on <strong>in</strong>digenous hydrocarbonoclastic<br />

bacteria. Environ Microbiol 9: 1562–1571.<br />

Nelson, K.E., We<strong>in</strong>el, C., Paulsen, I.T., Dodson, R.J., Hilbert,<br />

H., Mart<strong>in</strong>s dos Santos, V.A., et al. (2002) Complete<br />

genome sequence <strong>and</strong> comparative analysis of the metabolically<br />

versatile Pseudomonas putida KT2440. Environ<br />

Microbiol 4: 799–808.<br />

Ornste<strong>in</strong>, R., Re<strong>in</strong>, R., Breen, D., <strong>and</strong> MacElroy, R. (1978)<br />

An optimized potential function for the calculation of<br />

nucleic acid <strong>in</strong>teraction energies. Biopolymers 17: 2341–<br />

2360.<br />

Pedersen, A.G., Jensen, L.J., Brunak, S., Staerfeldt, H.H.,<br />

<strong>and</strong> Ussery, D.W. (2000) A DNA structural atlas for<br />

Escherichia coli. J Mol Biol 299: 907–930.<br />

Pride, D.T., Me<strong>in</strong>ersmann, R.J., Wassenaar, T.M., <strong>and</strong><br />

Blaser, M.J. (2003) Evolutionary implications of microbial<br />

genome tetranucleotide frequency biases. Genome Res<br />

13: 145–158.<br />

Reva, O.N., <strong>and</strong> Tümmler, B. (2004) Global features of<br />

sequences of bacterial chromosomes, plasmids <strong>and</strong><br />

phages revealed by analysis of oligonucleotide usage<br />

patterns. BMC Bio<strong>in</strong>formatics 5: 90.<br />

Reva, O.N., <strong>and</strong> Tümmler, B. (2005) Differentiation of regions<br />

with atypical oligonucleotide composition <strong>in</strong> bacterial<br />

genomes. BMC Bio<strong>in</strong>formatics 6: 251.<br />

Röl<strong>in</strong>g, W.F., Milner, M.G., Jones, D.M., Lee, K., Daniel, F.,<br />

Swannell, R.J., et al. (2002) Robust hydrocarbon degradation<br />

<strong>and</strong> dynamics of bacterial communities dur<strong>in</strong>g nutrient<br />

– enhanced oil spill bioremediation. Appl Environ Microbiol<br />

68: 5537–5548.<br />

Sabirova, J.S., Ferrer, M., Regenhardt, D., Timmis, K.N., <strong>and</strong><br />

Golysh<strong>in</strong>, P.N. (2006) Proteomic <strong>in</strong>sights <strong>in</strong>to metabolic<br />

adaptations <strong>in</strong> Alcanivorax borkumensis <strong>in</strong>duced by alkane<br />

utilization. J Bacteriol 188: 3763–3773.<br />

Saitou, N., <strong>and</strong> Nei, M. (1987) The neighbor-jo<strong>in</strong><strong>in</strong>g method:<br />

a new method for reconstruct<strong>in</strong>g phylogenetic trees. Mol<br />

Biol Evol 4: 406–425.<br />

Satchwell, S.C., Drew, H.R., <strong>and</strong> Travers, A.A. (1986)<br />

Sequence periodicities <strong>in</strong> chicken nucleosome core DNA.<br />

J Mol Biol 191: 659–675.<br />

Schneiker, S., Mart<strong>in</strong>s dos Santos, V.A., Bartels, D., Bekel,<br />

T., Brecht, M., Buhrmester, J., et al. (2006) Genome<br />

sequence of the ubiquitous hydrocarbon-degrad<strong>in</strong>g mar<strong>in</strong>e<br />

bacterium Alcanivorax borkumensis. Nat Biotechnol 24:<br />

997–1004.<br />

Sharp, P.M., <strong>and</strong> Li, W.H. (1986) Codon usage <strong>in</strong> regulatory<br />

genes <strong>in</strong> Escherichia coli does not reflect selection for ‘rare’<br />

codons. Nucleic Acids Res 14: 7737–7749.<br />

Shpigelman, E.S., Trifonov, E.N., <strong>and</strong> Bolshoy, A. (1993)<br />

CURVATURE: software for the analysis of curved DNA.<br />

Comput Appl Biosci 9: 435–440.<br />

Sicheritz-Ponten, T., <strong>and</strong> Andersson, S.G. (2001) A phyloge-<br />

nomic approach to microbial evolution. Nucleic Acids Res<br />

29: 545–552.<br />

Skovgaard, M., Jensen, L.J., Friis, C., Stærfeldt, H.H.,<br />

Worn<strong>in</strong>g, P., Brunak, S., <strong>and</strong> Ussery, D.W. (2002) The<br />

atlas visualisation of genome-wide <strong>in</strong>formation. In Methods<br />

<strong>in</strong> Microbiology. Wren, B., <strong>and</strong> Dorrell, N. (eds). London,<br />

UK: Academic Press, pp. 49–63.<br />

Smits, T.H., Balada, S.B., Witholt, B., <strong>and</strong> van Beilen, J.B.<br />

(2002) Functional analysis of alkane hydroxylases from<br />

gram-negative <strong>and</strong> gram-positive bacteria. J Bacteriol 184:<br />

1733–1742.<br />

Spangenberg, C., Fislage, R., Röml<strong>in</strong>g, U., <strong>and</strong> Tümmler, B.<br />

(1997) Disrespectful type IV pil<strong>in</strong>s. Mol Microbiol 25: 203–<br />

204.<br />

Stover, C.K., Pham, X.Q., Erw<strong>in</strong>, A.L., Mizoguchi, S.D., Warrener,<br />

P., Hickey, M.J., et al. (2000) Complete genome<br />

sequence of Pseudomonas aerug<strong>in</strong>osa PA01, an opportunistic<br />

pathogen. Nature 406: 959–964.<br />

Syutsubo, K., Kishira, H., <strong>and</strong> Harayama, S. (2001) Development<br />

of specific oliogonucleotide probes for the identification<br />

<strong>and</strong> <strong>in</strong> situ defection of hydrocarbon – degrad<strong>in</strong>g<br />

Alcanivorax stra<strong>in</strong>s. Environ Microbiol 3: 371–379.<br />

Teel<strong>in</strong>g, H., Meyerdierks, A., Bauer, M., Amann, R., <strong>and</strong><br />

Glockner, F.O. (2004) Application of tetranucleotide<br />

frequencies for the assignment of genomic fragments.<br />

Environ Microbiol 6: 938–947.<br />

Ussery, D., Soumpasis, D.M., Brunak, S., Staerfeldt, H.H.,<br />

Worn<strong>in</strong>g, P., <strong>and</strong> Krogh, A. (2002) Bias of pur<strong>in</strong>e stretches<br />

<strong>in</strong> sequenced chromosomes. Comput Chem 26: 531–541.<br />

Wang, H., Noordewier, M., <strong>and</strong> Benham, C.J. (2004) Stress-<br />

Induced DNA Duplex destabilization (SIDD) <strong>in</strong> the E. coli<br />

genome: SIDD sites are closely associated with promoters.<br />

Genome Res 14: 1575–1584.<br />

We<strong>in</strong>el, C., Nelson, K.E., <strong>and</strong> Tümmler, B. (2002) Global<br />

features of the Pseudomonas putida KT2440 genome<br />

sequence. Environ Microbiol 4: 809–818.<br />

Willenbrock, H., <strong>and</strong> Ussery, D.W. (2007) Prediction of highly<br />

expressed genes <strong>in</strong> microbes based on chromat<strong>in</strong><br />

accessibility. BMC Mol Biol 8: 11.<br />

Willenbrock, H., Friis, C., Juncker, A.S., <strong>and</strong> Ussery, D.W.<br />

(2006) An environmental signature for 323 microbial<br />

genomes based on codon adaptation <strong>in</strong>dices. Genome Biol<br />

7: R114.<br />

Worn<strong>in</strong>g, P., Jensen, L.J., Hall<strong>in</strong>, P.F., Staerfeldt, H.H., <strong>and</strong><br />

Ussery, D.W. (2006) Orig<strong>in</strong> of replication <strong>in</strong> circular<br />

prokaryotic chromosomes. Environ Microbiol 8: 353–<br />

361.<br />

Yakimov, M.M., Golysh<strong>in</strong>, P.N., Lang, S., Moore, E.R.,<br />

Abraham, W.R., Lunsdorf, H., <strong>and</strong> Timmis, K.N. (1998)<br />

Alcanivorax borkumensis General nov., sp. nov., a new,<br />

hydrocarbon-degrad<strong>in</strong>g <strong>and</strong> surfactant-produc<strong>in</strong>g mar<strong>in</strong>e<br />

bacterium. Int J Syst Bacteriol 48: 339–348.<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


Paper III: Global features of the Alcanivorax borkumensis SK2 genome


1<br />

2.9 Paper IV: The orig<strong>in</strong>s of Vibrio species<br />

<strong>Comparative</strong> Genomics


Microb Ecol<br />

DOI 10.1007/s00248-009-9596-7<br />

MINIREVIEWS<br />

On the Orig<strong>in</strong>s of a Vibrio Species<br />

Tammi Vesth & Trudy M. Wassenaar & Peter F. Hall<strong>in</strong> &<br />

Lars Snipen & Kar<strong>in</strong> Lagesen & David W. Ussery<br />

Received: 3 July 2009 /Accepted: 17 September 2009<br />

# The Author(s) 2009. This article is published with open access at Spr<strong>in</strong>gerl<strong>in</strong>k.com<br />

Abstract Thirty-two genome sequences of various Vibrionaceae<br />

members are compared, with emphasis on what<br />

makes V. cholerae unique. As few as 1,000 gene families<br />

are conserved across all the Vibrionaceae genomes analysed;<br />

this fraction roughly doubles for gene families<br />

conserved with<strong>in</strong> the species V. cholerae. Of these,<br />

approximately 200 gene families that cluster on various<br />

locations of the genome are not found <strong>in</strong> other sequenced<br />

Vibrionaceae; these are possibly unique to the V. cholerae<br />

species. By compar<strong>in</strong>g gene family content of the analysed<br />

genomes, the relatedness to a particular species is identified<br />

for two unspeciated genomes. Conversely, two genomes<br />

T. Vesth : T. M. Wassenaar : P. F. Hall<strong>in</strong> : L. Snipen :<br />

K. Lagesen : D. W. Ussery (*)<br />

Center for Biological Sequence Analysis,<br />

Department of Systems Biology,<br />

The Technical University of Denmark,<br />

Build<strong>in</strong>g 208,<br />

2800 Kgs. Lyngby, Denmark<br />

e-mail: dave@cbs.dtu.dk<br />

T. M. Wassenaar<br />

Molecular Microbiology <strong>and</strong> Genomics Consultants,<br />

Zotzenheim, Germany<br />

P. F. Hall<strong>in</strong><br />

Novozymes A/S,<br />

Krogshøjvej 36,<br />

2880 Bagsværd, Denmark<br />

L. Snipen<br />

Biostatistics, Department of Chemistry, Biotechnology,<br />

<strong>and</strong> Food Sciences, Norwegian University of Life Sciences,<br />

Ås, Norway<br />

K. Lagesen<br />

Centre for Molecular Biology <strong>and</strong> Neuroscience <strong>and</strong> Institute<br />

of Medical Microbiology, University of Oslo,<br />

Oslo, Norway<br />

presumably belong<strong>in</strong>g to the same species have suspiciously<br />

dissimilar gene family content. We are able to identify a<br />

number of genes that are conserved <strong>in</strong>, <strong>and</strong> unique to, V.<br />

cholerae. Some of these genes may be crucial to the niche<br />

adaptation of this species.<br />

Introduction<br />

The species concept for bacteria has long been under siege<br />

from several angles, <strong>and</strong> now with thous<strong>and</strong>s of bacterial<br />

genomes be<strong>in</strong>g sequenced, the disputes have <strong>in</strong>tensified [8].<br />

One frequently used def<strong>in</strong>ition of a bacterial species is “a<br />

category that circumscribes a (preferably) genomically<br />

coherent group of <strong>in</strong>dividual isolates/stra<strong>in</strong>s shar<strong>in</strong>g a high<br />

degree of similarity <strong>in</strong> (many) <strong>in</strong>dependent features,<br />

comparatively tested under highly st<strong>and</strong>ardized conditions”<br />

[12]. Such <strong>in</strong>dependent features are usually phenotypes that<br />

can easily be tested. For a new species to be def<strong>in</strong>ed,<br />

amongst other criteria, <strong>in</strong>ter-species DNA–DNA hybridisation<br />

has to be below 70%, although this rule is not<br />

without its limitations [18]. In the late 1970s <strong>and</strong> 1980s, the<br />

16S rRNA gene sequence was <strong>in</strong>troduced as a molecular<br />

clock that could be used to <strong>in</strong>fer phylogenetic relationships<br />

[50]. Ideally, isolates belong<strong>in</strong>g to the same species have<br />

identical or nearly identical 16S rRNA genes, <strong>and</strong> these<br />

differ from isolates belong<strong>in</strong>g to different species [32, 44].<br />

In practice, this is not always the case. Examples exist of<br />

different species shar<strong>in</strong>g identical rRNA genes (for<br />

<strong>in</strong>stance, E. coli <strong>and</strong> Shigella [37] that are even placed <strong>in</strong><br />

different genera); <strong>in</strong> addition, isolates of one species can<br />

have different rRNA genes beyond the 97% that is<br />

considered to demarcate species [4]. Lateral transfer of<br />

genetic material (to which ribosomal genes are believed to<br />

be resistant) destroys the phylogenetic relationship, so that


phylogenies based on alternative housekeep<strong>in</strong>g genes can<br />

differ from a 16S rRNA tree <strong>and</strong> frequently are not even <strong>in</strong><br />

accordance to each other. Such observations question the<br />

validity of a phylogenetic tree as the most suitable model<br />

for bacterial ancestry, when multiple genetic transfers<br />

would produce a network-like evolutionary structure [6].<br />

On the other h<strong>and</strong>, it is observed that lateral gene transfer is<br />

most frequent between genetically related members shar<strong>in</strong>g<br />

a similar base content <strong>and</strong> occupy<strong>in</strong>g the same ecological<br />

niche [29]. Nevertheless, a core of genes can be recognised<br />

that produce coherent phylogenetic trees, though these may<br />

not represent the species’ complete evolutionary history as<br />

they comprise only a m<strong>in</strong>or fraction of the genetic content<br />

of the organism [35].<br />

Whether a tree or a network is more accurate to describe<br />

phylogeny, <strong>in</strong> either case bacterial species may be considered<br />

as a cloud of isolates hav<strong>in</strong>g a higher level of genetic<br />

similarity to each other than to organisms belong<strong>in</strong>g to a<br />

different species. When such clouds have fuzzy <strong>and</strong><br />

overlapp<strong>in</strong>g borders, the species concept falls apart but that<br />

will only apply to certa<strong>in</strong> cases [7]. S<strong>in</strong>ce 16S rRNA genes<br />

are not <strong>in</strong>formative on the level of diversity with<strong>in</strong> a<br />

species, the 'density' of a cloud of isolates mak<strong>in</strong>g up a<br />

species cannot be determ<strong>in</strong>ed by this gene. Those genes<br />

shared by all isolates belong<strong>in</strong>g to one species comprise the<br />

core genome of that species [39], <strong>and</strong> the degree of<br />

diversity <strong>in</strong> the rema<strong>in</strong><strong>in</strong>g non-core genes determ<strong>in</strong>es the<br />

density of the species cloud.<br />

We hypothesised that certa<strong>in</strong> genes can be recognised as<br />

specific to a particular species, to be conserved <strong>in</strong> that<br />

species but not present <strong>in</strong> related species. We tested our<br />

hypothesis with complete genome sequences of the bacterial<br />

family Vibrionaceae, which belong to the γ-<br />

Proteobacteria <strong>and</strong> comprises eight genera. Most available<br />

genome sequences belong to the genus Vibrio. This genus<br />

conta<strong>in</strong>s 51 recognised species [10, 46] which are ma<strong>in</strong>ly<br />

found <strong>in</strong> mar<strong>in</strong>e environments, frequently liv<strong>in</strong>g <strong>in</strong> association<br />

with mar<strong>in</strong>e organisms such as corals, fish, squid or<br />

zooplankton. Most of them are symbionts <strong>and</strong> only a few<br />

are human pathogens, notably particular serotypes of V.<br />

cholerae produc<strong>in</strong>g cholera, Vibrio parahaemolyticus<br />

(caus<strong>in</strong>g gastroenteritis) <strong>and</strong> Vi vulnificus (caus<strong>in</strong>g wound<br />

<strong>in</strong>fections) [46]. Other Vibrionaceae, <strong>in</strong>clud<strong>in</strong>g V. vulnificus,<br />

Aliivibrio salmonicida <strong>and</strong> V. harveyi, are fish or<br />

shellfish pathogens <strong>and</strong> have major economic impact.<br />

Photobacterium profundum, represent<strong>in</strong>g another genus<br />

with<strong>in</strong> the Vibrionaceae, was also <strong>in</strong>cluded.<br />

The gene content of 32 available sequenced Vibrionaceae<br />

genomes was compared <strong>and</strong> the results were analysed <strong>in</strong><br />

various ways. The data allowed us to identify possible V.<br />

cholerae-specific genes, s<strong>in</strong>ce this species was represented<br />

by 18 genomes that was a sufficient number to test<br />

conservation both with<strong>in</strong> the species <strong>and</strong> across species.<br />

We found that a two-component signal transduction pathway<br />

is uniquely conserved <strong>in</strong> V. cholerae but is not found outside<br />

this species. Our f<strong>in</strong>d<strong>in</strong>gs further <strong>in</strong>dicated that possibly a<br />

relatively small set of genes could confer niche specialisation<br />

allow<strong>in</strong>g V. cholerae to be adopted to a unique environment,<br />

so that over time V. cholerae have become a dist<strong>in</strong>ct species.<br />

Materials <strong>and</strong> Methods<br />

Genomes <strong>and</strong> Gene Annotations Used<br />

Publicly available genome sequences of Vibrionaceae were<br />

selected that were provided <strong>in</strong> less than 300 contigs <strong>and</strong> <strong>in</strong><br />

which full-length 16S rRNA sequence could be found us<strong>in</strong>g<br />

the rRNA gene f<strong>in</strong>der RNAmmer [19]. The 32 genome<br />

sequences <strong>in</strong>cluded are shown <strong>in</strong> Table 1.<br />

The gene annotations as provided <strong>in</strong> GenBank were<br />

used, except for those genomes marked “Easygene” <strong>in</strong><br />

Table 1 where prote<strong>in</strong> annotation was not available <strong>in</strong> the<br />

RefSeq file at the time of analysis, <strong>and</strong> we used EasyGene<br />

[20] to identify the genes. As a control, an available<br />

GenBank annotation was compared to a generated Easygene<br />

annotation to confirm that the number of identified<br />

genes was comparable.<br />

Ribosomal RNA Analysis<br />

RNAmmer [19] was used to identify 16S rRNA sequences<br />

with<strong>in</strong> the 32 genomes. Sequences were considered reliable<br />

if they were between 1,400 <strong>and</strong> 1,700 nucleotides long <strong>and</strong><br />

had an RNAmmer score above 1,800. In cases where the<br />

program found multiple <strong>and</strong> variable 16S sequences with<strong>in</strong><br />

a genome, one of these (with satisfactory RNAmmer<br />

scores) was arbitrarily chosen. The sequences were aligned<br />

us<strong>in</strong>g PRANK [23, 24], <strong>and</strong> the program MEGA4 was used<br />

to elucidate a phylogenetic tree [45]. With<strong>in</strong> MEGA4, the<br />

tree was created us<strong>in</strong>g the Neighbor-Jo<strong>in</strong><strong>in</strong>g method with<br />

the uniform rate Jukes–Cantor distance measure <strong>and</strong> the<br />

complete-delete option. Five hundred resampl<strong>in</strong>gs were<br />

done to f<strong>in</strong>d the bootstrap values.<br />

Pan-Genome Family Cluster<strong>in</strong>g<br />

T. Vesth et al.<br />

Cluster<strong>in</strong>g based on shared gene families from the Vibrio<br />

pan-genome was constructed, based on BLASTP similarity<br />

us<strong>in</strong>g default sett<strong>in</strong>gs. A BLASTP hit was considered<br />

significant if the alignment produced at least 50% identity<br />

for at least 50% of the length of the longest gene (either<br />

query or subject). Us<strong>in</strong>g this criterion, each pair of genes<br />

produc<strong>in</strong>g a significant reciprocal best hit was scored as<br />

belong<strong>in</strong>g to the same gene family. A genome matrix was<br />

constructed, conta<strong>in</strong><strong>in</strong>g one row for each genome <strong>and</strong> one


Orig<strong>in</strong>s of V. cholerae<br />

Table 1 Vibrionaceae genomes used <strong>in</strong> this analysis<br />

GPID Organism Contigs Accession/GenBank Status No. of genes Ref.<br />

36 V. cholerae N16961 a<br />

2 AE003852.1 Fully sequenced 3,828 [15]<br />

15667 V. cholerae O395 TIGR a<br />

2 CP000626.1 Fully sequenced 3,875 [11]<br />

32853 V. cholerae O395 TEDA a<br />

2 CP001235.1 Fully sequenced 3,934 [49]<br />

33555 V. cholerae MJ-1236 a<br />

2 CP001485.1 Fully sequenced 3,774 [31]<br />

15666 V. cholerae MO10 a<br />

153 NZ_AAKF00000000 Unf<strong>in</strong>ished (Easygene) 3,421 [5]<br />

15670 V. cholerae V52 a<br />

268 NZ_AAKJ00000000 Unf<strong>in</strong>ished (NCBI) 3,815 [16]<br />

33559 V. cholerae BX330286 a<br />

8 NZ_ACIA00000000 Unf<strong>in</strong>ished (NCBI) 3,632 [31]<br />

33557 V. cholerae B33 a<br />

17 NZ_ACHZ00000000 Unf<strong>in</strong>ished (NCBI) 3,748 [31]<br />

33553 V. cholerae RC9 a<br />

11 NZ_ACHX00000000 Unf<strong>in</strong>ished (NCBI) 3,811 [31]<br />

32851 V. cholerae M66-2 2 CP001233.1 Fully sequenced 3,693 [49]<br />

18495 V. cholerae MZO-2 162 NZ_AAWF00000000 Unf<strong>in</strong>ished (NCBI) 3,425 [16]<br />

18265 V. cholerae 1587 254 NZ_AAUR00000000 Unf<strong>in</strong>ished (NCBI) 3,758 [16]<br />

18253 V. cholerae 2740-80 257 NZ_AAUT00000000 Unf<strong>in</strong>ished (NCBI) 3,771 [16]<br />

17723 V. cholerae AM-19226 154 NZ_AATY00000000 Unf<strong>in</strong>ished (Easygene) 3,407 [33]<br />

33561 V. cholerae 12129 12 NZ_ACFQ00000000 Unf<strong>in</strong>ished (NCBI) 3,574 [31]<br />

33549 V. cholerae VL426 5 NZ_ACHV00000000 Unf<strong>in</strong>ished (NCBI) 3,461 [31]<br />

33579 V. cholerae TM 11079-80 35 NZ_ACHW00000000 Unf<strong>in</strong>ished (NCBI) 3,621 [31]<br />

33551 V. cholerae TMA 21 20 NZ_ACHY00000000 Unf<strong>in</strong>ished (NCBI) 3,600 [31]<br />

13564 V. campbellii AND4 143 NZ_ABGR00000000 Unf<strong>in</strong>ished (NCBI) 3,935 [13]<br />

19857 V. harveyi BAA-1116 3 CP000789.1 Fully sequenced 6,064 [1]<br />

349 V. vulnificus CMCP6 2 AE016795.2 Fully sequenced 4,538 [38]<br />

1430 V. vulnificus YJ016 3 BA000037.2 Fully sequenced 5,028 [3]<br />

19397 V. shilonii AK1 158 NZ_ABCH00000000 Unf<strong>in</strong>ished (NCBI) 5,360 [41]<br />

15693 Vibrio sp. Ex25 222 NZ_AAKK00000000 Unf<strong>in</strong>ished (Easygene) 4,004 [16]<br />

13616 Vibrio sp. MED222 99 NZ_AAND00000000 Unf<strong>in</strong>ished (NCBI) 4,590 [36]<br />

32815 V. splendidus LGP32 2 FM954973.1 Fully sequenced 4,434 [27]<br />

19395 V. parahaemolyticus 16 78 NZ_ACCV00000000 Unf<strong>in</strong>ished (Easygene) 3,780 [9]<br />

360 V. parahaemolyticus 2210633 2 BA000031.2 Fully sequenced 4,832 [25]<br />

12986 A. fischeri ES114 3 CP000020.1 Fully sequenced 3,823 [42]<br />

19393 A. fischeri MJ11 3 CP001133.1 Fully sequenced 4,039 [26]<br />

30703 A. salmonicida LFI1238 6 FM178379.1 Fully sequenced 4,284 [17]<br />

13128 P. profundum SS9 3 CR354531.1 Fully sequenced 5,480 [48]<br />

GPID genome project identifier at NCBI. Contigs the number of contiguous sequences, which for a completely sequenced genome is at least two<br />

(for two chromosomes) <strong>and</strong> can be up to six when plasmids are present. Unf<strong>in</strong>ished sequences are represented by multiple contigs per<br />

chromosome<br />

a<br />

Stra<strong>in</strong>s conta<strong>in</strong><strong>in</strong>g the genes encod<strong>in</strong>g the cholera enterotox<strong>in</strong> subunits are <strong>in</strong>dicated<br />

column for each gene family. Cell (i, j) <strong>in</strong> this matrix is 1 if<br />

genome i has a member <strong>in</strong> gene family j, 0 otherwise. A<br />

hierarchical cluster<strong>in</strong>g, with average l<strong>in</strong>kage based on the<br />

Manhattan distance between genomes was then performed.<br />

Two trees were made, one with more weight given to gene<br />

families present <strong>in</strong> most (90%, or between 27 <strong>and</strong> 30)<br />

Vibrio genomes (“stabilome”), <strong>and</strong> the other with more<br />

weight given to gene families present <strong>in</strong> only a few (two,<br />

three, or four) genomes (“mobilome”). Thus, the orig<strong>in</strong>al<br />

Boolean matrix is now scaled differently, depend<strong>in</strong>g on the<br />

number of genomes <strong>in</strong> each gene family [44]. For both<br />

trees, s<strong>in</strong>gletons (families which are only found <strong>in</strong> one<br />

genome) have been excluded.<br />

Pan- <strong>and</strong> Core Genome Analysis<br />

The results of the BLAST analysis were also used to<br />

construct a pan- <strong>and</strong> core genome plot as follows. Based on<br />

cluster<strong>in</strong>gs from the pan-genome family tree, an ordered set<br />

of genomes was constructed with V. cholerae genomes at<br />

the start. For the first chosen genome, all BLAST hits found<br />

<strong>in</strong> the second genome were recorded <strong>and</strong> the accumulative


Figure 1 Phylogenetic tree of<br />

the 16S rRNA gene extracted<br />

from 32 sequenced Vibrio<br />

genomes listed <strong>in</strong> Table 1. Environmental<br />

V. cholerae lack<strong>in</strong>g<br />

the cholera enterotox<strong>in</strong> genes<br />

are highlighted <strong>in</strong> bright green,<br />

whilst pathogenic V. cholerae<br />

genomes are <strong>in</strong> dark green.<br />

Further colour<strong>in</strong>g was used for<br />

species for which two genomes<br />

are represented<br />

number of gene families (as def<strong>in</strong>ed above) now recognised <strong>in</strong><br />

total was plotted for the pan-genome. The number of gene<br />

families with at least one representative gene <strong>in</strong> both genomes<br />

was plotted for the core genome. A runn<strong>in</strong>g total is plotted for<br />

the pan-genome which <strong>in</strong>creases as more genomes are added,<br />

whilst the core genome represent<strong>in</strong>g conserved gene families<br />

slowly decreases with the addition of more genomes.<br />

Whole-Genome BLAST Analysis <strong>and</strong> Construction<br />

of a BLAST Matrix<br />

The predicted genes of every genome (annotated or found<br />

by Easygene) were translated <strong>and</strong> every gene was compared,<br />

by BLASTP aga<strong>in</strong>st every other genome <strong>and</strong> its own<br />

genome. In the latter case, the hit to self was ignored. The<br />

50/50 rule for BLAST hits as described above was used. If<br />

these requirements were met, genes were comb<strong>in</strong>ed <strong>in</strong> a<br />

gene family. The BLAST results were visualised <strong>in</strong> a<br />

BLAST matrix [2], which summarises the results of<br />

genomic pairwise comparisons <strong>and</strong> reports, both as percentage<br />

<strong>and</strong> as absolute numbers, the number of reciprocal<br />

BLAST hits as a fraction of the total number of gene<br />

families found <strong>in</strong> the two genomes. For easier visual<br />

<strong>in</strong>spection, the cells <strong>in</strong> the matrix are coloured darker as<br />

56<br />

88<br />

65<br />

55<br />

86<br />

the fraction of similarity <strong>in</strong>creases. Hits identified with<strong>in</strong> a<br />

genome are differently coloured.<br />

BLAST Atlas<br />

BLAST results were also visualised <strong>in</strong> a BLAST atlas, this<br />

time visualis<strong>in</strong>g, for all genes <strong>in</strong> the reference genome V<br />

cholerae N16961, their best hit <strong>in</strong> all other genomes, aga<strong>in</strong><br />

with a threshold of 50% identity over at least 50% of the<br />

length of the query prote<strong>in</strong>. The atlas displays the hits as they<br />

are located <strong>in</strong> the reference stra<strong>in</strong> [14]. The BLAST scores<br />

obta<strong>in</strong>ed for each queried gene is plotted, so that conserved<br />

<strong>and</strong> variable regions are located with respect to the reference<br />

genome. Note that genes absent <strong>in</strong> the reference genome are<br />

not shown <strong>in</strong> the lanes of the query genomes.<br />

Results<br />

Vibrio sp. MED222<br />

A<br />

A<br />

Vibrio sp. Ex25<br />

Ribosomal RNA Analysis<br />

A phylogenetic tree based on the 16S rRNA gene extracted<br />

from the 32 analysed Vibrionaceae genomes is shown <strong>in</strong><br />

Fig. 1. The 18 V. cholerae genomes build a tight subcluster,<br />

45<br />

T. Vesth et al.


Orig<strong>in</strong>s of V. cholerae<br />

68<br />

68<br />

93<br />

64<br />

64<br />

95<br />

100<br />

Vibrio, stabilome<br />

0.20 0.15 0.10 0.05 0.00<br />

Relative manhattan distance<br />

quite distanced from the other species. Above this <strong>in</strong> the<br />

figure, another subcluster compris<strong>in</strong>g eight genomes represent<strong>in</strong>g<br />

at least six species is recognised, <strong>and</strong> with<strong>in</strong> this<br />

cluster the two V. parahaemolyticus genes are not found on<br />

the same branch. A third cluster, a bit further removed,<br />

<strong>in</strong>cludes Aliivibrio fischeri <strong>and</strong> A. almonidica as well as V.<br />

splendidus <strong>and</strong> Vibrio species MED 222; the gene of<br />

Photobacterium profundum is the most distant.<br />

Pan-Genome Family Trees<br />

99<br />

99<br />

100<br />

48<br />

100<br />

100<br />

98<br />

98<br />

67<br />

Vibrio harveyi ATCC BAA1116<br />

Vibrio parahaemolyticus RIMD2210633<br />

Vibrio vulnificus CMCP6<br />

Vibrio vulnificus YJ016<br />

Vibrio sp MED222<br />

Vibrio splendidus LGP32<br />

Vibrio shilonii AK1<br />

Vibrio sp Ex25<br />

Vibrio parahaemolyticus 16<br />

Vibrio campbellii AND4<br />

Aliivibrio fischeri<br />

Aliivibrio fischeri<br />

Aliivibrio salmonicida LFI1238<br />

Photobacterium profundum SS9<br />

Vibrio cholerae 1587<br />

Vibrio cholerae AM 19226<br />

Vibrio cholerae MO10<br />

Vibrio cholerae B33VCE<br />

Vibrio cholerae MJ1236<br />

Vibrio cholerae RC9<br />

Vibrio cholerae BX330286<br />

Vibrio cholerae M662<br />

Vibrio cholerae O395 TEDA<br />

Vibrio cholerae N16961<br />

Vibrio cholerae O395 TIGR<br />

Vibrio cholerae 12129<br />

Vibrio cholerae TMA21<br />

Vibrio cholerae V52<br />

Vibrio cholerae TM1107980<br />

Vibrio cholerae 2740 80<br />

Vibrio cholerae VL426<br />

Vibrio cholerae MZO 2<br />

Start<strong>in</strong>g with a database conta<strong>in</strong><strong>in</strong>g the total set of all Vibrio<br />

gene families, a profile of match<strong>in</strong>g gene families was<br />

constructed for each <strong>in</strong>dividual genome. This was stored as<br />

a matrix, conta<strong>in</strong><strong>in</strong>g a column for each gene families, <strong>and</strong> a<br />

row for each genome. The rows conta<strong>in</strong> a 0 or 1<br />

represent<strong>in</strong>g the presence or absence of the gene family.<br />

This matrix was weighted to emphasise either the genes<br />

found <strong>in</strong> most genomes (the “stabilome”) or <strong>in</strong> only a few<br />

genomes (the “mobilome”); from these weighted matrices,<br />

cluster<strong>in</strong>g of gene families yielded the result<strong>in</strong>g trees shown<br />

<strong>in</strong> Fig. 2. Shorter distances represent genomes with many<br />

gene families <strong>in</strong> common, <strong>and</strong> larger distances reflect<br />

genomes with fewer gene families <strong>in</strong> common. As<br />

expected, <strong>in</strong> both trees, genomes from the same species<br />

cluster together, whereby the depth of resolution with<strong>in</strong> a<br />

species is considerably better than can be seen <strong>in</strong> the 16S<br />

rRNA tree <strong>in</strong> Fig. 1. Similarity between the unspeciated<br />

100<br />

80<br />

66<br />

37<br />

54<br />

100<br />

98<br />

67<br />

46<br />

Figure 2 Pan-genome family cluster<strong>in</strong>g of the 32 Vibrio genome<br />

sequences. The two plots represent weighted values for genes present<br />

<strong>in</strong> at least 90% of the genomes (stabilome) or genes found <strong>in</strong> only a<br />

40<br />

100<br />

100<br />

58<br />

82<br />

91<br />

59<br />

59<br />

100<br />

100<br />

71<br />

48<br />

Vibrio, mobilome<br />

59<br />

80<br />

100<br />

100<br />

100<br />

0.20 0.15 0.10 0.05 0.00<br />

100<br />

100<br />

100<br />

Relative manhattan distance<br />

Vibrio isolate MED222 <strong>and</strong> V. splendidus is suggested by<br />

their close cluster<strong>in</strong>g; this is a connection also suggested by<br />

others [21]. Note that the unspeciated Vibrio isolate Ex25<br />

<strong>and</strong> V. parahaemolyticus 2210633 cluster together <strong>in</strong> the<br />

mobilome tree, but are more distant <strong>in</strong> the stabilome. This<br />

implies that the genes shared between these two genomes<br />

are less common genes with<strong>in</strong> the Vibrio genomes<br />

exam<strong>in</strong>ed here. As already <strong>in</strong>dicated by the 16S rRNA<br />

tree, the two V. parahaemolyticus isolates are quite<br />

dissimilar, <strong>and</strong> appear on separate branches. The Aliivibrio<br />

cluster is placed with<strong>in</strong> Vibrio genomes <strong>in</strong> both the<br />

stabilome <strong>and</strong> the mobilome, as was the case for their 16S<br />

rRNA gene. P. profundum is not such an outlier as <strong>in</strong> the<br />

16S rRNA tree, <strong>and</strong> <strong>in</strong> the stabilome. It is even positioned<br />

close to the Aliivibrio genomes. Zoom<strong>in</strong>g <strong>in</strong> at the genomes<br />

of V. cholerae, a division <strong>in</strong>to two subclusters can be seen;<br />

these clusters correspond to environmental vs. cl<strong>in</strong>ical<br />

isolates (with the exception of V52 <strong>in</strong> the stabilome).<br />

Pan- <strong>and</strong> Core Genome Plot<br />

99<br />

77<br />

100<br />

67<br />

82<br />

100<br />

90<br />

90<br />

100<br />

89<br />

89<br />

Vibrio sp Ex25<br />

Vibrio parahaemolyticus RIMD2210633<br />

Vibrio campbellii AND4<br />

Vibrio cholerae 1587<br />

Vibrio cholerae AM 19226<br />

Vibrio cholerae MZO 2<br />

Vibrio cholerae 2740 80<br />

Vibrio cholerae V52<br />

Vibrio cholerae MO10<br />

Vibrio cholerae O395 TIGR<br />

Vibrio cholerae BX330286<br />

Vibrio cholerae RC9<br />

Vibrio cholerae B33VCE<br />

Vibrio cholerae MJ1236<br />

Vibrio cholerae N16961<br />

Vibrio cholerae M662<br />

Vibrio cholerae O395 TEDA<br />

Vibrio cholerae TMA21<br />

Vibrio cholerae 12129<br />

Vibrio cholerae TM1107980<br />

Vibrio cholerae VL426<br />

Vibrio parahaemolyticus 16<br />

Aliivibrio fischeri<br />

Aliivibrio fischeri<br />

Aliivibrio salmonicida LFI1238<br />

Vibrio vulnificus CMCP6<br />

Vibrio vulnificus YJ016<br />

Vibrio sp MED222<br />

Vibrio splendidus LGP32<br />

Vibrio harveyi ATCC BAA1116<br />

Vibrio shilonii AK1<br />

Photobacterium profundum SS9<br />

BLAST results were analysed to construct a pan-genome,<br />

which is a hypothetical collection of all the gene families<br />

that are found <strong>in</strong> the <strong>in</strong>vestigated genomes [28]. The core<br />

genome was constructed from all gene families that were<br />

represented at least once <strong>in</strong> every genome. Thus, the gene<br />

families conserved <strong>in</strong> all genomes represent their core<br />

genome; add<strong>in</strong>g the rema<strong>in</strong><strong>in</strong>g gene families produces the<br />

65<br />

82<br />

100<br />

100<br />

100<br />

100<br />

few (two to four) genomes (mobilome). The colours highlight<strong>in</strong>g the<br />

species are the same as <strong>in</strong> Fig. 1


25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

Pan genome<br />

Core genome<br />

New gene families<br />

V. cholerae TM11079-80<br />

V. cholerae TMA21<br />

V. cholerae 12129<br />

V. cholerae MZO-2<br />

V. cholerae AM-19226<br />

V. cholerae 1587<br />

V. cholerae 2740-80<br />

V. cholerae V52<br />

V. cholerae B33VCE<br />

V. cholerae MJ1236<br />

V. cholerae RC9<br />

V. cholerae BX330286<br />

V. cholerae MO10<br />

V. cholerae O395 TIGR<br />

V. cholerae O395 TEDA<br />

V. cholerae M66-2<br />

V. cholerae N16961<br />

Figure 3 Pan- <strong>and</strong> core genome plot of the 32 Vibrionaceae genomes. The colours highlight<strong>in</strong>g species are the same as <strong>in</strong> Fig. 1<br />

pan-genome. The result<strong>in</strong>g pan- <strong>and</strong> core genome plot is<br />

shown <strong>in</strong> Fig. 3. The genomes start with the documented<br />

cl<strong>in</strong>ical isolates of V. cholerae <strong>and</strong> then follow the order<br />

suggested by the pan-genome family cluster<strong>in</strong>g (Fig. 2),<br />

although genomes from the same species were kept<br />

together (the two V. parahaemolyticus genomes were split<br />

<strong>in</strong> the trees). As more genomes are added <strong>in</strong> the plot, the<br />

number of gene families <strong>in</strong> the pan-genome (blue l<strong>in</strong>e)<br />

<strong>in</strong>creases, <strong>and</strong> the number of conserved gene families (red<br />

l<strong>in</strong>e) <strong>in</strong> the core genome decreases, albeit at a lower rate.<br />

This is because every genome can add many novel (<strong>and</strong><br />

frequently different) genes to the pan-genome but only<br />

decreases the core genome with a few genes that are absent<br />

V. cholerae VL426<br />

P.profundum SS9<br />

V.shilonii AK1<br />

A.salmonicida LFI1238<br />

A. fisheri MJ11<br />

A. fisheri ES114<br />

Vibrio. sp MED222<br />

V.splendidus LGB2<br />

V. vulnificus YJ016<br />

V. vulnificus CMCP6<br />

V.harveyi BAA-1116<br />

V.campbellii<br />

Vibrio sp Ex25<br />

V. parahaem. 2210633<br />

V. parahaem. 16<br />

T. Vesth et al.<br />

<strong>in</strong> that particular stra<strong>in</strong> but that were conserved <strong>in</strong> the<br />

previously analysed genomes. The pan-genome curve<br />

<strong>in</strong>creases with a relative steep slope when a novel species<br />

is added, as is obvious when a V. parahaemolyticus genome<br />

is added after the last V. cholerae. A stable plateau can be<br />

seen for the pan-genome of V. cholerae around 6,500 genes.<br />

Nevertheless, a small <strong>in</strong>crease occurs when add<strong>in</strong>g V.<br />

cholerae 11587; this is caused by the difference between<br />

the two subclusters of V. cholerae seen <strong>in</strong> Fig. 2. V.<br />

cholerae stra<strong>in</strong> 2740-80 behaves atypical <strong>in</strong> all the figures<br />

shown; although documented as an environmental isolate, it<br />

appears closer to the cl<strong>in</strong>ical isolates, <strong>in</strong> terms of overall<br />

genomic properties.


Orig<strong>in</strong>s of V. cholerae<br />

Figure 4 BLAST matrix of<br />

the 32 Vibrionaceae genomes.<br />

The colours highlight<strong>in</strong>g the<br />

species are the same as <strong>in</strong> Fig. 1.<br />

S<strong>in</strong>ce the reciprocal similarity<br />

(reported as percent) is not<br />

readable at this resolution, every<br />

matrix cell is coloured us<strong>in</strong>g the<br />

scales as <strong>in</strong>dicated. The bottom<br />

row identifies hits (other than<br />

hits-to-self) found with<strong>in</strong> a genome.<br />

Four matrix cells report<strong>in</strong>g<br />

high pairwise similarities are<br />

outl<strong>in</strong>ed; their numbers are<br />

specified <strong>in</strong> the text<br />

Homology between proteomes<br />

30.0 %<br />

90.0 %<br />

Homology with<strong>in</strong> proteomes<br />

6.0 %<br />

0.0 %<br />

A.salmonicida LFI1238<br />

V.species Ex25<br />

V.campbellii AND4<br />

V.harveyi BAA1116<br />

V.shilonii AK1<br />

P.profundum SS9<br />

27.2 %<br />

1,946 / 7,165<br />

31.2 %<br />

2,143 / 6,862<br />

32.5 %<br />

2,385 / 7,336<br />

31.1 %<br />

2,163 / 6,948<br />

V.cholerae N16961<br />

V.cholerae 0395 TEDA<br />

V.cholerae 0395 TIGR<br />

V.cholerae V52<br />

V.cholerae M66-2<br />

V.cholerae MO10<br />

V.cholerae BX330286<br />

V.cholerae RC9<br />

V.cholerae MJ1236<br />

27.1 %<br />

1,964 / 7,245<br />

27.5 %<br />

1,971 / 7,179<br />

35.8 %<br />

2,018 / 5,637<br />

32.6 %<br />

2,405 / 7,380<br />

31.5 %<br />

2,169 / 6,884<br />

26.3 %<br />

1,893 / 7,208<br />

38.7 %<br />

2,143 / 5,536<br />

35.9 %<br />

2,049 / 5,713<br />

33.1 %<br />

2,415 / 7,299<br />

30.4 %<br />

2,098 / 6,893<br />

28.0 %<br />

1,962 / 7,016<br />

32.1 %<br />

1,846 / 5,747<br />

38.3 %<br />

2,156 / 5,631<br />

36.4 %<br />

2,055 / 5,647<br />

31.7 %<br />

2,323 / 7,337<br />

32.3 %<br />

2,164 / 6,706<br />

28.7 %<br />

1,944 / 6,766<br />

V.parahaemolyticus 2210633<br />

V.parahaemolyticus 16<br />

V.vulnificus CMCP6<br />

V.vulnificus YJ016<br />

V.species MED222<br />

V.splendidus LGP32<br />

A.fischeri ES114<br />

A.fischeri MJ11<br />

34.0 %<br />

1,963 / 5,771<br />

32.1 %<br />

1,873 / 5,828<br />

38.8 %<br />

2,162 / 5,566<br />

34.7 %<br />

1,968 / 5,677<br />

33.6 %<br />

2,410 / 7,181<br />

33.0 %<br />

2,137 / 6,467<br />

28.2 %<br />

1,960 / 6,957<br />

35.0 %<br />

1,949 / 5,561<br />

33.7 %<br />

1,977 / 5,865<br />

32.5 %<br />

1,873 / 5,769<br />

37.9 %<br />

2,110 / 5,560<br />

37.3 %<br />

2,045 / 5,477<br />

34.3 %<br />

2,377 / 6,932<br />

32.4 %<br />

2,155 / 6,649<br />

27.6 %<br />

1,965 / 7,122<br />

40.3 %<br />

2,326 / 5,771<br />

34.8 %<br />

1,967 / 5,647<br />

34.2 %<br />

1,983 / 5,797<br />

30.6 %<br />

1,777 / 5,804<br />

40.3 %<br />

2,167 / 5,378<br />

38.7 %<br />

2,021 / 5,225<br />

33.8 %<br />

2,403 / 7,116<br />

31.8 %<br />

2,169 / 6,817<br />

27.7 %<br />

1,965 / 7,093<br />

V.cholerae B33VCE<br />

38.4 %<br />

2,291 / 5,971<br />

39.8 %<br />

2,339 / 5,873<br />

35.3 %<br />

1,972 / 5,581<br />

32.5 %<br />

1,896 / 5,827<br />

33.3 %<br />

1,863 / 5,593<br />

41.6 %<br />

2,140 / 5,139<br />

37.4 %<br />

2,032 / 5,428<br />

33.3 %<br />

2,418 / 7,252<br />

32.1 %<br />

2,173 / 6,778<br />

27.8 %<br />

1,967 / 7,064<br />

V.cholerae 2740-80<br />

41.7 %<br />

2,552 / 6,116<br />

38.0 %<br />

2,307 / 6,067<br />

40.4 %<br />

2,345 / 5,808<br />

33.6 %<br />

1,884 / 5,612<br />

35.3 %<br />

1,981 / 5,619<br />

34.4 %<br />

1,846 / 5,360<br />

40.6 %<br />

2,159 / 5,323<br />

36.7 %<br />

2,048 / 5,585<br />

33.5 %<br />

2,420 / 7,225<br />

32.2 %<br />

2,173 / 6,752<br />

25.7 %<br />

1,850 / 7,198<br />

V.cholerae 1587<br />

V.cholerae TM11079-80<br />

V.cholerae TMA21<br />

V.cholerae VL426<br />

44.3 %<br />

2,515 / 5,683<br />

41.2 %<br />

2,564 / 6,224<br />

38.5 %<br />

2,311 / 6,004<br />

38.6 %<br />

2,251 / 5,839<br />

36.3 %<br />

1,965 / 5,413<br />

36.6 %<br />

1,964 / 5,371<br />

33.4 %<br />

1,852 / 5,547<br />

39.5 %<br />

2,169 / 5,493<br />

37.0 %<br />

2,051 / 5,545<br />

33.6 %<br />

2,420 / 7,193<br />

30.3 %<br />

2,079 / 6,856<br />

25.6 %<br />

1,841 / 7,194<br />

42.2 %<br />

2,215 / 5,254<br />

43.7 %<br />

2,527 / 5,781<br />

41.9 %<br />

2,575 / 6,151<br />

37.0 %<br />

2,227 / 6,026<br />

41.7 %<br />

2,346 / 5,626<br />

37.7 %<br />

1,947 / 5,165<br />

35.5 %<br />

1,974 / 5,563<br />

32.7 %<br />

1,868 / 5,705<br />

39.7 %<br />

2,168 / 5,459<br />

37.2 %<br />

2,052 / 5,516<br />

31.0 %<br />

2,282 / 7,362<br />

29.7 %<br />

2,044 / 6,887<br />

28.1 %<br />

1,904 / 6,782<br />

V.cholerae AM-19226<br />

V.cholerae MZO-2<br />

40.0 %<br />

2,421 / 6,055<br />

41.6 %<br />

2,225 / 5,354<br />

44.5 %<br />

2,539 / 5,707<br />

40.0 %<br />

2,473 / 6,185<br />

39.7 %<br />

2,312 / 5,825<br />

42.9 %<br />

2,314 / 5,388<br />

36.6 %<br />

1,961 / 5,354<br />

34.6 %<br />

1,982 / 5,732<br />

33.0 %<br />

1,872 / 5,667<br />

40.0 %<br />

2,171 / 5,428<br />

34.4 %<br />

1,944 / 5,645<br />

30.8 %<br />

2,270 / 7,379<br />

32.4 %<br />

2,098 / 6,481<br />

26.9 %<br />

1,851 / 6,869<br />

70.3 %<br />

2,933 / 4,174<br />

39.6 %<br />

2,438 / 6,154<br />

42.3 %<br />

2,236 / 5,283<br />

42.8 %<br />

2,449 / 5,718<br />

42.9 %<br />

2,564 / 5,977<br />

40.6 %<br />

2,270 / 5,592<br />

41.9 %<br />

2,334 / 5,571<br />

35.7 %<br />

1,969 / 5,522<br />

34.8 %<br />

1,984 / 5,694<br />

33.2 %<br />

1,872 / 5,641<br />

38.2 %<br />

2,104 / 5,504<br />

34.8 %<br />

1,952 / 5,606<br />

33.3 %<br />

2,327 / 6,984<br />

31.2 %<br />

2,045 / 6,565<br />

28.2 %<br />

1,949 / 6,915<br />

73.6 %<br />

3,045 / 4,135<br />

69.2 %<br />

2,953 / 4,267<br />

40.0 %<br />

2,440 / 6,094<br />

41.3 %<br />

2,181 / 5,277<br />

45.9 %<br />

2,535 / 5,526<br />

44.1 %<br />

2,533 / 5,743<br />

39.9 %<br />

2,299 / 5,768<br />

40.9 %<br />

2,343 / 5,733<br />

35.9 %<br />

1,971 / 5,485<br />

35.0 %<br />

1,985 / 5,667<br />

30.2 %<br />

1,747 / 5,786<br />

37.3 %<br />

2,064 / 5,537<br />

38.1 %<br />

1,994 / 5,228<br />

32.1 %<br />

2,268 / 7,062<br />

32.6 %<br />

2,153 / 6,600<br />

27.9 %<br />

1,942 / 6,969<br />

71.6 %<br />

3,010 / 4,205<br />

74.9 %<br />

3,101 / 4,142<br />

69.7 %<br />

2,944 / 4,221<br />

38.4 %<br />

2,348 / 6,120<br />

43.8 %<br />

2,234 / 5,101<br />

47.1 %<br />

2,503 / 5,310<br />

43.3 %<br />

2,559 / 5,916<br />

38.9 %<br />

2,309 / 5,932<br />

41.2 %<br />

2,346 / 5,697<br />

36.1 %<br />

1,971 / 5,458<br />

31.9 %<br />

1,857 / 5,817<br />

30.0 %<br />

1,736 / 5,791<br />

41.6 %<br />

2,134 / 5,135<br />

36.4 %<br />

1,935 / 5,317<br />

34.2 %<br />

2,394 / 7,002<br />

31.8 %<br />

2,123 / 6,682<br />

27.9 %<br />

1,941 / 6,954<br />

V.parahaemolyticus 2210633<br />

V.parahaemolyticus 16<br />

V.vulnificus CMCP6<br />

V.vulnificus YJ016<br />

V.species MED222<br />

V.splendidus LGP32<br />

A.fischeri ES114<br />

A.fischeri MJ11<br />

A.salmonicida LFI1238<br />

V.species Ex25<br />

V.campbellii AND4<br />

V.harveyi BAA1116<br />

V.shilonii AK1<br />

V.cholerae 12129<br />

V.cholerae TM11079-80<br />

V.cholerae TMA21<br />

V.cholerae VL426<br />

75.9 %<br />

3,094 / 4,077<br />

72.6 %<br />

3,068 / 4,226<br />

75.5 %<br />

3,089 / 4,092<br />

66.3 %<br />

2,833 / 4,271<br />

41.4 %<br />

2,445 / 5,905<br />

45.9 %<br />

2,223 / 4,842<br />

46.4 %<br />

2,534 / 5,464<br />

42.3 %<br />

2,572 / 6,075<br />

39.3 %<br />

2,314 / 5,892<br />

41.4 %<br />

2,346 / 5,670<br />

32.8 %<br />

1,843 / 5,611<br />

32.1 %<br />

1,861 / 5,795<br />

33.6 %<br />

1,805 / 5,377<br />

39.1 %<br />

2,048 / 5,244<br />

37.7 %<br />

2,026 / 5,367<br />

33.4 %<br />

2,359 / 7,060<br />

32.0 %<br />

2,130 / 6,656<br />

27.9 %<br />

1,909 / 6,851<br />

68.7 %<br />

2,874 / 4,181<br />

77.2 %<br />

3,155 / 4,088<br />

73.5 %<br />

3,065 / 4,172<br />

69.8 %<br />

2,942 / 4,217<br />

73.2 %<br />

2,952 / 4,034<br />

42.4 %<br />

2,408 / 5,683<br />

44.3 %<br />

2,232 / 5,038<br />

45.2 %<br />

2,546 / 5,633<br />

42.7 %<br />

2,578 / 6,032<br />

39.4 %<br />

2,314 / 5,868<br />

38.0 %<br />

2,213 / 5,823<br />

33.1 %<br />

1,848 / 5,585<br />

35.6 %<br />

1,922 / 5,398<br />

31.9 %<br />

1,743 / 5,469<br />

40.4 %<br />

2,139 / 5,293<br />

37.3 %<br />

2,022 / 5,418<br />

33.4 %<br />

2,375 / 7,115<br />

32.0 %<br />

2,097 / 6,549<br />

29.6 %<br />

2,295 / 7,753<br />

70.4 %<br />

2,922 / 4,153<br />

67.2 %<br />

2,880 / 4,288<br />

78.0 %<br />

3,149 / 4,038<br />

68.5 %<br />

2,914 / 4,256<br />

76.0 %<br />

3,059 / 4,025<br />

73.5 %<br />

2,863 / 3,897<br />

41.8 %<br />

2,434 / 5,818<br />

43.0 %<br />

2,240 / 5,212<br />

45.5 %<br />

2,548 / 5,599<br />

42.9 %<br />

2,579 / 6,005<br />

36.9 %<br />

2,208 / 5,989<br />

38.0 %<br />

2,209 / 5,811<br />

36.7 %<br />

1,906 / 5,192<br />

34.2 %<br />

1,872 / 5,473<br />

33.5 %<br />

1,845 / 5,501<br />

39.4 %<br />

2,118 / 5,370<br />

37.3 %<br />

2,019 / 5,407<br />

33.0 %<br />

2,325 / 7,056<br />

35.2 %<br />

2,581 / 7,333<br />

27.9 %<br />

1,972 / 7,061<br />

64.7 %<br />

2,888 / 4,463<br />

70.3 %<br />

2,965 / 4,217<br />

69.7 %<br />

2,916 / 4,183<br />

71.5 %<br />

2,986 / 4,175<br />

74.1 %<br />

3,024 / 4,083<br />

75.2 %<br />

2,954 / 3,928<br />

76.4 %<br />

2,970 / 3,887<br />

40.8 %<br />

2,445 / 5,993<br />

43.4 %<br />

2,242 / 5,171<br />

45.8 %<br />

2,552 / 5,568<br />

39.4 %<br />

2,432 / 6,172<br />

36.4 %<br />

2,186 / 6,003<br />

41.8 %<br />

2,264 / 5,418<br />

34.9 %<br />

1,843 / 5,282<br />

35.7 %<br />

1,970 / 5,513<br />

32.9 %<br />

1,824 / 5,545<br />

40.3 %<br />

2,145 / 5,320<br />

37.8 %<br />

2,001 / 5,288<br />

46.4 %<br />

3,371 / 7,266<br />

34.3 %<br />

2,276 / 6,634<br />

29.4 %<br />

2,212 / 7,534<br />

76.9 %<br />

3,165 / 4,117<br />

64.9 %<br />

2,940 / 4,533<br />

72.2 %<br />

2,986 / 4,136<br />

69.0 %<br />

2,860 / 4,145<br />

79.5 %<br />

3,125 / 3,932<br />

73.0 %<br />

2,908 / 3,986<br />

80.4 %<br />

3,080 / 3,831<br />

73.1 %<br />

2,977 / 4,072<br />

41.1 %<br />

2,450 / 5,957<br />

43.6 %<br />

2,244 / 5,143<br />

42.2 %<br />

2,409 / 5,711<br />

39.1 %<br />

2,413 / 6,176<br />

39.9 %<br />

2,238 / 5,609<br />

39.9 %<br />

2,202 / 5,514<br />

36.8 %<br />

1,951 / 5,307<br />

34.9 %<br />

1,952 / 5,586<br />

33.0 %<br />

1,831 / 5,549<br />

39.8 %<br />

2,086 / 5,245<br />

47.0 %<br />

2,741 / 5,827<br />

34.9 %<br />

2,496 / 7,160<br />

34.4 %<br />

2,472 / 7,184<br />

27.8 %<br />

2,222 / 7,979<br />

83.4 %<br />

3,315 / 3,973<br />

76.7 %<br />

3,195 / 4,167<br />

67.6 %<br />

2,983 / 4,413<br />

68.5 %<br />

2,869 / 4,191<br />

71.8 %<br />

2,896 / 4,036<br />

77.9 %<br />

3,002 / 3,856<br />

78.5 %<br />

3,061 / 3,901<br />

77.3 %<br />

3,098 / 4,009<br />

73.4 %<br />

2,971 / 4,050<br />

41.3 %<br />

2,449 / 5,936<br />

41.1 %<br />

2,153 / 5,242<br />

41.4 %<br />

2,372 / 5,735<br />

43.0 %<br />

2,483 / 5,781<br />

38.0 %<br />

2,171 / 5,707<br />

42.0 %<br />

2,320 / 5,530<br />

36.1 %<br />

1,940 / 5,373<br />

35.2 %<br />

1,954 / 5,558<br />

33.1 %<br />

1,804 / 5,448<br />

64.9 %<br />

3,384 / 5,214<br />

37.8 %<br />

2,081 / 5,503<br />

38.7 %<br />

2,880 / 7,439<br />

33.0 %<br />

2,516 / 7,615<br />

28.1 %<br />

2,155 / 7,667<br />

82.4 %<br />

3,302 / 4,009<br />

81.3 %<br />

3,320 / 4,085<br />

81.6 %<br />

3,264 / 4,000<br />

65.1 %<br />

2,880 / 4,423<br />

73.7 %<br />

2,947 / 4,001<br />

71.5 %<br />

2,801 / 3,915<br />

83.0 %<br />

3,135 / 3,777<br />

75.8 %<br />

3,073 / 4,056<br />

77.1 %<br />

3,088 / 4,007<br />

73.8 %<br />

2,975 / 4,030<br />

37.9 %<br />

2,313 / 6,099<br />

41.2 %<br />

2,152 / 5,228<br />

46.3 %<br />

2,464 / 5,326<br />

40.1 %<br />

2,373 / 5,919<br />

40.1 %<br />

2,293 / 5,719<br />

41.1 %<br />

2,303 / 5,603<br />

36.2 %<br />

1,940 / 5,352<br />

35.3 %<br />

1,926 / 5,455<br />

31.9 %<br />

2,074 / 6,494<br />

45.0 %<br />

2,357 / 5,232<br />

39.9 %<br />

2,372 / 5,942<br />

37.0 %<br />

2,900 / 7,832<br />

36.5 %<br />

2,593 / 7,105<br />

29.5 %<br />

2,198 / 7,456<br />

83.2 %<br />

3,325 / 3,995<br />

80.8 %<br />

3,319 / 4,106<br />

81.9 %<br />

3,311 / 4,041<br />

77.5 %<br />

3,153 / 4,066<br />

67.3 %<br />

2,909 / 4,320<br />

81.0 %<br />

2,989 / 3,688<br />

72.2 %<br />

2,861 / 3,960<br />

79.4 %<br />

3,144 / 3,961<br />

75.1 %<br />

3,061 / 4,077<br />

78.0 %<br />

3,097 / 3,971<br />

65.6 %<br />

2,791 / 4,256<br />

38.2 %<br />

2,320 / 6,080<br />

46.0 %<br />

2,220 / 4,821<br />

43.5 %<br />

2,367 / 5,437<br />

43.5 %<br />

2,550 / 5,859<br />

39.2 %<br />

2,272 / 5,796<br />

41.6 %<br />

2,314 / 5,569<br />

36.3 %<br />

1,906 / 5,250<br />

35.5 %<br />

2,270 / 6,400<br />

32.3 %<br />

1,842 / 5,705<br />

46.1 %<br />

2,626 / 5,697<br />

37.5 %<br />

2,396 / 6,387<br />

34.6 %<br />

2,682 / 7,762<br />

36.7 %<br />

2,562 / 6,982<br />

30.3 %<br />

2,110 / 6,968<br />

85.8 %<br />

3,291 / 3,837<br />

80.7 %<br />

3,321 / 4,117<br />

81.6 %<br />

3,311 / 4,057<br />

76.3 %<br />

3,142 / 4,120<br />

78.4 %<br />

3,157 / 4,029<br />

67.8 %<br />

2,836 / 4,184<br />

74.9 %<br />

2,944 / 3,932<br />

69.0 %<br />

2,868 / 4,158<br />

79.3 %<br />

3,138 / 3,958<br />

76.3 %<br />

3,076 / 4,029<br />

71.3 %<br />

2,953 / 4,142<br />

65.2 %<br />

2,768 / 4,246<br />

42.3 %<br />

2,399 / 5,675<br />

42.7 %<br />

2,113 / 4,953<br />

45.9 %<br />

2,501 / 5,451<br />

42.2 %<br />

2,506 / 5,941<br />

39.8 %<br />

2,292 / 5,756<br />

41.5 %<br />

2,272 / 5,479<br />

35.9 %<br />

2,233 / 6,219<br />

34.5 %<br />

1,965 / 5,696<br />

32.6 %<br />

2,040 / 6,250<br />

43.2 %<br />

2,655 / 6,143<br />

36.9 %<br />

2,259 / 6,124<br />

36.7 %<br />

2,759 / 7,516<br />

30.4 %<br />

2,085 / 6,866<br />

29.7 %<br />

2,127 / 7,169<br />

V.cholerae N16961<br />

V.cholerae 0395 TEDA<br />

V.cholerae 0395 TIGR<br />

V.cholerae V52<br />

V.cholerae M66-2<br />

V.cholerae MO10<br />

V.cholerae BX330286<br />

V.cholerae RC9<br />

V.cholerae MJ1236<br />

V.cholerae B33VCE<br />

V.cholerae 2740-80<br />

V.cholerae AM-19226<br />

V.cholerae MZO-2<br />

V.cholerae 12129<br />

V.cholerae 1587<br />

79.6 %<br />

3,139 / 3,944<br />

82.5 %<br />

3,278 / 3,971<br />

81.4 %<br />

3,309 / 4,067<br />

75.3 %<br />

3,136 / 4,162<br />

83.7 %<br />

3,275 / 3,915<br />

74.3 %<br />

2,987 / 4,018<br />

68.1 %<br />

2,876 / 4,226<br />

73.3 %<br />

2,983 / 4,071<br />

69.1 %<br />

2,864 / 4,145<br />

80.3 %<br />

3,147 / 3,918<br />

69.2 %<br />

2,925 / 4,226<br />

70.2 %<br />

2,930 / 4,172<br />

71.6 %<br />

2,802 / 3,915<br />

39.7 %<br />

2,303 / 5,796<br />

44.3 %<br />

2,213 / 5,001<br />

46.1 %<br />

2,513 / 5,455<br />

42.9 %<br />

2,536 / 5,906<br />

39.2 %<br />

2,230 / 5,684<br />

43.9 %<br />

2,762 / 6,293<br />

36.2 %<br />

1,976 / 5,464<br />

35.3 %<br />

2,191 / 6,211<br />

30.5 %<br />

2,050 / 6,715<br />

40.2 %<br />

2,413 / 5,999<br />

38.6 %<br />

2,289 / 5,931<br />

29.6 %<br />

2,214 / 7,478<br />

29.4 %<br />

2,083 / 7,082<br />

28.3 %<br />

1,980 / 6,989<br />

92.9 %<br />

3,489 / 3,754<br />

78.1 %<br />

3,147 / 4,032<br />

83.4 %<br />

3,267 / 3,919<br />

76.0 %<br />

3,147 / 4,141<br />

82.6 %<br />

3,267 / 3,954<br />

86.6 %<br />

3,253 / 3,757<br />

78.6 %<br />

3,113 / 3,962<br />

66.4 %<br />

2,917 / 4,393<br />

73.6 %<br />

2,979 / 4,045<br />

70.0 %<br />

2,873 / 4,102<br />

73.1 %<br />

3,000 / 4,103<br />

64.3 %<br />

2,805 / 4,365<br />

77.2 %<br />

2,983 / 3,866<br />

68.6 %<br />

2,743 / 4,001<br />

42.9 %<br />

2,463 / 5,745<br />

43.5 %<br />

2,200 / 5,058<br />

45.7 %<br />

2,506 / 5,480<br />

42.3 %<br />

2,475 / 5,845<br />

41.4 %<br />

2,698 / 6,523<br />

45.4 %<br />

2,507 / 5,523<br />

36.3 %<br />

2,179 / 6,005<br />

33.1 %<br />

2,209 / 6,672<br />

33.1 %<br />

2,074 / 6,269<br />

42.3 %<br />

2,451 / 5,795<br />

33.6 %<br />

1,915 / 5,695<br />

29.3 %<br />

2,244 / 7,665<br />

26.7 %<br />

1,916 / 7,168<br />

28.0 %<br />

2,022 / 7,222<br />

77.1 %<br />

3,186 / 4,134<br />

89.7 %<br />

3,485 / 3,884<br />

80.2 %<br />

3,169 / 3,953<br />

79.4 %<br />

3,143 / 3,956<br />

82.9 %<br />

3,277 / 3,954<br />

85.6 %<br />

3,244 / 3,790<br />

91.2 %<br />

3,355 / 3,679<br />

75.5 %<br />

3,125 / 4,141<br />

66.3 %<br />

2,908 / 4,386<br />

74.3 %<br />

2,982 / 4,014<br />

68.3 %<br />

2,820 / 4,126<br />

69.5 %<br />

2,908 / 4,185<br />

71.6 %<br />

2,868 / 4,006<br />

71.8 %<br />

2,855 / 3,974<br />

77.1 %<br />

2,975 / 3,861<br />

40.8 %<br />

2,400 / 5,876<br />

44.9 %<br />

2,242 / 4,998<br />

45.1 %<br />

2,444 / 5,415<br />

46.4 %<br />

3,042 / 6,550<br />

43.7 %<br />

2,492 / 5,705<br />

43.7 %<br />

2,670 / 6,112<br />

34.2 %<br />

2,205 / 6,448<br />

35.7 %<br />

2,219 / 6,213<br />

34.9 %<br />

2,114 / 6,065<br />

34.5 %<br />

1,963 / 5,692<br />

32.5 %<br />

1,919 / 5,903<br />

28.3 %<br />

2,095 / 7,406<br />

34.5 %<br />

2,335 / 6,762<br />

25.5 %<br />

1,872 / 7,339<br />

80.4 %<br />

3,303 / 4,109<br />

74.9 %<br />

3,187 / 4,253<br />

81.1 %<br />

3,280 / 4,046<br />

75.1 %<br />

3,024 / 4,028<br />

87.0 %<br />

3,272 / 3,762<br />

83.2 %<br />

3,208 / 3,855<br />

90.1 %<br />

3,346 / 3,715<br />

91.7 %<br />

3,455 / 3,766<br />

74.6 %<br />

3,103 / 4,160<br />

67.0 %<br />

2,915 / 4,348<br />

68.5 %<br />

2,844 / 4,150<br />

68.0 %<br />

2,806 / 4,126<br />

76.7 %<br />

2,961 / 3,861<br />

67.9 %<br />

2,780 / 4,092<br />

82.5 %<br />

3,117 / 3,780<br />

73.0 %<br />

2,911 / 3,989<br />

42.4 %<br />

2,451 / 5,781<br />

43.5 %<br />

2,155 / 4,958<br />

48.9 %<br />

2,994 / 6,128<br />

43.2 %<br />

2,597 / 6,013<br />

41.9 %<br />

2,637 / 6,301<br />

40.8 %<br />

2,680 / 6,565<br />

36.6 %<br />

2,201 / 6,016<br />

38.1 %<br />

2,277 / 5,979<br />

55.5 %<br />

2,683 / 4,838<br />

33.9 %<br />

1,991 / 5,874<br />

30.3 %<br />

1,795 / 5,923<br />

43.4 %<br />

2,981 / 6,875<br />

30.9 %<br />

2,144 / 6,948<br />

26.1 %<br />

2,254 / 8,624<br />

88.1 %<br />

3,495 / 3,966<br />

88.8 %<br />

3,489 / 3,927<br />

80.2 %<br />

3,271 / 4,079<br />

77.3 %<br />

3,164 / 4,093<br />

80.7 %<br />

3,108 / 3,853<br />

83.0 %<br />

3,126 / 3,768<br />

91.4 %<br />

3,373 / 3,689<br />

90.4 %<br />

3,439 / 3,805<br />

96.0 %<br />

3,531 / 3,678<br />

75.4 %<br />

3,111 / 4,124<br />

64.7 %<br />

2,847 / 4,403<br />

70.6 %<br />

2,886 / 4,087<br />

73.1 %<br />

2,818 / 3,854<br />

71.8 %<br />

2,849 / 3,968<br />

78.6 %<br />

3,059 / 3,894<br />

78.0 %<br />

3,045 / 3,906<br />

74.7 %<br />

2,922 / 3,914<br />

40.9 %<br />

2,360 / 5,769<br />

43.5 %<br />

2,547 / 5,858<br />

47.2 %<br />

2,608 / 5,524<br />

67.5 %<br />

3,741 / 5,540<br />

39.7 %<br />

2,672 / 6,728<br />

72.3 %<br />

3,688 / 5,101<br />

38.7 %<br />

2,246 / 5,808<br />

75.0 %<br />

3,261 / 4,346<br />

52.4 %<br />

2,666 / 5,085<br />

30.5 %<br />

1,813 / 5,939<br />

46.2 %<br />

2,452 / 5,307<br />

45.0 %<br />

3,018 / 6,702<br />

30.1 %<br />

2,581 / 8,574<br />

25.9 %<br />

2,170 / 8,370<br />

P.profundum SS9<br />

3.0 %<br />

110 / 3,665<br />

4.2 %<br />

155 / 3,729<br />

4.3 %<br />

157 / 3,665<br />

3.3 %<br />

120 / 3,599<br />

2.8 %<br />

99 / 3,560<br />

1.8 %<br />

59 / 3,353<br />

2.9 %<br />

100 / 3,429<br />

2.8 %<br />

102 / 3,619<br />

3.0 %<br />

109 / 3,575<br />

2.6 %<br />

92 / 3,593<br />

3.5 %<br />

125 / 3,567<br />

2.8 %<br />

99 / 3,586<br />

2.5 %<br />

84 / 3,305<br />

2.2 %<br />

73 / 3,311<br />

2.1 %<br />

72 / 3,454<br />

2.4 %<br />

83 / 3,442<br />

2.9 %<br />

99 / 3,427<br />

1.9 %<br />

62 / 3,316<br />

3.2 %<br />

147 / 4,662<br />

2.1 %<br />

79 / 3,683<br />

2.8 %<br />

121 / 4,337<br />

3.1 %<br />

150 / 4,773<br />

2.3 %<br />

103 / 4,463<br />

2.8 %<br />

118 / 4,277<br />

2.6 %<br />

96 / 3,691<br />

2.9 %<br />

112 / 3,894<br />

3.3 %<br />

111 / 3,378<br />

2.7 %<br />

103 / 3,886<br />

2.3 %<br />

88 / 3,822<br />

3.9 %<br />

201 / 5,117<br />

3.9 %<br />

200 / 5,078<br />

5.0 %<br />

243 / 4,897


Gap F<br />

2M<br />

2.5M<br />

Gap E<br />

875k<br />

750k<br />

625k<br />

0M<br />

V. cholerae 01<br />

El Tor N16961<br />

chromosome 1<br />

2,961,149 bp<br />

1000k<br />

1.5M<br />

0k<br />

500k<br />

Gap D<br />

V. cholerae 01<br />

El Tor N16961<br />

chromosome 2<br />

1,072,310 bp<br />

125k<br />

375k<br />

0.5M<br />

1M<br />

250k<br />

Gap C<br />

Gap A<br />

Gap B<br />

Super<strong>in</strong>tegron<br />

Gap G<br />

Outer circle<br />

P.profundum SS9<br />

V.shilonii AK1<br />

V.harveyi BAA-116<br />

V.campebellii AND4<br />

V.parahaemolyticus 16<br />

V.parahaemolyticus 2210633<br />

Vibrio spp. Ex25<br />

A.salmonicida LF11238<br />

A.fischeri MJ11<br />

A.fischeri ES114<br />

V.splendidus LGP32<br />

V.species MED222<br />

V.vulnificus YJ016<br />

V.vulnificus CMCP6<br />

V.cholerae VL426<br />

V.cholerae 12129<br />

V.cholerae TMA21<br />

V.cholerae TM11079-80<br />

V.cholerae 1587<br />

V.cholerae AM-19226<br />

V.cholerae MZO-2<br />

V.cholerae 2740-80<br />

V.cholerae BX330286<br />

V.cholerae B33VCE<br />

V.cholerae RC9<br />

V.cholerae MJ1236<br />

V.cholerae M66-2<br />

V.cholerae V52<br />

V.cholerae MO10<br />

V.cholerae O395 TEDA<br />

V.cholerae 0395 TIGR<br />

V.cholerae N16961<br />

genes positive str<strong>and</strong><br />

genes negatve str<strong>and</strong><br />

Stack<strong>in</strong>g energy<br />

Position preference<br />

Global direct repeats<br />

GC skew<br />

Inner circle<br />

T. Vesth et al.


Orig<strong>in</strong>s of V. cholerae<br />

When the first genome of A. fischeri is added, which is<br />

not a member of the Vibrio genus, it does not add<br />

significantly more novel genes to the pan-genome than<br />

Vibrio genomes did. This contrasts with P. profundum<br />

which produces a sharp <strong>in</strong>crease <strong>in</strong> the pan-genome, as<br />

does, <strong>in</strong>terest<strong>in</strong>gly, V. shilonii. Note that there are approximately<br />

20,200 total gene families with<strong>in</strong> the 32 sequenced<br />

Vibrionaceae genomes, whereas the core genome decreases<br />

to approximately 1,000 gene families.<br />

BLAST Comparison Visualised <strong>in</strong> a BLAST Matrix<br />

A BLAST matrix provides a visual overview of reciprocal<br />

pairwise whole-genome comparisons, as shown <strong>in</strong> Fig. 4.<br />

The stronger a matrix cell is coloured, the more similarity<br />

was detected between the gene content of two genomes. As<br />

can be seen <strong>in</strong> the lower right triangle, all V. cholerae<br />

genomes are highly similar, with similarity rang<strong>in</strong>g between<br />

64% <strong>and</strong> 93% for any given pair of genomes. No statistical<br />

difference was observed when compar<strong>in</strong>g cl<strong>in</strong>ical isolates<br />

to environmental isolates. The two A. fischeri <strong>and</strong> the two<br />

V. vulnificus genomes also share a high degree of identity<br />

with<strong>in</strong> their species (75% <strong>and</strong> 67%, respectively), visible at<br />

the bottom of the matrix. In contrast, the two V. parahaemolyticus<br />

genomes only share 35% identity, which is<br />

not higher than the similarity detected between genomes of<br />

different species. With 72% similarity, isolate MED222<br />

most closely matches V. splendidus <strong>and</strong> with 65% isolate<br />

EX25 aga<strong>in</strong> shares most similarity with V. parahaemolyticus<br />

2210633.<br />

BLAST Atlas<br />

A BLAST atlas was constructed us<strong>in</strong>g V. cholerae N16961<br />

(O1, El Tor) as the reference genome, shown <strong>in</strong> Fig. 5. The<br />

best blast hits identified <strong>in</strong> the query genomes are<br />

plotted <strong>in</strong> the lanes around the reference genome, with<br />

different colours for different species. In general,<br />

chromosome 1 is more strongly conserved than chromosome<br />

2. A large part of chromosome 2 of N16961<br />

displays very little conservation <strong>in</strong> the other genomes;<br />

this area represents a super <strong>in</strong>tegron [40] that conta<strong>in</strong>s<br />

the V. cholerae-specific repeat (VCR) sequences, as well<br />

Figure 5 BLAST atlas with V. cholerae stra<strong>in</strong> N16961 as a reference<br />

stra<strong>in</strong>, show<strong>in</strong>g chromosomes 1 (top) <strong>and</strong> 2 (bottom). The best<br />

BLAST hits identified with genes from N16961 <strong>in</strong> the other V.<br />

cholerae genomes are represented <strong>in</strong> dark red, for the location as it<br />

appears <strong>in</strong> N16961. Blast hits <strong>in</strong> the other genomes are shown <strong>in</strong><br />

various colours as <strong>in</strong>dicated to the right. Major areas conserved <strong>in</strong> V.<br />

cholerae but not <strong>in</strong> other Vibrionaceae are identified as gap B, gap C,<br />

gap D <strong>and</strong> gap F <strong>in</strong> green; areas that are found <strong>in</strong> toxigenic V. cholerae<br />

only are marked black as gap A, gap E <strong>and</strong> gap G. The super<strong>in</strong>tegron<br />

on chromosome 2 of V. cholerae is also <strong>in</strong>dicated<br />

as a high number of gene cassettes. The repeat sequences<br />

are visible as black boxes <strong>in</strong> the repeat lane of the<br />

reference genome (second <strong>in</strong>ner lane). Although all V.<br />

cholerae genomes conta<strong>in</strong> a super<strong>in</strong>tegron, its genes are<br />

very diverse between isolates [34] which expla<strong>in</strong>s the lack<br />

of blast hits <strong>in</strong> this region.<br />

Several regions of the atlas have been highlighted. Gaps<br />

B, C, D <strong>and</strong> F on chromosome 1 (<strong>in</strong>dicated <strong>in</strong> green)<br />

conta<strong>in</strong> genes that are conserved <strong>in</strong> the represented<br />

genomes of V. cholerae but not <strong>in</strong> the other Vibrionaceae.<br />

The gaps marked A, E <strong>and</strong> G <strong>in</strong>dicate regions that are<br />

specific to the toxigenic, cl<strong>in</strong>ical isolates only. Annotated,<br />

V. cholerae-specific genes present <strong>in</strong> all these regions are<br />

listed <strong>in</strong> Table 2 (hypothetical genes are excluded). Genes<br />

specific for tox<strong>in</strong>ogenic V. cholerae identified <strong>in</strong> gap A<br />

<strong>in</strong>clude, amongst others, biosynthesis genes for the tox<strong>in</strong><br />

co-regulated pilus (which is required for transmission of the<br />

prophage CTXΦ carry<strong>in</strong>g the enterotox<strong>in</strong> genes), as well as<br />

genes encod<strong>in</strong>g citrate lyase. Note that the genes <strong>in</strong> gap A<br />

are also found <strong>in</strong> the environmental isolate V. cholerae<br />

2740-80.<br />

Gap B conta<strong>in</strong>s a number of outer membrane prote<strong>in</strong><br />

genes <strong>in</strong>volved <strong>in</strong> sugar modification that are found <strong>in</strong> all V.<br />

cholerae genomes. Genes from gap C encod<strong>in</strong>g a histid<strong>in</strong>e<br />

k<strong>in</strong>ase two-component signal transduction regulatory system<br />

are also conserved with<strong>in</strong> the species, as genes <strong>in</strong> gaps<br />

D <strong>and</strong> F, <strong>in</strong>volved <strong>in</strong> chemotaxis <strong>and</strong> possible multidrug<br />

resistance.<br />

Gap E, conta<strong>in</strong><strong>in</strong>g genes conserved <strong>in</strong> toxigenic stra<strong>in</strong>s<br />

only, holds the prophage CTXΦ that conta<strong>in</strong>s the genes<br />

encod<strong>in</strong>g cholera enterotox<strong>in</strong> subunits A <strong>and</strong> B; this<br />

enterotox<strong>in</strong> is responsible for the excessive, watery diarrhoea<br />

typical for cholera. Upon b<strong>in</strong>d<strong>in</strong>g to target cell GM1<br />

gangliosides, enterotox<strong>in</strong> enters the cell <strong>and</strong> stimulates<br />

adenylate cyclase by ADP ribosylation. The resultant<br />

<strong>in</strong>creased cyclic AMP levels <strong>in</strong>duce excessive electrolyte<br />

movement <strong>and</strong> sodium plus water secretion [43]. Stra<strong>in</strong><br />

M66-2 is believed to be a precursor of the seventh<br />

p<strong>and</strong>emic V. cholerae that lacks the prophage CTXΦ <strong>and</strong><br />

the enterotox<strong>in</strong> genes [11]. Gap E bears the RTX tox<strong>in</strong><br />

operon, which encodes a pore-form<strong>in</strong>g cytotox<strong>in</strong> [22]. An<br />

RTX tox<strong>in</strong> is also present <strong>in</strong> environmental isolate 2740-80<br />

<strong>and</strong> <strong>in</strong> V. vulnificus.<br />

Gap G on chromosome 2 consists of a set of five genes,<br />

all <strong>in</strong> the same orientation, <strong>in</strong> a putative operon, flanked by<br />

genes on the complimentary str<strong>and</strong>. This appears to be a<br />

remnant of a mobile element, as these genes are flanked by<br />

a transposase gene on the 3′ end, <strong>and</strong> there is a small global<br />

repeat on the 5′ end. Only the first two of the five genes have<br />

an assigned function, with the first gene be<strong>in</strong>g a GMP<br />

reductase, <strong>and</strong> the second a putative DNA methyltransferase.<br />

The rema<strong>in</strong><strong>in</strong>g three genes are hypothetical, but their<br />

strik<strong>in</strong>gly strong conservation <strong>in</strong> all pathogenic stra<strong>in</strong>s <strong>and</strong>


Table 2 A selection of genes located <strong>in</strong> the gaps marked <strong>in</strong> Fig. 5<br />

Gap A (850000–913000)<br />

852903–851557 Citrate/sodium symporter<br />

853165–854235 Citrate (pro-3S)-lyase ligase<br />

854287–854583 Citrate lyase subunit gamma<br />

854565–855455 Citrate lyase, beta subunit<br />

855391–856995 Citrate lyase, alpha subunit<br />

856992–857528 citX prote<strong>in</strong><br />

857506–858447 citG prote<strong>in</strong><br />

869812–866873 Helicase-related prote<strong>in</strong><br />

870391–869813 Tellurite resistance prote<strong>in</strong>-related<br />

871298–870819 Transcriptional regulator, putative<br />

873242–874225 Transposase, putative<br />

876974–880015 ToxR-activated gene A prote<strong>in</strong><br />

881390–884728 Inner membrane prote<strong>in</strong>, putative<br />

885773–886267 tagD prote<strong>in</strong><br />

888405–886543 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

888846–889511 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

889496–889906 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

890449–891123 Tox<strong>in</strong> co-regulated pil<strong>in</strong><br />

891203–892495 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

892495–892947 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

892950–894419 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

894412–894867 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

894855–895691 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

895707–896165 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

896155–897666 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

897641–898663 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

898673–899689 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

899896–900726 TCP pilus virulence regulatory prote<strong>in</strong><br />

900726–901487 Leader peptidase TcpJ<br />

901494–903374 Accessory colonization factor AcfB<br />

903380–904150 Accessory colonization factor AcfC<br />

904648–905556 tagE prote<strong>in</strong><br />

906206–905559 Accessory colonization factor AcfA<br />

914124–912856 Phage family <strong>in</strong>tegrase<br />

Gap B (975000–1010000)<br />

978644–979144 Phosphotyros<strong>in</strong>e prote<strong>in</strong> phosphatase<br />

981833–982387 Ser<strong>in</strong>e acetyltransferase-related prote<strong>in</strong><br />

982384–983532 Exopolysacch. biosynth prote<strong>in</strong> EpsF<br />

983529–984938 Polysacch. export prote<strong>in</strong>, putative (gfcE)<br />

986166–986597 Ser<strong>in</strong>e acetyltransferase-related prote<strong>in</strong><br />

986597–987937 capK prote<strong>in</strong>, putative<br />

987913–989010 Polysaccharide biosynthesis prote<strong>in</strong>, putative<br />

1001910–1002437 Polysaccharide export-related prote<strong>in</strong> (gfcE)<br />

1002462–1004675 Putative exopolysacch. biosynth prote<strong>in</strong><br />

Gap C (1130000–1160000)<br />

1139646–1142912 Chit<strong>in</strong>ase, putative<br />

1147856–1148998 Response regulator<br />

1149033–1149398 Response regulator<br />

1149990–1151309 Sensory box sensor histid<strong>in</strong>e k<strong>in</strong>ase<br />

Table 2 (cont<strong>in</strong>ued)<br />

1151321–1152625 Sensor histid<strong>in</strong>e k<strong>in</strong>ase<br />

1152625–1154235 Response regulator<br />

1154252–1155595 Response regulator<br />

1157228–1155624 Sensor histid<strong>in</strong>e k<strong>in</strong>ase<br />

1158044–1157232 Periplasmic b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong>-related<br />

Gap D (1478000–1520000)<br />

2086826–2087584 CDP-diacylglycerol-glyc.-3phosph-3-phosphatidyltransferase<br />

2087587–2088519 Phosphatidate cytidylyltransferase<br />

2094741–2095604 PvcB prote<strong>in</strong><br />

2098112–2097183 LysR family transcriptional regulator<br />

2098432–2100258 pvcA prote<strong>in</strong><br />

2117923–2119977 Methyl-accept<strong>in</strong>g chemotaxis prote<strong>in</strong><br />

2120575–2120030 Transcriptional regulator<br />

2120663–2121826 Benzoate transport prote<strong>in</strong><br />

Gap E (1537000–1587500)<br />

1541452–1543170 Sensor histid<strong>in</strong>e k<strong>in</strong>ase/response regulator<br />

1545396–1543231 Tox<strong>in</strong> secretion transporter, putative<br />

1546802–1545399 RTX tox<strong>in</strong> transporter<br />

1548919–1546757 RTX tox<strong>in</strong> transporter<br />

1549662–1550123 RTX tox<strong>in</strong> activat<strong>in</strong>g prote<strong>in</strong><br />

1550108–1563784 RTX tox<strong>in</strong> RtxA<br />

1564376–1564152 RstC prote<strong>in</strong><br />

1564844–1564470 RstB1 prote<strong>in</strong><br />

1565901–1564822 RstA1 prote<strong>in</strong><br />

1566027–1566365 Transcriptional repressor RstR<br />

1567341–1566967 Cholera enterotox<strong>in</strong>, B subunit<br />

1568114–1567338 Cholera enterotox<strong>in</strong>, A subunit<br />

1569412–1568213 Zona occludens tox<strong>in</strong><br />

1569702–1569409 Accessory cholera enterotox<strong>in</strong><br />

1571241–1570993 Colonization factor<br />

1571760–1571377 RstB2 prote<strong>in</strong><br />

1572817–1571738 RstA1 prote<strong>in</strong><br />

1572943–1573281 Transcriptional repressor RstR<br />

1577272–1575704 Phage replication prote<strong>in</strong> Cri<br />

1582123–1580555 Phage replication prote<strong>in</strong> Cri<br />

1583160–1583513 Transposase OrfAB, subunit A<br />

1583510–1584382 Transposase OrfAB, subunit B<br />

Gap F (1896000–1956000)<br />

1896092–1897327 Phage family <strong>in</strong>tegrase<br />

1900831–1898009 Helicase, putative<br />

1903632–1902898 Chemotaxis prote<strong>in</strong> MotB-related<br />

1908858–1905790 Type I restriction enzyme HsdR<br />

1916009–1913628 DNA methylase HsdM, putative<br />

1933231–1935654 Neuram<strong>in</strong>idase<br />

1936007–1935801 Transcriptional regulator<br />

1936121–1936597 DNA repair prote<strong>in</strong> RadC, putative<br />

1938391–1937519 Transposase OrfAB, subunit B<br />

1938732–1938388 Transposase OrfAB, subunit A<br />

1941671–1941351 Transcriptional regulator, putative<br />

T. Vesth et al.


Orig<strong>in</strong>s of V. cholerae<br />

Table 2 (cont<strong>in</strong>ued)<br />

1942032–1941658 Middle operon regulator-related<br />

1944457–1943306 eha prote<strong>in</strong><br />

Gap G (chromosome II, 21300–223000)<br />

213207–214250 GMP reductase<br />

214574–215725 DNA methyltransferase<br />

220262–219825 IS1004 transposase<br />

All gene annotations are taken from the reference genome V. cholerae<br />

stra<strong>in</strong> N16961. Hypothetical prote<strong>in</strong>s were excluded. Gaps A, E <strong>and</strong> G<br />

are conserved <strong>in</strong> pathogenic stra<strong>in</strong>s, whereas gaps B, C, D <strong>and</strong> F are<br />

conserved <strong>in</strong> all V. cholerae genomes analysed (Figure 1)<br />

complete absence of homologues <strong>in</strong> the other Vibrio genomes<br />

strongly po<strong>in</strong>t towards a potential biological significance.<br />

Discussion<br />

The recent availability of many Vibrionaceae genomes,<br />

<strong>in</strong>clud<strong>in</strong>g a substantial number of V. cholerae genomes,<br />

allows the possibility to take a closer look at the similarities<br />

<strong>and</strong> differences of species with<strong>in</strong> the genus Vibrio. This can<br />

exam<strong>in</strong>e, on a genome scale, what dist<strong>in</strong>guishes V. cholerae<br />

from the other Vibrio species. S<strong>in</strong>ce not all V. cholerae<br />

isolates are pathogenic, the presence of the prophagebear<strong>in</strong>g<br />

cholera enterotox<strong>in</strong>, the ma<strong>in</strong> virulence factor for<br />

cholera, is not a suitable marker for this species. We<br />

attempted to identify a set of V. cholerae-specific genes,<br />

<strong>and</strong> also explored the <strong>in</strong>ternal diversity with<strong>in</strong> the V.<br />

cholerae genomes that have been sequenced to date.<br />

On a phylogenetic tree based on the 16S ribosomal RNA<br />

gene, those isolates that do not belong to the genus Vibrio<br />

were positioned as outliers, as expected. This tree further<br />

<strong>in</strong>dicated the closest resembl<strong>in</strong>g 16S rRNA sequence for<br />

the two sequenced Vibrio stra<strong>in</strong>s that are currently not<br />

assigned to a species. It was observed that the two<br />

sequenced V. parahaemolyticus stra<strong>in</strong>s were not placed<br />

together. The complete gene content of each genome was<br />

next compared by BLAST <strong>and</strong> the results were pooled <strong>in</strong>to<br />

gene families which were subjected to cluster analysis. This<br />

provided evidence that the 18 V. cholerae genomes fall <strong>in</strong>to<br />

two subclusters, one ma<strong>in</strong>ly conta<strong>in</strong><strong>in</strong>g cl<strong>in</strong>ical isolates <strong>and</strong><br />

the other environmental isolates.<br />

The gene family cluster<strong>in</strong>g, subsequent pan-genome<br />

analysis <strong>and</strong> the pairwise BLAST results, as summarised<br />

<strong>in</strong> the BLAST matrix, all supported the relatedness of<br />

Vibrio species Ex25 to V. parahaemolyticus 2210633 but<br />

not to V. parahaemolyticus 16. This latter genome was quite<br />

different from V. parahaemolyticus 2210633 <strong>in</strong> all analyses.<br />

Although it is possible that the species V. parahaemolyticus<br />

is far more genetically diverse than V. cholerae, A. fischeri<br />

or V. vulnificus, an alternative explanation is that one of the<br />

sequenced isolates is perhaps <strong>in</strong>correctly named as V.<br />

parahaemolyticus. The similarity between Vibrio species<br />

MED222 <strong>and</strong> V. splendidus based on gene families is <strong>in</strong><br />

agreement with their related 16S rRNA genes <strong>and</strong> published<br />

data [21]. However, <strong>in</strong> contrast to what the ribosomal<br />

gene suggests, our whole-genome comparison <strong>in</strong>dicates that<br />

the three Aliivibrio genomes (A. salmonicida <strong>and</strong> two A.<br />

fischeri) are not so different from Vibrio after all. Their<br />

recent placement <strong>in</strong> the genus Aliivibrio, a decision based<br />

on five genes (the 16S rRNA gene <strong>and</strong> four housekeep<strong>in</strong>g<br />

genes) <strong>and</strong> phenotypical characteristics [47], appears not to<br />

be reflective of the whole genome picture presented here.<br />

The BLAST results were graphically summarised <strong>in</strong> a<br />

BLAST atlas, which visualised V. cholerae-specific gene<br />

clusters. These coded for polysaccharide biosynthesis<br />

enzymes, response regulators <strong>and</strong> chemotaxis prote<strong>in</strong>s,<br />

amongst others. In addition, a V. cholerae-specific, histid<strong>in</strong>e<br />

k<strong>in</strong>ase two-component signal transduction regulatory system<br />

was identified. The two-component signal transduction<br />

pathway is a powerful regulat<strong>in</strong>g system for bacteria to<br />

adapt to a particular ecological niche. There is a precedent<br />

for this claim, as the <strong>in</strong>troduction of a s<strong>in</strong>gle regulatory<br />

prote<strong>in</strong> <strong>in</strong> Vibrio fischeri stra<strong>in</strong> MJ11 has been shown to<br />

specifically enable colonization of the squid Euprymna<br />

scolopes [26].<br />

As expected, the ma<strong>in</strong> differences observed between V.<br />

cholerae cl<strong>in</strong>ical isolates <strong>and</strong> the environmental stra<strong>in</strong>s are<br />

due to genes related to virulence. Two exceptions are the<br />

presence of a number of virulence genes <strong>in</strong> the environmental<br />

stra<strong>in</strong> V. cholerae 2740-80 <strong>and</strong> the absence of<br />

enterotox<strong>in</strong> genes <strong>in</strong> cl<strong>in</strong>ical isolate M66-2. It has already<br />

been suggested that M66-2 might be a predecessor of<br />

p<strong>and</strong>emic, enterotoxic V. cholerae [11]. From sequence<br />

comparison of four housekeep<strong>in</strong>g genes, it was concluded<br />

that V. cholerae 2740-80 is <strong>in</strong>termediary between toxigenic<br />

<strong>and</strong> non-toxigenic isolates [30]. This view is confirmed by<br />

the data presented here, although we propose to consider<br />

the possibility that the isolate arose from a p<strong>and</strong>emic clone<br />

that has lost the CTXΦ prophage, rather than be<strong>in</strong>g a<br />

precursor of a pathogen.<br />

In conclusion, several different methods of genome<br />

comparisons have yielded a picture of V. cholerae genomes<br />

as form<strong>in</strong>g a dist<strong>in</strong>ct cluster, compared to related species,<br />

<strong>and</strong> a relatively small number of genes might be responsible<br />

for environmental niche adaptation <strong>and</strong> hence for generation<br />

of this dist<strong>in</strong>ct species. Likely c<strong>and</strong>idates <strong>in</strong>clude<br />

multiple two-component signal transduction regulatory<br />

prote<strong>in</strong>s as well as chemotaxis prote<strong>in</strong>s.<br />

Acknowledgements We would like to thank Tim B<strong>in</strong>newies for<br />

early work on this project, <strong>and</strong> also to the Danish Research Councils<br />

<strong>and</strong> the DTU Globalization funds for f<strong>in</strong>ancial support.


Open Access This article is distributed under the terms of the<br />

Creative Commons Attribution Noncommercial License which permits<br />

any noncommercial use, distribution, <strong>and</strong> reproduction <strong>in</strong> any<br />

medium, provided the orig<strong>in</strong>al author(s) <strong>and</strong> source are credited.<br />

References<br />

1. Bassler B et al. (2007) CP000789.1: Direct submission to<br />

GenBank<br />

2. B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Staerfeldt HH, Ussery DW (2005)<br />

Genome update: proteome comparisons. Microbiol 151:1–4<br />

3. Chen CY, Wu KM, Chang YC, Chang CH, Tsai HC, Liao TL, Liu<br />

YM, Chen HJ, Shen AB, Li JC, Su TL, Shao CP, Lee CT, Hor LI,<br />

Tsai SF (2003) <strong>Comparative</strong> genome analysis of Vibrio vulnificus,<br />

a mar<strong>in</strong>e pathogen. Genome Res 13:2577–2587<br />

4. Clayton RA, Sutton G, H<strong>in</strong>kle PS, Bult C, Fields C (1995)<br />

Intraspecific variation <strong>in</strong> small-subunit rRNA sequences <strong>in</strong><br />

GenBank: why s<strong>in</strong>gle sequences may not adequately represent<br />

prokaryotic taxa. Int J Syst Bacteriol 45:595–599<br />

5. Colwell R, Grim CJ, Young S, Jaffe D, Gnerre S, Berl<strong>in</strong> A,<br />

Heiman D, Hepburn T, Shea T, Sykes S, Alvarado L, Kodira C,<br />

Heidelberg J, L<strong>and</strong>er E, Galagan J, Nusbaum C, Birren B (2008)<br />

NZ_AAKF00000000: Direct submission to GenBank<br />

6. Doolittle WF (1995) Phylogenetic classification <strong>and</strong> the universal<br />

tree. Science 284:2124–2129<br />

7. Doolittle WF, Papke RT (2006) Genomics <strong>and</strong> the bacterial<br />

species problem. Genome Biol 7:116<br />

8. Doolittle WF, Zhaxybayeva O (2009) On the orig<strong>in</strong> of prokaryotic<br />

species. Genome Res 19:744–756<br />

9. Edwards R, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton G,<br />

Rogers Y-H, Friedman R, Frazier M, Venter JC (2008)<br />

NZ_ACCV00000000: Direct submission to GenBank<br />

10. Farmer JJ, J<strong>and</strong>a JM (2005) Vibrionaceae. In: Bergey’s<br />

manual of systematic bacteriology, 2nd edn, vol 2 part B.<br />

Spr<strong>in</strong>ger, New York, pp 491–546<br />

11. Feng L, Reeves PR, Lan R, Ren Y, Gao C, Zhou Z, Ren Y, Cheng<br />

J, Wang W, Wang J, Qian W, Li D, Wang L (2008) A recalibrated<br />

molecular clock <strong>and</strong> <strong>in</strong>dependent orig<strong>in</strong>s for the cholera p<strong>and</strong>emic<br />

clones. PLoS ONE 3:e4053<br />

12. Gevers D, Cohan FM, Lawrence JG, Sprat BG, Coeyne T, Feil EJ,<br />

Stackebr<strong>and</strong>t E, Van de Peer Y, V<strong>and</strong>amme P, Thompson FL,<br />

Sw<strong>in</strong>gs J (2005) Re-evaluat<strong>in</strong>g prokaryotic species. Nat Rev<br />

Microbiol 3:733–739<br />

13. Hagstrom A, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton<br />

G, Rogers Y-H, Friedman R, Frazier M, Venter JC (2007)<br />

NZ_ABGR00000000: Direct submission to GenBank<br />

14. Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW (2008) The genome<br />

BLASTatlas—a GeneWiz extension for visualization of wholegenome<br />

homology. Mol Biosyst 4:363–371<br />

15. Heidelberg JF, Eisen JA, Nelson WC, Clayton RA, Gw<strong>in</strong>n ML,<br />

Dodson RJ, Haft DH, Hickey EK, Peterson JD, Umayam L, Gill<br />

SR, Nelson KE, Read TD, Tettel<strong>in</strong> H, Richardson D, Ermolaeva<br />

MD, Vamathevan J, Bass S, Q<strong>in</strong> H, Dragoi I, Sellers P, McDonald<br />

L, Utterback T, Fleishmann RD, Nierman WC, White O, Salzberg<br />

SL, Smith HO, Colwell RR, Mekalanos JJ, Venter JC, Fraser CM<br />

(2000) DNA sequence of both chromosomes of the cholera<br />

pathogen Vibrio cholerae. Nature 406:477–483<br />

16. Heidelberg J, Sebastian Y. NZ_AAKJ00000000, NZ_AAUT00000000,<br />

NZ_AAKK00000000, NZ_AAUR00000000, NZ_AAWF00000000:<br />

Direct submission to GenBank<br />

17. Hjerde E, Lorentzen MS, Holden MT, Seeger K, Paulsen S, Bason<br />

N, Churcher C, Harris D, Norbertczak H, Quail MA, S<strong>and</strong>ers S,<br />

Thurston S, Parkhill J, Willassen NP, Thomson NR (2008) The<br />

genome sequence of the fish pathogen Aliivibrio salmonicida<br />

T. Vesth et al.<br />

stra<strong>in</strong> LFI1238 shows extensive evidence of gene decay. BMC<br />

Genomics 9:616<br />

18. Konstant<strong>in</strong>idis T, Ramette A, Tiedje JA (2006) The bacterial<br />

species def<strong>in</strong>ition <strong>in</strong> the genomic era. Phil Trans R Soc B<br />

361:1929–1940<br />

19. Lagesen K, Hall<strong>in</strong> P, Rødl<strong>and</strong> EA, Staerfeldt HH, Rognes T,<br />

Ussery DW (2007) RNAmmer: consistent <strong>and</strong> rapid annotation of<br />

ribosomal RNA genes. Nucleic Acids Res 35:3100–3108<br />

20. Larsen TS, Krogh A (2003) EasyGene—a prokaryotic gene f<strong>in</strong>der<br />

that ranks ORFs by statistical significance. BMC Bio<strong>in</strong>formatics<br />

4:29<br />

21. Le Roux F, Zou<strong>in</strong>e M, Chakroun N, B<strong>in</strong>esse J, Saulnier D,<br />

Bouchier C, Zidane N, Ma L, Rusniok C, Lajus A, Buchrieser C,<br />

Médigue C, Polz MF, Mazel D (2009) Genome sequence of Vibrio<br />

splendidus: an abundant planctonic mar<strong>in</strong>e species with a large<br />

genotypic diversity. Environ Microbiol 11:1959–1970<br />

22. L<strong>in</strong> W, Fullner KJ, Clayton R, Sexton JA, Rogers MB, Calia KE,<br />

Calderwood SB, Fraser C, Mekalanos JJ (1999) Identification of<br />

a Vibrio cholerae RTX tox<strong>in</strong> gene cluster that is tightly l<strong>in</strong>ked to<br />

the cholera tox<strong>in</strong> prophage. Proc Natl Acad Sci U S A 96:1071–<br />

1076<br />

23. Loytynoja A, Goldman N (2005) An algorithm for progressive<br />

multiple alignment of sequences with <strong>in</strong>sertions. Proc Natl Acad<br />

Sci U S A 102:10557–10562<br />

24. Loytynoja A, Goldman N (2008) Phylogeny-aware gap placement<br />

prevents errors <strong>in</strong> sequence alignment <strong>and</strong> evolutionary analysis.<br />

Science 320:1632–1635<br />

25. Mak<strong>in</strong>o K, Oshima K, Kurokawa K, Yokoyama K, Uda T,<br />

Tagomori K, Iijima Y, Najima M, Nakano M, Yamashita A,<br />

Kubota Y, Kimura S, Yasunaga T, Honda T, Sh<strong>in</strong>agawa H, Hattori<br />

M, Iida T (2003) Genome sequence of Vibrio parahaemolyticus: a<br />

pathogenic mechanism dist<strong>in</strong>ct from that of V. cholerae. Lancet<br />

361:743–749<br />

26. M<strong>and</strong>el MJ, Wollenberg MS, Stabb EV, Visick KL, Ruby EG<br />

(2009) A s<strong>in</strong>gle regulatory gene is sufficient to alter bacterial host<br />

range. Nature 458:215–218<br />

27. Mazel D, Le Roux F (2008) FM954973.1: Direct submission to<br />

GenBank<br />

28. Med<strong>in</strong>i D, Donati C, Tettel<strong>in</strong> H, Masignani V, Rappuoli R<br />

(2005) The microbial pan-genome. Curr Op<strong>in</strong> Genet Dev<br />

15:589–594<br />

29. Medrano-Soto A, Moreno-Hagelsieb G, V<strong>in</strong>uesa P, Christen JA,<br />

Collado-Vides J (2001) Succesful lateral transfer requires codon<br />

usage compatibility between foreign genes <strong>and</strong> recipient genomes.<br />

Mol Biol Evol 21:1884–1894<br />

30. Mohapatra SS, Ramach<strong>and</strong>ran D, Mantri CK, Colwell RR, S<strong>in</strong>gh<br />

DV (2009) Determ<strong>in</strong>ation of relationships among non-toxigenic<br />

Vibrio cholerae O1 biotype El Tor stra<strong>in</strong>s from housekeep<strong>in</strong>g<br />

gene sequences <strong>and</strong> ribotype patterns. Res Microbiol 160:<br />

57–62<br />

31. Munk A, Tapia R, Green L, Rogers Y, Detter JC, Bruce D, Brett<strong>in</strong> TS,<br />

Colwell R, Grim C, Vonste<strong>in</strong> V, Bartels D. CP001485.1,<br />

NZ_ACHV00000000, NZ_ACHY00000000, NZ_ACHW00000000,<br />

NZ_ACHX00000000, NZ_ACHZ00000000, NZ_ACIA00000000,<br />

NZ_ACFQ00000000: Direct submission to GenBank<br />

32. Murray RG, Stackebr<strong>and</strong>t E (1995) Taxonomic note: implementation<br />

of the provisional status C<strong>and</strong>idatus for <strong>in</strong>completely<br />

described procaryotes. Int J Syst Bacteriol 45:186–187<br />

33. Nierman WC (2006) NZ_AATY00000000: Direct submission to<br />

GenBank<br />

34. Pang B, Yan M, Cui Z, Ye X, Diao B, Ren Y, Gao S, Zhang L,<br />

Kan B (2007) Genetic diversity of toxigenic <strong>and</strong> nontoxigenic<br />

Vibrio cholerae serogroups O1 <strong>and</strong> O139 revealed by array-based<br />

comparative genomic hybridization. J Bacteriol 189:4837–4879<br />

35. Philippe H, Douady CJ (2003) Horizontal gene transfer <strong>and</strong><br />

phylogenetics. Curr Op<strong>in</strong> Microbiol 6:498–505


Orig<strong>in</strong>s of V. cholerae<br />

36. P<strong>in</strong>hassi J, Pedros-Alio C, Ferriera S, Johnson J, Kravitz S,<br />

Halpern A, Rem<strong>in</strong>gton K, Beeson K, Tran B, Rogers Y-H,<br />

Friedman R, Venter JC (2006) NZ_AAND00000000: Direct<br />

submission to GenBank<br />

37. Pupo GM, Lan R, Reeves PR (2000) Multiple <strong>in</strong>dependent orig<strong>in</strong>s<br />

of Shigella clones of Escherichia coli <strong>and</strong> convergent evolution of<br />

many of their characteristics. Proc Natl Acad Sci U S A<br />

97:10567–10572<br />

38. Rhee JH, Kim SY, Chung SS, Lee SE, Choy HE (2002)<br />

AE016795.2: Direct submission to GenBank<br />

39. Riley MA, Lizotte-Waniewski M (2009) Population genomics <strong>and</strong><br />

the bacterial species concept. Methods Mol Biol 532:367–377<br />

40. Rowe-Magnus DA, Guérout AM, Mazel D (1999) Super<strong>in</strong>tegrons.<br />

Res Microbiol 150:641–651<br />

41. Rosenberg E, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton<br />

G, Rogers Y-H, Friedman R, Frazier M. Venter JC (2006)<br />

NZ_ABCH00000000: Direct submission to GenBank<br />

42. 3Ruby EG, Urbanowski M, Campbell J, Dunn A, Fa<strong>in</strong>i M, Gunsalus<br />

R, Lostroh P, Lupp C, McCann J, Millikan D, Schaefer A, Stabb E,<br />

Stevens A, Visick K, Whistler C, Greenberg EP (2005) Complete<br />

genome sequence of Vibrio fischeri: a symbiotic bacterium with<br />

pathogenic congeners. Proc Natl Acad Sci U S A 102:3004–3009<br />

43. Sánchez J, Holmgren J (2005) Virulence factors, pathogenesis <strong>and</strong><br />

vacc<strong>in</strong>e protection <strong>in</strong> cholera <strong>and</strong> ETEC diarrhoea. Curr Op<strong>in</strong><br />

Immunol 17:388–398<br />

44. Stackebr<strong>and</strong>t E, Frederiksen W, Garrity GM, Grimont PA,<br />

Kämpfer P, Maiden MC, Nesme X, Rosselló-Mora R, Sw<strong>in</strong>gs J,<br />

Trüper HG, Vauter<strong>in</strong> L, Ward AC, Whitman WB (2002) Report of<br />

the ad hoc committee for the re-evaluation of the species def<strong>in</strong>ition<br />

<strong>in</strong> bacteriology. Int J Syst Evol Microbiol 52:1043–1047<br />

45. Tamura K, Dudley J, Nei M, Kumar S (2007) MEGA4: Molecular<br />

Evolutionary Genetics Analysis (MEGA) software version 4.0.<br />

Mol Biol Evol 24:1596–1599<br />

46. Thompson FL, Iida T, Sw<strong>in</strong>gs J (2004) Biodiversity of vibrios.<br />

Microbiol Mol Biol Rev 68:403–431<br />

47. Urbanczyk H, Ast JC, Higg<strong>in</strong>s MJ, Carson J, Dunlap PV (2007)<br />

Reclassification of Vibrio fischeri, Vibrio logei, Vibrio salmonicida<br />

<strong>and</strong> Vibrio wodanis as Aliivibrio fischeri gen. nov., comb.<br />

nov., Aliivibrio logei comb. nov., Aliivibrio salmonicida comb.<br />

nov. <strong>and</strong> Aliivibrio wodanis comb. nov. Int J Syst Evol Microbiol<br />

57:2823–2829<br />

48. Vezzi A, Campanaro S, D'Angelo M, Simonato F, Vitulo N, Lauro<br />

FM, Cestaro A, Malacrida G, Simionati B, Cannata N, Romualdi<br />

C, Bartlett DH, Valle G (2005) Life at depth: Photobacterium<br />

profundum genome sequence <strong>and</strong> expression analysis. Science<br />

30:1459–1461<br />

49. Wang L, Feng L, Reeves P, Lan R, Ren Y, Gao C, Zhou Z, Ren Y,<br />

Wang W (2008) CP001233.1. CP001235.1: Direct submission to<br />

GenBank<br />

50. Woese CR (1987) Bacterial evolution. Microbial Rev 51:221–271


1<br />

<strong>Comparative</strong> Genomics<br />

2.10 Paper V: Tools for comparison of bacterial genomes


74 Tools for Comparison of<br />

Bacterial Genomes<br />

T. M. Wassenaar 1,2 . T. T. B<strong>in</strong>newies 1,3 . P. F. Hall<strong>in</strong> 1 . D. W. Ussery 1, *<br />

1<br />

Center for Biological Sequence Analysis, Technical University of<br />

Denmark, Kgs. Lyngby, Denmark<br />

*dave@cbs.dtv.dk<br />

2<br />

Molecular Microbiology <strong>and</strong> Genomics Consultants, Zotzenheim,<br />

Germany<br />

3<br />

Roche Diagnostics Ltd., Advanced Systems Group, Global Platforms &<br />

Support, Rotkreuz, Switzerl<strong>and</strong><br />

1 Introduction . . . . . . ..................................................................4314<br />

2 Genomic DNA Sequence Comparisons . ...........................................4314<br />

3 Visualization of Genomic Data: The Genome Atlas ..............................4317<br />

4 Whole Genome Alignment Methods . . . . ...........................................4319<br />

5 Compar<strong>in</strong>g the Cod<strong>in</strong>g Fraction of Genomes . . . . . . . . ..............................4321<br />

6 Codon Usage Comparisons . . . . .....................................................4322<br />

7 Prote<strong>in</strong> Sequence Comparisons . . . . . . . . . ...........................................4322<br />

8 Gene Synteny <strong>and</strong> Genome Isl<strong>and</strong>s . . . . . ...........................................4325<br />

9 M<strong>in</strong>imal Information About a Genome Sequence . . . ..............................4325<br />

10 Research Needs . . . ..................................................................4325<br />

K. N. Timmis (ed.), H<strong>and</strong>book of Hydrocarbon <strong>and</strong> Lipid Microbiology, DOI 10.1007/978-3-540-77587-4_337,<br />

# Spr<strong>in</strong>ger-Verlag Berl<strong>in</strong> Heidelberg, 2010


4314 74<br />

Tools<br />

Abstract: Of the plethora of bio<strong>in</strong>formatical <strong>tools</strong> available, some useful <strong>tools</strong> that allow<br />

complete genome sequences to be compared are described here. Comparisons of genome<br />

length, base composition, gene density, numbers of tRNA <strong>and</strong> rRNA genes, <strong>and</strong> codon usage<br />

can provide useful biological <strong>in</strong>sights. Examples are provided of a Genome Atlas plot, to<br />

summarize many features of a s<strong>in</strong>gle genome, <strong>and</strong> a BLAST Atlas, <strong>in</strong> which multiple genomes<br />

can be comb<strong>in</strong>ed. A table of web-services for useful <strong>tools</strong> is provided.<br />

1 Introduction<br />

Presently, there are about 900 bacterial <strong>and</strong> archaeal genomes that have been fully sequenced<br />

<strong>and</strong> become publicly available 1 <strong>and</strong> their number more than doubled last year. Approximately<br />

40% of the sequenced genomes are obta<strong>in</strong>ed from environmental (terrestrial <strong>and</strong> mar<strong>in</strong>e)<br />

organisms. In addition, metagenomic projects are now produc<strong>in</strong>g a vast amount of sequences.<br />

Here we provide a brief overview of methods to compare sequenced bacterial genomes. Of the<br />

many methods available to compare bacterial genomes (B<strong>in</strong>newies et al., 2006) > Table 1<br />

lists several that we f<strong>in</strong>d useful. It is beyond the scope of this review to provide a detailed<br />

analysis of these methods, <strong>and</strong> the list is far from complete. The <strong>tools</strong> discussed here provide<br />

some <strong>in</strong>terest<strong>in</strong>g <strong>in</strong>formation on fundamental biological features <strong>and</strong> can be used to compare<br />

a few or large numbers of genomes. The <strong>tools</strong> are easy to use <strong>and</strong> produce results that are easy<br />

to <strong>in</strong>terpret <strong>and</strong> can be graphically represented. The latter is an important quality determ<strong>in</strong>ant<br />

of any sequence analysis tool when deal<strong>in</strong>g with genomes, as the complexity of <strong>in</strong>put data is<br />

so large.<br />

2 Genomic DNA Sequence Comparisons<br />

A genome can be more than one DNA molecule. Approximately 10% of the bacterial genomes<br />

sequenced so far have more than one chromosome. By def<strong>in</strong>ition a genome <strong>in</strong>cludes all<br />

chromosomes (<strong>and</strong> plasmids) that constitute an organism’s total DNA. Chromosomes are<br />

essential, s<strong>in</strong>gle-copy, <strong>in</strong>dependently replicat<strong>in</strong>g DNA molecules present <strong>in</strong> each member of<br />

the species. Some species conta<strong>in</strong> plasmids; these are frequently stra<strong>in</strong>-specific <strong>and</strong> sometimes<br />

(<strong>in</strong>correctly, <strong>in</strong> our op<strong>in</strong>ion) omitted from a genome sequence.<br />

At the time of writ<strong>in</strong>g, the largest bacterial genome sequenced is that of Solibacter usitatus<br />

(stra<strong>in</strong> Ell<strong>in</strong> 6076), a soil bacterium belong<strong>in</strong>g to the Acidobacteria. It consists of a s<strong>in</strong>gle<br />

chromosome of 9.97 mega basepairs (Mbp). The smallest bacterial genome known is<br />

that of Carsonella ruddii (PV), an endosymbiont of a plant sap-feed<strong>in</strong>g <strong>in</strong>sect with a mere<br />

159,662 bp. Genome size is a rough <strong>in</strong>dicator of biological adaptive potential so it is no<br />

surprise that soil bacteria have bigger genomes, as they have to adapt to environmental<br />

variation, whereas the protective niche of an endosymbiont allows for a small genome.<br />

The genome size of an organism is easy to calculate <strong>and</strong> tabulate. > Figure 1a gives<br />

a graphical representation for genome size variation with<strong>in</strong> bacterial phyla. A ‘‘box <strong>and</strong><br />

whiskers’’ plot as shown <strong>in</strong> > Fig. 1 visualizes the distribution of a property that can be<br />

1 Completed genome statistics obta<strong>in</strong>ed from the NCBI Genome Project web pages: http://www.ncbi.nlm.nih.gov/<br />

genomes/lproks.cgi<br />

for Comparison of Bacterial Genomes


. Table 1<br />

Methods for comparison of bacterial genomes<br />

Method URL References<br />

Length, %GC http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi Wheeler et al. (2007)<br />

Chromosome<br />

alignment (ACT)<br />

Chromosome<br />

alignment (MUMMER)<br />

http://www.sanger.ac.uk/Software/ACT/ Carver et al. (2005)<br />

http://www.webact.org/WebACT/home<br />

http://mummer.sourceforge.net Kurtz et al. (2004)<br />

Repeats – various http://www.cbs.dtu.dk/services/GenomeAtlas Ussery et al. (2004)<br />

Repeats –<br />

tetranucleotides<br />

Repeats – short,<br />

t<strong>and</strong>em<br />

Tools for Comparison of Bacterial Genomes 74<br />

http://www.megx.net/tetra Teel<strong>in</strong>g et al. (2004)<br />

http://m<strong>in</strong>isatellites.u-psud.fr/GPMS/default.php Denoeud <strong>and</strong><br />

Vergnaud (2004)<br />

Repeats – VNTRs http://vntr.csie.ntu.edu.tw Chang et al. (2007)<br />

Replication Orig<strong>in</strong>s http://www.cbs.dtu.dk/services/GenomeAtlas Worn<strong>in</strong>g et al.<br />

(2006)<br />

Noncod<strong>in</strong>g RNAs http://rfam.sanger.ac.uk Griffiths-Jones, et al.<br />

(2005)<br />

rRNAs http://www.cbs.dtu.dk/services/RNAmmer Lagesen et al. (2007)<br />

Genome Atlas http://www.cbs.dtu.dk/services/GenomeAtlas Hall<strong>in</strong> <strong>and</strong> Ussery<br />

(2004)<br />

BLAST Atlas (zoomable) http://www.cbs.dtu.dk/services/gwBrowser<br />

UPDATE!<br />

‘‘Genome Properties’’ http://cmr.tigr.org/tigr-scripts/CMR/shared/<br />

GenomePropertiesHomePage.cgi<br />

Hall<strong>in</strong> <strong>and</strong> Ussery<br />

(2004)<br />

Selengut et al.<br />

(2007)<br />

4315<br />

expressed as a numerical value, such as length, %GC, number of genes, etc. Such plots show<br />

the spread of the data <strong>and</strong> are made as follows: the values are sorted <strong>and</strong> divided <strong>in</strong>to two equal<br />

parts, separated by the median, which is marked as a bar <strong>in</strong> the middle of the distribution. A<br />

box is drawn to cover the range where the middle 50% of the data are (exclud<strong>in</strong>g the first 25%<br />

<strong>and</strong> the last 25% of the data). The ‘‘whiskers’’ are the hatched l<strong>in</strong>es, connect<strong>in</strong>g the lowest (left)<br />

<strong>and</strong> highest (right) values, with the exception of outlier po<strong>in</strong>ts, which are shown as <strong>in</strong>dividual<br />

dots. Outliers are def<strong>in</strong>ed as data that are distant by more than 1.5 times the range of the box.<br />

The base composition of genomes, i.e., their %GC content (or %AT which together make<br />

100%), can also be compared, as shown <strong>in</strong> > Fig. 1b. The GC content of a genome can range<br />

from 17% <strong>in</strong> C. ruddii to 75% GC <strong>in</strong> Anaeromyxobacter dehalogenans. The smallest genome is<br />

also the most AT rich, <strong>and</strong> many of the larger genomes are quite GC rich. It is not clear if there<br />

is a biological force <strong>in</strong> play beh<strong>in</strong>d this correlation, although it has been observed that the<br />

ecological niche an organism occupies roughly correlates to both genome size <strong>and</strong> GC content<br />

(Foerstner et al., 2005, Musto et al., 2006).<br />

In addition to the average GC content for a whole genome, local variation with<strong>in</strong> a given<br />

genome can be exam<strong>in</strong>ed, <strong>and</strong> this reveals two general trends for almost all bacterial genomes.<br />

First, on a more global, chromosomal level a large region flank<strong>in</strong>g the orig<strong>in</strong> of DNA


4316 74<br />

Tools<br />

Size distribution of prokaryotic genomes (N = 779) AT content distribution of prokaryotic genomes (N = 779)<br />

for Comparison of Bacterial Genomes<br />

Crenarchaeota (n = 16)<br />

Euryarchaeota (n = 35)<br />

Nanoarchaeota (n = 1)<br />

Acidobacteria (n = 2)<br />

Act<strong>in</strong>obacteria (n = 55)<br />

Aquificae (n = 3)<br />

Bacteroidetes/chlorobi ( n = 26)<br />

Chlamydiae/verrucomicrobia (n = 13)<br />

Chloroflexi (n = 7)<br />

Cyanobacteria (n = 33)<br />

De<strong>in</strong>ococcus/thermus (n = 4)<br />

Firmicutes (n = 155)<br />

Fusobacteria (n = 1)<br />

Planctomycetes (n = 1)<br />

Alphaproteobacteria (n = 94)<br />

Betaproteobacteria (n = 61)<br />

Gammaproteobacteria (n = 191)<br />

Deltaproteobacteria (n = 21)<br />

Epsilonproteobacteria (n = 22)<br />

Spirochaetes (n = 16)<br />

Thermotogea (n = 8)<br />

Other archaea (n = 1)<br />

Other bacteria (n = 13)<br />

80<br />

70<br />

50 60<br />

AT content (percent)<br />

40<br />

30<br />

12<br />

10<br />

6 8<br />

Genome size (Mbp)<br />

4<br />

2<br />

0<br />

. Figure 1<br />

(a) Box <strong>and</strong> Whisker plot of genome length distribution for 779 bacterial chromosomes, grouped by phyla. The phylum <strong>and</strong> the number of chromosomes<br />

<strong>in</strong>cluded are <strong>in</strong>dicated at the left. Each phylum is colored accord<strong>in</strong>g to our GenomeAtlas website. (b) The distribution of average chromosomal AT content<br />

for the same set of bacterial genomes.


eplication tends to be more GC rich, <strong>and</strong> the region around the replication term<strong>in</strong>us usually<br />

is more ATrich. AT-rich sequences melt more easily than GC-rich sequences, due <strong>in</strong> part to the<br />

extra hydrogen bond present <strong>in</strong> a GC base pair. Contra-<strong>in</strong>tuitively, this would make the orig<strong>in</strong><br />

of replication the least likely to start replication. However, with<strong>in</strong> the ‘‘large region’’ around<br />

the orig<strong>in</strong> of approximately 5% of the chromosome, there is a short stretch of more AT rich<br />

basepairs, where the replication orig<strong>in</strong> bubble opens up. Second, <strong>and</strong> zoom<strong>in</strong>g <strong>in</strong> at genes, the<br />

average GC content of <strong>in</strong>tergenic regions is generally lower than that of cod<strong>in</strong>g sequences.<br />

These regions will melt more readily, are more curved <strong>and</strong> more rigid than the chromosomal<br />

average, <strong>in</strong> order to enable gene expression (Pedersen et al., 2000, Ussery <strong>and</strong> Hall<strong>in</strong>, 2004).<br />

This is true for nearly all of the bacterial genomes sequenced, regardless of GC content. In order<br />

to calculate relative or local %GC, a w<strong>in</strong>dow has to be def<strong>in</strong>ed (say, <strong>in</strong>vestigat<strong>in</strong>g 100 basepairs)<br />

for which the %GC is calculated. This w<strong>in</strong>dow is then moved along the genome by s<strong>in</strong>glenucleotide<br />

steps, <strong>and</strong> the %GC is scored related to the middle of each w<strong>in</strong>dow. These scores can<br />

then be graphically represented. A web-based tool for this is available at the Genome Atlas<br />

Website 2 <strong>in</strong> which local %GC can be visualized by color codes as discussed below.<br />

3 Visualization of Genomic Data: The Genome Atlas<br />

Genome atlases are circular plots of chromosomes or plasmids (a l<strong>in</strong>ear version is available<br />

when applicable) on which general properties of the DNA molecule are plotted as colors.<br />

Genome atlases are available from our web server 2 for many of the currently sequenced<br />

bacterial genomes. > Figure 2 shows a Genome Atlas for the chromosome of Geobacillus<br />

kaustophilus stra<strong>in</strong> HTA426 (a thermophilic Firmicute that also conta<strong>in</strong>s a plasmid of 4.8 kb).<br />

This isolate was obta<strong>in</strong>ed from a deep sea sediment of the Mariana Trench <strong>in</strong> the Pacific Ocean<br />

(Takami et al., 2004a, b). Its genome is 3.5 Mbp long <strong>and</strong> conta<strong>in</strong>s 52.1% GC. G. kaustophilus<br />

has been suggested to provide a possible solution for paraff<strong>in</strong> deposition problems with oil<br />

production (Sood <strong>and</strong> Lal, 2008). A Genome Atlas maps four different aspects of the<br />

chromosomal DNA sequence <strong>in</strong> various lanes <strong>in</strong> a st<strong>and</strong>ard manner: DNA structural features<br />

are represented <strong>in</strong> the three outer lanes, all cod<strong>in</strong>g sequences are <strong>in</strong>dicated <strong>in</strong> the next lane, two<br />

k<strong>in</strong>ds of repeats are mapped <strong>in</strong> the next two lanes, <strong>and</strong> base composition properties are plotted<br />

<strong>in</strong> the two <strong>in</strong>nermost lanes (Jensen et al., 1999). The scale <strong>in</strong> the center corresponds with the<br />

sequence number<strong>in</strong>g <strong>in</strong> GenBank. The DNA structural features of the three outermost circles<br />

are based on the physical chemical properties of the DNA helix. The annotated genes are given<br />

<strong>in</strong> blue for prote<strong>in</strong>-cod<strong>in</strong>g genes oriented clockwise, <strong>and</strong> red for genes on the other str<strong>and</strong><br />

(counterclockwise). The tRNA <strong>and</strong> rRNA genes have their own color. The clockwise str<strong>and</strong><br />

corresponds with the sequence stored <strong>in</strong> GenBank (genes on the other str<strong>and</strong> are annotated as<br />

‘‘complement’’ <strong>in</strong> there). To identify global repeats (sequences that are repeated somewhere<br />

else on the chromosome) we search for the best match of a 100 bp w<strong>in</strong>dow aga<strong>in</strong>st the entire<br />

chromosome. Search<strong>in</strong>g on the positive str<strong>and</strong> results <strong>in</strong> direct repeats (both sequences run <strong>in</strong><br />

the same direction) whilst search<strong>in</strong>g on the negative str<strong>and</strong> gives <strong>in</strong>verted repeats (the two<br />

repeat units run <strong>in</strong> opposite directions). For most of these general properties summarized <strong>in</strong> a<br />

Genome Atlas (structural properties, repeats, base composition) dedicated atlases are also<br />

available, where more features are given (such as local <strong>and</strong> simple repeats <strong>in</strong> a Repeat Atlas, or<br />

2 http://www.cbs.dtu.dk/services/GenomeAtlas/<br />

Tools for Comparison of Bacterial Genomes 74<br />

4317


4318 74<br />

Tools<br />

Genome atlas<br />

Intr<strong>in</strong>sic curvature<br />

dev<br />

avg<br />

0.17 0.22<br />

Stack<strong>in</strong>g energy<br />

for Comparison of Bacterial Genomes<br />

dev<br />

avg<br />

–9.03 –7.55<br />

Position preference<br />

dev<br />

avg<br />

0.14 0.17<br />

Annotations: CDS +<br />

CDS –<br />

rRNA<br />

tRNA<br />

0M<br />

0.5M<br />

3M<br />

Global direct repeats<br />

G. kaustophilus<br />

HTA426<br />

ma<strong>in</strong> chromosome<br />

fix<br />

avg<br />

1M<br />

2.5M<br />

5.00 7.50<br />

3,544,776 bp<br />

Global <strong>in</strong>verted repeats<br />

fix<br />

avg<br />

5.00 7.50<br />

1.5M<br />

2M<br />

GC Skew<br />

dev<br />

avg<br />

–0.15 0.14<br />

Percent AT<br />

fix<br />

avg<br />

0.20 0.80<br />

Resolution: 1418<br />

Center for biological sequence analysis<br />

http://www.cbs.dtu.dk/<br />

. Figure 2<br />

Genome atlas of the ma<strong>in</strong> chromosome of Geobacillus kaustrophilus. See text for further explanation.


Tools for Comparison of Bacterial Genomes 74<br />

base composition <strong>in</strong> a Base Atlas). Such specialized atlases are expla<strong>in</strong>ed <strong>in</strong> detail <strong>in</strong> a book that<br />

we recently produced (Ussery et al., 2008).<br />

As can be seen <strong>in</strong> > Fig. 2, the genes <strong>in</strong> this chromosome are strongly favor<strong>in</strong>g one str<strong>and</strong>:<br />

the positive str<strong>and</strong> for the first (right) half <strong>and</strong> the negative str<strong>and</strong> for the second (left) half of<br />

the chromosome. These happen to be the lead<strong>in</strong>g str<strong>and</strong> dur<strong>in</strong>g replication. Replication starts<br />

at the orig<strong>in</strong>, (the 12 o’clock position here), <strong>and</strong> proceeds on either side along the circle with<br />

both a lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong> until the bubble reaches the term<strong>in</strong>us, at 6 o’clock, <strong>and</strong> the<br />

ends are comb<strong>in</strong>ed. The positive str<strong>and</strong> represented by a genome sequence is the lead<strong>in</strong>g<br />

str<strong>and</strong> but only for the first half up till the term<strong>in</strong>us. Read<strong>in</strong>g across the term<strong>in</strong>us along the<br />

sequence on the same str<strong>and</strong> one enters the lagg<strong>in</strong>g str<strong>and</strong>. Gene preference for the lead<strong>in</strong>g<br />

str<strong>and</strong> is a general feature for Firmicutes <strong>and</strong> for some other bacteria.<br />

In > Fig. 2 the two outward lanes identify some regions with strong structural properties<br />

(for <strong>in</strong>stance the region around 2 o’clock, <strong>in</strong>dicated by a black l<strong>in</strong>e). The observed strong<br />

curvature (blue <strong>in</strong> the outward lane) where the DNA would easily melt (red <strong>in</strong> the second lane)<br />

suggests this region conta<strong>in</strong>s genes that are highly expressed.<br />

There are a number of global repeats, notably <strong>in</strong> the first quarter of the chromosome. Note<br />

that the ribosomal RNA genes (light blue <strong>in</strong> the annotation lane) are located here, as <strong>in</strong>dicated<br />

by the arrows, <strong>and</strong> these are picked up as global repeats, as <strong>in</strong>deed they are repeated genes.<br />

The GC skew lane shows the bias of G’s towards one str<strong>and</strong> or the other, averaged over a<br />

10,000 bp w<strong>in</strong>dow. In contrast to many Firmicutes with a strong GC skew, this genome only<br />

has a weak GC skew (the right half is light blue <strong>and</strong> the left half is light p<strong>in</strong>k). The <strong>in</strong>nermost<br />

circle colors the local AT content when it is more than three st<strong>and</strong>ard deviations distant from<br />

the global average. Note a light red color around the 2 o’clock region: this local deviation <strong>in</strong> AT<br />

content is related to the structural features located here.<br />

The Genome Atlas of the Archaea Methanosarc<strong>in</strong>a acetivorans, shown <strong>in</strong> > Fig. 3, tells a<br />

different story. This strictly anaerobic organism so efficiently produces methane that it is held<br />

responsible for virtually all biogenic methane. It can also oxidate CO to CO 2 (Lessner et al.,<br />

2006). Stra<strong>in</strong> C2A (the type stra<strong>in</strong> of the species) was isolated from a mar<strong>in</strong>e sediment<br />

(Galagan et al., 2002). Its genome is 5.7 Mbp long <strong>and</strong> conta<strong>in</strong>s 42.7% GC. The Genome<br />

Atlas shows that its genes are evenly distributed over the two str<strong>and</strong>s, <strong>and</strong> a GC skew is absent.<br />

Instead, the lower quart of the genome conta<strong>in</strong>s many strong structural features. The genome<br />

only conta<strong>in</strong>s three rRNA gene copies (<strong>in</strong>dicated by arrows) one of which is located on the<br />

negative str<strong>and</strong> (but as discussed above, this is actually the lead<strong>in</strong>g str<strong>and</strong>, as is preferred for<br />

nearly all bacterial rRNA genes). Many other global repeats are visible, notably <strong>in</strong> the region<br />

around 1.2 Mbp, which is strongly curved <strong>and</strong> easily melted, <strong>and</strong> is slightly more AT rich than<br />

the rest of the genome. Here, the important carbon-monoxide dehydrogenase gene locus is<br />

present, as are multiple transposases, which could be an <strong>in</strong>dication of horizontally acquired<br />

DNA. The genome is relatively poorly annotated, with many genes given as ‘‘predicted<br />

prote<strong>in</strong>’’ only, which is not uncommon for archaeal genomes.<br />

In conclusion, a Genome atlas comb<strong>in</strong>es a number of features <strong>in</strong> one s<strong>in</strong>gle figure that<br />

summarizes a very long <strong>and</strong> detailed story about a chromosome or plasmid.<br />

4 Whole Genome Alignment Methods<br />

4319<br />

Another way to compare genomes is based on alignment of nucleotide or am<strong>in</strong>o acid<br />

sequences. Sequence alignment is a common tool to identify similarities, with BLAST, for


4320 74<br />

Tools<br />

Genome atlas<br />

Intr<strong>in</strong>sic curvature<br />

dev<br />

avg<br />

0.18 0.24<br />

for Comparison of Bacterial Genomes<br />

dev<br />

avg<br />

Stack<strong>in</strong>g energy<br />

–8.10 –7.21<br />

dev<br />

avg<br />

Position preference<br />

0.13 0.15<br />

0.5M<br />

0M<br />

M<br />

Annotations: CDS +<br />

CDS –<br />

rRNA<br />

tRNA<br />

5M<br />

1M<br />

4.5M<br />

1.5M<br />

M. acetivorans C2A<br />

5,751,492 bp<br />

Global direct repeats<br />

fix<br />

avg<br />

4<br />

2M<br />

5.00 7.50<br />

3.5M<br />

2.5M<br />

Global <strong>in</strong>verted repeats<br />

fix<br />

avg<br />

5.00 7.50<br />

3M<br />

GC skew<br />

dev<br />

avg<br />

–0.03 0.02<br />

fix<br />

avg<br />

Percent AT<br />

0.20 0.80<br />

Resolution: 2301<br />

Center for biological sequence analysis<br />

http://www.cbs.dtu.dk/<br />

. Figure 3<br />

Genome atlas of the ma<strong>in</strong> chromosome of the Archea Methanosarc<strong>in</strong>a acetivorans.


Basic Local Alignment Search Tool, the most common (Altschul et al., 1990). However<br />

BLAST is not automatically suitable for large DNA <strong>in</strong>put segments such as complete<br />

genomes. A more suitable program to align sequences <strong>in</strong> the range of megabases is Mummer,<br />

developed at TIGR, of which version 3 is now publicly available (Kurtz et al., 2004). Further,<br />

this method has been recently extended to <strong>in</strong>clude the average nucleotide identity <strong>in</strong> the<br />

conserved core genes of a set of genomes (Deloger et al., 2009). Moreover, graphical representation<br />

of the result<strong>in</strong>g alignment becomes an issue. Specific <strong>tools</strong> have been designed to align<br />

genome sequences <strong>and</strong> visualize such events. The Artemis Comparison Tool (ACT) is worth<br />

mention<strong>in</strong>g of which two versions are available: a downloadable version to be used on a local<br />

computer (Carver et al., 2005) <strong>and</strong> a web-based version with pre-computed comparisons<br />

between several hundred bacterial genomes. 3 BLAST results of entire bacterial chromosomes<br />

aga<strong>in</strong>st each other have also been used to construct phylogenetic trees (Henz et al., 2005). Blast<br />

comparisons will be treated <strong>in</strong> Section 7 of this chapter.<br />

5 Compar<strong>in</strong>g the Cod<strong>in</strong>g Fraction of Genomes<br />

The typical cod<strong>in</strong>g density for a bacterial genome is about 90%, rang<strong>in</strong>g from 95%<br />

for Pelagibacter ubique (an alpha-proteal mar<strong>in</strong>e bacterium that counts to the most numerous<br />

bacteria <strong>in</strong> the world) (Giovannoni et al., 2005) to around 75% for M. acetivorans.<br />

Intracellular bacteria can have a cod<strong>in</strong>g density as low as 50%. This means the majority<br />

of bacterial DNA codes for genes, which mostly are not spliced so that <strong>in</strong>trons are absent<br />

(with very few exceptions). However, not every open read<strong>in</strong>g frame is a gene, <strong>and</strong> it<br />

appears that many bacterial genomes are over-annotated, predict<strong>in</strong>g 10–15% more genes<br />

than are real (Skovgaard et al., 2001). These over-annotated genes are frequently short<br />

open read<strong>in</strong>g frames. In addition, genes can be missed <strong>in</strong> the annotation. A frequent mistake<br />

is that genes are annotated on the wrong str<strong>and</strong>, which can happen if the read<strong>in</strong>g frame is<br />

open <strong>in</strong> either direction. The <strong>in</strong>tergenic regions separat<strong>in</strong>g genes regulate transcription,<br />

<strong>and</strong> <strong>in</strong> <strong>in</strong>tracellular bacteria frequently conta<strong>in</strong> pseudogenes or repeats. Genes not cod<strong>in</strong>g<br />

for prote<strong>in</strong>s <strong>in</strong>clude tRNA <strong>and</strong> rRNA genes, <strong>and</strong> some parts of <strong>in</strong>tergenic regions can<br />

be transcribed <strong>in</strong>to stable RNA that are transcribed but do not code for prote<strong>in</strong>s. E. coli<br />

conta<strong>in</strong>s several hundred small non-cod<strong>in</strong>g RNA genes (ncRNA) (Chen et al., 2002) that<br />

can act as regulators (Gottesman, 2005). Their role <strong>in</strong> environmental bacteria is virtually<br />

unexplored.<br />

Although tRNA <strong>and</strong> rRNA genes are essential to life, they are sometimes missed <strong>in</strong> the<br />

annotation of a genome, a rather embarrass<strong>in</strong>g omission, or occasionally annotated on<br />

the wrong str<strong>and</strong> (Lagesen et al., 2007). The number <strong>and</strong> location of rRNA operons <strong>in</strong> a<br />

genome can say someth<strong>in</strong>g about an organism. It appears that organisms with short doubl<strong>in</strong>g<br />

times have larger numbers of rRNA <strong>and</strong> tRNA genes. Compar<strong>in</strong>g > Figs. 2 <strong>and</strong> 3 it is<br />

likely that G. kaustrophilus with 9 rRNA copies, nearly all located close to the orig<strong>in</strong> of<br />

replication (which boosts expression dur<strong>in</strong>g replication as their copy number <strong>in</strong>creases) can<br />

divide more quickly than M. acetivorans which only has three copies. Some really fast-grow<strong>in</strong>g<br />

bacteria can have 14 or more rRNA copies, as can be viewed from our list of genomes. 4<br />

3 http://www.webact.org/WebACT/home<br />

4 www.cbs.dtu.dk/services/GenomeAtlas/<br />

Tools for Comparison of Bacterial Genomes 74<br />

4321


4322 74<br />

Tools<br />

for Comparison of Bacterial Genomes<br />

6 Codon Usage Comparisons<br />

Once the genes of a given genome have been def<strong>in</strong>ed, their codon usage can be analyzed. S<strong>in</strong>ce<br />

the genetic code is redundant, with up to 6 codons per am<strong>in</strong>o acid, variable codons are used at<br />

different frequencies. Much of the redundancy <strong>in</strong> the genetic code is due to third base<br />

variation. > Figure 4 displays the am<strong>in</strong>o acid usage for three prokaryotic genomes: Methanosphaera<br />

stadtmanae (27.6% GC), an archaeal methanogen that uses methanol <strong>and</strong> hydrogen to<br />

produce methane; Desulfitobacterium hafniense (47.4% GC), a Firmicute that efficiently<br />

dehalogenates tetrachloroethene <strong>and</strong> polychloroethanes; <strong>and</strong> Anaeromyxobacter dehalogenans<br />

(75% GC). This species, the first myxobacteria to be grown as a pure culture, can use orthosubstituted<br />

mono- <strong>and</strong> dichlor<strong>in</strong>ated phenols. The frequency of each possible codon is plotted<br />

<strong>in</strong> a wheel plot <strong>in</strong> the upper part of the figure, arranged such that their third base is conserved<br />

<strong>in</strong> each quarter. The bias <strong>in</strong> codon usage towards the third position can also be seen <strong>in</strong> the<br />

sequence logo plots <strong>in</strong> the lower part of > Fig. 4. From both graphics it is evident that genomic<br />

GC content highly affects codon use (or the other way round). Based on a genome’s bias <strong>in</strong><br />

codon usage, it is possible to predict its likely environmental niche (Willenbrock et al., 2006).<br />

Moreover, it is known that am<strong>in</strong>o acid usage (not shown here) depends on environment, based<br />

on analysis of metagenomic samples (Musto et al., 2006, Foerstner et al., 2005).<br />

7 Prote<strong>in</strong> Sequence Comparisons<br />

One can compare each <strong>in</strong>dividual gene <strong>in</strong> a given genome by BLAST aga<strong>in</strong>st a set of genomes.<br />

This produces a huge amount of data that can be graphically represented <strong>in</strong> a BLAST Matrix<br />

(B<strong>in</strong>newies et al., 2005, Ussery et al., 2009). A BLAST Matrix is not symmetrical, as the<br />

outcome is determ<strong>in</strong>ed by which genome is used as query sequence. The diagonal of a BLAST<br />

matrix represents a BLASTof a genome aga<strong>in</strong>st itself. The self-match (the gene f<strong>in</strong>d<strong>in</strong>g itself) is<br />

discarded, thus the reported scores reflect <strong>in</strong>ternal homologues present <strong>in</strong> a given genome.<br />

Most of these have been derived from gene duplication <strong>and</strong> are thus paralogs.<br />

When more <strong>in</strong>formation should be visualized a BLAST Atlas is helpful. Such an atlas uses<br />

one genome as a reference aga<strong>in</strong>st which the gene conservation of other genomes is plotted<br />

(Hall<strong>in</strong> <strong>and</strong> Ussery, 2004, Skovgaard et al., 2002). In this case gene location only refers to the<br />

location <strong>in</strong> the reference genome, which of course can be varied <strong>in</strong> multiple BLAST Atlases.<br />

A BLAST Atlas is also a suitable platform to visualize metagenomic data. So far, we have<br />

not dealt with metagenomics extensively, ma<strong>in</strong>ly because this approach very rarely results <strong>in</strong><br />

completely assembled microbiological genomes. But for a BLAST Atlas, that is not a problem,<br />

as one can comb<strong>in</strong>e all the metagenomic DNA <strong>in</strong> one lane, thereby ignor<strong>in</strong>g from which<br />

organism the detected genes orig<strong>in</strong>ated. All obta<strong>in</strong>ed BLAST hits are plotted around a<br />

reference genome. An example of a BLAST Atlas is given <strong>in</strong> > Fig. 5, centered around<br />

Pelotomaculum thermopropionicum, a thermophilic, syntropic Firmicute that can utilize<br />

1-butanol, 1-propanol, 1-pentanol or 1,3-propanediol as a carbon source. Note that despite<br />

the high number of lanes, conserved <strong>and</strong> variable genes can still be easily visually <strong>in</strong>spected.<br />

From compact<strong>in</strong>g a s<strong>in</strong>gle genome <strong>in</strong>to a Genome Atlas, we’ve now moved several levels up<br />

<strong>and</strong> compact multiple genomes <strong>in</strong>to a s<strong>in</strong>gle atlas. In > Fig. 5, the P. thermopropionicum<br />

genome is compared to many species of Clostridia, as well as other bacteria. Unfortunately,<br />

very few BLAST hits were found with the metagenomics samples so there is very little color <strong>in</strong><br />

those three lanes. Compared to well characterized genomes (like E. coli), relatively few hits are


Methanosphaera stadtmanae DSM 3091<br />

Desulfitobacterium hafniense Y51<br />

Anaeromyxobacter dehalogenans 2CP-C<br />

GGG<br />

GGG<br />

GGG<br />

GAA<br />

GAA<br />

CAA<br />

CGG<br />

GAA<br />

CAA<br />

CGG<br />

UAA<br />

GCG<br />

CAA<br />

CGG<br />

UAA<br />

GCG<br />

CUA<br />

AAA<br />

UGG<br />

UAA<br />

GCG<br />

UGG<br />

UGG<br />

CUA<br />

UUA<br />

AAA<br />

UUA<br />

GUA<br />

AUA<br />

AGG<br />

CCG<br />

CUA<br />

UUA<br />

AAA<br />

CCG<br />

AUA<br />

AGG<br />

CCG<br />

AUA<br />

AGG<br />

GUA<br />

UCG<br />

UCG<br />

GUA<br />

UCG<br />

GUG<br />

GUG<br />

GUG<br />

ACG<br />

ACG<br />

ACG<br />

ACA<br />

UCA<br />

ACA<br />

CCA<br />

CUG<br />

UCA<br />

ACA<br />

CCA<br />

UCA<br />

CUG<br />

GCA<br />

CCA<br />

CUG<br />

GCA<br />

UUG<br />

UUG<br />

GCA<br />

UUG<br />

AGA<br />

AUG<br />

GAG<br />

AGA<br />

AUG<br />

GAG<br />

UGA<br />

AGA<br />

AUG<br />

GAG<br />

UGA<br />

CGA<br />

CAG<br />

UGA<br />

CGA<br />

CAG<br />

CGA<br />

CAG<br />

GGA<br />

UAG<br />

G GA<br />

UAG<br />

GGA<br />

UAG<br />

AAU<br />

72% AT<br />

AAG<br />

AAU<br />

25% AT 53% AT<br />

AAG<br />

AAU<br />

AAG<br />

UAU<br />

UAU<br />

GGC<br />

GGC<br />

CAU<br />

CGC<br />

UAU<br />

GGC<br />

CAU<br />

CGC<br />

UGC<br />

CAU<br />

CGC<br />

UGC<br />

UGC<br />

GAU<br />

AUU<br />

AGC<br />

GAU<br />

AUU<br />

AGC<br />

GAU<br />

AUU<br />

AGC<br />

UUU<br />

UUU<br />

GCC<br />

UUU<br />

Tools for Comparison of Bacterial Genomes 74<br />

GCC<br />

GCC<br />

CUU<br />

CCC<br />

CUU<br />

CCC<br />

UCC<br />

ACU<br />

ACC<br />

CUU<br />

UC CCC<br />

UCC<br />

ACU<br />

ACC<br />

ACU<br />

ACC<br />

GUU<br />

GUU<br />

UCU<br />

UCU<br />

GUU<br />

UCU<br />

CCU<br />

AGU<br />

CCU<br />

AGU<br />

AUC<br />

UUC<br />

CUC<br />

GUC<br />

AUC<br />

UU CUC<br />

GUC<br />

CCU<br />

AGU<br />

AUC<br />

UUC<br />

CUC<br />

GUC<br />

UGU<br />

UGU<br />

GCU<br />

AAC<br />

UGU<br />

GCU<br />

AAC<br />

CGU<br />

UAC<br />

GCU<br />

AAC<br />

CGU<br />

UAC<br />

CAC<br />

CGU<br />

UAC<br />

CAC<br />

GGU<br />

GAC<br />

CAC<br />

GGU<br />

GAC<br />

GGU<br />

GAC<br />

C<br />

0.6<br />

0.6<br />

0.6<br />

0.5<br />

0.5<br />

0.5<br />

U AG<br />

0.4<br />

0.4<br />

0.4<br />

0.3<br />

0.3<br />

0.3<br />

0.2<br />

0.2<br />

G<br />

0.2<br />

C<br />

A<br />

0.1<br />

0.1<br />

UA G CU<br />

A<br />

CU<br />

GA<br />

0.1<br />

CU<br />

CG<br />

A<br />

G<br />

U<br />

G<br />

U<br />

A<br />

C G<br />

A<br />

C<br />

U<br />

UA<br />

C<br />

G<br />

1 st 2 nd 3 rd 1 st 2 nd 3 rd 1 st 2 nd 3 rd<br />

4323<br />

. Figure 4<br />

Frequency wheel plots of codon usage (top) <strong>and</strong> sequence logo plots (bottom) of Anaeromyxobacter dehalogenans (left), Desulfitobacterium hafniense<br />

(middle) <strong>and</strong> Methanosphaera stadtmanae (right).


4324 74<br />

Tools<br />

for Comparison of Bacterial Genomes<br />

2.5M<br />

2M<br />

0M<br />

P. thermopropionicum<br />

SI<br />

3,025,375 bp<br />

1.5M<br />

0.5M<br />

1M<br />

2 Alkaliphilus species<br />

Bacillus fragilis<br />

17 Clostridium species<br />

4 Desulfitobacterium species<br />

E. coli K-12<br />

6 other species belong<strong>in</strong>g<br />

to Clostridia<br />

. Figure 5<br />

BLAST Atlas with Pelotomaculum thermoproopionicuma the reference genome. Around this the<br />

BLAST hits of 31 genomes of other bacteria are added as listed to the right, from the outermost<br />

circle (top <strong>in</strong> the legend), to the <strong>in</strong>nermost circle of the bacterial genomes (bottom of legend).<br />

The outermost lane shows the hits of P. thermopropionicum <strong>in</strong> the UniProt database (which<br />

does not conta<strong>in</strong> all annotated genes as it requires biological evidence of a gene product).<br />

The next three lanes are metagenomic DNA samples from...[Dave specify] <strong>and</strong> next follow<br />

30 genomes of other bacteria as listed to the right.<br />

found <strong>in</strong> other genomes, <strong>in</strong>dicated by lack of strong colour <strong>in</strong> most of the lanes <strong>in</strong> Figure 5.<br />

This is probably a reflection of the huge diversity <strong>in</strong> DNA content <strong>in</strong> such samples, reduc<strong>in</strong>g<br />

the chance of a BLAST hit. It is a sober<strong>in</strong>g thought that there is still so little we know, <strong>and</strong> so<br />

much that rema<strong>in</strong>s to be discovered <strong>in</strong> the microbial world.<br />

There are many methods be<strong>in</strong>g developed which utilizes sets of conserved genes <strong>and</strong> gene<br />

families <strong>in</strong> related organisms to cluster organisms <strong>in</strong>to groups; these groups can represent<br />

known taxonomic relationships. For example, certa<strong>in</strong> genes might be common to a set of<br />

organisms grow<strong>in</strong>g <strong>in</strong> a particular ecological niche. Some examples of such regions along the<br />

chromosome can be seen <strong>in</strong> the BLAST atlas plots where genomes of related organisms of<br />

different species are compared.


8 Gene Synteny <strong>and</strong> Genome Isl<strong>and</strong>s<br />

A comparison of genes present, absent or diverged between genomes usually ignores gene synteny:<br />

the position at which such genes are found. The term was co<strong>in</strong>ed for eukaryotes to describe genes<br />

that were located on the same chromosome; <strong>in</strong> bacterial genomes the local neighbor<strong>in</strong>g genes,<br />

their order <strong>and</strong> direction are usually compared. The closer two organisms are, the more likely is<br />

gene synteny to be conserved (between genomes of the same genus, or species, subspecies or<br />

phylogenic clade, <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g order). Gene synteny is destroyed by <strong>in</strong>versions (chang<strong>in</strong>g the<br />

direction of one or several genes), translocations (chang<strong>in</strong>g the position of genes) <strong>and</strong> <strong>in</strong>sertion<br />

<strong>and</strong> deletion events. All of these can result from mistakes dur<strong>in</strong>g replication, or be the result of<br />

self-replicat<strong>in</strong>g mobile elements, such as bacteriophages, <strong>in</strong>tegrons, transposons etc.<br />

The events that affect gene synteny, comb<strong>in</strong>ed with po<strong>in</strong>t mutations accumulat<strong>in</strong>g dur<strong>in</strong>g<br />

replication are the two major forces that <strong>in</strong>crease genetic diversity; selection of those organisms<br />

that are fittest to survive particular conditions decreases diversity. Evolution further<br />

depends on the change of such selective conditions. With a slow but steady re-shuffl<strong>in</strong>g of<br />

genes by evolutionary processes, a pattern emerges of a genetic ‘‘backbone’’ of genes whose<br />

location is relatively conserved between genomes of reasonable genetic distance, <strong>and</strong> groups of<br />

‘‘cluttered’’ genes that are far more variable, <strong>in</strong> what have been termed ‘‘genome isl<strong>and</strong>s.’’<br />

Genome isl<strong>and</strong>s usually conta<strong>in</strong> genes that are all <strong>in</strong>volved <strong>in</strong> a particular phenotypic process.<br />

Examples are pathogenicity isl<strong>and</strong>s, symbiosis isl<strong>and</strong>s, metabolic isl<strong>and</strong>s or magnetosome<br />

isl<strong>and</strong>s. Examples are sulfur metabolism isl<strong>and</strong>s discovered <strong>in</strong> metagenomic sequences from<br />

mar<strong>in</strong>e sediments (Mussmann et al., 2005) or the magnetosome isl<strong>and</strong> conta<strong>in</strong><strong>in</strong>g all genes<br />

that produce the <strong>in</strong>tracellular organelle enabl<strong>in</strong>g magnetotactic bacteria to orient themselves<br />

along magnetic field l<strong>in</strong>es (Richter et al., 2007). The evolutionary advantage of genome isl<strong>and</strong>s<br />

is obvious. They can be regarded as genetic ‘‘build<strong>in</strong>g blocks’’; when transferred from one<br />

organism to the next, they can confer a complete phenotypic trait to the acceptor, enabl<strong>in</strong>g,<br />

for <strong>in</strong>stance, adaptation to a novel ecological niche.<br />

9 M<strong>in</strong>imal Information About a Genome Sequence<br />

Genome sequences are stored <strong>in</strong> public databases such as GenBank under their biological<br />

names (preceded by ‘‘c<strong>and</strong>idatus’’ for undecided taxonomic position), or by a code of<br />

numbers <strong>and</strong> letters for unculturable organisms that have not been classified. Unfortunately,<br />

other relevant <strong>in</strong>formation is often lack<strong>in</strong>g. It has become apparent that biological <strong>and</strong><br />

environmental data are important, <strong>and</strong> a recent st<strong>and</strong>ard for ‘‘M<strong>in</strong>imal Information about a<br />

Genome Sequence’’ has been proposed (Field et al., 2008). The Genomic St<strong>and</strong>ards Consortium<br />

5 (GSC, http://gensc.org) promotes the st<strong>and</strong>ardization of genome sequenc<strong>in</strong>g descriptions<br />

<strong>and</strong> their exchange <strong>and</strong> <strong>in</strong>tegration <strong>in</strong> the scientific community. Overall, it is important<br />

that genome sequence <strong>in</strong>formation is released <strong>in</strong>to the public doma<strong>in</strong> <strong>in</strong> a timely manner so<br />

that global scientific progress can be ma<strong>in</strong>ta<strong>in</strong>ed.<br />

10 Research Needs<br />

Tools for Comparison of Bacterial Genomes 74<br />

4325<br />

For very few environmental species multiple genome sequences are available. From genomic<br />

<strong>in</strong>tra-species comparisons of pathogenic bacteria we know that these provide an extra layer of


4326 74<br />

Tools<br />

<strong>in</strong>formation, as genetic diversity with<strong>in</strong> a bacterial species can be enormous. When multiple<br />

genomes are available for a species we can def<strong>in</strong>e its core genome (all genes that are present <strong>in</strong><br />

all genomes of that species), its pan-genome (all genes that have been found <strong>in</strong> that species)<br />

<strong>and</strong> its dispensable genes that are responsible for the variation between isolates. Multiple<br />

genomes per species, together with more metagenomic data <strong>and</strong> more archaeal genome<br />

sequences, comprise our most urgent data gaps. The research <strong>tools</strong> for analysis of the<br />

genomes are available. Generate the sequences <strong>and</strong> the feast can beg<strong>in</strong>.<br />

References<br />

for Comparison of Bacterial Genomes<br />

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ<br />

(1990) Basic local alignment search tool. J Mol Biol<br />

215: 403–410.<br />

B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Staerfeldt HH, Ussery DW<br />

(2005) Genome update: proteome comparisons.<br />

Microbiology 151: 1–4.<br />

B<strong>in</strong>newies TT, et al. (2006) Ten years of bacterial genome<br />

sequenc<strong>in</strong>g: comparative-genomics-based discoveries.<br />

Funct Integr Genomics 6: 165–185.<br />

Carver TJ, Rutherford KM, Berriman M, Raj<strong>and</strong>ream<br />

MA, Barrell BG, Parkhill J (2005) ACT: the Artemis<br />

Comparison Tool. Bio<strong>in</strong>formatics 21: 3422–3423.<br />

Chang CH, Chang YC, Underwood A, Chiou CS, Kao CY<br />

(2007) VNTRDB: a bacterial variable number t<strong>and</strong>em<br />

repeat locus database. Nucleic Acids Res 35:<br />

D416–D421.<br />

Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH,<br />

Ecker DJ, Blyn LB (2002) A bio<strong>in</strong>formatics based<br />

approach to discover small RNA genes <strong>in</strong> the Escherichia<br />

coli genome. Biosystems 65: 157–177.<br />

Deloger M, El Karoui M, Petit MA (2009) A genomic<br />

distance based on MUM <strong>in</strong>dicates discont<strong>in</strong>uity between<br />

most bacterial species <strong>and</strong> genera. J Bacteriol<br />

191: 91–99.<br />

Denoeud F, Vergnaud G (2004) Identification of polymorphic<br />

t<strong>and</strong>em repeats by direct comparison of<br />

genome sequence from different bacterial stra<strong>in</strong>s: a<br />

web-based resource. BMC Bio<strong>in</strong>formatics 5: 4.<br />

Field D, et al. (2008) The m<strong>in</strong>imum <strong>in</strong>formation about a<br />

genome sequence (MIGS) specification. Nature Biotechnol<br />

26:541–547.<br />

Foerstner KU, von Mer<strong>in</strong>g C, Hooper SD, Bork P (2005)<br />

Environments shape the nucleotide composition of<br />

genomes. EMBO Rep 6: 1208–1213.<br />

Galagan JE, et al. (2002) The genome of M. acetivorans<br />

reveals extensive metabolic <strong>and</strong> physiological diversity.<br />

Genome Res 12: 532–542.<br />

Giovannoni SJ, et al. (2005) Genome streaml<strong>in</strong><strong>in</strong>g <strong>in</strong> a<br />

cosmopolitan oceanic bacterium. Science 309:<br />

1242–1245.<br />

Gottesman S (2005) Micros for microbes: non-cod<strong>in</strong>g<br />

regulatory RNAs <strong>in</strong> bacteria. Trends Genet 21:<br />

399–404.<br />

Griffiths-Jones S, Moxon S, Marshall M, Khanna A,<br />

Eddy SR, Bateman A (2005) Rfam: annotat<strong>in</strong>g<br />

non-cod<strong>in</strong>g RNAs <strong>in</strong> complete genomes. Nucleic<br />

Acids Res 33: D121–D124.<br />

Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW (2008) The genome<br />

BLAST atlas - a GeneWiz extension for visualization<br />

of whole-genome homology. Mol Biosyst 4: 363–371.<br />

Hall<strong>in</strong> PF, Ussery DW (2004) <strong>CBS</strong> Genome Atlas<br />

Database: a dynamic storage for bio<strong>in</strong>formatic results<br />

<strong>and</strong> sequence data. Bio<strong>in</strong>formatics 20: 3682–3686.<br />

Henz SR, Huson DH, Auch AF, Nieselt-Struwe K,<br />

Schuster SC (2005) Whole-genome prokaryotic<br />

phylogeny. Bio<strong>in</strong>formatics 21: 2329–2335.<br />

Jensen LJ, Friis C, Ussery DW (1999) Three views of<br />

microbial genomes. Res Microbiol 150: 773–777.<br />

Kurtz S, Philippy A, Delcher AL, Smoot M, Shumway M,<br />

Antonescu C, Salzberg SL (2004) Versatile <strong>and</strong> open<br />

software for compar<strong>in</strong>g large genomes. Genome Biol<br />

5: R12.<br />

Lagesen K, Hall<strong>in</strong> P, Rodl<strong>and</strong> EA, Staerfeldt HH,<br />

Rognes T, Ussery DW (2007) RNAmmer: consistent<br />

<strong>and</strong> rapid annotation of ribosomal RNA genes.<br />

Nucleic Acids Res 35: 3100–3108.<br />

Lessner DJ, et al. (2006) An unconventional pathway for<br />

reduction of CO 2 to methane <strong>in</strong> CO-grown Methanosarc<strong>in</strong>a<br />

acetivorans revealed by proteomics. Proc<br />

Natl Acad Sci USA 103: 17921–17926.<br />

Mussmann M, Richter M, Lombardot T, Meyerdierks A,<br />

Kuever J, Kube M, Glöckner FO, Amann R (2005)<br />

Clustered genes related to sulfate respiration <strong>in</strong> uncultured<br />

prokaryotes support the theory of their<br />

concomitant horizontal transfer. J Bacteriol. 187:<br />

7126–7137.<br />

Musto H, Naya H, Zavala A, Romero H, Alvarez-Val<strong>in</strong> F,<br />

Bernardi G (2006) Genomic GC level, optimal<br />

growth temperature, <strong>and</strong> genome size <strong>in</strong> prokaryotes.<br />

Biochem Biophys Res Commun 347: 1–3.<br />

Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH,<br />

Ussery DW (2000) A DNA structural atlas for<br />

Escherichia coli. J Mol Biol 299: 907–930.<br />

Richter M, Kube M, Bazyl<strong>in</strong>ski DA, Lombardot T,<br />

Glöckner FO, Re<strong>in</strong>hardt R, Schüler D (2007) <strong>Comparative</strong><br />

genome analysis of four magnetotactic


acteria reveals a complex set of group-specific<br />

genes implicated <strong>in</strong> magnetosome biom<strong>in</strong>eralization<br />

<strong>and</strong> function. J Bacteriol 189: 4899–4910.<br />

Selengut JD, et al. (2007) TIGRFAMs <strong>and</strong> Genome Properties:<br />

<strong>tools</strong> for the assignment of molecular function<br />

<strong>and</strong> biological process <strong>in</strong> prokaryotic genomes.<br />

Nucleic Acids Res 35: D260–D264.<br />

Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A<br />

(2001) On the total number of genes <strong>and</strong> their<br />

length distribution <strong>in</strong> complete microbial genomes.<br />

Trends Genet 17: 425–428.<br />

Skovgaard M, Jensen LJ, Friis C, Stærfeldt HH, Worn<strong>in</strong>g P,<br />

Brunak S, Ussery D (2002) The atlas visualisation of<br />

genome-wide <strong>in</strong>formation. Meth Microbiol. 33:<br />

49–63.<br />

Sood N, Lal B. (2008). Isolation <strong>and</strong> characterization of a<br />

potential paraff<strong>in</strong>-wax degrad<strong>in</strong>g thermophilic bacterial<br />

stra<strong>in</strong> Geobacillus kaustophilus TERI NSM for<br />

application <strong>in</strong> oil wells with paraff<strong>in</strong> deposition<br />

problems. Chemosphere 70: 1445–1451.<br />

Takami H, et al. (2004a) Genomic characterization of<br />

thermophilic Geobacillus species isolated from the<br />

deepest sea mud of the Mariana Trench. Extremophiles<br />

8: 351–356.<br />

Takami H, et al. (2004b) Thermoadaptation trait<br />

revealed by the genome sequence of thermophilic<br />

Tools for Comparison of Bacterial Genomes 74<br />

4327<br />

Geobacillus kaustophilus. Nucl Acids Res 32:<br />

6292–6303.<br />

Teel<strong>in</strong>g H, Waldmann J, Lombardot T, Bauer M,<br />

Glockner FO (2004) TETRA: a web-service <strong>and</strong> a<br />

st<strong>and</strong>-alone program for the analysis <strong>and</strong> comparison<br />

of tetranucleotide usage patterns <strong>in</strong> DNA<br />

sequences. BMC Bio<strong>in</strong>formatics 5: 163.<br />

Ussery DW, Hall<strong>in</strong> PF (2004) Genome update: AT content<br />

<strong>in</strong> sequenced prokaryotic genomes. Microbiology<br />

150: 749–752.<br />

Ussery DW, Bor<strong>in</strong>i S, Wassenaar TM (2009) Comput<strong>in</strong>g<br />

for <strong>Comparative</strong> Microbial Genomics: Bio<strong>in</strong>formatics<br />

for Microbiologists (<strong>Computational</strong> series)<br />

London, Verlag: Spr<strong>in</strong>ger.<br />

Wheeler DL, et al. (2007) Database resources of the<br />

National Center for Biotechnology Information.<br />

Nucleic Acids Res 35: D5–D12.<br />

Willenbrock H, Friis C, Friis AS, Ussery DW (2006) An<br />

environmental signature for 323 microbial genomes<br />

based on codon adaptation <strong>in</strong>dices. Genome Biol 7:<br />

R114.<br />

Worn<strong>in</strong>g P, Jensen LJ, Hall<strong>in</strong> PF, Staerfeldt HH,<br />

Ussery DW (2006) Orig<strong>in</strong> of replication <strong>in</strong> circular<br />

prokaryotic chromosomes. Environ Microbiol 8:<br />

353–361.


Chapter 3<br />

rRNA operons <strong>and</strong> promoter analysis<br />

rRNA operons <strong>and</strong> promoter<br />

analysis<br />

3.1 Introduction<br />

This chapter covers two papers (VI <strong>and</strong> VII), deal<strong>in</strong>g with rRNA localization with<strong>in</strong> the<br />

genome, <strong>and</strong> analysis of the promoter region upstream of rRNA operons. The RNAmmer<br />

tool (Lagesen et al., 2007) presented <strong>in</strong> paper VI was motivated by the lack of a software<br />

<strong>tools</strong> that was able to accurately <strong>and</strong> consistently annotate ribosomal RNA (rRNA) genes<br />

<strong>in</strong> prokaryotes. BLAST strategies are widely used for this as the rRNA genes are highly<br />

conserved. However, homology search methods produces often less accurate gene boundaries<br />

as they fail to account for the observed variation <strong>in</strong> some regions. Hidden Markov<br />

Model (HMM) strategies, such as RNAmmer, can take <strong>in</strong>to account conserved stem loop<br />

structures, greatly improv<strong>in</strong>g the accuracy of prediction of the full length rRNA genes.<br />

Particular detail will be given to the E. coli rRNA operons <strong>in</strong> terms of promoter predictions,<br />

s<strong>in</strong>ce much experimental <strong>in</strong>formation is known about this system. An application<br />

of the gwBrowser as a tool for visualization of promoter regions upstream of the rRNA<br />

operons <strong>in</strong> E. coli concludes the chapter. The gwBrowser effort is currently be<strong>in</strong>g published<br />

<strong>in</strong> the St<strong>and</strong>ards In Genomic Sciences journal. The P1 <strong>and</strong> P2 prediction <strong>tools</strong> are<br />

still developmental, <strong>and</strong> have not been published.<br />

Encod<strong>in</strong>g the central structure of the ribosome, the 5S, 16S, <strong>and</strong> 23S rRNA genes are<br />

essential for prote<strong>in</strong> synthesis <strong>and</strong> are transcribed at high levels. In E. coli the rrn operons<br />

are regulated by a t<strong>and</strong>em promotor system. With abundant transcription, the system is<br />

favorable for study<strong>in</strong>g the mechanisms of highly expressed genes <strong>and</strong> establish connection<br />

to the physical properties of the DNA. In this work, the SIDD energy (Wang et al., 2004;<br />

Wang & Benham, 2008) was used to measure the energy requirement to melt the DNA<br />

helix near the promotor region. The work was carried out dur<strong>in</strong>g my visit to Professor<br />

Craig Benhams lab at UC Davis, fall 2007.<br />

3.2 P1 <strong>and</strong> P2 promoters <strong>in</strong> E. coli<br />

The seven rRNA operons of E. coli are regulated by the two promotors P1 <strong>and</strong> P2,<br />

where P1 is active predom<strong>in</strong>ately dur<strong>in</strong>g exponential growth whereas P2 is active dur<strong>in</strong>g<br />

stationay phase (Hirvonen et al., 2001; Murray & Gourse, 2004). Apart from the –10 <strong>and</strong><br />

–35 hexamers, the P1 site conta<strong>in</strong>s between 3 <strong>and</strong> 5 FIS (Factor for Inversion Stimulation)<br />

b<strong>in</strong>d<strong>in</strong>g sites <strong>and</strong> an UP element. FIS has been reported to <strong>in</strong>crease the transcription <strong>in</strong><br />

vivo by 4-10 fold <strong>in</strong> this system (Bokal et al., 1995).<br />

105


Conservation of regulatory elements<br />

-35<br />

-10<br />

σ<br />

α<br />

α ββ‘ subunit<br />

+1<br />

CDS<br />

Figure 3.1: The transcription of bacterial genes.<br />

The first step <strong>in</strong> transcription occurs when the sigma factor first b<strong>in</strong>ds to the -10 <strong>and</strong><br />

-35 region, followed by a wrap of the DNA template around the large RNA polymerase<br />

holoenzyme complex, caus<strong>in</strong>g a bend of the DNA molecule (figure 3.1). Roughly 150 bp of<br />

DNA is wrapped around the polymerase, form<strong>in</strong>g a constra<strong>in</strong>ed supercoil. The wrapp<strong>in</strong>g<br />

<strong>in</strong>teraction with the two α-subunits are particularly important, for the right orientation<br />

of DNA with respect to the promoter sites <strong>and</strong> transcription <strong>in</strong>itiation.<br />

B<strong>in</strong>d<strong>in</strong>g of the FIS prote<strong>in</strong> can strongly bend the DNA, <strong>and</strong> if properly spaced, greatly<br />

facilitate the wrapp<strong>in</strong>g of the DNA around the alpha subunits. The DNA bend<strong>in</strong>g takes<br />

place via a helix-turn-helix structure <strong>and</strong> is recognized by a 15 nucleotide symmetric motif<br />

(Hengen et al., 1997). The stress that is <strong>in</strong>duced when FIS b<strong>in</strong>ds to the DNA helix,<br />

causes a bend which destabilizes the helix lower<strong>in</strong>g the energy required for melt<strong>in</strong>g further<br />

downstream (Wang & Benham, 2008; Bokal et al., 1995). While be<strong>in</strong>g highly expressed<br />

dur<strong>in</strong>g exponential phase FIS ensures an <strong>in</strong>creased activity of P1 compared with P2. In an<br />

E. coli stra<strong>in</strong> lack<strong>in</strong>g the FIS prote<strong>in</strong> the P2 promotor is more active dur<strong>in</strong>g exponential<br />

growth. The same study suggest FIS to have a repression effect on P2 (Liebig & Wagner,<br />

1995). Both P1 <strong>and</strong> P2 conta<strong>in</strong>s an UP element b<strong>in</strong>d<strong>in</strong>g to the RNA polymerase α Cterm<strong>in</strong>al<br />

doma<strong>in</strong> (αCTD). This work aims at apply<strong>in</strong>g an <strong>in</strong>formation content method to<br />

the P1 <strong>and</strong> P2 system, account<strong>in</strong>g for helical spac<strong>in</strong>g between these regulatory elements as<br />

well as the conservation of the motifs. The t<strong>and</strong>em promotor system is depicted <strong>in</strong> figure<br />

3.2.<br />

3.3 Conservation of regulatory elements<br />

Information content is widely used <strong>in</strong> bio<strong>in</strong>formatics to f<strong>in</strong>d <strong>and</strong> rank <strong>in</strong>dependent motifs<br />

as an alternative to mach<strong>in</strong>e learn<strong>in</strong>g approaches. Shultzaberger <strong>and</strong> co-workers have exp<strong>and</strong>ed<br />

earlier applications of <strong>in</strong>formation content by describ<strong>in</strong>g the helical fac<strong>in</strong>g between<br />

regulatory elements on the DNA str<strong>and</strong> (Shultzaberger et al., 2007). This framework allows<br />

for an additive comb<strong>in</strong>ation of both aligned weight matrices <strong>and</strong> their spac<strong>in</strong>g to<br />

produce a f<strong>in</strong>al score of the entire structure. When observ<strong>in</strong>g the σ 70 promotor consist<strong>in</strong>g<br />

of the –10 <strong>and</strong> –35 hexamers, the spac<strong>in</strong>g corespond to each box be<strong>in</strong>g located on oposite<br />

sides of the DNA helix (see figure 3.3).<br />

Chang<strong>in</strong>g the spac<strong>in</strong>g will likely cause a disruption of the b<strong>in</strong>d<strong>in</strong>g by RNA polymerase.<br />

This is accounted for by apply<strong>in</strong>g a cos<strong>in</strong>e function to the distance score (see equation 3.2).<br />

Shultzaberger’s equations were used to model the P1 <strong>and</strong> P2 system.<br />

To score a given query sequence of length L aga<strong>in</strong>st a weight matrix, a b × p matrix<br />

is first generated by align<strong>in</strong>g the query sequence <strong>and</strong> the matrix. This provides all Rb,p<br />

106


tuB<br />

murI<br />

Fis III Fis II Fis I UP -35 -10<br />

m<strong>in</strong>: -4nt<br />

center:2nt<br />

max:4nt<br />

m<strong>in</strong>: 0nt<br />

center:3nt<br />

max:6nt<br />

m<strong>in</strong>: 13nt<br />

center:16nt<br />

max:19nt<br />

P1<br />

rRNA operons <strong>and</strong> promoter analysis<br />

16S tRNA 23S 5S<br />

Glu murB<br />

-35 -10<br />

m<strong>in</strong>: 0nt<br />

center:3nt<br />

max:6nt<br />

P2 P1<br />

m<strong>in</strong>: 13nt<br />

center:16nt<br />

max:19nt<br />

Figure 3.2: The promotor structure of the rrnB operon <strong>in</strong> E. coli.<br />

-35<br />

!<br />

-10<br />

!<br />

-10 -35<br />

Figure 3.3: The –10 <strong>and</strong> –35 hexamers of the E. coli σ 70 promotor correspond to the motifs be<strong>in</strong>g<br />

located on opposite side of the DNA helix. Delition or <strong>in</strong>sertions of the spac<strong>in</strong>g cases a shift of<br />

approx. 36deg per nucleotide.<br />

107


Conservation of regulatory elements<br />

values.<br />

nb,p<br />

Rb,p = log2(4) + log2<br />

N<br />

L<br />

Rtot = RB,p<br />

p=1<br />

(3.1)<br />

–where b ∈ AT GC iterates through the four bases, p denotes the position <strong>in</strong> the<br />

alignment, L is the length of the alignment (or width of the matrix), <strong>and</strong> nb,p is the<br />

number of bases b at position p, <strong>and</strong> B denotes the nucleotide at position p <strong>in</strong> the query<br />

sequence. Shultzaberger <strong>and</strong> co-workers account for the helical fac<strong>in</strong>g by <strong>in</strong>troduc<strong>in</strong>g the<br />

accessibility, n(d) (equation 3.2) <strong>and</strong> the gap surprisal, GS(d) (see equation 3.3).<br />

n(d) = 1 + cos[ 2π<br />

(d − c)] (3.2)<br />

w<br />

–where c is the center distance between two b<strong>in</strong>d<strong>in</strong>g sites (e.g. optimally spaced), d is<br />

the query distance, w = 10.6 is the distance of a one helix turn of B-form DNA. F<strong>in</strong>ally,<br />

this gives GS(d) as follows:<br />

n(d)<br />

GS(d) = log2<br />

N<br />

(3.3)<br />

–where N is the sum of all n(d) (see equation 3.4). The sign of the GS(d) was changed<br />

from the orig<strong>in</strong>al equation described by Shultzaberger <strong>and</strong> co-workers to allow for comb<strong>in</strong><strong>in</strong>g<br />

all scores by addition.<br />

N =<br />

max<br />

<br />

d=m<strong>in</strong><br />

n(d) (3.4)<br />

–where m<strong>in</strong> <strong>and</strong> max are the boundaries of a given w<strong>in</strong>dow exam<strong>in</strong>ed. F<strong>in</strong>ally, summariz<strong>in</strong>g<br />

all Ri <strong>and</strong> GS(d) values gives the total <strong>in</strong>formation of all motifs <strong>and</strong> all spacers (see<br />

figure 3.5)<br />

Ri(tot) = Ri(m1) + GS(d, m1) + Ri(m2) + ... + GS(d, mn−1) + Ri(mn) (3.5)<br />

3.3.1 Model<strong>in</strong>g the P1 <strong>and</strong> P2 <strong>in</strong> selected enterics<br />

Exist<strong>in</strong>g experimentally verified –10 <strong>and</strong> –35 hexamers (Huerta & Collado-Vides, 2003)<br />

were converted <strong>in</strong>to Rb,p matrices together with data for known UP elements (Estrem<br />

et al., 1998) <strong>and</strong> FIS b<strong>in</strong>d<strong>in</strong>g sites (Hengen et al., 1997). Figure 3.4 shows logo plots of<br />

the <strong>in</strong>formation content of these studies. The <strong>in</strong>itial weight matrices founded the basis<br />

for iteratively build<strong>in</strong>g the f<strong>in</strong>al <strong>in</strong>formation model of the P1 <strong>and</strong> P2 promotor structure,<br />

us<strong>in</strong>g the follow<strong>in</strong>g procedure:<br />

1. E. coli <strong>and</strong> Shigella genomes<br />

108<br />

2. rRNA gene f<strong>in</strong>d<strong>in</strong>g <strong>and</strong> make upstream sequence<br />

3. Apply models based on literature weight matrices<br />

4. Ref<strong>in</strong>e weight matrices accord<strong>in</strong>g to observations<br />

5. Formulate f<strong>in</strong>al model


Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

T A T A A T<br />

1<br />

2<br />

(a)<br />

3<br />

4<br />

Position<br />

T T G A C A<br />

1<br />

2<br />

(c)<br />

3<br />

4<br />

Position<br />

5<br />

6<br />

5<br />

6<br />

Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

Bits<br />

1<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

2<br />

1<br />

rRNA operons <strong>and</strong> promoter analysis<br />

T G A A A T T T T T T T T T G A A A A G T A<br />

3<br />

2<br />

3<br />

4<br />

4<br />

5<br />

5<br />

6<br />

6<br />

7<br />

7<br />

8<br />

8<br />

9<br />

10<br />

(b)<br />

9<br />

10<br />

11<br />

12<br />

Position<br />

0.0<br />

A T T G G T Y A A A W T T T R A C C A A T<br />

Figure 3.4: Logo plots show<strong>in</strong>g the <strong>in</strong>itial weight matrices used for search<strong>in</strong>g E. coli <strong>and</strong> Shigella<br />

genomes: –10 hexamer (a), –35 hexamer (b), UP element (c), <strong>and</strong> FIS b<strong>in</strong>d<strong>in</strong>g motif (d).<br />

The 16S rRNA genes of all E. coli <strong>and</strong> Shigella genomes were annotated us<strong>in</strong>g RNAmmer.<br />

For the list of genomes, see table 3.1. All 16S rRNA genes were aligned us<strong>in</strong>g clustalw<br />

(Thompson et al., 1994) <strong>and</strong> a neighbor-jo<strong>in</strong><strong>in</strong>g tree was constructed (see figure 3.5). The<br />

figure shows additional Salmonella <strong>and</strong> Yers<strong>in</strong>ia genomes for comparison.<br />

(d)<br />

11<br />

13<br />

12<br />

Position<br />

14<br />

13<br />

15<br />

16<br />

14<br />

17<br />

15<br />

18<br />

16<br />

19<br />

17<br />

20<br />

21<br />

18<br />

22<br />

19<br />

20<br />

21<br />

109


Conservation of regulatory elements<br />

Escherichia coli 536<br />

Escherichia coli APEC O1<br />

Escherichia coli CFT073<br />

Shigella sonnei Ss046<br />

Shigella boydii Sb227<br />

Shigella flexneri 2a str. 301<br />

Shigella flexneri 2a str. 2457T<br />

Escherichia coli UTI89<br />

Escherichia coli K12<br />

Escherichia coli O157:H7 EDL933<br />

Escherichia coli O157:H7 str. Sakai<br />

Escherichia coli W3110<br />

Shigella dysenteriae Sd197<br />

Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67<br />

Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150<br />

Salmonella enterica subsp. enterica serovar Typhi Ty2<br />

Salmonella enterica subsp. enterica serovar Typhi str. CT18<br />

Salmonella typhimurium LT2<br />

Yers<strong>in</strong>ia pestis Antiqua<br />

Yers<strong>in</strong>ia pestis CO92<br />

Yers<strong>in</strong>ia pestis KIM<br />

Yers<strong>in</strong>ia pestis Nepal516<br />

Yers<strong>in</strong>ia pestis Pestoides F<br />

Yers<strong>in</strong>ia pestis biovar Microtus str. 91001<br />

Yers<strong>in</strong>ia pseudotuberculosis IP 32953<br />

Figure 3.5: Neighbor-jo<strong>in</strong><strong>in</strong>g tree of first 1k bases of all 16S rRNA genes of Yers<strong>in</strong>ia, Salmonella,<br />

Shigella, <strong>and</strong> E. coli<br />

110


RNA operons <strong>and</strong> promoter analysis<br />

Organism Accession Reference<br />

Escherichia coli 101-1 AAMK00000000 (unpublished)<br />

Escherichia coli 53638 AAKB00000000 (unpublished)<br />

Escherichia coli 536 CP000247 (Brzuszkiewicz et al., 2006)<br />

Escherichia coli APEC O1 CP000468 (Johnson et al., 2007)<br />

Escherichia coli B171 AAJX00000000 (unpublished)<br />

Escherichia coli B7A AAJT00000000 (unpublished)<br />

Escherichia coli B AAWW00000000 (unpublished)<br />

Escherichia coli CFT073 AE014075 (Welch et al., 2002)<br />

Escherichia coli E110019 AAJW00000000 (unpublished)<br />

Escherichia coli E22 AAJV00000000 (unpublished)<br />

Escherichia coli F11 AAJU00000000 (unpublished)<br />

Escherichia coli K12 U00096 (Blattner et al., 1997)<br />

Escherichia coli O157:H7 EDL933 AE005174 (Perna et al., 2001)<br />

Escherichia coli O157:H7 str. Sakai BA000007 (Hayashi et al., 2001)<br />

Escherichia coli SECEC SMS-3-5 ABAQ00000000 (unpublished)<br />

Escherichia coli UTI89 CP000243 (Chen et al., 2006)<br />

Escherichia coli W3110 AP009048 (Hayashi et al., 2006)<br />

Shigella boydii CDC 3083-94 AAKA00000000 (unpublished)<br />

Shigella boydii Sb227 CP000036 (Yang et al., 2005)<br />

Shigella dysenteriae 1012 AAMJ00000000 (unpublished)<br />

Shigella dysenteriae Sd197 CP000034 (Yang et al., 2005)<br />

Shigella flexneri 2a str. 2457T AE014073 (Liao et al., 2003)<br />

Shigella flexneri 2a str. 301 AE005674 (J<strong>in</strong> et al., 2002)<br />

Shigella sonnei Ss046 CP000038 (Yang et al., 2005)<br />

Table 3.1: Escherichia coli <strong>and</strong> Shigella genomes currently available at the time of the work<br />

(October 2007)<br />

111


Conservation of regulatory elements<br />

Ri<br />

Ri<br />

−15 −10 −5 0 5 10<br />

−10 −5 0 5 10 15<br />

P1: Raw comb<strong>in</strong>ed scores, −10,−35, UP (E.coli) (N=63)<br />

−500 −400 −300 −200 −100 0<br />

Position relative to 16S gene start<br />

(a)<br />

P2: Raw comb<strong>in</strong>ed scores, −10,−35, UP (E. coli) (N=63)<br />

−500 −400 −300 −200 −100 0<br />

Position relative to 16S gene start<br />

(c)<br />

Ri<br />

Ri<br />

−15 −10 −5 0 5 10 15<br />

−10 −5 0 5 10 15<br />

P1: Adjusted comb<strong>in</strong>ed scores, −10,−35, UP (E.coli) (N=63)<br />

−500 −400 −300 −200 −100 0<br />

Position relative to gene start<br />

(b)<br />

P2: Adjusted comb<strong>in</strong>ed scores, −10,−35, UP (E. coli) (N=63)<br />

−500 −400 −300 −200 −100 0<br />

Position relative to gene start<br />

Figure 3.6: Profiles show<strong>in</strong>g the maximum Ri(tot) scores of the <strong>in</strong>itial weight matrices applied to<br />

E. coli <strong>and</strong> Shigella: Unadjusted P1 scores (a), Adjusted P1 scores (b), Unadjusted P2 scores (c),<br />

<strong>and</strong> Adjusted P2 scores (d)<br />

3.3.2 Iterat<strong>in</strong>g weight matrix frequencies<br />

The program iscan was developed to query a given DNA sequence <strong>and</strong> for every position <strong>in</strong><br />

this sequence calculate the maximum Ri(tot) that can be obta<strong>in</strong>ed by try<strong>in</strong>g out different<br />

spac<strong>in</strong>g configuraitons with<strong>in</strong> a specified w<strong>in</strong>dow. The iscan algorithm aligns the first<br />

matrix with the query (<strong>in</strong> this case the –10 hexamer) <strong>and</strong> tries all distances between 13<br />

<strong>and</strong> 19 nucleotides towards the –35 hexamer, us<strong>in</strong>g 16 nucleotides as the center. Then<br />

the program locks the optimal of those distances, <strong>and</strong> cont<strong>in</strong>ues with the next box (<strong>in</strong><br />

this case the the UP element) until all elements have been <strong>in</strong>cluded. For source code, see<br />

appendix D.5. The spac<strong>in</strong>g configuration of the two models is shown <strong>in</strong> figure ??.<br />

The maximum Ri(tot) values of all operons were stacked <strong>and</strong> average <strong>and</strong> st<strong>and</strong>ard<br />

deviation values were plotted as function of position. Because the distance between P1/P2<br />

<strong>and</strong> the 16S gene varies slightly, the unadjusted plots appear noisy. By shift<strong>in</strong>g the plots<br />

slightly by align<strong>in</strong>g to local maxima around P1 <strong>and</strong> P2 renders the P1 <strong>and</strong> P2 model scores<br />

sharper (see figure 3.6).<br />

3.3.3 Ref<strong>in</strong><strong>in</strong>g E. coli <strong>and</strong> Shigella models<br />

All peaks of Ri(tot) around the regions of P1 <strong>and</strong> P2 have been collected, <strong>and</strong> the P1 <strong>and</strong><br />

P2 models were ref<strong>in</strong>ed by adjust<strong>in</strong>g matrix parameters accord<strong>in</strong>g to the observed base<br />

frequencies <strong>in</strong> the hits obta<strong>in</strong>ed. The logo plots of are shown <strong>in</strong> figure 3.7<br />

112<br />

(d)


Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

Bits<br />

T A T A A T<br />

1<br />

2<br />

(a)<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

3<br />

4<br />

Position<br />

1<br />

5<br />

6<br />

Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

T C A A A A A A T T A T T T A A A A T T T C<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

(b)<br />

T T T G C T T G A A A A A T G A G C G G T<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

(d)<br />

11<br />

12<br />

Position<br />

Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

13<br />

14<br />

15<br />

16<br />

17<br />

18<br />

19<br />

20<br />

11<br />

12<br />

Position<br />

21<br />

13<br />

14<br />

Bits<br />

15<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

16<br />

rRNA operons <strong>and</strong> promoter analysis<br />

17<br />

1<br />

18<br />

19<br />

20<br />

21<br />

22<br />

T A T T A T<br />

2<br />

(e)<br />

T C A G A A A A A G A A A G C A A A A A A A<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

11<br />

(g)<br />

12<br />

13<br />

14<br />

15<br />

16<br />

17<br />

3<br />

4<br />

Position<br />

5<br />

6<br />

Bits<br />

Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

T T G T C A<br />

1<br />

1<br />

2<br />

(c)<br />

3<br />

4<br />

Position<br />

5<br />

T T G A C T<br />

Figure 3.7: Logos show<strong>in</strong>g the base compostion of P1 <strong>and</strong> P2 of E. coli genomes, as identified<br />

by <strong>in</strong>itial P1 <strong>and</strong> P2 scan: P1 –10 hexamer (a), P1 –35 hexamer (b), P1 UP element (c), P1 FIS<br />

b<strong>in</strong>d<strong>in</strong>g motif (d), P2 –10 hexamer (e), P2 –35 hexamer (f), P2 UP element (g)<br />

Position<br />

18<br />

19<br />

20<br />

21<br />

22<br />

2<br />

(f)<br />

3<br />

4<br />

Position<br />

5<br />

6<br />

6<br />

113


DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy<br />

Z−score<br />

−0.8 −0.6 −0.4 −0.2 0.0<br />

U00096: SIDD measure − free energy<br />

−400 −200 0 200 400<br />

Distance from translation start<br />

s=−0.025<br />

s=−0.035<br />

s=−0.045<br />

s=−0.055<br />

Figure 3.8: Average profiles of SIDD energy calculated at five different helix densities -0.025,<br />

-0.035, -0.045, <strong>and</strong> -0.055. All genes have been aligned at the translation start.<br />

3.4 DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy<br />

An algorithm developed by Benham <strong>and</strong> co-workers (Wang & Benham, 2008; Wang et al.,<br />

2004) estimates the SIDD energy which is the free energy required to open the DNA helix<br />

under different superhelix densities. When observ<strong>in</strong>g the SIDD energy 400 nucleotides on<br />

each side of the translation start of all cod<strong>in</strong>g sequences <strong>in</strong> E. coli K12 (accession U00096) a<br />

clear drop <strong>in</strong> the energy requirement is visible. The drop orig<strong>in</strong>ates from the transcription<br />

start rather than the translation start, which examples the broad appearance of the curve.<br />

Figure 3.8 plots the SIDD energy values at different helix densities (-0.025, -0.035, -0.045,<br />

<strong>and</strong> -0.055). The graph represents the z-scores show<strong>in</strong>g how the average SIDD energy at<br />

a given relative position compares with the average <strong>and</strong> st<strong>and</strong>ard deviation of the entire<br />

chromosome. z-score below zero correspond to SIDD energies lower then the average of<br />

the chromosome, which melts more easily.<br />

3.4.1 codesearch: Mapp<strong>in</strong>g nummerical data to genome annotations<br />

The codesearch tool was written to enable searches for various annotation patterns of a<br />

genome <strong>and</strong> to map nummerical data relative to these annotations. The tool requires a<br />

pregenerated codefile which condenses all annotations of the genome <strong>in</strong>to a s<strong>in</strong>gle str<strong>in</strong>g,<br />

correspond<strong>in</strong>g to one character per nucleotide position (see table 3.2). The tool allows the<br />

user to provide a regular expression to search <strong>in</strong> the pre-generated code file.<br />

A list of nummerical data perta<strong>in</strong><strong>in</strong>g to the <strong>in</strong>dividual nucleotides of the genome can<br />

then be <strong>in</strong>cluded. When def<strong>in</strong>ed, codesearch will extract the nummerical values correspond<strong>in</strong>g<br />

to the regions match<strong>in</strong>g the pattern. The output of codesearch is divided <strong>in</strong>to<br />

two tab-separated columns: First column conta<strong>in</strong> the genomic region where pattern has<br />

matched, the other column contians either the sequence as a str<strong>in</strong>g (when runn<strong>in</strong>g <strong>in</strong><br />

114


Code Mean<strong>in</strong>g Example<br />

C Cod<strong>in</strong>g CCCCCCCCCCCCC<br />

> Annotation start on forward str<strong>and</strong> .....>CCCC...<br />

< Annotation start on reverse str<strong>and</strong> ...CCCCTTT.....<br />

t 5S rRNA ..tttssss......<br />

l 23S rRNA ...lllllcodesearch −cod U00096 . cod . gz −seq U00096 . fsa −pat ’(.{5 ,5} > s {1 ,1}) ’<br />

2 223773..223779 AAATTGA<br />

3 3939833..3939839 AAATTGA<br />

4 4033556..4033562 AAATTGA<br />

5 4164684..4164690 AAATTGA<br />

6 4206172..4206178 AAATTGA<br />

7 3426782..3426776 ATTGAAG<br />

8 2729177..2729171 ATTGAAG<br />

9 >codesearch −cod U00096 . cod . gz −dat U00096 . sidd35 . gz : 1 , 4 −pat<br />

’(.{5 ,5} > s {1 ,1}) ’\<br />

10 −format ’%0.2f ’ | tab2tbl −−w<strong>in</strong>dow = ’ −5 ,2 ’ −org ’ E . coli K12 ’ −col<br />

blue<br />

11 def org col −5 −4 −3 −2 −1 1 2<br />

12 223773..223779 E . coli K12 blue 7.93 7.93 7.94 8.00 8.26 8.28 8.37<br />

13 3939833..3939839 E . coli K12 blue 7.91 7.90 7.92 7.99 8.25 8.28 8.36<br />

14 4033556..4033562 E . coli K12 blue 7.83 7.83 7.85 7.92 8.19 8.22 8.32<br />

15 4164684..4164690 E . coli K12 blue 7.85 7.85 7.87 7.95 8.21 8.25 8.34<br />

16 4206172..4206178 E . coli K12 blue 7.91 7.91 7.92 7.99 8.26 8.28 8.37<br />

17 3426782..3426776 E . coli K12 blue 7.91 7.93 7.99 8.26 8.28 8.37 8.73<br />

18 2729177..2729171 E . coli K12 blue 7.91 7.93 7.99 8.26 8.28 8.37 8.72<br />

Us<strong>in</strong>g heatmap to generate energy l<strong>and</strong>scape<br />

The R function heatmap described <strong>in</strong> chapter 2, was used to compare both SIDD profiles<br />

<strong>and</strong> the profiles of P1/P2 model scores. All promotor sequences were aligned first accord<strong>in</strong>g<br />

to the peak score of the P1 model (near the expected site of P1) <strong>and</strong> second accord<strong>in</strong>g to<br />

the peak score of the P2 model (near the expected site of P2). In figure 3.9 the model scores<br />

are visualized us<strong>in</strong>g the heatmap function on the green, heatmaps on the left, whereas the<br />

rightmost heatmaps conta<strong>in</strong> the SIDD energies (blue) of the aligned promotor sequences.<br />

This analysis show that a deep drop <strong>in</strong> the SIDD energy occurs for approximately half of<br />

the promotor sequences, near the P1 site.<br />

115


DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy<br />

P1<br />

-10 box<br />

16S rRNA +1<br />

P2<br />

-10 box<br />

16S rRNA +1<br />

−500<br />

−490<br />

−480<br />

−470<br />

−460<br />

−450<br />

−440<br />

−430<br />

−420<br />

−410<br />

−400<br />

−390<br />

−380<br />

−370<br />

−360<br />

−350<br />

−340<br />

−330<br />

−320<br />

−310<br />

−300<br />

−290<br />

−280<br />

−270<br />

−260<br />

−250<br />

−240<br />

−230<br />

−220<br />

−210<br />

−200<br />

−190<br />

−180<br />

−170<br />

−160<br />

−150<br />

−140<br />

−130<br />

−120<br />

−110<br />

−100<br />

−90<br />

−80<br />

−70<br />

−60<br />

−50<br />

−40<br />

−30<br />

−20<br />

−10<br />

+1<br />

+10<br />

+20<br />

+30<br />

+40<br />

+50<br />

−500<br />

−490<br />

−480<br />

−470<br />

−460<br />

−450<br />

−440<br />

−430<br />

−420<br />

−410<br />

−400<br />

−390<br />

−380<br />

−370<br />

−360<br />

−350<br />

−340<br />

−330<br />

−320<br />

−310<br />

−300<br />

−290<br />

−280<br />

−270<br />

−260<br />

−250<br />

−240<br />

−230<br />

−220<br />

−210<br />

−200<br />

−190<br />

−180<br />

−170<br />

−160<br />

−150<br />

−140<br />

−130<br />

−120<br />

−110<br />

−100<br />

−90<br />

−80<br />

−70<br />

−60<br />

−50<br />

−40<br />

−30<br />

−20<br />

−10<br />

+1<br />

+10<br />

+20<br />

+30<br />

+40<br />

+50<br />

-22 34<br />

Promotor sequences<br />

Model score (bits)<br />

500<br />

490<br />

480<br />

470<br />

460<br />

450<br />

440<br />

430<br />

420<br />

410<br />

400<br />

390<br />

380<br />

370<br />

360<br />

350<br />

340<br />

330<br />

320<br />

310<br />

300<br />

290<br />

280<br />

270<br />

260<br />

250<br />

240<br />

230<br />

220<br />

210<br />

200<br />

190<br />

180<br />

170<br />

160<br />

150<br />

140<br />

130<br />

120<br />

110<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

+1<br />

+10<br />

+20<br />

+30<br />

+40<br />

+50<br />

+60<br />

500<br />

490<br />

480<br />

470<br />

460<br />

450<br />

440<br />

430<br />

420<br />

410<br />

400<br />

390<br />

380<br />

370<br />

360<br />

350<br />

340<br />

330<br />

320<br />

310<br />

300<br />

290<br />

280<br />

270<br />

260<br />

250<br />

240<br />

230<br />

220<br />

210<br />

200<br />

190<br />

180<br />

170<br />

160<br />

150<br />

140<br />

130<br />

120<br />

110<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

+1<br />

+10<br />

+20<br />

+30<br />

+40<br />

+50<br />

SIDD energy (kcal/mol<br />

5.8 10.0<br />

Gaps are appended<br />

to each promotor<br />

region to adjust to<br />

maxima of the P1/P2<br />

model scores<br />

Figure 3.9: E. coli <strong>and</strong> Shigella rrnB energy l<strong>and</strong>scape visualized us<strong>in</strong>g the heatmap function.<br />

Each vertical column corresponds to a promotor sequence, whereas the horizontal rows represent<br />

average values over 10 bp with<strong>in</strong> each sequence. Coord<strong>in</strong>ates labeled on the horizontal rows are<br />

relative to the 16S rRNA gene start. The upper heatmaps show P1 whereas the lower heatmaps<br />

show P2. Leftmost heatmaps show P1/P2 model scores <strong>in</strong> green, whereas rightmost heatmaps<br />

show the SIDD energy <strong>in</strong> blue.<br />

116


RNA operons <strong>and</strong> promoter analysis<br />

3.5 The genomic context: visualiz<strong>in</strong>g operons <strong>and</strong> DNA<br />

properties<br />

Dur<strong>in</strong>g the thesis work, this author has been <strong>in</strong>volved <strong>in</strong> the development of a next generation<br />

genome browser to replace the older GeneWiz software developed at <strong>CBS</strong> (Pedersen<br />

et al., 2000; Jensen et al., 1999). The old GeneWiz is still used by the BLASTatlas service<br />

to generate the static atlas graphic. The goal with the new version was to create an<br />

<strong>in</strong>teractive <strong>and</strong> platform-<strong>in</strong>dependant program that would allow the user to zoom from a<br />

global genomic scale down to the nucleotide level. The basic pr<strong>in</strong>ciples of transform<strong>in</strong>g<br />

nummerical data <strong>in</strong>to a color coded representation rema<strong>in</strong>ed identical to the GeneWiz<br />

method. But the old GeneWiz software required several m<strong>in</strong>utes to regenerate a plot <strong>and</strong><br />

the challenge was to provide an efficient data flow that would allow this regeneration <strong>in</strong><br />

fractions of a second. Eva Rotenberg <strong>and</strong> Hans Henrik Stærfeldt from <strong>CBS</strong> authored the<br />

gwBrowser Java code which h<strong>and</strong>les the plott<strong>in</strong>g, whereas this author has been responsible<br />

for the server side software. For the fast visualization to be possible, all nummerical data<br />

that are plotted must be pre-b<strong>in</strong>ned <strong>and</strong> accessible for all of the zoom-levels. A system was<br />

established which could conta<strong>in</strong> these pre-b<strong>in</strong>ned data for a number of genomes us<strong>in</strong>g a<br />

MySQL database. The first solution <strong>in</strong>volved a s<strong>in</strong>gle large table, with fields correspond<strong>in</strong>g<br />

to genome id, position, zoom level, field, <strong>and</strong> value. It quickly proved unfeasible. S<strong>in</strong>ce<br />

stor<strong>in</strong>g all zoom levels for a genome of length N requires 2×N records, a rough estimation<br />

shows that a 1,000 genomes of 3Mb <strong>and</strong> 20 different DNA properties (field) requires 120<br />

billion database records. Ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g these large search <strong>in</strong>dexes <strong>and</strong> prevent<strong>in</strong>g table locks<br />

dur<strong>in</strong>g update made this solution impossible. A different solution was tried splitt<strong>in</strong>g each<br />

genome <strong>in</strong>to its own table <strong>and</strong> this solved many speed issues but did not perform satisfactory.<br />

Instead, data are stored <strong>in</strong> b<strong>in</strong>ary files - one file per genome <strong>and</strong> zoom level. All<br />

values are written as fixed-width data <strong>and</strong> us<strong>in</strong>g memory mapp<strong>in</strong>g the server can quickly<br />

obta<strong>in</strong> data with<strong>in</strong> the file know<strong>in</strong>g the coord<strong>in</strong>ates of the w<strong>in</strong>dow. The list<strong>in</strong>g belows<br />

shows how the client retrieves data for the genome id AL111168GENOMEatlas, from position<br />

1 to 37,473 bp, at zoom level 5. Figure 3.10 shows the workflow of the gwBrowser<br />

software. For further details on this tool, please refer to paper VII. The software is now<br />

available via http://www.cbs.dtu.dk/services/gwBrowser.<br />

1 set server = http : / / ws . cbs . dtu . dk/cgi−b<strong>in</strong>/gwBrowser −0.91/ server . cgi<br />

2 curl $server"?d=AL111168GENOMEatlas&m=d&f=dnap0&b=1&e=37473&l=5&z=<br />

false"<br />

3.6 Visualiz<strong>in</strong>g sequenc<strong>in</strong>g quality us<strong>in</strong>g gwBrowser<br />

Modern high-throughput sequenc<strong>in</strong>g techniques currently lack sufficient read lengths to<br />

span many repetitive elements of genomes, especially the rRNA genes mentioned above. To<br />

assess how well a given set of reads can close a genome sequence, a method was developed<br />

which accounts for both quality scores of the reads <strong>and</strong> the uniqueness of the reads. The<br />

concept of the method is to map the qualities of all reads back to a reference genome <strong>and</strong><br />

apply a weight to the qualities accord<strong>in</strong>g to the uniqueness of the reads. Reads that have<br />

multiple hits throughout the genome will contribute little whereas reads that at specific<br />

will contribute fully. Figure 3.11 shows the pr<strong>in</strong>ciple of the method <strong>and</strong> it was <strong>in</strong>tegrated<br />

<strong>in</strong>to the gwBrowser software.<br />

117


Visualiz<strong>in</strong>g sequenc<strong>in</strong>g quality us<strong>in</strong>g gwBrowser<br />

Configure <strong>and</strong><br />

submit atlas<br />

‘<br />

wait for process<strong>in</strong>g<br />

q’r(i)<br />

genome<br />

Browser applet<br />

Reference genome,<br />

annotations, sequenc<strong>in</strong>g<br />

reads, query genomes,<br />

custom numerical data<br />

Edit<strong>in</strong>g of atlas layout<br />

Atlas layout (XML)<br />

Request (atlas ID, zoom level,<br />

w<strong>in</strong>dow, field name ... )<br />

Returned data<br />

Ma<strong>in</strong> server<br />

1 2<br />

3<br />

XML configuration<br />

CLIENT SIDE SERVER SIDE<br />

hit H1<br />

score<br />

S1<br />

mapped reads<br />

ref. genome<br />

Figure 3.10: Pr<strong>in</strong>ciple workflow of gwBrowser data exchange.<br />

read<br />

1<br />

2<br />

3<br />

qr(i)<br />

i<br />

q’r(i)<br />

hit H2<br />

score S2<br />

genome<br />

read<br />

hit H3<br />

score S3<br />

Align<strong>in</strong>g read<br />

sequence to<br />

genome<br />

hit Hr<br />

score Sr<br />

Map quality scores<br />

to genome <strong>and</strong><br />

apply weight<br />

4<br />

5<br />

Data b<strong>in</strong>n<strong>in</strong>g of<br />

zoom levels<br />

B<strong>in</strong>ned data<br />

Browser server<br />

Weighted coverage<br />

Sequence Weighted agreement coverage<br />

Max Sequence unique agreement qual<br />

Information Max unique Content qual<br />

Read Information anbsense Content<br />

Annotations<br />

Read anbsense<br />

CDS+<br />

Annotations CDS-<br />

Weighted coverage<br />

rRNA CDS+<br />

tRNA CDS-<br />

Sequence agreement<br />

rRNA<br />

Intr<strong>in</strong>sic tRNA Curvature<br />

Max unique qual<br />

Stack<strong>in</strong>g Intr<strong>in</strong>sic Curvature Energy<br />

Information Content<br />

Position Stack<strong>in</strong>g Preference Energy<br />

Read anbsense<br />

Global Position Annotations Direct Preference Repeats<br />

CDS+ rRNA<br />

CDS! CDS+<br />

tRNA<br />

Global Inverted Direct<br />

CDS-<br />

Repeats<br />

rRNA<br />

GC Global Skew Inverted<br />

tRNA<br />

Repeats<br />

Intr<strong>in</strong>sic Curvature<br />

Percent GC SkewAT<br />

Stack<strong>in</strong>g Energy<br />

Percent AT<br />

F<strong>in</strong>ally, all maximum values Position Preference are<br />

plotted on the reference genome<br />

Global Direct Repeats<br />

us<strong>in</strong>g GeneWiz Browser. The<br />

marked b<strong>and</strong> <strong>in</strong> the example Global Inverted above Repeats<br />

shows a regions with low<br />

GC Skew<br />

uniqueness.<br />

Percent AT<br />

From all positions <strong>in</strong> the genome,<br />

obta<strong>in</strong> the maximum uniqueness<br />

value derived from the mapped<br />

reads.<br />

Figure 3.11: Mapp<strong>in</strong>g qualities of sequenc<strong>in</strong>g reads to a reference genome while account<strong>in</strong>g for<br />

the uniqueness of the read.<br />

118


P2<br />

-10<br />

-35<br />

UP<br />

P1<br />

-10<br />

-35<br />

UP<br />

FIS<br />

FIS<br />

FIS<br />

rrnB<br />

rrnD<br />

rrnE<br />

rrnB<br />

rrnA<br />

rrnC<br />

rrnG<br />

E. coli K12<br />

MG1665<br />

rRNA operons <strong>and</strong> promoter analysis<br />

rrnH<br />

SIDD, s:-0.055<br />

SIDD, s:-0.045<br />

SIDD, s:-0.035<br />

Annotations<br />

CDS+<br />

CDS-<br />

rRNA<br />

tRNA<br />

Intr<strong>in</strong>sic Curvature<br />

Stack<strong>in</strong>g Energy<br />

Position Preference<br />

GC Skew<br />

Percent AT<br />

Figure 3.12: A zoom of the P1 P2 t<strong>and</strong>em promotor system upstream of the rrnB operon of E.<br />

coli K12.<br />

3.6.1 Visualiz<strong>in</strong>g the P1 <strong>and</strong> P2 structure us<strong>in</strong>g gwBrowser<br />

The gwBrowser tool allows the user to append various types of annotations like TSS mark,<br />

boxes, <strong>and</strong> arrows once the b<strong>in</strong>n<strong>in</strong>g step has f<strong>in</strong>ished. This allows to visualize promotor<br />

structures like the P1 / P2 system <strong>and</strong> to <strong>in</strong>tegrate this with various DNA properties.<br />

The gwBrowser tool was applied to study the E. coli rrnb promotor system to correlate<br />

the annotated regulatory elements with a the SIDD energy (Wang et al., 2004; Wang &<br />

Benham, 2008) (see figure 3.12).<br />

The plot <strong>in</strong> figure 3.12 shows a drop <strong>in</strong> free energy upstream of P1 <strong>and</strong> P2, which<br />

from an energetic viewpo<strong>in</strong>t expla<strong>in</strong> the high transcription rate. The transcription factor<br />

FIS stimulates transcription at several promoters, <strong>and</strong> for example the b<strong>in</strong>d<strong>in</strong>g of FIS<br />

at the leuV promoter (Ross et al., 1999) has been suggested to transmit the superhelical<br />

destabilization downstream to the po<strong>in</strong>t where the RNAP twists <strong>and</strong> opens the helix (Wang<br />

et al., 2004). This model may be valid for the rrnB P1 promoter also, as the activity of<br />

leuV <strong>and</strong> rrnB P1 are comparable (Bauer et al., 1988).<br />

3.7 Summary<br />

Ribosomal RNA genes play an important role <strong>in</strong> the cells, <strong>and</strong> can be highly transcribed<br />

- often more than 90% of the total transcripts <strong>in</strong> rapidly grow<strong>in</strong>g bacterial cells are from<br />

rRNA genes. Further, rRNA genes are important <strong>in</strong> determ<strong>in</strong><strong>in</strong>g taxonomy. Further,<br />

correctly f<strong>in</strong>d<strong>in</strong>g the location of the start/stop positions for the rRNA genes is difficult to<br />

do with BLAST searches; we have developed RNAmmer to f<strong>in</strong>d the rRNA genes. Once the<br />

genes are mapped, further studies, such as promoter profil<strong>in</strong>g can be done. The gwBrowser<br />

allows one to zoom <strong>in</strong> on particular areas of the chromosome, <strong>and</strong> <strong>in</strong> the case of rRNA<br />

promoters, to map important structural properties of the DNA <strong>in</strong> the promoter region.<br />

119


Summary<br />

120


1<br />

rRNA operons <strong>and</strong> promoter analysis<br />

3.8 Paper VI: RNAmmer: Fast two-level HMM prediction<br />

of rRNA <strong>in</strong> prokaryotic genome sequences<br />

121


3100–3108 Nucleic Acids Research, 2007, Vol. 35, No. 9 Published onl<strong>in</strong>e 22 April 2007<br />

doi:10.1093/nar/gkm160<br />

RNAmmer: consistent <strong>and</strong> rapid annotation<br />

of ribosomal RNA genes<br />

Kar<strong>in</strong> Lagesen 1,2, *, Peter Hall<strong>in</strong> 3 , E<strong>in</strong>ar Andreas Rødl<strong>and</strong> 1,2,4,5 , Hans-Henrik Stærfeldt 3 ,<br />

Torbjørn Rognes 1,2,4 <strong>and</strong> David W. Ussery 1,2,3<br />

1 Centre for Molecular Biology <strong>and</strong> Neuroscience <strong>and</strong> Institute of Medical Microbiology, University of Oslo,<br />

NO-0027 Oslo, Norway, 2 Centre for Molecular Biology <strong>and</strong> Neuroscience <strong>and</strong> Institute of Medical Microbiology,<br />

Rikshospitalet-Radiumhospitalet Medical Centre, NO-0027 Oslo, Norway, 3 Center for Biological Sequence<br />

Analysis, Biocentrum-DTU, Technical University of Denmark, DK-2800 Lyngby, Denmark, 4 Department of<br />

Informatics, University of Oslo, PO Box 1080 Bl<strong>in</strong>dern, NO-0316 Oslo, Norway <strong>and</strong> 5 Norwegian Comput<strong>in</strong>g<br />

Center, PO Box 114 Bl<strong>in</strong>dern, NO-0314 Oslo, Norway<br />

Received December 1, 2006; Revised <strong>and</strong> Accepted March 2, 2007<br />

ABSTRACT<br />

The publication of a complete genome sequence is<br />

usually accompanied by annotations of its genes.<br />

In contrast to prote<strong>in</strong> cod<strong>in</strong>g genes, genes for<br />

ribosomal RNA (rRNA) are often poorly or <strong>in</strong>consistently<br />

annotated. This makes comparative<br />

studies based on rRNA genes difficult. We have<br />

therefore created computational predictors for the<br />

major rRNA species from all k<strong>in</strong>gdoms of life <strong>and</strong><br />

compiled them <strong>in</strong>to a program called RNAmmer.<br />

The program uses hidden Markov models tra<strong>in</strong>ed on<br />

data from the 5S ribosomal RNA database <strong>and</strong><br />

the European ribosomal RNA database project.<br />

A pre-screen<strong>in</strong>g step makes the method fast with<br />

little loss of sensitivity, enabl<strong>in</strong>g the analysis of<br />

a complete bacterial genome <strong>in</strong> less than a m<strong>in</strong>ute.<br />

Results from runn<strong>in</strong>g RNAmmer on a large set of<br />

genomes <strong>in</strong>dicate that the location of rRNAs can be<br />

predicted with a very high level of accuracy. Novel,<br />

unannotated rRNAs are also predicted <strong>in</strong> many<br />

genomes. The software as well as the genome<br />

analysis results are available at the <strong>CBS</strong> web server.<br />

INTRODUCTION<br />

Ribosomes are the molecular mach<strong>in</strong>es which form the<br />

connection between nucleic acids <strong>and</strong> prote<strong>in</strong>s <strong>in</strong> all liv<strong>in</strong>g<br />

organisms. The ribosome’s dependence on ribosomal<br />

RNAs (rRNAs) for its function has caused them to be<br />

conserved at both the sequence <strong>and</strong> the structure level.<br />

Because of this, rRNAs are often used <strong>in</strong> comparative<br />

studies such as phylogenetic <strong>in</strong>ference. <strong>Comparative</strong><br />

studies have become more popular as more genomes<br />

have been completely sequenced, but can potentially<br />

*To whom correspondence should be addressed. Tel: þ4722844786; Email: kar<strong>in</strong>.lagesen@medis<strong>in</strong>.uio.no<br />

become complicated when some of the genes they are<br />

based on are poorly annotated or not annotated at all.<br />

Unfortunately, this is often a problem with rRNAs as<br />

genome annotation pipel<strong>in</strong>es usually do not <strong>in</strong>clude <strong>tools</strong><br />

specific for rRNA detection. Instead, rRNAs are often<br />

located by sequence similarity searches such as BLAST.<br />

Although such searches may give reasonable answers due<br />

to the high level of sequence conservation <strong>in</strong> the core<br />

regions of the genes, us<strong>in</strong>g such results for annotation<br />

purposes can be problematic. The validity of the search<br />

results depends on the program <strong>and</strong> database used.<br />

Chang<strong>in</strong>g one or both of these can drastically change<br />

the results. Genomic databases have grown exponentially<br />

over the past two decades <strong>and</strong> search programs have as a<br />

consequence had to undergo constant revisions <strong>in</strong> order to<br />

meet the requirements of the research community. Thus,<br />

the results of a search done today are probably very<br />

different from those produced several years ago. An added<br />

complication is that the most commonly used database<br />

search methods have poor performance for noncod<strong>in</strong>g<br />

RNAs. A recent study compar<strong>in</strong>g several different<br />

methods for predict<strong>in</strong>g noncod<strong>in</strong>g RNAs, <strong>in</strong>clud<strong>in</strong>g<br />

rRNAs, found that the most commonly used methods<br />

gave the most <strong>in</strong>accurate results (1).<br />

Through our work on the GenomeAtlas database (2),<br />

we have seen the results of poor annotation of rRNAs.<br />

Some genomes do not have any rRNAs annotated at all,<br />

whereas other genomes seem to have rRNAs annotated<br />

on the wrong str<strong>and</strong>. We <strong>in</strong>itially tried to do systematic<br />

BLAST (3) searches, but it proved difficult to ma<strong>in</strong>ta<strong>in</strong><br />

consistency throughout this process. The high level of<br />

sequence conservation among the rRNAs enabled us to<br />

create hidden Markov models (HMMs) from structural<br />

alignments. Such models are more capable of captur<strong>in</strong>g<br />

the sequence variation that is <strong>in</strong>herently present <strong>in</strong><br />

the rRNA gene families than simple BLAST searches.<br />

ß 2007 The Author(s)<br />

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/<br />

by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, <strong>and</strong> reproduction <strong>in</strong> any medium, provided the orig<strong>in</strong>al work is properly cited.


Us<strong>in</strong>g HMMs also simplifies the use of common criteria<br />

for prediction assessment. A library of HMMs was<br />

constructed <strong>and</strong> the program RNAmmer was developed<br />

to make use of this library. RNAmmer is available<br />

through the <strong>CBS</strong> web site, as a web service or as a<br />

st<strong>and</strong>-alone package. It has been tested on all published<br />

genomes <strong>and</strong> gives accurate predictions of rRNAs. The<br />

program also has the added benefit of produc<strong>in</strong>g results<br />

that are comparable between genomes.<br />

Our work has focused on three of the major rRNA<br />

species. The ribosome consists of two subunits, the small<br />

<strong>and</strong> the large subunit, which pair up to form the<br />

functional ribosome. The rRNAs present <strong>in</strong> prokaryotes<br />

are the 5S <strong>and</strong> 23S <strong>in</strong> the large subunit, <strong>and</strong> the 16S <strong>in</strong> the<br />

small subunit. In eukaryotes, 5S, 5.8S <strong>and</strong> 28S rRNA exist<br />

<strong>in</strong> the large subunit, <strong>and</strong> 18S rRNA <strong>in</strong> the small subunit.<br />

The 5.8S is not considered <strong>in</strong> this work. There are<br />

substantial sequence <strong>and</strong> secondary structure similarities<br />

between eukaryotic <strong>and</strong> prokaryotic rRNAs; however,<br />

the eukaryotic rRNAs commonly have longer stems <strong>and</strong><br />

larger loops than those of the prokaryotes. The subunits<br />

are composed of both RNAs <strong>and</strong> prote<strong>in</strong>s. S<strong>in</strong>ce their<br />

discovery <strong>in</strong> the early 1950s, it has been debated whether<br />

ribosomal function should be credited to the rRNAs or<br />

the prote<strong>in</strong>s. Recent crystal studies have revealed that<br />

prote<strong>in</strong> synthesis to a large extent is dependent on the<br />

rRNAs (4–7) <strong>and</strong> this has most likely been <strong>in</strong>strumental<br />

for their high level of conservation.<br />

In prokaryotes, the 16S, 23S <strong>and</strong> 5S rRNAs are<br />

commonly transcribed together, while the 18S, 28S <strong>and</strong><br />

5.8S rRNAs form a transcriptional unit <strong>in</strong> eukaryotes.<br />

Eukaryotic 5S rRNA commonly appear <strong>in</strong> highly duplicated<br />

t<strong>and</strong>em repeats (8). In most organisms, there are<br />

several copies of the rRNA transcription unit, <strong>and</strong><br />

although as much as 11% sequence divergence has been<br />

observed between units with<strong>in</strong> the same genome, the<br />

difference is usually less than 1% (9). In several cases,<br />

segments are also edited out of the transcribed rRNA.<br />

These segments may be <strong>in</strong>trons that after splic<strong>in</strong>g leave<br />

a cont<strong>in</strong>uous rRNA, or they can be <strong>in</strong>terven<strong>in</strong>g sequences<br />

(IVS) that leave a fragmented rRNA which is still<br />

functional with<strong>in</strong> the ribosome structure (10). Introns<br />

are most prevalent <strong>in</strong> eukaryotes <strong>and</strong> archaeas, while<br />

<strong>in</strong>terven<strong>in</strong>g sequences have been seen <strong>in</strong> eukaryotes <strong>and</strong><br />

bacteria. Introns are predom<strong>in</strong>antly found with<strong>in</strong> conserved<br />

sequences close to tRNA <strong>and</strong> mRNA-b<strong>in</strong>d<strong>in</strong>g<br />

sites (10), whereas <strong>in</strong>terven<strong>in</strong>g sequences are ord<strong>in</strong>arily<br />

seen <strong>in</strong> hypervariable regions (11).<br />

METHODS AND MATERIALS<br />

Us<strong>in</strong>g HMMs to f<strong>in</strong>d new members of a sequence family<br />

requires reliable multiple alignments. The 16S/18S <strong>and</strong><br />

23S/28S rRNA alignments were retrieved from the<br />

European ribosomal RNA database (ERRD) (12).<br />

In this database, annotated large <strong>and</strong> small subunit<br />

ribosomal RNA sequences from the EMBL nucleotide<br />

database with a length of at least 70% of their estimated<br />

full length have been aligned. Multiple alignments of 5S<br />

rRNAs were retrieved from the 5S Ribosomal RNA<br />

Nucleic Acids Research, 2007, Vol. 35, No. 9 3101<br />

Database (13). Data from both databases were downloaded<br />

on October 27, 2005. The alignments are<br />

all structural alignments, i.e. aligned us<strong>in</strong>g secondary<br />

structure <strong>in</strong>formation ga<strong>in</strong>ed from comparative sequence<br />

analysis. The 5S alignments were already divided<br />

<strong>in</strong>to separate alignments for archaeal, bacterial <strong>and</strong><br />

eukaryotic sequences, whereas the ERRD data were not.<br />

The alignments for 16/18S <strong>and</strong> 23/28S rRNAs were<br />

divided <strong>in</strong>to the same groups as the 5S data to provide<br />

k<strong>in</strong>gdom-specific predictors. The data was stored <strong>in</strong><br />

a MySQL database for easier h<strong>and</strong>l<strong>in</strong>g.<br />

The ERRD data conta<strong>in</strong>ed sequences from ‘environmental<br />

samples’. These were excluded s<strong>in</strong>ce there was little<br />

<strong>in</strong>formation about them. The 5S were generally around<br />

120 nt long, the 16/18S around 1500 nt <strong>and</strong> the 23/28S<br />

around 3000 nt long, all with no obvious outliers. The<br />

length of the eukaryotic rRNAs varied substantially,<br />

more than those of bacterial <strong>and</strong> archaeal rRNAs, but no<br />

sequences <strong>in</strong> the alignments seemed obviously wrong.<br />

The sequences were divided <strong>in</strong>to phylogenetic groups to<br />

help with further analysis. Due to sequenc<strong>in</strong>g bias, some<br />

phylogenetic groups dom<strong>in</strong>ated the data sets. Such a skew<br />

could potentially cause the predictors to be less sensitive<br />

on underrepresented phylogenetic groups. Among<br />

the bacteria, 82% of the sequences were from three<br />

phyla: Act<strong>in</strong>obacteria, Firmicutes <strong>and</strong> Proteobacteria.<br />

Around 70% of the archaeal sequences were from<br />

Euryarchaeota; among the eukaryotes, the Streptophyta<br />

comprised 15% of the data. Several of the sequences also<br />

proved to be very similar. Therefore, redundancy reduction<br />

<strong>in</strong>spired by Hobohms second algorithm (14) was<br />

performed. This algorithm starts with a sorted list of the<br />

number of neighbors each sequence has. An all-aga<strong>in</strong>st-all<br />

comparison between the sequences is performed <strong>and</strong><br />

neighborship is judged by the level of similarity found.<br />

Similarity was measured by Score ¼ P<br />

i, j nijSij=ðN gÞ<br />

where i <strong>and</strong> j sum over the four nucleotides, nij counts the<br />

number of aligned nucleotide pairs (i, j ), N is the length of<br />

the sequence <strong>and</strong> g is the number of gap-only positions; S ij<br />

refers to the scor<strong>in</strong>g matrix EDNAFULL created by Todd<br />

Lowe. The maximum similarity level allowed was set to<br />

ensure that each phylum was represented. Similarity<br />

graphs were formed for each group, with the sequences<br />

as vertices <strong>and</strong> edges between similar sequences. The<br />

sequence with the highest connectivity <strong>and</strong> its edges were<br />

deleted from the graph, <strong>and</strong> this was repeated until no<br />

edges rema<strong>in</strong>ed. At the end, all removed sequences were<br />

checked to see if they had any edges to vertices <strong>in</strong> the<br />

rema<strong>in</strong><strong>in</strong>g set. If not, they were re<strong>in</strong>stated. This procedure<br />

was implemented as a C program.<br />

Sequences <strong>in</strong> ERRD may conta<strong>in</strong> ambiguous nucleotide<br />

symbols represent<strong>in</strong>g nucleotides that have not been<br />

uniquely determ<strong>in</strong>ed. These occur more frequently <strong>in</strong><br />

bacteria <strong>and</strong> eukaryotes than <strong>in</strong> archaea, <strong>and</strong> primarily at<br />

both ends of the alignment: <strong>in</strong> 16/18S, predom<strong>in</strong>antly<br />

at the end; <strong>in</strong> 23/28S, predom<strong>in</strong>antly at the beg<strong>in</strong>n<strong>in</strong>g.<br />

In the latter case, this was mostly due the high prevalence<br />

of gaps at the end of the alignment. As we found that<br />

ambiguous nucleotides at the ends reduced the ability to<br />

predict start <strong>and</strong> stop positions accurately, we decided to<br />

remove all sequences with five or more ambiguous


3102 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />

Table 1. The <strong>in</strong>itial number of rRNA sequences <strong>and</strong> the number of sequences excluded for different reasons.<br />

K<strong>in</strong>gdom Type Initial count Environmental samples Incomplete sequences Redundancy reduction Total <strong>in</strong> HMM<br />

Archaea 5S 58 0 0 10 48<br />

16S 589 239 471 287 76<br />

23S 37 0 18 8 15<br />

Bacteria 5S 461 0 0 101 360<br />

16S 12 107 1429 10 723 2485 743<br />

23S 398 0 155 130 127<br />

Eukaryotes 5S 316 0 0 33 283<br />

18S 6585 24 5222 836 979<br />

28S 157 0 91 8 58<br />

Environmental samples were excluded due to lack of phylogenetic <strong>in</strong>formation. Sequences with too many unknown nucleotides <strong>in</strong> either end of the<br />

sequence were excluded to improve HMM accuracy. Redundancy reduction was performed to reduce bias. Note that these groups may overlap. The<br />

last column <strong>in</strong>dicates the number of sequences used to build each HMM.<br />

nucleotides <strong>in</strong> either end of the sequence. A summary of<br />

the number of sequences removed dur<strong>in</strong>g curation of the<br />

alignments is shown <strong>in</strong> Table 1.<br />

The software package HMMer (15) version 2.3.2 was<br />

used to create HMMs from alignments where all columns<br />

conta<strong>in</strong><strong>in</strong>g only gaps had been removed. It was configured<br />

for nucleotides, <strong>and</strong> to compensate for skews <strong>in</strong> the<br />

nucleotide distribution a custom null model for each<br />

alignment was used. Although redundancy reduction had<br />

been performed, the Henikoff position-based weigh<strong>in</strong>g<br />

scheme (16) was used to reduce any rema<strong>in</strong><strong>in</strong>g biases.<br />

When us<strong>in</strong>g the HMMs to search genome sequences,<br />

the default alignment method was used: a match must<br />

span the entire model, <strong>and</strong> several matches may be found<br />

with<strong>in</strong> one sequence.<br />

With the aim of <strong>in</strong>creas<strong>in</strong>g the search speed, we<br />

determ<strong>in</strong>ed the 75 most conserved consecutive columns<br />

<strong>in</strong> each alignment, as illustrated <strong>in</strong> Figure 1, <strong>and</strong> produced<br />

‘spotter’ HMMs based on these. S<strong>in</strong>ce searches with the<br />

smaller spotter models would be considerably faster,<br />

we wanted to <strong>in</strong>vestigate the possibility of us<strong>in</strong>g the<br />

spotter to pre-screen for c<strong>and</strong>idates, us<strong>in</strong>g the full HMMs<br />

only on regions surround<strong>in</strong>g the spotter hits. Spotter <strong>and</strong><br />

full model searches were done separately. Spotter <strong>and</strong> full<br />

model predictions were matched based on whether they<br />

had overlapp<strong>in</strong>g nucleotides on the same str<strong>and</strong>. A l<strong>in</strong>ear<br />

regression was used to express spotter score <strong>in</strong> terms of<br />

full model score. Variation was estimated as l<strong>in</strong>ear <strong>in</strong> full<br />

model score with non-positive regression coefficients.<br />

Least squares estimates were used <strong>in</strong> both cases. Spotter<br />

scores were assumed to be miss<strong>in</strong>g when negative <strong>and</strong>,<br />

hence, assumed to follow a truncated normal distribution;<br />

expected scores <strong>and</strong> square deviations were used to replace<br />

miss<strong>in</strong>g values <strong>in</strong> the two regressions. From this model, we<br />

computed the lowest full model score, T99, for which there<br />

was at least a 99% likelihood of gett<strong>in</strong>g a correspond<strong>in</strong>g<br />

spotter hit, <strong>and</strong> the likelihood, Pm<strong>in</strong>, that a full model hit<br />

with the lowest found score should have a correspond<strong>in</strong>g<br />

spotter hit.<br />

Both the full HMMs <strong>and</strong> the spotter HMMs were run<br />

on all fully sequenced genomes found <strong>in</strong> the Genome Atlas<br />

database (listed <strong>in</strong> Supplementary Table S1). All predictions<br />

with non-negative score <strong>and</strong> E-value at most 100<br />

were reported. Only full model hits with E-value 50.01<br />

were accepted as reliable hits, but none with E-value<br />

between 0.01 <strong>and</strong> 100 were reported. As rRNAs with<strong>in</strong> a<br />

genome tend to be very similar, usually with at least 99%<br />

identity, different full model hits with<strong>in</strong> a genome<br />

correspond<strong>in</strong>g to actual rRNAs should be expected to<br />

have similar scores. However, we found a substantial<br />

number of hits with far lower scores which we assume to<br />

be pseudogenes, truncated rRNAs or otherwise nonfunctional<br />

rRNA copies. To ensure that these did not have<br />

an adverse effect on the analyses, we excluded full model<br />

hits hav<strong>in</strong>g a score less than 80% of the maximal score<br />

<strong>in</strong> that genome. These are listed <strong>in</strong> Supplementary<br />

Table S2.<br />

Annotations of rRNAs were obta<strong>in</strong>ed from GenBank.<br />

Unfortunately, rRNAs have not been annotated <strong>in</strong> a<br />

uniform manner <strong>and</strong> it was often unclear exactly what<br />

was annotated. In some cases, both the separate rRNAs<br />

<strong>and</strong> the full operon was annotated. In all such cases, the<br />

operons were longer than 5000 nt, <strong>and</strong> all annotations<br />

longer than that were thus excluded. In our experience,<br />

this affected only operons. In other cases, different pieces<br />

of the same gene had been annotated as separate entities.<br />

Thus, some predictions matched several annotation<br />

entries; these are listed <strong>in</strong> Supplementary Table S3. A<br />

prediction was considered to match an annotation if they<br />

were on the same str<strong>and</strong> <strong>and</strong> the length of their overlap<br />

was at least half the length of the shorter of the two; it was<br />

considered to be annotated if it matched at least one<br />

annotation. The deviation between annotated <strong>and</strong> predicted<br />

start <strong>and</strong> stop positions was also exam<strong>in</strong>ed, but<br />

predictions with multiple match<strong>in</strong>g annotations were<br />

excluded from this comparison.<br />

Additional analyses were performed for experimentally<br />

verified 16S <strong>in</strong> Anaplasma marg<strong>in</strong>ale St. Maries (M60313),<br />

Chlamydia muridarum Nigg (D85718), Escherichia coli<br />

K12 MG1655 (J01695), Sulfolobus tokodaii St. 7<br />

(AB022438), Thermus thermophilus HB8 (X07998) <strong>and</strong><br />

Nitrobacter hamburgensis X14 (L11663). <strong>Computational</strong><br />

speed was assessed on M. capricolum ATCC 27343<br />

(CP000123) Solibacter usitatus Ell<strong>in</strong>6076 (CP000473) <strong>and</strong><br />

Sargasso Sea data (AACY01000001-AACY01811372).<br />

All test searches reported were performed on an<br />

SGI Altix 3000 mach<strong>in</strong>e us<strong>in</strong>g one 1.3 GHz Itanium 2<br />

processor.


Information content<br />

Information content<br />

Information content<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

RESULTS<br />

0 20 40 60 80 100 120 140<br />

0 50 100 150<br />

0 50 100 150<br />

Position <strong>in</strong> Alignment<br />

The predictions of the full HMM models have been<br />

compared first aga<strong>in</strong>st annotations, then aga<strong>in</strong>st the<br />

spotter models.<br />

Full model predictions versus annotation<br />

As Table 2 shows, the predictors appeared to be better<br />

at detect<strong>in</strong>g bacterial rRNAs <strong>and</strong> less powerful for<br />

eukaryotic rRNAs. The highest accuracy was seen for<br />

the 16/18S rRNAs followed by the 23/28S. Two groups of<br />

rRNAs were particularly difficult to locate: the archaeal<br />

5S <strong>and</strong> the eukaryotic 18S. The miss<strong>in</strong>g archaeal 5S were<br />

all from four euryarchaeotic genomes which are all<br />

anaerobic methane producers. The eukaryotic 18S that<br />

the predictors could not f<strong>in</strong>d were all from two genomes,<br />

Guillardia theta <strong>and</strong> Plasmodium falciparum.<br />

Closer evaluation revealed that several annotated<br />

rRNAs that lacked a match<strong>in</strong>g prediction had actually<br />

been detected, but on the opposite str<strong>and</strong>. In eukaryotes,<br />

this was only seen with Arabidopsis thaliana 5S.<br />

In bacteria, most of the reverse predictions were 5S; <strong>in</strong><br />

archaea, they were predom<strong>in</strong>antly 16S <strong>and</strong> 23S. It should<br />

be noted that for all the reverse str<strong>and</strong> predictions<br />

the predicted start <strong>and</strong> stop positions agreed well<br />

with the annotation, <strong>in</strong>dicat<strong>in</strong>g that they have been<br />

annotated on the wrong str<strong>and</strong>. Annotated rRNAs<br />

that lacked match<strong>in</strong>g predictions <strong>in</strong> either direction are<br />

listed <strong>in</strong> Supplementary Table S4.<br />

Table 2 gives the number of predicted rRNAs that did<br />

not have a correspond<strong>in</strong>g annotation: putative novel<br />

rRNAs. About 70% of them were 5S rRNAs, <strong>and</strong> only a<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0 500 1000 1500<br />

0 500 1000 1500 2000 2500 3000<br />

0 1000 2000 3000 4000 5000<br />

Position <strong>in</strong> Alignment<br />

Nucleic Acids Research, 2007, Vol. 35, No. 9 3103<br />

A 5S, Archaea (n = 48) B 16S, Archaea (n = 76) C 23S, Archaea (n = 15)<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

few were archaeal. In bacteria, most of the novel rRNAs<br />

were found <strong>in</strong> Firmicutes <strong>and</strong> Gammaproteobacterias,<br />

although it should be noted that these two phyla are<br />

the two dom<strong>in</strong>ant groups <strong>and</strong> conta<strong>in</strong> the bulk of the<br />

currently sequenced bacterial genomes. Among the<br />

eukaryotes, only A. thaliana had novel rRNAs. The<br />

scores of the new rRNA predictions did not significantly<br />

differ from those that were annotated, <strong>in</strong>dicat<strong>in</strong>g that<br />

these are true rRNAs not yet annotated. The 5S is often<br />

omitted <strong>in</strong> the rRNA annotation; s<strong>in</strong>ce the eukaryotic 5S<br />

is usually separated from the 18-28S sequence, they might<br />

be less visible to annotators.<br />

Start <strong>and</strong> stop deviations<br />

0 500 1000 1500 2000 2500 3000 3500<br />

D 5S, Bacteria (n = 360) E 16S, Bacteria (n = 743) F 23S, Bacteria (n = 127)<br />

0 1000 2000 3000 4000<br />

G 8S, Eukaryotes (n = 283) H 18S, Eukaryotes (n = 979) I 28S, Eukaryotes (n = 58)<br />

0 1000 2000 3000 4000 5000 6000 7000<br />

Position <strong>in</strong> Alignment<br />

Figure 1. The graphs show conservation <strong>in</strong> the alignments as measured by <strong>in</strong>formation content: C ¼ P<br />

i fi log 2ðfi=qiÞ where i sums over the four<br />

nucleotides, f i is the frequency of nucleotide i <strong>in</strong> the column <strong>and</strong> qi ¼ 1=4 is used as the background frequency. Ambiguous nucleotide symbols were<br />

evenly divided between the correspond<strong>in</strong>g f i, gaps between all four nucleotides. The grey l<strong>in</strong>e represents the value for each position <strong>in</strong> the alignment,<br />

the black l<strong>in</strong>e is a runn<strong>in</strong>g average over 75 nt around the current position, whereas the white dot <strong>in</strong>dicates the center of the most conserved 75 nt<br />

region of the alignment.<br />

The differences between predicted <strong>and</strong> annotated start<br />

<strong>and</strong> stop positions are illustrated <strong>in</strong> Figure 2 <strong>and</strong> it shows<br />

that they agree well. The median of the start <strong>and</strong> stop<br />

prediction deviations were <strong>in</strong> most groups zero or very<br />

close to zero with more than half with<strong>in</strong> 10 nucleotides.<br />

This was not the case for the eukaryotes.<br />

For eukaryotic 5S, only five genomes conta<strong>in</strong>ed<br />

predictions with match<strong>in</strong>g annotations. The predictions<br />

were uniform <strong>in</strong> length, whereas the annotations<br />

were more variable. The predictions that <strong>in</strong>dicated a<br />

substantially shorter 5S than annotated were all <strong>in</strong><br />

Schizosaccharomyces pombe: the average length of the<br />

annotations was 170 nt, whereas the correspond<strong>in</strong>g<br />

predictions were all 114 nt. For eukaryotic 18S, however,<br />

predicted start <strong>and</strong> stop positions were very accurate,<br />

although many annotated 18S were missed.


3104 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />

Table 2. The number of rRNAs annotated <strong>and</strong> predicted <strong>in</strong> the genomes that were exam<strong>in</strong>ed.<br />

K<strong>in</strong>gdom Type Annotated Same str<strong>and</strong> Other str<strong>and</strong> Not found Full model predictions Novel<br />

Archaea (n ¼ 27) 5S 56 (24) 43 (21) 1 (1) 12 (8) 47 (23) 4 (3)<br />

16S 47 (25) 45 (25) 2 (2) 0 (0) 47 (27) 2 (2)<br />

23S 47 (25) 44 (24) 2 (2) 1 (1) 46 (26) 2 (2)<br />

Bacteria (n ¼ 321) 5S 1205 (285) 1166 (285) 30 (16) 9 (5) 1339 (320) 173 (69)<br />

16S 1172 (299) 1146 (299) 22 (12) 4 (4) 1237 (320) 91 (34)<br />

23S 1197 (297) 1154 (291) 22 (13) 21 (12) 1248 (313) 94 (36)<br />

Eukaryotes (n ¼ 13) 5S 65 (7) 46 (6) 19 (1) 0 (0) 324 (9) 278 (5)<br />

18S 13 (4) 6 (4) 0 (0) 7 (2) 13 (6) 7 (3)<br />

28S 13 (5) 12 (4) 0 (0) 1 (1) 19 (7) 7 (3)<br />

The table gives the number of annotations, <strong>and</strong> splits this <strong>in</strong>to those match<strong>in</strong>g predictions on the same str<strong>and</strong>, on the other str<strong>and</strong>, <strong>and</strong> not found.<br />

The total number of full model predictions is given. Novel predictions are full model predictions not match<strong>in</strong>g any annotation on the same str<strong>and</strong>,<br />

<strong>and</strong> <strong>in</strong>clude those annotated on the other str<strong>and</strong>. Numbers <strong>in</strong> parentheses <strong>in</strong>dicate the number of genomes. It should be noted that the eukaryotic<br />

annotated count is somewhat uncerta<strong>in</strong> due to ambiguous rRNA annotations. The genomes which were analyzed were from the GenomeAtlas<br />

database, a database over all available fully sequenced genomes.<br />

Archaea<br />

Bacteria<br />

Eukaryotes<br />

Start<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

5S<br />

(43/1163/46)<br />

For eukaryotic 28S, only two genomes had predictions<br />

with match<strong>in</strong>g annotations. One of them, Encephalitozoon<br />

cuniculi, had stop positions predicted once 1112 nt <strong>and</strong><br />

twice 4797 nt downstream of the annotation, whereas<br />

the start position was accurately predicted. In the<br />

other genome, Guillardia theta, the start positions were<br />

uniformly predicted 110 nt upstream of the annotated<br />

position, but with the stop position quite accurately<br />

predicted.<br />

1000<br />

Stop<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

1000<br />

Start<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

16/18S<br />

(44/1146/6)<br />

1000<br />

Stop<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

1000<br />

Start<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

23/28S<br />

(42/1150/9)<br />

Stop<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

1000<br />

Figure 2. Deviation of start <strong>and</strong> stop positions between predicted <strong>and</strong> annotated RNA is presented as pairs of panels. The number of predictions<br />

among the archaea, bacteria <strong>and</strong> eukaryotes are denoted beneath the panel group head<strong>in</strong>g. The zero position <strong>in</strong> each panel corresponds to the<br />

annotation start or stop position with predicted positions presented relative to these. The yellow dot <strong>in</strong>dicates the median deviation <strong>and</strong> the black<br />

box the quartile range. The h<strong>in</strong>ges on the side of the box extend from the side of the box to the data po<strong>in</strong>t that is closest to, but does not exceed, 1.5<br />

times the <strong>in</strong>terquartile range. The curves show the density of the distribution.<br />

S<strong>in</strong>ce rRNAs tend to be very similar with<strong>in</strong> a genome,<br />

predictions with<strong>in</strong> each genome generally had similar<br />

lengths. This similarity with<strong>in</strong> genomes as well as with<strong>in</strong><br />

groups of closely related genomes caused multiple peaks<br />

<strong>in</strong> the distributions of endpo<strong>in</strong>t deviations. An example<br />

of this can be seen <strong>in</strong> the bacterial 16S predictions where<br />

some of the predicted start <strong>and</strong> stop positions were<br />

clustered downstream of the annotation <strong>and</strong> where some<br />

of the predicted start positions were clustered upstream<br />

1000


of the annotation. Some of the major contributors to<br />

the upstream peak <strong>in</strong> the start positions were different<br />

Streptococcus pyogenes stra<strong>in</strong>s, Bacillus genomes <strong>and</strong><br />

Yers<strong>in</strong>ia pestis genomes. These, <strong>in</strong> addition to<br />

Streptococcus agalactiae stra<strong>in</strong>s <strong>and</strong> Vibrio parahaemolyticus,<br />

were also prevalent <strong>in</strong> the stop position downstream<br />

peak. There was also a downstream peak <strong>in</strong> the<br />

start positions, <strong>and</strong> the genomes caus<strong>in</strong>g this peak were<br />

ma<strong>in</strong>ly Staphylococcus aureus, Bacillus cereus <strong>and</strong> several<br />

Escherichia coli relatives.<br />

Most of the start <strong>and</strong> stop deviations did not exceed<br />

100 nt. However, there were a few cases of deviations<br />

exceed<strong>in</strong>g 1000 nt, <strong>and</strong> these are not shown <strong>in</strong> the figure.<br />

This was the case for eukaryotic 23S <strong>and</strong> was ma<strong>in</strong>ly due<br />

to the three previously described stop positions predicted<br />

considerably downstream of the annotated stop position.<br />

In the two longer predictions from E. cuniculi, this was<br />

due to the HMM plac<strong>in</strong>g the latter 100 nt of the prediction<br />

further downstream to achieve a better score. Such <strong>in</strong>serts<br />

would most likely not appear when the spotter model is<br />

used first, s<strong>in</strong>ce the <strong>in</strong>serted sequence would be too long.<br />

To test this, a truncated version of the sequence was run<br />

through the predictor. The stop position was then<br />

accurately predicted. This phenomenon also expla<strong>in</strong>s<br />

some cases among the bacterial 16S predictions where the<br />

start position was placed very far upstream of the<br />

annotation. There were 27 rRNAs that had a start<br />

position predicted to start anywhere from 13 000 to<br />

40 000 nt upstream of the annotated start position. All<br />

but one of these were Firmicutes, mostly Streptococci <strong>and</strong><br />

Staphylococci. Closer study of the sequences revealed that<br />

the misplaced start position predictions were aga<strong>in</strong> due to<br />

long sequences be<strong>in</strong>g <strong>in</strong>serted near the start of the rRNA,<br />

<strong>in</strong>dicat<strong>in</strong>g that the first part of the HMM had been<br />

misplaced <strong>in</strong> the same manner as for Guillardia theta’s stop<br />

predictions. To test if these were the same k<strong>in</strong>d of <strong>in</strong>serts,<br />

a region end<strong>in</strong>g <strong>in</strong> the same place as the predictions but<br />

start<strong>in</strong>g 10 000 nt earlier was run through the full model<br />

predictor. This led to the bacterial 16S rRNAs be<strong>in</strong>g<br />

predicted with a deviation <strong>in</strong> start <strong>and</strong> stop positions on<br />

par with what was otherwise seen.<br />

Comparison to experimentally verified rRNAs<br />

Annotations were often ambiguous <strong>and</strong> considered<br />

unreliable. For discrepancies between annotations <strong>and</strong><br />

RNAmmer predictions, it is not a priori clear which of the<br />

two is correct. However, some genomes with experimentally<br />

verified rRNAs were selected to further assess the<br />

accuracy of start <strong>and</strong> stop predictions. The genomes<br />

we exam<strong>in</strong>ed were Anaplasma marg<strong>in</strong>ale Str. Maries,<br />

Chlamydia muridarum Nigg, Escherichia coli K12<br />

MG1655, Sulfolobus tokodaii Str. 7, Thermus thermophilus<br />

HB8 <strong>and</strong> Nitrobacter hamburgensis X14. These genomes<br />

all had complete 16S sequences accord<strong>in</strong>g to the NCBI<br />

database <strong>and</strong> had accompany<strong>in</strong>g literature which said that<br />

they were experimentally determ<strong>in</strong>ed. When check<strong>in</strong>g<br />

the positions of these rRNAs with BLAST aga<strong>in</strong>st the<br />

genome, some discrepancies were found. Due to this we<br />

used the BLAST results when compar<strong>in</strong>g annotated<br />

rRNAs to predictions.<br />

Nucleic Acids Research, 2007, Vol. 35, No. 9 3105<br />

In total, there were 14 copies of the six 16S sequences,<br />

<strong>and</strong> all of them were found by our predictions. Stop<br />

predictions were more accurate than start predictions.<br />

In all but four cases, the start position was predicted<br />

to be 7 nt downstream of the annotated start position.<br />

In A. marg<strong>in</strong>ale <strong>and</strong> S. tokodaii, the start position was<br />

predicted to be the same as annotation, <strong>and</strong> both of the<br />

two entries from C. muridarum were predicted to start 3 nt<br />

downstream of annotated start position. In N. hamburgensis<br />

the start position was, <strong>in</strong> contrast to the other cases,<br />

predicted to start 7 nt upstream of annotated start<br />

position. The stop positions <strong>in</strong> all but three predictions<br />

ended on the same position as the annotation. In N.<br />

hamburgensis predicted stop was 9 nt downstream,<br />

whereas <strong>in</strong> S. tokoaii <strong>and</strong> A. marg<strong>in</strong>ale the predicted<br />

stop was 1 nt downstream of annotation. Thus,<br />

all predictions were with<strong>in</strong> 10 nt of the annotated start<br />

<strong>and</strong> stop positions.<br />

Comparison to RFAM<br />

RFAM is a database of RNA families which <strong>in</strong>corporates<br />

secondary structure <strong>in</strong> its analyses. We have made a<br />

comparison with the 5S rRNA predictions of<br />

RFAM (17,18) for a selection of twenty prokaryotic<br />

genomes listed <strong>in</strong> Supplementary Table S5. There were a<br />

total of 55 5S annotated <strong>in</strong> these genomes. RNAmmer<br />

found 53 of them, while 54 were found <strong>in</strong> RFAM. In three<br />

of the genomes, both methods predicted a 5S to with<strong>in</strong> a<br />

few nucleotides of the annotated position, but both placed<br />

it on the other str<strong>and</strong>. Both predictors identified three new<br />

5S rRNAs with<strong>in</strong> these genomes, <strong>and</strong> at approximately the<br />

same positions. Two of these new 5S rRNAs followed<br />

another annotated 5S rRNA, look<strong>in</strong>g like a t<strong>and</strong>em<br />

repeat. In most cases, both methods placed the start<br />

position a few nucleotides downstream of the annotation,<br />

whereas the stop position was more evenly distributed<br />

around the annotated position. RNAmmer generally<br />

predicted rRNAs to be shorter by a nucleotide or two<br />

than RFAM, usually at start of the genes.<br />

Spotter pre-screen<strong>in</strong>g<br />

Table 3 shows that, with the exception of archaeal 5S,<br />

no full model hits were missed by the spotter model.<br />

Also, the spotter produced relatively few false positives,<br />

except for the eukaryotic 5S.<br />

M<strong>in</strong>imum, maximum, quantile <strong>and</strong> median scores for<br />

all the full model predictions are shown <strong>in</strong> Table 3, giv<strong>in</strong>g<br />

some <strong>in</strong>dication of the range of scores that rRNAs can be<br />

expected to have. The table also <strong>in</strong>cludes the threshold T99<br />

<strong>and</strong> the likelihood Pm<strong>in</strong> which <strong>in</strong>dicate that all full model<br />

predictions were expected to have correspond<strong>in</strong>g spotter<br />

model predictions except some among the archaeal 5S.<br />

Based on the relatively stable lengths of the different<br />

types of rRNAs <strong>and</strong> the correspond<strong>in</strong>g full model hits <strong>and</strong><br />

the position of the spotter hit with<strong>in</strong> them, we decided on<br />

w<strong>in</strong>dow sizes around spotter model hits to use when the<br />

spotter model is used first. These were chosen to be 300 nt<br />

for the 5S rRNA, 5000 nt for the 16/18S <strong>and</strong> 9000 nt for<br />

the 23/28S. Be<strong>in</strong>g roughly three times the length of the


3106 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />

Table 3. Evaluation of spotter <strong>and</strong> full model predictions.<br />

K<strong>in</strong>gdom Type Number of model predictions Full model scores T99 Pm<strong>in</strong><br />

correspond<strong>in</strong>g rRNAs, we consider rRNA sequences to be<br />

unlikely to extend beyond these w<strong>in</strong>dows.<br />

<strong>Computational</strong> speed<br />

Search<strong>in</strong>g Mycoplasma capricolum ATCC27343, about<br />

1 Mbp, for bacterial 16S took 14 m<strong>in</strong>utes us<strong>in</strong>g the full<br />

HMM. Us<strong>in</strong>g the spotter to screen the sequence, then the<br />

full model on the spotter hits, reduced the time to<br />

16 seconds. Search times are expected to <strong>in</strong>crease<br />

proportionally to the genome size; when us<strong>in</strong>g the spotter<br />

model to screen the sequence, search time will also<br />

<strong>in</strong>crease with <strong>in</strong>creas<strong>in</strong>g number of spotter hits.<br />

Time differences between search<strong>in</strong>g long <strong>and</strong> short<br />

sequences were exam<strong>in</strong>ed by search<strong>in</strong>g through the<br />

complete sequence of Solibacter usitatus Ell<strong>in</strong>6076, <strong>and</strong><br />

through the Sargasso Sea environmental samples (19).<br />

Search<strong>in</strong>g the S. usitatus genome, about 10 Mbp, took 48<br />

seconds per Mbp. Two copies from each rRNAs family<br />

were found. The Sargasso Sea samples consisted of<br />

811 372 entries total<strong>in</strong>g over 800 Mbp. On this set the<br />

search speed was 407 seconds per Mbp. The article (19)<br />

accompany<strong>in</strong>g this set <strong>in</strong>dicated 1164 small subunit rRNA<br />

genes (16/18S) or fragments of genes; we found only 332,<br />

but our predictors are not able to f<strong>in</strong>d fragments of<br />

rRNAs. In addition, we found 562 5S <strong>and</strong> 68 23S<br />

sequences.<br />

DISCUSSION<br />

Full Spotter FPS M<strong>in</strong> Q1 Med Q3 Max<br />

Archaea 5S 47 35 7 2.9 12.7 20.0 35.3 50.6 34.9 0.69<br />

16S 47 47 0 1180.8 1891.9 1937.9 2004.0 2096.5 50 1.0<br />

23S 46 46 1 2240.7 2714.1 2870.7 3155.3 3267.3 50 1.0<br />

Bacteria 5S 1339 1339 123 39.9 77.7 89.5 94.6 109.6 14.0 1.0<br />

16S 1237 1237 31 721.9 1905.5 1989.4 2058.7 2148.5 50 1.0<br />

23S 1248 1248 20 2502.8 3267.8 3586.5 3690.7 3876.1 50 1.0<br />

Eukaryotes 5S 324 324 251 43.9 51.1 53.9 74.3 82.2 50 1.0<br />

18S 13 13 14 625.3 625.3 1733.1 1777.5 1777.6 50 1.0<br />

28S 19 19 5 1434.2 2904.7 3225.0 3335.9 3380.9 50 1.0<br />

This table shows the total number of full models, the number of spotter predictions that had match<strong>in</strong>g full model predictions <strong>and</strong> the number of false<br />

positive spotter model predictions. The characteristics of the full model prediction score distributions are shown. FPS denotes the number of false<br />

positive spotter predictions. T99 refers to the lowest score a full model could have while still be<strong>in</strong>g detected with 99% probability by a spotter model<br />

with positive score. Pm<strong>in</strong> is the probability that a spotter with positive score would f<strong>in</strong>d a full model with the m<strong>in</strong>imum score <strong>in</strong>dicated. The lowest<br />

score for a full model score can be used as a lower limit on which results could be expected to be real.<br />

Our aim has been to enable high-throughput searches for<br />

rRNA while produc<strong>in</strong>g accurate <strong>and</strong> consistent predictions<br />

suitable for comparative analyses. For this purpose,<br />

we have developed the RNAmmer package which relies on<br />

HMMs for both speed <strong>and</strong> accuracy. HMMs were made<br />

us<strong>in</strong>g HMMer (15), which from a multiple alignment<br />

produces an HMM where match states represent columns<br />

with a specific nucleotide distribution, correspond<strong>in</strong>g<br />

deletion states represent the possibility of gaps, <strong>and</strong><br />

<strong>in</strong>sertion states represent columns with large numbers of<br />

gaps; transition probabilities between the states <strong>in</strong>dicate<br />

how likely each of the states are. HMMs thus differ from<br />

sequence alignments <strong>in</strong> that the likelihood of <strong>in</strong>sertions<br />

<strong>and</strong> deletions may vary along the sequence. When<br />

search<strong>in</strong>g a sequence with an HMM, the score <strong>in</strong>dicates<br />

how well the sequence segment matches the model. The<br />

<strong>in</strong>formation content of a position, which reflects the<br />

nucleotide distribution <strong>and</strong> the likelihood of gaps,<br />

<strong>in</strong>dicates how well that position is conserved. A good<br />

match to the HMM may come either from a highly<br />

conserved region which may well be short, or from a<br />

longer region with only weak conservation. We f<strong>in</strong>d both<br />

these cases. Bacterial 16S are detected despite almost half<br />

of the nucleotides be<strong>in</strong>g assigned to <strong>in</strong>sert states, as other<br />

regions are highly conserved. For archaeal 23S, however,<br />

the <strong>in</strong>formation content of each position is low, but the<br />

sequence is long <strong>and</strong> there are few allowed <strong>in</strong>sert states.<br />

These aspects can also expla<strong>in</strong> cases of poor performance,<br />

both of the full model <strong>and</strong> of the spotter model.<br />

The low <strong>in</strong>formation content <strong>in</strong> the eukaryotic 5S <strong>and</strong><br />

18S alignments <strong>in</strong>dicates that these sequences are more<br />

divergent than archaeal <strong>and</strong> bacterial 5S <strong>and</strong> 16S.<br />

In addition, 40% of the 5S <strong>and</strong> 75% of the 18S alignment<br />

give rise to <strong>in</strong>sert states <strong>in</strong> the HMM. Thus, there is little<br />

for the HMM to recognize. In addition, many of the<br />

missed 18S rRNAs were from Cryptophyta, a phylum<br />

which makes up only 0.6% of the alignment data.<br />

The archaeal 5S show the same characteristics as the<br />

eukaryotic 5S <strong>and</strong> 18S, which most likely expla<strong>in</strong>s the low<br />

performance for these rRNAs. The score for archaeal 5S<br />

hits were generally low, <strong>and</strong> the spotter score comes only<br />

from a 75 nt part of the sequence giv<strong>in</strong>g it even lower score<br />

caus<strong>in</strong>g it to miss 12 of the full model hits. It is notable,<br />

however, that these were the only cases missed by the<br />

spotter model: with the exception of archaeal 5S, our<br />

analyses show that the spotter should be able to detect<br />

rRNAs unless they are much further diverged than what<br />

we f<strong>in</strong>d <strong>in</strong> our data.<br />

Columns at the beg<strong>in</strong>n<strong>in</strong>g <strong>and</strong> end of the multiple<br />

alignments often have low conservation <strong>and</strong> many gaps.<br />

Such columns are generally accommodated <strong>in</strong>to the<br />

HMM as <strong>in</strong>sert states, but HMMer ignores them at the<br />

beg<strong>in</strong>n<strong>in</strong>g <strong>and</strong> end of the alignment. An example is the 5S,


where match states stop around 10 columns from the<br />

end of the alignments effectively caus<strong>in</strong>g the HMM to<br />

predict the last conserved nucleotide of the consensus<br />

sequence rather than the stop of the rRNAs. Hence, it is<br />

not uncommon for the stop position of the 5S to be<br />

predicted up to 10 nt downstream of the annotated stop<br />

position.<br />

These effects can also expla<strong>in</strong> the endpo<strong>in</strong>t accuracy<br />

that was seen when we compared our results to<br />

experimentally determ<strong>in</strong>ed 16S sequences. We tried to<br />

f<strong>in</strong>d sequences where the ends had been experimentally<br />

verified by RACE or PCR, but such rRNAs proved<br />

difficult to f<strong>in</strong>d. All the ones we selected were sequenced,<br />

but it is uncerta<strong>in</strong> to what extent the authors had<br />

tried to determ<strong>in</strong>e the ends. These experimentally<br />

found rRNAs did show better agreement with annotation<br />

than predictions <strong>in</strong> general, although this is not sufficient<br />

to conclude that our predictions are more accurate. Our<br />

stop predictions were very accurate, but more deviation<br />

was seen <strong>in</strong> the start predictions. These results could reflect<br />

more variation <strong>in</strong> the beg<strong>in</strong>n<strong>in</strong>g of the alignments, which<br />

as <strong>in</strong> the 5S case could effectively cause the HMM to<br />

predict the last conserved nucleotide of the consensus<br />

sequence rather than the end of the rRNAs.<br />

In some cases, larger endpo<strong>in</strong>t deviations occur. This<br />

can happen when one of the ends of the model f<strong>in</strong>ds a<br />

better match <strong>in</strong> a different part of the sequence. Insertion<br />

states sometimes allows the HMM to <strong>in</strong>sert long gap<br />

regions <strong>and</strong> thus f<strong>in</strong>d a match<strong>in</strong>g stop position far from<br />

the rest of the sequence. As shown for the bacterial 16S<br />

sequences that displayed this phenomenon, this is less of a<br />

problem when the spotter model is employed. The w<strong>in</strong>dow<br />

searched around the spotter hit would most likely be too<br />

short to accommodate such an <strong>in</strong>sert, <strong>and</strong> the model<br />

would match with the proper sequence.<br />

For fragmented rRNAs, long gap regions may be<br />

correctly predicted. This was seen for Coxiella burnetii 23S<br />

where our prediction has the same start position<br />

as annotated, but where the predicted stop position<br />

is 1884 nt downstream of GenBank’s stop position.<br />

However, accord<strong>in</strong>g to Entrez Gene, this rRNA appears<br />

<strong>in</strong> four pieces <strong>and</strong> with the same stop position as ours,<br />

suggest<strong>in</strong>g that <strong>in</strong> some cases ‘too long’ predictions might<br />

actually be correct. These cases should normally not be<br />

masked when us<strong>in</strong>g the spotter unless <strong>in</strong>serts between the<br />

fragments would make it exceed the w<strong>in</strong>dow size.<br />

The HMM produced by HMMer requires time of order<br />

O(NM) to search a sequence of length N us<strong>in</strong>g a model<br />

with M states, M be<strong>in</strong>g proportional to the length of the<br />

multiple alignment. However, the speed is <strong>in</strong>creased by<br />

us<strong>in</strong>g a 75 nt long spotter model to pre-screen the<br />

sequence, which requires time of order O(N), <strong>and</strong> then<br />

runn<strong>in</strong>g the full HMM on w<strong>in</strong>dows around each spotter<br />

hit which requires time of order OðKM 2 Þ for K spotter<br />

hits, <strong>and</strong> w<strong>in</strong>dow size proportional to M. The benefit of<br />

us<strong>in</strong>g the spotter is clearly illustrated <strong>in</strong> the M. capricolum<br />

searches. However, the time difference between the<br />

S. usitatus <strong>and</strong> the Sargasso Sea data searches shows<br />

that the spotter might lose its mission when deal<strong>in</strong>g with<br />

many shorter sequences.<br />

Nucleic Acids Research, 2007, Vol. 35, No. 9 3107<br />

There are other approaches to predict<strong>in</strong>g non-cod<strong>in</strong>g<br />

RNA. One commonly used method is sequence alignment,<br />

e.g. BLAST (3), Paralign (20) or FASTA (21). Another is<br />

based on structure-sensitive Stochastic Context Free<br />

Grammars (SCFG) (22) which form the basis of the<br />

tRNA prediction program tRNAscan-SE (23) <strong>and</strong> of<br />

Infernal (24), which is used when creat<strong>in</strong>g RFAM. While<br />

the sequence alignment methods are very fast, they are not<br />

particularly suited for prediction of non-cod<strong>in</strong>g RNA (1).<br />

Infernal, however, has a general worst case runn<strong>in</strong>g time<br />

of order OðMN 3 Þ, which is prohibitive. The RFAM<br />

database (17,18), which <strong>in</strong>cludes 5S <strong>and</strong> the 5 0 doma<strong>in</strong><br />

of 16S, uses BLAST to pre-screen genome sequences,<br />

followed by Infernal; despite a more efficient approach<br />

than the general SCFG, it does not analyze the entire 16S.<br />

A search for 5S <strong>in</strong> a 1 Mbp genome us<strong>in</strong>g Infernal took<br />

4 hours 45 m<strong>in</strong>utes: almost 1000 times as much as the<br />

16 seconds used by RNAmmer for the much larger 16S<br />

model. A time-sav<strong>in</strong>g approach to SCFGs could be to use<br />

the RaveNna (25) package which can convert an RFAM<br />

SCFG to an HMM. This drastically reduces the runn<strong>in</strong>g<br />

time; however, its usefulness would be limited s<strong>in</strong>ce no<br />

models for the larger rRNAs are available. Another factor<br />

is that the 5S found by RaveNna (26) which were not<br />

already <strong>in</strong> RFAM were all <strong>in</strong> organellar sequences,<br />

sequences not analyzed by RNAmmer. For further<br />

comparisons <strong>and</strong> comments on these different methods,<br />

we refer to (1).<br />

The RNAmmer program is available as a traditional<br />

HTML-based prediction server at http://www.cbs.dtu.dk/<br />

services/RNAmmer as well as through a SOAP-based<br />

web service. It is also available for download through<br />

the same site.<br />

SUPPLEMENTARY DATA<br />

Supplementary Data is available at NAR onl<strong>in</strong>e.<br />

ACKNOWLEDGEMENTS<br />

We are grateful for fund<strong>in</strong>g from EMBIO at the<br />

University of Oslo, the Research Council of Norway<br />

<strong>and</strong> the Danish Center for Scientific Comput<strong>in</strong>g. It was<br />

also supported by a grant from the European Union<br />

through the EMBRACE Network of Excellence, contract<br />

number LSHG-CT-2004-512092. We would also like to<br />

thank our colleagues for critical read<strong>in</strong>g of the manuscript.<br />

Fund<strong>in</strong>g to pay the Open Access publication charge<br />

was provided by Research Council of Norway.<br />

Conflict of <strong>in</strong>terest statement. None declared.<br />

REFERENCES<br />

1. Freyhult,E., Bollback,J. <strong>and</strong> Gardner,P. (2007) Explor<strong>in</strong>g genomic<br />

dark matter: a critical assessment of the performance of homology<br />

search methods on noncod<strong>in</strong>g RNA. Genome Res., 17, 117–125.<br />

2. Pedersen,A., Jensen,L., Brunak,S., Staerfeldt,H. <strong>and</strong> Ussery,D.<br />

(2000) A DNA structural atlas for Escherichia coli. J. Mol. Biol.,<br />

299, 907–930.<br />

3. Altschul,S., Gish,W., Miller,W., Myers,E. <strong>and</strong> Lipman,D. (1990)<br />

Basic local alignment search tool. J. Mol. Biol., 215, 403–10.


3108 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />

4. Wimberly,B., Brodersen,D., Clemons,W. Jr., Morgan-Warren,R.,<br />

Carter,A., Vonrhe<strong>in</strong>,C., Hartsch,T. <strong>and</strong> Ramakrishnan,V. (2000)<br />

Structure of the 30s ribosomal subunit. Nature, 407, 327–339.<br />

5. Schluenzen,F., Tocilj,A., Zarivach,R., Harms,J., Gluehmann,M.,<br />

Janell,D., Bashan,A., Bartels,H., Agmon,I. et al. (2000) Structure<br />

of functionally activated small ribosomal subunit at 3.3 angstroms<br />

resolution. Cell, 102, 615–623.<br />

6. Nissen,P., Hansen,J., Ban,N., Moore,P. <strong>and</strong> Steitz,T. (2000)<br />

The structural basis of ribosome activity <strong>in</strong> peptide bond synthesis.<br />

Science, 289, 920–930.<br />

7. Yusupov,M., Yusupova,G., Baucom,A., Lieberman,K., Earnest,T.,<br />

Cate,J. <strong>and</strong> Noller,H. (2001) Crystal structure of the ribosome at<br />

5.5 A ˚ resolution. Science, 292, 883–896.<br />

8. Srivastava,A. <strong>and</strong> Schless<strong>in</strong>ger,D. (1991) Structure <strong>and</strong> organization<br />

of ribosomal DNA. Biochimie, 73, 631–638.<br />

9. Ac<strong>in</strong>as,S., Marcel<strong>in</strong>o,L., Klepac-Ceraj,V. <strong>and</strong> Polz,M. (2004)<br />

Divergence <strong>and</strong> redundancy of 16s rRNA sequences <strong>in</strong> genomes<br />

with multiple rrn operons. J Bacteriol, 186, 2629–2635.<br />

10. Jackson,S., Cannone,J., Lee,J., Gutell,R. <strong>and</strong> Woodson,S. (2002)<br />

Distribution of rRNA <strong>in</strong>trons <strong>in</strong> the three-dimensional structure<br />

of the ribosome. J Mol Biol, 323, 35–52.<br />

11. Evguenieva-Hackenberg,E. (2005) Bacterial ribosomal RNA <strong>in</strong><br />

pieces. Mol Microbiol, 57, 318–325.<br />

12. Wuyts,J., Perriere,G. <strong>and</strong> Van De Peer,Y. (2004) The European<br />

ribosomal RNA database. Nucleic Acids Res, 32 Database issue,<br />

D101–D103.<br />

13. Szymanski,M., Barciszewska,M., Erdmann,V. <strong>and</strong> Barciszewski,J.<br />

(2002) 5s Ribosomal RNA database. Nucleic Acids Res., 30, 176–178.<br />

14. Hobohm,U., Scharf,M., Schneider,R. <strong>and</strong> S<strong>and</strong>er,C. (1992) Selection<br />

of representative prote<strong>in</strong> data sets. Prote<strong>in</strong> Sci., 1, 409–417.<br />

15. Eddy,S. (1998) Profile hidden markov models. Bio<strong>in</strong>formatics, 14,<br />

755–763.<br />

16. Henikoff,S. <strong>and</strong> Henikoff,J. (1994) Position-based sequence weights.<br />

J. Mol. Biol., 243, 574–578.<br />

17. Griffiths-Jones,S., Moxon,S., Marshall,M., Khanna,A., Eddy,S.<br />

<strong>and</strong> Bateman,A. (2005) Rfam: annotat<strong>in</strong>g non-cod<strong>in</strong>g RNAs <strong>in</strong><br />

complete genomes. Nucleic Acids Res., 33 Database Issue,<br />

D121–D124.<br />

18. Griffiths-Jones,S., Bateman,A., Marshall,M., Khanna,A. <strong>and</strong><br />

Eddy,S. (2003) Rfam: an RNA family database. Nucleic Acids Res.,<br />

31, 439–441.<br />

19. Venter,J., Rem<strong>in</strong>gton,K., Heidelberg,J., Halpern,A., Rusch,D.,<br />

Eisen,J., Wu,D., Paulsen,I., Nelson,K. et al. (2004) Environmental<br />

genome shotgun sequenc<strong>in</strong>g of the Sargasso Sea. Science, 304,<br />

66–74.<br />

20. Rognes,T. (2001) ParAlign: a parallel sequence alignment algorithm<br />

for rapid <strong>and</strong> sensitive database searches. Nucleic Acids Res, 29,<br />

1647–1652.<br />

21. Pearson,W. <strong>and</strong> Lipman,D. (1988) Improved <strong>tools</strong> for biological<br />

sequence comparison. Proc. Natl. Acad. Sci. USA, 85, 2444–2448.<br />

22. Durb<strong>in</strong>,R., Eddy,S.R., Krogh,A. <strong>and</strong> Mitchison,G. (2000)<br />

Biological Sequence Analysis: Probabilistic Models of Prote<strong>in</strong>s <strong>and</strong><br />

Nucleic Acids. Cambridge University Press.<br />

23. Lowe,T. <strong>and</strong> Eddy,S. (1997) tRNAscan-SE: a program for<br />

improved detection of transfer RNA genes <strong>in</strong> genomic sequence.<br />

Nucleic Acids Res., 25, 955–964.<br />

24. Eddy,S. (2002) A memory-efficient dynamic programm<strong>in</strong>g algorithm<br />

for optimal alignment of a sequence to an RNA secondary<br />

structure. BMC Bio<strong>in</strong>formatics, 3, 18.<br />

25. We<strong>in</strong>berg,Z. <strong>and</strong> Ruzzo,W. (2006) Sequence-based heuristics for<br />

faster annotation of non-cod<strong>in</strong>g RNA families. Bio<strong>in</strong>formatics, 22(1).<br />

26. We<strong>in</strong>berg,Z. <strong>and</strong> W.L.,R. (2004) In RECOMB 04: Proceed<strong>in</strong>gs of<br />

the Eighth Annual International Conference on <strong>Computational</strong><br />

Molecular Biology, ACM Press, pp. 243–251.


1<br />

rRNA operons <strong>and</strong> promoter analysis<br />

3.9 Paper VII: GeneWiz browser: An Interactive Tool for<br />

Visualiz<strong>in</strong>g Sequenced Chromosomes<br />

131


St<strong>and</strong>ards <strong>in</strong> Genomic Sciences (2009) 1: 204-215 DOI:10.4056/sigs.28177<br />

GeneWiz browser: An Interactive Tool for Visualiz<strong>in</strong>g<br />

Sequenced Chromosomes<br />

Peter F. Hall<strong>in</strong> 1 , Hans-Henrik Stærfeldt 1 , Eva Rotenberg 1, 2 , Tim T. B<strong>in</strong>newies 1, 3 , Craig J.<br />

Benham 4 , <strong>and</strong> David W. Ussery 1<br />

1 Center for Biological Sequence Analysis, Department of Systems Biology, The Technical<br />

University of Denmark, 2800 Kgs. Lyngby, Denmark.<br />

2 Lersoe Parkalle 37, 2TV, 2100 Copenhagen, Denmark<br />

3 Roche Diagnostics Ltd., CH-6343 Rotkreuz, Switzerl<strong>and</strong><br />

4 UC Davis Genome Center, University of California, Davis, California, U.S.A.<br />

We present an <strong>in</strong>teractive web application for visualiz<strong>in</strong>g genomic data of prokaryotic chromosomes.<br />

The tool (GeneWiz browser) allows users to carry out various analyses such as<br />

mapp<strong>in</strong>g alignments of homologous genes to other genomes, mapp<strong>in</strong>g of short sequenc<strong>in</strong>g<br />

reads to a reference chromosome, <strong>and</strong> calculat<strong>in</strong>g DNA properties such as curvature or stack<strong>in</strong>g<br />

energy along the chromosome. The GeneWiz browser produces an <strong>in</strong>teractive graphic<br />

that enables zoom<strong>in</strong>g from a global scale down to s<strong>in</strong>gle nucleotides, without chang<strong>in</strong>g the<br />

size of the plot. Its ability to disproportionally zoom provides optimal readability <strong>and</strong> <strong>in</strong>creased<br />

functionality compared to other browsers. The tool allows the user to select the display<br />

of various genomic features, color sett<strong>in</strong>g <strong>and</strong> data ranges. Custom numerical data can<br />

be added to the plot allow<strong>in</strong>g, for example, visualization of gene expression <strong>and</strong> regulation<br />

data. Further, st<strong>and</strong>ard atlases are pre-generated for all prokaryotic genomes available <strong>in</strong><br />

GenBank, provid<strong>in</strong>g a fast overview of all available genomes, <strong>in</strong>clud<strong>in</strong>g recently deposited<br />

genome sequences. The tool is available onl<strong>in</strong>e from<br />

http://www.cbs.dtu.dk/services/gwBrowser. Supplemental material <strong>in</strong>clud<strong>in</strong>g <strong>in</strong>teractive atlases<br />

is available onl<strong>in</strong>e at http://www.cbs.dtu.dk/services/gwBrowser/suppl/.<br />

Introduction<br />

The development of fast <strong>and</strong> <strong>in</strong>expensive genome<br />

sequenc<strong>in</strong>g technologies has led to the generation<br />

of vast amounts of genomic <strong>in</strong>formation. As ge-­‐<br />

nomic sequenc<strong>in</strong>g becomes both more powerful<br />

<strong>and</strong> affordable, the h<strong>and</strong>l<strong>in</strong>g <strong>and</strong> analysis of the<br />

generated data produces novel challenges <strong>and</strong><br />

shifts the focus away from the discovery process<br />

towards technical considerations of h<strong>and</strong>l<strong>in</strong>g,<br />

stor<strong>in</strong>g <strong>and</strong> analyz<strong>in</strong>g sequence data. An impor-­‐<br />

tant step when explor<strong>in</strong>g a new genome is to com-­‐<br />

pare it to exist<strong>in</strong>g sequences, <strong>in</strong> order to identify<br />

both novel <strong>and</strong> conserved features. Many auto-­‐<br />

mated computational methods are available that<br />

attempt to derive prote<strong>in</strong> function from sequence<br />

[1-3]. In a metagenomic study by Harr<strong>in</strong>gton <strong>and</strong><br />

co-­‐workers it was estimated that 76% of the ex-­‐<br />

am<strong>in</strong>ed prote<strong>in</strong> cod<strong>in</strong>g genes could be assigned a<br />

function. However, to assess predictions for <strong>in</strong>di-­‐<br />

vidual genes the visualization rema<strong>in</strong>s critical to<br />

provide the biologist with an overview of the ge-­‐<br />

nomic context. Are genes of <strong>in</strong>terest situated <strong>in</strong><br />

clusters? In operons? How are they regulated?<br />

How does their DNA base composition compare<br />

with that of the rest of the genome? In order to<br />

display such features both on a genome scale <strong>and</strong><br />

<strong>in</strong> close-­‐up down to the level of nucleotides, we<br />

developed the GeneWiz browser which is based<br />

on the ‘Genome Atlas’ concept [4,5]. This tool can<br />

also display local DNA structural properties, so<br />

that regulatory or repeat regions can easily be<br />

identified <strong>and</strong> <strong>in</strong>terpreted <strong>in</strong> a chromosomal con-­‐<br />

text.<br />

Dur<strong>in</strong>g development of the GeneWiz browser, it<br />

became apparent that novel sequenc<strong>in</strong>g technolo-­‐<br />

gy creates a further dem<strong>and</strong>. The current genera-­‐<br />

tion of sequenc<strong>in</strong>g <strong>in</strong>struments utilizes primed<br />

The Genomic St<strong>and</strong>ards Consortium


synthesis <strong>in</strong> flow cells to simultaneously obta<strong>in</strong><br />

the sequences of millions of different DNA tem-­‐<br />

plates, an approach that changed the field of DNA<br />

sequenc<strong>in</strong>g [6,7]. Flow sequenc<strong>in</strong>g, also known as<br />

sequenc<strong>in</strong>g by synthesis (SBS) on a solid surface,<br />

tracks nucleotides as they are added to a grow<strong>in</strong>g<br />

DNA str<strong>and</strong> [8]. SBS is used by high-­‐throughput<br />

sequenc<strong>in</strong>g systems which have become commer-­‐<br />

cially available <strong>in</strong> the past two years. Examples<br />

<strong>in</strong>clude the sequencer GS Titanium (commercia-­‐<br />

lized by 454/Roche); Genome Analyser GA-­‐II (So-­‐<br />

lexa/Illum<strong>in</strong>a); <strong>and</strong> SOLiD 3 system (Applied<br />

Biosystems).<br />

These developments have <strong>in</strong>creased the speed of<br />

sequenc<strong>in</strong>g while significantly reduc<strong>in</strong>g its cost<br />

[9,10]. This much higher throughput provides<br />

greater coverage, but at the cost of much shorter<br />

read-­‐lengths: from 50 bases with SOLiD 3 to 75<br />

bases with Illum<strong>in</strong>a GA II. Even reads of 500 bases<br />

obta<strong>in</strong>ed with the 454-­‐Titanium are still shorter<br />

than read lengths typically obta<strong>in</strong>ed us<strong>in</strong>g the<br />

Sanger method [9,11]. The output from modern<br />

high-­‐through sequenc<strong>in</strong>g equipment challenges<br />

the assembly software by generat<strong>in</strong>g shorter <strong>and</strong><br />

ambiguous reads. Process<strong>in</strong>g of this flood of se-­‐<br />

quence data has rapidly become a bottleneck, <strong>and</strong><br />

develop<strong>in</strong>g the necessary skills <strong>and</strong> <strong>tools</strong> will most<br />

likely be a driv<strong>in</strong>g factor <strong>in</strong> the execution of<br />

second-­‐generation sequenc<strong>in</strong>g [12]. As a first step<br />

<strong>in</strong> this development, it needs to be determ<strong>in</strong>ed to<br />

what extent assembly of short-­‐read sequences can<br />

be trusted, an assessment for which the GeneWiz<br />

browser can also be used.<br />

Methods<br />

Our method of visualization is based on color-­‐<br />

encoded lanes to display numerical <strong>in</strong>formation<br />

on a genome atlas similar to GeneWiz [4,5]. The<br />

color encod<strong>in</strong>g can be done either us<strong>in</strong>g a l<strong>in</strong>ear<br />

scale with a fixed m<strong>in</strong>imum <strong>and</strong> maximum range,<br />

or a dynamic scale of st<strong>and</strong>ard deviations. Us<strong>in</strong>g<br />

the latter, color <strong>in</strong>tensity decreases as data ap-­‐<br />

proach average values, thereby emphasiz<strong>in</strong>g re-­‐<br />

gions of significant variation. The web <strong>in</strong>terface is<br />

divided <strong>in</strong>to four optional sections, to address<br />

various biological viewpo<strong>in</strong>ts of chromosomes: 1)<br />

DNA properties 2) Mapp<strong>in</strong>g of homologous genes<br />

by BLAST 3) Mapp<strong>in</strong>g of short sequenc<strong>in</strong>g reads 4)<br />

Custom lanes such as S<strong>in</strong>gle Nucleotide Polymor-­‐<br />

Hall<strong>in</strong>, et al.<br />

phism (SNP) or microarray data. The output of<br />

each method is a numerical vector of length cor-­‐<br />

respond<strong>in</strong>g to that of the reference sequence, <strong>and</strong><br />

the methods used for this construction are de-­‐<br />

scribed <strong>in</strong> detail below.<br />

Read quality assessment<br />

Gene duplications, rRNA operons <strong>and</strong> other repeti-­‐<br />

tive chromosomal regions are known to cause<br />

difficulties dur<strong>in</strong>g the assembly of short reads [13].<br />

To assess the degree of ambiguity of sequenc<strong>in</strong>g<br />

reads, a method was developed that derives the<br />

uniqueness of all reads, account<strong>in</strong>g for both the<br />

read quality <strong>and</strong> the match to the reference ge-­‐<br />

nome.<br />

Sequence reads from Illum<strong>in</strong>a <strong>and</strong> 454 are re-­‐<br />

ported with base qualities: a per-­‐nucleotide meas-­‐<br />

ure that denotes the credibility of the base calls. A<br />

method was derived which condenses these quali-­‐<br />

ties <strong>in</strong>to values per position <strong>in</strong> the reference ge-­‐<br />

nome <strong>and</strong> calculates the follow<strong>in</strong>g <strong>in</strong>formation:<br />

uniqueness-­‐weighted quality, <strong>in</strong>formation content,<br />

sequence agreement, <strong>and</strong> repeat-­‐weighted cover-­‐<br />

age, (see methods). These estimates provide a<br />

prelim<strong>in</strong>ary overview of regions that may appear<br />

problematic to assemble. In general, low unique-­‐<br />

ness is found <strong>in</strong> the gaps between the assembled<br />

contigs generated by the default assembly <strong>tools</strong><br />

from a given sequence dataset, as will be demon-­‐<br />

strated below. A high score of uniqueness-­‐<br />

weighted quality <strong>in</strong>dicates that the base is unique-­‐<br />

ly identified by a read <strong>and</strong> that it has a high base<br />

quality <strong>in</strong> that read. The approach is illustrated <strong>in</strong><br />

Figure 1.<br />

From the mapp<strong>in</strong>g, five different parameters were<br />

calculate which together summarizes the trust-­‐<br />

worth<strong>in</strong>ess of the reads given the assembly:<br />

Weighted coverage Under the assumption that<br />

all reads would map only once (Hr=1), the coverage<br />

c(i) can be calculated as the number of<br />

alignments R mapped at position i. A weighted<br />

coverage c’(i)=wr,h (see equation below) is used<br />

to correct for higher coverage artificially <strong>in</strong>troduced<br />

by repeats:<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 205


GeneWiz browser<br />

Figure 1 | Mapp<strong>in</strong>g reads to a reference genome account<strong>in</strong>g for uniqueness. In step 1, each read is<br />

aligned aga<strong>in</strong>st the reference genome. In the second step, the quality of each read is weighted accord<strong>in</strong>g<br />

to the uniqueness of the hit. A read giv<strong>in</strong>g rise to two hits S 1 <strong>and</strong> S 2 <strong>in</strong> the reference genome<br />

will be weighted proportionally with the relative alignment scores; if scores are identical, the<br />

mapp<strong>in</strong>g of S 1 <strong>and</strong> S 2 will be applied a weight of w=0.5 (see equation below). Step 3 maps the<br />

weighted qualities back to the reference genome so that each genomic position conta<strong>in</strong>s an array<br />

of weighted qualities. Once all reads are mapped, <strong>in</strong> step 4 only the maximum weighted quality<br />

value is kept <strong>and</strong>, step 5, the maximum weighted quality scores are color coded to reveal regions<br />

of low uniqueness.<br />

Uniqueness-weighted quality This measure cor-­‐<br />

responds to the base qualities obta<strong>in</strong>ed from the<br />

reads that are mapped to the reference genome,<br />

weighted by the uniqueness of the read. Consider<br />

read r, which has a quality profile , where i is<br />

the position <strong>in</strong> the read. The read is aligned to the<br />

reference genome by BLAST, <strong>and</strong> all Hr hits are<br />

<strong>in</strong>cluded, when the follow<strong>in</strong>g criteria are met:<br />

BLAST score Sh of hit h is greater than or equal to<br />

S0 (optionally provided by the user), Sh S1 x<br />

where S1 is the score of the first/best hit, x [0;1]<br />

is a constant provided by the user, <strong>and</strong> the E-­‐value<br />

is equal to or less than a threshold specified by the<br />

user. The follow<strong>in</strong>g formula is used to derive the<br />

weighted quality :<br />

The value is plotted on a color scale whereby low<br />

<strong>in</strong>formation (r<strong>and</strong>om distribution, least expected)<br />

is given <strong>in</strong> dark colors, <strong>and</strong> high <strong>in</strong>formation (high<br />

From all the q’r(i) values obta<strong>in</strong>ed at each position<br />

<strong>in</strong> the genome, the maximum uniqueness-­‐<br />

weighted quality is chosen when all reads have<br />

been mapped.<br />

Information content provides a number <strong>in</strong> bits of<br />

<strong>in</strong>formation [14] represent<strong>in</strong>g to what degree the<br />

reads agree: zero bits means equal distribution of<br />

A, T, G <strong>and</strong> C at a given position <strong>and</strong> 2 bits means<br />

complete conservation of a s<strong>in</strong>gle base.<br />

conservation, most expected) as light or neutral<br />

color. This measure may be useful for visualiz<strong>in</strong>g<br />

s<strong>in</strong>gle nucleotide polymorphisms.<br />

206 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences


Read absence. A boolean where ‘one’ <strong>in</strong>dicates<br />

complete absence of aligned reads.<br />

Visualization of whole-genome homology<br />

The BLASTatlas method [15] derives a map of per-­‐<br />

nucleotide numbers on a reference genome to<br />

visualize the matches <strong>in</strong> the alignment between<br />

the reference genome <strong>and</strong> a query. The query can<br />

constitute any number of genomic contigs, scaf-­‐<br />

folds, full genomes, or collections thereof. This<br />

provides a method to identify regions of a refer-­‐<br />

ence genome that are conserved throughout mul-­‐<br />

tiple samples, as well as those that are unique. The<br />

BLASTatlas method is <strong>in</strong>tegrated <strong>in</strong>to the GeneWiz<br />

browser software to facilitate a user-­‐friendly <strong>in</strong>-­‐<br />

terface. Accord<strong>in</strong>g to the BLAST algorithm chosen,<br />

DNA or prote<strong>in</strong> sequences of the reference are<br />

aligned with the best match <strong>in</strong> the query (us<strong>in</strong>g<br />

either blastp, blastn, tblastn, or blastx). The align-­‐<br />

ment is then mapped back to the reference ge-­‐<br />

nome. A match adds a 'one' whereas a mismatch<br />

adds a 'zero' at each position along the chromo-­‐<br />

Hall<strong>in</strong>, et al.<br />

some. These ones <strong>and</strong> zeros translate <strong>in</strong>to smooth<br />

color zones due to b<strong>in</strong>n<strong>in</strong>g<br />

DNA properties <strong>and</strong> DNA destabilization<br />

Through the web <strong>in</strong>terface it is currently possible<br />

to select from 36 different nucleotide composition<br />

<strong>and</strong> DNA structural properties [4,5,16-22]. In addi-­‐<br />

tion to this, calculations of so-­‐called SIDD energy<br />

estimates are provided, offer<strong>in</strong>g an approximation<br />

of promoter regions. This method estimates the<br />

free energy required to open the DNA helix, calcu-­‐<br />

<br />

-­‐0.035, -­‐0.044, -­‐0.055, us<strong>in</strong>g the SIDD algorithm<br />

[23]. All of these parameters can be applied <strong>in</strong> any<br />

comb<strong>in</strong>ation to any of the prokaryotic genomes<br />

available from the web <strong>in</strong>terface, or to a custom<br />

sequence provided by the user. Alternatively, the<br />

parameters may be applied as collections form<strong>in</strong>g<br />

8 st<strong>and</strong>ard atlases: Genome-­‐, Base-­‐, Structure-­‐,<br />

Cruciform-­‐, A-­‐DNA-­‐, Z-­‐DNA-­‐, the Repeat-­‐atlas, <strong>and</strong><br />

f<strong>in</strong>ally the SIDD atlas, which is <strong>in</strong>troduced <strong>in</strong> this<br />

manuscript (Figure 3).<br />

Figure 3 Configuration <strong>and</strong> references for pre-def<strong>in</strong>ed groups of DNA sequence- <strong>and</strong> structural<br />

properties: Genome-, Base-, Structure-, Cruciform-, A-DNA-, Z-DNA-, Repeat-, <strong>and</strong> SIDD-atlas.<br />

Custom data<br />

A designated section of the GeneWiz browser is<br />

assigned for custom data. It allows the user to<br />

provide a per-­‐nucleotide list of numerical values<br />

along with a desired color <strong>and</strong> data range. Al-­‐<br />

though not presented here, this allows for visuali-­‐<br />

zation of additional <strong>in</strong>formation such as microar-­‐<br />

ray data that has been pre-­‐processed by the user,<br />

by mapp<strong>in</strong>g gene expression, regulation change, or<br />

p-values back to genomic coord<strong>in</strong>ates. In addition<br />

to the ma<strong>in</strong> genome annotation cover<strong>in</strong>g CDSs,<br />

tRNAs, <strong>and</strong> rRNAs, the user may specify miscella-­‐<br />

neous <strong>and</strong> pseudo-­‐gene annotations separately. A<br />

button allows the query of selected reference ge-­‐<br />

nomes aga<strong>in</strong>st a replicate of pseudogenes.org [24].<br />

Other annotations of possible pseudogenes can be<br />

added, such as GenePRIMP output (geneprimp.jgi-­‐<br />

psf.org/).<br />

Dynamic visualization<br />

The GeneWiz browser allows dynamic dispropor-­‐<br />

tional zoom<strong>in</strong>g, mean<strong>in</strong>g that zoom<strong>in</strong>g occurs<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 207


GeneWiz browser<br />

nearly <strong>in</strong>stantly when requested by the user, by<br />

redraw<strong>in</strong>g all the components like tracks, legends,<br />

marks <strong>and</strong> text for every view. This allows the<br />

browser to scale the plot to make use of the entire<br />

plott<strong>in</strong>g area, by not rescal<strong>in</strong>g all parts of the plot<br />

equally. For example, zoom<strong>in</strong>g 10 x will stretch a<br />

data lane 10 <strong>in</strong> genome position axis, however<br />

the lane height <strong>and</strong> distance to the neighbor lane<br />

will rema<strong>in</strong> constant. The dynamic nature of the<br />

GeneWiz browser requires pre-­‐b<strong>in</strong>n<strong>in</strong>g of data for<br />

each zoom level, all of which are stored on a cen-­‐<br />

tral server; for improved efficiency only data re-­‐<br />

quested by the user are sent. The approach to<br />

store per-­‐nucleotide <strong>in</strong>formation as table records<br />

<strong>in</strong> a database (e.g. MySQL) has proved unfeasible,<br />

as the number of records per genome exceeds<br />

millions, <strong>and</strong> the construction of <strong>in</strong>dexes would be<br />

very time consum<strong>in</strong>g. Instead, a memory mapp<strong>in</strong>g<br />

technique was chosen, that allows the server to<br />

directly obta<strong>in</strong> the values from b<strong>in</strong>ary files when<br />

provided with the zoom w<strong>in</strong>dow <strong>and</strong> level, for any<br />

chromosome <strong>in</strong> the database. (Examples are pro-­‐<br />

vided as supplemental data, http://www.cbs.-­‐<br />

dtu.dk/services/gwBrowser/suppl/).<br />

The client is written as a JavaApplet, that obta<strong>in</strong>s<br />

the data remotely from the server<br />

(http://ws.cbs.dtu.dk/cgi-­‐b<strong>in</strong>/gwBrowser-­‐<br />

0.91/server.cgi). The browser server is written <strong>in</strong><br />

Perl/CGI, while a compiled c-­‐program h<strong>and</strong>les the<br />

access to the b<strong>in</strong>ary data files. The options cur-­‐<br />

rently supported are listed <strong>in</strong> Table 2.<br />

Table 2 GeneWiz Browser server options.<br />

Option description<br />

d The unique identifier for the atlas<br />

Feature type (e.g. CDS,rRNA,tRNA) when return<strong>in</strong>g<br />

ft<br />

annotations<br />

f Data field to return<br />

b Beg<strong>in</strong> of w<strong>in</strong>dow<br />

e End of w<strong>in</strong>dow<br />

l Zoom level<br />

z Enable zlib compression of output<br />

m=i Return the genome length<br />

m=avg/stddev/m<strong>in</strong>/max Return aggregate data for w<strong>in</strong>dow/genome<br />

m=d<br />

Return data values provided field, w<strong>in</strong>dow <strong>and</strong> zoom<br />

level<br />

m=c Return colors provided two or three-step ranges<br />

m=n Return nucleotides provided the w<strong>in</strong>dow<br />

m=a Return annotations (used together with option ‘ft’)<br />

<strong>and</strong> genes as well as numerical data associated<br />

These options (Table 2) can be <strong>in</strong>corporated <strong>in</strong>to a with each nucleotide. The disproportional capabil-­‐<br />

s<strong>in</strong>gle URL. For example, one could request all ity of the GeneWiz browser implies that all com-­‐<br />

ponents (legends, tracks, marks, etc.) are regene-­‐<br />

<br />

m-­‐ rated for every view requested by the user. Figure 4<br />

http://ws.cbs.dtu.dk/cgi-­‐<br />

outl<strong>in</strong>es the GeneWiz browser workflow.<br />

b<strong>in</strong>/gwBrowser-­‐<br />

-­‐<br />

When submitt<strong>in</strong>g a job via the web <strong>in</strong>terface, the<br />

<br />

request is assigned a job identifier, under which<br />

<br />

a-­‐<br />

all data lanes <strong>and</strong> configurations are kept. After<br />

tions are described <strong>in</strong> the xml record, which can<br />

the job has been processed the user may alter lane<br />

be downloaded from the web<br />

order, colors, ranges, <strong>and</strong> append various types of<br />

(http://ws.cbs.dtu.dk/cgi-­‐b<strong>in</strong>/gwBrowser-­‐<br />

marks to the plot. The layout of a given browser<br />

0.91/fetchxml.cgi?AL111168GENOMEatlas). Fur-­‐<br />

<strong>in</strong>stance is governed by an XML file, located on the<br />

ther examples are provided <strong>in</strong> the supplemental<br />

server. When generat<strong>in</strong>g the graphical representa-­‐<br />

data section.<br />

tion of the genome, the client Java program will<br />

make requests to the server to acquire aggregated<br />

The GeneWiz workflow <strong>and</strong> data displayed<br />

values, such as the averages, st<strong>and</strong>ard deviations,<br />

The GeneWiz browser plots <strong>and</strong> provides dispro-­‐<br />

m<strong>in</strong>ima, <strong>and</strong> maxima as well as lane data <strong>and</strong> an-­‐<br />

portional zoom<strong>in</strong>g for data perta<strong>in</strong><strong>in</strong>g to features<br />

notations.<br />

208 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences


Hall<strong>in</strong>, et al.<br />

Figure 4 | The dataflow of the GeneWiz browser service. 1) The selected reference genome <strong>and</strong> the<br />

lanes to be <strong>in</strong>cluded are def<strong>in</strong>ed via the web <strong>in</strong>terface. 2) The request is sent to the analysis server<br />

that h<strong>and</strong>les the calculations. 3) When the job is f<strong>in</strong>ished, the web page redirects to the applet<br />

viewer that allows the user to navigate <strong>and</strong> edit the plot layout.<br />

Premade atlases<br />

The genome sequences stored <strong>in</strong> the <strong>CBS</strong> Genome<br />

Atlas Database [25] are synchronized with NCBI<br />

Entrez genome projects <strong>and</strong> have been pre-­‐<br />

processed for all of the eight st<strong>and</strong>ard atlases<br />

mentioned above. This allows the user to select<br />

from currently 1,636 pre-­‐b<strong>in</strong>ned replicons from<br />

864 prokaryotic sequenc<strong>in</strong>g projects, searchable<br />

by replicon name, GenBank accession number, or<br />

organism name (http://www.cbs.dtu.dk/-­‐ servic-­‐<br />

es/gwBrowser/precalc/)<br />

Results<br />

Evaluation of re-sequenc<strong>in</strong>g quality<br />

Three re-­‐sequenced bacterial genomes were ex-­‐<br />

am<strong>in</strong>ed, one genome sequence was generated us-­‐<br />

<strong>in</strong>g the Illum<strong>in</strong>a GA technology, whereas two ge-­‐<br />

nome sequences were generated utiliz<strong>in</strong>g the 454-­‐<br />

Titanium technology (Table 3). The public se-­‐<br />

quence was selected as reference for mapp<strong>in</strong>g the<br />

re-­‐sequenc<strong>in</strong>g reads us<strong>in</strong>g the GeneWiz browser<br />

tool. The r<strong>and</strong>omness <strong>in</strong> fragmentation was esti-­‐<br />

mated by compar<strong>in</strong>g the experimental data with<br />

<strong>in</strong>-silico digestions, generated at 40X coverage<br />

us<strong>in</strong>g read lengths between 30 to 5,000 bp. A good<br />

correspondence between the <strong>in</strong>-silico <strong>and</strong> experi-­‐<br />

mental reads suggests little bias towards certa<strong>in</strong><br />

chromosomal regions (Figure 5, panel A). The as-­‐<br />

sembled contigs provided by 454 (C. jejuni <strong>and</strong> E.<br />

coli) are mapped to the reference genome us<strong>in</strong>g<br />

BLAST <strong>and</strong> annotated <strong>in</strong> the perimeter of the at-­‐<br />

lases (two leftmost atlases <strong>in</strong> Figure 5, panel A+B).<br />

The detailed atlas of the experimental data (true<br />

reads), are shown <strong>in</strong> Figure 5, panel B. Panel C<br />

shows quality/count of reads plotted as a function<br />

of read position. Note that the read quality de-­‐<br />

creases the further the distance from the beg<strong>in</strong>-­‐<br />

n<strong>in</strong>g of the read.<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 209


GeneWiz browser<br />

Table 3 Sequenc<strong>in</strong>g details of three bacterial genomes, two of which were re-sequenced us<strong>in</strong>g<br />

454-Titanium <strong>and</strong> one with Illum<strong>in</strong>a GA technology.<br />

E. coli K12 MG1655 C. jejuni<br />

NCTC11168<br />

S. typhi Ty2<br />

Stra<strong>in</strong> id ATCC: 700926D-5 ATCC:<br />

700819D-5<br />

ERA000001<br />

Technology 454-Titanium 454-Titanium Illum<strong>in</strong>a GA II<br />

Read count 538,784 502,438 1,650,370<br />

Avg read length ((std.<br />

dev)<br />

522 (=53) 598 (=75) 51 (=0)<br />

Truncated length 600 600 35<br />

Coverage 61X 183X 18X<br />

Genome size 4,639,675 bp 1,641,481 bp 4,791,961 bp<br />

Accession <strong>and</strong> orig<strong>in</strong>al<br />

Reference<br />

U00096 [26] AL111168 [27] AE014613 [28]<br />

Figure 5 | Panel A: The maximum uniqueness quality is shown for the actual reads (green-to-blue<br />

lane) plotted <strong>in</strong> the outermost lanes, us<strong>in</strong>g the published genome as a reference. The follow<strong>in</strong>g<br />

lanes show <strong>in</strong>-silico digestions at 40 X coverage (red-to-blue lane), us<strong>in</strong>g read lengths 30, 50, 70,<br />

200, 500, 1,000, 1,000, <strong>and</strong> 5,000 bases. Panel B shows the weighted coverage, agreement with<br />

reference, maximum uniqueness quality, <strong>in</strong>formation content, read absence, <strong>and</strong> AT content. All<br />

six plots can be accessed for zoom<strong>in</strong>g via the supplemental data section. Panel C displays the read<br />

count (green, secondary ord<strong>in</strong>ate) <strong>and</strong> read quality (red, primary ord<strong>in</strong>ate) as a function of read<br />

length. Note that read counts differ with<strong>in</strong> the three datasets, result<strong>in</strong>g <strong>in</strong> different scales on the<br />

secondary ord<strong>in</strong>ate. For the two 454-Titanium sets (C. jejuni <strong>and</strong> E. coli K12), an assembly was<br />

provided which allows a mapp<strong>in</strong>g of contigs to the reference genome. These marks are shown <strong>in</strong><br />

gray <strong>in</strong> the perimeter of these plots. Red marks <strong>in</strong>dicate contigs with two or more hits <strong>in</strong> the reference.<br />

210 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences


Genome homology: Compar<strong>in</strong>g multiple<br />

Burkholderia species<br />

A comparative study aimed at mapp<strong>in</strong>g for exam-­‐<br />

ple pathogenic isl<strong>and</strong>s or gene losses among dif-­‐<br />

ferent bacterial genomes can benefit from a graph-­‐<br />

ical representation provided by the BLASTatlas<br />

method. The genus of Burkholderia covers a num-­‐<br />

ber of important animal <strong>and</strong> human pathogens<br />

known to cause melioidosis (B. pseudomallei) <strong>and</strong><br />

pulmonary <strong>in</strong>fection <strong>in</strong> cystic fibrosis (CF) patients<br />

(B. cepacia), whereas B. thail<strong>and</strong>ensis, which is<br />

closely related to B. pseudomallei, rarely gives rise<br />

to diseases <strong>in</strong> humans [29,30]. Both species of B.<br />

thail<strong>and</strong>ensis <strong>and</strong> B. mallei display large chromo-­‐<br />

somal deletions when compared to B. pseudomallei.<br />

However, the more scattered nature of the<br />

Hall<strong>in</strong>, et al.<br />

gene loss observed <strong>in</strong> B. thail<strong>and</strong>ensis suggests<br />

that B. mallei evolved from B. pseudomallei<br />

through the loss of larger regions [31]. These dele-­‐<br />

tions are evident from the atlas shown <strong>in</strong> Figure 6<br />

where the two chromosomes of Burkholderia<br />

pseudomallei 1710b are used as BLASTatlas refer-­‐<br />

ence <strong>in</strong> a comparison with 14 publicly available<br />

Burkholderia genomes (B. thail<strong>and</strong>ensis plus all<br />

species hav<strong>in</strong>g two or more stra<strong>in</strong>s sequenced, see<br />

supplemental data). In addition it is evident that a<br />

strong preference of deletion exist for chromo-­‐<br />

some II. Ong <strong>and</strong> co-­‐workers report that deletions<br />

<strong>in</strong> chromosome II counts for 70% <strong>and</strong> 61% of the<br />

total gene loss <strong>in</strong> B. mallei <strong>and</strong> B. thail<strong>and</strong>ensis,<br />

respectively.<br />

Figure 6 | BLASTatlas of Burkholderia pseudomallei 1710b chromosomes I+II compared with 14<br />

Burkholderia species. Show<strong>in</strong>g from the outermost circles: B. ambifaria (2, purple), B. cenocepacia<br />

(4, red) B. thail<strong>and</strong>ensis (1, green) 10774, B. mallei (4, green), <strong>and</strong> B. pseudomallei (3, blue). Innermost<br />

circles show percent AT, <strong>and</strong> CG skew. Note, that to allow visual comparison between B.<br />

thail<strong>and</strong>ensis <strong>and</strong> B. mallei, both species are colored green: the outermost green lane corresponds<br />

to the s<strong>in</strong>gle B. thail<strong>and</strong>ensis, whereas the rema<strong>in</strong><strong>in</strong>g four green lanes are all B. mallei. GenBank<br />

accession numbers as well as <strong>in</strong>teractive plots are available through the supplemental data section.<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 211


GeneWiz browser<br />

The SIDD atlas: Annotation of regulatory<br />

elements<br />

The browser application enables the user to ap-­‐<br />

pend various annotation marks such as transcrip-­‐<br />

tion start site arrows, gene labels, <strong>and</strong> boxes. A<br />

f<strong>in</strong>al example illustrates how these marks can be<br />

used to <strong>in</strong>tegrate known regulatory elements with<br />

DNA properties <strong>and</strong> gene annotations to draw a<br />

more complete picture of a promoter region. The<br />

regulatory elements of the E. coli K12 MG1665 rrn<br />

operons [32] have been annotated <strong>in</strong> a st<strong>and</strong>ard<br />

SIDD atlas, provid<strong>in</strong>g a visualization of the P1/P2<br />

promoter structure (Figure 7). A zoom of the pro-­‐<br />

moter region reveals a strong SIDD site near the<br />

predom<strong>in</strong>ant P1 promoter approximately 40 bp.<br />

upstream of the P1 transcription start site. The<br />

transcription factor FIS stimulates transcription at<br />

several promoters, <strong>and</strong> for example the b<strong>in</strong>d<strong>in</strong>g of<br />

FIS at the leuV promoter [33] has been suggested<br />

to transmit the superhelical destabilization down-­‐<br />

stream to the po<strong>in</strong>t where the RNAP twists <strong>and</strong><br />

opens the helix [34]. This model may be valid for<br />

the rrnB P1 promoter also, as the activity of leuV<br />

<strong>and</strong> rrnB P1 are comparable [35].<br />

Figure 7 | A zoom upstream of the E. coli K12 MG1665 rrnB operon. The three outer-most lanes<br />

show SIDD at three superhelix densities of sigma=-0.055, -0.045, <strong>and</strong> -0.035. The lower free energy<br />

required to melt the helix can be observed near the UP element of P1, for the SIDD lane at sigma<br />

= -0.045. The atlas is available for zoom<strong>in</strong>g on the supplemental data section.<br />

Discussion<br />

Visualization of the multidimensional <strong>in</strong>formation<br />

that is represented by a s<strong>in</strong>gle genome sequence<br />

rema<strong>in</strong>s complex. An <strong>in</strong>dispensable property of a<br />

genome visualization tool is that it must be zoom-­‐<br />

able, so that <strong>in</strong>formation can be <strong>in</strong>terpreted at<br />

vary<strong>in</strong>g scales. Two recently published methods,<br />

the DNAPlotter [36] <strong>and</strong> the Genome Projector<br />

[37], both enable the user to build circular plots of<br />

numerical data related to genes as well as graphs<br />

of numerical data perta<strong>in</strong><strong>in</strong>g to the nucleotides.<br />

These <strong>tools</strong> create static graphics <strong>and</strong> allows only<br />

for proportional zoom<strong>in</strong>g, hence mak<strong>in</strong>g the plot<br />

hard to <strong>in</strong>terpret when zoom<strong>in</strong>g too deep. Both of<br />

these <strong>tools</strong> allow for visualization of <strong>in</strong>dividual<br />

genomes, but do not allow easy comparison across<br />

multiple genomes. With the ease of new genome<br />

sequences becom<strong>in</strong>g available, it is essential to be<br />

able to quickly compare other genomes to a refer-­‐<br />

ence.<br />

A number of other <strong>tools</strong> approach genome visuali-­‐<br />

zation from different angles: Genome Diagram [38]<br />

<strong>and</strong> Circos [39] are comm<strong>and</strong> l<strong>in</strong>e programs gene-­‐<br />

rat<strong>in</strong>g publication quality static images <strong>and</strong> vector<br />

graphics. Although these <strong>tools</strong> allow comparison<br />

of other genomes, are flexible <strong>and</strong> allow visualiza-­‐<br />

tion of numerical data, they lack an <strong>in</strong>teractive<br />

layer.<br />

The GeneWiz browser described here uses dis-­‐<br />

proportional zoom<strong>in</strong>g to overcome this. From a<br />

technical perspective, the choice of programm<strong>in</strong>g<br />

language for writ<strong>in</strong>g graphical browsers is of im-­‐<br />

portance. There are obvious advantages of provid-­‐<br />

212 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences


<strong>in</strong>g platform-­‐<strong>in</strong>dependent Java software like that<br />

of the GeneWiz browser, but often this is at the<br />

cost of performance. Nevertheless, our tool de-­‐<br />

monstrates the usefulness of a genome browser<br />

that relies on <strong>in</strong>teractive, true disproportional<br />

zoom<strong>in</strong>g to visualize annotated genes <strong>and</strong> features<br />

as well as numerical data provided at s<strong>in</strong>gle nuc-­‐<br />

leotide resolution. By build<strong>in</strong>g a comprehensive<br />

tool that is both scalable <strong>and</strong> flexible, we have<br />

shown how different types of genomic data can be<br />

<strong>in</strong>tegrated <strong>in</strong>to a s<strong>in</strong>gle, easily navigated graphic<br />

that can be annotated further by the user.<br />

Author contributions<br />

P.F.H. wrote the paper <strong>and</strong> composed the web<br />

<strong>in</strong>terfaces, as well as most parts of the server back<br />

end. H.H.S. wrote the c-­‐code of the data b<strong>in</strong>n<strong>in</strong>g<br />

<strong>and</strong> retrieval software <strong>and</strong> contributed to the Java<br />

Applet; E.R. wrote the majority of the Java Applet<br />

code <strong>and</strong> formulation of the XML configurations.<br />

Reference<br />

1. Harr<strong>in</strong>gton ED, S<strong>in</strong>gh AH, Doerks T, Letunic I,<br />

von Mer<strong>in</strong>g C, Jensen LJ, Raes J, Bork P. Quantitative<br />

assessment of prote<strong>in</strong> function prediction<br />

from metagenomics shotgun sequences. Proc Natl<br />

Acad Sci USA 2007; 104:13913-13918. PubMed<br />

doi:10.1073/pnas.0702636104<br />

2. Jensen LJ, Gupta R, Blom N, Devos D, Tamames<br />

J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K,<br />

Workman C et al. Prediction of human prote<strong>in</strong><br />

function from post-translational modifications <strong>and</strong><br />

localization features. J Mol Biol 2002; 319:1257-<br />

1265. PubMed doi:10.1016/S0022-<br />

2836(02)00379-0<br />

3. Friedberg I. Automated prote<strong>in</strong> function prediction--the<br />

genomic challenge. Brief Bio<strong>in</strong>form<br />

2006; 7:225. PubMed doi:10.1093/bib/bbl004<br />

4. Jensen LJ, Friis C, Ussery DW. Three views of<br />

microbial genomes. Res Microbiol 1999;<br />

150:773-777. PubMed doi:10.1016/S0923-<br />

2508(99)00116-3<br />

5. Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH,<br />

Ussery DW. A DNA structural atlas for Escherichia<br />

coli. J Mol Biol 2000; 299:907-930. PubMed<br />

doi:10.1006/jmbi.2000.3787<br />

6. Hall N. Advanced sequenc<strong>in</strong>g technologies <strong>and</strong><br />

their wider impact <strong>in</strong> microbiology. J Exp Biol<br />

2007; 210:1518-1525. PubMed<br />

doi:10.1242/jeb.001370<br />

Hall<strong>in</strong>, et al.<br />

T.T.B. provided source data <strong>and</strong> analysis of C. jejuni<br />

<strong>and</strong> E. coli sequenc<strong>in</strong>g reads <strong>and</strong> C.J.B. assisted<br />

writ<strong>in</strong>g the paper (paragraphs on SIDD energy).<br />

D.W.U. assisted <strong>in</strong> writ<strong>in</strong>g the paper, supervised<br />

the project <strong>and</strong> provided ideas for figures <strong>and</strong><br />

analysis. All authors have read <strong>and</strong> made correc-­‐<br />

tions to the manuscript.<br />

Acknowledgements<br />

This work is funded <strong>in</strong> part by grants from the Danish<br />

Center for Scientific Comput<strong>in</strong>g, NSF Research Grant<br />

DBI-­‐0416764, The Danish Research Council grant 26-­‐<br />

06-­‐0349, <strong>and</strong> the EU EMBRACE network of Excellence,<br />

contract number LSHG-­‐CT-­‐2004-­‐512092. We thank<br />

Mark Driscoll <strong>and</strong> Marcel Margulies from 454 Life<br />

Sciences for provid<strong>in</strong>g the data for C. jejuni <strong>and</strong> E. coli<br />

<strong>and</strong> Julian Parkhill at the Sanger <strong>in</strong>stitute for provid<strong>in</strong>g<br />

the S. typhi sequenc<strong>in</strong>g data. We thank also Dr. Trudy<br />

Wassenaar <strong>and</strong> Dr. Lars Juhl Jensen for mak<strong>in</strong>g sugges-­‐<br />

tions to the manuscript.<br />

7. Holt RA, Jones SJ. The new paradigm of flow cell<br />

sequenc<strong>in</strong>g. Genome Res 2008; 18:839-846.<br />

PubMed doi:10.1101/gr.073262.107<br />

8. Käller M, Lundeberg J, Ahmadian A. Arrayed<br />

identification of DNA signatures. Expert Rev Mol<br />

Diagn 2007; 7:65-76. PubMed<br />

doi:10.1586/14737159.7.1.65<br />

9. Gupta PK. S<strong>in</strong>gle-molecule DNA sequenc<strong>in</strong>g<br />

technologies for future genomics research. Trends<br />

Biotechnol 2008; 26:602-611. PubMed<br />

doi:10.1016/j.tibtech.2008.07.003<br />

10. Shendure J, Ji H. Next-generation DNA sequenc<strong>in</strong>g.<br />

Nat Biotechnol 2008; 26:1135-1145.<br />

PubMed doi:10.1038/nbt1486<br />

11. Smith DR, Qu<strong>in</strong>lan AR, Peckham HE, Makowsky<br />

K, Tao W, Woolf B, Shen L, Donahue WF, Tusneem<br />

N, Stromberg MP et al. Rapid wholegenome<br />

mutational profil<strong>in</strong>g us<strong>in</strong>g nextgeneration<br />

sequenc<strong>in</strong>g technologies. Genome Res<br />

2008; 18:1638-1642. PubMed<br />

doi:10.1101/gr.077776.108<br />

12. L<strong>in</strong> F, Schröder H, Schmidt B. Solv<strong>in</strong>g the Bottleneck<br />

Problem <strong>in</strong> Bio<strong>in</strong>formatics Comput<strong>in</strong>g: An<br />

Architectural Perspective. J VLSI Signal Process<br />

2007; 48:185-188. doi:10.1007/s11265-007-<br />

0088-z<br />

13. Phillippy AM, Schatz MC, Pop M. Genome assembly<br />

forensics: f<strong>in</strong>d<strong>in</strong>g the elusive mis-<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 213


GeneWiz browser<br />

assembly. Genome Biol 2008; 9:R55. PubMed<br />

doi:10.1186/gb-2008-9-3-r55<br />

14. Tolstrup N, Rouzé P, Brunak S. A branch po<strong>in</strong>t<br />

consensus from Arabidopsis found by noncircular<br />

analysis allows for better prediction of<br />

acceptor sites. Nucleic Acids Res 1997; 25:3159-<br />

3163. PubMed doi:10.1093/nar/25.15.3159<br />

15. Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW. The genome<br />

BLASTatlas-a GeneWiz extension for visualization<br />

of whole-genome homology. Mol Biosyst<br />

2008; 4:363-371. PubMed<br />

doi:10.1039/b717118h<br />

16. Bolshoy A, McNamara P, Harr<strong>in</strong>gton RE, Trifonov<br />

EN. Curved DNA without A-A: experimental estimation<br />

of all 16 DNA wedge angles. Proc Natl<br />

Acad Sci USA 1991; 88:2312-2316. PubMed<br />

doi:10.1073/pnas.88.6.2312<br />

17. Brukner I, Sánchez R, Suck D, Pongor S. Sequence-dependent<br />

bend<strong>in</strong>g propensity of DNA as<br />

revealed by DNase I: parameters for tr<strong>in</strong>ucleotides.<br />

EMBO J 1995; 14:1812-1818. PubMed<br />

18. van Noort V, Worn<strong>in</strong>g P, Ussery DW, Rosche<br />

WA, S<strong>in</strong>den RR. Str<strong>and</strong> misalignments lead to quasipal<strong>in</strong>drome<br />

correction. Trends Genet 2003;<br />

19:365-369. PubMed doi:10.1016/S0168-<br />

9525(03)00136-7<br />

19. Olson WK, Gor<strong>in</strong> AA, Lu XJ, Hock LM, Zhurk<strong>in</strong><br />

VB. DNA sequence-dependent deformability deduced<br />

from prote<strong>in</strong>-DNA crystal complexes. Proc<br />

Natl Acad Sci USA 1998; 95:11163-11168.<br />

PubMed doi:10.1073/pnas.95.19.11163<br />

20. Ornste<strong>in</strong> RL, Re<strong>in</strong> R, Breen DL, MacElroy RD. An<br />

optimized potential function for the calculation of<br />

nucleic acid <strong>in</strong>teraction energies. I- Base stack<strong>in</strong>g.<br />

Biopolymers 1978; 17:2341-2360.<br />

doi:10.1002/bip.1978.360171005<br />

21. Satchwell SC, Drew HR, Travers AA. Sequence<br />

periodicities <strong>in</strong> chicken nucleosome core DNA. J<br />

Mol Biol 1986; 191:659-675. PubMed<br />

doi:10.1016/0022-2836(86)90452-3<br />

22. Ussery D, Soumpasis DM, Brunak S, Staerfeldt<br />

HH, Worn<strong>in</strong>g P, Krogh A. Bias of pur<strong>in</strong>e stretches<br />

<strong>in</strong> sequenced chromosomes. Comput Chem<br />

2002; 26:531-541. PubMed doi:10.1016/S0097-<br />

8485(02)00013-X<br />

23. Wang H, Benham CJ. Superhelical destabilization<br />

<strong>in</strong> regulatory regions of stress response genes.<br />

PLOS Comput Biol 2008; 4:e17. PubMed<br />

doi:10.1371/journal.pcbi.0040017<br />

24. Karro JE, Yan Y, Zheng D, Zhang Z, Carriero N,<br />

Cayt<strong>in</strong>g P, Harrrison P, Gerste<strong>in</strong> M. Pseudo-<br />

gene.org: a comprehensive database <strong>and</strong> comparison<br />

platform for pseudogene annotation. Nucleic<br />

Acids Res 2007; 35:D55-D60. PubMed<br />

doi:10.1093/nar/gkl851<br />

25. Hall<strong>in</strong> PF, Ussery DW. <strong>CBS</strong> Genome Atlas Database:<br />

a dynamic storage for bio<strong>in</strong>formatic results<br />

<strong>and</strong> sequence data. Bio<strong>in</strong>formatics 2004;<br />

20:3682-3686. PubMed<br />

doi:10.1093/bio<strong>in</strong>formatics/bth423<br />

26. Blattner FR, Plunkett G, Bloch CA, Perna NT,<br />

Burl<strong>and</strong> V, Riley M, Collado-Vides J, Glasner JD,<br />

Rode CK, Mayhew GF et al. The complete genome<br />

sequence of Escherichia coli K-12. Science<br />

1997; 277:1453-1462. PubMed<br />

doi:10.1126/science.277.5331.1453<br />

27. Parkhill J, Wren BW, Mungall K, Ketley JM,<br />

Churcher C, Basham D, Chill<strong>in</strong>gworth T, Davies<br />

RM, Feltwell T, Holroyd S et al. The genome sequence<br />

of the food-borne pathogen Campylobacter<br />

jejuni reveals hypervariable sequences. Nature<br />

2000; 403:665-668. PubMed<br />

doi:10.1038/35001088<br />

28. Deng W, Liou SR, Plunkett G, Mayhew GF, Rose<br />

DJ, Burl<strong>and</strong> V, Kodoyianni V, Schwartz DC,<br />

Blattner FR. <strong>Comparative</strong> genomics of Salmonella<br />

enterica serovar Typhi stra<strong>in</strong>s Ty2 <strong>and</strong> CT18. J<br />

Bacteriol 2003; 185:2330-2337. PubMed<br />

doi:10.1128/JB.185.7.2330-2337.2003<br />

29. Brett PJ, DeShazer D, Woods DE. Burkholderia<br />

thail<strong>and</strong>ensis sp. nov., a Burkholderia pseudomallei-like<br />

species. Int J Syst Bacteriol 1998; 48:317-<br />

320. PubMed<br />

30. Smith MD, Angus BJ, Wuthiekanun V, White NJ.<br />

Arab<strong>in</strong>ose assimilation def<strong>in</strong>es a nonvirulent biotype<br />

of Burkholderia pseudomallei. Infect Immun<br />

1997; 65:4319-4321. PubMed<br />

31. Ong C, Ooi CH, Wang D, Chong H, Ng KC, Rodrigues<br />

F, Lee MA, Tan P. Patterns of large-scale<br />

genomic variation <strong>in</strong> virulent <strong>and</strong> avirulent Burkholderia<br />

species. Genome Res 2004; 14:2295-<br />

2307. PubMed doi:10.1101/gr.1608904<br />

32. Hirvonen CA, Ross W, Wozniak CE, Marasco E,<br />

Anthony JR, Aiyar SE, Newburn VH, Gourse RL.<br />

Contributions of UP elements <strong>and</strong> the transcription<br />

factor FIS to expression from the seven rrn P1<br />

promoters <strong>in</strong> Escherichia coli. J Bacteriol 2001;<br />

183:6305-6314. PubMed<br />

doi:10.1128/JB.183.21.6305-6314.2001<br />

33. Ross W, Salomon J, Holmes WM, Gourse RL.<br />

Activation of Escherichia coli leuV transcription<br />

by FIS. J Bacteriol 1999; 181:3864-3868. PubMed<br />

214 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences


34. Wang H, Noordewier M, Benham CJ. Stress<strong>in</strong>duced<br />

DNA duplex destabilization (SIDD) <strong>in</strong><br />

the E. coli genome: SIDD sites are closely associated<br />

with promoters. Genome Res 2004;<br />

14:1575-1584. PubMed doi:10.1101/gr.2080004<br />

35. Bauer BF, Kar EG, Elford RM, Holmes WM. Sequence<br />

determ<strong>in</strong>ants for promoter strength <strong>in</strong> the<br />

leuV operon of Escherichia coli. Gene 1988;<br />

63:123-134. PubMed doi:10.1016/0378-<br />

1119(88)90551-3<br />

36. Carver T, Thomson N, Bleasby A, Berriman M,<br />

Parkhill J. DNAPlotter: circular <strong>and</strong> l<strong>in</strong>ear <strong>in</strong>teractive<br />

genome visualization. Bio<strong>in</strong>formatics 2009;<br />

25:119-120. PubMed<br />

doi:10.1093/bio<strong>in</strong>formatics/btn578<br />

Hall<strong>in</strong>, et al.<br />

37. Arakawa K, Tamaki S, Kono N, Kido N, Ikegami<br />

K, Ogawa R, Tomita M. Genome Projector:<br />

zoomable genome map with multiple views. BMC<br />

Bio<strong>in</strong>formatics 2009; 10:31. PubMed<br />

doi:10.1186/1471-2105-10-31<br />

38. Pritchard L, White JA, Birch PR, Toth IK. GenomeDiagram:<br />

a python package for the visualization<br />

of large-scale genomic data. Bio<strong>in</strong>formatics<br />

2006; 22:616-617. PubMed<br />

doi:10.1093/bio<strong>in</strong>formatics/btk021<br />

39. Krzyw<strong>in</strong>ski M, Sche<strong>in</strong> J, Birol I, Connors J, Gascoyne<br />

R, Horsman D, Jones SJ, Marra MA. Circos:<br />

an <strong>in</strong>formation aesthetic for comparative genomics.<br />

Genome Res 2009; 19:1639-1645. PubMed<br />

doi:10.1101/gr.092759.109<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 215


Paper VII: GeneWiz browser: An Interactive Tool for Visualiz<strong>in</strong>g Sequenced Chromosomes<br />

144


Chapter 4<br />

Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />

Web Services <strong>and</strong> <strong>Interoperability</strong><br />

<strong>in</strong> Genomics<br />

This chapter describes work done connection with the EU project EMBRACE. The deliverables<br />

def<strong>in</strong>ed for <strong>CBS</strong> have had both outreach obligations as well as implementation<br />

tasks of provid<strong>in</strong>g <strong>tools</strong> <strong>and</strong> databases through Web Services. This author’s contributions<br />

reflect this duality; there was a responsibility for develop<strong>in</strong>g the server <strong>in</strong>frastructure for<br />

host<strong>in</strong>g Web Services while also teach<strong>in</strong>g about us<strong>in</strong>g <strong>and</strong> design concepts on several occasions<br />

(see appendix A.1). <strong>CBS</strong> is now us<strong>in</strong>g this work to <strong>in</strong>tegrate all major prediction<br />

servers under the same Web Services umbrella. There are currently 17 services offered<br />

us<strong>in</strong>g this technology 1 . The work on Web Services has made the foundation for creat<strong>in</strong>g<br />

an onl<strong>in</strong>e resource like BLASTatlas (paper I). Further, the RNAmmer tool (VI) is offered<br />

both as a traditional web <strong>in</strong>terface <strong>and</strong> through Web Services <strong>and</strong> these implementations<br />

demonstrate the usefullness of programmtic access to <strong>tools</strong>.<br />

4.1 Introduction<br />

Over the past decade, the <strong>in</strong>ternet has undoubtedly revolutionized the way <strong>in</strong>formation<br />

is exchanged <strong>in</strong> the modern society. From bank transactions, digital road maps <strong>and</strong><br />

satellite images, email<strong>in</strong>g, news articles, <strong>and</strong> social networks, these services are now hard<br />

to imag<strong>in</strong>e, without a digitally connected world. Biological <strong>and</strong> bio<strong>in</strong>formatic <strong>in</strong>formation<br />

is no exception as it relies on the <strong>in</strong>ternet to provide the transport of sequence data,<br />

experimental results, scientific articles etc. Both the number <strong>and</strong> complexity of biological<br />

<strong>in</strong>formation <strong>in</strong>creases day by day. As new experimental techniques become available, new<br />

types of data as well as new ways of comb<strong>in</strong><strong>in</strong>g them, are <strong>in</strong>troduced. For decades, the<br />

exchange of biological <strong>in</strong>formation over the <strong>in</strong>ternet has been <strong>in</strong> the form of human readable<br />

HTML documents (HyperText Markup Language) - or flat files resid<strong>in</strong>g on FTP servers<br />

(File Transfer Protocol). When designed, HTML was <strong>in</strong>tended to host static <strong>in</strong>formation<br />

presented by a server to a human be<strong>in</strong>g us<strong>in</strong>g a browser. Today, computers are required<br />

to digest the huge amounts of <strong>in</strong>formation with less <strong>in</strong>volvement of humans, <strong>and</strong> more<br />

advanced technologies are now required. To successfully <strong>in</strong>tegrate the vast amounts of<br />

data provided by the life science community, <strong>in</strong>teroperability rema<strong>in</strong>s a key issue. It<br />

may seem unrealistic to reach a po<strong>in</strong>t where every biologist <strong>and</strong> bio<strong>in</strong>formatician has<br />

the world’s biological databases <strong>and</strong> <strong>tools</strong> accessible through programmatic access, from<br />

their favorite programm<strong>in</strong>g language. However, with the current technologies <strong>in</strong> Web<br />

1 BLASTatlas, EasyGene, EPipe, GeneWiz, GenomeAtlas, hERG, MaxAlign, NetChop, NetCTL, Net-<br />

Glycate, NetNGlyc, NetOGlyc, NetPhos, RNAmmer, SIDDbase, SignalP, <strong>and</strong> TMHMM<br />

145


<strong>Interoperability</strong><br />

Figure 4.1: Screen shot of NCBI Entrez Genome projects web page<br />

Services, an <strong>in</strong>teroperble life science community may not be far away. When connected,<br />

the communities will be able to exchange not only data but many services such as <strong>tools</strong><br />

for predict<strong>in</strong>g prote<strong>in</strong> function, perform<strong>in</strong>g sequence alignments, or gene f<strong>in</strong>d<strong>in</strong>g.<br />

4.2 <strong>Interoperability</strong><br />

”The term ’<strong>in</strong>teroperability’ is def<strong>in</strong>ed as the ability ... <strong>in</strong>formation, by IEEE (http...)”.<br />

The term ’<strong>in</strong>teroperability’ is def<strong>in</strong>ed as the ability of two or more systems to exchange<br />

<strong>and</strong> make use of <strong>in</strong>formation (IEEE, http://www.ieee.org). Whether systems can be<br />

said to be ’<strong>in</strong>teroperable’ depends on how one <strong>in</strong>terprets ’make use of’. Consider the list of<br />

full prokaryotic genome sequences, ma<strong>in</strong>ta<strong>in</strong>ed by NCBI at http://www.ncbi.nlm.nih.<br />

gov/genomes/lproks.cgi, as shown <strong>in</strong> figure figure 4.1.<br />

To automatically retrieve this list, one may write a parser to transform the HTML<br />

<strong>in</strong>to a computer-readable text. Apart from be<strong>in</strong>g overly sensitive to changes <strong>in</strong> the HTML<br />

document, such a parser will lack the knowledge beh<strong>in</strong>d the data s<strong>in</strong>ce the format is not<br />

typed nor structured. It is only when <strong>in</strong>terpreted by an <strong>in</strong>ternet browser <strong>and</strong> presented<br />

graphically to a human, that this <strong>in</strong>formation makes any sense. Both recipient <strong>and</strong> receiver<br />

must <strong>in</strong> other words have knowledge about the <strong>in</strong>formation that is exchanged, before these<br />

can be said to be <strong>in</strong>teroperable. The are two aspects of <strong>in</strong>teroperability: First, there must<br />

exist agreement on the format by which data is exchanged. Whether this is structured<br />

XML or any arbitrary format, the server must return the format expected by the client<br />

upon a request. Second, the description <strong>and</strong> underst<strong>and</strong><strong>in</strong>g of the content of the data be<strong>in</strong>g<br />

exchanged is a requirement when build<strong>in</strong>g client-side code <strong>and</strong> objects <strong>in</strong> Web Services.<br />

Without the knowledge of exact data types, the programm<strong>in</strong>g environment (e.g. C, Java,<br />

Perl) fails to declare the objects with proper variable types.<br />

146


Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />

List<strong>in</strong>g 4.1: Abbreviated <strong>in</strong>put to the queryGenomes operations of the Genome Atlas Database<br />

3.0 web service<br />

1 <br />

4 <br />

5 <br />

6 <br />

7 <br />

8 AL111168<br />

9 yes<br />

10 <br />

11 <br />

12 <br />

13 <br />

4.2.1 SOAP based Web Services<br />

The SOAP st<strong>and</strong>ard (Simple Object Access Protocol, prior to version 1.2) is to a large<br />

extent an agreed-upon technology describ<strong>in</strong>g a protocol to exchange <strong>in</strong>formation <strong>in</strong> structured<br />

XML messages (eXtensible Markup Language). The protocol was recommended by<br />

W3C (World Wide Web Consortium) <strong>in</strong> 2003, <strong>and</strong> describes the messag<strong>in</strong>g format between<br />

a client <strong>and</strong> a server which <strong>in</strong> most cases are transported over HTTP. In list<strong>in</strong>gs 4.1<br />

<strong>and</strong> 4.2 an example request <strong>and</strong> response from the <strong>CBS</strong> Genome Atlas Database 3.0 Web<br />

Service is provided, us<strong>in</strong>g operation queryGenomes to query the database for a genbank<br />

accession number.<br />

The SOAP messages are XML structures consist<strong>in</strong>g of a SOAP envelope, which then<br />

consist of a header (not <strong>in</strong>cluded here) <strong>and</strong> a body. A special envelope style called<br />

’wrapped’ is used for the <strong>CBS</strong> services, mean<strong>in</strong>g that the content of both response <strong>and</strong> request<br />

is wrapped by an element named accord<strong>in</strong>g to the operation issued (here queryGenomes).<br />

This enables the server to easily dispatch the message to the proper <strong>in</strong>ternal code. The<br />

SOAP protocol forms the basic language for exchang<strong>in</strong>g messages over HTTP but does not<br />

describe the structure of the messages exchanged by a given resource nor does it expla<strong>in</strong><br />

its functionality. The WSDL (Web Services Description Language) file closes this gap by<br />

def<strong>in</strong><strong>in</strong>g <strong>in</strong>formation which enables a user or computer to communicate with the resource.<br />

The WSDL declares all the operations supported by a resource <strong>and</strong> the composition of the<br />

XML structures allowed by the operations. F<strong>in</strong>ally, the WSDL def<strong>in</strong>es the endpo<strong>in</strong>t URL<br />

to which the request SOAP message is submitted. The essential data of the WSDL are<br />

the descriptions of the XML structure, formulated <strong>in</strong> the XSD language (XML Schema<br />

Def<strong>in</strong>ition). The schema for the request of the queryGenomes operations can be seen from<br />

list<strong>in</strong>g 4.3. Figure 4.2 shows a schematic draw<strong>in</strong>g of a SOAP resource.<br />

4.3 EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability<br />

EMBRACE Network of Excellence is a project funded by the European Commission under<br />

the sixth framework programme (FP6). The <strong>in</strong>tention of the EMBRACE projects was<br />

partly to <strong>in</strong>tegrate the major <strong>tools</strong> <strong>and</strong> databases with<strong>in</strong> the life science communities. A<br />

technology recommendation workgroup with<strong>in</strong> EMBRACE has <strong>in</strong>vestigated which current<br />

technologies could form the basis of the <strong>in</strong>tegration <strong>and</strong> it has recommended SOAP based<br />

147


EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability<br />

List<strong>in</strong>g 4.2: Abbreviated output from the queryGenomes operations of the Genome Atlas Database<br />

3.0 web service<br />

1 <br />

2 <br />

3 <br />

5 <br />

6 <br />

7 <br />

8 <br />

9 <br />

10 <br />

11 B a c t e r i a<br />

12 E p s i l o n p r o t e o b a c t e r i a<br />

13 8<br />

14 Campylobacter j e j u n i subsp . j e j u n i NCTC 11168<br />

15 AL111168<br />

16 NC 002163<br />

17 Chromosome<br />

18 <br />

19 <br />

20 <br />

21 <br />

22 <br />

23 <br />

24 <br />

25 <br />

26 <br />

148


Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />

List<strong>in</strong>g 4.3: XSD entry of the queryGenomes request message<br />

1 <br />

2 <br />

3 <br />

4 <br />

5 <br />

6 <br />

8 <br />

10 <br />

12 <br />

14 <br />

16 <br />

18 <br />

20 <br />

21 <br />

22 <br />

23 <br />

24 <br />

25 <br />

26 <br />

27 <br />

28 <br />

29 <br />

30 <br />

31 <br />

32 <br />

SOAP request<br />

<strong>and</strong> response<br />

SOAP client<br />

Client user / computer<br />

endpo<strong>in</strong>t WSDL Schemas<br />

HTTP server<br />

WSDL <strong>and</strong> schema files<br />

downloaded by client <strong>in</strong><br />

XML<br />

Figure 4.2: Schematic layout of a simple SOAP resource, where WSDL <strong>and</strong> schemas reside on the<br />

same server. WSDL <strong>and</strong> schemas are read <strong>and</strong> <strong>in</strong>tepreted by the SOAP client <strong>in</strong> order compose<br />

the outgo<strong>in</strong>g request <strong>and</strong> parse the <strong>in</strong>com<strong>in</strong>g server response.<br />

149


EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability<br />

Web Services described by WSDL files where data structures are typed us<strong>in</strong>g the XSD<br />

format.<br />

4.3.1 Quasi - a light-weight SOAP server<br />

One of the ma<strong>in</strong> obstacles for many SOAP servers <strong>and</strong> clients is the computational overhead<br />

<strong>and</strong> memory consumption <strong>in</strong>volved <strong>in</strong> pars<strong>in</strong>g large <strong>and</strong> complex XML structures.<br />

For the BLASTatlas service, this was a limitt<strong>in</strong>g factor. Try<strong>in</strong>g a conventional server package<br />

called SOAP::Lite, rendered the submit process to require more memory than what is<br />

<strong>in</strong> a modern desktop computer while tak<strong>in</strong>g around 20 m<strong>in</strong>utes just to prepare the message<br />

before submit. Once submitted, the server required the same overhead to parse the <strong>in</strong>com<strong>in</strong>g<br />

XML. The XML::Compile package for Perl prooved superior as a client framework.<br />

However, for the server side, there was a dem<strong>and</strong> for speed, flexibility <strong>and</strong> custom adjustment<br />

which led to the development of a light-wight SOAP server called ’quasi’ (’QUite<br />

A Soap Implementaion’ or ’QUAsi Soap Implementation’). Apart from the speed it has<br />

further advantages:<br />

• The server can be launched both remotely <strong>and</strong> locally. The later allows quick <strong>and</strong><br />

easy test<strong>in</strong>g of services by read<strong>in</strong>g SOAP message from STDIN<br />

• XML pars<strong>in</strong>g method (e.g. XML::Simple or XML::Twig) may be chosen <strong>in</strong>dependently<br />

for each operations <strong>and</strong> even postponed until after the job is placed <strong>in</strong> the<br />

queue <strong>and</strong> the job id is returned. This is an advantage for very big messages<br />

• Control over the code stack enable implementation of custom functionality much<br />

faster.<br />

4.3.2 quasi mktemp - From template to Web Service<br />

To take the ease-of-implementation to a new step, a template creator was written which<br />

reads from a st<strong>and</strong>ard <strong>CBS</strong> template an example Web Service. The user provides the<br />

name <strong>and</strong> version of the service <strong>and</strong> the tool prepares an entire <strong>in</strong>stallation of the service<br />

on the servers. The template created gives the follow<strong>in</strong>g :<br />

• Creates automatically WSDL <strong>and</strong> XSD files for the name <strong>and</strong> version of the service,<br />

placed <strong>in</strong> the proper location of the file system<br />

• Example directory with a work<strong>in</strong>g Perl example us<strong>in</strong>g the service<br />

• Has built-<strong>in</strong> templates for both syncrhonous <strong>and</strong> asynchronous access<br />

• Creates the proper entry <strong>in</strong> the central services database table<br />

• When the template creator has run a web page will be available describ<strong>in</strong>g the<br />

service <strong>and</strong> provid<strong>in</strong>g l<strong>in</strong>ks to WSDL <strong>and</strong> XSD files as well as WSDL-embedded<br />

documentation<br />

When design<strong>in</strong>g Web Services, it is not a trivial task to keep track of namespaces,<br />

declerations of <strong>in</strong>put/output objects, operation names etc. The feedback received so far<br />

for this tool <strong>in</strong>dicates that function<strong>in</strong>g examples clearly reduces chances for mistakes. The<br />

manual for the software is found <strong>in</strong> appendix D.6.<br />

150


Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />

4.4 ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services<br />

ENCODE (the Encyclopedia Of DNA Elements) was launched <strong>in</strong> September 2003 by<br />

the National Human Genome Research Institute. The goal was to identify all functional<br />

elements <strong>in</strong> the human genome sequence. In the pilot phase 1 percent (30 Mb) from<br />

44 selected regions of the human genome has been analysed by ENCODE consortium<br />

researchers (Birney et al., 2007).<br />

GENCODE is a sub-project of ENCODE, which seeks to identify all prote<strong>in</strong>-cod<strong>in</strong>g<br />

genes <strong>in</strong> the ENCODE selected regions. For each prote<strong>in</strong> cod<strong>in</strong>g gene this means the<br />

del<strong>in</strong>eation of a complete mRNA sequence for at least one splice isoform, <strong>and</strong> often for<br />

a number of additional alternative splice forms. The contributions from the BioSapiens<br />

partners are focused on <strong>in</strong>formation from a prote<strong>in</strong> annotation perspective. Special attention<br />

is given to the potential aspect of alternative splic<strong>in</strong>g <strong>and</strong> the putative effect it has<br />

on functional diversification of genes.<br />

In the pilot phase of the Biosapiens project the properties of the cod<strong>in</strong>g sequences<br />

for the 44 regions have been analyzed by the Biosapiens partners separately. The results<br />

from s<strong>in</strong>gle groups were collected <strong>and</strong> the ma<strong>in</strong> f<strong>in</strong>d<strong>in</strong>gs were published (Tress et al., 2007).<br />

Furthermore the entire collection of annotations created by all partners was made available<br />

as supplementary material for the publication.<br />

In the current phase of the BioSapiens project the goal is establish a scale-up of the<br />

annotation approach applied to the pilot ENCODE sequences to cover the 100% of the human<br />

genome, <strong>in</strong>clud<strong>in</strong>g all the isoforms. For the scale-up, the ENCODE Pipel<strong>in</strong>e (EPipe)<br />

was constructed (this Biosapiens deliverable), which is a WWW service that allows researchers<br />

to compare functional annotations for all splice variants of a given gene <strong>in</strong> an<br />

automatic way, or alternatively use it for analysis of mutated sequence variants conta<strong>in</strong><strong>in</strong>g<br />

SNPs. The author of this thesis. This author has been responsible for the development<br />

of the ma<strong>in</strong> parts of the EPipe software as well as for implement<strong>in</strong>g a large part of the<br />

modules (feature predictors). The EPipe projects is an ongo<strong>in</strong>g effort which has <strong>in</strong>volved<br />

a number of people dur<strong>in</strong>g its development.<br />

4.4.1 Collect<strong>in</strong>g Web Services clients <strong>in</strong> EPipe<br />

EPipe uses a number of local <strong>and</strong> remote resources for prote<strong>in</strong> feature prediction. The<br />

ability of EPipe to connect to remote resources via Web Services is <strong>in</strong>corporated with<strong>in</strong><br />

the <strong>in</strong>dividual modules. This put a great deal of flexibility as to which resourses to support<br />

(e.g. BioMoby, SOAP etc). The pipel<strong>in</strong>e is shown <strong>in</strong> figure 4.3.<br />

EPipe itself is offered both as a SOAP web service (http://www.cbs.dtu.dk/ws/<br />

EPipe <strong>and</strong> a traditional web <strong>in</strong>terfece (http://www.cbs.dtu.dk/services/EPipe). A<br />

schematic overview of the workflow <strong>in</strong> EPipe is shown <strong>in</strong> figure 4.4.<br />

4.4.2 Mapp<strong>in</strong>g Pfam annotations to prote<strong>in</strong> structure: mecA<br />

In Staphylococcus aureus the mecA gene encodes a penicill<strong>in</strong>-b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> (PBP2a),<br />

result<strong>in</strong>g <strong>in</strong> Methicill<strong>in</strong> resistance (Ender et al., 2009). The EPipe software can be used to<br />

map a range of different relevant features onto the prote<strong>in</strong> structure, <strong>in</strong> order to visualize<br />

differences between homologs of this prote<strong>in</strong>. In this example however, a s<strong>in</strong>gle MecR1<br />

prote<strong>in</strong> from Staphylococcus aureus stra<strong>in</strong> A5937, GenBank accession no. EEV85461, is<br />

processed. Figure 4.5 shows the structure browser of EPipe which allows the user to<br />

browse the different features that are predicted, by show<strong>in</strong>g the mapp<strong>in</strong>g onto the prote<strong>in</strong><br />

structure. Here, the three Pfam doma<strong>in</strong>s Transpeptidase, MecA N, <strong>and</strong> PBP dimer appear<br />

as significant hits.<br />

151


ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services<br />

Input sequences<br />

Cache filter<br />

BLAST aga<strong>in</strong>st<br />

PDB <strong>in</strong>dividually<br />

Cache filter<br />

Cache filter<br />

Cache filter<br />

Cache filter<br />

module IV<br />

alignment module I module II module III<br />

Positional<br />

features<br />

Non-positional<br />

features<br />

Alignment<br />

dependent<br />

module X<br />

Map feature<br />

coord<strong>in</strong>ates to<br />

alignment<br />

Map features onto<br />

best structure<br />

XML of all results<br />

Cache filter<br />

Render images <strong>in</strong><br />

parallel <strong>and</strong> present<br />

to output pages<br />

Table of<br />

nonpositional<br />

features<br />

Conclusion<br />

table<br />

Plot alignment <strong>and</strong><br />

positions hav<strong>in</strong>g<br />

different feature<br />

configuration<br />

Plot alignment<br />

<strong>and</strong> features<br />

with remapped<br />

coord<strong>in</strong>ates<br />

Similarity <strong>in</strong><br />

feature space<br />

Figure 4.3: Schematic layout of the ENCODE pipel<strong>in</strong>e, EPipe. The ma<strong>in</strong> program ensures that<br />

as much as possible is dispatched <strong>in</strong> parrallel. Modules may either be alignment dependent or not.<br />

If the alignment is required to predict the prote<strong>in</strong> features, the module is not launched until the<br />

alignment algorithm has f<strong>in</strong>ished. Modules may either return global features of the entire prote<strong>in</strong><br />

(e.g. cellular localization), or return positional features (e.g. phosphorylation sites).<br />

152


Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />

Figure 4.4: The <strong>in</strong>put web page of EPipe: Upper part def<strong>in</strong>es sequence upload <strong>and</strong> alignment<br />

method, <strong>and</strong> lower part selects which modules / methods to run. When applicable, gene ontologies<br />

have been added to each feature <strong>and</strong> feature values (light green boxes).<br />

153


ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services<br />

Figure 4.5: The mecA encoded prote<strong>in</strong> (EEV85461) shows homology to PDB entry 1VQQ (Lim<br />

& Strynadka, 2002). Top panel shows the EPipe structure browser which allows for any 90 degrees<br />

rotat<strong>in</strong>g. Lower panel shows a post-process<strong>in</strong>g of the PyMol script, generated by EPipe.<br />

154


Chapter 5<br />

Conclusion <strong>and</strong> perspectives<br />

Conclusion <strong>and</strong> perspectives<br />

This thesis has presented a number comparative genomics <strong>tools</strong> that have been used<br />

throughout different research projects <strong>and</strong> peer review publications. The aim has been to<br />

provide methods that enable the scientist to keep up with the <strong>in</strong>creas<strong>in</strong>g speed by which<br />

genome sequences are published. Visualization plays a key role <strong>and</strong> f<strong>in</strong>d<strong>in</strong>g better ways<br />

to present sequence <strong>in</strong>formation <strong>in</strong> a condensed <strong>and</strong> <strong>in</strong>tuitive way is essential for deriv<strong>in</strong>g<br />

knowledge from the large number of bacterial stra<strong>in</strong>s be<strong>in</strong>g sequenced.<br />

Information content has previously been used to quantify conservation of DNA motifs,<br />

<strong>and</strong> a recent extension of this <strong>in</strong>formation framework has allowed to model complete<br />

promotors such as the P1/P2 system described <strong>in</strong> this work. The models shown here<br />

are to a large extent specific towards E. coli P1/P2 sites. However, the design of the<br />

matrix <strong>and</strong> spac<strong>in</strong>g configuration format of the iscan tool enables for a much broader<br />

application. The tool may be used to test different hypothesis of promotor configurations<br />

across a broader range of organisms by estimat<strong>in</strong>g the promotor conservation a s<strong>in</strong>gle<br />

comparable measure. There is still efforts to be made to implement benchmark<strong>in</strong>g <strong>and</strong> to<br />

exam<strong>in</strong>e other promotor systems.<br />

S<strong>in</strong>ce the start of the human genome project (HGP) <strong>in</strong> 1990 there has been large<br />

<strong>in</strong>vestments to develop <strong>and</strong> improve sequenc<strong>in</strong>g technology. The present stage, where a<br />

bacterial genome can be sequenced for a few thous<strong>and</strong> dollars with<strong>in</strong> few hours, is a result<br />

of years of competition <strong>and</strong> <strong>in</strong>vestments <strong>in</strong> genome projects. There are no signs that new<br />

achievements <strong>in</strong> sequenc<strong>in</strong>g technology stops here. The concept of sequenc<strong>in</strong>g s<strong>in</strong>gle DNA<br />

molecules real time has long been an ultimate goal with<strong>in</strong> genomics <strong>and</strong> DNA sequenc<strong>in</strong>g.<br />

It has been demonstrated how a DNA synthesis reaction can be monitored real-time, by<br />

immobiliz<strong>in</strong>g a DNA polymerase with<strong>in</strong> a small (20 zeptoliter) well (Eid et al., 2009). If the<br />

technology reaches a f<strong>in</strong>al product, it may well start a new era <strong>in</strong> comparative genomics.<br />

Once it is possible to obta<strong>in</strong> a genome sequence at the same rate as the DNA replication<br />

itself, <strong>and</strong> at superior read lengths, sophisticated software must be implemented for the<br />

downstream process<strong>in</strong>g. The technology can give a boost to the quality of metagenomic<br />

sequenc<strong>in</strong>g, <strong>and</strong> solve the current issues of proper assembly of these data sets.<br />

The BLASTatlas tool presented <strong>in</strong> this thesis <strong>in</strong>corporates a number of software to<br />

calculate different DNA properties as well as scripts for mapp<strong>in</strong>g sequence alignments to a<br />

reference genome. The number of dependencies makes it difficult to package the software<br />

<strong>and</strong> make <strong>in</strong>stallation on other computer systems. To share these more complex <strong>tools</strong><br />

among scientists Web Services plays an important role <strong>and</strong> it has been demonstrated how<br />

analysis <strong>and</strong> visualization methods can be offered us<strong>in</strong>g this technology. At first glance the<br />

traditional web <strong>in</strong>terfaces seems more user-friendly. However, implement<strong>in</strong>g <strong>in</strong>teroperable<br />

methods like that of the BLASTatlas method, forces a process <strong>in</strong> which the communication<br />

is formalized <strong>and</strong> def<strong>in</strong>ed <strong>in</strong> every detail. This allows direct <strong>in</strong>tegration <strong>in</strong>to the user’s pro-<br />

155


gramm<strong>in</strong>g environment which scales significantly better. Mak<strong>in</strong>g one or two comparisons<br />

us<strong>in</strong>g a web <strong>in</strong>terface will <strong>in</strong> most cases be faster than us<strong>in</strong>g the Web Services counterpart.<br />

The true advantages are achieved when analysis are repeated possibly hundreds of<br />

times <strong>and</strong> when l<strong>in</strong>k<strong>in</strong>g <strong>in</strong>put/output between different remote resources. Integration of<br />

biological data us<strong>in</strong>g SOAP based Web Services is ga<strong>in</strong><strong>in</strong>g acceptance. When the technology<br />

has matured it will undoubtedly enhance the way biological <strong>in</strong>formation is exploited<br />

by allow<strong>in</strong>g seamless flow between for example public sequence databases, repositories of<br />

experimental data <strong>and</strong> bio<strong>in</strong>formtic prediction servers.<br />

156


Appendix A<br />

Appendix: Workshops, teach<strong>in</strong>g, <strong>and</strong> conferences<br />

Appendix: Workshops, teach<strong>in</strong>g,<br />

<strong>and</strong> conferences<br />

A.1 Lectures <strong>and</strong> Presentations<br />

A.1.1 DTU Course 27101: Framework Course <strong>in</strong> Biotechnology <strong>and</strong><br />

Food Sciences<br />

Taught autumn 2008 by Prof. David Ussery, this cause featured weekly computer exercises<br />

throughout the semester <strong>and</strong> projects requir<strong>in</strong>g computer work. I planned <strong>and</strong> supervised<br />

the exercises as well as assisted the students do<strong>in</strong>g project work. See also: http://www.<br />

cbs.dtu.dk/dtucourse/genomics27101.php<br />

A.1.2 <strong>Comparative</strong> Microbial Genomics Workshop<br />

Held June 2 nd - 6 st 2008, Bangkok, Thail<strong>and</strong>. I assisted the plann<strong>in</strong>g of the workshop,<br />

lectured on rRNA operon structure, web services, <strong>and</strong> genome visualization methods <strong>and</strong><br />

was responsible for computer exercises. Web page: http://www.cbs.dtu.dk/courses/<br />

thaiworkshop08/programme.php<br />

A.1.3 <strong>Comparative</strong> Microbial Genomics <strong>and</strong> Taxonomy<br />

Held August 14 st - 18 st 2006, Petropolis, Brazil. I assisted the plann<strong>in</strong>g of the workshop<br />

<strong>and</strong> was responsible for computer exercises. See also: http://www.cbs.dtu.dk/courses/<br />

brazilworkshop/programme.php<br />

A.1.4 EMBRACE Workshop on Client Side Script<strong>in</strong>g for Web Services<br />

Work package D5.2.X2. Held February 6 st - 8 st 2008, <strong>CBS</strong>. Responsible for computer exercises<br />

<strong>and</strong> lectures. See also: http://www.cbs.dtu.dk/courses/embrace/2008-02-06/<br />

A.1.5 EMBRACE Workshop on Bio<strong>in</strong>formatics of Immunology<br />

Work package D5.2.6. Held January 24 st - 26 st 2007, <strong>CBS</strong>. Responsible for computer exercises<br />

<strong>and</strong> lectures. See also: http://www.cbs.dtu.dk/courses/embrace/2007-01-24/<br />

A.1.6 EMBRACE 3 rd AGM: Implementation of web services<br />

Presentation held April 23 rd 2007 at CNRS Institute of Biology <strong>and</strong> Chemistry of Prote<strong>in</strong>s<br />

<strong>in</strong> Lyon, France.<br />

157


Workshops <strong>and</strong> meet<strong>in</strong>gs<br />

A.1.7 EMBRACE Workshop on Perl, SQL <strong>and</strong> Web Services<br />

Scheduled for November 16 th - 20 th 2009. See also: http://www.cbs.dtu.dk/courses/<br />

embrace/2009-11-16/<br />

A.2 Workshops <strong>and</strong> meet<strong>in</strong>gs<br />

A.2.1 EMBRACE Workshop: SOAP web services<br />

April 2006, Bergen, Norway.<br />

A.2.2 EUCOMM Bio<strong>in</strong>formatics Tra<strong>in</strong><strong>in</strong>g Course<br />

February 2007, H<strong>in</strong>xton, United K<strong>in</strong>gdom<br />

A.2.3 EMBRACE Workshop: Modern computer <strong>tools</strong> for the biosciences<br />

March 2007, Uppsala, Sweden<br />

A.2.4 EMBRACE 3rd Annual General Meet<strong>in</strong>g<br />

April 2007, Lyon, France<br />

A.2.5 EMBRACE Workshop: Deploy<strong>in</strong>g Web Services for Biological<br />

Sequence Annotation<br />

May 2007, Geneva, Switzerl<strong>and</strong><br />

A.2.6 EMBRACE 4th Annual General Meet<strong>in</strong>g<br />

April 2008, Heidelberg, Germany<br />

A.2.7 Technical discussion of EMBRACE registry<br />

June 2008, Amsterdam, Holl<strong>and</strong><br />

A.2.8 EMBRACE meet<strong>in</strong>g: Discussion of st<strong>and</strong>ard data types<br />

Januar 2009, Bergen, Norway<br />

A.3 Conferences<br />

A.3.1 Conference: Metagenomics, July 2007, San Diego U.S.A.<br />

B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Sellami N, Ussery DW Prediction of Pathogenicity Networks <strong>in</strong><br />

Bacterial Genomes<br />

A.3.2 Conference: ASM Biodefense 2007, February 2007, Wash<strong>in</strong>gton<br />

U.S.A.<br />

Poster: Hall<strong>in</strong> PF <strong>and</strong> B<strong>in</strong>newies TT. Gene organization of RNA genes <strong>and</strong> secretion<br />

system components of the Sargasso Sea environmental samples<br />

158


Appendix B<br />

Appendix: Ph.D. study plan<br />

Appendix: Ph.D. study plan<br />

159


Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />

September 2005<br />

Nedenstående studieplan er accepteret af studerende og vejleder<br />

Hovedvejleders underskrift lokal nr. Studerendes underskrift<br />

Ph.d.-studieplan<br />

Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />

Cpr.-nr.: 160877 2053<br />

Ph.d.-program: Bio<strong>in</strong>formatics<br />

Institut: BioCentrum<br />

Startdato: March 1 2006<br />

Slutdato: February 2009<br />

Hovedvejleder: Associate professor David W. Ussery<br />

(Titel, navn, <strong>in</strong>stitut, tlf.)<br />

BioCentrum-DTU, Technical University of Denmark,<br />

Build<strong>in</strong>g 301, DK-2800 Lyngby, Denmark<br />

E-mail address: dave@cbs.dtu.dk<br />

Phone (direct): (+45) 45 25 24 88<br />

Medvejleder: Guest Researcher Gertrude Maria Wassenaar<br />

(Titel, navn,<br />

<strong>in</strong>stitution/virksomhed)<br />

BioCentrum-DTU, Technical University of Denmark,<br />

Build<strong>in</strong>g 301, DK-2800 Lyngby, Denmark<br />

E-mail address: trudy@cbs.dtu.dk<br />

Phone (direct): (+45) 45 25 24 77<br />

Dato: 18-11-2007<br />

Studiets titel: DNA Structural Analysis <strong>and</strong> Transcript Prediction <strong>in</strong> Prokaryotic<br />

genomes<br />

1


Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />

September 2005<br />

Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />

Cpr.-nr.: 160877 2053<br />

Studiets hovedemne:<br />

The goal of this project is to obta<strong>in</strong> better underst<strong>and</strong><strong>in</strong>g about the structural<br />

mechanisms that are <strong>in</strong>volved <strong>in</strong> the <strong>in</strong>itiation of transcription of DNA <strong>in</strong><br />

Prokaryotic genomes <strong>and</strong> to use this <strong>in</strong>formation to make better <strong>and</strong> consistent<br />

transcript predictions. We have presented a database (Hall<strong>in</strong> <strong>and</strong> Ussery 2004)<br />

which holds several k<strong>in</strong>ds of <strong>in</strong>formation for each of the over 300 fully<br />

sequenced Prokaryotic genomes that are currently available. Different research<br />

groups have made efforts to gather sequence data <strong>and</strong> analysis of the fully<br />

sequenced microbial genomes that are be<strong>in</strong>g published.<br />

Currently we rely on the authors' annotation of genome sequences when<br />

comparative genomics are applied to our data sets. However, different authors<br />

use different <strong>tools</strong>, approaches <strong>and</strong> criteria dur<strong>in</strong>g the annotation process. There<br />

are examples of genomes that are predicted to be 50-100% over annotated<br />

(Skovgaard et al. 2001). Once reliable <strong>and</strong> automated processes for predict<strong>in</strong>g<br />

transcriptomes are established, comparative analysis can be applied on the entire<br />

collection of organisms. It is envisioned that the users of our website can<br />

<strong>in</strong>teractively be able to browse any piece of DNA to look for structural properties<br />

<strong>and</strong> repeats.<br />

_________________<br />

Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A On the<br />

total number of genes <strong>and</strong> their length distribution <strong>in</strong> complete<br />

microbial genomes (2001) Trends Genet.17:425-8.<br />

Peter F. Hall<strong>in</strong> <strong>and</strong> David W. Ussery <strong>CBS</strong> Genome Atlas<br />

Database: A dynamic storage for bio<strong>in</strong>formatic results <strong>and</strong><br />

sequence data (2004). Bio<strong>in</strong>formatics 20:3682-3686.<br />

(Her beskrives den videnskabelige projektdels <strong>in</strong>dhold samt mål og midler. Hvis beskrivelsen er på mere end 1 A4side<br />

gives en kort oversigt her med henvisn<strong>in</strong>g til selve beskrivelsen, der vedlægges som bilag).<br />

Det eksterne<br />

forskn<strong>in</strong>gsophold<br />

Professor Craig John Benham, University of California, Davis.<br />

Benhams research focuses on mathematical modell<strong>in</strong>g of DNA<br />

destabilization <strong>and</strong> prediction of open<strong>in</strong>g of the DNA molecule<br />

dur<strong>in</strong>g a transcription event. His strong mathematical approach is<br />

novel <strong>and</strong> would contribute significantly to our prediction methods<br />

<strong>and</strong> could possibly help expla<strong>in</strong><strong>in</strong>g biological / experimental<br />

results. It is the idea that Craig Benhams calculations will be<br />

<strong>in</strong>tegrated <strong>in</strong>to the prediction algorithms that is a major topic of my<br />

project.<br />

A 12 weeks <strong>in</strong>ternship is scheduled for October-December to Craig<br />

Benhams lab to <strong>in</strong>tegrate SIDD predictions (Stress Induced DNA<br />

Duplex Destabilization) with <strong>CBS</strong> databases <strong>and</strong> to prepare 1-2<br />

manuscripts on SIDD measures on a global prokaryotic scale.<br />

2


Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />

September 2005<br />

Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />

Cpr.-nr.: 160877 2053<br />

(Her anføres de forskn<strong>in</strong>gsmiljøer uden for DTU, hvor den ph.d.-studerende planlægges at opholde sig. Er der<br />

<strong>in</strong>dgået konkrete aftaler, anføres dette. For hvert ophold angives det skønnede tidsforbrug (f.eks. i uger), og det<br />

samlede tidsforbrug til eksterne ophold anføres).<br />

Kursusdelen:<br />

Kurser på DTU<br />

Eksterne kurser<br />

Kurser meritoverført i<br />

forb<strong>in</strong>delse med<br />

<strong>in</strong>dskrivn<strong>in</strong>g:<br />

Biological Sequence Analysis PhD 12 ECTS [OK]<br />

27802 Metabolic Eng<strong>in</strong>eer<strong>in</strong>g <strong>and</strong><br />

Systems biology<br />

PhD 5 ECTS F1A<br />

27725 Globale regulatoriske netværk i<br />

mikroorganismer<br />

MSc 5 ECTS F2B<br />

27617 Prote<strong>in</strong> structure <strong>and</strong><br />

computational biology<br />

Msc 5 ECTS F5A<br />

27041 Introduction to Systems Biology Msc 5 ECTS E3A<br />

For kurser, som ikke f<strong>in</strong>des i studiehåndbogen, skal der vedlægges en beskrivelse af det faglige <strong>in</strong>dhold. Her<br />

anføres studiets forventede kursus/uddannelsesaktivteter. For hver del angives det skønnede antal ECTS-po<strong>in</strong>t, der<br />

sammenlagt skal svare til ca. 30 ECTS-po<strong>in</strong>t. 30 ECTS-po<strong>in</strong>t svarer til ca. 840 timers arbejde).<br />

Formidl<strong>in</strong>gsdelen ( <strong>in</strong>kl.<br />

pligtarbejde):<br />

I have spent a total of about a month's time prepar<strong>in</strong>g <strong>and</strong> assist<strong>in</strong>g<br />

<strong>in</strong> computer exercises for the <strong>CBS</strong> course <strong>Comparative</strong> Microbial<br />

Genomics <strong>and</strong> Taxonomy (Petropolis, Brazil, Aug. 2006,<br />

http://www.cbs.dtu.dk/courses/brazilworkshop) <strong>and</strong> <strong>in</strong> prepar<strong>in</strong>g<br />

<strong>and</strong> giv<strong>in</strong>g talks at several meet<strong>in</strong>gs.<br />

Exercises <strong>in</strong> course ”Biological Sequence Analysis” (<strong>CBS</strong> –DTU)<br />

1 hrs. Presentation, Modern computer <strong>tools</strong> for the biosciences<br />

(Uppsala, Sweden) Presentation: Embrace workshop on<br />

bio<strong>in</strong>formtics of Immunology (<strong>CBS</strong> – DTU) Presentation: Web<br />

Services implementation on <strong>CBS</strong>: Third Anual General Meet<strong>in</strong>g of<br />

EMBRACE, (Lyon France).<br />

I plan to put <strong>in</strong> an additional month of work for giv<strong>in</strong>g <strong>and</strong><br />

prepar<strong>in</strong>g presentations <strong>and</strong> lectures for a one week workshop to be<br />

3


Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />

September 2005<br />

Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />

Cpr.-nr.: 160877 2053<br />

held at <strong>CBS</strong> <strong>in</strong> February 2008:<br />

http://www.cbs.dtu.dk/courses/embrace/2008-02-<br />

06/programme.php. Lectures <strong>and</strong> exercies will be adjusted to cover<br />

promoter analysis us<strong>in</strong>g the EMBRACE technology. We <strong>in</strong>tend to<br />

use graphical as well as statistical approaches to characterize<br />

promoter signatures of prokaryotic genomes. These are core topics<br />

of the thesis.<br />

Poster presentation at Metagenomics 2007, San Diego: “Gene<br />

organization of RNA genes <strong>and</strong> secretion system components of<br />

the Sargasso Sea environmental samples”<br />

(Her anføres studiets forventede dels formidl<strong>in</strong>gs-aktivteter og dels det pålagte pligtarbejde. For hver del angives<br />

det skønnede tidsforbrug (f.eks. i uger), der sammenlagt skal svare til 3 måneder).<br />

Tidsplan:<br />

1st half year (March 06 –August 06)<br />

Publication on rRNA gene predictor (RNAmmer). <strong>Comparative</strong> Microbial Genomics worksshop <strong>in</strong><br />

Brasil. Meet<strong>in</strong>gs <strong>and</strong> work for <strong>CBS</strong> <strong>in</strong> connection to EMBRACE.<br />

2nd half year (September 06 – Feb 07)<br />

Lactococcus microarray project with Chr Hansen. Book chapter on <strong>Comparative</strong> Genomics, editor<br />

Dawn Field. EMBRACE meet<strong>in</strong>gs <strong>and</strong> workshops.<br />

3rd half year (March 07 –August 07)<br />

Followup article on RNAmmer – <strong>and</strong> rRNA/tRNA operons.<br />

4th half year (September 07 – Feb 08)<br />

(Oct-Dec) Internship, Craig Benham: Davis, California,<br />

Include work from Craig Benhams lab <strong>in</strong>to RNAmmer followup manuscript <strong>and</strong> prepare SIDDbase<br />

application note <strong>and</strong> article on SIDD measures <strong>in</strong> prokaryotic promotor sequences.<br />

Prepare manuscripts<br />

5th. half year (March 08 –August 08)<br />

Course: Globale regulatoriske netværk i mikroorganismer (F2B)<br />

Course: Prote<strong>in</strong> structure <strong>and</strong> computational biology (F5A)<br />

Course: 1 week may/june: 27802 Metabolic Eng<strong>in</strong>eer<strong>in</strong>g <strong>and</strong> Systems Biology<br />

Thesis writ<strong>in</strong>g+Prepare manuscripts<br />

6th. half year (September 08 – Feb 09)<br />

Course: Introduction to Systems Biology<br />

Thesis writ<strong>in</strong>g<br />

(Tidsplanen bør <strong>in</strong>deholde tidspunkter/perioder for alle væsentlige aktiviteter her i forb<strong>in</strong>delse med ph.d.uddannelsen.<br />

Det er vigtigt, at tidsplanen er fuldstændig., Den kan vedlægges som appendiks).<br />

Kort beskrivelse af<br />

vejledn<strong>in</strong>gens form:<br />

Det kan bl.a. aftales, hvor tit vejledn<strong>in</strong>gen sker i form af møder eller ved skriftlig tilbagemeld<strong>in</strong>g<br />

4


Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />

September 2005<br />

Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />

Cpr.-nr.: 160877 2053<br />

Patenter/<strong>in</strong>novation: Der er s<strong>and</strong>synlighed for, at der under projektet udvikles<br />

teknologier eller software, som kan patenteres?<br />

Hvis Ja<br />

Ja x Nej<br />

Kort redegørelse for hvilke metoder, der anvendes til oplær<strong>in</strong>g af den ph.d.-studerende i de <strong>in</strong>novationsmæssige<br />

aspekter<br />

Andet:<br />

(Her kan anføres <strong>and</strong>re forhold af betydn<strong>in</strong>g for bedømmelsen af studieplanen).<br />

5


Appendix C<br />

Appendix: Courses<br />

C.1 Global regulatory networks <strong>in</strong> microorganisms<br />

DTU course 27725, ECTS 5, M.sc. level.<br />

C.2 Prote<strong>in</strong> Structure <strong>and</strong> <strong>Computational</strong> Biology<br />

DTU course 27617, ECTS 5, M.sc. level.<br />

C.3 Biological Sequence Analysis<br />

DTU course 27803, ECTS 12.5, PhD level.<br />

C.4 <strong>Comparative</strong> Genome Analysis<br />

Copenhagen University, Department of Biology, ECTS 5.<br />

Appendix: Courses<br />

C.5 Doctorial sem<strong>in</strong>ar on bus<strong>in</strong>ess economics for academic<br />

entrepreneurs<br />

Aarhus school of bus<strong>in</strong>ess, University of Aarhus, ECTS 3, PhD level.<br />

C.6 ECTS summary<br />

Total ECTS is 30.5 of which 15.5 at PhD level.<br />

165


Appendix D<br />

Appendix: Software<br />

D.1 fetchgbk manual<br />

S Y N O P S I S<br />

f e t c h g b k − d o w n l o a d s g e n b a n k / r e f s e q r e c o r d s i n g e n b a n k f o r m a t , s p e c i f y i n g e i t h e r<br />

a c c e s s i o n s n u m b e r , a c c e s s i o n r a n g e s , o r p r o j e c t i d .<br />

f e t c h g b k (−h ) (−p [ P R O J E C T _ I D ] ) (−a [ A C C E S S I O N / R A N G E ] ) (−d [ D A T A B A S E ] )<br />

D E S C R I P T I O N<br />

W h e n d e f i n i n g t h e p r o j e c t id , u s i n g −p o p t i o n , o p t i o n −a i s i g n o r e d a n d a l l<br />

a c c e s s i o n n u m b e r s f o r a l l s e g m e n t s o f t h a t p r o j e c t , a r e f e t c h e d f r o m t j e p r o j e c t .<br />

W h e n u s i n g t h e −p o p t i o n , t h e −d o p t i o n i s i n e f f e c t , a l l o w i n g y o u t o c o n t r o l w h i c h<br />

d a t a b a s e t o u s e ( r e f s e q / g e n b a n k )<br />

W h e n u s i n g t h e −a o p t i o n , t h e p r o g r a m w i l l r e t r i e v e o n l y t h a t a c c e s s i o n ( o r r a n g e<br />

o f a c c e s s i o n s ) . I t w i l l i g n o r e t h e −d o p t i o n . T h e p r o g r a m p r i n t e s g e n b a n k f o r m a t<br />

d a t a t o s t d o u t . O p t i o n −l i s u s e d t o s h o w o n l y a T A B s e p a r a t e d l i s t s h o w i n g a c c e s s i o n<br />

a n d s e g m e n t n a m e<br />

V E R S I O N<br />

2008 −08 −15: v e r s i o n 1 . 0 c r e a t e d / p f h<br />

−p [ n u m b e r ]<br />

T h e N C B I G e n o m e P r o j e c t n u m b e r , l i k e w h a t c a n b e f o u n d h e r e :<br />

h t t p : / / w w w . n c b i . n l m . n i h . g o v / g e n o m e s / l p r o k s . c g i . T h i s o p t i o n o v e r r u l e s t h e −a o p t i o n .<br />

−a [ a c c e s s i o n n o . o r a c c e s s i o n n u m b e r r a n g e ]<br />

W h e n u s i n g t h i s o p t i o n , t h e p r o g r a m i s i n s t r u c t e d t o d o w n l o a d o n l y t h i s r e c o r d ( o r<br />

t h e s e r e c o r d s , o f a r a n g e i s d e f i n e d ) . T h e −d o p t i o n i s i g n o r e d<br />

−d [ g e n b a n k / r e f s e q ]<br />

C h o i c e o f d a t a b a s e . H a s o n l y e f f e c t w h e n u s i n g o p t i o n −p .<br />

−l<br />

B o o l e a n , i n s t r u c t i n g t h e p r o g r a m n o t t o s h o w g e n b a n k r e c o r d s , b u t o n l y l i s t s e g m e n t<br />

n a m e s f o r e a c h a c c e s s i o n .<br />

−h<br />

S h o w i n g t h i s h e l p p a g e<br />

E X A M P L E S<br />

f e t c h g b k −p 19391 −d r e f s e q | g r e p L O C U S<br />

f e t c h g b k −p 19391 −d g e n b a n k | g r e p L O C U S<br />

f e t c h g b k −a N Z _ A B I Z 0 0 0 0 0 0 0 0 | g r e p L O C U S<br />

f e t c h g b k −a N Z _ A B I H 0 1 0 0 0 0 0 1 −N Z _ A B I H 0 1 0 0 0 0 3 8 | g r e p L O C U S<br />

f e t c h g b k −a C P 0 0 0 8 9 6 | g r e p L O C U S<br />

f e t c h g b k −p 12997 −d r e f s e q −l<br />

A U T H O R<br />

P e t e r F i s c h e r H a l l i n , A u g u s t 2008 , p f h @ c b s . d t u . d k<br />

166


D.2 Sample output from queryGenomes<br />

As output from list<strong>in</strong>g 2.3.<br />

Appendix: Software<br />

1 #k<strong>in</strong>gdom phyla pid organism genbank r e f s e q segment c o l o r ATCONTENT NGENES<br />

2 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 7 9 N C _ 0 1 1 3 1 2<br />

C h r o m o s o m e 1 f f d d 4 4 0 . 6 0 7 7 3069<br />

3 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 0 N C _ 0 1 1 3 1 3<br />

C h r o m o s o m e 2 f f d d 4 4 0 . 6 1 7 6 1105<br />

4 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 2 N C _ 0 1 1 3 1 4 P l a s m i d<br />

p V S A L 3 2 0 f f d d 4 4 0 . 6 2 7 1 32<br />

5 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 1 N C _ 0 1 1 3 1 1 P l a s m i d<br />

p V S A L 8 4 0 f f d d 4 4 0 . 5 9 9 3 72<br />

6 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 3 N C _ 0 1 1 3 1 5 P l a s m i d<br />

p V A L 4 3 f f d d 4 4 0 . 6 1 9 3<br />

7 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 4 N C _ 0 1 1 3 1 6 P l a s m i d<br />

p V S A L 4 3 f f d d 4 4 0 . 6 4 3 9 3<br />

8 B a c t e r i a D e l t a p r o t e o b a c t e r i a 9637 B d e l l o v i b r i o b a c t e r i o v o r u s H D 1 0 0 B X 8 4 2 6 0 1 N C _ 0 0 5 3 6 3<br />

C h r o m o s o m e f f d d 4 4 0 . 4 9 3 5 3583<br />

9 B a c t e r i a G a m m a p r o t e o b a c t e r i a 28329 C e l l v i b r i o j a p o n i c u s U e d a 1 0 7 C P 0 0 0 9 3 4 N C _ 0 1 0 9 9 5<br />

C h r o m o s o m e f f d d 4 4 0 . 4 8 0 1 3754<br />

10 B a c t e r i a B a c t e r o i d e t e s / C h l o r o b i 12607 C h l o r o b i u m p h a e o v i b r i o i d e s D S M 265 C P 0 0 0 6 0 7 N C _ 0 0 9 3 3 7<br />

C h r o m o s o m e f f b b 5 5 0 . 4 7 0 1 1753<br />

11 B a c t e r i a D e l t a p r o t e o b a c t e r i a 29493 D e s u l f o v i b r i o d e s u l f u r i c a n s s u b s p . d e s u l f u r i c a n s s t r . A T C C<br />

27774 C P 0 0 1 3 5 8 N C _ 0 1 1 8 8 3 C h r o m o s o m e f f d d 4 4 0 . 4 1 9 3 2356<br />

12 B a c t e r i a D e l t a p r o t e o b a c t e r i a 329 D e s u l f o v i b r i o d e s u l f u r i c a n s s u b s p . d e s u l f u r i c a n s s t r . G 2 0<br />

C P 0 0 0 1 1 2 N C _ 0 0 7 5 1 9 C h r o m o s o m e f f d d 4 4 0 . 4 2 1 6 3775<br />

13 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 6 N C _ 0 1 2 7 9 5 P l a s m i d<br />

p D M C 2 f f d d 4 4 0 . 6 2 8 3 10<br />

14 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 4 N C _ 0 1 2 7 9 6<br />

C h r o m o s o m e f f d d 4 4 0 . 3 7 2 3 4629<br />

15 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 5 N C _ 0 1 2 7 9 7 P l a s m i d<br />

p D M C 1 f f d d 4 4 0 . 4 1 9 7 65<br />

16 B a c t e r i a D e l t a p r o t e o b a c t e r i a 29541 D e s u l f o v i b r i o s a l e x i g e n s D S M 2638 C P 0 0 1 6 4 9 N C _ 0 1 2 8 8 1<br />

C h r o m o s o m e f f d d 4 4 0 . 5 2 9 1 3807<br />

17 B a c t e r i a D e l t a p r o t e o b a c t e r i a 17227 D e s u l f o v i b r i o v u l g a r i s D P 4 C P 0 0 0 5 2 8 N C _ 0 0 8 7 4 1 P l a s m i d<br />

p D V U L 0 1 f f d d 4 4 0 . 3 4 3 1 150<br />

18 B a c t e r i a D e l t a p r o t e o b a c t e r i a 17227 D e s u l f o v i b r i o v u l g a r i s D P 4 C P 0 0 0 5 2 7 N C _ 0 0 8 7 5 1 C h r o m o s o m e<br />

f f d d 4 4 0 . 3 6 9 9 2941<br />

19 B a c t e r i a D e l t a p r o t e o b a c t e r i a 27731 D e s u l f o v i b r i o v u l g a r i s s t r . M i y a z a k i F C P 0 0 1 1 9 7 N C _ 0 1 1 7 6 9<br />

C h r o m o s o m e f f d d 4 4 0 . 3 2 8 9 3180<br />

20 B a c t e r i a D e l t a p r o t e o b a c t e r i a 51 D e s u l f o v i b r i o v u l g a r i s s t r . H i l d e n b o r o u g h A E 0 1 7 2 8 5 N C _ 0 0 2 9 3 7<br />

C h r o m o s o m e f f d d 4 4 0 . 3 6 8 6 3379<br />

21 B a c t e r i a D e l t a p r o t e o b a c t e r i a 51 D e s u l f o v i b r i o v u l g a r i s s t r . H i l d e n b o r o u g h A E 0 1 7 2 8 6 N C _ 0 0 5 8 6 3<br />

M e g a p l a s m i d f f d d 4 4 0 . 3 4 3 2 152<br />

22 B a c t e r i a O t h e r B a c t e r i a 30733 T h e r m o d e s u l f o v i b r i o y e l l o w s t o n i i D S M 11347 C P 0 0 1 1 4 7 N C _ 0 1 1 2 9 6<br />

C h r o m o s o m e 888888 0 . 6 5 8 7 2033<br />

23 B a c t e r i a G a m m a p r o t e o b a c t e r i a 29177 T h i o a l k a l i v i b r i o s p . HL−E b G R 7 C P 0 0 1 3 3 9 N C _ 0 1 1 9 0 1<br />

C h r o m o s o m e f f d d 4 4 0 . 3 4 9 4 3283<br />

24 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32851 V i b r i o c h o l e r a e M66 −2 C P 0 0 1 2 3 3 N C _ 0 1 2 5 7 8 C h r o m o s o m e I<br />

f f d d 4 4 0 . 5 2 1 7 2650<br />

25 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32851 V i b r i o c h o l e r a e M66 −2 C P 0 0 1 2 3 4 N C _ 0 1 2 5 8 0 C h r o m o s o m e I I<br />

f f d d 4 4 0 . 5 2 9 6 1043<br />

26 B a c t e r i a G a m m a p r o t e o b a c t e r i a 33555 V i b r i o c h o l e r a e MJ −1236 C P 0 0 1 4 8 5 N C _ 0 1 2 6 6 8 C h r o m o s o m e 1<br />

f f d d 4 4 0 . 5 2 4 8 2770<br />

27 B a c t e r i a G a m m a p r o t e o b a c t e r i a 33555 V i b r i o c h o l e r a e MJ −1236 C P 0 0 1 4 8 6 N C _ 0 1 2 6 6 7 C h r o m o s o m e 2<br />

f f d d 4 4 0 . 5 3 2 5 1004<br />

28 B a c t e r i a G a m m a p r o t e o b a c t e r i a 36 V i b r i o c h o l e r a e O 1 b i o v a r E l T o r s t r . N 1 6 9 6 1 A E 0 0 3 8 5 2<br />

N C _ 0 0 2 5 0 5 C h r o m o s o m e I f f d d 4 4 0 . 5 2 3 2736<br />

29 B a c t e r i a G a m m a p r o t e o b a c t e r i a 36 V i b r i o c h o l e r a e O 1 b i o v a r E l T o r s t r . N 1 6 9 6 1 A E 0 0 3 8 5 3<br />

N C _ 0 0 2 5 0 6 C h r o m o s o m e I I f f d d 4 4 0 . 5 3 0 9 1092<br />

30 B a c t e r i a G a m m a p r o t e o b a c t e r i a 15667 V i b r i o c h o l e r a e O 3 9 5 C P 0 0 0 6 2 6 N C _ 0 0 9 4 5 6 C h r o m o s o m e 1<br />

f f d d 4 4 0 . 5 3 1 2 1133<br />

31 B a c t e r i a G a m m a p r o t e o b a c t e r i a 15667 V i b r i o c h o l e r a e O 3 9 5 C P 0 0 0 6 2 7 N C _ 0 0 9 4 5 7 C h r o m o s o m e 2<br />

f f d d 4 4 0 . 5 2 2 2 2742<br />

32 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 0 N C _ 0 0 6 8 4 0 C h r o m o s o m e I<br />

f f d d 4 4 0 . 6 1 0 4 2575<br />

33 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 1 N C _ 0 0 6 8 4 1 C h r o m o s o m e I I<br />

f f d d 4 4 0 . 6 2 9 8 1172<br />

34 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 2 N C _ 0 0 6 8 4 2 P l a s m i d p E S 1 0 0<br />

f f d d 4 4 0 . 6 1 5 8 55<br />

35 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 3 N C _ 0 1 1 1 8 6 C h r o m o s o m e I I<br />

f f d d 4 4 0 . 6 2 7 5 1254<br />

36 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 4 N C _ 0 1 1 1 8 5 P l a s m i d p M J 1 0 0<br />

f f d d 4 4 0 . 6 5 2 195<br />

37 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 9 N C _ 0 1 1 1 8 4 C h r o m o s o m e I<br />

f f d d 4 4 0 . 6 1 1 2 2590<br />

38 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 9 1 N C _ 0 0 9 7 7 7 P l a s m i d<br />

p V I B H A R f f d d 4 4 0 . 5 6 2 1 120<br />

39 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 8 9 N C _ 0 0 9 7 8 3<br />

C h r o m o s o m e I f f d d 4 4 0 . 5 4 4 5 3570<br />

40 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 9 0 N C _ 0 0 9 7 8 4<br />

C h r o m o s o m e I I f f d d 4 4 0 . 5 4 7 3 2374<br />

41 B a c t e r i a G a m m a p r o t e o b a c t e r i a 360 V i b r i o p a r a h a e m o l y t i c u s R I M D 2210633 B A 0 0 0 0 3 1 N C _ 0 0 4 6 0 3<br />

C h r o m o s o m e I f f d d 4 4 0 . 5 4 6 1 3080<br />

42 B a c t e r i a G a m m a p r o t e o b a c t e r i a 360 V i b r i o p a r a h a e m o l y t i c u s R I M D 2210633 B A 0 0 0 0 3 2 N C _ 0 0 4 6 0 5<br />

C h r o m o s o m e I I f f d d 4 4 0 . 5 4 6 5 1752<br />

43 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32815 V i b r i o s p l e n d i d u s L G P 3 2 F M 9 5 4 9 7 3 N C _ 0 1 1 7 4 4 C h r o m o s o m e 2<br />

f f d d 4 4 0 . 5 6 3 6 1486<br />

44 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32815 V i b r i o s p l e n d i d u s L G P 3 2 F M 9 5 4 9 7 2 N C _ 0 1 1 7 5 3 C h r o m o s o m e 1<br />

f f d d 4 4 0 . 5 5 9 6 2950<br />

45 B a c t e r i a G a m m a p r o t e o b a c t e r i a 349 V i b r i o v u l n i f i c u s C M C P 6 A E 0 1 6 7 9 5 N C _ 0 0 4 4 5 9 C h r o m o s o m e I<br />

f f d d 4 4 0 . 5 3 5 5 2973<br />

46 B a c t e r i a G a m m a p r o t e o b a c t e r i a 349 V i b r i o v u l n i f i c u s C M C P 6 A E 0 1 6 7 9 6 N C _ 0 0 4 4 6 0 C h r o m o s o m e I I<br />

f f d d 4 4 0 . 5 2 8 8 1565<br />

167


BLASTatlas configurations<br />

47 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 B A 0 0 0 0 3 7 N C _ 0 0 5 1 3 9 C h r o m o s o m e I<br />

f f d d 4 4 0 . 5 3 5 9 3262<br />

48 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 B A 0 0 0 0 3 8 N C _ 0 0 5 1 4 0 C h r o m o s o m e I I<br />

f f d d 4 4 0 . 5 2 7 9 1697<br />

49 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 A P 0 0 5 3 5 2 N C _ 0 0 5 1 2 8 P l a s m i d p Y J 0 1 6<br />

f f d d 4 4 0 . 5 5 0 7 69<br />

D.3 BLASTatlas configurations<br />

D.3.1 file blast.cfg<br />

1 l e g e n d : B . a m b i f a r i a A M M D<br />

2 p r o g r a m : b l a s t p<br />

3 c o l o r : 1 0 1 0 1 0 _ 0 2 0 0 0 2<br />

4 r a n g e : 0 . 0 , 0 . 8<br />

5 s o u r c e : f i l e s / 1 3 4 9 0 . f s a<br />

6<br />

7 l e g e n d : B . a m b i f a r i a M C 4 0 −6<br />

8 p r o g r a m : b l a s t p<br />

9 c o l o r : 1 0 1 0 1 0 _ 0 2 0 0 0 2<br />

10 r a n g e : 0 . 0 , 0 . 8<br />

11 s o u r c e : f i l e s / 1 7 4 1 1 . f s a<br />

12<br />

13 l e g e n d : B . c e n o c e p a c i a A U 1054<br />

14 p r o g r a m : b l a s t p<br />

15 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />

16 r a n g e : 0 . 0 , 0 . 8<br />

17 s o u r c e : f i l e s / 1 3 9 1 9 . f s a<br />

18<br />

19 l e g e n d : B . c e n o c e p a c i a H I 2 4 2 4<br />

20 p r o g r a m : b l a s t p<br />

21 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />

22 r a n g e : 0 . 0 , 0 . 8<br />

23 s o u r c e : f i l e s / 1 3 9 1 8 . f s a<br />

24<br />

25 l e g e n d : B . c e n o c e p a c i a J 2 3 1 5<br />

26 p r o g r a m : b l a s t p<br />

27 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />

28 r a n g e : 0 . 0 , 0 . 8<br />

29 s o u r c e : f i l e s / 3 3 9 . f s a<br />

30<br />

31 l e g e n d : B . c e n o c e p a c i a MC0 −3<br />

32 p r o g r a m : b l a s t p<br />

33 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />

34 r a n g e : 0 . 0 , 0 . 8<br />

35 s o u r c e : f i l e s / 1 7 9 2 9 . f s a<br />

36<br />

37 l e g e n d : B . g l u m a e B G R 1<br />

38 p r o g r a m : b l a s t p<br />

39 c o l o r : 1 0 1 0 1 0 _ 0 5 0 5 0 5<br />

40 r a n g e : 0 . 0 , 0 . 8<br />

41 s o u r c e : f i l e s / 3 3 9 0 1 . f s a<br />

42<br />

43 . . . . . .<br />

D.3.2 file custom.cfg<br />

1<br />

2 l e g e n d : S I D D @ −0.035<br />

3 c o l o r : 0 0 0 0 1 0 _ 1 0 1 0 1 0<br />

4 r a n g e : 9 : 1 0<br />

5 b o x f i l t e r : 5 0 0 0<br />

6 s o u r c e : g u n z i p −c B X 5 7 1 9 6 6 −57 a 2 f 2 c 2 e 1 1 c a 0 d d 8 c d 7 4 4 9 3 d 6 6 7 d 4 d 6 −3173005. s i d d −−0.035−c−10−c . o u t . g z |<br />

c u t −f 4 |<br />

D.4 BLASTmatrix example<br />

This Perl script constructs an XML configuration file by look<strong>in</strong>g up the Genome Atlas<br />

Database through MySQL. It queries for all Campylobacter stra<strong>in</strong>s currently available.<br />

1 #! / u s r / b<strong>in</strong> / p e r l<br />

2 u s e s t r i c t ;<br />

3<br />

4 m y $ S A C O _ E X T R A C T = " / u s r / c b s / b i o / b i n / l i n u x 6 4 / s a c o _ e x t r a c t " ;<br />

5 m y %c o l o r s = ( l a r i => ’ 0 , 1 0 4 , 1 3 9 ’ , j e j u n i => ’ 0 , 1 3 9 , 6 9 ’ , h o m i n i s => ’ 66 , 66 , 1 1 1 ’ , f e t u s<br />

=> ’ 1 3 9 , 1 0 1 , 8 ’ , c u r v u s=>’ 1 4 0 , 23 , 2 3 ’ , c o n c i s u s=>’ 2 0 5 , 1 7 3 , 0 ’ ) ;<br />

6<br />

7 m y $ s o u r c e s = " " ; # h o l d s the s o u r c e s p a r t o f the c o n f i g u r a t i o n − r e p l a c e i n t o DATA s e c t i o n<br />

8<br />

9 o p e n O R G A N I S M , " m y s q l - N - B - e \ " s e l e c t pid , o r g a n i s m _ n a m e f r o m g e n o m e a t l a s 3 _ c u r .<br />

g e n b a n k _ c o m p l e t e _ p r j w h e r e o r g a n i s m _ n a m e l i k e ’ c a m p y l o b a c t e r % ’ o r d e r b y o r g a n i s m _ n a m e \ " | "<br />

o r d i e $ ! ;<br />

10 w h i l e (< O R G A N I S M >) {<br />

11 c h o m p ;<br />

12 m y ( $ p i d , $ o r g a n i s m _ n a m e ) = s p l i t /\ t / ;<br />

168


Appendix: Software<br />

13 w a r n " $ o r g a n i s m _ n a m e ( p i d $ p i d ) \ n " ;<br />

14 m y ( $ g e n u s , $ s p e c i e s , $ s t r a i n ) = ( $1 , $2 , $ 3 ) i f $ o r g a n i s m _ n a m e = /(\ S+) (\ S+) ( . ∗ ) / ;<br />

15 m y $ c o l o r = " 1 0 0 , 1 0 0 , 1 0 0 " ;<br />

16 $ c o l o r = $ c o l o r s { $ s p e c i e s } i f d e f i n e d $ c o l o r s { $ s p e c i e s } ;<br />

17 $ s o u r c e s .= "<br />

18 < e n t r y ><br />

19 < s o u r c e > . / $ p i d . p r o t e i n s . fsa < / s o u r c e ><br />

20 < t i t l e > $ g e n u s $ s p e c i e s < / t i t l e ><br />

21 < s u b t i t l e > $ s t r a i n < / s u b t i t l e ><br />

22 < g r o u p > $ s p e c i e s < / g r o u p ><br />

23 < c o l o r > $ c o l o r < / c o l o r ><br />

24 <br />

25 " ;<br />

26 o p e n P I D , " > $ p i d . p r o t e i n s . f s a " o r d i e $ ! ;<br />

27 o p e n A C C E S S I O N , " m y s q l - N - B - e \ " s e l e c t g e n b a n k , s e g m e n t _ n a m e f r o m g e n o m e a t l a s 3 _ c u r .<br />

g e n b a n k _ c o m p l e t e _ s e q w h e r e p i d = $ p i d a n d s e g m e n t _ n a m e n o t l i k e ’ g e n o m e % ’ \ " | " ;<br />

28 w h i l e (< A C C E S S I O N > ) {<br />

29 c h o m p ;<br />

30 m y ( $ g e n b a n k , $ s e g m e n t _ n a m e ) = s p l i t /\ t / ;<br />

31 c h o m p $ g e n b a n k ;<br />

32 w a r n " a d d i n g $ s e g m e n t _ n a m e ( a c c e s s i o n $ g e n b a n k ) \ n " ;<br />

33 m y $ g b k = " / h o m e / d a t a b a s e s / g e n o m e a t l a s d b - 3 . 0 _ c u r / d a t a / $ g e n b a n k / $ g e n b a n k . g b k " ;<br />

34 o p e n P R O T , " $ S A C O _ E X T R A C T - I g e n b a n k - O f a s t a - t < $ g b k 2 > / d e v / n u l l | " o r d i e $ ! ;<br />

35 w h i l e (< P R O T >) {<br />

36 p r i n t P I D ;<br />

37 }<br />

38 c l o s e P R O T ;<br />

39 }<br />

40 c l o s e A C C E S S I O N ;<br />

41 c l o s e P I D ;<br />

42 }<br />

43 c l o s e O R G A N I S M ;<br />

44 w a r n " d u m p i n g x m l c o n f i g o n s t d o u t . . . \ n " ;<br />

45 w h i l e (< D A T A >) {<br />

46 s//$ s o u r c e s / g ;<br />

47 p r i n t ;<br />

48 }<br />

49<br />

50 _ _ D A T A _ _<br />

51 <br />

52 <br />

53 P r o t e o m e c o m p a r i s o n o f C a m p y l o b a c t e r s p e c i e s <br />

54 −<br />

55 <br />

56 <br />

57 a u t o <br />

58 a u t o <br />

59 <br />

60 0.9<br />

61 0.9<br />

62 0.9<br />

63 <br />

64 <br />

65 0.975<br />

66 0<br />

67 0<br />

68 <br />

69 <br />

70 <br />

71 a u t o <br />

72 a u t o <br />

73 <br />

74 0.9<br />

75 0.9<br />

76 0.9<br />

77 <br />

78 <br />

79 0<br />

80 0.975<br />

81 0<br />

82 <br />

83 <br />

84 <br />

85 <br />

86 <br />

87 <br />

88 <br />

D.5 iscan source code<br />

1 #! / u s r / b<strong>in</strong> / p e r l<br />

2 u s e s t r i c t ;<br />

3<br />

4 m y $ p w m ;<br />

5 m y %m a t r i x ;<br />

6 m y $ s p a c e r ;<br />

7 m y @ P W M ;<br />

8 m y $ p i = 3 . 1 4 1 5 9 2 6 5 ;<br />

9<br />

10 # read the model f i l e s # i n c l u d e s u p p o r t e d r e c u r s i v e l y (NO CHECK FOR LOOPS ! )<br />

11 m y %s e t u p ;<br />

12 m y @ L I N E S ;<br />

169


iscan source code<br />

13 i f ( d e f i n e d $ A R G V [ 0 ] ) {<br />

14 @ L I N E S = r e a d _ m o d ( $ A R G V [ 0 ] ) ;<br />

15 } e l s e {<br />

16 w h i l e (< D A T A >) {<br />

17 p r i n t ;<br />

18 }<br />

19 c l o s e D A T A ;<br />

20 d i e " n o m o d e l p r o v i d e d . t e m p l a t e m o d e l d u m p e d \ n " ;<br />

21 }<br />

22<br />

23 m y $ p w m i d = −1;<br />

24 p r i n t " # t h i s i s t h e m o d e l : \ n " ;<br />

25 f o r e a c h ( @ L I N E S ) {<br />

26 p r i n t " # $ _ \ n " ;<br />

27 i f ( / ˆ \ [ p w m \ ] \ s∗=\s ∗ ( . ∗ ) /) {<br />

28 $ p w m i d ++;<br />

29 p u s h @ P W M , " $ p w m i d : $ 1 " ;<br />

30 }<br />

31 m y $ p w m = $ P W M [$# P W M ] ;<br />

32 $ s e t u p { $ p w m }{ $ 1 } = $ 2 i f /ˆ(\ w+)\ s∗=\s ∗([\.\ −0 −9]+) / ;<br />

33 n e x t u n l e s s / ˆ \ [ ( [ A T G C ]+) \ ] / ;<br />

34 m y @ F = s p l i t / [ \ s \ t ] + / ;<br />

35 s h i f t @ F ;<br />

36 e r r ( " p w m n o t d e f i n e d " ) u n l e s s d e f i n e d $ p w m ;<br />

37 @ { $ m a t r i x { $ p w m }{ $ 1 }} = @ F ;<br />

38 $ m a t r i x { $ p w m }{ c o u n t } [ $ _ ] += $ F [ $ _ ] f o r e a c h ( 0 . . $#F ) ;<br />

39 }<br />

40<br />

41 # make a lookup t a b l e o f d i s t a n c e i n f o r m a t i o n measure<br />

42 m y %S P A C E R _ L O O K U P ;<br />

43 f o r e a c h m y $ s p a c e r ( k e y s %s e t u p ) {<br />

44 m y $ m i n = $ s e t u p { $ s p a c e r }{ m i n } ;<br />

45 m y $ m a x = $ s e t u p { $ s p a c e r }{ m a x } ;<br />

46 m y $ c e n t e r = $ s e t u p { $ s p a c e r }{ c e n t e r } ;<br />

47 p r i n t f " # p a r s i n g a c c e s s i b i l i t y f o r $ s p a c e r ( m i n = $ m i n , m a x = $ m a x , c e n t e r = $ c e n t e r ) \ n " ;<br />

48 m y $ n = 0 ;<br />

49 $ n += 1 + c o s ( ( ( 2 ∗ $ p i ) / 1 0 . 6 ) ∗ ( $ _ − $ c e n t e r ) ) f o r e a c h ( $ m i n . . $ m a x ) ;<br />

50 f o r e a c h m y $ d ( $ m i n . . $ m a x ) {<br />

51 i f ( $ c e n t e r e q " " ) {<br />

52 $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } = 0 ;<br />

53 } e l s e {<br />

54 $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } = −(−l o g ( ( 1 + c o s ( ( ( 2 ∗ $ p i ) / 1 0 . 6 ) ∗ ( $ d<br />

− $ c e n t e r ) ) ) / $ n ) / l o g ( 2 ) ) ;<br />

55 }<br />

56 p r i n t f " # d = % d , s c o r e = % 0 . 2 f \ n " , $d , $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } ;<br />

57 }<br />

58 }<br />

59<br />

60 # compute matrix based o f f r e q u e n c i e s<br />

61 f o r e a c h m y $ p w m ( k e y s %m a t r i x ) {<br />

62 p r i n t " # p r e p a r i n g m a t r i x ’ $ p w m ’\ n " ; ;<br />

63 f o r e a c h m y $ l e t t e r ( q w / A T G C /) {<br />

64 p r i n t " # [ $ l e t t e r ] " ;<br />

65 f o r e a c h m y $ i ( 0 . . $#{ $ m a t r i x { $ p w m }{ A }} ) {<br />

66 m y $ i 1 = " - " ;<br />

67 m y $ i 2 = s p r i n t f ( ’ % 5 s ’ , ’ - ’ ) ;<br />

68 i f ( $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] > 0 ) {<br />

69 $ i 1 = 2 + l o g ( $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] / $ m a t r i x { $ p w m }{ c o u n t } [ $ i ] ) / l o g ( 2 ) − 0 ;<br />

70 $ i 2 = s p r i n t f ( ’ % 5 s ’ , s p r i n t f ( ’ % 0 . 2 f ’ , $ i 1 ) ) ;<br />

71 }<br />

72 $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] = $ i 1 i f $ i 1 n e " - " ;<br />

73 p r i n t " \ t $ i 2 " ;<br />

74 }<br />

75 p r i n t " \ n " ;<br />

76 }<br />

77 }<br />

78<br />

79 # l o o p o v e r a l l s e q u e n c e s i n i n p u t<br />

80 m y @ i n p = &r e a d _ f a s t a ;<br />

81 f o r e a c h m y $ s ( 0 . . $#i n p ) {<br />

82 m y $ s e q = $ i n p [ $ s ]−>{ s e q } ;<br />

83 p r i n t f " # S E Q U E N C E % s \ n " , $ i n p [ $ s ]−>{ i d } ;<br />

84 p r i n t f " # % d b p \ n " , l e n g t h ( $ s e q ) ;<br />

85 m y %L E N ;<br />

86 m y %B I T ;<br />

87 f o r e a c h m y $ p w m ( @ P W M ) {<br />

88 p r i n t " # g e n e r a t i n g b i t s c o r e s f o r m a t r i x ’ $ p w m ’\ n " ;<br />

89 @ { $ B I T { $ p w m }} = &s c a n ( $ s e q ,%{ $ m a t r i x { $ p w m }}) ;<br />

90 $ L E N { $ p w m } = s c a l a r ( @ { $ m a t r i x { $ p w m }{ A }}) ;<br />

91 p r i n t f " # % d e l e m e n t s i n a r r a y \ n " , s c a l a r ( @ { $ B I T { $ p w m }} ) ;<br />

92 }<br />

93 f o r e a c h m y $ p ( 0 . . ( l e n g t h ( $ s e q ) − $ L E N { $ P W M [ 0 ] } ) ) {<br />

94 p r i n t f " # c o n s i d e r i n g p o s i t i o n % d ( r o o t m o d e l ) \ n " , $ p + 1 ;<br />

95 # f i n d the s c o r e o f the i n i t i a l matrix , f o r t h i s g i v e n p o s i t i o n<br />

96 m y $ w = $ s e t u p { $ P W M [ 0 ] } { w e i g h t } ;<br />

97 m y $ f s i = $ B I T { $ P W M [ 0 ] } [ $ p ] ∗ $ w ;<br />

98 m y $ o f f s e t = $ p ;<br />

99 m y $ s i g n a l = s u b s t r ( $ s e q , $ o f f s e t , $ L E N { $ P W M [ 0 ] } ) ;<br />

100 m y $ s = s p r i n t f " % s \ t % 0 . 2 f " , $ s i g n a l , $ f s i ;<br />

101<br />

102 f o r e a c h m y $ p w m _ i n d e x (1 . . $#P W M ) {<br />

103 m y $ p w m = $ P W M [ $ p w m _ i n d e x ] ;<br />

104 m y $ w = $ s e t u p { $ p w m }{ w e i g h t } ;<br />

105<br />

106 # g e t the s p a c i n g d e t a i l s f o r the upstream s p a c e r<br />

107 m y $ p r e v _ p w m = $ P W M [ $ p w m _ i n d e x − 1 ] ;<br />

108<br />

170


Appendix: Software<br />

109 m y ( $ m i n , $ m a x , $ c e n t e r ) = ( $ s e t u p { $ p r e v _ p w m }{ m i n } ,<br />

110 $ s e t u p { $ p r e v _ p w m }{ m a x } , $ s e t u p { $ p r e v _ p w m }{ c e n t e r }) ;<br />

111<br />

112 m y $ o p t _ s p a c e r ;<br />

113 m y $ o p t _ u n i t _ s c o r e ;<br />

114<br />

115 # c a l c u l a t e u n i t s c o r e s f o r each o f the s p a c i n g c o n f i g u r a t i o n s<br />

116 # A u n i t i s the s p a c e r <strong>and</strong> the f o l l o w i n g matrix . We s e a r c h f o r the<br />

117 # s p a c e r g i v i n g r i s e t o the h i g h e s t u n i t s c o r e<br />

118<br />

119 p r i n t f " # a d j u s t i n g s p a c e r d o w n s t r a m o f ’ $ p w m ’\ n " ;<br />

120<br />

121 f o r e a c h m y $ s p a c e r ( $ m i n . . $ m a x ) {<br />

122 # don ’ t c o n t i n u e , o f the o f f s e t g o e s beyond z e r o . . .<br />

123 l a s t i f $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r < 0 ;<br />

124 n e x t i f $ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] ∗ $ w < $ s e t u p { $ p w m }{ t h r e s h o l d } a n d<br />

d e f i n e d $ s e t u p { $ p w m }{ t h r e s h o l d } ;<br />

125<br />

126 # i f no o p t i m a l s p a c e r i s d e c l a r e d y e t ( e . g . b e c a u s e t h i s i s<br />

127 # the f i r s t round ) then do i t now<br />

128 $ o p t _ s p a c e r = $ s p a c e r u n l e s s d e f i n e d $ o p t _ s p a c e r ;<br />

129 m y $ t e s t _ u n i t _ s c o r e = $ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] ∗ $ w + $ S P A C E R _ L O O K U P {<br />

$ s p a c e r }{ $ m i n }{ $ m a x }{ $ c e n t e r } ;<br />

130 p r i n t f " # s p a c e r : % d , s c o r e : % 0 . 1 f ( % 0 . 1 f + % 0 . 1 f ) \ n " , $ s p a c e r , $ t e s t _ u n i t _ s c o r e ,<br />

$ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] , $ S P A C E R _ L O O K U P { $ s p a c e r }{ $ m i n }{ $ m a x }{<br />

$ c e n t e r } ;<br />

131 $ o p t _ u n i t _ s c o r e = $ t e s t _ u n i t _ s c o r e u n l e s s d e f i n e d $ o p t _ u n i t _ s c o r e ;<br />

132 i f ( $ t e s t _ u n i t _ s c o r e > $ o p t _ u n i t _ s c o r e ) {<br />

133 $ o p t _ s p a c e r = $ s p a c e r ;<br />

134 $ o p t _ u n i t _ s c o r e = $ t e s t _ u n i t _ s c o r e ;<br />

135 }<br />

136 } # f o r e a c h my $ s p a c e r<br />

137<br />

138 # o f f s e t i s where the c u r r e n t pwm s t a r t s<br />

139 $ o f f s e t = $ o f f s e t − $ L E N { $ p w m } − $ o p t _ s p a c e r ;<br />

140<br />

141 p r i n t f " # n e w o f f s e t % d \ n " , $ o f f s e t ;<br />

142<br />

143 i f ( ! d e f i n e d $ o p t _ u n i t _ s c o r e ) {<br />

144 p r i n t f " # u n a b l e t o d e t e r m i n e s p a c e r \ n " ;<br />

145 $ s .= s p r i n t f " \ t - \ t % s \ t - " , ( ’ - ’ x $ L E N { $ p w m }) ;<br />

146 n e x t ;<br />

147 } e l s e {<br />

148 p r i n t f " # s p a c e r $ o p t _ s p a c e r c h o s e n , u n i t ’% s ’ g i v e s s c o r e % 0 . 1 f \ n " , $ p w m ,<br />

$ o p t _ u n i t _ s c o r e ;<br />

149 $ f s i += $ o p t _ u n i t _ s c o r e ;<br />

150 m y $ s i g n a l = s u b s t r ( $ s e q , $ o f f s e t , $ L E N { $ p w m }) ;<br />

151 $ s .= s p r i n t f " \ t % d \ t % s \ t % 0 . 2 f " , $ o p t _ s p a c e r , $ s i g n a l , $ f s i ;<br />

152 }<br />

153 } # f o r e a c h my $pwm <strong>in</strong>dex<br />

154 # p r i n t the f i n a l b i t s c o r e<br />

155 p r i n t f " % d \ t % 0 . 2 f \ t % s \ t \ n " , ( $ p +1) , $ f s i , $ s ;<br />

156 } # my $p = 0<br />

157 } # f o r ( $s = 0 . . . .<br />

158<br />

159<br />

160 #######################################<br />

161 # HELPER FUNCTIONS<br />

162 #######################################<br />

163<br />

164<br />

165 # scan u s i n g a matrix o f i n f o r m a t i o n<br />

166 s u b s c a n {<br />

167 m y @ a ;<br />

168 m y ( $s ,% m ) = @ _ ;<br />

169 m y $ m a = $#{$ m { A } } ;<br />

170 f o r e a c h m y $ p ( 0 . . ( l e n g t h ( $ s )−$#{$ m { A }} −1 ) ) {<br />

171 m y $ R i = 0 ;<br />

172 $ R i += $ m { s u b s t r ( $s , $ p+$_ , 1 ) } [ $ _ ] f o r e a c h ( 0 . . $ m a ) ;<br />

173 p u s h @ a , $ R i ;<br />

174 }<br />

175 # r e t u r n a l i s t hav<strong>in</strong>g n−l +1 e l e m e n t s e h e r e n i s the s e q u e n c e l e n g t h ,<br />

176 # n i s the matrix s i z e ( f o r −10 ( hexamer , n=6)<br />

177 r e t u r n @ a ;<br />

178 }<br />

179<br />

180 ###############################################################<br />

181 # s p a c e r b i t s c o r e c a l c u l a t i o n s c o o r d i n a t e s a r e s h i f t e d 6bp<br />

182 ###############################################################<br />

183<br />

184 s u b r e a d _ m o d {<br />

185 m y @ r e t ;<br />

186 m y $ f n = $ _ [ 0 ] ;<br />

187 m y $ i ;<br />

188 o p e n $ i , $ f n o r e r r ( " u n a b l e t o o p e n f i l e ’ $ f n ’: $ ! \ n " ) ;<br />

189 w h i l e ( r e a d l i n e ( $ i ) ) {<br />

190 c h o m p ;<br />

191 i f (/ˆ#\ s ∗ i n c l u d e \ s ∗ ( . ∗ ) /) {<br />

192 m y @ a = r e a d _ m o d ( $ 1 ) ;<br />

193 p u s h @ r e t , @ a ;<br />

194 } e l s e {<br />

195 n e x t i f / ˆ [ \ s \#] + / ;<br />

196 n e x t u n l e s s /ˆ\ S +/;<br />

197 p u s h @ r e t , $ _ ;<br />

198 }<br />

199 }<br />

200 c l o s e $ i ;<br />

171


quasi mktemp manual<br />

201 r e t u r n @ r e t ;<br />

202 }<br />

203<br />

204 s u b r e a d _ f a s t a {<br />

205 m y @ f a s t a ; # c o n t a i n s a l l<br />

206 m y $ i d = −1;<br />

207 w h i l e ( ) {<br />

208 c h o m p ;<br />

209 i f ( /ˆ >(.∗) / ) {<br />

210 $ i d ++;<br />

211 $ f a s t a [ $ i d ]−>{ i d } = $ 1 ;<br />

212 } e l s i f ( / ˆ ( [ A−Za−z ]+) /) {<br />

213 $ f a s t a [ $ i d ]−>{ s e q } .= $ 1 ;<br />

214 }<br />

215 }<br />

216 r e t u r n @ f a s t a ;<br />

217 }<br />

218<br />

219 s u b e r r {<br />

220 p r i n t $ _ [ 0 ] ;<br />

221 e x i t 1 ;<br />

222 }<br />

223 e x i t 0 ;<br />

224<br />

225 _ _ D A T A _ _<br />

226 [ p w m ]=−10 r e g i o n<br />

227 w e i g h t =1<br />

228 [ A ] 0 63 0 63 63 0<br />

229 [ T ] 63 0 63 0 0 63<br />

230 [ G ] 0 0 0 0 0 0<br />

231 [ C ] 0 0 0 0 0 0<br />

232 [ s p a c e r ]<br />

233 m i n =13<br />

234 c e n t e r =16<br />

235 m a x =19<br />

236 [ p w m ]=−35 r e g i o n<br />

237 w e i g h t =1<br />

238 [ A ] 0 0 0 0 0 36<br />

239 [ T ] 63 63 0 54 0 9<br />

240 [ G ] 0 0 63 0 18 9<br />

241 [ C ] 0 0 0 9 45 9<br />

242 [ s p a c e r ]<br />

243 m i n =0<br />

244 c e n t e r =3<br />

245 m a x =6<br />

246 [ p w m ]= U P<br />

247 w e i g h t =0.5<br />

248 [ A ] 18 0 45 27 45 54 54 54 18 9 45 9 2 9 18 45 54 45 9 2 0 9<br />

249 [ T ] 45 11 0 0 18 0 9 9 36 45 18 54 45 45 27 9 9 18 54 54 63 17<br />

250 [ G ] 0 9 18 36 0 0 0 0 9 9 0 0 0 9 9 0 0 0 0 7 0 0<br />

251 [ C ] 0 43 0 0 0 9 0 0 0 0 0 0 16 0 9 9 0 0 0 0 0 37<br />

252 [ s p a c e r ]<br />

253 m i n=−4<br />

254 c e n t e r =2<br />

255 m a x =4<br />

256 [ p w m ]= F I S<br />

257 w e i g h t =0.5<br />

258 t h r e s h o l d =0<br />

259 [ A ] 26 27 16 0 18 9 0 29 54 54 54 45 42 3 2 36 7 2 18 22 16<br />

260 [ T ] 36 36 45 0 0 38 43 0 0 0 9 0 18 45 0 0 0 0 1 0 45<br />

261 [ G ] 1 0 2 63 18 7 20 34 9 9 0 18 3 13 45 0 54 0 44 41 0<br />

262 [ C ] 0 0 0 0 27 9 0 0 0 0 0 0 0 2 16 27 2 61 0 0 2<br />

D.6 quasi mktemp manual<br />

1 N A M E<br />

2 q u a s i _ m k t e m p − c r e a t e a t e m p l a t e C B S W e b S e r v i c e i m p l e m e n t a t i o n<br />

3<br />

4 S Y N O P S I S<br />

5 p e r l q u a s i _ m k t e m p l [− n S E R V I C E N A M E ] [− v V E R S I O N ] [− w W S N U M B E R ] (−f ) (− r e m o v e ) (−t<br />

T E M P L A T E N A M E )<br />

6<br />

7 D E S C R I P T I O N<br />

8 T h i s s c r i p t c r e a t e s a f u n c t i o n a l t e m p l a t e S O A P W e b S e r v i c e i m p l e m e n t a t i o n u n d e r Q u a s i<br />

i n c l u d i n g<br />

9 a w o r k i n g e x a m p l e . T h e o b j e c t t y p e s t h i s s e r v i c e r e c i e v e s / g e n e r a t e s a r e t h e C B S s t a n d a r d<br />

s e q u e n c e<br />

10 d a t a o b j e c t / a n n o t a t i o n d a t a o b j e c t .<br />

11<br />

12 T h e f o l l o w i n g e l e m e n t s a r e c r e a t e d b y t h e p r o g r a m :<br />

13<br />

14 ∗ W S D L f i l e , w i t h p r o p e r n a m e s p a c e s a n d o p e r a t i o n ( s )<br />

15 ∗ A n X S D i n c l u d e d b y t h e W S D L<br />

16 ∗ A d i r e c t o r y i n / u s r / o p t / w w w / cgi−b i n / C B S / s o a p / w s / q u a s i / c o n t a i n i n g t h e P e r l m o d u l e (<br />

m o d u l e . p m )<br />

17 ∗ A d i r e c t o r y i n / u s r / o p t / w w w / p u b / C B S / w s / c o n t a i n i n g t h e XSD , W S D L a n d e x a m p l e f i l e s .<br />

18 ∗ A n e n t r y i n m y s q l . W e b S e r v i c e s . s e r v i c e s<br />

19 ∗ A n i n d e x . p h p a n d i n c l u d e . h t m l l o c a t e d i n / u s r / o p t / w w w / p u b / C B S / w s / [ S E R V I C E N A M E ]<br />

20<br />

21 To−d o l i s t , o n c e y o u h a v e c r e a t e d t h e t e m p l a t e :<br />

22<br />

23 [ ] A l t e r t h e W S D L s o i t c o n t a i n s t h e o p e r a t i o n s y o u n e e d<br />

24 [ ] A l t e r t h e X S D s o a l l o p e r a t i o n d a t a t y p e s a r e d e f i n e d<br />

172


Appendix: Software<br />

25 [ ] A l t e r t h e f i l e m o d u l e . p m a n d p o s s i b l y w r a p p e r . pl , l o c a t e d i n / u s r / o p t / w w w / cgi−b i n / s o a p<br />

/ w s / q u a s i / [ S E R V I C E ] / [ W S ] /<br />

26 [ ] A l t e r t h e e x a m p l e s o t h a t i t c o n t a i n s a r e l e v a n t e x a m p l e f o r y o u r s e r v i c e .<br />

27 [ ] A l t e r t h e i n c l u d e . h t m l s o t h a t i t d e s c r i b e s t h e u s a g e o f t h e e x a m p l e s c r i p t<br />

28 [ ] O n c e y o u a r e h a p p y w i t h t h e i m p l e m e n t a t i o n , r e m o v e t h e f l a g ” i n t e r n a l _ o n l y ” f r o m m y s q l<br />

. W e b S e r v i c e s . s e r v i c e s<br />

29 a n d c h a n g e t h e d e s i r e d d e s c r i p t i o n f o r y o u r s e r v i c e ( i n f i e l d ’ d e s c r i p t i o n ’ )<br />

30<br />

31 O P T I O N S<br />

32 −n S E R V I C E N A M E<br />

33 C a s e −s e n s i t i v e s e r v i c e n a m e , e . g . S i g n a l P<br />

34<br />

35 −v V E R S I O N<br />

36 T h e v e r s i o n o f t h e s e r v i c e i n t h e f o r m X . Y , e . g . 1 . 2<br />

37<br />

38 −w W S N U M B E R<br />

39 T h i s i s t h e i m p l e m e n t a t i o n n u m b e r f o r t h i s s e r v i c e a n d v e r s i o n . T h e n u m b e r<br />

40 s t a r t s a t z e r o .<br />

41<br />

42 −f<br />

43 F o r c e s o v e r w r i t i n g e x i s t i n g f i l e s<br />

44<br />

45 −r e m o v e<br />

46 R e m o v e s a l l f i l e s p e r t a i n i n g t o t h i s s e r v i c e / v e r s i o n / i m p l e m e n t a i o n − b e c a r e f u l l !<br />

47<br />

48 −t T E M P L A T E<br />

49 N e w t e m p l a t e s c a n b e i n s t a l l e d . U s e o p t i o n −t l i s t t o l i s t a l l t e m p l a t e s<br />

50<br />

51 A U T H O R<br />

52 P e t e r F i s c h e r H a l l i n , p f h @ c b s . d t u . dk , S e p t e m b e r 2008<br />

53<br />

54 S E E A L S O<br />

55 / u s r / o p t / q u a q /<br />

56 / u s r / o p t / w w w / cgi−b i n / C B S / s o a p / w s / q u a s i . c g i<br />

57<br />

58 A U T H O R<br />

59 P e t e r H a l l i n 2008−09−15, p f h @ c b s . d t u . d k<br />

173


BIBLIOGRAPHY<br />

Bibliography<br />

S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, & D. J.<br />

Lipman (1997). ‘Gapped blast <strong>and</strong> psi–blast: a new generation of prote<strong>in</strong> database<br />

searchprograms.’ Nucleic Acids Res 25:3389–402.<br />

B. F. Bauer, E. G. Kar, R. M. Elford, & W. M. Holmes (1988). ‘Sequence determ<strong>in</strong>ants<br />

for promoter strength <strong>in</strong> the leuv operon of Escherichia coli.’ Gene 63:123–34.<br />

J. Besemer, A. Lomsadze, & M. Borodovsky (2001). ‘GeneMarks: a self–tra<strong>in</strong><strong>in</strong>g method<br />

for prediction of gene starts <strong>in</strong> microbial genomes. Implications for f<strong>in</strong>d<strong>in</strong>g sequence<br />

motifs <strong>in</strong> regulatory regions.’ Nucleic Acids Res 29:2607–18.<br />

T. T. B<strong>in</strong>newies, P. F. Hall<strong>in</strong>, H.-H. Staerfeldt, & D. W. Ussery (2005). ‘Genome Update:<br />

proteome comparisons.’ Microbiology 151:1–4.<br />

T. T. B<strong>in</strong>newies, Y. Motro, P. F. Hall<strong>in</strong>, O. Lund, D. Dunn, T. La, D. J. Hampson,<br />

M. Bellgard, T. M. Wassenaar, & D. W. Ussery (2006). ‘Ten years of bacterial genome<br />

sequenc<strong>in</strong>g: comparative–genomics–baseddiscoveries.’ Funct Integr Genomics 6:165–85.<br />

E. Birney, J. A. Stamatoyannopoulos, A. Dutta, R. Guigo, T. R. G<strong>in</strong>geras, E. H. Margulies,<br />

Z. Weng, M. Snyder, E. T. Dermitzakis, R. E. Thurman, M. S. Kuehn, C. M.<br />

Taylor, S. Neph, C. M. Koch, S. Asthana, A. Malhotra, I. Adzhubei, J. A. Greenbaum,<br />

R. M. Andrews, P. Flicek, P. J. Boyle, H. Cao, N. P. Carter, G. K. Clell<strong>and</strong>, S. Davis,<br />

N. Day, P. Dhami, S. C. Dillon, M. O. Dorschner, H. Fiegler, P. G. Giresi, J. Goldy,<br />

M. Hawrylycz, A. Haydock, R. Humbert, K. D. James, B. E. Johnson, E. M. Johnson,<br />

T. T. Frum, E. R. Rosenzweig, N. Karnani, K. Lee, G. C. Lefebvre, P. A. Navas, F. Neri,<br />

S. C. J. Parker, P. J. Sabo, R. S<strong>and</strong>strom, A. Shafer, D. Vetrie, M. Weaver, S. Wilcox,<br />

M. Yu, F. S. Coll<strong>in</strong>s, J. Dekker, J. D. Lieb, T. D. Tullius, G. E. Crawford, S. Sunyaev,<br />

W. S. Noble, I. Dunham, F. Denoeud, A. Reymond, P. Kapranov, J. Rozowsky,<br />

D. Zheng, R. Castelo, A. Frankish, J. Harrow, S. Ghosh, A. S<strong>and</strong>el<strong>in</strong>, I. L. Hofacker,<br />

R. Baertsch, D. Keefe, S. Dike, J. Cheng, H. A. Hirsch, E. A. Sek<strong>in</strong>ger, J. Lagarde,<br />

J. F. Abril, A. Shahab, C. Flamm, C. Fried, J. Hackermuller, J. Hertel, M. L<strong>in</strong>demeyer,<br />

K. Missal, A. Tanzer, S. Washietl, J. Korbel, O. Emanuelsson, J. S. Pedersen, N. Holroyd,<br />

R. Taylor, D. Swarbreck, N. Matthews, M. C. Dickson, D. J. Thomas, M. T.<br />

Weirauch, J. Gilbert, J. Drenkow, I. Bell, X. Zhao, K. G. Sr<strong>in</strong>ivasan, W.-K. Sung, H. S.<br />

Ooi, K. P. Chiu, S. Foissac, T. Alioto, M. Brent, L. Pachter, M. L. Tress, A. Valencia,<br />

S. W. Choo, C. Y. Choo, C. Ucla, C. Manzano, C. Wyss, E. Cheung, T. G. Clark,<br />

J. B. Brown, M. Ganesh, S. Patel, H. Tammana, J. Chrast, C. N. Henrichsen, C. Kai,<br />

J. Kawai, U. Nagalakshmi, J. Wu, Z. Lian, J. Lian, P. Newburger, X. Zhang, P. Bickel,<br />

J. S. Mattick, P. Carn<strong>in</strong>ci, Y. Hayashizaki, S. Weissman, T. Hubbard, R. M. Myers,<br />

174


BIBLIOGRAPHY<br />

J. Rogers, P. F. Stadler, T. M. Lowe, C.-L. Wei, Y. Ruan, K. Struhl, M. Gerste<strong>in</strong>, S. E.<br />

Antonarakis, Y. Fu, E. D. Green, U. Karaoz, A. Siepel, J. Taylor, L. A. Liefer, K. A.<br />

Wetterstr<strong>and</strong>, P. J. Good, E. A. Fe<strong>in</strong>gold, M. S. Guyer, G. M. Cooper, G. Asimenos,<br />

C. N. Dewey, M. Hou, S. Nikolaev, J. I. Montoya-Burgos, A. Loytynoja, S. Whelan,<br />

F. Pardi, T. Mass<strong>in</strong>gham, H. Huang, N. R. Zhang, I. Holmes, J. C. Mullik<strong>in</strong>, A. Ureta-<br />

Vidal, B. Paten, M. Ser<strong>in</strong>ghaus, D. Church, K. Rosenbloom, W. J. Kent, E. A. Stone,<br />

S. Batzoglou, N. Goldman, R. C. Hardison, D. Haussler, W. Miller, A. Sidow, N. D.<br />

Tr<strong>in</strong>kle<strong>in</strong>, Z. D. Zhang, L. Barrera, R. Stuart, D. C. K<strong>in</strong>g, A. Ameur, S. Enroth, M. C.<br />

Bieda, J. Kim, A. A. Bh<strong>in</strong>ge, N. Jiang, J. Liu, F. Yao, V. B. Vega, C. W. H. Lee,<br />

P. Ng, A. Shahab, A. Yang, Z. Moqtaderi, Z. Zhu, X. Xu, S. Squazzo, M. J. Oberley,<br />

D. Inman, M. A. S<strong>in</strong>ger, T. A. Richmond, K. J. Munn, A. Rada-Iglesias, O. Wallerman,<br />

J. Komorowski, J. C. Fowler, P. Couttet, A. W. Bruce, O. M. Dovey, P. D. Ellis, C. F.<br />

Langford, D. A. Nix, G. Euskirchen, S. Hartman, A. E. Urban, P. Kraus, S. Van Calcar,<br />

N. He<strong>in</strong>tzman, T. H. Kim, K. Wang, C. Qu, G. Hon, R. Luna, C. K. Glass, M. G. Rosenfeld,<br />

S. F. Aldred, S. J. Cooper, A. Halees, J. M. L<strong>in</strong>, H. P. Shulha, X. Zhang, M. Xu,<br />

J. N. S. Haidar, Y. Yu, Y. Ruan, V. R. Iyer, R. D. Green, C. Wadelius, P. J. Farnham,<br />

B. Ren, R. A. Harte, A. S. H<strong>in</strong>richs, H. Trumbower, H. Clawson, J. Hillman-Jackson,<br />

A. S. Zweig, K. Smith, A. Thakkapallayil, G. Barber, R. M. Kuhn, D. Karolchik, L. Armengol,<br />

C. P. Bird, P. I. W. de Bakker, A. D. Kern, N. Lopez-Bigas, J. D. Mart<strong>in</strong>, B. E.<br />

Stranger, A. Woodroffe, E. Davydov, A. Dimas, E. Eyras, I. B. Hallgrimsdottir, J. Huppert,<br />

M. C. Zody, G. R. Abecasis, X. Estivill, G. G. Bouffard, X. Guan, N. F. Hansen,<br />

J. R. Idol, V. V. B. Maduro, B. Maskeri, J. C. McDowell, M. Park, P. J. Thomas, A. C.<br />

Young, R. W. Blakesley, D. M. Muzny, E. Sodergren, D. A. Wheeler, K. C. Worley,<br />

H. Jiang, G. M. We<strong>in</strong>stock, R. A. Gibbs, T. Graves, R. Fulton, E. R. Mardis, R. K.<br />

Wilson, M. Clamp, J. Cuff, S. Gnerre, D. B. Jaffe, J. L. Chang, K. L<strong>in</strong>dblad-Toh, E. S.<br />

L<strong>and</strong>er, M. Koriab<strong>in</strong>e, M. Nefedov, K. Osoegawa, Y. Yosh<strong>in</strong>aga, B. Zhu, & P. J. de Jong<br />

(2007). ‘Identification <strong>and</strong> analysis of functional elements <strong>in</strong> 1of the human genome by<br />

the encode pilot project.’ Nature 447:799–816.<br />

F. R. Blattner, G. r. Plunkett, C. A. Bloch, N. T. Perna, V. Burl<strong>and</strong>, M. Riley, J. Collado-<br />

Vides, J. D. Glasner, C. K. Rode, G. F. Mayhew, J. Gregor, N. W. Davis, H. A.<br />

Kirkpatrick, M. A. Goeden, D. J. Rose, B. Mau, & Y. Shao (1997). ‘The complete<br />

genome sequence of Escherichia coli k–12.’ Science 277:1453–62.<br />

A. J. t. Bokal, W. Ross, & R. L. Gourse (1995). ‘The transcriptional activator prote<strong>in</strong> fis:<br />

Dna <strong>in</strong>teractions <strong>and</strong>cooperative <strong>in</strong>teractions with rna polymerase at the Escherichia<br />

coli rrnbp1 promoter.’ J Mol Biol 245:197–207.<br />

A. Bolshoy, P. McNamara, R. E. Harr<strong>in</strong>gton, & E. N. Trifonov (1991). ‘Curved dna<br />

without a–a: experimental estimation of all 16 dna wedgeangles.’ Proc Natl Acad Sci U<br />

S A 88:2312–6.<br />

P. J. Brett, D. DeShazer, & D. E. Woods (1998). ‘Burkholderia thail<strong>and</strong>ensis sp. nov., a<br />

Burkholderia pseudomallei–likespecies.’ Int J Syst Bacteriol 48:317–20.<br />

E. Brzuszkiewicz, H. Bruggemann, H. Liesegang, M. Emmerth, T. Olschlager, G. Nagy,<br />

K. Albermann, C. Wagner, C. Buchrieser, L. Emody, G. Gottschalk, J. Hacker, & U. Dobr<strong>in</strong>dt<br />

(2006). ‘How to become a uropathogen: comparative genomic analysis ofextra<strong>in</strong>test<strong>in</strong>al<br />

pathogenic Escherichia coli stra<strong>in</strong>s.’ Proc Natl Acad Sci U S A 103:12879–84.<br />

S. L. Chen, C.-S. Hung, J. Xu, C. S. Reigstad, V. Magr<strong>in</strong>i, A. Sabo, D. Blasiar, T. Bieri,<br />

R. R. Meyer, P. Ozersky, J. R. Armstrong, R. S. Fulton, J. P. Latreille, J. Spieth, T. M.<br />

175


BIBLIOGRAPHY<br />

Hooton, E. R. Mardis, S. J. Hultgren, & J. I. Gordon (2006). ‘Identification of genes<br />

subject to positive selection <strong>in</strong> uropathogenicstra<strong>in</strong>s of Escherichia coli: a comparative<br />

genomics approach.’ Proc Natl Acad Sci U S A 103:5977–82.<br />

A. L. Delcher, D. Harmon, S. Kasif, O. White, & S. L. Salzberg (1999). ‘Improved microbial<br />

gene identification with glimmer.’ Nucleic Acids Res 27:4636–41.<br />

J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan,<br />

B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark,<br />

R. Dalal, A. Dew<strong>in</strong>ter, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. He<strong>in</strong>er,<br />

K. Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. L<strong>in</strong>, P. Lundquist,<br />

C. Ma, P. Marks, M. Maxham, D. Murphy, I. Park, T. Pham, M. Phillips, J. Roy,<br />

R. Sebra, G. Shen, J. Sorenson, A. Tomaney, K. Travers, M. Trulson, J. Vieceli, J. Wegener,<br />

D. Wu, A. Yang, D. Zaccar<strong>in</strong>, P. Zhao, F. Zhong, J. Korlach, & S. Turner (2009).<br />

‘Real–time dna sequenc<strong>in</strong>g from s<strong>in</strong>gle polymerase molecules.’ Science 323:133–8.<br />

M. Ender, B. Berger-Bachi, & N. McCallum (2009). ‘A novel dna–b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> modulat<strong>in</strong>g<br />

methicill<strong>in</strong> resistance <strong>in</strong> Staphylococcus aureus.’ BMC Microbiol 9:15.<br />

S. T. Estrem, T. Gaal, W. Ross, & R. L. Gourse (1998). ‘Identification of an up element<br />

consensus sequence for bacterialpromoters.’ Proc Natl Acad Sci U S A 95:9761–6.<br />

P. F. Hall<strong>in</strong> & D. W. Ussery (2004). ‘Cbs Genome Atlas Database: a dynamic storage for<br />

bio<strong>in</strong>formatic results <strong>and</strong> sequence data.’ Bio<strong>in</strong>formatics 20:3682–6.<br />

K. Hayashi, N. Morooka, Y. Yamamoto, K. Fujita, K. Isono, S. Choi, E. Ohtsubo, T. Baba,<br />

B. L. Wanner, H. Mori, & T. Horiuchi (2006). ‘Highly accurate genome sequences of<br />

Escherichia coli k–12 stra<strong>in</strong>s mg1655<strong>and</strong> w3110.’ Mol Syst Biol 2:2006.0007.<br />

T. Hayashi, K. Mak<strong>in</strong>o, M. Ohnishi, K. Kurokawa, K. Ishii, K. Yokoyama, C. G. Han,<br />

E. Ohtsubo, K. Nakayama, T. Murata, M. Tanaka, T. Tobe, T. Iida, H. Takami,<br />

T. Honda, C. Sasakawa, N. Ogasawara, T. Yasunaga, S. Kuhara, T. Shiba, M. Hattori,<br />

& H. Sh<strong>in</strong>agawa (2001). ‘Complete genome sequence of enterohemorrhagic Escherichia<br />

coli o157:h7 <strong>and</strong>genomic comparison with a laboratory stra<strong>in</strong> k–12.’ DNA Res 8:11–22.<br />

P. N. Hengen, S. L. Bartram, L. E. Stewart, & T. D. Schneider (1997). ‘Information<br />

analysis of Fis b<strong>in</strong>d<strong>in</strong>g sites.’ Nucleic Acids Res 25:4994–5002.<br />

C. A. Hirvonen, W. Ross, C. E. Wozniak, E. Marasco, J. R. Anthony, S. E. Aiyar, V. H.<br />

Newburn, & R. L. Gourse (2001). ‘Contributions of up elements <strong>and</strong> the transcription<br />

factor fis toexpression from the seven rrn p1 promoters <strong>in</strong> Escherichia coli.’ J Bacteriol<br />

183:6305–14.<br />

A. M. Huerta & J. Collado-Vides (2003). ‘Sigma70 promoters <strong>in</strong> Escherichia coli: specific<br />

transcription <strong>in</strong> denseregions of overlapp<strong>in</strong>g promoter–like signals.’ J Mol Biol 333:261–<br />

78.<br />

L. J. Jensen, C. Friis, & D. W. Ussery (1999). ‘Three views of microbial genomes.’ Res<br />

Microbiol 150:773–7.<br />

L. J. Jensen, M. Skovgaard, T. Sicheritz-Ponten, N. T. Hansen, H. Johansson, M. K.<br />

Joergensen, K. Kiil, P. F. Hall<strong>in</strong>, & D. Ussery (2005). THE PSEUDOMONADS VOL<br />

I. GENOMICS, LIFE STYLE AND MOLECULAR ARCHITECTURE, vol. 1, chap.<br />

Chapter 5: <strong>Comparative</strong> genomics of four Pseudomonas species, pp. 139–164. Kluwer<br />

Academic / Plenum Publishers, New York.<br />

176


BIBLIOGRAPHY<br />

Q. J<strong>in</strong>, Z. Yuan, J. Xu, Y. Wang, Y. Shen, W. Lu, J. Wang, H. Liu, J. Yang, F. Yang,<br />

X. Zhang, J. Zhang, G. Yang, H. Wu, D. Qu, J. Dong, L. Sun, Y. Xue, A. Zhao, Y. Gao,<br />

J. Zhu, B. Kan, K. D<strong>in</strong>g, S. Chen, H. Cheng, Z. Yao, B. He, R. Chen, D. Ma, B. Qiang,<br />

Y. Wen, Y. Hou, & J. Yu (2002). ‘Genome sequence of Shigella flexneri 2a: <strong>in</strong>sights<br />

<strong>in</strong>to pathogenicitythrough comparison with genomes of Escherichia coli k12 <strong>and</strong> o157.’<br />

Nucleic Acids Res 30:4432–41.<br />

T. J. Johnson, S. Kariyawasam, Y. Wannemuehler, P. Mangiamele, S. J. Johnson,<br />

C. Doetkott, J. A. Skyberg, A. M. Lynne, J. R. Johnson, & L. K. Nolan (2007). ‘The<br />

genome sequence of avian pathogenic Escherichia coli stra<strong>in</strong> o1:k1:h7shares strong similarities<br />

with human extra<strong>in</strong>test<strong>in</strong>al pathogenic e. coligenomes.’ J Bacteriol 189:3228–36.<br />

J. Kyte & R. F. Doolittle (1982). ‘A simple method for display<strong>in</strong>g the hydropathic character<br />

of a prote<strong>in</strong>.’ J Mol Biol 157:105–32.<br />

K. Lagesen, P. Hall<strong>in</strong>, E. A. Rodl<strong>and</strong>, H.-H. Staerfeldt, T. Rognes, & D. W. Ussery (2007).<br />

‘RNAmmer: consistent <strong>and</strong> rapid annotation of ribosomal rna genes.’ Nucleic Acids Res<br />

35:3100–8.<br />

T. S. Larsen & A. Krogh (2003). ‘EasyGene–a prokaryotic gene f<strong>in</strong>der that ranks ORFs<br />

by statistical significance.’ BMC Bio<strong>in</strong>formatics 4:21.<br />

T. Lefebure & M. J. Stanhope (2007). ‘Evolution of the core <strong>and</strong> pan–genome of Streptococcus:<br />

positive selection, recomb<strong>in</strong>ation, <strong>and</strong> genome composition.’ Genome Biol<br />

8:R71.<br />

X. Liao, T. Y<strong>in</strong>g, H. Wang, J. Wang, Z. Shi, E. Feng, K. Wei, Y. Wang, X. Zhang,<br />

L. Huang, G. Su, & P. Huang (2003). ‘A two–dimensional proteome map of Shigella<br />

flexneri.’ Electrophoresis 24:2864–82.<br />

B. Liebig & R. Wagner (1995). ‘Effects of different growth conditions on the <strong>in</strong> vivo<br />

activity of thet<strong>and</strong>em Escherichia coli ribosomal rna promoters p1 <strong>and</strong> p2.’ Mol Gen<br />

Genet 249:328–35.<br />

D. Lim & N. C. J. Strynadka (2002). ‘Structural basis for the beta lactam resistance of<br />

pbp2a from methicill<strong>in</strong>–resistant Staphylococcus aureus.’ Nat Struct Biol 9:870–6.<br />

T. M. Lowe & S. R. Eddy (1997). ‘tRNAscan–se: a program for improved detection of<br />

transfer rna genes <strong>in</strong>genomic sequence.’ Nucleic Acids Res 25:955–64.<br />

J. P. McCutcheon, B. R. McDonald, & N. A. Moran (2009). ‘Orig<strong>in</strong> of an alternative<br />

genetic code <strong>in</strong> the extremely small <strong>and</strong> gc–rich genome of a bacterial symbiont.’ PLoS<br />

Genet 5:e1000565.<br />

C. E. McEwan, D. Gatherer, & N. R. McEwan (1998). ‘Nitrogen–fix<strong>in</strong>g aerobic bacteria<br />

have higher genomic gc content than non–fix<strong>in</strong>g species with<strong>in</strong> the same genus.’<br />

Hereditas 128:173–8.<br />

W. G. Miller, C. T. Parker, M. Rubenfield, G. L. Mendz, M. M. S. M. Wosten, D. W.<br />

Ussery, J. F. Stolz, T. T. B<strong>in</strong>newies, P. F. Hall<strong>in</strong>, G. Wang, J. A. Malek, A. Rogos<strong>in</strong>,<br />

L. H. Stanker, & R. E. M<strong>and</strong>rell (2007). ‘The complete genome sequence <strong>and</strong> analysis<br />

of the epsilonproteobacteriumArcobacter butzleri.’ PLoS One 2:e1358.<br />

H. D. Murray & R. L. Gourse (2004). ‘Unique roles of the rrn p2 rrna promoters <strong>in</strong><br />

Escherichia coli.’ Mol Microbiol 52:1375–87.<br />

177


BIBLIOGRAPHY<br />

A. Nakabachi, A. Yamashita, H. Toh, H. Ishikawa, H. E. Dunbar, N. A. Moran, & M. Hattori<br />

(2006). ‘The 160–kilobase genome of the bacterial endosymbiont Carsonella.’ Science<br />

314:267.<br />

C. Ong, C. H. Ooi, D. Wang, H. Chong, K. C. Ng, F. Rodrigues, M. A. Lee, & P. Tan<br />

(2004). ‘Patterns of large–scale genomic variation <strong>in</strong> virulent <strong>and</strong> avirulentBurkholderia<br />

species.’ Genome Res 14:2295–307.<br />

J. Parkhill, B. W. Wren, K. Mungall, J. M. Ketley, C. Churcher, D. Basham, T. Chill<strong>in</strong>gworth,<br />

R. M. Davies, T. Feltwell, S. Holroyd, K. Jagels, A. V. Karlyshev, S. Moule,<br />

M. J. Pallen, C. W. Penn, M. A. Quail, M. A. Raj<strong>and</strong>ream, K. M. Rutherford, A. H. van<br />

Vliet, S. Whitehead, & B. G. Barrell (2000). ‘The genome sequence of the food–borne<br />

pathogen Campylobacter jejunireveals hypervariable sequences.’ Nature 403:665–8.<br />

A. G. Pedersen, L. J. Jensen, S. Brunak, H. H. Staerfeldt, & D. W. Ussery (2000). ‘A dna<br />

structural atlas for Escherichia coli.’ J Mol Biol 299:907–30.<br />

V. Perez-Brocal, R. Gil, S. Ramos, A. Lamelas, M. Postigo, J. M. Michelena, F. J. Silva,<br />

A. Moya, & A. Latorre (2006). ‘A small microbial genome: the end of a long symbiotic<br />

relationship?’ Science 314:312–3.<br />

N. T. Perna, G. r. Plunkett, V. Burl<strong>and</strong>, B. Mau, J. D. Glasner, D. J. Rose, G. F. Mayhew,<br />

P. S. Evans, J. Gregor, H. A. Kirkpatrick, G. Posfai, J. Hackett, S. Kl<strong>in</strong>k, A. Bout<strong>in</strong>,<br />

Y. Shao, L. Miller, E. J. Grotbeck, N. W. Davis, A. Lim, E. T. Dimalanta, K. D.<br />

Potamousis, J. Apodaca, T. S. Anantharaman, J. L<strong>in</strong>, G. Yen, D. C. Schwartz, R. A.<br />

Welch, & F. R. Blattner (2001). ‘Genome sequence of enterohaemorrhagic Escherichia<br />

coli o157:h7.’ Nature 409:529–33.<br />

O. N. Reva, P. F. Hall<strong>in</strong>, H. Willenbrock, T. Sicheritz-Ponten, B. Tummler, & D. W.<br />

Ussery (2008). ‘Global features of the Alcanivorax borkumensis sk2 genome.’ Environ<br />

Microbiol 10:614–25.<br />

E. P. C. Rocha (2004). ‘Codon usage bias from trna‘s po<strong>in</strong>t of view: redundancy, specialization,<br />

<strong>and</strong> efficient decod<strong>in</strong>g for translation optimization.’ Genome Res 14:2279–86.<br />

W. Ross, J. Salomon, W. M. Holmes, & R. L. Gourse (1999). ‘Activation of Escherichia<br />

coli leuv transcription by fis.’ J Bacteriol 181:3864–8.<br />

K. Rutherford, J. Parkhill, J. Crook, T. Horsnell, P. Rice, M. A. Raj<strong>and</strong>ream, & B. Barrell<br />

(2000). ‘Artemis: sequence visualization <strong>and</strong> annotation.’ Bio<strong>in</strong>formatics 16:944–5.<br />

R. A. Sanford, J. R. Cole, & J. M. Tiedje (2002). ‘Characterization <strong>and</strong> description of<br />

Anaeromyxobacter dehalogenans gen. nov., sp. nov., an aryl–halorespir<strong>in</strong>g facultative<br />

anaerobic myxobacterium.’ Appl Environ Microbiol 68:893–900.<br />

S. C. Satchwell, H. R. Drew, & A. A. Travers (1986). ‘Sequence periodicities <strong>in</strong> chicken<br />

nucleosome core dna.’ J Mol Biol 191:659–75.<br />

S. Schneiker, O. Perlova, O. Kaiser, K. Gerth, A. Alici, M. O. Altmeyer, D. Bartels,<br />

T. Bekel, S. Beyer, E. Bode, H. B. Bode, C. J. Bolten, J. V. Choudhuri, S. Doss,<br />

Y. A. Elnakady, B. Frank, L. Gaigalat, A. Goesmann, C. Groeger, F. Gross, L. Jelsbak,<br />

L. Jelsbak, J. Kal<strong>in</strong>owski, C. Kegler, T. Knauber, S. Konietzny, M. Kopp, L. Krause,<br />

D. Krug, B. L<strong>in</strong>ke, T. Mahmud, R. Mart<strong>in</strong>ez-Arias, A. C. McHardy, M. Merai, F. Meyer,<br />

S. Mormann, J. Munoz-Dorado, J. Perez, S. Pradella, S. Rachid, G. Raddatz, F. Rosenau,<br />

C. Ruckert, F. Sasse, M. Scharfe, S. C. Schuster, G. Suen, A. Treuner-Lange, G. J.<br />

178


BIBLIOGRAPHY<br />

Velicer, F.-J. Vorholter, K. J. Weissman, R. D. Welch, S. C. Wenzel, D. E. Whitworth,<br />

S. Wilhelm, C. Wittmann, H. Blocker, A. Puhler, & R. Muller (2007). ‘Complete genome<br />

sequence of the myxobacterium Sorangium cellulosum.’ Nat Biotechnol 25:1281–9.<br />

R. K. Shultzaberger, Z. Chen, K. A. Lewis, & T. D. Schneider (2007). ‘Anatomy of<br />

Escherichia coli sigma70 promoters.’ Nucleic Acids Res 35:771–88.<br />

M. D. Smith, B. J. Angus, V. Wuthiekanun, & N. J. White (1997). ‘Arab<strong>in</strong>ose assimilation<br />

def<strong>in</strong>es a nonvirulent biotype of Burkholderiapseudomallei.’ Infect Immun 65:4319–21.<br />

H. Tettel<strong>in</strong>, V. Masignani, M. J. Cieslewicz, C. Donati, D. Med<strong>in</strong>i, N. L. Ward, S. V.<br />

Angiuoli, J. Crabtree, A. L. Jones, A. S. Durk<strong>in</strong>, R. T. Deboy, T. M. Davidsen, M. Mora,<br />

M. Scarselli, I. Margarit y Ros, J. D. Peterson, C. R. Hauser, J. P. Sundaram, W. C.<br />

Nelson, R. Madupu, L. M. Br<strong>in</strong>kac, R. J. Dodson, M. J. Rosovitz, S. A. Sullivan,<br />

S. C. Daugherty, D. H. Haft, J. Selengut, M. L. Gw<strong>in</strong>n, L. Zhou, N. Zafar, H. Khouri,<br />

D. Radune, G. Dimitrov, K. Watk<strong>in</strong>s, K. J. B. O’Connor, S. Smith, T. R. Utterback,<br />

O. White, C. E. Rubens, G. Gr<strong>and</strong>i, L. C. Madoff, D. L. Kasper, J. L. Telford, M. R.<br />

Wessels, R. Rappuoli, & C. M. Fraser (2005). ‘Genome analysis of multiple pathogenic<br />

isolates of Streptococcus agalactiae: implications for the microbial “pan–genome“.’ Proc<br />

Natl Acad Sci U S A 102:13950–5.<br />

J. D. Thompson, D. G. Higg<strong>in</strong>s, & T. J. Gibson (1994). ‘Clustal w: improv<strong>in</strong>g the sensitivity<br />

of progressive multiple sequencealignment through sequence weight<strong>in</strong>g, position–<br />

specific gap penalties <strong>and</strong>weight matrix choice.’ Nucleic Acids Res 22:4673–80.<br />

H. Toh, B. L. Weiss, S. A. H. Perk<strong>in</strong>, A. Yamashita, K. Oshima, M. Hattori, & S. Aksoy<br />

(2006). ‘Massive genome erosion <strong>and</strong> functional adaptations provide <strong>in</strong>sights <strong>in</strong>to the<br />

symbiotic lifestyle of Sodalis gloss<strong>in</strong>idius <strong>in</strong> the tsetse host.’ Genome Res 16:149–56.<br />

M. L. Tress, P. L. Martelli, A. Frankish, G. A. Reeves, J. J. Wessel<strong>in</strong>k, C. Yeats, P. I. Olason,<br />

M. Albrecht, H. Hegyi, A. Giorgetti, D. Raimondo, J. Lagarde, R. A. Laskowski,<br />

G. Lopez, M. I. Sadowski, J. D. Watson, P. Fariselli, I. Rossi, A. Nagy, W. Kai, Z. Storl<strong>in</strong>g,<br />

M. Ors<strong>in</strong>i, Y. Assenov, H. Blankenburg, C. Huthmacher, F. Ramirez, A. Schlicker,<br />

F. Denoeud, P. Jones, S. Kerrien, S. Orchard, S. E. Antonarakis, A. Reymond, E. Birney,<br />

S. Brunak, R. Casadio, R. Guigo, J. Harrow, H. Hermjakob, D. T. Jones, T. Lengauer,<br />

C. A. Orengo, L. Patthy, J. M. Thornton, A. Tramontano, & A. Valencia (2007). ‘The<br />

implications of alternative splic<strong>in</strong>g <strong>in</strong> the encode prote<strong>in</strong> complement.’ Proc Natl Acad<br />

Sci U S A 104:5495–500.<br />

J. W. Tukey (1977). Exploratory Data Analysis. Addison-Wesley.<br />

D. W. Ussery, P. F. Hall<strong>in</strong>, K. Lagesen, & T. M. Wassenaar (2004). ‘Genome update:<br />

tRNAs <strong>in</strong> sequenced microbial genomes.’ Microbiology 150:1603–6.<br />

T. Visnes, B. Doseth, H. S. Pettersen, L. Hagen, M. M. L. Sousa, M. Akbari, M. Otterlei,<br />

B. Kavli, G. Slupphaug, & H. E. Krokan (2009). ‘Uracil <strong>in</strong> dna <strong>and</strong> its process<strong>in</strong>g by<br />

different dna glycosylases.’ Philos Trans R Soc Lond B Biol Sci 364:563–8.<br />

H. Wang & C. J. Benham (2008). ‘Superhelical destabilization <strong>in</strong> regulatory regions of<br />

stress responsegenes.’ PLoS Comput Biol 4:e17.<br />

H. Wang, M. Noordewier, & C. J. Benham (2004). ‘Stress–<strong>in</strong>duced dna duplex destabilization<br />

(sidd) <strong>in</strong> the e. coli genome:sidd sites are closely associated with promoters.’<br />

Genome Res 14:1575–84.<br />

179


BIBLIOGRAPHY<br />

R. A. Welch, V. Burl<strong>and</strong>, G. r. Plunkett, P. Redford, P. Roesch, D. Rasko, E. L. Buckles,<br />

S.-R. Liou, A. Bout<strong>in</strong>, J. Hackett, D. Stroud, G. F. Mayhew, D. J. Rose, S. Zhou, D. C.<br />

Schwartz, N. T. Perna, H. L. T. Mobley, M. S. Donnenberg, & F. R. Blattner (2002).<br />

‘Extensive mosaic structure revealed by the complete genome sequence ofuropathogenic<br />

Escherichia coli.’ Proc Natl Acad Sci U S A 99:17020–4.<br />

H. Willenbrock, C. Friis, A. S. Juncker, & D. W. Ussery (2006). ‘An environmental<br />

signature for 323 microbial genomes based on codon adaptation <strong>in</strong>dices.’ Genome Biol<br />

7:R114.<br />

K.-M. Wu, L.-H. Li, J.-J. Yan, N. Tsao, T.-L. Liao, H.-C. Tsai, C.-P. Fung, H.-J. Chen,<br />

Y.-M. Liu, J.-T. Wang, C.-T. Fang, S.-C. Chang, H.-Y. Shu, T.-T. Liu, Y.-T. Chen, Y.-<br />

R. Shiau, T.-L. Lauderdale, I.-J. Su, R. Kirby, & S.-F. Tsai (2009). ‘Genome sequenc<strong>in</strong>g<br />

<strong>and</strong> comparative analysis of Klebsiella pneumoniae ntuh–k2044, a stra<strong>in</strong> caus<strong>in</strong>g liver<br />

abscess <strong>and</strong> men<strong>in</strong>gitis.’ J Bacteriol 191:4492–501.<br />

F. Yang, J. Yang, X. Zhang, L. Chen, Y. Jiang, Y. Yan, X. Tang, J. Wang, Z. Xiong,<br />

J. Dong, Y. Xue, Y. Zhu, X. Xu, L. Sun, S. Chen, H. Nie, J. Peng, J. Xu, Y. Wang,<br />

Z. Yuan, Y. Wen, Z. Yao, Y. Shen, B. Qiang, Y. Hou, J. Yu, & Q. J<strong>in</strong> (2005). ‘Genome<br />

dynamics <strong>and</strong> diversity of Shigella species, the etiologic agents ofbacillary dysentery.’<br />

Nucleic Acids Res 33:6445–58.<br />

180

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!