An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>An</strong> <strong>Integrated</strong> <strong>Data</strong> <strong>An</strong>alysis <strong>Suite</strong> <strong>and</strong> <strong>Programming</strong> Framework<br />
for High-Throughput DNA Sequencing<br />
Dissertation<br />
der Mathematisch-Naturwissenschaftlichen Fakultät<br />
der Eberhard Karls Universität Tübingen<br />
zur Erlangung des Grades eines Doktors der Naturwissenschaften<br />
(Dr. rer. nat.)<br />
vorgelegt von<br />
Dipl.-Inform. Felix Ott<br />
aus Basel<br />
Tübingen<br />
2013
Tag der mündlichen Qualikation: 18.12.2013<br />
Dekan:<br />
Prof. Dr. Wolfgang Rosenstiel<br />
1. Berichterstatter: Prof. Dr. Detlef Weigel<br />
2. Berichterstatter: Prof. Dr. Daniel Huson
Erklärung<br />
Hiermit erkläre ich, dass ich diese Arbeit selbstständig und nur mit den angegebenen Hilfsmitteln<br />
angefertigt habe, und dass alle Stellen, die im Wortlaut oder dem Sinn nach <strong>and</strong>eren Werken<br />
entnommen sind, durch <strong>An</strong>gabe der Quellen kenntlich gemacht sind.<br />
Beiträge Dritter, auf die im Text eingegangen wird, sind auf der Seite Contributions<br />
aufgelistet. Des Weiteren wird auf thematische Überschneidungen mit Publikationen, an denen<br />
ich mitgewirkt habe, im Text explizit hingewiesen. Weitere Überschneidungen mit zukünftigen<br />
Publikationen sind möglich.<br />
Tübingen, 19. Dezember 2013<br />
Felix Ott
Contributions<br />
The high-throughput DNA sequencing data analysis pipeline SHORE that this work is built<br />
around is the work of a variety of contributors. The foundations to the software were laid by<br />
Stephan Ossowski <strong>and</strong> Korbinian Schneeberger, <strong>and</strong> are meticulously described in Stephan Ossowski's<br />
Ph. D. thesis Computational Approaches for Next Generation Sequencing <strong>An</strong>alysis <strong>and</strong><br />
MiRNA Target Search [1]. Concerned with a direct continuation of this work are sections 2.1, 2.3<br />
<strong>and</strong> 2.6. The prerequisites are however discussed explicitly, <strong>and</strong> further delineation is provided<br />
in section 1.7. Jörg Hagmann has supported the development with ideas <strong>and</strong> has contributed a<br />
multitude of improvements as well as modules for RNA editing analysis, bisulte sequencing data<br />
analysis <strong>and</strong> multi-reference variation calling, which are however not subject of this work. Jonas<br />
Müller has contributed RAD-Seq motivated extensions to the read demultiplexing code, which<br />
were incorporated into the methods described in section 2.4. Alf Scotl<strong>and</strong> has supported parts<br />
of the coding eort, <strong>and</strong> Sebastian Bender has contributed further extensions to the software<br />
pipeline not discussed in this work.<br />
In accordance with st<strong>and</strong>ard scientic protocol, rst-person plural pronouns will be used to refer to<br />
the reader <strong>and</strong> the writer, or to my scientic collaborators <strong>and</strong> myself.
Acknowledgment<br />
First of all I would like to thank my supervisors Detlef Weigel <strong>and</strong> Daniel Huson for supporting<br />
my thesis <strong>and</strong> the opportunity to work on countless interesting <strong>and</strong> challenging questions at the<br />
Max-Planck-Institut Tübingen. Further I am grateful to Steen Schmidt <strong>and</strong> Karsten Borgwardt,<br />
for helpful <strong>and</strong> encouraging advice given over the years in their role as members of my Thesis<br />
Advisory Committee.<br />
Jörg Hagmann, Stephan Ossowski <strong>and</strong> Korbinian Schneeberger must be singled out, but I<br />
am grateful for the excellent collaboration with all other contributors to the SHORE software.<br />
Norman Warthmann, Jun Cao, Sang-Tae Kim, Stefan Henz out of many others have been of<br />
great help by using, testing <strong>and</strong> suggesting improvements to the application in various areas.<br />
Markus Schmid, Levi Yant, David Posé Padilla, François Parcy, Edwige Moyroud, Stephan<br />
Wenkel, Ronny Br<strong>and</strong>t, Richard Immink <strong>and</strong> others deserve my gratitude for excellent ChIP-Seq<br />
collaborations, discussions <strong>and</strong> explanations.<br />
There are plenty of other people I would like to thank for outst<strong>and</strong>ing professional collaborations,<br />
advice or interesting discussions <strong>and</strong> input. Among them are Lisa Smith, Br<strong>and</strong>on Gaut,<br />
Jesse Hollister, Xi Wang, Marco Todesco, Felipe Fenselau de Felippes, Pablo Manavella, Ignacio<br />
Rubio Somoza <strong>and</strong> Christa Lanz. To Rebecca Schwab, Jörg Hagmann <strong>and</strong> Jonas Müller I am<br />
immensely grateful for doing a great job proofreading this thesis.<br />
Finally, I apologize to those I forgot.
Zusammenfassung<br />
Die Dideoxynukleotid-Kettenabbruchmethode wird seit dem Jahr 2005 durch eine<br />
Vielzahl an parallelen DNA-Sequenzierungsmethoden ergänzt. Diese haben sich als eine<br />
Schlüsseltechnologie mit einer groÿen <strong>An</strong>zahl von Einsatzmöglichkeiten in der Genetik<br />
erwiesen, stellen <strong>and</strong>ererseits mit einer Flut an erzeugten Daten auch eine Herausforderung<br />
dar. Eine Vielfalt an Algorithmen, die sich mit unterschiedlichen Aspekten der<br />
Sequenzierdaten-Verarbeitung und <strong>An</strong>alyse befassen, wurde bereits erarbeitet und implementiert.<br />
Zur routinemäÿigen <strong>An</strong>alyse dieser Daten benötigt es jedoch zudem eine optimal<br />
aufein<strong>and</strong>er abgestimmte Sammlung von <strong>An</strong>alysewerkzeugen, die alle notwendigen Schritte<br />
von der initialen Rohdatenverarbeitung bis zum Erlangen der primären <strong>An</strong>alyseergebnisse<br />
abdeckt.<br />
Mit der Software SHORE wird eine modulare, vielfältig nutzbare Datenanalyse-Lösung<br />
für die parallele DNA-Sequenzierung bereitgestellt. In der vorliegenden Arbeit werden<br />
die SHORE zugrunde liegenden Konzepte erweitert und verallgemeinert, um den Entwicklungen<br />
der neuen Sequenziermethoden Rechnung zu tragen. Diese Entwicklungen<br />
schlieÿen unter <strong>and</strong>erem einen deutlich erhöhten Datenertrag ein, sowie die Verbreitung<br />
von Protokollen, die das Multiplexen von Proben mittels spezieller Kennzeichnungssequenzen<br />
ermöglichen. Um eine breitgefächerte Unterstützung grundlegender Funktionen<br />
wie Datenkompression und indexierter Suchalgorithmen zu ermöglichen, wurde eine<br />
allgemein nutzbare C++ -Programmierumgebung <strong>lib</strong>shore entwickelt, die als gemeinsame<br />
Basis aller Datenverarbeitungsmodule in SHORE fungiert. Ein weiteres Ziel war hierbei,<br />
Programmierschnittstellen bereitzustellen, die modulares Design sowie Parallelisierung von<br />
Sequenzierdaten-Verarbeitungsalgorithmen unterstützen.<br />
Ein weiterer Schwerpunkt dieser Arbeit war die Entwicklung von <strong>An</strong>alysealgorithmen<br />
für mittels der Chromatin-Immunopräzipitation gewonnener Daten. Eine Schlüsselrolle in<br />
der Regulierung der DNA-Transkription wird von einer Klasse von DNA-bindenden Proteinen,<br />
den Transkriptionsfaktoren, eingenommen. ChIP-Seq-Studien erlauben die Darstellung<br />
von Transkriptionsfaktor-Bindestellen für das gesamte Genom, und besitzen daher<br />
groÿes Potential, zum Verständnis der Funktionsweise der komplexen regulatorischen Netzwerke<br />
beizutragen. In dieser Arbeit wird ein <strong>An</strong>alysemodul SHORE peak vorgestellt, welches<br />
zur Verarbeitung von Transkriptionsfaktor-Immunopräzipitationsdaten dient. Das Computerprogramm<br />
vereint statistische Detektion angereicherter Sequenzabschnitte mit empirischen<br />
Regeln zur Auslterung von Artefakten, um so die zuverlässige Erkennung der entscheidenden<br />
Bereiche des Genoms zu gewährleisten. Da Wiederholungsexperimente, die durch<br />
fallende Sequenzierungskosten verstärkt ermöglicht werden, ein wichtiges Mittel zur Identizierung<br />
der biologisch relevanten Sequenzbereiche darstellen, wurde zudem besonderer<br />
Wert auf die simultane Auswertung dieser Datensätze gelegt.<br />
Die entwickelten Algorithmen wurden zudem auf weitere <strong>An</strong>alysemodule übertragen,<br />
die mit dem Ziel der möglichst exiblen Kongurierbarkeit entwickelt wurden. In Kombination<br />
mit der bereits enthaltenen Funktionalität zur Detektion genomischer Varianten<br />
sollte die Software SHORE von Nutzen für einen groÿen Teil des <strong>An</strong>wendungsbereiches der<br />
neuen DNA-Sequenzierungstechnologien sein. Mit der allgemein nutzbaren Funktionalität<br />
der <strong>lib</strong>shore-Programmierumgebung sollte zudem die einfache Erweiterbarkeit im Hinblick<br />
auf zukünftige <strong>An</strong>sätze zur Datenanalyse gewährleistet sein.
Abstract<br />
The various parallel DNA sequencing methods that have been introduced since 2005 to<br />
complement the established dideoxynucleotide chain termination method have proved a key<br />
technology with a wide range of applications in genetic research, but also challenge with<br />
a constant ood of data. A multitude of algorithms have been proposed <strong>and</strong> implemented<br />
addressing dierent aspects of sequencing data processing <strong>and</strong> data analysis for a variety<br />
of high-throughput sequencing applications. Routine data analysis however requires a consistently<br />
assembled tool chain to ensure smooth end-to-end processing starting from raw<br />
sequencing read data up to primary analysis results.<br />
The software SHORE aims to be a modular, general-purpose data analysis suite for highthroughput<br />
DNA sequencing. With this work, we extend <strong>and</strong> generalize the concepts of the<br />
SHORE analysis pipeline to accommodate increased throughput of sequencing devices <strong>and</strong><br />
development of novel sequencing protocols such as bar-coded sample multiplexing. To provide<br />
universal support of basic features such as data set compression or indexing <strong>and</strong> query<br />
mechanisms, we develop a generic C++ programming framework <strong>lib</strong>shore to form the foundation<br />
of all of SHORE's modules. Furthermore, we aim to provide a generic application programming<br />
interface facilitating modular design as well as parallelization of high-throughput<br />
sequencing data processing <strong>and</strong> analysis algorithms.<br />
A further focus of this work was on the development of data analysis algorithms for chromatin<br />
immunoprecipitation <strong>and</strong> other sequence enrichment <strong>and</strong> expression proling studies.<br />
Through genome-wide binding-site proling for transcription factors, DNA-binding proteins<br />
occupying a key role in transcription regulation, ChIP-Seq studies have the potential to<br />
greatly improve the ability to underst<strong>and</strong> the functioning of complex regulatory networks.<br />
We present an analysis module SHORE peak oriented towards processing of transcription<br />
factor immunoprecipitation data. Our program combines statistical enrichment detection<br />
with empirical artifact removal rules to ensure robust identication of the most relevant<br />
sites. As replicate experiments are an important factor for the identication of biologically<br />
relevant sites <strong>and</strong> with dropping sequencing cost are rendered more feasible, emphasis was<br />
put on the simultaneous processing of such data sets.<br />
By transferring enrichment detection <strong>and</strong> expression analysis algorithms to further auxiliary<br />
modules emphasizing exibility of conguration, as well as maintaining previously<br />
available variation analysis functionality, SHORE should represent a valuable resource for<br />
the majority of high-throughput sequencing applications. Furthermore, in combination with<br />
the generic functionality of the <strong>lib</strong>shore framework, we hope to ensure extensibility to readily<br />
accommodate future analysis strategies.
Contents<br />
1 Introduction 1<br />
1.1 High-Throughput DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />
1.2 Approaches to Parallel DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . 2<br />
1.3 Applications of High-Throughput Sequencing . . . . . . . . . . . . . . . . . . . . 3<br />
1.3.1 Genome <strong>and</strong> Transcriptome Sequencing . . . . . . . . . . . . . . . . . . . 4<br />
1.3.2 ChIP-Seq <strong>and</strong> further Immunoprecipitation Protocols . . . . . . . . . . . 5<br />
1.4 Properties of Sequencing <strong>Data</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
1.5 Sequencing <strong>Data</strong> <strong>An</strong>alysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
1.5.1 Base Calling <strong>and</strong> Read Quality Assessment . . . . . . . . . . . . . . . . . 8<br />
1.5.2 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
1.5.3 Short Read Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
1.5.4 Variation <strong>and</strong> Genotype Calling . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
1.5.5 Quantication by Deep Sequencing . . . . . . . . . . . . . . . . . . . . . . 11<br />
1.5.6 ChIP-Seq <strong>An</strong>alysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />
1.6 Storage <strong>and</strong> Representation of Sequencing <strong>Data</strong> . . . . . . . . . . . . . . . . . . . 13<br />
1.6.1 Storage Formats for Next-Generation Sequencing <strong>Data</strong> . . . . . . . . . . . 13<br />
1.6.2 Accelerated Queries on Sequencing <strong>Data</strong> . . . . . . . . . . . . . . . . . . . 15<br />
1.7 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
2 A High-Throughput DNA Sequencing <strong>Data</strong> <strong>An</strong>alysis <strong>Suite</strong> 19<br />
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.2 Ecient Storage of High-Throughput Sequencing <strong>Data</strong> Using Text-Based File Formats<br />
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
2.2.2 Widely Compatible Indexed Block-Wise Compression . . . . . . . . . . . 23<br />
2.2.3 Ecient Queries on Text Files . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
2.2.4 Improved Compression of Read Mapping <strong>Data</strong> . . . . . . . . . . . . . . . 27<br />
2.2.5 Compression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
2.2.6 <strong>Data</strong> Storage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />
2.3 A Non-Destructive Read Filtering <strong>and</strong> Partitioning Framework . . . . . . . . . . 33<br />
2.4 A Flexible Sequencing Read Demultiplexing System . . . . . . . . . . . . . . . . 34<br />
2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
2.4.2 A Format for Description of Multiplexing Setups . . . . . . . . . . . . . . 36<br />
2.4.3 Barcode Recognition <strong>and</strong> Sample Resolution . . . . . . . . . . . . . . . . 38<br />
2.5 Versatile Oligomer Detection <strong>and</strong> Read Clipping . . . . . . . . . . . . . . . . . . 38<br />
2.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
2.5.2 Dynamic <strong>Programming</strong> Alignment <strong>and</strong> Backtracing Pipeline . . . . . . . 41<br />
xi
xii<br />
CONTENTS<br />
2.6 A Parallelization Front-End for Short Read Alignment Tools . . . . . . . . . . . 43<br />
2.7 Robust Detection of ChIP-Seq Enrichment . . . . . . . . . . . . . . . . . . . . . . 45<br />
2.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />
2.7.2 Correcting for Duplicated Sequences . . . . . . . . . . . . . . . . . . . . . 45<br />
2.7.3 Detection Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
2.7.4 Recognition of Read Mapping Artifacts . . . . . . . . . . . . . . . . . . . 47<br />
2.7.5 Ranking <strong>and</strong> Assessment of Peak Signicance . . . . . . . . . . . . . . . . 49<br />
2.7.6 Experimental Relevance of Peak Signicance . . . . . . . . . . . . . . . . 51<br />
2.8 Visualization of Sequencing Read <strong>and</strong> Alignment <strong>Data</strong> . . . . . . . . . . . . . . . 52<br />
2.8.1 Gathering Run Quality <strong>and</strong> Sequence Composition Statistics . . . . . . . 52<br />
2.8.2 Visualization of Read Mapping <strong>Data</strong> . . . . . . . . . . . . . . . . . . . . . 55<br />
2.8.3 Visualization of Local or Genome-Scale Depth of Coverage . . . . . . . . 56<br />
2.8.4 Quantile Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
3 A C++ Framework for High-Throughput DNA Sequencing 61<br />
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />
3.2 A Modular Signal-Slot Processing Framework . . . . . . . . . . . . . . . . . . . . 64<br />
3.2.1 Reader <strong>and</strong> Writer Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
3.2.2 Denition of Processing Network Topology . . . . . . . . . . . . . . . . . 64<br />
3.2.3 Module Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />
3.2.4 In-Place <strong>Data</strong> Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />
3.3 A Simplied Parallelization Interface . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
3.3.1 Parallelization Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
3.3.2 Parallel Pipeline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
4 Closing Remarks 77<br />
Bibliography 81
1 Introduction<br />
Next-Generation Sequencing has evolved into a powerful tool for many areas of biological<br />
science, but has also introduced many new challenges related to data analysis<br />
<strong>and</strong> computing infrastructure into the eld. Following a brief recapitulation of highthroughput<br />
DNA sequencing technology, this chapter highlights important areas of<br />
application while discussing previous <strong>and</strong> related work. We outline general sequencing<br />
data analysis methodology <strong>and</strong> conclude by establishing the motivation for this<br />
work.<br />
1.1 High-Throughput DNA Sequencing<br />
Widely deployed for less than a decade, the impact of parallel next-generation sequencing technology<br />
on genetic research has already been considerable. The novel high-throughput sequencing<br />
approaches are characterized by massive parallelism implemented in a single device. From this<br />
design results the key advantage of a signicantly reduced cost per sequenced base, which has<br />
allowed to quickly displace the previously predominant Sanger sequencing method in many areas<br />
of application.<br />
Sanger sequencing, the prevalent sequencing method for the majority of time since its introduction<br />
in 1977 [2], needed to rely on extensive automation in dedicated sequencing centers to<br />
achieve time <strong>and</strong> cost eective readout of large amounts of sequence. Most prominently, this was<br />
put into eect during the eort of generating the rst draft sequence of the human genome [36].<br />
By contrast, high-throughput sequencing instruments are designed to analyze thous<strong>and</strong>s to millions<br />
of DNA molecules simultaneously, <strong>and</strong> thus enable even smaller institutions to produce<br />
large quantities of sequence data on site.<br />
Secondary to elucidation of DNA primary structure, sequencing utilized as a r<strong>and</strong>om sampling<br />
device delivers quantitative clues on sample composition. In this capacity, it has been<br />
used for diverse purposes such as inference of DNA methylation levels [7] or the analysis of environmental<br />
samples [8]. In concert with the economical advantage over the dideoxynucleotide<br />
chain-termination method, parallel DNA sequencing becoming widely accessible to researchers<br />
has served to promote such deep sequencing approaches that open up whole new areas of application<br />
beyond the domain previously occupied by Sanger sequencing.<br />
As an alternative to microarray technologies, for example in gene expression proling or<br />
chromatin immunoprecipitation assays, deep sequencing overcomes fundamental technological<br />
restrictions such as probe resolution <strong>and</strong> probe saturation. Furthermore, without the inherent<br />
requirement of a-priori knowledge of sequences to be detected or quantied, development of novel<br />
protocols of application has been furthered to transform high-throughput sequencing methods<br />
into a versatile tool for research.<br />
1
2 CHAPTER 1. INTRODUCTION<br />
1.2 Approaches to Parallel DNA Sequencing<br />
<strong>An</strong> early parallel sequencing method, massively parallel signature sequencing (MPSS), was published<br />
in the year 2000 [9] <strong>and</strong> had already been available as a centralized, commercial service.<br />
However, as its short read length of only 17 to 20 nucleotides proved to be prohibitive in most<br />
potential areas of application, MPSS mainly found use as a gene expression proling assay.<br />
A technological <strong>and</strong> commercial breakthrough was marked by the deployment of integrated<br />
bench top devices through a variety of dierent providers. The rst high-throughput sequencing<br />
instrument brought to the market was the 454 Life Sciences Genome Sequencer FLX in 2005 [10].<br />
Soon after, in 2006, followed Illumina's Genome <strong>An</strong>alyzer [11] instrument <strong>and</strong> the Life Technology<br />
SOLiD system [12] in 2008.<br />
Over a short period of time, considerable technological maturation has ensued, <strong>and</strong> vendors<br />
have started to broaden their portfolios by tailoring their respective solutions to dierent operational<br />
scenarios, exemplied by the Illumina HiSeq <strong>and</strong> MiSeq devices or the 454 Life Sciences<br />
bench top model GS Junior. On the other h<strong>and</strong>, diverse alternative approaches have been proposed<br />
<strong>and</strong> introduced into the market, notably a semiconductor-based solution marketed as Ion<br />
Torrent by Life Technology [13] <strong>and</strong> the Pacic Biosciences single molecule sequencing system<br />
PacBio RS [14].<br />
As their dening feature, all high-throughput sequencing systems share the parallel interrogation<br />
of large numbers of DNA molecules. Sample DNA to be analyzed is typically applied to<br />
an expendable ow cell or ow chip, a specialized glass slide or microtiter plate particular to<br />
the respective technology. Aside from such fundamental analogies, the respective methods of sequence<br />
interrogation dier in key aspects. The basis of most current technologies is formed by the<br />
sequencing-by-synthesis (SBS) principle, with the notable exception of the SOLiD sequencingby-ligation<br />
method. Sequencing-by-synthesis decodes the sequence of DNA molecules by keeping<br />
track of nucleotides incorporated by a DNA polymerase during complementary str<strong>and</strong> synthesis,<br />
with the distinguishing feature of the dierent technological implementations being the manifold<br />
methods employed for detecting events of nucleotide incorporation.<br />
The 454 family of sequencers realize a pyrosequencing approach. The four deoxyribonucleoside<br />
triphosphates (dNTP) adenine, cytosine, guanine <strong>and</strong> thymine are owed cyclically<br />
over microtiter wells holding clonally amplied DNA templates, where each ow of reagents<br />
delivers a specic type of dNTP. The amount of pyrophosphate released during nucleotide<br />
incorporation, which is determined by the number of nucleotides incorporated in each ow, is<br />
converted into light intensity through an enzymatic reaction <strong>and</strong> recorded by a CCD camera. [10]<br />
The Illumina systems are based on reversible chain termination chemistry. A mixture of<br />
all four dNTPs is owed over a slide with r<strong>and</strong>omly distributed clusters of clonally amplied<br />
templates. The dNTPs are modied by a chain terminator, limiting elongation of the synthesized<br />
str<strong>and</strong> to a single nucleotide. Furthermore, each type of dNTP is distinguished by fusion to a<br />
specic uorophore, allowing for identication of the respective incorporated base by means<br />
of laser excitation <strong>and</strong> a CCD detector. Reagents subsequently applied displace the reversible<br />
terminator <strong>and</strong> dye to enable further str<strong>and</strong> elongation in the following cycle. [11]<br />
The Ion Torrent design adopts the homopolymer detection approach introduced by pyrosequencing.<br />
However, instead of pyrophosphate, the release of hydrogen ions serves as a proxy for<br />
str<strong>and</strong> elongation. The individual pH measurements are realized non-optically on an expendable<br />
semiconductor chip. [13]<br />
PacBio RS sequencers implement a technology termed single molecule real-time sequencing<br />
(SMRT). The method is enabled by zero-mode waveguides (ZMW), nanostructures that enable<br />
probing of volumes smaller than the wave length of light [15]. In contrast to the other current<br />
technologies described, the SMRT approach therefore does not need to rely on clonally amplied
1.3. APPLICATIONS OF HIGH-THROUGHPUT SEQUENCING 3<br />
DNA templates, <strong>and</strong> instead single DNA molecules <strong>and</strong> polymerases suce for obtaining a recognizable<br />
signal. DNA polymerases are immobilized in ZMW detection volumes, creating a slide<br />
suitable for the parallel observation of polymerase activity. Fluorescently labeled dNTPs are employed<br />
as the substrate, where nucleotides being incorporated are detected via laser excitation<br />
<strong>and</strong> a CCD detector. The uorophore itself is displaced upon nucleotide incorporation. [14]<br />
As opposed to sequencing-by-synthesis reliant on DNA polymerase, SOLiD sequencing constitutes<br />
a sequencing-by-ligation technique that exploits DNA ligase enzymes for sequence decoding.<br />
Probe oligomers partially consisting of degenerate <strong>and</strong> universal bases [16] are ligated to interrogate<br />
di-nucleotides at a spacing of three base pairs. When reaching the full read length, the<br />
template is reset <strong>and</strong> the ligation process is repeated ve times at dierent osets, <strong>and</strong> thus<br />
each of the template's bases is measured twice through dierent probe oligomers. Four dierent<br />
uorophores serve to label the 16 types of probe to obtain an ambiguous color space code. The<br />
properties of the dibase color encoding are such that for any base, the ambiguity may be resolved<br />
as long as the identity of the preceding base is known. [12]<br />
The following discussion centers on Illumina sequencing technology as the primary focus of<br />
the present work, although applicable to varying extent to multiple or all of the currently<br />
available high-throughput sequencing devices.<br />
1.3 Applications of High-Throughput Sequencing<br />
While high-throughput sequencing devices in general excel in the cost-eective generation of massive<br />
amounts of sequence data, the new technologies are still associated with weaknesses regarding<br />
error rates <strong>and</strong> maximal achievable sequencing read length (section 1.4). These weaknesses<br />
limit the capacity to generate base-by-base readouts of entire genome sequences. Nonetheless,<br />
concerted application of complementary technologies including Sanger sequencing has served to<br />
greatly accelerate the pace at which novel reference assemblies for manifold model organisms are<br />
being generated.<br />
Certainly, despite the reduction in cost-per-base that has been achieved, decoding larger,<br />
repeat-rich genome sequences remains a massive task with current DNA sequencing technology.<br />
Genomic inter-individual <strong>and</strong> inter-species variation may however be assessed circumventing the<br />
issue of global-scale sequence topology (section 1.3.1), which has thus evolved as one of the<br />
primary applications of high-throughput sequencing. In turn, interpretation of data generated<br />
with the aim of such consensus-based characterization of sequence variation is greatly facilitated<br />
by the availability of a suitable reference genome assembly.<br />
Similarly, the resequencing scenario facilitates data interpretation for an extensive family of<br />
deep sequencing <strong>and</strong> genome-wide screening applications. As throughput of sequencing devices<br />
has grown to enable routine sequencing even of large genomes to multiple depths of coverage, its<br />
potential for quantitative measurement has been leveraged for the assessment of diverse genomic<br />
properties. With stock genome sequencing protocol, deep sequencing allows estimation of quantiable<br />
statistics such as gene copy number [1719], application to forward genetic screens [20, 21]<br />
or dissection of environmental samples [2224]. To exploit its potential for quantication in a<br />
wide range of further research settings, high-throughput sequencing is complemented by a host<br />
of dierent sample <strong>and</strong> <strong>lib</strong>rary preparation protocols acting as adapters between the property of<br />
interest <strong>and</strong> DNA sequencing technology.<br />
RNA conversion protocols facilitate investigation of transcript structure <strong>and</strong> abundance<br />
[2529] <strong>and</strong> address structure, expression <strong>and</strong> distribution of specic types of transcript<br />
such as long non-coding RNA or small RNA [27, 3033]. In ChIP-Seq, MeDIP-Seq or HITS-<br />
CLIP, immunoprecipitation-based sequence enrichment protocols provide insight into diverse
4 CHAPTER 1. INTRODUCTION<br />
genetic <strong>and</strong> epigenetic mechanisms including transcription factor binding events, histone <strong>and</strong><br />
DNA modications or specicity of RNA binding proteins [17, 3438].<br />
Conversely, depletion protocols such as MNase-Seq are for example applicable to investigation<br />
of nucleosome positioning [39]. <strong>An</strong> additional option for the study of epigenetic features by highthroughput<br />
sequencing is available in sequence conversion protocols utilizing bisulte treatment<br />
of the sample DNA (BS-Seq) [27, 40, 41].<br />
Further protocols aiming at a reduction of <strong>lib</strong>rary complexity can supply signicantly reduced<br />
cost of sequencing for certain applications, such as retrieval of genetic markers through sequencing<br />
of restriction site associated DNA (RAD-Seq) [42, 43].<br />
By st<strong>and</strong>ard sample preparation protocols, a molecular mixture retrieved from millions of<br />
cells is assessed. This property is exploited de<strong>lib</strong>erately for example by genomic enrichment or<br />
epigenetic sampling protocols such as ChIP-Seq as well as BS-Seq. On the other h<strong>and</strong>, single-cell<br />
sequencing protocols are being developed to enable the maximally ne-grained assessment of<br />
cellular populations [44].<br />
While each sequencing protocol has on its own furthered our underst<strong>and</strong>ing of the molecular<br />
makeup of cells, integration <strong>and</strong> correlation of the dierent types <strong>and</strong> sources of data now available<br />
shows great promise to provide detailed insight into the complex interplay of the various<br />
mechanisms governing the cellular machinery.<br />
1.3.1 Genome <strong>and</strong> Transcriptome Sequencing<br />
Molecular genetics studies how an organism's genome interacts with its cellular environment,<br />
ultimately shaping the organism's phenotype inside its habitat. Learning how this interaction<br />
can be inuenced may ultimately help to develop new capabilities to counteract diseases or<br />
enhance traits like disease resistance or crop yield. <strong>An</strong>alysis of the DNA-level dierences between<br />
organisms, regardless of whether they are of the same species or not, presents an entry point<br />
to obtaining clues on this interaction. With the emergence of next-generation sequencing, for<br />
the rst time researchers are able to employ this approach on the very large scale, exemplied<br />
by a variety of major sequencing projects both on the intra- <strong>and</strong> inter-species level like the<br />
1000 genomes project [45, 46] for the human genome, the 1001 genomes project [19] for the<br />
Arabidopsis thaliana genome, the Genome 10K project [47] with the goal of analyzing 10000<br />
vertebrate genomes or the metagenomic human microbiome project [48, 49].<br />
Due to the short lengths of the sequence reads recovered by most high-throughput sequencers,<br />
most studies circumvent the complex task of whole genome assembly focusing on small scale sequence<br />
variants or localized larger scale rearrangements (section 1.5.4). These genetic markers<br />
thus recovered can be processed further allowing application to a wide range of scientic questions.<br />
Recovered on population scale, they provide insight into a species' population structure<br />
<strong>and</strong> the evolution of its genomes <strong>and</strong> of its subpopulations. When integrated with phenotype<br />
data, functional clues may further be obtained by means of genome wide association studies<br />
(GWAS) (e. g. [50]).<br />
Knowledge of gene expression proles proves valuable in many dierent contexts, <strong>and</strong> may on<br />
the one h<strong>and</strong> provide a means for the classication of entire cells, as well as on the other h<strong>and</strong> of<br />
certain genes or groups of genes with respect to their involvement in specic cellular pathways or<br />
function. While gene expression has been routinely assessed on the large scale using microarray<br />
technology, the method suers from non-linear measurement, probe saturation <strong>and</strong> suboptimal<br />
signal-to-noise ratio.<br />
High-throughput sequencing by contrast allows basically unlimited dynamic range only<br />
bounded by allocatable eort <strong>and</strong> amounts of cellular material. Other than microarrays,
1.3. APPLICATIONS OF HIGH-THROUGHPUT SEQUENCING 5<br />
sequencing further enables recovery <strong>and</strong> quantication of unexpected transcripts <strong>and</strong> splice<br />
forms.<br />
A variety of dierent transcriptome sequencing protocols are available, distinguished by their<br />
suitability for assessment of transcript structure <strong>and</strong> quantication as well as capabilities to<br />
detect certain types of transcript <strong>and</strong> to recover transcript directionality. As with all examples<br />
of quantication by sequencing, data interpretation may be intricate (section 1.5.5).<br />
A special class of transcript is given by small RNA, short cellular RNA molecules ∼2025<br />
nucleotides in length ubiquitous in almost all well-studied eukaryotes. These have been identied<br />
as another vital factor for transcript regulation, <strong>and</strong> are further implicated in various important<br />
cellular processes such as DNA methylation <strong>and</strong> pathogen response [51, 52]. Transcript levels<br />
are aected by small RNA through RNA interference (RNAi). As the specicity-determining<br />
component of RNA induced silencing complexes (RISC), they direct the degrading enzymatic<br />
activity of ARGONAUTE proteins at the targeted, roughly complementary transcripts.<br />
Small RNA molecules are attributed to multiple cellular pathways in plants <strong>and</strong> animals generally<br />
involving double str<strong>and</strong>ed RNA or RNA stem-loop structures. According to the generating<br />
biochemical pathway, small RNA can be categorized into dierent classes, notably microRNA<br />
(miRNA) <strong>and</strong> small interfering RNA (siRNA), linked with overlapping, but distinct cellular roles.<br />
High-throughput small RNA proling by sequencing facilitates genome-wide discovery, categorization<br />
<strong>and</strong> relative quantication for the various small RNA species. However, small RNA<br />
quantication is sensitive to sequence specic bias [53], with their short length likely introducing<br />
increased variance to base composition bias (section 1.4) <strong>and</strong> emphasizing the importance of ligation<br />
biases. Furthermore, relative quantication with reference to <strong>lib</strong>rary size is often deemed<br />
unsuitable for an investigation, <strong>and</strong> consensus on appropriate normalizing transformations is currently<br />
lacking (cf. [54]). Nonetheless, small RNA sequencing data prove a valuable quantitative<br />
resource on the genome scale, e. g. as a proxy for RNA-directed DNA methylation [55] or in<br />
correlation with further transcriptome or epigenetic data.<br />
1.3.2 ChIP-Seq <strong>and</strong> further Immunoprecipitation Protocols<br />
Expression of genes is thought to be connected in complex regulatory networks of positive <strong>and</strong><br />
negative feedback. A key role in these networks is occupied by transcription factors, DNA-binding<br />
proteins involved with initiation of gene transcription at the promoter sequences. ChIP-Seq, the<br />
successor of the microarray-based ChIP-chip technology featuring improved signal-to-noise ratio<br />
<strong>and</strong> sharper signal bounds, facilitates the identication of putative transcription factor targets<br />
in a relevant in-vivo context. By elucidation of individual transcription factors' targets, our<br />
underst<strong>and</strong>ing of the complex transcriptional interplay can be gradually improved.<br />
Immunoprecipitation (IP) methods exploit the antibody-antigen anity to enrich specic<br />
molecules in a solution. In the course of the procedure, antibodies for the antigen to be concentrated<br />
are incubated with the solution to promote formation of antibody-antigen complexes.<br />
<strong>An</strong>tibodies are captured on solid-phase beads, thus enabling precipitation to enrich for the formed<br />
complexes.<br />
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-Seq) employs<br />
the IP technique for in-vivo detection of regions of co-localization of a specic molecule, usually a<br />
protein, with the genomic DNA. During the procedure, live tissue is prepared with formaldehyde<br />
or an appropriate alternative cross-linking reagent. Formaldehyde undergoes a reaction that<br />
induces cross-linking of amino groups in close vicinity, thereby promoting formation of DNAprotein<br />
complexes (DPC) (cf. Barker et al. [56]). Following lysis of the tissue, the DNA is<br />
sonicated, yielding DPC associated with only short fragments of genomic DNA. The DPC are<br />
immunoprecipitated <strong>and</strong>, by purication <strong>and</strong> amplication of the contained DNA, prepared as
6 CHAPTER 1. INTRODUCTION<br />
Instrument<br />
Max. Average Read Pairs / Run Time<br />
Read Length Flowcell<br />
Illumina MiSeq 2x250bp 16M 39h<br />
Illumina Genome <strong>An</strong>alyzer IIx 2x150bp 300M 14 days<br />
Illumina HiSeq 2x100bp 1400M 11 days<br />
454 Life Sciences GS Junior ∼400bp 100000 10h<br />
454 Life Sciences GS FLX+ ∼700bp 1M 23h<br />
Life Technologies Ion PGM up to 400bp 1000005M 3.7h7.3h<br />
Life Technologies Ion Proton up to 200bp ∼70M 2h4h<br />
Life Technologies 5500W (SOLiD) 1x75bp / 2x50bp ∼1600M 6 days<br />
Pacic Biosciences PacBio RS ∼4.5kbp ∼22000 2h<br />
Table 1.1: DNA Sequencer Specications According to Device Vendors (3/2013)<br />
a sequencing <strong>lib</strong>rary. Following deep sequencing, thus precipitated tags facilitate identication<br />
<strong>and</strong> localization of putative protein binding sites.<br />
Immunoprecipitation is complemented by techniques that seek to deplete segments of the<br />
DNA that are not protein-bound, e. g. using exonucleases like micrococcal nuclease (MNase) to<br />
digest the unbound parts of the DNA fragments following cross-linking <strong>and</strong> sonication. MNase-<br />
Seq is used on its own to investigate nucleosome positioning [39, 57], but can also be combined<br />
with ChIP-Seq for improved binding site resolution.<br />
When available, immunoprecipitation of the DPC is achieved using a specic antibody raised<br />
against the protein of interest. Alternatively, the experiment is performed using transgenic<br />
organisms where the protein is tagged by an appropriate adapter protein, e. g. GFP [58], for<br />
which a high quality antibody is available. However, this fusion protein approach comes with<br />
the drawback of possible tag-induced alteration of protein function or binding anity.<br />
Apart from its application to the identication of transcription factor targets [34, 35], ChIP-<br />
Seq is most prominently applied for genome-wide screening of histone modications [36]. Using<br />
5-methylcytosine-specic antibodies, immunoprecipitation is further applicable for generating<br />
genome-wide maps of DNA methylation (MeDIP-Seq) [37].<br />
RNA analogues to the ChIP-Seq strategy, high-throughput sequencing of RNA isolated<br />
by crosslinking immunoprecipitation (HITS-CLIP) [38] <strong>and</strong> photoactivatable-ribonucleosideenhanced<br />
crosslinking <strong>and</strong> immunoprecipitation (PAR-CLIP) [59] are used in a similar<br />
fashion to investigate in-vivo binding of RNA associates like e. g. the miRNA binding protein<br />
ARGONAUTE.<br />
1.4 Properties of Sequencing <strong>Data</strong><br />
Modications to both the instruments themselves as well as to the sequencing chemistry utilized<br />
have led to vast improvements in per-run output since the rst iteration of next generation sequencers,<br />
which has frequently drawn comparisons to the famous Moore's Law of the development<br />
of integrated circuits [60, 61].<br />
A further reduction of overall sequencing cost remains a primary objective in the further<br />
development of the technologies, often illustrated by the catch phrase 1000 dollar [human]<br />
genome [13, 62]. Eorts directed at this goal include the ongoing pursuit of new sequencing<br />
technologies as well as streamlining key operational aspects such as ease of <strong>lib</strong>rary preparation,<br />
run time <strong>and</strong> quantization of throughput i. e. the minimal amount of sequence that must<br />
be generated per sequencing run to achieve acceptable cost-per-base. However, with dropping
1.4. PROPERTIES OF SEQUENCING DATA 7<br />
cost-per-base further dierentiated considerations including data h<strong>and</strong>ling <strong>and</strong> analysis cost as<br />
well as the unique capabilities of either respective technology are increasingly moving into focus.<br />
Dierences between the various technologies also imply dierent behavior regarding important<br />
parameters such as read length, frequency, type <strong>and</strong> r<strong>and</strong>omness of sequencing errors or sequence<br />
dependent depth-of-coverage biases. With the targeted eld of application, the impact of either<br />
property varies. Table 1.1 provides a comparison of read length, throughput <strong>and</strong> run time of<br />
sequencing instruments. Despite specications of the young technologies being in constant ux,<br />
the overview highlights the technology's strengths <strong>and</strong> weaknesses discussed in the following.<br />
Reversible chain terminator as well as sequencing-by-ligation approaches enforce in-phase<br />
measurement of all DNA templates being sequenced. Therefore, these devices provide a xed,<br />
user selectable read length. Albeit currently providing the highest level of parallelism <strong>and</strong> of<br />
per-run throughput in terms of sequenced bases, they oer a signicantly shorter maximal read<br />
length compared to the other methods discussed.<br />
Sequencing-by-synthesis methods that as in the case of 454 <strong>and</strong> Ion Torrent instruments<br />
imply measurement of the homopolymer sequence of the respective DNA templates generate sequence<br />
reads of variable length determined by the actual nucleotide composition of the molecules.<br />
Whereas minimum read length equals the number of sequencing cycles, i. e. one quarter of the<br />
number of dNTP ows, the average read length thus depends on the type of DNA being sequenced.<br />
By contrast, no phasing of str<strong>and</strong> elongation is enforced by the single molecule real time<br />
sequencing approach, neither on single base nor on the homopolymer level. The method instead<br />
attempts near-continuous tracking of the str<strong>and</strong> elongation process. SMRT devices produce<br />
variable length reads, with the read length largely dependent on the binding anity of the<br />
polymerase enzyme utilized by the technology. The method currently achieves the longest average<br />
read length, with however a broad spectrum of read lengths <strong>and</strong> below-average level of parallelism.<br />
The dideoxynucleotide chain-termination sequencing method has undergone a considerable<br />
time span of continued renement. Therefore, although next-generation instruments are improving<br />
steadily with more <strong>and</strong> more sophisticated chemistry <strong>and</strong> signal processing, Sanger sequencing<br />
technology still denes the gold st<strong>and</strong>ard in terms of single read accuracy. Apart from stochastic<br />
measurement uctuations, all present technologies feature an error prole with an increase in<br />
error frequency towards the end of the read. Typical sources of systematic decline in sequence<br />
quality include deterioration of the sequencing reagents over time or loss of phasing for methods<br />
relying on clonally amplied DNA. Orthogonally, overall read quality is inuenced by factors like<br />
cross-talk between adjacent spots, mixed clusters or optical distortion. [63]<br />
Due to phased probing of the template DNA, with the Illumina <strong>and</strong> SOLiD instruments<br />
measurement error mostly induces base substitution errors. Conversely, 454 <strong>and</strong> Ion Torrent<br />
instruments are primarily susceptible to over- or underestimation of homopolymer lengths, which<br />
results in an elevated frequency of base insertion <strong>and</strong> deletion errors. SMRT suers from an<br />
elevated frequency of nucleotide deletion errors.<br />
All platforms including Sanger sequencing are furthermore subject to sequence specic biases,<br />
where both the probability of sampling specic sequences as well as single base accuracy can be<br />
strongly inuenced by the local sequence context. Besides the implications for sequencing data<br />
analysis <strong>and</strong> interpretation, uneven error rates <strong>and</strong> sampling probabilities result in an increase<br />
of required average sequence coverage <strong>and</strong> thus elevated cost-per-base.<br />
Sampling bias may be introduced not only during the actual steps of the sequencing procedure,<br />
but also the weakly st<strong>and</strong>ardized sequencing <strong>lib</strong>rary preparation, with e. g. PCR amplication an<br />
important <strong>and</strong> frequently cited source of bias [64]. Sequencing bias is classied either according<br />
to the introducing reaction or step of the <strong>lib</strong>rary preparation, such as ligation or fragmentation<br />
bias, or according to the sequence properties it is attributed to, such as base composition or end
8 CHAPTER 1. INTRODUCTION<br />
biases.<br />
Since quality <strong>and</strong> strength of these biases are dependent on an only vaguely understood<br />
interplay of variable factors like fragmentation size <strong>and</strong> exact PCR conditions, they are not<br />
easily corrected for <strong>and</strong> must be considered during data analysis <strong>and</strong> interpretation of results.<br />
1.5 Sequencing <strong>Data</strong> <strong>An</strong>alysis<br />
While data analysis strategies for dierent applications of high-throughput sequencing naturally<br />
diverge at some point following the initial base calling of the read sequences, they frequently share<br />
common data processing tasks or generic approaches such as reference genome-based resequencing.<br />
In addition to software solutions oered by device vendors or other commercial providers,<br />
countless computational utilities are developed by the scientic community implementing specic<br />
analysis strategies or prerequisites of generic analysis approaches such as read mapping.<br />
Following the base calling step in which nucleotide sequences are deduced from signal intensities<br />
delivered by the sequencing device, raw sequence data undergo quality control <strong>and</strong> further<br />
processing according to parameters dictated by the respective application <strong>and</strong> analysis strategy.<br />
Subsequent strategies diverge into dierent classes such as reference sequence-guided resequencing<br />
approaches, de-novo sequence analysis such as genome assembly as well as reference-free<br />
multi-sample comparison for diverse purposes such as e. g. assessment of variation or microRNA<br />
content. Primarily motivated by the relatively short read lengths oered by current sequencing<br />
technology, reference sequence-guided resequencing remains the most widely used <strong>and</strong> practical<br />
method (section 1.5.3).<br />
1.5.1 Base Calling <strong>and</strong> Read Quality Assessment<br />
Base calling is the initial step of deducing an actual nucleotide sequence from the raw signal<br />
measurements delivered by sequencing machines, e. g. the brightness levels measured by the<br />
CCD detector for uorescence-based techniques. Parallel sequencers operate by sampling the<br />
entire measurement area supporting large numbers of DNA molecules (e. g. ow cell) at certain<br />
intervals (section 1.2). Instrument measurements initially are stored as a series of images, where<br />
each image represents the signal intensities across the measurement area at a single sampling<br />
interval. The stack of images generated is subsequently processed in software.<br />
In short, base calling software is required to locate spots within the measurement area that<br />
represent valid DNA templates, <strong>and</strong> further analyze the run of signal intensities at each spot<br />
to provide a nucleotide sequence that likely explains the observed intensities. In addition to<br />
this maximum likelihood estimate of a sequencing read's nucleotide sequence, a measure of<br />
reliability for the issued base calls is desirable. Due to their conceptual simplicity, PHREDlike<br />
per-base quality scores [65, 66] have become the de facto st<strong>and</strong>ard across all platforms.<br />
Additional measures that capture the peculiarities of the respective sequencing process, like e. g.<br />
individual base likelihood in the case of reversible chain terminator sequencing or homopolymer<br />
accuracy in the case of pyrosequencing, can be calculated by specic base callers but are widely<br />
disregarded by downstream analysis software.<br />
Base calling software is usually highly device specic <strong>and</strong> thus provided by the instrument<br />
vendors; alternative software solutions exist in some cases [6771], but have not become widely<br />
adopted.<br />
Quality of high-throughput sequencing reads can be highly variable <strong>and</strong> is dependent on many<br />
dierent factors (section 1.4). Therefore, prior to further analysis it is often desirable to assess<br />
the quality of the generated sequencing reads, <strong>and</strong> to pre-lter the raw read data accordingly.<br />
Furthermore, depending on the sequencing protocol <strong>and</strong> application, the relevant sequences are
1.5. SEQUENCING DATA ANALYSIS 9<br />
often fused to utility oligomers like bar codes, sequencing adapters or linker sequences that are<br />
still present in the sequencer's output <strong>and</strong> interfere with direct application of the appropriate<br />
analysis algorithm. Countless tools <strong>and</strong> scripts have been developed for these purposes <strong>and</strong> are<br />
in some cases made freely available.<br />
FastQC [72] for example is a simple Java application providing visualization for average read<br />
quality <strong>and</strong> various other properties of the data set. The FASTX Toolkit [73] provides tools for<br />
creating sequence quality charts, adapter removal <strong>and</strong> demultiplexing operations. TagDust [74]<br />
is a tool that identies <strong>lib</strong>rary artifacts such as sequencing primer dimers. Similar capabilities<br />
are oered by the tool TagCleaner [75], with additional read clipping, trimming <strong>and</strong> splitting<br />
options. A collection of Perl scripts for read ltering, quality-based trimming <strong>and</strong> creating read<br />
quality charts is provided by the NGS QC Toolkit [76].<br />
1.5.2 Assembly<br />
While the classical domain of large-scale DNA sequencing is the assembly of entire genomes,<br />
due to their short read length most high-throughput sequencing technologies do not produce<br />
data optimally suited to this approach. Computational tools based on alternative assembly<br />
algorithms <strong>and</strong> data structures such as De Bruijn graphs have been developed that can better<br />
cope with large amounts of short reads, for example the Velvet [77], Oases [78], SOAPdenovo [79],<br />
ALLPATHS [80, 81] <strong>and</strong> ALLPATHS-LG [82, 83] assemblers. Despite these improvements, stock<br />
short read assemblies still do not reach the quality of typical laboriously generated reference<br />
genomes [84]. Generating high quality reference assemblies continues to require enormous eort<br />
often involving construction of BAC <strong>lib</strong>raries <strong>and</strong> large amounts of man-hours both on the wet<br />
lab <strong>and</strong> data analysis sides for closing gaps in the assembly, ordering the assembled scaolds or<br />
identication of mis-assembled regions.<br />
These remaining issues prevent large-scale multiple whole-genome comparison approaches,<br />
<strong>and</strong> as a consequence, solutions to data representation <strong>and</strong> methodology for this type of analysis<br />
are not yet well established. In the medium term, sequencing technology can be expected to<br />
reach a sucient mixture of sequencing cost, throughput, read length <strong>and</strong> accuracy to render<br />
whole-genome assembly a rather routine task. For genome-wide variation analysis, this should<br />
bring about a shift towards such large-scale assembly approaches.<br />
1.5.3 Short Read Alignment<br />
While direct genome assembly remains impractical, most applications of next-generation sequencing<br />
are analyzed through resequencing approaches that rely on the availability of a suitable<br />
reference genome sequence. Relating all data to a single assembled genome circumvents the issue<br />
of assembly, facilitates comparisons by establishing a common coordinate system <strong>and</strong> allows to<br />
examine analysis results in the context of a fully rened genome annotation.<br />
Mapping millions of short reads to reference genomes up to multiple gigabases in size constitutes<br />
one of the computationally most dem<strong>and</strong>ing tasks of the resequencing strategy. In the<br />
presence of sequencing errors, erroneous reference sequences <strong>and</strong> possibly genomic variation such<br />
as SNPs, indels <strong>and</strong> rearrangements, perfect match queries between read <strong>and</strong> reference sequence<br />
cannot in general deliver satisfactory results.<br />
The read alignment problem presents a string matching task that has quickly been met<br />
by a variety of proposed algorithms <strong>and</strong> software solutions, confronting the scientic community<br />
with a multitude of freely available alignment tool options like BWA [85], Bowtie [86],<br />
Bowtie2 [87], SOAP [88], SOAP2 [89], SHRiMP [90], SHRiMP2 [91], MAQ [92], GenomeMapper<br />
[93], PALMapper [94] or CloudBurst [95].
10 CHAPTER 1. INTRODUCTION<br />
To achieve satisfactory performance, alignment tools rely upon precomputed, static index<br />
data structures usually generated for the reference genome, ranging from simple k-mer hashes<br />
or spaced-seed strategies to sux trees, sux arrays or equivalent compressed full text indexes<br />
such as the Burrows-Wheeler transform (BWT) based FM-index [96]. The genome indexing<br />
approach has implications regarding both performance <strong>and</strong> hardware requirements of an alignment<br />
algorithm. Disk space <strong>and</strong> index memory required for k-mer indexes, sux trees <strong>and</strong> plain<br />
sux arrays amount to a multiple of the size of the indexed nucleotide sequence. In contrast,<br />
compressed BWT-based indexes have the advantage of an inherent representation of the genome<br />
sequence, with their size approaching that of the compressed sequence.<br />
Dierent index data structures aside, read mapping algorithms are set apart by their choice of<br />
trade-os between mapping speed, accuracy, sensitivity <strong>and</strong> supported alignment parameters. At<br />
the same time, mapping accuracy is partially inuenced by mapping sensitivity, since a missed<br />
mapping location for a read may lead to spurious mappings to a dierent location of similar<br />
sequence. Read mapping algorithms commence by identication of potential origins of seeds,<br />
short substrings of the sequence read matching the reference genome sequence at high sequence<br />
identity. The tradeo between mapping speed on the one side <strong>and</strong> accuracy <strong>and</strong> sensitivity on the<br />
other is greatly inuenced by seed selection, seed length as well as the required level of sequence<br />
identity. Tools may decide to ignore seeds that occur with very high frequency in the reference<br />
genome, further accelerating the mapping process at the cost of sensitivity.<br />
Following seed placement, follow-up alignment algorithms establish the actual base pairings<br />
between sequencing read <strong>and</strong> reference sequence, in the process identifying the best possible<br />
match for the entire read according to the scoring measure dened by the mapping tool. Furthermore,<br />
many mapping utilities in addition allow the explicit or implicit retrieval of the next-best<br />
mapping location, permitting to infer the reliability of the assigned location (mapping quality)<br />
under the assumption of completeness <strong>and</strong> correctness of the provided reference genome<br />
sequence.<br />
The set of features supported by mapping <strong>and</strong> alignment algorithms decides over their suitability<br />
for dierent kinds of sequencing data or application. However, omission of certain features<br />
may help with optimization of the algorithm. Gapped alignments for example come with a significant<br />
performance hit, but are indispensable for application in variation detection or for aligning<br />
sequencing reads obtained by pyrosequencing, Ion Torrent or SMRT technology. Further gaprelated<br />
features such as ane gap costs or split-read approaches permit correct placement of the<br />
read in presence of longer insertions <strong>and</strong> deletions or intron-exon boundaries.<br />
Sequencing reads with special properties such as color space or bisulte converted data furthermore<br />
require fully customized mapping <strong>and</strong> alignment strategies.<br />
1.5.4 Variation <strong>and</strong> Genotype Calling<br />
Reversing the assembly <strong>and</strong> multi-genome comparison approach (section 1.5.2), resequencing<br />
has become the current method of choice for assessing genomic variation. In the resequencing<br />
setting, all read data for certain subspecies, strains or mutants are mapped to a single reference<br />
sequence of a closely related organism. The complexity of the analysis task is therefore shifted<br />
from establishing <strong>and</strong> examining complex relations between multiple assembled genomes towards<br />
obtaining a comprehensive <strong>and</strong> reliable readout of the structural information contained in the<br />
short read alignments.<br />
Basic consensus calling is limited to the detection of conserved regions <strong>and</strong> small scale variation<br />
like single nucleotide polymorphisms <strong>and</strong> small insertions <strong>and</strong> deletions contained therein.<br />
Such regions of evident sequence conservation are directly suggested by the short read mappings.<br />
The task for analysis algorithms is therefore to assess each of the supported positions to identify
1.5. SEQUENCING DATA ANALYSIS 11<br />
the most likely combination of alleles to give rise to the observed data, along with a measure of<br />
condence for the provided solution.<br />
Identication of this most likely combination of alleles at each reference genome position is<br />
therefore the main problem to be solved by analysis algorithms. A major issue to be overcome<br />
by these algorithms is to achieve an integration of models for sequencing errors, read mapping<br />
<strong>and</strong> base alignment errors. While the probability of sequencing errors may be systematically<br />
inuenced by the local sequence context, individual error events in dierent reads may be considered<br />
as largely independent. Sources of mapping <strong>and</strong> base alignment error are however mostly<br />
systematic in nature. Reads <strong>and</strong> aligned bases therefore do not represent independent evidence<br />
for a certain allele conguration, preventing these types of error from being as readily captured<br />
by statistical modeling.<br />
The majority of consensus <strong>and</strong> genotype calling algorithms commence by examining the<br />
pileup of base calls <strong>and</strong> base qualities provided by the mapped reads overlapping a position.<br />
Given sucient depth of coverage, eventual sequencing errors can be thereby identied at high<br />
condence.<br />
However, sequencing error may not only cause invalid readout of single positions, but also<br />
spurious cross-mappings to loci with similar sequence to that of the true origin of a read. The<br />
likelihood of such spurious mappings may be assessed using measures like the mapping quality<br />
provided by some alignment tools (section 1.5.3). Furthermore, heuristic parts of the mapping<br />
algorithms used to achieve improved mapping performance may also cause systematic crossmappings<br />
that can be indistinguishable from true variants or give rise to invalid multi-allelic<br />
calls. Such pseudo multi-allelic regions may further arise due to copy number variants, underrepresentation<br />
of repetitive sequence in the reference assembly or sample contamination.<br />
<strong>An</strong>other source of systematic miscalls presents base alignment error. Base alignment error<br />
is introduced by reads, albeit mapped to the locus on the reference genome from which they<br />
originate, whose alignment provided by the read mapping tool does not suggest the correct set<br />
of base pairings. This type of error mainly occurs at transition points to insertions, deletions,<br />
copy number variants or other types of rearrangement or at exon-intron boundaries in mRNA<br />
sequencing. Reads overlapping these transition points only with a small number of positions do<br />
not contain enough information to support the correct type of variant, causing spurious alignment<br />
of terminal nucleotides.<br />
For these reasons, current consensus <strong>and</strong> genotype call algorithms employ combinations of<br />
statistical models as well as alignment correction algorithms <strong>and</strong> heuristics for removing or<br />
penalizing read alignment issues.<br />
The Genome <strong>An</strong>alysis Toolkit (GATK) [97] combines Bayesian heterozygous genotype calling<br />
based on quality score reca<strong>lib</strong>ration with indel realignment <strong>and</strong> various heuristics for penalizing<br />
mapping artifacts [98].<br />
SHORE [18] implements a exibly ca<strong>lib</strong>ratable scoring matrix approach [1, 99] that calculates<br />
SNP, indel <strong>and</strong> reference call scores based on a user-denable weighting of various properties of<br />
the read alignments overlapping a position. The method allows to impose a penalty on, or to<br />
ignore a number of terminal nucleotides towards the ends of each alignment, ruling out many<br />
cases of base alignment error at the cost of an eective loss of sequencing depth.<br />
The SAMtools package [100] implements a prole HMM for estimation of a Base Alignment<br />
Quality (BAQ) for the assessment of potential mapping artifacts [101].<br />
1.5.5 Quantication by Deep Sequencing<br />
In principle, deep sequencing constitutes statistical sampling of the population of molecules that<br />
make up the sequencing <strong>lib</strong>rary. Therefore, obtained sequencing data may be transformed into
12 CHAPTER 1. INTRODUCTION<br />
quantities that represent a frequency distribution that may be used to infer relative quantities<br />
with reference to this population.<br />
Depending on the application, these measurements may be processed further for mere classication<br />
purposes (e. g. enriched versus not enriched in chromatin immunoprecipitation (ChIP)<br />
assays) or transformed into quantities appropriate for multi-sample comparison (e. g. normalized<br />
expression level in RNA sequencing).<br />
For routinely used, but weakly dened concepts such as gene expression, denition of an appropriate<br />
reference for obtaining biologically meaningful quantities can be intricate. Depending<br />
on the application the appropriate unit of measurement might be absolute, i. e. the number of<br />
copies of a molecule in the cell or in the sample, a cellular or compartmental concentration, or<br />
relative to the total population of molecules of the same family. To a certain degree, relative measurements<br />
can be directly obtained from sequencing data, whereas calculation of absolute values<br />
is in general not possible. For identication of dierentially expressed loci in a multi-sample<br />
comparison, it may however be sucient to represent all measurements in relation to a common<br />
reference value. Given many loci contribute to the sequencing <strong>lib</strong>rary with a majority following<br />
the same distribution in all conditions examined, the problem of calculation of a common<br />
reference value can be reduced to identication of outliers in a multi-sample comparison.<br />
Next to basic compensation for sequencing <strong>lib</strong>rary size, further specialized normalization procedures<br />
are therefore often applied to the raw quantities of sequenced reads. Some of these procedures<br />
attempt inference of quantities that are better suited to the comparison of the biological<br />
conditions in question, such as estimation of absolute quantities. Further aims of normalization<br />
include stabilization of measurement variance or removal of sequence specic biases. As<br />
an example, for whole transcriptome sequencing a normalizing procedure termed trimmed mean<br />
of M-values (TMM) [102] has been shown to compare favorably with plain relative quantication.<br />
For data with dierent properties such as microRNA proles, consensus regarding data<br />
normalization is yet to be reached [103].<br />
1.5.6 ChIP-Seq <strong>An</strong>alysis<br />
While chromatin IP (section 1.3.2) enriches for sequence fragments that contain a binding site for<br />
the protein of interest, no perfect purication is achieved. Obtained sequencing data therefore<br />
tend to contain signicant amounts of background noise in the form of reads that represent<br />
DNA fragments that were sequenced purely by chance rather than due to their anity to the<br />
DNA-binding protein. Primary data analysis involves the identication of those genomic regions<br />
that potentially contain one or more binding sites, recognizable from the mapped read data by<br />
characteristic peaks in sequencing depth at the respective sites <strong>and</strong> their close vicinity.<br />
Single transcription factors are often thought to contribute to the regulation of hundreds<br />
or thous<strong>and</strong>s of dierent genes, <strong>and</strong> to take on dierent roles in various regulatory networks.<br />
This creates the need for peak calling algorithms that are able to process the ChIP-Seq data<br />
to identify all putative binding sites in a genome-wide scan. Multiple freely available computational<br />
tools including MACS [104], QuEST [105], SISSRs [106], PeakSeq [107], cisgenome [108] or<br />
FindPeaks [109] all implement dierent algorithmic approaches to the peak calling problem.<br />
MACS employs estimation of local depth of coverage background using a sliding window. By<br />
this background estimate, p-values for enrichment are calculated based on the Poisson distribution.<br />
The program is able to operate with or without control <strong>lib</strong>raries <strong>and</strong> calculates false<br />
discovery rates through a sample-control swap.<br />
SISSRs calculates net tag count as the dierence of forward <strong>and</strong> reverse str<strong>and</strong> read mappings<br />
in short jumping windows. Base line transitions of net tag count are retained as sites of putative<br />
enrichment, which are further compared to Poisson-motivated read support thresholds.
1.6. STORAGE AND REPRESENTATION OF SEQUENCING DATA 13<br />
The PeakSeq program identies c<strong>and</strong>idate enriched regions through segment-wise calculation<br />
of a depth of coverage threshold obtained by simulation of r<strong>and</strong>om read distribution, taking into<br />
account a pre-computed mappability map to capture repetitive parts of the genome. Thus<br />
obtained c<strong>and</strong>idates are assessed further using a binomial test comparing sample <strong>and</strong> scaled<br />
control data.<br />
Particular to the QuEST algorithm, peak detection relies on score proles generated by kernel<br />
density estimation over the reads' 5 ′ mapping coordinates rather than depth of coverage. Peak<br />
calling is achieved by identication of local score maximums <strong>and</strong> a combination of score <strong>and</strong> fold<br />
change thresholds.<br />
While a multitude of dierent analysis algorithms thus exist, due to the diculty of assessing<br />
the quality of the provided solution there is generally no consensus which of the available<br />
algorithms performs best in a given scenario.<br />
1.6 Storage <strong>and</strong> Representation of Sequencing <strong>Data</strong><br />
High <strong>and</strong> constantly improving throughput of sequencing instruments implies routine production<br />
of uncommonly large data sets. The amount of data produced poses challenges not only regarding<br />
data transfer, processing <strong>and</strong> analysis, but also in terms of data storage. In the desire to guarantee<br />
best possible traceability of scientic results <strong>and</strong> all potential sources of error, the benet of longterm<br />
storage of all raw data must be balanced with storage cost, further considering the likelihood<br />
to again require certain types of data <strong>and</strong> the cost of eventually regenerating experimental data.<br />
Under these considerations it has become accepted practice to require long-term storage of at<br />
least nucleotide sequences <strong>and</strong> PHRED qualities of sequencing reads, whereas the preservation<br />
of raw signal measurements or further quality measures (section 1.5.1) is deemed optional.<br />
Generally, large storage size of the data sets emphasizes the importance of the tradeo between<br />
space <strong>and</strong> access eciency. While o-line long-term storage solutions are able to mainly address<br />
space eciency, with the only restriction being that the computational eort required to convert<br />
to <strong>and</strong> from the compressed representation of the data should remain in tolerable bounds, on-line<br />
representation of the data during the data analysis phase has to provide additional guarantees<br />
regarding data access.<br />
1.6.1 Storage Formats for Next-Generation Sequencing <strong>Data</strong><br />
High-throughput sequencing data sets store elements like sequencing reads or read alignments,<br />
which dene a tuple of attributes, such as a read identier, nucleotide sequence <strong>and</strong> per-base<br />
qualities. Additional meta-data items like sequencing instrument, run or processing information<br />
may be required to be kept alongside the data set. On-line sequencing data representation has<br />
to at least guarantee ecient sequential traversal of a data set's elements, i. e. enable ecient<br />
retrieval of all of the respective next element's attributes. Furthermore, it is typically required<br />
to allow storage of the data set elements in a specic order not dictated by space eciency<br />
considerations. Finally, accelerated queries for data set elements with specic properties, such<br />
as range queries for retrieval of elements overlapping with a given region of the genome, require<br />
a certain degree of r<strong>and</strong>om access support from the storage solution.<br />
While generic solutions such as relational database management systems (RDBMS) address<br />
many of these requirements, they come with an excess of further features while impeding data<br />
set distribution. Therefore, most sequencing data analysis solutions rely on custom le-based<br />
storage solutions. File formats in the bioinformatics eld have traditionally been ASCII text<br />
les following simple formatting rules. Examples include the de facto st<strong>and</strong>ard formats for representation<br />
of biological sequences, FASTA <strong>and</strong> FASTQ [110]. For description of large amounts
14 CHAPTER 1. INTRODUCTION<br />
of features that are relative to a reference sequence, simple tab-delimited tables have emerged as<br />
the predominant storage formats. For example, SAM [100, 111] is a st<strong>and</strong>ard le format used for<br />
storing sequencing reads <strong>and</strong> read alignments. GFF [112] is a generic format for description of<br />
genomic features that is primarily used, but not restricted to, representation of genome annotation<br />
data, <strong>and</strong> its subset specication GVF [113] is targeted at description of genomic variation<br />
like SNP or structural variants. VCF [114] provides a specication for storage of multi-sample<br />
variation <strong>and</strong> genotype information. SAM, GFF <strong>and</strong> VCF are structured similarly, with each data<br />
set element being represented on a single line. Common or m<strong>and</strong>atory element attributes like<br />
reference genome coordinates are stored as m<strong>and</strong>atory columns, whereas optional attributes are<br />
added in arbitrary order in the form of key-value pairs (tags). All three formats are generic formats<br />
that allow users to specify arbitrary additional attributes, <strong>and</strong> allow inclusion of meta-data<br />
in the le in the form of lines starting with a certain character sequence serving as meta-data<br />
indicator. The SAM format comes with an equivalent binary companion format BAM with the<br />
aim of reduced le parsing overhead.<br />
While processed data like variant calls are usually moderate in size, space eciency is a<br />
concern for raw read <strong>and</strong> read alignment data, suggesting the application of data compression<br />
algorithms. The generic compression algorithm DEFLATE [115] is one of the most widely used<br />
compression methods due to a tradeo between compression eciency <strong>and</strong> compression <strong>and</strong><br />
decompression speed that is acceptable in many settings. Generic compression algorithms are<br />
also applied to sequencing data, e. g. the BAM format is routinely encoded as BGZF [111], a<br />
simple subset specication of the GZIP [116] format for DEFLATE-compressed data.<br />
However, compression eciency can be improved through use of sequencing data specic encoding<br />
methods or preprocessing. To some extent, high-throughput sequencing data compression<br />
is related to the issue of DNA sequence compression. Complete genome sequences are known<br />
to be hardly compressible beyond the typical 2-bit encoding of the four bases using common<br />
generic compression algorithms [117]. Most eorts preceding the emergence of high-throughput<br />
sequencing have thus been directed at generating ecient representations of such large biological<br />
sequences. However, algorithms that have been made available for this purpose are not readily<br />
applicable to short read sequencing data.<br />
Common approaches used by most DNA compression algorithms in the literature rely on<br />
sux arrays or similar index data structures for detecting <strong>and</strong> collapsing perfectly or imperfectly<br />
repetitive regions in a small set of large sequences. As such, these implementations are not suited<br />
to the task of compressing the huge sets of very short sequences generated by a high-throughput<br />
sequencing run. <strong>An</strong>alogous approaches are however available for short read sequences. Imposing<br />
a specic ordering on the elements of a data set potentially leads to an increase in entropy, <strong>and</strong><br />
thereby reduced compression eciency. A utility SCALCE implements a read reordering method<br />
using a similarity-based clustering approach that serves as a compression booster in combination<br />
with generic compression algorithms <strong>and</strong> formats [118]. However, as specic element order is<br />
often a requirement of analysis algorithms, SCALCE <strong>and</strong> similar reordering approaches are most<br />
suitable for o-line long-term storage <strong>and</strong> data distribution. A dierent approach is taken by<br />
the tool Quip, which implements an assembly-based short read compression scheme where reads<br />
are represented by their position on specically assembled contigs [119]. Assembled contigs must<br />
be stored along with the data set elements, but allow ecient compression of data sequenced to<br />
multiple depth.<br />
While these approaches focus on compression of short nucleotide sequences, those represent<br />
just one of the attributes to be stored in next-generation sequencing data sets, along with read<br />
identiers, per-base PHRED qualities <strong>and</strong> possibly read alignments. Various methods therefore<br />
attempt to improve compression ratio for these heterogeneous collections of data. Notably, the<br />
CRAM format [120] aims at being a drop-in replacement for the SAM/BAM format, to large extent
1.7. CONTRIBUTIONS OF THIS WORK 15<br />
supporting the same logical structure <strong>and</strong> set of features. The main features of CRAM are reduced<br />
representations of aligned read sequences <strong>and</strong> per-base qualities. For aligned sequences, the<br />
format implements reference compression. In reference compressed alignments only the lengths<br />
of read subsequences that perfectly match the reference genome are stored, or an equivalent<br />
representation. While in principle a lossy operation, all discarded information may subsequently<br />
be recovered as long as the appropriate reference sequence is available. For PHRED quality<br />
data, CRAM oers a choice of various lossy binning schemes aimed at reduction of the entropy<br />
introduced with the read qualities. Following such preprocessing operations, CRAM encodes data<br />
into a custom binary format utilizing well-known encoding schemes such as Golomb [121] <strong>and</strong><br />
Human coding [122].<br />
1.6.2 Accelerated Queries on Sequencing <strong>Data</strong><br />
Support for accelerated queries on a data set is required to enable rapid data visualization<br />
[123, 124] <strong>and</strong> data set exploration. Indexed range queries are e. g. supported by the<br />
BigWig <strong>and</strong> BigBed formats [123], the BAM format or the utility Tabix [125] using R-tree [126]<br />
based indexing [124].<br />
While stock generic compression algorithms permit streaming encoding <strong>and</strong> decoding <strong>and</strong> are<br />
thus compatible with the goal of sequential, ordered traversal of data sets, they do not comply<br />
with r<strong>and</strong>om access requirements. This disadvantage is commonly mitigated by enabling seeking<br />
to dened positions in the stream. These decoding osets are realized by periodically either<br />
saving the complete state of the encoder, or by causing an encoder reset. In both cases, the<br />
cost is an eective decrease in compression eciency, either because of the overhead of saving<br />
the series of encoder states along with the compressed data, or because the encoder needs to be<br />
retrained following each reset. Additionally, an index data structure must be integrated to allow<br />
conversion between compressed <strong>and</strong> uncompressed stream osets. As a result, there is a tradeo<br />
between compression ratio <strong>and</strong> r<strong>and</strong>om access granularity, <strong>and</strong> thus eciency.<br />
1.7 Contributions of this Work<br />
The software SHORE is a processing <strong>and</strong> analysis pipeline for high-throughput DNA sequencing<br />
data [1, 18]. The pipeline is targeted at a reference sequence-based resequencing data analysis<br />
approach, combining raw read manipulation <strong>and</strong> ltering, read mapping <strong>and</strong> several analysis<br />
modules specic to dierent sequencing applications.<br />
With this work, large parts of our pipeline were reworked for coping with increasingly large<br />
amounts of data, optimization of workows <strong>and</strong> extended analysis requirements. In the process,<br />
SHORE was embedded into a generic programming framework to provide universal support of<br />
introduced features across all processing modules (section 3.1). Increasing storage sizes of data<br />
sets produced by high-throughput sequencing devices have been addressed by implementation of a<br />
variety of data compression options. At the same time, increased processing times were countered<br />
by extended parallel processing facilities, <strong>and</strong> specically distributed memory parallelization<br />
of the read mapping process. Furthermore, with greater data set size, eciency of retrieval<br />
of specic data set elements or subsets has gained importance for data set exploration <strong>and</strong><br />
visualization. To meet these requirements, appropriate data indexing <strong>and</strong> query algorithms were<br />
incorporated. Exploiting the data set indexing facilities, within several analysis modules we<br />
address basic visualization of sequencing read alignment <strong>and</strong> depth of coverage distribution to<br />
mitigate the need for time-consuming data set copying <strong>and</strong> conversion for use with external<br />
visualization solutions.
16 CHAPTER 1. INTRODUCTION<br />
Our application programming framework <strong>lib</strong>shore was implemented as a common foundation<br />
to all SHORE processing modules. In addition to elementary functionality such as automatic<br />
input <strong>and</strong> output decoding <strong>and</strong> encoding, algorithms for indexing sequencing-related data <strong>and</strong><br />
alignment of biological sequences are provided. Furthermore, we developed a generic processing<br />
framework with the aim of facilitating breakdown <strong>and</strong> implementation of rather complex<br />
sequencing data analysis algorithms as collections of manageable <strong>and</strong> self-contained processing<br />
modules. We further extended on this framework for straightforward parallelization of basic<br />
sequencing data processing.<br />
The fundamental data analysis workow introduced with the initial version of SHORE consists<br />
of subsequent application of read ltering, read mapping <strong>and</strong> application-specic analysis<br />
modules. While this basic approach has been maintained, it has been extended in many areas<br />
for simplication of a variety of routine tasks. <strong>An</strong> overview of the modied workow is therefore<br />
given in section 2.1.<br />
Signicant modications to core pipeline modules concern raw sequencing read import <strong>and</strong><br />
ltering as well as read mapping. <strong>Data</strong> import was modied to enable recovery of raw data<br />
or iterative data set re-ltering (section 2.3), <strong>and</strong> complemented with extended functionality to<br />
address e. g. the more <strong>and</strong> more widespread use of a variety of <strong>lib</strong>rary multiplexing <strong>and</strong> further<br />
adapter ligation protocols. For this purpose, exible sequence-specic read partitioning <strong>and</strong><br />
clipping facilities have been added (sections 2.4, 2.5). SHORE's read mapping module initially<br />
constituted a basic parallelization wrapper providing a unied interface to dierent short read<br />
alignment tools. To take advantage of an increasing diversity of read mapping algorithms, features<br />
have been added to enable integration of the results obtained using dierent alignment<br />
algorithms or parameters (section 2.6).<br />
Application-specic data analysis modules initially implemented in SHORE mainly addressed<br />
the assessment of genomic variation. This functionality has largely been maintained, with only<br />
slight modications to take advantage of the <strong>lib</strong>shore infrastructure. A focus of this work however<br />
was on quantitative application of high-throughput sequencing, <strong>and</strong> primarily ChIP-Seq<br />
data analysis. Within the SHORE analysis environment, we have implemented an enrichment<br />
detection module targeted at the analysis of transcription factor immunoprecipitation data (section<br />
2.7). In comparison to other approaches, our method emphasizes robustness towards read<br />
mapping artifacts <strong>and</strong> simplies h<strong>and</strong>ling of replicate experiments. Our ChIP-Seq enrichment<br />
detection module is complemented by congurable auxiliary utilities allowing composition of custom<br />
workows applicable to various expression proling or enrichment detection applications.<br />
In conclusion, with this work the SHORE software has been developed from an analysis pipeline<br />
consisting of a set of largely independent submodules targeted primarily at genomic variation<br />
assessment into a tightly integrated, general-purpose sequencing data analysis suite supporting<br />
a set of coordinated, interdependent features. The SHORE sequencing data analysis suite is open<br />
source software freely available 1 under the GNU General Public License (GPL) version 3.<br />
1 http://shore.sf.net
2 A High-Throughput DNA Sequencing <strong>Data</strong> <strong>An</strong>alysis<br />
<strong>Suite</strong><br />
The constant pace of development of high-throughput sequencing technology <strong>and</strong> protocols<br />
is a force driving continued adaptation <strong>and</strong> improvement of related software<br />
solutions. The present chapter is concerned with the data storage solutions <strong>and</strong> retrieval<br />
algorithms, high-throughput sequencing data analysis algorithms <strong>and</strong> software<br />
utilities that were developed in the course of this work for use with the SHORE DNA<br />
sequencing data processing <strong>and</strong> analysis pipeline.<br />
2.1 Overview<br />
Amounts of data produced by high-throughput sequencing devices have from the start posed<br />
a challenge with respect to data analysis. Since their introduction, countless computational<br />
utilities have been developed to satisfy dierent data processing <strong>and</strong> analysis needs in all areas of<br />
application. Large-scale high-throughput sequencing projects however emphasize the importance<br />
of st<strong>and</strong>ardized work ows <strong>and</strong> data storage. With data analysis extending over multiple phases<br />
as well as a signicant time span <strong>and</strong> overall man-hours distributed among multiple individuals,<br />
coordinated eorts to ensure coherence <strong>and</strong> reproducibility of processing <strong>and</strong> analysis results<br />
become imperative.<br />
A basic level of st<strong>and</strong>ardization is readily achieved by reliance on whole-sale software solutions,<br />
with all required data processing <strong>and</strong> analysis algorithms based on a common framework.<br />
However, apart from commercial oers few freely available high-throughput sequencing applications<br />
provide integration to varying degrees of more than just isolated processing <strong>and</strong><br />
analysis steps, e. g. SOAP, SAMtools [100] or GATK [97].<br />
With the SHORE software, we provide an integrated high-throughput sequencing read processing<br />
<strong>and</strong> analysis framework based on consistent data storage concepts. SHORE oers an<br />
end-to-end pipeline including all relevant steps starting from initial raw read data ltering <strong>and</strong><br />
partitioning <strong>and</strong> proceeding up to primary analysis results. <strong>An</strong> overview of SHORE's general data<br />
analysis work ow is provided in gure 2.1. While in principle consistent with what has been<br />
described [1] (cf. section 1.7), we implement a variety of extensions including optional recovery<br />
of raw data, on-the-y merging, subset selection <strong>and</strong> ltering as well as data visualization, as<br />
discussed in the following.<br />
In a reference sequence-guided resequencing approach, primary data analysis proceeds in three<br />
main stages. First, raw sequencing read data obtained from the base calling software become<br />
subject to quality control <strong>and</strong> demultiplexing algorithms (gure 2.1(a) ). Thus processed reads<br />
are subsequently mapped to the appropriate reference genome sequence (b). Application-specic<br />
algorithms then interpret the read alignment data as the nal step of the primary analysis (d),<br />
such as genotype calling, assessment of structural variation, enrichment detection or dierential<br />
expression analysis. Approaches to follow-up analysis of the generated primary results are<br />
manifold <strong>and</strong> highly divergent, <strong>and</strong> therefore not currently integrated with the pipeline.<br />
19
20 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
Base Calling<br />
Unprocessed<br />
Reads<br />
Import,<br />
Filtering <strong>and</strong><br />
Demultiplexing<br />
(a)<br />
Processed<br />
Reads<br />
Recovery<br />
(e)<br />
Visualization <strong>and</strong><br />
Quality Assessment<br />
(g)<br />
Read<br />
Mapping<br />
(b)<br />
Export<br />
(f )<br />
Unprocessed<br />
Reads<br />
Archiving or<br />
Distribution<br />
Read<br />
Alignments<br />
Merge,<br />
Select<br />
<strong>and</strong> Filter<br />
(c)<br />
Export<br />
(h)<br />
Processed<br />
Reads<br />
Visualization<br />
(i)<br />
<strong>An</strong>alysis<br />
(d)<br />
Read<br />
Alignments<br />
Results<br />
External <strong>An</strong>alysis<br />
or Visualization<br />
Figure 2.1: The SHORE Workow<br />
Our analysis pipeline is realized as a comm<strong>and</strong> line application organized as a collection of<br />
submodules. A module SHORE import provides initial processing of raw read data <strong>and</strong> logically<br />
organizes its output in a predened directory hierarchy, which forms the basis for the operation of<br />
further modules (section 2.3). The module provides a comprehensive set of commonly required<br />
ltering operations such as quality based ltering <strong>and</strong> read trimming, removal of sequencing<br />
adapters or demultiplexing for bar-coded samples (section 2.4). To facilitate iterative adjustment<br />
of ltering parameters or distribution of raw read data to third parties, the module further<br />
provides functionality to fully restore its original input data (e).<br />
<strong>An</strong>alysis of SHORE processed data by means of third-party analysis utilities is possible through<br />
format conversion. A module SHORE convert provides read <strong>and</strong> alignment data export to a<br />
variety of commonly supported data exchange formats such as FASTA, FASTQ, SAM, BED or<br />
GFF (f, h). To assess overall run quality, nucleotide composition <strong>and</strong> base quality distributions<br />
may optionally be visualized with a module SHORE tagstats (g, section 2.8).<br />
<strong>Data</strong> analysis methods implemented in SHORE primarily follow a reference sequence-oriented<br />
resequencing approach (section 1.5). The provided module SHORE mapowcell is responsible for<br />
the required mapping of the processed read data to a reference genome sequence. The mapping<br />
module is realized as a high-level comm<strong>and</strong> line interface <strong>and</strong> parallelization front-end to a variety<br />
of widely useful read mapping tools such as GenomeMapper, BWA or Bowtie2 (section 2.6).
2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />
TEXT-BASED FILE FORMATS 21<br />
Read mapping data stored as a result of the SHORE mapowcell module constitute the input<br />
for data analysis modules such as variant calling or enrichment detection (d). Available analysis<br />
modules cover SNP <strong>and</strong> indel detection (SHORE qVar, consensus), ChIP-Seq analysis (SHORE peak,<br />
section 2.7), reference-based small RNA analysis as well as versatile modules SHORE coverage <strong>and</strong><br />
SHORE count applicable to a variety of expression analysis <strong>and</strong> enrichment sequencing approaches.<br />
A further module SHORE mapdisplay as well as SHORE coverage <strong>and</strong> count support data set exploration<br />
through visualization of read alignments or depth of coverage, respectively (i, section 2.8).<br />
Alignment input data to analysis <strong>and</strong> visualization modules can on-the-y be selected by<br />
region of interest, merged with data of e. g. further sequencing runs as well as passed through a<br />
variety of ltering operations such as duplicate sequence removal (c, sections 2.2.3, 2.7.2).<br />
The SHORE data analysis suite is designed for seamless integration into established Unix<br />
comm<strong>and</strong> line workows. Where appropriate, modules support streaming input <strong>and</strong> output<br />
to <strong>and</strong> from other comm<strong>and</strong>s. SHORE's data are stored in simple line based text le formats<br />
utilizing st<strong>and</strong>ards-compliant compression formats, facilitating data manipulation by stock text<br />
processing utilities. To support routine application in a specic scenario, all of SHORE's defaults<br />
may be pre-congured through user-specic conguration les.<br />
2.2 Ecient Storage of High-Throughput Sequencing <strong>Data</strong> Using<br />
Text-Based File Formats<br />
For large data sets as generated by high-throughput sequencing machines, space ecient ondisk<br />
representation is imperative. At the same time, during active analysis phases data must<br />
remain easily retrievable, requiring an appropriate compromise between compression ratio <strong>and</strong><br />
access eciency (section 1.6). This section presents data storage formats, indexing <strong>and</strong> retrieval<br />
mechanisms implemented in the SHORE pipeline.<br />
Typical binary le formats for sequencing data come with both advantages <strong>and</strong> disadvantages<br />
compared to traditional text-based alternatives. SHORE makes use of compressed text le formats<br />
for read alignment data that achieve similar or better compression ratios compared to compressed<br />
binary formats such as CRAM. SHORE's le compression formats all support streaming input<br />
<strong>and</strong> output as well as near-r<strong>and</strong>om read access to decompressed le osets. Users may exibly<br />
choose the desired tradeo between compression ratio, decoding <strong>and</strong> encoding speed <strong>and</strong> r<strong>and</strong>om<br />
access guarantees. Specialized utilities oer binary search like queries <strong>and</strong> range queries on<br />
compressed or uncompressed data sets for many types of text-based le format popular with the<br />
bioinformatics community.<br />
2.2.1 Overview<br />
Driven by the amounts of short read <strong>and</strong> alignment data generated during HTS data analysis,<br />
specialized binary storage formats have been introduced, notably the BAM format [100] <strong>and</strong><br />
the CRAM format [120] both designed for storage of aligned as well as unaligned short read<br />
data. BAM employs binary encoding of values with the aim of reduced le parsing overhead,<br />
i. e. to improve eciency of conversion between the disk <strong>and</strong> the in-memory representations of<br />
data. On the other h<strong>and</strong>, CRAM denes a specialized format with the aim of enabling increased<br />
compression ratios by supporting custom encoding for certain attributes of the data set elements.<br />
In spite of the reasons in favor of specialized binary data encoding, there are also drawbacks<br />
compared to traditional ASCII text-based storage formats. Parsing custom binary formats requires<br />
careful consideration of technical details like machine endianness <strong>and</strong> representation of<br />
oating point numbers. Additionally, in depth knowledge of the relevant data encoding <strong>and</strong><br />
compression algorithms is inevitable. Unless a parsing <strong>lib</strong>rary is provided for the programming
22 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
language of choice, readout of binary formats is therefore often too laborious or beyond basic programming<br />
skills, eectively limiting users of the format to choose from a narrow set of supported<br />
programming languages.<br />
In contrast, basic parsers are straightforward to implement for simple tab-delimited table<br />
formats. Requirements such as string tokenization <strong>and</strong> conversion between character strings <strong>and</strong><br />
numeric types are well supported by high level scripting languages popular in the bioinformatics<br />
eld such as Perl or Python, <strong>and</strong> considered a basic skill in next to any programming language.<br />
The line-based formats are furthermore accessible to shell scripting <strong>and</strong> readily ltered through<br />
widely used Unix text processing utilities such as grep, awk, sed or sort. While conversion<br />
between le <strong>and</strong> in-memory representation is less ecient compared to optimized binary formats,<br />
the introduced overhead is marginal in relation to the overall CPU consumption of data analysis<br />
algorithms.<br />
Furthermore, delimited text tables can serve as a basic framework for le format denitions<br />
such as SAM, GFF or VCF. This common foundation enables implementation of generic algorithms<br />
applicable to a large variety of dierent le formats, as demonstrated for example by the<br />
tool Tabix [125] for range indexing of genomic feature le formats. Adapting similar algorithms to<br />
dierent specialized binary formats is laborious, requiring reimplementation or implementation<br />
of sophisticated adapter mechanisms.<br />
Format issues are easily identiable with text-based le formats without requirement of expert<br />
knowledge or implementation of specialized format verication routines. Software dealing with<br />
data produced by a rapidly evolving technology as is high-throughput sequencing instrumentation<br />
is likely required to undergo constant adaptation. Besides, at high rates of data production<br />
storage systems may operate close to the limits of their capacity. In such potentially unstable<br />
settings, simplicity of le formats constitutes a signicant advantage.<br />
On grounds of these considerations, the SHORE analysis suite relies on tab-delimited table<br />
formats for storing reads, read mapping data <strong>and</strong> analysis results. Users may further choose to<br />
store les either directly as plain text, or between one of two widely used compression formats,<br />
GZIP [116] <strong>and</strong> XZ [127]. While such generic compression le formats themselves represent<br />
binary formats, decoding comm<strong>and</strong>s <strong>and</strong> routines are widely available across platforms <strong>and</strong><br />
programming languages <strong>and</strong> are employed merely as a ltering layer, thus retaining most of the<br />
advantages of a text-based format. With compression, SHORE implements block-wise encoding<br />
for both the GZIP <strong>and</strong> the XZ format to enable near-r<strong>and</strong>om read access at any decompressed<br />
le oset. Our implementation enables free adjustment of r<strong>and</strong>om access <strong>and</strong> space eciency<br />
trade-os (section 2.2.2).<br />
SHORE's text-based read alignment le format was extended to support reference compression<br />
as introduced by the binary CRAM format (section 1.6), as well as further simple preprocessing<br />
lters serving to boost compressibility in combination with generic compression algorithms.<br />
Depending on the choice of preprocessing lters, compression algorithm <strong>and</strong> r<strong>and</strong>om access resolution,<br />
our compressed text le formats enable similar or better read alignment compression<br />
ratios compared to CRAM (section 2.2.5).<br />
<strong>An</strong>other common requirement in sequencing data analysis is the retrieval of data set elements<br />
with a certain attribute value, e. g. extraction of the set of single nucleotide polymorphisms<br />
included in a certain range of the reference genome or retrieving a read with a specic identier<br />
for a short read data set. However, st<strong>and</strong>ard linear search is slow when processing large data sets.<br />
Given the data set is in sorted order with respect to the key attribute, accelerated queries are<br />
possible. SHORE sort oers text le sorting functionality similar to the st<strong>and</strong>ard sort comm<strong>and</strong><br />
line utility, optimized for use with typical high-throughput sequencing le formats. Additionally,<br />
the tool is capable of performing binary accelerated queries on arbitrary sorted tab-delimited<br />
text les.
2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />
TEXT-BASED FILE FORMATS 23<br />
(a) GZIP<br />
(b) BGZF<br />
(c) dictzip<br />
(d) SHORE-GZIP<br />
GZIP header<br />
DEFLATE compressed data<br />
GZIP footer<br />
Indexing information embedded in header<br />
Figure 2.2: Basic Structure of the GZIP Format <strong>and</strong> its Sub-Formats<br />
Binary search is however not applicable to elements like read mappings, gene models or<br />
variant information representing two-dimensional objects with a start <strong>and</strong> an end coordinate.<br />
Fast access to such data set elements overlapping a specic region of interest is crucial for data<br />
visualization <strong>and</strong> manual data set exploration. A text-le indexing approach capable of answering<br />
various types of range queries is implemented as a utility SHORE 2dex, providing functionality<br />
similar to Tabix [125] or the BAM [100] <strong>and</strong> BigBed [123] indexing schemes. In comparison to<br />
the also-generic Tabix indexer, SHORE provides additional functionality such as exible choice of<br />
compression format <strong>and</strong> r<strong>and</strong>om access resolution, indexing of les not sorted by start coordinate<br />
<strong>and</strong> fast queries on data sets with deep sequence coverage.<br />
Both utilities SHORE sort <strong>and</strong> SHORE 2dex make use of SHORE's generic r<strong>and</strong>om read access<br />
back end <strong>and</strong> can thereby seamlessly be applied to both compressed <strong>and</strong> uncompressed data sets.<br />
2.2.2 Widely Compatible Indexed Block-Wise Compression<br />
GZIP [116] is one of the most widely used generic compression formats with a high level of<br />
support across operating systems, programming languages, comm<strong>and</strong> line utilities <strong>and</strong> many<br />
further software applications. The employed DEFLATE compression algorithm [115] oers a<br />
reasonable compromise between encoding <strong>and</strong> decoding speed <strong>and</strong> compression ratio. Therefore,<br />
the format is an obvious choice for compression of sequencing related data. However, it lacks<br />
r<strong>and</strong>om read access support, forcing decompression of all data preceding a required section of<br />
the compressed le. As this interferes with the ability to perform accelerated queries on the<br />
compressed data sets, subset specications have been developed to add this feature without<br />
breaking compatibility.<br />
The GZIP le format embeds DEFLATE compressed data between a header <strong>and</strong> a footer<br />
sequence (gure 2.2(a) ). Encoding parameters are specied within the header, whereas the<br />
footer consists solely of eight byte specifying a checksum as well as the decompressed le size<br />
modulo 2 32 . Header, compressed data <strong>and</strong> footer constitute a GZIP stream, <strong>and</strong> a GZIP compliant<br />
le consists of one ore more concatenated streams. Upon decompression, data of concatenated<br />
GZIP streams are concatenated as well.
24 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
The BGZF format developed for the BAM format <strong>and</strong> the Tabix utility [100, 111, 125] exploits<br />
the concatenation feature to implement independently compressed blocks within the GZIP speci-<br />
cation. Uncompressed data are split into 64 kilobyte (kB) blocks, compressed independently as<br />
GZIP streams <strong>and</strong> eventually concatenated (gure 2.2(b) ). Although the resulting compressed<br />
size of a block is unpredictable <strong>and</strong> depends on the compressibility of the data, with the knowledge<br />
of the start osets of all blocks in a compressed le it is possible to quickly locate the block<br />
containing a specic uncompressed le oset. Thereby excess data that have to be decompressed<br />
to access any specic position of the le are limited to at most the block size of 64 kB. BGZF<br />
however does not specify a means of storing the start oset of each of the compressed blocks. To<br />
realize accelerated queries on top of BGZF, block oset information must therefore be incorporated<br />
in external index les. This represents a source of potential error, as the contents of the<br />
compressed le may diverge from the stored index data. Furthermore, various software applications<br />
do not correctly implement the stream concatenation feature of the GZIP specication,<br />
resulting in BGZF compressed data being truncated after decompression of the rst block.<br />
The open source comm<strong>and</strong> line tool dictzip comes with a specication for an indexed GZIPcompliant<br />
format. The GZIP format specication allows up to 64 kB of arbitrary extra data<br />
to be embedded within the header sequence. This extra data is simply to be ignored by basic<br />
GZIP decoders. The dictzip format takes advantage of the extra data eld for storing block<br />
oset information (gure 2.2(c) ). During the encoding process, the DEFLATE algorithm oers<br />
the possibility of manual insertion of reset points from which decompression may be started<br />
without processing the entire le. By exploiting this feature to compress blocks of input data<br />
independently, the dictzip tool avoids reliance on the stream concatenation feature as in the<br />
case of BGZF. However, with the extra data eld in the GZIP header being limited to 64 kB by<br />
specication, the number of blocks that can be stored is limited as well, eectively prohibiting<br />
storage of data with an uncompressed size of more than 1.8 gigabyte (GB). Furthermore, the<br />
entire index data to be stored in the GZIP header only becomes available after the entire le has<br />
been encoded. Streaming output is therefore incompatible with the format, which is therefore<br />
suitable for medium size, static data sets, but is not well matched to the compression of large-scale<br />
sequencing data.<br />
Due to the shortcomings of available formats, we introduce a further subset specication.<br />
The SHORE-GZIP format compresses the entire le as a single GZIP stream to provide optimal<br />
compatibility. To enable streaming output, all indexing information is stored at the end of the<br />
le. On decompression using either SHORE or arbitrary third party tools, block osets stored<br />
in the index are discarded. Thereby, index information is guaranteed to remain consistent with<br />
the le contents. To allow ne-tuning the tradeo between compression ratio <strong>and</strong> r<strong>and</strong>om access<br />
overhead, the format enables free choice of encoding block size.<br />
SHORE-GZIP realizes block-wise encoding similar to the dictzip approach by requesting periodic<br />
resets of the DEFLATE encoder (gure 2.2(d) ). During encoding, the method keeps track<br />
of the compressed block osets. After encoding all input data, the recorded index information<br />
is split into blocks of approximately 64 kB. Each block is embedded as extra data eld into a<br />
GZIP header, which is appended to the compressed le followed by two byte signaling empty<br />
compressed data to DEFLATE decoders, as well as eight further byte forming the appropriate<br />
GZIP footer. The indexing information is hence stored as consecutive empty GZIP streams that<br />
will be ignored by st<strong>and</strong>ard decoding of the data set. The SHORE-GZIP implementation is however<br />
able to recognize, collect <strong>and</strong> decode the trailing index blocks on dem<strong>and</strong> to enable r<strong>and</strong>om<br />
read access.<br />
<strong>An</strong> important feature of plain GZIP is the ability for users to concatenate compressed les<br />
without breaking format compliance. With SHORE-GZIP, concatenation results in index records<br />
being interspersed among the data streams. Through correct recognition of such interspersed
2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />
TEXT-BASED FILE FORMATS 25<br />
index elements by SHORE's index parsing algorithm, r<strong>and</strong>om access functionality is preserved in<br />
concatenated les.<br />
Apart from GZIP, SHORE oers le compression using the XZ [127] format. XZ employs the<br />
LempelZivMarkov chain algorithm (LZMA) to achieve considerably improved average compression<br />
ratio compared to DEFLATE. LZMA oers decompression performance close to that<br />
of DEFLATE, whereas compression requires signicantly more computational resources. XZ is<br />
modeled after the GZIP format by specication of a similar structure of concatenated streams<br />
of encoded data. The le format however natively includes terminal index records that store<br />
all positions of an eventual encoder reset <strong>and</strong> can therefore be employed within SHORE without<br />
further extensions.<br />
2.2.3 Ecient Queries on Text Files<br />
Queries for specic elements in a sorted sequence can be quickly answered by binary search<br />
in O(log n) time, where n is the length of the input sequence. Stock binary search algorithms<br />
however are not directly applicable to text-formatted data sets where the total number of elements<br />
<strong>and</strong> their exact start osets in the le are not known. We therefore implement a modied binary<br />
search for text le queries. Our algorithm bisects the data set by seeking to the central byte <strong>and</strong><br />
subsequently seeking past the next newline marker following that position. Except for border<br />
cases that must be h<strong>and</strong>led separately, the following line is compared to the user-provided search<br />
key <strong>and</strong> the algorithm is applied recursively to the appropriate subset of the le. In general,<br />
worst case time complexity of the modied binary search algorithm is O(m) on arbitrary text<br />
les with a total size of m byte, which however reduces to O(log n) for an n-element data set if<br />
a static limit for the maximal byte size per element is assumed to exist.<br />
Generic text le search functionality is made available through the SHORE sort utility. The<br />
program is applicable to arbitrary sorted tab-delimited text tables, oering the capability to<br />
retrieve specic data set elements as well as to bisect the data set using a search key.<br />
Automated data analysis can often narrow down the data relevant to the investigation to<br />
a small set of regions on the reference genome. Close examination of such individual genomic<br />
regions necessitates rapid retrieval of data associated with the appropriate range of positions. Unless<br />
consisting completely of non-overlapping entries, range queries can however not be answered<br />
by binary search.<br />
Objects associated with a specic range are two-dimensional entities that can be interpreted<br />
as points in a plane dened by their start <strong>and</strong> end coordinate. A query range q = (x q , y q )<br />
then is itself a point in the plane, which together with its reection about the 45 degree line<br />
q ′ = (y q , x q ) partitions the search space into six dierent quadrants (gure 2.3(a) ). Area A<br />
includes all objects fully included in the query range, dened by constraints x ≥ x q <strong>and</strong> y ≤ y q .<br />
Subdivision B represents all elements whose range itself includes the entire query range, with<br />
the constraints x ≤ x q <strong>and</strong> y ≥ y q . All objects intersecting the query are represented by the set<br />
union of areas A to D dened by y > x q <strong>and</strong> x < y q , whereas elements completely disjoint from<br />
the query range fall into quadrants E <strong>and</strong> F (y ≤ x q or x ≥ y q ).<br />
One of many ways to impose a spacial partitioning on multi-dimensional data represent<br />
k-d trees [128], recursively bisecting the data set with respect to alternating dimensions (gure<br />
2.3(b) ). In a perfectly balanced k-d tree, each recursion splits the current subset of the data<br />
at the median of the respective dimension's values. In that case, the tree may be represented<br />
implicitly by an array of its elements with a particular ordering [129]. k-d trees can answer<br />
region searches with a worst case time complexity of O(k · n 1−1/k + r), i. e. O( √ n + r) in the<br />
two-dimensional case, with n the total number of data set elements <strong>and</strong> r the number of elements<br />
to be retrieved [130].
26 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
B<br />
D<br />
F<br />
Start + Size<br />
C<br />
q = (x q , y q )<br />
A<br />
q ′ = (y q , x q )<br />
E<br />
Start<br />
Start<br />
(a) Range Query<br />
Figure 2.3: Range Search Using k-d Trees<br />
(b) Search Space Partitioning<br />
SHORE implements a generic algorithm to impose 2-d tree ordering on array-like objects.<br />
Given a 2-d tree ordered array object, further algorithms facilitate retrieval of elements intersecting<br />
(AD), included by (A) or including (B) a query range or entirely located upstream (E)<br />
or downstream (F) of a query position.<br />
Based on the 2-d tree algorithms, we developed a persistent indexing scheme for textformatted<br />
range data les, made available through the SHORE 2dex utility. The indexing method<br />
processes input les in blocks of a xed, user-provided byte size. For each block, we record<br />
a sequential identier as well as the sequence range spanned by the combined elements whose<br />
storage entry starts inside the respective block. If a block is not associated with any storage<br />
entry, no information is saved. Blocks records are arranged in 2-d tree order <strong>and</strong> saved to disk<br />
as the concatenation of three arrays providing the reordered identiers, genomic start positions<br />
<strong>and</strong> sizes. For answering range queries, array information is mapped into memory <strong>and</strong> supplied<br />
to the appropriate 2-d tree query algorithm, thereby providing the identiers of all blocks that<br />
potentially contain elements relevant to the query. In combination with the index block size,<br />
block identiers allow calculation of the uncompressed start osets of the blocks in the le. A<br />
subsequent linear search phase scans the blocks indicated by the 2-d tree query to assemble the<br />
nal search result.<br />
Through the block size parameter, users are enabled to variably adjust the tradeo between<br />
index le size <strong>and</strong> maximum length of the linear search phase. In sparse range data sets or<br />
data sets not ordered by sequence coordinate, a query range may be spanned by the combined<br />
elements associated with a block although none of the individual elements is of relevance to the<br />
respective query. In such cases irrelevant blocks must frequently be subjected to linear search<br />
unless the index block size is suciently small. To overcome this slight deciency, we introduce<br />
a second user parameter maxgap. If subsequent data set elements in a block are farther apart on<br />
the genomic sequence than the provided maxgap size, then the respective block is stored multiple<br />
times associated with dierent ranges that do not include any coverage gap larger than maxgap.
2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />
TEXT-BASED FILE FORMATS 27<br />
(a)<br />
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTA<br />
||||||||||**|||||||||||||****|||||||||||||<br />
CCCTAAACCCGTAACCCTAAACCCT---TCTCTGAATCCTTA<br />
(b)<br />
(c)<br />
CCCTAAACCC[TG][AT]AACCCTAAACCCT[A-][A-][A-][CT]CTCTGAATCCTTA<br />
10[TA,GT]13[AAA,---][CT]13<br />
(d)<br />
Query CIGAR MD<br />
CCCTAAACCCGTAACCCTAAACCCTTCTCTGAATCCTTA 10=2X13=3D1X13= 10T0A13^AAA0C13<br />
Figure 2.4: Example Read Alignment <strong>and</strong> its Representations in the MapList <strong>and</strong> SAM Formats<br />
BAM indexing as well as the Tabix utility are two incarnations of a similar indexing scheme<br />
based on R-trees which was rst introduced for the UCSC Genome Browser [124]. Utilities like<br />
the BAM indexer however are tied to their specic associated le format. The more versatile<br />
Tabix utility supports many text-based le formats, but requires BGZF compressed data. By<br />
separating the r<strong>and</strong>om access <strong>and</strong> range indexing concepts, the 2d-tree index is universally applicable<br />
to all storage formats supported by the r<strong>and</strong>om access back end, currently including plain<br />
text, XZ <strong>and</strong> SHORE-GZIP. Moreover, the ability of indexing data sets not strictly required to<br />
be ordered by sequence coordinate is unique among the stated indexing methods. As described,<br />
our method indexes chunks of the data set having a xed byte size. The UCSC R-tree family of<br />
indexes dier in this respect by grouping data set elements into tiling windows on the genomic<br />
sequence at a maximum resolution of 16 kilobases. This binning approach implies the disadvantage<br />
of potential uncontrollable growth of the linear search phase for deeply covered regions<br />
of the genome. We further regard the 2d-tree index's homogeneous array layout a substantial<br />
implementation advantage.<br />
2.2.4 Improved Compression of Read Mapping <strong>Data</strong><br />
To cope with increasingly large amounts of read mapping data, we implement reference based<br />
compression, quality reduction, read relabeling as well as a simple transposition algorithm capable<br />
of further boosting compressibility of SHORE alignment les <strong>and</strong> other types of tab-delimited table.<br />
Conversion from st<strong>and</strong>ard to condensed representations is implemented as a tool SHORE coal.<br />
A read mapping by our denition is a multi-attribute entity composed of its mapping coordinates<br />
on the reference genome as well as a read alignment that species the exact set of<br />
base pairings, traditionally represented in a multi-line format (gure 2.4(a) ). The line-based,<br />
tab-delimited MapList format [1] dened by SHORE stores mapping coordinates, read alignments<br />
along with the original read information <strong>and</strong> attributes describing the reliability of the mapping<br />
as well as read pairing information. Alignments are stored in an alignment string format resembling<br />
that originally introduced by Vmatch [131], a sequence matching tool based on enhanced<br />
sux arrays [132]. In the Vmatch format, matching positions are represented literally by the<br />
matching nucleotide symbol, whereas edit operations are described by the pair of mismatching<br />
symbols enclosed in square brackets (gure 2.4(b) ).<br />
To adjust to the development of sequencing technology <strong>and</strong> read mapping tools, we have<br />
carefully amended the original alignment representation with a small set of additional features<br />
while maintaining full backwards compatibility. With the constant increase of achievable read<br />
length, it becomes possible for read mapping tools to align reads across longer stretches of
28 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
inserted, deleted or polymorphic sequence. For more concise <strong>and</strong> readable representation of such<br />
alignments, we additionally allow consecutive edit operations of the same type to be indicated by<br />
two comma-separated strings representing the sequence of the reference <strong>and</strong> the read, respectively.<br />
Reference-based compression as introduced by CRAM in our format is supported by simple<br />
substitution of all substrings representing exact matches between read <strong>and</strong> reference sequence by<br />
their length (gure 2.4(c) ).<br />
The st<strong>and</strong>ard SAM/BAM format splits read alignment information across three attributes<br />
(gure 2.4(e) ). In this format, the nucleotide sequence of a read (query) is stored unaltered.<br />
Actual alignment information is separated into a CIGAR string as well as an optional MD tag<br />
required to specify reference sequence information for mismatch <strong>and</strong> deletion positions. Omission<br />
of read information redundant with the reference sequence via reference-based compression is<br />
not supported by the format specication. In the opposite direction, the SHORE alignment<br />
string format species further enhancements such as inclusion of soft clipped sequence enclosed<br />
in angle brackets (not shown) to enable full feature compatibility with the CIGAR alignment<br />
representation.<br />
Apart from read alignments, read qualities <strong>and</strong> identiers can make up for a considerable<br />
fraction of the storage requirements of compressed read data (cf. section 2.2.5). SHORE oers a<br />
simple algorithm capable of lowering base quality resolution <strong>and</strong> thereby read quality entropy<br />
similar to the quality binning approach taken by cramtools [120]. We take a user-specied<br />
resolution threshold k to partition each read quality string into sections where the lowest <strong>and</strong><br />
highest quality value dier by at most k. Each nucleotide is then assigned the lowest quality<br />
value found in the respective section. Information on edit operations is prioritized by always<br />
storing the exact quality value for read bases representing mismatches or insertions relative to<br />
the reference sequence.<br />
Simplication of read identiers can amount to a signicant further reduction of storage<br />
requirements. St<strong>and</strong>ard read identiers are strings automatically generated by the base calling<br />
software. For example, Illumina software builds read labels from a sequencing run identier<br />
as well as the reads' coordinates on the ow cell. Due to the r<strong>and</strong>omness of the coordinates<br />
as well as the r<strong>and</strong>om order of occurrence of the reads in a typical mapping data set ordered<br />
by alignment coordinate, the sequence of identiers to be stored is of high entropy <strong>and</strong> poor<br />
compressibility. Post-mapping systematic relabeling of reads can therefore drastically improve<br />
compression ratio. Following relabeling, information content of the sequence of stored identiers<br />
is primarily determined by their actual function, which is identication of associated data set<br />
entries such as paired reads or repetitive read mappings. SHORE coal oers such relabeling<br />
combining simple enumeration by alignment coordinate with a static data set identier.<br />
In general, generic compression algorithms like DEFLATE or LZMA better pick up redundancy<br />
in homogeneously structured data. This suggests that e. g. read quality data might be<br />
compressed more eectively when grouped together rather than stored interleaved with the further<br />
attributes of the data set elements such as identiers <strong>and</strong> read alignments. Isolated storage<br />
of these attributes however prohibits data streaming. Furthermore, given presence of variablelength<br />
attributes that cannot reasonably be stored in a xed, predetermined amount of space,<br />
r<strong>and</strong>om access to the full set of attributes of a certain data set element requires additional index<br />
data structures that counter the aim of optimizing compression ratio. A compromise between<br />
such an attribute centric <strong>and</strong> the regular element centric representation constitutes block-wise<br />
coding of the data in an attribute centric representation, with block size chosen small enough to<br />
transform a block into an element centric representation on-the-y.<br />
To explore the eectiveness of such a mixed representation for boosting compressibility, we<br />
implemented a simple transposition operation applicable to tab-delimited le formats (algorithm<br />
2.1). Our algorithm takes as parameter a xed block size k for transformation of the
2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />
TEXT-BASED FILE FORMATS 29<br />
input : Text T, block size k<br />
output : Transposed text T ′<br />
foreach block B in T of size k<br />
do<br />
rst_row ← rst row of B;<br />
remove the rst row from B;<br />
last_row ← last row of B;<br />
remove the last row from B;<br />
append rst_row to T ′ ;<br />
if B is a table then<br />
foreach column C in B<br />
do<br />
append C as row to<br />
T ′ ;<br />
end<br />
else<br />
append B to T ′ ;<br />
end<br />
append last_row to T ′ ;<br />
end<br />
Algorithm 2.1: Block-Wise Text Transposition<br />
input data. Input is subdivided into blocks of k byte. First <strong>and</strong> last row of a block are dened<br />
as the characters up to, <strong>and</strong> including the rst newline, as well as the characters following the<br />
last newline marker found in the block. For each block, rst <strong>and</strong> last row are output unaltered,<br />
preserving lines spanning block boundaries. <strong>Data</strong> in between are checked whether they represent<br />
a table with xed number of columns, <strong>and</strong> if so are transposed swapping rows for columns.<br />
When applied to data represented as a tab-delimited table with a xed number of columns,<br />
the transposition algorithm succeeds in grouping together all attribute values of the same type<br />
in a block, lines at block boundaries excluded. For tables with variable number of columns or<br />
other data, the input is left unaltered. Applied twice with the same block size parameter value,<br />
the original input is guaranteed to be restored.<br />
SHORE supports a transposed format by addition of a special header marking transposed<br />
data <strong>and</strong> specifying the transformation's block size. On le access this header is recognized <strong>and</strong><br />
input is passed through a lter applying the appropriate block-wise back-transposition. R<strong>and</strong>om<br />
access is supported by reading the appropriate block of data, back-transposing <strong>and</strong> subsequently<br />
seeking to the correct oset inside the block.<br />
2.2.5 Compression Results<br />
We used a data set consisting of approximately 6.8 million aligned 40 base pair Arabidopsis<br />
thaliana reads obtained by Illumina sequencing for evaluation of eectiveness of the measures<br />
implemented to improve space eciency.<br />
In plain MapList format, the read mapping data set occupied about 1.1 Gb of disk space. To<br />
assess the potential for reduction of total data set size by optimizing encoding <strong>and</strong> representation<br />
of certain attributes, we extracted individual data set columns (table 2.1). Read alignments <strong>and</strong>
30 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
Attribute Plain Size Compressed Size Compression Ratio<br />
read alignments (ref. comp.) 349 Mb (33 Mb) 19 Mb (3.4 Mb) 5.7% (10.3%)<br />
read IDs (relabeled) 227 Mb (117 Mb) 39 Mb (3.2 Mb) 17.5% (2.7%)<br />
qualities (reduced) 341 Mb (341 Mb) 76 Mb (30 Mb) 22.3% (8.8%)<br />
other elds 163 Mb 5.4 Mb 3.3%<br />
Table 2.1: Compressibility of Individual Read Mapping Attributes<br />
qualities accounted for 349 Mb <strong>and</strong> 341 Mb, each corresponding to approximately 32% of the<br />
total storage size, followed by st<strong>and</strong>ard read identiers with 227 Mb (21%) <strong>and</strong> the collective<br />
size of all further attributes (163 Mb, 15%).<br />
Column data were compressed in XZ format at 128 kB block size. Read alignments were<br />
thereby reduced to 5.7% (19 Mb) of their original size, further elds to 3.3% (5.4 Mb), whereas<br />
read identiers <strong>and</strong> quality strings could not be compressed as eectively with compression ratios<br />
of 17.5% (39 Mb) <strong>and</strong> 22.3% (76 Mb), respectively.<br />
By application of reference-based compression, storage size of read alignments was reduced to<br />
less than ten percent (33 Mb) in plain text format. This gain was however somewhat diminished<br />
in compressed format due to the worse compressibility of the data, resulting in a compressed size<br />
of 3.4 Mb at a compression ratio of 10.3%.<br />
Relabeling of the reads resulted in a storage gain of roughly fty percent relative to the<br />
original representation owing to the shorter read identiers. Vastly improved however was the<br />
compressibility of the identier sequence, resulting in a reduction to just 2.7% (3.2 Mb) relative<br />
to plain text format.<br />
Eect of reduced quality value resolution was assessed using the cramtools quality binning<br />
conguration N40-R8 to allow direct comparison to the CRAM format, although the algorithm<br />
implemented in SHORE achieved similar results. The N40-R8 setting preserves quality values of<br />
all reference sequence mismatches, while assigning quality values of sequence matches to one of<br />
eight possible bins. Application of quality resolution reduction improved compression ratio of<br />
the quality data from 22.3% to 8.8%.<br />
We further assessed the eect of various combinations of compression <strong>and</strong> data reduction<br />
congurations on total data set size by application of the SHORE compress <strong>and</strong> SHORE coal utilities<br />
to the data set in MapList format <strong>and</strong> compare the results with storage of SAM/BAM <strong>and</strong> CRAM<br />
formatted data (table 2.2).<br />
Eect of compression block size was evaluated using SHORE's r<strong>and</strong>om access granularity presets<br />
fast, corresponding to 128 kB block size, medium, 2 MB, <strong>and</strong> slow, 64 MB. Compression<br />
of the 1.1 GB data set in SHORE-GZIP format at 128 kB block size reduced space requirements<br />
to 205 MB, or 19% of the uncompressed le size. Increasing block size to 2 MB further improved<br />
space eciency by just 1.5% to 202 MB storage size, <strong>and</strong> use of 64 MB blocks did not have a<br />
signicant advantage over 2 MB.<br />
Selecting the XZ compression format at 128 kB block size resulted in a storage size of 163 MB,<br />
15% of plain data set size <strong>and</strong> a 20% improvement over SHORE-GZIP at the same block size. Eect<br />
of increased compression block size was more pronounced for XZ than for SHORE-GZIP. At 2 MB<br />
<strong>and</strong> 64 MB block size disk usage was reduced to 140 MB <strong>and</strong> 116 MB, respectively, improvements<br />
of 14% <strong>and</strong> 17% over the respective previous block size.<br />
Block-wise transposition was applied using the block size parameter matching the respective<br />
compression block size. At the smallest block size, size of the transformed data in SHORE-GZIP<br />
format was 181 MB, a 12% improvement over untransposed storage. Block-wise transposition<br />
also augmented the eect of increased block sizes, achieving le sizes of 160 MB at 2 MB <strong>and</strong> of
2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />
TEXT-BASED FILE FORMATS 31<br />
File Format Preprocessing Compression Format Block Size Compressed File Size<br />
SAM - - - 1.3 GB<br />
MapList - - - 1.1 GB<br />
BAM - BGZF default 237 MB<br />
SAM - GZIP 128 kB 216 MB<br />
MapList - GZIP 128 kB 205 MB<br />
MapList - GZIP 2 MB 202 MB<br />
MapList - GZIP 64 MB 202 MB<br />
MapList BT GZIP 128 kB 181 MB<br />
MapList - XZ 128 kB 163 MB<br />
MapList BT GZIP 2 MB 160 MB<br />
MapList BT GZIP 64 MB 157 MB<br />
MapList BT XZ 128 kB 150 MB<br />
MapList BT, RC GZIP 2 MB 143 MB<br />
MapList - XZ 2 MB 140 MB<br />
CRAM RC - default 135 MB<br />
MapList BT, RC XZ 128 kB 132 MB<br />
MapList BT XZ 2 MB 123 MB<br />
MapList - XZ 64 MB 116 MB<br />
MapList BT, RC, RL GZIP 2 MB 114 MB<br />
MapList BT, RC, QR GZIP 2 MB 109 MB<br />
MapList BT, RC XZ 2 MB 108 MB<br />
CRAM RC, QR - default 103 MB<br />
MapList BT, RC, QR XZ 128 kB 100 MB<br />
MapList BT XZ 64 MB 96 MB<br />
MapList BT, RC, RL XZ 128 kB 95 MB<br />
MapList BT, RC XZ 64 MB 84 MB<br />
MapList BT, RC, RL, QR GZIP 2 MB 81 MB<br />
MapList BT, RC, QR XZ 2 MB 80 MB<br />
MapList BT, RC, RL XZ 2 MB 77 MB<br />
MapList BT, RC, QR XZ 64 MB 62 MB<br />
MapList BT, RC, RL XZ 64 MB 60 MB<br />
MapList BT, RC, RL, QR XZ 64 MB 39 MB<br />
BT:<br />
RC:<br />
RL:<br />
QR:<br />
block-wise transposition<br />
reference-based compression<br />
read relabeling<br />
lossy quality resolution reduction (cramtools [120] preset N40-R8)<br />
Table 2.2: Comparison of Alignment File Compression Eciency
32 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
157 MB at the 64 MB conguration. This corresponds to advantages of 21% <strong>and</strong> 22% over the<br />
matching untransformed representations.<br />
With XZ/LZMA compression, compression ratios beneted slightly less from block-wise transposition,<br />
at the three dierent block sizes gaining 8%, 12% <strong>and</strong> 17% over untransposed data to<br />
obtain le sizes of 150 MB, 123 MB <strong>and</strong> 96 MB, respectively.<br />
We further list the eect of additionally applying various combinations of reference-based<br />
compression, read relabeling <strong>and</strong> quality resolution for transposed data encoded in GZIP format<br />
with 2 MB block size as well as encoded in XZ format at the three dierent block sizes. In<br />
all compression formats <strong>and</strong> block sizes assessed, reference-based compression could account for<br />
11% to 12% of additional storage space savings. Added relabeling of reads resulted in le sizes<br />
reduced by further 20% in SHORE-GZIP format <strong>and</strong> by over 28% in the three XZ encoded les.<br />
Reduction of quality resolution gave similar improvements, producing slightly higher savings<br />
than read relabeling in GZIP compressed data, but had somewhat lesser impact compared to<br />
relabeling in XZ encoding.<br />
For comparison of formats, table 2.2 also includes disk usage for the same data set converted<br />
to SAM, BAM <strong>and</strong> CRAM le formats. The SAM format representation of the data set entries<br />
was slightly more verbose compared to the MapList formatting, resulting in an uncompressed<br />
le size of approximately 1.3 GB. The BAM encoded data set was about 14% larger compared<br />
to MapList, <strong>and</strong> about 9% larger than the SAM formatted data compressed as SHORE-GZIP<br />
at 128 kB block size. When applying reference-based compression only, size of CRAM encoded<br />
data ranked slightly better than the MapList le compressed in XZ format at 2 MB block size<br />
or compressed as GZIP following block-wise transposition <strong>and</strong> reference-based compression, <strong>and</strong><br />
slightly worse than the 128 kB block size XZ-encoded le with block-wise transposition <strong>and</strong><br />
reference-based compression enabled. With additional reduction of quality value resolution,<br />
CRAM ranked somewhat worse than the XZ compressed MapList le for the same data encoded<br />
in 128 kB chunks with block-wise transposition.<br />
2.2.6 <strong>Data</strong> Storage Considerations<br />
Choice of sequencing data storage formats depends on three areas of application with overlapping,<br />
but distinct design goals. O-line data archiving requires primarily space-ecient representation.<br />
<strong>Data</strong> exchange formats must conform to a highly st<strong>and</strong>ardized <strong>and</strong> stable format specication <strong>and</strong><br />
allow combined distribution of all data <strong>and</strong> meta-data in a single le, while directly supporting<br />
simple analysis <strong>and</strong> processing tasks. Finally, on-line storage solutions should be optimized<br />
regarding the trade-o between space eciency <strong>and</strong> certain access guarantees to provide optimal<br />
support for the respective approaches to analysis.<br />
With the implementation of the SHORE storage formats, we demonstrate indexable, space ef-<br />
cient on-line data storage for high-throughput sequencing on the basis of simple text-based le<br />
format denitions <strong>and</strong> stock compression formats. While binary formats dedicated to sequencing<br />
data storage such as CRAM should permit further optimization for improved performance<br />
<strong>and</strong> space eciency, the outlined simple formats <strong>and</strong> pre-processing methods should provide a<br />
valuable benchmark for future eorts.<br />
The implemented indexing solution for compressed sequencing data les forms the basis for<br />
rapid data retrieval for iterative analysis <strong>and</strong> data set exploration. Furthermore, our indexing<br />
approach should facilitate the implementation of caching strategies e. g. for implementation of<br />
fully interactive visualization of large sequencing data sets.
2.3. A NON-DESTRUCTIVE READ FILTERING AND PARTITIONING FRAMEWORK 33<br />
2.3 A Non-Destructive Read Filtering <strong>and</strong> Partitioning Framework<br />
To streamline further processing <strong>and</strong> analysis, it is in general desirable to initially prune highthroughput<br />
sequencing data sets of sequence with below-par quality. Sequencing of short polymers<br />
such as small RNA, bar-coded multiplexing <strong>and</strong> diverse other protocols in addition to that<br />
require sequence context sensitive read clipping <strong>and</strong> data partitioning.<br />
With SHORE import, we provide a utility to apply such initial read ltering operations, in<br />
the process converting sequencing reads from various supported input formats into SHORE's<br />
FlatRead format <strong>and</strong> partitioning the data according to the predened SHORE directory hierarchy<br />
as described [1]. This work further develops the application into a modular <strong>and</strong> extensible read<br />
ltering framework.<br />
<strong>Data</strong> analysis frequently is an iterative process, where at advanced stages it may become clear<br />
that an initial choice of ltering parameters for a data set was suboptimal. However, retrieval of<br />
the original source data may be hampered by archiving in backup solutions, unless large amounts<br />
of duplicate data are to be supported <strong>and</strong> managed on disk. Our utility was therefore reworked to<br />
perform all sequencing read editing <strong>and</strong> ltering in a non-destructive, reversible manner, exploiting<br />
the predened SHORE directory hierarchy for preventing loss of information. It is thus able<br />
to seamlessly recover the raw unprocessed sequencing data sets for straightforward re-ltering<br />
or distribution to third parties. Processing facilities provided by the application include custom<br />
quality <strong>and</strong> chastity ltering, low complexity read removal, quality-based trimming of read ends<br />
<strong>and</strong> small RNA adapter clipping, as described [1, 18]. We have further added lters for sequence<br />
context-dependent read splitting as well as a versatile demultiplexing system (section 2.4). Defaults<br />
of the program are congured such that no ltering is applied automatically, but quality<br />
lters indicated by the input format are respected. Supported read ltering operations must be<br />
selectively enabled by the user <strong>and</strong> are applied in a dened order.<br />
SHORE import further provides optional functionality for partitioning its output into batches of<br />
xed, user denable size, as well as additionally by the length of the processed reads. As sequence<br />
length is a highly signicant property e. g. for small RNA data, length partitioning facilitates<br />
selection of subsets relevant to the respective analysis. <strong>Data</strong> partitioning may furthermore oer<br />
technical advantages. For example, large sequencing data sets may require a long period time<br />
to be written to disk, with a correspondingly high possibility of encountering system failures or<br />
other issues in the process. Reduction of the amount of data stored in a single le represents a<br />
trivial yet eective measure for limiting the signicance of such failures for processing operations<br />
that are applied on a per-le basis. As SHORE is capable of merging sorted data sets on-the-y<br />
with minor processing overhead, potential downsides of data set partitioning are minimized.<br />
The original denition of the SHORE run directory hierarchy was amended to support nondestructive<br />
read ltering, sample demultiplexing <strong>and</strong> raw data recovery. The directory hierarchy<br />
by default includes four levels (gure 2.5). The root of the hierarchy represents a single run of<br />
a sequencing instrument. This directory is named according to the respective output directory<br />
parameter, an arbitrary user provided string. Subdirectory names have dened meanings. The<br />
rst level below the root directory comprises of solely numeric names, representing the individual<br />
physically separated lanes of the sequencing run. The second level includes two dierent types<br />
of directory. Cleaned <strong>and</strong> demultiplexed read data are stored in directories corresponding to<br />
the appropriate sample identier prexed by the string sample_. On the same level, a directory<br />
filtered stores all data required to restore the original unedited data set.<br />
A further sub-level resides within the sample directories. Valid paired-end sequencing reads<br />
are partitioned into directories named according to an integer enumeration of the paired-end<br />
read indexes. Reads obtained by single end sequencing or paired-end reads losing their partner<br />
due to read ltering are stored in a separate directory single. <strong>Data</strong> les are stored either
34 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
Run-EAS67.136<br />
1<br />
2<br />
ltered<br />
sample_Ler-1<br />
sample_Col-0<br />
sample_Kro-0<br />
sample_Bur-0<br />
ltered<br />
1<br />
2<br />
single<br />
1<br />
2<br />
single<br />
1<br />
2<br />
single<br />
1<br />
2<br />
single<br />
Figure 2.5: SHORE Run Directory Hierarchy<br />
directly in the single <strong>and</strong> paired-end directories, or below a further sub-level for batching or read<br />
length-dependent partitioning.<br />
On completion of a batch of reads, the associated le is sorted to enable accelerated queries<br />
on the data set (section 2.2.3) <strong>and</strong> compressed in a supported indexed compression format (section<br />
2.2.4). Output is formatted in SHORE's FlatRead format; results of read trimming <strong>and</strong><br />
clipping are on request reported via a set of optional tags rather than being applied directly<br />
to make ltering results available for further customized processing. Optionally, SHORE directory<br />
hierarchy output can be disabled to transform the program into a Unix pipeline-compliant<br />
st<strong>and</strong>alone read ltering utility. The application reports basic read trimming <strong>and</strong> ltering statistics,<br />
whereas quality <strong>and</strong> nucleotide content statistics are implemented as a separate module<br />
SHORE tagstats (section 2.8).<br />
SHORE import denes various dierent importers to be capable of dealing with a variety of<br />
input formats. For correct processing of read data, the program must be able to determine<br />
all reads belonging to the same read pairing. For pairing recognition to be possible without<br />
disproportional time or memory overhead, input data must be ordered according to adequate<br />
criteria. Currently available importers allow input where read pairs are provided as separate,<br />
correspondingly ordered les such as Illumina GAPipeline directories, sets of FASTQ les or<br />
SOLiD color space FASTA les, as well as sorted input in FlatRead, FASTQ, Illumina QSEQ or<br />
454 SFF format. Therefore, unordered input data require preprocessing, but can be conveniently<br />
passed to the application via Unix pipes using utilities SHORE convert <strong>and</strong> SHORE sort.<br />
Our utility automatically recognizes any filtered directories residing within SHORE run or<br />
lane directories that are passed to it as input. The contents of these directories are automatically<br />
parsed to restore the original state of the data prior to any SHORE specic ltering, optionally<br />
re-ltering the data set with altered settings. Filtering dened by the original input format is<br />
by default restored as well, but can be explicitly ignored. By passing the restored output to the<br />
tool SHORE convert, raw data can be obtained for distribution in a variety of widely supported<br />
formats.<br />
2.4 A Flexible Sequencing Read Demultiplexing System<br />
To minimize cost per sequenced base, next-generation sequencing instruments must be uncompromisingly<br />
optimized for maximum throughput, at the expense of generating more data in a<br />
single run than would be required for many types of research project. Therefore, most current se-
2.4. A FLEXIBLE SEQUENCING READ DEMULTIPLEXING SYSTEM 35<br />
quencers feature physically separated sequencing lanes or dierent types of ow cells or ow chips<br />
to allow either to distribute overall throughput to multiple samples or to enable reduced-output<br />
modes of operation. While physical compartmentalization allows simple sample multiplexing<br />
without any requirement for adaptation of work ows, it is tied to the sequencing hardware <strong>and</strong><br />
therefore oers very limited capabilities.<br />
Bar coded multiplexing strategies have thus quickly become the method of choice to achieve<br />
greater operational exibility with the primarily throughput-optimized high-end sequencer models.<br />
Bar coding protocols oer near-perfect scalability by allowing to distribute per-run sequencing<br />
depth to an almost arbitrary number of samples.<br />
Bar coded sequencing data require that prior to further analysis demultiplexing is performed<br />
in software to separate the respective samples as well as sample sequences from the articial bar<br />
code oligomers. The SHORE import utility integrates a exible demultiplexer capable of dealing<br />
with a wide variety of st<strong>and</strong>ard <strong>and</strong> custom bar coding protocols. The respective bar coding<br />
setup is provided by the user in a simple tab-delimited sample sheet format. Our tool supports<br />
all combinations of 5 ′ bar code ligation <strong>and</strong> index read protocols, variable length bar codes <strong>and</strong><br />
is able to take advantage of additional common subsequences such as e. g. restrictions sites in<br />
RAD-Seq applications (section 1.3).<br />
2.4.1 Overview<br />
Bar coding is realized by fusing sample DNA with index oligomers that serve as an unambiguous<br />
sample identier. Ligation of these bar codes can be achieved in many dierent ways, each coming<br />
with a dierent set of advantages <strong>and</strong> disadvantages regarding <strong>lib</strong>rary preparation, cost <strong>and</strong><br />
exibility. Therefore, various bar coding protocols as well as dierent bar coding kits provided<br />
by sequencing instrument vendors are available.<br />
Using the Illumina platform, index sequences may be associated with particular sequencing<br />
primers to be read as a separate index read. Other protocols ligate bar codes to be read in one<br />
go with the sample DNA, resulting in index oligomer <strong>and</strong> sample DNA being fused into a single<br />
sequence read. Moreover, paired-end sequencing protocols imply various possible bar coding<br />
congurations, with oligomer labels either attached to the start of just one of the sequencing<br />
reads, identical labels attached to either read, or dierent labels inserted at the start of the<br />
rst <strong>and</strong> the second read. Labeling reads individually, or even combining 5 ′ <strong>and</strong> index read bar<br />
codes, enables large numbers of samples to be discriminated using only few dierent oligomer<br />
tags, at the cost of a more elaborate sample preparation procedure <strong>and</strong> an increased overhead<br />
in sequencing cycles.<br />
Often multiple dierent labels are chosen to identify each respective sample. For example,<br />
strong bias in the sequenced sample may in certain cases interfere with sequencing instrument<br />
ca<strong>lib</strong>ration, <strong>and</strong> due to that the set of bar code oligomers must be carefully compiled to achieve<br />
a largely balanced nucleotide composition in the multiplexed DNA.<br />
Demultiplexing via tools provided by the sequencing instrument vendors is usually only possible<br />
for the vendor-specic indexing protocols. With the increasing variety in dierent bar coding<br />
protocols, we have gradually developed bar code matching integrated with the SHORE import utility<br />
into a exible demultiplexing solution that is congurable for all compositions of index read<br />
<strong>and</strong> 5 ′ ligated bar coding via a simple sample sheet format. We employ a read oriented format to<br />
avoid manual enumeration of all valid bar code combinations for multi bar code congurations.<br />
Prior to resolving tuples of bar codes to the appropriate sample identiers, bar code matching is<br />
performed individually for each read for improved performance <strong>and</strong> detection of potential mislabeling.<br />
The current bar code matching implementation is able to tolerate a xed, user-specied<br />
number of mismatches <strong>and</strong> no gaps.
36 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
1 #?sample read barcode<br />
Col-0 2 TTCACG<br />
Col-0 2 GGATGT<br />
Ler-1 2 CTAGGC<br />
5 Ler-1 2 AGACCA<br />
Listing 2.1: Example of a Simple Demultiplexing Sheet<br />
2.4.2 A Format for Description of Multiplexing Setups<br />
A SHORE demultiplexing sample sheet is a simple tab-delimited text table with named columns.<br />
The recognized column names are lane, sample, barcode_group, read, barcode, extbarcode<br />
<strong>and</strong> barcode_type.<br />
The sample sheet is parsed utilizing a generic table input API provided by the SHORE <strong>lib</strong>rary.<br />
Column names are dened in the header line, the rst non-empty line of the le that is not a line<br />
comment, or is introduced with a special comment tag #?. Columns are recognized by name<br />
<strong>and</strong> may occur in arbitrary order; further columns may be present, but are ignored.<br />
The only m<strong>and</strong>atory column, named sample, denes the sample identiers that bar codes are<br />
to be translated to. All further columns may be added in arbitrary combinations <strong>and</strong> order. A<br />
typical simple index read demultiplexing conguration is described by additionally specifying the<br />
read <strong>and</strong> barcode columns (listing 2.1), with read specifying the index of the read that contains<br />
the bar code tag <strong>and</strong> barcode the actual sequence of the bar code oligomer. The read index<br />
column may be omitted, indicating that all sequence reads of a pair are attached to identical bar<br />
codes.<br />
Index read <strong>and</strong> 5 ′ bar coding require suitable treatment of the respective bar code oligomers.<br />
While 5 ′ bar codes must be clipped, with the remaining part of the read sequence to be retained,<br />
index reads must be removed completely from the data set. The default implemented in SHORE<br />
is to apply bar code clipping if the sample sheet entry refers to the rst read or if the read<br />
index was omitted, <strong>and</strong> to lter bar-coded reads where the read index is greater than one. The<br />
barcode_type column allows to explicitly control this behavior. SHORE accepts bar code types<br />
read, 5prime <strong>and</strong> none, where the type is associated with the read index, i. e. all of a lane's<br />
entries with the same read index must also specify the same bar code type. Bar codes of type<br />
read will trigger ltering of the entire bar code associated read, whereas 5prime bar codes will<br />
be clipped.<br />
For congurations with bar code information distributed across multiple sequence reads, several<br />
entries with dierent read index values are for each sample added to the denition (listing 2.2,<br />
lines 47). The sample sheet entries will then be grouped on the sample identier, with each<br />
possible combination of bar codes with diering read indexes considered a valid bar code tuple<br />
for the respective sample. Sample sheet entries with the special sample identier * are interpreted<br />
as valid for each sample specied for the respective sequencing lane (e. g. lines 1617).<br />
For example, listing 2.2 denes two valid bar codes for the Ler-1 sample for each of the three<br />
sequencing reads, <strong>and</strong> thus 2 3 = 8 valid 3-tuples of bar codes that can be resolved to that sample.<br />
Automatic combination of all entries listed for a sample can be controlled by the user via<br />
specication of the barcode_group column. Sample sheet entries with dierent bar code group<br />
values are not able to form a valid bar code tuple (e. g. listing 2.2, lines 1013). For example, the<br />
rst read bar code specied in line 10 for the Col-0 sample may only combined with the second<br />
read entry from line 11 <strong>and</strong> not the one from line 13. Furthermore, a combination of line 10 <strong>and</strong><br />
13 would collide with the bar code tuples including entries from line 4 <strong>and</strong> 7, which are already<br />
specied to resolve to the Ler-1 sample. As with the sample column, the bar code group may be<br />
set to * to specify <strong>and</strong> entry that refers to all groups of the respective sample in the respective
2.4. A FLEXIBLE SEQUENCING READ DEMULTIPLEXING SYSTEM 37<br />
1 #?lane sample barcode_group read barcode extbarcode barcode_type<br />
# Bar codes for the Ler-1 sample.<br />
1 Ler-1 0 1 AACT TGCAG 5prime<br />
5 1 Ler-1 0 1 TAGC TGCAG 5prime<br />
1 Ler-1 0 2 AACT * read<br />
1 Ler-1 0 2 CCCT * read<br />
# Bar codes for the Col-0 sample.<br />
10 1 Col-0 0 1 AACTT GCAG 5prime<br />
1 Col-0 0 2 GGAC * read<br />
1 Col-0 1 1 TTGCT GCAG 5prime<br />
1 Col-0 1 2 CCCT * read<br />
15 # Third read has the same bar codes for all valid samples.<br />
1 * * 3 GGAC * 5prime<br />
1 * * 3 TTGC * 5prime<br />
# Lane 2 is not multiplexed, discard the index read.<br />
20 2 Bur-0 0 2 * * read<br />
Listing 2.2: Example of a Full Demultiplexing Sheet<br />
lane (e. g. lines 1617).<br />
For certain applications, the identity of several nucleotides immediately following 5 ′ bar code<br />
sequences is known, like for example restriction site sequence in RAD-Seq (section 1.3). While<br />
such sequences can be exploited to correctly assign each read to the appropriate sample, they<br />
usually should in contrast to the bar code oligomers not be removed from the output. Parts of the<br />
recognition sequence that should not be clipped from the read can be specied in the extended<br />
bar code (extbarcode) eld of the table. Internally, the sequence to be recognized is contructed<br />
by concatenation of the barcode <strong>and</strong> extbarcode elds, while the division of sequence among<br />
both elds is translated into a bar code cut position. For bar code types other than 5prime, the<br />
split into bar code <strong>and</strong> extended bar code has no eect. The bar code cut position is determined<br />
in the context of the respective bar code tuple, i. e. for the same recognition sequence dierent<br />
samples may dene a dierent split between bar code <strong>and</strong> extended bar code, as demonstrated<br />
by lines 4 <strong>and</strong> 10 of listing 2.2. If no part of the recognition sequence is to be removed from the<br />
output, then the entire oligomer can be provided as extended bar code, with column barcode<br />
either omitted or set to * .<br />
If neither bar code nor extended bar code are provided with a value other than * , then the<br />
respective sample sheet entry will match any read sequence. With bar code type either none<br />
or 5prime, such an entry can be utilized for assigning a certain sample identier to an entire<br />
sequencing lane. On the other h<strong>and</strong>, this property may be exploited to completely remove all<br />
reads with a certain read index from the output (e. g. listing 2.2, line 20).<br />
The sample sheet column lane serves to allow independent demultiplexing specications for<br />
dierent sequencing lanes in a single sample sheet le. Rows with diering sequencing lane elds<br />
are completely independent of each other. If the sequencing lane column is omitted, then the<br />
entire demultiplexing specication is considered valid for all lanes of the instrument run.
38 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
2.4.3 Barcode Recognition <strong>and</strong> Sample Resolution<br />
Despite current sequencing technologies having reached rather high levels of stability, single cycle<br />
quality dropouts can not completely ruled out. Therefore, algorithms matching reads <strong>and</strong> bar<br />
code oligomers should be able to tolerate at least limited hamming or edit distance. Rather than<br />
directly mapping each tuple of reads to one of the potentially huge number of possible bar code<br />
tuples with a certain cumulative edit distance, our algorithm proceeds by individually assigning<br />
each sequence read to a specic bar code.<br />
Sample demultiplexing is thus broken down into three successive operations. First, each<br />
sequencing read is matched to the set of allowed bar code sequences associated with the respective<br />
read index. Subsequently, given all sequences for a ow chamber spot could be matched to one<br />
of the oligomers, the resulting tuple of bar codes is mapped to the appropriate sample identier.<br />
Following successful bar code resolution, reads are then pruned of the index oligomers as required.<br />
The initial bar code recognition step tolerates up to a xed, user specied hamming distance<br />
between prex of the sequencing read <strong>and</strong> bar code sequence. We perform a staged prex<br />
matching to precomputed sorted sets of sequences, where each stage corresponds to a certain<br />
hamming distance to the exact tag sequences. The set of sequences for each stage is generated<br />
from the set of valid tags for the respective read index by recursively mutating the sequences<br />
of the respective previous set, with the mutated sequences carrying a pointer to the original<br />
unmodied bar code. Potential collisions in bar code resolution are automatically detected <strong>and</strong><br />
the corresponding entries pruned. At each stage, the read being assessed is used to query the set<br />
of tag sequences utilizing a binary relation that denes a pair of sequences as equivalent given<br />
that one is the prex of the other. If the query returns no result, the algorithm proceeds to the<br />
next stage, until the user specied number of mismatches is reached <strong>and</strong> the read is discarded<br />
as unresolvable.<br />
Given a tuple of reads could thus be successfully mapped to a tuple of valid index sequences,<br />
the set of valid bar code tuples is interrogated to resolve the bar code combination to the appropriate<br />
sample identier. Each of the reads is then either pruned of its 5 ′ end or agged as ltered<br />
according to the sample sheet's bar code type specication. Since the bar code cut position can<br />
in general only be resolved in the context of the complete bar code tuple, this clipping operation<br />
is postponed until sample identier <strong>and</strong> bar code group have been resolved.<br />
Due to its restriction to hamming distance rather than edit distance, our demultiplexer is of<br />
limited applicability to indel-prone sequencing technologies such as 454, Ion Torrent of SMRT<br />
sequencing. However, in certain cases perfect matching may be sucient, or probable indels<br />
might be represented explicitly in the sample sheet. In setups requiring very long bar code<br />
oligomers <strong>and</strong> a high mismatch tolerance, recursive generation of the mutated bar code sets<br />
threatens to become overly resource intensive. However, bar code matching is implemented as<br />
one of several isolated modules with a simple interface, such that additional matching options<br />
might easily be added in the future as required, e. g. utilizing the dynamic programming alignment<br />
algorithms provided by the SHORE <strong>lib</strong>rary. For highly customized context sensitive read clipping<br />
<strong>and</strong> splitting allowing for gaps <strong>and</strong> user dened scoring schemes, we provide a separate utility<br />
SHORE oligo-match (section 2.5).<br />
2.5 Versatile Oligomer Detection <strong>and</strong> Read Clipping<br />
In addition to sample demultiplexing, a wealth of dierent <strong>lib</strong>rary preparation protocols <strong>and</strong><br />
sequencing applications exist that involve ligation of certain adapter or linker sequences, or otherwise<br />
require sequence context aware clipping or splitting of the unprocessed read sequences.<br />
Depending on the protocol or application, the respective recognition sequences may be degener-
2.5. VERSATILE OLIGOMER DETECTION AND READ CLIPPING 39<br />
ate or truncated in various ways. Therefore, computational tools are required that are capable of<br />
detecting optimal sequence matches given dened constraints, <strong>and</strong> can utilize detected matches<br />
for pruning, splitting or otherwise manipulating sequencing read data according to user requirements.<br />
In this work, a highly congurable utility SHORE oligo-match for pair-wise sequence matching<br />
<strong>and</strong> match based read manipulation was implemented. The program utilizes dynamic programming<br />
alignment algorithms with customized alignment matrix <strong>and</strong> backtrace initialization, modes<br />
of backtracing <strong>and</strong> match ltering. Various modes of sequencing read manipulation or annotation<br />
are provided, <strong>and</strong> full match information may be retrieved for advanced customization needs.<br />
Our application supports both threaded as well as distributed memory modes of parallel operation.<br />
It is applicable to matching, manipulating <strong>and</strong> ltering even large data sets of sequencing<br />
reads by a small set of provided oligomer sequences.<br />
2.5.1 Overview<br />
Optimal pairwise sequence alignments can be computed using well known dynamic programming<br />
algorithms, given that no larger-scale sequence rearrangements need to be accounted for. Pairwise<br />
alignment algorithms are grouped into two classes, global alignment methods like the Needleman-<br />
Wunsch algorithm [133] <strong>and</strong> local alignment such as the Smith-Waterman algorithm [134]. However,<br />
matching pairs of short sequences such as high throughput sequencing reads <strong>and</strong> adapter<br />
oligomers usually implies specic sequence overlap congurations, <strong>and</strong> therefore represents neither<br />
a global nor local alignment use case.<br />
Mate pair sequencing <strong>lib</strong>raries for Roche 454 Genome Sequencer instruments are constructed<br />
through the Cre/lox system. Cre is an enzyme that catalyzes site specic recombination at<br />
pairs of lox sequences. The protocol exploits this anity for circularizing multiple kilobase DNA<br />
fragments. Circularized DNA is then cut in relative proximity to the recombination site to<br />
obtain a shorter linear fragment consisting of the inverted ends of the original fragment joined<br />
by a linker oligomer that includes a lox sequence. To obtain mate pair reads, the compound<br />
insert is sequenced <strong>and</strong> then split computationally adjacent to the detected location of the linker<br />
oligomer.<br />
Cre/lox mate pair sequencing protocols have also been introduced to Illumina technology<br />
[135]. Due to the by comparison short read lengths, both ends of the circularized fragment<br />
cannot in general be retrieved as a single sequence read <strong>and</strong> thus the insert is sequenced from<br />
both ends. The linker oligomer may be detectable in none, only one, or both sequences of a<br />
read pair, depending on read length, linker position <strong>and</strong> size of the insert.<br />
Recognition <strong>and</strong> removal of adapter oligomers is further required in all situations where there<br />
is an overlap between the distributions of insert size <strong>and</strong> sequencing read length. In small RNA<br />
sequencing, all relevant sequencing reads will contain the 3 ′ sequencing adapter due to the short<br />
length of the sequenced polymer. Alternative multiplexing protocols are enabled by this property<br />
where the bar code oligomer is ligated to the 3 ′ <strong>lib</strong>rary adapter. In total RNA <strong>lib</strong>raries adapter<br />
sequences are in contrast only picked up in a minority of reads.<br />
Adapter <strong>and</strong> linker recognition require nding the optimal pair-wise alignment with the constraint<br />
of specic sequence overlap congurations. We dene ve dierent keywords global,<br />
local, dangling_qry, dangling_ref <strong>and</strong> dangling_any for description of pair-wise sequence<br />
overlap congurations. At either end of the sequence alignment there are four dierent possible<br />
overlap congurations overhang of the reference sequence (dangling_ref), overhang of<br />
the query sequence (dangling_qry), unaligned sequence in both reference <strong>and</strong> query (local) as<br />
well as no sequence overhang (global) <strong>and</strong> therefore 16 dierent alignment congurations in<br />
total (table 2.3).
40 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
Alignment Matrix initialization (left end) Backtracing (right end)<br />
1 global global<br />
2 global dangling_qry, dangling_any<br />
3 global dangling_ref, dangling_any<br />
4 global local<br />
5 dangling_qry, dangling_any global<br />
6 dangling_qry, dangling_any dangling_qry, dangling_any<br />
7 dangling_qry, dangling_any dangling_ref, dangling_any<br />
8 dangling_qry, dangling_any local<br />
9 dangling_ref, dangling_any global<br />
10 dangling_ref, dangling_any dangling_qry, dangling_any<br />
11 dangling_ref, dangling_any dangling_ref, dangling_any<br />
12 dangling_ref, dangling_any local<br />
13 local global<br />
14 local dangling_qry, dangling_any<br />
15 local dangling_ref, dangling_any<br />
16 local local<br />
Table 2.3: Dynamic <strong>Programming</strong> Alignment Modes
2.5. VERSATILE OLIGOMER DETECTION AND READ CLIPPING 41<br />
The conguration depicted in the rst column indicates the type of end overhang that should<br />
not be penalized by the alignment algorithm's scoring method, with the lower line dened as<br />
reference (ref) <strong>and</strong> the upper line as query (qry). The dierent end alignment modes form a<br />
hierarchy of constraints, with dangling_any a subset of local, dangling_qry <strong>and</strong> dangling_ref<br />
subsets of dangling_any, <strong>and</strong> global a common subset dangling_qry <strong>and</strong> dangling_ref.<br />
The term query will be used in the following to always indicate the sequencing read, <strong>and</strong><br />
reference may thus refer to a possibly much shorter oligomer sequence. Alignment of a short<br />
read to a part of a genomic reference sequence constitutes an example of overlap conguration 11,<br />
dened by the keyword pair (dangling_ref;dangling_ref). Detection of sequencing adapters<br />
or 3 ′ bar codes in small RNA sequencing data corresponds to congurations 6 or 7, depending<br />
on whether the read contains the entire or only the partial adapter sequence. This subset of<br />
congurations is dened by the pair (dangling_qry;dangling_any). Detection of bar codes in<br />
5 ′ bar-coded sequences is represented by case 2 (global;dangling_qry).<br />
For Cre/lox-type recombinant mate pair sequencing varied overlap constraints may be appropriate<br />
depending on the desired sensitivity-specicity tradeo. Conservative detection of<br />
linker sequences in valid 454 type mate pairs corresponds to conguration 6. Illumina Cre/lox<br />
mate pair protocols obtain read pairs where the linker sequence is expected towards the end<br />
of either read (conguration 6 or 7). Occasionally one of the enzymatic cuts of the circular<br />
DNA may also occur inside the linker DNA. Such cases are described by conguration<br />
10. The subset of congurations 6, 7 <strong>and</strong> 10 constitutes a case of mutually dependent end<br />
alignment congurations. Thus, it can not be accounted for by a single keyword pair, but the<br />
two pairs (dangling_qry;dangling_any) <strong>and</strong> (dangling_ref;dangling_qry) (or alternatively,<br />
(dangling_any;dangling_qry) <strong>and</strong> (dangling_qry;dangling_ref)).<br />
The SHORE oligo-match utility is capable of for each sequencing read selecting the optimal<br />
alignment or alignments out of multiple pairs of end alignment modes, multiple reference<br />
oligomers <strong>and</strong> optionally their reverse complemented sequence.<br />
By utilizing a fully customizable 16x16 scoring matrix, the program enables adjustable h<strong>and</strong>ling<br />
of ambiguous IUPAC nucleotide codes as well as asymmetric base mismatch penalties with<br />
respect to the direction of the match.<br />
A pair-wise sequence mapping is composed of two pairs of end coordinates as well as the<br />
alignment describing actual base pairings <strong>and</strong> sequence gaps. To increase specicity of read<br />
clipping <strong>and</strong> splitting operations, it is desirable to ensure that optimal pair-wise mappings feature<br />
a unique pair of end coordinates, whereas potential alternative alignments can be considered<br />
irrelevant. Our alignment algorithm is capable of either generating an exhaustive list of all<br />
possible alignments, a list featuring a representative of all alignments with a dierent pair of<br />
end coordinates, or just a single representative for each pair-wise mapping, which is optionally<br />
assessed for uniqueness of end coordinates.<br />
The utility provides lters for pair-wise mappings with respect to uniqueness of oligomer<br />
selection <strong>and</strong> end coordinates. Alignments valid following ltering may be comprehensively<br />
reported <strong>and</strong> applied to clipping or splitting sequencing reads at either, or at both ends of the<br />
detected oligomer.<br />
2.5.2 Dynamic <strong>Programming</strong> Alignment <strong>and</strong> Backtracing Pipeline<br />
Our alignment pipeline proceeds in three subsequent passes. Initially dynamic programming<br />
alignment of the sequencing read is performed to each oligomer <strong>and</strong> for each pair of provided end<br />
alignment modes. The optimal alignment or alignments are passed to the backtracing module.<br />
Finally, generated traces are ltered <strong>and</strong> may subsequently be applied to read manipulation.
42 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
While alignment scores become available after the initial stage, ltering is delayed until after<br />
backtracing for reasons of threshold calculation.<br />
The end alignment mode for the left end of the alignment determines the mode of alignment<br />
matrix initialization <strong>and</strong> alignment score calculation. With left end alignment mode local<br />
initialization <strong>and</strong> score calculation proceeds according to st<strong>and</strong>ard local alignment, with rst<br />
row <strong>and</strong> column initialized to zero <strong>and</strong> the alignment score at each eld of the matrix truncated<br />
at zero from below. Left end mode dangling_any utilizes the same matrix initialization, but<br />
does not clip alignment scores at zero. With dangling_ref <strong>and</strong> dangling_qry, only the rst row<br />
or column is zero-initialized, respectively, given the width of the matrix is determined by the<br />
size of the reference sequence. Sequence overhangs for the respective other sequence are valued<br />
with the regular gap penalty. Mode global does not perform pre-initialization of the matrix, as<br />
in st<strong>and</strong>ard global alignment.<br />
For selection of the best out of multiple permitted pairs of end alignment mode for a reference<br />
oligomer, multiple passes of dynamic programming alignment must be performed due to the<br />
distinct requirements of matrix initialization. For each distinct left end alignment mode a corresponding<br />
right end alignment mode may be dened. However, alignment is only performed if the<br />
conguration is not included by a dierent pair of modes specied, i. e. if the constraint on the<br />
right end of the alignment is weaker than that specied by the next left end mode that includes the<br />
current one. Due to the hierarchical nature of tolerated sequence overhang congurations there<br />
is nonetheless a chance that the same alignment will be produced multiple times by dierent keyword<br />
pairs. For example, (dangling_ref;dangling_qry) <strong>and</strong> (dangling_qry;dangling_ref)<br />
both dene supersets of the (global;global) conguration. Such redundancies can be eliminated<br />
prior to backtracing by determining the subset of the respective end alignment mode that<br />
has already been covered by previous alignments <strong>and</strong> excluding it by assigning corresponding<br />
elds of the alignment matrix the maximum penalty.<br />
For backtracing initialization <strong>and</strong> alignment score calculation, the alignment algorithm<br />
keeps track of values <strong>and</strong> locations of last row, last column <strong>and</strong> global score maximums for the<br />
matrix. Mirroring matrix initialization, backtracing starts at global score maximums for right<br />
end alignment mode local, at last row <strong>and</strong> last column maximums for modes dangling_ref<br />
<strong>and</strong> dangling_qry, respectively, at the combined last row <strong>and</strong> last column maximums for<br />
dangling_any, <strong>and</strong> at the lower right corner for global.<br />
Whenever the backtracing algorithm encounters multiple possible elds of origin for an alignment<br />
score, alternative paths are stacked for later completion, unless retrieval of only a single<br />
representative alignment was requested. While exhaustive tracing of all possible alignment paths<br />
can be combinatorially unfavorable, generating a representative for all distinct pairs of mapping<br />
end coordinates can be performed eciently. For this purpose, each eld of the matrix that has<br />
been reached by the backtracing algorithm from the same right end coordinates is marked as<br />
visited. If a eld marked as visited is reached from one of the stacked alternative paths, the<br />
respective path is aborted <strong>and</strong> the algorithm proceeds to assessing the next path.<br />
For read clipping purposes it is typically only relevant whether an oligomer can be assigned<br />
to unique coordinates on the sequencing read. For assessment of coordinate uniqueness, the<br />
backtracing algorithm proceeds only until alternative end coordinates have been encountered, <strong>and</strong><br />
the alternative trace is not emitted. If the backtrace completes without encountering coordinates<br />
dierent from the primary trace, its end coordinates are marked as unique.<br />
Reliability of a sequence match is determined by both the length of the match <strong>and</strong> its relative<br />
amount <strong>and</strong> type of edit operations. Alignment score ltering is therefore congured by setting<br />
slope <strong>and</strong> oset of a linear function. By default, the function's parameter is the length within the<br />
shorter sequence out of reference <strong>and</strong> query that is spanned by the match. Alignments with a<br />
score below the threshold thus calculated are not propagated to the list of results <strong>and</strong> sequencing
2.6. A PARALLELIZATION FRONT-END FOR SHORT READ ALIGNMENT TOOLS 43<br />
read manipulation. Alternatively, the score threshold may be set up to be calculated using the<br />
total length of the selected sequence as the function parameter.<br />
2.6 A Parallelization Front-End for Short Read Alignment Tools<br />
A great variety of sequencing read mapping <strong>and</strong> alignment programs are nowadays available to<br />
users. While most of these applications are similar in operation, each comes with its own specic<br />
user interface <strong>and</strong> usage peculiarities. Despite the huge selection of dierent implementations<br />
<strong>and</strong> extensive optimization of the underlying mapping <strong>and</strong> alignment algorithms, read mapping<br />
is furthermore still one of the computationally most expensive steps for resequencing applications<br />
(section 1.5.3). While parallel processing can often solve the issue of extensive run times,<br />
parallelization support is not uniform among all available solutions. Therefore, as each read<br />
alignment tool comes with its own set of trade-os, strengths <strong>and</strong> weaknesses, users are often left<br />
to decide between performance <strong>and</strong> required or desired features, <strong>and</strong> frequently need to adapt to<br />
dierent interfaces.<br />
The SHORE mapowcell application is a parallelization, pre- <strong>and</strong> post-processing wrapper for<br />
a variety of widely used <strong>and</strong> freely available short read mappers [1]. In addition to shared memory<br />
parallelization, the application has been reimplemented supporting networked distributed<br />
memory architectures through the Message Passing Interface (MPI) st<strong>and</strong>ard. Further added<br />
capabilities include automatic adjustment of parameters for h<strong>and</strong>ling of data sets with heterogeneous<br />
read lengths, as well as straightforward integration of mapping results obtained using<br />
dierent aligner back-ends or sets of mapping parameters. The program thus constitutes a<br />
meta-alignment tool able to amend mapped sequencing read data sets using dierent aligners,<br />
parameters or additional reference sequence information. Such multi-pass read mapping may<br />
especially benet computationally expensive approaches such as spliced read mapping.<br />
SHORE mapowcell currently provides a common user interface to the tools GenomeMapper,<br />
BWA, Bowtie, Bowtie2, ELAND <strong>and</strong> BLAT. User parameters are automatically transformed into<br />
the appropriate equivalent for the respective mapping back-end <strong>and</strong> results consistently reported<br />
in compressed, seekable SHORE MapList format easily convertible to various data exchange formats<br />
such as SAM.<br />
Ion or pyrosequencing read data sets as well as other types of sequencing data set following<br />
quality based read trimming (section 2.3) or sequence context specic read clipping (section 2.5)<br />
feature read length distributions with broad support range. However, many read mapping tools<br />
only support a set of static alignment parameters provided at program startup, resulting in<br />
below-par mapping specicity for reads on the lower tail of the read length distribution <strong>and</strong><br />
insucient sensitivity for reads at the upper end of the length spectrum.<br />
The SHORE mapowcell application automates read length-dependent selection of all relevant<br />
alignment parameters such as maximum tolerated edit distance, gap extension lengths or seed<br />
sizes. These user parameters may be provided as either absolute values or percentage of read<br />
length; additionally, edit distance may be bounded according to the alignment seed lemma.<br />
Given a certain read length l <strong>and</strong> edit distance d, a read alignment is guaranteed to contain a<br />
perfectly matching seed of size s = ⌊l/(d + 1)⌋, <strong>and</strong> therefore the maximum edit distance that<br />
guarantees nding a seed for each valid mapping is d = ⌊l/s⌋ − 1. We provide two modes of<br />
reducing the initially specied edit distance. A mode strict limits maximum edit distance to<br />
d max = ⌊l/s⌋ − 1, thereby ensuring that either all valid alignments with optimal edit distance<br />
are found for a read, or it is left unmapped (further seed selection heuristics implemented by<br />
the mapping tool excluded). <strong>An</strong> alternative setting on sets a weaker limit d max = ⌊l/s⌋, such<br />
that some of multiple mappings with the same edit distance may be missed, but reads are left<br />
unmapped if valid alignments at lesser edit distance can not be excluded.
44 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
Sequencing of DNA that is highly diverged from available reference genome sequences as well<br />
as transcriptome sequencing due to intron exon junctions imply that many sequencing reads<br />
can not be mapped to the reference as a contiguous unit. Such data must be mapped using<br />
specialized spliced sequencing read mapping programs like PALMapper [94] or generic matching<br />
tools such as BLAT [136]. Specialized spliced mapping algorithms may however represent a<br />
suboptimal choice for mapping the entire data set, either for reasons of performance or specicity.<br />
A dierent, yet related problem pose cases where the respective reference sequence is too large<br />
to be h<strong>and</strong>led with a certain combination of mapping tool <strong>and</strong> available hardware. Our utility<br />
is able to mitigate these issues through automatic combination of the results of multiple read<br />
mapping passes using dierent alignment tools, parameters or reference sequence information.<br />
Two dierent re-mapping modes oer either full reassessment of the entire data set utilizing<br />
the information of prior read mapping passes, or processing <strong>and</strong> integrating additional mapping<br />
results for read previously not mappable to the reference genome.<br />
For robust h<strong>and</strong>ling of variable reference sequence information provided to the program,<br />
SHORE mapowcell maintains a sequence dictionary used to record reference sequence identiers,<br />
metadata <strong>and</strong> checksums along with the alignment result le. With each read mapping pass the<br />
reference dictionary is updated appropriately <strong>and</strong> possible additional sequences are assigned new<br />
unique sequence identiers as required.<br />
For reassessment of read previously not mappable to the reference genome, mapping proceeds<br />
as in single pass application, <strong>and</strong> results are the union of all newly discovered read mappings with<br />
the results of the previous pass. For full renement of entire data sets, both previously mapped<br />
<strong>and</strong> unmapped reads become subject to realignment. Previous mapping results are however<br />
factored into calculation of alignment parameters <strong>and</strong> determination of batch identity. On completion,<br />
results of the current <strong>and</strong> previous mapping run are merged while at the same time<br />
pruning read alignments rendered suboptimal by the newly obtained results from the mapping<br />
data set.<br />
Read mapping tools incorporating computationally more expensive spliced alignment algorithms<br />
further emphasize the need for ecient <strong>and</strong> capable parallelization of the mapping process.<br />
While many aligners implement simple threaded parallelization, support is not universal, <strong>and</strong><br />
only a few mapping tool-specic solutions exist for distributed memory settings or cloud environments,<br />
like e. g. CloudBurst [95]. SHORE mapowcell implements a simple batch-based parallelization<br />
approach, in which automatically generated batches of input data may be distributed<br />
to either an alignment tool running in multi-threaded mode, to multiple aligner processes instantiated<br />
on the same machine, to alignment tools running within MPI processes distributed over a<br />
network, or a hybrid combination of all of the above.<br />
SHORE mapowcell denes a binary relation on read data by which reads are equivalent if<br />
they are set to be aligned using the same combination of static mapping parameters, <strong>and</strong> divides<br />
the data set into batches accordingly. Reads are partitioned into batch-specic les in SHORE's<br />
temporary le directory <strong>and</strong>, on reaching a user specied maximum batch size, submitted for<br />
parallel processing. Parallel operations include actual read mapping as well as format conversions<br />
<strong>and</strong> sorting where required. With ne tuning of the maximum batch size parameter, a simple<br />
measure is provided for adjustment of the tradeo between minimal parallelization overhead <strong>and</strong><br />
improved load balancing. For distributed memory parallelization, the temporary le location<br />
specied is required to be shared over the relevant network, with only meta information on each<br />
batch being transmitted via MPI channels for simplicity of implementation. On completion of<br />
all batches related to a specic input le, mapping results are automatically merged back into<br />
the permanent storage location.
2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 45<br />
2.7 Robust Detection of ChIP-Seq Enrichment<br />
Investigation of protein-DNA interaction by ChIP-Seq can help to further our underst<strong>and</strong>ing of<br />
cellular regulatory networks (section 1.3.2). Primary data analysis for ChIP-Seq experiments<br />
requires software tools that can pinpoint hundreds to thous<strong>and</strong>s of putative protein binding sites<br />
from data sets of high-throughput sequencing read alignments (section 1.5.6).<br />
This section discusses the peak calling pipeline SHORE peak integrated into the SHORE analysis<br />
suite. The utility is able to discriminate true enrichment from common artifact patterns using<br />
various robust heuristics, calculates false discovery rates (FDR) by interrogation of control experiments<br />
<strong>and</strong> implements a unique joint detection, separate evaluation approach to h<strong>and</strong>ling of<br />
replicate experiments. The program outputs locations of possible binding regions together with<br />
their FDR <strong>and</strong> various additional features potentially useful for ranking, ltering or ne-tuning<br />
the detection. It further enables users to generate graphical representations of peak regions,<br />
put detected loci into the context of available genome annotation or to extract binding region<br />
sequences for subsequent motif sampling <strong>and</strong> analysis.<br />
Our pipeline is pre-congured for straightforward detection of dened loci of ChIP-Seq enrichment<br />
as produced by transcription factor (TF) binding sites. Settings may however be adjusted<br />
to support the detection of clusters of enrichment that occur e. g. in data representing states<br />
of histone modication. Individual algorithms are implemented as pluggable modules <strong>and</strong> have<br />
been brought over to multi-purpose tools SHORE coverage <strong>and</strong> SHORE count, allowing ne-grained<br />
control <strong>and</strong> exible applicability of enrichment detection in diverse scenarios such as MNase-Seq<br />
assays or assessment of copy number variation (CNV).<br />
The SHORE peak pipeline has been outlined [58, 137] <strong>and</strong> further employed [138141] in multiple<br />
biological publications.<br />
2.7.1 Overview<br />
Were sequencing reads r<strong>and</strong>omly sampled from the genome, sequencing depth should be distributed<br />
according to the Poisson distribution whose parameter equals the average depth [142].<br />
Although high-throughput sequencing is known to be highly non-r<strong>and</strong>om, with biases that bring<br />
about considerable deviation from the Poisson model (section 1.4), the distribution can still<br />
present a powerful tool for the detection of irregularities in sequencing depth at high sensitivity.<br />
Our peak calling algorithm consists of two major steps. The rst pass is a detection phase that<br />
employs a relaxed Poisson assumption to the read alignment data generated through the ChIP-<br />
Seq experiment to identify regions of putative enrichment at high sensitivity. In the subsequent<br />
second pass implemented to improve specicity of the prediction, these c<strong>and</strong>idate regions are<br />
subjected to several robust empirical rules for artifact removal <strong>and</strong> further tested against control<br />
experiment data to assess signicance of the enrichment.<br />
2.7.2 Correcting for Duplicated Sequences<br />
The ChIP-Seq procedure yields quantities of DNA that are typically orders of magnitude below<br />
the amounts retrieved by other methods of DNA sampling. The excess PCR rounds required<br />
to obtain a compliant sequencing <strong>lib</strong>rary often bring about signicant amounts of duplicated<br />
sequences, reducing <strong>lib</strong>rary complexity. PCR may entail signicant sequence bias (section 1.4)<br />
with erratic rates of fragment duplication.<br />
Duplicate sequences represent the same fragment of DNA <strong>and</strong> therefore violate the assumption<br />
that each sequencing read is sampled independently from the set of sequences under examination.<br />
During detection of sequence enrichment, the presence of duplicate sequences must thus be<br />
explicitly accounted for by the predictive model, or these sequences must be pruned from the
46 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
data set prior to application of the actual detection algorithm. Most current ChIP-Seq analysis<br />
algorithms including SHORE peak implement a pruning approach.<br />
In the presence of sequencing errors or variable read length, PCR duplicates cannot be recognized<br />
by simply collapsing all reads having the same nucleotide sequence. Furthermore, at<br />
enriched loci identical sequences may be sampled independently. In the sequencer output, these<br />
identical sequences are thus indistinguishable from PCR artifacts.<br />
Duplicate recognition is therefore implemented as a ltering step that is applied after mapping<br />
the reads to the reference genome. ChIP-Seq algorithms typically restrict usage of duplicate reads<br />
by applying a static limit to the number of reads having their 5 ′ end mapped to the same position<br />
of the genome. While reliably removing duplicated sequences, the approach does not distinguish<br />
between duplication events <strong>and</strong> independent sampling, <strong>and</strong> thus suers from saturation once all<br />
reference positions are becoming occupied.<br />
We therefore adaptively remove PCR duplicates, using a heuristic based on the assumption<br />
that the number of non-duplicated reads n ∗ x that have their 5 ′ end at any position x of the<br />
genome is sampled from a Poisson distribution with unknown distribution parameter λ x , <strong>and</strong><br />
that each of these reads is represented on average by an unknown number of copies α x . Given the<br />
assumption holds, α x equals the dispersion index D x = σ 2 x/µ x for the read count distribution<br />
including duplicates. We further assume that all λ i <strong>and</strong> all α i for positions that are in close<br />
vicinity take similar values, <strong>and</strong> that the read count distribution for any position x can therefore<br />
be approximated using a short range of positions adjacent to x. Based on these assumptions,<br />
we estimate α for each position using a sliding window. In a window, the dispersion index is<br />
estimated as the ratio of the empirical variance <strong>and</strong> mean of the read counts. To increase the<br />
adaptiveness of the sliding window calculation, the ratio is calculated independently for a left<br />
window that includes all positions from x − k to x, <strong>and</strong> a right window including positions x to<br />
x + k<br />
ˆD left = s2 (x−k)..x<br />
¯n (x−k)..x<br />
ˆD right = s2 x..(x+k)<br />
¯n x..(x+k)<br />
with s 2 <strong>and</strong> ¯n the empirical variance <strong>and</strong> mean over the positions indicated, respectively. The<br />
left <strong>and</strong> right results are then averaged using the root mean square (RMS) of both values to<br />
obtain an estimate ˆα x of α x √<br />
1<br />
ˆα x =<br />
2 · ( ˆD left 2 + ˆD right 2 )<br />
The division into left <strong>and</strong> right windows should allow for sharper boundaries between enriched<br />
<strong>and</strong> background sites. RMS average over the two windows was chosen to ensure that re-evaluation<br />
using a scaled read count vector n ′ = n · 1/ˆα x as input will always yield ˆα ′ x = 1.0.<br />
In scenarios where it is desirable to retain the complete set of sequences with their associated<br />
base qualities, the estimated multiplicity factor ˆα x may be incorporated into the analysis by<br />
assigning each read starting at x a weight of 1/ˆα x during calculation of sequencing depth. If<br />
retaining all sequences is not a concern, the maximal number of reads may alternatively limited<br />
to ⌊n x /ˆα x ⌋, with n x the observed read count at a position x. The limiting approach is currently<br />
implemented in SHORE, discarding all excess alignments when the read count limit for the<br />
respective position has been reached.<br />
The duplicate correction algorithm does not provide any guarantees regarding the correctness<br />
of ˆα x . However, given that the variance of the read counts of adjacent positions is likely to<br />
exceed the Poisson assumption, average multiplicity should be overestimated. The algorithm is
2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 47<br />
therefore conservative in the sense that the true number of non-duplicated reads per position<br />
will be underestimated. In contrast to imposing a static limit on the number of reads accepted<br />
at each position, the adaptive heuristic still allows to discriminate dierent levels of extreme<br />
enrichment as well as peak calling in presence of deep background coverage.<br />
2.7.3 Detection Phase<br />
The ChIP-Seq procedure enriches for short DNA fragments with above average anity to the<br />
protein of interest. One or both ends of these fragments are subsequently sequenced, allowing<br />
detection of its origin on the reference genome assembly. In the following, the enriched DNA<br />
fragments subjected to sequencing will be referred to as inserts, whereas the sequenced ends<br />
of the fragments will be referred to as reads. The number of inserts putatively overlapping a<br />
position will be termed insert depth, <strong>and</strong> the number of reads overlapping the position read depth.<br />
The primary peak detection phase, resembling that of MACS [104] (section 1.5.6), is realized<br />
as a sliding window analysis of insert depth. A window of xed, user-selectable width is shifted<br />
along the reference sequence in single base steps. In each step, the algorithm calculates the<br />
average depth of insert coverage over the current window. In the case of paired-end sequencing,<br />
insert depth is calculated from the ranges dened by read pairs. For single end sequencing, each<br />
read is extended in 3 ′ direction to match the estimated average insert size. To exclude regions<br />
not accessible to the sequencing experiment, positions with sequencing depth zero are ignored,<br />
i. e. the average depth is calculated as<br />
∑<br />
¯d ′ x∈W<br />
W =<br />
d(x)<br />
|{x ∈ W : d(x) > 0}|<br />
where W is the set of reference sequence positions included by the sliding window <strong>and</strong> d(x)<br />
signies the depth at a position x. Since the sliding window may or may not contain enriched<br />
sites, ¯d′ W may be regarded a conservative estimate of the minimum average background signal<br />
over the window. ¯d′ W is used as the average coverage depth parameter to evaluate the potential<br />
enrichment at the central position x c of the sliding window by a one-sided Poisson test. The<br />
test calculates the p-value for the respective coverage depth from the cumulative distribution<br />
function for the upper tail of the Poisson distribution<br />
⌊d(x<br />
p(x c ) = 1 − e − ¯d ∑ c)⌋<br />
′<br />
W ·<br />
i=0<br />
( ¯d ′ W )i<br />
i!<br />
Consecutive positions falling below a certain signicance threshold, by default set to 0.05, are<br />
joined to form a c<strong>and</strong>idate region.<br />
Detection is ne-tunable using two auxiliary signicance thresholds. A relaxed probing signicance<br />
threshold is accepted to automatically join c<strong>and</strong>idate regions separated only by short<br />
drops in depth of coverage into a contiguous region. Furthermore, c<strong>and</strong>idate regions may be<br />
pre-ltered using an acceptance threshold, discarding all regions that do not include at least one<br />
position that satises this more stringent signicance criterion.<br />
2.7.4 Recognition of Read Mapping Artifacts<br />
Final signicance scores for enrichment are calculated based on the number of read mappings<br />
providing evidence for a respective site (section 2.7.5). For multiple reasons, signicance of<br />
enrichment is on its own of limited value for singling out relevant protein binding sites (section<br />
2.7.6). Peak signicance is therefore primarily used as a means of imposing a ranking on<br />
the detected loci.
48 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
Strong oversampling of a region of the reference sequence, for example due to collapsing of<br />
repetitive sequence onto few representations in a reference (section 1.5.3), may result in high<br />
levels of signicance for such loci even for minimal strengths of enrichment. Thus oversampling<br />
entails high-ranking loci that are of limited relevance to the analysis when compared to signicant<br />
loci sampled at regular depth.<br />
SHORE peak oers various ltering heuristics that allow for reliable removal of typical oversampling<br />
artifacts. Oversampled regions are marked by elevated depth of coverage in both<br />
experiment <strong>and</strong> control sample. By specication of a maximum coverage depth tolerated in the<br />
control for valid c<strong>and</strong>idate regions, many oversampled positions are ruled out. SHORE lters<br />
c<strong>and</strong>idate regions based on their average depth of control sample coverage, discarding regions<br />
whose depth lies above a certain threshold. The rejection threshold is calculated adaptively in<br />
terms of a user specied multiple of st<strong>and</strong>ard deviations distance from the median chromosomal<br />
depth of coverage. When analyzing multiple replicate experiments, such suspiciously high depth<br />
of coverage in any of the controls specied triggers removal of the proposed c<strong>and</strong>idate region<br />
by our algorithm. By default, regions with average control sample read depth more than six<br />
st<strong>and</strong>ard deviations above median depth are ignored.<br />
Fold change, calculated as the ratio of read counts or average depths of experiment <strong>and</strong><br />
control sample coverage at a c<strong>and</strong>idate locus, in some cases presents a more sensitive criterion for<br />
recognition of loci that satisfy a strict signicance cuto due to oversampling, but are of limited<br />
relevance to further analysis. Oversampling is then marked by large amounts of supporting<br />
evidence in the form of reads mapped to the region, but comparably weakly pronounced fold<br />
change. SHORE calculates normalized fold change values for a region as<br />
f =<br />
¯d e<br />
k e,c · ¯d c<br />
(2.1)<br />
where ¯d e <strong>and</strong> ¯d c represent the mean depth of coverage for experiment <strong>and</strong> control, respectively,<br />
<strong>and</strong> k e,c a scaling factor calculated for experiment <strong>and</strong> control over the respective chromosome to<br />
accommodate dierences in overall sequencing depth (section 2.7.5). Regions with a normalized<br />
fold change f < 2 are not considered by default. For experiments comprising multiple replicates,<br />
the user-specied fold change criterion must be satised by at least one replicate.<br />
For true sequence enrichment, mapped reads represent precipitated fragments of DNA typically<br />
larger than the sequenced read length. By contrast, oversampling artifacts arise due to<br />
over-representation solely of the part of the sequence utilized for read mapping. Given this<br />
mapping length is shorter than the actual insert length, a characteristic dierence in the str<strong>and</strong><br />
specic read depth curves ensues for enrichment compared to oversampling. In the case of enrichment,<br />
read depth curves for the forward <strong>and</strong> the reverse str<strong>and</strong> are shifted, with the shift oset<br />
representing the missing 3 ′ parts of the inserts that are not taken into account for the read depth<br />
calculation (gure 2.6). Contrarily, for oversampled regions not comprising enriched sequence,<br />
no such shift in str<strong>and</strong> specic read depth may be observed. The peak shift of single-end read<br />
depth therefore may be exploited as an acceptance criterion for enriched regions. SHORE peak<br />
estimates a peak shift value for a region based on the two areas under the read depth curves<br />
for each str<strong>and</strong>. For both these areas, the center of gravity (COG) is calculated. The estimated<br />
peak shift then equals dierence between the x-coordinate of the COG for the reverse str<strong>and</strong> <strong>and</strong><br />
the x-coordinate of the forward str<strong>and</strong> COG. In multi-replicate experiments, the user-specied<br />
peak shift threshold must be satised by at least one replicate.<br />
COG-based shift estimation may be considered robust for dened loci such as transcription<br />
factor binding sites. For larger stretches of enriched sequence representing clustered binding<br />
sites or histone methylation, the COG is easily skewed by local biases or r<strong>and</strong>om uctuations in<br />
sequencing depth, <strong>and</strong> therefore does not provide reliable estimation of peak shift.
2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 49<br />
200<br />
SOC1<br />
100<br />
0<br />
Depth of Coverage [# Reads]<br />
-100<br />
-200<br />
10<br />
5<br />
0<br />
Control<br />
-5<br />
-10<br />
18810400 18810600 18810800 18811000 18811200 18811400 18811600 18811800<br />
Position [bp]<br />
Chromosome 2<br />
Figure 2.6: ChIP-seq Depth of Coverage in the Vicinity of a Transcription Factor Binding Site<br />
2.7.5 Ranking <strong>and</strong> Assessment of Peak Signicance<br />
In a nal step, the peak calling algorithm assesses detected c<strong>and</strong>idate regions to establish a<br />
ranking of putative peak sites <strong>and</strong> determine their level of signicance.<br />
For any c<strong>and</strong>idate region, given the read count for the experiment is a r<strong>and</strong>om variable X<br />
<strong>and</strong> the read count for the control is a r<strong>and</strong>om variable Y , then their joint read count is the<br />
convolution Z = X ∗ Y .<br />
Assuming the joint read count in a region takes an observed value Z = z, then the experiment<br />
read count is distributed according to a probability mass function P (X = x|Z = z). Using Bayes'<br />
theorem, it is<br />
P (Z = z|X = x) · P (X = x)<br />
P (X = x|Z = z) =<br />
P (Z = z)<br />
P (Y = (z − x)) · P (X = x)<br />
=<br />
P (Z = z)<br />
(2.2)<br />
P (Y = (z − x)) · P (X = x)<br />
= ∑ z<br />
k=0 (P (Y = (z − k)) · P (X = k)) (2.3)<br />
Were experiment <strong>and</strong> control read counts following Poisson distributions, X ∼ Pois(λ 1 )<br />
<strong>and</strong> Y ∼ Pois(λ 2 ), then their convolution is Poisson distributed as well, Z ∼ Pois(λ 3 ), where
50 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
λ 3 = λ 1 + λ 2 .<br />
When applying the Poisson assumption to equation 2.2, it follows that<br />
P (X = x|Z = z) =<br />
λ z−x<br />
2 ·e −λ 2<br />
(z−x)!<br />
· λx 1 ·e−λ 1<br />
x!<br />
λ z 3·e−λ 3<br />
z!<br />
z!<br />
=<br />
x! · (z − x)! · λx 1 · λ2 z−x · e −(λ1+λ2)<br />
λ z 3<br />
( · e−λ3<br />
z λ<br />
= ·<br />
x)<br />
x 1 · λ2<br />
z−x<br />
=<br />
( z<br />
x)<br />
·<br />
(λ 1 + λ 2 ) z<br />
( ) x λ1<br />
·<br />
λ 1 + λ 2<br />
(<br />
1 − λ 1<br />
λ 1 + λ 2<br />
) z−x<br />
(2.4)<br />
Therefore, given Poisson distributed read counts <strong>and</strong> an observed joint read count for a region<br />
Z = z, the read count of the experiment follows a binomial distribution, X ∼ B(z, p), with the<br />
probability parameter p as implied by equation 2.4<br />
p = λ 1<br />
λ 1 + λ 2<br />
(2.5)<br />
Our algorithm relies on the Poisson assumption to assess the signicance level of c<strong>and</strong>idate<br />
peaks using a one-sided binomial test, i. e. by evaluating the upper tail of the cumulative distribution<br />
function (CDF) of the binomial distribution for the experiment read count x <strong>and</strong> joint<br />
read count z<br />
z∑<br />
( z<br />
P (X ≥ ⌊x⌋) = · p<br />
k)<br />
k · (1 − p) z−k<br />
k=⌊x⌋<br />
The oor function ⌊ ⌋ is applied to x to accommodate weighted read counts as e. g. used for<br />
reads mapping to multiple positions of the reference genome in a conservative way.<br />
Normalization of experiment <strong>and</strong> control is therefore dened by calculation of an appropriate<br />
value for the probability parameter p of the binomial distribution. To obtain an estimate ˆp of the<br />
probability parameter we dene a sequencing depth scaling factor as δ = λ 1 /λ 2 . By equation 2.5,<br />
the relation between the depth scaling factor <strong>and</strong> the probability parameter is<br />
p =<br />
δ<br />
δ + 1<br />
δ =<br />
p<br />
1 − p<br />
Chromosomal, mitochondrial <strong>and</strong> plastid sequences may be present at considerably dierent<br />
number of copies in the cell, implying potentially dierent depth of coverage for these sequences.<br />
The probability parameter is therefore estimated independently for each chromosome or organelle<br />
sequence. The sequence is subdivided into xed-size non-overlapping bins. The normalization<br />
algorithm then chooses ˆδ such that the median of the read counts associated with the bins for<br />
the experiment equals the median of the control sample bins multiplied by ˆδ. This estimate is<br />
then used to replace δ in equation 2.6 to obtain a probability parameter for the binomial tests<br />
on the respective chromosome or organelle sequence.<br />
While detection of c<strong>and</strong>idate regions is applied to the joint data of all experiments, nal<br />
assessment of peak signicance is carried out separately for each replicate data set. This joint<br />
(2.6)
2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 51<br />
detection, separate evaluation approach ensures that replicate results are immediately comparable<br />
without requirement for peak overlap calculation involving arbitrary thresholds. At the<br />
same time, signicance levels obtained still represent independent entities.<br />
To account for the potentially large number of binomial tests performed, we nally process<br />
the p-values calculated to obtain false discovery rates (FDR) using Benjamini-Hochberg estimation<br />
[143].<br />
Finally, to allow ne-grained custom ltering of peak lists, a variety of further peak properties<br />
are reported by our program, such as peak shift, peak height <strong>and</strong> fold enrichment ratio. Fold<br />
change values between sample <strong>and</strong> control are reported as a score F on the interval [−1, 1] using<br />
the transformation F = 4 π · arctan(f) − 1 on equation 2.1.<br />
2.7.6 Experimental Relevance of Peak Signicance<br />
Besides true DNA binding anity of the protein of interest <strong>and</strong> described artifact patterns,<br />
sources of bias may contribute to sequence enrichment in some cases indistinguishable in ChIP-<br />
Seq results.<br />
For example, between-sample variation in the strength of sequence specic sampling bias<br />
might potentially be introduced due to slight dierences in sample preparation parameters (section<br />
1.4). Modeling varying sequencing depth as a scaling factor that globally applies to entire<br />
chromosomes then has the consequence that the denition of enrichment of a sequence in a<br />
two-sample comparison may include positions with above-average sampling bias, if the specic<br />
bias is pronounced more strongly in the experiment than in the control; as well as positions with<br />
below-average bias where the bias in the experiment is weaker when compared to control. Such<br />
dierences in sampling bias strength can likely not be estimated globally, as the eects of multiple<br />
sources of bias might be superimposed. Adequate sampling bias estimation <strong>and</strong> correction<br />
therefore can be expected to require complex modeling based on specic sequence features <strong>and</strong><br />
likely further investigation into sequencing <strong>and</strong> sample preparation processes. Str<strong>and</strong> specic<br />
eects may however be ruled out by assessment of properties like peak shape, peak shift or<br />
forward-reverse fold change. Furthermore, such technical sources of enrichment can sometimes<br />
be excluded by downstream binding motif analysis.<br />
Moreover, correctly identied sites with above-average protein-DNA anity might be biologically<br />
irrelevant. With multiple co-factors contributing to transcription initiation, evolutionary<br />
forces restricting the genome-wide development of individual binding sites are likely weak. Relevant<br />
sites should therefore be dened by presence of further promoter elements. Furthermore,<br />
both transcription factor binding as well as the ChIP procedure itself are governed by complex<br />
physico-chemical dynamics. DNA fragments with above-average anity to the protein of interest<br />
might become detectable in ChIP-Seq, where the induced binding is however not suciently<br />
stable to result in transcription-regulatory relevance. Signicance of such sites is however solely<br />
determined by sucient depth of sequencing.<br />
Further biological factors such as chromatin state <strong>and</strong> accessibility may contribute to significant<br />
enrichment not directly related to protein binding events. For these reasons, statistical<br />
signicance of enrichment can on its own not provide a criterion for denition of biologically<br />
relevant binding sites.<br />
Furthermore, the interpretation of the ranking imposed by binding site signicance involves<br />
intricate considerations. Signal resulting from several dierent inuences becomes convoluted<br />
through the ChIP procedure. Quality of the employed antibody, overall quality of the precipitation<br />
procedure, prevalence of the DNA-protein interaction across tissues as well as the actual<br />
anity of the protein of interest to its target motifs all contribute to the overall signal observed.<br />
The rst two eects globally aect the ratio of enriched sites to the magnitude of background cov-
52 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
erage, <strong>and</strong> therefore impair the ability to observe global discrepancies in binding anity between<br />
samples. By contrast, tissue composition <strong>and</strong> binding anity are both locus specic eects that<br />
are indistinguishable during analysis, i. e. the procedure does not allow distinguishing between<br />
the strength of the Protein-DNA interaction <strong>and</strong> its predominance across the cells <strong>and</strong> tissue<br />
types from which the sample was retrieved. The large amount of tissue required for immunoprecipitation<br />
protocols likely contributes to a lack of controllability of the analyzed tissue states<br />
<strong>and</strong> composition, which may constitute a factor impeding experimental reproducibility.<br />
2.8 Visualization of Sequencing Read <strong>and</strong> Alignment <strong>Data</strong><br />
Exploration <strong>and</strong> iterative analysis of high-throughput sequencing data require computational<br />
tools able to rapidly retrieve <strong>and</strong> visualize certain subsets or features of a data set. After obtaining<br />
raw or ltered read data, assessment of base call quality, nucleotide composition <strong>and</strong> read length<br />
distribution is desirable to verify overall data quality prior to further analysis (section 1.5.1).<br />
At later stages of data analysis, visual inspection of read alignment data can be a valuable<br />
tool for verication of loci of identied mutation c<strong>and</strong>idates (section 1.3) or underst<strong>and</strong>ing the<br />
structure of complex genomic regions of clustered polymorphisms or rearrangements. For other<br />
applications such as mRNA-, small RNA- or ChIP-Seq, examination of local depth of coverage<br />
curves is an adequate approach to value structure or expression of individual transcripts or<br />
rule out a potential mapping artifact. Finally, dierent approaches to data visualization are<br />
required for gaining cues towards potential megabase scale or genome wide characteristics of the<br />
sequencing depth distribution in applications targeting e. g. epigenomic modications.<br />
While web browser or graphical user interface driven applications allow fully interactive navigation<br />
of data sets, they may require considerable eort in data preparation or are not easily<br />
operated remotely over a network.<br />
The SHORE pipeline implements a variety of means to ad-hoc sequencing data visualization.<br />
The SHORE tagstats utility oers a summarized view of base quality distributions, per-base<br />
nucleotide composition as well as read length distribution. We further provide an application<br />
SHORE mapdisplay capable of rapidly generating comprehensible views of read alignment data,<br />
either as a text-based representation viewable in a terminal window, or in the form of static<br />
scalable vector graphics (SVG) images readily displayed in modern web browsers. In addition<br />
to that, versatile utilities SHORE coverage <strong>and</strong> SHORE count provide options for visualizing local<br />
depth of coverage as well as larger-scale genomic distribution of read mappings, respectively.<br />
2.8.1 Gathering Run Quality <strong>and</strong> Sequence Composition Statistics<br />
The SHORE tagstats utility summarizes content of unique read sequences in a data set, <strong>and</strong><br />
additionally gathers base quality <strong>and</strong> per base nucleotide statistics. Statistics are collected in<br />
various tables <strong>and</strong> may additionally be visualized in the form of an all-in-one gure generated as<br />
a static Portable Document Format (PDF) image (gure 2.7).<br />
Read quality distribution is represented as a per-base box plot, with indicators for median<br />
values, second <strong>and</strong> third quartile as well as the range included by a single st<strong>and</strong>ard deviation<br />
distance from the mean. Median <strong>and</strong> quartile values are corrected for quantization bias (section<br />
2.8.4).<br />
Total number of reads extending beyond a certain length is indicated by the support curve.<br />
The gure further includes the frequency of either nucleotide at each cycle of the sequencing run.<br />
Nucleotide frequencies are displayed multiplied by four to allow lineup with the support curve. In<br />
presence of sequencing errors <strong>and</strong> varying read length, uniqueness of read sequences cannot serve<br />
as a means of removal of duplicate sequences arising due to PCR amplication. Moreover, the
2.8. VISUALIZATION OF SEQUENCING READ AND ALIGNMENT DATA 53<br />
Quality Score | Nucleotide Count<br />
0 8 16 24 32 40 48<br />
0 5356 10712 16067 21423 26779 32135<br />
Median Base Quality<br />
Inner Quality Quartiles<br />
Average Quality ±SD<br />
Support<br />
4A<br />
4C<br />
4G<br />
4T<br />
Length Distribution<br />
Dashed: Unique Sequences<br />
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 160<br />
Cycle [bp]<br />
Figure 2.7: Read <strong>and</strong> Quality Statistics for a Paired-End Sequencing Run Following Quality-<br />
Based Read Trimming<br />
likelihood of reads originating from duplicates having the same nucleotide sequence decreases with<br />
increasing read length (sections 1.4, 2.7.2). Frequency <strong>and</strong> nucleotide composition of unique read<br />
sequences can however still provide valid clues on ChIP-Seq <strong>lib</strong>rary quality <strong>and</strong> complexity <strong>and</strong><br />
reveal important sample properties in small RNA sequencing applications. Thus, support <strong>and</strong><br />
nucleotide composition after pruning of redundant read sequences is indicated by a corresponding<br />
dashed line for each property.<br />
Finally, distribution of read lengths is dicult to pick up visually from support distributions<br />
alone. For improved appraisal, the distributions are therefore additionally indicated by short<br />
vertical bars at the bottom of the gure.<br />
While the presented superimposition of quality <strong>and</strong> nucleotide composition curves may in<br />
some cases serve to obfuscate the interpretability of either individual property, it renders multivariate<br />
entities such as correlations between read quality issues <strong>and</strong> skews in nucleotide composition<br />
or the eectiveness of read trimming algorithms visually immediately appraisable.
54 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
G<br />
g<br />
G<br />
G<br />
G<br />
A<br />
G<br />
-A<br />
a<br />
a-<br />
G<br />
T<br />
C<br />
c<br />
T<br />
A<br />
G<br />
A<br />
a<br />
A<br />
G<br />
G<br />
a<br />
A<br />
c<br />
G<br />
c<br />
a<br />
A<br />
G<br />
A<br />
A<br />
A<br />
a<br />
T<br />
g<br />
C<br />
g<br />
A<br />
c<br />
A<br />
T<br />
C C<br />
A<br />
A<br />
C<br />
A<br />
A<br />
A<br />
C<br />
A<br />
a<br />
A<br />
G<br />
G<br />
A<br />
T<br />
A<br />
G<br />
t<br />
A<br />
a<br />
a<br />
G<br />
G<br />
A<br />
C<br />
G<br />
a<br />
a<br />
a<br />
T<br />
T<br />
T<br />
A<br />
5’<br />
T<br />
A<br />
t<br />
c-<br />
A<br />
A<br />
T<br />
A<br />
A<br />
t<br />
C<br />
G<br />
G<br />
C<br />
a<br />
G<br />
A<br />
TA<br />
G<br />
A<br />
A<br />
T<br />
G<br />
G<br />
g<br />
t<br />
G<br />
A<br />
T<br />
C<br />
t<br />
A<br />
c<br />
T<br />
CA<br />
t<br />
A<br />
T<br />
c<br />
G<br />
c<br />
a<br />
C<br />
G<br />
A<br />
G<br />
C<br />
5’<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
c c<br />
G G<br />
G<br />
3’<br />
G<br />
A<br />
3’<br />
c<br />
T<br />
T<br />
A<br />
g<br />
A<br />
t<br />
G<br />
T<br />
a<br />
TN<br />
a<br />
C<br />
C<br />
A<br />
G<br />
C<br />
G<br />
A<br />
a<br />
t<br />
A<br />
A<br />
A<br />
a<br />
A<br />
g<br />
-A<br />
A<br />
C<br />
t<br />
A<br />
C<br />
G<br />
A<br />
g<br />
C<br />
c<br />
a<br />
a<br />
a<br />
C<br />
A<br />
c<br />
A<br />
A<br />
C<br />
T<br />
T<br />
G<br />
T<br />
T<br />
A<br />
G<br />
C<br />
T<br />
T<br />
G<br />
G<br />
A<br />
A<br />
A<br />
-A<br />
A<br />
a<br />
A<br />
A<br />
g<br />
C<br />
g<br />
A<br />
c<br />
T<br />
c<br />
G<br />
A<br />
t<br />
G<br />
A<br />
G<br />
A<br />
A<br />
G<br />
A<br />
T<br />
A-<br />
A<br />
a<br />
a<br />
a<br />
c<br />
A<br />
a<br />
c<br />
G<br />
C<br />
A<br />
g<br />
A<br />
G<br />
A<br />
a<br />
t<br />
3’<br />
A<br />
A<br />
G<br />
c<br />
c<br />
t<br />
G<br />
A<br />
A<br />
g<br />
a<br />
G<br />
a<br />
g<br />
C<br />
A<br />
A<br />
A<br />
G<br />
a<br />
a<br />
a<br />
G<br />
G<br />
g<br />
a A<br />
T<br />
T<br />
T<br />
t<br />
t<br />
A<br />
A<br />
A<br />
A 3’<br />
A<br />
a<br />
a<br />
G<br />
G<br />
g<br />
G<br />
c<br />
5’<br />
G<br />
T<br />
G<br />
t<br />
T<br />
A-<br />
g<br />
3’<br />
t<br />
G<br />
g<br />
a<br />
A<br />
a-<br />
A<br />
A<br />
A-<br />
G<br />
C<br />
a<br />
5’<br />
C<br />
G<br />
a<br />
G<br />
G<br />
G<br />
A<br />
T<br />
T<br />
a<br />
g<br />
A-<br />
C<br />
A<br />
A<br />
a<br />
A<br />
G<br />
a<br />
c<br />
A<br />
C<br />
a<br />
a<br />
A<br />
G<br />
G<br />
T<br />
C<br />
c<br />
a<br />
g<br />
T<br />
A<br />
T<br />
C<br />
TA<br />
A<br />
t<br />
T<br />
A-At<br />
A<br />
A<br />
G<br />
c<br />
A<br />
A<br />
T<br />
A<br />
A<br />
T<br />
A<br />
g<br />
G<br />
g<br />
a<br />
-A<br />
C<br />
G<br />
c<br />
-A<br />
G<br />
A-<br />
g<br />
g<br />
A<br />
A<br />
C C<br />
T<br />
ta<br />
A<br />
C<br />
T<br />
A<br />
A<br />
C<br />
A<br />
A-<br />
G<br />
a<br />
a<br />
a<br />
T<br />
a<br />
T<br />
g<br />
A<br />
a<br />
T<br />
A<br />
A<br />
A<br />
t<br />
T<br />
-a<br />
A<br />
-a<br />
A<br />
C<br />
g<br />
a<br />
t<br />
G<br />
A<br />
T<br />
g<br />
A<br />
G<br />
T<br />
3’<br />
5’<br />
3’g<br />
A<br />
C<br />
G<br />
A<br />
A<br />
a<br />
a<br />
a<br />
A 3’<br />
G<br />
t<br />
A<br />
C<br />
G<br />
A<br />
C<br />
g<br />
C<br />
g<br />
g<br />
G<br />
a<br />
G<br />
c<br />
A<br />
A<br />
A<br />
G<br />
A<br />
c<br />
A<br />
3’A<br />
A<br />
t<br />
G<br />
g<br />
g<br />
T<br />
A<br />
A<br />
A<br />
g<br />
A<br />
a<br />
G<br />
G<br />
c<br />
A<br />
ac<br />
a<br />
a<br />
a<br />
A<br />
A<br />
a<br />
a<br />
A<br />
A<br />
A<br />
a<br />
A<br />
G<br />
c<br />
A<br />
g<br />
a<br />
g<br />
G<br />
A<br />
t<br />
A-<br />
G<br />
A-<br />
G<br />
a<br />
3’<br />
A a<br />
G<br />
C<br />
C<br />
3’<br />
A<br />
A<br />
3’A<br />
5’<br />
C<br />
3’<br />
G<br />
G<br />
t<br />
A<br />
A<br />
T<br />
A<br />
a<br />
a<br />
T<br />
a<br />
C<br />
a<br />
a<br />
C<br />
a<br />
G<br />
A<br />
a<br />
A<br />
A<br />
a<br />
T<br />
T<br />
T<br />
T<br />
a<br />
A<br />
A<br />
A A<br />
g<br />
T<br />
a<br />
G G<br />
G<br />
A<br />
g<br />
t<br />
G<br />
g<br />
C<br />
A<br />
A<br />
c<br />
a<br />
A<br />
A<br />
G G<br />
G<br />
A<br />
G<br />
T<br />
a<br />
T<br />
G<br />
T<br />
A<br />
g<br />
G<br />
G<br />
G<br />
A<br />
T<br />
T<br />
T<br />
C<br />
a<br />
g<br />
A<br />
T<br />
A<br />
t<br />
T<br />
G<br />
A<br />
t<br />
c<br />
A<br />
G<br />
c<br />
g<br />
A<br />
g<br />
t<br />
t T<br />
A<br />
-A<br />
C<br />
-a<br />
a<br />
t<br />
C<br />
t<br />
A<br />
A<br />
c<br />
T<br />
a<br />
A<br />
A<br />
g<br />
a<br />
5’G<br />
g<br />
t<br />
G<br />
G<br />
G<br />
a<br />
C<br />
a<br />
A<br />
a<br />
G<br />
C<br />
A<br />
A<br />
A<br />
c<br />
3’<br />
C<br />
C<br />
t<br />
g<br />
g<br />
g<br />
C<br />
a A<br />
C<br />
C<br />
T<br />
T<br />
A<br />
a<br />
A<br />
T<br />
A<br />
g<br />
a<br />
G<br />
A<br />
a<br />
G G<br />
a<br />
5’<br />
A<br />
G<br />
G<br />
a<br />
G<br />
T<br />
a<br />
A<br />
c<br />
a a<br />
C<br />
A<br />
5’<br />
A<br />
5’<br />
A<br />
a<br />
A<br />
A<br />
c<br />
A<br />
a<br />
C<br />
A<br />
g<br />
g<br />
g<br />
3’<br />
G<br />
G<br />
G<br />
G<br />
T<br />
A<br />
A<br />
G<br />
G<br />
A<br />
g<br />
g<br />
A<br />
A<br />
a<br />
A<br />
A<br />
A<br />
g<br />
C<br />
t<br />
a<br />
G<br />
A<br />
a<br />
a<br />
t<br />
T<br />
T<br />
5’<br />
t<br />
g<br />
t<br />
A<br />
C<br />
a<br />
A<br />
a<br />
A<br />
a<br />
3’<br />
A<br />
A<br />
A<br />
A A<br />
A<br />
3’<br />
A<br />
t<br />
t<br />
A<br />
A<br />
t<br />
T<br />
a<br />
G<br />
G<br />
A<br />
A<br />
G<br />
A<br />
A<br />
T<br />
G<br />
t<br />
T<br />
g<br />
a<br />
a A<br />
g<br />
G<br />
t<br />
G<br />
g<br />
G<br />
a<br />
G<br />
G<br />
a<br />
T<br />
A<br />
c<br />
G<br />
t<br />
T<br />
T<br />
G<br />
a<br />
G<br />
T<br />
t<br />
5’<br />
G<br />
G<br />
G<br />
G<br />
A<br />
c<br />
A<br />
t<br />
a<br />
A<br />
a<br />
3’<br />
a<br />
T<br />
C<br />
t<br />
A<br />
T<br />
g<br />
G<br />
T<br />
ca<br />
T<br />
G g<br />
g<br />
G<br />
ac<br />
T<br />
a<br />
t<br />
-A<br />
3’<br />
t<br />
a<br />
5’<br />
t<br />
A<br />
T<br />
5’<br />
G<br />
A<br />
G<br />
A<br />
t<br />
A<br />
g<br />
G<br />
-a<br />
a<br />
A<br />
g<br />
a<br />
C<br />
C<br />
A<br />
A<br />
t<br />
G<br />
t<br />
A A<br />
G<br />
C<br />
A<br />
G<br />
a<br />
C<br />
A<br />
A<br />
C<br />
C<br />
G<br />
t<br />
A<br />
a<br />
a<br />
a<br />
ga<br />
A<br />
G<br />
A<br />
A<br />
A<br />
a<br />
A<br />
A<br />
a<br />
g<br />
G<br />
a<br />
T<br />
A<br />
G<br />
G<br />
a<br />
A<br />
g<br />
t<br />
a<br />
A<br />
G<br />
T<br />
a<br />
A<br />
A<br />
C<br />
A<br />
t<br />
A<br />
A<br />
A<br />
A<br />
C<br />
c<br />
C<br />
A<br />
G<br />
C<br />
A<br />
G<br />
G G<br />
G<br />
G<br />
A<br />
g<br />
A<br />
G<br />
G<br />
G<br />
a<br />
C<br />
a<br />
A<br />
g<br />
C<br />
a<br />
A<br />
3’<br />
g<br />
T<br />
g<br />
A<br />
T<br />
A<br />
T<br />
A<br />
C<br />
T<br />
a<br />
A<br />
T<br />
A<br />
A<br />
A A<br />
A<br />
A<br />
A<br />
a<br />
G<br />
a<br />
T<br />
G<br />
g<br />
g<br />
t<br />
A<br />
T<br />
a<br />
G G<br />
A<br />
C<br />
G<br />
a<br />
A<br />
5’<br />
A<br />
-A<br />
T<br />
G<br />
G<br />
G G<br />
G<br />
c<br />
t<br />
A<br />
C<br />
T<br />
g<br />
G<br />
A<br />
a<br />
A<br />
G<br />
A<br />
t<br />
a<br />
A<br />
t<br />
C<br />
A<br />
C<br />
A<br />
T<br />
c<br />
T<br />
A<br />
g<br />
g<br />
t<br />
A<br />
g<br />
g<br />
g<br />
t<br />
T<br />
A A<br />
3’<br />
A<br />
g<br />
t<br />
G<br />
A<br />
g<br />
a<br />
G<br />
a<br />
A<br />
g G<br />
g<br />
a<br />
g<br />
A<br />
A<br />
A<br />
G<br />
A<br />
A<br />
g<br />
A<br />
A<br />
A<br />
A<br />
A<br />
ca<br />
A<br />
A<br />
C<br />
G<br />
g<br />
T<br />
G<br />
T<br />
A<br />
A<br />
G<br />
a<br />
A<br />
A<br />
C<br />
c<br />
T<br />
C<br />
t<br />
C<br />
A<br />
A<br />
A<br />
g<br />
a<br />
A<br />
a<br />
G<br />
T<br />
g<br />
g<br />
a<br />
A<br />
A<br />
t<br />
A<br />
a<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
C<br />
a<br />
a<br />
3’G<br />
c<br />
A<br />
g<br />
g<br />
g<br />
G<br />
A<br />
A<br />
G<br />
G<br />
3’<br />
t<br />
G<br />
a<br />
a<br />
A<br />
C<br />
a<br />
3’<br />
a<br />
G<br />
5’<br />
G<br />
g G<br />
c<br />
a<br />
T<br />
c<br />
T<br />
T<br />
T<br />
A<br />
a<br />
a-<br />
G<br />
A<br />
t<br />
G<br />
c<br />
G<br />
G<br />
a<br />
A<br />
A<br />
G<br />
A<br />
c<br />
G<br />
A<br />
AG<br />
T<br />
g<br />
-A<br />
G<br />
G<br />
a<br />
G<br />
T<br />
G<br />
A<br />
a<br />
G<br />
A<br />
c<br />
T<br />
G G<br />
a<br />
a<br />
c<br />
C<br />
G<br />
G<br />
T<br />
-a<br />
t<br />
G<br />
5’T<br />
T<br />
G<br />
T<br />
T<br />
a<br />
c C<br />
G<br />
g<br />
G<br />
g<br />
a<br />
C<br />
t<br />
G<br />
c<br />
ac<br />
3’<br />
c<br />
A<br />
G<br />
3’<br />
a<br />
A<br />
c<br />
T<br />
c<br />
t<br />
a<br />
a<br />
G<br />
G<br />
a<br />
A<br />
A<br />
a<br />
A<br />
3’<br />
T<br />
T<br />
T<br />
a<br />
ta<br />
C<br />
3’<br />
a<br />
c<br />
a<br />
a<br />
G<br />
C<br />
t<br />
A<br />
a<br />
A<br />
G<br />
G<br />
g<br />
A A<br />
A<br />
A<br />
A<br />
3’<br />
A<br />
a<br />
C<br />
c<br />
G<br />
A<br />
c<br />
a<br />
c<br />
g<br />
T<br />
t<br />
t<br />
G<br />
a<br />
a<br />
g<br />
ca<br />
5’<br />
A<br />
A<br />
a<br />
3’<br />
C<br />
A<br />
C<br />
c<br />
g<br />
T<br />
T<br />
T<br />
A<br />
A<br />
A<br />
-A<br />
G<br />
g<br />
-A<br />
A<br />
G<br />
t<br />
5’<br />
A<br />
A<br />
5’<br />
A<br />
T<br />
g<br />
g<br />
A<br />
a<br />
A 5’<br />
a<br />
C<br />
t<br />
a<br />
A<br />
a<br />
A<br />
G g<br />
A<br />
G<br />
a<br />
G<br />
A<br />
A<br />
g<br />
a<br />
A<br />
G<br />
3’<br />
G<br />
g<br />
T<br />
a<br />
T<br />
C<br />
G<br />
A<br />
T<br />
A<br />
A<br />
A<br />
a<br />
g<br />
G<br />
a<br />
3’<br />
t<br />
a<br />
T<br />
c<br />
T<br />
TAT<br />
a<br />
T<br />
C<br />
G<br />
c<br />
T<br />
A<br />
a<br />
a<br />
G<br />
a<br />
C<br />
A<br />
T<br />
A<br />
a<br />
g G<br />
A-<br />
G G<br />
A<br />
T<br />
t<br />
A<br />
T<br />
a<br />
a<br />
A<br />
a<br />
A<br />
a<br />
A<br />
a<br />
C<br />
a<br />
GC<br />
G<br />
A<br />
a<br />
C<br />
g<br />
A<br />
G G<br />
T<br />
T<br />
T<br />
A<br />
A<br />
T<br />
T<br />
T<br />
G<br />
A<br />
a<br />
t<br />
a<br />
c<br />
a a<br />
A<br />
A<br />
c<br />
a A<br />
T<br />
c<br />
G<br />
a<br />
a<br />
T<br />
T<br />
A<br />
a<br />
G<br />
G<br />
3’<br />
a<br />
a<br />
g<br />
g<br />
g<br />
A<br />
G<br />
A<br />
g<br />
A<br />
C<br />
c<br />
A<br />
t<br />
T<br />
A<br />
C<br />
C<br />
A<br />
A<br />
T<br />
C<br />
C<br />
A<br />
T<br />
g<br />
A<br />
C<br />
a<br />
A<br />
T<br />
G<br />
T<br />
g<br />
a<br />
A<br />
A a<br />
A<br />
A<br />
a<br />
C<br />
3’<br />
a<br />
C c<br />
a<br />
G<br />
G<br />
g<br />
t<br />
g<br />
g<br />
A<br />
a<br />
A<br />
a<br />
A<br />
g<br />
G<br />
A<br />
G<br />
T<br />
T<br />
c<br />
t<br />
a<br />
5’<br />
A<br />
a<br />
A<br />
a<br />
5’<br />
G<br />
A<br />
a<br />
G<br />
g<br />
A<br />
A<br />
T<br />
G<br />
g<br />
A<br />
A<br />
t<br />
a a<br />
A<br />
C<br />
t<br />
ACA<br />
A<br />
A<br />
A<br />
A<br />
c<br />
c<br />
A<br />
a<br />
A<br />
A<br />
a<br />
A<br />
T<br />
A<br />
g<br />
G<br />
G<br />
C<br />
a<br />
G<br />
T<br />
G<br />
g<br />
t<br />
G<br />
G<br />
5’<br />
G G<br />
A<br />
A<br />
C<br />
a<br />
A<br />
A<br />
C<br />
A<br />
t<br />
A<br />
C<br />
t<br />
C<br />
T<br />
G<br />
A a<br />
t<br />
A<br />
T<br />
a<br />
T<br />
a<br />
a<br />
A<br />
A a<br />
a<br />
a<br />
t<br />
G<br />
A<br />
A<br />
G<br />
A<br />
G<br />
C<br />
G<br />
a<br />
a<br />
a<br />
A<br />
A<br />
a<br />
a<br />
a<br />
T<br />
A<br />
C<br />
a<br />
g<br />
c<br />
g<br />
A<br />
c<br />
C<br />
a<br />
G<br />
-A<br />
A<br />
C<br />
A-<br />
G<br />
T<br />
T<br />
T<br />
g<br />
t<br />
A A<br />
G<br />
T<br />
T<br />
t<br />
A<br />
G<br />
G<br />
A<br />
A<br />
a<br />
A-<br />
T<br />
T<br />
a<br />
a<br />
a a<br />
c<br />
A<br />
T<br />
c<br />
A<br />
T<br />
A<br />
T<br />
3’<br />
A<br />
G<br />
A a<br />
G<br />
GN<br />
a<br />
A<br />
A<br />
5’<br />
G<br />
G<br />
T<br />
a<br />
A<br />
T<br />
G<br />
C<br />
a A<br />
C<br />
t<br />
G<br />
a<br />
T<br />
G<br />
A<br />
a<br />
G<br />
G<br />
G<br />
A<br />
A<br />
G<br />
G<br />
A<br />
A<br />
t<br />
G<br />
T<br />
A<br />
C<br />
a<br />
G<br />
A<br />
a<br />
A<br />
a<br />
G<br />
A<br />
C<br />
a<br />
g<br />
a<br />
G<br />
c<br />
g<br />
T<br />
A<br />
G<br />
5’<br />
G<br />
T<br />
t<br />
g<br />
g<br />
AC<br />
a a<br />
G<br />
G<br />
A<br />
G<br />
G<br />
a<br />
A<br />
A<br />
A<br />
G<br />
A<br />
5’<br />
A<br />
A<br />
A<br />
C<br />
G<br />
t<br />
a<br />
T<br />
C<br />
A<br />
G<br />
a<br />
a<br />
G<br />
A<br />
3’<br />
a<br />
A<br />
G<br />
T<br />
T<br />
G<br />
G<br />
G<br />
A<br />
G<br />
G<br />
TG<br />
a<br />
G<br />
G<br />
GN<br />
t<br />
a<br />
A<br />
A<br />
A<br />
C<br />
A<br />
A<br />
a<br />
a<br />
T<br />
A<br />
A<br />
A<br />
T<br />
T<br />
a<br />
C<br />
A<br />
t<br />
A<br />
C<br />
c<br />
C<br />
T<br />
A<br />
A<br />
A<br />
A<br />
A<br />
G<br />
A<br />
g<br />
T<br />
c<br />
A<br />
A<br />
G<br />
A<br />
G<br />
a<br />
A<br />
A<br />
G<br />
a<br />
C<br />
A-<br />
A<br />
c<br />
T<br />
A<br />
G g<br />
G<br />
A<br />
A<br />
A<br />
T<br />
5’<br />
A<br />
a<br />
t<br />
g<br />
A<br />
a<br />
A<br />
a<br />
G<br />
A<br />
A<br />
c<br />
g<br />
C<br />
a<br />
A<br />
-A<br />
A<br />
G<br />
T<br />
G<br />
5’<br />
c<br />
A<br />
T<br />
C<br />
5’<br />
g<br />
A<br />
A<br />
a<br />
G<br />
G<br />
a<br />
a<br />
g<br />
A<br />
G<br />
G<br />
G<br />
G<br />
a<br />
a<br />
A<br />
T<br />
a<br />
T<br />
a<br />
C<br />
C<br />
G<br />
A<br />
3’<br />
A<br />
g<br />
a<br />
a<br />
T<br />
T<br />
C<br />
G<br />
G<br />
T<br />
G<br />
a<br />
T<br />
G<br />
G<br />
t<br />
a<br />
c<br />
g<br />
c<br />
C<br />
5’<br />
C<br />
g<br />
A-<br />
G<br />
A A<br />
G<br />
A<br />
A A<br />
a<br />
t<br />
t<br />
G<br />
t<br />
A<br />
T<br />
G<br />
c<br />
A<br />
a<br />
C<br />
C<br />
C<br />
A<br />
G<br />
C<br />
C<br />
G<br />
a<br />
G<br />
T<br />
T<br />
A<br />
A<br />
A<br />
t<br />
g<br />
T<br />
t<br />
a<br />
T<br />
A<br />
g G<br />
g<br />
A<br />
g<br />
a<br />
t t<br />
a<br />
T<br />
A<br />
T<br />
A<br />
T<br />
t<br />
G<br />
T<br />
t<br />
G<br />
a<br />
3’<br />
G<br />
3’<br />
A<br />
A<br />
A<br />
a<br />
A<br />
a<br />
A<br />
G<br />
a<br />
A<br />
a<br />
a<br />
g<br />
T<br />
g<br />
G<br />
G<br />
C<br />
A<br />
TG<br />
C<br />
A A<br />
t T<br />
a<br />
C<br />
g<br />
t<br />
T<br />
A<br />
A<br />
G<br />
T<br />
A<br />
A<br />
T<br />
t<br />
a<br />
A<br />
G<br />
g<br />
A<br />
a<br />
A<br />
A<br />
G<br />
T<br />
A<br />
T<br />
A<br />
A A<br />
C<br />
A A<br />
a<br />
A<br />
A<br />
A<br />
G<br />
A<br />
g<br />
A<br />
A<br />
T<br />
g<br />
A<br />
A<br />
T<br />
G<br />
a<br />
A<br />
g<br />
A<br />
-A<br />
G<br />
A<br />
g<br />
g<br />
TA<br />
g<br />
C<br />
t t<br />
T<br />
A<br />
T<br />
T<br />
g<br />
g<br />
G<br />
A<br />
C<br />
c<br />
G<br />
T<br />
A<br />
T<br />
a<br />
C<br />
A<br />
a<br />
T<br />
C<br />
G<br />
G<br />
G<br />
G<br />
A<br />
a<br />
t<br />
C<br />
A<br />
C<br />
a<br />
G<br />
a<br />
c<br />
G<br />
G<br />
A<br />
t<br />
c<br />
c<br />
T<br />
a<br />
c<br />
C<br />
C<br />
G<br />
A<br />
A<br />
A<br />
A<br />
a<br />
A<br />
a<br />
A<br />
T<br />
A<br />
G<br />
a<br />
5’<br />
a<br />
A<br />
g<br />
T<br />
A<br />
A<br />
A<br />
A<br />
A<br />
a a<br />
A<br />
c<br />
C<br />
c<br />
A<br />
C<br />
C<br />
g<br />
c<br />
3’<br />
A<br />
G<br />
G<br />
t<br />
g<br />
C<br />
t<br />
G<br />
a<br />
G<br />
A<br />
A<br />
5’<br />
t<br />
g<br />
A<br />
c<br />
a<br />
gt<br />
A<br />
G<br />
T<br />
A<br />
T<br />
g<br />
T<br />
A<br />
T<br />
T<br />
A<br />
G<br />
t<br />
G<br />
A<br />
G<br />
t<br />
G<br />
g<br />
A<br />
A<br />
a<br />
g<br />
T<br />
A<br />
A<br />
A<br />
g<br />
G<br />
g<br />
c C<br />
A<br />
T<br />
A<br />
A<br />
A<br />
A<br />
A<br />
T<br />
a<br />
A-<br />
A<br />
TG<br />
a<br />
c<br />
a<br />
T<br />
t<br />
A<br />
t<br />
G<br />
g<br />
A<br />
A<br />
g<br />
A<br />
a<br />
T<br />
5’<br />
T<br />
C<br />
A-<br />
A<br />
A<br />
A<br />
A<br />
G<br />
A<br />
G<br />
t<br />
G<br />
a<br />
G<br />
T<br />
G<br />
a<br />
A<br />
T<br />
a<br />
g<br />
a<br />
G<br />
G<br />
A<br />
a<br />
a<br />
C<br />
G<br />
G<br />
t<br />
A A<br />
T<br />
-A<br />
G<br />
a A<br />
t T<br />
G<br />
A<br />
G<br />
g<br />
a<br />
A<br />
G<br />
g<br />
A<br />
A<br />
C<br />
A<br />
A<br />
a<br />
G<br />
3’<br />
G<br />
A<br />
C<br />
G<br />
t<br />
A<br />
A<br />
T<br />
A<br />
G<br />
c<br />
G<br />
C<br />
g<br />
G<br />
t<br />
C<br />
c<br />
A<br />
C<br />
a<br />
A<br />
C<br />
a<br />
t<br />
a<br />
C<br />
A<br />
t<br />
A<br />
3’<br />
G<br />
a<br />
A<br />
G<br />
T<br />
a<br />
G<br />
a<br />
A<br />
t<br />
A<br />
g<br />
a<br />
a<br />
A<br />
C<br />
A<br />
G<br />
a<br />
A<br />
5’g<br />
C<br />
G<br />
t<br />
G<br />
a<br />
G<br />
G<br />
a<br />
a<br />
A<br />
a<br />
-A<br />
A<br />
g<br />
c<br />
g<br />
G<br />
A<br />
A<br />
G<br />
g<br />
T<br />
G<br />
A<br />
T<br />
T<br />
A<br />
A<br />
G<br />
G<br />
A<br />
g<br />
A<br />
T<br />
G<br />
G<br />
A<br />
3’<br />
g<br />
A<br />
g<br />
A<br />
a<br />
3’<br />
C<br />
g<br />
a<br />
A<br />
c c<br />
C<br />
g<br />
A<br />
a<br />
A<br />
5’<br />
a<br />
A<br />
a<br />
A<br />
t<br />
A<br />
A<br />
A<br />
T<br />
A<br />
g g<br />
A<br />
T<br />
A<br />
A<br />
A<br />
G<br />
3’<br />
a-<br />
C<br />
T<br />
A<br />
A<br />
A<br />
A<br />
5’<br />
g<br />
G<br />
a<br />
g<br />
G<br />
A<br />
T<br />
G<br />
T<br />
G<br />
T<br />
A<br />
g<br />
A<br />
T<br />
A<br />
G<br />
T<br />
G<br />
a<br />
A<br />
A<br />
C<br />
a A<br />
C<br />
A<br />
A<br />
-A<br />
G<br />
A<br />
A<br />
t<br />
A-<br />
A<br />
A<br />
A<br />
t<br />
A<br />
-N<br />
A<br />
T<br />
5’<br />
T<br />
a<br />
A<br />
A<br />
g<br />
C<br />
t<br />
G<br />
3’<br />
g<br />
A-<br />
g<br />
A-<br />
A<br />
C<br />
a<br />
a<br />
A<br />
G<br />
A<br />
A<br />
G<br />
A<br />
A<br />
A<br />
C<br />
3’<br />
3’<br />
G<br />
G<br />
g<br />
T<br />
G<br />
G<br />
A<br />
t<br />
5’<br />
c<br />
T<br />
c<br />
A<br />
G<br />
C<br />
A<br />
a<br />
A<br />
C<br />
T<br />
A A<br />
a<br />
A<br />
a<br />
a<br />
C<br />
A<br />
A<br />
A<br />
g<br />
G<br />
3’<br />
g<br />
C<br />
g<br />
a<br />
T<br />
c<br />
c<br />
A<br />
A<br />
A<br />
C<br />
A<br />
C<br />
C<br />
G<br />
G<br />
G<br />
C<br />
C<br />
GT<br />
A<br />
t<br />
t<br />
T<br />
G<br />
T<br />
g<br />
a<br />
g<br />
A A<br />
A<br />
g<br />
A<br />
G<br />
5’<br />
C<br />
A<br />
A<br />
T t<br />
A<br />
5’<br />
A a<br />
t<br />
T<br />
t<br />
a a<br />
A<br />
t<br />
g<br />
g<br />
A<br />
G<br />
ac<br />
C<br />
A<br />
A<br />
T<br />
G<br />
A<br />
G<br />
g<br />
A<br />
C<br />
A<br />
A<br />
T<br />
c<br />
G<br />
c<br />
5’<br />
c<br />
G<br />
A<br />
A<br />
A<br />
A a<br />
A<br />
A<br />
A<br />
a<br />
A<br />
A<br />
A<br />
A<br />
a<br />
A<br />
g<br />
ac<br />
A<br />
G<br />
A<br />
G<br />
A<br />
G<br />
G<br />
G<br />
a<br />
G<br />
A<br />
T<br />
c<br />
c<br />
A A<br />
A<br />
TN<br />
G<br />
T<br />
G<br />
G<br />
G<br />
G<br />
G<br />
T<br />
5’<br />
G<br />
C<br />
A<br />
A<br />
5’<br />
G<br />
g<br />
A<br />
T<br />
G<br />
G<br />
A<br />
G<br />
A<br />
A<br />
c<br />
A<br />
A<br />
A-<br />
A<br />
G<br />
A<br />
G<br />
A<br />
3’<br />
G<br />
A<br />
G<br />
G<br />
a<br />
a<br />
a<br />
A a<br />
T<br />
G<br />
A<br />
A<br />
T<br />
A<br />
a-<br />
a<br />
A<br />
a<br />
g<br />
A<br />
C<br />
C<br />
a<br />
a<br />
T<br />
C<br />
T<br />
a<br />
C<br />
T<br />
G<br />
C<br />
G<br />
G<br />
T<br />
A<br />
a<br />
C<br />
c<br />
A<br />
C<br />
a<br />
A<br />
g<br />
g<br />
G<br />
G<br />
G<br />
G<br />
A<br />
A<br />
C<br />
c<br />
G<br />
a<br />
T<br />
A<br />
T<br />
g<br />
a<br />
A<br />
c<br />
A<br />
C<br />
C<br />
A<br />
5’<br />
C<br />
A<br />
C<br />
T<br />
A<br />
a<br />
A<br />
C<br />
A<br />
C<br />
a<br />
T<br />
T<br />
A<br />
G<br />
g<br />
g<br />
T<br />
A<br />
T<br />
G<br />
G<br />
g<br />
G<br />
a<br />
a<br />
a<br />
a<br />
a<br />
G<br />
-A<br />
a<br />
-A<br />
T<br />
a<br />
t<br />
a<br />
G<br />
T<br />
G<br />
T<br />
A<br />
a<br />
t<br />
T<br />
A<br />
3’<br />
a<br />
a<br />
g<br />
A<br />
G<br />
G<br />
G<br />
T<br />
G<br />
T<br />
G<br />
A<br />
G<br />
A<br />
C<br />
c<br />
A<br />
c<br />
C<br />
C<br />
T<br />
A<br />
C<br />
c<br />
A<br />
t<br />
A<br />
A<br />
a<br />
a<br />
A<br />
A<br />
a<br />
A A<br />
A<br />
A<br />
A<br />
A<br />
-A<br />
A<br />
G<br />
a<br />
G<br />
A<br />
A<br />
A<br />
A<br />
t<br />
T<br />
a<br />
T<br />
t<br />
T<br />
A<br />
C<br />
T<br />
c<br />
G<br />
G<br />
A<br />
A<br />
g<br />
A<br />
a<br />
A<br />
g<br />
A<br />
T<br />
t<br />
G<br />
G<br />
t<br />
A A<br />
C<br />
C C<br />
T<br />
C<br />
A<br />
g<br />
A a<br />
g<br />
A<br />
A<br />
a-<br />
5’<br />
t<br />
A<br />
A A<br />
A<br />
-A<br />
5’g<br />
3’<br />
g<br />
T<br />
A<br />
A<br />
3’<br />
A<br />
G<br />
c<br />
T<br />
g<br />
g<br />
g<br />
c<br />
ca<br />
C<br />
C<br />
a<br />
g<br />
3’<br />
g<br />
g<br />
g<br />
C<br />
t<br />
G<br />
A<br />
A<br />
g<br />
g<br />
A<br />
A<br />
A<br />
a<br />
A<br />
T<br />
A<br />
g<br />
A<br />
c<br />
A<br />
t<br />
G<br />
A<br />
3’<br />
A<br />
G<br />
G<br />
A<br />
G<br />
A<br />
3’<br />
g<br />
C<br />
A<br />
A<br />
C<br />
T<br />
t<br />
A A<br />
C<br />
a<br />
G<br />
a<br />
A<br />
A<br />
C<br />
A<br />
G<br />
C<br />
a<br />
3’<br />
G<br />
T<br />
A<br />
c<br />
G<br />
g<br />
G<br />
c<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
000002<br />
#?chr<br />
----------|-<br />
----------|-<br />
0001018414|C<br />
0001018391|C<br />
0001018411|T<br />
0001018408|A<br />
0001018370|A<br />
0001018372|C<br />
0001018365|G<br />
0001018403|A<br />
0001018394|A<br />
0001018388|A<br />
0001018417|T<br />
0001018378|A<br />
0001018371|A<br />
0001018367|G<br />
0001018422|A<br />
0001018369|A<br />
0001018379|A<br />
0001018430|A<br />
0001018380|G<br />
0001018397|A<br />
0001018395|A<br />
0001018431|G<br />
0001018384|G<br />
0001018373|G<br />
0001018366|A<br />
0001018386|G<br />
0001018416|T<br />
0001018361|G<br />
0001018399|G<br />
0001018387|T<br />
0001018383|G<br />
0001018401|A<br />
0001018363|A<br />
0001018368|G<br />
0001018402|A<br />
0001018421|G<br />
0001018432|T<br />
0001018404|C<br />
0001018377|G<br />
0001018396|A<br />
0001018359|G<br />
0001018362|A<br />
0001018376|A<br />
0001018358|G<br />
0001018428|C<br />
0001018435|C<br />
0001018375|G<br />
0001018434|A<br />
0001018418|A<br />
0001018405|A<br />
0001018374|T<br />
0001018415|T<br />
0001018420|G<br />
0001018423|G<br />
0001018412|G<br />
0001018429|A<br />
0001018410|T<br />
0001018409|C<br />
0001018381|A<br />
0001018389|A<br />
0001018407|A<br />
0001018398|T<br />
0001018393|A<br />
0001018425|G<br />
0001018433|A<br />
0001018390|A<br />
0001018436|T<br />
0001018406|C<br />
0001018424|G<br />
0001018427|C<br />
0001018392|A<br />
0001018385|A<br />
0001018364|A<br />
0001018382|T<br />
0001018419|T<br />
0001018426|C<br />
0001018413|A<br />
0001018400|A<br />
0001018360|T<br />
pos|refbase<br />
alignments<br />
Figure 2.8: Vertical Scrolling Text Mode Alignment View of SHORE mapdisplay
2.8. VISUALIZATION OF SEQUENCING READ AND ALIGNMENT DATA 55<br />
2:1018241 2:1018261 2:1018281 2:1018301 2:1018321 2:1018341 2:1018361 2:1018381 2:1018401 2:1018421 2:1018441 2:1018461<br />
C T C A C A A G A G C T G T T T T A T G T C C T A A A T G A A A C A A T T T A T G G A T T A T A T T T T T T T C T C G T C A T G A T T T T A T G G A G T T T A T T G A T T T G G T A A A T A A T A A G T A A C T T T T T T T G G T A A A A T G G T G A A A G A G G A A A C G T G A G A A G A T G G A G T A A A C A A A A A A T G A A A A C A C A A C T T G A C T T T A T G G A G G G C C C A A G T A A C T T T G A A A A C G G G C C A A A G A G A C A A T A C A T C C T T A A T A<br />
C<br />
G<br />
C<br />
A<br />
C<br />
A<br />
-<br />
C<br />
A<br />
C<br />
T<br />
C<br />
A<br />
-<br />
A<br />
A<br />
-<br />
C<br />
A<br />
A<br />
A<br />
-<br />
N<br />
C<br />
A<br />
-<br />
C<br />
A<br />
C<br />
A<br />
A<br />
-<br />
T<br />
C<br />
A<br />
G<br />
C<br />
A<br />
C<br />
A<br />
C<br />
A<br />
-<br />
A<br />
C<br />
C<br />
C<br />
C<br />
A C G<br />
C<br />
A<br />
-<br />
C<br />
C<br />
A<br />
G<br />
C<br />
A C<br />
C<br />
N G<br />
N N<br />
G<br />
G<br />
N<br />
T<br />
T<br />
A<br />
A<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
C<br />
T<br />
C<br />
T<br />
T<br />
T<br />
T<br />
T<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
A<br />
-<br />
-<br />
-<br />
-<br />
-<br />
-<br />
-<br />
-<br />
A<br />
-<br />
-<br />
-<br />
-<br />
-<br />
-<br />
-<br />
-<br />
-<br />
-<br />
-<br />
C<br />
-<br />
A<br />
C<br />
C<br />
C<br />
A<br />
A<br />
A A<br />
T<br />
A<br />
Figure 2.9: Vector Graphics Alignment View of SHORE mapdisplay<br />
2.8.2 Visualization of Read Mapping <strong>Data</strong><br />
By utilizing SHORE's generic indexed query mechanisms (section 2.2.3), the SHORE mapdisplay<br />
program is able to rapidly retrieve all read mappings associated with a user specied genomic<br />
region. By default, the program generates a comprehensible colored ASCII/ANSI text representation<br />
of the read alignments, which is presented as a scrollable terminal view using the Unix le<br />
pager utility less. For technical reasons, read mappings are displayed in a vertical layout (gure<br />
2.8).<br />
Insertions are represented as additional positions in the alignment view. <strong>An</strong>alogous to SHORE<br />
alignment string format (section 2.2.4), mismatching positions in the alignment are displayed<br />
with the reference base on the left <strong>and</strong> the query base on the right side. When a more compact<br />
view is required, display of the reference nucleotide is omitted. To allow distinguishing between<br />
reads mapping to the forward or to the reverse str<strong>and</strong> of the reference sequence, reverse str<strong>and</strong><br />
associated alignments are printed in lower case <strong>and</strong> forward str<strong>and</strong> sequences in upper case letters,<br />
<strong>and</strong> 3 ′ <strong>and</strong> 5 ′ ends are indicated by corresponding colored end markers. To allow identication of<br />
individual reads in the alignment view, inclusion of further information such as read identiers<br />
<strong>and</strong> paired-end sequencing information may be requested on program startup.<br />
The text mode mapping viewer provides a rapid, detailed view of read mapping data without<br />
interruption of workows. It is however less suited to obtaining a medium- to larger-scale<br />
overview of complex genomic regions. Therefore, we additionally provide the possibility of generating<br />
an alternative alignment view as an SVG image (gure 2.9).<br />
SVG les generated are easily <strong>and</strong> eciently rendered in a web browser. In contrast to<br />
the text view, read sequences are omitted in the SVG view except for mismatching alignment<br />
positions to enable improved rendering performance. Mismatching alignment positions as well<br />
as insertions <strong>and</strong> deletions are color coded <strong>and</strong> their sequence included for direct examination at<br />
close zoom levels. Directionality of read mappings is indicated both by shape <strong>and</strong> color of the<br />
corresponding horizontal bars. As layout locations for the read mappings are generally assigned<br />
by a greedy algorithm that picks the lowest possible y coordinate immediately when an alignment
56 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
Depth of Coverage [# Reads]<br />
0 5 10 15 20 25<br />
Median Depth of Coverage<br />
Depth Inner Quartiles<br />
Average Depth ±SD<br />
2:1 2:10000001 3:200001 3:10200001 3:20200001<br />
Position [bp]<br />
Figure 2.10: Segment-Wise Genome-Wide Box Plot<br />
is encountered, depth of coverage can be hard to judge from the read alignments alone. Therefore,<br />
the SHORE mapdisplay program generates an overlay of depth of coverage, represented as vertical<br />
bars, <strong>and</strong> the horizontal bars of the read mappings. In addition, the SVG view allows retrieval<br />
of exact depth of coverage information for each position <strong>and</strong> detailed alignment information for<br />
each of the reads displayed in the form of mouse activated tool tips.<br />
2.8.3 Visualization of Local or Genome-Scale Depth of Coverage<br />
SHORE coverage is a multi-purpose tool for depth of coverage calculation <strong>and</strong> depth based segmentation<br />
(section 2.7). The complementary SHORE count utility gathers a wide variety of segmentrelated<br />
statistics. Like most SHORE modules, both programs are capable of rapidly retrieving<br />
<strong>and</strong> processing alignment information associated with dened regions of the genome.<br />
SHORE coverage can be utilized to obtain visual representations of local depth of coverage (gure<br />
2.6) or of kmer-phasing for small RNA data analysis (not shown). Per-base rendering of depth<br />
of coverage is however inappropriate for examination of larger scale distribution of sequencing<br />
depth. Visualization capability for such types of analysis is integrated with the SHORE count<br />
program. The software collects segment associated statistics for either user provided or xed size<br />
jumping window segments. The segment data collected may be incorporated into a continuous<br />
genome- or chromosome-wide box plot (gure 2.10).<br />
In the representation, the median line connects the median depth of coverage value for each<br />
of the segments provided, <strong>and</strong> is enclosed by boxes indicating second <strong>and</strong> third quartiles as well<br />
as the single st<strong>and</strong>ard deviation range surrounding the average depth of coverage of the segment.<br />
Red vertical bars indicate chromosome boundaries. As read depths mostly represent integer<br />
values, median <strong>and</strong> quartile values are adjusted to compensate quantization (section 2.8.4).
2.8. VISUALIZATION OF SEQUENCING READ AND ALIGNMENT DATA 57<br />
Quality Score<br />
16 24 32 40<br />
Depth of Coverage [# Reads]<br />
0 5 10<br />
Median Base Quality<br />
Inner Quality Quartiles<br />
1 11 21 31 41 51 61 71 81 1 11 21 31 41 51 61 71 81<br />
Cycle [bp]<br />
Median Depth of Coverage<br />
Depth Inner Quartiles<br />
3:9250001 3:9250001<br />
Position [bp]<br />
Figure 2.11: Quantile Interpolation Applied to Base Qualities <strong>and</strong> to Depth of Coverage<br />
2.8.4 Quantile Correction<br />
As in the case of base quality data or positional depth of coverage, quantile values of integer<br />
data distributed over a narrow range only coarsely represent the distribution of interest due to<br />
the small number of dierent values that such quantiles can assume (gure 2.11, left side).<br />
Given that the integer data in fact represent real-valued entities that have at some point<br />
been rounded, then it can be assumed that the original value of an integer s follows an unknown<br />
distribution supported on the interval [s − 0.5, s + 0.5). For an integer data set of size N, this<br />
implies for an arbitrary integer j-th c-quantile q i , that the expected value for the quantile of<br />
the original, non-rounded data q r data equals the minimal value to be considered for q r plus the<br />
k-th order statistic for n values distributed according to the unknown distribution. Here, n is<br />
the number of values in the integer data set of equal value to q i , <strong>and</strong> k = ⌈(j · N)/c⌉ − u, where<br />
u denotes the number of integer data with smaller value than the quantile q i .<br />
Given n i. i. d. samples X 1 , . . . , X n , then their k-th order statistic is a r<strong>and</strong>om variable W (n,k)<br />
with probability density function<br />
( ) n − 1<br />
f W(n,k) (x) = n · · f X (x) · F X (x) k−1 · (1 − F X (x)) n−k (2.7)<br />
k − 1<br />
Assuming X is distributed uniformly over the unit interval, then the k-th order statistic follows
58 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
a beta distribution, W (n,k) ∼ Beta(k, n − k + 1), with expected value <strong>and</strong> variance<br />
E(W (n,k) ) =<br />
k<br />
n + 1<br />
k<br />
Var(W (n,k) ) =<br />
(n + 1) · (n + 2) ·<br />
(<br />
1 − k )<br />
n + 1<br />
The assumption of rounded data is reasonable for base quality values <strong>and</strong> arguable also for<br />
observed read count data. Therefore, it can be considered valid to estimate corrected quantile<br />
values as<br />
ˆq r = q i − 0.5 +<br />
k<br />
n + 1<br />
(2.8)<br />
(2.9)<br />
(2.10)<br />
While the assumption of uniformity for the distribution of the original data to be considered<br />
may be questionable with special emphasis for read depth data with value zero, the adjusted<br />
quantiles still provide a considerable improvement for appraisal of the distribution compared<br />
to their integer counterparts. Equation 2.7 might be exploited where a better approximation<br />
of the actual distribution is available. For simplicity of implementation, we do not calculate<br />
the exact values of the ˆq r , but employ a probabilistic algorithm that adds a r<strong>and</strong>om real value<br />
uniformly sampled from the interval [−0.5, 0.5) to each integer value prior to quantile calculation<br />
(gure 2.11, right side).
3 A C ++ Framework for High-Throughput DNA<br />
Sequencing<br />
In a large, multi-purpose data analysis software solution, there are inevitably many<br />
recurring processing tasks <strong>and</strong> algorithmic challenges. To avoid large amounts of<br />
complex, redundant <strong>and</strong> error-prone code, the common subproblems must be isolated<br />
as reusable modules. With the SHORE DNA sequencing data analysis suite, we have<br />
striven to implement frequently required segments of code using consistent design patterns<br />
<strong>and</strong> interfaces. The resulting C++ programming framework <strong>lib</strong>shore constitutes<br />
the subject of this chapter.<br />
3.1 Overview<br />
High-throughput sequencing has become adopted as a versatile tool applicable to a wide range of<br />
dierent scientic questions (section 1.3). While data analysis for dierent types of application<br />
must be approached from unique angles (section 1.5), in their elementary building blocks methods<br />
<strong>and</strong> algorithms often share considerable overlap on many levels.<br />
Reading, parsing <strong>and</strong> formatted output of sequencing data in st<strong>and</strong>ard storage formats is a<br />
near universal requirement. Additionally, subsetting data sets by dened properties, achievable<br />
through dierent combinations of simple accept-reject ltering operations, is an often powerful<br />
tool for adjustment of an analysis' sensitivity-specicity tradeo. More complex operations rely<br />
on the relationship between multiple data set elements, like e. g. removal of PCR duplicate sequences<br />
from read alignment data, or alter certain properties of the data set elements themselves,<br />
<strong>and</strong> are therefore sensitive to, <strong>and</strong> potentially useful not only in various combinations, but also<br />
orders of application. Finally, entire analysis processes can be modeled as modules or series of<br />
modules transforming the type of the data, for example read alignments into depth of coverage<br />
or read alignments into positional pileups into variant calls.<br />
While isolation <strong>and</strong> encapsulation of subproblems facilitates correct implementation <strong>and</strong> comprehensibility<br />
of each individual step, the logic required to tie components into end-to-end processing<br />
pipelines can itself become extensive. For versatile modes of application, modularized<br />
components must therefore be embedded into a more generic framework capable of providing the<br />
glue between input, output <strong>and</strong> data ltering, manipulation <strong>and</strong> transformation.<br />
The <strong>lib</strong>shore C++ framework code is loosely categorized into ten dierent packages including<br />
application framework functionality, generic data processing infrastructure <strong>and</strong> sequencing specic<br />
processing modules <strong>and</strong> data structures. With gure 3.1 we present a package overview in<br />
a simplied UML-like representation, illustrating exemplary classes <strong>and</strong> associations.<br />
The base package comprises elementary low level functionality <strong>and</strong> utilities as well as machine<br />
or system-dependent blocks of code. While its functionality is required for most other packages,<br />
it is self-contained <strong>and</strong> does not by itself depend on other parts of the <strong>lib</strong>rary.<br />
The class program from the package of the same name constitutes the base class of all SHORE<br />
comm<strong>and</strong> line utilities. The base class combines comm<strong>and</strong> line interface denition, documen-<br />
61
62 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />
base<br />
program<br />
stream<br />
av_parser<br />
add_option(spec,var,desc)<br />
operator()(argc,argv)<br />
uncaught_h<strong>and</strong>ler<br />
operator()(func)<br />
istreams<br />
xz_istream<br />
gzx_istream<br />
0..*<br />
std::istream<br />
program<br />
add_option(specication,variable,description)<br />
operator()(argc,argv)<br />
algo<br />
<strong>Data</strong>Iterator<br />
FunctionArrayIterator<br />
<strong>Data</strong>Iterator<br />
sux_array FunctionArrayIterator<br />
CmpX<br />
CmpY<br />
d2tree CmpXY<br />
cmp_x: CmpX<br />
cmp_y: CmpY<br />
cmp_xy : CmpXY<br />
dp_aligner_pipe<br />
fmtio<br />
Pipe<br />
Reader<br />
sam_reader<br />
Reader<br />
bam_reader<br />
Reader<br />
fastq_reader<br />
Reader<br />
s_reader<br />
sux_query<br />
line_sorter<br />
sux_index<br />
twodex<br />
Cmp<br />
sort_les(les)<br />
merge_les(les)<br />
is_sorted(le)<br />
upper_bound(le, line)<br />
Reader<br />
alignment_reader<br />
Reader<br />
maplist_reader<br />
Reader<br />
read_reader<br />
Reader<br />
atread_reader<br />
container<br />
intpack<br />
statistics<br />
Distribution<br />
binomial<br />
Distribution<br />
poisson<br />
mtc<br />
ostreams<br />
parallel<br />
thread<br />
1..*<br />
parallelizer<br />
sync<br />
datatype<br />
processing<br />
T<br />
T<br />
Pipe<br />
xz_ostream<br />
gzx_ostream<br />
alignment_lter_pipe<br />
alignment<br />
Pipe<br />
pipe_facade<br />
Writer<br />
Source<br />
feed<br />
Sink<br />
Reader<br />
extractor<br />
T<br />
T<br />
Derived<br />
read<br />
serial<br />
desync<br />
Source/Pipe/Sink<br />
Pipe/Sink<br />
parallel<br />
Pipe<br />
pipe<br />
0..*<br />
T<br />
alignment_tokenizer<br />
Source<br />
source<br />
Writer-Reader<br />
Writer<br />
Reader<br />
pipe_box<br />
Sink<br />
sink<br />
std::ostream<br />
nucleotide<br />
sequence_record<br />
Reader<br />
Pipe<br />
Writer<br />
Writer<br />
atread_writer<br />
Writer<br />
maplist_writer<br />
Writer<br />
sam_writer<br />
Reader<br />
fasta_reader<br />
Basic_Reader<br />
Reader<br />
T<br />
monolithic buer_chain<br />
0..*<br />
plugin<br />
T<br />
Figure 3.1: <strong>lib</strong>shore Overview
3.1. OVERVIEW 63<br />
tation <strong>and</strong> comm<strong>and</strong> line parsing through an auxiliary class av_parser <strong>and</strong> incorporates an<br />
object of class uncaught_h<strong>and</strong>ler to provide h<strong>and</strong>ling <strong>and</strong> reporting of fatal error conditions<br />
encountered by the application.<br />
Classes istreams <strong>and</strong> ostreams of the stream package simplify h<strong>and</strong>ling of les or other input<br />
or output destinations. Interpretation of special location strings or prexes facilitates specication<br />
of special les or other sources or destinations of input <strong>and</strong> output as option arguments or<br />
option default values, like e. g. st<strong>and</strong>ard input, output or Unix pipes. Input or output locations<br />
compressed in GZIP, BAM or XZ format are delegated to appropriate helper stream classes for<br />
automatic decoding, encoding or h<strong>and</strong>ling of r<strong>and</strong>om access requests (section 2.2.2). Both classes<br />
in addition provide input/output error checking <strong>and</strong> further stream-related functionality. Class<br />
ostreams furthermore implements simplied h<strong>and</strong>ling of temporary le destinations as well as<br />
optional temporary le compression for improved h<strong>and</strong>ling of large amounts of temporary data.<br />
Package algo comprises of dynamic programming alignment (section 2.5.2) as well as indexing,<br />
sorting <strong>and</strong> query algorithms <strong>and</strong> associated persistent data structure implementations<br />
(section 2.2.3). The platform for range indexing <strong>and</strong> queries is provided by a generic 2-d tree algorithm<br />
template d2tree, which is further utilized by the class twodex that supplies a persistent<br />
index data structure congurable for a variety of genomic data formats. Multiple algorithm templates<br />
with the prex suffix_ implement generic sux array construction <strong>and</strong> queries [144, 145].<br />
The sux array algorithms are employed by objects of class suffix_index, which provides persistent<br />
disk indexes for genomic sequences congurable <strong>and</strong> capable of answering various types<br />
of sequence match queries. To support up to 64 bit data at reasonable space eciency, persistent<br />
suffix_index <strong>and</strong> twodex index data are encoded <strong>and</strong> decoded non-byte aligned with respective<br />
exact required bit width utilizing class intpack of the container package.<br />
The <strong>lib</strong>shore data processing infrastructure is dened <strong>and</strong> implemented as a set of templates<br />
forming the processing package (section 3.2). Package parallel extends on this infrastructure<br />
to build an abstracted parallel processing interface (section 3.3).<br />
Despite growing acceptance of data exchange specications such as SAM/BAM for short read<br />
mapping data, data format issues can present a major impediment to application of algorithms<br />
to input data from diverse sources. Therefore, the <strong>lib</strong>rary includes support for a broad variety<br />
of sequencing read, mapping <strong>and</strong> further sequencing <strong>and</strong> DNA-related input <strong>and</strong> output le formats<br />
provided through the fmtio package. Format support is implemented as Reader <strong>and</strong> Writer<br />
classes with a consistent interface modeled after the requirements of the data processing infrastructure.<br />
To simplify working with various formats, for sequencing read <strong>and</strong> alignment data the<br />
<strong>lib</strong>rary provides multi-format readers read_reader <strong>and</strong> alignment_reader which automatically<br />
incorporate the adequate Reader objects for each respective input le.<br />
File format parsing <strong>and</strong> decoding requires adequate in-memory representation for the respective<br />
elements of the various types of data set, which are dened in the datatype package. Plain<br />
data structures read <strong>and</strong> alignment provide the st<strong>and</strong>ard representation of short sequencing<br />
read data, whereas large genomic sequences are stored as sequence_record objects with more<br />
sophisticated internal memory management. Additionally, processing modules <strong>and</strong> utilities specic<br />
to a certain type of data set element or certain properties of data set elements are also<br />
grouped under this package. For example, a class alignment_filter_pipe combines application<br />
of a variety of frequently required ltering <strong>and</strong> editing modules to read mapping data. Parsing<br />
<strong>and</strong> decoding the SHORE alignment string pair-wise alignment representation (section 2.2.4) into<br />
CIGAR string-equivalent tokens is provided as a class alignment_tokenizer. Class nucleotide<br />
provides decoding, encoding <strong>and</strong> ecient manipulation of IUPAC encoded DNA bases.<br />
The additional statistics package supplies the basis for statistical hypothesis tests based on<br />
a variety of distributions. The complementary class mtc re-implements various multiple hypothesis<br />
test correction procedures in C++ following the R multtest package implementation [146].
64 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />
Supported procedures include false discovery rate (FDR) calculation following Benjamini <strong>and</strong><br />
Hochberg [143] as well as Benjamini <strong>and</strong> Yekutieli [147] <strong>and</strong> further family wise error rate<br />
(FWER) control via Bonferroni [148], Hochberg [149], Holm [150] as well as Sidak [151] correction<br />
algorithms.<br />
3.2 A Modular Signal-Slot Processing Framework<br />
3.2.1 Reader <strong>and</strong> Writer Concepts<br />
The requirement to support large numbers of dierent input le formats emphasizes the advantage<br />
of consistent interfaces to provide access to the data. The <strong>lib</strong>shore framework therefore<br />
denes the concept of a Reader as a class providing three methods named has_data, current<br />
<strong>and</strong> next which allow linear iteration over the data set's elements.<br />
The has_data method returns a boolean value indicating whether further data set elements<br />
are available to the respective Reader object. The method current provides access to the element<br />
of the data set at the current position of the iteration, <strong>and</strong> next indicates the current element<br />
has been processed <strong>and</strong> may be discarded. Initialization of the following element for retrieval by<br />
current may be h<strong>and</strong>led by either of the next or has_data methods.<br />
Conversely, we dene the concept of a Writer as a class supporting methods append <strong>and</strong><br />
flush. The method append is used to pass the object a data set element for processing, whereas<br />
flush serves as a notication that no more data are available for processing, triggering nalizing<br />
actions such as addition of footer sequences for compressed data sets.<br />
Intermediary processing <strong>and</strong> ltering modules can thus be realized as classes conforming to<br />
both the Writer <strong>and</strong> Reader concepts. Method flush in this case adopts the role of nishing data<br />
processing <strong>and</strong> propagating all remaining data to the object's Reader interface. For example, in a<br />
sliding window analysis output of data will be delayed until enough input data become available<br />
to cover the entire width of the window. On receiving a ush request, processing of remaining<br />
elements must however be completed without waiting for further data associated with the sliding<br />
window.<br />
3.2.2 Denition of Processing Network Topology<br />
With the Reader <strong>and</strong> Writer concepts, propagation of data through a network of processing<br />
modules might be accomplished through a series of nested processing loops (listing 3.1). The<br />
example illustrates propagation of data read from a single source of input through two cascaded<br />
ltering operations. Processing results are to be written to the application's st<strong>and</strong>ard output<br />
stream, whereas intermediate results following the rst step of ltering are additionally recorded<br />
as a separate output le. The rst set of loops of lines 827 constitute the actual propagation of<br />
input data. Read mapping data provided by the reader object are appended to the rst ltering<br />
module (line 10). As a result, the lter object may or may not produce an arbitrary amount of<br />
data (e. g. due to sliding window based removal of PCR duplicated sequences), which are passed<br />
to the intermediate le writer as well as the second lter object (lines 1224). <strong>Data</strong> produced by<br />
the second lter object must be transmitted to the nal output accordingly through a further<br />
nested loop (lines 1721). Additional sets of nested loops correspond to propagation of remaining<br />
data following appropriately stacked ush requests to each of the modules (lines 3056).<br />
The demonstrated mode of explicit data propagation is verbose <strong>and</strong> error prone. Our programming<br />
framework therefore implements a set of class templates providing a signal-slot pipeline<br />
architecture to allow more concise <strong>and</strong> comprehensible representations. A signal constitutes a<br />
function object that may be connected to an arbitrary number of function objects supporting
3.2. A MODULAR SIGNAL-SLOT PROCESSING FRAMEWORK 65<br />
1 alignment_reader r("in.bam");<br />
alignment_filter f1(filter1_config);<br />
alignment_filter f2(filter2_config);<br />
maplist_writer w1("filtered1.maplist");<br />
5 maplist_writer w2("stdout");<br />
// Process data.<br />
while(r.has_data())<br />
{<br />
10 f1.append(r.current());<br />
while(f1.has_data())<br />
{<br />
w1.append(f1.current());<br />
15 f2.append(f1.current());<br />
while(f2.has_data())<br />
{<br />
w2.append(f2.current());<br />
20 f2.next();<br />
}<br />
25<br />
}<br />
}<br />
r.next();<br />
f1.next();<br />
// Finish.<br />
30 f1.flush();<br />
while(f1.has_data())<br />
{<br />
w1.append(f1.current());<br />
35 f2.append(f1.current());<br />
while(f2.has_data())<br />
{<br />
w2.append(f2.current());<br />
40 f2.next();<br />
}<br />
45<br />
}<br />
w1.flush();<br />
f1.next();<br />
f2.flush();<br />
50 while(f2.has_data())<br />
{<br />
w2.append(f2.current());<br />
f2.next();<br />
}<br />
55<br />
w2.flush();<br />
Listing 3.1: Using Processing Modules with Nested Loops
66 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />
Source<br />
data<br />
Pipe<br />
data<br />
Sink<br />
ush<br />
ush<br />
freeze<br />
freeze<br />
thaw<br />
thaw<br />
Signal<br />
Slot<br />
Figure 3.2: Pipeline Connections<br />
1 alignment_source r("in.bam");<br />
alignment_filter_pipe f1(filter1_config);<br />
alignment_filter_pipe f2(filter2_config);<br />
maplist_sink w1("filtered1.maplist");<br />
5 maplist_sink w2("stdout");<br />
// Connect modules.<br />
r | f1 | f2 | w2;<br />
10 f1 | w1;<br />
// Process data <strong>and</strong> finish.<br />
r.dump();<br />
Listing 3.2: Using Processing Modules with the <strong>lib</strong>shore Pipeline API<br />
the same types of function parameters (slots). Calling a signal function passes the respective set<br />
of function parameters to all connected slots in an unspecied order of invocation.<br />
We thereby dene the concept of a Source as a class providing signals named data as well<br />
as flush, <strong>and</strong> further slots labeled freeze <strong>and</strong> thaw. A Sink concept is conversely dened as a<br />
class providing slots named data <strong>and</strong> flush, as well as signals freeze <strong>and</strong> thaw. Furthermore, a<br />
Pipe amalgamates both the Source <strong>and</strong> the Sink concept. Thus dened Pipe processing modules<br />
immediately forward all elements of output data produced in response to ush requests or data<br />
appended the data slot to their respective data signal. A processing pipeline can thus be dened<br />
as a series of connected Pipe elements terminated by Sources <strong>and</strong> Sinks at either end, respectively<br />
(gure 3.2).<br />
A Source or Sink can be realized using respective templates source <strong>and</strong> sink capable of<br />
appropriately transforming Reader <strong>and</strong> Writer objects. A further template pipe is provided for<br />
transformation of objects conforming to both the Reader <strong>and</strong> Writer concepts. For each le<br />
format Reader <strong>and</strong> Writer class, the <strong>lib</strong>rary denes corresponding Source <strong>and</strong> Sink classes by<br />
means of the respective template, which are marked by suxes _source <strong>and</strong> _sink.<br />
Source, Pipe <strong>and</strong> Sink objects can be combined into extensive processing networks through<br />
connection of the appropriate signals <strong>and</strong> slots. To improve the clarity of denition of such<br />
networks, an overload of the pipe operator ( |) is provided for establishing pair-wise module<br />
connections. The code of the processing example can thereby be simplied to a large extent<br />
(listing 3.2). Lines 8 <strong>and</strong> 10 dene the topology of the processing network. The source template
3.2. A MODULAR SIGNAL-SLOT PROCESSING FRAMEWORK 67<br />
further denes a method dump, which causes all data provided by the incorporated Reader to be<br />
passed to the data signal before triggering the flush signal to nalize processing (line 13).<br />
For cases where application to entire data sets is not the intended use, modules must provide<br />
a means of suspending the ow of processing. Classes implementing the Reader concept may for<br />
example oer congurable st<strong>and</strong>ard ltering facilities comprising of multiple cascaded processing<br />
modules. However, while the Reader concept implies that each data set element is retrievable<br />
individually, Pipe elements may produce an arbitrary amount of output in response to input<br />
data processing, which would render such modules unsuitable for this type of reuse. We therefore<br />
introduce the additional module connections freeze <strong>and</strong> thaw serving to suspend <strong>and</strong> resume<br />
processing pipeline data ow, respectively (gure 3.2). These connections run anti-parallel to<br />
the data <strong>and</strong> flush signals. As signal invocation corresponds to a series of nested function<br />
calls transmitted along the connected modules, source <strong>and</strong> pipe templates are able to directly<br />
detect suspend <strong>and</strong> resume requests sent by downstream elements. A provided class template<br />
extractor utilizes the implemented suspend facilities to act as a bridge between the Sink <strong>and</strong><br />
Reader concepts. Thus, extractor objects may be inserted at the end of a line of processing<br />
modules to retrieve generated output data individually. A complementary template feed provides<br />
bridging between the Writer <strong>and</strong> Source concepts.<br />
3.2.3 Module Implementation<br />
Encapsulation of subproblems as modules implementing both the Writer <strong>and</strong> Reader concepts<br />
comes with the disadvantage that all data generated in response to a single invocation of one of<br />
the Writer methods must be kept in an internal buer for subsequent retrieval via the Reader<br />
interface. However, in the context of a processing pipeline each element of output data can in<br />
principle directly be transmitted to downstream modules for further processing, often avoiding<br />
the need to implement internal buering. Direct propagation can be achieved by direct implementation<br />
of the Pipe concept, thus exposing direct access to the module's data <strong>and</strong> ush<br />
signals. Implementation logic of unbuered, interruptible processing is however more complex<br />
than the buered equivalent. We therefore provide a base class template pipe_facade enforcing<br />
a restricted programming model to support correct implementation of such modules.<br />
The facade template requires distribution of processing implementation across four function<br />
calls prepare, append, next <strong>and</strong> flush. A function emit is further made accessible to subclasses<br />
to propagate generated output. The template enforces a single-emit rule by which invocation<br />
of the emit base class function is permissible exactly once during each call to either of the four<br />
implementation functions. The purpose of each of the four functions is illustrated by example<br />
(listing 3.3).<br />
Raw sequencing read input data will typically be ordered by an identier specic to the<br />
respective insert, whereas order of the individual reads retrieved from the insert is arbitrary.<br />
For some purposes further ordering of sequencing reads of the same insert by their paired-end<br />
sequencing index may however be convenient. Implementation of an appropriate reordering<br />
module demonstrates the purpose of each of the four facade functions.<br />
The reordering module collects all reads belonging to the same insert in an internal buer,<br />
<strong>and</strong> subsequently orders them by value of their paired-end index. The prepare method serves as<br />
an initial notication about next element that will be passed to the object's append method. In<br />
the case of the example module, identiers are compared to determine whether further reads are<br />
available for the insert currently being processed. If the next read does not belong to the current<br />
insert, reordering of the reads by paired-end index is triggered. Reads are then transmitted for<br />
downstream processing by invocation of emit on the rst of the collected reads. In concert with
68 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />
1 class order_paired_reads_pipe<br />
:public pipe_facade<br />
{<br />
private:<br />
10 // Grants pipe_facade access to private member functions.<br />
friend class pipeline_core_access;<br />
15<br />
// Stores all reads with the same identifier.<br />
std::vector m_buffer;<br />
// Orders all reads collected in buffer by their paired-end index.<br />
void order_reads_by_index();<br />
// Tests if a read has no mates.<br />
20 static bool is_single_read(const read &);<br />
void next()<br />
{<br />
if(!m_buffer.empty())<br />
25 {<br />
m_buffer.pop_front();<br />
30 }<br />
}<br />
if(!m_buffer.empty())<br />
emit(m_buffer.front());<br />
void prepare(const read & r)<br />
{<br />
35 if(!m_buffer.empty() && (r.id != m_buffer.front().id))<br />
{<br />
order_reads_by_index();<br />
emit(m_buffer.front());<br />
}<br />
40 }<br />
void append(const read & r)<br />
{<br />
if(is_single_read(r))<br />
45 emit(r);<br />
else<br />
m_buffer.push_back(r);<br />
}<br />
50 void flush()<br />
{<br />
if(!m_buffer.empty())<br />
{<br />
order_reads_by_index();<br />
55 emit(m_buffer.front());<br />
}<br />
}<br />
};<br />
Listing 3.3: Usage Example for the pipe_facade Template
3.2. A MODULAR SIGNAL-SLOT PROCESSING FRAMEWORK 69<br />
the method next described below, the internal sequencing read buer thereby gets successively<br />
freed for processing of the following insert.<br />
Processing of new input data must be h<strong>and</strong>led by the method append. In the context of the<br />
reordering example, single-end <strong>and</strong> orphan reads i. e. all reads that for any reason are the only<br />
ones for their respective insert can be immediately transmitted downstream, whereas paired<br />
reads are appended to the internal sequencing read buer for subsequent sorting.<br />
The method next is automatically invoked through the facade class on completing propagation<br />
of an element of output data. Subclasses may implement the method with the purpose<br />
of discarding data no longer required. The emit method may be re-invoked inside the method<br />
body. In the example module, this property is exploited to iteratively emit all reads of the<br />
current insert <strong>and</strong> free the buer for processing of the following insert.<br />
By means of the division of implementation among the four processing functions in concert<br />
with the single-emit rule, the facade template is able to ensure correct data propagation <strong>and</strong><br />
h<strong>and</strong>ling of suspend <strong>and</strong> resume requests triggered through the freeze <strong>and</strong> thaw slots. Subclasses<br />
may however omit implementation of individual methods not required for the respective<br />
processing task.<br />
3.2.4 In-Place <strong>Data</strong> Manipulation<br />
In the pipeline model, processing modules are presented with individual data set elements that<br />
are invalidated as the following element becomes available. As demonstrated by the example<br />
(listing 3.3), this implies that operations that rely on a relation between multiple elements must<br />
maintain an internal buer to accumulate the relevant data. Furthermore, to avoid unintended<br />
side eects between dierent branches of the downstream processing setup, modules <strong>and</strong> data<br />
sources must prohibit direct manipulation of transmitted data. Copying of data is therefore also a<br />
prerequisite to implementing manipulation of certain properties of the data set's elements. Both<br />
restrictions may therefore result in a considerable level of unnecessary data copying overhead.<br />
To mitigate this issue with the explicit model of data propagation (section 3.2.2), Reader<br />
classes are initially implemented following a lower level concept Basic_Reader, which is dened<br />
as a class with a single method next. The method can be used to attempt reading a single data<br />
set element into an externally provided buer, <strong>and</strong> returns a boolean ag to indicate whether an<br />
element could be successfully retrieved. <strong>Data</strong> read into the provided buer can then be freely<br />
manipulated prior to further processing. The Basic_Reader variant of a class is identied by the<br />
prex basic_. The corresponding Reader classes compliant with the processing infrastructure<br />
are generated automatically via a class template monolithic (gure 3.1).<br />
The processing framework provides a more generic mechanism to avoid data copying overhead.<br />
By implementation as plug-in class, operations with identical input <strong>and</strong> output data type can be<br />
realized using in-place manipulation of an arbitrary number of data set elements determined by<br />
the class.<br />
For plug-in module support, Reader, Source or Pipe classes inherit a template class<br />
buffer_chain. The template maintains an extensible list of re-assignable buers for the<br />
respective type of data set element associated with the data source. Prior to downstream<br />
propagation of data, objects of classes inheriting a class plugin of appropriate type may be<br />
granted access to the list of maintained buers for manipulation, removal or requesting further<br />
buering of data set elements.<br />
As a simple example, we present the implementation of a plug-in for sort order verication<br />
(listing 3.4). Prior to propagation of the rst of the buered elements, plug-in modules<br />
are presented with the internal list of buers, as well as a list of unused buers to which discarded<br />
elements may be added. The module's apply method may freely manipulate, discard,
70 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />
1 class sequence_sorting_check<br />
:public plugin<br />
{<br />
private:<br />
5<br />
// Generates an exception to indicate data are not sorted<br />
// according to expectation.<br />
void sorting_error();<br />
10 public:<br />
virtual int apply(std::deque & buffers,<br />
std::vector & unused,<br />
const bool flush)<br />
15 {<br />
if(buffers.size() < 2)<br />
return flush ? (PLUGIN_SUCCESS) : (PLUGIN_STARVED);<br />
if(buffers[1]->sequence < buffers[0]->sequence)<br />
20 sorting_error();<br />
};<br />
}<br />
return PLUGIN_SUCCESS;<br />
Listing 3.4: Plugin Implementation Example<br />
re-use buers of the unused buers list or allocate further buers for its data manipulation task.<br />
Given enough data set elements have been buered for the module to carry out its operation,<br />
a status value PLUGIN_SUCCESS is returned, <strong>and</strong> status PLUGIN_STARVED otherwise to cause the<br />
buffer_chain object to delay data propagation <strong>and</strong> collect further data. In case of the sort order<br />
example, the method indicates at least two elements are required to complete the operation (line<br />
16). Given the prerequisite is fullled, the module proceeds to verify the ordering of the rst two<br />
elements found in the buer (line 19).<br />
Support for plug-in modules can be readily combined with the pipe_facade approach (listing<br />
3.5). The source code example presents a simple conversion module for sequencing read<br />
data types. To allow a more direct representation of the le structure, sequences read from 454<br />
St<strong>and</strong>ard Flowgram Format (SFF) les are initially stored in a data structure distinct from the<br />
<strong>lib</strong>shore regular sequencing read representation. To be used with generic read processing modules,<br />
a conversion of read data is thus required. The append method of the conversion module requests<br />
an unused buer <strong>and</strong> uses it to perform the data type conversion (lines 21, 23). Given plug-in<br />
modules succeed, a data set element is transmitted downstream (lines 25, 26). The module's<br />
next method invalidates the rst of the buered elements <strong>and</strong> transmits further data when appropriate.<br />
When combined with the buffer_chain class, the method flush of the pipe_facade<br />
template must be implemented to nalize processing of elements that were possibly delayed by<br />
request of plug-in modules (lines 2933).
3.3. A SIMPLIFIED PARALLELIZATION INTERFACE 71<br />
1 class sff_flatten_pipe<br />
:public shore::pipe_facade,<br />
public buffer_chain<br />
{<br />
5 private:<br />
// Grants pipe_facade access to private member functions.<br />
friend class shore::pipeline_core_access;<br />
10 // Performs the conversion from sff_read to read type.<br />
static void init_read_from_sff(read & r, const sff_read & sff);<br />
void next()<br />
{<br />
15 if(buffer_chain_pop())<br />
emit(buffer_chain_front());<br />
}<br />
void append(const shore::sff_read & r)<br />
20 {<br />
shore::read & current = buffer_chain_push();<br />
init_read_from_sff(current, r);<br />
25 if(buffer_chain_prepare())<br />
emit(buffer_chain_front());<br />
}<br />
void flush()<br />
30 {<br />
if(buffer_chain_flush(false))<br />
emit(buffer_chain_front());<br />
}<br />
};<br />
Listing 3.5: Usage Example for the buffer_chain Template<br />
3.3 A Simplied Parallelization Interface<br />
While run time of simple high-throughput sequencing data processing tasks is often primarily<br />
determined by the available capacities for input <strong>and</strong> output, computationally more expensive<br />
operations such as mapping of reads to a reference genome sequence may require heavy<br />
parallelization to complete in an acceptable time span. Correctness of in general error prone<br />
implementation of parallel algorithms is dicult to assess. We build on the <strong>lib</strong>shore pipeline<br />
infrastructure (section 3.2) to realize a parallelization framework that oers a simplied parallel<br />
programming model sucient for typical processing of sequencing data. The framework provides<br />
an interface that abstracts from parallelization-related implementation details <strong>and</strong> is able<br />
to seamlessly support both shared <strong>and</strong> distributed memory modes of operation.
72 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />
1 const int NUM_THREADS = 16;<br />
const int BATCH_SIZE = 100;<br />
parallelizer par(NUM_THREADS, BATCH_SIZE);<br />
5 sync syn1;<br />
sync syn2;<br />
serial r("in.bam");<br />
parallel f1(filter1_config);<br />
10 parallel f2(filter2_config);<br />
serial w1("filtered1.maplist");<br />
serial w2("stdout");<br />
// Connect modules.<br />
15 r | par | f1 | f2 | syn2 | w2;<br />
f1 | syn1 | w1;<br />
// Process data <strong>and</strong> finish.<br />
20 dump(r);<br />
Listing 3.6: Parallel Processing Example<br />
3.3.1 Parallelization Modules<br />
Our parallelization <strong>lib</strong>rary is based primarily on ve dierent C++ class templates (gure 3.1).<br />
Class templates parallelizer, sync <strong>and</strong> desync can be instantiated to provide special parallelization<br />
modules for incorporation into a processing pipeline. Instantiated objects constitute<br />
special processing modules with identical input <strong>and</strong> output data type. The respective templates<br />
are parametrized on the type of data that is processable by the module. The template<br />
parallelizer is responsible for thread creation <strong>and</strong> management as well as dispatching of data<br />
to dierent threads or processes. Objects instantiated from the class template sync can be<br />
incorporated into a parallel pipeline to protect following pipeline modules from concurrent modi-<br />
cation. The third type of parallelization module desync allows transition from pipeline modules<br />
protected by a sync object to further concurrently modiable processing modules. For creation<br />
of parallel processing pipelines, the actual processing modules must further be incorporated by<br />
class templates parallel or serial to indicate whether the respective module may be used<br />
concurrently.<br />
While the basic processing infrastructure only enforces compatibility of input <strong>and</strong> output<br />
data types of connected modules, the parallelization templates impose additional restrictions<br />
upon which classes of object can be connected. Outputs of serial objects may be connected<br />
to either other serial or to parallelizer or desync objects. Objects instantiated from the<br />
parallel template are the only objects capable of receiving input from parallelizer objects.<br />
The parallel objects can be connected downstream to either further objects instantiated from<br />
the parallel template, or to appropriate sync objects. Finally, desync outputs can only be<br />
connected to parallel objects.<br />
Using these templates, the original data processing example (listing 3.2) can be reformulated<br />
to support parallel modes of operation (listing 3.6). As indicated by the example, the level of<br />
parallelism is requested through a parallelizer object (line 4). The parallelizer distributes<br />
batches of data collected from the connected reader object to the worker threads associated with<br />
it. The two subsequent ltering operations are carried out in concurrent mode, whereas the
3.3. A SIMPLIFIED PARALLELIZATION INTERFACE 73<br />
output operations must be protected by respective sync objects to prevent garbled output.<br />
Our API abstracts from the actual implementation by which parallelization <strong>and</strong> synchronization<br />
of program ow is accomplished. The underlying framework currently supports multithreaded<br />
operation as well as distributed memory congurations such as compute clusters utilizing<br />
the Message Passing Interface (MPI) st<strong>and</strong>ard. To avoid a hard dependency on MPI<br />
implementations, support for distributed operation is disabled by default, but may be enabled<br />
at compile time without further source code modication.<br />
3.3.2 Parallel Pipeline Architecture<br />
Internally, automation of parallel processing is accomplished by introduction of further signalslot<br />
connections between the parallelization modules. The framework's mode of operation is<br />
illustrated with a block diagram depicting all possible module connections (gure 3.3(a) ). The<br />
data Source for the processing pipeline is incorporated by the leftmost serial object. The<br />
data <strong>and</strong> flush signals of the source are connected to a multiplexer object internal to the<br />
parallelizer. The parallelizer further incorporates multiple thread objects (multiplicity is<br />
indicated by three dots in the lower compartment of the respective block diagram element). <strong>Data</strong><br />
collected by the multiplexer are dispatched to threads by means of a condition variable <strong>and</strong><br />
simple exchange of buers.<br />
Pipeline initialization is controlled by an auxiliary connection numthreads oered by all parallelization<br />
classes (not shown). Objects of class parallel on initialization create multiple separate<br />
processing modules. Each of these modules is associated with one specic thread of the<br />
parallelizer, to which it is connected to through its data <strong>and</strong> flush slots.<br />
On the output side, each module incorporated by the parallel object is further connected<br />
to a unique track provided by the sync object. All of the tracks' data signals are connected<br />
to the same downstream serial object, whereas received flush requests must be integrated by<br />
the parent sync object to ensure that the downstream section of the pipeline gets ushed after<br />
processing has completed on all threads.<br />
A sync incorporates a mutex variable which is used to protect downstream parts of the<br />
pipeline from concurrent use. On activation of a track object's data or flush slots, this mutex<br />
is acquired by the respective track before conveying the signal to the connected serial object.<br />
By the nested function call property of signals passed to connected pipeline elements, it is ensured<br />
that all contiguous, serial downstream operations are completed as the signal call returns. The<br />
mutex can thus be released immediately to free the protected modules for use by other threads.<br />
The desync template is intended to enable correct resumption of parallel mode of operation<br />
downstream of a section of the processing pipeline controlled by a sync object. To accomplish<br />
that, desync objects must provide a decoupling of downstream processing modules from the<br />
upstream serial modules. For this purpose, the module receives locking information from the<br />
controlling sync object by means of four additional signal-slot connections. These connections<br />
between the sync <strong>and</strong> the desync object are established through objects lock_link provided<br />
by the serial modules composing the enclosed section of the pipeline. A signal postlock is<br />
transmitted by the sync object after the mutex variable has been acquired by a thread. In<br />
response, the desync object starts to accumulate all data that is generated by the upstream<br />
pipeline in a thread-specic buer. The signal preunlock is received immediately before the<br />
mutex is unlocked by the upstream sync. The signal causes the thread-specic buer to be<br />
detached from upstream pipeline input. Finally, a third signal postunlock indicates that the<br />
respective thread has just relinquished the sync's mutex variable, <strong>and</strong> causes the desync to dump<br />
all data collected in a buer to the channel of the downstream pipeline that is associated with
74 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />
(a)<br />
serial<br />
parallelizer<br />
parallel<br />
sync<br />
serial<br />
desync<br />
parallel<br />
Source<br />
multiplexer<br />
track<br />
Pipe<br />
multiplexer<br />
thread<br />
Pipe<br />
lock_link<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
track<br />
Sink<br />
. . .<br />
. . .<br />
(b)<br />
serial<br />
parallelizer<br />
parallel<br />
sync<br />
serial<br />
desync<br />
parallel<br />
multiplexer<br />
track<br />
multiplexer<br />
thread<br />
Pipe<br />
lock_link<br />
. . .<br />
. . .<br />
. . .<br />
. . .<br />
track<br />
Sink<br />
. . .<br />
. . .<br />
Signal<br />
data<br />
ush<br />
Slot<br />
postlock<br />
preunlock<br />
postunlock<br />
join<br />
Figure 3.3: Connections between Parallel Processing Objects<br />
the respective thread. <strong>An</strong> auxiliary flush signal in addition conveys the thread-specic ush<br />
state across synchronized sections of the processing pipeline.<br />
Finally, a connection join is supported by all parallel processing modules to ensure correct<br />
termination of processing across all threads.<br />
For distributed memory architectures, data are distributed from a single root process to<br />
several worker processes. Worker processes must receive the data to be processed concurrently<br />
from the root process, <strong>and</strong> must subsequently return their output data for processing by serial<br />
modules (gure 3.3(b) ).<br />
Objects of parallelizer, sync <strong>and</strong> desync initialized in worker processes automatically<br />
establish connections for exchange of data with their respective counterpart in the root process,<br />
where each connection is associated with an additional thread in the root process.<br />
To prevent initialization conicts <strong>and</strong> overhead, serial objects are associated with the additional<br />
function of preventing creation of their respective incorporated processing module when
3.3. A SIMPLIFIED PARALLELIZATION INTERFACE 75<br />
not in the root process.<br />
Threads of worker processes' parallelizer objects operate by requesting data from the respective<br />
counterpart in the root process. Failure to retrieve data indicates that processing is<br />
complete, <strong>and</strong> triggers invocation of the thread's ush signal <strong>and</strong> subsequently thread termination.<br />
Otherwise, data received are transmitted to the subsequent parallel object, identical with<br />
threaded operation.<br />
Subsequent sync objects submit received data <strong>and</strong> ush signals to the corresponding object<br />
of the root process. Sections formed by serial objects however cause an interruption in the<br />
pipelines of worker processes. Therefore, sync objects must exploit the postlock signal transmitted<br />
through serial modules to trigger a downstream desync object to request data generated<br />
by serial processing from its counterpart in the root process.
4 Closing Remarks<br />
Routine analysis of high-throughput sequencing data not only requires development of an appropriate<br />
analysis strategy for the respective application as well as solving associated algorithmic<br />
challenges, but is further greatly facilitated by embedding the approach into an underlying framework<br />
optimized for h<strong>and</strong>ling large sequence data sets. With SHORE, we have implemented an<br />
end-to-end pipeline providing a powerful <strong>and</strong> modular analysis environment. Through tight<br />
integration of all required analysis modules, data ow is optimized, minimizing overhead <strong>and</strong><br />
incompatibilities at the transition points between the various steps of an analysis. Through embedding<br />
into the pipeline environment, even supplementary utility algorithms are augmented by<br />
readily available complementary functionality such as capable read data ltering.<br />
A variety of analysis modules dedicated to specic high-throughput sequencing applications<br />
are already made available with SHORE. With the SHORE peak module, this work introduced support<br />
for ChIP-Seq enrichment proling oering highly congurable peak detection <strong>and</strong> ltering<br />
mechanisms. With an approach that renders analysis results of simultaneously processed data<br />
sets immediately comparable, obtaining conclusions from replicate experiments is in our view<br />
greatly facilitated. With further general-purpose modules SHORE coverage <strong>and</strong> SHORE count providing<br />
exible depth of coverage-based sequence segmentation <strong>and</strong> comprehensive gathering of<br />
segment-associated statistics, SHORE should prove applicable to a wide range of expression pro-<br />
ling, enrichment or depletion protocols without further adaptation.<br />
With the integrated approach, we were able to introduce a common foundation in the form<br />
of the <strong>lib</strong>shore framework. This has allowed us to implement a common data storage solution<br />
allowing universal support of exible <strong>and</strong> eective data compression <strong>and</strong> accelerated data set<br />
queries across all analysis modules. Thereby, we maintain the usefulness of the analysis suite in<br />
the face of vastly increased quantities of data produced by sequencing instruments. With generalpurpose<br />
modules such as SHORE oligo-match, we further demonstrate that usability of st<strong>and</strong>-alone<br />
utilities also benets from the common framework in terms of le format independence, familiar<br />
comm<strong>and</strong> line interfaces, automatic support for compressed or streaming input <strong>and</strong> output, or<br />
capable parallelization. With the <strong>lib</strong>shore processing framework, we provide a means for designing<br />
complex analysis algorithms as a collection of manageable, self-contained modules. The ability<br />
to parallelize many such algorithms without further modication from our point of view has<br />
signicant value with the still increasing rate of sequencing data production, <strong>and</strong> might serve as a<br />
basis for the development of further abstracted parallel processing approaches targeted at specic<br />
subclasses of sequencing application. For example, parallel implementations of reference-guided<br />
variant calling algorithms require appropriate data partitioning <strong>and</strong> result merging mechanisms,<br />
distributing read mapping data to dierent processing threads according to their location on the<br />
reference sequence. Such mechanisms are potentially intricate if multi-position traits such as<br />
allelic phasing of single nucleotide polymorphisms are to be assessed.<br />
Despite next-generation sequencing technology now having been available for several years,<br />
for most applications surprisingly little consensus has been reached regarding optimal analysis<br />
77
78 CHAPTER 4. CLOSING REMARKS<br />
algorithms. With the sample preparation procedure leading up to the sequencing device being<br />
application-specic, weakly st<strong>and</strong>ardized, involving signicant amounts of manual work <strong>and</strong> thus<br />
ultimately highly variable, development of robust statistical models has proved intricate. Furthermore,<br />
strategies for result validation are usually laborious, cost-intensive or have yet to be<br />
developed. In many cases, specicity or sensitivity of predictions can be improved by intersection<br />
or union of results obtained through partially complementary algorithms, respectively. For<br />
example, with the SHORE mapowcell utility, we provide a unied interface that allows to readily<br />
obtain <strong>and</strong> combine the results of a variety of available read mapping algorithms in consistent<br />
<strong>and</strong> immediately comparable output formats. Implementation of further wrapper infrastructure<br />
to similarly integrate third-party analysis algorithms for applications such as variation calling or<br />
enrichment proling might be a future direction to obtain a valuable resource to readily determine<br />
individual algorithms' strengths <strong>and</strong> weaknesses through large-scale comparison. However,<br />
the various algorithms available for a specic high-throughput sequencing application often share<br />
similar fundamental ideas, <strong>and</strong> thus suer from corresponding systematic issues.<br />
The initially rapid pace at which innovations improving throughput, read lengths or error<br />
rates have been achieved now seems to be diminishing for the current generation of sequencing<br />
technologies. However, conceptually dierent high-throughput sequencing approaches such as<br />
nanopore sequencing are being investigated, <strong>and</strong> might in the medium term account for further<br />
drastic changes in the eld. With cost reduced by a further magnitude <strong>and</strong> improved reliability<br />
<strong>and</strong> automation of sample preparation, DNA sequencing could nd its way into routine medical<br />
application. However, other than cost-per-base, changes in the properties of the data should<br />
prove most relevant for scientic application <strong>and</strong> data analysis.<br />
Sequencing read length has long been perceived as a critical factor for genome assembly or<br />
assessment of genome structure. It seems likely that a technology able to deliver read lengths<br />
of multiple tens of kilobases at competitive accuracy <strong>and</strong> price would bring about a major shift<br />
towards multiple whole-genome comparison strategies. In addition to whole-genome topology,<br />
such approaches will implicitly yield traits like copy number variation as well as single nucleotide<br />
polymorphisms <strong>and</strong> all other types of localized, small-scale variation, <strong>and</strong> possibly render the<br />
corresponding reference-based resequencing approaches irrelevant. However, to fully leverage the<br />
advantages compared to reference sequence-guided analysis, signicant further challenges with<br />
respect to data analysis <strong>and</strong> representation will have to be overcome. Reasonable relations must<br />
be devised to capture homology, replacing the linear reference genome coordinate system. Providing<br />
manageable breakdown <strong>and</strong> visualization of such data will be essential, <strong>and</strong> likely require<br />
a variety of dierent approaches depending on the respective focus of the investigation. To assess<br />
the functional implications of primary analysis results, gene annotation or annotation transfer<br />
algorithms will have to be simplied <strong>and</strong> improved. Finally, ensuring immediate comparability<br />
of results across dierent studies should prove challenging.<br />
Technological innovations that could potentially replace or drastically improve the power<br />
of current deep sequencing approaches such as whole-genome enrichment or expression pro-<br />
ling are more dicult to envision. Single-molecule sequencing methods however may in the<br />
future permit amplication- <strong>and</strong> ligation-free sequencing of samples from small amounts of starting<br />
material. From such elimination of steps of the sample preparation procedure might follow a<br />
signicant reduction of sequence-specic biases. Furthermore, this simplication should present<br />
opportunities for st<strong>and</strong>ardization <strong>and</strong> automation of sequencing <strong>lib</strong>rary production to further reduce<br />
technical variability. With minimization of the amount of required DNA <strong>and</strong> tissue, sample<br />
composition should correspondingly be rendered more controllable. Spike-in of st<strong>and</strong>ard quantities<br />
of DNA oligomers seems to have been ab<strong>and</strong>oned by the scientic community, but might in<br />
future rened <strong>and</strong> st<strong>and</strong>ardized incarnations provide an additional resource to data normalization<br />
<strong>and</strong> estimation of measurement variance. Dropping sequencing cost <strong>and</strong> simplication of <strong>lib</strong>rary
79<br />
preparation should further induce tolerance to employ the resource sequencing less sparingly.<br />
While most current studies due to economical considerations rely on no more than two replicate<br />
experiments, gathering extensive experimental evidence in the form of many replicate data sets<br />
could become st<strong>and</strong>ard in the future. Combined, these factors may entail improved statistical<br />
power from quantitative sequencing experiments, while at the same time facilitating statistical<br />
modeling of the sequencing <strong>and</strong> sample preparation process.<br />
To this eect, the fundamental approaches to quantitative sequencing data analysis should<br />
require rather gradual modication compared to genome structure <strong>and</strong> variation analysis. Reference<br />
sequence-based analysis as implemented in our modules SHORE peak or SHORE count will<br />
probably remain the preferred strategy in the near future, although it may become more commonplace<br />
to generate appropriate reference genomes de-novo along with the quantitative data. A<br />
medium-term goal might be complementing the available algorithms for quantitative data with<br />
methods that allow to better leverage the statistical power acquired by increasing the number of<br />
replicate experiments. Short-term improvements could possibly be achieved by investing in more<br />
sophisticated models <strong>and</strong> algorithms to capture <strong>and</strong> estimate an increasing number of parameters<br />
of <strong>lib</strong>rary preparation, sequencing <strong>and</strong> data analysis. However, the characteristics of current<br />
data sets may be the result of superimposition <strong>and</strong> complex interplay of a variety of eects, <strong>and</strong><br />
are furthermore subject to constant change as protocols <strong>and</strong> sequencing chemistry evolve.<br />
With progress with respect to sequencing read length, sequence assembly <strong>and</strong> automated<br />
genome annotation algorithms, reference-based variation detection as implemented by<br />
SHORE consensus or SHORE qVar modules could be superseded by fundamentally dierent strategies.<br />
However, SHORE <strong>and</strong> <strong>lib</strong>shore may still constitute a valuable framework for implementation<br />
of such future methods. While increasing read length should automatically resolve many of<br />
the ambiguities that current analysis algorithms are confronted with, the biological questions<br />
to address by future algorithms are likely to gain complexity. They should therefore pose<br />
computationally intensive tasks that benet from application programming frameworks oriented<br />
towards the ecient processing of large amounts of biological sequence data.<br />
While the focus of this work was on the technical <strong>and</strong> algorithmic aspects of obtaining primary<br />
analysis results from various types of high-throughput DNA sequencing data set, a future<br />
challenge will lie with moving from elucidation of isolated properties of the cellular mechanism<br />
to building consistent theoretical frameworks through integration of the data made accessible by<br />
dierent sequencing applications <strong>and</strong> further complementary technologies.
Bibliography<br />
[1] S. Ossowski. Computational Approaches for Next Generation Sequencing <strong>An</strong>alysis <strong>and</strong><br />
MiRNA Target Search. PhD thesis. Wilhelmstr. 32, 72074 Tübingen: Universität Tübingen,<br />
2010.<br />
[2] F. Sanger, S. Nicklen, <strong>and</strong> A. R. Coulson. DNA sequencing with chain-terminating inhibitors.<br />
In: Proc. Natl. Acad. Sci. U.S.A. 74.12 (Dec. 1977), pp. 54635467. doi: 10.<br />
1073/pnas.74.12.5463. pmid: 271968.<br />
[3] J. C. Venter, M. D. Adams, G. G. Sutton, A. R. Kerlavage, H. O. Smith, <strong>and</strong> M.<br />
Hunkapiller. Shotgun sequencing of the human genome. In: Science 280.5369 (June<br />
1998), pp. 15401542. doi: 10.1126/science.280.5369.1540. pmid: 9644018.<br />
[4] T. S. Centre <strong>and</strong> T. W. U. G. S. Center. Toward a complete human genome sequence.<br />
In: Genome Res. 8.11 (Nov. 1998), pp. 10971108. doi: 10.1101/gr.8.11.1097. pmid:<br />
9847074.<br />
[5] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O.<br />
Smith, M. Y<strong>and</strong>ell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew,<br />
D. H. Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski,<br />
G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder, A. G.<br />
Clark, J. Nadeau, V. A. McKusick, N. Zinder, et al. The sequence of the human genome.<br />
In: Science 291.5507 (Feb. 2001), pp. 13041351. doi: 10.1126/science.1058040. pmid:<br />
11181995.<br />
[6] E. S. L<strong>and</strong>er, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon,<br />
K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J. Howl<strong>and</strong>,<br />
L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J. P. Mesirov,<br />
C. Mir<strong>and</strong>a, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A. Sheridan, C.<br />
Sougnez, et al. Initial sequencing <strong>and</strong> analysis of the human genome. In: Nature 409.6822<br />
(Feb. 2001), pp. 860921. doi: 10.1038/35057062. pmid: 11237011.<br />
[7] M. Frommer, L. E. McDonald, D. S. Millar, C. M. Collis, F. Watt, G. W. Grigg, P. L.<br />
Molloy, <strong>and</strong> C. L. Paul. A genomic sequencing protocol that yields a positive display of<br />
5-methylcytosine residues in individual DNA str<strong>and</strong>s. In: Proc. Natl. Acad. Sci. U.S.A.<br />
89.5 (Mar. 1992), pp. 18271831. doi: 10.1073/pnas.89.5.1827. pmid: 1542678.<br />
[8] J. C. Venter, K. Remington, J. F. Heidelberg, A. L. Halpern, D. Rusch, J. A. Eisen,<br />
D. Wu, I. Paulsen, K. E. Nelson, W. Nelson, D. E. Fouts, S. Levy, A. H. Knap, M. W.<br />
Lomas, K. Nealson, O. White, J. Peterson, J. Homan, R. Parsons, H. Baden-Tillson, C.<br />
Pfannkoch, Y. H. Rogers, <strong>and</strong> H. O. Smith. Environmental genome shotgun sequencing<br />
of the Sargasso Sea. In: Science 304.5667 (Apr. 2004), pp. 6674. doi: 10.1126/science.<br />
1093857. pmid: 15001713.<br />
81
82 BIBLIOGRAPHY<br />
[9] S. Brenner, M. Johnson, J. Bridgham, G. Golda, D. H. Lloyd, D. Johnson, S. Luo, S.<br />
McCurdy, M. Foy, M. Ewan, R. Roth, D. George, S. Eletr, G. Albrecht, E. Vermaas, S. R.<br />
Williams, K. Moon, T. Burcham, M. Pallas, R. B. DuBridge, J. Kirchner, K. Fearon,<br />
J. Mao, <strong>and</strong> K. Corcoran. Gene expression analysis by massively parallel signature sequencing<br />
(MPSS) on microbead arrays. In: Nat. Biotechnol. 18.6 (June 2000), pp. 630<br />
634. doi: 10.1038/76469. pmid: 10835600.<br />
[10] M. Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J.<br />
Berka, M. S. Braverman, Y. J. Chen, Z. Chen, S. B. Dewell, L. Du, J. M. Fierro, X. V.<br />
Gomes, B. C. Godwin, W. He, S. Helgesen, C. H. Ho, C. H. Ho, G. P. Irzyk, S. C. J<strong>and</strong>o,<br />
M. L. Alenquer, T. P. Jarvie, K. B. Jirage, J. B. Kim, J. R. Knight, J. R. Lanza, J. H.<br />
Leamon, S. M. Lefkowitz, M. Lei, et al. Genome sequencing in microfabricated highdensity<br />
picolitre reactors. In: Nature 437.7057 (Sept. 2005), pp. 376380. doi: 10.1038/<br />
nature03959. pmid: 16056220.<br />
[11] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton, C. G. Brown,<br />
K. P. Hall, D. J. Evers, C. L. Barnes, H. R. Bignell, J. M. Boutell, J. Bryant, R. J.<br />
Carter, R. Keira Cheetham, A. J. Cox, D. J. Ellis, M. R. Flatbush, N. A. Gormley, S. J.<br />
Humphray, L. J. Irving, M. S. Karbelashvili, S. M. Kirk, H. Li, X. Liu, K. S. Maisinger,<br />
L. J. Murray, B. Obradovic, T. Ost, M. L. Parkinson, M. R. Pratt, et al. Accurate whole<br />
human genome sequencing using reversible terminator chemistry. In: Nature 456.7218<br />
(Nov. 2008), pp. 5359. doi: 10.1038/nature07517. pmid: 18987734.<br />
[12] K. J. McKernan, H. E. Peckham, G. L. Costa, S. F. McLaughlin, Y. Fu, E. F. Tsung, C. R.<br />
Clouser, C. Duncan, J. K. Ichikawa, C. C. Lee, Z. Zhang, S. S. Ranade, E. T. Dimalanta,<br />
F. C. Hyl<strong>and</strong>, T. D. Sokolsky, L. Zhang, A. Sheridan, H. Fu, C. L. Hendrickson, B. Li, L.<br />
Kotler, J. R. Stuart, J. A. Malek, J. M. Manning, A. A. <strong>An</strong>tipova, D. S. Perez, M. P. Moore,<br />
K. C. Hayashibara, M. R. Lyons, R. E. Beaudoin, et al. Sequence <strong>and</strong> structural variation<br />
in a human genome uncovered by short-read, massively parallel ligation sequencing using<br />
two-base encoding. In: Genome Res. 19.9 (Sept. 2009), pp. 15271541. doi: 10.1101/gr.<br />
091868.109. pmid: 19546169.<br />
[13] J. M. Rothberg, W. Hinz, T. M. Rearick, J. Schultz, W. Mileski, M. Davey, J. H. Leamon,<br />
K. Johnson, M. J. Milgrew, M. Edwards, J. Hoon, J. F. Simons, D. Marran, J. W. Myers,<br />
J. F. Davidson, A. Branting, J. R. Nobile, B. P. Puc, D. Light, T. A. Clark, M. Huber,<br />
J. T. Branciforte, I. B. Stoner, S. E. Cawley, M. Lyons, Y. Fu, N. Homer, M. Sedova, X.<br />
Miao, B. Reed, et al. <strong>An</strong> integrated semiconductor device enabling non-optical genome<br />
sequencing. In: Nature 475.7356 (July 2011), pp. 348352. doi: 10.1038/nature10242.<br />
pmid: 21776081.<br />
[14] J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan,<br />
B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark,<br />
R. Dalal, A. Dewinter, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. Heiner, K.<br />
Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. Lin, et al. Real-time<br />
DNA sequencing from single polymerase molecules. In: Science 323.5910 (Jan. 2009),<br />
pp. 133138. doi: 10.1126/science.1162986. pmid: 19023044.<br />
[15] M. J. Levene, J. Korlach, S. W. Turner, M. Foquet, H. G. Craighead, <strong>and</strong> W. W. Webb.<br />
Zero-mode waveguides for single-molecule analysis at high concentrations. In: Science<br />
299.5607 (Jan. 2003), pp. 682686. doi: 10.1126/science.1079700. pmid: 12560545.<br />
[16] D. Loakes. Survey <strong>and</strong> summary: The applications of universal DNA base analogues.<br />
In: Nucleic Acids Res. 29.12 (June 2001), pp. 24372447. doi: 10.1093/nar/29.12.2437.<br />
pmid: 11410649.
BIBLIOGRAPHY 83<br />
[17] X. Zhang, J. Yazaki, A. Sundaresan, S. Cokus, S. W. Chan, H. Chen, I. R. Henderson,<br />
P. Shinn, M. Pellegrini, S. E. Jacobsen, <strong>and</strong> J. R. Ecker. Genome-wide high-resolution<br />
mapping <strong>and</strong> functional analysis of DNA methylation in arabidopsis. In: Cell 126.6 (Sept.<br />
2006), pp. 11891201. doi: 10.1016/j.cell.2006.08.003. pmid: 16949657.<br />
[18] S. Ossowski, K. Schneeberger, R. M. Clark, C. Lanz, N. Warthmann, <strong>and</strong> D. Weigel.<br />
Sequencing of natural strains of Arabidopsis thaliana with short reads. In: Genome Res.<br />
18.12 (Dec. 2008), pp. 20242033. doi: 10.1101/gr.080200.108. pmid: 18818371.<br />
[19] J. Cao, K. Schneeberger, S. Ossowski, T. Gunther, S. Bender, J. Fitz, D. Koenig, C.<br />
Lanz, O. Stegle, C. Lippert, X. Wang, F. Ott, J. Muller, C. Alonso-Blanco, K. Borgwardt,<br />
K. J. Schmid, <strong>and</strong> D. Weigel. Whole-genome sequencing of multiple Arabidopsis thaliana<br />
populations. In: Nat. Genet. 43.10 (Oct. 2011), pp. 956963. doi: 10.1038/ng.911.<br />
[20] K. Schneeberger, S. Ossowski, C. Lanz, T. Juul, A. H. Petersen, K. L. Nielsen, J. E.<br />
J?rgensen, D. Weigel, <strong>and</strong> S. U. <strong>An</strong>dersen. SHOREmap: simultaneous mapping <strong>and</strong> mutation<br />
identication by deep sequencing. In: Nat. Methods 6.8 (Aug. 2009), pp. 550551.<br />
doi: 10.1038/nmeth0809-550. pmid: 19644454.<br />
[21] K. Schneeberger. Whole genome analysis of Arabidopsis thaliana using Next Generation<br />
Sequencing. PhD thesis. Wilhelmstr. 32, 72074 Tübingen: Universität Tübingen, 2010.<br />
urn: urn:nbn:de:bsz:21-opus-56685.<br />
[22] E. A. Dinsdale, R. A. Edwards, D. Hall, F. <strong>An</strong>gly, M. Breitbart, J. M. Brulc, M. Furlan,<br />
C. Desnues, M. Haynes, L. Li, L. McDaniel, M. A. Moran, K. E. Nelson, C. Nilsson, R.<br />
Olson, J. Paul, B. R. Brito, Y. Ruan, B. K. Swan, R. Stevens, D. L. Valentine, R. V.<br />
Thurber, L. Wegley, B. A. White, <strong>and</strong> F. Rohwer. Functional metagenomic proling of<br />
nine biomes. In: Nature 452.7187 (Apr. 2008), pp. 629632. doi: 10.1038/nature06810.<br />
[23] J. Frias-Lopez, Y. Shi, G. W. Tyson, M. L. Coleman, S. C. Schuster, S. W. Chisholm,<br />
<strong>and</strong> E. F. Delong. Microbial community gene expression in ocean surface waters. In:<br />
Proc. Natl. Acad. Sci. U.S.A. 105.10 (Mar. 2008), pp. 38053810. doi: 10.1073/pnas.<br />
0708897105. pmid: 18316740.<br />
[24] X. Mou, S. Sun, R. A. Edwards, R. E. Hodson, <strong>and</strong> M. A. Moran. Bacterial carbon<br />
processing by generalist species in the coastal ocean. In: Nature 451.7179 (Feb. 2008),<br />
pp. 708711. doi: 10.1038/nature06513. pmid: 18223640.<br />
[25] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeer, <strong>and</strong> B. Wold. Mapping <strong>and</strong><br />
quantifying mammalian transcriptomes by RNA-Seq. In: Nat. Methods 5.7 (July 2008),<br />
pp. 621628. doi: 10.1038/nmeth.1226. pmid: 18516045.<br />
[26] B. T. Wilhelm, S. Marguerat, S. Watt, F. Schubert, V. Wood, I. Goodhead, C. J. Penkett,<br />
J. Rogers, <strong>and</strong> J. Bahler. Dynamic repertoire of a eukaryotic transcriptome surveyed<br />
at single-nucleotide resolution. In: Nature 453.7199 (June 2008), pp. 12391243. doi:<br />
10.1038/nature07002. pmid: 18488015.<br />
[27] R. Lister, R. C. O'Malley, J. Tonti-Filippini, B. D. Gregory, C. C. Berry, A. H. Millar,<br />
<strong>and</strong> J. R. Ecker. Highly integrated single-base resolution maps of the epigenome in Arabidopsis.<br />
In: Cell 133.3 (May 2008), pp. 523536. doi: 10.1016/j.cell.2008.03.029.<br />
pmid: 18423832.<br />
[28] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, <strong>and</strong> M. Snyder. The<br />
transcriptional l<strong>and</strong>scape of the yeast genome dened by RNA sequencing. In: Science<br />
320.5881 (June 2008), pp. 13441349. doi: 10.1126/science.1158441. pmid: 18451266.
84 BIBLIOGRAPHY<br />
[29] N. Cloonan, A. R. Forrest, G. Kolle, B. B. Gardiner, G. J. Faulkner, M. K. Brown, D. F.<br />
Taylor, A. L. Steptoe, S. Wani, G. Bethel, A. J. Robertson, A. C. Perkins, S. J. Bruce,<br />
C. C. Lee, S. S. Ranade, H. E. Peckham, J. M. Manning, K. J. McKernan, <strong>and</strong> S. M.<br />
Grimmond. Stem cell transcriptome proling via massive-scale mRNA sequencing. In:<br />
Nat. Methods 5.7 (July 2008), pp. 613619. doi: 10.1038/nmeth.1223. pmid: 18516046.<br />
[30] C. Lu, K. Kulkarni, F. F. Souret, R. MuthuValliappan, S. S. Tej, R. S. Poethig, I. R.<br />
Henderson, S. E. Jacobsen, W. Wang, P. J. Green, <strong>and</strong> B. C. Meyers. MicroRNAs <strong>and</strong><br />
other small RNAs enriched in the Arabidopsis RNA-dependent RNA polymerase-2 mutant.<br />
In: Genome Res. 16.10 (Oct. 2006), pp. 12761288. doi: 10.1101/gr.5530106.<br />
pmid: 16954541.<br />
[31] J. G. Ruby, C. Jan, C. Player, M. J. Axtell, W. Lee, C. Nusbaum, H. Ge, <strong>and</strong> D. P. Bartel.<br />
Large-scale sequencing reveals 21U-RNAs <strong>and</strong> additional microRNAs <strong>and</strong> endogenous<br />
siRNAs in C. elegans. In: Cell 127.6 (Dec. 2006), pp. 11931207. doi: 10.1016/j.cell.<br />
2006.10.040. pmid: 17174894.<br />
[32] R. D. Morin, M. D. O'Connor, M. Grith, F. Kuchenbauer, A. Delaney, A. L. Prabhu,<br />
Y. Zhao, H. McDonald, T. Zeng, M. Hirst, C. J. Eaves, <strong>and</strong> M. A. Marra. Application of<br />
massively parallel sequencing to microRNA proling <strong>and</strong> discovery in human embryonic<br />
stem cells. In: Genome Res. 18.4 (Apr. 2008), pp. 610621. doi: 10.1101/gr.7179508.<br />
pmid: 18285502.<br />
[33] P. L<strong>and</strong>graf, M. Rusu, R. Sheridan, A. Sewer, N. Iovino, A. Aravin, S. Pfeer, A. Rice,<br />
A. O. Kamphorst, M. L<strong>and</strong>thaler, C. Lin, N. D. Socci, L. Hermida, V. Fulci, S. Chiaretti,<br />
R. Foa, J. Schliwka, U. Fuchs, A. Novosel, R. U. Muller, B. Schermer, U. Bissels, J. Inman,<br />
Q. Phan, M. Chien, D. B. Weir, R. Choksi, G. De Vita, D. Frezzetti, H. I. Trompeter,<br />
et al. A mammalian microRNA expression atlas based on small RNA <strong>lib</strong>rary sequencing.<br />
In: Cell 129.7 (June 2007), pp. 14011414. doi: 10.1016/j.cell.2007.04.040. pmid:<br />
17604727.<br />
[34] D. S. Johnson, A. Mortazavi, R. M. Myers, <strong>and</strong> B. Wold. Genome-wide mapping of in<br />
vivo protein-DNA interactions. In: Science 316.5830 (June 2007), pp. 14971502. doi:<br />
10.1126/science.1141319. pmid: 17540862.<br />
[35] G. Robertson, M. Hirst, M. Bainbridge, M. Bilenky, Y. Zhao, T. Zeng, G. Euskirchen, B.<br />
Bernier, R. Varhol, A. Delaney, N. Thiessen, O. L. Grith, A. He, M. Marra, M. Snyder,<br />
<strong>and</strong> S. Jones. Genome-wide proles of STAT1 DNA association using chromatin immunoprecipitation<br />
<strong>and</strong> massively parallel sequencing. In: Nat. Methods 4.8 (Aug. 2007),<br />
pp. 651657. doi: 10.1038/nmeth1068. pmid: 17558387.<br />
[36] A. Barski, S. Cuddapah, K. Cui, T. Y. Roh, D. E. Schones, Z. Wang, G. Wei, I. Chepelev,<br />
<strong>and</strong> K. Zhao. High-resolution proling of histone methylations in the human genome.<br />
In: Cell 129.4 (May 2007), pp. 823837. doi: 10 . 1016 / j . cell . 2007 . 05 . 009. pmid:<br />
17512414.<br />
[37] T. A. Down, V. K. Rakyan, D. J. Turner, P. Flicek, H. Li, E. Kulesha, S. Graf, N.<br />
Johnson, J. Herrero, E. M. Tomazou, N. P. Thorne, L. Backdahl, M. Herberth, K. L.<br />
Howe, D. K. Jackson, M. M. Miretti, J. C. Marioni, E. Birney, T. J. Hubbard, R. Durbin,<br />
S. Tavare, <strong>and</strong> S. Beck. A Bayesian deconvolution strategy for immunoprecipitationbased<br />
DNA methylome analysis. In: Nat. Biotechnol. 26.7 (July 2008), pp. 779785. doi:<br />
10.1038/nbt1414. pmid: 18612301.
BIBLIOGRAPHY 85<br />
[38] D. D. Licatalosi, A. Mele, J. J. Fak, J. Ule, M. Kayikci, S. W. Chi, T. A. Clark, A. C.<br />
Schweitzer, J. E. Blume, X. Wang, J. C. Darnell, <strong>and</strong> R. B. Darnell. HITS-CLIP yields<br />
genome-wide insights into brain alternative RNA processing. In: Nature 456.7221 (Nov.<br />
2008), pp. 464469. doi: 10.1038/nature07488. pmid: 18978773.<br />
[39] D. E. Schones, K. Cui, S. Cuddapah, T. Y. Roh, A. Barski, Z. Wang, G. Wei, <strong>and</strong> K.<br />
Zhao. Dynamic regulation of nucleosome positioning in the human genome. In: Cell<br />
132.5 (Mar. 2008), pp. 887898. doi: 10.1016/j.cell.2008.02.022. pmid: 18329373.<br />
[40] S. J. Cokus, S. Feng, X. Zhang, Z. Chen, B. Merriman, C. D. Haudenschild, S. Pradhan,<br />
S. F. Nelson, M. Pellegrini, <strong>and</strong> S. E. Jacobsen. Shotgun bisulphite sequencing of the<br />
Arabidopsis genome reveals DNA methylation patterning. In: Nature 452.7184 (Mar.<br />
2008), pp. 215219. doi: 10.1038/nature06745. pmid: 18278030.<br />
[41] A. Meissner, T. S. Mikkelsen, H. Gu, M. Wernig, J. Hanna, A. Sivachenko, X. Zhang,<br />
B. E. Bernstein, C. Nusbaum, D. B. Jae, A. Gnirke, R. Jaenisch, <strong>and</strong> E. S. L<strong>and</strong>er.<br />
Genome-scale DNA methylation maps of pluripotent <strong>and</strong> dierentiated cells. In: Nature<br />
454.7205 (Aug. 2008), pp. 766770. doi: 10.1038/nature07107. pmid: 18600261.<br />
[42] N. A. Baird, P. D. Etter, T. S. Atwood, M. C. Currey, A. L. Shiver, Z. A. Lewis, E. U.<br />
Selker, W. A. Cresko, <strong>and</strong> E. A. Johnson. Rapid SNP discovery <strong>and</strong> genetic mapping using<br />
sequenced RAD markers. In: PLoS ONE 3.10 (2008), e3376. doi: 10.1371/journal.<br />
pone.0003376. pmid: 18852878.<br />
[43] E. M. Willing, M. Homann, J. D. Klein, D. Weigel, <strong>and</strong> C. Dreyer. Paired-end RADseq<br />
for de novo assembly <strong>and</strong> marker design without available reference. In: Bioinformatics<br />
27.16 (Aug. 2011), pp. 21872193. doi: 10.1093/bioinformatics/btr346. pmid:<br />
21712251.<br />
[44] C. Zong, S. Lu, A. R. Chapman, <strong>and</strong> X. S. Xie. Genome-wide detection of singlenucleotide<br />
<strong>and</strong> copy-number variations of a single human cell. In: Science 338.6114 (Dec.<br />
2012), pp. 16221626. doi: 10.1126/science.1229164. pmid: 23258894.<br />
[45] G. R. Abecasis, D. Altshuler, A. Auton, L. D. Brooks, R. M. Durbin, R. A. Gibbs, M. E.<br />
Hurles, G. A. McVean, D. Altshuler, R. M. Durbin, G. R. Abecasis, D. R. Bentley, A.<br />
Chakravarti, A. G. Clark, F. S. Collins, F. M. De La Vega, P. Donnelly, M. Egholm,<br />
P. Flicek, S. B. Gabriel, R. A. Gibbs, B. M. Knoppers, E. S. L<strong>and</strong>er, H. Lehrach, E. R.<br />
Mardis, G. A. McVean, D. A. Nickerson, L. Peltonen, A. J. Schafer, S. T. Sherry, et al. A<br />
map of human genome variation from population-scale sequencing. In: Nature 467.7319<br />
(Oct. 2010), pp. 10611073. doi: 10.1038/nature09534. pmid: 20981092.<br />
[46] G. R. Abecasis, A. Auton, L. D. Brooks, M. A. DePristo, R. M. Durbin, R. E. H<strong>and</strong>saker,<br />
H. M. Kang, G. T. Marth, G. A. McVean, D. M. Altshuler, R. M. Durbin, G. R. Abecasis,<br />
D. R. Bentley, A. Chakravarti, A. G. Clark, P. Donnelly, E. E. Eichler, P. Flicek, S. B.<br />
Gabriel, R. A. Gibbs, E. D. Green, M. E. Hurles, B. M. Knoppers, J. O. Korbel, E. S.<br />
L<strong>and</strong>er, C. Lee, H. Lehrach, E. R. Mardis, G. T. Marth, G. A. McVean, et al. <strong>An</strong><br />
integrated map of genetic variation from 1,092 human genomes. In: Nature 491.7422<br />
(Nov. 2012), pp. 5665. doi: 10.1038/nature11632. pmid: 23128226.<br />
[47] D. Haussler, S. J. O'Brien, O. A. Ryder, F. K. Barker, M. Clamp, A. J. Crawford, R.<br />
Hanner, O. Hanotte, W. E. Johnson, J. A. McGuire, W. Miller, R. W. Murphy, W. J.<br />
Murphy, F. H. Sheldon, B. Sinervo, B. Venkatesh, E. O. Wiley, F. W. Allendorf, G. Amato,<br />
C. S. Baker, A. Bauer, A. Beja-Pereira, E. Bermingham, G. Bernardi, C. R. Bonvicino, S.<br />
Brenner, T. Burke, J. Cracraft, M. Diekhans, S. Edwards, et al. Genome 10K: a proposal<br />
to obtain whole-genome sequence for 10,000 vertebrate species. In: J. Hered. 100.6 (2009),<br />
pp. 659674. doi: 10.1093/jhered/esp086. pmid: 19892720.
86 BIBLIOGRAPHY<br />
[48] B. A. Methe, K. E. Nelson, M. Pop, H. H. Creasy, M. G. Giglio, C. Huttenhower, D.<br />
Gevers, J. F. Petrosino, S. Abubucker, J. H. Badger, A. T. Chinwalla, A. M. Earl, M. G.<br />
FitzGerald, R. S. Fulton, K. Hallsworth-Pepin, E. A. Lobos, R. Madupu, V. Magrini,<br />
J. C. Martin, M. Mitreva, D. M. Muzny, E. J. Sodergren, J. Versalovic, A. M. Wollam,<br />
K. C. Worley, J. R. Wortman, S. K. Young, Q. Zeng, K. M. Aagaard, O. O. Abolude,<br />
et al. A framework for human microbiome research. In: Nature 486.7402 (June 2012),<br />
pp. 215221. doi: 10.1038/nature11209. pmid: 22699610.<br />
[49] C. Huttenhower, D. Gevers, R. Knight, S. Abubucker, J. H. Badger, A. T. Chinwalla,<br />
H. H. Creasy, A. M. Earl, M. G. FitzGerald, R. S. Fulton, M. G. Giglio, K. Hallsworth-<br />
Pepin, E. A. Lobos, R. Madupu, V. Magrini, J. C. Martin, M. Mitreva, D. M. Muzny,<br />
E. J. Sodergren, J. Versalovic, A. M. Wollam, K. C. Worley, J. R. Wortman, S. K. Young,<br />
Q. Zeng, K. M. Aagaard, O. O. Abolude, E. Allen-Vercoe, E. J. Alm, L. Alvarado, et al.<br />
Structure, function <strong>and</strong> diversity of the healthy human microbiome. In: Nature 486.7402<br />
(June 2012), pp. 207214. doi: 10.1038/nature11234. pmid: 22699609.<br />
[50] P. R. Burton, D. G. Clayton, L. R. Cardon, N. Craddock, P. Deloukas, A. Duncanson, D. P.<br />
Kwiatkowski, M. I. McCarthy, W. H. Ouweh<strong>and</strong>, N. J. Samani, J. A. Todd, P. Donnelly,<br />
J. C. Barrett, P. R. Burton, D. Davison, P. Donnelly, D. Easton, D. Evans, H. T. Leung,<br />
J. L. Marchini, A. P. Morris, C. C. Spencer, M. D. Tobin, L. R. Cardon, D. G. Clayton,<br />
A. P. Attwood, J. P. Boorman, B. Cant, U. Everson, J. M. Hussey, et al. Genome-wide<br />
association study of 14,000 cases of seven common diseases <strong>and</strong> 3,000 shared controls. In:<br />
Nature 447.7145 (June 2007), pp. 661678. doi: 10.1038/nature05911. pmid: 17554300.<br />
[51] R. W. Carthew <strong>and</strong> E. J. Sontheimer. Origins <strong>and</strong> Mechanisms of miRNAs <strong>and</strong> siRNAs.<br />
In: Cell 136.4 (Feb. 2009), pp. 642655. doi: 10 . 1016 / j . cell . 2009 . 01 . 035. pmid:<br />
19239886.<br />
[52] O. Voinnet. Origin, biogenesis, <strong>and</strong> activity of plant microRNAs. In: Cell 136.4 (Feb.<br />
2009), pp. 669687. doi: 10.1016/j.cell.2009.01.046. pmid: 19239888.<br />
[53] S. E. Linsen, E. de Wit, G. Janssens, S. Heater, L. Chapman, R. K. Parkin, B. Fritz, S. K.<br />
Wyman, E. de Bruijn, E. E. Voest, S. Kuersten, M. Tewari, <strong>and</strong> E. Cuppen. Limitations<br />
<strong>and</strong> possibilities of small RNA digital gene expression proling. In: Nat. Methods 6.7<br />
(July 2009), pp. 474476. doi: 10.1038/nmeth0709-474. pmid: 19564845.<br />
[54] S. U. Meyer, M. W. Pfa, <strong>and</strong> S. E. Ulbrich. Normalization strategies for microRNA pro-<br />
ling experiments: a 'normal' way to a hidden layer of complexity? In: Biotechnol. Lett.<br />
32.12 (Dec. 2010), pp. 17771788. doi: 10.1007/s10529-010-0380-z. pmid: 20703800.<br />
[55] J. D. Hollister, L. M. Smith, Y. L. Guo, F. Ott, D. Weigel, <strong>and</strong> B. S. Gaut. Transposable<br />
elements <strong>and</strong> small RNAs contribute to gene expression divergence between Arabidopsis<br />
thaliana <strong>and</strong> Arabidopsis lyrata. In: Proc. Natl. Acad. Sci. U.S.A. 108.6 (Feb. 2011),<br />
pp. 23222327. doi: 10.1073/pnas.1018222108.<br />
[56] S. Barker, M. Weinfeld, <strong>and</strong> D. Murray. DNA-protein crosslinks: their induction, repair,<br />
<strong>and</strong> biological consequences. In: Mutat. Res. 589.2 (Mar. 2005), pp. 111135. doi: 10.<br />
1016/j.mrrev.2004.11.003. pmid: 15795165.<br />
[57] A. Valouev, J. Ichikawa, T. Tonthat, J. Stuart, S. Ranade, H. Peckham, K. Zeng, J. A.<br />
Malek, G. Costa, K. McKernan, A. Sidow, A. Fire, <strong>and</strong> S. M. Johnson. A high-resolution,<br />
nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning.<br />
In: Genome Res. 18.7 (July 2008), pp. 10511063. doi: 10.1101/gr.076463.108.<br />
pmid: 18477713.
BIBLIOGRAPHY 87<br />
[58] L. Yant, J. Mathieu, T. T. Dinh, F. Ott, C. Lanz, H. Wollmann, X. Chen, <strong>and</strong> M. Schmid.<br />
Orchestration of the oral transition <strong>and</strong> oral development in Arabidopsis by the bifunctional<br />
transcription factor APETALA2. In: Plant Cell 22.7 (July 2010), pp. 2156<br />
2170. doi: 10.1105/tpc.110.075606.<br />
[59] M. Hafner, M. L<strong>and</strong>thaler, L. Burger, M. Khorshid, J. Hausser, P. Berninger, A. Rothballer,<br />
M. Ascano, A. C. Jungkamp, M. Munschauer, A. Ulrich, G. S. Wardle, S. Dewell,<br />
M. Zavolan, <strong>and</strong> T. Tuschl. Transcriptome-wide identication of RNA-binding protein<br />
<strong>and</strong> microRNA target sites by PAR-CLIP. In: Cell 141.1 (Apr. 2010), pp. 129141. doi:<br />
10.1016/j.cell.2010.03.009. pmid: 20371350.<br />
[60] M. Muers. Technology: Getting Moore from DNA sequencing. In: Nat. Rev. Genet. 12.9<br />
(Sept. 2011), p. 586. doi: 10.1038/nrg3059. pmid: 21808262.<br />
[61] G. E. Moore. Cramming More Components onto <strong>Integrated</strong> Circuits. In: Electronics<br />
38.8 (Apr. 19, 1965), pp. 114117. doi: 10.1109/JPROC.1998.658762.<br />
[62] R. F. Service. Gene sequencing. The race for the $1000 genome. In: Science 311.5767<br />
(Mar. 2006), pp. 15441546. doi: 10.1126/science.311.5767.1544. pmid: 16543431.<br />
[63] C. Ledergerber <strong>and</strong> C. Dessimoz. Base-calling for next-generation sequencing platforms.<br />
In: Brief. Bioinformatics 12.5 (Sept. 2011), pp. 489497. doi: 10 . 1093 / bib / bbq077.<br />
pmid: 21245079.<br />
[64] D. Aird, M. G. Ross, W. S. Chen, M. Danielsson, T. Fennell, C. Russ, D. B. Jae, C.<br />
Nusbaum, <strong>and</strong> A. Gnirke. <strong>An</strong>alyzing <strong>and</strong> minimizing PCR amplication bias in Illumina<br />
sequencing <strong>lib</strong>raries. In: Genome Biol. 12.2 (2011), R18. doi: 10.1186/gb-2011-12-2-<br />
r18. pmid: 21338519.<br />
[65] B. Ewing, L. Hillier, M. C. Wendl, <strong>and</strong> P. Green. Base-calling of automated sequencer<br />
traces using phred. I. Accuracy assessment. In: Genome Res. 8.3 (Mar. 1998), pp. 175<br />
185. doi: 10.1101/gr.8.3.175. pmid: 9521921.<br />
[66] B. Ewing <strong>and</strong> P. Green. Base-calling of automated sequencer traces using phred. II. Error<br />
probabilities. In: Genome Res. 8.3 (Mar. 1998), pp. 186194. doi: 10.1101/gr.8.3.186.<br />
pmid: 9521922.<br />
[67] M. Kircher, U. Stenzel, <strong>and</strong> J. Kelso. Improved base calling for the Illumina Genome<br />
<strong>An</strong>alyzer using machine learning strategies. In: Genome Biol. 10.8 (2009), R83. doi:<br />
10.1186/gb-2009-10-8-r83. pmid: 19682367.<br />
[68] W. C. Kao, K. Stevens, <strong>and</strong> Y. S. Song. BayesCall: A model-based base-calling algorithm<br />
for high-throughput short-read sequencing. In: Genome Res. 19.10 (Oct. 2009), pp. 1884<br />
1895. doi: 10.1101/gr.095299.109. pmid: 19661376.<br />
[69] J. Rougemont, A. Amzallag, C. Iseli, L. Farinelli, I. Xenarios, <strong>and</strong> F. Naef. Probabilistic<br />
base calling of Solexa sequencing data. In: BMC Bioinformatics 9 (2008), p. 431. doi:<br />
10.1186/1471-2105-9-431. pmid: 18851737.<br />
[70] Y. Erlich, P. P. Mitra, M. delaBastide, W. R. McCombie, <strong>and</strong> G. J. Hannon. Alta-Cyclic:<br />
a self-optimizing base caller for next-generation sequencing. In: Nat. Methods 5.8 (Aug.<br />
2008), pp. 679682. doi: 10.1038/nmeth.1230. pmid: 18604217.<br />
[71] A. R. Quinlan, D. A. Stewart, M. P. Stromberg, <strong>and</strong> G. T. Marth. Pyrobayes: an improved<br />
base caller for SNP discovery in pyrosequences. In: Nat. Methods 5.2 (Feb. 2008), pp. 179<br />
181. doi: 10.1038/nmeth.1172. pmid: 18193056.
88 BIBLIOGRAPHY<br />
[72] S. <strong>An</strong>drews. FastQC: A quality control tool for high throughput sequence data. May 2012.<br />
WebCite: 68W7OHlEL. url: http : / / www . bioinformatics . babraham . ac . uk / projects /<br />
fastqc/ Feb. 27, 2013.<br />
[73] A. Gordon. FASTX-Toolkit: FASTQ/A short-reads pre-processing tools. Feb. 2010. WebCite:<br />
5zWQ6Eqh6. url: http://hannonlab.cshl.edu/fastx_toolkit/index.html Feb. 27,<br />
2013.<br />
[74] T. Lassmann, Y. Hayashizaki, <strong>and</strong> C. O. Daub. TagDusta program to eliminate artifacts<br />
from next generation sequencing data. In: Bioinformatics 25.21 (Nov. 2009), pp. 2839<br />
2840. doi: 10.1093/bioinformatics/btp527. pmid: 19737799.<br />
[75] R. Schmieder, Y. W. Lim, F. Rohwer, <strong>and</strong> R. Edwards. TagCleaner: Identication <strong>and</strong><br />
removal of tag sequences from genomic <strong>and</strong> metagenomic datasets. In: BMC Bioinformatics<br />
11 (2010), p. 341. doi: 10.1186/1471-2105-11-341. pmid: 20573248.<br />
[76] R. K. Patel <strong>and</strong> M. Jain. NGS QC Toolkit: a toolkit for quality control of next generation<br />
sequencing data. In: PLoS ONE 7.2 (2012), e30619. doi: 10 . 1371 / journal . pone .<br />
0030619. pmid: 22312429.<br />
[77] D. R. Zerbino <strong>and</strong> E. Birney. Velvet: algorithms for de novo short read assembly using<br />
de Bruijn graphs. In: Genome Res. 18.5 (May 2008), pp. 821829. doi: 10.1101/gr.<br />
074492.107. pmid: 18349386.<br />
[78] M. H. Schulz, D. R. Zerbino, M. Vingron, <strong>and</strong> E. Birney. Oases: robust de novo RNAseq<br />
assembly across the dynamic range of expression levels. In: Bioinformatics 28.8 (Apr.<br />
2012), pp. 10861092. doi: 10.1093/bioinformatics/bts094. pmid: 22368243.<br />
[79] R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, Y. Liu, J.<br />
Tang, G. Wu, H. Zhang, Y. Shi, Y. Liu, C. Yu, B. Wang, Y. Lu, C. Han, D. Cheung, S.-M.<br />
Yiu, S. Peng, Z. Xiaoqian, G. Liu, X. Liao, Y. Li, H. Yang, J. Wang, T.-W. Lam, <strong>and</strong><br />
J. Wang. SOAPdenovo2: an empirically improved memory-ecient short-read de novo<br />
assembler. In: GigaScience 1.1 (2012), p. 18. doi: 10.1186/2047-217X-1-18.<br />
[80] J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. L<strong>and</strong>er,<br />
C. Nusbaum, <strong>and</strong> D. B. Jae. ALLPATHS: de novo assembly of whole-genome shotgun<br />
microreads. In: Genome Res. 18.5 (May 2008), pp. 810820. doi: 10.1101/gr.7337908.<br />
pmid: 18340039.<br />
[81] I. Maccallum, D. Przybylski, S. Gnerre, J. Burton, I. Shlyakhter, A. Gnirke, J. Malek, K.<br />
McKernan, S. Ranade, T. P. Shea, L. Williams, S. Young, C. Nusbaum, <strong>and</strong> D. B. Jae.<br />
ALLPATHS 2: small genomes assembled accurately <strong>and</strong> with high continuity from short<br />
paired reads. In: Genome Biol. 10.10 (2009), R103. doi: 10.1186/gb-2009-10-10-r103.<br />
pmid: 19796385.<br />
[82] F. J. Ribeiro, D. Przybylski, S. Yin, T. Sharpe, S. Gnerre, A. Abouelleil, A. M. Berlin, A.<br />
Montmayeur, T. P. Shea, B. J. Walker, S. K. Young, C. Russ, C. Nusbaum, I. MacCallum,<br />
<strong>and</strong> D. B. Jae. Finished bacterial genomes from shotgun sequence data. In: Genome<br />
Res. 22.11 (Nov. 2012), pp. 22702277. doi: 10.1101/gr.141515.112. pmid: 22829535.<br />
[83] S. Gnerre, I. Maccallum, D. Przybylski, F. J. Ribeiro, J. N. Burton, B. J. Walker, T.<br />
Sharpe, G. Hall, T. P. Shea, S. Sykes, A. M. Berlin, D. Aird, M. Costello, R. Daza, L.<br />
Williams, R. Nicol, A. Gnirke, C. Nusbaum, E. S. L<strong>and</strong>er, <strong>and</strong> D. B. Jae. High-quality<br />
draft assemblies of mammalian genomes from massively parallel sequence data. In: Proc.<br />
Natl. Acad. Sci. U.S.A. 108.4 (Jan. 2011), pp. 15131518. doi: 10.1073/pnas.1017351108.<br />
pmid: 21187386.
BIBLIOGRAPHY 89<br />
[84] K. Schneeberger, S. Ossowski, F. Ott, J. D. Klein, X. Wang, C. Lanz, L. M. Smith, J.<br />
Cao, J. Fitz, N. Warthmann, S. R. Henz, D. H. Huson, <strong>and</strong> D. Weigel. Reference-guided<br />
assembly of four diverse Arabidopsis thaliana genomes. In: Proc. Natl. Acad. Sci. U.S.A.<br />
108.25 (June 2011), pp. 1024910254. doi: 10.1073/pnas.1107739108.<br />
[85] H. Li <strong>and</strong> R. Durbin. Fast <strong>and</strong> accurate short read alignment with Burrows-<br />
Wheeler transform. In: Bioinformatics 25.14 (July 2009), pp. 17541760. doi:<br />
10.1093/bioinformatics/btp324. pmid: 19451168.<br />
[86] B. Langmead. Aligning short sequencing reads with Bowtie. In: Curr Protoc Bioinformatics<br />
Chapter 11 (Dec. 2010), Unit 11.7. doi: 10.1002/0471250953.bi1107s32. pmid:<br />
21154709.<br />
[87] B. Langmead <strong>and</strong> S. L. Salzberg. Fast gapped-read alignment with Bowtie 2. In: Nat.<br />
Methods 9.4 (Apr. 2012), pp. 357359. doi: 10.1038/nmeth.1923. pmid: 22388286.<br />
[88] R. Li, Y. Li, K. Kristiansen, <strong>and</strong> J. Wang. SOAP: short oligonucleotide alignment program.<br />
In: Bioinformatics 24.5 (Mar. 2008), pp. 713714. doi: 10.1093/bioinformatics/<br />
btn025. pmid: 18227114.<br />
[89] R. Li, C. Yu, Y. Li, T. W. Lam, S. M. Yiu, K. Kristiansen, <strong>and</strong> J. Wang. SOAP2: an<br />
improved ultrafast tool for short read alignment. In: Bioinformatics 25.15 (Aug. 2009),<br />
pp. 19661967. doi: 10.1093/bioinformatics/btp336. pmid: 19497933.<br />
[90] S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, <strong>and</strong> M. Brudno. SHRiMP:<br />
accurate mapping of short color-space reads. In: PLoS Comput. Biol. 5.5 (May 2009),<br />
e1000386. doi: 10.1371/journal.pcbi.1000386. pmid: 19461883.<br />
[91] M. David, M. Dzamba, D. Lister, L. Ilie, <strong>and</strong> M. Brudno. SHRiMP2: sensitive yet practical<br />
SHort Read Mapping. In: Bioinformatics 27.7 (Apr. 2011), pp. 10111012. doi:<br />
10.1093/bioinformatics/btr046. pmid: 21278192.<br />
[92] H. Li, J. Ruan, <strong>and</strong> R. Durbin. Mapping short DNA sequencing reads <strong>and</strong> calling variants<br />
using mapping quality scores. In: Genome Res. 18.11 (Nov. 2008), pp. 18511858. doi:<br />
10.1101/gr.078212.108. pmid: 18714091.<br />
[93] K. Schneeberger, J. Hagmann, S. Ossowski, N. Warthmann, S. Gesing, O. Kohlbacher,<br />
<strong>and</strong> D. Weigel. Simultaneous alignment of short reads against multiple genomes. In:<br />
Genome Biol. 10.9 (2009), R98. doi: 10.1186/gb-2009-10-9-r98. pmid: 19761611.<br />
[94] G. Jean, A. Kahles, V. T. Sreedharan, F. De Bona, <strong>and</strong> G. Ratsch. RNA-Seq read<br />
alignments with PALMapper. In: Curr Protoc Bioinformatics Chapter 11 (Dec. 2010),<br />
Unit 11.6. doi: 10.1002/0471250953.bi1106s32. pmid: 21154708.<br />
[95] M. C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. In: Bioinformatics<br />
25.11 (June 2009), pp. 13631369. doi: 10.1093/bioinformatics/btp236. pmid:<br />
19357099.<br />
[96] P. Ferragina <strong>and</strong> G. Manzini. Opportunistic data structures with applications. In: Proceedings<br />
of the 41st <strong>An</strong>nual Symposium on Foundations of Computer Science. FOCS '00.<br />
Washington, DC, USA: IEEE Computer Society, 2000, pp. 390. isbn: 0-7695-0850-2. doi:<br />
10.1109/SFCS.2000.892127.<br />
[97] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K.<br />
Garimella, D. Altshuler, S. Gabriel, M. Daly, <strong>and</strong> M. A. DePristo. The Genome <strong>An</strong>alysis<br />
Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.<br />
In: Genome Res. 20.9 (Sept. 2010), pp. 12971303. doi: 10.1101/gr.107524.110. pmid:<br />
20644199.
90 BIBLIOGRAPHY<br />
[98] M. A. DePristo, E. Banks, R. Poplin, K. V. Garimella, J. R. Maguire, C. Hartl, A. A.<br />
Philippakis, G. del <strong>An</strong>gel, M. A. Rivas, M. Hanna, A. McKenna, T. J. Fennell, A. M.<br />
Kernytsky, A. Y. Sivachenko, K. Cibulskis, S. B. Gabriel, D. Altshuler, <strong>and</strong> M. J. Daly.<br />
A framework for variation discovery <strong>and</strong> genotyping using next-generation DNA sequencing<br />
data. In: Nat. Genet. 43.5 (May 2011), pp. 491498. doi: 10.1038/ng.806. pmid:<br />
21478889.<br />
[99] S. Ossowski, K. Schneeberger, J. I. Lucas-Lledo, N. Warthmann, R. M. Clark, R. G. Shaw,<br />
D. Weigel, <strong>and</strong> M. Lynch. The rate <strong>and</strong> molecular spectrum of spontaneous mutations<br />
in Arabidopsis thaliana. In: Science 327.5961 (Jan. 2010), pp. 9294. doi: 10 . 1126 /<br />
science.1180677. pmid: 20044577.<br />
[100] H. Li, B. H<strong>and</strong>saker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis,<br />
<strong>and</strong> R. Durbin. The Sequence Alignment/Map format <strong>and</strong> SAMtools. In: Bioinformatics<br />
25.16 (Aug. 2009), pp. 20782079. doi: 10 . 1093 / bioinformatics / btp352. pmid:<br />
19505943.<br />
[101] H. Li. Improving SNP discovery by base alignment quality. In: Bioinformatics 27.8 (Apr.<br />
2011), pp. 11571158. doi: 10.1093/bioinformatics/btr076. pmid: 21320865.<br />
[102] M. D. Robinson <strong>and</strong> A. Oshlack. A scaling normalization method for dierential expression<br />
analysis of RNA-seq data. In: Genome Biol. 11.3 (2010), R25. doi: 10.1186/gb-<br />
2010-11-3-r25. pmid: 20196867.<br />
[103] L. X. Garmire <strong>and</strong> S. Subramaniam. Evaluation of normalization methods in mammalian<br />
microRNA-Seq data. In: RNA 18.6 (June 2012), pp. 12791288. doi: 10 . 1261 / rna .<br />
030916.111. pmid: 22532701.<br />
[104] Y. Zhang, T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein, C. Nusbaum,<br />
R. M. Myers, M. Brown, W. Li, <strong>and</strong> X. S. Liu. Model-based analysis of ChIP-Seq<br />
(MACS). In: Genome Biol. 9.9 (2008), R137. doi: 10.1186/gb-2008-9-9-r137. pmid:<br />
18798982.<br />
[105] A. Valouev, D. S. Johnson, A. Sundquist, C. Medina, E. <strong>An</strong>ton, S. Batzoglou, R. M. Myers,<br />
<strong>and</strong> A. Sidow. Genome-wide analysis of transcription factor binding sites based on ChIP-<br />
Seq data. In: Nat. Methods 5.9 (Sept. 2008), pp. 829834. doi: 10.1038/nmeth.1246.<br />
pmid: 19160518.<br />
[106] R. Jothi, S. Cuddapah, A. Barski, K. Cui, <strong>and</strong> K. Zhao. Genome-wide identication of in<br />
vivo protein-DNA binding sites from ChIP-Seq data. In: Nucleic Acids Res. 36.16 (Sept.<br />
2008), pp. 52215231. doi: 10.1093/nar/gkn488. pmid: 18684996.<br />
[107] J. Rozowsky, G. Euskirchen, R. K. Auerbach, Z. D. Zhang, T. Gibson, R. Bjornson, N.<br />
Carriero, M. Snyder, <strong>and</strong> M. B. Gerstein. PeakSeq enables systematic scoring of ChIPseq<br />
experiments relative to controls. In: Nat. Biotechnol. 27.1 (Jan. 2009), pp. 6675.<br />
doi: 10.1038/nbt.1518. pmid: 19122651.<br />
[108] H. Ji, H. Jiang, W. Ma, D. S. Johnson, R. M. Myers, <strong>and</strong> W. H. Wong. <strong>An</strong> integrated<br />
software system for analyzing ChIP-chip <strong>and</strong> ChIP-seq data. In: Nat. Biotechnol. 26.11<br />
(Nov. 2008), pp. 12931300. doi: 10.1038/nbt.1505. pmid: 18978777.<br />
[109] A. P. Fejes, G. Robertson, M. Bilenky, R. Varhol, M. Bainbridge, <strong>and</strong> S. J. Jones. Find-<br />
Peaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read<br />
sequencing technology. In: Bioinformatics 24.15 (Aug. 2008), pp. 17291730. doi: 10.<br />
1093/bioinformatics/btn305. pmid: 18599518.
BIBLIOGRAPHY 91<br />
[110] P. J. Cock, C. J. Fields, N. Goto, M. L. Heuer, <strong>and</strong> P. M. Rice. The Sanger FASTQ le<br />
format for sequences with quality scores, <strong>and</strong> the Solexa/Illumina FASTQ variants. In:<br />
Nucleic Acids Res. 38.6 (Apr. 2010), pp. 17671771. doi: 10.1093/nar/gkp1137. pmid:<br />
20015970.<br />
[111] T. S. F. S. W. Group. The SAM Format Specication. Sept. 2011. WebCite: 6FqvzGZJ8.<br />
url: http://samtools.sourceforge.net/SAM1.pdf Feb. 27, 2013.<br />
[112] L. Stein. Generic Feature Format Version 3. Feb. 2013. WebCite: 5R00Wxobq. url: http:<br />
//www.sequenceontology.org/gff3.shtml Mar. 4, 2013.<br />
[113] M. G. Reese, B. Moore, C. Batchelor, F. Salas, F. Cunningham, G. T. Marth, L. Stein, P.<br />
Flicek, M. Y<strong>and</strong>ell, <strong>and</strong> K. Eilbeck. A st<strong>and</strong>ard variation le format for human genome<br />
sequences. In: Genome Biol. 11.8 (2010), R88. doi: 10.1186/gb-2010-11-8-r88. pmid:<br />
20796305.<br />
[114] P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E.<br />
H<strong>and</strong>saker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, R. Durbin,<br />
D. Altshuler, G. Abecasis, D. Bentley, A. Chakravarti, A. Clark, F. De La Vega, P.<br />
Donnelly, M. Dunn, P. Flicek, S. Gabriel, E. Green, R. Gibbs, B. Knoppers, E. L<strong>and</strong>er,<br />
H. Lehrach, E. Mardis, G. Marth, et al. The variant call format <strong>and</strong> VCFtools. In:<br />
Bioinformatics 27.15 (Aug. 2011), pp. 21562158. doi: 10.1093/bioinformatics/btr330.<br />
pmid: 21653522.<br />
[115] P. Deutsch. DEFLATE Compressed <strong>Data</strong> Format Specication version 1.3. RFC 1951<br />
(Informational). Internet Engineering Task Force, May 1996. WebCite: 6Fr6mTscx. url:<br />
http://www.ietf.org/rfc/rfc1951.txt.<br />
[116] P. Deutsch. GZIP le format specication version 4.3. RFC 1952 (Informational). Internet<br />
Engineering Task Force, May 1996. WebCite: 6Fr6csgTT. url: http://www.ietf.org/rfc/<br />
rfc1952.txt.<br />
[117] X. Chen, M. Li, B. Ma, <strong>and</strong> J. Tromp. DNACompress: fast <strong>and</strong> eective DNA sequence<br />
compression. In: Bioinformatics 18.12 (Dec. 2002), pp. 16961698. doi: 10.1093/<br />
bioinformatics/18.12.1696. pmid: 12490460.<br />
[118] F. Hach, I. Numanagic, C. Alkan, <strong>and</strong> S. C. Sahinalp. SCALCE: boosting sequence<br />
compression algorithms using locally consistent encoding. In: Bioinformatics 28.23 (Dec.<br />
2012), pp. 30513057. doi: 10.1093/bioinformatics/bts593. pmid: 23047557.<br />
[119] D. C. Jones, W. L. Ruzzo, X. Peng, <strong>and</strong> M. G. Katze. Compression of next-generation<br />
sequencing reads aided by highly ecient de novo assembly. In: Nucleic Acids Res. 40.22<br />
(Dec. 2012), e171. doi: 10.1093/nar/gks754. pmid: 22904078.<br />
[120] M. Hsi-Yang Fritz, R. Leinonen, G. Cochrane, <strong>and</strong> E. Birney. Ecient storage of high<br />
throughput DNA sequencing data using reference-based compression. In: Genome Res.<br />
21.5 (May 2011), pp. 734740. doi: 10.1101/gr.114819.110. pmid: 21245279.<br />
[121] S. Golomb. Run-length encodings (Corresp.) In: Information Theory, IEEE Transactions<br />
on 12.3 (1966), pp. 399401. issn: 0018-9448. doi: 10.1109/TIT.1966.1053907.<br />
[122] D. Human. A Method for the Construction of Minimum-Redundancy Codes. In: Proceedings<br />
of the IRE 40.9 (1952), pp. 10981101. issn: 0096-8390. doi: 10.1109/JRPROC.<br />
1952.273898.<br />
[123] W. J. Kent, A. S. Zweig, G. Barber, A. S. Hinrichs, <strong>and</strong> D. Karolchik. BigWig <strong>and</strong><br />
BigBed: enabling browsing of large distributed datasets. In: Bioinformatics 26.17 (Sept.<br />
2010), pp. 22042207. doi: 10.1093/bioinformatics/btq351. pmid: 20639541.
92 BIBLIOGRAPHY<br />
[124] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, <strong>and</strong><br />
D. Haussler. The human genome browser at UCSC. In: Genome Res. 12.6 (June 2002),<br />
pp. 9961006. doi: 10.1101/gr.229102. pmid: 12045153.<br />
[125] H. Li. Tabix: fast retrieval of sequence features from generic TAB-delimited les. In:<br />
Bioinformatics 27.5 (Mar. 2011), pp. 718719. doi: 10.1093/bioinformatics/btq671.<br />
pmid: 21208982.<br />
[126] A. Guttman. R-trees: a dynamic index structure for spatial searching. In: SIGMOD Rec.<br />
14.2 (June 1984), pp. 4757. issn: 0163-5808. doi: 10.1145/971697.602266.<br />
[127] L. Collin <strong>and</strong> I. Pavlov. The .xz File Format version 1.0.4. Aug. 2009. WebCite: 6FqpSZFsW.<br />
url: http://tukaani.org/xz/xz-file-format.txt Nov. 4, 2012.<br />
[128] J. L. Bentley. Multidimensional binary search trees used for associative searching. In:<br />
Commun. ACM 18.9 (Sept. 1975), pp. 509517. issn: 0001-0782. doi: 10.1145/361002.<br />
361007.<br />
[129] Y. Livnat, H.-W. Shen, <strong>and</strong> C. R. Johnson. A Near Optimal Isosurface Extraction Algorithm<br />
Using the Span Space. In: IEEE Transactions on Visualization <strong>and</strong> Computer<br />
Graphics 2.1 (Mar. 1996), pp. 7384. issn: 1077-2626. doi: 10.1109/2945.489388.<br />
[130] D. Lee <strong>and</strong> C. Wong. Worst-case analysis for region <strong>and</strong> partial region searches in multidimensional<br />
binary search trees <strong>and</strong> balanced quad trees. English. In: Acta Informatica<br />
9.1 (1977), pp. 2329. issn: 0001-5903. doi: 10.1007/BF00263763.<br />
[131] S. Kurtz. The Vmatch large scale sequence analysis software. Dec. 2011. WebCite:<br />
5EyNRAwVK. url: http://www.vmatch.de Feb. 27, 2013.<br />
[132] M. I. Abouelhoda, S. Kurtz, <strong>and</strong> E. Ohlebusch. Replacing sux trees with enhanced sux<br />
arrays. In: Journal of Discrete Algorithms 2.1 (2004), pp. 53 86. doi: 10.1016/S1570-<br />
8667(03)00065-0.<br />
[133] S. B. Needleman <strong>and</strong> C. D. Wunsch. A general method applicable to the search for<br />
similarities in the amino acid sequence of two proteins. In: J. Mol. Biol. 48.3 (Mar.<br />
1970), pp. 443453. doi: 10.1016/0022-2836(70)90057-4. pmid: 5420325.<br />
[134] T. F. Smith <strong>and</strong> M. S. Waterman. Identication of common molecular subsequences.<br />
In: J. Mol. Biol. 147.1 (Mar. 1981), pp. 195197. doi: 10.1016/0022-2836(81)90087-5.<br />
pmid: 7265238.<br />
[135] F. Van Nieuwerburgh, R. C. Thompson, J. Ledesma, D. Deforce, T. Gaasterl<strong>and</strong>, P.<br />
Ordoukhanian, <strong>and</strong> S. R. Head. Illumina mate-paired DNA sequencing-<strong>lib</strong>rary preparation<br />
using Cre-Lox recombination. In: Nucleic Acids Res. 40.3 (Feb. 2012), e24. doi:<br />
10.1093/nar/gkr1000. pmid: 22127871.<br />
[136] W. J. Kent. BLATthe BLAST-like alignment tool. In: Genome Res. 12.4 (Apr. 2002),<br />
pp. 656664. doi: 10.1101/gr.229202. pmid: 11932250.<br />
[137] E. Moyroud, E. G. Minguet, F. Ott, L. Yant, D. Pose, M. Monniaux, S. Blanchet, O.<br />
Bastien, E. Thevenon, D. Weigel, M. Schmid, <strong>and</strong> F. Parcy. Prediction of regulatory<br />
interactions from genome sequences using a biophysical model for the Arabidopsis LEAFY<br />
transcription factor. In: Plant Cell 23.4 (Apr. 2011), pp. 12931306. doi: 10.1105/tpc.<br />
111.083329.
BIBLIOGRAPHY 93<br />
[138] R. Br<strong>and</strong>t, M. Salla-Martret, J. Bou-Torrent, T. Musielak, M. Stahl, C. Lanz, F. Ott,<br />
M. Schmid, T. Greb, M. Schwarz, S. B. Choi, M. K. Barton, B. J. Reinhart, T. Liu,<br />
M. Quint, J. C. Palauqui, J. F. Martinez-Garcia, <strong>and</strong> S. Wenkel. Genome-wide bindingsite<br />
analysis of REVOLUTA reveals a link between leaf patterning <strong>and</strong> light-mediated<br />
growth responses. In: Plant J. 72.1 (Oct. 2012), pp. 3142. doi: 10 . 1111 / j . 1365 -<br />
313X.2012.05049.x.<br />
[139] R. G. Immink, D. Pose, S. Ferrario, F. Ott, K. Kaufmann, F. L. Valentim, S. de Folter, F.<br />
van der Wal, A. D. van Dijk, M. Schmid, <strong>and</strong> G. C. <strong>An</strong>genent. Characterization of SOC1's<br />
central role in owering by the identication of its upstream <strong>and</strong> downstream regulators.<br />
In: Plant Physiol. 160.1 (Sept. 2012), pp. 433449. doi: 10.1104/pp.112.202614.<br />
[140] D. Pose, L. Verhage, F. Ott, L. Yant, J. Mathieu, G. C. <strong>An</strong>genent, R. G. Immink, <strong>and</strong> M.<br />
Schmid. Temperature-dependent regulation of owering by antagonistic FLM variants.<br />
In: Nature (Sept. 2013). doi: 10.1038/nature12633.<br />
[141] P. Merelo, Y. Xie, L. Br<strong>and</strong>, F. Ott, D. Weigel, J. L. Bowman, M. G. Heisler, <strong>and</strong> S.<br />
Wenkel. Genome-Wide Identication of KANADI1 Target Genes. In: PLoS ONE 8.10<br />
(2013), e77341. doi: 10.1371/journal.pone.0077341.<br />
[142] E. S. L<strong>and</strong>er <strong>and</strong> M. S. Waterman. Genomic mapping by ngerprinting r<strong>and</strong>om clones: a<br />
mathematical analysis. In: Genomics 2.3 (Apr. 1988), pp. 231239. doi: 10.1016/0888-<br />
7543(88)90007-9. pmid: 3294162.<br />
[143] Y. Benjamini <strong>and</strong> Y. Hochberg. Controlling the false discovery rate: a practical <strong>and</strong><br />
powerful approach to multiple testing. In: Journal of the Royal Statistical Society. Series<br />
B (Methodological) (1995), pp. 289300.<br />
[144] U. Manber <strong>and</strong> G. Myers. Sux arrays: a new method for on-line string searches. In:<br />
Proceedings of the rst annual ACM-SIAM symposium on Discrete algorithms. SODA '90.<br />
San Francisco, California, USA: Society for Industrial <strong>and</strong> Applied Mathematics, 1990,<br />
pp. 319327. isbn: 0-89871-251-3.<br />
[145] G. Nong, S. Zhang, <strong>and</strong> W. H. Chan. Two Ecient Algorithms for Linear Time Sux<br />
Array Construction. In: IEEE Transactions on Computers 60.10 (2011), pp. 14711484.<br />
issn: 0018-9340. doi: 10.1109/TC.2010.188.<br />
[146] K. S. Pollard, H. N. Gilbert, Y. Ge, S. Taylor, <strong>and</strong> S. Dudoit. multtest: Resampling-based<br />
multiple hypothesis testing. R package version 2.4.0.<br />
[147] Y. Benjamini <strong>and</strong> D. Yekutieli. The control of the false discovery rate in multiple testing<br />
under dependency. In: <strong>An</strong>nals of statistics (2001), pp. 11651188.<br />
[148] O. J. Dunn. Multiple comparisons among means. In: Journal of the American Statistical<br />
Association 56.293 (1961), pp. 5264.<br />
[149] Y. Hochberg. A sharper Bonferroni procedure for multiple tests of signicance. In:<br />
Biometrika 75.4 (1988), pp. 800802. doi: 10.1093/biomet/75.4.800.<br />
[150] S. Holm. A simple sequentially rejective multiple test procedure. In: Sc<strong>and</strong>inavian journal<br />
of statistics (1979), pp. 6570.<br />
[151] P. H. Westfall <strong>and</strong> S. Young. Resampling-Based Multiple Testing: Examples <strong>and</strong> Methods<br />
for P-Value Adjustment. A Wiley-Interscience publication. Wiley, 1993. isbn: 9 780471<br />
557616.